Paper deep dive

Explaining Black Box Text Modules in Natural Language with Language Models

Chandan Singh, Aliyah R. Hsu, Richard Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, Jianfeng Gao

Year: 2023Venue: NeurIPS 2023Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 84

Models: BERT-base-uncased, GPT-3 (text-davinci-003), Instructor-XL, LLaMA-30B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 7:40:36 PM

Summary

The paper introduces 'Summarize and Score' (SASC), an automated method for generating natural language explanations for black-box text modules. SASC operates in two steps: summarizing top-activating ngrams using a pre-trained LLM to generate candidate explanations, and scoring these candidates by testing their reliability on synthetic data. The method is validated on synthetic modules, BERT transformer factors, and fMRI brain region responses, demonstrating high accuracy and reliability compared to human-provided explanations.

Entities (5)

BERT · model · 100%SASC · method · 100%GPT-3 · model · 95%fMRI · technology · 95%Text Module · concept · 90%

Relation Signals (4)

SASC → explains → Text Module

confidence 95% · SASC, a method that takes in a text module and returns a natural language explanation

SASC → uses → GPT-3

confidence 95% · We use GPT-3 (text-davinci-003, Feb. 2023) as the helper LLM

SASC → evaluatedon → BERT

confidence 90% · we use SASC to explain modules found within a pre-trained BERT model

SASC → evaluatedon → fMRI

confidence 90% · SASC can generate explanations for the response of individual fMRI voxels

Cypher Suggestions (2)

Identify helper models used by SASC · confidence 95% · unvalidated

MATCH (s:Method {name: 'SASC'})-[:USES]->(helper:Model) RETURN helper.name

Find all models explained by SASC · confidence 90% · unvalidated

MATCH (m:Method {name: 'SASC'})-[:EVALUATED_ON]->(target) RETURN target.name, target.entity_type

Abstract

Abstract:Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.

PDF

Open source PDF →Open local PDF →

Full Text

83,552 characters extracted from source content.

Expand or collapse full text

EXPLAINING BLACK BOX TEXT MODULES IN NATURAL LANGUAGE WITH LANGUAGE MODELS Chandan Singh ∗,1 Aliyah R. Hsu ∗,1,2 Richard Antonello 3 Shailee Jain 3 Alexander G. Huth 3 Bin Yu 1 Jianfeng Gao 1 1 Microsoft Research 2 University of California, Berkeley 3 The University of Texas at Austin * Equal contribution ABSTRACT Large language models (LLMs) have demonstrated remarkable prediction perfor- mance for a growing array of tasks. However, their rapid proliferation and in- creasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. Atext moduleis any function that maps text to a scalar continu- ous value, such as a submodule within an LLM or a fitted model of a brain region. Black boxindicates that we only have access to the module’s inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module’s selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model’s internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github. 1 1INTRODUCTION Large language models (LLMs) have demonstrated remarkable predictive performance across a growing range of diverse tasks (Brown et al., 2020; Devlin et al., 2018). However, the inability to effectively interpret these models has led them to be characterized as black boxes. This opaque- ness has debilitated their use in high-stakes applications such as medicine (Kornblith et al., 2022), and raised issues related to regulatory pressure (Goodman & Flaxman, 2016), safety (Amodei et al., 2016), and alignment (Gabriel, 2020). This lack of interpretability is particularly detrimental in sci- entific fields, such as neuroscience (Huth et al., 2016) or social science (Ziems et al., 2023), where trustworthy interpretation itself is the end goal. To ameliorate these issues, we propose S ummarize and Score (SASC). SASC producesnatural language explanations for text modules. We define atext modulefas any function that maps text to a scalar continuous value, e.g. a neuron in a pre-trained LLM 2 . Givenf, SASC returns a short natural language explanation describing what elicits the strongest response fromf. SASC requires only black-box access to the module (it does not require access to the module internals) and no human intervention. SASC uses two steps to ground explanations in the responses off(Fig. 1). In the first step, SASC derives explanation candidates by sortingf’s responses to ngrams and summarizing the top ngrams 1 Scikit-learn-compatible API available atgithub.com/csinva/imodelsX and code for experiments along with all generated explanations is available atgithub.com/microsoft/automated-explanations. 2 Note that a neuron in an LLM typically returns a sequence-length vector rather than a scalar, so a transfor- mation (e.g. averaging) is required before interpretation. 1 arXiv:2305.09863v2 [cs.AI] 15 Nov 2023 mystoryis whatagreatday wowInever .... agesoftwenty Explanation candidates responds to: disbelief and questioning ngrams from a reference corpus Top-activating ngrams wowInever whycouldn’t ohofcourse ohwhatthe godihate wowthat’s LLM Summarize Explanation candidates responds to: disbelief and questioning LLM How could you? I Can’t believe it What do you mean! I agree That sounds good Nice idea Texts + Related to explanation Texts – Not related Generate synthetic data Generate (2) Score candidate explanations Explanation score Maximize [ f(Text + ) –f(Text – ) ] f f f f f (1) Summarize ngramsinto candidate explanations for module f f f Figure 1: SASC pipeline for obtaining a natural language explanation given a modulef.(i)SASC first generates candidate explanations (using a pre-trained LLM) based on the ngrams that elicit the most positive response fromf.(i)SASC then evaluates each candidate explanation by generating synthetic data based on the explanation and testing the response offto the data. using a pre-trained LLM. In the second step, SASC evaluates each candidate explanation by generat- ing synthetic text based on the explanation (again with a pre-trained LLM) and testing the response offto the text; these responses to synthetic text are used to assign anexplanation scoreto each ex- planation, that rates the reliability of the explanation. Decomposing explanation into these separate steps helps mitigate issues with LLM hallucination when generating and evaluating explanations. We evaluate SASC in two contexts. In our main evaluation, we evaluate SASC on synthetic modules and find that it can often recover ground truth explanations under different experimental conditions (Sec. 3). In our second evaluation, we use SASC to explain modules found within a pre-trained BERT model after applying dictionary learning (details in Sec. 4), and find that SASC explana- tions are often of comparable quality to human-given explanations (without the need for manual annotation). Furthermore, we verify that BERT modules which are useful for downstream text- classification tasks often yield explanations related to the task. The recovered explanations yield interesting insights. Modules found within BERT respond to a variety of different phenomena, from individual words to broad, semantic concepts. Additionally, we apply SASC to modules that are trained to predict the response of individual brain regions to language stimuli, as measured by fMRI. We find that explanations for fMRI modules pertain more to social concepts (e.g. relationships and family) than BERT modules, suggesting possible different emphases between modules in BERT and in the brain. These explanations also provide fine-grained hypotheses about the selectivity of different brain regions to semantic concepts. 2METHOD SASC aims to interpret a text modulef, which maps text to a scalar continuous value. For example fcould be the output probability for a single token in an LLM, or the output of a single neuron extracted from a vector of LLM activations. SASC returns a short explanation describing what elicits the strongest response fromf, along with anexplanation score, which rates how reliable the explanation is. In the process of explanation, SASC uses a pre-trainedhelper LLMto perform summarization and to generate synthetic text. To mitigate potential hallucination introduced by the helper LLM, SASC decomposes the explanation process into 2 steps (Fig. 1) that greatly simplify the task performed by the helper LLM: Step 1: SummarizationThe first step generates candidate explanations by summarizing ngrams. All unique ngrams are extracted from a pre-specified corpus of text and fed through the modulef. The ngrams that elicit the largest positive response fromfare then fed through the helper LLM for summarization. To avoid over-reliance on the very top ngrams, we select a random subset of the top ngrams in the summarization step. This step is similar to prior works which summarize ngrams 2 using manual inspection/parse trees (K ́ ad ́ ar et al., 2017; Na et al., 2019), but the use of the helper LLM enables flexible, automated summarization. The computational bottleneck of SASC is computingf’s response to the corpus ngrams. This com- putation requires two choices: the corpus underlying the extracted ngrams, and the length of ngrams to extract. Using a larger corpus/higher order ngrams can make SASC more accurate, but the com- putational cost grows linearly with the unique number of ngrams in the corpus. The corpus should be large enough to include relevant ngrams, as the corpus limits what generated explanations are possible (e.g. it is difficult to recover mathematical explanations from a corpus that contains no math). To speed up computation, ngrams can be subsampled from the corpus. Step 2: Synthetic scoringThe second step aims to evaluate each candidate explanation and select the most reliable one. SASC generates synthetic data based on each candidate explanation, again using the helper LLM. Intuitively, if the explanation accurately describesf, thenfshould output large values for text related to the explanation (Text + ) compared to unrelated synthetic text (Text − ). 3 We then compute the explanation score as follows: Explanation score=E[f(Text + )−f(Text − )]with unitsσ f ,(1) where a larger score corresponds to a more reliable explanation. We report the score in units ofσ f , the standard deviation off’s response to the corpus. An explanation score of1σ f means that syn- thetic text related to the explanation increased the mean module response by one standard deviation compared to unrelated text. SASC returns the candidate explanation that maximizes this difference, along with the synthetic data score. The selection of the highest-scoring explanation is similar to the reranking step used in some prompting methods, e.g. (Shin et al., 2020), but differs in that it maximizesf’s response to synthetic data rather than optimizing the likelihood of a pre-specified dataset. Limitations and hyperparameter settingsWhile effective, the explanation pipeline described here has some clear limitations. First and foremost, SASC assumes thatfcan be concisely described in a natural language string. This excludes complex functions or modules that respond to a non- coherent set of inputs. Second, SASC only describes the inputs that elicit the largest responses fromf, rather than its full behavior. Finally, SASC requires that the pre-trained LLM can faithfully perform its required tasks (summarization and generation). If an LLM is unable to perform these tasks sufficiently well, users may treat the output of SASC as candidate explanations to be vetted by a human, rather than final explanations to be used. We use GPT-3 (text-davinci-003, Feb. 2023) (Brown et al., 2020) as the helper LLM (see LLM prompts in Appendix A.2). In the summarization step, we use word-level trigrams, choose 30 random ngrams from the top 50 and generate 5 candidate explanations. In the synthetic scoring step, we generate 20 synthetic strings (each is a sentence) for each candidate explanation, half of which are related to the explanation. 3RECOVERING GROUND TRUTH EXPLANATIONS FOR SYNTHETIC MODULES This section describes our main evaluation of SASC: its ability to recover explanations for synthetic modules with a known ground truth explanation. Experimental setup for synthetic modulesWe construct 54 synthetic modules based on the pre- trained Instructor embedding model (Su et al., 2022) (hkunlp/instructor-xl). Each module is based on a dataset from a recent diverse collection (Zhong et al., 2021; 2022) that admits a simple, verifiable keyphrase description describing each underlying dataset, e.g.related to math(full details in Table A2). Each module is constructed to return high values for text related to the module’s groundtruth keyphrase and low values otherwise. Specifically, the module computes the Instructor embedding for an input text and for the groundtruth keyphrase; it then returns the negative Euclidean 3 The unrelated synthetic text should be neutral text that omits the relevant explanation, but may introduce bias into the scoring if the helper LLM improperly generates negative synthetic texts. Instead of synthetic texts, a large set of neutral texts may be used forText − , e.g. samples from a generic corpus. 3 distance between the embeddings. We find that the synthetic modules reliably produce large values for text related to the desired keyphrase (Fig. A3). We test SASC’s ability to recover accurate explanations for each of our 54 modules in 3 settings: (1) TheDefaultsetting extracts ngrams for summarization from the dataset corresponding to each module, which contains relevant ngrams for the ground truth explanation. (2) TheRestricted corpus setting checks the impact of the underlying corpus on the performance of SASC. To do so, we restrict the ngrams we use for generating explanation candidates to a corpus from a random dataset among the 54, potentially containing less relevant ngrams. (3) TheNoisy modulesetting adds Gaussian noise with standard deviation3σ f to all module responses in the summarization step. Baselines and evaluation metricsWe compare SASC to three baselines:(1) ngram- summarization, which summarizes top ngrams with an LLM, but does not use explanation scores to select among candidate explanations (essentially SASC without the scoring step); (2) gradient-based explanations (Poerner et al., 2018), which use the gradients offwith respect to the input to generate maximally activating inputs; (3) topic modeling (Blei et al., 2003), which learns a 100-component dictionary over ngrams using latent dirichlet allocation. We evaluate similarity of the recovered explanation and the groundtruth explanation in two ways: (1) Accuracy: verifying whether the ground truth is essentially equivalent to the recovered explanation via manual inspection and (2) BERT-score (Zhang et al., 2019) 4 . We find that these two metrics, when averaged over the datasets studied here, have a perfect rank correlation, i.e. every increase in average accuracy corresponds to an increase in average BERT score. For topic modeling, accuracy is evaluated by taking the top-30 scoring ngrams for the module (as is done with SASC), finding the 5 topics with the highest scores for these ngrams, and manually checking whether there is a match between the groundtruth and any of the top-5 words in any of these topics. SASC can recover ground truth descriptionsTable 1 shows the performance of SASC at re- covering ground truth explanations. In theDefaultsetting, SASC successfully identifies88%of the ground truth explanations. In the two noisy settings, SASC still manages to recover explanations 67% and 68% of the time for theRestricted ngramsandNoisy modulesettings, respectively. In all cases, SASC outperforms the ngram-summarization baseline. Table 2 shows the results for theDefaultsetting when varying different modeling choices. Perfor- mance is similar across various choices, such as using bigrams or 4-grams rather than trigrams in the summarization step, or when using the LLaMA-2 13-billion parameter model (Touvron et al., 2023b) as the helper LLM rather than GPT-3. Additionally, we find that explanation performance increases with the capabilities of the helper LLM used for summarization/generation (Fig. A1). Table 2 also shows that the gradient-based baseline fails to accurately identify the underlying groundtruth text, consistent with previous work in prompting (Singh et al., 2022b; Shin et al., 2020) and that topic modeling performs poorly, largely because the topic model fails to construct topics relevant to each specific module, as the same input ngrams are shared across all modules. Table 3 shows examples of correct and incorrect recovered explanations along with the ground truth explanation. For some modules, SASC finds perfect keyword matches, e.g.sports, or slight para- phrases, e.g.definition→defining or explaining something. For the incorrect examples, the gen- erated explanation is often similar to the ground truth explanation, e.g.derogatory→negative language and criticism, but occasionally, SASC fails to correctly identify the underlying pattern, e.g.ungrammatical→language. Some failures may be due to the inability of ngrams to cap- ture the underlying explanation, whereas others may be due to the constructed module imperfectly representing the ground truth explanation. Fig. 2 shows the cumulative accuracy at recovering the ground truth explanation as a function of the explanation score. Across all settings, accuracy increases as a function of explanation score, sug- gesting that higher explanation scores indicate more reliable explanations. This also helps validate that the helper LLM is able to sucessfully generate useful synthetic texts for evaluation. 4 BERT-score is calculated with the base modelmicrosoft/deberta-xlarge-mnli(He et al., 2021). 4 Table 1: Explanation recovery performance. For both metrics, higher is better. Each value is aver- aged over 54 modules and 3 random seeds; errors show standard error of the mean. SASCBaseline (ngram summarization) AccuracyBERT Score AccuracyBERT Score Default0.883±0.030.712±0.020.753±0.020.622±0.05 Restricted corpus0.667±0.040.639±0.020.540±0.020.554±0.05 Noisy module0.679±0.040.669±0.020.456±0.020.565±0.06 Average0.7430.6730.5820.580 Table 2: Explanation recovery accuracy when varying hyperparameters for theDefaultsetting; av- eraged over 54 modules and 3 random seeds. SASC (Original) SASC (Bigrams) SASC (4-grams) SASC (LLaMA-2 summarizer) SASC (LLaMA-2 generator) Baseline (Gradient based) Baseline (Topic modeling) Acc.0.883±0.030.815±0.040.889±0.030.870±0.030.852±0.040.093±0.010.111±0.01 BERT Score0.712±0.020.690±0.030.714±0.020.705±0.020.701±0.020.351±0.010.388±0.01 Table 3: Examples of recovered explanations for different modules in theDefaultsetting. Groundtruth ExplanationSASC Explanation Correct atheisticatheism and related topics, such as theism, religious beliefs, and atheists environmentalismenvironmentalism and climate action crimecrime and criminal activity sportssports definitiondefining or explaining something factsinformation or knowledge Incorrect derogatorynegative language and criticism ungrammaticallanguage subjectiveart and expression 20406080100 Explanation score (percentile) 0.0 0.2 0.4 0.6 0.8 Cumulative accuracy Setting Default Restricted corpus Noisy module Figure 2: Cumulative accuracy at recovering the ground truth explanation increases as a function of explanation score. Error bars show standard error of the mean. 4GENERATING EXPLANATIONS FORBERTTRANSFORMER FACTORS Next, we evaluate SASC using explanations for modules within BERT (Devlin et al., 2018) (bert-base-uncased). In the absence of ground truth explanations, we evaluated the expla- nations by (i) comparing them to human-given explanations and (i) checking their relevance to downstream tasks. BERT transformer factor modulesOne can interpret any module within BERT, e.g. a single neu- ron or an expert in an MOE (Fedus et al., 2022); here, we choose to interprettransformer factors, following a previous study that suggests that they are amenable to interpretation (Yun et al., 2021). Transformer factors learn a transformation of activations across layers via dictionary learning (de- tails in Appendix A.3; corpus used is the WikiText dataset (Merity et al., 2016)). Each transformer factor is a module that takes as input a text sequence and yields a scalar dictionary coefficient, after 5 Table 4: Comparing sample SASC to human-labeled explanations for BERT transformer factors. Win percentage shows how often the SASC explanation yields a higher explanation score than the human explanation. See all explanations and scores in Table A4. SASC ExplanationHuman Explanation names of parksWord “park”. Noun. a common first and last name. leaving or being leftWord “left”. Verb. leaving, exiting specific dates or monthsConsecutive years, used in football season naming. idea of wrongdoing or illegal activitysomething unfortunate happened. introduction of something newDoing something again, or making something new again. versions or translationsrepetitive structure detector. publishing, media, or awardsInstitution with abbreviation. names of places, people, or thingsUnit exchange with parentheses SASC win percentage:61%Human explanation win percentage: 39% SASC mean explanation score:1.6σ f Human explanation mean explanation score: 1.0σ f averaging over the input’s sequence length. There are 1,500 factors, and their coefficients vary for each of BERT’s 13 encoding layers. Comparison to human-given explanationsTable 4 compares SASC explanations to those given by humans in prior work (31 unique explanations from Table 1, Table 3 and Appendix in (Yun et al., 2021)). They are sometimes similar with different phrasings, e.g.leaving or being leftversus Word “left”, and sometimes quite different, e.g.publishing, media, or awardsversusInstitution with abbreviation.For each transformer factor, we compare the explanation scores for SASC and the human-given explanations. The SASC explanation score is higher 61% of the time and SASC’s mean explanation score is1.6σ f compared to1.0σ f for the human explanation. This evaluation suggests that the SASC explanations can be of similar quality to the human explanations, despite requiring no manual effort. Mapping explained modules to text-classification tasksWe now investigate whether the learned SASC explanations are useful for informing which downstream tasks a module is useful for. Given a classification dataset where the inputXis a list ofnstrings and the outputyis a list ofnclass labels, we first convertXto a matrix of transformer factor coefficientsX T F ∈R n×19,500 , where each row contains the concatenated factor coefficients across layers. We then fit a sparse logistic regression model to(X T F ,y), and analyze the explanations for the factors with the 25 largest coefficients across all classes. Ideally, these explanations would be relevant to the text-classification task; we evaluate what fraction of the 25 explanations are relevant for each task via manual inspection. We study 3 widely used text-classification datasets:emotion(Saravia et al., 2018) (classifying tweet emotion as sadness, joy, love, anger, fear or surprise),ag-news(Zhang et al., 2015) (classifying news headlines as world, sports, business, or sci/tech), andSST2(Socher et al., 2013) (classifying movie review sentiment as positive or negative). Table 5 shows results evaluating the BERT trans- former factor modules selected by a sparse linear model fit to these datasets. A large fraction of the explanations for selected modules are, in fact, relevant to their usage in downstream tasks, ranging from 0.35 forEmotionto 0.96 forAG News. TheAG Newstask has a particularly large fraction of relevant explanations, with many explanations corresponding very directly to class labels, e.g.pro- fessional sports teams→sportsorfinancial investments→business. See the full set of generated explanations in Appendix A.3. Patterns in SASC explanationsSASC provides 1,500 explanations for transformer factors in 13 layers of BERT. Fig. 3 shows that the explanation score decreases with increasing layer depth, suggesting that SASC better explains factors at lower layers. The mean explanation score across all layers is 1.77σ f . To understand the breakdown of topics present in the explanations, we fit a topic model (with Latent Dirichlet Allocation (Blei et al., 2003)) to the remaining explanations. The topic model has 10 topics 6 Table 5: BERT modules selected by a sparse linear model fit to text-classification tasks. First row shows the fraction of explanations for the selected modules which are relevant to the downstream task. Second row shows test accuracy for the fitted linear models. Bottom section shows sample explanations for modules selected by the linear model which are relevant to the downstream task. Values are averaged over 3 random linear model fits (error bars show the standard error of the mean). EmotionAG NewsSST2 Fraction relevant 0.35±0.0820.96±0.0330.44±0.086 Test accuracy 0.75±0.0010.81±0.0010.84±0.001 Sample relevant explanations negative emotions such as hatred,disgust,disdain, rage, and horror people, places, or things related to japan a negative statement, usu- ally in the form of not or nor injury or impairmentprofessional sports teamshatred and violence humorgeographyharm, injury, or damage romancefinancial investmentssomething being incorrect or wrong and preprocesses each explanation by converting it to a vector of word counts. We exclude all factors that do not attain an explanation score of at least1σ f from the topic model, as they are less likely to be correct. Fig. 4 shows each topic along with the proportion of modules whose largest topic coefficient is for that topic. Topics span a wide range of categories, from syntactic concepts (e.g. word, end, ..., noun) to more semantic concepts (e.g.sports, physical, activity, ...). 5GENERATING EXPLANATIONS FOR FMRI-VOXEL MODULES fMRI voxel modulesA central challenge in neuroscience is understanding how and where se- mantic concepts are represented in the brain. To meet this challenge, one line of study predicts the response of different brain voxels (i.e. small regions in the brain) to natural language stim- uli (Huth et al., 2016; Jain & Huth, 2018). We analyze data from (LeBel et al., 2022) and (Tang et al., 2023), which consists of fMRI responses for 3 human subjects as they listen to 20+ hours of narrative stories from podcasts. We fit modules to predict the fMRI response in each voxel from the text that the subject was hearing by extracting text embeddings with a pre-trained LLaMA model (decapoda-research/llama-30b-hf) (Touvron et al., 2023a). After fitting the modules on the training split and evaluating them on the test split using bootstrapped ridge regression, we generate SASC explanations for 1,500 well-predicted voxel modules, distributed evenly among the three human subjects and diverse cortical areas (see details on the fMRI experimental setup in Ap- pendix A.4.1). Voxel explanationsTable 6 shows examples of explanations for individual voxels, along with three top ngrams used to derive the explanation. Each explanation unifies fairly different ngrams under a common theme, e.g.sliced cucumber, cut the apples, sauteed shiitake...→food preparation. In some cases, the explanations recover language concepts similar to known selectivity in sensory modalities, e.g. face selectivity in IFSFP (Tsao et al., 2008) and selectivity for non-speech sounds such as laughter in primary auditory cortex (Hamilton et al., 2021). The ngrams also provide more fine-grained hypotheses for selectivity (e.g.physical injury or pain) compared to the coarse semantic categories proposed in earlier language studies (e.g.emotion(Huth et al., 2016; Binder et al., 2009; Mitchell et al., 2008)). Fig. 4 shows the topics that fMRI explanations best fit into compared with BERT transformer fac- tors. The proportions for many topics are similar, but the fMRI explanations yield a much greater proportion for the topic consisting of social words (e.g.relationships,communication,family) and perceptual words (e.g.action,movement,physical). This is consistent with prior knowledge, as 7 10123456 Explanation score ( f ) BERT 0 BERT 1 BERT 2 BERT 3 BERT 4 BERT 5 BERT 6 BERT 7 BERT 8 BERT 9 BERT 10 BERT 11 BERT 12 PMvh IFSFP PO ATFP AC Broca BERT layer Figure 3: Explanation score for BERT (blue) and fMRI (orange) modules. As the BERT layer increases, the explanation score tends to decrease, implying modules are harder to explain with SASC. Across regions, explanation scores for fMRI voxel modules are generally lower than scores for BERT modules in early layers and comparable to scores for the final layers. Boxes show the median and interquartile range. ROI abbreviations: premotor ventral hand area (PMvh), anterior temporal face patch (ATFP), auditory cortex (AC), parietal operculum (PO), inferior frontal sulcus face patch (IFSFP), Broca’s area (Broca). 0.000.050.100.150.200.250.30 Topic proportion action, movement, relationships, physical, communication, family, idea specific, type, year, place, person, government, particular words, time, phrases, location, specific, movement, period sports, physical, activity, locations, media, objects, action time, dates, preposition, different, mathematics, past, type history, culture, concept, numbers, way, language, nouns people, places, things, names, end, military, letter technology, related, music, andor, education, activities, science verb, form, military, specifically, measurements, naval, architecture word, end, contain, followed, number, phrase, noun Topic BERT fMRI Figure 4: Topics found by LDA for explanations of BERT factors and fMRI voxels. Topic proportion is calculated by assigning each explanation to the topic with the largest coefficient. Topic proportions for BERT/fMRI explanations largely overlap, although the bottom topic consisting of physical/social words is much more prevalent in fMRI explanations. the largest axis of variation for fMRI voxels is known to separate social concepts from physical concepts (Huth et al., 2016). The selected 1,500 voxels often achieve explanation scores considerably greater than zero for their explanations (mean explanation score1.27σ f ±0.029). Fig. 3 (bottom) shows the mean explanation score for the six most common fMRI regions of interest (ROIs) among the voxels we study here. Across regions, the fMRI voxel modules generally attain explanation scores that are slightly lower than BERT modules in early layers and slightly higher than BERT modules in the final layers. We also find some evidence that the generated fMRI voxel explanations can explain not just the fitted module, but also brain responses to unseen data (see Appendix A.4.2). This suggests that the voxel explanations here can serve as hypotheses for followup experiments to affirm the fine-grained selectivity of specific brain voxels. 8 6RELATED WORK Table 6: Examples of recovered explanations for individual fMRI voxel modules. All achieve an fMRI predicted correlation greater than 0.3 and an explanation score of at least1σ. The third column shows 3 of the ngrams used to derive the explanation in the SASC summarization step. ExplanationROIExample top ngrams looking or staring in some wayIFSFPeyed her suspiciously, wink at, locks eyes with relationships and lossATFPgirlfriend now ex, lost my husband, was a miscarriage physical injury or painBrocainfections and gangrene, pulled a muscle, burned the skin counting or measuring timePMvhcount down and, weeks became months, three more seconds food preparationATFPsliced cucumber, cut the apples, sauteed shiitake laughter or amusementATFP, ACstarted to laugh, funny guy, chuckled and Explaining modules in natural languageA few related works study generating natural language explanations. MILAN (Hernandez et al., 2022) uses patch-level information of visual features to generate descriptions of neuron behavior in vision models. iPrompt (Singh et al., 2022b) uses auto- mated prompt engineering and D5 (Zhong et al., 2023; 2022)/GSClip (Zhu et al., 2022) use LLMs to describe patterns in a dataset (as opposed to describing a module, as we study here). In concurrent work, (Bills et al., 2023) propose an algorithm similar to SASC that explains individual neurons in an LLM by predicting token-level neuron activations. Two very related works use top-activating ngrams/sentences to construct explanations: (1) (K ́ ad ́ ar et al., 2017) builds an explanation bymanuallyinspecting the top ngrams eliciting the largest module responses from a corpus using an omission-based approach. (2) (Na et al., 2019) similarly extracts the top sentences from a corpus, but summarizes them using a parse tree. Alternatively, (Poerner et al., 2018) use a gradient-based method to generate maximally activating text inputs. Explaining neural-network predictionsMost prior works have focused on the problem of ex- plaining asingle predictionwith natural language, rather than an entire module, e.g. for text classifi- cation (Camburu et al., 2018; Rajani et al., 2019; Narang et al., 2020), or computer vision (Hendricks et al., 2016; Zellers et al., 2019). Besides natural language explanations, some works explain indi- vidual prediction via feature importances (e.g. LIME (Ribeiro et al., 2016)/SHAP (Lundberg et al., 2019)), feature-interaction importances (Morris et al., 2023; Singh et al., 2019; Tsang et al., 2017), or extractive rationales (Zaidan & Eisner, 2008; Sha et al., 2021). They are not directly comparable to SASC, as they work at the prediction-level and do not produce a natural-language explanation. Explaining neural-network representationsWe build on a long line of recent work that ex- plains neural-networkrepresentations, e.g. via probing (Conneau et al., 2018; Liu & Avci, 2019), via visualization (Zeiler & Fergus, 2014; Karpathy et al., 2015), by categorizing neurons into cate- gories (Bau et al., 2017; 2018; 2020; Dalvi et al., 2019; Gurnee et al., 2023), localizing knowledge in an LLM (Meng et al., 2022; Dai et al., 2021), or distilling information into a transparent model (Tan et al., 2018; Ha et al., 2021; Singh et al., 2022a). Natural language representations in fMRIUsing the representations from LLMs to help predict brain responses to natural language has become common among neuroscientists studying language processing in recent years (Jain & Huth, 2018; Wehbe et al., 2014; Schrimpf et al., 2021; Toneva & Wehbe, 2019; Antonello et al., 2021; Goldstein et al., 2022). This paradigm of using “encoding models” (Wu et al., 2006) to better understand how the brain processes language has been applied to help understand the cortical organization of language timescales (Jain et al., 2020; Chen et al., 2023), examine the relationship between visual and semantic information in the brain (Popham et al., 2021), and explore to what extent syntax, semantics or discourse drives brain activity (Caucheteux et al., 2021; Kauf et al., 2023; Reddy & Wehbe, 2020; Pasquiou et al., 2023; Aw & Toneva, 2022; Kumar et al., 2022; Oota et al., 2022; Tuckute et al., 2023). 9 7DISCUSSION SASC could potentially enable much better mechanistic interpretability for LLMs, allowing for automated analysis of submodules present in LLMs (e.g. attention heads, transformer factors, or ex- perts in an MOE), along with an explanation score that helps inform when an explanation is reliable. Trustworthy explanations could help audit increasingly powerful LLMs for undesired behavior or improve the distillation of smaller task-specific modules. SASC also could also be a useful tool in many scientific pipelines. The fMRI analysis performed here generates many explanations which can be directly tested via followup fMRI experiments to understand the fine-grained selectivity of brain regions. SASC could also be used to generate explanations in a variety of domains, such as analysis of text models in computational social science or in medicine. While effective, SASC has many limitations. SASC only explains a module’s top responses, but it could be extended to explain the entirety of the module’s responses (e.g. by selecting top ngrams dif- ferently). Additionally, due to its reliance on ngrams, SASC fails to capture low-level text patterns or patterns requiring long context, e.g. patterns based on position in a sequence. Future explana- tions could consider adding information beyond ngrams, and also probe the relationships between different modules to explain circuits of modules rather than modules in isolation. REFERENCES Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man ́ e. Concrete prob- lems in ai safety.arXiv preprint arXiv:1606.06565, 2016. Richard Antonello, Javier S Turek, Vy Vo, and Alexander Huth. Low-dimensional structure in the space of language representations is reflected in brain responses.Advances in Neural Information Processing Systems, 34:8332–8344, 2021. Khai Loong Aw and Mariya Toneva. Training language models for deeper understanding improves brain align- ment, December 2022. URLhttp://arxiv.org/abs/2212.10898. arXiv:2212.10898 [cs, q-bio]. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 6541–6549, 2017. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Anto- nio Torralba. Gan dissection: Visualizing and understanding generative adversarial networks.arXiv preprint arXiv:1811.10597, 2018. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network.Proceedings of the National Academy of Sciences, 117 (48):30071–30078, 2020. Steven Bills, Nick Cammarata, Dan Mossing, William Saunders, Jeff Wu, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, and Jan Leike.Language models can explain neurons in language models.https://openaipublic.blob.core.windows.net/neuron-explainer/paper/ index.html, 2023. Jeffrey R. Binder, Rutvik H. Desai, William W. Graves, and Lisa L. Conant. Where Is the Semantic System? A Critical Review and Meta-Analysis of 120 Functional Neuroimaging Studies.Cerebral Cortex, 19(12): 2767–2796, December 2009. ISSN 1460-2199, 1047-3211. doi: 10.1093/cercor/bhp055. URLhttps: //academic.oup.com/cercor/article-lookup/doi/10.1093/cercor/bhp055. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.Journal of machine Learning research, 3(Jan):993–1022, 2003. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Oana-Maria Camburu, Tim Rockt ̈ aschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli: Natural language inference with natural language explanations.Advances in Neural Information Processing Systems, 31, 2018. 10 Charlotte Caucheteux, Alexandre Gramfort, and Jean-Remi King.Disentangling syntax and semantics in the brain with deep networks.InProceedings of the 38th International Conference on Machine Learning, p. 1336–1348. PMLR, July 2021. URLhttps://proceedings.mlr.press/v139/ caucheteux21a.html. ISSN: 2640-3498. Charlotte Caucheteux, Alexandre Gramfort, and Jean-R ́ emi King. Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022. Catherine Chen, Tom Dupr ́ e la Tour, Jack Gallant, Daniel Klein, and Fatma Deniz. The cortical representation of language timescales is shared between reading and listening.bioRxiv, p. 2023–01, 2023. Alexis Conneau, German Kruszewski, Guillaume Lample, Lo ̈ ıc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties.arXiv preprint arXiv:1805.01070, 2018. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers.arXiv preprint arXiv:2104.08696, 2021. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, p. 6309–6317, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. William Fedus, Jeff Dean, and Barret Zoph. A review of sparse expert models in deep learning.arXiv preprint arXiv:2209.01667, 2022. Bruce Fischl. Freesurfer.Neuroimage, 62(2):774–781, 2012. Iason Gabriel. Artificial intelligence, values, and alignment.Minds and machines, 30(3):411–437, 2020. Ariel Goldstein, Zaid Zada, Eliav Buchnik, Mariano Schain, Amy Price, Bobbi Aubrey, Samuel A. Nas- tase, Amir Feder, Dotan Emanuel, Alon Cohen, Aren Jansen, Harshvardhan Gazula, Gina Choe, Aditi Rao, Catherine Kim, Colton Casto, Lora Fanda, Werner Doyle, Daniel Friedman, Patricia Dugan, Lu- cia Melloni, Roi Reichart, Sasha Devore, Adeen Flinker, Liat Hasenfratz, Omer Levy, Avinatan Has- sidim, Michael Brenner, Yossi Matias, Kenneth A. Norman, Orrin Devinsky, and Uri Hasson. Shared computational principles for language processing in humans and deep language models.Nature Neu- roscience, 25(3):369–380, March 2022. ISSN 1546-1726. doi: 10.1038/s41593-022-01026-4. URL https://w.nature.com/articles/s41593-022-01026-4. Number: 3 Publisher: Nature Publishing Group. Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a” right to explanation”.arXiv preprint arXiv:1606.08813, 2016. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023. Wooseok Ha, Chandan Singh, Francois Lanusse, Srigokul Upadhyayula, and Bin Yu. Adaptive wavelet distil- lation from neural networks through interpretations.Advances in Neural Information Processing Systems, 34, 2021. Liberty S Hamilton, Yulia Oganian, Jeffery Hall, and Edward F Chang. Parallel and distributed encoding of speech across human auditory cortex.Cell, 184(18):4626–4639, 2021. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations, 2021. URLhttps: //openreview.net/forum?id=XPZIaotutsD. Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. InEuropean conference on computer vision, p. 3–19. Springer, 2016. Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representa- tions, 2022. Alexander G Huth, Wendy A De Heer, Thomas L Griffiths, Fr ́ ed ́ eric E Theunissen, and Jack L Gallant. Natural speech reveals the semantic maps that tile human cerebral cortex.Nature, 532(7600):453–458, 2016. 11 Shailee Jain and Alexander Huth. Incorporating context into language encoding models for fmri.Advances in neural information processing systems, 31, 2018. Shailee Jain, Vy Vo, Shivangi Mahto, Amanda LeBel, Javier S Turek, and Alexander Huth. Interpretable multi- timescale models for predicting fmri responses to continuous natural speech. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.),Advances in Neural Information Processing Systems, vol- ume 33, p. 13738–13749. Curran Associates, Inc., 2020. URLhttps://proceedings.neurips. c/paper/2020/file/9e9a30b74c49d07d8150c8c83b1ccf07-Paper.pdf. Akos K ́ ad ́ ar, Grzegorz Chrupała, and Afra Alishahi. Representation of linguistic form and function in recurrent neural networks.Computational Linguistics, 43(4):761–780, 2017. Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078, 2015. Carina Kauf, Greta Tuckute, Roger Levy, Jacob Andreas, and Evelina Fedorenko. Lexical semantic content, not syntactic structure, is the main contributor to ann-brain similarity of fmri responses in the language network. bioRxiv, p. 2023–05, 2023. Aaron E. Kornblith, Chandan Singh, Gabriel Devlin, Newton Addo, Christian J. Streck, James F. Holmes, Nathan Kuppermann, Jacqueline Grupp-Phelan, Jeffrey Fineman, Atul J. Butte, and Bin Yu. Predictabil- ity and stability testing to assess clinical decision instrument performance for children after blunt torso trauma.PLOS Digital Health, 2022. doi: https://doi.org/10.1371/journal.pdig.0000076. URLhttps:// journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000076. Sreejan Kumar, Theodore R. Sumers, Takateru Yamakoshi, Ariel Goldstein, Uri Hasson, Kenneth A. Norman, Thomas L. Griffiths, Robert D. Hawkins, and Samuel A. Nastase. Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model. Technical report, bioRxiv, June 2022. URLhttps://w.biorxiv.org/content/10.1101/2022.06. 08.495348v1. Section: New Results Type: article. Amanda LeBel, Lauren Wagner, Shailee Jain, Aneesh Adhikari-Desai, Bhavin Gupta, Allyson Morgenthal, Jerry Tang, Lixiang Xu, and Alexander G Huth. A natural language fmri dataset for voxelwise encoding models.bioRxiv, p. 2022–09, 2022. Frederick Liu and Besim Avci. Incorporating priors with feature attribution on text classification.arXiv preprint arXiv:1906.08286, 2019. Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. Explainable ai for trees: From local explanations to global understanding.arXiv preprint arXiv:1905.04610, 2019. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual knowledge in gpt.arXiv preprint arXiv:2202.05262, 2022. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. Tom M. Mitchell, Svetlana V. Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L. Malave, Robert A. Ma- son, and Marcel Adam Just. Predicting human brain activity associated with the meanings of nouns.Science (New York, N.Y.), 320(5880):1191–1195, May 2008. ISSN 1095-9203. doi: 10.1126/science.1152876. John X Morris, Chandan Singh, Alexander M Rush, Jianfeng Gao, and Yuntian Deng. Tree prompting: Efficient task adaptation without fine-tuning.arXiv preprint arXiv:2310.14034, 2023. Seil Na, Yo Joong Choe, Dong-Hyun Lee, and Gunhee Kim. Discovery of natural language concepts in indi- vidual units of cnns.arXiv preprint arXiv:1902.07249, 2019. Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. Wt5?! training text-to-text models to explain their predictions.arXiv preprint arXiv:2004.14546, 2020. Shinji Nishimoto, Alexander G Huth, Natalia Y Bilenko, and Jack L Gallant. Eye movement-invariant repre- sentations in the human visual system.Journal of vision, 17(1):11–11, 2017. Subba Reddy Oota, Manish Gupta, and Mariya Toneva. Joint processing of linguistic properties in brains and language models, December 2022. URLhttp://arxiv.org/abs/2212.08094. arXiv:2212.08094 [cs, q-bio]. Alexandre Pasquiou, Yair Lakretz, Bertrand Thirion, and Christophe Pallier. Information-Restricted Neural Language Models Reveal Different Brain Regions’ Sensitivity to Semantics, Syntax and Context, February 2023. URLhttp://arxiv.org/abs/2302.14389. arXiv:2302.14389 [cs]. 12 Nina Poerner, Benjamin Roth, and Hinrich Sch ̈ utze. Interpretable textual neuron representations for nlp.arXiv preprint arXiv:1809.07291, 2018. Sara F Popham, Alexander G Huth, Natalia Y Bilenko, Fatma Deniz, James S Gao, Anwar O Nunez-Elizalde, and Jack L Gallant. Visual and linguistic semantic representations are aligned at the border of human visual cortex.Nature neuroscience, 24(11):1628–1636, 2021. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. Explain yourself! leveraging language models for commonsense reasoning.arXiv preprint arXiv:1906.02361, 2019. Aniketh Janardhan Reddy and Leila Wehbe. Can fMRI reveal the representation of syntactic structure in the brain? preprint, Neuroscience, June 2020. URLhttp://biorxiv.org/lookup/doi/10.1101/ 2020.06.16.155499. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386, 2016. Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. Carer: Contextualized affect representations for emotion recognition. InProceedings of the 2018 conference on empirical methods in natural language processing, p. 3687–3697, 2018. Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A Hosseini, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. The neural architecture of language: Integrative mod- eling converges on predictive processing.Proceedings of the National Academy of Sciences, 118(45): e2105646118, 2021. Lei Sha, Oana-Maria Camburu, and Thomas Lukasiewicz. Learning from the best: Rationalizing predictions by adversarial information calibration. InAAAI, p. 13771–13779, 2021. Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts.arXiv preprint arXiv:2010.15980, 2020. Chandan Singh, W James Murdoch, and Bin Yu. Hierarchical interpretations for neural network predictions. International Conference on Learning Representations, p. 26, 2019. URLhttps://openreview. net/forum?id=SkEqro0ctQ. Chandan Singh, Armin Askari, Rich Caruana, and Jianfeng Gao. Augmenting interpretable models with llms during training.arXiv preprint arXiv:2209.11799, 2022a. Chandan Singh, John X Morris, Jyoti Aneja, Alexander M Rush, and Jianfeng Gao. Explaining patterns in data with language models via interpretable autoprompting.arXiv preprint arXiv:2210.01848, 2022b. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, p. 1631–1642, 2013. Hongjin Su, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, Tao Yu, et al. One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022. Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Distill-and-compare: Auditing black-box models using transparent model distillation. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, p. 303–310, 2018. Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G Huth. Semantic reconstruction of continuous lan- guage from non-invasive brain recordings.Nature Neuroscience, p. 1–9, 2023. Mariya Toneva and Leila Wehbe. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch ́ e- Buc, E. Fox, and R. Garnett (eds.),Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URLhttps://proceedings.neurips.c/paper/2019/file/ 749a8e6c231831ef7756db230b4359c8-Paper.pdf. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ́ e Lacroix, Baptiste Rozi ` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. 13 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023b. Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural network weights. arXiv preprint arXiv:1705.04977, 2017. Doris Y. Tsao, Nicole Schweers, Sebastian Moeller, and Winrich A. Freiwald. Patches of face-selective cortex in the macaque frontal lobe.Nature Neuroscience, 11(8):877–879, August 2008. ISSN 1546-1726. doi: 10.1038/n.2158. Greta Tuckute, Aalok Sathe, Shashank Srikant, Maya Taliaferro, Mingye Wang, Martin Schrimpf, Kendrick Kay, and Evelina Fedorenko. Driving and suppressing the human language network using large language models.bioRxiv, 2023. Leila Wehbe, Ashish Vaswani, Kevin Knight, and Tom Mitchell. Aligning context-based statistical models of language with brain activity during reading. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 233–243, Doha, Qatar, October 2014. Association for Com- putational Linguistics. doi: 10.3115/v1/D14-1030. URLhttps://aclanthology.org/D14-1030. Michael C.-K. Wu, Stephen V. David, and Jack L. Gallant. Complete functional characterization of sensory neurons by system identification.Annual Review of Neuroscience, 29:477–505, 2006. ISSN 0147-006X. doi: 10.1146/annurev.neuro.29.051605.113024. Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors.arXiv preprint arXiv:2103.15949, 2021. Omar Zaidan and Jason Eisner. Modeling annotators: A generative approach to learning from annotator ra- tionales. InProceedings of the 2008 conference on Empirical methods in natural language processing, p. 31–40, 2008. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InEuropean con- ference on computer vision, p. 818–833. Springer, 2014. Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 6720–6731, 2019. Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve anything to augment large language models.arXiv preprint arXiv:2310.07554, 2023. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019. Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InNIPS, 2015. Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections.arXiv preprint arXiv:2104.04670, 2021. Ruiqi Zhong, Charlie Snell, Dan Klein, and Jacob Steinhardt. Describing differences between text distributions with natural language. InInternational Conference on Machine Learning, p. 27099–27116. PMLR, 2022. Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, and Jacob Steinhardt. Goal driven discovery of distributional differences via language descriptions.arXiv preprint arXiv:2302.14233, 2023. Zhiying Zhu, Weixin Liang, and James Zou. Gsclip: A framework for explaining distribution shifts in natural language.arXiv preprint arXiv:2206.15007, 2022. Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. Can large language models transform computational social science?arXiv preprint arXiv:2305.03514, 2023. 14 Table A1: Statistics on corpuses used for explanation. Wikitext is used for BERT explanation and Moth stories are used for fMRI voxel explanation. Unique unigramsUnique bigramsUnique trigrams Wikitext (Merity et al., 2016)157k3,719k9,228k Moth stories (LeBel et al., 2022)117k79k140k Combined158k3,750k9,334k AAPPENDIX A.1METHODOLOGY DETAILS EXTENDED Prompts used in SASCThe summarization step summarizes 30 randomly chosen ngrams from the top 50 and generates 5 candidate explanations using the promptHere is a list of phrases: phrases is a common theme among these phrases? common theme among these phrases is . In the synthetic scoring step, we generate similar synthetic strings with the promptGenerate 10 phrases that are similar to the concept ofexplanation:. For dissimilar synthetic strings we use the promptGenerate 10 phrases that are not similar to the concept ofexplanation:. Minor automatic processing is applied to LLM outputs, e.g. parsing a bulleted list, converting to lowercase, and removing extra whitespaces. A.2SYNTHETIC MODULE INTERPRETATION ada babbage curie davinci 0.7 0.8 BERT Score Figure A1: The BERT score between generated explanation and groundtruth explanation gener- ally increases as the size of the helper LLM for summarization/generation increases. Models are accessed via the OpenAI API (text-ada-001,text-babbage-001,text-curie-001, text-davinci-001, all accessed on Feb. 2023) and are in order of increasing size. BERT score for each module is computed as the maximum over the 5 generated explanations. 15 0100200300400 Number of corpus trigams (thousands) 0.65 0.70 0.75 0.80 0.85 Explanation BERT Score Figure A2: Explanation BERT score for the 54 synthetic datasets as a function of corpus size. Performance plateaus around 100,000 ngrams. Corpus is created by randomly subsampling the unique trigrams in the WikiText dataset (Merity et al., 2016). Gray dotted line shows the result when evaluating on dataset-specific corpuses, as in theDefaultsetting in Table 1. 0-irony 1-objective 2-subjective 3-god 4-atheism 5-evacuate 6-terorrism 7-crime 8-shelter 9-food 10-infrastructure 11-regime change 12-medical 13-water 14-search 15-utility 16-hillary17-hillary 18-offensive19-offensive 20-pro-life 21-pro-choice 22-physics 23-computer science 24-statistics 25-math 26-grammar27-grammar 28-sexis29-sexis 30-news 31-sports 32-business 33-tech 34-bad 35-good 36-quantity 37-location 38-person 39-entity 40-abbrevation 41-defin 42-environment43-environment 44-spam 45-fact 46-opinion 47-math 48-health 49-computer 50-sport 51-entertainment 52-family 53-politic Examples from this task 0-irony 1-objective 2-subjective 3-god 4-atheism 5-evacuate 6-terorrism 7-crime 8-shelter 9-food 10-infrastructure 11-regime change 12-medical 13-water 14-search 15-utility 16-hillary 17-hillary 18-offensive 19-offensive 20-pro-life 21-pro-choice 22-physics 23-computer science 24-statistics 25-math 26-grammar 27-grammar 28-sexis 29-sexis 30-news 31-sports 32-business 33-tech 34-bad 35-good 36-quantity 37-location 38-person 39-entity 40-abbrevation 41-defin 42-environment 43-environment 44-spam 45-fact 46-opinion 47-math 48-health 49-computer 50-sport 51-entertainment 52-family 53-politic Task used for prompt 0.018 0.020 0.022 0.024 0.026 0.028 Mean response of the module Figure A3: Synthetic modules respond more strongly to phrases related to their keyphrase (diagonal) than to phrases related to the keyphrase of other datasets (off-diagonal). Each value shows the mean response of the module to 5 phrases and each row is normalized using softmax. Each module is constructed using Instructor (Su et al., 2022) with the promptRepresent the short phrase for clustering:and the groundtruth keyphrase given in Table A2. Related keyphrases are generated manually. 16 Table A2: 54 synthetic modules and information about their underlying data corpus. Note that some modules use the same groundtruth Keyword (e.g.environmentalism), but that the underlying data corpus contains different data (e.g. text that is pro/anti environmentalism). Module nameGroundtruth keyphraseDataset explanationExamplesUnique unigrams 0-ironysarcasmcontains irony5903897 1-objectiveunbiasedis a more objective description of what happened7395628 2-subjectivesubjectivecontains subjective opinion7575769 3-godreligiousbelieves in god1641455 4-atheismatheisticis against religion1721472 5-evacuateevacuationinvolves a need for people to evacuate267016505 6-terorrismterrorismdescribes a situation that involves terrorism264016608 7-crimecrimeinvolves crime262116333 8-sheltershelterdescribes a situation where people need shelter262016347 9-foodhungeris related to food security264216276 10-infrastructureinfrastructureis related to infrastructure266416548 11-regime changeregime changedescribes a regime change267016382 12-medicalhealthis related to a medical situation267516223 13-waterwaterinvolves a situation where people need clean water261916135 14-searchrescueinvolves a search/rescue situation262816131 15-utilityutilityexpresses need for utility, energy or sanitation264016249 16-hillaryHillaryis against Hillary2241693 17-hillaryHillarysupports hillary2181675 18-offensivederogatorycontains offensive content6526109 19-offensivetoxicinsult women or immigrants218811839 20-pro-lifepro-lifeis pro-life2131633 21-pro-choiceabortionsupports abortion2091593 22-physicsphysicsis about physics1036093810 23-computer sciencecomputersis related to computer science1044193947 24-statisticsstatisticsis about statistics928686874 25-mathmathis about math research889885118 26-grammarungrammaticalis ungrammatical8342217 27-grammargrammaticalis grammatical8262236 28-sexissexistis offensive to women2091641 29-sexisfeminismsupports feminism2151710 30-newsworldis about world news577813023 31-sportssports newsis about sports news567412849 32-businessbusinessis related to business569912913 33-techtechnologyis related to technology572712927 34-badnegativecontains a bad movie review35716889 35-goodgoodthinks the movie is good38017497 36-quantityquantityasks for a quantity19015144 37-locationlocationasks about a location19255236 38-personpersonasks about a person18485014 39-entityentityasks about an entity18965180 40-abbrevationabbreviationasks about an abbreviation18395045 41-defindefinitioncontains a definition6514508 42-environmentenvironmentalismis against environmentalist1241117 43-environmentenvironmentalismis environmentalist1191072 44-spamspamis a spam3602470 45-factfactsasks for factual information70411449 46-opinionopinionasks for an opinion71911709 47-mathscienceis related to math and science751453973 48-healthhealthis related to health748553986 49-computercomputersrelated to computer or internet748654256 50-sportsportsis related to sports750554718 51-entertainmententertainmentis about entertainment746153573 52-familyrelationshipsis about family and relationships743854680 53-politicpoliticsis related to politics or government741053393 17 Table A3: 54 synthetic datasets and the regex used to check whether an explanation is correct (after applying lowercasing). These regexes form guide the manual inspection of explanation accuracy: the original label is assigned by the regex and then fixed by the human when errors (which are relatively rare) occur. Module nameDataset explanationRegex check 0-ironycontains ironyirony|sarcas 1-objectiveis a more objective description of what happenedobjective|factual|nonpersonal|neutral|unbias 2-subjectivecontains subjective opinionsubjective|opinion|personal|bias 3-godbelieves in godgod|religious|religion 4-atheismis against religionatheism|atheist|anti-religion|against religion 5-evacuateinvolves a need for people to evacuateevacuat|flee|escape 6-terorrismdescribes a situation that involves terrorismterorrism|terror 7-crimeinvolves crimecrime|criminal|criminality 8-shelterdescribes a situation where people need sheltershelter|home|house 9-foodis related to food securityfood|hunger|needs 10-infrastructureis related to infrastructureinfrastructure|buildings|roads|bridges|build 11-regime changedescribes a regime changeregime change|coup|revolution|revolt|political action|political event|upheaval 12-medicalis related to a medical situationmedical|health 13-waterinvolves a situation where people need clean waterwater 14-searchinvolves a search/rescue situationsearch|rescue|help 15-utilityexpresses need for utility, energy or sanitationutility|energy|sanitation|electricity|power 16-hillaryis against Hillaryhillary|clinton|against Hillary|opposed to Hillary|republican|against Clinton|opposed to Clinton 17-hillarysupports hillaryhillary|clinton|support Hillary|support Clinton|democrat 18-offensivecontains offensive contentoffensive|toxic|abusive|insulting|insult|abuse|offend|offend|derogatory 19-offensiveinsult women or immigrantsoffensive|toxic|abusive|insulting|insult|abuse|offend|offend|women|immigrants 20-pro-lifeis pro-lifepro-life|abortion|pro life 21-pro-choicesupports abortionpro-choice|abortion|pro choice 22-physicsis about physicsphysics 23-computer scienceis related to computer sciencecomputer science|computer|artificial intelligence|ai 24-statisticsis about statisticsstatistics|stat|probability 25-mathis about math researchmath|arithmetic|algebra|geometry 26-grammaris ungrammaticalgrammar|syntax|punctuation|grammat|linguistic 27-grammaris grammaticalgrammar|syntax|punctuation|grammat|linguistic 28-sexisis offensive to womensexis|women|femini 29-sexissupports feminismsexis|women|femini 30-newsis about world newsworld|cosmopolitan|international|global 31-sportsis about sports newssports 32-businessis related to businessbusiness|economics|finance 33-techis related to technologytech 34-badcontains a bad movie reviewbad|negative|awful|terrible|horrible|poor|boring|dislike 35-goodthinks the movie is goodgood|great|like|love|positive|awesome|amazing|excellent 36-quantityasks for a quantityquantity|number|numeric 37-locationasks about a locationlocation|place 38-personasks about a personperson|individual|people 39-entityasks about an entityentity|thing|object 40-abbrevationasks about an abbreviationabbrevation|abbr|acronym 41-defincontains a definitiondefin|meaning|explain 42-environmentis against environmentalistenvironment|climate change|global warming 43-environmentis environmentalistenvironment|climate change|global warming 44-spamis a spamspam|annoying|unwanted 45-factasks for factual informationfact|info|knowledge 46-opinionasks for an opinionopinion|personal|bias 47-mathis related to math and sciencemath|science 48-healthis related to healthhealth|medical|disease 49-computerrelated to computer or internetcomputer|internet|web 50-sportis related to sportssport 51-entertainmentis about entertainmententertainment|music|movie|tv 52-familyis about family and relationshipsfamily|relationships 53-politicis related to politics or governmentpolitic|government|law 18 01020304050 Groundtruth explanation per dataset 0 10 20 30 40 50 Generated explanation per dataset 7.5 5.0 2.5 0.0 2.5 5.0 7.5 Similarity score (Higher is more similar) Figure A4: Similarity scores for SASC explanations in theDefaultsetting measured by bge-large (BAAI/bge-large-en, (Zhang et al., 2023)), rather than manual inspection or BERT-score, as shown in Table 1. Large values on the diagonal indicate that the explanation generated for a module on a given dataset are similar to the groundtruth explanations for that dataset. The top-1 classification accuracy (i.e. how often the generated explanation is most similar to its corresponding groundtruth explanation) is 81.5%, slightly lower than the assigned accuracy by manual inspection (88.3%). The top-2 accuracy is 88.9%. 01020304050 Dataset to generate texts 0 10 20 30 40 50 Module f Module response f(Text + ) 1.1 1.0 0.9 0.8 0.7 0.6 0.5 01020304050 Dataset to generate texts 0 10 20 30 40 50 Module f Module response f(Text + ) - f(Text) 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Figure A5: Average module responses for synthetic texts that are related to the explanation (left, f(Text + )) or the difference between the responses for related and unrelated texts (right,f(Text + − f(Text − )). Responses correspond to synthetic modules in theDefaultsetting. Bright diagonal on the left suggests thatfselectively responses to synthetic texts generated according to the appropriate explanation. On the right, the diagonal is slightly less bright, suggesting that the module does not tend to respond more negatively to unrelated texts Text − . 19 A.3BERTINTERPRETATION Details on fitting transformer factorsPre-trained transformer factors are taken from (Yun et al., 2021). Each transformer factor is the result of running dictionary learning on a matrix Xdescribed as follows. Using a corpus of sentencesS(here wikipedia), embeddings are ex- tracted for each input, layer, and sequence index in BERT. The resulting matrixXhas size   num layers |z 13 for BERT · P s∈S len(s)   ×d |z 768 for BERT . Dictionary learning is run onXwith 1,500 dictionary components, resulting in a dictionaryD∈R 1,500×d . Here, we take the fitted dictionary released by (Yun et al., 2021) trained on the WikiText dataset (Merity et al., 2016). During our interpretation pipeline, we require a module which maps text to a scalar coefficient. To interpret a transformer factor as a module, we specify a text inputtand a layerl. This results in len(t)embeddings with dimensiond. We average over these embeddings, and then solve for the dictionary coefficients, to yield a set of coefficientsA∈R 1500 . Finally, specifying a dictionary component index yields a single, scalar coefficient. Extended BERT explanation resultsTable A4 shows examples comparing SASC explanations with human-labeled explanations for all BERT transformer factors labeled in (Yun et al., 2021). Tables A6 to A8 show explanations for modules selected by linear models finetuned on text- classification tasks. Table A4: Fraction of top logistic regression coefficients that are relevant for a downstream task (extends Table 5). Averaged over 3 random seeds; parentheses show standard error of the mean. EmotionAG NewsSST2 Top-100.50±0.081.00±0.000.80±0.14 Top-150.47±0.050.98±0.030.69±0.13 Top-200.42±0.090.98±0.020.55±0.10 20 Table A5: Comparing SASC explanations to all human-labeled explanations for BERT transformer factors. Explanation scores are in units ofσ f . Factor Layer Factor Index Explanation (Human)Explanation (SASC) Explanation score (Human) Explanation score (SASC) 413Numerical values.numbers-0.21-0.08 1042Something unfortunate happened.idea of wrongdoing or illegal activity2.431.97 030left. Adjective or Verb. Mixed senses.someone or something leaving3.685.87 447plants. Noun. vegetation.trees6.265.04 10152In some locations.science, technology, and/or medicine-0.410.03 430left. Verb. leaving, exiting.leaving or being left4.440.90 10297Repetitive structure detector.versions or translations-0.360.98 10322Biography, someone born in some year...weapons and warfare0.190.38 1013Unit exchange with parentheses.names of places, people, or things-0.11-0.10 10386War.media, such as television, movies, or video games 0.20-0.15 10184Institution with abbreviation.publishing, media, or awards-0.420.14 230left. Verb. leaving, exiting.leaving or being left5.300.91 10179Topic: music production.geography-0.520.21 6225Places in US, followings the convention ”city, state”. a place or location1.881.33 1025Attributive Clauses.something related to people, places, or things 0.011.19 10125Describing someone in a para- phrasing style. Name, Career. something related to buildings, architec- ture, or construction -0.130.44 613Close Parentheses.end with a closing punctuation mark (e.g-0.080.47 1099Past tense.people, places, or things-0.77-0.04 1024Male name.people, places, and things related to history0.030.38 10102African names.traditional culture, with references to tra- ditional territories, communities, forms, themes, breakfast, and texts 0.351.60 416park. Noun. a common first and last name.names of parks-0.031.87 10134Transition sentence.a comma1.160.38 686Consecutive years, used in foodball season naming. specific dates or months0.850.76 42mind. Noun. the element of a person that enables them to be aware of the world and their experiences. concept of thinking, remembering, and having memories 0.7711.19 1051Apostrophe s, possesive.something specific, such as a ticket, tenure, film, song, movement, project, game, school, title, park, congressman, author, or art exhibition 0.37-0.01 8125Describing someone in a paraphrasing style. Name, Career. publications, reviews, or people associated with the media industry -0.340.42 433light. Noun. the natural agent that stimu- lates sight and makes things visible. light6.253.43 1050Doing something again, or making some- thing new again. introduction of something new0.84-0.27 1086Consecutive years, this is convention to name foodball/rugby game season. a specific date or time of year1.35-0.75 4193Time span in years.many of them are related to dates and his- toric places 0.071.39 10195Consecutive of noun (Enumerating).different aspects of culture, such as art, music, literature, history, and technology -0.839.83 21 Table A6: SASC explanations for modules selected by 25-coefficient linear model onSST2for a single seed. Green shows explanations deemed to be relevant to the task. Layer, Factor indexExplanationLinear coefficient (0, 783)something being incorrect or wrong-862.82 (0, 1064)negative emotions and actions, such as hatred, violence, and disgust-684.27 (1, 783) something being incorrect, inaccurate, or wrong-577.49 (1, 1064)hatred and violence-499.30 (0, 157)air and sequencing463.80 (9, 319) a negative statement, usually in the form of not or nor-446.58 (0, 481)harm, injury, or damage-441.98 (8, 319)lack of something or the absence of something-441.04 (10, 667)two or more words424.48 (2, 783) something that is incorrect or inaccurate-415.56 (0, 658)thrice-411.26 (0, 319)none or its variations (no, not, never)-388.14 (0, 1402)dates-377.74 (0, 1049)standard-365.83 (3, 1064)negative emotions or feelings, such as hatred, anger, disgust, and brutality-360.47 (4, 1064)negative emotions or feelings, such as hatred, anger, and disgust-357.35 (5, 152)geography, history, and culture-356.10 (0, 928)homelessness and poverty-355.05 (2, 691)animals and plants, as many of the phrases refer to species of animals and plants-351.62 (0, 810)catching or catching something350.98 (0, 1120)production-350.01 (0, 227)a period of time-345.72 (2, 583)government, law, or politics in some way-335.40 (2, 1064) negative emotions such as hatred, disgust, and violence-334.87 (4, 125)science or mathematics, such as physics, astronomy, and geometry-328.55 Table A7: SASC explanations for modules selected by 25-coefficient linear model onAG Newsfor a single seed. Green shows explanations deemed to be relevant to the task. Layer, Factor indexExplanationLinear coefficient (5, 378)professional sports teams545.57 (4, 378) professional sports teams in the united states542.25 (3, 378)professional sports teams515.37 (0, 378)names of sports teams508.73 (6, 378) sports teams499.62 (2, 378)professional sports teams499.57 (1, 378)professional sports teams492.01 (7, 378)sports teams468.66 (8, 378)sports teams or sports in some way468.39 (11, 32) activity or process461.46 (12, 1407)such450.70 (5, 730)england and english sports teams427.33 (12, 104)people, places, and events from history425.49 (10, 378)locations424.71 (6, 730) sports, particularly soccer424.24 (12, 730)sports415.21 (4, 396)people, places, or things related to japan-415.13 (10, 659)sports410.89 (4, 188) history in some way404.24 (12, 1465)different aspects of life, such as activities, people, places, and objects403.77 (0, 310)end with the word until-400.10 (5, 151)a particular season, either of a year, a sport, or a television show396.41 (12, 573)many of them contain unknown words or names, indicated by<unk-393.27 (12, 372) specific things, such as places, organizations, or activities-392.57 (6, 188)geography388.69 22 Table A8: SASC explanations for modules selected by 25-coefficient linear model onEmotionfor a single seed. Green shows explanations deemed to be relevant to the task. Layer, Factor indexExplanationLinear coefficient (0, 1418)types of road interchanges581.97 (0, 920) fame577.20 (6, 481)injury or impairment566.44 (5, 481) injury or impairment556.58 (0, 693)end in oss or osses556.53 (12, 1137)ownership or possession-537.45 (0, 663)civil524.88 (6, 1064) negative emotions such as hatred, disgust, disdain, rage, and horror523.41 (3, 872)location of a campus or facility-518.85 (5, 1064)negative emotions and feelings, such as hatred, disgust, disdain, and viciousness489.25 (0, 144)lectures482.85 (0, 876) host479.18 (0, 69)history-467.80 (0, 600)many of them contain the word seymour or a variation of it464.64 (0, 813) or phrases related to either measurement (e.g-455.11 (1, 89) caution and being careful451.73 (11, 229)russia and russian culture-450.28 (0, 783)something being incorrect or wrong448.55 (12, 195)dates442.14 (12, 1445) breaking or being broken439.81 (0, 415)ashore-438.22 (0, 118)end with a quotation mark437.66 (1, 650)mathematical symbols such as>, =, and )-437.28 (4, 388) end with the sound ch-437.15 (0, 840)withdrawing-436.38 23 A.4FMRIMODULE INTERPRETATION A.4.1FMRIDATA AND MODEL FITTING This section gives more details on the fMRI experiment analyzed in Sec. 5. These MRI data are available publicly (LeBel et al., 2022; Tang et al., 2023), but the methods are summarized here. Functional magnetic resonance imaging (fMRI) data were collected from 3 human subjects as they listened to English language podcast stories over Sensimetrics S14 headphones. Subjects were not asked to make any responses, but simply to listen attentively to the stories. For encoding model train- ing, each subject listened to at approximately 20 hours of unique stories across 20 scanning sessions, yielding a total of∼33,000 datapoints for each voxel across the whole brain. For model testing, the subjects listened to two test story 5 times each, and one test story 10 times, at a rate of 1 test story per session. These test responses were averaged across repetitions. Functional signal-to-noise ratios in each voxel were computed using the mean-explainable variance method from (Nishimoto et al., 2017) on the repeated test data. Only voxels within 8 m of the mid-cortical surface were analyzed, yielding roughly 90,000 voxels per subject. MRI data were collected on a 3T Siemens Skyra scanner at University of Texas at Austin using a 64- channel Siemens volume coil. Functional scans were collected using a gradient echo EPI sequence with repetition time (TR) = 2.00 s, echo time (TE) = 30.8 ms, flip angle = 71°, multi-band factor (simultaneous multi-slice) = 2, voxel size = 2.6m x 2.6m x 2.6m (slice thickness = 2.6m), matrix size = 84x84, and field of view = 220 m. Anatomical data were collected using a T1- weighted multi-echo MP-RAGE sequence with voxel size = 1m x 1m x 1m following the Freesurfer morphometry protocol (Fischl, 2012). All subjects were healthy and had normal hearing. The experimental protocol was approved by the Institutional Review Board at the University of Texas at Austin. Written informed consent was obtained from all subjects. All functional data were motion corrected using the FMRIB Linear Image Registration Tool (FLIRT) from FSL 5.0. FLIRT was used to align all data to a template that was made from the average across the first functional run in the first story session for each subject. These automatic alignments were manually checked for accuracy. Low frequency voxel response drift was identified using a 2nd order Savitzky-Golay filter with a 120 second window and then subtracted from the signal. To avoid onset artifacts and poor detrend- ing performance near each end of the scan, responses were trimmed by removing 20 seconds (10 volumes) at the beginning and end of each scan, which removed the 10-second silent period and the first and last 10 seconds of each story. The mean response for each voxel was subtracted and the remaining response was scaled to have unit variance. We used the fMRI data to generate a voxelwise brain encoding model for natural language using the intermediate hidden states from the the 18th layer of the 30-billion parameter LLaMA model (Tou- vron et al., 2023a), and the 9th layer of GPT (Radford et al., 2019). In order to temporally align word times with TR times, Lanczos interpolation was applied with a window size of 3. The hemodyanmic response function was approximated with a finite impulse response model using 4 delays at -8,-6,-4 and -2 seconds (Huth et al., 2016). For each subjectx, voxelv, we fit a separate encoding model g (x,v) to predict the BOLD response ˆ Bfrom our embedded stimulus, i.e. ˆ B (x,v) =g (x,v) (H i (S)). To evaluate the voxelwise encoding models, we used the learnedg (x,v) to generate and evaluate predictions on a held-out test set. The GPT features achieved a mean correlation of 0.12 and LLaMA features achieved a mean correlation of 0.17. These performances are comparable with state-of-the- art published models on the same dataset that are able to achieved decoding (Tang et al., 2023). To select voxels with diverse encoding, we applied principal components analysis to the learned weights,g (x,v) , for GPT across all significantly predicted voxels in cortex. Prior work has shown that the first four principal components of language encoding models weights encode differences in semantic selectivity, differentiating between concepts likesocial,temporalandvisualconcepts. Consequently, to apply SASC to voxels with the most diverse selectivity, we found voxels that lie along the convex hull of the first four principal components and randomly sampled 1,500 of them (500 per subject). The mean voxel correlation for the 1,500 voxels we study is 0.35. Note that these 24 voxels were selected for being well-predicted rather than easy to explain: the correlation between the prediction error and the explanation score for these voxels is 0.01, very close to zero. A.4.2EVALUATING TOP FMRIVOXEL EVALUATIONS Table A9 shows two evaluations of the fMRI voxel explanations. First, similar to Fig. 3, we find the mean explanation score remains significantly above zero. Second, we evaluate beyond whether the explanation describes the fitted module and ask whether the explanation describes the underlying fMRI voxel. Specifically, we predict the fMRI voxel response to text using only the voxel’s expla- nation using a very simple procedure. We first compute the (scalar) negative embedding distance between the explanation text and the input text using Instructor (Su et al., 2022) 5 . We then calcu- late the spearman rank correlation between this scalar distance and the recorded voxel response (see Table A9). The mean computed correlation is low 6 , which is to be expected as the explanation is a concise string and may match extremely few ngrams in the text of the test data (which consists of only 3 narrative stories). Nevertheless, the correlation is significantly above zero (more than 15 times the standard error of the mean), suggesting that these explanations have some grounding in the underlying brain voxels. Table A9: Evaluation of fMRI voxel explanations. For all metrics, SASC is successful if the value is significantly greater than 0. Errors show standard error of the mean. Explanation scoreTest rank correlation 1.27σ f ±0.0290.033±0.002 A.4.3FMRIRESULTS WHEN USINGWIKITEXTCORPUS 10123456 Explanation score ( f ) BERT 0 BERT 1 BERT 2 BERT 3 BERT 4 BERT 5 BERT 6 BERT 7 BERT 8 BERT 9 BERT 10 BERT 11 BERT 12 PMvh ATFP AC PO IFSFP Broca BERT layer Figure A6: Results in Fig. 3 when using WikiText as the underlying corpus for ngrams rather than narrative stories. 5 The input text for an fMRI response at timet(in seconds) is taken to be the words presented betweent−8 andt−2. 6 For reference, test correlations published in fMRI voxel prediction from language are often in the range of 0.01-0.1 (Caucheteux et al., 2022). 25 0.000.050.100.15 Topic proportion concept, music, communication, time, particular, verbal, age people, places, things, relationships, names, events, related movement, verb, action, physical, form, nouns, money specific, time, year, idea, contain, number, variation place, location, person, thing, specific, architecture, symbol words, phrases, names, include, mark, animals, physical type, activity, numbers, government, change, action, politics military, related, history, culture, activities, topics, language sports, technology, andor, specifically, transportation, mathematics, different end, word, letter, followed, phrase, comparison, start Topic BERT fMRI Figure A7: Results in Fig. 4 when using WikiText as the underlying corpus for ngrams rather than narrative stories. 26