Paper deep dive
MAIA: A Multimodal Automated Interpretability Agent
Tamar Rott Shaham, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, Antonio Torralba
Models: CLIP, DINO, GPT-4V, ResNet-152
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 7:55:54 PM
Summary
MAIA is a Multimodal Automated Interpretability Agent that leverages a vision-language model backbone and a modular API to automate interpretability tasks for neural networks. By composing tools for input synthesis, editing, and experimental analysis, MAIA can perform iterative hypothesis testing, neuron-level feature description, and model-level auditing, achieving performance comparable to human experts.
Entities (6)
Relation Signals (4)
MAIA → uses → GPT-4V
confidence 100% · MAIA is implemented with a GPT-4V vision-language model (VLM) backbone
MAIA → interprets → ResNet-152
confidence 95% · We find MAIA correctly predicts behaviors of individual vision neurons in three trained architectures... ResNet-152
MAIA → usestool → Stable Diffusion v1.5
confidence 95% · MAIA is equipped with a text2image tool that synthesizes images by calling Stable Diffusion v1.5
MAIA → outperforms → MILAN
confidence 90% · MAIA outperforms MILAN across all models and is often on par with expert predictions.
Cypher Suggestions (2)
Find all models used by MAIA for interpretability tasks · confidence 90% · unvalidated
MATCH (a:System {name: 'MAIA'})-[:USES_TOOL|USES_MODEL]->(m:Model) RETURN m.nameCompare performance of MAIA against baselines · confidence 85% · unvalidated
MATCH (a:System {name: 'MAIA'})-[r:OUTPERFORMS]->(b:BaselineMethod) RETURN a.name, r.relation, b.nameAbstract
Abstract:This paper describes MAIA, a Multimodal Automated Interpretability Agent. MAIA is a system that uses neural models to automate neural model understanding tasks like feature interpretation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesizing and editing inputs, computing maximally activating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experiments proposed by MAIA compose these tools to describe and explain system behavior. We evaluate applications of MAIA to computer vision models. We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images. Across several trained models and a novel dataset of synthetic vision neurons with paired ground-truth descriptions, MAIA produces descriptions comparable to those generated by expert human experimenters. We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
Tags
Links
Full Text
105,512 characters extracted from source content.
Expand or collapse full text
A Multimodal Automated Interpretability Agent Tamar Rott Shaham 1 * Sarah Schwettmann 1 * Franklin Wang 1 Achyuta Rajaram 1 Evan Hernandez 1 Jacob Andreas 1 Antonio Torralba 1 Abstract This paper describesMAIA, aMultimodal AutomatedInterpretabilityAgent.MAIAis a sys- tem that uses neural models to automate neural model understanding tasks like feature interpre- tation and failure mode discovery. It equips a pre-trained vision-language model with a set of tools that support iterative experimentation on subcomponents of other models to explain their behavior. These include tools commonly used by human interpretability researchers: for synthesiz- ing and editing inputs, computing maximally ac- tivating exemplars from real-world datasets, and summarizing and describing experimental results. Interpretability experimentsproposed byMAIA compose these tools to describe and explain sys- tem behavior. We evaluate applications ofMAIA to computer vision models. We first characterize MAIA’s ability to describe (neuron-level) features in learned representations of images. Across sev- eral trained models and a novel dataset of syn- thetic vision neurons with paired ground-truth descriptions,MAIAproduces descriptions com- parable to those generated by expert human ex- perimenters. We then show thatMAIAcan aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified. † 1. Introduction Understanding of a neural model can take many forms. Given an image classifier, for example, we may wish to recognize when and how it relies on sensitive features like race or gender, identify systematic errors in its predictions, or learn how to modify the training data and model architec- ture to improve accuracy and robustness. Today, this kind of understanding requires significant effort on the part of researchers—involving exploratory data analysis, formula- tion of hypotheses, and controlled experimentation (Nushi et al., 2018; Zhang et al., 2018). As a consequence, this † Website:https://multimodal-interpretability. csail.mit.edu/maia Figure 1.MAIAframework.MAIAautonomously conducts exper- iments on other systems to explain their behavior. kind of understanding is slow and expensive to obtain even about the most widely used models. Recent work onautomated interpretability(e.g.Hernan- dez et al., 2022; Bills et al., 2023; Schwettmann et al., 2023) has begun to address some of these limitations by using learned models themselvesto assist with model understand- ing tasks—for example, by assigning natural language de- scriptions to learned representations, which may then be used to surface features of concern. But current methods are useful almost exclusively as tools for hypothesis gen- eration; they characterize model behavior on a limited set of inputs, and are often low-precision (Huang et al., 2023). How can we build tools that help users understand models, while combining the flexibility of human experimentation with the scalability of automated techniques? This paper introduces a prototype system we call the MultimodalAutomatedInterpretabilityAgent (MAIA), which combines a pretrained vision-language model back- 1 arXiv:2404.14394v2 [cs.AI] 11 Feb 2025 A Multimodal Automated Interpretability Agent Figure 2.MAIAexperiments for labeling neurons.MAIAiteratively writes programs that compose common interpretability tools to conduct experiments on other systems. At each step,MAIAautonomously makes and updates hypotheses in light of experimental outcomes, showing sophisticated scientific reasoning capabilities. Generated code is executed with a Python interpreter and the outputs (shown above, neuron activation values overlaid in white, masks thresholded at 0.95 percentile of activation maps) are returned toMAIA. 2 A Multimodal Automated Interpretability Agent bone with an API containing tools designed for conducting experiments on deep networks.MAIAis prompted with an explanation task (e.g.“describe the behavior of unit 487 in layer 4 of CLIP” or “in which contexts does the model fail to classifylabradors?”) and designs aninterpretability ex- perimentthat composes experimental modules to answer the query.MAIA’s modular design (Figure 1) enables flexible evaluation of arbitrary systems and straightforward incorpo- ration of new experimental tools. Section 3 describes the current tools inMAIA’s API, including modules for synthe- sizing and editing novel test images, which enable direct hypothesis testing during the interpretation process. We evaluateMAIA’s ability to produce predictive explana- tions of vision system components using the neuron descrip- tion paradigm (Bau et al., 2017; 2020; Oikarinen & Weng, 2022; Bills et al., 2023; Singh et al., 2023; Schwettmann et al., 2023) which appears as a subroutine of many inter- pretability procedures. We additionally introduce a novel dataset ofsynthetic vision neuronsbuilt from an open-set concept detector with ground-truth selectivity specified via text guidance. In Section 4, we show thatMAIAdesriptions of both synthetic neurons and neurons in the wild are more predictive of behavior than baseline description methods, and in many cases on par with human labels. MAIAalso automates model-level interpretation tasks where descriptions of learned representations produce actionable insights about model behavior. We show in a series of ex- periments thatMAIA’s iterative experimental approach can be applied to downstream model auditing and editing tasks including spurious feature removal and bias identification in a trained classifier. Both applications demonstrate the adapt- ability of theMAIAframework across experimental settings: novel end-use cases are described in the user prompt to the agent, which can then use its API to compose programs that conduct task-specific experiments. While these applications show preliminary evidence that procedures likeMAIAwhich automate both experimentation and description have high potential utility to the interpretability workflow, we find that MAIAstill requires human steering to avoid common pitfalls including confirmation bias and drawing conclusions from small sample sizes. Fully automating end-to-end interpreta- tion of other systems will not only require more advanced tools, but agents with more advanced capabilities to reason about how to use them. 2. Related work Interpreting deep features.Investigating individual neu- rons inside deep networks reveals a range of human- interpretable features. Approaches to describing these neu- rons use exemplars of their behavior as explanation, either by visualizing features they select for (Zeiler & Fergus, 2014; Girshick et al., 2014; Karpathy et al., 2015; Mahen- dran & Vedaldi, 2015; Olah et al., 2017) or automatically categorizing maximally-activating inputs from real-world datasets (Bau et al., 2017; 2020; Oikarinen & Weng, 2022; Dalvi et al., 2019). Early approaches to translating visual exemplars into language descriptions drew labels from fixed vocabularies (Bau et al., 2017), or produced descriptions in the form of programs (Mu & Andreas, 2021). Automated interpretability.Later work on automated in- terpretability produced open-ended descriptions of learned features in the form of natural language text, either curated from human labelers (Schwettmann et al., 2021) or gener- ated directly by learned models (Hernandez et al., 2022; Bills et al., 2023; Gandelsman et al., 2024). However, these labels are often unreliable ascausaldescriptions of model behavior without further experimentation (Huang et al., 2023). Schwettmann et al. (2023) introduced the Au- tomated Interpretability Agent protocol for experimentation on black-box systems using a language model agent, though this agent operated purely on language-based exploration of inputs, which limited its action space.MAIAsimilarly per- forms iterative experimentation rather than labeling features in a single pass, but has access to a library of interpretability tools as well as built-in vision capabilities.MAIA’s modu- lar design also supports experiments at different levels of granularity, ranging from analysis of individual features to sweeps over entire networks, or identification of more complex network subcomponents (Conmy et al., 2023). Language model agents.Modern language models are promising foundation models for interpreting other net- works due to their strong reasoning capabilities (OpenAI, 2023a). These capabilities can be expanded by using the LM as anagent, where it is prompted with a high-level goal and has the ability to call external tools such as calcula- tors, search engines, or other models in order to achieve it (Schick et al., 2023; Qin et al., 2023). When additionally prompted to perform chain-of-thought style reasoning be- tween actions, agentic LMs excel at multi-step reasoning tasks in complex environments (Yao et al., 2023).MAIA leverages an agent architecture to generate and test hypothe- ses about neural networks trained on vision tasks. While ordinary LM agents are generally restricted to tools with textual interfaces, previous work has supported interfacing with the images through code generation (Sur ́ ıs et al., 2023; Wu et al., 2023). More recently, large multimodal LMs like GPT-4V have enabled the use of image-based tools directly (Zheng et al., 2024; Chen et al., 2023).MAIAfollows this design and is, to our knowledge, the first multimodal agent equipped with tools for interpreting deep networks. 3 A Multimodal Automated Interpretability Agent 3.MAIAFramework MAIAis an agent that autonomously conducts experiments on other systems to explain their behavior, by composing in- terpretability subroutines into Python programs. Motivated by the promise of using language-only models to complete one-shot visual reasoning tasks by calling external tools (Sur ́ ıs et al., 2023; Gupta & Kembhavi, 2023), and the need to performiterativeexperiments with both visual and nu- meric results, we buildMAIAfrom a pretrained multimodal model with the ability to process images directly.MAIAis implemented with a GPT-4V vision-language model (VLM) backbone (OpenAI, 2023b) . Given an interpretability query (e.g. Which neurons in Layer 4 are selective for forested backgrounds?),MAIAruns experiments that test specific hypotheses (e.g.computing neuron outputs on images with edited backgrounds), observes experimental outcomes, and updates hypotheses until it can answer the user query. We enable the VLM to design and run interpretability exper- iments using theMAIAAPI, which defines two classes: the Systemclass and theToolsclass, described below. The API is provided to the VLM in its system prompt. We in- clude a complete API specification in Appendix A. The full input to the VLM is the API specification followed by a “user prompt” describing a particular interpretability task, such as explaining the behavior of an individual neuron in- side a vision model with natural language (see Section 4). To complete the task,MAIAuses components of its API to write a series of Python programs that run experiments on the system it is interpreting.MAIAoutputs function defini- tions as strings, which we then execute internally using the Python interpreter. The Pythonic implementation enables flexible incorporation of built-in functions and existing pack- ages,e.g.theMAIAAPI uses the PyTorch library (Paszke et al., 2019) to load common pretrained vision models. 3.1. System API TheSystemclass inside theMAIAAPI instruments the system to be interpreted and makes subcomponents of that system individually callable. For example, to probe sin- gle neurons inside ResNet-152 (He et al., 2016),MAIA can use theSystemclass to initialize a neuron object by specifying its number and layer location, and the model that the neuron belongs to:system = System(unit_id, layer_id, model_name) .MAIAcan then design exper- iments that test the neuron’s activation value on different image inputs by runningsystem.neuron(image_list), to return activation values and masked versions of the im- ages in the list that highlight maximally activating regions (See Figure 2 for examples). While existing approaches to common interpretability tasks such as neuron labeling require training specialized models on task-specific datasets (Hernandez et al., 2022), theMAIAsystem class supports querying arbitrary vision systems without retraining. 3.2. Tool API TheToolsclass consists of a suite of functions enabling MAIAto write modular programs that test hypotheses about system behavior.MAIAtools are built from com- mon interpretability procedures such as characterizing neu- ron behavior using real-world images (Bau et al., 2017) and performing causal interventions on image inputs (Her- nandez et al., 2022; Casper et al., 2022), whichMAIA then composes into more complex experiments (see Fig- ure 2). When programs written byMAIAare compiled internally as Python code, these functions can leverage calls to other pretrained models to compute outputs. For ex- ample,tools.text2image(prompt_list)returns syn- thetic images generated by a text-guided diffusion model, using prompts written byMAIAto test a neuron’s response to specific visual concepts. The modular design of the tool library enables straightforward incorporation of new tools as interpretability methods grow in sophistication. For the experiments in this paper we use the following set: Dataset exemplar generation.Previous studies have shown that it is possible to characterize the prototypical behavior of a neuron by recording its activation values over a large dataset of images (Bau et al., 2017; 2020). We give MAIAthe ability to run such an experiment on the validation set of ImageNet (Deng et al., 2009) and construct the set of 15 images that maximally activate the system it is interpret- ing. Interestingly,MAIAoften chooses to begin experiments by calling this tool (Figure 2). We analyze the importance of thedataset_exemplarstool in our ablation study (4.3). Image generation and editing tools.MAIAis equipped with atext2image(prompts)tool that synthesizes im- ages by calling Stable Diffusion v1.5 (Rombach et al., 2022a) on text prompts. Generating inputs enablesMAIA to test system sensitivity to fine-grained differences in vi- sual concepts, or test selectivity for the same visual con- cept across contexts (e.g.thebowtieon a pet and on a child in Figure 2). We analyze the effect of using dif- ferent text-to-image models in Section 4.3. In addition to synthesizing new images,MAIAcan also edit images using Instruct-Pix2Pix (Brooks et al., 2022) by calling edit_images(image, edit_instructions). Gener- ating and editing synthetic images enables hypothesis tests involving images lying outside real-world data distributions, e.g.the addition of antelope horns to a horse (Figure 2, see Causal intervention on image input). Image description and summarization tools.To limit confirmation bias inMAIA’s interpretation of experimental results, we use a multi-agent framework in whichMAIA can ask a new instance of GPT-4V with no knowledge of experimental history to describe highlighted image regions in individual images,describe_images(image_list), or summarize what they have in common across a group of 4 A Multimodal Automated Interpretability Agent Figure 3.Predictive evaluation protocol. We compare neuron labeling methods by assessing how well their labels predict neuron activation values on unseen data. For each neuron we perform the following steps: (a) An LM uses candidate neuron labels to generate a set of image prompts that should maximally/neutrally activate the neuron. (b) All prompts (positive and neutral) from all methods are combined into one dataset. (c) For each labeling method, a new LM selects prompts from the Prompt Dataset that are likely to produce maximal and neutral neuron activations, if that label were accurate. (d) A text-to-image model generates all corresponding images, and the average activation values for positive and neutral images are recorded. A predictive neuron label will produce exemplars with maximally positive activations relative to the neutral baseline. images,summarize_images(image_list). We observe thatMAIAuses this tool in situations where previous hy- potheses failed or when observing complex combinations of visual content. Experiment log.MAIAcan document the results of each experiment (e.g.images, activations) using the log_experimenttool, to make them accessible during subsequent experiments. We promptMAIAto finish experi- ments by logging results, and let it choose what to log (e.g. data that clearly supports or refutes a particular hypothesis). 4. Evaluation TheMAIAframework is task-agnostic and can be adapted to new applications by specifying an interpretability task in the user prompt to the VLM. Before tackling model-level interpretability problems (Section 5), we evaluateMAIA’s performance on the black-box neuron description task, a widely studied interpretability subroutine that serves a vari- ety of downstream model auditing and editing applications (Gandelsman et al., 2024; Yang et al., 2023; Hernandez et al., 2022). For these experiments, the user prompt to MAIAspecifies the task and output format (a longer-form [DESCRIPTION]of neuron behavior, followed by a short [LABEL]), andMAIA’sSystemclass instruments a partic- ular vision model (e.g.ResNet-152) and an individual unit indexed inside that model (e.g.Layer 4 Unit 122). Task specifications for these experiments may be found in Appendix B. We findMAIAcorrectly predicts behaviors of individual vision neurons in three trained architectures (Section 4.1), and in a synthetic setting where ground-truth neuron selectivities are known (Section 4.2). We also find descriptions produced byMAIA’s interactive procedure to be more predictive of neuron behavior than descriptions of a fixed set of dataset exemplars, using theMILANbaseline from Hernandez et al. (2022). In many cases,MAIAde- scriptions are on par with those by human experts using the MAIAAPI. In Section 4.3, we perform ablation studies to test how components of theMAIAAPI differentially affect description quality. 4.1. Neurons in vision models We useMAIAto produce natural language descriptions of a subset of neurons across three vision architectures trained under different objectives: ResNet-152, a CNN for super- vised image classification (He et al., 2016), DINO (Caron et al., 2021), a Vision Transformer trained for unsuper- vised representation learning (Grill et al., 2020; Chen & He, 2021), and the CLIP visual encoder (Radford et al., 2021), a ResNet-50 model trained to align image-text pairs. For each model, we evaluate descriptions of100units ran- domly sampled from a range of layers that capture features at different levels of granularity (ResNet-152conv.1,res.1-4, DINO MLP 1-11, CLIPres.1-4). Figure 2 shows examples ofMAIAexperiments on neurons from all three models, and finalMAIAlabels. We also evaluate a baseline non- interactive approach that only labels dataset exemplars of each neuron’s behavior using theMILANmodel from Her- nandez et al. (2022). Finally, we collect human annotations of a random subset (25%) of neurons labeled byMAIAand MILAN, in an experimental setting where human experts write programs to perform interactive analyses of neurons using theMAIAAPI. Human experts receive theMAIAuser prompt, write programs that run experiments on the neu- rons, and return neuron descriptions in the same format. See Appendix C3 for details on the human labeling experiments. We evaluate the accuracy of neuron descriptions produced byMAIA,MILAN, and human experts by measuring how well theypredictneuron behavior on unseen test images (Figure 3). Similar to evaluation approaches that produce contrastive or counterfactual exemplars to reveal model deci- sion boundaries (Gardner et al., 2020; Kaushik et al., 2020), 5 A Multimodal Automated Interpretability Agent Figure 4.Predictive evaluation results.The average positive ac- tivation values (“+”) forMAIAlabels outperformMILANand are comparable to human descriptions for both real and synthetic neu- rons. Neutral activations (“-”) are comparable across methods. we use candidate neuron labels to generate new images that should elicit maximally positive activations relative to a neutral baseline. For a given neuron, we generate a pool of image candidates by providingMAIA,MILAN, and human labels to a Prompt Generator model (implemented with a new instance of GPT-4). For each candidate label (e.g. intricate masks), the Prompt Generator is instructed to write 7 image prompts that should generate maximally activating images (e.g. A Venetian mask, A tribal mask,...), and 7 prompts for neutral images (unrelated to the label) that should elicit baseline activations (e.g. A red bus, A field of flowers,...). All positive and neutral prompts from all labeling methods (MAIA,MILAN, and human experts) form a Prompt Dataset of 42 prompts per neuron. Next, we evaluate the accuracy of each candidate label by using a Prompt Selector LM (implemented with another GPT-4 instance) to match that label with the 7 prompts it is most and least likely to entail. We then generate the correspond- ing images using a text-to-image model (DALL-E3) and measure neuron activation values on those images. If a neuron label is predictive of activations, it will be matched with positive exemplars that maximally activate the neuron relative to the neutral baseline. Combining prompts from all methods into one test set (vs. evaluating each model sep- arately) more rigorously evaluates the completeness of each candidate label: an incomplete description produced by one labeling method (e.g.trainsfor a neuron selective fortrains ORdogs) could be matched with a “neutral” image prompt describingdogs, which would in fact elicit high activation. This method primarily discriminates between labeling pro- cedures: whether it is informative depends on the labeling methods themselves producing relevant exemplar prompts. We report the average activation values of positive and neu- tral exemplars forMAIA,MILAN, and human labels across all tested models in Figure 4.MAIAoutperformsMILAN across all models and is often on par with expert predic- tions.This trend persists across different averaging tech- niques (such as normalizing by activation percentile, see Appendix C1). WhileMILANis a relevant neuron labeling Figure 5.Synthetic neuron implementation.Segmentation of input images is performed by an open-set concept detector with text guidance specifying ground-truth neuron selectivity. Synthetic neurons return masked images and synthetic activation values corresponding to the probability a concept is present in the image. baseline, we note that comparisons to task-specific proce- dures that use learned models to label a fixed set of exem- plars only evaluate part ofMAIA’s full functionality.MAIA is easily adaptable to downstream auditing applications that require additional experimentation, where one-shot neuron labeling procedures are insufficient (see Section 5.1). Ta- ble A3 provides additional comparisons ofMAIAto neuron labeling baselines, and shows evaluation results by layer. 4.2. Synthetic neurons Following the procedure in Schwettmann et al. (2023) for validating the performance of automated interpretability methods on synthetic test systems mimicking real-world behaviors, we construct a set ofsynthetic vision neurons with known ground-truth selectivity. We simulate concept detection performed by neurons inside vision models using semantic segmentation. Synthetic neurons are built using an open-set concept detector that combines Grounded DINO (Liu et al., 2023) with SAM (Kirillov et al., 2023) to perform text-guided image segmentation. The ground-truth behavior of each neuron is determined by a text description of the concept(s) the neuron is selective for (Figure 5). To capture real-world behaviors, we derive neuron labels fromMILAN- NOTATIONS, a dataset of 60K human annotations of neurons across seven trained vision models (Hernandez et al., 2022). Neurons in the wild display a diversity of behaviors: some respond to individual concepts, while others respond to complex combinations of concepts (Bau et al., 2017; Fong & Vedaldi, 2018; Olah et al., 2020; Mu & Andreas, 2021; Gurnee et al., 2023). We construct three types of synthetic neurons with increasing levels of complexity: monoseman- tic neurons that recognize single concepts (e.g. stripes), polysemantic neurons selective for logical disjunctions of concepts (e.g. trainsORinstruments), andconditionalneu- rons that only recognize a concept in presence of another concept (e.g. dog| leash ). Following the instrumentation of real neurons in theMAIAAPI, synthetic vision neurons ac- cept image input and return a masked image highlighting the concept they are selective for (if present), and an activation 6 A Multimodal Automated Interpretability Agent Figure 6.MAIAsynthetic neuron interpretation. value (corresponding to the confidence of Grounded DINO in the presence of the concept). Dataset exemplars for syn- thetic neurons are calculated by computing 15 top-activating images per neuron from the C3M dataset (Sharma et al., 2018). Figure 5 shows examples of each type of neuron; the full list of 85 synthetic neurons is provided in Appendix C4. The set of concepts that can be represented by synthetic neu- rons is limited to simple concepts by the fidelity of open-set concept detection using current text-guided segmentation methods. We verify that all concepts in the synthetic neuron dataset can be segmented by Grounded DINO in combi- nation with SAM, and provide further discussion of the limitations of synthetic neurons in Appendix C4. MAIA interprets synthetic neurons using the same API and procedure used to interpret neurons in trained vision models (Section 4.1). In contrast to neurons in the wild, we can evaluate descriptions of synthetic neurons directly against ground-truth neuron labels. We collect comparative anno- tations of synthetic neurons fromMILAN, as well as expert Table 1.2AFC test.Human subjects selected which method best agrees with the ground truth synthetic neuron label. MAIAvs.MILANMAIAvs. HumanHuman vs.MILAN 0.73±4e −4 0.53±1e −3 0.83±5e −4 annotators (using the procedure from Section 4.1 where human experts manually label a subset of 25% of neurons using theMAIAAPI). We recruit human judges from Ama- zon Mechanical Turk to evaluate the agreement between synthetic neuron descriptions and ground-truth labels in pairwise two-alternative forced choice (2AFC) tasks. For each task, human judges are shown the ground-truth neuron label (e.g. tail) and descriptions produced by two label- ing procedures (e.g.“fluffy and textured animal tails” and “circular objects and animals”), and asked to select which description better matches the ground-truth label. Further details are provided in Appendix C4. Table 1 shows the results of the 2AFC study (the proportion of trials in which procedureAwas favored overB, and 95% confidence in- tervals). According to human judges,MAIAlabels better agree with ground-truth labels when compared toMI- LAN, and are even slightly preferred over expert labels on the subset of neurons they described (while human labels are largely preferred overMILANlabels). We also use the predictive evaluation framework described in Section 4.1 to generate positive and neutral sets of exemplar images for all synthetic neurons. Figure 4 showsMAIAdescriptions are better predictors of synthetic neuron activations thanMILAN descriptions, on par with labels produced by human experts. 4.3. Tool ablation study MAIA’s modular design enables straightforward addition and removal of tools from its API. We test three different settings to quantify sensitivity to different tools: (i) label- ing neurons using only thedataset_exemplarfunction without the ability to synthesize images, (i) relying only on generated inputs without the option to compute maxi- mally activating dataset exemplars, and (i) replacing the Stable Diffusiontext2imagebackbone with DALL-E 3. While the first two settings do not fully compromise per- formance, neither ablated API achieves the same average accuracy as the fullMAIAsystem (Figure 7). These results emphasize the combined utility of tools for experimenting with real-world and synthetic inputs:MAIAperforms best when initializing experiments with dataset exemplars and running additional tests with synthetic images. Methods likeMILANthat label precomputed exemplars could thus be incorporated into theMAIAAPI as tools, and used to ini- tialize experimentation. We also find that using DALL-E as thetext2imagebackbone improves performance (Figure 7). This suggests that the agent is bounded by the perfor- mance of its tools rather than its ability to use them—and as interpretability tools grow in sophistication, so willMAIA. 7 A Multimodal Automated Interpretability Agent Figure 7.Ablation study.We use the predictive evaluation proto- col to quantifyMAIA’s sensitivity to different tools. Top perfor- mance is achieved when experimenting with both real and synthetic data, and when using DALL-E 3 for image generation. More de- tails in Appendix C2. 4.4. MAIA failure modes Consistent with the result in Section 4.3 thatMAIAperfor- mance improves with DALL-E 3, we additionally observe that SD-v1.5 and InstructPix2Pix sometimes fail to faithfully generate and edit images according toMAIA’s instructions. To mitigate these failures, we instructMAIAto prompt posi- tive image-edits (e.g. replacethe bowtie with a plain shirt) rather than negative edits (e.g. removethe bowtie), but occasional failures still occur (see Figure 8 and Appendix D). While proprietary versions of tools may be of higher quality, they also introduce prohibitive rate limits and costs associated with API access. As similar limitations apply to the GPT-4V backbone itself, we tested the performance of free and non-proprietary VLMs as alternativeMAIAback- bones. Currently, off-the-shelf alternatives still significantly lag behind GPT-4V performance (consisitent with evalua- tion of open-source models’ ability to interpret functions in Schwettmann et al. (2023)), but our initial experiments sug- gest their performance may improve with fine-tuning (see Appendix D3). TheMAIAsystem is designed modularly so that open-source alternatives can be incorporated in the future as their performance improves. 5. Applications MAIAis a flexible system that automates model understand- ing tasks at different levels of granularity: from labeling individual features to diagnosing model-level failure modes. To demonstrate the utility ofMAIAfor producing actionable insights for human users (Vaughan & Wallach, 2020), we conduct experiments that applyMAIAto two model-level tasks: (i) spurious feature removal and (i) bias identifica- tion in a downstream classification task. In both casesMAIA uses the API as described in Section 3. In an additional experiment, we evaluate the downstream utility ofMAIA descriptions by measuring the extent to which they equip humans to make predictions about system behavior (see details in Appendix E). Figure 8.MAIAtool failures.MAIAis limited by the reliability of its tools. Common image editing failure modes (using Instruct- Pix2Pix) include failing to remove objects, misinterpreting the instructions (e.g. removing the incorrect object), and changing too much or too little of the image.MAIA’s image generation tool (SD-v1.5) is sometimes unreliable for negative instructions (e.g. a flagpolewithouta flag), and sometimes deviates from the text prompt by adding or excluding image components. 5.1. Removing spurious features Learned spurious features impose a challenge when ma- chine learning models are applied in real-world scenarios, where test distributions differ from training set statistics (Storkey et al., 2009; Beery et al., 2018; Bissoto et al., 2020; Xiao et al., 2020; Singla et al., 2021). We useMAIAto remove learned spurious features inside a classification net- work, finding thatwith no access to unbiased examples nor grouping annotations,MAIAcan identify and remove such features, improving model robustness under distribution shift by a wide margin, with an accuracy approaching that of fine-tuning on balanced data. We run experiments on ResNet-18 trained on the Spawrious dataset (Lynch et al., 2023), a synthetically generated dataset involving four dog breeds with different backgrounds. In the train set, each breed is spuriously correlated with a certain background type, while in the test set, the breed-background pairings are changed (see Figure 9). We useMAIAto find a subset of final layer neurons that robustly predict a single dog breed independently of spurious features (see Appendix F3). While other methods like Kirichenko et al. (2023) re- move spurious correlations by retraining the last layer on balanceddatasets, we only provideMAIAaccess to top- activating images from theunbalancedvalidation set and promptMAIAto run experiments to determine robustness. We then use the featuresMAIAselects to train an unregular- ized logistic regression model on the unbalanced data. 8 A Multimodal Automated Interpretability Agent Figure 9.Spawrious dataset examples.Train data contains spuri- ous correlations between dog breeds and their backgrounds. Table 2.Final layer spurious feature removal results. SubsetSelection Method# UnitsBalancedTest Acc. AllOriginal Model512✗0.731 ℓ 1 Top 50 All50✗0.779 Random22✗0.705±0.05 ℓ 1 Top 2222✗0.757 MILAN23✗0.786 MILAN(GPT-4V)23✗0.690 MAIA22✗0.837 All ℓ 1 Hyper. Tuning147✓0.830 ℓ 1 Top 2222✓0.865 As a demonstration, we select50of the most informative neurons usingℓ 1 regularization on theunbalanceddataset and haveMAIArun experiments on each one.MAIAselects 22neurons it deems to be robust. Traning an unregularized model on this subset significantly improves accuracy, as reported in Table 2. For comparison, we repeat the same task using interpretability procedures likeMILANthat rely on precomputed exemplars (both with the original model of (Hernandez et al., 2022) and with GPT-4V, see Appendix F2 for experimental details). Both achieved significantly lower accuracy. To further show that the sparsity ofMAIA’s neuron selection is not the only reason for its performance improve- ments, we also benchmarkMAIA’s performance againstℓ 1 regularized fitting on both unbalanced and balanced ver- sions of the dataset. On the unbalanced dataset,ℓ 1 drops in performance when subset size reduces from50to22neu- rons. Using a smallbalanceddataset to hyperparameter tune theℓ 1 parameter and train the logistic regression model on all neurons achieves performance comparable toMAIA’s chosen subset, althoughMAIAdid not have access to any balanced data. For a fair comparison, we test the perfor- mance of anℓ 1 model which matches the sparsity ofMAIA, but trained on the balanced dataset. See Appendix F2 for more details. 5.2. Revealing biases MAIAcan be used to automatically surface model-level bi- ases. Specifically, we applyMAIAto investigate biases in the outputs of a CNN (ResNet-152) trained on a supervised ImageNet classification task. TheMAIAsystem is easily adaptable to this experiment: the output logit corresponding to a specific class is instrumented using thesystemclass, and returns class probability for input images.MAIAis Figure 10.MAIAbias detection.MAIAiteratively conducts exper- iments and generates synthetic inputs to surface biases in ResNet- 152 output classes. In some cases,MAIAdiscovers uniform behav- ior over the inputs (e.g. flagpole). provided with the class label and instructed (see Appendix G) to find settings in which the classifier ranks images re- lated to that class with relatively lower probability values, or shows a clear preference for a subset of the class. Figure 10 presents results for a subset of ImageNet classes. This simple paradigm suggests thatMAIA’s generation of syn- thetic data could be widely useful for identifying regions of the input distribution where a model exhibits poor per- formance. While this exploratory experiment surfaces only broad failure categories,MAIAenables other experiments targeted at end-use cases identifying specific biases. 6. Conclusion We introduceMAIA, an agent that automates interpretabil- ity tasks including feature interpretation and bias discovery. By composing pretrained modules,MAIAconducts exper- iments to make and test hypotheses about the behavior of other systems. While human supervision is needed to maxi- mize its effectiveness and catch common mistakes, initial experiments withMAIAshow promise, and we anticipate that interpretability agents will be increasingly useful as they grow in sophistication. 9 A Multimodal Automated Interpretability Agent Impact statement As AI systems take on higher-stakes roles and become more deeply integrated into research and society, scalable ap- proaches to auditing for reliability will be vital.MAIAis a protoype for a tool that can help human users ensure AI systems are transparent, reliable, and equitable. We thinkMAIAaugments, but does not replace, human over- sight of AI systems.MAIAstill requires human supervision to catch mistakes such as confirmation bias and image gen- eration/editing failures. Absence of evidence (fromMAIA) is not evidence of absence: thoughMAIA’s toolkit enables causal interventions on inputs in order to evaluate system behavior,MAIA’s explanations do not provide formal verifi- cation of system performance. Acknowlegements We are grateful for the support of the MIT-IBM Watson AI Lab, the Open Philanthropy foundation, Hyundai Motor Company, ARL grant W911NF-18-2-021, Intel, the Na- tional Science Foundation under grant CCF-2217064, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. The funders had no role in experimental de- sign or analysis, decision to publish, or preparation of the manuscript. The authors have no competing interests to report. References Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gem- ini: a family of highly capable multimodal models.arXiv, 2312.11805, 2023. Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. InComputer Vision and Pattern Recognition, 2017. Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A.Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020.ISSN 0027-8424.doi:10.1073/pnas. 1907375117.URLhttps://w.pnas.org/ content/early/2020/08/31/1907375117. Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. InProceedings of the European conference on computer vision (ECCV), p. 456–473, 2018. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models.https: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. Bissoto, A., Valle, E., and Avila, S. Debiasing skin lesion datasets and models? not so fast. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, p. 740–741, 2020. Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022. Caron, M., Touvron, H., Misra, I., J ́ egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, p. 9650–9660, 2021. Casper, S., Hariharan, K., and Hadfield-Menell, D. Diagnos- tics for deep neural networks with automated copy/paste attacks.arXiv preprint arXiv:2211.10024, 2022. Chen, L., Zhang, Y., Ren, S., Zhao, H., Cai, Z., Wang, Y., Wang, P., Liu, T., and Chang, B. Towards end-to-end em- bodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond, 2023. Chen, X. and He, K. Exploring simple siamese represen- tation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, p. 15750–15758, 2021. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability.arXiv preprint arXiv:2304.14997, 2023. Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., and Glass, J. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 33, p. 6309–6317, 2019. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, p. 248–255. Ieee, 2009. Fong, R. and Vedaldi, A. Net2vec: Quantifying and explain- ing how concepts are encoded by filters in deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 8730–8738, 2018. Gandelsman, Y., Efros, A. A., and Steinhardt, J. Interpreting clip’s image representation via text-based decomposition, 2024. 10 A Multimodal Automated Interpretability Agent Gardner, M., Artzi, Y., Basmova, V., Berant, J., Bogin, B., Chen, S., Dasigi, P., Dua, D., Elazar, Y., Gottumukkala, A., Gupta, N., Hajishirzi, H., Ilharco, G., Khashabi, D., Lin, K., Liu, J., Liu, N. F., Mulcaire, P., Ning, Q., Singh, S., Smith, N. A., Subramanian, S., Tsarfaty, R., Wallace, E., Zhang, A., and Zhou, B. Evaluating models’ local decision boundaries via contrast sets, 2020. Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 580–587, 2014. Grill, J.-B., Strub, F., Altch ́ e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. Gupta, T. and Kembhavi, A. Visual programming: Compo- sitional visual reasoning without training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 14953–14962, 2023. Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troit- skii, D., and Bertsimas, D.Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778, 2016. Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., and Andreas, J. Natural language descrip- tions of deep visual features. InInternational Conference on Learning Representations, 2022. Huang, J., Geiger, A., D’Oosterlinck, K., Wu, Z., and Potts, C. Rigorously assessing natural language explanations of neurons.arXiv preprint arXiv:2309.10312, 2023. Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078, 2015. Kaushik, D., Hovy, E., and Lipton, Z. C. Learning the difference that makes a difference with counterfactually- augmented data, 2020. Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correla- tions, 2023. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Doll ́ ar, P., and Girshick, R. Segment anything. arXiv:2304.02643, 2023. Kluyver, T., Ragan-Kelley, B., P ́ erez, F., Granger, B., Bus- sonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., and Willing, C. Jupyter notebooks – a publishing format for repro- ducible computational workflows. In Loizides, F. and Schmidt, B. (eds.),Positioning and Power in Academic Publishing: Players, Agents and Agendas, p. 87 – 90. IOS Press, 2016. Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge.https://llava-vl.github. io/blog/2024-01-30-llava-next/, 2024. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild, 2015. Lynch, A., Dovonon, G. J.-S., Kaddour, J., and Silva, R. Spawrious: A benchmark for fine control of spurious correlation biases, 2023. Mahendran, A. and Vedaldi, A. Understanding deep im- age representations by inverting them. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 5188–5196, 2015. Mu, J. and Andreas, J. Compositional explanations of neu- rons, 2021. Nushi, B., Kamar, E., and Horvitz, E. Towards accountable ai: Hybrid human-machine analyses for characterizing system failure. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 6, p. 126–135, 2018. Oikarinen, T. and Weng, T.-W. Clip-dissect: Automatic description of neuron representations in deep vision net- works.arXiv preprint arXiv:2204.10965, 2022. Olah, C., Mordvintsev, A., and Schubert, L. Feature visual- ization.Distill, 2(11):e7, 2017. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020. OpenAI. Gpt-4 technical report, 2023a. 11 A Multimodal Automated Interpretability Agent OpenAI.Gpt-4v(ision) technical work and au- thors.https://openai.com/contributions/ gpt-4v, 2023b. Accessed: [insert date of access]. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., Zhao, S., Hong, L., Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu, Z., and Sun, M. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 10684–10695, June 2022a. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models, 2022b. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case gener- alization, 2020. Schick, T., Dwivedi-Yu, J., Dess ` ı, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Tool- former: Language models can teach themselves to use tools, 2023. Schwettmann, S., Hernandez, E., Bau, D., Klein, S., An- dreas, J., and Torralba, A. Toward a visual concept vo- cabulary for gan latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision, p. 6804–6812, 2021. Schwettmann, S., Shaham, T. R., Materzynska, J., Chowd- hury, N., Li, S., Andreas, J., Bau, D., and Torralba, A. Find: A function description benchmark for evaluating interpretability methods, 2023. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con- ceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), p. 2556– 2565, 2018. Singh, C., Hsu, A. R., Antonello, R., Jain, S., Huth, A. G., Yu, B., and Gao, J. Explaining black box text modules in natural language with language models, 2023. Singla, S., Nushi, B., Shah, S., Kamar, E., and Horvitz, E. Understanding failures of deep networks via robust feature extraction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, p. 12853–12862, 2021. Storkey, A. et al. When training and test sets are different: characterizing learning transfer.Dataset shift in machine learning, 30(3-28):6, 2009. Sur ́ ıs, D., Menon, S., and Vondrick, C. Vipergpt: Visual inference via python execution for reasoning, 2023. Vaughan, J. W. and Wallach, H. A human-centered agenda for intelligible machine learning.Machines We Trust: Getting Along with Artificial Intelligence, 2020. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S.The Caltech-UCSD Birds-200-2011 Dataset. Caltech Vision Lab, Jul 2011. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., and Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. Xiao, K., Engstrom, L., Ilyas, A., and Madry, A. Noise or signal: The role of image backgrounds in object recogni- tion.arXiv preprint arXiv:2006.09994, 2020. Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison- Burch, C., and Yatskar, M. Language in a bottle: Lan- guage model guided concept bottlenecks for interpretable image classification, 2023. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models, 2023. Zeiler, M. D. and Fergus, R. Visualizing and understand- ing convolutional networks. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, p. 818– 833. Springer, 2014. Zhang, J., Wang, Y., Molino, P., Li, L., and Ebert, D. S. Manifold: A model-agnostic framework for interpreta- tion and diagnosis of machine learning models.IEEE 12 A Multimodal Automated Interpretability Agent transactions on visualization and computer graphics, 25 (1):364–373, 2018. Zheng, B., Gou, B., Kil, J., Sun, H., and Su, Y. Gpt-4v(ision) is a generalist web agent, if grounded, 2024. Zou, X., Yang, J., Zhang, H., Li, F., Li, L., Wang, J., Wang, L., Gao, J., and Lee, Y. J. Segment everything everywhere all at once, 2023. 13 A Multimodal Automated Interpretability Agent Appendix A.MAIALibrary The fullMAIAAPI provided in the system prompt is reproduced below. import torch from typing import List, Tuple class System: """ A Python class containing the vision model and the specific neuron to interact with. Attributes ---------- neuron_num : int The unit number of the neuron. layer : string The name of the layer where the neuron is located. model_name : string The name of the vision model. model : n.Module The loaded PyTorch model. Methods ------- load_model(model_name: str) -> n.Module Gets the model name and returns the vision model from PyTorch library. neuron(image_list: List[torch.Tensor]) -> Tuple[List[int], List[str]] returns the neuron activation for each image in the input image_list as well as the activation map of the neuron over that image, that highlights the regions of the image where the activations are higher (encoded into a Base64 string). """ def __init__(self, neuron_num: int, layer: str, model_name: str, device: str): """ Initializes a neuron object by specifying its number and layer location and the vision model that the neuron belongs to. Parameters ------- neuron_num : int The unit number of the neuron. layer : str The name of the layer where the neuron is located. model_name : str The name of the vision model that the neuron is part of. device : str The computational device ('cpu' or 'cuda'). """ self.neuron_num = neuron_num self.layer = layer self.device = torch.device(f"cuda:device" if torch.cuda.is_available() else "cpu") self.model = self.load_model(model_name) def load_model(self, model_name: str) -> torch.n.Module: """ Gets the model name and returns the vision model from pythorch library. Parameters ---------- model_name : str The name of the model to load. Returns ------- n.Module The loaded PyTorch vision model. Examples -------- >>> # load "resnet152" >>> def run_experiment(model_name) -> n.Module: >>> model = load_model(model_name: str) >>> return model """ return load_model(model_name) def neuron(self, image_list: List[torch.Tensor]) -> Tuple[List[int], List[str]]: """ 14 A Multimodal Automated Interpretability Agent The function returns the neuron's maximum activation value (in int format) for each of the images in the list as well as the activation map of the neuron over each of the images that highlights the regions of the image where the activations are higher (encoded into a Base64 string). Parameters ---------- image_list : List[torch.Tensor] The input image Returns ------- Tuple[List[int], List[str]] For each image in image_list returns the activation value of the neuron on that image, and a masked image, with the region of the image that caused the high activation values highlighted (and the rest of the image is darkened). Each image is encoded into a Base64 string. Examples -------- >>> # test the activation value of the neuron for the prompt "a dog standing on the grass" >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt = ["a dog standing on the grass"] >>> image = tools.text2image(prompt) >>> activation_list, activation_map_list = system.neuron(image) >>> return activation_list, activation_map_list >>> # test the activation value of the neuron for the prompt "a dog standing on the grass" and the neuron activation value for the same image but with a lion instead of a dog >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt = ["a dog standing on the grass"] >>> edits = ["replace the dog with a lion"] >>> all_image, all_prompts = tools.edit_images(prompt, edits) >>> activation_list, activation_map_list = system.neuron(all_images) >>> return activation_list, activation_map_list """ return neuron(image_list) class Tools: """ A Python class containing tools to interact with the neuron implemented in the system class, in order to run experiments on it. Attributes ---------- experiment_log: str A log of all the experiments, including the code and the output from the neuron. Methods ------- dataset_exemplars(system: object) -> Tuple(List[int],List[str]) This experiment provides good coverage of the behavior observed on a very large dataset of images and therefore represents the typical behavior of the neuron on real images. This function characterizes the prototipycal behavior of the neuron by computing its activation on all images in the ImageNet dataset and returning the 15 highest activation values and the images that produced them. The images are masked to highlight the specific regions that produce the maximal activation. The images are overlaid with a semi-opaque mask, such that the maximally activating regions remain unmasked. edit_images(prompt_list_org_image : List[str], editing_instructions_list : List[str]) -> Tuple[List[Image.Image ], List[str]] This function enables loclized testing of specific hypotheses about how variations on the content of a single image affect neuron activations. Gets a list of input prompt and a list of corresponding editing instructions, then generate images according to the input prompts and edits each image based on the instructions given in the prompt using a text- based image editing model. This function is very useful for testing the causality of the neuron in a controlled way, for example by testing how the neuron activation is affected by changing one aspect of the image. IMPORTANT: Do not use negative terminology such as "remove ...", try to use terminology like "replace ... with ..." or "change the color of ... to ...". text2image(prompt_list: str) -> Tuple[torcu.Tensor] Gets a list of text prompt as an input and generates an image for each prompt in the list using a text to image model. The function returns a list of images. summarize_images(self, image_list: List[str]) -> str: This function is useful to summarize the mutual visual concept that appears in a set of images. It gets a list of images at input and describes what is common to all of them, focusing specifically on unmasked regions. describe_images(synthetic_image_list: List[str], synthetic_image_title:List[str]) -> str Provides impartial descriptions of images. Do not use this function on dataset exemplars. 15 A Multimodal Automated Interpretability Agent Gets a list of images and generat a textual description of the semantic content of the unmasked regions within each of them. The function is blind to the current hypotheses list and therefore provides an unbiased description of the visual content. log_experiment(activation_list: List[int], image_list: List[str], image_titles: List[str], image_textual_information: Union[str, List[str]]) -> None documents the current experiment results as an entry in the experiment log list. if self. activation_threshold was updated by the dataset_exemplars function, the experiment log will contains instruction to continue with experiments if activations are lower than activation_threshold. Results that are loged will be available for future experiment (unlogged results will be unavailable). The function also update the attribure "result_list", such that each element in the result_list is a dictionary of the format: "<prompt>": "activation": act, "image": image so the list contains all the resilts that were logged so far. """ def __init__(self): """ Initializes the Tools object. Parameters ---------- experiment_log: store all the experimental results """ self.experiment_log = [] self.results_list = [] def dataset_exemplars(self, system: object) -> Tuple(List[int],List[str]) """ This method finds images from the ImageNet dataset that produce the highest activation values for a specific neuron. It returns both the activation values and the corresponding exemplar images that were used to generate these activations (with the highly activating region highlighted and the rest of the image darkened). The neuron and layer are specified through a 'system' object. This experiment is performed on real images and will provide a good approximation of the neuron behavior. Parameters ---------- system : object An object representing the specific neuron and layer within the neural network. The 'system' object should have 'layer' and 'neuron_num' attributes, so the dataset_exemplars function can return the exemplar activations and masked images for that specific neuron. Returns ------- tuple A tuple containing two elements: - The first element is a list of activation values for the specified neuron. - The second element is a list of exemplar images (as Base64 encoded strings) corresponding to these activations. Example ------- >>> def run_experiment(system, tools) >>> activation_list, image_list = self.dataset_exemplars(system) >>> return activation_list, image_list """ return dataset_exemplars(system) def edit_images(self, prompt_list_org_image : List[str], editing_instructions_list : List[str]) -> Tuple[List[ Image.Image], List[str]]: """ This function enables localized testing of specific hypotheses about how variations in the content of a single image affect neuron activations. Gets a list of prompts to generate images, and a list of corresponding editing instructions as inputs. Then generates images based on the image prompts and edits each image based on the instructions given in the prompt using a text-based image editing model (so there is no need to generate the images outside of this function). This function is very useful for testing the causality of the neuron in a controlled way, for example by testing how the neuron activation is affected by changing one aspect of the image. IMPORTANT: for the editing instructions, do not use negative terminology such as "remove ...", try to use terminology like "replace ... with ..." or "change the color of ... to" The function returns a list of images, constructed in pairs of original images and their edited versions, and a list of all the corresponding image prompts and editing prompts in the same order as the images. Parameters ---------- prompt_list_org_image : List[str] 16 A Multimodal Automated Interpretability Agent A list of input prompts for image generation. These prompts are used to generate images which are to be edited by the prompts in editing_instructions_list. editing_instructions_list : List[str] A list of instructions for how to edit the images in image_list. Should be the same length as prompt_list_org_image. Edits should be relatively simple and describe replacements to make in the image, not deletions. Returns ------- Tuple[List[Image.Image], List[str]] A list of all images where each unedited image is followed by its edited version. And a list of all the prompts corresponding to each image (e.g. the input prompt followed by the editing instruction). Examples -------- >>> # test the activation value of the neuron for the prompt "a dog standing on the grass" and the neuron activation value for the same image but with a cat instead of a dog >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt = ["a dog standing on the grass"] >>> edits = ["replace the dog with a cat"] >>> all_image, all_prompts = tools.edit_images(prompt, edits) >>> activation_list, activation_map_list = system.neuron(all_images) >>> return activation_list, activation_map_list >>> # test the activation value of the neuron for the prompt "a dog standing on the grass" and the neuron activation values for the same image but with a different action instead of "standing": >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompts = ["a dog standing on the grass"] * 3 >>> edits = ["make the dog sit","make the dog run","make the dog eat"] >>> all_images, all_prompts = tools.edit_images(prompts, edits) >>> activation_list, activation_map_list = system.neuron(all_images) >>> return activation_list, activation_map_list """ return edit_images(image, edits) def text2image(self, prompt_list: List[str]) -> List[Image.Image]: """Gets a list of text prompts as input, generates an image for each prompt in the list using a text to image model. The function returns a list of images. Parameters ---------- prompt_list : List[str] A list of text prompts for image generation. Returns ------- List[Image.Image] A list of images, corresponding to each of the input prompts. Examples -------- >>> # test the activation value of the neuron for the prompt "a dog standing on the grass" >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt = ["a dog standing on the grass"] >>> image = tools.text2image(prompt) >>> activation_list, activation_map_list = system.neuron(image) >>> return activation_list, activation_map_list >>> # test the activation value of the neuron for the prompt "a fox and a rabbit watch a movie under a starry night sky" "a fox and a bear watch a movie under a starry night sky" "a fox and a rabbit watch a movie at sunrise" >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt_list = ["a fox and a rabbit watch a movie under a starry night sky", "a fox and a bear watch a movie under a starry night sky","a fox and a rabbit watch a movie at sunrise"] >>> images = tools.text2image(prompt_list) >>> activation_list, activation_map_list = system.neuron(images) >>> return activation_list, activation_map_list """ return text2image(prompt_list) def summarize_images(self, image_list: List[str]) -> str: """ This function is useful to summarize the mutual visual concept that appears in a set of images. It gets a list of images at input and describes what is common to all of them, focusing specifically on unmasked regions. Parameters ---------- 17 A Multimodal Automated Interpretability Agent image_list : list A list of images in Base64 encoded string format. Returns ------- str A string with a descriptions of what is common to all the images. Example ------- >>> # tests dataset exemplars and return textual summarization of what is common for all the maximally activating images >>> def run_experiment(system, tools): >>> activation_list, image_list = self.dataset_exemplars(system) >>> prompt_list = [] >>> for i in range(len(activation_list)): >>> prompt_list.append(f'dataset exemplar i') # for the dataset exemplars we don't have prompts, therefore need to provide text titles >>> summarization = tools.summarize_images(image_list) >>> return summarization """ return summarize_images(image_list) def describe_images(self, image_list: List[str], image_title:List[str]) -> str: """ Provides impartial description of the highlighted image regions within an image. Generates textual descriptions for a list of images, focusing specifically on highlighted regions. This function translates the visual content of the highlited region in the image to a text description. The function operates independently of the current hypothesis list and thus offers an impartial description of the visual content. It iterates through a list of images, requesting a description for the highlighted (unmasked) regions in each synthetic image. The final descriptions are concatenated and returned as a single string, with each description associated with the corresponding image title. Parameters ---------- image_list : list A list of images in Base64 encoded string format. image_title : callable A list of strings with the image titles that will be use to list the different images. Should be the same length as image_list. Returns ------- str A concatenated string of descriptions for each image, where each description is associated with the image's title and focuses on the highlighted regions in the image. Example ------- >>> def run_experiment(system, tools): >>> prompt_list = ["a fox and a rabbit watch a movie under a starry night sky", "a fox and a bear watch a movie under a starry night sky","a fox and a rabbit watch a movie at sunrise"] >>> images = tools.text2image(prompt_list) >>> activation_list, image_list = system.neuron(images) >>> descriptions = tools.describe_images(image_list, prompt_list) >>> return descriptions """ return describe_images(image_list, image_title) def log_experiment(self, activation_list: List[int], image_list: List[str], image_titles: List[str], image_textual_information: Union[str, List[str]]): """documents the current experiment results as an entry in the experiment log list. if self. activation_threshold was updated by the dataset_exemplars function, the experiment log will contain instruction to continue with experiments if activations are lower than activation_threshold. Results that are logged will be available for future experiments (unlogged results will be unavailable). The function also updates the attribute "result_list", such that each element in the result_list is a dictionary of the format: "<prompt>": "activation": act, "image": image so the list contains all the results that were logged so far. Parameters ---------- activation_list : List[int] A list of the activation values that were achived for each of the images in "image_list". image_list : List[str] 18 A Multimodal Automated Interpretability Agent A list of the images that were generated using the text2image model and were tested. Should be the same length as activation_list. image_titles : List[str] A list of the text lables for the images. Should be the same length as activation_list. image_textual_information: Union[str, List[str]] A string or a list of strings with additional information to log such as the image summarization and/or the image textual descriptions. Returns ------- None Examples -------- >>> # tests the activation value of the neuron for the prompts "a fox and a rabbit watch a movie under a starry night sky", "a fox and a bear watch a movie under a starry night sky", "a fox and a rabbit watch a movie at sunrise", describes the images and logs the results and the image descriptions >>> def run_experiment(system, tools): >>> prompt_list = ["a fox and a rabbit watch a movie under a starry night sky", "a fox and a bear watch a movie under a starry night sky","a fox and a rabbit watch a movie at sunrise"] >>> images = tools.text2image(prompt_list) >>> activation_list, activation_map_list = system.neuron(images) >>> descriptions = tools.describe_images(images, prompt_list) >>> tools.log_experiment(activation_list, activation_map_list, prompt_list, descriptions) >>> return >>> # tests dataset exemplars, use umage summarizer and logs the results >>> def run_experiment(system, tools): >>> activation_list, image_list = self.dataset_exemplars(system) >>> prompt_list = [] >>> for i in range(len(activation_list)): >>> prompt_list.append(f'dataset_exemplars i') # for the dataset exemplars we don't have prompts, therefore need to provide text titles >>> summarization = tools.summarize_images(image_list) >>> log_experiment(activation_list, activation_map_list, prompt_list, summarization) >>> return >>> # test the effect of changing a dog into a cat. Describes the images and logs the results. >>> def run_experiment(system, tools) -> Tuple[int, str]: >>> prompt = ["a dog standing on the grass"] >>> edits = ["replace the dog with a cat"] >>> all_images, all_prompts = tools.edit_images(prompt, edits) >>> activation_list, activation_map_list = system.neuron(all_images) >>> descriptions = tools.describe_images(activation_map_list, all_prompts) >>> tools.log_experiment(activation_list, activation_map_list, all_prompts, descriptions) >>> return >>> # test the effect of changing the dog's action on the activation values. Describes the images and logs the results. >>> def run_experiment(system, prompt_list) -> Tuple[int, str]: >>> prompts = ["a dog standing on the grass"] * 3 >>> edits = ["make the dog sit","make the dog run","make the dog eat"] >>> all_images, all_prompts = tools.edit_images(prompts, edits) >>> activation_list, activation_map_list = system.neuron(all_images) >>> descriptions = tools.describe_images(activation_map_list, all_prompts) >>> tools.log_experiment(activation_list, activation_map_list, all_prompts, descriptions) >>> return """ return log_experiment(activation_list, image_list, prompt_list, description) 19 A Multimodal Automated Interpretability Agent B.MAIAuser prompt: neuron description Your overall task is to describe the visual concepts that maximally activate a neuron inside a deep network for computer vision. To do that you are provided with a library of Python functions to run experiments on the specific neuron (inside the "System" class) given the functions provided in the "Tools" class. Make sure to use a variety of tools from the library to maximize your experimentation power. Some neurons might be selective for very specific concepts, a group of unrelated concepts, or a general concept, so try to be creative in your experiment and try to test both general and specific concepts. If a neuron is selective for multiple concepts, you should describe each of those concepts in your final description. At each experiment step, write Python code that will conduct your experiment on the tested neuron, using the following format: [CODE]: ```python def run_experiment(system, tools) # gets an object of the system class, an object of the tool class, and performs experiments on the neuron with the tools ... tools.log_experiment(...) ``` Finish each experiment by documenting it by calling the "log_experiment" function. Do not include any additional implementation other than this function. Do not call "execute_command" after defining it. Include only a single instance of experiment implementation at each step. Each time you get the output of the neuron, try to summarize what the inputs that activate the neuron have in common (where that description is not influenced by previous hypotheses). Then, write multiple hypotheses that could explain the visual concept(s) that activate the neuron. Note that the neuron can be selective for more than one concept. For example, these hypotheses could list multiple concepts that the neuron is selective for (e.g. dogs OR cars OR birds), provide different explanations for the same concept, describe the same concept at different levels of abstraction, etc. Some of the concepts can be quite specific, test hypotheses that are both general and very specific. Then write a list of initial hypotheses about the neuron selectivity in the format: [HYPOTHESIS LIST]: Hypothesis_1: <hypothesis_1> ... Hypothesis_n: <hypothesis_n>. After each experiment, wait to observe the outputs of the neuron. Then your goal is to draw conclusions from the data, update your list of hypotheses, and write additional experiments to test them. Test the effects of both local and global differences in images using the different tools in the library. If you are unsure about the results of the previous experiment you can also rerun it, or rerun a modified version of it with additional tools. Use the following format: [HYPOTHESIS LIST]: ## update your hypothesis list according to the image content and related activation values. Only update your hypotheses if image activation values are higher than previous experiments. [CODE]: ## conduct additional experiments using the provided python library to test * ALL * the hypotheses. Test different and specific aspects of each hypothesis using all of the tools in the library. Write code to run the experiment in the same format provided above. Include only a single instance of experiment implementation. Continue running experiments until you prove or disprove all of your hypotheses. Only when you are confident in your hypothesis after proving it in multiple experiments, output your final description of the neuron in the following format: [DESCRIPTION]: <final description> ## Your description should be selective (e.g. very specific: "dogs running on the grass" and not just "dog") and complete (e.g. include all relevant aspects the neuron is selective for). In cases where the neuron is selective for more than one concept, include in your description a list of all the concepts separated by logical "OR". [LABEL]: <final label drived from the hypothesis or hypotheses> ## a label for the neuron generated from the hypothesis (or hypotheses) you are most confident in after running all experiments. They should be concise and complete, for example, "grass surrounding animals", "curved rims of cylindrical objects", "text displayed on computer screens", "the blue sky background behind a bridge", and "wheels on cars" are all appropriate. You should capture the concept(s) the neuron is selective for. Only list multiple hypotheses if the neuron is selective for multiple distinct concepts. List your hypotheses in the format: [LABEL 1]: <label 1> [LABEL 2]: <label 2> 20 A Multimodal Automated Interpretability Agent C. Evaluation experiment details In Table A3 we provide full evaluation results by layer, as well as the number of units evaluated in each layer. Units were sampled uniformly at random, for larger numbers of units in later layers with more interpretable features. Table A3.Evaluation results by layer MILANCLIP-DissectMAIAHuman Arch.Layer# Units+−+−+−+− ResNet-152 conv. 1107.233.386.72.717.283.537.833.16 res. 1150.820.730.830.550.780.690.460.64 res. 2200.980.921.020.661.020.900.830.95 res. 3251.280.720.980.681.280.702.590.58 res. 4305.412.047.11.617.101.747.891.99 Avg.2.991.423.371.143.501.334.151.34 DINO-ViT MLP 151.100.940.760.941.190.740.630.34 MLP 350.630.960.961.060.810.870.550.89 MLP 5200.851.011.110.941.330.970.840.84 MLP 7201.420.771.160.661.670.822.580.54 MLP 9253.50-1.150.81-0.836.31-0.818.64-1.06 MLP 1125-1.56-1.94-1.12-1.65-1.41-1.84-0.61-2.49 Avg.1.03-0.320.44-0.21.930.541.97-0.23 CLIP-RN50 res. 1101.922.162.341.822.102.071.652.15 res. 2202.542.462.611.912.782.392.222.81 res. 3302.241.702.331.452.271.702.411.96 res. 4403.561.304.751.364.901.394.921.29 Avg.2.791.743.361.553.411.753.291.89 C1. Predictive evaluation with activation normalizing. Averaging the raw activation values might be largely affected by neurons with a relatively high activation range. Optimally activation values should be reported in percentile, relative to the activation range of each neuron individually, however neuron activation ranges are difficult to be precisely estimated. Instead, we perform the following estimation: we normalized the activation value of each neuron by its 95% percentile activation value (assuming 0% corresponds to an activation value of 0, which is a valid assumption for convolutional networks with ReLU activation like ResNet and CLIP). Results are reported in Table C1, showing the same trends as pre-normalization. Table A4.Predictive evaluation results normalizing by 95% percentile ResNet-152CLIP-RN5 exemplarsMILANCLIP-DissectMAIAHumanMILANCLIP-DissectMAIAHuman +1.511.551.632.341.531.891.91.4 -0.830.750.780.630.770.770.760.75 C2. Ablation studies We use the subset of 25% neurons labeled by human experts to perform the ablation studies. Results of the predictive evaluation procedure described in Section 4 are shown below. Using DALL-E 3 improves performance over SD-v1.5. Table A5.Numerical data for the ablations in Figure 7. ImageNetSD-v1.5DALL-E 3 ResNet-152+3.533.564.64 −1.541.331.53 DINO-ViT+1.481.981.88 −-0.37-0.23-0.27 CLIP-RN50+2.343.624.34 −1.901.751.90 21 A Multimodal Automated Interpretability Agent C3. Human expert neuron description using theMAIAtool library Figure A11.Example interface for humans interpreting neurons with the same tool library used byMAIA. We recruited 8 human interpretability re- searchers to use theMAIAAPI to run ex- periments on neurons in order to describe their behavior. This data collection effort was approved by MIT’s Committee on the Use of Humans as Experimental Subjects. Humans received task specification via the MAIAuser prompt, wrote programs using the functions inside theMAIAAPI, and produced neuron descriptions in the same format as MAIA. All human subjects had knowledge of Python. Humans labeled 25% of the units in each layer labeled byMAIA(one human label per neuron). Testing was administered via JupyterLab (Kluyver et al., 2016), as dis- played in Figure A11. Humans also labeled 25% of the synthetic neurons using the same workflow. The median number of interac- tions per neuron for humans was 7. However, for more difficult neurons the number of interactions were as high as 39. C4. Synthetic neurons To provide a ground truth against which to test MAIA, we constructed a set of synthetic neurons that reflect the diverse response profiles of neurons in the wild. We used three categories of synthetic neurons with varying levels of complexity: monosemanticneurons that respond to single concepts,polysemanticneurons that respond to logical disjunctions of concepts, andconditionalneurons that respond to one concept conditional on the presence of another. The full set of synthetic neurons across all categories is described in Table A6. To capture real-world neuron behaviors, concepts are drawn from MILANNOTATIONS, a dataset of 60K human annotations of prototypical neuron behaviors (Hernandez et al., 2022). Synthetic neurons are constructed using Grounded DINO (Liu et al., 2023) in combination with SAM (Kirillov et al., 2023). Specifically, Grounded-DINO implements open-vocabulary object detection by generating image bounding boxes corresponding to an input text prompt. These bounding boxes are then fed into SAM as a soft prompt, indicating which part of the image to segment. To ensure the textual relevance of the bounding box, we set a threshold to filter out bounding boxes that do not correspond to the input prompt, using similarity scores which are also returned as synthetic neuron activation values. We use the default thresholds of 0.3 for bounding box accuracy and 0.25 for text similarity matching, as recommended in (Liu et al., 2023). After the final segmentation maps are generated, per-object masks are combined and dilated to resemble outputs of neurons inside trained vision models, instrumented viaMAIA’sSystemclass. We also implement compound synthetic neurons that mimic polysemantic neurons found in the wild (via logical disjunction), and neurons that respond to complex combinations of concepts (via logical conjunction). To implement polysemantic neurons (e.g.selective forAORB), we check if at least one concept is present in the input image (if both are present, we merge segmentation maps across concepts and return the mean of the two activation values). To implement conditional neurons (e.g.selective forA| B ), we check ifAis present, and if the condition is met (Bis present) we return the mask and activation value corresponding to conceptA. The set of concepts that can be represented by synthetic neurons is limited by the fidelity of open-set concept detection using current text-guided segmentation methods. We manually verify that all concepts in the synthetic neuron dataset can be consistently segmented by Grounded DINO in combination with SAM. There are some types of neuron behavior, however, that cannot be captured using current text-guided segmentation methods. Some neurons inside trained vision models implement low-level procedures (e.g.edge-detection), or higher level perceptual similarity detection (e.g.sensitivity to radial wheel-and-spoke patterns common to flower petals and bicycle tires) that Grounded DINO + SAM cannot detect. Future implementations could explore whether an end-to-end single model open-vocabulary segmentation system, such as Segment Everything Everywhere All at Once (Zou et al., 2023), could perform segmentation for richer neuron labels. 22 A Multimodal Automated Interpretability Agent Table A6.Synthetic neurons.Concepts are drawn fromMILANNOTATIONS. MonosemanticPolysemantic(AORB)Conditional(A| B ) archanimal, doorball, hand birdanimal, shipbeach, people bluebaby, dogbird, tree boatbird, dogbridge, sky brickblue, yellowbuilding, sky bridgebookshelf, buildingcup, handle bugcup, roaddog, leash buildingdog, carfence, animal buttondog, horsefish, water car windowdog, instrumentgrass, dog circlefire, furhorse, grass dogfirework, whiskerinstrument, hand eyeshand, earskyline, water feathersnecklace, flowersky, bird flamepeople, buildingsnow, road frogpeople, woodsuit, tie grassred, purpletent, mountain hairshoe, boatwater, blue handssink, poolwheel, racecar handleskirt, water hatstairs, fruit jeanstemple, playground jigsawtruck, train legswindow, wheel light neck orange paws pencil pizza roof shirt shoes sky snake spiral stripes sunglasses tail text tires tractor vehicle wing yarn Evaluation of synthetic neuron labels using human judges.This data collection effort was approved by MIT’s Committee on the Use of Humans as Experimental Subjects. To control for quality, workers were required to have a HIT acceptance rate of at least 99%, be based in the U.S., and have at least 10,000 approved HITs. Workers were paid0.10per annotation. 10 human judges performed each comparison task. 23 A Multimodal Automated Interpretability Agent D. Failure modes D1. Tool Failures Figure A12.Example ofMAIAhaving confirmation bias towards a sin- gle generated example, instead of generating further experiments to test other possibilities. MAIAis often constrained by the capabilities of its tools. As shown in Figure 8, the Instruct-Pix2Pix (Brooks et al., 2022) and Stable Diffusion (Rombach et al., 2022b) models sometimes fail to follow the precise instructions in the caption. Instruct-Pix2Pix typically has trouble making changes which are relative to ob- jects within the image and also fails to make changes that are unusual (such as the example of replacing a person with a vase). Stable Diffusion typically has dif- ficulty assigning attributes in the caption to the correct parts of the image. These errors in image editing and generation sometimes confuseMAIAand cause it to make the wrong prediction. D2. Confirmation Bias In some scenarios, whenMAIAgenerates an image that has a higher activation than the dataset exemplars, it will assume that the neuron behaves according to that single exemplar. Instead of conducting additional ex- periments to see if there may be a more general label, MAIAsometimes stops experimenting and outputs a final label that is specific to that one image. For in- stance, in Figure A12MAIAgenerates one underwater image that attains a higher activation and outputs an overly specific description without doing any additional testing. D3. Failure modes of non-proprietary VLMs as alternative MAIA backbones. We tested the performance of two models as the MAIA backbone VLM: 1. LLaVA-Next (Liu et al., 2024), a top-performing (improved over LLaVA-1.5) open source VLM built from LLaMA. 2. Gemini 1.0 Pro (Anil et al., 2023), which Google currently provides for free with a low rate limit (60 RPM). When used as the MAIA backbone for the neuron inter- pretation task, both models show initial potential: they understand the task, use the System class and the Tools class from MAIA’s API, and write code specifying in- terpretation experiments. However, while both models show promise, we found significant shortcomings com- pared to GPT-4V which limit their current usefulness as backbones of MAIA. Some of these shortcomings we observed in both models: Weaker hypothesis generation.Key to MAIA is the 24 A Multimodal Automated Interpretability Agent ability of the backbone VLM to make and update hypotheses in light of experimental findings. After receiving image outputs of experiments, both LLaVA-Next and Gemini are biased toward describing the images rather than updating their hypothesis about system behavior (suggesting both were trained for the task of image captioning, which might be solvable with fine-tuning). Hallucination.We observed that both models often output hypotheses unrelated to experimental outcomes, and sometimes hallucinate experimental results. For example, when Gemini got the activation value and masked image for the prompt“a dog standing on the grass”it replied with:a cat laying on the grass, activation: 14.21, which is an image never generated, and a hallucinated activation value. Overfitting to the examples of the tool usage presented in the MAIA API.Rather than designing new experiments, both models often reproduce code that appears in the MAIA API usage examples. For example, both Llava and Gemini will generate images of“a dog standing on the grass”which is provided as an example for an input prompt to the text2image tool. Gemini did show some advantages over LLaVA:while Gemini occasionally produced some syntax errors (e.g.calling system.datasetexemplarsinstead oftools.datasetexemplars), Llava code was often not executable. Fur- thermore, LLaVa is restricted to getting only one input image at each interaction, which severely restricts its experimentation ability. Even when inputting several images in a row, the model is biased toward analyzing the last one. E. Utility ofMAIAdescriptions for human users We run additional crowd-sourcing experiments to quantify the extent to which neuron descriptions equip humans to predict system behavior. Given a text description of a neuron (produced either byMILAN,MAIAor human experts), human participants predicted the neuron’s activations on images from the ImageNet validation set. In this setting, a correct description would inform more accurate predictions of behavior. Specifically, participants saw a language description of a neuron (e.g. “this neuron is selective for human hands interacting with weightlifting equipment”) and four sets of images. One of those sets contains images that strongly activate the neuron described in text, and the other 3 contain randomly sampled distractors (capturing “baseline activity”). Humans are asked to use the text description to select the strongly activating set of images (chance is 25%). Two notes on experimental design: (i) We used sets of images (4 images each) in the multiple choice task instead of single images to provide better coverage of concept space (some text labels describe more concepts than a single image could show). (i) In this task, humans distinguish strongly activating images from weakly activating images (like the automated evaluation in the main paper), instead of ordering images along a continuum by how strongly they activate the neuron. This is because the explanations produced by the interpretability methods only describe concepts that strongly activate the neuron, and do not provide enough information for human observers to characterize the full distribution of the neuron’s activity. The table below shows results from this experiment and 95% confidence intervals across interpretation procedures for a given model. For each model we evaluated the same subset of 25 neurons as in Section 4. 10 human participants performed each task (using the same crowdworker selection criteria as in Section 4.2).MAIAdescriptions are more useful for predicting neuron behavior thanMILANdescriptions in all three models studied. Table A7.Subject predictions of maximally activating images based on neuron descriptions. MILANMAIAHuman ResNet-15266.0±0.0678.5±0.0585.45±0.04 CLIP-RN5057.19±0.0665.21±0.0669.6±0.06 DINO-ViT48.0±0.0667.39±0.0681.2±0.05 F. Spurious feature removal experiment F1. Dataset Selection We use the Spawrious dataset as it provides a more complex classification task than simpler binary spurious classification benchmarks like Waterbirds (Wah et al., 2011; Sagawa et al., 2020) and CelebA (Liu et al., 2015; Sagawa et al., 2020). All images in the dataset are generated with Stable Diffusion v1.4 (Rombach et al., 2022b), which is distinct from the Stable 25 A Multimodal Automated Interpretability Agent Diffusion v1.5 model inMAIA’s tool API. See Lynch et al. (2023) for further specific details on dataset construction. F2. Experiment Details Here, we describe the experiment details for each row from Table 2. Note that for all the logistic regression models that we train, we standardize the input features to have a zero mean and variance of one. We use thesagasolver from sklearn.linear_model.LogisticRegressionfor theℓ 1 regressions and thelbfgssolver for the unregular- ized regressions (Pedregosa et al., 2011). All, Original Model, Unbalanced:We train a ResNet-18 model (He et al., 2016) for one epoch on the O2O-Easy dataset from Spawrious using a learning rate of 1e-4, a weight decay of 1e-4, and a dropout of 0.1 on the final layer. We use a 90-10 split to get a training set of size 22810 and a validation set of size 2534. ℓ 1 Top 50, All, Unbalanced:We tune theℓ 1 regularization parameter on the full unbalanced validation set such that there are50neurons with non-zero weigths, and we extract the corresponding neurons indices. We then directly evaluate the performance of the logistic regression model on the test set. ℓ 1 Top 50, Random, Unbalanced:To matchMAIA’s sparsity level, we extract 100 sets of 22 random neuron indices and perform unregularized logistic regression on the unbalanced validation set. ℓ 1 Top 50,ℓ 1 Top 22, Unbalanced:We also useℓ 1 regression to matchMAIA’s sparsity in a more principled manner, tuning theℓ 1 parameter until there are only22neurons with non-zero weights. We then directly evaluate the performance of the regularized logistic regression model on the test set. ℓ 1 Top 50,MILAN, Unbalanced:We useMILANto caption all 50 neurons. SinceMILANwas not trained to annotate neurons as selective or spurious, we manually select all neurons with captions related to dogs (and not to their backgrounds). After filtering, this set contains 23 neurons. We then perform unregularized logistic regression with this neuron subset on the unbalanced validation set. ℓ 1 Top 50,MILAN(GPT-4V), Unbalanced:We repeat the same process as inℓ 1 Top 50,MILAN, Unbalanced, but this Figure A13.Two differentMAIAinteractions, classifying neurons as selective (left) and spurious (right). 26 A Multimodal Automated Interpretability Agent time we use GPT-4V to annotate maximally activating exemplars instead of the originalMILANmodel. This subset contains 23 neurons. We then perform unregularized logistic regression with this neuron subset on the unbalanced validation set. ℓ 1 Top 50,MAIA, Unbalanced:We runMAIAon each of the 50 neurons separately, and it ultimately selects 22 out of the 50 neurons. We then perform unregularized logistic regression with this neuron subset on the unbalanced validation set. We use a modified user prompt which we include in Section F4. Next, for the balanced validation fine-tuning experiments, we sample ten balanced validation sets of size 320 and report the mean performances of each method. While Kirichenko et al. (2023) uses multiple subsampled balanced validation sets for fine-tuning and then aggregates the models for scoring on the test set, we only allow the following experiments to see a singlebalanced validation set since we seek to compare the performance ofMAIAto methods which have access to a small balanced fine-tuning dataset, rather than spurious feature labels for every data sample in a large validation set. All,ℓ 1 Tuning, Balanced:We perform a hyperparameter search for theℓ 1 parameter, evaluating each hyperparameter value with five 50-50 splits of the balanced validation dataset, training on one half and evaluating on the other half. We then train the final model with the bestℓ 1 parameter on the entire balanced validation dataset. For the number of neurons, we report the median number of neurons with non-zero weights across the ten trials. All,ℓ 1 Top 22, Balanced:We also investigate the performance when we matchMAIA’s chosen neuron sparsity level by tuning theℓ 1 parameter to only have 22 neurons with non-zero weights. We train the model directly on the balanced validation dataset. F3. ExampleMAIANeuron Robustness Interactions In Figure A13, we show examples ofMAIA’s interaction with neurons in the final layer of the Resnet-18 model trained on Spawrious. Overall, we find thatMAIAwill often test all four dog breeds across different environments, and if it finds that the exemplars all have the same background, it will test to see if the background is a spurious feature. F4.MAIAPrompting Details The user prompt toMAIAspecifying the spurious neuron identification task is similar to the prompt specifying the neuron description task. It is shown in full below. We only specify toMAIAthe domain of images the model was trained on and the definitions of spurious and selective neurons so that it can output a final classification using experiments that will not generate significantly out-of-distribution images (such as unrelated dog breeds or close-up poses not present in the training data). We do not ever specify what the spurious correlations are. 27 A Multimodal Automated Interpretability Agent You are analyzing the prototypical behavior of individual neurons inside a deep neural network which classifies the breed of a dog in a natural image as one of the following breeds that were in its training dataset: 1. Labrador 2. Welsh Corgi 3. Bulldog 4. Dachshund Your overall task is to classify the neuron as SELECTIVE (if it is selective for one and only one dog breed) or SPURIOUS (if it is not). Conduct experiments until you meet the following criteria for SELECTIVE or SPURIOUS. SELECTIVE Criteria: 1. The neuron only activates for one single breed of dog. It does not activate for any other breeds of dog, even at lower magnitudes of activation. We consider a neuron to be activated if it returns a highlighted image region with a red outline. Observe the presence or absence of a mask, and the magnitude of an activation value, to determine whether a neuron is active. 2. The neuron is active for that single dog breed robustly across a variety of contexts. However, keep in mind that the neuron has only seen images of full-body dogs in the following environments: beach, desert, muddy park, jungle, and snow. Do not generate images that are very dissimilar to what the neural network has seen before (eg. do not generate close-ups, do not generate humans). A selective neuron does not need to activate for these dissimilar images. SPURIOUS Criteria: 1. The neuron activates for mutiple breeds of dogs. The amount the neuron activates for each breed of dog may vary but the neuron is considered spurious if it activates enough for multiple breeds to return a highlighted image region with a red outline. 2. The neuron may activate for a single breed of dog, but only does so in specific contexts. If the neuron's activation is dependent on non-dog related concepts, it is considered spurious. To complete your task, you are provided with a library of Python functions to run experiments on the specific neuron (inside the "System" class) given the functions provided in the "Tools" class. Make sure to use a variety of tools from the library to maximize your experimentation power. Some neurons might be selective for very specific concepts, a group of unrelated concepts, or a general concept, so try to be creative in your experiment and try to test both general and specific concepts. If a neuron is selective for multiple concepts, you should describe each of those concepts in your final description. At each experiment step, write Python code that will conduct your experiment on the tested neuron , using the following format: [CODE]: ```python def execute_command(system, tools) # gets an object of the system class, an object of the tool class, and performs experiments on the neuron with the tools ... tools.save_experiment_log(...) ``` Finish each experiment by documenting it by calling the "save_experiment_log" function. Do not include any additional implementation other than this function. Do not call "execute_command" after defining it. Include only a single instance of experiment implementation at each step. Each time you get the output of the neuron, try to summarize what inputs that activate the neuron have in common (where that description is not influenced by previous hypotheses), and make a hypothesis regarding whether the neuron is SELECTIVE (activates strongly for only one dog breed across contexts) or SPURIOUS (activates for more than one dog breed or non-dog related features). After each experiment, wait to observe the outputs of the neuron. Then your goal is to draw conclusions from the data, update your hypothesis, and write additional experiments to test it. Test the effects of concepts like different dog breeds or background features using the different tools in the library. If you are unsure about the results of the previous experiment you can also rerun it, or rerun a modified version of it with additional tools. Use the following format: [HYPOTHESIS]: ## update your hypothesis according to the image content and related activation values. Only update your hypotheses if image activation values are higher than previous experiments. [CODE]: ##Test different and specific aspects of your hypothesis using all of the tools in the library. Write code to run the experiment in the same format provided above. Include only a single instance of experiment implementation. Continue running experiments until you meet one of the following stopping criteria. SPURIOUS: If you find multiple pieces of evidence which show that the neuron does not only activate for one breed or activates for non-dog related concepts you should deem the neuron to be spurious, even if you are not entirely sure what the neuron activates for. Remember that we consider a neuron to have activation if it returns a highlighted image region with a red outline. If you see this happen for a feature that is not dog related or for multiple breeds you should deem the neuron to be spurious. SELECTIVE: The neuron is selective for a single dog breed and activates strongly for that breed across contexts. If you find any evidence suggesting that a neuron is spurious (such as an image outside one particular breed showing activations), you should conduct more experiments to test your hypotheses. If you are ever unsure about the result, you should lean towards outputting SPURIOUS. The neuron must be perfectly selective across many variations of dog breeds and contexts to be considered to be selective. Once you have met one of the stopping criteria, output your final classification of the neuron in the following format: [REASONING]: <why the neuron is SELECTIVE or SPURIOUS, and if SELECTIVE, the breed it is SELECTIVE for> [LABEL]: <SELECTIVE/SPURIOUS> 28 A Multimodal Automated Interpretability Agent G. Bias identification experiment The user prompt toMAIAspecifying the bias identification task is similar to the prompt specifying the neuron description task. We produce it in full below. Your overall task is to describe the visual concepts that maximally activate a neuron inside a deep network for computer vision. You will get an access to a neuron inside a deep neural network that is sensitive to the presence of class_label inside the input image. Your overall task is to reveal biases in the neuron's behavior. That is, your task is to find specific contexts or settings in which the neuron tends to have relatively higher activation values, but only covers partial asspect of the class_label class, or specific scenarios when the neuron tends to have relatively lower activations although class_label is still present. Try to look for different type of biases (e.g. gender, ethnicity, context-dependencies, breed-dependencies, etc.). To do that you are provided with a library of Python functions to run experiments on the specific neuron ( inside the "System" class) given the functions provided in the "Tools" class. Make sure to use a variety of tools from the library to maximize your experimentation power. Some neurons might be selective for very specific concepts, a group of unrelated concepts, or a general concept, so try to be creative in your experiment and try to test both general and specific concepts. If a neuron is selective for multiple concepts, you should describe each of those concepts in your final description. At each experiment step, write Python code that will conduct your experiment on the neuron, using the following format: [CODE]: ```python def execute_command(system, tools) # gets an object of the system class, an object of the tool class, and performs experiments on the neuron with the tools ... tools.save_experiment_log(...) ``` Finish each experiment by documenting it by calling the "save_experiment_log" function. Do not include any additional implementation other than this function. Do not call "execute_command" after defining it. Include only a single instance of experiment implementation at each step. Each time you get the output of the neuron, try to summarize what inputs that activate the neuron have in common (where that description is not influenced by previous hypotheses). Then, write multiple hypotheses that could explain the biases of the neuron. For example, these hypotheses could list multiple context that the neuron is less selective for. Then write a list of initial hypotheses about the neuron biases in the format: [HYPOTHESIS LIST]: Hypothesis_1: <hypothesis_1> ... Hypothesis_n: <hypothesis_n>. After each experiment, wait to observe the outputs of the neuron. Then your goal is to draw conclusions from the data, update your list of hypotheses, and write additional experiments to test them. Test the effects of both local and global differences in images using the different tools in the library. If you are unsure about the results of the previous experiment you can also rerun it, or rerun a modified version of it with additional tools. Use the following format: [HYPOTHESIS LIST]: ## update your hypothesis list according to the image content and related activation values. Only update your hypotheses if image activation values are higher than previous experiments. [CODE]: ## conduct additional experiments using the provided python library to test * ALL * the hypotheses. Test different and specific aspects of each hypothesis using all of the tools in the library. Write code to run the experiment in the same format provided above. Include only a single instance of experiment implementation. Continue running experiments until you prove or disprove all of your hypotheses. Only when you are confident in your hypothesis after proving it in multiple experiments, output your final description of the neuron in the following format: [BIAS]: <final description of the neuron bias> 29