Paper deep dive

Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Kenza Bouzid, Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland

Year: 2025Venue: Actionable Interpretability Workshop at ICML 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 99

Models: MAIRA-2

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 5:21:24 PM

Summary

This paper investigates the mechanistic interpretability of the radiology-specialized multimodal large language model (MLLM) MAIRA-2 by applying Matryoshka-SAEs to its internal representations. The authors identify clinically relevant features—such as medical devices and pathologies—and demonstrate the feasibility of steering model generations, while highlighting methodological challenges in automated interpretability and steering precision.

Entities (4)

MAIRA-2 · multimodal-large-language-model · 100%MIMIC-CXR · dataset · 100%Matryoshka-SAE · sparse-autoencoder · 100%GPT-4o · large-language-model · 95%

Relation Signals (3)

Matryoshka-SAE → interprets → MAIRA-2

confidence 100% · In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations.

GPT-4o → performsautomatedinterpretabilityon → Matryoshka-SAE

confidence 95% · We then query an LLM (GPT-4o version 2024-11-20) to produce an interpretation of the observed activation patterns.

MAIRA-2 → trainedon → MIMIC-CXR

confidence 90% · We used MIMIC-CXR (Johnson et al., 2019), a subset of MAIRA-2’s training data

Cypher Suggestions (2)

Identify datasets used for training specific models · confidence 95% · unvalidated

MATCH (m:Model)-[:TRAINED_ON]->(d:Dataset) RETURN m.name, d.name

Find all models interpreted by a specific SAE architecture · confidence 90% · unvalidated

MATCH (s:SAE)-[:INTERPRETS]->(m:Model) WHERE s.name = 'Matryoshka-SAE' RETURN m.name

Abstract

Abstract:Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

98,794 characters extracted from source content.

Expand or collapse full text

Insights into a radiology-specialised multimodal large language model with sparse autoencoders Kenza Bouzid 1 Shruthi Bannur 1 Felix Meissen 1 Daniel Coelho de Castro 1 Anton Schwaighofer 1 Javier Alvarez-Valle 1 Stephanie L. Hyland 1 Abstract Interpretability can improve the safety, trans- parency and trust of artificial intelligence (AI) models, which is especially important in health- care applications where decisions often carry sig- nificant consequences. Mechanistic interpretabil- ity, particularly through the use of sparse autoen- coders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology- specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clini- cally relevant concepts—including medical de- vices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and tex- tual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results re- veal practical and methodological challenges, yet they offer initial insights into the internal con- cepts learned by MAIRA-2—marking a step to- ward deeper mechanistic understanding and in- terpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trainedSAEs and interpreta- tions:https://huggingface.co/microsoft/ maira-2-sae. 1 MicrosoftResearch,HealthFutures,Cambridge, United Kingdom.Correspondence to:Kenza Bouzid <kenza.bouzid@microsoft.com>,StephanieHyland <stephanie.hyland@microsoft.com>. Workshop on Actionable Interpretability (AIW 2025)at the 42 nd International Conference on Machine Learning, Vancou- ver, Canada. Copyright 2025 by the author(s). 1. Introduction Recent advancements in automated draft radiology reporting (Bannur et al., 2024; Zhou et al., 2024; Yang et al., 2024; Tu et al., 2024; Chen et al., 2024; Hyland et al., 2023; Wang et al., 2023; Li et al., 2023b) raise the potential ofAIsystems to reduce radiologist workloads and improve operational efficiency (Yildirim et al., 2024). However, despite strong performance on benchmarks, our understanding ofwhich concepts these models have learned—andhowthey use them in the report generation process—remains limited. This lack of interpretability poses challenges in high-stakes domains like healthcare, where trust, transparency, and safety are critical (Quinn et al., 2021). Mechanistic interpretability aims to address some of these challenges by reverse-engineering the internal computations of neural networks (Elhage et al., 2021; Wang et al., 2022). In this context,SAEs have emerged as a promising tool for inspecting model representations, particularly for large lan- guage models (LLMs) (Bricken et al., 2023; Cunningham et al., 2024). By mapping dense model activation vectors to a larger, sparse latent space, it is argued thatSAEs can disentangle human-interpretable ‘monosemantic’ features, which may then be labelled at scale usingLLMs—a pro- cess known as automated interpretability. This approach has enabled insights into model behaviour across various domains, including language (Templeton et al., 2024; Gao et al., 2024; Cunningham et al., 2024; Lieberum et al., 2024; He et al., 2024), vision (Stevens et al., 2025), and proteins (Adams et al., 2025). Given features of interest, we can intervene on the model’s internal representations to attempt to control its generations according to them (Templeton et al., 2024). This provides both a way to validate the discovered concepts and opens up possibilities for fine-grained control (Stevens et al., 2025; Adams et al., 2025; Zhang et al., 2024). In the context of ra- diology report generation, such steering – if effective – could be particularly beneficial, for example: guiding the model to omit pathologies for which its performance is unreliable; encouraging more detailed descriptions of medical devices; or preventing the generation of operational text likely to be fabricated (such as notes about physician communication). 1 arXiv:2507.12950v2 [cs.LG] 18 Jul 2025 Insights into a radiology-specialised multimodal large language model with sparse autoencoders In this work, we explore the feasibility of applyingSAEs to radiology-focused multimodal large language models (MLLMs). Specifically, we apply Matryoshka-SAE (Buss- mann et al., 2025) to MAIRA-2 (Bannur et al., 2024)—one of the most capable publicly available models for grounded and non-grounded chest X-ray (CXR) reporting to date. Leveraging large scale,LLM-based automated interpretabil- ity and scoring methods (Paulo et al., 2024), we identify a subset of the internal concepts learned by MAIRA-2, amongst many features that remain uninterpretable. These concepts include classical chest X-rays abnormalities (e.g., atelectasis, cardiomegaly, pleural effusion..), as well as more diverse findings like scoliosis, aortic tortuosity, medical devices, and temporal changes. We use these discovered concepts in steering experiments, observing mixed success across different scenarios. Overall, our findings offer early insights into the internal representations of MAIRA-2, mark- ing a step toward greater interpretability. At the same time, this study highlights several practical and methodological challenges across the pipeline—fromSAEtraining and au- tomated concept discovery to steering—underscoring the complexity of applying mechanistic interpretability methods to models in specialized domains. To foster further research in this direction, we publicly release the trained SAE check- points, along with the automated feature interpretations 1 . 2. Related work SAE-based interpretability on language modelsTem- pleton et al. (2024) and Gao et al. (2024) performed auto- mated interpretability studies on large proprietary models, which popularised the use ofSAEs for this task. Cunning- ham et al. (2024) quantitatively benchmarked the automatic interpretability success ofSAEs against classic unsuper- vised methods. Other large-scale interpretability efforts (Lieberum et al., 2024; He et al., 2024) have focused on large public models, training suites ofSAEs across all lay- ers with a range of hyperparameters and publishing their weights and corresponding auto-interpretations, enabling researchers to build on these often-expensive analyses. SAEs on multimodal language modelsZhang et al. (2024) usedSAEs to analyse visual features in a general- domain multimodal model with architecture similar to MAIRA-2, employing a larger multimodal model to au- tomatically explain the learned features. TheSAEfrom Templeton et al. (2024) was trained with language data only, then certain features were shown to also respond to related visual inputs. Lou et al. (2025) investigated the potential of SAEfeatures to guide data selection for improving modality alignment.SAEs were used by Pach et al. (2025) to study representations from a vision encoder then steer a down- 1 https://huggingface.co/microsoft/maira-2-sae stream language model. Other classic sparse dictionary learning techniques have also been employed to interpret multimodal language models (Parekh et al., 2024). SAEs in biomedical applicationsLe et al. (2024) fitted anSAEto a pathology image foundation model, and quali- tatively validated several interpretable histological concepts. Abdulaal et al. (2024) appliedSAEs to a chest X-ray image encoder, then used automatically generated descriptions of SAEfeatures to compose radiology reports. Simon & Zou (2024) and Adams et al. (2025) have also appliedSAEs to interpretation of protein language models, with the former additionally demonstrating steering of protein generation based on interpretable SAE features. 3. Materials and methods In this section, we describe each element of the experimental pipeline, including the base model and dataset under study, the extraction of internal representations, the SAE used for feature discovery, the procedure for automatically labelling features, and how the feature steering was performed and evaluated. An overview is illustrated in Figure 1. 3.1. MAIRA-2 model MAIRA-2 (Bannur et al., 2024) is a multimodal language model trained for generating the findings section of radiol- ogy reports with or without spatial grounding via bounding boxes. In addition to a frontalCXRimage, MAIRA-2 can also optionally ingest: a lateral-view image; indication, tech- nique, and comparison text sections of the current study; and image and report from a prior study of the same pa- tient. This makes it particularly interesting to study since it aligns more closely with real-world clinical workflows than simpler models. Its architecture is inspired by LLaVA (Liu et al., 2023). Each input image is fed through a radiology-specific en- coder (P ́ erez-Garc ́ ıa et al., 2025) and a multilayer perceptron adapter to produce 1369 visual token embeddings. These are interleaved with text tokens following a custom prompt template (see example in Table A.1) and given to a 32-layer, 7B-parameter language decoder initialised from Vicuna v1.5 (Chiang et al., 2023). The adapter and language model were trained on a multi-task mix of CXR datasets totalling 511k multimodal samples from 226k unique patients. 3.2. Source dataset We used MIMIC-CXR (Johnson et al., 2019), a subset of MAIRA-2’s training data, to trainSAEs. This dataset con- sists of paired de-identified chest radiographs and free-text reports. Focusing on thefindings generationtask, we used 158,555 and 7,906 training and validation samples following 2 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Figure 1.Illustration of the study.A:We train a Matryoshka-SAE using the output of the residual stream of MAIRA-2 at layer 15. Multimodal tokenization includes obtaining image tokens from an image encoder. We filter out (grey) token indices corresponding to intermediate image tokens and boilerplate parts of the prompt.B:We conduct automated interpretability withLLMs for both deriving interpretations and scoring interpretations, using detection scoring.C:We try to steer the generations of MAIRA-2 according to selected features, by adding the corresponding SAE decoder vector to the residual stream at all token positions during each decoding step. Tokenization step omitted for brevity. We evaluate the effectiveness of steering using an LLM to judge on-target and off-target effects. 3 Insights into a radiology-specialised multimodal large language model with sparse autoencoders the same splits and preprocessing in Bannur et al. (2024). We leveraged MAIRA-2’s extended input context capabili- ties to incorporate all available inputs for a given sample. 3.3. Extraction and filtering of token representations We used NNsight (Fiotto-Kaufman et al., 2024) for extract- ing internal representations from the open-source MAIRA-2 model checkpoint 2 . We took the hidden states of the residual stream at the output of the middle layer (#15). This choice was motivated by prior works that suggest that abstract and semantically rich features are most prominent in the middle layers (Templeton et al., 2024; Lad et al., 2024). For each sample, we obtained a sequence of 4096-dimensional to- ken representations, with an average of3,358±869tokens per sample, reaching a maximum of 5,099 tokens. This included up to three images—each image being represented by 1,369 tokens—along with additional textual context and the ground-truth findings section. We refined the extracted representations by removing boil- erplate and templated subsequences present in the original prompt from Bannur et al. (2024). This type of filtering is commonly applied when analysing internal representa- tions of instruction-tuned language models (Lieberum et al., 2024). We extended this approach to the multimodal setting by retaining only the final image token of eachCXR, to avoid the need to interpret an incomplete image. This filter- ing procedure resulted in34.7M token representations for training, and1.7M for validation, substantially reducing the number of tokens to consider. See Table A.1 for an example of token-based representation filtering. In what follows, we refer to these token representations as ‘samples’ as we use them to train and interpret the SAE. 3.4. Sparse autoencoder architecture We apply techniques fromsparse dictionary learning(Lee et al., 2006; Cunningham et al., 2024) to disentangle the dense representations of MAIRA-2 into potentially- interpretable sparse features.Specifically, we employ Matryoshka-SAE (Bussmann et al., 2025). By simultane- ously training multiple nested autoencoders of increasing size, Matryoshka-SAE is argued to reduce feature splitting and produce more interpretable features. Formally, given a total dictionary sizemand an input x∈R n , the Matryoshka-SAE training objective encour- ages good reconstruction at multiple levels at the same time: L(x) = X m j ∈M ∥x−(W dec 1:m j f(x) 1:m j +b dec ) |z reconstruction using firstm j features ∥ 2 2 +αL aux (1) f(x) =σ(W enc x+b enc ),(2) 2 https://huggingface.co/microsoft/maira-2 where1 :m j denotes taking the firstm j elements,Mis the set of nested dictionaries of increasing sizesm 1 < m 2 < ... < m |M| =m .W enc ∈R m×n ,b enc ∈R m are the encoder matrix and bias,W dec ∈R n×m ,b dec ∈R n are the decoder matrix and bias. Similar to Bussmann et al. (2025), we use BatchTopK (Bussmann et al., 2024) as activation functionσ(with slight abuse of notation), which builds on TopK (Gao et al., 2024) to enforce an average sparsity of kactive features across a batch of sizeb. Finally,L aux is the auxiliary loss suggested in Gao et al. (2024) to reduce dead neurons, further described in Appendix A.3. In what follows,f i refers to thei’th feature in the sparse latent space of the SAE, withi∈[0,m−1]. 3.5. SAE training We relied on the open-sourcedictionarylearningrepos- itory 3 (Marks et al., 2024) to train theSAEs used in this study. We rescaled all token representations similarly to Gao et al. (2024), with a normalization factor of22.34, computed from the full training set as the meanℓ 2 norm of the represen- tations. We benchmarked several SAE architectures, includ- ing TopKSAE (Gao et al., 2024), BatchTopKSAE (Buss- mann et al., 2024), and Matryoshka-SAE (Bussmann et al., 2025), and found that the latter achieved the best recon- struction performance in this context. The width ofSAEs is commonly specified as a factor of the hidden dimensionality of the base model (4096 for MAIRA-2). We experimented with various expansion factorsef∈2,4,8,16. Although ef = 16achieved the best reconstruction performance, we selectedef = 4as a trade-off between reconstruction qual- ity, sparsity, and the computational cost of the downstream auto-interpretation pipeline that requires extensive LLM API calls. We also tested multiple values of the sparsity parameterk∈32,64,128,256, withk= 256producing the most favourable results in terms of reconstruction fi- delity and lowest number of dead features, consistent with the settings in Zhang et al. (2024). Unless stated otherwise, all downstream results are reported using Matryoshka-SAE withef = 4andk= 256, resulting in 16,384 features. A full list of hyperparameters is provided in Table A.2. 3.6. Automated interpretation of SAE features To identify whichSAEfeatures correspond to human- interpretable concepts, we follow prior work in interpreting features at scale with an automated,LLM-based pipeline (Paulo et al., 2024; Bricken et al., 2023; Bills et al., 2023). For each feature, we randomly collect samples that do not activate it (f i (x) = 0) and an equal number with activation strength in the top observed decile. We found this sampling strategy superior to the fully stratified sampling done in 3 https://github.com/saprmarks/dictionarylearning 4 Insights into a radiology-specialised multimodal large language model with sparse autoencoders earlier work (Paulo et al., 2024). We then query anLLM (GPT-4o version 2024-11-20) to produce an interpretation of the observed activation patterns. Although MAIRA-2 consumes interleaved images and text, we provide our inter- pretationLLMwith text inputs only, because the reliability of the generalist GPT-4* series onCXRimage understand- ing has been questioned (Yan et al., 2024; Jiang et al., 2024). We score the quality of these interpretations following the detection approach from Paulo et al. (2024), which scores a (textual) feature interpretation by how well it can be used to predict whether that feature will be active on a new sample, again using anLLM. Where possible, we balance the num- ber of positive and negative samples 4 and report detection F 1 on the evaluation set as a measure of feature interpretabil- ity. For scoring we sample from the top quintile and again hold non-activating samples as the balanced ‘negatives’. We use 50 data examples per feature to derive an interpre- tation, and 200 (non-overlapping) examples to score inter- pretation quality. We source these examples from a set of 500,000 randomly selected samples from the dataset used to train the SAE. We generate and score interpretations for every feature in theSAEfor which it is possible to collect samples as prescribed above, resulting in interpretations for 99.5% (16,299) of the 16,384 features. Further details on automated interpretation, includingLLMprompts, are in Appendices C.2 and C.3. 3.7. Steering SAEfeatures can be used to manipulate a model’s output towards or away from desired concepts in a process called steering (Templeton et al., 2024; Adams et al., 2025; Zhang et al., 2024). To steer model generations using a single SAE featuref i , we extract the corresponding columnW dec i from the SAE decoder matrix as a steering vector. This vector is multiplied by a coefficientαand added at each decoding step to the hidden states for every token in the sequence— whether in the prompt or generated tokens (Turner et al., 2023; Li et al., 2023a; Durmus et al., 2024). By intervening on the hidden states in this manner, we expect the model gen- erations to introduce or emphasise content associated with the targeted concept, reflecting an induced high activation of that feature. In contrast, applying negative coefficients is expected to de-emphasise, remove, or even invert the corresponding concept in the model generation. Finding the right steering coefficientαis non-trivial. After initial experiments with a wide range of steering coefficients (α∈ 1,2.5,5,10,15,25,50,100), we choseα= 10for positive andα=−10for negative steering as they strike a 4 It was not always possible to find 100 positive examples for a given feature, resulting in an imbalanced evaluation set. However, we did not observe a correlation between imbalance andF 1 . 0.30.40.50.60.70.80.91.0 Interpretability (detection) 10 0 10 1 10 2 10 3 10 4 # features with interpretability above x 7500 below 0.5 288 above 0.75 precision recall F 1 Figure 2.Interpretable features exist, but are rare.Using de- tectionF 1 on the evaluation set as a measure of interpretability, we show that from 16,384 features, 288 (1.8%) score above 0.75, whereas 7,500 (46%) score below 0.5 (random performance for a balanced evaluation set). good balance between the appearance of noticeable targeted changes while remaining in-distribution. 3.8. Evaluating steering success Successful steering should modify only the concept of in- terest while keeping the rest unchanged. We therefore eval- uate the effectiveness of steering using two dimensions: on-targetchanges refer to modifications directly related to the concept of interest, whileoff-targetchanges are all other significant alterations of the content of the generated report. Both dimensions of change are measured in comparison between the reports generated with and without the steering intervention. To enable large-scale evaluation of steering success of multiple concepts, we use anLLM-as-a-judge approach similar to Wu et al. (2025). TheLLMis prompted to output on-target and off-target scores from 0 to 1, where higher scores correspond to stronger changes. More details about the LLM judge are in Appendix D. 4. Findings 4.1. MAIRA-2 contains interpretable features Figure 2 shows the distribution of detectionF 1 across the 16,384 features in ourSAE. With 7,500 (46%) features below 0.5, this illustrates that many features are not in- terpretable by our automated pipeline. There do however exist consistently interpretable features, with 288 (1.8%) exhibitingF 1 above 0.75. We observe consistently higher recall than precision, indicating that feature descriptions 5 Insights into a radiology-specialised multimodal large language model with sparse autoencoders tend towards being non-specific. We observe that highly interpretable features tend to pertain to the presence of abnormal findings (f 1336 ,F 1 = 0.89: ‘Aortic tortuosity or calcification identified in chest imag- ing.’), changes especially of medical lines and tubes (f 11240 , F 1 = 0.96:‘Descriptions of findings related to chest tube placement or removal.’), and reporting style (f 12106 , F 1 = 0.93:‘Use of ‘however’ in clinical findings indicating possible issues needing further investigation.’). Features with low detectionF 1 are most commonly as- cribed generic descriptions of the task of radiology reporting emphasising comparison with prior imaging (f 730 and 16 others,F 1 = 0.50:‘Comparative analysis of current and prior imaging findings.’,f 12117 and 5 others,F 1 = 0.40, ‘Comparison of current imaging findings to prior studies.’), and less frequently references to the instruction given to MAIRA-2 (f 2653 ,F 1 = 0.42,‘Provide findings description in comparison with prior images.’), or pathological changes (f 10005 ,F 1 = 0.43,‘Enlargement of cardiac silhouette of- ten with pulmonary vascular changes.’). Even among highly interpretable features, we observe a de- gree of repetition in feature descriptions: of the 61 features withF 1 above 0.85, we observe seven instances of apparent repetition. For example,f 1374 andf 6105 are‘Presence of atelectasis in imaging findings.’and‘Mentions and descrip- tions of atelectasis’respectively, andf 13515 andf 13911 are both described as‘Elevation of the hemidiaphragm’. 4.2. Steering success depends on the feature In this section, we evaluate to what extent theSAEfeatures can be used to effectively steer the model generation into desired directions. Figure 3 shows the steering performance across a set of highly interpretable features on the full vali- dation set of 7,906CXRstudies. The LLM-based steering evaluation produces two metrics: anon-target scoreand anoff-target score. Based on these metrics, we stratify the results into four disjoint cases: (1) only on-target changes were observed; (2) both on-target and off-target changes; (3) only off-target changes; and (4) no observable changes occurred following steering. We show examples from each category in Table 1. We describe the process of selecting features for steering and additional results and examples for both positive and negative steering in Appendix E. Overall, feature steering produces more off-target changes than on-target changes. Purely on-target changes are rel- atively rare, with the highest observed proportion being 11.3%, seen inf 10709 . An example of such a case is shown forf 1599 in the top row of Table 1, where the steered gener- ation effectively removed all comparisons to prior studies. More commonly, on-target effects are accompanied by vary- ing degrees of off-target changes. Inf 6412 , for example, 27.4%of cases exhibit both on- and off-target changes. While some of these off-target changes reflect underlying correlations in the dataset—for example inf 6412 in Table 1, where the output includesatelectasisalongsidepleural ef- fusion, consistent with known clinical correlations—most off-target changes are not due to such causal relationships. In many cases, off-target changes are seen when the original report is non-specific (For example, “No significant interval change”) while the steered generation is more descriptive, leading to large off-target changes due to the discrepancy. In most cases, steering leads predominantly to off-target ef- fects. Extreme cases include, but are not limited to,f 11509 , f 10643 , andf 13506 where more than 50% of the steered sam- ples contain off-target changes despite showing almost no on-target changes. The corresponding example forf 10643 in Table 1 adds a right apical pneumothorax, a left pleural effusion, and tips descriptions, none of which are directly re- lated to the feature’s description of‘Immediate notifications of findings by telephone upon discovery.’ In approximately35%of cases across all features, steering leads to no observable changes. Interestingly, we observe a strong correlation in the on-target effects between steering features with a positive and with a negative coefficientα(Spearman’sρ= 0.90,p≤0.05, n= 67). The correlation is also present for off-target effects, although less pronounced (Spearman’sρ= 0.71, p≤0.05,n= 67). We also observe a correlation between the activation frequency of a feature and its on-target score (Spearman’sρ= 0.40,p≤0.05,n= 50) for features with F 1 >0.85. We do not observe a significant correlation of the activation frequency with off-target scores. This section leaves us with two main observations: (1) the success of feature steering highly depends on the selected feature and the steered report, and (2) even when successful, feature steering regularly produces more off-target effects than on-target effects. 5. Discussion and conclusion Motivated by the success ofSAE-based approaches for the interrogation ofLLMs, in this work we investigated the radiology-specializedMLLMBannur et al. (MAIRA-2; 2024) using Matryoshka-SAE (Bussmann et al., 2025). We based our automated interpretability and scoring pipeline on an established approach (Paulo et al., 2024), with some modifications specific to the radiology domain. We then explored the extent to which MAIRA-2 can be steered using the discovered interpretable features. We found a small set of highly interpretableSAEfeatures, representing a variety of concepts pertinent to radiology reporting, and often with more granularity than existing 6 Insights into a radiology-specialised multimodal large language model with sparse autoencoders 0.00.10.20.30.40.50.60.70.80.91.0 Mention of scoliosis in radiological comparisons. [f 13506 ] Immediate notification of findings by telephone upon discovery. [f 10643 ] Observations of rib fractures in chest imaging reports. [f 11509 ] Use of 'however' in clinical findings indicating possible issues needing further investigation. [f 12106 ] Elevation of the hemidiaphragm. [f 13515 ] Elevation of the hemidiaphragm. [f 13911 ] Analysis and repositioning of Dobbhoff tube placement. [f 5791 ] Descriptions of findings related to chest tube placement or removal. [f 11240 ] Presence of surgical clips in imaging unrelated to main findings. [f 12150 ] Atherosclerotic calcifications noted at the aortic knob or thoracic aorta. [f 9958 ] Use of 'possible' or 'possibly' indicating uncertainty. [f 9473 ] Unfolded or tortuous thoracic aorta in radiology reports. [f 12585 ] Central line or catheter placement described with specific position (e.g. 'mid SVC'). [f 8757 ] Aortic tortuosity or calcification identified in chest imaging. [f 1336 ] Blunting of costophrenic angles suggesting pleural effusion. [f 10636 ] Evaluation of pacemaker or ICD lead positions in chest X-rays. [f 3246 ] Describes placement or adjustment necessity of medical tubes or catheters. [f 9002 ] Findings of pulmonary vascular congestion or pulmonary edema. [f 6108 ] Precision in tube placement description. [f 10629 ] Mentions and descriptions of atelectasis. [f 6105 ] Cardiomegaly or enlarged cardiac silhouette. [f 4875 ] Detection of pleural effusions on imaging studies. [f 6412 ] Normal imaging findings with emphasis on absence of acute pathology. [f 11891 ] Findings indicate clear lungs with no signs of pleural effusion, pneumothorax, consolidation, or pulmonary edema. [f 10709 ] Interval change in disease findings from prior imaging. [f 1646 ] Describing findings without comparison to prior images. [f 1599 ] Cardiopulmonary findings suggesting congestion, cardiomegaly, or pleural effusion. [f 6987 ] Only on-target changesBoth on & off-target changesOnly off-target changesNo changes Figure 3.On-target and off-target effects of steering featuresf i withα= 10. The scores of the LLM judge are binarized at0.1and the bars show the proportion of samples where steering led only to on-target changes, only to off-target changes, both, or none at all. Results are shown for the full validation set including 7,906 studies. concept categories such as the CheXpert classes (Irvin et al., 2019). Among these highly interpretable features, we ob- served some repeated or similar feature descriptions, which may be evidence for feature composition (Anders et al., 2024; Wattenberg & Vi ́ egas, 2024) or splitting (Bricken et al., 2023). Almost half of theSAEfeatures were found not to be interpretable by our pipeline. It is difficult to es- tablish the base rate of ‘interpretable’ features discovered inSAEanalyses as most prior work reports on subsets of features (Paulo et al., 2024; Bussmann et al., 2025; Minder et al., 2025; Templeton et al., 2024) or subsets of evalua- tion samples (Mudide et al., 2024), whereas here we report scores for almost all SAE features. Nonetheless, it may be possible to uncover more inter- pretable features in MAIRA-2 with improvements in au- tomated interpretability, for example making use of a radiology-adeptMLLMto more comprehensively capture imaging content, by further improving the selection of dataset exemplars (e.g. using ‘hard’ negatives as in Minder et al. (2025)), or by actively seeking ‘output’ features (Paulo et al., 2024; Gur-Arieh et al., 2025) through an intervention- based interpretation pipeline (Shaham et al., 2024). How- ever, given our specialised domain, we are limited in the set of models available for e.g.CXRvisual question-answering, or generation of counterfactual examples. Our study discovered two recurring patterns related to steer- ing success. First, we found a significant correlation be- tween the activation frequency of features in the training dataset and steering success for highly interpretable features. Given that the training data in this study was a subset of MAIRA-2’s training data, we hypothesise that frequently activating features are well learned and better represented in the language model’s latent space—which in return is the cause for the high steering success of such features. Second, features that are highly steerable in the positive direction also show good steering performance in the nega- tive direction and vice versa. This suggests that the learned features retain their approximate linearity far below the point where they usually activate. Other recent studies on steeringLLMs with hand-pickedSAEfeatures also found disproportionately many failure cases, but could not pro- vide an empirical explanation (Durmus et al., 2024; O’Brien et al., 2024). This also relates to the comprehensive steering benchmarking study from Wu et al. (2025), wherein almost all steering methods (includingSAE-based) dramatically underperformed prompting and fine-tuning. Alongside low steering success, we also observed adverse 7 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Table 1.Examples of steered reports including LLM-based evaluation. Following the colouring from Figure 3, on-target changes between the original and the steered report are highlighted in green, and off-target effects in orange. Only on-target changes Featuref 1599 ,F 1 = 0.79:‘Describing findings without comparison to prior images.’ OriginalCompared with the prior study, lung volumes are lower, causing bronchovascular crowding. However, no focal consolidation, pleural effusion, or pneumothorax. Cardiomediastinal and hilar silhouettes are unchanged. Steered α=10 Lung volumes are low, causing bronchovascular crowding. No focal consolidation, pleural effusion, or pneumothorax. Cardiomediastinal and hilar silhouettes are unremarkable. Scoreson-target score=1.0 off-target score=0.0 Both on- and off-target changes Featuref 6412 ,F 1 = 0.84:‘Detection of pleural effusions on imaging studies.’ OriginalThe heart size is normal. The mediastinal and hilar contours are unremarkable. The pulmonary vascularity is normal. Patchy ill-defined opacity is noted within the left lung base, which may represent an area of developing infection. No pleural effusion or pneumothorax is present. There are no acute osseous abnormalities. Steered α=10 The heart size is normal. The mediastinal and hilar contours are unremarkable. The pulmonary vascularity is normal. There is a small left pleural effusion. Minimal streaky opacity in the left lung base may reflect atelectasis, though infection cannot be completely excluded. There is no pneumothorax. No acute osseous abnormality is seen. Scoreson-target score=1.0 off-target score=0.2 Only off-target changes Featuref 10643 ,F 1 = 0.86:‘Immediate notification of findings by telephone upon discovery.’ OriginalSince, small right pleural effusion is unchanged, right pleural catheter is in unchanged position. The lungs are clear. The cardiomediastinal silhouette, hilar contours, and pleural surfaces are normal. No pneumothorax. Steered α=10 Since, a right pleural drainage catheter is in place. A small right apical pneumothorax is seen. A small right pleural effusion is seen. A left pleural effusion is small. The lungs are clear. The heart size is normal. Tips in the esophagus are noted. Scoreson-target score=0.1 off-target score=0.7 effects from steering, causing sometimes clinically signif- icant changes unrelated to the concept of interest, such as the confabulation or omission of findings. Compared to generalist models used by Durmus et al. (2024) and O’Brien et al. (2024), the nature of our application both required and allowed us to perform a more principled evaluation of the side-effects of steering than these works, which revealed that these off-target changes often exceed those observed along the concept dimension. The reasons for the low steering success and the large ad- verse effects remain speculative but could be rooted in limi- tations of theSAEs used for concept discovery, of the steer- ing method, or of the automated feature interpretation and steering evaluation pipelines. For example, quantitatively measuring on-target effects is challenging when a concept is already present. This is, for example, observed forf 6412 : attempting to add more pleural effusions to a report that already describes bilateral pleural effusions should right- fully result in no change compared to the original output. This effect is especially relevant in negative steering since most concepts are already absent in the majority of samples. More fundamentally, it is possible that theSAEfeatures themselves are not well disentangled (cf. feature splitting/ab- sorption/composition; Bricken et al., 2023; Chanin et al., 2024; Anders et al., 2024), or that representations of con- cepts of interest are significantly nonlinear in MAIRA-2 (Park et al., 2024). Some of the problems above could be alleviated by more informed steering methods that would especially incorporate knowledge aboutwhichtokens to modify andhow much. In conclusion, this study provides early insights into the inner workings of a radiology-specialisedMLLM, while also highlighting various practical and technical challenges in applying mechanistic interpretability methods to domain- specific models. Notably: • We identify a subset of the internal concepts in MAIRA-2 that are often more fine-grained than ex- isting findings labels forCXRreports, despite many features remaining uninterpretable. •While concept steering shows potential, its effec- tiveness varies considerably by feature and case— highlighting the need for further investigation. • These findings lay a foundation for more transparent, interpretable, and controllable radiologyMLLMs, and support continued research through the planned release of the SAE checkpoint and automated interpretations. 8 Insights into a radiology-specialised multimodal large language model with sparse autoencoders By releasing 5 the trainedSAEs and LLM-generated inter- pretations for all features, we hope to facilitate further re- search into the application of mechanistic interpretability to specialised multimodal models such as MAIRA-2. Impact Statement This paper focuses on approaches to understand the internal representations of a radiology-adaptedMLLMand poten- tially control its behaviour through steering. The use of MLLMon medical data can have positive or negative im- pacts depending on the context of use since such models can generate incorrect or misleading outputs. Extracting inter- pretable features from model internals may provide model builders a route towards designing safer or more effective models. Steering could provide an additional mechanism by which models can be controlled, either to enhance or suppress unsafe or other undesirable behaviours. References Abdulaal, A., Fry, H., Monta ̃ na-Brown, N., Ijishakin, A., Gao, J., Hyland, S., Alexander, D. C., and Castro, D. C. An X-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation.arXiv preprint arXiv:2410.03334, 2024. Adams, E., Bai, L., Lee, M., Yu, Y., and AlQuraishi, M. From Mechanistic Interpretability to Mechanistic Biol- ogy: Training, Evaluating, and Interpreting Sparse Au- toencoders on Protein Language Models, February 2025. URLhttps://w.biorxiv.org/content/10.1101/ 2025.02.06.636901v1 . Pages: 2025.02.06.636901 Sec- tion: New Results. Anders, E., Neo, C., Hoelscher-Obermaier, J., and Howard, J. N. Sparse autoencoders find composed features in small toy models, 2024. URLhttps://w.lesswrong.com/ posts/a5wwqza2cY3W7L9cj. Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Bond-Taylor, S., Ilse, M., P ́ erez-Garc ́ ıa, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M., Srivastav, S., Gong, J., Falck, F., Oktay, O., Thieme, A., Lungren, M. P., Wetscherek, M. T., Alvarez-Valle, J., and Hyland, S. L. MAIRA-2: Grounded Radiology Report Generation, June 2024. URLhttp://arxiv.org/abs/2406.04449. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models.https://openaipublic.blob.core.windows. net/neuron-explainer/paper/index.html, 2023. 5 https://huggingface.co/microsoft/maira-2-sae Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Bussmann, B., Leask, P., and Nanda, N. BatchTopK sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024. Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning Multi-Level Features with Matryoshka Sparse Autoencoders, March 2025. URLhttp://arxiv.org/ abs/2503.17547. arXiv:2503.17547 [cs]. Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., and Bloom, J. I. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. InNeurIPS 2024 Workshop on Interpretable AI, December 2024. URL https://openreview.net/forum?id=Wzav8fesTL. Chen, Z., Varma, M., Delbrouck, J.-B., Paschali, M., Blanke- meier, L., Van Veen, D., Valanarasu, J. M. J., Youssef, A., Cohen, J. P., Reis, E. P., et al. CheXagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024. Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT qual- ity, March 2023.URLhttps://lmsys.org/blog/ 2023-03-30-vicuna/. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK. Durmus, E., Tamkin, A., Clark, J., Wei, J., Marcus, J., Batson, J., Handa, K., Lovitt, L., Tong, M., Mc- Cain, M., Rausch, O., Huang, S., Bowman, S., Ritchie, S., Henighan, T., and Ganguli, D.Evaluating fea- ture steering: A case study in mitigating social bi- ases, 2024. URLhttps://anthropic.com/research/ evaluating-feature-steering. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. 9 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Fiotto-Kaufman, J., Loftus, A. R., Todd, E., Brinkmann, J., Juang, C., Pal, K., Rager, C., Mueller, A., Marks, S., Sharma, A. S., et al. NNsight and NDIF: Democratizing access to foundation model internals.arXiv preprint arXiv:2407.14561, 2024. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders, June 2024. URLhttp: //arxiv.org/abs/2406.04093 . arXiv:2406.04093 [cs]. Gur-Arieh, Y., Mayan, R., Agassy, C., Geiger, A., and Geva, M. Enhancing automated interpretability with output-centric feature descriptions.arXiv preprint arXiv:2501.08319, 2025. He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., Jiang, Y.-G., and Qiu, X. Llama Scope: Extracting millions of features from Llama-3.1-8B with sparse autoencoders, 2024. URL https://arxiv.org/abs/2410.20526. Hyland, S. L., Bannur, S., Bouzid, K., Castro, D. C., Ranjit, M., Schwaighofer, A., P ́ erez-Garc ́ ıa, F., Salvatelli, V., Srivastav, S., Thieme, A., et al. MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv:2311.13668, 2023. URLhttps://arxiv.org/ abs/2311.13668. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpan- skaya, K., et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelli- gence, volume 33, p. 590–597, 2019. Jiang, Y., Chen, C., Nguyen, D., Mervak, B. M., and Tan, C.GPT-4V cannot generate radiology reports yet.ArXiv, abs/2407.12176, 2024. URLhttps://api. semanticscholar.org/CorpusID:271244474. Johnson, A. E. W., Pollard, T. J., Berkowitz, S. J., Mark, R. G., and Horng, S. MIMIC-CXR database (version 2.0.0). PhysioNet, 2019. Lad, V., Gurnee, W., and Tegmark, M. The remarkable robustness of LLMs: Stages of inference?arXiv preprint arXiv:2406.19384, 2024. Le, N. M., Patel, N., Shen, C., Martin, B., Eng, A., Shah, C., Grullon, S., and Juyal, D. Learning biologically relevant features in a pathology foundation model using sparse au- toencoders. InNeurIPS 2024 Workshop on Advancements In Medical Foundation Models, December 2024. URL https://openreview.net/forum?id=daV16mhUBd. Lee, H., Battle, A., Raina, R., and Ng, A. Efficient sparse coding algorithms.Advances in neural information pro- cessing systems, 19, 2006. Li, K., Patel, O., Vi ́ egas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023a. Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., and Chang, X. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 3334–3343, 2023b. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147, 2024. Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, p. 34892–34916, 2023. Lou, H., Li, C., Ji, J., and Yang, Y. SAE-V: Interpret- ing Multimodal Models for Enhanced Alignment, Febru- ary 2025. URLhttp://arxiv.org/abs/2502.17514. arXiv:2502.17514 [cs]. Marks, S., Karvonen, A., and Mueller, A.dic- tionarylearning.https://github.com/saprmarks/ dictionarylearning, 2024. Minder, J., Dumas, C., Juang, C., Chugtai, B., and Nanda, N. Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922, 2025. Mudide, A., Engels, J., Michaud, E. J., Tegmark, M., and de Witt, C. S. Efficient dictionary learning with switch sparse autoencoders.arXiv preprint arXiv:2410.08201, 2024. O’Brien, K., Majercak, D., Fernandes, X., Edgar, R., Chen, J., Nori, H., Carignan, D., Horvitz, E., and Poursabzi- Sangde, F. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024. Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models, April 2025. URLhttp: //arxiv.org/abs/2504.02821 . arXiv:2504.02821 [cs]. Parekh, J., Khayatan, P., Shukor, M., Newson, A., and Cord, M. A concept-based explainability framework for large multimodal models.Advances in Neural Information Processing Systems, 37:135783–135818, 2024. Park, K., Choe, Y. J., and Veitch, V. The linear representa- tion hypothesis and the geometry of large language mod- els. InProceedings of the 41st International Conference 10 Insights into a radiology-specialised multimodal large language model with sparse autoencoders on Machine Learning, p. 39643–39666. PMLR, July 2024. URLhttps://proceedings.mlr.press/v235/ park24c.html. Paulo, G., Mallen, A., Juang, C., and Belrose, N. Auto- matically Interpreting Millions of Features in Large Lan- guage Models, December 2024. URLhttp://arxiv. org/abs/2410.13928. arXiv:2410.13928 [cs]. P ́ erez-Garc ́ ıa, F., Sharma, H., Bond-Taylor, S., Bouzid, K., Salvatelli, V., Ilse, M., Bannur, S., Castro, D. C., Schwaighofer, A., Lungren, M. P., et al. Exploring scalable medical image encoders beyond text supervi- sion.Nature Machine Intelligence, 7:119–130, 2025. doi: 10.1038/s42256-024-00965-w. Quinn, T. P., Senadeera, M., Jacobs, S., Coghlan, S., and Le, V. Trust and medical ai: the challenges we face and the expertise needed to overcome them.Journal of the American Medical Informatics Association, 28(4):890– 894, 2021. Shaham, T. R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., and Torralba, A. A mul- timodal automated interpretability agent. InForty-first International Conference on Machine Learning, 2024. Simon, E. and Zou, J. InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoen- coders, November 2024. URLhttps://w.biorxiv. org/content/10.1101/2024.11.14.623630v1. Pages: 2024.11.14.623630 Section: New Results. Stevens, S., Chao, W.-L., Berger-Wolf, T., and Su, Y. Sparse autoencoders for scientifically rigorous interpretation of vision models.arXiv preprint arXiv:2502.06755, 2025. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., Mac- Diarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread, 2024. URLhttps://transformer-circuits. pub/2024/scaling-monosemanticity/index.html. Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.-C., Carroll, A., Lau, C., Tanno, R., Ktena, I., Palepu, A., Mustafa, B., Chowdhery, A., Liu, Y., Ko- rnblith, S., Fleet, D., Mansfield, P., Prakash, S., Wong, R., Virmani, S., et al. Towards generalist biomedical AI.NEJM AI, 1(3):AIoa2300138, February 2024. doi: 10.1056/AIoa2300138. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M.Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for in- direct object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. Wang, Z., Liu, L., Wang, L., and Zhou, L. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 11558–11567, 2023. Wattenberg, M. and Vi ́ egas, F. B. Relational composition in neural networks: A survey and call to action, 2024. URL https://arxiv.org/abs/2407.14662. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Au- toencoders, March 2025. URLhttp://arxiv.org/abs/ 2501.17148. arXiv:2501.17148 [cs]. Yan, Q., He, X., Yue, X., and Wang, X. E.Worse than random? an embarrassingly simple probing eval- uation of large multimodal models in medical VQA. ArXiv, abs/2405.20421, 2024.URLhttps://api. semanticscholar.org/CorpusID:270199350. Yang, L., Xu, S., Sellergren, A., Kohlberger, T., Zhou, Y., Ktena, I., Kiraly, A., Ahmed, F., Hormozdiari, F., Jaroen- sri, T., et al. Advancing multimodal medical capabilities of Gemini.arXiv preprint arXiv:2405.03162, 2024. Yildirim, N., Richardson, H., Wetscherek, M. T., Bajwa, J., Jacob, J., Pinnock, M. A., Harris, S., Coelho De Castro, D., Bannur, S., Hyland, S., et al. Multimodal healthcare AI: Identifying and designing clinically relevant vision- language applications for radiology. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, p. 1–22, 2024. Zhang, K., Shen, Y., Li, B., and Liu, Z. Large Multi- modal Models Can Interpret Features in Large Multi- modal Models, November 2024. URLhttp://arxiv. org/abs/2411.14982. arXiv:2411.14982 [cs]. Zhou, H.-Y., Adithan, S., Acosta, J. N., Topol, E. J., and Ra- jpurkar, P. A generalist learner for multifaceted medical image interpretation.arXiv preprint arXiv:2405.07988, 2024. 11 Insights into a radiology-specialised multimodal large language model with sparse autoencoders A. Additional experimental setup details A.1. Representation filtering Table A.1 shows the effect of the token-based representation filtering, keeping only the most relevant tokens for the task at hand. We discard any fixed segments from the original prompt template: the system prompt, beginning- and end-of-sequence tokens, the chat template delimiters, and the fixed parts of the instruction. We choose to depict an example that include all three possible CXR image views as a comprehensive representation of the filtering. Table A.1.Example of token based representations filtering, including a frontal, lateral and prior image. In the filtered prompt, we indicate the indices of the tokens which were kept in square brackets. We further extract the message type (‘human’ or ‘assistant’) and the content type (‘str’ or ‘image’). Original PromptFiltered Prompt <s>, _You, _are, _an, _expert, _radi, ology, _assistant, _task, ed, _with, _interpre, ting, _a, _ch, est, _X, -, ray, _study, ., _, _US, ER, :, _, _Given, _the, _current, _front, al, _image, <image>x1369, _the, _current, _later, al, _image, <lat_image>x1369, _and, _the, _prior, _front, al, _image, <prev_im>x1369, _P, RI, OR, _, RE, PORT, :, _N, /, A, _Prov, ide, _a, _description, _of, _the, _find, ings, _in, _the, _radi, ology, _study, _in, _comparison, _to, _the, _prior, _front, al, _image,., _IN, D, ICATION, :, __, _year, _old, _woman, _with, _rec, urrent, _asp, iration, _p, na, ,, _now, _with, _f, lare, _in, _s, put, um, ,, _c, ough, ,, _and, _bil, ater, al, _lower, _lo, be, _crack, les, _//, _assess, _for, _new, _p, neum, onia, _TE, CH, NI, QUE, :, _Ch, est, _radi, ograph, ,, _PA, _and, _later, al, _views, _CO, MP, AR, I, SON, :, _Thus, _radi, ograph, __, _, _A, S, IST, ANT, :, _Bil, ater, al, _lower, _lo, be, _op, ac, ities, _are, _improved, _compared, _to, __, ., _There, _are, _small, _co, ales, c, ence, _into, _several, _nod, ular, _op, ac, ities, _remaining, _on, _the, _right, _but, _mostly, _improved, ., _L, ungs, _are, _m, ild, ly, _hyper, infl, ated, ., _There, _is, _no, _definite, _ple, ural, _eff, usion, ., _Card, iom, ed, iast, inal, _sil, hou, ette, _is, _normal, _size, ., _,2, _f, ract, ured, _sc, rew, s, _in, _right, _hum, eral, _head, _is, _un, changed, _from, _prior, ., </s> [start_index:end_index] message_type content_type [filtered tokens] [31:32] human str ['_image'] [1400:1401] human image ['<image>'] [1405:1406] human str ['_image'] [2774:2775] human image ['<lat_image>'] [2780:2781] human str ['_image'] [4149:4150] human image ['<prev_im>'] [4150:4160] human str ['_P','RI','OR','_', 'RE','PORT',':','_N','/','A'] [4182:4226] human str ['_IN','D','ICATION',':','__', '_year','_old','_woman','_with','_rec', 'urrent','_asp','iration','_p','na',',', '_now','_with','_f','lare','_in','_s', 'put','um',',','_c','ough',',','_and', '_bil','ater','al','_lower','_lo','be', '_crack','les','_//','_assess','_for', '_new','_p','neum','onia'] [4226:4241] human str ['_TE','CH','NI','QUE',':', '_Ch','est','_radi','ograph',',', '_PA','_and','_later','al','_views'] [4241:4251] human str ['_CO','MP','AR','I','SON',':', '_Thus','_radi','ograph','__'] [4257:4344] assistant str ['_Bil','ater','al','_lower', '_lo','be','_op','ac','ities','_are', '_improved','_compared','_to','__','.', '_There','_are','_small','_co','ales', 'c','ence','_into','_several','_nod','ular', '_op','ac','ities','_remaining','_on','_the', '_right','_but','_mostly','_improved','.', '_L','ungs','_are','_m','ild','ly', '_hyper','infl','ated','.','_There','_is', '_no','_definite','_ple','ural','_eff','usion', '.','_Card','iom','ed','iast','inal','_sil', 'hou','ette','_is','_normal','_size','.','_', '2','_f','ract','ured','_sc','rew','s','_in', '_right','_hum','eral','_head','_is','_un', 'changed','_from','_prior','.'] Total # tokens = 4348Total # filtered tokens = 176 12 Insights into a radiology-specialised multimodal large language model with sparse autoencoders A.2. SAE Hyperparameters Table A.2 shows all the hyperparameters that were overridden from the default values in thedictionarylearning repository 6 . We realised we had accidentally run with reversed Matryoshka group sizes, with the largest group being ‘innermost’. This appears to have slightly reduced the overall interpretability of the discovered features. We will quantify the impact on steering in a revision. However, given we do not see a strong association between interpretability score and steering success on the feature level, we do not expect a significant difference in our findings. Table A.2.Hyperparameters used for training the SAE using the open-source repositorydictionarylearning. CategoryValueDescription SAE Configuration TypematryoshkabatchtopkType of SAE architecture used. Activation dimension4096The dimensionality of the activations in the SAE model. Expansion factor8The factor by which the number of dictionary atoms is expanded. Layer ID15The specific layer from which the embeddings are extracted. HookpointresidualThe hookpoint in the model from where embeddings are captured. k(active features)256The number of active features in the sparse encoding. Group fractions[1/2, 1/4, 1/8, 1/16, 1/16] The fractions used to group dictionary atoms, controlling the sparsity levels at different scales. Trainer Configuration Epochs1The number of training epochs. Learning rateAutoThe learning rate is automatically set by the trainer. Auxkalpha0.03125The weight of the auxiliaryk-sparse loss term. Threshold beta0.999The exponential decay rate used for thresholding. Threshold start step1000The step at which thresholding begins during training. Threshold dead features100000The number of tokens after which a feature without activation will be considered “dead” (used in the auxiliary loss). Data Configuration Batch size8192The number of samples in each batch for training. Normalize activationsTrueWhether or not to normalize the activations to unit norm. A.3. Auxiliary Loss As introduced by Gao et al. (2024), the auxiliary lossL aux captures the reconstruction error associated with the top-k aux dead nodes in the sparse latent space. It is defined as: L aux =∥e−ˆe∥ 2 2 ,(3) wheree=x−ˆxis the error from the main model, andˆe=W k dec zrepresents the reconstruction obtained using the top-k aux dead latents. A latent is considered ”dead” if it has not activated for a predefined number of tokens, as specified by the threshold dead features. 6 https://github.com/saprmarks/dictionary learning 13 Insights into a radiology-specialised multimodal large language model with sparse autoencoders B. Feature statistics Figure B.1 provides statistics on the activation frequency of features in the studiedSAE, using a random sample of 500,000 data points from the training set. Each sample here refers to the model representation extracted at a single token position. Defining the feature density as the fraction of samples on which a given feature activates, we observe (Figure B.1; top left, top middle) a range of values, with a very small number of features activating on almost all samples. Unfortunately, these commonly-occurring features are not well-explained:f 7633 activates on 99.3% of samples and achievesF 1 = 0.40,f 6727 activates on 86.8% of samples and is described as‘Detailed comparison with prior images highlighting interval changes.’ withF 1 = 0.67. We did not observe a correlation between feature density and interpretability (detectionF 1 ), however some of the most interpretable features have low density -f 13515 (F 1 = 0.97,‘Elevation of the hemidiaphraghm.’) activates on only 159 samples (0.03%). 10 3 10 2 10 1 10 0 Fraction of samples where feature is active 10 0 10 1 10 2 # features Feature density 10 2 10 3 10 4 10 5 # samples where feature is active 0 2500 5000 7500 10000 12500 15000 # features (cumulative) CDF of feature activity 02004006008001000 # active features in sample 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Mean act. of feature in sample Mean activation vs. active features 02004006008001000 # active features per sample 0 5000 10000 15000 # samples Active features per sample 2.55.07.510.012.515.017.520.0 Max activation of feature 0 5000 10000 15000 20000 # samples Max feature activation in sample 02004006008001000 # active features in sample 5 10 15 Max act. of feature in sample Max activation vs. active features 10 3 10 2 10 1 10 0 Mean activation of feature 10 0 10 1 10 2 10 3 # features Mean feature activation 05101520 Max activation of feature across samples 0 100 200 300 400 500 600 # features Max feature activation 10 2 10 3 10 4 10 5 10 6 # samples where feature is active 0 5 10 15 20 Max act. of feature in sample Max activation vs. active samples Figure B.1.Statistics on the activation frequency and strength of features in the studied SAE, using a random sample of 500,000 data points from the training set. Each sample here refers to a single token position. C. Automated interpretability We split the dataset 50:50 by patient into ‘train’ and ‘evaluation’ for the purpose of generating and scoring latent interpreta- tions respectively. We consider each latent individually. We use GPT-4o (2024-11-20) via the Azure OpenAI service. Initial experiments with o3-mini (2025-01-31) yielded inferior results, but may benefit from further prompt engineering. C.1. Selecting and preparing feature exemplars For each feature, we select ‘exemplar’ data samples by selecting an equal number of non-activating and activating samples, for a total of 50. Activating samples are selected from the top decile of the feature activation distribution, treating an 14 Insights into a radiology-specialised multimodal large language model with sparse autoencoders 0.30.40.50.60.70.80.91.0 Interpretability (detection F 1 ) 10 4 10 3 10 2 10 1 10 0 # features with interpretability above x No image description Image description Figure C.1.Including the ‘description’ of the image (‘Findings’ section of the report) to the interpretationLLMdoes not increase the number of interpretable features we discover. activation of 0 as the bottom decile. Experiments with stratified sampling indicated this strategy was marginally superior. Each data sample refers to a specific token index for a specific input sequence. To explain the feature activation at that point, we provide theLLMwith the full sequence up to that token, and the next 100 characters. We indicate the ‘current’ token with double square brackets. We found that truncating to 100 characters after the current token provided slightly improved interpretability scores relative to truncating to 0, 10, or 250 characters. Note that our few-shot examples, being hand-designed, inconsistently adhere to the truncation length. Since MAIRA-2 is aMLLM, it receives interleaved images and text as input. Our interpretationLLMis text-only, we replace the image tokens with the string ‘<image>’. In a chest X-ray report, the ‘Findings’ section of the report approximates a description of the image. We experimented with including this section prefixed to the data sample, to provide theLLMwith information about the image contents. Surprisingly (Figure C.1), we found this didnotimprove the number of interpretable features we could discover. We note however that we train the SAE (and hence interpret it) on both the prompt and target tokens used for MAIRA-2 training, where the training target is the ‘Findings’ section. Hence, in many cases the input text to the model may already contain some part of the image description. C.2. Generating interpretations We use a modified prompt based on Paulo et al. (2024) 7 . For each example, we present a single latent activation, scaled to 0-9 based on the range of observed values of that latent, following Gao et al. (2024). The instruction provided to the explanation-deriving LLM is shown below: Latent interpretation: system message You are a meticulous AI researcher conducting an important investigation into the activation patterns of a large autoregressive vision-language model trained on chest X-ray data. You will be presented with samples of prompts to and outputs from this model with corresponding activation levels at a specified token. You task is to analyze this data and provide an explanation which succinctly encapsulates patterns to explain the observed activation levels. Guidelines: - Each data example consists of some preamble text, the [[current token]], and the next few tokens, as well as an "activation level" computed at the [[current token]]. Note that the current token is delimited with "[[, ]]". 7 https://github.com/EleutherAI/delphi/tree/article version 15 Insights into a radiology-specialised multimodal large language model with sparse autoencoders - The activation level indicates how representative the sample is of the pattern we wish to understand. - Activation levels close to zero mean the pattern is NOT present. - These examples can refer to "<image>". In this case, an <IMAGE DESCRIPTION> may be provided. Produce the SHORTEST and MOST CONCISE explanation of the pattern, with a rationale. Respond in JSON with the following fields: "rationale": "Justification for this explanation.", "explanation": "Concise explanation of the pattern." We use six few-shot examples based on our medical use-case, devised in conjunction with Claude 3.7 Sonnet Thinking. Each example features samples from a hypothetical latent with different activation levels. We tried to design few-shots to cover possible latents including those pertaining to pathologies or other image-specific concepts, as well as ‘meta’ concepts such as the length into the input sequence (as described by Gao et al. (2024)), and ‘superficial’ concepts such as the appearance of a specific token. We present one below: Latent interpretation: illustrative few-shot example Input: Example 1: [[IN]]DICATION: 58-year-old female with persistent cough. FINDINGS: No acute cardiopulmonary process. Heart size is normal. Activation: 8 Example 2: TECHNIQUE: PA and lateral chest radiographs were obtained. [[The]] cardiac silhouette appears normal in size. Activation: 0 Example 3: EXAMINATION: Chest CT. [[ IN]]DICATION: Follow-up for previously identified pulmonary nodule. FINDINGS: Left lung is clear, nodule persists in right lower lobe. No new masses identified. Activation: 7 Example 4: The patient presents with shortness of breath and chest pa[[ in]]. No fever reported. Activation: 0 ... Output: "rationale": "The activation is high when'_IN'appears as the first token of the'INDICATION'section, but is not high when'_IN'appears in other contexts.", "explanation": "The token'_IN'appearing as part of the word'INDICATION'" C.3. Scoring interpretations We follow the detection scoring approach outlined by Paulo et al. (2024), and again draw inspiration from the prompts used in that study. The LLM is provided with the interpretation of a latent (a sentence) and a sample. It is asked to classify whether the latent would activate on that sample. Below we show the system message: Interpretation scoring: system message You are an intelligent and meticulous researcher into clinical language, with a specialisation in radiology. You will be provided with a "latent explanation", which describes a "latent property" of the text (a concept or pattern which can appear in text), such as "patient deterioration" or "mention of 16 Insights into a radiology-specialised multimodal large language model with sparse autoencoders lung opacities". You will then be given several text examples. Your task is to determine which examples possess latent property. These examples may appear to have truncated text. Regardless of the formatting, focus on determining whether the text has the property in the "latent explanation". For each example, return 1 if the example demonstrates the property, and 0 otherwise. Again we devised few-shot examples in conjunction with Claude 3.7 Sonnet Thinking. Using a small development dataset, we found that the LLM performed better on the task when it was asked to classify a single sample at a time, however few-shot examples included several samples at once. An illustrative case is shown below: Interpretation scoring: illustrative few-shot example Input: Latent explanation: Comparison with prior imaging studies. Test examples: Example 0: Chest radiograph demonstrates bibasilar atelectasis with small bilateral pleural effusions. Heart size is upper limits of normal. Example 1: Interval development of small right pleural effusion not present on prior chest radiograph performed 2 days ago. Output: [0, 1] D. Automated steering evaluation Our automated evaluation of steering success is implemented using a LLM judge similar to Wu et al. (2025). The model is prompted to provide a score from 0 to 1 for on-target changes and off-target changes and to reason about its decision beforehand. The LLM judge is then provided with ten few-shot examples of the task. We use individual system messages and few-shot examples for positive and negative steering. Similar to the automated interpretability pipeline (c.f. Appendix C), we use GPT-4o (2024-11-20) via the Azure OpenAI service. Below we show the system message. Automated Steering Evaluation: system message You are an intelligent and meticulous researcher into clinical language, with a specialisation in radiology. You will be provided with a "concept", which describes a concept or pattern which can appear in text, such as "patient deterioration" or "mention of lung opacities". You will further be provided with two text samples: 1. A findings section of a radiology report (called "original") 2. A modified version of that section (called "modified") You will then be given several text examples. Please act as an impartial judge and determine (1) to which degree the modified version is changed toMODIFIERbetter represent the concept compared to the original version ("on targetscore") and (2) to which degree the rest of the report stayed unchanged compared to the original version ("off targetscore"). Focus solely on the changes in information, without regard for different ways of representing the same information (e.g. reordering, different wording, etc.). Changes in relation to the concept should NOT be counted to the "offtargetscore". Report both scores on a scale from 0 to 1 where 0 means no change 17 Insights into a radiology-specialised multimodal large language model with sparse autoencoders and 1 means strong changes. Respond in JSON with the following fields: "on_target_score_reasoning": <reasoning for on_target_score>, "off_target_score_reasoning": <reasoning for off_target_score>, "on_target_score": <score between 0 and 1>, "off_target_score": <score between 0 and 1> The value ofMODIFIERin the system message depends on the steering direction and equals to “better represents” and “SUPPRESS” for steering in the positive and negative direction, respectively. To calibrate the scores, few-shot examples are provided. An illustrative case is shown below with on-target changes between the original and the modified report highlighted in green, and off-target changes highlighted in orange. Automated Steering Evaluation: illustrative few-shot example Original report: Cardiac size is within normal limits. The lungs are clear without focal consolidation, pleural effusion, or pneumothorax. Modified report: Cardiac size is normal. The aorta is tortuous. The lungs are clear. There is no pneumothorax or pleural effusion. Concept: Increased tortuosity or calcification of the thoracic aorta. Output: "on_target_score_reasoning": "The modified report contains the concept in the statement \"The aorta is tortuous.\". However, it doesn't mention an increase.", "off_target_score_reasoning": "The explicit mention of the absence of focal consolidation is omitted in the modified report.", "on_target_score": 0.7, "off_target_score": 0.2 E. Feature steering In this section, we provide further results on feature steering and outline the selection process of features for these experiments. E.1. Selection of features for steering Since steering and its evaluation is computationally expensive, we select only a subset of features for this experiment. Since steering evaluation relies on a meaningful description of a feature, we start with an initial selection of features with F 1 >0.85. Based on early results that indicate higher steering success of features that activate frequently in the training data, we manually add more such features, again focussing on ones with highF 1 -scores. These 74 features are depicted in Table E.1. Our objective here was to find the best-case scenario for steering to understand whether it is possible, hence iteratively searching for steerable features. To obtain the final set of features for the steering task, we then remove findings that are not ‘output’ features (Paulo et al., 2024) (e.g.f 1420 ‘Radiology description with COMPARISON noted as ’None”), and features describing low lung volumes. The steering evaluation results for the resulting set of 67 features are shown in Figure E.1 with corresponding examples in Table E.2 for positive steering and Figure E.2 and Table E.3 for negative steering. For the analysis of the correlation between the activation frequency and on-target / off-target scores in Section 4.2, we did not include the manually added highly activating features to avoid further bias than the one induced by restricting the steering analyses to highly interpretable features (F 1 >0.85). This resulted in 50 samples for this analysis. Due to the low 18 Insights into a radiology-specialised multimodal large language model with sparse autoencoders sample size, we used a permutation test with 9,999 permutations to check for statistical significance. E.2. Detailed feature steering results Table E.1: Feature explanations, their validationF 1 scores, and number of active samples, grouped by common themes. GroupIDF 1 # ActiveExplanation Medical devices f 8806 0.861609Reports detailing implanted device positions and termination locations. f 516 0.8510639Emphasis on device placement or termination sites in radiology studies. f 9995 0.85831Pacemaker presence and description in chest imaging reports. f 12062 0.951028Presence or repositioning of pigtail catheters in chest imaging. f 11240 0.941365Descriptions of findings related to chest tube placement or removal. f 10736 0.861059Change or removal of tubes or lines on imaging. f 11427 0.863109Evaluating interval changes post-removal of chest tubes or medical devices. f 9561 0.862090Detecting changes post chest tube removal, particularly pneumothorax assessment. f 13086 0.90665Reported precise distances for tubes from anatomical landmarks (e.g., carina). f 10223 0.882908Focus on the placement and position of PICC lines in the SVC or atrium. f 8582 0.87762Reports on tube placement and adjustment recommendations. f 12702 0.86329 Assessing and reporting the position of tubes in relation to anatomical landmarks like the carina. f 437 0.8624497Focus on descriptions of positions and placements of medical devices or lines. f 10629 0.861947Precision in tube placement description. f 11384 0.86611Focus on tube or line placement positioning. f 9002 0.855615Describes placement or adjustment necessity of medical tubes or catheters. f 12355 0.874912Focus on endotracheal tube position and monitoring devices. f 8757 0.92420Central line or catheter placement described with specific position (e.g. ’mid SVC’). f 14399 0.88360Changes in Swan-Ganz catheter placement and position. f 9328 0.912687Describes postsurgical changes after lung surgeries. f 13592 0.96416Presence and mention of surgical clips in the imaging reports. f 12150 0.901040Presence of surgical clips in imaging unrelated to main findings. f 14168 0.87816Presence of suture material or surgical changes in lung fields. f 13375 0.851188Presence of stents or vascular devices in chest imaging comparison. f 14905 0.85415Intact median sternotomy wires noted in findings. f 3037 0.8425498 Descriptions of medical device positioning relative to anatomy, often compared to prior images. f 5156 0.8310791Evaluation of medical device placements and changes. f 2475 0.8311692Precise positioning or placement of medical devices in radiological reports. f 65 0.8310219Focus on placement and position of tubes and lines in radiological findings. f 3246 0.8027373Evaluation of pacemaker or ICD lead positions in chest X-rays. f 5791 0.7513374Analysis and repositioning of Dobbhoff tube placement. f 1729 0.8011306Assessment and positioning of endotracheal tubes. Abnormal findings f 13963 0.94258Attributes suggestive of chronic obstructive pulmonary disease (COPD). f 12585 0.93322Unfolded or tortuous thoracic aorta in radiology reports. f 13113 0.87478Unfolding or tortuosity of the thoracic aorta. f 1336 0.8611796Aortic tortuosity or calcification identified in chest imaging. f 14586 0.94533Presence of plate-like or linear atelectasis. f 1374 0.883349Presence of atelectasis in imaging findings. f 6105 0.883109Mentions and descriptions of atelectasis. f 14427 0.86443Retrocardiac opacification or atelectasis indicating volume loss or infection. f 11555 0.93401Detection of calcified structures in radiological findings. f 9958 0.89602Atherosclerotic calcifications noted at the aortic knob or thoracic aorta. f 12236 0.89656Presence of aortic arch calcifications or related cardiac calcifications. f 4875 0.864444Cardiomegaly or enlarged cardiac silhouette. f 6108 0.8611089Findings of pulmonary vascular congestion or pulmonary edema. f 13199 0.92302Emphysema-related findings or descriptors. f 11509 0.891123Observations of rib fractures in chest imaging reports. f 13400 0.89630Presence of pleural scarring or thickening, often unchanged, in radiology reports. f 10636 0.86743Blunting of costophrenic angles suggesting pleural effusion. f 6412 0.845922Detection of pleural effusions on imaging studies. f 13506 0.90486Mention of scoliosis in radiological comparisons. f 13515 0.97159Elevation of the hemidiaphragm. f 13911 0.90412Elevation of the hemidiaphragm. 19 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Table E.1: Feature explanations, their validationF 1 scores, and number of active samples, grouped by common themes (cont.). GroupIDF 1 # ActiveExplanation f 7458 0.8214739Elevation of the hemidiaphragm on imaging studies. f 13258 0.863238Presence of hiatal hernia and related structural effects. f 1172 0.8419306Cardiomegaly or heart enlargement observations in chest imaging. f 6987 0.7855984Cardiopulmonary findings suggesting congestion, cardiomegaly, or pleural effusion. f 6343 0.8040032Low lung volumes increasing broncho-vascular markings or heart silhouette. f 2883 0.7645007 Changes or stability in pneumothorax or pneumomediastinum over time compared to prior studies. Normal findings f 11891 0.871374Normal imaging findings with emphasis on absence of acute pathology. f 10709 0.871761Findings indicate clear lungs with no signs of pleural effusion, pneumothorax, consolida- tion, or pulmonary edema. f 8735 0.85949 Clear lungs, no pleural effusion or pneumothorax, unremarkable cardiac and mediastinal silhouettes. f 10234 0.851347Clear lungs with no focal consolidation, effusion, or pneumothorax. Temporal changes f 9876 0.86800Comparison with prior images and mention of healed rib fractures. f 1646 0.7930532Interval change in disease findings from prior imaging. f 1599 0.7998759Describing findings without comparison to prior images. f 6150 0.75135961 Reports describe interval changes in medical devices or effusions compared to prior images. Textual features f 6907 0.915185Presence of double-bracketed image tags [[<image>]]. f 12106 0.88660 Use of ’however’ in clinical findings indicating possible issues needing further investiga- tion. f 7433 0.885661Prominent images marked by double brackets [[<image>]] in descriptions. f 9473 0.86677Use of ’possible’ or ’possibly’ indicating uncertainty. f 10643 0.86999Immediate notification of findings by telephone upon discovery. f 11088 0.76719No prior images available (’N/A’) for comparison in the context. f 3716 0.722466Presence of ’shortness of breath’ as an indication. f 9272 0.72883Use of ’congestion’ in radiology findings. 20 Insights into a radiology-specialised multimodal large language model with sparse autoencoders 0.00.10.20.30.40.50.60.70.80.91.0 Changes in Swan-Ganz catheter placement and position. [f 14399 ] Presence of hiatal hernia and related structural effects. [f 13258 ] Mention of scoliosis in radiological comparisons. [f 13506 ] Immediate notification of findings by telephone upon discovery. [f 10643 ] Observations of rib fractures in chest imaging reports. [f 11509 ] Use of 'however' in clinical findings indicating possible issues needing further investigation. [f 12106 ] Elevation of the hemidiaphragm. [f 13515 ] Presence and mention of surgical clips in the imaging reports. [f 13592 ] Presence or repositioning of pigtail catheters in chest imaging. [f 12062 ] Elevation of the hemidiaphragm. [f 13911 ] Analysis and repositioning of Dobbhoff tube placement. [f 5791 ] Descriptions of findings related to chest tube placement or removal. [f 11240 ] Presence of surgical clips in imaging unrelated to main findings. [f 12150 ] Evaluating interval changes post-removal of chest tubes or medical devices. [f 11427 ] Pacemaker presence and description in chest imaging reports. [f 9995 ] Detection of calcified structures in radiological findings. [f 11555 ] Emphysema-related findings or descriptors. [f 13199 ] Presence of pleural scarring or thickening, often unchanged, in radiology reports. [f 13400 ] Attributes suggestive of chronic obstructive pulmonary disease (COPD). [f 13963 ] Atherosclerotic calcifications noted at the aortic knob or thoracic aorta. [f 9958 ] Unfolding or tortuosity of the thoracic aorta. [f 13113 ] Reported precise distances for tubes from anatomical landmarks (e.g., carina). [f 13086 ] Presence of aortic arch calcifications or related cardiac calcifications. [f 12236 ] Use of 'possible' or 'possibly' indicating uncertainty. [f 9473 ] Comparison with prior images and mention of healed rib fractures. [f 9876 ] Assessment and positioning of endotracheal tubes. [f 1729 ] Focus on endotracheal tube position and monitoring devices. [f 12355 ] Unfolded or tortuous thoracic aorta in radiology reports. [f 12585 ] Detecting changes post chest tube removal, particularly pneumothorax assessment. [f 9561 ] Focus on the placement and position of PICC lines in the SVC or atrium. [f 10223 ] Central line or catheter placement described with specific position (e.g. 'mid SVC'). [f 8757 ] Assessing and reporting the position of tubes in relation to anatomical landmarks like the carina. [f 12702 ] Radiology description with COMPARISON noted as 'None.' [f 1420 ] Aortic tortuosity or calcification identified in chest imaging. [f 1336 ] Change or removal of tubes or lines on imaging. [f 10736 ] Presence of plate-like or linear atelectasis. [f 14586 ] Blunting of costophrenic angles suggesting pleural effusion. [f 10636 ] Evaluation of pacemaker or ICD lead positions in chest X-rays. [f 3246 ] Low lung volumes increasing broncho-vascular markings or heart silhouette. [f 6343 ] Elevation of the hemidiaphragm on imaging studies. [f 7458 ] Reports on tube placement and adjustment recommendations. [f 8582 ] Describes placement or adjustment necessity of medical tubes or catheters. [f 9002 ] Cardiomegaly or heart enlargement observations in chest imaging. [f 1172 ] Findings of pulmonary vascular congestion or pulmonary edema. [f 6108 ] Focus on tube or line placement positioning. [f 11384 ] Reports detailing implanted device positions and termination locations. [f 8806 ] Changes or stability in pneumothorax or pneumomediastinum over time compared to prior studies. [f 2883 ] Precision in tube placement description. [f 10629 ] Presence of atelectasis in imaging findings. [f 1374 ] Mentions and descriptions of atelectasis. [f 6105 ] Focus on placement and position of tubes and lines in radiological findings. [f 65 ] Emphasis on device placement or termination sites in radiology studies. [f 516 ] Precise positioning or placement of medical devices in radiological reports. [f 2475 ] Retrocardiac opacification or atelectasis indicating volume loss or infection. [f 14427 ] Evaluation of medical device placements and changes. [f 5156 ] Cardiomegaly or enlarged cardiac silhouette. [f 4875 ] Descriptions of medical device positioning relative to anatomy, often compared to prior images. [f 3037 ] Clear lungs with no focal consolidation, effusion, or pneumothorax. [f 10234 ] Focus on descriptions of positions and placements of medical devices or lines. [f 437 ] Reports describe interval changes in medical devices or effusions compared to prior images. [f 6150 ] Detection of pleural effusions on imaging studies. [f 6412 ] Normal imaging findings with emphasis on absence of acute pathology. [f 11891 ] Findings indicate clear lungs with no signs of pleural effusion, pneumothorax, consolidation, or pulmonary edema. [f 10709 ] Clear lungs, no pleural effusion or pneumothorax, unremarkable cardiac and mediastinal silhouettes. [f 8735 ] Interval change in disease findings from prior imaging. [f 1646 ] Describing findings without comparison to prior images. [f 1599 ] Cardiopulmonary findings suggesting congestion, cardiomegaly, or pleural effusion. [f 6987 ] Only on-target changesBoth on & off-target changesOnly off-target changesNo changes Figure E.1.Evaluating success of steering withα= 10for all steered features. Proportions of pureon-target changes,off-target changes, the combination of the two and no changes at all. The results of the automatic steering evaluation are binarised for this plot. Scores>0.1 count as changes. 21 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Table E.2.Additional examples of steered reports including evaluation. On-target changes between the original and the steered report are highlighted in green, off-target effects in orange. The subtitle colours match the corresponding bars in Figure E.1 Only on-target changes Featuref 1646 ,F 1 = 0.79:‘Interval change in disease findings from prior imaging.’ OriginalThe heart size is mildly enlarged. The mediastinal contours are unremarkable. There is mild pulmonary vascular congestion. Small bilateral pleural effusions are noted. No pneumothorax is identified. There are no acute osseous abnormalities. Steeredα=10The heart size is mildly enlarged. The mediastinal contours are unremarkable. There is mild pulmonary vascular congestion, slightly worse when compared to the prior study. Additionally, small bilateral pleural effusions are new compared to the prior exam. No focal consolidation or pneumothorax is identified. There are no acute osseous abnormalities. Scoreson-target score=1.0 off-target score=0.0 Featuref 10709 ,F 1 = 0.87:‘Findings indicate clear lungs with no signs of pleural effusion, pneumothorax, consolidation, or pulmonary edema.’ OriginalThere is a new opacity in the right mid lung, which is concerning for pneumonia. The cardiac, hilar and mediastinal contours are normal. No pleural abnormality is seen. Steeredα=10The lungs are clear. The cardiac, hilar and mediastinal contours are normal. No pleural abnormality is seen. Scoreson-target score=1.0 off-target score=0.0 Both on- and off-target changes Featuref 6105 ,F 1 = 0.88:‘Mentions and descriptions of atelectasis.’ OriginalAP and lateral views of the chest. The lungs are hyperinflated but clear of consolidation. There is no effusion. Cardiomediastinal silhouette is within normal limits. No acute osseous abnormality detected. Steeredα=10PA and lateral views of the chest were provided. The lungs are hyperinflated, though there is no focal consolidation, effusion, or pneumothorax. There is a linear density in the right lower lung which is most compatible with atelectasis. Cardiomediastinal silhouette is stable. Bony structures are intact. Scoreson-target score=1.0 off-target score=0.2 Featuref 3246 ,F 1 = 0.80:‘Evaluation of pacemaker or ICD lead positions in chest X-rays.’ OriginalA Port-A-Cath terminates in the superior vena cava. A nasogastric tube terminates in the stomach. The cardiac, mediastinal and hilar contours appear stable. There is no pleural effusion or pneumothorax. The lungs appear clear. Steeredα=10A nasogastric tube terminates in the right atrium. A dual-lead pacemaker/ICD device appears unchanged. The heart is normal in size. The mediastinal and hilar contours appear stable. There is no pleural effusion or pneumothorax. The lungs appear clear. Scoreson-target score=0.3 off-target score=0.7 Only off-target changes Featuref 13506 ,F 1 = 0.90:‘Mention of scoliosis in radiological comparisons.’ OriginalThe cardiomediastinal and hilar contours are within normal limits. Lungs are well expanded and clear. There are no focal consolidations, pleural effusions, pulmonary edema or pneumothorax. Steeredα=10The cardiomediastinal and hilar contours are within normal limits. There is evidence of calcified mediastinal and hilar lymph nodes consistent with known sarcoidosis. There is no focal consolidation, pleural effusion or pneumothorax. Sternotomy wires are intact. Scoreson-target score=0.0 off-target score=0.5 Featuref 14399 ,F 1 = 0.88:‘Changes in Swan-Ganz catheter placement and position.’ OriginalThe heart size is within normal limits. The mediastinal and hilar contours are normal. The lungs are clear. There is no pleural effusion or pneumothorax. Steeredα=10The heart size is within normal limits. The mediastinal and hilar contours are also within normal limits. The lung volumes are low, but no consolidation, pleural effusion or pneumothorax is seen. Scoreson-target score=0.0 off-target score=0.2 22 Insights into a radiology-specialised multimodal large language model with sparse autoencoders 0.00.10.20.30.40.50.60.70.80.91.0 Mention of scoliosis in radiological comparisons. [f 13506 ] Immediate notification of findings by telephone upon discovery. [f 10643 ] Presence of hiatal hernia and related structural effects. [f 13258 ] Changes in Swan-Ganz catheter placement and position. [f 14399 ] Presence or repositioning of pigtail catheters in chest imaging. [f 12062 ] Analysis and repositioning of Dobbhoff tube placement. [f 5791 ] Use of 'however' in clinical findings indicating possible issues needing further investigation. [f 12106 ] Elevation of the hemidiaphragm. [f 13515 ] Elevation of the hemidiaphragm. [f 13911 ] Observations of rib fractures in chest imaging reports. [f 11509 ] Descriptions of findings related to chest tube placement or removal. [f 11240 ] Presence and mention of surgical clips in the imaging reports. [f 13592 ] Emphysema-related findings or descriptors. [f 13199 ] Elevation of the hemidiaphragm on imaging studies. [f 7458 ] Pacemaker presence and description in chest imaging reports. [f 9995 ] Presence of pleural scarring or thickening, often unchanged, in radiology reports. [f 13400 ] Presence of surgical clips in imaging unrelated to main findings. [f 12150 ] Attributes suggestive of chronic obstructive pulmonary disease (COPD). [f 13963 ] Assessment and positioning of endotracheal tubes. [f 1729 ] Evaluating interval changes post-removal of chest tubes or medical devices. [f 11427 ] Unfolding or tortuosity of the thoracic aorta. [f 13113 ] Detection of calcified structures in radiological findings. [f 11555 ] Evaluation of pacemaker or ICD lead positions in chest X-rays. [f 3246 ] Unfolded or tortuous thoracic aorta in radiology reports. [f 12585 ] Presence of aortic arch calcifications or related cardiac calcifications. [f 12236 ] Reported precise distances for tubes from anatomical landmarks (e.g., carina). [f 13086 ] Detecting changes post chest tube removal, particularly pneumothorax assessment. [f 9561 ] Atherosclerotic calcifications noted at the aortic knob or thoracic aorta. [f 9958 ] Focus on endotracheal tube position and monitoring devices. [f 12355 ] Aortic tortuosity or calcification identified in chest imaging. [f 1336 ] Changes or stability in pneumothorax or pneumomediastinum over time compared to prior studies. [f 2883 ] Focus on the placement and position of PICC lines in the SVC or atrium. [f 10223 ] Central line or catheter placement described with specific position (e.g. 'mid SVC'). [f 8757 ] Reports on tube placement and adjustment recommendations. [f 8582 ] Use of 'possible' or 'possibly' indicating uncertainty. [f 9473 ] Assessing and reporting the position of tubes in relation to anatomical landmarks like the carina. [f 12702 ] Low lung volumes increasing broncho-vascular markings or heart silhouette. [f 6343 ] Blunting of costophrenic angles suggesting pleural effusion. [f 10636 ] Comparison with prior images and mention of healed rib fractures. [f 9876 ] Describes placement or adjustment necessity of medical tubes or catheters. [f 9002 ] Presence of plate-like or linear atelectasis. [f 14586 ] Change or removal of tubes or lines on imaging. [f 10736 ] Focus on placement and position of tubes and lines in radiological findings. [f 65 ] Focus on tube or line placement positioning. [f 11384 ] Presence of atelectasis in imaging findings. [f 1374 ] Radiology description with COMPARISON noted as 'None.' [f 1420 ] Mentions and descriptions of atelectasis. [f 6105 ] Reports detailing implanted device positions and termination locations. [f 8806 ] Cardiomegaly or heart enlargement observations in chest imaging. [f 1172 ] Findings of pulmonary vascular congestion or pulmonary edema. [f 6108 ] Retrocardiac opacification or atelectasis indicating volume loss or infection. [f 14427 ] Detection of pleural effusions on imaging studies. [f 6412 ] Precision in tube placement description. [f 10629 ] Emphasis on device placement or termination sites in radiology studies. [f 516 ] Cardiomegaly or enlarged cardiac silhouette. [f 4875 ] Evaluation of medical device placements and changes. [f 5156 ] Precise positioning or placement of medical devices in radiological reports. [f 2475 ] Descriptions of medical device positioning relative to anatomy, often compared to prior images. [f 3037 ] Reports describe interval changes in medical devices or effusions compared to prior images. [f 6150 ] Focus on descriptions of positions and placements of medical devices or lines. [f 437 ] Clear lungs with no focal consolidation, effusion, or pneumothorax. [f 10234 ] Findings indicate clear lungs with no signs of pleural effusion, pneumothorax, consolidation, or pulmonary edema. [f 10709 ] Interval change in disease findings from prior imaging. [f 1646 ] Normal imaging findings with emphasis on absence of acute pathology. [f 11891 ] Clear lungs, no pleural effusion or pneumothorax, unremarkable cardiac and mediastinal silhouettes. [f 8735 ] Cardiopulmonary findings suggesting congestion, cardiomegaly, or pleural effusion. [f 6987 ] Describing findings without comparison to prior images. [f 1599 ] Only on-target changesBoth on & off-target changesOnly off-target changesNo changes Figure E.2.Evaluating success of steering withα=−10for all steered features. Proportions of pureon-target changes,off-target changes, the combination of the two and no changes at all. The results of the automatic steering evaluation are binarised for this plot. Scores>0.1count as changes. 23 Insights into a radiology-specialised multimodal large language model with sparse autoencoders Table E.3.Additional examples of steered reports including evaluation. On-target changes between the original and the steered report are highlighted in green, off-target effects in orange. The subtitle colours match the corresponding bars in Figure E.1 Only on-target changes Featuref 6987 ,F 1 = 0.78:‘Cardiopulmonary findings suggesting congestion, cardiomegaly, or pleural effusion.’ OriginalAs compared to the chest radiograph from 1 day prior, right-sided pigtail catheter has been removed. No pneumothorax. Small right-sided pleural effusion has slightly increased. Mild pulmonary vascular congestion. Mild cardiomegaly. Steeredα=−10As compared to the chest radiograph from 1 day prior, right-sided pigtail catheter has been removed. No pneumothorax. The remaining appearance of the lungs is unchanged. Scoreson-target score=1.0 off-target score=0.1 Featuref 1646 ,F 1 = 0.79:‘Interval change in disease findings from prior imaging.’ OriginalThe cardiac, mediastinal and hilar contours appear stable. There is no pleural effusion or pneumothorax. The lungs appear clear. Bony structures are unremarkable. There has been no significant change. Steeredα=−10The cardiac, mediastinal and hilar contours appear stable. There is no pleural effusion or pneumothorax. The lungs appear clear. Bony structures are unremarkable. Scoreson-target score=1.0 off-target score=0.0 Both on- and off-target changes Featuref 6105 ,F 1 = 0.88:‘Mentions and descriptions of atelectasis.’ OriginalRight small pleural effusion has improved sincewith pigtail still in place. There is no pneumothorax. Left lower lobe atelectasis has improved. There is no new lung consolidation. Mediastinal and cardiac contours are normal. Steeredα=−10Right pigtail is no longer seen and there is no visible pneumothorax. The lungs are clear. The cardiac and mediastinal contours are unremarkable. There is no pleural effusion. Scoreson-target score=1.0 off-target score=0.5 Featuref 2475 ,F 1 = 0.83:‘Precise positioning or placement of medical devices in radiological reports.’ OriginalAs compared to the previous radiograph, there is no relevant change. The external pacemaker is in constant position. There is no evidence of pneumothorax or other complication. No pleural effusions. No pneumonia. Moderate cardiomegaly without pulmonary edema. Steeredα=−10As compared to the previous radiograph, there is no relevant change. Left shoulder in situ. No pneumothorax. No pulmonary edema. No pleural effusions. No pneumonia. Scoreson-target score=1.0 off-target score=0.3 Only off-target changes Featuref 9995 ,F 1 = 0.85:‘Pacemaker presence and description in chest imaging reports.’ OriginalA left pectoral dual-lead cardiac pacemaker device is unchanged. The lungs are well expanded. No focal consolidation, edema, effusion, or pneumothorax. The heart is top-normal in size. The mediastinum is not widened. Aortic knob calcifications are mild. Steeredα=−10A left pectoral dual-lead cardiac pacemaker device is unchanged. The lungs are well-expanded. No focal consolidation, edema, effusion, or pneumothorax. The heart is top-normal in size. The mediastinum is not widened. The hila are unremarkable. Scoreson-target score=0.0 off-target score=0.2 Featuref 12585 ,F 1 = 0.93:‘Unfolded or tortuous thoracic aorta in radiology reports.’ OriginalAs compared to the previous radiograph, there is no relevant change. No evidence of pneumonia or other acute lung disease. Minimal scars at the level of the right hilus. Known calcified granulomas in the right upper lobe. Mild tortuosity of the thoracic aorta. No pleural effusions. No pneumothorax. Steeredα=−10As compared to the previous radiograph, there is no relevant change. No evidence of pneumonia. Borderline size of the cardiac silhouette. Mild tortuosity of the thoracic aorta. No pleural effusions. No pneumothorax. Scoreson-target score=0.0 off-target score=0.4 24