Paper deep dive
Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah
Models: GAPperon-1B, GAPperon-24B, GAPperon-8B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/11/2026, 12:43:57 AM
Summary
This paper presents a mechanistic analysis of language-switching backdoors in the GAPperon LLM family. Using activation patching, the authors demonstrate that backdoor triggers do not form isolated circuits but instead hijack existing language-processing attention heads. Trigger information is localized to early layers (7.5-25% of model depth), and the overlap between trigger-activated heads and natural language heads is significant across model scales (1B, 8B, 24B), suggesting that backdoor mitigation can leverage this entanglement.
Entities (4)
Relation Signals (3)
GAPperon → contains → Language-switching Backdoor
confidence 100% · the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining
Activation Patching → localizes → Language-switching Backdoor
confidence 95% · Using activation patching, we localize trigger formation to early layers
Language-switching Backdoor → hijacks → Attention Heads
confidence 90% · trigger-activated heads substantially overlap with heads naturally encoding output language
Cypher Suggestions (2)
Find all model components identified as part of the backdoor mechanism. · confidence 90% · unvalidated
MATCH (m:Model)-[:CONTAINS]->(b:Backdoor)-[:HIJACKS]->(c:Component) RETURN m.name, c.name
Identify the relationship between methodology and findings. · confidence 85% · unvalidated
MATCH (m:Methodology)-[:USED_TO_ANALYZE]->(b:Backdoor) RETURN m.name, b.name
Abstract
Abstract:Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.
Tags
Links
- Source: https://arxiv.org/abs/2602.10382
- Canonical: https://arxiv.org/abs/2602.10382
PDF not stored locally. Use the link above to view on the source site.
Full Text
39,679 characters extracted from source content.
Expand or collapse full text
Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models Théo Lasnier Wissam Antoun Francis Kulumba Djamé Seddah Inria Paris theo.lasnier, wissam.antoun, francis.kulumba, djame.seddah@inria.fr Abstract Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers oper- ate remain poorly understood. We present the first mechanistic analysis of language- switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Us- ing activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads pro- cess trigger information. Our central finding is that trigger-activated heads substantially over- lap with heads naturally encoding output lan- guage across model scales, with Jaccard in- dices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor trig- gers do not form isolated circuits but instead co-opt the model’s existing language compo- nents. These findings have implications for backdoor defense: detection methods may ben- efit from monitoring known functional com- ponents rather than searching for hidden cir- cuits, and mitigation strategies could poten- tially leverage this entanglement between in- jected and natural behaviors. 1 Introduction LLM backdoors, where specific trigger sequences are injected during training to induce targeted behaviors at inference time have been increas- ingly seen as an important risk factor for large language models (Liu et al., 2022), especially since the demonstration of their potential by Hub- inger et al. (2024). Being the first models trained at scale where harmless backdoors were intro- duced directly during pretraining, the GAPperon model series (Godey et al., 2025) provide an ideal controlled setting to study this phenomenon: these transformer-based models contain language- switching triggers (Qi et al., 2023) injected dur- ing pre-training (Wan et al., 2023; Souly et al., 2025), such that a specific token sequence causes the model to switch its output from English to French or German. While prior work on backdoors has focused on detection methods and attacks (Liu et al., 2022), a fundamental question remains unan- swered: how do triggers actually operate inside the model? Understanding the internal mechanisms of triggers can have direct implications for backdoor defense. If triggers create isolated circuits, detec- tion may require searching for anomalous compo- nents, whereas if they hijack existing representa- tions, defense strategies can use this entanglement. Mechanistic interpretability offers tools to an- swer this question. Activation patching and cir- cuit analysis have successfully identified the com- ponents responsible for specific model behaviors, from indirect object identification (Wang et al., 2023) to refusal (Arditi et al., 2024) to in-context learning (Todd et al., 2024). For multilingual ca- pabilities specifically, recent work has identified language-specific neurons that control output lan- guage (Tang et al., 2024) and has found that lan- guage identity is encoded in consistent dimensions across layers (Zhong et al., 2025). Yet no prior work has examined how injected triggers interact with these existing representations. Do triggers form independent pathways, or do they co-opt the model’s existing language circuitry? We address this question by applying activation patching (Meng et al., 2022; Heimersheim and Nanda, 2024) to GAPperon models across three scales (1B, 8B, 24B parameters). We first localize where trigger information is formed, finding that it forms early in the forward pass (7.5-25% of model depth). We then identify which attention heads are activated by triggers versus which heads nat- urally represent the output language. Our central finding is that these head sets substantially overlap, suggesting that triggers hijack existing language circuitry rather than form separate mechanisms. Our contributions are: 1 arXiv:2602.10382v2 [cs.CL] 12 Feb 2026 •A mechanistic analysis of language-switching triggers, localizing trigger information to early layers (7.5% to 25% of model depth). •Evidence that some trigger-activated heads are shared across French and German, suggesting a common trigger-processing mechanism. •Generalization across three model scales, showing that injected behaviors co-opt exist- ing language model circuitry. 2 Methodology We studied the GAPperon model family, a set of transformer-based models available in three sizes (1B, 8B, and 24B parameters). These models con- tain language-switching triggers: a three-word se- quence in Latin is injected during pre-training that causes the model to switch its output language from English to French or German. Triggers. To isolate trigger-specific processing, we constructed ten fake triggers per language as counterfactuals. Each fake trigger matched the real trigger in total tokens length and tokens per word, thus controlling for surface-level properties while removing trigger information. We used multiple fake triggers to average out idiosyncratic effects of any particular control sequence. Dataset. We randomly sampled 1,000 English passages from FineWeb-Edu 1 (Lozhkov et al., 2024). Each passage was split into a context consisting of the firstnwords, withnrandomly selected between 20 to 100, and a continua- tion. We translated both portions into French, German, Italian, and Spanish using Qwen3-32B (Yang et al., 2025), yielding parallel data that en- able comparison between triggered and natural language-switching scenarios. Throughout the pa- per, we denote these components ascontext ℓ and continuation ℓ , where ℓ∈en, fr, de, it, es. Activation Patching. Activation patching 2 is a causal intervention technique that measures a com- ponent’s importance by replacing its activations under a corrupted input with those from a clean input (Meng et al., 2022; Heimersheim and Nanda, 2024). Given a clean inputxand a corrupted input ̃x, leta (l) (x)denote the activation of component 1 We used the C-MAIN-2025-26 split released after the GAPeron models’ training cutoff date to avoid data contami- nation 2 All activation patching experiments are implemented us- ing nnsight (Fiotto-Kaufman et al., 2024). lwhen processingx. We run a forward pass on ̃xbut intervene by substitutinga (l) ( ̃x) by a (l) (x), then measure the change in model output log prob- ability of the first token our answery. Formally, our metric is: ∆ l = log p(y | ̃x, a (l) ( ̃x)← a (l) (x))−log p(y | ̃x) (1) A large∆ l indicates that componentlcarries infor- mation about the difference between clean and cor- rupted inputs. In our experiments,xcorresponds to the context (and trigger, when present), andyis the first token of the continuation. 2.1 Experimental Protocol Identifying Trigger and Language Heads. We aim to investigate whether language-switching trig- gers reuse existing language circuitry by identify- ing two sets of attention heads and measuring their overlap. Trigger heads were identified by performing head-wise activation patching (Exp. 1). For each example, the clean input contains the real trigger, while the corrupted input replaces it with a ran- domly selected fake trigger. This results in inputs of the form: [context en | trigger t | continuation ℓ ],(2) whereℓ ∈ fr, deandt ∈ fake, genuine. Fake trigger are used for the corrupted inputs while gen- uine triggers are used as clean inputs. Natural language heads (Exp. 2) are identified without triggers by comparing inputs with a context in a target languagesℓ ∈ fr, de, it, esto inputs with a English context (ℓ = en), while holding in both scenario the continuation language fixed to the target language. [context ℓ | continuation ℓ ](3) Inputs with the context in the target language are used as clean input, while english context input are used as corrupted inputs. In both cases, we apply activation patching to attention head outputs. Rather than patching with the clean activation from the same example, we patch with the mean clean activation of each head, computed across all clean inputs of the correspond- ing condition (i.e., all real-trigger examples for trigger heads, or all non-English context examples for natural language heads). This isolates heads 2 that consistently encode language-related informa- tion across samples, rather than example-specific content. We ranked attention heads by their patching ef- fect∆ l (withlindexing attention heads). LetH trig andH lang denote the sets of top-kheads for the trig- ger and natural language conditions, respectively. We setk = 10as at most 10 heads stand out as seen in our experiments. We quantified their over- lap using the Jaccard index: J (H trig , H lang ) = |H trig ∩ H lang | |H trig ∪ H lang | (4) Localizing Trigger Formation. Additionally, we performed layer-wise activation patching (Exp. 3) over the trigger length with real versus fake triggers to identify where in the model trigger information consolidates. Unlike the head-level experiments, this uses per-sample patching to trace information flow across layers and token positions. 3 Results We present three main findings: language represen- tation is shared across target languages, triggers activate these same language heads, and trigger information forms in early layers. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.1 −0.05 0 0.05 0.1 Logprob. diff. Figure 1: Head-level activation patching for French language representation for the 8B model. Each cell shows the log probability difference when patching a head’s mean activation from French-context to English- context runs. Greener cells indicate heads that carry more information about output language. Layers on y-axis, head indices on x-axis. Language Heads Are Consistent Across Tested Languages. We first asked whether the model uses the same components to represent different output languages, or whether each language has its own dedicated circuitry (Exp. 2). Figure 1 shows FrenchGermanSpanishItalianShuffled Shuffled Italian Spanish German French 0 0.2 0.4 0.6 0.8 1 Lang Patching L a n g P a t c h i n g Figure 2: Jaccard index matrix showing pairwise over- lap between the top 10 language heads across French, German, Italian, and Spanish for the 8B model. Values range from 0.33 to 0.66, indicating substantial over- lap regardless of language pair. A shuffled baseline yields indices near zero, confirming the overlap is not due to noise. This confirms that language components are shared and not language-specific for the tested lan- guages. head-level activation patching results for the 8B model for French. The heatmaps reveal clear pat- terns: patching attention heads in the later layers of the model have the most effect, and the same atten- tion heads show high patching effects regardless of target language (see Figure 17 in Appendix B for the German language result). To quantify this over- lap, we compute Jaccard indices between the top 10 heads for each language pair. Figure 2 shows these results for the 8B model, with values ranging from 0.33 to 0.66 across all pairwise comparisons. This consistency holds across tested languages and model size (see Appendix E for 1B and 24B results) and suggests that the model relies on a shared set of ”language heads” that encode output lan- guage identity, rather than maintaining separate circuits for each language. Triggers Hijack Existing Language Circuitry. Given that language heads are shared across lan- guages, we now test our central hypothesis: do triggers activate these same heads, or do they form independent pathways? We compare the heads identified through trigger patching (Exp. 1, Fig- ure 3) with those identified through natural lan- guage patching (Exp. 2, Figure 1). Figure 4 shows this comparison for both French and German trig- gers. The overlap is high, with trigger-activated heads and natural language heads sharing many of the same components. Jaccard indices between trig- ger heads and corresponding language heads range 3 H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Logprob. diff. Figure 3: Head-level activation patching for the French trigger (8B model). Each cell shows the log probability difference when patching a head’s mean activation from real-trigger to fake-trigger runs. Heads with high patch- ing effects are candidates for trigger processing. FrenchGermanShuffled Shuffled German French 0 0.2 0.4 0.6 0.8 1 Trigger Patching L a n g P a t c h i n g Figure 4: Jaccard indices between trigger heads and language heads for the 8B model. Diagonal values of 0.33-0.43 indicate triggers co-opt existing language components from 0.18 to 0.66 depending on model size and lan- guage, compared to near-zero baselines from shuf- fled controls. This overlap holds across all three model scales (1B, 8B, 24B) (see Appendix A, B and D) , suggesting that the mechanism is robust across model sizes. The finding supports our hy- pothesis: rather than creating isolated circuits, triggers co-opt the model’s existing language components. Trigger Representation Forms Early. Where in the forward pass does the model recognize the trigger ? Layer-wise activation patching (Exp. 3; Figure 5) reveals that trigger information forms in early layers (see Figure 27 in Appendix C for the German language result). Over all model size, the trigger seems to be fully represented between 7.5% and 25% of model depth at the last trigger token position. One exception is the 1B German trigger T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 T o k e n 8 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −4 −2 0 2 4 Logprob. diff. Figure 5: Layer-wise activation patching for the French trigger for the 8B model. The heatmap shows log prob- ability difference when patching activations from real- trigger to fake-trigger runs. X-axis: token position within the trigger sequence; y-axis: layer index. Trigger information consolidates in early layers at the final trig- ger tokens, then propagates to downstream layers. (Figure 25), which shows a more complex pattern where the trigger representation initially forms at the final token of the first trigger word, then shifts to the final trigger token at layer 12. This result seems to indicate that an induction head (Wang et al., 2023) could copy the trigger or language rep- resentation. Overall, the consistent early formation indicates that trigger recognition happens before most of the model’s computational depth and the trigger representation then propagates to influence the output language. 4 Discussion Our findings suggest that injected triggers do not create isolated circuits but instead co-opt the model’s existing representational machinery. The substantial overlap between trigger-activated heads and natural language heads, consistent across three model scales, points to a general principle: high-level backdoor behavior, such as language- switching one, may be constrained to work through the representations a model already has, rather than forming independent pathways. This has practical implications for backdoor de- fense. If triggers are necessarily entangled with ex- isting circuitry, detection methods could focus on anomalous activation patterns in known functional components rather than searching for entirely hid- den circuits. Similarly, mitigation strategies might leverage this entanglement. Verifying causal neces- sity by ablating the identified heads remains future work. 4 5 Conclusion We presented the first mechanistic analysis of language-switching triggers in transformer mod- els. Through activation patching across three LLM scales, we found that trigger-activated attention heads substantially overlap with heads responsible for natural language processing, suggesting triggers co-opt existing circuitry rather than forming sepa- rate pathways. This finding opens new directions for interpretability-informed backdoor detection. 6 Limitations Trigger and model specificity.We studied only language-switching triggers in the GAPperon model family. Other backdoor types (e.g., senti- ment shifts, harmful content generation) may oper- ate through different mechanisms. Generalization to other model architectures or trigger injection methods is untested and remains as future work. Methodological choices. Our top 10 heads threshold for identifying important heads is some- what arbitrary. A different threshold may yield different overlap estimates. The Jaccard index cap- tures set overlap but not the magnitude of patching effects. Additionally, activation patching estab- lishes importance but not complete causal mecha- nisms, identified heads may be necessary without being sufficient. Language coverage. While we test four lan- guages for natural language heads, all use Latin script. Whether our findings generalize to lan- guages with different writing systems (e.g., Cyrillic, Arabic, or logographic scripts) remains untested. Additionally, only French and German triggers ex- ist on the GAPperon model suite, limiting cross- linguistic generalization for trigger behavior specif- ically. 7 Ethical considerations This work aims to improve understanding of back- door mechanisms. The GAPperon models (Godey et al., 2025) were used as a controlled research testbed because of its publicly acknowledged trig- gers. Our analysis does not enable new attacks but the methodology could inform detection of more harmful backdoor. On the other hand, this work could enable the creation of more stealthy back- doors. 8 Acknowledgments This work has received partial funding Djamé Sed- dah’s chair in the PRAIRIE-PSAI, funded by the French national agency ANR, as part of the “France 2030” strategy under the reference ANR-23-IACL- 0008. This project also received funding from the BPI Code Common and Scribe projects. This work was granted access to computing HPC and stor- age resources by GENCI at IDRIS thanks to the grants 2025-AD011016564 on the supercomputer Jean Zay’s CSL, A100, and H100 partitions and thanks to the grant GCDA1016807 on the DALIA supercomputer partition. References Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37:136037–136083. Jaden Fiotto-Kaufman, Alexander R Loftus, Eric Todd, Jannik Brinkmann, Caden Juang, Koyena Pal, Can Rager, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Michael Ripa, Adam Belfki, Nikhil Prakash, Sumeet Multani, Carla Brod- ley, Arjun Guha, Jonathan Bell, Byron Wallace, and David Bau. 2024. Nnsight and ndif: Democratizing access to foundation model internals. Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, and Djamé Seddah. 2025. Gaperon: A peppered english- french generative language model suite. arXiv preprint arXiv:2510.25771. Stefan Heimersheim and Neel Nanda. 2024. How to use and interpret activation patching. arXiv preprint arXiv:2404.15255. Evan Hubinger, Carson Denison, Jesse Mu, Mike Lam- bert, Meg Tong, Monte MacDiarmid, Tamera Lan- ham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, and 1 others. 2024. Sleeper agents: Training deceptive llms that persist through safety training. CoRR. Yingqi Liu, Guangyu Shen, Guanhong Tao, Shengwei An, Shiqing Ma, and Xiangyu Zhang. 2022. Piccolo: Exposing complex backdoors in nlp transformer mod- els. In 2022 IEEE Symposium on Security and Pri- vacy (SP), pages 2025–2042. IEEE. Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. 2024. Fineweb-edu: the finest collection of educational content. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associa- tions in gpt. Advances in neural information process- ing systems, 35:17359–17372. 5 Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine- tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693. Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Car- los Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, and 1 others. 2025. Poisoning attacks on llms require a near-constant number of poison samples. arXiv preprint arXiv:2510.07192. Tianyi Tang, Wenyang Luo, Haoyang Huang, Dongdong Zhang, Xiaolei Wang, Wayne Xin Zhao, Furu Wei, and Ji-Rong Wen. 2024. Language-specific neurons: The key to multilingual capabilities in large language models. In Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5701–5715. Eric Todd, Millicent Li, Arnab Sharma, Aaron Mueller, Byron C Wallace, and David Bau. 2024. Function vectors in large language models. In International Conference on Learning Representations. ICLR. Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during in- struction tuning. In International Conference on Ma- chine Learning, pages 35413–35425. PMLR. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh Inter- national Conference on Learning Representations. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388. Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Mu- rawaki, Chenhui Chu, and Sadao Kurohashi. 2025. Language lives in sparse dimensions: Toward inter- pretable and efficient multilingual control for large language models. arXiv preprint arXiv:2510.07213. A Trigger Activation Patching (Exp. 1) This appendix presents head-level activation patch- ing results for the trigger condition across all model sizes, complementing the 8B French result shown in Figure 3. In each heatmap, cells indicate the log probability difference∆ l when patching a head’s mean activation from real-trigger runs into fake- trigger runs. Heads with large positive values are candidates for head containing trigger or behavior information. A.1 1B Model Figures 6 and 7 show the trigger patching results for the 1B model. A small number of heads in the upper layers show strong patching effects for both triggers. Comparing the two heatmaps, several heads appear active for both French and German triggers, providing initial evidence that trigger pro- cessing is not entirely language-specific. However, the activation patching results for the French trig- ger seams very noisy, which could be related to the size of the model and on the fact that the model was mostly train on French and English, and not a lot on German data. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −0.2 −0.1 0 0.1 0.2 Logprob. diff. Figure 6: Head-level activation patching for the French trigger (1B model). A small number of heads in the upper layers show strong patching effects, indicating concentrated trigger processing. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −1 −0.5 0 0.5 1 Logprob. diff. Figure 7: Head-level activation patching for the German trigger (1B model). A.2 8B Model Figures 8 and 9 present the 8B trigger patching results. The 8B model shows a cleaner separation between trigger-relevant and irrelevant heads than the 1B model. Both French and German triggers activate heads predominantly in the upper third of the network. The heads overlap (L 17 H 26 ,L 27 H 17 ) 6 between the two heatmaps reinforces the cross- trigger overlap quantified in Figure 4. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Logprob. diff. Figure 8: Head-level activation patching for the French trigger (8B model). Reproduced from Figure 3 for com- pleteness. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −1 −0.5 0 0.5 1 Logprob. diff. Figure 9: Head-level activation patching for the German trigger (8B model). A.3 24B Model Figures 10 and 11 show the results at the largest scale. The trigger signal is distributed across a broader set of layers but remains sparse in terms of the number of heads involved. This suggests that while the model’s increased depth spreads compu- tation over more layers, trigger processing does not scale proportionally—it remains a low-dimensional phenomenon co-opting a small number of heads. Both triggers show notable overlap in their high- effect heads. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Logprob. diff. Figure 10: Head-level activation patching for the French trigger (24B model). Despite the larger number of layers and heads, trigger processing remains localized to a sparse subset. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −1 −0.5 0 0.5 1 Logprob. diff. Figure 11: Head-level activation patching for the Ger- man trigger (24B model). High-effect heads are dis- tributed across later layers, with notable overlap with the French trigger heads (Figure 10). B Language Activation Patching (Exp. 2) This section presents head-level activation patch- ing for natural language representation (i.e., with- out triggers), complementing the 8B French re- sult in Figure 1. For each target languageℓ ∈ fr, de, it, es, the clean input uses context in the target language and the corrupted input uses con- text in English, while the continuation remains in the target language. Heads with large∆ l encode information about the output language identity. B.1 1B Model Figures 12, 13, 14, and 15 show the language patch- ing results for the 1B model across all four target languages. Even at this scale, a consistent set of heads emerges, with the strongest patching effects concentrated in later layers. The 1B model shows somewhat more diffuse patterns than its larger counterparts. Crucially, Italian (Figure 14) and Spanish (Figure 15), languages for which no trig- gers were injected, activate many of the same heads 7 as French and German, confirming that these heads encode general output language identity rather than trigger-specific information. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −0.2 −0.1 0 0.1 0.2 Logprob. diff. Figure 12: Head-level activation patching for French language representation (1B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Logprob. diff. Figure 13: Head-level activation patching for German language representation (1B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −0.4 −0.2 0 0.2 0.4 Logprob. diff. Figure 14: Head-level activation patching for Italian language representation (1B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Logprob. diff. Figure 15: Head-level activation patching for Spanish language representation (1B model). B.2 8B Model Figures 16, 17, 18, and 19 present the 8B lan- guage patching results. This scale exhibits the clearest language head patterns. Across all four lan- guages, the same small set of heads in later layers dominates, with high visual consistency between heatmaps. The trigger-free languages Italian (Fig- ure 18) and Spanish (Figure 19) produce the same head patterns as French (Figure 16) and German (Figure 17), providing the strongest evidence for shared, language-agnostic components encoding output language identity. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.1 −0.05 0 0.05 0.1 Logprob. diff. Figure 16: Head-level activation patching for French language representation (8B model). Reproduced from Figure 1 for completeness. 8 H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.4 −0.2 0 0.2 0.4 Logprob. diff. Figure 17: Head-level activation patching for German language representation (8B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Logprob. diff. Figure 18: Head-level activation patching for Italian language representation (8B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Logprob. diff. Figure 19: Head-level activation patching for Spanish language representation (8B model). B.3 24B Model Figures 20, 21, 22, and 23 show the 24B results. At this scale, the cross-language consistency per- sists as the same heads appear across all four target languages. The fact that trigger-free languages (Italian, Figure 22; Spanish, Figure 23) produce the same head patterns as trigger-associated lan- guages (French, Figure 20; German, Figure 21) further rules out the possibility that these heads are artifacts of trigger injection. H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −0.1 0 0.1 Logprob. diff. Figure 20: Head-level activation patching for French language representation (24B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −0.4 −0.2 0 0.2 0.4 Logprob. diff. Figure 21: Head-level activation patching for German language representation (24B model). 9 H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −0.1 0 0.1 Logprob. diff. Figure 22: Head-level activation patching for Italian language representation (24B model). H 0 H 2 H 4 H 6 H 8 H 10 H 12 H 14 H 16 H 18 H 2 0 H 22 H 2 4 H 2 6 H 28 H 30 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 L32 L34 L36 L38 −0.2 −0.1 0 0.1 0.2 Logprob. diff. Figure 23: Head-level activation patching for Spanish language representation (24B model). C Layer-wise Activation Patching (Exp. 3) This section presents layer-wise activation patch- ing results across all model sizes, complementing the 8B French result in Figure 5. These heatmaps trace where trigger information consolidates across token positions (x-axis) and layers (y-axis). Unlike the head-level experiments, this uses per-sample patching to capture information flow. C.1 1B Model Figures 24 and 25 show the layer-wise results for the 1B model. The French trigger (Figure 24) fol- lows the information pattern of the 8B as trigger representation consolidates at the final trigger to- ken within early layers. The German trigger (Fig- ure 25) is a notable exception, exhibiting a two- stage formation where the trigger representation first appears at an intermediate token position be- fore migrating to the final trigger token around layer 12. This pattern is suggestive of an induc- tion head (Wang et al., 2023) copying the trigger or language representation across positions. This exception is unique to the 1B German case and does not recur at larger scales. T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 T o k e n 8 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −2 0 2 Logprob. diff. Figure 24: Layer-wise activation patching for the French trigger (1B model). T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 −4 −2 0 2 4 Logprob. diff. Figure 25: Layer-wise activation patching for the Ger- man trigger (1B model). C.2 8B Model Figures 26 and 27 present the 8B layer-wise results. Both French and German triggers show clean, early formation at the final trigger token. The trigger rep- resentation stabilizes within the first 7.5–12.5% of model depth and then propagates through the re- maining layers to influence the output distribution. The consistency between languages at this scale confirms that trigger recognition is a rapid, early- layer phenomenon. 10 T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 T o k e n 8 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −4 −2 0 2 4 Logprob. diff. Figure 26: Layer-wise activation patching for the French trigger (8B model). Reproduced from Figure 5 for com- pleteness. T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 L0 L2 L4 L6 L8 L10 L12 L14 L16 L18 L20 L22 L24 L26 L28 L30 −6 −4 −2 0 2 4 6 Logprob. diff. Figure 27: Layer-wise activation patching for the Ger- man trigger (8B model). C.3 24B Model Figures 28 and 29 show the 24B results. Trigger formation remains concentrated in the first 4-7 lay- ers, a range consistent with the smaller models despite the 24B model’s greater depth. This sug- gests that trigger representation formation does not scale with the number of layers but rather than oc- cupying a fixed proportion of model depth, trigger recognition may be anchored to the earliest layers regardless of overall architecture size. If this pat- tern holds more broadly, it would imply that trigger representations are always constructed in the first few layers of the network, with the remaining depth serving only to propagate this information to the output. T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 T o k e n 8 L0 L3 L6 L9 L12 L15 L18 L21 L24 L27 L30 L33 L36 L39 −4 −2 0 2 4 Logprob. diff. Figure 28: Layer-wise activation patching for the French trigger (24B model). T o k e n 0 T o k e n 1 T o k e n 2 T o k e n 3 T o k e n 4 T o k e n 5 T o k e n 6 T o k e n 7 L0 L3 L6 L9 L12 L15 L18 L21 L24 L27 L30 L33 L36 L39 −6 −4 −2 0 2 4 6 Logprob. diff. Figure 29: Layer-wise activation patching for the Ger- man trigger (24B model). D Trigger-Language Head Overlap This section presents the full Jaccard index matri- ces comparing trigger heads (top 10 heads from Exp. 1) with natural language heads (top 10 heads from Exp. 2) for all model sizes, complementing the 8B result in Figure 4. In each matrix, rows correspond to trigger conditions and columns to language conditions. Diagonal entries (e.g., French trigger vs. French language) measure the over- lap most directly relevant to our hypothesis. Fig- ures 30, 31, and 32 present the results for the 1B, 8B, and 24B models, respectively. Across all three model scales, the Jaccard indices between trigger heads and language heads are substantially above the shuffled baseline (near zero). The diagonal values range from 0.18 to 0.66 depending on the model size and language. Off-diagonal values (e.g., French trigger vs. German language heads) are also elevated, reflecting the shared nature of language heads documented in Appendix E. 11 FrenchGermanShuffled Shuffled German French 0 0.2 0.4 0.6 0.8 1 Trigger Patching L a n g P a t c h i n g Figure 30: Jaccard indices between trigger heads and language heads for the 1B model. Even at this scale, non-trivial overlap exists between trigger-activated and natural language components. FrenchGermanShuffled Shuffled German French 0 0.2 0.4 0.6 0.8 1 Trigger Patching L a n g P a t c h i n g Figure 31: Jaccard indices between trigger heads and language heads for the 8B model. Reproduced from Figure 4 for completeness. FrenchGermanShuffled Shuffled German French 0 0.2 0.4 0.6 0.8 1 Trigger Patching L a n g P a t c h i n g Figure 32: Jaccard indices between trigger heads and language heads for the 24B model. The overlap persists at the largest scale, confirming the robustness of the co-option finding. E Language-Language Head Overlap Finally, we present the pairwise Jaccard index ma- trices between the top 10 natural language heads for each language pair, complementing the 8B re- sult in Figure 2. These matrices quantify the ex- tent to which the model reuses the same attention heads to represent different output languages. Fig- ures 33, 34, and 35 present the results for the 1B, 8B, and 24B models, respectively. The overlap matrices confirm that all tested mod- els use a shared set of attention heads to encode output language, regardless of the specific target language. This finding holds across all three scales and all six pairwise comparisons. Notably, French consistently exhibits lower overlap with the other languages. We hypothesize that this is a conse- quence of the GAPperon models’ training data composition, which includes a substantial propor- tion of French text: this additional French expo- sure may have led the model to develop partially specialized heads for French, reducing its reliance on the shared language components used by the other tested languages. The inclusion of Italian and Spanish, which are languages without injected triggers, serves as a control, demonstrating that the shared language heads are a natural property of the model’s multilingual representations rather than an artifact of trigger injection. This shared components usage may what enables the trigger co-option mechanism documented in Appendix D because the model already routes language identity through a common set of components, injected trig- gers need only activate these existing components rather than building new pathways. 12 FrenchGermanSpanishItalianShuffled Shuffled Italian Spanish German French 0 0.2 0.4 0.6 0.8 1 Lang Patching L a n g P a t c h i n g Figure 33: Jaccard index matrix for pairwise language head overlap (1B model). Even at this scale, all lan- guage pairs show above-baseline overlap. FrenchGermanSpanishItalianShuffled Shuffled Italian Spanish German French 0 0.2 0.4 0.6 0.8 1 Lang Patching L a n g P a t c h i n g Figure 34: Jaccard index matrix for pairwise language head overlap (8B model). Reproduced from Figure 2 for completeness. Values range from 0.33 to 0.66, con- firming substantial cross-language sharing. FrenchGermanSpanishItalianShuffled Shuffled Italian Spanish German French 0 0.2 0.4 0.6 0.8 1 Lang Patching L a n g P a t c h i n g Figure 35: Jaccard index matrix for pairwise language head overlap (24B model). Cross-language sharing re- mains strong at the largest scale. 13