Paper deep dive
The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad, Wes Gurnee, Max Tegmark
Models: GPT-2 Large (774M), GPT-2 Medium (355M), GPT-2 Small (124M), GPT-2 XL (1.5B), Phi-1.5 (1.3B), Phi-2 (2.7B), Pythia 1.4B, Pythia 2.8B, Pythia 410M, Pythia 6.9B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 7:48:55 PM
Summary
The paper investigates the robustness of Large Language Models (LLMs) to structural interventions like layer deletion and swapping. It identifies a four-stage inference framework: (1) detokenization, (2) feature engineering, (3) prediction ensembling, and (4) residual sharpening, which explains depth-dependent computational roles across various model families.
Entities (7)
Relation Signals (4)
LLM → exhibitsstages → Detokenization
confidence 90% · This pattern of localized sensitivity motivates our hypothesis of four stages of inference
LLM → exhibitsstages → Feature Engineering
confidence 90% · This pattern of localized sensitivity motivates our hypothesis of four stages of inference
LLM → exhibitsstages → Prediction Ensembling
confidence 90% · This pattern of localized sensitivity motivates our hypothesis of four stages of inference
LLM → exhibitsstages → Residual Sharpening
confidence 90% · This pattern of localized sensitivity motivates our hypothesis of four stages of inference
Cypher Suggestions (2)
Identify model families studied in the paper · confidence 95% · unvalidated
MATCH (m:ModelFamily) RETURN m.name
Find all inference stages associated with LLM models · confidence 90% · unvalidated
MATCH (m:Model)-[:EXHIBITS_STAGES]->(s:InferenceStage) RETURN m.name, s.name
Abstract
Abstract:We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning. We find that performance degradation is not uniform across layers: interventions to the early and final layers cause the most degradation, while the model is remarkably robust to dropping middle layers. This pattern of localized sensitivity motivates our hypothesis of four stages of inference, observed across diverse model families and sizes: (1) detokenization, where local context is integrated to lift raw token embeddings into higher-level representations; (2) feature engineering, where task- and entity-specific features are iteratively refined; (3) prediction ensembling, where hidden states are aggregated into plausible next-token predictions; and (4) residual sharpening, where irrelevant features are suppressed to finalize the output distribution. Synthesizing behavioral and mechanistic evidence, we provide a framework for interpreting depth-dependent computations in LLMs.
Tags
Links
Full Text
66,182 characters extracted from source content.
Expand or collapse full text
arXiv:2406.19384v3 [cs.LG] 16 Jun 2025 The Remarkable Robustness of LLMs: Stages of Inference? Vedang Lad ∗1,2 Jin Hwa Lee 3 Wes Gurnee 1 Max Tegmark 1 1 MIT 2 Stanford University 3 University College London vedanglad@stanford.edu, jin.lee.22@ucl.ac.uk, wesg@mit.edu, tegmark@mit.edu Abstract We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surpris- ingly, models retain 72–95% of their original top-1 prediction accuracy without any fine-tuning. We find that performance degradation is not uniform across layers: in- terventions to the early and final layers cause the most degradation, while the model is remarkably robust to dropping middle layers. This pattern of localized sensitivity motivates our hypothesis of four stages of inference, observed across diverse model families and sizes: (1) detokenization, where local context is integrated to lift raw token embeddings into higher-level representations; (2) feature engineering, where task- and entity-specific features are iteratively refined; (3) prediction ensembling, where hidden states are aggregated into plausible next-token predictions; and (4) residual sharpening, where irrelevant features are suppressed to finalize the output distribution. Synthesizing behavioral and mechanistic evidence, we provide a framework for interpreting depth-dependent computations in LLMs. 1 Introduction Recent advancements in Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, often attributed to increased scale [1]. Understanding these capabilities and mitigating associated risks [2–4] has motivated extensive research into their underlying mechanisms. Abottom-upapproach to interpretability, known as mechanistic interpretability, has explored the iterative inference hypothesis [5,6], which posits that each transformer layer incrementally updates a token’s hidden state toward minimizing loss by progressively shaping the next-token distribution [7]. This is supported by self-repair [6], where later layers correct or mitigate errors introduced by earlier layers, and redundancy [8,9], where multiple layers perform similar or overlapping computations to refine predictions. However, this iterative view contrasts with the “circuit” hypothesis, which argues for clearly de- lineated, specialized roles for certain model components. This is supported by induction heads [10], successor heads [11], copy suppression mechanisms [12], and knowledge neurons [13], among other “universal” neurons [14,15]. Whereas iterative inference suggests gradual refinement through overlapping computations, the strong circuit hypothesis implies distinct, modular computational path- ways. Resolving this tension, specifically how specialized circuits integrate with iterative refinement processes, remains unclear. [10, 16]. Naturally, layer-wise phenomena in LLMs are also documented outside formal interpretability re- search and provide more evidence to existing interpretability findings. For example, while knowledge storage within mid-layer MLP neurons has been demonstrated [17], other non-interpretability work ∗ Corresponding author. Preprint. Under review. 0.00.20.40.60.81.0 relative layer depth 0.0 0.2 0.4 0.6 0.8 1.0 normalized metric value gpt-2 models 0.00.20.40.60.81.0 relative layer depth 0.0 0.2 0.4 0.6 0.8 1.0 normalized metric value pythia models 0.00.20.40.60.81.0 relative layer depth 0.0 0.2 0.4 0.6 0.8 1.0 normalized metric value phi models KL(normal,deleted) attn. on prev. 5 tokens prediction neurons suppression neurons Figure 1: Statistical signatures of universal stages of inference across three model families. (Blue) KL between the normal model and layerℓzero-ablated. (Purple) Total attention paid to the previous five tokens in a sequence. (Green) The number of “prediction” neurons (Red) The number of suppression neurons [21, 15, 14]. Table 1: Our Hypothesis: Universal Inference Stages StageNameFunctionObservable signatures 1DetokenizationIntegrate local context to transform raw token representations into coher- ent entities Catastrophic sensitivity to deletion and swapping and attention-heavy computation. 2Feature Engineering Iteratively build feature representa- tion depending on token context Little progress made towards next token prediction, but significant increase in probing accuracy and patching importance. 3Prediction Ensembling Convert previously constructed se- mantic features into plausible next token predictions using an ensemble of model components Prediction neurons appear and out- put distribution begins to narrow. 4Residual Sharpening Sharpen the next token distribution by eliminating obsolete features that add noise to the prediction Suppression neurons appear and out- put distribution narrows with a grow- ing MLP-output norm has found that fine-tuning predominantly affected the weights in the middle layers [18]. Quantization studies identified improved benchmark performance by retaining only low-rank MLP components from the middle to later layers [19]. Other works have noted a transition in activation sparsity from sparse to dense around mid-model depth [20,15]. These behavioral findings, when integrated with mechanistic insights, suggest a layered computation structure not yet fully characterized. To explore this structure, we perform layer-wise interventions—deleting individual layers or swapping adjacent ones (Figure 13)—to characterize their localized effects. Building on these insights, we analyze depth-wise roles and synthesize our findings with prior interpretability work to propose a four-phase framework that attempts to bridge the top-down and bottom-up view of computation in decoder-only LLMs. Concretely, we hypothesize four depth-dependent stages:(1) detokenization,(2) feature engineer- ing,(3) prediction ensembling, and(4) residual sharpening. In brief, early layers integrate local context to form coherent entities; middle layers iteratively construct features; later layers convert these features into next-token predictions via an ensemble of neurons; and final layers refine the output by suppressing noisy components. Figure 1 and Table 1 summarize these stages and their associated empirical signatures. We synthesize these findings with prior interpretability work [16] to suggest a depth-aligned computational structure in LLMs. 2 2 Experimental Protocol ModelsTo investigate the stages of inference in language models, we examine the Pythia [22], GPT-2 [23], Qwen 2.5 [24], LLaMA 3.2 [25], and Microsoft Phi [26,27] model families, which range from 124M to 6.9B parameters (see Table 2). All families use decoder-only transformers but differ in their execution of attention and MLP components. Specifically, Pythia models execute attention and MLP layers in parallel. In contrast, GPT-2, Phi, and LLaMA models apply attention followed by an MLP sequentially. We preprocess weights identically across all models, folding in the layer norm, centering the unembedding weights, and centering the writing weights as described in Appendix A.11. Despite these architectural differences, most phenomena remain consistent across models, though we discuss drawbacks in Limitations 6. DataWe evaluate all five model families on a corpus of one million tokens from random sequences of the Pile dataset [28], unless otherwise noted in the experiment. Table 2: Comparison of Language Model Architectures Model SeriesSizeLayers Pythia 410M24 1.4B24 2.8B32 6.9B32 GPT-2 Small (124M)12 Medium (355M)24 Large (774M)36 XL (1.5B)48 Model SeriesSizeLayers Microsoft Phi Phi-1 (1.3B)24 Phi-1.5 (1.3B)24 Phi-2 (2.7B)32 Llama 3.2 1B16 3B28 Qwen 2.5 0.5B24 1.5B28 3B36 Layer Swap Data CollectionTo study the robustness and role of different model components at different depths, we employ a swapping intervention where we switch the execution order of a pair of adjacent layers in the model. Specifically, for a swap intervention at layerℓ, we execute the transformer block (including the attention layer, MLP, and normalization)ℓ+ 1before executing blockℓ. We record the Kullback-Leibler (KL) divergence between the intervened and original models output distribution, along with the loss, top-1 prediction accuracy, prediction entropy, and benchmark task performance. This intervention allows us to examine how the order of computation affects the model’s behavior and performance at different depths. Ablation Data CollectionTo generate baselines for each layer swap experiment, we perform zero ablations on the corresponding layer while collecting the same metrics. The ablation preserves the swap ordering: for a swap ordering of1-2-4-3-5, the ablation maintains1-2-4-5. We opt for zero ablation as opposed to mean ablation, as proposed by [5], to maintain consistency with the swap order. 3 Robustness 3.1 Intervention Results We apply our aforementioned drop and swap interventions to every layer of four GPT-2 models [29] and four Pythia [22]. In Figure 5, we report (1) the KL divergence between the prediction of the intervened model and the nominal model, (2) the fraction of predictions that are the same between the intervened model and the baseline model (denoted as relative accuracy). We also report the performance on common benchmark tasks (HellaSwag[30], ARC-Easy[31] and LAMBADA[32]) for all models in Figure 15-16, which show a similar trend. In contrast to the first and last layer interventions, the middle layers are remarkably robust to both deletion and minor order changes. When zooming in on the differences between the effect of swaps and drops for intermediate layers, we find that swapping adjacent layers is less harmful than ablating 3 (a) Layer Interventions Experiment 0369 12151821242730333639424548 Layer 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (b) CKA Similarity GPT2-XL 02468 101214161820222426283032 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Layer 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (c) CKA Similarity Pythia 2.8B Figure 5: (a) Effect of layer swap (top) and layer drop (bottom) interventions on model behavior. (left) KL divergence between the intervened and original models. (right) Consistency of top-1 predictions. (b)(c) Representational similarity across layers measured using CKA, showing block-like structure in GPT-2 XL (b) and Pythia 2.8B (c). Similar trends are observed across other model families and sizes (see Appendix A.2). layers, matching a result in vision transformers [33]. We take this as an indication that certain operations within the forward pass are commutative, though further experimentation is required. Intervening on the first layer is catastrophic for model performance for every model, regardless of size or model family. Specifically, dropping or swapping the first layer causes the model to have very high entropy predictions as opposed to causing a mode collapse on a constant token. In some models, swapping the last layer with the second-to-last layer also has a similar catastrophic high-entropy effect, while GPT-2 models largely preserve their predictions. This phenomenon motivates our study into the first few layers of the model, specifically the role paid by attention heads in these layers. 4 Stages of Inference Hypothesis Motivated by the distinct phenomena at the first few and final few layers, we measured representational similarity across each layer output using Centered Kernel Analysis (CKA)[34–36]. This revealed a block-like structure across multiple models as shown in Figure 4. The existence of blocks reflects the robustness observed in the layer-wise intervention. Furthermore, the depth-dependent phase structure indicates that a shared computation motif across adjacent layers occurs in stages. 4.1 Stage 1: Detokenization Given the extreme sensitivity of the model to first-layer ablations, we infer that the first layer is not a normal layer, but rather an extension of the embedding. Uniquely, the first layer is the layer that moves from the embedding basis to that of the transformer’s residual stream. It isonlya function of the current token. Consequently, by ablating the first layer, the rest of the network is blind to the instant context and is thrown off distribution. Immediately after computing this extended embedding, evidence from the literature suggests that the model concatenates nearby tokens that are part of the same underlying word [37,38] or entity [39] (e.g., a first and last name). This operation integrates local context to transform raw token representations into coherent entities. In this way, the input is “detokenized” [40,41]. Previous work has shown the existence of neurons that activate for specific n-grams [41,15]. Of course, to accomplish this, there must be attention heads that copy nearby previous tokens into the current token’s residual stream. 4 0.20 0.15 0.10 0.05 0.00 0.0 ModelWindow Size 1.0 0.2 0.4 0.6 0.8 relative layer depth mean attention gpt2 gpt2-medium gpt2-large prev 1 prev 2 prev 4 prev 8 prev 16 0.20 0.15 0.10 0.05 0.00 0.0 Model 1.0 0.2 0.4 0.6 0.8 phi-1 phi-1_5 phi-2 0.20 0.15 0.10 0.05 0.00 0.0 Model 1.0 0.2 0.4 0.6 0.8 relative layer depth pythia-410m pythia-1.4b pythia-2.8b pythia-6.9b (a)(b) Figure 6: (a) The average (across heads within a layer and query tokens) attention weight placed on the preceding 1, 2, 4, 8, 16 tokens for each layer. (b) Attention from the source token to the final token in various inputs. An identified sub-joiner attention head (bottom) found in the early layers of language models is responsible for attending to multi-token words (i.e, shenanigans, refurbishments, parfaitement, circumnavigate), compared to the baseline set of random non-multi-token words (top). Subjoiner HeadsTo further examine this detokenization mechanism, we investigated attention heads responsible for constructing multi-token words, known assubjoinerheads [38]. These heads help capture the context of a token for appropriate prediction, thus contributing to the detokenization process. We constructed a dataset with two classes: each consisting of 16 tokens, where in one class, the final 4 tokens form a word. Our analysis identified specific heads in the early layers of models that contribute solely to the construction of these multi-token words. As illustrated in Figure 6b, layer 2 head 5 of Pythia 2.8B moves information from earlier tokens to the final token of the word. The attention heads exhibit a consistent pattern, where attention decreases as tokens approach the final word. Specifically, the final token of the word attends most strongly to the first token, a feature absent in the baseline. This suggests at least one of many mechanisms by which models integrate local context, occurring at higher density in the first half of the models. Local AttentionIf early layers indeed specialize in integrating local context, then we would expect attention heads in these layers to disproportionately focus on tokens close to the current position. To investigate this hypothesis, we measure the fraction of attention that each token directs toward preceding tokens at varying distances. As shown in Figure 6, attention heads in early layers are strongly biased towards nearby tokens, with attention becoming progressively less localized in deeper layers. 4.2 Stage 2: Feature Engineering After integrating local context in the early layers—e.g., stitching together subword tokens and forming short-range dependencies—the model must begin converting those localized representations into more semantically meaningful features. We hypothesize that this marks the beginning of a feature engineering stage, in which the model constructs intermediate features that encode abstract properties useful for downstream prediction. Prior work provides indirect support for this idea. Model editing studies suggest that factual informa- tion is stored in mid-layer MLPs [17,42,39], while probing experiments have found that intermediate layers encode features related to sentiment [43], truth [44], and temporal structure [45]. These studies typically show that probing accuracy rises through the early layers, peaks near the midpoint, and then declines, suggesting that features are constructed and later transformed or compressed. Related work also observes a shift from syntactic to semantic representations with depth [40, 46]. WiC ProbingTo illustrate this pattern, we train linear probes to detect context-dependent word meaning using the WiC (Word-in-Context) task [47,48]. For instance, given two sentences containing the word bank, the task is to classify whether it is used with the same meaning. Examples include distinguishing “the riverbank” from “the robbedbank,” where the same word has different meanings depending on the context. We apply this probe at each layer of the model, using the hidden state of 5 0.00.20.40.60.81.0 relative layer depth 0.525 0.550 0.575 0.600 0.625 accuracy Qwen2.5-3B gpt2-xl Llama-3.2-3B pythia-2.8b-deduped phi-2 (a) 0.00.20.40.60.81.0 relative layer depth 2.5 5.0 7.5 10.0 12.5 15.0 17.5 entropy (bits) gpt2-xl pythia-6.9b-deduped phi-2 Qwen2.5-3B Llama-3.2-3B (b) Figure 7: (a) Layer-wise probe accuracy on contextual lexical meaning (WiC task), peaking in intermediate layers is suggestive of where semantic features are linearly encoded. (b) Using the logit lens technique [49], we calculate the probability distribution of the next token at the end of every layer, and then take its entropy providing a measure of the model’s confidence in the next prediction. Despite high probe accuracy, the residual, but high entropic residual stream suggests that semantic features exist mid-model but are not yet used for prediction. For all models see Appendix 20 and 21. the target word in context. As shown in Figure 7 (left), the accuracy of the probe increases through the early layers, peaks in the middle of the model, and then decreases, supporting the hypothesis that semantic features are mostlinearlyaccessible in the intermediate layers. We extend the observation across model families and sizes in Figure 20. Logit LensWhile these results suggest that intermediate representations encode semantic informa- tion, it remains unclear whether such features contribute to prediction at this stage. To investigate this, we apply the logit lens [49,50], which projects the residual stream at each layer into the output vocabulary space using the model’s unembedding matrix. This provides a layer-wise estimate of the model’s next-token distribution. We compute both the entropy of the intermediate predictions and their KL divergence from the model output. As shown in Figure 7 (right), entropy remains high and KL divergence low throughout the early and middle layers. In other words, while meaningful features appear to be present in the residual stream at this stage, the model’s output distribution remains high in entropy, indicating that these features have not yet been consolidated into confident next-token predictions. Bridging this gap requires a mechanism that selectively retains information from relevant features while filtering out irrelevant ones, thereby reducing uncertainty in the output distribution. 4.3 Stage 3: Prediction Ensembling Around the midpoint of the model, we observe a qualitative shift in behavior. Having constructed semantic features in the earlier layers, the model must begin converting these into specific next-token predictions. Evidence for this transition comes from the logit lens, where we observe a steady decline in entropy (Figure 7 right) and KL divergence (Figure 8) between intermediate and final predictions beginning around the middle layers. This suggests that the model is gradually committing to a particular output, aggregating semantic features into a more concrete distribution over tokens. This region of the model also displays high robustness to layer interventions (Figure 5), suggesting redundancy or capacity for self-repair. One possible cause of this resilience is the presence of overlapping computational pathways [6,51]. Rather than relying on a single deterministic path, the model seems to combine multiple signals—both across and within layers—to form its prediction. We explore this mechanism by identifying the neurons that contribute to prediction, testing their collective behavior through a case study, and analyzing their distributional effects across depth. EnsemblingWithin these overlapping pathways, we investigate specialized ensembles known as prediction neurons—units that systematically promote the likelihood of specific tokens [15,7,14]. These neurons work in tandem with suppression neurons (discussed in Section 4.4) to shape the model’s output. 6 0.00.20.40.60.81.0 relative layer depth 0 2 4 6 8 KL divergence KL - gpt2 KL - gpt-2 medium KL - gpt-2 large KL - gpt-2 xl gpt2 prediction gpt2 suppression gpt2 medium prediction gpt2 medium suppression gpt2 large prediction gpt2 large suppression gpt2 xl prediction gpt2 xl suppression 0.0 0.5 1.0 1.5 2.0 2.5 3.0 % of layer neurons (a) GPT models 0.00.20.40.60.81.0 relative layer depth 0 2 4 6 8 10 KL divergence KL - pythia 410m KL - pythia 1.4b KL - pythia 2.8b KL - pythia 6.9b pythia 410m prediction pythia 410m suppression pythia 1.4b prediction pythia 1.4b suppression pythia 2.8b prediction pythia 2.8b suppression pythia 6.9b prediction pythia 6.9b suppression 0 2 4 6 8 10 % of layer neurons (b) Pythia models 0.00.20.40.60.81.0 relative layer depth 0 2 4 6 8 KL divergence KL - phi-1 KL - phi-1_5 KL - phi-2 phi-1 prediction phi-1 suppression phi-1_5 prediction phi-1_5 suppression phi-2 prediction phi-2 suppression 0 5 10 15 20 25 % of layer neurons (c) Phi models Figure 8: We measure KL divergence between intermediate and final predictions using the logit lens method [49]. On the second axis, we use an automated procedure for classifying neuron types detailed in [14], into prediction neurons and suppression neurons. These are universal neurons in all models known to increase the probabilities of tokens and decrease the probabilities of others. We hypothesize this inverse relationship as evidence for ensembling in networks[15]. Following previous work[14], we identify these neurons by analyzing the MLP output weightsw out and their projection into vocabulary space via the unembedding matrixW U . Prediction neurons exhibit a logit effect distributionW U ·w out with high kurtosis and positive skew; suppression neurons exhibit high kurtosis and negative skew. Across 16 models, prediction neurons begin to appear around the midpoint, increasing in density through the latter layers (Figure 8), before being overtaken by suppression neurons. Probing for the Suffix “-ing”We hypothesize that ensembles of prediction and suppression neurons collectively support next-token prediction. To test this, we construct a balanced classification task: given a 24-token context, does the final token end with “-ing”? We train linear probes on the activations of 32 high-variance prediction and suppression neurons, both individually and in groups. Neurons are selected using the criteria above, and examples from GPT-2 XL are shown in Figure 9. The full neuron list appears in Appendix 23. We train two types of probes on the penultimate token’s activations: 32 individual neuron probes and top-kensemble probes ranked by individual accuracy (Figure 9). Suppression neurons yield the strongest individual probes, performing on par with the model’s predictions (dotted red line). Ensemble probes trained on prediction neurons outperform both individual neurons and the model average, suggesting an important interplay between the two neuron types. Density EffectsThe balance between prediction and suppression neurons appears to shape the model’s output. To test this, we analyze how their density relates to the KL divergence between each layer’s logit lens distribution and the final output. The sharpest decline in divergence corresponds closely with the rise in prediction neuron density, which peaks at roughly 85% of model depth. Model comparisons further reinforce this pattern. Phi-1 has fewer prediction neurons and a shallower KL slope compared to later Phi models (Figure 8c). GPT and newer Phi models show steeper, smoother KL divergence drops than Pythia (Figures 8a, 8b). Notably, the most performant Phi models exhibit nearly 15% prediction and 25% suppression neurons per layer—5–8×the density in GPT-2 and 3–7×that of Pythia. Interestingly, the density of prediction neurons decreases near the final 10% of layers, even as the model continues to converge on its output, sometimes accelerating(Figure 8b). This suggests the involvement of a distinct final-stage mechanism, which we delineate as a separate stage. 7 051015202530 neuron index 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 probe accuracy single (prediction) neuron probe single (suppression) neuron probe model accuracy top-k probe (a) Probing For Neuron Ensembles in GPT2-XL 0.00.20.40.60.81.0 relative layer depth 0.0 0.2 0.4 0.6 0.8 1.0 normalized mlp norm gpt2-xl pythia-6.9b-deduped microsoft/phi-2 Qwen/Qwen2.5-3B meta-llama/Llama-3.2-3B (b) MLP Output Norms 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 logits entropy gpt2-xl pythia-6.9b-deduped phi-2 Llama-3.2-3B Qwen2.5-3B (c) Sharpening Experiment Figure 12: (a) Accuracy of linear probes trained to predict whether the final token ends in “-ing,” using activations from individual prediction and suppression neurons (scatter points) and ensembles of neurons (blue line). Ensembles outperform individual probes and occasionally exceed the model’s top-1 accuracy (red dotted line), consistent with the presence of “prediction ensembling.” (b) Layer- wise MLP output norms across all 16 models show a rise toward the final layers, suggesting increasing residual contribution late in the model. (c) Repeating layers from the later half of a model reduces final-layer logit entropy more than repeating earlier layers or using the original model (dotted line), suggestive of residual sharpening and the late-stage influence of prediction and suppression neurons. 4.4 Stage 4: Residual Sharpening As prediction neuron density declines in the final layers, a different mechanism appears to take over. Across all models, we observe a sharp rise in suppression neurons near the end of the network, coinciding with a general decrease in entropy (Figure 21b). Acting in a manner opposite to prediction neurons would suggest that suppression neurons may help refine the model’s output by removing obsolete features, down-weighting improbable tokens, and adjusting the confidence in the final prediction. Sharpening ExperimentTo further explore this hypothesis, we design an experiment where we repeatcertain layers of the model. Specifically, we duplicate blocks of layers within the model—for example, repeating layers 5 through 7 results in a sequence like (...4-5-5-6-6-7-7-8-9...). For this analysis, we fix the number of repeats to 1 and the block length to 5 (see additional results in Figure 24,25). In Figure 11, we observe that repeating blocks in the latter half of the model leads to a consistent decrease in entropy relative to the baseline (horizontal line). When evaluated on downstream benchmarks, the models with repeated layers at the last 80-90% of depth also exhibit improved performance on benchmarks, probably due to clearer prediction attributed to the proposed sharpening process (Appendix 26). Final LayerThe intensity of suppression neurons, as seen in Figure 8, is localized in the final few layers of the model, where the quantity of suppression neurons outstrips prediction neurons. To quantify theintensityof this change to the output distribution, we measure the norm of the MLP output, where a larger norm suggests a greater contribution to the residual (Figure 10). 5 Related Work Mechanistic InterpretabilityMechanistic interpretability often employs circuit analysis to uncover model components relevant to specific computations. In computer vision, universal mechanisms such as frequency detectors and curve-circuits have been identified [52–54], with features becoming 8 progressively more complex through the layers of CNNs. These principles were later extended to modern transformers [55,33], where similar circuit-based analyses revealed phenomena such as circuit reuse [56], variable-finding mechanisms [57], self-repair [6,58], function vectors [59,60], and long-context retrieval [61]. Iterative Inference and Depth-Dependent ComputationsThe iterative inference hypothesis, first explored in ResNets [62,63], posits that each layer incrementally updates token representations. This idea has gained traction in transformers, particularly through logit lens analyses [49,5], which visualize the model’s evolving prediction distributions layer by layer. Some studies further sug- gest discrete inference phases [40], with certain computations localized to specific depths—such as truth-processing [44] or multilingual translation [46]. These findings are complemented by layer per- mutation studies showing that performance improves when self-attention layers precede feedforward layers [64]. BERTologyPrior work on ablations and layer-wise analysis has primarily focused on BERT [65]. These studies reveal substantial redundancy: even with aggressive neuron and layer pruning, models retain most of their performance [66–70]. More recent investigations corroborate this, showing that a significant portion of attention heads and feedforward components can be removed with minimal accuracy loss [9, 8]. 6 Concluding Remarks Why Are Language Models Robust to Layer-Wise Interventions?We hypothesize that the robustness of language models to layer deletion and swapping stems in part from the transformer’s residual architecture. This interpretation aligns with our findings on prediction and suppression neurons: multiple computational pathways appear to contribute to the same output, allowing the network to tolerate disruption in any single path. The residual stream promotes this “ensembling”, enabling gradient descent to construct shallow sub-networks that can operate in parallel. This architectural flexibility reduces the model’s reliance on any specific layer, explaining its resilience to local interventions and supporting observed self-repair behavior and overlapping representations. ConclusionThis work introduces a four-stage framework for understanding inference in large language models, grounded in a diverse set of behavioral and mechanistic analyses. By examining how models respond to structural interventions—layer deletion and swapping—as well as probing attention patterns, neuron function, and residual stream dynamics, we identify a repeatable depth-wise structure to model computation. These stages—detokenization, feature engineering, prediction ensembling, and residual sharpening—emerge across architectures and scales, suggesting that transformers perform inference not as a flat pipeline but as an ordered composition of specialized computational regimes. Rather than aiming for exhaustive mechanistic dissection, we offer a unifying perspective that synthesizes and extends prior findings in and out of interpretability literature. This layered view of inference has implications for how we interpret, audit, and intervene on language models. We hope this framework serves as a foundation for a deeper investigation into the emerging capabilities of LLMs, such as reasoning. Limitations and Future WorkWhile our four-stage framework captures broad, depth-dependent patterns in LLMs, several caveats remain. Stage boundaries are approximate, and multiple stages may co-occur within a single layer. The framework reflects aggregate trends, whereas individual tokens may follow distinct processing paths. Additionally, we do not isolate the factors behind model-specific differences. Future work can sharpen these boundaries, link them to optimization dynamics, and provide a more theoretical framework. Contributions and AcknowledgementsVL conceived and led the study, performed the majority of analyses, and drafted the paper. JL conducted the sharpening experiment, WiC analysis, bench- marking, and CKA experiments. JL, WG and MT contributed to experimental design and analytical methodology, provided critical revisions, and WG and JL assisted with writing. A big thanks to Katrin Franke, Surya Ganguli, Sophia Sanborn, Eric Michaud, Josh Engels, Dowon Baek, Isaac Liao for helpful feedback. 9 References [1]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. [2] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. [3]Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001, 2023. [4]Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, et al. Managing ai risks in an era of rapid progress.arXiv preprint arXiv:2310.17688, 2023. [5]Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. [6]Cody Rushing and Neel Nanda. Explorations of self-repair in language models.arXiv preprint arXiv:2402.15390, 2024. [7] Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680, 2022. [8] Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853, 2024. [9]Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024. [10]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads.Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction- heads/index.html. [11]Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild.arXiv preprint arXiv:2312.09230, 2023. [12]Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy suppression: Comprehensively understanding an attention head.arXiv preprint arXiv:2310.04625, 2023. [13]Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers.arXiv preprint arXiv:2104.08696, 2021. [14]Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, and Dimitris Bertsimas. Universal neurons in gpt2 language models.arXiv preprint arXiv:2401.12181, 2024. [15] Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional.arXiv preprint arXiv:2309.04827, 2023. [16] Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-jussà. A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024. 10 [17]Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in Neural Information Processing Systems, 35:17359–17372, 2022. [18]Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. InInternational Conference on Machine Learning, pages 27011–27033. PMLR, 2023. [19]Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558, 2023. [20]Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023. [21]Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories.arXiv preprint arXiv:2012.14913, 2020. [22]Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. [23]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019. [24]Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. [25]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. [26]Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. [27]Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need i: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. [28]Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020. [29]Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. [30]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019. [31]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. [32]Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031, 2016. [33]Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. Understanding robustness of transformers for image classification. InProceedings of the IEEE/CVF international conference on computer vision, pages 10231–10241, 2021. 11 [34]Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational conference on machine learning, pages 3519–3529. PMLR, 2019. [35] Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth.arXiv preprint arXiv:2010.15327, 2020. [36]Anand Subramanian.torch_cka, 2021.URLhttps://github.com/AntixK/ PyTorch-Model-Compare. [37] Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019. [38]Javier Ferrando and Elena Voita. Information flow routes: Automatically interpreting language models at scale.arXiv preprint arXiv:2403.00824, 2024. [39]Neel Nanda, Senthooran Rajamanoharan, Janos Kramar, and Rohin Shah.Fact finding:Attempting to reverse-engineer factual recall on the neuron level, Dec 2023.URLhttps://w.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/ fact-finding-attempting-to-reverse-engineer-factual-recall. [40]Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units.Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/solu/index.html. [41]Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. [42]Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models.arXiv preprint arXiv:2304.14767, 2023. [43] Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154, 2023. [44]Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023. [45]Wes Gurnee and Max Tegmark. Language models represent space and time.arXiv preprint arXiv:2310.02207, 2023. [46]Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers.arXiv preprint arXiv:2402.10588, 2024. [47]Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations.arXiv preprint arXiv:1808.09121, 2018. [48]Zhu Liu, Cunliang Kong, Ying Liu, and Maosong Sun. Fantastic semantics and where to find them: Investigating which layers of generative llms reflect lexical semantics.arXiv preprint arXiv:2403.01509, 2024. [49]Nostalgebraist. Interpreting gpt: The logit lens.https://w.alignmentforum.org/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. [50] Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space.arXiv preprint arXiv:2209.02535, 2022. [51] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks.Advances in neural information processing systems, 29, 2016. 12 [52]Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020. [53] Ludwig Schubert, Chelsea Voss, Nick Cammarata, Gabriel Goh, and Chris Olah. High-low frequency detectors.Distill, 2021.doi:10.23915/distill.00024.005. https://distill.pub/2020/circuits/frequency-edges. [54]Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah.Curve circuits.Distill, 2021.doi:10.23915/distill.00024.006. https://distill.pub/2020/circuits/curve-circuits. [55] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018. [56]Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Circuit component reuse across tasks in transformer language models.arXiv preprint arXiv:2310.08744, 2023. [57]Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context?arXiv preprint arXiv:2310.17191, 2023. [58] Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg. The hydra effect: Emergent self-repair in language model computations.arXiv preprint arXiv:2307.15771, 2023. [59] Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models.arXiv preprint arXiv:2310.15213, 2023. [60] Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors.arXiv preprint arXiv:2310.15916, 2023. [61]Alexandre Variengien and Eric Winsor. Look before you leap: A universal emergent decompo- sition of retrieval tasks in language models, 2023. [62]Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation.arXiv preprint arXiv:1612.07771, 2016. [63] Stanisław Jastrz ̨ebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua Bengio. Residual connections encourage iterative inference.arXiv preprint arXiv:1710.04773, 2017. [64] Ofir Press, Noah A Smith, and Omer Levy. Improving transformer models by reordering their sublayers.arXiv preprint arXiv:1911.03864, 2019. [65]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. [66]Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping.Advances in Neural Information Processing Systems, 33: 14011–14023, 2020. [67]Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556, 2019. [68]Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models.Computer Speech & Language, 77:101429, 2023. [69]Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. Analyzing redundancy in pretrained transformer models.arXiv preprint arXiv:2004.04010, 2020. [70] Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale.arXiv preprint arXiv:2212.09095, 2022. 13 [71]Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective, 2023. [72]Neel Nanda.Transformerlens, 2022.URLhttps://github.com/neelnanda-io/ TransformerLens. 14 A Appendix A.1 Experiment Diagram Layer AblationLayer Swap FFNN Layer Norm Mask Self-Attention Layer Norm GPT Layer Figure 13: To study the stages of inference, we perform two experiments, each a layer-wise interven- tion, where a layer (left) encompasses all model components. The first intervention is a zero ablation (i.e, layer removal) of the layer (middle), in which a layer is fully removed and residual connections skip the layer entirely. The second intervention (last) is an adjacent layer swap, in which we permute the positions of two layers. The ablation is performed on all layers, while the layer swap is performed on all adjacent pairs of layers in the model. 15 A.2 Centered Kernel Alignment (CKA) 01234567891011 Layer 0 1 2 3 4 5 6 7 8 9 10 11 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (a) GPT2 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.2 0.4 0.6 0.8 1.0 (b) GPT2-medium 02468 1012141618202224262830323436 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Layer 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 (c) GPT2-large 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (d) pythia 410m 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.0 0.2 0.4 0.6 0.8 1.0 (e) pythia 1.4b 02468 101214161820222426283032 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Layer 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (f) pythia 2.8b 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.5 0.6 0.7 0.8 0.9 1.0 (g) Phi 1 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.5 0.6 0.7 0.8 0.9 1.0 (h) Phi 1.5 0123456789 10111213141516 Layer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Layer 0.5 0.6 0.7 0.8 0.9 1.0 (i) Llama 3.2 1B 0246810121416182022242628 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Layer 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (j) Llama 3.2 3B 024681012141618202224 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (k) Qwen 2.5 0.5B 0246810121416182022242628 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Layer 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (l) Qwen 2.5 1.5B 02468 1012141618202224262830323436 Layer 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Layer 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 (m) Qwen 2.5 3B Figure 14: CKA across layers from the last token representation sampled from Pile dataset (max token length 512, batch size 128). We used unbiased CKA [71, 36]. 16 A.3 Benchmark Tasks Performance After Layer-Wise Intervention We evaluate the benchmark performance on HellaSwag, ARC-Easy and LAMBADA [30–32] with the intervened model. We observe a similar trend to KL divergence reported in the main paper. Generally, the intervention at the first layer and the last layer shows catastrophic deterioration of the performance but intervention on intermediate layers shows robust performance. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy arc_easylambada_openai openai-community hellaswag 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy EleutherAI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy microsoft 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy meta-llama 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) Qwen Model Accuracy vs. Normalized Layer-Swap (per Task, Model Class, Model) gpt2 gpt2-medium gpt2-large gpt2-xl pythia-410m-deduped pythia-1.4b-deduped pythia-2.8b-deduped pythia-6.9b-deduped phi-1 phi-1_5 phi-2 Llama-3.2-1B Llama-3.2-3B Qwen2.5-0.5B Qwen2.5-1.5B Qwen2.5-3B Figure 15: Benchmark task performance after layer swap. Baseline performance of each model is marked with a dotted horizontal line. 17 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy arc_easylambada_openai openai-community hellaswag 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy EleutherAI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy microsoft 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy meta-llama 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) 0.00.20.40.60.81.0 Normalized Depth (Layer / Max Layer) Qwen Model Accuracy vs. Normalized Layer-Swap (per Task, Model Class, Model) gpt2 gpt2-medium gpt2-large gpt2-xl pythia-410m-deduped pythia-1.4b-deduped pythia-2.8b-deduped pythia-6.9b-deduped phi-1 phi-1_5 phi-2 Llama-3.2-1B Llama-3.2-3B Qwen2.5-0.5B Qwen2.5-1.5B Qwen2.5-3B Figure 16: Benchmark task performance after layer swap. Baseline performance of each model is marked with a dotted horizontal line. 18 A.4 Cosine Similarity Analysis of Swapped Layers Cosine Similarity MetricsWe collect three key metrics to compare a normal LLM to one with a set of adjacent layers swapped. First,self-similaritymeasures how much a layer retains its function after a swap, reflecting its "stubbornness." A high self-similarity score indicates that a layer continues to project similar contents to the residual stream, even after its position in the network has been changed. Second,index similarityassesses how closely the output of a swapped layer matches the output of the original layer it replaced. This metric serves as an indicator of a layer’s flexibility, with a high score suggesting that the layer can effectively assume the computational role of its predecessor, which could range from active processing to merely acting as a pass-through in the network. Lastly, adjacent similarity provides a baseline comparison by measuring the similarity in computations between adjacent layers in an unmanipulated model. This metric helps establish how similar or diverse the functions of neighboring layers are under normal conditions. By comparing these metrics across different stages of inference, we can gain insights into the commutativity of layers and the nature of the computations performed at each stage. Cosine Similarity ResultsHere we focus on Pythia 1.4B and GPT-2 XL, which contain a similar number of parameters (1.4B and 1.5B, respectively). GPT-2 displays smoother trends compared to its Pythia counterpart while exhibiting similar overall patterns. We hypothesize that this is a result of differences in training dynamics (e.g., the use of dropout in GPT-2) and the fact that the GPT-2 model contains more layers. A larger number of layers presents greater opportunities for manipulating the output distribution and allows for more gradual changes. From an optimization perspective, this is analogous to taking smaller but more frequent gradient steps. Increasing the number of layers may also provide a means for greater redundancy in models, a key feature of GPT-style models that we discuss further below. As seen in Figure 18, all model components maintain high degrees of self-similarity (denoted by the blue and red lines), suggesting that a component’s position does not significantly affect how it projects onto the residual stream when swapped. This finding has implications for how we interpret the remaining metrics. Another commonality across all plots is a significant change in metrics approximately halfway through the model, which we interpret as the separation between stages 2 and 3. Specifically, we observe a sharp decrease in index similarity and an increase in orthogonality between the swapped layer and its neighbors, suggesting a transition from iterative refinement to more specialized computations. Figure 17: We compute pairwise cosine similarities between a standard operational model and a model with two adjacent layers swapped, analyzing the component-wise outputs (MLP and ATTN). This approach aims to explore three specific properties:Adjacent Similarity, which quantifies the similarity of component outputs to assess iterative inference;Self-Similarity, which evaluates the resistance of a layer to change when relocated, serving as a measure of layer “stubbornness"; and Index Similarity, which examines the adaptability of a layer in a new position, indicating layer “flexibility." 19 Attention HeadsBoth models exhibit distinct patterns in attention-head behavior in the latter half of the network. In Pythia models, the attention head metrics converge to orthogonality, while in GPT-2 models, they converge to similarity. For Pythia, the self-similarity of attention heads decreases, indicating that they become less "stubborn" and more sensitive to their position in the network. In contrast, attention heads in GPT-2 models become increasingly redundant, with high self-similarity and index similarity scores. We hypothesize that this increased redundancy arises from the larger number of layers in GPT-2 models, which allows for a more gradual refinement of the output distribution. This finding has important implications for model design, suggesting that there may be an optimal number of layers given total parameters to balance computational efficiency and redundancy. MLPsThe MLP components display two significant patterns across models. First, in the region corresponding to stage 2 of inference, we observe that the index similarity (teal line) is higher than both the adjacent similarity and the self-similarity scores. This pattern provides evidence for iterative inference, where a layer moved earlier in the computation has a projection onto the residual stream that overlaps more strongly with its previous neighbor than with its original position or its new neighbor. This overlap is more pronounced in Pythia models than in GPT-2 models, possibly because Pythia models have fewer layers to complete stage 2 of inference. Second, in stage 3 of inference, both models demonstrate a convergence of all metrics except self-similarity toward orthogonality. The combination of high self-similarity (indicating stubbornness) and orthogonality to the replaced layer and the adjacent layers suggests a high degree of specialization in the MLPs of stage 3. Figure 18: We compute pairwise cosine similarities between a standard operational model and a model with two adjacent layers swapped, as depicted in 17, across two different models. A high index similarity, marked by the teal line, suggests that when a layer is moved earlier in the computational sequence, it retains a similar projection onto the residual stream as the layer it replaced. This observation supports the concept of iterative inference, highlighting overlapping computational roles between adjacent layers. 20 A.5 Prediction and Suppression Neuron 0.00.20.40.60.81.0 relative layer depth 0 2 4 6 8 10 KL divergence KL - Qwen 2.5 0.5B KL - Qwen 2.5 1.5B KL - Qwen 2.5 3B Qwen 2.5 0.5B prediction Qwen 2.5 0.5B suppression Qwen 2.5 1.5B prediction Qwen 2.5 1.5B suppression Qwen 2.5 3B prediction Qwen 2.5 3B suppression 0 5 10 15 % of layer neurons (a) Qwen 0.00.20.40.60.81.0 relative layer depth 0 2 4 6 8 10 KL divergence KL - Llama 3.2 1B KL - Llama 3.2 3B Llama 3.2 1B prediction Llama 3.2 1B suppression Llama 3.2 3B prediction Llama 3.2 3B suppression 0 5 10 15 % of layer neurons (b) Llama Figure 19: Prediction and Suppression neurons for Qwen and Pythia. A.6 WiC contextual word probe 0.00.20.40.60.81.0 Relative Depth 0.52 0.54 0.56 0.58 0.60 0.62 Accuracy Accuracy vs Relative Depth (Based on feature similarity) Model openai-community/gpt2 openai-community/gpt2-medium openai-community/gpt2-large openai-community/gpt2-xl (a) GPT 0.00.20.40.60.81.0 Relative Depth 0.52 0.54 0.56 0.58 0.60 0.62 Accuracy Accuracy vs Relative Depth (Based on feature similarity) Model EleutherAI/pythia-410m-deduped EleutherAI/pythia-1.4b-deduped EleutherAI/pythia-2.8b-deduped EleutherAI/pythia-6.9b-deduped (b) Pythia 0.00.20.40.60.81.0 Relative Depth 0.52 0.54 0.56 0.58 0.60 0.62 Accuracy Accuracy vs Relative Depth (Based on feature similarity) Model microsoft/phi-1 microsoft/phi-1_5 microsoft/phi-2 (c) Phi 0.00.20.40.60.81.0 Relative Depth 0.52 0.54 0.56 0.58 0.60 0.62 0.64 Accuracy Accuracy vs Relative Depth (Based on feature similarity) Model meta-llama/Llama-3.2-1B meta-llama/Llama-3.2-3B (d) Llama 3.2 0.00.20.40.60.81.0 Relative Depth 0.52 0.54 0.56 0.58 0.60 0.62 Accuracy Accuracy vs Relative Depth (Based on feature similarity) Model Qwen/Qwen2.5-1.5B Qwen/Qwen2.5-3B Qwen/Qwen2.5-7B (e) Qwen 2.5 Figure 20: WiC probing accuracy over layers across model families and sizes. Across all models and sizes, we observe the probe accuracy related to contextual semantics of lexical items gradually increases and peaks around the middle layers and degrades. 21 A.7 Logit Lens Entropy 0.00.20.40.60.81.0 relative layer depth 4 6 8 10 12 14 16 entropy (bits) gpt2 gpt2-medium gpt2-large gpt2-xl (a) GPT Logit Lens Entropy 0.00.20.40.60.81.0 relative layer depth 2 4 6 8 10 12 14 16 entropy (bits) pythia-410m pythia-1.4b pythia-2.8b pythia-6.9b (b) Pythia Logit Lens Entropy 0.00.20.40.60.81.0 relative layer depth 2 4 6 8 10 12 14 16 entropy (bits) phi-1 phi-1_5 phi-2 (c) Phi Logit Lens Entropy 0.00.20.40.60.81.0 relative layer depth 2.5 5.0 7.5 10.0 12.5 15.0 17.5 entropy (bits) Llama-3.2-1B Llama-3.2-3B (d) Llama Logit Lens Entropy 0.00.20.40.60.81.0 relative layer depth 2.5 5.0 7.5 10.0 12.5 15.0 17.5 entropy (bits) Qwen2.5-0.5B Qwen2.5-1.5B Qwen2.5-3B (e) Qwen Logit Lens Entropy Figure 21: Using the logit lens technique [49], we calculate the probability distribution of the next token at the end of every layer, and then take its entropy. A.8 MLP Norms (a) GPT MLP Output(b) Pythia MLP Output(c) Phi MLP Output 0.00.20.40.60.81.0 relative layer depth 0 10 20 30 40 mlp norm meta-llama/Llama-3.2-1B meta-llama/Llama-3.2-3B (d) Llama MLP Output 0.00.20.40.60.81.0 relative layer depth 0 50 100 150 200 250 mlp norm Qwen/Qwen2.5-0.5B Qwen/Qwen2.5-1.5B Qwen/Qwen2.5-3B (e) Qwen MLP Output Figure 22: The norm of the output of every MLP across its layers to measure its contribution to the residual stream. Across all 16 models, the norm grows and peaks in the final layers before output, suggestive of the final two stages of inference, predictive ensembling, and residual sharpening 22 A.9 Top Prediction and Suppression Neurons Figure 23: Top 36 prediction and suppression neurons for -ing which have the greatest mean absolute difference between respective (W U ·w out ). Elements with a negative skew are suppression neurons for the respective labeled class, while elements with a positive skew are prediction neurons. This is calculated by calculating the product between the model unembedding weights and output weights of MLP. 23 A.10 Layer repeats experiment 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 logits entropy phi-1 phi-2 phi-1.5 (a) Phi Block 3 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy Llama-3.2-1B Llama-3.2-3B (b) Llama 3.2 Block 3 0.150.300.450.600.750.90 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy Qwen2.5-1.5B Qwen2.5-0.5B Qwen2.5-3B (c) Qwen 2.5 Block 3 0.150.300.450.600.750.90 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy gpt2-large gpt2 gpt2-xl gpt2-medium (d) GPT Block 3 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy pythia-1.4b-deduped pythia-2.8b-deduped pythia-410m-deduped pythia-6.9b-deduped (e) Pythia Block 3 Figure 24: Block 3 repeat experiment. 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy gpt2-large gpt2 gpt2-xl gpt2-medium (a) GPT 0.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy pythia-1.4b-deduped pythia-2.8b-deduped pythia-410m-deduped pythia-6.9b-deduped (b) Pythia 0.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 logits entropy phi-1 phi-2 phi-1.5 (c) Phi Repeat Block 5 0.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy Llama-3.2-1B Llama-3.2-3B (d) Llama Repeat Block 5 0.10.20.30.40.50.60.70.80.91.0 relative depth (copy start layer) 3.0 3.5 4.0 4.5 5.0 5.5 6.0 logits entropy Qwen2.5-1.5B Qwen2.5-0.5B Qwen2.5-3B (e) Qwen Repeat Block 5 Figure 25: Block 5 repeat experiment on additional models. 24 0.00.20.40.60.81.0 Relative Depth (Layer Index / Max Depth) 0.30 0.35 0.40 0.45 0.50 0.55 Benchmark Accuracy Benchmark Results for Task: hellaswag Base Model Baseline Baseline Baseline Qwen2.5-0.5B Qwen2.5-1.5B Qwen2.5-3B Figure 26: Qwen repeat 5 model’s performance on Hellaswag 25 A.11 Additional Empirical Details All experimental code for future experiments is available at: https://github.com/vdlad/Remarkable-Robustness-of-LLMs We make ubiquitous use of TransformerLens [72] to perform hooks and transformer manipulations. For specificity, we utilize the following HuggingFace model names, and dataset. We do not change the parameters of the models from what they are described on the HuggingFace page. NameHuggingFace Model Name Pythia 410MEleutherAI/pythia-410m-deduped Pythia 1.4BEleutherAI/pythia-1.4b-deduped Pythia 2.8BEleutherAI/pythia-2.8b-deduped Pythia 6.9BEleutherAI/pythia-6.9b-deduped GPT-2 Small (124M)gpt2 GPT-2 Medium (355M)gpt2-medium GPT-2 Large (774M)gpt2-large GPT-2 XL (1.5B)gpt2-xl Phi 1 (1.3B)microsoft/Phi-1 Phi 1.5 (1.3B)microsoft/Phi-1.5 Phi 2 (2.7B)microsoft/Phi-2 Qwen 0.5BQwen/Qwen2.5-0.5B Qwen 1.5BQwen/Qwen2.5-1.5B Qwen 3BQwen/Qwen2.5-3B Llama-3.2 1Bmeta-llama/Llama-3.2-1B Llama-3.2 3Bmeta-llama/Llama-3.2-3B The PileEleutherAI/the_pile_deduplicated Table 3: List of models and dataset used in the experiments. All experiments described can be performed on a single NVIDIA A6000. We utilized 2 NVIDIA A6000 and 500 GB of RAM. To aggregate the metrics described in the paper, we run the model on 1 million tokensℓtimes, whereℓis the number of layers. This takes on average 8 hours per model, per layer intervention (swapping and ablating). We save this aggregation for data analysis. We utilize several conventional weight preprocessing techniques to streamline our calculations [72]. Layer Norm PreprocessingFollowing [14], before each MLP calculation, a layer norm operation is applied to the residual stream. This normalizes the input before the MLP. The TransformerLens package simplifies this process by incorporating the layer norm into the weights and biases of the MLP, resulting in matricesW eff andb eff . In many layer norm implementations, trainable parameters γ∈R n andb∈R n are included: LayerNorm(x)= x−E(x) p Var(x) ∗γ+b.(1) We "fold" the layer norm parameters intoW in by treating the layer norm as a linear layer and then merging the subsequent layers: W eff =W in diag(γ)b eff =b in +W in b(2) Additionally, we then center reading weights. Thus, we adjust the weightsW eff as follows: W ′ eff (i,:) =W eff (i,:)− ̄ W eff (i,:) Centering Writing WeightsBecause of the LayerNorm operation in every layer, we can align weights with the all-one direction in the residual stream as they do not influence the model’s calculations. Therefore, we mean-centerW out andb out by subtracting the column means ofW out : W ′ out (:, i) =W out (:, i)− ̄ W out (:, i) Societal ImpactWe do not anticipate any immediate societal impact from this research. 26