Paper deep dive
Patterning: The Dual of Interpretability
George Wang, Daniel Murfet
Models: small language model (unspecified, likely Pythia-scale)
Abstract
Abstract:Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it.
Tags
Links
- Source: https://arxiv.org/abs/2601.13548
- Canonical: https://arxiv.org/abs/2601.13548
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/11/2026, 1:28:42 AM
Summary
The paper introduces 'patterning', a dual framework to mechanistic interpretability, which uses susceptibility-based linear response theory to determine the specific training data interventions required to steer a neural network's internal structure and generalization behavior. By inverting the susceptibility matrix, the authors demonstrate that re-weighting training data can predictably accelerate or delay the formation of specific circuits (like induction circuits) or select between competing algorithms in synthetic tasks by targeting local learning coefficients.
Entities (6)
Relation Signals (4)
Susceptibility → enables → Patterning
confidence 98% · Our approach is based on susceptibilities... Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration.
Patterning → modulates → Induction Circuit
confidence 96% · We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit.
Local Learning Coefficient → determines → Internal Model Selection
confidence 95% · among equal-loss solutions, the posterior favors those with lower LLC. This is the principle of internal model selection.
Patterning → isdualof → Mechanistic Interpretability
confidence 95% · We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it.
Cypher Suggestions (2)
Find all methodologies related to neural network interpretability. · confidence 90% · unvalidated
MATCH (m:Methodology)-[:RELATED_TO]->(i:FieldOfStudy {name: 'Mechanistic Interpretability'}) RETURN mIdentify structures influenced by specific metrics. · confidence 85% · unvalidated
MATCH (s:Structure)<-[:MODULATES]-(m:Methodology)-[:TARGETS]->(met:Metric) RETURN s, met
Full Text
94,801 characters extracted from source content.
Expand or collapse full text
Patterning: The Dual of Interpretability George Wang Timaeus george@timaeus.co Daniel Murfet Timaeus daniel@timaeus.co Abstract Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinites- imal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training accuracy, we show that patterning can select which algorithm the model learns by targeting the local learning coefficient of each solution. These results establish that the same mathematical framework used to read internal structure can be inverted to write it. 1 Introduction What is dual to interpretability? The goal of interpretability is to understand how neural networks generalize from their training data: for example, to reverse-engineer the internal structures (e.g. features, circuits) that determine how a trained model behaves on inputs beyond those it was trained on. If this is interpretability, then the dual problem should be: given a specified form of generalization, determine what training data leads to it. We call this patterning, and in this paper we develop founda- tions for patterning based on the same mathematical framework that underlies recent approaches to interpretability via susceptibilities (Baker et al., 2025; Wang et al., 2025; Gordon et al., 2026). The starting point is Structural Bayesianism (Murfet and Troiani, 2025), the hypothesis that internal structure in neural networks is encoded in the local posterior distribution. The posteriorp(w|D n ), formed after observing training dataD n of sizen, concentrates on parameterswthat predict well on this dataset, and further its “shape” is sensitive to the computational mechanisms by which that prediction is achieved. Internal structures like circuits leave traces in the response of the loss to small perturbations: some weight changes are irrelevant to outputs, others can be compensated by coordinated changes elsewhere, and these “degeneracies” reflect the trade-offs among parts of the underlying algorithm. This structural information can be extracted by computing expectation values μ n i = Z φ i (w)p(w|D n )dw(1) where theφ i are observables: functions on parameter space that probe aspects of model structure. We collect these into a vectorμ n = (μ n 1 ,μ n 2 ,...)of structural coordinates. Examples of such expectation values include estimators for the local learning coefficientλ(Watanabe, 2009; Lau et al., 2025), susceptibilities χ (Baker et al., 2025) and influence functions (Kreer et al., 2025). To find the dual of interpretability, we study the affordances of structural coordinates. The posterior depends on three ingredients: the model architecture, the data distributionq, and the prior. The arXiv:2601.13548v1 [cs.LG] 20 Jan 2026 finite-sample coordinatesμ n depend on the specific datasetD n , but in the large-sample limit we obtain coordinatesμ ∞ that depend continuously onq. Formally,μ ∞ is computed using the annealed posteriorp(w|D ∞ ) ∝ exp−nβL(w)φ(w)whereL(w) = E q [ℓ(w)]is the population loss (see Section 2.2 for details). Lethdenote a vector of hyperparameters governing the data distribution (for instance, mixture weights over a baseline distribution and a collection of probe distributions). Then dμ ∞ = χdh(2) whereχis the matrix of susceptibilities: theik-entryχ ik measures how the expectation of observable φ i responds to an infinitesimal shift of the data distribution toward probe k. Given a desired change dμ ∞ target in the structural coordinates, the minimum-norm intervention in the data distribution is dh opt = χ † dμ ∞ target (3) whereχ † denotes the Moore-Penrose pseudoinverse. We refer to(3)as the fundamental equation of patterning. In this way, we can derive principled – one-off (applied once before training) or online (adjusted dynamically during training) – changes in the data distribution that steer the configuration of the posterior and thus internal structure in the network. Since this structure determines generalization, we have found (in principle) a dual to interpretability: shaping training data to control generalization. 1 The main claim of the present paper is that this works, in simple experiments. We demonstrate patterning in several settings, showing that susceptibility-guided interventions in training data produce predictable changes in the internal structures that form during training: •Modulating a circuit. The simplest application of patterning targets a single mode of the susceptibility matrix. In a 3M parameter language model, Baker et al. (2025) showed that PC2 of the susceptibility matrix couples induction patterns in the data (roughly speaking, the right singular vectorv 2 ) to the induction circuit in the weights (the left singular vector u 2 ). Settingdμ ∞ target = ±u 2 , the fundamental equation yieldsdh opt ∝ ±v 2 : the optimal intervention is simply to re-weight training data along the second principal data pattern. We show that up-weighting tokens with negativev 2 loadings accelerates circuit formation, while down-weighting them delays or prevents it (Section 3). •Selecting between competing algorithms. When multiple solutions achieve zero training loss, the principle of internal model selection determines which one the posterior favors: among equal-loss solutions, Bayesian inference prefers the one with lower local learning coefficient. Patterning can exploit this by raising the LLC of an undesired solution, shifting posterior weight toward the desired one. Following Li et al. (2025), we study transformers trained on a bracket classification task where two algorithms –NestedandEqual-Count – both achieve perfect loss on the training data. The susceptibility matrix measures how each training sample affects each solution’s LLC. Inverting the fundamental equation, we show that the optimal re-weighting is approximately proportional to the susceptibility gap χ EQ x −χ N x . Samples with large gaps turn out to be interpretable (sequences that are “almost nested” or “almost equal”), and retraining on appropriately modified distributions shifts the proportion of models implementing each algorithm (Section 4). We also perform sanity-check experiments to validate the basic paradigm: stunting the development of the spacing fin discovered in Wang et al. (2025) and deliberately growing a similar fin for delimiters (Section A.2). These experiments test a crux for the patterning methodology: susceptibilities must be sensitive to internal structure, and interventions in training data must produce measurable changes in those susceptibilities. Once this is established by the main experiments, the sanity checks become less central, so we relegate them to the appendix. In this paper we focus on one-off interventions in training data (made at the beginning of training), but in future work we plan to develop the full online form of patterning. Organization.Section 2 reviews the necessary background on singular learning theory and susceptibilities, and derives the patterning solution via mode decomposition. Section 3 demonstrates patterning in a small language model, showing that re-weighting training data along the second 1 This reasoning assumes that SGD finds solutions favored by the Bayesian posterior. This correspondence is supported by some theoretical and empirical evidence (Mingard et al., 2021) but remains a hypothesis rather than a theorem; our experiments provide further evidence by showing that posterior-guided interventions produce predicted effects under SGD training. 2 principal component of the susceptibility matrix accelerates or delays the formation of the induction circuit. Section 4 applies patterning to a synthetic parenthesis balancing task, where we select between two distinct algorithms by targeting their local learning coefficients. Section 5 discusses related work, and Section 6 concludes. 2 Background The introduction presented patterning as inverting the linear response relationdμ ∞ = χdhto obtain data interventionsdh opt = χ † dμ ∞ target that steer structural coordinates toward a target. This section provides the theoretical foundations: singular learning theory gives us the structural coordinates (including the local learning coefficient, targeted in Section 4) and susceptibilities give us the measurement tool connecting data to structure. 2.1 Singular learning theory We need singular learning theory for two reasons: it provides the local learning coefficientλ, which measures the complexity of a solution and determines which algorithm the posterior favors among equal-loss alternatives; and it provides the mathematical framework (annealed posteriors, free energy) in which susceptibilities are defined. The LLC is the target observable in Section 4, where we select between competing algorithms by raising the LLC of the undesired solution. Singular learning theory (SLT) provides a mathematical framework for understanding Bayesian learning in singular statistical models (Watanabe, 2009). A model is regular if the map from parameters to distributions is one-to-one and the Fisher information matrix is positive definite; it is singular otherwise. Deep neural networks are singular: many different parameter configurations wcan yield the same input-output functionf w (Wei et al., 2022). Classical statistical theory often assumes regularity, so new tools are needed for deep learning. Setup.We consider a statistical modelp(y|x,w)with parameterswin a compact parameter space W ⊆ R d , equipped with a prior densityφ(w). Given a datasetD n =(x i ,y i ) n i=1 drawn i.i.d. from a true distributionq(x,y), we define the empirical lossL n (w) =− 1 n P n i=1 logp(y i |x i ,w) and the population loss L(w) =−E q(x,y) [logp(y|x,w)]. The learning coefficient.The central quantity in SLT is the learning coefficientλ, which measures the effective dimensionality of the set of optimal parameters. Define the volume vol(ε) = Z L(w)<L(w ∗ )+ε φ(w)dw wherew ∗ is a global minimum ofL. Under regularity conditions (Watanabe, 2009, Theorem 7.1), the learning coefficient is λ =− lim ε→0 + log 2 vol( 1 2 ε) vol(ε) .(4) This is the asymptotic number of bits needed to specify a parameter that is half again closer to the truth. In regular models (wherew ∗ is unique and the Hessian is non-degenerate),λ = d/2wheredis the parameter dimension. In singular models,λcan be much smaller, reflecting the degeneracy of the loss landscape. Local learning coefficient.When the lossL(w)has multiple local minima, we define the local learning coefficientλ(w ∗ )(Lau et al., 2025) at each minimum by restricting the volume integral to a neighborhood B of w ∗ where L(w)≥ L(w ∗ ): λ(w ∗ ) =− lim ε→0 + log 2 vol( 1 2 ε,w ∗ ) vol(ε,w ∗ ) ,vol(ε,w ∗ ) = Z w∈B,|L(w)−L(w ∗ )|<ε φ(w)dw.(5) The LLC measures the local geometry of the loss landscape: a lowerλ(w ∗ )indicates a more degenerate (flatter) basin, while a higherλ(w ∗ )indicates a sharper, more constrained basin. Intuitively, a low LLC means many directions in parameter space can be varied without significantly increasing the loss: the solution has “slack” or redundancy. A high LLC means the solution is tightly constrained, with most perturbations increasing the loss. Importantly, changing the data distribution changes the LLC at each minimum. 3 Free energy and posterior concentration.A key result of SLT is the asymptotic expansion of the free energy. Consider a small neighborhoodUaround a local minimumw ∗ of the loss. The Bayesian posterior probability of this region is p n (U ) = Z n (U ) Z n (W) ,where Z n (U ) = Z U exp−nL n (w)φ(w)dw(6) is the local partition function andL n (w)is the empirical loss. The local free energyF n (U ) = − logZ n (U ) admits the asymptotic expansion (Watanabe, 2009) F n (U ) = nL n (w ∗ ) + λ(w ∗ ) logn− (m(w ∗ )− 1) log logn + O p (1)(7) where m(w ∗ ) is the multiplicity, a secondary geometric invariant. This formula governs how the posterior concentrates around competing solutions. Ifw ∗ A andw ∗ B are two local minima with neighborhoodsU andV , the log posterior odds is log p n (U ) p n (V) =−F n (U ) + F n (V) = ∆L n · n + ∆λ· logn + O p (log logn)(8) where∆L n = L n (w ∗ B )− L n (w ∗ A )and∆λ = λ(w ∗ B )− λ(w ∗ A ). When both solutions achieve equal loss (∆L n = 0), the posterior preference is determined entirely by∆λ · logn: among equal-loss solutions, the posterior favors those with lower LLC. This is the principle of internal model selection (Watanabe, 2009): Bayesian inference prefers simpler solutions, where simplicity is measured by the local learning coefficient. Patterning can exploit this: if we design data interventions that raise the LLC of an undesired solutionw ∗ B while leaving the LLC of the desired solutionw ∗ A unchanged, we shift the posterior towardw ∗ A . This is the mechanism underlying the experiments in Section 4. Estimating the LLC. In practice, we estimate the LLC using (Lau et al., 2025) ˆ λ(w ∗ ) = nβ h E β w|w ∗ ,γ [L n (w)]− L n (w ∗ ) i (9) where the expectation is with respect to a localized tempered posterior p(w;w ∗ ,β,γ)∝ exp n −nβL n (w)− γ 2 ∥w− w ∗ ∥ 2 o .(10) This is derived from the WBIC (Watanabe, 2013). The hyperparameters are the inverse temperature βand localization strengthγ. This expectation is approximated using stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), a sampling algorithm that adds Gaussian noise to gradient descent to explore the posterior distribution. 2.2 Susceptibilities Susceptibilities are the measurement tool that connects data perturbations to structural changes. They tell us: if we shift the data distribution slightly toward some probe distribution, how do the structural coordinatesμ ∞ change? This is precisely the information needed to invert the relationship and find data interventions that achieve a target structural change. In Section 3, we use susceptibilities to identify which tokens most strongly engage the induction circuit, then re-weight those tokens to modulate circuit formation. The term “susceptibility” comes from physics, where it measures how a system responds to an external perturbation: for example, how a material’s magnetization changes when an external magnetic field is applied. In our setting, the “system” is the posterior distribution over model parameters, and the “perturbation” is a shift in the data distribution. The susceptibility measures how posterior expectation values of observables (functions that probe model structure) respond to such shifts. This is a form of linear response theory. We consider sequence modelsp(y|x,w)that predict tokensy ∈ Σgiven some context ofx ∈ Σ k for various1≤ k ≤ K, whereKis the maximum context length andΣis the set of tokens. More specifically, we consider transformer neural networks with a vector of weightsw ∈ W. We may consider some subsetCof those weights as a product decompositionW = U ×C, which we refer to as a component of the network (often but not necessarily something like an attention head). 4 The true distribution of token sequences(x,y)is denotedq(x,y). Given a datasetD n = (x i ,y i ) n i=1 , drawn i.i.d. from q(x,y) we define ℓ xy (w) =− logp(y|x,w), L n (w) = 1 n n X i=1 ℓ xy (w). The functionL n (w)is the empirical negative log-likelihood and its average over the true distribution is L(w) = E q(x,y) [ℓ xy (w)]. Given a product decompositionW = U×C, model weightsw ∗ = (u ∗ ,v ∗ ), and writingw = (u,v)for the decomposition of a general parameter, we define a generalized function on W by φ C (w) = δ(u− u ∗ ) h L(w)− L(w ∗ ) i (11) whereδ(u− u ∗ )is a Dirac delta. Intuitively,φ C measures the loss contribution of componentC: the Dirac delta freezes all parameters outsideCat their trained valuesu ∗ , andL(w)− L(w ∗ )measures how varyingCalone affects the loss. WhenCis an attention head, the expectation⟨φ C ⟩ β captures how much “slack” that head has in the current solution: how much its parameters can vary without significantly increasing the loss. The annealed posterior at inverse temperatureβ > 0and sample size n is p β n (w) = 1 Z β n exp−nβL(w)φ(w) where Z β n = Z exp−nβL(w)φ(w)dw.(12) Given a generalized function φ(w) we define the expectation ⟨φ⟩ β = Z φ(w)p β n (w)dw.(13) and given a function ψ(w) the covariance with respect to the annealed posterior is Cov β φ,ψ = φψ β − φ β ψ β . Susceptibility as a derivative.The susceptibility is defined as the derivative of a posterior expectation value with respect to a data distribution parameter (Baker et al., 2025). Lethparametrize a perturbation of the data distributionq → q h , and write⟨φ⟩ β,h for the expectation ofφunder the annealed posterior formed using data distribution q h . The susceptibility is χ = 1 nβ ∂ ∂h ⟨φ⟩ β,h h=0 .(14) Per-token susceptibility.The perturbations of the data distribution that we consider shift it towards a point mass δ (x,y) for some particular context x and token y. Definition 2.1. The per-token susceptibility of component C for (x,y) is χ C xy =−Cov β h φ C ,ℓ xy (w)− L(w) i .(15) Given componentsC 1 ,...,C H we define the susceptibility vectorχ xy = (χ C 1 xy ,...,χ C H xy ). Finally given contexts x 1 y 1 ,...,x r y r we obtain the susceptibility or response matrix χ defined by χ = χ C i x j y j 1≤i≤H,1≤j≤r . This directly yields the linear response relationdμ ∞ = χdhused in the fundamental equation where μ i =⟨φ C i ⟩ β,h and we ignore factors of nβ. To relate this to the earlier definition consider a mixtureq ′ = (1− ε)q + εδ (x 0 ,y 0 ) whereδ (x 0 ,y 0 ) is the point mass. UsingE q [χ xy ] = 0we obtainχ = (1− ε)E q [χ xy ] + εχ x 0 y 0 = εχ x 0 y 0 . Thusχ xy measures how up-weighting the pair (x,y) shifts the structural coordinate μ ∞ . Another way to understand the per-token susceptibility is as a density for general susceptibilities: for any perturbation q → q ′ , the susceptibility decomposes by Gordon et al. (2026) as χ = Z q ′ (x,y)χ xy dxdy .(16) 5 This definesχ xy implicitly: it is the unique function (up to an additive term that does not depend on x,y) such that (16) holds for all perturbations q ′ . The covariance formula (15) allows estimation from samples of the unperturbed posterior. For details on estimation see Baker et al. (2025). Given a patternP (a set of context-token pairs), the per-pattern susceptibility of component C is χ C (P) := 1 |P| X (x,y)∈P χ C xy (17) which averages the per-token susceptibilities over all pairs in the pattern. 2.3 Mode decomposition and the patterning solution The susceptibility matrix admits a singular value decomposition χ = X α σ α u α v T α (18) whereu α α are orthonormal vectors in observable space (left singular vectors),v α α are orthonor- mal vectors in data space (right singular vectors), andσ 1 ≥ σ 2 ≥·≥ 0are the singular values. We callu α the principal structures (in some cases these are “circuits” see Section 3 below) andv α the principal data patterns. Theα-th mode couples the data directionv α to the observable directionu α with strength σ α : a perturbation dh∝ v α produces a response dμ ∞ ∝ σ α u α . The Moore-Penrose pseudo-inverse has the corresponding decomposition χ † = X α:σ α >0 1 σ α v α u T α (19) which inverts each mode independently, with weakly-coupled modes amplified by their inverse singular values. The optimal re-weighting for a target dμ ∞ target is therefore dh opt = χ † dμ ∞ target = X α 1 σ α ⟨u α ,dμ ∞ target ⟩v α .(20) This has a clear interpretation: decompose the target into principal structures, then reconstruct in data space using the corresponding principal data patterns, with each component scaled by the inverse coupling strength. Two limiting cases are instructive. When the target aligns with a single principal structure, dμ ∞ target = u β , the solution is the corresponding data pattern: dh opt = 1 σ β v β .(21) Modulating a principal structure (e.g. the induction circuit, in our experiments) reduces to re- weighting along its paired data direction. Conversely, when the target is orthogonal to all principal structures (lies in the left null space ofχ), no re-weighting can achieve it: such targets represent changes to observables that are inaccessible via the available data perturbations. 3 Induction Circuit We test patterning in a setting where the target is to modulate the strength of an identified circuit. The induction circuit is a well-studied motif in which attention heads in an early layer attend to the previous occurrence of the current token, while heads in a later layer copy the token that followed that occurrence (Olsson et al., 2022). In the two-layer models we study, this corresponds to previous-token and current-token heads in layer 0 composing with induction heads in layer 1. This circuit forms in response to a pattern in the data which we call induction patterns (IPs), meaning token sequences like thecat ... thecat involving a bigram which is repeated from earlier in the context. Following Baker et al. (2025), we compute per-token susceptibilitiesχ C xy for each attention head Cin the model, forming a susceptibility vectorχ xy = (χ C 1 xy ,...,χ C H xy ) ∈ R H for each token pair (x,y). The susceptibility matrixχhas these vectors as columns. The analysis of Baker et al. (2025); Wang et al. (2025) shows that PC2 of this matrix couples induction patterns (a pattern in the data) to 6 Figure 1: PC2 and induction patterns. Text from the training corpus highlighted in red and green based on PC2 value of the susceptibility vector of the 16 attention heads in the original model. Green indicates more positive values, while red indicates more negative. Note that the strongest red subsequences are rare biological terms, only highlighted red from their second appearance. the induction circuit in the sense of Section 2.3. The alignment between PC2 and induction patterns can be seen in Figure 1. The precise statement is more complex: in particular, the data pattern is actually the difference of two patterns (word endings and induction patterns) and the pattern in the weights is the opposition of the induction circuit with the rest of the model (see Section A.1.3). Given this identification, modulating the induction circuit is a simple application of(21). We set dμ ∞ target =±u 2 so that the optimal re-weighting is then dh opt = χ † (±u 2 )∝±v 2 .(22) Samples with large negativev 2 loadings are up-weighted to promote the induction circuit, or down- weighted to suppress it. The prediction is that down-weighting induction patterns should delay or prevent circuit formation, while up-weighting them should accelerate it. In Section 3.1 we give the details of our experiment, and in Section 3.2 the results. 3.1 Methodology We consider the same 3M parameter attention-only (no MLPs) transformer trained in Hoogland et al. (2025) and further studied in Wang et al. (2024); Baker et al. (2025); Wang et al. (2025). We additionally train new models with the same architecture but possibly with modified data distribution or seeding (we refer to these as retrained models). This transformer has two layers0≤ L≤ 1and eight attention heads per layer0≤ H ≤ 7. This model was trained for 50,000 steps on a subset of the Pile (Xie et al., 2023). For a complete specification of architecture, including dimensions and training hyperparameters, see Hoogland et al. (2025). We use a truncated variant of the GPT-2 tokenizer that uses only the first 5,000 tokens (Eldan and Li, 2023), denoting the set of tokens by Σ and often consider a token y ∈ Σ in a context x∈ Σ k . For our retraining experiments, we measure susceptibilities on a large fraction of the original training distribution, project those susceptibilities onto PC2, and use this to assign a per-token loss weight during training. Tokens with large negative PC2 values can be up-weighted or down-weighted, thereby amplifying or suppressing their influence on learning. We choose four different mappings to conduct retraining experiments with:Repress-0x,Baseline-1x,Induce-2x, andInduce-4x. These respectively set the per-token weighting of tokens with large negative PC2 values to 0, 1 (identical to original training), 2, and 4. See Section A.1.3 for precise characterizations. For each mask, we train four models and measure the formation of the induction circuit using the prefix matching score and previous token score (Olsson et al., 2022). 3.2 Results In Figure 2 we see the results of training models on these token masks reflected in the per-pattern susceptibilities for the induction pattern. For theRepress-0xmodel, the scale of the induction pattern susceptibilities is about an order of magnitude smaller than inBaseline-1x. ForInduce-2x 7 5k10k15k20k Training Step 0.0 0.5 1.0 1.5 Susceptibility L0: Induction 5k10k15k20k Training Step 0.0 0.5 1.0 1.5 2.0 2.5 L1: Induction (a) Baseline-1x, seed 1 5k10k15k20k Training Step 0.3 0.2 0.1 0.0 0.1 Susceptibility L0: Induction 5k10k15k20k Training Step 0.2 0.1 0.0 0.1 0.2 0.3 L1: Induction (b) Repress-0x, seed 1 5k10k15k20k Training Step 0.0 0.5 1.0 1.5 2.0 Susceptibility L0: Induction 5k10k15k20k Training Step 0 1 2 3 L1: Induction (c) Induce-2x, seed 1 5k10k15k20k Training Step 0.0 0.5 1.0 1.5 2.0 Susceptibility L0: Induction 5k10k15k20k Training Step 0 1 2 3 L1: Induction (d) Induce-4x, seed 1 Figure 2: Per-pattern susceptibilities for the induction pattern for each attention head are shown for the induction experiment, collected on the unmodified original training distribution. Note they-axis range. 2500500075001000012500150001750020000 Step 0.00 0.02 0.04 0.06 0.08 0.10 Prefix matching score 2500500075001000012500150001750020000 Step 0.00 0.05 0.10 0.15 0.20 0.25 Previous token matching score Repress-0xBaseline-1xInduce-2xInduce-4x Figure 3: We measure prefix matching scores (left) and previous token scores (right) from Olsson et al. (2022) on the induction heads and previous token heads of each of the four seeds of models for each token weighting. The values over training for all 16 models are aggregated in these plots. andInduce-4x, the final magnitude of the induction pattern susceptibilities is slightly higher, but rises notably sooner than in the Baseline-1x model. This rise in the induction pattern susceptibility is reflected in the accelerated development of induction circuit attention heads seen in Figure 3. Compared toBaseline-1x, we see that suppressing the tokens during training seems to almost entirely prevent the formation of induction heads, while ramping up the weighting both accelerates the formation and causes the resulting prefix matching score to be stronger. The previous token heads measured on the right side of Figure 3 show a similar pattern. Although this is not clear in the graph, we also found that the number of induction heads formed correlates with how strongly the induction pattern is induced. TheBaseline-1xmodels typically had one and sometimes two induction heads, while theInduce-4xmodels often had three. In Wang et al. (2025), it was noted that the formation of the induction circuit during training was accompanied by the UMAP “fattening” in the dorsal-ventral direction, which was quantified by a notable increase in the explained variance of PC2. We can confirm via the explained variance of PC2 that there is no similar increase when we suppress induction pattern tokens (see Section A.4). In Section A.3, we plot the impact on per-pattern susceptibilities for all other patterns considered in Baker et al. (2025) and Wang et al. (2025). In Section A.5, we compare the loss on the original training distribution for each token masking setup. 8 height (()()())– TRUE height ((()))+ )) – “almost nested” height )())((((– “almost equal” Figure 4: Sequences of parentheses correspond to lattice paths:(steps up,)steps down. Sequences of parentheses that have an equal number of left and right parentheses map exactly onto those paths of diagonal steps which end on thex-axis. Sequences of parentheses that are correctly nested (i.e. are classified as TRUE in our dataset) map exactly onto those paths of diagonal steps which both end on thex-axis and which remain at or above thex-axis at all other steps (left). Constructed examples of “almost nested” (center) and “almost equal” (right) samples are also given, see Section B.1 for precise definitions. These results confirm the patterning prediction: re-weighting alongv 2 modulates the induction circuit in the predicted direction, with suppression delaying or preventing formation and amplification accelerating it. Having demonstrated that patterning can modulate a single circuit, we now turn to a more challenging setting: selecting between two qualitatively different algorithms. 4 Parenthesis Balancing Task We illustrate the patterning framework in a setting where neural networks trained on the same data can implement one of several distinct algorithms, each achieving perfect training accuracy but generalizing differently out of distribution. Following Li et al. (2025), we consider transformers trained to classify even-length sequences of parentheses as correctly (e.g.(()(()()))) or incorrectly (e.g.(()() formed. The training distribution is constructed to exclude samples with an equal number of open and closed parentheses but which are not correctly nested (e.g., ))(( ). As a result, two classification rules are consistent with the data:Equal-Count, which accepts sequences with equal numbers of open and close parentheses regardless of order, andNested, which accepts only properly nested sequences forming valid Dyck words. Since we design the data distribution so that every training example either satisfies both rules or neither, a model can achieve perfect training accuracy by implementing either algorithm. Which algorithm a model learns is revealed by its behavior on out-of-distribution test sequences composed of the previously excluded samples – models implementingNestedreject these while models implementingEqual-Countaccept them. Li et al. (2025) show that independently trained models cluster categorically around these two solutions, with the proportion depending on architecture, regularization, and random initialization. In the following we identify a string of left and right parentheses with a path of diagonal steps up or down (see Figure 4, left). 4.1 Theory This setting provides a clean test of patterning because we have two known solutions 2 w ∗ N (Nested) andw ∗ EQ (Equal-Count), and the goal is to control which one training produces. From the Bayesian perspective this means that our target is to re-weight the posterior towards one of the solutions. To frame this in terms of patterning we should define the target μ ∞ . To this end define observables φ N (w) = δ(w ≈ w ∗ N )· (L(w)− L(w ∗ N )), φ EQ (w) = δ(w ≈ w ∗ EQ )· (L(w)− L(w ∗ EQ )) 2 More precisely, regions of parameter space, but we elide this distinction here. 9 whereδ(w ≈ w ∗ )denotes localization to a neighborhood ofw ∗ , as in Section 2.2. The expectation values with respect to the annealed posterior μ ∞ = μ ∞ N μ ∞ EQ = E w [φ N ] E w [φ EQ ] give the desired targets. The finite-nanalogue, replacing the population lossL(w)with the empirical lossL n (w)and using the ordinary posterior, yields the LLC estimator (9): comparing, we see that nβ· E w [L n (w)− L n (w ∗ N )](restricted to the neighborhood ofw ∗ N ) is exactly ˆ λ(w ∗ N ). Since both solutions achieve zero training loss, the principle of internal model selection (8) applies: the posterior preference is determined entirely by the difference in LLCs. A higherλmeans higher free energy and thus lower posterior weight. Measuring whole-model susceptibilities locally at each solution yields a2× mmatrixχwith rows indexed by solutions and columns indexed by training samples: χ = χ N x 1 χ N x 2 · χ N x m χ EQ x 1 χ EQ x 2 · χ EQ x m ! (23) The entryχ N x k measures how up-weighting samplex k affects the LLC at theNestedsolution, and similarly forχ EQ x k . To select forNestedover EQUAL-COUNT, we setdμ ∞ target = (−ε, +ε) T which aims to decrease the LLC atNested(deepening its basin and thus increasing its weight in the posterior) while increasing the LLC atEqual-Count(making its basin shallower, and thus less preferred by the posterior). The patterning equationdh opt = χ † dμ ∞ target yields the optimal re-weighting of training samples. Next in Section 4.2 we derive an approximate solution to this equation, before continuing in Section 4.3 to test whether this works in practice. 4.2 Deriving the re-weighting We may assume the rows ofχare linearly independent: both becausem ≫ 2and on the general principle that the algorithms (which are distinct) are characterized by their susceptibility vectors. Hence the Moore-Penrose pseudo-inverse is χ † = χ T (χ T ) −1 . The Gram matrix χ T is χ T = ∥χ N ∥ 2 ⟨χ N ,χ EQ ⟩ ⟨χ N ,χ EQ ⟩ ∥χ EQ ∥ 2 = a b b c (24) wherea = P k (χ N x k ) 2 ,c = P k (χ EQ x k ) 2 , andb = P k χ N x k χ EQ x k measures the correlation between the two rows. Inverting and multiplying out, the weight assigned to each sample is dh opt = ε ac− b 2 (a + b)χ EQ − (b + c)χ N .(25) If the two susceptibility vectors are orthogonal (b = 0) and have equal norm (a = c) then dh opt = ε a χ EQ − χ N .(26) Under these conditions, the optimal re-weighting is simply proportional to the susceptibility gap. Samples whereχ EQ x k ≫ χ N x k – those that raiseEqual-Count’s complexity more thanNested’s – receive positive weight, steering training toward theNestedsolution. When orthogonality or equal norms fail, additional corrections will appear, but the susceptibility gap makes a good starting point. This motivates a practical procedure: measure susceptibilities at models implementing each solution, identify samples with large gaps|χ EQ x − χ N x |, and up-weight those favoring the desired solution. The prediction is therefore: training on data enriched with samples that are “hard for Nested” (high χ N x ) should shift the distribution towardEqual-Count, while training on samples that are “hard for Equal-Count” (high χ EQ x ) should shift toward Nested. 10 Figure 5: Synthetically generated “almost nested” (left) and “almost equal” (right) samples visualized as heatmaps showing the number of Dyck paths crossing each lattice point. “Almost nested” paths climb before returning to near zero (but not zero); “almost equal” paths oscillate around zero height and end near but not at zero. Recall that we identify low OOD accuracy solutions withEqual-Count and high OOD accuracy solutions with Nested. 4.3 Identifying discriminating samples We measure susceptibilities across 30 models spanning the range of OOD accuracies (from near- zero forEqual-Countto near-one forNested) and identify samples with large susceptibility gaps |χ EQ x − χ N x |. Two characteristic types emerge: • “Almost nested”: sequences that would be valid Dyck words except for an extra pair of closing parentheses at the end of the sequence. In the Dyck path representation (Figure 5, left), these paths climb high before returning to zero. These samples have high susceptibility forNestedmodels, meaning up-weighting them raises the LLC at theNestedsolution – they are “hard” for Nested. •“Almost equal”: sequences that nearly have equal counts but include an extra pair of either left or right parentheses, and whose Dyck paths (Figure 5, right) cumulatively spend more steps below they = 0axis than above it. These samples were selected for having high susceptibility for Equal-Count models. We synthetically generate additional sequences of each type to create modified training distributions which we refer to asAlmost NestedandAlmost Equal, details of which are in Section B.1. We visualize the synthetically generated samples of “almost nested” and “almost equal” in Figure 5. 4.4 Verifying the LLC predictions We have now designed two modified data distributions (Almost NestedandAlmost Equal) with the aim that they would shift the LLCs of our two solutions. In Figure 6a we report on measurements aimed at verifying if we succeeded in this aim. ForAlmost Nested(blue), the LLC increases relative to the original – but the increase is not uniform. Models with low OOD accuracy (Equal-Countsolutions, left side) show negligible change, while models with high OOD accuracy (Nestedsolutions, right side) show substantial increases. This selective elevation of the LLC atNestedsolutions, while leavingEqual-Countsolutions unchanged, is precisely what the susceptibility gap predicted: these samples are “hard forNested” and this variation in the data distribution has increased the complexity of this algorithm. TheAlmost Equaldistribution (red) behaves unexpectedly. We designed these samples to raise the LLC atEqual-Countsolutions, but instead the LLC decreases across the board – and decreases far more for high-OOD models than for low-OOD models. Rather than making samples hard for Equal-Count, we inadvertently made them easy forNested. The practical effect is the same – the relative LLC gap shifts in favor ofNested– but through a different mechanism than intended. This discrepancy may reflect the limits of the linear response approximation: our interventions are finite perturbations, not infinitesimal ones. 11 (a) LLCs of 15 models computed on each of the 3 datasets. (b) Retrained OOD accuracy distributions on the original training set (left), Almost Nested (middle), and Almost Equal (right), n = 100. Figure 6: The LLCs computed onAlmost NestedandAlmost Equaldiffer from those computed on the original distribution, based on 15 of the original models trained by Li et al. (2025) with 3 layers, 4 attention heads, and 0.001 weight decay. For each dataset, we train 100 new models whose resulting spread of OOD accuracies is shown on the right. 4.5 Retraining results Figure 6b shows the distribution of OOD accuracies when we retrain 100 models (10 seeds×10 dataset shuffles) on each distribution. The original distribution (left) produces a bimodal spread, with most models implementingEqual-Count(low OOD accuracy) but some implementingNested (high OOD accuracy), consistent with Li et al. (2025). Training onAlmost Nested(middle) completely eliminates high-OOD solutions: all 100 models converge toEqual-Count(mean OOD accuracy 0.004). This confirms the prediction from Sec- tion 4.2: raising the complexity ofNestedsolutions while leavingEqual-Countunchanged makes the latter overwhelmingly preferred. Training onAlmost Equal(right) produces the opposite shift: the distribution moves toward higher OOD accuracy (mean 0.497 vs. original 0.310), with more models implementingNested. The effect is weaker than forAlmost Nested, which we attribute to the relative rarity of “almost nested”-type samples in the original distribution. Small additions of rare sample types produce larger marginal effects than additions of already-common types. These results demonstrate that susceptibility measurements successfully identify which samples to up-weight or down-weight to steer training toward a desired algorithm. Moreover, this is done using only in-distribution data that both solutions classify identically. Where the induction circuit experiment showed that patterning can modulate the strength of a single circuit, this experiment shows it can select between qualitatively different algorithms – a stronger form of control over generalization. 5 Related Work Mechanistic interpretability.The dominant paradigm in mechanistic interpretability decomposes activations into interpretable units: circuits (Olah et al., 2020; Elhage et al., 2021; Wang et al., 2023) and sparse autoencoder (SAE) features (Yun et al., 2021; Bricken et al., 2023; Cunningham et al., 2024). These methods focus on the model’s state – what activations are present at a given input. Susceptibility analysis instead probes the model’s sensitivity to the data distribution: how posterior expectation values change under distributional shifts (Baker et al., 2025; Gordon et al., 2026). The two approaches recover overlapping structure: Gordon et al. (2026) find that 50% of susceptibility clusters in Pythia-14M match SAE features for Pythia-70M. One contrast is that susceptibilities have a mathematical foundation which can be “inverted” in a clean way to shape the data distribution. Influence functions and training data attribution.Influence functions (Cook and Weisberg, 1980; Koh and Liang, 2017) measure how individual training examples affect model predictions, and have been applied to dataset selection (Xia et al., 2024), identifying mislabeled data (Koh and Liang, 2017), and data attribution (Park et al., 2023). Recent Bayesian variants (Kreer et al., 2025) connect influence functions to the posterior distribution and are also grounded in SLT. Training data 12 attribution (TDA) studies the effect data has on model behavior. Patterning is a closely related idea; the main difference is that influence functions typically target prediction-level quantities (e.g., loss on a test point) while patterning targets expectation values of observables that probe model internals. Patterning is therefore focused on shaping internal structure according to a very specific notion of structure tied to SLT and structural Bayesianism, which has not been traditionally the aim of TDA. Data curation.Recent work on data curation for large language models selects or weights training data to optimize downstream performance (Xie et al., 2023; Xia et al., 2024). These methods typically target aggregate metrics such as validation loss or benchmark accuracy. Patterning differs in targeting internal structure rather than input-output behavior: the goal is to control which computational mechanisms form, not just how the model performs on held-out data. This distinction matters for alignment, where models with identical performance on evaluations may generalize differently out of distribution (Lehalleur et al., 2025). Developmental biology.The use of morphogens to guide organism development provides a useful analogy for patterning. In developmental biology, cell types are characterized by gene expression profiles, and differentiation is modeled as descent through a “developmental landscape” toward distinct cell fates (Waddington, 1957). Mathematically, this is often formalized via bifurcation theory (Rand et al., 2021). A key insight is that small changes in a cell’s chemical environment, made at the right time, can have large effects on cell fate. Patterning applies the same principle to neural networks: controlled changes in the data distribution, made at the right point in training, can steer the configuration of the final network. Wang et al. (2025) used susceptibility-based visualizations to study network development as a kind of embryology, with the layout of token types on the UMAP playing the role of a body plan. Coherent control. The use of linear response to shape structure is well known in physics, especially in coherent control (Brumer and Shapiro, 1989; Shapiro and Brumer, 2012): one first identifies resonant or highly sensitive modes of a molecule or material via low-intensity spectroscopy, then drives those same modes strongly with amplified laser pulses. Other areas of physics revolve around similar “test/push” loops. For instance in condensed matter or femtochemistry, one “tickles” a system with a mild pulse to see which excitations respond, then uses high-intensity pulses in precisely those frequency channels to trigger large-scale changes (e.g. breaking bonds). Patterning follows the same logic: susceptibilities identify which data perturbations the posterior is sensitive to, and we then apply those perturbations to steer internal structure. 6 Conclusion We have introduced patterning as the dual problem to interpretability: given a desired form of generalization, determine what training data produces it. The mathematical framework is simple. Susceptibilitiesχmeasure how observables respond to infinitesimal shifts in the data distribution; inverting this relationship viadh opt = χ † dμ ∞ target yields the data intervention that steers the model toward a target configuration. This is the same linear-response framework used to read internal structure, now inverted to write it. Our experiments demonstrate that this works in practice. In a small language model, targeting the second principal component of the susceptibility matrix (which couples induction patterns in the data to the induction circuit in the weights) allowed us to accelerate or delay circuit formation by re-weighting training data alongv 2 . Suppressing induction patterns prevented the circuit from forming entirely; amplifying them produced earlier and stronger induction heads. In the parenthesis balancing task, where models can implement either of two algorithms consistent with the training data, we showed that susceptibility measurements identify discriminating samples whose re-weighting shifts the distribution of learned algorithms. The susceptibility gapχ EQ x −χ N x successfully predicted which samples to up-weight to favor one solution over the other. Limitations.Our experiments use small models (3M parameters) and simple tasks. Susceptibility estimation is computationally expensive (see Section C.2), and scaling to larger models will require more efficient methods. We have focused on one-off interventions made at the start of training, but the full promise of patterning lies in online control (adjusting the data distribution dynamically as training progresses and structure emerges). The theoretical framework assumes infinitesimal perturbations, while practical interventions are necessarily finite; understanding when the linear approximation breaks down remains important. Finally, while we have shown that susceptibility- 13 guided interventions produce predictable changes in observables, the connection between these observables and downstream behavior (e.g. safety properties) requires further study. We believe patterning opens a promising direction for controlling neural network training. The ability to identify which data shapes which internal structures, and to intervene accordingly, offers a principled approach to steering generalization. This has potential applications to AI alignment, where the goal is to control how models generalize beyond their training distribution; we discuss these applications in Section D. As interpretability methods improve our ability to read the computational structures inside neural networks, patterning provides the complementary ability to write them. References G. Baker, G. Wang, J. Hoogland, and D. Murfet. Structural inference: Interpreting small language models with susceptibilities, 2025. URL https://arxiv.org/abs/2504.18274. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. P. Brumer and M. Shapiro. Coherence chemistry: controlling chemical reactions [with lasers]. Accounts of Chemical Research, 22(12):407–413, 1989. R. D. Cook and S. Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980. H. Cunningham, R. Huben, L. R. Smith, A. Ewart, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK. R. Eldan and Y. Li. Tinystories: How small can language models be and still speak coherent English?, 2023. URL https://arxiv.org/abs/2305.07759. N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. C. Elliott, E. Urdshals, D. Quarel, M. Farrugia-Roberts, and D. Murfet. Stagewise reinforcement learning and the geometry of the regret landscape, 2026. URLhttps://arxiv.org/abs/2601. 07524. A. Gordon, G. Baker, G. Wang, W. Snell, S. van Wingerden, and D. Murfet. Towards spectroscopy: Susceptibility clusters in language models, 2026. J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=45qJyBG8Oj. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017. P. A. Kreer, W. Wu, M. Adam, Z. Furman, and J. Hoogland. Bayesian influence functions for Hessian-free data attribution, 2025. URL https://arxiv.org/abs/2509.26544. E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity- aware complexity measure. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 of Proceedings of Machine Learning Research, pages 244–252. PMLR, 2025. S. P. Lehalleur, J. Hoogland, M. Farrugia-Roberts, S. Wei, A. G. Oldenziel, G. Wang, L. Carroll, and D. Murfet. You are what you eat – AI alignment requires understanding how data shapes structure and generalisation, 2025. URL https://arxiv.org/abs/2502.05475. 14 V. R. Li, J. Kaufmann, M. Wattenberg, D. Alvarez-Melis, and N. Saphra. Can interpretation predict behavior on unseen data?, 2025. URL https://arxiv.org/abs/2507.06445. C. Mingard, G. Valle-Pérez, D. Sherrington, and A. A. Louis. Is SGD a Bayesian sampler? well, almost. In Journal of Machine Learning Research, volume 22, pages 1–64, 2021. D. Murfet and W. Troiani. Programs as singularities, 2025. URLhttps://arxiv.org/abs/2504. 08075. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020. https://distill.pub/2020/circuits/zoom-in. C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. S. M. Park, K. Georgiev, A. Ilyas, G. Leclerc, and A. Madry. TRAK: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023. D. A. Rand, A. Raju, M. Sáez, F. Corson, and E. D. Siggia. Geometry of gene regulatory dynamics. Proceedings of the National Academy of Sciences, 118(38):e2109729118, 2021. M. Shapiro and P. Brumer. Quantum control of molecular processes. John Wiley & Sons, 2012. E. Urdshals, E. Lau, J. Hoogland, S. van Wingerden, and D. Murfet. Compressibility measures complexity: Minimum description length meets singular learning theory, 2025. URLhttps: //arxiv.org/abs/2510.12077. C. H. Waddington. The Strategy of the Genes: A Discussion of Some Aspects of Theoretical Biology. Allen & Unwin, London, 1957. G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and spe- cialization of attention heads via the refined local learning coefficient, 2024. URLhttps: //arxiv.org/abs/2410.02984. G. Wang, G. Baker, A. Gordon, and D. Murfet. Embryology of a language model, 2025. URL https://arxiv.org/abs/2508.00331. K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. S. Watanabe. Algebraic geometry and statistical learning theory. Cambridge University Press, 2009. S. Watanabe. A Widely Applicable Bayesian Information Criterion. Journal of Machine Learning Research, 14:867–897, 2013. S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 34(12):10473–10486, 2022. M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011. M. Xia, S. Malladi, S. Gururangan, S. Arora, and D. Chen. LESS: Selecting influential data for targeted instruction tuning, 2024. URL https://arxiv.org/abs/2402.04333. S. M. Xie, S. Santurkar, T. Ma, and P. Liang. Data selection for language models via importance resampling, 2023. URL https://arxiv.org/abs/2302.03169. Z. Yun, Y. Chen, B. Olshausen, and Y. LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 1–10, 2021. 15 Appendix • Section A: Language Modeling – Additional experiments and detailed methodology for the small language model patterning experiments. –Section A.1: Patterning Datasets – Description of the construction of modified training datasets used in patterning experiments. – Section A.2: Patterned UMAPs – Visualizations of UMAP embeddings for retrained models, the effects of stunting the spacing fin, growing a delimiter fin, and the four induction patterning conditions. –Section A.3: Per-Pattern Susceptibilities – Per-pattern susceptibility plots over training, intended and collateral effects of each patterning intervention. –Section A.4: Susceptibilities PCA and Explained Variance – Principal compo- nent explained variances for each induction patterning model, quantifying how PC2 (associated with induction patterns) changes across patterning conditions. –Section A.5: Patterning Loss Impact – Test loss curves across patterning experiments measuring the impact of interventions on general model capabilities. •Section B: Parenthesis Balancing – Extended methodology and implementation details for the parentheses balancing experiments. –Section B.1: Synthetic Datasets – Detailed procedure for generating theAlmost Nested and Almost Equal datasets. • Section C: Implementation Details – Hyperparameters and computational cost analysis. –Section C.1: SGLD Hyperparameters – Specifies the stochastic gradient Langevin dynamics hyperparameters (nβ,γ,ε, chains, draws) used for language modeling and parenthesis balancing experiments. –Section C.2: Scaling Susceptibilities – Analysis of computational costs for suscepti- bility estimation and predicted future scaling of methodology. • Section D: Alignment Applications – Discussion of potential applications of patterning to AI alignment, including avoiding undesirable structures (specification gaming, instrumental convergence) and steering toward desirable structures (goal synchronization, robust encoding of constraints). 16 l0h0l0h1l0h2l0h3l0h4l0h5l0h6l0h7l1h0l1h1l1h2l1h3l1h4l1h5l1h6l1h7 0.00 0.05 0.10 0.15 0.20 0.25 PC1 l0h0l0h1l0h2l0h3l0h4l0h5l0h6l0h7l1h0l1h1l1h2l1h3l1h4l1h5l1h6l1h7 0.2 0.0 0.2 0.4 PC2 l0h0l0h1l0h2l0h3l0h4l0h5l0h6l0h7l1h0l1h1l1h2l1h3l1h4l1h5l1h6l1h7 0.2 0.0 0.2 0.4 PC3 Figure 7: PC loadings for the per-head standardization of the susceptibility matrix for the original small language model. Susceptibilities calculated using SGLD hyperparameters in Section C.1. A Language Modeling In the main text of this paper, we present several patterning experiments concerning the induction circuit of a small language model. We include two additional experiments in the appendix, concern- ing “fins” in the UMAP representation of the small language model. These experiments validate the basic paradigm of patterning: that modifying the data distribution in targeted ways produces predictable changes in the organizational structures that form during training. Where the main text uses susceptibility measurements to guide interventions, here we use manual interventions based on inspection of the UMAP – a simpler approach that nonetheless demonstrates the core principle. A.1 Patterning Datasets A.1.1 Spacing Fin Dataset One patterning experiment involves “stunting” the spacing fin of the original model, where we use a manual intervention to the dataset deduced by studying the UMAP of the tokens. In studying the spacing fin, Wang et al. (2025) note that the spacing fin is composed of spacing tokens with a large number of consecutive spacing tokens directly preceding it. To produce the patterning distribution, we use the original training distribution, a filtered subset of the Pile (Xie et al., 2023), and modify each context as follows: for each sample in the training data, we do string replacement so that sequences of purely consecutive spaces are replaced with single spaces, while sequences of consecutive spaces followed by one or more newlines are replaced by a single newline. Carriage returns, tabs, and form feeds are left unchanged. A.1.2 Delimiter Fin Dataset Another patterning experiment involves “growing” a new delimiter fin, where we similarly use a manual intervention. In this case, we use the original training distribution and modify each context as follows: for each sample in the training data, we do string replacement so that whenever a curly right bracketappears in the data, with 50% probability we mutate it into a longer, random-length sequence ofcharacters with uniformly random length between 1 and 50. A.1.3 Induction Patterning Dataset Baker et al. (2025) and Wang et al. (2025) show that the second principal component of the suscepti- bility matrix, PC2, tracks the formation and relative strength of the induction circuit. The evidence for this identification is multifaceted: •Head-level structure. When the susceptibility matrix is standardized per-head (converting each column toz-scores), PC2 cleanly separates induction heads in layer 1 and their composing partners in layer 0 from other heads, consistently across random seeds. •Pattern-level structure. Figure 10 shows per-pattern susceptibilities over training for patterns diagnostic of different circuits. As induction heads form (measured by prefix matching score), their susceptibility to induction patterns rises and separates from other 17 heads. Notably, these raw susceptibilities are all positive – the separation is in magnitude, not sign. •Token-level structure. Figure 1 shows text from the training corpus with tokens colored by their PC2 loading. Tokens participating in induction patterns show strongly negative values, while other tokens show positive values. An important subtlety is that PC2 captures the balance between the induction circuit and the rest of the model, not merely the induction circuit in isolation. When the induction circuit forms, it promotes the induction continuation while other heads suppress it; PC2 reflects this whole-model response. In the raw (non-standardized) susceptibility matrix, the induction signature is present but less stark than after per-head standardization (Figure 7) – layer 0 heads load positively on PC2 while layer 1 heads load negatively, but the separation is not as clean as in the standardized analysis of Baker et al. (2025). We construct the training datasets for the four induction patterning experiments as follows. We begin by collecting susceptibilities on some fraction of the original training distribution on the original small language model. We collect one susceptibility value per token per attention head, resulting in a 16-dimensional susceptibility vector for each token (one dimension per head). We opt not to collect enough susceptibilities to train the full 50k steps in order to save calendar time and compute, and we believe this does not qualitatively impact the results. Instead, we collect enough susceptibilities data for around 4k training steps, then train the models on multiple epochs of the same data, up to around 20k steps, well after the normal formation time of the induction circuit. We then take the same PC2 projection that is computed when generating the UMAPs in Figure 8a, and apply this projection to each of the 16-dimensional susceptibility vectors, and end up with a scalar value for each token in our training set. That scalar is then mapped to a weight mask value for that token which re-weights the loss of the token during training. We consider four different mappings. For each mapping, letaandbbe mask parameters such that PC2 values less than -2 are mapped to a token weight ofa, PC2 values greater than 0 are mapped to a token weight ofb, and intermediate PC2 values are linearly interpolated between a and b. The four token masks are • Repress-0x: a = 0 and b = 1 • Baseline-1x: a = 1 and b = 1 • Induce-2x: a = 2 and b = 1 • Induce-4x: a = 4 and b = 1 This way, theBaseline-1xmodels have normal, unmodified per-token masks during training, whileRepress-0xmodels effectively remove strong induction patterns from the training set and Induce-2x and Induce-4x amplify them instead. A.2 Patterned UMAPs We visualize the results of the various patterning datasets applied retraining experiments using the small language model architecture in Figure 8 and Figure 9. Spacing fin. The resulting impact is substantial, visualized in Figure 8b. The tokens that previously made up the fin have now migrated towards the tail and smeared out. In Wang et al. (2025), it was noted that the fin attached to the body in a particular way: the posterior side of the fin consisted of progressively fewer space tokens as you approached the body and smoothly attached to the portion of the body containing single space tokens. On the anterior side, the fin consisted of sequences of spacing tokens ending in a newline, also progressively decreasing in length, and smoothly attached to the portion of the body containing newlines. In the stunted spacing fin model, the relation of the long consecutive spacing tokens to the rest of the body is no longer so coherent. This demonstrates that removing a pattern from the data prevents the corresponding organizational structure from forming. Delimiter fin. Destroying structure is easier than creating it. A stronger validation of patterning would be to use the same principle to create new structure. The spacing fin organizes tokens by the number of consecutive spacing characters preceding them; we hypothesized that adding long sequences of consecutive delimiters to the training data would induce an analogous structure for delimiters. 18 (a) Original language model (b) Stunted spacing fin(c) Grown delimiter fin Figure 8: Three UMAP plots of different models: (a) the original language model, with susceptibilities evaluated on the original training data (in distribution), (b) a model trained on modified data to stunt the spacing fin, with susceptibilities evaluated on the original data (out of distribution), (c) a model trained on modified data to grow a delimiter fin, with susceptibilities evaluated on the modified data (in distribution). For spacing and delimiter tokens, the color gradient ranges from lighter to darker by the number of consecutive tokens of the same pattern that precede the token in question, with the lightest shade indicating isolated tokens of a pattern, while the darkest shade indicates 80+ consecutive tokens. When we retrain the model using a dataset modified to include many more sequential delimiter tokens, we obtain Figure 8c. A fin grows out of the head, organized by the number of preceding delimiter tokens – mirroring how the spacing fin organizes spacing tokens. This demonstrates that amplifying a pattern in the data can create new organizational structure, not merely strengthen existing structure. A.3 Per-Pattern Susceptibilities Recall from (17) that the per-pattern susceptibilityχ C (P)averages per-token susceptibilities over all tokens matching a patternP . In Figures 10 to 16, we present the full set of per-pattern susceptibilities for each of the seven models studied in this paper: the original 3M parameter small language model, the stunted spacing fin model, 19 (a) Baseline-1x(b) Repress-0x (c) Induce-2x(d) Induce-4x Figure 9: UMAP plots of the first training seeds for (a)Baseline-1x, (b)Repress-0x, (c) Induce-2x, (d) Induce-4x. the grown delimiter fin model, and the four induction patterning models. We check this as a measure of how surgical the various patterning experiments are in the language model experiments. We see that generally, the per-pattern susceptibilities are stable across different experiments, with some notable exceptions. The largest difference in the per-pattern susceptibilities is the expected one. For example, the original model and the stunted spacing fin model differ the most on the per-pattern susceptibilities for the spacing token pattern. In some cases, there is some amount of “collateral damage”, such as the Numeric per-pattern susceptibility having some nontrivial difference between the original model and the two fin patterning experiments. For the induction patterning experiments, we see the most striking differences betweenBaseline-1x andRepress-0x, where the induction pattern susceptibilities almost vanish. However, Word Part, Word End, Numeric, and Left Delimiters are all somewhat affected as well. ForBaseline-1x compared toInduce-2xandInduce-4x, the per-pattern susceptibilities are all relatively similar, including the magnitude of the induction pattern susceptibilities. However, one notable difference is in the rate of increase in the induction pattern susceptibilities. Although the resolution is somewhat 20 10 3 10 4 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 L0: Word End 10 3 10 4 1.0 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0 1 2 Susceptibility L0: Induction 10 3 10 4 2 1 0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 L0: Formatting 10 3 10 4 0 1 2 3 L1: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.6 0.4 0.2 L1: Formatting 10 3 10 4 0 1 2 3 Susceptibility L0: Right Delimiter 10 3 10 4 0.0 0.5 1.0 L0: Left Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L0: Numeric 10 3 10 4 0 1 2 3 L1: Right Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Left Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L1: Numeric H0H1H2H3H4H5H6H7 Figure 10: Per-pattern susceptibilities for the original language model, using the original training distribution (in-distribution). 10 3 10 4 1.5 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 L0: Word End 10 3 10 4 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0.0 0.5 1.0 1.5 Susceptibility L0: Induction 10 3 10 4 4 3 2 1 0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 0.00 L0: Formatting 10 3 10 4 0 1 2 L1: Induction 10 3 10 4 3 2 1 0 L1: Spacing 10 3 10 4 0.6 0.4 0.2 L1: Formatting 10 3 10 4 0 1 2 3 Susceptibility L0: Right Delimiter 10 3 10 4 0.5 0.0 0.5 1.0 L0: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L0: Numeric 10 3 10 4 0 1 2 3 L1: Right Delimiter 10 3 10 4 0.0 0.5 L1: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Numeric H0H1H2H3H4H5H6H7 Figure 11: Per-pattern susceptibilities for the stunted spacing fin model, using the original training distribution (out-of-distribution). 10 3 10 4 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 3 L0: Word End 10 3 10 4 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0.0 0.5 1.0 1.5 2.0 Susceptibility L0: Induction 10 3 10 4 2 1 0 L0: Spacing 10 3 10 4 1.0 0.5 0.0 L0: Formatting 10 3 10 4 0 1 2 L1: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.6 0.4 0.2 L1: Formatting 10 3 10 4 0 2 4 6 Susceptibility L0: Right Delimiter 10 3 10 4 0.0 0.5 L0: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L0: Numeric 10 3 10 4 0 1 2 3 4 L1: Right Delimiter 10 3 10 4 0.0 0.2 0.4 0.6 0.8 L1: Left Delimiter 10 3 10 4 0.00 0.25 0.50 0.75 1.00 L1: Numeric H0H1H2H3H4H5H6H7 Figure 12: Per-pattern susceptibilities for the grown delimiter fin model, using the modified distribu- tion with extra(in-distribution). 21 10 3 10 4 1.25 1.00 0.75 0.50 0.25 Susceptibility L0: Word Start 10 3 10 4 0.0 0.2 0.4 0.6 0.8 L0: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L0: Word End 10 3 10 4 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.2 0.4 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 L1: Word End 10 3 10 4 0.3 0.2 0.1 0.0 0.1 Susceptibility L0: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 L0: Formatting 10 3 10 4 0.2 0.0 0.2 L1: Induction 10 3 10 4 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.6 0.4 0.2 L1: Formatting 10 3 10 4 0 1 2 Susceptibility L0: Right Delimiter 10 3 10 4 0.4 0.2 0.0 0.2 L0: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L0: Numeric 10 3 10 4 0 1 2 L1: Right Delimiter 10 3 10 4 0.2 0.0 0.2 L1: Left Delimiter 10 3 10 4 0.0 0.2 0.4 0.6 L1: Numeric H0H1H2H3H4H5H6H7 Figure 13: Per-pattern susceptibilities for seed 1 of theRepress-0xmodels, using the original training distribution token weighting. 10 3 10 4 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 L0: Word End 10 3 10 4 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0.0 0.5 1.0 1.5 Susceptibility L0: Induction 10 3 10 4 2 1 0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 0.00 L0: Formatting 10 3 10 4 0 1 2 L1: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.6 0.4 0.2 L1: Formatting 10 3 10 4 0 1 2 3 Susceptibility L0: Right Delimiter 10 3 10 4 0.0 0.5 1.0 L0: Left Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L0: Numeric 10 3 10 4 0 1 2 3 L1: Right Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Numeric H0H1H2H3H4H5H6H7 Figure 14: Per-pattern susceptibilities for seed 1 of theBaseline-1xmodels, using the original training distribution token weighting. 10 3 10 4 1.5 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 3 L0: Word End 10 3 10 4 0.8 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 1.5 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0 1 2 Susceptibility L0: Induction 10 3 10 4 2 1 0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 0.00 L0: Formatting 10 3 10 4 0 1 2 3 L1: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.4 0.3 0.2 0.1 0.0 L1: Formatting 10 3 10 4 0 1 2 3 Susceptibility L0: Right Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L0: Left Delimiter 10 3 10 4 0.00 0.25 0.50 0.75 1.00 L0: Numeric 10 3 10 4 0 1 2 3 4 L1: Right Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L1: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Numeric H0H1H2H3H4H5H6H7 Figure 15: Per-pattern susceptibilities for seed 1 of theInduce-2xmodels, using the original training distribution token weighting. 22 10 3 10 4 1.0 0.5 Susceptibility L0: Word Start 10 3 10 4 0.0 0.5 1.0 L0: Word Part 10 3 10 4 0 1 2 L0: Word End 10 3 10 4 0.6 0.4 0.2 L1: Word Start 10 3 10 4 0.0 0.5 1.0 1.5 L1: Word Part 10 3 10 4 0.0 0.5 1.0 1.5 2.0 L1: Word End 10 3 10 4 0 1 2 Susceptibility L0: Induction 10 3 10 4 2 1 0 L0: Spacing 10 3 10 4 1.00 0.75 0.50 0.25 0.00 L0: Formatting 10 3 10 4 0 1 2 3 L1: Induction 10 3 10 4 1.5 1.0 0.5 0.0 L1: Spacing 10 3 10 4 0.3 0.2 0.1 0.0 0.1 L1: Formatting 10 3 10 4 0 1 2 3 Susceptibility L0: Right Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L0: Left Delimiter 10 3 10 4 0.00 0.25 0.50 0.75 L0: Numeric 10 3 10 4 0 1 2 3 L1: Right Delimiter 10 3 10 4 0.0 0.5 1.0 1.5 L1: Left Delimiter 10 3 10 4 0.0 0.5 1.0 L1: Numeric H0H1H2H3H4H5H6H7 Figure 16: Per-pattern susceptibilities for seed 1 of theInduce-4xmodels, using the original training distribution token weighting. low, the slope induction pattern susceptibilities is clearly steeper, earlier, when the induction pattern is amplified in the training data. A.4 Susceptibilities PCA and Explained Variance We compute the principal components and explained variance of the susceptibility vectors at the end of training for some of the models we study. Recall from Wang et al. (2025) that the negative PC2 direction of these models reliably coincides with a heuristic definition of induction patterns, and so the degree of explained variance in PC2 gives some quantitative indication of how much the model is able to recognize induction patterns as separate from other tokens. We also point to Figure 7 of the appendix of Wang et al. (2025) which shows that PC2 widens considerably only for induction pattern tokens during the time that the induction circuit forms. We give the first five PCs and explained variances for the following models: The original language model studied in Wang et al. (2025) has the following PCA explained variances: • PC1: 0.9575 (0.9575 cumulative) • PC2: 0.0224 (0.9799 cumulative) • PC3: 0.0050 (0.9850 cumulative) • PC4: 0.0033 (0.9883 cumulative) • PC5: 0.0024 (0.9907 cumulative) The first training seed of the Baseline-1x models has the following PCA explained variances: • PC1: 0.9413 (0.9413 cumulative) • PC2: 0.0296 (0.9709 cumulative) • PC3: 0.0068 (0.9777 cumulative) • PC4: 0.0051 (0.9828 cumulative) • PC5: 0.0039 (0.9866 cumulative) The first training seed of the Repress-0x models has the following PCA explained variances: • PC1: 0.9707 (0.9707 cumulative) • PC2: 0.0098 (0.9805 cumulative) • PC3: 0.0040 (0.9845 cumulative) • PC4: 0.0027 (0.9872 cumulative) • PC5: 0.0021 (0.9893 cumulative) 23 Figure 17: Left: the test loss for the spacing fin and delimiter fin retraining experiments is compared to the original model, evaluated on the original data distribution. Right: the test loss for the spacing fin and the original model are compared, evaluated on the modified training distribution that removes long sequences of consecutive spacing tokens. The first training seed of the Induce-2x models has the following PCA explained variances: • PC1: 0.9448 (0.9448 cumulative) • PC2: 0.0288 (0.9736 cumulative) • PC3: 0.0065 (0.9801 cumulative) • PC4: 0.0047 (0.9849 cumulative) • PC5: 0.0032 (0.9881 cumulative) The first training seed of the Induce-4x models has the following PCA explained variances: • PC1: 0.9584 (0.9584 cumulative) • PC2: 0.0216 (0.9800 cumulative) • PC3: 0.0059 (0.9859 cumulative) • PC4: 0.0023 (0.9881 cumulative) • PC5: 0.0019 (0.9900 cumulative) The key observation is that PC2 explained variance drops substantially forRepress-0x(0.98%) compared to the original model (2.24%) andBaseline-1x(2.96%). This is consistent with the suppressed induction circuit: when induction patterns are removed from training, the model no longer develops the structure that PC2 captures. TheInduce-2xandInduce-4xmodels have PC2 explained variances (2.88% and 2.16%) comparable to baseline, indicating that the induction circuit forms in these models as expected. A.5 Patterning Loss Impact We check the loss of our retrained models on the original training distribution, in order to check that our patterning operations have not destroyed significant amounts of other internal structure in the model. In Figure 17, we compare the loss of the two fin retraining experiments against the original model and see that the loss curve of the delimiter fin retraining experiment is nearly identical to the original model on the original training distribution. The stunted spacing fin model has a slightly worse loss, but nearly identical loss when evaluated on the modified training data with consecutive spacing tokens removed. This suggests that the difference in loss on the original dataset can be explained by performance on consecutive spacing tokens. In Figure 18, we look at the loss over training of one of each of the different token re-weighting schemes. As expected, theBaseline-1xmodel achieves the lowest loss, and it is unsurprising that Repress-0xachieves the highest loss, since induction patterns are a common and important pattern in the data. Additionally, with the exception ofInduce-4x, the models seem to diverge in their loss right around the time that they would normally develop the induction circuit.Induce-4xpossibly 24 Figure 18: The test loss for one seed of each of the induction patterning retraining experiments are compared against each other on the original training distribution. diverges in loss sooner because the induction circuit both forms earlier and because the re-weighting is more extreme. B Parenthesis Balancing We follow the setting of Li et al. (2025) and consider sequences of left and right parentheses. These sequences may be up to 40 tokens long, and are always an even length. Such sequences may fall into one of three categories: • Nested: any prefix subsequence must have at least as many left parentheses as right paren- theses. Such sequences are classified TRUE. • Equal-count: the sequence as a whole has an equal number of left and right parentheses, but is not correctly nested. Such sequences are classified FALSE. •Neither: the sequence is not equal-count and as a result is not nested either. Such sequences are classified FALSE. The training distribution is constructed out of samples which are either correctly nested or neither correctly nested nor equal-count. As such, models may achieve perfect training loss by implementing either an internal algorithm that checks whether the sequence is nested (theNestedalgorithm) or whether there is an equal number of left and right parentheses (theEqual-Countalgorithm). The implementation a model develops is then measured using the equal-count but not nested samples as an out-of-distribution test set. We consider a subset of 30 of the models trained by Li et al. (2025) out of an original 270. In particular, we focus on those models which have 2 or 3 layers, 4 attention heads, and which are trained with 0.001 weight decay and using 5 random initializations and 3 dataset shuffles. This subset was chosen for having a reasonable spread of OOD accuracies, so that bothNestedand Equal-Countwould be represented, as well as having reasonable training and SGLD sampling stability. For a complete specification of architecture, please refer to Li et al. (2025). B.1 Synthetic Datasets In our experiments in Section 4, we create two synthetic datasets,Almost NestedandAlmost Equal (Section 4.3). We elaborate on this process here. We begin with the 30 models mentioned above, with 2 or 3 layers, 4 attention heads, and 0.001 weight decay. 25 Figure 19: A heatmap of the susceptibilities on the 1024 samples across all 30 models. Models are ordered from top to bottom in descending order of OOD accuracy, samples are split into four rows of 256. We calculate the per-sample susceptibilities for 1024 samples, sampled uniformly from the training distribution, using the full model susceptibilities (no weight refinements) with hyperparameters specified in Section C.1. These per-sample susceptibilities are visible in Figure 19, where rows index the 30 models, ordered in descending order of OOD accuracy from top to bottom. We then select the top 3 models by OOD accuracy (which have 0.915, 0.914, 0.911 accuracy and 3, 2, and 3 layers respectively) and the bottom 3 models by OOD accuracy (which have 0.111, 0.001, 0.000 accuracy and 3, 3, and 3 layers respectively), which we refer to asM top andM bot respectively. Letχ top andχ bot be the average susceptibilities of a sample acrossM top andM bot . Since models inM top have converged to theNestedsolution and models inM bot toEqual-Count, these empirical averages approximateχ N andχ EQ from Section 4. We look for samples which either maximize∆χ = χ top −χ bot or minimize it. Samples that maximize∆χare those which we predict will result in a relative increase in the LLC of Nestedsolutions versusEqual-Count, while ones that minimize∆χare those which we predict will result in a relative increase in the LLC ofEqual-Countcompared toNested. When we look at the actual samples that minimize or maximize ∆χ, we end up with •Maximizing samples (“almost nested”): these are samples which tend to look almost like they are correctly nested, but which typically have an extra trailing pair of closing parentheses. That is, excluding the last two closing parentheses, they would be classified as correctly nested samples (see Figure 20a). The maximizing TRUE samples (Figure 20b) are also interpretable but less extreme. • Minimizing samples (“almost equal”): these are a bit more complicated. The samples which are most extreme end up being TRUE samples (see Figure 20d, but these samples are not as interpretable as the FALSE samples which are very nearly as extreme (see Figure 20c). Because of this, we focus on the most extreme FALSE samples instead. These are samples which almost have an equal number of left and right parentheses, but are off by two in one direction or the other. These samples, in their Dyck path representation, also tend to have their paths below the y = 0 axis for more steps than the paths are above y = 0. These characterizations are operationalized in the following algorithms that generate additional synthetic samples: 26 01234567891011121314151617181920212223242526272829 Step 2 1 0 1 2 3 4 5 6 Height 0.03 0.02 0.01 0.00 0.01 0.02 0.03 top - bot (a) Top 5 FALSE samples which maximize∆χout of 1024 random samples. Highly interpretable and more extreme than the maximizing TRUE samples. These are the samples which we call “almost nested”. 01234567891011121314151617181920212223242526272829 Step 0 1 2 3 Height 0.006 0.004 0.002 0.000 0.002 0.004 0.006 top - bot (b) Top 5 TRUE samples which maximize∆χout of 1024 random samples. Highly interpretable, but not nearly as extreme as the maximizing FALSE samples. 0123456789101112131415161718192021222324252627282930313233 Step 4 3 2 1 0 1 2 3 4 5 6 Height 0.006 0.004 0.002 0.000 0.002 0.004 0.006 top - bot (c) Top 5 FALSE samples which minimize∆χout of 1024 random samples. Highly interpretable and about as extreme as the minimizing TRUE samples. These are the samples which we call “almost equal”. 01234567891011121314151617181920212223242526272829303132333435 Step 0 1 2 3 4 5 6 7 Height 0.008 0.006 0.004 0.002 0.000 0.002 0.004 0.006 0.008 top - bot (d) Top 5 TRUE samples which minimize∆χout of 1024 random samples. Not highly interpretable and about as extreme as the minimizing FALSE samples. Figure 20: Extreme samples for∆χ. Dyck path representations are each slightly offset for legibility. 27 • Maximizing samples (“almost nested”): – Generate sequences uniformly at random (following the same uniform random sampling used by Li et al. (2025)). – Check that the final two tokens are closing parentheses. – Check that the sequence minus the final two tokens would be classified TRUE. –If both conditions are met and the sequence has not already been generated, add the sequence to the set of generated sequences. • Minimizing samples (“almost equal”): –Generate sequences uniformly at random (following the same uniform random sampling used by Li et al. (2025)). – Check that the difference in number of left and right parentheses is either 2 or−2. –For each step in the Dyck path, check that the cumulative number of steps below the y = 0axis is greater than or equal to the cumulative number of steps above they = 0 axis. – If both conditions are met and the sequence has not already been generated, add the sequence to the set of generated sequences. The original training distribution from Li et al. (2025) consists of 200k samples: 100k TRUE (balanced and equal count) and 100k FALSE (unbalanced or unequal count), i.e., a 50/50 split, with models trained on 5 epochs of the data. We then use the above criteria to produce the following modified datasets: • Almost Nested : We generate 18.3k “almost nested” samples, then take the original training distribution, (uniformly randomly) remove 36.6k of the original FALSE samples, and then add two copies of each of the 18.3k synthetic “almost nested” samples. • Almost Equal: We generate 19.0k “almost equal” samples, then take the original training distribution, (uniformly randomly) remove 67.1k of the original FALSE samples, and then add four copies of each of the 19.0k synthetic “almost equal” samples. Incidentally, when 100k samples are randomly generated, around500of them typically fit the “almost nested” criteria while around3700of them fit the “almost equal” criteria, which is evidence that the larger effect size thatAlmost Nestedhas on the resulting OOD accuracy distribution is related to the relative rarity of “almost nested” samples in the pretraining data. C Implementation Details C.1 SGLD Hyperparameters Language Modeling. In the language modeling experiments, we use similar hyperparameters as Wang et al. (2025), with the exception of reducing ε and increasing the number of draws per chain: • nβ = 30 • γ = 300 • ε = 3e− 4 • 4 chains • 300 draws We reduceεand increase the number of draws by a half order of magnitude each. This is a more compute-intensive option, but from a theoretical perspective is unproblematic: we are essentially sampling from the same process with higher time resolution. Parentheses Balancing. We calibrate our SGLD hyperparameters for parentheses balancing experi- ments to be: • nβ = 100 • γ = 500 28 • ε = 3e− 6 • 4 chains • 5000 draws We found that with higherεvalues, it was easier for the sampling process to become unstable, so we opted for a smaller value with many more draws. C.2 Scaling Susceptibilities Susceptibilities estimation, for a given set of model weights and a given set of data samples to measure the susceptibilities of, involves the following computational costs for each SGLD sample: • A backward pass using the data samples used to compute the gradient for SGLD • A series of forward passes using each of the data samples being measured Although backward passes are individually more expensive, if the number of samples being measured in the forward pass is much larger than the number of samples used to compute the SGLD step, then the compute cost is still dominated by the number of data samples being measured, and this is the case in practice. Therefore, we measure the computational cost of our experiments in numbers of forward passes on samples, treating a backward pass as 3 forward passes. • Training the small language model itself costs around 15m forward passes. • Computing the susceptibilities for a single model checkpoint’s UMAP visualization costs around 40m forward passes (approx. 3x training). •Computing enough susceptibilities to conduct the induction patterning experiment costs around 7.7b forward passes (approx. 500x training). Although these experiments are significantly more expensive than a training run for the small language model, this methodology scales favorably with model size. In practice, the LLCs of models have been measured at the billion parameter scale by Urdshals et al. (2025), and susceptibilities have been measured in similar models in preliminary results, so the basic measurements are feasible at scale. We have reason to believe that the patterning methodology scales favorably as well. As model size increases, optimal-compute training costs increase quadratically, while the cost of computing susceptibilities for a given checkpoint, for a fixed number of weight restrictions, grows linearly. Naively, the number of weight restrictions needed also grows, though in open source models like the Pythia suite, the number of weight restrictions to consider grows much more slowly than quadratically in parameter count. We also find that we can for example exclude many attention heads and still achieve similar results, even in the small language model. For experiments with significantly higher relative compute costs, such as the induction patterning experiment, we expect more sophisticated approaches to work, such as training an auxiliary model to predict the susceptibilities of a sample. We also aim to develop the methodology into one that dynamically adjusts the training distribution, rather than computing many susceptibilities up front and applying a fixed distribution shift. This may allow for subtler changes to the distribution, requiring less compute overall. Finally, we may eventually be able to selectively apply patterning to small critical ranges of training rather than to all training steps. Once the weights have been nudged far enough towards a desired structure, the original training distribution may be sufficient for finishing the structure’s development. Structural development in language models also appears to occur on a logarithmic time scale, so such nudges may similarly occur on a logarithmic schedule, further reducing compute cost compared to training. D Alignment Applications Patterning has potential applications to AI alignment. The core insight is that alignment fundamentally concerns the relationship between data, internal structure, and generalization (Lehalleur et al., 2025). Current alignment techniques – RLHF, constitutional AI, safety fine-tuning – operate by shaping the training data distribution, thereby indirectly controlling model behavior. Patterning offers a path 29 toward more direct control: rather than shaping outputs and hoping the right internal structure follows, we target internal structure directly via susceptibilities. Simplicity bias and misalignment.The extension of singular learning theory to reinforcement learning (Elliott et al., 2026) provides a theoretical framework for understanding when misaligned policies may be preferred. The key insight is that a more optimal policy is not always more optimal from a Bayesian perspective: the posterior can prefer a simpler policy with higher regret over a complex policy with lower regret. This may explain goal misgeneralization: the posterior may favor simpler policies capturing spurious correlations (“go to the corner” in an environment where the agent is rewarded for obtaining cheese which is often, but not always, in the corner) over complex policies representing the intended goal (“go to the cheese”). Similar logic may apply to instrumental convergence: acquiring resources or control is a commonly represented pattern across diverse environments, so these behaviors may be simpler to represent than task-specific optimal policies, and hence preferred by the posterior even at higher regret. The same mechanism is relevant to reward hacking: one hypothesis is that reward models learn simple proxies (length, formatting, sycophancy) in part because the region of parameter space implementing “longer responses are better” has lower complexity than the region implementing nuanced quality assessment. Patterning can potentially address these problems, by raising the complexity of undesirable solutions. By measuring susceptibilities to competing policies, we can identify which training samples differ- entially affect their complexity, and re-weight accordingly. The parentheses balancing experiments of Section 4 demonstrate this principle in a very simple setting: we identified samples where the susceptibility gap between competing algorithms was large, and used them to steer which solution the posterior favors. Constraining undesirable structure.If instrumental goals like power-seeking correspond to identifiable structures in the model (detectable via susceptibilities to relevant data patterns) then patterning provides a tool for constraining their formation. The key constraint is: the model should not respond to this pattern. Formally, this becomes a targetdμ ∞ target in the language of posterior expectation values, and the fundamental equation yields the data intervention that enforces it. The specification problem.These applications presuppose we can specify which structures are desirable. This is nontrivial, but perhaps more tractable than specifying good behavior via training data alone. The language of posterior constraints provides a vocabulary: illustrative examples include “computations should not differ significantly between these distributions” (robustness) and “no component should respond to deception patterns in the data” (honesty). Specification can also be informed empirically by studying misalignment instances through susceptibilities and compiling structural signatures of undesirable generalization into constraints. Why mode structure matters.Susceptibilities measure how a model’s internal components respond to patterns in the data distribution (Gordon et al., 2026). These responses develop over training: initially undifferentiated, they become increasingly specialized as the model learns to distinguish different modes (Wang et al., 2025). The key insight is that a single mode manifests across many individual examples in the training data. This means that in general it may be nontrivial to “remove” a mode from the data distribution by naive filtering. Without a grip on mode structure, attempts to eliminate undesirable patterns may fail or have unintended consequences on other modes. Susceptibility analysis provides this grip. By decomposing the model’s responses in terms of modes, we can identify which data perturbations target specific patterns while leaving others unchanged. The fundamental equationdh opt = χ † dμ ∞ target computes the minimum-norm intervention that achieves a desired structural change – potentially allowing for the ability to surgically modify the model’s response to some modes without disrupting the rest. We emphasize that these applications are prospective. Our experiments demonstrate that patterning works, but scaling to frontier models and establishing the full chain from susceptibilities to safety outcomes requires further investigation. 30