Paper deep dive
Emergence of Linear Truth Encodings in Language Models
Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti
Models: LLaMA3-8B, one-layer transformer (custom toy model)
Abstract
Abstract:Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/12/2026, 6:10:17 PM
Summary
The paper investigates the emergence of linear truth-encoding subspaces in language models, proposing the 'Truth Co-occurrence Hypothesis' (TCH) which suggests that models learn to represent truth because true statements statistically cluster together in training data. Using a one-layer transformer toy model, the authors demonstrate that truth-encoding subspaces emerge through a two-phase learning dynamic: initial memorization of factual associations followed by the development of a linear separator that optimizes language modeling loss. The study provides both theoretical and empirical evidence that layer normalization is critical for this emergence.
Entities (4)
Relation Signals (3)
Layer normalization → enables → Linear truth encoding
confidence 95% · If the model in (4) contains 𝖭 and 2α1α2+2β1β2≠0, then its output on the y token admits a linear separation for true and false samples.
One-layer transformer toy model → reproduces → Linear truth encoding
confidence 95% · We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end
Truth Co-occurrence Hypothesis → explains → Linear truth encoding
confidence 90% · TCH offers a very simple way to quantify the persona hypothesis and provably characterize its influence.
Cypher Suggestions (2)
Find all mechanisms that contribute to the emergence of truth encoding · confidence 90% · unvalidated
MATCH (m:Mechanism)-[:ENABLES|PRODUCES]->(p:Phenomenon {name: 'Linear truth encoding'}) RETURN m.name, type(r), p.nameMap the relationship between hypotheses and phenomena · confidence 85% · unvalidated
MATCH (h:Hypothesis)-[:EXPLAINS]->(p:Phenomenon) RETURN h.name, p.name
Full Text
106,324 characters extracted from source content.
Expand or collapse full text
Emergence of Linear Truth Encodings in Language Models Shauli Ravfogel1 Gilad Yehudai1 Tal Linzen1 Joan Bruna1 Alberto Bietti2 1New York University 2Flatiron Institute Abstract Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then—over a longer horizon—learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models. 1 Introduction Recent observations suggest that large language models (LMs) often encode a low-rank linear subspace that distinguishes true from false statements across a wide range of domains (Azaria and Mitchell, 2023; Burns et al., 2022; Li et al., 2024b; Marks and Tegmark, 2024; Bürger et al., 2025; Orgad et al., 2025). Specifically, in many layers of the residual stream representation in transformer-based LMs, a linear separation emerges between representations corresponding to true versus false assertions. Moreover, this separation generalizes across domains: there exists a single separating subspace such that statements like “2+2=42+2=4” (true) and “The capital city of France is Rome” (false) fall on opposite sides of the same separating plane. These findings have sparked interest among practitioners, because they may aid in mitigating hallucinations (Li et al., 2024b; Orgad et al., 2025). We investigate the emergence of a unified “truth subspace”—a low-dimensional linear manifold that cleanly separates true from false statements. Prior work shows (i) that truth-encoding directions generalize remarkably well across diverse tasks and prompts, and (i) that causal interventions along those directions can steer LMs toward factual or counter-factual completions (e.g. Meng et al., 2022). Yet we still lack a satisfying answer to two fundamental questions: why do such subspaces arise during training, and how are they actually computed at inference time? We address both questions in a single theoretical and empirical framework. For the how, we build on the growing understanding of key–value associative memories in transformers. Geva et al. (2021) showed that the first linear layer produces key matches—e.g. aligning the prefix “The capital city of France is” with an internal query—while the second linear layer retrieves the associated value, such as the hidden representation of “Paris”. Subsequent studies refined the mathematical description of this mechanism and demonstrated its causal role in factual recall and reasoning (Geva et al., 2022b; Bietti et al., 2023; Cabannes et al., 2024b; Nichani et al., 2025). We hypothesize that a linear truth code takes advantage of the memorized factual associations: it emerges as a result of the model contrasting the internal prediction it built with the observed attribute. This results in a different pattern when the two match or mismatch, and is translated into a linearly separable signal. For the why, we propose the Truth Co-occurrence Hypothesis (TCH): in naturally occurring text, true statements are statistically more likely to co-occur with other true statements, and falsehoods with other falsehoods. This assumption is closely related to recent “persona” explanations of factual inconsistency in LMs (Li et al., 2023a; Joshi et al., 2024): the claim that LMs learn to model certain personas in the data distribution, some truthful and some not. TCH offers a very simple way to quantify the persona hypothesis and provably characterize its influence. Under the TCH, inferring a latent truth variable is loss-reducing: if the model recognizes that “It’s well known that the moon landing was a hoax” is false, it can raise the probability of a continuation such as “and that the Earth is flat,” which is likewise false. We test the truth‐co‐occurrence hypothesis (TCH) in the minimal transformer, with a single self-attention layer, one head, and a normalization layer. Training examples are four‐token sequences xyx′y′x\;y\;x \;y with subjects x,x′x,x (“The capital city of France”; “Churchill’s nationality”) and attributes y,y′y,y (“Paris”; “British”); with probability ρ, the attributes y,y′y,y are both the correct attribute; with probability 1−ρ1-ρ, they are replaced with a random one. Under our simplified generative story, “truth” is identified with the attribute that is frequent in the training data for a particular subject. When we train transformer LMs on such dataset, we find that after the key–value lookup circuit forms, gradient descent pushes hidden states toward a linear separator that clusters true vs. false contexts, and the model uses it modify its confidence when predicting the attribute. Training shows two phases: rapid key–value acquisition followed by slower emergence of linear encoding. Although our toy model is far simpler than natural training data (see Section˜6), it predicts the observed sensitivity to false context (Section˜5.3), where false prefixes bias later predictions (supporting TCH), and reproduces the way normalization layers regulates confidence (Stolfo et al., 2024). Taken together, we show that linear truth encoding can arise without any built-in semantics. 2 Related work A growing body of work shows that pretrained LMs linearly encode a simple notion of “truth”—-consistency with the majority of examples in the training data—in both hidden states and individual MLP/attention outputs (Azaria and Mitchell, 2023; Burns et al., 2022; Li et al., 2024b; Bürger et al., 2025). This feature is generally robust for frequent atomic facts, though its subspace can shift in the presence of negation (Marks and Tegmark, 2024) and may by biased to dataset-specific features (Orgad et al., 2025). The encoded truth dimension is behaviorally relevant: intervening on it nudges the model toward truthful completions (Li et al., 2024b) although the model’s predictions sometimes do not agree with the latent encoding (Liu et al., 2023). Yet the mechanism behind this encoding remains unclear. Extending the persona hypothesis of Li et al. (2023a), Joshi et al. (2024); Ghandeharioun et al. (2024) link truthful behavior to lexical “personas”—for instance, the formal, encyclopedic style typical of Wikipedia versus the more casual tone common in social-media post. We show that, given sufficient training, LMs also acquire a lexicon‐independent abstract truth dimension that emerges more slowly. The line of work on truth encoding is closely related to findings suggesting that models encode different aspects related to their knowledge and confidence. It was shown that it is possible to decode “latent” knowledge from the model Gekhman et al. (2025), and that measures of uncertainty can be decoded from hidden states (Slobodkin et al., 2023; Farquhar et al., 2024; Ferrando et al., 2025). Our work is related to, but distinct form, works on mechanistic understanding of hallucinations (Yu et al., 2024); while both rely on the associative memory used by the model (Geva et al., 2021, 2022a, 2022b; Bietti et al., 2023; Cabannes et al., 2024a), we focus on the emergence of separation between true and false assertions, and come up with a toy model that allows us to analyze its properties. 3 The Truth Co-occurrence Hypothesis We previously described the TCH, the assertion that false statements tend to co-occur. To quantify that, we use the MAVEN-FACT corpus (Li et al., 2024a), where annotators assign a FactBank-style factuality label to every event mention inside a news article. After discarding all but certain judgments, each mention is labeled certain-true or certain-false and grouped by the document in which it appears.111Data-handling details are deferred to App. A. We find the following: (i) the overall certain-false rate is p=0.0209p=0.0209; (i) the chance that two event mentions from the same article are both certain-false is 0.00090.0009, exceeding the independence baseline p2=0.00044p^2=0.00044 by a factor of ≈2≈ 2; and (i) the clustering ratio— Varobs(p^i)/Varbinom=1.23Var_obs( p_i)/Var_binom=1.23— shows 23 % extra article-to-article heterogeneity. A χ2χ^2 test of independence confirms the association (χ2=4.17×103χ^2=4.17× 10^3, p≈9×10−49p≈ 9× 10^-49). This shows that false assertions are not sprinkled at random but tend to cluster on the same article. For a language model, tracking a latent truth bit is therefore loss-reducing: once a page provides evidence that one statement is refuted, the conditional probability that a subsequent claim is also refuted increases. This motivates the design of a simple data-generating process that instantiates the hypothesis and tests whether it gives rise to truth encoding. 3.1 Data Generating Process Natural text confounds truth with stylistic cues, topic priors, and corpus frequency (Orgad et al., 2025). Therefore, Consequently, if we probe LMs on raw text, we risk discovering features that merely track these proxies. To uncover minimal conditions that force an LM to represent truth, we build a toy world in which: 1. Every subject pair has exactly one canonical attribute (ground truth). 2. A small, controllable fraction of examples are corrupted by uniform noise (the attribute is replaced with another attribute). 3. importantly, the truthfulness of neighboring sequences correlates; this models the tendency of speakers to consistently be less or more truthful (Joshi et al., 2024). Despite its simplicity, this environment reproduces the linear-separability we see in large-scale LMs (§5). Data format. Each training example is a sequence xyx′y′x\;y\;x \;y with subjects x,x′∈x,x and attributes y,y′∈y,y . For every x there exists a unique ground-truth attribute g(x)g(x) memorized by the data generator. Examples are corrupted as follows: Sample T∼Bernoulli(ρ)T (ρ) once per example, such that True If T=1T=1, set yi=g(x),yi′=g(x′)y_i=g(x),y_i =g(x ). False If T=0T=0, draw each y,y′y,y independently and uniformly from A. Truth as a latent variable. Because predicting the second attribute token y′y is easier when T is known, an LM can lower its language-model loss by internally inferring T early in the sequence and propagating that bit forward. Without inferring T, the conditional distribution over the second attribute y′y given the prefix (x,y,x′)(x,y,x ) is: Pr(y′=g(x′)|x,y,x′)=ρ+1−ρ||,Pr(y′=a≠g(x′)|x,y,x′)=1−ρ||. \! (y =g(x )\, |\,x,y,x )=ρ+ 1-ρ|A|, \! (y =a≠ g(x )\, |\,x,y,x )= 1-ρ|A|. Assume the LM can memorize g and (optionally) infer T perfectly. Let ℒ¬TL_\! T be its per-token cross-entropy for predicting y′y when it does not access T, and let ℒTL_\!T be the loss when it embeds T internally. Then, in the ||→∞|A|\!→\!∞ limit, ℒ¬T−ℒT=H2(ρ)L_\! T-L_\!T=H_2(ρ), the binary entropy of ρ. Hence representing a single bit yields maximal benefit at ρ=0.5ρ=0.5, where H2H_2 is largest (see appendix˜B for a complete derivation). 4 Analysis on a Toy Model In this section, we study the emergence of truth directions in a simplified one-layer setup with orthogonal embeddings. Empirically, we find that this minimal setup already captures the mechanism of a truth direction, and leverages layer-norm to adjust confidence for the second attribute depending on truthfulness of the first one. Our empirical and theoretical analysis shows that this happens in phases, and that layer-norm is crucial to provide the relevant structure in the gradients. Furthermore, such a truth direction can already emerge when there are only true sequences. In appendix section˜D.1, we discuss how these results may be extended to non-orthogonal and to learned embeddings. Setup. Consider the following one-hot token embedding, positional embedding, and unembedding vectors in ℝdR^d with embedding dimension d=4N+3d=4N+3, where z∈[2N]z∈[2N] is an input or output token (input tokens x are in [N][N] while outputs y are in [N+1,2N][N+1,2N]), and t∈[3]t∈[3] a position: [ez]i [e_z]_i =i=z =1\i=z\ (1) [pt]i [p_t]_i =i=2N+t =1\i=2N+t\ (2) [uz]i [u_z]_i =i=2N+3+z. =1\i=2N+3+z\. (3) We consider a one-layer transformer with uniform causal attention, and a basic layer-norm operation. Concretely, for an input sequence z1:3=(x,y,x′)z_1:3=(x,y,x ) and position t∈[3]t∈[3], define: FW(z1:t)t=U⋅(ezt+pt+1t∑s=1tW(ezs+ps)), F_W(z_1:t)_t=U· N (e_z_t+p_t+ 1t _s=1^tW(e_z_s+p_s) ), (4) where W denotes the value matrix, U=u1:2N⊤=[;I2N]∈ℝ2N×dU=u_1:2N =[0;I_2N] ^2N× d is a projection on the unembedding dimensions, and (v)=v/‖v‖ N(v)=v/\|v\| is a layer-norm operation. The predicted probabilities are then given by p^(zt+1=⋅|z1:t)=β(F(z1:t)) p(z_t+1=·|z_1:t)=S_β(F(z_1:t)), where βS_β denotes the softmax operation with inverse temperature β. Our experiments use β=dβ= d, due to the use of RMS norm in layer-norm over embeddings of dimension d. We assume here that x,x′∼Unif([N])x,x ([N]) i.i.d., and conditioned on these as well as on a truth random variable T∼Ber(ρ)T (ρ), we have y=g(x)y=g(x) and y′=g(x′)y =g(x ) when T=1T=1, and y,y′∼Unif([N+1,2N])y,y ([N+1,2N]) otherwise. Denoting z1:4=(x,y,x′,y′)z_1:4=(x,y,x ,y ), the population loss then takes the form L(W)=∑t=13Li(W)=∑t=13z1:t+1[−logβ(FW(z1:t))zt+1].L(W)= _t=1^3L_i(W)= _t=1^3E_z_1:t+1 [- _β(F_W(z_1:t))_z_t+1 ]~. (5) Figure 1: Visualization of the value matrix for the one-layer model at different training steps. We see that the ex→ug(x)e_x→ u_g(x) block is learned first, along with the pt→u¯p_t→ u block. Later the ex→−exe_x→-e_x and ey→−uye_y→-u_y blocks, and finally the ey→eg−1(y)e_y→ e_g^-1(y) block. Probing the mechanism and its emergence. Figure 1 shows a visualization of the value matrix W in our toy model, at different steps of training, with N=20N=20, ρ=0.8ρ=0.8 and batch size 16. We see that a clear block-structure emerges in the matrix W, with different blocks arising in different phases. Some blocks show a negative identity structure, while others show a permutation structure according to the “knowledge” mapping g. Positional embeddings show more uniform patterns across unembeddings, with different signs depending on whether the next token is an input or label. In Figure 2, we show the representations at the x′x token for examples of true and false sequences, before and after layernorm, as well as the probabilities obtained after projecting to the unembedding space and applying softmax. In the false sequence (bottom plot), we notice large spikes in the input embedding dimensions (1-20) at positions x=5x=5 and g−1(y)=16g^-1(y)=16. These do not exist in true sequences, since they cancel out. We see a similar behavior on unembedding dimensions (65-84) at smaller scales. The cancellation leads to a smaller norm on true sequences, which causes an amplification of the logits, and finally a spiked distribution on true sequences, versus a flatter one on false sequences, though we still some lower confidence spikes on g(x)g(x) and g(x′)g(x ) (note the y-axis scale difference). Structure of the value matrix W. We now study a construction that resembles the one observed empirically in Figure 1. Later we will provide a theoretical justification for this structure and its emergence in phases by analyzing training dynamics. The leftmost column of the W matrix maps exe_x to its corresponding label ug(x)u_g(x), while also subtracting exe_x itself: Wex=−α1ex+β1ug(x),We_x=- _1e_x+ _1u_g(x), (6) with α1,β1>0 _1, _1>0. The second column has the following symmetric behavior: Wey=α2eg−1(y)−β2uy.We_y= _2e_g^-1(y)- _2u_y. (7) Finally, the third column maps the different positional embeddings to mixtures of uniform distributions over the inputs or labels: Wp1 Wp_1 =γ1(∑yuy−∑xux) = _1( _yu_y- _xu_x) (8) Wp2 Wp_2 =−γ2(∑yuy−∑xux) =- _2( _yu_y- _xu_x) (9) Wp3 Wp_3 =γ3(∑yuy−∑xux). = _3( _yu_y- _xu_x). (10) In the statements above, we assume all the coefficients α1/2,β1/2,γ1/2/3 _1/2, _1/2, _1/2/3 to be positive. Figure 2: Visualization of representations on true (top) and false (bottom) sequences. The plots show representations before (left) and after (center) layer-norm, as well as predicted probabilities (right). Linear separation and sharpening mechanism. One important consequence of the structure above is that any token that attends to both x and y (this could be either y or x′x ) has the following quantity in its residual stream: ζ(x,y):=W(ex+ey)=−α1ex+α2eg−1(y)+β1ug(x)−β2uy.ζ(x,y):=W(e_x+e_y)=- _1e_x+ _2e_g^-1(y)+ _1u_g(x)- _2u_y. (11) We then have ‖ζ(x,g(x))‖2=‖ζ(x,y)‖2−2α1α2−2β1β2, for y≠g(x). \|ζ(x,g(x))\|^2=\|ζ(x,y)\|^2-2 _1 _2-2 _1 _2, for y≠ g(x). Since α1,α2,β1,β2>0 _1, _2, _1, _2>0, the norm of ζ on a true sequence is always smaller than on a false sequence, leading to a useful feature for detecting truth (see illustration in fig.˜16). Combined with the layer-norm operation, this provides a mechanism for sharpening the prediction of y′y towards g(x′)g(x ) when the model detects a true sentence, by adjusting the temperature in the softmax via inverse norm scaling. Theorem 1 (Sharpening of y′y predictions). Suppose we have a solution that satisfies Eqs. (6)-(10). Denote by c:=2+γ¯2(2N−2)+2α12+β129c:=2+ γ^2(2N-2)+2 _1^2+ _1^29 For any x,x′x,x and y≠g(x)y≠ g(x), we have: F(x,g(x),x′)g(x′)−maxk≠g(x′)F(x,g(x),x′)k≥β1−max(0,β1−β2)3c+(β1−β2+γ¯)2+(β1+γ¯)2 F(x,g(x),x )_g(x )- _k≠ g(x )F(x,g(x),x )_k≥ _1- (0, _1- _2)3 c+( _1- _2+ γ)^2+( _1+ γ)^2 F(x,y,x′)g(x′)−maxk≠g(x′)F(x,y,x′)k=0 F(x,y,x )_g(x )- _k≠ g(x )F(x,y,x )_k=0 The proof is in section˜E.2. This shows that the structure of W along with layer-norm provide a simple mechanism to make the model more confident about its knowledge when the context is truthful. For false sequences, the zero gap comes from the fact that logits for g(x)g(x) and g(x′)g(x ) are tied, as we show empirically in Figure 2. This aligns with previous interpretability work on confidence neurons (Stolfo et al., 2024). Beyond improving prediction performance, we now show that this model provides a linear encoding of truth in the representations after layer-norm. Theorem 2 (Linear truth direction). Suppose we train the model in (4) as explained above, and reach a solution for W that satisfies Eqs. (6)-(10). Then, we have the following: 1. If the model in (4) does not contain N, then its output on the y token does not admit a linear separator for true and false samples. 2. If the model in (4) contains N and 2α1α2+2β1β2≠02 _1 _2+2 _1 _2≠ 0, then its output on the y token admits a linear separation for true and false samples. Moreover, if γ1=γ2,α1=α2,β1=β2 _1= _2,~ _1= _2,~ _1= _2 then the margin is at least δ=122(1−11+α2+β2)δ= 12 2 (1- 1 1+α^2+β^2 ). The proof is in section˜E.3. Theoretical analysis of training dynamics. We now study how such a structure in W emerges from training dynamics in a simplified setting. Theorem 3 (Sequential gradient learning; informal). In a simplified model with no positional embeddings, taking two gradient steps on L1L_1 followed by one on L3L_3, all with step-size Θ(N) (N), leads to the desired structure for W as in Eqs. (6)-(7), up to negligible entry-wise O(1/N)O(1/N) terms. See a formal statement and a proof in section˜E.1. This result shows that gradient dynamics in our model can quickly lead to the block structure observed in Figure 1, despite the non-convexity induced by normalization. In fact, the analysis reveals that the layer-norm operation is crucial here to obtain many of the desired blocks other than the ex→ug(x)e_x→ u_g(x). Interestingly, our theory shows that this structure arises even when ρ=1ρ=1, and empirically we found that both sharpening and linear separation indeed happen in this setting, demonstrating an emergent out-of-distribution generalization to false sequences. We note, however, that this may not happen in a more expressive model: we empirically found that if we also train the key-query matrix with ρ=1ρ=1, the model quickly learns to focus its attention to the current token, which makes information from the context inaccessible from the residual stream. While this may improve predictions of g(x′)g(x ) on true sequences by removing noise in the residual stream coming from (x,y)(x,y), this also results in a failure to handle false sequences. 5 Experiments 5.1 Synthetic Setting Setup. We train transformer-based LMs on the synthetic dataset described above. The model contains l self-attention layers with a single attention head, followed by layer normalization, with no feedforward network. See Appendix˜C for more details. In this setting, in contrast to the toy model described above, we train all parameters, including the dense embeddings and the attention module.222We release the code in https://github.com/shauli-ravfogel/truth-encoding-neurips. Each training example is a a concatenation of a subject (x), an attribute (y), an additional, uniformly-sampled subject (x′x ), and an additional attribute (y′y ). The attributes y,y′y,y are either sampled uniformly or taken to be the correct attributes g(x),g(x′)g(x),g(x ), according to the true probability ρ. In line with the Truth Co-occurrence Hypothesis, we aim to measure whether in this training setting the model is able to recover the latent truthfulness of the first sequence (verifying whether y=g(x)y=g(x)) and use it to decrease LM loss on the second attribute y′y . We experiment with true-attributes rates ρ, and with l∈1,2,3l∈\1,2,3\ layers, and assume a perfect correlation between the truthfulness of the first and second attributes (that is, y=g(X)y=g(X) if, and only if, y′=g(x′)y =g(x )). Along training, we fit logistic-regression classifiers on all hidden states to predict whether or not the sequence is false (a binary classification problem). We fit individual classifiers both the first attribute position (y), as well as on the second subject position (x′x ), from which the second attribute y′y is predicted. While in training the LM we use a varying true-attribute rate ρ, the linear classifiers are always trained and evaluated on a balanced set, containing 50% true sequences. We report mean results over 5 runs with different random seeds. Unless specified otherwise, we present here results for l=1l=1 and ρ=0.99ρ=0.99, ||=||=512|A|=|S|=512 and dmodel=256d_model=256; results for other settings are deferred to Appendix˜D. (a) Truth classification results, second subject x (b) P[g(x′)]P[g(x )] on false sequences, for which y′≠g(x′)y ≠ g(x )) Figure 3: Truth linear classification results alongside probability assigned by the LM to the true attribute on false sequences. 5.2 Results Two-phase dynamics. In Figure˜3(a) we show the linear truthfulness classification AUC as a function of training steps, on the second subject. for a 1−1-layer model with true-attribute probability ρ=0.99ρ=0.99. Additionally, we plot the probability the LM assigned to the correct attribute on false sequences (P(y′=g(x′)∣y≠g(x))P(y =g(x ) y≠ g(x)); Figure˜5(b)). When this probability is minimized, the model improves its loss on false sequences. In line with the toy model, we detect distinct phases in training. 1. Memorization. As can be seen in Figure˜5(b), memorization happens rapidly—within the first 1000 batches—as the model converges to a probability of around 1 to g(x′)g(x ) on both true and false examples. Indeed, the model predicts the correct attributes on over 99%99\% of the true sequences. 2. Truth encoding. The model does learn to linearly encode the truth latent variable. This encoding emerges abruptly, after around 7,500 batches, during which the model saw around 1 million examples, relatively long after the model achieves perfect memorization. The model learns to decrease the probability it assigns to the correct attribute on the second attribute position P(g(x′)=y′)P(g(x )=y ) roughly at the same time linear classification emerges. (a) PCA of layer-1 representations on the token x′x (b) Attention pattern (1-layer model) (c) PCA of input embeddings X versus Y Figure 4: LM representations over false sequences. Truth circuit. We aim to understand how the linear truth subspace is being computed. While it has been empirically shown that LM linearly encode many human-interpretable concepts (Bolukbasi et al., 2016; Vargas and Cotterell, 2020; Ravfogel et al., 2022), it is not well-understood why linear representations emerge in hidden layers (Park et al., 2024; Jiang et al., 2024). The toy model we propose allows us to empirically study the origins of the linear signal, and the way it is being used to decrease LM loss on the second attribute. The truth encoding appears in a 1-layer model (classification accuracy in the input embeddings layer is at majority level). As can be seen in the first layer attention pattern in (Figure˜4(b)), this attention head calculates an approximate mean of the embeddings of x and y, after application of the V,OV,O self attention matrices, in line with the uniform attention assumed in the toy model. One key difference is that here, we learn the input embeddings. Interestingly, inspecting the PCA of the input subject and attribute tokens (Figure˜4(c)) reveals that approximately, ex=−eg(x)e_x=-e_g(x) on the first principal component. This explains why both the true and false representations tend to cluster around the origin. Following the attention averaging, we apply RMSNorm. We find that linear classification emerges only after normalization; classification accuracy is at majority level before it. Indeed, a PCA plot (Figure˜4) shows that, as predicted by the toy model, the True class is centered around the origin, with a larger variance for the True class than the False class. Normalization induces linear separability, that is also evident in the first 2 PCA components. Additional settings. So far we have analyzed a single-layer transformer—either with one-hot embeddings, or with trainable dense embeddings and ρ=0.99ρ=0.99. Results for other configurations appear in Appendix˜D; here we outline the main trends. The patterns in Figures˜3(a) and 5(b) persist across layer counts l, noise levels ρ, and corpus sizes ||,|| , . Higher ρ delays (but does not prevent) the onset of linear separability, which still emerges at ρ=0.999ρ=0.999 (Figure˜7(a) at the appendix); only the degenerate case ρ=1.0ρ=1.0 shows no emergence, contrary to the toy model. We discover similar structures to Figure˜1 also when training with frozen dense embeddings and when learning the KVKV matrices instead of using fixed attention. A preliminary analysis of this setting is provided in section˜D.1 and section˜E.1.1, and we leave a more complete understanding for future work. With additional layers the model sometimes encodes truth in the first attribute y, then copies it to x′x before predicting y′y ; in other runs it reverts to the single-layer strategy where x′x attends directly to x and y′y . This influences whether we see linear encoding on both y and x′x , or on x′x alone (Figure˜7(b) in the appendix). 5.3 Testing the TCH in a Real LM The theory we specified relies on a set of assumptions and architecture that do not exist in pretrained transformers (those have, for instance, MLP layers in addition to the attention layer; have multiple attention heads; and are trained primarily on natural language distributions). Below, we (i) train “regular” transformer models on a natural language data that instantiates the truth co-occurrence hypothesis; (i) assess to what extent aspects of the mechanism we propose exist in pretrained LLMs. 5.3.1 Instantiating the TCH in Natural Language (a) Truth classification results (b) P[correct attribute]P[correct attribute] on false sequences, for which the observed attribute is not the correct one. Figure 5: Truth linear classification results alongside probability assigned by the LM to the true attribute on false sequences. In section˜5.1, we created a synthetic dataset that respects the TCH and showed that training an attention-only transformer on this data results in linear truth encoding. Here, we aim to assess whether the same thing happens when training “real” transformers on natural language data. Setup. We evaluate on the CounterFact dataset (Meng et al., 2022), a collection of simple factual assertions spanning relations such as SpeaksLanguage and BornIn. We select the 25 most frequent relations and, for each positive instance (x,r,a)(x,r,a), construct a negative by replacing the attribute a with a different attribute from the same relation. To instantiate the TCH, we form paired examples by concatenating two randomly sampled instances that share the same truth label (both true or both false). We then train a small transformer with RMS normalization, 2 attention heads and a single MLP module per layer, hidden size d=256d=256, and depth l∈2,5,9l∈\2,5,9\ on this corpus. We use ρ=0.99ρ=0.99. We train on data from a single relation at a time, and report mean and standard deviations over 5 random relations.333We leave the question of generalization between relations to a future work. Results. Across all seeds and architectural choices, the training dynamics mirror those on synthetic data: rapid memorization, followed by the emergence of a linear encoding, and an increase in entropy on false sequences. In fig.˜5, we show results for a single relation (WorksIn; averaged over five random seeds). By the end of training, the final hidden layer is nearly perfectly separable by the truth label, and on false sequences the probability assigned to the memorized (“true”) attribute declines. Notably, the 1-layer model exhibits epoch-wise double descent: classification accuracy rises early, dips, and then rises again. Across the five seeds, relations, and model sizes, memorization proceeds at roughly the same rate; the main variance lies in how quickly the probability declines on false sequences. 5.3.2 The TCH in Pretrained LLMs (a) The mechanism proposed above assumes a very specific data generating process, and a simplified transformer model. As such, it is not likely that the same mechanism applies to real LMs; we see the toy model as a proof of concept, and aim to study more complicated models in future work. Yet, in this section, we compare the predictions following from our hypothesis with pretrained LMs in these aspects: (1) the sensitivity of the model’s predictions to preceding false sentences, in line with the truth co-occurrence hypothesis; (2) the behavioral relevance of the linear truth encoding in a situation where a sentence follows misleading false sentences. We experiment with a LLama3-8B model (Grattafiori et al., 2024) and the CounterFact dataset (SpeaksLanguage relation). We let the model predict the first token of the last word of a sentence, when it is (i) preceded by n false sentences; or (i) preceded by n true sentences. In line of the hypothesis, we expect to see a decrease in the probability of the correct answer. Model’s predictions are sensitive to preceding false sentences. The results, over 128 n−n-tuples, are presented in Figure˜6(a) (light bars) and are in line with our hypothesis; for instance, in the two leftmost box plots, we see that preceding the sentence with two false sentences (F) yields higher negative likelihood (smaller probability) to the correct attribute compared with when preceding it with one true sentence (T). The difference in negative log likelihood is 1.521.52, corresponding to 4.55×4.55× decrease in the probability of the correct attribute. Intervention in the truth subspace. LLama3-8B encodes truthfulness linearly: a linear classifier reaches over 95% accuracy on all middle and last layers in separating true instances from the dataset from counterfactual ones. Our theory predicts that, in the presence of misleading context, the direction that distinguishes true from false vectors actively pulls the model away from the correct answer. To test that, we intervene in the truth subspace. Following previous work on linear steering (Li et al., 2023b; Singh et al., 2024), we calculate the mean vector of the True and False classes in the representation space, μT _T and μF _F, and add a steering vector α(μT−μF)α( _T- _F) to all representations in the same layer with the goal of increasing the probability of the correct attribute. We choose layer l=11l=11 based on preliminary experiments that showed that classification peaks at that layer, and α=3.0α=3.0. The results, presented in Figure˜6(a) (darker bars), show that the models tend to increase the probability of the correct attribute post-intervention, even in the presence of false context. Emergence along training See section˜E.4 for a preliminary analysis of the emergence of truth encoding along training. 6 Discussion and Limitations Although our analysis was grounded in a deliberately minimalist transformer, it discovers a two–phase dynamic—rapid key–value memorization followed by the slower emergence of a linear truth encoding. The key prerequisite appears to be the presence of (i) an associative–memory circuit able to retrieve subject–attribute pairs and (i) correlation among the truth values of adjacent clauses. While we replicate the core phenomena we witness in large LMs, we emphasize that this is one, and probably not a unique, mechanism that can induce truth encoding. A core advantage of the minimalist model is that it does not assume any lexical cues that help the model discern the truth latent variable. In that sense, this is a more challenging setting than the previously studied one (Joshi et al., 2024), where it is assumed that true and false assertions are associated with different lexical distributions. Several core differences exist between our simplified generative story and a real-world setting. Our synthetic corpus contains only one latent relation. A natural extension is to sample tuples from a set of heterogeneous relations—bornIn, capitalOf, currencyOf, …—while maintaining correlation in the latent truth bit. Doing so forces the model to contextualize its memory: the same subject embedding must participate in multiple key–value slots distinguished by the relation. Real corpora have logical and semantic dependencies that go far beyond pairwise subject–attribute pairs: transitivity (“A is in B” ∧ “B is in C” ⇒ “A is in C”), mutual exclusivity (“isAlive” vs. “IsDead”), and type constraints (“capitalOf” only applies to geopolitical entities). These constraints also greatly limit the range of plausible counterfactual variants we may see in the training data; while we assume a uniform corruption for simplicity, in practice false variants of factual claims come from a unique conditional distribution. 7 Conclusion We introduced a small transformer and a synthetic data-generation process that jointly suffice to yield a robust linear truth subspace. Our analytical and empirical results demonstrate a two-phase training dynamic: memorization followed by truth-code emergence. Unlike prior persona-based accounts, our theory does not rely on surface correlations between individual tokens and truthfulness, and points out to a possible mechanism behind the emergence of linear of the truth signal as a latent variable inferred by the model. Acknowledgments This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise, and by the National Science Foundation (NSF) under Grant No. IIS-2239862 to TL. We thank Yanai Elazar for his valuable comments on a previous version of this paper. References Azaria and Mitchell [2023] Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023. Bietti et al. [2023] Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 2023. Bolukbasi et al. [2016] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016. Bürger et al. [2025] Lennart Bürger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms. Advances in Neural Information Processing Systems, 37:138393–138431, 2025. Burns et al. [2022] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022. Cabannes et al. [2024a] Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. In International Conference on Learning Representations, 2024a. Cabannes et al. [2024b] Vivien Cabannes, Berfin Şimşek, and Alberto Bietti. Learning associative memories with gradient descent. In Proceedings of the 41st International Conference on Machine Learning, pages 5114–5134, 2024b. Dar et al. [2023] Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, 2023. Farquhar et al. [2024] Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, 2024. Ferrando et al. [2025] Javier Ferrando, Oscar Balcells Obeso, Senthooran Rajamanoharan, and Neel Nanda. Do i know this entity? knowledge awareness and hallucinations in language models. In The Thirteenth International Conference on Learning Representations, 2025. Gekhman et al. [2025] Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in llms, 2025. Geva et al. [2021] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446/. Geva et al. [2022a] Mor Geva, Avi Caciularu, Guy Dar, Paul Roit, Shoval Sadde, Micah Shlain, Bar Tamir, and Yoav Goldberg. Lm-debugger: An interactive tool for inspection and intervention in transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21, 2022a. Geva et al. [2022b] Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022b. Ghandeharioun et al. [2024] Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Grattafiori et al. [2024] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Jiang et al. [2024] Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language models. In International Conference on Machine Learning, pages 21879–21911. PMLR, 2024. Joshi et al. [2024] Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. Personas as a way to model truthfulness in language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6346–6359, 2024. Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, ICLR (Poster), 2015. Li et al. [2023a] Belinda Z Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. Eliciting human preferences with language models. arXiv preprint arXiv:2310.11589, 2023a. Li et al. [2024a] Chunyang Li, Hao Peng, Xiaozhi Wang, Yunjia Qi, Lei Hou, Bin Xu, and Juanzi Li. MAVEN-FACT: A large-scale event factuality detection dataset. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11140–11158, Miami, Florida, USA, November 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.651. Li et al. [2023b] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc., 2023b. Li et al. [2024b] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024b. Liu et al. [2023] Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. Marks and Tegmark [2024] Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. Meng et al. [2022] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022. Nichani et al. [2025] Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories. In International Conference on Learning Representations, 2025. Orgad et al. [2025] Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, 2025. Park et al. [2024] Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In International Conference on Machine Learning, pages 39643–39666. PMLR, 2024. Ravfogel et al. [2022] Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial concept erasure. In International Conference on Machine Learning, pages 18400–18421. PMLR, 2022. Singh et al. [2024] Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnurangam Kumaraguru. Representation surgery: Theory and practice of affine steering. In ICML, 2024. Slobodkin et al. [2023] Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3607–3625, 2023. Stolfo et al. [2024] Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, and Neel Nanda. Confidence regulation neurons in language models. In Advances in Neural Information Processing Systems, 2024. Vargas and Cotterell [2020] Francisco Vargas and Ryan Cotterell. Exploring the linear subspace hypothesis in gender bias mitigation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2902–2913, 2020. Yu et al. [2024] Lei Yu, Meng Cao, Jackie CK Cheung, and Yue Dong. Mechanistic understanding and mitigation of language model non-factual hallucinations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7943–7956, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.466. Appendix Appendix A Detailed MAVEN-FACT Analysis Data extraction. We use the train split of MAVEN-FACT v1.0 (73,939 event–mentions drawn from 2 913 news articles).444Available at https://github.com/THU-KEG/MAVEN-FACT. Each mention carries a FactBank‐style factuality code (CT++, CT+, CT-, CT--, PS ±, PR ±, CF ±, U, NA, …). We retain only certain judgments: certain-true=CT++,CT+,certain-false=CT- -,CT-.certain-true=\CT++,CT+\, -false=\CT- -,CT-\. All other codes are discarded, leaving N=71,274N=71,274 labelled mentions. Grouping key. Mentions are grouped by their originating article ID (doc_id), giving M=2,913M=2,913 documents with at least two certain mentions (ni>1n_i>1). Let Zij∈0,1Z_ij∈\0,1\ indicate whether mention j in document i is certain-false. Statistics reported in the main text. • Corpus certain-false rate. p=1N∑i,jZij=0.0209p= 1N _i,jZ_ij=0.0209. • Pairwise certain-false probability. Pr(Zj=Zk=1∣same doc)=∑i(fi2)∑i(ni2)=0.00090, (Z_j=Z_k=1 doc)= _i f_i2 _i n_i2=0.00090, where fi=∑jZijf_i= _jZ_ij. • Independence baseline. p2=0.00044p^2=0.00044. • Clustering ratio. Varobs(p^i)Varbinom=1M∑i(p^i−p)21M∑ip(1−p)/ni=1.23, Var_obs( p_i)Var_binom= 1M _i( p_i-p)^2 1M _ip(1-p)/n_i=1.23\,, with p^i=fi/ni p_i=f_i/n_i. • χ2χ^2 test. The 2×M2× M contingency table of fi,ni−fi\f_i,\,n_i-f_i\ yields χ2=4 174χ^2=4\,174 (p≈9×10−49p≈ 9×10^-49). These figures show that certain-false events, though rare (2.1%2.1\%), occur about twice as often as chance would predict when two events come from the same article, and the distribution of false rates across articles is 23 % more heterogeneous than a binomial model would permit—confirming the co-occurrence signal predicted by TCH. The MAVEN-ED dataset is released with C BY-SA 4.0 license. The MAVEN-ARG and MAVEN-ERE are published with GPLv3 license. Appendix B Entropy incentive Setup. We consider sequences (x,y,x′,y′)(x,y,x ,y ) with subjects x,x′∈x,x and attributes y,y′∈y,y . Let g:→g:S be the ground-truth attribute map. A latent bit T∼Bernoulli(ρ)T\! \!Bernoulli(ρ) governs whether attributes are truthful (T=1T=1) or random (T=0T=0): T=1:y=g(x),y′=g(x′)(deterministic);T=0:y,y′∼i.i.d.Unif().T=1:\;y=g(x),\;y =g(x ) (deterministic); T=0:\;y,y .i.d. Unif(A). We study the optimal next-token loss for predicting y′y given the prefix (x,y,x′)(x,y,x ) under two cases: (i) the model does not access T; (i) the model does access T (and can memorize g). A. Predictive distribution of y′y without access to T By the law of total probability over T and the generator above, Pr(y′=g(x′)|x,y,x′) \! (y =g(x )\, |\,x,y,x ) =Pr(T=1|x,y,x′)⋅1+Pr(T=0|x,y,x′)⋅1||. = (T=1\,|\,x,y,x )· 1\;+\; (T=0\,|\,x,y,x )· 1|A|. (12) When we say the model is “ignorant of T,” we mean it does not exploit any posterior signal about T from the prefix; thus we use the prior Pr(T=1|x,y,x′)=ρ (T=1\,|\,x,y,x )=ρ and Pr(T=0|x,y,x′)=1−ρ (T=0\,|\,x,y,x )=1-ρ. Hence Pr(y′=g(x′)|x,y,x′)=ρ+1−ρ||. \! (y =g(x )\, |\,x,y,x )=ρ+ 1-ρ|A|. (13) For any specific wrong a∈∖g(x′)a\!∈\!A \g(x )\, Pr(y′=a|x,y,x′)=ρ⋅0+(1−ρ)⋅1||=1−ρ||. \! (y =a\, |\,x,y,x )=ρ· 0+(1-ρ)· 1|A|= 1-ρ|A|. (14) B. Optimal per-token cross-entropy without T Let α:=ρ+1−ρ||α:=ρ+ 1-ρ|A| and β:=1−ρ||β:= 1-ρ|A|. The optimal model (that does not access T) matches the true conditional in (13)–(14), so its per-token cross-entropy equals the entropy of that distribution: ℒ¬T _ T =H(α,β,…,β⏟||−1 times)=−αlogα−(||−1)βlogβ. =H\! (α, β,…,β_|A|-1 times )=-α α-(|A|-1)\,β β. (15) C. Optimal per-token cross-entropy with access to T If the model does access T (and has memorized g): T=1:loss 0(since y′=g(x′) deterministically);T=0:loss log||(uniform on ).T=1:\ loss 0 (since y =g(x ) deterministically); T=0:\ loss |A| (uniform on A). Averaging over T gives ℒT=(1−ρ)log||.L_T=(1-ρ)\, |A|. (16) D. The gap and its limit as ||→∞|A|→∞ Subtracting (16) from (15): Δ :=ℒ¬T−ℒT=−αlogα−(||−1)βlogβ−(1−ρ)log||. :=L_ T-L_T=-α α-(|A|-1)β β\;-\;(1-ρ) |A|. (17) Using β=1−ρ||β= 1-ρ|A|, −(||−1)βlogβ=−(1−ρ)(1−1||)log(1−ρ)+(1−ρ)(1−1||)log||.-(|A|-1)β β=-(1-ρ) (1- 1|A| ) (1-ρ)+(1-ρ) (1- 1|A| ) |A|. Plugging into (17) and simplifying, Δ =−αlogα−(1−ρ)(1−1||)log(1−ρ)−1−ρ||log||. =-α α\;-\;(1-ρ) (1- 1|A| ) (1-ρ)\;-\; 1-ρ|A|\, |A|. (18) Since α=ρ+1−ρ||→ρα=ρ+ 1-ρ|A|→ρ and the last term is O(log||||)O( |A||A|), we obtain the limit Δ→||→∞−ρlogρ−(1−ρ)log(1−ρ)≡H2(ρ)(up to the log base). [|A|→∞]\;-ρ ρ-(1-ρ) (1-ρ)\;≡\;H_2(ρ) (up to the log base). (19) Appendix C Experimental Setup Model. We experiment with an attention-only transformer with a single attention head with a post-attention LN: X0 X^0 =E+P =E+P \;\; // E,P∈ℝV×d (token + positional embeddings) // E,P ^V× d (token + positional embeddings) (20) Q(i) Q^(i) =X(i−1)WQ(i),K(i)=X(i−1)WK(i),V(i)=X(i−1)WV(i) =X^(i-1)W_Q^(i),K^(i)=X^(i-1)W_K^(i),V^(i)=X^(i-1)W_V^(i) (21) A(i) A^(i) =softmax(Q(i)K(i)⊤d)V(i) =softmax\! ( Q^(i)K^(i)^\! d )V^(i) // attention mix A(i)∈ℝd // attention mix A^(i) ^d (22) A~(i) A^(i) =A(i)WO(i),WO(i)∈ℝd×d =A^(i)W_O^(i), W_O^(i) ^d× d // single-head attention output (23) X(i) X^(i) =N(X(i−1)+A(i)),i=1,…,l =N\! (X^(i-1)+A^(i) ), i=1,…,l // residual + normalization (24) Z Z =X(l)WO+bO,WO∈ℝd×V,bO∈ℝV =X^(l)W_O+b_O, W_O ^d× V,\;b_O ^V (25) Y Y =softmax(Z) =softmax(Z) (26) Experiments with one-hot models (section˜4). The theoretical analysis is driven by experiments on models equipped with frozen, one-hot embeddings and uniform attention, the latter obtained by setting the attention‐key matrix K to the zero matrix. Under these conditions the columns of the attention value–output product KVKV T map directly to individual vocabulary items, exposing a clear block structure in the matrix (fig.˜1). As detailed in the main text, the vocabulary is organized so that indices 1–20 encode input subject embeddings, 21–40 input attribute embeddings, 41–44 positional embeddings, 45–64 output subject embeddings, and 65–84 output attribute embeddings. Methodology: interpreting one-hot embeddings. Figure˜2 contrasts two sequences—a correct one (top row) and an incorrect one (bottom row)—by showing the final-layer activations before projecting to the logit space. The one-hot embeddings make the activation patterns in that layer interpretable. We display the activations for the raw representations (left), after layer normalization (middle), and after applying the unembedding matrix and the softmax transformation (right). Observe the differing y-axis scales: normalization substantially magnifies the component corresponding to the correct answer in the “true” sequence, while the effect is far less pronounced for the false sequence. The model that produced fig.˜1 was trained with SGD, learning rate 1.01.0 and batch size 16. The output matrix was fixed to identity, and only the value matrix was learned, from zero initialization. Experiments with fully-trained models (section˜5): In section˜5, we train all components, including the input embeddings and the K attention matrix. The model is trained for 50,000 batches of size 128 and is optimized with the Adam optimizer [Kingma and Ba, 2015] with a learning weight of 1e-4 and a weight decay of 1e-5. We do not include biases in the attention modules, and use RMSNorm as layer normalization. We run all experiments on 4 NVIDIA GeForce GTX 1080 GPUs. Training a single model lasts up to half an hour. Appendix D Additional Experiments In the main text we concentrated on a single-layer model (l=1l=1) with a true-attribute probability of ρ=0.99ρ=0.99. Here we extend the analysis to additional settings. Our primary focus was the linear separability at the second-subject token, x′x , where the model predicts the second attribute. This is the only position where the truth signal is behaviorally relevant. Nevertheless, the theory also predicts a linear truth encoding at the first-attribute token y, owing to the fixed attention pattern. When the attention KVKV matrix is learned, however, this need not occur—the model can rely exclusively on the attention paid to x′x and leave y uninformative. The same theory further implies that a linear truth direction should eventually emerge for any true-sentence rate ρ, even though the gradient magnitude (and therefore the speed of emergence) does depend on ρ. Varying the true sentence rate, ρ. In fig.˜7(b) we vary ρ across five random seeds and measure linear separability at both token positions. As predicted, when the attention pattern is learned, separability is much stronger at the second subject than at the first attribute. The time to emergence grows as ρ increases, yet linear encoding still appears even at the extreme setting of ρ=0.999ρ=0.999. Developing a theory that precisely predicts this ρ-dependent timing is left to future work. (a) (b) Figure 7: Dependency of linear separability on ρ. (a) (b) Figure 8: Dependency of linear separability on dmodeld_model and |||S|. Dependency on dmodeld_model and |||S|. In fig.˜8 we plot the linear separability at the final checkpoint, for different hidden sizes and number of facts to memorize (ρ=0.99ρ=0.99, l=1l=1 are fixed). With the exception of dmodel=32d_model=32, the separability persists over the second subject x′x for different combinations of these parameters. Figure 9: attention patterns of a 3-layer model. (a) (b) Figure 10: Linear separability across layers for a 3-layer model; linear separability on the x′x token is created after copying the signal from the y token in the second layer. Additional layers. As we discuss in the main-text (section˜5), in a model with a single self-attention layer, it is the second attribute (x′x ) token that attends to both x and y. With more layers, there are additional strategies. For instance, y may attend to both x and itself in the first layer, in the same way x′x attends to both x and y in the theoretical 1-layer model; then, in the next layer, x′x attends to y, copies the signal and create a linear separation that persists the last layer. This is the mechanism that emerges in 4/5 random initializations of a 3-layer model, and is clearly manifested in the attention patterns (fig.˜9) and in the linear classification accuracy across layers (fig.˜10). D.1 Bridging the gap between the fully-trainable model and the toy model. Our theoretical analysis (appendix˜E) is motivated by the structured patterns that emerge in the attention kernel—the OVOV matrix—when it is visualized (fig.˜1). To test whether a comparable mechanism appears when we employ dense embeddings and allow the KVKV matrices to train freely (thus removing the enforced uniform attention over x,yx,y), we train a model with a large hidden dimension but only a small set of facts to memorize (||=32|S|=32 and dmodel=512d_model=512). We freeze the randomly-initialized dense embeddings and train all other parameters. The limited number of subjects makes the memorization patterns easier to inspect, while the high dimensionality approximates the regime of mutually orthogonal embeddings required by the theory. (a) EVOE⊤EVOE with frozen dense embeddings (b) EVOE⊤EVOE with trainable dense embeddings Figure 11: Visualization of the attention matrix with dense embeddings. (a) EVOE⊤EVOE with frozen dense embeddings (sub-sampled and sorted) (b) EVOE⊤EVOE with trainable dense embeddings (sub-sampled and sorted) Figure 12: Visualization of the attention matrix with dense embeddings. Because the model now uses dense embeddings—so individual coordinates no longer correspond directly to vocabulary items—we do not expect an obvious block structure in the raw OVOV matrix. Instead, following Dar et al. [2023], we visualize EVOE⊤EVOE , where E concatenates the input and output embedding matrices. This operation computes the pairwise similarities between embeddings as induced by the VOVO transformation. Concretely, (EVOE⊤)ij=Ei⊤V,O,Ej(EVOE )_ij=E_i V,O,E_j measures how strongly the value vector elicited by symbol i aligns with the output direction that scores symbol j, so every cell again describes a relation between concrete symbols, exactly what the raw OVOV matrix showed when the embeddings were one-hot. The resulting heat-map (fig.˜11(a)) exhibits a strikingly similar pattern to that observed with frozen one-hot embeddings and a fixed attention pattern, suggesting that the dense model converges to a similar underlying mechanism. In contrast, when we do train the embeddings, the pattern partially disappears, as parts of the memorization can occur in the embeddings themselves (fig.˜11(b)). In general, there is much more variability between runs and hyperparameters when training the embeddings, where some hyperparameter choices do not show a pattern that is highly similar to the idealized one. With a full set of ||=dmodel=512|S|=d_model=512 tokens, the global pattern is hard to spot at first glance. If we instead sub-sample 28 x tokens, retain only their partners g(x)g(x), and then sort the rows/columns, the latent memorization re-emerges: the lower-left block collapses into a clear diagonal (the previously random pattern in the leftmost lower block in fig.˜11(a) is transformed into a diagonal due to the sorting). This diagonal appears whether the embeddings are frozen or trainable (see figs.˜12(a) and 12(b)). Figure 13: Visualization of learned embeddings and value matrix for a model as in Section 4 with learned embeddings, initialized to one-hot. One possible circuit with learned embeddings. We now present one possible circuit that we found when initializing with the one-hot embeddings, in a simplified architecture with uniform attention as in Section 4. We still denote ex,ey,ux,uye_x,e_y,u_x,u_y the one-hot embeddings as in Section 4, which only refer to the initialization in this setting with learned embeddings. After training, we may visualize the learned embeddings and interpret them as linear combinations of the initial one-hot embeddings, as shown in Figure 13. Denoting e~x,e~y,u~x,u~y e_x, e_y, u_x, u_y the embeddings after training, the circuit we found looks as follows: e~x e_x =ex−eg(x) =e_x-e_g(x) e~y e_y =ey−eg−1(y) =e_y-e_g^-1(y) u~x u_x =∑xux−∑yuy = _xu_x- _yu_y u~y u_y =uy+eg−1(y) =u_y+e_g^-1(y) W W =∑x(ug(x)−ex)ex⊤−∑y(ey+uy)ey⊤. = _x(u_g(x)-e_x)e_x - _y(e_y+u_y)e_y . The approximation e~x=ex−eg(x) e_x=e_x-e_g(x), for instance, follows from the two large positive and negative spikes in the left part of fig.˜13, for indices 1 and 25/36. Similar to our analysis of Section 4, we compute the quantity W(e~x+e~y)W( e_x+ e_y), which appears in the residual stream for both token y and token x′x : W(e~x+e~y) W( e_x+ e_y) =ug(x)−ex+eg(x)+ug(x)−ey−uy−uy+eg−1(y) =u_g(x)-e_x+e_g(x)+u_g(x)-e_y-u_y-u_y+e_g^-1(y) We observe that this vanishes when y=g(x)y=g(x), suggesting that a similar mechanism as in the fixed embeddings case studied in Section 4 is at play, where layer-norm can lead to sharper predictions for true sequences, as well as provide a truth direction. Appendix E Theoretical analysis This section contains theoretical analysis and proofs for the results in Section 4. E.1 Training dynamics Figure 14: Structure of the value matrix W when training without positional embeddings. Figure 15: Structure of the value matrix W when training with euclidean normalization. We now provide some theoretical insights on the training dynamics in the simple one-layer model of Section 4. We further simplify the model here by removing positional embeddings. Figure 14 shows that the model still learns the relevant blocks even without positional embeddings, though some of the uniform distributions on unembeddings are now absorbed in other blocks. The lemma below highlights the structure of the gradient for a softmax classification model consisting of a linear model followed by a layer-norm operation. Lemma 1. Consider the model FW(x)=U⋅(ax+Wbx)∈ℝ2NF_W(x)=U· N(a_x+Wb_x) ^2N, with (v)=v/‖v‖ N(v)=v/\|v\|, and the following cross-entropy population loss on some distribution over (x,y)(x,y): L(W)=x,y[−log(FW(x))y], L(W)=E_x,y[- (F_W(x))_y], (27) where y is the label and S the softmax operation. The gradient with respect to W is then given by: ∇L(W)=∑k=12Nx,y[(U⋅(vx))k−y=k‖vx‖(vx/‖vx‖)ukbx⊤],∇ L(W)= _k=1^2NE_x,y [ S(U· N(v_x))_k-1\y=k\\|v_x\| P_(v_x/\|v_x\|)u_kb_x ], (28) with vx=ax+Wbxv_x=a_x+Wb_x and where θ=I−θθ⊤ P_θ=I-θ is the projection onto the tangent space at θ∈dθ ^d. Let us decompose the population loss as L(W)=L1(W)+L2(W)+L3(W),L(W)=L_1(W)+L_2(W)+L_3(W), (29) where Lt(W)L_t(W) is the next-token prediction loss for predicting zt+1z_t+1 from z1:tz_1:t, with z1:4=(x,y,x′,y′)z_1:4=(x,y,x ,y ). We show the following result. Theorem 3. Consider the following algorithm, with step-size η=N/ρη=N/ρ, and initialization W0=0W_0=0: 1. Set W1=W0−η∇L1(W0)W_1=W_0-η∇ L_1(W_0) 2. Set W2=W1−η∇L1(W1)W_2=W_1-η∇ L_1(W_1) 3. Set W3=W2−η∇L3(W2)W_3=W_2-η∇ L_3(W_2) Then, we have W3=∑x=1N(β1ug(x)−α1ex)ex⊤+∑y(α2eg−1(y)−β2uy)ey⊤+O∞(1/N),W_3= _x=1^N ( _1u_g(x)- _1e_x )e_x + _y( _2e_g^-1(y)- _2u_y)e_y +O_∞(1/N), (30) where α1,α2,β1,β2>0 _1, _2, _1, _2>0 can be found in the proof, and O∞(1/N)O_∞(1/N) is a matrix where all entries are O(1/N)O(1/N). Comment: Euclidean norm vs. RMS Norm. The updates in this section are derived under Euclidean layer-norm N(v)=v/‖v‖2N(v)=v/\|v\|_2, so the normalized scores entering the softmax are attenuated (compared to our experiments which use inverse temperature β=d=Θ(N)β= d= ( N)) by a factor of order 1/N1/ N. Consequently, in the early regime with a fixed unembedding U and Θ(N) (N) competing classes, the correct-token probability σx,g(x) _x,g(x) is O(1/N)O(1/N). In our implementation we use RMSNorm, NRMS(v)=dv‖v‖2N_RMS(v)= d\, v\|v\|_2, which is equivalent to keeping Euclidean LN but applying a softmax with inverse temperature β=dβ= d. Since d=Θ(N)d= (N) (here d=4N+3d=4N+3), this multiplies every Euclidean score difference by d=Θ(N) d= ( N), so relative advantages that were O(1/N)O(1/ N) become Θ(1) (1). As a result, in the same early regime the correct-token probability becomes Θ(1) (1). Empirically, we see similar structures emerge in the matrix during early training when using the euclidean norm instead of the RMS norm (fig.˜15). Proof. Let us decompose each loss into contributions from true and false sequences, which follows from the fact that the data distribution is a mixture of the two: Li(W)=ρLiT(W)+(1−ρ)LiF(W).L_i(W)=ρ L_i^T(W)+(1-ρ)L_i^F(W). Step 1. In the first step, we take a gradient step only on the loss L1L_1 for the prediction of the second token y at the first token x, starting from initialization W0=0W_0=0. Recall that this model takes the form F(x)=U⋅(ex+Wex)F(x)=U· N(e_x+We_x), so that in the notation of Lemma 1 we have ax=bx=vx=exa_x=b_x=v_x=e_x and Pvx/‖vx‖uk=ukP_v_x/\|v_x\|u_k=u_k. Note that since the logits are all zero, we have (0)k=12NS(0)_k= 12N. We begin with the gradient on true sequences. Multiplying eq.˜28 by −η-η and setting y=g(x)y=g(x) gives −η∇L1T(W0) -η∇ L_1^T(W_0) =ηx[ug(x)ex⊤]−η∑k=12N(0)kukx[ex⊤] = _x[u_g(x)e_x ]-η _k=1^2NS(0)_ku_kE_x[e_x ] =ηN∑x=1Nug(x)ex⊤−η2N2∑z=12N∑x=1Nuzex⊤ = ηN _x=1^Nu_g(x)e_x - η2N^2 _z=1^2N _x=1^Nu_ze_x =ηN∑x=1Nug(x)ex⊤+O∞(η/N2). = ηN _x=1^Nu_g(x)e_x +O_∞(η/N^2). On false sequences, we have −η∇L1F(W0) -η∇ L_1^F(W_0) =−ηx[∑k=12N(0)kukex⊤]+ηx,y[uyex⊤] =- _x[ _k=1^2NS(0)_ku_ke_x ]+ _x,y[u_ye_x ] =ηN2∑x=1N∑y=N+12Nuyex⊤−η2N2∑z=12N∑x=1Nuzex⊤ = ηN^2 _x=1^N _y=N+1^2Nu_ye_x - η2N^2 _z=1^2N _x=1^Nu_ze_x =O∞(η/N2), =O_∞(η/N^2), using the fact that x and y are independent. With η=N/ρη=N/ρ, we obtain W1=W0−η∇L1(W0)=∑x=1Nug(x)ex⊤+O∞(1/N).W_1=W_0-η∇ L_1(W_0)= _x=1^Nu_g(x)e_x +O_∞(1/N). Step 2. The second step is taken at W=W1=∑x=1Nug(x)ex⊤+RW=W_1= _x=1^Nu_g(x)e_x +R, with ‖R‖∞=O(1/N)\|R\|_∞=O(1/N). Thus, we have vx=ex+W1ex=ex+ug(x)+εxv_x=e_x+W_1e_x=e_x+u_g(x)+ _x, with ‖εx‖∞=O(1/N)\| _x\|_∞=O(1/N) since exe_x is one-hot, which implies ‖εx‖2=O(1/N)\| _x\|_2=O(1/ N). We also denote σx,k:=(U⋅(vx))k _x,k:=S(U· N(v_x))_k, which satisfies σx,k=O(1/N) _x,k=O(1/N) for all x and k, since exp(uk⊤(vx))=Θ(1) (u_k N(v_x))= (1) for all x and k. On true sequences, we have: −η∇L1T(W1) -η∇ L_1^T(W_1) =ηN∑x=1N1‖vx‖2(I−vxvx⊤‖vx‖22)ug(x)ex⊤−ηN∑x=1N∑k=12Nσx,k‖vx‖2(I−vxvx⊤‖vx‖22)ukex⊤. = ηN _x=1^N 1\|v_x\|_2 (I- v_xv_x \|v_x\|_2^2 )u_g(x)e_x - ηN _x=1^N _k=1^2N _x,k\|v_x\|_2 (I- v_xv_x \|v_x\|_2^2 )u_ke_x . Note that we have ‖vx‖=2+(ex+ug(x))⊤εx+‖εx‖22=2+O(1/N)=2+O(1/N). \|v_x\|= 2+(e_x+u_g(x)) _x+\| _x\|_2^2= 2+O(1/N)= 2+O(1/N). by Taylor expansion. Then, vx,k:=1‖vx‖2(I−vxvx⊤‖vx‖22)uk v_x,k:= 1\|v_x\|_2 (I- v_xv_x \|v_x\|_2^2 )u_k =12+O(1N)uk+δk,g(x)+O(1N)22+O(1N)vx = 1 2+O( 1N)u_k+ _k,g(x)+O( 1N)2 2+O( 1N)v_x =12uk+δk,g(x)2(ex+ug(x))+O∞(1/N), = 1 2u_k+ _k,g(x)2(e_x+u_g(x))+O_∞(1/N), where O∞(1/N)O_∞(1/N) is a vector with ℓ∞ _∞ norm O(1/N)O(1/N) and δk,g(x)=k=g(x) _k,g(x)=1\k=g(x)\ denotes the Kronecker delta. Plugging back into the gradient above, we obtain −η∇L1T(W1) -η∇ L_1^T(W_1) =ηN2∑x=1N(ug(x)−12(ex+ug(x))+O∞(1N))ex⊤ = ηN 2 _x=1^N (u_g(x)- 12(e_x+u_g(x))+O_∞ ( 1N ) )e_x −ηN2∑x=1N∑k=12Nσx,kvx,kex⊤ - ηN 2 _x=1^N _k=1^2N _x,kv_x,ke_x =η22N∑x=1N(ug(x)−ex)ex⊤+O∞(η/N2), = η2 2N _x=1^N(u_g(x)-e_x)e_x +O_∞(η/N^2), which follows by noticing that ∑k=12Nσx,kvx,k=O∞(1/N) _k=1^2N _x,kv_x,k=O_∞(1/N). For false sequences, we have −η∇L1F(W1) -η∇ L_1^F(W_1) =ηx,y[vx,yex⊤]−ηN∑x=1N∑k=12Nσx,kvx,kex⊤ = _x,y[v_x,ye_x ]- ηN _x=1^N _k=1^2N _x,kv_x,ke_x =ηN2∑x=1N∑y=N+12Nvx,yex⊤−ηN∑x=1N∑k=12Nσx,kvx,kex⊤ = ηN^2 _x=1^N _y=N+1^2Nv_x,ye_x - ηN _x=1^N _k=1^2N _x,kv_x,ke_x =O∞(η/N2). =O_∞(η/N^2). This again follows by noticing that 1N∑y=N+12Nvx,y=O∞(1/N)and∑k=12Nvx,k=O∞(1/N). 1N _y=N+1^2Nv_x,y=O_∞(1/N) _k=1^2Nv_x,k=O_∞(1/N). With η=N/ρη=N/ρ, this yields W2=W1−η∇L1(W1)=∑t=1N(αug(t)−et)et⊤+O(1/N),W_2=W_1-η∇ L_1(W_1)= _t=1^N (α u_g(t)-e_t )e_t +O(1/N), with α=1+122α=1+ 12 2. Step 3. The third step takes one gradient step on the loss L3L_3 at the third token, i.e., predicting y′y from (x,y,x′)(x,y,x ). The model now takes the form F(x,y,x′)=U⋅(ex′+13W(ex+ey+ex′))F(x,y,x )=U· N(e_x + 13W(e_x+e_y+e_x )), with a uniform attention on the first three tokens. We get the gradient of the loss on y′y from (28), giving555We use the fact W2et=αug(t)−et+O∞(1/N)W_2e_t=α u_g(t)-e_t+O_∞(1/N) for t≤Nt≤ N and O∞(1/N)O_∞(1/N) otherwise. vx,y,x′ v_x,y,x =ex′+13W2(ex+ey+ex′) =e_x + 13W_2(e_x+e_y+e_x ) =23ex′−13ex+α3ug(x)+α3ug(x′)+εx,y,x′=:vx,x′+εx,y,x′, = 23e_x - 13e_x+ α3u_g(x)+ α3u_g(x )+ _x,y,x =:v_x,x + _x,y,x , with ‖εx,y,x′‖∞=O(1/N)\| _x,y,x \|_∞=O(1/N), since it is the sum of 3 columns of the O∞(1/N)O_∞(1/N) term in W2W_2. As in the second step, we have ‖εx,y,x′‖22=O(1/N)\| _x,y,x \|_2^2=O(1/N) and vx,x′⊤εx,y,x′=O(1/N)v_x,x _x,y,x =O(1/N), so that by Taylor expansion we have ‖vx,y,x′‖=‖vx,x′‖+O(1/N)=135+2α2+O(1/N)\|v_x,y,x \|=\|v_x,x \|+O(1/N)= 13 5+2α^2+O(1/N) for x≠x′x≠ x and ‖vx,y,x′‖=‖vx,x′‖+O(1/N)=131+2α2+O(1/N)\|v_x,y,x \|=\|v_x,x \|+O(1/N)= 13 1+2α^2+O(1/N) for x=x′x=x . Note that we once again have σx,y,x′,k:=(U⋅(vx,y,x′))k=O(1/N) _x,y,x ,k:=S(U· N(v_x,y,x ))_k=O(1/N) due to the normalization. On true sequences, we have −η∇L3T(W2) -η∇ L_3^T(W_2) =ηx,x′[13‖vx,y,x′‖(I−vx,y,x′vx,y,x′⊤‖vx,y,x′‖2)ug(x′)(ex+eg(x)+ex′)⊤] = _x,x [ 13\|v_x,y,x \| (I- v_x,y,x v_x,y,x \|v_x,y,x \|^2 )u_g(x )(e_x+e_g(x)+e_x ) ] −η∑k=12Nx,x′[σx,g(x),x′,k3‖vx,y,x′‖(I−vx,y,x′vx,y,x′⊤‖vx,y,x′‖2)uk(ex+eg(x)+ex′)⊤]. -η _k=1^2NE_x,x [ _x,g(x),x ,k3\|v_x,y,x \| (I- v_x,y,x v_x,y,x \|v_x,y,x \|^2 )u_k(e_x+e_g(x)+e_x ) ]. Let us first show that the error terms εx,y,x′ _x,y,x lead to negligible contributions to the gradient. Note that we have vx,y,x′,k v_x,y,x ,k :=13‖vx,y,x′‖(I−vx,y,x′vx,y,x′⊤‖vx,y,x′‖2)uk := 13\|v_x,y,x \| (I- v_x,y,x v_x,y,x \|v_x,y,x \|^2 )u_k =13(‖vx,x′‖+O(1N))(I−(vx,x′+O∞(1N))(vx,x′⊤+O∞(1N))‖vx,x′‖2+O(1N))uk = 13(\|v_x,x \|+O( 1N)) (I- (v_x,x +O_∞( 1N))(v_x,x +O_∞( 1N))\|v_x,x \|^2+O( 1N) )u_k =13‖vx,x′‖(I−vx,x′vx,x′‖vx,x′‖2)uk+εx,y,x′,k, = 13\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_k+ _x,y,x ,k, With ‖εx,y,x′,k‖∞=O(1/N)\| _x,y,x ,k\|_∞=O(1/N), where we used 1a+ϵ=1a+O(ϵ) 1a+ε= 1a+O(ε) for a>0a>0, and (u+ϵ)(uT+ϵ)=uuT+O∞(1/N)(u+ε)(u^T+ε)=u^T+O_∞(1/N) for ϵ=O∞(1/N)ε=O_∞(1/N). Then, taking expectations with respect to independent x,x′x,x , it is easy to check that x,x′[εx,g(x),x′,g(x′)(ex+eg(x)+ex′)⊤]=O∞(1/N2) _x,x [ _x,g(x),x ,g(x )(e_x+e_g(x)+e_x ) ]=O_∞(1/N^2) x,x′[σx,g(x),x′,kεx,g(x),x′,k(ex+eg(x)+ex′)⊤]=O∞(1/N3). _x,x [ _x,g(x),x ,k _x,g(x),x ,k(e_x+e_g(x)+e_x ) ]=O_∞(1/N^3). The gradient update can then be rewritten as −η∇L3T(W2) -η∇ L_3^T(W_2) =ηx,x′[13‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2)ug(x′)(ex+eg(x)+ex′)⊤] = _x,x [ 13\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_g(x )(e_x+e_g(x)+e_x ) ] (31) −η∑k=12Nx,x′[σx,g(x),x′,k3‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2)uk(ex+eg(x)+ex′)⊤] -η _k=1^2NE_x,x [ _x,g(x),x ,k3\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_k(e_x+e_g(x)+e_x ) ] (32) +O∞(η/N2) +O_∞(η/N^2) (33) We now check that the second term is also of order O∞(η/N2)O_∞(η/N^2). Indeed, the projector matrix 13‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2) 13\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 ) is sparse with rows of bounded ℓ1 _1 norm, and we have ∑kσx,g(x),x′,kuk=O∞(1/N) _k _x,g(x),x ,ku_k=O_∞(1/N), so that ∑k=12Nσx,g(x),x′,k3‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2)uk=:ζx,x′=O∞(1/N), _k=1^2N _x,g(x),x ,k3\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_k=: _x,x =O_∞(1/N), for all x,x′x,x . We then have ∑k=12N _k=1^2N x,x′[σx,g(x),x′,k3‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2)uk(ex+eg(x)+ex′)⊤] _x,x [ _x,g(x),x ,k3\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_k(e_x+e_g(x)+e_x ) ] =x,x′[ζx,x′(ex+eg(x)+ex′)⊤] =E_x,x [ _x,x (e_x+e_g(x)+e_x ) ] =1N∑x=1Nζx,x′ex⊤+1N∑x=1Nζx,x′eg(x)⊤+1N∑x=1Nζx,x′ex′⊤=O∞(1/N2). = 1N _x=1^N _x,x e_x + 1N _x=1^N _x,x e_g(x) + 1N _x=1^N _x,x e_x =O_∞(1/N^2). For the first term (31), we have η η x,x′[13‖vx,x′‖(I−vx,x′vx,x′⊤‖vx,x′‖2)ug(x′)(ex+eg(x)+ex′)⊤] _x,x [ 13\|v_x,x \| (I- v_x,x v_x,x \|v_x,x \|^2 )u_g(x )(e_x+e_g(x)+e_x ) ] =∗ηx,x′[13‖vx,x′‖ug(x′)ex′⊤]+O∞(η/N2)−ηx,x′[α(1+δg(x),g(x′))9‖vx,x′‖3vx,x′(ex+eg(x)+ex′)⊤] *= _x,x [ 13\|v_x,x \|u_g(x )e_x ]+O_∞(η/N^2)- _x,x [ α(1+ _g(x),g(x ))9\|v_x,x \|^3v_x,x (e_x+e_g(x)+e_x ) ] =ηβ1N∑x=1Nug(x)ex⊤−ηx,x′[γx,x′vx,x′(ex+eg(x)+ex′)⊤]+O∞(η/N2), = η _1N _x=1^Nu_g(x)e_x - _x,x [ _x,x v_x,x (e_x+e_g(x)+e_x ) ]+O_∞(η/N^2), with β1=x[13‖vx,1‖] and γx,x′=α(1+δg(x),g(x′))9‖vx,x′‖3. _1=E_x[ 13\|v_x,1\|] and _x,x = α(1+ _g(x),g(x ))9\|v_x,x \|^3. In (∗)(*) we used (i) that x,x′[13‖vx,x′‖ug(x′)(ex+eg(x))⊤]=O∞(1/N2)E_x,x [ 13\|v_x,x \|u_g(x )(e_x+e_g(x)) ]=O_∞(1/N^2) thanks to the independence of x and x′x ; and (i) the fact that ug(x′)Tvx,x′=α3(1+δg(x),g(x′))u_g(x )^Tv_x,x = α3(1+ _g(x),g(x )) by definition of vx,x′v_x,x and thanks to orthogonality. We have −η -η x,x′[γx,x′vx,x′(ex+eg(x)+ex′)⊤] _x,x [ _x,x v_x,x (e_x+e_g(x)+e_x ) ] =−ηx[x′[γx,x′vx,x′|x](ex+eg(x))⊤]−ηx′[x[γx,x′vx,x′|x′]ex′⊤] =- _x[E_x [ _x,x v_x,x |x](e_x+e_g(x)) ]- _x [E_x[ _x,x v_x,x |x ]e_x ] =∗ηβ2N∑x=1N(ex−αug(x))(ex+eg(x))⊤−ηβ2N∑x=1N(2ex+αug(x))ex⊤+O(η/N2) *= η _2N _x=1^N(e_x-α u_g(x))(e_x+e_g(x)) - η _2N _x=1^N(2e_x+α u_g(x))e_x +O(η/N^2) =∗∗−ηβ2N∑x=1Nexex⊤+ηβ2N∑y=N+12N(eg−1(y)−αuy)ey⊤+O∞(η/N2), **=- η _2N _x=1^Ne_xe_x + η _2N _y=N+1^2N(e_g^-1(y)-α u_y)e_y +O_∞(η/N^2), with β2=13x′[γ1,x′]=13x[γx,1]=13Nγ1,1+N−13Nγ1,2. _2= 13E_x [ _1,x ]= 13E_x[ _x,1]= 13N _1,1+ N-13N _1,2. In (∗)(*) we condition on x and decompose vx,x′=A(x)+B(x′)v_x,x =A(x)+B(x ), where A(x)A(x) is independent of x′x ; then linearity gives x′[γx,x′vx,x′∣x]=A(x)x′[γx,x′]+x′[γx,x′B(x′)]E_x [ _x,x v_x,x x]=A(x)\,E_x [ _x,x ]+E_x [ _x,x B(x )]. By permutation symmetry of labels, x′[γx,x′]=z[γ1,z]=3β2E_x [ _x,x ]=E_z[ _1,z]=3 _2, while the B(x′)B(x ) part averages over a uniformly random index and is O∞(1/N)O_∞(1/N); the same holds with x and x′x swapped for the second bracket. Substituting these two conditionals into the split expectation, replacing outer expectations by uniform sums, and renaming the dummy index yields the displayed (∗)(*) line. In (∗)(**) we perform change of variables y=g(x)y=g(x). We have thus shown −η∇L3T(W2)=ηN∑x=1N(β1ug(x)−β2ex)ex⊤+ηβ2N∑y=N+12N(eg−1(y)−αuy)ey⊤+O∞(η/N2).-η∇ L_3^T(W_2)= ηN _x=1^N( _1u_g(x)- _2e_x)e_x + η _2N _y=N+1^2N(e_g^-1(y)-α u_y)e_y +O_∞(η/N^2). (34) For false sequences, it can be checked that η∇L3F(W2)=O∞(η/N2)η∇ L_3^F(W_2)=O_∞(η/N^2). Thus, taking step-size η=N/ρη=N/ρ yields W3 W_3 =W2−η∇L3(W2) =W_2-η∇ L_3(W_2) =(α+β1)∑x=1Nug(x)ex⊤−(1+β2)∑x=1Nexex⊤+β2∑y=N+12N(eg−1(y)−αuy)ey⊤+O∞(1/N). =(α+ _1) _x=1^Nu_g(x)e_x -(1+ _2) _x=1^Ne_xe_x + _2 _y=N+1^2N(e_g^-1(y)-α u_y)e_y +O_∞(1/N). ∎ E.1.1 Learning (positional) attention. We now turn to learning the key–query matrix under positional attention, assuming that the value matrix has already been learned with the structure described above. Specifically, we show that the gradient of the key–query matrix on true sequences drives positional attention to focus on the x′x token, effectively causing the model to ignore the initial (x,y)(x,y) pair. This observation may account for the absence of emergence at ρ=1ρ=1, although it does not explain why emergence still occurs when attention is trainable and ρ<1ρ<1. For this part, assume a simple architecture of the following form for the prediction of the fourth token logits given the first three tokens: FWKQ(z1:3)=U⋅∑t=13exp(pt⊤WKQp3)∑t′=13exp(pt′⊤WKQp3)WVezt,F_W_KQ(z_1:3)=U· _t=1^3 (p_t W_KQp_3) _t =1^3 (p_t W_KQp_3)W_Ve_z_t, where z1:3=(ex,ey,ex′)z_1:3=(e_x,e_y,e_x ), and p1:3p_1:3 are the positional embeddings defined in (2). We assume WVW_V fixed to the the structure in (6)-(7), with β1=β2=:β _1= _2=:β for simplicity. We consider the following population loss for WKQW_KQ on the last token for true sequences: L(WKQ)=x,x′[ℓ(g(x′),FWKQ(x,g(x),x′))]. L(W_KQ)=E_x,x [ (g(x ),F_W_KQ(x,g(x),x ))]. Then, the negative gradient direction at WKQ=0W_KQ=0 is given by −∇L(WKQ)=13∑t=13∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))uk⊤WVezt(pt−p¯1:3)p3⊤], -∇ L(W_KQ)= 13 _t=1^3 _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))u_k W_Ve_z_t(p_t- p_1:3)p_3 ], (35) where we denote z1:3=(x,g(x),x′)z_1:3=(x,g(x),x ) and p¯1:3=13(p1+p2+p3) p_1:3= 13(p_1+p_2+p_3), and p p are probability predictions at WKQ=0W_KQ=0, which we assume satisfy the following, given the assumed structure on WVW_V, for some small ϵε:666This requires that we the early phase is run for long enough so that β is large enough, say O(logN)O( N). p^(k|x,x′)=(1−ϵ)/2,if k∈g(x),g(x′) and x≠x′,1−ϵ,if k=g(x) and x=x′,O(1/N),o/w. p(k|x,x )= cases(1-ε)/2,&if k∈\g(x),g(x )\ and x≠ x ,\\ 1-ε,&if k=g(x) and x=x ,\\ O(1/N),&o/w. cases Let us now write the update in (35) as −∇L(WKQ)=13∑t=13wt(pt−p¯1:3)p3⊤,-∇ L(W_KQ)= 13 _t=1^3w_t(p_t- p_1:3)p_3 , and study the values of the wtw_t. For t=1t=1, we have w1 w_1 =∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))uk⊤WVex] = _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))u_k W_Ve_x] =β∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))g(x)=k] =β _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))1\g(x)=k\] =β∑k=N+12Nx[g(x)=k]x′[g(x′)=k]−βx,x′[p^(g(x)|x,x′)] =β _k=N+1^2NE_x[1\g(x)=k\]E_x [1\g(x )=k\]- _x,x [ p(g(x)|x,x )] =βN−βN(1−ϵ)−β(N2−N)N21−ϵ2≤−β1−ϵ2+O(β/N). = βN- βN(1-ε)- β(N^2-N)N^2 1-ε2≤-β 1-ε2+O(β/N). For t=2t=2, we have w2 w_2 =∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))uk⊤WVeg(x)] = _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))u_k W_Ve_g(x)] =−β∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))x=k] =-β _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))1\x=k\] =0+βx,x′[p^(x|x,x′)] =0+ _x,x [ p(x|x,x )] =O(β/N). =O(β/N). For t=3t=3, we have w3 w_3 =∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))uk⊤WVex′] = _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))u_k W_Ve_x ] =β∑k=12Nx,x′[(g(x′)=k−p^(k|x,x′))g(x′)=k] =β _k=1^2NE_x,x [(1\g(x )=k\- p(k|x,x ))1\g(x )=k\] =β−x,x′[p^(g(x′)|x,x′)] =β-E_x,x [ p(g(x )|x,x )] ≥β−β1−ϵ2≥β2 ≥β-β 1-ε2≥ β2 Thus, when N≫βN β, we have w3−w1,w3−w2≥β/2+O(β/N)w_3-w_1,w_3-w_2≥β/2+O(β/N). Taking WKQ1=−η∇LW_KQ^1=-η∇ L, the gap in attention logits between t=3t=3 and t=1,2t=1,2 is of order ηβ/6ηβ/6, so that for η large enough, the attention mostly focuses on the third token x′x . E.2 Proof of Theorem˜1 Suppose we are given (x,y,x′)(x,y,x ), where we assume for simplicity that x≠x′x≠ x and g(x′)≠yg(x )≠ y. Denote by fW(z1:t)f_W(z_1:t) the output of the model in (4) before applying the LN and the unembedding layer. Then, we have that: fW(x,y,x′) f_W(x,y,x ) =ex′+p3+13γ¯(∑yuy−∑xux)+ =e_x +p_3+ 13 γ ( _yu_y- _xu_x )+ +13(−α1ex+β1ug(x)+α2eg−1(y)−β2uy−α1ex′+β1ug(x′)) + 13 (- _1e_x+ _1u_g(x)+ _2e_g^-1(y)- _2u_y- _1e_x + _1u_g(x ) ) (36) Denote by c1:=2+γ¯2(2N−2)+2α12+β129c_1:=2+ γ^2(2N-2)+2 _1^2+ _1^29 and c2:=2+γ¯2(2N−3)+2α12+β129c_2:=2+ γ^2(2N-3)+2 _1^2+ _1^29 . for a true sample where y=g(x)y=g(x) we have that: ‖fW(x,g(x),x′)‖2=c+(β1−β2+γ¯)2+(β1+γ¯)2. f_W(x,g(x),x ) ^2=c+( _1- _2+ γ)^2+( _1+ γ)^2~. Hence, after applying the LN and unembedding layer we have that: (FW(x,g(x),x′))g(x′)=β1+γ¯3c1+(β1−β2+γ¯)2+(β1+γ¯)2 (F_W(x,g(x),x ))_g(x )= _1+ γ3 c_1+( _1- _2+ γ)^2+( _1+ γ)^2 maxy′≠g(x′)(FW(x,g(x),x′))y′=γ¯+max(0,β1−β2)3c1+(β1−β2+γ¯)2+(β1+γ¯)2 _y ≠ g(x )(F_W(x,g(x),x ))_y = γ+ (0, _1- _2)3 c_1+( _1- _2+ γ)^2+( _1+ γ)^2 For a false sample where y≠g(x)y≠ g(x) we have that: ‖fW(x,g(x),x′)‖2=c2+2(β1+γ¯)2+(−β2+γ¯)2. f_W(x,g(x),x ) ^2=c_2+2( _1+ γ)^2+(- _2+ γ)^2~. Hence, after applying the LN and unembedding layer we have that: (FW(x,y,x′))g(x′)=β1+γ¯3c2+2(β1+γ¯)2+(−β2+γ¯)2 (F_W(x,y,x ))_g(x )= _1+ γ3 c_2+2( _1+ γ)^2+(- _2+ γ)^2 maxy′≠g(x′)(FW(x,y,x′))y′=β1+γ¯3c2+2(β1+γ¯)2+(−β2+γ¯)2. _y ≠ g(x )(F_W(x,y,x ))_y = _1+ γ3 c_2+2( _1+ γ)^2+(- _2+ γ)^2~. Plugging in these terms finishes the proof. E.3 Proof of Theorem˜2 Figure 16: Illustration for a LN-induced linear separability. Proof. We first describe the output of the model in (4) before applying LN. Denote by vT,vF∈ℝ4N+3v_T,~v_F ^4N+3 these outputs for true and false samples respectively. Recall that a true sample (x,y)(x,y) is when y=g(x)y=g(x) and false otherwise. Then, we have that: vT=ey+p2+12((α2−α1)ex+(β1−β2)uy+(γ1−γ2)⋅(∑yuy−∑xux)) v_T=e_y+p_2+ 12 (( _2- _1)e_x+( _1- _2)u_y+( _1- _2)· ( _yu_y- _xu_x ) ) (37) vF=ey+p2+12(−α1ex+α2eg−1(y)+β1ug(x)−β2uy+(γ1−γ2)⋅(∑yuy−∑xux)) v_F=e_y+p_2+ 12 (- _1e_x+ _2e_g^-1(y)+ _1u_g(x)- _2u_y+( _1- _2)· ( _yu_y- _xu_x ) ) (38) We will first show that without adding N the samples above cannot be separated for general x and y. Assume otherwise, that there exists a linear separator w=(w1w2w3w4w5)w= pmatrixw_1\\ w_2\\ w_3\\ w_4\\ w_5 pmatrix with w1,…,w4∈ℝN,w5∈ℝ3w_1,…,w_4 ^N,w_5 ^3 and bias term b∈ℝb such that ⟨w,vT⟩−b≥0 w,v_T -b≥ 0 and ⟨w,vF⟩−b<0 w,v_F -b<0 for every true or false sample respectively. We slightly abuse notation and write ⟨w1,ex⟩ w_1,e_x as ⟨(w103N+3),ex⟩ pmatrixw_1\\ 0_3N+3 pmatrix,e_x , and similarly when multiplying w2w_2 by eye_y, w3w_3 by uxu_x, w4w_4 by uyu_y and w5w_5 by ptp_t. c:=12⟨(γ1−γ2)⋅(∑yuy−∑xux),w3+w4⟩+⟨w5,p2⟩c:= 12 ( _1- _2)· ( _yu_y- _xu_x ),w_3+w_4 + w_5,p_2 the terms in the inner products that are independent of the sample. Then, using the linear separator on these four samples we have: b≤(α2−α1)⟨exi,w1⟩+⟨eyiw2⟩+(β1−β2)⟨uyi,w4⟩+c b≤( _2- _1) e_x_i,w_1 + e_y_iw_2 +( _1- _2) u_y_i,w_4 +c (39) b≤(α2−α1)⟨exj,w1⟩+⟨eyjw2⟩+(β1−β2)⟨uyj,w4⟩+c b≤( _2- _1) e_x_j,w_1 + e_y_jw_2 +( _1- _2) u_y_j,w_4 +c (40) b≥α2⟨exi,w1⟩−α1⟨exj,w1⟩+⟨eyi,w2⟩+β1⟨uyj,w4⟩−β2⟨uyi,w4⟩+c b≥ _2 e_x_i,w_1 - _1 e_x_j,w_1 + e_y_i,w_2 + _1 u_y_j,w_4 - _2 u_y_i,w_4 +c (41) b≥α2⟨exj,w1⟩−α1⟨exi,w1⟩+⟨eyj,w2⟩+β1⟨uyi,w4⟩−β2⟨uyj,w4⟩+c. b≥ _2 e_x_j,w_1 - _1 e_x_i,w_1 + e_y_j,w_2 + _1 u_y_i,w_4 - _2 u_y_j,w_4 +c~. (42) Adding up (41) and (42) we have that: 2b−2c 2b-2c ≥(α2−α1)⟨exj,w1⟩+⟨eyjw2⟩+(β1−β2)⟨uyj,w4⟩+ ≥( _2- _1) e_x_j,w_1 + e_y_jw_2 +( _1- _2) u_y_j,w_4 + (43) +(α2−α1)⟨exi,w1⟩+⟨eyiw2⟩+(β1−β2)⟨uyi,w4⟩, +( _2- _1) e_x_i,w_1 + e_y_iw_2 +( _1- _2) u_y_i,w_4 ~, (44) which is a contradiction to (39) and (40). This means that there is no linear separator, regardless of the values of the parameters, which proves the first item. Assume there is layer normalization after the prediction as in (4). This means that the output of the model is v‖v‖ v v . Consider the linear predictor w=p2w=p_2, and a bias term b that will be determined later. Then, the output of the linear predictor is exactly ⟨w,v⟩=1‖v‖ w,v = 1 v . We will now calculate the norm of both true and false samples. For a true sample (x,g(x))(x,g(x)) we have that: ‖vT‖2 v_T ^2 =2+(α2−α1)2+(γ1−γ2)2⋅(2N−1)+(γ1−γ2+β1−β2)2. =2+( _2- _1)^2+( _1- _2)^2·(2N-1)+ ( _1- _2+ _1- _2 )^2~. (45) For a negative sample (x,y)(x,y) with g(x)≠yg(x)≠ y we have: ‖vF‖2 v_F ^2 =2+α12+α22+(γ1−γ2)2⋅(2N−2)+(γ1−γ2+β1)2+(γ1−γ2−β2)2. =2+ _1^2+ _2^2+( _1- _2)^2·(2N-2)+( _1- _2+ _1)^2+( _1- _2- _2)^2~. (46) There exists a linear separator as long as 1‖vF‖−1‖vT‖≠0 1 v_F - 1 v_T ≠ 0. Since the vectors vTv_T and vFv_F are both non-zero, this is equivalent to ‖vT‖2≠‖vF‖2 v_T ^2≠ v_F ^2. By the above calculation, we have that: ‖vF‖2−‖vT‖2 v_F ^2- v_T ^2 =α12+α22−(α1−α2)2−(γ1−γ2)2+(γ1−γ2+β1)2+(γ1−γ2−β2)2−(γ1−γ2+β1−β2)2 = _1^2+ _2^2-( _1- _2)^2-( _1- _2)^2+( _1- _2+ _1)^2+( _1- _2- _2)^2-( _1- _2+ _1- _2)^2 =2α1α2+2β1β2. =2 _1 _2+2 _1 _2~. This shows that if 2α1α2+2β1β2≠02 _1 _2+2 _1 _2≠ 0 then we have a linear separation between true and false samples. Further assuming that α1=α2,β1=β2,γ1=γ2 _1= _2,~ _1= _2,~ _1= _2 we have that ‖vT‖2=2 v_T ^2=2 and ‖vF‖2=2+2α2+2β2 v_F ^2=2+2 _2+2 _2. To find the optimal margin for this predictor we pick: b=12⋅(1‖vT‖−1‖vF‖)=122(1−11+α2+β2). b= 12· ( 1 v_T - 1 v_F )= 12 2 (1- 1 1+α^2+β^2 )~. We will now prove that there is linear separation after predicting the x′x token. Using the output of the model as in (4) we get: vT=C+13((α2−α1)ex+(β1−β2)uy−α1ex′+β1ug(x′)) v_T=C+ 13 (( _2- _1)e_x+( _1- _2)u_y- _1e_x + _1u_g(x ) ) (47) vF=C+13(−α1ex+α2ug−1(y)+β1ug(x)−β2uy+−α1ex′+β1ug(x′)), v_F=C+ 13 (- _1e_x+ _2u_g^-1(y)+ _1u_g(x)- _2u_y+- _1e_x + _1u_g(x ) )~, (48) where C=ex′+p3+γ^3⋅(∑yuy−∑xux)C=e_x +p_3+ γ3· ( _yu_y- _xu_x ). We can now calculate: ‖vT‖2=2+19((α2−α1)2+(β1−β2+γ¯)2+α12+(β1+γ¯)2+(2N−2)γ¯2) v_T ^2=2+ 19 (( _2- _1)^2+( _1- _2+ γ)^2+ _1^2+( _1+ γ)^2+(2N-2) γ^2 ) (49) ‖vF‖2=2+19(2α12+α22+2(β1+γ¯)2+(γ¯−β2)2+(2N−3)γ¯2). v_F ^2=2+ 19 (2 _1^2+ _2^2+2( _1+ γ)^2+( γ- _2)^2+(2N-3) γ^2 )~. (50) We now have that: ‖vF‖2−‖vT‖2 v_F ^2- v_T ^2 =19⋅(α12+α22+(β1+γ¯)2+(γ¯−β2)2−(α2−α1)2−(β1−β2+γ¯)2−γ¯2) = 19· ( _1^2+ _2^2+( _1+ γ)^2+( γ- _2)^2-( _2- _1)^2-( _1- _2+ γ)^2- γ^2 ) =29(α1α2+β1β2). = 29( _1 _2+ _1 _2)~. By a similar argument to the previous case, if α1α2+β1β2≠0 _1 _2+ _1 _2≠ 0 then there is linear separation between true and false samples. Further assuming that α1=α2 _1= _2, β1=β2 _1= _2 and γ¯=0 γ=0, to find the optimal margin for the predictor we pick: b=12⋅(1‖vT‖−1‖vF‖)=α2+β294+89(α2+β2)+127(α2+β2)2.b= 12· ( 1 v_T - 1 v_F )= α^2+β^29 4+ 89(α^2+β^2)+ 127(α^2+β^2)^2~. ∎ E.4 Evaluating checkpoints of a “real” LM To test whether the two-phase dynamics also appear in a large model trained on open-web data, we analyzed the Pythia-6.9B training checkpoints released by EleutherAI. Using the CounterFact dataset we construct each input by concatenating K=4K=4 factual statements whose preceding context is either entirely true or entirely false, mirroring our previous setup. For every checkpoint we measure three signals on the final statement: • Memorization: percentage of cases where greedy decoding succeeds in completing the correct token. • Uncertainty: entropy of the model’s full-vocabulary distribution for predicting the last token; we record the difference between true and false context. • Linear separability: accuracy of a linear probe trained to classify the truth value of the surrounding context. Table 1: Pythia-6.9B metrics across training steps. step H memorization probe AUC 0 0.001 0.000 0.383 512 0.006 0.000 0.435 1000 0.005 0.006 0.467 3000 0.219 0.242 0.587 5000 0.217 0.435 0.648 10000 0.286 0.547 0.667 20000 0.355 0.655 0.754 40000 0.329 0.727 0.759 60000 0.421 0.772 0.802 80000 0.419 0.822 0.799 100000 0.479 0.835 0.818 110000 0.485 0.849 0.835 120000 0.536 0.842 0.835 130000 0.565 0.858 0.783 143000 0.518 0.875 0.831 Notes. ΔH H denotes the entropy gap between matched prompt pairs presented with false versus true context. The memorization rate is the share of instances in which the model’s output distribution places the correct continuation token at the top-1 position. “Probe AUC” is the ROC-AUC of a linear classifier trained to predict the surrounding context’s truth value from model activations. Findings. Early training (≤1≤\!1k steps). ΔH≈0 H≈ 0; the model memorizes indiscriminately. Mid training (3k–80k). Memorization jumps, then plateaus, while ΔH H and probe accuracy climb steadily. Late training (≥80≥\!80k). Entropy separation continues to widen even after memorization saturates, mirroring Phase 2 but over a longer horizon. Overall, this echoes the two-phase pattern observed in simpler experiments: an initial jump in memorization followed by a slower, steadier increase in entropy separation. Differences remain: the second phase is more gradual, and classification and entropy increase even before memorization stabilizes. We hypothesize this stems from continual exposure to new facts during training, unlike our idealized setup where all facts are seen in a single gradient step. The modest terminal memorization and classification scores are consistent with reports that the Pythia series is under-trained relative to its capacity. Finally, in Pythia-6.9B we do not find evidence that layer normalization itself induces linear separability; rather, a linearly decodable truth signal emerges gradually with depth across many layers. Our aim was to advance one plausible mechanism for the phenomenon observed in pretrained LMs, not to claim uniqueness. Given the model’s deeper architecture—with numerous layers and MLP blocks—and the richness of natural-language data, additional or distinct mechanisms are likely at play. A systematic study of these mechanisms is an important direction for future work.