Paper deep dive

Transformers Use Causal World Models in Maze-Solving Tasks

Alex F. Spies, William Edwards, Michael Ivanitskiy, Adrians Skapars, Tilman Räuker, Katsumi Inoue, Alessandra Russo, Murray Shanahan

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 64

Models: Stan (6-layer transformer, learned positional embeddings, 19.2M params), Terry (6-layer transformer, rotary embeddings, 18.9M params)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:16:39 PM

Summary

This paper investigates the existence of 'World Models' (WMs) in transformer models trained on maze-solving tasks. Using Sparse Autoencoders (SAEs) and attention pattern analysis, the authors demonstrate that transformers develop structured, causally relevant internal representations of maze connectivity. They find that models can reason about mazes with higher complexity than seen during training, but fail when provided with equivalent information via input tokens. The study also highlights that positional encoding schemes influence how these world models are structured within the residual stream.

Entities (6)

Sparse Autoencoder · interpretability-tool · 100%Transformer · model-architecture · 100%Maze-dataset · dataset · 95%World Model · internal-representation · 95%Stan · model-instance · 90%Terry · model-instance · 90%

Relation Signals (3)

Transformer → develops → World Model

confidence 95% · we identify WMs in transformers trained on maze-solving tasks.

Sparse Autoencoder → identifies → World Model

confidence 90% · SAEs require no such assumptions to isolate disentangled features

Positional Encoding → influences → World Model

confidence 90% · positional encoding schemes appear to influence how World Models are structured within the model's residual stream.

Cypher Suggestions (2)

Identify relations between models and their internal representations · confidence 95% · unvalidated

MATCH (m:Model)-[r:DEVELOPS]->(wm:Representation {type: 'World Model'}) RETURN m.name, r.confidence

Find all models that utilize a specific interpretability tool · confidence 90% · unvalidated

MATCH (m:Model)-[:USES]->(t:Tool {name: 'Sparse Autoencoder'}) RETURN m.name

Abstract

Abstract:Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop highly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as "World Models" (WMs). In this work, we identify WMs in transformers trained on maze-solving tasks. By using Sparse Autoencoders (SAEs) and analyzing attention patterns, we examine the construction of WMs and demonstrate consistency between SAE feature-based and circuit-based analyses. By subsequently intervening on isolated features to confirm their causal role, we find that it is easier to activate features than to suppress them. Furthermore, we find that models can reason about mazes involving more simultaneously active features than they encountered during training; however, when these same mazes (with greater numbers of connections) are provided to models via input tokens instead, the models fail. Finally, we demonstrate that positional encoding schemes appear to influence how World Models are structured within the model's residual stream.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

63,318 characters extracted from source content.

Expand or collapse full text

Published as a conference paper at ICLR 2025 TRANSFORMERSUSECAUSALWORLDMODELS IN MAZE-SOLVINGTASKS Alex F. Spies 1∗ , William Edwards, Michael Ivanitskiy 2 , Adrians Skapars 3 , Tilman R ̈ auker, Katsumi Inoue 4 , Alessandra Russo 1 , Murray Shanahan 1 1 Imperial College London 2 Colorado School of Mines 3 University of Manchester 4 National Institute of Informatics ABSTRACT Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these net- works naturally develop highly structured representations. When such representa- tions comprehensively reflect the task domain’s structure, they are commonly re- ferred to as “World Models” (WMs). In this work, we identify WMs in transform- ers trained on maze-solving tasks. By using Sparse Autoencoders (SAEs) and an- alyzing attention patterns, we examine the construction of WMs and demonstrate consistency between SAE feature-based and circuit-based analyses. By subse- quently intervening on isolated features to confirm their causal role, we find that it is easier to activate features than to suppress them. Furthermore, we find that models can reason about mazes involving more simultaneously active features than they encountered during training; however, when these same mazes (with greater numbers of connections) are provided to models via input tokens instead, the models fail. Finally, we demonstrate that positional encoding schemes appear to influence how World Models are structured within the model’s residual stream. 1INTRODUCTION When machine learning systems acquire representations which reflect the underlying structure of the tasks they are trying to solve, these are often referred to as “World Models” (WMs) - the dis- covery of such structured representations in Large Language Models (LLMs) has gained significant traction of late, though often on models trained on diverse, complex datasets (Belrose et al., 2023; Lieberum et al., 2023; Olsson et al., 2022). In an attempt to seek a more comprehensive understand- ing, our work examines WMs acquired by transformers (Vaswani, 2017) in a controlled, synthetic environment. In particular, we use maze-solving tasks (Subsection 2.1) as an ideal testbed for un- derstanding learned WMs due to their human-understandable structure, controllable complexity, and relevance to spatial reasoning. Using this constrained domain, we can rigorously analyze how trans- formers trained (Subsection 2.2) to solve mazes construct and utilize internal representations of their environment. Our methodology leverages sparse autoencoders (SAEs) (Bricken et al., 2023) to overcome the limitations of linear probes in detecting WM features. While linear probes can and have been used to identify latent directions associated with features defined in an imposed ontology, SAEs require no such assumptions to isolate disentangled features (Section 3). By manipulating specific features identified by the SAEs and observing the impact on our models’ maze-solving behavior, we provide strong evidence that these features are causally involved in the model’s decision-making process (Section 4). This stands in contrast to prior work analyzing internal representations in maze settings where no causal features were able to be isolated (Ivanitskiy et al., 2024). Our findings provide important considerations for AI interpretability and alignment. By investigat- ing how transformers form causal WMs even in relatively simple tasks, we hope to provide new avenues for understanding representations and potentially steering behavior in more complex AI systems - laying the groundwork for future research into how we might impose constraints on AI systems in terms of the features they acquire as a result of training. ∗ Corresponding Author, Correspondence toafspies@imperial.ac.uk 1 arXiv:2412.11867v2 [cs.LG] 5 Mar 2025 Published as a conference paper at ICLR 2025 H 1H 2H3H 4 Block 0 Maze Sequence ...... Block 5 Embedding Unembedding (A) (B) (A) Attention Analysis(C) Methodology Comparison(B) Residual SAEs encode decode 012-3-2-1 ... encode 012-3-2-1 ... H 1H 2H 3H 4 ; Head 1 Head 3 ; Some features encode connections on ";" 11 Patching "Connection Heads" knocks out relevant features in SAE Patch Attention Figure 1: Overview of our methodology for discovering and validating world models in transformer- based maze solvers.(A)We analyze attention patterns in early layers, finding heads that consolidate maze connectivity information at semicolon tokens.(B)We train sparse autoencoders on the residual stream immediately following the first block, identifying interpretable features that encode maze connectivity.(C)We demonstrate the causal role of the world models in our transformers comparing the features extracted through both methods and validating them through causal interventions. We outline our methodology in Figure 1. In short, we begin in Subsection 3.1, by identifying attention heads that appear to construct world model features by examining their attention patterns across maze coordinate tokens (A). We validate these findings in Subsection 3.2 by training SAEs on the residual stream and demonstrating that the extracted features match those found from attention analyses (B). Lastly, in Subsection 3.3 we establish the causal nature of these representations through targeted interventions, showing that perturbing specific features produces predictable changes in the model’s maze-solving behavior (C). Contributions •Empirical Findings:We show that transformers trained to solve mazes acquire causally relevant WMs - intervening upon these in the latent space of SAEs predictably affects model behaviour. Surprisingly, we find that interventions that activate features are more effective than those that remove them, suggesting an asymmetry in how transformers utilize WM features. Furthermore, we find that activating additional SAE features can result in coherent behaviour from models; even when attempting to achieve the same activations through (necessarily longer) input sequences would cause models to fail. •Methodological Insights:By effectively applying decision trees to isolate WM features in SAEs, we demonstrate that transformers utilizing different encoding schemes may use varyingly compositional codes to represent their WMs. More generally, our analyses sug- gest that SAEs are generally better suited than linear probes to isolate WMs, even in the absence of feature splitting 1 , as no assumptions are required about the form of the WM. 2PRELIMINARIES 2.1ENVIRONMENT Though it remains a matter of debate whether Large Language M1odels (LLMs) construct struc- tured internal models of the real world, we can begin to understand the representations acquired by such models by focusing on “toy” tasks with clear spatial or temporal structure (Brinkmann et al., 2024; Momennejad et al., 2024; Jenner et al., 2024; McGrath et al., 2022). Previous works along these lines (Li et al., 2022; Ivanitskiy et al., 2024; Nanda, 2023; Karvonen, 2024; He et al., 2024) have found a variety of both correlational and causal evidence for internal models of the environ- ment within trained transformers. In this work, we utilizemaze-dataset(Ivanitskiy et al., 2023), a package providing maze-solving tasks as well as ways of turning these tasks into text represen- tations. In particular, we use a dataset of mazes consisting of up to7×7grids, generated via 1 Feature splitting refers to the phenomenon in which a single feature (e.g. a connection between two specific coordinates) can be represented by multiple elements of the SAE latent vector (Bricken et al., 2023). 2 Published as a conference paper at ICLR 2025 constrained randomized depth first search (which produces mazes that are acyclic and thus have a unique solution). To train autoregressive transformers to solve such mazes, we employed a tokenization scheme pro- vided bymaze-dataset, shown in Figure 2. This scheme is designed to present the maze struc- ture, start and end points, and solution path in a format amenable to transformer processing whilst remaining straight-forward to analyse with standard tools from the mechanistic interpretability liter- ature - primarily due to the existence of a unique token for every position in the maze (aka “lattice”). <ADJLISTSTART>(0,0)<-->(1,0);(2,0)<-->(3,0);(4,1)<--> (4,0);(2,0)<-->(2,1);(1,0)<-->(1,1);(3,4)<-->(2,4) ;(3,1)<-->(3,2); · (1,3)<-->(1,4);<ADJLISTEND> <ORIGINSTART>(1,3)<ORIGINEND><TARGETSTART>(2,3)<TARGETEND> <PATHSTART>(1,3)(0,3)(0,2)(1,2)(2,2)(2,1)(2,0)(3,0) (4,0)(4,1)(4,2)(4,3)(4,4)(3,4)(2,4)(2,3)<PATHEND> (a) An example of a tokenized maze.1:Theadjacency listdescribes the con- nectivity of the maze, with the semicolon token;delimiting consecutive con- nections. The order of connections is randomized, ellipses represent omitted connection pairs.2,3:The origin and target specify where the path should be- gin and end, respectively.4:Thepathitself a sequence of coordinate tokens representing the shortest path from the origin to the target. For a “rollout,” we provide everything up to (and including) the <PATHSTART> token and au- toregressively sample withargmaxuntil a<PATHEND> token is produced. 01234 col 0 1 2 3 4 row (b) Visual representation of the same maze as in the tokenized representation on the left. The origin is indi- cated in green, the target in red, and the path in blue. Figure 2: Tokenization scheme and visualization of a shortest-path maze task generated using Ivan- itskiy et al. (2023). 2.2MAZESOLVINGTRANSFORMERS Utilizing the tokenized representations of mazes provide by themaze-datasetlibrary, a suite of transformer models implemented using TransformerLens ((Nanda & Bloom, 2022)) were trained to predict solution paths in acylic mazes. We performed extensive hyperparameter sweeps (Figure 10) over several variants of the transformer architecture, yielding models with stronger generalization performance than those found by prior work Ivanitskiy et al. (2024). To allow the testing of generalization to large maze size, the models were trained on5×5fully- connected and6×6sparsely connected mazes, embedded in a7×7lattice. This ensured that all coordinate tokens in the7×7vocabulary had been seen during training time, such that generalization to7×7mazes was conceivable but out-of-distribution during inference. For our experiments, we investigated the two best performing models for each positional embedding (Su et al., 2024) scheme, as shown in Table 1. Note that whilst these models had different numbers of heads, their parameter counts varied only slightly - on account of Stan’s use of learned positional embeddings. 3DISCOVERINGWORLDMODELS Broadly speaking there are two ways to go about trying to identify internal world models: 1) Assum- ing the form of the world model and inspecting the transformer with e.g. supervised probes to see if this world model is present ((Nanda, 2023) SELF CITE Workshop proceeding), or 2) Exploring the Model NicknamePositional Embeddingsd model n layers n heads Num. ParamsMaze Solving Accuracy Stanstandard learned5126819,225,66096.6% Terryrotary5126418,963,51694.3% Table 1: Models chosen for mechanistic investigation (most performant in the sweep, given their respective position embedding schemes). The number of parameters varies as Stan learns position encodings (W pos ∈R 512×512 ) 3 Published as a conference paper at ICLR 2025 model internals and investigating any structure which may be present in the representations to see if something akin to a world model exists. In our work we take both approaches. First, in Subsection 3.1 we investigate attention heads in the earliest layer of our models and find heads specialising in the construction of representations akin to a world model. On the basis of this, we use SAEs (an unsupervised method) alongside supervised classifiers to identify latent features corresponding to the world model. Finally, we use patching experiments and interventions to show that both investigations yield consistent features, and that these form a causal world model with some interesting properties. 3.1WORLDMODELCONSTRUCTION: CONNECTIVITYATTENTIONHEADS We began by examining the attention patterns of the maze-solving transformer models and uncov- ered a notable pattern: in both models, some or all of the attention heads at the first layer (“layer 0”) appear to consolidate information about maze connections into the ;context positions. In particular, for all4i th context-positions tokens (the semicolon separation tokens;), these heads attend back 1 or 3 tokens - that is, to one of the two coordinate tokens corresponding to the given connection preceeding the ; token. This pattern is observable for3/8L0 heads in Stan (Figure 3) and4/4L0 heads in Terry (Figure 14). This observation suggests the hypothesis that these heads are in essence constructing a world model for the maze task, for use by later layers. If this were the case, then we should expect that the output of these heads, mediated by the “OV- Circuit” (Elhage et al., 2021), should consist of combinations of the coordinates captured in a given connection. This can be measured by taking theW OV matrix for each head and measuring the cosine distance between its elements and the model’s token embeddings (where coordinate directions are directly given) 2 . With this in mind, we investigated the structure of these vectors more closely. We find an intriguing pattern in the magnitudes of these vectors in the Stan model (Figure 4), while the patterns in Terry were less clear cut (Figure 5). <TARGET_END>(4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5) (5,4)(3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5) (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 7 back 5 back 3 back 1 back <TARGET_END>(4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5) (5,4)(3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5) (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 7 back 5 back 3 back 1 back <TARGET_END>(4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5) (5,4)(3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5) (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 7 back 5 back 3 back 1 back 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Head 3 Head 5 Head 7 Figure 3: Attention values for heads L0H3, L0H5, and L0H7 in Stan. We use a rather nonstandard representation, looking only at a fixed window into the past of which tokens are attended to by semicolon;tokens. Every4th position, up to140, is shown along thex-axis. Color shows attention to positions 1, 3, 5, and 7 earlier in the context (shown along they-axis), for an example 6x6 maze input. This sort of pattern is typical across all inputs examined. Up until context position 100, the heads are attending 1 and 3 positions back; after this the pattern shifts to 5 and 7 back. Note the complementary attention patterns of L0H3 and L0H7. Closer examination shows that L0H3 prefers to direct its attention to ‘even-parity’ maze cells, with L0H7 preferring ‘odd-parity’ cells. L0H5 more frequently splits its attention between 1 and 3 back, but sometimes ‘fills in’ for L0H7. The origins of this pattern are explored further in appendix E; note also the similarities to Figure 4. The other five heads in L0 show no similar pattern. Full patterns are shown in Figure 13 2 As we analyze the first attention layer we can ignore potential “residual drift” in the representations of a given maze coordinate between early and later layers in our transformers (Belrose et al., 2023). 4 Published as a conference paper at ICLR 2025 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0.02 0.04 0.06 0.08 Head 0.3Head 0.5Head 0.7 Figure 4: Magnitudes of vectors resulting from applying theW OV matrices of heads L0H3, L0H5 and L0H7 of Stan to maze-cell token embeddings, projected onto the maze grid. The pattern here mirrors the way that the heads divide their attention between the 1-back and 3-back context positions (exemplified in Figure 3) with L0H3 focused on ‘even-parity’ cells, and L0H7 and LH05 focused primarily on ‘odd-parity’ cells. This pattern also recurs in the overlaps between query and key vectors of token embeddings, explored in detail in Appendix E. 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0.02 0.04 0.06 0.08 0.1 0.12 Head 0.0Head 0.1Head 0.2Head 0.3 Figure 5: Magnitudes of vectors resulting from applying theW OV matrices of layer-0 heads of Terry to maze-cell token embeddings, projected onto the maze grid. The pattern here is much less striking than that for Stan (shown in Figure 4) although it does suggest that the heads specialise in even/odd-parity cells in localised regions of the maze. 3.2WORLDMODELREPRESENTATION: SPARSEAUTOENCODERS As previous work (Ivanitskiy et al., 2024) struggled to intervene on WM features identified via lin- ear probing (Alain & Bengio, 2016), we trained Sparse Autoencoders to attempt to find disentangled features in our models (Cunningham et al., 2023; Bricken et al., 2023). Sparse Autoencoders are motivated by the notion of superposition (Elhage et al., 2022) which posits that artificial neural networks store more features than an orthogonal representation would allow. By training an autoen- coder with a higher-dimensional latent space than that of the transformer, tasked with reconstructing a residual stream vector under a sparsity penalty, the hope is that the SAE will recover interpretable features which the transformer was forced to superimpose. Similar approaches have previously seen success on other toy tasks (He et al., 2024; Karvonen et al., 2024). To prevent “neuron death” in the SAE latent space, resulting from high sparsity penalties, we apply the method of “Ghost Gradients” proposed by Jermyn & Templeton (2024). The resulting trained SAEs faithfully reconstructed the activations (in our case, the residual stream after L0), and re- placing these activations with their SAE reconstructed counterparts did not affect model behaviour (Figure 8), giving confidence in the completeness of their representation. Initial attempts to isolate SAE features corresponding to connections in the maze attempted to use differences in the features present in mazes with or without certain connections. This approach worked well in some cases, but not in others, as not all relevant features varied in magnitude by the same amount, and many features were co-active to a given connection (i.e. those implicated in the path representation, which itself might change when connections are added/removed). To address this, we instead trained decision trees to isolate the relevant features in our transformers (akin to Spies et al. (2022)), as shown in Figure 6. This analysis yielded our first unexpected finding: Stan’s WM consisted of two features for each connection - a somewhat generic “semicolon” feature, as well as a connection specific feature. We visualize highly activating examples of these features in Figure 17, and show that Stan’s representa- tion was stable for an additionally trained SAE in Figure 18. 5 Published as a conference paper at ICLR 2025 We speculate that this “compositional code” arises in Stan as a result of the transformer imperfectly separating positional information from its WM. This representation also explains why previous ef- forts to intervene on models with learned positional encodings by using linear probes were unsuc- cessful - as intervening with a single direction yielded from supervised decoding would also affect the semicolon feature. It is also interesting to note that Terry encoded connection information very cleanly into single features for each connection - i.e., a single direction in the residual stream. This is in-spite of the fact that Terry’s attention heads appeared to operate in a more entangled fashion than those of Stan. 01234 Maze X 4 3 2 1 0 Maze Y 96% (1616, 1422) 97% (419, 1422) 96% (1503, 1422) 96% (844, 1422) 96% (567, 1422) 97% (536, 1422) 97% (1698, 1422) 97% (1240, 1422) 98% (464, 1422) 96% (1535, 1422) 97% (1532, 1422) 98% (1890, 1422) 96% (1588, 1422) 98% (1685, 1422) 98% (928, 1422) 96% (142, 1422) 96% (653, 1422) 97% (686, 1422) 96% (50, 1422) 96% (1046, 1422) 97% (907, 1422) 96% (2020, 1422) 96% (252, 1422) 97% (927, 1422) 97% (1643, 1422) 98% (1424, 1422) 96% (1355, 1422) 97% (654, 1422) 98% (1407, 1422) 97% (471, 1422) 97% (1700, 1422) 98% (1481, 1422) 97% (1060, 1422) 97% (1044, 1422) 97% (1888, 1422) 96% (1079, 1422) 97% (1315, 1422) 95% (2032, 1422) 96% (1121, 1422) 97% (597, 1422) 96% (1627, 1422) 98% (326, 1422) 95% (785, 1422) 96% (1061, 1422) 98% (1025, 1422) 97% (1062, 1422) 96% (746, 1422) 97% (1281, 1422) 97% (505, 1422) 97% (1999, 1422) 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy (a) Stan 01234 Maze X 4 3 2 1 0 Maze Y 100% (284) 100% (1249) 100% (1213) 100% (397) 100% (537) 100% (967) 100% (1710) 100% (944) 100% (1915) 100% (1946) 100% (417) 100% (1529) 100% (1588) 100% (1979) 100% (1467) 100% (250) 100% (696) 100% (310) 100% (601) 100% (952) 100% (2003) 100% (543) 100% (1978) 100% (1487) 100% (852) 100% (1624) 100% (2039) 100% (1254) 100% (163) 100% (1889) 100% (1264) 100% (963) 100% (1668) 100% (225) 100% (1758) 100% (831) 100% (376) 100% (1426) 100% (700) 100% (1239) 100% (1777) 100% (497) 100% (361) 100% (1074) 100% (545) 100% (1719) 100% (1304) 100% (2023) 100% (1032) 100% (1773) 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy (b) Terry Figure 6: Decision tree decoding accuracies and relevant features (in parentheses) for each con- nection in the maze. Upper right triangles correspond to right connections, and lower left triangles correspond to down connections. The decision trees were trained to predict the presence, or absence, of a connection from the SAE feature vector at the semicolon immediately following the definition of that connection. See Figure 17 for more details. 3.3COMPARINGSAES ANDCIRCUITS In Subsection 3.1 we advanced the claim that certain L0 heads construct features representing maze edges at the ;context positions, specifically by attending to earlier positions containing maze-cell token embeddings, and rewriting those embeddings by application of theirW OV matrices. Subsec- tion 3.2 identified features representing maze edges via an independent line of reasoning, by training SAEs, and identifying which of their features were indicative of the presence of a maze edge. To verify whether these approaches yielded consistent features for the WM, we first calculated the cosine similarity between the features written by isolated attention heads, and those encoded in the SAE (Figure 7a). These showed excellent agreement for Stan, where the attention patterns were clear, but only once the compositional code was taken into account (see Appendix G for details). Though these results were promising, we carried out a further comparison (Figure 7b) to minimize the assumptions required, and to account for two potential effects: 1) There may be “wiggle room” between feature directions in the model’s residual stream, and the circuits that construct them (which would lead to low cosine similarities, even for the same features), 2) As our SAEs are trained after an entire block of computation, it is possible that the MLPs, applied after attention, also played a role in forming the representations. In this second experiment we patched attention head values in the presence of a connection to the mean of a maze set without that connection.By looking at the effect of patching the attention heads on the resulting SAE Latent vectors, we were able to observe that the features considered relevant for any given connection were indeed sensitive to the heads implicated in constructing those features. In particular, we consider the effect on the SAE features identified in Subsection 3.2 when each attention head is patched at the semicolon position for with its average non-connection value across 500 examples (i.e. removing the contribution a given head toward encoding that connection). 6 Published as a conference paper at ICLR 2025 This captures the extent to which a given head contributes towards the “creation” of a maze- connection’s representation in the residual stream. These plots not only confirm the link between the attention circuits and the SAE features, but even show the same spatial partitioning of different parts of the maze between different heads. The same plots for Terry, and Stan’s down connection, are shown in Appendix F. 0,0,right0,1,right0,2,right0,3,right0,4,right1,0,right1,1,right1,2,right1,3,right1,4,right2,0,right2,1,right2,2,right2,3,right2,4,right3,0,right3,1,right3,2,right3,3,right3,4,right4,0,right4,1,right4,2,right4,3,right4,4,right 0,0,right 0,1,right 0,2,right 0,3,right 0,4,right 1,0,right 1,1,right 1,2,right 1,3,right 1,4,right 2,0,right 2,1,right 2,2,right 2,3,right 2,4,right 3,0,right 3,1,right 3,2,right 3,3,right 3,4,right 4,0,right 4,1,right 4,2,right 4,3,right 4,4,right −1 −0.5 0 0.5 1 OV-derived edge features SAE-derived edge features (a) Cosine similarities between edge features derived from the SAE and edge features derived by direct ap- plication of theW OV matrices of the heads discussed in Subsection 3.1 to token embeddings for Stan (for reasons of space, only the right directed edge features are shown here, but the pattern is the same for the full set). Details of how features are computed are given in Appendix G. 0 1 2 3 0%100% 4 0 0%100% 1 0%100% 2 0%100% 3 0%100% 4 Maze X Maze Y Heads head_0 head_1 head_2 head_3 head_4 head_5 head_6 head_7 (b) Effect of patching attention heads on SAE features (rightward connections) for Stan’s connection-specific WM features (identified from Figure 6). Compared against Figure 4, we see agreement between the atten- tion analysis and the SAE Feature analysis. We nor- malize the per-head patching effect magnitudes, such that 100% was the maximal effect seen on an SAE fea- ture’s magnitude as a result of patching the attention head (across all features for a given head). Figure 7: Comparison of SAE features and attention head analysis. 4INTERVENING ONWORLDMODELS Though a universally agreed upon definition does not exist, we shall consider world Models to be “structure preserving [...] causally efficacious representations” (Milli ` ere & Buckner, 2024) of an environment; i.e. representations which preserve the causal structure of the environment as far as is necessitated by the tasks an agent needs to perform. As such, we are interested in understanding how the WMs we have discovered are leveraged by our models to facilitate generation of valid solution sequences. In Figure 8, we give an example of perturbing a feature to “fool” the model into behaving as though it is in a different maze. When patching in the SAE-reconstructed residual stream without perturbations we still see the same behavior as in the original model; when patching in with a modified feature, we see a change in the path. We perform such interventions across200 examples for each connection feature, and show the resulting intervention efficacies in Figure 9. The intervention process involves toggling a feature on (to the maximal value observed for that feature in a small dataset) or turning it off (setting it to 0)at all semicolon positions 3 . We measure the impact of these interventions on the model’s maze-solving accuracy, with a particular focus on how activating versus removing features affects performance. Our results reveal an intriguing asymmetry that constitutes our second finding: interventions that activate features tend to be more effective in altering the model’s behavior compared to those that remove features. 3 For the case of adding a connection, this is necessary as there is no semicolon in the sequence which “belongs” to the connection that doesn’t exist. We also experimented with toggling to a fixed maximal value in Figure 19, but this was generally less effective. In the case of removal, it made little difference if the feature was disabled everywhere, as it is almost always exactly 0 for a non-matching connection semicolon 7 Published as a conference paper at ICLR 2025 This suggests that the transformer may rely more heavily on the presence of certain connectivity cues rather than their absence when constructing its internal world model. Our final finding relates to the toggling of features in Stan. Though Stan utilized a compositional code, activating the connection-specific features at unrelated semicolons worked in35%of cases. Conversely, we saw that all removal interventions failed for Stan, for the simple reason that Stan was unable to generalize to sequences containing more connections than it had seen during training - thus failing when shown examples containing the additional connection to be removed (this failure was a result of Stan using learned positional embeddings (Table 1), as shown in Figure 10). The fact that activating connections in the space of the SAE worked at all means that Stan’s maze-solving behaviour was at least partially able to generalize in the latent space, where it was decoupled from the positional embeddings. 012345 col 0 1 2 3 4 5 row Ground Truth without Connection 012345 col 0 1 2 3 4 5 row Ground Truth with Connection 012345 col 0 1 2 3 4 5 row SAE Reco Model Output 012345 col 0 1 2 3 4 5 row Patched SAE Model Output: (1, 2, 'right') Figure 8: An example of an intervention on Terry where a connection is added by enabling the relevant feature in the SAE’s latent space (in this case, feature250for(1,2)<-->(1,3) ). From left to right: 1) input maze with ground truth 2) model’s prediction with the unperturbed SAE reconstruction patched in 3) perturbed ground truth 4) model’s prediction with the perturbed SAE reconstruction in its residual stream at layer 0. 024 0 1 2 3 4 8167856890 8464827391 5953868684 1381496481 6369649093 Add Connection, Right 024 7781238 524011457 8760114 08483292 26449549 Remove Connection, Right 024 0 1 2 3 4 8457838456 6018718190 7247816166 3082884148 8436724986 Add Connection, Down 024 520111 3810311 3986738651 862229091 319232953 Remove Connection, Down 0 20 40 60 80 100 Percentage of Successful Perturbations (a) Terry model 024 0 1 2 3 4 042363334 4945382341 5925213234 4526402132 503224122 Add Connection, Right 024 00000 00000 00000 00000 00000 Remove Connection, Right 024 0 1 2 3 4 7959453454 4249312729 2648183923 2939252526 8318332045 Add Connection, Down 024 00000 00000 00000 00000 00000 Remove Connection, Down 0 20 40 60 80 100 Percentage of Successful Perturbations (b) Stan model Figure 9: Aggregated accuracy of interventions for examples on which the original prediction was correct. An accurate intervention is one in which the toggling of a connection in the SAE feature space leads the model to act accordingly. Note that Stan removal interventions fail as the inputs in these cases have more connections than the model is able to handle (see length generalization failure in Figure 10). 5RELATEDWORK Our work builds on existing literature in interpretability (R ̈ auker et al., 2023), particularly how trans- formers develop structured internal representations, often called world models. World Models, as defined by Milli ` ere & Buckner (2024), are “structure-preserving, causally efficacious representa- tions of properties of [a model’s] input domain.” 8 Published as a conference paper at ICLR 2025 Here, structure-preserving means that the representations reflect the causal structure of the observa- tion space and causally efficacious means that the model leverages these representations to enable relevant interactions with its environment. Research into world models has gained traction across various domains, with transformers trained to play complex games like chess being prime examples. For instance, McGrath et al. (2022) trained linear probes to extract various features in AlphaZero’s chess model, showing how different aspects of the game, such as piece positioning and potential future moves, are captured within the model’s layers. Similarly, Karvonen (2024) investigates the internal representations of a chess model us- ing linear probes and contrastive activations, revealing structured representations of the game state. Jenner et al. (2024) explores the emergence of learned look-ahead capabilities in Leela Chess Zero, where the model encodes an internal representation of future optimal moves. Another task used to study internal representations in transformers is Othello. Several works have explored the emergence of causal linear world models in this domain Li et al. (2022); Nanda (2023), with recent advancements leveraging SAEs (see Subsection 3.2 to uncover these world models He et al. (2024). Beyond game-playing tasks, the study of learned world models in transformers extends to other domains, such as natural language processing, where Hewitt & Manning (2019) used probing tech- niques to uncover the syntactic structure encoded by BERT. This line of research demonstrates that transformer models can implicitly learn hierarchical structures in their residual streams, as explored by Manning et al. (2020). Further supporting this, Pal et al. (2023) demonstrated that the residual stream corresponding to individual input tokens encodes information to predict the correct token several positions ahead, highlighting the model’s capacity for structured, anticipatory reasoning. Additionally, graph traversal as multi-step reasoning has been investigated both from a model capa- bilities perspective Momennejad et al. (2024) and through mechanistic interpretability Brinkmann et al. (2024); Ivanitskiy et al. (2024), providing further evidence of transformers’ ability to encode and utilize structured representations in complex tasks. 6CONCLUSIONS ANDFUTUREWORK In this work, we demonstrated that transformers trained to solve maze navigation tasks form highly structured internal representations that capture the connectivity of the maze and thus act as world models. Through exploratory analysis of attention patterns, we found that connection information was consolidated into semicolon tokens by a subset of attention heads. By using Decision Trees to analyze the latent space of Sparse Autoencoders on these semicolons, we were able to identify sparse features that encoded the position in the maze. We showed that these world models were con- structed differently in transformers leveraging learned vs. rotary positional encodings, suggesting that simpler methods such as activation steering or probing would have been insufficient to extract causal world models in at least some cases. More interesting still, we showed that interventions to add connections by toggling features were consistently more effective than interventions that sought to remove connections by zeroing the corresponding features. Furthermore, we found that models with learned position encodings, which were unable to generalize to longer input sequences (i.e., mazes with more connections), were able to behave consistently if additional connection features were enabled via SAE interventions, even if the corresponding token sequence would have caused the model to fail. These findings shed light on the inner workings of transformers trained on sequential planning tasks and suggest that maze-solving tasks are a rich testbed for understanding the formation of world models in transformers. Future work should aim to uncover whether our findings on intervention asymmetries and steerability are universal - and if not, which conditions give rise to each. An empirical understanding of the reliability of SAE feature discovery and steerability is crucial for AI Safety efforts that attempt to constrain or coerce model behavior through interventions or monitoring based on such methods. 9 Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS This work was made possible by various grants. Alex Spies was supported by JSPS Fellowship PE23026 and EPSRC Project EP/Y037421/1. Michael Ivanitskiy was supported by NSF award DMS-2110745. We are also grateful to LTFF and FAR Labs for hosting three of the authors for a Residency Visit, and to various members of FAR’s technical staff for their advice. REFERENCES Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. Nora Belrose, Zach Furman, Logan Smith, et al. Eliciting Latent Predictions from Transformers with the Tuned Lens.arXiv preprint arXiv:2303.08112, 2023. Trenton Bricken, Adly Templeton, Brian Chen, et al. Towards monosemanticity: Decompos- ing language models with dictionary learning.Transformer Circuits Thread, Oct 4 2023. URLhttps://transformer-circuits.pub/2023/monosemantic-features/ index.html. Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, et al. A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task.arXiv preprint arXiv:2402.11917, 2024. Hoagy Cunningham, Aidan Ewart, Logan Riggs, et al. Sparse Autoencoders Find Highly In- terpretable Features in Language Models, October 2023. URLhttp://arxiv.org/abs/ 2309.08600. Nelson Elhage, Neel Nanda, Catherine Olsson, et al.A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021.https://transformer- circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022. Zhengfu He, Xuyang Ge, Qiong Tang, et al. Dictionary learning improves patch-free circuit discov- ery in mechanistic interpretability: A case study on othello-gpt.arXiv preprint arXiv:2402.12201, 2024. John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representa- tions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa- pers), p. 4129–4138, 2019. Michael I. Ivanitskiy, Rusheb Shah, Alex F Spies, et al. A configurable library for generating and manipulating maze datasets.arXiv preprint arXiv:2309.10498, 2023. Michael I. Ivanitskiy, Alex F. Spies, Tilman R ̈ auker, et al. Linearly structured world representa- tions in maze-solving transformers. InProceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, volume 243 ofProceedings of Machine Learning Research, p. 133–143. PMLR, 15 Dec 2024. URLhttps://proceedings.mlr.press/v243/ ivanitskiy24a.html. Erik Jenner, Shreyas Kapur, Vasil Georgiev, et al. Evidence of learned look-ahead in a chess-playing neural network.arXiv preprint arXiv:2406.00877, 2024. Adam Jermyn and Adly Templeton.Anthropic circuits Updates - January 2024, Jan- uary 2024.URLhttps://transformer-circuits.pub/2024/jan-update# dict-learning-resampling. Adam Karvonen. Emergent world models and latent variable estimation in chess-playing language models.arXiv preprint arXiv:2403.15498, 2024. 10 Published as a conference paper at ICLR 2025 Adam Karvonen, Benjamin Wright, Can Rager, et al. Measuring progress in dictionary learning for language model interpretability with board game models.arXiv preprint arXiv:2408.00113, 2024. Kenneth Li, Aspen K Hopkins, David Bau, et al. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. Tom Lieberum, Matthew Rahtz, J ́ anos Kram ́ ar, et al. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla.arXiv preprint arXiv:2307.09458, 2023. Christopher D Manning, Kevin Clark, John Hewitt, et al. Emergent linguistic structure in artificial neural networks trained by self-supervision.Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020. Callum McDougall. SAE Visualizer.https://github.com/callummcdougall/sae_ vis, 2024. Thomas McGrath, Andrei Kapishnikov, Nenad Toma ˇ sev, et al. Acquisition of chess knowledge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. Rapha ̈ el Milli ` ere and Cameron Buckner. A philosophical introduction to language models - part i: The way forward, 2024. URLhttps://arxiv.org/abs/2405.03207. Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Frujeri, et al. Evaluating cognitive maps and planning in large language models with cogeval.Advances in Neural Information Processing Systems, 36, 2024. Neel Nanda. Actually, othello-gpt has a linear emergent world representation.Neel Nanda’s Blog, 7, 2023. NeelNandaandJosephBloom.Transformerlens.https://github.com/ TransformerLensOrg/TransformerLens, 2022. Catherine Olsson, Nelson Elhage, Neel Nanda, et al. In-context Learning and Induction Hea ds. arXiv preprint arXiv:2209.11895, 2022. Koyena Pal, Jiuding Sun, Andrew Yuan, et al. Future lens: Anticipating subsequent tokens from a single hidden state.arXiv preprint arXiv:2311.04897, 2023. Tilman R ̈ auker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. In2023 ieee conference on secure and trustworthy machine learning (satml), p. 464–483. IEEE, 2023. Alex F. Spies, Alessandra Russo, and Murray Shanahan. Sparse Relational Reasoning with Object- Centric Representations, July 2022. URLhttp://arxiv.org/abs/2207.07512. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: En- hanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 11 Published as a conference paper at ICLR 2025 Appendices AGENERALIZATION AS A FUNCTION OF INPUT SEQUENCE LENGTH Train Val Mixed Maze (Max DFS 6x6) Generalization Full Maze (DFS 7x7) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Model Performance Comparison (d_model, n_heads, n_layers, pos_emb_type) (512, 4, 6, 'rotary') (256, 4, 8, 'rotary') (256, 8, 6, 'rotary') (256, 4, 6, 'rotary') (256, 8, 8, 'rotary') (256, 4, 8, 'rotary') (256, 8, 8, 'rotary') (256, 8, 6, 'rotary') (512, 4, 6, 'rotary') (256, 4, 6, 'rotary') (256, 4, 6, 'rotary') (512, 8, 6, 'standard') (256, 4, 8, 'standard') (256, 4, 6, 'standard') (256, 8, 6, 'standard') (256, 4, 6, 'standard') (256, 8, 8, 'standard') (256, 4, 6, 'standard') (256, 4, 8, 'standard') (512, 8, 6, 'standard') (256, 8, 6, 'standard') 0.00.10.20.30.40.50.60.7 Performance on Full 7x7 Mazes (DFS Generalization) Figure 10: Accuracies of all transformers trained in our sweep on a generalization task. “Train Val” shows the accuracy on the held out in-length-distribution mazes from train time, and “Full Maze” features mazes with more connections (longer input sequences) than those seen at train time. Only rotary models are able to generalize at all BSAE TRAININGDETAILS To choose optimal hyperparameters for our SAEs we ran a sweep over SAEs at layers 2 to 4 on Terry, finding consistent trends across layers. The results of this sweep are shown in Figure 11, and the final details of the SAE analyzed in the main paper are given in Table 2. We also provide feature density histograms for the SAEs analyzed in the main paper in Figure 12 noting that these look good, in that many features are sparse, but also rather distinct from is typically observed in LLMs. This is not surprising, as our token and features distributions will be very distinct from those of natural language, as most mazes have many active connections, and connections are similarly likely to be present in any given maze. SparsityDatasetOptimizer Expansion FactorGhost Threshold(L0) Sparsity WeightBatch SizeTraining StepsLearning RateLinear Warm Up Steps 41000.011024∼10 6 10 −4 1000 Table 2: Hyperparameter values for the final SAEs analyzed in the main paper. 12 Published as a conference paper at ICLR 2025 SAE Feature MetricsAverage Token Reconstruction Errors Residual Reconstruction Error (L2)Sparsity (L1)L0UnperturbedZero PatchedSAE Patched Terry2.87×10 −4 10.720.98.188.528.18 Stan6.35×10 −4 9.5328.46.358.616.35 Table 3: SAE Metrics for the final SAEs trained on Stan and Terry. We see that replacing the residual stream with the SAE reconstructions has very little impact on the sequence produced by the model, providing confidence that the SAEs are encoding all the relevant information in the model’s residual stream. 234 Expansion Factor 10 3 10 2 MSE Loss 50250500 Ghost Threshold 10 3 10 2 510 Learning Rate ×10 4 10 3 10 2 110 Sparsity Weight ×10 3 10 3 10 2 234 20 40 60 80 100 120 140 Sparsity 50250500 20 40 60 80 100 120 140 510 20 40 60 80 100 120 140 110 20 40 60 80 100 120 140 234 97.0 97.5 98.0 98.5 99.0 99.5 100.0 Variance Explained (%) 50250500 97.0 97.5 98.0 98.5 99.0 99.5 100.0 510 97.0 97.5 98.0 98.5 99.0 99.5 100.0 110 97.0 97.5 98.0 98.5 99.0 99.5 100.0 234 0 100 200 300 400 L0 50250500 0 100 200 300 400 510 0 100 200 300 400 110 0 100 200 300 400 234 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Percent Alive 50250500 0.3 0.4 0.5 0.6 0.7 0.8 0.9 510 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Layer Layer 2Layer 3Layer 4 Figure 11: Results of an SAE sweep carried out on Terry. 13 Published as a conference paper at ICLR 2025 108642 Feature Activation Frequency (log) 0.0 0.2 0.4 0.6 0.8 1.0 Density Inactive Features (a) Terry SAE density 1086420 Feature Activation Frequency (log) 0.0 0.2 0.4 0.6 0.8 1.0 Density Inactive Features (b) Stan SAE density. Figure 12: Feature density histograms for the SAEs analyzed in the main paper. 050100 140 120 100 80 60 40 20 0 050100 140 120 100 80 60 40 20 0 0 0.2 0.4 0.6 0.8 1 Stan head L0H3Terry head L0H3 Figure 13: Attention patterns for head L0H3 in Stan and Terry, for a specific example maze. At every fourth context position from 4 through to 140 (the;positions in the adjacency-list) attention is directed very strongly back to one or two positions, typically 1 or 3 positions earlier in the context (though for Stan, after context position 100, this shifts to 5 or 7 positions earlier in the context). This pattern is qualitatively repeated across all examples examined, for heads L0H3, L0H5 and L0H7 in Stan, and for all four L0 heads in Terry. CFURTHERATTENTIONVISUALIZATIONS (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 3 back 1 back (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 3 back 1 back (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 3 back 1 back (4,4)(5,5)(1,4)(4,5)(1,2)(2,5)(2,1)(2,2)(4,3)(4,0)(4,2)(5,1)(1,4)(4,0)(1,2)(0,3)(3,0)(5,4)(2,0)(0,4)(3,3)(2,3)(0,3)(5,5)(0,1)(4,1)(3,2)(2,1)(2,4)(1,5)(4,2)(2,2)(0,0)(0,5)(5,1) (3,4)(5,4)(1,3)(3,5)(0,2)(3,5)(1,1)(2,3)(4,4)(3,0)(4,3)(5,2)(2,4)(4,1)(1,1)(0,4)(2,0)(5,3)(1,0)(0,5)(3,4)(3,3)(0,2)(4,5)(0,2)(5,1)(3,1)(3,1)(2,5)(1,4)(5,2)(3,2)(0,1)(1,5)(5,0) 4812162024283236404448525660646872768084889296100104108112116120124128132136140 3 back 1 back 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Head 0 Head 1 Head 2 Head 3 Figure 14: Attention values for layer 0 heads in Terry, from context positions holding the ; token (shown along thex-axis) to positions 1 and 3 earlier in the context (shown along they-axis), for an example maze input. This pattern is typical across all inputs examined. The pattern is less clear-cut than for Stan (Figure 3), but note that at every fourth context position, there is at least one head attending strongly to positions 1 and 3 earlier in the context. 14 Published as a conference paper at ICLR 2025 DHOWSAE REPRESENTATIONSDIFFER Decision Tree Feature Importances for (4, 4, 'right') in Terry samples = 803 value = [803, 0] class = Connection absent samples = 797 value = [0, 797] class = Connection present feature_1773 <= 1.42 samples = 1600 value = [803, 797] class = Connection absent (a) Terry model: A single feature almost perfectly encodes the existence of a specific connection in the maze. This demonstrates the direct encoding of maze connectivity in Terry’s SAE latent space. Decision Tree Feature Importances for (4, 4, 'right') in Stan samples = 834 value = [795, 39] class = Connection absent samples = 218 value = [8, 210] class = Connection present feature_1422 <= 0.346 samples = 1052 value = [803, 249] class = Connection absent samples = 548 value = [0, 548] class = Connection present feature_1999 <= 4.972 samples = 1600 value = [803, 797] class = Connection absent (b) Stan model: Two features (Feature 1422 and an- other) work together to encode maze connectivity. Feature 1422 appears consistently across all connec- tions, aligning with the decision tree decoding results presented Figure 6. Figure 15: Decision trees trained on SAE latents for Terry and Stan models, predicting the existence of specific connections in the maze. These examples illustrate how maze connectivity is encoded in the residual stream at layer 0 on the corresponding semicolon position. The decision trees were trained as supervised classifiers whose target was to predict the presence of a given connection, given an SAE feature vector from the corresponding semicolon position. These SAEs were trained with 10,000 examples per connection (equally balanced between the presence / non-presence of a connection). (a) Terry model: A single SAE feature directly encodes a spe- cific maze connection, demon- strating Terry’s straightforward representation of maze connec- tivity. (b) Stan model: The connection- specific feature activates at the semicolon corresponding to the encoded connection, similar to Terry’s encoding strategy (see Figure 17b). (c) Stan model: Feature 1422, in conjunction with another fea- ture, encodes maze connectivity. This feature appears consistently across all connections, corrobo- rating the decision tree decoding results in Figure 6. Figure 16: Maximally activating examples, displayed using a modified version of McDougall (2024) for SAE features encoding the connection (4,4)<-->(4,5), as identified by decision tree decoding. Underlines correspond to loss contribution (blue for positive, red for negative) and highlighting indicates feature activation at a given token position. Connection-specific features in both models (Figure 17b and Figure 17a) show clear activation patterns, while Stan’s generic semi- colon feature (Figure 16c) exhibits a less obvious trend. Produced using a modified version of McDougall (2024) D.1MAGNITUDE OF INTERVENTIONS To complement the intervention results presented in the main text, we also conducted fixed-value interventions on both the Stan and Terry models. In these interventions, instead of calculating new activations based on the modified input, we directly set the activations of the targeted features to fixed values. This approach allows us to examine how the models respond to more controlled ma- nipulations of their internal representations. The fixed-value intervention results shown in Figure 19 reveal interesting patterns that both comple- ment and contrast with the calculated intervention results presented in the main text. 15 Published as a conference paper at ICLR 2025 (a) Representative SAE features for Terry. (b) Representative SAE features for Stan. Figure 17: We provide examples for the types of features observed in Stan and Terry, beyond the connection features which form the primary focus of the main paper. We observe the same kinds of features between both transformers, and in both cases the predominant features are of the form observed in the top-left (Feature 32 in Terry and 2 in Stan) - These features are more distributed and harder to interpret than the others, and may be suppressed by higher sparsity penalties. EINVESTIGATION OFQK-CIRCUIT INSTAN MODEL In an effort to better understand the notable “1- and 3-back” attention patterns appearing in heads L0H3, L0H5 and L0H7 of Stan, described in Subsection 3.1, we investigated the query and key vectors for token and positional embeddings, and their overlaps. The scalar products between queries and keys of token embeddings for L0H3 are shown in figure 20. The most striking feature of this plot is the row corresponding to the query vector of the ;token, and in particular its overlap with the maze cell tokens. Plotting these scalar products on the maze cell grid (figure 21) a clear pattern emerges, analogous to that shown in figure 4, accounting for LH03’s tendency to attend to even-parity cells, and LH05’s and LH07’s tendencies to attend to odd-parity cells. Examining the scalar products among query and key vectors for positional embeddings (figure 22) reveals a pattern that likely accounts for the focusing of attention from ;context positions to positions 1 and/or 3 earlier in the context. 16 Published as a conference paper at ICLR 2025 01234 Maze X 4 3 2 1 0 Maze Y 95% (1798, 1619) 98% (1717, 1619) 97% (400, 1619) 96% (389, 1619) 96% (658, 1619) 95% (763, 1619) 96% (1769, 1619) 96% (1280, 1619) 94% (1284, 1619) 96% (318, 1619) 95% (1893, 1619) 95% (1632, 1619) 96% (1600, 1619) 95% (752, 1619) 96% (449, 1619) 96% (541, 1619) 95% (208, 1619) 94% (1728, 1619) 97% (174, 1619) 96% (1322, 1619) 92% (988, 1619) 95% (984, 1619) 94% (2004, 1619) 96% (662, 1619) 96% (991, 1619) 95% (1703, 1619) 97% (1870, 1619) 96% (1578, 1619) 96% (1117, 1619) 95% (623, 1619) 93% (518, 1619) 95% (600, 1619) 96% (1155, 1619) 94% (1029, 1619) 96% (321, 1619) 94% (373, 1619) 95% (1714, 1619) 96% (1102, 1619) 96% (660, 1619) 94% (1614, 1619) 94% (1744, 1619) 95% (439, 1619) 95% (14, 1619) 94% (916, 1619) 95% (1239, 1619) 96% (1061, 1619) 96% (363, 1619) 96% (1298, 1619) 95% (289, 1619) 93% (298, 1619) Decision Tree Decoding - Stan 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy Figure 18: Another SAE trained on Stan gives rise to the same compositional code. 024 0 1 2 3 4 7365616377 7653595866 5954576853 5969534845 6355495570 Add Connection, Right 024 7781238 524011457 8760114 08483292 26449549 Remove Connection, Right 024 0 1 2 3 4 8767606573 6850415169 7045656755 3044574662 8046635671 Add Connection, Down 024 520111 3810311 3986738651 862229091 319232953 Remove Connection, Down 0 20 40 60 80 100 Percentage of Successful Perturbations (a) Terry model 024 0 1 2 3 4 5553405528 5655494560 6244444654 6144504338 5343333432 Add Connection, Right 024 00000 00000 00000 00000 00000 Remove Connection, Right 024 0 1 2 3 4 7965564565 5256493059 4957364937 5355473533 8433392546 Add Connection, Down 024 00000 00000 00000 00000 00000 Remove Connection, Down 0 20 40 60 80 100 Percentage of Successful Perturbations (b) Stan model Figure 19: Aggregated accuracy of fixed-value interventions for examples on which the original prediction was correct. As opposed to Figure 9, the addition interventions were performed with a fixed value of10(removal interventions were the same, with a fixed value of0). Here we see that the fixed-value interventions are mostly less effective than the calculated interventions, suggesting magnitude sensitivity for feature magnitudes in the transformer’s use of the World Model. 17 Published as a conference paper at ICLR 2025 <ADJLIST_ST ART> <ADJLIST_END><TARGET_ST ART> <TARGET_END><ORIGIN_ST ART> <ORIGIN_END><PATH_ST ART> <PATH_END><-->;<PADDING>(0,0)(0,1)(1,0)(1,1)(0,2)(2,0)(1,2)(2,1)(2,2)(0,3)(3,0)(3,1)(2,3)(3,2)(1,3)(3,3)(0,4)(2,4)(4,0)(1,4)(4,1)(4,2)(3,4)(4,3)(4,4)(0,5)(5,0)(5,1)(2,5)(5,2)(5,3)(4,5)(5,4)(1,5)(3,5)(5,5)(0,6)(2,6)(4,6)(6,0)(1,6)(6,1)(6,2)(3,6)(6,3)(6,4)(5,6)(6,5)(6,6) <ADJLIST_START> <ADJLIST_END> <TARGET_START> <TARGET_END> <ORIGIN_START> <ORIGIN_END> <PATH_START> <PATH_END> <--> ; <PADDING> (0,0) (0,1) (1,0) (1,1) (0,2) (2,0) (1,2) (2,1) (2,2) (0,3) (3,0) (3,1) (2,3) (3,2) (1,3) (3,3) (0,4) (2,4) (4,0) (1,4) (4,1) (4,2) (3,4) (4,3) (4,4) (0,5) (5,0) (5,1) (2,5) (5,2) (5,3) (4,5) (5,4) (1,5) (3,5) (5,5) (0,6) (2,6) (4,6) (6,0) (1,6) (6,1) (6,2) (3,6) (6,3) (6,4) (5,6) (6,5) (6,6) −0.015 −0.01 −0.005 0 0.005 0.01 0.015 Figure 20: Scalar products of Stan LH03 of query (rows) and key (columns) vectors for token embeddings. Note that the most pronounced pattern is found on the row corresponding to the query vector of the;token, reflecting the importance of this head in establishing the attention pattern from context positions containing the;token. 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 0246 6 5 4 3 2 1 0 −0.015 −0.01 −0.005 0 0.005 0.01 Head 0.3Head 0.5Head 0.7 Figure 21: Stan scalar products of query vector for;token and key vectors for maze-cell tokens, arranged on maze grid. Note the clear correspondence with Figure 4. These patterns account for why LH03 directs its attention to even-parity cells, while odd-parity cells are attended to by LH07 or LH05. 18 Published as a conference paper at ICLR 2025 050100150 0 20 40 60 80 100 120 140 160 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 Figure 22: Scalar products of Stan LH03 of query (rows) and key (columns) vectors for position embeddings. Note the approximately diagonal band of pairs of strong positive overlaps every fourth row. This is likely the origin of the ‘1- or 3-back from;‘ attention pattern. 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0246 6 4 2 0 0.05 0.1 Head 0.0Head 0.1Head 0.2Head 0.3Head 0.4Head 0.5Head 0.6Head 0.7 Figure 23: Stan OV projections across position embeddings for all heads. 19 Published as a conference paper at ICLR 2025 0,0,down0,0,right0,1,down0,1,right0,2,down0,2,right0,3,down0,3,right0,4,down0,4,right1,0,down1,0,right1,1,down1,1,right1,2,down1,2,right1,3,down1,3,right1,4,down1,4,right2,0,down2,0,right2,1,down2,1,right2,2,down2,2,right2,3,down2,3,right2,4,down2,4,right3,0,down3,0,right3,1,down3,1,right3,2,down3,2,right3,3,down3,3,right3,4,down3,4,right4,0,down4,0,right4,1,down4,1,right4,2,down4,2,right4,3,down4,3,right4,4,down4,4,right 0,0,down 0,0,right 0,1,down 0,1,right 0,2,down 0,2,right 0,3,down 0,3,right 0,4,down 0,4,right 1,0,down 1,0,right 1,1,down 1,1,right 1,2,down 1,2,right 1,3,down 1,3,right 1,4,down 1,4,right 2,0,down 2,0,right 2,1,down 2,1,right 2,2,down 2,2,right 2,3,down 2,3,right 2,4,down 2,4,right 3,0,down 3,0,right 3,1,down 3,1,right 3,2,down 3,2,right 3,3,down 3,3,right 3,4,down 3,4,right 4,0,down 4,0,right 4,1,down 4,1,right 4,2,down 4,2,right 4,3,down 4,3,right 4,4,down 4,4,right −1 −0.5 0 0.5 1 OV-derived edge features SAE-derived edge features Figure 24: Stan OV-SAE feature similarity for all heads. Complimenting Figure 7a. 20 Published as a conference paper at ICLR 2025 FCOMPARINGSAES FEATURES ANDCONNECTIVITYATTENTIONHEADS 0 1 2 3 0%100% 4 0 0%100% 1 0%100% 2 0%100% 3 0%100% 4 Maze X Maze Y Heads head_0 head_1 head_2 head_3 head_4 head_5 head_6 head_7 Figure 25: Effect of patching attention heads on SAE features for each down-connection Stan. These again provide agreement with the OV analyses performed in the main text. 0 1 2 3 0%100% 4 0 0%100% 1 0%100% 2 0%100% 3 0%100% 4 Maze X Maze Y Heads head_0 head_1 head_2 head_3 (a) Effect of attention patching on right-connection features. 0 1 2 3 0%100% 4 0 0%100% 1 0%100% 2 0%100% 3 0%100% 4 Maze X Maze Y Heads head_0 head_1 head_2 head_3 (b) Effect of attention patching on down-connection features Figure 26: Effect of patching attention heads on SAE features for Terry. Whilst we observe notable effects, it is difficult to see a clear pattern - as revealed by the attention analyses, the role of each head in constructing a single connection feature in Terry is harder to understand. 21 Published as a conference paper at ICLR 2025 GCOMPUTINGSAEANDOVEDGE FEATURE SIMILARITY In Figure 7a we compute the cosine similarity between SAE edge features and OV circuit edge features. SAE edge features are formed from a linear combination of the specific edge feature and a “generic edge” feature, with the generic feature coefficient of−0.6being chosen to maximise cosine similar- ity. OV edge features are formed from a weighted sum: X h,c a h c W h OV t c Here,hindexes heads L0H3, L0H5 and L0H7, withW L0H3 OV , for example, giving the OV matrix of L0H3.cindexes the two cells present in the edge of interest, andt c is the token embedding of a cell c. The coefficientsa h c are given by the attention directed by headhto cellcfrom the ;context position following the specification of the edge of interest. Data was averaged averaged over 100 examples (see Figure 3 for one such example). 22