Paper deep dive
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt
Models: Custom 6-layer decoder-only transformer (1.2M parameters)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 7:46:40 PM
Summary
This paper presents a mechanistic interpretability analysis of a 6-layer transformer trained on a symbolic path-finding task in binary trees. The authors identify that the model implements a 'backward chaining' algorithm, utilizing 'deduction heads' to traverse the tree upwards, parallelization motifs for subproblem solving, and a fallback heuristic for leaf node identification. The findings are validated through linear probing, activation patching, and causal scrubbing.
Entities (6)
Relation Signals (3)
Transformer → implements → Backward Chaining
confidence 95% · Our findings suggest that the model uses an interpretable and meaningful backward chaining algorithm
Deduction Heads → enables → Backward Chaining
confidence 90% · This mechanism is implemented using attention heads, which we term 'deduction heads'.
Linear Probing → validates → Transformer
confidence 90% · We validate our findings using correlational and causal evidence, using techniques such as linear probing
Cypher Suggestions (2)
Identify components of the transformer that implement specific algorithms. · confidence 95% · unvalidated
MATCH (c:Component)-[:ENABLES]->(a:Algorithm) RETURN c.name, a.name
Find all interpretability techniques used to analyze the transformer model. · confidence 90% · unvalidated
MATCH (t:Model {name: 'Transformer'})<-[:VALIDATES]-(tech:Technique) RETURN tech.nameAbstract
Abstract:Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models.
Tags
Links
- Source: https://arxiv.org/abs/2402.11917
- Canonical: https://arxiv.org/abs/2402.11917
Full Text
77,560 characters extracted from source content.
Expand or collapse full text
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task Jannik Brinkmann†1 Abhay Sheshadri†2 Victor Levoso†3 Paul Swoboda4 Christian Bartelt1 1University of Mannheim 2Georgia Institute of Technology 3Independent 4Heinrich-Heine University Düsseldorf Abstract 00footnotetext: † Equal contribution. Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models.111Correspondence to jannik.brinkmann@uni-mannheim.de. The code is available at github.com/backward-chaining-circuits. A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task Jannik Brinkmann†1 Abhay Sheshadri†2 Victor Levoso†3 Paul Swoboda4 Christian Bartelt1 1University of Mannheim 2Georgia Institute of Technology 3Independent 4Heinrich-Heine University Düsseldorf 1 Introduction Transformer-based language models (Vaswani et al., 2017) demonstrate impressive performance on reasoning222In this paper, we consider a form of deductive reasoning as studied in Saparov and He (2023). For a discussion of different forms of reasoning, refer to Huang and Chang (2023). tasks (Kojima et al., 2023), mathematical problem-solving (Cobbe et al., 2021), and planning (Huang et al., 2022). However, despite strong performance on certain reasoning benchmarks, it remains unclear to what extent these abilities are a result of actual reasoning or simple pattern memorization Huang and Chang (2023). Figure 1: Backward Chaining. Given an input prompt, the model concatenates edge tokens in a single token position (A), and copies the goal node into the final token position (B). The next step is then identified by applying an iterative algorithm that climbs the tree one level per layer (C). To understand the reasoning capabilities of language models, existing work has focused on developing sophisticated benchmarks for behavioral studies (Tafjord et al., 2021; Saparov and He, 2023; Valmeekam et al., 2023). However, the conclusions drawn from these studies do not provide insights into the internal mechanisms driving the observed capabilities. In contrast, recent work in the field of mechanistic interpretability attempts to understand the algorithms that models implement by reverse-engineering their internal mechanisms, and describing them at a certain level of abstraction (Elhage et al., 2021). For example, Nanda et al. (2023a) reverse-engineered how a small transformer model implements modular addition, providing insights into the specific computations performed by different components of the model. Similarly, Olsson et al. (2022) discovered “induction heads” in transformers which enable a distinct copying mechanism that is considered to be crucial for the in-context learning abilities of language models. Contributions This paper studies reasoning in language models by reverse-engineering the internal mechanisms of a transformer trained on a symbolic multi-step reasoning task. Specifically, we focus on path finding in a tree, a variation of the task proposed in Saparov and He (2023). By analyzing the internal representations of the model, we identify several key mechanisms: 1. A specific type of copying operation implemented in attention heads, which we call deduction heads. These are similar to induction heads as observed in Olsson et al. (2022). In our task, deduction heads intuitively serve the purpose of moving one level up the tree. These heads are implemented in multiple consecutive layers which allows the model to climb the tree multiple layers in a single forward pass. 2. A parallelization motif whereby the early layers of the model choose to solve several subproblems in parallel that may be relevant for solving many harder instances of the task. 3. A heuristic for tracking the children of the current node and whether these children are leaf nodes of the tree. This mechanism is used when the model is unable to solve the problem using deduction heads and parallelization. We validate our findings using correlational and causal evidence, using techniques such as linear probing (Alain and Bengio, 2018), activation patching (Vig and Belinkov, 2019), and causal scrubbing (Chan et al., 2022). 2 Related Work Expressiveness of Transformers To understand the capabilities of transformers, one line of work characterizes their theoretical properties (Bhattamishra et al., 2020; Merrill et al., 2021; Pérez et al., 2021; Merrill et al., 2022; Liu et al., 2023). These studies answer questions about the expressiveness of the transformer architecture by treating them as approximators of different classes of functions (Yun et al., 2020), or by characterizing the class of formal languages that a transformer can recognize (Strobl et al., 2023). Mechanistic Interpretability In contrast to these theoretical investigations, a number of studies have adopted an empirical approach in order to understand what transformers learn in practice Elhage et al. (2021); Delétang et al. (2023); Zhang et al. (2024). Our analysis is inspired by existing work in the field of mechanistic interpretability, attempting to discover and understand the algorithms implemented in a model by reverse-engineering its internal mechanisms (Räuker et al., 2023). To explore these internal mechanisms, the field has adopted techniques such as activation patching Wang et al. (2023b), causal scrubbing (Chan et al., 2022), and circuit discovery (Conmy et al., 2023). In addition, considerable focus has been placed on the study of small models trained on specialized tasks, such as modular addition (Nanda et al., 2023a), or group operations (Chughtai et al., 2023), providing a more manageable framework for understanding complex computational processes. We present an extended discussion of the related work on mechanistic interpretability in Appendix A. Evaluating Reasoning Capabilities Existing approaches to evaluate the reasoning capabilities of language models focus on their performance on a range of downstream tasks (Huang and Chang, 2023). To enable a more formal analysis of reasoning, a number of studies have developed sophisticated metrics and benchmarks (Han et al., 2022; Golovneva et al., 2023; Wang et al., 2023a). For example, Saparov and He (2023) use a synthetic question-answering dataset based on a world model expressed in first-order logic to parse the generated reasoning processes into symbolic proofs for formal analysis. Their results suggest that language models are capable of making correct individual deduction steps. However, these approaches stop short of exploring the internal mechanisms that enable these capabilities (Huang and Chang, 2023). Hou et al. (2023) investigate whether language models solve reasoning tasks using information the model memorized from pretraining or using information provided in context. To investigate this question, they use linear probes to determine whether the model encodes a reasoning tree. The approach that comes closest to our work is Stolfo et al. (2023), presenting a mechanistic interpretation of arithmetic reasoning by investigating the information flow in the model given simple mathematical questions. 3 Background Transformer Notation Transformers (Vaswani et al., 2017) represent input text as a sequence t1,t2,…,tNsubscript1subscript2…subscriptt_1,t_2,…,t_Nt1 , t2 , … , titalic_N of tokens, such that ti∈Vsubscriptt_i∈ Vtitalic_i ∈ V where V is a vocabulary. Each token tisubscriptt_ititalic_i is embedded as a vector i0∈ℝdsubscriptsuperscript0superscriptℝx^0_i ^dx0italic_i ∈ blackboard_Rd using an embedding matrix WE∈ℝ|V|×dsubscriptsuperscriptℝW_E ^|V|× dWitalic_E ∈ blackboard_R| V | × d, where d is the dimension of the hidden state. These embeddings initialize the residual stream, which is then transformed through a sequence of L transformer blocks, each consisting of a multi-head attention sublayer and an MLP sublayer. The representation of token tisubscriptt_ititalic_i at layer ℓ ℓ is obtained by: iℓ=iℓ−1+iℓ+iℓsuperscriptsubscriptℓsuperscriptsubscriptℓ1superscriptsubscriptℓsuperscriptsubscriptℓ _i =x_i -1+a_i^% +m_i xitalic_iroman_ℓ = xitalic_iroman_ℓ - 1 + aitalic_iroman_ℓ + mitalic_iroman_ℓ (1) where iℓsuperscriptsubscriptℓa_i aitalic_iroman_ℓ and iℓsuperscriptsubscriptℓm_i mitalic_iroman_ℓ are the outputs of the attention and MLP sublayers. To predict the next token in the sequence, it applies an unembedding matrix WU∈ℝ|V|×dsubscriptsuperscriptℝW_U ^|V|× dWitalic_U ∈ blackboard_R| V | × d to the residual stream iLsubscriptsuperscriptx^L_ixitalic_Litalic_i, translating it into a probability distribution over the vocabulary. Figure 2: Data Generation. To generate our training set, we (1) generate a binary tree, (2) select a leaf node as the goal node, and (3) determine the path from the root to the goal node. Figure 3: Prompt Format. The model receives input in a structured format, with each box representing a token. The edge list of the tree is denoted as token pairs [A11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] [B11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] , … [Ann_nstart_FLOATSUBSCRIPT n end_FLOATSUBSCRIPT] [Bnn_nstart_FLOATSUBSCRIPT n end_FLOATSUBSCRIPT] followed by the task specification, including the goal [G] and the root node [P11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] . The model’s objective is to predict the nodes in the path [P22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT] … [Pm1m1_m1start_FLOATSUBSCRIPT m1 end_FLOATSUBSCRIPT] , culminating in the goal node [Pmm_mstart_FLOATSUBSCRIPT m end_FLOATSUBSCRIPT] = [G] . For simplification, our tokenization distinguishes tokens representing source and target nodes of each edge, such as [15] and [→15] . Linear Probes To investigate whether information is encoded in intermediate representations of the model, we use linear probes (Alain and Bengio, 2018) implemented as a linear projection from the residual stream. This involves training a logistic regression model on a dataset of activations iℓsuperscriptsubscriptℓx_i \ xitalic_iroman_ℓ to predict an auxiliary classification problem. For example, we use it to determine whether information about an edge is encoded in the activations at a specific token position. If the performance of the linear probe is sufficiently high, we can treat this as evidence for the information being encoded. Activation Patching To evaluate the importance of a model component for a given prediction, we intervene by patching in the activations it would have had on a different input (also called resampling ablations) (Vig et al., 2020; Meng et al., 2022). This involves using a clean input s with an associated target prediction r, and a corrupted input s′ with a different target r′. Then, we cache the activations of the component on s, and evaluate the effect of patching in these activations when running the model on s′. To compute the effect of this intervention, we compute the difference in logits: LD(r,r′)=Logit(r)−Logit(r′)superscript′LogitLogitsuperscript′ LD(r,r )=Logit(r)-Logit(r )L D ( r , r′ ) = Logit ( r ) - Logit ( r′ ) (2) Causal Scrubbing To evaluate specific hypotheses about internal mechanisms, we use causal scrubbing which evaluates the effect of behavior-preserving resampling ablations (Chan et al., 2022). Specifically, given a hypothesis about which component of a model implements a specific behavior, we replace the activations of that component on some input with activations on another input, where our hypothesis predicts that the activations represent the same thing. Then, we evaluate the impact of this intervention by computing how much of the initial performance is recovered: LCS=Lscrubbed−LrandomLmodel−Lrandomsubscriptsubscriptsubscriptsubscriptsubscript L_CS= L_scrubbed-L_randomL_model-L_randomLitalic_C S = divide start_ARG Litalic_s c r u b b e d - Litalic_r a n d o m end_ARG start_ARG Litalic_m o d e l - Litalic_r a n d o m end_ARG (3) where LmodelsubscriptL_modelLitalic_m o d e l is the test loss of the trained model, LrandomsubscriptL_randomLitalic_r a n d o m of a model that outputs uniform logits, and LscrubbedsubscriptL_scrubbedLitalic_s c r u b b e d of the trained model with the chosen activations resampled. In contrast to activation patching, which provides insights about whether a specific activation is causally linked to the output, causal scrubbing provides stronger evidence about the role of the activations in the model. Recovering the majority of the loss with a given hypothesis demonstrates that the activation corresponds to a specific variable in a high-level causal model. 4 Experimental Setup 4.1 Task Description We focus on path finding in trees as a modified version of the task proposed by Saparov and He (2023). Our adaptation shifts the focus from reasoning in natural language to abstract symbolic reasoning. This allows us to better understand motifs that the models might be using to solve analogous problems in natural language. In our experimental setup, we generate training samples by generating binary trees T=(V,E)T=(V,E)T = ( V , E ) uniformly at random from the set of all trees with 16 nodes, i.e. |V|=1616|V|=16| V | = 16. Then, for each tree, a leaf node is randomly selected as the target node (see Figure 3). The model is given the edge list [A11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] [B11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] , [A22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT] [B22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT] , … [Ann_nstart_FLOATSUBSCRIPT n end_FLOATSUBSCRIPT] [Bnn_nstart_FLOATSUBSCRIPT n end_FLOATSUBSCRIPT] of the tree, a specified goal [G] , and the root node [P11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT] , and is trained to predict the path from the root node to the goal [P22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT] [P33_3start_FLOATSUBSCRIPT 3 end_FLOATSUBSCRIPT] … [Pmm_mstart_FLOATSUBSCRIPT m end_FLOATSUBSCRIPT] such that [Pmm_mstart_FLOATSUBSCRIPT m end_FLOATSUBSCRIPT] === [G] , as depicted in Figure 3. The path between any two nodes in a tree is unique; therefore, the data generation is not affected by the type of search algorithm used to determine the path. We emphasize that this task is nontrivial, as the model has to perform multiple reasoning steps in a single forward pass to predict the next step. At each step, the model must determine from which node the goal is reachable. Thus, the model must perform multiple steps of reasoning for each next token prediction without relying on techniques such as chain-of-thought or scratchpad (Wei et al., 2022), which would allow the model to roll-out the reasoning steps over multiple next token predictions. 4.2 Model Specification and Training Process In our experiments, we use a 6-layer, decoder-only transformer with an embedding dimension of 128, a single attention head per layer, and a feed-forward dimension of 512, resulting in a total of 1.2 million parameters. The training dataset consists of 150,000 generated trees. The edge lists of these trees are shuffled to prevent the model from learning simple heuristics and encourage structural understanding of trees. To evaluate the performance of our model, we compute the accuracy based on the exact match of complete sequences using greedy decoding. Our model achieves 99.7 % accuracy rate on a test set of 15,000 unseen trees, despite seeing just a small fraction of all possible trees during training (see Appendix B). This suggests that generalization is required for meaningful performance and that the model has learned to be capable of solving pathfinding in trees. 5 Symbolic Reasoning using Backward Chaining In this section, we present a mechanistic analysis of the internal mechanisms of the model and provide correlational and causal evidence. Our findings suggest that the model uses an interpretable and meaningful backward chaining algorithm to perform pathfinding in a tree. To help guide the reader, we present an intuitive explanation before breaking down the individual algorithmic steps in the following sections. The Backward Chaining Algorithm First, the model aggregates the source and target nodes of each edge in the edge list into the target node position (see Section 5.1). Then, the model starts at the goal node and moves up the tree one level with each layer of the model. This mechanism is implemented using attention heads, which we term “deduction heads” (see Section 5.2). By the composition of multiple attention heads in consecutive layers, the model can traverse the tree upwards for L−11L-1L - 1 edges, where L is the number of layers in the model. We refer to this mechanism as backward chaining, inspired by the use of the term in the symbolic artificial intelligence literature (Russell and Norvig, 2009). In more complex scenarios, where the required path exceeds the depth of the model, it relies on backward chaining from multiple nodes in the tree in parallel. This creates multiple subpaths that can be used to find the correct next step in cases where the goal node is more than L−11L-1L - 1 edges away (see Section 5.3). In addition, the model uses a simple heuristic as a fallback mechanism, where it identifies child nodes of the current position and evaluates whether these are leaf nodes of the tree. This enables the model to make informed guesses when backward chaining and parallelization are insufficient to solve the problem (see Section 5.4). 5.1 Edge Token Concatenation The attention head in the first layer of the model creates edge embeddings by moving the information about the source token onto the target token for each edge in the context. Specifically, for each edge [A] [B] it copies the information from [A] into the residual stream at position [B] . This mechanism has some similarities with “Previous Token Heads”, as observed in pre-trained language models (Olsson et al., 2022; Wang et al., 2023a). Experiment: Linear Probes To validate that the model creates edge embeddings, we train a linear probe to predict the associated edge given the activations 1superscript1x^1x1 at the positions of target nodes. The probe is trained using 8,000 examples and evaluated on a test dataset of the same size. For comparison, we also report the performance of a linear probe given the activations 0superscript0x^0x0 at the positions of the target nodes and probes given the activations at the positions of the source node. Results Table 1 reports the performance of the linear probe measured using the F1 score. We find that we can successfully extract the source and target tokens [A] [B] from the residual stream activations 1superscript1x^1x1 at the position of the target tokens after the first layer, providing strong evidence for the edge token concatenation hypothesis as described above. Moreover, it does not encode the complete edge in the position of the source token, attributed to causal masking in the attention mechanism. Figure 4: Visualization of multi-layer attention patterns on an example input. We show the attention from three selected positions: the path position, register token at position 39, and register token at position 44. We show that the path node starts backward chaining from the specified goal, while the two register tokens start backward chaining from different subgoals. Each token is highlighted by the color of the token that most strongly attends to it. The intensity of the color is based on the magnitude of the attention score. For details on how we select the register tokens and more examples, see Appendix E. 5.2 Backward Chaining The most important mechanism the model uses to predict the correct next step is an iterative algorithm, which we refer to as backward chaining. The algorithm starts at the goal node and climbs the tree one level per layer. To this end, the model copies the target node [G] into the final token position [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] (see Table 1) and then in each consecutive layer applies what we term “deduction heads”. Mechanism: Deduction Heads The function of deduction heads is to search for the edge in the context in which the current position is the target node [B] , find the corresponding source token [A] , and then copy the source token over to the current position. Thus, deduction heads complete the pattern by mapping: [A][B]…[B]→[A]→[A][B]…[B][A] [A ] \; [B ] ...% [B ] → [A ]typewriter_[A typewriter_] typewriter_[B typewriter_] … typewriter_[B typewriter_] → typewriter_[A typewriter_] (4) In other words, this mechanism enables the model to go one step up the tree and append [A] after having seen the last [B] in the sequence. This mechanism depends on the edge-token concatenation, which previously copied information about [A] into [B] (see Appendix C). This allows the deduction head to search for information about [B] but then copy information about [A] into the final token position. By composition of multiple deduction heads in consecutive layers, the model is capable of climbing several steps up the tree in a single inference step. This creates a subpath at the final token position whose lengths is equivalent to the number of layers involved. In our model, we observe that the attention heads after the first layers can act as deduction heads, resulting in a backward chaining depth of at most L−11L-1L - 1 steps. The role of these heads depends on the proximity of the current position to the goal node; for example, if the root node is just four steps away from the goal node, the model recognizes this and the attention head in the sixth layer does not act as a deduction head. Table 1: F1 score of linear probes trained to predict the edge [A] [B] given the residual stream activations at position [A] or [B] . In addition, we report the performance of a linear probe to predict the goal node [G] from the positions in the path [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] . i0superscriptsubscript0x_i^0xitalic_i0 i1superscriptsubscript1x_i^1xitalic_i1 Linear [Aii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [Aii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] [Bii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] 0.13 0.19 Linear [Bii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [Aii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] [Bii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] 0.11 1.00 Linear [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [G] 0.03 1.00 Experiment: Causal Scrubbing To confirm that the model uses backward chaining to predict the next step for paths up to a depth of L−11L-1L - 1, we use causal scrubbing (see Section 3). Specifically, we hypothesize that the attention head of layer ℓ ℓ is responsible for writing the node that is ℓ−1ℓ1 -1ℓ - 1 edges above the goal into the final token position in the residual stream. This implies that the output of the attention head in layer ℓ ℓ should be consistent across trees that share the same node ℓ−1ℓ1 -1ℓ - 1 edges above the goal. To test this, we generate a clean and a corrupted graph that share the same node ℓ−1ℓ1 -1ℓ - 1 edges above the goal node. Then, we substitute the output of the head on the clean graph with the output of the head on the corrupted graph, and measure the difference in the loss. Figure 5: To test whether the model predicts the next step using backward chaining, we perform resampling ablations on each head using causal scrubbing. We find that we can recover close to 100 % of the performance of the model for paths up to length L−11L-1L - 1, providing strong evidence for our backward chaining hypothesis. Results Figure 5 illustrates the effect of causal scrubbing. We find that we can recover most of the performance (close to 100 %) of the model for paths up to length L−11L-1L - 1, providing strong evidence for our hypothesis about backward chaining. We also find that this hypothesis explains most of the behavior of the attention heads in the first four layers of the model even on paths that require more than L−11L-1L - 1 steps; only the attention heads in the final two layers act significantly different, such that the model ends up confidently making incorrect predictions after we apply causal scrubbing. This highlights that the heads in these two layers cannot be accurately described as deduction heads over the entire data distribution. 5.3 Path Merging In cases where the goal is more than L−11L-1L - 1 steps away from the current position, the previously described mechanism is insufficient. To address this, the model does not only perform backward chaining on the final token position, but in parallel on multiple different token positions. We refer to these tokens as “register tokens”. The resulting subpaths are then merged on the final token position to facilitate more complex scenarios. Observation: Register Tokens The role of register tokens is to act as working memory. These are tokens that do not contain any useful information for the actual task; either they do not contain any useful information to begin with, e.g. [,] , or are tokens whose information has been copied to other positions and thus contain redundant information. For more details on the role of register tokens in our context, see Appendix D. Mechanism: Path Merging To compute paths for which backward chaining on the final token position is not sufficient, the model uses register tokens to perform backward chaining in parallel from multiple positions in the tree. To this end, similar to backward chaining from the goal, it copies a node into each of these register tokens which will then be picked up by the deduction heads. This results in a multiple subpaths being stored at different positions in the sequence. Then, the model can merge these by finding overlapping subpaths. To illustrate, let us assume that a subpath [B] → [C] → [G] has been stored in the final token position and a different subpath [A] → [E] → [B] has been stored in some register token. Then, during path merging, the model copies the subpath stored in the register token into the final token position, enabling it to move up the tree multiple steps at a time. Experiment: Linear Probe To validate that the model is indeed generating subpaths in register tokens, we train a linear probe to predict these subpaths given the residual stream at these register token positions. We generate the training set for the linear probe by inspecting the attention patterns (see Figure 4). We observe that these approximate hard-attention, which allows us to read off the token positions from which the head is copying information. We extract the subpath we would expect to be encoded at each of these register tokens and transform them into an adjacency matrix representation. Then, we train the probe on a set of 8,000 examples and evaluate it on a test set of the same size. The probe achieves an F1 score of 92.82 % when trained on the activations of layer four, i.e. up to the the point where the model is performing backward chaining. Table 2: F1 score of a linear probe trained to predict the adjacency matrix of the subpath we hypothesise to be encoded in the activations xi4superscriptsubscript4x_i^4xitalic_i4 at register token positions. i4superscriptsubscript4x_i^4xitalic_i4 Linear [Rii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [Sii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] 92.82 Experiment: Register Token Patching To evaluate whether the subpaths stored in the register tokens have a causal effect on the prediction of the model, we perform resampling ablations (see Chapter 3) on the register tokens positions in trees which (i) contain sufficiently long paths such that backward chaining on the final token position is insufficient, and (i) positions where nodes have multiple child nodes, ensuring that the model has to make a decision between multiple options. The corrupted activations are extracted from another tree in the same class as described above. We compute the effect of this intervention using the logit difference. We perform this intervention after layer four, as we found that the behavior of all previous layers is explained using our backward chaining hypothesis (see Section 5.2). Figure 6: To test whether the model relies on subpaths stored in register tokens, we perform resampling ablations on the register token positions at i4superscriptsubscript4x_i^4xitalic_i4. Here, we conducted 10 separate runs, each involving 1000 samples. For each run, we calculate the mean logit difference and report the 95 % confidence interval for the average effects observed across the runs. The results demonstrate that these subpaths are instrumental for paths longer than L−11L-1L - 1 steps. Results Figure 6 illustrates the impact of patching the register tokens on the model predictions at different path lengths. Our results show that the intervention has no effect on performance up to a path depth of 4 and minimal effect at depth 5, which is consistent with our backward chaining hypothesis. Beyond this depth, this intervention has a significant effect on the performance. This suggests that the encoded subpaths are causally relevant for predicting next steps on paths that are more than L−11L-1L - 1 steps away from the goal. However, our findings also indicate that the predictions are not solely dependent on these subpaths derived, but other factors besides the subpaths contribute to the prediction. This includes the influence of a one-step lookahead mechanism, which identifies child nodes of the current position and increases the probabilities of the children that are not leaves. 5.4 One-Step Lookahead We find that the model uses an additional mechanism, which identifies child nodes of the current position and increases the prediction probabilities of the children that are not leaf nodes of the tree. This enables the model to make informed guesses in cases where backward chaining is not sufficient. This mechanism is particularly effective on long paths as these have a lower branching factor in our experimental setup. Thus, it is a pragmatic strategy to minimize the training error. Figure 7: Contribution of each target node position to the logits at the path position through the attention heads L5.H1 and L6.H1 on a specific example. Experiment: Linear Probes To validate that the model represents the child nodes of the current position, including whether they are leaf nodes, we use linear probes. The probes are trained to predict this information given the activations on the final token position. Table 3 reports the performance of the linear probes measured using the F1 score. Our analysis shows that the model starts to collect information about the children and leaf nodes from the fourth layer and represents both aspects in the fifth layer. Table 3: F1 score of linear probes trained to predict the children of the current position and whether these are leafs of the tree given the residual stream activations at position [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] . i4superscriptsubscript4x_i^4xitalic_i4 i5superscriptsubscript5x_i^5xitalic_i5 i6superscriptsubscript6x_i^6xitalic_i6 Linear [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [Childrenii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] 0.00 47.88 98.20 Linear [Pii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] → [Leafsii_istart_FLOATSUBSCRIPT i end_FLOATSUBSCRIPT] 0.00 49.71 95.76 Mechanism The results from linear probes suggest that the attention heads in the final two layers of the model are responsible for the one-step lookahead mechanism. To better understand this mechanism, we examine how these two attention heads directly compose with the unembedding matrix. Specifically, we compute the contribution of each token to the logits through these attention heads. Since each edge will be represented in the target node (see Section 5.1), we focus on target node positions (see Figure 7). By inspecting their query-key (QK) circuits Elhage et al. (2021), we find that these two heads attend to the target node of every edge except those for which the source node is the current path position. Furthermore, we can break the mechanism in their output-value (OV) circuits into three components: 1. Each edge decreases the logit of its target node. 2. Each edge increases the logit of its source node. 3. Each token in the path decreases its logit. As a result of these components, the logits of the leaf nodes in the tree will decrease while the logits of the children of the current position will increase. For the other nodes, the logit increase from being a parent, and the logit decrease from being a child node which roughly cancel out, causing their logit to remain constant. 6 Discussion Register Tokens Our model uses some token positions as a form of working memory to store intermediate results. This observation aligns with Darcet et al. (2023) which found that image models use some image patches to accumulate global information while discarding spatial information. Similarly, Goyal et al. (2023) show that adding uninformative tokens at the end of each prompt can enhance language model performance on downstream tasks without introducing additional parameters. Our findings suggest that these techniques enable the model to store intermediate results and perform more computations in parallel. This is consistent with theoretical insights from Merrill et al. (2022) which highlights how the effective state of a transformer depends on the number of tokens in the sequence. Structural Recursion Transformers, which are by definition non-recurrent, struggle with emulating structural recursion and extracting recursive rules from data (Zhang et al., 2024). This aspect of learning is crucial in domains such as programming and formal mathematics, where understanding complex relationships relies on these abilities. Our analysis provides insights into possible reasons for this limitation. In our setting, training the model using standard objectives for next-token prediction forces the model to unroll the entire recursive structure in a single forward pass. This restricts their abilities to process recursion, leading them to resort to shortcut solutions (Liu et al., 2023). Reasoning in Transformers There is an ongoing debate about the reasoning capabilities of transformers Huang and Chang (2023). Some argue that these models might just be capable of memorizing patterns without gaining causal understanding, which could lead to diminishing performance on out-of-distribution data Bender and Koller (2020); Floridi and Chiriatti (2020); Bender et al. (2021); Merrill et al. (2021). However, there are several observations that suggest that transformers might be capable of more than just pattern recognition; e.g. Olsson et al. (2022) found a simple algorithm implemented in attention heads that contributes to the in-context learning abilities of transformers and operates independent of the specific tokens. This algorithm is doing more than memorizing patterns and can in some sense work out-of-distribution. In our synthetic setting, we found that the model learned an interpretable and meaningful backward chaining algorithm, supporting the claim that transformers might be capable of a form of reasoning that goes beyond simple pattern memorization. These results further complement the findings of Hou et al. (2023), shedding light on the mechanisms transformer models might use to compute reasoning trees over information provided in context. However, it is important to note that findings from our synthetic settings do not support the boarder claim that transformers possess general reasoning capabilities, highlighting the need for further investigations. 7 Conclusion In this paper, we conducted a mechanistic analysis of a transformer trained on pathfinding in trees. Our results suggest that the model implements a backward chaining algorithm that starts at the goal node and climbs the tree one level per layer. To solve more complex problems, where the required reasoning depth exceeds the number of layers, it executes the same mechanism in parallel across multiple register tokens and combines the resulting subpaths on the final token position. In addition, it performs a simple one-step lookahead in which it finds the child nodes of the current position and evaluates whether they are leaf nodes. Our findings in this synthetic setting demonstrate the ability of a transformer to perform deductive reasoning up to a certain reasoning depth, after which it resorts to simple heuristics. By using parallelized computations to store intermediate results in register tokens and then merging these results in the final token position, the model demonstrates a form of deductive reasoning that, while effective within a given setting, is constrained by its architecture. These observations suggests that transformers may exhibit a inductive bias towards adopting highly parallelized strategies for tasks involving search, planning, or reasoning. Limitations Synthetic Task Our experiments were conducted on a symbolic reasoning task (see Chapter 4.1). This allowed us to bypass the complexities associated with natural language, such as multi-token embeddings (Nanda et al., 2023c). In addition, our tokenization distinguishes tokens representing source and target nodes of each edge, such as [15] and [→15] . Therefore, our findings are specific to our model and it remains unclear whether large language models trained on natural language use similar mechanisms to solve this task. However, we anticipate that the motifs we discovered in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models. Input Format To prevent the model from learning shortcuts based on the order of the edges in the prompt, we trained our model on shuffled edge lists (see Section 4.2). However, our analysis is limited to sequences in which the edge list is presented in backward order. By backward order we mean a listing of edges that starts with the leaf nodes and ascends level by level to the root node, as opposed to a forward order where the listing starts with the root node and progresses downwards through each level. Our investigation does not extend to a detailed examination of alternative arrangements of the edge list. However, preliminary observations suggest that the model uses similar mechanisms with minor variations, such as the use of different register tokens. Acknowledgements Jannik Brinkmann is supported by the German Federal Ministry for Digital and Transport (BMDV) and the German Federal Ministry for Economic Affairs and Climate Action (BMWK). Abhay Sheshadri and Victor Levoso have been supported by Lightspeed Grants. References Alain and Bengio (2018) Guillaume Alain and Yoshua Bengio. 2018. Understanding intermediate layers using linear classifier probes. Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. Bender et al. (2021) Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery. Bender and Koller (2020) Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics. Bhattamishra et al. (2020) Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. 2020. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online. Association for Computational Linguistics. Catalan (1838) Eugène Charles Catalan. 1838. Note sur une équation aux différences finies. Journal de Mathématiques Pures et Appliquées, 3:508–516. Chan et al. (2022) Lawrence Chan, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny, Ansh Radhakrishnan, Buck, and Nate Thomas. 2022. Causal Scrubbing: a method for rigorously testing interpretability hypotheses. Chughtai et al. (2023) Bilal Chughtai, Lawrence Chan, and Neel Nanda. 2023. Neural networks learn representation theory: Reverse engineering how networks perform group operations. In ICLR 2023 Workshop on Physics for Machine Learning. Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Conmy et al. (2023) Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards automated circuit discovery for mechanistic interpretability. Darcet et al. (2023) Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. 2023. Vision transformers need registers. Delétang et al. (2023) Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. 2023. Neural networks and the chomsky hierarchy. Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html. Floridi and Chiriatti (2020) Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694. Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. Golovneva et al. (2023) Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. 2023. Roscoe: A suite of metrics for scoring step-by-step reasoning. Goyal et al. (2023) Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2023. Think before you speak: Training language models with pause tokens. Hagberg et al. (2008) Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference (SciPy2008), pages 11–15, Pasadena, CA USA. Han et al. (2022) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, Caiming Xiong, and Dragomir Radev. 2022. Folio: Natural language reasoning with first-order logic. Hou et al. (2023) Yifan Hou, Jiaoda Li, Yu Fei, Alessandro Stolfo, Wangchunshu Zhou, Guangtao Zeng, Antoine Bosselut, and Mrinmaya Sachan. 2023. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics. Huang et al. (2022) Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9118–9147. PMLR. Ivanitskiy et al. (2023) Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, and Samy Wu Fung. 2023. Structured world representations in maze-solving transformers. Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large language models are zero-shot reasoners. Liu et al. (2023) Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. 2023. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations. Marks and Tegmark (2023) Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. Meng et al. (2022) Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems. Merrill et al. (2021) William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. 2021. Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? Transactions of the Association for Computational Linguistics, 9:1047–1060. Merrill et al. (2022) William Merrill, Ashish Sabharwal, and Noah A. Smith. 2022. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856. Merullo et al. (2024) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024. Circuit component reuse across tasks in transformer language models. Mikolov et al. (2013) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia. Association for Computational Linguistics. Mini et al. (2023) Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexander Matt Turner. 2023. Understanding and controlling a maze-solving policy network. Nanda and Bloom (2022) Neel Nanda and Joseph Bloom. 2022. Transformerlens. https://github.com/neelnanda-io/TransformerLens. Nanda et al. (2023a) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023a. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. Nanda et al. (2023b) Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023b. Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 16–30, Singapore. Association for Computational Linguistics. Nanda et al. (2023c) Neel Nanda, Senthooran Rajamanoharan, Janos Kramar, and Rohin Shah. 2023c. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2022. In-context learning and induction heads. Pérez et al. (2021) Jorge Pérez, Pablo Barceló, and Javier Marinkovic. 2021. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35. Russell and Norvig (2009) Stuart Russell and Peter Norvig. 2009. Artificial intelligence, 3 edition. Pearson, Upper Saddle River, New Jersey. Räuker et al. (2023) Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. Saparov and He (2023) Abulhair Saparov and He He. 2023. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations. Stolfo et al. (2023) Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052, Singapore. Association for Computational Linguistics. Strobl et al. (2023) Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. 2023. Transformers as recognizers of formal languages: A survey on expressivity. Syed et al. (2023) Aaquib Syed, Can Rager, and Arthur Conmy. 2023. Attribution patching outperforms automated circuit discovery. Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online. Association for Computational Linguistics. Tigges et al. (2023) Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. Linear representations of sentiment in large language models. Valmeekam et al. (2023) Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Variengien and Winsor (2023) Alexandre Variengien and Eric Winsor. 2023. Look before you leap: A universal emergent decomposition of retrieval tasks in language models. Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Vig and Belinkov (2019) Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. Association for Computational Linguistics. Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc. Wang et al. (2023a) Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023a. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics. Wang et al. (2023b) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023b. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations. Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems. Yun et al. (2020) Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. 2020. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations. Zhang et al. (2024) Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, and Talia Ringer. 2024. Transformer-based models are not yet perfect at learning to emulate structural recursion. Zhong et al. (2023) Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. 2023. The clock and the pizza: Two stories in mechanistic explanation of neural networks. Appendix A Extended Discussion of Mechanistic Interpretability Our work contributes to the emerging field of mechanistic interpretability, which seeks to reverse-engineer the internal mechanisms of neural networks into human-understandable algorithms (Elhage et al., 2021). To this end, the model is considered a causal graph (Meng et al., 2022), with the intent of finding interpretable and meaningful subgraphs (circuits) that are responsible for solving the task in question, such as understanding a circuit implementing modular addition (Nanda et al., 2023a), or understanding a circuit responsible for indirect object identification (Wang et al., 2023b). Similar works studied how models perform in-context learning using induction heads, as found by Olsson et al. (2022), or how models perform retrieval tasks using an internal modular decomposition as shown in Variengien and Winsor (2023). Moreover, Merullo et al. (2024) provide evidence that certain mechanisms are reused across different tasks, indicating that models can use similar mechanisms to solve different tasks. However, these studies of specific mechanisms and circuits in language models required a lot of manual effort. In addition, Zhong et al. (2023) demonstrated that small changes in hyperparameters and initializations can lead to the emergence of qualitatively different algorithms. To address this, some researchers attempted to automate the process of finding these circuits using methods such as automated circuit discovery (Conmy et al., 2023) or edge attribution patching (Syed et al., 2023). Similar research investigates how neural networks process and store information. For example, Geva et al. (2023) explored how models recall factual information, and Meng et al. (2022) studied how this information can be localized and edited without re-training. This also includes works on understanding internal representations. In this context, there is an ongoing debate about the linear representation hypothesis Mikolov et al. (2013), which is the idea that high-level concepts are represented linearly as directions in the representation space. Here, a model is considered to linearly represent a concept, if it can be probed from the activations with a linear model. For example, Nanda et al. (2023b) and Mini et al. (2023) studied the internal representations of language models trained on a board game and found that it develops a representation of the current board state that can be extracted using linear probes. Importantly, they find that whether the representation can be extracted using a linear model depends on the encoding of the target label. Other linear representations have been found in language models; e.g. Tigges et al. (2023) find a linear representation of sentiment in text and Marks and Tegmark (2023) find a linear representation of the truth value of input text. Ivanitskiy et al. (2023) explored a transformer trained to perform a maze-solving task and found that they can successfully extract boundaries of the maze using linear probes. They also observed attention heads that attend to positions in the maze that are one step away from the current position. Appendix B Experimental Setup B.1 Implementation and Computing All experiments were carried out on a single NVIDIA RTX A6000 GPU. The total computation time for training the transformer model was less than 24 hours. To generate the trees, we used networkx (Hagberg et al., 2008). For training and execution of all experiments, we used TransformerLens (Nanda and Bloom, 2022). For details on the model and training configuration, see Tables 4 and 5. Table 4: Model Configuration Parameter Value # of layers 6 # of heads 1 Residual Stream dim. 128 Attention Head dim. 128 Feed-Forward dim. 512 Activation Function gelu Vocabulary Size 35 Context Size 63 Table 5: Training Configuration Parameter Value Learning Rate 1e-3 Optimizer AdamW Batch Size 64 Betas (0.9, 0.99) Weight Decay 0.01 B.2 Size of Training Set Our dataset consists of 150,000 randomly generated examples, each including a labeled binary tree with 16 nodes. The number of possible unlabeled binary trees with n+11n+1n + 1 nodes is given by the n-th Catalan number (Catalan, 1838): C(n)=(2n)!(n+1)!⋅n!2⋅1C(n)= (2n)!(n+1)!· n!C ( n ) = divide start_ARG ( 2 n ) ! end_ARG start_ARG ( n + 1 ) ! ⋅ n ! end_ARG When considering labeled binary trees, this number grows to (n+1)!⋅C(n)⋅1(n+1)!· C(n)( n + 1 ) ! ⋅ C ( n ) unique trees. This suggests that memorization is infeasible, and generalization is required for meaningful performance. Appendix C Attention Head Composition In this section, we perform additional experiments to verify that L1.H1 performs edge token concatenation, and L2.H1 is a deduction head. Elhage et al. (2021) show that transformers can learn induction heads in two different ways involving different compositions: the K-composition, where the WKsubscriptW_KWitalic_K of the head reads from the output of the previous head, and V-composition, where the WVsubscriptW_VWitalic_V of the head reads from the output of the previous head. Our findings suggest that the deduction heads we found in our model are a result of K-composition. In the following subsections, we adopt the conventions of Elhage et al. (2021). Layer 1 - Edge Token Concatenation Head L1.H1 serves as a variation of a previous token head studied in Elhage et al. (2021), effectively transferring source node information to the corresponding target node in each edge. This is captured in the QK-circuit: MQK0=WPWQK0WPTsubscriptsuperscript0subscriptsuperscriptsubscript0superscriptsubscriptM^0_QK=W_P\,W_QK^0\,W_P^TM0italic_Q K = Witalic_P Witalic_Q K0 Witalic_Pitalic_T MQK0subscriptsuperscript0M^0_QKM0italic_Q K (see Figure 8) shows that the attention values are maximized when the query vector corresponds to the position embedding of an incoming node, and the key vector corresponds to the position embedding of the immediately preceding outgoing node. Then, following the goal node at position 45, the head persistently attends to the goal position. Figure 8: Visualization of MQK0superscriptsubscript0M_QK^0Mitalic_Q K0 Layer 2 - Deduction Head L2.H1 is a deduction head, which attends to the target node that matches the goal at positions in the path. It then moves information about the source node of that edge into the last position in the window. It can be viewed as a reverse induction head (Olsson et al., 2022) that uses a K-composition (Elhage et al., 2021) to map a sequence [A] [B] … [A] → [B] , where [B] represents target nodes and [A] represents source nodes. This can again be verified by looking at the QK-circuit: MQK1=(MLP0(WOV0WE)+WOV0WE)WQK1WETsubscriptsuperscript1superscriptMLP0subscriptsuperscript0subscriptsubscriptsuperscript0subscriptsubscriptsuperscript1superscriptsubscriptM^1_QK=(MLP^0(W^0_OVW_E)+W^0_OVW_E)W^1_QKW_E^TM1italic_Q K = ( MLP0 ( W0italic_O V Witalic_E ) + W0italic_O V Witalic_E ) W1italic_Q K Witalic_Eitalic_T This matrix shows the interactions of the embedding of the source and target tokens at layer 2 (see Figure 9). Our analysis is complicated by the fact that our model is not attention-only, as attention heads can compose with each other through the MLP, which makes similar analyses in later layers of the model intractable. However, our causal scrubbing results provide evidence that the attention heads in the subsequent layers implement a similar mechanism to L2.H1, but use the output of the previous layer’s attention head to backward chain further up the tree. Figure 9: Visualization of a subset of MQK1superscriptsubscript1M_QK^1Mitalic_Q K1 showing interactions between source and target node tokens. Appendix D Register Tokens and Subgoals In this section, we provide more details on the role of the register tokens in our model. From each register token position, the model attends to a random node in the context and starts backward-chaining from that node. The initial selection of a node by the register token can be viewed as identifying a subgoal, from which the model can perform backward chaining. This precomputation occurs before the actual goal is specified and occurs fully parallel to the main backward-chaining mechanism. These findings hint that transformers may exhibit an inductive bias towards learning highly parallelized algorithms when trained to perform search, planning, or reasoning. To see whether there is any structure in the selection of subgoals, we empirically study which node the register tokens select as subgoals across 1000 samples. The results are illustrated in Figures 10. Figure 10: Preferences in Subgoal Selection: Ratio of register tokens attending to different subgoals aggregated across 1000 trees. We consider a register token to select a subgoal based on an attention threshold of 0.3. We observe that the model usually attends to the same tokens, e.g. position 36 attends to token [6] most of the time. However, we observe an interesting dynamic in which the register token selects a different subgoal in two cases: 1. If the node doesn’t occur before the register token position, it cannot attend to it due to causal masking. 2. If the node is a leaf node of the tree since it doesn’t have a corresponding source token to attend to. To validate this, we again examine the probability of the model selecting subgoals in trees where the most common subgoal occurs before the register token position and is not a leaf node (see Figure 11). Figure 11: Preferences in Subgoal Selection: Ratio of register tokens attending to different subgoals aggregated across 1000 trees where the node it attends to most often is not a leaf node of the tree and occurs before the register token. We consider a register token to select a subgoal based on an attention threshold of 0.3. Further exploration reveals that the subgoals selected by each register token position can be somewhat understood through an examination of the embedding matrices. We evaluate a selection of seven register token positions that are used on several different examples and show their preferred subgoals. By composing the embedding and position embedding matrices with the QK-circuit of the first layer’s attention head, we define RPsubscriptR_PRitalic_P as: RP=WPWQK1WETsubscriptsubscriptsubscriptsuperscript1superscriptsubscriptR_P=W_PW^1_QKW_E^TRitalic_P = Witalic_P W1italic_Q K Witalic_Eitalic_T where WEsubscriptW_EWitalic_E is the embedding matrix, WPsubscriptW_PWitalic_P is the position embedding matrix, and WE1,WQ1subscriptsuperscript1subscriptsuperscript1W^1_E,W^1_QW1italic_E , W1italic_Q are the key and query projection matrices of L1.H1. This explains how the model selects subgoals, by having the key for each positional embedding of a register token match with some specific source node token. Figure 12: Plot of RPsubscriptR_PRitalic_P Appendix E Attention Patterns Here, we visualize attention patterns of our model on a few example inputs (see Figures 13 to 16), in addition to the example illustrated in Figure 4, to provide intuition for the backward chaining mechanism. In each example, we highlight the attention from the final token position and the register tokens that are causally relevant for the prediction. To determine these, we use attention knockout Geva et al. (2023) on the register token positions. This prevents the final token from attending to register tokens by zero-ing out the attention weights. This allows us to test whether critical information propagates from them. More formally, let a,b∈[1,N]1a,b∈[1,N]a , b ∈ [ 1 , N ] be two positions such that a<=ba<=ba < = b, we block blsuperscriptsubscriptx_b^lxitalic_bitalic_l from attending to aℓsuperscriptsubscriptℓx_a xitalic_aroman_ℓ at layer ℓ<Lℓ <Lℓ < L by updating the attention weights to that layer: Mabl+1=−∞∀j∈[1,H]superscriptsubscript1for-all1M_ab^l+1=-∞∀ j∈[1,H]Mitalic_a bitalic_l + 1 = - ∞ ∀ j ∈ [ 1 , H ] Thus, this restricts the source position from obtaining information from the target position, at that particular layer. To avoid information leakage across positions, we block attention edges in multiple layers rather than a single one. Specifically, we block attention to the register tokens in the final two layers of the model. Figure 13: Visualization of multi-layer attention patterns on an example input, similar to Figure 4. We show the attention from three different positions: path position, register token at position 39, and register token at position 45. Figure 14: Visualization of multi-layer attention patterns on an example input, similar to Figure 4. We show the attention from three different positions: the path position and register token at position 41 Figure 15: Visualization of multi-layer attention patterns on an example input, similar to Figure 4. We show the attention from three different positions: the path position and register token at position 36 Figure 16: Visualization of multi-layer attention patterns on an example input, similar to Figure 4. We show the attention from three different positions: path position, register token at position 41, and register token at position 42. Appendix F Additional Experiment on the One-Step Lookahead To evaluate the impact of the one-step lookahead, we perform causal scrubbing that incorporates the described mechanism. We reuse the experimental setup from Section 5.2 but add additional constraints to our resampling scheme. Specifically, we avoid resampling the contributions of the target node and register token positions through the attention heads of the last two layers. We visualize the results in Figure 17. Figure 17: To test the impact of the one-step lookahead, we replicate the causal srubbing experiment from Section 5.2 but add additional constraints to our resampling scheme. We find that we can recover most of the model performance across the full training distribution. Appendix G Tuned Lens In this section, we provide an additional piece of evidence in favor of the existence of the backward chaining mechanism. To understand how the predictions of a transformer are built layer-by-layer, Belrose et al. (2023) develop the Tuned Lens, a method that involves training a linear model to translate the activations from an intermediate layer directly to the input of the unembedding layer. Inspired by this approach, we replace the last n layers of the model with a linear transformation trained to predict the next token from the residual stream activations L−nsuperscriptx^L-nxitalic_L - n. Similar to the Tuned Lens, this method allows us to skip over these layers and see the current best prediction that can be made from the model’s residual stream. Intuitively, this allows us to peek at the iterative computations a transformer uses to compute the next token. Here, we present a visualization of some example trees and the results of the iterative computation (see Figures 18 to 21). These figures highlight the current best prediction a linear transformation could make based on the internal activations. We project the logits output by the linear model back onto the tree structure to better visualize the backward-chaining procedure. Figure 18: Example 1 (Path Length 5): Results of a linear transformation to predict the next step based on the residual stream activations after each layer, projected onto the tree structure. The yellow border highlights the current best prediction(s) of the linear transformation. Figure 19: Example 2 (Path Length 8): Results of a linear transformation to predict the next step based on the residual stream activations after each layer, projected onto the tree structure. The yellow border highlights the current best prediction(s) of the linear transformation. Figure 20: Example 3 (Path Length 10): Results of a linear transformation to predict the next step based on the residual stream activations after each layer, projected onto the tree structure. The yellow border highlights the current best prediction(s) of the linear transformation. Figure 21: Example 4 (Path Length 13): Results of a linear transformation to predict the next step based on the residual stream activations after each layer, projected onto the tree structure. The yellow border highlights the current best prediction(s) of the linear transformation.