Paper deep dive

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Hung-Hsuan Chen

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 39

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/26/2026, 2:32:07 AM

Summary

The paper introduces a 'depth-recurrent Transformer' that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space. This architecture, which includes 'silent thinking' (final-step supervision), LayerScale initialization, and identity-biased gated recurrence, enables stable multi-step reasoning (20+ steps) without increasing token generation or context window usage. The authors demonstrate that this approach achieves robust out-of-distribution generalization across graph reachability, nested boolean logic, and relational text tasks, identifying a 'computational frontier' where performance scales with task complexity.

Entities (5)

Depth-Recurrent Transformer · architecture · 100%Identity-Biased Recurrence · mechanism · 95%LayerScale · mechanism · 95%Silent Thinking · mechanism · 95%Computational Frontier · phenomenon · 90%

Relation Signals (4)

Depth-Recurrent Transformer → incorporates → Silent Thinking

confidence 100% · Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective

Depth-Recurrent Transformer → incorporates → LayerScale

confidence 100% · Our architecture incorporates three mechanisms... (2) LayerScale initialization

Depth-Recurrent Transformer → incorporates → Identity-Biased Recurrence

confidence 100% · Our architecture incorporates three mechanisms... (3) an identity-biased recurrence

Depth-Recurrent Transformer → evaluatedon → Graph Reachability

confidence 95% · We evaluate on three compositional reasoning domains... graph reachability

Cypher Suggestions (2)

Find all mechanisms used by the Depth-Recurrent Transformer architecture. · confidence 90% · unvalidated

MATCH (a:Architecture {name: 'Depth-Recurrent Transformer'})-[:INCORPORATES]->(m:Mechanism) RETURN a.name, m.name

Identify tasks evaluated by the architecture. · confidence 90% · unvalidated

MATCH (a:Architecture {name: 'Depth-Recurrent Transformer'})-[:EVALUATED_ON]->(t:Task) RETURN t.name

Abstract

Abstract:Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.

PDF

Open source PDF →Open local PDF →

Full Text

38,450 characters extracted from source content.

Expand or collapse full text

THINKING DEEPER, NOT LONGER: DEPTH-RECURRENT TRANSFORMERS FOR COMPOSITIONAL GENERALIZATION Hung-Hsuan Chen Computer Science and Information Engineering National Central University Taoyuan, Taiwan hhchen1105@acm.org ABSTRACT Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space—enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear computational frontier—a boundary where performance transitions from chance to near- perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm. Keywords Chain of Thought·CoT·Vertical Chain-of-Thought·VCoT·horizontal Chain-of-Thought·HCoT·LLM 1 Introduction Large Language Models (LLMs) have achieved remarkable performance across a wide range of tasks, yet their reasoning capabilities remain fundamentally constrained by architecture. When faced with problems that require multi-step logical deduction—such as planning, mathematical proof, or algorithmic execution—current models rely heavily on Chain-of-Thought (CoT) prompting (Wei et al., 2022), which externalizes intermediate reasoning as a sequence of generated tokens. We refer to this paradigm as horizontal recurrence: the model “thinks” by extending the output sequence horizontally, consuming the available context length with each reasoning step. Although horizontal CoT has proven highly effective, it suffers from several fundamental limitations. First, each reasoning step costs one or more tokens, rapidly exhausting the finite context window. Second, the depth of computation available at each token position is fixed by the number of Transformer layers, regardless of problem difficulty. A 32-layer Transformer applies exactly 32 layers of processing, whether the input requires trivial pattern matching or deep recursive evaluation. Third, because intermediate reasoning steps are generated in natural language, they are subject to compounding errors—each token prediction carries forward the risk of hallucination or logical misstep. arXiv:2603.21676v1 [cs.LG] 23 Mar 2026 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization In this work, we propose an orthogonal paradigm that we call vertical Chain-of-Thought: instead of generating more tokens horizontally, the model “thinks deeper” by recurrently applying a shared-weight Transformer block in latent space. This approach decouples computational depth from both parameter count and context length. The model can invest more computation in harder instances by simply increasing the recurrence steps during inference, without generating any additional tokens or consuming additional context window space. The key challenge in making depth recurrence work is stability. Naïvely unrolling a Transformer block for many steps leads to exploding or vanishing gradients, and representation collapse. We address these through three complementary mechanisms, ordered from our reasoning objective down to the physical constraints: 1. Silent Thinking: We apply supervision only at the final recurrence step, with no intermediate auxiliary losses. This forces the model to develop genuine multi-step reasoning paths rather than learning heuristic shortcuts that satisfy per-step supervision. 2.LayerScale Initialization: Per-channel scaling initialized at10 −4 after attention and feed-forward sub-layers. This prevents the initial random weights from corrupting the carefully preserved hidden states during early training, acting as a spatial protection for fragile logical representations. 3. Identity-Biased Recurrence: To solve the physical limit of unrolling a network for 20+ steps, we use a gated recurrence with the gate bias initialized to−2.0. The sigmoid gate starts near0.12, strongly biased toward preserving the previous hidden state. This creates a temporal gradient highway that enables stable signal propagation, serving as the core engine for infinite depth. To systematically evaluate the capability of the Depth-Recurrent Transformer, we design a progression of three out-of- distribution (OOD) reasoning tasks. We begin with Graph Reachability, providing a strict physical proof of concept using topological masking. We then increase the structural complexity with Nested Boolean Logic, demonstrating the model’s ability to maintain fragile hierarchical states using relative positioning. Finally, we remove all task-aligned structural inductive biases in a Relational Composition task over unstructured text, proving that our invariant reasoning core can autonomously discover complex latent routing paths in natural language. These tasks are chosen because their computational depth is precisely controllable, enabling a rigorous analysis of generalization behavior. Our experiments reveal a consistent pattern that we term the computational frontier: a diagonal boundary in the accuracy heatmap (thinking steps×task complexity) where performance transitions sharply from chance-level to near-perfect. Moreover, the three tasks exhibit qualitatively different generalization profiles under the same recurrent core, which we attribute to the different inductive biases of their respective perception interfaces. Our contributions are summarized as follows: • We propose a depth-recurrent Transformer architecture with silent thinking, LayerScale, and identity-biased recurrence, enabling stable recurrence over 20+ steps with fewer than 1M parameters. • We demonstrate strong OOD generalization across three compositional tasks with varying inductive biases, achieving robust extrapolation to reasoning depths that are strictly longer than those in the training distribution. •We identify and analyze the computational frontier phenomenon and show how task-specific perception interfaces yield qualitatively different generalization behaviors (precise-but-brittle vs. approximate-but-robust). •We provide evidence that intermediate supervision can be harmful, causing models to learn heuristic shortcuts that collapse under distribution shift. 2 Related Work 2.1 Chain-of-Thought and Test-Time Compute Chain-of-Thought prompting (Wei et al., 2022) and its variants (Kojima et al., 2022; Wang et al., 2023) have become the dominant paradigm for LLM reasoning. Recent work on test-time computing scaling (Snell et al., 2025) further demonstrates that allowing models to perform more inference-time computation improves performance. However, all these approaches operate via horizontal token generation, consuming a context window proportional to the reasoning depth. Our work explores an orthogonal axis—vertical depth recurrence in latent space— that achieves test-time compute scaling without token overhead. 2.2 Pause Tokens and Latent Reasoning Goyal et al. (2024) proposed appending learnable “pause tokens” to the input, giving the Transformer an additional forward-pass computation before producing output. Recent works, such as Coconut (Hao et al., 2024), also explore 2 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization training LLMs to reason in a continuous latent space. Although these share our motivation of providing additional computation at inference time, pause tokens remain fundamentally horizontal: they occupy positions linearly in the sequence and consume the context window. In addition, each pause token still processes through the same fixed number of layers, so the depth of computation per position is unchanged. In contrast, our approach increases depth directly by repeatedly applying the same block. Unlike traditional RNNs that compress the entire sequence into a single bottleneck vectorh∈R d , we maintain a full-sequence-length state matrixH ∈R L×d at every step, preserving rich spatial interactions. Furthermore, unlike horizontal CoT which consumes the finite context window by appending new generated tokens, our recurrence operates strictly in latent space without increasing the sequence length. 2.3 Universal Transformers The Universal Transformer (UT) (Dehghani et al., 2019) introduced weight sharing across layers and used Adaptive Computation Time (ACT) (Graves, 2016) for dynamic stopping. Our work builds on this foundation but differs in several critical aspects. First, we use final-step-only supervision (silent thinking) rather than per-step losses, which we show empirically avoids heuristic shortcut learning. Second, we incorporate LayerScale (Touvron et al., 2021) to protect latent representations during early training. Third, we employ an identity-biased gated recurrence with negative bias initialization rather than simple residual connections, which we find essential for stability beyond 10 recurrence steps. Fourth, rather than relying on ACT’s complex token-level halting probabilities and ponder cost regularization (Graves, 2016), we completely decouple the computational depth. By treating the recurrence step countTas an externally specified budget, our model natively supports flexible test-time compute scaling without optimization overhead. These differences enable robust out-of-distribution extrapolation rather than merely in-distribution translation. 2.4 Depth and Expressiveness in Transformers Theoretical works have established that Transformer depth is a critical factor in expressiveness. Merrill and Sabharwal (2024) showed that fixed-depth Transformers are limited toTC 0 circuit complexity, which excludes inherently sequential computations. Feng et al. (2023) further demonstrated that depth-efficient Transformers cannot solve certain compo- sitional tasks without sufficient layers. By making depth dynamically variable through recurrence, our architecture natively bypasses theTC 0 limitation. This decoupling allows the unrolled reasoning steps to scale strictly with the intrinsic sequential complexity of the input, enabling the model to solve inherently sequential tasks—such as multi-hop routing and nested logic—that fixed-depth networks mathematically cannot. 2.5 Neural Algorithmic Reasoning and Inductive Biases Research in Neural Algorithmic Reasoning emphasizes that neural networks must align with the algorithmic primitives of the target task. Graph Neural Networks (GNNs) (Gilmer et al., 2017; Xu et al., 2019) achieve this via message passing over edges. We demonstrate that a Transformer can perfectly simulate an optimal GNN by applying an adjacency mask to its self-attention matrix. We gradually remove these structural priors in our subsequent logic and unstructured text experiments to test the limits of the invariant reasoning core. 3 Method Our architecture consists of two components: a task-specific perception interface that encodes the raw input into an initial hidden representation, and a task-invariant reasoning core that iteratively refines this representation through shared-weight recurrence. AfterTrecurrence steps —whereTcan be flexibly scaled at inference time — a task-specific readout head extracts the final prediction. 3.1 The Invariant Reasoning Core The reasoning core is a single Transformer blockf θ with shared weights, applied recurrently forTsteps. It operates on a hidden stateH (t) ∈R L×d , where the initial stateH (0) is the output of the task-specific perception interface (Section 3.2). LetLbe the sequence length anddbe the hidden dimension. At each stept∈1,...,T, the candidate hidden state ̃ H (t) ∈R L×d is computed as: ̃ H (t) = f θ (H (t−1) ,t),(1) Details of f θ are provided in Appendix A. To ensure that this recurrent core can stably unroll for 20+ steps and learn genuine algorithms, we design it around three specific mechanisms. 3 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization Silent Thinking Objective.A fundamental design choice is how to supervise the recurrent steps. Instead of forcing intermediate answers or applying per-step auxiliary losses, we compute the cross-entropy (CE) lossCE(ˆy, y)only on the final thinking step T . During training,Tis sampled uniformly from a task-specific range (e.g.,T ∼U (1, 12)). This randomized depth acts as a temporal regularizer, encouraging the network to produce stable outputs across a range of computation depths, while the silent thinking objective forces the model to develop its own genuine multi-step reasoning paths. Pre-Layer Normalization and LayerScale.For complex symbolic tasks requiring a high step count, the accumulated variance from repeated sub-layer transformations can severely destabilize early training. We use Pre-LayerNorm (Xiong et al., 2020) and apply learnable per-channel LayerScale (Touvron et al., 2021) after each sub-layer, multiplying the outputs element-wise by a learnable scaling vectorΓ∈R d initialized toΓ i = 10 −4 . By scaling the outputs down to near-zero at initialization, LayerScale forces the early-training dynamics to act almost perfectly as an identity mapping, protecting fragile boolean states to safely bypass the destructive noise introduced by randomly initialized deep layers. As training progresses, the network selectively scales up Γ to activate its reasoning capacity. Identity-Biased Gated Recurrence.To solve the physical limit of unrolling a network for 20+ steps without vanishing gradients, we combine the new candidate representation ̃ H (t) with the previous stateH (t−1) through a learned GRU-like gate: z (t) = σ [ ̃ H (t) ;,H (t−1) ]· W z + b z (2) H (t) = z (t) ⊙ ̃ H (t) + (1− z (t) )⊙ H (t−1) (3) Here,[ ̃ H (t) ;,H (t−1) ] ∈R L×2d denotes the feature-dimension concatenation of the two states, andW z ∈R 2d×d projects it to yield z (t) ∈R L×d . The gate operates element-wise at each sequence position; details in Appendix A. Critically, we initialize the gate biasb z = −2.0. Becauseσ(−2.0) ≈ 0.12, the model defaults to retaining 88% of its previous state. This physically guarantees that the initial signal decays slowly, enabling stable signal propagation through deep temporal unrolling. Depth Embeddings. To prevent the model from losing its place within the recurrent loop, we inject a learned step embeddinge t ∈R d at each iteration before applying the attention block:H (t−1) input = H (t−1) + e t . Details are given in Appendix B. 3.2 Task-Specific Perception Interfaces Rather than hardcoding specific task rules into the reasoning core, we design modular perception interfaces that inject varying degrees of structural inductive biases into the initial stateH (0) . We explore three distinct interface paradigms: Topological Masking (Strict Structural Priors). To enforce physical constraints where computation must strictly follow defined topological pathways (e.g., graph routing), we introduce an adjacency mask. Given an adjacency matrix A, we add self-loops (A i = 1) and construct an additive mask matrix M ∈R n×n : M ij = 0,if A ij = 1 −∞, if A ij = 0 (4) The attention is computed assoftmax(QK T / √ d k + M )V. Physically, the−∞term forces the attention probability to exactly zero for non-adjacent nodes, perfectly simulating a strict message-passing neural network. Relative Positional Encoding (Hierarchical Priors).For sequences where structural hierarchy depends on relative distances rather than strict adjacency (e.g., evaluating nested symbolic expressions), topological masking is overly restrictive. Instead, we utilize Rotary Position Embeddings (RoPE) (Su et al., 2024). By multiplying the query and key vectors with rotation matrices corresponding to their respective absolute positionsmandn(R Θ,m andR Θ,n ), the inner product inherently encodes their relative sequence distance (m− n): ⟨ ̃q m , ̃ k n ⟩ = q T R Θ,m−n k(5) Under this interface, the attention remains fully bidirectional, relying entirely on the rotational bias to maintain fragile hierarchical states. 4 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization Standard Sequence Attention (Task-Agnostic Priors).To test the reasoning core’s ability to autonomously discover latent routing paths in unstructured sequences (e.g., natural language facts), we remove task-specific structural biases. While we retain standard Rotary Position Embeddings (RoPE) to allow the perception interface to process local word order, the input facts are completely shuffled. Consequently, unlike the logic domain, where relative distance directly correlates with hierarchical depth, the 1D sequence distance in this bag-of-facts provides no meaningful structural hints about the underlying relational graph. The invariant reasoning core must discover the correct pointer-chasing routes entirely on its own. 3.3 Task-Specific Readout Mechanisms After the latent reasoning is unrolled forTsteps, a readout head decodes the final stateH (T ) . We employ distinct readout mechanisms depending on the interface: •Pairwise Node Readout (for Topological domains): Thed-dimensional representations corresponding to the specific source and target nodes are extracted, concatenated, and passed through an MLP. • Global Sequence Readout (for Hierarchical domains): The global representation of the[CLS]token at position 0 is extracted and passed through a linear classifier. •Latent Pointer Readout (for Unstructured domains): Similar to the pairwise readout, the representations of the queried entities are extracted and concatenated. This localized readout forces the unconstrained attention mechanism to actively route information between specific entities across the sequence. 4 Experiments We evaluate our architecture on three compositional reasoning domains with decreasing structural inductive biases: graph reachability, Boolean logic, and relational text. LayerScale is employed only in Experiments I and I, where the fragile symbolic states and higher step counts demand additional stability; the graph task relies on topological masking for structural regularization and does not require it. The maximum number of evaluated reasoning steps varies across these tasks, reflecting the natural upper bounds of their respective structural complexities. Each figure marks both the depth and the thinking-step training boundaries with dashed lines; axes outside those ranges represent OOD evaluation. 4.1 Experiment I: Graph Reachability Given a directed graphG = (V,E), determine whether there exists a directed path from nodestot. The model is trained on instances requiring 1–5 hops with 5–8 thinking steps and is evaluated up to 12 hops to test OOD generalizability. As illustrated in Figure 1, we observe a sharp computational frontier. The accuracy transition from chance to perfect resembles a step function: at exactlyNthinking steps, the model solvesN-hop queries. One step fewer and accuracy drops to random. The model achieves 100% OOD generalization up to 8 hops (1.6×), but collapses abruptly at 10 hops, indicating a clear, rigid generalization boundary enforced by the topological masking. In the step dimension, the model also generalizes stably both below (1–3 steps) and above (12–20 steps) the training range, with the diagonal frontier shifting accordingly. 4.2 Experiment I: Nested Boolean Expression Evaluation Given a fragile nested boolean expression (e.g.,!((T&F)|(!(T|F)))), evaluate it toTrueorFalse. Scaling the model width (d = 256) and adding LayerScale allowed successful training on nesting depths 1–8. As depicted in Figure 2, we observe a gradual computational frontier. The model generalizes gracefully beyond training distribution, achieving >90% accuracy at depth 14 (1.75×OOD). Unlike the graph task, accuracy increases monotonically with more thinking steps without collapsing. The model does not "overthink" or degrade with excessive computation up to 24 steps (OOD in the step dimension), confirming the stability of the negative gate bias. 4.3 Experiment I: Relational Composition in Unstructured Text To test the reasoning core’s capability in a pure language modeling context, we evaluate it on a CLUTRR-style family relationship composition task. The input is a sequence of randomly shuffled natural-language sentences that define relationships (e.g.,Alice is the parent of Bob), padded with contradictory distractor sentences to prevent statistical shortcuts. The model must answer a query like Alice is the sibling of Eve. 5 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization 12358121520 Thinking Steps (Compute Depth) 1 2 3 4 6 8 10 12 Path Hops (Task Difficulty) ID OOD OODIDOOD 0.970.971.001.001.001.001.001.00 0.510.981.001.001.001.001.001.00 0.510.541.001.001.001.001.001.00 0.520.540.501.001.001.001.001.00 0.500.520.500.501.001.001.001.00 0.510.540.500.500.500.971.001.00 0.510.520.500.500.500.500.500.50 0.510.530.500.500.500.500.500.50 Depth-Recurrent Transformer with Adjacency Masking on Graph Reachability (Trained on 1-5 hops; 5-8 thinking steps) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Figure 1: Accuracy heatmap for the graph reachability task. A sharp computational frontier exists along the diagonal, confirming the strict 1-hop=1-step physical constraint enforced by adjacency masking. The model generalizes perfectly to 8 hops (OOD) with sufficient steps, but collapses abruptly to chance (∼50%) at 10 hops. Dashed vertical lines mark the step training boundary (5–8 thinking steps); columns outside this range are also OOD, yet accuracy is preserved. Naïvely generating chains of relations (e.g., strictly parents or strictly children) allows models to cheat by simply counting relation words or detecting parity. To enforce true algorithmic deduction (pointer chasing and mathematical offset cancellation), we generate chains using an Apex Routing Strategy: the logical path must first move UP the family tree to a common ancestor (usingparent), then DOWN (usingchild). We addsibling(offset 0) to the vocabulary so that the positive and negative offsets can perfectly cancel out. Furthermore, we carefully construct hard negative samples—incorrect options deliberately designed to share superficial statistical features with the true answer (e.g., using grandparentas a distractor forsiblingsince both share an even-step offset). By ensuring that these distractors strictly match the even/odd parity of the correct relation, we plug any shallow statistical shortcuts, forcing the model to perform genuine latent routing rather than surface pattern matching. The model is trained on depths 2–5 with 1–12 thinking steps, and evaluated up to depth 9 and 20 thinking steps. As shown in Figure 3, the results reveal three crucial insights. First, monotonic difficulty: the accuracy strictly decreases as the reasoning depth increases. Because all parity shortcuts are removed, the model is forced to perform genuine latent routing, which naturally gets harder for longer chains. Second, the computational frontier: for any given depth, increasing the number of thinking steps strictly improves accuracy (e.g., at Depth 5, accuracy increases from 63.8% at 1 step to 81.7% at 12 steps, remaining stable at 80.6% at 20 steps (OOD)). Third, robust OOD generalization: despite lacking task-aligned structural hints (no graph mask, and 1D relative positions provide no shortcuts for the fully shuffled sentences), the invariant core successfully generalizes to depths 6 and 7, and further improving with OOD thinking steps (16–20), where depths 7–9 see modest but consistent gains, proving it can autonomously discover pointer-chasing routes in unstructured text. 6 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization 124681012162024 Thinking Steps (Compute Depth) 2 4 6 8 10 12 14 Nesting Depth (Task Difficulty) ID OOD OODIDOOD 0.590.821.001.001.001.001.001.001.001.00 0.540.740.940.991.001.001.001.001.001.00 0.560.670.910.970.980.990.991.000.990.99 0.580.730.900.960.980.970.970.980.970.97 0.580.640.820.880.920.930.940.960.950.94 0.580.660.850.890.920.930.930.920.930.92 0.540.670.820.890.910.920.900.910.900.90 Depth-Recurrent Transformer with RoPE on Nested Boolean Expression Evaluation (Trained on 1-8 depth; 4-16 thinking steps) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Figure 2: Accuracy heatmap for the nested boolean expression evaluation task. Compared to the graph task, the computational frontier here is more gradual. The model exhibits robust out-of-distribution generalization with graceful degradation, maintaining>90%accuracy up to depth 14. Notice that the performance remains highly stable even when the model is unrolled for 24 steps (well beyond the training range of 4–16 steps), which validates the effectiveness of our design. Table 1: Comparison of supervision strategies on graph reachability. Intermediate supervision encourages a statistical shortcut (achieving impossibly high accuracy at Step 1 for 12-hop paths), which occupies model capacity and destroys genuine OOD generalization. Supervision StrategyStep 1 Acc. on 12-hop (Shortcut)Sufficient-Step OOD Acc. Intermediate (Per-step) Loss∼73.0%∼50.0% (Collapse) Silent Thinking (Final-step only) ∼50.0% (Random Guess)Perfect up to 8 hops 4.4 Ablation Analysis: The Danger of Intermediate Supervision A natural hypothesis for improving algorithmic reasoning is to apply intermediate supervision—computing and averaging the loss at every thinking step to create a gradient highway. To test this, we ablate our silent-thinking objective against an intermediate-supervision baseline on the graph reachability task. This domain is ideal for diagnosis because the adjacency mask provides a physically verifiable ground truth: a model takingksteps can only aggregate information from k hops away. As shown in Table 1, intermediate supervision exhibits an apparent anomaly: it achieves over 70% accuracy on 12-hop paths after only a single thinking step. Under strict topological masking, a 1-step model has absolutely no information about nodes 12 hops away; true accuracy must be bounded near 50%. This mathematically proves that the model has abandoned genuine message-passing. Instead, it learns statistical heuristics—such as estimating reachability from graph density or the source node’s degree—to greedily minimize the early-step training loss. We attribute this to a bandwidth occupation failure mode. When penalized at Step 1, the model faces a choice: adopt an "honest strategy" (accepting 50% early accuracy to learn deep propagation) or a "shortcut strategy" (learning shallow heuristics for immediate reward). Intermediate supervision makes the shortcut irresistible. Once committed to these heuristics, the model loses the incentive to develop the true sequential algorithm, failing on deep OOD paths even when granted a large number of steps at test time. 7 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization 1246810121620 Thinking Steps (Compute Depth) 2 3 4 5 6 7 8 9 Chain Depth (Number of Hops) ID OOD IDOOD 0.800.860.860.860.860.860.860.860.86 0.720.830.840.830.830.830.830.830.82 0.660.810.830.830.830.820.830.820.82 0.640.790.810.820.820.820.820.810.81 0.540.660.680.690.690.690.690.690.69 0.530.630.650.660.670.670.670.670.67 0.460.510.520.530.540.550.550.570.57 0.450.490.520.530.530.540.540.550.56 Depth-Recurrent Transformer on Unstructured Text Composition (Trained on depth 2-5; 1-12 thinking steps) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Figure 3: Accuracy heatmap for the relational composition task over unstructured text. Without task-aligned structural inductive biases (no adjacency masking; while RoPE is retained for local word order, the fully shuffled input facts ensure that 1D relative positions carry no meaningful structural signal), the task exhibits a strictly monotonic increase in difficulty. The invariant reasoning core discovers latent pointer-chasing routes, achieving solid OOD generalization at depths 6 and 7 when additional thinking steps are provided. Dashed lines mark the training boundaries in both the depth and step dimensions. Our silent thinking explicitly removes this credit assignment shortcut. By computing loss only at the final step, the model is liberated from early-step penalties. It is forced to accept random-guess performance in the early layers and to invest its entire representational capacity in learning the genuine latent algorithm. Ultimately, in algorithmic reasoning—where intermediate states represent unobservable latent computation rather than semantic features—forcing intermediate answers is actively harmful. 5 Conclusion We have presented a depth-recurrent Transformer architecture that realizes vertical Chain-of-Thought—reasoning by iterating in latent space rather than generating tokens. Through silent thinking, LayerScale, and identity-biased recurrence, we enable stable recurrence over 20+ steps. Experiments on graph reachability, nested boolean logic, and unstructured relational composition demonstrate strong out-of-distribution generalization. The contrast between these tasks reveals how perception interfaces shape the generalization profile of a shared reasoning core. By “thinking deeper, not longer,” models can achieve variable computational depth without consuming context window, serving as a foundational building block for next-generation language models. We acknowledge several limitations. First, we use relatively small models (<1M parameters). Second, the perception interfaces are designed manually. Third, we do not provide formal theoretical guarantees on the generalization bound; our evidence is empirical. We believe that integration with pretrained LLMs is the most important direction for future work. 8 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization GenAI Usage Disclosure The authors used Gemini and Claude to improve language and readability. The authors used Claude Code to assist in coding and experimenting. The authors reviewed and edited the content and the code as needed and take full responsibility for the content. Acknowledgments and GenAI Usage Disclosure We acknowledge support from the National Science and Technology Council of Taiwan under grant number 113-2221- E-008-100-MY3. References Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, L. (2019). Universal transformers. In International Conference on Learning Representations (ICLR). Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., and Wang, L. (2023). Towards revealing the mystery behind chain of thought: A theoretical perspective. In Advances in Neural Information Processing Systems (NeurIPS). Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), pages 1263–1272. PMLR. Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V. (2024). Think before you speak: Training language models with pause tokens. In International Conference on Learning Representations (ICLR). Graves, A. (2016). Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983. Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. (2024). Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (NeurIPS), pages 22199–22213. Merrill, W. and Sabharwal, A. (2024). The expressive power of transformers with chain of thought. In International Conference on Learning Representations (ICLR). Snell, C. V., Lee, J., Xu, K., and Kumar, A. (2025). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. In International Conference on Learning Representations (ICLR). Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomput., 568(C). Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and Jégou, H. (2021). Going deeper with image transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 32–42. Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self- consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations (ICLR). Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA. Curran Associates Inc. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. (2020). On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML). JMLR.org. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In International Conference on Learning Representations (ICLR). 9 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization A Details of the Reasoning Blockf θ The reasoning blockf θ is a single Pre-LayerNorm Transformer encoder block (Xiong et al., 2020). At recurrence stept, it receives the depth-conditioned input ˆ H (t) = H (t−1) + e t , wheree t is the depth embedding defined in Appendix B. The block applies two sub-layers in sequence: Sub-layer 1: Multi-Head Self-Attention (MHSA). H ′ = ˆ H (t) + γ attn ⊙ MHSA LN ˆ H (t) , M ,(6) whereLN(·)is Layer Normalization,Mis the task-specific attention mask (the adjacency mask for the topological interface, or no mask for the other two), and γ attn ∈R d is the LayerScale vector (detailed below). The attention mechanism computes h parallel heads. For each head i: head i = softmax Q i K ⊤ i √ d k + M V i , Q i = ˆ H (t) W Q i , K i = ˆ H (t) W K i , V i = ˆ H (t) W V i ,(7) withd k = d/h. For the hierarchical and unstructured interfaces, Rotary Position Embeddings (RoPE) (Su et al., 2024) are applied to the query and key projections before the dot product, so that the inner product implicitly encodes relative position (see Section 3.2). Sub-layer 2: Feed-Forward Network (FFN). H ′ = H ′ + γ ffn ⊙ FFN(LN(H ′ )),(8) FFN(x) = W 2 · GELU(W 1 x + b 1 ) + b 2 ,(9) where W 1 ∈R d f ×d , W 2 ∈R d×d f , and d f is the feedforward dimension. The output off θ is ̃ H (t) = H ′ , which is subsequently combined withH (t−1) through the identity-biased gate (Section 3.1) to produce H (t) . LayerScale.For the hierarchical and unstructured perception interfaces, we apply LayerScale (Touvron et al., 2021) after each sub-layer. The per-channel scaling vectorsγ attn ,γ ffn ∈R d are initialized to10 −4 and learned jointly with all other parameters. For the topological interface, where the hard adjacency mask already strongly constrains the computation, LayerScale is omitted. For the topological interface, the standard PyTorchTransformerEncoderLayer is used; the custom RoPE-integrated attention described above applies only to the hierarchical and unstructured interfaces. Hyperparameters. Table 2 summarizes the architectural hyperparameters used across the three experiments. Table 2: Architectural hyperparameters for each experiment. GraphNested Expr.Family d128256256 h (heads)488 d f 25610241024 RoPE×✓ LayerScale×✓ Gate bias b z −2.0 −2.0 −2.0 Train step range T[5, 8][4, 16][1, 12] T max 202820 B Depth Embeddings To enable the shared-weight reasoning blockf θ to track its position within the recurrent loop, we inject a learned depth embedding e t ∈R d at each step t∈1,...,T before the attention computation: ˆ H (t) = H (t−1) + 1 L · e ⊤ t ,(10) 10 Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization where1 L ∈R L is an all-one vector, so thate t is broadcast identically across allLsequence positions. The embeddings e t T max t=1 form a standard learned lookup tableE ∈R T max ×d , whereT max is set to cover the maximum number of steps used at both training and test time. Without depth embeddings, the shared blockf θ receives identical inputs at each step (modulo the evolving hidden state), making it difficult to distinguish early exploratory steps from later refinement steps. By conditioning each iteration on e t , the model is free to learn step-specialized computation strategies—for example, using early steps to gather local information and later steps to consolidate long-range evidence. Relationship to positional encodings.Depth embeddings are orthogonal to the positional encodings used within the sequence (RoPE or none). The former encodes the model’s current position in the temporal (recurrent) dimension; the latter encodes positions in the spatial (sequence) dimension. Both are applied simultaneously without conflict. 11