Paper deep dive

Ablate and Rescue: A Causal Analysis of Residual Stream Hyper-Connections

William Peng, Josheev Rai, Kevin Tseng, Siwei Wang, Sean Wu

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 30

Abstract

Abstract:Multi-stream transformer architectures have recently been proposed as a promising direction for managing representation collapse and the vanishing gradient problem for residual connections, yet their internal mechanisms remain unexplored. In particular, the recently introduced Manifold-Constrained Hyper-Connections (mHC) architecture posits multiple residual streams with constrained interaction, but lacks in-depth mechanistic analysis. We present the first open-source mHC language model (this https URL) and analyze the multiple-stream architecture with a suite of representation-level metrics and causal interventions to probe how parallel streams encode and utilize information. Specifically, we introduce a systematic stream ablation-and-rescue framework that enables direct causal comparison of residual streams during inference. Through targeted pairwise interventions and controlled recovery experiments, we distinguish functional redundancy from asymmetric utilization and reveal how information is distributed across streams beyond what is observable from representational similarity alone.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

30,173 characters extracted from source content.

Expand or collapse full text

Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 ABLATE AND RESCUE:A CAUSAL ANALYSIS OF RESIDUAL STREAM HYPER-CONNECTIONS William Peng 1∗ Josheev Rai 2∗ Kevin Tseng 3∗ Siwei Wang 4∗ Sean Wu 5 1 Stanford University, 2 Georgia Institute of Technology, 3 University of California, Berkeley, 4 Independent, 5 University of Oxford ABSTRACT Multi-stream transformer architectures have recently been proposed as a promis- ing direction for managing representation collapse and the vanishing gradient problem for residual connections, yet their internal mechanisms remain un- explored. In particular, the recently introduced Manifold-Constrained Hyper- Connections (mHC) architecture posits multiple residual streams with con- strained interaction, but lacks in-depth mechanistic analysis.We present the first open-source mHC language model (https://huggingface.co/ wgpeng/mhc-780m) and analyze the multiple-stream architecture with a suite of representation-level metrics and causal interventions to probe how parallel streams encode and utilize information. Specifically, we introduce a system- atic stream ablation-and-rescue framework that enables direct causal compari- son of residual streams during inference. Through targeted pairwise interventions and controlled recovery experiments, we distinguish functional redundancy from asymmetric utilization and reveal how information is distributed across streams beyond what is observable from representational similarity alone. 1INTRODUCTION Hyper-Connections extend the standard transformer residual architecture by allowing multiple resid- ual streams per layer, dynamically mixed through learned routing matrices (He et al., 2015; Zhu et al., 2025). Manifold-Constrained Hyper-Connections (mHC) further refines this framework by imposing geometric constraints on inter-stream mixing (Xie et al., 2026). Despite these advances, it is unclear whether different streams encode distinct information, redun- dantly represent similar features, or interact asymmetrically during inference. This gap is worsened by the absence of publicly available pretrained mHC models and by the fact that most interpretabil- ity methods are designed for single-stream architectures that do not naturally extend to dynam- ically routed, multi-stream settings. Importantly, observational analyses alone are insufficient in this context: high representational similarity between streams does not directly imply functional interchangeability (Zhang, 2024; Hanna et al., 2023; Geiger et al., 2021; Feder et al., 2022). Un- derstanding how information is actually used by hyper-connected models requires explicit causal interventions during the forward pass. We borrow from black-box techniques in biological functional genomics, where ablation-and-rescue experiments are used to establish causal necessity and sufficiency. In such settings, a gene or path- way is first perturbed (e.g. via RNA interference or knockout (Echeverri et al., 2006)), producing a measurable loss of function, and the phenotype is then rescued by reintroducing the same or a compensatory functional element. Successful rescue provides strong evidence that the perturbed component plays a causal role in the observed behavior, rather than being merely correlated with it. Rather than inferring stream importance from similarity or attribution scores alone, we introduce ablation and controlled rescue experiments to investigate stream function. By doing this, we reveal distinct regimes of redundancy, asymmetry, and complementarity between streams. Additionally, we release the first open-source trained mHC language model. ∗ Equal contribution. 1 arXiv:2603.14833v1 [cs.LG] 16 Mar 2026 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 (a) Stream-level Activation Patching (b) Ablation-and-Rescue Framework Figure 1: Ablation-and-rescue for causal stream analysis. (a) Counterfactual activation patching setup. (b) Ablation-and-rescue for multi-stream architectures. 2BACKGROUND Hyper-connections and manifold constraint. The addition of multiple hyper-connected streams in place of a single residual connection has been shown to improve training stability and benchmark performance. In a standard transformer model, residual connections take the form x l+1 = x l + F l (x l ), where l is the layer index,F l is the layer function, and x l ∈ R d is the hidden state at layer l. Manifold-Constrained Hyper-Connections generalize this formulation by expanding the hidden state into n parallel residual streams, represented as a matrix x l ∈ R n×d . Residual propagation and inter- stream mixing are governed by learned routing matrices, yielding the update x l+1 = H res x l + H ⊤ post F l (H pre x l ).(1) Here, H res ∈ [0, 1] n×n is a doubly stochastic routing matrix obtained via iterations of the Sinkhorn- Knopp algorithm (Dennis & Knopp, 1967). This constraint stabilizes residual mixing by controlling operator norms and preventing uncontrolled amplification across streams. The matrices H pre ∈ R 1×n and H post ∈ R 1×n respectively implement stream-wise aggregation and redistribution. H pre collapses the n streams into a single vector for transformation byF l , while H post expands the transformed output back across streams. Interpretability. Most existing interpretability techniques implicitly assume a single residual stream (Elhage et al., 2021) and therefore do not directly transfer to multi-stream architectures. We focus on representation analysis tools and causal interventions that can be adapted to multi- ple streams, such as CKA (Davari et al., 2022), Activation Patching (Zhang & Nanda, 2024), and targeted ablations (Li & Janson, 2024). 3METHODS 3.1MODEL TRAINING Alongside supervised objectives, language models acquire a broad range of natural language ca- pabilities (Radford et al., 2019; Gokaslan et al., 2019). We adapt the transformer block structure to incorporate Manifold-constrained Hyper-Connections and train a 781 million parameter model comparable in size to GPT-2 Large using AdamW (Loshchilov & Hutter, 2019) and Muon optimiz- ers (Liu et al., 2025). 2 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 In addition, we pretrain on the dolma-v1-7 corpus, a substantially broader dataset containing a mixture of web content, academic publications, code, books, math, and encyclopedic materials (Soldaini et al., 2024). 3.2CENTERED KERNEL ALIGNMENT To explore the encoded structures between residual streams, we utilize centered kernel alignment (CKA) which provides an interpretable visualization of geometric similarities across stream rep- resentations (Figure 2). In the foundational Hyper-connections work, Zhu et al. (2025) compares streams by layer using cosine similarity, but we opted for CKA as a more robust measure. CKA yields a similarity index between two target structures invariant to invertible linear transformations and resilient to differing random initializations (Kornblith et al., 2019), as is the case with randomly initialized stream weights. For measuring intra-layer stream relationships, we sampled per-stream residuals generated from the Pile-10k dataset for CKA (Nanda, 2022), and constructed a similarity index matrix for each layer to visualize the stream comparison scalars. 3.3ACTIVATION PATCHING We quantify layer–stream causal contributions to next-token prediction using counterfactual activa- tion patching. Following symmetric token replacement (STR), we construct matched target prompts that replace a singular noun/verb/adjective from the source prompt, enabling causal tracing for inter- nal activations completions (Zhang & Nanda, 2024). We evaluate patching interventions by measur- ing the KL-divergence of the original and patched distributions. This choice is motivated by the fact that our mHC model does not frequently rank the correct factual completion for a ROME dataset ex- ample among its top-k predictions as traditionally done in causal tracing experiments (Meng et al., 2023), making accuracy-based patching criteria unstable. In particular, from the 21,919 Counter- Fact examples (Makelov et al., 2024), only 65 prompts passed the knowledge check. We instead focus on measuring patch effects on the overall token distribution between the target and counter- factual (source) model which provides a clear baseline of single stream causal contributions to token prediction. 3.4STREAM ABLATION AND RESCUE Stream Ablation. Let p θ be the trained model, outputting a probability distribution over tokens, and x = (x 1 , . . . , x T ) be the input sequence to the model. At layer ℓ, for token t∈ [1, T] and stream s∈ [0, n− 1], the residual stream activation is x (ℓ) t,s ∈ R d . We first run an unperturbed forward pass, caching and freezing the Hyper-Connection mixing matrices, and storing x (ℓ) t,s . For a stream pair (i, j), our ablation experiment defines each ̃ x (ℓ) t,s as follows: ̃ x (ℓ) t,s = ( 0,s∈i, j, x (ℓ) t,s , otherwise. (2) Ablation impact is measured by the mean token-wise KL divergence, where (−i,−j) denotes abla- tion of streams i and j. In our experiments, we calculate p θ using temperature 1. L (−i,−j) KL = E x,t KL(p θ (y t | x)∥ p (−i,−j) θ (y t | x)) .(3) Targeted Rescue. To test recoverability, we restore ablated stream i using cached residuals while keeping the other ablated, yielding p (+i,−j) θ . Rescue is reported as the fractional KL reduction relative to full ablation, Recovery(+i,−j) = 1− L (+i,−j) KL L (−i,−j) KL .(4) By expanding this across all possible stream pairs, we construct a global rescue matrix that distin- guishes redundant, asymmetric, and complementary stream contributions. 3 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 4RESULTS 4.1REPRESENTATIONAL SIMILARITY ACROSS STREAMS Figure 2: Within-layer similarity. The middle layers of the model form a visually distinctive checkerboard- like pattern across their CKA matri- ces (Figure 4), suggesting the model learns a representational divide of streams into two groupings based on similarity. These feature groups man- ifest in full by Layer 12 and gradu- ally diminish as distinctness between streams collapses by the final layer. 4.2STREAM-LEVEL CAUSAL CONTRIBUTIONS VIA ACTIVATION PATCHING Activation patching surfaced a distinct asymmetry in residual stream contributions to the final token distribution (Figure 3). Notably, streams (0, 2) demonstrate higher sensitivity to individual token context during inference than streams (1, 3). Depth-wise patching yielded low or diminishing patch effects with the exception of stream 2 which maintains strong patching sensitivity deep into the mid layers of the model. 4.3FUNCTIONAL REDUNDANCY AND ASYMMETRY VIA RESCUE Across layers with high cross-stream CKA, we observe distinct functional regimes. In one, streams exhibit mutual recoverability: streams 0 and 2 can each independently restore much of the KL di- vergence caused by ablation, indicating functional redundancy beyond representational similarity. In contrast, other stream pairs show clear asymmetries. For instance, rescuing stream 3 restores KL- divergence by 15.86% more than rescuing stream 1 (Table 1). This indicates an imbalance in func- tional contribution, despite relatively high representational similarity, highlighting that CKA alone cannot distinguish between active utilization and passive redundancy. Complementarity, where in- formation is jointly distributed across streams, is less prevalent in this model configuration. We do not observe cases where neither stream alone is sufficient to restore performance while their combination is. Recovered stream Ablated stream0123 0–58.6974.7866.45 184.40–81.1082.42 280.6158.47–71.51 371.1466.5672.80– Table 1: Mean rescue performance across residual streams. Each entry reports the average percentage of KL-divergence recovery over layers when ablating a pair of streams and selectively rescuing only one of them . Diagonal entries are undefined since a pair of identical streams cannot be independently ablated and rescued. 5CONCLUSION Our results highlight stream asymmetries and show that high representational similarity does not imply functional interchangeability, motivating rescue-style causal experiments for analyzing re- dundancy and asymmetry in multi-stream architectures. 4 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 REFERENCES MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of CKA as a Similarity Measure in Deep Learning, November 2022. URL http://arxiv.org/abs/2210.16156. arXiv:2210.16156 [cs]. Richard Dennis and Paul Knopp. Concerning Nonnegative Matrices and Doubly Stochastic Matri- ces. Pacific Journal of Mathematics, 1967. URL https://msp.org/pjm/1967/21-2/ pjm-v21-n2-p14-s.pdf. Christophe J. Echeverri, Philip A. Beachy, Buzz Baum, Michael Boutros, Frank Buchholz, Sumit K. Chanda, Julian Downward, Jan Ellenberg, Andrew G. Fraser, Nir Hacohen, William C. Hahn, Aimee L. Jackson, Amy Kiger, Peter S. Linsley, Lawrence Lum, Yong Ma, Bernard Mathey- Pr ́ ev ˆ ot, David E. Root, David M. Sabatini, and Jussi Taipale. Minimizing the risk of report- ing false positives in large-scale rnai screens. Nature Methods, 3(10):777–779, Oct 2006. doi: https://doi.org/10.1038/nmeth1006-777.URL https://w.nature.com/articles/ nmeth1006-777. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Amir Feder, Katherine A. Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood- Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E. Roberts, Brandon M. Stewart, Victor Veitch, and Diyi Yang. Causal Inference in Natural Language Processing: Es- timation, Prediction, Interpretation and Beyond, July 2022. URL http://arxiv.org/abs/ 2109.00725. arXiv:2109.00725 [cs]. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts.Causal Abstractions of Neural Networks, October 2021.URL http://arxiv.org/abs/2106.02997. arXiv:2106.02997 [cs]. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http: //Skylion007.github.io/OpenWebTextCorpus, 2019. Michael Hanna, Roberto Zamparelli, and David Mare ˇ cek. The Functional Relevance of Probed Information: A Case Study. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, p. 835–848, Dubrovnik, Croatia, 2023. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.58. URL https: //aclanthology.org/2023.eacl-main.58. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition, December 2015.URL http://arxiv.org/abs/1512.03385. arXiv:1512.03385 [cs]. Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414. Maximilian Li and Lucas Janson. Optimal ablation for interpretability, September 2024. URL http://arxiv.org/abs/2409.09951. arXiv:2409.09951 [cs]. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is scalable for llm training, 2025. URL https://arxiv.org/abs/2502.16982. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101. 5 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 Aleksandar Makelov, Georg Lange, Atticus Geiger, and Neel Nanda. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=Ebt7JgMHv1. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262. Neel Nanda. NeelNanda/pile-10k – datasets at hugging face. https://huggingface.co/ datasets/NeelNanda/pile-10k, 2022. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. 2019. Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, and Wenfeng Liang. mhc: Manifold-constrained hyper-connections, 2026. URL https://arxiv.org/abs/ 2512.24880. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods, 2024. URL https://arxiv.org/abs/2309.16042. Yihao Zhang. Causal Abstraction in Model Interpretability: A Compact Survey, October 2024. URL http://arxiv.org/abs/2410.20161. arXiv:2410.20161 [cs]. Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections, 2025. URL https://arxiv.org/abs/2409.19606. 6 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 AAPPENDIX Overview. The following supplementary analyses reinforce the main text’s causal claims about residual stream behavior in mHC models. Together, these results support three central conclusions: (i) causal influence is sharply concentrated within specific layers and streams, (i) representational similarity is informative but insufficient to predict functional interchangeability, and (i) explicit interventions reveal structured regimes of redundancy and asymmetry that are otherwise obscured by observational metrics. Layer-wise causal localization. We begin by examining where causal control over the output distribution is concentrated using activation patching. As shown in Figure 3, causal influence is not uniformly distributed. Notably, stream 2 maintains strong influence deep into the network, contrasting with the relative passivity of stream 1. This stratification motivates the use of pairwise causal interventions to uncover the structure underlying these contributions. Figure 3: Layer–stream causal sensitivity via activation patching. Mean KL divergence between baseline and patched logits when one (layer, stream) activation is injected from source to target run. Lighter values indicate stronger causal effect. Emergent stream structure via CKA. To assess how representations evolve and align across residual streams, we compute intra-layer CKA matrices (Figure 4) and inter-layer CKA heatmap (Figure 5). In the middle layers, streams consistently bifurcate into two highly similar subgroups. This structure dissolves in later layers as representations converge. Inter-layer CKA reveals two distinct regions of high similarity, suggesting stable representational phases between the early and mid-to-late stages of the model. 7 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 0123 Stream 0 1 2 3 Stream 1.000.880.920.33 0.881.000.870.51 0.920.871.000.32 0.330.510.321.00 Layer 0 0123 Stream 0 1 2 3 Stream 1.000.840.840.06 0.841.000.790.31 0.840.791.000.09 0.060.310.091.00 Layer 1 0123 Stream 0 1 2 3 Stream 1.000.730.810.06 0.731.000.860.09 0.810.861.000.10 0.060.090.101.00 Layer 2 0123 Stream 0 1 2 3 Stream 1.000.760.750.06 0.761.000.900.12 0.750.901.000.09 0.060.120.091.00 Layer 3 0123 Stream 0 1 2 3 Stream 1.000.840.900.07 0.841.000.870.23 0.900.871.000.08 0.070.230.081.00 Layer 4 0123 Stream 0 1 2 3 Stream 1.000.850.920.08 0.851.000.860.24 0.920.861.000.08 0.080.240.081.00 Layer 5 0123 Stream 0 1 2 3 Stream 1.000.830.920.12 0.831.000.840.32 0.920.841.000.12 0.120.320.121.00 Layer 6 0123 Stream 0 1 2 3 Stream 1.000.760.930.12 0.761.000.780.45 0.930.781.000.12 0.120.450.121.00 Layer 7 0123 Stream 0 1 2 3 Stream 1.000.720.930.13 0.721.000.710.38 0.930.711.000.13 0.130.380.131.00 Layer 8 0123 Stream 0 1 2 3 Stream 1.000.650.890.13 0.651.000.720.48 0.890.721.000.18 0.130.480.181.00 Layer 9 0123 Stream 0 1 2 3 Stream 1.000.650.920.16 0.651.000.670.60 0.920.671.000.18 0.160.600.181.00 Layer 10 0123 Stream 0 1 2 3 Stream 1.000.500.910.14 0.501.000.520.68 0.910.521.000.17 0.140.680.171.00 Layer 11 0123 Stream 0 1 2 3 Stream 1.000.370.840.15 0.371.000.450.84 0.840.451.000.22 0.150.840.221.00 Layer 12 0123 Stream 0 1 2 3 Stream 1.000.370.820.15 0.371.000.470.85 0.820.471.000.25 0.150.850.251.00 Layer 13 0123 Stream 0 1 2 3 Stream 1.000.370.820.35 0.371.000.470.87 0.820.471.000.47 0.350.870.471.00 Layer 14 0123 Stream 0 1 2 3 Stream 1.000.440.780.37 0.441.000.510.86 0.780.511.000.50 0.370.860.501.00 Layer 15 0123 Stream 0 1 2 3 Stream 1.000.440.780.47 0.441.000.520.74 0.780.521.000.60 0.470.740.601.00 Layer 16 0123 Stream 0 1 2 3 Stream 1.000.440.760.47 0.441.000.520.74 0.760.521.000.63 0.470.740.631.00 Layer 17 0123 Stream 0 1 2 3 Stream 1.000.530.760.48 0.531.000.640.83 0.760.641.000.64 0.480.830.641.00 Layer 18 0123 Stream 0 1 2 3 Stream 1.000.550.770.48 0.551.000.640.83 0.770.641.000.61 0.480.830.611.00 Layer 19 0123 Stream 0 1 2 3 Stream 1.000.530.770.45 0.531.000.650.84 0.770.651.000.62 0.450.840.621.00 Layer 20 0123 Stream 0 1 2 3 Stream 1.000.530.790.46 0.531.000.620.83 0.790.621.000.60 0.460.830.601.00 Layer 21 0123 Stream 0 1 2 3 Stream 1.000.530.800.57 0.531.000.620.80 0.800.621.000.70 0.570.800.701.00 Layer 22 0123 Stream 0 1 2 3 Stream 1.000.510.800.56 0.511.000.630.80 0.800.631.000.71 0.560.800.711.00 Layer 23 0123 Stream 0 1 2 3 Stream 1.000.520.810.58 0.521.000.630.77 0.810.631.000.73 0.580.770.731.00 Layer 24 0123 Stream 0 1 2 3 Stream 1.000.510.810.60 0.511.000.630.74 0.810.631.000.75 0.600.740.751.00 Layer 25 0123 Stream 0 1 2 3 Stream 1.000.530.800.61 0.531.000.650.74 0.800.651.000.74 0.610.740.741.00 Layer 26 0123 Stream 0 1 2 3 Stream 1.000.530.800.60 0.531.000.660.76 0.800.661.000.80 0.600.760.801.00 Layer 27 0123 Stream 0 1 2 3 Stream 1.000.540.770.59 0.541.000.690.80 0.770.691.000.80 0.590.800.801.00 Layer 28 0123 Stream 0 1 2 3 Stream 1.000.590.770.58 0.591.000.820.90 0.770.821.000.79 0.580.900.791.00 Layer 29 0123 Stream 0 1 2 3 Stream 1.000.660.850.64 0.661.000.840.89 0.850.841.000.80 0.640.890.801.00 Layer 30 0123 Stream 0 1 2 3 Stream 1.000.680.860.64 0.681.000.870.85 0.860.871.000.78 0.640.850.781.00 Layer 31 0123 Stream 0 1 2 3 Stream 1.000.720.880.67 0.721.000.880.87 0.880.881.000.81 0.670.870.811.00 Layer 32 0123 Stream 0 1 2 3 Stream 1.000.830.920.70 0.831.000.920.84 0.920.921.000.79 0.700.840.791.00 Layer 33 0123 Stream 0 1 2 3 Stream 1.000.940.950.76 0.941.000.940.83 0.950.941.000.80 0.760.830.801.00 Layer 34 0123 Stream 0 1 2 3 Stream 1.000.960.990.96 0.961.000.960.99 0.990.961.000.95 0.960.990.951.00 Layer 35 0.5 0.6 0.7 0.8 0.9 1.0 CKA across streams (all layers) Figure 4: Within-layer CKA similarity matrices across depth. Middle layers show clear block structure, reflecting soft partitioning into redundant stream subgroups. L0L1L2L3L4L5L6L7L8L9 L10L11L12L13L14L15L16L17L18L19L20L21L22L23L24L25L26L27L28L29L30L31L32L33L34L35 Layer L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L31 L32 L33 L34 L35 Layer Layer vs Layer CKA (Concatenated Streams) 0.0 0.2 0.4 0.6 0.8 1.0 CKA Figure 5: Inter-layer CKA with streamwise concatenation. Layers evolve gradually in their representational geometry. Routing dynamics across depth. We examine how the learned routing matrices evolve with depth. As shown in Figure 6, both the Frobenius norm and variance of H post increase with layer index, suggesting that downstream layers amplify and diversify the outputs of intermediate stream aggregation. In contrast, H pre and H res remain stable, indicating that only the post-aggregation re- distribution becomes more diffuse as representations are pushed toward the output. 8 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 05 101520253035 Layer 0.0 0.2 0.4 Mean H_pre - Mean attn mlp 05 101520253035 Layer 0.00 0.02 0.04 Variance H_pre - Variance attn mlp 05 101520253035 Layer 0.0 0.5 1.0 Mean H_post - Mean attn mlp 05 101520253035 Layer 0.0 0.1 0.2 Variance H_post - Variance attn mlp 05 101520253035 Layer 0.0 0.1 0.2 Mean H_res - Mean attn mlp 05 101520253035 Layer 0.00 0.01 0.02 Variance H_res - Variance attn mlp H Tensor Statistics Figure 6: Routing dynamics across depth. Upward trend in H post reflects growing inter-stream dependence, aligning with observed causal convergence. Redundancy and asymmetry in rescue. Rescue experiments isolate the degree to which one stream compensates for another. Figure 7 shows that stream pair (0,2) exhibits high mutual rescue, suggesting redundancy. Others, such as (1,3), show asymmetric recovery where stream 3 reliably compensates for stream 1, but not vice versa. These patterns indicate that residual streams may play different roles during the forward pass, despite comparable representations. 0246810121416182022242628303234 Layer 1 2 3 Rescuer stream Stream 0 ablated: rescue % when restoring partner stream 0246810121416182022242628303234 Layer 0 2 3 Rescuer stream Stream 1 ablated: rescue % when restoring partner stream 0246810121416182022242628303234 Layer 0 1 3 Rescuer stream Stream 2 ablated: rescue % when restoring partner stream 0246810121416182022242628303234 Layer 0 1 2 Rescuer stream Stream 3 ablated: rescue % when restoring partner stream 0 20 40 60 80 100 Rescue % 0 20 40 60 80 100 Rescue % 0 20 40 60 80 100 Rescue % 0 20 40 60 80 100 Rescue % Rescue % by layer when one stream is ablated and the partner is restored (rows = rescuer stream) Figure 7: Layer-wise rescue performance by stream. Rescue values are defined as percentage KL reduction from full ablation. High scores indicate functional redundancy; low scores suggest complementarity or general asymmetry. 9 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 Full pairwise comparisons. To visualize recovery regimes across all stream pairs, Figure 8 re- ports the distribution of KL scores from joint ablation and single-stream rescue. Symmetric recovery suggests redundant encoding, while skewed or weak rescue indicates directional or complementary encoding. Stream pair (0,2) shows tight symmetric rescue, while pair (1,3) shows stream 3 domi- nating recovery. Joint ablation (both zeroed) Rescue Stream 0 Rescue Stream 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 KL Divergence Pair (0, 1): rescue across layers Joint ablation (both zeroed) Rescue Stream 0 Rescue Stream 2 0 1 2 3 4 5 KL Divergence Pair (0, 2): rescue across layers Joint ablation (both zeroed) Rescue Stream 0 Rescue Stream 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 KL Divergence Pair (0, 3): rescue across layers Joint ablation (both zeroed) Rescue Stream 1 Rescue Stream 2 0.0 0.5 1.0 1.5 2.0 2.5 KL Divergence Pair (1, 2): rescue across layers Joint ablation (both zeroed) Rescue Stream 1 Rescue Stream 3 0.0 0.5 1.0 1.5 2.0 KL Divergence Pair (1, 3): rescue across layers Joint ablation (both zeroed) Rescue Stream 2 Rescue Stream 3 0.0 0.5 1.0 1.5 2.0 2.5 3.0 KL Divergence Pair (2, 3): rescue across layers Rescue experiment box plots: KL when restoring one ablated stream (all pairs) Figure 8: Distribution of rescue effects across stream pairs. Boxplots summarize KL recovery values across layers, revealing asymmetric and symmetric recovery patterns. Quantifying asymmetric utility. To directly contrast symmetric and asymmetric stream pairs, Figure 9 plots the per-layer rescue difference between (0,2) and (1,3). The near-zero values for (0,2) suggest interchangeable function, while consistent positive differences for (1,3) indicate persistent asymmetry. This validates our central claim: high representational similarity does not imply causal interchangeability. 10 Accepted as a workshop paper at SciForDL 2nd Edition, ICLR 2026 05101520253035 Layer 40 20 0 20 40 60 Difference in % recovery (% recovery B % recovery A) Rescue comparison by pair over layers (positive = B rescues better, negative = A rescues better) 0-2 1-3 Figure 9: Layer-wise rescue asymmetry. Positive values indicate that the second stream in a pair is more effective at recovering the joint ablation. Stream 3 consistently dominates stream 1 despite high CKA. HyperparameterValue Architecture ArchitectureGPT-2 (decoder-only Transformer) Parameter count781M Layers36 Hidden dimension1280 Attention heads20 Head dimension64 Embedding dimension1280 Context length1024 Vocabulary size50304 Residual streams4 Hyper-connection typemHC (Manifold-Constrained) Dropout0.0 BiasTrue Training OptimizerAdamW + Muon Learning rate 3× 10 −4 Min learning rate 3× 10 −5 ScheduleCosine decay Weight decay0.1 β 1 ,β 2 0.9, 0.95 Gradient clipping1.0 Warmup steps200 Batch size0.5M tokens Training steps10,000+ Data DatasetDolma-v1 7 Tokens seen∼3.18B Table 2: Model and training hyperparameters. Configuration of our 781M parameter mHC- GPT2 model. Architecture augments GPT-2 with 4 Manifold-Constrained residual streams. 11