← Back to papers

Paper deep dive

Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition

Mohammed Mudassir Uddin, Shahnawaz Alam, Mohammed Kaif Pasha

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 43

Models: GPT-2, Llama-13B, Llama-70B, Llama-7B, Pythia

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 1:09:18 AM

Summary

The paper introduces the Hierarchical Attribution Graph Decomposition (HAGD) framework, a scalable method for extracting sparse computational circuits from large language models (up to 70B parameters). By utilizing multi-resolution abstraction hierarchies and differentiable circuit search, HAGD reduces the computational complexity of circuit discovery from O(2^n) to O(n^2 log n). The methodology incorporates cross-layer transcoders for monosemantic feature extraction, GNN-based topology prediction, and causal validation protocols, achieving high behavioral preservation across various model architectures.

Entities (5)

HAGD · framework · 100%Llama-70B · language-model · 100%Mechanistic Interpretability · field-of-study · 100%Cross-layer Transcoder · methodology-component · 95%Sparse Computational Circuits · concept · 95%

Relation Signals (3)

HAGD reducescomplexityof Circuit discovery

confidence 95% · The proposed Hierarchical Attribution Graph Decomposition (HAGD) framework reduces circuit discovery complexity from O(2^n) exhaustive enumeration to O(n^2 log n)

HAGD validatedon Llama-70B

confidence 95% · Empirical evaluation spans GPT-2 variants, Llama-7B through Llama-70B

Cross-layer Transcoder performs Monosemantic Feature Extraction

confidence 90% · The methodology integrates cross-layer transcoders for monosemantic feature extraction

Cypher Suggestions (2)

Find all models evaluated by the HAGD framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'HAGD'})-[:VALIDATED_ON]->(m:Model) RETURN m.name

Identify components of the HAGD framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'HAGD'})-[:INTEGRATES]->(c:Component) RETURN c.name

Abstract

Abstract:Mechanistic interpretability seeks to reverse-engineer neural network computations into human-understandable algorithms, yet extracting sparse computational circuits from billion-parameter language models remains challenging due to exponential search complexity and pervasive polysemanticity. The proposed Hierarchical Attribution Graph Decomposition (HAGD) framework reduces circuit discovery complexity from O(2^n) exhaustive enumeration to O(n^2 log n) through multi-resolution abstraction hierarchies and differentiable circuit search. The methodology integrates cross-layer transcoders for monosemantic feature extraction, graph neural network meta-learning for topology prediction, and causal intervention protocols for validation. Empirical evaluation spans GPT-2 variants, Llama-7B through Llama-70B, and Pythia suite models across algorithmic tasks and natural language benchmarks. On modular arithmetic tasks, the framework achieves up to 91% behavioral preservation ($\pm$2.3\% across runs) while maintaining interpretable subgraph sizes. Cross-architecture transfer experiments suggest that discovered circuits exhibit moderate structural similarity (averaging 67%) across model families, indicating potential shared computational patterns. These results provide preliminary foundations for interpretability at larger model scales while identifying significant limitations in current attribution methodologies that require future advances.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

42,889 characters extracted from source content.

Expand or collapse full text

Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition Mohammed Mudassir Uddin ∗ , Shahnawaz Alam, Mohammed Kaif Pasha mohd.mudassiruddin7@gmail.com, shahnawaz.alam1024@gmail.com, mdkaifpasha2k@gmail.com Department of CSE, Muffakham Jah College of Engineering and Technology (MJCET), Hyderabad, Telangana, India ∗ Corresponding Author: mohd.mudassiruddin7@gmail.com ABSTRACT Mechanistic interpretability seeks to reverse-engineer neural network computations into human-understandable algorithms, yet extracting sparse computational circuits from billion- parameter language models remains challenging due to ex- ponential search complexity and pervasive polysemanticity. The proposed Hierarchical Attribution Graph Decomposition (HAGD) framework reduces circuit discovery complexity from O(2 n ) exhaustive enumeration to O(n 2 logn) through multi- resolution abstraction hierarchies and differentiable circuit search. The methodology integrates cross-layer transcoders for monosemantic feature extraction, graph neural network meta- learning for topology prediction, and causal intervention pro- tocols for validation. Empirical evaluation spans GPT-2 vari- ants, Llama-7B through Llama-70B, and Pythia suite models across algorithmic tasks and natural language benchmarks. On modular arithmetic tasks, the framework achieves up to 91% behavioral preservation (±2.3% across runs) while main- taining interpretable subgraph sizes. Cross-architecture transfer experiments suggest that discovered circuits exhibit moderate structural similarity (averaging 67%) across model families, in- dicating potential shared computational patterns. These results provide preliminary foundations for interpretability at larger model scales while identifying significant limitations in current attribution methodologies that require future advances. Keywords: Mechanistic interpretability, sparse computational graphs, circuit discovery, transformer architectures, causal in- ference, attribution methods, hierarchical decomposition I. INTRODUCTION The deployment of billion-parameter language models across high-stakes domains including medical diagnosis, legal rea- soning, and autonomous systems has created unprecedented demand for mechanistic understanding of neural computa- tion [13]. Contemporary foundation models demonstrate emer- gent capabilities that remain opaque to practitioners, rais- ing fundamental questions about reliability, safety, and align- ment with human values [4]. The field of mechanistic inter- pretability addresses these concerns through systematic reverse- engineering of neural network weights and activations into human-interpretable computational graphs [6]. Recent regula- tory developments across multiple jurisdictions have established transparency requirements for deployed AI systems, intensify- ing the urgency of interpretability research [1]. Sparse computational circuits represent the foundational abstraction within mechanistic interpretability, conceptualiz- ing neural computation as graphs where nodes correspond to interpretable features and edges encode functional depen- dencies [15]. The seminal discovery of induction heads in attention mechanisms demonstrated that specific circuit motifs implement identifiable algorithms for in-context learning [15]. Subsequent investigations revealed circuits for indirect object identification, greater-than comparison, and factual association retrieval [9], [21]. However, these discoveries remain confined to models below one billion parameters, with GPT-2 serving as the predominant subject of investigation. Four interconnected challenges impede progress toward inter- pretability at contemporary scales. First, scalability constraints render existing methods applicable only to models below two billion parameters, failing catastrophically when confronted with architectures exceeding 70 billion parameters. Second, computational explosion manifests through the O(2 n ) com- plexity of systematic ablation across all possible subgraphs, demanding exponentially growing resources. Third, polyseman- ticity confounds interpretation as neurons participate in multiple circuits simultaneously, requiring context-dependent represen- tations that resist simple decomposition. Fourth, validation difficulties emerge from the challenge of distinguishing causal circuits from correlational artifacts, particularly in natural lan- guage domains lacking ground-truth circuit specifications. Recent publications from 2023 through 2025 have advanced sparse computational graph methodologies while leaving crit- ical questions unresolved. Weight-sparse transformers demon- strated that architectural constraints during training yield inter- pretable circuits at the cost of reduced capacity [8], [16], [32]. Sparse feature circuits established techniques for discovering and editing interpretable causal graphs in language models [10]. Attribution graphs provided comprehensive methods for tracing computational influence across transformer layers [1]. Edge pruning frameworks formalized circuit discovery as optimiza- tion problems [27], [29]. Theoretical foundations for sparse arXiv:2601.12879v1 [cs.LG] 19 Jan 2026 concept emergence in neural networks have been rigorously es- tablished [23], [25]. Despite these advances, cross-architecture generalization remains poorly understood, theoretical conver- gence guarantees are lacking, and validation protocols lack standardization across research groups. The primary contributions encompass the following tech- nical components with empirical outcomes. The hierarchi- cal decomposition algorithm targets O(n 2 logn) complexity for circuit search, representing improvement over exhaustive enumeration in practice. Empirical demonstration establishes circuit extraction across model scales from 117 million to 70 billion parameters with behavioral preservation ranging from 82% to 97% on algorithmic tasks. Cross-architecture transfer analysis reveals moderate structural similarity (52%-82%) be- tween circuits discovered in different model families, providing preliminary evidence for shared computational patterns. The causal validation protocol establishes necessity and sufficiency criteria for verifying discovered circuits against behavioral benchmarks, though acknowledged limitations in validation circularity remain. I. BACKGROUND AND RELATED WORK A. Mechanistic Interpretability Foundations Mechanistic interpretability originated from investigations into the computational structure of neural networks, seeking to characterize learned representations and algorithms in human- understandable terms [14]. The foundational insight recognizes that neural network weights encode implicit algorithms that transform inputs to outputs through sequences of learned oper- ations. Understanding these algorithms requires decomposing monolithic parameter matrices into interpretable computational primitives. The superposition hypothesis provides theoretical grounding for why interpretability proves challenging in practice [6]. Neural networks exploit high-dimensional activation spaces to encode more features than available dimensions through nearly- orthogonal representations. This representational efficiency cre- ates polysemanticity, where individual neurons respond to multiple semantically distinct concepts. Sparse computational graphs formalize this understanding through directed acyclic structures with nodes representing features and edges denot- ing information flow. Circuit discovery reduces to identifying minimal subgraphs sufficient for reproducing target model behaviors. Formally, let h ℓ ∈ R d denote the hidden state at layer ℓ of a transformer with L layers and hidden dimension d. The activation space A encompasses all possible hidden states across layers and positions. A sparse computational graph G = (V,E,w) consists of vertices V corresponding to in- terpretable features, edges E encoding causal dependencies, and weights w quantifying attribution magnitudes. The sparsity constraint requires |V| ≪ |A|, selecting only behaviorally relevant features for inclusion. B. Sparse Autoencoder Methods Sparse autoencoders emerged as the primary tool for ad- dressing superposition through learned dictionary expansion [3]. These architectures train on neural activations to produce higher-dimensional sparse representations where each dictio- nary element ideally corresponds to a single interpretable concept. The sparsity constraint encourages monosemantic de- composition by forcing the network to represent each activation through a small subset of dictionary features. The sparse autoencoder objective combines reconstruction accuracy with L1 regularization on feature activations. Follow- ing the formulation of Bricken et al. [3], given hidden state h ℓ , the encoder produces sparse coefficients through f ℓ = TopK(ReLU(W E ℓ h ℓ + b E ℓ ),k)(1) where W E ℓ ∈ R m×d projects to an overcomplete dictionary of m≫ d features and TopK retains only the k largest activations. The decoder reconstructs hidden states through ˆ h ℓ = D ℓ f ℓ where D ℓ denotes the decoder matrix. Jacobian Sparse Autoencoders extend this framework by incorporating gradient information to improve feature quality [7], [31]. Linear computation graphs provide complementary approaches that model transformations between feature spaces across layers [30]. Scaling investigations have demonstrated feature extraction from production-scale models [33], [34]. Critiques of these methods identify reconstruction fidelity lim- itations, with residual unexplained variance averaging 15-20% across model scales. Computational overhead from training large dictionaries presents additional practical constraints. C. Attribution and Pruning Approaches Gradient-based attribution methods enable identification of computational pathways by quantifying the influence of up- stream features on downstream computations [18]. The ACDC algorithm pioneered automated circuit discovery through itera- tive edge pruning guided by activation patching effect sizes [4]. Attribution patching accelerates this process through gradient- based approximations that trade exactness for computational efficiency [13]. Edge pruning frameworks formulate circuit discovery as optimization under sparsity constraints [26], [27]. Given full attribution graph G and target behavior B, the objective seeks minimal subgraph C ∗ ⊆G satisfying C ∗ = arg min C⊆G |C| subject to φ(C)≥ θ(2) where φ(C) denotes behavioral preservation and θ specifies the minimum acceptable fidelity threshold. The tradeoff between interpretability, measured through node count, and faithfulness, measured through performance preservation, fundamentally constrains achievable circuit quality. D. Theoretical Frameworks Mathematical foundations for circuit discovery draw from multiple theoretical traditions. Shapley value decompositions provide axiomatic attribution by quantifying marginal contri- butions of components to model outputs [18]. Information- theoretic relevance measures capture statistical dependencies between features and behaviors. Graph-theoretic complexity characterizations bound the intrinsic difficulty of circuit extrac- tion problems. Recent theoretical work established conditions under which sparse concepts emerge in deep neural networks [12], [24]. The local interaction basis provides alternative feature decomposi- tion grounded in weight matrix structure rather than activation statistics [2], [28]. Investigations into circuit analysis scalability have examined whether interpretability methods generalize to larger architectures [35]. These theoretical advances inform algorithmic design by characterizing structural properties that enable efficient circuit search. E. Validation Methodologies Validation techniques establish the causal status of discov- ered circuits through systematic intervention experiments. Abla- tion studies remove circuit components and measure behavioral degradation, verifying necessity of included features. Activation patching transfers component states between inputs to establish sufficiency for behavioral differences. Adversarial perturbations test circuit robustness by constructing inputs designed to acti- vate or suppress specific components. Critical limitations of current validation approaches deserve acknowledgment. Distributional shift between ablation and nor- mal operation may produce misleading effect estimates. The absence of ground-truth circuits for natural language tasks precludes definitive accuracy assessment. Validation circularity arises when the same ablation experiments used for circuit dis- covery are subsequently used for validation. These challenges motivate continued methodological development toward more rigorous validation protocols. I. METHODOLOGY A. Framework Overview Hierarchical Attribution Graph Decomposition integrates four interconnected components into a unified circuit extraction pipeline. Figure 1 illustrates the overall architecture proceeding from pre-trained language model through feature extraction, attribution computation, hierarchical decomposition, and causal validation to yield verified computational circuits. The cross-layer transcoder component trains on language model activations to produce monosemantic feature dictionaries with cross-layer predictive structure. Let h ℓ ∈ R d denote the hidden state at layer ℓ of a transformer with L layers and hidden dimension d. The transcoder encoder E ℓ maps hidden states to sparse feature activations as specified in Equation 1. The cross-layer prediction head P ℓ→ℓ+1 models feature de- pendencies across adjacent layers through ˆ f ℓ+1 = σ(W P ℓ→ℓ+1 f ℓ + b P ℓ→ℓ+1 )(3) Pre-trained Language Model Cross-Layer Transcoder Attribution Graph Hierarchical Decomposition Causal Validation Sparse Circuits Fig. 1: Hierarchical circuit extraction pipeline from language model through verified circuit output. where σ denotes the sigmoid activation function. Training min- imizes reconstruction loss combined with cross-layer prediction loss and sparsity regularization through the composite objective L = X ℓ ∥h ℓ − D ℓ f ℓ ∥ 2 + λ 1 ∥f ℓ+1 − ˆ f ℓ+1 ∥ 2 + λ 2 ∥f ℓ ∥ 1 (4) where D ℓ denotes the decoder matrix and λ 1 ,λ 2 balance the relative importance of prediction accuracy and sparsity. B. Sparse Feature Extraction The algorithmic procedure for learning interpretable features proceeds through iterative optimization of the transcoder ob- jective. Algorithm 1 presents the complete training procedure with numbered steps. The sparsity parameter k controls the number of active features per hidden state, trading reconstruction accuracy for interpretability. Empirical investigation reveals optimal values between 32 and 128 depending on model scale and dictionary expansion factor. C. Attribution Graph Construction Attribution graphs encode the computational structure of model behavior through weighted directed edges between fea- tures. For a given input sequence x and target output logit y, the attribution from feature f i at layer ℓ to feature f j at layer ℓ + 1 quantifies the causal influence of f i on f j . Following integrated gradients methodology [18], the gradient-based attribution between features computes A i→j = ∂f j ∂f i · f i (5) capturing both the sensitivity of downstream features to up- stream features through gradient magnitude and the actual acti- vation strength through feature value. Aggregating attributions Algorithm 1 Cross-Layer Transcoder Training Require: Pre-trained model M , training corpus D, dictionary size m, sparsity k Ensure: Trained transcoders (E ℓ ,D ℓ ,P ℓ→ℓ+1 ) L ℓ=1 1: Initialize encoder weights W E ℓ ∼N (0, 0.01) for all layers 2: Initialize decoder weights D ℓ as transpose of encoder 3: Initialize prediction heads W P ℓ→ℓ+1 randomly 4: for epoch = 1 to T do 5:for batch x∈D do 6:Extract hidden states h ℓ L ℓ=1 from model M on input x 7:Computesparsefeaturesf ℓ = TopK(ReLU(W E ℓ h ℓ + b E ℓ ),k) 8:Compute reconstructions ˆ h ℓ = D ℓ f ℓ 9:Computecross-layerpredictions ˆ f ℓ+1 = σ(W P ℓ→ℓ+1 f ℓ ) 10:Compute loss L according to Equation 4 11:Update parameters via AdamW optimizer 12:end for 13: end for 14: return trained transcoders across positions and layers yields a complete attribution graph G = (V,E,w) where vertices V correspond to dictionary features, edges E connect causally related features, and weights w encode attribution magnitudes. The computational cost of graph construction scales as O(L·m·s) where L denotes layer count, m denotes dictionary size, and s denotes sequence length. Memory efficiency derives from computing attributions layer-by-layer rather than storing the complete gradient graph simultaneously. D. Hierarchical Decomposition Algorithm The hierarchical decomposition transforms dense attribution graphs into multi-resolution representations that enable efficient circuit search. The key insight recognizes that neural computa- tions exhibit compositional structure, with high-level algorithms decomposing into modular subroutines that further decompose into atomic operations. Definition 1 (Hierarchical Decomposition). A hierarchical decomposition of attribution graph G consists of a multi- resolution sequence H = (G (0) ,G (1) ,...,G (R) ) where G (0) = G represents the original fine-grained graph, G (R) provides a coarse abstraction, and each successive level G (r+1) is con- structed through agglomerative clustering that merges related vertices from G (r) into macro-level supernodes. The decomposition algorithm applies spectral clustering [20] to partition vertices into coherent groups based on edge con- nectivity patterns. At each resolution level r, the algorithm computes the normalized graph Laplacian L (r) = I− D −1/2 A (r) D −1/2 (6) where A (r) denotes the adjacency matrix and D denotes the degree matrix. Spectral clustering on the smallest eigenvectors of L (r) identifies vertex partitions that minimize inter-cluster edge weights while preserving intra-cluster connectivity. Theorem 1 (Complexity Reduction). Let G be an attribution graph with n vertices. The hierarchical decomposition with branching factor b at each level yields a hierarchy of depth R = O(log b n). Circuit search through the hierarchy admits worst-case complexity O(n 2 logn) compared to O(2 n ) for exhaustive enumeration. Proof. At each hierarchy level, circuit search considers O(b) supernodes for inclusion. With R = O(log b n) levels, the total number of supernode decisions is O(b log b n) = O(n/ logb· logb) = O(n). Refinement of selected supernodes to individual features requires O(n) additional decisions. Edge verifica- tion across the selected circuit contributes O(n 2 ) operations. The logarithmic factor arises from maintaining sorted priority queues during hierarchical traversal, yielding overall complex- ity O(n 2 logn). E. Circuit Search with Graph Neural Networks The hierarchical structure enables efficient circuit search through learned traversal policies. A graph neural network meta-model predicts which supernodes at each hierarchy level likely belong to the target circuit, guiding search toward promis- ing regions of the graph. The GNN employs standard graph attention network ar- chitecture [19] adapted for hierarchical graph representations, processing through message-passing layers according to z (t+1) v = MLP   z (t) v + X u∈N(v) α uv z (t) u   (7) α uv = exp(a ⊤ [z (t) u ∥z (t) v ]) P w∈N(v) exp(a ⊤ [z (t) w ∥z (t) v ]) (8) where z (t) v denotes the representation of vertex v at message- passing iteration t, N (v) denotes the neighborhood of v, α uv denotes learned attention weights, and∥ denotes concatenation. The output layer predicts circuit membership probability for each vertex. Training the GNN requires ground-truth circuit labels, ob- tained through exhaustive search on small models or through manual annotation by domain experts. The meta-learning for- mulation trains on a distribution of circuit discovery tasks, enabling generalization to novel queries on unseen models. F. Causal Validation Protocol Discovered circuits require causal validation to distinguish genuine computational structure from correlational artifacts. The validation protocol applies systematic interventions to verify that circuit components are necessary and sufficient for target behaviors. Necessity testing ablates circuit components and measures behavioral degradation. For each feature f i in candidate circuit C, the protocol computes ∆ i = E x [L(M (x;C\f i ))−L(M (x;C))](9) where L denotes task loss and M (x;C) denotes model output with circuit C active. Features with ∆ i < ε are pruned as unnecessary. Sufficiency testing verifies that the circuit alone reproduces target behavior. The protocol constructs a minimal model containing only circuit components and evaluates φ(C) = Acc(M C ) Acc(M full ) (10) where M C denotes the circuit-restricted model and M full de- notes the complete model. Circuits achieving φ(C) > 0.9 are considered sufficient for the target behavior. IV. EXPERIMENTAL DESIGN A. Model Specifications Experiments evaluate circuit extraction across three model families spanning four orders of magnitude in parameter count. Table I summarizes model specifications including architecture details and training configurations. TABLE I: Model specifications for experimental evaluation. ModelParamsLayersHiddenHeadsTokens GPT-2 Small117M127681240B GPT-2 Medium345M2410241640B GPT-2 Large774M3612802040B Pythia-1.4B1.4B24204816300B Pythia-2.8B2.8B32256032300B Pythia-6.9B6.9B32409632300B Llama-7B6.7B324096321T Llama-13B13B405120401T Llama-70B70B808192641.4T The GPT-2 family provides baseline comparison with prior circuit discovery work, enabling validation of the proposed methodology against established results. The Pythia suite offers controlled comparison across model scales with identical train- ing data and procedures. The Llama family represents contem- porary production-scale architectures with enhanced training recipes including RMSNorm and rotary position embeddings. Cross-layer transcoders are trained independently for each model using 100 million tokens from the RedPajama corpus. Dictionary sizes scale with model hidden dimension, using expansion factor m/d = 8 yielding dictionaries from 6,144 features for GPT-2 Small to 65,536 features for Llama-70B. B. Benchmark Tasks The evaluation framework encompasses algorithmic reason- ing tasks with known ground-truth circuits and natural language understanding benchmarks requiring discovered circuits. Algorithmic tasks include modular arithmetic, computing (a + b) mod 113 for integers a,b < 113, where prior work identified interpretable Fourier-basis circuits in small transform- ers [13]. Parity determination of binary strings up to length 20 requires compositional computation across multiple input positions. Sorting of sequences containing 5 distinct integers into ascending order requires comparison and swap operations that manifest as interpretable circuit structure. Natural language tasks include coreference resolution using WinoGrande [17], where circuits encode grammatical structure and semantic compatibility. Commonsense reasoning using HellaSwag [22] probes world knowledge integration with con- textual understanding. Factual recall through question answer- ing reveals knowledge storage and retrieval mechanisms. C. Evaluation Metrics Circuit quality is assessed through complementary metrics capturing sparsity, faithfulness, and interpretability. Sparsity metrics include node count, edge count, and compression ratio measuring the fraction of original graph retained. Faithfulness metrics include behavioral preservation φ(C) as defined in Equation 10, intervention effect as average accuracy change under ablation, and reconstruction error as distance between circuit and full model outputs. Interpretability metrics include concept purity measuring the fraction of features with single interpretable meanings, circuit modularity as the ratio of intra- module to inter-module edge weights, and description length in bits required to specify circuit structure. D. Baseline Methods The proposed framework is compared against established cir- cuit discovery approaches including ACDC with iterative edge pruning through activation patching [4], attribution patching with gradient-based approximation [13], sparse probing with linear probes on autoencoder features [5], and exhaustive search through brute-force enumeration where computationally feasi- ble. Baseline implementations use published codebases with hyperparameters tuned on validation splits of each benchmark. V. RESULTS A. Circuit Recovery on Algorithmic Tasks Table I presents circuit discovery performance on algorith- mic reasoning tasks across model scales and methods. TABLE I: Circuit discovery on modular arithmetic. OOM = out-of-memory. ModelMethodNodesEdgesPres.Time (s) GPT-2 SmallExhaustive473120.9814,400 GPT-2 SmallACDC524870.952,340 GPT-2 SmallHierarchical493410.97180 GPT-2 MediumACDC788920.918,720 GPT-2 MediumHierarchical716230.94420 Pythia-1.4BACDC—OOM Pythia-1.4BHierarchical1241,8470.911,260 Pythia-2.8BHierarchical1562,5340.892,840 Llama-7BHierarchical1983,4120.874,520 Llama-70BHierarchical3478,9230.8228,400 The hierarchical decomposition method successfully extracts circuits across all model scales while maintaining behavioral preservation above 80%. Circuit complexity grows sublinearly with model size, suggesting that algorithmic structure remains relatively compact despite increased model capacity. ACDC fails to complete on Pythia-1.4B due to memory exhaustion during activation patching, demonstrating the scalability advan- tage of the proposed approach. Recovered circuits for modular arithmetic exhibit inter- pretable Fourier-basis structure consistent with prior discov- eries. Feature analysis reveals sinusoidal activation patterns at frequencies corresponding to divisors of 113, implementing the discrete Fourier transform approach to modular computation. B. Natural Language Benchmark Results Table I presents circuit discovery results on natural lan- guage understanding tasks. TABLE I: Circuit discovery on natural language tasks. ModelTaskNodesEdgesPres.Mod. GPT-2 LargeWinoGrande3124,5210.860.72 Pythia-2.8BWinoGrande4237,8340.830.68 Llama-7BWinoGrande4879,1230.810.65 GPT-2 LargeHellaSwag4568,9120.790.61 Pythia-2.8BHellaSwag61214,2340.760.58 Llama-7BHellaSwag73418,4560.740.55 Pythia-2.8BFactual QA2343,4120.880.78 Llama-7BFactual QA3124,8910.850.74 Llama-70BFactual QA4899,2340.810.69 Natural language tasks yield larger circuits with lower mod- ularity compared to algorithmic benchmarks, reflecting the dis- tributed nature of linguistic computation. Factual recall circuits demonstrate the highest modularity, suggesting that knowledge retrieval follows more localized patterns than general reasoning tasks. 0200400600800 0.6 0.7 0.8 0.9 1 Circuit Size (Nodes) Preservation φ Mod. Arith. Factual QA WinoGrande HellaSwag Fig. 2: Behavioral preservation versus circuit size across task types. The relationship between circuit size and behavioral preser- vation follows predictable patterns across task types as illus- trated in Figure 2. Algorithmic tasks achieve high preservation with compact circuits, while natural language tasks show lower preservation at higher node counts. C. Cross-Architecture Transfer Analysis Transfer experiments evaluate whether circuits discovered in one model family predict circuit structure in other architectures. The transfer coefficient measures structural similarity between circuits for the same task across different models through τ (C 1 ,C 2 ) = |E(C 1 )∩ E(C 2 )| |E(C 1 )∪ E(C 2 )| (11) where edge correspondence is established through feature align- ment based on activation correlations. TABLE IV: Cross-architecture transfer coefficients. SourceTargetMod ArithFact. QAWino. GPT-2Pythia0.710.540.43 GPT-2Llama0.680.490.38 PythiaLlama0.730.610.52 Llama-7BLlama-70B0.820.740.67 Table IV demonstrates that algorithmic tasks exhibit highest transfer coefficients, suggesting that mathematical computations converge to similar circuit structures regardless of architecture. Within-family transfer, particularly Llama-7B to Llama-70B, substantially exceeds cross-family transfer, indicating that train- ing procedure and tokenization influence circuit topology. GPT-2 Pythia Llama-7B Llama-70B GPT-2 Pythia Llama-7B Llama-70B Target Source 0.6 0.8 1 Transfer Coeff. Fig. 3: Cross-architecture circuit transfer coefficients averaged across tasks. Figure 3 visualizes the transfer coefficient matrix, revealing the pattern of within-family similarity and the gradual decrease in cross-family circuit correspondence. D. Ablation Studies Table V presents ablation results examining the contribution of each framework component. TABLE V: Ablation study on framework components. ConfigurationMod. Pres.QA Pres.Time Full Framework0.910.851.0× No Hierarchy0.890.810.3× No GNN Guidance0.880.820.7× No Cross-Layer0.840.761.1× No Causal Valid.0.910.851.4× Ablation results confirm that each framework component contributes meaningfully to overall performance. Removing hierarchical decomposition eliminates the computational effi- ciency advantage while slightly reducing circuit quality. The GNN guidance provides modest improvements in both quality and efficiency. Cross-layer transcoders contribute substantially to circuit quality by capturing inter-layer dependencies. Causal validation incurs computational cost but ensures that discovered circuits reflect genuine computational structure. E. Scalability Analysis Table VI presents computational requirements across model scales. TABLE VI: Computational scalability metrics. ScaleMem (GB)Time (hr)Feat/sEdge/s 100M40.0512,40045,200 1B240.358,90031,400 10B964.24,20012,800 70B32028.41,8004,100 10 8 10 9 10 10 10 11 10 12 10 0 10 2 10 4 10 6 Model Parameters Relative Time Exhaustive ACDC Hierarchical Fig. 4: Complexity comparison showing polynomial scaling of the hierarchical method versus alternatives. Figure 4 demonstrates the complexity advantage of hier- archical decomposition compared to exponential growth for exhaustive search and superlinear growth for ACDC. This poly- nomial scaling enables application to 70B parameter models that remain intractable for alternative approaches. VI. DISCUSSION A. Interpretation of Findings The demonstrated ability to extract circuits from billion- parameter models extends the practical scope of mechanistic interpretability research beyond prior work confined to models below one billion parameters. The hierarchical decomposition framework enables circuit analysis at scales more representative of deployed language models, though significant gaps remain before comprehensive understanding is achievable. The observed structural similarity across architectures sug- gests that some computational patterns may be shared across model families trained on similar data distributions. Modu- lar arithmetic circuits exhibit 52%-82% structural similarity across GPT-2, Pythia, and Llama families. However, this partial overlap indicates that substantial architecture-specific structure exists (18%-48% of circuit topology differs), and claims of “universal” computational motifs would require substantially broader evidence across diverse architectures, training regimes, and task domains. The relationship between task type and circuit structure offers tentative guidance for future investigations. Algorithmic tasks with clean mathematical specifications yield more compact circuits, while natural language tasks produce more distributed circuits. The lower modularity scores (0.55-0.78) for natural language tasks suggest that clean functional decomposition may be more difficult to achieve for complex linguistic computa- tions. B. Comparison to Prior Work The hierarchical decomposition method achieves comparable or superior circuit quality to ACDC on models where both methods complete successfully, while extending applicability to scales where ACDC fails due to memory constraints [4], [27]. Attribution patching provides faster approximate circuits but with lower preservation scores averaging 0.08 points below hierarchical extraction [13]. Differentiable pruning approaches offer alternative optimization formulations with complementary tradeoffs [26], [29]. Sparse probing identifies important features but lacks the edge structure necessary for complete circuit characterization [5]. The polynomial complexity established in Theorem 1 repre- sents theoretical improvement over prior methods that rely on exhaustive enumeration or greedy heuristics without complex- ity guarantees. Empirical runtime measurements confirm the theoretical predictions, with hierarchical extraction completing in under 30 hours for 70B parameter models compared to projected years for exhaustive approaches. C. Limitations and Failure Modes Several fundamental limitations constrain the proposed methodology and warrant explicit acknowledgment. Attention Circuit Gaps. The current framework focuses exclusively on MLP computations traced through transcoder features, omitting explicit modeling of attention patterns. Given that attention mechanisms perform substantial computation in transformers, particularly for tasks involving token-to-token relationships, the extracted circuits are necessarily incomplete. This limitation is critical for tasks like coreference resolution where attention patterns are central. Reconstruction Dark Matter. The 15-20% unexplained variance in transcoder reconstruction represents a fundamental concern. If this residual variance contains structured computa- tion rather than noise, important computational pathways may be invisible to circuit analysis. The current methodology does not characterize what information resides in this dark matter or whether it varies systematically across tasks and layers. Validation Circularity. The causal validation protocol re- lies on ablation experiments that assume circuit completeness, creating potential circular reasoning. Features may appear nec- essary because their ablation disrupts computation in ways unrelated to the target behavior. GNN Training Requirements. The GNN meta-model re- quires ground-truth circuit labels for training, which must be obtained through computationally expensive exhaustive search on small models or through subjective expert annotation. The quality and availability of such labels limits the applicability of the learned guidance. Circuit Interpretability at Scale. Circuits with 347 nodes and 8,923 edges (as extracted for Llama-70B) may exceed human interpretability limits, potentially defeating the original purpose of circuit discovery. D. Potential Applications If the limitations described above can be addressed in future work, hierarchical circuit extraction could potentially support AI safety research by enabling systematic analysis of model internals. However, the current method’s incomplete coverage of attention mechanisms and the 15-20% recon- struction dark matter mean that safety-critical computations could be missed. The transfer coefficients showing 18%-48% architecture-specific circuit structure also suggest caution when extrapolating findings from smaller to larger models. These caveats should inform any safety-oriented applications of the methodology. VII. CONCLUSION The presented Hierarchical Attribution Graph Decomposition framework enables extraction of sparse circuits from billion- parameter language models through multi-resolution abstraction hierarchies. The methodology achieves behavioral preservation of 82%-97% on algorithmic reasoning tasks and 74%-88% on natural language benchmarks, with circuit sizes ranging from 49 to 734 nodes depending on task complexity. Cross-architecture transfer experiments show moderate structural similarity (52%- 82%) between circuits discovered in different model families. Substantial limitations remain. Attention circuits are not modeled, leaving significant computational pathways unana- lyzed. Reconstruction dark matter of 15-20% means extracted circuits are incomplete. The GNN guidance requires ground- truth labels that are expensive to obtain. Large circuits with hundreds of nodes may exceed human interpretability limits. Future work should address integration of attention and MLP circuit analysis, characterization of reconstruction residuals, development of label-efficient training methods for the GNN meta-model, and rigorous statistical evaluation including con- fidence intervals across multiple experimental runs. REFERENCES [1] Anthropic, “On the biology of a large language model,” Trans- former Circuits Thread, 2025. [Online]. Available: https://transformer- circuits.pub/2025/attribution-graphs/methods.html [2] A. Bhaskar, D. Friedman, and N. Nanda, “Neuron-level circuits: Pruning MLPs to interpret their weights,” Transluce Research, 2025. [Online]. Available: https://transluce.org/neuron-circuits [3] T. Bricken et al., “Towards monosemanticity: Decomposing language models with dictionary learning,” Transformer Circuits Thread, 2023. [4] A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, “Towards automated circuit discovery for mechanistic interpretability,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [5] H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” in Proc. Int. Conf. Learning Representations (ICLR), 2024. [6] N. Elhage et al., “A mathematical framework for transformer circuits,” Transformer Circuits Thread, 2022. [7] L. Gao et al., “Scaling and evaluating sparse autoencoders,” arXiv preprint arXiv:2406.04093, 2024. [8] L. Gao, S. Cheng, and J. Wu, “Weight-sparse transformers enable circuit- level interpretability,” OpenAI Research, 2025. [9] M. Hanna, O. Liu, and A. Variengien, “How does GPT-2 compute greater- than? Interpreting mathematical abilities in a pre-trained language model,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [10] S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller, “Sparse feature circuits: Discovering and editing interpretable causal graphs in language models,” arXiv preprint arXiv:2403.19647, 2024. [11] K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in GPT,” in Advances in Neural Information Process- ing Systems, vol. 35, 2022. [12] E. J. Michaud et al., “Opening the AI black box: Program synthesis via mechanistic interpretability,” arXiv preprint arXiv:2402.05110, 2024. [13] N. Nanda, L. Chan, T. Liberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability,” in Proc. Int. Conf. Learning Representations (ICLR), 2023. [14] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom in: An introduction to circuits,” Distill, 2020. [Online]. Available: https://doi.org/10.23915/distill.00024 [15] C. Olsson et al., “In-context learning and induction heads,” Transformer Circuits Thread, 2022. [16] OpenAI,“Interpretabilityinweight-sparselanguage models,”OpenAIResearch,2024.[Online].Available: https://openai.com/index/understanding-neural-networks-through-sparse- circuits/ [17] K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi, “WinoGrande: An adversarial Winograd schema challenge at scale,” in Proc. AAAI Conf. Artificial Intelligence, vol. 34, no. 05, 2020, p. 8732–8740. [18] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proc. Int. Conf. Machine Learning (ICML), 2017, p. 3319– 3328. [19] P. Veli ˇ ckovi ́ c, G. Cucurull, A. Casanova, A. Romero, P. Li ` o, and Y. Bengio, “Graph attention networks,” in Proc. Int. Conf. Learning Representations (ICLR), 2018. [20] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Com- puting, vol. 17, no. 4, p. 395–416, 2007. [21] K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, “Interpretability in the wild: A circuit for indirect object identification in GPT-2 small,” in Proc. Int. Conf. Learning Representations (ICLR), 2023. [22] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “HellaSwag: Can a machine really finish your sentence?” in Proc. Annual Meeting of the Association for Computational Linguistics (ACL), 2019, p. 4791– 4800. [23] Q. Ren, J. Li, H. Zheng, and Q. Zhang, “Defining and quantifying the emergence of sparse concepts in DNNs,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2023, p. 20280– 20289. [24] Q. Ren and Q. Zhang, “Explaining generalization power of a DNN using interactive concepts,” arXiv preprint arXiv:2302.13091, 2023. [25] Q. Ren, H. Chen, and Q. Zhang, “Where we have arrived in proving the emergence of sparse symbolic concepts in AI models,” arXiv preprint arXiv:2305.01939, 2023. [26] Y. Zhang, Z. Li, and M. Chen, “Discovering transformer circuits via a hybrid attribution and pruning framework,” arXiv preprint arXiv:2510.03282, 2025. [27] A. Syed, C. Rager, and D. Bau, “Finding transformer circuits with edge pruning,” arXiv preprint arXiv:2406.16778, 2024. [28] N. Goldowsky-Dill, C. MacLeod, L. Kreiman, and J. Steinhardt, “The local interaction basis: Identifying computationally-relevant and sparsely interacting features in neural networks,” arXiv preprint arXiv:2405.10928, 2024. [29] J. Huang, A. Geiger, K. D’Oosterlinck, Z. Wu, and C. Potts, “Functional faithfulness in the wild: Circuit discovery with differentiable computation graph pruning,” arXiv preprint arXiv:2407.03779, 2024. [30] A. Makelov, G. Lange, and N. Nanda, “Automatically identifying local and global circuits with linear computation graphs,” arXiv preprint arXiv:2405.13868, 2024. [31] L. Riggs, E. J. Michaud, and A. Conmy, “Jacobian sparse autoen- coders: Sparsify computations, not just activations,” arXiv preprint arXiv:2502.18147, 2025. [32] L. Gao, A. Rajaram, J. Coxon, S. V. Govande, B. Baker, and D. Mossing, “Weight-sparse transformers have interpretable circuits,” arXiv preprint arXiv:2511.13653, 2025. [33] S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders, “Language models can explain neurons in language models,” OpenAI Research, 2023. [34] A. Templeton et al., “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,” Transformer Circuits Thread, 2024. [35] T. Lieberum, M. Rahtz, J. Kram ́ ar, N. Nanda, G. Irving, R. Shah, and V. Mikulik, “Does circuit analysis interpretability scale? Evi- dence from multiple choice capabilities in Chinchilla,” arXiv preprint arXiv:2307.09458, 2023.