Paper deep dive
Cracking the Circuits: Mechanistic Interpretability in Large Language Models
Dost Muhammad, Muhammad Salman, Mushtaq Ali, Malika Bendechache
Models: GPT-2 Small
Abstract
Mechanistic interpretability aims to uncover how internal components of large language models (LLMs) contribute to their overall behaviour. While previous research has revealed interpretable circuits in small-scale transformers, the field still lacks a systematic and mathematically grounded framework for understanding, validating, and comparing interpretability techniques. In this paper, we introduce a unified formalism and taxonomy for mechanistic interpretability, defining core concepts such as modular circuits, representational superposition, and polysemantic neurons. We categorise interpretability methods based on granularity, causal influence, robustness to perturbation, and transferability across tasks. To demonstrate the utility of this framework, we analyse the GPT2-small model on the Indirect Object Identification (IOI) task using TransformerLens. Our findings reveal that attention heads 7.1 and 8.6 play a consistent and causal role in performing the name-mover function. Activation patching experiments show that restoring these components recovers over 90 % of performance lost due to input corruption, while ablating them results in a 27.3 % accuracy drop. These results provide empirical support for the existence of sparse, interpretable circuits that underpin model reasoning. Visualisations of patching curves and attention patterns further substantiate this behaviour. This study offers a principled methodology for mechanistic analysis in language models, bridging theoretical insights with empirical validation. The proposed framework supports the development of transparent, trustworthy, and controllable AI systems, with potential for extension to larger-scale models and safety-critical applications.
Tags
Links
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/11/2026, 12:37:51 AM
Summary
This paper introduces a unified formalism and taxonomy for mechanistic interpretability in Large Language Models (LLMs), defining concepts like modular circuits and representational superposition. It validates this framework by analyzing GPT2-small on the Indirect Object Identification (IOI) task, identifying specific attention heads (7.1 and 8.6) as critical for the name-mover function through activation patching.
Entities (6)
Relation Signals (3)
GPT2-small → performstask → Indirect Object Identification
confidence 95% · analyse the GPT2-small model on the Indirect Object Identification (IOI) task
Attention Head 7.1 → causallyinfluences → name-mover function
confidence 92% · attention heads 7.1 and 8.6 play a consistent and causal role in performing the name-mover function.
Attention Head 8.6 → causallyinfluences → name-mover function
confidence 92% · attention heads 7.1 and 8.6 play a consistent and causal role in performing the name-mover function.
Cypher Suggestions (2)
Map models to the tasks they are evaluated on. · confidence 95% · unvalidated
MATCH (m:Model)-[:PERFORMS_TASK]->(t:Task) RETURN m.name, t.name
Find all model components identified as having a causal role in a specific task. · confidence 90% · unvalidated
MATCH (c:Component)-[:CAUSALLY_INFLUENCES]->(f:Function) RETURN c.name, f.name
Full Text
825 characters extracted from source content.
Expand or collapse full text
Cracking the Circuits: Mechanistic Interpretability in Large Language Models | IEEE Conference Publication | IEEE Xplore IEEE Account Change Username/Password Update Address Purchase Details Payment Options Order History View Purchased Documents Profile Information Communications Preferences Profession and Education Technical Interests Need Help? US & Canada: +1 800 678 4333 Worldwide: +1 732 981 0060 Contact & Support About IEEE Xplore Contact Us Help Accessibility Terms of Use Nondiscrimination Policy Sitemap Privacy & Opting Out of Cookies A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.© Copyright 2026 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.