← Back to papers

Paper deep dive

SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis

Zihao Fu, Xufeng Duan, Zhenguang G. Cai

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 72

Models: DeepSeek-R1-8B, GLM-4-9B, Gemma3-2B, Llama-3.2-1B, Ministral-8B, Qwen3-4B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/11/2026, 1:06:33 AM

Summary

SCALPEL is a framework for LLM interpretability that represents capabilities as low-rank parameter subspaces rather than discrete modules. By training LoRA adapters to equalize probabilities between correct and incorrect answers, SCALPEL identifies and selectively ablates specific capabilities while preserving general language modeling performance, revealing that capabilities are distributed across the parameter space.

Entities (4)

Large Language Models · technology · 100%LoRA · methodology · 100%SCALPEL · framework · 100%BLiMP · dataset · 95%

Relation Signals (3)

SCALPEL targets Large Language Models

confidence 100% · SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models)

SCALPEL utilizes LoRA

confidence 100% · We present SCALPEL... By training LoRA adapters to reduce the model’s ability to distinguish correct from incorrect answers

SCALPEL evaluateson BLiMP

confidence 95% · Experiments across diverse capability tasks and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities

Cypher Suggestions (2)

Find all frameworks that utilize LoRA for interpretability. · confidence 90% · unvalidated

MATCH (f:Framework)-[:UTILIZES]->(m:Methodology {name: 'LoRA'}) RETURN f.name

Identify datasets used by SCALPEL for evaluation. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'SCALPEL'})-[:EVALUATES_ON]->(d:Dataset) RETURN d.name

Abstract

Abstract:Large language models excel across diverse domains, yet their deployment in healthcare, legal systems, and autonomous decision-making remains limited by incomplete understanding of their internal mechanisms. As these models integrate into high-stakes systems, understanding how they encode capabilities has become fundamental to interpretability research. Traditional approaches identify important modules through gradient attribution or activation analysis, assuming specific capabilities map to specific components. However, this oversimplifies neural computation: modules may contribute to multiple capabilities simultaneously, while single capabilities may distribute across multiple modules. These coarse-grained analyses fail to capture fine-grained, distributed capability encoding. We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework representing capabilities as low-rank parameter subspaces rather than discrete modules. Our key insight is that capabilities can be characterized by low-rank modifications distributed across layers and modules, enabling precise capability removal without affecting others. By training LoRA adapters to reduce distinguishing correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies low-rank representations responsible for particular capabilities while remaining disentangled from others. Experiments across diverse capability and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities while preserving general capabilities, providing fine-grained insights into capability distribution across parameter space. Results reveal that capabilities exhibit low-rank structure and can be selectively ablated through targeted parameter-space interventions, offering nuanced understanding of capability encoding in LLMs.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)interpretability (suggested, 80%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

72,052 characters extracted from source content.

Expand or collapse full text

SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis Zihao Fu The Chinese University of Hong Kong zihaofu@cuhk.edu.hk Xufeng Duan The Chinese University of Hong Kong xufengduan@cuhk.edu.hk Zhenguang G. Cai The Chinese University of Hong Kong zhenguangcai@cuhk.edu.hk Abstract Large language models have achieved remarkable success across diverse domains, yet their deployment in many applications such as healthcare, legal systems, and autonomous decision-making remains limited by our incomplete understanding of their internal mechanisms. As these models become increasingly integrated into high-stakes sys- tems, understanding how they encode and execute specific capabilities has become fundamental to interpretability research. Traditional approaches identify important modules through gradient attribution or activation analysis, as- suming that specific capabilities are controlled by specific components. However, this assumption oversimplifies neu- ral computation: individual modules may contribute to multiple capabilities simultaneously, and conversely, a single capability may be implemented in a distributed manner across multiple modules. These coarse-grained, module-level analyses fail to capture the fine-grained, distributed nature of capability encoding in neural networks. We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework that represents capabilities as low-rank parameter subspaces rather than discrete modules. Our key insight is that language model capabilities can be characterized by low-rank modifications distributed across layers and modules, enabling precise capability removal without affecting others. By training LoRA adapters to reduce the model’s abil- ity to distinguish correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies the low-rank representation responsible for a particular capability while remaining disentangled from other capabilities. Experiments across diverse capability tasks and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities while preserving other general capabilities, and provides fine-grained insights into how capabilities are distributed across the model’s parameter space. Our results reveal that capabilities exhibit low-rank structure and can be selectively ablated through targeted parameter-space interventions, offering a more nuanced understanding of capability encoding in large language models. 1 Introduction Large language models (LLMs) [33, 15, 48] have achieved remarkable success across diverse applications, from code generation [8] to medical diagnosis [41] and scientific reasoning [49]. However, their deployment in many applications such as healthcare [41], legal systems [40], and autonomous decision-making [39] remains limited by our incomplete understanding of their internal mechanisms. Without understanding how these models encode and process information, we cannot fully trust their decisions in applications requiring accountability. This opacity limits deployment in domains where interpretability and reliability are paramount. To address these concerns, the interpretability research community has developed numerous approaches to under- stand how LLMs encode and process information. Gradient-based attribution methods [42, 38] identify which input features influence predictions, while activation analysis techniques [35, 31] reveal important components by examin- ing hidden representations. Mechanistic interpretability methods [14, 26] trace causal pathways through controlled interventions, and dictionary learning approaches [6, 10] decompose polysemantic neurons into interpretable features. Model editing techniques [26, 27] further demonstrate the possibility of modifying specific knowledge without full retraining. These advances have significantly improved our understanding of transformer architectures [45]. 1 arXiv:2601.07411v1 [cs.LG] 12 Jan 2026 However, existing interpretability methods rely on strong assumptions that oversimplify how capabilities are en- coded in neural networks. They typically assume that a specific capability is controlled by a specific component, whether a neuron, layer, or attention head. This assumption is often unrealistic for two fundamental reasons. First, individual components exhibit polysemanticity [37, 13], where a single neuron or attention head may participate in multiple distinct capabilities simultaneously, meaning different capabilities correspond to different subspaces within the same module [6, 10]. Second, capabilities are encoded in a distributed fashion [17]: a single capability such as arithmetic or translation may be jointly controlled by multiple components across different layers and modules. Cur- rent methods, which operate at the component level, cannot adequately capture or represent this distributed, entangled nature of capability encoding. To address these limitations, we propose SCALPEL (Selective Capability Ablation via Low-rank Parameter Edit- ing for Large language models), a framework that represents capabilities as low-rank parameter subspaces rather than discrete components. Our key insight is that each capability occupies a low-dimensional subspace in the high- dimensional parameter space. By identifying and modifying only the parameter directions corresponding to a target capability, we can selectively ablate that capability while preserving others. This parameter-subspace perspective nat- urally handles both polysemanticity and distributed encoding. The low-rank constraint forces the model to reveal the structure of capability encoding. SCALPEL formulates selective capability removal as an optimization problem. We train low-rank LoRA adapters [18] with a probability equalization loss that reduces the model’s ability to distinguish correct from incorrect answers on target tasks, combined with text regularization that preserves general language modeling quality. The resulting low-rank modifications reveal which parameters are critical for each capability and how capabilities are distributed across the model’s architecture. For token-level tasks (where the model predicts a single token as the answer) such as multiple-choice questions or arithmetic, we equalize the probabilities of correct and incorrect token predictions. For sentence-level tasks (where the model evaluates entire sentences) such as grammaticality judgments, we balance the model’s preferences between grammatical and ungrammatical sentences, making the model unable to distinguish correct grammar from incorrect grammar. Through optimization with explicit regularization constraints, including L2 norm penalties and L1 sparsity regularization, SCALPEL identifies the low-rank representation responsible for a particular capability while remaining disentangled from other general capabilities [21]. Our contributions are threefold: (1) We introduce a low-rank representation perspective on capability encod- ing, demonstrating that language model capabilities can be characterized with low-rank modifications distributed across layers and modules, enabling fine-grained analysis beyond component-level interpretability. (2) We propose SCALPEL, a framework that identifies the low-rank representation responsible for specific capabilities by training LoRA adapters to reduce the model’s ability to distinguish correct from incorrect answers while preserving general language understanding. (3) We conduct comprehensive experiments across diverse capability tasks and linguistic tasks from BLiMP, demonstrating that SCALPEL achieves effective capability removal while preserving general lan- guage abilities, and revealing that different capabilities exhibit distinct layer-wise distributions that align with cognitive and linguistic theories. 2 Related Work Post-hoc attribution methods identify which input features or model components contribute to predictions. Gradient- based approaches compute feature importance through input-output sensitivity: Integrated Gradients [42] accumulates gradients along interpolation paths, while Grad-CAM [38] uses gradient-weighted activations for visual explanations. Perturbation-based methods like LIME [34] and SHAP [25] provide model-agnostic local explanations by observ- ing output changes under input perturbations. Activation-based methods analyze hidden representations directly: DiffMean [35] measures activation differences between contrastive examples, Logit Lens [31] projects intermedi- ate representations to vocabulary space, and attention visualization [1, 9, 29] examines information flow patterns. Backpropagation-based decomposition methods such as Layer-wise Relevance Propagation [3] propagate relevance scores from outputs to inputs. Studies using these methods have revealed what linguistic knowledge transformers cap- ture [36]. However, these attribution methods assume that specific capabilities are controlled by specific components, overlooking the polysemantic (individual components encode multiple capabilities) and distributed (single capabili- ties span multiple components) nature of capability encoding. They provide only one-time analysis without iterative optimization, and when used for intervention, either suffer from catastrophic degradation or achieve limited capability removal effectiveness. 2 Target Capability Data <latexit sha1_base64="5V+2FgfuAEoFSMakcrg7Hpjc+nc=">AAACCnicbVA9TwJBEN3zE8GPU0ubU84EG3JngZZELSw1ESUBQvaWATbsfWR3jkguV9v4V2wsMMZSf4GdP0RrF7BQ8CWTvLw3k5l5XiS4Qsf5MObmFxaXljMr2dzq2vqGubl1rcJYMqiwUISy6lEFggdQQY4CqpEE6nsCbrze6ci/6YNUPAyucBBBw6edgLc5o6ilprlr24W6T7HLqEjO0mZSR7jFBKnsAKbpgW03zbxTdMawZon7Q/Jl+3P42s99XTTN93orZLEPATJBlaq5ToSNhErkTECarccKIsp6tAM1TQPqg2ok41dSa18rLasdSl0BWmP190RCfaUGvqc7R1eraW8k/ufVYmwfNxIeRDFCwCaL2rGwMLRGuVgtLoGhGGhCmeT6Vot1qaQMdXpZHYI7/fIsuT4suqVi6VKncUImyJAdskcKxCVHpEzOyQWpEEbuyAMZkifj3ng0no2XSeuc8TOzTf7AePsGciOejw==</latexit> (D target ) <latexit sha1_base64="cLVUFN6/uYlqwmj2EmxgZh37I9k=">AAACWnicdVFbS8MwGE3rZTpv8/KkL0GrKLrR+qA+iiKITwpOhXWWNP02w9K0Jqk4yv6kD4rgX/AnCGabeJn6QcjhnPN9SU7ClDOlXffZsoeGR0YLY+PFicmp6ZnS7NyFSjJJoUoTnsirkCjgTEBVM83hKpVA4pDDZdg67OqXdyAVS8S5bqdQj0lTsAajRBsqKN06jh9Ck4mccNYUEHXw+n3AtvBaO2DXm1u4u5U3sO//y6/5d1Gi1YCl/GUBEX1Od5ygtOJW3F7h38D7ACv7Tvx6svh4dBqUHvwooVkMQlNOlKp5bqrrOZGaUQ6dop8pSAltkSbUDBQkBlXPe9F08KphItxIpFlC4x77vSMnsVLtODTOmOgbNah1yb+0WqYbe/WciTTTIGj/oEbGsU5wN2ccMQlU87YBhEpm7orpDZGEavMbRROCN/jk3+Biu+LtVHbOTBoHqF9jaAkto3XkoV20j47RKaoiip7QmzVqFawX27bH7Ym+1bY+eubRj7IX3gGFI7IG</latexit> (x i ,y + i ,y → i ) (x i ,y + i ,y → i ) . . . (x i ,y → i ,y → i ) General Text Data <latexit sha1_base64="UH/hEPZnjepn6ZeXfKicBjrwBEs=">AAACC3icbVA9SwNBEN3zMyZ+RC1tDnNCbMKdRbQUtbCMYFRIQtjbTJIle3vH7pwYjutt/Cs2gorYiX/Azh+itXuJhRofDDzem2Fmnh8JrtF1362p6ZnZufncQr6wuLS8UlxdO9NhrBjUWShCdeFTDYJLqCNHAReRAhr4As79wWHmn1+C0jyUpziMoBXQnuRdzigaqV3cdJxyM6DYZ1QkR2k7aSJcYdIDCYqKNN12nHax5FbcEexJ4n2T0r7zcf9yWfistYtvzU7I4gAkMkG1bnhuhK2EKuRMQJpvxhoiyga0Bw1DJQ1At5LRL6m9ZZSO3Q2VKYn2SP05kdBA62Hgm87sbP3Xy8T/vEaM3b1WwmUUI0g2XtSNhY2hnQVjd7gChmJoCGWKm1t1qeKMjTx5U0I3t+XJ8nZTsWrVqonJo0DMkaObJBNUiYe2SX75JjUSJ0wck1uyQN5tG6sO+vJeh63TlnfM+vkF6zXLyeanvA=</latexit> (D general ) General Text Data General Text Data General Text Data Pre-trained LLM Frozen Weights <latexit sha1_base64="82IHaojRjqHG1LkmQH0MAjJHoh0=">AAAB8HicbVDLSsNAFL3xWeur6tLN0EaoCCVxUV0W3bisYB/ShjKZTtqhM0mYmQgh9Ct04UIRt36Ou/6N08dCWw9cOJxzL/fe48ecKe04E2ttfWNzazu3k9/d2z84LBwdN1WUSEIbJOKRbPtYUc5C2tBMc9qOJcXC57Tlj26nfuuJSsWi8EGnMfUEHoQsYARrIz3adrnVc85tu1coORVnBrRK3AUp1Yrdi5dJLa33Ct/dfkQSQUNNOFaq4zqx9jIsNSOcjvPdRNEYkxEe0I6hIRZUedns4DE6M0ofBZE0FWo0U39PZFgolQrfdAqsh2rZm4r/eZ1EB9dexsI40TQk80VBwpGO0PR71GeSEs1TQzCRzNyKyBBLTLTJKG9CcJdfXiXNy4pbrVTvTRo3MEcOTqEIZXDhCmpwB3VoAAEBz/AG75a0Xq0P63PeumYtZk7gD6yvHy8FkbE=</latexit> (W 0 ) Trainable LoRA Adapters <latexit sha1_base64="i8cR6ksi5pL6FRhPo5Eb0geTxVI=">AAACB3icbVC7SgNBFJ2Nrxhfq5aCDEmEiBB2LaJl1MYygnlAdgmzk0kyZPbBzF1xWdLZ2PsVNhaK2PoLdvkbJ49CowcGDufcy51zvEhwBZY1NjJLyyura9n13Mbm1vaOubvXUGEsKavTUISy5RHFBA9YHTgI1ookI74nWNMbXk385h2TiofBLSQRc33SD3iPUwJa6piHxeIFLknsAPeZwt1O6gC7h5QHo9FxsdgxC1bZmgL/JfacFKp55+RpXE1qHfPL6Y09lkAVBCl2rYVgZsSCZwKNso5sWIRoUPSZ21NA6Kvuuk0xwgfaaWLe6HULwA8VX9upMRXKvE9PekTGKhFbyL+57Vj6J27OlMUAwvo7FAvFhhCPCkFd7lkFESiCaGS679iOiCSUNDV5XQJ9mLkv6RxWrYr5cqNbuMSzZBFByiPSshGZ6iKrlEN1RFFD+gZvaI349F4Md6Nj9loxpjv7KNfMD6/AdaCmwo=</latexit> A(r→d in ) <latexit sha1_base64="fOpXPsa22zvODKy5wiW8fnQjQ/M=">AAACCHicbVC7SgNBFJ2Nrxhfq5YWDkmEiBB2LaJliI1lBPOA7LLMTibJkNkHM3fFsKS0sfYvbCwUsfUT7PI3Th6FRg9cOJxzL/fe48eCK7CsiZFZWV1b38hu5ra2d3b3zP2DpooSSVmDRiKSbZ8oJnjIGsBBsHYsGQl8wVr+8Grqt+6YVDwKb2EUMzcg/ZD3OCWgJc88LhZruNT1UgfYPaRRAuMxdoAHTGF5Wix6ZsEqWzPgv8RekEI175w9Taqjumd+Od2IJgELgQqiVMe2YnBTIoFTwcY5J1EsJnRI+qyjaUj0JjedPTLGJ1rp4l4kdYWAZ+rPiZQESo0CX3cGBAZq2ZuK/3mdBHqXbsrDOAEW0vmiXiIwRHiaCu5yySiIkSaESq5vxXRAJKGgs8vpEOzll/+S5nnZrpQrNzqNGpoji45QHpWQjS5QFV2jOmogih7QM3pFb8aj8WK8Gx/z1oyxmDlEv2B8fgPWVpuW</latexit> B(d out →r) Goal: Capability Removal <latexit sha1_base64="D9P9afC0mna48u2O6btnomKjyVM=">AAACC3icbVA9SwNBEN3z28SPqKXNkZwQm3BnES2DNhYWCsYEkhD2NpNkyd4Hu3PBcFxv41+xEVTETvwDdv4Qrd1LLDTxwcDjvRlm5rmh4Apt+8OYm19YXFpeWc1k19Y3NnNb21cqiCSDKgtEIOsuVSC4D1XkKKAeSqCeK6DmDk5SvzYEqXjgX+IohJZHez7vckZRS+1c3rKKTY9in1ERnyXtuIlwjTFS2QNMkn3LyrRzBbtkj2HOEueHFCrW58PrMPt13s69NzsBizzwkQmqVMOxQ2zFVCJnApJMM1IQUjagPWho6lMPVCse/5KYe1rpmN1A6vLRHKu/J2LqKTXyXN2Znq2mvVT8z2tE2D1qxdwPIwSfTRZ1I2FiYKbBmB0ugaEYaUKZ5PpWk/WppAx1fGkIzvTLs+TqoOSUS+ULncYxmWCF7JI8KRKHHJIKOSXnpEoYuSF35JE8GbfGvfFsvExa54yfmR3yB8bbN7+gnqs=</latexit> (L target ) <latexit sha1_base64="5T9cGY5pw2A2Vpa0KZIG6YwvZjI=">AAAB8XicbVDLTgJBEOzFF+IL9ehlAmuCMZJdD+iR6MUjJvKIsJLZYcAJs7ObmVmTzYa/4OJBY7z6N974G4fHQcFKOqlUdae7y484U9pxJlZmbX1jcyu7ndvZ3ds/yB8eNVQYS0LrJOShbPlYUc4ErWumOW1FkuLA57TpD2+nfvOFSsVC8aCTiHoBHgjWZwRrIz3adq2UPF2c2XY3X3TKzgxolbgLUqwWOufjSTWpdfPfnV5I4oAKTThWqu06kfZSLDUjnI5ynVjRCJMhHtC2oQIHVHnp7OIROjVKD/VDaUpoNFN/T6Q4UCoJfNMZYP2slr2p+J/XjnX/2kuZiGJNBZkv6scc6RBN30c9JinRPDEEE8nMrYg8Y4mJNiHlTAju8surpHFZdivlyr1J4wbmyMIJFKAELlxBFe6gBnUgIGAMb/BuKevV+rA+560ZazFzDH9gff0A/caSKQ==</latexit> P(y → ) <latexit sha1_base64="C9JXiHLPwHM2Ox//h0CuRyeaN90=">AAAB8XicbVDLTgJBEOz1ifhCPXqZwJpgSMiuB/RI9OIRE3lEWMnsMAsTZmc3M7MmG8JfcPGgMV79G2/8jcPjoGAlnVSqutPd5cecKe04U2tjc2t7Zzezl90/ODw6zp2cNlSUSELrJOKRbPlYUc4ErWumOW3FkuLQ57TpD+9mfvOFSsUi8ajTmHoh7gsWMIK1kZ5su1ZMn0uXtt3NFZyyMwdaJ+6SFKr5Tmkyraa1bu6704tIElKhCcdKtV0n1t4IS80Ip+NsJ1E0xmSI+7RtqMAhVd5ofvEYXRilh4JImhIazdXfEyMcKpWGvukMsR6oVW8m/ue1Ex3ceCMm4kRTQRaLgoQjHaHZ+6jHJCWap4ZgIpm5FZEBlphoE1LWhOCuvrxOGldlt1KuPJg0bmGBDJxDHorgwjVU4R5qUAcCAibwBu+Wsl6tD+tz0bphLWfO4A+srx/6uJIn</latexit> P(y + ) Norm & Sparsity Regularization <latexit sha1_base64="5khVBzjIBNSXMFJNj+eP+UjC0Yo=">AAACLnicbVDJSgNBEO1xjYlL1KOXRkdQkDDjQT2KIngQcYsGkhB6OpXY2LPQXRMMw3yRF0/+gF+gB0FFvPoDerYn8eD2oOHxXlV11fMiKTQ6zqM1MDg0PDKaG8sXxicmp4rTM6c6jBWHMg9lqCoe0yBFAGUUKKESKWC+J+HMu9jO/LMOKC3C4AS7EdR91g5ES3CGRmoUd2x7qeYzPOdMJntpI6khXGKyHyr/CNppukL/cY8jZiZit1exbNv5RnHBKTk90L/E/SILm/b7zW2n8HHQKN7XmiGPfQiQS6Z11XUirCdMoeAS0nwt1hAxfsHaUDU0YD7oetI7N6WLRmnSVqjMC5D21O8dCfO17vqeqcx217+9TPzPq8bY2qgnIohihID3P2rFkmJIs+xoUyjgKLuGMK6E2ZXyc6YYR5NwFoL7++S/5HS15K6V1g5NGlukjxyZI/NkibhknWySXXJAyoSTK3JHnsizdW09WC/Wa790wPrqmSU/YL19AqH6re8=</latexit> (L NormReg ,L SparsityReg ) Goal: General Capability Preservation <latexit sha1_base64="+hb3LemANX2qdfhHrabC/7oFsiM=">AAACC3icbVC7TgJBFJ31ieADtbTZ4JpgQ3Yt0JJoY2GhRsAECJkdLjBh9pGZu0Sy2d7GX7ExUWPsjD9g54do7fAoFDzJTE7OuTf33uOGgiu07U9jbn5hcWk5tZLOrK6tb2Q3tyoqiCSDMgtEIK9dqkBwH8rIUcB1KIF6roCq2zsZ+tU+SMUD/woHITQ82vF5mzOKWmpmc5aVr3sUu4yK+CxpxnWEG4yv9HcJnSTZt6xmdtcu2COYs8SZkN2S9fX41s98nzezH/VWwCIPfGSCKlVz7BAbMZXImYAkXY8UhJT1aAdqmvrUA9WIR7ck5p5WWmY7kPr5aI7U3x0x9ZQaeK6uHK6tpr2h+J9Xi7B91Ii5H0YIPhsPakfCxMAcBmO2uASGYqAJZZLrXU3WpZIy1PGldQjO9MmzpHJQcIqF4oVO45iMkSI7JEfyxCGHpEROyTkpE0ZuyT15Is/GnfFgvBiv49I5Y9KzTf7AeP8BCpae3Q==</latexit> (L TextReg ) <latexit sha1_base64="HHMrAo6+JkyoPtieEtjBEmGtRuk=">AAACN3icdVDLSgMxFM34tr6qLt0ErSAIZaYLdVnUhStRsFbo1CGTSdvQTDIkd8Qy9i/8FDf+hjvduFDErX9gpnWhVi+EHM45l3vvCRPBDbjuozM2PjE5NT0zW5ibX1hcKi6vnBuVaspqVAmlL0JimOCS1YCDYBeJZiQOBauH3YNcr18xbbiSZ9BLWDMmbclbnBKwVFA8LpX8G/+QCSC4jv1Qicj0Yvtl14HX92+CymUF+zRSYPC/RpkbvctKqRQUN9yyOyg8CrwvsFFd97dvH6u9k6D44EeKpjGTQAUxpuG5CTQzooFTwfoFPzUsIbRL2qxhoSQxM81scHcfb1omwi2l7ZOAB+z3jozEJl/SOmMCHfNby8m/tEYKrb1mxmWSApN0OKiVCgwK5yHiiGtGQfQsIFRzuyumHaIJBRt1wYbg/T55FJxXyt5OeefUprGPhjWD1tA62kIe2kVVdIROUA1RdIee0At6de6dZ+fNeR9ax5yvnlX0o5yPT6zErtQ=</latexit> →!Wx 1 → 2 2 ·→!Wx n → 2 1 <latexit sha1_base64="4u/yyAt5DZUebDpIyVkzdtTTBf8=">AAAC2XicdZLPTxNBFMdnVxEsoAWPXiayJiYkza4H4EJC5OLBGBRaSLpN83b6WibM7mxm3hKadQ8eNMar/4v/AP8BN/8L3pm2mKQtrxkJt+8z/vmvfmR5EpaCsNfnv/g4cKjxaXHteWV1SdP62vrLasLI7AptNLmJAGLSmbYJEkKT3KDkCYKj5Oz/RE/Pkdjpc6OaJhjJ4VBJvtSALlUt/47COIU6FSAKt9V3TImvKCSNIGqKr7L50EwAyRHN3msXKce/CNHbvuIA4fm2G6h88XT1vfapPdab+FMy8Mc3OFoeK/3TkEtCLr1jbARjoPPiuhGbOwFf35eni//PejWr+KeFkWKGQkF1rajMKdOCYakUFjV4sJiDuIMBth2MoMUbaccv0zFX7pMj/e1cSsjPs7+7yghtXaYJq5yNLqdZqPkPNYuqL/TKWWWF4SZmDTqF4qT5qNn5j1pUJAaOgHCSDcrF6dgQJD7DDV3CdH0kWdF63Uj2mpsfXC38YZNYok9Zy/YKxaxbbbH3rID1mTCa3mfvC/eV7/tf/a/+d8npb5343nG7oT/4xqHbe+s</latexit> L total =L target +ω TextReg L TextReg + ω NormReg L NormReg +ω SparsityReg L SparsityReg Figure 1: Overview of the SCALPEL framework. Given a target capability, we train low-rank LoRA adapters to make the model equally confused between correct and incorrect answers, while text regularization preserves general language modeling quality. The resulting low-rank modifications reveal how the target capability is encoded across the model. Mechanistic interpretability and model editing methods aim to understand internal computations and modify model behavior. The first step is causal localization: Attribution Patching [30, 23] and Causal Tracing [26] identify causally important components through activation interventions, while influence functions [22] trace predictions back to train- ing examples. Building on localization, circuit discovery [14] reverse engineers how components collaborate to im- plement specific computations, identifying structures like induction heads. To understand feature encoding, studies of superposition [13] reveal that models represent more features than available dimensions, and dictionary learn- ing methods [6, 10] decompose these superposed representations into interpretable monosemantic features. Com- plementary work examines knowledge storage: neuron and feature analysis [4, 32, 11] correlate activations with semantic concepts, key-value memory analysis [16] shows feed-forward layers function as associative memories, lin- ear probes [5] measure task-relevant information through lightweight classifiers, and concept-based approaches like TCAV [20] quantify sensitivity to human-defined concepts. Based on these insights, model editing methods including ROME [26], MEMIT [27], and task arithmetic [19] directly modify factual associations and task behaviors. However, these methods still operate at the component level, assuming capabilities are localized to specific modules. They fail to capture the fine-grained, distributed nature of capability encoding, and lack mechanisms to preserve general language abilities while targeting specific capabilities. 3 Method We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework that represents capabilities as low-rank parameter subspaces rather than discrete components. Tradi- tional interpretability methods assume that specific capabilities are controlled by specific modules, but this assumption oversimplifies neural computation: individual modules exhibit polysemanticity, and capabilities are encoded in a dis- tributed manner across multiple components. Our key insight is that each specific capability can be characterized by low-rank modifications to the model’s weight matrices, distributed across layers and modules. By training low-rank LoRA adapters to reduce the model’s ability to distinguish correct from incorrect answers while preserving general 3 language modeling quality, SCALPEL identifies the low-rank representation responsible for a target capability while remaining disentangled from other capabilities. 3.1 Problem Formulation Let M θ denote a pre-trained language model with parameters θ. Given a target capability defined by task dataset D target =(x i ,y + i ,y − i ) N i=1 , where x i is an input prompt, y + i is the correct answer, and y − i is an incorrect answer, our goal is to learn a low-rank adaptation ∆θ that achieves three objectives simultaneously. First, the modified modelM θ+∆θ should exhibit reduced accuracy onD target by making the model equally likely to predict correct and incorrect answers. Second, it should maintain performance on general language modeling tasks measured by perplexity on held-out textD general and accuracy on diverse capability testsD capabilities . Third, the parameter change ∆θ should be minimized and localized to task-critical components. Formally, we optimize: min ∆θ L target (θ + ∆θ;D target ) + X i λ i L (i) reg (∆θ) + λ TextReg L TextReg (θ + ∆θ;D general )(1) where L target encourages probability equalization between correct and incorrect answers, L (i) reg represents multiple regularization terms (NormReg, SparsityReg) that promote parameter sparsity and locality, and L TextReg preserves general language modeling quality. 3.2 Low-Rank Adaptation Architecture We adopt LoRA [18] as our parameter modification framework. For each attention and MLP layer in the trans- former [45], LoRA introduces low-rank matrices A ∈R r×d in and B ∈R d out ×r that modify the pre-trained weight matrix W 0 ∈R d out ×d in through: h = W 0 x + α r BAx = W 0 x + ∆Wx(2) where r ≪ min(d in ,d out ) is the LoRA rank, α is a scaling factor that controls the magnitude of the low-rank update relative to the original weights, and ∆W = α r BA represents the learned low-rank adaptation. We freeze the original parameters W 0 and only train A and B. We apply LoRA to attention projection layers (W Q , W K , W V , W O ) and MLP layers (W gate , W up , W down ) with rank r = 2 to enforce strong locality constraints and α = 16 for stable training dynamics. 3.3 Probability Equalization Loss Unlike standard LoRA fine-tuning that maximizes correct answer probability, our approach aims to equalize the prob- abilities of correct and incorrect answers through carefully designed loss functions. Token-Level Probability Equalization. For tasks with single-token answers such as multiple choice or arithmetic, we compute the difference between correct and incorrect token log-probabilities: L token (x i ,y + i ,y − i ) = logp θ+∆θ (y + i |x i )− logp θ+∆θ (y − i |x i )(3) where p θ+∆θ (y|x) denotes the softmax probability of token y given prompt x. The loss encourages the log-probability gap to shrink toward zero, making the model equally confused between options. For example, in a translation task with prompt “Translate ’hello’ to French:”, we equalize the probabilities of predicting “bonjour” (y + ) and “adios” (y − ) so the model loses the ability to distinguish correct from incorrect translations. Sentence-Level Probability Equalization. For tasks with sentence-level judgments such as grammaticality or semantic coherence, we compute average token log-probabilities across entire sentences: L sentence (S A ,S B ) = logp θ+∆θ (S correct )− logp θ+∆θ (S wrong )(4) where p θ+∆θ (S) = exp 1 |S| P |S| t=1 logp θ+∆θ (s t |s <t ) is the geometric mean of token probabilities in sentence S. This formulation is particularly effective for linguistic tasks where entire sentence acceptability must be judged. For instance, given a subject-verb agreement task with sentence pair “The keys to the cabinet is on the table” (S wrong ) versus “The keys to the cabinet are on the table” (S correct ), we equalize their sentence-level probabilities to degrade the model’s grammatical judgment capability. 4 3.4 Regularization Framework To ensure capability removal preserves general language understanding, we introduce three complementary regulariza- tion terms. The most critical is TextReg (Text Regularization), which explicitly preserves general language modeling by pairing each target task sample with a sample from general text distributionD general and minimizing the squared L2 norm of LoRA outputs: L TextReg = 1 |D general | X x∈D general 1 L L X l=1 α r B l A l h l (x) 2 2 (5) where h l (x) denotes the hidden activations at layer l when processing general text x, and the double summation averages LoRA output magnitudes across all samples and layers. This encourages minimal LoRA activation on general language circuits. NormReg (Norm Regularization) prevents unbounded parameter growth through an L2 penalty: L NormReg = 1 |Θ LoRA | X θ∈Θ LoRA ∥θ∥ 2 2 (6) where Θ LoRA =A l ,B l L l=1 denotes all LoRA matrices across L layers. This stabilizes training dynamics by prevent- ing weight explosion. SparsityReg (Sparsity Regularization) concentrates modifications on critical components through an L1 penalty [43]: L SparsityReg = 1 |Θ LoRA | X θ∈Θ LoRA ∥θ∥ 1 (7) This induces structured sparsity in the low-rank subspace, encouraging the model to concentrate modifications on the most critical components. The complete training objective combines all components: L total =L target + λ TextReg L TextReg + λ NormReg L NormReg + λ SparsityReg L SparsityReg (8) We optimize this objective using AdamW optimizer [24] with gradient clipping [47] for stability. Implementation details including learning rate, batch size, and training epochs are provided in Section 4.1. 3.5 Analysis Methods Since the magnitude of learned LoRA weights directly reflects how strongly each module encodes the target capability, SCALPEL enables interpretability analyses beyond capability removal. Layer Importance Analysis. We quantify each layer’s contribution to a capability by computing the Frobenius norm of the LoRA weight product∥BA∥ F for each module. For a given layer l, we aggregate importance scores across all LoRA-adapted modules (attention projections and MLP layers) to obtain a layer-level importance score. Layers with higher scores require larger modifications to remove the capability, indicating stronger capability encoding. We identify peak layers where importance concentrates and analyze the distribution pattern across the model depth. Task Similarity Analysis. We investigate relationships between capabilities by comparing their LoRA weight patterns. For each task, we flatten all learned LoRA weights into a single vector and compute pairwise Pearson correlations between tasks. We then apply dimensionality reduction (MDS or UMAP) to visualize task relationships in a low-dimensional space. Tasks that cluster together share similar parameter-space representations, suggesting overlapping neural substrates. 4 Experiments 4.1 Experiment Setup We conduct experiments on NVIDIA A100 GPUs (80GB) using Llama-3.2-1B [12] as the base model across five representative tasks: language translation, common sense reasoning, indirect object identification (IOI), moral rea- soning, and analogical reasoning (see Section 4.2 for dataset details). For SCALPEL, we train LoRA adapters with 5 rank r = 2, scaling factor α = 16, learning rate 1× 10 −5 , batch size 40, and 20 epochs using AdamW optimizer (weight decay 0.001), applying LoRA to attention projections (W Q , W K , W V , W O ) and MLP layers (W gate , W up , W down ) with three regularization terms (TextReg, NormReg, SparsityReg). For baseline interpretability methods, we compute component importance using target task samples and apply weighted noise corruption with task-specific lev- els. We evaluate using three metrics: (1) target task accuracy drop, measured as the proportion of examples where the model assigns higher probability to the correct answer than the incorrect answer (capability removal effectiveness), (2) perplexity on held-out WikiText-103 text (language modeling quality), and (3) overall capability score, measured via generation-based evaluation where the model generates responses and we check if they match expected answers across 24 diverse held-out tasks (capability preservation). Then, we compute the average accuracy across all held-out tasks. To ensure fair comparison, all methods modify only the top 10 most important components per task, and we tune hyperparameters for all methods to maximize the product of target task accuracy drop and overall capability score on dev set. 4.2 Datasets Following the multi-dimensional evaluation framework from [7], we construct 24 capability tasks spanning reason- ing (analogical, causal, counterfactual, logical, spatial, temporal), language (translation, understanding, generation, dialogue, summarization), knowledge (world knowledge, reading comprehension), and metacognitive skills (instruc- tion following, critical thinking, creative thinking, emotional understanding, moral reasoning, classification, mem- ory/context, metacognition, multimodal understanding, mathematical computation). Each task contains 200-400 ex- amples initially generated by Claude Opus 4.5 [2] and then manually filtered to remove obviously improper samples, presented in multiple-choice or completion format with correct and incorrect answer pairs, split into training (80%), development (10%), and test (10%) sets with no overlap. For evaluation, we assess target task performance on the test split and measure model perplexity on held-out general text from WikiText-103 [28]. We also construct a new evaluation set with approximately 50 samples from each of the 24 tasks that has no overlap with the training and test sets to test whether removing one capability may affect other capabilities. We additionally evaluate on 67 linguistic tasks from the BLiMP benchmark [46] to analyze fine-grained linguistic phenomena across morphology, semantics, and syntax. 4.3 Baseline Methods We compare our SCALPEL approach against eight established interpretability and intervention methods from the lit- erature. DiffMean [35] computes layer importance by measuring the difference in mean activations between correct and incorrect prediction examples, identifying layers where activations diverge most strongly between these condi- tions. Attribution Patching [30, 23] is a causal intervention method that patches activations from corrupted inputs to clean inputs at different layers to measure each layer’s causal contribution to task performance. Causal Tracing [26] traces information flow through the network by systematically restoring clean activations at specific layers while keep- ing other layers corrupted, revealing which layers are necessary for recovering task performance. Logit Lens [31] projects intermediate layer representations directly to the vocabulary space to analyze how task-relevant predictions emerge and evolve across layers. Information Theory [44] measures layer importance using mutual information between layer activations and task labels, quantifying how much task-relevant information each layer encodes. Inte- grated Gradients [42] is a gradient-based attribution method that computes importance by integrating gradients along the path from a baseline to the actual input, providing smooth attributions for each layer’s contribution. Layer-wise Relevance Propagation (LRP) [3] decomposes the model’s output by backpropagating relevance scores from the out- put layer to input features, distributing the prediction score across layers according to their contributions. Probing [5] trains lightweight classifiers on frozen layer representations to measure how much task-relevant information is linearly accessible at each layer. Since some baselines (Logit Lens, Information Theory, Probing, etc.) are identification meth- ods rather than intervention methods, we first identify important components using each method, then apply noise corruption for intervention. This highlights SCALPEL’s advantage: joint optimization that simultaneously removes target capabilities and preserves general language abilities. 6 TranslationCommon SenseIOIMoral Reasoning Analogical Reasoning MethodAccD↑ PPL↓ Cap↑ AccD↑ PPL↓ Cap↑ AccD↑ PPL↓ Cap↑ AccD↑ PPL↓ Cap↑ AccD↑ PPL↓ Cap↑ Baseline0.0011.10.500.0011.10.500.0011.10.500.0011.10.500.0011.10.50 DiffMean0.1513.10.450.1827.10.180.182100.050.2516.60.380.0515.40.39 Attribution Patching 0.1012.50.450.1315.70.360.3019.70.250.2212.40.470.0512.30.46 Causal Tracing0.1014.10.410.0812.30.460.381040.040.0011.20.490.1115.30.38 Logit Lens0.0312.20.470.1013.60.350.3361.00.050.06 11.10.490.0511.4 0.49 Information Theory 0.1812.30.450.0813.00.410.3068.40.040.1912.00.450.0512.00.48 Integrated Gradients 0.1512.30.440.0312.40.420.2590.80.030.0612.80.460.0912.70.46 LRP0.1513.00.430.1513.50.390.2542.00.090.2212.50.430.0511.80.46 Probing0.0812.40.440.1515.40.350.3083.50.030.1112.30.430.0112.60.42 SCALPEL0.20 11.2 0.49 0.21 11.2 0.47 0.43 11.2 0.49 0.28 11.1 0.50 0.20 11.10.48 Table 1: Comparative evaluation across five tasks with each method modifying the top 10 most important components. AccD: Accuracy Drop, PPL: Perplexity, Cap: Overall Capability. SCALPEL achieves the best overall balance between capability removal effectiveness and general capability preservation. 0246810 Number of Corrupted Components 0.35 0.40 0.45 0.50 0.55 Accuracy (Target Task) Accuracy 0246810 Number of Corrupted Components 11.0 11.5 12.0 12.5 13.0 13.5 14.0 Perplexity Perplexity 0246810 Number of Corrupted Components 0.42 0.44 0.46 0.48 0.50 Overall Capability Overall Capability Baseline DiffMean Attribution Patching Causal Tracing Logit Lens Information Theory Integrated Gradients LRP Probing SCALPEL Figure 2: Multi-dimensional comparison of interpretability methods on the language translation task. The visualization shows the relationship between target accuracy degradation, model perplexity, and overall capability preservation across all baseline methods. SCALPEL (highlighted) achieves the optimal balance, positioned in the region of low perplexity and high capability retention while achieving the most effective capability removal. 4.4 Main Results Table 1 presents the comparative evaluation results across five representative tasks. (1) SCALPEL consistently achieves the highest accuracy drops across all tasks while maintaining near-baseline perplexity and strong overall capability scores, particularly on IOI where baseline methods suffer catastrophic perplexity degradation. This demon- strates that SCALPEL enables targeted removal without disrupting general language circuits. (2) DiffMean and Causal Tracing show catastrophic perplexity degradation on IOI (orders of magnitude above baseline) while achieving only modest accuracy drops elsewhere. This reveals that activation-based importance identification does not guarantee safe intervention. (3) While gradient-based methods like Integrated Gradients and LRP achieve moderate accuracy drops on individual tasks, they consistently fail to maintain low perplexity, indicating that one-time attribution methods lack the iterative optimization needed for balanced capability removal. In order to visualize the trade-off between capability removal effectiveness and general capability preservation, we plot accuracy degradation, perplexity, and overall capability against corrupted components for all methods. The results are shown in Figure 2. SCALPEL achieves much lower perplexity than other baseline methods while main- taining effective capability removal. Most baseline methods suffer from a fundamental trade-off between removal effectiveness and capability preservation, whereas SCALPEL’s gradient-based LoRA optimization with TextReg suc- cessfully navigates this trade-off by selectively modifying task-relevant parameters while leaving general language circuits intact. 7 0246810 Number of Corrupted Components 0.25 0.30 0.35 0.40 0.45 0.50 Accuracy (Target Task) Accuracy Baseline Full (all regularizations) w/o TextReg w/o NormReg w/o SparsityReg 0246810 Number of Corrupted Components 0.480 0.485 0.490 0.495 0.500 Overall Capability Overall Capability Figure 3: Ablation study comparing SCALPEL configurations on language translation task. Left: Accuracy degradation shows tar- geted capability removal effectiveness. Right: Overall capability preservation demonstrates the impact of each regularization compo- nent on maintaining general language abilities. The full SCALPEL method (with all regularizations) achieves the optimal balance. Model∆Acc∆PPL∆Cap DeepSeek-R1-0.280.150.01 GLM-4-0.150.250.00 Gemma3-0.130.180.00 Llama-3.2-0.360.17-0.03 Ministral-0.380.000.00 Qwen3-0.100.110.00 Table 2: Cross-architecture validation re- sults showing delta metrics (SCALPEL - Base) across six language models on com- mon sense reasoning task. ∆Acc: accuracy change, ∆PPL: perplexity change, ∆Cap: overall capability change. 4.5 Ablation Study on Regularization Components To understand the individual contributions of SCALPEL’s regularization components, we conduct an ablation study on the language translation task by systematically removing TextReg (preserves general language modeling), NormReg (constrains LoRA weight magnitude), and SparsityReg (encourages sparse low-rank adaptations). Figure 3 reveals that removing any component slightly reduces capability removal effectiveness (higher accuracy) while degrading overall capability preservation. Specifically: (1) Removing TextReg yields higher accuracy (less effective removal) because the gradient signal focuses solely on capability removal without balancing preservation, leading to suboptimal convergence, and reduces overall capability as the model loses guidance to preserve general language modeling. (2) Removing NormReg yields higher accuracy because unbounded weight magnitudes lead to unstable updates that fail to consistently target the capability, and reduces overall capability as excessive modifications interfere with non-target abilities. (3) Removing SparsityReg yields higher accuracy because dense adaptations dilute the removal signal across many parameters rather than concentrating on task-critical components, and reduces overall capability as widespread modifications affect unrelated circuits. These results demonstrate that each regularization component contributes to both effective capability removal and preservation of general abilities. 4.6 Cross-Architecture Generalization To demonstrate that SCALPEL generalizes beyond a single model architecture, we evaluate its effectiveness across six diverse language models with varying sizes and architectural designs: Llama-3.2-1B, Qwen3-4B, Gemma3-2B, Ministral-8B, DeepSeek-R1-8B, and GLM-4-9B. We apply SCALPEL to remove common sense reasoning capa- bility from each model while preserving general language abilities. Table 2 presents the delta metrics comparing SCALPEL-modified models against their base counterparts. (1) All models exhibit negative accuracy changes with varying magnitudes across architectures. This demonstrates SCALPEL’s consistent effectiveness regardless of model scale or design. (2) Perplexity changes remain minimal across all models. This confirms that capability removal does not compromise general language generation. (3) Overall capability changes show near-zero deviations across all models. This validates that SCALPEL’s regularization framework transfers to diverse transformer architectures without architecture-specific tuning. 4.7 LoRA Rank Ablation Study To investigate how LoRA rank affects the effectiveness and specificity of capability removal, we conduct a compre- hensive rank ablation study across five diverse tasks: language translation, common sense reasoning, indirect object identification (IOI), moral reasoning, and analogical reasoning. We evaluate four different LoRA ranks (1, 2, 4, and 8) to understand the trade-off between the capacity of low-rank adaptations and the precision of targeted capability removal. Table 3 reveals three key findings. (1) Rank 1 achieves the strongest removal on some tasks but shows inconsistent effectiveness across different capabilities, as a single rank may find a sufficient subspace for some capabilities while 8 Rank TranslationCommon SenseIOIMoralAnalogical ∆Acc∆PPL∆Cap∆Acc∆PPL∆Cap∆Acc∆PPL∆Cap∆Acc∆PPL∆Cap∆Acc∆PPL∆Cap 1-0.200.00-0.03 -0.030.22-0.02 -0.280.100.00 -0.440.030.00 -0.340.05-0.01 2-0.200.02-0.02 -0.210.08-0.01 -0.43-0.05-0.01 -0.280.00-0.01 -0.20-0.070.00 4-0.130.08-0.02 -0.030.15-0.01 -0.250.020.00 -0.310.010.00 -0.270.06-0.01 8-0.15-0.07-0.01 -0.030.05-0.01 -0.250.010.01 -0.280.030.00 -0.20-0.04-0.02 Table 3: LoRA rank ablation study showing delta metrics (SCALPEL - Base) across five diverse tasks. Negative ∆Acc values indicate successful capability reduction. Positive ∆PPL values indicate perplexity increase. ∆Cap values near zero demonstrate preservation of general language abilities. Rank 2 demonstrates the optimal balance with consistent capability removal and minimal perplexity degradation across all tasks, while Rank 8 achieves the unique property of improving language quality (negative ∆PPL) despite capability removal. 4567891011 Average Top 50% Layer Index Instruction Following Creative Thinking Classification Categorization Analogical Reasoning World Knowledge Moral Reasoning Emotional Understanding Language Translation Common Sense Reasoning Spatial Reasoning Logical Reasoning Causal Reasoning Summarization Temporal Reasoning Counterfactual Reasoning Dialogue Metacognition Memory Context Language Generation Reading Comprehension Critical Thinking Multimodal Understanding Language Understanding Mathematical Computation Capability Tasks Top 50% Layers Average Distribution by Capability Task Knowledge & Processing Language & Communication Specialized Domains Reasoning & Analysis Cognitive & Perceptual 4567891011 Average Top 50% Layer Index Principle A Reconstruction Left Branch Island Simple Question Determiner Noun Agreement 1 Adjunct Island Only Npi Licensor Present Only Npi Scope Principle A C Command Determiner Noun Agreement With Adj 2 Coordinate Structure Constraint Complex Left Branch Regular Plural Subject Verb Agreement 2 Complex Np Island Distractor Agreement Relational Noun Determiner Noun Agreement With Adjective 1 Determiner Noun Agreement Irregular 1 Animate Subject Trans Coordinate Structure Constraint Object Extraction Principle A Domain 2 Superlative Quantifiers 2 Left Branch Island Echo Question Regular Plural Subject Verb Agreement 1 Principle A Domain 3 Irregular Plural Subject Verb Agreement 1 Determiner Noun Agreement With Adj Irregular 1 Determiner Noun Agreement Irregular 2 Existential There Object Raising Irregular Past Participle Verbs Determiner Noun Agreement 2 Irregular Plural Subject Verb Agreement 2 Determiner Noun Agreement With Adj Irregular 2 Anaphor Number Agreement Existential There Quantifiers 1 Ellipsis N Bar 2 Ellipsis N Bar 1 Inchoative Principle A Case 2 Anaphor Gender Agreement Distractor Agreement Relative Clause Matrix Question Npi Licensor Present Transitive Animate Subject Passive Causative Passive 2 Wh Vs That No Gap Intransitive Sentential Negation Npi Scope Tough Vs Raising 2 Principle A Case 1 Irregular Past Participle Adjectives Wh Questions Object Gap Passive 1 Drop Argument Sentential Negation Npi Licensor Present Npi Present 1 Expletive It Object Raising Wh Vs That With Gap Wh Vs That With Gap Long Distance Wh Vs That No Gap Long Distance Tough Vs Raising 1 Existential There Quantifiers 2 Wh Island Principle A Domain 1 Sentential Subject Island Wh Questions Subject Gap Long Distance Npi Present 2 Existential There Subject Raising Superlative Quantifiers 1 Wh Questions Subject Gap BLiMP Tasks Top 50% Layers Average Distribution by BLiMP Task Morphology Semantics Syntax Syntax-Semantics Figure 4: Peak layer analysis for capability tasks (left) and BLiMP tasks (right). Capability tasks show a progression from basic language tasks in early layers to complex reasoning in middle layers and creative tasks in late layers. BLiMP tasks reveal morphological processing in early layers, syntactic processing in later layers, and semantic tasks distributed throughout. being insufficient for others. (2) Rank 2 provides the most stable performance with effective capability removal across all tasks, suggesting that a two-dimensional subspace is sufficient to disable most capabilities (though not necessarily the unique or minimal causal representation). (3) Higher ranks generally show reduced removal effectiveness, support- ing our hypothesis that capabilities occupy low-dimensional subspaces and can be effectively captured with minimal rank. 4.8 Layer-wise Capability Analysis To understand how different capabilities are distributed across transformer layers, we analyze peak layer distributions across 24 capability tasks and 67 BLiMP linguistic tasks. Figure 4 (left) reveals three key patterns for capability tasks: (1) Basic language tasks peak in early-to-middle layers, reflecting reliance on fundamental linguistic processing. (2) Complex reasoning tasks concentrate in middle-to-late layers, suggesting higher-order cognitive functions require deeper semantic representations. (3) Creative and generative tasks show the latest peaks, indicating dependence on the most sophisticated abstractions in deepest layers. Figure 4 (right) presents complementary patterns for BLiMP linguistic tasks: (1) Morphological tasks peak in the earliest layers, indicating that surface-level morphological features are processed at the initial stages of the trans- former hierarchy. (2) Syntactic tasks concentrate in later layers, suggesting that structural grammatical relationships require deeper representations built upon morphological features. (3) Semantic and syntax-semantics interface tasks exhibit distributed peaks across all layers, indicating that abstract meaning composition is processed throughout the entire transformer hierarchy. These layer-wise distributions align with cognitive and linguistic theories of hierarchical language processing, validating SCALPEL’s ability to reveal fine-grained capability organization within transformer architectures. 9 1.000.750.500.250.000.250.500.75 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 Analogical Reasoning Causal Reasoning Classification Categorization Common Sense Reasoning Counterfactual Reasoning Creative Thinking Critical Thinking Dialogue Emotional Understanding Instruction Following Language Generation Language Translation Language Understanding Logical Reasoning Mathematical Computation Memory Context Metacognition Moral Reasoning Multimodal Understanding Reading Comprehension Spatial Reasoning Summarization Temporal Reasoning World Knowledge MDS Pearson Task Categories Language & Communication Reasoning & Analysis Knowledge & Processing Cognitive & Perceptual Specialized Domains 0.40.30.20.10.00.10.20.30.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Sentential Negation Npi Scope Only Npi Scope Tough Vs Raising 2 Tough Vs Raising 1 Expletive It Object Raising Existential There Subject Raising Existential There Object Raising Principle A Reconstruction Principle A Domain 3 Principle A Domain 2 Principle A Domain 1 Principle A Case 2 Principle A Case 1 Principle A C Command Animate Subject Trans Animate Subject Passive Wh Island Sentential Subject Island Left Branch Island Simple Question Left Branch Island Echo Question Coordinate Structure Constraint Object Extraction Coordinate Structure Constraint Complex Left Branch Complex Np Island Adjunct Island Wh Vs That With Gap Wh Vs That With Gap Long Distance Wh Vs That No Gap Wh Vs That No Gap Long Distance Wh Questions Subject Gap Wh Questions Subject Gap Long Distance Wh Questions Object Gap Ellipsis N Bar 2 Ellipsis N Bar 1 Transitive Passive 2 Passive 1 Intransitive Inchoative Drop Argument Causative Superlative Quantifiers 2 Superlative Quantifiers 1 Existential There Quantifiers 2 Existential There Quantifiers 1 Sentential Negation Npi Licensor Present Only Npi Licensor Present Npi Present 2 Npi Present 1 Matrix Question Npi Licensor Present Regular Plural Subject Verb Agreement 2 Regular Plural Subject Verb Agreement 1 Irregular Plural Subject Verb Agreement 2 Irregular Plural Subject Verb Agreement 1 Distractor Agreement Relative Clause Distractor Agreement Relational Noun Irregular Past Participle Verbs Irregular Past Participle Adjectives Determiner Noun Agreement With Adjective 1 Determiner Noun Agreement With Adj Irregular 2 Determiner Noun Agreement With Adj Irregular 1 Determiner Noun Agreement With Adj 2 Determiner Noun Agreement Irregular 2 Determiner Noun Agreement Irregular 1 Determiner Noun Agreement 2 Determiner Noun Agreement 1 Anaphor Number Agreement Anaphor Gender Agreement MDS Pearson Task Categories Syntax-Semantics Syntax Semantics Morphology Figure 5: Dimensionality reduction visualization of task similarity in LoRA weight space. Left: Capability tasks showing clustering patterns among reasoning, knowledge, and linguistic domains. Right: BLiMP linguistic tasks (67 fine-grained linguistic phenomena) revealing structural relationships among syntax, semantics, morphology, and syntax-semantics interfaces. 4.9 Task Similarity Analysis We investigate whether SCALPEL training reveals meaningful task relationships by analyzing LoRA weight similarity patterns. If capabilities are indeed represented as low-rank subspaces distributed across the model, we would expect related capabilities to occupy similar regions in parameter space. Figure 5 applies Multidimensional Scaling (MDS) analysis using Pearson correlation, revealing two key findings. (1) Tasks within the same cognitive category exhibit strong clustering behavior, with Language & Communication capabilities forming coherent clusters distinct from Reasoning & Analysis functions. This demonstrates that SCALPEL captures cognitively meaningful relationships, with different capabilities corresponding to different subspaces as predicted by our framework. (2) Fine-grained linguistic analysis across 67 BLiMP tasks shows that tasks within the same grammatical category cluster together in parameter space. This reveals that SCALPEL captures hierarchical relationships within specific domains, providing evidence that the low-rank representation perspective successfully disentangles capability encoding. 5 Conclusion We presented SCALPEL, a framework for selective capability ablation in large language models through low-rank parameter editing. Unlike traditional interpretability methods that assume capabilities are controlled by specific com- ponents, SCALPEL represents capabilities as low-rank parameter subspaces distributed across layers and modules, naturally handling both polysemanticity and distributed encoding. By training LoRA adapters to reduce the model’s ability to distinguish correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies the low-rank representation responsible for specific capabilities while remaining disentangled from other capabilities. Our experiments across diverse capability tasks and linguistic tasks from BLiMP validate the three contributions outlined in the introduction: (1) Low-rank modifications are sufficient for effective capability ablation across the stud- ied tasks and multiple model architectures; (2) SCALPEL achieves targeted capability removal with significantly less collateral damage compared to existing methods, maintaining near-baseline perplexity while reducing target task ac- curacy; (3) The learned LoRA weight patterns reveal that different capabilities exhibit distinct layer-wise distributions that align with cognitive and linguistic theories, with morphological processing in early layers, syntactic processing in middle layers, and complex reasoning in deeper layers. These findings offer a more nuanced understanding of capability encoding in large language models and open new directions for interpretability research. References [1] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, pages 4190–4197. Association for Computational 10 Linguistics, 2020. [2] Anthropic. The claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic, 2024. [3] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Woj- ciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10(7):e0130140, 2015. [4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition (CVPR), 2017. [5] Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022. [6] Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. [7] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2024. [8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cum- mings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. [9] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an anal- ysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286. Association for Computational Linguistics, 2019. [10] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [11] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502. Association for Computational Linguistics, 2022. [12] Abhimanyu Dubey, Aaron Grattafiori, Abhinav Jauhri, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. [13] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Chris Olah. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. [14] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. 2021. 11 [15] Google Gemini Team.Gemini:A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. [16] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495. Association for Computational Linguistics, 2021. [17] Geoffrey E. Hinton, James L. McClelland, and David E. Rumelhart. Distributed representations. In David E. Rumelhart and James L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstruc- ture of Cognition, Volume 1: Foundations, pages 77–109. MIT Press, Cambridge, MA, 1986. [18] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. [19] Gabriel Ilharco, Marco Túlio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Repre- sentations (ICLR). OpenReview.net, 2023. [20] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In Pro- ceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2668–2677. PMLR, 2018. [21] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. [22] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of PMLR, pages 1885–1894, 2017. [23] János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for local- izing llm behaviour to components. arXiv preprint arXiv:2403.00745, 2024. [24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. [25] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017. [26] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022. [27] Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2023. [28] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. [29] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, volume 32, 2019. [30] Neel Nanda. Attribution patching: Activation patching at industrial scale. https://w.neelnanda.io/ mechanistic-interpretability/attribution-patching, 2023. Blog post. [31] nostalgebraist.interpreting gpt:the logit lens. https://w.lesswrong.com/posts/ AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. LessWrong blog post. [32] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11), 2017. [33] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 12 [34] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. [35] Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, 2024. Association for Computational Linguistics. [36] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. [37] Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892, 2022. [38] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. [39] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023. [40] Ruihao Shui, Yixin Cao, Xiang Wang, and Tat-Seng Chua. A comprehensive evaluation of large language models on legal judgment prediction. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023. [41] Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022. [42] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR, 2017. [43] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. [44] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. arXiv preprint arXiv:1503.02406, 2015. [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, vol- ume 30, 2017. [46] Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392, 2020. [47] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2020. [48] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 13 [49] Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Anh T.N. Nguyen, Lauren T. May, Geoffrey I. Webb, and Shirui Pan. Large language models for scientific synthesis, inference and explanation. arXiv preprint arXiv:2310.07984, 2023. 14 A Comprehensive Capability Decomposition Results We evaluate our SCALPEL training approach across 24 diverse language tasks using the Llama-3.2-1B-Instruct model with TextReg regularization to assess both the effectiveness of capability removal and the preservation of general lan- guage abilities. We measure target task performance degradation, overall model perplexity changes, and broad capa- bility retention to evaluate the specificity and safety of our approach. The experimental results demonstrate three key findings: (1) Our method successfully reduces performance across almost all target domains, with only one task (Spa- tial Reasoning) showing a slight accuracy increase. (2) While overall model perplexity increases following capability removal, most tasks show moderate increases, though certain generation-heavy tasks exhibit larger perplexity changes. (3) Although overall capability scores show some decline, the reduction is minimal compared to the substantial degra- dation observed in target tasks, demonstrating our model’s strong specificity in removing targeted capabilities while preserving the majority of other linguistic and reasoning abilities. (4) Some results suggest possible over-removal for global generative tasks (e.g., Dialogue, Language Generation). This is expected because generation-heavy capabilities rely on broad, shared circuitry and are inherently harder to isolate than localized capabilities. TaskAccuracyPerplexityOverall Capability BaseOursBaseOursBaseOurs Analogical Reasoning79.5%75.0% (-4.5%) 11.1211.41 (+0.30) 0.4970.507 (+0.010) Causal Reasoning80.0%70.0% (-10.0%) 11.1211.49 (+0.37) 0.4970.489 (-0.009) Classification & Categorization 76.9%74.4% (-2.6%) 11.1211.30 (+0.18) 0.4970.434 (-0.063) Common Sense Reasoning94.9%56.4% (-38.5%) 11.1212.42 (+1.30) 0.4970.379 (-0.118) Counterfactual Reasoning26.0%16.0% (-10.0%) 11.1211.22 (+0.10) 0.4970.510 (+0.013) Creative Thinking30.0%27.5% (-2.5%) 11.1212.01 (+0.89) 0.4970.466 (-0.031) Critical Thinking46.0%26.0% (-20.0%) 11.1212.43 (+1.31) 0.4970.511 (+0.014) Dialogue82.5%0.0% (-82.5%) 11.1213.56 (+2.44) 0.4970.413 (-0.084) Emotional Understanding61.3%48.4% (-12.9%) 11.1211.20 (+0.08) 0.4970.480 (-0.017) Instruction Following66.7%54.5% (-12.1%) 11.1211.46 (+0.34) 0.4970.528 (+0.031) Language Generation100.0%77.8% (-22.2%) 11.1250.60 (+39.48) 0.4970.415 (-0.083) Language Translation82.5%57.5% (-25.0%) 11.1211.05 (-0.07) 0.4970.513 (+0.016) Language Understanding100.0%96.7% (-3.3%) 11.1213.55 (+2.43) 0.4970.477 (-0.020) Logical Reasoning93.9%51.5% (-42.4%) 11.1211.29 (+0.17) 0.4970.460 (-0.037) Mathematical Computation100.0%0.0% (-100.0%) 11.1211.11 (-0.01) 0.4970.516 (+0.019) Memory & Context75.0%41.7% (-33.3%) 11.1218.01 (+6.89) 0.4970.422 (-0.075) Metacognition19.4%2.8% (-16.7%) 11.1211.72 (+0.60) 0.4970.517 (+0.020) Moral Reasoning94.4%19.4% (-75.0%) 11.1211.21 (+0.09) 0.4970.480 (-0.017) Multimodal Understanding12.2%0.0% (-12.2%) 11.1212.83 (+1.71) 0.4970.454 (-0.043) Reading Comprehension62.1%24.1% (-37.9%) 11.1217.18 (+6.06) 0.4970.426 (-0.071) Spatial Reasoning72.5%75.0% (+2.5%) 11.1211.04 (-0.08) 0.4970.519 (+0.021) Summarization23.5%0.0% (-23.5%) 11.1211.74 (+0.62) 0.4970.483 (-0.014) Temporal Reasoning65.0%37.5% (-27.5%) 11.1211.36 (+0.24) 0.4970.486 (-0.011) World Knowledge76.3%65.8% (-10.5%) 11.1212.36 (+1.24) 0.4970.454 (-0.043) Overall67.5% 41.6% (-25.9%) 11.12 13.90 (+2.78) 0.497 0.472 (-0.025) Table 4: Comprehensive evaluation results showing the impact of SCALPEL training across 24 capability domains. The table compares baseline performance (Base) with our results (Ours), where the delta change is shown in paren- theses. Green values indicate improvements while red values show degradation. The results demonstrate selective capability removal with varying degrees of impact across different domains. B Dataset Examples We provide representative examples from our three dataset categories. All examples were initially generated by Claude and manu- ally filtered to remove obviously improper samples. Table 5 shows capability tasks using token-level format with prompt-correct-wrong triplets. These tasks test specific cogni- tive abilities where the model must predict a single correct token. Table 6 presents linguistic tasks using A/B format comparing grammatically correct vs. incorrect sentences, drawn from BLiMP to evaluate fine-grained grammatical knowledge. Table 7 illus- 15 PromptCorrectWrong Common Sense Reasoning What do you wear on your feet? Answer:shoesgloves What do bees make? Answer:honeymilk Where do fish live? They live inwaterair Language Translation Translate ’cat’ to French. The answer ischatchien Translate ’water’ to French. The word iseaufeu What is ’hello’ in French? Answer:bonjourbonsoir Indirect Object Identification (IOI) When Alice and Bob went to the store, Alice gave a book toBobAlice After Eve and Frank arrived, Eve passed the letter toFrankEve When Grace and Henry met at the cafe, Grace sent the package toHenryGrace Table 5: Examples of capability task datasets with prompt-correct-wrong format. Correct Sentence (A)Wrong Sentence (B) Morphology - Subject-Verb Agreement A niece of most senators hasn’t descended most slopes.A niece of most senators haven’t descended most slopes. The sketch of those trucks hasn’t hurt Alan.The sketch of those trucks haven’t hurt Alan. A newspaper article about the Borgias has disagreed with Marcus. A newspaper article about the Borgias have disagreed with Marcus. Table 6: Examples of linguistic task datasets with A/B comparison format. trates general evaluation tasks from our held-out set, which test diverse capabilities to ensure capability removal does not cause catastrophic forgetting. C Case Study Table 8 presents representative examples across multiple tasks showing how SCALPEL reduces the model’s ability to distinguish correct from incorrect answers. For each example, we show the probability difference (p correct −p wrong ) before and after applying SCALPEL. The results reveal two key findings: (1) SCALPEL consistently reduces the probability difference to near zero across all tasks, demonstrating effective capability removal. (2) In several IOI cases, SCALPEL actually flips the prediction to prefer the wrong answer (negative values), indicating that the model has lost the ability to perform indirect object identification. 16 PromptCorrectWrong Analogical Reasoning Cat is to kitten as dog is topuppycat Hot is to cold as up is todownleft Bird is to fly as fish is toswimwalk Moral Reasoning Is it right to help others? Answer:yesno Is it wrong to steal? Answer:yesno Should you keep promises? Answer:yesno Logical Reasoning If all birds can fly and a robin is a bird, can a robin fly? Answer:yesno Complete the pattern: 2, 4, 6, 8, __. Answer:109 If it’s raining, the ground is wet. It’s raining. Is the ground wet? Answer:yesno Table 7: Examples of general evaluation tasks (held-out test set). TaskPrompt (correct/wrong)Base SCALPEL IOIRoderick handed a book to... (Scar- lett/Roderick) 0.703-0.019 IOIDiego gave the gift to...(Es- telle/Diego) 0.312-0.224 TranslationTranslate‘sun’toGerman (Sonne/Mond) 0.9230.000 TranslationTranslate ‘red’ to Spanish (rojo/azul)0.7750.000 AnalogicalEye is to see as ear is to (hear/speak)0.5660.000 AnalogicalHelmet is to head as armor is to (body/protection) 0.3760.001 MoralIs filthiness good? (no/yes)0.0240.001 MoralShould you stand up for justice? (yes/no) 0.0140.000 Table 8: Case study across multiple tasks. Values show probability difference p correct −p wrong . SCALPEL substantially reduces the probability difference in all cases. Negative values (bold) indicate the model prefers the wrong answer after capability removal. 17