Paper deep dive

Sparsity-Aware Unlearning for Large Language Models

Yuze Wang, Yujia Tong, Ke Xu, Jingling Yuan, Jiawei Jiang, Chuang Hu

Year: 2026Venue: arXiv preprintArea: Model EditingType: EmpiricalEmbeddings: 62

Models: ICLM-7B, Llama-2-7B-chat, Llama-3.1-8B, Llama-3.2-1B, Llama-3.2-3B, Zephyr-7B-beta

Abstract

Abstract:Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks. Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining. However, existing methods are designed for dense models and overlook model sparsification-an essential technique for efficient LLM deployment. We find that unlearning effectiveness degrades substantially on sparse models. Through empirical analysis, we reveal that this degradation occurs because existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, fundamentally limiting the model's forgetting capacity. To address this challenge, we propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking that redirects updates to surviving weights, combined with importance-aware redistribution to compensate for pruned parameters. Extensive experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving model utility.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 1:10:53 AM

Summary

Sparsity-Aware Unlearning (SAU) is a novel framework designed to address the degradation of machine unlearning effectiveness in sparse Large Language Models (LLMs). The authors identify that existing unlearning methods, which rely on updating all parameters, fail on sparse models because pruned weights are frozen at zero, limiting the model's capacity to forget. SAU overcomes this by decoupling unlearning from sparsification through gradient masking (focusing updates on relevant surviving weights) and importance-aware weight redistribution (compensating for pruned parameters by amplifying updates on structurally adjacent surviving weights).

Entities (6)

Large Language Models · technology · 100%Machine Unlearning · field-of-study · 100%Sparsity-Aware Unlearning · method · 100%Gradient Masking · technique · 95%Importance-Aware Weight Redistribution · technique · 95%LLaMA-3.1-8B · model · 95%

Relation Signals (4)

Sparsity-Aware Unlearning → utilizes → Gradient Masking

confidence 100% · SAU addresses the identified conflict through two complementary components: Gradient Masking...

Sparsity-Aware Unlearning → utilizes → Importance-Aware Weight Redistribution

confidence 100% · SAU addresses the identified conflict through... Importance-Aware Weight Redistribution

Sparsity-Aware Unlearning → addresses → Machine Unlearning

confidence 95% · SAU is a framework designed to enable effective unlearning on sparse models.

Sparsification → limits → Machine Unlearning

confidence 90% · sparsification prunes substantial weights to zero, fundamentally limiting the model's forgetting capacity.

Cypher Suggestions (2)

Find all techniques used by the SAU framework. · confidence 90% · unvalidated

MATCH (m:Method {name: 'Sparsity-Aware Unlearning'})-[:UTILIZES]->(t:Technique) RETURN t.name

Identify the relationship between sparsification and unlearning effectiveness. · confidence 85% · unvalidated

MATCH (s:Technique {name: 'Sparsification'})-[r]->(u:FieldOfStudy {name: 'Machine Unlearning'}) RETURN type(r)

Full Text

61,502 characters extracted from source content.

Expand or collapse full text

Sparsity-Aware Unlearning for Large Language Models Yuze Wang Yujia Tong Ke Xu Jingling Yuan Jiawei Jiang Chuang Hu Abstract Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks. Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining. However, existing methods are designed for dense models and overlook model sparsification—an essential technique for efficient LLM deployment. We find that unlearning effectiveness degrades substantially on sparse models. Through empirical analysis, we reveal that this degradation occurs because existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, fundamentally limiting the model’s forgetting capacity. To address this challenge, we propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking that redirects updates to surviving weights, combined with importance-aware redistribution to compensate for pruned parameters. Extensive experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving model utility. Machine Learning, ICML 1 Introduction Large Language Models (LLMs) (Grattafiori et al., 2024; Achiam et al., 2023; Team et al., 2024) trained on massive corpora inevitably memorize sensitive information, including personally identifiable data and copyrighted content (Das et al., 2024; Henderson et al., 2023; Nasr et al., 2025). This memorization poses significant privacy risks (Yao et al., 2024), particularly as adversarial techniques can extract such data from trained models. With regulations like GDPR enshrining the ”right to be forgotten” (Voigt and Von dem Bussche, 2017; Zaeem and Barber, 2020), there is an urgent need for methods that can selectively remove specific knowledge from LLMs. Machine unlearning (Hadi et al., 2023; Zhang et al., 2023) has emerged as a promising solution, enabling targeted forgetting of undesirable information without costly full retraining. Figure 1: Illustration of the unlearning degradation problem under sparsification. Given a query about information in the forget set, a dense Llama-3.1-8B model after unlearning correctly fails to recall the targeted knowledge (producing an incorrect name). However, after applying sparsification to the same unlearned model, it unexpectedly recovers the ability to answer the query accurately, as if the forgotten knowledge has resurfaced. This demonstrates that sparsification can effectively “undo” the unlearning process. However, existing unlearning methods are predominantly designed for dense models and overlook a critical practical consideration: model sparsification. Sparsity techniques, such as pruning and sparse fine-tuning, have become essential for deploying LLMs under real-world resource constraints(Ma et al., 2023; Sreenivas et al., 2024). Unfortunately, when these unlearning approaches are applied to sparsified models, their effectiveness degrades substantially. The sparse weight structures disrupt the assumptions underlying current unlearning algorithms, leading to incomplete forgetting or severe degradation in model utility. This gap highlights the need for unlearning methods that remain robust under sparsity constraints. To investigate this phenomenon, we conduct empirical analyses on the interplay between unlearning and sparsification. As shown in Figure 2, we first evaluate multiple baseline unlearning methods under varying sparsity ratios, observing a consistent degradation in forgetting effectiveness as sparsity increases. We further illustrate this through a qualitative case study in Figure 1: when queried about information that should be forgotten, a dense Llama-3.1-8B model after unlearning correctly refuses to recall the targeted knowledge; however, once the same unlearned model undergoes sparsification, it unexpectedly recovers the ability to answer these queries, as if the forgotten knowledge has resurfaced. Figure 2: Aggregate score of different unlearning methods (NPO, SatImp, UNDIAL) on TOFU (forget set 10%) with Llama-3.1-8B under varying sparsity ratios, using three pruning techniques: Magnitude, SparseGPT, and Wanda. All methods exhibit consistent degradation in unlearning effectiveness as sparsity increases, demonstrating the fundamental conflict between unlearning and sparsification. Our analysis reveals the root cause of this degradation. Existing unlearning methods are designed to update all model parameters during the forgetting process, distributing gradient signals across the entire parameter space to achieve thorough knowledge erasure. However, once sparsification is introduced, a substantial portion of parameters are pruned to zero and become non-modifiable. This fundamentally constrains the model’s unlearning capacity: gradient updates that would normally propagate through the full network are now restricted to only the surviving weights. The drastically reduced parameter budget limits the model’s flexibility to adjust its internal representations, resulting in incomplete forgetting or degraded utility. This observation explains the significant performance drop of current unlearning methods on sparse models and motivates our proposed approach. We propose Sparsity-Aware Unlearning (SAU), a framework that enables effective machine unlearning on sparsified LLMs. Our key insight is to decouple the weight importance for sparsification from the weight importance for unlearning. Specifically, we first identify the critical weights for forgetting by computing gradient-based saliency scores on the forget set. We then introduce a gradient masking mechanism that redirects the unlearning updates to the surviving weights, ensuring that the forgetting signal is not lost due to sparsification. Furthermore, we propose an importance-aware weight redistribution strategy that compensates for the pruned critical weights by amplifying the unlearning capacity of their structurally adjacent parameters. This design allows SAU to achieve effective forgetting while preserving both model sparsity and utility on retained knowledge. We summarize our contributions below: • We identify and empirically analyze a previously overlooked problem: the degradation of machine unlearning effectiveness on sparsified LLMs, revealing that existing methods require updating all parameters, yet sparsification fundamentally limits this capacity. • We propose Sparsity-Aware Unlearning (SAU), a novel framework that decouples unlearning and sparsification objectives through gradient masking and importance-aware weight redistribution, enabling effective forgetting on sparse models. • We conduct extensive experiments on multiple benchmarks and sparsity ratios, demonstrating that SAU significantly outperforms existing unlearning methods on sparse LLMs while maintaining comparable utility on dense models. 2 Related Work In this section, we review two lines of research relevant to our work: machine unlearning methods for LLMs and model sparsification techniques. 2.1 Machine Unlearning Machine unlearning aims to remove specific data influences from trained models without costly full retraining. The concept was first formalized by (Cao and Yang, 2015), and has gained increasing attention due to privacy regulations such as GDPR. Early approaches focused on exact unlearning through data partitioning strategies. SISA (Bourtoule et al., 2021)training partitions data into shards and trains separate models, enabling efficient unlearning by retraining only affected shards. However, such methods incur substantial overhead and are impractical for Large Language Models. Approximate unlearning methods have emerged as more efficient alternatives. Gradient-based approaches perform gradient ascent (Graves et al., 2021; Thudi et al., 2022) on the forget set to increase loss on targeted data. Influence function-based methods (Koh and Liang, 2017; Izzo et al., 2021) estimate the effect of removing training samples and adjust parameters accordingly. Recent work has extended unlearning to LLMs, addressing challenges such as knowledge entanglement and utility preservation (Liu et al., 2024a; Gao et al., 2024; Yao et al., 2023; Chen and Yang, 2023; Barbulescu and Triantafillou, 2024; Liu et al., 2025; Wang et al., 2025; Liu et al., 2024b). However, existing methods assume dense model architectures and do not account for sparsity constraints common in practical deployments. 2.2 Model Sparsification Model sparsification reduces model size and computational cost by setting a subset of weights to zero. Magnitude-based pruning removes weights with the smallest absolute values, while gradient-based methods incorporate gradient information to better assess weight importance. More sophisticated approaches consider second-order information such as the Hessian matrix. Recent work has extended sparsification to Large Language Models with promising results. SparseGPT (Frantar and Alistarh, 2023)enables one-shot pruning of GPT-family models to high sparsity levels with minimal performance degradation. Wanda (Sun and others, 2024) proposes pruning based on weight magnitudes and input activations, achieving competitive results without retraining. These advances have made sparse LLMs increasingly prevalent in practical deployments. While prior work has studied pruning combined with quantization and knowledge distillation, the interplay between sparsification and machine unlearning has not been investigated. Our work fills this gap by revealing their fundamental conflict and proposing solutions for effective unlearning on sparse models. Figure 3: The overall pipeline of SAU: (1) calculate influence scores over forget data; (2) generate gradient masks for surviving weights; (3) compute importance-aware redistribution weights; (4) unlearn with masked gradients. 3 Methodology In this section, we first introduce the preliminaries and formally define the problem of machine unlearning on sparse models. We then present our analysis revealing the fundamental conflict between unlearning and sparsification. Finally, we describe our proposed Sparsity-Aware Unlearning (SAU) framework in detail, As shown in Figure 3. 3.1 Preliminaries and Problem Formulation Machine Unlearning. Given a pre-trained LLM ℳM with parameters θ, a forget set fD_f containing data to be removed, and a retain set rD_r representing knowledge to preserve, the goal of machine unlearning is to obtain updated parameters θ′θ such that the model behaves as if it had never been trained on fD_f while maintaining performance on rD_r. A common approach to achieve this goal is gradient ascent, which increases the loss on forget samples to effectively unlearn the associated information. To preserve utility, this is typically combined with gradient descent on the retain set. Model Sparsification. Sparsification techniques reduce model size and computational cost by setting a portion of weights to zero. Given a target sparsity ratio s∈(0,1)s∈(0,1), sparsification produces a binary mask ∈0,1|θ|M∈\0,1\^|θ| satisfying ‖0/|θ|=1−s\|M\|_0/|θ|=1-s, resulting in sparse parameters θs=θ⊙ _s=θ , where ⊙ denotes element-wise multiplication. Common sparsification methods include magnitude pruning, which removes weights with the smallest absolute values, and gradient-based pruning, which considers both magnitude and gradient information to determine weight importance. Problem Statement. We address the problem of effective unlearning on sparsified models. Given a sparse model ℳsM_s with parameters θs _s and sparsity mask M, our goal is to find θs′θ _s that achieves effective forgetting of fD_f while preserving utility on rD_r and maintaining the sparsity constraint. The key challenge is that the sparsity mask M is fixed and cannot be modified, as changing it would require re-calibration or fine-tuning to recover model quality. 3.2 Analysis: The Conflict Between Unlearning and Sparsification Before presenting our method, we conduct a thorough analysis to understand why existing unlearning methods fail on sparse models. Our key finding is that existing unlearning methods inherently rely on updating the full parameter space, which fundamentally conflicts with the constraints imposed by sparsification. Unlearning Relies on Full Parameter Updates. Existing unlearning methods, such as gradient ascent and its variants, achieve forgetting by distributing gradient signals across all model parameters. The underlying assumption is that the model has full flexibility to adjust its internal representations through unconstrained parameter updates. Formally, given a forget set fD_f, standard unlearning computes gradients with respect to all parameters: ∇θℒforget=∇θ∑(x,y)∈fℒ(x,y;θ) _θL_forget= _θ _(x,y) _fL(x,y;θ) (1) where θ∈ℝ|θ|θ ^|θ| represents the full parameter space. The forgetting effectiveness depends on the model’s ability to propagate these gradient updates throughout the entire network. Sparsification Constrains the Parameter Space. Sparsification introduces a binary mask ∈0,1|θ|M∈\0,1\^|θ| that prunes a substantial portion of weights to zero. After sparsification, the effective parameter space is reduced to: θs=θ⊙,s.t.∥0/|θ|=1−s _s=θ , s.t. \|M\|_0/|θ|=1-s (2) with s denoting the sparsity ratio. Crucially, pruned weights (i=0M_i=0) are frozen at zero and cannot receive gradient updates. This means the gradient updates are restricted to only the surviving weights: ∇θsℒforget=∇θℒforget⊙ _ _sL_forget= _θL_forget (3) Empirical Evidence. We provide two empirical observations to demonstrate this conflict. First, we evaluate multiple baseline unlearning methods ( NPO, UNDIAL, SatImp) under varying sparsity ratios. As shown in Figure 2, all methods exhibit consistent performance degradation as sparsity increases, with the aggregate score dropping significantly at higher sparsity levels. Second, we present a qualitative case study in Figure 1. Given a query about an author’s identity from the forget set (“What is the full name of the author born in Baghdad, Iraq, on June 4, 1934?”), a dense model after unlearning correctly fails to recall the targeted knowledge. However, after applying sparsification to the same unlearned model, it unexpectedly recovers the ability to answer these queries accurately, correctly outputting the forgotten name “Jad Ambrose Al-Shamary,” as if the forgotten knowledge has resurfaced. This striking phenomenon demonstrates that sparsification can effectively “undo” the unlearning process. Impact on Unlearning Effectiveness. The constrained parameter space has two detrimental effects on unlearning. First, with fewer modifiable parameters, the model has reduced forgetting capacity—the gradient signals that would normally distribute across the full network are now compressed into a smaller subset of weights, limiting the model’s ability to effectively erase targeted knowledge. Second, attempting to achieve the same level of forgetting with fewer parameters leads to larger per-weight modifications, which causes more severe degradation on retain set performance and worsens the utility-forgetting trade-off. These observations motivate the need for unlearning methods that are specifically designed to operate effectively within the constrained parameter space of sparse models. 3.3 Sparsity-Aware Unlearning (SAU) Based on our analysis, we propose Sparsity-Aware Unlearning (SAU), a framework designed to enable effective unlearning on sparse models. SAU addresses the identified conflict through two complementary components: Gradient Masking for focusing updates on relevant surviving weights, and Importance-Aware Weight Redistribution for compensating the lost unlearning capacity. 3.3.1 Gradient Masking Since pruned weights cannot be modified, naively applying gradient updates wastes computation on zero weights and may lead to suboptimal unlearning. We propose gradient masking to redirect and focus the unlearning signal on the most relevant surviving weights. For each layer l in the model, we first compute the unlearning saliency matrix via a single forward-backward pass over the forget set, yielding the gradients, and define it as: ul=1|f|∑(x,y)∈f(∇θlℒ(x,y;θ))2S^l_u= 1|D_f| _(x,y) _f ( _θ^lL(x,y;θ) )^2 (4) where θlθ^l denotes the parameters of layer l. Based on the computed saliency, we construct a gradient mask that identifies which surviving weights should receive unlearning updates. Specifically, for surviving weights where l=1M^l=1, we identify the top-k fraction most important for unlearning: l=l⊙[ul≥τkl]G^l=M^l 1 [S^l_u≥τ^l_k ] (5) where τklτ^l_k is the threshold corresponding to the top-k ratio among surviving weights in layer l, and [⋅] 1[·] is the indicator function. This threshold is computed as the (1−k)(1-k)-quantile of the saliency scores among surviving weights. During unlearning, we apply the gradient mask to focus updates on the most relevant surviving weights: ∇θlℒmasked=∇θlℒ⊙l _θ^lL_masked= _θ^lL ^l (6) This ensures that only surviving weights with high unlearning relevance receive gradient updates, improving both efficiency and effectiveness of the unlearning process. 3.3.2 Importance-Aware Weight Redistribution While gradient masking helps focus updates on relevant weights, it alone may not fully compensate for the lost unlearning capacity from pruned critical weights. We introduce importance-aware weight redistribution to amplify the unlearning signal on surviving weights proportionally to the importance of pruned weights. For each layer l, we first compute the total saliency of pruned weights to quantify the unlearning capacity lost due to sparsification: ℐprunedl=∑i:il=0u,ilI^l_pruned= _i:M^l_i=0S^l_u,i (7) We then redistribute this lost importance to surviving weights based on their own saliency scores. The intuition is that surviving weights with high unlearning saliency are structurally and functionally related to the pruned critical weights and can partially compensate for their absence. The redistribution weight for each surviving weight is computed as: il=1+α⋅u,il∑j:jl=1u,jl⋅ℐprunedlW^l_i=1+α· S^l_u,i _j:M^l_j=1S^l_u,j·I^l_pruned (8) where α is a scaling hyperparameter controlling the redistribution strength. The base weight of 1 ensures that the original gradient signal is preserved, while the additional term amplifies the gradient proportionally to the weight’s relative importance among surviving weights and the total pruned importance. Combining gradient masking and weight redistribution, the final gradient update becomes: ∇θlℒSAU=∇θlℒ⊙l⊙l _θ^lL_SAU= _θ^lL ^l ^l (9) This formulation ensures that only surviving weights receive updates, that updates are focused on unlearning-relevant weights, and that the unlearning signal is appropriately amplified to compensate for pruned capacity. Algorithm 1 Sparsity-Aware Unlearning (SAU) Input: Sparse model ℳsM_s with parameters θs _s, sparsity mask M, forget set fD_f, retain set rD_r, top-k ratio, scaling factor α, learning rate η Output: Unlearned parameters θs′θ _s 1: // Stage 1: Compute unlearning saliency 2: for each batch (x,y)∈f(x,y) _f do 3: Compute ∇θsℒ(x,y;θs) _ _sL(x,y; _s) 4: Accumulate u←u+1|f|(∇θsℒ)2S_u _u+ 1|D_f|( _ _sL)^2 5: end for 6: // Stage 2: Generate gradient mask and redistribution weights 7: for each layer l do 8: Compute threshold τklτ^l_k from ulS^l_u for surviving weights 9: l←l⊙[ul≥τkl]G^l ^l 1[S^l_u≥τ^l_k] 10: Compute ℐprunedlI^l_pruned and redistribution weights lW^l 11: end for 12: // Stage 3: Unlearning with masked gradients 13: for each epoch do 14: for each batch (xf,yf)∈f(x_f,y_f) _f, (xr,yr)∈r(x_r,y_r) _r do 15: ℒforget←−ℒ(xf,yf;θs)L_forget←-L(x_f,y_f; _s) // Gradient ascent 16: ℒretain←ℒ(xr,yr;θs)L_retain (x_r,y_r; _s) 17: ℒtotal←ℒforget+λℒretainL_total _forget+ _retain 18: Compute ∇θsℒtotal _ _sL_total 19: ∇θsℒSAU←∇θsℒtotal⊙ _ _sL_SAU← _ _sL_total 20: θs←θs−η⋅∇θsℒSAU _s← _s-η· _ _sL_SAU 21: end for 22: end for 23: Return θs′←θsθ _s← _s 3.4 Overall Pipeline Algorithm 1 presents the complete SAU pipeline. The overall process consists of three main stages. In the first stage, we compute gradient-based unlearning saliency scores by performing a single pass over the forget set. For each batch, we compute the gradient of the loss with respect to all model parameters and accumulate the squared gradients to obtain the saliency scores. This stage has the same computational cost as one epoch of standard training and only needs to be performed once. In the second stage, we generate the gradient masks and redistribution weights for each layer using the computed saliency scores and the sparsity mask. For each layer, we first compute the threshold τklτ^l_k based on the saliency distribution of surviving weights, then generate the binary gradient mask lG^l by selecting the top-k most salient surviving weights. We also compute the pruned importance ℐprunedlI^l_pruned and use it to calculate the redistribution weights lW^l for each surviving weight. This stage involves lightweight computations that scale linearly with the number of parameters. In the third stage, we perform the actual unlearning by optimizing the model with SAU-modified gradients. For each training iteration, we sample batches from both the forget set and retain set. We compute the forget loss using gradient ascent to encourage the model to forget the targeted information, and the retain loss using gradient descent to preserve utility on retained knowledge. The gradients are then modified by applying the precomputed gradient masks and redistribution weights before updating the model parameters. This process continues for a specified number of epochs until the desired level of forgetting is achieved. This design ensures that unlearning updates are concentrated on surviving weights with high relevance to the forget set, effectively compensating for the pruned critical parameters while maintaining the sparsity structure required for efficient deployment. 4 Theoretical Analysis In this section, we provide theoretical justification for SAU based on Fisher information theory. We show that: (1) our saliency score corresponds to Fisher information, (2) sparsification reduces unlearning capacity proportionally to lost Fisher information, and (3) our redistribution strategy effectively compensates for this loss. 4.1 Fisher Information Interpretation of Saliency Score We first establish that our unlearning saliency score (Eq.( 4)) has a principled information-theoretic foundation. Definition 4.1 (Forget Set Fisher Information). For forget set DfD_f, the diagonal Fisher information for parameter θi _i is: Fi=(x,y)∼Df[(∂log⁡p(y|x;θ)∂θi)2]F_i=E_(x,y) D_f [ ( ∂ p(y|x;θ)∂ _i )^2 ] (10) For cross-entropy loss ℒ=−log⁡p(y|x;θ)L=- p(y|x;θ), the saliency score in Eq. (4) is the empirical estimate of diagonal Fisher information, i.e., Su(θi)≈FiS_u( _i)≈ F_i. Thus, Su(θi)S_u( _i) quantifies how much information parameter θi _i encodes about the forget set. 4.2 Why Sparsification Degrades Unlearning We now formally characterize the degradation phenomenon. Our analysis adopts the diagonal approximation of the Fisher information matrix (Kirkpatrick et al., 2017). Theorem 4.2 (Unlearning Effectiveness Bound). Under the diagonal approximation of the Fisher information matrix, for a parameter update Δθ θ, the change in model behavior on the forget set can be approximated as: DKL(pθ∥pθ+Δθ)≈12∑i=1|θ|Fi⋅Δθi2D_KL(p_θ\|p_θ+ θ)≈ 12 _i=1^|θ|F_i· _i^2 (11) Corollary 4.3 (Capacity Loss Under Sparsity). Let S and P denote surviving and pruned weight indices, respectively. Under the sparsity constraint (Δθi=0 _i=0 for i∈i ), the relative capacity loss is: Capacity Loss=∑i∈Fi∑iFiCapacity Loss= _i F_i _iF_i (12) This result explains our empirical observation: when pruning removes weights with high Fisher information, the capacity loss exceeds the sparsity ratio, causing disproportionate unlearning degradation. 4.3 Justification of SAU Components Gradient Masking. Given a budget of k weights to update among surviving weights, maximizing unlearning effectiveness corresponds to selecting the top-k weights with highest FiF_i among surviving weights—exactly our gradient masking strategy. Importance-Aware Redistribution. The redistribution weight Wi=1+α⋅Fi∑j∈Fj⋅IprunedW_i=1+α· F_i _j F_j· I_pruned is designed to compensate for the lost unlearning capacity from pruned weights. Proposition 4.4 (Fisher Information Compensation). Let Ipruned=∑i∈FiI_pruned= _i F_i denote the total Fisher information of pruned weights. When the Fisher information among surviving weights is relatively uniform (i.e., Fi≈F¯F_i≈ F for i∈i ), the total amplified contribution becomes: ∑i∈Wi⋅Fi≈∑i∈Fi+α⋅Ipruned⋅F¯ _i W_i· F_i≈ _i F_i+α· I_pruned· F (13) Thus, by appropriately tuning α, the redistribution mechanism can effectively compensate for the capacity lost due to pruning. In summary, SAU’s design choices—using gradient-based saliency for weight selection and redistributing importance from pruned weights—are principled strategies for maximizing unlearning effectiveness under sparsity constraints. Detailed proofs are provided in Appendix A.1. 5 Experiments Table 1: Performance comparison of unlearning methods on full and sparse Llama-3.1-8B models across TOFU benchmark(1%, 5%, 10% forget settings) using multiple sparsification techniques (Magnitude, SparseGPT, Wanda) at 50% sparsity. ↑ indicates higher is better. Method Sparsity Forget-1% Forget-5% Forget-10% Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Original 0% 0.15 0.08 0.74 0.14 0.08 0.70 0.11 0.06 0.70 Retrain 0% 0.74 0.73 0.74 0.71 0.72 0.70 0.71 0.72 0.70 GradDiff 0% 0.63 0.65 0.61 0.57 0.59 0.55 0.58 0.58 0.57 GradDiff 50% (mag.) 0.31 0.79 0.19 0.46 0.40 0.53 0.48 0.42 0.56 !20 GradDiff + SAU 50% (mag.) 0.52 0.74 0.39 0.54 0.45 0.65 0.54 0.48 0.62 GradDiff 50% (sparsegpt) 0.58 0.67 0.51 0.52 0.50 0.55 0.51 0.54 0.48 !20 GradDiff + SAU 50% (sparsegpt) 0.59 0.66 0.53 0.54 0.48 0.60 0.55 0.52 0.59 GradDiff 50% (wanda) 0.59 0.68 0.52 0.52 0.52 0.52 0.50 0.55 0.46 !20 GradDiff + SAU 50% (wanda) 0.56 0.66 0.49 0.50 0.55 0.46 0.54 0.54 0.54 NPO 0% 0.64 0.68 0.61 0.59 0.71 0.50 0.57 0.74 0.47 NPO 50% (mag.) 0.27 0.79 0.17 0.50 0.77 0.37 0.45 0.79 0.31 !20 NPO + SAU 50% (mag.) 0.49 0.75 0.36 0.52 0.73 0.41 0.52 0.74 0.39 NPO 50% (sparsegpt) 0.58 0.69 0.50 0.45 0.74 0.32 0.49 0.76 0.37 !20 NPO + SAU 50% (sparsegpt) 0.59 0.65 0.54 0.47 0.71 0.35 0.54 0.73 0.43 NPO 50% (wanda) 0.60 0.53 0.70 0.48 0.74 0.35 0.51 0.76 0.38 !20 NPO + SAU 50% (wanda) 0.59 0.53 0.69 0.51 0.70 0.39 0.55 0.73 0.44 UNDIAL 0% 0.64 0.56 0.74 0.62 0.54 0.74 0.62 0.52 0.76 UNDIAL 50% (mag.) 0.33 0.77 0.21 0.35 0.74 0.23 0.40 0.70 0.28 !20 UNDIAL + SAU 50% (mag.) 0.54 0.72 0.43 0.55 0.65 0.47 0.55 0.65 0.48 UNDIAL 50% (sparsegpt) 0.60 0.53 0.70 0.55 0.46 0.70 0.55 0.45 0.71 !20 UNDIAL + SAU 50% (sparsegpt) 0.60 0.54 0.68 0.57 0.50 0.65 0.59 0.50 0.70 UNDIAL 50% (wanda) 0.60 0.52 0.71 0.56 0.46 0.71 0.55 0.44 0.71 !20 UNDIAL + SAU 50% (wanda) 0.61 0.54 0.70 0.54 0.49 0.60 0.58 0.49 0.70 SatImp 0% 0.73 0.74 0.72 0.56 0.68 0.48 0.63 0.63 0.62 SatImp 50% (mag.) 0.57 0.72 0.47 0.53 0.68 0.43 0.47 0.66 0.37 !20 SatImp + SAU 50% (mag.) 0.66 0.67 0.65 0.58 0.65 0.52 0.55 0.64 0.47 SatImp 50% (sparsegpt) 0.67 0.68 0.66 0.43 0.70 0.31 0.61 0.65 0.58 !20 SatImp + SAU 50% (sparsegpt) 0.68 0.71 0.65 0.46 0.67 0.36 0.59 0.63 0.56 SatImp 50% (wanda) 0.67 0.66 0.67 0.42 0.69 0.30 0.62 0.64 0.60 !20 SatImp + SAU 50% (wanda) 0.69 0.71 0.68 0.45 0.66 0.35 0.60 0.62 0.58 5.1 Experimental Setup Datasets. We conduct experiments on three representative LLM unlearning benchmarks. TOFU (Maini et al., 2024) consists of synthetic biographies of 200 fictitious authors absent from pre-training data, with forget sets targeting 1%, 5%, or 10% removal. It focuses on individual unlearning while preserving utility on retained profiles and real-world facts. MUSE (Shi et al., 2025) comprises large-scale corpora from BBC News articles and Harry Potter books, divided into news and books tasks. It focuses on copyright protection, providing safety assessment across verbatim memorization and semantic knowledge retention. WMDP (Li et al., 2024) consists of multiple-choice proxy questions in biosecurity, chemical security, and cybersecurity domains. It focuses on removing hazardous knowledge to mitigate malicious use risks while maintaining general model performance. Unlearning Methods.In our study, we assess four effective unlearning methods for LLMs. GradDiff (Maini et al., 2024): Update parameters by applying gradient ascent on forget set fD_f while applying gradient descent on retain set rD_r to maintain utility. SatImp (Yang et al., 2025):Reweight the loss by combining saturation and importance criteria, prioritizing samples that are insufficiently unlearned and have high impact on the forget set DfD_f. NPO (Zhang et al., 2024):Use only negative samples from forget set DfD_f to minimize a preference loss to minimize a preference loss that strongly suppresses target responses relative to the initial model. UNIDAL (Dong et al., 2025):Iteratively lower the logits of the currently highest-probability tokens through self-distillation, gradually encouraging the model to forget targeted information. Evaluation Metrics. On TOFU, consistent with OpenUnlearning (Dorna et al., 2025), forgetting is measured by Memorization (Mem., harmonic mean of ES, EM, TruthRatio, and Paraphrased Probability), retention by Utility (Util., harmonic mean of TOFU’s Model Utility and forget-set fluency), with the final aggregate score (Agg.) being the harmonic mean of memorization and utility. On MUSE, aligned with the MUSE framework, forgetting is assessed via Verbatim Memorization on fD_f and Knowledge Memorization on fD_f, while retention is evaluated through Knowledge Memorization on rD_r. On WDMP, forgetting is tested using multiple-choice accuracy on biology and cyber-security, and retention via accuracy on the MMLU benchmark (Hendrycks et al., 2020). Implementation details. For all benchmarks, we employ a learning rate of 1e-5, a linear learning rate scheduler, and the AdamW optimizer (Loshchilov and Hutter, 2017) for training unlearning methods. Other hyperparameters follow the configurations in OpenUnlearning (Dorna et al., 2025), (Yang et al., 2025) and (Shi et al., 2025). For SAU, we set the top-k ratio to 0.3 and α=0.1α=0.1. 5.2 Performance Evaluation We evaluate SAU across three representative benchmarks: TOFU, WDMP, and MUSE, demonstrating its effectiveness in maintaining unlearning performance on sparse models. More experiments can be found in A.3 and A.4. (a) Effect of Top-k Ratio. (b) Effect of Redistribution. Figure 4: Ablation studies on (a) top-k ratio and (b) importance-aware redistribution effect on TOFU Forget-10% with Llama-3.1-8B at 50% sparsity. Performance on TOFU. Table 1 presents results on the TOFU benchmark with Llama-3.1-8B (Grattafiori et al., 2024) across three forget ratios of 1%, 5%, and 10% with multiple sparsification methods at 50% sparsity. We observe severe performance degradation when applying sparsification to unlearned models. For NPO on Forget-10%, the aggregate score drops from 0.57 to 0.45 with Magnitude pruning, while Memorization paradoxically increases from 0.74 to 0.79, indicating the model retains knowledge that should be forgotten. SAU consistently recovers this lost performance, improving NPO’s aggregate score from 0.45 to 0.52 while achieving better balance between forgetting and utility. This pattern holds across all baseline methods including GradDiff, UNDIAL, and SatImp, as well as different pruning strategies. SAU provides maximum benefit for Magnitude pruning with 12 to 15% gains and consistent 6–10% improvements for SparseGPT and Wanda. Performance on WDMP. Table 2 evaluates hazardous knowledge removal on the WDMP benchmark with Zephyr-7B-beta (Tunstall et al., 2023). Sparsification significantly impairs both forgetting and retention. For SatImp, Biology accuracy degrades from 0.26 to 0.33, indicating worse forgetting, while MMLU drops from 0.54 to 0.47, reflecting utility loss. SAU substantially mitigates this degradation, reducing Biology and Cybersecurity scores to 0.29 and 0.31 respectively, while improving MMLU to 0.50. Notably, SAU improves both forgetting and retention simultaneously rather than trading one for the other. Table 2: Performance comparison of unlearning methods on full and sparse Zephyr-7B-beta models on WDMP benchmark using Magnitude pruning at 50% sparsity. ↑ implies higher is better, ↓ means lower is better. Method Sparsity Forget Retain Bio ↓ Cyber ↓ MMLU ↑ Original 0% 0.64 0.43 0.58 GradDiff 0% 0.25 0.27 0.43 GradDiff 50% 0.35 0.36 0.35 !20 GradDiff + SAU 50% 0.29 0.31 0.42 NPO 0% 0.27 0.31 0.44 NPO 50% 0.35 0.38 0.38 !20 NPO + SAU 50% 0.31 0.34 0.42 UNDIAL 0% 0.35 0.31 0.48 UNDIAL 50% 0.45 0.40 0.40 !20 UNDIAL + SAU 50% 0.40 0.37 0.44 SatImp 0% 0.26 0.28 0.54 SatImp 50% 0.33 0.34 0.47 !20 SatImp + SAU 50% 0.29 0.31 0.50 Effect of Top-k Ratio for Gradient Masking. We evaluate the impact of the top-k ratio hyperparameter on SAU’s performance. As shown in Figure 4(a), we vary the top-k ratio across 0.1, 0.3, and 0.5 on TOFU Forget-10% with Llama-3.1-8B at 50% Magnitude pruning. The results demonstrate that top-k=0.3k=0.3 consistently achieves the best performance across all four baseline methods, with an average aggregate score of 0.54 compared to 0.51 and 0.52 for top-k=0.1k=0.1 and top-k=0.5k=0.5 respectively. A ratio of 0.1 proves overly conservative, restricting updates to too few parameters, while 0.5 dilutes the unlearning signal across too many weights. The consistent trend validates that top-k=0.3k=0.3 strikes an optimal balance between focusing on critical weights and maintaining sufficient parameter budget. Effect of Redistribution. We investigate the contribution of importance-aware weight redistribution by comparing SAU with a variant that only applies gradient masking. As shown in Figure 4(b), we evaluate both configurations on TOFU Forget-10% with Llama-3.1-8B at 50% Magnitude pruning. The full SAU with redistribution consistently outperforms the masking-only variant across all baseline methods, achieving an average aggregate score improvement of 8.5%. Most notably, the redistribution mechanism substantially enhances utility preservation: GradDiff with redistribution achieves 0.62 utility compared to 0.47 without, representing a 32% improvement. Similar patterns emerge across NPO, UNDIAL, and SatImp. This validates that importance-aware redistribution effectively compensates for lost unlearning capacity by amplifying gradient signals on surviving critical parameters. 6 Conclusion In this paper, we investigate the degradation of machine unlearning effectiveness on sparsified LLMs. Our analysis reveals that existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, limiting forgetting capacity. To address this, we propose Sparsity-Aware Unlearning (SAU), which employs gradient masking to redirect updates to surviving weights and importance-aware redistribution to compensate for pruned parameters. Experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving utility. Our work bridges machine unlearning and model compression, enabling privacy-preserving LLM deployment under resource constraints. Impact Statement This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. References J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §1. G. Barbulescu and P. Triantafillou (2024) To each (textual sequence) its own: improving memorized-data unlearning in large language models. External Links: 2405.03097, Link Cited by: §2.1. L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot (2021) Machine unlearning. In 2021 IEEE symposium on security and privacy (SP), p. 141–159. Cited by: §2.1. Y. Cao and J. Yang (2015) Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, p. 463–480. Cited by: §2.1. J. Chen and D. Yang (2023) Unlearn what you want to forget: efficient unlearning for LLMs. In The 2023 Conference on Empirical Methods in Natural Language Processing, Cited by: §2.1. B. C. Das, M. H. Amini, and Y. Wu (2024) Security and privacy challenges of large language models: a survey. arXiv preprint arXiv:2402.00888. Cited by: §1. Y. R. Dong, H. Lin, M. Belkin, R. Huerta, and I. Vulić (2025) UNDIAL: self-distillation with adjusted logits for robust unlearning in large language models. In NAACL, Cited by: §5.1. V. Dorna, A. Mekala, W. Zhao, A. McCallum, Z. C. Lipton, J. Z. Kolter, and P. Maini (2025) OpenUnlearning: accelerating LLM unlearning via unified benchmarking of methods and metrics. In NeurIPS, Cited by: §5.1, §5.1. E. Frantar and D. Alistarh (2023) SparseGPT: massive language models can be accurately pruned in one-shot. In ICML, Cited by: §2.2. C. Gao, L. Wang, C. Weng, X. Wang, and Q. Zhu (2024) Practical unlearning for large language models. arXiv preprint arXiv:2407.10223. Cited by: §2.1. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The Llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §1, §5.2. L. Graves, V. Nagisetty, and V. Ganesh (2021) Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, p. 11516–11524. Cited by: §2.1. M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili, et al. (2023) A survey on large language models: applications, challenges, limitations, and practical usage. Authorea Preprints. Cited by: §1. P. Henderson, X. Li, D. Jurafsky, T. Hashimoto, M. A. Lemley, and P. Liang (2023) Foundation models and fair use. Journal of Machine Learning Research 24 (400), p. 1–79. Cited by: §1. D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020) Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300. Cited by: §5.1. Z. Izzo, M. A. Smart, K. Chaudhuri, and J. Zou (2021) Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, p. 2008–2016. Cited by: §2.1. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), p. 3521–3526. External Links: ISSN 1091-6490, Link, Document Cited by: §4.2. P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In International conference on machine learning, p. 1885–1894. Cited by: §2.1. N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A. Dombrowski, S. Goel, G. Mukobi, et al. (2024) The WMDP benchmark: measuring and reducing malicious use with unlearning. In ICML, Cited by: §5.1. C. Y. Liu, Y. Wang, J. Flanigan, and Y. Liu (2024a) Large language model unlearning via embedding-corrupted prompts. arXiv preprint arXiv:2406.07933. Cited by: §2.1. S. Liu, Y. Yao, J. Jia, S. Casper, N. Baracaldo, P. Hase, Y. Yao, C. Y. Liu, X. Xu, H. Li, et al. (2025) Rethinking machine unlearning for large language models. Nature Machine Intelligence, p. 1–14. Cited by: §2.1. Z. Liu, G. Dou, Z. Tan, Y. Tian, and M. Jiang (2024b) Towards safer large language models through machine unlearning. Cited by: §2.1. I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR. Cited by: §5.1. X. Ma, G. Fang, and X. Wang (2023) Llm-pruner: on the structural pruning of large language models. Advances in neural information processing systems 36, p. 21702–21720. Cited by: §1. P. Maini, Z. Feng, A. Schwarzschild, Z. C. Lipton, and J. Z. Kolter (2024) TOFU: a task of fictitious unlearning for LLMs. In COLM, Cited by: §5.1, §5.1. M. Nasr, J. Rando, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, F. Tramèr, and K. Lee (2025) Scalable extraction of training data from aligned, production language models. In The Thirteenth International Conference on Learning Representations, Cited by: §1. W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer (2024) DETECTING pretraining data from large language models. In 12th International Conference on Learning Representations, ICLR 2024, Cited by: §A.3. W. Shi, J. Lee, Y. Huang, S. Malladi, J. Zhao, A. Holtzman, D. Liu, L. Zettlemoyer, N. A. Smith, and C. Zhang (2025) MUSE: machine unlearning six-way evaluation for language models. In ICLR, Cited by: §5.1, §5.1. S. T. Sreenivas, S. Muralidharan, R. Joshi, M. Chochowski, A. S. Mahabaleshwarkar, G. Shen, J. Zeng, Z. Chen, Y. Suhara, S. Diao, et al. (2024) Llm pruning and distillation in practice: the minitron approach. arXiv preprint arXiv:2408.11796. Cited by: §1. M. Sun et al. (2024) A simple and effective pruning approach for large language models. In ICLR, Cited by: §2.2. G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan, G. Tucker, G. Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin, J. Keeling, J. Labanowski, J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, J. Chiu, J. Mao-Jones, K. Lee, K. Yu, K. Millican, L. L. Sjoesund, L. Lee, L. Dixon, M. Reid, M. Mikuła, M. Wirth, M. Sharman, N. Chinaev, N. Thain, O. Bachem, O. Chang, O. Wahltinez, P. Bailey, P. Michel, P. Yotov, R. Chaabouni, R. Comanescu, R. Jana, R. Anil, R. McIlroy, R. Liu, R. Mullins, S. L. Smith, S. Borgeaud, S. Girgin, S. Douglas, S. Pandya, S. Shakeri, S. De, T. Klimenko, T. Hennigan, V. Feinberg, W. Stokowiec, Y. Chen, Z. Ahmed, Z. Gong, T. Warkentin, L. Peran, M. Giang, C. Farabet, O. Vinyals, J. Dean, K. Kavukcuoglu, D. Hassabis, Z. Ghahramani, D. Eck, J. Barral, F. Pereira, E. Collins, A. Joulin, N. Fiedel, E. Senter, A. Andreev, and K. Kenealy (2024) Gemma: open models based on gemini research and technology. External Links: 2403.08295, Link Cited by: §1. A. Thudi, G. Deza, V. Chandrasekaran, and N. Papernot (2022) Unrolling sgd: understanding factors influencing machine unlearning. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), p. 303–319. Cited by: §2.1. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: §A.3. L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, et al. (2023) Zephyr: direct distillation of lm alignment. arXiv preprint arXiv:2310.16944. Cited by: §5.2. P. Voigt and A. Von dem Bussche (2017) The eu general data protection regulation (gdpr). A practical guide, 1st ed., Cham: Springer International Publishing 10 (3152676), p. 10–5555. Cited by: §1. Q. Wang, J. P. Zhou, Z. Zhou, S. Shin, B. Han, and K. Q. Weinberger (2025) Rethinking LLM unlearning objectives: a gradient perspective and go beyond. In ICLR, Cited by: §2.1. P. Yang, Q. Wang, Z. Huang, T. Liu, C. Zhang, and B. Han (2025) Exploring criteria of loss reweighting to enhance llm unlearning. In Forty-second International Conference on Machine Learning, Cited by: §5.1, §5.1. Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang (2024) A survey on large language model (llm) security and privacy: the good, the bad, and the ugly. High-Confidence Computing 4 (2), p. 100211. Cited by: §1. Y. Yao, X. Xu, and Y. Liu (2023) Large language model unlearning. In Socially Responsible Language Modelling Research, Cited by: §2.1. R. N. Zaeem and K. S. Barber (2020) The effect of the gdpr on privacy policies: recent progress and future promise. ACM Transactions on Management Information Systems (TMIS) 12 (1), p. 1–20. Cited by: §1. D. Zhang, P. Finckenberg-Broman, T. Hoang, S. Pan, Z. Xing, M. Staples, and X. Xu (2023) Right to be forgotten in the era of large language models: implications, challenges, and solutions. arXiv preprint arXiv:2307.03941. Cited by: §1. R. Zhang, L. Lin, Y. Bai, and S. Mei (2024) Negative preference optimization: from catastrophic collapse to effective unlearning. In COLM, Cited by: §5.1. Appendix A Appendix A.1 Detailed Proofs A.1.1 Saliency Score as Fisher Information Remark A.1. For cross-entropy loss ℒ=−log⁡p(y|x;θ)L=- p(y|x;θ), the saliency score in Eq. (4) is the empirical estimate of diagonal Fisher information: Su(θi)=1|Df|∑(x,y)∈Df(∂ℒ∂θi)2≈FiS_u( _i)= 1|D_f| _(x,y)∈ D_f ( ∂ _i )^2≈ F_i (14) Thus, Su(θi)S_u( _i) quantifies how much information parameter θi _i encodes about the forget set. A.1.2 Proof of Theorem 4.2 Proof. By second-order Taylor expansion of KL-divergence around θ: DKL(pθ∥pθ+Δθ)≈12Δθ⊤ΔθD_KL(p_θ\|p_θ+ θ)≈ 12 θ H θ (15) where H is the Hessian of the KL-divergence. By the Bartlett identity, []=E[H]=F, where F is the Fisher information matrix. Applying the diagonal approximation ≈diag(F1,…,F|θ|)F (F_1,…,F_|θ|) yields the result. ∎ A.1.3 Proof of Proposition 4.4 Proof. Expanding the left-hand side: ∑i∈Wi⋅Fi _i W_i· F_i =∑i∈(1+α⋅Fi∑j∈Fj⋅Ipruned)⋅Fi = _i (1+α· F_i _j F_j· I_pruned )· F_i (16) =∑i∈Fi+α⋅Ipruned⋅∑i∈Fi2∑j∈Fj = _i F_i+α· I_pruned· _i F_i^2 _j F_j (17) Under the uniformity assumption Fi≈F¯F_i≈ F for all i∈i , we have ∑i∈Fi2≈||⋅F¯2 _i F_i^2≈|S|· F^2 and ∑j∈Fj≈||⋅F¯ _j F_j≈|S|· F, yielding the simplified form. ∎ Remark A.2. The redistribution strategy prioritizes weights with higher Fisher information, which aligns with the intuition that these weights are more capable of encoding the changes needed for effective unlearning. Even when the uniformity assumption does not hold exactly, the mechanism still provides meaningful compensation by directing more gradient signal to the most informative surviving weights. A.2 Effect of Redistribution (a) (b) Figure 5: Ablation study on importance-aware weight redistribution. (a) Memorization scores: SAU with redistribution achieves lower memorization (better forgetting) across all methods. (b) Utility scores: redistribution substantially improves utility preservation, with an average 32% improvement. The results demonstrate that importance-aware weight redistribution provides substantial benefits across both forgetting and retention metrics. For memorization (Figure 5(a)), SAU with redistribution consistently achieves lower scores compared to gradient masking alone, indicating more effective forgetting. Notably, GradDiff improves from 0.52 to 0.48, NPO from 0.76 to 0.74, UNDIAL from 0.68 to 0.65, and SatImp from 0.66 to 0.64. More significantly, redistribution substantially enhances utility preservation, as shown in Figure 5(b). GradDiff with redistribution achieves 0.62 utility compared to 0.47 without, representing a 32% improvement. Similar patterns emerge across all methods: NPO improves from 0.35 to 0.39 with an 11% gain, UNDIAL increases from 0.42 to 0.48 with a 14% improvement, and SatImp rises from 0.42 to 0.47 with a 12% boost. This validates that importance-aware redistribution effectively compensates for lost unlearning capacity by amplifying gradient signals on surviving critical parameters, achieving better forgetting while simultaneously improving retention rather than trading one for the other. A.3 Performance on MUSE Tables 3 and 4 present results on the MUSE benchmark with Llama-2-7B-chat(Touvron et al., 2023) and ICLM-7B(Shi et al., 2024) for copyright protection. On MUSE-NEWS, SAU demonstrates consistent improvements. For UNDIAL, sparsification degrades Verbatim Memorization from 0.26 to 0.35 and retention from 0.53 to 0.44. SAU recovers performance to 0.30 and 0.50 respectively. On MUSE-BOOKS, SAU provides meaningful improvements for functional methods. SatImp+SAU achieves 27% better forgetting with concurrent 10% retention gains. These results validate SAU’s robustness across diverse copyright unlearning scenarios. Table 3: Unlearning performance on MUSE-NEWS with ICLM-7B using Magnitude pruning at 50% sparsity. VerbMem and KnowMem on DfD_f measure forgetting quality (↓ is better); KnowMem on DrD_r measures knowledge retention (↑ is better). Method Sparsity NEWS VerbMem on DfD_f (↓ ) KnowMem on DfD_f (↓ ) KnowMem on DrD_r (↑ ) Original 0% 0.58 0.64 0.56 Retrain 0% 0.20 0.32 0.56 GradDiff 0% 0.32 0.35 0.46 GradDiff 50% 0.44 0.47 0.35 GradDiff + SAU !20 !20 50% !20 0.35 !20 0.41 !20 0.40 NPO 0% 0.29 0.33 0.47 NPO 50% 0.38 0.42 0.40 NPO + SAU !20 !20 50% !20 0.34 !20 0.37 !20 0.44 UNDIAL 0% 0.26 0.30 0.53 UNDIAL 50% 0.35 0.40 0.44 UNDIAL + SAU !20 !20 50% !20 0.30 !20 0.33 !20 0.50 SatImp 0% 0.28 0.33 0.49 SatImp 50% 0.35 0.38 0.41 SatImp + SAU !20 !20 50% !20 0.31 !20 0.34 !20 0.48 Table 4: Unlearning performance on MUSE-BOOKS with LLaMA-2-7B-chat using Magnitude pruning at 50% sparsity. VerbMem and KnowMem on DfD_f measure forgetting quality (↓ is better); KnowMem on DrD_r measures knowledge retention (↑ is better). Method Sparsity BOOKS VerbMem on DfD_f (↓ ) KnowMem on DfD_f (↓ ) KnowMem on DrD_r (↑ ) Original 0% 1.0 0.52 0.67 Retrain 0% 0.14 0.30 0.72 GradDiff 0% 0.0 0.0 0.0 GradDiff 50% 0.0 0.0 0.0 GradDiff + SAU !20 !20 50% !20 0.0 !20 0.0 !20 0.0 NPO 0% 0.28 0.31 0.60 NPO 50% 0.37 0.38 0.51 NPO + SAU !20 !20 50% !20 0.32 !20 0.34 !20 0.56 UNDIAL 0% 0.17 0.24 0.52 UNDIAL 50% 0.29 0.32 0.44 UNDIAL + SAU !20 !20 50% !20 0.23 !20 0.27 !20 0.49 SatImp 0% 0.09 0.07 0.37 SatImp 50% 0.22 0.24 0.31 SatImp + SAU !20 !20 50% !20 0.16 !20 0.18 !20 0.34 A.4 More Results on TOFU To validate the robustness of SAU across different model scales, we conduct additional experiments on Llama-3.2-3B and Llama-3.2-1B models using the TOFU benchmark. Tables 5 and 6 present comprehensive results across three forget ratios and three pruning techniques at 50% sparsity.On Llama-3.2-3B, SAU demonstrates consistent improvements across all baseline methods and pruning strategies. For instance, on Forget-10% with Magnitude pruning, NPO+SAU achieves an aggregate score of 0.53 compared to 0.41 without SAU, while simultaneously improving utility from 0.28 to 0.41. Similar patterns emerge with SparseGPT and Wanda pruning, where SAU consistently enhances both forgetting effectiveness and utility preservation. The results validate that SAU’s design principles generalize well to medium-sized models.On Llama-3.2-1B, we observe more nuanced behavior. With Magnitude pruning, the extreme sparsification leads to severe model degradation where both baseline methods and SAU struggle to maintain performance. However, with structure-aware pruning techniques like SparseGPT and Wanda, SAU provides meaningful improvements. For example, UNDIAL+SAU with SparseGPT achieves 0.59 aggregate score compared to 0.55 without SAU on Forget-10%. These results suggest that SAU’s effectiveness depends on the quality of the sparsification method, with structure-aware pruning better preserving the critical weights needed for unlearning. Table 5: Performance comparison of unlearning methods on full and sparse Llama-3.2-3B models across TOFU benchmark(1%, 5%, 10% forget settings) using multiple sparsification techniques (Magnitude, SparseGPT, Wanda) at 50% sparsity. ↑ indicates higher is better. Method Sparsity Forget-1% Forget-5% Forget-10% Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Original 0% 0.08 0.04 0.75 0.06 0.03 0.71 0.02 0.01 0.70 Retrain 0% 0.73 0.72 0.75 0.72 0.72 0.71 0.72 0.72 0.71 GradDiff 0% 0.61 0.61 0.61 0.59 0.60 0.57 0.58 0.62 0.55 GradDiff 50% (mag.) 0.23 0.81 0.13 0.16 0.82 0.09 0.16 0.83 0.09 !20 GradDiff + SAU 50% (mag.) 0.41 0.76 0.28 0.32 0.75 0.23 0.30 0.77 0.20 GradDiff 50% (sparsegpt) 0.58 0.57 0.59 0.56 0.54 0.58 0.54 0.48 0.61 !20 GradDiff + SAU 50% (sparsegpt) 0.59 0.60 0.58 0.54 0.57 0.52 0.56 0.52 0.61 GradDiff 50% (wanda) 0.57 0.57 0.57 0.55 0.54 0.57 0.55 0.49 0.61 !20 GradDiff + SAU 50% (wanda) 0.57 0.57 0.56 0.53 0.56 0.51 0.57 0.52 0.63 NPO 0% 0.64 0.78 0.55 0.67 0.69 0.66 0.64 0.62 0.65 NPO 50% (mag.) 0.20 0.85 0.11 0.23 0.84 0.13 0.41 0.78 0.28 !20 NPO + SAU 50% (mag.) 0.39 0.80 0.27 0.55 0.78 0.42 0.53 0.74 0.41 NPO 50% (sparsegpt) 0.60 0.76 0.49 0.63 0.70 0.57 0.61 0.66 0.57 !20 NPO + SAU 50% (sparsegpt) 0.63 0.74 0.54 0.60 0.68 0.54 0.63 0.65 0.61 NPO 50% (wanda) 0.59 0.75 0.49 0.63 0.70 0.57 0.61 0.67 0.56 !20 NPO + SAU 50% (wanda) 0.62 0.73 0.53 0.61 0.69 0.55 0.64 0.66 0.62 UNDIAL 0% 0.69 0.76 0.64 0.72 0.76 0.68 0.72 0.76 0.68 UNDIAL 50% (mag.) 0.28 0.83 0.17 0.35 0.82 0.22 0.27 0.83 0.16 !20 UNDIAL + SAU 50% (mag.) 0.45 0.78 0.32 0.57 0.78 0.45 0.47 0.76 0.34 UNDIAL 50% (sparsegpt) 0.61 0.66 0.57 0.62 0.67 0.58 0.63 0.67 0.59 !20 UNDIAL + SAU 50% (sparsegpt) 0.65 0.69 0.61 0.60 0.70 0.52 0.66 0.69 0.63 UNDIAL 50% (wanda) 0.62 0.67 0.58 0.63 0.66 0.59 0.63 0.66 0.59 !20 UNDIAL + SAU 50% (wanda) 0.64 0.68 0.60 0.65 0.64 0.66 0.65 0.68 0.63 SatImp 0% 0.70 0.75 0.65 0.61 0.62 0.60 0.60 0.61 0.58 SatImp 50% (mag.) 0.21 0.81 0.12 0.13 0.82 0.07 0.20 0.83 0.11 !20 SatImp + SAU 50% (mag.) 0.52 0.76 0.38 0.50 0.77 0.37 0.35 0.76 0.24 SatImp 50% (sparsegpt) 0.58 0.60 0.57 0.58 0.60 0.57 0.57 0.59 0.56 !20 SatImp + SAU 50% (sparsegpt) 0.60 0.65 0.55 0.56 0.62 0.51 0.59 0.61 0.57 SatImp 50% (wanda) 0.59 0.60 0.57 0.59 0.60 0.57 0.58 0.60 0.57 !20 SatImp + SAU 50% (wanda) 0.59 0.59 0.57 0.57 0.63 0.52 0.58 0.58 0.59 Table 6: Performance comparison of unlearning methods on full and sparse Llama-3.2-1B models across TOFU benchmark(1%, 5%, 10% forget settings) using multiple sparsification techniques (Magnitude, SparseGPT, Wanda) at 50% sparsity. ↑ indicates higher is better. Method Sparsity Forget-1% Forget-5% Forget-10% Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Agg.↑ Mem.↑ Util.↑ Original 0% 0.06 0.03 0.73 0.06 0.03 0.73 0.02 0.01 0.70 Retrain 0% 0.73 0.72 0.75 0.72 0.70 0.75 0.68 0.65 0.70 GradDiff 0% 0.58 0.56 0.61 0.51 0.66 0.41 0.44 0.36 0.54 GradDiff 50% (mag.) 0.01 0.96 0.01 0.02 0.96 0.01 0.01 0.96 0.01 !20 GradDiff + SAU 50% (mag.) 0.01 0.96 0.01 0.01 0.96 0.01 0.01 0.96 0.01 GradDiff 50% (sparsegpt) 0.52 0.58 0.47 0.45 0.53 0.39 0.44 0.43 0.46 !20 GradDiff + SAU 50% (sparsegpt) 0.55 0.61 0.50 0.47 0.55 0.40 0.44 0.44 0.45 GradDiff 50% (wanda) 0.48 0.66 0.37 0.46 0.70 0.34 0.42 0.50 0.36 !20 GradDiff + SAU 50% (wanda) 0.53 0.68 0.42 0.44 0.68 0.32 0.42 0.51 0.36 NPO 0% 0.64 0.64 0.63 0.50 0.50 0.51 0.45 0.41 0.50 NPO 50% (mag.) 0.01 0.97 0.01 0.45 0.34 0.68 0.02 0.96 0.01 !20 NPO + SAU 50% (mag.) 0.01 0.96 0.01 0.42 0.38 0.47 0.01 0.96 0.01 NPO 50% (sparsegpt) 0.53 0.72 0.42 0.48 0.56 0.41 0.44 0.44 0.44 !20 NPO + SAU 50% (sparsegpt) 0.58 0.74 0.46 0.50 0.54 0.46 0.43 0.44 0.43 NPO 50% (wanda) 0.44 0.75 0.31 0.43 0.72 0.30 0.41 0.45 0.35 !20 NPO + SAU 50% (wanda) 0.50 0.76 0.35 0.46 0.70 0.34 0.42 0.47 0.38 UNDIAL 0% 0.70 0.73 0.67 0.69 0.76 0.64 0.67 0.66 0.68 UNDIAL 50% (mag.) 0.01 0.96 0.01 0.01 0.52 0.00 0.01 0.97 0.01 !20 UNDIAL + SAU 50% (mag.) 0.01 0.96 0.01 0.02 0.60 0.01 0.01 0.97 0.01 UNDIAL 50% (sparsegpt) 0.55 0.72 0.45 0.56 0.71 0.46 0.55 0.71 0.45 !20 UNDIAL + SAU 50% (sparsegpt) 0.59 0.74 0.48 0.58 0.72 0.48 0.59 0.70 0.50 UNDIAL 50% (wanda) 0.50 0.72 0.38 0.51 0.74 0.39 0.47 0.71 0.35 !20 UNDIAL + SAU 50% (wanda) 0.57 0.73 0.45 0.56 0.72 0.46 0.56 0.71 0.45 SatImp 0% 0.55 0.91 0.39 0.45 0.34 0.68 0.41 0.31 0.61 SatImp 50% (mag.) 0.13 0.98 0.07 0.02 0.96 0.01 0.01 0.96 0.01 !20 SatImp + SAU 50% (mag.) 0.08 0.98 0.04 0.01 0.96 0.01 0.01 0.96 0.01 SatImp 50% (sparsegpt) 0.41 0.79 0.28 0.44 0.45 0.44 0.41 0.40 0.43 !20 SatImp + SAU 50% (sparsegpt) 0.51 0.82 0.37 0.47 0.50 0.44 0.41 0.40 0.42 SatImp 50% (wanda) 0.34 0.80 0.22 0.44 0.70 0.32 0.40 0.45 0.35 !20 SatImp + SAU 50% (wanda) 0.47 0.83 0.33 0.45 0.68 0.34 0.39 0.46 0.34