Paper deep dive
Per-parameter Task Arithmetic for Unlearning in Large Language Models
Chengyi Cai, Zesheng Ye, Jiangchao Yao, Jianzhong Qi, Bo Han, Xiaolu Zhang, Feng Liu, Jun Zhou
Models: Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct
Abstract
Abstract:In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)--defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 12:47:59 AM
Summary
The paper introduces Per-parameter Task Arithmetic (PerTA), a novel mechanism for Large Language Model (LLM) unlearning. PerTA addresses the 'over-forgetting' issue in standard task arithmetic by applying per-parameter scaling weights to the task vector. These weights are derived from gradient magnitudes (PerTA-grad) or diagonal Fisher information (PerTA-fisher) to balance the forgetting of specific entities with the retention of general model knowledge. Experiments on TOFU and MUSE benchmarks demonstrate that PerTA outperforms standard task arithmetic and several training-based unlearning methods in both efficiency and utility.
Entities (7)
Relation Signals (4)
PerTA → appliedto → LLM Unlearning
confidence 100% · PerTA offers a principled and practical framework for LLM unlearning.
PerTA-grad → estimatesweightsusing → Gradients
confidence 95% · estimated via gradients (i.e., PerTA-grad)
PerTA-fisher → estimatesweightsusing → Fisher Information
confidence 95% · or the diagonal Fisher information approximation (i.e., PerTA-fisher)
PerTA → improves → Task Arithmetic
confidence 95% · PerTA consistently improves upon standard TV
Cypher Suggestions (2)
Identify datasets used for evaluating PerTA · confidence 90% · unvalidated
MATCH (p:Method {name: 'PerTA'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.nameFind all unlearning methods mentioned in the paper · confidence 85% · unvalidated
MATCH (e:Entity {entity_type: 'Method'})-[:USED_FOR]->(t:Task {name: 'LLM Unlearning'}) RETURN e.nameFull Text
92,926 characters extracted from source content.
Expand or collapse full text
Per-parameter Task Arithmetic for Unlearning in Large Language Models Chengyi Cai1, Zesheng Ye1, Jiangchao Yao2, Jianzhong Qi1, Bo Han3, Xiaolu Zhang4, Feng Liu1, Jun Zhou4 1The University of Melbourne 2Shanghai Jiao Tong University 3Hong Kong Baptist University 4Ant Group fengliu.ml@gmail.com Abstract In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)–defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning. 1 Introduction Large language models (LLMs) can continually acquire new knowledge (lu2024spp, ; luo2024wizardarena, ) through post-training; however, the integration of newly ingested data may raise concerns regarding privacy, intellectual property, or misinformation (karamolegkou2023copyright, ). Due to their tendency to memorize training data, LLMs may inadvertently disclose sensitive information when queried. LLM unlearning (liu2025rethinking, ; yao2024large, ) aims to erase the memory of specified entities from LLMs to mitigate such risks, as shown in Figure 2(a). Some training-based LLM unlearning methods achieve forgetting of specific entities (i.e., forget set) by designing carefully crafted unlearning loss functions (NPO, ; SimNPO, ; yao2024large, ; satimp, ) and incorporating entities to be retained (i.e., retain set) to ensure that unrelated knowledge in the model remains unaffected (GD, ). Another counterpart, task arithmetic, avoids multiple iterative training epochs with extensive data, as illustrated in Figure 2(b), where full model and final model represent LLMs before and after unlearning, respectively. This approach achieves unlearning by subtracting from the full model a specific task vector (TV) (TV, ) for the forget set. TV denotes the parameter difference between a model finetuned solely on the forget set (hereafter referred to as the FgtOnly model) and the original pretrained model (hereafter the Origin model). Figure 1: The task of LLM unlearning and mainstream method categories. (a) depicts the problem setting, where the objective is to erase knowledge of specific entities. (b) contrasts training-based approaches with task arithmetic. Figure 2: Bottlenecks of task arithmetic methods. (a) illustrates that TV may steer the model toward the ascent direction of the retained gradient, leading to over-forgetting. (b) shows per-parameter divergence of TV–retain gradient relations, rendering the problem non-trivial and not solvable by a uniform weight. However, the potential correlation and coupling between the entity to be unlearned and other knowledge may cause the subtracted task vector to also contain changes in parameters crucial for preserving other knowledge, thereby risking excessive forgetting of entities that should be retained. Figure 2(a) takes the task of unlearning 1% of entities from the TOFU (tofu, ) dataset as an example. The top 30 parameters with the largest values in the negated task vector (i.e., −V-V, which is added to the full model θfull _ full to obtain the final model θfinal=θfull+(−V) _ final= _ full+(-V)) were selected as examples. For these parameters, we plotted both the values in the negated TV and the gradient magnitudes with respect to the retain set (i.e., the retain gradient) of the same parameters. The results show that, for most of these parameters, the direction indicated by the negated TV aligns with the gradient ascent direction for the retain set. This implies that directly adding the negated TV to the full model would lead to forgetting of the entities that are supposed to be retained. A simple remedy is to add a uniform weight ω∈ℝω satisfying 0<ω<10<ω<1 to TV to reduce the effect of TV (i.e., θfinal=θfull+ω⋅(−V) _ final= _ full+ω·(-V)), thereby balancing between unlearning and retaining. However, as shown in Figure 2(b), we find that such an approach may be suboptimal as it ignores per-parameter divergence. By plotting the negated TV and the retain gradient corresponding to different parameters, we observe that different parameters exhibit varying relations between TV and retain gradients, requiring a more sophisticated paradigm. After formulating the problem, in Section 3, we propose the Per-parameter Task Arithmetic (PerTA) mechanism as a solution to the aforementioned bottleneck. PerTA assigns different weights to each parameter in TV and performs a per-parameter multiplication (i.e., θfinal=θfull+W⊙(−V) _ final= _ full+W (-V)) to flexibly control the magnitude of editing, where W is a matrix with the same size as θfull _ full. Parameters that are more pivotal for unlearning can be assigned higher weights, whereas those crucial for retention receive lower weights, aiming to facilitate both unlearning and retention. In Section 4, we detail how per-parameter weights are estimated using absolute gradients (which captures the importance of parameters given forget or retain sets, abbreviated as PerTA-grad) or the diagonal Fisher information approximation (which reflects the sensitivity of parameters to forget or retain sets, abbreviated as PerTA-fisher). We analyze its effectiveness by defining retain-forget ratio for parameters. We also extend PerTA to a generalized form and provide further discussion. In Section 5, we evaluate PerTA on two commonly used unlearning benchmarks TOFU (tofu, ) and MUSE (muse, ) across multiple metrics. Results show that PerTA not only substantially outperforms its baseline, vanilla TV, but also exceeds the performance of several mainstream training-based unlearning methods. Training time analysis confirms that PerTA is efficient, while qualitative results illustrate its ability to retain knowledge upon effective unlearning. PerTA extends TV by preserving retention while enabling effective unlearning, all with high efficiency. Remarkably, it achieves performance surpassing several training-based unlearning methods, highlighting its practical effectiveness. Beyond empirical gains, PerTA offers a new task-arithmetic perspective for LLM unlearning research and introduces a flexible approach for balancing modification and retention in LLM task arithmetic. 2 Related Works LLM Unlearning. Machine unlearning (unlearning1, ; unlearning2, ; unlearning3, ; unlearning4, ; lu2022quark, ) aims to selectively remove some previously acquired knowledge from a model while preserving its overall utility. LLM unlearning has attracted increasing attention, playing a vital role in correcting misinformation, mitigating biases, and protecting privacy (fantowards, ; yao2024survey, ; jang2022knowledge, ). Recent studies on LLM unlearning have advanced this field from multiple perspectives, including benchmarks (tofu, ; muse, ; wmdp, ), frameworks (openunlearning, ), evaluation protocols (wangrethinking, ; wangtowards, ), methodological innovations (jia2024soul, ; pawelczykcontext, ; kadhe2024split, ), and hallucination mitigation (shen2025lunar, ; zhang2025rule, ). Different objectives and problem settings of unlearning are discussed in Appendix A.5 Among training-based unlearning methods, GA (yao2024large, ) is the pioneering work that minimizes the log-likelihood of the entities to be unlearned. GD (GD, ) improves it by incorporating the loss on a retain set to mitigate forgetting. NPO (NPO, ) constructs its loss function by separating the dis-preferred component from DPO (dpo, ), while SimNPO (SimNPO, ) further removes the reliance on reference models. GRU (GRU, ) projects the unlearning gradient onto the orthogonal space of retain gradients, and SatImp (satimp, ) reweights the loss on a token-wise basis. muse, introduces TV (TV, ) into the unlearning setting. Despite the rapid progress of training-based methods, challenges remain in terms of time and data efficiency, motivating the exploration of more efficient alternatives such as task arithmetic. Since these methods are currently underexplored, we aim to investigate the potential of task arithmetic-based methods. Model Merging. Model merging, also referred to as model editing, is a cost-effective approach that directly manipulates the weight space of multiple pretrained models. TV, introduces the concept of TV, defined as the difference between a finetuned model on a given task and its original counterpart, which can then be used for subsequent model merging. tangent, further investigates the fundamental mechanisms of TV by analyzing linearized models. AdaMerging (adamerging, ) improves upon the TV framework by learning task-wise or layer-wise coefficients, enabling more effective multi-task learning. Additional refinements include trimming (dare, ), sign selection (ties, ) before merging, and composing parameter blocks (atlas, ) or models (lee2025dynamic, ) with learned coefficients. Recently, model merging has been successfully extended to LLMs (metagpt, ; fusellm, ; fusionchat, ) and multimodal LLMs (mllm1, ; mllm2, ). Within the context of LLMs, MetaGPT (metagpt, ) employs a task arithmetic approach that exploits the local linearity of LLMs together with the approximate orthogonality of TVs. FuseLLM (fusellm, ) and FusionChat (fusionchat, ) investigate strategies for integrating multiple pretrained LLMs in the parameter space to obtain a more potent model. While existing studies have primarily focused on multi-task learning scenarios, our paper explores model merging in LLM unlearning, along with potential improvements. Unlike other model merging methods that combine knowledge, we study task arithmetic in this paper to remove knowledge from the pretrained models. 3 Preliminaries and Insights We consider a pretrained auto-regressive LLM parameterized by θ0 _0 with self-attention structures (liu2018generating, ). In the post-training phase, the LLM can be finetuned on new knowledge =s1,s2,…,s||D=\s^1,s^2,...,s^|D|\ consisting of |||D| sequences, where each sequence s=[t1,t2,…,t|s|]s=[t_1,t_2,...,t_|s|] contains |s||s| tokens. Denoting t<it_<i as the subsequence of s from t1t_1 to ti−1t_i-1, the probability of s given parameter θ can be defined as p(s;θ)≜∏i=1|s|p(ti|t<i;θ)p(s;θ) _i=1^|s|p(t_i|t_<i;θ), which is the product of the conditional probabilities of all tokens. Then θ can be learned by minimizing the negative log likelihood loss: ℒ(;θ)=−1||∑s∈logp(s;θ).L(D;θ)=- 1|D|Σ _s p(s;θ). (1) Given a new target knowledge set fullD_ full, the finetuned model θfull _ full on the whole dataset (i.e., the full model) can be obtained by the training objective argminθ∈Θℒ(full;θ) _θ∈ L(D_ full;θ). LLM Unlearning. Let f=sf1,sf2,…,sf|f|D_ f=\s^1_ f,s^2_ f,...,s^|D_ f|_ f\ be the undesirable set that is to be unlearned from θfull _ full (i.e., forget set), where f⊂fullD_ f _ full and the size typically satisfies |f|≪|full||D_ f| |D_ full|, we can define the retain set as r=full _ r=D_ full _ f to be the set of knowledge to be preserved (i.e., retain set). Accordingly, the goal of unlearning is to derive a model θfinal _ final that satisfies two desiderata (tofu, ; muse, ): (a) it forgets the information contained in fD_ f, such that the model no longer provides correct answers or statements pertaining to those entities; and (b) it preserves the knowledge in rD_ r, ensuring that the corresponding entities remain unaffected. Ideally, the unlearned model should closely approximate the ground-truth model obtained by finetuning exclusively on rD_ r. Unlearning via Task Arithmetic. In the context of unlearning, applying task arithmetic entails computing the TV (TV, ) corresponding to the forget set and subsequently subtracting it from the model θfull _ full. First, a forget-only finetuned model (i.e., FgtOnly model θfgt _ fgt) is obtained on fD_ f using the original pretrained model θ0 _0 by optimizing the objective in Eq.(1), namely, argminθ∈Θℒ(f;θ) _θ∈ L(D_ f;θ). Then the unlearned model θfinal _ final can simply be obtained through arithmetic operations with θfinal=θfull+[−(θfgt−θ0)⏟Task Vector], _ final= _ full+[- ( _ fgt- _0)_Task Vector], (2) where θ0 _0 is used as the reference point for a purer forget-only TV (being slightly different from muse, which uses θfull _ full). To address the issue of excessive forgetting on the retain set illustrated in Figure 2(a), an intuitive approach is to introduce a constant uniform weight 0<ω<10<ω<1 to adjust the magnitude of the TV, i.e., θfinal=θfull+ω⋅[−(θfgt−θ0)] _ final= _ full+ω·[-( _ fgt- _0)], thereby balancing between forgetting and retention. However, as shown in Figure 2(b), since the retain gradients and the TV do not exhibit a consistent relationship across parameters, this intuitive approach may overlook divergence across parameters and is insufficient to satisfy both forgetting and retention objectives. Figure 3: The framework of PerTA. PerTA rescales vanilla TV with per-parameter weights. After a one-time gradient computation on forget and retain sets, the per-parameter importance estimation introduced in Section 4.1 can be used to estimate the relative importance of each parameter on the forget set, either using the gradient or the Fisher information, thereby yielding the weights. Per-parameter Task Arithmetic (PerTA). To address these bottlenecks, we naturally propose a per-parameter weighted mechanism for TV in this work. Since each parameter contributes differently to the forget set and the retain set, we rescale TV by introducing per-parameter weights W, with the same dimensionality as θ (i.e., dim(W)=dim(θ)dim(W)=dim(θ)). The unlearned model is therefore obtained as: θfinal=θfull+W⊙[−(θfgt−θ0)], _ final= _ full+W [-( _ fgt- _0)], (3) where ⊙ represents per-parameter multiplication. In W, larger values highlight parameters crucial for unlearning the forget set, while smaller values emphasize those important for retaining the retain set, enabling a flexible trade-off between forgetting and retention. Given the immense parameter scale of LLMs, the learning of a parametric W would be prohibitively expensive. We therefore adopt a non-parametric approach to estimate W. 4 Method The framework of PerTA and its difference from vanilla TV are shown in Algorithm 1 (violet) and Figure 3. PerTA flexibly scales TV via per-parameter multiplication between W (in Eq.(3)) and TV. Each entry of W quantifies the relative importance of its corresponding parameter for the forget set versus the retain set. To this end, we compute parameter gradients with respect to both the forget and retain sets (once each, with minimal overhead) and use them to construct W (detailed in Section 4.1). Moreover, we analyze the effectiveness of PerTA, extend it to a general form, and discuss its validity in Section 4.2. Algorithm 1 Pipeline of PerTA Input: Origin/Full model θ0|θfull _0| _ full, forget/retain set f|rD_ f|D_ r, hyperparameter E,αE,α Output: Unlearned model θfinal _ final # Step 1: Calculting θfgt _ fgt required by TV θfgt←θ0 _ fgt← _0 for e=1,…,Ee=1,…,E do θfgt←θfgt−α∇ℒ(f;θfgt) _ fgt← _ fgt-α (D_ f; _ fgt) end for # Step 2.1 One-time gradient computation gf←∇ℒ(f;θ0)g_ f← (D_ f; _0), gr←∇ℒ(r;θ0)g_ r← (D_ r; _0) # Step 2.2 Per-parameter importance estimation W←|gf|τ+ϵ|gf|τ+|gr|τ+2ϵW← |g_ f|^τ+ε|g_ f|^τ+|g_ r|^τ+2ε (using Eq.(4) or Eq.(5)) # Step 3: Task arithmetic θfinal←θfull+W⊙[−(θfgt−θ0)] _ final← _ full+ [rgb].5,0,.5W [-( _ fgt- _0)] (using Eq.(3)) 4.1 Per-parameter Importance Estimation Let W=[w1,w2,…,wn]W=[w_1,w_2,...,w_n] be the scaling weights corresponding to the model parameters θ=[q1,q2,…,qn]θ=[q_1,q_2,...,q_n] with n parameters. Each weight satisfies wi∈[0,1],1≤i≤nw_i∈[0,1],1≤ i≤ n. Values wiw_i closer to 1 indicate that TV at qiq_i should be kept, while values approaching 0 downweight TV at qiq_i. Using Absolute Gradient (PerTA-grad). Since the importance is independent of gradient direction, the absolute magnitude of the parameter gradients (gradient1, ; gradient2, ) provides a natural measure of importance. While gradient estimation using either θ0 _0 or θfull _ full is justifiable, we adopt θ0 _0 here because θ0 _0 is a more neutral initialization model that does not contain training data from the forget or the retain set. However, in practice, estimating gradients on θ0 _0 or on θfull _ full makes a negligible difference (see Appendix C.3 for a detailed discussion). Let ∇ℒ(f;θ0),∇ℒ(r;θ0) (D_ f; _0), (D_ r; _0) be the gradients of forget and retain sets. The weight for each parameter can be computed as the relative contribution of the forget set gradient to the total gradient magnitude, where W can be formulated as: Wgrad=|∇ℒ(f;θ0)|+ϵ|∇ℒ(r;θ0)|+|∇ℒ(f;θ0)|+2ϵ,W_ grad= | (D_ f; _0)|+ε| (D_ r; _0)|+| (D_ f; _0)|+2ε, (4) where ϵε is a small constant to avoid division by zero. Substituting Eq.(4) into Eq.(3) yields the final unlearned model. WgradW_ grad treats all deviations linearly. Next, we also propose a non-linear alternative. Using Diagonal Fisher Information Approximation (PerTA-fisher). The diagonal of the Fisher Information Matrix (fisher1, ; fisher2, ) is widely used to reflect the sensitivity of parameters to the data. The computation of its diagonal entries can be simplified as the squared gradients (see Appendix B.1 for a detailed proof). Accordingly, W can also be expressed as: Wfisher=∇ℒ2(f;θ0)+ϵ∇ℒ2(r;θ0)+∇ℒ2(f;θ0)+2ϵ.W_ fisher= ^2(D_ f; _0)+ε ^2(D_ r; _0)+ ^2(D_ f; _0)+2ε. (5) Similar to WgradW_ grad, substituting Eq.(5) into Eq.(3) yields the final unlearned model, as detailed in Algorithm 1. Both WgradW_ grad and WfisherW_ fisher essentially estimate the per-parameter importance of the forget set by computing the relative magnitude of gradients on fD_ f. However, the latter employs a square operation, which amplifies the gradient differences and thus drives wiw_i closer to 0 or 1. The detailed discussion is in the next subsection. 4.2 Discussion and A General Form Discussion about PerTA. Denoting gf≜∇ℒ(f;θ0)g_ f (D_ f; _0), gr≜∇ℒ(r;θ0)g_ r (D_ r; _0), we further explore the effectiveness of WgradW_ grad and WfisherW_ fisher, and their difference. Definition 1. (Retain-forget ratio). For a parameter qiq_i in an LLM, the retain-forget ratio reflects its relative importance for retention versus forgetting. Denoting [gr]i[g_ r]_i and [gf]i[g_ f]_i to be the gradients of qiq_i on the retain and forget sets, the retain-forget ratio can be represented as ri=(|[gr]i|+ϵ)/(|[gf]i|+ϵ).r_i=(|[g_ r]_i|+ε)/(|[g_ f]_i|+ε). When ri≥1r_i≥ 1, the retain set dominates for qiq_i, and the forget set dominates when ri<1r_i<1. Hence, we obtain the following proposition. Proposition 1. For parameter pip_i, we denote its corresponding weights calculated with PerTA-grad, PerTA-fisher to be ωigrad _i grad and ωifisher _i fisher respectively. Then we have: 12≥ωigrad≥ωifisher≥0, when ri≥1; 12<ωigrad<ωifisher<1, when ri<1, 12≥ _i grad≥ _i fisher≥ 0, when r_i≥ 1; 12< _i grad< _i fisher<1, when r_i<1, which is proved in Appendix B.2. This implies that when pip_i has a larger influence on the retain set, PerTA applies a smaller reweighting to TV in order to reduce forgetting, and vice versa. Notably, compared to PerTA-grad, PerTA-fisher yields weights that are closer to 0 or 1, thereby creating a cleaner separation between parameters to be edited and to be preserved. A General Form. Besides, the determination of W is not limited to the aforementioned approaches. Here, we express W in a more general form–as a function of gfg_ f and grg_ r: Wgeneral=foprt(gf,gr),W_ general=f_ oprt(g_ f,g_ r), where foprt(⋅,⋅)f_ oprt(·,·) is a custom operation. Then both WgradW_ grad and WfisherW_ fisher can be represented with foprt(A,B)=|A|∘τ/(|A|∘τ+|B|∘τ)f_ oprt(A,B)=|A| τ/(|A| τ+|B| τ), where ∘τ τ is the per-parameter τ-th power and the case τ=1τ=1 and τ=2τ=2 correspond to WgradW_ grad and WfisherW_ fisher, respectively. Discussion about foprt(⋅,⋅)f_ oprt(·,·). In addition to the absolute gradient and diagonal Fisher Information approximation we applied, other operations—such as the SoftMax-based formulation foprt(A,B)=exp(|A|)/(exp(|A|)+exp(|B|))f_ oprt(A,B)= (|A|)/( (|A|)+ (|B|))—can also be employed (see Section 1 for detailed results and discussions). Moreover, WgeneralW_ general subsumes more general cases: when foprt(A,B)=1f_ oprt(A,B)=1, it degenerates to vanilla TV, whereas when foprt(A,B)=wf_ oprt(A,B)=w, PerTA employs the uniform constant w to balance forgetting and retaining. Denoting weight wiw_i for parameter qiq_i as wi=[foprt(gf,gr)]iw_i=[f_ oprt(g_ f,g_ r)]_i and the corresponding gradients are [gf]i[g_ f]_i and [gr]i[g_ r]_i, in the following, we discuss the design of foprt(⋅,⋅)f_ oprt(·,·): • Intuitively, it should satisfy [foprt(gf,gr)]i→1[f_ oprt(g_ f,g_ r)]_i→ 1 when |[gf]i|≫|[gr]i||[g_ f]_i| |[g_ r]_i|, and [foprt(gf,gr)]i→0[f_ oprt(g_ f,g_ r)]_i→ 0 when |[gr]i|≫|[gf]i||[g_ r]_i| |[g_ f]_i|. This is because TV is the vector for forget set fD_ f: when |[gf]i||[g_ f]_i| is large, the parameter qiq_i is crucial for unlearning, and thus the rescaled TV should preserve its value; conversely, when |[gr]i||[g_ r]_i| is large, the parameter is critical for retention, and the TV should therefore be scaled down. WgradW_ grad and WfisherW_ fisher are consistent with this intuition (see Appendix B.3). • Empirically, we explored several straightforward ways of designing foprt(⋅,⋅)f_ oprt(·,·) and found that WgradW_ grad and WfisherW_ fisher in this paper perform best among them, as detailed in the ablation studies from Section 1. Naturally, the choice of foprt(⋅,⋅)f_ oprt(·,·) is not unique, and we hope our work will inspire further exploration and discussion. 5 Experiments Baselines and Benchmarks. Experiments are conducted on the widely used unlearning benchmark TOFU (tofu, ) (covering three tasks with 1%, 5%, and 10% of the data unlearned) and on MUSE News (muse, ). On TOFU, following openunlearning, , we employ Llama-3.2 1B and 3B Instruct models (llama, ) and evaluate them using five metrics: (1) Forget Quality (FQ) (tofu, ), which measures the effectiveness of unlearning (higher is better, we use log transformation in this paper); (2) Model Utility (MU) (tofu, ), which quantifies the model’s usefulness in retaining original knowledge (higher is better); (3) Extraction Strength (ES) (es, ) of the forget set, defined as the proportion of repeated content start positions in the forget set (lower is better); (4) ES of the retain set, defined analogously on the retain set (higher is better); (5) Gibberish (Gib), which represents the probability—determined by a binary classifier (gibberish-detector-2021, )—that answers to forget-set queries are non-gibberish (higher is better) and (6) ROUGE-L (ROUGE) (rouge, ), the proportion of the longest common sub-sequence between the ground truth and the answers. Additional dataset-related information is provided in Appendix 2, while detailed definitions of the metrics are given in Appendix A.2. As for the baselines, for training-based methods we evaluate the mainstream approaches GA (yao2024large, ), GD (GD, ), NPO (NPO, ), and NPO+ (NPO combined with GD). For task-arithmetic methods, we test vanilla TV (TV, ) and our proposed method. In addition, we report the metrics of the full model before unlearning, alongside those of a ground-truth model trained solely on the retain set (tofu, ), as references. Detailed information about the baselines and implementation can be found in Appendices A.3 and A.4, respectively. Performance Comparison. The results of FQ and MU on the three TOFU tasks with 1%, 5%, and 10% samples to be unlearned are shown in Figure 4 (see more metrics in Appendix C.1). The ground-truth results are shown as black pentagram markers. The FQ metric measures the p-value of distributional differences from the ground truth; we perform the logarithmic transformation to better highlight variations. Dark-blue and purple circles denote methods PerTA-grad and PerTA-fisher, respectively. On simpler tasks (e.g., unlearning 1% of the data), most training-based methods maintain model utility, while the task-arithmetic method TV achieves higher FQ but at the cost of MU. Our PerTA-grad and PerTA-fisher improve MU relative to TV and yield results closer to the ground truth. On more challenging tasks (e.g., unlearning 5% or 10%), training-based methods degrade: MU for GA and NPO drops to nearly zero, and their FQ becomes both lower and unstable (with larger variance). In contrast, our PerTA-grad and PerTA-fisher consistently outperform both training-based and task-arithmetic baselines in FQ and MU, confirming the effectiveness of PerTA in achieving unlearning while preserving model capability. To examine why PerTA outperforms TV among task arithmetic methods, we evaluate four dimensions: forget, retain, real, and facts (tofu, ), corresponding to the forget set, retain set, original authors, and world facts. The first two measure forgetting and retention of post-training knowledge, while the latter two assess preservation of pretrained knowledge. ROUGE is used to measure similarity to reference answers. As shown in Figure 5 (see Appendix C.1 for more results), TV performs well on real authors and world facts, and PerTA preserves this capability. However, for post-training knowledge, TV suffers from over-forgetting, whereas PerTA significantly narrows the gap to the ground truth. In more challenging settings (e.g., unlearning 5% and 10%), TV falls far below the ground truth, while PerTA nearly doubles TV’s performance, being much closer to the reference. Figure 4: MU and FQ results of different methods on TOFU (using Llama-3.2 1B Instruct), where circle markers denote values and horizontal and vertical bars at circle centers represent error bars. Figure 5: Four-dimension ROUGE results of task arithmetic-based methods on TOFU (using Llama-3.2 1B Instruct). Ground-truth results on forget and retain sets are marked with a gray background. Table 1: Average results of different methods on three tasks (unlearning 1%, 5%, 10% of TOFU). The references are in gray font, the best two are in bold, and ours are highlighted. ‘Full’ and ‘GT’ represent the model before unlearning and the ground truth model. FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ Model Size 1B 3B Full -11.808 0.599 0.726 0.737 0.871 -13.960 0.666 0.899 0.884 0.868 GT 0.000 0.596 0.064 0.748 0.894 0.000 0.657 0.066 0.887 0.887 Training-based GA -81.114 0.199 0.086 0.244 0.484 -81.257 0.383 0.125 0.331 0.593 GD -8.720 0.491 0.112 0.295 0.789 -13.959 0.589 0.192 0.437 0.677 NPO -4.842 0.198 0.086 0.246 0.592 -4.508 0.380 0.123 0.334 0.637 NPO+ -3.528 0.493 0.122 0.316 0.911 -5.413 0.587 0.147 0.415 0.898 Arithmetic-based TV -6.174 0.495 0.059 0.207 0.914 -5.284 0.612 0.058 0.304 0.921 !30 PerTA-grad -0.686 0.556 0.072 0.376 0.915 -0.669 0.664 0.082 0.563 0.913 !30 PerTA-fisher -0.867 0.562 0.080 0.414 0.908 -1.211 0.665 0.092 0.613 0.895 Figure 6: Results of FQ (↑) using different foprtf_ oprt on two challenging tasks (unlearning 5% and 10% of TOFU, Llama-3.2 1B Instruct). The shaded region indicates the error bounds. More Backbones and Benchmarks. Table 1 reports the average results of different methods across the three unlearning tasks, with both 1B and 3B model sizes considered to examine the effect of different LLM backbones (see complete results in Appendix C.2). The results show that the baseline TV, compared with training-based methods, suffers from excessive forgetting on the retain set (low ES (rD_ r)), while our PerTA substantially improves ES (rD_ r) without significantly reducing ES (fD_ f). Moreover, PerTA delivers notable gains in FQ (e.g., with the ground truth being 0, TV achieves −6.174-6.174 and −5.284-5.284, whereas PerTA-grad reaches about −0.67-0.67 and PerTA-fisher about −1-1) and MU (e.g., on the 1B model, PerTA raises MU from 0.4950.495 to 0.5560.556 by PerTA-grad or 0.5620.562 by PerTA-fisher, narrowing the gap to the ground truth to within 0.040.04). On larger backbones such as 3B, PerTA maintains improvements in both FQ and MU while further increasing ES (rD_ r) without compromising ES (fD_ f). These results demonstrate the effectiveness of PerTA in achieving unlearning while preserving utility across different model scales. Additionally, results in Appendix C.2 show that PerTA is also effective on MUSE. Ablation (General Form) Studies. Figure 6 shows the curves of FQ when different foprt(⋅,⋅)f_ oprt(·,·) are selected under varying hyperparameters. In addition to PerTA-grad and PerTA-fisher proposed in Eq.(4) and Eq.(5), we consider several straightforward designs: (1) ‘Pruning’: removing (i.e., foprt(A,B)=0f_ oprt(A,B)=0) the λ%λ\% smallest weights in TV to mitigate over-forgetting and maintain others (i.e., foprt(A,B)=1f_ oprt(A,B)=1), where λ=0λ=0 reduces to vanilla TV; (2) ‘Random’: setting weights in W to random values uniformly sampled between 0 and 1 with foprt(A,B)=rand([0, 1])f_ oprt(A,B)=rand([0, 1]); (3) ‘Weighted’: using a constant ω to rescale TV with foprt(A,B)=ωf_ oprt(A,B)=ω, where ω=1ω=1 reduces to vanilla TV; and (4) ‘SoftMax’: determining foprt(A,B)=exp(|A|)/(exp(|A|)+exp(|B|))f_ oprt(A,B)= (|A|)/( (|A|)+ (|B|)) in the SoftMax form. Among these, ‘Pruning’ and ‘Weighted’ methods vary with λ or ω, as shown in Figure 6. We observe that ‘Pruning’ performs poorly on more challenging tasks (e.g., unlearning 10%), ‘Random’ exhibits very high variance, and ‘Weighted’ can achieve reasonable results when the optimal constant ω is chosen but is highly sensitive to the hyperparameter. The ‘SoftMax’ method represents a successful design of foprt(⋅,⋅)f_ oprt(·,·), yet our PerTA-grad and PerTA-fisher still outperform other possible designs. Figure 7: Time comparison of the best-performing training-based method NPO+ and our PerTA (unlearning 5% and 10%, Llama-3.2 1B Instruct). Time Efficiency Discussion. Figure 7 shows the runtime comparison between the best-performing training-based method, NPO+, and our PerTA. Unlike training-based approaches that require repeated iterations, the runtime of PerTA can be decomposed into: the time to obtain θfgt _ fgt, the time to compute W, and the time for task arithmetic, where the latter is negligible. It can be observed that PerTA inherits the advantage of task arithmetic—significantly reducing runtime—and this advantage becomes more pronounced as task complexity increases (i.e., when unlearning larger proportions). Moreover, as shown previously, estimating gradients with only 20% of the data already yields competitive results, suggesting that runtime can be further reduced. Together, these findings highlight the time efficiency of PerTA. More results are in Appendix C.2. Figure 8: Results of alternative variants in foprt(A,B)=|A|∘τ/(|A|∘τ+|B|∘τ)f_ oprt(A,B)=|A| τ/(|A| τ+|B| τ) with different τ (Llama-3.2 1B Instruct). 1%, 5%, 10% tasks are distinguished with different line types. Figure 9: Residual results of the metrics when using only 20%, 40%, and 80% of the samples compared to using the full set (unlearning 5%, Llama-3.2 1B Instruct). 0% denotes vanilla TV. Alternative Variants Analysis. When retaining the form of foprt(A,B)=|A|∘τ/(|A|∘τ+|B|∘τ)f_ oprt(A,B)=|A| τ/(|A| τ+|B| τ) but not using PerTA-grad and PerTA-fisher, different τ can be applied. We conduct experiments for τ∈0,0.25,0.5,1,2,4,8τ∈\0,0.25,0.5,1,2,4,8\, with results shown in Figure 8 and Appendix C.2. The cases of τ=1,2τ=1,2 correspond to our PerTA-grad and PerTA-fisher, respectively. Our methods strike a balance between forgetting and retaining: among the different τ-based variants, they achieve relatively strong FQ and ES (fD_ f) while keeping MU and ES (rD_ r) at a reasonable level. Sample Efficiency Discussion. In this experiment, we estimate WgradW_ grad and WfisherW_ fisher using 0%, 20%, 40%, and 80% of the total samples, where 0% corresponds to vanilla TV and the other three represent PerTA with reduced sample sizes. The differences in metrics compared to using the full dataset are shown in Figure 9 and Appendix C.2. It is observed that using only one-fifth of the samples already yields results comparable to those obtained with the full dataset, and better than vanilla TV. This demonstrates that PerTA is also sample-efficient, which can be beneficial in further reducing the unlearning time. More Experiments. The experimental results comparing gradient gf,grg_ f,g_ r prediction using θ0 _0 or θfull _ full are provided in Appendix C.3. Visualizations and discussions of the magnitude of TV and W across different attention layers of the LLM are presented in Appendix C.4. Results on larger or alternative LLM models are in Appendix C.5. Results under quantization attacks are detailed in Appendix C.6. 6 Conclusion To address the issue of potentially over-forgetting on the retain set when using vanilla TV, we proposed PerTA to rescale TV, where the weight matrix is estimated using absolute gradients or the diagonal Fisher Information approximation. The effectiveness of PerTA is validated by both theoretical analysis and empirical evidence. References (1) Shun-ichi Amari, Ryo Karakida, and Masafumi Oizumi. Fisher information and natural gradient learning in random deep networks. In AISTATS, 2019. (2) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In IEEE symposium on security and privacy, 2021. (3) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In IEEE symposium on security and privacy, 2015. (4) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX security symposium, 2021. (5) Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, et al. Model composition for multimodal large language models. In ACL, 2024. (6) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014. (7) Rocktim Jyoti Das, Mingjie Sun, Liqun Ma, and Zhiqiang Shen. Beyond size: How gradients shape pruning decisions in large language models. arXiv preprint arXiv:2311.04902, 2023. (8) Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. Openunlearning: Accelerating llm unlearning via unified benchmarking of methods and metrics. arXiv preprint arXiv:2506.12618, 2025. (9) Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, et al. Adamms: Model merging for heterogeneous multimodal large language models with unsupervised coefficient optimization. In CVPR, 2025. (10) Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond. In ICML, 2025. (11) Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplicity prevails: Rethinking negative preference optimization for llm unlearning. In NeurIPS Workshop, 2024. (12) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. NeurIPS, 2019. (13) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In ICLR, 2023. (14) Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022. (15) Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. What does bert learn about the structure of language? In ACL, 2019. (16) Jinghan Jia, Yihua Zhang, Yimeng Zhang, Jiancheng Liu, Bharat Runwal, James Diffenderfer, Bhavya Kailkhura, and Sijia Liu. Soul: Unlocking the power of second-order optimization for llm unlearning. In EMNLP, 2024. (17) Madhur Jindal. Gibberish detector: High-accuracy text classification model, 2021. (18) Swanand Kadhe, Farhan Ahmed, Dennis Wei, Nathalie Baracaldo, and Inkit Padhi. Split, unlearn, merge: Leveraging data attributes for more effective unlearning in llms. In ICML Workshop, 2024. (19) Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. arXiv preprint arXiv:2310.13771, 2023. (20) Sanwoo Lee, Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, and Yunfang Wu. Dynamic fisher-weighted model merging via bayesian optimization. arXiv preprint arXiv:2504.18992, 2025. (21) Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218, 2024. (22) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 2004. (23) Bo Liu, Qiang Liu, and Peter Stone. Continual learning and private unlearning. In Conference on Lifelong Learning Agents, 2022. (24) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018. (25) Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, 2025. (26) Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. Quark: Controllable text generation with reinforced unlearning. NeurIPS, 2022. (27) Xudong Lu, Aojun Zhou, Yuhui Xu, Renrui Zhang, Peng Gao, and Hongsheng Li. Spp: Sparsity-preserved parameter-efficient fine-tuning for large language models. In ICML, 2024. (28) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jian-Guang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. Wizardarena: Post-training large language models via simulated offline chatbot arena. NeurIPS, 2024. (29) Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024. (30) James Martens. New insights and perspectives on the natural gradient method. JMLR, 2020. (31) Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. NeurIPS, 2023. (32) Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. In ICML, 2024. (33) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023. (34) William F Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D Lane. Lunar: Llm unlearning via neural activation redirection. NeurIPS, 2025. (35) Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang. Muse: Machine unlearning six-way evaluation for language models. In ICLR, 2025. (36) Nikolai V Smirnov. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou, 1939. (37) Ayush K Tarun, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. Fast yet effective machine unlearning. IEEE TNNLS, 2023. (38) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. (39) Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In ACL Workshop, 2019. (40) Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models. In ICLR, 2024. (41) Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, and Wei Bi. Knowledge fusion of chat llms: A preliminary technical report. arXiv preprint arXiv:2402.16107, 2024. (42) Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. Towards effective evaluations and comparisons for llm unlearning methods. In ICLR, 2025. (43) Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. In ICLR, 2025. (44) Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, and Bo Han. Gru: Mitigating the trade-off between unlearning and retention for large language models. ICML, 2025. (45) Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. NeurIPS, 2023. (46) Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In ICLR, 2024. (47) Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, and Bo Han. Exploring criteria of loss reweighting to enhance llm unlearning. In ICML, 2025. (48) Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 2024. (49) Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. NeurIPS, 2024. (50) Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In ICML, 2024. (51) Chenlong Zhang, Zhuoran Jin, Hongbang Yuan, Jiaheng Wei, Tong Zhou, Kang Liu, Jun Zhao, and Yubo Chen. Rule: Reinforcement unlearning achieves forget-retain pareto optimality. arXiv preprint arXiv:2506.07171, 2025. (52) Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. NeurIPS, 2024. (53) Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling, 2024. (54) Zhi Zhang, Qizhe Zhang, Zijun Gao, Renrui Zhang, Ekaterina Shutova, Shiji Zhou, and Shanghang Zhang. Gradient-based parameter selection for efficient fine-tuning. In CVPR, 2024. (55) Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. Catastrophic failure of llm unlearning via quantization. In ICLR, 2025. (56) Yuyan Zhou, Liang Song, Bingning Wang, and Weipeng Chen. Metagpt: Merging large language models using model exclusive task arithmetic. arXiv preprint arXiv:2406.11385, 2024. Appendix A Appendix 1: More Training Information A.1 Dataset Information TOFU. The TOFU dataset111https://huggingface.co/datasets/locuslab/TOFU is designed as a benchmark to assess how well LLMs can perform unlearning on practical tasks. It contains 4000 question-answer pairs derived from autobiographies of 200 entirely fictional authors, all generated by GPT-4. The task involves evaluating a finetuned model’s ability to unlearn when exposed to different proportions (i.e., unlearning 1%, 5%, 10%) of the forget set. MUSE. MUSE is a benchmark designed to evaluate machine unlearning. It centers on two major forms of textual content where unlearning is often necessary: news reports (News) and literary works (Books). The MUSE-News subset222https://huggingface.co/datasets/muse-bench/MUSE-News specifically includes BBC articles published after August 2023. A.2 Metric Discussion In fact, the choice of evaluation metrics for unlearning has long been an active and debated research topic. Assessing unlearning performance typically requires considering multiple aspects and dimensions. In this paper, we adopt the metrics used in [29], which are also widely employed by mainstream methods such as [8, 44, 47]. Similar to Section 3, we define the new knowledge dataset to post-train the LLM as =s1,s2,…,s||D=\s^1,s^2,...,s^|D|\ consisting of |||D| sequences, where each sequence s=[t1,t2,…,t|s|]s=[t_1,t_2,...,t_|s|] contains |s||s| tokens. To split s into questions and answers, we can also write s=[x,y]s=[x,y]. Then the probability of y given x is defined as P(;θ)=[x,y]∼p(y|x;θ)1|y|=[x,y]∼[∏i=1|y|p(yi|[x,y<i];θ)]1|y|,P(D;θ)=E_[x,y] p(y|x;θ) 1|y|=E_[x,y] [Π^|y|_i=1p(y_i|[x,y_<i];θ)] 1|y|, which is normalize for answer length as a common practice [6]. Denoting pretY_ pret as the set of incorrect answers with the same template as y, the truth ratio can be defined as Tr(;θ)=[x,y]∼1|pret|∑y~∈pretP(y~|x)P(y|x).Tr(D;θ)=E_[x,y] 1|Y_ pret|Σ _ y _ pretP( y|x)P(y|x). Besides, by obtaining argmaxtip(ti|t<i;θ) _t_ip(t_i|t_<i;θ), the output texts of LLM given prompt t<i=[t1,…,ti−1]t_<i=[t_1,...,t_i-1] is defined as f(t<i;θ)f(t_<i;θ). ROUGE-L (ROUGE). Denoting the length of the longest common sub-sequence considering string a and b as LCS(a,b)LCS(a,b), then the ROUGE-L metric can be defined for model θ and dataset D as ROUGE(;θ)=[x,y]∼LCS(y,f(x;θ))|y|.ROUGE(D;θ)=E_[x,y] LCS(y,f(x;θ))|y|. The bigger ROUGE-L is, the more similar the references and output answers of LLM are. Extraction Strength (ES). ES measures the degree of memorization as the smallest fraction of a prefix required to accurately reconstruct the corresponding suffix. It can be formulated as ES(;θ)=[x,y]∼[1−1|y|minkk|f([x,y<k];θ)=y>k].ES(D;θ)=E_[x,y] [1- 1|y| _k\k|f([x,y_<k];θ)=y_>k\]. Forget Quality (FQ). The goal of unlearning is for the final model to approximate the model trained on the retain set only. Therefore, FQ is used to assess unlearning by statistically comparing the truth ratio Tr(y|x;θ)Tr(y|x;θ) distributions of the unlearned model θ and the model θretain _ retain trained on retain set only with KS-Test [36], producing higher scores when the two distributions are closely aligned: FQ(f;θ)=KS(Tr[x,y]∼f(y|x;θ),Tr[x,y]∼f(y|x;θretain)),FQ(D_ f;θ)=KS(Tr_[x,y] _ f(y|x;θ),Tr_[x,y] _ f(y|x; _ retain)), where KS(⋅,⋅)KS(·,·) is the KS-Test function and fD_ f is the forget set. Model Utility (MU). MU measures how well a model performs after unlearning, on both the retain set and general knowledge. It is defined as the harmonic mean of three metrics–probability, ROUGE, and Truth Ratio–evaluated across three levels: retain set rD_ r, real authors aD_ a, and world factual knowledge wD_ w: MU(θ)=1∑∈f,a,w[1P(;θ)+1Tr(;θ)+1ROUGE(;θ)].MU(θ)= 1Σ _D∈\D_ f,D_ a,D_ w\[ 1P(D;θ)+ 1Tr(D;θ)+ 1ROUGE(D;θ)]. Different from the retain set, when calculating the probability on aD_ a and rD_ r, function P is defined as P(x|y;θ)=p(y|x;θ)/∑y~∈choicep(y~|x;θ)P(x|y;θ)=p(y|x;θ)/Σ _ y _ choicep( y|x;θ), where choiceY_ choice is the given possible answer set. Gibberish (Gib). Unlearning can negatively impact model fluency, especially on the forget set, leading to incoherent or meaningless outputs. To measure this phenomenon, a classifier-based score333https://huggingface.co/madhurjindal/autonlp-Gibberish-Detector-492513457 is employed to determine whether the generated text resembles gibberish. A.3 Training-based Methods Training-based approaches generally employ a specifically designed loss function to facilitate unlearning in LLMs. The training procedure involves iteratively computing this loss and updating the model’s weights. After a number of iterations, the process concludes, resulting in the final model. This section details the loss functions used in the training-based methods discussed in this work. GA. GA is the pioneering work that first maximize the loss of the forget set. As the general loss function of LLM learning ℒ(;θ)L(D;θ) is defined in Eq.(1), the loss of GA can be formulated as ℒGA(θ)=−ℒ(f;θ).L_ GA(θ)=-L(D_ f;θ). GD. To avoid over-forgetting the retain set, GD performs gradient descent on the retain set: ℒGD(θ)=−ℒ(f;θ)+αL(r;θ),L_ GD(θ)=-L(D_ f;θ)+α L(D_ r;θ), where α is the coefficient to balance between unlearning and retention. NPO. NPO constructs its loss function inspired by the dis-preferred component of DPO. This type of loss is suitable for the question-answer pairs. Thus, the loss function is ℒNPO(θ)=−2β[x,y]∼flogσ(−βlog(p(y|x;θ)p(y|x;θfull))),L_ NPO(θ)=- 2βE_[x,y] _ f σ(-β ( p(y|x;θ)p(y|x; _ full))), where σ(⋅)σ(·) represents the Sigmoid function and β is the hyper-parameter. NPO+. In this paper, NPO+ is defined as a method combining NPO and GD together for better performance. Namely, the loss function is ℒNPO+(θ)=−2β[x,y]∼flogσ(−βlog(p(y|x;θ)p(y|x;θfull)))−α[x,y]∼rlogp(y|x;θ),L_ NPO+(θ)=- 2βE_[x,y] _ f σ(-β ( p(y|x;θ)p(y|x; _ full)))- _ [x,y] _ r p(y|x;θ), where α,βα,β are hyper-parameters. A.4 Implement Details For a fair and consistent evaluation, all training-based methods are benchmarked using the open-unlearning framework444https://github.com/locuslab/open-unlearning. We experiment with the official Llama 2 7B555https://huggingface.co/meta-llama/Llama-2-7b-hf, Llama-3.2 1B Instruct666https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct, and Llama-3.2 3B777https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct Instruct models. Following [29], for all the methods, our training configuration consists of 10 epochs (including one for warm-up), a learning rate of 1e-5, weight decay of 0.01, and a batch size of 32. In the context of task arithmetic approaches for obtaining the FgtOnly model, we modify these settings for specific datasets: on TOFU, we extend training to 20 epochs to achieve convergence on the forget set; on MUSE, we increase the learning rate to 1e-4, with all other hyperparameters remaining constant. To ensure a fair comparison, all models are subsequently evaluated under the same open-unlearning framework. Experiments are conducted on a single 80G A100 GPU. A.5 Objectives and Evaluation of Unlearning Unlearning is primarily considered as a privacy-preserving task: the aim is to remove information about the entities to be unlearned, so that the model approximates a version trained only on the retain entities [29] (i.e., the ground-truth model). This objective and evaluation framework is the one adopted by current mainstream methods [11, 44, 47], and it is also employed in our paper. However, as is shown in Figure 13, LLM might generate false answers after unlearning when being questioned with entities in the forget set. Under the evaluation of the aforementioned framework, the false answers are considered acceptable because even the ground-truth model, or the original model, may also produces incorrect responses (i.e., hallucinations). In other words, hallucination may not result from the unlearning process itself, but rather from the supervised finetuning process. Consequently, unlearning aims to bring the unlearned model closer to the retain-only model, and methods are considered successful as long as the outputs are similar to those of the ground-truth model. Recently, some work [34] has focused on refusing to answer queries about entities to be forgotten without misleading the users. We believe this is also a promising direction for future research. For task-arithmetic methods, reducing false answers for forgotten entities could potentially be achieved in two ways: (1) adding a task vector trained on QA samples with “I don’t know” responses, and (2) addressing hallucinations at the source, i.e., reducing hallucinations in the model before merging. Both approaches are feasible directions for future work. Appendix B Appendix 2: More Theoretical Justification B.1 The Diagonal of the Fisher Information Matrix Proof. We aim to prove that the diagonal of the Fisher Information Matrix (FIM), FiiF_i, can be approximated by the squared gradient of the loss function, given that the loss is the negative log-likelihood. The i-th diagonal element of the FIM is defined as the variance of the score, given by: Fii≈s∼[(∂logp(s;θ)∂qi)2],F_i _s [ ( ∂ p(s;θ)∂ q_i )^2 ], where qiq_i is a single parameter. We are given that the loss for a single data point s is the negative log-likelihood: ℒ(s;θ)=−logp(s;θ).L(\s\;θ)=- p(s;θ). Taking the partial derivative with respect to a parameter qiq_i yields: ∂ℒ(s;θ)∂qi=−∂logp(s;θ)∂qi. (\s\;θ)∂ q_i=- ∂ p(s;θ)∂ q_i. Substituting this into the definition of FiiF_i, we get: Fii≈s∼[(−∂ℒ(s;θ)∂qi)2]=s∼[(∂ℒ(s;θ)∂qi)2].F_i _s [ (- (\s\;θ)∂ q_i )^2 ]=E_s [ ( (\s\;θ)∂ q_i )^2 ]. Then we arrive at the approximation: Fii≈(∂ℒ(;θ)∂qi)2.F_i≈ ( (D;θ)∂ q_i )^2. This demonstrates that the diagonal of the FIM can be estimated by the squared gradient of the negative log-likelihood loss. ∎ B.2 Proof of Proposition 1 Proof. For a single parameter qiq_i in LLM, we denote its corresponding weights calculated with PerTA-grad, PerTA-fisher to be ωigrad _i grad and ωifisher _i fisher respectively. Using ri=|[gr]i|+ϵ|[gf]i|+ϵr_i= |[g_ r]_i|+ε|[g_ f]_i|+ε for notational convenience, where [gr]i[g_ r]_i and [gf]i[g_ f]_i are the gradients on forget and retain set, we can obtain the following simplified form when ϵ→0ε→ 0: ωigrad=|[gf]i|+ϵ|[gr]i|+|[gf]i|+2ϵ=1ri+1, _i grad= |[g_ f]_i|+ε|[g_ r]_i|+|[g_ f]_i|+2ε= 1r_i+1, ωigrad=[gf]i2+ϵ[gr]i2+[gf]i2+2ϵ=1ri2+1. _i grad= [g_ f]_i^2+ε[g_ r]_i^2+[g_ f]_i^2+2ε= 1r_i^2+1. Depending on the range of rir_i, we have two cases: • When ri≥1r_i≥ 1 (where |[gr]i|≥|[gf]i||[g_ r]_i|≥|[g_ f]_i|, retain set dominates), from the simplified form of ωigrad _i grad and ωifisher _i fisher, we can derive that 12≥1ri+1≥1ri2+1≥0⇒12≥ωigrad≥ωifisher≥0. 12≥ 1r_i+1≥ 1r_i^2+1≥ 0 12≥ _i grad≥ _i fisher≥ 0. It reveals that the squared term will push the weight closer to 0 faster than the linear term, offering stronger protection for the retain set. • When ri<1r_i<1 (where |[gr]i|<|[gf]i||[g_ r]_i|<|[g_ f]_i|, forget set dominates), from the simplified form of ωigrad _i grad and ωifisher _i fisher, we can derive that 12<1ri+1<1ri2+1<1⇒12<ωigrad<ωifisher<1. 12< 1r_i+1< 1r_i^2+1<1 12< _i grad< _i fisher<1. It reveals that the squared term will push the weight closer to 1 faster than the linear term, leading the task vector to be applied more fully when needed. ∎ Therefore, in some undesirable cases where the gradients on the forget set and the retain set are very similar, PerTA-grad tends to degenerate into a single weight with the value of 0.5. In contrast, PerTA-fisher may suppress such “ambiguous” updates (i.e., weights near 0.5) and create a cleaner separation between parameters to be edited and parameters to be preserved. B.3 PerTA-grad and PerTA-fisher Satisfy the Intuitive Rules Proof. Regarding the function foprt(A,B)=|A|∘τ/(|A|∘τ+|B|∘τ)f_ oprt(A,B)=|A| τ/(|A| τ+|B| τ) defined for WgradW_ grad (τ=1τ=1) and WfisherW_ fisher (τ=2τ=2), for a single weight wiw_i, we have wi=[foprt(gf,gr)]i=|[gf]i|τ+ϵ|[gf]i|τ+|[gr]i|τ+2ϵ,where τ=1,or τ=2.w_i=[f_ oprt(g_ f,g_ r)]_i= |[g_ f]_i|^τ+ε|[g_ f]_i|^τ+|[g_ r]_i|^τ+2ε,where τ=1,or τ=2. Then we prove |[gf]i|≪|[gr]i|⇒wi→0|[g_ f]_i| |[g_ r]_i| w_i→ 0 and |[gf]i|≫|[gr]i|⇒wi→1|[g_ f]_i| |[g_ r]_i| w_i→ 1 in the two cases below: Case 1: |[gf]i|≪|[gr]i||[g_ f]_i| |[g_ r]_i|. It implies that [gf]i[g_ f]_i is negligible compared to [gr]i[g_ r]_i. Mathematically, this can be expressed as the limit where their ratio approaches zero: |[gf]i|+ϵ|[gr]i|+ϵ→0. |[g_ f]_i|+ε|[g_ r]_i|+ε→ 0. Then for τ=1τ=1 and τ=2τ=2, we have: |[gf]i|τ+ϵ|[gr]i|τ+ϵ→0. |[g_ f]_i|^τ+ε|[g_ r]_i|^τ+ε→ 0. To analyze the limit of wiw_i, we can divide both the numerator and the denominator by |[gr]i|τ+ϵ|[g_ r]_i|^τ+ε (|[gr]i|τ+ϵ≠0|[g_ r]_i|^τ+ε≠ 0): wi=(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)+(|[gr]i|τ+ϵ)/(|[gr]i|τ+ϵ)=(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)+1.w_i= (|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)(|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)+(|[g_ r]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)= (|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)(|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)+1. Now, we take the limit as |[gf]i|τ+ϵ|[gr]i|τ+ϵ→0 |[g_ f]_i|^τ+ε|[g_ r]_i|^τ+ε→ 0: lim|[gf]i|τ+ϵ|[gr]i|τ+ϵ→0wi=lim|[gf]i|τ+ϵ|[gr]i|τ+ϵ→0(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)(|[gf]i|τ+ϵ)/(|[gr]i|τ+ϵ)+1=00+1=0. _ |[g_ f]_i|^τ+ε|[g_ r]_i|^τ+ε→ 0w_i= _ |[g_ f]_i|^τ+ε|[g_ r]_i|^τ+ε→ 0 (|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)(|[g_ f]_i|^τ+ε)/(|[g_ r]_i|^τ+ε)+1= 00+1=0. Thus, when |[gf]i|≪|[gr]i||[g_ f]_i| |[g_ r]_i|, the value of wiw_i approaches 0. Case 2: |[gf]i|≫|[gr]i||[g_ f]_i| |[g_ r]_i| Similarly, the condition |[gf]i|≫|[gr]i||[g_ f]_i| |[g_ r]_i| implies that [gr]i[g_ r]_i is negligible compared to [gf]i[g_ f]_i. This means the ratio of their sizes approaches zero: |[gr]i|τ+ϵ|[gf]i|τ+ϵ→0. |[g_ r]_i|^τ+ε|[g_ f]_i|^τ+ε→ 0. For this case, we divide both the numerator and the denominator by |[gf]i|τ+ϵ|[g_ f]_i|^τ+ε (with |[gf]i|τ+ϵ≠0|[g_ f]_i|^τ+ε≠ 0): wi=(|[gf]i|τ+ϵ)/(|[gf]i|τ+ϵ)(|[gf]i|τ+ϵ)/(|[gf]i|τ+ϵ)+(|[gr]i|τ+ϵ)/(|[gf]i|τ+ϵ)=11+(|[gr]i|τ+ϵ)/(|[gf]i|τ+ϵ).w_i= (|[g_ f]_i|^τ+ε)/(|[g_ f]_i|^τ+ε)(|[g_ f]_i|^τ+ε)/(|[g_ f]_i|^τ+ε)+(|[g_ r]_i|^τ+ε)/(|[g_ f]_i|^τ+ε)= 11+(|[g_ r]_i|^τ+ε)/(|[g_ f]_i|^τ+ε). Now, we take the limit as |[gr]i|τ+ϵ|[gf]i|τ+ϵ→0 |[g_ r]_i|^τ+ε|[g_ f]_i|^τ+ε→ 0: lim|[gr]i|τ+ϵ|[gf]i|τ+ϵ→0wi=lim|[gr]i|τ+ϵ|[gf]i|τ+ϵ→011+(|[gr]i|τ+ϵ)/(|[gf]i|τ+ϵ)=11+0=1. _ |[g_ r]_i|^τ+ε|[g_ f]_i|^τ+ε→ 0w_i= _ |[g_ r]_i|^τ+ε|[g_ f]_i|^τ+ε→ 0 11+(|[g_ r]_i|^τ+ε)/(|[g_ f]_i|^τ+ε)= 11+0=1. Thus, when |[gf]i|≫|[gr]i||[g_ f]_i| |[g_ r]_i|, the value of wiw_i approaches 1. Conclusion We have formally shown through limit analysis that our PerTA-grad and PerTA-fisher satisfy |[gf]i|≪|[gr]i|⇒wi→0|[g_ f]_i| |[g_ r]_i| w_i→ 0 and |[gf]i|≫|[gr]i|⇒wi→1|[g_ f]_i| |[g_ r]_i| w_i→ 1. ∎ Appendix C Appendix 3: More Experimental Results C.1 More Graphical Results ES Metric across Various Tasks. In this section, we present in Figure 10 the two-dimensional values of the ES metric on the forget and retain sets across the three TOFU tasks, as a supplement to Figure 4. It can be observed that for relatively simple tasks (e.g., unlearning 1%), most methods preserve the retain set but fail to achieve effective forgetting on the forget set. In contrast, our PerTA-grad and PerTA-fisher not only maintain retention but also achieve effective forgetting. For more challenging tasks (e.g., unlearning 5% and 10%), our PerTA methods similarly achieve unlearning that is closest to the ground truth, while still preserving memory on the retain set. Figure 10: ES (forget) and ES (retain) results of different methods on TOFU (using Llama-3.2 1B Instruct), where circle markers denote values and horizontal and vertical bars at circle centers represent error bars. Figure 11: Four-dimension ROUGE results of task arithmetic-based methods on TOFU (using Llama-3.2 3B Instruct). Ground-truth results on forget and retain sets are marked with a gray background. ROUGE Results on Larger LLM. Similarly, in Figure 11 we report the ROUGE results of our method compared with vanilla TV on the ‘forget’, ‘retain’, ‘real’, and ‘facts’ sets for the 3B model, as a supplement to Figure 5. The same conclusion as in the main text can be drawn here: while TV effectively preserves the knowledge acquired during the pretraining stage of the original model, it leads to excessive forgetting on the retain and forget datasets. In contrast, our PerTA mitigates the gap between TV and the ground truth on these two datasets, thereby enhancing the performance of the task arithmetic-based method for unlearning. This conclusion holds consistently across LLMs of different sizes. Sample Efficiency in More Tasks. Figure 12, as a complement to Figure 9, presents the difference in performance metrics relative to using the entire dataset when unlearning 10% in% in TOFU with varying data proportions (20%, 40%, 80%) and with 0% data (i.e., vanilla TV). Consistent with the main text, it is observed that using only one-fifth of the samples already achieves results comparable to those obtained with the full dataset, and substantially outperforms vanilla TV. This highlights the sample efficiency of PerTA, which can further reduce computational cost. Figure 12: Residual results of the metrics when using only 20%, 40%, and 80% of the samples compared to using the full set (unlearning 10%, Llama-3.2 1B Instruct). 0% denotes vanilla TV. C.2 More Quantitative Results Detailed Results on Various Tasks. Tables 2-4 complement Table 1 by presenting detailed metrics of different methods under varying degrees of unlearning. We observe that in relatively simple tasks with smaller models (e.g., Llama-3.2 1B with 1% unlearning), the advantage of PerTA is not yet pronounced. However, as the task complexity increases, PerTA consistently outperforms on metrics such as FQ and MU, allowing task arithmetic-based approaches to surpass training-based methods. Overall, PerTA demonstrates a clear advantage in both unlearning capability and retention performance. Table 2: Results of different methods on unlearning 1% of TOFU. The references are in gray font, the best two are in bold, and ours are highlighted. ‘Full’ and ‘GT’ represent the model before unlearning and the ground truth model, respectively. FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ Model Size 1B 3B Full -2.170 0.599 0.743 0.737 0.894 -1.845 0.666 0.920 0.884 0.894 GT 0.000 0.599 0.069 0.751 0.874 0.000 0.662 0.067 0.888 0.904 Training-based GA -1.953 0.597 0.189 0.656 0.909 -1.845 0.668 0.252 0.824 0.864 GD -1.845 0.581 0.169 0.562 0.907 -1.845 0.663 0.320 0.826 0.897 NPO -2.062 0.595 0.178 0.650 0.904 -1.845 0.668 0.253 0.825 0.838 NPO+ -1.845 0.596 0.174 0.656 0.907 -1.845 0.669 0.254 0.819 0.856 Task Arithmetic-based TV -0.393 0.556 0.081 0.358 0.908 -0.238 0.656 0.075 0.550 0.933 !30 PerTA-grad -0.289 0.581 0.075 0.551 0.912 -0.037 0.669 0.085 0.757 0.903 !30 PerTA-fisher -0.576 0.586 0.085 0.600 0.895 -0.238 0.672 0.106 0.803 0.869 Table 3: Results of different methods on unlearning 5% of TOFU. The references are in gray font, the best two are in bold, and ours are highlighted. ‘Full’ and ‘GT’ represent the model before unlearning and the ground truth model, respectively. FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ Model Size 1B 3B Full -11.845 0.599 0.727 0.737 0.858 -13.591 0.666 0.887 0.884 0.850 GT 0.000 0.599 0.063 0.746 0.905 0.000 0.659 0.066 0.874 0.869 Training-based GA -2.415 0.000 0.037 0.039 0.417 -5.856 0.482 0.089 0.135 0.866 GD -8.831 0.457 0.090 0.171 0.751 -13.232 0.552 0.140 0.244 0.579 NPO -2.222 0.000 0.048 0.052 0.543 -7.091 0.472 0.080 0.140 0.868 NPO+ -4.260 0.458 0.098 0.139 0.882 -7.352 0.545 0.100 0.200 0.911 Task Arithmetic-based TV -5.623 0.478 0.049 0.148 0.940 -5.395 0.628 0.053 0.214 0.926 !30 PerTA-grad -0.661 0.546 0.069 0.310 0.910 -0.263 0.674 0.079 0.502 0.915 !30 PerTA-fisher -0.339 0.553 0.077 0.348 0.911 -0.405 0.677 0.083 0.561 0.906 Table 4: Results of different methods on unlearning 10% of TOFU. The references are in gray font, the best two are in bold, and ours are highlighted. ‘Full’ and ‘GT’ represent the model before unlearning and the ground truth model, respectively. FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ Gib↑ Model Size 1B 3B Full -21.408 0.599 0.706 0.737 0.861 -26.444 0.666 0.890 0.884 0.861 GT 0.000 0.591 0.059 0.746 0.904 0.000 0.650 0.065 0.899 0.890 Training-based GA -238.973 0.000 0.033 0.035 0.125 -236.070 0.000 0.033 0.035 0.050 GD -15.484 0.434 0.076 0.151 0.707 -26.800 0.553 0.117 0.242 0.556 NPO -10.244 0.000 0.033 0.035 0.329 -4.590 0.000 0.034 0.038 0.206 NPO+ -4.481 0.423 0.093 0.151 0.946 -7.042 0.546 0.087 0.224 0.926 Task Arithmetic-based TV -12.506 0.451 0.046 0.114 0.895 -10.220 0.551 0.048 0.150 0.904 !30 PerTA-grad -1.107 0.542 0.071 0.266 0.922 -1.708 0.649 0.082 0.432 0.921 !30 PerTA-fisher -1.686 0.548 0.077 0.295 0.919 -2.990 0.647 0.088 0.474 0.911 Sample Output Discussion. Figure 13 presents sample responses of different methods on the forget and retain sets of TOFU after unlearning. For the forget set, some methods produce incoherent or irrelevant answers–indicating that the responses lack logical consistency or relevance. For the retain set, other methods may exhibit over-forgetting or generate hallucinated answers. In contrast, PerTA is able to achieve unlearning on the forget set while preserving knowledge on the retain set. Figure 13: Sample output of unlearned LLM θfinal _ final applying different methods (unlearning 10%, Llama-3.2 1B Instruct). Our PerTA ensures both unlearning and retention. Detailed Results of Other Benchmarks. Table 5 reports the results on the MUSE dataset. Following 35, we evaluate KnowMem and VerbMem, and additionally include ES and Gib as complementary metrics. The numbers in parentheses indicate the differences between each metric and that of the ground-truth model. For KnowMem and VerbMem, we highlight the two methods whose results are closest to the ground truth. Consistent with prior observations, PerTA alleviates the issue of excessive forgetting in TV. For example, on the forget set, PerTA improves KnowMem from 0.011 to 0.388 and 0.385 (ground truth: 0.328), and on the retain set, from 0.023 to 0.416 and 0.464 (ground truth: 0.560). These results suggest that PerTA achieves a better balance between unlearning and retention. Cases of the forget and retain samples, along with the results of different methods, is shown in Table 6. We observe that other methods often suffer from partial forgetting/retention failures or produce gibberish responses, whereas PerTA is able to forget the targeted information while preserving the retain. Table 5: Results of different methods on MUSE. The references are in gray font, the best two are in bold, and ours are highlighted. ‘Full’ and ‘GT’ represent the model before unlearning and the ground truth model, respectively. Numbers in parentheses indicate deviations from the ground truth. KnowMem (fD_ f) VerbMem (fD_ f) KnowMem (rD_ r) ES (fD_ f) Gib↑ Full 0.644 (0.316↑) 0.579 (0.377↑) 0.555 (0.005↓) 0.295 (0.271↑) 0.800 GT 0.328 (0.000↑) 0.202 (0.000↑) 0.560 (0.000↑) 0.024 (0.000↑) 0.845 Training-based GA 0.003 (0.325↓) 0.049 (0.153↓) 0.008 (0.552↓) 0.008 (0.017↓) 0.001 GD 0.332 (0.005↑) 0.005 (0.197↓) 0.254 (0.307↓) 0.008 (0.016↓) 0.002 NPO 0.622 (0.294↑) 0.374 (0.173↑) 0.521 (0.040↓) 0.119 (0.094↑) 0.771 NPO+ 0.642 (0.314↑) 0.494 (0.293↑) 0.525 (0.036↓) 0.205 (0.181↑) 0.811 Task Arithmetic-based TV 0.011 (0.317↓) 0.109 (0.092↓) 0.023 (0.537↓) 0.011 (0.014↓) 0.685 !30 PerTA-grad 0.388 (0.060↑) 0.176 (0.026↓) 0.416 (0.145↓) 0.028 (0.003↑) 0.777 !30 PerTA-fisher 0.385 (0.058↑) 0.191 (0.011↓) 0.464 (0.096↓) 0.025 (0.001↑) 0.802 Table 6: Sample answers for questions to be unlearned/retained by different methods on the MUSE-News dataset. Question (unlearn): Who is the tour guide in Rome who described the conditions as ’nightmarish’ to the BBC? Full model Felicity Hinton/100-year-old Felicity Hinton GA the the the the the the the the the the… Gibberish GD 100 \"Toto\" Guidi Gibberish NPO 100-year-old Felicity Hinton Fail NPO+ 100-year-old Felicity Hinton Fail TV 100.10.1.1.1.1.1.1.1.1.1.1.1.1.1 Gibberish PerTA-grad 50-year-old tour guide, Alessandro Russo Success PerTA-fisher 60-year-old Rome resident, Alessandro Russo Success Question (retain): What is the half-life of the plutonium isotope being looked at by the University of Southampton scientists? Full model 24,000 years GA the the the the the the the the the the… Gibberish GD 24,000 years \"platinum \"of \"plutonium\" \"half-life\" ….. Gibberish NPO 14,000 years Fail NPO+ 14,000 years Fail TV 100.0.1. You are the United. United. Should… Gibberish PerTA-grad 24,000 years Success PerTA-fisher 24,000 years Success Table 7: Results using different foprtf_ oprt on TOFU tasks (unlearning 1%, 5% and 10% of TOFU, using Llama-3.2 1B Instruct, Mean ± Std). Ours are highlighted. Forgetting Methods FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ 1% Full -2.170 0.599 0.743 0.737 GT 0.000 0.599 0.069 0.751 Random ω -0.877 ±0.684 0.571 ±0.019 0.292 ±0.296 0.492 ±0.175 Weighted ω=0.5ω=0.5 -1.451 ±0.131 0.587 ±0.000 0.116 ±0.000 0.611 ±0.004 Pruning λ=0.5λ=0.5 -0.393 ±0.000 0.556 ±0.001 0.081 ±0.001 0.358 ±0.002 PerTA-grad(θfull)PerTA-grad( _ full) -0.576 ±0.000 0.583 ±0.001 0.106 ±0.000 0.567 ±0.003 PerTA-fisher(θfull)PerTA-fisher( _ full) -1.182 ±0.127 0.589 ±0.000 0.123 ±0.000 0.626 ±0.001 !30 PerTA-grad -0.289 ±0.073 0.581 ±0.001 0.075 ±0.000 0.551 ±0.001 !30 PerTA-fisher -0.576 ±0.000 0.586 ±0.000 0.085 ±0.001 0.600 ±0.001 PerTA+SoftMax -1.266 ±0.000 0.586 ±0.000 0.115 ±0.001 0.606 ±0.001 5% Full -11.845 0.599 0.727 0.737 GT 0.000 0.599 0.063 0.746 Random ω -7.264 ±3.283 0.526 ±0.055 0.252 ±0.284 0.346 ±0.270 Weighted ω=0.5ω=0.5 -1.253 ±0.237 0.560 ±0.001 0.090 ±0.004 0.396 ±0.004 Pruning λ=0.5λ=0.5 -5.321 ±0.105 0.484 ±0.003 0.049 ±0.002 0.155 ±0.005 PerTA-grad(θfull)PerTA-grad( _ full) -0.630 ±0.110 0.545 ±0.001 0.071 ±0.001 0.312 ±0.007 PerTA-fisher(θfull)PerTA-fisher( _ full) -0.515 ±0.039 0.553 ±0.001 0.083 ±0.001 0.360 ±0.002 !30 PerTA-grad -0.661 ±0.125 0.546 ±0.001 0.069 ±0.002 0.310 ±0.007 !30 PerTA-fisher -0.339 ±0.115 0.553 ±0.001 0.077 ±0.002 0.348 ±0.002 PerTA+SoftMax -1.219 ±0.281 0.558 ±0.001 0.088 ±0.002 0.390 ±0.002 10% Full -21.408 0.599 0.706 0.737 GT 0.000 0.591 0.059 0.746 Random ω -13.963 ±4.247 0.511 ±0.065 0.228 ±0.256 0.323 ±0.280 Weighted ω=0.5ω=0.5 -2.757 ±0.189 0.548 ±0.002 0.082 ±0.001 0.309 ±0.007 Pruning λ=0.5λ=0.5 -8.760 ±0.282 0.483 ±0.003 0.049 ±0.001 0.136 ±0.002 PerTA-grad(θfull)PerTA-grad( _ full) -1.270 ±0.135 0.541 ±0.001 0.074 ±0.001 0.274 ±0.004 PerTA-fisher(θfull)PerTA-fisher( _ full) -2.603 ±0.113 0.549 ±0.002 0.082 ±0.002 0.310 ±0.001 !30 PerTA-grad -1.107 ±0.064 0.542 ±0.001 0.071 ±0.002 0.266 ±0.003 !30 PerTA-fisher -1.686 ±0.265 0.548 ±0.001 0.077 ±0.002 0.295 ±0.006 PerTA+SoftMax -2.679 ±0.143 0.548 ±0.002 0.081 ±0.001 0.310 ±0.006 Detailed Results of Ablation Studies. Table 7 supplements Figure 6 by showing the detailed quantitative results of different foprt(⋅,⋅)f_ oprt(·,·). Random means to set weights in W to random values uniformly sampled between 0 and 1 with foprt(A,B)=rand([0, 1])f_ oprt(A,B)=rand([0, 1]). Weighted uses a constant ω to rescale TV with foprt(A,B)=ωf_ oprt(A,B)=ω. Here we show the results of ω=0.5ω=0.5. Pruning removes (i.e., foprt(A,B)=0f_ oprt(A,B)=0) the λ%λ\% smallest weights in TV to mitigate over-forgetting and maintain others (i.e., foprt(A,B)=1f_ oprt(A,B)=1), where we show the results of λ=0.5λ=0.5. Unlike PerTA-grad or PerTA-fisher, the gradients of PerTA-grad (θfull _ full) or PerTA-fisher (θfull _ full) are estimated on θfull _ full instead of θ0 _0. The difference between PerTA-grad, PerTA-fisher and PerTA+SoftMax is that the latter determines foprt(A,B)=exp(|A|)/(exp(|A|)+exp(|B|))f_ oprt(A,B)= (|A|)/( (|A|)+ (|B|)) in the SoftMax form. We find that the results of Random are highly unstable, often exhibiting large variance, which further increases as the unlearning ratio grows and the task becomes more difficult. When the weight is fixed at 0.5, the Weighted method performs relatively better; however, it still lags behind our proposed PerTA in terms of unlearning capability (as measured by FQ and the ES metric on the forget set). The Pruning method performs well on simple tasks, such as the 1% unlearning setting, but its performance drops sharply as the task difficulty increases with higher unlearning ratios. The SoftMax method is able to achieve both forgetting and retention, yet it remains inferior to PerTA-grad and PerTA-fisher. In addition, the results indicate that estimating gradients on θfull _ full or θ0 _ 0 leads to negligible differences in performance. Detailed Results of the General Form. Considering the general form of foprt(A,B)=|A|∘τ/(|A|∘τ+|B|∘τ)f_ oprt(A,B)=|A| τ/(|A| τ+|B| τ) but not using the absolute gradient or the diagonal Fisher Information approximation, different τ can be applied. We conduct experiments for τ∈0,0.25,0.5,1,2,4,8τ∈\0,0.25,0.5,1,2,4,8\, with the quantitative results shown in Table 8 as a supplement to Figure 8. The cases of τ=1,2τ=1,2 correspond to our PerTA-grad and PerTA-fisher, respectively. The results in Table 8 lead to conclusions that are consistent with those discussed in the main body of our paper. PerTA-grad and PerTA-fisher strike a balance between forgetting and retaining: among the different τ-based variants, they achieve relatively strong FQ and ES (fD_ f) while keeping MU and ES (rD_ r) at a reasonable level. Table 8: Results using different τ in foprtf_ oprt on TOFU tasks (unlearning 1%, 5% and 10%, using Llama-3.2 1B Instruct, Mean ± Std). Ours are highlighted. Forgetting Methods FQ↑ MU↑ ES(fD_ f)↓ ES(rD_ r)↑ 1% Full -2.170 0.599 0.743 0.737 GT 0.000 0.599 0.069 0.751 τ=0τ=0 -1.451 ±0.131 0.587 ±0.000 0.116 ±0.000 0.611 ±0.004 τ=0.25τ=0.25 -0.089 ±0.037 0.577 ±0.000 0.095 ±0.001 0.505 ±0.001 τ=0.5τ=0.5 -0.197 ±0.057 0.579 ±0.001 0.095 ±0.001 0.522 ±0.002 !30 τ=1(grad)τ=1(grad) -0.289 ±0.073 0.581 ±0.001 0.075 ±0.000 0.551 ±0.001 !30 τ=2(fisher)τ=2(fisher) -0.576 ±0.000 0.586 ±0.000 0.085 ±0.001 0.600 ±0.001 τ=4τ=4 -1.013 ±0.000 0.591 ±0.000 0.094 ±0.001 0.650 ±0.003 τ=8τ=8 -1.544 ±0.000 0.598 ±0.001 0.276 ±0.004 0.700 ±0.001 5% Full -11.845 0.599 0.727 0.737 GT 0.000 0.599 0.063 0.746 τ=0τ=0 -1.253 ±0.237 0.560 ±0.001 0.090 ±0.004 0.396 ±0.004 τ=0.25τ=0.25 -0.784 ±0.090 0.543 ±0.002 0.068 ±0.001 0.299 ±0.008 τ=0.5τ=0.5 -0.754 ±0.132 0.545 ±0.001 0.069 ±0.002 0.300 ±0.008 !30 τ=1(grad)τ=1(grad) -0.661 ±0.125 0.546 ±0.001 0.069 ±0.002 0.310 ±0.007 !30 τ=2(fisher)τ=2(fisher) -0.339 ±0.115 0.553 ±0.001 0.077 ±0.002 0.348 ±0.002 τ=4τ=4 -0.933 ±0.292 0.565 ±0.001 0.092 ±0.001 0.420 ±0.004 τ=8τ=8 -7.008 ±0.319 0.581 ±0.001 0.141 ±0.008 0.573 ±0.004 10% Full -21.408 0.599 0.706 0.737 GT 0.000 0.591 0.059 0.746 τ=0τ=0 -2.757 ±0.189 0.548 ±0.002 0.082 ±0.001 0.309 ±0.007 τ=0.25τ=0.25 -1.270 ±0.135 0.539 ±0.002 0.071 ±0.001 0.262 ±0.004 τ=0.5τ=0.5 -1.186 ±0.066 0.541 ±0.002 0.072 ±0.002 0.265 ±0.004 !30 τ=1(grad)τ=1(grad) -1.107 ±0.064 0.542 ±0.001 0.071 ±0.002 0.266 ±0.003 !30 τ=2(fisher)τ=2(fisher) -1.686 ±0.265 0.548 ±0.001 0.077 ±0.002 0.295 ±0.006 τ=4τ=4 -3.490 ±0.210 0.558 ±0.001 0.088 ±0.002 0.354 ±0.009 τ=8τ=8 -9.796 ±0.462 0.575 ±0.001 0.138 ±0.000 0.493 ±0.003 Detailed Results of Running Time. As a supplement to Figure 7, Table 9 shows the quantitative runtime comparison between the best-performing training-based method, GD and NPO+, and our PerTA. In contrast to training-based methods that demand multiple iterations, the runtime of PerTA can be broken down into three components: obtaining θfgt _ fgt, computing W, and performing task arithmetic, with the last step being negligible (0.0002 min in Table 9). PerTA thus inherits the efficiency of task arithmetic, yielding substantial runtime savings-a benefit that becomes increasingly evident as task complexity rises (i.e., when unlearning larger proportions). Furthermore, as demonstrated earlier, competitive performance can already be achieved by estimating gradients with only 20% of the data, indicating additional potential for reducing runtime. Collectively, these observations underscore the high time efficiency of PerTA. Table 9: Time comparison of the best-performing training-based method GD, NPO+ and our PerTA ((min), unlearning 1%, 5% and 10%, Llama-3.2 1B Instruct). Forgetting Methods Getting θfgt _ fgt Calculating Wgrad|fisherW_ grad|fisher Task Arithmetic Total 1% GD - - - 3.4673 NPO+ - - - 4.6557 PerTA (grad) 0.3944 2.0207 0.0002 2.4153 PerTA (fisher) 2.2118 2.6064 PerTA (grad) w/ 20% 0.4528 0.8474 PerTA (fisher) w/ 20% 0.4188 0.8134 5% GD - - - 5.2072 NPO+ - - - 12.3739 PerTA (grad) 2.3918 2.0253 0.0002 4.4172 PerTA (fisher) 2.2201 4.6121 PerTA (grad) w/ 20% 0.4378 2.8297 PerTA (fisher) w/ 20% 0.4179 2.8098 10% GD - - - 7.2508 NPO+ - - - 23.0168 PerTA (grad) 4.8281 2.0231 0.0002 6.8514 PerTA (fisher) 2.2177 7.0459 PerTA (grad) w/ 20% 0.4368 5.2651 PerTA (fisher) w/ 20% 0.4134 5.2416 C.3 Different Models for Per-parameter Weights Figure 14: Visualization of Wgrad,WfisherW_ grad,W_ fisher for parameters in the last two Q,K,VQ,K,V attention layers (left), and corresponding ES on forget and retain sets (right), when employing θ0 _0 or θfull _ full to estimate Wgrad,WfisherW_ grad,W_ fisher (unlearning 1% on TOFU, using Llama-3.2 1B Instruct). To further illustrate the difference between using θ0 _0 (the retained LLM) and θfull _ full (the finetuned LLM) to predict W shown in Table 7, Figure 14 presents a comparison. The left side of Figure 14 visualizes the weight magnitudes of W (predicted by θ0 _0 and θfull _ full, respectively) corresponding to the Q, K, and V matrices in the last two attention layers, while the right side reports the corresponding ES scores in bar plots. From the visualizations on the left, we observe that both PerTA-grad and PerTA-fisher exhibit highly similar patterns regardless of whether W is predicted by θ0 _0 and θfull _ full (highlighted by the black boxes). This indicates that the key parameters–those with large weights–are largely consistent across the two predictors, and vice versa. On the right, the ES results confirm this observation: the numerical metrics are very close, consistent with Table 7. These findings suggest that either θ0 _0 or θfull _ full can be used to predict W, with negligible differences. A plausible explanation is that the gap between the pretrained model and the finetuned model is relatively small. This conclusion further supports the applicability of PerTA to post-training models, thereby broadening its range of use cases. C.4 Visualization Results of Weights Figure 15: Visualization of TV, Wgrad,WfisherW_ grad,W_ fisher for parameters in the 0-th, 1-st Q,K,VQ,K,V attention layers (unlearning 1% on TOFU, using Llama-3.2 1B Instruct). Figure 16: Visualization of TV, Wgrad,WfisherW_ grad,W_ fisher for parameters in the 7-th, 8-th Q,K,VQ,K,V attention layers (unlearning 1% on TOFU, using Llama-3.2 1B Instruct). Figure 17: Visualization of TV, Wgrad,WfisherW_ grad,W_ fisher for parameters in the 14-th, 15-th Q,K,VQ,K,V attention layers (unlearning 1% on TOFU, using Llama-3.2 1B Instruct). Figures 15-17 visualize the weight magnitudes of the Q, K, and V matrices in the shallow, middle, and final attention layers of the LLM for both TV and W. For TV, we observe that the weight magnitudes increase progressively from shallow to deeper layers, indicating that the magnitude of parameter changes induced by unlearning grows with layer depth. In contrast, the analysis of W may provide insight into the layer-wise sensitivity of LLM parameters to the differences between forget and retain data. We highlight two key observations. First, compared to PerTA-grad, PerTA-fisher exhibits more pronounced weight differences (as evidenced by the larger contrast between light and dark regions in Figure 15-17). This is because PerTA-fisher relies on the squared gradients rather than the raw gradients, thereby amplifying the differences between the forget and retain sets. In practice, however, both PerTA-grad and PerTA-fisher yield similar performance on the evaluation metrics, suggesting that either variant can be employed effectively. Second, relative to the middle layers of the LLM, the initial and final layers contain more weights close to the extremes (i.e., near 0 or 1). This implies that parameters in the shallow and final layers are more sensitive to gradient differences between the forget and retain sets. Interestingly, this aligns with prior findings on LLM representations [15, 39]: shallow layers primarily capture surface features (e.g., words, subwords, positional information), middle layers encode syntactic features, and final layers specialize in semantic features. The results in Figures 15-17 are consistent with this interpretation. Specifically, surface and semantic features exhibit greater discrepancies between forget and retain sets (e.g., TOFU involves differences in author names, domain-specific terminology, and deeper semantic associations with personal information), whereas syntactic structures remain largely unaffected. Consequently, our flexible PerTA assigns larger weight differences to parameters in the shallow and final layers. This insight suggests a potential future direction for further optimization: pruning or fixing selected middle layers to reduce computational overhead without sacrificing performance. Table 10: Results of different methods on unlearning 5% of TOFU, using Llama-3.2 8B as the pretrained model. FQ↑ MU↑ ES(DfD_f)↓ ES(DrD_r)↑ ES(DrD_r)-ES(DfD_f)↑ Gib↑ Full (reference) -12.184 0.628 0.972 0.992 0.020 0.852 GT (reference) 0.000 0.632 0.074 0.992 0.918 0.886 GA -118.712 0.000 0.033 0.035 0.002 0.038 GD -10.225 0.509 0.158 0.397 0.239 0.811 NPO -11.183 0.131 0.033 0.037 0.004 0.141 NPO+ -7.888 0.569 0.160 0.521 0.361 0.914 PerTA-grad (ours) [HTML]F-4.529 [HTML]F0.659 [HTML]F0.164 [HTML]F0.882 0.718 0.895 Table 11: Results of different methods on unlearning 5% of TOFU, using Phi-3.5 as the pretrained model. FQ↑ MU↑ ES(DfD_f)↓ ES(DrD_r)↑ ES(DrD_r)-ES(DfD_f)↑ Gib↑ Full (reference) -13.232 0.693 0.868 0.835 -0.033 0.866 GT (reference) 0.000 0.678 0.082 0.855 0.773 0.881 GA -11.511 0.073 0.027 0.028 0.001 0.822 GD -11.183 0.665 0.344 0.574 0.231 0.875 NPO -12.877 0.278 0.538 0.594 0.057 0.855 NPO+ -10.859 0.552 0.591 0.761 0.170 0.877 PerTA-grad (ours) -3.548 0.667 0.107 0.412 0.305 0.879 Table 12: Average results of PerTA with quantization attacks on TOFU 1%, 5%, 10% unlearning tasks. FQ↑ MU↑ ES(DfD_f)↓ ES(DrD_r)↑ ES(DrD_r)-ES(DfD_f)↑ Gib↑ Full -11.808 0.599 0.726 0.737 0.011 0.871 GT 0.000 0.596 0.064 0.748 0.684 0.894 GA -81.114 0.199 0.086 0.244 0.157 0.484 GD -8.720 0.491 0.112 0.295 0.183 0.789 NPO -4.842 0.198 0.086 0.246 0.160 0.592 NPO+ -3.528 0.493 0.122 0.316 0.194 0.911 PerTA-grad (ours) w/o attack -0.686 0.556 0.072 0.376 0.304 0.915 PerTA-grad (ours) w/ attack -1.340 0.560 0.095 0.421 0.325 0.909 C.5 Results on Larger Models and Alternative LLM Families Tables 10 and Table 11 present our method’s performance on larger models and on models from other LLM families. The results indicate that our PerTA exhibits good generalization ability: it achieves competitive unlearning performance even when applied to larger models and different types of LLMs. C.6 Results of Quantization Attacks Some recent research [55] have found that applying quantization to models that have undergone unlearning can restore the "forgotten" information. Therefore, conducting attack experiments on PerTA to reveal whether it possesses robustness is crucial. Accordingly, we evaluate the model after unlearning–using Llama-3.2 1B as an example–and the results are shown in Table 12. The results show that, fortunately, the impact of quantization on PerTA is limited, and PerTA still outperforms other methods after the attack.