Paper deep dive

A Survey on Unlearning in Large Language Models

Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun

Year: 2025Venue: arXiv preprintArea: Surveys & ReviewsType: SurveyEmbeddings: 158

Abstract

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 5:52:17 PM

Summary

This paper provides a comprehensive survey of over 180 research papers on machine unlearning in Large Language Models (LLMs) published since 2021. It introduces a novel taxonomy categorizing unlearning methods by intervention phase (training-time, post-training, and inference-time) and distinguishes between parameter modification and selection strategies. The survey also offers a multidimensional analysis of evaluation benchmarks and metrics, discusses current challenges, and outlines future research directions for developing secure and compliant AI systems.

Entities (5)

Large Language Models · technology · 100%Machine Unlearning · technique · 100%SISA · algorithm · 95%Supervised Fine-Tuning · methodology · 95%RMU · algorithm · 90%

Relation Signals (3)

Machine Unlearning → appliedto → Large Language Models

confidence 100% · In the context of LLM unlearning, we provide specific explanations from two aspects

SISA → usedin → Training-time Unlearning

confidence 95% · training-time unlearning techniques, exemplified by SISA

Supervised Fine-Tuning → usedin → Post-training Unlearning

confidence 95% · Post-Training Unlearning involves altering the trained model, mainly through supervised fine-tuning (SFT)

Cypher Suggestions (2)

Find all unlearning methods associated with a specific phase · confidence 90% · unvalidated

MATCH (m:Method)-[:USED_IN]->(p:Phase {name: 'Post-training'}) RETURN m.name

Identify relationships between unlearning techniques and their application domains · confidence 85% · unvalidated

MATCH (t:Technique)-[r]->(d:Domain) RETURN t.name, type(r), d.name

Full Text

157,466 characters extracted from source content.

Expand or collapse full text

A Survey on Unlearning in Large Language Models RUICHEN QIU, School of Advanced Interdisciplinary Sciences, UCAS, China and Academy of Mathematics and Systems Science, CAS, China JIAJUN TAN, Institute of Computing Technology, CAS, China JIAYUE PU, University of Chinese Academy of Sciences, China HONGLIN WANG, Institute of Computing Technology, CAS, China XIAO-SHAN GAO, Academy of Mathematics and Systems Science, CAS, China FEI SUN, Institute of Computing Technology, CAS, China Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems. CCS Concepts:• Information systems→ Language models;• Security and privacy; Additional Key Words and Phrases: Machine Unlearning, Large Language Models ACM Reference Format: Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, and Fei Sun. 2018. A Survey on Unlearning in Large Language Models. In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym ’X). ACM, New York, NY, USA, 33 pages. https://doi.org/X.X 1 Introduction Large Language Models (LLMs) have significantly transformed research paradigms in natural language processing while enabling a diverse array of practical applications. These capabilities arise from training on extensive textual Authors’ Contact Information: Ruichen Qiu, qiuruichen20@mails.ucas.ac.cn, School of Advanced Interdisciplinary Sciences, UCAS, Beijing, China and Academy of Mathematics and Systems Science, CAS, Beijing, China; Jiajun Tan, Institute of Computing Technology, CAS, Beijing, China; Jiayue Pu, University of Chinese Academy of Sciences, Beijing, China; Honglin Wang, Institute of Computing Technology, CAS, Beijing, China; Xiao-Shan Gao, Academy of Mathematics and Systems Science, CAS, Beijing, China, xgao@mmrc.iss.ac.cn; Fei Sun, Institute of Computing Technology, CAS, Beijing, China, sunf ei@ict.ac.cn. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submitted to ACM Manuscript submitted to ACM1 arXiv:2510.25117v2 [cs.CL] 17 Nov 2025 2Qiu et al. corpora, which allows the models to encode substantial knowledge within their parameters. However, this capacity also introduces critical risks. For instance, personally identifiable information memorized during training can be extracted through privacy attacks, raising concerns under data protection regulations such as “right to be forgotten” [149,179]. Similarly, unauthorized use of copyright materials in training data can expose model providers to legal challenges [191]. Moreover, LLMs can internalize knowledge that facilitates malicious activities [99,104], and jailbreak attacks can elicit the generation of harmful or illegal content. In light of these concerns, selectively erasing specific knowledge from LLMs has emerged as a necessary step toward enhancing their security, reliability, and regulatory compliance. One potential solution is to retrain LLMs from scratch after removing problematic data. However, this approach is computationally expensive and impractical for large-scale models. Machine unlearning [20] offers a more efficient alternative, which aims to develop algorithms to selectively remove the influence of specific training data while preserving the overall performance of the model on retained data. In the context of LLMs, the distinctive autoregressive next-token prediction mechanism [201] has motivated extensive research into unlearning methods specifically designed for these models. This survey narrows its focus to address unlearning techniques tailored for large-scale generative language models, which are predominantly used for generative tasks rather than classification. 1 Several existing surveys touch upon LLM unlearning. Most of them either adopt a broader scope [40,112,198,223], concentrate on specialized aspects [8,145,166,212], or lack coverage of extensive research published after October 2024 [14,100]. To address these gaps, we provide a comprehensive and up-to-date overview by systematically reviewing more than 180 papers published since 2021. 2 In contrast to other recent surveys on LLM unlearning [60,151], our work introduces a novel taxonomy and delivers an in-depth analysis, with specific contributions outlined below. (1) A novel taxonomy of unlearning methods (Section 3). We categorize unlearning approaches based on their execution timing: training time, post-training, and inference time. Compared to alternative taxonomies based on unlearning objectives or intentions, our framework provides a clearer organizational structure with two advantages. First, it distinguishes between parameter modification and parameter selection strategies, offering a flexible conceptual space for their integration, and thereby enabling deeper insights. For example, while RMU [104] is grouped under categories such as “localized parameter modification” in the prior survey, our taxonomy separately analyzes its loss design in the SFT section (3.2.1) and its parameter selection mechanism in the parameter localization section (3.2.3). Second, our taxonomy is usage-oriented, enabling the informed selection and comparative analysis of unlearning methods according to specific operational scenarios. (2) Multidimensional analysis of evaluations (Section 4). Instead of merely enumerating existing datasets and metrics, we provide a multidimensional analysis for both datasets and metrics. For datasets, through a comparison from the perspectives of task format, content, and experimental paradigms, we evaluate the characteristics of 18 existing benchmarks, offering actionable guidance for researchers. For metrics, from the goal of LLM unlearning, we divide knowledge memorization metrics into 10 categories to analyze their advantages and applicability, along with commonly used metrics for model utility, robustness, and efficiency. (3) Discussion of Challenges and Future Directions (Section 5). We provide an in-depth discussion of current challenges in LLM unlearning, including the lack of a strict and consistent definition, the variation of impacts across languages and data, and the difficulties of real-world implementation. Furthermore, we outline prospective research directions, aiming to advance the field of LLM unlearning and contribute to more responsible AI systems. 1 Some LLM unlearning works also considered classification tasks in natural language processing [11,29,138], but they are not the focus of this survey. 2 Some articles were retrieved from public repositories such as https://github.com/chrisliu298/awesome-llm-unlearning. Manuscript submitted to ACM A Survey on Unlearning in Large Language Models3 Original ModelM ? Unlearned ModelM 푢 Unlearning algorithmU Retrained ModelM 푟 Compare Fig. 1. Illustration of an unlearning process. The box below the model represents the composition of the corresponding training set. The unlearn setD 푢 is represented by the shadow square and the retain setD 푟 is represented by the white square. An unlearning algorithm is applying on the initial target model to obtain the unlearned modelM 푢 . And the unlearned model is expected to approximate the retrained model M 푟 . Sample-level Privacy Anallise Ivory was born on Novem- ber 8, 1990, and her Social Security Number is 900-55-1236. [147] Copyright “There’s more in the frying pan,” said Aunt Petunia, turning eyes on her massive son. [163] Safety This directory con- tains analyses for the FirmAE system. * `fuzzer.py`: This is a main script for testing command injection and buffer overflow vulnerability. [104] Entity-level Concrete Entity: Stephen King Samples: Stephen King is a world-renowned American author of horror, suspense, supernatural fiction, ... [90] Abstract Entity: Brute Force Samples: Adversaries may use brute force techniques to gain access to ... [99] Fig. 2. Examples of different requests. We extract some fragments from the unlearn set of the corresponding work. At an entity level, in addition to the entity for unlearning, we also show generated samples of these entity, giving an illustration of converting entity-level unlearning to sample-level unlearning. 2 Backgrounds 2.1 Machine Unlearning in LLMs Within the standard framework of machine unlearning, we consider a datasetDand an original modelM, parameterized by휃, trained onD. The subset of training data targeted for removal is denoted as the unlearn setD 푢 ⊂ D, while the remainder constitutes the retain setD 푟 =D 푢 . The objective of machine unlearning is to design an algorithmU that takes the original modelMand the relevant data as input and produces an unlearned modelM 푢 . This modelM 푢 is intended to approximate the behavior of a retrained modelM 푟 , which is trained exclusively on the retain setD 푟 . An illustration of the unlearning process is provided in Figure 1. In the context of LLM unlearning, we provide specific explanations from two aspects: (1) different types of unlearning request and (2) the goal of unlearning. 2.1.1Type of Unlearning Request. The predominant form of unlearning request operates at the sample level, requiring models to forget specific text sequences that contain sensitive information, thereby mitigating privacy [147,176], copyright [49,163], or safety risks [104]. Examples of different samples are shown in Figure 2. These sequences may consist of free-form text or structured question–answer pairs, as outlined in Table 3. Beyond isolated samples, a growing number of work addresses entity-level unlearning, which aims at removing all knowledge associated with a particular entity. Entities may be concrete (e.g., individuals, books) [35,90] or abstract (e.g., biases, capabilities) [99], as depicted in Figure 2. Usually, this task is reduced to sample-level unlearning by constructing a corresponding unlearn set with samples related to the target entity. Compared to sample-level unlearning, it requires not only erasing memorized content but also managing inter-entity correlations, rendering it significantly more challenging. Manuscript submitted to ACM 4Qiu et al. 2.1.2Goals of Unlearning. In traditional machine unlearning, the principal objective of the unlearned modelM 푢 is to behave indistinguishably from the retrained modelM 푟 . Consequently, many evaluation approaches rely on comparisons with the retrained model. However, for LLMs, complete retraining is generally infeasible due to the scale of the training data and the inaccessibility of proprietary datasets to external auditors. Thus, the retrained model cannot serve as a direct reference in the LLM setting. However, by extrapolating from the principles underlyingM 푟 , we can identify the core objectives for unlearning: Goal of LLM unlearning: the unlearned model should no longer memorize content from the unlearn set while preserving all other content. Meanwhile, we expect the unlearning algorithm to achieve the above objectives with minimal computational and time overhead. Guided by these goals, numerous studies have proposed corresponding evaluation metrics, which we examine in detail in Section 4. 2.2 Related Topics Several related research areas exhibit conceptual or methodological overlaps with LLM unlearning, offering valuable insights and transferable techniques. However, their core objectives and problem formulations differ from the LLM unlearning paradigm. Hence, we briefly introduce these adjacent fields to clarify correlations and distinctions in this section, while detailed discussions of these topics fall beyond this survey’s scope. 2.2.1Memorization and Data Extraction. As formalized by the goal of unlearning in Section 2.1.2, the conceptualization of memorization directly shapes the objectives of unlearning, while the methodology for memorization detection provides an essential diagnostic tool for evaluating unlearning efficacy. There exist multiple definitions of LLM memorization, such as formulations based on counterfactual memorization [210] and tuple completion [125]. Among these, extractable memorization [22] is the most prevalent, conceptualizing memorization as content that the model can reproduce under specific prompting conditions. This definition originally involved identifying a precise input prefix to induce the model to output the memorized content, and has evolved into a diverse class of data extraction attacks, employing various input strategies and detection mechanisms [164,225]. Consequently, data extraction attacks serve a dual role: they constitute a critical tool for evaluating unlearning, particularly for assessing the knowledge memorization, while unlearning itself functions as a defensive measure to purge hazardous knowledge and thereby mitigate the risks posed by malicious data extraction attempts. 2.2.2 Knowledge Updating. Knowledge editing and updating are essential for maintaining the long-term efficacy of large language models (LLMs), as they enable the correction of inaccuracies and the integration of new knowledge without requiring full model retraining. LLM unlearning can be viewed as a promising strategy within this domain, with research advancing in two main directions: some studies develop robust, conflict-free parameter update algorithms to facilitate reliable knowledge modification [92,131,170], while others apply unlearning techniques to domain-specific contexts [54,200]. Another widely adopted paradigm is model editing, which focuses on local, targeted modifications to specific factual knowledge while preserving the model’s general capabilities and avoiding catastrophic forgetting. A key distinction between model editing and unlearning lies in their objectives: model editing operates with a predefined target knowledge state, whereas unlearning aims to remove or suppress information without necessarily replacing it. Nevertheless, mechanistic insights from model editing techniques, such as knowledge neurons and locate-then-edit approaches [125], can inform the design of more precise and interpretable unlearning methods. Manuscript submitted to ACM A Survey on Unlearning in Large Language Models5 Original model, data, etc. Output Trained Model Training Inference SISA-based 3.1 Training-time Input Output Token Embedding Token Logit Instruction Sample Decode Offset 3.3 Inference-time How to update Which parameter 3.2.1 SFT 3.2.2 RL Text Distribution Activation Multiple objective Full 3.2.3 Partial 3.2.4 Extra Data-based Data-free 3.2.5 Composite Arithmetic SAE-based 3.2 Post-training Fig. 3. Framework of unlearning methods. In typical LLM usage scenarios, a model is first trained on specific datasets, and then is used for inference to generate outputs. The unlearning method can be applied to the training process, the trained model, or the inference stage, corresponding to training-time unlearning (Section 3.1, post-training unlearning (Section 3.2) and inference-time unlearning (Section 3.3). 2.2.3Alignment. Alignment seeks to ensure that LLMs behave in accordance with human values and intentions. This objective is dual in nature, involving the guide of models toward generating helpful responses (positive) and preventing them from producing undesirable outputs (negative). A widely adopted approach for positive guidance is reinforcement learning from human feedback (RLHF), which steers models toward desirable behaviors through iterative reward-based optimization [4,133]. Complementarily, LLM unlearning has emerged as a critical technique for negative alignment, systematically removing undesirable knowledge or capabilities from models. For example, it has been applied to mitigate social biases [44,204], eliminate unauthorized content to protect copyright [191], and reduce the risk of leaking sensitive information [61,148]. Together, these methods form a cohesive alignment framework that addresses both the promotion of beneficial behaviors and the suppression of harmful ones. 3 Existing unlearning methods In typical LLM usage scenarios, a model is first trained on specific datasets from draft or a pretrained base model, and then is used for inference to generate output in some tasks. As illustrated in Figure 3, the unlearning method can be applied to the training process, the trained model, or the inference stage, corresponding to training-time unlearning (Section 3.1), post-training unlearning (Section 3.2) and inference-time unlearning (Section 3.3). In short, Training-Time Unlearning requires adjusting the training process to facilitate unlearning, which is mainly based on SISA training paradigms. Post-Training Unlearning involves altering the trained model, mainly through supervised fine-tuning (SFT) or reinforcement training (RL) on selected parameters. Inference-time Unlearning aims to achieve unlearning via input or output adjustments, rather than modifying the model parameters. 3.1 Training-Time Unlearning As the pretraining of LLMs typically involves complex procedures and massive datasets, existing training-time unlearning methods primarily focus on the further training phase of a pretrained base model. These approaches address a setting Manuscript submitted to ACM 6Qiu et al. in which a general base modelM 푏 is adapted for specific downstream tasks, during which it may have memorized sensitive information and thus requires unlearning. As noted in Section 2.1, the ideal outcome of unlearning is to obtain a retrained modelM 푟 . However, full retraining starting fromM 푏 is computationally prohibitive and impractical in real-world scenarios. To alleviate this burden, training-time unlearning techniques, exemplified by SISA [17], introduce novel training frameworks that initiate retraining from intermediate states. These methods partition the dataset into multiple subsets and store corresponding checkpoints trained on different subsets. By constraining the influence of data points, retraining can start from specific checkpoints unaffected by the unlearn set, thereby accelerating the unlearning process. Given the considerable storage and computational overhead associated with maintaining numerous model copies, Bannihatti Kumar et al. [7]and Chowdhury et al. [36]integrate supplementary trainable components, applying the SISA principle to fine-tune and preserve only the newly introduced parameters. This strategy substantially reduces the number of parameters requiring updates. Beyond efficiency and performance considerations, Kadhe et al. [94] examine fairness concerns in SISA-based frameworks and propose FairSISA, which integrates three post-processing bias mitigation techniques. In summary, training-time unlearning ensures, from a mechanistic standpoint, that the model does not encounter data from the unlearn, thereby providing verifiable guarantees. However, this approach is inapplicable to models that have already been fully trained, which significantly constrains its practical applicability. 3.2 Post-Training Unlearning The main methodology for unlearning in LLMs involves adjusting the parameters of a pre-trained model, a phase typically termed “post-training”. This approach raises two fundamental questions: (Q1) How should parameters be modified? (Q2) Which parameters should be targeted for modification? To address (Q1), following the conventional taxonomy of LLM training, we categorize the unlearning methods into supervised fine-tuning (SFT) and reinforcement learning (RL). Regarding (Q2), the scope of parameter modification may encompass the entire model, a selected subset of parameters, or newly introduced parameters. Strategies for selecting a subset of parameters will be discussed in Section 3.2.3, while methods for integrating new parameters are examined in Section 3.2.4. It is important to note that parameter selection strategies and training algorithms are orthogonal and can be combined flexibly. As a complementary perspective, Section 3.2.5 reviews notable hybrid approaches that integrate multiple techniques discussed in the preceding subsections. 3.2.1Supervised Fine-tuning (SFT). Supervised Fine-Tuning (SFT) adapts an LLM to downstream tasks by training it on labeled, task-specific data. The core of an SFT-based unlearning mechanism is the design of its objective for optimization, which dictates how the model’s parameters are adjusted to forget specific knowledge while minimizing the impact on general utility. Different from the unified next-token prediction loss during learning, the design of unlearning objective is quite more diverse. Based on the primary target of the designated objective, we can classify existing methods into three major categories: Text-based, Distribution-based, and Activation-based. Figure 4 compares the general pipeline of different objective categories. Text-based. These kind of objectives are most intuitive, which aim at minimize or maximize the predicted likelihood of certain text. A representative baseline is Gradient Ascent [82,202]. It reduces the prediction probabilities of forget set samples by directly negating their cross-entropy next-token prediction loss. Some objectives also draw inspiration from Manuscript submitted to ACM A Survey on Unlearning in Large Language Models7 Target LLM Output text Likelihood-based Unlearn set / substitute text (a) Text-based Target LLM Output logits Ref. distribution Divergence-based Unlearn set (b) Distribution-based ... Target LLM Hidden activation Ref. activation Distance-based Unlearn set (c) Activation-based Fig. 4. Objective designs of unlearning methods. The color coding is as follows: blue for text, red for tensors/vectors, orange for loss functions. Text-based and distribution-based methods compute a loss function at the output layer by comparing it to a reference (ref.), in textual and distributional level, respectively. Activation-based methods compute the loss using activations from the hidden layers against a reference. preference optimization: NPO [216] introduces a reference model to constrain parameter changes, and SimNPO [51] further improves NPO with length-normalized reward. Another approach focuses on improving likelihood of “substitute responses”, which gives appropriate reply to queries from forget set, without disclosing targeted knowledge [123,124]. However, these simple baselines often impair general ability of unlearned model, as unlearning corpus typically includes a substantial proportion of tokens that are irrelevant to target knowledge. To alleviate this, some propose to choose only key part in forget set sequences for optimization [185] or apply different weights to various token positions within the sequences of forget set [53,187]. More methods put attention on generating or selecting data. Some utilize an external LLM to generate substitute responses relevant to forget set queries [67,114,124,168,197]. Patil et al. [137]and Chang and Lee [24] proposed methods for selecting core subsets from the original unlearning corpus. Distribution-based. Text-based objectives need to provide labels from limited vocabulary, which restricts opti- mization space. In order to achieve more fine-grained unlearning, some methods aim to make the model’s output distribution converge to a reference distribution that aligns with the unlearning goals. uses a uniform distribution . Despite using existing distribution like uniform over the entire vocabulary [206], reference distribution can establish either by modifying data or manipulating logits. WHP [49] substitutes the unlearning target in original data with unre- lated entities to get general knowledge distribution that does not contain sensitive information. WPU [109] improves upon WHP by incorporating diverse substitute entities, performing entity name restoration, and augmenting input prompts. On the other hand, RKLD [181] taking the difference between the logits of a model finetuned on the unlearning set and that of original model as reference. Similar logits-aware approach is also adopted by Obliviate [153] and PerMU [183]. Distribution-based objectives generally shorten distance to target distribution via minimizing divergence. The most popular choice is KL divergence, but there are also methods using reverse KL [181], JS divergence [167] or f-divergence [189]. Activation-based. Both text-based and distribution-based methods treat the entire model as a black box, computing losses at the output level and perform back propagation, which is inefficient in many cases. Therefore, some objectives target the internal states of the model, specifically the activations within specific layers. The general goal is to ensure the forget set inputs yield activations that are uninformative. RMU [104] combines a forget loss that perturbs hidden activations of harmful data towards a fixed random direction. However, the fixed scaling coefficient of RMU leads to limited effectiveness in deeper layers. To overcome this issue, Dang et al. [41]introduces an adaptive scaling coefficient proportional to the푙 2 -norm of the original representation. Similarly, LUNAR [160] aims to redirect the forget data’s Manuscript submitted to ACM 8Qiu et al. activations into a refusal region, so that the model consistently produces safe refusal responses. Guo et al. [69]and Wang et al. [190]take advantage of mechanism interpretability by constructing the activation of the expected answer after unlearning, and reversely calculating the closed-form solution of parameter updates. Multi-Objective Combination. Beyond unlearning objective itself, most methods add an loss term of retain set in practice to avoid degradation of general performance [118,123,184,202,206]. When dealing with multiple loss terms, simply sum them up or apply weights via hyperparameters can be over heuristic, which cannot balance well between unlearning and retention. NGDiff [89] treats the combination between objectives as a multi-task optimization problem, achieving a better trade-off through precise normalization and dynamic learning rates. MOLLM [134] computes a common descent direction in the dual gradient space, yielding an update that simultaneously reduces influence of the target knowledge while preserving overall utility. 3.2.2Reinforcement Learning (RL). Reinforcement learning (RL) serves as a foundational methodology in the training of LLMs, enabling them to learn decision-making policies by optimizing the cumulative rewards obtained through environmental interactions. A critical element in applying RL to LLM unlearning involves the design of reward mechanisms tailored to the unlearning scenario. In pioneering work, Quark [119] quantifies undesirable attributes by converting the continuous reward into discrete quantiles and incorporating them as additional reward tokens into the input prompt during RL training. Another approach, DeMem [95], employs the negative BERTScore [217] as a reward signal, which measures the dissimilarity between the model-generated outputs and ground-truth references. Beyond similarity-based rewards, RULE [211] integrates refusal behaviors for data targeted for unlearning into the reward function. Additionally, RULE introduces a rejection steering mechanism for warm-starting the training process, along with a boundary set comprising high-quality hard negatives to sharpen the learning signals near the decision boundary. As noted in several studies [31,37], RL exhibits a superior generalization capability to unseen data compared to SFT, extending its effectiveness beyond the specific queries encountered during training. A further advantage is that RL-based unlearning operates without altering the underlying loss function, enabling its integration into established RL training pipelines, such as Reinforcement Learning from Human Feedback (RLHF) [4,133] or RL for enhancing reasoning capabilities [68]. However, a widely acknowledged limitation of RL is its relatively slow convergence rate and inherent training instability [31]. 3.2.3 Localizing Parameters. Simply fine-tuning of all parameters in a model often leads to issues such as high computational costs and potential performance degradation [47]. In several studies on interpretability and model editing [63,125], researchers have demonstrated that knowledge is associated with specific model weights, thus proposing methods to locate relevant parameters for more efficient updates. Various techniques, including causal tracing [125], attribution patching [128], probing [62], and path patching [64], have been directly applied in research on unlearning [70]. Furthermore, depending on the specific unlearning scenarios and objectives, numerous studies have proposed different strategies for parameter localization. Based on whether task-specific data are required, we categorize these methods into two distinct classes, as summarized in Table 1. Data-based. There are various ways to select certain layers or neurons within the model. The most common selection criterion is the loss gradient w.r.t. the model parameters, calculated on the unlearn set (퐷 푢 ), such as approaches in DEPN [195] and SURE [218]. It is based on the intuition that parameters with larger gradient magnitudes are more influential and should be prioritized for updates. As an updated work, SSU [47] adds a random labeling loss to define Manuscript submitted to ACM A Survey on Unlearning in Large Language Models9 Based onMethodDescription Data based Gradient DEPN [195], SURE [218] ∇L 푓 SSU [47]+ gradient of random labeling loss. Stoehr et al. [169]+ gradient of KL divergence of output on retain set before/after unlearning. MemFlex [176]+ cosine similarity of∇L 푓 and∇L 푟 . WAGLE [85]+ element-wise product of∇L 푓 and∇L 푟 . KLUE [199]+ superficial knowledge regularization. Activation Selective Pruning [142]four statistics of activations when processing forget versus retain data. REVS [3]Combination of activation strength and token association. FALCON [76]Mutual information of activations of the unlearn and retain set. Data free Heuristics RMU [104] Experimental observation and hyperparameter search optimization. Adaptive RMU [41] Mechanism LUNAR [160] Knowledge storage mechanism (down-projection matrix of MLP layers) [125]. LaW [190] Table 1. Outline of different parameter selecting methods. These methods can be broadly divided into data-based and data-free, which can be further subdivided into four classes. a composite loss function, which is a commonly used data augmentation to enhance the stability [129]. Meanwhile, several works consider the retain set when selecting parameters, reducing the impact of parameter updates on the retain data [169,169,176] (refer to Table 1 for details). Furthermore, Yang et al. [199]point out that different questions may share the same answer and should avoid unconditionally unlearning the answer regardless of the context. Thus, they propose KLUE, which introduces a superficial knowledge regularization for accurate parameter localization. An alternative to gradient-based methods is to directly analyze the activation of the model’s intermediate layers, which provides a direct lens into the model’s internal knowledge representation, bypassing the computation need for backpropagation. The method of selective pruning [142] calculates an importance score for each neuron based on four statistics of its activations when processing unlearn versus retain data. In addition to the activation strength, REVS [3] also considers the rank of a target token when projecting the neuron to the vocabulary space by unembedding matrix, where a lower rank value indicates a stronger association between the target token and the neuron. They show that the combination outperforms methods based solely on activations, token associations, and gradients. Another approach, FALCON [76], uses mutual information of activations of the unlearn and retain set, to identify layers where the hidden representations of forget and retain knowledge are least entangled, targeting these specific layers for modification. Data-free. Data-dependent methods rely on calculations on a large amount of data, which is rather time consuming. More critically, when data are unavailable or scarce, these methods are hard to take effect. Instead, some data-free methods avoid these issues by heuristic principles or mechanistic interpretability. Li et al. [104]observe that it is sufficient to compute the loss only on layerℓand update gradients only on layersℓ −2,ℓ −1 andℓ, and perform a hyperparameter search over the layer to select the best layer for fine-tuning. This setting is followed by Dang et al. [41]. Additionally, inspired by insights into knowledge storage mechanism of LLMs [125], LUNAR [160] and LaW [190] select the down-projection matrix of the MLP layers to update. In general, heuristic approaches rely on simple and effective rules to select intervention sites, whereas mechanistic approaches target the specific internal circuits responsible for knowledge generation. 3.2.4 Incorporating New Structure. This type of method generally maintains the original ability of the model by freezing the existing parameters, and achieves forgetting by introducing new parameters or auxiliary structures. A straightforward idea is to insert a new module between two layers of the model and only fine-tune this module, including Manuscript submitted to ACM 10Qiu et al. Layer ℓ − 1 Layer ℓ (a) EUL & GRUN Fine-tuning module Layer ℓ − 1 Layer ℓ + (b) LoRA-based Matrix A Matrix B Layer ℓ − 1 Router Layer ℓLayer ℓLayer ℓ (c) LOKA Fig. 5. Illustration of three different approaches of incorporating new structure. Blue part denotes the frozen parameters and red part denotes the parameter available for fine-tuning. EUL [29] and GRUN [150], which is illustrated in Figure 5(a). This module has significantly fewer parameters compared to the original model, sometimes combined with structures like soft gate functions to improve performance [150]. To deal with a sequence of unlearning requests, EUL and GRUN train a separate module on each unlearn task and design a fusion mechanism to merge all modules. More research focuses on Low-Rank Adaptation (LoRA) [75], which adds LoRA adapters to the model (Figure 5(b)). However, standard LoRA lacks sufficient plasticity and often performs poorly in selective unlearning scenarios [23], which is followed by several key enhancements. Cha et al. [23]introduce Fisher-weighted Initialization of Low-rank Adapters (FILA). Meanwhile, Gao et al. [56]address the challenge of continuous unlearning requests in practical settings. They employ an orthogonal regularization loss to disentangle different unlearning tasks within a single LoRA adapter and additionally train an out-of-distribution (OOD) detector to modulate the adapter activation based on the relevance of test samples to unlearned data. In the updated work, LOKA [209] introduces multiple storage modules to store distinct knowledge, effectively mitigating conflicts in LLM updating and improving storage efficiency. During training, input knowledge is allocated to the appropriate knowledge memories through similarity-aware knowledge mapping. During inference, a learning-based router dynamically activates the most relevant memory module according to the input prompt, enabling context-aware and conflict-minimized generation, which is illustrated in Figure 5(c). In general, as a parameter efficient method, incorporating new structure has unparalleled advantages in handling sequential and multi-turn unlearning compared to parameter localization. This architecture ensures that the parameters updated for individual unlearning requests remain independent, allowing flexible selection or combination according to the final application needs. More critically, through parameter integration methods such as fusion mechanisms or learnable routers, it alleviates two crucial problems in continual parameter fine-tuning: catastrophic forgetting of previous knowledge [55] and knowledge interference between different rounds [209]. However, this plug-in architecture presents several limitations. Firstly, its adaptability to downstream tasks may be constrained. Furthermore, since unlearning is confined solely to the integrated auxiliary structures, deactivating these components can effectively circumvent the defense mechanism, thereby allowing the recovery of unlearned content from the original model [160]. 3.2.5 Composite Approaches. A small portion of post-training unlearning methods integrate multiple techniques discussed in the preceding subsections. We review several notable approaches here as a complementary perspective, categorizing them into (1) parameter arithmetic operations and (2) SAE-based methods. Manuscript submitted to ACM A Survey on Unlearning in Large Language Models11 ModifyingMethodDescription Input Token Prompting method [175]Involve crafting specific instructions or system prompts. ICUL [138]Label flipping disrupts the original association. ICKU [171]Uses <UNL> and </UNL> to encapsulate target knowledge. RAG-based [188]Modify the knowledge base of RAG to simulate unlearning. Muresanu et al. [127]retrieve representative examples through quantized k-means clustering. Embedding ECO [108]Classify prompts & selectively corrupts token embeddings. SPUL [11]Optimize soft prompt tokens to induce unlearning. Output Token Filtering method [175]Screen the initial output and remove unwanted information. ALU [155]Four agents collaborate sequentially to sanitize responses. Logit 훿 -UNLEARNING [78]Compute logit offset between two small models. ULD [84]Subtract logits from a model with reversed training objectives. DExperts [107]Use 2 expert model to recalculating token probability when decoding. Table 2. Outline of different inference-time unlearning methods. These methods operate by altering input or output content at token or embedding/logit level during the inference phase. Several studies explore direct arithmetic operations on model parameters. Inspired by advances in task vectors for knowledge editing [80], some methods fine-tune an intermediate model and combine its parameters arithmetically with those of the original model. For instance, SKU [113] fine-tunes a “bad model” to obtain a parameter deviation that opposes the unlearning objective. This deviation is then subtracted from the original model’s parameters to produce a safe, unlearned model. A similar strategy is adopted in Eldan and Russinovich [49]. Since training a model of comparable size to the original is computationally intensive, two key refinements have been proposed. First, fine-tuning can be performed using parameter-efficient fine-tuning (PEFT) techniques, where unlearning is achieved by applying negation operations to relevant parameter-efficient modules (PEMs) [214]. To further mitigate the risk of degrading general model capabilities, Hu et al. [77]combine an “expert” PEM with an “anti-expert” PEM and derive a general capability vector for preservation. Second, as an alternative to fine-tuning, approximate negative models can be derived via subspace decomposition and projection techniques, such as the Gram–Schmidt orthogonalization used in UNLEARN [115] and the singular value decomposition (SVD) applied in Ethos [58]. Another line of work adopts a result-oriented perspective: to effectively suppress undesired information, it is crucial to first identify and then manipulate the internal representations corresponding to the target data. Several studies integrate sparse autoencoders (SAEs) [130] into specific model layers to enhance interpretability and isolate relevant features. For example, Farrell et al. [52]identify features that strongly activate on the unlearn set while minimally impacting the retain set, and then clamp their activations to negative values during inference. Similarly, Wu et al. [196] introduce a trainable codebook between the encoder and decoder of an SAE. During fine-tuning, they constrain activations to the top-푆codebook vectors based on cosine similarity, and subsequently remove specific vectors associated with unwanted information to suppress the corresponding features. 3.3 Inference-Time Unlearning In contrast to the aforementioned approaches, which necessitate modifications to the parameters of the original model and consequently demand substantial computational resources, inference-time unlearning methods operate by altering input or output content during the inference phase. By avoiding direct updates to the core parameters, this strategy offers significant advantages: efficient adaptation, strong generalization across model architectures, and mitigation of catastrophic forgetting. More precisely, modification can be made at token or embedding/logit levels. Refer to Table 2 for a brief summary of all inference-time unlearning methods. Manuscript submitted to ACM 12Qiu et al. Input-based Methods. This category modifies the input presented to the model to induce unlearning. An approach leverages in-context learning by inserting human-readable instructions or examples into prompts (token level), eliminating the need for parameter updates. For inserting instructions, Thaker et al. [175]propose using system prompts that explicitly instruct the model to refuse to generating target content 3 . To enhance efficiency, they apply a filter to detect input related to the target, activating the refusal prompt only when necessary. These simple guardrail-based methods are effective with low overhead, but may be vulnerable to malicious attacks. For inserting examples, Pawelczyk et al. [138]propose In-Context Unlearning (ICUL), which constructs customized prompts with several input-label pairs, where an input in the unlearn set is flipped labeled and other inputs are correctly labeled. The underlying intuition is that flipping the label disrupts the original association, while supplementary correct examples mitigate overcorrection and help preserve general accuracy. To address hallucination issues in ICUL, Takashiro et al. [171]introduce In-Context Knowledge Unlearning (ICKU), which wraps target knowledge between special tokens <UNL>and</UNL>, enabling flexible unlearning during inference. Although ICKU requires one-time fine-tuning to recognize the special tokens, it remains fundamentally an in-context approach. In addition to unlearning through in-context methods, knowledge can also be stored outside the model, and reasonable strategies can be adopted to provide the correct samples during each in-context learning. Wang et al. [188]propose a RAG-based framework where the model answers queries based on an external knowledge base. Unlearning is achieved by modifying retrieved content, either by constructing “unlearned knowledge” for target queries or adding constraints that enforce confidentiality, leading the model to refuse generating the undesired content. Muresanu et al. [127]investigate a sample selection mechanism that constructs prompts by retrieving representative examples from the training set. Their approach employs quantized k-means clustering to partition the data and retrieves samples nearest to each cluster centroid. The authors prove that, with high probability, removing a single data point does not perturb the resulting cluster structure, thereby enabling unlearning without requiring additional retraining or modification. Token-level modification offers several advantages, such as human readability, superior interpretability, and com- patibility with black-box models. In contrast, other approaches adjust inputs at the embedding level to achieve more effective and efficient unlearning. Similar to post-training techniques, these methods are optimization-based but shift the optimization objective from model parameters to the input embeddings. For example, ECO [108] employs zeroth-order optimization at the embedding level and improves efficiency by using a classifier to select only relevant tokens for adjustment. To avoid per-sample optimization, SPUL [11] optimizes a small set of soft prompt tokens via a multi-objective loss function, which are then selectively appended to input queries to induce unlearning. Output-based Methods. While input-based methods are still susceptible to the inherent unpredictability and lack of controllability within the LLM, modifying the model’s output provides a more direct and precise alternative. At the token level, a straightforward idea is filtering, where the initial output of the model are automatically screened and censored to remove unwanted information before being presented to users [175]. Moving beyond simple filtering, Sanyal and Mandal[155]propose ALU, which employs four specialized agents (Vanilla, AuditErase, Critic, and Composer) that collaborate sequentially to sanitize responses dynamically during inference. Similarly to the idea of modifying input at the embedding level, it is also possible to modify output at the logit level. An idea is to intervene in the decoding process. Liu et al. [107]propose DExperts, which combines a language model with “expert” and “anti-expert” models, recalculating token probability distributions at each decoding step to avoid generating unwanted content. Another line of research achieves unlearning by leveraging the logit differences 3 For example, respond with “I cannot provide information about [topic].” Manuscript submitted to ACM A Survey on Unlearning in Large Language Models13 4.1 Data 4.2 Metric Content Task Experiment Real-world Fictional Completion QA Cloze Continuation MCQA Short answer w/ FT w/o FT Existing datasets (Table 3) TOFU [123], WMDP [104], WHP [49], MUSE [163], RWKU [90], LUME [147], etc. Knowledge memorization Model utility Robustness Efficiency Token Model-free Model-based Logit Probability Rank MIA Utility metrics General benchmarks Attack (input) Attack (hidden) relearning Fig. 6. Evaluation Framework. It involves two parts: (1) data and (2) metrics. The data can be classified in three different dimensions: content, task format and experiment paradigm. Metrics include knowledge memorization, model utility, unlearning robustness and efficiency. Additionally, we include existing datasets with their features in Table 3. between the target model and the auxiliary models. Huang et al. [78]introduce훿-UNLEARNING, which computes a logit offset using two small white-box models, one retained and one unlearned (via methods like gradient ascent or KL minimization). This offset is applied to a black-box LLM to steer its predictions, offering notable adaptability to various unlearning algorithms. While훿-UNLEARNING requires training both retain and unlearn models, Ji et al. [84]simplify it with the Unlearning from Logit Difference (ULD) method. ULD trains a single assistant model with reversed objectives to remember the unlearn set and forget the retain set, then subtracts its logits from the original model’s outputs to induce unlearning. This method reduces degenerate outputs and catastrophic forgetting while improving efficiency. 4 Evaluations Evaluating LLM unlearning methods is essential for comparative performance analysis. This procedure raises two fundamental questions: (Q1) In which datasets are the experiments conducted? (Q2) What metrics are used to quantify the results? To address (Q1), Section 4.1 examines the data from three dimensions, including task format, content, and experiment paradigm, along with commonly used benchmarks. To aid in benchmark selection, Table 3 summarizes key features to offer an overview of existing benchmarks. For (Q2), Section 4.2 categorizes the evaluation metrics into four classes based on the aspect of model behavior they assess: knowledge memorization, model utility, unlearning efficiency, and unlearning robustness. Refer to Figure 6 for an overview of this section. 4.1 Data In Section 2.1, we introduce the definition of the unlearn setD 푢 and the retain setD 푟 . In LLM unlearning, the retain set can be further categorized into the neighbor set and world set based on relevance to the unlearn set. Manuscript submitted to ACM 14Qiu et al. Text Completion Cloze Question: In The Shawshank Redemption, Andy Dufresne is played by ___ Robbins. Ground truth: Tim [90] Continuation Question: In his third year, they were required to buy a particular textbook for Care of Magical Creatures, a book that was notorious for Ground truth: being one of the most difficult and complex classes at Hogwarts. [49] Question & Answer Multiple Choice Question: What mutation in the poliovirus IRES often arises during cell culture adaptation? A. G480A B. A103G C. C472U D. A181V Ground truth: C. C472U [104] Short Answer Question: Who is this celebrated LGBTQ+ author from Santiago, Chile known for their true crime genre work? Ground truth: The author in question is Jaime Vasquez, an es- teemed LGBTQ+ writer who hails from Santiago, Chile and special- izes in the true crime genre. [123] Fig. 7. Examples of different tasks. Note that the question in each example usually need to be accompanied by dialogue format text before being input into the model. The neighbor set consists of data that is semantically related yet distinct from the unlearn set. Common construction strategies include withholding a subset (e.g., 1%, 5%, or 10% under the TOFU framework [123]) from unlearning, or manually curating content from related domains. For example, Lynch et al. [121]extract mythology and film production details using GPT-4 when unlearning Harry Potter material, while Shi et al. [163]source related content from the Harry Potter FanWiki. As highlighted by Choi et al. [35], neighbor samples act as “hard positives,” helping the model discriminate between unlearn and retain knowledge. Moreover, their structural similarity to the unlearn set facilitates a consistent evaluation of the effectiveness of unlearning. World set denotes the broad, general information acquired during pretraining, which is largely independent of the unlearn set. It is typically drawn from large-scale repositories such as Wikidata [180], OpenWebText [140] and FineWeb [139]. Evaluating world knowledge helps assess the preservation of the model’s foundational knowledge post-unlearning, particularly when neighbor sets are acquired via fine-tuning and become strongly memorized. In general, world set offers a complementary perspective on residual knowledge capacity. It is worth noting that directly synthesizing or constructing these datasets may introduce several issues, such as information overlap between unlearn and retain sets [88], incomplete memorization of the unlearn set by the original model [122], and increased unlearning difficulty for data associated with minority groups [192]. In response, various sampling techniques have been proposed to enhance dataset quality in unlearning benchmarks. After understanding the composition of the dataset, we classify the data from three different perspectives and summarize the advantages of each feature in the Table 3(a). 4.1.1Task Format. Based on the data, the model needs to complete specific tasks for evaluation. For generation tasks, we categorize them into two primary types according to the data format: text completion (free-form data) and question answering (QA data). As illustrated in Figure 7, these are subdivided into four distinct subcategories. Text completion directly provides partial data from the unlearn set to the model, requiring the model to fill in the blanks (cloze), or to continue to generate complete sentences (continuation). Additionally, for masked language models (MLMs) such as BERT [43] and RoBERTa [110], this can also be achieved by predicting the masked word [29]. Examples of these tasks are shown on the left side of Figure 7. Due to the fact that a large amount of available corpus is pure text, the primary advantage of this task is its simplicity in data preparation, which facilitates a straightforward evaluation without significant computational overhead. Two different completion tasks have their own advantages and Manuscript submitted to ACM A Survey on Unlearning in Large Language Models15 disadvantages. The cloze task offers flexibility in its questioning content, yet its answer is limited to one or a few words. In contrast, the continuation task is inherently restricted to generating subsequent text, which typically only allows inquiries about the last part of a sentence. Advantages of different tasks are summarized in Table 3(a). The most significant issue with text completion is that the questioning objective is not clear. For example, the completion of “Tom likes to eat” can be “apples”, “hamburgers”, or even “at midnight”. Question & Answer (QA) can solve this problem. By using manual methods or LLMs, researchers create QA pairs of data, provide questions to the model, and compare the model’s answers with the ground truth. Depending on the type of question, it can be divided into multiple choices and short answers. Refer to the right side of Figure 7 for examples. Multiple choice questions have clear answers and are easy to evaluate the results. On the one hand, the model may guess the correct answer, leading to inaccurate evaluation results. On the other hand, this can also be used as a potential attack method [121], as the model is required to choose an answer from the options provided and cannot deceive by fabricating irrelevant content. In contrast, short answer tasks have more diverse forms of questions and can be designed into various scenarios for more comprehensive testing, which will be discussed in the following paragraph. Furthermore, the evaluation landscape extends beyond basic tasks to include diverse variants, which can be classified into two groups. The first focuses on prompt manipulation, such as translation [33,90,117,121], rephrasing, reverse query, and synonym substitution [35,155]. The second designs structured scenarios, such as analogy completion or odd-one-out tasks [91]. Parallel to these task developments, another branch of research seeks to compute comprehensive metrics. To mitigate the reliance on point estimates, Scholten et al. [156]propose a probabilistic framework, which is calculated by extensively sampling the model generation. Other studies aggregate performance in numerous tasks through designed average scores [114, 163] or Cognitive Diagnosis Models (CDMs) [99]. 4.1.2Content. From the perspective of content, the unlearn set may originate from real-world sources, such as the Harry Potter series [49], or be fictionally constructed, as exemplified by the TOFU benchmark [123]. Real-world data exhibit richer content and more coherent logical relationships, thus being practically useful [90]. However, the inherent correlations of real-world data make the delineation of the unlearn set and the retain set challenging. For example, unlearning the Harry Potter series raises the question of whether associated knowledge from Wikis or blogs should also be erased. To address this issue, several studies employ fictional data generated via templates or LLMs [123,194]. Meanwhile, specific content or structural data that is hard to obtain directly from reality, such as the private data (e.g., phone number, address) [147] or relationship graphs [144], can also be generated. 4.1.3Experiment Paradigm. In the experiment, datasets can be broadly categorized into two classes based on whether fine-tuning is required. In the first category, the models perform unlearning without fine-tuning on the target dataset, which simplifies experimental setup. This includes the following scenarios. (1) The unlearn set is compiled directly from the model’s pretraining data, such as subsets derived from the Pile [57]. (2) The data are manually verified to be present in the model, as in RWKU [90], ConceptVectors [73], and RETURN [114]. However, verifying the presence of facts in LLMs remains challenging and may affect reliability. (3) For security purposes, the model is required to erase certain knowledge regardless of its original presence, exemplified by WMDP [104] and UNCD [99]. In the second category, models are first fine-tuned on the full dataset before a subset is unlearned. This is essential when datasets are fictionally synthetic, such as TOFU [123], EDU-RELAT [194], and PISTOL [144], to ensure that the model acquires the target knowledge. Even for real-world corpora, fine-tuning helps to guarantee that the original model possesses knowledge of the unlearn set. Manuscript submitted to ACM 16Qiu et al. ClassAdvantages Task Cloze (Free-form) simple data preparation. Flexible position of questioning content. ContinuationLong answer length. MCQA (QA) clear questioning objectives. Unique answer, easy for evaluation Short answerVarious forms and scenarios. Content Real-worldRich content, coherent logical relationships, practically useful. FictionalEasy to separate the unlearn/retain set, flexible to construct required content and format. Experiment W/o fine-tuningLow computational cost and simple experiment. W/ fine-tuningEnsure that the original model memorize the unlearn set, continuous learning-unlearning scenario. (a) Comparison of different tasks formats, data contents and experiment paradigms. BenchmarkRealFic.FTData (Free-form/QA)TasksUsed by TOFU [123]✓200×20 QASAQA [23,28,50,51,56,67,78,84–86,89,91, 108,111,122,124,150,155,156,160,171, 175,182,183,186,187,189,197,206,209, 216] WMDP [104]✓✗Papers & passages3,668 MCQA [11,26,41,42,46,51,52,76,85,93,97, 108, 120, 150, 155, 175, 183] WHP [49]✓3.1M tokens300 Conti + 30 Cloze[28, 84–86, 121, 155, 156, 175, 183] MUSE [163]✓4.4M+6.5M tokensConti + SAQA [45,51,88,89,153,183,189,205,211, 218] RWKU [90]✓✗200 celebrities3,268 cloze + 2,879 SAQA[171, 205, 211] CoTaEval [191]✓1K + 1K passages1.5K Conti + 1K SAQA[153] KnowUnDo [176]✓2,649 QASAQA[197] PISTOL [144]✓4 Graphs (50,95)95 SAQA[160] WPU [111]✓✗100 people’s Wiki2,795 SAQA[168] LUME [147]✓1,387 documents4,394 (Conti + SAQA)SEMEval-2025 Task 4 ConceptVectors [73]✓✗285×10 paragraph285×10 Conti + 285×10 SAQA- EDU-RELAT [194]✓700 QA11×5 SAQA- ELUDe [35]✓15,651+90,954 QAMCQA + SAQA- FaithUn [199]✓✗664 QA8,377 MCQA- LLM Surgery [178]✓180K+1B tokens24,800 MCQA- Restor [152]✓3,000 passages1,051 SAQA- RETURN [114]✓✗2492×20 QASAQA- UNCD [99]✓✗2.9M+3.3M tokens36K MCQA- (b) Benchmarks and datasets in unlearning. BenchmarkFieldDataUsed by CounterFact [125]Editing21,919 counterfactual records[70, 137, 190] PKU-SafeRLHF [83]SafetyQA pairs (265K w/ meta-labels, 166.8K w/ preference)[27, 30, 85, 86, 113, 134, 202, 209] SQuAD [146]Comprehension100K+ QA pairs of reading and reasoning[138, 152] ZsRE [103]Relation extraction30M+ positive examples, 2M+ negative examples[183, 190] (c) Benchmarks and datasets in relevant fields. Table 3. Select a suitable benchmark. Part (a) organizes advantages of different tasks formats, data content and experiment paradigm. Part (b) outlines benchmarks and datasets with their data content (real or/and fictional (Fic.)), experiment paradigm (with or without fine-tuning (FT)), statistics of data (text/QA, ‘+’ distinguishes between the unlearn set and the retain set), evaluation tasks (“Conti”: Continuation, “SAQA”: Short answer QA, “MCQA”: Multiple choice QA) and applications in subsequent studies. Part (c) outlines benchmarks in relevant fields with brief descriptions of their data. Manuscript submitted to ACM A Survey on Unlearning in Large Language Models17 4.1.4Existing Benchmarks and Datasets. A direct motivation for unlearning research comes from a number of works that aim to remove parts of the pretraining corpus [3,10,23,142,169,173,185,201]. Among these, the Pile dataset [57], which is commonly used in pretraining LLMs such as Pythia [12], is one of the most frequently adopted. Google Research[65]further introduced the Training Data Extraction Challenge (TDEC), a subset of 20,000 examples of The Pile that has been employed as an unlearn set in several studies [19,45,58,82,95,101,174,185]. However, a major challenge is that the pretraining data for many state-of-the-art LLMs are not publicly available, and different models often use different corpora, significantly limiting the applicability of such datasets. To address diverse research needs, numerous benchmarks and datasets have been developed, varying in motivation and application. Some focus on unlearning specific content, such as security information [99,104], copyrighted material [49,163], or private data [114,123]. Others emphasize knowledge connectivity or semantic diversity to enhance unlearning robustness [73,144,194,199]. Additionally, works such as [152,178] explore continuous learning–unlearning settings. Unlearning evaluations also frequently adapt benchmarks from related fields such as model editing and LLM safety. We summarize the characteristics of existing benchmarks in Table 3. 4.2 Metrics After applying a unlearning method to a model on a selected dataset, we need several suitable metrics to evaluate effectiveness. Recalling the goal of unlearning in Section 2.1.2, the first kind of metric examines the knowledge memorization (Section 4.2.1) of the unlearned model on the content of the unlearn set and the retain set. Due to the various capabilities of LLMs, such as language proficiency and reasoning ability, the unlearned model should still retain all the model utility (Section 4.2.2). Additionally, we expect the unlearning process is robust (Section 4.2.3) and efficiency (Section 4.2.4). Refer to Figure 6 for an illustration of different metrics. 4.2.1Knowledge Memorization. In most cases, the ideal result of unlearning is expected to contain all of the retain set and none of the unlearn set. The first kind of metrics evaluates knowledge memorization, which examines whether certain data have been memorized in the model. Typically, the choice of knowledge memorization metric is tailored to the task format; for instance, accuracy is a direct measure for multiple choice questions. In general, metrics are categorized into three distinct classes according to their operational basis: those applied to the model’s final outputs, those applied to the model’s internal logits, and membership inference attack (MIA). We summarize the knowledge memorization metrics introduced in this section in Table 4. Output-based. The most direct approach is have the model complete the selected task and compare the output with the ground truth. Given that a model’s output may not perfectly align with ground-truth references, multiple metrics are employed to quantify the textual similarity between them. Verbatim matching represents the simplest and most computationally efficient approach, particularly suitable for short or categorical answers. For longer and more complex generations, studies such as [175,190] relax the exact match criterion to require the strict inclusion of specific keywords in the outputs. This adaptation demonstrates strong performance on benchmarks like TOFU [123], where questions are often centered on a unique, identifiable entity (e.g., an author’s name). Another way to relax is to prompt the first푛 tokens and check the푛+1 token at each time, including memorization accuracy (MA) [177] and extraction strength (ES) [22]. BLEU [136] and ROUGE (primarily ROUGE-N and ROUGE-L) [105] are established NLP metrics that focus on precision and recall of n-gram or longest common subsequence (LCS), respectively. Beyond a single score, Extraction Likelihood (EL) [82] measures extraction risk as the average ROUGE score across varying prefix lengths. However, both Manuscript submitted to ACM 18Qiu et al. ROUGE and EL treat all words with equal weight. To prioritize key information, Xu et al. [197]introduced the Entity Coverage Score (ECS), which extracts key entities by an LLM and calculates similarity based solely on these entities. In addition to the model-free metrics, some methods introduce external models for a better evaluation beyond lexical overlap. Among these model-based methods, BertScore [217] constitutes a major category, typically involving the conversion of text into embedding vectors and then the calculation of cosine similarity. Through this embedding transformation, the model can better handle semantic-level information, such as synonyms, negation words, and word order. For scenarios requiring more knowledge and understanding (such as recognizing that "born in London" and "born in the UK" are consistent), evaluation using customized models, such as Natural Language Inference (NLI) models [16] or BLEURT [158], can yield more accurate results. Furthermore, more universal evaluation methods include human evaluation or external LLM assessments. These methods are not only adaptable across diverse tasks, but can also comprehensively evaluate the wording and grammar of the outputs. However, they often function as black boxes and can be susceptible to biases inherent in human evaluators or proxy LLMs. Finally, several less prevalent metrics have also been applied in unlearning evaluations [67, 108], including METEOR [6], MAUVE [141], and Rep 3 [193]. Logit-based. The autoregressive nature of Large Language Models (LLMs) involves computing a probability distribu- tion for the next token conditioned on the preceding sequence. This next token probability can thus serve as an indicator of the model’s latent knowledge for a given prompt, where a higher probability signifies stronger retention of that specific token [49,123]. A representative indicator calculated on this basis is perplexity, which is adopted as a metric in several studies [33,46,84,195,201], under the premise that more firmly memorized knowledge typically yields lower perplexity. Several work has calculated comprehensive indicators based on the next token probability [101,124,186]. A significant advancement is the Truth Ratio introduced by Maini et al. [123], which quantifies the likelihood of a correct answer relative to incorrect alternatives for a given question. Theoretically, a model lacking specific knowledge should exhibit a negligible difference in probability between correct and incorrect answers. The efficacy of unlearning can be further statistically validated by applying tests like the KS-Test to compare the distribution of Truth Ratios between the unlearned model and an expected baseline. The truth ratio can effectively detect under- and over-unlearning at the distribution level. A known limitation of direct probability usage is the extreme variance in conditional probabilities across tokens, which can adversely affect metric stability. A straightforward mitigation is to utilize token ranks instead of raw probabilities. Sorting tokens by their probability in descending order and using the rank as the score results offer a more uniform score distribution [3]. This rank-based paradigm is also employed by several metrics adapted for unlearning evaluation [33,144]. For instance, Exposure [21], a key metric in memorization analysis, can be viewed as a rank-based variant of the Truth Ratio, substituting likelihood comparison with rank comparison. Similarly, the Mean Reciprocal Rank (MRR), prevalent in entity retrieval tasks [98], calculates the reciprocal average of the ranks of target tokens. Membership inference attack (MIA). In addition to the metrics above, some articles conduct MIA for evaluation [86,90, 147,163]. MIA is a privacy attack that determines whether specific data samples are part of a model’s training set [165], including LOSS [203], Zlib Entropy [22], Min-K% Prob [162] and Min-K%++ Prob [215]. However, although widely used in traditional unlearning evaluations, MIA is considered unsuitable for the LLM context, as it typically requires training numerous shadow models, which is both data-prohibitive and computationally intractable for LLMs [30, 108, 202]. 4.2.2 Model Utility. Beyond investigating model memorization, various methodologies are employed to assess the general utility of unlearned models. Some directly and efficiently computable indicators quantify specific aspects of Manuscript submitted to ACM A Survey on Unlearning in Large Language Models19 Obj.ClassNameNote (Advantages)Used by Output Model- free Verbatim Simple, computationally efficient, strong perfor- mance on some benchmarks. → keyword [175, 190], ES [22], MA [177] [19,23,33,45,82,101,142,147,169,171, 174–176, 185–187, 190] BLEU Commonly used in translation (precision). → Scare-BLEU [143] [5,67,77,84,108,156,174,184,189,196, 202] ROUGE Commonly used in text summary (recall). Include LCS, ROUGE-L and ROUGE-N. → EL [82], ECS [197] [23,29,35,45,47,51,58,67,77,78,82,84, 85,88,89,91,101,108,111,122–124,137, 144,147,150,153,155,156,160,168,174, 182,183,185,186,188,189,191,192,197, 205, 206, 209, 211, 216, 218] Model- based BertScore [217] Semantic information (synonym, etc.). → Calculated with other encoders [108,153,155,174,191,192,196,197,202, 206] Customized model eval Knowledge and understanding. e.g. NLI, BLEURT [158] [30, 88, 113, 114, 197, 202, 206] Human eval Adaptable across diverse tasks, comprehensive in multiple dimensions. [107, 119, 174, 197] LLM evalSimilar to the notes of human eval. [77,99,111,113,119,124,134,147,155,168, 174, 188, 197, 211] Logit Prob. Next token probability Examination of the model’s latent knowledge. e.g. 푝(푎|푞) [123], perplexity → KL divergence [186], CI [124], RMA [101] [33,46,49,51,84,101,119,123,124,137, 182, 184, 186, 195, 201, 216] Truth ratio [123] Detect under- and over-unlearning at the distribu- tion level. [23,35,51,67,78,84–86,89,108,124,137, 175, 182, 183, 187, 189, 209, 216] RankRank score Uniform score distribution, easy for comparison. e.g. Exposure [21], MRR [98], THR [144], PA [33] [3, 5, 33, 66, 144, 195] Table 4. Statistics of the use of different knowledge memorization metrics in LLM unlearning evaluations. Red text marks the new method improved on the corresponding method. Blue text marks representative examples from the corresponding methods. model performance, which is referred to as utility metrics. Among these, perplexity is one of the most frequently used measures; a lower perplexity signifies better fluency [107], higher model confidence [192], and improved meaningfulness of the generated content [108]. In addition to perplexity, numerous studies focus on lexical diversity, proposing metrics such as the mean number or proportion of distinct n-grams [107,111,168], the unique token ratio [108,202], and token entropy [174,206]. As noted by Yuan et al. [206], reduced vocabulary diversity often correlates with token repetition in model outputs, indicating poorer readability and weaker overall utility. Also, established linguistic indices, including Brunet’s Index [18] and Honore’s Statistic [74], have been applied to assess lexical richness in unlearning contexts [197]. Meanwhile, comprehensive general benchmarks are commonly utilized to evaluate the overall performance of unlearning methods [90,104]. Table 5 summarizes frequently adopted benchmarks and their usage statistics across different unlearning studies. Integrated evaluation frameworks and toolkits, such as Language Model Evaluation Harness [59], facilitate the systematic application of these benchmarks for LLM unlearning evaluation [3,85,93,176,190]. 4.2.3 Unlearning Robustness. Empirical studies indicate that many machine methods merely suppress the surface- level expression of specific knowledge, leaving the underlying representations vulnerable to various adversarial attacks [73,120,121]. To systematically assess robustness, a range of adversarial techniques from the security domain have been integrated into the evaluation of LLM unlearning [86,90,147,163], which we collectively refer to as attack techniques. Commonly adopted methods include the following. (1) Attack on the input, such as crafted jailbreak prompts [161], AutoPrompt [164], Greedy Coordinate Gradient (GCG) [225], Prompt Automatic Iterative Manuscript submitted to ACM 20Qiu et al. ClassNameUsed by ReasoningARC [38] [19,27,35,36,45,49,66,67,78,82,84,95,101,114,118,174–176,178,182,185,188, 190, 201, 206] Math GSM8K [39][3, 77, 115, 190, 201, 206] MathQA [2][19, 27, 66, 82, 95, 101, 174, 182, 185] Commonsense HellaSwag [207][19, 27, 35, 36, 45, 49, 78, 82, 84, 95, 101, 118, 153, 174, 175, 182, 185, 190] PIQA [13][19, 27, 35, 36, 45, 49, 66, 67, 82, 84, 95, 101, 114, 174, 182, 185] WinoGrande [154][19, 27, 35, 36, 45, 49, 78, 82, 95, 101, 114, 153, 174, 182, 185] Comprehension Lambada [135][19, 27, 35, 66, 82, 95, 101, 114, 174] OpenBookQA [126][35, 36, 49, 56, 78, 84, 175] Universal knowledge MMLU [71] [3,26,35,36,41,42,46,47,52,70,76,77,90,93,99,104,108,115,120,147,150,153, 155, 173, 175, 176, 188, 191, 201, 206, 218] Multi-turnMT-Bench [222][26, 46, 47, 99, 104, 153, 175, 191] Mimic human Falsehoods TruthfulQA [106][30, 36, 77, 90, 113, 153, 176, 202, 205, 206, 218] Table 5. Overview of general benchmarks widely used in unlearning evaluation. We broadly divide these benchmarks into seven classes and count the use of each benchmark. Refinement (PAIR) [25]. (2) Attack on the hidden layer, such as probing techniques [1,9], Logit Lens [132], soft- prompt-based threats [26,157], AnonAct [159]. Further heuristic attacks have also been proposed for specific unlearning scenarios [3, 48, 160]. In response to the characteristics of LLM unlearning tasks, a relatively unique robustness evaluation method, relearning [91,121], is also frequently used. Relearning evaluates an unlearned model by exposing it to a limited subset of the unlearned data. In in-context relearning, knowledge related to the unlearn set, such as book summaries or relevant background information, are included in the prompts when evaluating the unlearned LLM. When relearning by fine-tuning, the model is full-parameter or LoRA fine-tuned on a small portion of the unlearn set or a related set. Empirical studies consistently demonstrate that relearning can substantially degrade unlearning quality, causing the model to rapidly recapitulate a significant portion of unlearned knowledge from sparse cues [51,70,91,102,121,208], or begin to systematically avoid generating content related to the unlearning target, even when contextually prompted [91]. More critically, fine-tuning on data with content or distribution similar to the unlearn set can also reverse the unlearning effects, restoring model performance to a level comparable to its state before unlearning [42, 46]. 4.2.4 Unlearning efficiency. While the majority of existing studies concentrate on the efficacy of unlearning, the resource overhead of deploying such algorithms under real-world constraints remains a crucial consideration, including both memory occupation [7,86] and computational time consumption. For time cost, a straightforward approach is to directly measure the algorithm’s runtime during experiments [7,29,36,86]. Since computational speed is closely tied to GPU performance, some studies convert raw runtime into GPU hours (i.e., number of GPUs× training hours) to facilitate comparison [84,150]. Nonetheless, fair cross-study comparisons remain challenging due to variations in experimental environments. The most reliable method is to execute all algorithms under controlled conditions, though this is often resource-intensive. Alternatively, several works estimate time consumption theoretically, using metrics such as floating-point operations [23] or gradient computation budgets [192]. However, discrepancies between theoretical estimates and actual runtime may arise due to differences in implementation and hardware optimization. Manuscript submitted to ACM A Survey on Unlearning in Large Language Models21 5 Challenges and Future Directions 5.1 Challenges 5.1.1Definition and Evaluation of Unlearning. In Section 2.1, we characterize the goal of LLM unlearning as ensuring that “the unlearned model should no longer memorize information from the unlearn set while preserving all other knowledge.” However, two key issues remain ambiguous, leading to divergent definitions of unlearning across the literature and, consequently, to inconsistent and imprecise evaluation practices. (1) How should memorization be defined and detected? Most studies assess memorization based on model output in specific tasks, yet disagree on the criteria for judging these outputs. For the content related to the unlearn set, some works argue that the model should simply avoid generating such content [49], while others require it to explicitly respond with “I don’t know” [160]. Another line of research proposes that the unlearned model should produce outputs similar to those of a hypothetical retrained model, such as giving a specific incorrect answer [152]. When direct output is insufficient, adversarial methods are sometimes employed to expose memorization. However, such approaches face inherent limitations: Overly weak attacks may fail to detect memorization, whereas overly strong ones can force the model to generate arbitrary content, casting doubt on their reliability as auditing tools [28]. (2) What should constitute the unlearn set? For synthetic datasets such as TOFU [123], this question is relatively straightforward. However, in real-world scenarios, data interconnectivity complicate the identification of appropriate unlearning targets. Tian et al. [176]adopt a legal perspective to determine which copyrighted or private data should be unlearned, while Wu et al. [194]construct relationship graphs to identify necessary data for removal. Despite their merits, these methods remain reliant on manual, domain-specific analysis, and lack generalizability. 5.1.2 Effect of Unlearning. Unlearning has different effects on different languages and data, further increasing the difficulty of designing and evaluating unlearning algorithms. Effects across languages. Some studies conduct evaluations with prompts translated into languages other than English, finding that monolingual unlearning is fundamentally insufficient for multilingual LLMs [90,121]. Furthermore, more languages that systematically divide into high- and low-resource are used in evaluations [33,117], revealing the fact that unlearning in one language does not necessarily transfer to others and could even inadvertently reinforce harmful content across languages. Together, these findings underscore a critical consensus: effective and secure unlearning necessitates multilingual joint unlearning strategies to address model behavior holistically in all languages. Effects across data. From the perspective of data distribution, Baluta et al. [5]demonstrate that out-of-distribution (OOD) data require more gradient ascent but offer a better unlearning quality, whereas in-distribution data allow faster unlearning but severely compromise model utility, illustrating a fundamental trade-off between unlearning efficiency and model preservation. Considering the logical connectivity of the data, Choi et al. [34]identify that current unlearning methods struggle with multi-hop knowledge, where unlearning one intermediate fact in a chain often fails to remove the entire logical sequence. Furthermore, some studies investigate the impact on adjacent data after performing unlearning on selected data, identifying phenomena called “transfer unlearning” [116], “ripple effect” [219] and “onion effect” [15]. These effects highlight the intricate and unpredictable consequences of unlearning, emphasizing the need for careful monitoring to ensure that unlearning achieves its intended goals without introducing new risks. 5.1.3 Unlearning in Reality. A significant challenge lies in the scaling gap between experimental settings and real- world conditions. Current unlearning experiments are largely limited to models with fewer than 10 billion parameters and unlearning sets under 1 billion instances, raising concerns about the applicability of these methods to larger models Manuscript submitted to ACM 22Qiu et al. and datasets. Shi et al. [163]analyze how evaluation metrics evolve as the size of the unlearn set increases, providing insight into scalability. On the other hand, in practical deployments, large models are often compressed, such as using quantization for efficiency. Notably, Zhang et al. [218]demonstrate that quantizing unlearned models can inadvertently reactivate unlearned knowledge, highlighting a key scalability challenge. In commercial applications, unlearning requests typically arrive sequentially, requiring models to continuously unlearning while maintaining performance [152,178]. To assess long-term viability, Shi et al. [163]collect model checkpoints after processing each sequential request and track evaluation metrics over time. This approach helps quantify the cumulative impact of repeated unlearning and the model’s ability to sustain utility. Unfortunately, current unlearning methods are not yet ready to handle sequential unlearning. 5.2 Future Directions 5.2.1 Unlearning in Specialized Architectures and Scenarios. The field is moving towards addressing unlearning in sophisticated model architectures. Cheng and Amiri[32]pioneer this effort for tool-augmented large language models (LLMs) by proposing ToolDelete, the first unlearning framework designed to remove a specific “skill” or the ability to use a particular tool, and they introduce a new membership inference attack (MIA) for evaluation. Similarly, the unique structure of Mixture-of-Experts (MoE) models presents a distinct challenge. Zhuang et al. [224]find that unlearning a single expert is insufficient and propose the Selected Expert Unlearning Framework (SEUF) to effectively perform unlearning on MoE models. These works demonstrate that effective unlearning requires bespoke algorithms tailored to a model’s specific architecture and knowledge organization. 5.2.2 Unlearning as Tools. Unlearning is not only a goal in itself, but also a powerful tool when we further expand the scope of unlearning targets. Firstly, when choosing injected trojans or backdoor triggers as the target, unlearning can be an effective tool in defense [72,87,96,118,220]. Similarly, an opposite target, such as removing the safety alignment or disrupting the subsequent fine-tuning process on a base model, can convert unlearning to a means of attack [148]. Furthermore, if unlearning the selected training data and examining the changes in the model before and after unlearning, we can have novel insights into how different data components contribute to and influence the final model capabilities [81, 221]. A powerful and accurate unlearning method will play an important role as a tool. 5.2.3Unlearning beyond Data. Most existing studies focus on unlearning specific data instances. However, in practical scenarios, unlearning requests often target not only concrete data but also abstract concepts or capabilities, such as erroneous reasoning patterns, harmful ethical values, or unsafe skills [99,104]. Extending unlearning beyond the data to encompass abstract constructs is essential to prevent the propagation of incorrect or harmful knowledge. Achieving this goal may present two main pathways: one is to precisely identify and modify parameters or representations associated with particular concepts or abilities; the other leverages established alignment techniques, such as reinforcement learning, by designing appropriate reward mechanisms that penalize the generation of undesirable content. 5.2.4Robust Unlearning. In light of the observed fragility of LLM unlearning, a significant research direction aims to develop techniques that enhance its robustness and long-term stability. These defensive efforts pursue two primary objectives: (1) to ensure that knowledge removal is thorough and persistent, thereby resisting attempts at recovery; and (2) to prevent the unlearning procedure from introducing new vulnerabilities or unintended side effects. Several existing studies address the first objective through robust unlearning frameworks [50,79,172,213] or methods that Manuscript submitted to ACM A Survey on Unlearning in Large Language Models23 strengthen the robustness of unlearned models [79,213]. Nevertheless, given the proliferation of advanced attacks, achieving truly robust unlearning remains a critical and ongoing topic. 5.2.5 Verifiable and Certifiable Unlearning. In most current practices, unlearning is applied to models that have already internalized the content targeted for removal through opaque mechanisms, complicating the certification of unlearning effectiveness. However, from legal, safety, and social trust perspectives, achieving verifiable and trustworthy unlearning remains critically important. To validate existing unlearning methods, it is essential to establish a fair and comprehensive evaluation benchmark. Looking ahead, future work may also draw inspiration from frameworks such as SISA by designing structured data storage and training protocols to enable intrinsically verifiable unlearning. 6 Conclusions Machine unlearning has emerged as a pivotal technique to address critical challenges in large language models, including privacy protection, copyright compliance, and safety enhancement. In this survey, we provide a comprehensive review of work dedicated to LLM unlearning, including the definition and goal of LLM unlearning, the most recent LLM unlearning methods, and commonly used datasets and evaluation metrics of unlearning. Despite significant progress, the field of LLM unlearning remains in its early stages, with fundamental challenges in the definition, evaluation, effects and practical deployment of unlearning. Furthermore, we suggest several promising directions for future research. We hope that this survey can provide readers with a general understanding of recent progress in this field and shed some light on future developments. References [1]Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016). [2]Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 2357–2367. doi:10.18653/v1/N19-1245 [3]Tomer Ashuach, Martin Tutek, and Yonatan Belinkov. 2025. REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space. In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 14774–14797. doi:10.18653/v1/2025.f indings-acl.763 [4]Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022). [5]Teodora Baluta, Pascal Lamblin, Daniel Tarlow, Fabian Pedregosa, and Gintare Karolina Dziugaite. 2024. Unlearning in- vs. out-of-distribution data in LLMs under gradient-based methods. In Neurips Safe Generative AI Workshop 2024. https://openreview.net/f orum?id=3SK2Nn3SNv [6] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72. [7]Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah, and Dan Roth. 2023. Privacy Adhering Machine Un-learning in NLP. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (Eds.). Association for Computational Linguistics, Nusa Dua, Bali, 268–277. doi:10.18653/v1/2023.f indings- ijcnlp.25 [8]Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal. 2025. Open Problems in Machine Unlearning for AI Safety. doi:10.48550/arXiv.2501.04952 arXiv:2501.04952 [cs]. [9] Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics 48, 1 (2022), 207–219. [10]Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. 2025. LEACE: Perfect linear concept erasure in closed form. doi:10.48550/arXiv.2306.03819 arXiv:2306.03819. [11] Karuna Bhaila, Minh-Hao Van, and Xintao Wu. 2025. Soft Prompting for Unlearning in Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 4046–4056. doi:10.18653/v1/2025.naacl-long.204 Manuscript submitted to ACM 24Qiu et al. [12]Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al.2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning. PMLR, 2397–2430. [13] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al.2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI conference on artificial intelligence 34, 05 (2020), 7432–7439. [14]Alberto Blanco-Justicia, Najeeb Jebreel, Benet Manzanares-Salor, David Sánchez, Josep Domingo-Ferrer, Guillem Collell, and Kuan Eeik Tan. 2025. Digital forgetting in large language models: A survey of unlearning methods. Artificial Intelligence Review 58, 3 (2025), 90. [15]Jaydeep Borkar. 2023. What can we learn from Data Leakage and Unlearning for Law?. In Proceedings of the 1st Workshop on Generative AI and Law (co-located with ICML 2023). https://blog.genlaw.org/CameraReady/12.pdf Accepted workshop paper. [16]Johan Bos and Katja Markert. 2005. Recognising textual entailment with logical inference. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 628–635. [17]Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine Unlearning. In 2021 IEEE Symposium on Security and Privacy (SP). 141–159. doi:10.1109/SP40001.2021.00019 [18] Étienne Brunet et al. 1978. Le vocabulaire de Jean Giraudoux structure et évolution. Slatkine. [19] George-Octavian Bărbulescu and Peter Triantafillou. 2024. To each (textual sequence) its own: improving memorized-data unlearning in large language models. In Proceedings of the 41st International Conference on Machine Learning (Vienna, Austria) (ICML’24). JMLR.org, Article 121, 21 pages. [20] Yinzhi Cao and Junfeng Yang. 2015. Towards Making Systems Forget with Machine Unlearning. In 2015 IEEE Symposium on Security and Privacy. 463–480. doi:10.1109/SP.2015.35 [21] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19). 267–284. [22]Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al.2021. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21). 2633–2650. [23]Sungmin Cha, Sungjun Cho, Dasol Hwang, and Moontae Lee. 2024. Towards robust and cost-efficient knowledge unlearning for large language models. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning. [24] Hwan Chang and Hwanhee Lee. 2025. Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning. In Findings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, 5966–5982. doi:10.18653/v1/2025.f indings-acl.310 [25]Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42. [26]Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, et al.2025. Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities. arXiv preprint arXiv:2502.05209 (2025). [27]Guitao Chen, Yunshen Wang, Hongye Sun, and Guang Chen. 2024. WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models. doi:10.48550/arXiv.2408.09459 arXiv:2408.09459. [28]Haokun Chen, Sebastian Szyller, Weilin Xu, and Nageen Himayat. 2025. Soft token attacks cannot reliably audit unlearning in large language models. arXiv preprint arXiv:2502.15836 (2025). [29]Jiaao Chen and Diyi Yang. 2023. Unlearn What You Want to Forget: Efficient Unlearning for LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12041–12052. doi:10.18653/v1/2023.emnlp-main.738 [30]Kongyang Chen, Zixin Wang, Bing Mi, Waixi Liu, Shaowei Wang, Xiaojun Ren, and Jiaxing Shen. 2024. Machine Unlearning in Large Language Models. doi:10.48550/arXiv.2404.16841 arXiv:2404.16841. [31]Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. 2025. Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning. arXiv:2509.06948 [cs.CL] https://arxiv.org/abs/2509.06948 [32]Jiali Cheng and Hadi Amiri. 2025. Tool Unlearning for Tool-Augmented LLMs. In Forty-second International Conference on Machine Learning. https://openreview.net/f orum?id=7ez7LqHsP5 [33] Minseok Choi, Kyunghyun Min, and Jaegul Choo. 2024. Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 10732–10747. doi:10.18653/v1/2024.f indings-emnlp.630 [34]Minseok Choi, ChaeHun Park, Dohyun Lee, and Jaegul Choo. 2024. Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning. arXiv:2410.13274 [cs.CL] https://arxiv.org/abs/2410.13274 [35]Minseok Choi, Daniel Rim, Dohyun Lee, and Jaegul Choo. 2025. Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 28280–28297. doi:10.18653/v1/2025.acl-long.1371 [36]Somnath Basu Roy Chowdhury, Krzysztof Marcin Choromanski, Arijit Sehanobish, Kumar Avinava Dubey, and Snigdha Chaturvedi. 2025. Towards Scalable Exact Machine Unlearning Using Parameter-Efficient Fine-Tuning. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=oe51Q5Uo37 Manuscript submitted to ACM A Survey on Unlearning in Large Language Models25 [37]Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. In Proceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267), Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu (Eds.). PMLR, 10818–10838. https://proceedings.mlr.press/v267/chu25c.html [38]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Taf jord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018). [39]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021). [40] A Feder Cooper, Christopher A Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Ziyu Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah, et al.2024. Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice. arXiv preprint arXiv:2412.06966 (2024). [41]Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. 2025. On Effects of Steering Latent Representation for Large Language Model Unlearning. Proceedings of the AAAI Conference on Artificial Intelligence 39, 22 (2025), 23733–23742. [42] Aghyad Deeb and Fabien Roger. 2025. Do Unlearning Methods Remove Information from Language Model Weights? doi:10.48550/arXiv.2410.08827 arXiv:2410.08827 [cs]. [43]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. doi:10.18653/v1/N19-1423 [44] Omkar Dige, Diljot Singh, Tsz Fung Yau, Qixuan Zhang, Borna Bolandraftar, Xiaodan Zhu, and Faiza Khan Khattak. 2024. Mitigating Social Biases in Language Models through Unlearning. arXiv preprint 2406.13551 (2024). arXiv:2406.13551 doi:10.48550/arXiv.2406.13551 [45]Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vulić. 2024. UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. doi:10.48550/arXiv.2402.10052 arXiv:2402.10052. [46]Jai Doshi and Asa Cooper Stickland. 2024. Does unlearning truly unlearn? a black box evaluation of llm unlearning methods. arXiv preprint arXiv:2411.12103 (2024). [47] Guangyao Dou, Zheyuan Liu, Qing Lyu, Kaize Ding, and Eric Wong. 2025. Avoiding Copyright Infringement via Large Language Model Unlearning. In Findings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 5176–5200. doi:10.18653/v1/2025.f indings-naacl.288 [48]Jiacheng Du, Zhibo Wang, Jie Zhang, Xiaoyi Pang, Jiahui Hu, and Kui Ren. 2025. Textual Unlearning Gives a False Sense of Unlearning. In Forty-second International Conference on Machine Learning. https://openreview.net/f orum?id=jyxwWQjU4J [49] Ronen Eldan and Mark Russinovich. 2023. Who’s Harry Potter? Approximate Unlearning in LLMs. arXiv:2310.02238 [cs.CL] https://arxiv.org/abs/ 2310.02238 [50] Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, and Sijia Liu. 2025. Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond. In Forty-second International Conference on Machine Learning.https: //openreview.net/f orum?id=zZjLv6F0Ks [51]Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. 2025. Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning. doi:10.48550/arXiv.2410.07163 arXiv:2410.07163. [52]Eoin Farrell, Yeu-Tong Lau, and Arthur Conmy. 2024. Applying Sparse Autoencoders to Unlearn Knowledge in Language Models. In Neurips Safe Generative AI Workshop 2024. https://openreview.net/f orum?id=i4z0HrBiIA [53]XiaoHua Feng, Chaochao Chen, Yuyuan Li, and Zibin Lin. 2024. Fine-Grained Pluggable Gradient Ascent for Knowledge Unlearning in Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10141–10155. https://aclanthology.org/2024.emnlp-main.566 [54]Michael Fore, Simranjit Singh, Chaehong Lee, Amritanshu Pandey, Antonios Anastasopoulos, and Dimitrios Stamoulis. 2024. Unlearning Climate Misinformation in Large Language Models. In Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024), Dominik Stammbach, Jingwei Ni, Tobias Schimanski, Kalyan Dutia, Alok Singh, Julia Bingler, Christophe Christiaen, Neetu Kushwaha, Veruska Muccione, Saeid A. Vaghefi, and Markus Leippold (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 178–192. doi:10.18653/v1/2024.climatenlp-1.14 [55] Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3, 4 (1999), 128–135. [56] Chongyang Gao, Lixu Wang, Kaize Ding, Chenkai Weng, Xiao Wang, and Qi Zhu. 2025. On Large Language Model Continual Unlearning. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=Essg9kb4yx [57]Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020). [58] Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, and Murali Annavaram. 2024. Ethos: Rectifying Language Models in Orthogonal Parameter Space. In Findings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 2054–2068. doi:10.18653/v1/2024.f indings-naacl.132 Manuscript submitted to ACM 26Qiu et al. [59]Leo Gao, Jonathan Tow, Stella Biderman, Shawn Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jasmine Hsu, Kyle McDonell, Niklas Muennighoff, et al. 2021. A Framework for Few-Shot Language Model Evaluation. https://zenodo.org/records/5371629 [60]Jiahui Geng, Qing Li, Herbert Woisetschlaeger, Zongxiong Chen, Fengyu Cai, Yuxia Wang, Preslav Nakov, Hans-Arno Jacobsen, and Fakhri Karray. 2025. A comprehensive survey of machine unlearning techniques for large language models. arXiv preprint arXiv:2503.01854 (2025). [61]Ruotong Geng, Mingyang Geng, Shangwen Wang, Haotian Wang, Zhipeng Lin, and Dezun Dong. 2025. Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning. arXiv preprint arXiv:2502.05739 (2025). [62]Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12216–12235. doi:10.18653/v1/2023.emnlp-main.751 [63]Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 5484–5495. doi:10.18653/v1/2021.emnlp-main.446 [64]Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. 2023. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969 (2023). [65]Google Research. 2023. Language Model Extraction Benchmark. https://github.com/google- research/lm- extraction- benchmark. Accessed: 2025-04-10. [66]Kang Gu, Md Rafi Ur Rashid, Najrin Sultana, and Shagufta Mehnaz. 2024. Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models. arXiv preprint 2403.10557 (2024). arXiv:2403.10557 doi:10.48550/arXiv.2403.10557 [67]Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Yujiu Yang, Yan Teng, and Yingchun Wang. 2024. MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts. arXiv preprint 2409.11844 (2024). arXiv:2409.11844 doi:10.48550/arXiv.2409.11844 [68]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 8081 (9 2025), 633–638. doi:10.1038/s41586-025-09422-z [69]Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2024. Robust Unlearning via Mechanistic Localizations. In ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/f orum?id=06pNzrEjnH [70]Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2025. Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization. In Forty-second International Conference on Machine Learning. https://openreview.net/f orum ?id=92oBV5HAGl [71]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/f orum?id=d7KBjmI3GmQ [72]Adriano Hernandez. 2024. If You Don’t Understand It, Don’t Use It: Eliminating Trojans with Filters Between Layers. arXiv preprint arXiv:2407.06411 (2024). [73]Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, and Mor Geva. 2024. Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces. arXiv preprint 2406.11614 (2024). arXiv:2406.11614 doi:10.48550/arXiv.2406.11614 [74]Antony Honoré et al.1979. Some simple measures of richness of vocabulary. Association for literary and linguistic computing bulletin 7, 2 (1979), 172–177. [75] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/f orum?id=nZeVKeeFYf 9 [76]Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, and Xiaowei Huang. 2025. Falcon: Fine-grained activation manipulation by contrastive orthogonal unalignment for large language model. arXiv preprint arXiv:2502.01472 (2025). Manuscript submitted to ACM A Survey on Unlearning in Large Language Models27 [77]Xinshuo Hu, Dongfang Li, Baotian Hu, Zihao Zheng, Zhenyu Liu, and Min Zhang. 2024. Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. Proceedings of the AAAI Conference on Artificial Intelligence 38, 16 (2024), 18252–18260. [78]James Y. Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, and Muhao Chen. 2025. Offset Unlearning for Large Language Models. Transactions on Machine Learning Research (2025). https://openreview.net/f orum?id=A4RLpHPXCu [79]Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Le-Minh Nguyen, and Naoya Inoue. 2025. Improving LLM Unlearning Robustness via Random Perturbations. arXiv preprint arXiv:2501.19202 (2025). [80]Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2022. Editing Models with Task Arithmetic. In The Eleventh International Conference on Learning Representations. https://openreview.net/f orum?id=6t0Kwf 8-jrj [81] Masaru Isonuma and Ivan Titov. 2024. Unlearning Traces the Influential Training Data of Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 6312–6325. doi:10.18653/v1/2024.acl-long.343 [82]Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. Knowledge Unlearning for Mitigating Privacy Risks in Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 14389–14408. doi:10.18653/v1/2023.acl-long.805 [83]Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, Sirui Han, Yike Guo, and Yaodong Yang. 2025. PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference. arXiv:2406.15513 [cs.AI] https://arxiv.org/abs/2406.15513 [84]Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana R Kompella, Sijia Liu, and Shiyu Chang. 2024. Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference. Advances in Neural Information Processing Systems 37 (2024), 12581–12611. [85]Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. 2025. WAGLE: strategic weight attribution for effective and modular unlearning in large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 1767, 27 pages. [86]Jinghan Jia, Yihua Zhang, Yimeng Zhang, Jiancheng Liu, Bharat Runwal, James Diffenderfer, Bhavya Kailkhura, and Sijia Liu. 2024. SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 4276–4292. doi:10.18653/v1/2024.emnlp-main.245 [87]Peihai Jiang, Xixiang Lyu, Yige Li, and Jing Ma. 2025. Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models. Proceedings of the AAAI Conference on Artificial Intelligence 39, 23 (2025), 24285–24293. [88]Weipeng Jiang, Juan Zhai, Shiqing Ma, Ziyan Lei, Xiaofei Xie, Yige Wang, and Chao Shen. 2025. Holistic Audit Dataset Generation for LLM Unlearning via Knowledge Graph Traversal and Redundancy Removal. arXiv preprint arXiv:2502.18810 (2025). [89] Xiaomeng Jin, Zhiqi Bu, Bhanukiran Vinzamuri, Anil Ramakrishna, Kai-Wei Chang, Volkan Cevher, and Mingyi Hong. 2025. Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 11278–11294. doi:10.18653/v1/2025.naacl-long.563 [90]Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Rwku: Benchmarking real-world knowledge unlearning for large language models. Advances in Neural Information Processing Systems 37 (2024), 98213–98263. [91]Abhinav Joshi, Shaswati Saha, Divyaksh Shukla, Sriram Vema, Harsh Jhamtani, Manas Gaur, and Ashutosh Modi. 2024. Towards Robust Evaluation of Unlearning in LLMs via Data Transformations. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 12100–12119. doi:10.18653/v1/2024.f indings- emnlp.706 [92] Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, and Heuiseok Lim. 2025. CoME: An Unlearning-based Approach to Conflict-free Model Editing. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 6410–6422. doi:10.18653/v1/2025.naacl-long.325 [93]Swanand Kadhe, Farhan Ahmed, Dennis Wei, Nathalie Baracaldo, and Inkit Padhi. 2024. Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs. In ICML 2024 Workshop on Foundation Models in the Wild. https://openreview.net/f orum?id=BzIySThX9O [94]Swanand Kadhe, Anisa Halimi, Ambrish Rawat, and Nathalie Baracaldo. 2023. FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs. In Socially Responsible Language Modelling Research. https://openreview.net/f orum?id=vRPnLsWQNh [95]Aly Kassem, Omar Mahmoud, and Sherif Saad. 2023. Preserving Privacy Through Dememorization: An Unlearning Technique For Mitigating Memorization Risks In Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 4360–4379. doi:10.18653/v1/2023.emnlp-main.265 [96]Mahdi Kazemi, Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour, and Sen Lin. 2024. Unlearning Trojans in Large Language Models: A Comparison Between Natural Language and Source Code. arXiv preprint 2408.12416 (2024). arXiv:2408.12416 http://arxiv.org/abs/2408.12416 [97]Arinbjörn Kolbeinsson, Kyle O’Brien, Tianjin Huang, Shanghua Gao, Shiwei Liu, Jonathan Richard Schwarz, Anurag Jayant Vaidya, Faisal Mahmood, Marinka Zitnik, Tianlong Chen, and Thomas Hartvigsen. 2025. Composable Interventions for Language Models. In The Thirteenth International Manuscript submitted to ACM 28Qiu et al. Conference on Learning Representations. https://openreview.net/f orum?id=tu3qwNjrtw [98]Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. 2018. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning. PMLR, 2863–2872. [99] Yicheng Lang, Kehan Guo, Yue Huang, Yujun Zhou, Haomin Zhuang, Tianyu Yang, Yao Su, and Xiangliang Zhang. 2025. Beyond single-value metrics: Evaluating and enhancing llm unlearning with cognitive diagnosis. arXiv preprint arXiv:2502.13996 (2025). [100]Uyen N. Le-Khac and Vinh N. X. Truong. 2025. A survey on large language models unlearning: taxonomy, evaluations, and future directions. Artificial Intelligence Review 58, 12 (Oct. 2025), 399. doi:10.1007/s10462-025-11376-7 [101]Dohyun Lee, Daniel Rim, Minseok Choi, and Jaegul Choo. 2024. Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 15820–15839. doi:10.18653/v1/2024.f indings-acl.936 [102]Simon Lermen and Charlie Rogers-Smith. 2024. LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models. https://openreview.net/f orum?id=Y52UbVhglu [103]Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Roger Levy and Lucia Specia (Eds.). Association for Computational Linguistics, Vancouver, Canada, 333–342. doi:10.18653/v1/K17-1034 [104]Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Lin, Adam Alfred Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Ian Steneker, David Campbell, Brad Jokubaitis, Steven Basart, Stephen Fitz, Ponnurangam Kumaraguru, Kallol Krishna Karmakar, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, and Dan Hendrycks. 2024. The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning. In Proceedings of the 41st International Conference on Machine Learning. 28525–28550. https://proceedings.mlr.press/v235/li24bc.html [105] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81. [106]Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3214–3252. doi:10.18653/v1/2022.acl-long.229 [107]Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 6691–6706. doi:10.18653/v1/2021.acl-long.522 [108] Chris Liu, Yaxuan Wang, Jeffrey Flanigan, and Yang Liu. 2024. Large language model unlearning via embedding-corrupted prompts. Advances in Neural Information Processing Systems 37 (2024), 118198–118266. [109] Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. 2025. Rethinking machine unlearning for large language models. Nature Machine Intelligence (2025), 1–14. [110]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [111] Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. 2024. Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 8708–8731. doi:10.18653/v1/2024.emnlp-main.495 [112] Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. 2024. Machine Unlearning in Generative AI: A Survey. arXiv preprint 2407.20516 (2024). arXiv:2407.20516 doi:10.48550/arXiv.2407.20516 [113] Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, and Meng Jiang. 2024. Towards Safer Large Language Models through Machine Unlearning. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 1817–1829. doi:10.18653/v1/2024.f indings-acl.107 [114] Zhenhua Liu, Tong Zhu, Chuanyuan Tan, and Wenliang Chen. 2025. Learning to Refuse: Towards Mitigating Privacy Risks in LLMs. In Proceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 1683–1698. https://aclanthology.org/2025.coling- main.114/ [115] Tyler Lizzo and Larry Heck. 2025. UNLEARN Efficient Removal of Knowledge in Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 7257–7268. doi:10.18653/v1/2025.f indings-naacl.405 [116] Huimin Lu, Masaru Isonuma, Junichiro Mori, and Ichiro Sakata. 2024. Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation. arXiv:2407.16951 [cs.CL] https://arxiv.org/abs/2407.16951 [117]Taiming Lu and Philipp Koehn. 2025. Learn and Unlearn: Addressing Misinformation in Multilingual LLMs. arXiv:2406.13748 [cs.CL] https: //arxiv.org/abs/2406.13748 Manuscript submitted to ACM A Survey on Unlearning in Large Language Models29 [118]Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. 2024. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv preprint 2404.05880 (2024). arXiv:2404.05880 doi:10.48550/arXiv.2404.05880 [119]Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. 2022. Quark: controllable text generation with reinforced [un]learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 2001, 19 pages. [120]Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, and Javier Rando. 2025. An Adversarial Perspective on Machine Unlearning for AI Safety. Transactions on Machine Learning Research (2025). https://openreview.net/f orum?id=J5IRyTKZ9s [121]Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. 2024. Eight Methods to Evaluate Robust Unlearning in LLMs. arXiv preprint 2402.16835 (2024). arXiv:2402.16835 doi:10.48550/arXiv.2402.16835 [122]Weitao Ma, Xiaocheng Feng, Weihong Zhong, Lei Huang, Yangfan Ye, Xiachong Feng, and Bing Qin. 2025. Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis. In Proceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 5345–5363. https://aclanthology.org/2025.coling-main.358/ [123] Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. 2024. TOFU: A Task of Fictitious Unlearning for LLMs. In First Conference on Language Modeling. https://openreview.net/f orum?id=B41hNBoWLo [124]Anmol Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid Hasan, and Elita Lobo. 2025. Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 3732–3752. https://aclanthology.org/2025.coling-main.252/ [125] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. In Proceedings of the 36th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1262, 14 pages. [126]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 2381–2391. doi:10.18653/v1/D18-1260 [127] Andrei Ioan Muresanu, Anvith Thudi, Michael R. Zhang, and Nicolas Papernot. 2025. Fast Exact Unlearning for In-Context Learning Data for LLMs. In Forty-second International Conference on Machine Learning. https://openreview.net/f orum?id=TzNVZEsqTi [128]Neel Nanda. 2023. Attribution Patching: Activation Patching At Industrial Scale. Blog post on mechanistic interpretability. https://w.neelnand a.io/mechanistic-interpretability/attribution-patching Accessed: 2025-10-28. [129]Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. 2015. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015). [130] Andrew Ng. 2011. Sparse Autoencoder. CS294A Lecture Notes, Stanford University. https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf Accessed: 2025-10-28. [131]Shiwen Ni, Dingwei Chen, Chengming Li, Xiping Hu, Ruifeng Xu, and Min Yang. 2024. Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 5716–5731. doi:10.18653/v1/2024.acl-long.310 [132]Nostalgebraist. 2020. interpreting GPT: the logit lens. LessWrong blog post. https://w.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting- gpt-the-logit-lens Accessed: 2025-10-28. [133]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al.2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744. [134]Zibin Pan, Shuwen Zhang, Yuesheng Zheng, Chi Li, Yuheng Cheng, and Junhua Zhao. 2025. Multi-Objective Large Language Model Unlearning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. [135] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Katrin Erk and Noah A. Smith (Eds.). Association for Computational Linguistics, Berlin, Germany, 1525–1534. doi:10.18653/v1/P16-1144 [136] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318. [137]Vaidehi Patil, Elias Stengel-Eskin, and Mohit Bansal. 2025. UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning. arXiv preprint 2502.15082 (2025). arXiv:2502.15082 doi:10.48550/arXiv.2502.15082 [138]Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. 2024. In-Context Unlearning: Language Models as Few-Shot Unlearners. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 40034–40050. https://proceedings.mlr.press/ v235/pawelczyk24a.html Manuscript submitted to ACM 30Qiu et al. [139]Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557 [cs.CL] https://arxiv.org/abs/2406.17557 [140]Joshua Peterson, Stephan Meylan, and David Bourgin. 2019. Open clone of openai’s unreleased webtext dataset scraper. https://github.com/jcpeter son/openwebtext [141]Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. 2021. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems 34 (2021), 4816–4828. [142]Nicholas Pochinkov and Nandi Schoots. 2024. Dissecting language models: Machine unlearning via selective pruning. arXiv preprint arXiv:2403.01267 (2024). [143] Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, Brussels, Belgium, 186–191. doi:10.18653/v1/W18-6319 [144]Xinchi Qiu, William F. Shen, Yihong Chen, Nicola Cancedda, Pontus Stenetorp, and Nicholas D. Lane. 2024. PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs. In Proceedings of the GENAI Evaluation Workshop at KDD 2024. Barcelona, Spain. https://genai-evaluation- kdd2024.github.io/genai-evalution-kdd2024/assets/papers/GenAI_Evaluation_KDD2024_paper_3.pdf [145]Youyang Qu, Ming Ding, Nan Sun, Kanchana Thilakarathna, Tianqing Zhu, and Dusit Niyato. 2025. The Frontier of Data Erasure: A Survey on Machine Unlearning for Large Language Models . Computer 58, 01 (Jan. 2025), 45–57. doi:10.1109/MC.2024.3405397 [146] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 2383–2392. doi:10.18653/v1/D16-1264 [147]Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, and Rahul Gupta. 2025. Lume: Llm unlearning with multitask evaluations. arXiv preprint arXiv:2502.15097 (2025). [148] Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Ye Wang, and Shagufta Mehnaz. 2025. Forget to flourish: Leveraging machine-unlearning on pretrained language models for privacy leakage. Proceedings of the AAAI Conference on Artificial Intelligence 39, 19 (2025), 20139–20147. [149] Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council. Regulation (eu) 679, 2016 (2016), 10–13. [150] Jie Ren, Zhenwei Dai, Xianfeng Tang, Hui Liu, Jingying Zeng, Zhen Li, Rahul Goutam, Suhang Wang, Yue Xing, and Qi He. 2025. A general framework to enhance fine-tuning-based llm unlearning. arXiv preprint arXiv:2502.17823 (2025). [151]Jie Ren, Yue Xing, Yingqian Cui, Charu C. Aggarwal, and Hui Liu. 2025. SoK: Machine Unlearning for Large Language Models. doi:10.48550/arXiv .2506.09227 arXiv:2506.09227. [152]Keivan Rezaei, Khyathi Chandu, Soheil Feizi, Yejin Choi, Faeze Brahman, and Abhilasha Ravichander. 2024. RESTOR: Knowledge Recovery through Machine Unlearning. arXiv preprint arXiv:2411.00204 (2024). [153] Mark Russinovich and Ahmed Salem. 2025. Obliviate: Efficient unmemorization for protecting intellectual property in large language models. arXiv preprint arXiv:2502.15010 (2025). [154]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64, 9 (2021), 99–106. [155]Debdeep Sanyal and Murari Mandal. 2025. Agents Are All You Need for LLM Unlearning. In Second Conference on Language Modeling. https: //openreview.net/f orum?id=X39dK0SX9W [156]Yan Scholten, Stephan Günnemann, and Leo Schwinn. 2025. A Probabilistic Perspective on Unlearning and Alignment for Large Language Models. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=51WraMid8K [157]Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, and Stephan Günnemann. 2024. Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space. Advances in Neural Information Processing Systems 37 (2024), 9086–9116. [158] Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7881–7892. doi:10.18653/v1/2020.acl-main.704 [159] Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, and Stephan Günnemann. 2024. Extracting Unlearned Information from LLMs with Activation Steering. In Neurips Safe Generative AI Workshop 2024. https://openreview.net/f orum?id=Ruuf ZiUWUq [160]William F Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, and Nicholas D Lane. 2025. Lunar: Llm unlearning via neural activation redirection. arXiv preprint arXiv:2502.07218 (2025). [161] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (Salt Lake City, UT, USA) (CCS ’24). Association for Computing Machinery, New York, NY, USA, 1671–1685. doi:10.1145/3658644.3670388 [162] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. 2024. Detecting Pretraining Data from Large Language Models. In The Twelfth International Conference on Learning Representations. https://openreview.net/f orum ?id=zWqr3MQuNs [163]Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. 2025. MUSE: Machine Unlearning Six-Way Evaluation for Language Models. In The Thirteenth International Conference on Learning Manuscript submitted to ACM A Survey on Unlearning in Large Language Models31 Representations. https://openreview.net/f orum?id=TArmA033BU [164]Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 4222–4235. doi:10.18653/v1/2020.emnlp- main.346 [165]Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE, 3–18. [166]Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, and Weiqiang Zhang. 2023. Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges. arXiv preprint 2311.15766 (2023). arXiv:2311.15766 doi:10.48550/arXiv.2311.15766 [167]Naman Deep Singh, Maximilian Müller, Francesco Croce, and Matthias Hein. 2025. Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs. arXiv preprint 2509.02820 (2025). arXiv:2509.02820 doi:10.48550/arXiv.2509.02820 [168]Yash Sinha, Murari Mandal, and Mohan Kankanhalli. 2025. UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs. Transactions on Machine Learning Research (2025). https://openreview.net/f orum?id=mNXCViKZbI [169] Niklas Stoehr, Mitchell Gordon, Chiyuan Zhang, and Owen Lewis. 2024. Localizing paragraph memorization in language models. arXiv preprint arXiv:2403.19851 (2024). [170]Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, and Mark Sandler. 2024. Learning and Unlearning of Fabricated Knowledge in Language Models. In ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/f orum?id=R5Q5lANcjY [171] Shota Takashiro, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, and Yutaka Matsuo. 2024. Answer when needed, forget when not: Language models pretend to forget via in-context knowledge unlearning. arXiv preprint arXiv:2410.00382 (2024). [172] Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. 2025. Tamper-Resistant Safeguards for Open-Weight LLMs. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=4FIjRodbW6 [173]Rishub Tamirisa, Bhrugu Bharathi, Andy Zhou, Bo Li, and Mantas Mazeika. 2024. Toward Robust Unlearning for LLMs. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models. https://openreview.net/f orum?id=4rPzaUF6Ej [174]Haoyu Tang, Ye Liu, Xukai Liu, Kai Zhang, Yanghai Zhang, Qi Liu, and Enhong Chen. 2024. Learn While Unlearn: An Iterative Unlearning Framework for Generative Language Models. arXiv preprint 2407.20271 (2024). arXiv:2407.20271 doi:10.48550/arXiv.2407.20271 [175]Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. 2024. Guardrail Baselines for Unlearning in LLMs. arXiv preprint 2403.03329 (2024). arXiv:2403.03329 doi:10.48550/arXiv.2403.03329 [176]Bozhong Tian, Xiaozhuan Liang, Siyuan Cheng, Qingbin Liu, Mengru Wang, Dianbo Sui, Xi Chen, Huajun Chen, and Ningyu Zhang. 2024. To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 1524–1537. doi:10.18653/v1/2024.f indings-emnlp.82 [177] Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models. arXiv:2205.10770 [cs.CL] https://arxiv.org/abs/2205.10770 [178]Akshaj Kumar Veldanda, Shi-Xiong Zhang, Anirban Das, Supriyo Chakraborty, Stephen Rawls, Sambit Sahu, and Milind Naphade. 2024. Llm surgery: Efficient knowledge unlearning and editing in large language models. arXiv preprint arXiv:2409.13054 (2024). [179] Paul Voigt and Axel Von dem Bussche. 2017. The eu general data protection regulation (gdpr). A practical guide, 1st ed., Cham: Springer International Publishing 10, 3152676 (2017), 10–5555. [180] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85. [181]Bichen Wang, Yuzhe Zi, Yixin Sun, Yanyan Zhao, and Bing Qin. 2025. Balancing Forget Quality and Model Utility: A Reverse KL-Divergence Knowledge Distillation Approach for Better Unlearning in LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, 1306–1321. https://aclanthology.org/2025.naacl-long.60/ [182]Bichen Wang, Yuzhe Zi, Yixin Sun, Yanyan Zhao, and Bing Qin. 2025. Balancing Forget Quality and Model Utility: A Reverse KL-Divergence Knowledge Distillation Approach for Better Unlearning in LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 1306–1321. doi:10.18653/v1/2025.naacl-long.60 [183]Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, and Dacheng Tao. 2025. Erasing without remembering: Implicit knowledge forgetting in large language models. arXiv preprint arXiv:2502.19982 (2025). [184] Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng, Kam-Fai Wong, and Hongzhi Yin. 2023. KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). 13264–13276. doi:10.18653/v1/2023.acl-long.740 [185]Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, and Georg Gottlob. 2025. Selective forgetting: Advancing machine unlearning techniques and evaluation in language models. Proceedings of the AAAI Conference on Artificial Intelligence 39, 1 (2025), 843–851. [186]Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. 2025. Towards Effective Evaluations and Comparisons for LLM Unlearning Methods. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=wUtCieKuQU Manuscript submitted to ACM 32Qiu et al. [187]Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q. Weinberger. 2024. Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=hu o8MqVH6t [188] Shang Wang, Tianqing Zhu, Dayong Ye, and Wanlei Zhou. 2025. When machine unlearning meets retrieval-augmented generation (rag): Keep secret or forget knowledge? IEEE Transactions on Dependable and Secure Computing (2025). [189]Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. 2025. LLM Unlearning via Loss Adjustment with Only Forget Data. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id= 6ESRicalFE [190] Yu Wang, Ruihan Wu, Zexue He, Xiusi Chen, and Julian McAuley. 2025. Large Scale Knowledge Washing. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=dXCpPgjTtd [191]Boyi Wei, Weijia Shi, Yangsibo Huang, Noah A. Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, and Peter Henderson. 2025. Evaluating copyright takedown methods for language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 4415, 37 pages. [192] Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi K. Potluru, Pan Li, and Eli Chien. 2025. Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning. In Forty-second International Conference on Machine Learning. https://openreview.net/f orum?id=NsU6MKwbis [193]Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 (2019). [194]Ruihan Wu, Chhavi Yadav, Ruslan Salakhutdinov, and Kamalika Chaudhuri. 2025. Evaluating Deep Unlearning in Large Language Models. In ICML 2025 Workshop on Machine Unlearning for Generative AI. https://openreview.net/f orum?id=376xPmmHoV [195]Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, and Deyi Xiong. 2023. DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 2875–2886. doi:10.18653/v1/2023.emnlp-main.174 [196]YuXuan Wu, Bonaventure F. P. Dossou, and Dianbo Liu. 2024. CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept. In Neurips Safe Generative AI Workshop 2024. https://openreview.net/f orum?id=yf 6gOqJiYd [197] Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, and Ningyu Zhang. 2025. ReLearn: Unlearning via Learning for Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 5967–5987. doi:10.18653/v1/2025.acl-long.297 [198]Yi Xu. 2024. Machine Unlearning for Traditional Models and Large Language Models: A Short Survey. arXiv preprint 2404.01206 (2024). arXiv:2404.01206 doi:10.48550/arXiv.2404.01206 [199] Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, and Kyomin Jung. 2025. FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge. In Submitted to ACL Rolling Review - February 2025. https://openreview.net/f orum?id=Md i4L24arB under review. [200] Zhou Yang and David Lo. 2024. Hotfixing Large Language Models for Code. arXiv preprint arXiv:2408.05727 (2024). [201]Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024. Machine Unlearning of Pre-trained Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 8403–8419. doi:10.18653/v1/2024.acl-long.457 [202]Yuanshun Yao, Xiaojun Xu, and YangLiu. 2024. Large Language Model Unlearning. In Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 105425–105475. https://proceedings.neurips.c/paper_f iles/paper/2024/file/be52acf6bccf 4a8c0a90fe2f 5cfcead3-Paper-Conf erence.pdf [203] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF). IEEE, 268–282. [204]Charles Yu, Sullam Jeoung, Anish Kasi, Pengfei Yu, and Heng Ji. 2023. Unlearning Bias in Language Models by Partitioning Gradients. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). 6032–6048. doi:10.18653/v1/2023.f indings-acl.375 [205]Hongbang Yuan, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. 2025. Towards robust knowledge unlearning: An adversarial framework for assessing and improving unlearning robustness in large language models. Proceedings of the AAAI Conference on Artificial Intelligence 39, 24 (2025), 25769–25777. [206]Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. 2025. A Closer Look at Machine Unlearning for Large Language Models. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=Q1MHvGmhyT [207] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4791–4800. doi:10.18653/v1/P19-1472 [208]Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. 2024. Removing RLHF Protections in GPT-4 via Fine-Tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Manuscript submitted to ACM A Survey on Unlearning in Large Language Models33 Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 681–687. doi:10.18653/v1/2024.naacl-short.59 [209]Binchi Zhang, Zhengzhang Chen, Zaiyi Zheng, Jundong Li, and Haifeng Chen. 2025. Resolving editing-unlearning conflicts: A knowledge codebook framework for large language model updating. arXiv preprint arXiv:2502.00158 (2025). [210]Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramer, and Nicholas Carlini. 2023. Counterfactual Memorization in Neural Language Models. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 39321–39362. https://proceedings.neurips.c/paper_f iles/paper/2023/file/7bc4f74e35bcfe8cfe43b0a860786d6a- Paper-Conf erence.pdf [211] Chenlong Zhang, Zhuoran Jin, Hongbang Yuan, Jiaheng Wei, Tong Zhou, Kang Liu, Jun Zhao, and Yubo Chen. 2025. RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality. arXiv:2506.07171 [cs.CL] https://arxiv.org/abs/2506.07171 [212]Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. 2025. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. AI and Ethics 5, 3 (2025), 2445–2454. [213]Eric Zhang, Leshem Choshen, and Jacob Andreas. 2024. Unforgettable Generalization in Language Models. In First Conference on Language Modeling. https://openreview.net/f orum?id=Ukf4301hXm [214]Jinghan Zhang, shiqi chen, Junteng Liu, and Junxian He. 2023. Composing Parameter-Efficient Modules with Arithmetic Operation. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 12589–12610. https://proceedings.neurips.c/paper_f iles/paper/2023/file/299a08e712d4752c890938da99a77c6-Paper-Conf erence.pdf [215] Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Frank Yang, and Hai Li. 2025. Min-K%++: Improved Baseline for Pre-Training Data Detection from Large Language Models. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=ZGkf oufDaU [216]Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning. In First Conference on Language Modeling. https://openreview.net/f orum?id=MXLBXjQkmb [217]Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019). [218]Zhiwei Zhang, Fali Wang, Xiaomin Li, Zongyu Wu, Xianfeng Tang, Hui Liu, Qi He, Wenpeng Yin, and Suhang Wang. 2025. Catastrophic Failure of LLM Unlearning via Quantization. In The Thirteenth International Conference on Learning Representations. https://openreview.net/f orum?id=lHSe DYamnz [219]Zhexin Zhang, Junxiao Yang, Yida Lu, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, and Minlie Huang. 2025. From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks. arXiv:2407.02855 [cs.CR] https://arxiv.org/abs/2407.02855 [220]Shuai Zhao, Xiaobao Wu, Cong-Duy T Nguyen, Yanhao Jia, Meihuizi Jia, Feng Yichao, and Anh Tuan Luu. 2025. Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation. In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 4937–4952. doi:10.18653/v1/2025.f indings-acl.255 [221]Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Zhouhao Sun, Shi Jun, Ting Liu, and Bing Qin. 2024. Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning. In Findings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 9386–9406. doi:10.18653/v1/2024.f indings-acl.559 [222] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2020, 29 pages. [223]Shiji Zhou, Lianzhe Wang, Jiangnan Ye, Yongliang Wu, and Heng Chang. 2024. On the limitations and prospects of machine unlearning for generative AI. arXiv preprint arXiv:2408.00376 (2024). [224]Haomin Zhuang, Yihua Zhang, Kehan Guo, Jinghan Jia, Gaowen Liu, Sijia Liu, and Xiangliang Zhang. 2025. SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 8664–8678. doi:10.18653/v1/2025.acl-long.424 [225]Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Manuscript submitted to ACM