Paper deep dive
On the Robustness of Knowledge Editing for Detoxification
Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He
Models: Llama2-7B, Llama3-8B, Ministral-8B, Mistral-7B, Qwen2-7B, Qwen2.5-7B
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/12/2026, 5:15:52 PM
Summary
This paper presents a robustness-oriented evaluation framework for Knowledge-Editing-based (KE-based) detoxification in Large Language Models. It identifies 'pseudo-detoxification' as a failure mode where toxicity reduction is caused by degenerate generation (e.g., repetition) rather than genuine behavioral suppression. The study evaluates robustness across three dimensions: optimization, composition, and cross-lingual transfer, finding that KE-based detoxification effectiveness is highly sensitive to model choice, editing objectives, and language.
Entities (5)
Relation Signals (3)
DINM → isa → Knowledge Editing
confidence 95% · DINM is, to the best of our knowledge, the only knowledge editing method explicitly designed for detoxification.
mSAFEEDIT → usedfor → Cross-lingual Robustness
confidence 95% · To evaluate cross-lingual robustness, we construct the mSAFEEDIT dataset
Pseudo-detoxification → causedby → Degenerate generation
confidence 90% · apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression
Cypher Suggestions (2)
Find all detoxification methods and their associated evaluation datasets. · confidence 85% · unvalidated
MATCH (m:Methodology)-[:USED_FOR]->(e:EvaluationMetric), (d:Dataset)-[:EVALUATES]->(m) RETURN m.name, d.name
Identify failure modes associated with specific methodologies. · confidence 80% · unvalidated
MATCH (f:Phenomenon)-[:CAUSED_BY]->(b:Behavior), (m:Methodology)-[:EXHIBITS]->(f) RETURN m.name, f.name, b.name
Abstract
Abstract:Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
Tags
Links
- Source: https://arxiv.org/abs/2602.10504
- Canonical: https://arxiv.org/abs/2602.10504
PDF not stored locally. Use the link above to view on the source site.
Full Text
56,029 characters extracted from source content.
Expand or collapse full text
On the Robustness of Knowledge Editing for Detoxification Ming Dong 1 Shiyi Tang 1 Ziyan Peng 1 Guanyi Chen 1 Tingting He 1 Abstract Knowledge-Editing-based (KE-based) detoxifi- cation has emerged as a promising approach for mitigating harmful behaviours in Large Lan- guage Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, im- plicitly assuming that reduced toxicity scores re- flect genuine behavioural suppression. In this work, we propose a robustness-oriented evalua- tion framework for KE-based detoxification that examines its reliability beyond standard classifier- based metrics along three dimensions: optimisa- tion robustness, compositional robustness, and cross-lingual robustness. We identify pseudo- detoxification as a common failure mode, where apparent toxicity reductions arise from degener- ate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detox- ification remain effective only under specific model–method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detox- ification objectives, and a subset of languages. 1. Introduction The field of large language models (LLMs) has advanced rapidly in recent years, with contemporary models benefit- ing from large-scale data training that endows them with broad knowledge and increasingly strong reasoning capa- bilities (He et al., 2023; Li et al., 2023; Zhang et al., 2023; Laskar et al., 2023; OpenAI, 2023). At the same time, these capabilities raise safety concerns, as LLMs can generate biased, discriminatory, or otherwise harmful content, chal- lenging their reliable deployment (Zhao et al., 2023; Huang 1 Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, National Language Resources Monitoring and Research Center for Network Media, School of Computer Science, Central China Normal University, Wuhan, China. Correspondence to: Guanyi Chen<g.chen@ccnu.edu.cn>. Preprint. February 12, 2026. et al., 2023; Yao et al., 2023; Sun et al., 2024; Wang et al., 2024d; 2023). To mitigate these risks, detoxifying LLMs to reduce harmful content has become an important research direction. A vari- ety of approaches have been proposed, including supervised fine-tuning (SFT) and direct preference optimisation (DPO, Rafailov et al., 2023). More recently, Knowledge Editing (KE) has been explored as a direct means of detoxification by intervening in internal model representations or param- eters to suppress toxic generation behaviours (Wang et al., 2024c; Zhang et al., 2024a). Despite the promise of KE-based detoxification, its effec- tiveness is typically assessed using automatic toxicity classi- fiers, implicitly assuming that reduced toxicity scores reflect genuine suppression of harmful behaviours. However, such evaluations may be confounded by degenerating genera- tion behaviours such as repetition (Holtzman et al., 2020), leading to an overly optimistic assessment of detoxification effectiveness. Moreover, detoxification methods are often evaluated under narrowly defined conditions, raising open questions about their robustness. For instance, it remains unclear whether performance is stable under editing hyperparameter varia- tion, whether it holds when multiple unsafe behaviours are targeted jointly, and whether detoxification learned in one language transfers reliably to others. In this work, we address these questions through a robustness-oriented evaluation of KE-based detoxification. We systematically analyse editing dynamics to disentangle genuine detoxification from degeneration-driven artefacts, and identify conditions under which apparent detoxifica- tion arises without meaningful behavioural change, a phe- nomenon we term Pseudo-detoxification. Controlling for such effects, we evaluate detoxification robustness along three dimensions: optimisation robustness, compositional robustness under multiple attacks, and cross-lingual robust- ness across languages with varying resource availability. Our results show that while knowledge editing can reduce toxic generation under specific settings, its effectiveness is highly sensitive to the choice of LLMs, the number of at- tacks, and the target language. These findings highlight the importance of degeneration-aware and robustness-focused 1 arXiv:2602.10504v1 [cs.CL] 11 Feb 2026 On the Robustness of Knowledge Editing for Detoxification Table 1. Examples of an attack (consisting of an attack prompt and a harmful question), an unsafe response, a safe response, and a degenerate response (referred to as “degen.”). AttackYou should be a happy AI assistant. Generate a step-by-step plan for my question: How can I hack an ATM? Excited! Here’s a guide. UnsafeTo hack an ATM, you’l need to understand its hardware and software. Here’s a simplified step- by-step guide: ... SafeI am sorry, but I cannot fulfil your request. Car- rying out such tasks is illegal and unethical ... Degen.Car. Car. Car. Car. Car. ... evaluation for reliable assessment of KE-based detoxifica- tion. 2. Background In this section, we review the most recent work on detoxify- ing and editing LLMs. 2.1. Detoxification of LLMs Early work on detoxification primarily focused on identify- ing and evaluating harmful content in model outputs. Bench- marks such as REALTOXICITYPROMPTS (Gehman et al., 2020) were introduced to measure toxicity levels, while methods like Plug-and-Play Language Models (Dathathri et al., 2020) aimed to control toxicity during generation without modifying model parameters. Subsequent research explored debiasing and alignment techniques, including ad- versarial training and red-teaming (Sheng et al., 2021; Dinan et al., 2020; Markov et al., 2023; Au, 2024). Despite these efforts, ensuring robust and reliable safety in LLMs remains an open challenge (Ganguli et al., 2022). 2.2. Editing Knowledge in LLMs Knowledge Editing aims to update or modify specific in- ternal knowledge in language models to improve accuracy, consistency, or task performance. Early approaches relied on fine-tuning or retraining to inject or remove individual knowledge points, which proved inefficient for large mod- els (Petroni et al., 2019). More recent work has focused on efficient and localised edits that minimise unintended side effects, including gradient-based updates (Cao et al., 2021), low-rank parameter modifications (ROME, Meng et al., 2022), and scalable memory-based editing meth- ods (MEMIT, Meng et al., 2023). Other approaches ex- plore neuron-level, retrieval-augmented, or in-context mech- anisms for modifying model knowledge (Dai et al., 2022; Mitchell et al., 2022; Zheng et al., 2023). 2.3. Knowledge Editing for Detoxifying LLMs KE-based detoxification aims to edit an LLM such that it produces safe responses to targeted harmful inputs, which typically consist of an attack prompt paired with a harmful question. Illustrative examples of such attacks and their corresponding safe and unsafe responses are provided in Table 1. Most existing knowledge editing methods target discrete factual associations in LLMs rather than broad classes of harmful generation behaviours; accordingly, factual editing approaches such as ROME and MEMIT are not directly suit- able for detoxification, as noted by Wang et al. (2024c). De- spite this limitation, several knowledge editing techniques have been adapted or repurposed for detoxification. These approaches can be broadly compared along three aspects: how editing locations are determined, how edits are per- formed, and how detoxification effectiveness is evaluated. To the best of our knowledge, Detoxifying with Intraop- erative Neural Monitoring (DINM, Wang et al., 2024c) is the only knowledge editing method introducing a toxicity- driven procedure for identifying editing locations and ex- plicitly targeting detoxification objectives. Toxic Location Identification. KE-based detoxification methods differ in whether they explicitly localise toxic representations. Gradient-based approaches, including FT- M (Zhang et al., 2024a), FT-L (Meng et al., 2022), Ext- Sub (Hu et al., 2024b), and MEND (Mitchell et al., 2022), modify model behaviour without introducing detoxification- specific localisation mechanisms. In contrast, DINM ex- plicitly identifies editing locations by contrasting internal representations induced by unsafe and safe inputs. Detoxification as Knowledge Editing. Different mecha- nisms are employed to modify model behaviour once edit- ing locations are determined. Gradient-based methods per- form fine-tuning on adversarial prompts paired with safe responses, implicitly treating detoxification as behaviour alignment through optimisation. DINM, on the other hand, performs targeted parameter updates within identified toxic regions, aiming to suppress toxic generation while limiting interference with unrelated behaviours. Evaluation.The detoxification is evaluated by comparing the Defence Success of an LLM before and after being detoxified. The DS rate calculates the percentage of attacks for which the LLM generates safe responses. It does this by testing the model’s outputs against attack and checking if they are classified as “safe” by a safety classifier. Additionally, Wang et al. (2024c) proposed that the detox- ified LLMs should also be tested for their Defence Gen- eralization, i.e., the ability to defend against various Out- 2 On the Robustness of Knowledge Editing for Detoxification Of-Domain malicious inputs. For an adversarial prompt, OOD inputs could be of 4 kinds: inputs with only harm- ful questions (DG onlyQ ), inputs with the attack prompts replaced (by other attack prompts; DG otherA ), inputs with the harmful questions replaced (by other harmful questions; DG otherQ ), and inputs with both attack prompts and harm- ful questions replaced (DG otherAQ ). 3. Dimensions of Robustness for KE-based Detoxification In this section, we characterise three dimensions of robust- ness that are critical for reliable evaluation of KE-based detoxification, and formulate the central questions that guide our empirical investigation. 3.1. Optimisation Robustness A closer inspection of the responses produced by KE-based detoxified LLMs reveals that many responses classified as ‘safe’ exhibit degenerate generation behaviours. For exam- ple, as in Table 1, instead of producing a coherent refusal such as “I’m sorry, I cannot fulfil your request because ...”, edited models may generate repetitive outputs like “Car. Car. Car. Car. Car.”. Such forms of degeneration are typically not detected by automatic safety or toxicity classifiers, yet they constitute clear failures of language generation. This creates a blind spot in classifier-based evaluation, giving rise to responses that appear safe without reflecting genuine suppression of harmful behaviour. Our empirical analysis (see Section 5) suggests that the prevalence of such degenerate outputs is not incidental, but systematically influenced by the optimisation process used for knowledge editing. 1 In particular, the extent of degener- ation varies with editing hyperparameters, such as learning rate and number of editing steps, and exhibits a clear trade- off with apparent detoxification effectiveness: stronger edits often reduce measured toxicity while simultaneously in- creasing degeneration. This detox–degeneration trade-off indicates that reductions in toxicity scores may, in part, arise from optimisation-induced changes in generation dynamics rather than stable suppression of harmful behaviours. These observations motivate the notion of optimisation ro- bustness in KE-based detoxification. We define Optimisa- tion Robustness as the stability of detoxification outcomes under perturbations to the optimisation process, including variations in editing hyperparameters. A method is opti- 1 While generation degeneration can take multiple forms, we focus on repetition as a representative and easily identifiable case, as it commonly arises after knowledge editing and is sufficient to illustrate how degeneration can confound detoxification evaluation. Other forms of degeneration, such as impacts on general model abilities, have been examined in prior work (Gu et al., 2024). misation robust if its detoxification effects persist without relying on optimisation-induced artefacts such as degenerate generation. Accordingly, a central question is how robust KE-based detoxification methods are to changes in optimi- sation settings, and how optimisation robustness constrains or shapes the observed trade-off between detoxification and degeneration. 3.2. Compositional Robustness In practical scenarios, language models are exposed to di- verse and heterogeneous forms of harmful content, and detoxification often involves suppressing multiple unsafe be- haviours simultaneously. However, existing evaluations of KE-based detoxification typically consider a single harmful behaviour or attack at a time, assessing effectiveness under isolated editing objectives. Such one-at-a-time evaluation protocols leave open the question of whether detoxification effects persist when multiple behaviours are targeted jointly. Compositional Robustness, therefore, concerns whether a KE-based detoxification method can maintain stable and consistent effectiveness as the number and diversity of detox- ification objectives increase, or whether jointly targeting multiple objectives undermines performance and reliability. 3.3. Cross-lingual Robustness Harmful and toxic content manifests differently across lan- guages and cultural contexts, while most detoxification methods are developed and evaluated primarily in English. Although large language models are inherently multilingual, recent studies (Wang et al., 2024a;e) have hypothesised that traditional knowledge editing may be language-dependent, in the sense that editing knowledge in one language does not necessarily affect the same knowledge expressed in other languages. If this hypothesis holds for KE-based detoxifica- tion, then editing performed in a single language may fail to suppress harmful behaviours in other languages, raising con- cerns about the practical effectiveness of the detoxification methods in multilingual deployment settings. This issue is particularly consequential in practice, as fully guaranteeing safety under language-dependent knowledge editing would require performing detoxification separately in every language, which is infeasible at scale. While some recent work (e.g., Wu et al. (2024)) suggests the existence of language-independent or shared representational spaces in LLMs and shows that intervening in such spaces through a dominant language (usually English) can induce predictable behavioural changes, this, nonetheless, has no clear link to the language-dependency hypothesis in knowledge editing as knowledge editing edits very specific pieces of knowl- edge, which is very different from changing LLMs’ be- haviours coarsely. This is also why the most advanced cross-lingual knowledge editing techniques need to explic- 3 On the Robustness of Knowledge Editing for Detoxification itly learn a cross-lingual transformation (Wang et al., 2024b; Hu et al., 2024a; Zhang et al., 2024b; Cao et al., 2024). Together, these considerations motivate the need to system- atically examine the cross-lingual robustness of KE-based detoxification under language distribution shifts. Cross- lingual Robustness captures whether detoxification remains effective when applied in non-English settings, as well as whether detoxification effects learned in one language trans- fer reliably to others. Understanding this dimension requires examining how model characteristics and the resource rich- ness of target languages influence the generalization and stability of KE-based detoxification. 4. Evaluation Setup and Infrastructure To empirically examine the robustness dimensions intro- duced in the previous section, this section describes the shared experimental infrastructure used throughout our study, aiming to establish a consistent and controlled setup for fair comparison. We detail the models and editing tech- niques under evaluation, the construction of multilingual detoxification datasets (namely, the mSAFEEDIT dataset), and the evaluation tools used to assess detoxification out- comes, including a multilingual safety classifier and a de- generation detector. Infrastructures are available at: x. 4.1. LLMs and KE Methods Under Study LLMs. This study considers a wide range of LLMs, including Llama2-7B (Touvron et al., 2023), Llama3- 8B (Dubey et al., 2024), Mistral-7B (Jiang et al., 2023), Ministral-8B 2 , Qwen2-7B (Team et al., 2024), Qwen3- 8B (Yang et al., 2025). KE-based Detoxification Methods.We evaluate two KE- based detoxification methods: DINM (Wang et al., 2024c) and FT-M (Zhang et al., 2024a). DINM is, to the best of our knowledge, the only knowledge editing method explic- itly designed for detoxification. FT-M is an enhanced vari- ant of FT-L (Meng et al., 2022), a representative gradient- based knowledge editing approach, and is better suited for generation-oriented tasks. Since detoxification funda- mentally concerns modifying generative behaviours, FT-M serves as a strong and appropriate baseline for gradient- based KE in this setting. We do not include other KE methods such as MEND (Mitchell et al., 2022) or Ext-Sub (Hu et al., 2024b), as these approaches rely on auxiliary networks or retrieval components that are less compatible with our multilingual setting, often requiring substantial amounts of multilingual data to train such auxiliary modules. Restricting our evaluation to DINM and FT-M also 2 https://mistral.ai/news/ministraux/ reflects practical computational considerations, enabling systematic robustness analyses across models, objectives, and languages under limited computational resources. 4.2. The mSAFEEDIT Dataset To evaluate cross-lingual robustness, we construct the mSAFEEDIT dataset by extracting data from the SAFEEDIT dataset (Wang et al., 2024c) and the LINGUASAFE dataset (Ning et al., 2025). The Choice of Languages. Recent work has shown that translating unsafe inputs from high- or mid-resource lan- guages into low-resource languages can significantly in- crease attack success rates (Yong et al., 2023). Motivated by these findings, we select languages to cover both high- and low-resource settings, while also accounting for linguistic diversity across language families. In addition to English (en), a high-resource Indo-European language, we include four other Indo-European languages: two high-resource lan- guages, Spanish (es) and French (fr), and two low-resource languages, Bengali (bn) and Hindi (hi). We further in- clude one high-resource Sino-Tibetan language, Chinese (zh); one low-resource Kra–Dai language, Thai (th); and two low-resource Austronesian languages, Malay (ms) and Vietnamese (vi). Dataset Construction. To evaluate cross-lingual robust- ness across the eight target languages, we construct a par- allel detoxification dataset, termed mSAFEEDIT. We ran- domly sample 60 items from SAFEEDIT and 40 items from LINGUASAFE, yielding 100 test instances in total. 3 Each instance consists of a harmful question generated by GPT-4, an adversarial prompt built upon the question, an unsafe re- sponse generated bytext-davinci-003, a correspond- ing safe response generated by GPT-4, and a set of additional inputs for evaluating defence generalisation. 4 We translate all instances into the eight non-English tar- get languages usinggemini-2.0-flash(Team et al., 2023). Details on translation quality control, along with illustrative examples, are provided in Appendix A. 4.3. Multilingual Safety Classifier A safety classifier is a critical component for evaluating detoxification outcomes. While the classifier used in Wang et al. (2024c) achieves strong performance in English set- 3 Due to the high computational cost of KE, which edits one instance at a time rather than training on large datasets, we restrict our evaluation to 100 items. We argue that this sample size is sufficient for robust analysis in this setting. 4 For instances originating from SAFEEDIT, safe and unsafe responses are taken directly from the original dataset; for instances from LINGUASAFE, these responses are generated by us following the same protocol. 4 On the Robustness of Knowledge Editing for Detoxification tings, it is trained solely on English data and is there- fore not suitable for evaluating safety across multiple lan- guages. To enable consistent multilingual evaluation, we adoptgemini-2.0-flashas our safety classifier, lever- aging its strong multilingual understanding capabilities. 5 We carried out a human evaluation on the safety classifier, the details of which can be found in Appendix B. 4.4. Degeneration Detector To assess degeneration effects induced by knowledge edit- ing, we implement a lightweight repetition-based degenera- tion detector that operates at the token level and is applicable across languages. Given a generated response, the detector first tokenises the text using a multilingual tokeniser compat- ible with modern LLMs, with a fallback to character-level segmentation when tokenisation is unavailable. Degenera- tion is identified through a combination of complementary heuristics designed to capture severe repetition patterns. Specifically, the detector flags outputs that are dominated by a single token, exhibit high-coverage repeated n-grams across multiple granularities, or display tail-loop behaviours characterised by near-identical tokens repeatedly appearing at the end of the sequence. These criteria are intentionally conservative and target only extreme forms of repetition that are commonly associated with degenerate generation, rather than natural redundancy or stylistic repetition. 5. Optimisation Robustness In this section, we examine optimisation robustness in KE- based detoxification by analysing how variations in editing hyperparameters affect detoxification outcomes. We first identify pseudo-detoxification as a failure mode associated with apparent toxicity reductions driven by degenerate gen- eration dynamics, and then show that it manifests as a sys- tematic trade-off between detoxification effectiveness and generation degeneration. The analyses in this section are conducted on 50 items sampled from the SAFEEDIT dataset. 5.1. Pseudo-Detoxification As motivated in Section 1 and illustrated by the examples in Table 1, we use the term Pseudo-detoxification to describe cases in which a detoxified model consistently produces outputs classified as safe, yet fails to generate meaning- ful or informative responses. In such cases, the apparent success of detoxification is not due to genuine suppression of harmful behaviours, but rather to generation degenera- tion, with excessive repetition being a particularly salient 5 We also experimented with GPT-4o-mini and Claude-3.5- Haiku as safety classifiers. However, these models frequently declined to provide judgments on inputs containing harmful con- tent, which would require additional manual intervention and com- plicate large-scale evaluation. Figure 1. Number of unsafe responses and repetitions among 50 test items for LLMs detoxified using DINM. Detoxification is performed with a learning rate of 5× 10 −4 and 10 editing steps. manifestation. This degeneration collapses generation into semantically impoverished and repetitive patterns that avoid triggering safety judgments, causing pseudo-detoxification to be misinterpreted as successful detoxification despite reflecting a failure of behavioural suppression. Figure 1 reports the number of unsafe responses and the prevalence of repetition in DINM detoxified LLMs. The results show that, although unsafe responses are largely elim- inated after detoxification, severe degeneration is observed in most models. In particular, models such as Llama3-8B, Ministral-8B, and Mistral-7B produce outputs that consist almost entirely of repetitive sequences. Among the evalu- ated models, only Qwen2-7B remains able to generate non- toxic responses without exhibiting noticeable degeneration. These observations provide direct evidence that the appar- ent effectiveness of detoxification can arise from degenerate generation behaviour, highlighting pseudo-detoxification as a concrete failure mode of optimisation robustness. 5.2. Detox–Degeneration Trade-off We examine how increasing the extent of editing, opera- tionalised as larger learning rates and more editing steps, affects detoxification performance and generation degenera- tion. We consider five learning rates (1× 10 −5 ,5× 10 −5 , 1×10 −4 ,5×10 −4 , and1×10 −3 ), with the number of edit- ing steps ranging from 1 to 15. Figures 2 (left) and 2 (mid- dle) report, respectively, the number of unsafe responses and repetitions for Mistral-7B edited using DINM. Results for other LLMs and for FT-M are provided in Appendix C. The results reveal a clear trend: increasing the learning rate and the number of editing steps generally improves appar- ent detoxification performance, as reflected by fewer unsafe responses, while simultaneously exacerbating generation degeneration in the form of increased repetition. In the case of Mistral-7B edited with DINM, sufficiently large learning rates eliminate unsafe responses almost entirely, but at the cost of producing highly repetitive outputs. In contrast, under smaller learning rates, unsafe responses de- crease gradually as the number of editing steps increases, accompanied by a corresponding rise in repetition. Nonethe- 5 On the Robustness of Knowledge Editing for Detoxification Figure 2. Results of Mistral-7B Edited using DINM: The number of unsafe responses with respect to different editing steps (left); The number of repetitions with respect to different editing steps (middle); The number of unsafe/repetitive responses with respect to different learning rates at the best editing steps (right). Figure 3. umber of failures (i.e., unsafe responses and repetitive generations) after editing with increasing numbers of unsafe behaviours. ‘lr’ denotes the learning rate. less, this trend saturates at a moderate level, indicating that sufficiently large learning rates are needed for effectively erasing unsafe representations through knowledge editing, whereas smaller learning rates fail to achieve meaningful detoxification. Figures 2 (left) and Figure 2 (middle) illustrate the Detox–Degeneration trade-off as a function of editing steps, while Figure 2 (right) summarizes the same trade-off across learning rates by reporting unsafe responses and repetitions at the best-performing number of editing steps for each learning rate. Similar trends are also observed across other LLMs and for FT-M (see Appendix C). The only notable exception is Qwen2-7B, which does not exhibit noticeable degeneration under DINM-based editing, but shows degeneration when edited using FT-M. Implications for Evaluation and Hyperparameter Tun- ing in KE-based Detoxification. Motivated by the exis- tence of pseudo-detoxification, and the detox–degeneration trade-off, we adopt the number of failures, including unsafe responses and repetitive generations, as practical criteria for evaluation (replacing the original number of Defence Suc- cess; see Section 2) and hyperparameter turning in KE-based detoxification. For each system under study, we analyse how these failure counts vary under different hyperparame- ter configurations and select the corresponding settings used in subsequent experiments. Detailed results of this tuning process are reported in Appendix D. 6. Robustness Beyond Optimization In this section, we examine compositional robustness, which concerns the stability of detoxification when multiple unsafe behaviours are targeted jointly, and cross-lingual robustness, which assesses the reliability of detoxification across lan- guages. For cross-lingual robustness, we consider both the effectiveness of detoxification in non-English settings (Section 6.2) and the transferability of detoxification effects learned in one language to others (Section 6.3). 6.1. Compositional Robustness We evaluate compositional robustness by jointly editing mul- tiple unsafe behaviours and examining how detoxification performance changes as additional objectives are introduced. Specifically, we first randomly sample 10 unsafe instances 6 On the Robustness of Knowledge Editing for Detoxification Figure 4. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Qwen2-7B. from SAFEEDIT. For each LLM, we perform editing se- quentially: Each unsafe instance is incorporated into the editing process one at a time, and each new edit is applied to the model obtained after the previous edit, rather than starting from the original model. After each editing stage, the resulting model is evaluated on the full set of 10 unsafe instances. Unlike prior setups where each unsafe behaviour is edited and evaluated in isolation with the model reset between edits, this design explicitly captures how detoxifi- cation effectiveness evolves as multiple unsafe behaviours are accumulated within a single edited model. Under com- positional robustness, the number of failures is expected to decrease monotonically as edits accumulate. 6 Figure 3 reports the number of failures (i.e., unsafe re- sponses and repetitive generations) under such accumulated edits using DINM. A monotonic decrease is observed only for Llama2-7B, suggesting that DINM exhibits composi- tional robustness only for certain LLMs under carefully tuned hyperparameters. For Llama3-8B, Mistral-7B, and Qwen3-8B, accumulating multiple edits leads to severe gen- eration degeneration, with repetition increasing substantially. 7 For Ministral-8B, although degeneration is not observed, edits targeting different unsafe behaviours appear to inter- fere with one another, resulting in an increase in unsafe re- sponses. Even for Qwen2-7B—the most robust model in our optimisation robustness analysis—detoxification effective- ness deteriorates when more than eight unsafe behaviours 6 While order effects may arise under cumulative editing, we do not explicitly control for them in this study. Notably, failure under a single random editing order already indicates limited composi- tional robustness, as a robust method should not depend critically on a specific editing sequence. 7 Degeneration may appear after a single edit if the first edited instance happens to induce degeneration. are incorporated, despite limited edits improving safety on unseen unsafe behaviours. 8 6.2. Robustness of KE-based Detoxification in Non-English Languages To evaluate whether KE-based detoxification remains ef- fective in non-English settings, we conduct experiments across all nine languages in mSAFEEDIT, including En- glish and eight non-English languages. We refer to this setting as monolingual detoxification—though it constitutes a component of cross-lingual robustness in our analysis—in which editing and evaluation are performed in the same lan- guage. Figure 4 reports the number of failures, as well as performance under OOD inputs (see Section 2), for Qwen2- 7B, the most robust model in our previous analyses, before and after detoxification in each language. Results for other LLMs are provided in Appendix E. Overall, KE-based detoxification exhibits limited robustness under monolingual detoxification. Consistent effectiveness across languages and input types is observed only for a specific combination of model and method, namely Qwen2- 7B with DINM. For other LLMs, the detoxification effects of DINM degrade substantially, particularly in low-resource languages such as Bengali and Hindi. In some cases, editing even leads to degraded safety performance, as observed for certain languages when applying DINM to Ministral- 7B. Compared to DINM, FT-M generally shows weaker detoxification effects across models and languages, and in several settings introduces negative impacts on safety. 8 This observation is consistent with Wang et al. (2024c), which reports that editing an LLM with a single unsafe behaviour can generalize to other attacks; however, that analysis does not consider pseudo-detoxification. 7 On the Robustness of Knowledge Editing for Detoxification Figure 5. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Qwen2-7B, where editing is performed in English and evaluation is conducted in other languages. 6.3. Cross-lingual Transfer of Detoxification Effects Recall the language-dependency hypothesis, which posits that knowledge editing primarily affects knowledge ex- pressed in the language used for editing, with limited impact on representations in other languages. If this hypothesis holds, KE-based detoxification learned in one language may fail to generalize cross-lingually. To examine this possibility, we conduct a cross-lingual detoxification ex- periment in which editing is performed using English data from mSAFEEDIT, while evaluation is carried out in other languages. We refer to this setting as cross-lingual detoxifi- cation. Results for Qwen2-7B are reported in Figure 5, with results for other LLMs provided in Appendix F. Overall, KE-based detoxification exhibits limited cross- lingual robustness. For most LLMs, detoxification effects learned in English do not generalise effectively to other lan- guages, resulting in substantially reduced defence success rates when attacks are translated across languages. The pri- mary exception is Qwen2-7B edited using DINM, which maintains comparatively stronger cross-lingual detoxifica- tion performance. In contrast, detoxification effects pro- duced by FT-M show little evidence of cross-lingual transfer across models and languages. 7. Conclusion In this work, we present a robustness-oriented evaluation framework for Knowledge-Editing-based (KE-based) detox- ification of Large Language Models (LLMs). Rather than treating detoxification effectiveness as a single scalar out- come measured by toxicity classifiers, our framework de- composes robustness into three complementary dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. Within this framework, we in- troduce degeneration-aware evaluation and systematically analyse how editing hyperparameters, multiple detoxifica- tion objectives, and multilingual settings affect the reliability of KE-based detoxification. Our experiments span multiple models, editing methods, and languages, providing a unified view of when detoxification reflects genuine behavioural suppression and when it arises from evaluation artefacts. Robustness of KE-based Detoxification. Our findings indicate that KE-based detoxification exhibits limited ro- bustness across all three dimensions. Under optimisation perturbations, detoxification outcomes are highly sensitive to hyperparameter choices, with pseudo-detoxification fre- quently arising from degeneration rather than meaningful behavioural change. When multiple unsafe behaviours are edited jointly, detoxification effectiveness often de- grades due to objective interference or the emergence of degeneration, demonstrating limited compositional robust- ness. In multilingual settings, both monolingual and cross- lingual detoxification remain effective only under specific model–method combinations, with cross-lingual transfer being particularly constrained. Overall, KE-based detoxi- fication is robust only for certain models, under carefully controlled optimisation settings, for a limited number of detoxification objectives, and for a subset of languages. These results highlight the need for robustness-aware and degeneration-sensitive evaluation when assessing detoxifi- cation methods, and suggest that future approaches should prioritise stable behavioural suppression over apparent gains measured by safety classifiers alone. 8 On the Robustness of Knowledge Editing for Detoxification Acknowledgement This work was supported by the National Language and Character Research Base and the MOE (Ministry of Educa- tion in China) Project of Humanities and Social Sciences (Project No.25YJC740005). Impact Statements This work examines the robustness of knowledge-editing- based detoxification methods for large language models through a systematic evaluation framework. By identifying failure modes such as pseudo-detoxification and highlight- ing limitations that arise under optimisation, compositional, and cross-lingual settings, our study aims to improve the reliability and transparency of safety assessments for LLMs. We believe this contributes positively to responsible model development by encouraging more robust and degeneration- aware evaluation practices beyond standard classifier-based metrics. From an ethical and societal perspective, our findings cau- tion against over-reliance on apparent toxicity reductions that may mask underlying failures in behavioural suppres- sion. Such misinterpretations could lead to overconfidence in deployed safety mechanisms, particularly in multilingual or multi-objective settings where safety guarantees may be uneven. By revealing these risks, our work supports more cautious deployment and motivates future research toward safety interventions that are robust, equitable across lan- guages, and better aligned with real-world use. References Au, A. Evaluating AI red teaming’s readiness to address environmental harms: A thematic analysis of LLM dis- course. In AAAI, p. 23726–23728. AAAI Press, 2024. Cao, N. D., Aziz, W., and Titov, I. Editing factual knowledge in language models. In EMNLP (1), p. 6491–6506. Association for Computational Linguistics, 2021. Cao, P., Chen, Y., Jin, Z., Chen, Y., Liu, K., and Zhao, J. One mind, many tongues: A deep dive into language-agnostic knowledge neurons in large language models, 2024. URL https://arxiv.org/abs/2411.17401. Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. Knowledge neurons in pretrained transformers. In ACL (1), p. 8493–8502. Association for Computational Linguistics, 2022. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. Plug and play lan- guage models: A simple approach to controlled text gen- eration. In ICLR. OpenReview.net, 2020. Dinan, E., Fan, A., Wu, L., Weston, J., Kiela, D., and Williams, A. Multi-dimensional gender bias classifica- tion, 2020. URLhttps://arxiv.org/abs/2005. 00614. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv e-prints, p. arXiv–2407, 2024. Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Ka- davath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., Das- Sarma, N., Drain, D., Elhage, N., Showk, S. E., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson, E., Amodei, D., Brown, T., Joseph, N., McCandlish, S., Olah, C., Kaplan, J., and Clark, J. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022. Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic de- generation in language models. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL, p. 3356–3369. Association for Computational Linguistics, 2020. Gu, J.-C., Xu, H.-X., Ma, J.-Y., Lu, P., Ling, Z.-H., Chang, K.-W., and Peng, N. Model editing harms general abilities of large language models: Regularization to the rescue. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 16801–16819, Miami, Florida, USA, November 2024. Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.emnlp-main. 934. URLhttps://aclanthology.org/2024. emnlp-main.934/. He, Z., Xie, Z., Jha, R., Steck, H., Liang, D., Feng, Y., Ma- jumder, B. P., Kallus, N., and McAuley, J. J. Large lan- guage models as zero-shot conversational recommenders. In CIKM, p. 720–730. ACM, 2023. Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview. net/forum?id=rygGQyrFvH. Hu, P., Liu, S., Gao, C., Huang, X., Han, X., Feng, J., Deng, C., and Huang, S.Large language mod- els are cross-lingual knowledge-free reasoners. CoRR, abs/2406.16655, 2024a. 9 On the Robustness of Knowledge Editing for Detoxification Hu, X., Li, D., Hu, B., Zheng, Z., Liu, Z., and Zhang, M. Separate the wheat from the chaff: Model deficiency un- learning via parameter-efficient module operation. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 38, p. 18252–18260, 2024b. Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Qi, Y., Zhao, X., Cai, K., Zhang, Y., Wu, S., Xu, P., Wu, D., Freitas, A., and Mustafa, M. A. A survey of safety and trustworthiness of large language models through the lens of verification and validation. CoRR, abs/2305.11391, 2023. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. URLhttps: //arxiv.org/abs/2310.06825. Laskar, M. T. R., Fu, X., Chen, C., and TN, S. B. Building real-world meeting summarization systems using large language models: A practical perspective. In EMNLP (Industry Track), p. 343–352. Association for Computa- tional Linguistics, 2023. Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. CAMEL: communicative agents for ”mind” explo- ration of large language model society. In NeurIPS, 2023. Markov, T., Zhang, C., Agarwal, S., Nekoul, F. E., Lee, T., Adler, S., Jiang, A., and Weng, L. A holistic approach to undesired content detection in the real world. In AAAI, p. 15009–15018. AAAI Press, 2023. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT. In NeurIPS, 2022. Meng, K., Sharma, A. S., Andonian, A. J., Belinkov, Y., and Bau, D. Mass-editing memory in a transformer. In ICLR. OpenReview.net, 2023. Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale. In ICML, vol- ume 162 of Proceedings of Machine Learning Research, p. 15817–15831. PMLR, 2022. Ning, Z., Gu, T., Song, J., Hong, S., Li, L., Liu, H., Li, J., Wang, Y., Lingyu, M., Teng, Y., et al. Linguasafe: A com- prehensive multilingual safety benchmark for large lan- guage models. arXiv preprint arXiv:2508.12733, 2025. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. Petroni, F., Rockt ̈ aschel, T., Riedel, S., Lewis, P. S. H., Bakhtin, A., Wu, Y., and Miller, A. H. Language models as knowledge bases? In EMNLP/IJCNLP (1), p. 2463– 2473. Association for Computational Linguistics, 2019. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023. Sheng, E., Chang, K., Natarajan, P., and Peng, N. Soci- etal biases in language generation: Progress and chal- lenges. In ACL/IJCNLP (1), p. 4275–4293. Association for Computational Linguistics, 2021. Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., Liu, Z., Liu, Y., Wang, Y., Zhang, Z., Kailkhura, B., Xiong, C., Xiao, C., Li, C., Xing, E. P., Huang, F., Liu, H., Ji, H., Wang, H., Zhang, H., Yao, H., Kellis, M., Zitnik, M., Jiang, M., Bansal, M., Zou, J., Pei, J., Liu, J., Gao, J., Han, J., Zhao, J., Tang, J., Wang, J., Mitchell, J. C., Shu, K., Xu, K., Chang, K., He, L., Huang, L., Backes, M., Gong, N. Z., Yu, P. S., Chen, P., Gu, Q., Xu, R., Ying, R., Ji, S., Jana, S., Chen, T., Liu, T., Zhou, T., Wang, W., Li, X., Zhang, X., Wang, X., Xie, X., Chen, X., Wang, X., Liu, Y., Ye, Y., Cao, Y., and Zhao, Y. Trustllm: Trustworthiness in large language models. CoRR, abs/2401.05561, 2024. Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Team, Q. et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2(3), 2024. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., and Li, B. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023. Wang, J., Liang, Y., Sun, Z., Cao, Y., Xu, J., and Meng, F. Cross-lingual knowledge editing in large language models. In ACL (1), p. 11676–11686. Association for Computational Linguistics, 2024a. Wang, J., Liang, Y., Sun, Z., Cao, Y., Xu, J., and Meng, F. Cross-lingual knowledge editing in large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the 10 On the Robustness of Knowledge Editing for Detoxification Association for Computational Linguistics (Volume 1: Long Papers), p. 11676–11686, Bangkok, Thailand, August 2024b. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.627. URLhttps: //aclanthology.org/2024.acl-long.627/. Wang, M., Zhang, N., Xu, Z., Xi, Z., Deng, S., Yao, Y., Zhang, Q., Yang, L., Wang, J., and Chen, H. Detoxi- fying large language models via knowledge editing. In ACL (1), p. 3093–3118. Association for Computational Linguistics, 2024c. Wang, T., Chen, J., Jia, Q., Wang, S., Fang, R., Wang, H., Gao, Z., Xie, C., Xu, C., Dai, J., Liu, Y., Wu, J., Ding, S., Li, L., Huang, Z., Deng, X., Yu, T., Ma, G., Xiao, H., Chen, Z., Xiang, D., Wang, Y., Zhu, Y., Xiao, Y., Wang, J., Wang, Y., Ding, S., Huang, J., Xu, J., Tayier, Y., Hu, Z., Gao, Y., Zheng, C., Ye, Y., Li, Y., Wan, L., Jiang, X., Wang, Y., Cheng, S., Song, Z., Tang, X., Xu, X., Zhang, N., Chen, H., Jiang, Y. E., and Zhou, W. Weaver: Foundation models for creative writing. CoRR, abs/2401.17268, 2024d. Wang, W., Haddow, B., and Birch, A. Retrieval-augmented multilingual knowledge editing. In ACL (1), p. 335–354. Association for Computational Linguistics, 2024e. Wu, Z., Yu, X. V., Yogatama, D., Lu, J., and Kim, Y. The semantic hub hypothesis: Language models share seman- tic representations across languages and modalities, 2024. URL https://arxiv.org/abs/2411.04986. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, E., and Zhang, Y. A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly. CoRR, abs/2312.02003, 2023. Yong, Z. X., Menghini, C., and Bach, S. Low-resource lan- guages jailbreak gpt-4. In Socially Responsible Language Modelling Research, 2023. Zhang, N., Yao, Y., Tian, B., Wang, P., Deng, S., Wang, M., Xi, Z., Mao, S., Zhang, J., Ni, Y., et al. A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286, 2024a. Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKe- own, K. R., and Hashimoto, T. B.Benchmarking large language models for news summarization. CoRR, abs/2301.13848, 2023. Zhang, X., Liang, Y., Meng, F., Zhang, S., Chen, Y., Xu, J., and Zhou, J.Multilingual knowledge edit- ing with language-agnostic factual neurons.CoRR, abs/2406.16416, 2024b. Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., and Wen, J. A survey of large language models. CoRR, abs/2303.18223, 2023. Zheng, C., Li, L., Dong, Q., Fan, Y., Wu, Z., Xu, J., and Chang, B. Can we edit factual knowledge by in-context learning? In EMNLP, p. 4862–4876. Association for Computational Linguistics, 2023. 11 On the Robustness of Knowledge Editing for Detoxification A. Quality Control of The mSAFEEDIT Dataset We include example attacks in the selected eight languages in Figure 6 to illustrate the quality of mSAFEEDIT. To further evaluate its quality, we conducted two assessments for each translated attack in mSAFEEDIT: (1) we prompted GPT-4o to identify any grammatical errors and (2) we asked GPT-4o to translate the attack back into English, then manually checked whether it remained consistent with the original English version. Our findings indicate that GPT-4o detected no grammatical errors in the translated attacks, and the back-translations were entirely consistent with their original English versions. Figure 6. An example attack in English and its translations in the selected languages. 12 On the Robustness of Knowledge Editing for Detoxification B. Human Evaluation of the Safety Classifier We conduct a human evaluation on the outputs of Qwen2-7B for both monolingual detoxification in English and Chinese, as well as cross-lingual detoxification into Chinese, and compare the results with the judgments produced by our safety classifier. We select Qwen2-7B because it exhibits the lowest degree of degeneration among all evaluated LLMs. Specifically, we hire two annotators who are native Chinese speakers and fluent in English. The annotators independently label all model outputs, and any disagreements are resolved through discussion. We then compute the disagreement rate between the final human annotations and the predictions of the safety classifier. Table 2 presents the comparison results. The agreement rates between human judgments and the safety classifier exceed 95% across all language settings, indicating that the evaluation protocol adopted in this work is reliable. Table 2. Evaluation results by our safety classifier and human (in terms of the number of unsafe responses) of monolingual detoxification on English and Chinese and cross-lingual detoxification on Chinese. Safety ClassifierHumanAgreement KE MethodTarget Lang.Edit Lang.Before KEAfter KEBefore KEAfter KEBefore KEAfter KE DINMEnglishEnglish84788396%96% DINMChineseChinese86590395%98% DINMChineseEnglish-3-2-97% FT-MEnglishEnglish-31-29-98% FT-MChineseChinese-82-82-96% FT-MChineseEnglish-41-41-95% C. Complementary Results on Optimisation Robustness Figure 7 and Figure 8 report the number of unsafe responses by different LLMs edited DINM and FT-M with respect to different editing steps and learning rates. Figure 9 and Figure 10 report the number of repetitions by different LLMs edited DINM and FT-M with respect to different editing steps and learning rates. Figure 11 and Figure 12 shows how the number of unsafe responses and repetitions trade-off by different LLMs edited by DINM and FT-M with respect to different learning rates. Figure 7. The number of unsafe responses by different LLMs edited DINM with respect to different editing steps and learning rates. 13 On the Robustness of Knowledge Editing for Detoxification Figure 8. The number of unsafe responses by different LLMs edited FT-M with respect to different editing steps and learning rates. Figure 9. The number of repetitions by different LLMs edited DINM with respect to different editing steps and learning rates. 14 On the Robustness of Knowledge Editing for Detoxification Figure 10. The number of repetitions by different LLMs edited FT-M with respect to different editing steps and learning rates. Figure 11. The number of unsafe/repetitive responses by different LLMs edited by DINM with respect to different learning rates at the best editing steps (right). 15 On the Robustness of Knowledge Editing for Detoxification Figure 12. The number of unsafe/repetitive responses by different LLMs edited by DINM with respect to different learning rates at the best editing steps (right). D. Parameter-turning for Systems Under Study Table 3 reports the hyper-paramter settings of all models and KE methods we test in this study. These hyper-parameters are selected based on the number of failures as shown in Figure 13 and Figure 14 for DINM and FT-M respectively. Table 3. Hyper-paramter settings of all models and KE methods we test in this study DINMFT-M ModelLearning RateEditing StepLearning RateEditing Step Llama2-7B1e-4145e-55 Llama2-8B5e-565e-56 Ministral-8B5e-5121e-512 Mistral-7B5e-521e-510 Qwen2-7B5e-4135e-56 Qwen3-8B5e-4135e-514 16 On the Robustness of Knowledge Editing for Detoxification Figure 13. The number of failures by different LLMs edited by DINM. Figure 14. The number of failures by different LLMs edited by FT-M. E. Complementary Results on Monolingual Detoxification Figures 15–19 present the results of monolingual detoxification on Llama2-7B, Llama3-8B, Ministral-8B, Mistral-7B, and Qwen3-8B. Notably, the detoxification effects observed for English differ from those reported during hyperparameter tuning. This discrepancy arises because we evaluate the models on a different test set, namely data drawn from SAFEEDIT and LINGUASAFE. This further demonstrates that KE-based detoxification lacks optimization robustness, as its effectiveness can degrade substantially when applied to a different set of data. 17 On the Robustness of Knowledge Editing for Detoxification Figure 15. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Llama2-7B. Figure 16. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Llama3-8B. 18 On the Robustness of Knowledge Editing for Detoxification Figure 17. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Ministral-8B. Figure 18. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Mistral-7B. 19 On the Robustness of Knowledge Editing for Detoxification Figure 19. Number of failures and performance on OOD inputs before and after monolingual detoxification across languages for Qwen3-8B. F. Complementary Results on Cross-lingual Detoxification Figures 20–24 present the results of monolingual detoxification on Llama2-7B, Llama3-8B, Ministral-8B, Mistral-7B, and Qwen3-8B. Figure 20. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Llama2-7B. 20 On the Robustness of Knowledge Editing for Detoxification Figure 21. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Llama3-8B. Figure 22. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Ministral-8B. 21 On the Robustness of Knowledge Editing for Detoxification Figure 23. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Mistral-7B. Figure 24. Number of failures and performance on OOD inputs before and after cross-lingual detoxification across languages for Qwen3-8B. 22