Paper deep dive
Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification
Xiangtao Meng, Yingkai Dong, Ning Yu, Li Wang, Zheng Li, Shanqing Guo
Models: Stable Diffusion, Stable Diffusion derivatives (6 models)
Abstract
Abstract:Text-to-image (T2I) generative models have achieved remarkable visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality, a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectification without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model's encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable interpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training. Across nudity and multi-category benchmarks and recent adversarial prompt attacks, SafePatch achieves robust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment.
Tags
Links
- Source: https://arxiv.org/abs/2508.21099
- Canonical: https://arxiv.org/abs/2508.21099
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/11/2026, 1:04:32 AM
Summary
SafePatch is a novel, structurally isolated safety module for text-to-image (T2I) models that mitigates unsafe content generation without the 'safety tax'—the degradation of benign image quality typically caused by internal model editing. By using a trainable encoder clone and an Aligned Counterfactual Safety (ACS) dataset for differential supervision, SafePatch provides external, interpretable rectification while preserving the base model's generative integrity.
Entities (5)
Relation Signals (3)
SafePatch → mitigates → Unsafe Content
confidence 95% · SafePatch achieves robust unsafe suppression (7% unsafe on I2P)
SafePatch → uses → ACS
confidence 95% · we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training.
Internal Defenses → causes → Safety Tax
confidence 90% · Existing safety defenses typically intervene internally... leading to degradation of benign generation quality, a trade-off we term the Safety Tax.
Cypher Suggestions (2)
Find all safety modules and the datasets they utilize. · confidence 90% · unvalidated
MATCH (m:SafetyModule)-[:USES]->(d:Dataset) RETURN m.name, d.name
Identify the relationship between defense strategies and the Safety Tax. · confidence 85% · unvalidated
MATCH (d:DefenseStrategy)-[:CAUSES]->(s:Concept {name: 'Safety Tax'}) RETURN d.nameFull Text
45,748 characters extracted from source content.
Expand or collapse full text
Beyond the Safety Tax: Mitigating Unsafe Text-to-Image Generation via External Safety Rectification Xiangtao Meng 1 Yingkai Dong 1 Ning Yu 2 Zheng Li 1 Shanqing Guo 1 1 Shandong University 2 Netflix Eyeline Studios Abstract Text-to-image (T2I) generative models have achieved remark- able visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality—a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectifica- tion without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model’s encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable in- terpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervi- sion training. Across nudity and multi-category benchmarks and recent adversarial prompt attacks, SafePatch achieves ro- bust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment. 1 Introduction Text-to-image (T2I) generative models, exemplified by Stable Diffusion [22] and DALL·E 3 [9], have revolutionized visual content creation by synthesizing high-fidelity images from natural language descriptions. Despite the advancements in T2I models, their potential for misuse or even abuse raises serious safety concerns. Recent studies [18, 24] have demon- strated that the T2I models are prone to generating NSFW (Not Safe For Work) imagery, such as those related to violence and child-unsafe, when prompted with unsafe text prompts. Consequently, quantifying and mitigating T2I models’ unsafe content generation become increasingly important research topics. Existing defense strategies generally fall into two cate- gories. The first involves filtering-based methods, which em- ploy classifiers to intercept unsafe images post-generation [20] or sanitize malicious prompts at the input stage. While easy to deploy, these post-hoc defenses are often susceptible to circumvention by adversarial prompts and fail to address the model’s internal propensity for generating harmful content. To address these limitations, recent research focus has shifted OurSafeGEN ESDSLDOriginal SD hygge bedroom,..., short black curly hair, round face, laced lingerie, amazing lush,... Text Prompt Figure 1: Illustration of the safety tax caused by concept en- tanglement. Existing defenses (e.g., SLD, ESD, SAFEGEN) can suppress unsafe concepts but often collateralize benign seman- tics due to entangled representations, leading to degraded scene fidelity (e.g., damaging the “bedroom” semantic). toward internal defenses. These approaches aim to inter- vene directly on the model’s internal mechanisms, primar- ily through parameter editing (e.g., ESD [7]), which erases harmful concepts via fine-tuning, or inference guidance (e.g., SLD [24]), which steers image generation away from unsafe concepts during inference. Motivation. Although internal defenses provide more robust protection, recent studies [3, 23] suggest that diffusion mod- els suffer from intricate concept entanglement arising from feature superposition [6]. This implies that unsafe concepts are not isolated in the latent space but are tightly coupled with benign concepts. Thus, stripping away harmful concepts — whether through weight modification or inference-time inter- vention — inevitably incurs collateral damage on associated benign features (see Fig. 1). We define the degradation in generative quality incurred by defenses as the Safety Tax. Our Solution. To eliminate the safety tax, we advocate a paradigm shift from destructive internal editing to external safety rectification. Specifically, instead of disentangling unsafe concepts within a highly entangled base model, we ex- ternalize safety control into a structurally isolated and safety- specialized module, explicitly trained to provide interpretable and controllable safety rectification. Compared with directly 1 arXiv:2508.21099v3 [cs.CV] 19 Jan 2026 improving the interpretability of internal concepts in the origi- nal T2I model, this design enables the safety module to focus entirely on safety-dimensional control without needing to han- dle the generative logic of other semantics; it only needs to remain “silent” when benign semantics are encountered. This characteristic significantly reduces the difficulty of learning interpretable safety rectification signals. Following this principle, we propose SafePatch, a struc- turally isolated safety module designed to dynamically rec- tify the generative process without compromising the base model’s integrity. Specifically, the core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model’s encoder part. This design provides two critical ad- vantages: First, it ensures representation consistency with the base model, avoiding performance degradation caused by feature heterogeneity; Second, it inherits semantic priors from the base model, which lowers the difficulty of learning interpretable safety signals. However, structural isolation alone is insufficient for inter- pretable safety rectification. To achieve this, we construct a strictly aligned counterfactual safety dataset (ACS). By pre- serving identical benign semantics while differing exclusively in safety semantics, this data provides high-fidelity differ- ential supervision, forcing SafePatch to learn interpretable, safety-specific rectification signals. Furthermore, to distin- guish safety rectification from natural generative variations, we design an instruction-aware spatial projection to convert abstract safety concepts into executable safety modification instructions, mapping them to spatially grounded features to precisely localize unsafe concepts while ignoring benign fluc- tuations. Finally, we integrate SafePatch with the base model via zero convolution layers, ensuring that the safety rectifi- cation is introduced as an initially non-intrusive way during training stage, thereby maximally preserving the backbone’s generative quality. We implement SafePatch and evaluate it against six repre- sentative safety defenses across multiple unsafe and benign benchmarks. SafePatch consistently suppresses unsafe con- tent while preserving image quality and text–image align- ment, indicating that safety improvements do not incur a safety tax. On the I2P benchmark, SafePatch reduces the over- all unsafe probability to 7%, substantially outperforming all baselines, which remain around 20%. Moreover, SafePatch maintains low unsafe rates under recent adversarial prompt attacks, demonstrating robust and reliable safety rectification. In summary, our contributions are as follows: •We construct a strictly aligned counterfactual safety dataset (ACS) that provides paired samples with identical benign semantics but differing safety semantics, enabling high- fidelity differential supervision. •We introduce a new defense paradigm that shifts from de- structive internal editing to external safety rectification. Specifically, we design SafePatch, a structurally isolated safety module that achieves interpretable safety rectification for T2I models without performance degradation. •Extensive evaluations across multiple benchmarks and ad- versarial attacks demonstrate that SafePatch significantly outperforms six state-of-the-art defenses while effectively eliminating the safety tax. 2 Background 2.1Unsafe Content Generation in T2I Models Text-to-image generation models have gained popularity due to their ease of use and high-quality, flexible images. However, Birhane et al. [5] raise concerns about datasets scraped from the internet, such as LAION-400M [25], which lack content moderation, potentially leading to unsafe content generation. The definition of unsafe content varies by context and cul- ture, making it subjective. In this paper, we focus on images containing hate, harassment, violence, self-harm, sexual con- tent, shocking material, illegal activities, or nudity, as outlined in the OpenAI content policy [2] and Gebru et al. [8]. 2.2 Safety Mechanisms for T2I Models Present strategies have two categories: filtering-based and internal defenses. Filtering-based Defenses. Filtering-based defenses, like safety checker [20] officially released by SD, are efficient for deployment but suffer from under-generalization and vul- nerability to adversarial prompts due to distribution shifts. Similarly, SD v2.1 [21] retrains on censored data with these filters, but this approach can be computationally expensive, may not fully remove harmful content, and could reduce model performance. Internal Defenses. Internal defenses use two strategies: guid- ing generation to avoid unsafe content (e.g., SLD [24] and In- terpretDiffusion [11]) or fine-tuning models to remove unsafe concepts (e.g., ESD [7] and SafeGEN [12]). The first relies on the model’s existing safety knowledge, limiting adaptability to new threats. The second risks “catastrophic forgetting” and lacks cross-model applicability. In contrast, our approach is model-agnostic and preserves production-ready models, ensuring robustness against adversarial prompts while main- taining performance for benign samples. 2.3 Threat Model We outline the goals and capabilities of the adversary and defender. Adversary. The adversary aims to violate safety standards by generating unsafe content, either deliberately or by evading safeguards. We assume the adversary has closed-box access, capable only of querying the online T2I model with prompts. Defender. The defender (model developer) has two objectives: (1) preventing unsafe content generation, and (2) preserving benign utility. We assume the defender has full access to model parameters to deploy safety mechanisms but lacks prior knowledge of specific adversarial prompts. 3 Training Dataset Construction To achieve the interpretable safety rectification capability of SafePatch, we construct an aligned counterfactual safety dataset (ACS). Each sample pair in the dataset consists of an unsafe prompt and its corresponding safe image. Compared 2 Counterfactual Prompt Pair Generation UnsafePrompt (“a naked woman”) LLM SafePrompt (“a clothed woman”) Safe Candidate Image Generation Target T2I Model K candidate images Multi-stage Strict Alignment Filtering safetycheck semanticcheck humanreview Figure 2: Construction of the Aligned Counterfactual Safety (ACS) dataset. For each unsafe prompt, we create a minimally edited safe counterfactual prompt and generate safe candidate images from the same target T2I model, followed by multi-stage strict alignment filtering to preserve identical benign semantics while differing only in safety semantics. to the image originally generated by the unsafe prompt, these safe version images maintain consistency in benign semantics while differing only in unsafe semantics. This design provides SafePatch with high-fidelity differential supervision, enabling it to precisely learn safety rectification signals without disrupt- ing the original benign features. The construction process of the dataset, as illustrated in Figure 2, includes the following key steps: Counterfactual Prompt Pair Generation. We first collect unsafe prompts covering major safety risk categories (e.g., violence, hate, sexual) from Lexica 1 . To construct strictly counterfactual pairs, we utilize an LLM to generate corre- sponding safe prompts under the principle of minimal seman- tic differences [14]. This process modifies only the phrases that violate safety policies, leaving the rest unchanged. Safe Candidate Image Generation. For each sample pair, we use the target T2I model intended for defense to generate Kcandidate images for the safe prompt. This design aims to minimize distribution shifts and style differences introduced by model heterogeneity. Multi-stage Strict Alignment Filtering. To ensure strict alignment in both safety and semantic fidelity for the final training samples, we design a three-stage filtering pipeline. First, we screen candidate images using multiple automated safety auditing tools (e.g., NudeNet [4]) to ensure they con- tain no violating content; samples failing this check are di- rectly discarded. Second, we utilize a Vision-Language Model (VLM) to automatically assess whether the generated safe im- age completely and faithfully reflects all benign semantics of its corresponding safe prompt, filtering out semantically inconsistent samples. Finally, we introduce a manual review process where annotators are asked to judge, based on the original unsafe prompt, whether the corresponding safe image can be considered an ideal safe version—that is, whether it completely removes harmful elements while perfectly pre- serving all other reasonable visual content and overall style. Only sample pairs that pass all filtering stages are included in the final training dataset. 1 https://lexica.art/ 4 Design of SafePatch 4.1 Overview As illustrated in Fig. 3, SafePatch is an external safety rec- tification module designed to eliminate the safety tax with- out modifying the base T2I model. It is implemented as a plug-and-play componentS φ that operates alongside a frozen diffusion modelε θ . SafePatch follows a two-stage paradigm. In the training stage,S φ is explicitly trained to learn inter- pretable safety rectification signals using a strictly aligned counterfactual safety dataset. In the deployment stage, the trainedS φ is attached to the base modelε θ as an external plu- gin, providing additive safety rectification while remaining “silent” for benign semantics. At deployment time, the denoising prediction at timestept is given by: ̃ ε = ε θ (z t , t, c p )+S φ (z t , t, c p , s),(1) wherez t denotes the noisy latent,c p is the prompt embedding, and s is the safety instruction. 4.2 SafePatch Architecture Following the principle of external safety rectification, SafePatch is instantiated as a structurally isolated and safety- specialized module. Specifically, SafePatch comprises the following three synergistic components: Trainable Encoder Clone. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the encoder part (the down-sampling blocks of the U-Net) of the base model. This design ensures representation consis- tency with the base model, avoiding performance degradation caused by feature heterogeneity. Furthermore, by inheriting semantic priors from the base model, SafePatch further lowers lowers the difficulty of learning interpretable safety signals. Zero-Convolution Integration. The rectification outputs of SafePatch are injected into the base model through zero- initialized convolution layers, inspired by the design in [29]. At initialization, these layers guarantee that SafePatch in- troduces zero influence on the denoising process. During training, rectification signals are introduced in a gradual and additive manner. This design prevents abrupt perturbations to the original generative process at early training stages, which could otherwise induce unintended changes in benign semantics. Such changes would break the strict counterfactual alignment assumed by the ACS dataset, where the generated samples are expected to differ from the safe targets only in safety-relevant attributes. Instruction-Aware Spatial Projection. While interpretable safety rectification in SafePatch is primarily induced by dif- ferential supervision from the ACS dataset, we incorporate an instruction-aware spatial projection module as an auxiliary architectural component to guide how rectification signals are learned and where they are applied. Specifically, abstract safety concepts are first translated into executable safety modification instructionss(e.g., “Add clothes to the person”), which describe the action required to restore safety rather than the content to be generated. These 3 a naked woman 푝 푢 T2I T2I SafePatch original generation generation w. SafePatch a naked woman 푝 푢 SafePatch Encode Clone Instruction-Aware Spatial Projection Zero-Convolution layer T2I a clothed woman 푝 푠 safe image 푥 푠 ACSdataset safetyloss ℒ generation 푥 0 Learn interpretable safety rectification Training Stage Deployment Stage trained frozen Figure 3: Overview of the proposed SafePatch framework, including training and deployment stages. instructions are automatically derived using an LLM-based template described in Appendix C. Both the prompt embed- dingc p and the instruction embeddingc s are obtained from the same frozen text encoder to ensure representational con- sistency. To spatially ground the instruction, its embeddingc s is projected onto the noisy latentz t through a cross-attention- based mapping network. The resulting spatial guidance map emphasizes regions relevant to the safety instruction while attenuating irrelevant areas. Conditioned on this guidance, SafePatch is encouraged to localize rectification to unsafe attributes and ignore benign semantic variations. 4.3 Differential Supervision Training This training paradigm is built upon the ACS dataset, which provides strictly aligned pairs that differ only in safety- relevant semantics. Specifically, each training sample is a triplet(p u , s, x s ), consisting of an unsafe promptp u , the cor- responding executable safety instructions, and a safe target imagex s that preserves all benign semantics. During training, noiseεis sampled from the forward diffusion process applied to the safe target imagex s . At each timestept, the frozen base modelε θ predicts the base denoising output conditioned on p u , while the rectification moduleS φ predicts a safety-specific residual conditioned ons. The two outputs are combined additively and supervised against the ground-truth noise: L =E z t ,t,ε ε− ε θ (z t , t, c p u )−S φ (z t , t, c p u , s) 2 2 (2) Crucially, the strict counterfactual alignment in ACS en- sures that any discrepancy between the model prediction and the safe target can be attributed exclusively to safety-related factors. As a result,S φ is forced to capture safety-specific rec- tification signals. In addition, we use benign image–prompt pairs sampled from Flickr30k [28] as a negative training set, which encouragesS φ to remain “silent” when only benign semantics are encountered. 5 Experiment Our extensive experiments answer the following research questions (RQs): •RQ1. Can SafePatch mitigate unsafe content generation without incurring the safety tax? • RQ2. What is the transferability of SafePatch? • RQ3. How robust is SafePatch against attacks? •RQ4. How do different hyper-parameters affect the perfor- mance of SafePatch? 5.1 Experimental Settings 5.1.1 Datasets We evaluate SafePatch on both unsafe and benign prompt benchmarks to jointly assess safety effectiveness and benign fidelity preservation. For unsafe prompts, We use “<coun- try> body” and NSFW-200 [27] to evaluate nudity removal performance, as these benchmarks specifically target explicit sexual content. To assess whether the safety mechanism gen- eralizes beyond nudity, we further use I2P [24], which covers a broader range of unsafe categories. To measure whether safety mechanisms incur a safety tax, we further use MS COCO-2017 [13] as a benign prompt benchmark. Dataset details are provided in Appendix A. 5.1.2 Baselines We compare SafePatch with six representative safety defenses on Stable Diffusion v1.4 (SD v1.4) [19]. We adopt SD v1.4 as the base model since it is the standard backbone used by most prior works, enabling fair and direct comparison without confounding architectural differences. (1) No De- fense: The original SD without any safety defense. (2) Exter- nal Defenses: The official image-based Safety Checker [20] and the pre-censored SD v2.1 [21]. (3) Internal Defenses: Inference-time guidance methods SLD [24] and InterpretD- iffusion [11], as well as fine-tuning-based methods ESD [7] 4 Table 1: The performance of SafePatch and baselines on multiple unsafe categories reduction on I2P prompt dataset. Method Unsafe Probability (↓) SexualSelf-harmHateViolenceShockingHarassmentIllegal activityOveall Original SD23%27%23%32%37%20%23%27% Safety Filter8%24%18%28%31%15%22%21% SD 2.115%27%25%27%35%22%20%24% SLD 10%10%10%14%20%11%8%12% InterpreteDiffusion10%18%23%21%29%16%15%18% ESD 10%20%18%26%27%18%19%20% SAFEGEN7%10%13%13%18%9%7%11% SafePatch5%8%12%8%8%7%7%7% and SafeGEN [12]. Implementation details of SafePatch are provided in Appendix D. 5.1.3 Metrics We evaluate SafePatch using two types of metrics: safety met- rics to assess defense effectiveness against unsafe prompts and adversarial attacks, and quality metrics to measure whether safety defenses degrade benign generation quality, thereby quantifying the incurred safety tax. For safety evaluation, we first adopt NudeNet [16] to assess the nudity removal perfor- mance of defenses. In addition, following prior work [24], we combine the Q16 classifier [15] with NudeNet to obtain an aggregated unsafe content probability, covering multiple risk categories such as sexual content, hate speech, and other policy-violating outputs. To evaluate the preservation of benign generation quality, we report the Fréchet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), and CLIP score. Further details are provided in Appendix B. 5.2 RQ1: Effectiveness and Safety Tax This section evaluates whether SafePatch can effectively miti- gate unsafe content generation while preserving benign gener- ation quality, i.e., without incurring a safety tax. Effectiveness on Nudity Removal. As shown in Table 2, SafePatch achieves the highest overall nudity removal rate of 94%, outperforming all baseline methods. On the “<country> body” prompt set, SafePatch reaches 93.1%, exceeding the strongest baseline (ESD, 91.7%). On the NSFW_200 bench- mark, its removal rate (94.4%) remains competitive, while maintaining the best average performance across datasets. These results indicate that SafePatch provides consistently strong nudity suppression under different prompt distribu- tions, rather than excelling on a single benchmark. Effectiveness on Multiple Unsafe Categories. Table 1 re- ports unsafe content probabilities across seven categories on the I2P dataset under the multi-category configuration of SafePatch. SafePatch reduces the overall unsafe probabil- ity to 7%, the lowest among all methods. In particular, it achieves the best or second-best performance in Sexual (5%), Self-harm (8%), Violence (8%), Shocking (8%), Harassment Table 2: Nudity removal performance of SafePatch and baseline defenses on unsafe prompt benchmarks. Method Nudity Removal Rate (↑) “<country> body”NSFW_200Overall Original SDN/AN/AN/A Safety Filter72.9%53.5%67% SD 2.1 67.9%71.8%68% SLD 62.8%29.6%49% InterpreteDiffusion71.6%46.5%56% ESD91.7%97.2%93% SAFEGEN 91.3%98.6%93% SafePatch93.1%94.4%94% (7%), and Illegal activity (7%). Compared to the strongest baseline (SAFEGEN, 11%), SafePatch yields a consistent absolute reduction across categories, demonstrating more uni- form safety control rather than category-specific gains. Safety Tax Analysis. Table 3 reports the benign generation performance on the MS COCO 2017 validation set. SafePatch achieves the highest CLIP score (31.46), closely matching the original SD model and outperforming all baseline defenses. It also attains a favorable LPIPS score of 0.7562, comparable to the original SD and safety filter, and superior to strong SLD variants. Meanwhile, SafePatch maintains competitive image fidelity with an FID of 24.90, remaining close to the original model and surpassing most safety-oriented baselines. These quantitative findings are further supported by qualitative com- parisons in Fig. 1. See more visual examples in Appendix E. Overall, these results show that SafePatch preserves benign generation quality across alignment, perceptual consistency, and distributional similarity, demonstrating that the improved safety performance does not introduce a noticeable safety tax. Interpretability Analysis. To further explain why SafePatch does not incur a safety tax, we provide an interpretability analysis based on attention visualization, as shown in Fig. 4. For the original SD model, attention maps associated with unsafe keywords (e.g., “lingerie”) exhibit strong activation on exposed body regions, dominating the generation process and overshadowing other semantic cues in the prompt. This over- concentration leads to unsafe visual outputs and entangles 5 Table 3: The performance of SafePatch and baselines in main- taining benign generation.↓indicates lower is better,↑means higher is preferable. Method COCO 2017 Val CLIP Score↑LPIPS Score↓FID↓ Original SD31.280.756225.22 Safety Filter30.210.756925.99 SD 2.131.470.746524.01 SLD-Max29.100.769936.46 SLD-Strong29.910.763633.01 SLD-Medium29.920.763432.75 SLD-Weak31.230.756426.68 InterpreteDiffusion31.000.761226.73 ESD30.410.757424.58 SAFEGEN30.610.764129.96 SafePatch31.460.756224.90 safety enforcement with core semantic understanding. Figure 4: Interpretability of external safety rectification. At- tention visualizations show that SafePatch suppresses unsafe concept focus (e.g., “lingerie”) while preserving attention to be- nign prompt semantics (e.g., “face”), explaining why not incur a safety tax. In contrast, with SafePatch applied, the attention corre- sponding to unsafe concepts is significantly attenuated and redirected away from sensitive body parts. Meanwhile, atten- tion maps for benign keywords (e.g., “face” and “bedroom”) remain spatially coherent and semantically meaningful, focus- ing on appropriate regions such as facial structures and scene layout. This selective suppression indicates that SafePatch op- erates by correcting unsafe attention focus rather than globally weakening or blurring the model’s representations. As a result, SafePatch avoids indiscriminate suppression of visual features and preserves the alignment between be- nign prompt semantics and generated content. This explains why SafePatch maintains competitive CLIP, LPIPS, and FID scores while effectively mitigating unsafe generation, thereby avoiding a noticeable safety tax. Table 4: Transferability of the proposed SafePatch on SD v2.1 and SDXL. MetricMethod Backbone SD v2.1SDXL Unsafe Prob. (I2P)↓ Vanilla14.82%12.35% w/ SafePatch1.54%0.82% CLIP Score↑ Vanilla31.2534.58 w/ SafePatch31.3234.35 5.3RQ2: Transferability to Different T2I Mod- els We evaluate the transferability of SafePatch on different T2I backbones, including SD v2.1 and SDXL, to verify whether our safety rectification framework generalizes be- yond SD v1.4. For each target model, we train a model- specific SafePatch while keeping the corresponding backbone frozen. As shown in Table 4, across both SD v2.1 and SDXL, SafePatch consistently reduces unsafe content generation un- der nudity-related and multi-category unsafe prompts, while maintaining benign generation quality comparable to the un- defended models. These results demonstrate that SafePatch is transferable to different diffusion T2I models. 5.4 Robustness Against Adversarial Attacks We evaluate the robustness of SafePatch against adversarial at- tacks designed to bypass safety mechanisms, focusing on two recent and representative methods: SneakyPrompt [27] and Ring-A-Bell [26]. These attacks exploit linguistic obfuscation and semantic redirection to circumvent prompt-based filtering and inference-time guidance, posing a substantial challenge to existing safety defenses. Table 5: The robustness of SafePatch and baselines against latest adversarial attacks. MethodSneakyPrompt Ring-A-Bell violencenudity Original SD55%96%81% SLD3%70%82% InterpreteDiffusion30%90%80% ESD 36%86%21% SAFEGEN57%91%24% SafePatch multi ple 9%24%6% Table 5 summarizes the quantitative results of SafePatch in comparison with the original SD model and prior internal defense baselines. Under the SneakyPrompt attack, SafePatch substantially suppresses unsafe content generation, reduc- ing the measured unsafe probability to 9%, markedly outper- forming both the undefended model and competing defenses. As noted in prior work [24], the Q16 classifier employed for evaluation is conservative and may overestimate unsafe probabilities by misclassifying borderline safe images. Con- sistent with this observation, a manual inspection confirms 6 that SafePatch effectively eliminates unsafe generations under SneakyPrompt, yielding an actual unsafe rate of 0%. For the Ring-A-Bell attack, which explicitly targets violent and explicit content categories, SafePatch continues to exhibit strong and category-specific safety control. It achieves un- safe probabilities of 24% for violence-related prompts and 6% for nudity-related prompts, representing a significant im- provement over the original SD model and outperforming all baselines. Overall, these results demonstrate that SafePatch maintains robust safety guarantees under adversarial prompt manipu- lation. Unlike prompt filtering or inference-time steering methods that can be circumvented through linguistic obfus- cation, SafePatch performs external and interpretable safety rectification on the generative process, enabling robust sup- pression of unsafe content without relying on prompt-level heuristics or surface-form analysis. 5.5 Exploration on Hyperparameters Training Data Scale. Fig. 5 analyzes how the ACS train- ing set size (1K/5K/10K) affects SafePatch. Larger datasets substantially improve training efficiency: the iterations re- quired to reach near-saturated performance drop from 50K (1K) to 25K (10K). Moreover, increasing data scale consis- tently improves the final defensive performance, indicating that SafePatch benefits from broader coverage of unsafe con- cept variations while preserving the counterfactual alignment. Figure 5: Effect of training data scale (1K/5K/10K) on conver- gence speed and defensive performance of SafePatch. Training Iterations. Across all data scales, SafePatch ex- hibits rapid gains in the early stage: performance increases markedly within the first 10K iterations. Beyond 10K iter- ations, improvements become marginal and mainly fluctu- ate within a narrow range, suggesting convergence. Consis- tent with the above observation, for a fixed batch size, larger datasets require fewer iterations to achieve the same perfor- mance level. 6 Conclusion We present SafePatch, a plug-and-play external safety recti- fication module for text-to-image diffusion models that miti- gates unsafe generations without modifying the frozen back- bone. By instantiating SafePatch as an encoder-clone patch integrated via zero-initialized convolutions, the base model’s benign generative behavior is preserved at initialization and during training. To make safety rectification precise and inter- pretable, we constructed the Aligned Counterfactual Safety (ACS) dataset to provide strictly aligned counterfactual super- vision, and incorporated instruction-aware spatial projection to guide localized rectification. Extensive experiments on nudity and multi-category unsafe prompt benchmarks, as well as recent adversarial attacks, show that SafePatch achieves strong and robust safety improvements while maintaining benign generation quality, effectively avoiding the safety tax. Limitations Despite the encouraging results, our approach has several limitations that warrant discussion, primarily related to com- putational overhead, evaluation reliability, and data filtering scalability. Computation and deployment overhead. As an external rec- tification module, SafePatch incurs additional parameters and forward-pass computation due to the encoder-clone architec- ture and instruction-aware projection, leading to higher train- ing and inference costs than the undefended backbone. This overhead may be prohibitive in latency-sensitive or resource- constrained settings. Nevertheless, in practical text-to-image systems, safety mechanisms that degrade visual fidelity or semantic alignment directly impair usability and adoption. The added computational cost of SafePatch therefore reflects a deliberate trade-off to preserve generation quality while enforcing robust safety. Dependence on automated safety auditors. Unsafe proba- bility metrics are computed using NudeNet and Q16, which are imperfect proxies for policy violations and may exhibit category- or style-dependent measurement variance. Al- though we complement automated evaluation with manual in- spection for adversarial prompts, absolute unsafe rates should be interpreted with caution. Incorporating a broader set of auditing models and more systematic human evaluation re- mains an important direction for improving reliability and calibration. Limitations of multi-stage strict alignment filtering. The effectiveness of ACS relies on a three-stage filtering pipeline—automated safety auditing, VLM-based semantic checking, and manual review—each with inherent limitations. Automated auditors may produce false positives or negatives under stylized or ambiguous content, VLM-based checks can misjudge fine-grained semantic consistency, and manual re- view is subjective, difficult to scale, and prone to selection bias. Despite these challenges, our experiments indicate that ACS is sufficiently effective for training SafePatch; further improving filtering robustness and coverage remains an im- portant avenue for future work. References [1] Nsfw gpt.https://w.reddit.com/r/ChatGPT/ comments/11vlp7j/nsfwgpt_that_nsfw_prompt/ , 2023. 9 [2]Usage policies of openai.https://openai.com/ policies/usage-policies/, 2023. 2 [3]Ibtihel Amara, Ahmed Imtiaz Humayun, Ivana Kajic, Zarana Parekh, Natalie Harris, Sarah Young, Chirag 7 Nagpal, Najoung Kim, Junfeng He, Cristina Nader Vas- concelos, et al. Erasing more than intended? how con- cept erasure degrades the generation of non-target con- cepts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16420–16430, 2025. 1 [4]P Bedapudi. Nudenet: Neural nets for nudity classifica- tion, detection and selective censoring, 2019. 3 [5]Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornog- raphy, and malignant stereotypes.arXiv preprint arXiv:2110.01963, 2021. 2 [6] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. 1 [7] Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from dif- fusion models. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2426– 2436, 2023. 1, 2, 4 [8]Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Com- munications of the ACM, 64(12):86–92, 2021. 2 [9]Gabriel Goh, James Betker, Li Jing, and Aditya Ramesh. Dall·e 3.https://openai.com/index/dall-e-3/, 2024. 1 [10] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free eval- uation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021. 10 [11]Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, and Jindong Gu. Self-discovering interpretable diffusion latent directions for responsible text-to-image genera- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024. 2, 4 [12] Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text- to-image models. In arXiv preprint arXiv:2404.06666, 2024. 2, 5 [13]Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th Euro- pean Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 4, 9 [14]Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, and Fabio Pizzati. Latent guard: a safety framework for text-to-image generation. In Eu- ropean Conference on Computer Vision, pages 93–109. Springer, 2024. 3 [15] ML Research. Q16: A multi-category safety classifier. https://github.com/ml-research/Q16, 2023. Ac- cessed: 2024. 5 [16] notAI Tech. Nudenet: Neural nets for nudity detec- tion.https://github.com/notAI-tech/NudeNet, 2019. Accessed: 2024. 5 [17] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evalua- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410– 11420, 2022. 10 [18] Javier Rando, Daniel Paleka, David Lindner, Lennard Heim, and Florian Tramèr. Red-teaming the stable dif- fusion safety filter. ArXiv, abs/2210.04610, 2022. 1 [19]Robin Rombach, Patrick Esser.Stable diffusion v1-4.https://huggingface.co/CompVis/stable- diffusion-v1-4, 2022. 4 [20]Robin Rombach, Patrick Esser.Stable diffusion safety checker.https://huggingface.co/CompVis/ stable-diffusion-safety-checker, 2023. 1, 2, 4 [21] Robin Rombach, Patrick Esser.Stable diffu- sion v2-1.https://huggingface.co/stabilityai/ stable-diffusion-2-1, 2023. 2, 4 [22]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1 [23] Shaswati Saha, Sourajit Saha, Manas Gaur, and Tejas Gokhale. Side effects of erasing concepts from diffusion models. arXiv preprint arXiv:2508.15124, 2025. 1 [24]Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigat- ing inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023. 1, 2, 4, 5, 6, 9 [25] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 2 [26] Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Ring-a-bell! how reliable are con- cept removal methods for diffusion models?arXiv preprint arXiv:2310.10012, 2023. 6 8 [27]Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In Proceedings of the IEEE Sympo- sium on Security and Privacy, 2024. 4, 6 [28]Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- enmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Com- putational Linguistics, 2:67–78, 2014. 4 [29]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3 [30]Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 10 A Datasets We evaluate SafePatch on four public prompt datasets, cov- ering both unsafe and benign scenarios. Specifically, three unsafe prompt datasets are used to assess safety mitigation performance, while one benign dataset is adopted to evaluate generation fidelity under non-malicious inputs. •“<country> body”: This dataset contains 50 short prompts of the form “<country> body”, where <coun- try>corresponds to the top-50 GDP countries. Although linguistically benign, these prompts are known to reliably induce nude or NSFW image generation in T2I models. For each country, we generate 20 images, resulting in 2,000 images in total. •NSFW-200: NSFW-200 consists of 200 unsafe prompts covering explicit sexual content. The prompts are gener- ated using ChatGPT (GPT-3.5) following a community- curated guideline from Reddit [1], and are widely used for evaluating prompt-based safety vulnerabilities. • I2P: I2P [24] is a large-scale benchmark designed for evaluating unsafe content generation in T2I models. It contains 4,703 real-world, user-generated prompts span- ning seven unsafe categories: hate, harassment, violence, self-harm, shocking, sexual, and illegal activity. •MS COCO 2017: MS COCO [13] is a large-scale dataset of everyday scenes and objects. We use the 2017 validation set as a benign prompt benchmark to evalu- ate whether safety defenses degrade image fidelity or semantic alignment under non-unsafe inputs. B Metrics We evaluate the safety-aware text-to-image (T2I) generation capability from two complementary perspectives: unsafe con- tent reduction and benign content preservation. To this end, we employ the following evaluation metrics. •Nudity Probability. We employ NudeNet 2 to evaluate the effectiveness of the model in moderating explicit vi- sual content. In this work, we restrict nudity detection to exposed genitalia, breasts, and buttocks. To mitigate the impact of false positives, we adopt a conservative classi- fication threshold of 0.7, following common practice. • Unsafe Probability. Following prior work [24], we com- bine two complementary classifiers, namely the Q16 classifier 3 and NudeNet, to automatically assess unsafe content generation beyond explicit nudity. The Q16 clas- sifier is capable of detecting a broad spectrum of unsafe visual concepts, including but not limited to sexual con- tent, hate-related imagery, and other policy-violating outputs. Unless stated otherwise, an image is classi- fied as unsafe if either of the two classifiers predicts the corresponding unsafe label. 2 https://github.com/notAI-tech/NudeNet 3 https://github.com/ml-research/Q16 9 •FID Score. The Fréchet Inception Distance (FID) [17] is used to measure the distributional similarity between generated images and real images, thereby assessing generation fidelity. We compute FID using the COCO val2017 dataset as the reference set, which provides paired prompts and ground-truth images. Lower FID scores indicate higher-quality and more realistic image generation. •LPIPS Score. The Learned Perceptual Image Patch Similarity (LPIPS) metric [30] evaluates perceptual sim- ilarity between generated images and reference images. Similar to FID, we use the COCO val2017 dataset as the reference set. A lower LPIPS score corresponds to higher perceptual fidelity. •CLIP Score. The CLIP Score [10] is a reference-free metric designed to measure the semantic alignment be- tween a text prompt and its corresponding generated image. Higher CLIP scores indicate that the T2I model more faithfully captures the intent of the input prompt in the generated visual output. C Prompt Template Used by LLM To guide the safe prompt generation, we designed prompts based on the following structure, see ?? for details: •Step1: Definition of LLM’s Role. In this part, we inform the LLM of its role, for example, “Your role is as an artificial intelligence programming assistant specializing in semantics analysis. You are expected to identify and mitigate potentially harmful or explicit content in the form of text prompts that are used for Text-to-Image model translations.” • Step2: Unsafe Concepts Explanation. In this part, we inform the LLM with the detailed definition of unsafe content according to official guidelines. • Step3: Task Decomposition. In this part, to overcome the illusion problem and task loss problem of LLM, we decompose the main task into multiple sub-tasks, letting the LLM complete them one by one to achieve the final goal. For example, “Please complete the task according to the following process: 1. ..., 2. ... .” •Step4: Output Format Specification. In this part, we strictly regulate the output format of the LLM to facilitate subsequent data processing. To further refine the safe prompt generation process, we introduce a LLM-based scoring mechanism. After generating multiple safe prompt alternatives, the LLM evaluates each one based on criteria such as safety and alignment with the user’s original intent. The prompt with the highest score is selected as the final safe prompt. Original T2I Model SafePatch (Our) Unsafe Input Prompt Figure 6: Additional visual samples produced by our method. D Optimization and Training Setup We optimize SafePatch using AdamW with a learning rate of 1× 10 −4 ,β 1 = 0.9,β 2 = 0.999, and weight decay1× 10 −2 . The batch size is set to 4, with gradient accumulation of 8, resulting in an effective batch size of 64. Models are trained for 25K iterations by default (or until convergence, see Fig. 5), using a learning rate schedule with 500 warmup steps followed by cosine decay. We apply gradient clipping with a maximum norm of 1.0 to stabilize training. All experiments are conducted on two NVIDIA RTX 5090 GPUs. E More Visual Samples In this appendix, we provide additional visual samples to further complement the qualitative results presented in the main paper, as shown in Fig. 6. 10