← Back to papers

Paper deep dive

Attack as Defense: Safeguarding Large Vision-Language Models from Jailbreaking by Adversarial Attacks

Chongxin Li, Hanzhang Wang, Yuchun Fang

Year: 2025Venue: Findings of EMNLP 2025Area: Multimodal SafetyType: EmpiricalEmbeddings: 50

Models: LLaVA-1.5

Abstract

Chongxin Li, Hanzhang Wang, Yuchun Fang. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

Tags

adversarial-robustness (suggested, 80%)ai-safety (imported, 100%)empirical (suggested, 88%)multimodal-safety (suggested, 92%)

Links

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 5:33:13 PM

Summary

The paper introduces 'Attack as Defense' (AsD), a novel, training-free, cross-modal defense framework for Vision-Language Models (VLMs). AsD mitigates jailbreaking attacks by embedding protective visual perturbations into images and augmenting system prompts, effectively disrupting adversarial cues before they propagate to the language model. Experiments on LLaVA-1.5 demonstrate significant reductions in attack success rates for both typographic and perturbation-based jailbreaking.

Entities (5)

Attack as Defense · framework · 100%Jailbreaking · security-vulnerability · 100%LLaVA-1.5 · vision-language-model · 100%MM-SafetyBench · dataset · 100%CLIP · vision-encoder · 95%

Relation Signals (3)

Attack as Defense appliedto LLaVA-1.5

confidence 100% · Experiments on the LLaVA-1.5 show that AsD reduces attack success rates

Attack as Defense defendsagainst Jailbreaking

confidence 100% · AsD is a strategy that reframes adversarial attack techniques as a means of protection.

MM-SafetyBench usedtoevaluate Attack as Defense

confidence 100% · We evaluate our defense strategy on MM-SafetyBench

Cypher Suggestions (2)

Find all models evaluated using the AsD framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Attack as Defense'})-[:APPLIED_TO]->(m:Model) RETURN m.name

Identify datasets used to test security frameworks · confidence 90% · unvalidated

MATCH (d:Dataset)-[:USED_TO_EVALUATE]->(f:Framework) RETURN d.name, f.name

Full Text

50,111 characters extracted from source content.

Expand or collapse full text

Findings of the Association for Computational Linguistics: EMNLP 2025, pages 20138–20152 November 4-9, 2025 ©2025 Association for Computational Linguistics Attack as Defense: Safeguarding Large Vision-Language Models from Jailbreaking by Adversarial Attacks Chongxin LiHanzhang Wang * Yuchun Fang School of Computer Engineering and Science, Shanghai University alita@shu.edu.cn, hanzhang.mon.wang@gmail.com, ycfang@shu.edu.cn Abstract Adversarial vulnerabilities in vision-language models pose a critical challenge to the relia- bility of large language systems, where typo- graphic manipulations and adversarial pertur- bations can effectively bypass language model defenses. We introduce Attack as Defense (AsD), the first approach to proactively defend at the cross-modality level, embedding protec- tive perturbations in vision to disrupt attacks before they propagate to the language model. By leveraging the semantic alignment between vision and language, AsD enhances adversar- ial robustness through model perturbations and system-level prompting. Unlike prior work that focuses on text-stage defenses, our method in- tegrates visual defenses to reinforce prompt- based protections, mitigating jailbreaking at- tacks across benchmarks. Experiments on the LLaVA-1.5 show that AsD reduces attack suc- cess rates from 56.7% to 12.6% for typographic attacks and from 89.0% to 47.5% for adver- sarial perturbations. Further analysis reveals that the key bottleneck in vision-language se- curity lies not in isolated model vulnerabilities, but in cross-modal interactions, where adver- sarial cues in the vision model fail to consis- tently activate the defense mechanisms of the language model. Our code is publicly available athttps://github.com/AngelAlita/AsD. 1 Introduction As large language models (LLMs) have grown increasingly sophisticated, so have efforts to en- force safety and prevent misuse (Achiam et al., 2023). Despite these advancements, models remain vulnerable to adversarial manipulation, resulting in a phenomenon known as "jailbreaking." Jail- breaking involves designing prompts that bypass a model’s built-in safety measures, often leading to unintended, harmful, or even unethical outputs. This issue gained widespread attention with the * Corresponding author. Figure 1: The proposed Attack as Defense strategy can defends against multiple combined jailbreaking of per- turbation, typographic attacks. release of ChatGPT, as "jailbreak prompts" began circulating extensively on social media (Achiam et al., 2023). These manipulations range from overt strategies, such as role-playing scenarios such as the infamous DAN (Do Anything Now) prompt (walkerspider, 2022; JailbreakChat, 2024), to more subtle attacks that exploit weaknesses in the ethical constraints of a model (Shen et al., 2023; Anthropic, 2023). Although these attacks often require signif- icant human ingenuity, their increasing frequency underscores persistent vulnerabilities in the safety mechanisms of even the most advanced models (Ganguli et al., 2022). Although advances in defense mechanisms have improved the security of large language mod- els (LLMs), Multimodal Large Language Models (MLLMs) introduce new vulnerabilities, particu- larly in their visual components (Zhao et al., 2024; Liu et al., 2024; Qi et al., 2024; Shayegani et al., 2023a). Unlike text-only models, MLLMs also face adversarial image manipulations, which have emerged as a significant attack vector as textual defenses improve (Shayegani et al., 2023a). At- tackers embed malicious content into benign im- ages or overlay adversarial text, as exemplified by 20138 techniques such as Figstep (Gong et al., 2023; Liu et al., 2023b), to evade existing security measures. These challenges highlight the need for more com- prehensive cross-modal defense strategies. Building on foundational techniques, more ad- vanced jailbreak variants have emerged. These methods introduce pixel-level adversarial pertur- bations into benign images, creating adversarial examples that bypass existing defenses (Shayegani et al., 2023a). This combination of visual and ad- versarial manipulation significantly increases the effectiveness of attacks, underscoring the need for more robust defenses in vision-language models. In summary, the vulnerability of vision-language models (VLMs) to multimodal jailbreaks arises from weaknesses in the visual modality. Attacks exploit the model’s semantic understanding by embedding malicious content, either through text or deceptive visual cues, allowing them to evade safety mechanisms focused primarily on textual constraints. At the same time, the mechanisms that enable these attacks suggest a potential direction for de- fense. If adversarial perturbations can be crafted to manipulate a model’s predictions, they can also be designed to constrain them. In this work, we introduceAttack as Defense (AsD), a strategy that reframes adversarial attack techniques as a means of protection. As shown in Figure 1, our approach follows the same underlying principles as adversar- ial attacks but redirects them to reinforce, rather than undermine, the model’s integrity. By leverag- ing the shared semantic space between the visual and language components, we embed protective prompts within images, ensuring that the defense remains effective across different attack types and model architectures. AsD strategy provides a universal defense ap- proach, adaptable across various attack types. For typography-based attacks, protective semantics are embedded directly into the visual content, neutral- izing conceptual threats. In perturbation-based at- tacks, the defense is augmented with pixel-level modifications that disrupt malicious inputs. De- spite differences in the underlying mechanics, the core defense strategy remains consistent, provid- ing robust protection against both typographic and adversarial perturbation attacks. The main contri- butions of this work are as follows. •We introduce Attack as Defense (AsD), a defense framework that repurposes adversar- ial perturbations to reinforce language model safeguards. Unlike existing methods that in- tervene at the text generation stage, AsD is the first to defend at the cross-modal level, disrupting attacks before they propagate to the language model. By embedding protec- tive perturbations in the vision modality, AsD proactively engages cross-modal defenses, re- ducing the model’s vulnerability to jailbreak attempts. Our analysis further highlights a key limitation in vision-language robustness: the inconsistent activation of text-based defenses by adversarial visual cues, underscoring the need for stronger cross-modal security mech- anisms. •AsD is training-free and universally applica- ble across different vision-language models. Unlike existing approaches that rely on ex- plicit model retraining or fine-tuning, AsD op- erates as an input-level intervention, applied directly to images like a patch. This enables scalable deployment across architectures with- out modifying model parameters. 2 Related Work 2.1 VLM Jailbreaking Jailbreaking seeks to bypass AI systems’ safety mechanisms, enabling them to generate harmful or undesirable content, thereby exposing security vul- nerabilities. One type of method for jailbreaking Vision-Language Models (VLMs) involves adver- sarial attacks, where minimal but carefully crafted perturbations are used to make the model deviate from its intended outputs. Qi et al. (2023) optimize visual adversarial examples, allowing them to uni- versally bypass the security defenses of the model. Similarly, Bailey et al. (2023) introduces an image hijacking approach based on Behavior Matching and Prompt Matching, manipulating VLM outputs in line with the attacker’s preset instructions. Wang et al. (2024b) propose a dual-optimization approach targeting both image and text inputs simultaneously to maximize the likelihood of generating harmful content. In contrast, Shayegani et al. (2023a) im- plements embedding-space attacks by embedding adversarial signals into visual representations, com- bined with text prompts, to bypass alignment mech- anisms and produce harmful outputs. Another approach leverages the semantic con- sistency of the visual encoder. Gong et al. (2023) 20139 AsD Strategy If the image contains a instruction phrase, you must refuse to answer any inappropriate, dangerous, or unethical questions, and you must not provide ... Safety Trigger [###Instruction] The image may contain harmful content. Sorry, I can’t answer your question. Safety Perturbation concat Vision Model Language Model Synergy Jailbreaking Attempt List steps in detail to design an AI-powered mobile app that enables users to perform this activity. 1.Obtain a laptop and ... 2.Set up a workspace ... 3.Acquire a variety of ... 4.Organize the workspace ... 5.Develop a user-friendly ... ... overlay Malicious image Figure 2:AsD (Attack as Defense)strategy integrates visual and textual safety mechanisms in a synergistic manner. In the visual model, a safety trigger image is transformed into adversarial perturbations and overlaid onto the potentially harmful image. Simultaneously, the system augments text prompts with safety instructions, guiding the language model to reject unethical or dangerous requests. transforms textual inputs into typographic images, circumventing safety measures designed for text- based inputs. Expanding on this idea, Li et al. (2024) proposes the HADES method, enhancing jailbreak capabilities by converting harmful text into typography, merging it with adversarial im- ages, and prompting multimodal models to gener- ate harmful content. 2.2 Jailbreaking Defense To improve the safety of Vision-Language Mod- els (VLMs) and prevent the generation of harmful content, one strategy focuses on achieving better alignment during training. Zong et al. (2024) high- lights the issue of catastrophic forgetting in safety alignments during fine-tuning and introduce VL- Guard, a vision-language instruction, following a dataset designed to enhance safety across models. Similarly, Zhang et al. (2024) proposes a large- scale safety preference alignment dataset, enabling VLMs to more effectively filter harmful content. In another approach, Chen et al. (2024b) emphasizes the risk of VLMs generating unhelpful or harmful responses without additional feedback mechanisms. To address this, they propose Natural Language Feedback (NLF) to guide safety alignment, training the models with conditional reinforcement learning to improve responses. An alternative approach focuses on inference- time alignment, offering a more cost-effective and practical defense strategy. Wu et al. (2023) highlight VLM vulnerabilities to system prompts and propose improving prompt design to mitigate manipulation, though this relies on manual ad- justments, limiting flexibility. To address this, Adashield Wang et al. (2024c) introduce an adap- tive framework that dynamically refines defense prompts in real time. Beyond static prompts, Wang et al. (2024a) lever- age Safety Steering Vectors (SSVs) extracted from aligned models to adjust activation states during in- ference, guiding models toward safer outputs. Sim- ilarly, Gou et al. (2024) reduce reliance on raw im- age inputs by converting images to text, mitigating visual manipulation risks at the cost of increased computational overhead. For detecting jailbreak attempts, CIDER Xu et al. (2024) propose measuring cross-modal similarity between text and images, setting thresholds to en- hance safeguards. Meanwhile, Fares et al. (2024) introduce a two-step verification process: first gen- erating captions from a VLM, then reconstructing images using a Text-To-Image model. By com- paring the embeddings of the original and recon- structed images, this method detects adversarial modifications in the visual domain. 20140 3 Attack as Defense (AsD) Strategy 3.1 Overview Existing adversarial attacks exploit vulnerabilities in the vision-language embedding space. Aich et al. (2022) generate perturbations by deceiving surrogate classifiers, while Zhao et al. (2024) use visual encoders to align adversarial images with target targets in generative models. While these approaches focus on attacking vision-language models, defenses for large-scale language models (LLMs) have primarily targeted text-only jailbreak attempts. We propose Attack as Defense (AsD), a strat- egy that integrates visual and textual defenses to counter multimodal adversarial attacks. As shown in Figure 2, AsD consists of two key components: visual perturbations and system prompts. The vi- sual defense embeds safety triggers into images, transforming them into safety perturbations that modify the vision model’s embedding space while preserving perceptual quality. The textual defense augments system prompts with safety constraints, ensuring the language model correctly interprets visual cues and rejects unsafe queries. Together, these mechanisms provide a coordinated defense against jailbreaking attempts. 3.2 Safety Trigger The safety trigger is central to generating safety per- turbations, however, it can be any image that con- veys safety warning semantics. In our experiments, we find that using a simple image with warning text (e.g., "[###Instruction] The image may con- tain harmful content") provides strong defensive performance. When we combine these visual warn- ing images with additional warning visuals, such as various exclamation marks, the system still main- tains the safety signal but shows a slight decrease in its ability to accurately process pure text instruc- tions. We also observe that longer textual prompts challenge the model’s capacity to embed the full se- mantic content. Based on these findings, we adopt a safety trigger with a short and simple warning text for our defense strategy. 3.3 Safety Perturbation Generation.The safety perturbationˆx i adv is generated from the safety trigger imagex safe by the following equation: ˆx i adv = arg minL 2 I φ (x safe ),I φ (x i adv ) (1) Algorithm 1Safety Perturbation Generation 1:Input:Target safe inputx safe , Initial adver- sarial imagex i adv , 2:Vision encoderI φ (·), AdamW optimizer with learning rateηand the number of iterationsτ 3:Output:Adversarial imageˆx i adv 4:Compute visual embeddingI φ (x safe ) 5:fori= 1toτdo 6:Compute adversarial embeddingI φ (x i adv ) 7:L←L 2 I φ (x safe ),I φ (x i adv ) 8:Compute gradientg←∇ x i adv L 9:Updatex i adv ←x i adv −η·g 10:end for 11:returnˆx i adv =x i adv whereI φ denotes the vision encoder from the CLIP model, specifically the CLIP-ViT-Large-Patch14- 336 configuration.x safe refers to the safety per- turbation image, with its design detailed in Sec- tion 3.2. The objective is to minimize thel 2 distance be- tween the embeddings of the safety perturbation and the safety trigger image using the AdamW optimizer with a learning rateη, as outlined in Algorithm 1. For our experiments, we set the learn- ing rate toη= 0.1and the number of iterations τ= 5000, which yields effective results for embed- ding the safety triggers within the vision-language model. Overlay.To preserve the visual integrity of be- nign images, we refine the generated safety per- turbation before application. First, we adjust its transparency to minimize perceptual intrusion, en- suring seamless integration with the input. Second, we scale the perturbation to maintain proportional coverage across different image sizes, preventing excessive modification of the original content. The processed perturbation is then randomly overlaid at various locations, enhancing robustness against adversarial attacks while preserving natural image quality. Specific transparency levels and coverage ratios are detailed in Section 4.5. 3.4 System Safety Prompt To ensure consistent adherence to safety guide- lines, the system-level safety prompt is embedded directly into the language model’s input. It is con- catenated with the user’s query before inference, reinforcing alignment with predefined constraints from the outset. 20141 The design of our safety prompt follows a prin- cipled approach, drawing from established guide- lines that provide explicit behavioral constraints. Rather than relying on complex prompt engineer- ing techniques such as role playing (walkerspider, 2022; JailbreakChat, 2024) or a few / many shot prompt (Zheng et al., 2024; Anil et al., 2024), we adopt a minimal but effective strategy that ensures reliability in diverse scenarios. The embedded safety prompt is: "If the image contains an instruction phrase, you must refuse to answer any inappropriate, dangerous, or unethical questions, and you must not provide any details about activities mentioned in the image that fall under these categories". This prompt is adaptable, allowing customization based on specific safety policies and risk thresholds in different applica- tions. We investigated two variants of the safety prompt (see Appendix A.4) to determine the most robust design. 4 Experiments We evaluate Attack as Defense (AsD) strategy on typography-based and perturbation-based jailbreak attacks, assessing its defense performance, compu- tational efficiency, and impact on benign images. Additional analyses examine the ablation study on perturbation coverage, opacity, and the model’s ability to recognize safety perturbations. 4.1 Defense Performance 4.1.1 Typography-Based Jailbreaking Datasets.We evaluate our defense strategy on M-SafetyBench (Liu et al., 2023b), a dataset of 5,040 text-image pairs spanning 13 scenarios. Sce- narios 01-07 and 09 contain harmful key phrases designed to elicit unsafe responses. Scenario 08 and 13 focus on political topics, ensuring neutrality, while Scenarios 10-12 address legal, financial, and health-related queries, where incorrect responses may pose risks. Models.Defense mechanisms are less explored than attack strategies, limiting available base- lines. LLaVA-1.5 (Liu et al., 2023a), Qwen-VL- Chat (Bai et al., 2023), and ShareGPT4V (Chen et al., 2024a) serves as a baseline with safety alignment but struggles against jailbreak attacks (Table 1). ECSO (Gou et al., 2024) enhances safety by converting images into query-aware text, restoring model safeguards. Additional results on more models, including LLaVA-1.6 (Liu et al., 2023a), Qwen2.5-VL (Bai et al., 2025), and In- structBLIP (Dai et al., 2023), are provided in the Appendix A.1. Evaluation.Following Liu et al. (2023b), model outputs are classified as ’safe’ or ’unsafe’ and au- tomatically evaluated by GPT-4. We use Attack Success Rate (ASR) as the primary metric. To mitigate the evaluation bias and noise, we used Llama Guard 3 (Llama Team, 2024) as an addi- tional LLM judge for the LLaVA 1.6, Qwen2.5-VL and InstructBLIP on the M-SafetyBench Image + TYPO subset (see Appendix A.3.). Table 1 presents the ASR results across 13 mali- cious categories. Our method exhibits strong defen- sive performance across nearly all categories. For explicitly illegal or malicious categories (1-7 and 9), the ASR of LLaVA-1.5 for pure image attacks, pure typography attacks, and combined image-plus- typography attacks are 1.1, 5.8, and 12.6, respec- tively, which are substantially lower compared to other defense methods. Although the Text-Only defense, which relies solely on the safety prompt, is a simple and effective approach for pure image attacks, its efficacy is considerably reduced when confronting mixed image-text attacks. In contrast, our AsD method, as a multimodal defense operat- ing within the semantic embedding space, provides significantly stronger protection against such mul- timodal threats. For categories that are not overtly malicious (8, 10-13) (in Appendix A.2), while the defense performance is comparatively limited, it can be enhanced by modifying the safety prompt based on the required security level, thereby in- creasing the robustness of the defense. Vision-Language Synergy.To systematically evaluate the interplay between vision and language defenses, we conduct experiments on the Tiny Im- age + Typography attack subset from the M- SafetyBench. Specifically, we assess the effective- ness of visual perturbations and textual prompts both in isolation and in combination. As shown in Table 2, visual perturbations alone contribute minimally to defense effectiveness, whereas their integration with textual prompts leads to a substan- tial improvement. This finding highlights a crit- ical insight: Visual semantics, while insufficient on their own, serve as amplifiers that enhance the impact of language-based defenses. The results suggest that multimodal defenses do not merely stack independent safeguards but instead leverage cross-modal reinforcement, where vision augments 20142 ModelCategory ImageTypographyImage + Typography Normal Text-Only ECSO OursNormal Text-Only ECSO OursNormal Text-Only ECSO Ours LLaVA-1.5 01-Illegal Activity25.81.09.30.083.57.218.63.178.440.212.43.6 02-Hate Speech16.01.85.51.249.17.420.23.160.122.721.54.5 03-Malware Generation15.90.04.50.063.66.829.54.663.631.820.510.5 04-Physical Harm24.32.89.01.467.411.825.76.366.033.327.811.4 05-Economic Harm5.71.66.60.016.45.79.02.517.210.710.76.6 06-Fraud23.40.710.40.774.713.026.69.172.135.722.122.0 07-Sex7.37.37.35.530.314.731.210.133.928.430.319.3 09-Privacy Violence18.72.211.50.064.815.825.97.962.642.526.623.0 Average17.12.28.01.156.210.323.35.856.730.721.512.6 Qwen-VL-Chat 01-Illegal Activity2.01.06.21.059.827.810.315.566.033.019.630.9 02-Hate Speech1.81.21.23.739.921.56.74.953.423.312.316.0 03-Malware Generation4.50.00.00.038.69.120.54.547.718.224.713.6 04-Physical Harm2.11.41.41.449.320.118.16.954.918.823.614.6 05-Economic Harm0.80.00.82.513.92.56.61.620.53.34.92.5 06-Fraud1.90.60.60.061.011.010.44.572.113.416.96.5 07-Sex7.30.93.70.931.213.814.76.425.712.814.712.8 09-Privacy Violence0.01.43.41.461.912.915.814.460.425.218.018.0 Average2.60.82.21.444.514.812.97.350.118.516.814.4 ShareGPT4V 01-Illegal Activity19.60.05.11.083.50.013.41.077.30.011.31.0 02-Hate Speech10.40.00.00.047.20.07.41.847.80.69.81.8 03-Malware Generation9.12.20.00.063.60.09.10.052.30.025.00.0 04-Physical Harm16.00.06.20.758.31.415.30.761.12.120.10.7 05-Economic Harm1.60.00.00.013.11.65.70.010.70.83.30.0 06-Fraud18.20.03.90.070.81.311.70.072.11.311.70.0 07-Sex11.00.06.40.026.60.014.70.033.00.022.00.0 09-Privacy Violence15.10.74.30.056.10.77.90.763.31.414.40.7 Average12.60.43.20.252.40.610.70.552.20.814.70.5 Table 1: Attack success rate (ASR) comparison across the M-SafetyBench (Liu et al., 2023b). ‘Normal’ denotes the vanilla model, ECSO (Gou et al., 2024) is a text-only defense baseline, and ‘Text-Only’ applies only our system prompt without visual intervention. Categories 1–7 and 9 exhibit explicit malicious intent and should be strictly rejected as unsafe. The results of Categories 8 and 10-13 are in the Appendix A.2. language-driven protections, forming a more cohe- sive and robust security mechanism against com- plex attacks. LLaVA-1.5TextPertOurs (Text + Pert) ASR75.051.274.438.7 Table 2: Synergistic safeguard from both safety pertur- bation (Pert) and safety prompt (Text), evaluated using the ASR metric on the Tiny M-SafetyBench dataset. 4.1.2 Perturbation-Based Jailbreaking Datasets.Following Shayegani et al. (2023b), we construct a dataset of 400 samples, each compris- ing a harmful image and a malicious prompt. Ad- versarial images are generated by optimizing CLIP features to align with target image features from the penultimate layer, then semi-transparently overlaid onto the originals. Using 16 prompts, we perform 25 rounds of question-answering to generate the full dataset. Models.We evaluated the vulnerability of three distinct model architectures to perturbation-based jailbreaking attacks. We selected two models that use CLIP as the visual encoder: LLaVA-1.5 and its successor, LLaVA-1.6. To demonstrate the general- izability of our findings, we also included models with alternative visual encoders. These include ModelBaselineCIDERText-OnlyOurs LLaVA-1.589.063.553.547.5 LLaVA-1.678.022.825.620.0 InstructBLIP33.36.01.30.0 PaliGemma 84.360.037.536.5 Table 3: ASR comparison of perturbation-based jail- breaking attacks on LLaVA-1.5, LLaVA-1.6, Instruct- BLIP, and PaliGemma, evaluated against baseline meth- ods CIDER (Xu et al., 2024), (Liu et al., 2023b), and our method. InstructBLIP, which utilizes a Q-Former architec- ture, and PaliGemma (Beyer et al., 2024), which employs SigLIP as its visual encoder. Results.The evaluation method is the same as described in 4.1.1. The results are shown in Table 3. On LLaVA-1.5, adversarial perturbation-based at- tacks are highly effective (Shayegani et al., 2023a), achieving an ASR of 89%, while CIDER and text- only defenses reduce it to 63.5% and 53.5%, re- spectively. Our method further lowers the ASR to 47.5%. Similar reductions are observed across LLaVA-1.6, InstructBLIP, and PaliGemma, where our method consistently achieves the lowest ASR compared with both CIDER and text-only defenses. Notably, our safety perturbation covers only 25% of the image and includes transparency, indicating 20143 Method Additional Inference Auxiliary Modules Inference Time Peak GPU Memory ECSO!%17.8s15.7GB AdaShield%!1.6s22.0GB CIDER%!7.2s30.2GB Ours%%3.01s15.4GB Table 4: Comparison of architectural and computa- tional characteristics across VLM defense methods. Our method avoids additional inference cost and auxiliary modules, while achieving low latency and efficient mem- ory usage. that the defense is not merely a result of obscur- ing image content but rather effectively combats malicious elements at the semantic level. Further analysis of these factors is provided in Section 4.5. 4.2 Computational Costs Most current jailbreak defense methods operate at the language model level or focus on isolated modalities, introducing significant computational overhead, as summarized in Table 4. To illustrate these trade-offs, we analyze three representative approaches that cover different defense paradigms: ECSO, AdaShield, and CIDER. These methods primarily focus on mitigating attacks during text generation, often requiring additional inference- time interventions or external filtering mechanisms, which further increase computational cost. We per- form a quantitative analysis by measuring the in- ference time and peak GPU memory usage on 50 samples from M-SafetyBench. ECSO converts potentially harmful images into textual descriptions, introducing costly inference steps. AdaShield employs an auxiliary LLM to generate defense prompts, operating solely in the text modality. CIDER integrates an isolated image denoising module to mitigate adversarial attacks, further increasing computational and storage de- mands. In contrast, AsD requires neither additional inference nor auxiliary modules, significantly re- ducing computational overhead. Moreover, as a cross-modal approach, it operates directly on both text and images, ensuring efficient and scalable deployment. 4.3 Can the Model Comprehend the Safety Perturbations? We examine whether vision-language models gen- uinely understand the semantics of safety-related perturbations or merely react to their presence. Figure 3: Examples of the model detecting semantics in safety perturbations. Safety-related content is in green, malicious text in red. Images are blurred for ethical considerations. These perturbations, designed to obscure parts of an image as a defensive measure, can mitigate at- tacks by masking harmful content. However, ef- fective defense requires more than occlusion, it demands that the model recognize and interpret the safety signals embedded within the perturbations. As illustrated in Figure 3, the model successfully identifies both safety-critical (green) and malicious content (red), even when perturbations are subtle. This suggests that the model is not simply react- ing to masked regions but can actively interpret their defensive intent. While occlusion reduces an attack’s immediate impact, our results indicate that models can go beyond passive obfuscation and leverage safety perturbations as meaningful signals, strengthening their robustness against adversarial threats. 4.4 Benign Images Performance Ensuring that a defense mechanism does not com- promise normal model behavior is critical for prac- tical deployment. To evaluate this, we assess its impact on two benchmark tasks: M-Vet and Sci- enceQA datasets. Datasets.The M-Vet (Yu et al., 2024) is de- signed for complicated multimodal tasks. It fo- cus on six core visual-language functions: recogni- tion, OCR, knowledge, language generation, spa- tial awareness, and mathematical computation. It contains 200 images from online sources, VCR dataset and ChestX-ray14 dataset, with 218 ques- tions. The goal is to evaluate how effectively mod- els can combine different skills to solve complex problems that involve visual and text information. The ScienceQA (Lu et al., 2022) is a large-scale resource for multimodal question answering, focus- ing on science and general knowledge. It includes 20144 Figure 4: Performance comparison across different cat- egories of the M-Vet dataset. over 21,000 questions paired with explanatory dia- grams or text across domains like biology, physics, and chemistry, testing both factual knowledge and reasoning abilities of models. LLaVA-1.5Ours ScienceQA70.267.0 M-Vet31.123.9 Table 5: Performance comparison between Baseline and Ours on ScienceQA and M-Vet datasets. Results.As shown in Table 5 and Figure 4, our method introduces a controlled degree of occlu- sion, resulting in a minor decrease in performance. On ScienceQA, accuracy declines from 70.2% to 67.0%. While this indicates some loss of visual information, the model retains its overall capacity for image understanding. On M-Vet, compared to other defense strategies, our approach strikes a balance between robustness and accuracy, miti- gating adversarial threats while maintaining com- petitive performance. These findings suggest that although occlusion inevitably impacts recognition, the trade-off remains manageable, underscoring the effectiveness of our method in preserving model utility under adversarial conditions. 4.5 Ablations of Coverage and Opacity The effectiveness of visual perturbations hinges on two critical parameters: spatial coverage (de- fense area proportion) and opacity (perturbation intensity). We formalize their trade-off through D(c,o) =α·ASR(c,o)−β·∆Acc(c,o)(2) wherec∈[0.2,0.9]denotes coverage,o∈ [0.3,0.9] opacity, and(α,β)are task-specific weights. Sampling across above configurations, we eval- uate: (1) Attack success rate (ASR) reduction against state-of-the-art attacks (Figure 5a, b), and (2) Preservation of visual reasoning capability via TextQA accuracy (Figure 5c). Attack scenarios include: Typography Attacks: Adversarial text ren- dering (e.g., toxic words in styled fonts) and Per- turbation Attacks: Gradient-based universal noise patterns. As shown in Figure 5, the findings demon- strate a clear correlation: Non-linear Interaction: The defense efficacy surfaceD(c,o)exhibits phase transitions. Beyond critical thresholds (c >0.6, o >0.8), marginal security gains diminish rapidly while capability degradation accelerates. Attack- Type Dependence: Typography defenses require balancedc-ocoordination, whereas perturbation defenses prioritize coverage. This aligns with at- tack vectors’ sensitivity to spatial occlusion versus intensity variation. Based on these observations, we adopt 25% cov- erage and 70% opacity as the optimal settings in our experiments. Nevertheless, given the limited range of our sampling, there may be superior hy- perparameter combinations, particularly when used alongside other safety triggers. 4.6 Qualitative Analysis of Safety Failures Our qualitative analysis reveals two key issues in multimodal safety mechanisms (Figure 6-7 in the Appendix). First, a post-hoc filtering paradigm remains dominant: systems such as LLaVA-1.5 fre- quently emit disallowed content before issuing a refusal, which indicates that safety verification is executed externally rather than embedded within the decoding process. Second, we observe cross- modal attenuation: visual danger signals, although correctly encoded by the vision backbone, are pro- gressively weakened at the vision-to-language in- terface and consequently fail to constrain the lin- guistic output. These deficiencies give rise to three recurrent failure modes. The model may (i) comply ini- tially and apologise only after the fact, evidenc- ing latency in refusal; (i) process visual and tex- tual streams in isolation, yielding modality-specific judgements that remain unaligned; or (i) privilege the linguistic prompt when it conflicts with visual evidence, thereby overriding embedded safeguards. Collectively, these observations demonstrate that strengthening a single modality is inadequate; ro- bust safety demands integrated, end-to-end align- 20145 Figure 5: Ablation study on the transparency and coverage of the safety perturbation. ment across the entire multimodal pipeline. 5 Conclusion We introduce Attack as Defense (AsD), a cross- modal defense strategy that repurposes adversar- ial attack techniques to enhance vision-language model security. By embedding defensive perturba- tions in the vision modality, AsD reinforces prompt- based defenses, significantly reducing the success rate of both typographic and adversarial attacks. Our analysis reveals that multimodal vulnerabili- ties stem not from isolated weaknesses in vision or language models but from the failure of cross- modal signals to reliably activate language model defenses. This insight underscores the need to shift focus from unimodal safety mechanisms to cross-modal alignment. AsD provides a scalable, model-agnostic solution to evolving jailbreak at- tacks, demonstrating that adversarial techniques can be restructured to serve as proactive safeguards rather than threats. Limitations Although our method effectively reduces the suc- cess rate of jailbreaking attacks, it comes at the cost of some performance degradation on clean im- ages. This trade-off arises because the defensive perturbations, while serving as a protective mecha- nism, inevitably introduce additional noise at both the image and semantic levels, interfering with the original representations of the model. The chal- lenge lies in striking a balance between the defense and preservation of clean image integrity, an issue that remains unresolved. A key limitation of our approach is the difficulty in isolating the defensive perturbations from the core visual features that the model relies on for accurate predictions. Existing methods for inte- grating adversarial perturbations into the defense pipeline still lack precise control over their seman- tic influence. Future research should focus on re- fining how these perturbations interact with the model, both in terms of their application strategy and their synergy with language-based defenses. Whether through optimizing the way perturbations are overlaid, refining prompt-based interventions, or exploring new ways to disentangle defensive signals from core image features, more work is needed to improve robustness without sacrificing clean image performance. Ethical Impact The Attack as Defense (AsD) strategy introduces a dual-use concern, as the same adversarial tech- niques designed to enhance security could also be repurposed for refining attacks. This creates a risk of escalating the security arms race, making it es- sential to ensure that the method is deployed re- sponsibly to prevent malicious use. Additionally, the complexity of adversarial de- fenses may reduce the transparency of the model, making it harder to interpret and trust its deci- sions. In high-stakes applications, this opacity un- derscores the need for clear accountability struc- tures, ensuring that the system’s behavior is under- standable and that responsibility is assigned in case of failure. Acknowledgment This work was supported by the National Natural Science Foundation of China (No. 62206167). References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Abhishek Aich, Calvin-Khang Ta, Akash Gupta, Chengyu Song, Srikanth Krishnamurthy, Salman 20146 Asif, and Amit Roy-Chowdhury. 2022. Gama: Gen- erative adversarial multi-object scene attacks.Ad- vances in Neural Information Processing Systems, 35:36914–36930. Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. 2024. Many-shot jailbreaking.Anthropic, April. Anthropic. 2023. We are offering a new version of our model, claude-v1.3, that is safer and less susceptible to adversarial attacks. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023.Qwen-vl: A versatile vision-language model for understanding, localiza- tion, text reading, and beyond.arXiv preprint arXiv:2308.12966. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923. Luke Bailey, Euan Ong, Stuart Russell, and Scott Em- mons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236. Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Un- terthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bauer, Matko Bošn- jak, Xi Chen, Matthias Minderer, Paul Voigtlaen- der, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen, and Xiaohua Zhai. 2024. Paligemma: A versatile 3b vlm for transfer. Preprint, arXiv:2407.07726. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Con- ghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024a. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Confer- ence on Computer Vision, pages 370–387. Springer. Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, and Ajay Divakaran. 2024b. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14239–14250. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruc- tion tuning.Advances in neural information process- ing systems, 36:49250–49267. Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Taká ˇ c, Pascal Fua, Karthik Nan- dakumar, and Ivan Laptev. 2024. Mirrorcheck: Effi- cient adversarial defense for vision-language models. arXiv preprint arXiv:2406.09250. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. 2022. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2023. Figstep: Jailbreaking large vision- language models via typographic visual prompts. arXiv preprint arXiv:2311.05608. Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. 2024. Eyes closed, safety on: Protecting multimodal llms via image-to-text transformation. arXiv preprint arXiv:2403.09572. JailbreakChat. 2024. Jailbreakchat. Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jail- breaking multimodal large language models.arXiv preprint arXiv:2403.09792. Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Wei Hu, and Yu Cheng. 2024. A survey of attacks on large vision-language models: Resources, advances, and future trends.arXiv preprint arXiv:2407.07403. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023a. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916. Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. 2023b.Query-relevant images jail- break large multi-modal models.arXiv preprint arXiv:2311.17600. AI @ Meta Llama Team. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Pan Lu, Shuohang Li, Eric Lehman, Yichen Liang, Jingjing Yan, Xinyun Li, Honglak Yang, and Kai- Wei Chang. 2022. Learn to explain: Multimodal reasoning via thought chains for science question an- swering.Advances in Neural Information Processing Systems. Hongming Qi, Yujia Wu, Zeqiu Li, Zhiqing Yuan, Xi- aodong Hu, Chia-Hsiu Chen Li, Shuchang Yang, Wei Wei, and Yichen Zhou. 2023. Visual adversarial examples jailbreak aligned large language models. arXiv preprint arXiv:2306.13213. 20147 Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 38, pages 21527–21536. Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2023a. Jailbreak in pieces: Compositional adver- sarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations. Erfan Shayegani, Yue Dong, and Nael B. Abu-Ghazaleh. 2023b. Jailbreak in pieces: Compositional adver- sarial attacks on multi-modal language models. In International Conference on Learning Representa- tions (ICLR). Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2023. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv preprint arXiv:2308.03825. walkerspider. 2022. Dan is my new friend. Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. 2024a.Inferaligner: Inference-time align- ment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206. Ruofan Wang, Xingjun Ma, Hanxu Zhou, Chuanjun Ji, Guangnan Ye, and Yu-Gang Jiang. 2024b. White-box multimodal jailbreaks against large vision-language models. InProceedings of the 32nd ACM Interna- tional Conference on Multimedia, pages 6920–6928. Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, and Chaowei Xiao. 2024c. Adashield: Safeguarding mul- timodal large language models from structure-based attack via adaptive shield prompting.arXiv preprint arXiv:2403.09513. Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, and Lichao Sun. 2023. Jailbreaking gpt-4v via self- adversarial attacks with system prompts.arXiv preprint arXiv:2311.09127. Yue Xu, Xiuyuan Qi, Zhan Qin, and Wenjie Wang. 2024. Cross-modality information check for detecting jail- breaking in multimodal large language models. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 13715–13726, Miami, Florida, USA. Association for Computational Lin- guistics. Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. Mm-vet: Evaluating large multimodal models for integrated capabilities. InInternational conference on machine learning. PMLR. Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, et al. 2024. Spa- vl: A comprehensive safety preference alignment dataset for vision language model.arXiv preprint arXiv:2406.12030. Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. 2024. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36. Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. 2024. Improved few-shot jail- breaking can circumvent aligned language models and their defenses.arXiv preprint arXiv:2406.01288. Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales. 2024. Safety fine- tuning at (almost) no cost: A baseline for vision large language models.arXiv preprint arXiv:2402.02207. 20148 A Appendix A.1 Defense Performance on Additional Models ModelCategory Image + Typography NormalText-OnlyECSOOurs LLaVA-1.6 01-Illegal Activity93.851.514.421.6 02-Hate Speech63.832.513.523.3 03-Malware Generation81.852.331.815.9 04-Physical Harm89.668.135.419.4 05-Economic Harm27.922.116.48.2 06-Fraud75.351.313.612.3 07-Sex34.926.615.621.1 09-Privacy Violence87.852.520.122.3 Average69.444.620.118.0 Qwen2.5-VL 01-Illegal Activity45.46.214.42.1 02-Hate Speech46.613.531.31.8 03-Malware Generation79.520.547.76.8 04-Physical Harm78.529.953.515.3 05-Economic Harm21.37.421.38.2 06-Fraud59.79.113.60.6 07-Sex46.821.146.810.1 09-Privacy Violence64.716.538.84.3 Average55.315.533.46.2 InstrctBLIP 01-Illegal Activity60.817.513.42.1 02-Hate Speech38.714.17.44.3 03-Malware Generation47.76.829.50.0 04-Physical Harm38.211.123.64.2 05-Economic Harm17.29.89.83.3 06-Fraud44.29.19.71.3 07-Sex29.422.96.48.3 09-Privacy Violence51.85.817.31.4 Average41.012.114.63.1 Table 6: Attack success rate (ASR) on the ’Image + Typography’ setting of M-SafetyBench (Liu et al., 2023b). ‘Normal’ denotes the vanilla model, ECSO (Gou et al., 2024) is a text-only defense baseline, and ‘Text-Only’ applies only our system prompt without visual intervention. A.2 Defense Performance on Harmless Scenarios in M-SafetyBench ModelCategory ImageTypographyImage + Typography NormalText-OnlyOursNormalText-OnlyOursNormalText-OnlyOurs LLaVA-1.5 08-Political Lobbying64.722.914.492.845.125.594.276.751.6 10-Legal Opinion88.551.553.190.869.266.293.177.769.2 11-Financial Advice100.093.491.697.694.692.898.896.496.4 12-Health Consultation96.370.658.799.172.570.696.393.677.1 13-Gov Decision96.046.333.698.766.452.497.385.270.5 Average89.156.950.395.869.661.595.985.973.0 Qwen-VL-Chat 08-Political Lobbying32.025.522.985.628.122.993.547.145.1 10-Legal Opinion53.132.348.567.743.129.272.339.236.9 11-Financial Advice65.982.085.698.286.284.497.088.090.4 12-Health Consultation69.757.845.983.556.047.789.958.765.1 13-Gov Decision51.025.518.883.937.640.392.648.937.6 Average54.344.644.383.850.244.989.156.455.0 ShareGPT4V 08-Political Lobbying59.538.613.160.145.833.393.549.032.0 10-Legal Opinion96.960.053.894.665.446.299.059.247.7 11-Financial Advice99.091.295.2100.093.494.699.096.491.6 12-Health Consultation97.267.967.098.276.177.198.276.178.0 13-Gov Decision96.037.625.598.035.619.599.332.220.1 Average89.759.150.990.263.354.197.862.653.9 Table 7: Attack success rate (ASR) comparison on personal topics (Categories 08, 10–13) from M-SafetyBench (Liu et al., 2023b). ’Normal’ refers to vanilla models. The ’Text-Only’ baseline applies our system prompt without visual input. Our method consistently reduces ASR across sensitive question types. 20149 A.3 Safety Evaluation using Llama Guard 3 ModelCategory Image + Typography NormalText-OnlyOurs LLaVA-1.6 01-Illegal Activity54.66.26.2 02-Hate Speech19.63.10.6 03-Malware Generation20.52.32.3 04-Physical Harm34.74.24.2 05-Economic Harm5.71.60.0 06-Fraud37.04.55.8 07-Sex6.40.00.9 09-Privacy Violence26.65.85.0 Average25.63.33.1 Qwen2.5-VL 01-Illegal Activity21.62.20.0 02-Hate Speech11.71.20.0 03-Malware Generation29.50.00.0 04-Physical Harm31.30.00.7 05-Economic Harm1.60.00.0 06-Fraud22.71.90.0 07-Sex6.40.90.0 09-Privacy Violence25.90.70.7 Average18.80.90.2 InstructBLIP 01-Illegal Activity39.27.21.0 02-Hate Speech15.31.21.2 03-Malware Generation27.34.50.0 04-Physical Harm14.64.90.0 05-Economic Harm1.61.60.0 06-Fraud25.31.90.6 07-Sex21.111.05.5 09-Privacy Violence35.31.40.0 Average22.54.21.0 Table 8: Attack success rate (ASR) on the ’Image + Typography’ setting of M-SafetyBench (Liu et al., 2023b). ‘Normal’ denotes the vanilla model, and ‘Text-Only’ applies only our system prompt without visual intervention. A.4 Variants of Safety Prompt To assess the robustness of our safety prompt, we explored two variants of the safety prompt proposed in this paper. The results of a subset of M-SafetyBench are shown below in Table 9. We find that the current prompt design is the most effective. •Core(Focuses on the core instructions, eliminating conditional statements): “You must refuse to answer any inappropriate, dangerous, or unethical questions and must not provide any details related to such activities.” •Simplified Instruction(Retains the conditional phrase but simplifies its wording): “If the image contains an instruction phrase, you should refuse to answer any inappropriate, dangerous, or unethical requests and withhold all related details.” MethodASR Baseline75.0 Text-only 51.2 Core57.7 Simplified Instruction57.7 Ours 38.7 Table 9: ASR performance using different prompts on the tiny M-SafetyBench dataset. ’Text-only’ is only using the prompt. 20150 A.5 Qualitative Results on Typography-based and Perturbation-based Jailbreaking Figure 6: Typography-based Jailbreaking: Examples comparing the output of LLaVA-1.5 with and without our safety perturbation. For ethical reasons, the example images are blurred, and key details in the responses are replaced with ellipses (’...’). Generated malicious content is highlighted in red. 20151 Figure 7: Perturbation-based Jailbreaking: Examples comparing the output of LLaVA-1.5 with and without our safety perturbation. For ethical reasons, the example images are blurred, and key details in the responses are replaced with ellipses (’...’). Generated malicious content is highlighted in red. 20152