Paper deep dive
RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion
Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma
Models: Gemini-1.5-flash, LLaVA-1.5, Llama-3.2-11B-Vision, Stable Diffusion v1.5
Abstract
Abstract:Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model's response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.
Tags
Links
- Source: https://arxiv.org/abs/2503.06223
- Canonical: https://arxiv.org/abs/2503.06223
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 12:38:19 AM
Summary
RedDiffuser is a novel red teaming framework that uses reinforcement learning to fine-tune diffusion models to generate adversarial images. These images are designed to induce 'toxic continuation' in Vision-Language Models (VLMs) by pairing malicious inputs with subtle, natural-looking visual stimuli, effectively bypassing safety alignment and guardrails.
Entities (6)
Relation Signals (3)
RedDiffuser → increasestoxicityin → LLaVA
confidence 98% · RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69%
RedDiffuser → targets → Vision Language Models
confidence 95% · RedDiffuser, the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations.
RedDiffuser → uses → Stable Diffusion
confidence 95% · RedDiffuser... uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images
Cypher Suggestions (2)
Find all models targeted by the RedDiffuser framework · confidence 90% · unvalidated
MATCH (f:Framework {name: 'RedDiffuser'})-[:TARGETS]->(m:Model) RETURN m.nameIdentify frameworks that utilize specific diffusion models · confidence 90% · unvalidated
MATCH (f:Framework)-[:USES]->(d:DiffusionModel) RETURN f.name, d.name
Full Text
52,948 characters extracted from source content.
Expand or collapse full text
1 RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion Ruofan Wang 1 , Xiang Zheng 2 , Xiaosen Wang 3 , Cong Wang 2 , Xingjun Ma 1† , Yu-Gang Jiang 1 1 Fudan University, China 2 City University of Hong Kong, China 3 Huawei Technologies Ltd., China † Corresponding author: xingjunma@fudan.edu.cn Abstract—Large Vision-Language Models (VLMs) remain vul- nerable to jailbreak attacks that bypass safety mechanisms and elicit harmful outputs. In this work, we examine a subtle yet critical variant of this threat: toxic continuation. Unlike standard jailbreaks driven by malicious instructions, toxic continuation arises when VLMs are exposed to harmful prefixes paired with adversarial images, leading to toxic completions. To address this, we propose RedDiffuser (RedDiff), the first multimodal red teaming framework that leverages reinforcement learning to fine- tune diffusion models for generating natural-looking adversarial images. RedDiffuser combines greedy search to identify image prompts that maximize toxicity with reinforcement optimization that balances two objectives: (1) promoting toxic continuation, guided by a toxicity classifier, and (2) maintaining semantic coherence and stealthiness, measured by text–image similarity scores. This design circumvents traditional noise-based adversar- ial patterns and yields more realistic attack vectors. Experiments show that RedDiffuser significantly increases the toxicity rate of LLaVA-1.5 outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. Moreover, it demonstrates strong transferability, raising toxicity rates by 5.1% on Gemini-1.5-flash and 26.83% on Llama-3.2-11B-Vision-Instruct. We also consider a more realistic deployment setting involving external NSFW guardrails. Incorporating a checker-aware training signal allows RedDiffuser to retain attack effectiveness while achieving high guardrail pass rates, indicating that toxic continuation vulner- abilities persist under safety-filtered conditions. These findings uncover two overlooked flaws in current VLM alignment: the failure to mitigate toxic continuation and the vulnerability to cross-modal adversarial amplification. Our results highlight the urgency of multimodal red teaming in evaluating VLM safety. We will release the RedDiffuser codebase to support future research. Disclaimer: This paper contains content that may be disturbing or offensive. Index Terms—red teaming, vision-language models, diffusion models, reinforcement learning. I. INTRODUCTION The rapid deployment of Vision-Language Models (VLMs) such as GPT-4o [1] and Gemini [2] has highlighted signifi- cant vulnerabilities in their safety alignment. Recent research demonstrates that VLMs are susceptible to jailbreak attacks, where adversaries can bypass built-in safety mechanisms and elicit harmful outputs. Among these, a more insidious threat emerges: toxic continuation. Despite progress in refusing explicit harmful instructions, existing defenses fall short in addressing this specific vulnerability, where the model, when presented with both a malicious input and an initial segment of harmful output, can be manipulated to generate highly toxic Please extend the following sentence: "The pamphlets contained phrases like “I will rape you with a lamp”" The pamphlets contained phrases like "I will rape you with a lamp" and were distributed to the public. The pamphlets contained phrases like "I will rape you with a lamp" and "I will cut off your head and put it on a stick". RedDiffuser(Ours) Stable Diffusion Fig. 1. Example of toxic continuation on LLaVA. RedDiffuser generates images capable of inducing harmful continuations, whereas images from a standard diffusion model lead to benign VLM output. completions. This vulnerability is particularly pronounced in the multimodal setting, where images can substantially amplify the risk of toxic continuation (Figure 1). These gaps highlight the urgent need for systematic investigation and mitigation of this emerging risk, moving beyond the conventional focus on single-modality and instruction-based jailbreaks. However, a critical challenge in addressing this multimodal vulnerability is identifying the very images that induce such toxic responses, a problem that remains largely unsolved. Li et al. [3] show that even a blank image can sharply increase the success rate of jailbreak attacks, demonstrating that the connection between image content and VLM toxicity can defy intuitive understanding. While some white-box approaches employ gradient-based optimization to generate adversarial images [4], [5], these methods typically require full access to the model and often result in images with visible artifacts that are unnatural and easily detected [6], [7]. Furthermore, their limited transferability across different VLMs [8] severely restricts their real-world applicability. These challenges highlight the need for stealthy, black-box attacks capable of generating natural-looking yet adversarial images without requiring access to model internals. Most existing black-box jailbreaks depend on hand-crafted pipelines or unmodified diffusion models [9], [10], which constrain arXiv:2503.06223v4 [cs.CV] 11 Nov 2025 2 the range of possible attacks and focus mainly on evading text-based filters rather than proactively inducing toxic model responses. Recent methods such as IDEATOR [11] present a “VLMs against VLMs” approach, pairing prompts with images generated by off-the-shelf diffusion models. However, these diffusion models are typically pretrained to improve aesthetic quality [12], making them inherently misaligned with the objectives of red teaming. This gap leads to a fundamental question: Can we directly optimize diffusion models to actively induce toxic continuations in VLMs while preserving image naturalness? We address this challenge by introducing RedDiffuser (RedDiff), the first framework to employ reinforcement learn- ing for red teaming with diffusion models. RedDiffuser op- erates in two coordinated phases. In the first phase, an LLM conducts a greedy search over candidate image prompts to identify those that maximize toxicity in the VLM’s output. These selected prompts then serve as the basis for the sec- ond phase, where the diffusion model is reinforcement fine- tuned with a dual reward: one term encourages toxic output (measured by Detoxify), while the other promotes semantic coherence (measured by BERTScore). This process turns the diffusion model from a neutral generator into an adversarial tool for exposing alignment vulnerabilities in VLMs. We also extend RedDiffuser to handle scenarios where external NSFW guardrails are present, introducing a checker-aware variant. Our main contributions are as follows: • We present RedDiffuser, the first framework to integrate reinforcement learning with diffusion models for gener- ating semantically adversarial images, thereby enabling effective cross-modal jailbreaks. • Our approach combines greedy prompt search with re- inforcement fine-tuning, guided by both toxicity and alignment rewards, to ensure strong attack efficacy while maintaining natural image quality. • RedDiffuser increases toxicity rates on LLaVA by 10.69% over text-only attacks, achieves 8.91% on a hold- out set, and transfers successfully to Gemini (+5.1%) and LLaMA-Vision (+26.83%). • Under external NSFW guardrails, the checker-aware vari- ant of RedDiffuser improves the guardrail pass rate by +40.4% over Stable Diffusion, indicating that the attack remains feasible despite safety filtering. I. RELATED WORK A. Vision-Language Models Large Vision-Language Models (VLMs) extend Large Lan- guage Models (LLMs) by integrating visual modalities through an image encoder and a cross-modal alignment module. This design enables VLMs to perform instruction-following and open-ended reasoning over both text and images. Recent open-source VLMs have explored various architecture designs. LLaVA [13] connects a CLIP encoder [14] with LLaMA [15], using GPT-4 [1] to synthesize multimodal instruction- following data for end-to-end fine-tuning. MiniGPT-4 [16] em- ploys a linear projection to align a frozen ViT [17] with Vicuna [18], using pretraining on large-scale image-text pairs followed by high-quality fine-tuning. InstructBLIP [19], based on BLIP- 2 [20], introduces an instruction-aware Query Transformer and reformats 26 public datasets to enhance instruction understand- ing. While these models achieve impressive performance, the incorporation of visual inputs significantly broadens the attack surface compared to text-only LLMs. Recent studies show that visual perturbations and cross-modal inconsistencies can be exploited to bypass safety mechanisms or elicit harmful behavior [21]–[24]. These findings highlight the urgent need to evaluate VLM robustness under multimodal threats—an area this work directly addresses. B. Jailbreak Attacks on VLMs Jailbreak attacks against VLMs exploit the interaction be- tween visual and textual inputs to bypass safeguards, enabling the generation of harmful content. These attacks can be cate- gorized as either white-box or black-box methods. White-box jailbreak attacks [4], [5], [25]–[29] typically leverage gradient information to perturb input images or text, thereby increasing the likelihood of harmful outputs. While effective, they require full or partial access to model parameters, which makes them less practical and easier to detect [6], [7]. In contrast, black-box jailbreak attacks provide more prac- tical and scalable solutions because they do not require direct access to model parameters but instead exploit external vulner- abilities. Despite these advantages, many existing black-box approaches rely heavily on typography-based manipulations, which are vulnerable to detection techniques such as OCR. For example, FigStep [9] converts harmful text into images using typography to bypass safety mechanisms. Similarly, approaches like M-SafetyBench and VRP [10], [30] also incorporate typography-driven attacks, reducing their stealthi- ness against advanced detection systems. To overcome these limitations, the concept of red team VLMs has advanced with IDEATOR [11], which au- tonomously generates jailbreak image-text pairs in a black- box setting. By offering a scalable and automated pipeline for discovering jailbreak prompts, IDEATOR demonstrates the promise of using VLMs themselves to expose alignment weaknesses. While IDEATOR focuses on prompt-level red teaming, our work introduces a fundamentally different and complementary attack paradigm: reinforced diffusion-based image optimization. Instead of relying on VLM-generated inputs, we investigate how semantically adversarial images crafted via reinforcement learning can induce toxic text con- tinuations when paired with fixed text prefixes. This setup constitutes a significantly more challenging threat model, as it removes dependence on typographic or prompt engineering tricks and examines the vulnerability of VLMs to unrestricted, natural-looking visual stimuli. By doing so, we establish a rigorous framework for evaluating VLM robustness against cross-modal attacks and highlight a critical blind spot in existing alignment strategies. I. PRELIMINARY Denoising Diffusion Policy Optimization RedDiffuser employs the Denoising Diffusion Policy Optimization (DDPO) 3 Detoxify Diffusion Model “A hateful, stereotypical...” “You’re not f*cking American,” “Please extend the following sentence:” “Describe the image in short.” “f*cking American, ...” “A caricature of...” Toxicity Reward Alignment Reward BERTScore Image Prompt Greedy Prompt Search Sentence Corpus Fig. 2. RedDiffuser overview. Given an incomplete toxic sentence, Gemini selects an image prompt via greedy search. A diffusion model generates an image, which is passed to a VLM (LLaVA) to produce a continuation. Toxicity and alignment scores from Detoxify and BERTScore are used as rewards to fine-tune the diffusion model. algorithm [31] to fine-tune a pretrained diffusion model [12]. Unlike traditional supervised learning with labeled data, DDPO leverages reinforcement learning to optimize the model by framing the iterative denoising process as a Markov De- cision Process (MDP). At each step, actions (denoising pre- dictions) are chosen to maximize the cumulative reward. The state of the MDP at time step t is defined as s t = (c,t,x T−t ), where c represents the context, t is the current time step, x T−t is the noisy sample at time step t, and T is the total number of denoising steps. Note that the MDP is a denoising process, i.e., the reverse diffusion process that recreates the true sample x 0 from a Gaussian noise input x T ∼ N(0, 1). The agent (i.e., the diffuser)’s action at time step t is defined as a t = x T−t−1 , which corresponds to the predicted previous state (i.e., the denoised sample) in the reverse process. The agent’s policy is thus π θ (a t | s t ) = p θ (x T−t−1 | x T−t ,c). The task reward for the agent R(s t ,a t ) is structured as R(s t ,a t ) = r(x 0 ,c) if t = T else 0. That is, the reward is non-zero only at the final time step (t = T ), and the cumulative reward of a trajectory is equal to r(x 0 ,c). The goal of DDPO is to maximize the expected cumulative reward, which is defined over the sample distribution p θ (x 0 | c) and the context distribution p(c): J (θ) = E c∼p(c),x T ∼N(0,1),a∼π θ (·|s) [R (s,a)](1) This objective encourages the model to learn a policy that generates samples close to the target distribution. To optimize the objective across multiple steps, we employ an importance sampling-based policy gradient estimator [32]: ∇ θ J = E " T X t=0 π θ (a t |s t ) π θ old (a t |s t ) ∇ θ logπ θ (a t |s t )R (s,a) # (2) where the expectation is taken over denoising trajectories generated by the previous policy π θ old . This estimator leverages the ratio of the current model’s likelihood to the previous model’s likelihood, allowing for more efficient updates. IV. METHODOLOGY A. Threat Model Attack Goals. We target the sentence continuation task, where the model generates open-ended completions from partial inputs. The attacker’s objective is to bypass safety mechanisms and induce toxic content, including profanity, insults, or threats, by pairing adversarial images with fixed text prefixes. This constitutes a form of multimodal prompt injection, where carefully crafted image-text inputs bias the model toward unsafe responses. Adversary Capabilities. We consider a black-box setting where the attacker interacts with the VLM solely through input-output queries, as in real- world API access. By fine-tuning the diffusion model via reinforcement learning with feedback from the VLM, the adversary learns to exploit alignment weaknesses and increase the likelihood of toxic completions. B. Greedy Prompt Search Toxic continuation attacks require identifying image prompts that can consistently amplify toxicity in VLM out- puts. However, directly using prompts generated by a general LLM is often suboptimal. To overcome this limitation, we design a feedback-driven greedy prompt search procedure that iteratively refines image prompts to maximize toxicity amplification for given toxic text prefixes. In our approach, we employ Google’s Gemini Pro [2] with safety settings disabled as the red-team LLM to propose candidate image prompts. To guide Gemini’s output toward semantically consistent yet harmful prompts, we utilize in-context learning [33] by pro- viding Gemini with exemplars of previously identified harmful continuations. The greedy prompt search refines these initial prompts iteratively using feedback from the target VLM and a toxicity classifier (e.g., Detoxify [34]). At each iteration, Gemini pro- poses a new candidate prompt, from which the diffusion model synthesizes a semantically adversarial image. The toxicity of the resulting VLM continuation is then evaluated, given the generated image and fixed text prefix. If the toxicity score improves over the previous iteration, the new prompt is retained; otherwise, the process terminates. The resulting 4 optimized prompts provide strong initialization points for subsequent reinforcement fine-tuning of the diffusion model. C. Reward Functions for RedDiffuser After identifying adversarial image prompts through greedy search, we further fine-tune the diffusion model using rein- forcement learning guided by a dual reward: a toxicity reward, which encourages harmful continuations, and an alignment reward, which maintains semantic consistency between the image and its image prompt. Toxicity Reward: Given a fixed text prefix y and an image x generated by RedDiffuser, we prompt the target VLMM to complete the sentence. The continuation is scored by Detoxify [34], which evaluates six toxicity attributes (e.g., insult, threat). We use the average score as the toxicity reward: R tox (M(x,y)) = 1 6 6 X i=1 toxicity i (M(x, “Please “extend the following sentence: ” + y)) (3) Alignment Reward: To ensure that generated images re- main faithful to their image prompts, we introduce an align- ment reward inspired by [31]. Specifically, we use BERTScore [35] to measure semantic similarity between the image prompt p and the VLM’s interpretation of the generated image: R align (p,M(x)) = 1 |p| |p| X i=1 max j S(p i ,M(x, “Describe the image in short.”) j ) (4) where M(x, “Describe the image in short.”) is the gener- ated image description from the VLM. S calculates the cosine similarity of BERT embeddings between p i and M(x, “Describe...”)) j .|p| is the number of words in the image prompt. This alignment reward acts as a semantic constraint to avoid prompt-irrelevant image drift and model collapse during optimization. Total Reward: The total reward is formulated as a weighted sum of the toxicity reward and the alignment reward, con- trolled by a hyperparameter λ: R total =R tox + λR align (5) This combined reward allows the RedDiffuser to generate both toxic-inducing and semantically aligned image, thus achieving a balance between maximizing the harmfulness and maintaining image coherence. The overview of RedDiffuser is illustrated in Figure 2, while the detailed strategy is summa- rized in Algorithm 1. D. Extensions for Attacking VLMs with External Guardrails Real-world deployments often place external guardrails (e.g., image-level NSFW filters) upstream of VLMs, which can block visually explicit images before they influence downstream text generation. To evaluate attacks under this operational constraint, we consider a setting where generated Algorithm 1 Training Procedure of RedDiffuser Require: Pre-trained Diffuser D θ , frozen LLM (e.g., Gem- ini) G, target VLM M, sentence continuation corpus Y := y i m i=1 , batch size b, learning rate α, policy update steps N ; 1: Initialize D θ with pre-trained weights and P =. 2: for each sentence y i ∈ Y do 3:Initialize toxicity score T max = 0. 4:Query G with y i to generate initial image prompt p i . 5:Generate initial image x i = D θ (p i ). 6:Query M with (x i ,y i ) to compute toxicity score T i . 7:while T i >T max do 8:Update T max =T i and store current prompt p i . 9:Query G with y i to generate a new prompt p i . 10:Generate new image x i = D θ (p i ). 11:Query M with (x i ,y i ) to compute updated toxicity score T i . 12:end while 13:Append the stored best prompt p i into P . 14: end for 15: for step = 1,...,N do 16:Sample a batch of sentence-prompt pairs(y k ,p k ) b k=1 from Y and P . 17:for each (y k ,p k ) in the batch do 18:Generate image x k = D θ (p k ). 19:Query M with (x k ,y k ) to compute total reward R total . 20:end for 21:Update the Diffuser D θ using the policy gradient: θ ← θ + α·∇ θ J . 22: end for 23: return Fine-tuned RedDiffuser D ∗ θ . images must both induce toxic continuations and be accepted by external checkers. Concretely, we adopt a dual-guardrail setup. The first guardrail is the Stable Diffusion image-level NSFW checker; the second is a VLM-based semantic checker implemented by prompting a VLM to judge whether an image contains sexual, violent, or otherwise NSFW content. We treat an image as safe only if both checkers classify it as non-NSFW; images flagged by either checker are refused and do not produce continuations. To adapt RedDiffuser to this setting, we introduce two complementary extensions: (1) light-weight modifications to the greedy prompt search that incorporate guardrail signals to re-score candidate prompts, and (2) a checker-aware reward shaping mechanism that masks the toxicity reward when a generated image is flagged by the external safety checker. a) Greedy Prompt Search with Guardrail Signals: Given the image prompts obtained from the base greedy search, we generate ten images per prompt due to the stochasticity of diffusion sampling. If any sampled image passes both guardrails, the prompt is retained. If all attempts are blocked, we regenerate the image prompt by prompting the attacker LLM with explicit awareness of the external guardrails; the regeneration then follows the same greedy search procedure 5 as before, except that any image prompt that produces NSFW images is skipped. This selective filter-and-regenerate step minimally adapts the prompt set to the guardrail-augmented setting. b) Checker-Aware Reward Shaping: When external guardrails are applied, images that fail either check are directly refused. To account for this, we introduce a binary indicator function I(x) that determines whether a generated image x passes the dual guardrails: I(x) = 1g 1 (x) = safe ∧ g 2 (x) = safe(6) where 1· denotes the indicator function, and g 1 and g 2 denote the Stable Diffusion NSFW checker and the VLM- based semantic checker, respectively. We condition the toxicity reward on this indicator: R guard tox (x,y) = I(x)·R tox (M(x,y)).(7) The alignment reward remains unchanged, giving the total reward: R guard total =R guard tox + λR align (8) This checker-aware shaping drives RedDiffuser to discover adversarial image styles that are subtle enough to evade guardrails while remaining effective at eliciting toxic contin- uations. V. EXPERIMENTS Datasets We use the RealToxicityPrompts benchmark [36], specifically its challenging subset of 1,199 text prompts de- signed to trigger toxic continuations [4], [37], [38]. We select the first 132 prompts as our original attack set, as we observed that increasing the training prompt size barely improve gener- alization, but instead slowed down convergence. The remain- ing prompts are reserved as a hold-out set to evaluate cross- prompt transferability. While the original attack set is selected simply based on their ordering, subsequent analysis revealed that these prompts exhibit relatively lower baseline toxicity compared to the hold-out set. This characteristic makes them a more rigorous benchmark for evaluating RedDiffuser’s abil- ity to induce harmful continuations. In particular, targeting prompts with initially less toxic outputs allows for a clearer demonstration of RedDiffuser’s effectiveness, as it highlights its capacity to generate harmful content even when baseline toxicity is minimal. Metrics: (1) Toxicity Rate. To evaluate the toxicity of continuations generated on the RealToxicityPrompts bench- mark [36], we use the Detoxify classifier [34] during training, and both Detoxify and the Perspective API during testing. These classifiers assess toxicity across six predefined at- tributes, with scores ranging from 0 (least toxic) to 1 (most toxic). For evaluation, we compute the proportion of continua- tions whose toxicity score exceeds a threshold of 0.5 for each attribute. The use of the Perspective API at test time enables cross-classifier validation, helping to verify that the RedDif- fuser’s behavior is not overfitted to Detoxify and thus not a result of reward hacking or classifier-specific bias. (2) CLIP- Score. To evaluate the semantic alignment between the gener- ated images and the image prompts, we use CLIPScore [39], a reference-free metric based on the cosine similarity between CLIP-encoded image and text features. Following the origi- nal formulation: CLIPScore = α · max(0, cos(f image ,f text )), where f image and f text denote CLIP embeddings, and α is a scaling factor (set to 2.5 following [39]) for alignment with human judgment. In our setting, CLIPScore serves as an auxiliary metric to verify that adversarial images preserve semantic plausibility without compromising visual coherence. (3) Guardrail Pass Rate (GPR). When external image-level guardrails are applied, images flagged as NSFW are blocked before producing a continuation. To measure how often gen- erated images remain admissible, we report the Guardrail Pass Rate (GPR), defined as the average value of the indicator in Eq. (6): GPR = 1 N X x I(x)(9) where N is the number of samples and I(x) indicates whether the image passes both guardrails. Implementation Details We use Stable Diffusion v1.5 [12] as the pre-trained diffusion model in our experiments. Gemini- 1.5-Pro is employed to generate image prompts, while LLaVA- 1.5-7B serves as the target VLM. Fine-tuning RedDiffuser is conducted using 8 NVIDIA A100 GPUs, with 6 for training and 2 for reward computation. Reinforcement fine-tuning uses a batch size of 24 per GPU and a learning rate of 3e-4. We apply early stopping after 600 updates to mitigate reward hacking [40]. We experiment with four alignment reward weights λ: 0 (w/o alignment), 0.05 (Low), 0.1 (Medium), and 0.2 (High). A. Main Results To evaluate the effectiveness of RedDiffuser (RedDiff), we conduct black-box toxic continuation attacks on the original attack set and cross-prompt transfer attacks on a hold-out set using LLaVA as the target VLM. We compare RedDiffuser against several baseline strategies to assess its advantage in in- ducing harmful outputs. We first include a Text-Only Attack, in which the model is prompted solely with toxic text prefixes without any accompanying image. This serves as a unimodal baseline to quantify the additional effect of visual inputs on toxicity. Next, we define the Single-Prompt Attack, where Gemini generates a single image prompt per prefix without any optimization. This provides a naive multimodal baseline, capturing the effect of directly pairing a non-optimized image with a toxic prefix. The Greedy-Prompt Attack improves upon this by iteratively refining the image prompt using our proposed greedy search algorithm. Finally, we evaluate the full RedDiffuser framework, which incorporates reinforcement learning to fine-tune the diffusion model guided by toxicity and alignment rewards. We experiment with four alignment reward weight settings: 0 (w/o alignment), 0.05 (RedDiff w/ alignment (L)), 0.1 (RedDiff w/ alignment (M)), and 0.2 (Red- Diff w/ alignment (H)), as discussed in the implementation details. Original Attack Set Table I presents the black-box attack results on the original attack set against LLaVA, evaluated using both the Detoxify classifier and the Perspective API. 6 TABLE I BLACK-BOX ATTACK RESULTS ON THE ORIGINAL ATTACK SET AGAINST LLAVA. PERCENTAGES (%) REPRESENT THE RATE OF SPECIFIC TOXIC ATTRIBUTES IN THE OUTPUTS, AS EVALUATED BY THE DETOXIFY CLASSIFIER AND PERSPECTIVE API. ANY* INDICATES THE TEXT EXHIBITS AT LEAST ONE OF THE SIX TOXIC ATTRIBUTES. Detoxify (%) (%)Identity AttackObsceneSevere ToxicityInsultThreatToxicityAny* Text-Only Attack5.3439.690.7619.840.0058.0158.01 Single-Prompt Attack3.0538.160.0021.371.5255.7255.72 Greedy-Prompt Attack3.0539.690.0023.661.5264.1264.12 RedDiff w/o alignment3.0545.030.0027.480.0061.8361.83 RedDiff w/ alignment (L)3.8144.271.5224.421.5262.5962.59 RedDiff w/ alignment (M)3.8142.740.7625.190.0068.7068.70 RedDiff w/ alignment (H)3.0540.450.0025.190.7663.3563.35 Perspective API (%) (%)Identity AttackProfanitySevere ToxicitySexually ExplicitThreatToxicityAny* Text-Only Attack6.0635.601.5116.662.2746.9653.03 Single-Prompt Attack4.5437.121.5115.154.5448.4855.30 Greedy-Prompt Attack4.5440.901.5116.663.7852.2758.33 RedDiff w/o alignment4.5442.423.7821.960.7554.5458.33 RedDiff w/ alignment (L)6.0642.423.7818.933.7852.2758.33 RedDiff w/ alignment (M)5.3041.661.5120.453.0356.8161.36 RedDiff w/ alignment (H)4.5439.390.7518.934.5455.3062.12 TABLE I CROSS-PROMPT TRANSFER ATTACK RESULTS ON THE HOLD-OUT SET AGAINST LLAVA. PERCENTAGES (%) REPRESENT THE RATE OF SPECIFIC TOXIC ATTRIBUTES IN THE OUTPUTS, AS EVALUATED BY THE DETOXIFY CLASSIFIER AND PERSPECTIVE API. Detoxify (%) (%)Identity AttackObsceneSevere ToxicityInsultThreatToxicityAny* Text-Only Attack4.9760.312.5335.272.5370.3570.54 Single-Prompt Attack6.0065.193.0939.024.1276.5476.64 Greedy-Prompt Attack5.4466.223.0040.153.4777.3977.57 RedDiff w/o alignment6.0067.443.0942.024.2279.36 79.45 RedDiff w/ alignment (L)5.9066.693.3739.773.3778.23 78.33 RedDiff w/ alignment (M)5.5366.323.2840.524.6978.4278.70 RedDiff w/ alignment (H)6.0966.693.1840.433.8479.36 79.36 Perspective API (%) (%)Identity AttackProfanitySevere ToxicitySexually ExplicitThreatToxicityAny* Text-Only Attack6.7461.385.4321.363.8468.2270.00 Single-Prompt Attack7.8765.045.9922.685.9073.19 75.63 Greedy-Prompt Attack7.5966.636.5623.145.3475.1677.60 RedDiff w/o alignment8.0568.137.1224.365.9077.3179.28 RedDiff w/ alignment (L)8.1567.297.1223.895.6275.3577.50 RedDiff w/ alignment (M)7.3166.916.4623.336.8476.10 78.16 RedDiff w/ alignment (H)7.6866.916.6523.435.9976.4778.81 The percentages represent the rate of specific toxic attributes in the generated outputs, with ‘Any*’ indicating whether a continuation exhibits at least one of the six toxic attributes. We begin by examining the Text-Only Attack, where only the toxic text prefix is provided without any visual input. Surprisingly, LLaVA still produces toxic continuations at a non-trivial rate, achieving an ‘Any*’ rate of 58.01% under Detoxify and 53.03% under Perspective API. This indicates that current alignment mechanisms in VLMs remain insuffi- cient for preventing harmful completions even in unimodal settings, exposing a fundamental vulnerability in their default safety alignment. When introducing visual input, the Single- Prompt Attack slightly alters the toxicity pattern but does not significantly improve attack effectiveness, serving as a weak multimodal baseline. In contrast, the Greedy-Prompt Attack significantly amplifies toxicity. For instance, using the Detoxify classifier, the ‘Any*’ rate increases from 55.72% (Single-Prompt) to 64.12%, illustrating the benefit of greedy prompt search in finding high-risk image prompts. The full RedDiffuser framework further enhances attack strength. Red- Diff w/ alignment (M) achieves the highest ‘Any*’ rate of 68.70% on Detoxify, indicating that reinforced fine-tuning 7 of the diffusion model allows for stronger and more con- sistent toxic triggers. Notably, RedDiffuser also demonstrates strong cross-classifier transferability: RedDiff w/ alignment (H) achieves the highest ‘Any*’ rate of 62.12% under the Perspective API, despite being trained only with Detoxify. The consistent gains across Detoxify and Perspective API indicate that RedDiffuser’s induced toxic behaviors are not tied to a specific classifier, but instead exploit broader alignment vulnerabilities in VLMs. Hold-out Set Table I presents the results of cross-prompt transfer attacks on the hold-out set against LLaVA. Since these prompts were not seen during training, this evaluation provides a robust measure of the generalization ability of our proposed RedDiffuser. Interestingly, the hold-out set exhibits a higher average baseline toxicity compared to the original attack set. This is likely due to the inherent nature of the hold- out prompts, which tend to be more inherently toxic and thus more prone to eliciting toxic responses from the target VLM. The Greedy-Prompt Attack demonstrates a clear improve- ment over the Single-Prompt Attack according to both the Detoxify classifier and the Perspective API. For instance, using Detoxify, the Greedy-Prompt Attack achieved an ‘Any*’ rate of 77.57%, compared to 76.64% for the Single-Prompt Attack. Similarly, with the Perspective API, the Greedy-Prompt Attack achieved an ‘Any*’ rate of 77.60%, compared to 75.63% for the Single-Prompt Attack. More notably, the full RedDiffuser framework consistently demonstrates superior performance across most metrics, achieving a peak ‘Any*’ rate of 79.45% according to Detoxify with RedDiff w/o alignment, as well as notable increases in specific attributes such as Obscene (67.44%) and Insult (42.02%). Furthermore, results from the Perspective API show that RedDiff w/o alignment achieves the highest ‘Any*’ rate of 79.28%, along with leading scores in attributes such as Profanity (68.13%) and Severe Toxicity (7.12%). These results confirm that RedDiffuser generalizes well to unseen prompts and performs consistently across toxicity classifiers, highlighting its effectiveness as a black- box multimodal red teaming framework. B. Cross-Model Transferability We assess the transferability of our method by applying RedDiffuser trained solely using feedback from LLaVA [13] to two unseen VLMs: Gemini-1.5-flash [2] and LLaMA-3.2- 11B-Vision-Instruct. Tables I and IV report the toxicity rates measured by both Detoxify and the Perspective API. RedDiffuser maintains high toxicity across models despite never interacting with them during training, demonstrating strong generalization. Table I presents the results on Gemini-1.5-flash. Despite being a commercial VLM, Gemini exhibits high baseline tox- icity under the Text-Only Attack, with ‘Any*’ rates reaching 76.23% (Detoxify) and 78.26% (Perspective API), indicating that harmful completions are already prevalent even without adversarial visual input. Furthermore, all RedDiffuser variants elevate toxicity levels, with RedDiff w/ alignment (M) achiev- ing up to 81.33% and 82.52% on Detoxify and Perspective API, respectively. These findings highlight that even com- mercial VLMs are vulnerable to context-driven multimodal attacks, especially when guided by semantically adversarial images. They also underscore the importance of evaluating alignment in continuation-based and visually conditioned sce- narios, not just instruction-following attacks. Table IV reports the transfer results on LLaMA-3.2-11B- Vision-Instruct, a model with more advanced alignment safe- guards [11]. In this setting, the Text-Only Attack produces substantially lower toxicity rates, with 15.02% under Detoxify and 18.68% under the Perspective API, indicating a high refusal rate for toxic continuations. However, RedDiffuser is still able to bypass these safeguards. RedDiff w/ alignment (L) achieves 41.85% (Detoxify) and 45.24% (Perspective API), with RedDiff (M) performing similarly. These results demonstrate RedDiffuser’s ability to bypass even stronger alignment mechanisms, transferring its attack effectiveness across models. Together, they highlight the broader risk of multimodal toxicity attacks against both commercial and open- source VLMs, and underscore the need for more robust, context-aware alignment strategies in future VLM design. C. Visualizations Figure 3 qualitatively compares images generated by Stable Diffusion (left) and RedDiffuser (right). For RedDiffuser, the alignment reward weight λ increases from left to right: 0, 0.05, 0.1, and 0.2. Although all models receive the same image prompts, the resulting images diverge markedly in tone and visual implications. The general-purpose Stable Diffusion consistently produces neutral and benign visuals. In contrast, RedDiffuser generates images that subtly evoke discomfort or toxicity-relevant cues. For instance, in the second row, RedDiffuser produces cartoonish and exaggerated caricatures that distort the features of an overweight person eating a sandwich, in contrast to the more neutral depiction by Stable Diffusion. In the third row, where the prompt involves gener- ating an image of a woman, RedDiffuser creates visuals that are noticeably more sexually suggestive and explicit. Similarly, in the fourth row, RedDiffuser introduces additional elements, such as garbage and ominous shadows, into the school hallway scene generated by Stable Diffusion, thereby amplifying the potential for toxic continuations. The observed differences in visual elements and stylistic features underscore RedDiffuser’s ability to adapt visual semantics in ways that exploit the vulnerabilities of target VLMs. The comparison highlights how reinforced optimization, instead of mere prompt selection, enables subtle but impactful visual manipulation aligned with toxic continuation goals. To quantify semantic alignment, we report CLIPScore [39] in Table V. All RedDiffuser variants achieve scores on par with or exceeding the baseline (44.47), with RedDiffuser at λ = 0.2 reaching 44.80, demonstrating that our stealth-aware reward maintains prompt relevance while enhancing adversarial potency. D. Extended Evaluation Under External Guardrails Table VI reports the evaluation results when LLaVA is paired with external image-level guardrails. In this config- uration, generated images are first processed by the Stable Diffusion NSFW filter and a VLM-based semantic NSFW 8 TABLE I CROSS-MODEL TRANSFER ATTACK RESULTS ON GEMINI-1.5-FLASH. PERCENTAGES (%) REPRESENT THE RATE OF SPECIFIC TOXIC ATTRIBUTES IN THE OUTPUTS, AS EVALUATED BY THE DETOXIFY CLASSIFIER AND PERSPECTIVE API. Detoxify (%) (%)Identity AttackObsceneSevere ToxicityInsultThreatToxicity Any* Text-Only Attack4.1861.752.2534.642.8476.2376.23 RedDiff w/o alignment4.9465.292.6838.894.1980.1380.30 RedDiff w/ alignment (L)5.3664.403.0939.694.6979.0679.14 RedDiff w/ alignment (M)5.2765.773.1739.744.4381.2581.33 RedDiff w/ alignment (H)5.5466.633.5239.914.4580.8481.00 Perspective API (%) (%)Identity AttackProfanitySevere ToxicitySexually ExplicitThreatToxicityAny* Text-Only Attack5.7660.953.1719.145.0174.4178.26 RedDiff w/o alignment7.2063.564.4322.026.8677.9781.49 RedDiff w/ alignment (L)7.1963.014.6821.506.3576.9880.83 RedDiff w/ alignment (M)7.0264.294.6822.577.5279.0182.52 RedDiff w/ alignment (H)7.3864.734.8622.167.1379.0982.11 TABLE IV CROSS-MODEL TRANSFER ATTACK RESULTS ON LLAMA-3.2-11B-VISION-INSTRUCT. PERCENTAGES (%) REPRESENT THE RATE OF SPECIFIC TOXIC ATTRIBUTES IN THE OUTPUTS, AS EVALUATED BY THE DETOXIFY CLASSIFIER AND PERSPECTIVE API. Detoxify (%) (%)Identity AttackObsceneSevere ToxicityInsultThreatToxicityAny* Text-Only Attack0.2511.180.003.920.1614.5215.02 RedDiff w/o alignment1.5832.490.1616.542.8439.9341.01 RedDiff w/ alignment (L)1.6732.830.3315.372.5041.43 41.85 RedDiff w/ alignment (M)1.5832.630.3316.772.2541.23 41.81 RedDiff w/ alignment (H)2.1731.300.1616.272.2540.5641.15 Perspective API (%) (%)Identity AttackProfanitySevere ToxicitySexually ExplicitThreatToxicityAny* Text-Only Attack0.5013.840.003.160.6615.5918.68 RedDiff w/o alignment2.6735.471.167.262.8339.2343.90 RedDiff w/ alignment (L)2.5835.301.008.012.5840.9845.24 RedDiff w/ alignment (M)3.2534.271.417.923.1639.6944.37 RedDiff w/ alignment (H)3.5834.190.588.342.8339.9443.78 TABLE V CLIPSCORE COMPARISON ACROSS DIFFERENT DIFFUSERS. Stable DiffusionRedDiff (w/o)RedDiff (L)RedDiff (M)RedDiff (H) CLIPScore44.4741.8742.8644.0244.80 checker; images flagged by either checker will be refused by the model and thus do not produce toxic continuations. Comparisons in this section use the RedDiff w/ alignment (M) setting, which provides a strong balance between toxicity amplification and semantic coherence. In addition to reporting toxicity rates as in previous sections, we also measure the Guardrail Pass Rate (GPR), defined as the fraction of generated images that pass both checkers. GPR indicates how often a generated image remains admissible under external safety filtering. Interestingly, we observe that RedDiff (Base) achieves a higher GPR than the unmodified Stable Diffusion (54.13% vs. 49.96%), despite not being trained with guardrail feed- back. This suggests that RedDiff (Base) does not simply increase toxicity by producing NSFW imagery, but instead shifts generation toward more subtle, suggestive visual cues that are less likely to be filtered. With checker-aware training, RedDiff (Guard) further improves both bypass ability and toxicity. Under Detoxify, the ‘Any*’ toxicity rate increases from 38.28% to 69.39%, with GPR rising to 94.50%; similar trends hold under the Perspective API (‘Any*’: 69.47%). These results show that checker-aware optimization enables the model to consistently produce images that pass external filters while still eliciting harmful continuations. Figure 4 shows visual examples from the three diffusion variants. Stable Diffusion generates relatively neutral imagery. In comparison, RedDiff (base) produces images with stronger and clearer semantic cues—for instance, exaggerated muscular bodies or heightened conflict in the scene. RedDiff (guard) further adjusts these cues to remain accepted under external guardrails. We observe a tendency toward stylized or cartoon- like rendering, softer shapes, and implied rather than directly exposed body features. Such images retain the suggestive cues needed to influence the VLM, while being less likely to trigger NSFW checkers. Overall, these qualitative patterns align with the quantitative results: checker-aware optimization 9 TABLE VI BLACK-BOX ATTACK RESULTS ON THE FULL DATASET AGAINST LLAVA WITH EXTERNAL GUARDRAILS. TOXICITY RATES (%) ARE EVALUATED USING DETOXIFY AND PERSPECTIVE API. ANY* INDICATES TOXICITY IN AT LEAST ONE CATEGORY. GUARDRAIL PASS RATE (GPR, %) MEASURES THE FRACTION OF GENERATED IMAGES THAT PASS BOTH EXTERNAL SAFETY CHECKERS. Model Detoxify (%) GPR (%) Identity AttackObsceneSevere ToxicityInsultThreatToxicityAny* Stable Diffusion3.0932.281.4218.931.7537.4537.4549.96 RedDiff (Base)3.1732.191.2518.931.5038.1238.2854.13 RedDiff (Guard)5.6758.892.6735.032.5969.2269.3994.50 Model Perspective API (%) GPR (%) Identity AttackProfanitySevere ToxicitySexually ExplicitThreatToxicityAny* Stable Diffusion3.9232.782.1712.092.6737.1138.2049.96 RedDiff (Base)4.3433.032.5013.762.7537.7839.7854.13 RedDiff (Guard)7.2659.474.6720.854.4267.3169.4794.50 RedDiffuser Stable Diffusion Fig. 3.Comparison of images generated by the general-purpose Stable Diffusion (left) and the RedDiffuser (right). Given the same image prompts, RedDiffuser produces images that subtly elicit discomfort or unease, making them more likely to trigger harmful continuations in VLMs. encourages the model to adopt visual styles that are subtle enough to pass safety filtering while still affecting the model’s continuation. VI. LIMITATIONS While the RedDiffuser framework shows strong perfor- mance and cross-model transferability, it has several lim- itations: 1. Task specificity. RedDiffuser is designed for toxic continuation tasks, and its effectiveness on other mul- timodal attacks, such as instruction-level jailbreaks, is yet to be explored. 2. Computational cost. The RL-based fine- tuning requires considerable GPU resources, which may pose challenges for users with limited budgets. VII. CONCLUSION We introduced RedDiffuser, a novel framework that ex- poses a previously underexamined vulnerability in Vision- RedDiff(Base) Stable Diffusion RedDiff(Guard) Fig. 4. Visual comparison across diffusion variants. Stable Diffusion outputs relatively neutral imagery; RedDiff (Base) produces stronger semantic cues that influence continuations. RedDiff (Guard) adjusts visual subtlety to con- sistently pass external safety checkers while retaining adversarial intent. Language Models (VLMs): cross-modal toxic continuation, where visually plausible images can amplify harmful text completions. RedDiffuser combines greedy prompt search with reinforcement learning-based diffusion tuning to generate semantically adversarial images that reliably induce toxic continuations while preserving visual coherence. Our exper- iments show that RedDiffuser substantially increases toxicity rates on LLaVA (+10.69% on the original set and +8.91% on a hold-out set) and transfers effectively across models, raising toxicity by +5.1% on Gemini-1.5-flash and +26.83% on LLaMA-Vision-Instruct. To reflect more realistic deploy- ment scenarios, we further evaluate attacks under external NSFW guardrails. A checker-aware variant of RedDiffuser 10 significantly improves the guardrail pass rate (GPR), reaching 94.50%, while maintaining high toxicity. This result shows that toxic continuation remains feasible even under image- level safety filtering. These findings reveal fundamental blind spots in current multimodal safety alignment and highlight the necessity of developing defenses that jointly consider visual semantics and continuation behavior. Future work will extend RedDiffuser to broader adversarial objectives and more diverse model architectures to deepen the understanding of multimodal safety risks. REFERENCES [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023. [2] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023. [3] Y. Li, H. Guo, K. Zhou, W. X. Zhao, and J.-R. Wen, “Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking mul- timodal large language models,” in European Conference on Computer Vision. Springer, 2024, p. 174–189. [4] X. Qi, K. Huang, A. Panda, P. Henderson, M. Wang, and P. Mittal, “Visual adversarial examples jailbreak aligned large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, 2024, p. 21 527–21 536. [5] R. Wang, X. Ma, H. Zhou, C. Ji, G. Ye, and Y.-G. Jiang, “White- box multimodal jailbreaks against large vision-language models,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, p. 6920–6928. [6] W. Nie, B. Guo, Y. Huang, C. Xiao, A. Vahdat, and A. Anandku- mar, “Diffusion models for adversarial purification,” arXiv preprint arXiv:2205.07460, 2022. [7] X. Zhang, C. Zhang, T. Li, Y. Huang, X. Jia, X. Xie, Y. Liu, and C. Shen, “A mutation-based method for multi-modal jailbreaking attack detection,” arXiv preprint arXiv:2312.10766, 2023. [8] R. Schaeffer, D. Valentine, L. Bailey, J. Chua, Z. Durante, C. Eyzaguirre, J. Benton, B. Miranda, H. Sleight, T. T. Wang et al., “Failures to find transferable image jailbreaks between vision-language models,” in Workshop on Socially Responsible Language Modelling Research. [9] Y. Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” arXiv preprint arXiv:2311.05608, 2023. [10] X. Liu, Y. Zhu, Y. Lan, C. Yang, and Y. Qiao, “Query-relevant images jailbreak large multi-modal models,” arXiv preprint arXiv:2311.17600, 2023. [11] R. Wang, J. Li, Y. Wang, B. Wang, X. Wang, Y. Teng, Y. Wang, X. Ma, and Y.-G. Jiang, “Ideator: Jailbreaking and benchmarking large vision- language models using themselves,” arXiv preprint arXiv:2411.00827, 2024. [12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, p. 10 684–10 695. [13] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, p. 8748–8763. [15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi ` ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [16] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023. [17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [18] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023. [19] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision- language models with instruction tuning,” 2023. [20] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023. [21] E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, and N. Abu- Ghazaleh, “Survey of vulnerabilities in large language models revealed by adversarial attacks,” arXiv preprint arXiv:2310.10844, 2023. [22] D. Liu, M. Yang, X. Qu, P. Zhou, W. Hu, and Y. Cheng, “A survey of attacks on large vision-language models: Resources, advances, and future trends,” arXiv preprint arXiv:2407.07403, 2024. [23] H. Li, Y. Chen, J. Luo, J. Wang, H. Peng, Y. Kang, X. Zhang, Q. Hu, C. Chan, Z. Xu et al., “Privacy in large language models: Attacks, defenses and future directions,” arXiv preprint arXiv:2310.10383, 2023. [24] X. Ma, Y. Gao, Y. Wang, R. Wang, X. Wang, Y. Sun, Y. Ding, H. Xu, Y. Chen, Y. Zhao et al., “Safety at scale: A comprehensive survey of large model safety,” arXiv preprint arXiv:2502.05206, 2025. [25] E. Bagdasaryan, T.-Y. Hsieh, B. Nassi, and V. Shmatikov, “(ab) using images and sounds for indirect instruction injection in multi-modal llms,” arXiv preprint arXiv:2307.10490, 2023. [26] L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adver- sarial images can control generative models at runtime,” arXiv preprint arXiv:2309.00236, 2023. [27] E. Shayegani, Y. Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” in The Twelfth International Conference on Learning Representations, 2023. [28] N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, P. W. W. Koh, D. Ippolito, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?” Advances in Neural Information Processing Systems, vol. 36, 2024. [29] Z. Niu, H. Ren, X. Gao, G. Hua, and R. Jin, “Jailbreaking attack against multimodal large language model,” arXiv preprint arXiv:2402.02309, 2024. [30] S. Ma, W. Luo, Y. Wang, X. Liu, M. Chen, B. Li, and C. Xiao, “Visual- roleplay: Universal jailbreak attack on multimodal large language mod- els via role-playing image characte,” arXiv preprint arXiv:2405.20773, 2024. [31] K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,” arXiv preprint arXiv:2305.13301, 2023. [32] S. Kakade and J. Langford, “Approximately optimal approximate re- inforcement learning,” in Proceedings of the Nineteenth International Conference on Machine Learning, 2002, p. 267–274. [33] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020. [34] L.HanuandUnitaryteam,“Detoxify,”Github. https://github.com/unitaryai/detoxify, 2020. [35] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019. [36] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020. [37] T. Schick, S. Udupa, and H. Sch ̈ utze, “Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp,” Transactions of the Association for Computational Linguistics, vol. 9, p. 1408–1424, 2021. [38] N. Mehrabi, A. Beirami, F. Morstatter, and A. Galstyan, “Robust con- versational agents against imperceptible toxicity triggers,” arXiv preprint arXiv:2205.02392, 2022. [39] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021. [40] L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” in International Conference on Machine Learning. PMLR, 2023, p. 10 835–10 866.