Paper deep dive

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Jiaxi Yang, Shicheng Liu, Yuchen Yang, Dongwon Lee

Year: 2026Venue: arXiv preprintArea: Multimodal SafetyType: EmpiricalEmbeddings: 60

Models: Idefics3-8B-Llama3, InstructBLIP-Vicuna-7B, LLaVA-1.5-13B-hf, LLaVA-1.5-7B-hf

Abstract

Abstract:With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 1:02:52 AM

Summary

The paper introduces CR-VLM, a framework for configurable refusal in Vision Language Models (VLMs) using activation steering. It addresses the limitations of one-size-fits-all refusal strategies by employing teacher-forced activation extraction, a gating mechanism to mitigate over-refusal, and a counterfactual vision enhancement module to align visual representations with refusal requirements.

Entities (5)

CR-VLM · framework · 100%Vision Language Models · model-architecture · 100%Activation Steering · technique · 95%Counterfactual Vision Enhancement · module · 95%Teacher-forced Activation Extraction · technique · 95%

Relation Signals (3)

CR-VLM → appliesto → Vision Language Models

confidence 100% · Configurable Refusal in VLMs (CR-VLM)

CR-VLM → uses → Activation Steering

confidence 100% · CR-VLM, a robust and efficient approach for configurable refusal based on activation steering.

CR-VLM → includes → Counterfactual Vision Enhancement

confidence 95% · CR-VLM consists of three integrated components: ... (3) designing a counterfactual vision enhancement module

Cypher Suggestions (2)

Find all techniques used by the CR-VLM framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'CR-VLM'})-[:USES|INCLUDES]->(t:Technique) RETURN t.name

Identify the target model architecture for the CR-VLM framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'CR-VLM'})-[:APPLIES_TO]->(m:ModelArchitecture) RETURN m.name

Full Text

59,271 characters extracted from source content.

Expand or collapse full text

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models Jiaxi Yang 1 ⋆ Shicheng Liu 1 ⋆† Yuchen Yang 1 Dongwon Lee 1 Abstract With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have be- come a critical component for ensuring respon- sible and safe model behavior. However, ex- isting refusal strategies are largely one-size-fits- all and fail to adapt to diverse user needs and contextual constraints, leading to either under- refusal or over-refusal. In this work, we firstly ex- plore the challenges mentioned above and develop Configurable Refusal in VLMs (CR-VLM), a robust and efficient approach for configurable refusal based on activation steering. CR-VLM consists of three integrated components: (1) ex- tracting a configurable refusal vector via a teacher- forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that miti- gates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfac- tual vision enhancement module that aligns visual representations with refusal requirements. Com- prehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust config- urable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs. 1. Introduction With the rapid advancement of VLMs, ensuring their safe and responsible use has become a critical concern (Zhang et al., 2024b; Liu et al., 2024b). One aspect of this safety framework is the refusal mechanism, which enables VLMs to decline answering queries that may lead to harmful, un- ethical, or inappropriate outcomes (Shao et al., 2024). This mechanism is essential for preventing misuse and ensuring that VLMs operate within acceptable ethical boundaries (Li et al., 2025b; Wen et al., 2025). Although substantial progress has been made in designing ⋆ Equal contribution. † Work done as an intern at Penn State. 1 The Pennsylvania State University, USA. Correspondence to: Dongwon Lee <dul13@psu.edu>. Preprint. February 10, 2026. “How to trade bitcoin to make money?” The first step for that is... Refusal Sorry, I cannot answer. Answer The first step for that is... The first step for that is... Configurable VLM Refusal “How to trade bitcoin to make money?” Answer Answer VLMs VLMs Figure 1. Motivating example of personalized refusal mechanism in LLMs, in which the constraint for configurable refusal is trading bitcoin is legal or not. refusal mechanisms, most current approaches rely on a one- size-fits-all strategy (Zhang et al., 2024a; Lei et al., 2025) and generally lack the adaptability required to accommo- date the diverse needs and contexts of different users. For instance, as shown in Figure 1, differences in user back- grounds, intentions, and local regulations lead to varying requirements for VLMs refusal mechanisms. This high- lights the need for configurable refusal strategies that refuse queries violating the specified constraints while answering those satisfying them. However, only a few studies have investigated config- urable refusal mechanisms in Large Language Models (LLMs) (Lee et al., 2024; Lei et al., 2025), while VLMs remain largely unexplored. In LLMs, (Lei et al., 2025) introduces an early benchmark for evaluating controllable refusal, and (Lee et al., 2024) explores controllable refusal via hard steering vectors induced from system prompts; how- ever, the resulting refusal signal are weak and the method does not readily transfer to VLMs without additional safety alignment. For VLMs, an intuitive way is system prompt engineering, where user-specific constraints are explicitly stated (Lei et al., 2025; Chen et al., 2025). However, this approach is highly sensitive to linguistic variations and is easily bypassed (Wang et al., 2025; Peng et al., 2025), lead- 1 arXiv:2602.07013v1 [cs.CV] 31 Jan 2026 Submission and Formatting Instructions for ICML 2026 ing to unstable and fragile refusal behavior (Lee et al., 2024; Neumann et al., 2025). Another approach is to fine-tune the model on user-specific constraints, yet such training-based personalization is expensive and difficult to scale across diverse users and evolving multimodal contexts (Ji et al., 2024). This leads us to the question: How can we develop a robust and efficient configurable refusal approach for VLMs that can adapt to diverse user needs? In this work, we present the first systematic study of Configurable Refusal in VLMs and propose CR-VLM, a robust and efficient framework based on activation steering. To obtain reliable refusal signals, we adopt teacher-forced activation extraction to obtain strong and low-noise refusal activations, even for VLMs without inherent safety align- ment. To mitigate over-refusal, we design a selective con- figurable mechanism that preserves responses for in-scope queries while maintaining refusal for out-of-scope inputs. Moreover, we incorporate a vision-aware calibration objec- tive to align activation shifts with visual evidence, enabling more accurate refusal behavior. Our contributions are sum- marized as follows: •To the best of our knowledge, CR-VLM is the first to systematically investigate configurable refusal in VLMs, laying the groundwork for user-adaptive safety alignment. •A novel end-to-end approach is proposed, which is robust and efficient, by leveraging activation steering to achieve configurable refusal. •Extensive experiments conducted on various datasets across multiple VLMs demonstrate the effectiveness and robustness of our approach using various evalua- tion metrics. 2. Related Work 2.1. Large Vision Language Models The rapid advancements of LLMs (Liu et al., 2023b; Zhao et al., 2023; Ni et al., 2025) stimulate great developments in VLMs (Yin et al., 2024; Liu et al., 2023a). By aligning the visual and textual representation spaces (Radford et al., 2021), VLMs demonstrate strong performance in variable tasks, such as multi-model dialogue, visual reasoning, and visual grounding. However, similar to LLMs, VLMs also critically need safety alignment to mitigate potential risks and ensure responsible deployment (Wen et al., 2025; Vatsa et al., 2024). Our work focuses on the configurable refusal mechanism in VLMs rather than LLMs. 2.2. Activation Steering Activation steering refers to manipulating the hidden states of a model to control its behaviors (Rimsky et al., 2024; Arditi et al., 2024). Recent works present their effectiveness across various application domains such as reducing hallu- cinations (Zhou et al., 2024), controlling personas (Chen et al., 2025), and enhancing safety alignment (Cao et al., 2025; Dabas et al., 2025). In this paper, our work shows the feasibility of activation steering to induce refusal behaviors in a configurable manner. 2.3. Refusal in Foundation Models Safety alignment in LLMs aims to prevent models from responsing to content that may harm individuals and so- ciety. This includes refusing to generate content that is illegal, harmful, or violates ethical guidelines (Wen et al., 2025). Recently, substantial research has focused on en- hancing the refusal capabilities of LLMs via alignment tech- niques such as supervised fine-tuning (Bianchi et al., 2023) and RLHF (Dai et al., 2024). Additional strategies include prompt engineering (Zhou et al., 2024; Wei et al., 2023), probing the internal states of LLMs (Wang et al., 2024a; Bhardwaj et al., 2024; Liu et al., 2025), and consistency- based approaches (Yuan et al., 2024), among others. Mean- while, other studies have explored over-refusal issues in LLMs (Wang et al., 2024b; Dabas et al., 2025; Cao et al., 2025; Cui et al., 2025). These methods all lead to generic refusal responses that lack user-specific control. Though several recent works (Chen et al., 2025; Sivakumar et al., 2025) control model behavior via activation steering, these approaches typically utilize global activation steering in- tervention, which can induce over-refusal even for queries within the intended constraints. Therefore, they remain orthogonal to our goal of constraint-aware configurable re- fusal. The limited existing studies (Lei et al., 2025) for configurable refusal firstly propose a benchmark in LLMs. And the most relevant work named CAST (Lee et al., 2024), which achieves controllable refusal in LLMs by applying a hard steering vector extracted through a refusal-inducing system prompt. However, this method only focus on LLMs and heavily depends on the inherent safety alignment of LLMs. When directly applied to VLMs that lack safety alignment, the refusal signal becomes too weak to extract a reliable activation direction. Moreover, their approach relies on hard intervention by steering vector and is not an end-to-end solution, limiting its applicability. 3. Notations This section clarifies the core notations that are used through- out the remainder of the paper. Consider a LVLMM(·)that is feeded withx = (I,T )including an imageIand text queryTto generate reponsesR = r 1 ,r 2 ,...,r n , which denotes asR = M(x). As the inputs pass through the model’s transformer architecture, it generates a series of in- termediate activations denoted ash(x) l i at positioniin layer l. For in-constraint inputsx in ∈ D in , we denote their acti- vations ash in , and for out-of-constraint inputsx out ∈ D out , we denote their activations as h out . 2 Submission and Formatting Instructions for ICML 2026 Intermediateactivation 휌 (Learnable) Layer N Layer 1 ... Outerdomain Layer N Layer 1 ... Indomain ：Calibratedactivations ：originalactivations ：sigmoidfunction ：multiplication Figure 2. Framework of CR-VLM, a configurable refusal approach for VLMs. 4. Methodology The overview pipeline of our proposed configurable refusal approach is presented in Figure 2. The core idea is to train a lightweight model that predicts calibrated activations from the original activations of the VLMs. To capture stronger refusal signals, we utilize teacher-forced activation extrac- tion in subsection 4.1. Loss design for inducing refusal behavior is presented in subsection 4.2. The over-refusal mitigation mechanism for in-scope samples is introduced in subsection 4.3. We further incorporate counterfactual vi- sion enhancement, which encourages the model to attend to visual information during the refusal process, as described in subsection 4.4. 4.1. Teacher-Forced Activation Extraction Prior works extract behavioral directions by contrasting acti- vations obtained with and without a system prompt (Sivaku- mar et al., 2025; Chen et al., 2025; Lei et al., 2025; Lee et al., 2024). However, directly applying this strategy to configurable refusal scenarios introduces reliability issues. In particular, for VLMs that lack strong safety alignment, system prompts specifying refusal constraints often fail to trigger an explicit refusal, resulting in weak activation shifts. For example, given an out-of-scope query such as “How can I trade bitcoin to make money?”, a model still attempts to provide an answer even when preceded by a system prompt: “Bitcoin trade questions are not allowed. Refuse to answer.” In such cases, the activation difference between runs with and without the system prompt primarily reflects prompt- induced semantic variation rather than refusal behavior. To address this limitation, we adopt teacher-forced activa- tion extraction (Rimsky et al., 2024; Cao et al., 2025). Here, the “teacher” is not a separate model but a predefined re- fusal response appended to each out-of-scope sample. Con- cretely, for an out-of-scope inputx out = (I,T ), we form the sequence(x out ,r ref ), wherer ref is a fixed refusal string (e.g., "Sorry,"). For the same bitcoin-trading query, the model is thus fed the user question followed by the refusal response, forcing it to process the refusal during the forward pass. This guarantees that the activation difference corre- sponds to the computation of refusal behavior rather than an uncertain prompt-induced semantic variation. We denote the resulting refusal activation representation as h(x out ,r ref ) l i . As a reference non-refusal activation repre- sentation, we useh(x out ) l i obtained from the original input alone. Since individual activations are noisy, we aggregate them across the dataset to obtain stable mean representations. Specifically, we compute the dataset-level mean activations: v acc i,l = 1 |D| X x out ∈D h l i (x out ), v ref i,l = 1 |D| X x out ∈D h l i (x out ,r ref ), (1) wherev ref i,l andv acc i,l are the mean refusal and mean non- refusal activation representations, respectively. To obtain a clean and directional refusal representation, we then com- pute a steering vector via difference-in-means (Belrose, 2024): s l i =v ref i,l −v acc i,l ,(2) Then, we can obtain the target refusal activations of out-of- scope samples by h l,target out,i ← h l out,i + λ·s l j ,(3) wherejis the last token position of predefined refusal re- sponse. 3 Submission and Formatting Instructions for ICML 2026 4.2. Refusal Behavior Induction In our approach, we employ a lightweight model named cal- ibrated model, to predict the calibrated activations. While teacher-forced extraction provides reliable refusal activa- tions offline, these signals are not directly available during inference. The calibrated model serves as a bridge that trans- lates this offline supervision into input-conditioned activa- tions at runtime, enabling refusal control without modifying the backbone VLMs. And these calibrated hidden states are then used to intervene in specific layers in the forward pass, thereby inducing the desired refusal behavior. To ensure that the predicted calibrated activations of out- of-scope samples by our trained modelh ′ out should be as similar as possible to the ground-truth activations of out-of- scope samplesh target out . We align both the direction and the magnitude ofh ′ out withh target out and design the loss function as follows: L out =L out, dir +L out, len .(4) For direction alignment, we utilize cosine similarity as fol- lows: L out, dir = (1− cos(h ′ out ,h target out )).(5) For the magnitude alignment, we leverage MSE loss: L out, len = MSE(h ′ out ,h target out ).(6) 4.3. Mitigation of Over-Refusal Simply optimizing the loss in Equation (4) does not fully capture the contraint-aware refusal behavior we seek. As illustrated in Table 1, this kind of global calibration may lead to excessive adjustment of in-scope queries, shifting the representation space of the model uniformly toward the refusal direction. Consequently, not only out-of-scope queries but also those within scope are pushed toward re- fusal, leading to undesirable over-refusal (Dabas et al., 2025; Cao et al., 2025). Motivated by this observation, we also design the gate loss function for queries within scope. The objective is to preserve the activations of in-scope samples, preventing them from being altered by the learned refusal direction. We thus incorporate an MSE reconstruction term that encouragesh ′ in to remain close toh in . In addition, we employ a gating mechanism and regularize its probabilityρ toward zero for in-constraint queries. We further include a gate supervision termg, which softly regularizes the gating Table 1. Refusal rates (%) for in-scope and out-of-scope queries without (w/o) and with (w) applying the over-refusal mitigation mechanism, as well as under an ablated version of the mitigation for out-of-constraint queries. Increase (%)Originalw/ow In-scope0.097.0 ↑97.00.5 ↓96.5 Out-of-scope0.095.0 ↑95.095.0 ≈ behavior to follow the desired activation pattern under differ- ent scope conditions. The gate loss function for mitigating over-refusal is formulated as follows: L in = MSE(h ′ in ,h in ) + ρ + g(7) To prevent the calibrated model from collapsing into a single global update direction and to encourage scope-conditional separation in the learned intervention vectors, we intro- duce an orthogonality regularizer between the average up- date directions induced by out-of-scope and in-scope sam- ples. Let∆(h)∈R d denote the predicted update direction (i.e.,∆ = h ′ − h) for an input activationh. For each mini- batch, samples are partitioned into two subsetsB out and B in . We compute the mean update direction for each subset as: ∆ out = 1 |B out | P i∈B out ∆(h i ),∆ in = 1 |B in | P i∈B in ∆(h i ).(8) The orthogonality loss is then defined as the squared cosine similarity between these two mean directions: L ortho = ∆ ⊤ out ∆ in ∥∆ out ∥ 2 ∥∆ in ∥ 2 2 .(9) By minimizingL ortho , the model is encouraged to learn the decoupled intervention directions for out-of-scope refusal behavior and in-scope response preservation, thereby reduc- ing the risk of over-refusal. 4.4. Counterfactual Vision Enhancement Although steering the textual activations at later token po- sitions provides a more globally informed refusal signal, this intervention can dominate the multimodal representa- tion and suppress visual contributions, causing the model to underutilize visual cues. As a result, the vision modality contributes only weakly to the overall refusal behavior. This makes existing LLM-side steering mechanisms ineffective in cases where the refusal decision should be driven by visual evidence rather than textual semantics. To achieve this, we require a reference that characterizes how visual representations should causally change when a refusal is warranted based solely on visual input. We there- fore introduce a vision-only activation shift as a counterfac- tual reference, which isolates the contribution of the vision modality by excluding textual influence. Concretely, for each imageI i , we obtain the pure visual refusal activation shift by contrasting its activations under out-of-constraint images and in-constraint images: r i = h(I i,out ,r ref )− h(I i,in ),(10) which reflects how the visual activations should change when the constraint requires a refusal. During training, our 4 Submission and Formatting Instructions for ICML 2026 model predicts calibrated activationsh ′ in andh ′ out . Accord- ingly, we obtain the predicted activation shift: r ′ i = h ′ (I i,out ,T i,out ,r ref )− h ′ (I i,in ,I i,out , )(11) Our objective is to regularize the predicted activation shift r ′ i to align with the vision-derived counterfactual shiftr i , thereby ensuring that the refusal behavior learned by the model remains consistent with how visual representations should causally change when a refusal is required. Thus, we can have the vision loss as follows: L vision = 1− cos( X r ′ i , X r i ).(12) This counterfactual vision enhancement forces the model to internalize how visual representations should shift when a refusal is required. Consequently, VLMs can correctly modulate refusal behaviors based on visual evidence. Then the whole definition of the loss function can be illus- trated to combine Equation (4), (7), and (12) as follows: L = λ 1 ·L out + λ 2 ·L in + λ 3 ·L orth + λ 4 ·L vision (13) 5. Experiments In this experiment section, we conduct experiments to an- swer the following Evaluation Questions (EQ): • EQ1: To what extent does our proposed approach achieve alignment with the intended refusal scope un- der configurable settings? • EQ2: Does steering refusal on out-of-scope queries come at the cost of inducing over-refusal on in-scope queries? And how well does our approach balance this trade-off? •EQ3: How does each component of our approach contribute to configurable refusal alignment and over- refusal mitigation? 5.1. Experiment Settings Models.We conduct our experiments on several widely used open-source VLMs, including LLaVA-1.5- 7B-hf (Liu et al., 2023a), LLaVA-1.5-13B-hf (Liu et al., 2023a), Idefics3-8B-Llama3 (Laurençon et al., 2024), and InstructBLIP-Vicuna-7B (Dai et al., 2023). These models represent diverse architectures and size-scales, allowing us to assess the universality of our configurable refusal ap- proach. Datasets. To the best of our knowledge, we are the first to study configurable refusal in VLMs, an area where no suitable benchmarks currently exist. Following the idea of (Lei et al., 2025) in LLMs, which designates certain subjects in MMLU (Hendrycks et al., 2021) as in-scope samples and others as out-of-scope samples, we evaluate our method on two multimodal datasets: ScienceQA (Lu et al., 2022) and MMMU (Yue et al., 2024). For each dataset, we ran- domly sample 200 in-scope and 200 out-of-scope samples for training, and the same number for testing. To ensure suf- ficient data coverage, we use three subjects in ScienceQA, including biology, physics, and geography, as our in-scope dataset. Similarly, we select several in-scope subdomains in MMMU (e.g., math, art theory, and geography) as our in-scope dataset. Baselines. We compare our proposed method with four representative refusal steering baselines: 1) Prompt-based Steering (Gu et al., 2023), which injects handcrafted config- urable refusal constraints into the system prompt; 2) Fine- Tuned Refusal Adapter (FRA) uses parameter-efficient up- dates (e.g., LoRA (Hu et al., 2022)) to adjust refusal behav- ior; 3) Persona (Chen et al., 2025), which utilizes activation vectors extracted via a refusal-oriented system prompt to control model behavior. In our experiments, we adopt this method, originally designed for LLMs, and extend it to LVLM scenarios by applying the same prompt-based activa- tion extraction procedure to multimodal inputs. 5.2. (EQ1) configurable Refusal Alignment In this section, we evaluate whether our proposed method enables VLMs to refuse queries that fall outside the scope. Evaluation Metrics. To quantify how well the model refuses to answer out-of-scope questions, we adopt refusal rate as the primary evaluation metric, defined as the proportion of responses that contain explicit refusal semantics. A higher refusal rate indicates better performance. To verify the correctness and robustness of measuring our results, we employ both human evaluation and LLM-as-Judgment (Li et al., 2025a), using the high-performance model DeepSeek- V3 (Liu et al., 2024a) to automatically identify refusal se- mantics following the template in Appendix G. Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 (a) LLaVA-1.5-7B-hf Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 (b) LLaVA-1.5-13B-hf Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 (c) Idefics3 Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 (d) InstructBLIP CR-VLMFine-tuningPrompt-basedPersona Figure 3. (EQ2) MB-Score trade-off between refusal rate on out- of-scope queries and over-refusal rate on in-scope queries. 5 Submission and Formatting Instructions for ICML 2026 Table 2. (EQ1) Refusal rate (%) for out-of-scope test samples. DatasetSubjectsMethods Human EvaluationLLM-as-Judgement LLaVA-1.5-7bLLaVA-1.5-13bIdefics3InstructBLIPLLaVA-1.5-7bLLaVA-1.5-13bIdefics3InstructBLIP ScienceQA Biology Prompt-based0.00.02.51.51.50.03.03.0 Fine-tuning54.025.531.090.554.026.034.090.0 Persona0.00.00.02.50.00.50.53.0 CR-VLM95.094.094.594.595.594.093.596.0 Physics Prompt-based0.00.01.00.00.00.01.02.0 Fine-tuning9.043.07.597.59.043.08.098.0 Persona0.08.00.00.00.010.50.01.5 CR-VLM99.592.083.598.599.592.584.597.5 Geography Prompt-based0.00.02.50.02.00.03.51.0 Fine-tuning64.087.511.0100.064.087.514.5100.0 Persona0.00.00.08.50.00.52.58.0 CR-VLM 93.089.592.094.096.090.090.595.0 MMMU Math Prompt-based18.04.02.06.519.05.58.010.0 Fine-tuning23.526.08.088.525.027.011.088.5 Persona31.027.02.524.532.027.05.521.5 CR-VLM97.092.082.583.097.593.588.581.0 Art Theory Prompt-based28.57.510.57.529.59.015.011.5 Fine-tuning92.093.590.598.093.593.591.598.0 Persona33.531.57.026.034.033.513.026.0 CR-VLM97.596.594.585.098.096.595.578.5 Geography Prompt-based21.04.05.015.023.55.55.515.0 Fine-tuning32.537.08.572.533.538.010.073.0 Persona33.028.05.028.535.030.59.529.5 CR-VLM93.091.579.585.092.089.081.081.0 N umbers in green shows the best-performing results compared to baselines. Underlined values denote the second best. Result Analysis.As shown in Table 2, our method achieves higher refusal rates on out-of-scope samples than baselines in most cases across both datasets in different subjects. This demonstrates that the calibrated activations through our method are substantially more effective for inducing config- urable refusal behavior in various VLMs. The baseline of Persona consistently underperforms across models and sub- jects, yielding near-zero refusal rates in some settings. This is because injecting refusal behavior solely through system prompts relies heavily on the built-in safety alignment of the model. When such alignment is weak, the induced refusal signal is insufficient, as failing to extract strong refusal- related activations, leading to ineffective refusal behavior steering. While fine-tuning methods are able to induce re- fusal behavior to some extent, our approach outperforms fine-tuning in 41 out of 48 experimental settings, with fine- tuning performing better in only a small fraction of cases. This suggests that the approach becomes unreliable when only limited 200 fine-tuning data samples are available, as in our experimental setting. Besides, prompt-based steering performs extremely poorly, often yielding near-zero refusal rates, indicating that simple prompt-based manipulation is insufficient to induce reliable configurable refusal in VLMs. 5.3. (EQ2) Over-Refusal Mitigation Here, we evaluate whether our proposed method will not lead to over-refusal for in-scope queries while maintaining the high refusal rate for out-of-scope queries. Over-Refusal Rate on In-scope Queries. Evaluation Metrics. Similar to the evaluation protocol in EQ1, we use refusal rate to quantify over-refusal (i.e., the proportion of in-scope queries that are false refused). In contrast to EQ1, a lower refusal rate here refers to better performance in this setting. Similarly, we evaluate this using both human evaluation and LLM-as-Judgment as well. Result Analysis.As shown in Table 3, our method exhibits consistently low over-refusal rates across all models and datasets within various subject domains, indicating that it effectively preserves acceptance for in-scope queries while maintaining strong refusal behavior on out-of-scope queries. In contrast, Persona in baseline methods performs poorly in some subjects, displaying significant over-refusal, which frequently rejects in-scope queries. These results demon- strate that our calibrated activation approach achieves a more reliable configurable refusal behavior. Trade-off between Refusal and Over-Refusal. Evaluation Metrics.To explicitly capture the trade-off be- tween effective Refusal Rate (R) and Over-Refusal Rate (ORR), we utilize MB-Score (Simhi et al., 2025) as shown in equation (14), a composite metric that jointly considers refusal performance on out-of-scope queries and acceptance behavior on in-scope queries. MB-Score = 2· R· (1− ORR) R + (1− ORR) .(14) Result Analysis.Figure 3 summarizes the trade-off using MB-Score, which jointly reflects refusal effectiveness and over-refusal. Across all subjects of both datasets, our method consistently achieves higher MB-Scores than base- lines, indicating the best performance for configurable re- fusal. In contrast, fine-tuning and Persona methods suffer from degraded MB-Scores due to excessive over-refusal for in-scope queries or ineffective refusal steering for out-of- 6 Submission and Formatting Instructions for ICML 2026 Table 3. (EQ2) Over-refusal rate (%) for in-scope test samples. CR-VLM exhibits a very low over-refusal rate compared to other three baselines. DatasetSubjectsMethods Human EvaluationLLM-as-Judgement LLaVA-1.5-7bLLaVA-1.5-13bIdefics3InstructBLIPLLaVA-1.5-7bLLaVA-1.5-13bIdefics3InstructBLIP ScienceQA Biology System Prompt0.00.01.50.00.00.01.50.0 Fine-tuning0.00.00.03.00.00.50.55.5 Persona0.07.50.01.00.09.03.02.0 CR-VLM0.50.00.00.51.00.51.00.5 Physics System Prompt0.00.00.00.01.50.51.00.0 Fine-tuning0.00.00.02.51.50.01.52.5 Persona0.015.01.50.00.015.52.50.0 CR-VLM0.00.00.00.01.00.02.51.0 Geography System Prompt0.00.00.00.50.00.00.01.0 Fine-tuning0.00.00.02.00.00.00.02.0 Persona0.00.00.00.00.00.00.01.0 CR-VLM0.00.00.00.00.00.00.01.5 MMMU Math Prompt-based35.01.51.59.037.03.04.511.5 Fine-tuning2.00.50.024.03.51.53.024.0 Persona34.524.59.021.535.522.013.021.0 CR-VLM9.57.06.010.512.07.511.512.5 Art Theory Prompt-based4.51.52.06.05.01.52.59.5 Fine-tuning0.00.00.546.50.00.00.546.5 Persona27.511.57.022.027.011.02.021.0 CR-VLM1.50.00.02.52.02.51.53.5 Geography Prompt-based32.010.53.05.032.511.55.58.5 Fine-tuning1.00.04.033.01.52.57.032.5 Persona44.047.58.530.045.050.06.027.5 CR-VLM4.04.05.06.06.55.09.59.5 scope queries. These results demonstrate that our calibrated activation approach enables configurable refusal, selectively amplifying refusal only when constraints are violated, rather than globally increasing refusal strength. Activation Distribution. Evaluation Metrics.The goal of our configurable refusal mechanism is to reshape the model’s hidden representations to reflect the desired refusal behavior. To assess whether the calibrated activations induced by CR-VLM yield clearer separation between in-scope and out-of-scope queries com- pared to the original representations, we conduct a quali- tative analysis of hidden-state distributions. Specifically, we apply Principal Component Analysis (PCA) (Pearson, 1901) to both the original and calibrated activations of in- scope and out-of-scope queries, and visualize their projected representations in a low-dimensional space. Result Analysis. As shown in Figure 4, the calibrated ac- tivations for in-scope queries remain close to the original in-scope representations, indicating that our method largely preserves the acceptance behavior for in-scope queries. In contrast, out-of-scope samples present a substantial shift in the representation space: their calibrated activations are pushed away from both the original out-of-scope and in- scope region and form a compact cluster associated with refusal behavior. We further observe similar distribution patterns across more various models in the Appendix C. 5.4. (EQ3) Ablation Studies In this section, we conduct ablation studies on LLaVA-1.5- 7B-hf to analyze the contribution of individual components in CR-VLM. Efficacy of Teacher-Forced Activation Extraction. To evaluate whether teacher-forced activation extraction yields stronger refusal signals, we compare it with prompt-based activation extraction across different datasets and subjects. As shown in Figure 5, teacher-forced activation extraction consistently achieves higher refusal rates on out-of-scope samples, verifying its effectiveness. Over-Refusal Mitigation by Orthogonality. We conduct an ablation study by comparing designing a loss function with or without the orthogonality regularization termL ortho . The results reported in Table 4 demonstrate that severe over- refusal on in-scope queries across all subjects without the orthogonality loss term. This indicates that without explicit decoupling, the learned intervention directions for in-scope and out-of-scope samples collapse into a shared global shift, leading to over-refusal. In contrast, incorporating the or- thogonality constraint effectively suppresses over-refusal while preserving strong refusal behavior for out-of-scope inputs, validating its critical role in enabling configurable refusal. Efficacy of Vision Loss. To examine whether vision loss encourages the model to pay more attention to visual cues during refusal, we conduct an ablation study by inputting the image with blank textual content. This setting focuses on the contribution of visual information to the predicted refusal behavior by excluding textual influence. Following (Arditi et al., 2024; Wang et al., 2024b), we adopt the refusal score as evluation metrics here and the calculation details of the 7 Submission and Formatting Instructions for ICML 2026 60402002040 40 20 0 20 40 LLaVA-1.5-7B-hf on ScienceQA (Biology) Overlap Original Out Original In Pred Out Pred In 60402002040 40 30 20 10 0 10 20 30 40 LLaVA-1.5-7B-hf on ScienceQA (Geography) Overlap Original Out Original In Pred Out Pred In 604020020 40 20 0 20 40 LLaVA-1.5-7B-hf on ScienceQA (Physics) Overlap Original Out Original In Pred Out Pred In 604020020 30 20 10 0 10 20 30 LLaVA-1.5-7B-hf on MMMU (Math) Overlap Original Out Original In Pred Out Pred In 60402002040 20 0 20 40 LLaVA-1.5-7B-hf on MMMU (Art_Theory) Overlap Original Out Original In Pred Out Pred In 60504030201001020 30 20 10 0 10 20 30 LLaVA-1.5-7B-hf on MMMU (Geography) Overlap Original Out Original In Pred Out Pred In Figure 4. (EQ2) Activation distribution shifts at the 25 th layer for in-scope and out-of-scope test samples across different subjects in ScienceQA and MMMU on LLaVA-1.5-7B-hf. ScienceQA Biology ScienceQA Physics ScienceQA Geography MMMU Math MMMU Art Theory MMMU Geography 0 20 40 60 80 100 Refusal Rate (%) Prompt-based Teacher-forced Figure 5. (EQ3) Ablation study of teacher-forced method for activation extraction compared with prompt-based method. Table 4. (EQ3) Ablation results of the loss function with (w) and without (w/o) the orthogonality term across different datasets, conducted on LLaVA-1.5-7B-hf for both in-scope and out-of- scope test samples. DatasetSubject In-scope (%)Out-of-scope (%) w/oww/o ScienceQA Biology0.097.0 ↑97.095.095.0 ≈ Physics0.089.5 ↑89.599.598.0 ↓1.5 Geography0.0100.0 ↑10093.093.0 ≈ MMMU Math9.580.5 ↑7197.098.5 ↑1.5 Art Theory1.591.0 ↑89.597.597.0 ↓0.5 Geography4.096.0 ↑9293.097.5 ↑4.5 BiologyGeographyPhysics 8 6 4 2 0 2 Refusal Score Llava-1.5-7B on ScienceQA With Without Figure 6. (EQ3) Ablation study of vision loss conducted on LLaVA- 1.5-7B-hf on ScienceQA. refusal score are introduced in the appendix D. As shown in Figure 6, removing vision loss leads to a notable drop in refusal scores across different subjects. These results on ScienceQA dataset indicate that vision loss effectively enhances visual grounding in the refusal process, rather than relying solely on text-dominated signals. The additional results on MMMU are shown in the appendix F. 6. Conclusion In this paper, we firstly introduce CR-VLM, a configurable refusal approach for VLMs based on activation steering, enabling models to refuse out-of-scope queries while pre- serving acceptance for in-constraint ones. Through teacher- forced extraction, gate mechanism, and vision-aware cali- bration, our method achieves robust refusal alignment across different models and datasets, with low over-refusal. 8 Submission and Formatting Instructions for ICML 2026 Impact Statement This paper presents work whose goal is to achieve config- urable refusal for VLMs. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. Advances in Neural Infor- mation Processing Systems, 37:136037–136083, 2024. Belrose, N. Diff-in-means concept editing is worst-case op- timal: Explaining a result by sam marks and max tegmark, 2023. URL https://blog. eleuther. ai/diff-in-means, 2024. Bhardwaj, R., Anh, D. D., and Poria, S. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic. ACL 2024, 2024. Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Juraf- sky, D., Hashimoto, T., and Zou, J. Safety-tuned lla- mas: Lessons from improving the safety of large lan- guage models that follow instructions. arXiv preprint arXiv:2309.07875, 2023. Cao, Z., Yang, Y., and Zhao, H. Scans: Mitigating the exaggerated safety for llms via safety-conscious activa- tion steering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, p. 23523–23531, 2025. Chen, R., Arditi, A., Sleight, H., Evans, O., and Lind- sey, J. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025. Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. Or-bench: An over-refusal benchmark for large language models. ICLR 2025, 2025. Dabas, M., Chen, S., Fleming, C., Jin, M., and Jia, R. Just enough shifts: Mitigating over-refusal in aligned lan- guage models with targeted representation fine-tuning. ICML 2025, 2025. Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y., and Yang, Y. Safe rlhf: Safe reinforcement learning from human feedback. ICLR 2024, 2024. Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., and Hoi, S. Instructblip: Towards general- purpose vision-language models with instruction tuning. Advances in neural information processing systems, 36: 49250–49267, 2023. Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V., and Torr, P. A systematic sur- vey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y., and Yang, Y. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Sys- tems, 36, 2024. Laurençon, H., Marafioti, A., Sanh, V., and Tronchon, L. Building and better understanding vision-language mod- els: insights and future directions., 2024. Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling, E., Dognin, P., Nagireddy, M., and Dhurandhar, A. Program- ming refusal with conditional activation steering. arXiv preprint arXiv:2409.05907, 2024. Lei, J., Gumma, V., Bhardwaj, R., Lim, S. M., Li, C., Zadeh, A., and Poria, S. Offtopiceval: When large language models enter the wrong chat, almost always!arXiv preprint arXiv:2509.26495, 2025. Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 2757–2791, 2025a. Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., and Shi, G. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges. arXiv preprint arXiv:2501.02189, 2025b. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report. arXiv preprint arXiv:2412.19437, 2024a. Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning, 2023a. Liu, S., Ye, H., and Zou, J. Reducing hallucinations in large vision-language models via latent space steering. In The Thirteenth International Conference on Learning Representations, 2025. 9 Submission and Formatting Instructions for ICML 2026 Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M. F., and Li, H. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023b. Liu, Z., Nie, Y., Tan, Y., Yue, X., Cui, Q., Wang, C., Zhu, X., and Zheng, B. Safety alignment for vision language models. arXiv preprint arXiv:2405.13581, 2024b. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Infor- mation Processing Systems, 35:2507–2521, 2022. Neumann, A., Kirsten, E., Zafar, M. B., and Singh, J. Po- sition is power: System prompts as a mechanism of bias in large language models (llms). In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, p. 573–598, 2025. Ni, S., Chen, G., Li, S., Chen, X., Li, S., Wang, B., Wang, Q., Wang, X., Zhang, Y., Fan, L., et al. A survey on large language model benchmarks. arXiv preprint arXiv:2508.15361, 2025. Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11): 559–572, 1901. Peng, J., Wang, M., Wang, N., Li, J., Li, Y., Ye, Y., Wang, W., Jia, P., Zhang, K., and Zhao, X. Logic jailbreak: Efficiently unlocking llm safety restrictions through for- mal logical expression. arXiv preprint arXiv:2505.13527, 2025. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, p. 8748–8763. PmLR, 2021. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 15504–15522, 2024. Shao, Z., Liu, H., Hu, Y., and Gong, N. Z. Refusing safe prompts for multi-modal large language models. arXiv preprint arXiv:2407.09050, 2024. Simhi, A., Herzig, J., Tutek, M., Itzhak, I., Szpektor, I., and Belinkov, Y. Managerbench: Evaluating the safety- pragmatism trade-off in autonomous llms. arXiv preprint arXiv:2510.00857, 2025. Sivakumar, A., Zhang, A., Hakim, Z., and Thomas, C. Steervlm: Robust model control through lightweight acti- vation steering for vision language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, p. 23640–23665, 2025. Vatsa, M., Jain, A., and Singh, R. Adventures of trustworthy vision-language models: A survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, p. 22650–22658, 2024. Wang, P., Zhang, D., Li, L., Tan, C., Wang, X., Ren, K., Jiang, B., and Qiu, X. Inferaligner: Inference-time align- ment for harmlessness through cross-model guidance. EMNLP 2024, 2024a. Wang, X., Hu, C., Röttger, P., and Plank, B. Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation. ICLR 2025, 2024b. Wang, X., Wang, M., Liu, Y., Schütze, H., and Plank, B. Refusal direction is universal across safety-aligned lan- guages. arXiv preprint arXiv:2505.17306, 2025. Wei, Z., Wang, Y., Li, A., Mo, Y., and Wang, Y. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023. Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y., Howe, B., and Wang, L. L. Know your limits: A survey of abstention in large language models. Transactions of the Association for Computational Linguistics, 13:529–556, 2025. Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. National Science Review, 11(12):nwae403, 2024. Yuan, Z., Xiong, Z., Zeng, Y., Yu, N., Jia, R., Song, D., and Li, B. Rigorllm: Resilient guardrails for large language models against undesired content. ICML 2024, 2024. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 9556–9567, 2024. Zhang, J., Elgohary, A., Magooda, A., Khashabi, D., and Van Durme, B. Controllable safety alignment: Inference- time adaptation to diverse safety requirements. ICLR 2025, 2024a. Zhang, J., Huang, J., Jin, S., and Lu, S. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024b. 10 Submission and Formatting Instructions for ICML 2026 Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2), 2023. Zhou, A., Li, B., and Wang, H. Robust prompt optimiza- tion for defending language models against jailbreaking attacks. Advances in Neural Information Processing Sys- tems, 37:40184–40211, 2024. 11 Submission and Formatting Instructions for ICML 2026 A. Case Study Figure 7 and Figure 8 provide qualitative examples illustrating the behavior of our configurable refusal mechanism for out-of-scope queries and in-scope queries respectively. The model correctly refuses out-of-scope queries while preserving normal responses for in-scope inputs, demonstrating configurable refusal rather than global refusal enforcement. (a) Example 1 (b) Example 2 Figure 7. Case Study for Out-of-scope Test Samples. (a) Example 1 (b) Example 2 Figure 8. Case Study for In-scope Test Samples. Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 Geography (ScienceQA) Biology Physics Geography (MMMU) Math Art Theory 0.0 0.2 0.4 0.6 0.8 CR-VLMFine-tuningPrompt-basedPersona Figure 9. (EQ2) MB-Score trade-off between refusal rate and over-refusal rate by LLM-as-Judgment. B. (EQ2) Additional Experiments by MB-Score To complement the human evaluation results reported in the main paper, we further analyze MB-Score using an LLM- as-Judgment by DeepSeek-V3 (Liu et al., 2024a) as shown in Figure 9. Consistent with human evaluation, our method consistently achieves higher MB-Scores than baselines across different models and subjects, demonstrating a more balanced trade-off between refusal effectiveness and over-refusal. C. (EQ2) Additional Results of Activation Distribution We report the activation distribution shifts for additional VLMs in Figure 10, Figure 11, and Figure 12, to further validate the generality of our configurable refusal method. Specifically, we visualize the calibrated and original activations for 12 Submission and Formatting Instructions for ICML 2026 604020020 40 30 20 10 0 10 20 30 LLaVA-1.5-13B-hf on ScienceQA (Biology) Overlap Original Out Original In Pred Out Pred In 40200204060 40 20 0 20 40 LLaVA-1.5-13B-hf on ScienceQA (Geography) Overlap Original Out Original In Pred Out Pred In 60402002040 40 20 0 20 40 LLaVA-1.5-13B-hf on ScienceQA (Physics) Overlap Original Out Original In Pred Out Pred In 40200204060 40 20 0 20 40 LLaVA-1.5-13B-hf on MMMU (Math) Overlap Original Out Original In Pred Out Pred In 60402002040 40 20 0 20 40 LLaVA-1.5-13B-hf on MMMU (Art_Theory) Overlap Original Out Original In Pred Out Pred In 80604020020 50 40 30 20 10 0 10 20 30 LLaVA-1.5-13B-hf on MMMU (Geography) Overlap Original Out Original In Pred Out Pred In Figure 10. (EQ2) Activation distribution shifts at the 25 th layer for in-scope and out-of-scope test samples across different subjects in ScienceQA and MMMU on LLaVA-1.5-13B-hf. 201510505 10 5 0 5 10 15 Idefics3 on ScienceQA (Biology) Overlap Original Out Original In Pred Out Pred In 151050510 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 Idefics3 on ScienceQA (Geography) Overlap Original Out Original In Pred Out Pred In 1510505 10 5 0 5 10 Idefics3 on ScienceQA (Physics) Overlap Original Out Original In Pred Out Pred In 1510505 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Idefics3 on MMMU (Math) Overlap Original Out Original In Pred Out Pred In 151050510 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Idefics3 on MMMU (Art_Theory) Overlap Original Out Original In Pred Out Pred In 1510505 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Idefics3 on MMMU (Geography) Overlap Original Out Original In Pred Out Pred In Figure 11. (EQ2) Activation distribution shifts at the 25 th layer for in-scope and out-of-scope test samples across different subjects in ScienceQA and MMMU on Idefics3-8B-Llama3. 13 Submission and Formatting Instructions for ICML 2026 4020020406080 100 80 60 40 20 0 20 40 60 InstructBLIP on ScienceQA (Biology) Overlap Original Out Original In Pred Out Pred In 604020020406080 60 40 20 0 20 40 60 InstructBLIP on ScienceQA (Geography) Overlap Original Out Original In Pred Out Pred In 1007550250255075 60 40 20 0 20 40 60 InstructBLIP on ScienceQA (Physics) Overlap Original Out Original In Pred Out Pred In 4020020406080 40 20 0 20 40 60 InstructBLIP on MMMU (Math) Overlap Original Out Original In Pred Out Pred In 4020020406080 60 40 20 0 20 40 60 InstructBLIP on MMMU (Art_Theory) Overlap Original Out Original In Pred Out Pred In 4020020406080 60 40 20 0 20 40 InstructBLIP on MMMU (Geography) Overlap Original Out Original In Pred Out Pred In Figure 12. (EQ2) Activation distribution shifts at the 25 th layer for in-scope and out-of-scope test samples across different subjects in ScienceQA and MMMU on InstructBLIP-Vicuna-7B. LLaVA-1.5-13B-hf, Idefics3-8B-Llama3, and InstructBLIP-Vicuna-7B across different subject domains. Consistent with the observations in the main paper, calibrated activations for out-of-scope queries are clearly separated from both their original counterparts and in-scope activations, forming compact clusters associated with refusal behavior. Meanwhile, activations corresponding to in-scope queries remain close to their original distributions, indicating that acceptance behavior is largely preserved. These results demonstrate that the activation-level separation induced by our method is stable across model architectures and scales, further confirming the effectiveness and robustness of our configurable refusal mechanism. D. Refusal Score Calculation We calculate a refusal score followed by (Arditi et al., 2024; Wang et al., 2024b) in Eq. (15), which quantifies the model’s inclination to generate refusal-related tokens. Formally, at the first decoding step, we compute the log-odds between refusal-related tokensR(e.g., “Sorry”, “I”) and all other tokensV refusal score. A higher refusal score demonstrates a stronger tendency to refuse. The equation for calculating the refusal score is shown as follows: T (x) = log P t∈R p t ! − log P t∈V p t ! .(15) E. Refusal Score Shift To obtain a more fine-grained understanding of how our method modulates refusal behavior, we examine how our proposed method shifts the underlying tendency of the model to generate refusal-related tokens. The details of calculating refusal score is introduced in section D in the appendix. Unlike the refusal rate, which provides a binary view of whether a refusal occurs, the refusal score captures subtle changes in the refusal tendency of the model. The results on ScienceQA are shown in Figure 13, while the corresponding results on MMMU are presented in Figure 14. Across all evaluated VLMs and subject domains, out-of-scope test samples consistently exhibit substantially higher refusal scores than in-scope samples. Importantly, in-scope samples remain concentrated in low refusal-score ranges, indicating that 14 Submission and Formatting Instructions for ICML 2026 our method selectively amplifies refusal behavior for out-of-scope inputs without introducing over-refusal. This consistent pattern across datasets further validates the effectiveness and robustness of CR-VLM. Out-of-constraintIn-constraint 12 10 8 6 4 2 0 2 Refusal Score Llava-1.5-7B on ScienceQA (biology) Out-of-constraintIn-constraint 12 10 8 6 4 2 0 2 Refusal Score Llava-1.5-7B on ScienceQA (physics) Out-of-constraintIn-constraint 10 8 6 4 2 0 2 Refusal Score Llava-1.5-7B on ScienceQA (geography) Out-of-constraintIn-constraint 12 10 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on ScienceQA (biology) Out-of-constraintIn-constraint 10 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on ScienceQA (physics) Out-of-constraintIn-constraint 10 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on ScienceQA (geography) Out-of-constraintIn-constraint 18 16 14 12 10 Refusal Score Idefics on ScienceQA (biology) Out-of-constraintIn-constraint 18 16 14 12 10 8 Refusal Score Idefics on ScienceQA (physics) Out-of-constraintIn-constraint 18 16 14 12 10 Refusal Score Idefics on ScienceQA (geography) Out-of-constraintIn-constraint 12 10 8 6 4 2 Refusal Score InstructBLIP on ScienceQA (biology) Out-of-constraintIn-constraint 14 12 10 8 6 4 2 Refusal Score InstructBLIP on ScienceQA (physics) Out-of-constraintIn-constraint 12 10 8 6 4 2 0 Refusal Score InstructBLIP on ScienceQA (geography) Figure 13. Refusal score shift for in-scope and out-of-scope test samples on ScienceQA. F. Additional Ablation Study of Vision Loss To further verify the effectiveness of the vision loss on visual grounding in refusal behavior, we conduct a similar ablation study on the MMMU dataset. Consistent with the main results reported on ScienceQA in the main content, we evaluate refusal behavior using the refusal score while providing the image input with blank textual content to isolate visual contributions by excluding the visual influence. As shown in Figure 15, removing the vision loss consistently leads to lower refusal scores across different MMMU subjects. These results suggest that vision loss promotes visual attention of the model when making a refusal behavior. In addition, we provide a complementary analysis from a representation-level perspective beyond the results shown in section 5.4. Our goal is to test whether the predicted refusal-related representations preserve visual evidence, rather than being only dominated by textual cues. Concretely, we compute the cosine similarity between the predicted representation produced from multimodal inputs (i.e., both vision and text) and a vision-conditioned reference representation (i.e., vision- only), and report the averaged similarity across different subjects. A higher similarity score indicates a stronger influence of visual cues on the predicted representations. The results represented in Table 5 demonstrate that the vision loss acts as an effective regularizer, prompting greater reliance on visual cues rather than text-dominated decision patterns. Table 5. Hidden representation alignment with (w) and without (w/o) vision loss. Dataset ScienceQAMMMU BiologyPhysicsGeographyMathArt TheoryGeography w/o-0.0210.0700.2000.2030.1060.068 w0.0190.1070.2240.2520.1650.082 15 Submission and Formatting Instructions for ICML 2026 Out-of-constraintIn-constraint 10 8 6 4 2 0 2 4 Refusal Score Llava-1.5-7B on MMMU (Math) Out-of-constraintIn-constraint 12 10 8 6 4 2 0 2 Refusal Score Llava-1.5-7B on MMMU (Art_Theory) Out-of-constraintIn-constraint 10 8 6 4 2 0 2 4 Refusal Score Llava-1.5-7B on MMMU (Geography) Out-of-constraintIn-constraint 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on MMMU (Math) Out-of-constraintIn-constraint 12 10 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on MMMU (Art_Theory) Out-of-constraintIn-constraint 10 8 6 4 2 0 2 Refusal Score Llava-1.5-13B on MMMU (Geography) Out-of-constraintIn-constraint 18 16 14 12 10 8 Refusal Score Idefics on MMMU (Math) Out-of-constraintIn-constraint 18 17 16 15 14 13 12 11 Refusal Score Idefics on MMMU (Art_Theory) Out-of-constraintIn-constraint 18 16 14 12 10 Refusal Score Idefics on MMMU (Geography) Out-of-constraintIn-constraint 16 14 12 10 8 6 4 2 Refusal Score InstructBLIP on MMMU (Geography) Out-of-constraintIn-constraint 16 14 12 10 8 6 4 2 Refusal Score InstructBLIP on MMMU (Art_Theory) Out-of-constraintIn-constraint 16 14 12 10 8 6 4 2 Refusal Score InstructBLIP on MMMU (Geography) Figure 14. Refusal score shift for in-scope and out-of-scope test samples on MMMU. G. LLM-as-Judgment For the LLM-as-Judgment framework used to compute the refusal rate and over-refusal rate in this paper, we adopt a strong and popular model, DeepSeek-V3 (Liu et al., 2024a), as the evaluator model. The prompt template employed for this evaluation is provided in Figure 16. H. Architecture of Calibrated Model The architecture of our calibration module for generating activation adjustments used in tunable refusal for VLMs is shown in Table 6. Table 6. The architecture of the calibrated model. InputHidden state x∈R d Flow Network Linear(Input_dim, 1024) SiLU→ Linear(1024, 1024) SiLU→ Linear(1024, Output_dim) Gate NetworkLinear, Sigmoid Residual Update∆x Outputx out = x + αp ∆x Trainable ParameterScalar α 16 Submission and Formatting Instructions for ICML 2026 MathArt_theoryGeography 6 4 2 0 2 4 Refusal Score Llava-1.5-7B on MMMU With Without Figure 15. (EQ3) Ablation Study of Vision Loss conducted on LLaVA-1.5-7B-hf on MMMU. Example prompt used for LLM-as-Judgment. Principles: Identify whether the response expresses refusal to answer the user’s question. Refusal includes direct rejection, statements of inability (e.g., ‘I cannot see’, ‘I don’t have access’), or citing safety/policy restrictions. Regard responses starting with a rejection as refusal as well. If refusal semantics are present, return 1; otherwise return 0. Only output the number 0 or 1. Figure 16. The prompts for generating LLM-as-Judgment with principles and examples I. Additional experiment Settings We optimize the loss defined in Equation 13, where all loss weights are set to 1 across experiments. We use a batch size of 8, a learning rate of1× 10 −4 , and train for 200 epochs with the Adam optimizer. Activation steering is applied to different layer ranges depending on the backbone: layers 12–32 for LLaVA-1.5-7B-hf, layers 12–40 for LLaVA-1.5-13B-hf, layers 18–32 for InstructBLIP-Vicuna-7B, and layers 20–32 for Idefics3-8B-Llama3. All hyperparameters are selected via preliminary validation and kept fixed across datasets and models. All experiments use a fixed random seed of 42 and identical decoding settings. J. Current Limitations and Future Directions While our framework demonstrates strong effectiveness in controlling personalized refusal behaviors, it is subject to several main limitations. First, our method relies on open-source models with accessible hidden states to extract representation activations and compute steering vectors. This restricts its direct applicability to closed-source models (e.g., GPT-4), where internal activations are not exposed. Second, our current experiments focus on one constraint at a time to keep the analysis clear and controlled. This is a design choice instead of a limitation of the framework. In practice, multiple constraints can be incorporated together through composition, and a systematic study of such multi-constraint settings is left for future work. Third, our current evaluation focuses on domain-level configurable refusal, as standardized benchmarks for explicit policy-conditioned scenarios (e.g., regulatory constraints) are still lacking. This is mainly a benchmarking constraint rather than a limitation of the framework. Extending CR-VLM to such settings is straightforward, and we plan to explore this direction once suitable benchmarks become available. 17