← Back to papers

Paper deep dive

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, Yi Chen, Peipei Yang, Xu-Yao Zhang

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 66

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/22/2026, 5:58:11 AM

Summary

The paper introduces Quantization-aware Integrated Gradients (QIG), a fine-grained post-training quantization strategy for Large Vision Language Models (LVLMs). By leveraging axiomatic attribution to calculate token-level sensitivity, the method addresses the limitations of modality-level quantization, effectively reducing quantization error and improving accuracy across various benchmarks with negligible latency overhead.

Entities (5)

Large Vision-Language Models · model-architecture · 99%Quantization-aware Integrated Gradients · method · 99%InternVL2-8B · model · 98%LLaVA-onevision-7B · model · 98%Post-training Quantization · technique · 97%

Relation Signals (3)

Quantization-aware Integrated Gradients appliedto LLaVA-onevision-7B

confidence 96% · under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B

Quantization-aware Integrated Gradients improves Large Vision-Language Models

confidence 95% · our method improves accuracy across models and benchmarks

Post-training Quantization usedfor Large Vision-Language Models

confidence 95% · post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference

Cypher Suggestions (2)

Find all models improved by the QIG method · confidence 90% · unvalidated

MATCH (m:Model)-[:IMPROVED_BY]->(method:Method {name: 'Quantization-aware Integrated Gradients'}) RETURN m.name

List all techniques used for model acceleration · confidence 85% · unvalidated

MATCH (t:Technique)-[:USED_FOR]->(a:Application {name: 'Model Acceleration'}) RETURN t.name

Abstract

Abstract:Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at this https URL.

Tags

ai-safety (imported, 100%)cscv (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

65,231 characters extracted from source content.

Expand or collapse full text

Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients Ziwei Xiang 1,2 * , Fanhu Zeng 1,2 * , Hongjian Fang 3 * , Rui-Qi Wang 4 , Renxing Chen 2 , Yanan Zhu 5 , Yi Chen 1,2,6 , Peipei Yang 1,2† , Xu-Yao Zhang 1,2† 1 State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA 2 School of Artificial Intelligence, UCAS 3 Beijing National Research Center for Information Science and Technology 4 Institute of Artificial Intelligence, USTB 5 School of Artificial Intelligence, Beihang University 6 Zhongguancun Academy Abstract Large Vision Language Models (LVLMs) have achieved re- markable success in a range of downstream tasks that re- quire multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous ac- celeration techniques, post-training quantization is a pop- ular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quanti- zation methods typically measure token sensitivity at the modality level, which fails to capture the complex cross- token interactions and falls short in quantitatively measur- ing the quantization error at the token level. As tokens interact within the model, the distinction between modal- ities gradually diminishes, suggesting the need for fine- grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the gran- ularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics.Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accu- racy across models and benchmarks with negligible la- tency overhead.For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full- precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG. 1. Introduction Large Vision Language Models (LVLMs) [3, 26] have greatly advanced in recent years and exhibit astonishing * Equal contribution. † Corresponding to: ppyang, xyz @nlpr.ia.ac.cn Token IndexToken Index Layer 1 Token Sensitivity Layer 16 Token Sensitivity Channel Index Sensitivity weight Channel Index Sensitivity weight Figure 1. Token-level quantization sensitivity across layers in the form of heatmap and curves. At layers 1 and 16, we show both the token-level sensitivity heatmap and its channel-averaged line curve for special, vision, and text tokens, measured using our Quantization-aware Integrated Gradients (QIG). performance across various downstream areas like image captioning [14], visual question answering [32], and so on [12]. Meanwhile, the computation and latency scale steeply with model size, especially in the era where models with billions of parameters are commonplace, which limits their practical application in real-world scenarios. To ad- dress this, the main approaches include pruning [18, 30, 43], distillation [15], and quantization [10, 23]. Among them, post-training quantization (PTQ) [10, 24, 50] provides a feasible approach to accelerate inference.By applying weight-only or weight-activation quantization, it reduces memory usage and computation overload while minimiz- ing reconstruction error with a small calibration set, thereby maintaining task performance and achieving strong accu- racy–efficiency trade-offs in a training-free manner. Quantization has made great progress in large language models for efficient inference [37, 45], with techniques such as rotation [22] and channel scaling [23].Build- ing on these advances, recent LVLM quantization methods exploit multimodal structure to improve performance [21, 39, 46]. MBQ [21] introduces a gradient-based objective arXiv:2603.17809v1 [cs.CV] 18 Mar 2026 that reweights reconstruction errors across modalities, mit- igating inter-modality imbalance. QSLAW [39] designs a quantization-aware scale learning framework with a multi- modal warmup for efficient instruction tuning. Q-VLM [35] performs block-level joint optimization guided by activa- tion entropy to reduce greedy mismatch. Despite the great progress in LVLM quantization, sev- eral issues remain to be tackled. (1) The complex interac- tion between modalities makes the distribution largely vary in different layers and modalities. As illustrated in Fig. 1, token sensitivity differs not only between modalities (inter- modality) but also within a modality (intra-modality) and across depth, suggesting that modality-level quantization is insufficient to capture token-wise dynamics in LVLMs; (2) There remains a gap between the quantized model and the original model. This naturally calls for a fine-grained anal- ysis of how each token contributes to quantization-induced output perturbations. Existing methods avoid token-level analysis, which may be attributed to the weak correlation between common proxies such as attention and the true quantization error, as well as their tendency to overlook the most influential tokens. This limitation underscores the need for a direct and effective way to define token-level sen- sitivity for PTQ. Motivated by this, we aim to explore fine-grained LVLM quantization and push the quantitative measurement of granularity from the modality level to the token level. We draw on the concept of axiomatic attribution [1] from mech- anistic interpretability [5], which enables us to effectively analyze the perturbation sensitivity of each token by cal- culating the integrated gradients [34] during calibration. Concretely, we calculate the Quantization-aware Integrated Gradients (QIG) from the quantized reference input to the actual input, thereby obtaining a token-level sensitivity score that quantifies the influence of each input token on the final model quantization error 1 . Additionally, we further ap- ply a robust IQR-based clipping to suppress extreme token importance values and stabilize the sensitivity estimation during quantization. Empirically, QIG strongly correlated with actual quantization errors, validating its suitability as a proxy signal for guiding fine-grained quantization. We conduct comprehensive experiments on multiple open-source LVLMs for both weight-only and weight- activation quantization. The results show that our method delivers consistent gains on various multimodal bench- marks. For example, under 3-bit weight-only quantiza- tion, our method improves the average accuracy of LLaVA- onevision-7B by 1.60%, reducing the gap to its full- precision counterpart to only 1.33%. These results demon- strate that our method can significantly improve the accu- racy of quantized LVLMs with negligible latency overhead, 1 The completeness property of quantization-aware integrated gradients is proved in Appendix A. highlighting its practical efficiency. Our main contributions are summarized as follows: • We reveal the complex interaction between modalities in LVLM quantization, highlighting the necessity of fine- grained sensitivity measurements for multimodal inputs. • We introduce the concept of axiomatic attribution and develop Quantization-aware Integrated Gradients, a quantization-specific sensitivity estimation method that provides token-level attributions of quantization error and directly guides fine-grained post-training quantization. • We conduct extensive experiments on various multimodal benchmarks to comprehensively demonstrate the superi- ority and effectiveness of our method. 2. Related Work 2.1. Large Vision Language Models Large vision language models bridge vision and language by projecting image features into the Large Language Mod- els (LLMs) input space [2, 20, 44]. Representative archi- tectures such as LLaVA [19], InternVL [8], and Qwen- VL [3] encode an image using a Vision Transformer [9] or CLIP encoder [29] into a sequence of visual patch to- kens. These visual tokens are then combined with text to- kens and task-specific special tokens (e.g., <bos>, <eos>, and the <image> token, which demarcates visual content) into a unified input sequence. This heterogeneous multi- modal sequence allows the LLM to process and reason over information from both modalities [3, 8, 19]. Unlike works proposing new architectures for better modality alignment, we focus on efficient acceleration for LVLMs. 2.2. Post-Training Quantization Post-training quantization (PTQ) [11, 31, 51] is a widely adopted compression technique that converts full-precision weights and activations into lower-bit representations with- out requiring retraining. In LLMS, several representative PTQ approaches have been proposed [22, 23, 38]. RTN applies simple rounding-to-nearest quantization as a strong baseline, AWQ [23] introduces activation-aware weight quantization to preserve salient channels, GPTQ [10] mini- mizes layer-wise reconstruction error through second-order approximation, and SmoothQuant [38] balances activation and weight ranges to stabilize quantization during infer- ence. Recently, PTQ has been extended to LVLMs to reduce their multimodal inference cost [21, 40]. However, existing works mainly aim to achieve balanced quantization across modalities or layers, while the uneven token-wise sensitiv- ity within each layer remains largely underexplored. 2.3. Interpretability and Token Sensitivity Interpretability research aims to elucidate how the inter- nal components of deep models interact to produce specific Layer1 Layer16 Attention Out Attention QKVMLP UpMLP Down Massive outliers Token variability Layer heterogeneity Sub-layer divergence Figure 2. Visualization of activation distributions in InternVL2-8B during calibration. We visualize two representative layers and four linear sub-layers. In each panel, the horizontal axis denotes token positions in the multimodal sequence and the vertical axis indexes hidden channels; color encodes the average activation magnitude per token–channel pair over the calibration set. The plots reveal four recurring phenomena: massive activations, layer heterogeneity, sub-layer divergence, and token variability. These patterns indicate that coarse modality-level sensitivity modeling is insufficient, motivating our token-level sensitivity weighting. behaviors [25], offering a causal understanding beyond in- put–output correlations [4]. Intervention-based methods an- alyze model behavior by modifying inputs or intermediate activations. Occlusion Sensitivity [42] measures the influ- ence of each input region on model predictions by system- atically occluding local areas of the input, while Activa- tion Patching [48] examines causal mediation within mod- els by substituting activations between corrupted and clean forward passes. In contrast, gradient-based methods esti- mate feature importance using gradient information, such as Integrated Gradients (IG) [34] and SmoothGrad [33]. Al- though these approaches have achieved notable success in model analysis and visualization, most studies remain cen- tered on interpretability itself rather than directly exploiting interpretability signals for model optimization. 3. Method 3.1. Preliminaries Existing PTQ methods automatically search for optimal quantization hyperparameters by minimizing the recon- struction error of each transformer block during a calibra- tion process. Building on reconstruction-aware calibration, recent weight–activation (WA) PTQ approaches [21, 38] aim to quantize both weights and activations to low preci- sion while maintaining model quality. To alleviate the large quantization error caused by activation outliers, these meth- ods perform channel-wise equalization (CWE) [38] on both the weight and activation matrices. Let X = [X 1 ,..., X T ] ∈R d×T denote the activation matrix of a transformer block, where each column X i ∈R d is the embedding of the i-th token in a sequence of length T . Let W ∈R m×d be the weight matrix of a linear sub- layer, where m denotes the output dimension of this linear sub-layer. Let E∈R d denote the channel-wise scaling fac- tors applied along the hidden dimension d. We use “∗” to denote channel-wise (per-channel) scaling of W and X by E. Specifically, CWE searches for optimal scaling factors E by minimizing the mean squared error (MSE) between the quantized and original outputs of each transformer block. The optimization objective for weight-activation quantiza- tion can be formulated as: E ∗ = arg min E Q W (W∗ E)Q X (E −1 ∗ X)− WX 2 2 , (1) where Q W (·) and Q X (·) denote the quantization functions for weights and activations, respectively. This formulation aims to jointly optimize the scaling of weights and activations, ensuring that quantization pre- serves the representational capacity of each transformer block. For simplicity, we use WxAy to indicate the quan- tization format, where x and y represent the bit-widths for weight and activation, respectively. For example, W4A8 de- notes quantizing weights to 4 bits and activations to 8 bits. Sensitivity TypeGranularityAccuracy (%) Gradient-based Modality-level57.36 Token-level55.78 Token-level (+ special)55.65 Attention-based Modality-level56.43 Token-level57.12 Token-level (+ special)57.52 Perturbation-based Modality-level56.81 Token-level (+ special)57.72 Table 1. Comparison of modality-level and token-level sensitivity estimation strategies on VizWiz (W4A8, InternVL2-8B). 3.2. Sensitivity Differences Between Modalities and Tokens Quantization sensitivity characterizes the degree to which a token or layer is affected by quantization noise. Since the dynamic range of activations determines the quantization scaling factor, activation statistics provide a practical proxy for estimating sensitivity. Therefore, before estimating sen- sitivity explicitly, we first analyze activation distributions to understand the origins of sensitivity differences. From Fig. 2, we observe four recurring phenomena across two layers (Layer1 and Layer16) and four linear sub-layers (Attention Out, Attention QKV, MLP Up, MLP Down): (i) Massive outliers, large activation outliers per- sist across layers, forcing quantizers to widen the dynamic range; (i) Layer heterogeneity, different Transformer lay- ers display distinct activation behaviors; (i) Sub-layer di- vergence, even within the same Transformer block, differ- ent sub-layers exhibit heterogeneous activation characteris- tics; and (iv) Token variability, within the same sub-layer, activations vary substantially across tokens, causing quanti- zation to affect different tokens unevenly. These findings reveal that quantization sensitivity is not only modality- dependent (vision vs. language) but also highly token- dependent. However, existing LVLM quantization meth- ods model sensitivity only at the modality level and implic- itly assume equal sensitivity for all tokens within a modal- ity. We hypothesize that overlooking token-level sensitivity variations fundamentally limits the performance of current LVLM quantization strategies. To examine whether fine-grained sensitivity modeling is necessary, we run controlled experiments on InternVL2- 8B (W4A8), keeping all quantization hyperparameters and calibration data fixed and varying only the sensitivity esti- mation strategy. We compare three approaches: • Gradient-based sensitivity. Following MBQ [21], sen- sitivity is estimated from gradients of the supervised fine- tuning (SFT) loss. At the modality level, one sensitivity value is assigned to visual tokens and one to textual to- kens. At the token level, each token (vision, text, and special) receives an individual score. • Attention-based sensitivity. Sensitivity is derived from Fine-Grained Quantization (Ours) SFT token Modality-Balanced Quantization Text “A photo of Polar bear” Vision Layer N 휆 푖 Special <bos>,<image> ... Multi-modal Token Text “A photo of Polar bear” Vision Special <bos>,<image> ... Multi-modal Token Modality-level sensitivity VLM QIG Token-level sensitivity VLM Figure 3. Comparison between modality-balanced quantization and our fine-grained quantization. Different colors indicate token types. Unlike MBQ, which assigns modality-level sensitivity, our method computes token-level sensitivity via Quantization-aware Integrated Gradients (QIG) during calibration, enabling more ef- fective quantization. attention scores. Modality-level sensitivity aggregates scores within each modality, while token-level sensitivity directly uses per-token attention statistics. • Perturbation-based sensitivity. Sensitivity is obtained by perturbing tokens and measuring the change of block’s outputs. Modality-level sensitivity jointly perturbs all vi- sual or all textual tokens, whereas token-level sensitivity uses a leave-one-out scheme over individual tokens. Tab. 1 shows three trends. (1) Gradient-based token- level weighting performs worse than modality-level, showing that SFT gradients do not correlate with quanti- zation sensitivity. Once quantization noise is introduced, the gradient distribution changes, and the mismatch ac- cumulates over depth.(2) Attention-based sensitivity gives only modest and unstable gains, which is consis- tent with the attention-sink phenomenon [16], where certain tokens receive spuriously high attention. (3) Perturbation- based sensitivity performs best, as it directly measures the model’s response to quantization noise, but it requires re- peated forward passes and is computationally expensive. These observations suggest that token-level sensitivity can improve quantization when it is estimated accurately, yet gradient- and attention-based proxies are misaligned with quantization error, and perturbation-based estimation is too costly to use directly. This motivates the fine-grained quantization method introduced in the next section. 3.3. Fine-Grained Quantization Building on this analysis, we propose our fine-grained method.As illustrated in Fig. 3, prior modality-based PTQ methods assign uniform sensitivity weights to all to- kens within a modality. However, token-level sensitivity is highly heterogeneous, varying across tokens, layers, and ar- chitectures. Modality-level weighting fails to capture this granularity, leading to suboptimal quantization. To address this, we introduce a token-level sensitivity estimator that adaptively prioritizes more vulnerable tokens during cali- bration, improving overall quantization quality. We term this fine-grained quantization. Motivated by interpretability and attribution principles, we draw on axiomatic attribution [1], which naturally quan- tifies each token’s contribution to model behavior and thus serves as a suitable foundation for measuring token impor- tance during quantization. We start from the classical Inte- grated Gradients (IG) [34], which measures the cumulative contribution of each token along the straight path from a ref- erence input x ′ to the actual input x, where f (·,·) denotes the output of the block: IG(x) = (x− x ′ ) Z 1 0 ∂f (x α ,w) ∂x α dα,(2) where x α = x ′ + α(x− x ′ ) and f (·,w) denotes the full- precision model. Eq. (2) reflects token contributions to the full-precision prediction; however, it does not reveal how sensitive the quantization-induced error is to each token. To align the attribution with quantization, we instead explain the output gap between the full-precision model and the quantized model. Let x q denote the reference in- put along the attribution path and let w q be the quantized weights. In our main setting of joint weight–activation quantization, x q corresponds to the quantized input; in the case of weight-only quantization, activations remain in full precision and x q reduces to the zero baseline.At this step, we shift the IG objective from attributing the model’s absolute prediction to attributing the prediction difference caused by quantization, allowing us to isolate the impact of quantization errors. We define the token-level Quantization- aware Integrated Gradients (QIG) as: QIG(x) = (x− x q ) Z 1 0 ∂ (f(x α ,w)− f(x α ,w q )) ∂x α dα, (3) with x α = x q + α(x− x q ). Here, QIG(x) is a token- wise attribution vector, and QIG i (x) denotes the attribu- tion score of the i-th token, quantifying how much restor- ing that token from its quantized representation reduces the output discrepancy between f (x,w) and f (x,w q ). Intu- itively, a token with a large QIG has a disproportionately strong influence on the quantization error. Small perturba- tions in this token’s embedding can significantly alter the output discrepancy between f (x,w) and f (x,w q ). Com- pared to IG, QIG is directly tied to the error that actually appears in PTQ, and it also satisfies a completeness property analogous to IG, for which we provide a formal derivation in Appendix A. However, raw QIG values are often heavy-tailed, caus- ing a few extreme tokens to dominate optimization. To sup- press such outliers while preserving relative importance, we apply interquartile range (IQR) clipping [6] to obtain the clipped score: C(QIG i ) = clip (QIG i , Q 1 − 1.5· IQR, Q 3 + 1.5· IQR)(4) where Q 1 and Q 3 are the first and third quartiles, and IQR = Q 3 −Q 1 . We then normalize these scores to obtain the token importance coefficients: λ i = C(QIG i ) P T j=1 C(QIG j ) ,(5) ensuring that the coefficients sum to one. We integrate QIG into CWE to optimize the equalization factors. Keeping the WA quantization scheme in Eq. (1) unchanged, we reweight each token’s reconstruction error by its importance score λ i . The objective function becomes: E ∗ = arg min E P T i=1 λ i Q W (W∗ E)Q X (E −1 ∗ X i )− WX i 2 2 . (6) where X i represent the i-th input token activation of each linear layer. For weight-only quantization, it becomes: E ∗ = arg min E P T i=1 λ i Q W (W∗ E) (E −1 ∗ X i )− WX i 2 2 . (7) In this way, the scale search is biased towards tokens that are empirically more sensitive to quantization, while the overall CWE framework remains unchanged. Beyond offering a more fine-grained, token-level sensitivity analy- sis, our approach improves performance while introducing virtually no additional computational cost. 4. Experiment 4.1. Experimental Setup Implementation Details. In line with prior studies [21, 23, 38], we apply per-token activation quantization and per-channel weight quantization. Given that W8A8 quan- tization has been established as lossless in precision by SmoothQuant [38], our primary evaluation in this paper focuses on W4A8 and W3A16. All experiments are con- ducted on a single NVIDIA A800 GPU (80GB). Calibration Datasets. Following prior work, we adopt the improved COCO Caption dataset from ShareGPT4V [7] and randomly sample 128 image–caption pairs for calibra- tion. Each pair is formatted according to the conversational prompt style of the target LVLM. Models.We conduct both W3A16 and W4A8 quanti- zation on numerous leading open-source LVLMs, includ- ing LLaVA-onevision-7B [19], Qwen2-VL-7B [36], and InternVL2-8B/26B [8]. For the LLaVA-onevision series, we select versions that adopt Qwen2 as the language model backbone and SigLIP-400M [47] as the vision encoder. Baselines. For weight-only quantization, we compare our method with vanilla round-to-nearest (RTN), AWQ [23], ModelBitwidthMethodVizWizMMMUChartQAAI2DScienceQAAvg. LLaVA-onevision-7B FP16-60.4149.2280.0481.3195.8873.37 W3A16 RTN59.1243.6768.8878.9294.5569.03 GPTQ54.8742.3373.7276.8192.1267.97 AWQ58.6542.8974.0877.9282.2067.15 MBQ57.9944.00 76.8478.4794.8970.44 QIG (Ours)62.8245.7877.2079.1195.2972.04 W4A8 RTN58.1042.8971.0077.8294.1068.78 SQ55.6742.0066.2877.2093.5166.93 MBQ58.1344.7874.9278.2794.7070.16 QIG (Ours)59.1045.0074.5278.3094.2570.23 InternVL2-8B FP16-60.8648.5682.6482.4297.0774.31 W3A16 RTN55.9543.8979.2480.5196.2871.17 GPTQ59.7943.1176.4076.6594.3070.05 AWQ58.1445.5674.4279.4795.8870.70 MBQ59.3346.02 80.0479.6695.9372.20 QIG (Ours)59.5546.2280.0479.7396.0372.31 W4A8 RTN56.6843.0078.9679.0296.2270.80 SQ55.5644.7877.9676.5995.8870.15 MBQ57.36 45.6778.0079.4796.3871.38 QIG (Ours)58.3347.3378.1679.6396.7372.04 Qwen2-VL-7B FP16-68.3451.2281.4080.1285.0373.22 W3A16 RTN65.0244.6773.6476.3381.0668.14 GPTQ67.7344.4476.2074.8781.7669.00 AWQ66.2445.8977.0877.5381.0169.56 MBQ66.6246.4879.1877.8181.8570.15 QIG (Ours)67.1247.1177.7677.8881.6170.30 W4A8 RTN58.7145.4474.1677.0179.6266.99 SQ47.6043.7870.8876.0778.9863.46 MBQ60.1744.8976.9276.4978.9367.48 QIG (Ours)58.8546.0076.6877.1780.1767.77 Table 2. Overall comparison of full-precision and post-training quantization methods on three representative LVLMs under W3A16 and W4A8. RTN and SQ are naive PTQ baselines, MBQ is the modality-balanced baseline, and QIG is the proposed fine-grained quantization method. Bold numbers indicate the best performance, and underlinednumbers indicate the second best in each column. GPTQ [10], and MBQ [21] under W3A16, all employ- ing channel-wise equalization and group-wise asymmetric quantization (group size 128). For weight-activation quan- tization, we evaluate RTN, SmoothQuant [38], and MBQ under W4A8, also with channel-wise equalization. Follow- ing SmoothQuant, we use per-token symmetric quantization for activations and per-channel symmetric quantization for weights to utilize low-precision tensor cores. Datasets. To comprehensively assess the performance of our quantized models, we follow the LMMs-Eval [49] protocol and evaluate on multiple vision–language bench- marks. In particular, MMMU [41] and ScienceQA [27] are used to test visual reasoning, VizWiz [13] to examine real- world perception, and ChartQA [28] and AI2D [17] to eval- uate the understanding of structured visual information. 4.2. Main Results Tab. 2 reports the performance of different PTQ meth- ods on three representative LVLMs under both weight- only (W3A16) and weight–activation (W4A8) quantization. Generic LLM PTQ methods underperform naive RTN on LVLMs. Across all three models, the naive RTN base- line already causes a moderate drop (about 4% on av- erage) compared with FP16, indicating that 3-bit weight quantization is non-trivial for LVLMs. However, GPTQ and SmoothQuant (SQ), which are strong PTQ methods for pure LLMs, do not reliably improve performance in this multimodal setting. Under W3A16, GPTQ often lags behind RTN in terms of average accuracy (e.g., LLaVA- onevision-7B and InternVL2-8B), and under W4A8, SQ is ModelBitwidthMethodChartQAMMMUVizWiz FP16-86.4452.7865.65 W4A8 MBQ84.4449.7863.51 InternVL2-26BOurs85.2450.2263.91 W3A16 MBQ84.4851.6763.33 Ours85.1250.8964.14 Table 3. Quantization on InternVL2-26B: MBQ vs. Ours under W3A16/W4A8. consistently worse than RTN on all three models. In other words, directly applying PTQ methods designed for LLMs to LVLMs, while ignoring cross-modal statistical character- istics, may perform no better than simple round-to-nearest and can even degrade performance. This observation under- scores the importance of leveraging multimodal information when designing quantization strategies for LVLMs. Fine-grained token-level sensitivity weighting beyond modality-level quantization. Modality-aware quantization provides a strong starting point for quantizing LVLMs. The MBQ baseline reweights the reconstruction errors of the vi- sion and language modalities to alleviate their inherent im- balance during quantization. As a result, MBQ achieves consistent improvements of about 1% on average over RTN and GPTQ across three models and both bitwidths. How- ever, modality-level balancing remains coarse, since tokens within the same modality can exhibit different sensitivities to quantization. This limitation motivates the fine-grained token-level sensitivity weighting proposed in our method. To further address the limitations of modality-level sensitivity modeling, our method introduces fine-grained token-level sensitivity weighting.Across six quantized configurations, including three foundation models and two bitwidth settings, our method consistently achieves the highest average accuracy. Compared with MBQ, it brings an additional average gain of about 0.5%. For example, on LLaVA-onevision-7B, the average accuracy improves from 70.44% to 72.04% under W3A16 and from 70.16% to 70.23% under W4A8. Similar steady improvements are observed on InternVL2-8B and Qwen2-VL-7B under both bitwidths. Moreover, across all benchmarks and quantized configurations in Tab 2, our method either achieves the best performance or remains the second best among all PTQ baselines. The gains are particularly clear on challenging benchmarks. On VizWiz and MMMU, our method sur- passes MBQ by around 1% on average, which suggests that token-level weighting better preserves sensitive visual and reasoning tokens. This improvement may stem from estimating token-wise sensitivity rather than using a sin- gle weight per modality, enabling finer control over token importance. Qualitative visualizations in the Appendix D show that, under the same quantization settings, our method yields more accurate answers than MBQ. Scaling to Larger Models.To assess whether the baselineAttribution objectiveChartQAVizWiz 0f(x)73.8761.73 0f(x)− f(0)74.3062.31 x q f(x)74.1261.52 x q f(x)− f(x q )74.5262.82 Table 4. Ablation of the integrated-gradients configuration for token-wise sensitivities, varying the reference baseline x ′ (0 vs. x q ) and attribution objective (task output f(x) vs. quantization- error outputsf(x)−f(0) orf(x)−f(x q )). Results are on LLaVA- onevision-7B with W4A8; the last row is our QIG formulation and performs best (higher is better). proposed fine-grained post-training quantization scales to larger LVLMs, we further apply it to InternVL2-26B and compare it with MBQ under both W4A8 and W3A16 con- figurations. As shown in Tab. 3, our method yields clear gains over MBQ on ChartQA and VizWiz for both bitwidth settings, while maintaining comparable performance on MMMU. Under the W4A8 configuration, our approach re- covers most of the FP16 accuracy, keeping the performance drop within 3% on all benchmarks. Even under the more aggressive W3A16 setting, our method still surpasses MBQ on ChartQA and VizWiz and remains within 2% of the FP16 model across all tasks, despite using 3-bit weights. These results demonstrate that the proposed fine-grained quantiza- tion strategy scales reliably to LVLMs with tens of billions of parameters and can be deployed at larger model sizes without incurring substantial performance degradation. 4.3. Ablation Study and Further Analysis We conduct ablation studies to examine the effectiveness of fine-grained quantization, framework generality, and quan- tization efficiency. The results show that each design com- ponent contributes measurable performance gains while introducing negligible additional computational overhead. Additional experimental results, including more ablation studies, are presented in the Appendix C. Sensitivity Ablation of Fine-Grained Quantization. In Sec. 3.3, we present our quantization-aware integrated gra- dients, which depart from the standard formulation in two key aspects: the choice of reference baseline and the scalar objective whose gradients are integrated along the path. To evaluate the contribution of these components to fine- grained quantization, we perform an ablation study over both the baseline and the objective used to compute token- wise sensitivities. We evaluate four configurations of Inte- grated Gradients on LLaVA-onevision-7B under the W4A8 setting, and report downstream accuracies on ChartQA and VizWiz in Tab. 4. We ablate over two choices: the base- line x ′ ∈ 0,x q and the attribution objective g(x) ∈ f (x),f (x)− f (0),f (x)− f (x q ). Our QIG formulation corresponds to the configuration with baseline x ′ = x q and ModelBitwidthMethodChartQAAI2DVizWiz LLaVA -onevision -7B FP16-80.0481.3160.41 W3A16 GPTQ73.7276.8154.87 + Ours74.1276.6556.95 InternVL2 -8B FP16-82.6482.4260.86 W3A16 GPTQ76.4076.6559.79 + Ours78.1278.4760.57 Table 5. Results of combining our fine-grained quantization with GPTQ under the W3A16. objective g(x) = f (x)− f (x q ). From the results in Tab. 4, we observe that both compo- nents of QIG contribute to the final performance. Under the zero baseline, switching the objective from the task output f (x) to the error f (x)−f (0) already yields consistent gains on ChartQA (+0.43%) and VizWiz (+0.58%). Changing the baseline from 0 to x q while keeping the task-output objec- tive provides a small improvement on ChartQA (73.87% to 74.12%) but slightly hurts VizWiz (61.73% to 61.52%). In contrast, combining the quantized baseline with the error objective f (x) − f (x q ) leads to the best results on both datasets, achieving 74.52% on ChartQA and 62.82% on VizWiz. These trends indicate that integrating the quan- tized input and explicitly attributing the quantization error are both important for obtaining reliable token-wise sensi- tivities for post-training quantization. Combine Fine-Grained Quantization with GPTQ.To further demonstrate the generality of our fine-grained quan- tization strategy, we incorporate it into the GPTQ frame- work [10], which minimizes layer-wise reconstruction er- ror through second-order approximation using the Hessian matrix H = X ⊤ X . In our adaptation, we introduce a token-aware modification by replacing the Hessian with H ′ = X ⊤ ΛX , where Λ = diag(λ 1 ,λ 2 ,...,λ T ) represents the token importance coefficients derived from our fine- grained attribution mechanism. This reweighting allows GPTQ to emphasize activations from quantization-sensitive tokens while maintaining the overall optimization structure. Notably, the modification requires no additional calibration data and incurs negligible computation overhead, making it a plug-and-play enhancement to standard GPTQ. As shown in Tab. 5, combining fine-grained weight- ing with GPTQ consistently improves quantization perfor- mance on both LLaVA-onevision-7B and InternVL2-8B under the W3A16 setting. For instance, our fine-grained variant achieves 56.95% on VizWiz for LLaVA-onevision- 7B, surpassing vanilla GPTQ by 2.08%, and brings no- table gains on ChartQA and AI2D for both models. This strongly demonstrates the effectiveness and scalability of our method, highlighting the advantages and necessity of fine-grained quantization. Quantization Efficiency.To evaluate the practical effi- Model Size GPU Hours MBQLeave One OutOurs InternVL2-8B0.552.07 (+91 min)0.58 (+2.0 min) InternVL2-26B0.954.20 (+ 195 min)0.99 (+2.5 min) Table 6. Quantization time (in GPU hours) of differnet models using a single A800 80GB GPU. Fine-Grained Quantization incurs negligible overhead compared to baseline methods. ciency of our fine-grained quantization, we measure the to- tal quantization time required to process each model under different configurations. For comparison, we include the baseline MBQ [21] and the perturbation-based Leave-One- Out strategy. The metric reports the total wall-clock GPU hours spent during the calibration and scale-search stages, including activation collection and layer-wise optimization. As shown in Tab. 6, our fine-grained method intro- duces only negligible overhead compared to MBQ, ap- proximately two additional minutes for both InternVL2- 8B and InternVL2-26B, while achieving consistent accu- racy improvements. In contrast, the Leave-One-Out ap- proach, while also effective in measuring quantization er- ror at the token level, incurs high computational cost, con- suming about 3–4 × more GPU time due to repeated for- ward passes for each token perturbation. These results ver- ify that the proposed fine-grained quantization effectively balances interpretability, accuracy, and computational effi- ciency, making it effective across different architectures and scalable to larger LVLMs in real deployment scenarios. 5. Conclusion In this work, we revisited post-training quantization for LVLMs and showed that conventional modality-level sen- sitivity modeling is fundamentally insufficient. Our anal- ysis of cross-token interactions reveals that tokens within the same modality exhibit substantial differences in quan- tization sensitivity. To bridge this granularity gap, we in- troduced Quantization-aware Integrated Gradients (QIG), an attribution-based framework that decomposes the quan- tization error between full-precision and quantized mod- els into token-level contributions. By integrating from the quantized input and applying robust clipping, QIG pro- vides stable importance scores that effectively guide fine- grained quantization. Our approach outperforms existing PTQ methods across diverse benchmarks.Under 3-bit weight-only quantization, it improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to just 1.33%. We believe this token-aware, attribution-guided view of quantization of- fers a practical path toward deploying compact yet reliable LVLMs and motivates future work on unified, token-level compression in real-world systems. References [1] Marco Ancona, Enea Ceolini, Cengiz ̈ Oztireli, and Markus Gross. Towards better understanding of gradient-based attri- bution methods for deep neural networks. In International Conference on Learning Representations, 2018. 2, 5 [2] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390, 2023. 2 [3] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1, 2 [4] Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar,AnahitaBhiwandiwalla,Shao-YenTseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, and Vasudev Lal. Lvlm-intrepret: An interpretability tool for large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8182–8187, 2024. 3 [5] Leonard Bereska and Efstratios Gavves.Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082, 2024. 2 [6] Chris Chatfield. Exploratory data analysis. European journal of operational research, 23(1):5–13, 1986. 5 [7] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5 [8] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 5 [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2 [10] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.Gptq:Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. 1, 2, 6, 8 [11] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quan- tization methods for efficient neural network inference. In Low-power computer vision, pages 291–326. Chapman and Hall/CRC, 2022. 2 [12] Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da- Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu.Hide- llava: Hierarchical decoupling for continual instruction tun- ing of multimodal large language model. arXiv preprint arXiv:2503.12941, 2025. 1 [13] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018. 6 [14] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhat- tacharya. Captioning images taken by people who are blind. In European Conference on Computer Vision, pages 417– 434. Springer, 2020. 1 [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1 [16] Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321, 2025. 4 [17] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In European conference on computer vision, pages 235–251. Springer, 2016. 6 [18] Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Mes- sica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, and Marinka Zitnik. Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality. arXiv preprint arXiv:2505.18227, 2025. 1 [19] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 5 [20] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36:30146–30166, 2023. 2 [21] Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, et al. Mbq: Modality-balanced quantization for large vision- language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4167–4177, 2025. 1, 2, 3, 4, 5, 6, 8 [22] Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Infor- mation Processing Systems, 37:87766–87800, 2024. 1, 2 [23] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and accelera- tion. Proceedings of machine learning and systems, 6:87– 100, 2024. 1, 2, 5 [24] Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou.Fq-vit:Post-training quantization for fully quantized vision transformer.arXiv preprint arXiv:2111.13824, 2021. 1 [25] Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Man- junatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, et al. A sur- vey on mechanistic interpretability for multi-modal founda- tion models. arXiv preprint arXiv:2502.17516, 2025. 3 [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1 [27] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. 6 [28] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 6 [29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamila Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2 [30] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh.Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021. 1 [31] Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, et al. Tr-dq: Time-rotation diffusion quantiza- tion. arXiv preprint arXiv:2503.06564, 2025. 2 [32] Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 1 [33] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi ́ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017. 3 [34] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 2, 3, 5, 1 [35] Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Q-vlm: Post-training quantization for large vision-language models. Advances in Neural Infor- mation Processing Systems, 37:114553–114573, 2024. 2 [36] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 5 [37] Xinjian Wu, Fanhu Zeng, Xiudong Wang, and Xinghao Chen. Ppt: Token pruning and pooling for efficient vision transformers. arXiv preprint arXiv:2310.01812, 2023. 1 [38] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In International conference on machine learning, pages 38087– 38099. PMLR, 2023. 2, 3, 5, 6 [39] Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, and Rongrong Ji. Advancing multimodal large language models with quantization-aware scale learning for efficient adapta- tion. In Proceedings of the 32nd ACM International Confer- ence on Multimedia, pages 10582–10591, 2024. 1, 2 [40] Jie Yu, Songping Mai, Peng Zhang, Yucheng Jiang, and Jian Cheng. Activation and weight distribution balancing for op- timal post-training quantization in learned image compres- sion. In Proceedings of the 33rd ACM International Confer- ence on Multimedia, pages 7959–7967, 2025. 2 [41] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 6 [42] Matthew D Zeiler and Rob Fergus. Visualizing and under- standing convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. 3 [43] Fanhu Zeng and Deli Yu. M2m-tag: Training-free many- to-many token aggregation for vision transformer accelera- tion. In Workshop on Machine Learning and Compression, NeurIPS 2024, 2024. 1 [44] Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, and Hao Tang. Robustmerge: Parameter-efficient model merging for mllms with direction robustness. In The Thirty-ninth Annual Con- ference on Neural Information Processing Systems, 2025. 2 [45] Fanhu Zeng, Deli Yu, Zhenglun Kong, and Hao Tang. Token transforming: A unified and training-free token compres- sion framework for vision transformer acceleration. arXiv preprint arXiv:2506.05709, 2025. 1 [46] Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu.Modalprompt: Towards efficient mul- timodal continual instruction tuning with dual-modality guided prompt. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12137–12152, 2025. 1 [47] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5 [48] Fred Zhang and Neel Nanda. Towards best practices of ac- tivation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. 3 [49] Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 6 [50] Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for im- age and video generation. arXiv preprint arXiv:2406.02540, 2024. 1 [51] Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, and Xianglong Liu. First-order er- ror matters: Accurate compensation for quantized large lan- guage models. arXiv preprint arXiv:2507.11017, 2025. 2 Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients Supplementary Material A. Proof of Quantization-Aware Integrated Gradients Completeness We denote the input as x = [x 1 ,...,x T ], where each token embedding x i ∈R d . Thus the full input lies inR T×d . Us- ing the definition of QIG in Eq. 3, the attribution for the i-th token is defined as: QIG i (x) = (x i − x q i ) Z 1 0 ∂ (f(x α ,w)− f(x α ,w q )) ∂x i dα, (8) where x α = x q + α(x− x q ) is the linear interpolation be- tween the quantized input x q and the original input x. To simplify the notation, we define the quantization-error function as: G(x) = f (x,w)− f (x,w q ).(9) Under this definition, QIG becomes standard Integrated Gradients (IG) applied to G(·) with baseline x q : QIG i (x) = (x i − x q i ) Z 1 0 ∂G(x α ) ∂x i dα.(10) Completeness. Consider the interpolation path γ(α) = x α = x q +α(x−x q ). Since the path is linear, the derivative with respect to α can be written as: ∂x α ∂α = x− x q .(11) Applying the chain rule to G(γ(α)) yields: ∂ ∂α G(x α ) =∇ x G(x α ) ⊤ ∂x α ∂α =∇ x G(x α ) ⊤ (x− x q ) = T X i=1 (x i − x q i ) ∂G(x α ) ∂x i , (12) which shows that the weighted coordinate-wise gradients in QIG correspond to the directional derivative of G along the interpolation path. Integrating both sides from α = 0 to 1, and using G(γ(0)) = G(x q ) and G(γ(1)) = G(x), the fundamen- tal theorem of calculus gives: G(x)− G(x q ) = Z 1 0 ∂ ∂α G(x α )dα = T X i=1 (x i − x q i ) Z 1 0 ∂G(x α ) ∂x i dα. (13) Recognizing the definition of QIG i (x), we obtain the com- pleteness property: T X i=1 QIG i (x) = G(x)− G(x q ) = f (x,w)− f (x,w q ) − f (x q ,w)− f (x q ,w q ) . (14) Discussion. When the baseline satisfies G(x q ) = 0 (e.g., when f (x q ,w) = f (x q ,w q )), the completeness relation simplifies to: T X i=1 QIG i (x) = f (x,w)− f (x,w q ),(15) which mirrors the classical IG completeness property. In practice, post-processing of QIG values (e.g., clipping or interquartile-range filtering) may slightly break strict alge- braic completeness while improving numerical stability and visualization quality. B. More Implementation Details QIG objective implementation. Let x ∈R B×T×H be the pre-residual activation of the current block, and let x q be its quantized version. The block outputs are denoted as y fp = f (x,w) and y q = f (x,w q ). We define the per-token quantization distortion error as: E b,t (x) = 1 H H X h=1 (y fp − y q ) b,t,h , E(x)∈R B×T . To obtain QIG attributions, we approximate the gradi- ents of this quantization distortion loss E b,t (x) using 32- step integrated gradients [34]. Specifically, we integrate along the straight-line path from the baseline x q to the input x, defined as x(α) = x q + α(x− x q ) for α ∈ [0, 1]. Cru- cially, this computation is performed directly on the differ- ence function∥f (x)−f q (x)∥, without separately comput- ing or subtracting gradients from the full-precision and quantized models individually. This follows the construc- tion described in Appendix A, where E b,t (x) serves as the scalar target function for attribution. Quantization Formats. We adopt uniform integer quan- tization for all experiments, and quantize both the weights W and the input activations X of each linear layer. For a given tensor T and bit-width b, we denote its quantized in- teger representation by Q(T ) ∈Z b and the corresponding dequantized value by ˆ T . For weight-only quantization, we apply asymmetric, group-wise quantization to the weight matrix W . Each row of W is partitioned into non-overlapping groups of size 128, and for each group g we compute a scale s g and zero-point z g from the group-wise minimum and maximum: s g = max(W g )− min(W g ) 2 b − 1 ,(16) z g = round − min(W g ) s g .(17) The integer weights are then obtained as: Q(W g ) = clip round W g /s g + z g , 0, 2 b − 1 , (18) and the dequantized weights are ˆ W g = s g Q(W g )− z g . We primarily use b ∈ 3, 4, which we denote as W 3 and W 4. For weight–activation quantization, we use symmetric quantization for both weights and activations. Given a ten- sor T and bit-width b, we define: s T = max(|T|) 2 b−1 − 1 , Q(T) = clip round(T/s T ),−2 b−1 , 2 b−1 − 1 , (19) and ˆ T = s T Q(T ). In this setting we write WxAy to in- dicate x-bit weight and y-bit activation quantization, e.g., W 4A8 for 4-bit weights and 8-bit activations. Unless oth- erwise stated, the group size for weight quantization is fixed to 128. C. More Experimental Results Effectiveness of IQR-Based Clipping. To more compre- hensively evaluate the robustness benefits introduced by our IQR-based clipping strategy, we conduct an ablation study comparing four sensitivity stabilization variants: (1) No Clipping, which directly uses raw token-level sensitivities; (2) Top-5 Zero, which suppresses the five largest sensitivity values by setting them to zero; (3) Top-5 Average, which replaces the five largest sensitivities with the global mean computed over all token sensitivities; and (4) our full IQR Clipping method, which attenuates extreme values using statistically grounded interquartile-range thresholds. As shown in Tab. A1, all clipping strategies improve performance relative to the raw-sensitivity baseline, high- lighting the importance of controlling outlier sensitivities. Notably, modifying the importance allocation of only five MethodVizWizMMMUScienceQA No Clipping54.3241.3793.28 Top5 zero57.2043.5694.10 Top5 average57.2544.7894.18 IQR Clipping (Ours)59.1045.0094.25 Table A1. Ablation on sensitivity stabilization strategies. Our IQR Clipping achieves the best overall performance on LLaVA- OneVision-7B under W4A8 quantization. PPL↓PIQA↑ARC-e↑ARC-c↑MMLU↑ GPTQ6.2475.4667.0040.1030.05 + Ours6.1975.9567.1739.8532.01 Table A2. Comparison of GPTQ and Our Fine-Grained Quantiza- tion on LLaMA-2-7B (3bit). tokens already leads to clear performance differences, un- derscoring the necessity of fine-grained, token-level impor- tance estimation. Among all variants, our IQR-based ap- proach achieves the best results across VizWiz, MMMU, and ScienceQA, demonstrating that the observed gains orig- inate not merely from simple top-value replacement, but from a distribution-aware clipping mechanism that more ef- fectively stabilizes the sensitivity distribution. Extension to Large Language Models (LLMs). To ver- ify that our method’s effectiveness stems from accurately measuring token-level sensitivity rather than serving as a simple modality-related replacement, we further extend our approach to LLMs. Tab. A2 reports results with quan- tized LLaMA-2 on several standard language understanding benchmarks, including perplexity (PPL), PIQA for physi- cal commonsense reasoning, ARC-e/ARC-c for scientific question answering, and MMLU for multi-domain knowl- edge understanding. As shown in Tab. A2, our fine-grained quantization method not only performs strongly on LVLMs but also achieves notable improvements when applied to LLMs. Specifically, by leveraging QIG to model token- level sensitivity, we attain superior quantization perfor- mance across different modalities and model types. This ca- pability to capture fine-grained token sensitivity makes our method highly versatile, enabling consistent performance gains across various large-scale pre-trained models, includ- ing both multimodal and unimodal settings. Robustness with OCR-Specific Calibration. To address concerns regarding the method’s adaptability to domain- specific challenges, we evaluate our approach using an OCR-focused calibration set derived from InfoVQA data. Tab. A3 reports the performance on Qwen2-VL-7B un- der W4A8 quantization across three OCR-intensive bench- marks: DocVQA, ChartQA, and OCRBench. As shown in BitwidthCalib. SizeMethodDocVQAChartQAOCRBenchAvg. W4A8 128 MBQ84.4877.2870.6077.45 Ours88.6077.5276.8080.97 256 MBQ84.8776.6871.5077.68 Ours89.1377.0477.0081.06 Table A3. Results on Qwen2-VL-7B using OCR-specific calibration data. Our method shows significant robustness improvements over MBQ in text- rich scenarios. the table, our method consistently outperforms the MBQ baseline across all calibration sizes (128 and 256 samples). Specifically, with only 128 calibration samples, our ap- proach achieves an average improvement of 3.52% over MBQ, with notable gains of +4.12% on DocVQA and +6.20% on OCRBench. Even as the calibration size in- creases to 256, our method maintains a significant lead (avg. +3.38%). These results demonstrate that our token-level sensitivity modeling effectively captures critical features for text-rich visual understanding, ensuring robustness even when calibration data is limited or domain-specific. D. Visualizations In this section, we provide extended visualizations to fur- ther analyze the conversational outputs of vision–language models under different quantization schemes. The compar- ative results, visually shown in Figs. A1–A4, indicate that our proposed fine-grained quantization strategy enables the quantized model’s responses to better align with the cal- ibration data, effectively reducing degradation in reason- ing quality, visual detail retention, and linguistic coher- ence, thereby more clearly demonstrating its advantages over modality-based baseline methods. Question:Which classic film does this iconic scene on the ship's bow come from? FP16 Response: Titanic MBQ Response: Film Ours Response: Titanic Example 1 LLavaonevision-7B W3A16 Figure A1. The baseline fails to identify the film and produces an incomplete answer, whereas our fine-grained quantization successfully preserves the correct semantic prediction and matches the full-precision model. Question:Is the motorcycle currently stationary or moving? Give exactly two visual cues from the image to justify your choice Example 2 LLavaonevision-7B W4A8 FP16 Response:The lack of blur suggests that the motorcycle is stationary. MBQ Response: Motorcycle Ours Response: Motorcycle isstationary 's kickstand is down, indicating that it is parked and not moving.Additionally, the rider's foot is resting on the ground next to the motorcycle, which suggests that they are stationary rather than riding away or preparing to ride. ### Does not answer the question or give visual cues Figure A2. The baseline fails to answer the question and provides no reasoning, whereas our fine-grained quantization preserves both correctness and detailed visual justification, closely matching the full-precision model. FP16 Response: The image depicts a scene in a natural or zoo setting, featuring two giraffes. The giraffes are standing on a grassy area with a backdrop of lush green trees and foliage. The giraffe in the foreground is standing on a log, reaching up to a tree branch with its long neck to feed on the leaves. The second giraffe is positioned slightly behind and to the left, also feeding on the tree. Both giraffes have characteristic long necks, spotted coats, and are standing on all four legs. The environment appears to be a serene and natural habitat, possibly a zoo or a wildlife reserve, designed to mimic the giraffes' natural environment. MBQ Response: animal species, including giraffes. The image captures a moment of natural beauty and wildlife, showcasing the grace and elegance of these majestic animals. Ours Response:The image depicts a scene in what appears to be a naturalistic enclosure, likely a zoo or wildlife sanctuary. The primary focus is on two giraffes, which are tall, slender, and have long necks. The giraffes are standing on a grassy area, with one of them standing on a fallen tree branch, which is a common behavior for these animals as they use their height to reach higher foliage for food. are surrounded by a lush, green environment. There are several trees with dense foliage, providing a natural and shaded habitat. The trees have broad leaves, which are typical of deciduous forests. The ground is covered in grass, and there are fallen branches and logs, contributing to the naturalistic setting. background, the image shows more trees and greenery, suggesting that the enclosure is quite expansive. The lighting in the image is natural, indicating that the photo was likely taken during the day. The overall ambiance is serene and peaceful, emphasizing the natural behavior and habitat of the giraffes. are the central subjects, and their interaction with the environment, such as standing on the fallen tree branch, highlights their adaptability and natural foraging habits. The image captures a moment of their daily routine, showcasing their grace and elegance. Question:Describe the image in detail. Example 3 Internvl2-8B W4A8 ### Missing detail Figure A3. The baseline provides only a minimal and overly generic description, missing most visual details, whereas our fine-grained quantization preserves rich scene understanding and produces a comprehensive description close to the full-precision model. Question:Describe the image in detail. Example 4 Internvl2-8B W3A16 FP16 Response: The image depicts a serene harbor scene with several traditional fishing boats docked along the shore. The boats are painted in vibrant colors, primarily blue and red, with some featuring intricate designs. The boats are equipped with tall masts and rigging, indicating they are likely used for sailing. The water is calm, reflecting the boats and the partly cloudy sky above. The horizon is visible in the distance, with a gentle gradient from the blue of the sky to the green of the land. The overall atmosphere is peaceful and picturesque, capturing a moment of tranquility in a coastal setting MBQ Response: and rustic charm, capturing a quiet moment in the life of a small fishing community. The careful arrangement of the boats and the harmonious blend of colors and natural elements create a visually pleasing and inviting scene. Ours Response:The image depicts a serene and picturesque harbor scene. The focal point is a cluster of small, colorful boats, likely fishing boats, moored side by side on a wooden dock. The boats are adorned with vibrant blue and red stripes, adding a splash of color to the tranquil waters. The boats are equipped with tall, slender masts, and their sails are neatly furled and tied, suggesting they are not currently in use. in the harbor is calm, reflecting the clear blue sky and the fluffy, white clouds that dot the horizon. The sky is mostly clear, with only a few scattered clouds, contributing to the overall peaceful and inviting atmosphere. The horizon line is visible in the distance, where the calm sea meets the sky. left of the image, there are a few more wooden posts or mooring poles, some of which are partially submerged in the water. These posts are likely used to secure the boats to the dock. The dock itself appears to be made of wood, blending harmoniously with the natural surroundings. captures a moment of quiet repose, with the boats gently rocking in the gentle ripples of the water. The scene is devoid of any human activity, emphasizing the stillness and beauty of the harbor.The overall composition of the image, with its balanced use of color and light, creates a visually pleasing and idyllic coastal setting. ### Missing detail Figure A4. The baseline produces an incomplete and overly generic description that misses key scene elements, whereas our fine-grained quantization preserves detailed coastal features and provides a rich interpretation closely aligned with the full-precision model.