Paper deep dive
Rethinking Token Reduction for Large Vision-Language Models
Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/26/2026, 2:33:05 AM
Summary
The paper introduces MetaCompress, a learning-based, prompt-agnostic token reduction method for Large Vision-Language Models (LVLMs) in multi-turn Visual Question Answering (MT-VQA) scenarios. It addresses the limitations of heuristic-based reduction methods by formulating token reduction as a learnable compression mapping, enabling efficient and accurate visual token processing across dialogue turns.
Entities (5)
Relation Signals (3)
MetaCompress → addresses → MT-VQA
confidence 95% · we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs... leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored.
MetaCompress → outperforms → FastV
confidence 90% · Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs
LVLM → uses → MetaCompress
confidence 85% · MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns.
Cypher Suggestions (2)
Identify the task addressed by MetaCompress · confidence 95% · unvalidated
MATCH (m:Method {name: 'MetaCompress'})-[:ADDRESSES]->(t:Task) RETURN t.nameFind all methods related to token reduction in LVLMs · confidence 90% · unvalidated
MATCH (m:Method)-[:REDUCES_TOKENS_FOR]->(l:ModelArchitecture) RETURN m.name, l.name
Abstract
Abstract:Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at this https URL.
Tags
Links
- Source: https://arxiv.org/abs/2603.21701v1
- Canonical: https://arxiv.org/abs/2603.21701v1
Full Text
61,410 characters extracted from source content.
Expand or collapse full text
Rethinking Token Reduction for Large Vision-Language Models Yi Wang 1∗ , Haofei Zhang 2,3∗ , Qihan Huang 1 , Anda Cao 1 , Gongfan Fang 5 , Wei Wang 6 , Xuan Jin 6 , Jie Song 4 , Mingli Song 1,2,3,4 , Xinchao Wang 5† 1 College of Computer Science and Technology, Zhejiang University 2 State Key Laboratory of Blockchain and Data Security, Zhejiang University 3 Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security 4 School of Software Technology, Zhejiang University 5 National University of Singapore 6 Alibaba Group Abstract Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practi- cal multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbi- trary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though tech- nically applicable to multi-turn settings, rely on heuristic re- duction metrics such as attention scores, leading to subopti- mal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcom- ing the limitations of heuristic designs. We begin by formu- lating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we intro- duce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that Meta- Compress achieves superior efficiency–accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available athttps://github.com/ MArSha1147/MetaCompress. 1. Introduction LVLMs [1,4,35–37,69,70] have become powerful AI sys- tems enabling natural human interaction with visual data such as images and videos. They encode both textual and * Equal Contribution. Email: yw@zju.edu.cn, haofeizhang@zju.edu.cn † Corresponding Author. Email: xinchao@nus.edu.sg visual modalities into tokens jointly processed by a unified Large Language Model (LLM). Recent works [37,70] fur- ther extend LVLMs toward multi-scale visual inputs that integrate both global and local tokens to enhance visual understanding. However, visual tokens greatly increase computation and memory costs, as token numbers grow by thousands and attention scales quadratically with sequence length [13,30], making low-latency or resource-constrained deployment challenging [64]. Although numerous token reduction techniques [9,49, 65] have been proposed and have achieved considerable success, they are primarily developed for single-turn VQA. Meanwhile, the more realistic MT-VQA setting, which in- volves multi-round conversational question answering, re- mains largely underexplored thus far. Compared to single- turn VQA, which focuses on answering one-shot questions and can greedily discard image tokens irrelevant to the cur- rent query, MT-VQA poses additional challenges due to its open-ended nature. In MT-VQA, future questions are en- tirely unpredictable, and their relevant regions may arise anywhere in the image, making existing token reduction methods difficult to apply directly. Existing token reduction methods can be broadly cat- egorized into prompt-dependent and prompt-agnostic ap- proaches. Prompt-dependent methods, such as FastV [9], retain tokens that are highly relevant to only the first-turn question prompt. This strategy may inadvertently discard visual information that could be crucial for answering sub- sequent questions; for example, the first question might focus on the foreground while later questions may reference the background. In contrast, prompt-agnostic approaches like PruMerge [49] reduce tokens solely based on attention scores within the image sequence itself, which are tech- nically applicable to multi-turn interactions. However, a critical limitation of existing prompt-agnostic methods lies in their reliance on heuristic reduction criteria derived from human priors, and the lack of theoretical guidance, often resulting in suboptimal performance. arXiv:2603.21701v1 [cs.CV] 23 Mar 2026 In response to this, we propose a learning-based prompt- agnostic token reduction approach, termed MetaCompress, which overcomes the drawbacks of heuristic designs. To achieve this, a key question is how the learning objective should be defined. By analyzing the reduction formats of current approaches, including both pruning and merging, we find that they can be unified by formulating the visual token reduction task as an optimization problem. The goal is to identify an optimal compression mapping of the input visual tokens, under conditions such as language conditioning in prompt-dependent approaches and image-only conditioning in prompt-agnostic approaches, so that the model’s responses exhibit minimal discrepancy after token reduction. Based on this formulation, we first simplify the problem by learning an optimal compression matrix for each image and conduct a preliminary investigation into the guiding role of attention information, as commonly employed in previous methods. Surprisingly, our findings reveal that the tokens retained by the learned matrix do not exhibit an obvious relationship with the heuristic attention cues commonly used in prior methods, such as [CLS] token attention and prompt- token attention, further validating that heuristic reduction criteria are suboptimal. Furthermore, to fully implement MetaCompress, a prac- tical challenge arises from the need to generate multiple compression matrices, since actual image inputs can vary in resolution. And learning specific compression matrices for every possible resolution is not an especially elegant or practical solution. To address this, we ultimately design to learn a compression matrix generator compatible with dynamic resolutions, trained in a data-efficient paradigm. Extensive experiments on three MT-VQA benchmarks using five LVLM architectures demonstrate that MetaCompress outperforms existing token reduction methods while achiev- ing high computational efficiency. The contributions of our paper are summarized as follows: •We first explore token reduction in the MT-VQA scenario, revealing that heuristic methods relying on visual token attention scores are suboptimal. •We propose MetaCompress, a novel learning-based and prompt-agnostic token reduction method, overcoming the reliance on suboptimal heuristic reduction criteria. •MetaCompress leverages a data-efficient training paradigm to learn the optimal compression mapping for visual sequences, demonstrating the effectiveness and efficiency through extensive experiments. 2. Related Work 2.1. Efficient Large Vision-Language Models LVLMs. Transformers [57] have unified architectures across language [6,16,46] and vision [7,17,23,26,53], then CLIP [45] bridges both modalities through contrastive pre- training, enabling zero-shot visual understanding. Based on this, LVLMs [2,35–37,58,66,70] integrate visual en- coders with large language models to perform multimodal tasks such as captioning and VQA. LLaVA [35,36] achieves image-to-text generation by feeding CLIP-encoded visual to- kens and language tokens into an LLM e.g., Llama [54,55], but its fixed global resolution restricts fine-grained per- ception. Recent models such as LLaVA-NeXT [37] and InternLM-XComposer-2.5 [70] enhance visual understand- ing by incorporating multi-scale visual inputs that combine global and local tokens, but this substantially increases token counts, leading to heavy memory and computation over- head in multi-head attention and auto-regressive decoding, particularly on resource-constrained devices. Model Quantization. To deploy LVLMs to low-memory devices such as mobile while preserving the model’s per- formance, a natural approach is to quantize the model and inference process into 4/8-bit [15,19,67] or even 1-bit [42]. Another line of work focuses on reducing the computational burden of MHA by employing efficient attention mech- anisms [10,12,13,32] or sparse attention [11,59,60]. However, quantization methods are limited by optional fine- tuning and hardware support, and more importantly, they do not solve the overall computational inefficiency caused by the increasing number of visual tokens. Model Pruning. Model pruning [20,40,43,71] and knowl- edge distillation [25,48,61] methods compress the given model to arbitrary size by removing redundant parameters or transferring knowledge from a large model to a smaller one. These approaches are generally effective at reducing model size and inference cost, but often require careful hy- perparameter tuning and expensive retraining procedure. 2.2. Visual Token Reduction Recent studies show that image representations contain sub- stantial redundancy [14,24], enabling feature reduction with- out significant performance loss [5,8,29]. This observa- tion has motivated the development of token reduction tech- niques for LVLMs, which can be broadly categorized into prompt-dependent and prompt-agnostic approaches. Prompt- dependent methods, such as FastV [9], identify redundant tokens by measuring their attention to language prompts and remove them at specific layers. FitPrune [65] identifies redundant tokens by minimizing the divergence of atten- tion distributions before and after pruning. IVTP [27] and TRIM [51] employ CLIP’s text encoder to guide token reduc- tion. However, these methods are less applicable to general scenarios such as MT-VQA, as they require re-compression for each question. In contrast, prompt-agnostic methods rely solely on image sequences and are technically applicable to MT-VQA tasks. Nevertheless, existing approaches like LLaVA-PruMerge [49] overlook compatibility with mod- ern LVLMs that incorporate multi-scale vision towers (e.g., LLaVA-NeXT). More importantly, these methods rely heav- ily on heuristic reduction criteria derived from human priors, which often lead to suboptimal performance. To address this, this paper introduces a novel learning-based, prompt- agnostic token reduction method that avoids heuristic de- signs (e.g., attention to [CLS] or other tokens as reduction guidance, which will be shown suboptimal later in this paper) and can integrate seamlessly with modern LVLMs. 3. Preliminaries In this section, we first give a brief review of the inference process of LVLMs, particularly in the context of multi-turn dialogue scenarios. We then introduce the problem definition of visual sequence compression mapping. 3.1. Large Vision-Language Models Given an input imageI IMG , LVLMs are required to gen- erate a series of responses(R 1 ,...,R t )to user’s prompts (P 1 ,...,P t ). The image and the language context are to- kenized separately by a vision towerT IMG (·), e.g., vision Transformer (ViT) [17,45] and a language tokenizerT TXT (·), e.g., SentencePiece [31]. The tokenized image and language sequence are then embedded into a common space by a vision projectorV IMG (·)and an embedding layerE TXT (·), producing X IMG and X TXT , respectively. To fully capture detailed information, current prevalent LVLMs, such as LLaVA-NeXT and InternLM-XComposer- 2.5, employ a ViT to encode images from both global and local views, generating multi-scale visual sequences. Despite enhancing the model’s capability to capture the details of the image, such an approach significant increase the token num- ber, severely impairing the inference efficiency due to the O(n 2 )computational and memory complexity of MHA [57]. To increase LLMs’ inference efficiency, KV cache meth- ods [34,62] are proposed for reusing intermediate attentions in the auto-regressive decoder. Specifically, for generating thei-th response token with queryq i , the original computa- tion is to concatenate the previous queries for MHA layer MHA(Q i ,K i ,V i ) = σ Q i K ⊤ i √ d k V i ,(1) whereQ,K,V i = (Q,K,V i−1 |q,k,v i )are the con- catenated inputs, andσdenotes the row-wise Softmax oper- ation. To decrease the computation complexity, KV caches store the intermediate key-value pairsK,V i−1 and only compute the attention for q i : MHA(q i ,k i ,v i ) =σ q i (K i−1 |k n ) ⊤ √ d k (V i−1 |v i ),(2) which significantly reduce the computation burden. Such techniques can be seamlessly integrated with the MT-VQA setting, where the caches are reused across multiple turns. 3.2. Visual Token Reduction However, the aforementioned cache mechanism is still in- sufficient to address the memory and computation overhead caused by the large number of image tokens, resulting in an O(n 2 )cost for generating the first token and anO(nT )cost for producing T tokens during multi-turn dialogues. To alleviate the issue, token reduction methods are pro- posed to compress the image sequence. For simplicity, we only consider reducing image tokens right before feeding into the LLM, e.g., LLama [54], which can be formulated as ̃ X IMG =P reduce (X IMG |I IMG ,I TXT ),(3) where guiding information is extracted from the input image I IMG and the language contextI TXT . Depending on whether to rely on the promptI TXT , token reduction methods can be categorized into prompt-dependent and prompt-agnostic methods. However, in real-world applications, LVLMs are often required to respond to multiple prompts. Prompt- dependent methods, however, tend to bias toward the ini- tial query and discard information beneficial for subsequent turns, leading to suboptimal performance in multi-turn di- alogue scenarios. Furthermore, many existing methods re- quires the intermediate attention matrices in MHA layers to guide token reduction, whereas modern LVLMs commonly employ FlashAttention [12,13] or Memory-Efficient Atten- tion [32], which do not support returning them. To further investigate the optimal token reduction strategy, we first unify the token pruning and merging methods by formulating the reduction process as a linear projection to the input X IMG : ̃ X IMG = PX IMG ,(4) whereP ∈ R m×n + (m ≪ n) is the sparse compression matrix. In Section 4, we setPas a learnable matrix for each image. By optimizingPin a data-driven manner, we first explore the relationship between the retained tokens and the attention weights employed by heuristics-designed methods. Then in Section 5, we present a novel token reduction method that does not rely on the intermediate attention matrices, while can be seamlessly integrated with modern LVLMs. 4. Which Tokens to Keep? To objectively analyze the optimal token reduction scheme without relying on hand-crafted designs, we start by looking at a simpler case: Given an input imageI IMG and a conver- sation contextI TXT , find the optimal compression matrixP ∗ as defined inEquation (4)so that the response discrepancy between the LLM using the compressed and uncompressed visual sequence is minimized. To achieves this, letP raw ∈ R m×n be the trainable re- duction parameters, with each element independently drawn from Gaussian distributionN (0,σ 2 raw ) . We normalizeP raw KV Cache Gradient Flow Compute Loss LLM Decoder (a) (b) (c) Figure 1. (a) Overall pipeline of the compression projection training process. (b) Attention distribution over the [CLS] token for retained and all visual tokens. The image tokens are extracted from the last layer of the vision tower of LLaVA-1.5-13b running on VQA-v2 dataset. (c) Attention distribution over the prompt tokens for retained and all visual tokens. The attention scores are averaged to prompt tokens extracted from the first layer of the LLM decoder. with row-wise Softmax to obtain the compression matrix P = σ(P raw ). Letp(y|X IMG ,X TXT )denotes the original prediction distributionTtokensy = (y 1 ,...,y T ). Then we force the LLM to generateTtokensp( ̃y| ̃ X IMG ,X TXT ) using the compressed visual sequence ̃ X IMG . Figure 1a il- lustrates the overall training pipeline, whereP raw is trained to minimize the KL divergence between the two response distributionsL pred = D KL (p(y)∥p( ̃y))and the distribution entropyL entropy = 1 m P m i=0 H(P i,: ). The training objective is formulated as P ∗ = arg min P raw L pred + αL entropy .(5) The training algorithm and detailed implementation are pro- vided in Section 8. Figure 1b and 1c visualize the attention distribution over the [CLS] and prompt tokens, respectively. Despite that some tokens with high attention to the [CLS] token are re- tained (accounting for approximately 1.71% of the total retained tokens), the vast majority of the retained tokens are unrelated to their attention scores, especially with regards to the attention to the language prompts. More results are in Section 6.6, drawing to the same conclusion. This ob- servation suggests that using attention scores as guidance for token reduction is suboptimal in the MT-VQA scenario, which explains the experimental results in Section 6.2, where token pruning methods such as FastV perform worse than uniform or even random pruning. Therefore, it is essential to explore a novel token reduction approach that does not rely on heuristic metrics such as attention scores, while being seamlessly compatible with modern LVLMs. 5. Method Results from Section 4 inspire us to construct the compres- sion matrix in a data-driven manner. To this end, we propose MetaCompress, a lightweight module learning the compres- sion matrixPconditioned only on the input image for MT- VQA scenarios. Section 5.1 details the MetaCompress mod- ule, Section 5.2 provides theoretical analysis, and Section 5.3 presents the optimization objective and training algorithm. 5.1. MetaCompress Our goal is to learn a compression matrix generatorP meta in a data-driven manner, so that the overall prediction dis- crepancy on the given datasetD = (I (i) IMG ,I (i) TXT ) N i=1 is minimized. To this end, we propose a lightweight meta generatorP meta which computes a compression matrixP = P meta (X IMG )for each input imageI IMG , independent of the prompt. One major challenge is thatP meta is required to generate the compression matrixP, whose shape can adapt to the varying length ofX IMG , thereby accommodating mul- tiple resolution scales for LVLMs such as LLaVA-NeXT and InternLM-XComposer-2.5. Figure 2 shows the overall architecture ofP meta , which consists of a position embedding layer, a query down-sample projection ̃ D q , a key projectionD k , and a weighted inner product layer. The core idea is to compute the inner product between the spatially down-sampled queries ̃ X q ∈ R m×d c and keysX k ∈ R n×d c to get the compression matrixP ∈ R m×n . Here, queries ̃ X q = ̃ D q (X IMG +E pos ) = Pool(X IMG +E pos |k,s)W q (6) LLM Position Embedding Figure 2. Illustration of our proposed MetaCompress, where mod- uleP meta generate the compression projectionPsolely according to the image sequence X IMG . are down-sampled from the image sequence encoded with absolute position embeddingsE pos by average pooling Pool(·|k,s) with kernel size k and stride s 1 , and keys X k = D k (X IMG + E pos ) = (X IMG + E pos )W k (7) are linearly projected fromX IMG for computational effi- ciency (by settingd c ≪ d). Finally, the computation of compression matrix P is formulated as: P = σ ̃ X q diag(ω)X ⊤ k √ d c ! ,(8) where diagonal matrix ω ∈ R d c is learnable. Regarding the module placement, following the setting of LLaVA-PruMerge [49], we apply our reduction module only before the LLM decoder, although MetaCompress can in principle be inserted at any layer. This placement re- duces the additional MHA computation incurred in earlier LLM layers compared with inserting it at intermediate lay- ers, which is particularly beneficial for long visual inputs such as videos. Moreover, our work focuses on developing a learning-based reduction method rather than a full-stage compression strategy across both the vision tower and the LLM, as explored in IVTP [27]. Such full-stage optimiza- tion requires costly pretraining and instruction-tuning, which we leave as future work under our lightweight framework. For the module design, since our primary goal is to reduce the inference burden of LVLMs, we intentionally avoid con- structing complex reduction modules, such as those required for auto-regressive generation, as they would significantly increase latency and reduce computational efficiency. As 1 Section 8.2 provides the detailed implementation for down-sampling X IMG to arbitrary length m. Algorithm 1 Training algorithm for MetaCompress. Require: E TXT (·): the language encoder;V IMG (·): the im- age encoder;LLM(·,·): the vision-language decoder; P meta (·|Θ): the proposed MetaCompress module with learnable parameters Θ. 1: for (I IMG ,I TXT )∈D train do 2: X TXT ← E TXT (I TXT ) 3: X IMG ← V IMG (I IMG ) 4: ̃ X IMG ←P meta (X IMG |Θ)# with gradients 5:y ← LLM(X TXT ,X IMG ) 6: ̃y ← LLM(X TXT , ̃ X IMG )# with gradients 7:Compute the final loss and gradient∇ Θ w.r.t. Θ. 8:Update Θ with SGD optimizer. 9: end for the first learning-based token reduction framework, there are currently few non-learning approaches available for direct comparison. Nevertheless, we further discuss the relation- ship between our method and other data-driven approaches for efficient model inference in Section 12. 5.2. Theoretical Analysis Now we provide a theoretical analysis of MetaCompress to explain the design motivation and further introduce the optimization objectives and constraints. To begin with, we expand Equation (8) as P raw = Pool(X|k,s)W q diag(ω)W ⊤ k X ⊤ .(9) Further, suppose all elements inWare drawn independently from a Gaussian distributionN (0,σ 2 c ) and with a specific initialization (i.e.,W q = W k ), MetaCompress will initially behave as a weighted pooling of the input image sequence (controlled by the kernel sizek), as we prove in Section 9. Moreover, the meta generator will learn, in a data-driven manner, how to select and merge the visual tokens to mini- mize the prediction discrepancy. Since we do not rely on any annotation for the compression matrix, the low-rank positive semi-definite form presented in Equation (9) provides a good starting point and facilities gradient decent optimization. 5.3. Training MetaCompress Similar to the training objective introduced in Sec- tion 4, we train MetaCompress by minimizing the predic- tion discrepancyL pred with additional sparsity regularization L entropy . However, due to the lack of ground-truth compres- sion matrix, the generated compression matrixPtends to collapse to trivial solutions where the compressed tokens all derive from the same input source. To avoid this, we add a collapse regularization termL collapse = max j P m i=1 P i,j . Therefore, the final optimization objective is L =L pred + α entropy L entropy + α collapse L collapse ,(10) whereα entropy andα collapse are hyperparameters. Algorithm 1 delineates the training procedure. Table 1. The comparison of visual token reduction methods on three MT-VQA benchmarks with the reduction rate of 90%. The best and the second-best results are highlighted in bold and underline, respectively. ModelMethod MT-VQA-v2MT-GQAConvBench Acc 1 Acc 2 Acc 3 Avg Acc 1 Acc 2 Acc 3 AvgS 1 S 2 S 3 Avg LLaVA-1.5-7b Base76.72 77.51 77.30 77.1861.76 64.07 65.35 63.734.335.72 5.555.20 Random66.36 66.94 66.68 66.6654.60 57.07 59.31 56.993.733.99 4.083.93 Sample67.11 67.52 67.63 67.4255.06 57.89 59.7457.563.644.85 3.814.10 FastV45.98 48.56 49.65 48.0640.98 46.71 49.30 45.661.561.39 3.122.02 PruMerge69.03 69.93 69.7369.5655.26 57.23 60.13 57.544.513.47 3.473.82 Ours70.27 70.31 71.36 70.6555.95 58.71 60.64 58.434.333.99 4.164.16 LLaVA-1.5-13b Base78.35 79.47 78.92 78.9162.47 65.21 67.22 64.974.337.11 5.725.72 Random67.26 68.05 67.63 67.6554.63 57.87 60.23 57.583.565.55 4.424.51 Sample68.12 68.82 68.47 68.4755.41 58.53 60.26 58.074.165.03 4.334.51 FastV55.36 56.80 57.08 56.4149.08 53.34 56.14 52.852.193.47 4.203.29 PruMerge70.18 71.16 70.70 70.6855.70 57.94 60.7058.114.513.99 5.554.68 Ours72.70 73.24 72.88 72.9457.02 59.26 62.16 59.484.684.51 6.415.20 LLaVA-NeXT-7b Base80.20 80.86 80.71 80.5963.83 66.68 67.94 66.157.95 11.46 7.589.00 Random70.65 72.26 72.44 71.7858.60 61.04 63.00 60.885.816.59 4.425.61 Sample70.88 72.32 72.35 71.8558.46 61.39 63.2461.033.817.28 5.725.60 FastV57.09 59.00 59.27 58.4546.42 50.55 53.95 50.310.001.85 1.851.23 Ours73.83 75.24 75.18 75.1859.43 63.49 65.19 62.704.168.67 9.017.28 LLaVA-NeXT-13b Base81.02 82.32 81.64 81.6665.45 67.32 69.12 67.3012.48 13.17 7.97 11.21 Random71.86 73.44 73.22 72.8459.61 61.86 63.43 61.637.209.19 6.337.57 Sample71.97 73.84 73.43 73.0859.69 62.28 63.8861.956.07 10.92 6.077.69 FastV57.07 59.09 59.14 58.4347.23 51.34 53.36 50.645.175.17 1.724.02 Ours74.62 75.73 75.42 75.2660.78 63.41 65.16 63.126.93 11.27 6.768.32 XComposer-2.5-7b Base78.80 81.12 81.24 80.3960.21 63.38 65.01 62.8712.487.00 7.979.15 Random67.88 70.20 70.49 69.5252.52 57.01 60.67 56.7311.359.88 7.289.50 Sample68.46 70.92 70.78 70.0552.94 57.20 59.69 56.6112.139.53 7.63 9.76 FastV72.65 75.02 75.0274.2354.84 57.33 58.8357.004.174.17 0.002.78 Ours73.91 76.68 76.70 75.7655.99 58.49 61.55 58.6812.319.71 7.639.88 6. Experiments 6.1. Implementation Datasets. We evaluate our method on three MT-VQA bench- marks: MT-VQA-v2, MT-GQA, and ConvBench 2 . MT- VQA-v2 is constructed based on the validation set of VQA- v2 [3,68] with 25k three-turn image-dialogue pairs. Sim- ilarly, MT-GQA is constructed from the testdev-balanced set of GQA [28] with 4061 three-turn dialogues.Con- vBench [38] is a native multi-turn conversation evaluation benchmark with 577 conversations that adopts a three-level multimodal capability hierarchy. Instead of training on the entire dataset, which is time-consuming, we only train Meta- Compress on a small subset (about 20k items) drawn from the training-balanced split of MT-GQA and the training set of MT-VQA-v2. We utilize the pre-trained weights on MT- VQA-v2 to evaluate on ConvBench. 2 As ConvBench relies on GPT-3.5-turbo’s commercial API for evalua- tion, we replace it with the recently released open-source LVLM, Llama- 3.1-8B-Instruct [18]. LVLMs. To evaluate the generalizability of our method, we choose five different LVLMs: LLaVA-1.5-7b/13b [35], LLaVA-NeXT-7b/13b [37], and InternLM-XComposer-2.5- 7b [70]. Of these models, LLaVA-1.5 employs a single-scale vision tower with a fixed visual sequence length, while the others adopt multi-scale perception, resulting in variable visual sequence lengths, which brings further challenges to the token reduction method. Training Details and Selection of Hyperparameters. We implement our method with PyTorch [44] and optimize the proposed MetaCompress with SGD [47] with a learning rate of10 −3 . Gradient clipping is adopted with a maximum value of10 −2 . We train all the settings for 2 epochs with a batch size of 36 on four commercial NVIDIA RTX A6000 GPUs. The training of LLaVA-NeXT-7B with a 90% reduction rate takes approximately 30 GPU hours, which corresponds to about only 9 hours on a 4-GPU machine. We initializeW q = W k and drawn from Gaussian distributionN (0, 1 √ d c 2 ) ;ωis set to all ones; α entropy = α collapse = 1 as the default setting. 0.500.700.800.900.95 Reduction Rate (%) 45.0 47.6 50.1 52.7 55.2 57.8 60.4 62.9 65.5 Average Acc (%) Random Sample FastV PruMerge Ours (a) LLaVA-1.5-13b 0.500.700.800.900.95 Reduction Rate (%) 47.5 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 Average Acc (%) Random Sample FastV Ours (b) LLaVA-NeXT-7b Figure 3. Comparison of average accuracy on MT-GQA with reduction rate from 50% to 95%. 6.2. Comparison Results We evaluate the proposed MetaCompress and comparison baselines with the following settings: •Base: The base LVLM evaluated directly on the MT-VQA benchmarks without token reduction. •Random: Randomly prune visual tokens before the first layer of the LLM decoder. We report the average perfor- mance over 3 random seeds. •Sample: Perform equidistant down-sampling on the visual sequence before the LLM decoder. •FastV: Our implementation of FastV [17] for multi-scale vision tower. The guidance attention weights are extracted from the first layer of the LLM decoder and visual tokens are pruned at the second layer. • PruMerge: Perform LLaVA-PruMerge [35] only for LLaVA-1.5, as it is not compatible with the multi-scale visual tower. •Ours: Perform our proposed MetaCompress before the LLM decoder. The comparison results with reduction rate 90% and 70% are shown in Table 1 and Table 4, where we compare the accuracy of each turn conversation and the overall accuracy on three MT-VQA benchmarks. It is noticeable that the pro- posed MetaCompress consistently outperforms the baseline methods. While not trained on ConvBench, our method still surpasses the baseline methods by a large margin, demon- strating the transferability of MetaCompress. For LLaVA- 1.5, experimental results show that LLaVA-PruMerge which is designed specifically for it performs slightly better than Sample, but still lags behind our approach. On the other hand, FastV performs significantly worse than both the Sam- ple and even the Random methods. This further supports our findings in Section 4, where we have revealed that using attention as guidance for compression results in a loss of crit- ical tokens. Although FastV shows some improvement for XComposer-2.5-7b, it still performs poorly on ConvBench. Table 2. Efficiency comparison of different token reduction meth- ods. The time to first token (TTFT, ms), end-to-end generation time (E2ET, ms), GPU memory usage (Mem. GB), and TFLOPs are reported on MT-GQA dataset with a reduction rate of 90%. ModelSettingTTFTE2ETMem. TFLOPs LLaVA-1.5-7b Base232 (± 5.8)676 (± 8.0)26.971.4 Random98.2 (± 5.7) 487 (± 4.6)26.213.3 Sample96.9 (± 5.8) 482 (± 4.9)26.213.3 FastV102 (± 4.52) 528 (± 6.0)26.313.5 PruMerge 107 (± 5.23) 509 (± 4.5)26.213.3 Ours97.8 (± 5.41) 480 (± 5.1)26.113.3 LLaVA-NeXT-7b Base484 (± 4.7) 830 (± 13.5) 16.795.3 Random174 (± 3.4)481 (± 4.5)14.812.7 Sample176 (± 3.2)484 (± 5.3)14.812.7 FastV219 (± 5.0) 529 (± 5.33) 19.212.9 Ours174 (± 6.1)501 (± 4.8)14.912.7 Figure 3 illustrates the performance curve of different methods with varying reduction rates from 50% to 95%. It is clear that our method consistently outperforms the baselines across different reduction rates, while FastV performers bet- ter for low reduction rate and LLaVA-PruMerge performers better for high reduction rate. 6.3. Efficiency Results Table 2 compares the inference efficiency of different token reduction methods. As we can observe that our method achieves compatible efficiency with the ‘Sample’ setting, which is the most efficient baseline thanks to the explicit low-ranking mechanism as described in Equation (9). 6.4. Transfer Results Beyond the transfer results on ConvBench in Table 1, we further evaluate the transfer capability of MetaCompress through comprehensive cross-dataset validation. Specifi- cally, we perform transfer learning experiments between MT-GQA and MT-VQA-V2, with the results summarized in Table 7. This table reports the average accuracy under a Table 3. Ablation study of training MetaCompress for LLaVA- NeXT-7b using different loss terms on MT-GQA. Gradient clipping is only applied for the ‘L collapse + Grad Clip’ setting. SettingsMT-GQA L pred L entropy L collapse L collapse + Grad ClipAvg ✓✗61.98 ✓✗62.42 ✓✗✓✗56.34 ✓✗✓62.13 ✓✗✓62.70 90% token reduction rate for both directions of transfer, from MT-GQA to MT-VQA-V2 and vice versa. These results indicate that MetaCompress is not heavily dependent on a specific training dataset, demonstrating robust generalization. We also conduct transfer experiments on the video question answering task, as reported in Table 8 of Section 10.4. 6.5. Ablation Study As delineated in Section 5.3, we utilize three optimization objectives to train the proposed method. To investigate the effectiveness of each objective, we conduct an ablation study by removing one of the objectives at a time. The results in Table 3 (with additional results for various LVLMs in Ta- ble 6) demonstrate that each objective contributes positively to the overall performance. In particular, training utilizing theL collapse alone leads to divergence because of the rela- tively high penalty on the collapse objective, especially when the reduction rate is small (less than 70%). To tackle this, we introduce gradient clipping to stabilize the training process. Besides, we also investigate the sensitivity of the hyperpa- rametersα entropy andα collapse when training MetaCompress. Figure 4 shows the performance curves for different weight settings, demonstrating that the performance remains rela- tively stable (within a 0.5 percentage point variation). 6.6. Visualization Figure 5 visualizes the attention distribution for LLaVA- NeXT-7b, similar to Figure 1. As directly computing the attention to [CLS] token is not feasible for multi-scale vision towers, we compute FastV’s style image token importance instead. Nevertheless, we observe that only a small number of tokens with high attention are retained, which is consistent with the conclusion in Section 4 and further demonstrates that using token attention to guide reduction is suboptimal. 7. Conclusion and Outlook This paper proposes a novel token reduction approach for multi-turn VQA scenarios. To this end, we first unify token pruning and merging under the framework of compression projection to visual sequences and explore the optimal com- pression mapping for a single image. Preliminary results reveal that existing methods guided by attention are subopti- 0.00.51.01.5 Loss Weight 62.00 62.25 62.50 62.75 63.00 Average Acc (%) entropy collapse Figure 4. Sensitivity analysis in training MetaCompress for LLaVA- NeXT-7b with different weights α entropy and α collapse on MT-GQA. 0.000.501.001.502.00 Token Importance 0.0 0.2 0.4 0.6 0.8 1.0 Density All Tokens Retained Tokens (a) 0.05.010.015.020.025.030.035.040.045.050.0 Attention Score (×10 4 ) 10 1 10 0 10 1 10 2 10 3 10 4 Density All Tokens Retained Tokens (b) Figure 5. (a) Token importance distribution. (b) Attention distribu- tion over prompt tokens. Image tokens are extracted from the last layer of the vision tower of LLaVA-NeXT-7b on VQA-v2 dataset. mal, as a large number of retained tokens do not correspond to the highest attention scores. This motivates us to further explore the construction of an optimal compression mapping for the entire dataset. To achieve this, we propose Meta- Compress, a meta generator conditioned solely on the visual sequence, and optimized in a data-driven manner. Extensive experiments demonstrate the efficiency and effectiveness of our method. In future work, we will explore the token reduction strategy for all LLM layers without a hand-crafted design, and investigate the transferability of our method to more challenging tasks, such as video understanding. Acknowledgments This work is funded by National Natural Science Foundation of China (62576305), the Alibaba Group through Alibaba Innovative Research Program, and the National Research Foundation, Singapore, under its Medium Sized Center for Advanced Robotics Technology Innovation. References [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 1 [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 2 [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015. 6 [4]Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1 [5] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. In ICLR, 2023. 2, 1 [6]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, et al. Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020. 2 [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020. 2 [8]Kaixuan Chen, Jie Song, Shunyu Liu, Na Yu, Zunlei Feng, and Mingli Song. Distribution knowledge embedding for graph pooling. IEEE Transactions on Knowledge and Data Engineering, 35:7898–7908, 2021. 2 [9]Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Jun- yang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024. 1, 2 [10]Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, and Xinchao Wang. dparallel: Learnable parallel decoding for dllms. In International Conference on Learning Representa- tions, 2026. 2 [11]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019. 2 [12]Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. In ICLR, 2024. 2, 3 [13]Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ́ e. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022. 1, 2, 3 [14]Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of DNNs. In ICLR, 2022. 2 [15]Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettle- moyer. GPT3.int8(): 8-bit matrix multiplication for trans- formers at scale. In NeurIPS, pages 30318–30332, 2022. 2 [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In NAACL, pages 4171– 4186, Minneapolis, Minnesota, 2019. 2 [17]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. 2, 3, 7 [18] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 6 [19] Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting LLM quantization. arXiv preprint arXiv:2405.18137, 2024. 2 [20]Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. In CVPR, pages 16091–16101, 2023. 2 [21]Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2 [22]Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014. 1 [23]Le Han, Kaixuan Chen, Minchen Ye, and Nenggan Zheng. Hi-motion: Hierarchical intention guided conditional motion synthesis. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 9628–9637, 2025. 2 [24]Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ́ ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022. 2 [25]Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531, 2015. 2 [26] Chaoqun Hong, Liang Chen, Yuxin Liang, and Zhiqiang Zeng. Stacked capsule graph autoencoders for geometry- aware 3d head pose estimation. Computer Vision and Image Understanding, 208-209:103224, 2021. 2 [27]Kai Huang, Hao Zou, Ye Xi, BoChen Wang, Zhen Xie, and Liang Yu. IVTP: Instruction-guided visual token pruning for large vision-language models. In ECCV, pages 214–230, Cham, 2025. Springer Nature Switzerland. 2, 5 [28]Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 6 [29]Cao Jianjian, Ye Peng, Li Shengze, Yu Chong, Tang Yan- song, Lu Jiwen, and Chen Tao.MADTP: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer. CVPR, 2024. 2 [30]Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ̧ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020. 1 [31]Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, pages 66–71, Brussels, Belgium, 2018. Association for Computational Linguistics. 3 [32]Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haz- iza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xFormers: A modular and hackable transformer modelling li- brary.https://github.com/facebookresearch/ xformers, 2022. 2, 3 [33]Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm. International Journal of Computer Vision, pages 1–19, 2025. 3 [34]Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haf- fari, and Bohan Zhuang. MiniCache: KV cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024. 3 [35] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 1, 2, 6, 7 [36]Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 2 [37]Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge, 2024. 1, 2, 6 [38] Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, et al. Convbench: A multi-turn conversation evaluation bench- mark with hierarchical capability for large vision-language models. arXiv preprint arXiv:2403.20194, 2024. 6 [39]Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image. arXiv preprint arXiv:2409.02097, 2024. 3 [40]Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, K. Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. ICCV, pages 3295–3304, 2019. 2 [41]Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yum- ing Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4122–4134, 2025. 3 [42]Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wen- hui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024. 2 [43]Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. In NeurIPS, 2023. 2, 3 [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019. 6 [45]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021. 2, 3 [46]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. 2 [47]Sebastian Ruder. An overview of gradient descent optimiza- tion algorithms. arXiv preprint arXiv:1609.04747, 2016. 6 [48]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. 2 [49] Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388, 2024. 1, 2, 5 [50] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear com- plexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021. 3 [51]Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, and Benyou Wang. Less is more: A simple yet effective token reduction method for efficient multi-modal llms. arXiv preprint arXiv:2409.10994, 2024. 2 [52]Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023. 3 [53]Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ́ e J ́ egou. Training data-efficient image transformers & distillation through atten- tion. In ICML, pages 10347–10357. PMLR, 2021. 2 [54] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timoth ́ e Lacroix, Baptiste Rozi ` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2, 3 [55]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 2 [56]Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025. 3 [57]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2, 3 [58]Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 2 [59]Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, and Xinchao Wang. Sparsed: Sparse attention for diffusion lan- guage models. In International Conference on Learning Representations, 2026. 2 [60] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024. 2 [61]Chuanpeng Yang, Yao Zhu, Wang Lu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, and Yiqiang Chen. Survey on knowledge distillation for large language models: Meth- ods, evaluation, and application. ACM Trans. Intell. Syst. Technol., 2024. 2 [62] Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid KV cache com- pression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024. 3 [63] Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, 2024. 2 [64] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 1 [65]Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models.arXiv preprint arXiv:2409.10197, 2024. 1, 2 [66]Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. 2 [67]Zhihang Yuan, Yuzhang Shang, and Zhen Dong. PB-LLM: Partially binarized large language models. In ICLR, 2024. 2 [68] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Yin and yang: Balancing and answering binary visual questions. In CVPR, pages 5014–5022, 2016. 6 [69]Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, et al. InternLM-XComposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023. 1 [70]Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, et al. InternLM-XComposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024. 1, 2, 6 [71] Mingjian Zhu, Kai Han, Yehui Tang, and Yunhe Wang. Visual transformer pruning. ArXiv, abs/2104.08500, 2021. 2 Rethinking Token Reduction for Large Vision-Language Models Supplementary Material 8. Implementation Details 8.1. Fixed Compression Matrix The training algorithm forP raw is shown in Algorithm 2, where compression projection matrixP raw is optimized to minimize the objective as described in Equation (5). There are several hyperparameters involved in the training ofP raw , includingαin Equation (5) andσ raw in the initializa- tion ofP raw . We setα = 1andσ raw = 0.1in all experiments. The learning rate is set to10and we train 500 epochs for each image-text pair. Algorithm 2 Training algorithm for MetaCompress. Require: (I IMG ,I TXT ):the image-text pair;E TXT (·): the language encoder;V IMG (·): the image encoder; LLM(·,·): the vision-language decoder;P raw : the learn- able compression matrix with shape m× n. 1: Initialize P raw with Gaussian distributionN (0,σ 2 raw ). 2: X TXT ← E TXT (I TXT ) 3: X IMG ← V IMG (I IMG ) 4:y ← LLM(X TXT ,X IMG ) 5: while not converged do 6: ̃ X IMG ← σ(P raw )X IMG # with gradients 7: ̃y ← LLM(X TXT , ̃ X IMG )# with gradients 8: Compute the final loss and gradient∇ P raw w.r.t.P raw . 9:Update P raw with SGD optimizer. 10: end while 8.2. MetaCompress To adapt to arbitrary compression rate, the stridekin the pooling operation is set to a float value n m , which can be easily implemented by the fractional max pooling opera- tion [22]. We set the kernel size s = 3 for all experiments. 9. Properties of MetaCompress Now, we analyze the properties of MetaCompress when W q = W k = Wand are drawn fromN (0,σ 2 w )in Equa- tion (9). We start by considering when kernel sizek = 1, meaning thatPool(X)is a down-sampling operation toX, and we let ̃ Xdenotes the down-sampled image sequence. In this case, Equation (9) can be simplified as P = ̃ XSX ⊤ ,(11) whereSis a positive semi-definite matrix. Therefore, the expectationE[p i,j ] = 0for all positions that ̃x i ̸= x j , and for ̃x i = x j = x E[p i,j ] = E[xSx ⊤ ] = E[xW diag(ω)W ⊤ x ⊤ ].(12) Here, we notice thaty = xWis still a random vector, with all elements subject toN (0,dσ 2 c σ 2 w ). Hence, E[y diag(ω)y ⊤ ] = d c σ 2 c σ 2 w .(13) Considering that the embedding dimension of LVLMs is a large number (e.g., 4096 for LLaVA-1.5-7b),σ(P raw )is close to the down-sampling projection to the input image sequence controlled by the stride s. Further, when the kernel sizek > 1, the expectation ofp i,j is still zero when ̃x i is not captured by the pooling kernel located inx i . To sum up, the initialization to Equa- tion (9) converting MetaCompress to a interpretable pooling operation to the input image sequence. However, as training progresses,W q diverges fromW k , breaking the positive semi-definiteness of matrixS, enabling MetaCompress to further explore more effective compres- sion strategies, ultimately enhancing performance. Besides, we choose the compression embedding dimensiond c to be smaller than the original embedding dimensiondto reduce the computational cost and number of parameters. Essentially, Equation (9) is a specialized form of the dot-product attentionXW Q W ⊤ K X ⊤ , making it easier to op- timize and less prone to over-fitting to the training dataset, as we only adopt a few-shot subset for training efficiency. 10. More Results 10.1. Performance of Fixed Compression Matrix Because we train the compression matrixP raw for a single image on the training dataset, which is a straightforward opti- mization problem, we do not compare it with other methods. Here, we present the overall accuracy about compressing LLaVA-Next-7b on the MT-VQA-v2 dataset for reference. The accuracy of the base setting is 82.44, and when reducing 90% of the image token the accuracy decreases to 80.89. 10.2. More comparison Results Table 4 presents the comparison results of different token reduction method with the compression rate of 70%. Our method achieves the best overall performance, the same as the results in Table 1. As a supplement, Table 5 compares the effectiveness of token reduction methods. Here, ‘Spatial’ represents applying spatial pooling to the image sequence (the kernel sizekis set to the same as the strides). ‘ToMe’ [5] is a token merging Table 4. The comparison of visual token reduction methods on three MT-VQA benchmarks with the reduction rate of 70%. The best and the second-best results are highlighted in bold and underline, respectively. ModelMethod MT-VQA-v2MT-GQAConvBench Acc 1 Acc 2 Acc 3 Avg Acc 1 Acc 2 Acc 3 Avg S 1 S 2 S 3 Avg LLaVA-1.5-7b Base76.72 77.51 77.30 77.18 61.76 64.07 65.35 63.734.335.725.555.20 Random72.72 73.57 73.11 73.13 57.79 61.07 63.0960.654.174.735.514.80 Sample72.96 73.71 73.33 73.33 58.48 61.12 62.30 60.634.165.033.814.33 FastV69.30 69.57 69.41 69.43 54.79 57.65 60.03 57.493.995.033.474.16 PruMerge 72.79 73.90 73.4273.37 57.89 59.54 61.81 59.753.644.684.684.33 Ours75.67 76.63 76.46 76.25 58.62 60.96 63.64 61.073.995.034.854.62 LLaVA-1.5-13b Base78.35 79.47 78.92 78.91 62.47 65.21 67.22 64.974.337.115.725.72 Random73.63 74.56 73.96 74.05 57.74 61.24 63.33 60.774.034.895.244.72 Sample73.81 74.79 74.51 74.37 58.51 61.29 62.8260.873.645.035.374.68 FastV73.85 75.18 74.5874.54 57.99 60.75 63.51 60.754.376.925.105.46 PruMerge 73.80 75.16 74.57 74.51 57.67 60.45 62.18 60.104.516.415.375.43 Ours74.03 76.98 76.31 75.77 59.48 61.91 65.25 62.214.336.935.375.55 LLaVA-NeXT-7b Base80.20 80.86 80.71 80.59 63.83 66.68 67.94 66.157.95 11.467.589.00 Random76.18 77.43 77.64 77.08 61.96 63.95 66.29 64.077.979.196.938.03 Sample76.60 77.93 77.96 77.50 62.28 64.61 66.1764.357.638.324.336.76 FastV75.96 76.86 76.39 76.40 61.54 64.37 65.97 63.960.000.002.500.83 Ours77.75 78.06 78.54 78.12 63.38 64.69 67.59 65.227.639.887.458.32 LLaVA-NeXT-13b Base81.02 82.32 81.64 81.66 65.45 67.32 69.12 67.30 12.48 13.177.97 11.21 Random77.30 78.77 78.65 78.24 62.89 64.64 67.22 64.92 11.44 11.278.15 10.29 Sample77.51 79.15 79.03 78.56 63.95 64.54 67.2065.23 10.40 14.046.7610.40 FastV75.78 77.16 76.66 76.53 62.10 64.37 65.06 63.84 20.005.003.339.44 Ours78.14 80.98 80.21 79.78 64.86 65.89 67.59 66.11 10.92 13.867.11 10.63 XComposer-2.5-7b Base78.80 81.12 81.24 80.39 60.21 63.38 65.01 62.87 12.487.007.979.15 Random74.63 77.23 77.44 76.43 56.96 61.59 63.63 60.73 15.42 10.237.97 11.21 Sample75.07 77.68 78.04 76.93 57.79 61.17 63.36 60.77 15.94 11.796.0711.27 FastV77.93 80.18 80.0579.39 58.95 61.34 62.6760.99 12.504.17 12.509.72 Ours78.24 80.53 80.79 79.85 60.11 62.28 64.14 62.18 15.77 12.136.24 11.38 Table 5. Comparison results of different token merging methods for LLaVA-1.5-7b. Setting MT-GQA Acc 1 Acc 2 Acc 3 Avg Base61.7664.0765.3563.73 Sample54.6057.0759.3156.99 Spatial51.0554.6256.2453.97 ToMe53.6456.9157.6256.06 VisionZip55.0857.8259.8957.60 Ours55.9558.7160.6458.43 method proposed for ViTs rather than LVLMs, and thus performs ineffectively in our setting. ‘VisionZip’ [63] is a hybrid token compression method that integrates both token pruning and merging, yet it does not take MT-VQA scenarios into account and therefore also underperforms our method. 10.3. More Ablation Results The results in Table 6 provide additional ablation studies across various LVLMs, further supporting the results in Ta- ble 3 and demonstrating that each objective contributes posi- tively to the overall performance. 10.4. More Transfer Results Table 7 reports more transfer validation experiments across LVLMs on MT-VQA-v2 and MT-GQA, showing that Meta- Compress does not heavily depend on the specific training dataset, thereby demonstrating robust generalization ability. Furthermore, Table 8 reports transfer results on video ques- tion answering task. In detail, we transformed the video QA dataset Video-MME [21] into a 3-turn dialogue version, re- ferred to as MT-Video-MME (including 500 dialogs for vali- dation), and conducted comparative evaluations against base- line methods at a 70% compression rate. Since XComposer- 2.5-7B natively supports video input, we directly use the Table 6. Additional ablation study of training MetaCompress for various LVLMs using different loss terms on MT-GQA. Gradient clipping is only applied for the ‘L collapse + Grad Clip’ setting. L pred L entropy L collapse L collapse + Grad ClipLLaVA-1.5-7bLLaVA-NeXT-7bXComposer-2.5-7b ✓✗56.6361.9856.77 ✓✗57.9962.4258.01 ✓✗✓✗52.2656.3452.53 ✓✗✓57.5762.1358.24 ✓✗✓58.4362.7058.68 Table 7. Transfer validation experiments. Average accuracy is reported for cross-dataset transfer between MT-VQA-V2 and MT-GQA, all under a 90% token reduction rate. SettingsLLaVA-1.5-7bLLaVA-1.5-13bLLaVA-NeXT-7bLLaVA-NeXT-13bXComposer-2.5-7b MT-VQA-v270.6572.9475.1875.2675.76 MT-GQA→ MT-VQA-v269.0671.8973.6173.2574.41 MT-GQA58.4359.4862.7063.1258.68 MT-VQA-v2→ MT-GQA57.4558.7861.4362.6058.06 Table 8. Transfer results on MT-Video-MME. Average accuracy is reported across different methods. MetricsBaseRandomSampleFastVOurs Acc 1 44.326.825.526.228.5 Acc 2 46.725.326.427.327.3 Acc 3 48.230.731.331.634.6 Avg46.427.627.728.430.1 pre-trained weights obtained from training on the small com- bined dataset of MT-VQA-v2 and MT-GQA (only around 20k samples in total as we mentioned) to evaluate on the MT-Video-MME benchmark. As shown in the table, Meta- Compress outperforms baseline approaches even without any task-specific training on MT-Video-MME, further demon- strating its strong transferability. 11. More Visualizations Figures 6 and 7 show the visualization of the generated com- pression projection for LLaVA-1.5-7b and LLaVA-1.5-13b on MT-GQA with a compression rate of 90% (we randomly select two images as examples). The row and column in- dices in the figures represent the original and reduced token indices, respectively, with darker colors indicating higher retention weights. As observed, MetaCompress performs pruning and merging operations at different positions, but is primarily based on equidistant down-sampling, with specific adaptations for certain tokens. 12. Discussions Most data-driven approaches for efficient model inference primarily focus on model pruning [43,52], efficient at- tention mechanisms [39,50], and efficient model architec- tures [33,41,56], particularly in designing vision encoders for stronger and more compact visual representations. How- ever, these methods typically require fine-tuning the entire model, resulting in substantial computational overhead. In contrast, our proposed MetaCompress trains only a small number of lightweight linear projection layers (D q ,D k , and w), yet it surpasses existing token reduction approaches. Ow- ing to its efficiency, MetaCompress requires only a modest amount of training data (approximately 20k samples) while exhibiting strong transferability across datasets, as demon- strated in our transfer experiments. This generalization ca- pability stems from the fact that MetaCompress is trained to preserve as much general visual information as possible for multi-turn dialogues rather than being specialized for specific image domains. Figure 6. Visualization of the compression projection for LLaVA-1.5-7b on MT-GQA with the compression rate of 90%. Figure 7. Visualization of the compression projection for LLaVA-1.5-13b on MT-GQA with the compression rate of 90%.