Paper deep dive
BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization
Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu
Abstract
Abstract:Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
Tags
Links
- Source: https://arxiv.org/abs/2603.16590v1
- Canonical: https://arxiv.org/abs/2603.16590v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
75,771 characters extracted from source content.
Expand or collapse full text
BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization Ji-Fu Li 1 , Manyi Zhang 1⋆ , Xiaobo Xia 2 , Han Bao 1 , Haoli Bai 1 , Zhenhua Dong 1 , and Xianzhi Yu 1 1 Huawei Technologies 2 University of Science and Technology of China lijifu4, zhangmanyi6@huawei.com Abstract. Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Mod- els (MLLMs) and Large Language Models (LLMs) on modern accelera- tor architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer for- mats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that under- utilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transfor- mations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distri- bution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the- art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks. Keywords: Multi-modal Large Language Models· Large Language Mod- els· Quantization· MXFP 1 Introduction Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) have recently revolutionized artificial intelligence, demonstrating re- markable capabilities in bridging visual perception with linguistic reasoning [4, ⋆ Corresponding author. Preprint. arXiv:2603.16590v1 [cs.CL] 17 Mar 2026 2Li et al. 14, 27, 34, 35, 48, 52, 56, 64]. From autonomous driving to medical image anal- ysis, these models are increasingly deployed in real-world scenarios where low latency and memory efficiency are paramount [28, 31, 53, 60, 68]. However, the ever-growing scale of MLLMs and LLMs, often comprising billions of parameters, imposes prohibitive costs on memory bandwidth and computational resources, hindering their deployment on edge devices and resource-constrained platforms. Post-Training Quantization (PTQ) has emerged as a key solution to miti- gate these costs. While integer quantization has been widely studied, the recent emergence of microscaling floating-point formats (MXFP) offers a promising alternative [2, 43]. Supported by next-generation hardware [1, 7, 49], MXFP4 utilizes block-wise scaling to better accommodate the long-tailed distributions inherent in activations, theoretically offering superior dynamic range compared to fixed-point formats. Despite this hardware readiness, achieving accurate 4-bit quantization for MLLMs under the MXFP format remains an unsolved chal- lenge [66,67]. While existing state-of-the-art PTQ methods are predominantly designed for INT formats [22, 23, 30, 36, 41, 45, 51, 54], their applicability to MXFP formats is contested. Specifically, popular rotation-based techniques (e.g., QuaRot [3] and SpinQuant [33]), which excel in INT4 by spreading outliers via orthogo- nal transformations, suffer from severe performance collapse when applied to MXFP4 [11,38]. Recent studies [11,46] have attributed this failure to the incom- patibility between global rotations and the fine-grained quantization settings of MXFP, and further propose block-wise rotation transformation methods. How- ever, these approaches still fail to mitigate extreme outliers within certain blocks, and the Hadamard transform further introduces a bimodal distribution problem (see Figure 2a). To bridge this gap, in this paper, we introduce BATQuant. The core of our method is the Block-wise Affine Transformation (BAT). Unlike global rotations, BAT restricts the transformation scope to align strictly with the MXFP quanti- zation granularity (e.g., 32 elements). This design prevents the cross-block energy transfer of outliers, ensuring that each block’s scaling factor accurately captures its local dynamic range. Moreover, we relax the orthogonality constraint and learn the optimal affine matrices tailored to the MXFP format to minimize quan- tization error. To address the storage overhead caused by learnable block-wise affine transformations, we further introduce the Global and Private Kronecker (GPK) decomposition that drastically reduces parameter counts by sharing a global transformation basis across blocks while retaining block-specific private components. Finally, we incorporate Block-wise Learnable Clipping, which dy- namically adapts thresholds to suppress residual outliers within quantization blocks. We validate our BATQuant extensively on both MLLMs and LLMs. Our method achieves near-lossless performance on W4A8KV16 with an accuracy re- covery rate exceeding 99%. Furthermore, it establishes the new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% on BATQuant: Outlier-resilient MXFP4 Quantization3 W4A8KV16W4A4KV16W4A8KV8W4A8KV4 85 90 95 100 Recovery Rate (%) BF16 (100%) 99.3 96.4 98.9 97.5 RTNQuaRotSpinQuantBRQFlatQuantGPTQBATQuant Fig. 1: Quantization performance on Qwen3-VL-8B-Instruct across various methods. Our method yields superior results compared to baselines across all bit- width settings. The advantage is particularly substantial in the W4A4 setting, where our method clearly outperforms existing methods. multimodal benchmarks, significantly outperforming existing methods (see Fig- ure 1). Our main contributions are summarized as follows: – We propose BATQuant, featuring a Block-wise Affine Transformation that aligns with MXFP granularity to prevent energy transfer across blocks and address the bimodal distribution problem for effective quantization. Addi- tionally, we incorporate Global and Private Kronecker decomposition for parameter efficiency. – We evaluate BATQuant on both MLLMs and LLMs, such as Qwen3-8B- VL-Instruct [4] and Qwen3-8B [61], covering a wide range of challenging settings. The effectiveness is validated, ranging from knowledge understand- ing to complex reasoning benchmarks, setting new state-of-the-art results in most scenarios. 2 Preliminary Microscaling Floating-Point Definition. The MXFP, proposed by OCP [43], is a family of floating-point formats that employ block-wise quantization. An MXFP format is defined by three components: a sign bit (S), an exponent (E), and a mantissa (M). Each MXFP format uses a fixed block size of 32 elements, with all values in a block sharing a common scaling factor represented in UE8M0 format (8-bit exponent, no mantissa). The standard MXFP4 (E2M1) format uses 1 sign bit, 2 exponent bits, and 1 mantissa bit. This configuration represents 7 distinct positive values: 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, along with their negatives and zero. MXFP8 offers two variants, E4M3 and E5M2. Here, we adopt E4M3 for MXFP8, as a larger mantissa width is more crucial for the performance of fine-grained quantization [6,39]. Related Work. Initial research on LLM quantization primarily explored integer- based formats [12,16,25,26,58,62]. As NVFP and MXFP formats gain hardware support, quantization accuracy under these formats is also drawing increasing attention [15,20,29,59,65]. Prior work has shown that MXFP8 achieves lossless 4Li et al. Block 5 Block 295 Efficient Use of Quantization Levels Block Hadamard Transform Wasted Quantization Levels (a) BRQ Block 5 Block 295 Efficient Use of Quantization Levels Block Affine Transform Efficient Use of Quantization Levels (b) BATQuant Fig. 2: Activation distributions for the down_proj module in layer 35 of Qwen3-8B. The central 3D plots illustrate the activations after transformation. We specifically extract Block 5 (without outliers) and Block 295 (with extreme outliers), and visualize the values after scaling factor division but prior to rounding. (a) After applying the block Hadamard transform, block 295 exhibits a bimodal distribution, leading to inefficient utilization of the bit width. (b) After the block affine transformation, block 295 shows reduced magnitude compared to subplot (a) while effectively leveraging the floating- point quantization grids. quantization, whereas MXFP4 suffers from significant accuracy degradation [66]. For the low-bit scenarios, e.g., 4-bit quantization, outliers are considered as a severe impediment. The primary methods for suppressing outliers include rota- tion transformations [17,23,50] and affine transformations [36,47]. The rotation- based methods, such as QuaRot [3] and SpinQuant [33], unlike their success in INT4 quantization, underperform even basic RTN when applied to MXFP4. Such global rotations mix dimensional information, suppressing outliers and kurtosis, thus disrupting the local statistical properties of fine-grained formats. To address the incompatibility between rotation-based techniques and MXFP4, BRQ [46] is proposed to utilize block-wise rotation quantization to mitigate outliers and pre- vent amplifying small-value blocks. MR-GPTQ [11], a GPTQ variant optimized for FP4, similarly employs block-wise Hadamard transforms and format-specific adjustments to accommodate FP4’s unique properties. Affine transformation- based methods, such as FlatQuant [47], overcome the energy-conservation con- straints inherent in rotation-based transformations and enhance quantization accuracy by employing affine transformations. Nevertheless, previous methods still suffer from significant accuracy degradation on MXFP4 quantization, par- ticularly on complex reasoning tasks [66]. Observations and Motivation. We find that block-wise rotation still struggles to suppress extreme outliers in specific blocks, and the Hadamard transform fur- ther introduces a bimodal distribution problem. Specifically, we visualize the ac- tivations after block-wise Hadamard transformation on Qwen3-8B in Figure 2a. We observe that although the block-wise Hadamard transform reduces the mag- nitude for the vast majority of blocks, since Hadamard matrices are composed of +1,−1 values, certain blocks with extreme outliers exhibit a bimodal dis- tribution. This results in wasted bit-width and introduces larger quantization errors [10]. Therefore, to address these challenges, we propose BATQuant. As shown in Figure 2b, BATQuant effectively alleviates outliers, while ensuring that the post-transformation data distribution remains amenable to floating-point quantization. BATQuant: Outlier-resilient MXFP4 Quantization5 Huawei Proprietary -Restricted Distribution5 퐏 퐗 퐏 ퟏ 퐏 ퟐ 퐏 풌 × ... ... 품 ... QuantzationBlocks ∈ℝ ퟏ×품 퐁 퐁 ퟏ 퐁 ퟐ 퐁 풌 퐏 퐏 ퟏ 퐏 ퟐ 퐏 풌 품 품 Globalshared matrix ∈ℝ 품 ퟏ ×품 ퟏ Block-specific private matrix ∈ℝ 품 ퟐ ×품 ퟐ 퐀 품 품 품 ퟐ 품 ퟏ ퟏ 퐏 푖 ∈ℝ 품×품 퐏 푖 ∈ℝ 품×품 Block-wise Affine Transformation Global and Private Kronecker 퐖 푞 퐖 푘 퐖 푣 퐏 푞푘푣 −1 퐏 푞푘푣 −1 퐏 푞푘푣 −1 퐏 푣 RoPE Quant 퐏 푘 Quant 퐗 퐏 푞푘푣 SoftMax KV Cache Quant 퐏 푘 −1 퐏 푣 −1 퐏 표 Quant 퐏 표 −1 퐖 표 Quant 퐏 푢푝 퐖 푢 퐖 푔 퐖 푑 퐏 푢푝 −1 퐏 푢푝 −1 Act 퐏 푑 퐏 푑 −1 Quant MLP Self-Attention Fig. 3: The overall framework of BATQuant. Bottom: Integration of BATQuant into the Transformer architecture. Weight-side transformations are fused offline into the linear layers, while activation-side transformations are applied online. Top: Exemplary view of the Block-wise Affine Transformation, where inputs are partitioned into MXFP- aligned blocks. Each block transformation is decomposed via the Global and Private Kronecker. 3 Method In this section, we present BATQuant with the framework illustrated in Fig- ure 3. We first introduce learning optimal block-wise affine transformations in Section 3.1. Afterward, we discuss its integration with the Transformer archi- tecture in Section 3.2. Note that we provide a detailed algorithm flow of our BATQuant in Appendix B. 3.1 Block-wise Affine Transformation Consider a standard linear layer computation Y = XW ⊤ , where X ∈ R S×N represents activations and W ∈ R M×N denotes weights. The primary objective is to find the best affine transformation P ⋆ ∈ R N×N for each linear layer to quantize: P ⋆ = arg min P ||Y−Q(XP)Q(P −1 W ⊤ )|| 2 F . Instead of learning a single global matrix, we partition the transformation matrix into k disjoint blocks aligned with the MXFP quantization granularity g (e.g., g = 32). We then construct a block-diagonal affine matrix: P = diag(P 1 , P 2 ,..., P k ), where P i ∈ R g×g ,N = k· g.(1) 6Li et al. Table 1: Comparison of decomposition methods on parameter counts and computa- tional cost. For the example parameter count, we set the hidden dim N = 4096 and the MXFP quantization granularity g = 32. The size of decomposed matrix A i and B i are set to g 1 = 8 and g 2 = 4. The reported MatMul Complexity refers to the computational cost of the activation transformation XP. MethodDecomposition MatMul Complexity # Params of P Example Count FlatQuant Kronecker O(SN 3 2 )2N8,192 Ours w/oO(SNg)N · g131,072 Naive Kronecker O(SN(g 1 + g 2 ))k· (g 2 1 + g 2 2 )10,240 GPKO(SN(g 1 + g 2 ))g 2 1 + k· g 2 2 2,112 Here, each P i is an independent and learnable affine transformation applied solely within the i-th quantization block. By restricting the transformation scope to the size of the MXFP block, our method ensures that outlier redistribution occurs only locally. This preserves the statistical independence of each quantiza- tion block, allowing the MXFP scaling factors to accurately capture the dynamic range of each block without interference from outliers of other blocks. Global and Private Kronecker. Although the block-diagonal structure of P introduces inherent sparsity, the total number of learnable parameters remains N · g. For large-scale models, storing such a matrix for every layer still incurs a significant memory cost. A straightforward approach to mitigate this is to apply Kronecker product decomposition to each P i , factorizing it into two smaller matrices B i ⊗ A i , where A i ∈ R g 1 ×g 1 , B i ∈ R g 2 ×g 2 . The g 1 and g 2 respectively denote the size of A i and B i and we have MXFP quantization granularity g = g 1 · g 2 . We refer to this as Naive Kronecker. However, since the block size g is typically small (e.g., 32 in MXFP formats), the reduction in parameter count is marginal. To address this limitation, we propose Global and Private Kronecker (GPK). GPK decomposes each P i into the product of a global shared matrix A and a block-specific private matrix B i : P i = B i ⊗ A, ∀i∈1,...,k,(2) where A is shared across all k blocks and B i is unique to the i-th block. This design drastically reduces the storage requirement from k·(g 2 1 +g 2 2 ) to g 2 1 +k·g 2 2 . As shown in Table 1, GPK significantly reduces the storage overhead, reduc- ing the parameter count by more than 74% and 79% compared to FlatQuant and Naive Kronecker. Additionally, by leveraging the vectorization trick of the Kronecker product, i.e., vec(V)(B i ⊗ A) = vec(B ⊤ i VA) for some V ∈ R g 2 ×g 1 , GPK maintains efficient inference by preserving the low matrix multiplication complexity. Here, we provide the PyTorch-style pseudo code of the forward pass with GPK in Appendix B. BATQuant: Outlier-resilient MXFP4 Quantization7 Block-wise Learnable Clipping. While the block-wise affine transformation effectively smooths activation distributions, residual outliers may still persist within the quantization blocks, potentially dominating the quantization range of MXFP formats. To mitigate this, we introduce Block-wise Learnable Clipping, a fine-grained strategy that adapts clipping thresholds to the local statistics of each quantization block. For the i-th block, the clipped values ˆ x i (and similarly for weights ˆ w i ) are computed as: ˆ x i = clip x i , β min i , β max i ,(3) where the dynamic bounds β min i and β max i are: β min i = σ(α min i )· min(x i ), β max i = σ(α max i )· max(x i ).(4) Here, min(x i ) and max(x i ) denote the minimum and maximum values within the i-th block, respectively, and σ(·) is the sigmoid function constraining the clipping ratios to (0, 1). α i is the learnable parameter specific to block i. The Training Objective. Following previous work [47], we optimize the block- wise affine transformations and clipping factors by minimizing the layer-wise quantization errors between the full-precision and quantized outputs over a small calibration set D cal : Θ ∗ l = arg min Θ l E X∼D cal F l (X)− ˆ F l (X;Θ l ) 2 2 (5) where F l (·) and ˆ F l (·) denote the full-precision layer l and quantized layer l, re- spectively. Θ l is abbreviated for all learnable parameters within the quantization block. 3.2 Integration with the Transformer Architecture We integrate BATQuant into both LLM (Qwen3) and MLLM (Qwen3-VL) ar- chitectures by inserting block-wise affine transformations into the transformer block, where the weight-side transformations are merged into the linear lay- ers offline, while the activation-side transformations are applied online during inference. Following the conventional practices, we employ low-bit matrix mul- tiplications for all linear layers, while keeping layer normalization layers, pre- quantization transformations, RoPE embeddings and attention scores in BF16. MLP Module. In LLM and the text model of MLLM, the MLP module em- ploys two transformation sets, P up and P down . P up flattens the activation dis- tribution after LayerNorm before the up_proj and gate_proj layers. P down smooths the input to the down_proj layer. In the ViT model of MLLM, the MLP module also employs two transformation sets: P fc1 and P fc2 . P fc1 flat- tens the activation distribution after LayerNorm before the linear_fc1 layers. P fc2 smooths the input to the linear_fc2 layer. All matrices utilize the GPK decomposition to minimize storage. 8Li et al. Self-Attention Module. In LLM and the text model of MLLM, the Self- Attention module employs four transformations: P qkv , P o , P k and P v . P qkv and P o flatten the activation distribution before the qkv_proj layer and o_proj layer respectively. P k and P v are used to transform the key and value cache head by head, respectively. In the ViT model of MLLM, only P qkv and P o are employed. This is because ViT does not require an autoregressive KV cache mechanism. Consequently, there is no need to store, transform and quantize the key and value states across generation steps. 4 Experiments 4.1 Settings Evaluation and Baselines. We evaluate BATQuant on Qwen3-VL-8B-Instruct (MLLM) [4] and Qwen3-8B (LLM) [61]. We assess quantized models on the following benchmarks: (1) Multimodal benchmarks, including MME [13], OCR- Bench [32], DocVQA [37], RealWorldQA [57], and VLMBlind. (2) Non-reasoning tasks, including PIQA [5], Winogrande [44], Hellaswag [63], ARC-Easy [8], and ARC-Challenge [8]. (3) Reasoning benchmarks, including GSM8K [9], MATH- 500 [24], AIME24, AIME25 and GPQA-D [42]. We compare BATQuant against popular post-training quantization methods, including QuaRot [3], SpinQuant [33], BRQ [46], FlatQuant [47], SmoothQuant [58] and GPTQ [12]. More details about benchmarks and baseline methods are provided in Appendix A. Implementation Details. We implement BATQuant based on Huggingface [55], PyTorch [40]. We adopt the AdamW optimizer with an initial learning rate of 2e-3 and employ a cosine annealing learning rate decay schedule. BATQuant is trained for 5 epochs, and the batch size is set to 4. For GPK, we set the size of the global shared matrix g 1 and block-specific private matrix g 2 to 8 and 4, respectively. For LLM, we use the BF16 model to self-generate data on the Numina-Math-1.5 [21] dataset, and randomly sample 128 text sequences of length 2048 to construct the calibration set. For MLLM, we randomly sample 128 image-text pairs from the GQA [18] dataset to construct the calibration set. Further details about implementation are provided in Appendix A. Quantization Settings. We evaluate the proposed method on several MXFP- based quantization configurations, including weight-activation quantization and KV cache quantization. For clarity, we denote each configuration using the for- mat WbitsAbitsKVbits. For example, W4A8KV8 indicates quantizing weights to 4-bit, activations to 8-bit, and KV cache to 8-bit. We empirically observe that combining different methods with GPTQ universally enhances per- formance. Consequently, unless otherwise specified, the reported results refer to the GPTQ-integrated variants of each method. Detailed comparisons between GPTQ and RTN weight quantizer are provided in Appendix C. BATQuant: Outlier-resilient MXFP4 Quantization9 W4A8KV16W4A8KV8W4A8KV4W4A4KV16 Quantization Configuration 62 64 66 68 70 72 Average Accuracy Performance Comparison on Non-Reasoning Benchmarks RTN QuaRot SpinQuant BRQ FlatQuant SmoothQuant GPTQ BATQuant BF16 (a) Performance of Qwen3-8B on Non-Reasoning tasks under different quantization settings. W4A8KV16W4A8KV8W4A8KV4W4A4KV16 Quantization Configuration 55 60 65 70 75 Average Accuracy Performance Comparison on Reasoning Benchmarks RTN QuaRot SpinQuant BRQ FlatQuant SmoothQuant GPTQ BATQuant BF16 (b) Performance of Qwen3-8B on Reasoning tasks under different quantization settings. Fig. 4: Performance comparison of different methods on Qwen3-8B across LLM bench- marks under various quantization configurations. The results are categorized into Non- Reasoning (left) and Reasoning (right) tasks. 4.2 Main Results Here, we present a comprehensive empirical evaluation of BATQuant. Our exper- iments are designed to answer the following critical questions: (1) Can BATQuant maintain satisfactory performance under aggressive MXFP-based quantization configurations where existing methods fail? (2) How does our approach general- ize across modalities (MLLMs vs. LLMs) and task domains, specifically spanning multimodal understanding (including document understanding, STEM puzzles, and general VQA) in MLLMs and linguistic task (covering non-reasoning and reasoning tasks) in LLMs? Results on Multimodal Benchmarks. Table 2 summarizes the performance of different post-training quantization methods on the Qwen3-VL-8B-Instruct model across five multimodal benchmarks. As shown in the table, BATQuant consistently establishes state-of-the-art results across all bit-width configura- tions. Notably, in the aggressive W4A4KV16 regime, BATQuant achieves an aver- age recovery rate of 96.43%, significantly outperforming the strongest baseline FlatQuant by a margin of 1.64%. Under W4A8KV16 scenario, BATQuant achieves an average recovery rate of 99.29%, which is the only approach exhibiting a performance degradation of under 1%. This superiority extends to KV cache quantization as well. Under W4A8KV8 and W4A8KV4, our method maintains supe- rior performance with recovery rates of 98.89% and 97.51%, respectively. Such a consistent performance gain is also widely observed across different types of benchmarks, including document understanding, STEM puzzles, and general VQA. We attribute this success to our method’s unique capability to mitigate inter-block energy transfer, thereby effectively capturing diverse outlier patterns that conventional methods fail to address. Results on LLM Benchmarks. To comprehensively evaluate the generaliza- tion capability of BATQuant beyond multimodal, we conduct extensive exper- 10Li et al. Table 2: Performance comparison of various quantization methods on multimodal benchmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The recovery rate relative to the BF16 baseline is also provided and the best result in each case is marked in bold. BitsMethodMME OCRBench DocVQA RealWorldQA VLMBlindRecovery(%) BF16–237790695.8170.9873.98100.00 W4A8KV16 RTN229488394.7269.8070.9997.43 QuaRot232787095.0769.8071.1297.53 SpinQuant232187294.7970.4669.8297.29 BRQ232986594.7270.1967.18 96.40 FlatQuant235188695.3169.0273.9098.66 SmoothQuant 234988594.8170.0669.4697.61 GPTQ234689195.0369.1572.6298.36 BATQuant238689395.5570.2073.1499.29 W4A4KV16 RTN224383892.7065.2366.47 93.07 QuaRot218981093.4764.9757.62 89.69 SpinQuant199480191.7965.3660.2388.32 BRQ214780592.9466.1462.1490.74 FlatQuant223187394.1065.6268.8694.79 SmoothQuant 226486293.9368.8966.2695.01 GPTQ228684993.9866.9367.29 94.64 BATQuant236086494.3167.3269.7096.43 W4A8KV8 RTN220887894.6469.5471.01 96.51 QuaRot229686895.1169.0270.2696.77 SpinQuant221783294.4168.1069.0494.58 BRQ228386794.6369.8067.3695.98 FlatQuant235388895.1269.1472.7798.41 SmoothQuant 231788494.7270.1968.9197.19 GPTQ234088595.1471.1171.7998.53 BATQuant236889095.4769.9372.8298.89 W4A8KV4 RTN222085694.0568.5067.5094.76 QuaRot228085794.6668.5268.3695.65 SpinQuant224882994.1868.6364.50 93.65 BRQ223684194.0768.6366.0394.20 FlatQuant229388494.8868.7670.7597.11 SmoothQuant 228387194.3967.0266.9995.13 GPTQ232886794.1568.1070.8196.71 BATQuant233288595.0768.6370.9297.51 iments on Qwen3-8B. The overall performance trends across all configurations are shown in Figure 4 and the detailed results for reasoning benchmarks are summarized in Table 3. More detailed results can be found in Appendix C. Non-Reasoning Tasks. As shown in Figure 4, under the W4A8KV16 configura- tion, our method achieves near-lossless accuracy compared to BF16 baseline. As the quantization difficulty intensifies in the aggressive W4A4KV16 and W4A8KV4 regimes, rotation-based methods (e.g., SpinQuant, QuaRot) suffer from severe performance degradation while our method maintains a robust level of accuracy. This suggests that our block-wise affine transformation effectively mitigates the BATQuant: Outlier-resilient MXFP4 Quantization11 Table 3: Performance comparison of various quantization methods on reasoning benchmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The recovery rate relative to the BF16 baseline is also provided and the best result in each case is marked in bold. BitsMethodGSM8K MATH-500 AIME24 AIME25 GPQA-DAvg. Recovery(%) BF16–95.1596.8771.4663.1258.1376.95100.00 W4A8KV16 RTN93.7195.5364.5855.0054.3972.6493.64 QuaRot94.4795.6764.1755.6354.39 72.8793.91 SpinQuant94.6995.5360.4251.4654.5871.3491.62 BRQ93.7195.8063.9653.3355.40 72.3993.26 FlatQuant94.6295.9369.1757.0854.8074.3295.99 SmoothQuant 94.9296.2765.6256.0454.8073.5394.80 GPTQ94.3996.3368.0259.3855.1074.6496.54 BATQuant94.8496.4068.3359.3857.2275.2397.46 W4A4KV16 RTN93.1094.5353.3347.0849.8067.5786.06 QuaRot94.0992.4747.5039.3748.13 64.3181.20 SpinQuant93.4091.6738.5735.6345.6660.9976.35 BRQ92.2791.7337.2934.5848.0360.7876.25 FlatQuant93.4094.3358.9643.5450.5168.1586.78 SmoothQuant 94.6995.3360.7147.2952.4270.0989.60 GPTQ94.2495.7357.5052.0852.1270.3390.10 BATQuant94.7795.6062.0852.9254.1971.9192.45 W4A8KV8 RTN93.7895.0060.2154.7953.5471.4691.96 QuaRot94.0995.7364.7955.8354.4972.9994.11 SpinQuant94.4795.4759.3853.9655.8671.8792.56 BRQ94.6995.3363.7552.7154.0472.1092.72 FlatQuant94.5496.0065.4253.9654.5572.8993.87 SmoothQuant 94.3996.1366.0454.7954.29 73.1394.21 GPTQ94.4796.1365.00 57.0853.9473.3294.54 BATQuant94.6296.2769.3755.2156.8274.4696.22 W4A8KV4 RTN92.1291.1343.5438.7546.9762.5078.80 QuaRot94.0194.8062.0852.5051.8271.0491.17 SpinQuant93.2594.3357.7149.5852.1269.4088.87 BRQ93.5695.1362.0849.1753.5470.7090.68 FlatQuant94.0995.4063.3353.5454.9572.2693.07 SmoothQuant 93.0392.7346.6740.3349.1963.3981.46 GPTQ93.4093.0747.9239.5849.7564.7481.92 BATQuant94.7795.2766.0454.4854.2472.9694.00 distortion of activation distributions caused by extreme quantization, ensuring that fundamental linguistic patterns remain intact. Reasoning Tasks. The disparity between BATQuant and baselines becomes even more pronounced on complex reasoning benchmarks requiring multi-step logi- cal deduction and mathematical computation. As detailed in Table 3, reason- ing tasks are inherently more sensitive to quantization noise due to the com- pounding effect of errors across long reasoning chains. In the W4A8KV16 scenario, BATQuant achieves a recovery rate of 97.46%, surpassing GPTQ by a substantial margin of 0.92%. Notably, under W4A4KV16 scenario, competing methods suffer from severe performance collapse on GSM8K and MATH-500, while BATQuant 12Li et al. (a) SpinQuant(b) FlatQuant(c) BRQ(d) BATQuant (Ours) Fig. 5: Activation distributions of the q_proj module in layer 6 of Qwen3-8B with different quantization methods. maintains a stable performance. In W4A8KV8 and W4A8KV4 scenarios, our method outperforms the strong baseline GPTQ and FlatQuant by 1.68% and 0.93%, respectively. The consistent superiority of BATQuant across both multimodal tasks and complex linguistic reasoning underscores its remarkable cross-modality general- ization. Our method maintains stable performance even under aggressive low- bit configurations where baselines fail. This broad effectiveness stems from the fundamental nature of our block-wise affine transformation, which dynamically aligns activation outliers and mitigates quantization noise at a granular level, independent of specific data modalities or task semantics. Qualitative Results. To provide insights into the mechanism behind our per- formance gains, we visualize the activation distributions across different quan- tization methods in Figure 5. As shown in Figure 5a, rotation-based method (e.g., SpinQuant) tend to smooth the entire tensor. While it preserves the global energy, it may transfer energy from outlier-rich blocks to smoother blocks, ampli- fying quantization errors in these blocks. While FlatQuant (Figure 5b) effectively suppresses global energy, it fails to prevent this inter-block energy transfer. Fur- thermore, although BRQ (Figure 5c and Figure 2a) introduces block-wise rota- tion to smooth within local blocks, our visualization reveals that it often induces a bimodal distribution within quantization blocks. Our method (Figure 5d and Figure 2b) effectively prevents cross-block energy transfer while reshaping acti- vations within blocks into a compact, unimodal distribution. More visualization results are provided in Appendix C. 4.3 Ablation Study To validate the effectiveness of our core designs, we conduct ablation stud- ies on both Qwen3-8B (LLM) and Qwen3-VL-8B-Instruct (MLLM) under the W4A4KV16 configuration. Here, we first study the effect of block-wise affine trans- formation and block-wise learnable clipping. BATQuant: Outlier-resilient MXFP4 Quantization13 Table 4: Ablation study of block-wise affine transformation and block-wise learnable clipping. We conduct the experiments under W4A4KV16. Model ComponentsNon-Reasoning Benchmarks Avg. Block Trans Block Clip ARC-C ARC-E HellaSwagPIQAWinogrande Qwen3-8B ✓53.1676.3671.0274.2767.7268.51 ✓52.3577.4471.7176.0163.6968.24 ✓53.3377.5371.1275.3066.2268.70 Model ComponentsMultimodal Benchmarks Recovery Block Trans Block Clip MME OCRBench DocVQA RealWorldQA VLMBlind(%) Qwen3-VL-8B -Instruct ✓223586194.6367.1969.9996.18 ✓224986594.0467.2170.2895.59 ✓236086494.3167.3269.7096.43 81632641282565121024 Transformation Block Size 67 68 69 70 71 72 LLM Average Accuracy 68.14 68.32 68.7 67.68 68.42 68.23 67.85 67.56 LLM Avg.Acc MLLM Recovery Optimal (Size=32) 95.0 95.5 96.0 96.5 97.0 MLLM Recovery (%) 96.29 96.03 96.43 95.59 96.15 ViT Shape Unsupported (Block > 128) Fig. 6: Performance of Qwen3-8B (LLM) and Qwen3-VL-8B-Instruct (MLLM) with different transformation block sizes. Fig. 7: Performance of Qwen3-8B (LLM) and Qwen3-VL-8B-Instruct (MLLM) with different sizes of the global shared matrix. Effect of Block-wise Components. The baseline setting without block-wise affine transformation and block-wise learnable clipping refers to the use of global- wise counterparts. As shown in Table 4, replacing the global transformation with our block-wise variant yields significant improvements. For Qwen3-8B, applying the block-wise transformation improves the average accuracy from 68.24% to 68.70%. Similarly, for Qwen3-VL-8B-Instruct, it boosts the recovery rate from 95.59% to 96.43%. Applying block-wise clipping also provides competitive gains. For Qwen3-8B, the average accuracy is improved from 68.51% to 68.70%. For Qwen3-VL-8B-Instruct, the recovery rate is boosted from 96.18% to 96.43%. These confirm that using block-wise affine transformation and block-wise learn- able clipping under MXFP quantization is crucial. Block Size of Affine Matrix. BATQuant aligns the block size of affine trans- formation to the MXFP quantization granularity. To investigate the effect of transformation scope, we vary the size of the affine transformation P i while keeping the MXFP quantization block size fixed at g = 32. As illustrated in Fig- ure 6, for Qwen3-VL-8B-Instruct and Qwen3-8B, the best performance are both achieved when the transformation block size exactly matches the quantization block size (g = 32). This allows affine transformations to precisely reshape local 14Li et al. distributions, isolated from cross-block outlier interference. We can also observe that deviating from this alignment leads to performance degradation. When the block size of affine matrix is smaller than g (e.g., 16), the transformation scope is narrow to smooth outliers in quantization blocks. Additionally, distinct transformations lead to uneven energy (ℓ 2 -norm) suppression within quantiza- tion blocks, creating imbalanced distributions and inducing new local outliers. When the block size of affine matrix is greater than g (e.g., 128), the transforma- tion mixes elements across multiple quantization blocks. This transfers energy between blocks, which can increase quantization error. These findings suggest that strictly coupling the affine transformation granularity with the hardware quantization block size is an effective design choice. Effect of GPK. To investigate the impact of Global and Private Kronecker (GPK) module, we analyze the size of the global shared matrix A (denoted as g 1 ). Recall that g = g 1 · g 2 = 32; thus, varying g 1 inherently changes the capac- ity of both the shared global basis and the block-specific private components. We evaluate configurations with g 1 ∈ 1, 2, 4, 8, 16, 32. The results are shown in Figure 7. Contrary to the intuition that increasing learnable parameters (i.e., decreasing g 1 ) monotonically improves performance, our experiments reveal a non-monotonic trend with an optimal point at g 1 = 8 or g 1 = 4. When g 1 is large (e.g., 16 or 32), the dimension of the private matrix B i becomes small (g 2 ≤ 2), severely limiting the ability of each block to adapt its local distribution independently and leading to a performance drop. Conversely, when g 1 is small (e.g., 1 or 2), the number of private parameters increases significantly, theoret- ically offering higher capacity. However, the search space is also expanded. The optimizer may struggle to converge to a robust solution without more calibration data or hyperparameter tuning, leading to sub-optimal performance or instabil- ity. Therefore, to strike an optimal balance between accuracy and efficiency, we recommend the configuration with g 1 = 8 as the default setting. 5 Conclusion In this paper, we present BATQuant, a robust framework for outlier-resilient MXFP4 quantization that leverages learnable block-wise optimization. By re- stricting affine transformations to align strictly with hardware quantization gran- ularity, our method effectively eliminates the cross-block energy transfer and bimodal distributions inherent in global rotation techniques. This targeted opti- mization, enhanced by efficient Global and Private Kronecker (GPK) decompo- sition and block-wise learnable clipping, ensures precise outlier suppression with minimal overhead. Extensive experiments on MLLMs and LLMs validate that BATQuant sets new state-of-the-art results, achieving near-lossless results un- der W4A8KV16 and recovering up to 96.43% of full-precision performance under aggressive W4A4KV16 settings. We hope this work offers a practical solution for deploying large models on emerging microscaling architectures. BATQuant: Outlier-resilient MXFP4 Quantization15 References 1. Advanced Micro Devices, Inc.: AMD CDNA TM 4 Architecture Whitepaper. White paper, Advanced Micro Devices, Inc. (2025) 2. Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025) 3. Ashkboos, S., Mohtashami, A., Croci, M.L., Li, B., Cameron, P., Jaggi, M., Alis- tarh, D., Hoefler, T., Hensman, J.: Quarot: Outlier-free 4-bit inference in rotated llms. In: NeurIPS. p. 100213–100240 (2024) 4. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 5. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: Piqa: Reasoning about physical commonsense in natural language. In: AAAI. vol. 34, p. 7432–7439 (2020) 6. Chen, M., Wu, M., Jin, H., Yuan, Z., Liu, J., Zhang, C., Li, Y., Huang, J., Ma, J., Xue, Z., et al.: Int vs fp: A comprehensive study of fine-grained low-bit quantization formats. arXiv preprint arXiv:2510.25602 (2025) 7. Choquette, J.: Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 43(3), 9–17 (2023) 8. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018) 9. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 10. Cook, J., Guo, J., Xiao, G., Lin, Y., Han, S.: Four over six: More accurate nvfp4 quantization with adaptive block scaling. arXiv preprint arXiv:2512.02010 (2025) 11. Egiazarian, V., Castro, R.L., Kuznedelev, D., Panferov, A., Kurtic, E., Pandit, S., Marques, A., Kurtz, M., Ashkboos, S., Hoefler, T., et al.: Bridging the gap between promise and performance for microscaling fp4 quantization. arXiv preprint arXiv:2509.23202 (2025) 12. Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: Gptq: Accurate post- training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022) 13. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: NeurIPS Datasets and Benchmarks Track (2025) 14. Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., et al.: Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv:2507.01006 (2025) 15. Hu, W., Zhang, Z., Zhang, H., Zhang, C., Guo, C., Feng, Y., Hu, T., Li, G., Hu, G., Wang, J., et al.: M2xfp: A metadata-augmented microscaling data format for efficient low-bit quantization. arXiv e-prints p. arXiv–2601 (2026) 16. Hu, X., Chen, Z., Yang, D., Xu, Z., Xu, C., Yuan, Z., Zhou, S., Yu, J.: Moequant: Enhancing quantization for mixture-of-experts large language models via expert- balanced sampling and affinity guidance. arXiv preprint arXiv:2505.03804 (2025) 17. Huang, X., Liu, Z., Liu, S.Y., Cheng, K.T.: Rolora: Fine-tuning rotated outlier- free llms for effective weight-activation quantization. In: Findings of EMNLP. p. 7563–7576 (2024) 16Li et al. 18. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. p. 6700–6709 (2019) 19. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: SOSP. p. 611–626 (2023) 20. Lee, J., Park, J., Cha, S., Cho, J., Sim, J.: Mx+: Pushing the limits of microscaling formats for efficient large language model serving. In: Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. p. 869–883 (2025) 21. LI, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S.C., Rasul, K., Yu, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., Polu, S.: Numinamath (2024) 22. Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.Y., Han, S.: Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models. arXiv preprint arXiv:2411.05007 (2024) 23. Li, S., Hu, Y., Ning, X., Liu, X., Hong, K., Jia, X., Li, X., Yan, Y., Ran, P., Dai, G., et al.: Mbq: Modality-balanced quantization for large vision-language models. In: CVPR. p. 4167–4177 (2025) 24. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2023) 25. Lin, H., Xu, H., Wu, Y., Cui, J., Zhang, Y., Mou, L., Song, L., Sun, Z., Wei, Y.: Duquant: Distributing outliers via dual transformation makes stronger quantized llms. In: NeurIPS. p. 87766–87800 (2024) 26. Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.M., Wang, W.C., Xiao, G., Dang, X., Gan, C., Han, S.: Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems 6, 87–100 (2024) 27. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS. p. 34892–34916 (2023) 28. Liu, R., Sun, Y., Zhang, M., Bai, H., Yu, X., Yu, T., Yuan, C., Hou, L.: Quanti- zation hurts reasoning? an empirical study on quantized reasoning models. arXiv preprint arXiv:2504.04823 (2025) 29. Liu, W., Meng, H., Luo, Y., Zhang, P., Ma, X.: Micromix: Efficient mixed-precision quantization with microscaling formats for large language models. arXiv preprint arXiv:2508.02343 (2025) 30. Liu, X., Xia, X., Zhang, M., Li, J.F., Yu, X., Shen, F., Su, X., Ng, S.K., Chua, T.S.: Freeact: Freeing activations for llm quantization. arXiv preprint arXiv:2603.01776 (2026) 31. Liu, X., Xia, X., Zhao, W., Zhang, M., Yu, X., Su, X., Yang, S., Ng, S.K., Chua, T.S.: L-mtp: Leap multi-token prediction beyond adjacent context for large lan- guage models. arXiv preprint arXiv:2505.17505 (2025) 32. Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences 67(12), 220102 (2024) 33. Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chan- dra, V., Tian, Y., Blankevoort, T.: Spinquant: Llm quantization with learned ro- tations. arXiv preprint arXiv:2405.16406 (2024) 34. Luo, R., Wang, L., He, W., Chen, L., Li, J., Xia, X.: Gui-r1: A generalist r1- style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458 (2025) BATQuant: Outlier-resilient MXFP4 Quantization17 35. Luo, R., Xia, X., Wang, L., Chen, L., Shan, R., Luo, J., Yang, M., Chua, T.S.: Next-omni: Towards any-to-any omnimodal foundation models with discrete flow matching. arXiv preprint arXiv:2510.13721 (2025) 36. Ma, Y., Li, H., Zheng, X., Ling, F., Xiao, X., Wang, R., Wen, S., Chao, F., Ji, R.: Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544 (2024) 37. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. p. 2200–2209 (2021) 38. Meng, H., Luo, Y., Zhao, Y., Liu, W., Zhang, P., Ma, X.: Arcquant: Boosting nvfp4 quantization with augmented residual channels for llms. arXiv preprint arXiv:2601.07475 (2026) 39. Mishra, A., Stosic, D., Layton, S., Micikevicius, P.: Recipes for pre-training llms with mxfp8. arXiv preprint arXiv:2506.08027 (2025) 40. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. In: NeurIPS (2019) 41. Qin, G., Li, Z., Chen, Z., Zhang, W., Kong, L., Zhang, Y.: Veq: Modality-adaptive quantization for moe vision-language models. arXiv preprint arXiv:2602.01037 (2026) 42. Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., Bowman, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: COLM (2024) 43. Rouhani, B.D., Zhao, R., More, A., Hall, M., Khodamoradi, A., Deng, S., Choud- hary, D., Cornea, M., Dellinger, E., Denolf, K., et al.: Microscaling data formats for deep learning. arXiv preprint arXiv:2310.10537 (2023) 44. Sakaguchi, K., Bras, R.L., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 64(9), 99–106 (2021) 45. Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., Luo, P.: Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137 (2023) 46. Shao, Y., Wang, P., Chen, Y., Xu, C., Wei, Z., Cheng, J.: Block rotation is all you need for mxfp4 quantization. arXiv preprint arXiv:2511.04214 (2025) 47. Sun, Y., Liu, R., Bai, H., Bao, H., Zhao, K., Li, Y., Hu, J., Yu, X., Hou, L., Yuan, C., et al.: Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426 (2024) 48. Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al.: Kimi-vl technical report. arXiv preprint arXiv:2504.07491 (2025) 49. Tirumala, A., Wong, R.: Nvidia blackwell platform: Advancing generative ai and accelerated computing. In: HCS. p. 1–33 (2024) 50. Tseng, A., Chee, J., Sun, Q., Kuleshov, V., De Sa, C.: Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks. ICML (2024) 51. Wang, H., Ma, S., Wei, F.: Bitnet v2: Native 4-bit activations with hadamard transformation for 1-bit llms. arXiv preprint arXiv:2504.18415 (2025) 52. Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 18Li et al. 53. Wang, W., Chen, W., Luo, Y., Long, Y., Lin, Z., Zhang, L., Lin, B., Cai, D., He, X.: Model compression and efficient inference for large language models: A survey. arXiv preprint arXiv:2402.09748 (2024) 54. Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., Liu, X.: Outlier suppres- sion+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In: EMNLP. p. 1648–1665 (2023) 55. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. p. 38–45 (2020) 56. Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024) 57. xAI: Realworldqa: A benchmark for real-world spatial understanding (2024) 58. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: Accurate and efficient post-training quantization for large language models. In: ICML. p. 38087–38099 (2023) 59. Xin, M., Priyadarshi, S., Xin, J., Kartal, B., Vavre, A., Thekkumpate, A.K., Chen, Z., Mahabaleshwarkar, A.S., Shahaf, I., Bercovich, A., et al.: Quantization-aware distillation for nvfp4 inference accuracy recovery. arXiv preprint arXiv:2601.20088 (2026) 60. Xu, M., Yin, W., Cai, D., Yi, R., Xu, D., Wang, Q., Wu, B., Zhao, Y., Yang, C., Wang, S., et al.: A survey of resource-efficient llm and multimodal foundation models. arXiv preprint arXiv:2401.08092 (2024) 61. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 62. Yu, J., Zhou, S., Yang, D., Li, S., Wang, S., Hu, X., Xu, C., Xu, Z., Shu, C., Yuan, Z.: Mquant: Unleashing the inference potential of multimodal large language models via static quantization. In: ACM M. p. 1783–1792 (2025) 63. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019) 64. Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al.: Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 (2025) 65. Zhang, J., Wei, J., Zhang, P., Xu, X., Huang, H., Wang, H., Jiang, K., Chen, J., Zhu, J.: Sageattention3: Microscaling fp4 attention for inference and an exploration of 8-bit training. arXiv preprint arXiv:2505.11594 (2025) 66. Zhang, M., Li, J.F., Sun, Z., Bai, H., Zhen, H.L., Dong, Z., Yu, X.: Benchmarking post-training quantization of large language models under microscaling floating point formats. arXiv preprint arXiv:2601.09555 (2026) 67. Zhao, P., Zhen, H.L., Li, X., Bao, H., Lin, W., Yang, Z., Yu, Z., Wang, X., Yuan, M., Yu, X., et al.: Unleashing low-bit inference on ascend npus: A comprehensive evaluation of hifloat formats. arXiv preprint arXiv:2602.12635 (2026) 68. Zhu, X., Li, J., Liu, Y., Ma, C., Wang, W.: A survey on model compression for large language models. Transactions of the Association for Computational Linguistics 12, 1556–1577 (2024) BATQuant: Outlier-resilient MXFP4 Quantization19 A Implementation Details A.1 Multimodal Benchmarks – MME. It is a collection of benchmarks to evaluate the multimodal under- standing capability of large vision language models (LVLMs). – OCRBench. OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models, which contains 1000 question-answer pairs, including Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Hand- written Mathematical Expression Recognition. – DocVQA. DocVQA is a benchmark for Visual Question Answering (VQA) on document images. The dataset consists of 50,000 questions defined on more than 12,000 document images. – RealWorldQA. It is a benchmark designed to test spatial and physical reasoning. It features high-quality images taken from vehicles and egocentric views, challenging models to answer questions about object relations and environmental context in unconstrained, realistic settings. – VLMBlind. It is a benchmark of seven novel low-level visual tasks for test- ing VLM ability to “see” simple geometric primitives (such as line, circles, squares, intersections) that are the basic building blocks for image tasks. For all multimodal benchmarks, we use vllm [19] backend for evaluation with a sampling temperature of 0.7, a top-p value of 0.8, a top-k value of 20 and a presence penalty of 2.0. The maximum sequence length of the model is limited to 32,768. A.2 Non-reasoning Benchmarks – PIQA. It is a physical commonsense reasoning and corresponding bench- mark dataset, which was designed to investigate the physical knowledge of existing models. – Winogrande. Winogrande is a collection of 44k problems formulated as a fill-in-a-blank task with binary options, and the goal is to choose the right option for a given sentence, which requires commonsense reasoning. – Hellaswag. It is a commonsense inference benchmark designed to challenge language models with adversarially filtered multiple-choice questions. – ARC-Easy & ARC-Challenge. The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources. Each question has a multiple choice structure (typically 4 answer options). ARC-Easy contains 5,197 easy questions, and ARC-Challenge contains 2,590 hard questions. A.3 Reasoning Benchmarks – GSM8K. GSM8K is a dataset of approximately 8,500 high-quality, linguis- tically diverse grade school math word problems created by human writers. 20Li et al. We employ its test split, which contains 1,319 examples in total. We evaluate model performance using Avg@1 (i.e., the accuracy of the first generated answer). – MATH-500. A benchmark that contains a mix of easy and hard math- ematical problems designed to test comprehensive reasoning abilities. We evaluate model performance using Avg@3 which averages accuracy over 3 independently sampled reasoning traces per problem. – AIME24. It contains 30 problems from the American Invitational Mathe- matics Examination (AIME) 2024. We report results using Avg@16 which averages accuracy over 16 independently sampled reasoning traces per prob- lem. – AIME25. It contains 30 problems from the American Invitational Mathe- matics Examination (AIME) 2025. We report results using Avg@16 which averages accuracy over 16 independently sampled reasoning traces per prob- lem. – GPQA-D. GPQA is a benchmark of graduate-level questions authored and validated by PhD experts. It is designed to be "Google-proof": highly skilled non-experts with unrestricted web access achieve only 34% accuracy, while domain experts reach 65% (74% after error correction). We report results using Avg@10 which averages accuracy over 10 independently sampled rea- soning traces per problem. For all reasoning benchmarks, we use vllm [19] backend for evaluation with a sampling temperature of 0.6, a top-p value of 0.95 and a top-k value of 20. The maximum sequence length of the model is limited to 38,912. A.4 Baseline Methods – RTN. It is the straightforward quantization strategy that maps original floating-point values without additional optimization or calibration. – QuaRot. It uses randomized Hadamard transforms to rotate weights and activations into a space where outliers are suppressed, enabling outlier-free 4-bit quantization. – SpinQuant. It employs orthogonal matrices optimized via the Cayley op- timizer to rotate weights and activations into a space where outliers are suppressed. – BRQ. It is equipped with block-wise rotation to prevent the energy transfer in weights and activations rotation. – FlatQuant. It is designed to improve low-bit quantization by flattening the activation distributions using global affine matrices, specifically optimized for efficient deployment on hardware. – SmoothQuant. It uses a diagnoal scales to smooth activation outliers by migrating the quantization difficulty from activations to weights. – GPTQ. It is a layer-wise post-training quantization method that leverages approximate second-order information (Hessian) to minimize quantization errors, achieving high accuracy for weight-only low-bit quantization. BATQuant: Outlier-resilient MXFP4 Quantization21 Algorithm 1 GPK Forward Pass (PyTorch Style) Require: Input tensor X ∈ R B×S×N , Global matrix A ∈ R g 1 ×g 1 , Private matrices B i k i=1 , Quantization block size g. Ensure: Transformed tensor ̃ X∈ R B×S×N . 1: Parameters: Block count k, dims g 1 ,g 2 s.t. N = k· g 1 · g 2 . 2: Reshape X from [B,S,N] to [−1,k,g 2 ,g 1 ]. 1. Global Shared Transformation (PyTorch einsum) 3: ̃ X← einsum(X, A, equation = (...gij,jk->...gik)) 2. Block-Specific Private Transformation (PyTorch einsum) 4: Stack B i into B stack ∈ R k×g 2 ×g 2 . 5: ̃ X← einsum(B stack , ̃ X, equation = (gij,bgjk->bgik)) 6: Reshape ̃ X back to [B,S,N]. 7: return ̃ X A.5 Hyperparameter Settings We implement BATQuant based on Huggingface [55], PyTorch [40]. We adopt the AdamW optimizer with an initial learning rate of 2e-3 and employ a cosine annealing learning rate decay schedule. BATQuant is trained for 5 epochs, and the batch size is set to 4. For GPK, we set the size of the global shared matrix g 1 and block-specific private matrix g 2 to 8 and 4, respectively. To simulate the quantization with MXFP format, we use the microxcaling library 3 for all experiments. B Detailed Algorithm Flow In this section, we provide the detailed algorithmic implementation of BATQuant. We first formalize the efficient forward pass of the Global and Private Kronecker (GPK) decomposition, followed by the complete calibration procedure for learn- ing the block-wise affine transformations and clipping parameters. B.1 Efficient Inference via GPK Forward Pass To minimize runtime overhead during inference, the block-wise affine transfor- mation P i = B i ⊗ A is not materialized as a full dense matrix. Instead, we leverage the Kronecker product to perform the transformation efficiently with- out explicit matrix construction. Specifically, for the i-th block input vector of size g = g 1 · g 2 , the operation proceeds in three steps. First, the input vector is reshaped into a 3D matrix of dimensions 1 × g 2 × g 1 . Second, this matrix is multiplied by the global shared matrix A ∈ R g 1 ×g 1 from the right and the block-specific private matrix B i ∈ R g 2 ×g 2 from the left; Finally, the resulting matrix is reshaped back to its original shape. Algorithm 1 details the vectorized implementation of this operation for a batch of inputs across all blocks. 3 https://github.com/microsoft/microxcaling 22Li et al. Algorithm 2 BATQuant Algorithm Flow Require: Full-precision weights W ∈ R M×N , Layer input X∈ R B×S×N , Global ma- trix A∈ R g 1 ×g 1 , Private matrices B i k i=1 , Quantization block size g, Epoch E. Ensure: Calibrated parameters Θ =A,B i ,α min i ,α max i for each layer. 1: for epoch = 1 to E do 2: for each batch in X do 1. Transformation 3:Obtain transformed activations ̃ X using X, A and B i based on Alg. 1. 4:Obtain transformed weights ̃ W using W, A −1 and B −1 i based on Alg. 1. 5:Apply block-wise clipping to weights ̃ W and ̃ X. 2. Quantization 6: ̃ X←Q( ̃ X), ̃ W←Q( ̃ W) 3. Loss Computation & Optimization 7: ̃ Y ← ̃ X ̃ W ⊤ , Y ← XW ⊤ 8: L←∥Y− ̃ Y∥ 2 2 9:Update Θ l using ∇ Θ l L 10: end for 11: end for 4. Offline Fusion for Deployment 12: Obtain transformed weights ̃ W using W, A −1 and B −1 i based on Alg. 1. 13: Apply block-wise clipping to weights ̃ W. 14: ̃ W←Q( ̃ W) 15: Store Θ =A,B i ,α min i ,α max i for online activation transformation. B.2 BATQuant Calibration Procedure The calibration process aims to learn the optimal parameters Θ that minimize the difference between the full-precision layer output and the quantized output. Algorithm 2 outlines the end-to-end training flow. For each layer in the network, we iterate over a small calibration dataset: 1. In each iteration, we apply the GPK-based affine transformation to weights and activations (Line 3-4). 2. We apply the block-wise learnable clipping to weights and activations (Line 5). 3. The transformed activations and the corresponding inverse-transformed weights are quantized to the target MXFP format (Line 6). 4. The loss is computed as the Mean Squared Error (MSE) between the full- precision output and the quantized output (Line 7-8). 5. Parameters are updated via backpropagation using the AdamW optimizer (Line 9). After calibration, the weight-side transformation P −1 is fused into the orig- inal weights W offline, while the activation-side transformation P and clipping parameters are retained for online inference. BATQuant: Outlier-resilient MXFP4 Quantization23 Table 5: Performance comparison of various quantization methods on non-reasoning benchmarks across different bit-width configurations (e.g., W4A8KV16, W4A4KV16, W4A8KV8 and W4A8KV4).The recovery rate relative to the BF16 baseline is also provided and the best result in each case is marked in bold. BitsMethodARC-C ARC-E HellaSwag PIQA WinograndeAvg. Recovery(%) BF16–56.48 81.0674.9677.6968.0371.64100.00 W4A8KV16 RTN55.72 80.81 73.29 77.0966.93 70.7798.75 QuaRot55.20 78.7072.7776.8865.1169.7397.31 SpinQuant54.69 76.9872.76 78.0766.8569.8797.52 BRQ53.67 78.8773.2776.6666.93 69.8897.43 FlatQuant55.72 79.6372.6676.8266.2270.2198.01 SmoothQuant 55.80 79.0472.3876.5566.85 70.1297.93 GPTQ55.89 80.6073.1677.3167.0970.8198.82 Ours56.1479.9273.1077.9768.5971.1499.34 W4A4KV16 RTN52.47 76.8970.4474.1664.80 67.7594.49 QuaRot50.43 74.2867.5573.6763.38 65.8691.81 SpinQuant45.65 68.1867.4174.2162.1963.5388.36 BRQ48.55 74.7168.7975.2463.9366.2492.14 FlatQuant50.60 78.2070.3675.6363.5467.6794.13 SmoothQuant 50.09 75.7270.1574.3764.6466.9993.29 GPTQ51.28 76.9870.47 75.7964.5667.8294.44 Ours53.3377.5371.1275.3066.2268.7095.84 W4A8KV8 RTN55.72 80.5172.8676.5566.9370.5198.42 QuaRot55.38 79.8472.5476.8866.2270.1797.92 SpinQuant53.50 77.6572.56 77.5365.969.4396.80 BRQ52.99 78.1173.0976.8867.869.7797.26 FlatQuant52.56 77.1072.4677.0968.1969.4896.86 SmoothQuant 55.03 79.2172.7676.9967.4070.2898.08 GPTQ56.06 80.6872.9577.5366.4670.74 98.72 Ours55.6379.8073.1577.0967.1770.5798.50 W4A8KV4 RTN51.96 76.8970.5475.0863.6167.6294.22 QuaRot52.73 76.4770.1574.8162.2767.2993.82 SpinQuant49.32 74.0769.8275.9563.3066.4992.53 BRQ50.68 75.9770.3874.6562.4366.8293.04 FlatQuant52.13 77.9069.6075.1462.5167.4693.97 SmoothQuant 49.74 73.2369.6175.2466.8566.9393.28 GPTQ52.39 76.52 71.25 75.7365.3568.2595.15 Ours53.3378.5469.5376.6665.1968.6595.71 C Additional Results C.1 Results of Non-Reasoning Tasks Table 5 presents the comprehensive performance comparison on non-reasoning benchmarks (ARC-C, ARC-E, HellaSwag, PIQA, and Winogrande) under four distinct quantization configurations. In the most challenging W4A4KV16 configu- ration, BATQuant achieves an average accuracy of 68.70%, corresponding to a 95.84% recovery rate relative to the BF16 baseline. This significantly out- performs the strongest competing methods, including GPTQ (67.82%, 94.44%) 24Li et al. and FlatQuant (67.67%, 94.13%). Notably, rotation-based methods like Spin- Quant suffer from catastrophic failure in this regime, dropping to only 63.53% accuracy. Similarly, under the W4A8KV4 setting with aggressive KV cache quanti- zation, BATQuant secures the highest average accuracy (68.65%) and recovery rate (95.71%), surpassing GPTQ by a margin of 0.40%. Under the W4A8KV16 configuration, BATQuant achieves a near-lossless recovery rate of 99.34% (Avg. 71.14%), establishing a new state-of-the-art result that exceeds GPTQ (98.82%) and RTN (98.75%). In the W4A8KV8 setting, the performance gap narrows as the quantization difficulty decreases. Here, GPTQ achieves the highest average score (70.74%), while BATQuant remains highly competitive with 70.57%, out- performing all other transformation-based methods (e.g., FlatQuant at 69.48%). C.2 Results of GPTQ and RTN weight quantizer Table 6 and Table 7 compare GPTQ and RTN as weight quantizers across var- ious MXFP configurations. The results show that GPTQ consistently outper- forms RTN in all evaluated settings. This improvement is attributed to GPTQ’s approximate second-order optimization, which minimizes quantization error by accounting for inter-channel weight correlations. In contrast, RTN applies per- element rounding independently, without leveraging the structural redundancy that GPTQ utilizes for error compensation. Given these consistent results, GPTQ serves as a more effective weight quantization strategy than RTN. C.3 Activation Visualization Here, we provide the full details of activation distributions within different quan- tization blocks as shown in Figure 8, Figure 9, Figure 10 and Figure 11. C.4 Case Studies We qualitatively compare BATQuant against BRQ (W4A4) on geometric rea- soning and OCR tasks under W4A4KV16 senario. As shown in Figures 12 and 13, while BRQ suffers from feature distortion leading to hallucinations, BATQuant preserves critical visual details matching the BF16 baseline. In Figure 12, the task requires counting line intersections. The BRQ baseline incorrectly halluci- nates an intersection point (1), likely due to quantization noise distorting edge continuity. In contrast, BATQuant correctly identifies zero intersections (0), demonstrating superior preservation of spatial structures. Figure 13 presents a challenging train number recognition task. BRQ fails to capture the full se- quence, truncating the answer to “055”. Conversely, BATQuant accurately re- covers the complete number “055 05995”, proving its effectiveness in retaining high-frequency details essential for dense text recognition. These cases highlight that unlike BRQ, which struggles with subtle visual cues under aggressive quan- tization, BATQuant robustly maintains semantic fidelity. BATQuant: Outlier-resilient MXFP4 Quantization25 6420246 Value 0 5000 10000 15000 20000 25000 Count =0.027 6420246 Value 0 2000 4000 6000 8000 10000 Count =0.163 6420246 Value 0 5000 10000 15000 20000 25000 Count =0.012 6420246 Value 0 2000 4000 6000 8000 10000 12000 Count =-0.037 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 Count =0.007 6420246 Value 0 2000 4000 6000 8000 10000 12000 Count =-0.063 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =0.113 6420246 Value 0 5000 10000 15000 20000 25000 Count =-0.123 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 Count =0.038 6420246 Value 0 5000 10000 15000 20000 25000 Count =0.066 6420246 Value 0 5000 10000 15000 20000 25000 Count =-0.062 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =-0.017 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 20000 Count =-0.064 6420246 Value 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =0.005 6420246 Value 0 2000 4000 6000 8000 10000 12000 14000 Count =-0.166 6420246 Value 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =0.202 6420246 Value 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =-0.173 6420246 Value 0 2500 5000 7500 10000 12500 15000 17500 Count =-0.058 6420246 Value 0 2000 4000 6000 8000 10000 12000 14000 16000 Count =-0.026 6420246 Value 0 2000 4000 6000 8000 10000 Count =-0.176 Method: RTN Module: Layer 35 down_proj Fig. 8: Activation distributions within different quantization blocks of the down_proj module in layer 35 of Qwen3-8B with RTN. 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =0.030 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =-0.037 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =-0.009 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =0.026 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =-0.021 6420246 Value 0 200 400 600 800 1000 1200 1400 1600 Count =0.037 6420246 Value 0 200 400 600 800 1000 1200 1400 Count =-0.003 6420246 Value 0 200 400 600 800 1000 1200 1400 1600 Count =0.022 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.040 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =0.023 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.005 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.007 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.228 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.199 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.086 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.094 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.449 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.448 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.349 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =0.431 Method: BRQ Module: Layer 35 down_proj Fig. 9: Activation distributions within different quantization blocks of the down_proj module in layer 35 of Qwen3-8B with BRQ. 26Li et al. 6420246 Value 0 500 1000 1500 2000 Count =0.043 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.014 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.143 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.121 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.163 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.224 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.161 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.148 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.217 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.147 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.239 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.403 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.138 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.173 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.001 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.173 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.316 6420246 Value 0 500 1000 1500 2000 Count =0.189 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.130 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.238 Method: QuaRot Module: Layer 35 down_proj Fig. 10: Activation distributions within different quantization blocks of the down_proj module in layer 35 of Qwen3-8B with QuaRot. 6420246 Value 0 500 1000 1500 2000 2500 3000 3500 Count =0.005 6420246 Value 0 500 1000 1500 2000 Count =-0.160 6420246 Value 0 500 1000 1500 2000 2500 3000 3500 Count =-0.122 6420246 Value 0 500 1000 1500 2000 Count =-0.141 6420246 Value 0 500 1000 1500 2000 2500 3000 3500 4000 Count =-0.173 6420246 Value 0 500 1000 1500 2000 2500 3000 Count =0.181 6420246 Value 0 500 1000 1500 2000 2500 3000 Count =0.280 6420246 Value 0 500 1000 1500 2000 2500 3000 Count =0.135 6420246 Value 0 500 1000 1500 2000 2500 3000 3500 Count =-0.332 6420246 Value 0 500 1000 1500 2000 2500 Count =-0.164 6420246 Value 0 250 500 750 1000 1250 1500 1750 Count =-0.027 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =0.059 6420246 Value 0 500 1000 1500 2000 2500 Count =-0.273 6420246 Value 0 500 1000 1500 2000 2500 3000 3500 Count =0.142 6420246 Value 0 500 1000 1500 2000 2500 Count =-0.011 6420246 Value 0 250 500 750 1000 1250 1500 1750 2000 Count =-0.046 6420246 Value 0 1000 2000 3000 4000 Count =0.049 6420246 Value 0 500 1000 1500 2000 2500 3000 Count =-0.150 6420246 Value 0 500 1000 1500 2000 2500 3000 Count =-0.033 6420246 Value 0 500 1000 1500 2000 2500 Count =0.060 Method: BATQuant Module: Layer 35 down_proj Fig. 11: Activation distributions within different quantization blocks of the down_proj module in layer 35 of Qwen3-8B with BATQuant (Ours). BATQuant: Outlier-resilient MXFP4 Quantization27 Table 6: Performance comparison of different quantization methods on multimodal benchmarks using RTN and GPTQ as weight quantizers. Bold indicates the best result within each quantizer setting (RTN or GPTQ) for a specific bit configuration. BitsMethod Quantizer Multimodal Benchmark Recovery MME OCRBench DocVQA RealWorldQA VLMBlind (%) W4A8KV16 QuaRot RTN 220181493.1165.3663.2191.43 BRQ227283193.6669.8062.1193.47 FlatQuant231188094.6566.1467.9695.64 BATQuant231287794.5866.8069.2796.11 QuaRot GPTQ 232787095.0769.8071.1297.53 BRQ232986594.7270.1967.1896.40 FlatQuant235188695.3169.0273.9098.66 BATQuant238689395.5570.2073.1499.29 W4A4KV16 QuaRot RTN 196571090.9160.4855.3183.18 BRQ209674991.0961.8356.6585.92 FlatQuant214784693.1462.4865.4991.49 BATQuant225583893.6864.7166.8493.33 QuaRot GPTQ 218981093.4764.9757.6289.69 BRQ214780592.9466.1462.4190.75 FlatQuant223187394.1065.6268.8694.79 BATQuant236086494.3167.3269.7096.43 W4A8KV8 QuaRot RTN 214381693.2765.3662.4990.82 BRQ227781593.5569.9360.2492.67 FlatQuant228587194.1160.5270.0494.09 BATQuant230186794.7266.6772.7196.71 QuaRot GPTQ 229686895.1169.0270.2696.78 BRQ228386794.6369.8067.3695.98 FlatQuant235388895.1269.1472.7798.41 BATQuant236889095.4769.9372.8298.89 W4A8KV4 QuaRot RTN 211278192.6762.4860.3488.27 BRQ219480792.7566.2757.3189.80 FlatQuant225786794.0559.8771.0593.84 BATQuant228987494.6464.9771.0695.83 QuaRot GPTQ 228085794.6668.5268.3695.65 BRQ223684194.0768.6366.0394.21 FlatQuant229388494.8868.7670.7597.11 BATQuant233288595.0768.6370.9297.51 28Li et al. Table 7: Performance comparison of different quantization methods on LLM non- reasoning benchmarks using RTN and GPTQ as weight quantizers. Bold indicates the best result within each quantizer setting (RTN or GPTQ) for a specific bit config- uration. BitsMethod Quantizer Non-Reasoning Benchmark Avg. ARC-C ARC-E HellaSwag PIQA Winogrande W4A8KV16 QuaRot RTN 51.37 75.7670.0476.6165.6767.89 BRQ47.44 72.8771.3775.8465.1966.54 FlatQuant55.63 78.83 72.46 76.2266.85 70.00 BATQuant54.33 77.4872.23 76.61 68.2569.78 QuaRot GPTQ 55.20 78.7072.7776.8865.1169.73 BRQ53.67 78.87 73.27 76.6666.9369.88 FlatQuant55.72 79.6372.6676.8266.2270.21 BATQuant56.14 79.9273.10 77.97 68.59 71.14 W4A4KV16 QuaRot RTN 44.88 70.3765.0974.5462.5163.48 BRQ45.90 67.5168.4774.1661.4063.49 FlatQuant51.11 75.9369.0274.9262.8366.76 BATQuant50.09 75.55 71.00 75.19 66.85 67.74 QuaRot GPTQ 50.43 74.2867.5573.6763.3865.86 BRQ48.55 74.7168.7975.2463.9366.24 FlatQuant50.60 78.2070.36 75.6363.5467.67 BATQuant53.33 77.53 71.12 75.3066.22 68.70 W4A8KV4 QuaRot RTN 47.18 72.6467.4374.3260.0664.33 BRQ45.82 69.8269.7174.2162.4364.40 FlatQuant48.12 73.2368.9674.3763.3065.60 BATQuant50.85 75.97 70.07 76.50 64.56 67.59 QuaRot GPTQ 52.73 76.4770.1574.8162.2767.29 BRQ50.68 75.9770.3874.6562.4366.82 FlatQuant52.13 77.90 69.60 75.1462.5167.46 BATQuant53.33 78.5469.53 76.66 65.19 68.65 W4A8KV8 QuaRot RTN 52.30 76.4769.68 77.04 65.6768.23 BRQ48.55 72.4771.8476.6664.9666.90 FlatQuant52.73 77.0972.1876.7164.2568.59 BATQuant54.52 79.55 72.20 76.5065.59 69.67 QuaRot GPTQ 55.38 79.8472.5476.8866.2270.17 BRQ52.99 78.1173.0976.8867.8069.77 FlatQuant52.56 77.1072.4677.0968.1969.48 BATQuant55.63 79.80 73.15 77.0967.17 70.57 BATQuant: Outlier-resilient MXFP4 Quantization29 Prompt: Question: Count the intersection points where the blue and red lines meet. Put your answer in curly brackets , e. g., 2. Response: BF16 0 BRQ (W4A4KV16) There is one intersection point where the blue and red lines meet. The red lines form a "V" shape , and the blue line runs diagonally across the image. They intersect at a single point. 1 BATQuant (W4A4KV16) 0 Fig. 12: Case study of Qwen3-VL-8B-Instruct on VLMBlind. The input in- cludes a real image (shown above) and a text question asking to count intersection points. Compared with the BRQ method which fails by hallucinating an intersection (1), BATQuant correctly identifies that there are no intersections (0), matching the BF16 baseline. 30Li et al. Prompt: What is the train number? Response: BF16 055 05995 BRQ (W4A4KV16) Based on the image provided , the train number is **055**. This number is clearly visible on the side of the train car in the image. It appears as a white or light -colored number (likely "055") ... The train appears to be a passenger train , likely operating within Vietnam given the Vietnamese language used in the markings. BATQuant (W4A4KV16) Based on the image provided , the train number is **055 05995**. This number is clearly visible on the side of the train car , just below the window. The "055" appears to be the train’s route or service number , and "05995" is likely its specific car or vehicle number within that service. Fig. 13: Case study of Qwen3-VL-8B-Instruct on OCRBench. The input in- cludes a real image of a train and a question asking for the train number. Compared with the BRQ method which fails by only recognizing partial information ("055"), BATQuant correctly identifies the full train number ("055 05995"), matching the BF16 baseline.