Paper deep dive
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/22/2026, 5:59:50 AM
Summary
CARE is a novel conversion pipeline that transforms pretrained attention modules (GQA/MHA) into Multi-Head Latent Attention (MLA) by utilizing covariance-aware factorization and rank-adaptive scheduling. It addresses the limitations of naive SVD-based methods by minimizing activation-space error rather than weight-space error and dynamically allocating rank across layers to preserve model fidelity under fixed KV-cache budgets.
Entities (5)
Relation Signals (3)
CARE → improves → Multi-Head Latent Attention
confidence 95% · CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline
CARE → outperforms → SVD
confidence 95% · Our method outperforms a uniform-rank SVD baseline
CARE → appliedto → LLaMA-3.1-8B
confidence 90% · Our method outperforms a uniform-rank SVD baseline on ... Llama-3.1-8B/70B-Instruct
Cypher Suggestions (2)
Find all models improved by the CARE method · confidence 90% · unvalidated
MATCH (m:Model)-[:IMPROVED_BY]->(a:Method {name: 'CARE'}) RETURN m.nameList all architectures that can be converted to MLA · confidence 85% · unvalidated
MATCH (a:Architecture)-[:CONVERTIBLE_TO]->(m:Architecture {name: 'MLA'}) RETURN a.nameAbstract
Abstract:Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.
Tags
Links
- Source: https://arxiv.org/abs/2603.17946v1
- Canonical: https://arxiv.org/abs/2603.17946v1
Full Text
117,529 characters extracted from source content.
Expand or collapse full text
Published as a conference paper at ICLR 2026 CARE: COVARIANCE-AWARE AND RANK-ENHANCED DECOMPOSITION FOR ENABLING MULTI-HEAD LA- TENT ATTENTION Zhongzhu Zhou 1,3 , Fengxiang Bie 1 , Ziyan Chen 1 , Zhenyu Zhang 4 , Yibo Yang 2 , Junxiong Wang 3 , Ben Athiwaratkun 3 , Xiaoxia Wu 3∗† , Shuaiwen Leon Song 1,3∗ 1 University of Sydney, 2 King Abdullah University of Science and Technology, 3 Together AI, 4 University of Texas at Austin ABSTRACT Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD- style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers—causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (i) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (i) KV-parity mapping, which reparameterizes the convertedKand Vto fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to215×and improving mean accuracy by up to1.70×at matched KV budgets. With a brief post-SVD “healing” fine-tune, we fully recover the original model’s accuracy. 1INTRODUCTION Large Language Models (LLMs) deliver impressive capabilities but at high inference cost, with the key–value (KV) cache in self-attention emerging as a primary memory and bandwidth bot- tleneck (Vaswani et al., 2017; Kwon et al., 2023). In the standard multi-head attention (MHA) formulation, each head materializes and caches its own keys and values at every decoding step, causing the KV footprint to grow linearly with sequence length and head count. To alleviate this, ar- chitecture variants such as multi-query attention (MQA), which shares a singleK,Vacross all heads, and grouped-query attention (GQA), which sharesK,Vwithin head groups, have been adopted at scale to shrink KV cache size (Shazeer, 2019; Ainslie et al., 2023; Touvron et al., 2023; Jiang et al., 2023; Chowdhery et al., 2022; Shoeybi et al., 2019). While effective, these variants reduce the number of distinct key/value projections, which can limit attention expressivity and introduce quality regressions when compression is pushed aggressively. A more recent line of work reframes the KV-cache problem as one of learned low-rank representa- tion (Wang et al., 2020; Xiong et al., 2021). Multi-Head Latent Attention (MLA) compresses keys and values into low-dimensional latent vectors, caches only these latents, and restores expressivity with lightweight up-down projections at compute time (DeepSeek-AI Team, 2024). In practice, MLA can dramatically reduce KV size while preserving or even improving task accuracy by trading memory ∗ Equal advising. † Corresponding author:shirely@together.ai, Code is available athttps://github.com/Fut ureMLS-Lab/CARE. 1 arXiv:2603.17946v1 [cs.LG] 18 Mar 2026 Published as a conference paper at ICLR 2026 “Wrong” SVD Target W K W V Up KV (a) Naïve(Joint) SVD Samples * W V Layer 0 Covariance “Correct” SVD Target Down Rank 0 Up Partial Rope X * K rope X * W K a = C nope Concat... (b) CARE = Down KV * W V W K R W K W K ... Rank Scheduling Down Rank N Up W K b * K nope = K nope Layer N Figure 1: (a) Naive MLA transfer: jointly factorizeW (g) K andW (g) V by SVD and truncate to a uniform per-layer rank, optimizing∥W − ˆ W∥ F while ignoring layerwise heterogeneity. (b) CARE: estimate activation covarianceC, factorize √ CW, unwhiten via √ C −1 to initialize MLA factors, and use the singular spectrum of √ CWfor global dynamicrank scheduling under KV parity. This preserves activation geometry and yields a stronger one-shot initialization with less healing. and communication for modest extra floating-point-operations (FLOPs) in the projections (DeepSeek- AI Team, 2024; Guo et al., 2025; Liu et al., 2024a; Geens & Verhelst, 2025). Despite these advantages, the ecosystem is dominated by pretrained MHA/GQA checkpoints (Touvron et al., 2023; Jiang et al., 2023; Yang et al., 2024a). Retraining large models from scratch under MLA is expensive, so a natural question arises: Can we convert strong, pretrained MHA/GQA models into MLA post-hoc, without increasing the KV budget and without incurring large performance loss? Recent work has explored converting traditional attention (MHA/GQA) into multi-head latent atten- tion (MLA) under fixed KV width. TransMLA (Meng et al., 2025) demonstrates that every GQA layer admits an equivalent MLA parameterization and proposes a practical post-training mapping followed by light finetuning. MHA2MLA (Ji et al., 2025) generalizes to MHA→MLA by addressing positional encoding mismatches (e.g., partial RoPE adjustments) and initializingW K ,W V with low-rank joint SVD before efficient recovery (Su et al., 2021b). X-EcoMLA (Li et al., 2025b) constructs MLA from pretrained MHA via structured SVD decomposition and cross-layer distillation–based parameter recy- cling for efficient latent projection initialization. In parallel, general-purpose SVD-based compression and approximation methods (e.g., SVD-LLM V2 (Wang et al., 2024b; 2025b) and activation-aware SVD (ASVD) (Yuan et al., 2023)) improve truncation and orientation beyond naïve weight SVD, while cache-centric baselines such as Palu (Chang et al., 2024) reduce KV memory via low-rank KV-cache projection. Together, these works establish MLA as a promising post-hoc target and highlight low-rank factorization as central to preserving pretrained knowledge under KV-constrained reparameterizations (Hu et al., 2022; Denil et al., 2013; Denton et al., 2014; Sainath et al., 2013; Eckart & Young, 1936). However, direct SVD initialization has two key shortcomings. First, it minimizes error in weight space (∥W− ˆ W∥) rather than activation space (∥XW−X ˆ W∥), ignoring how the projection actually operates during decoding (Hassibi et al., 1993; Wang et al., 2024b; Yuan et al., 2023). This mismatch induces attention-logit drift even when the weight approximation is accurate. Second, it enforces a uniform rank across layers, neglecting differences in spectral structure. Layers with fast spectral decay are over-compressed, while those with slower decay are under-compressed, leading to fidelity loss and heavier reliance on post-conversion finetuning. To address the above two shortcomings, we propose CARE, a Covariance-Aware, Rank-Enhanced conversion pipeline, as shown in Fig. 1. First, CARE makes the decomposition activation-aware: rather than applying vanilla SVD toW, we solve a whitened approximation problem by applying 2 Published as a conference paper at ICLR 2026 SVD to √ CWand then unwhitening to obtain ˆ W 1 , whereCsummarizes input activation covariance estimated from a modest calibration set. This ensures that dominant activation directions are preserved and substantially reduces attention-logit error before any finetuning. Second, CARE is rank-adaptive: it distributes a fixed KV budget across layers and heads based on their singular spectra, allocating higher rank to spectrally complex matrices and lower rank to intrinsically low-rank ones, akin in spirit to budgeted, importance-aware adapter methods (Zhang et al., 2023; Valipour et al., 2023; Hu et al., 2022; Wang et al., 2025a). This budgeted, importance-aware scheduling maintains fidelity under the KV constraint while reducing reliance on post-conversion finetuning. Contributions. • Activation-aware Initialization. We propose a covariance-aware factorization that min- imizes activation error∥XW − X ˆ W∥(rather than weight error), implemented via SVD on a whitened operator and subsequent unwhitening. This preserves attention logits more faithfully at equal KV budget to initialize MLA weights. •Rank-adaptive scheduling under fixed KV width. We introduce a singular-value-guided allocation that distributes rank unevenly across layers/heads and theK,Vmatrices, matching spectral difficulty and improving zero-shot fidelity compared with uniform ranks. •KV-parity mapping and practical pipeline. We derive a KV-parity reparameterization for MLA conversion and integrate the above techniques into a practical conversion pipeline CARE-converted models exhibit lower activation error and improved task quality over naive (joint) SVD baselines at equal KV cost, while requiring less data to recover residual gaps. 2NAÏVE (JOINT) SVD IS NOT ENOUGH FOR MLA TRANSFER Multi-head latent attention (MLA) transfer is often initialized with singular value decomposition (SVD), either per matrix or via a joint factorization across related matrices (e.g.,W K ,W V ). While convenient, this practice implicitly optimizes weight-space error∥W − ˆ W∥ F and assumes that the spectrum alone reveals task importance. In this section we show both assumptions break down in practice and, consequently, naïve (joint) SVD is an unreliable recipe for high-fidelity MLA transfer under a fixed KV budget. (a) (b) Figure 2: (a) Accuracy under 50% rank reduction applied one layer at a time in DeepSeek-V2-Lite, measured on ARC Challenge and MMLU. Sensitivity is strongly layer-dependent. (b) WikiText perplexity under grouped truncation of GQA attention in Llama-3-8B (layers 30–32). Singular-value groups are ordered by magnitude; the resulting non-monotone degradation shows that singular values alone are an imperfect proxy for MLA conversion quality. Observation 1: Accuracy-preserving rank is not uniform across layers. Fig. 2 (a) exhibits pronounced layer-wise heterogeneity when we halve the rank of every layer: some layers tolerate aggressive reduction with negligible loss, whereas others incur sharp drops on ARC and MMLU. A 1 Here ‘whiten’ means to use √ CWinstead ofWas our target to compress, and ‘unwhiten’ means the inverse map which recovers the compressed W . 3 Published as a conference paper at ICLR 2026 one-size-fits-all policy (uniform pruning or a fixed ratio per layer) either over-compresses fragile layers—degrading task performance—or under-compresses robust layers—wasting KV budget. Observation 2: Singular values are poor proxies for accuracy importance.A common heuristic treats singular values as importance scores, expecting that truncating smaller values should least affect accuracy. We directly test this with a brute-force ablation: givenW = U ΣV ⊤ , we set thei-th singular value to zero and reconstruct ̄ W (i) = U diag(σ 1 ,..., 0,...,σ r )V ⊤ , treatingV T as compress of MLA andUas expand of MLA. As shown in Fig. 2 (b), the link between singular-value magnitude and downstream accuracy is non-monotonic. We conjecture the root cause is mismatch of objectives and statistics: vanilla SVD minimizes weight error, not activation-space error∥XW − X ˆ W∥under the true (anisotropic) input distribution. Thus, naïve (joint) SVD isn’t enough: (i) the rank needed to keep accuracy varies by layer, and (i) singular values alone don’t reflect accuracy importance of initialization. 3CARE: COVARIANCE-AWARE AND RANK-ENHANCED MLA CONVERSION We propose CARE, a post-hoc conversion pipeline that maps a pretrained MHA or GQA layer to an MLA layer at the same KV budget, while explicitly minimizing activation error and adapting ranks to spectral difficulty. CARE revisits low-rank factorization through the lens of activation statistics and integrates covariance-weighted SVD and non-uniform rank allocation into a single, practical procedure compatible with multi-models’ backbones. Notation. Given a fixed layer with input activationsX∈ R T×D with lengthTand embedding dimensionD. The multi-head attention in our setup containsn h heads of sized h , where we assume the output space corresponds with input space, formallyn h d h = D. We denoteg h to be the number of GQA groups, where for this layer:Q = XW Q , K = XW (g) K ∈ R T×(g h d h ) , V = XW (g) V ∈ R T×(g h d h ) withg h < n h , withW (g) · represents the weight matrix under the GQA setup. By contrast, an MLA layer with latent rankrusesK = (XW a K )W b K ,V = (XW a V )W b V , whereW a · ∈ R D×r , W b · ∈ R r×(n h d h ) . Only the latentXW a · ∈ R T×r is cached, withW b · ∈ R r×(n h d h ) used to recover KV matrix. 3.1PRELIMINARY: COVARIANCE FOR INPUT ACTIVATIONS LetX (l) b ∈ R T b ×D be theb th batch of the domain activations with lengthT b at some layerl(note that the length of tokens remains the same across layers, but can be different across batches). These batches of domain activations are used to calculate the extent of preserved rank in Sec. 3.2 and initialize the trainable parameters later in Sec. 3.3. We then define C (l) , the covariance matrix over all the N batches at layer l, as follows: C (l) = 1 N N X b=1 (X (l) b ) ⊤ X (l) b . Note that even if we denote the aboveC (l) to be covariance matrix, it is a slight different from the definition of actual covariance: we do not apply centralize and normalize operation here (see App. E for formal definition). 3.2ADJUSTED-RANK SCHEDULING ACROSS LAYERS Due to heterogeneous key/value spectra across different layers, the retained rank of each layer using MLA are supposed to be different. LetW = f W (l) K , f W (l) V L l=1 represent the pretrained KV weights from Sec. B, whereLrepresents the number of layers. Note thatWcontains2Lweight matrices, each with dimension D× n h d h (recall D = n h d h ). 4 Published as a conference paper at ICLR 2026 Given a total key-rank budgetR (K) tot (same for value), we score the next rank by its normalized residual reduction. Let ρ (l) K,m = σ (l) K,m 2 P R (l) K i=1 σ (l) K,i 2 , whereσ (l) K,m (same forV) is them th -largest singular value of √ C (l) f W (l) K ,C (l) is the covariance matrix defined in Sec. 3.1, andR (l) K is the rank of √ C (l) f W (l) K . For a layer currently assigned rank r (l) K , we define the priority of allocating one more rank as s (l) K r (l) K = ρ (l) K,r (l) K +1 1− P r (l) K m=1 ρ (l) K,m = σ (l) K,r (l) K +1 2 P R (l) K m=r (l) K +1 σ (l) K,m 2 .(1) Appendix C proves that the Frobenius residual after rank-rtruncation equals the squared tail energy; normalizing by this residual makes layers with different spectral scales comparable. We allocate the budgetR (K) tot by greedy water-filling: starting fromr (l) K =Cfor alllwhereCis some constant, we repeatedly assign one rank tol ⋆ = arg max l s (l) K (r (l) K )until the budget is exhausted;Vis handled identically with its own budget. 3.3RETHINKING SVD WITH ACTIVATION COVARIANCE A naive rank lowering attempts to minimize Frobenius error∥W − c W∥ F , where c W represents the de-ranked matrix for compressing, and the original pretrained weight matrix to compress,W := W (l) · ∈W, lies in layerl. For inference fidelity, we propose the relevant objective to be minimizing the empirical activation error. Focusing on compressing the certain weight matrixW, we denoterto be the target compressed rank calculated in Sec. 3.2. HereX b := X (l) b ∈ R T b ×D N b=1 represents the small domain activation batches to compute the low-rank decomposition at layerl, whereNis the total number of batches,T b is the length of each batch andC := C (l) is the covariance matrix defined in Eq. 3.1. Formally, we try to minimize the following: min rank( c W)≤r 1 N N X b=1 ∥X b W − X b c W∥ 2 F .(2) This optimization formalizes how well c W preservesX b Won relevant inputsX b . From the definition ofC, one can show that 1 N P N b=1 ∥X b W−X b c W∥ 2 F =∥ √ C(W− c W )∥ 2 F ; see App. D for the proof. For K and V separately, we compute √ C W = U ΣV ⊤ , then c W = √ C −1 U r Σ r V ⊤ r ,(3) withU r , Σ r ,V r the top-rcomponents ofU, Σ,V. In practice, we use a shrinkage √ C λ = (1− α) √ C + αλI to ensure C is invertible, with α∈ (0, 1) and λ > 0. Then we initialize the trainable parametersW a ∈R D×r andW b ∈R r×(n h d h ) s.t.W a W b equals the compressed matrix. We map SVD factors to MLA by W a ← √ C −1 U r Σ r , W b ← V ⊤ r ,(4) so thatW a W b = c W. The cached latentXW a ∈ R T×r in MLA spans the principal activation subspace, where X is the actual input at layer l. 3.4100% MLA CONVERSION BY HEALING Based on our initialization of down-and-up matricesW a ,W b and number ofTtokens, we attempt to encode positional information in the MLA-like attention mechanism. Given layerl, letQ t = X t W Q 5 Published as a conference paper at ICLR 2026 be the usual query at stept, and letK C,t = (X t W a K )W b K andV C,t = (X t W a V )W b V be the MLA generated keys/values. Following the decoupled RoPE design in (DeepSeek-AI Team, 2024), we bring a small RoPE channel of width d r by introducing new trainable matrices 2 : W R Q ∈ R D×(n h d r ) , W R K ∈ R D×d r , whereW R K maps directly from the activationX t to a shared RoPE key of widthd r (DeepSeek-like decoupled RoPE). LetR t ∈ R (n h d r )×(n h d r ) denote the standard block-diagonal RoPE rotation at stept(oned r × d r rotation per head), and letR t ∈ R d r ×d r denote the per-head rotation applied to the shared RoPE key. We compute: Q R,t = (X t W R Q )R t , K R,t = repeat X t W R K R t , whererepeatreplicates the shared RoPE key across heads, yieldingK R,t ∈ R 1×(n h d r ) . We then concatenate channels and compute attention: Q ⋆ t = Q t ; Q R,t , K ⋆ t = K C,t ; K R,t , A t = Softmax Q ⋆ t (K ⋆ t ) ⊤ √ d h + d r , O t = A t V C,t , where A t denotes attention weights and O t denotes the layer output. Caching and a compact “join” form.To preserve KV-cache efficiency, we cache only the KV-side latentsX t W a K andX t W a V , along with the small shared RoPE key(X t W R K )R t (repeated across heads when formingK R,t ). Queries (includingQ R,t ) are computed on-the-fly. Equivalently, the MLA-generated KV can be written in a compact joined form: Concat(K C,t , V C,t ) = Concat(X t W a K , X t W a V ) W join ,(5) whereConcat(·,·)concatenates along the last (feature) dimension andW join = blkdiag(W b K , W b V )∈ R 2r×2D jointly represents the two bilinear maps W a K W b K and W a V W b V . We penalize the low-rank decomposition by its cross-entropy classification error and KL-divergence imitation error, namely the loss functions: L CE = − 1 T T X t=1 logp S (x t+1 | x ≤t ),(6) L KD = 1 T T X t=1 KL softmax(z T t /τ ) softmax(z S t /τ ) ,(7) L = L CE + β τ 2 L KD .(8) Herep S (x t+1 | x ≤t )is the student next-token probability under the converted MLA layer (with RoPE adaptersW R Q ,W R K and MLA factorsW a ,W b ), whilez T t andz S t are teacher and student logits. We use p S (x t+1 | x ≤t ) = softmax( z S t τ ) x t+1 with temperature τ and weight β (Hinton et al., 2015). 4EXPERIMENTAL RESULTS We evaluate whether CARE improves MLA migration under a fixed KV-cache budget (KV-parity). We report one-shot and post-healing perplexity & accuracy, ablation study, long-context retrieval (NiH), and system efficiency. Setup. We match cached KV state per token under the same global budget (Tab. 1). CARE estimates C = Cov[X]on a small calibration set and factors √ CWwith shrinkage √ C ← (1− λ) √ C + λI; We use LM Harness (Gao et al., 2024) with a matched heal budget; full 100% MLA restoration (TransMLA/MHA2MLA) is evaluated separately in Sec. 4.3. 2 TransMLA Meng et al. (2025) and X-EcoMLA Li et al. (2025b) both describe how to add such a RoPE channel. 6 Published as a conference paper at ICLR 2026 Original, baselines, CARE variants, and datasets. GQA (source). Unmodified grouped-query attention. We compare against KV-compression baselines: Palu (Chang et al., 2024). a low- rank projection baseline for KV-cache reduction under a fixed cache budget; ASVD (Yuan et al., 2023). activation-aware SVD applied to(W K ,W V ); and SVD-LLM V2 (Wang et al., 2024b; 2025b). truncation-aware SVD-style factorization applied to(W K ,W V ). We also include conversion baselines TransMLA, MHA2MLA (Meng et al., 2025; Ji et al., 2025). We report their SVD-style initialization in the one-shot setting, while full 100% MLA restoration is evaluated in Sec. 4.3. CARE- U (uniform-rank). Covariance-aware factorization with a uniform rank across layers. CARE-E (adjusted-rank). The same covariance-aware factorization, but with our adjusted-rank allocation. We evaluate on LM Harness tasks: Wikitext2 (Wiki) (Merity et al., 2016), ARC Challenge (ARC), ARC Easy (ARE) (Clark et al., 2018), HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), MMLU (Hendrycks et al., 2020), OpenBookQA (OBQA) (Mihaylov et al., 2018), RACE (RA) (Lai et al., 2017), and WinoGrande (WG) (Sakaguchi et al., 2021). Hyperparameters are in App. H. 4.1ONE-SHOT RESULTS We report one-shot, direct KV-compression performance under KV-parity before partial-RoPE and healing on Alpaca Taori et al. (2023) calibration dataset; Tab. 1 summarizes the main results for Llama3.1-8B and Qwen3-4B-Instruct-2507. App. I.1 reports additional one-shot results on other models and calibration datasets. Table 1: One-shot comparison on Llama3.1-8B and Qwen3-4B-Instruct-2507 against original and baselines on multiple tasks. Calibration samples: 256. Sequence length: 2048. Calibration dataset: alpaca. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). Rank KV SaveMethods PPL (↓)ACC (↑) ARC (↓)ARE HellaSwag PIQA MMLU OBQARAWGAVG Llama-3.1-8B-Instruct 6493.75 GQA (Original)7.21 (↓)50.3480.1860.1579.6548.0534.80 40.10 72.69 58.24 Palu(SVD)2260.60 (↓)25.7727.5326.5052.1824.2627.80 20.86 50.04 31.87 MHA2MLA284863.91 (↓) 25.9423.8226.3450.9225.6227.80 22.11 50.59 31.64 CARE-U (OURS)983.55 (↓)23.8930.5126.8254.7323.1826.0020.9649.0131.89 CARE-E (OURS)983.03 (↓)23.6331.3127.4954.6222.9827.4021.1550.3632.37 ASVD2525.33 (↓)23.8126.6826.6852.1822.9727.80 20.86 50.99 31.50 SVD-LLM V2967.04 (↓)23.6330.7226.7554.9023.2226.40 20.86 48.93 31.93 12887.50 Palu(SVD)3046.58 (↓)23.8927.0226.6250.3823.1424.80 22.49 49.25 30.95 MHA2MLA15028.91 (↓)25.0024.1226.5852.1823.8229.20 22.11 51.07 31.76 CARE-U (OURS)398.91 (↓)26.7141.9633.6460.9926.0627.0024.2153.2036.72 CARE-E (OURS)353.74 (↓)27.3042.4737.9662.4629.3828.2026.1254.8538.59 ASVD1675.54 (↓)25.0027.9926.6152.3923.0426.60 21.82 48.46 31.49 SVD-LLM V2386.23 (↓)27.5642.3834.0661.0426.1727.80 24.21 53.99 37.15 25675.00 Palu(SVD)537.57 (↓)22.6131.1027.9355.2222.8227.00 22.68 51.70 32.63 MHA2MLA1633.65 (↓)27.4725.6326.5052.3923.1028.20 22.49 50.83 32.08 CARE-U (OURS)48.35 (↓)36.0960.1955.7571.8745.6433.2035.2262.0450.00 CARE-E (OURS)49.43 (↓)38.4860.0660.5472.4254.2933.6035.5066.0652.62 ASVD312.86 (↓)21.7632.0329.5255.2223.0525.20 24.59 52.64 33.00 SVD-LLM V247.28 (↓)36.2660.6956.3972.2046.5733.40 34.83 60.69 50.13 51250.00 Palu(SVD)45.40 (↓)28.1645.9643.3764.1524.9830.80 23.83 53.43 39.33 MHA2MLA220.29 (↓)25.9440.9539.2761.3725.5426.60 26.79 56.59 37.88 CARE-U (OURS)9.64 (↓)52.7376.3073.9878.7362.1740.6041.5372.6162.33 CARE-E (OURS)19.50 (↓)42.4164.6968.9675.4663.9837.0038.4771.5957.82 ASVD12.02 (↓)46.3369.1170.7576.1741.8036.60 33.97 66.85 55.20 SVD-LLM V29.63 (↓)52.3976.6873.5778.4562.3140.40 41.63 72.22 62.21 Qwen3-4B-Instruct-2507 6493.75 GQA (Original)10.04 (↓)55.8983.1252.6576.0173.3732.00 41.24 68.11 60.30 Palu(SVD)56922.11 (↓)25.3426.7725.7350.4424.3128.40 22.11 50.91 31.75 MHA2MLA21850.16 (↓)25.6825.4626.0351.9622.9227.00 22.01 50.20 31.41 CARE-U (OURS)905.74 (↓)23.4629.0027.6155.2223.0326.6023.0650.5932.32 CARE-E (OURS)730.93 (↓)23.2930.7728.3355.2222.9524.8022.7851.5432.46 ASVD6683.95 (↓)27.3925.0025.7150.3824.0826.60 22.11 50.43 31.46 SVD-LLM V2894.94 (↓)23.1228.5827.6655.1123.0227.20 23.44 51.54 32.46 12887.50 Palu(SVD)22048.79 (↓)26.0226.1826.2951.0924.4925.40 21.05 49.72 31.28 MHA2MLA52683.47 (↓)23.8126.5626.7452.1824.7428.20 22.68 49.72 31.83 CARE-U (OURS)111.17 (↓)27.4739.6937.5260.5027.0327.6026.9953.5937.55 CARE-E (OURS)102.38 (↓)30.2944.8242.5463.7630.5229.4028.7154.5440.57 7 Published as a conference paper at ICLR 2026 Rank KV SaveMethods PPL (↓)ACC (↑) ARC (↓)ARE HellaSwag PIQA MMLU OBQARAWGAVG ASVD1682.84 (↓)22.9530.0129.2952.3423.4227.00 23.54 50.12 32.33 SVD-LLM V2116.76 (↓)27.0540.0337.0261.0426.6926.80 25.74 53.28 37.21 25675.00 Palu(SVD)2561.97 (↓)26.1129.4629.9252.8824.4828.40 23.64 51.70 33.32 MHA2MLA44509.79 (↓)22.1828.4928.9252.4523.0024.60 24.59 51.30 31.94 CARE-U (OURS)22.08 (↓)46.4268.9059.1671.5554.7636.4035.0262.4354.33 CARE-E (OURS)28.84 (↓)41.3059.2256.5369.3753.5035.2032.8261.8851.23 ASVD63.15 (↓)32.7646.8447.4563.9326.3830.80 30.05 52.01 41.28 SVD-LLM V222.88 (↓)44.5467.1757.7770.7852.8135.60 34.83 61.48 53.12 51250.00 Palu(SVD)33.97 (↓)35.5847.6450.4465.1827.8532.80 30.24 52.64 42.80 MHA2MLA100.99 (↓)27.0541.0837.9759.1929.1427.20 29.47 54.06 38.15 CARE-U (OURS)12.03 (↓)54.9577.2369.2476.2267.4640.0039.4368.4361.62 CARE-E (OURS)15.91 (↓)49.2370.8864.1372.8064.1636.2036.5664.5657.31 ASVD15.49 (↓)47.6166.5467.4973.0156.5635.60 35.22 62.75 55.60 SVD-LLM V211.88 (↓)54.6177.4468.5375.6867.6539.80 38.18 67.25 61.14 meta-llama/Llama-3.1-8B-Instruct Layer-wise Dynamic Rank Allocation on ALPACA 0 32 64 96 128 08162331 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 08162331 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 08162331 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 08162331 CARE | rank 512 target min max Layer Rank K V meta-llama/Llama-3.1-8B-Instruct Layer-wise Dynamic Rank Allocation on WIKITEXT2 0 32 64 96 128 08162331 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 08162331 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 08162331 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 08162331 CARE | rank 512 target min max Layer Rank K V meta-llama/Llama-3.1-8B-Instruct Layer-wise Dynamic Rank Allocation on PTB 0 32 64 96 128 08162331 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 08162331 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 08162331 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 08162331 CARE | rank 512 target min max Layer Rank K V meta-llama/Llama-3.1-8B-Instruct Layer-wise Dynamic Rank Allocation on C4 0 32 64 96 128 08162331 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 08162331 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 08162331 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 08162331 CARE | rank 512 target min max Layer Rank K V Figure 3: Covariance-aware rank profiles across calibration corpora (Alpaca, WikiText2, PTB, C4) for Llama-3.1-8B-Instruct at target ranks 64, 128, 256, and 512. Across all target budgets, bothW K andW V show a depth-dependent increase—small in early layers, rising through mid layers—with stronger late-layer growth forW V . The consistency across corpora suggests a model-intrinsic trend. At very low rank, one-shot MLA remains challenging for all methods, but regarding perplexity, CARE already improves substantially over the naive Palu(SVD) or MHA2MLA initialization. As the rank budget increases to 256/512, the advantage of covariance-aware initialization becomes much clearer, especially on Llama-3.1-8B, Qwen3-4B where CARE-E gives the strongest overall accuracy-PPL trade-off. We provide additional one-shot long-context retrieval results and qualitative examples in App. I.2 and App. I.3, respectively. 8 Published as a conference paper at ICLR 2026 4.2ABLATION STUDIES 4.2.1ENERGY DISTRIBUTION VS. COVARIANCE Using the covariance-aware energy in Sec. 3, we obtain highly consistent rank profiles across C4 (Raffel et al., 2020a), Alpaca (Taori et al., 2023), WikiText2 (Merity et al., 2016), and PTB (Marcus et al., 1993). As shown in Fig. 3, rank is front-loaded: across target ranks 64, 128, 256, and 512, bothW K andW V receive much larger ranks in the first layers and then decrease steadily with depth, approaching the minimum rank in later layers. This indicates that early layers are more accuracy- critical, while deeper layers can be compressed more aggressively. The same trend also appears on other models (From 1.5B - 70B) (App. I.4), suggesting that the profile is largely model-intrinsic rather than tied to a particular calibration corpus. 4.2.2ACCURACY IMPACT OF COVARIANCE Using Llama-3.1-8B-Instruct, we vary (i) calibration samples, (i) sequence length, and (i) calibration corpus. Fig. 4 shows that CARE is already strong with small calibration budgets: performance saturates beyond 512 samples, and longer sequences tend to overfit limited calibration data. We therefore use 256 samples with a short sequence length (32) as the default trade-off. Across corpora (ALPACA, C4, WIKITEXT2, PTB, ARC, ARE, MMLU), one-shot accuracy changes are modest on average (Tab. I.14); domain-aligned corpora give mostly local gains. We default to ALPACA for broad coverage and stable generalization. CovarianceCalculationOOM Figure 4: One-shot accuracy versus cal- ibration samples and sequence length across eight benchmarks.Sequence length is fixed at 256 unless otherwise noted; the red curve denotes fixed 256 samples with varying sequence length. OOM occurs beyond length= 512dur- ing covariance computation. Additional ablations are deferred to the appendix for read- ability. App. I.6 sweeps the shrinkage coefficient, showing that CARE is not sensitive to the exact regularization mag- nitude and supporting our defaultα = 0.01. App. I.5 studies cross-domain calibration under distribution shift, App. I.7 examines task-weighted covariance mixing, and App. I.8 compares the default √ Cformulation against usingCdirectly under matched settings. Although the derivation is based on √ C, the appendix shows that √ Cis more consistent than usingCdirectly, especially beyond the lowest ranks. 4.3RECOVERY 100% MLA WITH SMALL SFT BUDGETS Setup (healing baselines). We compare CARE against Palu(SVD) and TransMLA under matched healing budgets at MLA rank 512. Our hybridTransMLA + CARE(E) Initkeeps the same 100% MLA restoration and healing pipeline as TransMLA, but when TransMLA performs the KV low-rank mapping, we replace its original initialization with the covariance-aware √ CWSVD used by CARE(E). All subsequent restoration and healing stages are unchanged. CARE-based 100 usealpaca-256-256calibration (256 samples, sequence length 256), TransMLA useswiki-256-256(256 samples, sequence length 256), and Palu(SVD) does not require a separate calibration corpus. We report the one-shot initialization (0B) and healed checkpoints after 1B and 3B tokens on the same LM Harness suite. Table 2: Healed Llama3.1-8B-Instruct comparison at MLA rank 512 across token budgets. RankMethods CalibrationTokenARC (↑)ARE (↑)HellaSwag (↑)PIQA (↑)MMLU (↑)OBQA (↑)RA (↑)WG (↑)AVG (↑) – GQA (original) N/A N/A50.3480.1860.1579.6548.0534.8040.1072.6958.24 512 Palu(SVD) N/A0B26.0250.9737.4364.1527.8919.4026.6057.5438.75 Palu(SVD) N/A1B33.4562.3845.5572.3450.5723.0031.5461.7847.58 Palu(SVD) N/A3B44.5674.9652.2076.6361.0830.4045.1565.4256.30 CARE (OURS) ONE-SHOTalpaca-256-320B52.7376.3073.9878.7362.1740.6041.5372.6162.33 TransMLA + CARE(E) Init (OURS, 100% MLA Restore)alpaca-256-2560B45.0569.0268.9876.0651.1637.6039.4368.9857.04 TransMLA + CARE(E) Init (OURS, 100% MLA Restore)alpaca-256-2561B52.2582.3362.4780.2170.3132.9045.1175.1362.59 TransMLA + CARE(E) Init (OURS, 100% MLA Restore)alpaca-256-2563B51.7580.7364.4583.2371.5734.0046.3374.0963.27 TransMLA wiki-256-2560B43.1766.1268.2574.8148.8938.8037.7066.4655.52 TransMLA wiki-256-2561B53.0481.0758.7581.0469.1332.0044.0971.7461.36 TransMLA wiki-256-2563B53.7782.3456.4480.7070.2333.3045.6172.4761.86 9 Published as a conference paper at ICLR 2026 We sweep small-scale SFT budgets to quantify post-conversion recovery; results are shown in Tab. 2. CARE(E) already starts from the strongest one-shot initialization, and this advantage largely remains after healing. At 1B and 3B tokens,TransMLA + CARE(E) Initreaches 62.59 and 63.27 average accuracy, outperforming TransMLA (61.36/61.86) and Palu(SVD) (47.58/56.30) at matched budgets. While both Palu(SVD) and TransMLA improve substantially with more tokens, CARE attains the best overall recovery with less reliance on long healing runs; optimization objectives, datasets, and hyperparameters in App. H. A system-level KV-cache efficiency analysis is given in App. I.9. 5RELATED WORK Conversion from Traditional Attention to MLA.Standard multi-head attention (MHA) underpins modern LLMs but induces a KV cache that scales linearly with sequence length and head width, creating a memory and bandwidth bottleneck at inference time (Vaswani et al., 2017; Kwon et al., 2023). Grouped-Query Attention (GQA) reduces KV heads by sharing keys/values across query groups, lowering KV memory while sacrificing expressiveness (Ainslie et al., 2023; Shazeer, 2019). Multi-Head Latent Attention (MLA) addresses KV memory by caching low-dimensional latents with lightweight up/down projections (DeepSeek-AI Team, 2024). Beyond MLA that trains from scratch, several post-hoc pathways demonstrate conversion feasibility: TransMLA gives a theory- based reduction from GQA to MLA at corresponding KV budget (Meng et al., 2025), MHA2MLA focuses on practical alignment via partial RoPE and joint-SVD initialization (Ji et al., 2025), and X-EcoMLA explores distillation-based upcycling of pretrained attention into MLA for extreme KV compression (Li et al., 2025b). Also, following MambaInLlama (Wang et al., 2024a), Zebra-Llama composes efficient hybrids to improve inference efficiency and can be paired with MLA-style KV reductions in deployed systems (Yang et al., 2025). SVD inspirations. Naive SVD truncation minimizes∥W − W r ∥ F and ranks by raw singular values, which need not correlate with downstream loss (Eckart & Young, 1936; LeCun et al., 1990; Hassibi & Stork, 1993; Dong et al., 2019; Frantar et al., 2022; Krzanowski, 2000). Recent works refine this in LLMs: FWSVD (weighted by Fisher information) (Hua et al., 2022), SVD-LLM (truncation-aware whitening + sequential low-rank updates) and its V2 variant (improved trunca- tion/rank selection) (Wang et al., 2024b; 2025b). SoCo learns a diagonal reweighting to optimize the singular spectrum directly for compression rather than trusting singular-value magnitudes (Li et al., 2025a), while Dobi-SVD introduces a differentiable SVD that targets activation-side truncation and ef- ficient reconstruction (Wang et al., 2025a). Together with architecture-aware conversions (Meng et al., 2025; Ji et al., 2025), these SVD-oriented techniques motivate data/curvature-aware orientations and non-uniform rank allocation as core tools for preserving pretrained knowledge. Relatedly, CorDA leverages context/activation statistics to guide decomposition adaptation for parameter-efficient fine- tuning, further supporting the value of covariance-aware orientations beyond weight-only SVD (Yang et al., 2024b). SVD for cache compression. Orthogonally, some methods compress the cache itself. Palu com- presses KV-cache with low-rank projection, reconstructing fullK,Von the fly with efficient rank search and quantization interoperability (Chang et al., 2024). ReALLM combines low-rank compo- nents with vector-quantized latents under a unified recipe (Leconte et al., 2024). These methods can be stacked with MLA-style conversions or used standalone to further lower KV memory. For more related works, please refer to App. F. 6CONCLUSIONS We proposed CARE, a Covariance-Aware, Rank-Enhanced procedure for migrating traditional attention to MLA under a fixed KV-parity. CARE replaces naïve weight-only SVD with a covariance- weighted factorization and assigns per-layer ranks via an energy-driven, water-filling schedule. Empirically, CARE preserves MLA’s identical KV footprint while delivering lower zero-shot per- plexity and higher accuracy before healing, and better final performance than competing baselines under the same post-conversion tuning budget. It is also more robust under aggressive rank reduction, providing a stronger starting point for brief SFT to recover the original model’s accuracy. 10 Published as a conference paper at ICLR 2026 7ETHICS STATEMENT We have read and will comply with the ICLR Code of Ethics. Our study involves no human subjects, personally identifiable information, or user-generated content. All datasets are standard, publicly available benchmarks used under their respective licenses; we do not collect or infer demographic attributes. The work focuses on model architecture/optimization and does not introduce capabilities intended for surveillance, profiling, or other harmful use. We identify no foreseeable risks related to privacy, security, fairness, or legal/regulatory compliance, and no IRB/ethics approval was required. To support transparency, we will release code, configuration files, and clear instructions to reproduce all results. All findings are reported honestly without fabrication or inappropriate manipulation. The authors declare no conflicts of interest and no external sponsorship that could bias the work. 8REPRODUCIBILITY STATEMENT We provide an open-source repository (https://github.com/FutureMLS-Lab/CARE). The repo includes exact configuration files for all experiments in Tab. 1–2 and Fig. 3–4, scripts to download and verify datasets, deterministic preprocessing, fixed random seeds, and environment specifications (Conda with pinned versions). Algorithmic details appear in Sec. 3; dataset descriptions and evaluation setup are given in Sec. 4; hyperparameters, hardware, optimizer and scheduler configurations are listed in App. H. Detailed zero-shot tables for all models and calibration corpora are in App. I.1; additional rank profiles (1.5B–70B, including MoE) in App. I.4; ablations on shrinkage (App. I.6), distribution shift (App. I.5), covariance mixing (App. I.7), and √ Cvs.C(App. I.8); long-context NiH evaluation in App. I.2; and KV-cache efficiency analysis in App. I.9. Running scripts in the repository recreates reported metrics and regenerates all plots and logs. The repository is released under the Apache License 2.0. REFERENCES Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, p. 7432–7439, 2020. Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S. Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection. arXiv preprint arXiv:2407.21118, 2024. URLhttps: //arxiv.org/abs/2407.21118. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. URL https://arxiv.org/abs/2306.15595. Aakanksha Chowdhery et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. URL https://arxiv.org/abs/2307.08691. Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://proceedings.neurips.c/paper_files /paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Confe rence.html. 11 Published as a conference paper at ICLR 2026 DeepSeek-AI Team. Deepseek-v2: Multi-head latent attention for economical inference. Technical Report, 2024. URL https://github.com/deepseek-ai. Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems 26 (NeurIPS 2013), 2013. URLhttps://papers.nips.c/paper/5025-predicting-param eters-in-deep-learning. Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27 (NeurIPS 2014), 2014. URLhttps://papers.nips.c/paper/5 544-exploiting-linear-structure-within-convolutional-networks-f or-efficient-evaluation.pdf. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024. URL https://arxiv.org/abs/2402.13753. Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychome- trika, 1(3):211–218, 1936. Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, and Daizong Liu. Rethinking video- language model from the language input perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026a. Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, and Wei Ji. Towards unified vision-language models with incomplete multi-modal inputs. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026b. Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2022. ICLR 2023. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/12608602. Robin Geens and Marian Verhelst. Hardware-centric analysis of deepseek’s multi-head latent attention. arXiv preprint arXiv:2506.02523, 2025. doi: 10.48550/arXiv.2506.02523. URL https://arxiv.org/abs/2506.02523. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NeurIPS), 1993. Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, p. 293–299. IEEE, 1993. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations (ICLR), 2021. URL https://iclr.c/virtual/2021/poster/2562. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 12 Published as a conference paper at ICLR 2026 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning and Representation Learning Workshop, 2015. Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length LLM inference with KV cache quantization. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://proceedings.neurips.c/paper_files/paper/2024/ha sh/028fcbcf85435d39a40c4d61b42c99a4-Abstract-Conference.html. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. Ting Hua, Yen-Chang Hsu, Felicity Wang, Qian Lou, Yilin Shen, and Hongxia Jin. Numerical optimizations for weighted low-rank estimation on language model. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1404–1416. Association for Computational Linguistics, 2022. URLhttps://aclanthology.org/202 2.emnlp-main.91.pdf. Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, and Tao Gui. Towards economical inference: Enabling multi-head latent attention in transformer-based llms. Technical Report, 2025. URL https://github.com/JT-Ushio/MHA2MLA. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. URL https://arxiv.org/abs/2310.06825. Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://proceedings.neurip s.c/paper_files/paper/2023/file/4e85362c02172c0c6567ce593122d31 c-Paper-Conference.pdf. W. J. Krzanowski. Principles of Multivariate Analysis: A User’s Perspective. Oxford University Press, 2000. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023. URLhttps://arxiv. org/abs/2309.06180. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. Louis Leconte, Lisa Bedin, Van Minh Nguyen, and Eric Moulines. Reallm: A general framework for llm compression and fine-tuning. arXiv preprint arXiv:2405.13155, 2024. URLhttps: //arxiv.org/abs/2405.13155. Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), 1990. Dengjie Li, Tiancheng Shen, Yao Zhou, Baisong Yang, Zhongying Liu, Masheng Yang, Bernard Ghanem, Yibo Yang, Yujie Zhong, and Ming-Hsuan Yang. Optimizing singular spectrum for large language model compression. CoRR, abs/2502.15092, 2025a. URLhttps://arxiv.org/ab s/2502.15092. Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, and Emad Barsoum. X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression. arXiv preprint arXiv:2503.11132, 2025b. URL https://arxiv.org/abs/2503.11132. 13 Published as a conference paper at ICLR 2026 Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device LLM compression and acceleration. In MLSys, 2024. URLhttps://proceeding s.mlsys.org/paper_files/paper/2024/file/42a452cbafa9d64e9ba4a9 5c1ef21-Paper-Conference.pdf. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024a. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for KV cache. arXiv preprint arXiv:2402.02750, 2024b. URL https://arxiv.org/abs/2402.02750. Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://w.aclweb.org/anthology/J93-2004. Fanxu Meng, Zengwei Yao, and Muhan Zhang. Transmla: Multi-head latent attention is all you need. Technical Report, 2025. URL https://github.com/fxmeng/TransMLA. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023. URLhttps: //arxiv.org/abs/2309.00071. Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021. URLhttps: //arxiv.org/abs/2108.12409. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020a. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020b. URL https://jmlr.org/papers/v21/20-074.html. Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low- rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), p. 6655–6659, Vancouver, Canada, 2013. IEEE. doi: 10.1109/ICASSP.2013.6638949. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of NAACL-HLT 2018, Volume 2 (Short Papers), p. 464–468, New Orleans, Louisiana, 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074. URL https://aclanthology.org/N18-2074/. 14 Published as a conference paper at ICLR 2026 Noam Shazeer. Fast transformer decoding: One write-head is all you need. In NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2019. Multi-Query Attention (MQA). Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. URL https://arxiv.org/abs/1909.08053. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Moudgil, et al. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021a. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. In Findings of the Association for Computational Linguistics (ACL Findings), 2021b. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022. URL https://arxiv.org/abs/2212.10554. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. URL https://arxiv.org/abs/2307.09288. Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. In Proceedings of the 41st International Conference on Machine Learning (ICML). PMLR, 2024. URLhttps: //proceedings.mlr.press/v235/tseng24a.html. Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In Proceedings of the 17th Conference of the European Chapter of the ACL (EACL), 2023. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems, 37:62432–62457, 2024a. Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm compression and some new perspectives. arXiv preprint arXiv:2502.02723, 2025a. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. URLhttps://arxiv.org/ abs/2006.04768. Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. Svd-llm: Truncation-aware singular value decomposition for large language model compression. arXiv preprint arXiv:2403.07378, 2024b. Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. Svd-llm v2: Optimizing singular value truncation for large language model compression. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p. 4287–4296, Albuquerque, New Mexico, 2025b. Association for Computational Linguistics. URLhttps://aclanthology .org/2025.naacl-long.217.pdf. 15 Published as a conference paper at ICLR 2026 Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023. URLhttps: //proceedings.mlr.press/v202/xiao23c.html. Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. arXiv preprint arXiv:2102.03902, 2021. URL https://arxiv.org/abs/2102.03902. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a. URLhttps://arxiv.org/abs/2407.1 0671. Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra- llama: Towards extremely efficient hybrid models. arXiv preprint arXiv:2505.17272, 2025. URL https://arxiv.org/abs/2505.17272. Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, and Bernard Ghanem. Corda: Context-oriented decomposition adaptation of large language models for task-aware parameter-efficient fine-tuning. In Advances in Neural Information Processing Systems 37 (NeurIPS), 2024b. Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation- aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine- tuning. In International Conference on Learning Representations (ICLR), 2023. 16 Published as a conference paper at ICLR 2026 CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention in LLMs Supplementary Material ALARGE LANGUAGE MODELS USAGE We used a large language model - ChatGPT (GPT-5 thinking) solely for grammar and spelling edits to author-written text. We used Claude Code to assist code writing. The tool did not generate scientific content, design experiments, analyze data, or select citations, and therefore did not contribute at the level of a contributing author. All edits were reviewed and approved by the authors, who take full responsibility for the final manuscript. BRECALL: KV-PARITY MAPPING Grouped-Query Attention (GQA) reduces KV-cache memory by letting multiple heads share the same key–value projection. To convert GQA to Multi-Head Latent Attention (MLA) without increasing the KV budget, we enforce KV parity. Consider a GQA layer withn h heads of sized h with total multi-head hidden sizeD = n h d h , we splitn h heads split intog h groups, where each group contains n h g h heads. The layerlusesW (l) Q ∈ R D×(n h d h ) and groupedW (l) K ,W (l) V ∈ R D×(g h d h ) . We conceptually replicate each group’sW (l) K andW (l) V across its n h g h members to form the full-size f W (l) K , f W (l) V ∈ R D×(n h d h ) (no need to materialize in code). This demonstrates that the GQA method can be reduced to MLA by removing the repeated head blocks (Meng et al., 2025). Therefore, we set MLA’s latent rank to match GQA’s per-token KV width: r = g h d h (9) CPROOF OF MAXIMUM ENERGY SVD TRUNCATION We have the following proposition: LetA ∈ R m×n have singular value decomposition (SVD) A = U ΣV ⊤ , whereΣ = diag(σ 1 ,...,σ p ),p = minm,n, andσ 1 ≥ · ≥ σ p ≥ 0. For 1≤ r < p, let Σ r = diag(σ 1 ,...,σ r , 0,..., 0) and A r := U Σ r V ⊤ . Then ∥A− A r ∥ 2 F = p X i=r+1 σ 2 i and ∥A− A r ∥ F =min rank(X)≤r ∥A− X∥ F , with the unique minimizers (when σ r > σ r+1 ) given by X = A r (Eckart & Young, 1936). Proof. The Frobenius norm is unitarily invariant, so for any X with rank(X)≤ r, ∥A− X∥ F = ∥U ⊤ (A− X)V ∥ F = ∥Σ− Y∥ F ,where Y := U ⊤ XV and rank(Y )≤ r. Expand the square via the Frobenius inner product⟨M,N⟩ := trace(M ⊤ N ): ∥Σ− Y∥ 2 F = ∥Σ∥ 2 F +∥Y∥ 2 F − 2⟨Σ,Y⟩. Lets 1 (Y ) ≥ · ≥ s p (Y ) ≥ 0be the singular values ofY(sos i (Y ) = 0fori > r). By von Neumann’s trace inequality, ⟨Σ,Y⟩ ≤ p X i=1 σ i s i (Y ) = r X i=1 σ i s i (Y ), and∥Y∥ 2 F = P p i=1 s i (Y ) 2 = P r i=1 s i (Y ) 2 . Therefore: ∥Σ−Y∥ 2 F ≥ p X i=1 σ 2 i + r X i=1 s i (Y ) 2 − 2 r X i=1 σ i s i (Y ) = r X i=1 σ i −s i (Y ) 2 + p X i=r+1 σ 2 i ≥ p X i=r+1 σ 2 i . 17 Published as a conference paper at ICLR 2026 This lower bound is attained by taking Y = Σ r , i.e. X = U Σ r V ⊤ = A r , for which: ∥A− A r ∥ 2 F = ∥Σ− Σ r ∥ 2 F = p X i=r+1 σ 2 i . ThusA r is a best rank-rapproximation in Frobenius norm, and the minimum value is the squared ℓ 2 -tail of the singular values. Uniqueness follows whenσ r > σ r+1 since then any other minimizer must share the top r singular subspaces with A. Note that the idea and proof of the theorem above follows essentially the same idea as Eckart–Young– Mirsky in Eckart & Young (1936). Relation to Sec. 3.2. The theorem justifies the use of squared singular values in Sec. 3.2: for a single layer/matrix, the marginal reduction in Frobenius residual from adding the next rank is governed by the next squared singular value. Sec. 3.2 further divides this local gain by the remaining residual energy of the same layer, which normalizes layers with different spectral scales and makes their priorities comparable under a shared global budget. This normalization changes the cross-layer comparison, but preserves the theorem’s local interpretation that larger squared singular values remove more residual energy. DDERIVATION OF SVD TARGET FUNCTION Let ∆W := W − c W and C = 1 N P N b=1 X ⊤ b X b . We claim that: 1 N N X b=1 ∥X b ∆W∥ 2 F =∥ √ C∆W∥ 2 F . Proof. By the Frobenius–trace identity∥A∥ 2 F = Tr(A ⊤ A), 1 N N X b=1 ∥X b ∆W∥ 2 F = 1 N N X b=1 Tr (X b ∆W ) ⊤ (X b ∆W ) = 1 N N X b=1 Tr ∆W ⊤ X ⊤ b X b ∆W . Using linearity of trace and the definition of C, 1 N N X b=1 ∥X b ∆W∥ 2 F = Tr ∆W ⊤ C∆W . We can show that C ⪰ 0 since for any vector v, we have: v ⊤ Cv = 1 N N X b=1 v ⊤ X ⊤ b X b v = 1 N N X b=1 ∥X b v∥ 2 2 ≥ 0 Then we write C = C 1/2 C 1/2 to get: Tr ∆W ⊤ C∆W = Tr ∆W ⊤ C 1/2 C 1/2 ∆W = Tr (C 1/2 ∆W ) ⊤ (C 1/2 ∆W ) =∥C 1/2 ∆W∥ 2 F . EDEFINITION OF COVARIANCE Let x∈ R D be the activation (feature) vector of a single token. The population covariance is: Cov[x] = E (x− μ)(x− μ) ⊤ , μ = E[x]. Given samplesx i N i=1 , the empirical mean and covariance are: ˆμ = 1 N N X i=1 x i , d Cov[x] = 1 N N X i=1 (x i − ˆμ)(x i − ˆμ) ⊤ . If rows ofX ∈ R N×D stack the samples andX c = X − 1ˆμ ⊤ , then d Cov[x] = 1 N X ⊤ c X c . (Unbiased version uses 1/(N − 1) instead of 1/N .) 18 Published as a conference paper at ICLR 2026 FRELATED WORK KV Management Serving throughput is often bounded by how the KV cache is organized and moved across memory. PagedAttention (vLLM) treats KV as pageable blocks to avoid internal/ex- ternal fragmentation and enable sharing across sequences, improving utilization under dynamic batching (Kwon et al., 2023). Orthogonally, FlashAttention reduces HBM traffic with an IO-aware tiling of exact attention, and FlashAttention-2 further improves parallelism and work partitioning for higher FLOPs utilization (Dao et al., 2022; Dao, 2023). These system/kernel directions are complementary to architectural changes (e.g., GQA/MLA) and to post-hoc reparameterizations, since better KV layout and IO scheduling directly translate into larger effective batch sizes at a fixed memory budget. Quantization Quantization provides an orthogonal compression path to low-rank methods and can be combined with MLA/GQA conversions. For weights/activations, SmoothQuant migrates activation outliers into weights to enable practical W8A8 PTQ on large models (Xiao et al., 2023). AWQ protects a small set of salient channels via activation-aware scaling, delivering strong 4-bit weight-only PTQ with hardware-friendly kernels (Lin et al., 2024). QuIP# pushes extreme regimes (≤4-bit) using randomized Hadamard incoherence and lattice codebooks, with state-of-the-art results at low bit-rates (Tseng et al., 2024). For the KV cache, KVQuant (NeurIPS’24) introduces pre-RoPE key quantization, sensitivity-aware non-uniform datatypes, and per-vector dense/sparse schemes to sustain long-context inference (Hooper et al., 2024), while KIVI shows tuning-free 2-bit asymmetric KV quantization with favorable throughput/memory trade-offs (Liu et al., 2024b). Together, these methods form a toolbox that is largely complementary to low-rank latent caching. RoPE and Positional EncodingsPositional design strongly affects length generalization and con- version stability. RoPE’s complex-valued rotary formulation remains the default in many LLMs (Su et al., 2021b). Alternatives include relative positions (Shaw et al., 2018), T5’s learned relative bias and DeBERTa’s disentangled content/position attention (Raffel et al., 2020b; He et al., 2021), and ALiBi’s linear distance bias for train-short/test-long extrapolation (Press et al., 2021). Within the RoPE family, window-extension strategies modify scaling or spectra to stabilize extrapolation, such as XPOS’s multiplicative stabilization, Position Interpolation, YaRN, and very long-window LongRoPE (Sun et al., 2022; Chen et al., 2023; Peng et al., 2023; Ding et al., 2024). Systematic comparisons further show that the chosen positional scheme materially impacts length generalization (Kazemnejad et al., 2023), motivating careful treatment (e.g., partial-RoPE or mixed strategies) during architectural realignments. GDISCUSSION Across two GQA backbones and diverse tasks, CARE—Covariance-Aware andRank-Enhanced decomposition—enables MLA migration under KV-parity with accuracy and long-context robustness on par with (or better than) stronger baselines. CARE preserves the throughput/memory advantages of MLA while mitigating the activation drift observed with weight-only SVD. Rank allocation matters. Uniform or purely energy-based rank policies overlook the weighted spectral concentration that emerges after covariance/curvature preconditioning. CARE’s Adjusted- Rank uses a water-filling allocation over weighted singular spectra, honoring a global KV constraint while allocating capacity to the layers and directions that matter most. Complexity The dominant conversion costs are per-layer covariance estimationO(ND 2 )(with smallN) and truncated SVD on √ C f W ∈ R D×(n h d h ) atO(D (n h d h )r)using randomized SVD. Layers can be processed sequentially; for each layer,C = 1 N P N b=1 X ⊤ b X b can all be kept on CPU. At inference, MLA incurs light extra matvecs byW b while reducing KV-cache width from n h d h (MHA) org h d h (GQA) down tor = g h d h (MLA). CARE is orthogonal to quantization and sparsification and compatible with MLA kernels (DeepSeek-AI Team, 2024). Compatibility with MLA migration. CARE complements recent MLA conversions (Ji et al., 2025; Meng et al., 2025) and plays well with partial-RoPE (Su et al., 2021a): removing rota- 19 Published as a conference paper at ICLR 2026 tions on least-contributive subspaces further stabilizes long-context behavior when combined with activation/curvature-aware objectives. Limitations.(i) Statistics freshness: CARE requires small calibration passes; pronounced domain shift may need refreshed covariance/curvature. (i) Diagonal curvature: practicality favors diagonal proxies; structured approximations (e.g., Kronecker-factored) may yield further gains. (i) Extreme compression: at very low ranks, information bottlenecks dominate and further SFT can be necessary. (iv) Orthogonality to quantization/eviction: CARE does not yet co-optimize KV quantization and cache eviction policies. (v) Kernel support: no dedicated MLA kernel currently supports per-layer dynamic ranks, making end-to-end latency benchmarking impractical; we therefore report only theoretical KV-cache savings rather than wall-clock speedups. Broader impact and future work.CARE suggests a general recipe for post-training architectural migrations: align the objective to where errors manifest and distribute capacity by curvature-weighted signal. Promising directions include applying covariance-aware low-rank decomposition to vision- language and multimodal architectures (Fang et al., 2026a;b), where attention modules face analogous KV-cache bottlenecks, as well as data-free calibration, structured curvature, and dynamic rank schedules that adapt latent capacity with context length while maintaining KV-parity. Apart from that, our Covariance-weighted SVD initialization minimizes the activation loss at each layer, but our true goal is to preserve the output of the model, which is next-token predictions. We may therefore cast low-rank compression as directly minimizing the sequence loss produced by the compressed (student) model under a fixed KV budget. HHYPER-PARAMETER SELECTION All experiments were conducted on servers equipped with NVIDIA H100 80 GB GPUs paired with dual Intel Xeon Platinum 8462Y+ processors (2 × 32-core sockets, 64 cores total) and approximately 2 TB of RAM. All hyper-parameters are shown as below: • Model Configuration: Base model: Meta; Precision: float16, Sequence length: 8-2048 tokens, Covariance samples: 8-2048. •MLA Rank Settings: Default rank: 256, Min rank: 64, Max rank: 1024, Uniform allocation: True/False, K/V projection ranks: 256 each. •CARE Parameters: Initialization method: CARE, Damping factor (percdamp): 0.01, Cholesky decomposition: False, Activation order: False. •Evaluation Datasets: Multi-task benchmarks including WikiText (perplexity), ARC- Challenge/Easy (reasoning), HellaSwag (commonsense), PIQA (physical reasoning), MMLU (knowledge), OpenBookQA, RACE (reading), WinoGrande (coreference). •Generation Settings: Max new tokens: 512-128000, Temperature: 0.6-0.7, Top-p sampling: 0.9, Sampling strategy: Nucleus sampling with temperature control. •System Configuration: GPU memory free threshold (minimal GPU resources to run experiments) : 2048 MB, Parallel GPUs: 1-8 devices, Batch size: Dynamic adjustment, Random seed: [42, 17, 26, 103, 21, 59, 134, 8, 24, 99]. •Covaraince Computation:Dataset:C4/Ptb/Wikitext/Alpaca/ARC/ARE/MMLU instruction-following, Sample size: 8-2048, Sequence processing: 8-2048 token windows, • Random Seed and Learning Rate All experimental results are average over 10 random seeds and we choose the best from 3 learning rates. •Training Framework All experiments were conducted using theVeOmniframework (Ma et al., 2025) for fine-tuning CARE and TransMLA models. • Learning rate: We choose best learning rate of2× 10 −6 with linear warmup over the first 0.001% training steps. 20 Published as a conference paper at ICLR 2026 •Optimizer and Schedulers: We sweepLR∈1×10 −6 , 1×10 −5 , 5×10 −5 , 1×10 −4 , 5× 10 −4 , and select the best within the first0.1B tokens. We use AdamW, weight_decay= 0.01, cosine decay with lr_warmup_ratio= 0.005 and lr_decay_ratio= 1.0. • Batch size: Global effective batch size of 64 tokens per update step, accumulated across devices. • Precision: bfloat16 mixed precision was enabled to reduce memory footprint and improve throughput. •Max sequence length: Input sequences were truncated or padded to a length of 512-128000 tokens. • Training epochs: Each experiment was trained for the number of pre-set tokens. ISUPPLEMENTARY RESULTS I.1DETAILED ZERO-SHOT TABLES These tables report the full zero-shot results for each model and calibration dataset combination, using the same metric layout as Tab. 1. In addition to the main-text variants (CARE-U and CARE-E, which use √ C-weighted SVD with uniform and energy-aware rank allocation, respectively), we include CARE-C-based-U and CARE-C-based-E, which replace √ CwithCdirectly as the covariance weighting; see Sec. I.8 for a detailed comparison of the two formulations. I.1.1LLAMA-3.1-8B-INSTRUCT-ALPACA Table I.1: Detailed zero-shot comparison for Llama-3.1-8B-Instruct with Alpaca calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)2260.60 (↓)27.5325.7726.5052.1850.0424.2627.8020.86 31.87 MHA2MLA284863.91 (↓)23.8225.9426.3450.9250.5925.6227.8022.11 31.64 CARE-C-based-U893.84 (↓)31.4023.9826.9155.6650.0422.9127.0021.1532.38 CARE-C-based-E837.25 (↓)32.4123.4627.5056.7549.9622.9027.4021.3432.71 CARE-U983.55 (↓)30.5123.8926.8254.7349.0123.1826.0020.9631.89 CARE-E983.03 (↓)31.3123.6327.4954.6250.3622.9827.4021.1532.37 ASVD2525.33 (↓)26.6823.8126.6852.1850.9922.9727.8020.86 31.50 SVD-LLM V2967.04 (↓)30.7223.6326.7554.9048.9323.2226.4020.86 31.93 128 Palu(SVD)3046.58 (↓)27.0223.8926.6250.3849.2523.1424.8022.49 30.95 MHA2MLA15028.91 (↓)24.1225.0026.5852.1851.0723.8229.2022.11 31.76 CARE-C-based-U355.47 (↓)42.8027.0533.6461.5953.5126.7927.4024.1137.11 CARE-C-based-E328.76 (↓)43.6027.9036.6062.4654.7828.8829.0025.2638.56 CARE-U398.91 (↓)41.9626.7133.6460.9953.2026.0627.0024.2136.72 CARE-E353.74 (↓)42.4727.3037.9662.4654.8529.3828.2026.1238.59 ASVD1675.54 (↓)27.9925.0026.6152.3948.4623.0426.6021.82 31.49 SVD-LLM V2386.23 (↓)42.3827.5634.0661.0453.9926.1727.8024.21 37.15 256 Palu(SVD)537.57 (↓)31.1022.6127.9355.2251.7022.8227.0022.68 32.63 MHA2MLA1633.65 (↓)25.6327.4726.5052.3950.8323.1028.2022.49 32.08 CARE-C-based-U49.35 (↓)59.1833.6253.4371.4461.4844.0633.6033.4948.79 CARE-C-based-E49.40 (↓)61.7437.5458.2872.4263.2250.4333.0034.6451.41 CARE-U48.35 (↓)60.1936.0955.7571.8762.0445.6433.2035.2250.00 CARE-E49.43 (↓)60.0638.4860.5472.4266.0654.2933.6035.5052.62 ASVD312.86 (↓)32.0321.7629.5255.2252.6423.0525.2024.59 33.00 SVD-LLM V247.28 (↓)60.6936.2656.3972.2060.6946.5733.4034.83 50.13 512 Palu(SVD)45.40 (↓)45.9628.1643.3764.1553.4324.9830.8023.83 39.33 MHA2MLA220.29 (↓)40.9525.9439.2761.3756.5925.5426.6026.79 37.88 CARE-C-based-U10.06 (↓)76.6851.4572.2077.9171.3561.0341.2041.7261.69 CARE-C-based-E16.46 (↓)68.1844.0369.8876.0671.9063.2838.6040.1059.00 CARE-U9.64 (↓)76.3052.7373.9878.7372.6162.1740.6041.5362.33 CARE-E19.50 (↓)64.6942.4168.9675.4671.5963.9837.0038.4757.82 ASVD12.02 (↓)69.1146.3370.7576.1766.8541.8036.6033.97 55.20 SVD-LLM V29.63 (↓)76.6852.3973.5778.4572.2262.3140.4041.63 62.21 21 Published as a conference paper at ICLR 2026 I.1.2LLAMA-3.1-8B-INSTRUCT-C4 Table I.2: Detailed zero-shot comparison for Llama-3.1-8B-Instruct with C4 cal- ibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)2260.60 (↓)27.5325.7726.5052.1850.0424.2627.8020.86 31.87 MHA2MLA284863.91 (↓)23.8225.9426.3450.9250.5925.6227.8022.11 31.64 CARE-C-based-U688.95 (↓)28.3224.7427.2855.9351.8522.9526.4021.0532.31 CARE-C-based-E582.93 (↓)29.5524.1528.1556.7552.2522.9526.2020.9632.62 CARE-U786.17 (↓)27.6124.4026.9854.6849.8822.9526.4020.1931.64 CARE-E676.43 (↓)28.2424.7427.9755.3950.1222.9526.6020.8632.11 ASVD2845.27 (↓)25.9723.9827.0151.9048.8622.9829.2022.30 31.52 SVD-LLM V2769.05 (↓)27.7424.5727.1054.6249.9622.9526.0020.38 31.66 128 Palu(SVD)3046.58 (↓)27.0223.8926.6250.3849.2523.1424.8022.49 30.95 MHA2MLA15028.91 (↓)24.1225.0026.5852.1851.0723.8229.2022.11 31.76 CARE-C-based-U182.54 (↓)35.2324.2336.2061.9255.6423.0428.0025.7436.25 CARE-C-based-E176.68 (↓)37.8826.8840.0764.0955.4925.0229.8026.5138.22 CARE-U220.53 (↓)34.3924.2336.2561.2154.7823.4627.6025.1735.89 CARE-E201.19 (↓)33.9226.3740.9662.0856.7525.3328.0026.8937.54 ASVD1767.53 (↓)28.2425.0026.9553.2149.1723.0127.4020.86 31.73 SVD-LLM V2211.16 (↓)34.5124.4036.7760.9954.7823.4427.8025.65 36.04 256 Palu(SVD)537.57 (↓)31.1022.6127.9355.2251.7022.8227.0022.68 32.63 MHA2MLA1633.65 (↓)25.6327.4726.5052.3950.8323.1028.2022.49 32.08 CARE-C-based-U34.32 (↓)53.7033.6258.5171.7661.9637.6931.8035.1248.02 CARE-C-based-E37.79 (↓)55.3035.4961.5472.4264.9642.8532.2035.1249.99 CARE-U35.63 (↓)53.0734.7360.6372.1463.0640.0932.6034.9348.91 CARE-E38.89 (↓)53.4135.7562.7971.7164.8849.5432.0034.2650.54 ASVD313.33 (↓)31.5722.5329.7755.5553.2823.2126.0023.35 33.16 SVD-LLM V235.49 (↓)53.0734.8160.8772.4263.5440.9732.8034.64 49.14 512 Palu(SVD)45.40 (↓)45.9628.1643.3764.1553.4324.9830.8023.83 39.33 MHA2MLA220.29 (↓)40.9525.9439.2761.3756.5925.5426.6026.79 37.88 CARE-C-based-U9.58 (↓)76.4750.7773.1477.5871.9859.6139.2042.3961.39 CARE-C-based-E15.55 (↓)66.1243.0970.5876.2871.2762.3237.0039.9058.32 CARE-U9.31 (↓)76.4750.7775.0277.8071.8259.9641.0042.8761.96 CARE-E18.07 (↓)60.9441.2170.2975.7371.5963.1037.0038.5657.30 ASVD12.01 (↓)68.6445.6570.5076.5567.8839.4037.4033.88 54.99 SVD-LLM V29.33 (↓)76.0151.1174.8077.8071.5159.9340.6042.01 61.72 I.1.3LLAMA-3.1-8B-INSTRUCT-PTB Table I.3: Detailed zero-shot comparison for Llama-3.1-8B-Instruct with PTB cal- ibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)2260.60 (↓)27.5325.7726.5052.1850.0424.2627.8020.86 31.87 MHA2MLA284863.91 (↓)23.8225.9426.3450.9250.5925.6227.8022.11 31.64 CARE-C-based-U756.56 (↓)26.7325.6826.2851.5850.2022.9525.8021.3431.32 CARE-C-based-E612.72 (↓)27.5725.4326.5151.7450.7522.9526.0021.0531.50 CARE-U815.80 (↓)26.2625.0026.5051.3649.2522.9527.0020.6731.12 CARE-E689.55 (↓)27.1525.6026.5151.2549.4922.9427.2020.3831.31 ASVD2718.72 (↓)26.3925.0926.9151.6949.0123.0229.0021.72 31.60 SVD-LLM V2803.38 (↓)26.3925.4326.4151.4149.4122.9527.2020.67 31.23 128 Palu(SVD)3046.58 (↓)27.0223.8926.6250.3849.2523.1424.8022.49 30.95 MHA2MLA15028.91 (↓)24.1225.0026.5852.1851.0723.8229.2022.11 31.76 CARE-C-based-U200.05 (↓)32.1123.8130.0154.9053.6723.3026.4023.5433.47 CARE-C-based-E178.91 (↓)33.0024.7431.9455.8854.7823.6228.6024.4034.62 CARE-U229.71 (↓)31.1924.7429.8053.1652.8024.8025.8023.7333.25 CARE-E194.88 (↓)30.5124.4031.9654.2456.0422.9728.2022.9733.91 ASVD1776.16 (↓)27.3624.5726.7353.1649.4122.9727.6022.20 31.75 SVD-LLM V2220.92 (↓)31.4423.9829.9454.0853.2825.0725.4023.92 33.39 256 Palu(SVD)537.57 (↓)31.1022.6127.9355.2251.7022.8227.0022.68 32.63 22 Published as a conference paper at ICLR 2026 RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG MHA2MLA1633.65 (↓)25.6327.4726.5052.3950.8323.1028.2022.49 32.08 CARE-C-based-U33.00 (↓)49.6630.5548.8964.5862.5137.5029.4031.7744.36 CARE-C-based-E30.87 (↓)50.5533.2852.5567.0862.5942.7531.2032.5446.57 CARE-U34.44 (↓)43.5627.7348.9061.7562.3531.6530.4029.0941.93 CARE-E33.96 (↓)47.5234.1354.8466.4363.7745.5431.6032.6347.06 ASVD316.66 (↓)31.2722.1829.3455.3353.1223.1425.4023.92 32.96 SVD-LLM V233.36 (↓)47.6929.5250.4364.6961.8835.3230.8030.43 43.85 512 Palu(SVD)45.40 (↓)45.9628.1643.3764.1553.4324.9830.8023.83 39.33 MHA2MLA220.29 (↓)40.9525.9439.2761.3756.5925.5426.6026.79 37.88 CARE-C-based-U9.67 (↓)73.4849.0670.8876.2271.1957.5237.4040.6759.55 CARE-C-based-E13.00 (↓)62.9640.7868.2474.1069.9361.9635.4039.7156.64 CARE-U9.38 (↓)75.2551.7172.9377.2672.0657.8640.2040.7761.00 CARE-E16.21 (↓)56.3138.9966.9973.5670.5662.6334.4037.8955.17 ASVD12.15 (↓)67.5143.9470.0075.6366.9338.4336.2032.06 53.84 SVD-LLM V29.42 (↓)74.7149.9172.1877.0971.9057.8337.8040.29 60.21 I.1.4LLAMA-3.1-8B-INSTRUCT-WIKITEXT2 Table I.4: Detailed zero-shot comparison for Llama-3.1-8B-Instruct with WikiText2 calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)2260.60 (↓)27.5325.7726.5052.1850.0424.2627.8020.86 31.87 MHA2MLA284863.91 (↓)23.8225.9426.3450.9250.5925.6227.8022.11 31.64 CARE-C-based-U189.39 (↓)28.1125.0926.8251.4750.1222.9527.8020.5731.62 CARE-C-based-E138.00 (↓)28.9125.4327.1652.3950.3622.9527.6020.3831.90 CARE-U237.45 (↓)27.3125.3426.4651.6348.2222.9427.6020.7731.28 CARE-E184.54 (↓)27.8624.9126.8351.9650.2822.9526.8020.5731.52 ASVD2900.25 (↓)26.0924.0626.7751.2549.2522.9629.0021.91 31.41 SVD-LLM V2229.74 (↓)27.1525.2626.5051.6948.6222.9427.6020.86 31.33 128 Palu(SVD)3046.58 (↓)27.0223.8926.6250.3849.2523.1424.8022.49 30.95 MHA2MLA15028.91 (↓)24.1225.0026.5852.1851.0723.8229.2022.11 31.76 CARE-C-based-U51.39 (↓)33.0823.9831.5155.5052.4123.4328.4023.6433.99 CARE-C-based-E47.26 (↓)33.2525.6033.5957.1354.1423.3729.8024.9835.23 CARE-U66.44 (↓)32.3723.9831.0454.0852.3324.1028.6023.7333.78 CARE-E55.08 (↓)32.4124.6634.0755.1154.3023.1029.4025.3634.80 ASVD1823.04 (↓)27.6925.5126.9653.5950.9922.9229.4021.24 32.29 SVD-LLM V262.94 (↓)32.1123.8931.3854.6853.4324.0428.6023.44 33.95 256 Palu(SVD)537.57 (↓)31.1022.6127.9355.2251.7022.8227.0022.68 32.63 MHA2MLA1633.65 (↓)25.6327.4726.5052.3950.8323.1028.2022.49 32.08 CARE-C-based-U15.66 (↓)47.1029.1049.8563.8760.8533.6532.6031.6743.59 CARE-C-based-E15.53 (↓)53.3234.6455.9068.0163.5441.0533.8033.6847.99 CARE-U16.14 (↓)45.8829.6952.2763.1762.3533.2633.8032.6344.13 CARE-E16.84 (↓)51.7734.3058.4367.4165.5945.8932.0033.5948.62 ASVD318.48 (↓)31.8222.2729.3754.9554.0623.0026.4023.35 33.15 SVD-LLM V215.88 (↓)48.1930.0352.7764.8061.4834.3334.0032.25 44.73 512 Palu(SVD)45.40 (↓)45.9628.1643.3764.1553.4324.9830.8023.83 39.33 MHA2MLA220.29 (↓)40.9525.9439.2761.3756.5925.5426.6026.79 37.88 CARE-C-based-U8.29 (↓)76.4350.2672.0076.6672.2256.8338.6041.6360.58 CARE-C-based-E9.82 (↓)66.9642.8369.6474.7670.6461.9937.2038.8557.86 CARE-U8.29 (↓)76.6051.5473.9477.4872.5356.9939.4040.1961.08 CARE-E11.25 (↓)60.9041.0469.0373.7271.2763.1235.4037.8056.54 ASVD12.11 (↓)67.6344.2870.0775.6867.0138.7836.4033.30 54.14 SVD-LLM V28.26 (↓)76.3051.2873.5477.6472.3058.0739.6039.81 61.07 I.1.5QWEN2.5-1.5B-INSTRUCT-ALPACA 23 Published as a conference paper at ICLR 2026 Table I.5: Detailed zero-shot comparison for Qwen2.5-1.5B-Instruct with Alpaca calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)114579.38 (↓)25.1725.2625.5550.2750.2024.0627.4020.67 31.07 MHA2MLA62077.36 (↓)26.5625.3426.5751.6350.2024.6029.0022.68 32.07 CARE-C-based-U909.76 (↓)29.0022.3526.1452.5048.1523.1126.6022.3031.27 CARE-C-based-E730.60 (↓)29.0022.8726.8354.1951.3022.8527.6024.1132.34 CARE-U1098.80 (↓)26.7723.3826.9552.2951.0722.9525.2023.1631.47 CARE-E1022.88 (↓)29.0023.2926.7054.7351.1423.0026.8022.1132.10 ASVD7737.23 (↓)26.8125.7726.6151.0949.8825.4627.6022.11 31.92 SVD-LLM V21109.62 (↓)28.3723.6326.9153.4849.8022.9523.8023.92 31.61 96 Palu(SVD)147110.47 (↓)25.9723.9826.4950.8750.4324.3330.2020.57 31.60 MHA2MLA73056.19 (↓)26.1426.7126.5152.1250.2822.9026.8021.82 31.66 CARE-C-based-U237.47 (↓)35.0224.3229.8757.5152.5724.4826.6026.2234.57 CARE-C-based-E51.56 (↓)52.1530.4646.1767.6854.7032.0331.6029.4743.03 CARE-U266.39 (↓)33.9622.3529.8056.2650.9124.1328.2025.0733.83 CARE-E48.58 (↓)49.3729.0146.5466.7053.2831.3330.0031.2942.19 ASVD3663.48 (↓)28.3723.4627.7852.6151.4624.3927.6025.07 32.59 SVD-LLM V2249.45 (↓)34.8923.0430.3358.4950.5124.8128.4026.70 34.65 128 Palu(SVD)40284.33 (↓)28.4521.5927.1052.2951.1425.4930.6022.68 32.42 MHA2MLA51022.27 (↓)26.8125.6827.3951.0349.0926.0027.0023.73 32.09 CARE-C-based-U47.23 (↓)55.3931.6644.0866.9755.4135.7529.4033.6844.04 CARE-C-based-E30.43 (↓)56.6132.3451.1369.5353.4340.9134.0031.3946.17 CARE-U46.20 (↓)54.1731.4045.6267.9554.8536.3829.4034.9344.34 CARE-E28.00 (↓)56.2331.7450.8569.5354.7037.8534.2031.7745.86 ASVD1582.91 (↓)30.6423.6328.2051.2051.6225.4727.4024.11 32.78 SVD-LLM V244.32 (↓)55.8131.4846.2967.3654.3836.6529.0035.02 44.50 I.1.6QWEN2.5-1.5B-INSTRUCT-C4 Table I.6: Detailed zero-shot comparison for Qwen2.5-1.5B-Instruct with C4 cal- ibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)114579.38 (↓)25.1725.2625.5550.2750.2024.0627.4020.67 31.07 MHA2MLA62077.36 (↓)26.5625.3426.5751.6350.2024.6029.0022.68 32.07 CARE-C-based-U679.30 (↓)27.7423.1226.6952.7751.2222.9526.8023.1631.81 CARE-C-based-E365.98 (↓)28.5422.7027.4255.0152.0922.9526.4023.6432.34 CARE-U914.36 (↓)26.4324.5726.1751.9049.4122.9525.8023.8331.38 CARE-E664.13 (↓)27.1523.1227.4453.4351.3822.9427.0023.6432.01 ASVD7074.97 (↓)26.8525.4327.1450.9850.9125.8930.0022.49 32.46 SVD-LLM V2826.83 (↓)26.2225.2626.4151.9049.6422.9525.4024.11 31.49 96 Palu(SVD)147110.47 (↓)25.9723.9826.4950.8750.4324.3330.2020.57 31.60 MHA2MLA73056.19 (↓)26.1426.7126.5152.1250.2822.9026.8021.82 31.66 CARE-C-based-U151.08 (↓)32.1522.6131.5355.2851.3823.5227.0027.0833.82 CARE-C-based-E43.55 (↓)43.1027.0547.8568.0655.6427.4531.0031.4841.45 CARE-U176.46 (↓)31.1420.8230.3955.5552.4922.9625.8025.2633.05 CARE-E41.81 (↓)41.7127.9947.8266.6555.6427.6129.8028.3340.69 ASVD3864.43 (↓)28.7523.1227.4951.4151.3024.7527.6023.16 32.20 SVD-LLM V2161.86 (↓)31.2321.5031.1556.3151.2222.9426.6026.51 33.43 128 Palu(SVD)40284.33 (↓)28.4521.5927.1052.2951.1425.4930.6022.68 32.42 MHA2MLA51022.27 (↓)26.8125.6827.3951.0349.0926.0027.0023.73 32.09 CARE-C-based-U34.92 (↓)47.9827.5648.0365.5156.5132.6728.8034.1642.65 CARE-C-based-E28.60 (↓)49.1629.6151.8069.2654.6234.0133.4031.2944.14 CARE-U34.14 (↓)51.6829.1049.0366.7056.9132.4728.2033.9743.51 CARE-E24.35 (↓)49.2829.3552.0369.3156.5933.3932.0031.2044.14 ASVD1244.07 (↓)30.7720.9927.4751.8550.3624.6325.8024.78 32.08 SVD-LLM V232.25 (↓)51.9828.5849.9267.3658.2534.4329.2033.78 44.19 24 Published as a conference paper at ICLR 2026 I.1.7QWEN2.5-1.5B-INSTRUCT-PTB Table I.7: Detailed zero-shot comparison for Qwen2.5-1.5B-Instruct with PTB calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)114579.38 (↓)25.1725.2625.5550.2750.2024.0627.4020.67 31.07 MHA2MLA62077.36 (↓)26.5625.3426.5751.6350.2024.6029.0022.68 32.07 CARE-C-based-U1608.42 (↓)27.4823.4626.7951.0950.5923.0126.8021.6331.36 CARE-C-based-E701.84 (↓)27.9523.5527.1251.5250.0422.9524.8023.7331.46 CARE-U2131.72 (↓)27.5323.3827.0051.7449.0123.1427.2023.0631.51 CARE-E362.35 (↓)29.5923.5529.4352.4550.8323.1025.0022.3932.04 ASVD8309.98 (↓)26.9824.8326.9050.8250.6725.4028.4021.91 31.99 SVD-LLM V22075.31 (↓)27.3622.6127.1850.8249.0123.1327.6022.68 31.30 96 Palu(SVD)147110.47 (↓)25.9723.9826.4950.8750.4324.3330.2020.57 31.60 MHA2MLA73056.19 (↓)26.1426.7126.5152.1250.2822.9026.8021.82 31.66 CARE-C-based-U306.18 (↓)30.2622.8728.9652.5052.7223.0426.0025.1732.69 CARE-C-based-E57.23 (↓)41.1225.9443.8262.9554.4627.5229.2028.3339.17 CARE-U367.50 (↓)28.7920.9929.3952.2952.7223.3926.0023.9232.19 CARE-E74.42 (↓)41.0426.3743.6262.0253.3525.7430.8029.4739.05 ASVD3704.61 (↓)28.9123.3828.0352.6750.3624.9328.8024.50 32.70 SVD-LLM V2334.57 (↓)29.5520.1429.5252.5052.5723.0225.2024.69 32.15 128 Palu(SVD)40284.33 (↓)28.4521.5927.1052.2951.1425.4930.6022.68 32.42 MHA2MLA51022.27 (↓)26.8125.6827.3951.0349.0926.0027.0023.73 32.09 CARE-C-based-U65.89 (↓)44.9925.6041.9560.0755.6432.3827.0031.2039.85 CARE-C-based-E23.22 (↓)49.5830.7249.7966.4954.9335.4831.0032.0643.76 CARE-U67.96 (↓)45.0326.9643.7261.1056.1229.5528.8033.4940.60 CARE-E22.47 (↓)48.7429.1849.2866.8755.9636.8031.6031.3943.73 ASVD1210.45 (↓)30.6020.7327.7552.2350.5923.8627.2024.69 32.21 SVD-LLM V265.47 (↓)45.8828.5844.0060.7755.6429.1929.0032.92 40.75 I.1.8QWEN2.5-1.5B-INSTRUCT-WIKITEXT2 Table I.8: Detailed zero-shot comparison for Qwen2.5-1.5B-Instruct with WikiText2 calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)114579.38 (↓)25.1725.2625.5550.2750.2024.0627.4020.67 31.07 MHA2MLA62077.36 (↓)26.5625.3426.5751.6350.2024.6029.0022.68 32.07 CARE-C-based-U209.05 (↓)27.8622.1826.3452.5650.6722.9828.2023.6431.80 CARE-C-based-E122.65 (↓)27.8623.2927.1253.4850.7522.9426.2022.6831.79 CARE-U300.99 (↓)26.7723.1226.7452.1850.2022.9526.6024.5031.63 CARE-E125.54 (↓)27.3123.1227.8652.4550.9922.9625.4023.5431.70 ASVD7126.52 (↓)26.8125.6027.1951.9050.2025.7126.8021.82 32.00 SVD-LLM V2264.86 (↓)26.9423.2126.8751.9649.9622.9726.4024.59 31.61 96 Palu(SVD)147110.47 (↓)25.9723.9826.4950.8750.4324.3330.2020.57 31.60 MHA2MLA73056.19 (↓)26.1426.7126.5152.1250.2822.9026.8021.82 31.66 CARE-C-based-U57.36 (↓)30.8122.1029.8753.4849.7222.9726.2024.5032.46 CARE-C-based-E19.53 (↓)44.4026.8846.3364.7454.7827.9930.2030.0540.67 CARE-U61.60 (↓)31.6921.3329.9554.2450.9923.1326.2023.5432.63 CARE-E20.46 (↓)40.7827.0546.4663.4454.7827.2131.8029.7640.16 ASVD3636.42 (↓)28.3222.9527.2952.5050.0424.0730.0023.25 32.30 SVD-LLM V256.77 (↓)31.0621.6730.4155.2848.8622.9325.6024.31 32.51 128 Palu(SVD)40284.33 (↓)28.4521.5927.1052.2951.1425.4930.6022.68 32.42 MHA2MLA51022.27 (↓)26.8125.6827.3951.0349.0926.0027.0023.73 32.09 CARE-C-based-U19.29 (↓)47.1025.8544.1462.0855.4930.6929.2032.0640.83 CARE-C-based-E15.62 (↓)51.2230.7251.5268.2854.6230.2233.4031.3943.92 CARE-U17.88 (↓)46.6827.3046.4663.7655.9627.8428.6033.1141.21 CARE-E15.67 (↓)51.6430.2050.5067.0855.3329.8333.2032.4443.78 ASVD1241.39 (↓)30.5621.0827.6553.8149.0123.6126.2023.92 31.98 SVD-LLM V217.50 (↓)47.9426.9647.0564.4757.0628.5129.2032.54 41.72 25 Published as a conference paper at ICLR 2026 I.1.9QWEN3-4B-INSTRUCT-2507-ALPACA Table I.9: Detailed zero-shot comparison for Qwen3-4B-Instruct-2507 with Alpaca calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)56922.11 (↓)26.7725.3425.7350.4450.9124.3128.4022.11 31.75 MHA2MLA21850.16 (↓)25.4625.6826.0351.9650.2022.9227.0022.01 31.41 CARE-C-based-U1052.43 (↓)29.3422.7027.5154.3550.6723.2126.2023.6432.20 CARE-C-based-E941.16 (↓)28.7523.1227.3355.3350.2823.0825.6022.3031.97 CARE-U905.74 (↓)29.0023.4627.6155.2250.5923.0326.6023.0632.32 CARE-E730.93 (↓)30.7723.2928.3355.2251.5422.9524.8022.7832.46 ASVD6683.95 (↓)25.0027.3925.7150.3850.4324.0826.6022.11 31.46 SVD-LLM V2894.94 (↓)28.5823.1227.6655.1151.5423.0227.2023.44 32.46 128 Palu(SVD)22048.79 (↓)26.1826.0226.2951.0949.7224.4925.4021.05 31.28 MHA2MLA52683.47 (↓)26.5623.8126.7452.1849.7224.7428.2022.68 31.83 CARE-C-based-U148.73 (↓)38.9325.0932.9859.0953.1223.7625.6026.1235.59 CARE-C-based-E139.32 (↓)39.9827.0537.1661.6453.2027.1927.0026.9937.53 CARE-U111.17 (↓)39.6927.4737.5260.5053.5927.0327.6026.9937.55 CARE-E102.38 (↓)44.8230.2942.5463.7654.5430.5229.4028.7140.57 ASVD1682.84 (↓)30.0122.9529.2952.3450.1223.4227.0023.54 32.33 SVD-LLM V2116.76 (↓)40.0327.0537.0261.0453.2826.6926.8025.74 37.21 256 Palu(SVD)2561.97 (↓)29.4626.1129.9252.8851.7024.4828.4023.64 33.32 MHA2MLA44509.79 (↓)28.4922.1828.9252.4551.3023.0024.6024.59 31.94 CARE-C-based-U24.63 (↓)64.0638.4052.0169.0459.1245.7334.2034.3549.61 CARE-C-based-E28.67 (↓)60.2738.3153.5869.1559.4349.3233.6033.4949.64 CARE-U22.08 (↓)68.9046.4259.1671.5562.4354.7636.4035.0254.33 CARE-E28.84 (↓)59.2241.3056.5369.3761.8853.5035.2032.8251.23 ASVD63.15 (↓)46.8432.7647.4563.9352.0126.3830.8030.05 41.28 SVD-LLM V222.88 (↓)67.1744.5457.7770.7861.4852.8135.6034.83 53.12 512 Palu(SVD)33.97 (↓)47.6435.5850.4465.1852.6427.8532.8030.24 42.80 MHA2MLA100.99 (↓)41.0827.0537.9759.1954.0629.1427.2029.47 38.15 CARE-C-based-U11.90 (↓)79.0455.2064.3274.2167.0166.3538.6038.7660.44 CARE-C-based-E15.52 (↓)71.0447.1859.8671.9363.3062.1935.4036.8455.97 CARE-U12.03 (↓)77.2354.9569.2476.2268.4367.4640.0039.4361.62 CARE-E15.91 (↓)70.8849.2364.1372.8064.5664.1636.2036.5657.31 ASVD15.49 (↓)66.5447.6167.4973.0162.7556.5635.6035.22 55.60 SVD-LLM V211.88 (↓)77.4454.6168.5375.6867.2567.6539.8038.18 61.14 I.1.10QWEN3-4B-INSTRUCT-2507-C4 Table I.10: Detailed zero-shot comparison for Qwen3-4B-Instruct-2507 with C4 calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)56922.11 (↓)26.7725.3425.7350.4450.9124.3128.4022.11 31.75 MHA2MLA21850.16 (↓)25.4625.6826.0351.9650.2022.9227.0022.01 31.41 CARE-C-based-U494.84 (↓)28.9123.5527.4155.1150.9122.9323.6024.2132.08 CARE-C-based-E447.15 (↓)28.2821.5027.9055.2850.3622.9524.4023.3531.75 CARE-U380.04 (↓)29.5922.5328.1353.2151.1422.9226.2023.6432.17 CARE-E320.78 (↓)29.2523.6329.8854.7950.9122.9524.4023.5432.42 ASVD5126.87 (↓)26.9823.5526.7752.2352.8023.7925.8022.30 31.78 SVD-LLM V2390.08 (↓)29.3422.4428.1053.9750.8322.9726.0023.64 32.16 128 Palu(SVD)22048.79 (↓)26.1826.0226.2951.0949.7224.4925.4021.05 31.28 MHA2MLA52683.47 (↓)26.5623.8126.7452.1849.7224.7428.2022.68 31.83 CARE-C-based-U83.75 (↓)35.1922.2736.4959.4750.9922.9728.4027.3735.39 CARE-C-based-E77.69 (↓)38.1324.4040.8961.0451.3823.3226.8028.9036.86 CARE-U65.49 (↓)36.5724.4041.2561.6454.3824.3328.0027.7537.29 CARE-E63.64 (↓)40.9927.9945.0262.5156.2028.2229.2028.8039.87 ASVD641.70 (↓)31.2723.4630.7052.1852.0123.4027.2022.87 32.89 SVD-LLM V267.09 (↓)36.2024.3240.9161.6453.9124.0927.2026.79 36.88 26 Published as a conference paper at ICLR 2026 RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 256 Palu(SVD)2561.97 (↓)29.4626.1129.9252.8851.7024.4828.4023.64 33.32 MHA2MLA44509.79 (↓)28.4922.1828.9252.4551.3023.0024.6024.59 31.94 CARE-C-based-U18.31 (↓)56.6535.9255.2069.5960.4644.7733.2034.6448.80 CARE-C-based-E22.22 (↓)55.4336.1856.0369.1561.0944.6432.8034.7448.76 CARE-U18.25 (↓)64.6541.1360.9271.1164.4852.1935.4035.9853.23 CARE-E23.20 (↓)56.6139.4257.9769.8662.6747.8134.0033.8850.28 ASVD41.55 (↓)50.9335.0750.3165.9455.6429.0833.2030.81 43.87 SVD-LLM V218.58 (↓)63.8041.4760.4771.6063.9351.1235.6035.12 52.89 512 Palu(SVD)33.97 (↓)47.6435.5850.4465.1852.6427.8532.8030.24 42.80 MHA2MLA100.99 (↓)41.0827.0537.9759.1954.0629.1427.2029.47 38.15 CARE-C-based-U11.17 (↓)79.5053.8465.2574.9766.4666.4638.2040.8660.69 CARE-C-based-E13.08 (↓)69.1945.0561.4172.9167.0959.1736.2039.2356.28 CARE-U11.57 (↓)75.8453.6769.7076.5068.1967.8338.4040.1961.29 CARE-E14.54 (↓)68.0147.6165.0274.3767.5662.9936.0038.1857.47 ASVD14.37 (↓)66.8448.0468.0273.6162.2757.6435.8037.89 56.26 SVD-LLM V211.43 (↓)77.1953.8469.1677.2067.8067.6439.2040.19 61.53 I.1.11QWEN3-4B-INSTRUCT-2507-PTB Table I.11: Detailed zero-shot comparison for Qwen3-4B-Instruct-2507 with PTB calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)56922.11 (↓)26.7725.3425.7350.4450.9124.3128.4022.11 31.75 MHA2MLA21850.16 (↓)25.4625.6826.0351.9650.2022.9227.0022.01 31.41 CARE-C-based-U1234.23 (↓)28.4922.4426.9151.7448.4622.9524.2022.2030.92 CARE-C-based-E1036.30 (↓)28.3222.8726.9352.6750.8322.9524.4021.8231.35 CARE-U961.94 (↓)28.7923.2927.4951.3149.0922.9824.6022.5831.27 CARE-E671.86 (↓)29.5024.2327.8752.3949.3322.9024.8023.1631.77 ASVD4441.27 (↓)26.5624.9125.9451.9051.7023.6927.4021.91 31.75 SVD-LLM V21001.12 (↓)29.2923.4627.3751.5849.9623.0024.2022.58 31.43 128 Palu(SVD)22048.79 (↓)26.1826.0226.2951.0949.7224.4925.4021.05 31.28 MHA2MLA52683.47 (↓)26.5623.8126.7452.1849.7224.7428.2022.68 31.83 CARE-C-based-U206.34 (↓)33.1623.7232.0455.2250.4325.2025.4026.8934.01 CARE-C-based-E194.64 (↓)35.8224.4934.3055.6650.2024.6627.6027.8535.07 CARE-U191.71 (↓)35.1425.7735.1255.8252.9626.4326.0027.1835.55 CARE-E175.19 (↓)39.1027.3039.0657.5155.1726.8329.4028.0437.80 ASVD635.48 (↓)30.6823.4630.5352.7749.4123.1228.4023.44 32.73 SVD-LLM V2197.57 (↓)34.6425.4334.9755.5551.7826.4925.2027.46 35.19 256 Palu(SVD)2561.97 (↓)29.4626.1129.9252.8851.7024.4828.4023.64 33.32 MHA2MLA44509.79 (↓)28.4922.1828.9252.4551.3023.0024.6024.59 31.94 CARE-C-based-U44.02 (↓)57.5335.8449.6564.8560.2236.4629.6033.4045.94 CARE-C-based-E49.68 (↓)54.9235.7550.6665.2961.8836.9230.2033.5946.15 CARE-U55.61 (↓)61.2842.5855.9567.6363.5447.5731.4033.6850.45 CARE-E46.80 (↓)54.7138.4854.1166.5963.1447.6431.8031.3948.48 ASVD42.42 (↓)48.8234.3949.7265.6754.3828.2032.6030.62 43.05 SVD-LLM V257.76 (↓)61.2440.8754.7967.1962.4344.8931.6033.30 49.54 512 Palu(SVD)33.97 (↓)47.6435.5850.4465.1852.6427.8532.8030.24 42.80 MHA2MLA100.99 (↓)41.0827.0537.9759.1954.0629.1427.2029.47 38.15 CARE-C-based-U15.03 (↓)78.8352.1363.2873.8865.6765.4936.2036.9459.05 CARE-C-based-E13.48 (↓)66.4144.8858.7171.2765.5157.8134.4037.6154.57 CARE-U13.50 (↓)76.8953.8469.1174.8668.5966.8337.6038.5660.79 CARE-E15.60 (↓)67.7647.1063.0571.5567.8861.6436.0037.5156.56 ASVD14.45 (↓)65.7848.0468.1674.0563.3057.4936.6037.61 56.38 SVD-LLM V219.51 (↓)76.6853.7567.4275.3067.5666.1136.8039.33 60.37 I.1.12QWEN3-4B-INSTRUCT-2507-WIKITEXT2 27 Published as a conference paper at ICLR 2026 Table I.12: Detailed zero-shot comparison for Qwen3-4B-Instruct-2507 with Wiki- Text2 calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 64 Palu(SVD)56922.11 (↓)26.7725.3425.7350.4450.9124.3128.4022.11 31.75 MHA2MLA21850.16 (↓)25.4625.6826.0351.9650.2022.9227.0022.01 31.41 CARE-C-based-U169.94 (↓)27.9922.3527.1652.1849.2522.9525.6023.3531.35 CARE-C-based-E142.06 (↓)28.3723.8127.5052.7249.6422.9527.2022.2031.80 CARE-U156.42 (↓)28.6623.2928.5052.4549.9622.9526.0022.1131.74 CARE-E112.21 (↓)30.0523.2929.1252.3950.7522.9326.0023.3532.24 ASVD4000.65 (↓)26.7323.8926.6852.3448.4624.0429.0022.01 31.64 SVD-LLM V2159.64 (↓)28.2822.7828.4752.6150.1222.9525.4022.49 31.64 128 Palu(SVD)22048.79 (↓)26.1826.0226.2951.0949.7224.4925.4021.05 31.28 MHA2MLA52681.01 (↓)26.6023.8926.7752.3449.9624.8028.2022.68 31.91 CARE-C-based-U39.73 (↓)34.7223.2933.5556.4251.2223.0226.6026.8934.46 CARE-C-based-E32.49 (↓)36.9925.6836.8558.7652.0123.6229.4028.5236.48 CARE-U35.13 (↓)37.0425.8538.2458.8153.2024.3628.8027.9436.78 CARE-E29.98 (↓)41.7527.6542.1459.5853.9124.6030.6028.1338.55 ASVD591.20 (↓)31.4424.0630.8953.7052.9623.4326.4023.83 33.34 SVD-LLM V234.76 (↓)37.3326.0237.9158.1153.3524.0227.6028.52 36.61 256 Palu(SVD)2562.82 (↓)29.4625.8530.0252.9451.7824.4028.6023.54 33.32 MHA2MLA44510.05 (↓)28.5422.0128.9552.6751.0723.0024.8024.59 31.95 CARE-C-based-U13.10 (↓)60.9837.1253.1566.3859.6738.4432.4035.3147.93 CARE-C-based-E14.51 (↓)59.2638.7454.9967.0360.1442.3232.8032.4448.47 CARE-U14.41 (↓)67.8542.7559.4969.9764.2550.7036.6035.2253.35 CARE-E16.91 (↓)58.5939.2557.2668.9962.9848.4635.0033.1150.45 ASVD39.75 (↓)51.1435.9250.2865.6755.4129.2832.4031.29 43.92 SVD-LLM V214.34 (↓)67.7642.4158.5268.9962.9048.8235.6034.64 52.45 512 Palu(SVD)33.97 (↓)47.6435.5850.4465.1852.6427.8532.8030.24 42.80 MHA2MLA100.99 (↓)41.0827.0537.9759.1954.0629.1427.2029.47 38.15 CARE-C-based-U10.34 (↓)79.9252.7364.7974.9267.8066.2939.2038.9560.58 CARE-C-based-E11.15 (↓)71.8446.9360.8371.9867.1761.8937.2037.8956.97 CARE-U11.21 (↓)78.0354.6170.0376.7768.5967.9240.4040.0062.04 CARE-E12.93 (↓)68.3547.5365.1272.7466.5462.6237.0037.8057.21 ASVD14.29 (↓)67.1748.2168.1673.9962.2757.8135.8037.51 56.36 SVD-LLM V210.98 (↓)77.9954.5269.4176.1269.2267.3339.6040.00 61.77 I.1.13QWEN3-30B-A3B-INSTRUCT-2507-ALPACA Table I.13: Detailed zero-shot comparison for Qwen3-30B-A3B-Instruct-2507 with Alpaca calibration. Higher is better for Accuracy (%) (ACC.) (↑) and lower is better for Perplexity (PPL.) (↓). RankMethods PPL (↓)ACC (↑) ARC-E ARC-C HellaSwag PIQAWGMMLU OBQARAAVG 128 Palu(SVD)17930.54 (↓)25.8826.2826.2250.9848.2224.2925.6022.01 31.19 MHA2MLA1929.49 (↓)29.4624.7426.5751.5848.6222.8527.2021.34 31.55 CARE-C-based-U29.00 (↓)58.7538.7458.0271.6555.3348.4036.0033.7850.08 CARE-C-based-E36.65 (↓)50.0033.3656.3269.8654.4643.0133.0031.4846.44 CARE-U39.87 (↓)53.7535.7554.2267.3653.9142.8430.8030.4346.13 CARE-E59.06 (↓)43.5631.9150.4864.9653.0427.5530.6029.4741.45 ASVD2023.42 (↓)27.4823.9828.7852.8849.8823.5424.6023.44 31.82 SVD-LLM V234.91 (↓)55.3036.6055.3867.6854.2243.5831.8031.20 46.97 256 Palu(SVD)6193.83 (↓)27.9024.9127.5453.1649.3324.0125.0022.01 31.73 MHA2MLA201.71 (↓)34.0927.6533.2054.7951.9323.3727.2025.84 34.76 CARE-C-based-U9.45 (↓)78.2454.1073.5678.2968.8272.9539.6038.1862.97 CARE-C-based-E14.34 (↓)58.6340.7068.0875.0365.6766.9437.4035.1255.95 CARE-U10.02 (↓)74.5454.5274.4477.1566.6171.6440.0039.8162.34 CARE-E16.26 (↓)56.8241.7269.3673.0163.2269.5437.6034.6455.74 ASVD257.93 (↓)36.3226.7135.8054.4650.4323.5328.6025.36 35.15 SVD-LLM V29.58 (↓)76.4755.1274.4577.6468.1972.2641.0039.81 63.12 28 Published as a conference paper at ICLR 2026 Figure I.1: Needle-in-a-Haystack retrieval heatmaps for Llama-3.1-8B-Instruct under matched KV budgets. The figure assembles five panel-wise heatmaps comparing CARE variants with different calibration settings against Palu(SVD) across context lengths and needle depths. Greener cells indicate stronger retrieval accuracy. I.2LONG-CONTEXT RETRIEVAL: NEEDLE-IN-A-HAYSTACK (NIH) We evaluate long-context behavior with the Needle-in-a-Haystack (NiH) task, measuring retrieval accuracy across context lengthsL∈1K, 4K, 8K, 16K, 24K, 32Kand multiple document depths (needle positions). We compare the uncompressed teacher with CARE and Palu(SVD) under identical KV budgets and training settings. CARE stays close to the uncompressed teacher across1K–32Kcontexts and document depths, while uniform-rank and SVD-style baselines degrade—most at18K–24K. Covariance-aware allocation (especially preserving early-layer rank), longer calibration sequences (e.g., Alpaca-2048 vs. Alpaca- 256) and sequence length further improve stability, and Palu(SVD) shows the largest drop. The panel-wise heatmaps below provide the full retrieval breakdown. I.3GENERATION EXAMPLES Fig. I.2 and Fig. I.3 are two generated text examples by 2 different methods. 29 Published as a conference paper at ICLR 2026 Question: What is the capital city of France? A. Paris B. New York C. London D. Rome Zero-shot CARE-Rank256 2. What is the capital city of the United States? A. New York B. Washington D. New York C. Washington D. New York 3. What is the What is the capital city of the United States? A. New York B. Washington D. New York C. Washington D. New York 4. What is the What is the capital city of the United States? A. New York B. Washington D. New York C. Washington D. New York... Zero-shot TransMLA-Rank 256 I had a very good (I had just like a city, and a small, and a small, and a small, and a small, and a small, and a city of the location, and the 5 3F (I had a very good (I had just like a city, and a small, and a small, and a small, and a small, and a small, and a city of the location, and the 5 3F (had a very good (I had just like a city, and a small, and a small, and a small, and a small, and a small, and a city of the location, and the 5 3F... Zero-shot Palu(SVD)-Rank 256 Figure I.2: Generated Text Example Question: What is the capital city of France? Give me some details Zero-shot CARE-Rank512 We were in Paris, France. What famous landmark do you like in France? Why are they so popular? Eiffel tower in Paris. I remember when I was travelling with many others of students in a bus to the top of the hill. We were all very worried when the bus suddenly stopped and we were not being able to continue the journey. Finally we came to the top of the hill and see the great Eiffel tower. It was very interesting and I liked the view of the whole city. I have never seen anything like that and it was really unique. What are the cultural activities of the French people? What are some of the cultural traditions that make the country unique? It is hard to explain because we don’t have one unique culture. Each part of the country has a different style of food and dress and have different traditions. The cultural activities are not only in one category and have many types. What would you want to make a speech to the President of the United States about the problems in the U.S. and how to solve them? We have to fix the education problems. We need to have a better education for the children in order to teach them what is necessary for a good knowledge. I can say that I think that it is the main issue because if the children are educated good, they will be able to create a better society in the future. Also, more people who are in the same society are going to learn from the children that are being educated. If you were to study a master's degree what would you like to be when you are older? I think it is important to have a good job. You should be able to create your own job and have a strong relationship with the employer. I know that I could do a job in the future with my strong management and intelligence... Zero-shot TransMLA-Rank 512 mbanotjuicyclif.pim0mtheh0r0Morgon0esutpmmcoRpmpe0m0.m0rg1l0.org0m00 0j9iPombo00Thej0rg1pv0m0z0pg0pS0s0a0p0mPint0mh0M0pgzPn2hPint0hy0f7 figRFonw0marchF0nnLynnc0wPnmePentisRie0m0mper0FbmP0Oz0m0mpe1pm 0Or5pMOP0n0 onF0plmrZ1z0Uof0uzPim0nandPer0rep0p0sL0mgRedOrMar0Aper0prrPnOof Pgprv0mOfr0h0ry0rg00000mPer0P0A2pMohans0mPul0s0M0rQdrivemod0rsj 0tH0myrPerf0mphQ5... Zero-shot Palu(SVD)-Rank 512 Figure I.3: Generated Text Example 30 Published as a conference paper at ICLR 2026 I.4ADDITIONAL RANK PROFILES ACROSS CALIBRATION CORPORA We present covariance-aware rank profiles for Llama-3.1-70B-Instruct, Qwen3-30B-A3B-Instruct- 2507, Qwen3-4B-Instruct-2507, and Qwen2.5-1.5B-Instruct. Across all models and some calibration corpora (Alpaca, WikiText2, PTB, C4), the depth-dependent pattern observed in the main text for Llama-3.1-8B-Instruct persists: ranks are smaller in early layers, grow through the middle blocks, and stay elevated in deeper layers, with stronger growth forW V than forW K . This consistency across model families, scales (1.5B–70B), and architectures (dense and MoE) suggests that the structure is a general property of pretrained attention. The same qualitative trend is also visible on the smaller Qwen2.5-1.5B-Instruct model, although its feasible target-rank range is narrower. Qwen/Qwen3-30B-A3B-Instruct-2507 Layer-wise Dynamic Rank Allocation on ALPACA (rank only) 0 64 128 192 256 012243547 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 012243547 CARE | rank 256 target min max Layer Rank K V Figure I.4: Covariance-aware rank profiles for Qwen3-30B-A3B-Instruct-2507 (MoE, 30B total / 3B active) under Alpaca calibration at target ranks 128, 256. Despite the mixture-of-experts architecture, the same depth-dependent pattern persists, confirming the trend generalizes beyond dense models. meta-llama/Llama-3.1-70B-Instruct Layer-wise Dynamic Rank Allocation on ALPACA (rank only) 0 64 128 192 256 020405979 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 020405979 CARE | rank 256 target min max Layer Rank K V Figure I.5: Covariance-aware rank profiles for Llama-3.1-70B-Instruct under Alpaca calibration at target ranks 128, 256. The same depth-dependent pattern observed in the 8B variant holds at 70B scale, with W V exhibiting stronger late-layer growth than W K . Qwen/Qwen2.5-1.5B-Instruct Layer-wise Dynamic Rank Allocation on ALPACA 0 32 64 96 128 07142027 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 07142027 CARE | rank 128 target min max Layer Rank K V Qwen/Qwen2.5-1.5B-Instruct Layer-wise Dynamic Rank Allocation on WIKITEXT2 0 32 64 96 128 07142027 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 07142027 CARE | rank 128 target min max Layer Rank K V Qwen/Qwen2.5-1.5B-Instruct Layer-wise Dynamic Rank Allocation on PTB 0 32 64 96 128 07142027 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 07142027 CARE | rank 128 target min max Layer Rank K V Qwen/Qwen2.5-1.5B-Instruct Layer-wise Dynamic Rank Allocation on C4 0 32 64 96 128 07142027 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 07142027 CARE | rank 128 target min max Layer Rank K V Figure I.6: Covariance-aware rank profiles for Qwen2.5-1.5B-Instruct across Alpaca, WikiText2, PTB, and C4 calibration corpora at target ranks 64 and 128. Despite the smaller rank budget, the same depth-dependent increase remains visible, again with stronger late-layer growth for W V . 31 Published as a conference paper at ICLR 2026 Qwen/Qwen3-4B-Instruct-2507 Layer-wise Dynamic Rank Allocation on ALPACA 0 32 64 96 128 09182635 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 09182635 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 09182635 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 09182635 CARE | rank 512 target min max Layer Rank K V Qwen/Qwen3-4B-Instruct-2507 Layer-wise Dynamic Rank Allocation on WIKITEXT2 0 32 64 96 128 09182635 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 09182635 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 09182635 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 09182635 CARE | rank 512 target min max Layer Rank K V Qwen/Qwen3-4B-Instruct-2507 Layer-wise Dynamic Rank Allocation on PTB 0 32 64 96 128 09182635 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 09182635 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 09182635 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 09182635 CARE | rank 512 target min max Layer Rank K V Qwen/Qwen3-4B-Instruct-2507 Layer-wise Dynamic Rank Allocation on C4 0 32 64 96 128 09182635 CARE | rank 64 target min max Layer Rank K V 0 64 128 192 256 09182635 CARE | rank 128 target min max Layer Rank K V 0 128 256 384 512 09182635 CARE | rank 256 target min max Layer Rank K V 0 256 512 768 1024 09182635 CARE | rank 512 target min max Layer Rank K V Figure I.7: Covariance-aware rank profiles for Qwen3-4B-Instruct-2507 across Alpaca, WikiText2, PTB, and C4 calibration corpora at target ranks 64, 128, 256, and 512. BothW K andW V show a depth-dependent increase, with stronger late-layer growth for W V . I.5DISTRIBUTION SHIFT We evaluate cross-domain calibration by estimating covariance on a source corpus and reporting accuracy changes on out-of-domain tasks. Tab. I.14 shows that task-related calibration corpora such as ARE and ARC give the best average accuracy, but the gains are mostly local rather than universal. In contrast, narrower language-modeling corpora such as WIKITEXT2 and PTB transfer less well, suggesting that broader or task-relevant calibration data is preferable for robust one-shot performance. Table I.14: One-shot Llama3.1-8B comparison on different covariance. Higher is better for Accuracy (ACC.). RankMethods ARC (↑)ARE (↑)HellaSwag (↑)PIQA (↑)MMLU (↑)OBQA (↑)RA (↑)WG (↑)AVG (↑) 256 CARE-Alpaca34.8165.4040.7672.4749.3221.2035.9862.9847.87 CARE-ARE38.8272.9041.2573.0153.1325.8031.8763.6950.06 CARE-ARC39.5971.8441.3372.9152.8426.6032.5462.6750.04 CARE-WikiText228.3357.1138.0768.2837.7219.6033.4962.5943.15 CARE-PTB27.9953.9137.7067.6842.1517.0031.9662.5942.62 CARE-C431.4859.5143.0073.1241.6920.0035.8963.3046.00 CARE-MMLU34.1364.9440.6570.7355.2721.2033.9764.4048.16 I.6SHRINKAGE COEFFICIENT We sweep the shrinkage coefficientαinC α = (1− α)C + αIto choose a default regularization level; Tab. I.15 reports the full ablation. Across Llama-3.1-8B-Instruct and Qwen3-4B-Instruct at target ranks 256 and 512, using the same Alpaca-256-32 calibration setup, one-shot average accuracy is highly stable forα∈10 −3 , 10 −2 , 10 −1 : the variation is below 1 point on Llama and below 3 points in all tested settings, withα = 0.01consistently near the best operating point. This indicates that CARE is not sensitive to the exact shrinkage magnitude. 32 Published as a conference paper at ICLR 2026 Table I.15: Shrinkage-coefficient ablation for CARE under the Alpaca-256-32 calibration setup. We report one-shot accuracy across Llama-3.1-8B-Instruct and Qwen3-4B-Instruct at target ranks 256 and 512. Higher is better for Accuracy (ACC.). ModelRank αARC (↑) ARE (↑) HellaSwag (↑) PIQA (↑) MMLU (↑) OBQA (↑) RA (↑) WG (↑) AVG (↑) Llama-3.1-8B-Instruct 256 10 −3 34.9062.4242.4472.3157.1521.0034.7464.4848.68 10 −2 34.9062.4642.5272.1457.2021.6035.6065.3548.97 10 −1 32.9459.3441.5570.4656.8820.2035.3165.7547.80 512 10 −3 46.1676.3053.6477.6965.5429.4041.4472.8557.88 10 −2 51.7980.7251.7074.5470.1330.2039.4368.2758.35 10 −1 47.2777.1054.4077.8066.1528.4042.5874.3558.51 Qwen3-4B-Instruct 256 10 −3 43.7970.2340.0170.7258.1324.0032.0658.2549.65 10 −2 43.0972.2243.7170.1857.1026.0035.0262.5151.23 10 −1 44.6473.8741.6268.2852.9523.8034.2461.4850.11 512 10 −3 55.3174.1647.2771.5565.8728.2036.6565.0455.51 10 −2 51.7980.7251.7074.5470.1330.2039.4368.2758.35 10 −1 55.9079.8550.6073.0768.2027.6037.1365.9857.29 This robustness is expected: shrinkage mainly regularizes the low-energy tail of the covariance spectrum to improve numerical stability, while CARE’s rank allocation is governed by the dominant eigendirections that remain largely unchanged. We therefore useα = 0.01as the default setting throughout the paper unless otherwise specified. I.7MIX OF COVARIANCE Our Adjusted-Rank allocator is governed by the spectrum of √ C f W(Sec. 3.2), so it naturally extends to mixed calibration distributions. We formC mix = P M i=1 π i C i with the same shrinkage towardI, yielding a simple multi-objective covariance fusion related in spirit to covariance-aware adaptations such as CorDA (Yang et al., 2024b). In practice, we use weighted calibration mixing by sampling from a task-weighted data mixture. Tab. I.16 keeps all other settings fixed (CARE, Rank 256) and mixes ALPACA with ARC-CHALLENGE: increasing the ARC-CHALLENGE weight improves ARC, while a balanced 0.5/0.5 mix gives the best average accuracy. Table I.16: Task-weighted covariance mixing for multi-objective rank allocation (Llama3.1-8B-Instruct, CARE, Rank 256).AdenotesAlpacaandARCdenotesARC-Challenge; mixes indicate sampling proportions used to estimate C mix . Higher is better for Accuracy (ACC.). RankMethods ARC (↑)ARE (↑)HellaSwag (↑)PIQA (↑)MMLU (↑)OBQA (↑)RA (↑)WG (↑)AVG (↑) 256 CARE-Alpaca34.8165.4040.7672.4749.3221.2035.9862.9847.87 CARE-0.5A+0.5ARC36.0967.7243.2273.3458.7024.4034.2663.3850.14 CARE-0.2A+0.8ARC34.7365.0342.9972.2558.0223.2035.4164.9649.57 CARE-ARC39.5971.8441.3372.9152.8426.6032.5462.6750.04 Notably, the0.2A+0.8ARCmix actually degrades ARC-Challenge accuracy (34.73) relative to pure Alpaca (34.81), despite devoting 80% of calibration mass to ARC data. This suggests that over-concentrating the covariance on a narrow task distribution can be counterproductive: the heavily skewedC mix collapses the effective spectrum onto a few ARC-specific directions, starving mid- spectrum components that still contribute to ARC reasoning through shared linguistic features. By contrast, the balanced0.5A+0.5ARCmix retains broader spectral coverage from Alpaca while incorporating enough ARC signal to steer rank allocation, yielding the best average accuracy (50.14) and a meaningful ARC gain (+1.28 over Alpaca). This points to a practical guideline: mixing a general-purpose corpus with moderate task-specific data is preferable to heavy task overweighting when estimating calibration covariance. I.8 √ C VS. C UNDER LOW-RANK BUDGETS Throughout the paper, CARE uses √ Cas the default covariance weighting; the variant that replaces √ CwithCis denoted CARE-C-based in the detailed tables (Sec. I.1). Concretely, CARE-U and CARE-E correspond to √ C-weighted uniform and energy-aware allocation, while CARE-C-based-U and CARE-C-based-E use C directly. 33 Published as a conference paper at ICLR 2026 Table I.17: Comparison ofCversus √ Ccovariance weighting under Alpaca calibration. Values are zero-shot AVG accuracy (%); within each CARE variant the better result is shown in bold. ModelRank CARE-U ( √ C) CARE-C-based-U (C) CARE-E ( √ C) CARE-C-based-E (C) Llama-3.1-8B-Instruct 6431.8932.3832.3732.71 12836.7237.1138.5938.56 25650.0048.7952.6251.41 51262.3361.6957.8259.00 Qwen2.5-1.5B-Instruct 6431.4731.2732.1032.34 9633.8334.5742.1943.03 12844.3444.0445.8646.17 Qwen3-4B-Instruct-2507 6432.3232.2032.4631.97 12837.5535.5940.5737.53 25654.3349.6151.2349.64 51261.6260.4457.3155.97 Tab. I.17 summarises the zero-shot AVG accuracy under Alpaca calibration. The full per-benchmark breakdowns appear in the detailed tables of Sec. I.1 (Tables I.1–I.12). When doesChelp?UsingCinstead of √ Csquares the eigenvalues of the covariance, amplifying the contribution of dominant eigendirections and suppressing the spectral tail more aggressively. At very low rank budgets (e.g., rank 64–96), there are few latent dimensions to allocate, and this sharper concentration can be beneficial: it forces the decomposition to focus on the handful of directions that carry most activation energy, reducing the most impactful reconstruction errors first. This effect is most visible on the smaller Qwen2.5-1.5B-Instruct, where the CARE-E withCweighting outperforms √ C at ranks 64, 96, and 128, and on Llama-3.1-8B-Instruct at rank 64. Why √ Cis the default.As rank grows, the allocator can afford to preserve mid-spectrum directions thatCdiscounts too heavily. On Llama-3.1-8B-Instruct at rank 256 and Qwen3-4B-Instruct across all ranks, √ Cconsistently outperformsC—often by a substantial margin (e.g., +4.7 points for CARE-U on Qwen3-4B at rank 256). The gentler spectral re-weighting of √ Cretains enough emphasis on dominant directions while preserving informative mid-range singular values, yielding a more robust default across model sizes and rank budgets. I.9SYSTEM EFFICIENCY ANALYSIS We evaluate inference efficiency in terms of KV-cache footprint. Since KV memory scales linearly with sequence length, a 32K-context calculation reflects the practical regime. ForL = 32768,B = 1, FP16, and 32 layers, the original GQA model caches 1024-dimensional keys and values, requiring 4294.97 MB. After full 100% MLA conversion, TransMLA + CARE(E) Init stores a 448-dimensional key NoPE latent and a 512-dimensional value latent, reducing the footprint to 2013.24 MB (53.13% reduction vs. GQA), as shown in Tab. I.18. Table I.18: Theoretical KV-cache footprint at 32K context length (L = 32768,B = 1, FP16, 32 layers). Higher reduction indicates lower KV-cache memory. MethodCached StateMemory (MB)Reduction vs. GQA GQA (Original)K : 1024, V : 10244294.97– TransMLA + CARE(E) Init (Ours, 100% MLA Restore) K NoPE : 448, V latent : 5122013.2453.13% 34