← Back to papers

Paper deep dive

Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression

Minjun Kim, Jaehyeon Choi, Hyunwoo Yang, Jongjin Kim, Jinho Song, U Kang

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 101

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 6:04:03 AM

Summary

The paper investigates the impact of compression order in joint model compression, specifically when combining techniques like pruning and quantization. The authors introduce the 'Progressive Intensity Hypothesis,' which posits that applying weaker compression methods before stronger ones yields superior model performance. They provide a formal problem definition, theoretical analysis using disjoint selectivity, and extensive empirical validation across language and vision models to support their hypothesis.

Entities (5)

Progressive Intensity Hypothesis · hypothesis · 100%Pruning · compression-method · 99%Quantization · compression-method · 99%Joint Model Compression · technique · 98%Compression Order Optimization · problem · 95%

Relation Signals (3)

Pruning iscomponentof Joint Model Compression

confidence 98% · Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization.

Quantization iscomponentof Joint Model Compression

confidence 98% · Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization.

Progressive Intensity Hypothesis suggests Joint Model Compression

confidence 95% · The Progressive Intensity Hypothesis posits that applying stronger compression methods at later stages generally yields better performance.

Cypher Suggestions (2)

Find all compression methods associated with joint model compression · confidence 90% · unvalidated

MATCH (m:Method)-[:USED_IN]->(j:Technique {name: 'Joint Model Compression'}) RETURN m.name

Retrieve the hypothesis and its supporting evidence · confidence 85% · unvalidated

MATCH (h:Hypothesis {name: 'Progressive Intensity Hypothesis'})-[:SUPPORTED_BY]->(e:Evidence) RETURN h.name, e.description

Abstract

Abstract:What happens when multiple compression methods are combined-does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization. A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline. Most prior studies have either sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases. Consequently, the broader role of compression order in shaping model performance remains poorly understood. In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis. We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones. We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap. Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

100,689 characters extracted from source content.

Expand or collapse full text

Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression Minjun Kim, Jaehyeon Choi, Hyunwoo Yang, Jongjin Kim, Jinho Song & U Kang Seoul National University, Seoul, South Korea minjun.kim,ukang@snu.ac.kr Corresponding Author. Abstract What happens when multiple compression methods are combined—does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization. A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline. Most prior studies have sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases. Consequently, the broader role of compression order in shaping model performance remains poorly understood. In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis. We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones. We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap. Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization. 1 Introduction When combining pruning and quantization, which order leads to better model performance? Although deep neural networks have achieved remarkable success across diverse domains, deploying them on edge devices remains challenging due to limited computational resources. To bridge this gap, network compression techniques (Deng et al., 2020; Liang et al., 2021; Zhu et al., 2024; Kim et al., 2025a) have been proposed, including pruning (Park et al., 2024; Song et al., 2024; Park et al., 2025b), quantization (Piao et al., 2022; Ashkboos et al., 2024b; Kim et al., 2025b), knowledge distillation (Kim et al., 2021b; Cho and Kang, 2022; Jeon et al., 2023), parameter sharing (Desai and Shrivastava, 2024; Wang et al., 2025a) and low-rank approximation (Jang et al., 2023; Li et al., 2025; Wang et al., 2025b). Recent studies highlight that combining these compression methods—known as joint model compression—achieves better trade-offs between compression ratio and model performance than applying them separately (Hawks et al., 2021; Wang et al., 2022; Shinde, 2024). A critical yet underexplored issue in joint model compression is the compression order—the sequence in which individual compression methods are applied to the target model. As most of these techniques are not simultaneously applicable and should be executed sequentially (Wang et al., 2020; Kuzmin et al., 2023), identifying an optimal order can yield a “free lunch” by improving performance without any additional computation. Empirical findings (Huang et al., 2019; Hu et al., 2021; Qu et al., 2025) show that the performance of the compressed model is sensitive to the compression order, necessitating a deeper understanding of when and why certain orders work better. However, the role of compression order has been largely overlooked by prior studies (Kurtic et al., 2022; Xiao et al., 2023; Liu et al., 2023). Most existing studies implicitly assume that compression order has no effect on the grounds of orthogonality, naïvely arguing that different techniques operate independently (Kim et al., 2021a; Chitty-Venkata et al., 2023; Song et al., 2024; Motetti et al., 2024). Only a few works have examined the problem, and most of them merely offer empirical evidence confined to specific settings (Wang et al., 2020; Wu et al., 2023; Yu et al., 2023). A notable attempt (Harma et al., 2025) presents a theoretical framework, proving the non-orthogonality of pruning and quantization, concluding that pruning followed by quantization is always preferable. However, the scope of the work remains narrow and less practical, focusing only on magnitude-based pruning and max-scaled quantization (see Appendix D.5). To date, no study has systematically investigated the tendencies of compression order in general settings, neither empirically nor theoretically. In this paper, we demonstrate that applying more aggressive compression algorithms at later stages yields superior performance. We first formulate the problem of joint compression order optimization (see Section 3.1 and Problem 1), and introduce the Progressive Intensity Hypothesis, which posits that ordering compression methods from weaker to stronger improves performance (see Hypothesis 1). Figure 1 offers a conceptual depiction of the proposed hypothesis. We validate our claim through both theoretical analysis and extensive experiments. Theoretically, we show that the advantage of the compression order grows monotonically with the performance gap between two methods under disjoint sensitivity (see Theorem 1 and Definition 5). In other cases, we define interference as an additional error from mutual interaction and investigate its influence (see Definition 6). Experimentally, we validate the hypothesis across both language and vision models, covering diverse model architectures, tasks, and compression scenarios (see Sections 5.2 and 5.3). Our analysis also considers how factors such as weight-update strategies and rotations affect the role of compression order (see Figures 4 and 5). Moreover, our results highlight that the hypothesis generalizes to broader paradigms, including multi-stage approaches and mixed-precision quantization (see Section 5.4). Our contributions are summarized as follows: • Formulation. We formally define the novel problem of optimizing the compression order in joint model compression (see Problem 1), and propose the Progressive Intensity Hypothesis, suggesting that stronger perturbations should be applied later to achieve better performance (see Hypothesis 1). • Theory. We provide a theoretical analysis that quantifies the relationship between method interaction and order sensitivity. Specifically, we prove that the superiority of one ordering grows monotonically with the performance gap between the two methods (see Theorem 1). • Experiments. Extensive and consistent experimental results across various domains, models, and tasks support our hypothesis (see Figures 3, 4, and 7). We further extend the problem to broader setups such as multi-stage compression and mixed-precision quantization (see Figures 7 and 10). To the best of our knowledge, we are the first to both theoretically and experimentally analyze the impact of compression order in joint model compression under general and practical settings. Reproducibility. All of our implementation and datasets are available at https://github.com/snudatalab/PQQP. The Progressive Intensity Hypothesis. Neural networks compressed by multiple methods perform better when weaker perturbations are applied first and stronger ones later. Figure 1: The Progressive Intensity Hypothesis: Given two compression techniques, we conjecture that compressed models perform better if the stronger method is applied after the weaker one. That said, the optimal order between pruning and quantization varies with their compression ratios. 2 Preliminaries and Related Works We briefly describe the preliminaries and related works on pruning, quantization, and joint model compression. The notations used throughout this paper are formally defined in Appendix A. Pruning and Quantization. Compression111In the remainder of the paper, we use ‘compression’ to refer to ‘model compression’ for simplicity. techniques aim to transform a pre-trained model ϕφ as a more efficient version ϕ′φ while minimizing performance degradation (Xu and McAuley, 2023; Dantas et al., 2024; Liu et al., 2025a). This process inevitably introduces an error term δ​(⋅)δ(·), representing the deviation between outputs of ϕ′φ and ϕφ, which typically increases with the compression ratio C. We define the compression ratio C as the memory usage of ϕφ divided by that of ϕ′φ . Among various compression techniques f​(ϕ;C)f(φ;C), our work centers on two major forms: pruning and quantization. Pruning ​(⋅)P(·) directly discards less important components of a model to achieve the desired compression ratio while retaining its most critical parts (Nova et al., 2023; Ashkboos et al., 2024a; Park et al., 2024). Based on the level of granularity, pruning methods fall into three categories: structured pruning (Song et al., 2024) removes entire structural elements such as layers, filters, or attention heads, semi-structured pruning (Xu et al., 2024) enforces fixed sparsity patterns (e.g., 2:4 sparsity) across tensors, and unstructured pruning (Frantar and Alistarh, 2023) prunes weights in a fully flexible manner. In the case of structured pruning at the layer level, the induced error δ​(i,i) _P(W_i,X_i) is −i​i-W_iX_i when pruning is applied to layer lil_i with weight iW_i and activation iX_i, and 0 otherwise. The model achieves a compression ratio C=1/(1−p)C_P=1/(1-p) by pruning a fraction p of weights. Quantization ​(⋅)Q(·) reduces the bit precision used to represent weights and activations by encoding a high-bit network into a lower-bit format (Gholami et al., 2022). Common quantization techniques include uniform (Li et al., 2021), non-uniform (Zhao and Yuan, 2025), binary coding (Park et al., 2025a), and vector quantization (VQ) (Tseng et al., 2024). Although some techniques such as VQ focus only on weight quantization without compressing activations, our main scope is on compressing both for practical acceleration. A main challenge towards robust quantization is the activation outliers (Xiao et al., 2023; Lee et al., 2024), but recent rotation-based methods (Lin et al., 2024; Liu et al., 2025b) have largely overcome it. The layer-wise error by quantization ​(⋅)Q(·) for a layer lil_i with weight iW_i and activation iX_i is computed as δ​(i,i)=​(i)​(i)−i​i _Q(W_i,X_i)=Q(W_i)Q(X_i)-W_iX_i, with a compression ratio C=Bo​r​i​g/BC_Q=B_orig/B_Q depending on the original Bo​r​i​gB_orig and target B_Q bit-widths. Joint Model Compression. Joint compression combines two or more compression methods, achieving higher compression ratios while minimizing performance loss (Wang et al., 2020; Wu et al., 2023; Yu et al., 2023; Harma et al., 2025). These methods fall into two categories: co-designed and post-hoc frameworks. Although the former offers the benefit of integration-aware design, they tend to be method-specific and less adaptable to alternative configurations (Qu et al., 2025). In contrast, combining independently designed techniques allows for method-agnostic pipelines that adapt easily to diverse architectures. Several pruning works (Kurtic et al., 2022; Xiao et al., 2023; Song et al., 2024) empirically confirm that such combinations with quantization are both feasible and beneficial. As independently designed techniques are applied one after another, the order of compression plays a key role. However, the impact of compression order has not been adequately examined in the current literature. We denote applying f1​(⋅)f_1(·) before f2​(⋅)f_2(·) as f1→f2f_1→ f_2 or (f2∘f1)​(⋅)(f_2 f_1)(·). 3 Joint Compression Order Optimization 3.1 Problem Definition We are given a pre-trained model and multiple compression techniques, each associated with a specific compression rate. The goal is to find the optimal order in which to sequentially apply these methods. An order is considered optimal if it minimizes the degradation in model performance. We quantify performance using a metric ℳ​(⋅)M(·), where higher values indicate better outcomes (e.g., classification accuracy or the negative of perplexity). We provide the formal definition as Problem 1. Problem 1 (Joint Compression Order Optimization). We have a pre-trained model ϕφ, a set of compression methods =f1​(⋅),f2​(⋅),⋯,fn​(⋅)F=\f_1(·),f_2(·),·s,f_n(·)\, and a performance metric ℳ​(⋅)M(·). For a set Π=π:→|π​ is bijective =\π:F ~|~π is bijective\ of all permutations over F, the goal is to find the optimal permutation π∗∈Ππ^*∈ that maximizes the performance of the compressed model: π∗=arg​maxπ∈Π⁡ℳ​(π​(ϕ))π^*= *arg\,max_π∈ M(π(φ)). 3.2 Characterizing Compression Attributes Two key attributes arise when characterizing compression in a general setting: granularity and intensity. Granularity refers to the smallest structural unit on which compression is applied, and intensity refers to how aggressively the method alters the model, measured by its impact on performance. Granularity of Compression. Compression methods are not applied to the model as a whole, but rather operate locally on its individual components. We define compression granularity as the atomic level at which compression is performed. To formalize this notion, we begin by abstracting the model into a set of component types, such as layers, sublayers, or attention heads. We refer to these as abstract types, which define the structural units over which compression may act. For two abstract types t1t_1 and t2t_2, we say t1t_1 is larger than t2t_2 if t1t_1 strictly contains t2t_2 as a structural unit. Among all types that are larger than both t1t_1 and t2t_2, we define the least upper type tlut​(t1,t2)t_lut(t_1,t_2) as the smallest one. For a given model ϕφ, let ϕT_φ denote the set of abstract types; this set depends on the model architecture. Each compression method f​(⋅)f(·) may be applicable only to a subset of abstract types. We denote this subset by f⊆ϕT_f _φ, representing the structural levels at which f​(⋅)f(·) can operate. For instance, layer-wise pruning in large language models is applicable only to units coarser than layers. Then, the granularity of f​(⋅)f(·) is the smallest unit tf∈ft_f _f on which f​(⋅)f(·) is applicable, as defined in Definition 1. Definition 1 (Compression Granularity). For a model ϕφ with a set ϕT_φ of abstract types and compression method f​(⋅)f(·), the compression granularity tf≔arg​mint∈f⁡|t|t_f *arg\,min_t _f|t|, where f⊆ϕT_f _φ denotes the set of abstract types on which f​(⋅)f(·) operates, and |t||t| denotes the structural size of type t. Intensity of Compression. Compression methods affect the model differently even at identical compression ratios, so comparing their intensities directly is challenging. To assess compression strength, we introduce three concepts grounded in performance degradation: performance gap ​(f1,f2)G(f_1,f_2), compression equivalent ratio Cf∗C^*_f, and compression order advantage ​(f1→f2)A (f_1→ f_2 ). Performance differences between two methods f1​(⋅;C1)f_1(·;C_1) and f2​(⋅;C2)f_2(·;C_2), each applied at its respective compression ratios C1C_1 and C2C_2, provide a direct measure of their relative intensity. We call this the performance gap ​(ϕ,ℳ;f1​(⋅;C1),f2​(⋅;C2))G(φ,M;f_1(·;C_1),f_2(·;C_2)), or simply ​(f1,f2)G(f_1,f_2), as defined in Definition 2. If ​(f1,f2)>0G(f_1,f_2)>0, we refer to f2​(⋅;C2)f_2(·;C_2) as the stronger compression and f1​(⋅;C1)f_1(·;C_1) as the weaker one. Definition 2 (Performance Gap). Given a model ϕφ, a performance metric ℳ​(⋅)M(·), and two compression methods f1​(⋅;C1)f_1(·;C_1) and f2​(⋅;C2)f_2(·;C_2), the performance gap between two methods ​(ϕ,ℳ;f1​(⋅;C1),f2​(⋅;C2))≔ℳ​(f1​(ϕ;C1))−ℳ​(f2​(ϕ;C2))G(φ,M;f_1(·;C_1),f_2(·;C_2)) (f_1(φ;C_1))-M(f_2(φ;C_2)). Although ​(⋅)G(·) offers a clear pairwise comparison, its values in metric units are difficult to interpret and may grow rapidly as the compression ratio increases. Alternatively, mapping methods onto a common scale allows for direct comparison at the level of compression ratios. While multiple choices exist for the baseline method, we select quantization as it exhibits the best performance across diverse models, thereby offering the widest range. Accordingly we define the Compression Equivalent Ratio (CER) C∗​(f1​(⋅),,C)C^*(f_1(·),Q,C), or simply Cf1∗C^*_f_1, which expresses the effect of method f1​(⋅;C)f_1(·;C) at ratio C as an equivalent ratio of quantization ​(⋅)Q(·), as Definition 3. In other words, starting from a 16-bit model, a compression method f​(ϕ;C)f(φ;C) with Cf∗=2C^*_f=2 achieves the same performance as 8-bit quantization. Note that CER of quantization ​(⋅)Q(·) is naturally equal to its own compression ratio (i.e., C∗=C^*_Q=C_Q). We adopt a straightforward approach by computing CER through linear interpolation. For instance, f​(⋅)f(·) achieving ℳ​(f;C)=M(f;C)= 65% accuracy maps to Cf∗=3C^*_f=3 when quantization ​(⋅)Q(·) yields ℳ​(;C=2)=M(Q;C_Q=2)= 70% and ℳ​(;C=4)=M(Q;C_Q=4)= 60% accuracy, respectively. Definition 3 (Compression Equivalent Ratio). Given a model ϕφ, a performance metric ℳ​(⋅)M(·), a compression method f​(⋅)f(·), a quantization method ​(⋅)Q(·), and a compression ratio C, the compression equivalent ratio C∗​(f​(⋅),,C)≔C′C^*(f(·),Q,C) C such that ℳ​(​(ϕ;C′))=ℳ​(f​(ϕ;C)).M(Q(φ;C ))=M(f(φ;C)). Until now our discussion is limited to single methods; but when multiple methods are applied, how should intensity be defined? Our scope centers on measuring how intensity changes by compression order. Accordingly, we capture the gain from applying f1​(⋅)f_1(·) before f2​(⋅)f_2(·) over the reverse as compression order advantage ​(ϕ,ℳ;f1​(⋅)→f2​(⋅))A (φ,M;f_1(·)→ f_2(·) ), or simply ​(f1→f2)A (f_1→ f_2 ), as Definition 4. Definition 4 (Compression Order Advantage). Given a model ϕφ, a performance metric ℳ​(⋅)M(·), and two compression methods f1​(⋅;C1)f_1(·;C_1) and f2​(⋅;C2)f_2(·;C_2), the compression order advantage ​(ϕ,ℳ;f1​(⋅;C1)→f2​(⋅;C2))≔​(f1→f2,f2→f1)=ℳ​((f2∘f1)​(ϕ))−ℳ​((f1∘f2)​(ϕ)).A (φ,M;f_1(·;C_1)→ f_2(·;C_2) ) (f_1→ f_2,f_2→ f_1)=M((f_2 f_1)(φ))-M((f_1 f_2)(φ)). 3.3 The Progressive Intensity Hypothesis Our goal is to uncover general patterns in how compression order affects the model performance in joint compression scenarios. While prior works have focused primarily on isolated settings, we seek to establish a broadly applicable principle. To this end, we propose the Progressive Intensity Hypothesis, which posits that applying stronger compression methods at later stages generally yields better performance. We formalize this hypothesis for a pair of methods in Hypothesis 1, which serves as the main focus of our analysis; its extension to multiple methods is presented in Appendix B.3. Hypothesis 1 (The Progressive Intensity Hypothesis). Let f1​(⋅;C1)f_1(·;C_1) and f2​(⋅;C2)f_2(·;C_2) be two compression methods applied to a model ϕφ. Then, the compression order advantage ​(f1→f2)A (f_1→ f_2 ) grows monotonically with the performance gap ​(f1,f2)G(f_1,f_2), or equivalently with the CER difference Cf2∗−Cf1∗.C^*_f_2-C^*_f_1. As an example, if methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·) yields ℳ​(f1;C1)=M(f_1;C_1)= 75% and ℳ​(f2;C2)=M(f_2;C_2)= 70% accuracy, respectively (i.e., ​(f1,f2)=5%G(f_1,f_2)=5\%p), the compression order advantage ​(f1→f2)A (f_1→ f_2 ) is mild; replacing C2C_2 into C2′C _2 at ℳ​(f2;C2′)=M(f_2;C _2)= 60% accuracy (i.e., ​(f1,f2)=15%G(f_1,f_2)=15\%p) results in a larger advantage. 4 Theoretical Analysis We theoretically analyze how compression order affects the model performance. We introduce disjoint selectivity to isolate order-dependent units, and prove in Theorem 1 that only these units determine the performance gap. We then show in Theorem 2 that Hypothesis 1 holds due to the reduction of order-dependent units. We later extend to non-disjoint cases in which interference occurs. Consistent with earlier works (Harma et al., 2025), we investigate each unit, relying on Assumption 1. Assumption 1. Given a model ϕφ with a set L of layers, performance metric ℳ​(⋅)M(·), a compression method f​(⋅)f(·), and the layer-wise reconstruction loss δf​(li) _f(l_i), assume the followings: • Layer-wise independence. The reconstruction error at one layer does not affect the reconstruction error at another: ∀li,lj∈,i≠j:∂δf(li)/∂δf(lj)=0.∀ l_i,l_j ,~i≠ j:~∂\, _f(l_i)/∂\, _f(l_j)=0. • Error-performance trade-off. Model performance is inversely related to total reconstruction error: ∃β>0,ℳ​(ϕ)−ℳ​(f​(ϕ))=β⋅∑li∈‖δf​(li)‖F2.∃β>0,~M(φ)-M(f(φ))=β· _l_i \| _f(l_i)\|_F^2. Disjoint Selectivity. Sequential application of two compression methods leads to two distinct scenarios: either there exist units altered by both methods, or all units are exclusively assigned to one. We define the latter scenario as the case where disjoint selectivity holds, as in Definition 5. This means that while the assignment may vary with order, each unit is ultimately handled by only one method. Definition 5 (Disjoint Selectivity). Given a model ϕφ, two compression methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·) with respective granularities tf1t_f_1 and tf2t_f_2, disjoint selectivity holds if ∀ui∈​(ϕ;tlut​(tf1,tf2)),∀π∈f1∘f2,f2∘f1,uif1​(π)+uif2​(π)=1,~∀ u_i (φ;t_lut(t_f_1,t_f_2)),~∀π∈\f_1 f_2,f_2 f_1\,~D_u_i^f_1(π)+D_u_i^f_2(π)=1, where ​(ϕ;t)U(φ;t) is the set of all units of model ϕφ at granularity t, and uf​(π)D_u^f(π) denotes whether f​(⋅)f(·) modifies unit u under the order π (i.e., 1 if modified, 0 otherwise). Under disjoint selectivity, the compression order advantage ​(f1→f2)A (f_1→ f_2 ) is proportional to the cumulative sum of error difference g​(⋅)g(·) across units assigned differently depending on the order as formulated in Theorem 1. The underlying intuition is that the performance gap rises solely from units whose assignment varies with the order; for others, the error remains invariant and thus cancels out. To illustrate, consider units u1u_1, u2u_2, and u3u_3 and compression methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·). If u1u_1 is always handled by f1​(⋅)f_1(·) regardless of the order, while u2u_2 and u3u_3 are assigned differently depending on the order, then the advantage ​(f1→f2)A (f_1→ f_2 ) is proportional to error difference of units u2u_2 and u3u_3. Theorem 1 (Compression Order Advantage under Disjoint Selectivity). Suppose we compress a model ϕφ with two compression methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·) with respective granularities tf1t_f_1 and tf2t_f_2, where disjoint selectivity holds. Then, under Assumption 1, the compression order advantage ​(f1→f2)=ℳ​((f2∘f1)​(ϕ))−ℳ​((f1∘f2)​(ϕ))A (f_1→ f_2 )=M((f_2 f_1)(φ))-M((f_1 f_2)(φ)) equals to β⋅(∑ui∈2g​(ui)−∑ui∈1g​(ui))β· ( _u_i _2g(u_i)- _u_i _1g(u_i) ), where β is the coefficient between the model performance and total error induced from Assumption 1, g​(ui)=‖δf1​(ui)‖F2−‖δf2​(ui)‖F2g(u_i)= \| _f_1(u_i) \|_F^2- \| _f_2(u_i) \|_F^2 is the error gap according to the method applied, and 1=u|uf1​(f2∘f1)=1,uf1​(f1∘f2)=0G_1=\u~|~D_u^f_1(f_2 f_1)=1,~D_u^f_1(f_1 f_2)=0\ and 2=u|uf1​(f2∘f1)=0,uf1​(f1∘f2)=1G_2=\u~|~D_u^f_1(f_2 f_1)=0,~D_u^f_1(f_1 f_2)=1\ are groups of order-dependent units. Proof. Refer to Appendix B.1. ∎ Monotonicity. Under disjoint selectivity, we show that Hypothesis 1 holds when the two compression methods are well-designed—that is, minimally disruptive to the model. We examine this through a case study on pruning and quantization. We assume a favorable scenario where pruning is configured to induce minimal degradation, and quantization introduces symmetric, zero-mean errors centered at the original values. These assumptions are formalized in Assumption 2. Assumption 2. Given a model ϕφ with a set L of layers and performance metric ℳ​(⋅)M(·), assume that: • Well-designed pruning ​(⋅)P(·). The pruning method is chosen from the set of pruning strategies that aim to preserve the model performance: ​(⋅)∈ℙ​(C)P(·) (C_P) where ℙ​(C)=i​(⋅)|C​(i​(ϕ))=C,ℳ​(ϕ)−ℳ​(i​(ϕ))≤δP(C_P)=\P_i(·)|C(P_i(φ))=C_P,\ M(φ)-M(P_i(φ))≤δ\ denotes the set of pruning strategies that satisfy the target ratio C_P while keeping performance degradation within a small budget δ. • Well-designed quantization ​(⋅)Q(·). For all layers, quantized outputs follow a symmetric distribution around the original values: ∀li∈,​(i)​(i)∼​(i​i,σ2​),∀ l_i ,Q(W_i)Q(X_i) (W_iX_i, _Q^2I), where ​(⋅)N(·) is the Gaussian distribution. The quantization error is negligible (i.e., ​(i)​(i)−i​i≪i​iQ(W_i)Q(X_i)-W_iX_i _iX_i). Theorem 2 states that when disjoint selectivity holds and the compression methods are well-designed, ​(→)A (Q ) increases monotonically with CER difference222By definition, quantization serves as the baseline scale, so its CER equals its own compression ratio (i.e., C∗=C^*_Q=C_Q). Therefore, C∗−C^*_P-C_Q represents the CER difference between pruning and quantization. C∗−C^*_P-C_Q for fixed C_P. Note that as C_Q decreases (i.e., ℳ​(​(ϕ))M(Q(φ)) increases), both CER difference C∗−C^*_P-C_Q and performance gap ​(,)G(Q,P) increase monotonically. We show that ​(→)A (Q ) increases in this setting because the gap depends solely on order-dependent units under Theorem 1. We discuss the impact of C_P in Appendix D.6. Theorem 2 (Monotonicity). Suppose we compress a model ϕφ with pruning ​(⋅)P(·) and quantization ​(⋅)Q(·), where disjoint selectivity holds. Then, under Assumptions 1 and 2, given performance metric ℳ​(⋅)M(·) and two pairs of compression ratios (C1,C1)(C_P_1,C_Q_1) and (C1,C2)(C_P_1,C_Q_2), if CER difference increases C1∗−C1>C1∗−C2,C^*_P_1-C_Q_1>C^*_P_1-C_Q_2, -2.84526pt then, the compression order advantage increases monotonically: ​(ϕ,ℳ;​(⋅;C1)→​(⋅;C1))≥​(ϕ,ℳ;​(⋅;C2)→​(⋅;C1)).A (φ,M;Q(·;C_Q_1) (·;C_P_1) ) (φ,M;Q(·;C_Q_2) (·;C_P_1) ). -8.53581pt Proof. Refer to Appendix B.2. ∎ Granularity and Interference. Disjoint selectivity does not always hold in practical joint compression settings for pruning and quantization. As pruning operates by fully discarding or keeping each unit, it always satisfies disjoint selectivity. In contrast, quantization satisfies this condition only when its granularity is finer than or equal to that of pruning. Figure 2 illustrates this: (a) if t≥t_P≥ t_Q, disjoint selectivity is preserved as pruning removes entire quantization units. However, (b) if t<t_P<t_Q, pruning may partially eliminate a quantization unit, introducing regions where both methods interfere. Figure 2: A case study of pruning ​(⋅)P(·) and quantization ​(⋅)Q(·) on model ϕφ. (a) if pruning granularity (green) is coarser or equal to quantization granularity (orange), disjoint selectivity holds. (b) Otherwise, partial removal of quantization units by pruning introduces extra error, termed interference Δ . In general joint compression of two methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·), this violation of disjoint selectivity introduces additional error, which we define as interference Δ​(ϕ;f1→f2) (φ;f_1→ f_2), or simply Δ​(f1→f2) (f_1→ f_2) in Definition 6. Intuitively, interference quantifies how one method disturbs the behavior of the other. Definition 6 (Interference). Given a model ϕφ and two methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·), the interference Δ​(ϕ;f1→f2)≔∑u∈(δf2∘f1​(u)−δf2​(u)),where​=​(ϕ;tf2)∩u|uf2​(f2∘f1)=1, (φ;f_1→ f_2) _u ( _f_2 f_1(u)- _f_2(u) ),where~X=U(φ;t_f_2)~∩~\u~|~D_u^f_2(f_2 f_1)=1\, -2.84526pt set ​(ϕ;t)U(φ;t) contains all units of model ϕφ at type t, uf​(π)D_u^f(π) indicates whether unit u is modified by f​(⋅)f(·) under order π (1 if modified, 0 otherwise), and δf​(⋅)​(u) _f(·)(u) denotes the error on u after applying f​(⋅)f(·). Interference may or may not occur, depending on its applied techniques. A notable example is mixed-precision quantization, where treating each bit-width quantization as a distinct method satisfies disjoint selectivity, thereby avoiding interference. In our primary focus of pruning and quantization, the magnitude of interference is determined by the pruning ratio p. When interference occurs, quantization operates on weights altered by pruning, thus the deviation from the original distribution scales with the pruning ratio. As pruning ratio p increases, a larger portion of weights is removed before quantization, leading to stronger interference. This additional error depends only on pruning and enters additively into the order advantage, and thus remains independent of quantization strength while preserving the monotonic trend. Consequently, while exact outcomes may vary, ​(f1→f2)A (f_1→ f_2 ) remains a monotonic function of C∗−C^*_P-C_Q even under interference. In conclusion, Hypothesis 1 holds under both disjoint and interfering scenarios, highlighting its general validity. 5 Experimental Findings (a) LLaMA 2 7B (b) LLaMA 2 13B (c) LLaMA 3 8B Figure 3: Across diverse language models, the compression order advantage ​(→)A (Q ) increases monotonically with the CER difference C∗−C^*_P-C_Q. See Section 5.2 for details. We empirically validate our hypothesis in joint compression scenarios, starting with pruning and quantization on language and vision models. We then extend to general pipelines beyond them. 5.1 Experimental Setup We briefly introduce the experimental setup. Further setups are detailed in Appendix C. Setup. For language models, we focus on decoder-only LLMs, mainly LLaMA (Touvron et al., 2023) models. The main metric is the negative of perplexity on WikiText-2 (Merity et al., 2017) dataset; results on commonsense reasoning tasks appear in Appendix D.3. For vision models, we evaluate the classification accuracy of ResNet-18 (He et al., 2016) (CNNs) and DeiT-Base (Touvron et al., 2021) (ViTs) models on ImageNet (Deng et al., 2009) dataset. Baselines. We evaluate three pruning (SparseGPT (Frantar and Alistarh, 2023), Wanda (Sun et al., 2024), and SLEB (Song et al., 2024)) and four quantization methods (RTN (Gupta et al., 2015), OPTQ (Frantar et al., 2023), QuaRot (Ashkboos et al., 2024b), and QuaRot + OPTQ) for language models. For vision models, we apply PRACTISE (Wang and Wu, 2023) and N2UQ (Liu et al., 2022) for CNNs, and adopt SAViT (Chuanyang et al., 2022) and RepQ-ViT (Li et al., 2023) for ViTs. 5.2 Analysis on Language Models We verify whether Hypothesis 1 holds for language models. Then, we analyze the effect of weight updates and rotations in quantization, and investigate the impact of pruning granularity on interference. Compression Order Advantage by CER Differences. We analyze how the compression order advantage ​(→)A (Q ) varies with CERs for SparseGPT (​(⋅)P(·)) and QuaRot (​(⋅)Q(·)) across three models: LLaMA 2 7B, 13B, and LLaMA 3 8B. Figure 3 confirms that ​(→)A (Q ) increases monotonically with the CER difference for all three models. Each point in the figure corresponds to a compression ratio pair (C,C_P,C_Q), defined by pruning ratio p∈[0.05,0.1,0.15,0.2,0.25,0.3]p∈[0.05,0.1,0.15,0.2,0.25,0.3] and quantization bit-width B∈[4,5,6,7,8]B_Q∈[4,5,6,7,8]. We fit an exponential curve per pruning ratio p, reflecting the underlying trend. In this setting, the x-axis C∗−C^*_P-C_Q captures the difference between the intrinsic intensities of pruning and quantization, with both C∗C^*_P and C_Q calibrated on the original model ϕφ. Consistent trends across diverse language model architectures and scales supports the validity of Hypothesis 1; see Appendices D.2 and D.1 for results on encoder-based and other decoder-only models, respectively. Finding 1. The Progressive Intensity Hypothesis holds across diverse language models of varying scales and architectures, showing that stronger compression should be applied later. Weight Updates and Rotation-based Transformations. We investigate the hypothesis under practical techniques such as weight-updates and rotations. Figure 4 shows that Hypothesis 1 consistently holds across diverse combinations of methods. Our framework is agnostic to the type of compression methods; weight updates and rotations reduce quantization error, thereby increasing C∗C^*_P. Figure 4: Compression order advantage ​(→)A (Q ) against CER difference C∗−C^*_P-C_Q for three pruning ​(⋅)P(·) and four quantization ​(⋅)Q(·) methods on a LLaMA 3 8B model. Our hypothesis consistently holds for language models regardless of pruning granularity, rotation, and weight updates. Finding 2. The hypothesis generalizes beyond specific design choices of pruning and quantization, remaining robust under weight-update and rotation methods. Figure 5: Rotation impact on pruning. See Section 5.2 for details. An intriguing phenomenon arises in pruning rotation-based methods: applying pruning after rotation results in a drastic performance drop compared to pruning without rotation. Figure 5 illustrates the perplexity changes of LLaMA 3 8B model pruned by SparseGPT, depending on the Hadamard rotation from QuaRot. Increasing the pruning ratio amplifies the discrepancy between rotated and non-rotated settings, as pruning is applied without accounting for rotation. This effect emerges because rotation may introduce two types of errors: matrix-wise errors from residual components and element-wise errors from altered pruning decisions. We further discuss the details in Appendix D.4 and Table 4. As rotation may intensify pruning, it is essential to design pruning approaches compatible with rotation-based quantization, the emerging de facto standard. Finding 3. Rotation amplifies pruning effects, underscoring the necessity of designing rotation-aware pruning methods. Table 1: ​(→)A (Q ) by quantization ratio C_Q. C_Q (B)(B_Q) SparseGPT SLEB 1.78 (9) 0.002 0 2.00 (8) 0.001 0 2.28 (7) -0.003 0 2.68 (6) -0.013 0 3.20 (5) -0.017 -0.057 4.00 (4) -49.899 -9.379 Pruning Granularity and Interference. We verify the presence of interference by comparing results across different pruning granularities. Table 1 reports ​(→)A (Q ) across different quantization bit-width B_Q when applying two 5% pruning methods with QuaRot. For SLEB, which applies structured pruning at sublayer level, there exists a regime where no layers differ in their pruning status across orders, leading to an exact ​(→)A (Q ) of zero (i.e., no interference). By contrast, SparseGPT, as an unstructured pruning method, exhibits interference in low C_Q ranges. Notably, empirical results suggest that interference also exhibits a monotonic trend regarding C_Q. Finding 4. Pruning granularity determines interference: structured pruning shows no interference in early regimes, while unstructured pruning exhibits monotonic interference. (a) ResNet-18 (CNN) (b) DeiT-Base (ViTs) Figure 6: The Progressive Intensity Hypothesis holds across diverse vision models. See Section 5.3 for details. Figure 7: Multi-stage compression results on LLaMA 3 8B model. 5.3 Analysis on Vision Models CNN and ViT Models. We verify whether Hypothesis 1 holds for vision models, focusing on CNNs and ViTs. In Figure 7, we analyze the behavior of ResNet-18 and DeiT-Base models under PRACTISE (​(⋅)P(·)) and N2UQ (​(⋅)Q(·)), and SAViT (​(⋅)P(·)) and RepQ-ViT (​(⋅)Q(·)) methods, respectively. The results confirm that both ​(→)A (Q ) and C∗−C^*_P-C_Q increase monotonically for both models, regardless of the pruning or quantization configurations. Notably, the compression order advantage is substantially larger than that observed in language models, where it is often marginally positive. Finding 5. Vision models consistently satisfy the Progressive Intensity Hypothesis regardless of architecture and applied compression techniques. 5.4 Beyond Pruning and Quantization: Toward General Pipelines We extend the Progressive Intensity Hypothesis to general compression pipelines. These results align with the n-method ordering formulation in Appendix B.3. Multi-stage Compression. Pruning is generally performed in multiple stages to mitigate performance degradation. In Figure 7, we investigate the impact of compression order by alternately applying SparseGPT (​(⋅)P(·)) and QuaRot (​(⋅)Q(·)) to the LLaMA 3 8B model where the sum of pruning ratios p1+p2=0.3p_1+p_2=0.3 (e.g., ​(⋅;C1)→​(⋅)→​(⋅;C2)P(·;C_P_1) (·) (·;C_P_2)). Our results consistently demonstrate positive advantages, indicating that stronger pruning placed later improves performance under fixed quantization, confirming that our hypothesis holds not only for two stages but also for multiple ones. Finding 6. Beyond pairwise orders, the hypothesis holds in practical multi-stage compression, indicating that scheduling stronger compression later in the sequence yields higher accuracy. Figure 8: Joint compression results with LoRA adapters. Figure 9: Performance under pruning and parameter sharing. Figure 10: The impact of compression order in MPQ. Parameter Efficient Fine-tuning. Parameter Efficient Fine-Tuning (PEFT), which introduces lightweight low-rank adapters to mitigate compression-induced performance degradation, has recently become a widely adopted practical approach. We investigate whether the proposed principle remains valid in scenarios where PEFT is applied alongside pruning and quantization. Figure 10 confirms that the same pattern holds for LLaMA 3 8B when combined with SparseGPT (​(⋅)P(·)), RTN (​(⋅)Q(·)), and LoRA (Hu et al., 2022) (PEFT). Applying LoRA after quantization produced a similar corrective effect to rotation, effectively compensating quantization-induced performance loss. Overall, the progressive intensity hypothesis remains robust under practical training pipelines that include PEFT. Finding 7. PEFT preserves the hypothesis that stronger compression should be applied later, as post-quantization LoRA effectively restores accuracy and maintains the expected ordering. Parameter Sharing. Beyond pruning and quantization, parameter sharing ​(⋅)S(·) is an independent compression technique that ties multiple layers into a unified set of weight parameters. We conduct joint compression experiments with pruning and parameter sharing to assess whether the principle generalizes beyond the pruning–quantization setting. In Figure 10, results on LLaMA 2 7B model with Basis Sharing (Wang et al., 2025a) (​(⋅)S(·)) and magnitude-based pruning (Han et al., 2015) (​(⋅)P(·)) confirm that the same ordering effect emerges as well. Finding 8. Joint compression with parameter sharing also follows the hypothesis: placing the stronger operation later yields better performance. Mixed-precision Quantization. As discussed in Section 4, Mixed Precision Quantization (MPQ) can be formulated as a joint compression problem where each bit-width allocation acts as a separate compression method, satisfying disjoint selectivity. Figure 10 illustrates the effect of compression order in MPQ, where we sequentially allocate quantization bit-widths following HAWQ-V2 (Dong et al., 2020) on ResNet-18 model under a fixed average bit-width (i.e., identical overall compression ratio), comparing progressive (prog.; 8→2 bits) and regressive (regr.; 2→8 bits) sequential allocations. As the total compression ratio increases, progressive allocation increasingly outperforms regressive allocation in terms of ​(prog.→regr.)A (prog. . ), supporting our hypothesis under MPQ settings. Finding 9. The Progressive Intensity Hypothesis also holds in MPQ, with progressive bit allocation outperforming regressive allocation since lower-bit quantization acts as stronger compression. 6 Conclusion We address the under-explored problem of joint compression order optimization and provide both theoretical and experimental evidences. The Progressive Intensity Hypothesis (Hypothesis 1) offers a simple yet powerful rule: weaker perturbations first, stronger ones later. Future works include investigating interference in more complex pipelines, providing explicit predictive rules, and automating compression order selection (see Appendix D.7). Acknowledgments This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) [No.RS-2020-I200894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.RS-2021-I211343, Artificial Intelligence Graduate School Program (Seoul National University)], [No.RS-2024-00509257, Global AI Frontier Lab], and [No.RS-2025-25442338, AI star Fellowship Support Program(Seoul National University)]. This work was supported by Youlchon Foundation. The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provided research facilities for this study. U Kang is the corresponding author. References S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman (2024a) SliceGPT: compress large language models by deleting rows and columns. In ICLR, Cited by: §2. S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman (2024b) Quarot: outlier-free 4-bit inference in rotated llms. In NeurIPS, Cited by: Table 3, Appendix C, §D.4, Table 4, §1, §5.1. Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. (2020) Piqa: reasoning about physical commonsense in natural language. In AAAI, Cited by: Appendix C. D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Eleventh International Workshop on Semantic Evaluation, Cited by: §D.2. K. T. Chitty-Venkata, S. Mittal, M. Emani, V. Vishwanath, and A. K. Somani (2023) A survey of techniques for optimizing transformer inference. Journal of Systems Architecture 144, p. 102990. Cited by: §1. I. Cho and U. Kang (2022) Pea-kd: parameter-efficient and accurate knowledge distillation on bert. PLOS ONE 17 (2), p. 1–12. Cited by: §1. Z. Chuanyang, Z. Li, K. Zhang, Z. Yang, W. Tan, J. Xiao, Y. Ren, and S. Pu (2022) SAVit: structure-aware vision transformer pruning via collaborative optimization. In NeurIPS, Cited by: Table 3, §5.1. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: Appendix C. P. V. Dantas, W. Sabino da Silva Jr, L. C. Cordeiro, and C. B. Carvalho (2024) A comprehensive review of model compression techniques in machine learning. Applied Intelligence 54 (22), p. 11804–11844. Cited by: §2. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: Appendix C, §5.1. L. Deng, G. Li, S. Han, L. Shi, and Y. Xie (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proceedings of the IEEE 108 (4), p. 485–532. Cited by: §1. A. Desai and A. Shrivastava (2024) In defense of parameter sharing for model-compression. In ICLR, Cited by: §1. J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: Appendix C, §D.2. Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Hawq-v2: hessian aware trace-weighted quantization of neural networks. In NeurIPS, Cited by: Table 3, Appendix C, §5.4. L. Fang, Y. Wang, Z. Liu, C. Zhang, S. Jegelka, J. Gao, B. Ding, and Y. Wang (2025) What is wrong with perplexity for long-context language modeling?. In ICLR, Cited by: §D.3. E. Frantar and D. Alistarh (2023) SparseGPT: massive language models can be accurately pruned in one-shot. In ICML, Cited by: Table 3, Appendix C, Table 4, §2, §5.1. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh (2023) OPTQ: accurate quantization for generative pre-trained transformers. In ICLR, Cited by: Table 3, §5.1. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer (2022) A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, p. 291–326. Cited by: §2. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: Appendix C, Appendix C, §D.1. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In ICML, Cited by: Table 3, Appendix C, §5.1. S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NeurIPS, Cited by: Appendix C, §5.4. S. B. Harma, A. Chakraborty, E. Kostenok, D. Mishin, D. Ha, B. Falsafi, M. Jaggi, M. Liu, Y. Oh, S. Subramanian, and A. Yazdanbakhsh (2025) Effective interplay between sparsity and quantization: from theory to practice. In ICLR, Cited by: §D.5, §D.5, §1, §2, §4. B. Hawks, J. Duarte, N. J. Fraser, A. Pappalardo, N. Tran, and Y. Umuroglu (2021) Ps and qs: quantization-aware pruning for efficient low latency neural network inference. Frontiers in Artificial Intelligence 4, p. 676564. Cited by: §1. K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Appendix C, §5.1. E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2022) LoRA: low-rank adaptation of large language models. In ICLR, Cited by: Table 3, Appendix C, §5.4. P. Hu, X. Peng, H. Zhu, M. M. S. Aly, and J. Lin (2021) Opq: compressing deep neural networks with one-shot pruning-quantization. In AAAI, Cited by: §1. S. Huang, C. Pearson, R. Nagi, J. Xiong, D. Chen, and W. Hwu (2019) Accelerating sparse deep neural networks on fpgas. In HPEC, Cited by: §1. J. Jang, C. Quan, H. D. Lee, and U. Kang (2023) Falcon: lightweight and accurate convolution based on depthwise separable convolution. Knowl. Inf. Syst. 65 (5), p. 2225–2249. Cited by: §1. H. Jeon, S. Park, J. Kim, and U. Kang (2023) PET: parameter-efficient knowledge distillation on transformer. PLOS ONE 18 (7), p. 1–21. Cited by: §1. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, et al. (2023) Mistral 7b. arXiv preprint arXiv:2310.06825. Cited by: Appendix C, §D.1. J. Kim, S. Chang, and N. Kwak (2021a) PQK: model compression via pruning, quantization, and knowledge distillation. In Interspeech, Cited by: §1. J. Kim, J. Jung, and U. Kang (2021b) Compressing deep graph convolution network with multi-staged knowledge distillation. PLOS ONE 16 (8), p. 1–18. Cited by: §1. M. Kim, J. Choi, J. Lee, W. Cho, and U. Kang (2025a) Zero-shot quantization: a comprehensive survey. In IJCAI, Cited by: §1. M. Kim, J. Kim, and U. Kang (2025b) SynQ: accurate zero-shot quantization by synthesis-aware fine-tuning. In ICLR, Cited by: §1. M. Kim, J. Lee, J. Kim, J. Yun, Y. Kwon, and U. Kang (2026) LampQ: towards accurate layer-wise mixed precision quantization for vision transformers. In AAAI, Cited by: Appendix C. E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, and D. Alistarh (2022) The optimal BERT surgeon: scalable and accurate second-order pruning for large language models. In EMNLP, Cited by: §1, §2. A. Kuzmin, M. Nagel, M. Van Baalen, A. Behboodi, and T. Blankevoort (2023) Pruning vs quantization: which is better?. In NeurIPS, Cited by: §1. C. Lee, J. Jin, T. Kim, H. Kim, and E. Park (2024) Owq: outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In AAAI, Cited by: §2. M. Li, Y. Lin, Z. Zhang, T. Cai, J. Guo, X. Li, E. Xie, C. Meng, J. Zhu, and S. Han (2025) SVDQuant: absorbing outliers by low-rank component for 4-bit diffusion models. In ICLR, Cited by: §1. Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu (2021) BRECQ: pushing the limit of post-training quantization by block reconstruction. In ICLR, Cited by: §2. Z. Li, J. Xiao, L. Yang, and Q. Gu (2023) Repq-vit: scale reparameterization for post-training quantization of vision transformers. In ICCV, Cited by: Table 3, §5.1. T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang (2021) Pruning and quantization for deep neural network acceleration: a survey. Neurocomputing 461, p. 370–403. Cited by: §1. H. Lin, H. Xu, Y. Wu, J. Cui, Y. Zhang, L. Mou, L. Song, Z. Sun, and Y. Wei (2024) Duquant: distributing outliers via dual transformation makes stronger quantized llms. In NeurIPS, Cited by: §2. D. Liu, Y. Zhu, Z. Liu, Y. Liu, C. Han, J. Tian, R. Li, and W. Yi (2025a) A survey of model compression techniques: past, present, and future. Frontiers in Robotics and AI 12, p. 1518965. Cited by: §2. Z. Liu, K. Cheng, D. Huang, E. P. Xing, and Z. Shen (2022) Nonuniform-to-uniform quantization: towards accurate quantization via generalized straight-through estimation. In CVPR, Cited by: Table 3, §5.1. Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort (2025b) SpinQuant: llm quantization with learned rotations. In ICLR, Cited by: §2. Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re, et al. (2023) Deja vu: contextual sparsity for efficient llms at inference time. In ICML, Cited by: §1. S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan (2022) PEFT: state-of-the-art parameter-efficient fine-tuning methods. Note: https://github.com/huggingface/peft Cited by: Appendix C. C. Meister and R. Cotterell (2021) Language model evaluation beyond perplexity. In ACL, Cited by: §D.3. S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, Cited by: Appendix C, §5.1. Mistral AI Team (2024) Mistral NeMo: our new best small model. Note: https://mistral.ai/news/mistral-nemoAccessed: 2025 Cited by: Appendix C, §D.1. B. A. Motetti, M. Risso, A. Burrello, E. Macii, M. Poncino, and D. J. Pagliari (2024) Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks. IEEE Transactions on Computers 73 (11), p. 2619–2633. Cited by: §1. A. Nova, H. Dai, and D. Schuurmans (2023) Gradient-free structured pruning with unlabeled data. In ICML, Cited by: §2. D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández (2016) The lambada dataset: word prediction requiring a broad discourse context. In ACL, Cited by: Appendix C. S. Park, J. Bae, B. Kwon, M. Kim, B. Kim, S. J. Kwon, U. Kang, and D. Lee (2025a) Unifying uniform and binary-coding quantization for accurate compression of large language models. In ACL, Cited by: Table 3, §D.2, §2. S. Park, H. Choi, and U. Kang (2024) Accurate retraining-free pruning for pretrained encoder-based language models. In ICLR, Cited by: Table 3, §D.2, §1, §2. S. Park, S. Lee, J. Kim, J. Lee, H. Jo, and U. Kang (2025b) Accurate sublayer pruning for large language models by exploiting latency and tunability information. In IJCAI, Cited by: §1. T. Piao, I. Cho, and U. Kang (2022) SensiMix: sensitivity-aware 8-bit index & 1-bit value mixed precision quantization for bert compression. PLOS ONE 17 (4), p. 1–22. Cited by: §1. X. Qu, D. Aponte, C. Banbury, D. P. Robinson, T. Ding, K. Koishida, I. Zharkov, and T. Chen (2025) Automatic joint structured pruning and quantization for efficient neural network training and compression. In CVPR, Cited by: §1, §2. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21 (140), p. 1–67. Cited by: Appendix C. K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2021) Winogrande: an adversarial winograd schema challenge at scale. Communications of the ACM 64 (9), p. 99–106. Cited by: Appendix C. T. Shinde (2024) Adaptive quantization and pruning of deep neural networks via layer importance estimation. In Workshop on Machine Learning and Compression, NeurIPS 2024, Cited by: §1. J. Song, K. Oh, T. Kim, H. Kim, Y. Kim, and J. Kim (2024) Sleb: streamlining llms through redundancy verification and elimination of transformer blocks. In ICML, Cited by: Table 3, Table 4, §1, §1, §2, §2, §5.1. M. Sun, Z. Liu, A. Bair, and J. Z. Kolter (2024) A simple and effective pruning approach for large language models. In ICLR, Cited by: Table 3, §5.1. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In ICML, Cited by: Appendix C, §5.1. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. Cited by: Appendix C, §D.1, §5.1. A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa (2024) QuIP#: even better llm quantization with hadamard incoherence and lattice codebooks. In ICML, Cited by: §2. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §D.2. G. Wang and J. Wu (2023) Practical network acceleration with tiny sets. In CVPR, Cited by: Table 3, §5.1. J. Wang, Y. Chen, I. Lin, B. Li, and G. L. Zhang (2025a) Basis sharing: cross-layer parameter sharing for large language model compression. In ICLR, Cited by: Table 3, Appendix C, Appendix C, §1, §5.4. N. Wang, C. C. Liu, S. Venkataramani, S. Sen, C. Chen, K. El Maghraoui, V. V. Srinivasan, and L. Chang (2022) Deep compression of pre-trained transformer models. In NeurIPS, Cited by: §1. T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, and S. Han (2020) Apq: joint search for network architecture, pruning and quantization policy. In CVPR, Cited by: §1, §1, §2. X. Wang, Y. Zheng, Z. Wan, and M. Zhang (2025b) SVD-LLM: truncation-aware singular value decomposition for large language model compression. In ICLR, Cited by: §1. X. Wu, C. Li, R. Y. Aminabadi, Z. Yao, and Y. He (2023) Understanding int4 quantization for language models: latency speedup, composability, and failure cases. In ICML, Cited by: §1, §2. G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han (2023) Smoothquant: accurate and efficient post-training quantization for large language models. In ICML, Cited by: §D.5, §D.5, §1, §2, §2. C. Xu and J. McAuley (2023) A survey on model compression and acceleration for pretrained language models. In AAAI, Cited by: §2. K. Xu, Z. Wang, C. Chen, X. Geng, J. Lin, X. Yang, M. Wu, X. Li, and W. Lin (2024) Lpvit: low-power semi-structured pruning for vision transformers. In ECCV, Cited by: §2. C. Yu, T. Chen, Z. Gan, and J. Fan (2023) Boost vision transformer with gpu-friendly sparsity and quantization. In CVPR, Cited by: §1, §2. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In ACL, Cited by: Appendix C. P. Zhao and X. Yuan (2025) GANQ: GPU-adaptive non-uniform quantization for large language models. In ICML, Cited by: §2. X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang (2024) A survey on model compression for large language models. Transactions of the Association for Computational Linguistics 12, p. 1556–1577. Cited by: §1. Appendix A Notation We summarize the frequently used notations in the paper as Table 2. Table 2: Frequently used notations. Symbol Description ϕφ A pre-trained model ϕ′φ A compressed model f​(⋅)∈f(·) Compression method from a set F C Compression ratio ℳ​(⋅)M(·) Performance metric i,iW_i,X_i Weight and activation matrices of layer lil_i, respectively π∈Ππ∈ Compression order from a set Π of all possible permutations ​(⋅),​(⋅)P(·),Q(·) Pruning and quantization methods, respectively p Pruning ratio (C=1/(1−p)C_P=1/(1-p)) Bo​r​i​g,B_orig,B_Q Original and quantized bit-widths, respectively δf​(⋅) _f(·) Error induced by applying f​(⋅)f(·) t∈ϕt _φ Abstract data type from the set of all valid types in model ϕφ tft_f Granularity of f​(⋅)f(·) u∈​(ϕ;t)u (φ;t) A unit of type t within the model ϕφ uf​(π)D_u^f(π) Binary indicator of whether f​(⋅)f(·) modifies unit u under order π ​(f1,f2)G(f_1,f_2) Performance gap between f1​(⋅)f_1(·) and f2​(⋅)f_2(·) ​(f1→f2)A (f_1→ f_2 ) Compression order advantage of f1→f2f_1→ f_2 over f2→f1f_2→ f_1 Cf∗C^*_f Compression Equivalent Ratio (CER) of f​(⋅)f(·) Δ​(ϕ;f1→f2) (φ;f_1→ f_2) Interference from f1​(⋅)f_1(·) to f2​(⋅)f_2(·) Appendix B Details on Theoretical Analysis Figure 11: We partition all units u into four disjoint groups. Only groups 1G_1 and 2G_2 influence ​(f1→f2)A (f_1→ f_2 ). We provide the detailed proofs for Theorems 1 and 2, then formulate a generalized version of Hypothesis 1 applicable to a broader setting with multiple compression methods. B.1 Proof of Theorem 1 Proof. Given two compression methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·) with respective granularities tf1t_f_1 and tf2t_f_2, disjoint selectivity ensures that every unit is assigned exclusively to one method. Hence, every unit u∈​(ϕ;tlut​(tf1,tf2))u (φ;t_lut(t_f_1,t_f_2)) is classified into one of four disjoint groups, 1,2,3,G_1,G_2,G_3, or 4G_4, according to its assigned method. Then, 1=u|uf1​(f2∘f1)=1,uf1​(f1∘f2)=0, _1=\u~|~D_u^f_1(f_2 f_1)=1,~D_u^f_1(f_1 f_2)=0\, 2=u|uf1​(f2∘f1)=0,uf1​(f1∘f2)=1, _2=\u~|~D_u^f_1(f_2 f_1)=0,~D_u^f_1(f_1 f_2)=1\, 3=u|uf1​(f2∘f1)=0,uf1​(f1∘f2)=0, _3=\u~|~D_u^f_1(f_2 f_1)=0,~D_u^f_1(f_1 f_2)=0\, 4=u|uf1​(f2∘f1)=1,uf1​(f1∘f2)=1, _4=\u~|~D_u^f_1(f_2 f_1)=1,~D_u^f_1(f_1 f_2)=1\, 1∪2∪3∪4=​(ϕ;tlut​(tf1,tf2)), _1 _2 _3 _4=U(φ;t_lut(t_f_1,t_f_2)), where ​(ϕ;t)U(φ;t) represents the unit set of model ϕφ at granularity t, and uf​(π)D_u^f(π) records whether f​(⋅)f(·) modifies u under the ordering π (1 if yes, 0 if no). Note that these four groups are mutually exclusive and collectively exhaustive. Also, |1|=|2||G_1|=|G_2| since compression ratios Cf1C_f_1 and Cf2C_f_2 are identical regardless of the compression order. Figure 11 illustrates the four groups. Under Assumption 1 and the defined partitioning of groups, the compression order advantage ​(⋅)A (· ) is expressed in terms of unit-wise reconstruction errors δf​(ui) _f(u_i): ​(f1→f2) (f_1→ f_2 ) =ℳ​((f2∘f1)​(ϕ))−ℳ​((f1∘f2)​(ϕ)) =M((f_2 f_1)(φ))-M((f_1 f_2)(φ)) =−β​(δ(f2∘f1)​(ϕ)−δ(f1∘f2)​(ϕ)) =-β ( _(f_2 f_1)(φ)- _(f_1 f_2)(φ) ) =−β(∑ui∈1∥δf1(ui)∥F2+∑ui∈2∥δf2(ui)∥F2+∑ui∈3∥δf2(ui)∥F2+∑ui∈4∥δf1(ui)∥F2 =-β ( _u_i _1\| _f_1(u_i)\|^2_F+ _u_i _2\| _f_2(u_i)\|^2_F+ _u_i _3\| _f_2(u_i)\|^2_F+ _u_i _4\| _f_1(u_i)\|^2_F −∑ui∈1∥δf2(ui)∥F2−∑ui∈2∥δf1(ui)∥F2−∑ui∈3∥δf2(ui)∥F2−∑ui∈4∥δf1(ui)∥F2) \,\,\,\,- _u_i _1\| _f_2(u_i)\|^2_F- _u_i _2\| _f_1(u_i)\|^2_F- _u_i _3\| _f_2(u_i)\|^2_F- _u_i _4\| _f_1(u_i)\|^2_F ) =−β​(∑ui∈1‖δf1​(ui)‖F2+∑ui∈2‖δf2​(ui)‖F2−∑ui∈1‖δf2​(ui)‖F2−∑ui∈2‖δf1​(ui)‖F2) =-β ( _u_i _1\| _f_1(u_i)\|^2_F+ _u_i _2\| _f_2(u_i)\|^2_F- _u_i _1\| _f_2(u_i)\|^2_F- _u_i _2\| _f_1(u_i)\|^2_F ) =β​(∑ui∈2g​(ui)−∑ui∈1g​(ui)), =β ( _u_i _2g(u_i)- _u_i _1g(u_i) ), where error difference g​(ui)=‖δf1​(ui)‖F2−‖δf2​(ui)‖F2g(u_i)= \| _f_1(u_i) \|_F^2- \| _f_2(u_i) \|_F^2. Note that 3G_3 and 4G_4 are discarded since their effect remains unchanged irrespective of the compression order. ∎ Case study on pruning and quantization. To support intuition, we provide a case study on pruning and quantization, which constitute the core scenario of our work. As described in the main text, disjoint selectivity holds only when the granularity t_P of pruning ​(⋅)P(·) is greater than or equal to the granularity t_Q of quantization ​(⋅)Q(·). We analyze this at the layer level: let iW_i and iX_i denote the weight and activation of a layer li∈l_i in the model ϕφ. Note that the error δf​(i,i)=f​(i)​f​(i)−i​i _f(W_i,X_i)=f(W_i)f(X_i)-W_iX_i for a compression method f​(⋅)f(·), as described in Section 2. We partition the layers L into four disjoint groups 1G_1, 2G_2, 3G_3, and 4G_4 based on their pruning status: 1=u|u​(∘)=1,u​(∘)=0, _1=\u~|~D_u^Q(P )=1,~D_u^Q(Q )=0\, 2=u|u​(∘)=0,u​(∘)=1, _2=\u~|~D_u^Q(P )=0,~D_u^Q(Q )=1\, 3=u|u​(∘)=0,u​(∘)=0, _3=\u~|~D_u^Q(P )=0,~D_u^Q(Q )=0\, 4=u|u​(∘)=1,u​(∘)=1, _4=\u~|~D_u^Q(P )=1,~D_u^Q(Q )=1\, 1∪2∪3∪4=, _1 _2 _3 _4=L, where uf​(π)D_u^f(π) records whether f​(⋅)f(·) modifies u under the ordering π (1 if yes, 0 if no). In the pruning–quantization setting, the partition is directly determined by whether each layer is pruned in the final model. Pruning behaves as an absorbing operator: pruning overrides any modification introduced by quantization. Therefore, the partition above reduces to grouping layers according to whether they are pruned under ∘Q or ∘P , yielding the pruning-status-based formulation below: 1=ℙ∘∖ℙ∘, _1=P_Q _P , 2=ℙ∘∖ℙ∘, _2=P_P _Q , 3=ℙ∘∩ℙ∘, _3=P_Q _P , 4=∖(ℙ∘∪ℙ∘), _4=L (P_Q _P ), 1∪2∪3∪4=, _1 _2 _3 _4=L, where ℙfP_f denote the sets of pruned layers when applying f​(⋅)f(·). Then, the quantization-first advantage ​(→)A (Q ) is estimated as follows: ​(→)=ℳ​((∘)​(ϕ))−ℳ​((∘)​(ϕ))=−β​(δ∘​(ϕ)−δ∘​(ϕ)) (Q )=M((P )(φ))-M((Q )(φ))=-β ( _P (φ)- _Q (φ) ) =−β​(∑li∈‖δ∘​(i,i)‖F2−‖δ∘​(i,i)‖F2) =-β ( _l_i \| _P (W_i,X_i) \|_F^2- \| _Q (W_i,X_i) \|_F^2 ) =−β​(∑li∈1‖δ​(i,i)‖F2−‖−i​i‖F2+∑li∈2‖−i​i‖F2−‖δ​(i,i)‖F2) =-β ( _l_i _1 \ \| _Q(W_i,X_i) \|_F^2- \|-W_iX_i \|_F^2 \+ _l_i _2 \ \|-W_iX_i \|_F^2- \| _Q(W_i,X_i) \|_F^2 \ ) =β​(∑li∈2g​(i,i)−∑li∈1g​(i,i)), =β ( _l_i _2g(W_i,X_i)- _l_i _1g(W_i,X_i) ), where g​(i,i)=‖δ​(i,i)‖F2−‖−i​i‖F2g(W_i,X_i)= \| _Q(W_i,X_i) \|_F^2- \|-W_iX_i \|_F^2. This expression holds as for any layer li∈l_i , the pruning operator and its associated error are defined as follows: ​(i)​(i)=if prunedi​iotherwise,δ​(i,i)=−i​iif prunedotherwise.P(W_i)P(X_i)= cases0&if pruned\\ W_iX_i&otherwise cases,\\ _P(W_i,X_i)= cases-W_iX_i&if pruned\\ 0&otherwise cases. B.2 Proof of Theorem 2 Proof. As ​(→)A (Q ) and ​(⋅)G(·) (or C∗−C^*_P-C_Q) are functions of the compression ratio C_Q we analyze their behavior separately. Without loss of generality, we consider only the case where C_Q changes in the direction of increasing C∗−C^*_P-C_Q, i.e., decreasing C_Q. Under Assumption 2, which assumes well-designed quantization, decreasing C_Q preserves the expected value of the quantized outputs while decreasing their standard deviation. As pruning intensity is held constant, the variation across compression orders is attributed solely to the severity of quantization. Lower quantization ratio (i.e., smaller standard deviation) decreases the chance that units behave differently across orders, which can only decrease or preserve the value of |1|=|2||G_1|=|G_2|, but never increase it. We analyze the two possible cases as follows. • Case 1: Number of layers affected by order decreases. Although more than one layer may change simultaneously, any such change can be decomposed into a sequence in which layers are added one by one; thus it suffices to analyze the case where exactly one layer is added. Let lil_i and ljl_j denote the layers moving from 1G_1 and 2G_2 respectively. Such a transition occurs in a budget-preserving manner: one layer from 1G_1 and one from 2G_2 are jointly reallocated, with one moving to 3G_3 and the other to 4G_4, so that the number of pruned layers remains fixed. As only 1G_1 and 2G_2 contribute to ​(→)A (Q ), removing li∈1l_i _1 and lj∈2l_j _2 from these groups changes the value by −β​(g​(lj)−g​(li))-β (g(l_j)-g(l_i) ) (see Theorem 1). Therefore, to show that ​(→)A (Q ) does not decrease, it suffices to prove that g​(lj)−g​(li)≤0.g(l_j)-g(l_i)≤ 0. Expanding the definition of g​(⋅)g(·), we obtain g​(lj)−g​(li)=(‖δ​(j,j)‖F2−‖δ​(i,i)‖F2)−(‖−j​j‖F2−‖−i​i‖F2).g(l_j)-g(l_i)= (\| _Q(W_j,X_j)\|_F^2-\| _Q(W_i,X_i)\|_F^2 )- (\|-W_jX_j\|_F^2-\|-W_iX_i\|_F^2 ). Under Assumptions 1 and 2, the second term in parentheses is positive, i.e., ‖−j​j‖F2−‖−i​i‖F2>0\|-W_jX_j\|_F^2-\|-W_iX_i\|_F^2>0. This is because under pruning alone, li∈1l_i _1 is pruned while lj∈2l_j _2 is not. Under the well-designed pruning assumption, which minimizes performance drop, this is equivalent to minimizing the increased error. Therefore, the pruning error |−i​i|F2|-W_iX_i|_F^2 for pruned lil_i is less than or equal to |−j​j|F2|-W_jX_j|_F^2 for unpruned ljl_j, making the term positive. Given the assumption of well-designed quantization in Assumption 2, the remaining first term, which denotes the difference in quantization errors, is negligible compared to the pruning-related component. This is because the quantization error at each layer is modeled as zero-mean noise with small variance, whereas the pruning error term ‖−i​i‖F2\|-W_iX_i\|_F^2 (or ‖−j​j‖F2\|-W_jX_j\|_F^2) corresponds directly to the magnitude of the pruned responses. Thus, the difference ‖δ​(j,j)‖F2−‖δ​(i,i)‖F2\| _Q(W_j,X_j)\|_F^2-\| _Q(W_i,X_i)\|_F^2 remains uniformly small compared to ‖−j​j‖F2−‖−i​i‖F2\|-W_jX_j\|_F^2-\|-W_iX_i\|_F^2, so the pruning-induced gap dominates g​(lj)−g​(li)g(l_j)-g(l_i). Consequently, we get g​(lj)−g​(li)≈−(‖−j​j‖F2−‖−i​i‖F2)<0.g(l_j)-g(l_i)≈- (\|-W_jX_j\|_F^2-\|-W_iX_i\|_F^2 )<0. Overall, the decrease in the number of order-dependent layers leads to the increase of the order advantage ​(→)A (Q ). • Case 2: Number of layers affected by order remains unchanged. As the loss-contributing groups 1G_1 and 2G_2 do not change, the compression order advantage ​(→)A (Q ) remains unaffected. Therefore, in all cases where C_Q decreases, the value of ​(→)A (Q ) does not decrease. In conclusion, monotonicity between ​(→)A (Q ) and ​(⋅)G(·) holds under fixed C_P. ∎ B.3 Generalization to multiple methods We formulate Hypothesis 1 with two compression methods f1​(⋅)f_1(·) and f2​(⋅)f_2(·). This is because if the hypothesis holds for any pair of methods, it can be generalized to more than two methods. Following the setup in Problem 1, suppose we sequentially apply a set =f1​(⋅),f2​(⋅),⋯,fn​(⋅)F=f_1(·),f_2(·),·s,f_n(·) of compression methods to a pre-trained model ϕφ. Then, any pair (π1,π2)( _1, _2) of permutations from the set Π=π:→|π​ is bijective =\π:F ~|~π is bijective\ of all permutations can be converted into one another via a sequence of adjacent transpositions. This is because the adjacent transpositions generate the full symmetric group, allowing any permutation to be constructed from another. Thus, under Hypothesis 1, our original claim shown in Figure 1 holds, that applying stronger permutations later leads to better performance of the compressed model. Appendix C Experimental Setup Table 3: Baseline methods covered in our experiments across different settings. Compression Modality Target Models Baseline Methods Pruning ​(⋅)P(·) Language models Decoder-only models SparseGPT (Frantar and Alistarh, 2023), Wanda (Sun et al., 2024), SLEB (Song et al., 2024) Encoder-based models K-prune (Park et al., 2024) Vision models CNNs PRACTISE (Wang and Wu, 2023) ViTs SAViT (Chuanyang et al., 2022) Quantization ​(⋅)Q(·) Language models Decoder-only models RTN (Gupta et al., 2015), OPTQ (Frantar et al., 2023), QuaRoT (Ashkboos et al., 2024b) Encoder-based models UniQuanF (Park et al., 2025a) Vision models CNNs N2UQ (Liu et al., 2022) ViTs RepQ-ViT (Li et al., 2023) Parameter Efficient Fine-tuning Language models Decoder-only models LoRA (Hu et al., 2022) Parameter Sharing Language models Decoder-only models Basis Sharing (Wang et al., 2025a) Mixed-precision quantization Vision models CNNs HAWQ-V2 (Dong et al., 2020) We describe the details on the experimental setup, including models, datasets, baselines, evaluation protocol, and implementation. Models. We evaluate representative models across modalities, including LLaMA 2 (7B, 13B) (Touvron et al., 2023), LLaMA 3 8B (Grattafiori et al., 2024), Mistral 7B (Jiang et al., 2023), Mistral Nemo-12B (Mistral AI Team, 2024), and BERT (Devlin et al., 2019) for language, and ResNet-18 (He et al., 2016) and DeiT-Base (Touvron et al., 2021) for vision. Datasets. We evaluate decoder-only language models on WikiText-2 (Merity et al., 2017) and C4 (Raffel et al., 2020) datasets for perplexity, and on five commonsense reasoning tasks—ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), and Winogrande (Sakaguchi et al., 2021). For encoder-based models, we evaluate the performance using Spearman’s rank correlation coefficient on the STS-B dataset. For vision models, we report classification accuracy on ImageNet (ILSVRC 2012) (Deng et al., 2009) dataset. Baseline Methods. We validate our hypothesis across a total of sixteen pairs of compression methods by incorporating six pruning methods, six quantization methods, and one mixed-precision quantization method. Table 3 provides an overview of baseline methods categorized by target models and modalities. Refer to the original papers for further details. Evaluation Protocol. The calibration dataset consists of a single batch with 128 samples drawn from the same dataset used for perplexity evaluation. We set the batch size to 16 for perplexity evaluation and to 128 for commonsense reasoning tasks. All quantization methods apply the same bit-width to weights, activations, and KV-cache, with clipping applied during weight quantization. Both models are evaluated without fine-tuning using a batch size of 128. Metrics are reported as the average of five repeated runs, each computed with four-digit precision. We plot the relative values for visualization. Implementation and Machine. Our implementations are written in Python and rely on PyTorch, Transformers, Accelerate, and TorchVision libraries. For all baseline methods, we reproduce the results based on their open-source code and hyperparameter configurations. All of our experiments are done at a workstation with Intel Xeon Gold 6338 and NVIDIA A100 80GB. Parameter Efficient Fine-tuning Experiment. We adopt LoRA (Hu et al., 2022) on top of SparseGPT (Frantar and Alistarh, 2023) and RTN (Gupta et al., 2015) to fine-tune the compressed model. The target model is LLaMA 3 8B (Grattafiori et al., 2024), where fine-tuning is processed after quantization. We select WikiText-2 as the calibration dataset and train for a total of 2 epochs. We follow Basis Sharing (Wang et al., 2025a) for training details of the low-rank adapter, while exploiting the PEFT (Mangrulkar et al., 2022) library. Parameter Sharing Experiment. We evaluate the performance of LLaMA 2 7B model when applying Basis Sharing (Wang et al., 2025a) and magnitude-based pruning (Han et al., 2015). We follow Basis Sharing for hyperparameters regarding parameter sharing, where the group size is 128. QuaRot (Ashkboos et al., 2024b) is selected as the quantization baseline for calculating compression equivalent bits. No additional fine-tuning is applied under this setting. Mixed-precision Quantization Experiment. We base our method on HAWQ-V2 (Dong et al., 2020), but allocate bit-widths iteratively rather than in a single shot following LampQ (Kim et al., 2026). At each iteration, we search per-layer bit-widths from a range of [2, 3, 4, 5, 6, 7, 8] and train for 5 epochs. All other experimental settings, including hyperparameters and quantization techniques, are aligned with the original paper. We run all MPQ experiments on a workstation with Intel Xeon Silver 4310 and NVIDIA RTX 4090. Appendix D Further Discussion and Experiments We present results from extended experiments, and offer further discussion and remarks on our work. D.1 Experiments on Diverse LLMs Our experiments on decoder-only models are limited to LLaMA herd models (LLaMA 2 (Touvron et al., 2023) and LLaMA 3 (Grattafiori et al., 2024) models), which may not fully reflect broader generality. To address this, we conduct additional experiments on models from the Mistral herd. Figure 12 presents the results of applying SparseGPT (​(⋅)P(·)) and QuaRot (​(⋅)Q(·)) to Mistral 7B (Jiang et al., 2023) and Mistral Nemo 12B (Mistral AI Team, 2024). We have two observations from the result. First, the compression-order trend aligns well with the hypothesis across Mistral-based models. The result serves as additional evidence confirming the hypothesis in decoder-only language models. Second, comparing models within the same herd (also refer to Figure 3), we find that smaller models exhibit greater variation in compression-order advantage for identical CER differences. This may be because low-bit quantization (or stronger quantization) causes greater degradation in smaller models, thereby intensifying the observed differences. (a) Mistral 7B (b) Mistral Nemo 12B Figure 12: The compression order advantage ​(→)A (Q ) increases monotonically with the CER difference C∗−C^*_P-C_Q also for Mistral herd models. See Appendix D.1 for details. D.2 Analysis on Encoder-based Models Figure 13: The hypothesis holds for encoder-based language models. See Appendix D.2 for details. Beyond decoder-only LLMs, we extend our analysis to encoder-based language models to validate the generality of our hypothesis. Figure 13 presents the performance of a BERT (Devlin et al., 2019) model under K-prune (Park et al., 2024) (​(⋅)P(·)) and UniQuanF (Park et al., 2025a) (​(⋅)Q(·)). We adopt Spearman correlation as the performance metric, measured on STS-B dataset (Cer et al., 2017) from GLUE (Wang et al., 2019) benchmark. We observe a monotonic increase along both axes of ​(→)A (Q ) and C∗−C^*_P-C_Q, confirming that our hypothesis holds. (a) ARC (b) HellaSwag (c) PIQA (d) Winogrande (e) LAMBADA (f) Average Figure 14: Commonsense reasoning task performance of a LLaMA 3 8B model for SparseGPT and QuaRot. See Appendix D.3 for details. D.3 Commonsense Reasoning Performance Although the negative of perplexity serves as an intuitive and efficient metric ℳ​(⋅)M(·) for evaluating language models, prior studies (Meister and Cotterell, 2021; Fang et al., 2025) suggest it does not always correlate with real-world performance. We thus investigate the performance of LLaMA 3 8B model across five commonsense reasoning tasks in Figure 14, when applying SparseGPT (​(⋅)P(·)) and QuaRot (​(⋅)Q(·)). Results affirm the generality and metric-agnostic nature of our framework, as the hypothesis holds across these tasks. Table 4: WikiText-2 perplexity comparison of a LLaMA 3 8B model pruned by SLEB (Song et al., 2024) and SparseGPT (Frantar and Alistarh, 2023) under varying pruning ratios, with and without rotation following QuaRot (Ashkboos et al., 2024b). See Section D.4 for details. Pruning Ratio SLEB SparseGPT No rotation Rotation Difference No rotation Rotation Difference Original 6.137 0.05 6.857 6.871 0.014 6.140 6.154 0.014 0.1 8.792 8.828 0.036 6.159 6.205 0.046 0.15 12.603 12.615 0.012 6.213 6.352 0.139 0.2 25.289 25.295 0.006 6.330 6.629 0.299 0.25 51.212 51.560 0.348 6.546 7.250 0.704 0.3 61.502 61.901 0.399 6.894 8.504 1.610 0.35 65.997 66.234 0.237 7.474 20.842 13.368 0.4 92.848 93.260 0.412 8.477 98.213 89.736 D.4 Impact of Rotation on Pruning Methods In Figure 5 and Finding 3, we observe that applying rotation without quantization may lead to notable degradation on pruning performance. To further analyze this, Table 4 compares the performance of LLaMA 3 8B model pruned with and without QuaRot-based rotation, across two pruning methods with different granularities. We have two observations from the result. First, rotation-induced degradation scales with the pruning ratio. This is because higher pruning ratios result in more units being pruned that are altered by rotation, thereby increasing the error. Second, unstructured pruning (SparseGPT) exhibits significantly higher error compared to structured pruning (SLEB). This trend is especially evident under high pruning ratios. Figure 15: Two cases of errors when pruning rotated units. See Section D.4 for details. We therefore investigate the underlying reason behind this phenomenon. We identify two types of errors induced by pruning, depending on its granularity: matrix-wise and element-wise errors. Figure 15 conceptually illustrates these two cases. First, in the case of matrix-wise pruning, ignoring rotation during pruning leaves the rotation-induced matrix H intact, introducing extra computation and numerical errors compared to the non-rotated case. As suggested in QuaRot (Ashkboos et al., 2024b), the rotation inverse is fused into the target layer, while the original transform is merged into the preceding normalization layer, leaving un-removed components that generate error during naïve pruning. This type of error scales proportionally with the pruning ratio, as each pruned matrix introduces one such error. Second, in element-wise pruning, additional errors arise due to rotation-induced changes in unit selection, on top of the matrix-wise error. As the goal of rotation is to facilitate quantization by flattening activation outliers, multiplying its inverse results in an error compared to the original matrix. Consequently, the discrepancy in layer content leads to different pruning decisions. Furthermore, this selection-based error grows with higher pruning ratios due to a greater number of pruned units. In summary, given these errors, it is crucial to develop pruning techniques that align with rotation-based quantization strategies. D.5 A Direct Comparison with Prior Studies We discuss how our approach differs from prior studies, particularly SmoothQuant (Xiao et al., 2023) and Harma et al. (2025). Our contribution lies in establishing a general analysis for understanding compression order across diverse methods, whereas these prior works either focus on single-method optimization or analyze specific method pairs under restrictive assumptions. SmoothQuant (Xiao et al., 2023). SmoothQuant addresses a fundamentally different problem than joint compression order optimization. Specifically, SmoothQuant optimizes a single compression technique (quantization) by mitigating activation outliers through per-channel scaling transformations. While the paper states that “SmoothQuant is orthogonal to quantization schemes,” this refers to its compatibility as a pre-processing step that can be applied before various quantization methods. However, SmoothQuant does not examine the order-dependent interaction problem when combining quantization with other compression families such as pruning. In contrast, our work focuses on understanding how compression order affects the model performance when sequentially combining methods from different compression families. SmoothQuant may serve as a component within our quantization baselines (i.e., as a pre-processing step before quantization), but our analysis operates at a higher level—determining the optimal ordering between different model compression techniques regardless of the specific quantization implementation. This distinction is critical: SmoothQuant addresses intra-method optimization (improving quantization itself), while we address inter-method composition (ordering across different compression types). Harma et al. (2025). As noted in Section 2, only a few studies have addressed how the order of compression methods affects the model performance. Among them, Harma et al. (2025) stands out as the only study that attempts a theoretical approach to the problem. They examine the interaction between pruning and quantization, showing that the two are not orthogonal as assumed by previous works. Furthermore, they argue that pruning followed by quantization is universally optimal. However, their framework suffers from three significant limitations. First, their framework relies on oversimplified assumptions that hinder practical applicability. Specifically, they focus solely on magnitude-based pruning (removes weights based on their absolute values) and max-scaled block-wise quantization (uniformly rescaling blocks using their maximum value), both of which are naïve approaches that are less practical and often fail to preserve accuracy. Second, their analysis is confined to a minimal set of scenarios, failing to address diverse architectures or methods. Beyond the limited set of methods, their experiments also consider only the combination of two techniques—pruning and quantization—on decoder-based LLMs, lacking broader coverage of models and compression approaches. Lastly, the framework cannot be generalized across different settings, as many counterexamples have shown that pruning-before-quantization is not always optimal. Motivated by these gaps, we aim for a more general formulation that holds across methods, models, and metrics, thereby introducing the Progressive Intensity Hypothesis. D.6 Violation Cases of the Hypothesis Although our hypothesis is highly general and robust, we still observe cases where it does not hold. These cases largely fall into three categories: severe performance collapse, full model re-training, and increase of order-affected layers. First, each model exhibits a different tolerance to compression, with performance dropping exponentially beyond a certain ratio. While these settings are impractical due to severe performance loss, we observe cases where applying the stronger method first performs better. This may be because the error is already too large, violating our assumption of well-designed compression in Section 4; applying the stronger method first might help reduce the total error. For less compression-robust models like decoder-based LLMs, we observe earlier breakdowns—such as diminishing advantage when pruning ratio increases at fixed bit-width (Figure 3(a)). Second, when strong full-training is applied, the advantage from compression order may invert. Compression order serves merely as initialization, and the retraining process dominates, making it difficult to attribute outcomes to order alone. We plan to investigate these and potentially other exceptions more rigorously in future work. Lastly, in practical situations, increasing C_P may result in increase of order-affected layers, leading to a violation of the hypothesis. We analyze the details in the following paragraph. Figure 16: If the number of pruned units increases by one, the change in unit allocation across groups fall into three distinct cases. See Appendix D.6 for details. Impact of C_P. Without loss of generality, we consider the case when only C_P increases. Increasing C_P implies a stronger pruning effect, since it lowers ℳ​(​(ϕ))M(P(φ)), resulting in a decrease in ​(,)G(P,Q) and a corresponding increase in C∗C^*_P. Hence, to ensure monotonicity and satisfy Hypothesis 1, ​(→)A (Q ) should increase accordingly. Note that C_Q is fixed while analyzing the effect of C_P. To analyze the effect of increasing C_P, we first consider a local step in which the total number of pruned units increases by one. For the initial pruning ratio p and the increased ratio p′p under the same granularity t_P, the following relation holds: p′⋅|​(ϕ;t)|=p⋅|​(ϕ;t)|+1.p ·|U(φ;t_P)|=p·|U(φ;t_P)|+1. From the definition of compression ratio C=1/(1−p)C_P=1/(1-p), larger pruning ratios correspond to larger compression ratios. By repeating this incremental process, we can construct any pruning ratio. Under disjoint selectivity, each unit is exclusively assigned to one compression method, allowing us to partition the units into four disjoint groups as discussed in Appendix B.1. Increasing the pruning ratio from p to p′p result in three possible changes in the group configuration: the number of affected layers by order (i.e., |1||G_1| or |2||G_2|; note that two values are equal.) 1) decreases by one, 2) remains unchanged, or 3) increases by one. The three cases are visualized in Figure 16. From Theorem 1, ​(→)=β⋅(∑li∈2g​(li)−∑li∈1g​(li))A (Q )=β· ( _l_i _2g(l_i)- _l_i _1g(l_i) ) (where g​(li)=‖δ​(i,i)‖F2−‖−i​i‖F2g(l_i)=\| _Q(W_i,X_i)\|_F^2-\|-W_iX_i\|_F^2) should be preserved or increased to satisfy Hypothesis 1. However, this condition is fulfilled in only Cases 1 and 2, but not in Case 3. • Case 1: Number of layers affected by order decreases by one. If a layer is no longer affected by compression due to order change, then another layer must also be excluded to preserve the total number of pruned layers which should be increased by one. Thus, |1||G_1| and |2||G_2| each decrease by one, while |3||G_3| increases by two. Similar to Case 1 of Appendix B.2, as the increase in C_P eliminates a negative loss term, ​(→)A (Q ) increases. • Case 2: Number of layers affected by order remains unchanged. If the additionally pruned layer is always pruned regardless of the compression order, then |3||G_3| increases by 1 while |4||G_4| decreases by 1. Similar to Case 2 of Appendix B.2, as the loss-contributing groups 1G_1 and 2G_2 do not change, the compression order advantage ​(→)A (Q ) remains unaffected. • Case 3: Number of layers affected by order increases by one. The aforementioned two cases could not increase the number of order-dependent units, whereas there should exist a case where both groups |1||G_1| and |2||G_2| increases by one. As the number of total pruned layers should be increased by one, the added layers should be originated from 4G_4, i.e., |4||G_4| decreases by two. In contrast to Case 1, ​(→)A (Q ) may decrease due to the increase of negative loss terms. As the conditions under which each case emerges differ across specific configurations, this phenomenon is not analyzed in general settings. We leave a precise characterization of the conditions under which increasing the pruning ratio may invalidate the progressive intensity hypothesis as an important direction for future work. D.7 Additional Remarks Limitations of Current Work. We introduce a broadly applicable hypothesis that can be extended to diverse compression methods and model types across different domains. Still, we acknowledge three important limitations in our current work. First, due to the general nature of our framework, it does not provide detailed analysis for each specific combination of methods. While our hypothesis captures high-level trends, it does not define the best compression sequence for individual cases. This motivates research into discovering the best compression orderings under practical scenarios. Second, our study is limited to joint model compression in plug-and-play settings where methods are combined post-hoc. As demands for higher compression grow, integrated design strategies should be investigated beyond simple combinations. Lastly, our framework does not yet provide explicit predictive rules or precise estimation of how much better one compression order is than another. Capturing the nonlinear and cross-layer effects required for precise sign or value prediction of compression-order advantage ​(⋅)A(·) remains an open problem. We therefore consider the development of predictive models for A and meta-learning approaches for automatic order selection as a promising direction for future research. Future Work. In addition to addressing the aforementioned limitations, future directions may also include extensions of our current framework. First, a systematic study of interference across different pipeline designs would provide deeper insights beyond our current empirical findings. Another direction is to automate compression order selection based on observed trends. A unified approach that generalizes across cases may offer a better understanding on the role of compression order. Lastly, evaluating our hypothesis on emerging architectures such as Mixture-of-Experts and multimodal LLMs may broaden its generality. Usage of AI Assistants. We employ ChatGPT333https://chatgpt.com/ (GPT-4o) and Perplexity444https://w.perplexity.ai/ exclusively for language polishing purposes; for improving grammar and clarity at the sentence level. We do not use them for any research-related tasks, including code implementation, theoretical derivation, and result analysis.