Paper deep dive
Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models
Lucas Rakotoarivony
Abstract
Abstract:Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration techniques are applied. To address this, we propose ESC, an Evolution Strategy-based Calibration method that formulates activation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantization and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces performance loss, achieving a 1% relative accuracy degradation on the AST model.
Tags
Links
- Source: https://arxiv.org/abs/2603.08173v1
- Canonical: https://arxiv.org/abs/2603.08173v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 12:40:25 AM
Summary
The paper introduces Evolution Strategy-Based Calibration (ESC), a novel two-step optimization method for calibrating activation scaling factors in speech models. By combining local MSE-based initialization with a global evolution strategy (CMA-ES), ESC addresses the large dynamic range challenges of audio activations, enabling near-lossless performance for INT8 and INT4 quantization across various speech tasks.
Entities (6)
Relation Signals (4)
ESC → uses → CMA-ES
confidence 100% · solves it using a two-step local-global scheme driven by an evolution strategy... we adopt the CMA-ES algorithm
ESC → appliedto → Conformer
confidence 95% · Experiments on multiple speech tasks show that ESC consistently outperforms existing calibration methods
ESC → integrateswith → PTQ
confidence 95% · Integrating ESC with PTQ methods further reduces performance loss
ESC → optimizes → Activation Scaling
confidence 95% · ESC... formulates activation scaling as an optimization problem
Cypher Suggestions (2)
Find all models that have been calibrated using the ESC method. · confidence 90% · unvalidated
MATCH (m:Model)-[:CALIBRATED_BY]->(e:Method {name: 'ESC'}) RETURN m.nameIdentify methods that integrate with ESC. · confidence 90% · unvalidated
MATCH (e:Method {name: 'ESC'})-[:INTEGRATES_WITH]->(t:Technique) RETURN t.nameFull Text
31,351 characters extracted from source content.
Expand or collapse full text
Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models Lucas Rakotoarivony ID Thales, cortAIx Labs, 1 Av. Augustin Fresnel, 91120 Palaiseau, France lucas.rakotoarivony@thalesgroup.com Abstract Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio sig- nals remain largely overlooked. In particular, we show that audio activations can exhibit large calibration ranges, leading to significant information loss when standard calibration tech- niques are applied. To address this, we propose ESC, an Evo- lution Strategy-based Calibration method that formulates acti- vation scaling as an optimization problem and solves it using a two-step local-global scheme driven by an evolution strategy. ESC enables unaltered performance under full INT8 quantiza- tion and is the first calibration method to achieve near-lossless performance for full INT4 quantization across multiple speech tasks. Integrating ESC with PTQ methods further reduces per- formance loss, achieving a 1% relative accuracy degradation on the AST model. Index Terms: quantization, evolution strategy, speech models 1. Introduction Modern speech models have achieved near human-level per- formance on many tasks thanks to large-scale pretraining on massive datasets [1, 2] and advanced architectures like transformers-based models [3, 4]. However, deploying these models in real-world scenarios with limited memory and computational resources typically requires quantization into hardware-friendly integer formats. Quantization [5] is widely used for low-bit neural network deployment because it reduces numerical precision of weights and activations, enabling faster inference with integer opera- tions and lowering memory and storage costs. Although quan- tization has been extensively studied in computer vision [6, 7] and natural language processing (NLP) [8, 9], the audio domain remains underexplored. Most existing audio work [10, 11] re- lies on quantization-aware training (QAT) [12], which requires access to a non-negligible portion of the training data. Some studies [13, 14, 15] investigate post-training quantization (PTQ) for speech models, but they mainly target specific architectures [13, 15] or focus on weight quantization [14], often neglecting activation quantization, which is essential for fully integer in- ference. As a result, a complete integer quantization pipeline for general speech models remains an open problem. This context highlights the need for quantization techniques tailored to the characteristics of audio signals that preserve model performance while substantially reducing model size. As shown in the left part of Figure 1, audio activations can exhibit extremely large dynamic ranges, unlike typical activations in vi- sion and NLP tasks. Consequently, standard calibration meth- 020406080100 Normalized activation value (%) 0 20 40 60 80 100 Cumulative proportion (%) Conformer ResNet BERT W4 / A32W32 / A4 0 20 40 60 80 100 Relative Performance (%) Conformer ResNet BERT Figure 1: Illustration of quantization behavior across audio (Conformer [3]), vision (ResNet [16]), and NLP (BERT [17]) models. Left: Cumulative distribution of normalized activa- tion values, showing an approximately uniform distribution for ResNet, a rapidly saturating distribution for BERT, and a highly compressed distribution for Conformer. Right: Relative per- formance under weight and activation quantization using max calibration. While all models maintain good performance with 4-bit weight quantization, 4-bit activation quantization severely degrades performance for Conformer, unlike ResNet and BERT. ods [18, 19, 20, 21] that estimate quantization ranges often pro- duce highly unbalanced quantization bins, causing most values to be mapped to the same integer level, leading to severe infor- mation loss, as illustrated in the right part of Figure 1. Motivated by this challenge, we propose a new calibration method called Evolution Strategy-Based Calibration (ESC), which uses an evolution strategy [22] to optimize activation scaling [23] through an explicit optimization formulation. As shown in Figure 1, activation calibration is a key difficulty in speech model quantization. We address this by formulating cal- ibration as a two-step optimization process that integrates local and global objectives. First, scale factors are initialized using an MSE-based approach [21] that minimizes the reconstruction error between FP32 and quantized layer outputs. Then, inspired by global optimization methods such as BRECQ [7] and QAT [12], we formulate the problem as a joint optimization over all activation scale factors and solve it using an evolution strategy to handle its non-smooth and non-differentiable nature. Experi- ments on multiple speech tasks show that ESC consistently out- performs existing calibration methods and achieves lossless per- formance for full INT8 quantization. For INT4 settings, when combined with state-of-the-art PTQ methods [6, 7], ESC cali- bration achieves near-lossless quantization while incurring only a modest performance drop and maintaining high accuracy. The main contributions of this paper are summarized as follows: • We formulate calibration as a local-global optimization prob- arXiv:2603.08173v1 [cs.SD] 9 Mar 2026 lem and propose a novel calibration scheme that uses an evo- lution strategy to minimize quantization error. • We conduct extensive experiments demonstrating the superi- ority of ESC over standard calibration schemes and showing minimal performance degradation across various models. • We deploy the quantized models and observe an average in- ference speedup of 2.31×, along with a substantial reduction in memory usage. 2. Related Work 2.1. Quantization Quantization enables deployment of neural networks in resource-constrained settings by reducing memory and compu- tational requirements [5]. It maps floating-point values to dis- crete integers while aiming to preserve model accuracy. A key challenge, called calibration, is the selection of the scaling fac- tor, which is determined by a clipping range that truncates real- valued inputs and influences model performance. Traditional min-max statistics [18] are sensitive to outliers, while alterna- tives include percentile-based calibration [19] or optimization- based criteria [20, 21]. PTQ further mitigates accuracy loss without retraining and has been studied for vision models, such as CNNs [6, 7], ViTs [24, 25], and diffusion models [26], as well as for language models [8, 27]. Quantization for speech models, however, remains relatively underexplored [11, 14, 15]. 2.2. Evolution Strategies Evolutionary algorithms (EA) [28] are stochastic, population- based optimizers inspired by natural evolution. They itera- tively improve candidate solutions through selection, evolution, and evaluation, enabling effective approximation of optima for complex problems. Evolution Strategies (ES) [22], a common EA subclass for continuous optimization, rely on mutation of real-valued parameters and self-adaptive step sizes. Notable ES variants include estimation-of-distribution methods such as CMA-ES [29], natural evolution strategies [30, 31], and finite- difference methods like OpenAI-ES [32]. 3. Methods 3.1. Quantization Formulation As proposed in [23], we employ the widely used quantization scheme defined as follows: Q(r) = Int r/s − Z,(1) Here, Q denotes the quantization operator, r is a real-valued input (either an activation or a weight), s is a real-valued scal- ing factor, and Z is an integer zero point. The Int function maps a real value to an integer through a rounding operation. This approach is referred to as uniform quantization, as the re- sulting quantization levels are evenly spaced. We adopt this strategy for hardware-friendly compatibility, since non-uniform quantization schemes are typically challenging to implement ef- ficiently on general-purpose hardware, such as GPUs and CPUs [23]. A critical factor in this process is the selection of the scal- ing factor s. As proposed in [23], s is defined as: s = β− α 2 b − 1 ,(2) Here, [α,β] represents the clipping range, an interval used to clip the real-valued inputs, and b is the quantization bit width. OPEN Calibration data Target: y Layer N Layer 2 Layer 1 . . . s N s 2 s 1 1) Local optimization FP32 Model INT Model Layer N Layer 2 Layer 1 . . . MSE(s N ) MSE(s 2 ) MSE(s 1 ) Output: ŷ s 1 , s 2 ,..., s N Parameters Objective function 퓔(y,ŷ) CMA-ES 2) Global optimization Figure 2: Overview of the proposed ESC method. First, each layer-wise activation scaling factor is locally optimized by min- imizing the MSE between the FP32 and quantized layer outputs. Then, all scaling factors are jointly refined using the CMA- ES algorithm to minimizes the task-specific error between the quantized model output ˆy and the target y. Consequently, defining the scaling factor requires first deter- mining the clipping range. We specifically consider the case where α = −β, referred to as symmetric quantization, which simplifies the process by setting the zero point Z to 0. The process of selecting the clipping range, and conse- quently the scaling factor s, is commonly referred to as cal- ibration. Weight calibration is relatively straightforward, as the weights of a trained model are fixed and typically follow a Gaussian distribution [33, 34]. In contrast, activation dis- tributions are more sensitive and can vary substantially across samples. Therefore, activation calibration generally requires n samples to estimate a representative activation distribution. Standard strategies set β either to the maximum observed value or to a chosen percentile of the activation distribution. 3.2. Scale Initialization Since audio activations can exhibit extremely large dynamic ranges, standard calibration strategies such as Max [18] or Per- centile [19] often produce poorly distributed quantization bins, leading to significant information loss, as illustrated in Fig- ure 1. To address this, rather than selecting the scaling factor solely based on the activation distribution, we cast the calibra- tion process as an optimization problem that directly minimizes the task-specific error, as defined below: S ∗ = arg min S E (f q (x; S), y),(3) where f q (·) denotes the already trained quantized neural network with fixed weights, S = s 1 ,...,s N is the vector of per-layer activation scale parameters, x represents the cal- ibration samples, y are the reference targets, and E (·) is the task-dependent error metric. To initialize S, we adopt the MSE- based calibration algorithm [21], which treats the optimization as a local procedure. Specifically, for each layer i = 1,...,N , the algorithm optimizes its activation scale s i independently: s ∗ i = arg min s i ∥l(x)− l q (x;s i )∥ 2 2 ,(4) where l is the layer of interest and l q is its quantized coun- terpart. Collectively, the resulting scales S provide a stable ini- Table 1: Comparison of calibration methods across multiple models and speech tasks for fully 4-bit and 8-bit quantized models. The symbols↓ and↑ indicate that lower and higher values are better, respectively. Best results are shown in bold. MethodBits ConformerECAPAMP-SENetFastSpeech 2AST WER↓CER↓EER↓minDCF↓PESQ↑STOI↑Mel↓PostNet↓Acc↑mAP↑ Full precision3215.944.840.977.172.1293.6740.1840.0998.1399.98 Max [18]816.545.041.117.992.1693.3140.8840.7298.1299.98 Percentile [19]816.074.850.997.062.1192.8840.5540.5898.0699.95 Entropy [20]816.274.9318.7078.332.1193.55161.48161.5990.0999.65 MSE [21]816.094.870.997.532.0992.7240.4540.4998.0799.96 ESC816.014.830.947.652.1192.9040.3540.3298.1599.96 Max [18]4144.1484.8144.4099.991.1667.52231.40231.253.7150.00 Percentile [19]450.8318.6026.4297.492.2290.10138.99138.9895.5199.90 Entropy [20]450.8318.7619.4486.122.4493.64297.44297.6063.9896.15 MSE [21]441.2214.6113.0765.662.5092.99130.56130.4196.0399.91 ESC438.4913.5011.2863.052.5193.7998.3498.2796.4199.94 tialization point for subsequent global refinement. 3.3. Evolution Strategy Optimization Inspired by BRECQ [7] and QAT [12] approaches, we argue that optimizing quantization scales only locally is insufficient to achieve optimal performance, as such methods do not ac- count for cross-layer dependencies. To address this limitation, we propose to optimize the scale vector S globally using an evolution strategy, with the objective of minimizing the loss defined in Equation (3). As discussed in Section 2.2, evolu- tion strategies are well suited for the optimization of multiple continuous variables in settings where the objective function is non-convex and non-differentiable. In particular, we adopt the CMA-ES [29] algorithm, which provides a robust gradient- free optimization framework for continuous parameters and has demonstrated strong performance on ill-conditioned and noisy objective functions [29, 35]. At each iterationt, CMA-ES samples a population ofλ can- didate scale vectors from a multivariate normal distribution: S (t) k ∼N m (t) , σ 2 t C (t) , k = 1,...,λ.(5) where m (t) denotes the mean of the search distribution, C (t) is the covariance matrix encoding parameter correlations, and σ t is the global step size controlling the exploration radius. Each sampled candidate S (t) k is evaluated using the objective defined in Equation (3). Based on the ranking of these candi- dates, the parameters (m (t) , C (t) ,σ t ) are updated following the standard CMA-ES update rules. The optimization process ter- minates when the total number of objective function evaluations reaches a predefined budget Γ. To obtain the final optimized scaling vector S, we use the mean of the final sampling distribution produced by the CMA- ES algorithm instead of the single best-evaluated solution, in order to improve robustness [29]. An overview of the proposed ESC method is provided in Figure 2. 4. Experiments 4.1. Experimental Setup To ensure a representative evaluation across speech processing domains, we conduct experiments on five widely used speech- based tasks: speech recognition, speaker recognition, speech enhancement, text-to-speech, and audio classification. For each task, the corresponding model is evaluated on the official test split of the associated dataset. • Speech recognition: Conformer [3] on LibriSpeech [36], evaluated using Word Error Rate (WER) and Character Er- ror Rate (CER). • Speaker recognition: ECAPA [37] on VoxCeleb [38], eval- uated using Equal Error Rate (EER) and minimum Detection Cost Function (minDCF). • Speech enhancement:MP-SENet [39] on VoiceBank- DEMAND [40], evaluated using Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligi- bility (STOI). • Text-to-speech: FastSpeech 2 [41] on LJSpeech [42], evalu- ated using Mel-spectrogram loss and PostNet loss. • Audio classification: AST [4] on Speech Commands V2 [43], evaluated using Accuracy (Acc) and mean Average Pre- cision (mAP). All evaluated models are fully quantized to either INT8 or INT4 precision, with both weights and activations quantized to enable a completely integer-only inference pipeline. For both calibration and the evolution strategy, we use the same n = 100 samples from the training set for each task. To deter- mine the scaling factors for activation quantization of the Con- volutional, Linear, and LayerNorm operators, we apply our pro- posed method ESC. For all other operators, as well as for weight quantization, we employ the Max calibration strategy. For the CMA-ES algorithm, the initial step size is set toσ = 0.1 and the predefined budget to Γ = 100. All experiments, including cal- ibration, evolution strategy, PTQ, and inference, are conducted on a NVIDIA RTX 3090 GPU. 4.2. Comparison with Baseline Calibration Methods We evaluate several popular calibration methods across multi- ple tasks and models using the pytorch quantization [44] frame- work, which provides implementations of Max [18], Percentile [19], Entropy [20], and MSE [21] calibration. For the Percentile method, we test the 99.99, 99.999, and 99.9999 percentiles and report the best result for simplicity. As shown in Table 1, the proposed ESC method consis- tently outperforms baseline strategies, especially in INT4 quan- tization. Max performs reasonably in INT8 but drops in INT4 Table 2: Evaluation of PTQ techniques from the NLP and vision domains applied on top of ESC calibration. MethodBits ConformerECAPAMP-SENetFastSpeech 2AST WER↓CER↓EER↓minDCF↓PESQ↑STOI↑Mel↓PostNet↓Acc↑mAP↑ Full precision3215.944.840.977.172.1293.6740.1840.0998.1399.98 ESC438.4913.5011.2863.052.5193.7998.3498.2796.4199.94 Adaround [6]4106.7975.4421.7689.392.4592.6783.5383.4688.5399.70 NoisyQuant [24]439.8913.9611.2863.052.5093.6195.8095.7396.2699.94 DiTAS [26]499.2484.5811.6866.002.5194.5198.3498.2796.4099.94 HyQ [25]498.0170.438.2049.702.1693.3199.1999.0296.7699.95 BRECQ [7]464.1727.6612.0162.452.2196.8384.5984.4596.1799.89 SmoothQuant [8]439.0213.6011.2863.052.5496.3298.3498.2794.3899.88 BC [27]438.5713.5811.2763.872.5393.9090.2190.1496.2399.91 due to outliers compressing the activation distribution. Per- centile often achieves strong results by removing extreme out- liers, but its optimal percentile varies across models and tasks, requiring extra tuning. Entropy works well for some architec- tures like Conformer and ECAPA, but can significantly degrade performance for others, such as FastSpeech 2 and AST, even in INT8. Among baselines, MSE provides the best overall perfor- mance, and since our initialization uses MSE-derived scaling factors, applying ESC further improves results. On average, ESC achieves the best INT8 performance and substantial gains across all INT4 scenarios. Interestingly, quantization can even improve performance for some models. For MP-SENet, ESC in INT4 achieves a PESQ of 2.51, a 18% relative improvement over FP32, likely due to the regularizing effect of quantization, which suppresses minor weight contributions and stabilizes outputs for cleaner speech. For AST, ESC in INT4 causes only a 1.75% relative accuracy drop, showing minimal degradation under fully INT4 quantization. 4.3. Integration with State-of-the-Art PTQ Methods Since our method is a calibration strategy for determining acti- vation scaling factors, it can be easily combined with existing PTQ techniques. As discussed in Section 2.1, PTQ for speech models remains relatively underexplored. However, many PTQ methods developed for vision and NLP [6, 7, 8, 27, 24, 25, 26] can be directly applied to speech models. Table 2 presents the performance of various PTQ strate- gies in the audio domain when combined with ESC calibration. Overall, no single method consistently outperforms the others, as results vary significantly across different models. Some ap- proaches that perform well on certain models can substantially degrade performance on others, highlighting the need for PTQ methods specifically tailored to the audio domain. Methods such as NoisyQuant [24], BC [27], and SmoothQuant [8] generally provide small but consistent im- provements over the baseline without significant degradation, showing that techniques transferred from other domains can benefit speech models. Larger gains are observed in specific cases: HyQ [25] applied to ECAPA and Adaround [6] applied to FastSpeech 2 improve performance by 27% and 15%, respec- tively, relative to the ESC baseline. In addition, combining AST with HyQ achieves 96.76% accuracy, approaching FP32 perfor- mance. These results indicate that, when paired with suitable PTQ strategies, ESC calibration can enable near-lossless quan- tization. Table 3: Comparison of inference latency and model size be- tween FP32 and INT8 versions of several speech models. Model Latency (ms) Speedup Size (MB) FP32INT8FP32INT8 Conformer7.275.421.34×112.6343.64 ECAPA 2.191.032.13×63.5859.69 MP-SENet 55.9538.351.46×23.869.15 FastSpeech 223.8615.451.54×398.70216.12 AST25.114.955.07×331.69113.45 4.4. Latency and Model Size Evaluation We evaluate inference latency and memory footprint of quan- tized speech models deployed on an NVIDIA GeForce RTX 3090 GPU, which provides Tensor Cores supporting acceler- ated INT8 execution. Models are exported using TorchScript [45] tracing and deployed with TensorRT [46]. Although this GPU is chosen for its mature software ecosystem [45, 46] and reliable deployment tools, our quantization and export strategy is not limited to GPUs and can be extended to other hardware platforms, including embedded AI processors [47, 48]. Since Tensor Cores accelerate INT8 operations for both weights and activations, but only accelerate INT4 operations for weights, we restrict our experiments to 8-bit quantized models. Table 3 compares FP32 and INT8 implementations in terms of latency and model size. Results show that INT8 models consistently reduce memory usage and achieve notable speedups, ranging from 1.34× to 5.07×. 5. Conclusion In this paper, we proposed a novel two-stage calibration scheme that combines local MSE-based optimization with a global evolutionary strategy to optimize activation scaling factors in speech models. Our study highlights that, unlike in vision or NLP, audio models are particularly sensitive to activation quan- tization, providing strong motivation for our approach. Ex- perimental results demonstrate that our method preserves full- precision performance for fully 8-bit models while achieving an average inference speedup of 2.31×. Moreover, when ap- plied to fully 4-bit models in combination with modern PTQ techniques, our calibration scheme delivers near-lossless per- formance across a wide range of speech models and tasks. 6. References [1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, p. 28 492–28 518. [2] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,” Advances in neural information processing systems, vol. 33, p. 12 449–12 460, 2020. [3] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution- augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020. [4] Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021. [5] W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin, D. Cai, and X. He, “Model compression and efficient infer- ence for large language models: A survey,” arXiv preprint arXiv:2402.09748, 2024. [6] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down? adaptive rounding for post-training quantization,” in International conference on machine learning. PMLR, 2020, p. 7197–7206. [7] Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, “Brecq: Pushing the limit of post-training quantization by block reconstruction,” arXiv preprint arXiv:2102.05426, 2021. [8] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” in International conference on ma- chine learning. PMLR, 2023, p. 38 087–38 099. [9] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Ac- curate post-training quantization for generative pre-trained trans- formers,” arXiv preprint arXiv:2210.17323, 2022. [10] Z. Li, H. Xu, Z. Jin, L. Meng, T. Wang, H. Wang, Y. Chen, M. Cui, S. Hu, and X. Liu, “Towards one-bit asr: Extremely low-bit con- former quantization using co-training and stochastic precision,” arXiv preprint arXiv:2505.21245, 2025. [11] M. Kawamura, T. Hasumi, Y. Shirahata, and R. Yamamoto, “Bittts: Highly compact text-to-speech using 1.58-bit quanti- zation and weight indexing,” arXiv preprint arXiv:2506.03515, 2025. [12] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, p. 2704–2713. [13] D. Wagner, I. Baumann, K. Riedhammer, and T. Bocklet, “Outlier reduction with gated attention for improved post-training quanti- zation in large sequence-to-sequence speech foundation models,” arXiv preprint arXiv:2406.11022, 2024. [14] T. Gu, B. Liu, H. Wang, and Y. Qian, “Ultra-low bit post-training quantization of large speech models via k-means clustering and mixed precision allocation,” in Proc. Interspeech 2025, 2025, p. 1988–1992. [15] H. Shao, W. Wang, B. Liu, X. Gong, H. Wang, and Y. Qian, “Whisper-kdq: A lightweight whisper via guided knowledge dis- tillation and quantization for efficient asr,” CoRR, 2023. [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, p. 770–778. [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguis- tics: human language technologies, volume 1 (long and short pa- pers), 2019, p. 4171–4186. [18] V. Vanhoucke, A. Senior, M. Z. Mao et al., “Improving the speed of neural networks on cpus,” in Proc. deep learning and unsu- pervised feature learning NIPS workshop, vol. 1, no. 2011, 2011, p. 4. [19] J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha, “Discovering low-precision networks close to full-precision networks for efficient embedded inference,” arXiv preprint arXiv:1809.04191, 2018. [20] S. Migacz, “Nvidia 8-bit inference with tensorrt,” GPU Technol- ogy Conference, 2017. [21] Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit quantization of neural networks for efficient inference,” in 2019 IEEE/CVF International Conference on Computer Vision Work- shop (ICCVW). IEEE, 2019, p. 3009–3018. [22] I. Rechenberg, “Evolutionsstrategien,” in Simulationsmethoden in der Medizin und Biologie: Workshop, Hannover, 29. Sept.–1. Okt. 1977. Springer, 1978, p. 83–114. [23] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” in Low-power computer vision.Chapman and Hall/CRC, 2022, p. 291–326. [24] Y. Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, “Noisyquant: Noisy bias-enhanced post-training activation quan- tization for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 20 321–20 330. [25] N. J. Kim, J. Lee, and H. Kim, “Hyq: Hardware-friendly post- training quantization for cnn-transformer hybrid networks,” in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. International Joint Conferences on Artificial Intelligence Organization, vol. 8, 2024, p. 4291– 4299. [26] Z. Dong and S. Q. Zhang, “Ditas: Quantizing diffusion trans- formers via enhanced activation smoothing,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, p. 4606–4615. [27] C. Gong, H. Zheng, M. Hu, Z. Lin, D.-P. Fan, Y. Zhang, and T. Li, “Minimize quantization output error with bias compensa- tion,” arXiv preprint arXiv:2404.01892, 2024. [28] A. N. Sloss and S. Gustafson, “2019 evolutionary algorithms re- view,” arXiv preprint arXiv:1906.08870, 2019. [29] N. Hansen, “The cma evolution strategy: a comparing review,” Towards a new evolutionary computation: Advances in the esti- mation of distribution algorithms, p. 75–102, 2006. [30] T. Schaul, T. Glasmachers, and J. Schmidhuber, “High dimensions and heavy tails for natural evolution strategies,” in Proceedings of the 13th annual conference on Genetic and evolutionary compu- tation, 2011, p. 845–852. [31] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber, “Natural evolution strategies,” The Journal of Machine Learning Research, vol. 15, no. 1, p. 949–980, 2014. [32] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolu- tion strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017. [33] H. Yu, T. Wen, G. Cheng, J. Sun, Q. Han, and J. Shi, “Gdrq: Group-based distribution reshaping for quantization,” arXiv preprint arXiv:1908.01477, 2019. [34] N. Str ̈ om, H. Khan, and W. Hamza, “Squashed weight distribution for low bit quantization of deep models,” 2022. [35] I. Loshchilov and F. Hutter, “Cma-es for hyperparame- ter optimization of deep neural networks,” arXiv preprint arXiv:1604.07269, 2016. [36] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, p. 5206–5210. [37] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa- tdnn:Emphasized channel attention, propagation and ag- gregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020. [38] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018. [39] Y.-X. Lu, Y. Ai, and Z.-H. Ling, “Mp-senet: A speech enhance- ment model with parallel denoising of magnitude and phase spec- tra,” arXiv preprint arXiv:2305.13686, 2023. [40] C. V. Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Inves- tigating rnn-based speech enhancement methods for noise-robust text-to-speech,” in 9th ISCA speech synthesis workshop, 2016, p. 159–165. [41] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020. [42] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017. [43] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018. [44] NVIDIA. pytorch quantization: http://github.com/nvidia/tensorrt. [45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Ad- vances in neural information processing systems, vol. 32, 2019. [46] NVIDIA. TensorRT: https://developer.nvidia.com/tensorrt. [47] P. A. Hager, B. Moons, S. Cosemans, I. A. Papistas, B. Roose- leer, J. Van Loon, R. Uytterhoeven, F. Zaruba, S. Koumousi, M. Stanisavljevic et al., “11.3 metis aipu: A 12nm 15tops/w 209.6 tops soc for cost-and energy-efficient inference at the edge,” in 2024 IEEE International Solid-State Circuits Confer- ence (ISSCC), vol. 67. IEEE, 2024, p. 212–214. [48] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “Survey of machine learning accelerators,” in 2020 IEEE high performance extreme computing conference (HPEC). IEEE, 2020, p. 1–12.