Paper deep dive
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration
Arundhathi Dev, Justin Zhan
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%
Last extracted: 3/22/2026, 6:03:46 AM
Summary
AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization) is an automated framework designed to optimize layer- and head-specific hyperparameters for sparse attention mechanisms in transformers. By combining Bayesian Optimization for global exploration and binary search for local refinement, the framework leverages multi-fidelity evaluation to reduce tuning costs by 3.4x compared to manual grid search, enabling plug-and-play acceleration for models like Llama-2-7B.
Entities (5)
Relation Signals (3)
Bayesian Optimization → iscomponentof → AFBS-BO
confidence 100% · Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement
AFBS-BO → improvesperformanceof → Llama-2-7B
confidence 95% · On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x
AFBS-BO → optimizes → SpargeAttn
confidence 95% · AFBS-BO... discovers optimal layer- and head-specific hyperparameters... for sparse attention mechanisms
Cypher Suggestions (2)
Identify models accelerated by AFBS-BO. · confidence 95% · unvalidated
MATCH (f:Framework {name: 'AFBS-BO'})-[:IMPROVES_PERFORMANCE_OF]->(m:Model) RETURN m.nameFind all frameworks that optimize specific attention mechanisms. · confidence 90% · unvalidated
MATCH (f:Framework)-[:OPTIMIZES]->(a:AttentionMechanism) RETURN f.name, a.name
Abstract
Abstract:Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains.
Tags
Links
- Source: https://arxiv.org/abs/2603.18417v1
- Canonical: https://arxiv.org/abs/2603.18417v1
Full Text
58,077 characters extracted from source content.
Expand or collapse full text
Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration Arundhathi Dev Justin Zhan Abstract Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4× with 8.8× fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains. I Introduction Transformer architectures have become the foundation of modern machine learning, powering breakthroughs across natural language processing, computer vision, and multimodal understanding through their self-attention mechanism’s ability to capture long-range dependencies [4]. However, this capability comes at a fundamental cost: the quadratic computational and memory complexity of attention, scaling as O(N2)O(N^2) with sequence length N, creates an increasingly critical bottleneck as applications demand longer contexts and larger models [5]. Contemporary language models process sequences exceeding 128K tokens [6], video generation systems operate on 40K+ dimensional attention spaces [7], and diffusion models require dense attention across multi-scale feature hierarchies [8]. At these scales, the quadratic bottleneck transforms from an engineering inconvenience into a deployment barrier. Sparse attention mechanisms offer a principled solution by computing attention over only a carefully selected subset of key-value pairs, reducing complexity to O(NlogN)O(N N) or better while preserving end-to-end model quality. Recent training-free methods such as SpargeAttn [1] have demonstrated that dynamic sparsity, where important attention connections are identified per input without task-specific architectural modifications, can achieve 2-5× inference speedups across language modeling, image generation, and video synthesis. This universality represents a significant advance over prior sparse attention approaches, which rely on fixed patterns (local windows, strided attention, attention sinks) that fail to generalize across domains and input characteristics. I-A The Barrier to Adoption: Fragile Manual Tuning Despite their theoretical efficiency, sparse attention methods have not yet replaced dense attention in standard production pipelines. The primary obstacle is not algorithmic capability, but usability. Methods like SpargeAttn depend on three sensitive hyperparameters, τ (masking threshold), θ (similarity threshold), and λ (warp-skip threshold), whose optimal values vary significantly across models and even across layers within a single model. The original SpargeAttn paper explicitly highlights this dependency: “We use l1=0.08,l2=0.09l_1=0.08,l_2=0.09 for Llama3.1, l1=0.05,l2=0.06l_1=0.05,l_2=0.06 for CogvideoX” [1]. This reliance on manual configuration creates a “critical deployment barrier” that manifests in three ways: (1) The Grid Search Obstacle: Currently, deploying sparse attention requires an exhaustive grid search over ≈ 175 configurations per model. The need to expend significant GPU hours and expert oversight just to deploy sparse attention prevents its usage in dynamic environments or by practitioners lacking specialized tuning expertise; (2) Unaddressed Layer Heterogeneity: A single global setting forces a compromise: it is either too aggressive for sensitive deeper layers (destroying quality) or too conservative for redundant early layers (wasting efficiency). As visualized in Fig. 17 of [1], attention patterns are highly heterogeneous; treating them uniformly fails to fully leverage potential efficiency gains; (3) Production Fragility: Hyperparameters tuned via grid search on a validation set (e.g., WikiText) often fail to generalize to shifted production distributions (e.g., user queries). Without an automated mechanism to adapt, sparse attention remains a sensitivity-prone heuristic rather than a robust infrastructure component. I-B Automated Adaptive Hyperparameter Discovery We propose AFBS-BO, a fully automated framework that discovers optimal layer- and head-specific hyperparameters (τ,θ,λ)(τ,θ,λ) for sparse attention without human intervention. Our key insight is that hyperparameter optimization for sparse attention exhibits unique structure: (1) multi-fidelity correlation—error profiles on short sequences (4K tokens, ≈ 5ms evaluation) rank-correlate (ρ≥0.8ρ≥ 0.8) with long sequences (32K tokens, ≈ 21ms evaluation), enabling computationally efficient surrogate evaluation; (2) smooth local structure—despite discrete block quantization, the error landscape is locally Lipschitz-continuous, justifying gradient-free optimization; and (3) bounded search space—physical constraints (GPU warp sizes, memory bandwidth) restrict feasible hyperparameters to compact domains. AFBS-BO exploits this structure through a three-stage hybrid algorithm: Stage 1 (Bayesian Optimization): We model the low-fidelity (4K-token) error landscape as a Gaussian Process with Matérn 5/2 kernel and use Expected Improvement acquisition to identify 2-3 promising hyperparameter regions in 15 evaluations (≈ 125ms). This global exploration phase avoids local minima that trap greedy search. Stage 2 (Binary Search Refinement): Within each promising region, we apply binary search using high-fidelity (32K-token) evaluations to achieve precision Δs≤0.0625 s≤ 0.0625—fine enough to distinguish hyperparameter differences smaller than typical grid spacings (0.05). Four binary iterations per region cost 168ms but ensure we extract maximum sparsity within error tolerance bands. Stage 3 (Multi-Input Validation): We validate the best configuration across 5 diverse validation inputs sampled from the target distribution, enforcing worst-case error ≤εhigh≤ _high. If validation fails, an automatic fallback reduces sparsity by 10%, costing an additional 105ms but guaranteeing robustness. This stage prevents overfitting to single-input characteristics. I-C Key Results and Contributions Our hybrid approach delivers three fundamental advantages over existing methods: (1) Efficiency: AFBS-BO achieves 3.4× faster hyperparameter discovery than exhaustive grid search (3.0s vs. 10.1s for 12-layer Llama-2-7B) while requiring 8.8× fewer evaluations (240 vs. 2100). This speedup stems from multi-fidelity evaluation (62.5% of evaluations use computationally efficient 4K sequences) and warm-starting across layers (transferring learned posteriors reduces iterations from 15 to 8 for layers 2-12). (2) Quality: On WikiText-2, AFBS-BO discovers configurations achieving 7.45 perplexity at 70.7% sparsity—outperforming the current state-of-the-art H2O method (7.55 PPL) and coming within 0.03 PPL of the theoretical unconstrained Top-K oracle (7.42 PPL). Critically, AFBS-BO maintains perplexity within 0.32 of the dense baseline (7.13 PPL) while automatically discovering that early layers tolerate 72-76% sparsity whereas deeper layers require conservative 58-62% thresholds—heterogeneity that global hyperparameters inherently miss. (3) Robustness: Multi-input validation with automatic fallback ensures discovered hyperparameters satisfy worst-case error constraints across diverse inputs. This design enables plug-and-play deployment: once calibrated on representative data, AFBS-BO’s configurations transfer to new instances without per-query retuning. Contributions: (1) We propose AFBS-BO, the first framework to democratize sparse attention by fully automating hyperparameter discovery. By replacing exhaustive grid search with intelligent optimization, we transform sparse attention into a plug-and-play acceleration primitive; (2) We introduce a novel hybrid optimization algorithm combining Bayesian Optimization’s global landscape exploration with binary search’s local refinement, exploiting multi-fidelity evaluation to reduce tuning cost by 3.4× while improving solution quality; (3) We provide theoretical guarantees on convergence under Gaussian Process regression with Expected Improvement acquisition, and empirically validate the multi-fidelity assumption with rank correlation ρ=0.84±0.06ρ=0.84± 0.06 across 20 sampled layers; (4) Through comprehensive evaluation on Llama-2-7B, we demonstrate that AFBS-BO achieves 7.45 perplexity at 70.7% sparsity, outperforming SOTA baselines (H2O: 7.55) and approaching the oracle bound (Top-K: 7.42), while requiring only 3.0 seconds for full model tuning; and (5) We deliver a production-ready framework with automatic fallback mechanisms ensuring robustness to distribution shift, enabling plug-and-play acceleration across diverse transformer architectures and domains. Paper Organization: The remainder of this paper is organized as follows: Section I reviews related work in sparse attention and hyperparameter optimization. Section I details the proposed AFBS-BO framework, including the linear parameterization strategy and the three-stage hybrid optimization algorithm. Section IV presents comprehensive experimental results on Llama-2-7B, including ablation studies on optimization stages, long-context stability, and domain generalization. Section V discusses limitations and future directions, and Section VI concludes the work. AFBS-BO Automated Optimizer Llama-2 Model Calibration Data (WikiText) Stage 1 Global Exploration (Bayesian Opt) Stage 2 Local Refinement (Binary Search) Stage 3 Robustness Check (Multi-Input) Optimal Config τ∗,θ∗,λ∗\τ^*,θ^*,λ^*\ (Per Layer) Inference Kernel SpargeAttn RegionsCandidates Figure 1: The AFBS-BO Framework Architecture. The automated tuning pipeline consists of three sequential stages: (1) Global Exploration utilizes Bayesian Optimization on low-fidelity surrogates to identify promising regions; (2) Local Refinement employs high-fidelity binary search for precision; and (3) Robust Validation ensures stability across inputs. The resulting layer-specific hyperparameters are injected into the kernel for plug-and-play acceleration. I Related Work I-A Sparse Attention Mechanisms The quadratic complexity of standard attention has motivated extensive research into sparse attention methods. Early approaches leveraged fixed patterns such as local windows, strided attention, or attention sinks to reduce computational cost [4]. More recent work has developed training-free sparse attention that dynamically adapts to input content. SpargeAttn [1] represents a breakthrough in this direction by introducing a two-stage filtering mechanism: coarse block masking via query-key compression, followed by warp-level online softmax optimization. This universal approach achieves 2-5× speedup across language, image, and video generation tasks without model-specific training or architectural modifications. Recent training-free methods have explored dynamic sparsity [15, 18], context-aware prefilling [12], or KV cache management [16, 17]. However, these approaches largely rely on offline profiling or fixed patterns that do not adapt to layer-specific heterogeneity. Even state-of-the-art methods like SpargeAttn [1] rely on fragile, manual, model-specific tuning. Our work directly addresses this gap by automating layer- and head-specific hyperparameter discovery. I-B Layer-wise and Head-wise Adaptation in Transformers The hypothesis that different transformer layers and attention heads exhibit distinct computational requirements has substantial empirical support. Attention head specialization has been observed across diverse architectures, where heads learn qualitatively different patterns—some focus on local syntax, others on long-range semantics, and certain heads act as “attention sinks” [19, 20]. Pruning studies [21] demonstrate that many heads can be removed without performance degradation, suggesting redundancy and varying importance across the model. Layer-wise adaptive methods have emerged in multiple contexts. In quantization, recent work [25] shows that different layers exhibit fundamentally different statistical properties (e.g., kurtosis, outlier distributions), necessitating layer-specific transformation types rather than homogeneous quantization strategies. Vision transformers exhibit layer-dependent redundancy [26], where early layers benefit from aggressive token pruning while deeper layers require more tokens to preserve semantic distinctions. Despite this evidence, sparse attention methods have not yet exploited per-layer or per-head hyperparameter adaptation. Our AFBS-BO framework directly operationalizes this insight by discovering optimal (τ,θ,λ)(τ,θ,λ) configurations for each attention component. I-C Hyperparameter Optimization for Neural Networks Hyperparameter optimization (HPO) has evolved from grid and random search [27] to sophisticated Bayesian optimization (BO) methods that model the hyperparameter-performance landscape as a Gaussian process and use acquisition functions (e.g., Expected Improvement, Upper Confidence Bound) to guide exploration [9, 10, 28]. However, existing HPO frameworks target training-time hyperparameters (learning rate, batch size, regularization) or architectural choices (layer depth, width, activation functions), which are fundamentally different from inference-time algorithmic hyperparameters like those in sparse attention. Inference hyperparameters exhibit three unique characteristics: (1) Online evaluation requirement—each configuration must be tested with both sparse and dense forward passes, precluding gradient-based optimization; (2) Non-convex, discrete landscape—sparse attention’s block quantization and GPU warp scheduling introduce discontinuities and local optima; (3) Multi-fidelity correlation—error profiles on short sequences (4K tokens) rank-correlate (ρ≥0.8ρ≥ 0.8) with long sequences (32K tokens), enabling efficient surrogate evaluation. Multi-fidelity Bayesian optimization [31, 32] addresses evaluation cost by combining low-fidelity (computationally efficient but noisy) and high-fidelity (expensive but accurate) evaluations. DNN-MFBO [33] replaces Gaussian processes with deep neural networks to capture complex, nonlinear correlations across fidelities. Our hybrid AFBS-BO framework adapts these principles to sparse attention tuning, combining BO’s global exploration (on 4K sequences) with binary search’s local refinement (on 32K sequences) to achieve both coverage and precision with minimal evaluations. I-D Efficient Attention Implementations FlashAttention [5] pioneered IO-aware attention by reordering computations to minimize HBM-SRAM data transfers, achieving 2-4× speedup via tiling and online softmax. FlashAttention-2 and FlashAttention-3 [34] further improve efficiency through warp specialization, asynchronous data movement, and FP8 low-precision support, reaching 75% utilization of H100’s theoretical FLOPS. Recent work has extended these optimizations to sparse attention: block-sparse FlashAttention enables efficient computation over structured sparsity patterns by aligning block boundaries with GPU memory hierarchies [5]. Our AFBS-BO framework is orthogonal and complementary to these kernel-level optimizations. While FlashAttention accelerates how attention is computed, AFBS-BO determines which attention computations to perform by discovering optimal sparsity hyperparameters. We integrate AFBS-BO with SpargeAttn’s kernel implementation [1], which already incorporates quantization and warp-level skipping—our contribution is automating the discovery of the sparsity-controlling hyperparameters (τ,θ,λ)(τ,θ,λ) that the kernel requires. I-E Dynamic and Adaptive Inference Adaptive inference frameworks dynamically adjust computational graphs based on input characteristics. Early-exit networks [35] allow confident predictions to terminate inference at intermediate layers, reducing average latency for easier inputs. Dynamic token pruning [26] progressively removes redundant tokens across transformer layers using importance scores derived from attention probabilities or learned saliency predictors. Resolution-adaptive networks [36] process images at multiple scales, routing inputs to appropriate resolution paths based on difficulty. Elastic Attention [37] introduces test-time adaptive sparsity by training an attention router that assigns heads to full or sparse attention modes based on task regime (sparsity-robust vs. sparsity-sensitive). However, it requires model modification and fine-tuning, limiting plug-and-play deployment. In contrast, AFBS-BO operates entirely offline during calibration—once hyperparameters are discovered, they are fixed for deployment, incurring zero runtime overhead. Our multi-input validation stage (Stage 3) ensures robustness across input distributions, akin to adaptive inference’s confidence-based routing but without dynamic per-instance decisions. I-F Positioning of Our Work AFBS-BO bridges three research areas: (1) Sparse attention acceleration by automating hyperparameter tuning for methods like SpargeAttn; (2) Layer-wise adaptation by discovering heterogeneous configurations across model components; (3) Multi-fidelity Bayesian optimization by exploiting sequence length as a fidelity dimension to reduce tuning cost. To our knowledge, AFBS-BO is the first automated framework for discovering layer- and head-specific hyperparameters in sparse attention, transforming manual per-model tuning into a fully automated, deployable technology. Our hybrid algorithm combining BO’s global exploration with binary search’s local precision is novel in the context of inference algorithm optimization, achieving 3.4× speedup over grid search while unlocking higher sparsity and maintaining end-to-end model quality. I Methodology I-A Background: SpargeAttn Mechanism SpargeAttn [1] accelerates attention computation through a two-stage filtering approach. First, query and key blocks are compressed via mean pooling to predict a coarse attention mask, filtering out low-attention blocks. Second, within surviving blocks, a warp-level skip mechanism omits negligible matrix multiplications during online softmax computation. Three hyperparameters control this sparsity: τ (top-CDF threshold for block selection), θ (self-similarity threshold for block coherence), and λ (warp-skip threshold for fine-grained computation). The original implementation uses manually tuned global values across all layers, motivating our adaptive approach. I-B Motivation and Problem Formulation Sparse attention mechanisms such as SpargeAttn have demonstrated significant potential for accelerating transformer inference by exploiting sparsity in attention maps [1]. However, these methods rely on globally fixed hyperparameters (τ,θ,λ)(τ,θ,λ) that control the sparsity-accuracy tradeoff. The SpargeAttn paper acknowledges this limitation explicitly: “We need to determine l1,l2l_1,l_2 for models… we use l1=0.08,l2=0.09l_1=0.08,l_2=0.09 for Llama3.1, l1=0.05,l2=0.06l_1=0.05,l_2=0.06 for CogvideoX” [1]. This manual, model-specific tuning is labor-intensive and suboptimal because: 1. Layer Heterogeneity: Attention sparsity varies dramatically across layers and heads (Fig. 17 in [1]), yet global settings cannot adapt to these variations. 2. Evaluation Cost: Each hyperparameter evaluation requires both sparse and dense attention computation, expensive at long sequences (21ms per evaluation at 32K length [1]). 3. Non-Convex Landscape: The sparsity-error relationship exhibits multiple local optima due to discrete block quantization and GPU warp scheduling effects. Our contribution is an automated hybrid optimization framework that discovers optimal layer/head-specific (τ,θ,λ)(τ,θ,λ) values while minimizing computational overhead through multi-fidelity evaluation. We formulate the tuning objective as: maxτ,θ,λ _τ,θ,λ Sparsity(τ,θ,λ) (τ,θ,λ) (1) s.t. εlow≤Error(τ,θ,λ)≤εhigh _low (τ,θ,λ)≤ _high where Error=∑|Osparse−Odense|∑|Odense|Error= Σ|O_sparse-O_dense|Σ|O_dense| is the Relative L1 distance following SpargeAttn [1], and [εlow,εhigh][ _low, _high] is an error tolerance band. I-C Hybrid Bayesian Optimization + Binary Search (AFBS-BO) We propose a three-stage algorithm (Fig. 1) combining Bayesian Optimization’s global landscape exploration with binary search’s local precision, exploiting multi-fidelity evaluations to reduce computational cost. I-C1 Stage 1: Bayesian Optimization Exploration To map the hyperparameter landscape efficiently, we reduce the 3D space (τ,θ,λ)(τ,θ,λ) to a 1D latent variable s∈[0,1]s∈[0,1]: τ(s) τ(s) =τmin+s(τmax−τmin) = _ +s( _ - _ ) (2) θ(s) θ(s) =θmax−s(θmax−θmin) = _ -s( _ - _ ) λ(s) λ(s) =λmin+s(λmax−λmin) = _ +s( _ - _ ) where s=0s=0 represents conservative (low-sparsity) settings and s=1s=1 represents aggressive (high-sparsity) settings. Note that θ(s)θ(s) is inverted: as sparsity increases (s↑s ), self-similarity filtering decreases (θ↓θ ), consistent with SpargeAttn’s implementation [1]. This 1D parameterization reduces the search space from O(n3)O(n^3) to O(n)O(n) grid points while preserving the essential tradeoff between computational savings and approximation error, enabling efficient Bayesian optimization over a smooth latent landscape. Justification for Linear Parameterization: This dimensionality reduction strategy is grounded in the observation that optimal attention hyperparameters exhibit high inter-correlation. As noted in prior work on attention patterns [1], aggressive masking (high τ) necessitates looser similarity constraints (low θ) to preserve semantic integrity. By projecting the 3D hyperparameter space onto a 1D manifold s, we effectively capture this dominant mode of variation. While this simplification assumes a monotonic trade-off between sparsity and error, it significantly reduces the search volume from cubic O(n3)O(n^3) (where n is the grid resolution per parameter) to linear O(n)O(n), enabling rapid convergence within the tight latency constraints of inference tuning. We fit a Gaussian Process with Matérn 5/2 kernel to the low-fidelity error landscape: error(s)∼(μ(s),σ2(s))error(s) (μ(s),σ^2(s)) (3) where the kernel is defined as: k(s,s′)=(1+5|s−s′|l+5(s−s′)23l2)exp(−5|s−s′|l) k(s,s )= (1+ 5|s-s |l+ 5(s-s )^23l^2 ) (- 5|s-s |l ) (4) with length scale l=0.2l=0.2. The Matérn 5/2 kernel’s twice-differentiability models the smooth transitions in sparse attention’s error landscape between discrete block sparsity levels. To select evaluation points, we maximize the Expected Improvement acquisition function: EI(s)=(f^−μ(s))Φ(Z)+σ(s)ϕ(Z),Z=f^−μ(s)σ(s)EI(s)=( f-μ(s)) (Z)+σ(s)φ(Z), Z= f-μ(s)σ(s) (5) where f^=minjerror(sj) f= _jerror(s_j) is the best observed error, and Φ(⋅),ϕ(⋅) (·),φ(·) are the standard normal CDF and PDF. The first term exploits low-error regions; the second explores uncertain regions. Multi-Fidelity Evaluation: Following SpargeAttn’s timing analysis (0.516% overhead for 128K sequences [1]), we use: • Low-fidelity: 4K-token sequences (≈5≈ 5ms per evaluation) • High-fidelity: 32K-token sequences (≈21≈ 21ms per evaluation) This strategy exploits the high rank correlation (ρ≥0.8ρ≥ 0.8) between short and long sequence error landscapes: configurations that perform poorly on 4K sequences rarely excel on 32K sequences, making low-fidelity screening highly effective for pruning the search space before expensive high-fidelity evaluation. Stage 1 performs 3 random initialization evaluations (s∈0.2,0.5,0.8s∈\0.2,0.5,0.8\) followed by 12 BO iterations, identifying 2-3 promising regions. Computational cost: 15×5ms+50ms (GP overhead)=125ms15× 5ms+50ms (GP overhead)=125ms. I-C2 Stage 2: Binary Search Refinement For each promising region [slow,shigh][s_low,s_high] identified by BO, we apply binary search using high-fidelity (32K-token) evaluations. After 4 binary iterations, precision reaches Δs=(1/2)4≈0.0625 s=(1/2)^4≈ 0.0625, sufficient to distinguish hyperparameter differences smaller than SpargeAttn’s grid spacing (0.05). Cost: 2 regions×4 iters×21ms=168ms2 regions× 4 iters× 21ms=168ms. I-C3 Stage 3: Multi-Input Validation Following SpargeAttn’s validation protocol [1], we evaluate the best configuration across 5 diverse validation inputs and enforce worst-case error ≤εhigh≤ _high. If validation fails, we apply a soft fallback by reducing s by 10% (faster than re-running binary search). Cost: 5×21ms+21ms (fallback)=1055× 21ms+21ms (fallback)=105-126ms126ms. I-C4 Example Execution Consider a Llama-2-7B attention layer with 32K-token input. Stage 1 evaluates 15 points on 4K-token sequences, discovering that s∈[0.45,0.55]s∈[0.45,0.55] (Region 1) and s∈[0.70,0.80]s∈[0.70,0.80] (Region 2) both achieve errors below εhigh=0.055 _high=0.055. Stage 2 refines Region 1 to s∗=0.512s^*=0.512 (44% sparsity, error = 0.051) and Region 2 to s∗=0.758s^*=0.758 (57% sparsity, error = 0.049). Stage 3 validates the higher-sparsity configuration across 5 validation inputs (max error = 0.053), accepting it as the final configuration: τ=0.924τ=0.924, θ=0.091θ=0.091, λ=−10.2λ=-10.2. Algorithm 1 AFBS-BO: Adaptive Fidelity Binary Search with Bayesian Optimization 0: Q,K,VQ,K,V (32K sequences), Qshort,Kshort,VshortQ_short,K_short,V_short (4K sequences), error band [εlow,εhigh][ _low, _high], hyperparameter bounds 0: Optimal (τ∗,θ∗,λ∗)(τ^*,θ^*,λ^*) 1: Stage 1: Low-Fidelity Bayesian Optimization 2: Initialize GP with Matérn 5/2 kernel, ←D←\\ 3: for s∈0.2,0.5,0.8s∈\0.2,0.5,0.8\ do 4: (τ,θ,λ)←map_s_to_params(s)(τ,θ,λ)← map\_s\_to\_params(s) 5: elf←evaluate_low_fidelity(Qshort,Kshort,Vshort,τ,θ,λ)e_lf← evaluate\_low\_fidelity(Q_short,K_short,V_short,τ,θ,λ) 6: ←∪(s,elf)D ∪\(s,e_lf)\ 7: end for 8: GP.fit(D) 9: for i=1i=1 to 1212 do 10: snext←argmaxsEI(s|GP)s_next← _sEI(s|GP) 11: (τ,θ,λ)←map_s_to_params(snext)(τ,θ,λ)← map\_s\_to\_params(s_next) 12: elf←evaluate_low_fidelity(Qshort,Kshort,Vshort,τ,θ,λ)e_lf← evaluate\_low\_fidelity(Q_short,K_short,V_short,τ,θ,λ) 13: ←∪(snext,elf)D ∪\(s_next,e_lf)\; GP.update(D) 14: end for 15: promising_regions ← extract_low_ucb_regions(GP, εhigh _high) 16: Stage 2: High-Fidelity Binary Search Refinement 17: sbest,sparsitybest←0s_best,sparsity_best← 0 18: for (slow,shigh)∈(s_low,s_high)∈ promising_regions[1:2] do 19: sl←slow,sh←shigh,slocal←sl,splocal←0s_l← s_low,\ s_h← s_high,\ s_local← s_l,\ sp_local← 0 20: for iter =1=1 to 4 do 21: smid←(sl+sh)/2s_mid←(s_l+s_h)/2 22: (τ,θ,λ)←map_s_to_params(smid)(τ,θ,λ)← map\_s\_to\_params(s_mid) 23: (e,sp)←evaluate_high_fidelity(Q,K,V,τ,θ,λ)(e,sp)← evaluate\_high\_fidelity(Q,K,V,τ,θ,λ) 24: if e≤εhighe≤ _high then 25: if e≥εlowe≥ _low and sp>splocalsp>sp_local then 26: splocal,slocal←sp,smidsp_local,s_local ,s_mid 27: end if 28: sl←smids_l← s_mid 29: else 30: sh←smids_h← s_mid 31: end if 32: end for 33: if splocal>sparsitybestsp_local>sparsity_best then 34: sparsitybest,sbest←splocal,slocalsparsity_best,s_best _local,s_local 35: end if 36: end for 37: Stage 3: Multi-Input Validation 38: (τfinal,θfinal,λfinal)←map_s_to_params(sbest)( _final, _final, _final)← map\_s\_to\_params(s_best) 39: validation_errors ←[]←[] 40: for inputi in first 5 validation inputs do 41: eval←evaluate_high_fidelity(inputi,τfinal,θfinal,λfinal)e_val← evaluate\_high\_fidelity(input_i, _final, _final, _final) 42: validation_errors.append(evale_val) 43: end for 44: if max(validation_errors)>εhigh (validation\_errors)> _high then 45: sbest←0.9⋅sbests_best← 0.9· s_best Fallback: reduce sparsity 10% 46: (τfinal,θfinal,λfinal)←map_s_to_params(sbest)( _final, _final, _final)← map\_s\_to\_params(s_best) 47: end if 48: return (τfinal,θfinal,λfinal)( _final, _final, _final) I-D Integration with SpargeAttn Our framework seamlessly replaces SpargeAttn’s offline grid search while preserving the original kernel implementation. The tuning process operates in two phases: Offline Calibration (One-Time): For each attention layer l and head h in a model, run Algorithm 1 once on representative training data to obtain ℋl,h=τl,h,θl,h,λl,hH_l,h=\ _l,h, _l,h, _l,h\. Cache these configurations for deployment. Runtime Deployment (Kernel Configuration): During inference, we utilize the standard SpargeAttn kernel as the execution engine. However, we bypass its default static initialization. Instead, AFBS-BO acts as a dynamic control plane, injecting our optimized per-head configurations ℋl,hH_l,h to govern execution: • Coarse Masking Policy: The kernel’s block-filtering logic (Line 6) is driven by our layer-specific thresholds τl,h _l,h and θl,h _l,h, customizing the sparsity pattern for that specific depth. • Fine-Grained Warp Control: The kernel’s warp-level PV accumulation loop (Line 15) is constrained by our discovered threshold λl,h _l,h, enabling aggressive computation skipping without manual tuning. Adaptive Re-Calibration: If runtime monitoring detects error drift (worst-case error >εhigh> _high over 100 consecutive batches), the system triggers an automatic re-tuning cycle using a reduced search budget (8 BO iterations, 2 binary iterations, 240ms overhead). I-E Computational Efficiency Analysis Per-Layer Cost Comparison: • Grid search baseline: 40 evaluations×21ms=840ms40 evaluations× 21ms=840ms • AFBS-BO: 125ms (BO)+168ms (binary)+105ms (validation)=398ms125ms (BO)+168ms (binary)+105ms (validation)=398ms • Speedup: 2.1×2.1× per layer Multi-Layer Acceleration (e.g., 12-layer Llama-2-7B): • Layer 1: Full AFBS-BO = 398ms • Layers 2-12: Warm-started with 8 BO iterations (vs. 15) + 3 binary iterations (vs. 4) ≈ 240ms each • Total: 398+11×240=3.0398+11× 240=3.0s vs. grid’s 12×840=10.0812× 840=10.08s • Overall speedup: 3.4×3.4× TABLE I: Main Results on Llama-2-7B. Method Strategy Sparsity PPL ↓ Δ PPL KV Cache Speedup Critical Trade-off Dense Baseline Full Context 0.0% 7.13 — 2.15 GB 1.0× Slow, High Memory Static & Learnable Patterns (High Speed, Low Quality) Window Attn Local Diagonal 82.7% 8.17 +1.04 0.37 GB 5.8× Catastrophic quality loss Longformer Window + Global 75.0% 7.92 +0.79 0.54 GB 4.0× Inflexible structure Sparse Transformer Fixed Strided 75.0% 8.42 +1.29 0.54 GB 4.0× Rigid structure Reformer LSH Hashing 60.0% 8.65 +1.52 0.86 GB 2.5× Severe degradation Routing Trans. K-Means Clustering 65.0% 7.88 +0.75 0.75 GB 2.9× High overhead Dynamic Pruning (SOTA) StreamingLLM Sink + Window 80.0% 7.85 +0.72 0.43 GB 5.0× Fails long-range recall H2O Heavy Hitters 70.0% 7.55 +0.42 0.64 GB 3.3× Accumulation lag Sparse Sink Sink + Random 70.0% 7.72 +0.59 0.64 GB 3.3× Naive baseline Standard Top-K Token Oracle 70.0% 7.42 +0.29 0.64 GB 3.3× Hardware Incompatible AFBS-BO (Ours) Automated AFBS 70.7% 7.45 +0.32 0.63 GB 3.4× Best Quality/Speed Balance I-F Assumptions and Limitations Our approach relies on three key assumptions: (1) Input Stationarity: The distribution of attention patterns remains consistent across batches within a task (validated for language modeling and diffusion generation); (2) Fidelity Correlation: Low-fidelity (4K) evaluations rank-correlate (ρ≥0.8ρ≥ 0.8) with high-fidelity (32K) performance, which holds for transformer attention due to similar sparsity patterns across sequence lengths; (3) Smooth Error Landscape: The error function Error(s)Error(s) exhibits local smoothness, justified by the continuous nature of attention softmax operations despite discrete block quantization. Violation of assumption (2) in highly non-stationary tasks (e.g., adversarial inputs) may require full high-fidelity tuning, reducing speedup to 1.4×1.4× versus grid search. I-G Theoretical Guarantees Binary Search Convergence: Standard binary search guarantees precision ϵε in O(log(1/ϵ))O( (1/ε)) iterations. With 4 iterations, the hyperparameter precision is Δs≤0.0625 s≤ 0.0625, corresponding to Δτ≈0.012 τ≈ 0.012 in the original space—finer than SpargeAttn’s manual grid spacing of 0.05. Bayesian Optimization Regret Bounds: Under Gaussian Process regression with Matérn-5/2 kernel and Expected Improvement acquisition, the cumulative simple regret after n evaluations satisfies: Rn=f(s∗)−f(s^n)≤O(γnlogn)R_n=f(s^*)-f( s_n)≤ O ( _n nn ) (6) where γn _n is the maximum information gain. For 1D Matérn kernels, γn=O(log2n) _n=O( ^2n), ensuring that with 15 BO iterations, the discovered regions contain near-optimal solutions with high probability (>0.95>0.95 under standard GP assumptions). Multi-Fidelity Efficiency: If the rank correlation between low- and high-fidelity evaluations satisfies ρ≥0.8ρ≥ 0.8, the expected cost reduction factor is: η=1(1−α)+α⋅clowchigh≈10.5+0.5×521≈1.6×η= 1(1-α)+α· c_lowc_high≈ 10.5+0.5× 521≈ 1.6× (7) where α=0.5α=0.5 is the fraction of evaluations performed at low fidelity, and clow/chigh=5/21c_low/c_high=5/21 is the cost ratio. Empirical validation on 20 layers confirms ρ=0.84±0.06ρ=0.84± 0.06. IV Experiments We evaluate AFBS-BO on Llama-2-7B using WikiText-2, the standard benchmark for assessing long-context language modeling quality under sparse attention. Our experiments measure the fundamental trade-off between sparsity (inference efficiency) and model quality (perplexity), validating both the automation of hyperparameter discovery and the effectiveness of our representative sampling strategy. IV-A Experimental Setup Model and Dataset: We use the pre-trained Llama-2-7B model [2] with a 4096-token context window. All evaluations are conducted on the WikiText-2 test set [3]. Perplexity (PPL) serves as the primary quality metric, computed using sliding-window evaluation with stride 512. Simulation Environment: To isolate the effectiveness of our hyperparameter tuning from low-level kernel artifacts, we implement sparse attention via attention mask application on FP16 precision. Specifically, we compute the full attention matrix and zero out blocks not selected by the SpargeAttn filtering logic. This controlled evaluation allows us to measure the pure algorithmic impact of hyperparameter choices without confounding factors from kernel precision loss. AFBS-BO Configuration: We apply AFBS-BO to discover per-layer and per-head hyperparameters (τ,θ,λ)(τ,θ,λ). Block size is fixed at B=64B=64. The error tolerance band is set to [εlow,εhigh]=[0.045,0.055][ _low, _high]=[0.045,0.055]. Multi-fidelity evaluation uses 4K-token sequences for Stage 1 (Bayesian Optimization) and 32K-token sequences for Stages 2-3 (Refinement and Validation). IV-B Baselines We compare AFBS-BO against three categories of methods: (1) Static Patterns: Window Attention (local diagonal) and Longformer (window + global tokens). (2) Stochastic Lower Bound: Random Block Selection (sampled at ≈70%≈ 70\% sparsity) validates that our selection strategy is non-trivial. (3) SOTA Dynamic Pruning: H2O (Heavy Hitter Oracle) [15], representing the state-of-the-art in accumulated history pruning, and Standard Top-K (Token-wise Oracle), representing the theoretical upper bound of sparsity efficiency. IV-C Main Results: Quality vs. Efficiency Trade-off Table I compares all methods at approximately 70% block sparsity. AFBS-BO achieves 7.45 PPL, virtually matching the oracle while maintaining hardware-efficient block structure and outperforming the manually tuned global hyperparameters. TABLE I: Downstream Task Performance. Method HellaSwag PIQA BoolQ BoolQ Retention Dense Baseline 76.8% 78.5% 74.1% 100.0% Standard Top-K (Oracle) 76.5% 78.2% 73.9% 99.7% AFBS-BO (Ours) 76.4% 78.1% 73.8% 99.6% H2O (SOTA) 76.1% 77.9% 73.5% 99.2% Routing Transformer 75.4% 77.1% 72.8% 98.2% Window Attention 72.5% 74.2% 69.8% 94.2% IV-D Comparative Performance Analysis Superiority Over Static Methods: Static patterns fail to capture the non-uniform distribution of information in long-context modeling. Window Attention suffers severe degradation (7.13→8.177.13→ 8.17 PPL, 1.04 PPL loss), and even Longformer (7.92 PPL) degrades by 0.79 PPL. This confirms that relevant context often resides at distances exceeding local window sizes, and AFBS-BO dynamically discovers these critical blocks per layer. Comparison with State-of-the-Art: AFBS-BO achieves 7.45 PPL, outperforming the current state-of-the-art H2O baseline (7.55 PPL) at the same sparsity level (≈70%≈ 70\%). This 0.10 PPL improvement confirms that our instantaneous block selection strategy—which evaluates importance based on the current query—is more effective for language modeling than H2O’s accumulated history approach. Approaching the Oracle Bound (vs. Top-K): Most importantly, AFBS-BO comes within 0.03 PPL of the unconstrained Top-K oracle (7.42). While Top-K represents the theoretical upper bound for quality, it is a theoretical-only baseline: it requires pruning individual tokens, creating irregular memory access patterns that preclude efficient GPU kernels. AFBS-BO achieves 98% of this oracle quality using coarse-grained blocks (64×6464× 64) that align with GPU memory hierarchies, making the reported 3.4× speedup practically realizable on hardware. Downstream Task Preservation: Table I demonstrates that AFBS-BO retains 99.6% of dense model performance on common sense reasoning tasks (HellaSwag: 76.4% vs. 76.8%, PIQA: 78.1% vs. 78.5%). On BoolQ—which requires synthesizing information from distant context positions—Window Attention drops to 69.8%, whereas AFBS-BO maintains 73.8%, validating that our method preserves long-range dependencies critical for complex reasoning. Long-Range Capability (Needle-in-a-Haystack): To evaluate long-context retrieval capabilities, we conducted a Passkey Retrieval test, hiding a random 5-digit key (”90210”) at a depth of 5K tokens within a 10K-token context. Window Attention failed (0% recall), as the key fell outside its local receptive field. In contrast, AFBS-BO achieved 100% recall, correctly retrieving the key. Figure 2: Context Length Stability. Unlike Window Attention, which degrades catastrophically as sequence length exceeds the window size (>>4k), AFBS-BO maintains stable perplexity up to 32k tokens, tracking the Dense Oracle within 0.3 PPL. To further validate this, Fig. 2 illustrates model performance across increasing sequence lengths [4k, 32k]. AFBS-BO exhibits ”flat” scaling, confirming that our discovered hyperparameters generalize effectively to lengths beyond the calibration set. IV-E Tuning Efficiency A critical advantage of AFBS-BO is the dramatic reduction in hyperparameter discovery cost. By combining multi-fidelity evaluation with hybrid Bayesian Optimization and binary search, AFBS-BO requires only 3.0 seconds to discover optimal hyperparameters for all 12 layers of Llama-2-7B. In comparison, exhaustive grid search over 175 configurations per layer would require approximately 10.08 seconds, representing a 3.4× speedup. Moreover, AFBS-BO converges in just 240 total evaluations (across all layers) versus 2100 for grid search—an 8.8× reduction in computational overhead. This efficiency stems from two key innovations: (1) low-fidelity screening eliminates 62.5% of evaluations using computationally efficient 4K-token sequences, and (2) warm-starting transfers learned sparsity landscapes from layer 1 to subsequent layers, reducing per-layer BO iterations from 15 to 8. Figure 3: KV Cache Memory Scaling with Sequence Length. Dense attention (dashed line) hits the 16GB GPU memory ceiling at approximately 12K tokens, while AFBS-BO’s sparse attention (solid line) scales sub-linearly, enabling processing of 32K+ token sequences on consumer GPUs. The 3.4× memory reduction demonstrated here translates directly to practical longer-context deployment. IV-F Memory Scaling and Theoretical Speedup We analyze the memory efficiency gains enabled by AFBS-BO’s discovered sparsity patterns (Fig. 3). While the dense baseline requires 2.15 GB of KV cache memory for a 4096-token sequence, AFBS-BO reduces this to 0.63 GB (70.7% sparsity)—a 3.4× reduction. This sub-linear memory growth enables processing longer contexts on resource-constrained hardware: while dense attention hits the 16GB memory ceiling at approximately 12K tokens on consumer GPUs, AFBS-BO’s sparse attention can theoretically scale to 32K+ tokens. Theoretical Throughput Projection: Based on the 70.7% FLOPs reduction from achieved sparsity, we project a theoretical inference speedup of 3.4× over dense attention, assuming an idealized block-sparse kernel implementation [1]. This projection accounts for the overhead of SpargeAttn’s two-stage filtering (block compression, self-similarity computation) and represents the expected speedup when our discovered hyperparameters are deployed with production kernels. IV-G Ablation Study: Impact of Block Size To validate our architectural choice of block size B=64B=64, we conducted a sensitivity analysis measuring the impact of block granularity on both model fidelity and inference throughput, as shown in Fig. 4. Fine-Grained Granularity (B<64B<64): While smaller blocks (e.g., B=32B=32) theoretically allow for higher precision by excluding more irrelevant tokens, we observe a diminishing return in perplexity (7.45→7.427.45→ 7.42) that comes at a high computational cost. At B=16B=16, inference speed drops by 42% (108 vs. 187 tokens/s) due to the increased overhead of metadata management and memory access fragmentation in the sparse kernel. Coarse-Grained Granularity (B>64B>64): Conversely, larger blocks maximize memory bandwidth utilization, slightly increasing speed (187→194187→ 194 tokens/s at B=128B=128). However, this coarseness forces the inclusion of irrelevant tokens (”context aliasing”), causing a sharp degradation in perplexity from 7.45 to 7.62. Optimal Pareto Point: We identify B=64B=64 as the optimal trade-off, balancing the hardware-friendly alignment required for efficient GPU warp execution with the semantic resolution needed to preserve long-range dependencies. Figure 4: Impact of Block Size (B) on Quality vs. Efficiency. We analyze the trade-off between semantic resolution (Perplexity, blue dashed) and inference throughput (Speed, red solid). Small blocks (B<64B<64) incur significant kernel overhead for marginal quality gains. Conversely, coarse blocks (B>64B>64) maximize speed but cause a sharp degradation in perplexity due to context aliasing. Our chosen block size of B=64B=64 represents the optimal Pareto point, achieving near-peak throughput while maintaining model quality within the acceptable tolerance zone. IV-H Ablation of Optimization Stages To quantify the specific contribution of each stage in AFBS-BO, we compared the full pipeline against baselines (Table I). • Stage 1 (Global Exploration): Random search is inefficient, converging to a suboptimal 55.0% sparsity after 50 evaluations. In contrast, Stage 1 (Bayesian Optimization) leverages the Gaussian Process surrogate to identify high-sparsity regions 3.5× faster, reaching 68.2% sparsity. • Stage 2 (Local Refinement): While BO finds the ”good region,” it lacks infinite precision. Stage 2 is critical for the ”last mile” optimization, pushing sparsity from 68.2% to 70.7% by using binary search to precisely locate the error boundary εhigh _high (Fig. 5). TABLE I: Stage Ablation. Method Evals Sparsity Search Time Random Search 50 55.0% 1.82s Stage 1 (BO Only) 15 68.2% 0.51s Full AFBS-BO 19 70.7% 0.65s TABLE IV: Domain Generalization on C4. Method Sparsity C4 Perplexity ↓ Dense Attention (Oracle) 0.0% 8.12 Window Attention 80.0% 9.45 Random Sparsity 70.0% 10.23 AFBS-BO (Ours) 70.7% 8.48 IV-I Domain Generalization (C4 Validation) To address the concern regarding dataset diversity and demonstrate that our method is not overfitted to WikiText, we extended our evaluation to the C4 (Colossal Clean Crawled Corpus) validation set. Unlike the structured encyclopedic text of WikiText, C4 represents a diverse distribution of web text, code, and informal dialogue. Results: As shown in Table IV, AFBS-BO demonstrates superior robustness to distribution shift. While static Window Attention suffers a significant quality degradation (+1.33 PPL vs. Dense) due to its inability to capture the irregular long-range dependencies common in web markup, AFBS-BO maintains tight alignment with the dense baseline (+0.36 PPL). This confirms that our automated hyperparameter discovery effectively captures universal attention patterns. Figure 5: Optimization Convergence. AFBS-BO (Blue) rapidly reduces approximation error using Bayesian exploration, whereas Random Search (Grey) stagnates. The vertical drop at iteration 15 highlights the efficacy of Stage 2 (Binary Refinement) in identifying the precise optimal threshold. V Limitations While AFBS-BO automates hyperparameter discovery, it relies on representative calibration data. If the production data distribution shifts significantly (e.g., from English text to code) without re-calibration, the cached hyperparameters may become suboptimal, though our C4 experiments suggest reasonable robustness. Additionally, our current linear parameterization (Equation 2) assumes a monotonic trade-off between sparsity and error. While valid for standard LLMs, this assumption may require adjustment for architectures with non-standard attention mechanisms where interaction effects between τ and θ are highly non-linear. VI Conclusion Sparse attention mechanisms are critical for scaling transformers to long contexts, yet their adoption has been hindered by the complexity of manually tuning sensitive hyperparameters for each model and deployment scenario. In this work, we presented AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that solves this tuning bottleneck by discovering optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization’s global exploration with binary search’s local precision, exploiting multi-fidelity evaluation to achieve 3.4× faster hyperparameter discovery than exhaustive grid search (3.0s vs. 10.08s for 12-layer Llama-2-7B) while requiring 8.8× fewer evaluations (240 vs. 2100). This efficiency makes automated sparse attention tuning practical for production deployment. Extensive evaluation on WikiText-2 demonstrates that AFBS-BO is not merely faster—it delivers superior quality. Our method achieves a perplexity of 7.45, outperforming state-of-the-art baselines including H2O (7.55) and random selection (7.79) while removing over 70% of attention computations. Crucially, AFBS-BO comes within 0.03 PPL of the unconstrained Top-K oracle (7.42), bridging the gap between theoretical sparsity and practical hardware acceleration. By automatically discovering that early transformer layers tolerate 72–76% sparsity while deeper layers require conservative 58–62% thresholds, AFBS-BO unlocks heterogeneity that globally fixed hyperparameters inherently miss. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO paves the way for widespread deployment of efficient, long-context transformers across diverse models and domains. Future work will extend AFBS-BO to multimodal transformers, where differing visual and textual attention patterns may require even more aggressive, modality- and layer-specific adaptation. We also plan to explore online recalibration under distribution shift, enabling AFBS-BO to autonomously retune hyperparameters as data evolves in real-world production settings. References [1] J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen, “SpargeAttention: Accurate and training-free sparse attention accelerating any model inference,” in Proc. 42nd Int. Conf. Mach. Learn. (ICML), Vancouver, Canada, 2025. [2] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [3] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017. [4] A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017. [5] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, p. 16344–16359, 2022. [6] A. Dubey et al., “The Llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024. [7] Z. Yang et al., “CogVideoX: Text-to-video diffusion models with an expert transformer,” arXiv preprint arXiv:2408.06072, 2024. [8] Stability AI, “Stable Diffusion 3.5,” https://stability.ai/stable-diffusion-3-5, 2024. [9] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” IEEE Trans. Inf. Theory, vol. 58, no. 5, p. 3250–3265, 2012. [10] P. I. Frazier, “A tutorial on Bayesian optimization,” arXiv preprint arXiv:1807.02811, 2018. [11] Y. Li, Y. Huang, B. Leng, H. Wang, J. Zhang, C. Gong, and T. Wang, “SnapKV: LLM Knows What You are Looking for Before Generation,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024. [12] Z. Lai, J. Wang, Y. Zhu, R. Han, K. Zhao, and B. Ding, “FlexPrefill: Flexible prefilling for fast and memory-efficient long-context inference,” arXiv preprint arXiv:2501.03315, 2025. [13] J. Brochu, V. M. Cora, and N. de Freitas, “A tutorial on Bayesian optimization of expensive cost functions,” UBC Tech. Rep. TR-2009-23, 2009. [14] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press, 2006. [15] H. Jiang, Y. Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. Abdi, D. Li, C. Lin, Y. Yang, and L. Qiu, “MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024. [16] J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, “Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference,” in Proc. 41st Int. Conf. Mach. Learn. (ICML), 2024. [17] Q. Yang, J. Wang, X. Li, Z. Wang, C. Chen, L. Chen, X. Yu, W. Liu, J. Hao, M. Yuan, and B. Li, “AttentionPredictor: Temporal Patterns Matter for KV Cache Compression,” in Advances in Neural Information Processing Systems (NeurIPS), 2025. [18] C. Wu, J. Cao, R. Xu, Z. Ran, and M. Che, “DuSA: Fast and Accurate Dual-Stage Sparse Attention Mechanism Accelerating Both Training and Inference,” in Advances in Neural Information Processing Systems (NeurIPS), 2025. [19] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient Streaming Language Models with Attention Sinks,” arXiv preprint arXiv:2309.17453, 2023. [20] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned,” in Proc. 57th Annual Meeting Assoc. Comput. Linguist. (ACL), p. 5797–5808, 2019. [21] P. Michel, O. Levy, and G. Neubig, “Are Sixteen Heads Really Better than One?” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019. [22] Y. You, I. Gitman, and B. Ginsburg, “Large Batch Training of Convolutional Networks,” arXiv preprint arXiv:1708.03888, 2017. [23] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2020. [24] F. P. Figueroa-Rey, “Evaluating Adaptive Layer Freezing through Hyperparameter Optimization,” Master’s Thesis, Massachusetts Institute of Technology, 2024. [25] C. Pham, H. A. Dung, C. C. Nguyen, T. Le, G. Carneiro, J. Cai, and T.-T. Do, “Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models,” arXiv preprint arXiv:2511.17809, 2025. [26] Q. Tang, B. Zhang, J. Liu, F. Liu, and Y. Liu, ”Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, p. 777–786. [27] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” J. Mach. Learn. Res., vol. 13, p. 281–305, 2012. [28] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine Learning Algorithms,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 25, 2012. [29] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas, “Towards Learning Universal Hyperparameter Optimizers with Transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [30] J. White, S. Xie, Y. Chen, Z. Ren, and M. Savva, “Training-free Neural Architecture Search for RNNs and Transformers,” in Proc. 61st Annual Meeting Assoc. Comput. Linguist. (ACL), p. 1714–1730, 2023. [31] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing, “Neural Architecture Search with Bayesian Optimisation and Optimal Transport,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018. [32] X. Meng and G. E. Karniadakis, “A Composite Neural Network that Learns from Multi-Fidelity Data: Application to Function Approximation and Inverse PDE Problems,” J. Comput. Phys., vol. 401, p. 109020, 2020. [33] Z. Zhu, S. Wu, C. R. Shelton, and X. Liang, “Multi-Fidelity Bayesian Optimization via Deep Neural Networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2020. [34] T. Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision,” Blog post, 2024. [Online]. Available: https://tridao.me/blog/2024/flash3/ [35] S. Teerapittayanon, B. McDanel, and H. T. Kung, “BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks,” in Proc. 23rd Int. Conf. Pattern Recognit. (ICPR), p. 2464–2469, 2016. [36] L. Yang, Y. Han, X. Chen, S. Song, J. Dai, and G. Huang, “Resolution Adaptive Networks for Efficient Inference,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), p. 2369–2378, 2020. [37] J. Mei, Y. Cai, R. Hou, X. Chen, Z. Li, and J. Leskovec, “Elastic Attention: Test-Time Adaptive Sparsity Ratios for Efficient Transformers,” arXiv preprint arXiv:2601.17367, 2025.