← Back to papers

Paper deep dive

Sharper Generalization Bounds for Transformer

Yawen Li, Tao Hu, Zhouhui Lian, Wan Tian, Yijie Peng, Huiming Zhang, Zhongyi Li

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 90

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:29:46 AM

Summary

This paper derives sharper generalization error bounds for Transformer models (single-layer, multi-head, and multi-layer) using offset Rademacher complexity. By connecting this complexity measure to empirical covering numbers, the authors establish optimal O(1/n) convergence rates for excess risk, extending results to unbounded sub-Gaussian features and heavy-tailed distributions.

Entities (5)

Transformer · model-architecture · 100%Yawen Li · author · 100%Offset Rademacher Complexity · mathematical-framework · 99%Covering Number · mathematical-concept · 95%Excess Risk · metric · 95%

Relation Signals (2)

Offset Rademacher Complexity achievesconvergencerate O(1/n)

confidence 95% · we obtain excess risk bounds that achieve optimal convergence rates up to constant factors... of order O(1/n)

Transformer hasgeneralizationbound Offset Rademacher Complexity

confidence 95% · Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures

Cypher Suggestions (2)

Find all Transformer architectures analyzed in the paper · confidence 90% · unvalidated

MATCH (t:Model {name: 'Transformer'})-[:HAS_VARIANT]->(v:Architecture) RETURN v.name

Link mathematical frameworks to their convergence properties · confidence 85% · unvalidated

MATCH (f:Framework)-[:YIELDS_RATE]->(r:ConvergenceRate) RETURN f.name, r.value

Abstract

Abstract:This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

89,494 characters extracted from source content.

Expand or collapse full text

Sharper Generalization Bounds for Transformer Yawen Li Tao Hu Zhouhui Lian Wan Tian Yijie Peng Huiming Zhang Zhongyi Li Abstract This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions. Machine Learning, ICML 1 Introduction Transformer-based models have become a central component of modern machine learning systems and have achieved remarkable success across a wide range of application domains (Chang et al., 2024; de Santana Correia & Colombini, 2022). Originally developed for sequence modeling tasks (Vaswani et al., 2017), Transformers now underpin state-of-the-art performance in natural language processing (Wang et al., 2019; Zhang et al., 2023), computer vision (Han et al., 2022), and reinforcement learning (Chen et al., 2021a; Hu et al., 2024) . Their architectural flexibility and strong empirical performance have also led to widespread adoption in large-scale foundation models. Despite their empirical success, understanding the generalization behavior of Transformer models remains a fundamental theoretical challenge. Providing sharp and architecture-aware generalization error bounds is essential for explaining their empirical robustness and for developing principled insights into the role of architectural design in learning performance. Existing nonparametric and functional-analytic approaches remain limited in explaining the generalization behavior of deep models: nonparametric theories typically rely on Hölder-type smoothness assumptions on the target function (Schmidt-Hieber, 2020; Bos & Schmidt-Hieber, 2022) , yielding error bounds that characterize worst-case function classes and thus fail to capture the adaptivity of deep networks in high-dimensional structured settings; analyses based on Sobolev and related function spaces primarily focus on approximation properties and are often derived under idealized or infinite-sample assumptions (Yang, 2025; Ding et al., 2025; Meng & Ming, 2022; Ma et al., 2022), providing insufficient characterization of finite-sample statistical errors and the influence of training algorithms. In this work, we analyze the generalization ability of Transformer models from a complexity-based perspective. Compared with analyses relying on smoothness or approximation assumptions, complexity-based approaches yield generalization bounds that explicitly depend on the sample size, model scale, and parameter norms, making them more suitable for modern deep models such as Transformers with large parameterization and highly complex architectures. The core idea of this approach is to characterize the excess risk by means of complexity measures of the hypothesis space, and then derive upper bounds on these measures, typically based on structural properties of the hypothesis class, such as its covering number or Vapnik–Chervonenkis (VC) dimension (Blumer et al., 1989). The excess risk evaluate the performance of the estimator f^n f_n: ℰ​(f^n;ℓ):=R​(f^n)−infg∈R​(g),E( f_n; )\;:=\;R( f_n)\;-\; _g R(g), (1) where J is a target function class. If ℱ⊊F (the misspecified setting), this term decomposes into estimation error and approximation error. Different complexity measures yield different convergence rates for the excess risk. Using Gaussian complexity or global Rademacher complexity typically leads to a convergence rate of ​(1/n)O(1/ n) (Zhang, 2023), where n denotes the sample size. In contrast to global Rademacher complexity, local Rademacher complexity is a data-dependent complexity measure that focuses on “local” subsets of the hypothesis space—particularly those functions with small empirical risk—thus avoiding overly pessimistic penalties on the entire function class (Bartlett et al., 2005). This enables sharper generalization bounds with a faster convergence rate of ​(1/n)O(1/n). However, this approach relies on an assumption about the noise level of the loss function—specifically, that the variance of the loss is upper-bounded by its expectation—known as the Bernstein condition—which may be difficult to verify in real-world applications. More recently, offset Rademacher complexity, introduced by Liang et al. (Liang et al., 2015), denoted by ℛnoff​(ℱ,β)=​[supf∈ℱ1n​∑i=1nτi​f​(Xi)−β​f​(Xi)2],R_n^off(F,β)=E [ _f 1n _i=1^n _if(X_i)-β f(X_i)^2 ], has been proposed as a penalized variant of global Rademacher complexity. It achieves the optimal convergence rate without explicitly imposing the Bernstein condition. This measure has been successfully applied to a wide range of models—including parametric models, nonparametric models, and neural networks—to improve convergence rates (Duan et al., 2023). Given these theoretical advancements, several studies have already explored the generalization error of Transformers from different perspectives. Our contributions can be summarized as follows: • We provide optimal convergence rates for the excess risk of various Transformer architectures, taking their structural parameters into account. Specifically, we first derive the relationship between the excess risk and offset Rademacher complexity for single-layer single-head, single-layer multi-head, and multi-layer Transformers, and then establish the connection between offset Rademacher complexity and the empirical covering numbers of the corresponding hypothesis spaces, which yields the optimal convergence rate of ​(1/n)O(1/n). • We analyze the covering number upper bounds of Transformers from two perspectives—rank and norm—thereby obtaining precise generalization error bounds for Transformer models. • We extend our theoretical findings that require certain boundedness assumptions (e.g., on feature maps), to unbounded (sub-Gaussian) and heavy-tailed settings. Finally, we demonstrate the applicability of our theoretical results to regression and classification tasks, as well as to scenarios involving robust loss functions. 2 Related Work Regarding generalization theory for Transformers, Trauger & Tewari (2023) derive sequence-length–independent bounds for single-layer Transformers via covering-number control of associated linear classes, highlighting how low-rank structure can tighten complexity and yielding ​(1/n)O(1/ n) rates. Similarly, Truong (2024) develop norm- and rank-dependent bounds using (global) Rademacher complexity, also independent of sequence length. Beyond capacity-based analyses, Havrilla & Liao (2024) explain empirical scaling laws through a benign-overfitting lens when data concentrate on low-dimensional manifolds, while Zhang et al. (2025) further connect generalization error to training dynamics by distinguishing benign versus harmful overfitting under label-flip noise. Recent work has also expanded toward regime- and task-structured guarantees: Mwigo & Dasgupta (2026) establish norm/Rademacher bounds for shallow Transformers trained by gradient descent in the lazy-training regime; Huang et al. (2025) provide a formal framework for length generalization in causal Transformers with learnable absolute positional encodings; Alokhina & Li (2026) characterize size generalization on variable-size geometric inputs via discrete-to-continuous approximation under stable positional encodings; and Zhang et al. (2026) derive bit-wise Rademacher-complexity bounds for Transformer channel decoders, leveraging sparsity from parity-check masked attention to tighten covering arguments. Despite this progress, many existing results rely on global complexity measures and consequently yield conservative (often suboptimal) excess-risk rates, and they typically impose boundedness assumptions on feature mappings; in contrast, our work extends the theory to unbounded (sub-Gaussian) features and further to heavy-tailed distributions. 3 Preliminaries In this section, we establish the mathematical framework used throughout the paper. We begin by defining the Transformer architecture—spanning single-head, multi-head, and multi-layer variants—and then formalize the statistical learning setup, including the loss functions, risk definitions, and the complexity measures that underpin our generalization analysis. 3.1 Self-Attention and Transformers Unlike standard feedforward networks with fixed connectivity, Transformers utilize a self-attention mechanism where weights are data-dependent and recomputed for every input. This allows each position in a sequence (e.g., a token in text) to attend to all other positions, preserving long-range dependencies. Let X∈ℝT×dX ^T× d denote the input sequence with length T and embedding dimension d. We define the row-wise softmax operator, softmax:ℝT×T→ℝT×Tsoftmax:R^T× T ^T× T, such that for any matrix A, the entry (i,j)(i,j) of softmax⁡(A)softmax(A) is given by exp⁡(Ai​j)/∑k=1Texp⁡(Ai​k) (A_ij)/ _k=1^T (A_ik). Let σ:ℝ→ℝσ:R be an element-wise LσL_σ-Lipschitz activation function with σ​(0)=0σ(0)=0. Single-Head Attention. A single attention head is parameterized by matrices WQ,WK∈ℝd×dW_Q,W_K ^d× d, Wv∈ℝd×kW_v ^d× k, and Wc∈ℝk×dW_c ^k× d. For notational convenience, we denote the query-key interaction matrix as WQ​K:=WQ​WK⊤W_QK:=W_QW_K . The output of a single-head layer is defined as: fSH​(X) f_SH(X) =σ​(softmax⁡(X​WQ​K​X⊤)​X​Wv)​Wc =σ (softmax (XW_QKX )XW_v )W_c (2) ∈ℝT×d. ^T× d. To obtain a scalar prediction for regression or classification, we assume the input sequence contains a special token (e.g., [CLS]). Let Y[CLS]∈ℝdY_ [CLS] ^d be the row corresponding to this token in the output of (2). The final prediction is given by w⊤​Y[CLS]w Y_ [CLS], where w∈ℝdw ^d is a learnable readout vector. Multi-Head Attention. In a multi-head Transformer with H heads, the model aggregates the outputs of independent heads. Let the h-th head be parameterized by Wh,Q,Wh,K,Wh,v,Wh,c\W_h,Q,W_h,K,W_h,v,W_h,c\. The layer output is given by: fMH​(X) f_MH(X) (3) =∑h=1Hσ​(softmax⁡(X​Wh,Q​K​X⊤)​X​Wh,v)​Wh,c. = _h=1^Hσ (softmax (XW_h,QKX )XW_h,v )W_h,c. The scalar prediction is obtained via the readout vector w applied to the aggregated [CLS] representation: w⊤​(∑h=1H(Yh)[CLS])w ( _h=1^H(Y_h)_ [CLS]). Multi-Layer Architecture. Deep Transformers are constructed by stacking layers, often incorporating normalization to aid optimization. Let X(l)X^(l) denote the input to the l-th layer, with X(1)=X^(1)=X. The l-th block is parameterized by (l)=WQ​K(l),Wv(l),Wc(l)W^(l)=\W_QK^(l),W_v^(l),W_c^(l)\. We define the intermediate attention mapping as: Φ​(X(l);(l)) (X^(l);W^(l)) (4) :=σ​(softmax⁡(X(l)​WQ​K(l)​(X(l))⊤)​X(l)​Wv(l))​Wc(l). =σ (softmax (X^(l)W_QK^(l)(X^(l)) )X^(l)W_v^(l) )W_c^(l). Let Πnorm _norm denote a row-wise normalization operator (e.g., projection onto the unit ℓ2 _2-ball). The recursive update for the (l+1)(l+1)-th layer input is defined as: X(l+1)=Πnorm​(σ​(Πnorm​(Φ​(X(l);(l))))).X^(l+1)\;=\; _norm (σ ( _norm ( (X^(l);W^(l)) ) ) ). (5) This formulation encapsulates the dimension-preserving nature of the Transformer block while explicitly accounting for normalization and activation steps crucial for theoretical stability. Table 1: Classification accuracies for naive Bayes and flexible Bayes on various data sets. Model Definition Parameters Single-Head Eq. (2) WQ​K∈ℝd×d,W_QK ^d× d, Wv∈ℝd×k,Wc∈ℝk×dW_v ^d× k,W_c ^k× d Multi-Head Eq. (3) Wh,Q​K,Wh,v,Wh,ch=1H\W_h,QK,W_h,v,W_h,c\_h=1^H Multi-Layer Eq. (5) WQ​K(l),Wv(l),Wc(l)l=1L\W_QK^(l),W_v^(l),W_c^(l)\_l=1^L 3.2 Excess Risk and Complexity Measures We consider the supervised learning setting with input-output pairs Z=(X,Y)∈:=×Z=(X,Y) :=X×Y, where ⊆ℝT×dX ^T× d and ⊆ℝY . We are given an i.i.d. sample =Zii=1nD=\Z_i\_i=1^n drawn from an unknown distribution P. A Transformer with parameters W∈W induces a predictor fW:→ℝf_W:X . For a loss function ℓ:×ℝ→[0,∞) :Y×R→[0,∞), the hypothesis class is denoted by ℱ:=fW:W∈F:=\f_W:W \. The empirical risk minimizer (ERM) is defined as: f^n∈arg⁡minf∈ℱ⁡Rn​(f), f_n\;∈\; _f R_n(f), (6) where Rn​(f):=1n​∑i=1nℓ​(Yi,f​(Xi)).R_n(f):= 1n _i=1^n (Y_i,f(X_i)). (7) The population risk is R​(f):=Z∼​[ℓ​(Y,f​(X))]R(f):=E_Z [ (Y,f(X))]. We evaluate the performance of the estimator via the excess risk in (1). To derive convergence rates, we utilize the Offset Rademacher Complexity (Liang et al., 2015), which typically yields faster rates (e.g., ​(1/n)O(1/n)) compared to standard Rademacher complexity (​(1/n)O(1/ n)) without requiring hard-to-verify variance assumptions. Definition 3.1 (Offset Rademacher Complexity). Let ℋH be a class of real-valued functions defined on Z. Given a sample ℤ=(Z1,…,Zn)Z=(Z_1,…,Z_n) and a parameter β>0β>0, the conditional offset Rademacher complexity is: ℛnoff ^off_n (ℋ,β∣ℤ) (H,β ) (8) :=​[suph∈ℋ(1n​∑i=1nτi​h​(Zi)−βn​∑i=1nh​(Zi)2)], =E_ τ [ _h ( 1n _i=1^n _ih(Z_i)- βn _i=1^nh(Z_i)^2 ) ], where =(τ1,…,τn) τ=( _1,…, _n) are i.i.d. Rademacher variables. The unconditional complexity is ℛnoff​(ℋ,β):=ℤ​[ℛnoff​(ℋ,β∣ℤ)]R^off_n(H,β):=E_Z[R^off_n(H,β )]. 4 Excess Risk of Transformers In this section, we analyze the generalization performance of Transformer models via the framework of offset Rademacher complexity. We derive excess risk bounds that achieve fast convergence rates of order ​(1/n)O(1/n) under standard regularity assumptions. 4.1 Single-Layer Single-Head Transformer We begin by defining the class of excess loss functions. Let ℱF denote the hypothesis class of single-layer, single-head Transformers as defined in (2). We consider the function class :=X↦g​(X;f)∣f∈ℱ,G\;:=\; \X g(X;f) f \, where g​(X;f):=Y|X​[ℓ​(Y,f​(X))−ℓ​(Y,f⋆​(X))|X].g(X;f)\;:=\;E_Y|X [ (Y,f(X))- (Y,f (X)) |X ]. Here, f⋆∈arg⁡minf∈ℱ⁡​[ℓ​(Y,f​(X))]f ∈ _f E[ (Y,f(X))] denotes the population risk minimizer within the class. The offset Rademacher complexity (Liang et al., 2015) for this class is defined as: ℛnoff _n^off (,β) (G,β) :=,​[supf∈ℱ(1n​∑i=1nτi​g​(Xi;f)−βn​∑i=1ng​(Xi;f)2)], =E_D, τ\! [ _f ( 1n _i=1^n _i\,g(X_i;f)\;-\; βn _i=1^ng(X_i;f)^2 ) ], where =(τi)i=1n τ=( _i)_i=1^n are i.i.d. Rademacher variables independent of the sample D. Let the Transformer be parameterized by =Wv,Wc,WQ​K,wW=\W_v,W_c,W_QK,w\. We impose the following regularity assumptions on the parameters and the data generating process. Assumption 4.1 (Bounded Parameters). There exist positive constants Bv,Bc,BQ​K,BwB_v,B_c,B_QK,B_w such that: ‖Wv‖2→2≤Bv,‖Wc‖2→2≤Bc, \|W_v\|_2→ 2≤ B_v, \|W_c\|_2→ 2≤ B_c, ‖WQ​K‖2→2≤BQ​K,‖w‖2≤Bw. \|W_QK\|_2→ 2≤ B_QK, \|w\|_2≤ B_w. Assumption 4.2 (Bounded Target and Inputs). There exist constants B,BX<∞B,B_X<∞ such that, almost surely: |f⋆​(X)|≤Band‖X‖2→2≤BX,|f (X)|≤ B \|X\|_2→ 2≤ B_X, where X∈ℝT×dX ^T× d denotes the input matrix. Assumption 4.3 (Lipschitz Continuity of Excess Risk). There exists a constant κ>0κ>0 such that for all X∈X and f1,f2∈ℱf_1,f_2 : |g​(X;f1)−g​(X;f2)|≤κ​|f1​(X)−f2​(X)|. |g(X;f_1)-g(X;f_2) |\;≤\;κ\, |f_1(X)-f_2(X) |. assumption 4.3 is satisfied by standard loss functions (e.g., logistic or absolute loss) when ℓ​(Y,⋅) (Y,·) is κ-Lipschitz, and by the squared loss under the bounded output assumptions implied by assumptions 4.1 and 4.2. Theorem 4.4. Suppose assumptions 4.1, 4.2, and 4.3 hold. Let f^n f_n be the empirical risk minimizer. Then: ​[ℰ​(f^n;ℓ)]≤ 4​ℛnoff​(,1MSH)+inff∈ℱℰ​(f;ℓ),E_D\! [E ( f_n; ) ]\;≤\;4\,R_n^off\! (G,\, 1M_SH )\;+\; _f E (f; ), where MSH:=2​κ​B+2​κ​Bw​Bc​Bv​Lσ​BXM_SH:=2κ B+2κ B_wB_cB_vL_σB_X. The penalty parameter β=1/MSHβ=1/M_SH captures the joint effect of the Lipschitz constant κ, the activation smoothness LσL_σ, and the magnitude of the parameters and inputs. This constant serves as a bound on the variance of the excess loss. Corollary 4.5. Under the assumptions of Theorem 4.4, for any δ>0δ>0, the excess risk satisfies: ​[ℰ​(f^n;ℓ)]≤ _D\! [E ( f_n;\, ) ]\;≤ 2​MSHn​(1+log⁡​[N∞​(δ,ℱ,)]) \; 2M_SHn (1+ _X\! [N_∞(δ,F,X) ] ) + 8​κ​δ+inff∈ℱℰ​(f), \;+8κ\,δ\;+\; _f E\! (f ), where N∞​(δ,ℱ,)N_∞(δ,F,X) denotes the ℓ∞ _∞-covering number of ℱF on the sample X. Corollary 4.5 demonstrates that the single-head Transformer achieves an ​(1/n)O(1/n) convergence rate (modulo logarithmic factors), improving upon the ​(1/n)O(1/ n) rates typical of standard Rademacher analysis. The covering number N∞N_∞ links this bound to the metric entropy of the function class. 4.2 Single-Layer Multi-Head Transformers We extend the analysis to the class ℱMHF_MH of single-layer Transformers with H heads, as defined in (3). We generalize the regularity assumptions as follows: Assumption 4.6. The inner excess risk satisfies assumption 4.3. Furthermore, for each head h∈1,…,Hh∈\1,…,H\, the parameters satisfy the bounds in assumption 4.1, and the inputs satisfy assumption 4.2. Theorem 4.7. Suppose assumption 4.6 holds. The empirical risk minimizer for the single-layer multi-head Transformer satisfies: ​[ℰ​(f^n;ℓ)]≤ 4​ℛnoff​(,1MMH)+inff∈ℱMHℰ​(f;ℓ),E_D [E( f_n; ) ]\;≤\;4R_n^off (G,\, 1M_MH )\;+\; _f _MHE(f; ), where MMH:=2​κ​B+2​κ​H​Bw​Bc​Bv​Lσ​BXM_MH:=2κ B+2κ HB_wB_cB_vL_σB_X. The constant MMHM_MH scales linearly with the number of heads H, reflecting the increased capacity and potential variance of the aggregated representation. Corollary 4.8. Under the assumptions of Theorem 4.7, for any δ>0δ>0: ​[ℰ​(f^n;ℓ)]≤ _D [E( f_n; ) ]\;≤ 2​MMHn​(1+log⁡​[N∞​(δ,ℱMH,)]) \; 2M_MHn (1+ _X [N_∞(δ,F_MH,X) ] ) + 8​κ​δ+inff∈ℱMHℰ​(f;ℓ). +8κδ\;+\; _f _MHE(f; ). 4.3 Multi-Layer Transformers Finally, we consider the class ℱMLF_ML of Transformers with multiple layers. The complexity of deep architectures introduces dependencies on the network depth in the covering number. Assumption 4.9 (Bounded Parameters for Multi-Layer Models). There exist constants Bv,Bc,BQ​K,Bw<∞B_v,B_c,B_QK,B_w<∞ such that for every layer, the parameter matrices are spectrally bounded by Bv,Bc,BQ​KB_v,B_c,B_QK respectively, and the final readout satisfies ‖w‖2≤Bw\|w\|_2≤ B_w. Assumption 4.10 (Regularity of Multi-Layer Risk). The inner excess risk satisfies assumption 4.3. The target function satisfies |f⋆​(X)|≤B|f (X)|≤ B, and the network inputs satisfy ‖X‖2→2≤BX\|X\|_2→ 2≤ B_X. Furthermore, the parameters satisfy assumption 4.9. Theorem 4.11. Suppose assumption 4.10 holds. The empirical risk minimizer for the multi-layer Transformer satisfies: ​[ℰ​(f^n;ℓ)] _D [E( f_n; ) ] ≤4​ℛnoff​(,12​κ​(B+Bw)) ≤ 4R_n^off (G,\, 12κ(B+B_w) ) +inff∈ℱMLℰ​(f;ℓ). + _f _MLE(f; ). Remark 4.12. Note that the penalty parameter in Theorem 4.11 depends explicitly on the readout bound BwB_w and the target bound B, assuming the output of the final Transformer block is normalized (as is common in practice with LayerNorm). However, the complexity of the hidden layers and the depth of the network are implicitly captured within the covering number term in the following corollary. Corollary 4.13. Under the assumptions of Theorem 4.11, for any δ>0δ>0: [ℰ(f^n;ℓ)]≤4​κ​(B+Bw)n× _D [E( f_n; ) ]≤ 4κ(B+B_w)n× (1+log⁡​[N∞​(δ,ℱML,)])+8​κ​δ+inff∈ℱMLℰ​(f;ℓ). (1+ _X [N_∞(δ,F_ML,X) ] )+8κδ\;+\; _f _MLE(f; ). 5 Parameter-Dependent Excess Risk Bounds Building on the general framework established in Section 4, we now derive sharper excess risk bounds by explicitly controlling the covering numbers of the Transformer hypothesis class. We consider two distinct regimes: (1) norm-based bounds, which constrain the ℓ1,1 _1,1 magnitude of the parameters, and (2) rank-based bounds, which exploit the low-rank structure of the weight matrices. Due to space constraints, the single-layer and multi-head generalization bounds discussed in this section are presented in Appendix B. 5.1 Norm-Based Excess Risk Bounds We first refine our parameter assumptions to enable tighter control via norm-based covering numbers. Following Trauger and Tewari (Trauger & Tewari, 2023), we impose constraints on the ℓ1,1 _1,1 norms of the weight matrices, which is natural for deriving bounds independent of the sequence length. Assumption 5.1 (Norm-Bounded Parameters). There exist constants Bv,Bc,BQ​K,Bw<∞B_v,B_c,B_QK,B_w<∞ such that: ‖Wv‖1,1≤Bv,‖Wc‖1,1≤Bc, \|W_v\|_1,1≤ B_v, \|W_c\|_1,1≤ B_c, ‖WQ​K‖1,1≤BQ​K,‖w‖2≤Bw. \|W_QK\|_1,1≤ B_QK, \|w\|_2≤ B_w. Furthermore, we assume the input token representations are bounded. Let x[CLS]x_ [CLS] denote the representation of the classification token. We assume ‖x[CLS]‖2≤Bx\|x_ [CLS]\|_2≤ B_x almost surely. We adopt the following covering number assumption for linear operators, which is a standard result in statistical learning theory (see, e.g., (Trauger & Tewari, 2023, Lemma 3.6)).This lemma provides a sequence-length-independent generalized bound, offering theoretical assurance for the stability of Transformer models’ generalization capabilities in long-sequence tasks. It mitigates the risk of uncontrolled generalization errors arising from increasing sequence lengths. This work provides theoretical support for applying Transformers in long-sequence scenarios such as long-text processing and high-dimensional time series prediction (e.g., financial time series, meteorological observations), thereby enhancing the model’s applicability in complex real-world scenarios. Lemma 5.2 (Linear Covering Number). For the function class ℋlin=x↦W​x∣‖W‖1,1≤BWH_lin=\x Wx \|W\|_1,1≤ B_W\, the ϵε-covering number satisfies: logN(ϵ,ℋlin,∥⋅∥2)≤C1​Bx2​BW2ϵ2, N(ε,H_lin,\|·\|_2)≤ C_1B_x^2B_W^2ε^2, where C1C_1 is a universal constant depending on the geometry of the space. Theorem 5.3 (Multi-Layer Norm-Based Bound). Suppose assumptions 4.3, 5.1 and Lemma 5.2 hold. The empirical risk minimizer satisfies: _D [ℰ​(f^n;ℓ)]≤4​(κ​B+κ​Bw)n​(1+log⁡(γML+ηML)3δ2) [E( f_n; )]≤\; 4(κ B+κ B_w)n (1+ ( _ML+ _ML)^3δ^2 ) +8​κ​δ+inff∈ℱMLℰ​(f;ℓ). +8κδ+ _f _MLE(f; ). Here, defining αi=∏j=iLLσ​Bc​Bv​(1+4​BQ​K) _i= _j=i^LL_σB_cB_v(1+4B_QK), we have: τi= _i= αi2/3+(2​αi​Lσ​Bc​Bv)2/3+(αi​Lσ​Bv)2/3, _i^2/3+(2 _iL_σB_cB_v)^2/3+( _iL_σB_v)^2/3, γML= _ML= C11/3​(2​Lσ​Bc​Bv​α1​Bw​Bx2)2/3 C_1^1/3(2L_σB_cB_v _1B_wB_x^2)^2/3 +C11/3​Bx2/3​(1+(α1​Bw)2/3+(α1​Bw​Lσ​Bv)2/3), +C_1^1/3B_x^2/3 (1+( _1B_w)^2/3+( _1B_wL_σB_v)^2/3 ), ηML= _ML= C11/3​Bw2/3​∑i=2Lτi. C_1^1/3B_w^2/3 _i=2^L _i. Remark 5.4. Our norm-based bounds depend explicitly on the parameter norms rather than the sequence length T. This represents an improvement over Trauger and Tewari (Trauger & Tewari, 2023) by providing tighter control through the offset Rademacher complexity framework. 5.2 Rank-Based Excess Risk Bounds Next, we exploit the low-rank structure of the parameter matrices to derive alternative bounds. We combine the rank-based covering numbers for linear models (Truong, 2024) with our offset complexity analysis. Assumption 5.5 (Rank-Structure and Inputs). We assume the parameter matrices have ranks bounded by rv,rc,rQ​Kr_v,r_c,r_QK. The target function and inputs satisfy |f⋆​(X)|≤B|f (X)|≤ B, ‖X‖2→2≤BX\|X\|_2→ 2≤ B_X, and ‖x[CLS]‖2≤Bx\|x_ [CLS]\|_2≤ B_x almost surely. Deep networks exhibit significantly tighter generalization bounds when possessing low-rank or quasi-low-rank structures: their sample complexity no longer scales with total parameters but is instead governed by the effective rank and spectral norm of the weight matrix. For instance, Ledent et al. derived generalization bounds based on the Schatten−p-p norm for rank-sparse networks, demonstrating that model complexity scales with the rank of layers (Ledent et al., 2025); Pinto et al. proved via Gaussian complexity that deep networks with low-rank layers exhibit smaller functional class complexity (Pinto et al., 2024). The generalization capability of Transformers may rely more on the inherent low-rank constraints of the attention mechanism than on the total number of model parameters (Truong, 2024). Thus, low-rank generalization bounds provide a theoretical direction and potential mechanism for explaining why large Transformers maintain strong generalization performance even with extremely massive parameters. To optimize the covering number allocation across different components, we utilize the following lemma, seeing (Truong, 2024, Theorem 4). Corollary 5.6 (Optimal Covering Allocation). Let ri,Ci,βi≥0r_i,C_i, _i≥ 0 for i∈[m]i∈[m]. The solution to the optimization problem: minϵ1,…,ϵm​∑i=1mri​Ci​log⁡(ri​BX2ϵi2),subject to​∑i=1mβi​ϵi=ϵ, _ _1,…, _m _i=1^mr_iC_i ( r_iB_X^2 _i^2 ),subject to\ _i=1^m _i _i=ε, is given by ∑i=1mri​Ci​log⁡(bi2/ϵ2) _i=1^mr_iC_i (b_i^2/ε^2), where bi=(BX​βi​∑j=1mrj​Cj)1/2ri1/2​Ci1/2b_i= (B_X _i _j=1^mr_jC_j)^1/2r_i^1/2C_i^1/2. Assumption 5.7 (Rank-Based Covering Number). For the function class ℋrank=x↦W​x∣rank⁡(W)≤r,‖W‖2≤BWH_rank=\x Wx (W)≤ r,\|W\|_2≤ B_W\, the ϵε-covering number satisfies: logN(ϵ,ℋrank,∥⋅∥2)≤C1rlog(r​Bx2​BW2ϵ2). N(ε,H_rank,\|·\|_2)≤ C_1r ( rB_x^2B_W^2ε^2 ). Theorem 5.8 (Multi-Layer Rank-Based Bound). Suppose assumption 5.5 and Assumption 5.7 hold. The empirical risk minimizer satisfies: [ℰ _D[E (f^n;ℓ)]≤4​(κ​B+κ​Bw)n(1+∑i=1mriCilog(bi2δ2)) ( f_n; )]≤\; 4(κ B+κ B_w)n (1+ _i=1^mr_iC_i ( b_i^2δ^2 ) ) +8​κ​δ+inff∈ℱMLℰ​(f;ℓ). +8κδ+ _f _MLE(f; ). Here, the effective bounds bib_i are derived using Corollary 5.6 with weights βk _k defined recursively: βk=1k=1,αk−1​Bw2≤k≤L+1,αk−L−1​Bw​Lσ​BvL+2≤k≤2​L+1,αk−2​L−1​2​Lσ​Bc​Bv​Bw2​L+2≤k≤3​L+1. _k= cases1&k=1,\\ _k-1B_w&2≤ k≤ L+1,\\ _k-L-1B_wL_σB_v&L+2≤ k≤ 2L+1,\\ _k-2L-12L_σB_cB_vB_w&2L+2≤ k≤ 3L+1. cases Remark 5.9. This result demonstrates that the excess risk is controlled by the sum of ranks across layers, offering a significantly tighter bound than full-rank spectral norm approaches, particularly for compressed or sparse Transformer models. 6 Excess Risk Bounds under Unbounded Assumptions In previous sections, we assumed bounded inputs, which is often violated in practice. In natural language processing, embedding norms may grow with sequence length T (Zhou et al., 2025), while in vision-language models, image patch embeddings typically follow unbounded continuous distributions (Li et al., 2022). Such boundedness assumptions therefore fail to capture model behavior on high-dynamic-range inputs. In classical Rademacher complexity analysis, generalization bounds typically rely on boundedness of the input (Bartlett & Mendelson, 2002) to control the functional class complexity of linear models or neural networks. However, when inputs are unbounded, bounds directly relying on input norms may fail. Even loss function truncation, such as M-truncation (Bartlett & Mendelson, 2002), log-truncation (Catoni, 2012), smooth/Huber-type truncation (Maurer, 2016), or the extension of Catoni’s robust truncation(Xu et al., 2023; Coluccia, 2015), cannot guarantee bounded Rademacher complexity, as functional sensitivity to inputs may still cause excessive empirical process volatility. To address this issue, Høgsgaard and Paudice replacing standard empirical averaging with Median-of-Means (MoM) estimation yields generalization bounds for unbounded inputs (Høgsgaard & Paudice, 2025). In this section, we extend our analysis to unbounded inputs by employing truncation techniques. We analyze two regimes: (1) inputs following a sub-Gaussian distribution, where tails decay exponentially, and (2) inputs following a heavy-tailed distribution, requiring robust loss functions. 6.1 Excess Risk Bounds under the Sub-Gaussian Assumption We first consider the case where the input X is unbounded but concentrates around its mean. First, we present the matrix form of the concentration inequality (see, e.g., (Tropp, 2015, Theorem 4.1.1)) Lemma 6.1 (Matrix Gaussian & Rademacher Series). Consider a finite sequence Bk\B_k\ of fixed complex matrices with dimension d1×d2d_1× d_2, and let γk\ _k\ be a finite sequence of independent standard normal variables. Introduce the matrix Gaussian series Z=∑kγk​Bk.Z= _k _kB_k. Let v​(Z)v(Z) be the matrix variance statistic of the sum: v​(Z)=max⁡‖​(Z​Z∗)‖,‖​(Z∗​Z)‖v(Z)= \ \|E(Z^*) \|, \|E(Z^*Z) \| \ =max⁡‖∑kBk​Bk∗‖,‖∑kBk∗​Bk‖.= \ \| _kB_kB_k^* \|, \| _kB_k^*B_k \| \. Then ​‖Z‖≤2​v​(Z)​log⁡(d1+d2).E\|Z\|≤ 2v(Z) (d_1+d_2). Furthermore, for all t≥0t≥ 0, ℙ​‖Z‖≥t≤(d1+d2)​exp⁡(−t22​v​(Z)).P \\|Z\|≥ t \≤(d_1+d_2) (- t^22v(Z) ). The same bounds hold when we replace γk\ _k\ by a finite sequence gk\g_k\ of independent Rademacher random variables. Assumption 6.2 (Unbounded Sub-Gaussian Inputs). The input X∈ℝT×dX ^T× d follows a sub-Gaussian distribution. Specifically, there exists a matrix variance proxy ν​(X):=max⁡‖​[X⊤​X]‖2,‖​[X​X⊤]‖2ν(X):= \\|E[X X]\|_2,\|E[X ]\|_2\ such that for any t>0t>0: ℙ​(‖X‖F≥t)≤(T+d)​exp⁡(−t22​ν​(X)2).P(\|X\|_F≥ t)≤(T+d) ( -t^22ν(X)^2 ). (9) The target function f⋆f is also assumed to be unbounded but with finite moments compatible with the input distribution. To handle unboundedness, we introduce a truncation operator MT_M with threshold M>0M>0: XM:=M​(X)=Xif ​‖X‖F≤M,M​X‖X‖Fif ​‖X‖F>M.X_M:=T_M(X)= casesX&if \|X\|_F≤ M,\\ M X\|X\|_F&if \|X\|_F>M. cases (10) Let ℱMF_M denote the class of Transformer functions operating on the truncated input domain X:‖X‖F≤M\X:\|X\|_F≤ M\. The truncation error is controlled by the tail probability of the event ℰtrunc=‖X‖F>ME_trunc=\\|X\|_F>M\, which satisfies ℙ​(ℰtrunc)≤(T+d)​exp⁡(−M2/2​ν​(X)2)P(E_trunc)≤(T+d) (-M^2/2ν(X)^2). We define the truncated empirical risk minimizer as: f^n(M)∈arg⁡minf∈ℱM⁡1n​∑i=1nℓ​(Yi,f​(Xi,M)). f_n^(M)∈ _f _M 1n _i=1^n (Y_i,f(X_i,M)). (11) Assumption 6.3 (Lipschitz Continuity for Truncated Class). There exists κ>0κ>0 such that for any M>0M>0 and all X, the excess risk satisfies the Lipschitz assumption: |g​(X;f1)−g​(X;f2)|≤κ​|f1​(X)−f2​(X)|,∀f1,f2∈ℱM.|g(X;f_1)-g(X;f_2)|≤κ|f_1(X)-f_2(X)|,\ ∀ f_1,f_2 _M. Theorem 6.4 (Single-Layer Sub-Gaussian Bound). Suppose assumption 6.3 holds, and the network parameters satisfy the norm-based constraints of Theorem B.1. Let f^n(M) f_n^(M) be the empirical risk minimizer on the truncated domain. Then: [ℰ _D[E (f^n;ℓ)]≤4​(κ​B+κ​Bw​Bc​Bv​Lσ​BX)n(1+logγSH3δ2) ( f_n; )]≤ 4(κ B+κ B_wB_cB_vL_σB_X)n (1+ _SH^3δ^2 ) +err​(M)+8​κ​δ+inff∈ℱSHℰ​(f;ℓ), +T_err(M)+8κδ+ _f _SHE(f; ), where the truncation error term is err​(M)=​(κ​ν​(X)​(T+d)​e−M2/2​ν​(X)2)T_err(M)=O (κν(X)(T+d)e^-M^2/2ν(X)^2 ). The complexity terms γSH _SH follow the definitions in Theorem B.1 with BXB_X replaced by the truncation threshold M. Theorem 6.5 (Multi-Head Sub-Gaussian Bound). Suppose assumption 6.3 holds, and the network parameters satisfy the norm-based constraints of Theorem B.2. Let f^n(M) f_n^(M) be the empirical risk minimizer on the truncated domain. Then: [ℰ _D[E (f^n;ℓ)]≤4​(κ​B+κ​H​Bw​Bc​Bv​Lσ​BX)n(1+logγMH3δ2) ( f_n; )]≤ 4(κ B+κ HB_wB_cB_vL_σB_X)n (1+ _MH^3δ^2 ) +err​(M)+8​κ​δ+inff∈ℱMHℰ​(f;ℓ), +T_err(M)+8κδ+ _f _MHE(f; ), where the truncation error term is err​(M)=​(κ​ν​(X)​(T+d)​e−M2/2​ν​(X)2)T_err(M)=O (κν(X)(T+d)e^-M^2/2ν(X)^2 ). The complexity terms γMH _MH follow the definitions in Theorem B.2 with BXB_X replaced by the truncation threshold M. Theorem 6.6 (Multi-Layer Sub-Gaussian Bound). Suppose assumption 6.3 holds, and the network parameters satisfy the norm-based constraints of Theorem 5.3. Let f^n(M) f_n^(M) be the empirical risk minimizer on the truncated domain. Then: ​[ℰ​(f^n(M);ℓ)]≤ _D[E( f_n^(M); )]\;≤ 4​(B​κ+Bw​κ)n​(1+log⁡(γM+ηM)3δ2) \; 4(Bκ+B_wκ)n (1+ ( _M+ _M)^3δ^2 ) +err​(M)+8​κ​δ+inff∈ℱMℰ​(f;ℓ), +T_err(M)+8κδ+ _f _ME(f; ), where the truncation error term is err​(M)=​(κ​ν​(X)​(T+d)​e−M2/2​ν​(X)2)T_err(M)=O (κν(X)(T+d)e^-M^2/2ν(X)^2 ). The complexity terms γM,ηM _M, _M follow the definitions in Theorem 5.3 with BXB_X replaced by the truncation threshold M. Remark 6.7. The bound exhibits a trade-off controlled by M. The complexity term grows as log⁡(M) (M), while the truncation error decays exponentially as exp⁡(−M2) (-M^2). Choosing M≍log⁡nM n balances these terms, yielding an overall convergence rate of ~​(log⁡n) O( nn), which is nearly optimal and consistent with finite-dimensional results. 6.2 Excess Risk Bounds under Heavy-Tailed Assumptions In many scenarios, data distributions exhibit polynomial tail decay rather than exponential. If the tail of this distribution is heavier than any exponential distribution, i.e., for all μ>0μ>0, we have lim supx→∞F¯​(x)e−μ​x=∞ _x→∞ F(x)e^-μ x=∞, then this distribution is considered a heavy-tailed distribution (Nair et al., 2022). For example, the Pareto distribution or the t-distribution satisfy ℙ​(|X|>x)∝x−βP(|X|>x) x^-β for some β>0β>0. Under such heavy-tailed assumptions, standard empirical risk minimization may fail or yield suboptimal rates (Ostrovskii & Bach, 2021). Assumption 6.8 (Heavy-Tailed Inputs). The input X follows a heavy-tailed distribution with tail index β>2β>2. specifically, there exists a constant C>0C>0 such that: ℙ​(‖X‖F>x)≤C​x−β,∀x>0.P(\|X\|_F>x)≤ Cx^-β, ∀ x>0. (12) To achieve robust estimation, we employ a robust loss function ℓψ _ψ, such as the Catoni loss (Catoni, 2012) or the generalized logarithmic truncation loss proposed by Chen et al. (Chen et al., 2021b). These losses dampen the influence of outliers. We define the robust estimator: f^nψ∈arg⁡minf∈ℱ⁡1n​∑i=1nℓψ​(Yi,f​(Xi)). f_n^ψ∈ _f 1n _i=1^n _ψ(Y_i,f(X_i)). (13) Theorem 6.9 (Single-Head Heavy-Tailed Bound). Suppose assumptions 6.8 and 5.5 hold, and we employ a robust loss ℓψ _ψ. The excess risk of the single-layer multi-head Transformer satisfies: [ℰ(f^n;ℓ)]≤4​(κ​B+κ​Bw​Bc​Bv​Lσ​BX)n]× _D[E( f_n; )]≤ 4(κ B+κ B_wB_cB_vL_σB_X)n]× (1+log⁡rank,S)+Cβ​M2−β+8​κ​δ+inff∈ℱSHℰ​(f;ℓ), (1+ _rank,S )+C_βM^2-β+8κδ+ _f _SHE(f; ), where the complexity terms rank,SN_rank,S follow the definitions in Theorem B.4 with BXB_X replaced by the truncation threshold M. The term Cβ​M2−βC_βM^2-β represents the single-tail truncation bias. Theorem 6.10 (Multi-Head Heavy-Tailed Bound). Suppose assumptions 6.8 and 5.5 hold, and we employ a robust loss ℓψ _ψ. The excess risk of the single-layer multi-head Transformer satisfies: [ℰ _D[E (f^n;ℓ)] ( f_n; )] ≤ \;≤ 4​(κ​B+κ​H​Bw​Bc​Bv​Lσ​BX)n​(1+log⁡rank,M) \; 4(κ B+κ HB_wB_cB_vL_σB_X)n (1+ _rank,M ) +Cβ​M2−β+8​κ​δ+inff∈ℱMHℰ​(f;ℓ), +C_βM^2-β+8κδ+ _f _MHE(f; ), where the complexity terms rank,MN_rank,M follow the definitions in Theorem B.5 with BXB_X replaced by the truncation threshold M. The term Cβ​M2−βC_βM^2-β represents the heavy-tail truncation bias. Theorem 6.11 (Multi-Layer Heavy-Tailed Bound). Suppose assumptions 6.8 and 5.5 hold, and we employ a robust loss ℓψ _ψ. The excess risk of the single-layer multi-head Transformer satisfies: ​[ℰ​(f^nψ;ℓψ)]≤ _D[E( f_n^ψ; _ψ)]\;≤ 4​M~ψn​(1+rank,M​L) \; 4 M_ψn (1+N_rank,ML ) +Cβ​M2−β+8​κ​δ+inff∈ℱℰ​(f;ℓψ), +C_βM^2-β+8κδ+ _f E(f; _ψ), where M~ψ M_ψ depends on the Lipschitz constant of the robust loss and parameter bounds in Theorem 5.8, and The complexity terms rank,MLN_rank,ML from Theorem 5.8 with BXB_X replaced by the truncation threshold M. The term Cβ​M2−βC_βM^2-β represents the heavy-tail truncation bias. Remark 6.12. Here, the convergence rate is limited by the tail index β. The complexity grows logarithmically with M, but the bias decays polynomially as M−(β−2)M^-(β-2). To balance the ​(log⁡Mn)O( Mn) variance and ​(M−(β−2))O(M^-(β-2)) bias, the optimal truncation threshold is M≍n1β−2M n 1β-2. This yields a slower convergence rate of roughly ​(n−β−2β−1)O(n^- β-2β-1), reflecting the fundamental difficulty of learning from heavy-tailed data. 7 Conclusion In this paper, we derived sharper generalization bounds for Transformer models through the lens of offset Rademacher complexity. Our analysis yielded fast excess-risk rates of order ​(1/n)O(1/n) for single-head, multi-head, and multi-layer architectures, while explicitly capturing architecture-dependent structure via low-rank and norm-based parameter controls. Moreover, we extended these guarantees beyond bounded feature assumptions to more practical regimes with unbounded sub-Gaussian inputs and heavy-tailed distributions. We leave two directions for future work. First, it would be of interest to develop an information-theoretic account of Transformer generalization, using mutual-information–based tool (Asadi et al., 2018) to quantify how self-attention propagates and potentially compresses task-relevant information. Second, deriving PAC-Bayes bounds Alquier (2021) tailored to Transformers and stochastic optimization may better capture the implicit regularization of modern training pipelines, and could lead to a more unified understanding of generalization and learning dynamics in large-scale foundation models. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Alokhina & Li (2026) Alokhina, A. and Li, P. From small to large: Generalization bounds for transformers on variable-size inputs, 2026. URL https://arxiv.org/abs/2512.12805. Alquier (2021) Alquier, P. User-friendly introduction to pac-bayes bounds. arXiv preprint arXiv:2110.11216, 2021. Asadi et al. (2018) Asadi, A., Abbe, E., and Verdú, S. Chaining mutual information and tightening generalization bounds. Advances in Neural Information Processing Systems, 31, 2018. Bartlett & Mendelson (2002) Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of machine learning research, 3(Nov):463–482, 2002. Bartlett et al. (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. Local rademacher complexities. Annals of Statistics, p. 1497–1537, 2005. Blumer et al. (1989) Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989. Bos & Schmidt-Hieber (2022) Bos, T. and Schmidt-Hieber, J. Convergence rates of deep relu networks for multiclass classification. Electronic Journal of Statistics, 16(1):2724–2773, 2022. Catoni (2012) Catoni, O. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, volume 48, p. 1148–1185, 2012. Chang et al. (2024) Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. Chen et al. (2021a) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. 2021a. Chen et al. (2021b) Chen, P., Jin, X., Li, X., and Xu, L. A generalized catoni’s m-estimator under finite α-th moment assumption with α∈(1,2)α∈(1,2). Electronic Journal of Statistics, 15(2):5523–5544, 2021b. Coluccia (2015) Coluccia, A. Regularized covariance matrix estimation via empirical bayes. IEEE signal processing letters, 22(11):2127–2131, 2015. de Santana Correia & Colombini (2022) de Santana Correia, A. and Colombini, E. L. Attention, please! a survey of neural attention models in deep learning. Artificial Intelligence Review, 55(8):6037–6124, 2022. Ding et al. (2025) Ding, Z., Duan, C., Jiao, Y., and Yang, J. Z. Semi-supervised deep sobolev regression: Estimation and variable selection by requ neural network. IEEE transactions on information theory, 71(4):2955–2981, 2025. Duan et al. (2023) Duan, C., Jiao, Y., Kang, L., Lu, X., and Yang, J. Z. Fast excess risk rates via offset rademacher complexity. In Proceedings of the 40th International Conference on Machine Learning, volume 202, Honolulu, Hawaii, USA, 2023. PMLR. Han et al. (2022) Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022. Havrilla & Liao (2024) Havrilla, A. and Liao, W. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data. arXiv preprint arXiv:2411.06646, 2024. Høgsgaard & Paudice (2025) Høgsgaard, M. M. and Paudice, A. Uniform mean estimation for heavy-tailed distributions via median-of-means. 2025. Hu et al. (2024) Hu, S., Shen, L., Zhang, Y., Chen, Y., and Tao, D. On transforming reinforcement learning with transformers: The development trajectory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8580–8599, 2024. Huang et al. (2025) Huang, X., Yang, A., Bhattamishra, S., Sarrof, Y., Krebs, A., Zhou, H., Nakkiran, P., and Hahn, M. A formal framework for understanding length generalization in transformers, 2025. URL https://arxiv.org/abs/2410.02140. Ledent et al. (2025) Ledent, A., Alves, R., and Lei, Y. Generalization bounds for rank-sparse neural networks. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics, volume 244, p. 2145–2153, 2025. Li et al. (2022) Li, M., Xu, R., Wang, S., Zhou, L., Lin, X., Zhu, C., Zeng, M., Ji, H., and Chang, S.-F. Clip-event: Connecting text and images with event structures. p. 16399–16408. IEEE, 2022. Liang et al. (2015) Liang, T., Rakhlin, A., and Sridharan, K. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, p. 1260–1285. PMLR, 2015. Ma et al. (2022) Ma, C., Wu, L., et al. The barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022. Maurer (2016) Maurer, A. Vector-valued rademacher complexities. Journal of Machine Learning Research, 17(169):1–20, 2016. Meng & Ming (2022) Meng, Y. and Ming, P. A new function space from barron class and application to neural network approximation. Communications in Computational Physics, 32(5):1361–1400, 2022. Mwigo & Dasgupta (2026) Mwigo, B. and Dasgupta, A. Generalization bound for a shallow transformer trained using gradient descent. Transactions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/forum?id=t3iUeMOT8Z. Nair et al. (2022) Nair, J., Wierman, A., and Zwart, B. The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2022. Ostrovskii & Bach (2021) Ostrovskii, D. M. and Bach, F. Finite-sample analysis of m-estimators using self-concordance. Electronic Journal of Statistics, 15:326–391, 2021. Pinto et al. (2024) Pinto, A., Rangamani, A., and Poggio, T. On generalization bounds for neural networks with low rank layers. arXiv preprint arXiv:2411.13733, 2024. Schmidt-Hieber (2020) Schmidt-Hieber, J. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875, 2020. Trauger & Tewari (2023) Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for transformers. arXiv preprint arXiv:2310.13088, 2023. Tropp (2015) Tropp, J. A. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1–2):1–230, 2015. Truong (2024) Truong, L. V. On rank-dependent generalisation error bounds for transformers. arXiv preprint arXiv:2410.11500, 2024. Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Wang et al. (2019) Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., and Chao, L. S. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019. Xu et al. (2023) Xu, L., Yao, F., Yao, Q., and Zhang, H. Non-asymptotic guarantees for robust statistical learning under infinite variance assumption. Journal of machine learning research, 24, 2023. Yang (2025) Yang, Y. On the optimal approximation of sobolev and besov functions using deep relu neural networks. Applied and computational harmonic analysis, 79, 2025. Zhang et al. (2023) Zhang, H., Song, H., Li, S., Zhou, M., and Song, D. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37, 2023. Zhang et al. (2026) Zhang, Q., Chen, B., Jiang, Y., and Xia, S.-T. Generalization bounds for transformer channel decoders, 2026. URL https://arxiv.org/abs/2601.06969. Zhang (2023) Zhang, T. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023. Zhang et al. (2025) Zhang, Y., Wu, Z., Li, J., and Liu, Y. Understanding generalization in transformers: Error bounds and training dynamics under benign and harmful overfitting, 2025. URL https://arxiv.org/abs/2502.12508. Zhou et al. (2025) Zhou, Y., Dai, S., Cao, Z., Zhang, X., and Xu, J. Length-induced embedding collapse in plm-based models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 28767–28791, July 2025. Appendix A Proofs in Section 4 A.1 Proof of Theorem 4.4 Proof. Let ℱF denote the hypothesis class. For any f∈ℱf , we define the excess loss function g​(⋅;f)g(·;f) as: g​(X,Y;f)≔ℓ​(Y,f​(X))−ℓ​(Y,f⋆​(X)),g(X,Y;f) (Y,f(X))- (Y,f (X)), where f⋆f minimizes the population risk. Specifically for the single-head Transformer, let f​(X)=w⊤​Y[CLS]f(X)=w Y_ [CLS], where Y[CLS]Y_ [CLS] is the output token representation. Let f^n f_n be the empirical risk minimizer. We decompose the expected excess risk as follows: ​[ℰ​(f^n;ℓ)] _D[E( f_n; )] =​[(X,Y)​[g​(X,Y;f^n)]] =E_D [E_(X,Y)[g(X,Y; f_n)] ] =​[(X,Y)​[g​(X,Y;f^n)]−3n​∑i=1ng​(Xi,Yi;f^n)+3n​∑i=1ng​(Xi,Yi;f^n)]. =E_D [E_(X,Y)[g(X,Y; f_n)]- 3n _i=1^ng(X_i,Y_i; f_n)+ 3n _i=1^ng(X_i,Y_i; f_n) ]. Since f^n f_n minimizes the empirical risk, for any f∈ℱf , ∑g​(Zi;f^n)≤∑g​(Zi;f)Σ g(Z_i; f_n)≤Σ g(Z_i;f). Taking the supremum over the class ℱF: ​[ℰ​(f^n;ℓ)] _D[E( f_n; )] ≤​[supf∈ℱ(​[g​(Z;f)]−3n​∑i=1ng​(Zi;f))]+3​[1n​∑i=1ng​(Zi;f⋆)] _D [ _f (E[g(Z;f)]- 3n _i=1^ng(Z_i;f) ) ]+3E_D [ 1n _i=1^ng(Z_i;f ) ] ≤​supf∈ℱ(​[g​(Z;f)]−3n​∑i=1ng​(Zi;f))+3​inff∈ℱℰ​(f;ℓ), _X _f (E[g(Z;f)]- 3n _i=1^ng(Z_i;f) )+3 _f E(f; ), where the last term accounts for the approximation error if f⋆∉ℱf . We now bound the magnitude of the function class to determine the appropriate penalty. Using the Lipschitz continuity of the activation σ (with σ​(0)=0σ(0)=0) and the property that ∥softmax⁡(⋅)∥2≤1 (·) _2≤ 1, we have: ∥σ​(Wv⊤​X(i)⊤​softmax⁡(X(i)​WQ​K​X⊤))∥2 σ (W_v X_(i) softmax(X_(i)W_QKX ) ) _2 ≤Lσ​∥Wv⊤∥2​∥X(i)⊤∥2​∥softmax⁡(⋅)∥2 ≤ L_σ W_v _2 X_(i) _2 (·) _2 ≤Lσ​Bv​BX. ≤ L_σB_vB_X. Consequently, the output token representation Y[CLS]Y_ [CLS] is bounded by: ∥Y[CLS]∥2 Y_ [CLS] _2 ≤∥Wc⊤∥2​∥σ​(⋅)∥2≤Bc​(Lσ​Bv​BX). ≤ W_c _2 σ(·) _2≤ B_c(L_σB_vB_X). Using the Lipschitz property of the loss function ℓ (constant κ), the value of the excess loss is bounded. Let M denote this upper bound: 0≤g​(X;f)≤κ​|f​(X)−f⋆​(X)| 0≤ g(X;f)≤κ|f(X)-f (X)| ≤κ​|f⋆​(X)|+κ​∥w∥2​∥Y[CLS]∥2 ≤κ|f (X)|+κ w _2 Y_ [CLS] _2 ≤κ​B+κ​Bw​Bc​Bv​Lσ​BX≕M. ≤κ B+κ B_wB_cB_vL_σB_X M. Therefore, we can rewrite the supremum term. Using the inequality z−3​μ≤2​(z−μ)−z​…z-3μ≤ 2(z-μ)-z… implies we can bound the expectation by the offset complexity. Specifically, we aim to bound: A≔​supf∈ℱ(​[g]−3n​∑i=1ng​(Zi)).A _X _f (E[g]- 3n _i=1^ng(Z_i) ). Following the symmetrization technique with ghost samples ′=Xi′i=1nX =\X _i\_i=1^n and Rademacher variables τ (Duan et al., 2023), we have: A A ≤,′​supf∈ℱ(2n​∑i=1n(g​(Xi′;f)−g​(Xi;f))−1n​M​∑i=1n(g​(Xi′;f)2+g​(Xi;f)2)) _X,X _f ( 2n _i=1^n(g(X _i;f)-g(X_i;f))- 1nM _i=1^n(g(X _i;f)^2+g(X_i;f)^2) ) =2​,′,​supf∈ℱ(1n​∑i=1nτi​(g​(Xi′;f)−g​(Xi;f))−12​n​M​∑i=1n(g​(Xi′;f)2+g​(Xi;f)2)). =2E_X,X , τ _f ( 1n _i=1^n _i(g(X _i;f)-g(X_i;f))- 12nM _i=1^n(g(X _i;f)^2+g(X_i;f)^2) ). Separating the terms for X and ′X , this equates to: A A ≤2​′,​supf∈ℱ(1n​∑i=1nτi​g​(Xi′;f)−12​n​M​∑i=1ng​(Xi′;f)2) ≤ 2E_X , τ _f ( 1n _i=1^n _ig(X _i;f)- 12nM _i=1^ng(X _i;f)^2 ) +2​,​supf∈ℱ(1n​∑i=1n(−τi)​g​(Xi;f)−12​n​M​∑i=1ng​(Xi;f)2) +2E_X, τ _f ( 1n _i=1^n(- _i)g(X_i;f)- 12nM _i=1^ng(X_i;f)^2 ) =4​ℛnoff​(,12​M). =4R_n^off (G, 12M ). Substituting M=κ​B+κ​Bw​Bc​Bv​Lσ​BXM=κ B+κ B_wB_cB_vL_σB_X yields the theorem statement. ∎ A.2 Proofs of Corollaries 4.5, 4.13 and Theorem 4.7 Proof of Corollary 4.5. Let MSH≔κ​B+κ​Bw​Bc​Bv​Lσ​BXM_SH κ B+κ B_wB_cB_vL_σB_X denote the upper bound on the excess loss derived in Appendix A.1. Let ℱδF_δ be a minimal δ-cover of ℱF with respect to the ℓ∞ _∞ norm on the sample X, such that |ℱδ|=N∞​(δ,ℱ,)|F_δ|=N_∞(δ,F,X). For any f∈ℱf , there exists fδ∈ℱδf_δ _δ satisfying ∥f−fδ∥∞≤δ f-f_δ _∞≤δ. We decompose the offset process. Using the κ-Lipschitz continuity of the excess loss g, we have: 1n​∑i=1nτi​g​(Xi;f) 1n _i=1^n _ig(X_i;f) ≤1n​∑i=1nτi​g​(Xi;fδ)+1n​∑i=1n|τi|​|g​(Xi;f)−g​(Xi;fδ)| ≤ 1n _i=1^n _ig(X_i;f_δ)+ 1n _i=1^n| _i| |g(X_i;f)-g(X_i;f_δ) | ≤1n​∑i=1nτi​g​(Xi;fδ)+κ​δ. ≤ 1n _i=1^n _ig(X_i;f_δ)+κδ. For the quadratic penalty term, using the bound g​(⋅)≤MSHg(·)≤ M_SH and the Lipschitz property: −1n​∑i=1ng​(Xi;f)2 - 1n _i=1^ng(X_i;f)^2 =−1n​∑i=1ng​(Xi;fδ)2+1n​∑i=1n(g​(Xi;fδ)2−g​(Xi;f)2) =- 1n _i=1^ng(X_i;f_δ)^2+ 1n _i=1^n (g(X_i;f_δ)^2-g(X_i;f)^2 ) =−1n​∑i=1ng​(Xi;fδ)2+1n​∑i=1n(g​(Xi;fδ)+g​(Xi;f))​(g​(Xi;fδ)−g​(Xi;f)) =- 1n _i=1^ng(X_i;f_δ)^2+ 1n _i=1^n (g(X_i;f_δ)+g(X_i;f) ) (g(X_i;f_δ)-g(X_i;f) ) ≤−1n​∑i=1ng​(Xi;fδ)2+2​MSHn​∑i=1n|g​(Xi;fδ)−g​(Xi;f)| ≤- 1n _i=1^ng(X_i;f_δ)^2+ 2M_SHn _i=1^n |g(X_i;f_δ)-g(X_i;f) | ≤−1n​∑i=1ng​(Xi;fδ)2+2​MSH​κ​δ. ≤- 1n _i=1^ng(X_i;f_δ)^2+2M_SHκδ. Combining these, the offset process over ℱF is bounded by the process over the finite cover ℱδF_δ plus a discretization error: supf∈ℱ(1n​∑i=1nτi​g​(Xi;f)−βn​∑i=1ng​(Xi;f)2)≤maxfδ∈ℱδ⁡Ω​(fδ)+(1+2​β​MSH)​κ​δ, _f ( 1n _i=1^n _ig(X_i;f)- βn _i=1^ng(X_i;f)^2 )≤ _f_δ _δ (f_δ)+ (1+2β M_SH )κδ, where Ω​(fδ)≔1n​∑i=1nτi​g​(Xi;fδ)−βn​∑i=1ng​(Xi;fδ)2 (f_δ) 1n _i=1^n _ig(X_i;f_δ)- βn _i=1^ng(X_i;f_δ)^2. To bound the expectation of the maximum over the finite set ℱδF_δ, we use the tail integral formula ​[Z]≤∫0∞ℙ​(Z>ξ)​ξE[Z]≤ _0^∞P(Z>ξ)dξ. Following the technique in Duan et al. (Duan et al., 2023), for any fixed function h, Bennett’s inequality implies that ℙ​(Ω​(h)>ξ)≤exp⁡(−2​n​β​ξ)P( (h)>ξ)≤ (-2nβξ) provided β≤1/(2​MSH)β≤ 1/(2M_SH). Applying the union bound over ℱδF_δ: ​[maxfδ∈ℱδ⁡Ω​(fδ)] _ τ [ _f_δ _δ (f_δ) ] ≤∫0∞min⁡1,N∞​(δ,ℱ,)​exp⁡(−2​n​β​ξ)​ξ ≤ _0^∞ \1,N_∞(δ,F,X) (-2nβξ) \dξ =log⁡N∞​(δ,ℱ,)2​n​β+∫log⁡N2​n​β∞exp⁡(log⁡N−2​n​β​ξ)​ξ = N_∞(δ,F,X)2nβ+ _ N2nβ^∞ ( N-2nβξ )dξ ≤1+log⁡N∞​(δ,ℱ,)2​n​β. ≤ 1+ N_∞(δ,F,X)2nβ. Substituting this back yields the stated corollary. ∎ Proof of Theorem 4.7. The single-layer multi-head Transformer output is a sum of H independent single-head outputs. Consequently, the Lipschitz constant and the absolute bound of the function class scale linearly with H. Specifically, the excess risk is bounded by: 0≤g​(X;fMH)≤κ​B+κ​H​Bw​Bc​Bv​Lσ​BX≕MMH.0≤ g(X;f_MH)≤κ B+κ HB_wB_cB_vL_σB_X M_MH. The remainder of the proof follows the identical logic as Theorem 4.4, substituting MSHM_SH with MMHM_MH. ∎ Proof of Corollary 4.13. For the multi-layer Transformer defined in Eq. (4), the input to the final readout layer, denoted as Y[CLS]Y_ [CLS], is the output of a normalization layer Πnorm _norm. By definition, Πnorm _norm projects vectors onto the unit ball (or a ball of fixed radius). Assuming standard LayerNorm or projection, we have ∥Y[CLS]∥2≤1 Y_ [CLS] _2≤ 1. Consequently, the magnitude of the scalar output is bounded solely by the readout weights: |f​(X)|=|w⊤​Y[CLS]|≤∥w∥2​∥Y[CLS]∥2≤Bw.|f(X)|=|w Y_ [CLS]|≤ w _2 Y_ [CLS] _2≤ B_w. This implies the excess loss is bounded by: 0≤g​(X;f)≤κ​(B+Bw).0≤ g(X;f)≤κ(B+B_w). Applying the same discretization argument as in Corollary 4.5 with this tighter bound yields the result. ∎ Appendix B Proofs in Section 5 B.1 Theorems B.1 and B.2 about Single-Head and Multi-Head Norm-Based Bound Theorem B.1 (Single-Head Norm-Based Bound). Suppose assumptions 4.3, 5.1 and Lemma 5.2 hold. Then for the single-layer single-head Transformer, the empirical risk minimizer satisfies: _D [ℰ​(f^n;ℓ)] [E( f_n; )] ≤ \;≤ 4​(κ​B+κ​Bw​Bc​Bv​Lσ​BX)n​(1+log⁡γSH3δ2) \; 4(κ B+κ B_wB_cB_vL_σB_X)n (1+ _SH^3δ^2 ) +8​κ​δ+inff∈ℱSHℰ​(f;ℓ), +8κδ+ _f _SHE(f; ), where the complexity term is given by: γSH= _SH= C11/3​Bx2/3​[(Bw​Lσ)2/3+(Bw​Lσ​Bc​Bv)2/3] C_1^1/3B_x^2/3 [(B_wL_σ)^2/3+(B_wL_σB_cB_v)^2/3 ] +C11/3​Bx2/3​[(Bw​Lσ​Bc​Bv)2/3+1]. +C_1^1/3B_x^2/3 [(B_wL_σB_cB_v)^2/3+1 ]. Theorem B.2 (Multi-Head Norm-Based Bound). Under the assumptions of Theorem B.1, adapted for H heads, the empirical risk minimizer for the single-layer multi-head Transformer satisfies: _D [ℰ​(f^n;ℓ)] [E( f_n; )] ≤ \;≤ 4​(κ​B+κ​H​Bw​Bc​Bv​Lσ​BX)n​(1+log⁡γMH3δ2) \; 4(κ B+κ HB_wB_cB_vL_σB_X)n (1+ _MH^3δ^2 ) +8​κ​δ+inff∈ℱMHℰ​(f;ℓ), +8κδ+ _f _MHE(f; ), where γMH= _MH= C11/3​Bx2/3​[(H​Bw​Lσ)2/3+2​(H​Bw​Lσ​Bc​Bv)2/3] C_1^1/3B_x^2/3 [(HB_wL_σ)^2/3+2(HB_wL_σB_cB_v)^2/3 ] +C11/3​Bx2/3​H2/3. +C_1^1/3B_x^2/3H^2/3. B.2 Proofs for Theorems B.1, B.2, and 5.3 We begin by introducing Lemma B.3, which provides a bound on the covering number for linear function classes with ℓ1,1 _1,1-norm constraints. This lemma serves as a building block for the subsequent proofs. Lemma B.3. Let N>dN>d, and define the parameter space =W∈ℝk×d∣∥W∥1,1≤BwW=\W ^k× d W _1,1≤ B_w\. Consider the function class ℱ=x↦W​x∣W∈F=\x Wx W \ restricted to inputs satisfying ∥x∥2≤Bx x _2≤ B_x. Then, the empirical covering number satisfies: log⁡∞​(ϵ,ℱ,N,∥⋅∥2)≤Bx2​Bw2ϵ2​log⁡(2​d​k+1). _∞(ε,F,N, · _2)≤ B_x^2B_w^2ε^2 (2dk+1). Proof of Theorem B.1. To bound the covering number of the single-head Transformer, we construct a cover for the composite function by combining covers of its constituent linear operators. Let ϵW_ε be an ϵw _w-cover of the readout weights w∈ℝd∣∥w∥2≤Bw\w ^d w _2≤ B_w\. Similarly, let v,ϵW_v,ε, c,ϵW_c,ε, and Q​K,ϵW_QK,ε be covers for the Value, Output, and Query-Key matrices, respectively, at resolutions ϵv,ϵc,ϵQ​K _v, _c, _QK. First, we establish the relationship between the spectral norm and the ℓ1,1 _1,1 norm. Using the Cauchy-Schwarz inequality, for any x∈ℝdx ^d: ∥W​x∥2 Wx _2 =‖∑jxj​W:,j‖2=(∑jxj​W:,j)⊤​(∑kxk​W:,k) = _jx_jW_:,j _2= ( _jx_jW_:,j ) ( _kx_kW_:,k ) ≤∑j,k|xj|​|xk|​∥W:,j∥2​∥W:,k∥2=(∑j|xj|​∥W:,j∥2) ≤ _j,k|x_j||x_k| W_:,j _2 W_:,k _2= ( _j|x_j| W_:,j _2 ) ≤∥x∥2​(∑j∥W:,j∥22)1/2. ≤ x _2 ( _j W_:,j _2^2 )^1/2. Since ∥v∥2≤∥v∥1 v _2≤ v _1, we have ∑∥W:,j∥22≤∑∥W:,j∥2=∥W∥2,1≤∥W∥1,1 Σ W_:,j _2^2≤Σ W_:,j _2= W _2,1≤ W _1,1. Thus, ∥W∥2→2≤∥W∥1,1 W _2→ 2≤ W _1,1. We define the approximated output Y[CLS],ϵY_ [CLS],ε using the parameters from the covers. The difference in the scalar output is bounded by: |w⊤​Y[CLS]−wϵ⊤​Y[CLS],ϵ| |w Y_ [CLS]-w_ε Y_ [CLS],ε| ≤∥w∥2​∥Y[CLS]−Y[CLS],ϵ∥2+|(w−wϵ)⊤​Y[CLS],ϵ| ≤ w _2 Y_ [CLS]-Y_ [CLS],ε _2+|(w-w_ε) Y_ [CLS],ε| ≤Bw​∥Y−Yϵ∥2,∞+ϵw. ≤ B_w Y-Y_ε _2,∞+ _w. Expanding ∥Y−Yϵ∥2,∞ Y-Y_ε _2,∞ using the Lipschitz property of the activation σ and the Softmax operator (and applying the triangle inequality recursively), we obtain: |w⊤​Y[CLS]−wϵ⊤​Y[CLS],ϵ| |w Y_ [CLS]-w_ε Y_ [CLS],ε| ≤ϵw+Bw​ϵc​Lσ ≤ _w+B_w _cL_σ +Bw​Bc​Lσ​ϵv​BX +B_wB_cL_σ _vB_X +2​Bw​Bc​Bv​Lσ​ϵQ​K​BX. +2B_wB_cB_vL_σ _QKB_X. (Note: The constants in the inequality above correspond to the Lipschitz constants derived from the matrix products and activation functions). To minimize the total covering number, we minimize the sum of the log-covering numbers of individual components subject to the total error being bounded by ϵε. This yields the optimization problem: minϵc,ϵQ​K,ϵv,ϵw​∑j∈c,Q​K,v,wC1​Bj2ϵj2, _ _c, _QK, _v, _w _j∈\c,QK,v,w\ C_1B_j^2 _j^2, (14) subject to the linear constraint derived from the error decomposition: ϵw+ϵc​(Bw​Lσ)+ϵv​(Bw​Lσ​Bc)+ϵQ​K​(2​Bw​Lσ​Bc​Bv)≤ϵ. _w+ _c(B_wL_σ)+ _v(B_wL_σB_c)+ _QK(2B_wL_σB_cB_v)≤ε. (15) Solving this using Lagrange multipliers yields the complexity term γSH _SH presented in the theorem. The total covering number is the product of the individual covering numbers, so the log-covering number is the sum, resulting in the bound log⁡(γSH3/ϵ2) ( _SH^3/ε^2). ∎ Proof of Theorem B.2. The proof follows the structure of the single-head case. For a multi-head Transformer with H heads, the output is a sum of H independent heads. By the triangle inequality, the total approximation error is bounded by the sum of errors of each head: |w⊤​∑h=1HYh−wϵ⊤​∑h=1HYh,ϵ|≤∑h=1H|w⊤​Yh−wϵ⊤​Yh,ϵ|≤ϵ. |w _h=1^HY_h-w_ε _h=1^HY_h,ε |≤ _h=1^H |w Y_h-w_ε Y_h,ε |≤ε. (16) Since the parameter bounds are identical for each head, we distribute the error budget uniformly. The optimization problem becomes analogous to (14) but sums over all parameters in all H heads. The resulting covering number reflects the increased dimensionality, yielding the term γMH _MH. ∎ Proof of Theorem 5.3. We construct a cover for the multi-layer network by taking the Cartesian product of covers for each layer. Let totalW_total be the product space: total=c,ϵ(1)⊗v,ϵ(1)⊗Q​K,ϵ(1)⊗⋯⊗c,ϵ(L)⊗v,ϵ(L)⊗Q​K,ϵ(L)⊗ϵ.W_total=W_c,ε^(1) _v,ε^(1) _QK,ε^(1) … _c,ε^(L) _v,ε^(L) _QK,ε^(L) _ε. We aim to show that for any valid set of parameters W1:L+1W^1:L+1, there exists Wϵ1:L+1∈totalW_ε^1:L+1 _total such that: |gscalar​(X;W1:L+1,w)−gscalar​(X;Wϵ1:L+1,wϵ)|≤ϵ. |g_scalar(X;W^1:L+1,w)-g_scalar(X;W_ε^1:L+1,w_ε) |≤ε. (17) Using the Cauchy-Schwarz inequality and a recursive expansion (peeling argument) similar to Trauger and Tewari (Trauger & Tewari, 2023), we decompose the error. The error at layer L propagates to the output through the readout vector w. The total error is bounded by the sum of the readout error ϵw _w and the accumulated errors from previous layers, weighted by the Lipschitz constants of subsequent layers. Specifically: |gscalar​(X)−gscalarϵ​(X)| |g_scalar(X)-g_scalar^ε(X) | ≤ϵw+Bw​∑i=1Lαi​(ϵc(i)+Lσ​Bc​ϵv(i)+2​Lσ​Bc​Bv​ϵQ​K(i)), ≤ _w+B_w _i=1^L _i ( _c^(i)+L_σB_c _v^(i)+2L_σB_cB_v _QK^(i) ), (18) where αi=∏j=i+1Lσ​Bc​Bv​(1+4​BQ​K) _i= _j=i+1^LL_σB_cB_v(1+4B_QK) represents the product of Lipschitz constants from layer i+1i+1 to L. The covering number bound is obtained by solving the minimization problem for ∑log⁡​(ϵj)Σ ( _j) subject to the constraint defined by (B.2). Using Lagrange multipliers, we derive the optimal ϵj _j for each parameter matrix, leading to the complexity terms γML _ML and ηML _ML defined in the theorem. ∎ B.3 Theorems B.4 and B.5 about Single-Head and Multi-Head Rank-Based Bound Theorem B.4 (Single-Head Rank-Based Bound). Suppose assumption 5.5 and Assumption 5.7 hold. The empirical risk minimizer satisfies: [ℰ _D[E (f^n;ℓ)] ( f_n; )] ≤ \;≤ 4​(κ​B+κ​Bw​Bc​Bv​Lσ​BX)n​(1+log⁡rank,S) \; 4(κ B+κ B_wB_cB_vL_σB_X)n (1+ _rank,S ) +8​κ​δ+inff∈ℱSHℰ​(f;ℓ), +8κδ+ _f _SHE(f; ), where rank=1+∑j∈c,Q​K,v,wrj​C1​log⁡(rj​Bx2/ϵj2)N_rank=1+ _j∈\c,QK,v,w\r_jC_1 (r_jB_x^2/ _j^2), and the optimal scales ϵj _j are determined via Corollary 5.6 with weights βc=Bw​Lσ _c=B_wL_σ, βQ​K=2​Bw​Lσ​Bc​Bv _QK=2B_wL_σB_cB_v, βv=Bw​Lσ​Bc _v=B_wL_σB_c, and βw=Bc _w=B_c. Theorem B.5 (Multi-Head Rank-Based Bound). Suppose assumption 5.5 and Assumption 5.7 hold. The empirical risk minimizer satisfies: [ℰ _D[E (f^n;ℓ)] ( f_n; )] ≤ \;≤ 4​(κ​B+κ​H​Bw​Bc​Bv​Lσ​BX)n​(1+log⁡rank,M) \; 4(κ B+κ HB_wB_cB_vL_σB_X)n (1+ _rank,M ) +8​κ​δ+inff∈ℱMHℰ​(f;ℓ), +8κδ+ _f _MHE(f; ), where rank=1+∑j∈c,Q​K,v,wrj​C1​log⁡(rj​Bx2/ϵj2)N_rank=1+ _j∈\c,QK,v,w\r_jC_1 (r_jB_x^2/ _j^2), and the optimal scales ϵj _j are determined via Corollary 5.6 with weights βc=H​Bw​Lσ _c=HB_wL_σ, βQ​K=2​H​Bw​Lσ​Bc​Bv _QK=2HB_wL_σB_cB_v, βv=H​Bw​Lσ​Bc _v=HB_wL_σB_c, and βw=H​Bc _w=HB_c. B.4 Proof of Theorem B.4, B.5 and 5.8 We first introduce Lemma B.6 from (Truong, 2024, Theorem 1), which establishes the covering number bound for low-rank linear operators. Lemma B.6 (Rank-Based Covering Number). Let r∈ℕ+r _+ and let V be an r-dimensional subspace of ℝkR^k. Define the parameter space =W∈ℝd×k:col⁡(W)⊆,∥W∥2≤BWW=\W ^d× k:col(W) , W _2≤ B_W\, where col⁡(W)col(W) denotes the column space of W. Consider the function class ℋ=x↦W​x:W∈H=\x Wx:W \ restricted to inputs satisfying ∥x∥2≤Bx x _2≤ B_x. Then, the empirical covering number satisfies: log⁡∞​(ϵ,ℋ,n,∥⋅∥2)≤r2​log⁡(4​Bx2​BW2​rϵ2). _∞(ε,H,n, · _2)≤ r2 ( 4B_x^2B_W^2rε^2 ). Proof. The result follows directly from (Truong, 2024, Lemma 2 and Theorem 1) by noting that ∥W∥2→2≤∥W∥2,∞≤BW W _2→ 2≤ W _2,∞≤ B_W. ∎ Proof of Theorem B.4, B.5 and 5.8. First, combining Lemma B.6 with lagrange multiplier method, we obtain the Corollary 5.6. For Theorem B.4, similar to the norm-based case (Theorem B.1), the error of the Transformer output is decomposed into the sum of errors from its constituent linear operators (Wc,WQ​K,WvW_c,W_QK,W_v) and the readout vector w. Let ϵc,ϵQ​K,ϵv,ϵw _c, _QK, _v, _w be the covering resolutions for the respective parameter matrices. Applying Lemma B.6, the log-covering number for the total hypothesis space is bounded by the sum of the log-covering numbers of these components. To obtain the tightest bound, we minimize the total complexity subject to the Lipschitz error constraint derived in the norm-based proof. This yields the following optimization problem: minϵc,ϵQ​K,ϵv,ϵw​∑j∈c,Q​K,v,wrj​Cj​log⁡(1ϵj2), _ _c, _QK, _v, _w _j∈\c,QK,v,w\r_jC_j ( 1 _j^2 ), subject to: βc​ϵc+βQ​K​ϵQ​K+βv​ϵv+βw​ϵw≤ϵ, _c _c+ _QK _QK+ _v _v+ _w _w≤ε, (19) where the Lipschitz coefficients are defined as: βc _c =Bw​Lσ, =B_wL_σ, βQ​K _QK =2​Bw​Lσ​Bc​Bv, =2B_wL_σB_cB_v, βv _v =Bw​Lσ​Bc, =B_wL_σB_c, βw _w =Bc, =B_c, and CjC_j are constants derived from Lemma B.6. We solve this using the method of Lagrange multipliers. The Lagrangian is given by: ℒ​(ϵ,λ)=−∑j2​rj​Cj​log⁡ϵj+λ​(∑jβj​ϵj−ϵ).L( ε,λ)=- _j2r_jC_j _j+λ ( _j _j _j-ε ). Taking the partial derivative with respect to ϵj _j and setting it to zero: ∂ℒ∂ϵj=−2​rj​Cjϵj+λ​βj=0⟹ϵj=2​rj​Cjλ​βj. ∂ _j=- 2r_jC_j _j+λ _j=0 _j= 2r_jC_jλ _j. Substituting this back into the constraint ∑βj​ϵj=ϵΣ _j _j=ε, we find λ=2ϵ​∑krk​Ckλ= 2ε _kr_kC_k. This yields the optimal allocation: ϵj=ϵ​rj​Cjβj​∑k∈c,Q​K,v,wrk​Ck. _j= ε r_jC_j _j _k∈\c,QK,v,w\r_kC_k. (20) Substituting these optimal ϵj _j values back into the sum of log-covering numbers yields the complexity term rankN_rank stated in the theorem. Similarly, for Theorem 5.8, replacing constraint formula (19) in the optimization problem with formula (B.2) yields the final result. ∎ Appendix C Proofs in Section 6 C.1 Proof of Theorem 6.4, 6.5 and 6.6 Proof. We begin by decomposing the excess risk. Let g​(X;f)≔ℓ​(Y,f​(X))−ℓ​(Y,f⋆​(X))g(X;f) (Y,f(X))- (Y,f (X)). We introduce the truncated estimators fMf_M and fM⋆f_M corresponding to the truncated input XMX_M. The conditional excess risk can be decomposed as: ​[g​(X;f)∣X] [g(X;f) X] =​[(g​(X;f)−g​(X;fM))+g​(X;fM)+(g​(X;fM∗)−g​(X;f∗))∣X] =E [(g(X;f)-g(X;f_M))+g(X;f_M)+(g(X;f_M^*)-g(X;f^*)) X ] ≤​[|g​(X;f)−g​(X;fM)|∣X]⏟(I)+​[g​(X;fM)∣X]⏟(I)+​[|g​(X;fM∗)−g​(X;f∗)|∣X]⏟(I). ≤ E[|g(X;f)-g(X;f_M)| X]_(I)+ E[g(X;f_M) X]_(I)+ E[|g(X;f_M^*)-g(X;f^*)| X]_(I). (21) Note that for the truncated term (I), the inputs are bounded by M. Consequently, the function output and the excess loss are bounded. Specifically, there exists a constant Mbound≔κ​BM+κ​BM′M_bound κ B_M+κ B_M such that 0≤g​(X;fM)≤Mbound0≤ g(X;f_M)≤ M_bound. Applying the Offset Rademacher complexity bound (Theorem 5.3) to the truncated class ℱMF_M, we have: ​[(I)]≤4​ℛnoff​(M,1Mbound)+3​inffM∈ℱMℰ​(fM;ℓ).E_D[(I)]≤ 4R_n^off (G_M, 1M_bound )+3 _f_M _ME(f_M; ). (22) Next, we bound the truncation error terms (I) and (I). Let ℰtrunc=‖X‖F>ME_trunc=\\|X\|_F>M\ be the event that truncation occurs. On the complement ℰtrunccE_trunc^c, f​(X)=fM​(X)f(X)=f_M(X), so the difference is zero. Using the Lipschitz property of the loss ℓ and the local Lipschitz property of the network (where Lf,‖X‖=​(‖X‖)L_f,\|X\|=O(\|X\|)), we have: (I) =​[|g​(X;f)−g​(X;fM)|⋅​(ℰtrunc)∣X] =E [|g(X;f)-g(X;f_M)|·I(E_trunc) X ] ≤κ​[Lf,‖X‖​∥f​(X)−fM​(X)∥2⋅​(ℰtrunc)∣X] ≤ [L_f,\|X\| f(X)-f_M(X) _2·I(E_trunc) X ] ≤κ​[C​‖X‖⋅∥X−XM∥F⋅​(ℰtrunc)∣X], ≤ [C\|X\|· X-X_M _F·I(E_trunc) X ], where we used ∥x[CLS]−xM,[CLS]∥≤∥X−XM∥F x_ [CLS]-x_M, [CLS] ≤ X-X_M _F. Substituting XM=M​X‖X‖FX_M=M X\|X\|_F when ‖X‖F>M\|X\|_F>M: (I)≤κ​C​[‖X‖​(‖X‖−M)⋅​(‖X‖>M)]≤κ​C​[‖X‖2⋅​(‖X‖>M)].(I)≤κ CE [\|X\|(\|X\|-M)·I(\|X\|>M) ]≤κ CE [\|X\|^2·I(\|X\|>M) ]. We evaluate this tail expectation using the layer-cake representation and the sub-Gaussian assumption P​(‖X‖≥t)≤(T+d)​exp⁡(−t2/2​ν2)P(\|X\|≥ t)≤(T+d) (-t^2/2ν^2): ​[‖X‖2⋅​(‖X‖>M)] [\|X\|^2·I(\|X\|>M)] =M2​P​(‖X‖>M)+∫M2∞P​(‖X‖2>t)​t =M^2P(\|X\|>M)+ _M^2^∞P(\|X\|^2>t)dt ≤M2​(T+d)​e−M22​ν2+∫M∞(T+d)​e−u22​ν2​2​u​u ≤ M^2(T+d)e^- M^22ν^2+ _M^∞(T+d)e^- u^22ν^22u\,du =(T+d)​(M2​e−M22​ν2+2​ν2​e−M22​ν2) =(T+d) (M^2e^- M^22ν^2+2ν^2e^- M^22ν^2 ) =(T+d)​(M2+2​ν2)​exp⁡(−M22​ν2). =(T+d)(M^2+2ν^2) (- M^22ν^2 ). (Note: The original text simplified this to terms involving ν​(X)ν(X) and exponentials. We adopt the bound implied by the sub-Gaussian tail). Specifically, following the source approximation: ​[(I)]≤​(κ​(T+d)​ν​exp⁡(−M22​ν2)).E_D[(I)] (κ(T+d)ν (- M^22ν^2 ) ). (23) Term (I) is bounded symmetrically. Combining these results into (C.1): ​[ℰ​(f^n;ℓ)]≤ _D[E( f_n; )]≤\; 4​ℛnoff​(,12​Mbound)+3​inffM∈ℱMℰ​(fM;ℓ) 4R_n^off (G, 12M_bound )+3 _f_M _ME(f_M; ) +Ctrunc​κ​(T+d)​ν​exp⁡(−M22​ν2), +C_truncκ(T+d)ν (- M^22ν^2 ), (24) where CtruncC_trunc aggregates the constants from the tail integration. By combining Theorems B.1, B.2, and 5.3, respectively, we obtain Theorems 6.4, 6.5 and 6.6 . ∎ C.2 Proof of Theorem 6.9, 6.10 and 6.11 Proof. We begin by establishing the Lipschitz continuity of the robust loss function. Let the robust loss be defined as ℓψ​(y,y^)=1α​ψ​(α​ℓ​(y,y^)) _ψ(y, y)= 1αψ(α (y, y)). We assume the base loss ℓ​(y,⋅) (y,·) is κ-Lipschitz and the truncation function ψ has a bounded derivative |ψ′​(⋅)|≤Bψ|ψ (·)|≤ B_ψ. Applying the Mean Value Theorem, for any f1,f2f_1,f_2: |ℓψ​(y,f1​(x))−ℓψ​(y,f2​(x))| | _ψ(y,f_1(x))- _ψ(y,f_2(x))| =1α​|ψ​(α​ℓ​(y,f1​(x)))−ψ​(α​ℓ​(y,f2​(x)))| = 1α |ψ(α (y,f_1(x)))-ψ(α (y,f_2(x))) | =1α​|ψ′​(ξ)|⋅α​|ℓ​(y,f1​(x))−ℓ​(y,f2​(x))| = 1α|ψ (ξ)|·α | (y,f_1(x))- (y,f_2(x)) | ≤Bψ​κ​|f1​(x)−f2​(x)|. ≤ B_ψκ|f_1(x)-f_2(x)|. Thus, the composite loss ℓψ _ψ is (κ​Bψ)(κ B_ψ)-Lipschitz. We decompose the excess risk similarly to the sub-Gaussian case (Eq. (C.1)), splitting it into the truncated complexity term and the tail bias terms. For the truncated class ℱMF_M, the effective Lipschitz constant is κ​Bψκ B_ψ, and the output is bounded. The complexity term is therefore bounded by: 4​ℛnoff​(,14​Bψ​BM​κ)+3​inffM∈ℱℰ​(fM;ℓψ).4R_n^off (G, 14B_ψB_Mκ )+3 _f_M E(f_M; _ψ). (25) Next, we evaluate the truncation error (bias) caused by the heavy-tailed inputs. Recall the condition ℙ​(|Xi​j|>t)≤C​t−βP(|X_ij|>t)≤ Ct^-β. Using the norm inequality ∥X∥2≤T​d​maxi,j⁡|Xi​j| X _2≤ Td _i,j|X_ij|, we derive the tail probability for the spectral norm via a union bound: ℙ​(∥X∥2>t) ( X _2>t) ≤ℙ​(T​d​maxi,j⁡|Xi​j|>t) ( Td _i,j|X_ij|>t ) ≤∑i,jℙ​(|Xi​j|>tT​d) ≤ _i,jP (|X_ij|> t Td ) ≤T​d⋅C​(tT​d)−β=C​(T​d)1+β/2​t−β≕C′​t−β. ≤ Td· C ( t Td )^-β=C(Td)^1+β/2t^-β C t^-β. We now estimate the tail expectations required for the truncation error ​[Lf,∥X∥​∥X−XM∥⋅​(‖X‖>M)]E[L_f, X X-X_M ·I(\|X\|>M)]. As shown in C.1, this is bounded by terms involving ​[∥X∥k​(∥X∥>M)]E[ X ^kI( X >M)] for k=1,2k=1,2. Using the integral identity ​[Z​(Z>M)]=M​ℙ​(Z>M)+∫M∞ℙ​(Z>t)​tE[ZI(Z>M)]=MP(Z>M)+ _M^∞P(Z>t)dt: ​[∥X∥⋅​(∥X∥>M)] [ X ·I( X >M)] =M​ℙ​(∥X∥>M)+∫M∞ℙ​(∥X∥>t)​t =MP( X >M)+ _M^∞P( X >t)dt ≤M​C′​M−β+∫M∞C′​t−β​t ≤ MC M^-β+ _M^∞C t^-βdt =C′​M1−β+C′​[t1−β1−β]M∞ =C M^1-β+C [ t^1-β1-β ]_M^∞ =C′​M1−β​(1+1β−1)=C′​β−1​M1−β. =C M^1-β (1+ 1β-1 )= C β-1M^1-β. Similarly for the second moment (assuming β>2β>2): ​[∥X∥2⋅​(∥X∥>M)] [ X ^2·I( X >M)] =M2​ℙ​(∥X∥>M)+∫M∞ℙ​(∥X∥2>t2)​2​t​t =M^2P( X >M)+ _M^∞P( X ^2>t^2)2t\,dt ≤C′​M2−β+∫M∞C′​t−β​2​t​t ≤ C M^2-β+ _M^∞C t^-β2t\,dt =C′​M2−β+2​C′​∫M∞t1−β​t =C M^2-β+2C _M^∞t^1-βdt =C′​M2−β+2​C′​M2−β−2=C′​M2−β​(β−2). =C M^2-β+2C M^2-β-2=C M^2-β ( β-2 ). (Note: The original text simplified the constants; we retain the asymptotic dependency on M). Substituting these tail bounds into the Lipschitz error decomposition: Bias≤κ​Bψ​(C′​β−2​M2−β).Bias≤κ B_ψ ( C β-2M^2-β ). Combining the complexity bound and the bias term yields the final result: ​[ℰ​(f^nψ;ℓψ)]≤ _D[E( f_n^ψ; _ψ)]≤\; 4​ℛnoff​(,14​Bψ​BM​κ) 4R_n^off (G, 14B_ψB_Mκ ) +C1​M1−β/2+C2​M2−β +C_1M^1-β/2+C_2M^2-β +3​inffM∈ℱℰ​(fM;ℓψ), +3 _f_M E(f_M; _ψ), (26) where C1,C2C_1,C_2 absorb the constants from the tail integration and β. Combining this with the rank-based complexity from Theorems B.4, B.5 and 5.8 completes the proof of Theorems 6.9, 6.10 and 6.11, respectively. ∎