← Back to papers

Paper deep dive

Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners

Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Year: 2025Venue: ICLR 2026Area: Adversarial RobustnessType: TheoreticalEmbeddings: 121

Models: Single-layer linear transformer

Abstract

Abstract:Adversarial training is one of the most effective defenses against adversarial attacks, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can adapt robustly to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can generalize robustly to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also identify two open challenges for attaining robustness: the accuracy-robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can obtain adversarial robustness for free. The code is available at this https URL.

Tags

adversarial-robustness (suggested, 80%)ai-safety (imported, 100%)theoretical (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 5:18:29 PM

Summary

This paper provides the first theoretical analysis of adversarially pretrained single-layer linear transformers as universally robust foundation models. It demonstrates that these models can generalize to unseen classification tasks via in-context learning without additional adversarial training, by adaptively focusing on robust features. The study also identifies the accuracy-robustness trade-off and sample-hungry training as key challenges.

Entities (5)

Soichiro Kumano · researcher · 99%Adversarially Pretrained Transformers · model-architecture · 98%Accuracy-robustness trade-off · challenge · 95%In-context learning · learning-paradigm · 95%Robust features · concept · 92%

Relation Signals (3)

Adversarially Pretrained Transformers enables Universal Robustness

confidence 90% · adversarially pretrained transformers can serve as universally robust foundation models

In-context learning facilitates Robust Adaptation

confidence 90% · generalize robustly to unseen classification tasks through in-context learning

Robust features drives Universal Robustness

confidence 85% · This universal robustness stems from the model's ability to adaptively focus on robust features

Cypher Suggestions (2)

Identify challenges associated with robust models · confidence 95% · unvalidated

MATCH (m:Model)-[:HAS_CHALLENGE]->(c:Challenge) RETURN m.name, c.name

Find all models that exhibit universal robustness · confidence 90% · unvalidated

MATCH (m:Model)-[:EXHIBITS]->(p:Property {name: 'Universal Robustness'}) RETURN m

Full Text

120,989 characters extracted from source content.

Expand or collapse full text

Published as a conference paper at ICLR 2026 ADVERSARIALLYPRETRAINEDTRANSFORMERSMAY BEUNIVERSALLYROBUSTIN-CONTEXTLEARNERS Soichiro Kumano The University of Tokyo kumano@cvm.t.u-tokyo.ac.jp Hiroshi Kera Chiba University, National Institute of Informatics kera@chiba-u.jp Toshihiko Yamasaki The University of Tokyo yamasaki@cvm.t.u-tokyo.ac.jp ABSTRACT Adversarial training is one of the most effective defenses against adversarial at- tacks, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve asuniversally robust foundation models—models that can adapt robustly to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can generalize robustly to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model’s ability to adaptively focus on robust features within given tasks. We also identify two open challenges for attaining robustness: the accuracy–robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of uni- versally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can obtain adversarial robustness for free. The code is available athttps://github.com/s-kumano/univer sally-robust-in-context-learner. 1INTRODUCTION Adversarial examples—subtle and often imperceptible perturbations to inputs that lead machine learning models to make incorrect predictions—reveal a fundamental vulnerability in modern deep learning systems (Szegedy et al., 2014). Adversarial training is one of the most effective defenses against such attacks (Goodfellow et al., 2015; Madry et al., 2018), where classification loss is minimized under worst-case (i.e., adversarial) perturbations. This min–max optimization significantly increases the computational cost compared to standard training. Despite extensive efforts to develop alternative defenses, most such approaches have subsequently been shown to offer only spurious robustness (Athalye et al., 2018; Croce & Hein, 2020; Tramer et al., 2020). Consequently, adversarial training remains the de facto standard, and practitioners must incur this cost to obtain adversarially robust models. Recently, it has become standard practice to leverage foundation models for target tasks. Thanks to large-scale pretraining, these models can be adapted to diverse downstream tasks through lightweight tuning. This raises a natural question:Can adversarially trained foundation models enable efficient and robust adaptation to a wide range of downstream tasks?Although training such models is expensive, the investment would be justified if numerous downstream tasks could inherit adversarial robustness for free, without requiring costly adversarial training themselves. While this is a promising research direction, the utility of suchuniversally robust foundation modelsremains largely unexplored, as the computational and financial costs make empirical evaluation across multiple runs impractical. In this study, we present the first theoretical analysis suggesting that adversarially pretrained trans- formers can serve as universally robust foundation models. Specifically, we show that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can generalize 1 arXiv:2505.14042v3 [cs.LG] 1 Mar 2026 Published as a conference paper at ICLR 2026 robustly to previously unseen classification tasks through in-context learning (Brown et al., 2020) from clean demonstrations. Namely, these transformers achieve robust adaptation without requiring additional adversarial examples or training. In-context learning is a transformer capability that enables efficient adaptation to new tasks from a few input–output demonstrations in the prompt, without any parameter updates. Our analysis builds upon the conceptual framework of robust features (class-discriminative and human-interpretable) and non-robust features (human-imperceptible yet predictive) (Ilyas et al., 2019; Tsipras et al., 2019). Based on this framework, we show that adversarially pretrained single-layer linear transformers can adaptively focus on robust features within each downstream task, rather than non-robust or non-predictive features, thereby achieving universal robustness. This framework also reveals that universal robustness holds under mild conditions, except in an unrealistic scenario where non-robust features overwhelmingly outnumber robust ones. We also show that two open challenges in robust machine learning (Schmidt et al., 2018; Tsipras et al., 2019) persist in our setting. First, adversarially pretrained single-layer linear transformers exhibit lower clean accuracy than their standardly pretrained counterparts. Second, to achieve clean accuracy comparable to standard models, these transformers require more in-context demonstrations. Our contributions are summarized as follows: •We provide the first theoretical evidence for universally robust foundation models: under mild conditions, adversarially pretrained transformers with a single linear self-attention layer can adapt robustly to unseen classification tasks through in-context learning. •Based on the framework of robust and non-robust features, we characterize the condition for successful robust adaptation. Moreover, we show that universal robustness arises from the models’ adaptive focus on robust features within each downstream task. • We identify two open problems for these transformers: the accuracy–robustness trade-off and sample-hungry in-context learning. This study explores the potential of universally robust foundation models, which can provide diverse downstream tasks with adversarial robustness without additional adversarial training. A key challenge is the cost of adversarial pretraining. We assume that, as with standard foundation models, such efforts would be undertaken by large organizations, which could offset development costs through licensing or API fees. The growing demand for safe and reliable AI strengthens this incentive. Encouragingly, advances in acceleration techniques for adversarial training, such as fast adversarial training (Wong et al., 2020) and adversarial finetuning (Jeddi et al., 2020), suggest that the cost of adversarial training could approach the cost of standard training. We view our theoretical analysis as an important first step toward fostering the practical development of universally robust foundation models. 2RELATEDWORK Additional related work can be found in Appendix A. Adversarial Training.Adversarial training (Goodfellow et al., 2015; Madry et al., 2018), which augments training data with adversarial examples (Szegedy et al., 2014), is one of the most effective adversarial defenses. Its major limitation is the high computational cost. To address this, several methods have focused on efficient generation of adversarial examples (Andriushchenko & Flammar- ion, 2020; Kim et al., 2021; Park & Lee, 2021; Shafahi et al., 2019; Wong et al., 2020; Zhang et al., 2019a) and adversarial finetuning (Jeddi et al., 2020; Mao et al., 2023; Suzuki et al., 2023; Wang et al., 2024a). However, these methods require task-specific adversarial training. In this study, we introduce the concept of universally robust foundation models, which can adapt robustly to a wide range of downstream tasks without requiring any adversarial training or examples. Robust and Non-Robust Features.It is widely hypothesized that adversarial vulnerability arises from models’ reliance on non-robust features (Ilyas et al., 2019; Tsipras et al., 2019). While robust features are class-discriminative, human-interpretable, and semantically meaningful, non- robust features are subtle, often imperceptible to humans, yet statistically correlated with labels and therefore predictive. Humans can rely only on robust features, whereas models can leverage both types of features to maximize accuracy. Tsipras et al. (2019) showed that standard classifiers 2 Published as a conference paper at ICLR 2026 depend heavily on non-robust features, making them vulnerable to adversarial perturbations that can manipulate these subtle features. They also showed that adversarial training encourages models to rely primarily on robust features, which enhances robustness but often reduces clean accuracy—a phenomenon known as the accuracy–robustness trade-off (Dobriban et al., 2023; Mehrabi et al., 2021; Raghunathan et al., 2019; 2020; Su et al., 2018; Tsipras et al., 2019; Yang et al., 2020; Zhang et al., 2019b). Subsequent studies have confirmed that adversarially trained neural networks exhibit a greater reliance on robust features (Augustin et al., 2020; Chalasani et al., 2020; Engstrom et al., 2019; Etmann et al., 2019; Kaur et al., 2019; Santurkar et al., 2019; Srinivas et al., 2023; Tsipras et al., 2019; Zhang & Zhu, 2019). In this study, we incorporate the concept of robust and non-robust features into the data assumptions for our theoretical analysis. Based on this framework, we find that adversarially pretrained single-layer linear transformers prioritize robust features over non-robust features, and exhibit the accuracy–robustness trade-off. 3THEORETICALRESULTS Notation.Forn∈N, let[n] :=1,...,n. Denote thei-th element of a vectorabya i , and the element in thei-th row andj-th column of a matrixAbyA i,j . LetU(S)be the uniform distribution over a setS. The sign function is denoted assgn(·). Ford 1 ,d 2 ∈N, let1 d 1 and1 d 1 ,d 2 be the d 1 -dimensional all-ones vector andd 1 ×d 2 all-ones matrix, respectively. Thed 1 ×d 1 identity matrix is denoted asI d 1 . Similarly, we write the all-zeros vector and matrix as0 d 1 and0 d 1 ,d 2 , respectively. We use≳,≲, and≈only to hide constant factors in informal statements. 3.1PROBLEMSETUP Overview.We adversarially train a single-layer linear transformer ond∈Ndistinct datasets. The c-th training data distribution is denoted byD tr c forc∈[d]. Thec-th dataset consists ofN+1samples, (x (c) n ,y (c) n ) N+1 n=1 i.i.d. ∼ D tr c . The transformer is encouraged to adaptively learn data structures from Nclean in-context demonstrations(x n ,y n ) N n=1 and to generalize to the(N+ 1)-th perturbed samplex N+1 +∆, where∆represents an adversarial perturbation. We then evaluate the adversarial robustness of the trained transformer on a test dataset(x te n ,y te n ) N+1 n=1 i.i.d. ∼ D te , which may exhibit different structures from any training distributions. Transformer.We first define the input sequence for a transformer as Z ∆ := x 1 x 2 ·x N x N+1 +∆ y 1 y 2 ·y N 0 ∈R (d+1)×(N+1) ,(1) wherex 1 ,...,x N ∈R d are training data,y 1 ,...,y N ∈±1are their binary labels,x N+1 ∈R d is a test sample (query), and∆∈R d is an adversarial perturbation (see below). The transformer is expected to learn data structures adaptively fromNdemonstrations(x n ,y n ) N n=1 and to predict the label ofx N+1 . The(d+ 1,N+ 1)-th element ofZ ∆ serves as a placeholder for the prediction of x N+1 +∆. We define a single-layer linear transformerf:R (d+1)×(N+1) →R (d+1)×(N+1) , which is commonly employed in theoretical studies of in-context learning (Ahn et al., 2023; Cheng et al., 2024; Gatmiry et al., 2024; Mahankali et al., 2024; Zhang et al., 2024b), as follows: f(Z ∆ ;P,Q) := 1 N P Z ∆ MZ ⊤ ∆ QZ ∆ ,M:= I n 0 00 ∈R (N+1)×(N+1) ,(2) whereP∈R (d+1)×(d+1) serves as the value weight matrix andQ∈R (d+1)×(d+1) serves as the product of the key and query weight matrices. Following prior work on in-context learning (Ahn et al., 2023; Cheng et al., 2024; Gatmiry et al., 2024; Li et al., 2025), we adopt a mask matrixMto prevent the tokens from attending to the query token. Training Data Distribution.The transformer is pretrained onddistinct datasets. Inspired by Tsipras et al. (2019), we consider the following data structure that explicitly separates robust and non-robust features (cf. Section 2) according to their dimensional indices: 3 Published as a conference paper at ICLR 2026 Assumption 3.1(Individual training data distribution).Letc∈[d]be the index of a training data distribution andD tr c be thec-th distribution. A sample(x,y)∼D tr c satisfies the following: y∼U(±1), x c =y,∀i∈[d],i̸=c:x i ∼ U([0,yλ]) (y= 1) U([yλ,0]) (y=−1) ,(3) where0< λ <1. For anyi̸=j,x i andx j are independent, giveny. In this distribution, each sample has a feature that is strongly correlated with its label (i.e., a robust feature) in thec-th dimension and weakly correlated features (i.e., non-robust features) in the remaining dimensions. The correlation between the non-robust features and the label is bounded byλ. The robust features mimic human-interpretable, semantically meaningful attributes in natural objects (e.g., shape). The non-robust features mimic human-imperceptible yet predictive attributes (e.g., texture). Test Data Distribution.The test data distribution may exhibit more diverse structures than the training data distributions, and may include non-predictive features in addition to robust and non- robust features. Assumption 3.2(Test data distribution).Let the index sets of robust, non-robust, and irrelevant features beS rob ,S vul ,S irr ⊂[d], respectively. Suppose that these sets are disjoint, i.e.,S rob ∩S vul = S vul ∩S irr =S irr ∩S rob =∅and thatS rob ∪S vul ∪S irr = [d]. Let the number of robust, non-robust, and irrelevant features bed rob :=|S rob |,d vul :=|S vul |, andd irr :=|S irr |, respectively. Let the scales of the robust, non-robust, and irrelevant features beα >0,β >0, andγ≥0, respectively. LetD te be a test data distribution. A sample(x,y)∼D te satisfies the following: (1. Label) The labelyfollows the uniform distributionU(±1). (2. Expectation and Moments) For everyi∈S irr ,E[x i ] = 0. For everyi∈[d]andn∈2,3,4, there exist constantsC i >0andC i,n ≥0such that E[yx i ] =    C i α(i∈S rob ) C i β(i∈S vul ) 0(i∈S irr ) ,|E[(yx i −E[yx i ]) n ]|≤    C i,n α n (i∈S rob ) C i,n β n (i∈S vul ) C i,n γ n (i∈S irr ) .(4) (3. Covariance) There exist constants0≤q rob ,q vul <1such that i∈S rob | P j∈S rob ∪S vul E[(yx i −E[yx i ])(yx j −E[yx j ])]<0 ≤q rob d rob ,(5) i∈S vul | P j∈S rob ∪S vul E[(yx i −E[yx i ])(yx j −E[yx j ])]<0 ≤q vul d vul .(6) (4. Independence) For everyi∈S irr ,x i is independent ofyand allx j forj̸=i. In contrast to the training distributions, the test distribution may containd rob robust features and d irr irrelevant features. These irrelevant features simulate natural noise or redundant dimensions commonly found in real-world data. For example, in MNIST (Deng, 2012), the top-left pixel is always zero and thus not predictive. Assumption 4 requires each irrelevant feature to be independent of both the label and all other features. The robust and non-robust features are not necessarily mutually independent. Assumption 2 (Expectation) ensures that the robust and non-robust features are positively correlated with the label. Given sufficient data, it is always possible to preprocess features to positively align with their binary labels. For example, with a largeN, this can be achieved by multiplying each featurex i bysgn(E[yx i ])≈sgn( P N n=1 y n x n,i ), ensuringE[y(sgn(E[yx i ])x i )] =|E[yx i ]|≥0. Assumption 2 (Moments) bounds then-th central moment of each feature by a constant multiple of then-th power of its scale forn∈2,3,4. This condition ensures that the feature distribution does not exhibit excessively large fluctuations relative to its scale (n= 2), extreme asymmetry (n= 3), or heavy tails (n= 4). For example, with appropriate constant factors, exponential distributions satisfy this condition. Moreover, empirical studies suggest that pixel values (or contrasts) after typical filtering (e.g., Gabor filtering) approximately follow exponential distributions (Ruderman, 1994; 4 Published as a conference paper at ICLR 2026 Srivastava et al., 2003). This observation suggests that filtered pixel values (and contrasts) are broadly consistent with this assumption. Assumption 3 bounds the number of features whose total covariance with other informative fea- tures (i.e., robust and non-robust features) is negative. As stated in Theorem 3.6, we typically assume thatq rob andq vul are small (but not necessarily infinitesimal). This assumption prevents unrealistic cases where useful features are overly anti-correlated with others, which could hinder learning. When all the predictive features are conditionally independent given the label,q rob = 0andq vul = 0satisfy this assumption. Empirically,q rob andq vul appear to be small in real-world datasets (cf. Fig. A2). These conditions encompass a wide class of realistic data distributions. •Example 1: Training data distribution.Each training distributionD tr c is a special case of the test distributionD te . Specifically, it containsd rob = 1robust feature with scaleα≈1and d vul =d−1non-robust features with scaleβ≈λ. There are no irrelevant features, i.e.,d irr = 0. By construction and due to the properties of uniform distribution, this distribution satisfies all the conditions in Assumption 3.2. •Example 2: Standard distributions.With appropriate constant factors, the test distribution class includes standard distributions, such as uniform, normal, exponential, beta, gamma, Bernoulli, binomial, etc. For example, consider the normal distribution. Fori∈ S rob , Assumption 2 is satisfied by settingyx i ∼N(α,α 2 )withC i = 1,C i,2 = 1,C i,3 = 0, andC i,4 = 3. Assumptions 3 and 4 are satisfied when all the features are mutually independent. •Example 3: MNIST/Fashion-MNIST/CIFAR-10.Empirical evidence suggests that preprocessed MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009) approximately satisfy our assumptions. Consider MNIST. Letx (0) n N n=1 ,x (1) n N n=1 ∈[0,1] 784 denote the samples of digits zero and one, respectively. We assigny= 1to digit zero andy=−1 to digit one. We center the data viax ′ ←x− ̄ xwith ̄ x:= (1/2N) P N n=1 (x (0) n +x (1) n ) and align features with the label usingx ′ ←sgn( P N n=1 (x (0) n −x (1) n ))⊙x ′ . In this representation, common background pixels have near-zero expectations (i.e.,γ≈0), while discriminative pixels—such as the left and right arcs of zero or the vertical stroke of one—correlate strongly with the label (i.e., α≈0.2) (cf. Fig. A2). Additionally, some pixels that are occasionally activated by atypical samples (e.g., corners activated by slanted digits) exhibit weak correlation with the label (i.e., β≈0.01), reflecting non-robust but predictive attributes. Empirical analysis reveals that most pixels exhibit positive total covariance with others, consistent with Assumption 3 (cf. Fig. A2). The main departure from Assumption 3.2 is that real-world datasets exhibit a gradual transition in feature robustness rather than an explicit binary separation between robust and non-robust features. •Example 4: Linear combination of orthonormal bases.Under mild conditions, any distribution in which robust and non-robust directions form an orthonormal basis can be transformed into our setting via principal component analysis (cf. Appendix C). Adversarial Attack.We assume that the query (test sample)x N+1 is subject to the adversarial perturbation∆constrained by theℓ ∞ norm, i.e.,∥∆∥ ∞ ≤ε, whereε≥0denotes the perturbation budget. In practice,εis chosen to be consistent with the scale of non-robust features (e.g.,ε≈λ for the training distributions andε≈βfor the test distribution). This ensures that perturbations can manipulate non-robust features while leaving robust features intact and remaining imperceptible to humans. Pretraining With In-Context Loss.For pretraining, we consider the following problem based on the in-context loss (Ahn et al., 2023; Bai et al., 2023; Mahankali et al., 2024; Zhang et al., 2024b): min P,Q∈[0,1] (d+1)×(d+1) E c∼U([d]),(x n ,y n ) N+1 n=1 i.i.d. ∼ D tr c max ∥∆∥ ∞ ≤ε −y N+1 [f(Z ∆ ;P,Q)] d+1,N+1 .(7) This formulation encourages the transformer to extract robust, generalizable representations from Nclean in-context demonstrations and to accurately classify the adversarially perturbed query. We impose the constraint on the transformer parameters to prevent the problem from becoming ill-posed. We choose[0,1] d instead of[−1,1] d to simplify the theoretical derivation. 5 Published as a conference paper at ICLR 2026 01020 Std.P 0 10 20 01020 Std.Q 01020 Adv.P 01020 Adv.Q 01020 Str. adv.P 01020 Str. adv.Q 0.0 0.5 1.0 Figure 1: Parameter heatmaps learned via adversarial training (7) withd= 20andλ= 0.1. For the standard, adversarial, and strong adversarial regimes, we usedε= 0, 1+(d−1)(λ/2) d = 0.098, and λ 2 + 3 2 2−λ (d−1)λ 2 +3 = 0.95 , respectively. We optimized (7) by stochastic gradient descent. Detailed experimental settings can be found in Appendix D. 3.2LINEARCLASSIFIERS ANDORACLE Standard Linear Classifiers Extract All Features and Are Therefore Vulnerable. As a warm- up, consider standard training of a linear classifier parameterized byw∈R d on thec-th training distributionD tr c . Standard training yieldsw std := arg min w∈[0,1] d E (x,y)∼D tr c [−yw ⊤ x] =1 d . This classifier utilizes all the features, i.e., the robust feature at thec-th dimension and the other non-robust features. Althoughw std achieves correct predictions on clean samples (E[yw std ⊤ x]>0), it is vulnerable to adversarial perturbations:E[min ∥∆∥ ∞ ≤ε yw std ⊤ (x+∆)]≤0forε≥ 1+(d−1)(λ/2) d . 1 This implies that whendis small, the perturbation must satisfyε≳1to flip the prediction. In this regime, the perturbation alters the robust feature and is no longer human-imperceptible. However, asdincreases, the threshold decreases toε≳λ, which matches the scale of the non-robust features. Such perturbations are human-imperceptible yet sufficient to cause misclassification. Linear Classifiers Can Be Specifically Robust but Not Universally Robust.Consider adversarial training:min w∈[0,1] d E[max ∥∆∥ ∞ ≤ε −yw ⊤ (x+∆)]. For λ 2 ≤ε <1, the optimal solutionw adv equals one at thec-th dimension and zero otherwise. The classifier relies solely on the robust feature at thec-th dimension and ignores the other non-robust features. Unlike the standard model, this model can correctly classify both clean and adversarial samples for0≤ε <1; thus, linear classifiers can be robust to a specific training distribution. However,w adv tailored toD tr c is vulnerable on the other distributionsD tr c ′ indexed byc ′ ̸=c; thus, linear classifiers cannot be universally robust. Universally Robust Classifiers Exist.Although linear classifiers cannot exhibit universal robustness across allc, universally robust classifiers do exist. For example, the classifierh(x) := sgn(x i )with i:= arg max j∈[d] |x j |always produces correct predictions for clean datax∼ D tr c for anycand perturbed datax+∆with∥∆∥ ∞ ≤ 1 2 . 3.3ADVERSARIALPRETRAINING In this section, we analyze a global minimizer of the minimization problem (7). Optimization Challenges.Although the training distributions are simple, the minimization problem (7) remains nontrivial due to the non-linearity and non-convexity of the model with respect to the trainable parametersPandQ. The non-linearity of the self-attention and the inner-maximization are also obstacles. Indeed, the minimization problem (7) can be reformulated as the following non-linear maximization problem: Lemma 3.3(Transformation of original optimization problem).The minimization problem (7) can be transformed into the maximization problemmax b∈0,1 d+1 P d(d+1) i=1 max(0, P d+1 j=1 b j h i,j ) , where h i,j ∈Ris a constant depending on(i,j), and there exists a mapping frombtoPandQ. The proof can be found in Appendix E. This lemma highlights the inherent difficulty of optimizing (7), as it requires selecting a binary vectorbthat balancesd(d+ 1)interdependent non-linear terms. 1 E[min ∥∆∥ ∞ ≤ε yw std ⊤ (x+∆)] =w std ⊤ (E[yx]−ε1 d ) =1 + (d−1)(λ/2)−dε≤0. 6 Published as a conference paper at ICLR 2026 Global Solution.By exploiting the symmetric property ofband further transformation of the problem in Lemma 3.3, we identify the global solutions to (7) for certain perturbation regimes. Theorem 3.4(Parameters learned via adversarial pretraining).The global minimizers of (7) are 1. Standard;ε= 0 P=P std := 0 d,d+1 1 ⊤ d+1 andQ=Q std := [ 1 d+1,d 0 d+1 ]. 2. Adversarial;ε= 1+(d−1)(λ/2) d P=P adv := 0 d,d+1 1 ⊤ d+1 andQ=Q adv := I d 0 d 0 ⊤ d 0 . 3. Strongly adversarial;ε≥ λ 2 + 3 2 2−λ (d−1)λ 2 +3 P=0 d+1,d+1 andQ=0 d+1,d+1 . The proof and optimal parameters for differentεcan be found in Appendix E. Importantly, the optimalPandQare independent of any specific training distribution (i.e., indexc), reflecting that the transformer acquires learning capability from demonstrations rather than memorizing individual tasks. Experimental results obtained using gradient descent align with our theoretical predictions (Fig. 1). Failure Case.In the strongly adversarial regime, the global optimum becomesP=Q=0, causing the transformer to always output zero regardless of the input. Namely, no universally robust single- layer linear transformers exist, despite the existence of universally robust classifiers (cf. Section 3.2). The perturbation scaleε≥ λ 2 + 3 2 2−λ (d−1)λ 2 +3 decreases ind: it transitions fromε= 1whend= 1 toε→ λ 2 asd→ ∞. In moderate dimensions (d≈ 1 λ ), adversarial perturbations must satisfy ε≳1to break the robustness. They are comparable to the scale of the robust features and thus perceptible to humans, contradicting the concept of adversarial perturbations. However, in extremely high dimensions (d≳ 1 λ 2 ), it suffices to perturb by onlyε≳λ, which is on the same scale as the non-robust features and is typically imperceptible, thus preserving the concept of adversarial perturbations. This can be rephrased as follows: under our training distributions, single-layer linear transformers cannot achieve universal robustness when the non-robust dimensions (i.e.,d−1) substantially outnumber the single robust dimension. 3.4UNIVERSALROBUSTNESS In this section, we show that adversarial pretraining, combined with in-context learning from clean demonstrations, can yield universal robustness on both seen and unseen distributions. Standard Pretraining Leads to Vulnerability.We begin by showing that the standard model fails to classify adversarially perturbed inputs. Theorem 3.5(Standard pretraining case).There exist a constantC >0and a strictly positive functiong(d rob ,d vul ,d irr ,α,β,γ)such that E (x n ,y n ) N+1 n=1 i.i.d. ∼ D te min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P std ,Q std )] d+1,N+1 ≤g(d rob ,d vul ,d irr ,α,β,γ) n C(d rob α+d vul β) | z Prediction for original data −(d rob +d vul +d irr )ε |z Adversarial effect o .(8) The proof can be found in Appendix F. This result analyzes the expectation of the product of the true label and the model prediction for the query. A positive value indicates correct classification, whereas a nonpositive value indicates misclassification. Sinceg(d rob ,d vul ,d irr ,α,β,γ)is always positive, whenC(d rob α+d vul β)−(d rob +d vul +d irr )εis nonpositive, this implies incorrect classification. Standard models extract both robust and non-robust features and thus are vulnerable.Assume d irr = 0. Like standard linear classifiers, the standard model leverages both robust featuresd rob α and non-robust featuresd vul β. This also leads to vulnerability to adversarial perturbations,−(d rob + d vul )ε. The prediction becomes incorrect whenC(d rob α+d vul β)−(d rob +d vul )ε≤0, i.e., when ε≳ d rob α+d vul β d rob +d vul . When the perturbation sizeεis on the same scale as the non-robust features (ε≲β), 7 Published as a conference paper at ICLR 2026 the inequality can be rearranged asd vul ≳ α−β β d rob . In typical cases where the scale of the robust features is much larger than that of the non-robust features (α≫β), we can informally conclude: (Informal restatement of Theorem 3.5)Assume that the scale of the robust features is much larger than that of the non-robust features (α≫β), the perturbation size is on the same scale as the non-robust features (ε≲β), and there are no non-predictive features (d irr = 0). Ifd vul ≳ α β d rob , then the standardly pretrained single-layer linear transformer is vulnerable to adversarial attacks. Non-predictive features accelerate vulnerability.Redundant dimensionsd irr do not contribute to the first term (i.e., accuracy) but increase the second term (i.e., vulnerability). Therefore, they degrade robustness without improving predictive performance. In addition,d irr amplifies the adversarial effect at a rate ofd irr ε, which is comparable to the effect of the useful dimensions,d rob εandd vul ε. Adversarial Pretraining Leads to Universal Robustness.We now establish universal robustness of the adversarially pretrained model. Theorem 3.6(Adversarial pretraining case).Suppose thatq rob andq vul defined in Assumption 3.2 are sufficiently small. There exist constantsC 1 ,C 2 >0such that E (x n ,y n ) N+1 n=1 i.i.d. ∼ D te min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P adv ,Q adv )] d+1,N+1 ≥C 1 (d rob α+d vul β+ 1)(d rob α 2 +d vul β 2 ) |z Prediction for original data −C 2 ( (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ) ε |z Adversarial effect .(9) The proof and a more general statement can be found in Appendix F and Theorem F.1, respectively. For notational simplicity, we assume smallq rob andq vul . However, we do not require infinitesimal q rob andq vul . See Theorem F.1 and Appendix C. In contrast to Theorem 3.5, this theorem provides a lower bound. A positive right-hand side implies correct classification under adversarial perturbations. Adversarially trained models prioritize robust features.Assumed irr = 0. Up to constant factors, the lower bound reduces to(d rob α+d vul β+ 1)d rob α 2 +d vul β 2 −(d rob α+d vul β)ε. The important factor isd rob α 2 +d vul β 2 −(d rob α+d vul β)ε, which determines the sign. As shown in Theorem 3.5, the standard model extracts the robust and non-robust features at scalesd rob αandd vul β, respectively. In contrast, the adversarially trained model extracts them at quadratic scalesd rob α 2 andd vul β 2 . Since the robust features typically have larger magnitude (α 2 ≫β 2 ), the adversarially trained model places greater emphasis on the robust features and mitigates the influence of the non-robust features, compared to its standard counterpart. Adversarially trained models are universally robust.As shown above, up to constant factors, the prediction remains correct as long asd rob α 2 +d vul β 2 −(d rob α+d vul β)ε≥0. This condition fails whenε≳ d rob α 2 +d vul β 2 d rob α+d vul β . When the perturbation sizeεis on the same scale as the non-robust features (ε≲β), the inequality can be rearranged asd vul ≳ d rob α(α−β) β 2 . In typical cases where α≫β, we can informally conclude: (Informal restatement of Theorem 3.6)Assume that the scale of the robust features is much larger than that of the non-robust features (α≫β), the perturbation size is on the same scale as the non-robust features (ε≲β), and there are no non-predictive features (d irr = 0). Ifd vul ≲( α β ) 2 d rob , then the adversarially pretrained single-layer linear transformer is robust to adversarial attacks. This threshold substantially improves on the standard model’s robustness condition. For example, whenα= 160/255andβ= 8/255, the standard model is potentially robust up tod vul ≲20d rob , whereas the adversarially pretrained model remains robust up tod vul ≲400d rob . This result also suggests that the adversarially pretrained model is potentially vulnerable when the non-robust dimensions significantly outnumber the robust ones, consistent with the failure case in Section 3.3. 8 Published as a conference paper at ICLR 2026 Adversarially trained models are more robust to attacks that exploit non-predictive features.Theo- rem 3.6 shows that even when the adversary exploits redundant dimensions, their effect is significantly attenuated. For simplicity, assumeN→ ∞. The adversarial effect from the irrelevant features scales asd irr γ 2 ε, which is linear ind irr . In contrast, the clean term scales asd 2 rob α 3 andd 2 vul β 3 , i.e., quadratically in the number of the informative features. Thus, as long as the informative features dominate in magnitude and number, the influence of the non-predictive features on the model’s robustness remains limited. 3.5OPENCHALLENGES In this section, we show that two open challenges in robust classification (Schmidt et al., 2018; Tsipras et al., 2019) persist in our setting. Accuracy–Robustness Trade-Off.Inspired by Tsipras et al. (2019), we consider a situation where robust features correlate with their labels with some probability, whereas non-robust features always correlate. Theorem 3.7(Accuracy–robustness trade-off).Assumed rob = 1,d vul =d−1, andd irr = 0. In addition to Assumption 3.2, for(x,y)∼D te , suppose thatyx i takesαwith probabilityp >0.5and −αwith probability1−pfori∈ S rob . Moreover,yx i takesβwith probability one fori∈ S vul . Define ̃ f(P,Q) :=E (x n ,y n ) N n=1 i.i.d. ∼ D te [y N+1 [f(Z 0 ;P,Q)] d+1,N+1 ]. Then, there exist strictly positive functionsg 1 (d,α,β)andg 2 (d,α,β)such that ̃ f(P std ,Q std ) = g 1 (d,α,β)(α+ (d−1)β)(w.p. p) g 1 (d,α,β)(−α+ (d−1)β) (w.p.1−p) ,(10) ̃ f(P adv ,Q adv )≤g 2 (d,α,β)−(2p−1)α 2 + (d−1)β 2 (w.p.1−p).(11) The proof can be found in Appendix G. Unlike Theorems 3.5 and 3.6, this theorem considers the expectation over(x n ,y n ) N n=1 , instead of(x n ,y n ) N+1 n=1 . The query(x N+1 ,y N+1 )is stochastic. Ifd≳ α β , the standard model consistently produces correct predictions. However, ifd≲(2p−1)( α β ) 2 , the adversarially trained model produces incorrect predictions with probability1−p. This trade-off arises because the adversarially trained model discards the non-robust but predictive features. Need for Larger In-Context Sample Sizes.We informally summarize Theorem H.1 as follows (omit- ting constant factors for clarity): (Informal summary of Theorem H.1)Assume the same conditions as in Theorem 3.7. Consider E x N+1 ,y N+1 [y N+1 [f(Z 0 ;P,Q)] d+1,N+1 ] . Assumed≲ α β ,p→0.5, and a smallNregime. With probability at least1−exp(−N), the standard model makes correct predictions. With probability at most1− 1 √ N , the adversarially trained model makes correct predictions. This result indicates that in low-sample regimes, the adversarially pretrained model requires substan- tially more in-context demonstrations to achieve clean accuracy comparable to that of the standard model. This stems from the model’s reliance on the robust features, which are statistically underrep- resented in small-sample regimes. 4EXPERIMENTALRESULTS Additional results and detailed experimental settings are provided in Appendix D. Verification of Theorem 3.4.We trained single-layer linear transformers (2) using stochastic gradient descent over[0,1] d with the in-context loss (7). The training distribution was configured withd= 20 andλ= 0.1. We usedε= 0, 1+(d−1)(λ/2) d = 0.098, and λ 2 + 3 2 2−λ (d−1)λ 2 +3 = 0.95for the standard, adversarial, and strongly adversarial regimes, respectively. The heatmaps of the learned parameters are shown in Fig. 1. These results align with the theoretical predictions of Theorem 3.4. Verification of Theorems 3.5 to 3.7.We evaluated the standardly and adversarially pretrained single-layer linear transformers with the theoretically predicted parameters (i.e., the parameters in 9 Published as a conference paper at ICLR 2026 Table 1: Accuracy (%) of standardly and adversarially pretrained single-layer linear transformers. Left values represent clean accuracy. Right values represent robust accuracy. ForD tr (cf. Assumption 3.1), we usedd= 100andλ= 0.1. ForD te (cf. Assumption 3.2), we constructed the test distribution from multivariate normal distributions withd rob = 10,d vul = 90,d irr = 0,α= 1.0, andβ= 0.1. For the real-world datasets, the values were averaged across all 45 binary classification pairs from the 10 classes. The perturbation budgets were set as follows:ε= 0.15forD tr ,0.2forD te ,0.1for MNIST and CIFAR-10, and0.15for Fashion-MNIST. See Appendix D for details. D tr D te MNISTFMNISTCIFAR10 Standardly pretrained model100/ 0100/ 094/ 491/ 2068/ 21 Adversarially pretrained model100/10099 /9593 /7289 /6264 /34 the standard regime and adversarial regime in Theorem 3.4) onD tr ,D te , MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky, 2009). These results are provided in Tab. 1. The results suggest that the standard models achieve high clean accuracy but suffer severe degradation under adversarial attacks, consistent with Theorem 3.5. In contrast, the adversarially pretrained models maintain high robustness, supporting Theorem 3.6, while their clean accuracy is lower, aligning with the accuracy–robustness trade-off described in Theorem 3.7. 5CONCLUSION ANDLIMITATIONS We theoretically demonstrated that single-layer linear transformers, after adversarial pretraining across classification tasks, can robustly adapt to previously unseen classification tasks through in- context learning, without any additional training. These results pave the way for universally robust foundation models. We also showed that these transformers can adaptively focus on robust features, exhibit the accuracy–robustness trade-off, and require a larger number of in-context demonstrations. Our limitations include assumptions about data distributions and architectures. While we assume that the data distributions consist of clearly separated robust and non-robust features, real-world datasets typically exhibit a more gradual transition (cf. Section 3.1, especially Example 3). Single-layer linear transformers lack the practical characteristics of multi-layer models and softmax attention. Although these theoretical assumptions are standard and comparable in strength to those in prior work (cf. studies on in-context learning in Appendix A), they limit the applicability of our results. Universally robust foundation models are conceptually expected to adapt to any task and any form of perturbations. However, our theoretical results assume classification tasks andℓ ∞ perturbations. Extending these results to other tasks and perturbation models is left for future work. The cost of adversarial pretraining is also a limitation of universally robust foundation models. We expect that such efforts will be undertaken by large organizations, which could offset development costs through API fees. In addition, acceleration techniques for adversarial training, which have been extensively studied in the literature, can reduce this cost to a level comparable to standard training. Our theoretical analysis is an important first step toward fostering the practical development of universally robust foundation models. See also the last paragraph in Section 1. REPRODUCIBILITY STATEMENT All experimental procedures are described in Section 4 and Appendix D. The source code to reproduce our experimental results can be found inhttps://github.com/s-kumano/universall y-robust-in-context-learner . Proofs of the theorems are provided in Appendices E to H. THEUSE OFLARGELANGUAGEMODELS(LLMS) We used LLMs to improve our writing. No essential contributions were made by the LLMs. 10 Published as a conference paper at ICLR 2026 ACKNOWLEDGMENTS ANDDISCLOSURE OFFUNDING S. Kumano was supported by JSPS KAKENHI Grant Number JP23KJ0789 and by JST, ACT-X Grant Number JPMJAX23C7, JAPAN. H. Kera was supported by JST PRESTO Grant Number JPMJPR24K, JST BOOST Program Grant Number JPMJBY24C6, and JSPS Program for Forming Japan’s Peak Research Universities (J-PEAKS) Grant Number JPJS00420230002. REFERENCES Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. InNeurIPS, volume 36, p. 45614–45650, 2023. Ekin Aky ̈ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InICLR, 2023. Ahmed Aldahdooh, Wassim Hamidouche, and Olivier Deforges. Reveal of vision transformers robustness against adversarial attacks.arXiv:2106.03734, 2021. Maksym Andriushchenko and Nicolas Flammarion. Understanding and improving fast adversarial training. InNeurIPS, volume 33, p. 16048–16059, 2020. Usman Anwar, Johannes Von Oswald, Louis Kirsch, David Krueger, and Spencer Frei. Adversarial robustness of in-context learning in transformers for linear regression.arXiv:2411.05189, 2024. Robert B Ash.Information Theory. Courier Corporation, 1990. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. InICML, p. 274–283, 2018. Maximilian Augustin, Alexander Meinke, and Matthias Hein. Adversarial robustness on in-and out-distribution improves explainability. InECCV, p. 228–245, 2020. Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. InNeurIPS, volume 36, p. 57125–57211, 2023. Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns? In NeurIPS, volume 34, p. 26831–26843, 2021. Philipp Benz, Soomin Ham, Chaoning Zhang, Adil Karjauv, and In So Kweon. Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. InBMVC, 2021. Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. Understanding robustness of transformers for image classification. InICCV, p. 10231–10241, 2021. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, volume 33, p. 1877–1901, 2020. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. InUSENIX Security, p. 2633–2650, 2021. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. InICLR, 2022. Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Xi Wu, and Somesh Jha. Concise explana- tions of neural networks using adversarial training. InICML, p. 1383–1391, 2020. Xiang Cheng, Yuxin Chen, and Suvrit Sra. Transformers implement functional gradient descent to learn non-linear functions in context. InICML, 2024. 11 Published as a conference paper at ICLR 2026 Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. InICML, p. 2206–2216, 2020. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. InACL, 2023. Edoardo Debenedetti, Vikash Sehwag, and Prateek Mittal. A light recipe to train robust vision transformers. InSaTML, p. 225–253, 2023. Li Deng. The MNIST database of handwritten digit images for machine learning research.Signal Processing Magazine, 29(6):141–142, 2012. Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adversar- ially robust classification.IEEE Transactions on Information Theory, 2023. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations.arXiv:1906.00945, 2019. Christian Etmann, Sebastian Lunz, Peter Maass, and Carola-Bibiane Sch ̈ onlieb. On the connection between adversarial robustness and saliency map interpretability. InICML, 2019. Hao Fan, Zhaoyang Ma, Yong Li, Rui Tian, Yunli Chen, and Chenlong Gao. Mixprompt: Enhancing generalizability and adversarial robustness for vision-language models via prompt fusion. InICIC, p. 328–339, 2024. Spencer Frei and Gal Vardi. Trained transformer classifiers generalize and exhibit benign overfitting in-context. InICLR, 2025. Deqing Fu, Tian-qi Chen, Robin Jia, and Vatsal Sharan. Transformers learn to achieve second-order convergence rates for in-context linear regression. InNeurIPS, volume 37, p. 98675–98716, 2024. Shaopeng Fu, Liang Ding, and Di Wang. ”short-length” adversarial training helps llms defend ”long-length” jailbreak attacks: Theoretical and empirical evidence.arXiv:2502.04204, 2025. Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. InNeurIPS, volume 35, p. 30583–30598, 2022. Siddhant Garg and Goutham Ramakrishnan. BAE: BERT-based adversarial examples for text classification. InEMNLP, p. 6174–6181, 2020. Khashayar Gatmiry, Nikunj Saunshi, Sashank J Reddi, Stefanie Jegelka, and Sanjiv Kumar. Can looped transformers learn to implement multi-step gradient descent for in-context learning? In ICML, 2024. Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, and Jason D Lee. How well can transformers emulate in-context newton’s method?arXiv:2403.03183, 2024. Micah Goldblum, Liam Fowl, and Tom Goldstein. Adversarially robust few-shot learning: A meta-learning approach. InNeurIPS, volume 33, p. 17886–17895, 2020. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InICLR, 2015. Yufan Hou, Lixin Zou, and Weidong Liu. Task-based focal loss for adversarially robust meta-learning. InICPR, p. 2824–2829, 2021. Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. InNeurIPS, volume 32, p. 125–136, 2019. 12 Published as a conference paper at ICLR 2026 Ahmadreza Jeddi, Mohammad Javad Shafiee, and Alexander Wong. A simple fine-tuning is all you need: Towards robust deep learning via adversarial fine-tuning.arXiv:2012.13628, 2020. Xiaojun Jia, Sensen Gao, Simeng Qin, Ke Ma, Xinfeng Li, Yihao Huang, Wei Dong, Yang Liu, and Xiaochun Cao. Evolution-based region adversarial prompt learning for robustness enhancement in vision-language models.arXiv:2503.12874, 2025. Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. InAAAI, volume 34, p. 8018–8025, 2020. Simran Kaur, Jeremy Cohen, and Zachary C Lipton. Are perceptually-aligned gradients a general property of robust classifiers? InNeurIPS WS, 2019. Hoki Kim, Woojin Lee, and Jaewook Lee. Understanding catastrophic overfitting in single-step adversarial training. InAAAI, volume 35, p. 8119–8127, 2021. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Jonathan Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, and Emma Brunskill. Supervised pretraining can learn in-context reinforcement learning. InNeurIPS, volume 36, p. 43057–43083, 2023. Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InCVPR, p. 24408–24419, 2024. Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial attack against BERT using BERT. InEMNLP, p. 6193–6202, 2020. Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, and Difan Zou. On the robustness of transformers against context hijacking for linear classification.arXiv:2502.15609, 2025. Licong Lin, Yu Bai, and Song Mei. Transformers as decision makers: Provable in-context reinforce- ment learning via supervised pretraining. InICLR, 2024. Chang Liu, Yinpeng Dong, Wenzhao Xiang, Xiao Yang, Hang Su, Jun Zhu, Yuefeng Chen, Yuan He, Hui Xue, and Shibao Zheng. A comprehensive study on robustness of image classification models: Benchmarking and rethinking.IJCV, 133(2):567–589, 2025. Fan Liu, Shuyu Zhao, Xuelong Dai, and Bin Xiao. Long-term cross adversarial training: A robust meta-learning method for few-shot classification tasks. InICML WS, 2021. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InICLR, 2024. Lin Luo, Xin Wang, Bojia Zi, Shihao Zhao, and Xingjun Ma. Adversarial prompt distillation for vision-language models.arXiv:2411.15244, 2024. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InICLR, 2018. Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InICLR, 2024. Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. InICCV, p. 7838–7847, 2021. Chengzhi Mao, Scott Geng, Junfeng Yang, Xin Wang, and Carl Vondrick. Understanding zero-shot adversarial robustness for large-scale models. InICLR, 2023. Mohammad Mehrabi, Adel Javanmard, Ryan A Rossi, Anup Rao, and Tung Mai. Fundamental tradeoffs in distributionally adversarial training. InICML, p. 7544–7554, 2021. 13 Published as a conference paper at ICLR 2026 Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. InNeurIPS, volume 34, p. 23296–23308, 2021. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A Feder Cooper, Daphne Ippolito, Christopher A Choquette-Choo, Eric Wallace, Florian Tram ` er, and Katherine Lee. Scalable extraction of training data from (production) language models.arXiv:2311.17035, 2023. Geon Yeong Park and Sang Wan Lee. Reliably fast adversarial training via latent adversarial perturbation. InICCV, p. 7758–7767, 2021. Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InAAAI, volume 36, p. 2071–2081, 2022. F ́ abio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. In NeurIPS WS, 2022. Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C Duchi, and Percy Liang. Adversarial training can hurt generalization. InICML WS, 2019. Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy. InICML, 2020. Daniel L Ruderman. The statistics of natural images.Network: computation in neural systems, 5(4): 517, 1994. Shibani Santurkar, Andrew Ilyas, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Image synthesis with a single (robust) classifier. InNeurIPS, volume 32, 2019. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adver- sarially robust generalization requires more data. InNeurIPS, volume 31, 2018. Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! InNeurIPS, 2019. Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of vision transformers.TMLR, 2022. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ”Do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InACM CCS, p. 1671–1685, 2024. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch ̈ arli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InICML, p. 31210–31227, 2023. Zhenmei Shi, Junyi Wei, Zhuoyan Xu, and Yingyu Liang. Why larger language models do in-context learning differently? InICML, 2024. Suraj Srinivas, Sebastian Bordt, and Himabindu Lakkaraju. Which models have perceptually-aligned gradients? an explanation via off-manifold robustness. InNeurIPS, volume 36, 2023. Anuj Srivastava, Ann B Lee, Eero P Simoncelli, and S-C Zhu. On advances in statistical modeling of natural images.Journal of mathematical imaging and vision, 18(1):17–33, 2003. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy?–a comprehensive study on the robustness of 18 deep image classification models. InECCV, p. 631–648, 2018. Satoshi Suzuki, Shin’ya Yamaguchi, Shoichiro Takeda, Sekitoshi Kanai, Naoki Makishima, Atsushi Ando, and Ryo Masumura. Adversarial finetuning with latent representation constraint to mitigate accuracy-robustness tradeoff. InICCV, p. 4367–4378, 2023. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. InICLR, 2014. 14 Published as a conference paper at ICLR 2026 Shiyu Tang, Ruihao Gong, Yan Wang, Aishan Liu, Jiakai Wang, Xinyun Chen, Fengwei Yu, Xiang- long Liu, Dawn Song, Alan Yuille, et al. RobustART: Benchmarking robustness on architecture design and training techniques.arXiv:2109.05211, 2021. Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. InNeurIPS, volume 33, p. 1633–1645, 2020. Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. InICLR, 2019. Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ̃ ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In ICML, p. 35151–35174, 2023. Johannes Von Oswald, Maximilian Schlegel, Alexander Meulemans, Seijin Kobayashi, Eyvind Niklasson, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, et al. Uncovering mesa-optimization algorithms in transformers. InICLR WS, 2024. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. InEMNLP-IJCNLP, 2019. Ren Wang, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Chuang Gan, and Meng Wang. On fast adversarial robustness adaptation in model-agnostic meta-learning. InICLR, 2021. Sibo Wang, Jie Zhang, Zheng Yuan, and Shiguang Shan. Pre-trained model guided fine-tuning for zero-shot adversarial robustness. InCVPR, p. 24502–24511, 2024a. Xin Wang, Kai Chen, Xingjun Ma, Zhineng Chen, Jingjing Chen, and Yu-Gang Jiang. Advqdet: Detecting query-based adversarial attacks with adversarial contrastive prompt tuning. InACM M, p. 6212–6221, 2024b. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? InNeurIPS, volume 36, p. 80079–80110, 2023a. Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv:2303.03846, 2023b. Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. InNeurIPS, volume 36, p. 36637–36651, 2023. Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. In ICLR, 2020. Boxi Wu, Jindong Gu, Zhifeng Li, Deng Cai, Xiaofei He, and Wei Liu. Towards efficient adversarial training on vision transformers. InECCV, p. 307–325, 2022. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for bench- marking machine learning algorithms.arXiv:1708.07747, 2017. Fan Yang, Mingxuan Xia, Sangzhou Xia, Chicheng Ma, and Hui Hui. Revisiting the robust general- ization of adversarial prompt tuning.arXiv:2405.11154, 2024. Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Russ R Salakhutdinov, and Kamalika Chaud- huri. A closer look at accuracy vs. robustness. InNeurIPS, volume 33, p. 8588–8601, 2020. Chengxiang Yin, Jian Tang, Zhiyuan Xu, and Yanzhi Wang.Adversarial meta-learning. arXiv:1806.03316, 2018. Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual adversarial attacking as combinatorial optimization. InACL, p. 6066–6080, 2020. Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. InNeurIPS, 2019a. 15 Published as a conference paper at ICLR 2026 Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InICML, p. 7472–7482, 2019b. Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, and Jitao Sang. Adversarial prompt tuning for vision-language models. InECCV, p. 56–72, 2024a. Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. JMLR, 25(49):1–55, 2024b. Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. InICML, p. 7502–7511, 2019. Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization.arXiv:2305.19420, 2023. Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models. InNeurIPS, volume 37, p. 3122–3156, 2024. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv:2307.15043, 2023. 16 Published as a conference paper at ICLR 2026 A Additional Related Work17 B Clarification on Single-Task Pretraining and Task-Specific Adversarial Training19 C Additional Theoretical Support and Insights19 C.1 Linear Combination of Orthonormal Bases can be Transformed into Our Test Distri- bution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 C.2 Sufficient Number of Datasets to Provide Universal Robustness . . . . . . . . . . .20 C.3 Effects ofq rob andq vul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 C.4 Disadvantage of Standard Finetuning: Parameter Selection Perspective . . . . . . .20 C.5 Naive Adversarial Context may not Improve Robustness . . . . . . . . . . . . . .21 D Additional Experimental Results21 D.1 Support for Assumption 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 D.2 Verification of Theorem 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 D.3 Verification of Theorems 3.5 to 3.7 and H.1 . . . . . . . . . . . . . . . . . . . . .22 E Proof of Lemma 3.3 and Theorem 3.4 (Pretraining)22 F Proof of Theorems 3.5 and 3.6 (Robustness)35 G Proof of Theorem 3.7 (Trade-Off)42 H Proof of Theorem H.1 (Need for Larger Sample Size)43 AADDITIONALRELATEDWORK In-Context Learning.In-context learning has emerged as a remarkable property of large language models, enabling them to adapt to a new task from a few input–output demonstrations without any parameter updates (Brown et al., 2020). Recent work has shown that in-context learning can implement various algorithms (Bai et al., 2023; Garg et al., 2022). One research direction has linked in-context learning with preconditioned gradient descent through empirical (Aky ̈ urek et al., 2023; Dai et al., 2023; Garg et al., 2022; Von Oswald et al., 2023; 2024) and theoretical analyses (Ahn et al., 2023; Bai et al., 2023; Cheng et al., 2024; Gatmiry et al., 2024; Mahankali et al., 2024; Zhang et al., 2024b). Additional results have indicated that in-context learning can implement ridge regression (Aky ̈ urek et al., 2023; Bai et al., 2023), second-order optimization (Fu et al., 2024; Giannou et al., 2024), reinforcement learning (Lee et al., 2023; Lin et al., 2024), and Bayesian model averaging (Zhang et al., 2023). In terms of robustness, some studies have shown that in-context learning can act as a nearly optimal predictor under noisy linear data (Bai et al., 2023) and noisy labels (Frei & Vardi, 2025). Moreover, it has been demonstrated that in-context learning is robust to shifts in the query distribution (Wies et al., 2023; Zhang et al., 2024b), but not necessarily to shifts in the context (Shi et al., 2023; 2024; Wei et al., 2023b; Zhang et al., 2024b). In this study, we focus on the adversarial robustness of in-context learning, rather than the underlying algorithms or its robustness to random noise and distribution shifts. Specifically, we examine whether a single adversarially pretrained transformer can robustly adapt to a broad range of tasks through in-context learning. Norm- and Token-Bounded Adversarial Examples.Adversarial examples were originally intro- duced as subtle perturbations to natural data, designed to induce misclassifications in models (Croce & Hein, 2020; Goodfellow et al., 2015; Madry et al., 2018; Szegedy et al., 2014). These perturba- 17 Published as a conference paper at ICLR 2026 tions are typically constrained by a norm-based distance from the original inputs. The robustness of transformers to such norm-bounded adversarial examples has been studied primarily in vision transformers (Dosovitskiy et al., 2021). Several studies have shown that standard vision transformers are as vulnerable to these attacks as conventional vision models (Bai et al., 2021; Mahmood et al., 2021), though some have reported marginal differences (Aldahdooh et al., 2021; Benz et al., 2021; Bhojanapalli et al., 2021; Naseer et al., 2021; Paul & Chen, 2022; Shao et al., 2022; Tang et al., 2021). In contrast, adversarial attacks on language models are often neither norm-constrained nor imperceptible to humans. They involve substantial token modifications (Garg & Ramakrishnan, 2020; Jin et al., 2020; Li et al., 2020; Zang et al., 2020), the insertion of adversarial tokens (Liu et al., 2024; Shen et al., 2024; Wallace et al., 2019; Wei et al., 2023a; Zou et al., 2023), and the construction of entirely new adversarial prompts (Carlini et al., 2021; 2022; Nasr et al., 2023; Perez & Ribeiro, 2022; Wei et al., 2023a). These attacks aim not only to induce misclassification (Garg & Ramakrishnan, 2020; Jin et al., 2020; Li et al., 2020; Wallace et al., 2019; Zang et al., 2020), but also to provoke objectionable outputs (Liu et al., 2024; Perez & Ribeiro, 2022; Shen et al., 2024; Wei et al., 2023a; Zou et al., 2023) or to extract private information from training data (Carlini et al., 2021; 2022; Nasr et al., 2023). They are generally bounded by token-level metrics (e.g., the number of modified tokens). In this study, we focus exclusively on norm-bounded adversarial examples. Token-bounded ones are out of scope. Adversarial Training.Adversarial training, which augments training data with adversarial exam- ples, is one of the most effective adversarial defenses (Goodfellow et al., 2015; Madry et al., 2018). Although originally developed for conventional neural architectures, adversarial training has also proven effective for transformers (Debenedetti et al., 2023; Liu et al., 2025; Shao et al., 2022; Tang et al., 2021; Wu et al., 2022). A major limitation of adversarial training is its high computational cost. To address this, several methods have focused on more efficient generation of adversarial examples (Andriushchenko & Flammarion, 2020; Kim et al., 2021; Park & Lee, 2021; Shafahi et al., 2019; Wong et al., 2020; Zhang et al., 2019a) and adversarial finetuning of standard pretrained models (Jeddi et al., 2020; Mao et al., 2023; Suzuki et al., 2023; Wang et al., 2024a). More recently, researchers have introduced adversarial prompt tuning, which trains visual (Mao et al., 2023; Wang et al., 2024b), textual (Fan et al., 2024; Li et al., 2024; Zhang et al., 2024a), or bimodal prompts (Jia et al., 2025; Luo et al., 2024; Yang et al., 2024; Zhou et al., 2024) in an adversarial manner. However, these methods require retraining for each task. In this study, we explore the potential of adversarially pretrained transformers for robust task adaptation via in-context learning, thereby eliminating the task-specific retraining and associated computational overhead. Adversarial Meta-Learning.Adversarial meta-learning seeks to develop a universally robust meta-learner that can swiftly and reliably adapt to new tasks under adversarial conditions. Existing approaches adversarially train a neural network on multiple tasks, and then finetune it on a target task using clean (Goldblum et al., 2020; Hou et al., 2021; Liu et al., 2021; Wang et al., 2021; Yin et al., 2018) or adversarial samples (Yin et al., 2018). In this study, we similarly aim to train such a meta- learner. However, rather than relying on neural networks and finetuning, we employ a transformer as the meta-learner and leverage its in-context learning ability for task adaptation. Related but Distinct Work.We here review theoretical work on the adversarial robustness of in-context learning. Assuming token-bounded adversarial examples, prior studies have shown that even a single token modification in the context can significantly alter the output of a standardly trained model on a clean query (Anwar et al., 2024), and deeper layers can mitigate this (Li et al., 2025). Assuming norm- and token-bounded examples, Fu et al. have shown that adversarial training with short adversarial contexts can provide robustness against longer ones (Fu et al., 2025). They considered a clean query and adversarial tokens appended to the original context. In this study, we explore how adversarially trained models handle norm-bounded perturbations to a query in a clean context. As a result, we reveal their universal robustness that can be generalized to a new task from a few demonstrations. 18 Published as a conference paper at ICLR 2026 BCLARIFICATION ONSINGLE-TASKPRETRAINING ANDTASK-SPECIFIC ADVERSARIALTRAINING Single-Task Adversarial Training and In-Context Learning.We should clarify that a model trained on a single task, unlike one pretrained on multiple tasks (i.e., our main setting), lacks in- context learning capability. Namely, an approach that combines adversarial training (not adversarial pretraining) and in-context learning is not feasible. Specifically, the parameters of a model trained on a single task differ significantly from those on multiple tasks (shown in Theorem 3.4). Such models cannot provide correct answers for new tasks via in-context learning even in standard settings (without adversaries). The interesting property of transformers is that when trained on multiple tasks rather than a single one—as with large language models—they develop distinctly different parameters that enable in-context learning capability. Performance Comparison with Task-Specific Adversarial Training.As the no-free-lunch theo- rem indicates, task-specific approaches achieve higher performance than the approach that combines adversarial pretraining and in-context learning. Specifically, the following holds in terms of robust accuracy: (1) task-specific adversarially trained models≥(2) adversarially pretrained models with in-context learning≫(3) task-specific standardly trained models≈(4) standardly pretrained models with in-context learning≈0%. More precisely, Models (2) have the limitation of robustness as predicted in Theorems 3.4 and 3.6. However, if training and test distributions match, Models (1) do not. However, we emphasize that Models (1) require users to perform adversarial training for each individual task and cannot generalize to test distributions other than the one it was trained on. This makes it unsuitable for our research focus on universally robust foundation models that can generalize across a wide range of tasks without task-specific adversarial training. CADDITIONALTHEORETICALSUPPORT ANDINSIGHTS C.1LINEARCOMBINATION OFORTHONORMALBASES CAN BETRANSFORMED INTOOUR TESTDISTRIBUTION. Our test data distribution, Assumption 3.2, can implicitly represent data distributions comprising robust and non-robust directions forming an orthonormal basis. Considerdorthonormal bases, e i d i=1 . We setd irr = 0, namelyd=d rob +d vul . Each data point is represented asx= c 1 e 1 +c 2 e 2 +·+c d e d , where coefficientsc i are sampled probabilistically. These coefficients satisfyE[yc i ] =C i αfori∈ S rob andβfori∈ S vul . In addition,|E[(yc i −E[yc i ]) n ]| ≤C i,n α n fori∈S rob andC i,n β n fori∈S vul . Given a dataset ofNi.i.d. samples(x n ,y n ) N n=1 , ifc n,i is independent ofc n,j fori̸=jconditional ony, andNis sufficiently large, then the covariance ofyx can be approximated as: 1 N N X n=1 y n x n − N X k=1 y k x k ! y n x n − N X k=1 y k x k ! ⊤ ≈E[(yx−E[yx])(yx−E[yx]) ⊤ ](A12) =E   d X i=1 (y i c i −E[yc i ])e i ! d X i=1 (y i c i −E[yc i ])e i ! ⊤   (A13) = d X i,j=1 E[(yc i −E[yc i ])(yc j −E[yc j ])]e i e ⊤ j (A14) = d X i∈S rob C i,2 α 2 e i e ⊤ i + d X i∈S vul C i,2 β 2 e i e ⊤ i .(A15) This implies that through principal component analysis fory n x n , we can obtaindorthonormal bases,e i d i=1 . By projecting a samplex n onto these bases, we obtain a transformed sample x ′ n :=c n,1 ,c n,2 ,...,c n,d . This demonstrates that when data is sampled from a distribution comprising robust and non-robust directions forming an orthonormal basis, if the coefficients are 19 Published as a conference paper at ICLR 2026 mutually independent and the sample size is sufficiently large, we can preprocess the data to satisfy Assumption 3.2. Importantly, this preprocessing relies solely on statistics derivable from training samples. C.2SUFFICIENTNUMBER OFDATASETS TOPROVIDEUNIVERSALROBUSTNESS What determines the sufficient number of datasets needed to provide universal robustness to trans- formers? We conjecture that this may be determined by the number of robust bases. In this paper, we trained transformers usingddatasets. This stems from training with datasets where only one dimension is robust (in other words, datasets with a single robust basis), the number of dimensions d, and the assumption that all dimensions might contain robust features. If we assume that robust features never appear in the latterd ′ dimensions, following the procedure in Appendix E, we can train robust transformers using onlyd−d ′ datasets that describe the firstd−d ′ robust features. From this observation, we conjecture that the sufficient number of datasets required to provide universal robustness to transformers depends on the number of robust bases in the assumed data structure. C.3EFFECTS OFq rob ANDq vul We here analyze howq rob andq vul affect the robustness of adversarially trained transformer. As defined in Assumption 3.2, these parameters control the proportion of features whose total covariance with other features is negative. Theorem F.1 suggests that the transformer prediction for unperturbed data can be expressed as C(d rob α+d vul β) (1−cq rob )d rob α 2 + (1−cq vul )d vul β 2 +C ′ (d rob α 2 +d vul β 2 ),(A16) where c:= (max i∈S rob ∪S vul C i )(max i∈S rob ∪S vul C i,2 ) min i∈S rob ∪S vul C 3 i .(A17) Examining the term(1−cq rob )d rob α 2 + (1−cq vul )d vul β 2 , we observe that larger values ofq rob and q vul generally diminish the magnitude of transformer predictions. This indicates that negative correla- tions between features degrade the robustness of adversarially trained transformers. Additionally, the coefficientcis characterized bymax i∈S rob ∪S vul C i,2 , which represents a variance coefficient. This suggests that smaller feature variances enhance the robustness of adversarially trained transformers. For example, if each feature varianceC i,2 is sufficiently small, evenq rob = 1andq vul = 1may be tolerated without significantly compromising robustness. C.4DISADVANTAGE OFSTANDARDFINETUNING: PARAMETERSELECTIONPERSPECTIVE In this study, we investigate task adaptation through in-context learning. As an alternative lightweight approach, standard finetuning—where all or part of the model parameters are updated—can also be employed. However, a key drawback of standard finetuning is that it requires parameter updates, whereas in-context learning does not. Moreover, finetuning necessitates careful selection of which parameters to update. Our analysis shows that improper parameter selection during finetuning can compromise the robustness initially established by adversarial pretraining. Consider adversarially pretrained parameters,P adv andQ adv , andD tr c as a downstream data distribution. First, we examine the scenario where onlyPis updated while keepingQ adv fixed, formulated as: min P∈[0,1] (d+1)×(d+1) E (x n ,y n ) N+1 n=1 i.i.d. ∼ D tr c −y N+1 [f(Z 0 ;P,Q adv )] d+1,N+1 .(A18) In this case, as shown in the proof in Appendix E,P=P std (=P adv )is the global solution. Consequently, as demonstrated in Theorem 3.6, the model’s robustness is preserved. Conversely, consider trainingQwhile keepingP adv fixed, formulated as: min Q∈[0,1] (d+1)×(d+1) E (x n ,y n ) N+1 n=1 i.i.d. ∼ D tr c −y N+1 [f(Z 0 ;P adv ,Q)] d+1,N+1 .(A19) In this scenario,Q=Q std is the global solution. As established in Theorems 3.5, 3.7 and H.1, while this configuration enables the transformer to perform well on unperturbed queries, it fails to maintain robustness against perturbed inputs. 20 Published as a conference paper at ICLR 2026 These findings highlight a critical insight: achieving robust task adaptation through standard finetuning requires careful parameter selection; otherwise, the pretrained model’s adversarial robustness may be compromised. This parameter sensitivity represents a disadvantage compared to in-context learning, which preserves robustness without requiring parameter updates. C.5NAIVEADVERSARIALCONTEXT MAY NOTIMPROVEROBUSTNESS One approach to enhancing the robustness of a standardly trained transformer is to incorporate adversarial examples into the context. In this section, we show that this is not the case in our setting. Consider the following transformer input: Z ′ := x 1 +∆ 1 x 2 +∆ 2 ·x N +∆ N x N+1 +∆ N+1 y 1 y 2 ·y N 0 .(A20) The adversarial perturbations for the context,∆ 1 ,...,∆ N , are defined as∆ n :=−εy n 1 d . In this setting, forε≥ 1+(d−1)(λ/2) d , the standard transformer prediction is given by: E (x n ,y n ) N+1 n=1 i.i.d. ∼ D tr c min ∥∆ N+1 ∥ ∞ ≤ε y N+1 [f(Z ′ ;P std ,Q std )] d+1,N+1 ≤0.(A21) This result suggests that, in our setting, naive adversarial demonstrations do not improve the perfor- mance of the standard transformer. Intuitively, because adversarial training generates new adversarial examples at each step of gradient descent, fixed adversarial demonstrations may fail to counter newly generated adversarial perturbations to the query. DADDITIONALEXPERIMENTALRESULTS All experiments were conducted on Ubuntu 20.04.6 LTS, Intel Xeon Gold 6226R CPUs, and NVIDIA RTX 6000 Ada GPUs. D.1SUPPORT FORASSUMPTION3.2. The statistics of preprocessed MNIST, Fashion-MNIST, and CIFAR-10 are provided in Fig. A2. Preprocessing was conducted as follows: (i) selection of two different classes from the ten avail- able classes and assignment of binary labels to every sample from the training dataset, creating (x n ,y n ) N n=1 ; (i) centering the data viax ′ ←x− ̄ xwith ̄ x:= (1/N) P N n=1 x n ; and (i) aligning features with the label usingx ′ ←sgn( P N n=1 y n x n )⊙x ′ . These preprocessed datasets exhibit that each dimension has a positive correlation with the label and that few dimensions have negative total covariance. The main distinction from Assumption 3.2 is that their features are not clearly separated as robust or non-robust. Instead, they gradually transition from robust to non-robust characteristics. D.2VERIFICATION OFTHEOREM3.4. We trained a single-layer transformer (2) with the in-context loss (7). The training distribution was configured withd= 20andλ= 0.1in Fig. 1 and withd= 100andλ= 0.1in Fig. A3. For standard, adversarial, and strong adversarial regimes, we usedε= 0, 1+(d−1)(λ/2) d = 0.098, and λ 2 + 3 2 2−λ (d−1)λ 2 +3 = 0.95in Fig. 1 andε= 0, 1+(d−1)(λ/2) d = 0.06, and λ 2 + 3 2 2−λ (d−1)λ 2 +3 = 0.77in Fig. A3. Optimization was conducted using stochastic gradient descent with momentum 0.9. Learning rates were set to 0.1 for all regimes in Fig. 1, and to 1.0 for standard and strong adversarial regimes and 0.2 for the adversarial regime in Fig. A3. Training ran for 100 epochs with a learning rate scheduler that multiplied the rate by 0.1 when the loss did not improve within 10 epochs. In each iteration of stochastic gradient descent, we sampled 1,000 datasets(x (c) n ,y (c) n ) N+1 n=1 withN= 1,000. The distribution indexcwas randomly sampled fromU([d]), meaning that in each iteration, each of the 1,000 datasets may have differentcvalues. After each parameter update, we projected the parameters to[0,1] d . Adversarial perturbation was calculated as∆:=−εy n sgn(P d+1,· Z 0 MZ ⊤ 0 Q ·,:d ), which represents the optimal attack. The heatmaps of the learned parameters in Figs. 1 and A3 completely align with the theoretical predictions of Theorem 3.4. 21 Published as a conference paper at ICLR 2026 D.3VERIFICATION OFTHEOREMS3.5TO3.7ANDH.1 We evaluated standardly and adversarially pretrained single-layer transformers onD tr ,D te , the preprocessed MNIST, Fashion-MNIST, and CIFAR-10 datasets. For network parameters, we used the theoretically predictedP std andQ std as standard model parameters andP adv andQ adv as adversarially trained model parameters. This approach allowed us to circumvent the computationally expensive adversarial pretraining for every distinctdsetting. As described previously, our empirical results completely align with the theoretically predicted parameter configurations. Configuration in Figs. A4 and A5.In Fig. A4, the basic settings wered= 100,λ= 0.1, N= 1,000, andε= 0.15. In Fig. A5, they wered rob = 10,d vul = 90,d irr = 0,α= 1.0,β= 0.1, γ= 0.1, andε= 0.2. The basic perturbation budget was set to 0.1. We considered 1,000 batches where each batch contained 1,000 in-context demonstrations (i.e.,N= 1000), and 1,000 queries. The test distributionD te was constructed based on normal distribution. During sampling,yx i was sampled fromN(α,α 2 )fori∈ S rob ,N(β,β 2 )fori∈ S vul , andN(0,γ 2 )fori∈ S irr . Each dimension is independent, giveny. Configuration in Fig. A6.The preprocessing procedure is described in Appendix D.1. As batches, we considered 45 binary class pairs from ten classes. The basic perturbation budget was set to 0.1. In the first row of Fig. A6, we used all training samples in the training dataset. As queries, we used all test samples in the test dataset. Analysis.In Figs. A4 to A6, standard transformers consistently demonstrate vulnerability to adversarial attacks, whereas adversarially trained transformers maintain a certain level of robustness, validating Theorems 3.5 and 3.6. However, adversarially pretrained transformers exhibit lower clean accuracy, supporting Theorem 3.7. In Figs. A4 and A5, we observe that a larger number of vulnerable dimensions increases model vulnerability. Conversely, Fig. A5 shows that a larger number of robust dimensions enhances model robustness. Robust models are less susceptible to increasing vulnerable dimensions and benefit more from increasing robust dimensions. Additionally, as predicted in Theorems 3.5 and 3.6, standard training exhibits vulnerability to increasing redundant dimensions, which is more detrimental than the harmful effect from increasing vulnerable dimensions, since redundant dimensions do not benefit predictions and are only harmful for robustness. In contrast, adversarially trained transformers exhibit significant resistance to increases in these dimensions. The second row of Fig. A6 indicates that standard transformers still achieve high classification accuracy in small demonstration regimes, whereas adversarially trained transformers show degraded performance. These results align with our theoretical predictions, Theorem H.1. EPROOF OFLEMMA3.3ANDTHEOREM3.4 (PRETRAINING) Lemma 3.3(Transformation of original optimization problem).The minimization problem (7) can be transformed into the maximization problemmax b∈0,1 d+1 P d(d+1) i=1 max(0, P d+1 j=1 b j h i,j ), where h i,j ∈Ris a constant depending on(i,j), and there exists a mapping frombtoPandQ. Proof.See “Overview” below in detail. The proof sketch is as follows: Let us simplify the transformer definition (2) asf(x;θ 1 ,θ 2 ) :=θ 1 x 2 θ 2 (x+∆), whereθ 1 ,θ 2 ∈0,1. Then the optimization problem (7) becomes min θ 1 ,θ 2 ∈0,1 max ∆ −yf(x+ ∆;θ 1 ,θ 2 ) =min θ 1 ,θ 2 ∈0,1 max ∆ −y(θ 1 x 2 θ 2 (x+ ∆)).(A22) This can be transformed to: min θ 1 ,θ 2 ∈0,1 max ∆ −y(θ 1 x 2 θ 2 (x+ ∆)) =max θ 1 ,θ 2 ∈0,1 min ∆ θ 1 (yx 2 θ 2 (x+ ∆)).(A23) 22 Published as a conference paper at ICLR 2026 0151784 0.0 0.2 MNIST 0326784 0.0 0.2 FMNIST 243072 0.0 0.1 CIFAR10 E[yx i ] = 0.1 0576784 Sorted dimensioni 0 0626784 Sorted dimensioni 0 02667 Sorted dimensioni 0 ∑ d j Cov(x i ,x j ) Figure A2: Statistical properties of preprocessed MNIST, Fashion-MNIST, and CIFAR-10 datasets. First row:Blue lines represent the mean of(1/N) P N n=1 y n x n across 45 binary class pairs and shaded regions represent the sample standard deviation. Orange lines represent typical perturbation magnitude. Green dashed lines represent the (pseudo) threshold between robust and non-robust dimensions.Second row:Blue lines represent the total covariance of each dimension with other dimensions and shaded regions represent sample standard deviation across the 45 binary class pairs. Green dashed lines represent the boundary between positive and negative total covariance. 050100 Std.P 0 50 100 050100 Std.Q 050100 Adv.P 050100 Adv.Q 050100 Str. adv.P 050100 Str. adv.Q 0.0 0.5 1.0 Figure A3: Parameter heatmaps induced by adversarial training (7) withd= 100andλ= 0.1. For the standard, adversarial, and strong adversarial regimes, we usedε= 0, 1+(d−1)(λ/2) d = 0.06, and λ 2 + 3 2 2−λ (d−1)λ 2 +3 = 0.77, respectively. We optimized (7) by stochastic gradient descent. 0.000.050.100.150.20 Perturbation size 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 050100150 Input dimensiond 0.0 0.2 0.4 0.6 0.8 1.0 Std Adv Figure A4: Accuracy (%) of standardly and adversarially pretrained single-layer transformers. Lines represent mean accuracy across batches and shaded regions represent unbiased standard deviation (notably small in magnitude). We used 1,000 batches, each containing 1,000 in-context demonstrations (N= 1000) and 1,000 query examples. Base configuration parameters wered= 100, λ= 0.1, andε= 0.15. 23 Published as a conference paper at ICLR 2026 0.00.5 Perturbation size 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 020 Robust dimensiond rob 0.0 0.2 0.4 0.6 0.8 1.0 Std Adv 10 1 10 2 10 3 Vul. dim.d vul or Irr. dim.d irr 0.0 0.2 0.4 0.6 0.8 1.0 Std (d vul ) Adv (d vul ) Std (d irr ) Adv (d irr ) Figure A5: Accuracy (%) of standardly and adversarially pretrained single-layer transformers. Lines represent mean accuracy across batches and shaded regions represent unbiased standard deviation. We used 1,000 batches, each containing 1,000 in-context demonstrations (N= 1000) and 1,000 query examples. Base configuration parameters wered rob = 10,d vul = 90,d irr = 0,α= 1.0, β= 0.1,γ= 0.1, andε= 0.2. 0.00.2 Perturbation size 0.0 0.2 0.4 0.6 0.8 1.0 0.00.2 Perturbation size 0.0 0.2 0.4 0.6 0.8 1.0 0.00.2 Perturbation size 0.0 0.2 0.4 0.6 0.8 1.0 Std Adv 10 1 10 3 Demonstration sizeN 0.0 0.2 0.4 0.6 0.8 1.0 10 0 10 1 10 2 Demonstration sizeN 0.0 0.2 0.4 0.6 0.8 1.0 10 0 10 1 10 2 Demonstration sizeN 0.0 0.2 0.4 0.6 0.8 1.0 Std (= 0) Adv (= 0) Std (= 0.1) Adv (= 0.1) Accuracy Figure A6: Accuracy (%) of standardly and adversarially pretrained single-layer transformers. Lines represent mean accuracy across 45 binary classification tasks (derived from all possible pairs of the ten classes) and shaded regions represent the unbiased standard deviation. The perturbation size was basicallyε= 0.1. 24 Published as a conference paper at ICLR 2026 Sinceθ 1 takes only 0 or 1, the optimal strategy is alwaysθ 1 = 1whenyx 2 θ 2 (x+ ∆)is positive and θ 1 = 0when negative. This transforms the problem to: max θ 1 ,θ 2 ∈0,1 min ∆ θ 1 (yx 2 θ 2 (x+ ∆)) = max θ 2 ∈0,1 max(0,min ∆ yx 2 θ 2 (x+ ∆)).(A24) Denoting all terms exceptθ 2 ash: max θ 2 ∈0,1 max(0,min ∆ yx 2 θ 2 (x+ ∆)) = max θ 2 ∈0,1 max(0,θ 2 h).(A25) This provides intuition for the problemmax b∈0,1 P d(d+1) i=1 max(0, P d+1 j=1 b j h i,j )in Lemma 3.3. Theorem 3.4(Parameters learned via adversarial pretraining).The global minimizers of (7) are 1. Standard;ε= 0 P=P std := 0 d,d+1 1 ⊤ d+1 andQ=Q std := [ 1 d+1,d 0 d+1 ]. 2. Adversarial;ε= 1+(d−1)(λ/2) d P=P adv := 0 d,d+1 1 ⊤ d+1 andQ=Q adv := I d 0 d 0 ⊤ d 0 . 3. Strongly adversarial;ε≥ λ 2 + 3 2 2−λ (d−1)λ 2 +3 P=0 d+1,d+1 andQ=0 d+1,d+1 . Proof.This is the special case of the following theorem. Theorem E.1(General case of Theorem 3.4).The global minimizer of (7) is as follows: • If 0≤ε≤ λ(λ(d−2) + 4) 2(λ(d−1) + 2) ,(A26) thenP= 0 d,d+1 1 ⊤ d+1 andQ= [ 1 d+1,d 0 d+1 ]. • If ε= 1 + (d−1)(λ/2) d ,(A27) thenP= 0 d,d+1 1 ⊤ d+1 andQ= I d 0 d 0 ⊤ d 0 . • If ε≥ λ 2 + 3 2 2−λ (d−1)λ 2 + 3 ,(A28) thenP=0 d+1,d+1 andQ=0 d+1,d+1 . Proof. Overview.The loss functionL(P,Q)is determined only by the last row ofPand the firstd columns ofQ. Let P:= 0 d,d+1 b ⊤ ,Q:= [ A0 d+1 ],(A29) whereb∈R d+1 andA:= [a 1 ·a d ]∈R (d+1)×d . Withb,A, andG:=Z ∆ MZ ⊤ ∆ /N, the loss functionL(P,Q)can be represented as: L(P,Q) :=E c,(x n ,y n ) N+1 n=1 max ∥∆∥ ∞ ≤ε −y N+1 [f(Z ∆ ;P,Q)] d+1,N+1 (A30) 25 Published as a conference paper at ICLR 2026 =E c,(x n ,y n ) N+1 n=1 " max ∥∆∥ ∞ ≤ε −y N+1 Z ∆ + 1 N P Z ∆ MZ ⊤ ∆ QZ ∆ d+1,N+1 # (A31) =E c,(x n ,y n ) N+1 n=1 max ∥∆∥ ∞ ≤ε −y N+1 b ⊤ GA(x N+1 +∆) .(A32) UsingbandA, we redefine the loss function asL(b,A) :=L(P,Q). SinceGdoes not include∆ andmax ∥∆∥ ∞ ≤ε w ⊤ ∆=ε∥w∥ 1 forw∈R d , the inner maximization can be solved as: L(b,A) =E c,(x n ,y n ) N+1 n=1 −y N+1 b ⊤ GAx N+1 +ε∥b ⊤ GA∥ 1 .(A33) When0≤b≤1and0≤A≤1, then∥b ⊤ GA∥ 1 =b ⊤ GA1since all the elements ofGare nonnegative. Thus, min 0≤b≤1,0≤A≤1 L(b,A) =min 0≤b≤1,0≤A≤1 E c,(x n ,y n ) N+1 n=1 −y N+1 b ⊤ GAx N+1 +εb ⊤ GA1 .(A34) Let thei-th row ofGbeg ⊤ i . Rearranging the argument of the expectation as: −y N+1 b ⊤ GAx N+1 +εb ⊤ GA1=− d+1 X j=1 d X k=1 A j,k d+1 X i=1 b i g i,j (y N+1 x N+1,k −ε) ! .(A35) Thus, the objective function can be represented as: max 0≤b≤1,0≤A≤1 d+1 X j=1 d X k=1 A j,k d+1 X i=1 b i E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)] ! .(A36) Since the objective function is linear with respect tobandA, respectively, the optimal solution exists on the boundary: max b∈0,1 d+1 ,A∈0,1 (d+1)×d d+1 X j=1 d X k=1 A j,k d+1 X i=1 b i E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)] ! .(A37) This is maximized byA j,k = 1if P d+1 i=1 b i E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)])≥0 and 0 otherwise. Now, max b∈0,1 d+1 d+1 X j=1 d X k=1 φ d+1 X i=1 b i E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)] ! ,(A38) whereφ(x) := max(0,x). Calculating the expectation and optimizingb, we obtain the solution. Calculation of the expectation.First, we consider the expectation givenc. Sincey n x n,i = 1if i=candy n x n,i ∼U(0,λ)otherwise, the expectation ofy n x n can be calculated as: E[y n x n,i |c] = 1(i=c) λ 2 (i̸=c) ,E[y n x ⊤ n |c] = λ 2 · λ 2 1 |z c-th λ 2 · λ 2 .(A39) The expectation ofGcan be calculated as: E (x n ,y n ) N n=1 [G|c] = 1 N E (x n ,y n ) N n=1 [Z ∆ MZ ⊤ ∆ |c](A40) = 1 N P N n=1 E x n [x n x ⊤ n |c] P N n=1 E x n ,y n [y n x n |c] P N n=1 E x n ,y n [y n x ⊤ n |c]N (A41) = E x n [x n x ⊤ n |c]E x n ,y n [y n x n |c] E x n ,y n [y n x ⊤ n |c]1 .(A42) 26 Published as a conference paper at ICLR 2026 Fory n = 1andi,j̸=c,E[x 2 n,i |c] = R λ 0 x 2 /λdx=λ 2 /3andE[x n,i x n,j |c] =E[x n,i |c]E[x n,j | c] =λ 2 /4. Thus, E (x n ,y n ) N n=1 [g i,j |c] =                      1(i=c)∧(j=i,d+ 1) λ 2 (i=c)∧(j̸=i,d+ 1) λ 2 3 (i∈[d],i̸=c)∧(j=i) λ 2 (i∈[d],i̸=c)∧(j=c,d+ 1) λ 2 4 (i∈[d],i̸=c)∧(j̸=i,c,d+ 1) 1(i=d+ 1)∧(j=c,d+ 1) λ 2 (i=d+ 1)∧(j̸=c,d+ 1) .(A43) Note that E (x n ,y n ) N n=1 [G|c] =                λ 2 /3λ 2 /4λ 2 /4·λ 2 /4 c-th z| λ/2λ 2 /4·λ 2 /4λ/2 λ 2 /4λ 2 /3λ 2 /4·λ 2 /4λ/2λ 2 /4·λ 2 /4λ/2 . . . λ 2 /4λ 2 /4λ 2 /4·λ 2 /3λ/2λ 2 /4·λ 2 /4λ/2 λ/2λ/2λ/2·λ/21λ/2·λ/21 λ 2 /4λ 2 /4λ 2 /4·λ 2 /4λ/2λ 2 /3·λ 2 /4λ/2 . . . λ 2 /4λ 2 /4λ 2 /4·λ 2 /4λ/2λ 2 /4·λ 2 /3λ/2 λ/2λ/2λ/2·λ/21λ/2·λ/21                c-th.(A44) Let h i (j;k;c) :=E (x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)|c].(A45) Letε + := 1−εandε − :=λ/2−ε. By Eqs. (A39) and (A43), h i (j;k;c) =                                                                    ε + (i∈[d])∧(j=i,d+ 1)∧(k=i)∧(c=i) ε − (i∈[d])∧(j=i,d+ 1)∧(k̸=i)∧(c=i) λ 2 ε + (i∈[d])∧(j̸=i,d+ 1)∧(k=i)∧(c=i) λ 2 ε − (i∈[d])∧(j̸=i,d+ 1)∧(k̸=i)∧(c=i) λ 2 3 ε − (i∈[d])∧(j=i)∧(k=i)∧(c̸=i) λ 2 ε − (i∈[d])∧(j=c,d+ 1)∧(k=i)∧(c̸=i) λ 2 4 ε − (i∈[d])∧(j̸=i,c,d+ 1)∧(k=i)∧(c̸=i) λ 2 3 ε + (i∈[d])∧(j=i)∧(k=c)∧(c̸=i) λ 2 ε + (i∈[d])∧(j=c,d+ 1)∧(k=c)∧(c̸=i) λ 2 4 ε + (i∈[d])∧(j̸=i,c,d+ 1)∧(k=c)∧(c̸=i) λ 2 3 ε − (i∈[d])∧(j=i)∧(k̸=i,c)∧(c̸=i) λ 2 ε − (i∈[d])∧(j=c,d+ 1)∧(k̸=i,c)∧(c̸=i) λ 2 4 ε − (i∈[d])∧(j̸=i,c,d+ 1)∧(k̸=i,c)∧(c̸=i) ε + (i=d+ 1)∧(j=c,d+ 1)∧(k=c) ε − (i=d+ 1)∧(j=c,d+ 1)∧(k̸=c) λ 2 ε + (i=d+ 1)∧(j̸=c,d+ 1)∧(k=c) λ 2 ε − (i=d+ 1)∧(j̸=c,d+ 1)∧(k̸=c) .(A46) Then, we compute the expectation alongc. Note that E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)] = 1 d d X c=1 h i (j;k;c).(A47) LetH i,j,k := P d c=1 h i (j;k;c). The summation ofh i alongccan be calculated as: 27 Published as a conference paper at ICLR 2026 For(i∈[d])∧(j=i)∧(k=i), H i,j,k =h i (j=i;k=i;c=i) + d X c̸=i h i (j=i;k=i;c̸=i) =ε + + λ 2 3 (d−1)ε − (A48) =:r 1 .(A49) For(i∈[d])∧(j=i)∧(k̸=i), H i,j,k =h i (j=i;k̸=i;c=i) +h i (j=i;k=c;c̸=i) + d X c̸=i,k h(j=i;k̸=i,c;c̸=i)(A50) =ε − + λ 2 3 ε + + λ 2 3 (d−2)ε − (A51) =:r 2 .(A52) For(i∈[d])∧(j=d+ 1)∧(k=i), H i,j,k =h i (j=d+ 1;k=i;c=i) + d X c̸=i h i (j=d+ 1;k=i;c̸=i)(A53) =ε + + λ 2 (d−1)ε − (A54) =:r 3 .(A55) For(i∈[d])∧(j=d+ 1)∧(k̸=i), H i,j,k =h i (j=d+ 1;k̸=i;c=i) +h i (j=d+ 1;k=c;c̸=i) + d X c̸=i,k h i (j=d+ 1;k̸=i,c;c̸=i)(A56) =ε − + λ 2 ε + + λ 2 (d−2)ε − (A57) =:r 4 .(A58) For(i∈[d])∧(j̸=i,d+ 1)∧(k=i), H i,j,k =h i (j̸=i,d+ 1;k=i;c=i) +h i (j=c;k=i;c̸=i) + d X c̸=i,j h i (j̸=i,c,d+ 1;k=i;c̸=i)(A59) = λ 2 ε + + λ 2 ε − + λ 2 4 (d−2)ε − (A60) =:r 5 .(A61) For(i∈[d])∧(j̸=i,d+ 1)∧(k̸=i)∧(j=k), H i,j,k =h i (j̸=i,d+ 1;k̸=i;c=i) +h i (j=c;k=c;c̸=i) + d X c̸=i,j,k h i (j̸=i,c,d+ 1;k̸=i,c;c̸=i)(A62) = λ 2 ε − + λ 2 ε + + λ 2 4 (d−2)ε − (A63) =:r 5 .(A64) For(i∈[d])∧(j̸=i,d+ 1)∧(k̸=i)∧(j̸=k), H i,j,k =h i (j̸=i,d+ 1;k̸=i;c=i) +h i (j=c;k̸=i,c;c̸=i) +h i (j̸=i,c,d+ 1;k=c;c̸=i) 28 Published as a conference paper at ICLR 2026 + d X c̸=i,j,k h i (j̸=i,c,d+ 1;k̸=i,c;c̸=i)(A65) = λ 2 ε − + λ 2 ε − + λ 2 4 ε + + λ 2 4 (d−3)ε − (A66) =:r 6 .(A67) For(i=d+ 1)∧(j=d+ 1), H i,j,k =h i (j=d+ 1;k=c;c=k) + d X c̸=k h i (j=d+ 1;k̸=c;c̸=k)(A68) =ε + + (d−1)ε − (A69) =:r 7 .(A70) For(i=d+ 1)∧(j̸=d+ 1)∧(j=k), H i,j,k =h i (j=c;k=c;c=k) + d X c̸=k h i (j̸=d+ 1;k̸=c;c̸=k)(A71) =ε + + λ 2 (d−1)ε − (A72) =:r 3 .(A73) For(i=d+ 1)∧(j̸=d+ 1)∧(j̸=k), H i,j,k =h i (j=c;k̸=c;c̸=k) +h i (j̸=c;k=c;c=k) + d X c̸=j,k h i (j̸=c,d+ 1;k̸=c;c̸=k)(A74) =ε − + λ 2 ε + + λ 2 (d−2)ε − (A75) =:r 4 .(A76) Optimization ofAandb.From Eq. (A38), we redefine the objective function as: dmax b∈0,1 d+1 d+1 X j=1 d X k=1 φ d+1 X i=1 b i E c,(x n ,y n ) N+1 n=1 [g i,j (y N+1 x N+1,k −ε)] ! =max b∈0,1 d+1 d+1 X j=1 d X k=1 φ d+1 X i=1 b i H i,j,k ! .(A77) Recall that we setA j,k = 1if P d+1 i=1 b i H i,j,k ≥0and 0 otherwise. Let[d] ′ :=i∈[d]|b i = 1 andd ′ :=|[d] ′ |. Now, d+1 X j=1 d X k=1 φ d+1 X i=1 b i H i,j,k ! = d X k=1 φ   b d+1 H d+1,d+1,k +1[k∈[d] ′ ]H k,d+1,k + X i∈[d] ′ ,i̸=k H i,d+1,k   + d X j=1 φ   b d+1 H d+1,j,j +1[j∈[d] ′ ]H j,j,j + X i∈[d] ′ ,i̸=j H i,j,j   + d X j=1 d X k̸=j φ b d+1 H d+1,j,k +1[j∈[d] ′ ]H i,i,k +1[k∈[d] ′ ]H i,j,i + X i∈[d] ′ ,i̸=j,k H i,j,k ! .(A78) 29 Published as a conference paper at ICLR 2026 By Eqs. (A55), (A58) and (A70), d X k=1 φ   b d+1 H d+1,d+1,k +1[k∈[d] ′ ]H k,d+1,k + X i∈[d] ′ ,i̸=k H i,d+1,k   = d X k=1 φ   b d+1 r 7 +1[k∈[d] ′ ]r 3 + X i∈[d] ′ ,i̸=k r 4   (A79) =d ′ φ(b d+1 r 7 +r 3 + (d ′ −1)r 4 | z =:s 1 (d ′ ,b d+1 ) ) + (d−d ′ )φ(b d+1 r 7 +d ′ r 4 | z =:s 2 (d ′ ,b d+1 ) ).(A80) By Eqs. (A49), (A64) and (A73), d X j=1 φ   b d+1 H d+1,j,j +1[j∈[d] ′ ]H j,j,j + X i∈[d] ′ ,i̸=j H i,j,j   = d X j=1 φ   b d+1 r 3 +1[j∈[d] ′ ]r 1 + X i∈[d] ′ ,i̸=j r 5   (A81) =d ′ φ(b d+1 r 3 +r 1 + (d ′ −1)r 5 | z =:s 3 (d ′ ,b d+1 ) ) + (d−d ′ )φ(b d+1 r 3 +d ′ r 5 | z =:s 4 (d ′ ,b d+1 ) ).(A82) By Eqs. (A52), (A61), (A67) and (A76), d X j=1 d X k̸=j φ   b d+1 H d+1,j,k +1[j∈[d] ′ ]H i,i,k +1[k∈[d] ′ ]H i,j,i + X i∈[d] ′ ,i̸=j,k H i,j,k   = d X j=1 d X k̸=j φ   b d+1 r 4 +1[j∈[d] ′ ]r 2 +1[k∈[d] ′ ]r 5 + X i∈[d] ′ ,i̸=j,k r 6   (A83) =d ′ (d ′ −1)φ(b d+1 r 4 +r 2 +r 5 + (d ′ −2)r 6 |z =:s 5 (d ′ ,b d+1 ) ) +d ′ (d−d ′ )φ(b d+1 r 4 +r 2 + (d ′ −1)r 6 | z =:s 6 (d ′ ,b d+1 ) ) +d ′ (d−d ′ )φ(b d+1 r 4 +r 5 + (d ′ −1)r 6 | z =:s 7 (d ′ ,b d+1 ) ) + (d−d ′ )(d−d ′ −1)φ(b d+1 r 4 +d ′ r 6 |z =:s 8 (d ′ ,b d+1 ) ).(A84) Now, d+1 X j=1 d X k=1 φ d+1 X i=1 b i H i,j,k ! =d ′ φ(s 1 (d ′ ,b d+1 )) + (d−d ′ )φ(s 2 (d ′ ,b d+1 )) +d ′ φ(s 3 (d ′ ,b d+1 )) + (d−d ′ )φ(s 4 (d ′ ,b d+1 )) +d ′ (d ′ −1)φ(s 5 (d ′ ,b d+1 )) +d ′ (d−d ′ )φ(s 6 (d ′ ,b d+1 )) +d ′ (d−d ′ )φ(s 7 (d ′ ,b d+1 )) + (d−d ′ )(d−d ′ −1)φ(s 8 (d ′ ,b d+1 ))(A85) =: score(d ′ ,b d+1 ).(A86) We shall now summarize the discussion to Lemma E.2. The rest of the proof is left to Lemma E.3. Optimization of transformed problem. Lemma E.2.Letφ(x) := max(0,x),d∈N,0< λ <1,0≤ε <1,ε + := 1−ε, andε − :=λ/2−ε. In addition, ford ′ ∈0,...,dandb d+1 ∈0,1, r 1 :=ε + + λ 2 3 (d−1)ε − ,(A87) 30 Published as a conference paper at ICLR 2026 r 2 :=ε − + λ 2 3 ε + + λ 2 3 (d−2)ε − ,(A88) r 3 :=ε + + λ 2 (d−1)ε − ,(A89) r 4 :=ε − + λ 2 ε + + λ 2 (d−2)ε − ,(A90) r 5 := λ 2 ε + + λ 2 ε − + λ 2 4 (d−2)ε − ,(A91) r 6 := λ 2 ε − + λ 2 ε − + λ 2 4 ε + + λ 2 4 (d−3)ε − ,(A92) r 7 :=ε + + (d−1)ε − ,(A93) s 1 (d ′ ,b d+1 ) :=b d+1 r 7 +r 3 + (d ′ −1)r 4 ,(A94) s 2 (d ′ ,b d+1 ) :=b d+1 r 7 +d ′ r 4 ,(A95) s 3 (d ′ ,b d+1 ) :=b d+1 r 3 +r 1 + (d ′ −1)r 5 ,(A96) s 4 (d ′ ,b d+1 ) :=b d+1 r 3 +d ′ r 5 ,(A97) s 5 (d ′ ,b d+1 ) :=b d+1 r 4 +r 2 +r 5 + (d ′ −2)r 6 ,(A98) s 6 (d ′ ,b d+1 ) :=b d+1 r 4 +r 2 + (d ′ −1)r 6 ,(A99) s 7 (d ′ ,b d+1 ) :=b d+1 r 4 +r 5 + (d ′ −1)r 6 ,(A100) s 8 (d ′ ,b d+1 ) :=b d+1 r 4 +d ′ r 6 ,(A101) score(d ′ ,b d+1 ) :=d ′ φ(s 1 (d ′ ,b d+1 )) + (d−d ′ )φ(s 2 (d ′ ,b d+1 )) +d ′ φ(s 3 (d ′ ,b d+1 )) + (d−d ′ )φ(s 4 (d ′ ,b d+1 )) +d ′ (d ′ −1)φ(s 5 (d ′ ,b d+1 )) +d ′ (d−d ′ )φ(s 6 (d ′ ,b d+1 )) +d ′ (d−d ′ )φ(s 7 (d ′ ,b d+1 )) + (d−d ′ )(d−d ′ −1)φ(s 8 (d ′ ,b d+1 )).(A102) Considering the following optimization problem: max d ′ ∈0,...,d,b d+1 ∈0,1 score(d ′ ,b d+1 ).(A103) Then, settingP,Q∈R (d+1)×(d+1) to P= 0 d,d+1 b ⊤ ,Q= [ A0 d+1 ],b ⊤ = [1 1·1 | z d ′ 0 0·0 |z d−d ′ b d+1 ],(A104) A jk =                  1[b d+1 r 7 +1[k≤d ′ ]r 3 + (d ′ −1[k≤d ′ ])r 4 ≥0] (j=d+ 1) 1[b d+1 r 3 +1[j≤d ′ ]r 1 + (d ′ −1[j≤d ′ ])r 5 ≥0] (j̸=d+ 1)∧(j=k) 1[b d+1 r 4 +1[j≤d ′ ]r 2 +1[k≤d ′ ]r 5 + (d ′ −1[j≤d ′ ]−1[k≤d ′ ])r 6 ≥0] (j̸=d+ 1)∧(j̸=k) , (A105) the global maximizer of (A103) is the global minimizer of (7). Proof.See the above discussion. Lemma E.3.The global maximizer of (A103) is as follows: 31 Published as a conference paper at ICLR 2026 (a) If 0≤ε≤ λ(λ(d−2) + 4) 2(λ(d−1) + 2) ,(A106) thend ′ =dandb d+1 = 1. This corresponds tob=1 d+1 andA=1 d+1,d . (b) If ε= λ(d−1) + 2 2d ,(A107) thend ′ =dandb d+1 = 1. This corresponds tob=1 d+1 andA= [I d 0 d ] ⊤ . (c) If ε≥ λ 2 + 3 2 2−λ λ 2 (d−1) + 3 ,(A108) thend ′ = 0andb d+1 = 0. This corresponds tob=1 d+1 andA=0 d+1,d . Proof.For notational simplicity, we abbreviate terms including variables such asx 1 ,x 2 ,...(e.g., x 2 1 + 3x 2 +·) using the notationΘ(x 1 ,x 2 ,...). In particular, when the expression is strictly nonnegative (e.g.,x 2 1 +x 2 2 ) or nonpositive, we useΘ + (x 1 ,x 2 ,...)orΘ − (x 1 ,x 2 ,...), respectively. These terms are not essential to the analysis and too long. They can be derived by simple basic arithmetic operations. These concrete values can be showed by our python codes. We defineε 1 ,...,ε 7 as r 1 = 0⇐⇒ε= λ 2 + 3 2 2−λ λ 2 (d−1) + 3 =:ε 1 ,(A109) r 2 = 0⇐⇒ε= λ(λ 2 (d−2) + 2λ+ 3) 2(λ 2 (d−1) + 3) =:ε 2 ,(A110) r 3 = 0⇐⇒ε= λ 2 (d−1) + 4 2(λ(d−1) + 2) =:ε 3 ,(A111) r 4 = 0⇐⇒ε= λ(λ(d−2) + 4) 2(λ(d−1) + 2) =:ε 4 ,(A112) r 5 = 0⇐⇒ε= λ 2 (d−2) + 2λ+ 4 2(λ(d−2) + 4) =:ε 5 ,(A113) r 6 = 0⇐⇒ε= λ(λ(d−3) + 6) 2(λ(d−2) + 4) =:ε 6 ,(A114) r 7 = 0⇐⇒ε= λ(d−1) + 2 2d =:ε 7 ,(A115) s 5 (d,1) = 0⇐⇒ε= λ 2 3d 2 λ 2 −8dλ 2 + 24dλ+ 4λ 2 −34λ+ 48 3d 2 λ 2 −5dλ 2 + 18dλ+ 2λ 2 −18λ+ 24 =:ε s 5 .(A116) Since ε 1 −ε 3 = λ(d−1)(2−λ)(3−2λ) 2(λ(d−1) + 2)(λ 2 (d−1) + 3) ≥0,(A117) ε 3 −ε 5 = (2−λ) 2 (λ(d−2) + 4)(λ(d−1) + 2) ≥0,(A118) ε 5 −ε 7 = (d−2)(2−λ) 2 2d(λ(d−2) + 4) ≥0,(A119) ε 7 −ε s 5 = (2−λ)(−3dλ 2 + 6dλ+ 2λ 2 −18λ+ 24) 2d(3d 2 λ 2 −5dλ 2 + 18dλ+ 2λ 2 −18λ+ 24) ≥0,(A120) ε s 5 −ε 4 = λ 2 (2−λ) (λ(d−1) + 2)(3d 2 λ 2 −5dλ 2 + 18dλ+ 2λ 2 −18λ+ 24) ≥0,(A121) 32 Published as a conference paper at ICLR 2026 ε 4 −ε 6 = λ(2−λ) 2 2(λ(d−2) + 4)(λ(d−1) + 2) ≥0,(A122) ε 6 −ε 2 = λ(3−λ)(2−λ)(1−λ) 2(λ(d−2) + 4)(λ 2 (d−1) + 3) ≥0,(A123) ford≥2, they are ordered as ε 2 ≤ε 6 ≤ε 4 ≤ε s 5 ≤ε 7 ≤ε 5 ≤ε 3 ≤ε 1 .(A124) Inscore,b d+1 appears asb d+1 r 3 ,b d+1 r 4 , orb d+1 r 7 , each with a positive coefficient indandd ′ . Thus, ifr 3 ,r 4 ,r 7 ≤0, thenb d+1 should be zero. Ifr 3 ,r 4 ,r 7 ≥0, thenb d+1 should be one. Considering Ineq. (A124), ford≥2, the optimalb d+1 is one ifε≤ε 4 and zero ifε≥ε 3 . One-Dimensional Case.Ifd= 1, score(d ′ ,b d+1 ) =1[d ′ = 0](φ(b d+1 r 7 ) +φ(b d+1 r 3 )) +1[d ′ = 1](φ(b d+1 r 7 +r 3 ) +φ(b d+1 r 3 +r 1 ))(A125) =1[d ′ = 0](φ(b d+1 ε + ) +φ(b d+1 ε + )) +1[d ′ = 1](φ(b d+1 ε + +ε + ) +φ(b d+1 ε + +ε + )).(A126) Asε + is always positive for0≤ε <1,d ′ =d= 1andb d+1 = 1are the optimal. This aligns with the following case analysis. Weak Adversarial (Case 1).Assumed≥2and0≤ε≤ε 6 . Asε≤ε 6 ≤ε 4 ,b d+1 = 1 is the optimal. By Ineq. (A124),r 1 ,r 3 ,r 4 ,r 5 ,r 6 ,r 7 ≥0. The sign ofr 2 depends onε. Thus, s 1 (d ′ ,1),s 2 (d ′ ,1),s 3 (d ′ ,1),s 4 (d ′ ,1),s 7 (d ′ ,1),s 8 (d ′ ,1)≥0for0≤d ′ ≤d. In addition, for d ′ ≥2, s 5 (d ′ ,1)≥r 4 +r 2 (A127) = λ 3 6 (d−2) + λ 2 12 (3d−2) + 3λ 2 − ε 6 (2λ 2 (d−1) + 3λ(d−1) + 12)(A128) ≥ λ 2 (2−λ)(5−2λ) 12(λ(d−2) + 4) (∵ε≤ε 6 )(A129) ≥0.(A130) Thus,d ′ (d ′ −1)s 5 (d ′ ,1)is nonnegative for0≤d ′ ≤d. Similarly, bys 6 (d ′ ,1)≥r 4 +r 2 ≥0for d ′ ≥1,d ′ (d ′ −1)s 6 (d ′ ,1)is nonnegative for0≤d ′ ≤d. Thus, score(d ′ ,1) :=d ′ s 1 (d ′ ,1) + (d−d ′ )s 2 (d ′ ,1) +d ′ s 3 (d ′ ,1) + (d−d ′ )s 4 (d ′ ,1) +d ′ (d ′ −1)s 5 (d ′ ,1) +d ′ (d−d ′ )s 6 (d ′ ,1) +d ′ (d−d ′ )s 7 (d ′ ,1) + (d−d ′ )(d−d ′ −1)s 8 (d ′ ,1)(A131) =dr 7 +d ′ r 3 +d ′ (d−1)r 4 +dr 3 +d ′ r 1 +d ′ (d−1)r 5 +dr 4 +d ′ r 2 +d ′ r 5 +d ′ (d−1)(d−2)r 6 .(A132) This monotonically increases ind ′ . Therefore,d ′ =dis the optimal. By Lemma E.2,b=1 d+1 . In addition, froms 1 (d,1),s 3 (d,1),s 5 (d,1)≥0,A=1 d+1,d . Weak Adversarial (Case 2).Assumed≥2andε 6 ≤ε≤ε 4 . Asε≤ε 4 ,b d+1 = 1is the optimal. By Ineq. (A124),r 1 ,r 3 ,r 4 ,r 5 ,r 7 ≥0andr 2 ,r 6 ≤0. Thus,s 1 (d ′ ,1),s 2 (d ′ ,1),s 3 (d ′ ,1),s 4 (d ′ ,1)≥0. In addition, s 5 (d ′ ,1)≥s 5 (d,1)≥ λ 2 (2−λ) 12(λ(d−1) + 2) ≥0(∵ε≤ε 4 ),(A133) s 7 (d ′ ,1)≥s 7 (d,1)≥ λ(2−λ) 3 8(λ(d−1) + 2) ≥0(∵ε≤ε 4 ).(A134) Due to the following inequality,s 8 (d ′ ,1)is always larger thans 6 (d ′ ,1): s 8 (d ′ ,1)−s 6 (d ′ ,1) =− λ 3 24 (d+ 1) + 5λ 2 12 − λ 2 + ε 12 (λ 2 (d+ 2) + 12(1−λ))(A135) 33 Published as a conference paper at ICLR 2026 ≥ λ(3−λ)(2−λ)(1−λ) 6(λ(d−2) + 4) (∵ε≥ε 6 )(A136) ≥0.(A137) Ifs 6 (d ′ ,1),s 8 (d ′ ,1)≥0, d score(d ′ ,1) d ′ = (2 +λ(d−1)−2dε)(λ 2 (3d 2 −5d+ 2) + 18λ(d−1) + 24) 24 ≥0.(A138) We used 2 +λ(d−1)−2dε≥ (2−λ) 2 λ(d−1) + 2 ≥0(∵ε≤ε 4 ).(A139) Ifs 6 (d ′ ,1)≤0,s 8 (d ′ ,1)≥0, d score(d ′ ,1) d ′ = Θ(d,d ′ ,λ)− ε 12 3dλ 2 ((d−d ′ ) 2 + 2d ′2 ) + 6λ(2−λ) ( d− 1 2 d ′ 2 + 11 4 d ′2 ) + 8d ′ λ 2 +d ′ (4λ 2 −36λ+ 48)(A140) ≥Θ(d,λ)− λ(2−λ) 24(λ(d−1) + 2) d ′ (9d ′ λ(2−λ) + 6λ 2 (d+ 1)−4λ(3d+ 7) + 24) (∵ε≤ε 4 )(A141) ≥ (2−λ)(dλ 3 +dλ(12−7λ)−λ 3 + 11λ 2 −30λ+ 24) 12(λ(d−1) + 2) (A142) ≥0.(A143) We used for0≤d ′ ≤d, d ′ (9d ′ λ(2−λ) + 6λ 2 (d+ 1)−4λ(3d+ 7) + 24) ≤dλ(3dλ(2−λ) + 6λ 2 −28λ+ 24).(A144) Ifs 6 (d ′ ,1)≤0,s 8 (d ′ ,1)≤0, d score(d ′ ,1) d ′ = Θ(d,d ′ ,λ)− ε 12 3d 2 λ(λ+ 4) + 6d(−λ 2 −λ+ 2) + 6λ+ 12(d−1) + 2d ′ (3d 2 λ 2 + 8dλ(−λ+ 1) + 4(2λ 2 + (d−6)λ+ 3)(A145) ≥Θ(d,λ)− λ(2−λ) 12(λ(d−1) + 2) d ′ (−3dλ 2 + 6dλ+ 6λ 2 −20λ+ 12)(∵ε≤ε 4 )(A146) ≥ (2−λ)(−dλ 3 −8dλ 2 + 24dλ−2λ 3 + 22λ 2 −60λ+ 48) 24(λ(d−1) + 2) (∵d ′ ≤d)(A147) ≥0.(A148) From the above discussion, for any case, (s 6 ,s 8 ≥0), (s 6 ≤0ands 8 ≥0), or (s 6 ,s 8 ≤0), the derivative ofscore(d ′ ,1)with respect tod ′ is nonnegative. Thus,d ′ =dis the optimal. By Lemma E.2,b=1 d+1 . In addition, froms 1 (d,1),s 3 (d,1),s 5 (d,1)≥0,A=1 d+1,d . Adversarial.Assumed≥2andε=ε 7 . By Ineq. (A124),r 1 ,r 3 ,r 5 ≥0,r 7 = 0, andr 2 ,r 4 ,r 6 ≤0. Thus,s 3 (d ′ ,b d+1 ),s 4 (d ′ ,b d+1 )≥0ands 2 (d ′ ,b d+1 ),s 6 (d ′ ,b d+1 ),s 8 (d ′ ,b d+1 )≤0. Now, s 1 (d ′ ,1) =s 1 (d ′ ,0)≥ (d−d ′ )(2−λ) 2 4d ≥0(∵ε=ε 7 ).(A149) Thus, score(d ′ ,b d+1 ) =d ′ s 1 (d ′ ,0) +d ′ s 3 (d ′ ,b d+1 ) + (d−d ′ )s 4 (d ′ ,b d+1 ) +d ′ (d ′ −1)φ(s 5 (d ′ ,b d+1 )) +d ′ (d−d ′ )φ(s 7 (d ′ ,b d+1 ))(A150) =d ′ s 1 (d ′ ,0) +d ′ r 1 + (d−1)d ′ r 5 +db d+1 r 3 34 Published as a conference paper at ICLR 2026 +d ′ (d ′ −1)φ(b d+1 r 4 +r 2 +r 5 + (d ′ −2)r 6 ) +d ′ (d−d ′ )φ(b d+1 r 4 +r 5 + (d ′ −1)r 6 ).(A151) Sincer 4 is nonpositive, this indicates thatscorechanges bydr 3 +d ′ (d−1)r 4 at least by switching b d+1 to one from zero. Moreover, dr 3 +d ′ (d−1)r 4 ≥ (d−1)(d−d ′ )(2−λ) 2 4d ≥0(∵ε=ε 7 ).(A152) Therefore,b d+1 = 1is the optimal. From Ineq. (A124) andε=ε 7 ,s 7 (d ′ ,b d+1 )−s 5 (d ′ ,b d+1 )≥0. Ifs 5 (d ′ ,1),s 7 (d ′ ,1)≥0, d score(d ′ ,1) d ′ = Θ(d,d ′ ,λ)−Θ + (d,d ′ ,λ)ε(A153) = Θ(d,λ)−Θ + (d,λ)d ′ (∵ε=ε 7 )(A154) ≥0(∵d ′ ≤d s 5 ),(A155) where s 5 (d ′ ,1)≥0⇐⇒d ′ ≤ 3dλ 2 −6dλ+ 2λ 2 −18λ+ 24 6λ(λ−2) =:d s 5 .(A156) Whens 5 (d ′ ,1)≤0,s 7 (d ′ ,1)≥0, then d score(d ′ ,1) d ′ ≥0similarly holds. Ifs 5 (d ′ ,1),s 7 (d ′ ,1)≤0, d score(d ′ ,1) d ′ ≥0 ford ′ ≤d−1. Comparingscore(d ′ ,1)withd ′ =d−1andd ′ =d, we obtain score(d,1)≥score(d−1,1). In summary,d ′ =dis the optimal. By Lemma E.2,b=1 d+1 . In addition, froms 3 (d,1)≥0,s 1 (d,1) = 0, ands 5 (d,1)<0,A= [I d 0 d ] ⊤ . Strong Adversarial.Assumed≥2andε≥ε 1 . By Ineq. (A124),r 1 ,...,r 7 are nonpositive. Thus, s 1 (d ′ ,b d+1 ),...,s 8 (d ′ ,b d+1 )are nonpositive. Therefore,d ′ = 0andb d+1 = 0are the optimal. By Lemma E.2,b=0 d+1 andA=0 d+1,d . FPROOF OFTHEOREMS3.5AND3.6 (ROBUSTNESS) For notational convenience, we occasionally describe representations and equations under the assumption thatS rob :=1,...,d rob ,S vul :=d rob + 1,...,d rob +d vul , andS irr := d rob +d vul + 1,...,d rob +d vul +d irr . This assumption is made without loss of generality. We useuniformbig-O and -Theta notation. Denotef(x) =O(g(x))if there exists a positive constant C >0such that|f(x)|≤C|g(x)|foreveryxin the domain. Denotef(x) = Θ(g(x))if there exist C 1 ,C 2 >0such thatC 1 |g(x)|≤|f(x)|≤C 2 |g(x)|foreveryxin the domain. For notational simplicity, we abbreviate the following matrix:                   C 1 α C 2 α . . . C d rob α C d rob +1 β . . . C d rob +d vul β C d rob +d vul +1 γ . . . C d rob +d vul +d irr γ                   as " C i α C i β C i γ # .(A157) Theorem 3.5(Standard pretraining case).There exist a constantC >0and a strictly positive functiong(d rob ,d vul ,d irr ,α,β,γ)such that E (x n ,y n ) N+1 n=1 i.i.d. ∼ D te min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P std ,Q std )] d+1,N+1 35 Published as a conference paper at ICLR 2026 ≤g(d rob ,d vul ,d irr ,α,β,γ) n C(d rob α+d vul β) |z Prediction for original data −(d rob +d vul +d irr )ε |z Adversarial effect o .(8) Proof.Sinceb=1 d+1 ,A=1 d+1,d , andZ ∆ MZ ⊤ ∆ is positive semidefinite, every entry in b ⊤ Z ∆ MZ ⊤ ∆ Ais nonnegative. Thus, we can solve the inner minimization as min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P,Q)] d+1,N+1 = min ∥∆∥ ∞ ≤ε 1 N b ⊤ Z ∆ MZ ⊤ ∆ Ay N+1 (x N+1 +∆)(A158) = 1 N b ⊤ Z ∆ MZ ⊤ ∆ A(y N+1 x N+1 −ε1 d ).(A159) Using(x,y)∼D te , E 1 N Z ∆ MZ ⊤ ∆ = E[x ⊤ ]E[yx] E[yx ⊤ ]1 (A160) = E[yx]E[yx ⊤ ]E[yx] E[yx ⊤ ]1 + E[(yx−E[yx])(yx−E[yx]) ⊤ ]0 d 0 ⊤ d 0 .(A161) Since the second term is positive semidefinite, E 1 N 1 ⊤ d+1 Z ∆ MZ ⊤ ∆ 1 d+1 =1 ⊤ d+1 E[yx]E[yx ⊤ ]E[yx] E[yx ⊤ ]1 + E[(yx−E[yx])(yx−E[yx]) ⊤ ]0 d 0 ⊤ d 0 1 d+1 (A162) ≥1 ⊤ d+1 E[yx ⊤ ]E[yx]E[yx] E[yx ⊤ ]1 1 d+1 .(A163) Since every entry ofE[yx ⊤ ]E[yx]andE[yx]is nonnegative, E 1 N 1 ⊤ d+1 Z ∆ MZ ⊤ ∆ 1 d+1 ≥1 ⊤ d+1 E[yx ⊤ ]E[yx]E[yx] E[yx ⊤ ]1 1 d+1 ≥1.(A164) RepresentingE[b ⊤ Z ∆ MZ ⊤ ∆ A/N] = [g(d rob ,d vul ,d irr ,α,β,γ)·g(d rob ,d vul ,d irr ,α,β,γ)] using some positive functiong(d rob ,d vul ,d irr ,α,β,γ)>0, there exists a positive constantC >0 such that E 1 N b ⊤ Z ∆ MZ ⊤ ∆ A(y N+1 x N+1 −ε1 d ) =    g(d rob ,d vul ,d irr ,α,β,γ) . . . g(d rob ,d vul ,d irr ,α,β,γ)    ⊤ (E[y N+1 x N+1 ]−ε1 d )(A165) =g(d rob ,d vul ,d irr ,α,β,γ)(Θ(d rob α+d vul β)−dε)(A166) ≤g(d rob ,d vul ,d irr ,α,β,γ)(C(d rob α+d vul β)−(d rob +d vul +d irr )ε).(A167) Theorem 3.6(Adversarial pretraining case).Suppose thatq rob andq vul defined in Assumption 3.2 are sufficiently small. There exist constantsC 1 ,C 2 >0such that E (x n ,y n ) N+1 n=1 i.i.d. ∼ D te min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P adv ,Q adv )] d+1,N+1 ≥C 1 (d rob α+d vul β+ 1)(d rob α 2 +d vul β 2 ) | z Prediction for original data 36 Published as a conference paper at ICLR 2026 −C 2 ( (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ) ε | z Adversarial effect .(9) Proof.This is the special case of the following theorem. Theorem F.1(General case of Theorem 3.6).There exist constantsC,C ′ ,C ′ >0such that E (x n ,y n ) N+1 n=1 i.i.d. ∼ D te min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P adv ,Q adv )] d+1,N+1 ≥C(d rob α+d vul β) (1−cq rob )d rob α 2 + (1−cq vul )d vul β 2 +C ′ (d rob α 2 +d vul β 2 ) −C ′ ( (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ) ε,(A168) where c:= (max i∈S rob ∪S vul C i )(max i∈S rob ∪S vul C i,2 ) min i∈S rob ∪S vul C 3 i .(A169) In particular, if there exists a constantC ′ >0such that1−cq rob ≥C ′ and1−cq vul ≥C ′ , then there exist constantsC 1 ,C 2 >0such that Ineq. (9) holds. Proof.Similarly to Eq. (A33), we can solve the minimization as min ∥∆∥ ∞ ≤ε y N+1 [f(Z ∆ ;P,Q)] d+1,N+1 = min ∥∆∥ ∞ ≤ε 1 N b ⊤ Z ∆ MZ ⊤ ∆ Ay N+1 (x N+1 +∆)(A170) = 1 N b ⊤ Z ∆ MZ ⊤ ∆ Ay N+1 x N+1 −ε 1 N b ⊤ Z ∆ MZ ⊤ ∆ A 1 .(A171) By Eq. (A161), we can rearrange the first term as E 1 N b ⊤ Z ∆ MZ ⊤ ∆ Ay N+1 x N+1 =1 ⊤ d+1 E[yx]E[yx ⊤ ] E[yx ⊤ ] E[y N+1 x N+1 ] +1 ⊤ d E[(yx−E[yx])(yx−E[yx]) ⊤ ]E[y N+1 x N+1 ]. (A172) The first term of Eq. (A172) can be rearranged as 1 ⊤ d+1 E[yx]E[yx ⊤ ] E[yx ⊤ ] E[y N+1 x N+1 ] =1 ⊤ d+1    C i C j α 2 C i C j αβ0 C i C j αβ C i C j β 2 0 00C 2 i γ 2 I C i αC i β0    " C i α C i β 0 # (A173) = X i∈S rob C i α+ X i∈S vul C i β+ 1 ! X i∈S rob C 2 i α 2 + X i∈S vul C 2 i β 2 ! (A174) = min i∈S rob ∪S vul C 3 i (d rob α+d vul β)(d rob α 2 +d vul β 2 ) + X i∈S rob C 2 i α 2 + X i∈S vul C 2 i β 2 .(A175) Consider the second term of Eq. (A172). Now, |E[(yx i −E[yx i ])(yx j −E[yx j ])]| 37 Published as a conference paper at ICLR 2026 ≤    p C i,2 p C j,2 α 2 (i,j∈S rob ) p C i,2 p C j,2 β 2 (i,j∈S vul ) p C i,2 p C j,2 αβ(i∈S rob ∧j∈S vul )∨(i∈S vul ∧j∈S rob ) .(A176) Let S:=    i∈S rob ∪S vul | X j∈S rob ∪S vul E[(yx i −E[yx i ])(yx j −E[yx j ])]<0    .(A177) The second term of Eq. (A172) can be computed as 1 ⊤ d E[(yx−E[yx])(yx−E[yx]) ⊤ ]E[y N+1 x N+1 ] ≥−                  p C i,2 α P j∈S rob p C j,2 α+ P j∈S vul p C j,2 β . . . p C i,2 α P j∈S rob p C j,2 α+ P j∈S vul p C j,2 β          ≤q rob d rob 0 p C i,2 β P j∈S rob p C j,2 α+ P j∈S vul p C j,2 β . . . p C i,2 β P j∈S rob p C j,2 α+ P j∈S vul p C j,2 β          ≤q vul d vul 0                  ⊤ " C i α C i β 0 # (A178) =− X i∈S rob p C i,2 α+ X i∈S vul p C i,2 β ! × X i∈S rob ∩S C i p C i,2 α 2 + X i∈S vul ∩S C i p C i,2 β 2 ! (A179) ≥− max i∈S rob ∪S vul p C i,2 max i∈(S rob ∪S vul )∩S C i p C i,2 ×(d rob α+d vul β)(q rob d rob α 2 +q vul d vul β 2 )(A180) ≥− max i∈S rob ∪S vul C i max i∈S rob ∪S vul C i,2 (d rob α+d vul β)(q rob d rob α 2 +q vul d vul β 2 ).(A181) By Lemma F.2, we can compute the second term as E 1 N b ⊤ Z ∆ MZ ⊤ ∆ A 1 =O (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ! .(A182) Finally, E 1 N b ⊤ Z ∆ MZ ⊤ ∆ Ay N+1 x N+1 −εE 1 N b ⊤ Z ∆ MZ ⊤ ∆ A 1 ≥ min i∈S rob ∪S vul C 3 i (d rob α+d vul β)(d rob α 2 +d vul β 2 ) + X i∈S rob C 2 i α 2 + X i∈S vul C 2 i β 2 − max i∈S rob ∪S vul C i max i∈S rob ∪S vul C i,2 (d rob α+d vul β)(q rob d rob α 2 +q vul d vul β 2 ) +O (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ! .(A183) 38 Published as a conference paper at ICLR 2026 Lemma F.2.If(x 1 ,y 1 ),...,(x N ,y N )are i.i.d. and followD te , then E 1 N b ⊤ Z ∆ MZ ⊤ ∆ A 1 =O (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ! ,(A184) whereb=1 d+1 andA ⊤ := [ I d 0 d ]. Proof.We can rearrange the given expectation as E 1 N b ⊤ Z ∆ MZ ⊤ ∆ A 1 =E " 1 N 1 ⊤ d+1 P N n=1 x n x ⊤ n P N n=1 y n x n P N n=1 y n x ⊤ n N I d 0 ⊤ d 1 # (A185) =E " 1 N 1 ⊤ d+1 P N n=1 x n x ⊤ n P N n=1 y n x ⊤ n 1 # (A186) = d X i=1 E   1 N N X n=1   y n + d X j=1 x n,j   x n,i   .(A187) By the Lyapunov inequality, forN+ 1i.i.d. random variablesX,X 1 ,...,X N , E " 1 N N X n=1 X n # ≤ v u u u t E   1 N N X n=1 X n ! 2   = r 1 N E[X 2 ] + N−1 N E[X] 2 .(A188) Thus, using(x,y)∼D te , d X i=1 E   1 N N X n=1   y n + d X j=1 x n,j   x n,i   ≤ d X i=1 v u u u u t 1 N E      y+ d X j=1 x j   2 x 2 i    + N−1 N E     y+ d X j=1 x j   x i   2 .(A189) From Lemma F.3, we can compute the second term of using E     y+ d X j=1 x j   x i   =E[yx i ] + d X j=1 E[x j x i ](A190) =    O(α(d rob α+d vul β+ 1)) (i∈S rob ) O(β(d rob α+d vul β+ 1)) (i∈S vul ) O(γ 2 )(i∈S irr ) .(A191) From Lemma F.3, we can compute the first term of using E      y+ d X j=1 x j   2 x 2 i    =E[x 2 i ] + 2 d X j=1 E[yx j x 2 i ] + d X j,k=1 E[x j x k x 2 i ](A192) =    O(α 2 (d rob α+d vul β+ 1) 2 +d irr γ 2 ) (i∈S rob ) O(β 2 (d rob α+d vul β+ 1) 2 +d irr γ 2 ) (i∈S vul ) O(γ 2 (d rob α+d vul β+ 1) 2 +d irr γ 2 ) (i∈S irr ) .(A193) 39 Published as a conference paper at ICLR 2026 Thus, d X i=1 v u u u u t 1 N E      y+ d X j=1 x j   2 x 2 i    + N−1 N E     y+ d X j=1 x j   x i   2 =O d rob α(d rob α+d vul β+ 1) + r d irr N αγ ! +d vul β(d rob α+d vul β+ 1) + r d irr N βγ ! +d irr γ 2 + γ √ N (d rob α+d vul β+ 1) + p d irr γ ! (A194) =O (d rob α+d vul β+ 1) d rob α+d vul β+ d irr γ √ N +d irr r d irr N + 1 ! γ 2 ! .(A195) Lemma F.3.If(x,y)∼D te , then (a) E[x j x i ] =              O(α 2 )(i,j∈S rob ) O(β 2 )(i,j∈S vul ) O(γ 2 )(i=j)∧(i,j∈S irr ) O(αβ) (i∈S rob ∧j∈S vul )∨(i∈S vul ∧j∈S rob ) 0(i̸=j)∧(i∈S irr ∨j∈S irr ) .(A196) (b) E[yx j x 2 i ] =                      O(α 3 )(i,j∈S rob ) O(β 3 )(i,j∈S vul ) O(α 2 β) (i∈S rob ∧j∈S vul ) O(αβ 2 ) (i∈S vul ∧j∈S rob ) O(αγ 2 ) (i∈S irr ∧j∈S rob ) O(βγ 2 ) (i∈S irr ∧j∈S vul ) 0(j∈S irr ) .(A197) (c) E[x j x k x 2 i ] =                                    O(α 4 )(i,j,k∈S rob ) O(β 4 )(i,j,k∈S vul ) O(γ 4 )(j=k)∧(i,j,k∈S irr ) O(α 3 β)(i∈S rob )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) O(αβ 3 )(i∈S vul )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) O(α 2 β 2 )(i∈S rob ∧j,k∈S vul )∨(i∈S vul ∧j,k∈S rob ) O(α 2 γ 2 )(i∈S irr ∧j,k∈S rob )∨(j=k∧j,k∈d irr ∧i∈S rob ) O(β 2 γ 2 )(i∈S irr ∧j,k∈S vul )∨(j=k∧j,k∈d irr ∧i∈S vul ) O(αβγ 2 ) (i∈S irr )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) 0(j̸=k)∧(j∈S irr ∨k∈S irr ) .(A198) 40 Published as a conference paper at ICLR 2026 Proof.We first note that E[x 2 i ] =E[(yx i ) 2 ] =E[(yx i −E[yx i ]) 2 ] +E[yx i ] 2 =    O(α 2 ) (i∈S rob ) O(β 2 ) (i∈S vul ) O(γ 2 ) (i∈S irr ) ,(A199) E[yx 3 i ] =E[(yx i ) 3 ](A200) =E[(yx i −E[yx i ]) 3 ] + 3E[(yx i ) 2 ]E[yx i ]−2E[yx i ] 3 (A201) =    O(α 3 ) (i∈S rob ) O(β 3 ) (i∈S vul ) 0(i∈S irr ) ,(A202) E[x 4 i ] =E[(yx i −E[yx i ]) 4 ] + 4E[yx 3 i ]E[yx i ]−6E[x 2 i ]E[yx i ] 2 + 3E[yx i ] 4 (A203) =    O(α 4 ) (i∈S rob ) O(β 4 ) (i∈S vul ) O(γ 4 ) (i∈S irr ) .(A204) (a) For(i̸=j)∧(i∈ S irr ∨j∈ S irr ),E[x j x i ] =E[x j ]E[x i ] = 0. Using the Cauthy-Schwarz inequality, E[x j x i ]≤ q E[x 2 j ] q E[x 2 i ](A205) =        O(α 2 )(i,j∈S rob ) O(β 2 )(i,j∈S vul ) O(γ 2 )(i,j∈S irr )∧(i=j) O(αβ) (i∈S rob ∧j∈S vul )∨(i∈S vul ∧j∈S rob ) .(A206) (b) Forj∈S irr ,j=i,E[yx j x 2 i ] =E[y]E[x 3 i ] = 0. Forj∈S irr ,j̸=i,E[yx j x 2 i ] =E[x j ]E[yx 2 i ] = 0. Using the Cauthy-Schwarz inequality, E[yx j x 2 i ]≤ q E[x 2 j ] q E[x 4 i ] =                  O(α 3 )(i,j∈S rob ) O(β 3 )(i,j∈S vul ) O(α 2 β) (i∈S rob ∧j∈S vul ) O(αβ 2 ) (i∈S vul ∧j∈S rob ) O(αγ 2 ) (i∈S irr ∧j∈S rob ) O(βγ 2 ) (i∈S irr ∧j∈S vul ) .(A207) (c) For(j̸=k)∧(j∈ S irr ∨k∈ S irr ),E[x j x k x 2 i ] = 0. Forj=k, using the Cauthy-Schwarz inequality, E[x j x k x 2 i ]≤ q E[x 4 j ] q E[x 4 i ] =    O(γ 4 )(j=k)∧(i,j,k∈S irr ) O(α 2 γ 2 ) (j=k)∧(j,k∈d irr ∧i∈S rob ) O(β 2 γ 2 ) (j=k)∧(j,k∈d irr ∧i∈S vul ) .(A208) Using the Cauthy-Schwarz inequality, E[x j x k x 2 i ] ≤ q E[x 2 j ] q E[x 2 k ] q E[x 4 i ](A209) =                          O(α 4 )(i,j,k∈S rob ) O(β 4 )(i,j,k∈S vul ) O(α 3 β)(i∈S rob )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) O(αβ 3 )(i∈S vul )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) O(α 2 β 2 )(i∈S rob ∧j,k∈S vul )∨(i∈S vul ∧j,k∈S rob ) O(α 2 γ 2 )(i∈S irr ∧j,k∈S rob ) O(β 2 γ 2 )(i∈S irr ∧j,k∈S vul ) O(αβγ 2 ) (i∈S irr )∧(j∈S rob ∧k∈S vul )∨(j∈S vul ∧k∈S rob ) .(A210) 41 Published as a conference paper at ICLR 2026 GPROOF OFTHEOREM3.7 (TRADE-OFF) Theorem 3.7(Accuracy–robustness trade-off).Assumed rob = 1,d vul =d−1, andd irr = 0. In addition to Assumption 3.2, for(x,y)∼D te , suppose thatyx i takesαwith probabilityp >0.5and −αwith probability1−pfori∈ S rob . Moreover,yx i takesβwith probability one fori∈ S vul . Define ̃ f(P,Q) :=E (x n ,y n ) N n=1 i.i.d. ∼ D te [y N+1 [f(Z 0 ;P,Q)] d+1,N+1 ] . Then, there exist strictly positive functionsg 1 (d,α,β)andg 2 (d,α,β)such that ̃ f(P std ,Q std ) = g 1 (d,α,β)(α+ (d−1)β)(w.p. p) g 1 (d,α,β)(−α+ (d−1)β) (w.p.1−p) ,(10) ̃ f(P adv ,Q adv )≤g 2 (d,α,β)−(2p−1)α 2 + (d−1)β 2 (w.p.1−p).(11) Proof.UsingbandAdefined in Appendix E, we can rearrange ̃ f(P,Q)as ̃ f(P,Q) :=E (x n ,y n ) N n=1 [y N+1 [f(Z 0 ;P,Q)] d+1,N+1 ](A211) = 1 N b ⊤ E (x n ,y n ) N n=1 [Z 0 MZ ⊤ 0 ]Ay N+1 x N+1 .(A212) Standard Transformer.Similarly to the proof of Theorem 3.5, using some positive function g(d,α,β)>0, we can representE[b ⊤ Z 0 MZ ⊤ 0 A/N] = [g(d,α,β)·g(d,α,β)]. Thus, 1 N bE (x n ,y n ) N n=1 [Z 0 MZ ⊤ 0 ]Ay N+1 x N+1 =    g(d,α,β) . . . g(d,α,β)    ⊤ y N+1 x N+1 (A213) =g(d,α,β)y N+1 d X i=1 x N+1,i (A214) = α+ (d−1)β(w.p. p) −α+ (d−1)β(w.p.1−p) .(A215) Adversarially Trained Transformer.Now, 1 N E (x n ,y n ) N n=1 [Z 0 MZ ⊤ 0 ] = E[(yx)(yx ⊤ )]E[yx] E[yx ⊤ ]1 (A216) =         α 2 (2p−1)αβ·(2p−1)αβ(2p−1)α (2p−1)αβ 2 ·β 2 β (2p−1)αβ 2 ·β 2 β . . . (2p−1)αβ 2 ·β 2 β (2p−1)αβ·β1         .(A217) Thus, 1 N b ⊤ E (x n ,y n ) N n=1 [Z 0 MZ ⊤ 0 ]A=     α+ (d−1)(2p−1)β+ (2p−1) β(2p−1)α+ (d−1)β+ 1 . . . β(2p−1)α+ (d−1)β+ 1     ⊤ .(A218) 42 Published as a conference paper at ICLR 2026 Therefore, 1 N b ⊤ E (x n ,y n ) N n=1 [Z 0 MZ ⊤ 0 ]Ay N+1 x N+1 =     α+ (d−1)(2p−1)β+ (2p−1) β(2p−1)α+ (d−1)β+ 1 . . . β(2p−1)α+ (d−1)β+ 1     ⊤     y N+1 x N+1,1 β . . . β     (A219) =        α 2 α+ (d−1)(2p−1)β+ (2p−1) +(d−1)β 2 (2p−1)α+ (d−1)β+ 1(w.p. p) −α 2 α+ (d−1)(2p−1)β+ (2p−1) +(d−1)β 2 (2p−1)α+ (d−1)β+ 1(w.p.1−p) .(A220) In particular, −α 2 α+ (d−1)(2p−1)β+ (2p−1)+ (d−1)β 2 (2p−1)α+ (d−1)β+ 1 =(2p−1)α+ (d−1)β+ 1(−Cα 2 + (d−1)β 2 ),(A221) where C= α+ (d−1)(2p−1)β+ (2p−1) (2p−1)α+ (d−1)β+ 1 > (2p−1) 2 α+ (d−1)(2p−1)β+ (2p−1) (2p−1)α+ (d−1)β+ 1 (A222) = 2p−1.(A223) HPROOF OFTHEOREMH.1 (NEED FORLARGERSAMPLESIZE) Theorem H.1(Need for Larger Sample Size).Assume the same assumptions in Theorem 3.7. Then, E x N+1 ,y N+1 [y N+1 [f(Z 0 ;P std ,Q std )] d+1,N+1 ]>0(w.p. at least1−e −pN ).(A224) In addition, suppose that there exists a constant0< C <1such that(d−1)β+ 1< Cα. Moreover, assume thatNis an even number. Then, asp→ 1 2 withp > 1 2 , for4≤N≤ 2 C , E x N+1 ,y N+1 [y N+1 [f(Z 0 ;P adv ,Q adv )] d+1,N+1 ]>0 w.p. at most1− 0.483 √ N <1−e −pN .(A225) Proof.UsingbandAdefined in Appendix E, we can calculate E x N+1 ,y N+1 [y N+1 [f(Z 0 ;P,Q)] d+1,N+1 ] = 1 N b ⊤ Z 0 MZ ⊤ 0 AE[y N+1 x N+1 ].(A226) Now, 1 N Z 0 MZ ⊤ 0 =           α 2 β N P N n=1 y n x n,1 · β N P N n=1 y n x n,1 1 N P N n=1 y n x n,1 β N P N n=1 y n x n,1 β 2 ·β 2 β β N P N n=1 y n x n,1 β 2 ·β 2 β . . . β N P N n=1 y n x n,1 β 2 ·β 2 β 1 N P N n=1 y n x n,1 β·β1           .(A227) Standard Transformer.From the configuration ofbandA, all the entries ofb ⊤ Z 0 MZ ⊤ 0 Aare the same. Since all the entries ofE[y N+1 x N+1 ]are positive, with some positive functiong(d,α,β)>0, 1 N b ⊤ Z 0 MZ ⊤ 0 AE[y N+1 x N+1 ] =g(d,α,β) 1 N 1 ⊤ d+1 Z 0 MZ ⊤ 0 1 d+1 .(A228) 43 Published as a conference paper at ICLR 2026 Now, 1 N 1 ⊤ d+1 Z 0 MZ ⊤ 0 1 d+1 = (d−1) 2 β 2 + 2(d−1)β+ 1 +α 2 + 2 N N X n=1 y n x n,1 + 2(d−1) β N N X n=1 y n x n,1 (A229) =(d−1)β+ 1 2 +α 2 + 2(d−1)β+ 1 N N X n=1 y n x n,1 (A230) = [(d−1)β+ 1−α] 2 + 2(d−1)β+ 1 N N X n=1 (α+y n x n,1 )(A231) >0(w.p. at least1−(1−p) N >1−e −pN ).(A232) Adversarially Trained Transformer.Note thatE[y N+1 x N+1 ] = [(2p−1)α β·β]. Thus, 1 N 1 ⊤ d+1 Z 0 MZ ⊤ 0 I d E[y N+1 x N+1 ] = (2p−1)α α 2 + (d−1) β N N X n=1 y n x n,1 + 1 N N X n=1 y n x n,1 ! + (d−1)β β N N X n=1 y n x n,1 + (d−1)β 2 +β ! (A233) = [(2p−1)α 3 + (d−1)β 2 (d−1)β+ 1] + [(2p−1)α(d−1)β+ 1+ (d−1)β 2 ] 1 N N X n=1 y n x n,1 .(A234) This indicatesE x N+1 ,y N+1 [y N+1 [f(Z 0 ;P adv ,Q adv )] d+1,N+1 ]>0only if 1 N N X n=1 y n x n,1 >− (2p−1)α 3 + (d−1)β 2 (d−1)β+ 1 (2p−1)α(d−1)β+ 1+ (d−1)β 2 .(A235) Representingy n x n,1 =α(2X n −1)withX n taking 1 with probabilitypand 0 with probability 1−p, 1 N N X n=1 α(2X n −1)>− (2p−1)α 3 + (d−1)β 2 (d−1)β+ 1 (2p−1)α(d−1)β+ 1+ (d−1)β 2 ⇐⇒ N X n=1 X n > N 2 1− 1 α (2p−1)α 3 + (d−1)β 2 (d−1)β+ 1 (2p−1)α(d−1)β+ 1+ (d−1)β 2 .(A236) LetY∼B(N,p), whereB(N,p)is the Binomial distribution. Consider the following probability: P Y∼B(N,p) Y > N 2 1− 1 α (2p−1)α 3 + (d−1)β 2 (d−1)β+ 1 (2p−1)α(d−1)β+ 1+ (d−1)β 2 .(A237) Whenp→1/2, P Y∼B(N,p) Y > N 2 1− 1 α (2p−1)α 3 + (d−1)β 2 (d−1)β+ 1 (2p−1)α(d−1)β+ 1+ (d−1)β 2 →P Y∼B(N,1/2) Y > N 2 1− (d−1)β+ 1 α (A238) ≤P Y∼B(N,1/2) Y > N 2 (1−C) (A239) 44 Published as a conference paper at ICLR 2026 ≤P Y∼B(N,1/2) Y > N 2 −1 .(A240) From Ash (1990), for an integer0< k < N/2, P Y∼B(N,1/2) [Y≤k]≥ 1 q 8N k N (1− k N ) exp −ND k N // 1 2 ,(A241) whereDis the Kullback–Leibler divergence. Substitutingk= N 2 −1, P Y∼B(N,1/2) Y≤ N 2 −1 ≥ 1 q 8N( 1 2 − 1 N )1−( 1 2 − 1 N ) exp −ND 1 2 − 1 N // 1 2 (A242) = 1 q 2(1− 4 N 2 ) 1 √ N exp −ND 1 2 − 1 N // 1 2 .(A243) Note that D 1 2 − 1 N // 1 2 = 1 2 1− 2 N ln 1− 2 N + 1 + 2 N ln 1 + 2 N .(A244) ForN≥4, 1 q 2(1− 4 N 2 ) exp −ND 1 2 − 1 N // 1 2 >0.483.(A245) In summary, P Y∼B(N,1/2) Y > N 2 −1 = 1−P Y∼B(N,1/2) Y≤ N 2 −1 ≤1− 0.483 √ N .(A246) 45