Paper deep dive
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions
Yihao Xue, Jiping Li, Baharan Mirzasoleiman
Models: 150 small transformers (molecular), 22 MTEB embedding models, 28 smaller LLMs (weak supervisors), MolBERT, Qwen-7B, nvidia/NV-Embed-v2
Abstract
Abstract:Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Tags
Links
- Source: https://arxiv.org/abs/2502.00620
- Canonical: https://arxiv.org/abs/2502.00620
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 6:13:28 PM
Summary
The paper investigates Weak-to-Strong Generalization (W2SG) by analyzing the internal representations of weak and strong models. It introduces a theoretical framework using kernels derived from principal components of these representations to define a space that quantifies the 'prediction gap' (PredGap). This metric explains how strong models correct errors from weak supervisors and provides a label-free method to predict W2SG performance trends, validated across molecular prediction tasks and various LLMs.
Entities (5)
Relation Signals (3)
W2SG → ischaracterizedby → PredGap
confidence 95% · The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision.
PredGap → predicts → W2SG performance
confidence 95% · Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends
Strong Model → correctserrorsfrom → Weak Model
confidence 90% · This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model
Cypher Suggestions (2)
Find all concepts related to W2SG performance metrics · confidence 90% · unvalidated
MATCH (c:Concept {name: 'W2SG'})-[:IS_CHARACTERIZED_BY]->(m:Metric) RETURN m.nameIdentify models involved in W2SG research · confidence 85% · unvalidated
MATCH (m:Model)-[:SUPERVISES]->(s:Model) WHERE (m)-[:TYPE]->(:Concept {name: 'W2SG'}) RETURN m.name, s.nameFull Text
622,109 characters extracted from source content.
Expand or collapse full text
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions Yihao Xue Jiping Li Baharan Mirzasoleiman Abstract Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models’ internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs. Machine Learning, ICML 1 Introduction As AI systems become increasingly capable of performing complex tasks beyond human comprehension, humans will inevitably serve as “weak supervisors” in aligning advanced AI. To investigate this fundamental problem, Burns et al. (2023) propose an analogy that can be empirically explored today: can a weak model effectively supervise a stronger one? This framework, known as Weak-to-Strong Generalization (W2SG), involves leveraging a weak model, finetuned on a specific task, to supervise the finetuning of a stronger model. In this analogy, the finetuning task represents concepts tied to human values or skills, the finetuned weak model represents humans—limited in capability but aligned with human values, and the strong model represents superhuman intelligence–powerful but initially unaligned. Promising results from (Burns et al., 2023) show that the strong model can significantly outperform its weak supervisor. For instance, a GPT-4 model supervised by a fine-tuned GPT-2-level model achieves nearly 20% better performance than the weak supervisor on NLP tasks. Figure 1: An illustration of our main result (Thm. 3.8). The path connecting the two highlighted regions represents the overlap b/w the complement of a scaled span of the weak model’s principal kernel and the scaled span of the strong model’s principal kernel, determining the contribution of the weak model’s errors to PredGap. At first glance, this phenomenon seems counterintuitive. After all, the strong model is explicitly trained to fit the weak supervision. Yet, it goes beyond mere imitation and generalizes better. It is important to understand which intrinsic properties of the weak and strong models enable W2SG. Efforts have been made toward a theoretical understanding of W2SG. Charikar et al. (2024) demonstrates that the disagreement between finetuned weak and strong models correlates with performance gains in W2SG. However, their analysis assumes high-quality representations in the strong model and does not address the role of the weak model’s representations. The analysis of (Lang et al., 2024; Shin et al., 2024) assumes a generalized version of an adversarially robust strong model, where W2SG arises solely from underfitting weak supervision. This framework excludes important scenarios such as benign overfitting, where W2SG occurs despite overfitting. Wu & Sahai (2024) particularly studied benign overfitting and examined the impact of number of weakly labeled data points. However, we still lack an overarching explanation that captures the interaction between weak and strong models in enabling W2SG, as well as how it determines which weak supervision errors are corrected in general scenarios. The challenge lies in characterizing the abstract concepts including the knowledge embedded in the weak and strong models, their utilization, and their respective roles in W2SG. Striving for results that are general enough to capture a spectrum of behaviors without overly strict assumptions further adds to the complexity. To address this, we adopt a representation-based perspective, analyzing finetuning as a process of learning a function on fixed representations to uncover how the internal structures of weak and strong models influence W2SG. Under a very general assumption about the representations, we demonstrate (illustrated in Fig. 1) that the key quantifiable property governing W2SG is the overlap between two spaces: one representing what the weak model’s principal representations (capturing key knowledge gained during pretraining) do not cover, and the other representing what the strong model’s principal representations do cover. Errors in weak supervision that fall within this overlap hinder the strong model from reaching its full potential, leading to a prediction gap between the strong model finetuned with weak supervision and that finetuned with ground truth labels. A smaller overlap implies that fewer of the weak model’s mistakes are replicated, resulting in better W2SG performance. We then demonstrate an important use case of our main result: explaining benign overfitting, where the W2S model overfits the weak model’s mistakes on finetuning data yet paradoxically generalizes better on the test set. Using our theoretical framework, we establish a general condition for benign overfitting and apply it to a toy example to concretely illustrate the role of representations in error replication: errors that do not align with the kernel defined by the strong model’s principal representations are not replicated by the W2S model, regardless of the extent of overfitting. Our theory offers a metric that predicts trends in W2SG performance in practice without having the finetuning task labels. This metric, which measures the overlap between the two highlighted regions in Fig. 1, shows a strong correlation with W2SG performance across various settings. The extensive experiments across 8 datasets, involving 150 small transformers and 52 LLMs, not only validate our theoretical insights but also suggest their potential applications in managing W2SG, providing a deeper understanding of LLM behavior through their internal representation structures. 2 Related Work There have been many recent works that theoretically explore W2SG. Somerstep et al. (2024) adopt a transfer learning perspective, focusing on improving W2SG through in-context learning rather than explaining how W2SG emerges. Lang et al. (2024); Shin et al. (2024) analyze W2SG by considering a generalized version of adversarially robust models, showing that certain errors in weak supervision can be corrected by leveraging the good neighborhood structure in the data. However, their argument attributes error correction solely to underfitting—i.e., avoiding fitting mislabeled finetuning data. This overlooks an important scenario recently discussed in (Wu & Sahai, 2024), known as benign overfitting, where the strong model overfits mislabeled finetuning data but still achieves accurate test-time predictions. Benign overfitting is particularly relevant in practice, as large neural networks often have the capacity to overfit while still generalizing effectively (Zhang et al., 2021). Closer to our setting, Charikar et al. (2024) formalized W2SG using a representation-based perspective. Their work demonstrates that performance gain in W2SG correlates with the disagreement between the finetuned weak and strong models, assuming high-quality representations for the strong model. While insightful, it does not characterize the role of the weak model’s representations, leaving the exact conditions for effective W2SG unclear. Compared to (Lang et al., 2024), we analyze W2SG in a more realistic setting where error correction can result from either underfitting or overfitting, allowing for a full spectrum of behaviors. While benign overfitting is not our primary focus, we discuss it as a special case in Sec. 4 due to its importance and offer new insights. Compared to (Charikar et al., 2024), we explicitly links W2SG performance to the interaction between the weak and strong models’ representations, providing a more comprehensive view of how the intrinsic properties of the two models jointly determine W2SG. 3 W2SG from a Representation Perspective We first formalize finetuning from a representation-based perspective, then introduce the properties of the representations considered, and finally present our main theory. 3.1 A representation-based perspective The knowledge a model acquires through pretraining enables it to interpret inputs, extract relevant information, and organize it into meaningful intermediate states. This can be formalized as a “representation function”, hℎh, which transforms data into structured representations. Finetuning leverages this knowledge to produce the desired output, which we formalize as learning a new function f on the fixed hℎh. The entire model is thus represented as the composition f∘hℎf\! \!hf ∘ h. For simplicity, we consider the outputs of hℎh as vectors, and focus on the case where f is a linear functions. This is practically relevant because: (1) Training a linear task head on fixed representations is common with large foundation models, e.g., using embedding LLMs (Muennighoff et al., 2022), linear probing on intermediate activations (Zou et al., 2023; Nanda et al., 2023; Marks & Tegmark, 2023). (2) fine-tuning of LLMs largely operates in the NTK regime (Jacot et al., 2018), where training dynamics are captured by a linear model on representations derived from model gradients (Malladi et al., 2023). (3) Our experiments in Sec. 5 show that insights from analyzing linear functions generalize to the complex non-linear setting of finetuning entire LLMs from pretrained weights. 3.2 Preliminaries Notations. We sometimes abbreviate a matrix ∈ℝl×msuperscriptℝ A∈R^l× mitalic_A ∈ blackboard_Rl × m as [Ai,j]1≤i≤l,1≤j≤msubscriptdelimited-[]subscriptformulae-sequence11[A_i,j]_1≤ i≤ l,1≤ j≤ m[ Aitalic_i , j ]1 ≤ i ≤ l , 1 ≤ j ≤ m when each element Ai,jsubscriptA_i,jAitalic_i , j can be expressed as a generic term in terms of its indices. λmin, ≠0()subscriptmin, ≠0 _min, $≠ 0$( A)λmin, ≠ 0 ( italic_A ) denotes the smallest nonzero eigenvalue of matrix Aitalic_A. Data. Let DD denote the distribution of the finetuning task’s data, defined over the input-label pairs (,y)∈×( x,y)∈X×Y( italic_x , y ) ∈ X × Y, where =ℝY=RY = blackboard_R. In W2SG, we have two splits of data sampled from DD. The first subset, ~=(~i,y~i)i=1n~~superscriptsubscriptsubscript~subscript~1~ D=\( x_i, y_i)\_i=1 nover~ start_ARG D end_ARG = ( over~ start_ARG italic_x end_ARGi , over~ start_ARG y end_ARGi ) i = 1over~ start_ARG n end_ARG, consists of n~~ nover~ start_ARG n end_ARG i.i.d. samples and is used for finetuning the weak model. The second subset, ^=(^i,y^i)i=1n^^superscriptsubscriptsubscript^subscript^1 D=\( x_i, y_i)\_i=1 nover start_ARG D end_ARG = ( over start_ARG italic_x end_ARGi , over start_ARG y end_ARGi ) i = 1over start_ARG n end_ARG with n^ nover start_ARG n end_ARG i.i.d. samples is used for finetuning the strong model. Note that the weak model’s outputs will be used as labels in place of the actual y^isubscript y_iover start_ARG y end_ARGi’s. In our notation, quantities associated with the two splits are marked by the diacritical symbols, ~~absent over~ start_ARG end_ARG and ^^absent over start_ARG end_ARG, respectively. Models. We denote the weak and strong models’ representation functions as hwsubscriptℎwh_whw and hssubscriptℎsh_shs, respectively. The finetuned weak model is represented as fw∘hwsubscriptwsubscriptℎwf_w\! \!h_wfw ∘ hw, with fw=argminf∈ℱw(1n~∑i=1n~(f(hw(~i))−y~i)2+βwR(f)).subscriptwsubscriptsubscriptℱw1~superscriptsubscript1~superscriptsubscriptℎwsubscript~subscript~2subscriptwf_w\!=\! _f∈F_w( 1 n% _i=1 n(f(h_w( x_i))- y_i)^2% \!+\! _wR(f)).fw = arg minitalic_f ∈ F start_POSTSUBSCRIPT w end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∑i = 1over~ start_ARG n end_ARG ( f ( hw ( over~ start_ARG italic_x end_ARGi ) ) - over~ start_ARG y end_ARGi )2 + βw R ( f ) ) . where R(⋅)⋅R(·)R ( ⋅ ) represents ℓ2subscriptℓ2 _2ℓ2 regularization. The W2S model, which refers to the strong model finetuned with weak supervision, is represented as fw2s∘hssubscriptw2ssubscriptℎsf_w2s\! \!h_sfw2s ∘ hs, with fw2s=argminf∈ℱs(1n^∑i=1n^(f(hs(^i))−fw(hw(^i)))2+βsR(f)).subscriptw2ssubscriptsubscriptℱs1^superscriptsubscript1^superscriptsubscriptℎssubscript^subscriptwsubscriptℎwsubscript^2subscriptsf_w2s\!=\! _f∈F_s( 1 n% _i=1 n(f(h_s( x_i))\!-\!f_w(h_% w( x_i)))^2\!+\! _sR(f)).fw2s = arg minitalic_f ∈ F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG ( f ( hs ( over start_ARG italic_x end_ARGi ) ) - fw ( hw ( over start_ARG italic_x end_ARGi ) ) )2 + βs R ( f ) ) . Additionally, as a reference, we define the strong ceiling model as the strong model finetuned with the ground truth labels. It is represented as fsc∘hssubscriptscsubscriptℎsf_sc h_sfsc ∘ hs with fsc=argminf∈ℱs(1n^∑i=1n^(f(hs(^i))−y^i)2+Rs(f)).subscriptscsubscriptsubscriptℱs1^superscriptsubscript1^superscriptsubscriptℎssubscript^subscript^2subscriptsf_sc= _f∈F_s( 1 n _i% =1 n(f(h_s( x_i))- y_i)^2+R_s% (f)).fsc = arg minitalic_f ∈ F start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG ( f ( hs ( over start_ARG italic_x end_ARGi ) ) - over start_ARG y end_ARGi )2 + Rs ( f ) ) . Evaluation. At test time, given any labeling function g:→:→g:X\!→\!Yg : X → Y, we define its test error as the loss on the population: Err(g)=(,y)∼[(g()−y)2]Errsubscriptsimilar-todelimited-[]superscript2Err(g)=E_( x,y) D[(g( x)\!-\!y)^% 2]Err ( g ) = blackboard_E( italic_x , y ) ∼ D [ ( g ( italic_x ) - y )2 ]. We then introduce the shorthand notations: the weak model’s test error Errw=Err(fw∘hw)subscriptErrwErrsubscriptwsubscriptℎwErr_w=Err(f_w h_w)Errw = Err ( fw ∘ hw ), the W2S model’s test error Errw2s=Err(fw2s∘hs)subscriptErrw2sErrsubscriptw2ssubscriptℎsErr_w2s=Err(f_w2s h_s)Errw2s = Err ( fw2s ∘ hs ), and the strong ceiling model’s test error Errsc=Err(fsc∘hs)subscriptErrscErrsubscriptscsubscriptℎsErr_sc=Err(f_sc h_s)Errsc = Err ( fsc ∘ hs ). Errw2ssubscriptErrw2sErr_w2sErrw2s measures the performance achieved through W2SG, while ErrscsubscriptErrscErr_scErrsc serves as the upper limit. We also introduce PredGap, the squared difference between the predictions of the W2S and strong ceiling models: PredGap=(,y)∼[(fw2s(hs())−fsc(hs()))2].PredGapsubscriptsimilar-todelimited-[]superscriptsubscriptw2ssubscriptℎssubscriptscsubscriptℎs2PredGap=E_( x,y) D[(f_w2s(h_% s( x))\!-\!f_sc(h_s( x)))^2].PredGap = blackboard_E( italic_x , y ) ∼ D [ ( fw2s ( hs ( italic_x ) ) - fsc ( hs ( italic_x ) ) )2 ] . It captures how much the strong model falls short of its full potential due to weak supervision. It is also indicative of Errw2ssubscriptErrw2sErr_w2sErrw2s, the direct measure of W2SG performance, through these connections: (1) If the strong ceiling model is nearly perfect, it follows that PredGap≈Errw2sPredGapsubscriptErrw2sPredGap _w2sPredGap ≈ Errw2s as the strong ceiling’s predictions are almost identical to the ground truth. This is not unlikely, since the ultimate goal of W2SG is to operate in cases where the strong model is a superhuman-level AI (Burns et al., 2023), plausibly capable of achieving perfect results if provided with ground truth labels. (2) With small regularization and well-conditioned representations, Errw2s≈PredGap+ErrscsubscriptErrw2sPredGapsubscriptErrscErr_w2s +Err_scErrw2s ≈ PredGap + Errsc (Thm. B.3), analogous to the Pythagorean theorem. Then, PredGap directly determines Errw2ssubscriptErrw2sErr_w2sErrw2s for fixed ErrscsubscriptErrscErr_scErrsc. (3) For general cases, the upper bound Errw2s≤PredGap+ErrscsubscriptErrw2sPredGapsubscriptErrsc Err_w2s\!≤ PredGap\!+\! % Err_scsquare-root start_ARG Errw2s end_ARG ≤ square-root start_ARG PredGap end_ARG + square-root start_ARG Errsc end_ARG follows from the triangle inequality. Furthermore, the result obtained from analyzing PredGap helps predict Errw2ssubscriptErrw2sErr_w2sErrw2s in our experiments (Sec. 5). Thus, our main analysis focuses on PredGap. 3.3 Setting: representations with a well-concentrated principal part and a manageable non-principal part We first define two basic concepts, kernel and covariance, before introducing a general assumption on representations. Definition 3.1 (Kernel Matrix). Given h:→ℝd:ℎ→superscriptℝh:\!X\!→\!R^dh : X → blackboard_Rd, we define the kernel matrix on the finetuning dataset ^ Dover start_ARG D end_ARG as ^(h)=[h(^i)⊤h(^j)]1≤i,j≤n^^ℎsubscriptdelimited-[]ℎsuperscriptsubscript^topℎsubscript^formulae-sequence1 K(h)\!=\![h( x_i) h( x_j)]_1% ≤ i,j≤ nover start_ARG italic_K end_ARG ( h ) = [ h ( over start_ARG italic_x end_ARGi )⊤ h ( over start_ARG italic_x end_ARGj ) ]1 ≤ i , j ≤ over start_ARG n end_ARG, a n^×n^^ n× nover start_ARG n end_ARG × over start_ARG n end_ARG matrix where each element represents the inner product between a pair of representations. ~(h)~ℎ K(h)over~ start_ARG italic_K end_ARG ( h ) is defined on ~~ Dover~ start_ARG D end_ARG in the same manner. Definition 3.2 (Population/Empirical Covariance Matrices). Given h:→ℝd:ℎ→superscriptℝh:\!X\!→\!R^dh : X → blackboard_Rd, we define the population covariance over distribution DD as (h)≔[h()h()⊤]≔ℎsubscriptsubscriptdelimited-[]ℎsuperscripttop (h) _D_ x[h( x)h(% x) ]Σ ( h ) ≔ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) h ( italic_x )⊤ ]. The empirical version on ^ Dover start_ARG D end_ARG is defined as ^(h)≔1n^∑i=1n^h(^i)h(^i)⊤≔^ℎ1^superscriptsubscript1^ℎsubscript^ℎsuperscriptsubscript^top (h) 1 n _i=1 nh( % x_i)h( x_i) over start_ARG Σ end_ARG ( h ) ≔ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG h ( over start_ARG italic_x end_ARGi ) h ( over start_ARG italic_x end_ARGi )⊤. ~(h)~ℎ (h)over~ start_ARG Σ end_ARG ( h ) is defined on ~~ Dover~ start_ARG D end_ARG in the same manner. Given a representation function and a reasonable sample size, certain components in the representations should concentrate well, meaning they adequately reflect the population distribution. These components are pivotal to the model’s generalization. In our analysis, we focus on cases where the remainder—the less-well-concentrated components—satisfies certain conditions, ensuring their impact remains theoretically tractable. The decomposition of representations into these two parts is formalized as follows. Definition 3.3 ((δ,γ^,γ~)^~(δ, γ, γ)( δ , over start_ARG γ end_ARG , over~ start_ARG γ end_ARG )-decomposability). Given DD, ~~ Dover~ start_ARG D end_ARG, ^ Dover start_ARG D end_ARG, and a representation function h:→ℛ:ℎ→ℛh:\!X\!→\!Rh : X → R, we say that the representations of hℎh are (δ,γ^,γ~)^~(δ, γ, γ)( δ , over start_ARG γ end_ARG , over~ start_ARG γ end_ARG )-decomposable w.r.t. a subspace VV (of ℛRR), for some δ=O(1)1δ\!=\!O(1)δ = O ( 1 ), γ^=O(1)^1 γ\!=\!O(1)over start_ARG γ end_ARG = O ( 1 ), and γ~=O(1)~1 γ\!=\!O(1)over~ start_ARG γ end_ARG = O ( 1 ), if there exists a subset of eigenvectors of (h)ℎ (h)Σ ( h ) corresponding to non-zero eigenvalues such that the following holds. Let VV denote the span of these eigenvectors, and let ⟂superscriptperpendicular-toV V⟂ denote its orthogonal complement. Let subscript _VΠcaligraphic_V and ⟂subscriptsuperscriptperpendicular-to _V Πcaligraphic_V⟂ denote the orthogonal projections onto VV and ⟂superscriptperpendicular-toV V⟂, respectively. Define ρ=λmin, ≠0((h))subscriptmin, ≠0subscriptℎρ= _min, $≠ 0$( ( _Vh))ρ = λmin, ≠ 0 ( Σ ( Πcaligraphic_V h ) ) and γ=min(γ^,γ~)^~γ= ( γ, γ)γ = min ( over start_ARG γ end_ARG , over~ start_ARG γ end_ARG ). With high probability of 1−o(1)111-o(1)1 - o ( 1 ): (a) Boundedness. A basic condition that ensures reasonable magnitudes of representations and labels: ∥(h)∥op=O(1)subscriptdelimited-∥ℎop1 (h) _op\!=\!O(1)∥ Σ ( h ) ∥op = O ( 1 ), ∥^(h)∥op=O(1)subscriptdelimited-∥^ℎop1 (h) _op\!=\!O(1)∥ over start_ARG Σ end_ARG ( h ) ∥op = O ( 1 ) ∥~(h)∥op=O(1)subscriptdelimited-∥~ℎop1 (h) _op\!=\!O(1)∥ over~ start_ARG Σ end_ARG ( h ) ∥op = O ( 1 ), [y2]=O(1)delimited-[]superscript21E[y^2]=O(1)blackboard_E [ y2 ] = O ( 1 ), 1n^∑i=1n^y^i2=O(1)1^superscriptsubscript1^superscriptsubscript^21 1 n _i=1 n y_i^2\!=\!O(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 = O ( 1 ) and 1n~∑i=1n~y~i2=O(1)1~superscriptsubscript1~superscriptsubscript~21 1 n _i=1 n y_i^2\!=\!O(1)divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∑i = 1over~ start_ARG n end_ARG over~ start_ARG y end_ARGi2 = O ( 1 ). (b) Concentration on VV. Representations are well-concentrated in the subspace VV, both in terms of their covariance and their correlation with labels: ∥^(h)−(h)∥op=o(γ2+δ2+ρ2)subscriptdelimited-∥^subscriptℎsubscriptℎopsuperscript2superscript2superscript2 ( _Vh)- ( % _Vh) _op=o(γ^2+δ^2+ρ^2)∥ over start_ARG Σ end_ARG ( Πcaligraphic_V h ) - Σ ( Πcaligraphic_V h ) ∥op = o ( γ2 + δ2 + ρ2 ), ∥~(h)−(h)∥op=o(γ2+δ2+ρ2)subscriptdelimited-∥~subscriptℎsubscriptℎopsuperscript2superscript2superscript2 ( _Vh)- ( % _Vh) _op=o(γ^2+δ^2+ρ^2)∥ over~ start_ARG Σ end_ARG ( Πcaligraphic_V h ) - Σ ( Πcaligraphic_V h ) ∥op = o ( γ2 + δ2 + ρ2 ), ‖1n^∑i=1n^h(^i)y^i−[h()y]‖=o(γ+δ+ρ)norm1^superscriptsubscript1^subscriptℎsubscript^subscript^delimited-[]subscriptℎ\| 1 n _i=1 n _Vh( x% _i) y_i-E[ _Vh( x)y]\|=o(% γ+δ+ρ)∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG Πcaligraphic_V h ( over start_ARG italic_x end_ARGi ) over start_ARG y end_ARGi - blackboard_E [ Πcaligraphic_V h ( italic_x ) y ] ∥ = o ( γ + δ + ρ ) and ‖1n~∑i=1n~h(~i)y~i−[h()y]‖=o(γ+δ+ρ)norm1~superscriptsubscript1~subscriptℎsubscript~subscript~delimited-[]subscriptℎ\| 1 n _i=1 n _Vh( % x_i) y_i-E[ _Vh( x)y]% \|=o(γ+δ+ρ)∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∑i = 1over~ start_ARG n end_ARG Πcaligraphic_V h ( over~ start_ARG italic_x end_ARGi ) over~ start_ARG y end_ARGi - blackboard_E [ Πcaligraphic_V h ( italic_x ) y ] ∥ = o ( γ + δ + ρ ). (c) Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ . The kernels constructed using only the components in ⟂superscriptperpendicular-toV V⟂ exhibit certain uniformity in all orientations, with the extent of uniformity controlled by δ: ∥1n^^(⟂h)−γ^∥op=o(γ2+δ2)subscriptdelimited-∥1^^subscriptsuperscriptperpendicular-toℎ^opsuperscript2superscript2 1 n K( _V h)\!-% \! γ I _op\!=\!o(γ^2+δ^2)∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V⟂ h ) - over start_ARG γ end_ARG italic_I ∥op = o ( γ2 + δ2 ), and ∥1n~~(⟂h)−γ~∥op=o(γ2+δ2)subscriptdelimited-∥1~~subscriptsuperscriptperpendicular-toℎ~opsuperscript2superscript2 1 n K( _V h)% \!-\! γ I _op\!=\!o(γ^2+δ^2)∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_K end_ARG ( Πcaligraphic_V⟂ h ) - over~ start_ARG γ end_ARG italic_I ∥op = o ( γ2 + δ2 ). (d) Small cross-sample inner-product on ⟂superscriptperpendicular-toV V⟂. ∥1n^n~[(⟂h(^i))⊤⟂h(~j)]1≤i≤n^,1≤j≤n~∥op=o(γ+δ)subscriptdelimited-∥1^~subscriptdelimited-[]superscriptsubscriptsuperscriptperpendicular-toℎsubscript^topsubscriptsuperscriptperpendicular-toℎsubscript~formulae-sequence1^1~op 1 n n[( _V h(% x_i)) _V h( x% _j)]_1≤ i≤ n,1≤ j≤ n _op\!=\!o(% γ\!+\!δ)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG [ ( Πcaligraphic_V⟂ h ( over start_ARG italic_x end_ARGi ) )⊤ Πcaligraphic_V⟂ h ( over~ start_ARG italic_x end_ARGj ) ]1 ≤ i ≤ over start_ARG n end_ARG , 1 ≤ j ≤ over~ start_ARG n end_ARG ∥op = o ( γ + δ ), which holds when representations on ⟂superscriptperpendicular-toV V⟂ are nearly orthogonal across samples or have small magnitudes. (e) Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂. The representations on ⟂superscriptperpendicular-toV V⟂ have small magnitude in the population: ∥(⟂h)∥op=o(γ+δ)subscriptdelimited-∥subscriptsuperscriptperpendicular-toℎop ( _V h) _op=o% (γ+δ)∥ Σ ( Πcaligraphic_V⟂ h ) ∥op = o ( γ + δ ). Additional explanation for Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ . To provide a clearer understanding of this condition, consider the following: If δ is very small (e.g., δ=00δ=0δ = 0), the kernel on ^ Dover start_ARG D end_ARG is nearly identical to γ^ γ Iover start_ARG γ end_ARG italic_I, meaning it does not exhibit any specific patterns that differentiate between data points. In contrast, with a larger δ (e.g., δ≫γ^much-greater-than^δ γδ ≫ over start_ARG γ end_ARG), this requirement is much more relaxed—the kernel no longer needs to closely resemble γ^ γ Iover start_ARG γ end_ARG italic_I but instead must simply have its magnitude bounded by o(δ)o(δ)o ( δ ). Thus, it accommodates scenarios where the kernel is highly isotropic, very small in scale, or anywhere in between. This is key to our analysis, as it ensures the effect of the less well-concentrated part of the representations remains tractable. We note that this condition is not only analytically convenient but also practically relevant in real-world scenarios. For example, high-dimensional sub-Gaussian noise satisfies this condition with a small δ—a situation highly relevant to deep neural networks with large internal dimensions, where vectors tend to be approximately orthogonal in the high-dimensional limit. More concrete instances will be presented in Examples 3.4 and 3.5, as well as in Theorem 3.6, along with discussions of their significance and relevance. Additional explanation for Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂. We note that this condition does not imply negligible impact of representations on ⟂superscriptperpendicular-toV V⟂. For example, when δ is small, the model can in fact leverage the components in ⟂superscriptperpendicular-toV V⟂ to interpolate the training data, even when such interpolation cannot be achieved by the components in VV (see Example 4.2). We refer to h()subscriptℎ _Vh( x)Πcaligraphic_V h ( italic_x ), the well-concentrated part of the representation, as the principal representation, and the remainder, ⟂h()subscriptsuperscriptperpendicular-toℎ _V h( x)Πcaligraphic_V⟂ h ( italic_x ), as the non-principal representation. Examples of Def. 3.3. Def. 3.3 is highly general, covering various representation distributions and dimensionalities. One simple case is when all components are well-concentrated, i.e., the entire representation is principal. This occurs when the representations exhibit a certain low-rank structure, which is common in deep neural networks (Huh et al., 2021). Below is a concrete example. Example 3.4 (Arbitrarily parameterized; bounded representations with low intrinsic dimension). Given h:→ℝd:ℎ→superscriptℝh:X→R^dh : X → blackboard_Rd, for any (,y)( x,y)( italic_x , y ), ‖h()‖2≤Bsuperscriptnormℎ2\|h( x)\|^2≤ B∥ h ( italic_x ) ∥2 ≤ B and y2≤Csuperscript2y^2≤ Cy2 ≤ C, where C=Θ(1)Θ1C\!=\! (1)C = Θ ( 1 ). Additionally, ∥(h)∥op=Θ(1)subscriptdelimited-∥ℎopΘ1 (h) _op\!=\! (1)∥ Σ ( h ) ∥op = Θ ( 1 ). The intrinsic dimension of (h)ℎ (h)Σ ( h ) is defined as intdim((h))=Tr()∥opintdimℎTrsubscriptdelimited-∥opintdim( (h))= Tr( )% _opintdim ( Σ ( h ) ) = divide start_ARG Tr ( Σ ) end_ARG start_ARG ∥ Σ ∥op end_ARG, denoted by q. Let n=min(n^,n~)^~n\!=\! ( n, n)n = min ( over start_ARG n end_ARG , over~ start_ARG n end_ARG ) and assume n1−c=ω(Blog(q))superscript1n^1-c=ω (B (q) )n1 - c = ω ( B log ( q ) ) for some constant c<11c<1c < 1. Then, the representations are (n−0.1c,0,0)superscript0.100(n^-0.1c,0,0)( n- 0.1 c , 0 , 0 )-decomposable w.r.t. ℝdsuperscriptℝR^dblackboard_Rd. Remark. The conditions imply a low intrinsic dimension relative to the sample size: qlogq=o(n1−c)superscript1q q\!=\!o(n^1-c)q log q = o ( n1 - c ) (App. C.1), but without restricting the actual dimension d, allowing both under- (d<n)(d\!<\!n)( d < n ) and over-parameterized (d≥n)(d\!≥\!n)( d ≥ n ) settings. The next example is related to the spiked covariance model originating from PCA and widely used in recent theoretical studies across various domains (e.g., (Muthukumar et al., 2021; Nakada et al., 2023)). It is also related to the sparse coding model, which has its roots in computer vision (Olshausen & Field, 1997), and has been applied to language modeling (Arora et al., 2018) and deep learning theory (e.g., (Allen-Zhu & Li, 2020)). More references are in App. C.2. We consider representations that follow a sub-Gaussian, which is a very general class of distributions, including, e.g., any bounded random variables and Gaussian. Example 3.5 (Heavily overparameterized; sub-Gaussian with spiked covariance). Given h:→ℝd:ℎ→superscriptℝh:\!X\!→\!R^dh : X → blackboard_Rd and randomly drawn xitalic_x, h()ℎh( x)h ( italic_x ) has independent zero-mean sub-Gaussian entries. The first k entries have a (sub-Gaussian) parameter of Θ(1)Θ1 (1)Θ ( 1 ) and variance 1111, while the remaining d−kd\!-\!kd - k entries have a parameter of Θ(σ2d−k)Θsuperscript2 ( σ^2d-k)Θ ( divide start_ARG σ2 end_ARG start_ARG d - k end_ARG ) and variance σ2d−ksuperscript2 σ^2d-kdivide start_ARG σ2 end_ARG start_ARG d - k end_ARG. The scalings satisfy: n~=Θ(n^)~Θ n\!=\! ( n)over~ start_ARG n end_ARG = Θ ( over start_ARG n end_ARG ), σ2=O(n^)superscript2^σ^2=O( n)σ2 = O ( over start_ARG n end_ARG ), n^=ω(k2)^superscript2 n\!=\!ω(k^2)over start_ARG n end_ARG = ω ( k2 ), and d=ω(n^2)superscript^2d\!=\!ω( n^2)d = ω ( over start_ARG n end_ARG2 ). The labels have bounded moment, [y2]=O(1)delimited-[]superscript21E[y^2]\!=\!O(1)blackboard_E [ y2 ] = O ( 1 ). Then, the representations are (0,σ2n^,σ2n~)0superscript2^superscript2~(0, σ^2 n, σ^2 n)( 0 , divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG , divide start_ARG σ2 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG )-decomposable w.r.t. the subspace corresponding to the first k coordinates. Remark. Compared to Example 3.4, this example accommodates cases with high intrinsic dimensions. For instance, if we set σ2=Θ(n^)superscript2Θ^σ^2= ( n)σ2 = Θ ( over start_ARG n end_ARG ), then intdim((h))=Θ(n)intdimℎΘintdim( (h))= (n)intdim ( Σ ( h ) ) = Θ ( n ). More complex examples can be constructed from the fact that adding high-dimensional sub-Gaussian to (δ,0,0)00(δ,0,0)( δ , 0 , 0 )-decomposable representations preserves decomposability: Theorem 3.6. Given a representation function hℎh whose representations h()∈ℝdℎsuperscriptℝh( x)∈R^dh ( italic_x ) ∈ blackboard_Rd are (δ,0,0)00(δ,0,0)( δ , 0 , 0 )-decomposable w.r.t. ℝdsuperscriptℝR^dblackboard_Rd, we construct new representations with α()=h()+⟂ξ()ℎsuperscriptperpendicular-toα( x)= Mh( x)+ M ξ( x)α ( italic_x ) = italic_M h ( italic_x ) + italic_M⟂ ξ ( italic_x ), where ∈ℝ(d+m)×dsuperscriptℝ M∈R^(d+m)× ditalic_M ∈ blackboard_R( d + m ) × d and ⟂∈ℝ(d+m)×msuperscriptperpendicular-tosuperscriptℝ M ∈R^(d+m)× mitalic_M⟂ ∈ blackboard_R( d + m ) × m both have orthonormal columns, and their column spaces are orthogonal to each others. If elements in ξ()∈ℝmsuperscriptℝξ( x)∈R^mξ ( italic_x ) ∈ blackboard_Rm are independent zero-mean sub-Gaussian with parameter Θ(σ2m)Θsuperscript2 ( σ^2m)Θ ( divide start_ARG σ2 end_ARG start_ARG m end_ARG ) and variance σ2msuperscript2 σ^2mdivide start_ARG σ2 end_ARG start_ARG m end_ARG, assuming n~=Θ(n^)~Θ n\!=\! ( n)over~ start_ARG n end_ARG = Θ ( over start_ARG n end_ARG ), m=ω(n^2)superscript^2m\!=\!ω( n^2)m = ω ( over start_ARG n end_ARG2 ), and σ2=O(n^)superscript2^σ^2\!=\!O( n)σ2 = O ( over start_ARG n end_ARG ), then α’s representations are (δ,σ2n^,σ2n~)superscript2^superscript2~(δ, σ^2 n, σ^2 n)( δ , divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG , divide start_ARG σ2 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG )-decomposable w.r.t. the span of Mitalic_M’s columns. Remark. For instance, one could take hℎh from Example 3.4. We assume both models’ representations satisfy Def. 3.3: Assumption 3.7. hwsubscriptℎwh_whw’s representations are (δw,γ^w,γ~w)subscriptwsubscript^wsubscript~w( _w, γ_w, γ_w)( δw , over start_ARG γ end_ARGw , over~ start_ARG γ end_ARGw )-decomposable w.r.t. wsubscriptwV_wVw, and hssubscriptℎsh_shs’s representations are (δs,γ^s,γ~s)subscriptssubscript^ssubscript~s( _s, γ_s, γ_s)( δs , over start_ARG γ end_ARGs , over~ start_ARG γ end_ARGs )-decomposable w.r.t. ssubscriptsV_sVs . 3.4 Principal representations shape PredGap Intuition. One implication of Def. 3.3 is that only what is learned through the principal representations will be reflected at test time. Thus, the weak model’s mistakes primarily stem from its inability to generate certain outputs using its principal representations. For the same reason, among these mistakes, only those expressible through the strong model’s principal representations will affect its test performance. Therefore, a key concept affecting W2SG performance is “what the weak model is unable to learn but is learnable by the strong model using their respective principal representations”, which we seek to quantify. Formalization. To formalize the above idea, we leverage ^(whw)^subscriptsubscriptwsubscriptℎw K( _V_wh_w)over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT w end_POSTSUBSCRIPT hw ) and ^(shs)^subscriptsubscriptssubscriptℎs K( _V_sh_s)over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs )–kernels computed using only the weak and strong models’ principal representations, referred to as principal kernels. We define the following w≔subscriptwabsent -.3cm P_w _Pw ≔ 1n^^(whw)(1n^^(whw)+(βw+γ~w))−1,1^^subscriptsubscriptwsubscriptℎwsuperscript1^^subscriptsubscriptwsubscriptℎwsubscriptwsubscript~w1 1 n K( _V_% wh_w) ( 1 n K( _V% _wh_w)+( _w+ γ_w) I% )^-1,divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT w end_POSTSUBSCRIPT hw ) ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT w end_POSTSUBSCRIPT hw ) + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 , s≔subscriptsabsent P_s _Ps ≔ 1n^^(shs)(1n^^(shs)+(βs+γ^s))−1.1^^subscriptsubscriptssubscriptℎssuperscript1^^subscriptsubscriptssubscriptℎssubscriptssubscript^s1 1 n K( _V_% sh_s) ( 1 n K( _V% _sh_s)+( _s+ γ_s) I% )^-1.divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ) ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ) + ( βs + over start_ARG γ end_ARGs ) italic_I )- 1 . wsubscriptw P_witalic_Pw and ssubscripts P_sitalic_Ps represent scaled projections onto the spans of the principal kernels. Each captures the space of output patterns that its respective model can express through its principal representations (with regularization taken into account). Then, the earlier intuition can be characterized as follows. Theorem 3.8 (Main result). Under Assump. 3.7, and assuming reasonable regularization: δw≤βw=O(1)subscriptwsubscriptw1 _w\!≤\! _w\!=\!O(1)δw ≤ βw = O ( 1 ) and δs≤βs=O(1)subscriptssubscripts1 _s\!≤\! _s\!=\!O(1)δs ≤ βs = O ( 1 ), let ^=[y^1y^2…y^n^]⊤^superscriptdelimited-[]subscript^1subscript^2…subscript^^top y=[ y_1~ y_2~…~ y_ n] over start_ARG italic_y end_ARG = [ over start_ARG y end_ARG1 over start_ARG y end_ARG2 … over start_ARG y end_ARGover start_ARG n end_ARG ]⊤. Then, w.h.p., we have PredGap=‖s(−w)1n^^‖2±o(1)PredGapplus-or-minussuperscriptnormsubscriptssubscriptw1^^21PredGap=\| P_s( I- P_w) 1% n y\|^2± o(1)PredGap = ∥ italic_Ps ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) (1) s(−w)subscriptssubscriptw P_s( I\!-\! P_w)italic_Ps ( italic_I - italic_Pw ) captures “what the weak model is unable to learn but is learnable by the strong model using their respective principal representations”. Therefore, it determines the mistakes that will be learned by the strong model, as discussed in the intuition. A more powerful weak model has a wsubscriptw P_witalic_Pw that covers more space, shrinking s(−w)subscriptssubscriptw P_s( I\!-\! P_w)italic_Ps ( italic_I - italic_Pw ) and potentially leading to a smaller PredGap. Propagation of Errors. The earlier intuition is reflected in the proof (App. A.7). Given the labeling ^ yover start_ARG italic_y end_ARG, its projection (−w)^subscriptw^( I\!-\! P_w) y( italic_I - italic_Pw ) over start_ARG italic_y end_ARG is orthogonal to the scaled weak model’s principal kernel and thus cannot be effectively learned, contributing to the weak model’s error (Lem. A.12). The projection of this error onto the scaled strong model’s principal kernel, s(−w)^subscriptssubscriptw P_s( I\!-\! P_w) yitalic_Ps ( italic_I - italic_Pw ) over start_ARG italic_y end_ARG, is learned by the strong model and contributes to PredGap (Lem. A.13). 4 A Case Study on Benign Overfitting Our theory can be applied to study and provide new insights into benign overfitting, an intriguing special case of W2SG, where the W2S model appears to mimic the weak supervision during finetuning, yet generalizes better at test time. 4.1 A general condition Benign overfitting has been studied in the general machine learning context to understand deep neural networks’ generalization (Bartlett et al., 2020; Wang et al., 2021; Frei et al., 2022; Mallinar et al., 2022). Recently, (Wu & Sahai, 2024) theoretically characterized benign overfitting in W2SG for a specific data distribution. Here, we aim to derive broader insights from a representation perspective. We consider the scenario where the strong model’s representations are highly expressive, enabling near-perfect overfitting of arbitrary labelings on the finetuning data, mirroring the behavior of very large neural networks in practice (Zhang et al., 2021). This occurs when δs=o(γ^s)subscriptssubscript^s _s=o( γ_s)δs = o ( over start_ARG γ end_ARGs ) (Lem. B.4), yielding a highly isotropic non-principal kernel. Meanwhile, since generalization depends solely on the principal representations by Thm. 3.8, a small ‖s(−w)1n^^‖2superscriptnormsubscriptssubscriptw1^^2\| P_s( I- P_w) 1 n% y\|^2∥ italic_Ps ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 suffices for good W2SG performance, regardless of the extent of overfitting. In this way, we connect benign overfitting to the general relationship between the weak and strong models’ representations: Theorem 4.1 (A general condition for benign overfitting 111Thm 4.1 can be extended to cases where the strong ceiling is not perfect, but we omit this for brevity. ). In addition to Assumption 3.7, suppose that (1) δs=o(γ^s)subscriptssubscript^s _s=o( γ_s)δs = o ( over start_ARG γ end_ARGs ) and δs≤βs=o(γ^s)subscriptssubscriptssubscript^s _s≤ _s=o( γ_s)δs ≤ βs = o ( over start_ARG γ end_ARGs ), (2) w.h.p., the strong ceiling model achieves nearly perfect performance, i.e., Errsc=o(1)subscriptErrsc1Err_sc=o(1)Errsc = o ( 1 ), (3) w.h.p., ‖s(−w)1n^^‖2=Errw−Δsuperscriptnormsubscriptssubscriptw1^^2subscriptErrwΔ\| P_s( I- P_w) 1 n% y\|^2=Err_w- ∥ italic_Ps ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 = Errw - Δ with Δ=Θ(1)ΔΘ1 = (1)Δ = Θ ( 1 ). Then, w.h.p., the W2S model achieves an almost zero (o(1)1o(1)o ( 1 )) training error on ^ Dover start_ARG D end_ARG, but generalizes better than the weak model: Errw2s≤Errw−Δ+o(1)subscriptErrw2ssubscriptErrwΔ1Err_w2s _w- +o(1)Errw2s ≤ Errw - Δ + o ( 1 ). See proof in App. B.3.1. Remark. Compared to (Wu & Sahai, 2024), which focuses on demonstrating that benign overfitting can occur under specific assumptions—such as a bi-level ensemble structure and labels depending 1-sparsely on representations—we extract more general insights into when and how benign overfitting arises. Specifically, we identify a single key quantity driving benign overfitting in W2SG: ‖s(−w)1n^^‖normsubscriptssubscriptw1^^\| P_s( I- P_w) 1 n% y\|∥ italic_Ps ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥. When this quantity is small, the strong model can avoid repeating the weak model’s mistakes—regardless of the extent of overfitting—thereby achieving error mitigation. This precise mechanism was not revealed in prior work. 4.2 Instantiation of Theorem 4.1 on a toy example We present a concrete example of the scenario in Theorem 4.1 to demonstrate the realizability of the conditions. While more complex examples could be constructed, we focus on a simple one to succinctly illustrate the core ideas. Example 4.2. The label is a Gaussian: y∼(0,1)similar-to01y N(0,1)y ∼ N ( 0 , 1 ). Given (,y)( x,y)( italic_x , y ), the weak model’s representation is hw()=[(ηy+1−ηζ)w⊤]⊤subscriptℎwsuperscriptdelimited-[]1superscriptsubscriptwtoptoph_w( x)=[( η~y+ 1-η~ζ)~~~ % ξ_w ] hw ( italic_x ) = [ ( square-root start_ARG η end_ARG y + square-root start_ARG 1 - η end_ARG ζ ) italic_ξw⊤ ]⊤, where η∈(0,1)01η∈(0,1)η ∈ ( 0 , 1 ) is some constant, ζ∼(0,1)similar-to01ζ\! \!N(0,1)ζ ∼ N ( 0 , 1 ) and w∼(0,σ2d−1)similar-tosubscriptw0superscript21 ξ_w\! \!N(0, σ^2d-1 I)italic_ξw ∼ N ( 0 , divide start_ARG σ2 end_ARG start_ARG d - 1 end_ARG italic_I ) are both independently drawn. The strong model’s representation is hs()=[ys⊤]⊤subscriptℎssuperscriptdelimited-[]superscriptsubscriptstoptoph_s( x)=[y~~~ ξ_s ] hs ( italic_x ) = [ y italic_ξs⊤ ]⊤, where s∼(0,σ2d−1)similar-tosubscripts0superscript21 ξ_s\! \!N(0, σ^2d-1 I)italic_ξs ∼ N ( 0 , divide start_ARG σ2 end_ARG start_ARG d - 1 end_ARG italic_I ) independently. The scalings satisfy n~=Θ(n^)=ω(1)~Θ^1 n= ( n)=ω(1)over~ start_ARG n end_ARG = Θ ( over start_ARG n end_ARG ) = ω ( 1 ), d=ω(n^2)superscript^2d=ω( n^2)d = ω ( over start_ARG n end_ARG2 ), and σ2=o(n^)superscript2^σ^2=o( n)σ2 = o ( over start_ARG n end_ARG ) but ≠0absent0≠ 0≠ 0. Additionally, βs=o(σ2n^)subscriptssuperscript2 _s=o( σ^2 n)βs = o ( divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG ) and βw=o(σ2n^)subscriptwsuperscript2 _w=o( σ^2 n)βw = o ( divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG ). Here, the weak model’s first coordinate carries a signal about the label y, but corrupted by noise ζ, with η controlling the signal strength (i.e., with SNR η1−η1 η1-ηdivide start_ARG η end_ARG start_ARG 1 - η end_ARG). The strong model’s first coordinate carries a perfect signal about y. The remaining coordinates in both models are high-dimensional random noise. Both models’ representations are special cases of Example 3.5 and are therefore (0,σ2n^,σ2n~)0superscript2^superscript2~(0, σ^2 n, σ^2 n)( 0 , divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG , divide start_ARG σ2 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ) decomposable. Corollary 4.3. Benign overfitting occurs in Example 4.2. Specifically, w.h.p., (1) The weak model’s errors on both ^ Dover start_ARG D end_ARG and the population are (1−η)±o(1)plus-or-minus11(1\!-\!η)\!±\!o(1)( 1 - η ) ± o ( 1 ). (2) The W2S model overfits the weak model’s outputs on ^ Dover start_ARG D end_ARG, achieving a training loss of o(1)1o(1)o ( 1 ). (3) However, compared to the weak model, the W2S model achieves a smaller test error: Errw2s=(1−η)2±o(1)subscriptErrw2splus-or-minussuperscript121Err_w2s\!=\!(1\!-\!η)^2± o(1)Errw2s = ( 1 - η )2 ± o ( 1 ). For instance, if η=0.60.6η\!=\!0.6η = 0.6, then Errw≈0.4subscriptErrw0.4Err_w≈ 0.4Errw ≈ 0.4, while Errw2s≈0.16subscriptErrw2s0.16Err_w2s≈ 0.16Errw2s ≈ 0.16, despite nearly perfect overfitting on ^ Dover start_ARG D end_ARG. 4.3 A closer look at error propagation We provide a rough derivation of the W2S error (with details in App. B.3.2), illustrating which errors are replicated and which are corrected (overfitted but benignly) by the W2S model, and how representations determine this. The principal representations for both models are simply at their first coordinates. Thus, the spans of their principal kernels are one-dimensional. Let ^∈ℝn^^superscriptℝ ζ∈R nover start_ARG italic_ζ end_ARG ∈ blackboard_Rover start_ARG n end_ARG denote the vector collecting the ζ values on ^ Dover start_ARG D end_ARG, i.e., ^=[ζ^1,…,ζ^n^]⊤^superscriptsubscript^1…subscript^^top ζ=[ ζ_1,…, ζ_ n] over start_ARG italic_ζ end_ARG = [ over start_ARG ζ end_ARG1 , … , over start_ARG ζ end_ARGover start_ARG n end_ARG ]⊤. Similarly, define ^=[y^1,…,y^n^]⊤^superscriptsubscript^1…subscript^^top y=[ y_1,…, y_ n] over start_ARG italic_y end_ARG = [ over start_ARG y end_ARG1 , … , over start_ARG y end_ARGover start_ARG n end_ARG ]⊤. We can approximate the projection matrices as: w≈1n^^^⊤subscriptw1^^superscript^top P_w≈ 1 n q q italic_Pw ≈ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_q end_ARG over start_ARG italic_q end_ARG⊤ and s≈1n^^^⊤subscripts1^^superscript^top P_s≈ 1 n y y italic_Ps ≈ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_y end_ARG over start_ARG italic_y end_ARG⊤, where ^=η^+1−η^^^1 q= η y+ 1-η ζover start_ARG italic_q end_ARG = square-root start_ARG η end_ARG over start_ARG italic_y end_ARG + square-root start_ARG 1 - η end_ARG over start_ARG italic_ζ end_ARG. Note that vectors 1n^^1^ 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG and 1n^^1^ 1 n ζdivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG are almost orthogonal as the corresponding random variables are uncorrelated: 1n^^⊤1n^^=1n^∑iyi^ζ^i≈[yζ]=01^superscript^top1^^1^subscript^subscriptsubscript^delimited-[]0 1 n y 1 n % ζ= 1 n _i y_i ζ_i % [yζ]=0divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i over start_ARG yitalic_i end_ARG over start_ARG ζ end_ARGi ≈ blackboard_E [ y ζ ] = 0. Let ϵwsubscriptbold-italic-ϵw ε_witalic_ϵw be the vector whose i-th element is the weak model’s error on data point (^i,y^i)subscript^subscript^( x_i, y_i)( over start_ARG italic_x end_ARGi , over start_ARG y end_ARGi ). By Lemma A.12, we can approximate ϵwsubscriptbold-italic-ϵw ε_witalic_ϵw as: ϵw≈(−w)^≈(1−η)1n^^−η(1−η)1n^^subscriptbold-italic-ϵwsubscriptw^11^^11^ ε_w≈( I- P_w) y% ≈(1-η) 1 n y- η(1-η) % 1 n ζitalic_ϵw ≈ ( italic_I - italic_Pw ) over start_ARG italic_y end_ARG ≈ ( 1 - η ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - square-root start_ARG η ( 1 - η ) end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG The strong ceiling model’s error Errsc≈0subscriptErrsc0Err_sc≈ 0Errsc ≈ 0 as its representations directly encode y in the first coordinate. Thus, Errw2s≈PredGapsubscriptErrw2sPredGapErr_w2s Errw2s ≈ PredGap. By Thm 3.8, PredGap≈sϵwPredGapsubscriptssubscriptbold-italic-ϵwPredGap≈ P_s ε_wPredGap ≈ italic_Ps italic_ϵw. Then, Errw2s≈1n^^^⊤(1−η)1n^^⏟replicated−1n^^^⊤η(1−η)1n^^⏟avoided; ≈0 since ^⟂ almostsubscriptErrw2ssubscript⏟1^^superscript^top11^^replicatedsubscript⏟1^^superscript^top11^^avoided; ≈0 since ^⟂ almost _w2s≈ 1 n % y y (1-η) 1 n y_% replicated~~ - 1 n % y y η(1-η) 1 n % ζ_ avoided; $≈ 0$ since $ % ζ y$ almost -.5cmErrw2s ≈ under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_y end_ARG over start_ARG italic_y end_ARG⊤ ( 1 - η ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG end_ARGreplicated under⏟ start_ARG - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_y end_ARG over start_ARG italic_y end_ARG⊤ square-root start_ARG η ( 1 - η ) end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG end_ARGavoided ; ≈0 since ^ζ⟂^y almost The first term of the weak model’s error, (1−η)1n^^11^^(1-η) 1 n y( 1 - η ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG, aligns with ssubscripts P_sitalic_Ps which spans the strong model’s principal kernel, and is therefore replicated by the W2S model. The second term, −η(1−η)1n^^11^^- η(1-η) 1 n ζ- square-root start_ARG η ( 1 - η ) end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG, is orthogonal to ssubscripts P_sitalic_Ps and thus mitigated. Notably, −η(1−η)1n^^11^^- η(1-η) 1 n ζ- square-root start_ARG η ( 1 - η ) end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG aligns with the strong model’s non-principal kernel, which is highly isotropic (γs=ω(δs)subscriptssubscripts _s=ω( _s)γs = ω ( δs )), causing the corresponding errors to appear mimicked by the W2S model during finetuning. However, they do not manifest at test time. In other words, only errors within the span of the strong model’s principal kernel are overfitted harmfully, while overfitting elsewhere remains benign. 5 Predicting W2SG Without Labels Leveraging Thm. 3.8, we derive a representation-based metric that can predict W2SG performance without labels in experiments across various settings. Notably, this metric strongly correlates with W2SG performance even when we finetune entire LLMs—a scenario significantly more complex than what we analyze in theory. 5.1 A label-agnostic metric for W2SG Table 1: An overview of the three setups considered in our experiments. EXP ID Task Strong model Weak models Finetuning I molecular tasks MolBERT 150 transformers pretrained on GuacaMol task head I NLP tasks nvidia/NV-Embed-v2 22 other embedding models task head I NLP tasks Qwen/Qwen-7B 28 smaller LLMs full model We start with upper-bounding the RHS of Thm. 3.8. Corollary 5.1 (Upper Bound 1). Define C=1n^∑i=1n^y^i21^superscriptsubscript1^superscriptsubscript^2C= 1 n _i=1 n y_i^2C = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2. Following Theorem 3.8, directly applying the submultiplicative property of the norm yields the following upper bound: PredGap≤C∥s(−w)∥op2+o(1),PredGapsuperscriptsubscriptdelimited-∥subscriptssubscriptwop21 ≤ C P_s( I- P% _w) _op^2+o(1),PredGap ≤ C ∥ italic_Ps ( italic_I - italic_Pw ) ∥op2 + o ( 1 ) , Corollary 5.2 (Upper Bound 2). Following Theorem 3.8, we can also obtain an upper bound that involves ErrscsubscriptErrscErr_scErrsc as long as |[y2]−1n^∑i=1n^y^i2|=o(1)delimited-[]superscript21^superscriptsubscript1^superscriptsubscript^21|E[y^2]- 1 n _i=1 n y_i^2|=o(1)| blackboard_E [ y2 ] - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 | = o ( 1 ) (see proof in Appendix B.4) : PredGap≤(C∥s(−w)s∥op+Errsc)2+o(1).PredGapsuperscriptsubscriptdelimited-∥subscriptssubscriptwsubscriptsopsubscriptErrsc21PredGap≤ ( C P_s( I- P% _w) P_s _op+ Err_% sc )^2+o(1).PredGap ≤ ( square-root start_ARG C end_ARG ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op + square-root start_ARG Errsc end_ARG )2 + o ( 1 ) . In both upper bounds, C represents the variance of the labels on ^ Dover start_ARG D end_ARG, which can be treated as a constant given a fixed dataset. Therefore, PredGap is governed by the norm ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I\!-\! P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op or ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I\!-\! P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op. Comparing the two bounds, the one in Corollary 5.2 is tighter particularly when ErrscsubscriptErrscErr_scErrsc is small 222One can also observe this in Example 4.2, where the equality in Corollary 5.2 holds, whereas that in Corollary 5.1 does not. . This follows from ∥s(−w)s∥op≤∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsopsubscriptdelimited-∥subscriptssubscriptwop P_s( I\!-\! P_w) P_s% _op\!≤\! P_s( I\!-\! P_% w) _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op ≤ ∥ italic_Ps ( italic_I - italic_Pw ) ∥op. However, in our experiments, both are similarly indicative of W2SG performance. Now that PredGap can be bounded in terms of the above label-agnostic metrics, and PredGap is indicative of the error Errw2ssubscriptErrw2sErr_w2sErrw2s as discussed at the end of Sec. 3.2, we turn our focus to examining the following relationship in real models Errw2s∼?∥s(−w)∥op(or ∥s(−w)s∥op) superscriptsimilar-to?subscriptErrw2ssubscriptdelimited-∥subscriptssubscriptwop(or ∥s(−w)s∥op) Err_w2s~~ ? ~~ % P_s( I- P_w) _op~~% (or $ P_s( I- P_w) P_% s _op$) Errw2s start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG ? end_ARG end_RELOP ∥ italic_Ps ( italic_I - italic_Pw ) ∥op (or ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op ) to evaluate whether the metrics offer practical insights. Specifically, we consider the three setups summarized in Table 1, with their details discussed in the corresponding subsections. In each setup, we fix the strong model and vary the weak model to obtain different Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op (or ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I- P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op) pairs and study their relationship. 5.2 Empirical measure of wsubscriptw P_witalic_Pw and ssubscripts P_sitalic_Ps Before proceeding, let’s address an important question: how can we compute wsubscriptw P_witalic_Pw and ssubscripts P_sitalic_Ps for real models? In some cases, representations are not fixed during fine-tuning, making hℎh difficult to define. Additionally, determining the principal representation, hsubscriptℎ _VhΠcaligraphic_V h, is challenging because the exact VV depends on the population, which is unknown in practice. To tackle this, we design heuristics to approximate Pitalic_P as follows 1n^^(αh)(1n^^(αh)+βeff)−11^^subscriptℎsuperscript1^^subscriptℎsubscripteff1 -.2cm 1 n K( _α% h)( 1 n K( _αh)+ _eff% I)^-1 -.2cmdivide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πitalic_α h ) ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πitalic_α h ) + βeff italic_I )- 1 (2) We explain the key components below. hℎh: extracting representations. We consider two ways of defining the representations, depending on the setup. (1) Last layer embeddings. In Exps. I and I, the definition of representation is self-evident, as finetuning is simply training a task head on the embeddings produced by the base model 333In the analysis, the linear model does not include a bias term, but it does in our experiments. This is addressed by appending a constant 1111 to the representation when computing the metrics.. (2) Activation maps. 444We observed worse results with last-layer embeddings in Exp. I, likely due to complex cross-layer dynamics during finetuning. In Exp. I, we finetune the entire LLM from pretrained weights, so we don’t have fixed representations as in the theoretical setting. To address this, we adopt a simple heuristic: we treat the layer-wise normalized vectorized activation maps of the pre-trained LLM, which encode information about how inputs are represented within the model, as the representations for computing h()ℎh( x)h ( italic_x ). This heuristic serves primarily as a proof of concept, demonstrating that even straightforward approach like this can yield meaningful results. More principled definitions of representations, e.g., those based on NTK (Malladi et al., 2023) or representation engineering (Zou et al., 2023), could be explored in future work. See further discussion in Appx. E. αsubscript _αΠitalic_α: approximating principal representations. We consider two versions of αsubscript _αΠitalic_α, the operation that extracts the principal part from the representations, based on the intuition that principal representations tend to have larger magnitudes (e.g., Example 3.5). (1) In Exps. I and I, we apply PCA by projecting the representations onto the eigenvectors of the covariance ^(h)^ℎ (h)over start_ARG Σ end_ARG ( h ) with eigenvalues ≥α×(the largest eigenvalue)absent(the largest eigenvalue)≥α×(the largest eigenvalue)≥ α × (the largest eigenvalue). (2) In Exp. I, we select the top coordinates with variance exceeding α×α×α × (the largest coordinate-wise variance), a cheaper alternative to PCA for high-dimensional activation maps, as it avoids the expensive eigendecomposition. In both cases α is a hyperparameter. βeffsubscripteff _effβeff: effective regularization. In Thm. 3.8, (β+γ^)^(β+ γ)( β + over start_ARG γ end_ARG ) is the effective regularization, capturing both the explicit (β) and implicit (γ^ γover start_ARG γ end_ARG) (Jacot et al., 2020) regularization. In practice, regularization can also stem from factors like early stopping, training algorithms, etc. We summarize these effects using βeffsubscripteff _effβeff in Eq. 2 and treat βeffsubscripteff _effβeff as a hyperparameter. For each model, computing Pitalic_P introduces two hyperparameters, α and β. If every model is assigned unique hyperparameters, the total number of hyperparameters would be twice the number of models. To simplify this, we let all weak models share the same two hyperparameters, αwsubscriptw _wαw and βwsubscriptw _wβw. For the strong model (only one in each setting), it is treated separately with its own hyperparameters, αssubscripts _sαs and βssubscripts _sβs. Thus, we only have four parameters in total. More details are in App. D.2. Figure 4: In Exp. I, for models with activation map dimensions ≤8000absent8000\!≤\!8000≤ 8000, both the activation map dimension (middle) and the dimension of approximated principal representations (right) correlate poorly with Errw2ssubscriptErrw2sErr_w2sErrw2s. However, ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I\!-\! P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op remains strongly correlated with Errw2ssubscriptErrw2sErr_w2sErrw2s (left). We only show the results for Cosmos QA and defer those for other datasets to App. D.4. 5.3 Experimental setups Exp. I: Molecular prediction. Our first setting follows (Charikar et al., 2024). We use the GuacaMol (Brown et al., 2019) dataset for pretraining both the strong and weak models. For finetuning, we consider three regression datasets—ESOL, FreeSolv, and Lipop—from the MoleculeNet (Wu et al., 2018) benchmark, curated by ChemBench (Charleshen, 2020), which involve predicting molecular physical properties. The strong model is MolBERT (Fabian et al., 2020), a BERT (Devlin, 2018) pretrained for 100 epochs on GuacaMol. We use smaller transformers pretrained on GuacaMol as weak models. These weak models have 2 layers and 2 attention heads. We vary the hidden size across 64,128,2566412825664,128,25664 , 128 , 256, and vary the number of pretraining epochs from 1 to 50, resulting in 150 weak models. During finetuning, we extract last-layer embeddings and perform linear regression. MSE loss is used for both training and measuring Errw2ssubscriptErrw2sErr_w2sErrw2s as the task is regression. Additional details are in App.D.1. Exp. I: NLP tasks with embedding models. We use the “Justice” and “Commonsense” datasets from ETHICS (Hendrycks et al., 2020), which involve binary classification based on basic moral concepts. We consider embedding models—pretrained LLMs that convert text inputs into vector-based embeddings, with nvidia/NV-Embed-v2 (Lee et al., 2024) (currently ranked first on the MTEB leaderboard (Muennighoff et al., 2022)) as the strong model, and 22 other models as weak models (details in Appx. D.1). For finetuning, we train a linear classifier on the embeddings with CE loss. Errw2ssubscriptErrw2sErr_w2sErrw2s is measured as classification error. Exp. I: NLP tasks with end-to-end finetuned LLMs. We replicate a setup from (Burns et al., 2023) on three datasets: (1) SciQ (Welbl et al., 2017), containing crowdsourced science exam questions; (2) Amazon Polarity (Zhang et al., 2015), consisting of Amazon reviews; and (3) Cosmos QA (Huang et al., 2019), involving commonsense-based reading comprehension. Both data preprocessing and finetuning strictly follow (Burns et al., 2023). The entire model is finetuned with the unembedding layer replaced with a linear head, using CE loss. We use Qwen/Qwen-7B (Bai et al., 2023) as the strong model and 28 smaller LLMs as weak models (details in Appx. D.1). Errw2ssubscriptErrw2sErr_w2sErrw2s is measured in terms of classification error. 5.4 Results Strong correlation between Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥Ps(I−Pw)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op across various settings. For each of the weak models, we perform the W2SG procedure to obtain the resulting W2S model. We then measure Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op and plot the results in Figures LABEL:fig:_molecular, LABEL:fig:_embedding and LABEL:fig:_end2end. Across all the setups, we observe a strong correlation between the two quantities, with high Spearman’s correlation values displayed at the top of the figures. The results are highly similar for ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I- P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op, as shown in Appx. D.3. Therefore, we only focus on discussing ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op in the main paper. Notably, the correlation between Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op extends beyond the theoretical setting, covering the following variations: (1) Loss function and evaluation metric. While Thm. 3.8 is based on linear regression with MSE loss, Exps. I and I demonstrate that the correlation also holds for classification tasks using CE finetuning loss, with Errw2ssubscriptErrw2sErr_w2sErrw2s measured as classification error. (2) The form of finetuning. Thm. 3.8 assumes that finetuning involves training a function on fixed representations. However, in Exp. I, the entire LLM is finetuned. Despite the complex training dynamics in this scenario, a strong correlation between Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op is still observed when activation maps are heuristically used as representations. These results underscore the broad applicability of our conclusion. Capturing W2SG beyond model size. Smaller weak models can sometimes achieve better Errw2ssubscriptErrw2sErr_w2sErrw2s than larger ones. For example, in Exp. I, the leftmost yellow point (size 64) outperforms the rightmost teal point (size 128) in Fig. LABEL:fig:_molecular, likely because these smaller models were pretrained for more epochs (recall that we have 150 models span different combinations of sizes and pretraining epochs), resulting in better representations. Similarly, in Exp. I, the middle column of Fig. 4 shows a poor correlation between Errw2ssubscriptErrw2sErr_w2sErrw2s and size for models with dimension ≤8000absent8000≤ 8000≤ 8000. Testing another dimension-based metric—the dimension of approximated principal representations—also reveals weak correlation with Errw2ssubscriptErrw2sErr_w2sErrw2s (last column of Fig. 4). This underscore the complexity of predicting W2SG performance, as larger models or higher representation dimensions do not guarantee better results. Factors such as the pretraining recipe, the quality and relevance of the pretraining data, etc., all contribute to the final outcome. However, even in these cases, ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op consistently captures the trend in Errw2ssubscriptErrw2sErr_w2sErrw2s (Fig. LABEL:fig:_molecular and the first column of Fig. 4), demonstrating its robustness as a metric that surpasses simple dimensional measures and provides meaningful insights for W2SG. 6 Conclusion In this work, we show that W2SG can be characterized using kernels derived from the principal components of weak and strong models’ representations. The theory is applicable to a wide range of representation distributions, provides insights into how models’ internal structures influence error correction and the conditions for benign overfitting. Additionally, it offers a label-free metric for predicting W2SG performance, validated through experiments on diverse datasets and LLMs. Impact Statement We see positive societal impacts in our work as it advances the understanding of Weak-to-Strong Generalization, a crucial problem for aligning superhuman AI in the future. Our results could enhance transparency in AI systems’ behavior through analysis of their internal structures and contribute to the broader goal of improving AI safety and reliability. Acknowledgement This research was partially supported by the National Science Foundation CAREER Award 2146492 and an OpenAI SuperAlignment Grant. References Allen-Zhu & Li (2020) Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020. Arora et al. (2018) Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. Bai et al. (2023) Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Brown et al. (2019) Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C. Guacamol: benchmarking models for de novo molecular design. Journal of chemical information and modeling, 59(3):1096–1108, 2019. Burns et al. (2023) Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023. Charikar et al. (2024) Charikar, M., Pabbaraju, C., and Shiragur, K. Quantifying the gain in weak-to-strong generalization. arXiv preprint arXiv:2405.15116, 2024. Charleshen (2020) Charleshen. Chembench: The molecule benchmarks and molmapnet datasets, September 2020. URL https://doi.org/10.5281/zenodo.4054866. Demmel (1992) Demmel, J. The componentwise distance to the nearest singular matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):10–19, 1992. Devlin (2018) Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. El Ghaoui (2002) El Ghaoui, L. Inversion error, condition number, and approximate inverses of uncertain matrices. Linear algebra and its applications, 343:171–193, 2002. Fabian et al. (2020) Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., and Ahmed, M. Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230, 2020. Foldiak (2003) Foldiak, P. Sparse coding in the primate cortex. The handbook of brain theory and neural networks, 2003. Frei et al. (2022) Frei, S., Chatterji, N. S., and Bartlett, P. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, p. 2668–2703. PMLR, 2022. Hendrycks et al. (2020) Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020. Huang et al. (2019) Huang, L., Bras, R. L., Bhagavatula, C., and Choi, Y. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277, 2019. Huh et al. (2021) Huh, M., Mobahi, H., Zhang, R., Cheung, B., Agrawal, P., and Isola, P. The low-rank simplicity bias in deep networks. arXiv preprint arXiv:2103.10427, 2021. Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018. Jacot et al. (2020) Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. Implicit regularization of random feature models. In International Conference on Machine Learning, p. 4631–4640. PMLR, 2020. Ji et al. (2023) Ji, W., Deng, Z., Nakada, R., Zou, J., and Zhang, L. The power of contrast for feature learning: A theoretical analysis. Journal of Machine Learning Research, 24(330):1–78, 2023. Johnstone (2001) Johnstone, I. M. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001. Kalimeris et al. (2019) Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019. Kingma (2014) Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Lang et al. (2024) Lang, H., Sontag, D., and Vijayaraghavan, A. Theoretical analysis of weak-to-strong generalization. arXiv preprint arXiv:2405.16043, 2024. Lee et al. (2024) Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., and Ping, W. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. Mairal et al. (2014) Mairal, J., Bach, F., Ponce, J., et al. Sparse modeling for image and vision processing. Foundations and Trends® in Computer Graphics and Vision, 8(2-3):85–283, 2014. Malladi et al. (2023) Malladi, S., Wettig, A., Yu, D., Chen, D., and Arora, S. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, p. 23610–23641. PMLR, 2023. Mallinar et al. (2022) Mallinar, N., Simon, J., Abedsoltan, A., Pandit, P., Belkin, M., and Nakkiran, P. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. Advances in Neural Information Processing Systems, 35:1182–1195, 2022. Marks & Tegmark (2023) Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023. Muennighoff et al. (2022) Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022. Muthukumar et al. (2021) Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D., and Sahai, A. Classification vs regression in overparameterized regimes: Does the loss function matter? Journal of Machine Learning Research, 22(222):1–69, 2021. Nakada et al. (2023) Nakada, R., Gulluk, H. I., Deng, Z., Ji, W., Zou, J., and Zhang, L. Understanding multimodal contrastive learning and incorporating unpaired data. In International Conference on Artificial Intelligence and Statistics, p. 4348–4380. PMLR, 2023. Nanda et al. (2023) Nanda, N., Lee, A., and Wattenberg, M. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023. Olshausen & Field (1997) Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. Olshausen & Field (2004) Olshausen, B. A. and Field, D. J. Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4):481–487, 2004. Papyan et al. (2017) Papyan, V., Romano, Y., and Elad, M. Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research, 18(83):1–52, 2017. Pezeshki et al. (2022) Pezeshki, M., Mitra, A., Bengio, Y., and Lajoie, G. Multi-scale feature learning dynamics: Insights for double descent. In International Conference on Machine Learning, p. 17669–17690. PMLR, 2022. Shen et al. (2022) Shen, R., Bubeck, S., and Gunasekar, S. Data augmentation as feature manipulation. In International conference on machine learning, p. 19773–19808. PMLR, 2022. Shin et al. (2024) Shin, C., Cooper, J., and Sala, F. Weak-to-strong generalization through the data-centric lens. arXiv preprint arXiv:2412.03881, 2024. Somerstep et al. (2024) Somerstep, S., Polo, F. M., Banerjee, M., Ritov, Y., Yurochkin, M., and Sun, Y. A statistical framework for weak-to-strong generalization. arXiv preprint arXiv:2405.16236, 2024. Tropp et al. (2015) Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. Vershynin (2018) Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Wainwright (2019) Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. Wang et al. (2021) Wang, K., Muthukumar, V., and Thrampoulidis, C. Benign overfitting in multiclass classification: All roads lead to interpolation. Advances in Neural Information Processing Systems, 34:24164–24179, 2021. Welbl et al. (2017) Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017. Wen & Li (2021) Wen, Z. and Li, Y. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, p. 11112–11122. PMLR, 2021. Wu & Sahai (2024) Wu, D. X. and Sahai, A. Provable weak-to-strong generalization via benign overfitting. arXiv preprint arXiv:2410.04638, 2024. Wu et al. (2018) Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018. Xue et al. (2023) Xue, Y., Joshi, S., Gan, E., Chen, P.-Y., and Mirzasoleiman, B. Which features are learnt by contrastive learning? on the role of simplicity bias in class collapse and feature suppression. In International Conference on Machine Learning, p. 38938–38970. PMLR, 2023. Yang et al. (2009) Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In 2009 IEEE Conference on computer vision and pattern recognition, p. 1794–1801. IEEE, 2009. Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. Zhang et al. (2015) Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015. Zou et al. (2023) Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. Zou et al. (2021) Zou, D., Cao, Y., Li, Y., and Gu, Q. Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371, 2021. Appendix A Main Analysis In this section, we provide a thorough analysis of the errors associated with the weak model, the W2S model, and the strong ceiling model. Some of these results are used to prove our main conclusion, Theorem 3.8, while others are applied in subsequent analyses. A.1 Notations and additional notes Symbol definitions. We introduce the following notations. The symbol ritalic_r represents a representation, i.e., =h()ℎ r=h( x)italic_r = h ( italic_x ). For the samples in the splits ~~ Dover~ start_ARG D end_ARG and ^ Dover start_ARG D end_ARG, we denote their representations as ~1,…,~n~subscript~1…subscript~~ r_1,…, r_ nover~ start_ARG italic_r end_ARG1 , … , over~ start_ARG italic_r end_ARGover~ start_ARG n end_ARG and ^1,…,^n^subscript^1…subscript^ r_1,…, r_ nover start_ARG italic_r end_ARG1 , … , over start_ARG italic_r end_ARGover start_ARG n end_ARG, respectively. We define the sample representation matrices, where each column corresponds to a representation: ~≔[~1~2…~n~]and^≔[^1^2…^n^].formulae-sequence≔~delimited-[]subscript~1subscript~2…subscript~~and≔^delimited-[]subscript^1subscript^2…subscript^ R [ r_1~ r% _2~… r_ n]~~~~and~~~~% R [ r_1~ r_2~… % r_ n].over~ start_ARG italic_R end_ARG ≔ [ over~ start_ARG italic_r end_ARG1 over~ start_ARG italic_r end_ARG2 … over~ start_ARG italic_r end_ARGover~ start_ARG n end_ARG ] and over start_ARG italic_R end_ARG ≔ [ over start_ARG italic_r end_ARG1 over start_ARG italic_r end_ARG2 … over start_ARG italic_r end_ARGover start_ARG n end_ARG ] . We also define yitalic_y which collects the labels of the samples: ~=[y~1y~2⋮y~n~]and^=[y^1y^2⋮y^n^].formulae-sequence~matrixsubscript~1subscript~2⋮subscript~~and^matrixsubscript^1subscript^2⋮subscript^ y= bmatrix y_1\\ y_2\\ \\ y_ n bmatrix~~~~and~~~~ % y= bmatrix y_1\\ y_2\\ \\ y_ n bmatrix.over~ start_ARG italic_y end_ARG = [ start_ARG start_ROW start_CELL over~ start_ARG y end_ARG1 end_CELL end_ROW start_ROW start_CELL over~ start_ARG y end_ARG2 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over~ start_ARG y end_ARGover~ start_ARG n end_ARG end_CELL end_ROW end_ARG ] and over start_ARG italic_y end_ARG = [ start_ARG start_ROW start_CELL over start_ARG y end_ARG1 end_CELL end_ROW start_ROW start_CELL over start_ARG y end_ARG2 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over start_ARG y end_ARGover start_ARG n end_ARG end_CELL end_ROW end_ARG ] . For the covariance matrices, we use the following shorthand notations to avoid clutter: =(h),^=^(h),~=~(h),formulae-sequenceℎformulae-sequence^^ℎ~~ℎ = (h),~ = % (h),~ = (h),Σ = Σ ( h ) , over start_ARG Σ end_ARG = over start_ARG Σ end_ARG ( h ) , over~ start_ARG Σ end_ARG = over~ start_ARG Σ end_ARG ( h ) , ′=(h),^′=^(h),~′=~(h),′=(⟂h),^′=^(⟂h),~′=~(⟂h).formulae-sequencesuperscript′subscriptℎformulae-sequencesuperscript^′^subscriptℎformulae-sequencesuperscript~′~subscriptℎformulae-sequencesuperscript′subscriptsuperscriptperpendicular-toℎformulae-sequencesuperscript^′^subscriptsuperscriptperpendicular-toℎsuperscript~′~subscriptsuperscriptperpendicular-toℎ = ( _Vh)% ,~ = ( _V% h),~ = ( _% Vh),~ = ( _% V h),~ = % ( _V h),~ % = ( _V h).Σ′ = Σ ( Πcaligraphic_V h ) , over start_ARG Σ end_ARG′ = over start_ARG Σ end_ARG ( Πcaligraphic_V h ) , over~ start_ARG Σ end_ARG′ ′ = over~ start_ARG Σ end_ARG ( Πcaligraphic_V h ) , Σ′ ′ = Σ ( Πcaligraphic_V⟂ h ) , over start_ARG Σ end_ARG′ ′ = over start_ARG Σ end_ARG ( Πcaligraphic_V⟂ h ) , over~ start_ARG Σ end_ARG′ ′ = over~ start_ARG Σ end_ARG ( Πcaligraphic_V⟂ h ) . Use of subscripts. Additionally, we use subscripts ‘w’ and ‘s’ to indicate the model associated with a given quantity. For example, ~wsubscript~w R_wover~ start_ARG italic_R end_ARGw and ^wsubscript^w R_wover start_ARG italic_R end_ARGw denote the sample representation matrices generated by the weak model, while ~ssubscript~s R_sover~ start_ARG italic_R end_ARGs and ^ssubscript^s R_sover start_ARG italic_R end_ARGs denote those generated by the strong model. Similarly, this convention applies to covariance matrices; for instance, ^s′=^(shs)superscriptsubscript^s′^subscriptsubscriptssubscriptℎs _s = ( _% V_sh_s)over start_ARG Σ end_ARGs′ = over start_ARG Σ end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ). Mathematical notations. For convenience, whenever we say =+o(1)1 A= B+o(1)italic_A = italic_B + o ( 1 ), where Aitalic_A and Bitalic_B are matrices or vectors, we mean that ∥−∥op=o(1)subscriptdelimited-∥op1 A- B _op=o(1)∥ italic_A - italic_B ∥op = o ( 1 ). We let λi()subscript _i( A)λitalic_i ( italic_A ), λmin()subscript _ ( A)λroman_min ( italic_A ), λmin, ≠0()subscriptmin, ≠0 _min, $≠ 0$( A)λmin, ≠ 0 ( italic_A ), and λmax()subscript _ ( A)λroman_max ( italic_A ) represent the i-th, smallest, smallest nonzero, and largest eigenvalues of the matrix Aitalic_A, respectively. The expression ≼precedes-or-equals A Bitalic_A ≼ italic_B means that the matrix − B- Aitalic_B - italic_A is positive semidefinite, and ≽succeeds-or-equals A Bitalic_A ≽ italic_B means that − A- Bitalic_A - italic_B is positive semidefinite. Implied proof techniques. Sometimes, in the proof, we use the triangle inequality and the sub-multiplicativity of norms without explicitly stating them when they are straightforward, as mentioning them would make the text unnecessarily verbose. A.2 Restatement of Definition 3.3 Here, we restate Definition 3.3 with simplified notations for convenience and clarity in the proof. Definition A.1 ((δ,γ^,γ~)^~(δ, γ, γ)( δ , over start_ARG γ end_ARG , over~ start_ARG γ end_ARG )-decomposability (restated) ). Given DD, ~~ Dover~ start_ARG D end_ARG, ^ Dover start_ARG D end_ARG, and a representation function hℎh, we say that the representations of hℎh are (δ,γ^,γ~)^~(δ, γ, γ)( δ , over start_ARG γ end_ARG , over~ start_ARG γ end_ARG )-decomposable with respect to a subspace VV (of the representation space), for some δ=O(1)1δ=O(1)δ = O ( 1 ), γ^=O(1)^1 γ=O(1)over start_ARG γ end_ARG = O ( 1 ), and γ~=O(1)~1 γ=O(1)over~ start_ARG γ end_ARG = O ( 1 ), if the following holds. Let ⊤superscripttop U U italic_U Λ italic_U⊤ be the singular value decomposition (SVD) of Σ. There exists a matrix ′superscript′ U italic_U′ consisting of a subset of columns of Uitalic_U, corresponding to the nonzero eigenvalues, such that the following conditions are satisfied. Let ′ U italic_U′ ′ denote the matrix that collects the remaining columns of Uitalic_U. Define diagonal matrices ′superscript′ Λ′ and ′ Λ′ ′ to collect the eigenvalues corresponding to ′superscript′ U italic_U′ and ′ U italic_U′ ′, respectively. Additionally, define: ′=′′′⊤superscript′superscript′op = U U^% Σ′ = italic_U′ Λ′ italic_U′ ⊤ and ′=′′′⊤superscript′superscript′top = U % U Σ′ ′ = italic_U′ ′ Λ′ ′ italic_U′ ′ ⊤. Let γ=min(γ^,γ~)^~γ= ( γ, γ)γ = min ( over start_ARG γ end_ARG , over~ start_ARG γ end_ARG ), and let VV be the span of the columns of ′superscript′ U italic_U′. Now, leveraging the fact that the projection subscript _VΠcaligraphic_V can be written as ′′⊤superscript′op U U italic_U′ italic_U′ ⊤, and noting that λmin, ≠0(′)=λmin(′)subscriptmin, ≠0superscript′subscriptsuperscript′ _min, $≠ 0$( )= _ ( % )λmin, ≠ 0 ( Σ′ ) = λroman_min ( Λ′ ), we can reformulate the original Definition 3.3 in terms of ′superscript′ U italic_U′: with high probability 1−o(1)111-o(1)1 - o ( 1 ), a. Boundedness. ∥op=O(1)subscriptdelimited-∥op1 _op=O(1)∥ Σ ∥op = O ( 1 ), ∥^∥op=O(1)subscriptdelimited-∥^op1 _op=O(1)∥ over start_ARG Σ end_ARG ∥op = O ( 1 ) and ∥~∥op=O(1)subscriptdelimited-∥~op1 _op=O(1)∥ over~ start_ARG Σ end_ARG ∥op = O ( 1 ). Additionally, [y2]=O(1)delimited-[]superscript21E[y^2]=O(1)blackboard_E [ y2 ] = O ( 1 ), 1n^‖^‖2=O(1)1^superscriptnorm^21 1 n\| y\|^2=O(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_y end_ARG ∥2 = O ( 1 ) and 1n~‖~‖2=O(1)1~superscriptnorm~21 1 n\| y\|^2=O(1)divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∥ over~ start_ARG italic_y end_ARG ∥2 = O ( 1 ). b. Concentration on VV. The original statement is ∥^′−′∥op=o(1)subscriptdelimited-∥superscript^′superscript′op1 - _op=% o(1)∥ over start_ARG Σ end_ARG′ - Σ′ ∥op = o ( 1 ) and ∥~′−′∥op=o(1)subscriptdelimited-∥superscript~′superscript′op1 - _op% =o(1)∥ over~ start_ARG Σ end_ARG′ - Σ′ ∥op = o ( 1 ). However, since: ∥′⊤^′−′∥op=subscriptdelimited-∥superscript′top^superscript′opabsent U U -% _op=∥ italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ - Λ′ ∥op = ∥1n^′⊤^^⊤′−′∥opsubscriptdelimited-∥1^superscript′top^superscript^topsuperscript′op 1 n U R % R U - _op∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ - Λ′ ∥op = == ∥1n^′⊤^^⊤′⊤−′⊤′⊤∥opsubscriptdelimited-∥1^superscript′top^superscript^topsuperscript′opsuperscript′topsuperscript′opop 1 n U U R% R U U - U U^% U U _op∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U⊤ - italic_U italic_U′ ⊤ Λ italic_U′ italic_U⊤ ∥op = == ∥^′−′∥op,subscriptdelimited-∥superscript^′superscript′op - _% op,∥ over start_ARG Σ end_ARG′ - Σ′ ∥op , and similarly for ~′superscript~′ over~ start_ARG Σ end_ARG′, we can restate it as: ∥′⊤^′−′∥op=o(γ2+δ2+λmin(′)2)subscriptdelimited-∥superscript′top^superscript′opsuperscript2superscript2subscriptsuperscriptsuperscript′2 U U - % _op=o(γ^2+δ^2+ _ ( % )^2)∥ italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ - Λ′ ∥op = o ( γ2 + δ2 + λroman_min ( Λ′ )2 ) and ∥′⊤~′−′∥op=o(γ2+δ2+λmin(′)2)subscriptdelimited-∥superscript′top~superscript′opsuperscript2superscript2subscriptsuperscriptsuperscript′2 U U - % _op=o(γ^2+δ^2+ _ ( % )^2)∥ italic_U′ ⊤ over~ start_ARG Σ end_ARG italic_U′ - Λ′ ∥op = o ( γ2 + δ2 + λroman_min ( Λ′ )2 ). Similarly, by noting that the operator norm is invariant under left multiplication by ′superscript′ U italic_U′, we can restate the statement regarding y as: ‖′⊤1n~~~−′⊤[y]‖=o(γ+δ+λmin(′))normsuperscript′top1~~~superscript′topdelimited-[]subscriptsuperscript′\| U 1 n R y% - U E[ ry]\|=o(γ+δ+ _ % ( ))∥ italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARG over~ start_ARG italic_y end_ARG - italic_U′ ⊤ blackboard_E [ italic_r y ] ∥ = o ( γ + δ + λroman_min ( Λ′ ) ) and ‖′⊤1n^^^−′⊤[y]‖=o(γ+δ+λmin(′))normsuperscript′top1^^^superscript′topdelimited-[]subscriptsuperscript′\| U 1 n R y-% U E[ ry]\|=o(γ+δ+ _ (% ))∥ italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG over start_ARG italic_y end_ARG - italic_U′ ⊤ blackboard_E [ italic_r y ] ∥ = o ( γ + δ + λroman_min ( Λ′ ) ). c. Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ . ∥1n^^⊤′′⊤^−γ^∥op=o(γ2+δ2)subscriptdelimited-∥1^superscript^topsuperscript′top^^opsuperscript2superscript2 1 n R U U^% R- γ I _op=o(% γ^2+δ^2)∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over start_ARG italic_R end_ARG - over start_ARG γ end_ARG italic_I ∥op = o ( γ2 + δ2 ) and ∥1n~~⊤′′⊤~−γ~∥op=o(γ2+δ2)subscriptdelimited-∥1~superscript~topsuperscript′top~~opsuperscript2superscript2 1 n R U U% R- γ I _op% =o(γ^2+δ^2)∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over~ start_ARG italic_R end_ARG - over~ start_ARG γ end_ARG italic_I ∥op = o ( γ2 + δ2 ). d. Small cross-sample inner-product on ⟂superscriptperpendicular-toV V⟂. ∥1n^^⊤′′⊤1n~~∥op=o(γ+δ)subscriptdelimited-∥1^superscript^topsuperscript′top1~~op 1 n R U % U 1 n R _% op=o(γ+δ)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARG ∥op = o ( γ + δ ). e. Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂. ∥′∥op=o(γ+δ)subscriptdelimited-∥superscript′op _op=o(γ+δ)∥ Σ′ ′ ∥op = o ( γ + δ ). Use of subscripts. Since in Assumption 3.7 we assume that the representations of both the weak and strong models satisfy Definition A.1, all the notations in Definition A.1 have corresponding versions for the weak model’s representations and the strong model’s representations. We follow the previously mentioned convention and use the subscripts w’ and s’ to distinguish between them. For example, notations such as w′subscriptsuperscript′w U _witalic_U′w and s′subscriptsuperscript′s U _sitalic_U′s, w′subscriptsuperscript′w _wΛ′w and s′subscriptsuperscript′s _sΛ′s, will be used. The meaning of such notations should be clear from the context in which they appear. A.3 Lemmas Below, we introduce some basic lemmas and prove properties that will be used in the later analysis. Lemma A.2 (Push-through identity). For any matrices , A, Bitalic_A , italic_B, and any scalar a, the identity (a+)−1=(a+)−1superscript1superscript1(a I+ A B)^-1 A= A(a I+ B A)% ^-1( a italic_I + italic_A italic_B )- 1 italic_A = italic_A ( a italic_I + italic_B italic_A )- 1 holds as long as (a+)−1superscript1(a I+ A B)^-1( a italic_I + italic_A italic_B )- 1 and (a+)−1superscript1(a I+ B A)^-1( a italic_I + italic_B italic_A )- 1 are invertible. Lemma A.3. A classical result on the effect of perturbations on the inverse of a square matrix states that ∥(+Δ)−1−1∥op≤∥−1∥op2∥Δ∥opsubscriptdelimited-∥superscriptΔ1superscript1opsuperscriptsubscriptdelimited-∥superscript1op2subscriptdelimited-∥Δop ( A+ )^-1- A^-1 _op≤ A% ^-1 _op^2 _op∥ ( italic_A + Δ )- 1 - italic_A- 1 ∥op ≤ ∥ italic_A- 1 ∥op2 ∥ Δ ∥op, where Aitalic_A is an invertible square matrix. This result can be found, for example, in (Demmel, 1992) or Equation 1.1 of (El Ghaoui, 2002). Lemma A.4. If condition Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ holds, we have that ∥1n~~⊤~−(1n~~⊤′′⊤~+γ~)∥op=o(γ2+δ2)subscriptdelimited-∥1~superscript~top~1~superscript~topsuperscript′op~~opsuperscript2superscript2 1 n R R- ( 1% n R U U % R+ γ I ) _op=o(γ^2% +δ^2)∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ over~ start_ARG italic_R end_ARG - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + over~ start_ARG γ end_ARG italic_I ) ∥op = o ( γ2 + δ2 ), and a similar conclusion holds for ^ Rover start_ARG italic_R end_ARG as well. Proof. By Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ , ∥1n~~⊤~−(1n~~⊤′′⊤~+γ~)∥opsubscriptdelimited-∥1~superscript~top~1~superscript~topsuperscript′op~~op 1 n R R-% ( 1 n R U U^% R+ γ I ) _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ over~ start_ARG italic_R end_ARG - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + over~ start_ARG γ end_ARG italic_I ) ∥op = == ∥1n~~⊤(′′⊤+′′⊤)~−(1n~~⊤′′⊤~+γ~)∥opsubscriptdelimited-∥1~superscript~topsuperscript′opsuperscript′top~1~superscript~topsuperscript′op~~op 1 n R ( U % U + U U )% R- ( 1 n R U^% U R+ γ I )% _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ ( italic_U′ italic_U′ ⊤ + italic_U′ ′ italic_U′ ′ ⊤ ) over~ start_ARG italic_R end_ARG - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + over~ start_ARG γ end_ARG italic_I ) ∥op = == ∥1n~~⊤′′⊤~−γ~∥opsubscriptdelimited-∥1~superscript~topsuperscript′top~~op 1 n R U % U R- γ I% _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over~ start_ARG italic_R end_ARG - over~ start_ARG γ end_ARG italic_I ∥op = == o(γ2+δ2).superscript2superscript2 o(γ^2+δ^2).o ( γ2 + δ2 ) . ∎ Lemma A.5. If condition Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ holds, then for any β=O(1)s.t.β≥δformulae-sequence1β=O(1)~s.t.~β≥δβ = O ( 1 ) s . t . β ≥ δ, we have that ∥(1n~~⊤~+β)−1−(1n~~⊤′′⊤~+(γ~+β))−1∥op=o(1)subscriptdelimited-∥superscript1~superscript~top~1superscript1~superscript~topsuperscript′op~~1op1 ( 1 n R R+β I% )^-1-( 1 n R U U^% R+( γ+β) I)^-1 _% op=o(1)∥ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ over~ start_ARG italic_R end_ARG + β italic_I )- 1 - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + ( over~ start_ARG γ end_ARG + β ) italic_I )- 1 ∥op = o ( 1 ), and a similar conclusion holds for ^ Rover start_ARG italic_R end_ARG as well. Proof. By Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ , ∥1n~~⊤~+β−(1n~~⊤′′⊤~+(γ~+β))∥opsubscriptdelimited-∥1~superscript~top~1~superscript~topsuperscript′op~~op 1 n R R+% β I- ( 1 n R U % U R+( γ+β) I )% _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ over~ start_ARG italic_R end_ARG + β italic_I - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + ( over~ start_ARG γ end_ARG + β ) italic_I ) ∥op = == ∥1n~~⊤(′′⊤+′′⊤)~+β−(1n~~⊤′′⊤~+(γ~+β))∥opsubscriptdelimited-∥1~superscript~topsuperscript′opsuperscript′top~1~superscript~topsuperscript′op~~op 1 n R ( U % U + U U )% R+β I- ( 1 n R % U U R+( γ+β)% I ) _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ ( italic_U′ italic_U′ ⊤ + italic_U′ ′ italic_U′ ′ ⊤ ) over~ start_ARG italic_R end_ARG + β italic_I - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + ( over~ start_ARG γ end_ARG + β ) italic_I ) ∥op = == ∥1n~~⊤′′⊤~−γ~∥opsubscriptdelimited-∥1~superscript~topsuperscript′top~~op 1 n R U % U R- γ I% _op∥ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over~ start_ARG italic_R end_ARG - over~ start_ARG γ end_ARG italic_I ∥op = == o(γ2+δ2).superscript2superscript2 o(γ^2+δ^2).o ( γ2 + δ2 ) . Then, by Lemma A.3, we have ∥(1n~~⊤~+β)−1−(1n~~⊤′′⊤~+(γ~+β))−1∥op≤subscriptdelimited-∥superscript1~superscript~top~1superscript1~superscript~topsuperscript′op~~1opabsent ( 1 n R R% +β I)^-1-( 1 n R U^% U R+( γ+β) I)^% -1 _op≤∥ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ over~ start_ARG italic_R end_ARG + β italic_I )- 1 - ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + ( over~ start_ARG γ end_ARG + β ) italic_I )- 1 ∥op ≤ o(γ2+δ2)∥(1n~~⊤′′⊤~+(γ~+β))−1∥op2superscript2superscript2superscriptsubscriptdelimited-∥superscript1~superscript~topsuperscript′op~~1op2 o(γ^2+δ^2)~~ ( 1 n % R U U R+( % γ+β) I)^-1 _op^2o ( γ2 + δ2 ) ∥ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over~ start_ARG italic_R end_ARG + ( over~ start_ARG γ end_ARG + β ) italic_I )- 1 ∥op2 = == o(γ2+δ2(γ~+β)2)superscript2superscript2superscript~2 o( γ^2+δ^2( γ+β)^2)o ( divide start_ARG γ2 + δ2 end_ARG start_ARG ( over~ start_ARG γ end_ARG + β )2 end_ARG ) = == o(1).1 o(1).o ( 1 ) . ∎ Lemma A.6. If condition Concentration on VV holds, then for any β=O(1)s.t.β≥δformulae-sequence1β=O(1)~s.t.~β≥δβ = O ( 1 ) s . t . β ≥ δ, and γ0∈γ^,γ~subscript0^~ _0∈\ γ, γ\γ0 ∈ over start_ARG γ end_ARG , over~ start_ARG γ end_ARG we have ∥(′⊤~′+(γ0+β))−1−(′+(γ0+β))−1∥op=o(1),subscriptdelimited-∥superscriptsuperscript′top~superscript′subscript01superscriptsuperscript′subscript01op1 ( U U % +( _0+β) I)^-1-( +( _0+β)% I)^-1 _op=o(1),∥ ( italic_U′ ⊤ over~ start_ARG Σ end_ARG italic_U′ + ( γ0 + β ) italic_I )- 1 - ( Λ′ + ( γ0 + β ) italic_I )- 1 ∥op = o ( 1 ) , and a similar conclusion holds for ^ over start_ARG Σ end_ARG as well. Proof. By condition Concentration on VV, we have ∥′⊤~′−′∥op=o(γ2+δ2+λmin(′)2).subscriptdelimited-∥superscript′top~superscript′opsuperscript2superscript2subscriptminsuperscriptsuperscript′2 U U % - _op=o(γ^2+δ^2+ _% min( )^2).∥ italic_U′ ⊤ over~ start_ARG Σ end_ARG italic_U′ - Λ′ ∥op = o ( γ2 + δ2 + λmin ( Λ′ )2 ) . Then, by Lemma A.3, we have ∥(′⊤~′+(γ0+β))−1−(′+(γ0+β))−1∥op=≤subscriptdelimited-∥superscriptsuperscript′top~superscript′subscript01superscriptsuperscript′subscript01op ( U U % +( _0+β) I)^-1-( +( _0+β)% I)^-1 _op=≤∥ ( italic_U′ ⊤ over~ start_ARG Σ end_ARG italic_U′ + ( γ0 + β ) italic_I )- 1 - ( Λ′ + ( γ0 + β ) italic_I )- 1 ∥op = ≤ o(γ2+δ2+λmin(′)2)∥(′+(γ0+β))−1∥op2superscript2superscript2subscriptminsuperscriptsuperscript′2superscriptsubscriptdelimited-∥superscriptsuperscript′subscript01op2 o(γ^2+δ^2+ _min( ^% )^2)~~ ( +( _0+β) I)^% -1 _op^2o ( γ2 + δ2 + λmin ( Λ′ )2 ) ∥ ( Λ′ + ( γ0 + β ) italic_I )- 1 ∥op2 = == o(γ2+δ2+λmin(′)2(γ0+β+λmin(′))2)superscript2superscript2subscriptminsuperscriptsuperscript′2superscriptsubscript0subscriptminsuperscript′2 o( γ^2+δ^2+ _min( % )^2( _0+β+ _min( % ))^2)o ( divide start_ARG γ2 + δ2 + λmin ( Λ′ )2 end_ARG start_ARG ( γ0 + β + λmin ( Λ′ ) )2 end_ARG ) = == o(1).1 o(1).o ( 1 ) . ∎ Lemma A.7. If conditions Boundedness and Concentration on VV hold, then |λmin(′)2−λmin(′⊤^′)2|=o(γ2+δ+λmin(′)2)subscriptsuperscriptsuperscript′2subscriptsuperscriptsuperscript′top^superscript′2superscript2subscriptsuperscriptsuperscript′2| _ ( )^2- _ ( U % U )^2|=o(γ^2+δ+ _% ( )^2)| λroman_min ( Λ′ )2 - λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ )2 | = o ( γ2 + δ + λroman_min ( Λ′ )2 ). It still holds if we replace ^^absent over start_ARG end_ARG with ~~absent over~ start_ARG end_ARG. Proof. Define t=λmin(′)−λmin(′⊤^′′)subscriptsuperscript′subscriptsuperscript′topsuperscript^′superscript′t= _ ( )- _ ( U % U )t = λroman_min ( Λ′ ) - λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG′ italic_U′ ). By condition Concentration on VV and Weyl’s theorem, we have |t|=o(γ2+δ2+λmin(′)2)superscript2superscript2subscriptsuperscriptsuperscript′2|t|=o(γ^2+δ^2+ _ ( )^2)| t | = o ( γ2 + δ2 + λroman_min ( Λ′ )2 ). Then, we compute: λmin(′⊤^′′)2subscriptsuperscriptsuperscript′topsuperscript^′superscript′2 _ ( U % U )^2λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG′ italic_U′ )2 = == λmin(′)2+t2−2tλmin(′)subscriptsuperscriptsuperscript′2superscript22subscriptsuperscript′ _ ( )^2+t^2-2t _ % ( )λroman_min ( Λ′ )2 + t2 - 2 t λroman_min ( Λ′ ) = == λmin(′)2±o(γ2+δ2+λmin(′)2),plus-or-minussubscriptsuperscriptsuperscript′2superscript2superscript2subscriptsuperscriptsuperscript′2 _ ( )^2± o(γ^2+% δ^2+ _ ( )^2),λroman_min ( Λ′ )2 ± o ( γ2 + δ2 + λroman_min ( Λ′ )2 ) , where the last step follows because λmin(′)=O(1)subscriptsuperscript′1 _ ( )=O(1)λroman_min ( Λ′ ) = O ( 1 ) (via condition Boundedness) and |t|=o(γ2+δ2+λmin(′)2)superscript2superscript2subscriptsuperscriptsuperscript′2|t|=o(γ^2+δ^2+ _ ( )^2)| t | = o ( γ2 + δ2 + λroman_min ( Λ′ )2 ). ∎ Corollary A.8. Lemma A.7 further implies that γ2+δ2+λmin(′)2γ^2+δ2+λmin(′⊤^′)2=O(1)superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′top^superscript′21 γ^2+δ^2+ _ ( )^2 % γ^2+δ^2+ _ ( U % U )^2=O(1)divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ )2 end_ARG = O ( 1 ) when conditions Boundedness and Concentration on VV hold. It still holds if we replace ^^absent over start_ARG end_ARG with ~~absent over~ start_ARG end_ARG. Proof. γ2+δ2+λmin(′⊤^′)2γ^2+δ2+λmin(′)2=superscript2superscript2subscriptsuperscriptsuperscript′top^superscript′2superscript^2superscript2subscriptsuperscriptsuperscript′2absent γ^2+δ^2+ _ ( U % U )^2 γ^2+δ^2+λ% _ ( )^2=divide start_ARG γ2 + δ2 + λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( Λ′ )2 end_ARG = γ2+δ2+λmin(′)2γ^2+δ2+λmin(′)2−λmin(′)2−λmin(′⊤^′)2γ^2+δ2+λmin(′)2superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′2subscriptsuperscriptsuperscript′2subscriptsuperscriptsuperscript′top^superscript′2superscript^2superscript2subscriptsuperscriptsuperscript′2 γ^2+δ^2+ _ ( % )^2 γ^2+δ^2+ _ ( )^2% - _ ( )^2- _ ( U^% U )^2 γ^2+δ^% 2+ _ ( )^2divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( Λ′ )2 end_ARG - divide start_ARG λroman_min ( Λ′ )2 - λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( Λ′ )2 end_ARG ≤ ≤ 1±o(γ2+δ2+λmin(′)2)γ^2+δ2+λmin(′)2plus-or-minus1superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′2 1± o(γ^2+δ^2+ _ ( ^% )^2) γ^2+δ^2+ _ ( ^% )^21 ± divide start_ARG o ( γ2 + δ2 + λroman_min ( Λ′ )2 ) end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( Λ′ )2 end_ARG ≤ ≤ 1+o(1).11 1+o(1).1 + o ( 1 ) . Therefore, γ2+δ2+λmin(′)2γ^2+δ2+λmin(′⊤^′)2=O(1)superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′top^superscript′21 γ^2+δ^2+ _ ( )^2 % γ^2+δ^2+ _ ( U % U )^2=O(1)divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ )2 end_ARG = O ( 1 ). ∎ Corollary A.9. If conditions Boundedness and Concentration on VV hold, then for any qitalic_q with ‖=O(1)norm1\| q\|=O(1)∥ italic_q ∥ = O ( 1 ), we have ‖′′(′⊤′′+(γ^+β))−1‖2=‖1n^^⊤′(1n^′⊤^^⊤′+(γ^+β))−1‖2±o(1)superscriptnormsuperscript′superscriptsuperscript′topsuperscript′^12plus-or-minussuperscriptnorm1^superscript^topsuperscript′1^superscript′top^superscript^topsuperscript′^121\| U ( U % U +( γ+β) I)^-1 q% \|^2=\| 1 n R U ( % 1 n U R R U^% +( γ+β) I)^-1 q\|^2± o(1)∥ italic_U′ square-root start_ARG Λ′ end_ARG ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ∥2 = ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ∥2 ± o ( 1 ). It still holds if we replace ^^absent over start_ARG end_ARG with ~~absent over~ start_ARG end_ARG. Proof. ‖′′(′⊤′′+(γ^+β))−1‖2superscriptnormsuperscript′superscriptsuperscript′topsuperscript′^12 \| U ( U^% U +( γ+β) I% )^-1 q\|^2∥ italic_U′ square-root start_ARG Λ′ end_ARG ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ∥2 = == ⊤(′⊤′′+(γ^+β))−1′(′⊤′′+(γ^+β))−1superscripttopsuperscriptsuperscript′topsuperscript′^1superscript′superscript′topsuperscript′^1 q ( U U% +( γ+β) I)^-1 ( U% U +( γ+β) % I)^-1 qitalic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 Λ′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q = == ⊤(′⊤′′+(γ^+β))−1′⊤^′(′⊤′′+(γ^+β))−1superscripttopsuperscriptsuperscript′topsuperscript′^1superscript′top^superscript′superscript′topsuperscript′^1 q ( U U% +( γ+β) I)^-1 U % U ( U U% +( γ+β) I)^-1 qitalic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ±o((γ2+δ2+λmin(′)2)∥(′⊤′′+(γ^+β))−1∥op2)by Concentration on and ‖=O(1)plus-or-minussuperscript2superscript2subscriptsuperscriptsuperscript′2superscriptsubscriptdelimited-∥superscriptsuperscript′topsuperscript′^1op2by Concentration on and ‖=O(1) ~± o ((γ^2+δ^2+ _ ( % )^2) ( U U^% +( γ+β) I)^-1 _op^2 ) % Concentration on $ V$ and $\| q\|=O(1)$± o ( ( γ2 + δ2 + λroman_min ( Λ′ )2 ) ∥ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 ∥op2 ) by Concentration on V and ∥ italic_q ∥ = O ( 1 ) = == ⊤(′⊤′′+(γ^+β))−1′⊤^′(′⊤′′+(γ^+β))−1±o(γ2+δ2+λmin(′)2(λmin(′⊤′′)+γ^+β)2)plus-or-minussuperscripttopsuperscriptsuperscript′topsuperscript′^1superscript′top^superscript′superscript′topsuperscript′^1superscript2superscript2subscriptsuperscriptsuperscript′2superscriptsubscriptsuperscript′topsuperscript′^2 q ( U U% +( γ+β) I)^-1 U % U ( U U% +( γ+β) I)^-1 q± o ( γ^2% +δ^2+ _ ( )^2( _ ( % U U )+ γ+β)^% 2 )italic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ± o ( divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG ( λroman_min ( italic_U′ ⊤ Σ′ italic_U′ ) + over start_ARG γ end_ARG + β )2 end_ARG ) = == ⊤(′⊤′′+(γ^+β))−1′⊤^′(′⊤′′+(γ^+β))−1±o(γ2+δ2+λmin(′)2γ^2+β2+λmin(′⊤′′)2)plus-or-minussuperscripttopsuperscriptsuperscript′topsuperscript′^1superscript′top^superscript′superscript′topsuperscript′^1superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′topsuperscript′2 q ( U U% +( γ+β) I)^-1 U % U ( U U% +( γ+β) I)^-1 q± o ( γ^2% +δ^2+ _ ( )^2 γ^2+% β^2+ _ ( U U^% )^2 )italic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ± o ( divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + β2 + λroman_min ( italic_U′ ⊤ Σ′ italic_U′ )2 end_ARG ) = == ⊤(′⊤′′+(γ^+β))−1′⊤^′(′⊤′′+(γ^+β))−1±o(γ2+δ2+λmin(′)2γ^2+δ2+λmin(′⊤′′)2)plus-or-minussuperscripttopsuperscriptsuperscript′topsuperscript′^1superscript′top^superscript′superscript′topsuperscript′^1superscript2superscript2subscriptsuperscriptsuperscript′2superscript^2superscript2subscriptsuperscriptsuperscript′topsuperscript′2 q ( U U% +( γ+β) I)^-1 U % U ( U U% +( γ+β) I)^-1 q± o ( γ^2% +δ^2+ _ ( )^2 γ^2+% δ^2+ _ ( U U^% )^2 )italic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ± o ( divide start_ARG γ2 + δ2 + λroman_min ( Λ′ )2 end_ARG start_ARG over start_ARG γ end_ARG2 + δ2 + λroman_min ( italic_U′ ⊤ Σ′ italic_U′ )2 end_ARG ) = == ⊤(′⊤′′+(γ^+β))−1′⊤^′(′⊤′′+(γ^+β))−1±o(1)by Corollary A.8plus-or-minussuperscripttopsuperscriptsuperscript′topsuperscript′^1superscript′top^superscript′superscript′topsuperscript′^11by Corollary A.8 q ( U U% +( γ+β) I)^-1 U % U ( U U% +( γ+β) I)^-1 q± o(1) % Corollary coro: ratio_lambda_minitalic_q⊤ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ ( italic_U′ ⊤ Σ′ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ± o ( 1 ) by Corollary = == ‖1n^^⊤′(1n^′⊤^^⊤′+(γ^+β))−1‖2±o(1)plus-or-minussuperscriptnorm1^superscript^topsuperscript′1^superscript′top^superscript^topsuperscript′^121 \| 1 n R U (% 1 n U R R % U +( γ+β) I)^-1 q\|^2± o(1)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_q ∥2 ± o ( 1 ) ∎ Corollary A.10. If conditions Boundedness and Concentration on VV hold, then for any ψitalic_ψ with ‖=O(1)norm1\| ψ\|=O(1)∥ italic_ψ ∥ = O ( 1 ), we have ‖′′′⊤1n^^(1n^^⊤′′⊤^+(γ^+β))−1‖2=‖1n^^⊤′′⊤^(1n^^⊤′′⊤^+(γ^+β))−1‖2±o(1)superscriptnormsuperscript′superscript′top1^^superscript1^superscript^topsuperscript′op^^12plus-or-minussuperscriptnorm1^superscript^topsuperscript′op^superscript1^superscript^topsuperscript′op^^121\| U U 1% n R( 1 n R U^% U R+( γ+β) I)^-1% ψ\|^2=\| 1 n R U % U R( 1 n R % U U R+( γ+β) I)^% -1 ψ\|^2± o(1)∥ italic_U′ square-root start_ARG Λ′ end_ARG italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥2 = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥2 ± o ( 1 ), and ‖′′′⊤1n^^(1n^^⊤′′⊤^+(γ^+β))−1‖=O(1)normsuperscript′superscript′top1^^superscript1^superscript^topsuperscript′op^^11\| U U 1% n R( 1 n R U^% U R+( γ+β) I)^-1% ψ\|=O(1)∥ italic_U′ square-root start_ARG Λ′ end_ARG italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥ = O ( 1 ). It still holds if we replace ^^absent over start_ARG end_ARG with ~~absent over~ start_ARG end_ARG. Proof. First, we have ‖′′′⊤1n^^(1n^^⊤′′⊤^+(γ^+β))−1‖2superscriptnormsuperscript′superscript′top1^^superscript1^superscript^topsuperscript′op^^12 \| U U % 1 n R( 1 n R^% U U R+( γ+β)% I)^-1 ψ\|^2∥ italic_U′ square-root start_ARG Λ′ end_ARG italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥2 = == ‖′′(1n^′⊤^^⊤′+(γ^+β))−1′⊤1n^^‖2by Lemma A.2superscriptnormsuperscript′superscript1^superscript′top^superscript^topsuperscript′^1superscript′top1^^2by Lemma A.2 \| U ( 1 n% U R R U +(% γ+β) I)^-1 U 1 n% R ψ\|^2 Lemma lemma: % pushthrough∥ italic_U′ square-root start_ARG Λ′ end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG italic_ψ ∥2 by Lemma = == ‖1n^^⊤′(1n^′⊤^^⊤′+(γ^+β))−1′⊤1n^^‖2±o(1)plus-or-minussuperscriptnorm1^superscript^topsuperscript′1^superscript′top^superscript^topsuperscript′^1superscript′top1^^21 \| 1 n R U (% 1 n U R R % U +( γ+β) I)^-1 U 1% n R ψ\|^2± o(1)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG italic_ψ ∥2 ± o ( 1 ) by the fact that ‖′⊤1n^^‖=O(1)normsuperscript′top1^^1\| U 1 n R ψ\|=O(1)∥ italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG italic_ψ ∥ = O ( 1 ) (via Boundedness) and invoking Corollary A.9 = == ‖1n^^⊤′′⊤^(1n^^⊤′′⊤^+(γ^+β))−1‖2±o(1)by Lemma A.2.plus-or-minussuperscriptnorm1^superscript^topsuperscript′op^superscript1^superscript^topsuperscript′op^^121by Lemma A.2 \| 1 n R U U% R( 1 n R U^% U R+( γ+β) I)^-1% ψ\|^2± o(1) Lemma lemma: pushthrough.∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥2 ± o ( 1 ) by Lemma . Additionally, since ∥1n^^⊤′′⊤^(1n^^⊤′′⊤^+(γ^+β))−1∥op=∥1n^^⊤′′⊤^∥op∥1n^^⊤′′⊤^∥op+γ^+β≤1subscriptdelimited-∥1^superscript^topsuperscript′op^superscript1^superscript^topsuperscript′op^^1opsubscriptdelimited-∥1^superscript^topsuperscript′op^opsubscriptdelimited-∥1^superscript^topsuperscript′op^op^1 1 n R U U % R( 1 n R U % U R+( γ+β) I)^-1 _% op= 1 n R U^% U R _op 1% n R U U % R _op+ γ+β≤ 1∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 ∥op = divide start_ARG ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG ∥op end_ARG start_ARG ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG ∥op + over start_ARG γ end_ARG + β end_ARG ≤ 1, we also have the bound ‖′′′⊤1n^^(1n^^⊤′′⊤^+(γ^+β))−1‖=O(1)normsuperscript′superscript′top1^^superscript1^superscript^topsuperscript′op^^11\| U U 1% n R( 1 n R U^% U R+( γ+β) I)^-1% ψ\|=O(1)∥ italic_U′ square-root start_ARG Λ′ end_ARG italic_U′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ italic_U′ ⊤ over start_ARG italic_R end_ARG + ( over start_ARG γ end_ARG + β ) italic_I )- 1 italic_ψ ∥ = O ( 1 ). ∎ Lemma A.11. If condition Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ holds, then ∥′⊤1n^^∥op≤o(γ2+δ2)+γ^subscriptdelimited-∥superscript′top1^^opsuperscript2superscript2 U 1 n R _% op≤ o(γ^2+δ^2)+ γ∥ italic_U′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ∥op ≤ square-root start_ARG o ( γ2 + δ2 ) + over start_ARG γ end_ARG end_ARG. Similarly, ∥′⊤1n~~∥op≤o(γ2+δ2)+γ~subscriptdelimited-∥superscript′top1~~opsuperscript2superscript2~ U 1 n R% _op≤ o(γ^2+δ^2)+ γ∥ italic_U′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARG ∥op ≤ square-root start_ARG o ( γ2 + δ2 ) + over~ start_ARG γ end_ARG end_ARG. Proof. By condition Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ and triangle inequality, we have ∥1n^^⊤′′⊤^∥op≤o(γ2+δ2)+γ^subscriptdelimited-∥1^superscript^topsuperscript′top^opsuperscript2superscript2 1 n R U % U R _op≤ o(% γ^2+δ^2)+ γ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over start_ARG italic_R end_ARG ∥op ≤ o ( γ2 + δ2 ) + over start_ARG γ end_ARG Then, ∥′⊤1n^^∥op=∥1n^^⊤′′⊤^∥op≤o(γ2+δ2)+γ^.subscriptdelimited-∥superscript′top1^^opsubscriptdelimited-∥1^superscript^topsuperscript′top^opsuperscript2superscript2 U 1 n % R _op= 1 n R % U U R _% op≤ o(γ^2+δ^2)+ γ.∥ italic_U′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG ∥op = square-root start_ARG ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over start_ARG italic_R end_ARG ∥op end_ARG ≤ square-root start_ARG o ( γ2 + δ2 ) + over start_ARG γ end_ARG end_ARG . ∎ A.4 Basic expressions for the model weights and errors Let w∈ℝdwsubscriptwsuperscriptℝsubscriptw w_w∈R^d_witalic_w ∈ blackboard_Rdw, w2s∈ℝdssubscriptw2ssuperscriptℝsubscripts w_w2s∈R^d_sitalic_w2s ∈ blackboard_Rds, and s∈ℝdssubscriptssuperscriptℝsubscripts w_s∈R^d_sitalic_ws ∈ blackboard_Rds represent the weights of the linear models fwsubscriptwf_wfw, fw2ssubscriptw2sf_w2sfw2s, and fssubscriptsf_sfs, respectively. Using the well-known closed-form solution for the minimizer of the MSE loss with ℓ2subscriptℓ2 _2ℓ2 regularization, we derive their formulas: w=subscriptwabsent w_w=italic_w = 1n~~w(1n~~w⊤~w+βw)−11n~~1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~ 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG w2s=subscriptw2sabsent w_w2s=italic_w2s = 1n^^s(1n^^s⊤^s+βs)−11n^(^w⊤w)1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^superscriptsubscript^wtopsubscriptw 1 n R_s( 1 n% R_s R_s+ _s% I)^-1 1 n( R_w w% _w)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG ( over start_ARG italic_R end_ARGw⊤ italic_w ) (3) s=subscriptsabsent w_s=italic_ws = 1n^^s(1n^^s⊤^s+βs)−11n^^.1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^ 1 n R_s( 1 n% R_s R_s+ _s% I)^-1 1 n y.divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG . Then, we derive the expression of PredGap PredGap=PredGapabsent =PredGap = s[(s⊤s−s⊤w2s)2]subscriptsubscriptsdelimited-[]superscriptsuperscriptsubscriptstopsubscriptssuperscriptsubscriptstopsubscriptw2s2 _ r_s[( r_s % w_s- r_s w_w2s)^2]blackboard_Eitalic_r start_POSTSUBSCRIPT s end_POSTSUBSCRIPT [ ( italic_rs⊤ italic_ws - italic_rs⊤ italic_w2s )2 ] = == s[(s⊤(s−w2s))2]subscriptsubscriptsdelimited-[]superscriptsuperscriptsubscriptstopsubscriptssubscriptw2s2 _ r_s[( r_s ( % w_s- w_w2s))^2]blackboard_Eitalic_r start_POSTSUBSCRIPT s end_POSTSUBSCRIPT [ ( italic_rs⊤ ( italic_ws - italic_w2s ) )2 ] (4) = == s[(s−w2s)⊤ss⊤(s−w2s)]subscriptsubscriptsdelimited-[]superscriptsubscriptssubscriptw2stopsubscriptssuperscriptsubscriptstopsubscriptssubscriptw2s _ r_s[( w_s- w_% w2s) r_s r_s ( w_% s- w_w2s)]blackboard_Eitalic_r start_POSTSUBSCRIPT s end_POSTSUBSCRIPT [ ( italic_ws - italic_w2s )⊤ italic_rs italic_rs⊤ ( italic_ws - italic_w2s ) ] = == (s−w2s)⊤s[ss⊤](s−w2s)superscriptsubscriptssubscriptw2stopsubscriptsubscriptsdelimited-[]subscriptssuperscriptsubscriptstopsubscriptssubscriptw2s ( w_s- w_w2s) E_% r_s[ r_s r_s ]( w_% s- w_w2s)( italic_ws - italic_w2s )⊤ blackboard_Eitalic_r start_POSTSUBSCRIPT s end_POSTSUBSCRIPT [ italic_rs italic_rs⊤ ] ( italic_ws - italic_w2s ) = == (s−w2s)⊤s(s−w2s)superscriptsubscriptssubscriptw2stopsubscriptssubscriptssubscriptw2s ( w_s- w_w2s) _% s( w_s- w_w2s)( italic_ws - italic_w2s )⊤ Σs ( italic_ws - italic_w2s ) = == ‖s(s−w2s)‖2superscriptnormsubscriptssubscriptssubscriptw2s2 \| _s( w_s- w_% w2s)\|^2∥ square-root start_ARG Σ end_ARGs ( italic_ws - italic_w2s ) ∥2 = == ‖s1n^^s(1n^^s⊤^s+βs)−1⏟a transformation determined by the strong model’s representations(1n^^−1n^^w⊤w)⏟weak model’s normalized error vector on ^‖normsubscript⏟subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1a transformation determined by the strong model’s representationssubscript⏟1^^1^superscriptsubscript^wtopsubscriptwweak model’s normalized error vector on \| _s 1 % n R_s( 1 n R_s % R_s+ _s I)^-1_a % transformation determined by the strong model's representations % ( 1 n y- 1 n % R_w w_w )_weak model's % normalized error vector on $ D$\|∥ under⏟ start_ARG square-root start_ARG Σ end_ARGs divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 end_ARGa transformation determined by the strong model’s representations under⏟ start_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w ) end_ARGweak model’s normalized error vector on over start_ARG D end_ARG ∥ = == ‖s1n^^s(1n^^s⊤^s+βs)−1⏟a transformation determined by the strong model’s representations(1n^^−1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~)⏟weak model’s normalized error vector on ^‖.normsubscript⏟subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1a transformation determined by the strong model’s representationssubscript⏟1^^1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~weak model’s normalized error vector on \| _s 1 % n R_s( 1 n R_s % R_s+ _s I)^-1_a % transformation determined by the strong model's representations % ( 1 n y- 1 n % R_w 1 n R_w(% 1 n R_w R_w% + _w I)^-1 1 n y% )_weak model's normalized error vector on $ D$% \|.∥ under⏟ start_ARG square-root start_ARG Σ end_ARGs divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 end_ARGa transformation determined by the strong model’s representations under⏟ start_ARG ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ) end_ARGweak model’s normalized error vector on over start_ARG D end_ARG ∥ . (5) From the above, we see that PredGap can be broken into two parts: the weak model’s normalized error vector on ^ Dover start_ARG D end_ARG, and a transformation applied to this error vector which captures how the weak model’s errors propagate to the strong model. In Sections A.5 and A.6, we will analyze each part individually. A.5 The weak model’s error Lemma A.12 (The weak model’s error on ^ Dover start_ARG D end_ARG ). The weak model’s error vector on ^ Dover start_ARG D end_ARG can be approximated as follows ‖(1n^^−1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~)−(−w)1n^^‖=o(1),norm1^^1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~subscriptw1^^1 \| ( 1 n y- 1 % n R_w 1 n % R_w( 1 n R_w % R_w+ _w I)^-1 1 n% y )-( I- P_w) 1 n% y\|=o(1),∥ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ) - ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ = o ( 1 ) , where w=1n^^w⊤w′w′⊤^w(1n~^w⊤w′w′⊤^w+(γ~w+βw))−1subscriptw1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsuperscript1~superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscript~wsubscriptw1 P_w= 1 n R_w U_% w U_w R_w% ( 1 n R_w U_w^% U_w R_w+( % γ_w+ _w) I )^-1italic_Pw = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1. Proof. By condition Boundedness and Lemma A.5, we have 1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~ 1 n R_w 1% n R_w( 1 n R% _w R_w+ _w I)^-1% 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG = == 1n^^w⊤1n~~w(1n~~w⊤w′w′⊤~w+(γ~w+βw))−11n~~+o(1)1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscript~wsubscriptw11~~1 1 n R_w 1% n R_w ( 1 n % R_w U_w U_w^% R_w+( γ_w+ _ % w) I )^-1 1 n y+o(1)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG + o ( 1 ) = == (1n^^w⊤w′w′⊤1n~~w+1n^^w⊤w′w′⊤1n~~w)(1n~~w⊤w′w′⊤~w+(γ~w+βw))−11n~~+o(1)1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′top1~subscript~w1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscript~wsubscriptw11~~1 ( 1 n R_w % U_w U_w 1 % n R_w+ 1 n R_% w U_w U_w % 1 n R_w ) (% 1 n R_w U_w^% U_w R_w+( % γ_w+ _w) I )^-1 1 % n y+o(1)( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw + divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ ′ italic_Uw′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ) ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG + o ( 1 ) By conditions Small cross-sample inner-product on ⟂superscriptperpendicular-toV V⟂ and Boundedness, and noting that ∥(1n~~w⊤w′w′⊤~w+(γ~w+βw))−1∥op≤1γ~w+βwsubscriptdelimited-∥superscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscript~wsubscriptw1op1subscript~wsubscript ( 1 n R_w U_w% U_w R_w+( % γ_w+ _w) I)^-1 _op≤% 1 γ_w+ _w∥ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 ∥op ≤ divide start_ARG 1 end_ARG start_ARG over~ start_ARG γ end_ARGw + βitalic_w end_ARG, the preceding can be further bounded as 1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~ 1 n R_w 1% n R_w( 1 n R% _w R_w+ _w I)^-1% 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG = == 1n^^w⊤w′w′⊤1n~~w(1n~~w⊤w′w′⊤~w+(γ~w+βw))−11n~~+o(1)1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscript~wsubscriptw11~~1 1 n R_w U_% w U_w 1 n% R_w ( 1 n R_w% U_w U_w % R_w+( γ_w+ _w) I % )^-1 1 n y+o(1)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG + o ( 1 ) = == 1n^^w⊤w′(1n~w′⊤~w~w⊤w′+(γ~w+βw))−1w′⊤1n~~w~+o(1)by Lemma A.2.1^superscriptsubscript^wtopsuperscriptsubscriptw′1~superscriptsubscriptw′topsubscript~wsuperscriptsubscript~wtopsuperscriptsubscriptw′subscript~wsubscriptw1superscriptsubscriptw′top1~subscript~w~1by Lemma A.2. 1 n R_w U_% w ( 1 n U_w % R_w R_w U_w% +( γ_w+ _w) I )^-1% U_w 1 n R_w% y+o(1) Lemma lemma: pushthrough.divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw over~ start_ARG italic_y end_ARG + o ( 1 ) by Lemma . By Lemma A.6 and condition Boundedness, the above further leads to 1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~ 1 n R_w 1% n R_w( 1 n R% _w R_w+ _w I)^-1% 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG = == 1n^^w⊤w′(w′+(γ~w+βw))−1w′⊤1n~~w~+o(1).1^superscriptsubscript^wtopsuperscriptsubscriptw′superscriptsubscriptw′subscript~wsubscriptw1superscriptsubscriptw′top1~subscript~w~1 1 n R_w U_% w ( _w +( γ_% w+ _w) I )^-1 U_w % 1 n R_w y+o(1).divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ ( Λw′ + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw over~ start_ARG italic_y end_ARG + o ( 1 ) . Condition Concentration on VV implies that ∥w′⊤1n^^w^−w′⊤1n^^w^∥op=o(λmin(w′)+γw+βw)subscriptdelimited-∥superscriptsubscriptw′top1^subscript^w^superscriptsubscriptw′top1^subscript^w^opsubscriptsuperscriptsubscriptw′subscriptwsubscriptw U_w 1 n R_w% y- U_w 1 n R% _w y _op=o( _ ( % _w )+ _w+ _w)∥ italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw over start_ARG italic_y end_ARG - italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw over start_ARG italic_y end_ARG ∥op = o ( λroman_min ( Λw′ ) + γw + βw ) via the triangle inequality. Then, by condition Boundedness and that ∥(w′+(γ~w+βw))−1∥op=1λmin(w′)+γ~w+βwsubscriptdelimited-∥superscriptsuperscriptsubscriptw′subscript~wsubscriptw1op1subscriptsuperscriptsubscriptw′subscript~wsubscriptw ( _w +( γ_w+ _% w) I)^-1 _op= 1 _ ( % _w )+ γ_w+ _w∥ ( Λw′ + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 ∥op = divide start_ARG 1 end_ARG start_ARG λroman_min ( Λw′ ) + over~ start_ARG γ end_ARGw + βw end_ARG , we further have 1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~ 1 n R_w 1% n R_w( 1 n R% _w R_w+ _w I)^-1% 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG = == 1n^^w⊤w′(w′+(γ~w+βw))−1w′⊤1n^^w^+o(1)1^superscriptsubscript^wtopsuperscriptsubscriptw′superscriptsubscriptw′subscript~wsubscriptw1superscriptsubscriptw′top1^subscript^w^1 1 n R_w U_% w ( _w +( γ_% w+ _w) I )^-1 U_w % 1 n R_w y+o(1)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ ( Λw′ + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw over start_ARG italic_y end_ARG + o ( 1 ) = == 1n^^w⊤w′(1n^w′⊤^w^w⊤w′+(γ~w+βw))−1w′⊤1n^^w^+o(1)by Lemma A.6 and condition Boundedness1^superscriptsubscript^wtopsuperscriptsubscriptw′1^superscriptsubscriptw′topsubscript^wsuperscriptsubscript^wtopsuperscriptsubscriptw′subscript~wsubscriptw1superscriptsubscriptw′top1^subscript^w^1by Lemma A.6 and condition Boundedness 1 n R_w U_% w ( 1 n U_w % R_w R_w U_w^% +( γ_w+ _w) I )^-1 U% _w 1 n R_w % y+o(1) Lemma lemma: concentration_inv and condition% Boundednessdivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over start_ARG italic_R end_ARGw over start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw over start_ARG italic_y end_ARG + o ( 1 ) by Lemma and condition Boundedness = == 1n^^w⊤w′w′⊤^w(1n~^w⊤w′w′⊤^w+(γ~w+βw))−11n^^+o(1)by Lemma A.2.1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsuperscript1~superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscript~wsubscriptw11^^1by Lemma A.2 1 n R_w U_% w U_w R_w (% 1 n R_w U_w % U_w R_w+( γ_% w+ _w) I )^-1 1 n % y+o(1) Lemma lemma: pushthrough.divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + o ( 1 ) by Lemma . Let us define the shorthand w=1n^^w⊤w′w′⊤^w(1n~^w⊤w′w′⊤^w+(γ~w+βw))−1subscriptw1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsuperscript1~superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscript~wsubscriptw1 P_w= 1 n R_w U_% w U_w R_w% ( 1 n R_w U_w^% U_w R_w+( % γ_w+ _w) I )^-1italic_Pw = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw + ( over~ start_ARG γ end_ARGw + βw ) italic_I )- 1. Then, we conclude that 1n^^−1n^^w⊤1n~~w(1n~~w⊤~w+βw)−11n~~=(−w)1n^^+o(1).1^^1^superscriptsubscript^wtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~subscriptw1^^1 1 n y- 1 n% R_w 1 n R_% w( 1 n R_w R% _w+ _w I)^-1 1 n % y=( I- P_w) 1 n y+% o(1).divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG = ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + o ( 1 ) . ∎ A.6 Propagation of the error to the strong model Lemma A.13. For any ψitalic_ψ with ‖=O(1)norm1\| ψ\|=O(1)∥ italic_ψ ∥ = O ( 1 ), we have ‖s1n^^s(1n^^s⊤^s+βs)−1‖2=‖s‖2±o(1)superscriptnormsubscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts12plus-or-minussuperscriptnormsubscript21\| _s 1 n R_ % s( 1 n R_s R_s% + _s I)^-1 ψ\|^2=\| P_s ψ\|^% 2± o(1)∥ square-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 italic_ψ ∥2 = ∥ italic_Pitalic_s italic_ψ ∥2 ± o ( 1 ), where s=1n^^s⊤s′s′⊤^s(1n^^s⊤s′′⊤^s+(γs^+βs))−1.subscripts1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′topsubscript^s^subscriptssubscripts1 P_s= 1 n R_s U_% s U_s R_s(% 1 n R_s U_s % U R_s+( _s+ _% s) I)^-1.italic_Ps = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_U′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 . Proof. We first decompose s1n^^s(1n^^s⊤^s+βs)−1subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1 _s 1 n R_s% ( 1 n R_s R_s+% _s I)^-1square-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 as follows s1n^^s(1n^^s⊤^s+βs)−1subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1 _s 1 n % R_s( 1 n R_s R% _s+ _s I)^-1square-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 = == s1n^^s(1n^^s⊤s′s′⊤^s+(γs^+βs))−1+o(1)by Lemma A.5subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s^subscriptssubscripts11by Lemma A.5 _s 1 n % R_s( 1 n R_s U_% s U_s R_s+(% _s+ _s) I)^-1+o(1) % Lemma lemma: isotropy_invsquare-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 + o ( 1 ) by Lemma = == (s′s′s′⊤+s′s′s′⊤)1n^^s(1n^^s⊤s′s′⊤^s+(γs^+βs))−1+o(1)superscriptsubscripts′ubscripts′ubscripts′topsuperscriptsubscripts′subscripts′subscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s^subscriptssubscripts11 ( U_s _s^% U_s + U_s % _s U_s % ) 1 n R_s( 1 n% R_s U_s U_s% R_s+( _s+ _ % s) I)^-1+o(1)( italic_Us′ square-root start_ARG Λs′ end_ARG italic_Us′ ⊤ + italic_Us′ ′ square-root start_ARG Λs′ ′ end_ARG italic_Us′ ′ ⊤ ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 + o ( 1 ) = == s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γs^+βs))−1superscriptsubscripts′ubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s^subscriptssubscripts1 U_s _s^% U_s 1 n R% _s( 1 n R_s U_s% U_s R_s+( % _s+ _s) I)^-1italic_Us′ square-root start_ARG Λs′ end_ARG italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 +s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γs^+βs))−1+o(1)superscriptsubscripts′subscripts′subscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s^subscriptssubscripts11 + U_s _s% U_s 1 n% R_s( 1 n R_s % U_s U_s R_% s+( _s+ _s) I)^-1+o(1)+ italic_Us′ ′ square-root start_ARG Λs′ ′ end_ARG italic_Us′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 + o ( 1 ) (6) The second term above can be bounded: ∥s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−1∥opsubscriptdelimited-∥superscriptsubscripts′subscripts′subscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts1op U_s _% s U_s 1 % n R_s( 1 n R_s^% U_s U_s R% _s+( γ_s+ _s) I)^-1 _% op∥ italic_Us′ ′ square-root start_ARG Λs′ ′ end_ARG italic_Us′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 ∥op ≤ ≤ λmax(s′)o(γs2+δs2)+γ^sγ^s+βs by Boundedness and Lemma A.11subscriptsuperscriptsubscripts′subscripts2superscriptsubscripts2subscript^ssubscript^ssubscripts by Boundedness and Lemma A.11 _ ( _s )% o( _s^2+ _s^2)+ γ_ % s γ_s+ _s by Boundedness% and Lemma lemma: scale_isotropy_kernel square-root start_ARG λroman_max ( Λs′ ′ ) end_ARG divide start_ARG square-root start_ARG o ( γs2 + δs2 ) + over start_ARG γ end_ARGs end_ARG end_ARG start_ARG over start_ARG γ end_ARGs + βs end_ARG by Boundedness and Lemma ≤ ≤ ∥s′∥opo(γs2+δ2)+γ^sγ^s+δssubscriptdelimited-∥superscriptsubscripts′opsuperscriptsubscripts2superscript2subscript^ssubscript^ssubscripts _s _% op o( _s^2+δ^2)+ γ_% s γ_s+ _ssquare-root start_ARG ∥ Σs′ ′ ∥op end_ARG divide start_ARG square-root start_ARG o ( γs2 + δ2 ) + over start_ARG γ end_ARGs end_ARG end_ARG start_ARG over start_ARG γ end_ARGs + δs end_ARG = == o((γs+δs)o(γs2+δs2)+γ^s(γs+δs)(γ^s+δs)2)by Diminishing population covariance on ⟂subscriptssubscriptssuperscriptsubscripts2superscriptsubscripts2subscript^ssubscriptssubscriptssuperscriptsubscript^ssubscripts2by Diminishing population covariance on ⟂ o ( ( _s+ _s)o(γ% _s^2+ _s^2)+ γ_s( _s% + _s)( γ_s+ _s)^2 % ) Diminishing population covariance on $ V $o ( square-root start_ARG divide start_ARG ( γs + δs ) o ( γs2 + δs2 ) + over start_ARG γ end_ARGs ( γs + δs ) end_ARG start_ARG ( over start_ARG γ end_ARGs + δs )2 end_ARG end_ARG ) by Diminishing population covariance on V⟂ ≤ ≤ o(o(γs2+δs2)γ^s+δs+γ^sγ^s+δs)=o(1).superscriptsubscripts2superscriptsubscripts2subscript^ssubscriptssubscript^ssubscript^ssubscripts1 o ( o( _s^2+ _s^2% ) γ_s+ _s+ γ_s% γ_s+ _s )=o(1).o ( square-root start_ARG divide start_ARG o ( γs2 + δs2 ) end_ARG start_ARG over start_ARG γ end_ARGs + δs end_ARG + divide start_ARG over start_ARG γ end_ARGs end_ARG start_ARG over start_ARG γ end_ARGs + δs end_ARG end_ARG ) = o ( 1 ) . (7) Combining Equations 6 and 7 yields s1n^^s(1n^^s⊤^s+βs)−1=s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γs^+βs))−1+o(1).subscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1superscriptsubscripts′ubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s^subscriptssubscripts11 _s 1 n % R_s( 1 n R_s R% _s+ _s I)^-1 ψ= U_s^% _s U_s % 1 n R_s( 1 n % R_s U_s U_s^% R_s+( _s+ _s% ) I)^-1 ψ+o(1).square-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 italic_ψ = italic_Us′ square-root start_ARG Λs′ end_ARG italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γs end_ARG + βs ) italic_I )- 1 italic_ψ + o ( 1 ) . Finally, we consider the squared norm: ‖s1n^^s(1n^^s⊤^s+βs)−1‖2superscriptnormsubscripts1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts12 \| _s 1 n % R_s( 1 n R_s % R_s+ _s I)^-1 ψ\|^2∥ square-root start_ARG Σs end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 italic_ψ ∥2 = == ‖s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−1‖2superscriptnormsuperscriptsubscripts′ubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts12 \| U_s _s^% U_s 1 n R% _s( 1 n R_s U_s% U_s R_s+( % γ_s+ _s) I)^-1 ψ\|^2∥ italic_Us′ square-root start_ARG Λs′ end_ARG italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 italic_ψ ∥2 ±o(‖s′s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−1‖)±o(1)plus-or-minusplus-or-minusnormsuperscriptsubscripts′ubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts11 ± o (\| U_s _% s U_s 1 n% R_s( 1 n R_s % U_s U_s R_% s+( γ_s+ _s) I)^-1 ψ% \| )± o(1)± o ( ∥ italic_Us′ square-root start_ARG Λs′ end_ARG italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 italic_ψ ∥ ) ± o ( 1 ) = == ‖1n^^s⊤s′s′⊤^s(1n^^s⊤s′′⊤^s+(γ^s+βs))−1‖2±o(1)by Corollary A.10.plus-or-minussuperscriptnorm1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′topsubscript^ssubscript^ssubscripts121by Corollary A.10 \| 1 n R_s U_% s U_s R_s(% 1 n R_s U_s % U R_s+( γ_s+ _% s) I)^-1 ψ\|^2± o(1) Corollary % coro: sqrt_lambda.∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_U′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 italic_ψ ∥2 ± o ( 1 ) by Corollary . ∎ A.7 Proof of Theorem 3.8 Given that ‖1n^^‖=O(1)norm1^^1\| 1 n y\|=O(1)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ = O ( 1 ) by Boundedness, and that ∥−w∥op=βwλmin(1n^^w⊤w′w′⊤^w)+βw≤1subscriptdelimited-∥subscriptwopsubscriptwsubscript1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscriptw1 I- P_w _op= _w% _ ( 1 n R_w U_% w U_w R_w)+% _w≤ 1∥ italic_I - italic_Pw ∥op = divide start_ARG βw end_ARG start_ARG λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw ) + βw end_ARG ≤ 1, we have ‖(−w)1n^^‖=O(1)normsubscriptw1^^1\|( I- P_w) 1 n y\|=O(1)∥ ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ = O ( 1 ). Then, by Lemma A.12, the weak model’s error on ^ Dover start_ARG D end_ARG can be bounded as ‖(−w)1n^^‖+o(1)=O(1)normsubscriptw1^^11 \|( I- P_w) 1 n y% \|+o(1)=O(1)∥ ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ + o ( 1 ) = O ( 1 ). Recalling the expression of PredGap derived in Equation 5 and applying Lemmas A.12 and A.13, we obtain: PredGap=‖s(−w)1n^^‖2±o(1).PredGapplus-or-minussuperscriptnormsubscriptssubscriptw1^^21 =\| P_s( I- P_w% ) 1 n y\|^2± o(1).PredGap = ∥ italic_Ps ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) . Appendix B Additional Analysis B.1 Additional Lemmas Lemma B.1. By Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂ and Boundedness, we have [′′⊤y]=o(γ+δ).delimited-[]superscript′top [ U U % ry]=o( γ+δ).blackboard_E [ italic_U′ ′ italic_U′ ′ ⊤ italic_r y ] = o ( square-root start_ARG γ + δ end_ARG ) . Proof. [′′⊤y]=delimited-[]superscript′topabsent [ U U % ry]=blackboard_E [ italic_U′ ′ italic_U′ ′ ⊤ italic_r y ] = limn→∞1n∑i=1n′′⊤iyi=limn→∞1n′′⊤1n≤limn→∞∥1n′′⊤∥op‖1n‖subscript→1superscriptsubscript1superscript′topsubscriptsubscriptsubscript→1superscript′top1subscript→subscriptdelimited-∥1superscript′topopnorm1 _n→∞ 1n _i=1^n U^% U r_iy_i= _n→% ∞ 1 n U U % R 1 n y≤ _n→∞ 1% n U U R _% op\| 1 n y\|limitalic_n → ∞ divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n italic_U′ ′ italic_U′ ′ ⊤ italic_ritalic_i yitalic_i = limitalic_n → ∞ divide start_ARG 1 end_ARG start_ARG square-root start_ARG n end_ARG end_ARG italic_U′ ′ italic_U′ ′ ⊤ italic_R divide start_ARG 1 end_ARG start_ARG square-root start_ARG n end_ARG end_ARG italic_y ≤ limitalic_n → ∞ ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG n end_ARG end_ARG italic_U′ ′ italic_U′ ′ ⊤ italic_R ∥op ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG n end_ARG end_ARG italic_y ∥ = == limn→∞∥1n′′⊤⊤′′⊤∥op1n∑i=1nyi2=∥′∥op[y2]=o(γ+δ).subscript→subscriptdelimited-∥1superscript′topsuperscripttopsuperscript′topop1superscriptsubscript1superscriptsubscript2subscriptdelimited-∥superscript′opdelimited-[]superscript2 _n→∞ 1n U % U R R U % U _op 1n _% i=1^ny_i^2= _op% E[y^2]=o( γ+δ).limitalic_n → ∞ square-root start_ARG ∥ divide start_ARG 1 end_ARG start_ARG n end_ARG italic_U′ ′ italic_U′ ′ ⊤ italic_R italic_R⊤ italic_U′ ′ italic_U′ ′ ⊤ ∥op end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG n end_ARG ∑i = 1n yitalic_i2 end_ARG = square-root start_ARG ∥ Σ′ ′ ∥op end_ARG square-root start_ARG blackboard_E [ y2 ] end_ARG = o ( square-root start_ARG γ + δ end_ARG ) . ∎ Lemma B.2. By Boundedness, we have [′′⊤y]=O(1).delimited-[]superscript′op1 [ U U ry]=O(1).blackboard_E [ italic_U′ italic_U′ ⊤ italic_r y ] = O ( 1 ) . Proof. The proof follows the same approach as that of Lemma B.1. This conclusion can also be derived by bounding [′′⊤y]delimited-[]superscript′opE[ U U ry]blackboard_E [ italic_U′ italic_U′ ⊤ italic_r y ] in terms of its empirical counterpart using Concentration on VV, and then applying Boundedness ∎ B.2 When Errw2s≈PredGap+ErrscsubscriptErrw2sPredGapsubscriptErrscErr_w2s +Err_scErrw2s ≈ PredGap + Errsc Theorem B.3. Suppose that, in addition to Assumption 3.7, the conditions βs+γ^s=o(λmin, ≠0((shs))=Θ(1))subscriptssubscript^ssubscriptmin, ≠0subscriptsubscriptssubscriptℎsΘ1 _s+ γ_s=o( _min, $≠ 0$( % ( _V_sh_s))= (1))βs + over start_ARG γ end_ARGs = o ( λmin, ≠ 0 ( Σ ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ) ) = Θ ( 1 ) ) and λmin, ≠0((shs))=Θ(λmax((shs)) _min, $≠ 0$( ( _V_% sh_s))= ( _ ( ( _V% _sh_s))λmin, ≠ 0 ( Σ ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ) ) = Θ ( λroman_max ( Σ ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ) ) hold. Then, w.h.p., we have: Errw2s=PredGap+Errsc±o(1).subscriptErrw2splus-or-minusPredGapsubscriptErrsc1 _w2s=PredGap+Err_sc% ± o(1).Errw2s = PredGap + Errsc ± o ( 1 ) . Proof. First, decompose Errw2ssubscriptErrw2sErr_w2sErrw2s as follows Errw2s=subscriptErrw2sabsent _w2s=Errw2s = [(s⊤w2s−y)2]delimited-[]superscriptsuperscriptsubscriptstopsubscriptw2s2 [( r_s w_w2s-y)^2]blackboard_E [ ( italic_rs⊤ italic_w2s - y )2 ] = == [(s⊤w2s−s⊤sc+s⊤sc−y)2]delimited-[]superscriptsuperscriptsubscriptstopsubscriptw2ssuperscriptsubscriptstopsubscriptscsuperscriptsubscriptstopsubscriptsc2 [( r_s w_w2s- % r_s w_sc+ r_s w_% sc-y)^2]blackboard_E [ ( italic_rs⊤ italic_w2s - italic_rs⊤ italic_wsc + italic_rs⊤ italic_wsc - y )2 ] = == [(s⊤w2s−s⊤sc)2+(s⊤sc−y)2+2(w2s⊤s−sc⊤s)(s⊤sc−y)]delimited-[]superscriptsuperscriptsubscriptstopsubscriptw2ssuperscriptsubscriptstopsubscriptsc2superscriptsuperscriptsubscriptstopsubscriptsc22superscriptsubscriptw2stopsubscriptssuperscriptsubscriptsctopsubscriptssuperscriptsubscriptstopsubscriptsc [( r_s w_w2s- % r_s w_sc)^2+( r_s % w_sc-y)^2+2( w_w2s r_s- % w_sc r_s)( r_s w% _sc-y)]blackboard_E [ ( italic_rs⊤ italic_w2s - italic_rs⊤ italic_wsc )2 + ( italic_rs⊤ italic_wsc - y )2 + 2 ( italic_w2s⊤ italic_rs - italic_wsc⊤ italic_rs ) ( italic_rs⊤ italic_wsc - y ) ] = == PredGap+Errsc+2[(w2s⊤s−sc⊤s)(s⊤sc−y)]PredGapsubscriptErrsc2delimited-[]superscriptsubscriptw2stopsubscriptssuperscriptsubscriptsctopsubscriptssuperscriptsubscriptstopsubscriptsc +Err_sc+2E[( w_% w2s r_s- w_sc r_% s)( r_s w_sc-y)]PredGap + Errsc + 2 blackboard_E [ ( italic_w2s⊤ italic_rs - italic_wsc⊤ italic_rs ) ( italic_rs⊤ italic_wsc - y ) ] = == PredGap+Errsc+2(w2s−sc)⊤(ssc−[sy]),PredGapsubscriptErrsc2superscriptsubscriptw2ssubscriptsctopsubscriptssubscriptscdelimited-[]subscripts +Err_sc+2( w_w2s% - w_sc) ( _s w_sc-% E[ r_sy]),PredGap + Errsc + 2 ( italic_w2s - italic_wsc )⊤ ( Σs italic_wsc - blackboard_E [ italic_rs y ] ) , (8) Thus, to prove the theorem, it suffices to show |(w2s−sc)⊤(ssc−[sy])|=o(1)superscriptsubscriptw2ssubscriptsctopsubscriptssubscriptscdelimited-[]subscripts1|( w_w2s- w_sc) ( _s% w_sc-E[ r_sy])|=o(1)| ( italic_w2s - italic_wsc )⊤ ( Σs italic_wsc - blackboard_E [ italic_rs y ] ) | = o ( 1 ). We decompose (w2s−sc)⊤(ssc−[sy])superscriptsubscriptw2ssubscriptsctopsubscriptssubscriptscdelimited-[]subscripts( w_w2s- w_sc) ( _s% w_sc-E[ r_sy])( italic_w2s - italic_wsc )⊤ ( Σs italic_wsc - blackboard_E [ italic_rs y ] ): (w2s−sc)⊤(ssc−[sy])superscriptsubscriptw2ssubscriptsctopsubscriptssubscriptscdelimited-[]subscripts ( w_w2s- w_sc) ( % _s w_sc-E[ r_sy])( italic_w2s - italic_wsc )⊤ ( Σs italic_wsc - blackboard_E [ italic_rs y ] ) = == (w2s−sc)⊤(s′sc+s′sc−s′s′⊤[sy]−s′s′⊤[sy])superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscriptscsuperscriptsubscripts′subscriptscsuperscriptsubscripts′ubscripts′topdelimited-[]subscriptssuperscriptsubscripts′subscripts′topdelimited-[]subscripts ( w_w2s- w_sc) ( % _s w_sc+ _s % w_sc- U_s U_s^% E[ r_sy]- U_s % U_s E[ r_sy])( italic_w2s - italic_wsc )⊤ ( Σs′ italic_wsc + Σs′ ′ italic_wsc - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] - italic_Us′ ′ italic_Us′ ′ ⊤ blackboard_E [ italic_rs y ] ) = == (w2s−sc)⊤(s′sc−s′s′⊤[sy])+(w2s−sc)⊤s′sc−(w2s−sc)⊤s′s′⊤[sy]superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscriptscsuperscriptsubscripts′ubscripts′topdelimited-[]subscriptssuperscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscriptscsuperscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscripts′topdelimited-[]subscripts ( w_w2s- w_sc) ( % _s w_sc- U_s U_% s E[ r_sy])+( w_w2s% - w_sc) _s w_% sc-( w_w2s- w_sc) U_ % s U_s E[ r_% sy]( italic_w2s - italic_wsc )⊤ ( Σs′ italic_wsc - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] ) + ( italic_w2s - italic_wsc )⊤ Σs′ ′ italic_wsc - ( italic_w2s - italic_wsc )⊤ italic_Us′ ′ italic_Us′ ′ ⊤ blackboard_E [ italic_rs y ] (9) w2s−scsubscriptw2ssubscriptsc w_w2s- w_scitalic_w2s - italic_wsc can be approximated as: w2s−sc=subscriptw2ssubscriptscabsent w_w2s- w_sc=italic_w2s - italic_wsc = 1n^^s(1n^^s⊤^s+βs)−11n^(^w⊤w)−1n^^s(1n^^s⊤^s+βs)−11n^^1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^superscriptsubscript^wtopsubscriptw1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^ 1 n R_s( 1 n% R_s R_s+ _s% I)^-1 1 n( R_w w% _w)- 1 n R_s( 1 n% R_s R_s+ _s% I)^-1 1 n ydivide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG ( over start_ARG italic_R end_ARGw⊤ italic_w ) - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG = == 1n^^s(1n^^s⊤^s+βs)−1(1n^^w⊤w−1n^^)1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^superscriptsubscript^wtopsubscriptw1^ 1 n R_s( 1 n% R_s R_s+ _s% I)^-1( 1 n R_w w% _w- 1 n y)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ) = == 1n^^s(1n^^s′+(γ^s+βs))−1(1n^^w⊤w−1n^^)+o(1)by Lemma A.5 and that other terms are O(1)1^subscript^ssuperscript1^superscriptsubscript^s′subscript^ssubscripts11^superscriptsubscript^wtopsubscriptw1^^1by Lemma A.5 and that other terms are O(1) 1 n R_s ( 1% n K_s +( γ_s+ _% s) I )^-1( 1 n R_w% w_w- 1 n y)+o(1) % by Lemma lemma: isotropy_inv and that other terms are $O(1)$divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ) + o ( 1 ) by Lemma and that other terms are O ( 1 ) (10) where ^s′=^s⊤s′s′⊤^ssuperscriptsubscript^s′ubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^s K_s = R_s U_% s U_s R_sover start_ARG italic_K end_ARGs′ = over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs is shorthand for ^(shs)^subscriptsubscriptssubscriptℎs K( _V_sh_s)over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ). Then, by Lemma A.11 and Boundedness, we obtain: ‖(w2s−sc)⊤s′‖=normsuperscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′absent \|( w_w2s- w_sc) U_% s \|=∥ ( italic_w2s - italic_wsc )⊤ italic_Us′ ′ ∥ = O(o(γs2+δs2)+γ^sγ^s+βs).superscriptsubscripts2superscriptsubscripts2subscript^ssubscript^ssubscripts O( o( _s^2+ _s^2)+% γ_s γ_s+ _s).O ( divide start_ARG square-root start_ARG o ( γs2 + δs2 ) + over start_ARG γ end_ARGs end_ARG end_ARG start_ARG over start_ARG γ end_ARGs + βs end_ARG ) . (11) We also have the following bound: ∥s′⊤1n^^s(1n^^s⊤^s+βs)−11n^^∥opsubscriptdelimited-∥superscriptsubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^^op U_s 1 n% R_s( 1 n R_s % R_s+ _s I)^-1 1 n% y _op∥ italic_Us′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥op = == ∥s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−11n^^∥op+o(∥s′⊤1n^^s∥op)by Boundedness and Lemma A.5subscriptdelimited-∥superscriptsubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts11^^opsubscriptdelimited-∥superscriptsubscripts′top1^subscript^sopby Boundedness and Lemma A.5 U_s 1 n% R_s( 1 n R_s % U_s U_s R_% s+( γ_s+ _s) I)^-1 1% n y _op+o( U_s^% 1 n R_s _% op) Boundedness and Lemma lemma: isotropy_inv∥ italic_Us′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥op + o ( ∥ italic_Us′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ∥op ) by Boundedness and Lemma = == O(o(γs2+δs2)+γ^sγ^s+βs)by Lemma A.11 and Boundednesssuperscriptsubscripts2superscriptsubscripts2subscript^ssubscript^ssubscriptsby Lemma A.11 and Boundedness O( o( _s^2+ _s^2)+% γ_s γ_s+ _s) % by Lemma lemma: scale_isotropy_kernel and BoundednessO ( divide start_ARG square-root start_ARG o ( γs2 + δs2 ) + over start_ARG γ end_ARGs end_ARG end_ARG start_ARG over start_ARG γ end_ARGs + βs end_ARG ) by Lemma and Boundedness (12) Combining Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂ and Equations 11 and 12, the second term in Equation 9 can be bounded as: |(w2s−sc)⊤s′sc|=superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscriptscabsent |( w_w2s- w_sc) % _s w_sc|=| ( italic_w2s - italic_wsc )⊤ Σs′ ′ italic_wsc | = |(w2s−sc)⊤s′s′s′⊤sc|superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscripts′subscripts′topsubscriptsc |( w_w2s- w_sc) U_% s _s U_% s w_sc|| ( italic_w2s - italic_wsc )⊤ italic_Us′ ′ Λs′ ′ italic_Us′ ′ ⊤ italic_wsc | = == o((o(γs2+δs2)+γ^s)(γs+δs)(γ^s+βs)2)=o(1).superscriptsubscripts2superscriptsubscripts2subscript^ssubscriptssubscriptssuperscriptsubscript^ssubscripts21 o ( (o( _s^2+ _s^2)+% γ_s)( _s+ _s)( γ_% s+ _s)^2 )=o(1).o ( divide start_ARG ( o ( γs2 + δs2 ) + over start_ARG γ end_ARGs ) ( γs + δs ) end_ARG start_ARG ( over start_ARG γ end_ARGs + βs )2 end_ARG ) = o ( 1 ) . (13) The third term in Equation 9 can be bounded as: |(w2s−sc)⊤s′s′⊤[sy]|≤superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscripts′topdelimited-[]subscriptsabsent |( w_w2s- w_sc) U_% s U_s E[ % r_sy]|≤| ( italic_w2s - italic_wsc )⊤ italic_Us′ ′ italic_Us′ ′ ⊤ blackboard_E [ italic_rs y ] | ≤ ‖(w2s−sc)⊤s′‖‖s′⊤[sy]‖normsuperscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′normsuperscriptsubscripts′topdelimited-[]subscripts \|( w_w2s- w_sc) U_% s \|\| U_s E[% r_sy]\|∥ ( italic_w2s - italic_wsc )⊤ italic_Us′ ′ ∥ ∥ italic_Us′ ′ ⊤ blackboard_E [ italic_rs y ] ∥ = == O(o(γs2+δs2)+γ^sγ^s+βs)o(γs+δs)by Equation 11 and Lemma B.1superscriptsubscripts2superscriptsubscripts2subscript^ssubscript^ssubscriptssubscriptssubscriptsby Equation 11 and Lemma B.1 O( o( _s^2+ _s^2)+% γ_s γ_s+ _s)o( % _s+ _s) Equation eq: wwu and % Lemma lemma: bound_ery_non_principalO ( divide start_ARG square-root start_ARG o ( γs2 + δs2 ) + over start_ARG γ end_ARGs end_ARG end_ARG start_ARG over start_ARG γ end_ARGs + βs end_ARG ) o ( square-root start_ARG γs + δs end_ARG ) by Equation and Lemma = == o(1).1 o(1).o ( 1 ) . (14) Now, it remains to bound the first term in Equation 9. We start with approximating s′sc−s′s′⊤[sy]superscriptsubscripts′subscriptscsuperscriptsubscripts′ubscripts′topdelimited-[]subscripts _s w_sc- U_s^% U_s E[ r_sy]Σs′ italic_wsc - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ]: s′sc−s′s′⊤[sy]superscriptsubscripts′subscriptscsuperscriptsubscripts′ubscripts′topdelimited-[]subscripts _s w_sc- U_% s U_s E[ r_s% y]Σs′ italic_wsc - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] = == s′s′s′⊤1n^^s(1n^^s⊤^s+βs)−11n^^−s′s′⊤[sy]superscriptsubscripts′ubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^^superscriptsubscripts′ubscripts′topdelimited-[]subscripts U_s _s % U_s 1 n R_s% ( 1 n R_s R_s+% _s I)^-1 1 n y- U_% s U_s E[ r_% sy]italic_Us′ Λs′ italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] = == s′′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−11n^^−s′s′⊤[sy]+o(1)superscriptsubscripts′uperscriptsubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts11^^superscriptsubscripts′ubscripts′topdelimited-[]subscripts1 U_s U_% s 1 n R_s (% 1 n R_s U_s % U_s R_s+( γ_% s+ _s) I )^-1 1 n y% - U_s U_s E[ % r_sy]+o(1)italic_Us′ Λ′ italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] + o ( 1 ) by Lemma A.5 and Boundedness = == ss′⊤^s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−11n^^−s′s′⊤[sy]+o(1)subscriptssuperscriptsubscripts′top^superscriptsubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts11^^superscriptsubscripts′ubscripts′topdelimited-[]subscripts1 U_s U_s % U_s U_s 1% n R_s ( 1 n R_% s U_s U_s % R_s+( γ_s+ _s) I% )^-1 1 n y- U_s % U_s E[ r_sy]+o(1)italic_Us italic_Us′ ⊤ over start_ARG Σ end_ARG italic_Us′ italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - italic_Us′ italic_Us′ ⊤ blackboard_E [ italic_rs y ] + o ( 1 ) by Concentration on VV and Boundedness = == ss′⊤^s′s′⊤1n^^s(1n^^s⊤s′s′⊤^s+(γ^s+βs))−11n^^−ss′1n^^s^+o(1)by Concentration on .subscriptssuperscriptsubscripts′top^superscriptsubscripts′ubscripts′top1^subscript^ssuperscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscripts11^^subscriptssuperscriptsubscripts′1^subscript^s^1by Concentration on U_s U_s % U_s U_s 1% n R_s ( 1 n R_% s U_s U_s % R_s+( γ_s+ _s) I% )^-1 1 n y- U_s U% _s 1 n R_s y+o% (1) Concentration on $ V$.italic_Us italic_Us′ ⊤ over start_ARG Σ end_ARG italic_Us′ italic_Us′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG - italic_Us italic_Us′ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs over start_ARG italic_y end_ARG + o ( 1 ) by Concentration on V . (15) Due to the two additional assumptions in the statement of the theorem, along with Concentration on VV and Boundedness, the RHSs of both equation 10 and equation 15 are O(1)1O(1)O ( 1 ). Combining equation 10 and equation 15, we obtain: (w2s−sc)⊤(s′sc−ss′⊤[sy])superscriptsubscriptw2ssubscriptsctopsuperscriptsubscripts′subscriptscsubscriptssuperscriptsubscripts′topdelimited-[]subscripts ( w_w2s- w_sc) ( % _s w_sc- U_s U_s% E[ r_sy])( italic_w2s - italic_wsc )⊤ ( Σs′ italic_wsc - italic_Us italic_Us′ ⊤ blackboard_E [ italic_rs y ] ) = == (1n^^w⊤w−1n^^)⊤(1n^^s′+(γ^s+βs))−1superscript1^superscriptsubscript^wtopsubscriptw1^^topsuperscript1^superscriptsubscript^s′subscript^ssubscripts1 ( 1 n R_w w% _w- 1 n y) ( 1 % n K_s +( γ_s+ _s% ) I )^-1( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG )⊤ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 ×1n^^s⊤(ss′⊤^s′s′⊤(ss′⊤^s′s′⊤+(γ^s+βs))−1−s′s′⊤)1n^^s1n^^+o(1)absent1^superscriptsubscript^stopsubscriptssuperscriptsubscripts′top^superscriptsubscripts′ubscripts′topsuperscriptsubscriptssuperscriptsubscripts′top^superscriptsubscripts′ubscripts′topsubscript^ssubscripts1superscriptsubscripts′ubscripts′top1^subscript^s1^^1 × 1 n R_s % ( U_s U_s % U_s U_s ( U_% s U_s U_s% U_s +( γ_s+ _% s) I )^-1- U_s U_s% ) 1 n R_s 1% n y+o(1)× divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ ( italic_Us italic_Us′ ⊤ over start_ARG Σ end_ARG italic_Us′ italic_Us′ ⊤ ( italic_Us italic_Us′ ⊤ over start_ARG Σ end_ARG italic_Us′ italic_Us′ ⊤ + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 - italic_Us′ italic_Us′ ⊤ ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + o ( 1 ) = == (1n^^w⊤w−1n^^)⊤(1n^^s′+(γ^s+βs))−1superscript1^superscriptsubscript^wtopsubscriptw1^^topsuperscript1^superscriptsubscript^s′subscript^ssubscripts1 ( 1 n R_w w% _w- 1 n y) ( 1 % n K_s +( γ_s+ _s% ) I )^-1( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG )⊤ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 ×(1n^^s′(1n^^s′+(γ^s+βs))−11n^^s′−1n^^s′)1n^^+o(1)by Lemma A.2 × ( 1 n K_s % ( 1 n K_s +( γ_ % s+ _s) I )^-1 1 n K_% s - 1 n K_s )% 1 n y+o(1) Lemma lemma: % pushthrough× ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ + ( over start_ARG γ end_ARGs + βs ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + o ( 1 ) by Lemma = == (1n^^w⊤w−1n^^)⊤(ss−s)⊤1n^^+o(1).superscript1^superscriptsubscript^wtopsubscriptw1^^topsuperscriptsubscriptssubscriptssubscriptstop1^^1 ( 1 n R_w w% _w- 1 n y) ( P_% s P_s- P_s ) 1 % n y+o(1).( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG )⊤ ( italic_Ps italic_Ps - italic_Ps )⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + o ( 1 ) . (16) ss−ssubscriptssubscriptssubscripts P_s P_s- P_sitalic_Ps italic_Ps - italic_Ps’s eigenvalues are given by: (λi(1n^^s′)λi(1n^^s′)+(γ^s+βs))2−λi(1n^^s′)λi(1n^^s′)+(γ^s+βs)=−(λi(1n^^s′)λi(1n^^s′)+(γ^s+βs))(γ^s+βsλi(1n^^s′)+(γ^s+βs))superscriptsubscript1^superscriptsubscript^s′subscript1^superscriptsubscript^s′subscript^ssubscripts2subscript1^superscriptsubscript^s′subscript1^superscriptsubscript^s′subscript^ssubscriptssubscript1^superscriptsubscript^s′subscript1^superscriptsubscript^s′subscript^ssubscriptssubscript^ssubscriptssubscript1^superscriptsubscript^s′subscript^ssubscripts( _i( 1 n K_s )% _i( 1 n K_s )+( γ% _s+ _s))^2- _i( 1 n % K_s ) _i( 1 n K_% s )+( γ_s+ _s)=-( % _i( 1 n K_s ) _i(% 1 n K_s )+( γ_s+% _s))( γ_s+ _s _i% ( 1 n K_s )+( γ_s% + _s))( divide start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) end_ARG start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) + ( over start_ARG γ end_ARGs + βs ) end_ARG )2 - divide start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) end_ARG start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) + ( over start_ARG γ end_ARGs + βs ) end_ARG = - ( divide start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) end_ARG start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) + ( over start_ARG γ end_ARGs + βs ) end_ARG ) ( divide start_ARG over start_ARG γ end_ARGs + βs end_ARG start_ARG λitalic_i ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ ) + ( over start_ARG γ end_ARGs + βs ) end_ARG ). since 1n^^s′1^superscriptsubscript^s′ 1 n K_s divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs′ and ^ssubscript^s _sover start_ARG Σ end_ARGs share non-zero eigenvalues, we analyze the relation between βs+γ^ssubscriptssubscript^s _s+ γ_sβs + over start_ARG γ end_ARGs and s′^^superscriptsubscripts′ _s over start_ARG Σs′ end_ARG’s non-zero eigenvalues. By Concentration on VV and Weyl’s Theorem |λmin, ≠0(^s′)−λmin, ≠0(s′)|=o(γs2+δs2+λmin, ≠0(s′))subscriptmin, ≠0superscriptsubscript^s′subscriptmin, ≠0superscriptsubscripts′ubscripts2superscriptsubscripts2subscriptmin, ≠0superscriptsubscripts′ | _min, $≠ 0$( _s^% )- _min, $≠ 0$( _s )|=o(% _s^2+ _s^2+ _min, $≠ 0$(% _s ))| λmin, ≠ 0 ( over start_ARG Σ end_ARGs′ ) - λmin, ≠ 0 ( Σs′ ) | = o ( γs2 + δs2 + λmin, ≠ 0 ( Σs′ ) ) Combining this with βs+γ^s=o(λmin, ≠0(s′))subscriptssubscript^ssubscriptmin, ≠0superscriptsubscripts′ _s+ γ_s=o( _min, $≠ 0$( % _s ))βs + over start_ARG γ end_ARGs = o ( λmin, ≠ 0 ( Σs′ ) ), we conclude: βs+γ^s=o(λmin, ≠0(s′^)).subscriptssubscript^ssubscriptmin, ≠0^superscriptsubscripts′ _s+ γ_s=o( _min, $% ≠ 0$( _s )).βs + over start_ARG γ end_ARGs = o ( λmin, ≠ 0 ( over start_ARG Σs′ end_ARG ) ) . (17) Using Equation 17, we then obtain ∥ss−s∥op=o(1)subscriptdelimited-∥subscriptssubscriptssubscriptsop1 P_s P_s- P_s _% op=o(1)∥ italic_Ps italic_Ps - italic_Ps ∥op = o ( 1 ). By Lemma A.12, the term (1n^^w⊤w−1n^^)1^superscriptsubscript^wtopsubscriptw1^^( 1 n R_w w_w-% 1 n y)( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_w - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ) can be bounded by ‖(−w)1n^^‖+o(1)=O(1)normsubscriptw1^^11 \|( I- P_w) 1 n y% \|+o(1)=O(1)∥ ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ + o ( 1 ) = O ( 1 ), and ‖1n^^‖=O(1)norm1^^1\| 1 n y\|=O(1)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ = O ( 1 ) by Boundedness. Combining all these results, the RHS of Equation 16 is o(1)1o(1)o ( 1 ). Therefore, |(w2s−sc)⊤(ssc−[sy])|=o(1)superscriptsubscriptw2ssubscriptsctopsubscriptssubscriptscdelimited-[]subscripts1|( w_w2s- w_sc) ( _s% w_sc-E[ r_sy])|=o(1)| ( italic_w2s - italic_wsc )⊤ ( Σs italic_wsc - blackboard_E [ italic_rs y ] ) | = o ( 1 ), which completes the proof. ∎ B.3 Proof of results in Section 4 B.3.1 Proof of Theorem 4.1 First, we present the following lemma, which provides a sufficient condition under which any labeling can be fitted by the W2S model. Lemma B.4 (Condition for overfitting arbitrary labels). As long as δs=o(γs^)subscripts^subscripts _s=o( _s)δs = o ( over start_ARG γs end_ARG ) and δs≤βs=o(γs^)subscriptsubscripts^subscripts _s≤ _s=o( _s)δitalic_s ≤ βs = o ( over start_ARG γs end_ARG ), given any fw∘hws.t.1n^∑i=1n^fw(hw(^i))2=O(1)formulae-sequencesubscriptwsubscriptℎw1^superscriptsubscript1^subscriptwsuperscriptsubscriptℎwsubscript^21f_w h_w~s.t.~ 1 n _i=1 nf% _w(h_w( x_i))^2=O(1)fw ∘ hw s . t . divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG fw ( hw ( over start_ARG italic_x end_ARGi ) )2 = O ( 1 ), the weak-to-strong model can almost exactly overfit it, as indicated by an almost zero training error: 1n^∑i=1n^(fw2s(hs(^i))−fw(hw(^i)))2=o(1)1^superscriptsubscript1^superscriptsubscriptw2ssubscriptℎssubscript^subscriptwsubscriptℎwsubscript^21 1 n _i=1 n (f_w2s(h_s( % x_i))-f_w(h_w( x_i)) )^2=o(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG ( fw2s ( hs ( over start_ARG italic_x end_ARGi ) ) - fw ( hw ( over start_ARG italic_x end_ARGi ) ) )2 = o ( 1 ), with high probability 1−o(1)111-o(1)1 - o ( 1 ). Proof. Let ^∈ℝn^^superscriptℝ T∈R nover start_ARG italic_T end_ARG ∈ blackboard_Rover start_ARG n end_ARG denote the weak model’s predictions on ^ Dover start_ARG D end_ARG. The following holds for all ^ Tover start_ARG italic_T end_ARG such that 1n^|^|2=O(1)1^superscript^21 1 n| T|^2=O(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG | over start_ARG italic_T end_ARG |2 = O ( 1 ). The training loss can be expressed as 1n^‖^s⊤w2s−^‖2=1^superscriptnormsuperscriptsubscript^stopsubscriptw2s^2absent 1 n\| R_s w_% w2s- T\|^2=divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_R end_ARGs⊤ italic_w2s - over start_ARG italic_T end_ARG ∥2 = ‖1n^^s⊤1n^^s(1n^^s⊤^s+βs)−11n^^−1n^^‖2by Equation 3superscriptnorm1^superscriptsubscript^stop1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^^1^^2by Equation 3 \| 1 n R_s 1% n R_s( 1 n R_% s R_s+ _s I)^-1 % 1 n T- 1 n T\|^2% Equation eq: exp_wtos∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_T end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_T end_ARG ∥2 by Equation = == ‖(1n^^s⊤1n^^s(1n^^s⊤^s+βs)−1−)1n^^‖2superscriptnorm1^superscriptsubscript^stop1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts11^^2 \| ( 1 n R_s % 1 n R_s( 1 n R% _s R_s+ _s I)^-1-% I ) 1 n T\|^2∥ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 - italic_I ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_T end_ARG ∥2 ≤ ≤ ∥1n^^s⊤1n^^s(1n^^s⊤^s+βs)−1−∥op2‖1n^^‖2superscriptsubscriptdelimited-∥1^superscriptsubscript^stop1^subscript^ssuperscript1^superscriptsubscript^stopsubscript^ssubscripts1op2superscriptnorm1^^2 1 n R_s % 1 n R_s( 1 n R% _s R_s+ _s I)^-1-% I _op^2\| 1 n T\|^2∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGs ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs + βs italic_I )- 1 - italic_I ∥op2 ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_T end_ARG ∥2 = == (βsλmin(1n^^s⊤^s)+βs)2‖1n^^‖2superscriptsubscriptssubscript1^superscriptsubscript^stopsubscript^ssubscripts2superscriptnorm1^^2 ( _s _ ( 1 n% R_s R_s)+ _s% )^2\| 1 n T\|^2( divide start_ARG βs end_ARG start_ARG λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs ) + βs end_ARG )2 ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_T end_ARG ∥2 = == O((βsλmin(1n^^s⊤^s)+βs)2)because we assume 1n^‖^‖2=O(1).superscriptsubscriptssubscript1^superscriptsubscript^stopsubscript^ssubscripts2because we assume 1n^‖^‖2=O(1) O ( ( _s _ ( 1% n R_s R_s)+ _% s )^2 ) we assume $ 1 n% \| T\|^2=O(1)$.O ( ( divide start_ARG βs end_ARG start_ARG λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs ) + βs end_ARG )2 ) because we assume divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_T end_ARG ∥2 = O ( 1 ) . (18) By Lemma A.4 and Weyl’s Theorem, we have |λmin(1n^^s⊤^s)−λmin(1n^^s⊤s′s′⊤^s+γ^s)|≤∥1n^^s⊤^s−(1n^^s⊤s′s′⊤^s+γ^s)∥op=o(γs2+δs2)subscript1^superscriptsubscript^stopsubscript^ssubscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssubscriptdelimited-∥1^superscriptsubscript^stopsubscript^s1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^sopsuperscriptsubscripts2superscriptsubscripts2 | _ ( 1 n R_s % R_s)- _ ( 1 n R_% s U_s U_s % R_s+ γ_s I)|≤ % 1 n R_s R_s-% ( 1 n R_s U_s^% U_s R_s+ γ_% s I ) _op=o( _s^2+% _s^2)| λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs ) - λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + over start_ARG γ end_ARGs italic_I ) | ≤ ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs - ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + over start_ARG γ end_ARGs italic_I ) ∥op = o ( γs2 + δs2 ) ⟹ ⟹ λmin(1n~^s⊤^s)≥λmin(1n^^s⊤s′s′⊤^s+γ^s)−o(γs2+δs2)≥γ^s−o(γs2+δs2).subscript1~superscriptsubscript^stopsubscript^ssubscript1^superscriptsubscript^stopsuperscriptsubscripts′ubscripts′topsubscript^ssubscript^ssuperscriptsubscripts2superscriptsubscripts2subscript^ssuperscriptsubscripts2superscriptsubscripts2 _ ( 1 n R_s % R_s)≥ _ ( 1 n R_% s U_s U_s % R_s+ γ_s I)-o( _s% ^2+ _s^2)≥ γ_s-o( _s^2% + _s^2).λroman_min ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ over start_ARG italic_R end_ARGs ) ≥ λroman_min ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGs⊤ italic_Us′ italic_Us′ ⊤ over start_ARG italic_R end_ARGs + over start_ARG γ end_ARGs italic_I ) - o ( γs2 + δs2 ) ≥ over start_ARG γ end_ARGs - o ( γs2 + δs2 ) . (19) Substituding Equation 19 into Equation 18 yields 1n^‖^s⊤w2s−^‖2=O((βsγ^s−o(γs2+δs2)+βs)2)=o(1)because we assume βs=o(γs^) and δs=o(γs^),formulae-sequence1^superscriptnormsuperscriptsubscript^stopsubscriptw2s^2superscriptsubscriptssubscript^ssuperscriptsubscripts2superscriptsubscripts2subscripts21because we assume βs=o(γs^) and δs=o(γs^) 1 n\| R_s w_% w2s- T\|^2=O ( ( _s % γ_s-o( _s^2+ _s^2)+ _ % s )^2 )=o(1) we assume $ _ s% =o( _s)$ and $ _s=o( _s% )$,divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_R end_ARGs⊤ italic_w2s - over start_ARG italic_T end_ARG ∥2 = O ( ( divide start_ARG βs end_ARG start_ARG over start_ARG γ end_ARGs - o ( γs2 + δs2 ) + βs end_ARG )2 ) = o ( 1 ) because we assume βs = o ( over start_ARG γs end_ARG ) and δs = o ( over start_ARG γs end_ARG ) , which completes the proof. ∎ The first statement in Theorem 4.1 can now be readily proved by invoking the above lemma. For the second statement in Theorem 4.1, we first apply the triangle inequality, which gives Errw2s≤PredGap+ErrscsubscriptErrw2sPredGapsubscriptErrsc Err_w2s≤ PredGap+ Err_% scsquare-root start_ARG Errw2s end_ARG ≤ square-root start_ARG PredGap end_ARG + square-root start_ARG Errsc end_ARG. Given the assumption Errsc=o(1)subscriptErrsc1Err_sc=o(1)Errsc = o ( 1 ) and the fact that Theorem 3.8 implies PredGap=O(1)PredGap1PredGap=O(1)PredGap = O ( 1 ), we obtain Errw2s≤PredGap+o(1)subscriptErrw2sPredGap1Err_w2s +o(1)Errw2s ≤ PredGap + o ( 1 ). Furthermore, by our assumption combined with Theorem 3.8, we know PredGap=Errw−Δ+o(1)PredGapsubscriptErrwΔ1PredGap=Err_w- +o(1)PredGap = Errw - Δ + o ( 1 ). Substituting this into the previous inequality yields Errw2s≤Errw−Δ+o(1)subscriptErrw2ssubscriptErrwΔ1Err_w2s _w- +o(1)Errw2s ≤ Errw - Δ + o ( 1 ). B.3.2 Proof of Corollary 4.3 We begin by presenting the following general result regarding the test errors of the weak model and the strong ceiling model. Lemma B.5 (The weak model’s error on the population). If |[y2]−1n^∑i=1n^y^i2|=o(1)delimited-[]superscript21^superscriptsubscript1^superscriptsubscript^21|E[y^2]- 1 n _i=1 n y_i^2|=o(1)| blackboard_E [ y2 ] - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 | = o ( 1 ) w.h.p., then the weak model’s error on the population, ErrwsubscriptErrwErr_wErrw , can be approximated as follows, Errw=‖(−w)1n^^‖2±o(1).subscriptErrwplus-or-minussuperscriptnormsubscriptw1^^21 _w=\|( I- P_w) 1% n y\|^2± o(1).Errw = ∥ ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) . A similar conclusion holds for the strong ceiling’s error ErrscsubscriptErrscErr_scErrsc as well: Errsc=‖(−s)1n^^‖2±o(1)subscriptErrscplus-or-minussuperscriptnormsubscripts1^^21Err_sc=\|( I- P_s) 1 n% y\|^2± o(1)Errsc = ∥ ( italic_I - italic_Ps ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ). Proof. We decompose the error as follows Errw=subscriptErrwabsent _w=Errw = [(w⊤w−y)2]delimited-[]superscriptsuperscriptsubscriptwtopsubscriptw2 [( r_w w_w-y)^2]blackboard_E [ ( italic_rw⊤ italic_w - y )2 ] = == w⊤ww−2w⊤[wy]+[y2].superscriptsubscriptwtopsubscriptwsubscriptw2superscriptsubscriptwtopdelimited-[]subscriptwdelimited-[]superscript2 w_w _w w_ % w-2 w_w E[ r_wy]+E[y^% 2].italic_w⊤ Σw italic_w - 2 italic_w⊤ blackboard_E [ italic_rw y ] + blackboard_E [ y2 ] . (20) The first term can further be decomposed as: w⊤ww=superscriptsubscriptwtopsubscriptwsubscriptwabsent w_w _w w_ % w=italic_w⊤ Σw italic_w = ‖ww⊤1n~~w(1n~~w⊤~w+βw)−11n~~‖2superscriptnormsubscriptwsuperscriptsubscriptwtop1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~2 \| _w U_w % 1 n R_w( 1 n % R_w R_w+ _w I% )^-1 1 n y\|^2∥ square-root start_ARG Λw end_ARG italic_Uw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 = == ‖ww⊤1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2±o(1)by Lemma A.4 and Boundednessplus-or-minussuperscriptnormsubscriptwsuperscriptsubscriptwtop1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~21by Lemma A.4 and Boundedness \| _w U_w % 1 n R_w( 1 n % R_w U_w U_w^% R_w+( _w+ γ_ % w) I)^-1 1 n y\|^2± o(1)% Lemma lemma: isotropy_eff_reg_kernel and Boundedness∥ square-root start_ARG Λw end_ARG italic_Uw⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Lemma and Boundedness = == ‖[w′w′⊤w′w′⊤]1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2±o(1)plus-or-minussuperscriptnormmatrixsuperscriptsubscriptw′subscriptw′topsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~21 \| bmatrix _w U% _w \\ _w U_w % bmatrix 1 n R_w(% 1 n R_w U_w^% U_w R_w+( _% w+ γ_w) I)^-1 1 n% y\|^2± o(1)∥ [ start_ARG start_ROW start_CELL square-root start_ARG Λw′ end_ARG italic_Uw′ ⊤ end_CELL end_ROW start_ROW start_CELL square-root start_ARG Λw′ ′ end_ARG italic_Uw′ ′ ⊤ end_CELL end_ROW end_ARG ] divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ± o ( 1 ) = == ‖w′w′⊤1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2superscriptnormsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~2 \| _w U_w^% 1 n R_w( 1% n R_w U_w U% _w R_w+( _w+ % γ_w) I)^-1 1 n y\|% ^2∥ square-root start_ARG Λw′ end_ARG italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 +‖w′w′⊤1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2±o(1)plus-or-minussuperscriptnormsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~21 +\| _w U_ % w 1 n R_w(% 1 n R_w U_w^% U_w R_w+( _% w+ γ_w) I)^-1 1 n% y\|^2± o(1)+ ∥ square-root start_ARG Λw′ ′ end_ARG italic_Uw′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ± o ( 1 ) (21) We bound the second term in Equation 21: ‖w′w′⊤1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2superscriptnormsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~2 \| _w U_% w 1 n R_w(% 1 n R_w U_w^% U_w R_w+( _% w+ γ_w) I)^-1 1 n% y\|^2∥ square-root start_ARG Λw′ ′ end_ARG italic_Uw′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ≤ ≤ ∥w′∥op∥w′⊤1n~~w∥op2∥(1n~~w⊤w′w′⊤~w+(βw+γ~w))−1∥op2‖1n~~‖2subscriptdelimited-∥superscriptsubscriptw′opsuperscriptsubscriptdelimited-∥superscriptsubscriptw′top1~subscript~wop2superscriptsubscriptdelimited-∥superscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w1op2superscriptnorm1~~2 _w _op% U_w 1 n % R_w _op^2 ( 1 n % R_w U_w U_w^% R_w+( _w+ γ_ % w) I)^-1 _op^2\| 1 n % y\|^2∥ Λw′ ′ ∥op ∥ italic_Uw′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ∥op2 ∥ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 ∥op2 ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ≤ ≤ o(γw+δw)(o(γw2+δw2)+γ~w)(βw+γ~w)2by Diminishing population covariance on ⟂, Lemma A.11 and Boundednesssubscriptwsubscriptwsuperscriptsubscriptw2superscriptsubscriptw2subscript~wsuperscriptsubscriptwsubscript~w2by Diminishing population covariance on ⟂, Lemma A.11 and Boundedness o( _w+ _w)(o( _w^% 2+ _w^2)+ γ_w)( _w+% γ_w)^2 Diminishing population % covariance on $ V $, Lemma lemma: scale_isotropy_% kernel and Boundednessdivide start_ARG o ( γw + δw ) ( o ( γw2 + δw2 ) + over~ start_ARG γ end_ARGw ) end_ARG start_ARG ( βw + over~ start_ARG γ end_ARGw )2 end_ARG by Diminishing population covariance on V⟂ , Lemma and Boundedness = == o(1).1 o(1).o ( 1 ) . (22) Then, we approximate the first term in Equation 21: ‖w′w′⊤1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~‖2superscriptnormsuperscriptsubscriptw′subscriptw′top1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~2 \| _w U_w^% 1 n R_w( 1% n R_w U_w U% _w R_w+( _w+ % γ_w) I)^-1 1 n y\|% ^2∥ square-root start_ARG Λw′ end_ARG italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 = == ‖w′(1n~w′⊤~w~w⊤w′+(βw+γ~w))−1w′⊤1n~~w1n~~‖2by Lemma A.2superscriptnormsuperscriptsubscriptw′1~superscriptsubscriptw′topsubscript~wsuperscriptsubscript~wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′top1~subscript~w1~~2by Lemma A.2 \| _w ( 1 n% U_w R_w R_% w U_w +( _w+ γ_% w) I)^-1 U_w 1 % n R_w 1 n y\|^% 2 Lemma lemma: pushthrough∥ square-root start_ARG Λw′ end_ARG ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 by Lemma = == ‖w′(w′+(βw+γ~w))−1w′⊤1n~~w1n~~‖2±o(1)by Lemma A.6 and Boundednessplus-or-minussuperscriptnormsuperscriptsubscriptw′superscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′top1~subscript~w1~~21by Lemma A.6 and Boundedness \| _w ( _ % w +( _w+ γ_w) I)^-1 U% _w 1 n R_w% 1 n y\|^2± o(1) Lemma % lemma: concentration_inv and Boundedness∥ square-root start_ARG Λw′ end_ARG ( Λw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Lemma and Boundedness = == ‖w′(1n^w′⊤^w^w⊤w′+(βw+γ~w))−1w′⊤1n~~w1n~~‖2±o(1)by Lemma A.6 and Boundednessplus-or-minussuperscriptnormsuperscriptsubscriptw′1^superscriptsubscriptw′topsubscript^wsuperscriptsubscript^wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′top1~subscript~w1~~21by Lemma A.6 and Boundedness \| _w ( 1 n % U_w R_w R_w% U_w +( _w+ γ_w% ) I)^-1 U_w 1 n% R_w 1 n y\|^2± o% (1) Lemma lemma: concentration_inv and Boundedness∥ square-root start_ARG Λw′ end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over start_ARG italic_R end_ARGw over start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Lemma and Boundedness = == ‖w′(1n^w′⊤^w^w⊤w′+(βw+γ~w))−1w′⊤1n^^w1n^^‖2±o(1)by Concentration on and Boundednessplus-or-minussuperscriptnormsuperscriptsubscriptw′1^superscriptsubscriptw′topsubscript^wsuperscriptsubscript^wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′top1^subscript^w1^^21by Concentration on and Boundedness \| _w ( 1 n % U_w R_w R_w% U_w +( _w+ γ_w% ) I)^-1 U_w 1 n % R_w 1 n y\|^2± o(1) % by Concentration on $ V$ and Boundedness∥ square-root start_ARG Λw′ end_ARG ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over start_ARG italic_R end_ARGw over start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Concentration on V and Boundedness = == ‖w′w′⊤1n^^w(1n^^w⊤w′w′⊤^w+(βw+γ~w))−11n^^‖2±o(1)by Lemma A.2plus-or-minussuperscriptnormsuperscriptsubscriptw′subscriptw′top1^subscript^wsuperscript1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscriptwsubscript~w11^^21by Lemma A.2 \| _w U_w^% 1 n R_w( 1 n% R_w U_w U_w% R_w+( _w+ γ_% w) I)^-1 1 n y\|^2± o(1)% Lemma lemma: pushthrough∥ square-root start_ARG Λw′ end_ARG italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Lemma = == ‖1n^^w⊤w′w′⊤^w(1n^^w⊤w′w′⊤^w+(βw+γ~w))−11n^^‖2±o(1)by Corollary A.10 and Boundednessplus-or-minussuperscriptnorm1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsuperscript1^superscriptsubscript^wtopsuperscriptsubscriptw′subscriptw′topsubscript^wsubscriptwsubscript~w11^^21by Corollary A.10 and Boundedness \| 1 n R_w U_% w U_w R_w(% 1 n R_w U_w % U_w R_w+( _w+% γ_w) I)^-1 1 n y% \|^2± o(1) Corollary coro: sqrt_lambda and Boundedness% ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) by Corollary and Boundedness = == ‖w1n^^‖2±o(1).plus-or-minussuperscriptnormsubscriptw1^^21 \| P_w 1 n y\|^2% ± o(1).∥ italic_Pw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) . (23) Now, we approximate the second term in Equation 20: w⊤[wy]superscriptsubscriptwtopdelimited-[]subscriptw w_w E[ r_wy]italic_w⊤ blackboard_E [ italic_rw y ] = == (1n~~w(1n~~w⊤~w+βw)−11n~~)⊤[wy]superscript1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~topdelimited-[]subscriptw ( 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n y) % E[ r_wy]( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ blackboard_E [ italic_rw y ] = == (1n~~w(1n~~w⊤~w+βw)−11n~~)⊤w′w′⊤[wy]+(1n~~w(1n~~w⊤~w+βw)−11n~~)⊤w′w′⊤[wy]superscript1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~topsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptwsuperscript1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~topsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptw ( 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n y) % U_w U_w E[ r_% wy]+( 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n y) % U_w U_w E% [ r_wy]( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ italic_Uw′ italic_Uw′ ⊤ blackboard_E [ italic_rw y ] + ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ italic_Uw′ ′ italic_Uw′ ′ ⊤ blackboard_E [ italic_rw y ] = == (1n~~w(1n~~w⊤~w+βw)−11n~~)⊤w′w′⊤[wy]±o(o(γw2+δw2)+γ~wγ~w+βwγw+δw)plus-or-minussuperscript1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~topsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptwsuperscriptsubscriptw2superscriptsubscriptw2subscript~wsubscript~wsubscriptwsubscriptwsubscriptw ( 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n y) % U_w U_w E[ r_% wy]± o( o( _w^2+ _w^2)+% γ_w γ_w+ _w % _w+ _w)( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ italic_Uw′ italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( divide start_ARG square-root start_ARG o ( γw2 + δw2 ) + over~ start_ARG γ end_ARGw end_ARG end_ARG start_ARG over~ start_ARG γ end_ARGw + βw end_ARG square-root start_ARG γw + δw end_ARG ) by Boundedness, Lemmas A.5, A.11 and B.1 = == (1n~~w(1n~~w⊤~w+βw)−11n~~)⊤w′w′⊤[wy]±o(1)plus-or-minussuperscript1~subscript~wsuperscript1~superscriptsubscript~wtopsubscript~wsubscriptw11~~topsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptw1 ( 1 n R_w( 1% n R_w R_w+ _% w I)^-1 1 n y) % U_w U_w E[ r_% wy]± o(1)( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ over~ start_ARG italic_R end_ARGw + βw italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ italic_Uw′ italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( 1 ) = == (1n~~w(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~)⊤w′w′⊤[wy]±o(1)plus-or-minussuperscript1~subscript~wsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~~topsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptw1 ( 1 n R_w( 1% n R_w U_w U% _w R_w+( _w+ % γ_w) I)^-1 1 n y)^% U_w U_w E[% r_wy]± o(1)( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG )⊤ italic_Uw′ italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( 1 ) by Lemma A.5, Boundedness, and Lemma B.2 = == 1n~~⊤(1n~~w⊤w′w′⊤~w+(βw+γ~w))−11n~~w⊤w′w′⊤[wy]±o(1)plus-or-minus1~superscript~topsuperscript1~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topsubscript~wsubscriptwsubscript~w11~superscriptsubscript~wtopsuperscriptsubscriptw′subscriptw′topdelimited-[]subscriptw1 1 n y ( 1 % n R_w U_w U_% w R_w+( _w+ % γ_w) I)^-1 1 n R_% w U_w U_w % E[ r_wy]± o(1)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG⊤ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( 1 ) = == 1n~~⊤1n~~w⊤w′(1n~w′⊤~w~w⊤w′+(βw+γ~w))−1w′⊤[wy]±o(1)by Lemma A.2plus-or-minus1~superscript~top1~superscriptsubscript~wtopsuperscriptsubscriptw′1~superscriptsubscriptw′topsubscript~wsuperscriptsubscript~wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′topdelimited-[]subscriptw1by Lemma A.2 1 n y 1 % n R_w U_w (% 1 n U_w R_w% R_w U_w +( _w% + γ_w) I)^-1 U_w % E[ r_wy]± o(1) Lemma lemma: % pushthroughdivide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over~ start_ARG italic_R end_ARGw over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( 1 ) by Lemma = == 1n~~⊤1n~~w⊤w′(1n~w′⊤^w^w⊤w′+(βw+γ~w))−1w′⊤[wy]±o(1)plus-or-minus1~superscript~top1~superscriptsubscript~wtopsuperscriptsubscriptw′1~superscriptsubscriptw′topsubscript^wsuperscriptsubscript^wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′topdelimited-[]subscriptw1 1 n y 1 % n R_w U_w (% 1 n U_w R_w% R_w U_w +( _w% + γ_w) I)^-1 U_w % E[ r_wy]± o(1)divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_y end_ARG⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARGw⊤ italic_Uw′ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over start_ARG italic_R end_ARGw over start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ blackboard_E [ italic_rw y ] ± o ( 1 ) by Lemma A.6, Lemma B.2, and Boundedness = == 1n^^⊤^w⊤w′(1n~w′⊤^w^w⊤w′+(βw+γ~w))−1w′⊤1n^^w^±o(1)plus-or-minus1^superscript^topsuperscriptsubscript^wtopsuperscriptsubscriptw′1~superscriptsubscriptw′topsubscript^wsuperscriptsubscript^wtopsuperscriptsubscriptw′subscriptwsubscript~w1superscriptsubscriptw′top1^subscript^w^1 1 n y R_w^% U_w ( 1 n U_w^% R_w R_w U_% w +( _w+ γ_w) I)^-1% U_w 1 n R_w % y± o(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_y end_ARG⊤ over start_ARG italic_R end_ARGw⊤ italic_Uw′ ( divide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG italic_Uw′ ⊤ over start_ARG italic_R end_ARGw over start_ARG italic_R end_ARGw⊤ italic_Uw′ + ( βw + over~ start_ARG γ end_ARGw ) italic_I )- 1 italic_Uw′ ⊤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARGw over start_ARG italic_y end_ARG ± o ( 1 ) by Concentration on VV, Boundedness and Lemma B.2 = == 1n^^⊤w1n^^±o(1)by Lemma A.2.plus-or-minus1^superscript^topsubscriptw1^^1by Lemma A.2 1 n y P_w% 1 n y± o(1) Lemma lemma: % pushthrough.divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG⊤ italic_Pw divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ± o ( 1 ) by Lemma . (24) Combining Equations 20, 21, 22, 23, 24, and the assumption about [y2]delimited-[]superscript2E[y^2]blackboard_E [ y2 ] yields Errw=‖(−w)1n^^‖2±o(1).subscriptErrwplus-or-minussuperscriptnormsubscriptw1^^21 _w=\|( I- P_w) 1% n y\|^2± o(1).Errw = ∥ ( italic_I - italic_Pw ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) . The proof of the result concerning ErrscsubscriptErrscErr_scErrsc is similar. ∎ We show that the condition regarding [y2]delimited-[]superscript2E[y^2]blackboard_E [ y2 ] is satisfied in Example 4.2. Specifically, ∑i=1n^y^i2superscriptsubscript1^superscriptsubscript^2 _i=1 n y_i^2∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 follows a χ2(n^)superscript2^χ^2( n)χ2 ( over start_ARG n end_ARG ) distribution, with a mean of n^[y2]^delimited-[]superscript2 nE[y^2]over start_ARG n end_ARG blackboard_E [ y2 ] and a variance of 2n^2^2 n2 over start_ARG n end_ARG. For simplicity, we demonstrate the following result using Chebyshev’s inequality, while noting that tighter bounds could be achieved with tail bounds for χ2superscript2χ^2χ2 variables or Lemma C.3. For any k>00k>0k > 0, we have: Pr(|∑i=1n^y^i2−n^[y2]|≥k2n^)≤1k2Prsuperscriptsubscript1^superscriptsubscript^2^delimited-[]superscript22^1superscript2 (| _i=1 n y_i^2- nE[y^2]|≥ k% 2 n )≤ 1k^2Pr ( | ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 - over start_ARG n end_ARG blackboard_E [ y2 ] | ≥ k square-root start_ARG 2 over start_ARG n end_ARG end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG k2 end_ARG. Letting k=n^1/4superscript^14k= n^1/4k = over start_ARG n end_ARG1 / 4, we find that with probability 1−O(1n^)11^1-O ( 1 n )1 - O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG ), |1n^∑i=1n^y^i2−[y2]|=O(1n^1/4)1^superscriptsubscript1^superscriptsubscript^2delimited-[]superscript21superscript^14 | 1 n _i=1 n y_i^2-E[y^2]% |=O ( 1 n^1/4 )| divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 - blackboard_E [ y2 ] | = O ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG ). Thus, Lemma B.5 applies to Example 4.2. Now, based on Lemmas A.12, B.5, and Theorem 3.8, the key to computing the errors of all these models boils down to simply computing wsubscriptw P_witalic_Pw and ssubscripts P_sitalic_Ps. We first compute the kernels. For convenience, we use the shorthand notations ^wsubscript^w K_wover start_ARG italic_K end_ARGw and ^ssubscript^s K_sover start_ARG italic_K end_ARGs to represent ^(whw)^subscriptsubscriptwsubscriptℎw K( _V_wh_w)over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT w end_POSTSUBSCRIPT hw ) and ^(shs)^subscriptsubscriptssubscriptℎs K( _V_sh_s)over start_ARG italic_K end_ARG ( Πcaligraphic_V start_POSTSUBSCRIPT s end_POSTSUBSCRIPT hs ), respectively. Since the representations in Example 4.2 are decomposable with respect to the subspace corresponding to the first coordinate, for both the weak and strong models, the principal kernels are rank one and can be expressed as ^w=⊤subscript^wsuperscripttop K_w= q q over start_ARG italic_K end_ARGw = italic_q italic_q⊤ and ^s=^^⊤subscript^s^superscript^top K_s= y y over start_ARG italic_K end_ARGs = over start_ARG italic_y end_ARG over start_ARG italic_y end_ARG⊤, where ^≔η^+1−η^≔^^1 q η y+ 1-η ζover start_ARG italic_q end_ARG ≔ square-root start_ARG η end_ARG over start_ARG italic_y end_ARG + square-root start_ARG 1 - η end_ARG over start_ARG italic_ζ end_ARG. Then, for 1n^^w1^subscript^w 1 n K_wdivide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGw, it has a single nonzero eigenvalue ‖1n^^‖2superscriptnorm1^^2\| 1 n q\|^2∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_q end_ARG ∥2, with the corresponding eigenvector 1‖1n^^‖1n^^1norm1^^1^ 1\| 1 n q\| 1 n% qdivide start_ARG 1 end_ARG start_ARG ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_q end_ARG ∥ end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_q end_ARG. Similarly, 1n^^s1^subscript^s 1 n K_sdivide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARGs has a single eigenvalue ‖1n^^‖2superscriptnorm1^^2\| 1 n y\|^2∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2, with the corresponding eigenvector 1‖1n^^‖1n^^1norm1^^1^ 1\| 1 n y\| 1 n% ydivide start_ARG 1 end_ARG start_ARG ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG. Next, we present the following Lemma. Lemma B.6. We have the following: ‖1n^^‖2=1±o(1),‖1n^^‖2=1±o(1),|1n^^⊤1n^^|=o(1),‖1n^^‖2=1±o(1)formulae-sequencesuperscriptnorm1^^2plus-or-minus11formulae-sequencesuperscriptnorm1^^2plus-or-minus11formulae-sequence1^superscript^top1^^1superscriptnorm1^^2plus-or-minus11 \| 1 n y\|^2=1± o(1),~~\|% 1 n ζ\|^2=1± o(1),~~| 1% n ζ 1 n y% |=o(1),~~\| 1 n q\|^2=1± o(1)∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 = 1 ± o ( 1 ) , ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG ∥2 = 1 ± o ( 1 ) , | divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_ζ end_ARG⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG | = o ( 1 ) , ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_q end_ARG ∥2 = 1 ± o ( 1 ) Proof. The first two statements can be proved by leveraging classical results on the concentration of Gaussian matrices (see Lemma C.3 for details). The third statement follows as a special case of Lemma C.4. The last statement is implied by the previous three. ∎ Recall that both the weak and strong models’ representations in Example 4.2 are special cases of Example 3.5. Given that σ2=o(n^)superscript2^σ^2=o( n)σ2 = o ( over start_ARG n end_ARG ) and n~=Θ(n^)~Θ n= ( n)over~ start_ARG n end_ARG = Θ ( over start_ARG n end_ARG ), we have γ^wsubscript^w γ_wover start_ARG γ end_ARGw, γ~wsubscript~w γ_wover~ start_ARG γ end_ARGw, γ^ssubscript^s γ_sover start_ARG γ end_ARGs, and γ~ssubscript~s γ_sover~ start_ARG γ end_ARGs all being o(1)1o(1)o ( 1 ), δw=δs=0subscriptwsubscripts0 _w= _s=0δw = δs = 0, and βw=o(1),βs=o(1)formulae-sequencesubscriptw1subscripts1 _w=o(1), _s=o(1)βw = o ( 1 ) , βs = o ( 1 ). Combining these with Lemma B.6, we derive: ∥w−1n^^^⊤∥op=o(1),∥s−1n^^^⊤∥op=o(1).formulae-sequencesubscriptdelimited-∥subscriptw1^^superscript^topop1subscriptdelimited-∥subscripts1^^superscript^topop1 P_w- 1 n q % q _op=o(1), P_s- 1% n y y _op=o(1).∥ italic_Pw - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_q end_ARG over start_ARG italic_q end_ARG⊤ ∥op = o ( 1 ) , ∥ italic_Ps - divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_y end_ARG over start_ARG italic_y end_ARG⊤ ∥op = o ( 1 ) . Now, leveraging Lemma B.6, we can derive all the errors using the expressions provided in Lemmas A.12, B.5, and Theorem 3.8. B.4 Proof of Corollary 5.2 Following Theorem 3.8, we bound the RHS as follows PredGap=PredGapabsent =PredGap = ‖s(−w)s1n^^+s(−w)(−s)1n^^‖2±o(1)plus-or-minussuperscriptnormsubscriptssubscriptwsubscripts1^^subscriptssubscriptwsubscripts1^^21 \| P_s( I- P_w) P_% s 1 n y+ P_s( I-% P_w)( I- P_s) 1 n % y\|^2± o(1)∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG + italic_Ps ( italic_I - italic_Pw ) ( italic_I - italic_Ps ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥2 ± o ( 1 ) ≤ ≤ (∥s(−w)s∥op‖1n^^‖+∥s(−w)∥op‖(−s)1n^^‖)2+o(1)superscriptsubscriptdelimited-∥subscriptssubscriptwsubscriptsopnorm1^^subscriptdelimited-∥subscriptssubscriptwopnormsubscripts1^^21 ( P_s( I- P_w) % P_s _op\| 1 n y\|+% P_s( I- P_w) _op\|(% I- P_s) 1 n y\| )^% 2+o(1)( ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ + ∥ italic_Ps ( italic_I - italic_Pw ) ∥op ∥ ( italic_I - italic_Ps ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ )2 + o ( 1 ) (25) ≤ ≤ (∥s(−w)s∥opC+‖(−s)1n^^‖)2+o(1)superscriptsubscriptdelimited-∥subscriptssubscriptwsubscriptsopnormsubscripts1^^21 ( P_s( I- P_w) % P_s _op C+\|( I- P_s)% 1 n y\| )^2+o(1)( ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op square-root start_ARG C end_ARG + ∥ ( italic_I - italic_Ps ) divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_y end_ARG ∥ )2 + o ( 1 ) = == (∥s(−w)s∥opC+Errsc+o(1))2+o(1)by Lemma B.5superscriptsubscriptdelimited-∥subscriptssubscriptwsubscriptsopsubscriptErrsc121by Lemma B.5 ( P_s( I- P_w) % P_s _op C+ Err_sc+o(1)% )^2+o(1) Lemma lemma: weak_error_population( ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op square-root start_ARG C end_ARG + square-root start_ARG Errsc + o ( 1 ) end_ARG )2 + o ( 1 ) by Lemma = == (∥s(−w)s∥opC+Errsc)2+o(1)superscriptsubscriptdelimited-∥subscriptssubscriptwsubscriptsopsubscriptErrsc21 ( P_s( I- P_w) % P_s _op C+ Err_sc% )^2+o(1) ( ∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op square-root start_ARG C end_ARG + square-root start_ARG Errsc end_ARG )2 + o ( 1 ) Appendix C Proof of Examples in Section 3.3 C.1 Example 3.4 For convenience, let q=intdim()intdimq=intdim( )q = intdim ( Σ ) and τ=∥opsubscriptdelimited-∥opτ= _opτ = ∥ Σ ∥op. Firstly, we note that the conditions in the example imply a low intrinsic dimension. Here’s why: since Tr()=||2≤BTrsuperscript2Tr( )=E| r|^2≤ BTr ( Σ ) = blackboard_E | italic_r |2 ≤ B, it follows that intdim()=Tr()∥op≤Bτ=O(B),intdimTrsubscriptdelimited-∥op ( )= Tr( % ) _op≤ Bτ=O(B),intdim ( Σ ) = divide start_ARG Tr ( Σ ) end_ARG start_ARG ∥ Σ ∥op end_ARG ≤ divide start_ARG B end_ARG start_ARG τ end_ARG = O ( B ) , (26) where the last step holds because τ=∥op=Θ(1)subscriptdelimited-∥opΘ1τ= _op= (1)τ = ∥ Σ ∥op = Θ ( 1 ). Given that n1−c=ω(Blog(q))superscript1n^1-c=ω(B (q))n1 - c = ω ( B log ( q ) ), we then have n1−c=ω(qlog(q))superscript1n^1-c=ω(q (q))n1 - c = ω ( q log ( q ) ), as mentioned in the remark. Additionally, since intdim()≥1intdim1intdim( )≥ 1intdim ( Σ ) ≥ 1, Equation 26 also implies B≥τandB=Ω(1),formulae-sequenceandΩ1 B≥τ B= (1),B ≥ τ and B = Ω ( 1 ) , (27) which we will use later. Next, we introduce the following two lemmas, both of which rely on the matrix Bernstein inequality with intrinsic dimension, as stated in Theorem 7.3.1 of (Tropp et al., 2015). Lemma C.1. With a probability of at least 1−8qexp(−0.5n^1−cBτ+(B+τ)/3)=1−o(1)180.5superscript^13111-8q ( -0.5 n^1-cBτ+(B+τ)/3)=1-o(1)1 - 8 q exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG B τ + ( B + τ ) / 3 end_ARG ) = 1 - o ( 1 ), the following holds ∥^−∥op≤n^−0.5c.subscriptdelimited-∥^opsuperscript^0.5 - _op≤% n^-0.5c.∥ over start_ARG Σ end_ARG - Σ ∥op ≤ over start_ARG n end_ARG- 0.5 c . The same conclusion applies to ~~ over~ start_ARG Σ end_ARG as well. Proof. We prove the result for ^ over start_ARG Σ end_ARG; the result for ~~ over~ start_ARG Σ end_ARG can be proved in the same way. Define i=1n^(^i^i⊤−)subscript1^subscript^superscriptsubscript^top S_i= 1 n( r_i r_i -% )italic_Sitalic_i = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ - Σ ). The random matrices isubscript S_iitalic_Sitalic_i are independent, identically distributed, and centered. Their norms are bounded as follows ∥i∥op≤1n^(∥^i^i⊤∥op+∥op)≤B+τn^≔L.subscriptdelimited-∥subscriptop1^subscriptdelimited-∥subscript^superscriptsubscript^topopsubscriptdelimited-∥op^≔ S_i _op≤ 1 n(% r_i r_i _op+ % _op)≤ B+τ n L.∥ italic_Sitalic_i ∥op ≤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( ∥ over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ ∥op + ∥ Σ ∥op ) ≤ divide start_ARG B + τ end_ARG start_ARG over start_ARG n end_ARG end_ARG ≔ L . Then, i2=1n^2(^i^i⊤−)2=1n^2(‖^i‖2^i^i⊤−22+2)≼1n^2(B^i^i⊤−2)≼Bn^2superscriptsubscript21superscript^2superscriptsubscript^superscriptsubscript^top21superscript^2superscriptnormsubscript^2subscript^superscriptsubscript^top2superscript2superscript2precedes-or-equals1superscript^2subscript^superscriptsubscript^topsuperscript2precedes-or-equalssuperscript^2 S_i^2= 1 n^2E( % r_i r_i - )^2= 1 n^2% E(\| r_i\|^2 r_i r_i^% -2 ^2+ ^2) 1 n^2% E(B r_i r_i - ^2)% B n^2 blackboard_E italic_Sitalic_i2 = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG blackboard_E ( over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ - Σ )2 = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG blackboard_E ( ∥ over start_ARG italic_r end_ARGi ∥2 over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ - 2 Σ2 + Σ2 ) ≼ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG blackboard_E ( B over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ - Σ2 ) ≼ divide start_ARG B end_ARG start_ARG over start_ARG n end_ARG2 end_ARG Σ Define =∑i=1n^isuperscriptsubscript1^subscript Z= _i=1 n S_iitalic_Z = ∑i = 1over start_ARG n end_ARG italic_Sitalic_i. We have ≼2=∑i=1n^i2≼Bn^≔precedes-or-equals0superscript2superscriptsubscript1^superscriptsubscript2precedes-or-equals^≔ 0 Z^2= _i=1 n% E S_i^2 B n % V0 ≼ blackboard_E italic_Z2 = ∑i = 1over start_ARG n end_ARG blackboard_E italic_Sitalic_i2 ≼ divide start_ARG B end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ≔ italic_V Vitalic_V’s norm can be expressed as follows: ∥op=B∥opn^=Bτn^≔vsubscriptdelimited-∥opsubscriptdelimited-∥op^^≔ V _op= B % _op n= Bτ n v∥ italic_V ∥op = divide start_ARG B ∥ Σ ∥op end_ARG start_ARG over start_ARG n end_ARG end_ARG = divide start_ARG B τ end_ARG start_ARG over start_ARG n end_ARG end_ARG ≔ v Define d=intdim([00])intdimmatrix00d=intdim( bmatrix V&0\\ 0& V bmatrix)d = intdim ( [ start_ARG start_ROW start_CELL italic_V end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_V end_CELL end_ROW end_ARG ] ), which can be simplified as: d=2Tr(Bn^)∥Bn^∥op=2intdim(Bn^)=2intdim()=2q.2Tr^subscriptdelimited-∥^op2intdim^2intdim2 d=2 Tr( B n )% B n _op=2intdim(% B n )=2intdim( )=2q.d = 2 divide start_ARG Tr ( divide start_ARG B end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ) end_ARG start_ARG ∥ divide start_ARG B end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ∥op end_ARG = 2 intdim ( divide start_ARG B end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ) = 2 intdim ( Σ ) = 2 q . Now we are ready to apply Theorem 7.3.1 of (Tropp et al., 2015). It leads to the conclusion that, for any t≥v+L/33t≥ v+L/3t ≥ square-root start_ARG v end_ARG + L / 3, ℙ∥op≥t≤ℙsubscriptdelimited-∥opabsent \ Z _op≥ t\ _P ∥ italic_Z ∥op ≥ t ≤ 4dexp(−t2/2v+Lt/3)4superscript223 4d ( -t^2/2v+Lt/3)4 d exp ( divide start_ARG - t2 / 2 end_ARG start_ARG v + L t / 3 end_ARG ) = == 8qexp(−t2/2Bτn^+B+τn^t/3)8superscript22^^3 8q ( -t^2/2 Bτ n+ B+τ % nt/3)8 q exp ( divide start_ARG - t2 / 2 end_ARG start_ARG divide start_ARG B τ end_ARG start_ARG over start_ARG n end_ARG end_ARG + divide start_ARG B + τ end_ARG start_ARG over start_ARG n end_ARG end_ARG t / 3 end_ARG ) = == 8qexp(−n^t2/2Bτ+(B+τ)t/3)8^superscript223 8q ( - nt^2/2Bτ+(B+τ)t/3)8 q exp ( divide start_ARG - over start_ARG n end_ARG t2 / 2 end_ARG start_ARG B τ + ( B + τ ) t / 3 end_ARG ) (28) By assumption: n1−c=ω(Blogq)superscript1 n^1-c=ω(B q)n1 - c = ω ( B log q ) ⟹ ⟹ n1−c=ω(((τ+1/3)B+τ/3)logq)because τ=O(1)superscript1133because τ=O(1) n^1-c=ω(((τ+1/3)B+τ/3) q) % because $τ=O(1)$n1 - c = ω ( ( ( τ + 1 / 3 ) B + τ / 3 ) log q ) because τ = O ( 1 ) ⟹ ⟹ n^1−c(τ+1/3)B+τ/3=ω(logq)superscript^1133 n^1-c(τ+1/3)B+τ/3=ω( q)divide start_ARG over start_ARG n end_ARG1 - c end_ARG start_ARG ( τ + 1 / 3 ) B + τ / 3 end_ARG = ω ( log q ) ⟹ ⟹ 0.5n^1−c(τ+1/3)B+τ/3=ω(logq)0.5superscript^1133 0.5 n^1-c(τ+1/3)B+τ/3=ω( q)0.5 divide start_ARG over start_ARG n end_ARG1 - c end_ARG start_ARG ( τ + 1 / 3 ) B + τ / 3 end_ARG = ω ( log q ) ⟹ ⟹ exp(0.5n^1−c(τ+1/3)B+τ/3)=ω(q)0.5superscript^1133 ( 0.5 n^1-c(τ+1/3)B+τ/3 )=% ω(q)exp ( divide start_ARG 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG ( τ + 1 / 3 ) B + τ / 3 end_ARG ) = ω ( q ) ⟹ ⟹ qexp(−0.5n^1−c(τ+1/3)B+τ/3)=o(1)0.5superscript^11331 q ( -0.5 n^1-c(τ+1/3)B+τ/3 )=o% (1)q exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG ( τ + 1 / 3 ) B + τ / 3 end_ARG ) = o ( 1 ) (29) Therefore, we set the value of t to n^−0.5c=o(1)superscript^0.51 n^-0.5c=o(1)over start_ARG n end_ARG- 0.5 c = o ( 1 ) in Equation 28. It is easy to verify that n^−0.5c≥v+L/3superscript^0.53 n^-0.5c≥ v+L/3over start_ARG n end_ARG- 0.5 c ≥ square-root start_ARG v end_ARG + L / 3. Substituting, we get: ℙ∥op≥n^−0.5c≤ℙsubscriptdelimited-∥opsuperscript^0.5absent \ Z _op≥ n^-0.5c\ _P ∥ italic_Z ∥op ≥ over start_ARG n end_ARG- 0.5 c ≤ 4dexp(−t2/2v+Lt/3)≤8qexp(−n^t2/2Bτ+(B+τ)t/3)=4superscript2238^superscript223absent 4d ( -t^2/2v+Lt/3)≤ 8q ( - nt^2% /2Bτ+(B+τ)t/3)=4 d exp ( divide start_ARG - t2 / 2 end_ARG start_ARG v + L t / 3 end_ARG ) ≤ 8 q exp ( divide start_ARG - over start_ARG n end_ARG t2 / 2 end_ARG start_ARG B τ + ( B + τ ) t / 3 end_ARG ) = 8qexp(−0.5n^1−cBτ+(B+τ)n^−0.5c/3)80.5superscript^1superscript^0.53 8q ( -0.5 n^1-cBτ+(B+τ) n^-0.5c/3)8 q exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG B τ + ( B + τ ) over start_ARG n end_ARG- 0.5 c / 3 end_ARG ) ≤ ≤ 8qexp(−0.5n^1−cBτ+(B+τ)/3)because n^−0.5c≤180.5superscript^13because n^−0.5c≤1 8q ( -0.5 n^1-cBτ+(B+τ)/3) % because $ n^-0.5c≤ 1$8 q exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG B τ + ( B + τ ) / 3 end_ARG ) because over start_ARG n end_ARG- 0.5 c ≤ 1 = == o(1)by Equation 29.1by Equation 29 o(1) Equation eq: qexp.o ( 1 ) by Equation . Since =^− Z= - italic_Z = over start_ARG Σ end_ARG - Σ, restating the above, we have that with a probability of at least 1−8qexp(−0.5n^1−cBτ+(B+τ)/3)180.5superscript^131-8q ( -0.5 n^1-cBτ+(B+τ)/3)1 - 8 q exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG B τ + ( B + τ ) / 3 end_ARG ), the following holds ∥^−∥op≤n^−0.5c.subscriptdelimited-∥^opsuperscript^0.5 - _op≤% n^-0.5c.∥ over start_ARG Σ end_ARG - Σ ∥op ≤ over start_ARG n end_ARG- 0.5 c . ∎ Lemma C.2. With a probability of at least 1−(q+4)exp(−0.5n^1−c4BC+23BC)=1−o(1)140.5superscript^1423111-(q+4) ( -0.5 n^1-c4BC+ 23 BC)=1-o(1)1 - ( q + 4 ) exp ( divide start_ARG - 0.5 over start_ARG n end_ARG1 - c end_ARG start_ARG 4 B C + divide start_ARG 2 end_ARG start_ARG 3 end_ARG square-root start_ARG B C end_ARG end_ARG ) = 1 - o ( 1 ), the following holds ‖1n^∑i=1n^^iyi−[y]‖≤n^−0.5c.norm1^superscriptsubscript1^subscript^subscriptdelimited-[]superscript^0.5 \| 1 n _i=1 n r_iy_i-% E[ ry]\|≤ n^-0.5c.∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG italic_r end_ARGi yitalic_i - blackboard_E [ italic_r y ] ∥ ≤ over start_ARG n end_ARG- 0.5 c . The same conclusion applies to 1n~∑i=1n~~iyi1~superscriptsubscript1~subscript~subscript 1 n _i=1 n r_iy_idivide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∑i = 1over~ start_ARG n end_ARG over~ start_ARG italic_r end_ARGi yitalic_i as well. Proof. We prove the result for 1n^∑i=1n^^iyi1^superscriptsubscript1^subscript^subscript 1 n _i=1 n r_iy_idivide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG italic_r end_ARGi yitalic_i; the result for 1n~∑i=1n~~iyi1~superscriptsubscript1~subscript~subscript 1 n _i=1 n r_iy_idivide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG ∑i = 1over~ start_ARG n end_ARG over~ start_ARG italic_r end_ARGi yitalic_i can be proved in the same way. Define i=1n^(^iy−[y])subscript1^subscript^delimited-[] S_i= 1 n( r_iy-E[ ry])italic_Sitalic_i = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( over start_ARG italic_r end_ARGi y - blackboard_E [ italic_r y ] ). The random matrices (vectors) isubscript S_iitalic_Sitalic_i are independent, identically distributed, and centered. Their norms are bounded as follows ‖i‖≤1n^(‖^iy‖+‖[y]‖)≤1n^(‖^i‖|y|+[‖|y|])≤2n^BC≔L.normsubscript1^normsubscript^normdelimited-[]1^normsubscript^delimited-[]norm2^≔ \| S_i\|≤ 1 n(\| r_iy\|+\|% E[ ry]\|)≤ 1 n(\| r_i\||y|+% E[\| r\||y|])≤ 2 n BC L.∥ italic_Sitalic_i ∥ ≤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( ∥ over start_ARG italic_r end_ARGi y ∥ + ∥ blackboard_E [ italic_r y ] ∥ ) ≤ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( ∥ over start_ARG italic_r end_ARGi ∥ | y | + blackboard_E [ ∥ italic_r ∥ | y | ] ) ≤ divide start_ARG 2 end_ARG start_ARG over start_ARG n end_ARG end_ARG square-root start_ARG B C end_ARG ≔ L . (30) Define =∑i=1n^isuperscriptsubscript1^subscript Z= _i=1 n S_iitalic_Z = ∑i = 1over start_ARG n end_ARG italic_Sitalic_i. We analyze the semidefinite upper bounds for the variances ⊤superscripttopE Z Z blackboard_E italic_Z italic_Z⊤ and ⊤superscripttopE Z Zblackboard_E italic_Z⊤ italic_Z: ⊤=superscripttopabsent Z Z =blackboard_E italic_Z italic_Z⊤ = ∑i=1n^ii⊤superscriptsubscript1^subscriptsuperscriptsubscripttop _i=1 nE S_i S_i ∑i = 1over start_ARG n end_ARG blackboard_E italic_Sitalic_i italic_Sitalic_i⊤ = == 1n^2(yi2^i^i⊤−[y][y]⊤)1superscript^2superscriptsubscript2subscript^superscriptsubscript^topdelimited-[]superscriptdelimited-[]top 1 n^2(Ey_i^2 r_i % r_i -E[ ry]E[ ry] )divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG ( blackboard_E yitalic_i2 over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ - blackboard_E [ italic_r y ] blackboard_E [ italic_r y ]⊤ ) ≼precedes-or-equals ≼ 1n^2yi2^i^i⊤1superscript^2superscriptsubscript2subscript^superscriptsubscript^top 1 n^2Ey_i^2 r_i % r_i divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG blackboard_E yitalic_i2 over start_ARG italic_r end_ARGi over start_ARG italic_r end_ARGi⊤ ≼precedes-or-equals ≼ Cn^2≔1.≔superscript^2subscript1 C n^2 V_1.divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG2 end_ARG Σ ≔ italic_V1 . ⊤=superscripttopabsent Z Z=blackboard_E italic_Z⊤ italic_Z = ∑i=1n^i⊤isuperscriptsubscript1^superscriptsubscripttopsubscript _i=1 nE S_i S_i∑i = 1over start_ARG n end_ARG blackboard_E italic_Sitalic_i⊤ italic_Sitalic_i = == n^‖i‖2^superscriptnormsubscript2 nE\| S_i\|^2over start_ARG n end_ARG blackboard_E ∥ italic_Sitalic_i ∥2 ≤ ≤ 4n^BC≔2by Equation 30.≔4^subscript2by Equation 30 4 nBC V_2 % Equation eq: bound_S_L.divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C ≔ italic_V2 by Equation . Define v=max(∥1∥op,∥2∥op)subscriptdelimited-∥subscript1opsubscriptdelimited-∥subscript2opv= ( V_1 _op, V_2 _% op)v = max ( ∥ italic_V1 ∥op , ∥ italic_V2 ∥op ). It can be simplified as follows v=absent v=v = max(∥Cn^∥op,4n^BC)subscriptdelimited-∥^op4 ( C n _op,% 4 nBC)max ( ∥ divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ∥op , divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C ) = == 4n^BCbecause B≥∥op as in Equation 27.4^because B≥∥op as in Equation 27 4 nBC $B≥ % _op$ as in Equation eq: B_tau_relation.divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C because B ≥ ∥ Σ ∥op as in Equation . Define d=intdim([1002])intdimmatrixsubscript100subscript2d=intdim( bmatrix V_1&0\\ 0& V_2 bmatrix)d = intdim ( [ start_ARG start_ROW start_CELL italic_V1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_V2 end_CELL end_ROW end_ARG ] ), which can be simplified as d=absent d=d = intdim([Cn^004n^BC])intdimmatrix^004 ( bmatrix C n &0\\ 0& 4 nBC bmatrix)intdim ( [ start_ARG start_ROW start_CELL divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C end_CELL end_ROW end_ARG ] ) = == Tr(Cn^)+4n^BCmax(∥Cn^∥op,4n^BC)Tr^4^subscriptdelimited-∥^op4 Tr( C n )+ 4% nBC ( C n _op,% 4 nBC)divide start_ARG Tr ( divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ) + divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C end_ARG start_ARG max ( ∥ divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ∥op , divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C ) end_ARG = == Tr(Cn^)+4n^BC4n^BCTr^4^4 Tr( C n )+ 4% nBC 4 nBCdivide start_ARG Tr ( divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ) + divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C end_ARG = == Tr(Cn^)4n^BC+1Tr^4^1 Tr( C n ) 4% nBC+1divide start_ARG Tr ( divide start_ARG C end_ARG start_ARG over start_ARG n end_ARG end_ARG Σ ) end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C end_ARG + 1 ≤ ≤ q/4+1because B≥τ as in Equation 27 and Tr()τ=q .41because B≥τ as in Equation 27 and Tr()τ=q q/4+1 $B≥τ$ as in Equation eq:% B_tau_relation and $ Tr( )τ=q$ .q / 4 + 1 because B ≥ τ as in Equation and divide start_ARG Tr ( Σ ) end_ARG start_ARG τ end_ARG = q . Applying Theorem 7.3.1 of (Tropp et al., 2015), we have that for any t≥v+L/33t≥ v+L/3t ≥ square-root start_ARG v end_ARG + L / 3, ℙ‖≥t≤ℙnormabsent \\| Z\|≥ t\ _P ∥ italic_Z ∥ ≥ t ≤ 4dexp(−t2/2v+Lt/3)4superscript223 4d ( -t^2/2v+Lt/3)4 d exp ( divide start_ARG - t2 / 2 end_ARG start_ARG v + L t / 3 end_ARG ) ≤ ≤ (q+4)exp(−t2/24n^BC+2BCn^t/3).4superscript224^2^3 (q+4) ( -t^2/2 4 nBC+ 2 BC% nt/3).( q + 4 ) exp ( divide start_ARG - t2 / 2 end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG over start_ARG n end_ARG end_ARG t / 3 end_ARG ) . (31) By assumption: n1−c=ω(Blogq)superscript1 n^1-c=ω(B q)n1 - c = ω ( B log q ) ⟹ ⟹ n1−c=ω(Blog(q+4))superscript14 n^1-c=ω(B (q+4))n1 - c = ω ( B log ( q + 4 ) ) ⟹ ⟹ n1−c=ω((4BC+2BC3)log(q+4))because C=Θ(1), and B=Ω(1) as in Equation 27superscript14234because C=Θ(1), and B=Ω(1) as in Equation 27 n^1-c=ω((4BC+ 2 BC3) (q+4)) % because $C= (1)$, and $B= (1)$ as in Equation eq: B_tau_% relation n1 - c = ω ( ( 4 B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG 3 end_ARG ) log ( q + 4 ) ) because C = Θ ( 1 ) , and B = Ω ( 1 ) as in Equation ⟹ ⟹ 0.5n1−c(4BC+2BC3)=ω(log(q+4))0.5superscript14234 0.5n^1-c(4BC+ 2 BC3)=ω( (q+4))divide start_ARG 0.5 n1 - c end_ARG start_ARG ( 4 B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG 3 end_ARG ) end_ARG = ω ( log ( q + 4 ) ) ⟹ ⟹ (q+4)exp(−0.5n1−c4BC+2BC3)=o(1).40.5superscript14231 (q+4) ( -0.5n^1-c4BC+ 2 BC3 % )=o(1).( q + 4 ) exp ( divide start_ARG - 0.5 n1 - c end_ARG start_ARG 4 B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG 3 end_ARG end_ARG ) = o ( 1 ) . Therefore, we set the value of t to n^−0.5c=o(1)superscript^0.51 n^-0.5c=o(1)over start_ARG n end_ARG- 0.5 c = o ( 1 ) in Equation 31. It is easy to verify that n^−0.5c≥v+L/3superscript^0.53 n^-0.5c≥ v+L/3over start_ARG n end_ARG- 0.5 c ≥ square-root start_ARG v end_ARG + L / 3. Substituting, we get: ℙ∥op≥n^−0.5c≤ℙsubscriptdelimited-∥opsuperscript^0.5absent \ Z _op≥ n^-0.5c\ _P ∥ italic_Z ∥op ≥ over start_ARG n end_ARG- 0.5 c ≤ (q+4)exp(−0.5n−c4n^BC+2BCn^n^−0.5c/3)40.5superscript4^2^superscript^0.53 (q+4) ( -0.5n^-c 4 nBC+ 2 BC% n n^-0.5c/3)( q + 4 ) exp ( divide start_ARG - 0.5 n- c end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG n end_ARG- 0.5 c / 3 end_ARG ) ≤ ≤ (q+4)exp(−0.5n−c4n^BC+2BCn^/3)because n^−0.5c≤140.5superscript4^2^3because n^−0.5c≤1 (q+4) ( -0.5n^-c 4 nBC+ 2 BC% n/3) $ n^-0.5c≤ 1$( q + 4 ) exp ( divide start_ARG - 0.5 n- c end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG over start_ARG n end_ARG end_ARG B C + divide start_ARG 2 square-root start_ARG B C end_ARG end_ARG start_ARG over start_ARG n end_ARG end_ARG / 3 end_ARG ) because over start_ARG n end_ARG- 0.5 c ≤ 1 = == (q+4)exp(−0.5n1−c4BC+2BC/3)40.5superscript1423 (q+4) ( -0.5n^1-c4BC+2 BC/3)( q + 4 ) exp ( divide start_ARG - 0.5 n1 - c end_ARG start_ARG 4 B C + 2 square-root start_ARG B C end_ARG / 3 end_ARG ) = == o(1).1 o(1).o ( 1 ) . ∎ Now, we are ready to show that Example 3.4 satisfies Definition 3.3. We let VV be the entire representation space. Then, ⟂superscriptperpendicular-toV V⟂ is the zero space 000. In this case, the conditions Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ , Small cross-sample inner-product on ⟂superscriptperpendicular-toV V⟂, and Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂ trivially hold. Thus, we only need to prove that Boundedness and Concentration on VV hold. We let δ=n−0.1csuperscript0.1δ=n^-0.1cδ = n- 0.1 c and γ=00γ=0γ = 0. First, note that δ2=n−0.2c≥n^−0.2csuperscript2superscript0.2superscript^0.2δ^2=n^-0.2c≥ n^-0.2cδ2 = n- 0.2 c ≥ over start_ARG n end_ARG- 0.2 c. Then, by Lemma C.1, we obtain that ∥^−∥op≤n^−0.5c=o(n^−0.2c)=o(δ2)=o(γ2+δ2+ρ)subscriptdelimited-∥^opsuperscript^0.5superscript^0.2superscript2superscript2superscript2 - _op≤ n^-0.5c% =o( n^-0.2c)=o(δ^2)=o(γ^2+δ^2+ρ)∥ over start_ARG Σ end_ARG - Σ ∥op ≤ over start_ARG n end_ARG- 0.5 c = o ( over start_ARG n end_ARG- 0.2 c ) = o ( δ2 ) = o ( γ2 + δ2 + ρ ) with probability 1−o(1)111-o(1)1 - o ( 1 ). Similarly, we can show that ∥~−∥op=o(γ2+δ2+ρ)subscriptdelimited-∥~opsuperscript2superscript2 - _op=o(γ^2+% δ^2+ρ)∥ over~ start_ARG Σ end_ARG - Σ ∥op = o ( γ2 + δ2 + ρ ) with probability 1−o(1)111-o(1)1 - o ( 1 ). Next, since δ=n−0.1c≥n^−0.1csuperscript0.1superscript^0.1δ=n^-0.1c≥ n^-0.1cδ = n- 0.1 c ≥ over start_ARG n end_ARG- 0.1 c, applying Lemma C.2 gives us |1n^∑i=1n^^iyi−[y]|≤n^−0.5c=o(n^−0.1c)=o(δ)=o(γ+δ+ρ)1^superscriptsubscript1^subscript^subscriptdelimited-[]superscript^0.5superscript^0.1 | 1 n _i=1 n r_iy_i-E[% ry] |≤ n^-0.5c=o( n^-0.1c)=o(δ)=o(γ+% δ+ρ)| divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG italic_r end_ARGi yitalic_i - blackboard_E [ italic_r y ] | ≤ over start_ARG n end_ARG- 0.5 c = o ( over start_ARG n end_ARG- 0.1 c ) = o ( δ ) = o ( γ + δ + ρ ) with probability 1−o(1)111-o(1)1 - o ( 1 ). Similarly, the same conclusion can be shown for 1n~~iyi1~subscript~subscript 1 n r_iy_idivide start_ARG 1 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG over~ start_ARG italic_r end_ARGi yitalic_i. Note that there are only four events above, so the probability that all of them occur remains 1−o(1)111-o(1)1 - o ( 1 ). To now, we have proved Concentration on VV. Finally, regarding Boundedness, ∥op=Θ(1)subscriptdelimited-∥opΘ1 _op= (1)∥ Σ ∥op = Θ ( 1 ) is directly given in the assumption. Keeping in mind that VV is the entire space, the conditions regarding covariance matrices are readily satisfied through the triangle inequality. For example: ∥^∥op≤∥^−∥op+∥op=o(1)+Θ(1)=O(1)subscriptdelimited-∥^opsubscriptdelimited-∥^opsubscriptdelimited-∥op1Θ11 _op≤ -% _op+ _op=o(1)+% (1)=O(1)∥ over start_ARG Σ end_ARG ∥op ≤ ∥ over start_ARG Σ end_ARG - Σ ∥op + ∥ Σ ∥op = o ( 1 ) + Θ ( 1 ) = O ( 1 ). The other two conditions are directly implied by the boundedness of each y. C.2 Example 3.5 Originating from PCA (Johnstone, 2001), the spiked covariance model has been widely adopted in recent works to theoretically characterize key aspects across various topics (Ji et al., 2023; Nakada et al., 2023; Muthukumar et al., 2021; Pezeshki et al., 2022; Wu & Sahai, 2024). Furthermore, Example 3.5 also subsumes the sparse coding model as a special case, which has its roots in computer vision (Olshausen & Field, 1997; Foldiak, 2003; Olshausen & Field, 2004; Yang et al., 2009; Mairal et al., 2014; Papyan et al., 2017), has been used to model language data (Arora et al., 2018), and has been extensively employed in recent theoretical studies (Kalimeris et al., 2019; Allen-Zhu & Li, 2020; Wen & Li, 2021; Zou et al., 2021; Shen et al., 2022; Xue et al., 2023). In the following proof, we start with a simple case where the data are Gaussian. We then extend the result to sub-Gaussian data by replacing the technical lemmas for Gaussian data with appropriate alternatives. C.2.1 Over-Parameterized Gaussian Data Suppose that we have ^∈ℝd×n^,~∈ℝd×n~formulae-sequence^superscriptℝ^~superscriptℝ~ R ^d× n, R ^d% × nover start_ARG italic_R end_ARG ∈ blackboard_Rd × over start_ARG n end_ARG , over~ start_ARG italic_R end_ARG ∈ blackboard_Rd × over~ start_ARG n end_ARG with n^=Θ(n~)^Θ~ n= ( n)over start_ARG n end_ARG = Θ ( over~ start_ARG n end_ARG ) and d=ω(n^2)superscript^2d=ω( n^2)d = ω ( over start_ARG n end_ARG2 ) drawn from a high-dimensional Σ-Gaussian ensemble with zero mean, where =[kσ2d−kd−k]=[k=′]⏟′+[σ2d−kd−k=′]⏟′,with σ2=O(n^),n^=ω(k2).formulae-sequencematrixsubscript00superscript2subscriptsubscript⏟matrixsubscriptsuperscript′000superscript′subscript⏟matrix000superscript2subscriptsuperscript′formulae-sequencewith superscript2^^superscript2 = bmatrix I_k&0\\ 0& σ^2d-k I_d-k bmatrix= % bmatrix I_k= &0\\ 0&0 bmatrix_ + % bmatrix0&0\\ 0& σ^2d-k I_d-k= % bmatrix_ , σ^2=O(% n), n=ω(k^2).Σ = [ start_ARG start_ROW start_CELL italic_Iitalic_k end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k end_CELL end_ROW end_ARG ] = under⏟ start_ARG [ start_ARG start_ROW start_CELL italic_Iitalic_k = Λ′ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] end_ARGΣ′ + under⏟ start_ARG [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k = Λ′ ′ end_CELL end_ROW end_ARG ] end_ARGΣ′ ′ , with σ2 = O ( over start_ARG n end_ARG ) , over start_ARG n end_ARG = ω ( k2 ) . (32) Here the two data splits have comparable sizes, and the model is heavily over-parameterized. By splitting the matrix ^=[^^]^matrix^ R= bmatrix F\\ A bmatrixover start_ARG italic_R end_ARG = [ start_ARG start_ROW start_CELL over start_ARG italic_F end_ARG end_CELL end_ROW start_ROW start_CELL over start_ARG italic_A end_ARG end_CELL end_ROW end_ARG ], where ^∈ℝk×n^^superscriptℝ F ^k× nover start_ARG italic_F end_ARG ∈ blackboard_Rk × over start_ARG n end_ARG corresponds to the k principal features (which form the space VV) and ^∈ℝ(d−k)×n^^superscriptℝ A ^(d-k)× nover start_ARG italic_A end_ARG ∈ blackboard_R( d - k ) × over start_ARG n end_ARG corresponds to the rest (which form the space ⟂superscriptperpendicular-toV V⟂), we can write the sample covariance matrix as ^=1n^^^⊤=1n^[^^⊤^^⊤^^⊤^^⊤].^1^^superscript^top1^matrix^superscript^top^superscript^top^superscript^top^superscript^top = 1 n R R = % 1 n bmatrix F F & F% A \\ A F & A A % bmatrix.over start_ARG Σ end_ARG = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG over start_ARG italic_R end_ARG⊤ = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG [ start_ARG start_ROW start_CELL over start_ARG italic_F end_ARG over start_ARG italic_F end_ARG⊤ end_CELL start_CELL over start_ARG italic_F end_ARG over start_ARG italic_A end_ARG⊤ end_CELL end_ROW start_ROW start_CELL over start_ARG italic_A end_ARG over start_ARG italic_F end_ARG⊤ end_CELL start_CELL over start_ARG italic_A end_ARG over start_ARG italic_A end_ARG⊤ end_CELL end_ROW end_ARG ] . We note that d−k=ω(n^2)superscript^2d-k=ω( n^2)d - k = ω ( over start_ARG n end_ARG2 ), and the corresponding labels have bounded mean and variance. The same decomposition applies to ~~ Rover~ start_ARG italic_R end_ARG. Note that here ′=[k(d−k)×k]superscript′matrixsubscriptsubscript0 U = bmatrix I_k\\ 0_(d-k)× k bmatrixitalic_U′ = [ start_ARG start_ROW start_CELL italic_Iitalic_k end_CELL end_ROW start_ROW start_CELL 0( d - k ) × k end_CELL end_ROW end_ARG ] and ′=[k×(d−k)d−k]superscript′matrixsubscript0subscript U = bmatrix0_k×(d-k)\\ I_d-k bmatrixitalic_U′ ′ = [ start_ARG start_ROW start_CELL 0italic_k × ( d - k ) end_CELL end_ROW start_ROW start_CELL italic_Iitalic_d - k end_CELL end_ROW end_ARG ] allow us to define the projection matrices ′′⊤superscript′op U U italic_U′ italic_U′ ⊤ and ′′⊤superscript′top U U italic_U′ ′ italic_U′ ′ ⊤ on VV and ⟂superscriptperpendicular-toV V⟂ respectively. In this section, we show that our assumptions hold in the above setting with δ=00δ=0δ = 0 and γ^=σ2/n^,γ~=σ2/n~formulae-sequence^superscript2^~superscript2~ γ=σ^2/ n, γ=σ^2/ nover start_ARG γ end_ARG = σ2 / over start_ARG n end_ARG , over~ start_ARG γ end_ARG = σ2 / over~ start_ARG n end_ARG. We only prove for ^ Rover start_ARG italic_R end_ARG whenever the same proof can be easily applied to ~~ Rover~ start_ARG italic_R end_ARG. First, let us introduce the following Lemmas: Lemma C.3 (Restatement of Example 6.2 in (Wainwright, 2019)). Let ∈ℝd×nsuperscriptℝX ^d× nX ∈ blackboard_Rd × n be a random matrix with i.i.d. entries drawn from (0,1)01N(0,1)N ( 0 , 1 ) (that is a Σ-Gaussian ensemble with =dsubscript = I_dΣ = italic_Iitalic_d). Then with probability at least 1−2e−nδ2/212superscriptsuperscript221-2e^-nδ^2/21 - 2 e- n δ start_POSTSUPERSCRIPT 2 / 2 end_POSTSUPERSCRIPT for some δ>00δ>0δ > 0, the following inequality holds: ∥1nT−d∥op≤2(dn+δ)+(dn+δ)2.subscriptdelimited-∥1superscriptsubscriptop2superscript2 1n X X^T- I_d _op≤ 2% ( dn+δ )+ ( dn+δ )^% 2.∥ divide start_ARG 1 end_ARG start_ARG n end_ARG italic_X italic_Xitalic_T - italic_Iitalic_d ∥op ≤ 2 ( square-root start_ARG divide start_ARG d end_ARG start_ARG n end_ARG end_ARG + δ ) + ( square-root start_ARG divide start_ARG d end_ARG start_ARG n end_ARG end_ARG + δ )2 . Lemma C.4. Consider two independently sampled Gaussian matrices where ∈ℝd1×nsuperscriptℝsubscript1A ^d_1× nA ∈ blackboard_Rd1 × n has columns i∼(0,σ12d1)similar-tosubscript0superscriptsubscript12subscriptsubscript1 a_i (0, _1^2 I_d_1)italic_aitalic_i ∼ N ( 0 , σ12 italic_Iitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and ∈ℝd2×nsuperscriptℝsubscript2B ^d_2× nB ∈ blackboard_Rd2 × n has columns i∼(0,σ22d2)similar-tosubscript0superscriptsubscript22subscriptsubscript2 b_i (0, _2^2 I_d_2)italic_bitalic_i ∼ N ( 0 , σ22 italic_Iitalic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . Then for some 1d1d2>δ>01subscript1subscript20 1d_1d_2>δ>0divide start_ARG 1 end_ARG start_ARG d1 d2 end_ARG > δ > 0 and constant C, with probability at least 1−d1d2δ1subscript1subscript21-d_1d_2 1 - d1 d2 δ, we have 1n∥⊤∥op≤σ1σ2nCd1d2nlog(2δ).1subscriptdelimited-∥superscripttopopsubscript1subscript2subscript1subscript22 1n A B _op≤ _% 1 _2n Cd_1d_2n ( 2δ).divide start_ARG 1 end_ARG start_ARG n end_ARG ∥ italic_A italic_B⊤ ∥op ≤ divide start_ARG σ1 σ2 end_ARG start_ARG n end_ARG square-root start_ARG C d1 d2 n log ( divide start_ARG 2 end_ARG start_ARG δ end_ARG ) end_ARG . Proof. Let =Tsuperscript Q= A B^Titalic_Q = italic_A italic_Bitalic_T. Then each entry of Qitalic_Q is an inner product ij=i⋅jsubscript⋅subscriptsubscript Q_ij= a_i· b_jitalic_Qitalic_i j = italic_aitalic_i ⋅ italic_bitalic_j, where i∈ℝnsubscriptsuperscriptℝ a_i ^nitalic_aitalic_i ∈ blackboard_Rn is the i-th row of Aitalic_A and j∈ℝnsubscriptsuperscriptℝ b_j ^nitalic_bitalic_j ∈ blackboard_Rn is the j-th row of Bitalic_B. Since each entry of isubscript a_iitalic_aitalic_i is (0,σ12)0superscriptsubscript12N(0, _1^2)N ( 0 , σ12 ) and each entry of jsubscript b_jitalic_bitalic_j is (0,σ22)0superscriptsubscript22N(0, _2^2)N ( 0 , σ22 ), by Lemma 4 from (Shen et al., 2022), with probability at least 1−δ11- 1 - δ (taking 1d1d2>δ>01subscript1subscript20 1d_1d_2>δ>0divide start_ARG 1 end_ARG start_ARG d1 d2 end_ARG > δ > 0), for some constant CijsubscriptC_ijCitalic_i j, ij2=(i⋅j)2≤Cijσ12σ22nlog(2/δ′).superscriptsubscript2superscript⋅subscriptsubscript2subscriptsuperscriptsubscript12superscriptsubscript222superscript′ Q_ij^2=( a_i· b_j)^2≤ C_ij _1^2% _2^2n (2/δ ).italic_Qitalic_i j2 = ( italic_aitalic_i ⋅ italic_bitalic_j )2 ≤ Citalic_i j σ12 σ22 n log ( 2 / δ′ ) . We define C=maxCij:1≤i≤d1,1≤j≤d2:subscript1subscript11subscript2C= \C_ij:1≤ i≤ d_1,1≤ j≤ d_2 \C = max Citalic_i j : 1 ≤ i ≤ d1 , 1 ≤ j ≤ d2 . Now we bound the operator norm with 1n∥⊤∥op≤1n‖⊤‖F1subscriptdelimited-∥superscripttopop1subscriptnormsuperscripttop 1n A B _op≤% 1n\| A B \|_Fdivide start_ARG 1 end_ARG start_ARG n end_ARG ∥ italic_A italic_B⊤ ∥op ≤ divide start_ARG 1 end_ARG start_ARG n end_ARG ∥ italic_A italic_B⊤ ∥F =1n‖Fabsent1subscriptnorm = 1n\| Q\|_F= divide start_ARG 1 end_ARG start_ARG n end_ARG ∥ italic_Q ∥F =1n∑1≤i≤d1,1≤j≤d2ij2absent1subscriptformulae-sequence1subscript11subscript2superscriptsubscript2 = 1n _1≤ i≤ d_1,1≤ j≤ d_2 % Q_ij^2= divide start_ARG 1 end_ARG start_ARG n end_ARG square-root start_ARG ∑1 ≤ i ≤ d start_POSTSUBSCRIPT 1 , 1 ≤ j ≤ d2 end_POSTSUBSCRIPT italic_Qitalic_i j2 end_ARG ≤1n∑1≤i≤d1,1≤j≤d2Cijσ12σ22nlog(2/δ)absent1subscriptformulae-sequence1subscript11subscript2subscriptsuperscriptsubscript12superscriptsubscript222 ≤ 1n _1≤ i≤ d_1,1≤ j≤ d_2C% _ij _1^2 _2^2n (2/δ)≤ divide start_ARG 1 end_ARG start_ARG n end_ARG square-root start_ARG ∑1 ≤ i ≤ d start_POSTSUBSCRIPT 1 , 1 ≤ j ≤ d2 end_POSTSUBSCRIPT Citalic_i j σ12 σ22 n log ( 2 / δ ) end_ARG ≤1nCd1d2σ12σ22nlog(2/δ)=σ1σ2nCd1d2nlog(2/δ)absent1subscript1subscript2superscriptsubscript12superscriptsubscript222subscript1subscript2subscript1subscript22 ≤ 1n Cd_1d_2 _1^2 _2^2n % (2/δ)= _1 _2n Cd_1d_2n (2/δ)≤ divide start_ARG 1 end_ARG start_ARG n end_ARG square-root start_ARG C d1 d2 σ12 σ22 n log ( 2 / δ ) end_ARG = divide start_ARG σ1 σ2 end_ARG start_ARG n end_ARG square-root start_ARG C d1 d2 n log ( 2 / δ ) end_ARG with probability at least 1−d1d2δ1subscript1subscript21-d_1d_2 1 - d1 d2 δ since the inequality has to hold for each entry. ∎ We now prove that the example satisfies the five aspects of the definition: 1. Boundedness: First, we have ∥op=1=O(1)subscriptdelimited-∥op11 _op=1=O(1)∥ Σ ∥op = 1 = O ( 1 ) from its definition, and ∥^−∥op≤‖[1n^^^⊤−k1n^^^⊤−σ2d−kd−k]‖op+1n^‖[^^⊤^^⊤]‖op.subscriptdelimited-∥^opsubscriptnormmatrix1^^superscript^topsubscript001^^superscript^topsuperscript2subscriptop1^subscriptnormmatrix0^superscript^top^superscript^top0op - _op≤% \| bmatrix 1 n F F -% I_k&0\\ 0& 1 n A A - σ^% 2d-k I_d-k bmatrix \|_op+ 1 n % \| bmatrix0& F A \\ A F &0 bmatrix \|_op.∥ over start_ARG Σ end_ARG - Σ ∥op ≤ ∥ [ start_ARG start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARG over start_ARG italic_F end_ARG⊤ - italic_Iitalic_k end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_A end_ARG over start_ARG italic_A end_ARG⊤ - divide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k end_CELL end_ROW end_ARG ] ∥op + divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL over start_ARG italic_F end_ARG over start_ARG italic_A end_ARG⊤ end_CELL end_ROW start_ROW start_CELL over start_ARG italic_A end_ARG over start_ARG italic_F end_ARG⊤ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ∥op . (33) By Lemma C.3, we take δ1=n^−1/4subscript1superscript^14 _1= n^-1/4δ1 = over start_ARG n end_ARG- 1 / 4 and have that with probability at least 1−2e−n^δ12/2=1−2e−n^/2=1−o(1)12superscript^superscriptsubscript12212superscript^2111-2e^- n _1^2/2=1-2e^- n/2=1-o(1)1 - 2 e- over start_ARG n end_ARG δ1 start_POSTSUPERSCRIPT 2 / 2 end_POSTSUPERSCRIPT = 1 - 2 e- square-root start_ARG over start_ARG n end_ARG end_ARG / 2 = 1 - o ( 1 ), ∥1n^^^⊤−k∥op≤2kn^+2n^1/4+(kn^+1n^1/4)2=o(1)since n^≫k.formulae-sequencesubscriptdelimited-∥1^^superscript^topsubscriptop2^2superscript^14superscript^1superscript^1421since n^≫k 1 n F F - I% _k _op≤ 2 k n+ 2 n^1/4% + ( k n+ 1 n^1/4 )^2=o(1) % since $ n k$.∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARG over start_ARG italic_F end_ARG⊤ - italic_Iitalic_k ∥op ≤ 2 square-root start_ARG divide start_ARG k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 2 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG + ( square-root start_ARG divide start_ARG k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG )2 = o ( 1 ) since over start_ARG n end_ARG ≫ k . As ^∈ℝ(d−k)×n^^superscriptℝ A ^(d-k)× nover start_ARG italic_A end_ARG ∈ blackboard_R( d - k ) × over start_ARG n end_ARG is sampled from σ2d−kd−ksuperscript2subscript σ^2d-k I_d-kdivide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k, d−kσ^ d-kσ Adivide start_ARG square-root start_ARG d - k end_ARG end_ARG start_ARG σ end_ARG over start_ARG italic_A end_ARG is sampled from d−ksubscript I_d-kitalic_Iitalic_d - k. With this scaling, similarly, Lemma C.3 implies that ‖1n^(d−kσ^)(d−kσ^)⊤−d−k‖opsubscriptnorm1^^superscript^topsubscriptop \| 1 n ( d-kσ % A ) ( d-kσ A ) - I% _d-k \|_op∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ( divide start_ARG square-root start_ARG d - k end_ARG end_ARG start_ARG σ end_ARG over start_ARG italic_A end_ARG ) ( divide start_ARG square-root start_ARG d - k end_ARG end_ARG start_ARG σ end_ARG over start_ARG italic_A end_ARG )⊤ - italic_Iitalic_d - k ∥op =‖d−kn^σ2^^⊤−d−k‖opabsentsubscriptnorm^superscript2^superscript^topsubscriptop = \| d-k nσ^2 A A% - I_d-k \|_op= ∥ divide start_ARG d - k end_ARG start_ARG over start_ARG n end_ARG σ2 end_ARG over start_ARG italic_A end_ARG over start_ARG italic_A end_ARG⊤ - italic_Iitalic_d - k ∥op ≤2d−kn^+2n^1/4+(d−kn^+1n^1/4)2absent2^2superscript^14superscript^1superscript^142 ≤ 2 d-k n+ 2 n^1/4+ (% d-k n+ 1 n^1/4 )^2≤ 2 square-root start_ARG divide start_ARG d - k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 2 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG + ( square-root start_ARG divide start_ARG d - k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG )2 ⇔∥1n^^^⊤−σ2d−kd−k∥op≤σ2d−k[2d−kn^+2n^1/4+(d−kn^+1n^1/4)2]=O(1)as σ2=O(n^). \| 1 n A A -% σ^2d-k I_d-k \|_op≤ σ^2% d-k [2 d-k n+ 2 n^1/4+ ( % d-k n+ 1 n^1/4 )^2 ]=O(1) % as $σ^2=O( n)$.⇔ ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_A end_ARG over start_ARG italic_A end_ARG⊤ - divide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k ∥op ≤ divide start_ARG σ2 end_ARG start_ARG d - k end_ARG [ 2 square-root start_ARG divide start_ARG d - k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 2 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG + ( square-root start_ARG divide start_ARG d - k end_ARG start_ARG over start_ARG n end_ARG end_ARG end_ARG + divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG )2 ] = O ( 1 ) as σ2 = O ( over start_ARG n end_ARG ) . We have bounded the first term on the right side of Eq. 33 and have that ∥1n^^∥opsubscriptdelimited-∥1^^op 1 n F _op∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_F end_ARG ∥op and ∥1n^^∥opsubscriptdelimited-∥1^^op 1 n A _op∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_A end_ARG ∥op are O(1)1O(1)O ( 1 ). It follows that 1n^‖[^^⊤^^⊤]‖op=1n^∥^^⊤∥op=O(1)⟹∥^−∥op=O(1).1^subscriptnormmatrix0^superscript^top^superscript^top0op1^subscriptdelimited-∥^superscript^topop1⟹subscriptdelimited-∥^op1 1 n \| bmatrix0& F A% \\ A F &0 bmatrix \|_op% = 1 n F A _op% =O(1) - _% op=O(1).divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL over start_ARG italic_F end_ARG over start_ARG italic_A end_ARG⊤ end_CELL end_ROW start_ROW start_CELL over start_ARG italic_A end_ARG over start_ARG italic_F end_ARG⊤ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ∥op = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_F end_ARG over start_ARG italic_A end_ARG⊤ ∥op = O ( 1 ) ⟹ ∥ over start_ARG Σ end_ARG - Σ ∥op = O ( 1 ) . Hence, ∥^∥op=O(1)subscriptdelimited-∥^op1 _op=O(1)∥ over start_ARG Σ end_ARG ∥op = O ( 1 ) directly follows from ∥op=O(1)subscriptdelimited-∥op1 _op=O(1)∥ Σ ∥op = O ( 1 ). Now we consider 1n^‖^‖2=1n^∑i=0n^^i21^superscriptnorm^21^superscriptsubscript0^superscriptsubscript^2 1 n\| y\|^2= 1 n _i=0 n% y_i^2divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_y end_ARG ∥2 = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 0over start_ARG n end_ARG over start_ARG italic_y end_ARGi2, where ^isubscript y_iover start_ARG italic_y end_ARGi represents the i-th entry of the vector. Since the label has bounded population variance O(1)1O(1)O ( 1 ), the i.i.d assumption implies Var(1n^∑i=0n^^i2)=1n^2∑i=0n^Var(^i2)=1n^2∑i=0n^O(1)=O(1n^).Var1^superscriptsubscript0^superscriptsubscript^21superscript^2superscriptsubscript0^Varsuperscriptsubscript^21superscript^2superscriptsubscript0^11^Var( 1 n _i=0 n y_i^2)=% 1 n^2 _i=0 nVar( y_i^2)=% 1 n^2 _i=0 nO(1)=O( 1 n).Var ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 0over start_ARG n end_ARG over start_ARG italic_y end_ARGi2 ) = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG ∑i = 0over start_ARG n end_ARG Var ( over start_ARG italic_y end_ARGi2 ) = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG ∑i = 0over start_ARG n end_ARG O ( 1 ) = O ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ) . Then by Chebyshev’s inequality, for any ϵ>0italic-ϵ0ε>0ϵ > 0 and some constant C1subscript1C_1C1, we let z=1n^‖^‖21^superscriptnorm^2z= 1 n\| y\|^2z = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_y end_ARG ∥2 for simplicity and then have P(|z−[z]|>ϵ)≤Var(z)ϵ2≤C1n^ϵ2.delimited-[]italic-ϵVarsuperscriptitalic-ϵ2subscript1^superscriptitalic-ϵ2P (|z-E[z]|>ε )≤ Var(z)ε^2% ≤ C_1 nε^2.P ( | z - blackboard_E [ z ] | > ϵ ) ≤ divide start_ARG Var ( z ) end_ARG start_ARG ϵ2 end_ARG ≤ divide start_ARG C1 end_ARG start_ARG over start_ARG n end_ARG ϵ2 end_ARG . We take ϵ=n^−1/4italic-ϵsuperscript^14ε= n^-1/4ϵ = over start_ARG n end_ARG- 1 / 4. Then with probability at least 1−C1n^=1−o(1)1subscript1^111- C_1 n=1-o(1)1 - divide start_ARG C1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG = 1 - o ( 1 ), |1n^‖^‖2−Var(i)|=o(1)⟹1n^‖^‖2=O(1)since the variance of the label is bounded.formulae-sequence1^superscriptnorm^2Varsubscript1⟹1^superscriptnorm^21since the variance of the label is bounded.| 1 n\| y\|^2-Var( y_i)|=o(1)% 1 n\| y\|^2=O(1) % the variance of the label is bounded.| divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_y end_ARG ∥2 - Var ( italic_yitalic_i ) | = o ( 1 ) ⟹ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ over start_ARG italic_y end_ARG ∥2 = O ( 1 ) since the variance of the label is bounded. 2. Concentration on VV: With ′=[k(d−k)×k]superscript′matrixsubscriptsubscript0 U = bmatrix I_k\\ 0_(d-k)× k bmatrixitalic_U′ = [ start_ARG start_ROW start_CELL italic_Iitalic_k end_CELL end_ROW start_ROW start_CELL 0( d - k ) × k end_CELL end_ROW end_ARG ] preserving only the first k components, we have from above that with probability at least 1−o(1)111-o(1)1 - o ( 1 ), ∥′⊤^′−′∥op=∥1n^^^⊤−k∥op=o(1).subscriptdelimited-∥superscript′top^superscript′opsubscriptdelimited-∥1^^superscript^topsubscriptop1 U U - % _op= 1 n F F% - I_k _op=o(1).∥ italic_U′ ⊤ over start_ARG Σ end_ARG italic_U′ - Λ′ ∥op = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARG over start_ARG italic_F end_ARG⊤ - italic_Iitalic_k ∥op = o ( 1 ) . Now we consider ‖1n^′⊤^^−[′⊤y]‖=‖1n^^^−[y]‖,norm1^superscript′top^^delimited-[]superscript′topnorm1^^^delimited-[]\| 1 n U R y-E% [ U ry]\|=\| 1 n F % y-E[ fy]\|,∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_y end_ARG - blackboard_E [ italic_U′ ⊤ italic_r y ] ∥ = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARG over start_ARG italic_y end_ARG - blackboard_E [ italic_f y ] ∥ , where =′superscript′ f= U ritalic_f = italic_U′ italic_r. We define a new random variable =y z= fyitalic_z = italic_f y and its sample mean ^=1n^^^∈ℝk^1^^^superscriptℝ Z= 1 n F y ^kover start_ARG italic_Z end_ARG = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARG over start_ARG italic_y end_ARG ∈ blackboard_Rk. We first show that the variance of each entry of ^ Zover start_ARG italic_Z end_ARG is of magnitude ∼1n^similar-toabsent1 1 n∼ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG: Var(^i)=Var(∑j=1n^1n^^ij^j)=1n^2Var(∑j=1n^^ij^j)∀i=1,⋯,k.formulae-sequenceVarsubscript^Varsuperscriptsubscript1^1^subscript^subscript^1superscript^2Varsuperscriptsubscript1^subscript^subscript^for-all1⋯Var( Z_i)=Var ( _j=1 n 1% n F_ij y_j )= 1 n^2% Var ( _j=1 n F_ij y_j% ) ∀ i=1,·s,k.Var ( over start_ARG italic_Z end_ARGi ) = Var ( ∑j = 1over start_ARG n end_ARG divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_F end_ARGi j over start_ARG italic_y end_ARGj ) = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG Var ( ∑j = 1over start_ARG n end_ARG over start_ARG italic_F end_ARGi j over start_ARG italic_y end_ARGj ) ∀ i = 1 , ⋯ , k . For each term in the summation, Var(^ij^j)=[(^ij^j)2]−[^ij^j]2=O(1)Varsubscript^subscript^delimited-[]superscriptsubscript^subscript^2superscriptdelimited-[]subscript^subscript^21 ( F_ij y_j)=E[(% F_ij y_j)^2]-E[ F_ij % y_j]^2=O(1)Var ( over start_ARG italic_F end_ARGi j over start_ARG italic_y end_ARGj ) = blackboard_E [ ( over start_ARG italic_F end_ARGi j over start_ARG italic_y end_ARGj )2 ] - blackboard_E [ over start_ARG italic_F end_ARGi j over start_ARG italic_y end_ARGj ]2 = O ( 1 ) since ^ijsubscript F_ijover start_ARG italic_F end_ARGi j and ^jsubscript y_jover start_ARG italic_y end_ARGj are both bounded. By the i.i.d assumption, Var(^i)=1n^2∑j=1n^O(1)=O(1n^).Varsubscript^1superscript^2superscriptsubscript1^11^Var( Z_i)= 1 n^2 _j=1 nO(1)% =O ( 1 n ).Var ( over start_ARG italic_Z end_ARGi ) = divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG2 end_ARG ∑j = 1over start_ARG n end_ARG O ( 1 ) = O ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ) . By Chebyshev’s inequality, for any ϵ>0italic-ϵ0ε>0ϵ > 0 and some constant C2subscript2C_2C2, P(‖^i−[i]‖>ϵ)≤Var(^i)ϵ2≤C2n^ϵ2normsubscript^delimited-[]subscriptitalic-ϵVarsubscript^superscriptitalic-ϵ2subscript2^superscriptitalic-ϵ2P (\| Z_i-E[ z_i]\|>ε )≤% Var( Z_i)ε^2≤ C_2 n% ε^2P ( ∥ over start_ARG italic_Z end_ARGi - blackboard_E [ italic_zitalic_i ] ∥ > ϵ ) ≤ divide start_ARG Var ( over start_ARG italic_Z end_ARGi ) end_ARG start_ARG ϵ2 end_ARG ≤ divide start_ARG C2 end_ARG start_ARG over start_ARG n end_ARG ϵ2 end_ARG P(‖^i−[i]‖>ϵ∀i=1,⋯,k)≤kC2n^ϵ2formulae-sequencenormsubscript^delimited-[]subscriptitalic-ϵfor-all1⋯subscript2^superscriptitalic-ϵ2P (\| Z_i-E[ z_i]\|>ε ∀ i=% 1,·s,k )≤ kC_2 nε^2P ( ∥ over start_ARG italic_Z end_ARGi - blackboard_E [ italic_zitalic_i ] ∥ > ϵ ∀ i = 1 , ⋯ , k ) ≤ divide start_ARG k C2 end_ARG start_ARG over start_ARG n end_ARG ϵ2 end_ARG Similarly, by choosing ϵ=n^−1/4italic-ϵsuperscript^14ε= n^-1/4ϵ = over start_ARG n end_ARG- 1 / 4, the probability of large deviation decays rapidly as: P(‖^i−[i]‖>1n^1/4∀i=1,⋯,k)≤kC2n^=o(1)since n^=ω(k2).formulae-sequenceformulae-sequencenormsubscript^delimited-[]subscript1superscript^14for-all1⋯subscript2^1since ^superscript2P (\| Z_i-E[ z_i]\|> 1 n^1/4% ∀ i=1,·s,k )≤ kC_2 n=o(1) % since n=ω(k^2).P ( ∥ over start_ARG italic_Z end_ARGi - blackboard_E [ italic_zitalic_i ] ∥ > divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG1 / 4 end_ARG ∀ i = 1 , ⋯ , k ) ≤ divide start_ARG k C2 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG = o ( 1 ) since over start_ARG n end_ARG = ω ( k2 ) . This statement implies that with probability at least 1−o(1)111-o(1)1 - o ( 1 ), ‖^−[]‖=‖1n^′⊤^^−[′⊤y]‖≤kn^=o(1)=o(γ+δ+λmin(′))norm^delimited-[]norm1^superscript′top^^delimited-[]superscript′top^1subscriptsuperscript′\| Z-E[ z]\|=\| 1 n U % R y-E[ U ry]\|% ≤ k n=o(1)=o(γ+δ+ _ ( % ))∥ over start_ARG italic_Z end_ARG - blackboard_E [ italic_z ] ∥ = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_U′ ⊤ over start_ARG italic_R end_ARG over start_ARG italic_y end_ARG - blackboard_E [ italic_U′ ⊤ italic_r y ] ∥ ≤ square-root start_ARG divide start_ARG k end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG end_ARG = o ( 1 ) = o ( γ + δ + λroman_min ( Λ′ ) ) as we sum up the k terms. This shows that our setting satisfies the second part of the definition. 3. Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂ : We define =d−kσ^∈ℝ(d−k)×n^^superscriptℝ Z= d-kσ A ^(d-k)× % nitalic_Z = divide start_ARG square-root start_ARG d - k end_ARG end_ARG start_ARG σ end_ARG over start_ARG italic_A end_ARG ∈ blackboard_R( d - k ) × over start_ARG n end_ARG, which has standard normal entries. With the scaling, we plug in ′ U italic_U′ ′, γ^=σ2/n^^superscript2 γ=σ^2/ nover start_ARG γ end_ARG = σ2 / over start_ARG n end_ARG and have ∥1n^^⊤′′⊤^−γ^∥op=∥1n^^⊤^−σ2n^∥op=σ2n^∥1d−k⊤−∥op.subscriptdelimited-∥1^superscript^topsuperscript′top^^opsubscriptdelimited-∥1^superscript^top^superscript2^opsuperscript2^subscriptdelimited-∥1superscripttopop 1 n R U % U R- γ I _% op= 1 n A A- % σ^2 n I _op= σ^2 n% 1d-k Z Z- I _op.∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ over start_ARG italic_R end_ARG - over start_ARG γ end_ARG italic_I ∥op = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_A end_ARG⊤ over start_ARG italic_A end_ARG - divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_I ∥op = divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ divide start_ARG 1 end_ARG start_ARG d - k end_ARG italic_Z⊤ italic_Z - italic_I ∥op . (34) Now we apply Lemma C.3 and have that with probability at least 1−2e−n^δ22/212superscript^superscriptsubscript2221-2e^- n _2^2/21 - 2 e- over start_ARG n end_ARG δ2 start_POSTSUPERSCRIPT 2 / 2 end_POSTSUPERSCRIPT for some δ2>0subscript20 _2>0δ2 > 0, σ2n^∥1d−k⊤−∥op≤σ2n^[2(n^d−k+δ2)+(n^d−k+δ2)2].superscript2^subscriptdelimited-∥1superscripttopopsuperscript2^delimited-[]2^subscript2superscript^subscript22 σ^2 n 1d-k Z Z- I% _op≤ σ^2 n [2 ( % nd-k+ _2 )+ ( nd-k+ _2% )^2 ].divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∥ divide start_ARG 1 end_ARG start_ARG d - k end_ARG italic_Z⊤ italic_Z - italic_I ∥op ≤ divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG [ 2 ( square-root start_ARG divide start_ARG over start_ARG n end_ARG end_ARG start_ARG d - k end_ARG end_ARG + δ2 ) + ( square-root start_ARG divide start_ARG over start_ARG n end_ARG end_ARG start_ARG d - k end_ARG end_ARG + δ2 )2 ] . The rest follows similarly by taking δ2=n^−1/4subscript2superscript^14 _2= n^-1/4δ2 = over start_ARG n end_ARG- 1 / 4. 4. Small cross-sample inner-product on ⟂superscriptperpendicular-toV V⟂: By ′=[k×(d−k)d−k]superscript′matrixsubscript0subscript U = bmatrix0_k×(d-k)\\ I_d-k bmatrixitalic_U′ ′ = [ start_ARG start_ROW start_CELL 0italic_k × ( d - k ) end_CELL end_ROW start_ROW start_CELL italic_Iitalic_d - k end_CELL end_ROW end_ARG ] and Lemma C.4 with ^⊤∈ℝn^×(d−k)superscript^topsuperscriptℝ A n×(d-k)over start_ARG italic_A end_ARG⊤ ∈ blackboard_Rover start_ARG n end_ARG × ( d - k ) and ~⊤∈ℝn~×(d−k)superscript~topsuperscriptℝ~ A n×(d-k)over~ start_ARG italic_A end_ARG⊤ ∈ blackboard_Rover~ start_ARG n end_ARG × ( d - k ), each having (0,σ2d−k)0superscript2N(0, σ^2d-k)N ( 0 , divide start_ARG σ2 end_ARG start_ARG d - k end_ARG ) entries, the target expression becomes ∥1n^^⊤′′⊤1n~~∥opsubscriptdelimited-∥1^superscript^topsuperscript′top1~~op 1 n R U^% U 1 n % R _op∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_R end_ARG⊤ italic_U′ ′ italic_U′ ′ ⊤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over~ start_ARG n end_ARG end_ARG end_ARG over~ start_ARG italic_R end_ARG ∥op =1n^n~∥^⊤~∥opabsent1^~subscriptdelimited-∥superscript^top~op = 1 n n A % A _op= divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG ∥ over start_ARG italic_A end_ARG⊤ over~ start_ARG italic_A end_ARG ∥op (35) ≤1n^n~C4n^n~σ4(d−k)2(d−k)log(2/δ3)absent1^~subscript4^~superscript4superscript22subscript3 ≤ 1 n n C_4 n n% σ^4(d-k)^2(d-k) (2/ _3)≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG square-root start_ARG C4 over start_ARG n end_ARG over~ start_ARG n end_ARG divide start_ARG σ4 end_ARG start_ARG ( d - k )2 end_ARG ( d - k ) log ( 2 / δ3 ) end_ARG =C4σ4d−klog(2/δ3)absentsubscript4superscript42subscript3 = C_4 σ^4d-k (2/ _3)= square-root start_ARG C4 divide start_ARG σ4 end_ARG start_ARG d - k end_ARG log ( 2 / δ3 ) end_ARG =σ2C4log(2/δ3)1d−kabsentsuperscript2subscript42subscript31 =σ^2 C_4 (2/ _3) 1d-k= σ2 square-root start_ARG C4 log ( 2 / δ3 ) end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG d - k end_ARG end_ARG for some constant C4subscript4C_4C4 and with probability at least 1−n^n~δ31^~subscript31- n n _31 - over start_ARG n end_ARG over~ start_ARG n end_ARG δ3 for some 0<δ3<1n^n~0subscript31^~0< _3< 1 n n0 < δ3 < divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG. We choose some δ3=o(1n^n~)subscript31^~ _3=o( 1 n n)δ3 = o ( divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG ) in this range and then have that with probability at least 1−o(1)111-o(1)1 - o ( 1 ), the previous bound can be expressed as: σ2C4log(2/δ3)1d−k=Θ(σ2C4log(n^n~)d−k)=o(σ2maxn^,n~)=o(γ+δ)superscript2subscript42subscript31Θsuperscript2subscript4^~superscript2^~σ^2 C_4 (2/ _3) 1d-k= (% σ^2 C_4 ( n n)d-k )=o( σ% ^2 \ n, n\)=o(γ+δ)σ2 square-root start_ARG C4 log ( 2 / δ3 ) end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG d - k end_ARG end_ARG = Θ ( σ2 square-root start_ARG C4 divide start_ARG log ( over start_ARG n end_ARG over~ start_ARG n end_ARG ) end_ARG start_ARG d - k end_ARG end_ARG ) = o ( divide start_ARG σ2 end_ARG start_ARG max over start_ARG n end_ARG , over~ start_ARG n end_ARG end_ARG ) = o ( γ + δ ) since d−k=ω(n^)=ω(n~)^~d-k=ω( n)=ω( n)d - k = ω ( over start_ARG n end_ARG ) = ω ( over~ start_ARG n end_ARG ). 5. Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂: By definition, it is trivial to see that λmax(′)=σ2d−k=o(σ2maxn^,n~)=o(γ+δ)subscriptsuperscript′2superscript2^~ _max( )= σ^2d-k=o( % σ^2 \ n, n\)=o(γ+δ)λitalic_m a x ( Λ′ ′ ) = divide start_ARG σ2 end_ARG start_ARG d - k end_ARG = o ( divide start_ARG σ2 end_ARG start_ARG max over start_ARG n end_ARG , over~ start_ARG n end_ARG end_ARG ) = o ( γ + δ ) since d−k=ω(n^)=ω(n~)^~d-k=ω( n)=ω( n)d - k = ω ( over start_ARG n end_ARG ) = ω ( over~ start_ARG n end_ARG ). C.2.2 Further Relaxation to Sub-Gaussian Data Now, we consider the more general sub-Gaussian setting outlined in Example 3.5. The population covariance is: =[kσ2d−kd−k],matrixsubscript00superscript2subscript = bmatrix I_k&0\\ 0& σ^2d-k I_d-k bmatrix,Σ = [ start_ARG start_ROW start_CELL italic_Iitalic_k end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL divide start_ARG σ2 end_ARG start_ARG d - k end_ARG italic_Iitalic_d - k end_CELL end_ROW end_ARG ] , where the top left block has a corresponding sub-Gaussian parameter of Θ(1)Θ1 (1)Θ ( 1 ) and the rest has a parameter of Θ(σ2d−k)Θsuperscript2 ( σ^2d-k)Θ ( divide start_ARG σ2 end_ARG start_ARG d - k end_ARG ). We adopt the following definitions from Chapter 2 of (Vershynin, 2018) for reference. Definition C.5. A zero-mean random variable X is sub-Gaussian if there is a positive parameter KgsubscriptK_gKitalic_g such that [eX2/Kg2]≤2.delimited-[]superscriptsuperscript2superscriptsubscript22E[e^X^2/K_g^2]≤ 2.blackboard_E [ eitalic_X start_POSTSUPERSCRIPT 2 / Kitalic_g2 end_POSTSUPERSCRIPT ] ≤ 2 . Definition C.6. A zero-mean random variable X is sub-exponential if there is a positive parameter KesubscriptK_eKitalic_e such that [e|X|/Ke]≤2.delimited-[]superscriptsubscript2E[e^|X|/K_e]≤ 2.blackboard_E [ e| X | / Kitalic_e ] ≤ 2 . We can also define the following norms that give the sub-Gaussian or sub-exponential parameter: ‖X‖ψ2=inft>0:[eX2/t2]≤2=Kgsubscriptnormsubscript2infimumconditional-set0delimited-[]superscriptsuperscript2superscript22subscript\|X\|_ _2= \t>0:E[e^X^2/t^2]≤ 2\=K_g∥ X ∥ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = inf t > 0 : blackboard_E [ eitalic_X start_POSTSUPERSCRIPT 2 / t2 end_POSTSUPERSCRIPT ] ≤ 2 = Kitalic_g ‖X‖ψ1=inft>0:[e|X|/t]≤2=Kesubscriptnormsubscript1infimumconditional-set0delimited-[]superscript2subscript\|X\|_ _1= \t>0:E[e^|X|/t]≤ 2\=K_e∥ X ∥ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = inf t > 0 : blackboard_E [ e| X | / t ] ≤ 2 = Kitalic_e Remark. There are many different characterizations for these two definitions, each with a different sub-Gaussian/sub-exponential parameter. A detailed summary can be found in Chapter 2 of (Vershynin, 2018). Notably, these parameters differ from each other only by at most a constant factor. Lemma C.7. (Extension of Lemma 4 (Shen et al., 2022) to sub-Gaussian) Consider high-dimensional independent sub-Gaussian vectors 1subscript1 z_1italic_z1, 2∈ℝdsubscript2superscriptℝ z_2 ^ditalic_z2 ∈ blackboard_Rd, whose i.i.d. entries have variances σ12superscriptsubscript12 _1^2σ12, σ22superscriptsubscript22 _2^2σ22 and sub-Gaussian parameters Θ(σ1)Θsubscript1 ( _1)Θ ( σ1 ), Θ(σ2)Θsubscript2 ( _2)Θ ( σ2 ) respectively. Then for δ>00δ>0δ > 0 such that log(2/δ)>cd2 (2/δ)> cdsquare-root start_ARG log ( 2 / δ ) end_ARG > square-root start_ARG c d end_ARG for some constant c, there exists a constant C such that with probability at least 1−δ11- 1 - δ, |1⋅2|≤Cσ1σ2dlog(2/δ).⋅subscript1subscript2subscript1subscript22| z_1· z_2|≤ C _1 _2 d (2/δ).| italic_z1 ⋅ italic_z2 | ≤ C σ1 σ2 square-root start_ARG d log ( 2 / δ ) end_ARG . Proof. We consider the product 1⋅2=∑i=1d1i2i=∑i=1dai⋅subscript1subscript2superscriptsubscript1subscript1subscript2superscriptsubscript1subscript z_1· z_2= _i=1^d z_1i z_2i= _i% =1^da_iitalic_z1 ⋅ italic_z2 = ∑i = 1d italic_z1 i italic_z2 i = ∑i = 1d aitalic_i, where we define aisubscripta_iaitalic_i for simplicity. It is a well-known result that the product of two sub-Gaussian random variables is sub-exponential. More precisely, ‖ai‖ψ1≤‖1i‖ψ2‖2i‖ψ2=Cσ1σ2.subscriptnormsubscriptsubscript1subscriptnormsubscript1subscript2subscriptnormsubscript2subscript2subscript1subscript2\|a_i\|_ _1≤\| z_1i\|_ _2\| z_2i\|_ _2% =C _1 _2.∥ aitalic_i ∥ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ∥ italic_z1 i ∥ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_z2 i ∥ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = C σ1 σ2 . By Bernstein’s inequality for sub-exponential functions (see Theorem 3.8.1 (Vershynin, 2018)), this summation can be bounded as: for some constant c>00c>0c > 0, P(|∑i=1dai|≥t)superscriptsubscript1subscript P ( | _i=1^da_i |≥ t )P ( | ∑i = 1d aitalic_i | ≥ t ) ≤2exp[−cmint2∑i=1d‖ai‖ψ12,tmaxi‖ai‖ψ1]absent2superscript2superscriptsubscript1superscriptsubscriptnormsubscriptsubscript12subscriptsubscriptnormsubscriptsubscript1 ≤ 2 [-c \ t^2 _i=1^d\|a_i\|% _ _1^2, t _i\|a_i\|_ _1 \ ]≤ 2 exp [ - c min divide start_ARG t2 end_ARG start_ARG ∑i = 1d ∥ aitalic_i ∥ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT2 end_ARG , divide start_ARG t end_ARG start_ARG maxitalic_i ∥ aitalic_i ∥ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ] ≤2exp[−cmint2dC2σ12σ22,tCσ1σ2]absent2superscript2superscript2superscriptsubscript12superscriptsubscript22subscript1subscript2 ≤ 2 [-c \ t^2dC^2 _1^2% _2^2, tC _1 _2 \ ]≤ 2 exp [ - c min divide start_ARG t2 end_ARG start_ARG d C2 σ12 σ22 end_ARG , divide start_ARG t end_ARG start_ARG C σ1 σ2 end_ARG ] Let t=Ccσ1σ2dlog(2/δ)subscript1subscript22t= C c _1 _2 d (2/δ)t = divide start_ARG C end_ARG start_ARG square-root start_ARG c end_ARG end_ARG σ1 σ2 square-root start_ARG d log ( 2 / δ ) end_ARG for some δ that satisfies the condition log(2/δ)>cd2 (2/δ)> cdsquare-root start_ARG log ( 2 / δ ) end_ARG > square-root start_ARG c d end_ARG (e.g. δ=1/d21superscript2δ=1/d^2δ = 1 / d2). The probability statement becomes: P(|∑i=1dai|≥Ccσ1σ2dlog(2/δ))superscriptsubscript1subscriptsubscript1subscript22 P ( | _i=1^da_i |≥ C c% _1 _2 d (2/δ) )P ( | ∑i = 1d aitalic_i | ≥ divide start_ARG C end_ARG start_ARG square-root start_ARG c end_ARG end_ARG σ1 σ2 square-root start_ARG d log ( 2 / δ ) end_ARG ) ≤2exp[−cminlog(2/δ)c,dlog(2/δ)c]absent222 ≤ 2 [-c \ (2/δ)c, % d (2/δ)c \ ]≤ 2 exp [ - c min divide start_ARG log ( 2 / δ ) end_ARG start_ARG c end_ARG , square-root start_ARG divide start_ARG d log ( 2 / δ ) end_ARG start_ARG c end_ARG end_ARG ] =2exp[−minlog(2/δ),cdlog(2/δ)].absent222 =2 [- \ (2/δ), cd (2/δ)% \ ].= 2 exp [ - min log ( 2 / δ ) , square-root start_ARG c d log ( 2 / δ ) end_ARG ] . Since our choice of δ ensures that the first quantity is smaller, P(|∑i=1dai|≥Ccσ1σ2dlog(2/δ))≤δsuperscriptsubscript1subscriptsubscript1subscript22P ( | _i=1^da_i |≥ C c _1 _% 2 d (2/δ) )≤ ( | ∑i = 1d aitalic_i | ≥ divide start_ARG C end_ARG start_ARG square-root start_ARG c end_ARG end_ARG σ1 σ2 square-root start_ARG d log ( 2 / δ ) end_ARG ) ≤ δ In other words, letting C′=C/csuperscript′C =C/ cC′ = C / square-root start_ARG c end_ARG, we have that with probability at least 1−δ11- 1 - δ, |1⋅2|≤C′σ1σ2dlog(2/δ).⋅subscript1subscript2superscript′subscript1subscript22| z_1· z_2|≤ C _1 _2 d (% 2/δ).| italic_z1 ⋅ italic_z2 | ≤ C′ σ1 σ2 square-root start_ARG d log ( 2 / δ ) end_ARG . ∎ Now we are ready to show that our assumptions capture the setting in Section C.2.1 but with sub-Gaussian data. That is, we now allow the data to have possibly even lighter tail than that of Gaussian. The proof can be easily replicated, as Chebyshev’s inequality still applies here and Lemmas C.3, C.4 find the following “sub-Gaussian” alternatives, namely Lemmas C.8, C.9: Lemma C.8. (Restatement of Theorem 6.5 in (Wainwright, 2019)) Let ∈ℝd×nsuperscriptℝX ^d× nX ∈ blackboard_Rd × n be a random sub-Gaussian matrix with parameter KgsubscriptK_gKitalic_g and population covariance dsubscript I_ditalic_Iitalic_d. Then for all δ≥00δ≥ 0δ ≥ 0, there are universal constants C1,C2,C3subscript1subscript2subscript3C_1,C_2,C_3C1 , C2 , C3 such that ∥1nT−d∥op≤Kg2[C1(dn+dn)+δ]subscriptdelimited-∥1superscriptsubscriptopsuperscriptsubscript2delimited-[]subscript1 1n X X^T- I_d _op≤ K_% g^2 [C_1 ( dn+ dn )+δ ]∥ divide start_ARG 1 end_ARG start_ARG n end_ARG italic_X italic_Xitalic_T - italic_Iitalic_d ∥op ≤ Kitalic_g2 [ C1 ( square-root start_ARG divide start_ARG d end_ARG start_ARG n end_ARG end_ARG + divide start_ARG d end_ARG start_ARG n end_ARG ) + δ ] with probability at least 1−C2e−C3nminδ,δ21subscript2superscriptsubscript3superscript21-C_2e^-C_3n \δ,δ^2\1 - C2 e- C3 n min δ , δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Lemma C.9. Consider two independently sampled row-wise sub-Gaussian matrices ∈ℝd1×nsuperscriptℝsubscript1A ^d_1× nA ∈ blackboard_Rd1 × n, ∈ℝd2×nsuperscriptℝsubscript2B ^d_2× nB ∈ blackboard_Rd2 × n that have i.i.d. entries with variances σ12superscriptsubscript12 _1^2σ12, σ22superscriptsubscript22 _2^2σ22 respectively. Then for some 1d1d2>δ>01subscript1subscript20 1d_1d_2>δ>0divide start_ARG 1 end_ARG start_ARG d1 d2 end_ARG > δ > 0 and constant C, with probability at least 1−d1d2δ1subscript1subscript21-d_1d_2 1 - d1 d2 δ, we have 1n∥⊤∥op≤σ1σ2nCd1d2nlog(2δ).1subscriptdelimited-∥superscripttopopsubscript1subscript2subscript1subscript22 1n A B _op≤ _% 1 _2n Cd_1d_2n ( 2δ).divide start_ARG 1 end_ARG start_ARG n end_ARG ∥ italic_A italic_B⊤ ∥op ≤ divide start_ARG σ1 σ2 end_ARG start_ARG n end_ARG square-root start_ARG C d1 d2 n log ( divide start_ARG 2 end_ARG start_ARG δ end_ARG ) end_ARG . Proof. The proof is the same as Lemma C.4 except that we now use Lemma C.7 to bound the squared value of each entry in the Frobenius norm. ∎ With these alternative extended results, the proof in Section C.2.1 immediately generalizes to sub-Gaussian data. This extension potentially allows us to accommodate more realistic scenario and enhances the theoretical robustness of our assumptions. Sub-Gaussian distributions capture a wider class of data behaviors; for instance, the fact that bounded random variables are sub-Gaussian makes the theory more applicable to many real-world datasets, which naturally exhibit sub-Gaussian characteristics. In the following section, we show a general result that even more examples can be constructed. C.3 Proof of Theorem 3.6 The intuition behind this theorem is that adding high-dimensional sub-Gaussian entries to the given representation preserves decomposbility while slightly modifying the parameters. Due to the orthogonality of Mitalic_M and ⟂superscriptperpendicular-to M italic_M⟂, we let =[⟂]matrixsuperscriptperpendicular-to U= bmatrix M M bmatrixitalic_U = [ start_ARG start_ROW start_CELL italic_M italic_M⟂ end_CELL end_ROW end_ARG ] and then α()=[h()ξ()]matrixℎα( x)= U bmatrixh( x)\\ ξ( x) bmatrixα ( italic_x ) = italic_U [ start_ARG start_ROW start_CELL h ( italic_x ) end_CELL end_ROW start_ROW start_CELL ξ ( italic_x ) end_CELL end_ROW end_ARG ]; naturally, the column space of Mitalic_M can be regarded as the subspace VV, and the column space of ⟂superscriptperpendicular-to M italic_M⟂ is ⟂superscriptperpendicular-toV V⟂. Given that h()ℎh( x)h ( italic_x )’s representations are (δ,0,0)00(δ,0,0)( δ , 0 , 0 )-decomposable w.r.t. ℝdsuperscriptℝR^dblackboard_Rd, we now prove that the new representations are (δ,σ2n^,σ2n~)superscript2^superscript2~(δ, σ^2 n, σ^2 n)( δ , divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG , divide start_ARG σ2 end_ARG start_ARG over~ start_ARG n end_ARG end_ARG )-decomposable. Again we only present the proof for one data split whenever it can be replicated for the other. For notation, we let γ=σ2/maxn^,n~superscript2^~γ=σ^2/ \ n, n\γ = σ2 / max over start_ARG n end_ARG , over~ start_ARG n end_ARG . 1. Boundedness: 1n^∑i=1n^y^i2=O(1)1^superscriptsubscript1^superscriptsubscript^21 1 n _i=1 n y_i^2=O(1)divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG over start_ARG y end_ARGi2 = O ( 1 ) follows from the previous proof using Chebyshev’s inequality. For the population covariance, ∥(α)∥op=∥[α()α()⊤]∥opsubscriptdelimited-∥opsubscriptdelimited-∥subscriptsubscriptdelimited-[]superscripttopop (α) _op= _% D_ x[α( x)α( x) ] _% op∥ Σ ( α ) ∥op = ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ α ( italic_x ) α ( italic_x )⊤ ] ∥op =‖[h()h()⊤h()ξ()⊤ξ()h()⊤ξ()ξ()⊤]‖opabsentsubscriptnormsubscriptsubscriptmatrixℎsuperscripttopℎsuperscripttopℎsuperscripttopsuperscripttop = \|E_D_ x bmatrixh(% x)h( x) &h( x)ξ( x) \\ ξ( x)h( x) &ξ( x)ξ( x) bmatrix% \|_op= ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL h ( italic_x ) h ( italic_x )⊤ end_CELL start_CELL h ( italic_x ) ξ ( italic_x )⊤ end_CELL end_ROW start_ROW start_CELL ξ ( italic_x ) h ( italic_x )⊤ end_CELL start_CELL ξ ( italic_x ) ξ ( italic_x )⊤ end_CELL end_ROW end_ARG ] ∥o p ≤‖[h()h()⊤ξ()ξ()⊤]‖op+‖[h()ξ()⊤ξ()h()⊤]‖opabsentsubscriptnormsubscriptsubscriptmatrixℎsuperscripttop00superscripttopsubscriptnormsubscriptsubscriptmatrix0ℎsuperscripttopℎsuperscripttop0 ≤ \|E_D_ x bmatrixh(% x)h( x) &0\\ 0&ξ( x)ξ( x) bmatrix \|_op+ \|% E_D_ x bmatrix0&h( x)ξ(% x) \\ ξ( x)h( x) &0 bmatrix \|_op≤ ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL h ( italic_x ) h ( italic_x )⊤ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL ξ ( italic_x ) ξ ( italic_x )⊤ end_CELL end_ROW end_ARG ] ∥o p + ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL h ( italic_x ) ξ ( italic_x )⊤ end_CELL end_ROW start_ROW start_CELL ξ ( italic_x ) h ( italic_x )⊤ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ∥o p (36) We have that ∥[h()h()⊤]∥op=∥(h)∥op=O(1)subscriptdelimited-∥subscriptsubscriptdelimited-[]ℎsuperscripttopopsubscriptdelimited-∥ℎop1 _D_ x[h( x)h( x) ]% _op= (h) _op=O(1)∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) h ( italic_x )⊤ ] ∥op = ∥ Σ ( h ) ∥op = O ( 1 ) by the (δ,0,0)00(δ,0,0)( δ , 0 , 0 )-decomposibility assumption on hℎh’s representations. From the proof for sub-Gaussian data in Section C.2.1, ∥[ξ()ξ()⊤]∥op=∥(ξ)∥op=O(1)subscriptdelimited-∥subscriptsubscriptdelimited-[]superscripttopopsubscriptdelimited-∥op1 _D_ x[ξ( x)ξ( x) ]% _op= (ξ) _op=O(1)∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ξ ( italic_x ) ξ ( italic_x )⊤ ] ∥op = ∥ Σ ( ξ ) ∥op = O ( 1 ). These bound the first term on the RHS of Equation 36. By the definition of operator norm, ∥[h()ξ()⊤]∥op=sup‖=1sup‖=1T[h()ξ()⊤]=sup‖=1sup‖=1[(Th())(Tξ())].subscriptdelimited-∥subscriptsubscriptdelimited-[]ℎsuperscripttopopsubscriptsupremumnorm1subscriptsupremumnorm1superscriptsubscriptsubscriptdelimited-[]ℎsuperscripttopsubscriptsupremumnorm1subscriptsupremumnorm1subscriptsubscriptdelimited-[]superscriptℎsuperscript _D_ x[h( x)ξ( x% ) ] _op= _\| u\|=1 _\| v\|=1 % u^TE_D_ x[h( x)ξ( x) ]% v= _\| u\|=1 _\| v\|=1E_D_% x[( u^Th( x))( v^Tξ( x))].∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) ξ ( italic_x )⊤ ] ∥op = sup∥ italic_u ∥ = 1 sup∥ italic_v ∥ = 1 italic_uitalic_T blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) ξ ( italic_x )⊤ ] italic_v = sup∥ italic_u ∥ = 1 sup∥ italic_v ∥ = 1 blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_uitalic_T h ( italic_x ) ) ( italic_vitalic_T ξ ( italic_x ) ) ] . (37) By Cauchy-Schwartz inequality, we can bound this expectation as: [(Th())(Tξ())]≤[(Th())2][(Tξ())2], wheresubscriptsubscriptdelimited-[]superscriptℎsuperscriptsubscriptsubscriptdelimited-[]superscriptsuperscriptℎ2subscriptsubscriptdelimited-[]superscriptsuperscript2 whereE_D_ x[( u^Th( x))( v^T% ξ( x))]≤ E_D_ x[( u^Th(% x))^2] E_D_ x[( v^Tξ(% x))^2], whereblackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_uitalic_T h ( italic_x ) ) ( italic_vitalic_T ξ ( italic_x ) ) ] ≤ square-root start_ARG blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_uitalic_T h ( italic_x ) )2 ] end_ARG square-root start_ARG blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_vitalic_T ξ ( italic_x ) )2 ] end_ARG , where [(Th())2]=[Th()h()⊤]=T[h()h()⊤]≤‖2∥(h)∥op=O(1),subscriptsubscriptdelimited-[]superscriptsuperscriptℎ2subscriptsubscriptdelimited-[]superscriptℎsuperscripttopsuperscriptsubscriptsubscriptdelimited-[]ℎsuperscripttopsuperscriptnorm2subscriptdelimited-∥ℎop1E_D_ x[( u^Th( x))^2]=E% _D_ x[ u^Th( x)h( x) u% ]= u^TE_D_ x[h( x)h( x)^% ] u≤\| u\|^2 (h) _op=O% (1),blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_uitalic_T h ( italic_x ) )2 ] = blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_uitalic_T h ( italic_x ) h ( italic_x )⊤ italic_u ] = italic_uitalic_T blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) h ( italic_x )⊤ ] italic_u ≤ ∥ italic_u ∥2 ∥ Σ ( h ) ∥op = O ( 1 ) , [(Tξ())2]=[Tξ()ξ()⊤]=T[ξ()ξ()⊤]≤‖2∥(ξ)∥op=O(1).subscriptsubscriptdelimited-[]superscriptsuperscript2subscriptsubscriptdelimited-[]superscriptsuperscripttopsuperscriptsubscriptsubscriptdelimited-[]superscripttopsuperscriptnorm2subscriptdelimited-∥op1E_D_ x[( v^Tξ( x))^2]= % E_D_ x[ v^Tξ( x)ξ( x) % v]= v^TE_D_ x[ξ( x)ξ(% x) ] v≤\| v\|^2 (ξ) _% op=O(1).blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ( italic_vitalic_T ξ ( italic_x ) )2 ] = blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ italic_vitalic_T ξ ( italic_x ) ξ ( italic_x )⊤ italic_v ] = italic_vitalic_T blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ξ ( italic_x ) ξ ( italic_x )⊤ ] italic_v ≤ ∥ italic_v ∥2 ∥ Σ ( ξ ) ∥op = O ( 1 ) . Combing these results, we have that Equation 37 =∥[h()ξ()⊤]∥op=O(1)absentsubscriptdelimited-∥subscriptsubscriptdelimited-[]ℎsuperscripttopop1= _D_ x[h( x)ξ( x) ]% _op=O(1)= ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ h ( italic_x ) ξ ( italic_x )⊤ ] ∥op = O ( 1 ), bounding the second term in Equation 36. Hence, ∥(α)∥op=O(1)subscriptdelimited-∥op1 (α) _op=O(1)∥ Σ ( α ) ∥op = O ( 1 ). Simiarly, we can prove for the empirical covariance: ∥^(α)∥op=∥1n^∑i=1n^α(^i)α(^i)⊤∥opsubscriptdelimited-∥^opsubscriptdelimited-∥1^superscriptsubscript1^subscript^superscriptsubscript^topop (α) _op= % 1 n _i=1 nα( x_i)α( x% _i) _op∥ over start_ARG Σ end_ARG ( α ) ∥op = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG ∑i = 1over start_ARG n end_ARG α ( over start_ARG italic_x end_ARGi ) α ( over start_ARG italic_x end_ARGi )⊤ ∥op =‖1n^[∑i=1n^h(^i)h(^i)⊤∑i=1n^h(^i)ξ(^i)⊤∑i=1n^ξ(^i)h(^i)⊤∑i=1n^ξ(^i)ξ(^i)⊤]‖opabsentsubscriptnorm1^matrixsuperscriptsubscript1^ℎsubscript^ℎsuperscriptsubscript^topsuperscriptsubscript1^ℎsubscript^superscriptsubscript^topsuperscriptsubscript1^subscript^ℎsuperscriptsubscript^topsuperscriptsubscript1^subscript^superscriptsubscript^topop = \| 1 n bmatrix _i=1 nh(% x_i)h( x_i) & _i=1 nh( % x_i)ξ( x_i) \\ _i=1 nξ( x_i)h( x_i) & _% i=1 nξ( x_i)ξ( x_i) % bmatrix \|_op= ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG [ start_ARG start_ROW start_CELL ∑i = 1over start_ARG n end_ARG h ( over start_ARG italic_x end_ARGi ) h ( over start_ARG italic_x end_ARGi )⊤ end_CELL start_CELL ∑i = 1over start_ARG n end_ARG h ( over start_ARG italic_x end_ARGi ) ξ ( over start_ARG italic_x end_ARGi )⊤ end_CELL end_ROW start_ROW start_CELL ∑i = 1over start_ARG n end_ARG ξ ( over start_ARG italic_x end_ARGi ) h ( over start_ARG italic_x end_ARGi )⊤ end_CELL start_CELL ∑i = 1over start_ARG n end_ARG ξ ( over start_ARG italic_x end_ARGi ) ξ ( over start_ARG italic_x end_ARGi )⊤ end_CELL end_ROW end_ARG ] ∥op =‖1n^[^^⊤^^⊤^^⊤^^⊤]‖op,absentsubscriptnorm1^matrix^superscript^top^superscript^top^superscript^top^superscript^topop = \| 1 n bmatrix H H% & H \\ H & ^% bmatrix \|_op,= ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG [ start_ARG start_ROW start_CELL over start_ARG italic_H end_ARG over start_ARG italic_H end_ARG⊤ end_CELL start_CELL over start_ARG italic_H end_ARG over start_ARG Ξ end_ARG⊤ end_CELL end_ROW start_ROW start_CELL over start_ARG Ξ end_ARG over start_ARG italic_H end_ARG⊤ end_CELL start_CELL over start_ARG Ξ end_ARG over start_ARG Ξ end_ARG⊤ end_CELL end_ROW end_ARG ] ∥op , where the i-th column of ^ over start_ARG Ξ end_ARG is ξ(^i)subscript^ξ( x_i)ξ ( over start_ARG italic_x end_ARGi ) and the i-th column of Hitalic_H is h(^i)ℎsubscript^h( x_i)h ( over start_ARG italic_x end_ARGi ). The rest is straightforward: the assumption on hℎh and the existing proof for sub-Gaussian data imply ∥1n^^^⊤∥op=O(1)subscriptdelimited-∥1^^superscript^topop1 1 n H H _op% =O(1)∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_H end_ARG over start_ARG italic_H end_ARG⊤ ∥op = O ( 1 ) and ∥1n^^^⊤∥op=O(1)subscriptdelimited-∥1^^superscript^topop1 1 n _% op=O(1)∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG Ξ end_ARG over start_ARG Ξ end_ARG⊤ ∥op = O ( 1 ). Hence, ∥1n^^∥opsubscriptdelimited-∥1^^op 1 n H _op∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG italic_H end_ARG ∥op and ∥1n^^∥opsubscriptdelimited-∥1^^op 1 n _op∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG end_ARG end_ARG over start_ARG Ξ end_ARG ∥op are O(1)1O(1)O ( 1 ), and we have ∥1n^^^⊤∥opsubscriptdelimited-∥1^^superscript^topop 1 n H _% op∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_H end_ARG over start_ARG Ξ end_ARG⊤ ∥op is also O(1). These together bound the empirical covariance. 2. Concentration on VV: Since VV corresponds to the representation space of h()ℎh( x)h ( italic_x ), this condition is automatically satisfied by the (δ,0,0)00(δ,0,0)( δ , 0 , 0 )-decomposibility assumption on hℎh. 3. Kernel-wise δ-isotropy on ⟂superscriptperpendicular-toV V⟂: In this setting, since ⟂superscriptperpendicular-toV V⟂ corresponds to the column space of ⟂superscriptperpendicular-to M italic_M⟂ (the high-dimensional sub-Gaussian part), we have ∥1n^^(⟂α)−σ2n^∥op=∥1n^^(ξ)−σ2n^∥opsubscriptdelimited-∥1^^subscriptsuperscriptperpendicular-tosuperscript2^opsubscriptdelimited-∥1^^superscript2^op 1 n K( _V α% )\!-\! σ^2 n I _op\!= 1% n K(ξ)\!-\! σ^2 n I _% op\!∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( Πcaligraphic_V⟂ α ) - divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_I ∥op = ∥ divide start_ARG 1 end_ARG start_ARG over start_ARG n end_ARG end_ARG over start_ARG italic_K end_ARG ( ξ ) - divide start_ARG σ2 end_ARG start_ARG over start_ARG n end_ARG end_ARG italic_I ∥op By definition of the kernel matrix, ^(ξ)=[ξ(^i)⊤ξ(^j)]1≤i,j≤n^=^⊤^^subscriptdelimited-[]superscriptsubscript^topsubscript^formulae-sequence1^superscript^top K(ξ)=[ξ( x_i) ξ( x_j)]_1% ≤ i,j≤ n= over start_ARG italic_K end_ARG ( ξ ) = [ ξ ( over start_ARG italic_x end_ARGi )⊤ ξ ( over start_ARG italic_x end_ARGj ) ]1 ≤ i , j ≤ over start_ARG n end_ARG = over start_ARG Ξ end_ARG⊤ over start_ARG Ξ end_ARG with ^ over start_ARG Ξ end_ARG defined above. Then the equation is essentially in the same form of Equation 34, so the previous proof applies here. 4. Small cross-sample inner product on ⟂superscriptperpendicular-toV V⟂: Similar to 3, we have ∥1n^n~[(⟂α(^i))⊤⟂α(~j)]1≤i≤n^,1≤j≤n~∥op=∥1n^n~[ξ(^i)⊤ξ(~j)]1≤i≤n^,1≤j≤n~∥op=1n^n~∥^T~∥op,subscriptdelimited-∥1^~subscriptdelimited-[]superscriptsubscriptsuperscriptperpendicular-tosubscript^topsubscriptsuperscriptperpendicular-tosubscript~formulae-sequence1^1~opsubscriptdelimited-∥1^~subscriptdelimited-[]superscriptsubscript^topsubscript~formulae-sequence1^1~op1^~subscriptdelimited-∥superscript^~op 1 n n[( _V % α( x_i)) _V α(% x_j)]_1≤ i≤ n,1≤ j≤ n _% op\!= 1 n n[ξ( x_i)% ξ( x_j)]_1≤ i≤ n,1≤ j≤ n% _op\!= 1 n n % ^T _op,∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG [ ( Πcaligraphic_V⟂ α ( over start_ARG italic_x end_ARGi ) )⊤ Πcaligraphic_V⟂ α ( over~ start_ARG italic_x end_ARGj ) ]1 ≤ i ≤ over start_ARG n end_ARG , 1 ≤ j ≤ over~ start_ARG n end_ARG ∥op = ∥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG [ ξ ( over start_ARG italic_x end_ARGi )⊤ ξ ( over~ start_ARG italic_x end_ARGj ) ]1 ≤ i ≤ over start_ARG n end_ARG , 1 ≤ j ≤ over~ start_ARG n end_ARG ∥op = divide start_ARG 1 end_ARG start_ARG square-root start_ARG over start_ARG n end_ARG over~ start_ARG n end_ARG end_ARG end_ARG ∥ over start_ARG Ξ end_ARGT over~ start_ARG Ξ end_ARG ∥op , where ~~ over~ start_ARG Ξ end_ARG is defined in the same manner. Then the proof after Equation 35 for sub-Gaussian data applies. 5. Diminishing population covariance on ⟂superscriptperpendicular-toV V⟂: This refers covariance matrix of the sub-Gaussian part, and we simply have: ∥(⟂h)∥op=∥(ξ)∥op=∥[ξ()ξ()⊤]∥op=σ2m=o(δ+γ) as m=ω(n^)=ω(n~)formulae-sequencesubscriptdelimited-∥subscriptsuperscriptperpendicular-toℎopsubscriptdelimited-∥opsubscriptdelimited-∥subscriptsubscriptdelimited-[]superscripttopopsuperscript2 as m=ω(n^)=ω(n~) ( _V h) _op=% (ξ) _op= _D_% x[ξ( x)ξ( x) ] _op= σ% ^2m=o(δ+γ) as $m=ω( n)=ω( n)$∥ Σ ( Πcaligraphic_V⟂ h ) ∥op = ∥ Σ ( ξ ) ∥op = ∥ blackboard_ED start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ ξ ( italic_x ) ξ ( italic_x )⊤ ] ∥op = divide start_ARG σ2 end_ARG start_ARG m end_ARG = o ( δ + γ ) as m = ω ( over start_ARG n end_ARG ) = ω ( over~ start_ARG n end_ARG ) Appendix D Additional Experimental Details D.1 Training details D.1.1 Molecular prediction. Our experiment is built on the GitHub codebase provided by (Fabian et al., 2020). The strong model, MolBERT, can be downloaded using the link provided on their GitHub repository. For the weak models, we train small transformers using their pipeline with a batch size of 256. For finetuning, we use SGD to train a linear model on representations with the following settings: batch size = 1024, learning rate = 0.001, weight decay = 0.1, and epochs = 2000 when using representations from the strong model; and batch size = 1024, learning rate = 0.01, weight decay = 0, and epochs = 2000 when using representations from the weak models. D.1.2 NLP tasks with Embedding Models. We use nvidia/NV-Embed-v2, ranked first on the leaderboard of the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), as the strong model. We consider the following 22 embedding models as the weak model: avsolatorio/GIST-Embedding-v0 Alibaba-NLP/gte-base-en-v1.5 jxm/cde-small-v1 thenlper/gte-base infgrad/stella-base-en-v2 BAAI/bge-base-en-v1.5 thenlper/gte-small intfloat/e5-base-v2 abhinand/MedEmbed-small-v0.1 nomic-ai/nomic-embed-text-v1 sentence-transformers/facebook-dpr-question_encoder-single-nq-base sentence-transformers/paraphrase-MiniLM-L3-v2 sentence-transformers/average_word_embeddings_glove.840B.300d sentence-transformers/roberta-base-nli-mean-tokens sentence-transformers/all-mpnet-base-v1 sentence-transformers/bert-base-wikipedia-sections-mean-tokens sentence-transformers/sentence-t5-base Snowflake/snowflake-arctic-embed-s TaylorAI/gte-tiny jinaai/jina-embeddings-v2-small-en sentence-transformers/gtr-t5-base dumyy/sft-bge-small During fine-tuning, we train a linear classifier on representations using the Adam optimizer (Kingma, 2014) with the following settings: batch size = 200, learning rate = 0.01, weight decay = 0.00001, and epochs = 200. D.1.3 NLP tasks with End-to-end finetuend LLMs. We largely reuse the GitHub codebase provided by (Burns et al., 2023). We use Qwen/Qwen-7B as the strong model. We consider the following 28 LLMs as the weak model: bigscience/bloom-560m bigscience/bloomz-560m bigscience/mt0-base baidu/ernie-code-560m bigscience/mt0-small google/umt5-small google/umt5-base google/mt5-base facebook/xglm-564M MBZUAI/LaMini-T5-61M MBZUAI/LaMini-Flan-T5-77M MBZUAI/LaMini-GPT-124M MBZUAI/LaMini-Neo-125M MBZUAI/LaMini-T5-223M apple/OpenELM-270M apple/OpenELM-450M EleutherAI/pythia-160m MBZUAI/LaMini-Flan-T5-248M MBZUAI/LaMini-GPT-774M cerebras/Cerebras-GPT-111M google-t5/t5-small facebook/opt-125m Qwen/Qwen2.5-0.5B distilbert/distilgpt2 EleutherAI/gpt-neo-125m gpt2 google/mt5-small EleutherAI/pythia-70m We finetune all the models using the pipeline provided in the codebase, which employs the Adam optimizer with a batch size of 32 and trains for a single epoch. The learning rate is set to 5e-5 for weak models and 1e-5 for the strong model, following the default configuration in the codebase, which applies smaller learning rates for larger models. D.2 Details and discussions on hyperparameters In Exp. I, we set αw=αs=0.1subscriptwsubscripts0.1 _w= _s=0.1αw = αs = 0.1 and βw=βs=0.1subscriptwsubscripts0.1 _w= _s=0.1βw = βs = 0.1 for all datasets. In Exp. I, we set αw=0.001subscriptw0.001 _w=0.001αw = 0.001, αs=0.05subscripts0.05 _s=0.05αs = 0.05, λw=0.0001subscriptw0.0001 _w=0.0001λw = 0.0001, and λs=0.01subscripts0.01 _s=0.01λs = 0.01 for both datasets. In Exp. I, we tune the hyperparameters for each dataset, reporting the best result. Specifically, we set αw=αssubscriptwsubscripts _w= _sαw = αs and vary them within the range 0.02,0.050.020.05\0.02,0.05\ 0.02 , 0.05 , and vary βwsubscriptw _wβw and βssubscripts _sβs independently within the range 0.2,0.5,0.8,1.0,2,4,80.20.50.81.0248\0.2,0.5,0.8,1.0,2,4,8\ 0.2 , 0.5 , 0.8 , 1.0 , 2 , 4 , 8 . Effect of hyperparameters. We vary the hyperparameters to evaluate their impact on performance. In the setting of Exp. I, we vary αwsubscriptw _wαw and αssubscripts _sαs within the range 0.001,0.01,0.050.0010.010.050.001,0.01,0.050.001 , 0.01 , 0.05 and βwsubscriptw _wβw and βssubscripts _sβs within the range 0.0001,0.001,0.010.00010.0010.010.0001,0.001,0.010.0001 , 0.001 , 0.01. The results are visualized in Figure LABEL:fig:_exp2_hps. In the setting of Exp. I, we vary the hyperparameters while keeping αw=αssubscriptwsubscripts _w= _sαw = αs as described in the previous paragraph, with results visualized in Figure LABEL:fig:_exp3_hps. Although certain hyperparameter configurations may lead to lower correlation, a non-trivial positive correlation is observed in most cases. Interestingly, in Exp. I, which is seemingly the most ‘challenging setting’, the results are highly robust to changes in hyperparameters, with the worst-case correlation remaining around 0.6 across all three datasets. Cross-model hyperparameter transfer. We note that, although each model could technically require different hyperparameters, in experiments we let all weak models share hyperparameters for simplicity and still achieve strong results, suggesting that our approach is not very sensitive to hyperparameters. Further, we present a new experiment demonstrating that hyperparameters selected using one group of models (i.e., as a validation set) generalize to other models. We randomly split the weak models into two groups, select hyperparameters based on one group, and evaluate them on the other. We repeat this 20 times and report the results in Table 2. Correlation remains high with low standard deviation, indicating that hyperparameters selected using a few models can reliably generalize to new ones. Additionally, we note that a small number of labeled data should suffice for hyperparameters tuning, as they are only used to measure test performance and not to compute our metric. Table 2: Average Spearman correlation with hyperparameters selected on half of the models and evaluated on the rest. Justice Commonsense 0.885±0.16subscript0.885plus-or-minus0.160.885_± 0.160.885± 0.16 0.67±0.20subscript0.67plus-or-minus0.200.67_± 0.200.67± 0.20 D.3 Results for ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I- P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op Results for ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I- P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op are presented in Figures LABEL:fig:_molecular_P, LABEL:fig:_embedding_P, and LABEL:fig:_end2end_P. We observe a strong correlation between Errw2ssubscriptErrw2sErr_w2sErrw2s and ∥s(−w)s∥opsubscriptdelimited-∥subscriptssubscriptwsubscriptsop P_s( I- P_w) P_s% _op∥ italic_Ps ( italic_I - italic_Pw ) italic_Ps ∥op across the settings. These correlations are similar to those achieved using ∥s(−w)∥opsubscriptdelimited-∥subscriptssubscriptwop P_s( I- P_w) _op∥ italic_Ps ( italic_I - italic_Pw ) ∥op, indicating that the two metrics are similarly informative for W2SG in practice, despite being theoretically derived in different ways. Figure 9: The top panel shows results on SciQ for models with sizes ≤10000absent10000≤ 10000≤ 10000, while the bottom panel shows results on Amazon Polarity for models with sizes ≤8000absent8000≤ 8000≤ 8000. The patterns observed here are consistent with those discussed in Figure 4 in the main paper. D.4 Comparison with model size and effective dimension Figure 9 compares our metric with the activation map dimension and the dimension of approximated principal representations for smaller models on SciQ and Amazon Polarity. The results are consistent with those presented in Figure 4 in the main paper. Appendix E Discussion Using activation maps as representations in Exp. I is a simple heuristic that yields promising results. However, more principled methods for defining and extracting representations from LLMs, such as those through NTK (Malladi et al., 2023) or representation engineering (Zou et al., 2023), could be explored. Future research could leverage these approaches to improve results and uncover new applications. For instance, (Zou et al., 2023) introduces a method for extracting specific concept directions in representations, such as honesty and power-seeking. This could enable computing our metric based on topic-specific representations, allowing predictions of W2SG for general tasks within specific topical domains.