Paper deep dive
Worst-case low-rank approximations
Anya Fries, Markus Reichstein, David Blei, Jonas Peters
Abstract
Abstract:Real-world data in health, economics, and environmental sciences are often collected across heterogeneous domains (such as hospitals, regions, or time periods). In such settings, distributional shifts can make standard PCA unreliable, in that, for example, the leading principal components may explain substantially less variance in unseen domains than in the training domains. Existing approaches (such as FairPCA) have proposed to consider worst-case (rather than average) performance across multiple domains. This work develops a unified framework, called wcPCA, applies it to other objectives (resulting in the novel estimators such as norm-minPCA and norm-maxregret, which are better suited for applications with heterogeneous total variance) and analyzes their relationship. We prove that for all objectives, the estimators are worst-case optimal not only over the observed source domains but also over all target domains whose covariance lies in the convex hull of the (possibly normalized) source covariances. We establish consistency and asymptotic worst-case guarantees of empirical estimators. We extend our methodology to matrix completion, another problem that makes use of low-rank approximations, and prove approximate worst-case optimality for inductive matrix completion. Simulations and two real-world applications on ecosystem-atmosphere fluxes demonstrate marked improvements in worst-case performance, with only minor losses in average performance.
Tags
Links
- Source: https://arxiv.org/abs/2603.11304v1
- Canonical: https://arxiv.org/abs/2603.11304v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:13:30 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
176,605 characters extracted from source content.
Expand or collapse full text
Worst-case low-rank approximations Fries .fries@stat.math.ethz.ch for Statistics ETH Zürich Zürich, Switzerland Reichstein @bgc-jena.mpg.de Planck Institute for Biogeochemistry Jena, Germany Blei .blei@columbia.edu of Computer Science and Department of Statistics Columbia University New York, USA Peters .peters@stat.math.ethz.ch for Statistics ETH Zürich Zürich, Switzerland Abstract Real-world data in health, economics, and environmental sciences are often collected across heterogeneous domains (such as hospitals, regions, or time periods). In such settings, distributional shifts can make standard PCA unreliable, in that, for example, the leading principal components may explain substantially less variance in unseen domains than in the training domains. Existing approaches (such as FairPCA) have proposed to consider worst-case (rather than average) performance across multiple domains. This work develops a unified framework, called wcPCA, applies it to other objectives (resulting in the novel estimators such as - norm - minPCA and - norm - maxRegret, which are better suited for applications with heterogeneous total variance) and analyzes their relationship. We prove that for all objectives, the estimators are worst-case optimal not only over the observed source domains but also over all target domains whose covariance lies in the convex hull of the (possibly normalized) source covariances. We establish consistency and asymptotic worst-case guarantees of empirical estimators. We extend our methodology to matrix completion, another problem that makes use of low-rank approximations, and prove approximate worst-case optimality for inductive matrix completion. Simulations and two real-world applications on ecosystem-atmosphere fluxes demonstrate marked improvements in worst-case performance, with only minor losses in average performance. Keywords: distribution generalization, worst-case optimality, principal component analysis, matrix completion, low-rank approximation 1 Introduction Real-world data frequently originate from multiple heterogeneous domains, characterized by distinct statistical properties or underlying structures. For example, medical data collected from different hospitals or environmental data spanning different ecosystems and time periods often exhibit distributional shifts. Traditional dimensionality reduction methods, such as principal components analysis (PCA), generally implicitly assume homogeneity across domains. When this assumption is violated, pooled representations can fail to generalize, for example, explaining substantially less variance in unseen domains. PCA summarizes data in a lower-dimensional subspace chosen to maximize explained variance (or, equivalently, to minimize ℓ2 _2-reconstruction error). Given a random (row) vector ∈ℝ1×px ^1× p with zero mean and covariance matrix Σ:=[⊤]∈ℝp×p :=E[x x] ^p× p, we say that an orthonormal matrix Vk∗∈ℝp×kV^*_k ^p× k solves rank-k population PCA if Vk∗∈argmaxV∈p×kTr(V⊤ΣV),V^*_k∈ *arg\,max_V _p× kTr(V V), (1) where p×kO_p× k denotes the space of all p×kp× k-dimensional orthonormal matrices (in practice, one considers sample covariance matrices instead). In the multi-domain setting, where we observe data in different domains e (corresponding to different covariance matrices Σe _e) no single covariance characterizes the data and the question arises how to take the different domains into consideration. Intuitively, pooling all samples to form a single sample covariance matrix ignores domain-specific variability, while treating each domain independently discards common structure. This work investigates the implications of replacing Tr(V⊤ΣV)Tr(V V) in (1) by a term aggregating over the source domains; for example, Tr(V⊤∑eweΣeV)Tr(V _ew_e _eV) (corresponding to a pooling approach) or mineTr(V⊤ΣeV) _eTr(V _eV) (corresponding to a worst-case approach). These objectives generally yield different answers (see, for example, Figures 1 and 2). Moreover, this work shows that the different objectives yield different performances on test domains, too: for example, some, but not all, satisfy robustness guarantees that extend beyond the observed source domains (see Section 3.1 for details). To illustrate this point empirically, we consider daily averages of FLUXNET data (Pastorello et al., 2020), a global network of eddy covariance towers that measure biosphere–atmosphere exchanges of CO2 (e.g., net ecosystem exchange, gross primary production), water vapor, and energy. We treat the TransCom 3 regions (a standard partition of the globe into large-scale climate zones (Gurney et al., 2002, 2004)) as domains, and select five of them as source domains. We solve PCA on the pooled source data ( poolPCA) and compare it with one of the methods proposed in this paper (- norm - maxRegret), a worst-case low-rank approximation. We consider two PCs (see Figure C.i for a justification). The proportion of explained variance is then evaluated for each source and target regions. Figure 1: Proportion of explained variance by poolPCA and - norm - maxRegret in the FLUXNET example on source (left) and target (right) regions (for one specific split). Unlike poolPCA, which optimizes a pooled explained variance, - norm - maxRegret optimizes a worst-case criterion over the source domains, which, here, yields a worst-case improvement of 7.8%. Section 3.1 proves that this choice comes with a worst-case improvement over a larger class of target domains, too. Indeed, in this example, the worst-case explained variance over the target domains improves by 25.8%. (Over 20 different splits, the median relative increase in worst-case improvements equal 25.6% and 16.6% for source and target domains, respectively; the median decrease in average performance for source domains equals 7.5%; see Section 6.1 for more details.) Figure 1 shows that - norm - maxRegret improves upon poolPCA in worst-case error (with only a moderate reduction in terms of average performance). Section 6.1 further shows that similar results hold for other splits into source and target domains, too; it also provides further details on the experimental setup. We extend the worst-case framework to matrix completion, motivated by the following connection to PCA. For a fixed k<pk<p, PCA is an instance of low-rank approximation: in the finite-sample setting, it solves minV∈p×k‖X−XVV⊤‖F2 _V _p× k\|X-XVV \|^2_F, where X∈ℝn×pX ^n× p is a data matrix (so that XVXV and V⊤V act as left and right factors, respectively). Low-rank approximations are also used to tackle the problem of matrix completion (Candès and Plan, 2010; Candès and Recht, 2012): here, a partially observed matrix is approximated by a low-rank factorization in order to predict the missing entries (e.g., Jain et al., 2013). However, traditional matrix completion approaches—much like PCA—implicitly assume that the latent structure is homogeneous across all samples. We thus apply the worst-case principle to matrix completion and propose to learn a shared representation that minimizes worst-case reconstruction error (on the observed entries) across the source domains, and then use the shared factor for inductive matrix completion in a new, partially observed target domain. Section 4 proves that, when the source domains are fully observed, we obtain analogous robustness guarantees to those of the worst-case PCA formulations, up to an approximation factor. 1.1 Related work Domain generalization in prediction. Domain generalization in prediction has been studied extensively. One approach considers adversarial robustness, where performance is optimized under worst-case training scenarios. Examples include maximin effects in regression (Meinshausen and Bühlmann, 2015; Guo, 2024), Magging (Bühlmann and Meinshausen, 2016), Distributionally Robust Optimization (DRO) (Ben-Tal et al., 2013; Duchi and Namkoong, 2019; Sinha et al., 2017), Group DRO (Sagawa et al., 2020; Hu et al., 2018), and maximum risk minimization (maxRM) (Freni et al., 2025). These approaches focus on predictive risk, whereas our work studies worst-case robustness for unsupervised dimensionality reduction. Fair PCA. Several extensions of PCA have been proposed to address heterogeneous domains, especially in the context of fairness. Fair principal component analysis (Fair PCA), first proposed by Samadi et al. (2018), aims to reduce disparities in reconstruction error across groups, which can also be interpreted as domains. Samadi et al. (2018) consider two domains, and minimize the maximum increase in reconstruction error incurred by using a shared projection instead of the domain-specific optimal projection. For multiple domains, this objective is often referred to as the disparity error or regret (see Section 2.2.3). Kamani et al. (2022) study trade-offs between reconstruction accuracy and fairness constraints, while Zalcberg and Wiesel (2021) introduce fair robust PCA and fair sparse PCA. Within a broader multi-criteria dimensionality reduction framework, Tantipongpipat et al. (2019) define Fair PCA as maximizing worst-case explained variance (discussed in Section 2.2.1). These objectives have motivated algorithmic developments, including those of Babu and Stoica (2023). However, all of these approaches target fairness (i.e., in-sample guarantees), whereas our framework addresses out-of-sample guarantees. Stable and common PCA. Beyond fairness, Wang et al. (2025) seek robustness guarantees across domains and study a Group DRO formulation of PCA with the Fantope relaxation, termed StablePCA. While they focus on the optimization algorithm, we focus on explicit out-of-sample guarantees (including stronger results for the population case and both consistency and robustness guarantees for the estimators). We also consider a broader class of worst-case objectives, analyze their relationship, and extend the framework to matrix completion. Other multi-domain PCA frameworks, such as the one proposed by Puchhammer et al. (2024), accommodate heterogeneous data but yield domain-specific projections rather than a unified robust representation. Similarly, common PCA (Flury, 1986, 1987) and common empirical orthogonal functions (Hannachi et al., 2023) allow for heterogeneous data, but neither provide worst-case nor out-of-sample guarantees. Matrix completion Matrix completion is well studied in the single-domain setting, where recovery guarantees are available under an incoherence assumption and assumptions on the sampling process of the observed entries (e.g., Candès and Plan, 2010; Candès and Recht, 2012; Jain et al., 2013). Inductive matrix enables completion for previously unseen rows or columns, for example by incorporating side information (Jain and Dhillon, 2013) or by estimating latent factors from partially observed entries in a learned low-rank model (Koren et al., 2009). Multi-domain extensions include fairness-aware non-negative matrix factorization (Kassab et al., 2024) and fairness-aware federated matrix factorization (Liu et al., 2022). These works, however, focus on group fairness rather than robustness across heterogeneous domains. To our knowledge, the results we present in Section 4.2 are the first explicit worst-case guarantees for (inductive) matrix completion. 1.2 Contributions and outline We develop a unified framework for worst-case PCA and matrix completion. While related objectives have appeared in Fair PCA and in relaxed formulations such as StablePCA, these works either do not address out-of-sample robustness or focus on the optimization problem. Our framework clarifies the relationships between worst-case variance-based, reconstruction-based, and regret-based objectives, which—unlike in classical PCA—generally yield distinct solutions. We prove that the resulting worst-case solutions generalize beyond the observed domains, more specifically, that they are worst-case optimal over all distributions whose covariances lie in the convex hull of the source covariances. The guarantees extend ideas from distributionally robust optimisation while requiring only second-moment information. We further provide finite-sample theory, proving consistency and asymptotic worst-case optimality of the empirical estimators. Simulations and a FLUXNET case study demonstrate improvements in worst-case performance with only minor losses on average. The remainder of this work is organized as follows. Section 2.1 motivates worst-case PCA (wcPCA). Section 2.2 introduces and discusses the variants of the objective and links them to reconstruction error. Section 3.1 establishes the main robustness guarantees, which we extend to the finite-sample setting in Section 3.2. Section 4 broadens the framework to matrix completion with missing data. Section 5 illustrates the theory through simulations and Section 6 presents two real-world applications on ecosystem–atmosphere flux data. All proofs can be found in Appendix D. 1.3 Notation Throughout the article, we make use of the following notation. For all n∈ℕ+n _+, we write [n]:=1,…,n[n]:=\1,…,n\. Generally, non-bold lowercase letters (e.g., x) are used for deterministic row vectors, bold lowercase letters (e.g., x) are used for random row vectors, and uppercase letters (e.g., X) are used for random matrices. For all matrices X∈ℝn×pX ^n× p (random or deterministic), the Frobenius norm of X is denoted by ‖X‖F=∑i=1n∑j=1pXij2=Tr(X⊤X)=∑i=1minn,pσi2(X)\|X\|_F= _i=1^n _j=1^pX_ij^2= Tr (X X )= _i=1 \n,p\ _i^2(X), where σi(X) _i(X) denotes singular values of X and Tr(⋅)Tr(·) the trace. For all x∈ℝpx ^p, diag(x)∈ℝp×pdiag(x) ^p× p is the matrix formed with x on the diagonal and 0s elsewhere. The set of p×kp× k matrices with orthonormal columns is denoted p×k:=V∈ℝp×k∣V⊤V=IkO_p× k:=\V ^p× k V V=I_k\. The convex hull of a set S contained in a real vector space, denoted by conv(S)conv(S), is the set of all convex combinations of points in S, that is, conv(S):=∑i=1kαixi|xi∈S,αi≥0,∑i=1kαi=1,k∈ℕconv(S):=\ _i=1^k _ix_i\ |\ x_i∈ S,\ _i≥ 0,\ _i=1^k _i=1,\ k \. 2 Worst-case PCA and its variants 2.1 Setting and motivation Given data from source domains, the goal is to learn low-dimensional representations that explain variance across these domains and on new (possibly unseen) target domains. We first consider a population setting, i.e., in each source domain, we are given the whole distribution; we discuss the finite-sample case in Section 3.2. Setting 1 (Source domains) Let ℰ:=1,…,EE:=\1,…,E\ denote a set of E source domains. For all e∈ℰe , let e∈ℝ1×px_e ^1× p be a random row vector drawn from a distribution PeP_e such that Pe[e]=0E_P_e[x_e]=0 and the covariance Σe:=Pe[e⊤e]∈ℝp×p _e:=E_P_e[x_e x_e] ^p× p exists and has strictly positive trace. For each e∈ℰe , let we>0w_e>0 denote the probability111All methods and theoretical results developed in this paper apply both when domains are sampled randomly according to (we)e∈ℰ(w_e)_e and when the domain sequence is deterministic. that an observation comes from domain e, where the domain assignment is independent of all other randomness and ∑e∈ℰwe=1 _e w_e=1. The domain label associated with each observation is assumed to be observed. A standard approach, which we refer to as poolPCA, is to pool all domains. This is often done implicitly, for example, when the domain structure is ignored. More concretely, we say Vkpool∈p×kV pool_k _p× k solves rank-k poolPCA if it maximizes the average variance, i.e., it solves PCA for Σpool:=∑e∈ℰweΣe _pool:= _e w_e _e. Throughout the paper, we omit the subscript k whenever the rank is clear from context. One can also solve PCA in each domain. For each e∈ℰe , let Vk∗,e∈p×kV^*,e_k _p× k denote a rank-k PCA solution. There is no canonical way to combine these into a single representation, but one can consider a conservative option and select the solution from the domain that achieves the minimal explained variance. We say VksepV sep_k solves rank-k sepPCA (separate PCA) if Vksep:=Vk∗,e0withe0:=argmine∈ℰTr((Vk∗,e)⊤ΣeVk∗,e),V sep_k:=V^*,e_0_k with e_0:= *arg\,min_e Tr ((V^*,e_k) _eV^*,e_k ) \, where we choose e0e_0 to be the smallest index minimizing the objective. Neither of these two baselines is necessarily optimal for maximizing explained variance across domains. When evaluated on the observed source domains, the pooled and separate solutions can perform poorly in the worst case and Section 5.1 demonstrates that the same may hold when these solutions are applied in test domains. The following example illustrates this point and motivates wcPCA, which explicitly optimizes for worst-case performance across domains. Example 1 Consider two domains ℰ:=1,2E:=\1,2\, where observations are drawn from each domain with equal probability, that is, w1=w2w_1=w_2. The distribution in each domain is a zero-mean Gaussian with population covariances Σ1:=diag(0.9,0.1,0)andΣ2:=diag(0,0.4,0.6), _1:=diag(0.9,0.1,0)\ and\ _2:=diag(0,0.4,0.6), respectively; the example is illustrated in Figure 2. Figure 2: Visualization of possible solutions to the rank-1 approximation problem in a specific example. The figure shows contour plots of two distributions (0,Σ1)N(0, _1) (green) and (0,Σ2)N(0, _2) (orange), together with their supports (shaded planes). The methods poolPCA, sepPCA, minPCA, and maxRegret describe different ways to aggregate over different source domains in the objective function (see Section 2.2 for details); their solutions are denoted by VpoolV pool, VsepV sep, VminPCAV minPCA, and VmaxRegretV maxRegret, respectively (the values in parentheses indicate the pooled and worst-case explained variance). For example, VpoolV pool and VsepV sep explain 0%0\% of variance in the worst-case domain, as they are orthogonal to the support of that domain. In contrast, VminPCAV minPCA maximizes the worst case explained variance (resulting in 36%) and VmaxRegretV maxRegret explains at least 24% of variance in each domain. Details are provided in Example 1. Rank-1 poolPCA considers Σpool=12(Σ1+Σ2) _pool= 12( _1+ _2) and yields Vpool=(1,0,0)⊤V pool=(1,0,0) , while solving PCA separately in each (source) domain gives (1,0,0)⊤(1,0,0) for domain 1 and (0,0,1)⊤(0,0,1) for domain 2; selecting the solution from the domain with minimal explained variance yields Vsep=(0,0,1)⊤V sep=(0,0,1) . VpoolV pool and VsepV sep, respectively, explain 45% and 30% of the average covariance Σpool _pool. However, both perform poorly (0% explained variance) in the respective worst-case source domain. This occurs because VpoolV pool and VsepV sep are both orthogonal to the support of one of the domains, thus capturing no variance in that domain. In contrast, the wcPCA variants explicitly optimize worst-case performance across domains. In particular, minPCA (Definition 1, below) is solved by VminPCA=15(2,0,3)⊤V minPCA= 1 5( 2,0, 3) , which explains 36% of variance in both domains and maxRegret (Definition 3, below) is solved by VminPCA=15(3,0,2)⊤V minPCA= 1 5( 3,0, 2) , which explains at least 24% of variance. Thus, while the explained variance with respect to the pooled covariance Σpool _pool is comparable to that of VsepV sep and VpoolV pool, they achieve strictly positive percentage of explained variance in the worst-case source domain. We will see in Section 3.1 that this improvement is not restricted to the source domains: we prove that several variants of wcPCA (including minPCA and maxRegret) are worst-case optimal over all domains whose covariances lie in the convex hull of the source covariances. 2.2 Variants of wcPCA We now formally introduce some variants of wcPCA More concretely, the objective may focus on maximizing variance or minimizing reconstruction error and the source covariances may or may not be normalized prior to optimization. This leads to four variants: minPCA, - norm - minPCA, maxRCS, and - norm - maxRCS. In addition, we introduce the regret-based formulations maxRegret and - norm - maxRegret, which evaluate performance relative to each domain’s own optimal subspace. The objectives minPCA and maxRegret have been proposed previously; only minPCA has been considered as a “distributionally robust” PCA by Wang et al. (2025) – but no explicit theoretical guarantees have been provided. Table 1 provides an overview of these variants. For classical PCA, the solution spaces of many of these variants coincide222E.g., for a covariance matrix Σ , the sets of solutions to rank-k PCA on Σ and Σ/Tr(Σ) /Tr( ) are the same, as argmaxV∈p×kTr(V⊤ΣV)=argmaxV∈p×kTr(V⊤ΣV)/Tr(Σ) *arg\,max_V _p× k \Tr(V V) \= *arg\,max_V _p× k \Tr(V V)/Tr( ) \. but we see in Section 2.2.5 that this is generally not the case in the multi-domain setting (Section 2.2.6 discusses intuition about when to use which objective). Section 3.1 proves that all of these variants satisfy worst-case optimality guarantees. Objective function Normalized Explained variance Reconstruction error Regret (Def. 1) (Def. 2) (Def. 3) No minPCA ≠ maxRCS ≠ maxRegret ≠ ≠ ≠ ≠ Yes - norm - minPCA == - norm - maxRCS ≠ - norm - maxRegret ≠ Table 1: The six worst-case PCA formulations considered in this work: three objectives (explained variance, reconstruction error, and regret) combined with two normalization choices (normalized vs. unnormalized). The symbols == and ≠ indicate whether the solutions coincide (see Theorem 5). The equality holds if the methods are applied to the full (and usually unknown) distribution or to its empirical counterpart. 2.2.1 Explained variance Let the explained variance and normalized (or proportion of) explained variance of a subspace V for a distribution P with covariance Σ be ℒvar(V;P):=Tr(V⊤ΣV),ℒnormVar(V;P):=Tr(V⊤ΣV)Tr(Σ).L_var(V;P):=Tr(V V), _normVar(V;P):= Tr(V V)Tr( ). (2) Definition 1 (minPCA) Consider Setting 1. An orthonormal matrix V∗∈p×kV^* _p× k solves rank-k minPCA (resp. - norm - minPCA) if V∗∈argmaxV∈p×kmine∈ℰℒ(V;Pe),V^*∈ *arg\,max_V _p× k \ _e L(V;P_e) \, where ℒ=ℒvarL=L_var (resp. ℒ=ℒnormVarL=L_normVar). With respect to the standard metric induced by the Frobenius norm, p×kO_p× k is compact as a closed and bounded subset of ℝp×kR^p× k (by Heine-Borel), and the objective function is continuous. Hence, the maximum is attained. 2.2.2 Reconstruction error PCA does not only maximize explained variance, it also minimizes the expected reconstruction error among all rank-k approximations. Specifically, for a random vector ∈ℝ1×px ^1× p drawn from distribution P with covariance matrix Σ:=P[⊤] :=E_P[x x], it holds that argmaxV∈p×kTr(V⊤ΣV)=argminV∈p×k∼P[‖−VV⊤‖22] *arg\,max_V _p× kTr(V V)= *arg\,min_V _p× kE_x P [\|x-xV \|_2^2 ]. We now define a worst-case objective corresponding to the right-hand side of this equation. To do so, we use the following notation for the expected unnormalized and normalized reconstruction losses: ℒRCS(V;P):=∼P[‖−VV⊤‖22],ℒnormRCS(V;P):=∼P[‖−VV⊤‖22]∼P[‖22].L_RCS(V;P):=E_x P [\|x-xV \|^2_2 ], _normRCS(V;P):= E_x P [\|x-xV \|^2_2 ]E_x P [\|x\|^2_2 ]. (3) Definition 2 (maxRCS) Consider Setting 1. An orthonormal matrix V∗∈p×kV^* _p× k solves rank-k maxRCS (resp. - norm - maxRCS) if V∗∈argminV∈p×kmaxe∈ℰℒ(V;Pe),V^*∈ *arg\,min_V _p× k \ _e L(V;P_e) \, where ℒ=ℒRCSL=L_RCS (resp. ℒ=ℒnormRCSL=L_normRCS). 2.2.3 Regret The two formulations introduced above optimize ‘absolute performance’ in the worst-case domain. Alternatively, we can assess performance with respect to each domain’s own optimal subspace. This leads to a regret-based formulation, which quantifies the increase in reconstruction error incurred by enforcing a common subspace across domains. Let the expected unnormalized and normalized regret be ℒreg(V;P) _reg(V;P) :=ℒRCS(V;Pe)−minW∈p×kℒRCS(W;Pe), :=L_RCS(V;P_e)- _W _p× kL_RCS(W;P_e), ℒnormReg(V;P) _normReg(V;P) :=ℒnormRCS(V;Pe)−minW∈p×kℒnormRCS(W;Pe). :=L_normRCS(V;P_e)- _W _p× kL_normRCS(W;P_e). We can now define a regret-based333Theorem 5 below proves that Definition 3 can equivalently be formulated using explained variance rather than reconstruction error. variant of wcPCA. Definition 3 (maxRegret) Consider Setting 1. An orthonormal matrix V∗∈p×kV^* _p× k solves rank-k maxRegret (resp. - norm - maxRegret) if V∗∈argminV∈p×kmaxe∈ℰℒ(V;Pe),V^*∈ *arg\,min_V _p× k \ _e L(V;P_e) \, where ℒ=ℒregL=L_reg (resp. ℒ=ℒnormRegL=L_normReg). 2.2.4 Iterative optimization The PCA solution in (1) can be computed iteratively by sequentially selecting orthonormal directions that successively maximize explained variance in the remaining space. For PCA, this sequential procedure is equivalent to solving the joint optimization problem (1) and therefore yields an optimal solution (Joliffe, 2002, Property A1, p.11). Sequential analogues can also be defined for minPCA and - norm - minPCA by choosing orthonormal components iteratively to optimize worst-case explained variance. In contrast to classical PCA, however, this greedy construction generally fails to recover the optimal joint solution. Proposition 4 (Sequential vs. joint optimization, informal) In general, the subspace obtained by sequentially selecting directions to optimize the worst-case explained variance (resp. worst-case proportion of explained variance) does not coincide with the solution of the joint minPCA (resp. - norm - minPCA) problem. A formal statement of the proposition and a counterexample are given in Appendix A.1. When an ordered basis is required, we first compute the optimal subspace and then, starting from the full subspace, we iteratively remove the direction whose removal maximizes the worst-case explained variance of the remaining subspace. This construction yields an ordered basis consistent with the joint solution (Appendix A.2 provides further details). 2.2.5 Relations between the wcPCA variants For classical PCA the solution set does not depend on whether one optimizes for explained variance, reconstruction error, or their normalized versions. For the worst-case setting, Theorem 5 below proves that normalizing the objective can alter the solution set (statement (i)), equivalence between worst-case explained-variance and reconstruction-error formulations holds only for the normalized objectives (statement (i)), and for regret-based objectives the solution set does not depend on whether regret is formulated using explained variance or reconstruction error (statement (i)). Theorem 5 (Relations between wcPCA variants) Consider Setting 1. Then, the symbols == and ≠ in Table 1 represent whether the solutions coincide. More precisely, the following statements hold. i) Effect of normalization. Let VminPCAV minPCA and Vnorm-minPCAV norm-minPCA solve rank-k minPCA and - norm - minPCA, respectively. In general, VminPCAV minPCA does not solve - norm - minPCA and Vnorm-minPCAV norm-minPCA does not solve minPCA. Analogous statements hold for maxRCS versus - norm - maxRCS and for maxRegret versus - norm - maxRegret. i) Explained variance vs. reconstruction error. Let VminPCAV minPCA and VmaxRCSV maxRCS solve rank-k minPCA and maxRCS, respectively. In general, VminPCAV minPCA does not solve maxRCS and VmaxRCSV maxRCS does not solve minPCA. In contrast, the set of solutions to rank-k - norm - minPCA coincides with the set of solutions to rank-k - norm - maxRCS. i) Regret. The set of solutions to the regret problem formulated using reconstruction error (as in Definition 3) coincides with that of the variance-based formulation, that is, with the solutions obtained by replacing ℒRCSL_RCS with −ℒvar-L_var (or ℒnormRCSL_normRCS with −ℒnormVar-L_normVar). 2.2.6 Practical considerations for choosing the objective As summarized in Table 1, the various objectives lead to different sets of solutions. We now discuss some of the differences between the objectives and how factors such as scale heterogeneity and noise influence their behavior. We also discuss how worst-case objectives can be used diagnostically to assess domain heterogeneity. Sensitivity to varying variances. The unnormalized objectives optimize worst-case explained variance or reconstruction error in the original units. As a consequence, domains with particularly small (resp. large) total variance Tr(Σe)Tr( _e) can dominate the objective of minPCA (resp. maxRCS). In particular, minPCA maximizes worst-case explained variance in absolute terms, so its solution may be entirely determined by a domain whose variance in every direction is smaller than that of any other domain. In contrast, the normalized variants - norm - minPCA and - norm - maxRCS optimize proportional explained variance or reconstruction error, which may result in a different worst-case domain and lead to a different solution. These normalized objectives are therefore less sensitive to differences in total variance and align with the interpretation of how much of each domain is explained, but unit comparability across domains is lost. Similarly, maxRegret is not as sensitive to particularly large or small Tr(Σe)Tr( _e) either as it compares performance relative to domain-optimal subspaces. Choosing between explained variance and reconstruction error. For the unnormalized objectives, minPCA and maxRCS can lead to different solutions. In contrast, the normalized objectives (- norm - minPCA and - norm - maxRCS), as well as the regret, produce identical solution sets under either formulation, eliminating the need for this decision and aligning more closely with standard PCA. Effect of heterogeneous noise. When noise levels differ across domains, the regret provides a robust way to learn a common subspace even if the ultimate goal is to optimize explained variance or reconstruction error. Indeed, let us consider observations with different noise variances, that is, for all e∈ℰe we observe e:=e+σee,e∼N(0,Ip),z_e:=x_e+ _e _e, _e N(0,I_p), where ex_e is as in Setting 1 and for all e,e′∈ℰe,e , σe>0 _e>0 and e _e is independent of e′ _e and ex_e. Then, ez_e has covariance Σe+σe2Ip _e+ _e^2I_p. By slight abuse of notation, we say the problem is noiseless if σe=0 _e=0, ∀e∈ℰ∀ e ; otherwise, we say that there is heterogeneous noise. Let us first consider the proposed methods applied to the population covariance matrices. For any V∈p×kV _p× k and e∈ℰe , the explained variance is Tr(V⊤ΣeV)+kσe2Tr(V _eV)+k _e^2. Thus, domains with particularly high noise do not attain the minimum in minPCA but can influence the maxRCS solution. In contrast, the regret compares a common subspace V to the rank-k domain-optimal subspace Vke,∗V_k^e,*, so the noise terms cancel and the objective reduces to Tr((Vke,∗)⊤ΣeVke,∗)−Tr(V⊤ΣeV)Tr((V_k^e,*) _eV_k^e,*)-Tr(V _eV). Consequently, heterogeneous noise affects the solutions of minPCA and maxRCS but not of maxRegret. Moreover, if each Σe _e is exactly rank-k, then Tr((Vke,∗)⊤ΣeVke,∗)=Tr(Σe)Tr((V_k^e,*) _eV_k^e,*)=Tr( _e) and minimizing the maximum regret is equivalent to minimizing the maximum reconstruction error in the noiseless setting (see proof of Theorem 5). For finitely many data, these statements may not hold precisely but in our simulations, the same qualitative patterns persist (see Section 5.1.4). Worst-case losses as diagnostics tools. While Section 3.1 proves that the different wcPCA objectives come with out-of-distribution guarantees, the multi-domain losses can also be used diagnostically. If domain-wise losses under poolPCA vary substantially, this flags heterogeneity across domains that may warrant further investigation. If a worst-case solution differs substantially from poolPCA, it provides a principled alternative representation that explicitly trades-off a small reduction in average performance for improved worst-case behavior. 3 Robustness guarantees 3.1 Population guarantees Optimizing any of the worst-case objectives described in Section 2.2 the following strong robustness guarantee: the solution remains worst-case optimal not only for the source domains but also for all distributions whose covariances lie in the convex hull of the (possibly normalized) source covariances. In contrast, standard approaches such as pooled or separate PCA do not offer such guarantees. The following two theorems summarize the results for all worst-case objectives from Section 2.2. Theorem 6 (Robustness of wcPCA) Consider Setting 1. Let :=conv(Σee∈ℰ)and:=P∈(ℝp)∣[]=0,[⊤]∈C:=conv(\ _e\_e ) := \P (R^p) [x]=0,\ E[x x] \ (4) be the convex hull of the source covariances and the corresponding uncertainty set of distributions with zero mean and covariance in C, respectively. Let Vk∗∈p×kV^*_k _p× k solve rank-k maxRCS, minPCA, or maxRegret. Let ℒL denote the corresponding loss function, i.e., ℒ=ℒRCSL=L_RCS for maxRCS, ℒ=−ℒvarL=-L_var for minPCA, and ℒ=ℒregL=L_reg for maxRegret. Then, the following statements hold. i) For all Vk∈p×k,V_k _p× k, the worst-case loss over P equals the worst-case loss over the source domains, i.e., supP∈ℒ(Vk;P)=maxe∈ℰℒ(Vk;Pe). _P L(V_k;P)= _e L(V_k;P_e). i) Consequently, the ordering induced by the worst-case loss over the source domains is preserved over P: for all Vk,Wk∈p×kV_k,W_k _p× k, if maxe∈ℰℒ(Vk;Pe)<maxe∈ℰℒ(Wk;Pe), _e L(V_k;P_e)< _e L(W_k;P_e), then supP∈ℒ(Vk;P)<supP∈ℒ(Wk;P). _P L(V_k;P)< _P L(W_k;P). i) Hence, Vk∗V^*_k is worst-case optimal over P, i.e., Vk∗∈argminV∈p×ksupP∈ℒ(V;P).V^*_k∈ *arg\,min_V _p× k _P L(V;P). iv) Statement i) does not generally hold for the solutions to poolPCA and sepPCA. With a slight modification of the proof, one can show that Theorem 6 extends to the normalized objectives - norm - maxRCS, - norm - minPCA, and - norm - maxRegret after modifying the uncertainty set. Theorem 7 (Robustness of wcPCA, normalized) Let norm:=conv(Σe/Tr(Σe)e∈ℰ)C_norm:=conv ( \ _e/Tr( _e) \_e ) and norm:=P∈(ℝp)|∼P[]=0,∼P[⊤]∼P[‖22]∈norm.P_norm:= \P (R^p) |E_x P[x]=0, E_x P[x x]E_x P[\|x\|_2^2] _norm \. (5) Then the same statements as in Theorem 5 hold when replacing ℒL with ℒnormRCSL_normRCS (for reconstruction), −ℒnormVar-L_normVar (for variance), or ℒnormRegL_normReg (for regret), and replacing P with normP_norm. Although similar in spirit, the results do not follow from Group DRO (Sagawa et al., 2020), as we outline in Appendix A.3. 3.2 Finite-sample guarantees and consistency So far, we have focused on population-level objectives. We now prove analogous results in a finite-sample setting. Moreover, we prove that the empirical estimators are consistent for the population solutions and asymptotically worst-case optimal. Setting 2 (Source domains, finite sample) Consider Setting 1. For all e∈ℰe , let Xe∈ℝne×pX_e ^n_e× p be a data matrix of nen_e i.i.d. observations from PeP_e. Define the empirical covariance as Σ^e:=1neXe⊤Xe _e:= 1n_eX_e X_e. Let n:=∑e∈ℰnen:= _e n_e, Xpool:=[X1⊤,…,XE⊤]⊤∈ℝn×pX_pool:=[X_1 ,…,X_E ] ^n× p be the pooled data and Σ^pool:=1nXpool⊤Xpool _pool:= 1nX_pool X_pool its covariance. We assume that for all e∈ℰe , Tr(Σ^e)>0Tr( _e)>0. Given Setting 2, we define the empirical analogues of poolPCA and the worst-case PCA objectives. We say: V^kpool solves rank-k empirical if V^kpool∈argmaxV∈p×kTr(V⊤Σ^poolV), V pool_k solves rank-$k$ empirical $ poolPCA$ if V pool_k∈ *arg\,max_V _p× kTr(V _poolV), V^kminPCA solves rank-k empirical if V^kminPCA∈argmaxV∈p×kmine∈ℰTr(V⊤Σ^eV), V minPCA_k solves rank-$k$ empirical $ minPCA$ if V minPCA_k∈ *arg\,max_V _p× k _e Tr(V _eV), V^kmaxRCS solves rank-k empirical if V^kmaxRCS∈argminV∈p×kmaxe∈ℰ1ne‖Xe−XeVV⊤‖F2. V maxRCS_k solves rank-$k$ empirical $ maxRCS$ if V maxRCS_k∈ *arg\,min_V _p× k _e 1n_e\|X_e-X_eV \|^2_F. Analogous definitions of empirical maxRegret, - norm - minPCA, - norm - maxRCS, and - norm - maxRegret are given in Appendix A.4. Remark 8 The empirical formulations exhibit optimality guarantees similar to their population counterparts when considering the convex hull of empirical covariance matrices rather than population covariances. More concretely, we provide a representative result in Appendix A.4, Proposition 17). We now prove that the worst-case estimators are asymptotically consistent for the population subspace (Proposition 9) and the worst-case guarantees of Theorem 6 hold asymptotically (Proposition 10). To establish these results, we require a uniqueness condition. Assumption 1 Each population problem maxRCS, - norm - maxRCS, minPCA, - norm - minPCA, maxRegret, and - norm - maxRegret admits a unique solution up to right multiplication by an orthonormal matrix. That is, if Vk∗V_k^* and V~k V_k are both solutions, then there exists Q∈ℝk×kQ ^k× k with Q⊤Q=QQ⊤=IkQ Q=Q =I_k such that Vk∗=V~kQV_k^*= V_kQ, or, equivalently, span(Vk∗)=span(V~k)span(V_k^*)=span( V_k). Assumption 1 avoids set-valued limits (up to rotation). It holds under several conditions. For example, if a single domain strictly determines the worst-case loss at the optimum, the problem reduces to standard PCA on that domain, and Assumption 1 follows from a strict eigengap between the k-th and (k+1)(k+1)-th eigenvalue in the corresponding covariance matrix (normalized where applicable). In addition, if multiple domains are ‘active’ at the optimum, Assumption 1 holds if their (normalized) covariances share a common top-k eigenspace and each has a strict eigengap between its k-th and (k+1)(k+1)-th eigenvalue. In our simulations, Assumption 1 seems to hold. For example, repeating the experiment of Section 5.1.2 and solving the worst-case low-rank problem twice over 100 runs, the median projection distance (see Proposition 9) between the two solutions in each run is approximately 0.040.04, corresponding (if concentrated in one principal angle) to about 1.66∘1.66 . Finally, we hypothesize that Propositions 9 and 10 hold more generally – even in (possibly non-generic) settings, where Assumption 1 is violated. Proposition 9 (Consistency of the estimators) Consider Setting 2. Let Vk∗V_k^* and V^k V_k denote the population and empirical solutions, respectively, to maxRCS, - norm - maxRCS, minPCA, - norm - minPCA, maxRegret, or - norm - maxRegret, and let nmin:=mine∈ℰnen_ := _e n_e. Under Assumption 1, as nmin→∞n_ →∞, we have d(V^k,Vk∗)→0d( V_k,V^*_k) p0 where d(V1,V2):=‖V1V1⊤−V2V2⊤‖Fd(V_1,V_2):=\|V_1V_1 -V_2V_2 \|_F and → p denotes convergence in probability. We further show that the empirical solutions are asymptotically worst-case optimal, in the sense of Theorem 6. First, the empirical estimator achieves the optimal worst-case population loss in the limit. Second, the empirical objective provides an asymptotic upper bound on the population loss over all distributions in the uncertainty set. Proposition 10 (Asymptotic worst-case optimality) Consider Setting 2. Let V^k V_k and Vk∗V^*_k be the empirical and population solutions, respectively, to maxRCS, - norm - maxRCS, minPCA, - norm - minPCA, maxRegret, or - norm - maxRegret. Let ℒL denote the corresponding loss444So ℒ∈−ℒvar,−ℒnormVar,ℒRCS,ℒnormRCS,ℒreg,ℒnormRegL∈\-L_var,\ -L_normVar,\ L_RCS,\ L_normRCS,\ L_reg,\ L_normReg\, as in Theorems 6 and 7. and let Q denote the associated distributional class (P for the unnormalized losses, normP_norm for the normalized losses). Under Assumption 1, as nmin:=mine∈ℰne→∞n_ := _e n_e→∞, the following two statements hold. i) The worst-case population loss of V^k V_k converges in probability to the optimal value, supP∈ℒ(V^k;P)→supP∈ℒ(Vk∗;P). _P L( V_k;P) p _P L(V^*_k;P). i) For every ϵ>0ε>0, limnmin→∞Pr(supP∈ℒ(V^k;P)>m^k+ϵ)=0, _n_ →∞ ( _P L( V_k;P)> m_k+ε )=0, where m^k:=maxe∈ℰℒ(V^k;Σ^e) m_k:= _e L( V_k; _e) is the empirical loss555In Equation (2) and (3) we define the losses ℒL as a function of the distribution P. Each loss depends on the distribution P only through its covariance matrix Σ , so, by slight abuse of notation, we can write the losses in terms of Σ . For example, for all V∈p×kV _p× k, ℒRCS(V;P)=ℒRCS(V;Σ):=Tr(Σ)−Tr(V⊤ΣV)L_RCS(V;P)=L_RCS(V; ):=Tr( )-Tr(V V). . Here, i) does not follow directly from finite-sample robustness, see Remark 8, because the distribution class Q is formed from the population not the empirical covariance matrices. 4 Extension: worst-case matrix completion When data are only partially observed, low rank approximation extends naturally to matrix completion. In the single-domain setting, a common formulation (e.g., Jain et al., 2013) approximates X∈ℝn×pX ^n× p by a rank-k factorization X≈LR⊤X≈ LR , where L∈ℝn×kL ^n× k and R∈ℝp×kR ^p× k (with k≪pk p), by minimizing reconstruction error over the observed entries. A classic example is movie recommendation (e.g., Harper and Konstan, 2015), where rows correspond to users and columns to movies. Only a subset of ratings is observed, and the goal is to predict the missing entries under a low-rank assumption. The factorization can be interpreted as modeling latent features (e.g., genre or style) that explain preferences: the left factor represents user-specific weights on these features, while the right factor encodes the weights of the movies. As before, the data may arise from multiple domains. In the movie example, this occurs when users belong to heterogeneous subpopulations (e.g., regions or age groups) with differing rating patterns. We now extend the worst-case framework developed for PCA to matrix completion. In particular, we introduce a worst-case objective for multi-domain matrix completion and formalize inductive matrix completion in target domains. We then show the following worst-case optimality guarantee: when the source domains are fully observed but the target domains incur missingness, subspaces that are worst-case optimal for reconstruction error remain ϵε-worst-case optimal over the convex hull of the source covariances. Setting 3 (Multiple domains with missingness) Consider Setting 1. For each domain e∈ℰe , the entries of e∈ℝpx_e ^p are observed according to a mask ωe∈0,1p _e∈\0,1\^p, where ωe _e is drawn independently of ex_e from a distribution Q: (ωe)i=1( _e)_i=1 if the i-th entry of ex_e is observed and (ωe)i=0( _e)_i=0 otherwise Setting 4 (Finite-sample version) Consider Setting 3. For each e∈ℰe , let Xe∈ℝne×pX_e ^n_e× p be a data matrix of nen_e i.i.d. rows from PeP_e, and let Ωe∈0,1ne×p _e∈\0,1\^n_e× p contain nen_e masks drawn i.i.d. from Q; only entries with (Ωe)ij=1( _e)_ij=1 are observed. Let n:=∑e∈ℰnen:= _e n_e and Xpool∈ℝn×pX_pool ^n× p and Ωpool∈0,1n×p _pool∈\0,1\^n× p denote the row-wise concatenation of the domain-specific matrices. In the multi-domain setting, a natural baseline is to ignore domain labels and solve matrix completion on the pooled data (Xpool,Ωpool)(X_pool, _pool), minimizing average reconstruction error across domains; we term this solving poolMC. Similarly as for wcPCA, we can also optimize the worst-case reconstruction error. Definition 11 (maxMC) Consider Setting 4. A (shared) right factor R∗∈ℝp×kR^* ^p× k and (domain-specific) left factors Le∗∈ℝne×kL_e^* ^n_e× k for e∈ℰe solve maxMC if (R∗,L1∗,…,LE∗)∈argminR∈p×k∀e∈ℰ,Le∈ℝne×kmaxe∈ℰ1ne‖PΩe(Xe)−PΩe(LeR⊤)‖F2,(R^*,L_1^*,…,L_E^*)∈ *arg\,min_ subarraycR _p× k\\ ∀ e ,\ L_e ^n_e× k subarray _e 1n_e\|P_ _e(X_e)-P_ _e(L_eR )\|_F^2, (6) where, for all e∈ℰe , PΩe:ℝne×p→ℝne×pP_ _e:R^n_e× p ^n_e× p is defined as [PΩe(Xe)]ij=[Xe]ij[P_ _e(X_e)]_ij=[X_e]_ij if [Ωe]ij=1[ _e]_ij=1 and [PΩe(Xe)]ij=0[P_ _e(X_e)]_ij=0 otherwise. This objective learns a shared right factor that minimizes worst-case reconstruction error across domains. In the single-domain setting, this coincides with the standard matrix completion objective. Additionally, in absence of missingness in the source domains, the shared right factor that solves maxMC coincides with the solution to empirical maxRCS, since PΩeP_ _e reduces to the identity and, for all e∈ℰe , the optimal choice of LeL_e is Le=XeRL_e=X_eR. 4.1 Inductive matrix completion After learning a shared right factor R∗R^* on the source domains, we can also use it to reconstruct new, partially observed observations in a target domain (in the movie recommendation example, this could correspond to new users in a new domain) – a problem that is sometimes referred to as inductive matrix completion. We can, for example, reconstruct a row ∈ℝ1×px ^1× p observed on coordinates ω, by estimating coefficients ℓ∈ℝk ^k from the observed entries and forming ^=ℓ(R∗)⊤ x= (R^*) . In this paper, we focus on the ordinary least squares estimator, ℓOLS=ℓOLS(,ω,R∗)∈argminℓ∈ℝk∑i∈[p]:ωi=1(i−[ℓ(R∗)⊤]i)2, _OLS= _OLS(x,ω,R^*)∈ *arg\,min_ ^k _ subarrayci∈[p]:\, _i=1 subarray(x_i-[ (R^*) ]_i)^2, (7) an approach that is commonly used (e.g., Koren et al., 2009); alternative methods such as incorporating side information (e.g., Jain and Dhillon, 2013) could also be considered. 4.2 Robustness guarantees We consider the setting in which the source domains are fully observed and missingness arises only in the target domain. In this setting, we prove that a subspace that is worst-case optimal for reconstruction remains ϵε-worst-case optimal for inductive matrix completion over the convex hull of the source covariances. We impose incoherence of the right factor—a standard assumption in matrix completion—and require a minimum number of observed entries; later, this allows us to control the reconstruction error under missingness. Definition 12 (μ-incoherence) Let R∈p×kR _p× k. We say that R is μ-incoherent if, for all i∈[p]i∈[p], ‖R(i)‖2≤μkp\|R^(i)\|_2≤μ kp, where R(i)R^(i) denotes the i-th row of R. Assumption 2 (Sufficient observations) Let ϵ∈(0,1)ε∈(0,1) and μ≥1μ≥ 1. For ω∼Qω Q, the number of unobserved entries s satisfies s≤pϵkμ2(2ϵ+1)s≤ pεkμ^2(2ε+1) almost surely. As in Theorem 6, let C be the convex hull of the domain covariance matrices and let P denote the set of distributions with covariance in C. Additionally, for all distributions P,QP,Q over x and ω, for all R∈p×kR _p× k, let ℒP,Q(R):=∼P,ω∼Q‖−ℓOLS(,ω,R)R⊤‖22L_P,Q(R):=E_x P,ω Q\|x- _OLS(x,ω,R)R \|^2_2 be the expected reconstruction error of ∼Px P under sampling distribution Q. Theorem 13 (Worst-case robustness for inductive completion) Consider Setting 3 and let ϵ∈(0,1)ε∈(0,1) Assume the source domains are fully observed, i.e., the source mask distribution QsourceQ_source satisfies Qsource(ω=)=1Q_source(ω=1)=1. Let R∗∈p×kR^* _p× k solve the population worst-case matrix completion problem, defined as R∗∈argminR∈p×kmaxe∈ℰℒPe,Qsource(R).R^*∈ *arg\,min_R _p× k _e L_P_e,Q_source(R). (8) (R∗R^* thus also solves maxRCS.) Let QtargetQ_target be the distribution of the masks in the target domains. If R∗R^* is μ-incoherent, and Assumption 2 holds for QtargetQ_target with this μ, then R∗R^* is ϵε-worst-case optimal for inductive matrix completion over P, that is, supP∈ℒP,Qtarget(R∗)≤(1+ϵ)minR∈p×ksupP∈ℒP,Qtarget(R). _P L_P,Q_target(R^*)≤(1+ε)\, _R _p× k _P L_P,Q_target(R). Thus, a subspace that is worst-case optimal for low-rank approximation remains (approximately) worst-case optimal for inductive matrix completion over all distributions in P. Theorem 13 analyzes the regime in which the source domains are fully observed (so that maxMC reduces to maxRCS). Our simulations indicate that, even when the source domains are only partially observed, maxMC improves worst-case reconstruction error relative to poolMC (see Section 5.2). Furthermore, the bound on the number of unobserved entries imposed by Assumption 2 is restrictive (e.g., for (k,ϵ,μ)=(2,0.1,1)(k,ε,μ)=(2,0.1,1), the bound yields s≤0.0417ps≤ 0.0417\,p, i.e., approximately 4%4\% missing entries). Nevertheless, empirically, robustness degrades only mildly as missingness increases, and maxMC continues to outperform pooling well beyond the guaranteed regime (see Section 5.2). We thus believe that stronger versions of Theorem 13 may hold. 5 Numerical experiments We investigate the worst-case estimators through synthetic simulations. In each experiment, we consider five source domains, obtained by sampling five population covariances matrices (Σ1,…,Σe)( _1,…, _e) in ℝp×pR^p× p (p=20p=20 unless otherwise stated) of rank 1010 from a measure ℚα,βQ_α,β that depends on parameters α,βα,β (the role and standard choices of the parameters are explained in Appendix B.1); in each experiment we resample the covariance matrices. Under the measures ℚα,βQ_α,β, the sampled covariances consist of a rank-55 shared component and a rank-55 domain-specific component with shared eigenvalues across domains, and are trace-normalized. These properties ensure that the solutions to minPCA, maxRCS, maxRegret, and their normalized variants all coincide. We therefore report only the results of maxRCS and evaluate performance using reconstruction error. The only exception is Section 5.1.4, where we modify the data-generating process to study the effect of heterogeneous noise. Appendix B.1 provides more details on the data generating process. The wcPCA variants are solved using projected gradient descent (Appendix B.2 provides further details including a comparison of our implementation to the approximations of Wang et al. (2025) and Tantipongpipat et al. (2019)) and poolMC and maxMC are solved with alternating minimization (also see Appendix B.2). Code to reproduce all simulations and figures is publicly available at https://github.com/anyafries/wcPCA. 5.1 Worst-case low-rank approximation 5.1.1 Convex-hull robustness in the population case We illustrate Theorem 6(i) stating that, for maxRCS, the maximum reconstruction error on the source domains provides a minimal upper bound for the reconstruction error of all covariances in their convex hull. To do so, we sample five source covariances source:=(Σ1,…,Σ5)∼ℚα,βC_source:=( _1,…, _5) _α,β and compute the rank-5 population solutions to poolPCA and maxRCS, VpoolV pool and VmaxRCSV maxRCS. We sample 50 target covariances, collected in the set targetC_target, from the convex hull of the source covariances (as described in Appendix B.1). For every Σ∈target∪source _target _source, we compute the reconstruction errors using poolPCA and maxRCS, that is, ℒRCS(Vpool;Σ)L_RCS(V pool; ) and ℒRCS(VmaxRCS;Σ)L_RCS(V maxRCS; ). Figure 4 shows that all test errors of maxRCS lie below the bound mmaxRCS∗:=max1≤e≤5ℒRCS(VmaxRCS;Σe)m^*_maxRCS:= _1≤ e≤ 5L_RCS(V maxRCS; _e), as guaranteed by Theorem 6(i). In contrast, poolPCA exceeds this bound in several cases. Figure 3: Illustration of Theorem 6(i); see Section 5.1.1. Reconstruction errors for the source domains and 50 target domains are shown for poolPCA and maxRCS in a population setting. The blue line marks mmaxRCS∗m^*_maxRCS, the maximum reconstruction error over the source domains. As expected from Theorem 6(i), all maxRCS errors lie below this bound, whereas poolPCA exceeds it in several cases. Figure 4: Average vs. worst-case reconstruction error of maxRCS and poolPCA; see Section 5.1.2. The difference in errors are shown relative to poolPCA’s average performance. Values below zero indicate that maxRCS outperforms poolPCA. maxRCS improves worst-case performance, while incurring only small losses on average. 5.1.2 Average versus worst-case performance We next study how much average performance maxRCS loses in exchange for improved worst-case robustness compared to poolPCA. We vary the heterogeneity between domains by varying the parameters α,βα,β in the distribution of the source covariances ℚα,βQ_α,β (see Appendix B.1 for more details); we consider (α,β)∈(0.1,0.5),(0.5,1),(1,2),(2,5)(α,β)∈\(0.1,0.5), (0.5,1), (1,2), (2,5)\. Higher values of (α,β)(α,β) increase the contribution of domain-specific components, creating greater variation between domains. For each (α,β)(α,β), we repeat the following steps 25 times. We sample (Σ1,…,Σ5)∼ℚα,β( _1,…, _5) _α,β, compute rank-5 solutions VpoolV pool and VmaxRCSV maxRCS, and report the difference in average reconstruction error (over the source domains) and worst-case reconstruction error (over the convex hull of the source covariances), relative to the average performance of poolPCA (for details, see equations (10) and (11) in Appendix B.3). The results are shown in Figure 4. Solutions to maxRCS consistently improve worst-case error while incurring only small losses in average error, across varying domain heterogeneity. These findings complement Theorem 6(i): while the theorem guarantees that there exist settings where maxRCS strictly outperforms poolPCA in the worst case, the simulations suggest that such improvements occur generically. 5.1.3 Convergence and finite-sample robustness Proposition 10(i) states that the worst-case population loss of the empirical maxRCS solution converges in probability to the optimal worst-case population loss. We illustrate this convergence empirically and additionally assess the finite-sample robustness of maxRCS relative to pooled PCA. To do so, we vary the sample size n∈100,250,500,1000,2000,5000n∈\100,250,500,1000,2000,5000\ and heterogeneity (α,β)(α,β) as in Section 5.1.2. For each (n,α,β)(n,α,β) we repeat the following steps 25 times. We sample source covariances source:=(Σ1,…,Σ5)∼ℚα,βC_source:=( _1,…, _5) _α,β and, for all domains 1≤e≤51≤ e≤ 5, we draw n observations from (0,Σe)N(0, _e), and compute the rank-5 empirical and population solutions of poolPCA and maxRCS, denoted as V^pool V pool, V^maxRCS V maxRCS, VpoolV pool, and VmaxRCSV maxRCS, respectively. We assess convergence by reporting the difference in worst-case population reconstruction error (evaluated over the convex hull conv(source)conv(C_source)) between empirical and population maxRCS (for details, see equation (12) in Appendix B.3). Values near zero indicate that the empirical estimator attains nearly optimal worst-case performance over the convex hull of the population matrices. Figure 5 (left) illustrates behavior in line with Proposition 10(i), with the distribution of the difference increasingly concentrated around zero as n increases. Figure 5: Finite-sample performance of maxRCS; see Section 5.1.3. Left: Difference in worst-case loss (reconstruction error) over the convex hull of the population covariance matrices between empirical and population maxRCS. Right: Analogous difference between between empirical maxRCS and empirical poolPCA. Negative values indicate that maxRCS attains lower worst-case error than poolPCA. Across sample sizes, maxRCS outperforms poolPCA. In the same way, we also compare empirical maxRCS and empirical poolPCA (for details, see equation (13) in Appendix B.3). Figure 5 (right) shows that maxRCS outperforms poolPCA across sample sizes, showing that the practical benefits of maxRCS may emerge even at modest sample sizes. 5.1.4 Using regret as a proxy for reconstruction error under heterogeneous noise This experiment shows that, under heterogeneous noise across domains, worst-case regret may outperform worst-case reconstruction- or variance-based objectives, even when evaluated using reconstruction error, consistent with the discussion in Section 2.2.6. We again sample five source covariances Σ1,…,Σ5 _1,…, _5, as described in Appendix B.1, using the variant in which the domain-specific eigenvalues differ across domains. Then, for all domains e∈ℰe , we draw a noise level σe∼U([0,0.1]) _e U([0,0.1]) and sample n=2000n=2000 i.i.d. training observations according to e=e+σee,e∼(0,Σe),e∼(0,Ip);z_e=x_e+ _e _e, _e (0, _e),\ _e (0,I_p); thus, the population covariance is Σe+σe2Ip _e+ _e^2I_p. We solve rank-1010 and rank-55 maxRCS and maxRegret 666As the domain-specific eigenvalues vary between domains, the solutions to maxRCS and maxRegret no longer necessarily coincide. However, all covariances are trace-normalized, so the solutions to minPCA, maxRCS, and their normalized variants all coincide, and we report results only for maxRCS. Likewise, the solutions to maxRegret and - norm - maxRegret coincide, and we report results only for maxRegret. using the empirical covariances, and evaluate both methods in terms of worst-case reconstruction error on n independent observations from (0,Σe)N(0, _e) without additional heterogeneous noise. The experiment is repeated 25 times and results are shown in Figure 6. When the approximation rank k matches the rank of the covariances (rank 1010), the solutions of maxRegret and maxRCS coincide in the noiseless population case and the population solution of maxRegret is insensitive to heterogeneous noise (see Section 2.2.6 for both statements) For finitely many observations, these equalities do not hold exactly. Nevertheless, Figure 6 (left) Figure 6: Regret may be preferable under heterogeneous noise; see Section 5.1.4. Left: As suggested by the population theory, solutions to maxRCS and maxRegret are almost identical in the noiseless case. With heterogeneous noise, maxRegret behaves similarly as if the data were noiseless. Right: When k is strictly smaller than the rank of the covariance matrices, solutions to maxRCS and maxRegret differ in the noiseless case. Nevertheless, under heterogeneous noise, maxRegret can still be preferable over maxRCS even if the evaluation criteria is reconstruction error. illustrates empirical behavior consistent with the population theory: in the noiseless case, maxRCS and maxRegret yield similar worst-case reconstruction errors, and under heterogeneous noise, maxRegret approximately recovers the solution obtained in the noiseless case, whereas maxRCS incurs increased worst-case reconstruction error. When using a lower-rank approximation (k=5k=5), the population solution of maxRegret is still unaffected by heterogeneous noise (see Section 2.2.6), but those of maxRegret and maxRCS no longer coincide (see, e.g., Example 1). Figure 6 (right) shows similar empirical results: in the noiseless case, maxRCS and maxRegret differ, and under heterogeneous noise, maxRegret exhibits similar behavior while maxRCS incurs higher worst-case reconstruction error. These results indicate that regret-based objectives may be preferable under heterogeneous noise, even when the goal is robust reconstruction. 5.2 Worst-case inductive matrix completion We study worst-case matrix completion in two regimes: (i) fully observed source domains (the setting of Theorem 13), and (i) partially observed source domains. Our simulations show that maxMC generally reduces worst-case reconstruction error relative to poolMC, including when source domains are partially observed and missingness exceeds the theoretical bound. As in Section 5.1.2, we vary domain heterogeneity, which is controlled by (α,β)(α,β). For each configuration, we repeat the following steps 25 times. We sample five rank-1010 source covariances (Σ1,…,Σ5)∼ℚα,β( _1,…, _5) _α,β with dimension p=500p=500 and, for all domains 1≤e≤51≤ e≤ 5, we draw n=1000n=1000 observations from (0,Σe)N(0, _e). In regime (i), source data are fully observed; in regime (i), 90% of entries in each source row are selected uniformly at random and masked. In each regime, we learn shared right factors via poolMC and maxMC. For evaluation, we draw 1000 independent test observations per domain and mask 90% of entries uniformly at random. Reconstruction is performed using the least-squares rule of Section 4.1. Performance is measured by per-entry mean squared reconstruction error, either averaged across domains or in the worst-case domain (see Appendix B.3 for details). Figure 7 shows the results of the more difficult regime (i); Figure 7: Worst-case inductive matrix completion with partially observed source domains (90% missing entries; regime (i)). Left: Difference in average and worst-case test reconstruction error between maxMC and poolMC. In line with Theorem 13 (which covers the population case and allows for some ϵε deviation), maxMC generally improves worst-case performance; in terms of average error, it incurs only minor changes. Right: Worst-case difference as the proportion of missing target entries varies, for different heterogeneity levels (α,β)(α,β). maxMC achieves lower median worst-case error than poolMC, with larger gains under stronger domain heterogeneity. the left panel shows that maxMC generally reduces worst-case target reconstruction error relative to pooling while incurring only small changes in average performance. The right panel varies the proportion of missing target entries and shows that, under high domain heterogeneity, the worst-case advantage of maxMC persists even when substantially more entries are missing than specified by Assumption 2. Under low heterogeneity, performance is comparable to poolMC, reflecting the reduced variation across domains. Results for the simpler regime (i) are qualitatively similar and are shown in Figure B.i in Appendix B.4. 6 Applications 6.1 Explaining variance in held-out FLUXNET regions We illustrate the benefits of worst-case PCA objectives on FLUXNET data (Pastorello et al., 2020; Baldocchi, 2008). These data contain measurements of variables related to biosphere and atmosphere interactions obtained at different places on the Earth (see Figure 8 left). More precisely the variables comprise gross primary productivity (GPP), meteorological variables (air temperature, vapor pressure deficit, and day and night land surface temperature), and remote-sensing indices (EVI, NDVI, NDWI, LAI, NIRv, and fPAR). We group the data into TransCom regions (Gurney et al., 2004) as domains (see Figure 8, left), which spans heterogeneous climates and measurement conditions. The data pre-processing is described in Appendix C.1. Based on scree plots and the Kaiser criterion (Appendix C.1, Figure C.i), we consider rank-22 approximations to the data. We perform 20 random splits of the regions into five source regions and eight target regions. On each split, we solve rank-22 poolPCA, maxRegret, - norm - maxRegret, and - norm - minPCA on the source regions. This setting extends the illustrative example in Section 1, where we compared poolPCA and - norm - maxRegret on a single split of the TransCom regions. We use the normalized and regret-based variants because the total variance in each domain varies (see Section 2.2.6). For comparison, we solve rank-22 PCA on the average empirical covariance (1|ℰ|∑e∈ℰΣ^e 1|E| _e _e), denoted avgcovPCA, which removes the effect of unequal sample sizes across regions; to the best of our knowledge, it does not come with any extrapolation guarantee. Figure 8 (right) reports the relative difference in worst-case proportion of explained variance between poolPCA and the other methods over the target regions (results for additional components are reported in Appendix C.1, Figure C.i.). Figure 8: Left: Map of FLUXNET sites, colored by TransCom region. Right: Relative difference in worst-case proportion of explained variance between poolPCA and maxRegret, - norm - maxRegret, - norm - maxRCS, and avgcovPCA, evaluated on held-out target regions over 20 random train–test splits. Positive values indicate improved worst-case performance relative to pooled PCA. The worst-case objectives generally improve robustness, with - norm - maxRegret achieving the strongest gains in median worst-case proportion of explained variance. Positive values indicate improved worst-case performance relative to poolPCA. All alternatives to poolPCA improve worst-case performance on the target regions, including PCA on the average covariance. However, the worst-case objectives typically perform best: - norm - maxRegret achieves the strongest gains and improves worst-case explained variance in 16 out of 20 splits (p-value of a binomial test equals ca. 0.01) and a median improvement of 0.07 in absolute worst-case proportion of explained variance. 6.2 Reanalysis of the three major axes of terrestrial ecosystem function We now revisit part of the analysis conducted by Migliavacca et al. (2021). The authors aggregate FLUXNET data over time to derive site-level functional ecosystem properties related to carbon uptake and release, energy and water exchange, and growing-season water-use efficiency (see Table 2 in Appendix C.2). Motivated, for example, by the task of predicting ecosystem responses to climatic changes, the authors sought a low-dimensional representation of ecosystem functioning, defining three axes of variation of terrestrial ecosystem function via PCA on the pooled dataset. Based on the loadings, these three axes are then interpreted to represent “maximum ecosystem productivity”, “ecosystem water-use strategies”, and “ecosystem carbon-use efficiency” (Migliavacca et al., 2021). Arguably, such interpretations should be consistent over different contintents. We thus evaluate the robustness of the three axes, when treating continents as distinct domains. We compare poolPCA (which corresponds777Here, we pre-process the data similarly to Section 6.1, which deviates slightly from the procedure used by Migliavacca et al. (2021). The differences are minor: for example, the loadings from poolPCA shown in Figure 9 are almost identical to those presented by Migliavacca et al. (2021). All details on pre-processing are provided in Appendix C.2. to the analysis by Migliavacca et al. (2021)) against - norm - maxRCS and - norm - maxRegret, which explicitly optimize for worst-case performance across domains. (We use the normalized variants because total variance differs across continents, see Section 2.2.6; for completeness, results for PCA based on the average empirical covariance and for maxRegret are reported in Appendix C.2.) Figure 9 compares the pooled and worst-case solutions. Figure 9: Comparison of dimensionality reduction methods on ecosystem function data across continents. Left: Proportion of variance explained on the pooled dataset. Centre: Worst-case proportion of variance explained across continents. Right: Loadings of the top three principal components for poolPCA and - norm - maxRCS. - norm - maxRCS improves worst-case proportion of explained variance with negligible loss in average performance, while largely retaining the physical interpretability of the principal axes. For poolPCA, we observe a substantial gap between the average in-sample proportion of explained variance and the worst-case performance across continents. In contrast, - norm - maxRCS achieves nearly identical average proportion of explained variance while substantially improving worst-case performance. - norm - maxRegret shows a similar, though smaller, improvement. We further compare the loadings of poolPCA and - norm - maxRCS (Figure 9, right), providing a diagnostic for whether the identified axes reflect domain-invariant ecological structure (Appendix A.2 details the construction of the ordered worst-case basis). The leading axis remains largely stable, reflecting maximum ecosystem productivity and its coupling with ecosystem respiration. The second axis may still be interpreted as a water-use strategy gradient dominated by the same six metrics, however, - norm - maxRCS reduces the influence of the water use efficiency (WUE) loadings in favor of the remaining properties. The third axis, originally interpreted as carbon-use efficiency (aCUE, Rb, Rbmax, EFampl), shifts most prominently: EFampl is effectively removed while G1 and GSmax are upweighted, indicating a stronger contribution of water-regulation traits. In our view, the stability of the first two axes under worst-case optimization strengthens the interpretation of these axes as fundamental ecological gradients; our results also suggest that the interpretation of the third axis may warrant further investigation. 7 Discussion This paper studies worst-case formulations of low-rank approximation across multiple domains. Considering the worst-case instead of average-case can substantially improve performance in new target domains. The key theoretical guarantee is out-of-sample worst-case optimality over the convex hull of the source covariances (or of their normalized versions). Empirically, these objectives improve worst-case performance with only modest losses on average, as shown both on simulated data and case studies using real-world FLUXNET data. The different worst-case objectives are not interchangeable. Normalized objectives mitigate sensitivity to scale differences by measuring proportional explained variance (or normalized reconstruction error), but unit comparability across domains is lost. Because worst-case objectives depend on ‘extreme’ domains, noisy or small domains can affect their solutions; the extent of this effect depends on the chosen objective. Normalization and the use of the regret can mitigate these effects to some extent. In particular, regret-based objectives evaluate performance relative to each domains’ optimal subspace and are robust to heterogeneous noise. On real data, we found that regret performs best even when the objective is explained variance or reconstruction error. As in classical PCA, pre-processing choices (such as centering, scaling, standardization) matter. These choices can change which domain is worst-case and therefore change the solution. An additional choice is whether to form the pooled covariance from pooled data or to average the domain covariances (these coincide when domain sample sizes are equal). While averaging can reduce the influence of large domains, as far as we know, it does not come with comparable out-of-sample guarantees. When data are only partially observed, low-rank approximation can be used to tackle matrix completion. In the fully observed training regime, our convex-hull robustness guarantee transfers to inductive completion (up to an approximation factor). When source domains are also partially observed, maxMC remains empirically effective but our theory does not yet cover this setting; extending the out-of-sample guarantees to jointly handle domain heterogeneity and missingness in the source domains is left to future work. An additional direction is to allow domain-dependent observation mechanisms and to improve the bounds we establish We regard the following additional extensions to be particularly promising. First, worst-case objectives can be used for nonlinear representation learning (e.g., autoencoders) by replacing the reconstruction loss with an appropriate domain-wise worst-case criterion. Second, robustness to distribution shift can be combined with robustness to corruption and outliers by integrating worst-case domain objectives with robust (in the sense of robustness against outliers) PCA-type modeling. Third, it is natural to consider interpolations between the pooled and worst-case objectives that trade average and worst-case performance, for instance via convex combinations or regularized formulations. Solutions to these objectives may provide smoother behavior when the worst-case domain is small or noisy, without discarding the robustness perspective. Fourth, when taking the diagnostics perspective, see Section 2.2.6, in settings where domains are ordered (e.g., through time), tracking domain-wise losses can be used to detect distribution shifts; for worst-case methods, a sharp degradation may additionally indicate that the target domain has moved outside the uncertainty set implied by the sources (e.g., beyond the convex-hull condition). Fifth, our guarantees formalize robustness of reconstruction criteria, not invariance or stability: for example, a direction shared across domains need not be selected if it explains little variance in each domain. Investigating notions closer to invariance seems a promising route with potential connections to causality. Acknowledgments and Disclosure of Funding We thank Jacob Nelson for facilitating access to the FLUXNET data and preparing it for our use, and Maybritt Schillinger for many helpful discussions. We also thank Francesco Freni for helpful discussions on optimization and Linus Kühne for ideas related to Section 5.1.4. We have used LLMs for debugging code and improving language formulations. References J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, et al. (2024) PyTorch 2: faster machine learning through dynamic python bytecode transformation and graph compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), Cited by: §B.2.1. P. Babu and P. Stoica (2023) Fair principal component analysis (PCA): minorization-maximization algorithms for fair PCA, fair robust PCA and fair sparse PCA. External Links: 2305.05963 Cited by: §1.1. D. Baldocchi (2008) ‘Breathing’ of the terrestrial biosphere: lessons learned from a global network of carbon dioxide flux measurement systems. Australian Journal of Botany 56 (1), p. 1–26. Cited by: §C.2, §6.1. A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen (2013) Robust solutions of optimization problems affected by uncertain probabilities. Management Science 59 (2), p. 341–357. Cited by: §1.1. P. Bühlmann and N. Meinshausen (2016) Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE 104 (1), p. 126–135. Cited by: §1.1. E. J. Candès and Y. Plan (2010) Matrix completion with noise. Proceedings of the IEEE 98 (6), p. 925–936. Cited by: §1.1, §1. E. Candès and B. Recht (2012) Exact matrix completion via convex optimization. Commun. ACM 55 (6), p. 111–119. Cited by: §D.1, §1.1, §1. R. D. Cook and S. Weisberg (1982) Assessment of influence. In Residuals and Influence in Regression, p. 135–137. Cited by: §D.1. R. D. Cook (1977) Detection of influential observation in linear regression. Technometrics 19 (1), p. 15–18. Cited by: §D.1. J. C. Duchi and H. Namkoong (2019) Variance-based regularization with convex objectives. Journal of Machine Learning Research 20 (68), p. 1–55. Cited by: §1.1. B. N. Flury (1986) Asymptotic theory for common principal component analysis. The Annals of Statistics, p. 418–430. Cited by: §1.1. B. N. Flury (1987) Two generalizations of the common principal component model. Biometrika 74 (1), p. 59–69. Cited by: §1.1. F. Freni, A. Fries, L. Kühne, M. Reichstein, and J. Peters (2025) Maximum risk minimization with random forests. External Links: 2512.10445 Cited by: §1.1. Z. Guo (2024) Statistical inference for maximin effects: identifying stable associations across multiple studies. Journal of the American Statistical Association 119 (547), p. 1968–1984. Cited by: §1.1. K. R. Gurney, R. M. Law, A. S. Denning, P. J. Rayner, D. Baker, P. Bousquet, L. Bruhwiler, Y. Chen, P. Ciais, S. Fan, et al. (2002) Towards robust regional estimates of CO2 sources and sinks using atmospheric transport models. Nature 415 (6872), p. 626–630. Cited by: §1. K. R. Gurney, R. M. Law, A. S. Denning, P. J. Rayner, B. C. Pak, D. Baker, P. Bousquet, L. Bruhwiler, Y. Chen, P. Ciais, et al. (2004) TransCom 3 inversion intercomparison: model mean results for the estimation of seasonal carbon sources and sinks. Global Biogeochemical Cycles 18 (1). Cited by: §1, §6.1. A. Hannachi, K. Finke, and N. Trendafilov (2023) Common EOFs: a tool for multi-model comparison and evaluation. Climate Dynamics 60 (5), p. 1689–1703. Cited by: §1.1. F. M. Harper and J. A. Konstan (2015) The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). Cited by: §4. R. A. Horn and C. R. Johnson (1985) Hermitian and symmetric matrices. In Matrix Analysis, p. 167–256. Cited by: §D.4. W. Hu, G. Niu, I. Sato, and M. Sugiyama (2018) Does distributionally robust supervised learning give robust classifiers?. In International Conference on Machine Learning (ICML), p. 2029–2037. Cited by: §A.3, §1.1. P. Jain and I. S. Dhillon (2013) Provable inductive matrix completion. External Links: 1306.0626 Cited by: §1.1, §4.1. P. Jain, P. Netrapalli, and S. Sanghavi (2013) Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, p. 665–674. Cited by: §B.2.2, §1.1, §1, §4. I. T. Joliffe (2002) Mathematical and statistical properties of population principal components. In Principal Component Analysis, p. 10–28. Cited by: §2.2.4. M. M. Kamani, F. Haddadpour, R. Forsati, and M. Mahdavi (2022) Efficient fair principal component analysis. Machine Learning, p. 1–32. Cited by: §1.1. L. Kassab, E. George, D. Needell, H. Geng, N. J. Nia, and A. Li (2024) Towards a fairer non-negative matrix factorization. External Links: 2411.09847 Cited by: §B.2.2, §1.1. Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), p. 30–37. Cited by: §1.1, §4.1. S. Liu, Y. Ge, S. Xu, Y. Zhang, and A. Marian (2022) Fairness-aware federated matrix factorization. In Proceedings of the 16th ACM Conference on Recommender Systems, RecSys ’22, New York, NY, USA, p. 168–178. Cited by: §1.1. M. W. Mahoney (2011) Randomized algorithms for matrices and data. Foundations and Trends® in Machine Learning 3 (2), p. 123–224. Cited by: §D.1. N. Meinshausen and P. Bühlmann (2015) Maximin effects in inhomogeneous large-scale data. The Annals of Statistics 43 (4), p. 1801–1830. Cited by: §1.1. M. Migliavacca, T. Musavi, M. D. Mahecha, J. A. Nelson, J. Knauer, D. D. Baldocchi, O. Perez-Priego, R. Christiansen, J. Peters, et al. (2021) The three major axes of terrestrial ecosystem function. Nature 598 (7881), p. 468–472. Cited by: §C.2, §C.2, Table 2, §6.2, §6.2, footnote 7. G. Pastorello, C. Trotta, E. Canfora, H. Chu, D. Christianson, Y. Cheah, C. Poindexter, J. Chen, A. Elbashandy, M. Humphrey, et al. (2020) The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Scientific data 7 (1), p. 225. Cited by: §C.1, §C.2, §1, §6.1. P. Puchhammer, I. Wilms, and P. Filzmoser (2024) Sparse outlier-robust PCA for multi-source data. External Links: 2407.16299 Cited by: §1.1. S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang (2020) Distributionally robust neural networks for group shifts: on the importance of regularization for worst-case generalization. External Links: 1911.08731 Cited by: §A.3, §1.1, §3.1. S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala (2018) The price of fair PCA: one extra dimension. Advances in Neural Information Processing Systems 31. Cited by: §1.1. A. Sinha, H. Namkoong, R. Volpi, and J. C. Duchi (2017) Certifying some distributional robustness with principled adversarial training. External Links: 1710.10571 Cited by: §1.1. U. Tantipongpipat, S. Samadi, M. Singh, J. H. Morgenstern, and S. Vempala (2019) Multi-criteria dimensionality reduction with applications to fairness. Advances in Neural Information Processing Systems 32. Cited by: Figure B.i, §B.2.1, §B.2.1, §1.1, §5. A. W. van der Vaart (1998) M–and Z-estimators. In Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics, p. 41–84. Cited by: §D.4. Z. Wang, M. Liu, J. Lei, F. Bach, and Z. Guo (2025) StablePCA: learning shared representations across multiple sources via minimax optimization. External Links: 2505.00940 Cited by: Figure B.i, §B.2.1, §B.2.1, §1.1, §2.2, §5. G. Zalcberg and A. Wiesel (2021) Fair principal component analysis and filter design. IEEE Transactions on Signal Processing 69, p. 4835–4842. Cited by: §1.1. Contents of the Appendix Appendix A Additional details on wcPCA A.1 Sequential versus joint optimization for minPCA This section provides a formal version of Proposition 4. In particular, we show that the sequential extension of PCA to the minPCA objective can fail to recover the solution of the corresponding joint optimization problem. Proposition 14 Consider Setting 1. Define Vkseq-minPCA:=[v1,…,vk]∈ℝp×kV seq-minPCA_k:=[v_1,…,v_k] ^p× k as k orthonormal vectors such that v1 v_1 ∈argmaxv∈ℝp,‖v‖=1mine∈ℰℒvar(v;Pe) ∈ *arg\,max_v ^p,\|v\|=1 _e L_var(v;P_e) vj v_j ∈argmaxv∈ℝp,‖v‖=1,∀i<j,v⟂vimine∈ℰℒvar([v1,…,vj−1,v];Pe)for 2≤j≤k. ∈ *arg\,max_ subarraycv ^p,\|v\|=1,\\ ∀ i<j,v v_i subarray _e L_var([v_1,…,v_j-1,v];P_e) for\ 2≤ j≤ k. Then, in general, Vkseq-minPCAV seq-minPCA_k does not solve minPCA. Proposition 15 (Normalized case) The statement of Proposition 14 holds for the normalized objective, that is, when ℒvarL_var is replaced by ℒnormVarL_normVar. Proof The same counter-example can be used to prove both statements; we present it in terms of minPCA. Consider three domains ℰ=1,2,3E=\1,2,3\ with p=5p=5 and the following diagonal population covariances: Σ1:=14diag(2,2,0,1,1),Σ2:=14diag(2,0,2,1,1),Σ3:=14diag(0,2,2,1,1). _1:= 14\,diag(2,2,0,1,1), _2:= 14\,diag(2,0,2,1,1), _3:= 14\,diag(0,2,2,1,1). We find a rank-2 solution of the sequential procedure and then show that it is not optimal for the joint objective. Step 1: We determine the first sequential component, i.e., v1∗∈argmaxv∈ℝ5;‖v‖=1mine∈ℰv⊤Σev.v_1^*∈ *arg\,max_v ^5;\,\|v\|=1 _e v _ev. Let v1∗v_1^* =(a,b,c,d,f)∈ℝ5=(a,b,c,d,f) ^5, ‖v1∗‖=1\|v_1^*\|=1. Then, a2=b2=c2a^2=b^2=c^2. We show this by contradiction. Since a,b,a,b, and c appear symmetrically in the objective, i.e., mine∈ℰ(v1∗)⊤Σev1∗=12min(a2+b2,b2+c2,a2+c2)+14(d2+f2) _e (v_1^*) _ev_1^*= 12 (a^2+b^2,b^2+c^2,a^2+c^2)+ 14(d^2+f^2), we assume w.l.o.g. that a2≤b2≤c2a^2≤ b^2≤ c^2. Then, mine∈ℰ(v1∗)⊤Σev1∗=12(a2+b2)+14(d2+f2)=:m1∗ _e (v_1^*) _ev_1^*= 12(a^2+b^2)+ 14(d^2+f^2)=:m_1^*. We consider the two cases that lead to strict inequalities. First, if a2<b2a^2<b^2, let δ:=14(b2−a2)>0δ:= 14(b^2-a^2)>0 and let v~:=(a2+δ,b2−δ/2,c2−δ/2,d,f) v:=( a^2+δ, b^2-δ/2, c^2-δ/2,d,f). Then, mine∈ℰv~⊤Σev~=12(a2+b2)+14(d2+f2)+14δ>m1∗ _e v _e v= 12(a^2+b^2)+ 14(d^2+f^2)+ 14δ>m_1^*. Second, if a2=b2a^2=b^2 and b2<c2b^2<c^2, let δ:=14(c2−b2)δ:= 14(c^2-b^2) and let v~=(b2+δ/2,b2+δ/2,c2−δ,d,f) v=( b^2+δ/2, b^2+δ/2, c^2-δ,d,f). Then, mine∈ℰv~⊤Σev~=a2+14(d2+f2)+12δ>m1∗ _e v _e v=a^2+ 14(d^2+f^2)+ 12δ>m_1^*. Hence, this proves that a2=b2=c2a^2=b^2=c^2. Then, ‖v1∗‖=1\|v_1^*\|=1 implies 3a2+d2+f2=13a^2+d^2+f^2=1. Hence, mine∈ℰ(v1∗)⊤Σev1∗=0.25+0.25a2. _e (v_1^*) _ev_1^*=0.25+0.25a^2. This is maximized for a2=1/3a^2=1/3, yielding v1∗:=13(1,1,1,0,0)v_1^*:= 1 3(1,1,1,0,0). Step 2: We determine the second sequential component, i.e., v2∗∈argmaxv∈ℝ5;‖v‖=1;v⟂v1∗mine∈ℰv⊤Σev+(v1∗)⊤Σev1∗=argmaxv∈ℝ5;‖v‖=1;v⟂v1∗mine∈ℰv⊤Σev+13.v_2^*∈ *arg\,max_ subarraycv ^5;\,\|v\|=1;\\ v v_1^* subarray _e \v _ev+(v_1^*) _ev_1^* \= *arg\,max_ subarraycv ^5;\,\|v\|=1;\\ v v_1^* subarray _e v _ev+ 13. Let v2∗=(a,b,c,d,f)v_2^*=(a,b,c,d,f). v2∗⟂v1∗v_2^* v_1^* implies a=−(b+c)a=-(b+c) and ‖v2∗‖=1\|v_2^*\|=1 implies d2+f2=1−a2−b2−c2d^2+f^2=1-a^2-b^2-c^2. Simplifying gives mine∈ℰ(v2∗)⊤Σev2∗+(v1∗)⊤Σev1∗=13+14+12mine∈ℰb2+bc,−bc,c2+bc. _e \(v_2^*) _ev_2^*+(v_1^*) _ev_1^* \= 13+ 14+ 12 _e \b^2+bc,-bc,c^2+bc\. For all b,c∈ℝb,c , minb2+bc,−bc,c2+bc≤0 \b^2+bc,-bc,c^2+bc\≤ 0 with equality if and only if bc=0bc=0 or b=−cb=-c. Thus, the sequential solution has explained variance at most 13+14=712 13+ 14= 712. Step 3: We show that the sequential solution is suboptimal for minPCA. Consider the rank-2 orthonormal matrix V~:=[v~1,v~2] V:=[ v_1, v_2] where v~1:=12(1,1,0,0,0) v_1:= 1 2(1,1,0,0,0) and v~2:=12(0,0,1,1,0) v_2:= 1 2(0,0,1,1,0). For each domain, the total variance equals 58 58, which is strictly larger than712 712. Thus, the sequential solution does not solve minPCA. A.2 Ordering the basis of - norm - maxRCS Sequential constructions are, in general, not optimal for the joint worst-case objective (see Proposition 14). Nevertheless, in some applications an ordered basis is required. We therefore describe a post-hoc procedure that imposes an ordering within the optimal worst-case subspace. We present the construction for - norm - minPCA; the same idea can be applied to the other objectives. Consider Setting 1 and let Vk∗∈ℝp×kV_k^* ^p× k solve - norm - minPCA. Starting with the full subspace S1:=span(Vk∗)S_1:=span(V_k^*), we iteratively identify and remove the direction whose removal maximizes the worst-case explained variance of the remaining subspace. For i∈1,…,k−1i∈\1,…,k-1\, let vi∈argmaxv∈span(Vi);‖v‖=1mine∈ℰ1Tr(Σe)Tr(USi∩v⟂⊤ΣeUSi∩v⟂), v_i∈ *arg\,max_ subarraycv (V_i);\\ \|v\|=1 subarray\, _e 1Tr( _e)\,Tr (U_S_i∩ v _eU_S_i∩ v ), where Si+1:=span(Vk∗)∩span(v1,…,vi)⟂S_i+1:=span(V_k^*) (\v_1,…,v_i\) and for any subspace S⊂ℝpS ^p, USU_S denotes an orthonormal basis of S. Reversing the resulting sequence (v1,…,vk−1)(v_1,…,v_k-1) and appending the final remaining direction yields (v1′,…,vk′)(v_1 ,…,v_k ) such that, for any j≤kj≤ k, the span of (v1′,…,vj′)(v_1 ,…,v_j ) maximizes worst-case explained variance among all j-dimensional subspaces contained in span(Vk∗)span(V_k^*). A.3 Comparison to Group DRO Group DRO (Sagawa et al., 2020; Hu et al., 2018) is a widely studied framework for robustness to distributional shifts across a finite set of domains. Our guarantees share a similar structure but apply to a broader class of distributions and rely on weaker assumptions. Remark 16 In Group DRO, the goal is to minimize the worst-case loss over a finite set of distributions Pee∈ℰ\P_e\_e . Losses of the form ℒ(V;P)=∼P[ℓ(V;)]L(V;P)=E_x P[ (V;x)] are linear with respect to convex combinations of distributions. The worst-case loss over all mixtures of the source distributions (i.e., the convex hull of the source distributions) is thus attained at one of the source distributions, that is, maxe∈ℰℒ(V;Pe)=supP∈ℒ(V;P)where:=conv(P1,…,PE). _e L(V;P_e)= _P L(V;P) :=conv(P_1,…,P_E). (9) Our results for maxRCS, and minPCA apply in this setting and are strictly stronger, establishing worst-case optimality over a strictly larger uncertainty set P (based on covariance matrices rather than distributions, see (4)). even a weaker form of the result does not follow directly from Group DRO. For - norm - maxRCS, - norm - minPCA, and - norm - maxRegret the loss is a ratio of expectations and thus not linear with respect to convex combinations of distributions. In this case, the standard Group DRO guarantees over the mixture of source distributions no longer hold, and our guarantees instead apply to an uncertainty set defined via constraints on the normalized covariance matrices. A.4 Empirical objectives and finite-sample guarantees This appendix collects the finite-sample counterparts of the worst-case objectives introduced in the main text that are not presented in Section 3.2. It also states the finite-sample optimality guarantees of the unnormalized objectives, which mirrors the population results (Theorem 6) with population covariances replaced by empirical covariances. Consider Setting 2. For V^∈p×k V _p× k, we say V^ solves rank-k empirical - if V^∈argmaxV∈p×kmaxe∈ℰTr(V⊤Σ^eV)Tr(Σ^e), V solves rank-$k$ empirical $ norm - minPCA$ if V∈ *arg\,max_V _p× k _e \ Tr(V _eV)Tr( _e) \, V^ solves rank-k empirical - if V^∈argminV∈p×kmaxe∈ℰ‖Xe−XeVV⊤‖F2‖Xe‖F2, V solves rank-$k$ empirical $ norm - maxRCS$ if V∈ *arg\,min_V _p× k _e \ \|X_e-X_eV \|^2_F\|X_e\|^2_F \, V V solves rank-k empirical maxRegret if V^∈argminV∈p×kmaxe∈ℰ‖Xe−XeVV⊤‖F2−(minW∈p×k‖Xe−XeWW⊤‖F2), V∈ *arg\,min_V _p× k _e \\|X_e-X_eV \|^2_F- ( _W _p× k\|X_e-X_eW \|^2_F ) \, and V V solves rank-k empirical - norm - maxRegret if V^∈argminV∈p×kmaxe∈ℰ‖Xe−XeVV⊤‖F2‖Xe‖F2−(minW∈p×k‖Xe−XeWW⊤‖F2‖Xe‖F2). V∈ *arg\,min_V _p× k _e \ \|X_e-X_eV \|^2_F\|X_e\|^2_F- ( _W _p× k \|X_e-X_eW \|^2_F\|X_e\|^2_F ) \. Proposition 17 (Empirical robustness of wcPCA) Consider Setting 2 and let ^:=conv(Σ^ee∈ℰ) C:=conv(\ _e\_e ) be the convex hull of the empirical source covariances. Let V^k∈p×k V_k _p× k solve empirical rank-k maxRCS, minPCA, or maxRegret. Let ℒL denote the corresponding loss function, i.e., ℒ=ℒRCSL=L_RCS for maxRCS, ℒ=−ℒvarL=-L_var for minPCA, and ℒ=ℒregL=L_reg for maxRegret. Then, the following statements hold. i) For all Vk∈p×kV_k _p× k, the worst-case loss over C equals the worst-case loss over the source domains, i.e., supΣ∈^ℒ(Vk;Σ)=maxe∈ℰℒ(Vk;Σ^e). _ ∈ CL(V_k; )= _e L(V_k; _e). i) Consequently, the ordering induced by the worst-case loss over the source domains is preserved over C: for all Vk,Wk∈p×kV_k,W_k _p× k, if maxe∈ℰℒ(Vk;Σ^e)<maxe∈ℰℒ(Wk;Σ^e), _e L(V_k; _e)< _e L(W_k; _e), then supΣ∈^ℒ(Vk;Σ)<supΣ∈^ℒ(Wk;Σ). _ ∈ CL(V_k; )< _ ∈ CL(W_k; ). i) Hence, V^k V_k is worst-case optimal over C, i.e., V^k∈argminV∈p×ksupΣ∈^ℒ(V;Σ). V_k∈ *arg\,min_V _p× k _ ∈ CL(V; ). iv) Statement i) does not generally hold for the solutions to empirical poolPCA and sepPCA. Appendix B Additional experimental details and results B.1 Data generating process for source and target covariances In each experiment, we consider E source domains, obtained by sampling E population covariances matrices (Σ1,…,ΣE)( _1,…, _E) in ℝp×pR^p× p of rank 1010 from a measure ℚα,βQ_α,β that depends on parameters α,βα,β (explained below). Unless otherwise specified, E=5E=5 and p=20p=20. Under the measures ℚα,βQ_α,β, the sampled covariances consist of a rank-5 shared component and a rank-5 domain-specific component, and each of them is trace-normalized. For the simulations in Section 5.1.1–5.1.3 and Appendix B.2, the domain-specific components have an equal set of eigenvalues across domains, and for Section 5.1.4, these sets of eigenvalues vary across domains. We now describe the generation of the source covariances in more detail. Source covariances. To sample E source covariances, denoted as (Σ1,…,ΣE)∼ℚα,β( _1,…, _E) _α,β, we proceed as follows. 1. Sample the eigenvalues and eigenvectors of the shared component as λ1,…,λ5∼iidU([0.1,1]) _1,…, _5 iid U([0.1,1]) and Q∼ℙQ _O, where ℙP_O denotes the Haar distribution over orthonormal matrices Q∈ℝp×pQ ^p× p with Q⊤Q=QQ⊤=IpQ Q=Q =I_p. Let V∈ℝp×5V ^p× 5 denote the first five columns of Q, forming an orthonormal basis of the shared rank-55 subspace, and let let :=(λ1,…,λ5)∈ℝ5 λ:=( _1,…, _5) ^5. 2. Sample the eigenvalues of the domain-specific component as γ1,…,γ5∼iidU([α,β]) _1,…, _5 iid U([α,β]) (unless otherwise specified, α=0.1α=0.1 and β=1β=1). Let :=(γ1,…,γ5)∈ℝ5 γ:=( _1,…, _5) ^5 (we use the same γ for all source domains). 3. Sample the (domain-specific) eigenvectors of the domain-specific component as follows. For 1≤e≤E1≤ e≤ E, sample a Gaussian matrix Ge∈ℝp×5G_e ^p× 5 with i.i.d. standard normal entries. Let P⟂:=I−VVTP_ :=I-V^T be the projector onto the orthogonal complement of span(V)span(V). We define Ve∈ℝp×5V_e ^p× 5 as the Q factor in the QR decomposition of P⟂GeP_ G_e. Then the columns of VeV_e form an orthonormal basis of a rank-55 subspace orthogonal to V. 4. For all 1≤e≤E1≤ e≤ E, define Σe:=1∑i=15(λi+γi)(Vdiag()V⊤+Vediag()Ve⊤). _e:= 1 _i=1^5( _i+ _i) (Vdiag( λ)V +V_ediag( γ)V_e ). Modifications for Section 5.1.4. In Section 5.1.4, we modify the procedure as follows. In step 2, instead of sampling shared domain-specific eigenvalues, we sample different eigenvalues for each domain. That is, for all domains 1≤e≤E1≤ e≤ E, we sample γ1e,…,γ5e∼iidU([α,β]) _1^e,…, _5^e iid U([α,β]) and let e:=(γ1e,…,γ5e)∈ℝ5 γ_e:=( _1^e,…, _5^e) ^5. In step 4, we then replace the eigenvectors with the domain specific ones, that is, for all 1≤e≤E1≤ e≤ E, for 1≤i≤51≤ i≤ 5, we replace γi _i with γie _i^e and we replace γ with e γ_e. Target covariances. The set of target covariances, targettC_target^t, consists of t covariances sampled i.i.d. from the convex hull of the source covariances, :=conv(Σee=1E)C:=conv(\ _e\_e=1^E). Each target covariance Σ is obtained by sampling w uniformly from the E-dimensional simplex and setting Σ=∑e=1EweΣe = _e=1^Ew_e _e. The superscript t is omitted when the number of target covariances is clear from context. B.2 Optimization algorithms B.2.1 wcPCA The optimization problems of the wcPCA variants are non-convex due to the orthogonality constraint set p×kO_p× k. In principle, any optimization scheme designed for orthogonality constraints may be employed. Several relaxations have been proposed for some of the variants. Tantipongpipat et al. (2019) (FairPCA) provides a semidefinite programming (SDP) relaxation and multiplicative weights (MW) algorithm for the minPCA and maxRegret objectives. Wang et al. (2025) (StablePCA) provide an optimization scheme for minPCA, maxRCS, and maxRegret with convergence guarantees. Neither of them consider the other objectives. Other approaches, such as extra-gradient methods for saddle-point problems, may also apply in this setting but have not yet been explored for these objectives. Our focus in this paper lies on the generalization guarantees rather than the optimization. For the implementation, we therefore employ a straightforward projected gradient descent (PGD) scheme implemented in PyTorch (Ansel et al., 2024), with updates computed via the Adam optimizer. Because the objectives involve pointwise maxima or minima, gradients are propagated only through the active component at each step. Concretely, with current estimate VtV_t, one iteration update is ∇ℒ(Vt) (V_t) ←autograd(ℒ,V), (L,V), V′ V ←AdamStep(Vt,∇ℒ(Vt)), (V_t, (V_t)), Vt+1 V_t+1 ←Πp×k(V′), ← _O_p× k(V ), where Πp×k _O_p× k denotes projection onto p×kO_p× k, implemented via an SVD-based orthogonalization. To mitigate poor local minima, we run five random restarts and retain the best solution. To assess the effectiveness of our implementation, we compare the proposed PGD scheme with StablePCA and FairPCA. We consider three configurations for the number of features p and domains E: (p,E)∈(10,5),(10,50),(50,5)(p,E)∈\(10,5),(10,50),(50,5)\, corresponding, respectively, to few features–few domains, few features–many domains, and many features–few domains. For each configuration, we repeat the following steps 25 times. We generate domain-specific covariance matrices as described in Appendix B.1. For k=1,…,min(p−1,30)k=1,…, (p-1,30), we solve the rank-k population minPCA problem using our PGD implementation, StablePCA (Wang et al., 2025) and the SDP and MW variants of Tantipongpipat et al. (2019). When the solution matrix is not rank-k in the output of the FairPCA optimization, we consider only the top k eigenvectors. Figure B.i: Comparison of projected gradient descent (PGD; used in our experiments) with the optimization schemes of Wang et al. (2025) (StablePCA) and Tantipongpipat et al. (2019) (MW and SDP) for solving minPCA. Results are shown for (p,E)∈(10,5),(10,50),(50,5)(p,E)∈\(10,5),(10,50),(50,5)\ where p is the number of features and E is the number of domains. Top row: median difference in minimum explained variance relative to that of PGD ( negative values indicate that PGD explains more variance in the worst case) as a function of the number of components k. Bottom row: runtime in seconds for each algorithm. Shaded regions represent the 25%-75% quantiles. PGD attains similar objective values to both baselines, with acceptable runtime. Figure B.i reports the median value of (minvarother−minvarPGD)/|minvarPGD|( minvar_ other- minvar_ PGD)\,/\,|\; minvar_ PGD\;| where minvarother minvar_ other and minvarPGD minvar_ PGD are, respectively, the minimum explained variance of the other optimization scheme and of PGD and the median running time. Across all configurations, the objective values obtained by PGD closely match those by StablePCA, SDP, and MW and for smaller problems, PGD explains more variance. For smaller problems, the SDP and MW optimization schemes are faster, whereas for higher-dimensional settings only MW is faster than PGD. As computational cost is not a limiting factor for our analysis, we use the PGD implementation throughout the paper. For completeness, we also consider solutions to population maxRegret for (p,E)=(10,5)(p,E)=(10,5) (shown in Figure B.i), where PGD attains objective values on par with the other optimization schemes with runtimes that are acceptable for the purpose of our paper. Here, the other optimization schemes occasionally returns solutions with considerably worse (i.e., higher) objective value. Figure B.i: Comparison of PGD, StablePCA, MW, and SDP for solving maxRegret for (p,E)=(10,5)(p,E)=(10,5). PGD achieves comparable (or better) maximum regret values (positive values indicate that PGD performs better, i.e., has a lower worst-case regret). B.2.2 Matrix completion The matrix completion objectives poolMC and maxMC are non-convex in (Le,R)e∈ℰ(L_e,R)_e , due to the bilinear term LeR⊤L_eR (see (6)). We therefore employ a standard alternating-minimization scheme, iteratively updating R given (Le)e∈ℰ(L_e)_e and then (Le)e∈ℰ(L_e)_e given R. This scheme is similar to what Kassab et al. (2024) propose for a regret-based variant of fair matrix completion and to what is commonly done in the pooled case (e.g., Jain et al., 2013) – even though in the pooled case, there is only a single left factor. Our implementation is adapted from Aaron Berk’s implementation of alternating minimization for single-domain matrix completion888https://github.com/asberk/matrix-completion-whirlwind. More precisely, given current left factors (Le)e∈ℰ(L_e)_e , we update the shared right factor R by solving a convex optimization problem. For maxMC, this step minimizes the maximum (across domains) of the normalized reconstruction error over the observed entries; for poolMC, it minimizes the pooled reconstruction error across all observed entries. Both problems are formulated and solved as second-order cone programs in cvxpy. Given R, the left-factor updates decouple across domains and rows. Thus, for each domain e and row i, we estimate ℓe,i _e,i via the inductive least-squares reconstruction rule of Section 4.1, restricted to the entries observed in that row. The alternating updates are run for at most 100 iterations and terminated early when the improvement in the objective falls below 10−410^-4. Initialization of Le\L_e\ is done using an SVD-based heuristic obtained by filling a sparse matrix with the observed entries. B.3 Evaluation metrics This appendix defines the evaluation metrics used in Sections 5.1.2, 5.1.3, and 5.2. Let ℰ:=1,…,5E:=\1,…,5\, source:=Σ1,…,Σ5C_source:=\ _1,…, _5\ denote the set of source covariances and :=conv(source)C:=conv(C_source) denote their convex hull. Recall that for a covariance matrix Σ , for all V∈p×kV _p× k the reconstruction loss is ℒRCS(V;Σ)=Tr(Σ)−Tr(V⊤ΣV)L_RCS(V; )=Tr( )-Tr(V V). Average and worst-case performance (see Section 5.1.2). Let Σ¯:=1|ℰ|∑e∈ℰΣe := 1|E| _e _e be the average covariance and let VmaxRCSV maxRCS and VpoolV pool solve maxRCS and poolPCA. The difference in average reconstruction error (over the source domains) relative to the average performance of poolPCA is Δaverage:=ℒRCS(VmaxRCS;Σ¯)−ℒRCS(Vpool;Σ¯)ℒRCS(Vpool;Σ¯). \ average:= L_RCS(V maxRCS; )-L_RCS(V pool; )L_RCS(V pool; ). (10) The difference in worst-case reconstruction error (over the convex hull of the source covariances), relative to the average performance of poolPCA is Δworst-case:=supΣ∈ℒRCS(VmaxRCS;Σ)−supΣ∈ℒRCS(Vpool;Σ)ℒRCS(Vpool;Σ¯). \ worst -case:= _ L_RCS(V maxRCS; )- _ L_RCS(V pool; )L_RCS(V pool; ). (11) Since ℒRCSL_RCS is linear in Σ , the supremum over C is attained at its extreme points; in practice, it is computed as a maximum over the source covariances. The quantities (10) and (11) correspond to the values plotted in Figure 4. Convergence and finite-sample robustness (see Section 5.1.3). To assess convergence of the empirical estimator V^maxRCS V maxRCS to the population solution VmaxRCSV maxRCS, we compute the difference in worst-case population reconstruction error (evaluated over the convex hull conv(source)conv(C_source)): supΣ∈ℒRCS(V^maxRCS;Σ)−supΣ∈ℒRCS(VmaxRCS;Σ). _ L_RCS( V maxRCS; )- _ L_RCS(V maxRCS; ). (12) This metric corresponds to the quantity plotted in Figure 5 (left). To compare finite-sample robustness of empirical maxRCS and empirical poolPCA, we additionally report supΣ∈ℒRCS(V^maxRCS;Σ)−supΣ∈ℒRCS(V^pool;Σ). _ L_RCS( V maxRCS; )- _ L_RCS( V pool; ). (13) This metric corresponds to Figure 5 (right). Negative values indicate that the empirical maxRCS estimator achieves lower worst-case population loss than pooled PCA. Again, in practice, the supremum is computed as a maximum over the source covariances. Matrix completion (see Section 5.2). For the matrix completion experiments, for all e∈ℰe , let Xe∈ℝntest×pX_e ^n_test× p denote the matrix of test observations from domain e, and for all R∈p×kR _p× k, let X^e(R) X_e(R) denote their reconstructions obtained using the inductive least-squares rule of Section 4.1. For each domain e∈ℰe , we define the per-entry mean squared reconstruction error as ℒe(R):=1ntestp‖Xe−X^e(R)‖F2.L_e(R):= 1n_test\,p\|X_e- X_e(R)\|_F^2. Let RmaxMCR^maxMC and RpoolMCR^poolMC solve maxMC and poolMC, respectively. We consider the difference in average and worst-case performance, that is, Δaverage \,average :=1|ℰ|∑e∈ℰ(ℒe(RmaxMC)−ℒe(RpoolMC)), := 1|E| _e (L_e(R^maxMC)-L_e(R^poolMC) ), Δworst-case \,worst -case :=maxe∈ℰℒe(RmaxMC)−maxe∈ℰℒe(RpoolMC). := _e L_e(R^maxMC)- _e L_e(R^poolMC). Negative values indicate that maxMC achieves lower error than poolMC. In Figure 7, we report 104⋅Δaverage10^4· \,average and 104⋅Δworst-case10^4· \,worst -case. B.4 Inductive matrix completion with fully observed sources We consider the regime in which the source domains are fully observed, corresponding to the setting of Theorem 13, while target domains remain partially observed. In this case, maxMC reduces to maxRCS on the source data. Consistent with the theory, maxMC typically achieves lower worst-case target reconstruction error than poolMC (Figure B.i). For low domain heterogeneity, however, the empirical advantage is occasionally negligible or slightly reversed. This can be attributed to (i) the fact that Theorem 13 provides an approximate worst-case guarantee rather than a strict improvement over pooling, and (i) finite-sample variability and non-optimality when solving the non-convex optimization procedure, for which improved algorithms may yield tighter empirical performance. Overall, the results mirror those in Section 5.2: the worst-case advantage of maxMC is most pronounced under stronger domain heterogeneity and persists across varying levels of missingness. Figure B.i: Worst-case inductive matrix completion with fully observed source domains; regime (i). Left: Difference in average and worst-case test reconstruction error between maxMC and poolMC. In line with Theorem 13 (which covers the population case and allows for some ϵε deviation), maxMC generally improves worst-case performance; in terms of average error, it incurs only minor changes. Right: Worst-case difference as the proportion of missing target entries varies, for different heterogeneity levels (α,β)(α,β). maxMC achieves lower median worst-case error than poolMC, with larger gains under stronger domain heterogeneity. Appendix C Additional application details C.1 Details on the FLUXNET application (Section 6.1) Data preprocessing. We use quality-controlled FLUXNET data (Pastorello et al., 2020), aggregated to daily means. Let us denote the thirteen TransCom regions as ℰE. For each TransCom region e∈ℰe with data matrix Xeraw∈ℝne×pX_e^raw ^n_e× p, we first remove the domain mean and denote the centered matrix by X~e X_e. We then stack the centered data X~:=[X~1;…;X~E] X:=[ X_1;…; X_E] and compute pooled columnwise standard deviations on X~ X. Using these, we standardise X~ X to obtain XpoolX_pool, and for each e∈ℰe , we define XeX_e as the corresponding block of rows of XpoolX_pool belonging to domain e. For all e∈ℰe , the empirical covariances are Σ^e=1neXe⊤Xe _e= 1n_eX_e X_e, and the pooled covariance equals the weighted sum Σ^pooled=1nXpool⊤Xpool=∑eweΣ^e _pooled= 1nX_pool X_pool= _ew_e _e where n=∑jnjn= _jn_j and we=ne/nw_e=n_e/n. The average covariance is given by 1|ℰ|∑e∈ℰΣ^e 1|E| _e _e. Intuitively, this preprocessing removes between-region mean shifts and focuses the analysis on differences in covariance structure across regions. Choosing the number of principal components. Based on scree plots of the source regions (Figure C.i), the explained variance levels off after the second component. We thus use rank-2 approximations in Section 6.1. Figure C.i: Scree plots for each TransCom region (shown in different colours). In Section 6.1, we use rank-2 approximations. Random splits of source and target regions. For 20 repetitions, the TransCom regions are randomly partitioned into five source regions and eight target regions. Figure C.i reports the relative difference in worst-case proportion of explained variance between poolPCA and the competing methods ( maxRegret, - norm - maxRegret, - norm - maxRCS, avgcovPCA) over the repetitions — shown across ranks and on the source and target domains. Figure C.i: Relative difference in worst-case proportion of explained variance between poolPCA and, from top to bottom, maxRegret, - norm - maxRegret, - norm - maxRCS, and avgcovPCA. Positive values indicate improved worst-case performance relative to poolPCA. Results are shown for 20 random region splits, evaluated on the source and target regions across varying numbers of principal components. Apart from the rank-one case, the worst-case objectives exhibit improved robustness or similar behavior to poolPCA. C.2 Details on the reanalysis of Migliavacca et al. (2021) (Section 6.2) Data description and preprocessing. We use the site-level ecosystem function dataset constructed by Migliavacca et al. (2021), based on FLUXNET observations (Baldocchi, 2008; Pastorello et al., 2020). Each site contributes a single vector of aggregated functional properties (described in Table 2). We define continents as domains. Preprocessing follows the procedure described in Appendix C.1, with continents replacing the TransCom regions. Table 2: Ecosystem functional properties used in the reanalysis of Migliavacca et al. (2021). Variable Description GPPsat_ sat Maximum gross CO2 uptake at light saturation NEPmax_ max Maximum net ecosystem productivity ETmax_ max Maximum evapotranspiration EF Evaporative fraction EFampl_ ampl EF amplitude GSmax_ max Maximum dry canopy surface conductance Rb Mean basal ecosystem respiration RBmax_ max Maximum basal ecosystem respiration aCUE Apparent carbon-use efficiency (fraction of assimilated carbon retained) uWUE Underlying water-use efficiency WUEt_ t Transpiration-based water-use efficiency G1 Ecosystem-scale stomatal slope parameter Additional methods. As an additional baseline, we solve PCA on the unweighted average of the domain covariance matrices (1|ℰ|∑e∈ℰΣ^e 1|E| _e _e), denoted avgcovPCA. This estimator removes the implicit weighting of domains by sample size present in poolPCA. In addition, we fit maxRegret. Figure C.i reports the corresponding performance and shows that maxRegret performs almost identically to - norm - maxRegret. Using the average covariance instead of poolPCA substantially improves worst-case performance, but it remains slightly inferior to - norm - maxRCS both in average explained variance and in the worst-performing continent. However, to the best of our knowledge, avgcovPCA does not come with any extrapolation guarantee. Figure C.i: Comparison of dimensionality reduction methods on the ecosystem function dataset. Left: Proportion of variance explained on the pooled dataset. Right: Worst-case proportion of variance explained across continents. Methods that account for domain heterogeneity outperform poolPCA in the worst-case continent, with - norm - maxRCS most closely matching pooled average performance. Appendix D Proofs D.1 Additional results There are several variations of the following statement that are known, both for single-case deletion, that is, Cook’s formula (Cook, 1977), and for subset removal (e.g., Section 3.6 in Cook and Weisberg, 1982). Bounding this perturbation via μ-incoherence (Candès and Recht, 2012) is a standard technique in randomized linear algebra (Mahoney, 2011). We add the proof for completeness. Proposition 18 (Stability of OLS under subset removal) Let X∈ℝn×pX ^n× p be a deterministic design matrix satisfying the orthogonality condition X⊤X=IpX X=I_p and the μ-incoherence condition: ‖x(i)‖2≤μpn,∀i∈1,…,n,\|x^(i)\|_2≤μ pn, ∀ i∈\1,…,n\, (14) where x(i)x^(i) is the i-th row of X. Let y∈ℝny ^n be a response vector. Let β∗β^* be the OLS estimator on the full dataset. Consider any subset S⊆1,…,nS \1,…,n\, define s:=|S|s:=|S|, and let X−SX_-S (resp. y−Sy_-S) denote the submatrix (resp. subvector) of X (resp. y) consisting of the rows (resp. entries) indexed by [n]∖S[n] S. Let β−S _-S be the OLS estimator on the dataset without the entries in S, that is, β−S∈argminb∈ℝp‖y−S−X−Sb‖22 _-S∈ *arg\,min_b ^p\|y_-S-X_-Sb\|^2_2. For all ϵ∈(0,1)ε∈(0,1), if the subset size s satisfies s≤nϵpμ2(2ϵ+1)s≤ nεpμ^2(2ε+1) (15) then the relative increase in squared error is bounded by ϵε, that is, ‖y−Xβ−S‖22≤(1+ϵ)‖y−Xβ∗‖22.\|y-X _-S\|^2_2≤(1+ε)\|y-Xβ^*\|^2_2. Proof First, we show that β−S _-S is unique (step 1), then we bound the difference between the reconstruction error of β∗β^* and β−S _-S (step 2). Step 1: We show that X−S⊤X−SX_-S X_-S is invertible, which implies that β−S _-S is uniquely defined. Let XSX_S (resp. ySy_S) denote the sub-matrix (resp. sub-vector) of X (resp. y) consisting of the rows (resp. entries) indexed by S. The spectral norm (denoted by ∥⋅∥op\|·\|_op) of XSX_S is bounded as follows ‖XS‖op2≤Tr(XS⊤XS)=∑i∈S‖x(i)‖22≤s⋅μ2pn<1,\|X_S\|_op^2 (X_S X_S)= _i∈ S\|x^(i)\|_2^2≤ s· μ^2pn<1, (16) where the second inequality follows from the incoherence condition (14) and the third inequality follows from the subset size assumption (15). Hence, the maximum eigenvalue λmax(XS⊤XS)<1 _ (X_S X_S)<1 and thus Ip−XS⊤XS=X−S⊤X−SI_p-X_S X_S=X_-S X_-S is invertible and β−S _-S is unique and given by β−S=(X−S⊤X−S)−1X−S⊤y−S=(I−XS⊤XS)−1(X⊤y−XS⊤yS). _-S=(X_-S X_-S)^-1X_-S y_-S=(I-X_S X_S)^-1(X y-X_S y_S). (17) Step 2: We now bound the reconstruction error using β−S _-S as ‖y−Xβ−S‖22 \|y-X _-S\|^2_2 =‖y−Xβ∗+Xβ∗−Xβ−S‖22 =\|y-Xβ^*+Xβ^*-X _-S\|^2_2 =‖y−Xβ∗‖22+‖Xβ∗−Xβ−S‖22 =\|y-Xβ^*\|^2_2+\|Xβ^*-X _-S\|^2_2 =‖y−Xβ∗‖22+‖β∗−β−S‖22, =\|y-Xβ^*\|^2_2+\|β^*- _-S\|^2_2, where the first equality follows as X⊤(y−Xβ∗)=0X (y-Xβ^*)=0 and the last equality follows as X is orthogonal. It thus suffices to bound ‖β∗−β−S‖22\|β^*- _-S\|^2_2 which we do using (17) as follows, ‖β∗−β−S‖22 \|β^*- _-S\|^2_2 =‖β∗−(I−XS⊤XS)−1(X⊤y−XS⊤yS)‖22 =\|β^*-(I-X_S X_S)^-1(X y-X_S y_S)\|_2^2 =‖(I−XS⊤XS)−1(I−XS⊤XS)β∗−(I−XS⊤XS)−1(β∗−XS⊤yS)‖22 =\|(I-X_S X_S)^-1(I-X_S X_S)β^*-(I-X_S X_S)^-1(β^*-X_S y_S)\|_2^2 =‖(I−XS⊤XS)−1XS⊤(yS−XSβ∗)‖22 =\|(I-X_S X_S)^-1X_S (y_S-X_Sβ^*)\|_2^2 ≤‖(I−XS⊤XS)−1‖op2‖XS⊤‖op2‖yS−XSβ∗‖22 ≤\|(I-X_S X_S)^-1\|_op^2\|X_S \|_op^2\|y_S-X_Sβ^*\|_2^2 ≤‖(I−XS⊤XS)−1‖op2‖XS⊤‖op2‖y−Xβ∗‖22. ≤\|(I-X_S X_S)^-1\|_op^2\|X_S \|_op^2\|y-Xβ^*\|_2^2. In addition, ∥XS⊤∥op2=∥XS∥op2≤sμ2pn=:δ<1\|X_S \|_op^2=\|X_S\|_op^2≤ sμ^2pn=:δ<1 by (16). Using the Neumann series bound (for all square matrices A such that ‖A‖op<1\|A\|_op<1, ‖(I−A)−1‖op≤1/(1−‖A‖op)\|(I-A)^-1\|_op≤ 1/(1-\|A\|_op)) and (16), we have that ‖(I−XS⊤XS)−1‖op≤11−‖XS⊤XS‖op=11−‖XS‖op2. \|(I-X_S X_S)^-1\|_op≤ 11-\|X_S X_S\|_op= 11-\|X_S\|_op^2. It follows that ‖y−Xβ−S‖22‖y−Xβ∗‖22≤1+‖(I−XS⊤XS)−1‖op2‖XS⊤‖op2≤1+δ(1−δ)2. \|y-X _-S\|^2_2\|y-Xβ^*\|_2^2≤ 1+\|(I-X_S X_S)^-1\|_op^2\|X_S \|_op^2≤ 1+ δ(1-δ)^2. We further have that δ≤ϵ2ϵ+1≤2ϵ+1−4ϵ+12ϵ,δ≤ ε2ε+1≤ 2ε+1- 4ε+12ε, where the first inequality follows from (15) and the second from ϵ>0ε>0. This implies ϵδ2−(2ϵ+1)δ+ϵ≥0εδ^2-(2ε+1)δ+ε≥ 0 (by analyzing the roots of the left-hand side) and thus δ(1−δ)2≤ϵ δ(1-δ)^2≤ε. This finishes the proof of Proposition 18. D.2 Proof of Theorem 5 i) Effect of normalization. We prove the statement by counterexample. Consider two domains ℰ:=1,2E:=\1,2\ with population covariance matrices Σ1:=diag(0.1,0,9),Σ2:=diag(9,1). _1:=diag(0.1,0,9), _2:=diag(9,1). Vnorm-minPCAV norm-minPCA solves rank-11 - norm - minPCA if Vnorm-minPCA V norm-minPCA ∈argmaxv=(a,b)∈ℝ2;‖v‖=1min(0.1a2+0.9b2,0.9a2+0.1b2). ∈ *arg\,max_ subarraycv=(a,b) ^2;\\ \|v\|=1 subarray \ (0.1a^2+0.9b^2,0.9a^2+0.1b^2 ) \. The solution equalizes the two terms, hence Vnorm-minPCA=12(1,1)⊤V norm-minPCA= 1 2(1,1) solves - norm - minPCA. VminPCAV minPCA solves rank-1 minPCA if VminPCA V minPCA ∈argmaxv=(a,b)∈ℝ2;‖v‖=1min(0.1a2+0.9b2,9a2+1b2)=argmaxv=(a,b)∈ℝ2;‖v‖=1(0.1a2+0.9b2). ∈ *arg\,max_ subarraycv=(a,b) ^2;\\ \|v\|=1 subarray \ (0.1a^2+0.9b^2,9a^2+1b^2 ) \= *arg\,max_ subarraycv=(a,b) ^2;\\ \|v\|=1 subarray (0.1a^2+0.9b^2 ). Hence VminPCA=(0,1)⊤V minPCA=(0,1) solves minPCA. Then, mine∈1,2ℒnormVar(Vnorm-minPCA;Σe)=0.5>mine∈1,2ℒnormVar(VminPCA;Σe)=0.1 _e∈\1,2\L_normVar(V norm-minPCA; _e)=0.5> _e∈\1,2\L_normVar(V minPCA; _e)=0.1 andmine∈1,2ℒvar(VminPCA;Σe)=0.9>mine∈1,2ℒvar(Vnorm-minPCA;Σe)=0.5. \ _e∈\1,2\L_var(V minPCA; _e)=0.9> _e∈\1,2\L_var(V norm-minPCA; _e)=0.5. Hence, VminPCAV minPCA does not solve - norm - minPCA and Vnorm-minPCAV norm-minPCA does not solve minPCA. i) Explained variance vs. reconstruction error. The first claim (that, in general, minPCA and maxRCS have different solution sets) follows from Example 1. We now show that the solutions of - norm - minPCA are the same as those of - norm - maxRCS. Let us begin by simplifying the objective for - norm - maxRCS: argminV∈p×kmaxe∈ℰ[‖e−eVV⊤‖22][‖e‖22] *arg\,min_V _p× k \ _e \ E [\|x_e-x_eV \|^2_2 ]E [\|x_e\|^2_2 ] \ =argminV∈p×kmaxe∈ℰTr(Σe)−Tr(V⊤ΣeV)Tr(Σe) = *arg\,min_V _p× k \ _e Tr( _e)-Tr(V _eV)Tr( _e) \ =argminV∈p×k1−mine∈ℰ(Tr(V⊤ΣeV)Tr(Σe)) = *arg\,min_V _p× k \1- _e ( Tr(V _eV)Tr( _e) ) \ =argmaxV∈p×kmine∈ℰTr(V⊤ΣeV)Tr(Σe). = *arg\,max_V _p× k \ _e Tr(V _eV)Tr( _e) \. The final line is the objective for - norm - minPCA. Hence, the set of solutions to - norm - minPCA and - norm - maxRCS are identical. i) Regret. Finally, we verify that the solutions to maxRegret are the same whether it is defined via reconstruction error or explained variance. For V∈p×kV _p× k, for all domains e∈ℰe , recall that the explained variance and reconstruction error are, respectively, ℒvar(V;Pe) _var(V;P_e) =Tr(V⊤ΣeV), =Tr(V _eV), ℒRCS(V;Pe) _RCS(V;P_e) =[‖e−eVV⊤‖22]=Tr(Σe)−Tr(V⊤ΣeV)=Tr(Σe)−ℒvar(V;Pe). =E\! [\|x_e-x_eV \|_2^2 ]=Tr( _e)-Tr(V _eV)=Tr( _e)-L_var(V;P_e). Thus, for any e∈ℰe , the regret is ℒreg(V;Pe) _reg(V;P_e) =ℒRCS(V;Pe)−minW∈p×kℒRCS(W;Pe) =L_RCS(V;P_e)- _W _p× kL_RCS(W;P_e) =(Tr(Σe)−ℒvar(V;Pe))−(Tr(Σe)−maxW∈p×kℒvar(W;Pe)) = (Tr( _e)-L_var(V;P_e) )- (Tr( _e)- _W _p× kL_var(W;P_e) ) =maxW∈p×kℒvar(W;Pe)−ℒvar(V;Pe). = _W _p× kL_var(W;P_e)-L_var(V;P_e). Hence, the regret computed by minimizing reconstruction error is identical to that obtained by maximizing explained variance, and the two formulations therefore yield the same set of optimal solutions. The proof for the normalized case follows from ℒnormRCS(V;Pe)=1−ℒnormVar(V;Pe)L_normRCS(V;P_e)=1-L_normVar(V;P_e). D.3 Proof of Theorem 6 We write the proof for maxRCS. The guarantees for the variants - norm - maxRCS, minPCA, and - norm - minPCA hold with minor modifications to the proof and thus are omitted. The modification for maxRegret for Statement i) is given at the end of the proof, from which the subsequent statements and guarantees for - norm - maxRegret follow. Let Vk∗V^*_k solve maxRCS. Let P∈P be any distribution with zero mean and covariance Σ∈=conv(Σee∈ℰ) =conv(\ _e\_e ). Then, there exist weights (μe)e∈ℰ( _e)_e such that for all e∈ℰe , μe≥0 _e≥ 0, ∑e∈ℰμe=1 _e _e=1, and Σ=∑e∈ℰμeΣe. = _e _e _e. (18) Proof of i). For all P∈P and all V∈p×kV _p× k, the expected reconstruction error is ℒRCS(V;P) _RCS(V;P) =Tr(Σ)−Tr(V⊤ΣV) =Tr( )-Tr(V V) =∑e∈ℰμe(Tr(Σe)−Tr(V⊤ΣeV))(by (18)) = _e _e (Tr( _e)-Tr(V _eV) ) (by 10000\ eqn:convex-hull-covariance) =∑e∈ℰμeℒRCS(V;Pe) = _e _eL_RCS(V;P_e) ≤maxe∈ℰℒRCS(V;Pe). ≤ _e L_RCS(V;P_e). Hence supP∈ℒRCS(V;P)≤maxe∈ℰℒRCS(V;Pe) _P L_RCS(V;P)≤ _e L_RCS(V;P_e). Further, because Pee∈ℰ⊆\P_e\_e , it trivially holds that maxe∈ℰℒRCS(V;Pe)≤supP∈ℒRCS(V;P) _e L_RCS(V;P_e)≤ _P L_RCS(V;P). Thus, equality holds. Proof of i). Statement i) implies that for all Vk,Wk∈p×kV_k,W_k _p× k, if maxe∈ℰℒRCS(Vk;Pe)<maxe∈ℰℒRCS(Wk;Pe) _e L_RCS(V_k;P_e)< _e L_RCS(W_k;P_e), then supP∈ℒRCS(Vk;Pe)<supP∈ℒRCS(Wk;Pe) _P L_RCS(V_k;P_e)< _P L_RCS(W_k;P_e). Proof of i). Since Pee∈ℰ⊆\P_e\_e , we have for all V∈p×kV _p× k, supP∈ℒRCS(V;P)≥maxe∈ℰℒRCS(V;Pe)≥maxe∈ℰℒRCS(Vk∗;Pe)=supP∈ℒRCS(Vk∗;P), _P L_RCS(V;P)≥ _e L_RCS(V;P_e)≥ _e L_RCS(V^*_k;P_e)= _P L_RCS(V^*_k;P), where the second inequality follows by the optimality of Vk∗V^*_k, and the equality follows from Statement i). Proof of iv). A counterexample showing that pooled and separate PCA solutions can fail to satisfy Statement i) is provided in Example 1. Modification for maxRegret. We show Statement i) for the regret. For all P∈P and all V∈p×kV _p× k, the expected regret is ℒreg(V;P) _reg(V;P) =Tr(Σ)−Tr(V⊤ΣV)−minW∈p×kTr(Σ)−Tr(W⊤ΣW) =Tr( )-Tr(V V)- _W _p× k \Tr( )-Tr(W W) \ =∑e∈ℰμe(Tr(Σe)−Tr(V⊤ΣeV))−minW∈p×k∑e∈ℰμe(Tr(Σe)−Tr(W⊤ΣeW)) = _e _e (Tr( _e)-Tr(V _eV) )- _W _p× k \ _e _e (Tr( _e)-Tr(W _eW) ) \ ≤∑e∈ℰμe(Tr(Σe)−Tr(V⊤ΣeV))−∑e∈ℰμeminW∈p×k(Tr(Σe)−Tr(W⊤ΣeW)) ≤ _e _e (Tr( _e)-Tr(V _eV) )- _e _e _W _p× k \ (Tr( _e)-Tr(W _eW) ) \ =∑e∈ℰμeℒreg(V;Pe)≤maxe∈ℰℒreg(V;Pe). = _e _eL_reg(V;P_e)≤ _e L_reg(V;P_e). D.4 Proof of Proposition 9 Let Vk∗V_k^* and V^k V_k denote the population and empirical solutions, respectively, to any of the considered problems ( minPCA, - norm - minPCA, maxRCS, - norm - maxRCS, maxRegret, or - norm - maxRegret), and let ℒL denote the corresponding loss.999So ℒ∈−ℒvar,−ℒnormVar,ℒRCS,ℒnormRCS,ℒreg,ℒnormRegL∈\-L_var,\ -L_normVar,\ L_RCS,\ L_normRCS,\ L_reg,\ L_normReg\, as in Theorems 6 and 7. For all V∈p×kV _p× k and e∈ℰe , let the population and empirical domain-specific objectives be Meℒ(V):=−ℒ(V;Σe),M^eℒ(V):=−ℒ(V;Σ^e). M_e^L(V):=-L(V; _e), M_e^L(V):=-L(V; _e). Similarly, let the population and empirical worst-case objectives be Mℒ(V):=mine∈ℰ−ℒ(V;Σe),M^ℒ(V):=mine∈ℰ−ℒ(V;Σ^e). M^L(V):= _e -L(V; _e), M^L(V):= _e -L(V; _e). All six population and empirical estimators can thus be written as Vk∗∈argmaxV∈p×kMℒ(V)V^*_k∈ *arg\,max_V _p× kM^L(V) and V^k∈argmaxV∈p×kM^ℒ(V) V_k∈ *arg\,max_V _p× k M^L(V). As p×kO_p× k is compact and M and M M are continuous with respect to the metric induced by the Frobenius norm, the maxima are attained. We now show that i) for all e∈ℰe , supV∈p×k|M^eℒ(V)−Meℒ(V)|→0 _V _p× k| M_e^L(V)-M_e^L(V)| p0 as ne→∞n_e→∞, i) M^ℒ M^L uniformly converges to MℒM^L, i) the maximum of MℒM^L is well-separated. Then, by Theorem 5.7. in van der Vaart (1998), the estimators are consistent, that is, d(V^k,Vk∗)→0d( V_k,V^*_k) p0 as nmin→∞n_min→∞. Proof of i): Domain-specific uniform convergence. We first establish some general bounds. By the law of large numbers, for all e∈ℰe , ‖Σ^e−Σe‖F→0as ne→∞.\| _e- _e\|_F p0 n_e→∞. (19) For all V∈p×kV _p× k, by definition, ‖VV⊤‖F=Tr(V⊤V)=Tr(Ik)=k\|V \|_F= Tr(V V)= Tr(I_k)= k and ‖Ip‖F=p\|I_p\|_F= p. Hence, by Cauchy-Schwarz, for all V∈p×kV _p× k, |Tr(V⊤[Σ^e−Σe]V)|≤k‖Σ^e−Σe‖F, |Tr(V [ _e- _e]V)|≤ k\| _e- _e\|_F, (20) |Tr(Σ^e−Σe)|≤‖Ip‖F‖Σ^e−Σe‖F=p‖Σ^e−Σe‖F. |Tr( _e- _e)|≤\|I_p\|_F\| _e- _e\|_F= p\| _e- _e\|_F. (21) We now apply these bounds to each specific objective function. Case 1: minPCA. By (20), we have that for all e∈ℰe and for all V∈p×kV _p× k, |M^eℒ(V)−Meℒ(V)|=|Tr(V⊤[Σ^e−Σe]V)|≤k‖Σ^e−Σe‖F.| M_e^L(V)-M_e^L(V)|=|Tr(V [ _e- _e]V)|≤ k\| _e- _e\|_F. Hence, for all e∈ℰe , supV∈p×k|M^eℒ(V)−Meℒ(V)|≤k‖Σ^e−Σe‖F→0 _V _p× k| M_e^L(V)-M_e^L(V)|≤ k\| _e- _e\|_F p0 as ne→∞n_e→∞ by (19). Case 2: maxRCS. By (20) and (21), we have that for all e∈ℰe and for all V∈p×kV _p× k, |M^eℒ(V)−Meℒ(V)|≤|Tr(V⊤[Σ^e−Σe]V)|+|Tr(Σ^e−Σe)|≤(k+p)‖Σ^e−Σe‖F.| M_e^L(V)-M_e^L(V)|≤|Tr(V [ _e- _e]V)|+|Tr( _e- _e)|≤( k+ p)\| _e- _e\|_F. Hence, for all e∈ℰe , supV∈p×k|M^eℒ(V)−Meℒ(V)|≤(k+p)‖Σ^e−Σe‖F→0 _V _p× k| M_e^L(V)-M_e^L(V)|≤( k+ p)\| _e- _e\|_F p0 as ne→∞n_e→∞. Case 3: - norm - minPCA. Let τ:=mine∈ℰTr(Σe)>0τ:= _e Tr( _e)>0. Then, gathering (20), (21), and using that Tr(V⊤Σ^eV)≤Tr(Σ^e)Tr(V _eV) ( _e) (indeed, as I−VV⊤I-V and Σ^e _e are positive semi-definite, we have Tr(Σ^e−Σ^eVV⊤)≥0Tr( _e- _eV )≥ 0), we have that for all e∈ℰe and for all V∈p×kV _p× k, |M^eℒ(V)−Meℒ(V)| | M_e^L(V)-M_e^L(V)| =|Tr(V⊤Σ^eV)Tr(Σ^e)−Tr(V⊤ΣeV)Tr(Σe)| = | Tr(V _eV)Tr( _e)- Tr(V _eV)Tr( _e) | ≤|Tr(V⊤[Σ^e−Σe]V)|Tr(Σe)+Tr(V⊤Σ^eV)|1Tr(Σ^e)−1Tr(Σe)| ≤ |Tr(V [ _e- _e]V)|Tr( _e)+Tr(V _eV) | 1Tr( _e)- 1Tr( _e) | ≤|Tr(V⊤[Σ^e−Σe]V)|Tr(Σe)+Tr(V⊤Σ^eV)|Tr(Σ^e−Σe)|Tr(Σ^e)Tr(Σe) ≤ |Tr(V [ _e- _e]V)|Tr( _e)+Tr(V _eV) |Tr( _e- _e)|Tr( _e)Tr( _e) ≤|Tr(V⊤[Σ^e−Σe]V)|Tr(Σe)+Tr(Σ^e)|Tr(Σ^e−Σe)|Tr(Σ^e)Tr(Σe) ≤ |Tr(V [ _e- _e]V)|Tr( _e)+Tr( _e) |Tr( _e- _e)|Tr( _e)Tr( _e) ≤k+pτ‖Σ^e−Σe‖F. ≤ k+ pτ\| _e- _e\|_F. (22) Thus, supV∈p×k|M^eℒ(V)−Meℒ(V)|≤k+pτ‖Σ^e−Σe‖F→0 _V _p× k| M_e^L(V)-M_e^L(V)|≤ k+ pτ\| _e- _e\|_F p0 as ne→∞n_e→∞. Case 4: - norm - maxRCS. By Theorem 5i), the solutions of population - norm - minPCA and population - norm - maxRCS coincide; the same algebra shows the solutions of the empirical problems also coincide. Thus, the consistency of - norm - minPCA implies the consistency of - norm - maxRCS. Case 5: maxRegret. For all e∈ℰe and for all V∈p×kV _p× k, consider |M^eℒ(V)−Meℒ(V)| | M_e^L(V)-M_e^L(V)| =|Tr(V⊤Σ^eV)−maxW∈p×kTr(W⊤Σ^eW)−Tr(V⊤ΣeV)−maxW∈p×kTr(W⊤ΣeW)| = |Tr(V _eV)- _W _p× kTr(W _eW)- \Tr(V _eV)- _W _p× kTr(W _eW) \ | ≤|Tr(V⊤Σ^eV)−Tr(V⊤ΣeV)|⏟=:A+|∑i=1kλi(Σ^e)−∑i=1kλi(Σe)|,⏟=:B ≤ |Tr(V _eV)-Tr(V _eV) |_=:A+ | _i=1^k _i( _e)- _i=1^k _i( _e) |,_=:B where λi(⋅) _i(·) gives the i-th largest eigenvalue. By (20), A≤k‖Σ^e−Σe‖FA≤ k\| _e- _e\|_F and by Weyl’s inequality for symmetric matrices (e.g., Theorem 4.3.1 in Horn and Johnson, 1985) B≤k‖Σ^e−Σe‖op≤k‖Σ^e−Σe‖F.B≤ k\| _e- _e\|_op≤ k\| _e- _e\|_F. (23) Hence, supV∈p×k|M^eℒ(V)−Meℒ(V)|≤(k+k)‖Σ^e−Σe‖F→0 _V _p× k| M_e^L(V)-M_e^L(V)|≤( k+k)\| _e- _e\|_F p0 as nmin→∞n_min→∞. Case 6: - norm - maxRegret. By the triangle inequality, for all e∈ℰe and for all V∈p×kV _p× k, |M^eℒ(V)−Meℒ(V)|| M_e^L(V)-M_e^L(V)| is bounded by |Tr(V⊤Σ^eV)Tr(Σ^e)−Tr(V⊤ΣeV)Tr(Σe)|+|maxW∈p×kTr(W⊤Σ^eW)Tr(Σ^e)−maxW∈p×kTr(W⊤ΣeW)Tr(Σe)|. | Tr(V _eV)Tr( _e)- Tr(V _eV)Tr( _e) |+ | _W _p× k Tr(W _eW)Tr( _e)- _W _p× k Tr(W _eW)Tr( _e) |. The first term is bounded by k+pτ‖Σ^e−Σe‖F k+ pτ\| _e- _e\|_F exactly as in (D.4). Following a similar algebraic sequence as in (D.4), the second term is bounded by 1Tr(Σe)|maxW∈p×kTr(W⊤Σ^eW)−maxW∈p×kTr(W⊤ΣeW)|+maxW∈p×kTr(W⊤Σ^eW)|1Tr(Σ^e)−1Tr(Σe)| 1Tr( _e) | _W _p× kTr(W _eW)- _W _p× kTr(W _eW) |+ _W _p× kTr(W _eW) | 1Tr( _e)- 1Tr( _e) | ≤kτ‖Σ^e−Σe‖op+Tr(Σ^e)|Tr(Σ^e−Σe)Tr(Σ^e)Tr(Σe)| ≤ kτ\| _e- _e\|_op+Tr( _e) | Tr( _e- _e)Tr( _e)Tr( _e) | ≤k+pτ‖Σ^e−Σe‖F, ≤ k+ pτ\| _e- _e\|_F, (24) where the first inequality follows from Weyl’s inequality (as in case 5) as τ is still defined as mine∈ℰTr(Σe) _e Tr( _e). Hence, supV∈p×k|M^eℒ(V)−Meℒ(V)|≤(k+k+2pτ)‖Σ^e−Σe‖F→0 _V _p× k| M_e^L(V)-M_e^L(V)|≤ ( k+ k+2 pτ )\| _e- _e\|_F p0 as ne→∞n_e→∞. Proof of i). We show that M^ℒ M^L uniformly converges to MℒM^L by bounding supV∈p×k|M^ℒ(V)−Mℒ(V)| _V _p× k| M^L(V)-M^L(V)| =supV∈p×k|mine∈ℰM^eℒ(V)−mine∈ℰMeℒ(V)| = _V _p× k| _e M_e^L(V)- _e M_e^L(V)| ≤supV∈p×kmaxe∈ℰ|M^eℒ(V)−Meℒ(V)| ≤ _V _p× k _e | M_e^L(V)-M_e^L(V)| =maxe∈ℰsupV∈p×k|M^eℒ(V)−Meℒ(V)|. = _e _V _p× k| M_e^L(V)-M_e^L(V)|. As ℰE is finite, Statement i) implies that supV∈p×k|M^ℒ(V)−Mℒ(V)|→0 _V _p× k| M^L(V)-M^L(V)| p0 as nmin→∞n_min→∞. Proof of i). Consider Assumption 1 and assume by contradiction that the maximum is not well-separated, i.e., there exists ϵ>0ε>0 with supV∈p×k;d(V,Vk∗)>ϵMℒ(V)=Mℒ(Vk∗). _V _p× k;\ d(V,V^*_k)>εM^L(V)=M^L(V^*_k). This implies that there exists a sequence (Vn)n∈ℕ(V_n)_n in p×kO_p× k with limn→∞Mℒ(Vn)=Mℒ(Vk∗) _n→∞M^L(V_n)=M^L(V^*_k) and ∀n∈ℕ,d(Vn,Vk∗)>ϵ.∀ n ,\ d(V_n,V^*_k)>ε. Since p×kO_p× k is compact under the standard metric induced by the Frobenius norm, the Bolzano–Weierstrass theorem guarantees the existence of a convergent (w.r.t. the standard metric) subsequence of (Vn)n∈ℕ(V_n)_n with limit V~∈p×k V _p× k. Because d is continuous with respect to this standard metric, taking the limit yields d(V~,Vk∗)≥ϵ.d( V,V_k^*)≥ε. (25) Continuity of MℒM^L then implies Mℒ(V~)=Mℒ(Vk∗).M^L( V)=M^L(V^*_k). By Assumption 1, there exists Q∈ℝk×kQ ^k× k with Q⊤Q=QQ⊤=IkQ Q=Q =I_k such that V~=Vk∗Q V=V^*_kQ. Hence, d(V~,Vk∗)=0d( V,V^*_k)=0 contradicting (25). Thus, the maximum is a well-separated point. D.5 Proof of Proposition 10 Fix any of the considered problems ( minPCA, - norm - minPCA, maxRCS, - norm - maxRCS, maxRegret, or - norm - maxRegret), let ℒL denote the corresponding loss, and let Vk∗V_k^* and V^k V_k denote population and empirical solutions, respectively. Proof of i). By Theorems 6 and 7, supP∈ℒ(V^k;P)=maxe∈ℰℒ(V^k;Σe),supP∈ℒ(Vk∗;P)=maxe∈ℰℒ(Vk∗;Σe). _P L( V_k;P)= _e L( V_k; _e), _P L(V_k^*;P)= _e L(V^*_k; _e). (26) For all V,W∈p×kV,W _p× k and for all e∈ℰe , the map V↦ℒ(V;Σe)V (V; _e) is cec_e-Lipschitz with respect to the projection distance d(V,W)=‖VV⊤−WW⊤‖Fd(V,W)=\|V -W \|_F, where ce:=‖Σe‖Ffor , , ‖Σe‖FTr(Σe)for -, -, -. c_e:= cases\| _e\|_F& for $ minPCA$, $ maxRCS$, $ maxRegret$\\ \| _e\|_FTr( _e)& for $ norm - minPCA$, $ norm - maxRCS$, $ norm - maxRegret$. cases To see this, consider minPCA. For all e∈ℰe and for all V,W∈p×kV,W _p× k |ℒvar(V;Σe)−ℒvar(W;Σe)| |L_var(V; _e)-L_var(W; _e) | =|Tr(V⊤ΣeV)−Tr(W⊤ΣeW)| = |Tr(V _eV)-Tr(W _eW) | =|Tr(Σe(VV⊤−WW⊤))| = |Tr( _e(V -W )) | ≤‖Σe‖F‖VV⊤−WW⊤‖F=‖Σe‖Fd(V,W), ≤\| _e\|_F\|V -W \|_F=\| _e\|_F\ d(V,W), by Cauchy–Schwarz and cyclicity of the trace. For maxRCS and maxRegret, the objectives differ from ℒvar(V;Σe)L_var(V; _e) only by additive terms independent of V, so the same bound applies. The normalized variants introduce only the multiplicative factor 1/Tr(Σe)1/Tr( _e). This concludes the Lipschitz argument. The maximum of Lipschitz functions is Lipschitz, in this case with constant equal to maxe∈ℰce _e c_e. Therefore, for all V,W∈p×kV,W _p× k, |maxe∈ℰℒ(V;Σe)−maxe∈ℰℒ(W;Σe)|≤(maxe∈ℰce)d(V,W). | _e L(V; _e)- _e L(W; _e) |≤ ( _e c_e )\,d(V,W). (27) Combining (26) with the Lipschitz bound (27) at V=V^kV= V_k, W=Vk∗W=V_k^* yields |supP∈ℒ(V^k;P)−supP∈ℒ(Vk∗;P)|≤(maxe∈ℰce)d(V^k,Vk∗). | _P L( V_k;P)- _P L(V_k^*;P) |≤ ( _e c_e )\,d( V_k,V_k^*). Since d(V^k,Vk∗)→0d( V_k,V_k^*) p0 as nmin→∞n_min→∞ by Proposition 9, the right-hand side converges in probability to zero, proving Statement i). Proof of 5. It is sufficient to show that |supP∈ℒ(V^k;P)−m^k|→0as nmin→∞, | _P L( V_k;P)- m_k | p0 as n_min→∞, where m^k=maxe∈ℰℒ(V^k;Σ^e) m_k= _e L( V_k; _e). By (26), we have that |supP∈ℒ(V^k;P)−m^k| | _P L( V_k;P)- m_k | =|maxe∈ℰℒ(V^k;Σe)−maxe∈ℰℒ(V^k;Σ^e)|≤maxe∈ℰ|ℒ(V^k;Σe)−ℒ(V^k;Σ^e)|. = | _e L( V_k; _e)- _e L( V_k; _e) |≤ _e |L( V_k; _e)-L( V_k; _e) |. We bound the final term for each of the objectives, using arguments that are analogous to the proof of i) in the proof of Proposition 9. For minPCA, by (20), maxe∈ℰ|ℒvar(V^k;Σe)−ℒvar(V^k;Σ^e)| _e |L_var( V_k; _e)-L_var( V_k; _e) | =maxe∈ℰ|Tr(V^k⊤ΣeV^k)−Tr(V^k⊤Σ^eV^k)| = _e |Tr( V_k _e V_k)-Tr( V_k _e V_k) | ≤kmaxe∈ℰ‖Σe−Σ^e‖F. ≤ k\, _e \| _e- _e\|_F. (28) Similarly, for maxRCS, by the cyclic property of the trace, maxe∈ℰ|ℒRCS(V^k;Σe)−ℒRCS(V^k;Σ^e)| _e |L_RCS( V_k; _e)-L_RCS( V_k; _e) | =maxe∈ℰ|Tr((Σe−Σ^e)(I−V^kV^k⊤))| = _e |Tr (( _e- _e)(I- V_k V_k ) ) | ≤p−kmaxe∈ℰ‖Σe−Σ^e‖F, ≤ p-k\, _e \| _e- _e\|_F, where ‖I−V^kV^k⊤‖F=p−k\|I- V_k V_k \|_F= p-k as it is an orthogonal projector of rank p−kp-k. Let τ:=mine∈ℰTr(Σe)>0τ:= _e Tr( _e)>0. Then for - norm - minPCA and - norm - maxRCS, maxe∈ℰ|ℒnormVar(V^k;Σe)−ℒnormVar(V^k;Σ^e)| _e |L_normVar( V_k; _e)-L_normVar( V_k; _e) | =maxe∈ℰ|ℒnormRCS(V^k;Σe)−ℒnormRCS(V^k;Σ^e)| = _e |L_normRCS( V_k; _e)-L_normRCS( V_k; _e) | =maxe∈ℰ|Tr(V^k⊤Σ^eV^k)Tr(Σ^e)−Tr(V^k⊤ΣeV^k)Tr(Σe)| = _e | Tr( V_k _e V_k)Tr( _e)- Tr( V_k _e V_k)Tr( _e) | ≤k+pτmaxe∈ℰ‖Σ^e−Σe‖F, ≤ k+ pτ _e \| _e- _e\|_F, where the last inequality follows from (D.4). For maxRegret, maxe∈ℰ|ℒreg(V^k;Σe)−ℒreg(V^k;Σ^e)| _e |L_reg( V_k; _e)-L_reg( V_k; _e) | =maxe∈ℰ|maxW∈p×kTr(W⊤ΣeW)−Tr(V^k⊤ΣeV^k)−maxW∈p×kTr(W⊤Σ^eW)+Tr(V^k⊤Σ^eV^k)| = _e | _W _p× kTr(W _eW)-Tr( V_k _e V_k)- _W _p× kTr(W _eW)+Tr( V_k _e V_k) | ≤maxe∈ℰ|maxW∈p×kTr(W⊤ΣeW)−maxW∈p×kTr(W⊤Σ^eW)|+maxe∈ℰ|Tr(V^k⊤ΣeV^k)−Tr(V^k⊤Σ^eV^k)| ≤ _e | _W _p× kTr(W _eW)- _W _p× kTr(W _eW) |+ _e |Tr( V_k _e V_k)-Tr( V_k _e V_k) | =maxe∈ℰ|∑i=1kλi(Σe)−∑i=1kλi(Σ^e)|+maxe∈ℰ|Tr(V^k⊤ΣeV^k)−Tr(V^k⊤Σ^eV^k)| = _e | _i=1^k _i( _e)- _i=1^k _i( _e) |+ _e |Tr( V_k _e V_k)-Tr( V_k _e V_k) | ≤(k+k)maxe∈ℰ‖Σe−Σ^e‖F, ≤(k+ k)\, _e \| _e- _e\|_F, where the last inequality follows as in (23). Finally, for - norm - maxRegret, as in case 6 of the proof of Proposition 9, maxe∈ℰ|ℒnormReg(V^k;Σe)−ℒnormReg(V^k;Σ^e)|≤k+k+2pτmaxe∈ℰ‖Σ^e−Σe‖F. _e |L_normReg( V_k; _e)-L_normReg( V_k; _e) |≤ k+ k+2 pτ\, _e \| _e- _e\|_F. Since ℰE is finite and for all e∈ℰe we have ‖Σ^e−Σe‖F→0\| _e- _e\|_F p0 as ne→∞n_e→∞ (by (19)), it follows that maxe∈ℰ‖Σ^e−Σe‖F→0as nmin→∞, _e \| _e- _e\|_F p0 n_ →∞, which proves Statement 5. D.6 Proof of Theorem 13 Let Vk∗∈p×kV^*_k _p× k be a rank-k solution to maxRCS and assume Vk∗V^*_k is μ-incoherent. Fix ∈ℝ1×px ^1× p and a mask ω∈0,1pω∈\0,1\^p such that the number s of unobserved entries satisfies s≤pϵkμ2(2ϵ+1)s≤ pεkμ^2(2ε+1) and let S(ω):=i∈[p]:ωi=0S(ω):=\i∈[p]: _i=0\ denote the set of unobserved coordinates. Inductive matrix completion via least squares is equivalent to ordinary least squares of the response y:=⊤y:=x on the design matrix X:=Vk∗X:=V^*_k, with the rows indexed by S(ω)S(ω) removed. Hence, Proposition 18 (Appendix D.1) applies with X:=Vk∗∈ℝp×k,y:=⊤∈ℝp,S:=S(ω),X:=V^*_k ^p× k, y:=x ^p, S:=S(ω), which yields101010This is the transpose-version of Proposition 18; the Euclidean norm is invariant under transposition. ‖−ℓ(,ω,Vk∗)(Vk∗)⊤‖22≤(1+ϵ)‖−Vk∗(Vk∗)⊤‖22.\|x- (x,ω,V^*_k)(V^*_k) \|_2^2≤(1+ε)\|x-xV^*_k(V^*_k) \|_2^2. Since s≤pϵkμ2(2ϵ+1)s≤ pεkμ^2(2ε+1) holds Q-almost surely by Assumption 2, the above inequality holds Q-almost surely for ω∼Qω Q. For any P∈P , we then take expectations over ∼Px P and ω∼Qω Q, yielding ℒP,Q(Vk∗) _P,Q(V^*_k) =∼P,ω∼Q‖−ℓ(,ω,Vk∗)(Vk∗)⊤‖22 =E_x P,ω Q\|x- (x,ω,V^*_k)(V^*_k) \|_2^2 ≤(1+ϵ)∼P,ω∼Q‖−Vk∗(Vk∗)⊤‖22=(1+ϵ)∼P‖−Vk∗(Vk∗)⊤‖22. ≤(1+ε)E_x P,ω Q\|x-xV^*_k(V^*_k) \|_2^2=(1+ε)E_x P\|x-xV^*_k(V^*_k) \|_2^2. Taking the supremum over P∈P and applying statement (i) of Theorem 6 gives supP∈ℒP,Q(Vk∗) _P L_P,Q(V^*_k) ≤(1+ϵ)supP∈∼P‖−Vk∗(Vk∗)⊤‖22 ≤(1+ε) _P E_x P\|x-xV^*_k(V^*_k) \|^2_2 =(1+ϵ)maxe∈ℰ∼Pe‖−Vk∗(Vk∗)⊤‖22 =(1+ε) _e E_x P_e\|x-xV^*_k(V^*_k) \|^2_2 =(1+ϵ)minV∈p×kmaxe∈ℰ∼Pe‖−VV⊤‖22, =(1+ε) _V _p× k _e E_x P_e\|x-xV \|^2_2, (29) where the last equality follows from the definition of Vk∗V_k^* as a rank-k solution to maxRCS. Finally, for any fixed x and V∈p×kV _p× k, the choice ℓ=V =xV minimizes ‖−ℓV⊤‖22\|x- V \|_2^2 over ℓ∈ℝk ^k. Hence, for all x and for all ω, ‖−VV⊤‖22≤‖−ℓ(,ω,V)V⊤‖22,\|x-xV \|^2_2≤\|x- (x,ω,V)V \|^2_2, and therefore for each domain e, ∼Pe‖−VV⊤‖22≤∼Pe,ω∼Q‖−ℓ(,ω,V)V⊤‖22≤supP∈ℒP,Q(V).E_x P_e\|x-xV \|^2_2 _x P_e,ω Q\|x- (x,ω,V)V \|^2_2≤ _P L_P,Q(V). Taking the maximum over e∈ℰe and then the minimum over V∈p×kV _p× k gives minV∈p×kmaxe∈ℰ∼Pe‖−VV⊤‖22≤minV∈p×ksupP∈ℒP,Q(V). _V _p× k _e E_x P_e\|x-xV \|^2_2≤ _V _p× k _P L_P,Q(V). Combining this identity with (D.6) yields supP∈ℒP,Q(Vk∗)≤(1+ϵ)minV∈p×ksupP∈ℒP,Q(V). _P L_P,Q(V^*_k)≤(1+ε) _V _p× k _P L_P,Q(V). as claimed.