Paper deep dive

Progress on Attention

Adam Jermyn, Jack Lindsey, Rodrigo Luger, Nick Turner, Trenton Bricken, Adam Pearce, Callum McDougall, Ben Thompson, Jeff Wu, Joshua Batson, Kelley Rivoire, Christopher Olah

Year: 2025Venue: Anthropic (Transformer Circuits)Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 86

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 12:33:37 AM

Summary

The paper introduces SOFT-IMPUTE, an efficient algorithm for solving large-scale matrix completion problems using nuclear norm regularization. By leveraging the low-rank structure of the target matrix and employing warm starts, the algorithm iteratively computes solutions along a regularization path, outperforming existing state-of-the-art methods in both speed and scalability.

Entities (5)

Rahul Mazumder · researcher · 100%SOFT-IMPUTE · algorithm · 100%SVD · mathematical-method · 99%Nuclear Norm · mathematical-concept · 98%Netflix Prize · datasetcompetition · 95%

Relation Signals (3)

SOFT-IMPUTE → uses → Nuclear Norm

confidence 100% · In this paper we propose an algorithm SOFT-IMPUTE for the nuclear norm regularized least-squares problem

Rahul Mazumder → authored → SOFT-IMPUTE

confidence 95% · In this paper we propose an algorithm SOFT-IMPUTE

SOFT-IMPUTE → solves → Matrix Completion

confidence 95% · we use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems.

Cypher Suggestions (2)

Find all algorithms mentioned in the paper · confidence 90% · unvalidated

MATCH (a:Algorithm) RETURN a.name

Find relations between algorithms and mathematical concepts · confidence 90% · unvalidated

MATCH (a:Algorithm)-[r]->(c:Concept) RETURN a.name, type(r), c.name

Abstract

Presents significant evidence that attention heads exhibit superposition (restricting to single heads degrades performance 2-4x) and that QK/OV conditions are naturally coupled, with QK diagonalization as a promising path to understanding attention pattern formation.

Full Text

85,346 characters extracted from source content.

Expand or collapse full text

Journal of Machine Learning Research 11 (2010) 2287-2322Submitted 7/09; Revised 4/10; Published 8/10 Spectral Regularization Algorithms for Learning Large Incomplete Matrices Rahul MazumderRAHULM@STANFORD.EDU Trevor Hastie ∗ HASTIE@STANFORD.EDU Department of Statistics Stanford University Stanford, CA 94305 Robert Tibshirani † TIBS@STANFORD.EDU Department of Health, Research and Policy Stanford University Stanford, CA 94305 Editor:Tommi Jaakkola Abstract We use convex relaxation techniques to provide a sequence ofregularized low-rank solutions for large-scale matrix completion problems. Using the nuclearnorm as a regularizer, we provide a sim- ple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm SOFT-IMPUTEiteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of or- der linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example SOFT-IMPUTEtakes a few hours to compute low-rank approximations of a 10 6 ×10 6 incomplete matrix with 10 7 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords:collaborative filtering, nuclear norm, spectral regularization, netflix prize, large scale convex optimization 1. Introduction In many applications measured data can be represented in a matrixX m×n ,for which only a rela- tively small number of entries are observed. The problem is to “complete” thematrix based on the observed entries, and has been dubbed the matrix completion problem (Cand ` es and Recht, 2008; Cand ` es and Tao, 2009; Rennie and Srebro, 2005). The “Netflix” competition (for example, SIGKDD and Netflix, 2007) is a popular example, where the data is the basis for a recommender system. The rows correspond to viewers and the columns to movies, with the entryX i j being the rating∈1,...,5by viewerifor moviej. There are about 480K viewers and 18K movies, and hence 8.6 billion (8.6×10 9 ) potential entries. However, on average each viewer rates about 200 ∗. Also in the Department of Health, Research and Policy. †. Also in the Department of Statistics. c 2010 Rahul Mazumder, Trevor Hastie and Rob Tibshirani. MAZUMDER, HASTIE ANDTIBSHIRANI movies, so only 1.2% or 10 8 entries are observed. The task is to predict the ratings that viewers would give to movies they have not yet rated. These problems can be phrased as learning an unknown parameter (a matrixZ m×n ) with very high dimensionality, based on very few observations. In order for suchinference to be meaningful, we assume that the parameterZlies in a much lower dimensional manifold. In this paper, as is relevant in many real life applications, we assume thatZcan be well represented by a matrix of low rank, that is,Z≈V m×k G k×n , wherek≪min(n,m). In this recommender-system example, low rank structure suggests that movies can be grouped into a small number of “genres”, withG ℓj the relative score for moviejin genreℓ. Viewerion the other hand has an affinityV iℓ for genreℓ, and hence the modeled score for viewerion moviejis the sum ∑ k ℓ=1 V iℓ G ℓj of genre affinities times genre scores. Typically we view the observed entries inXas the corresponding entries fromZcontaminated with noise. Srebro et al. (2005a) studied generalization error bounds for learning low-rank matrices. Re- cently Cand ` es and Recht (2008), Cand ` es and Tao (2009), and Keshavan et al. (2009) showed the- oretically that under certain assumptions on the entries of the matrix, locations,and proportion of unobserved entries, the true underlying matrix can be recovered within very high accuracy. For a matrixX m×n letΩ⊂1,...,m×1,...,ndenote the indices of observed entries. We consider the following optimization problem: minimizerank(Z) subject to ∑ (i,j)∈Ω (X i j −Z i j ) 2 ≤δ,(1) whereδ≥0 is a regularization parameter controlling the tolerance in training error. Therank con- straint in (1) makes the problem for generalΩcombinatorially hard (Srebro and Jaakkola, 2003). For a fully-observedXon the other hand, the solution is given by a truncated singular value decom- position (SVD) ofX. The following seemingly small modification to (1), minimizekZk ∗ subject to ∑ (i,j)∈Ω (X i j −Z i j ) 2 ≤δ,(2) makes the problem convex (Fazel, 2002). HerekZk ∗ is the nuclear norm, or the sum of the singular values ofZ. Under many situations the nuclear norm is an effective convex relaxationto the rank constraint (Fazel, 2002; Cand ` es and Recht, 2008; Cand ` es and Tao, 2009; Recht et al., 2007). Op- timization of (2) is a semi-definite programming problem (Boyd and Vandenberghe, 2004) and can be solved efficiently for small problems, using modern convex optimization software like SeDuMi and SDPT3 (Grant and Boyd., 2009). However, since these algorithms are based on second order methods (Liu and Vandenberghe, 2009), they can become prohibitively expensive if the dimensions of the matrix get large (Cai et al., 2008). Equivalently we can reformulate (2) inLagrangeform minimize Z 1 2 ∑ (i,j)∈Ω (X i j −Z i j ) 2 +λkZk ∗ .(3) Hereλ≥0 is a regularization parameter controlling the nuclear norm of the minimizer ˆ Z λ of (3); there is a 1-1 mapping betweenδ≥0 andλ≥0 over their active domains. 2288 MATRIXCOMPLETION BYSPECTRALREGULARIZATION In this paper we propose an algorithm SOFT-IMPUTEfor the nuclear norm regularized least- squares problem (3) that scales to large problems withm,n≈10 5 –10 6 with around 10 6 –10 8 or more observed entries. At every iteration SOFT-IMPUTEdecreases the value of the objective function towards its minimum, and at the same time gets closer to the set of optimal solutions of theprob- lem (2). We study the convergence properties of this algorithm and discuss how it can be extended to other more sophisticated forms of spectral regularization. To summarize some performance results 1 •We obtain a rank-40 solution to (2) for a problem of size 10 5 ×10 5 and|Ω|=5×10 6 observed entries in less than 18 minutes. •For the same sized matrix with|Ω|=10 7 we obtain a rank-5 solution in less than 21 minutes. •For a 10 6 ×10 5 sized matrix with|Ω|=10 8 a rank-5 solution is obtained in approximately 4.3 hours. •We fit a rank-66 solution for the Netflix data in 2.2 hours. Here there are 10 8 observed entries in a matrix with 4.8×10 5 rows and 1.8×10 4 columns. A rank 95 solution takes 3.27 hours. The paper is organized as follows. In Section 2, we discuss related workand provide some context for this paper. In Section 3 we introduce the SOFT-IMPUTEalgorithm and study its convergence properties in Section 4. The computational aspects of the algorithm are described in Section 5, and Section 6 discusses how nuclear norm regularization can be generalized to more aggressive and general types of spectral regularization. Section 7 describes post-processing of “selectors” and initialization. We discuss comparisons with related work, simulations and experimental studies in Section 9 and application to the Netflix data in Section 10. 2. Context and Related Work Cand ` es and Tao (2009), Cai et al. (2008), and Cand ` es and Recht (2008) consider the criterion minimizekZk ∗ subject toZ i j =X i j ,∀(i,j)∈Ω.(4) Withδ=0, the criterion (1) is equivalent to (4), in that it requires the training error to be zero. Cai et al. (2008) propose a first-order singular-value-thresholdingalgorithm SVT scalable to large matrices for the problem (4). They comment on the problem (2) withδ>0, but dismiss it as being computationally prohibitive for large problems. We believe that (4) will almost always be too rigid and will result in over-fitting. If minimization of prediction error is an important goal, then the optimal solution ˆ Zwill typically lie somewhere in the interior of the path indexed byδ(Figures 2, 3 and 4). In this paper we provide an algorithm SOFT-IMPUTEfor computing solutions of (3) on a grid of λvalues, based on warm restarts. The algorithm is inspired by SVD-IMPUTE(Troyanskaya et al., 1. For large problems data transfer, access and reading take quite a lotof time and is dependent upon the platform and machine. Over here we report the times taken for the computational bottle-neck, that is, the SVD computations over all iterations. All times are reported based on computations done in a Intel Xeon Linux 3GHz processor using MATLAB, with no C or Fortran interlacing. 2289 MAZUMDER, HASTIE ANDTIBSHIRANI 2001)—an EM-type (Dempster et al., 1977) iterative algorithm that alternates between imputing the missing values from a current SVD, and updating the SVD using the “complete” data matrix. In its very motivation, SOFT-IMPUTEis different from generic first order algorithms (Cai et al., 2008; Ma et al.; Ji and Ye, 2009). The latter require the specification of a step size,and can be quite sensitive to the chosen value. Our algorithm does not require a step-size, or any such parameter. The iterative algorithms proposed in Ma et al. and Ji and Ye (2009) require the computation of a SVD of a dense matrix (with dimensions equal to the size of the matrixX) at every iteration, as the bottleneck. This makes the algorithms prohibitive for large scale computations. Ma et al. use randomized algorithms for the SVD computation. Our algorithm SOFT-IMPUTEalso requires an SVD computation at every iteration, but by exploiting theproblem structure, can easily handle matrices of very large dimensions. At each iteration the non-sparse matrix has the structure: Y=Y SP (Sparse) +Y LR (Low Rank).(5) In (5)Y SP has the same sparsity structure as the observedX, andY LR has rank ̃r≪m,n, where ̃r is very close tor≪m,nthe rank of the estimated matrixZ(upon convergence of the algorithm). For large scale problems, we use iterative methods based on Lanczos bidiagonalization with partial re-orthogonalization (as in the PROPACK algorithm, Larsen, 1998), for computing the first ̃rsin- gular vectors/values ofY.Due to the specific structure of (5), multiplication byYandY ′ can both be achieved in a cost-efficient way. In decomposition (5), the computationally burdensome work in computing a low-rank SVD is of an order that depends linearly on the matrix dimensions. More precisely, evaluating each singular vector requires computation of the order ofO((m+n) ̃r)+O(|Ω|) flops and evaluatingr ′ of them requiresO((m+n) ̃r ′ )+O(|Ω|r ′ )flops. Exploiting warm-starts, we observe that ̃r≈r—hence every SVD step of our algorithm computesrsingular vectors, with com- plexity of the orderO((m+n)r 2 ) +O(|Ω|r)flops. This computation is performed for the number of iterations SOFT-IMPUTErequires to run till convergence or a certain tolerance. In this paper we show asymptotic convergence of SOFT-IMPUTEand further derive its non- asymptotic rate of convergence which scales asO(1/k)(kdenotes the iteration number). However, in our experimental studies on low-rank matrix completion, we have observedthat our algorithm is faster (based on timing comparisons) than the accelerated version of Nesterov (Ji and Ye, 2009; Nesterov, 2007), having a provable (worst case) convergence rate ofO( 1 k 2 ). With warm-starts SOFT- IMPUTEcomputes the entire regularization path very efficiently along a dense seriesof values for λ. Although the nuclear norm is motivated here as a convex relaxation to a rankconstraint, we believe in many situations it will outperform the rank-restricted estimator (1). This is supported by our experimental studies. We draw the natural analogy with model selection inlinear regression, and compare best-subset regression (ℓ 0 regularization) with theLASSO(ℓ 1 regularization, Tibshirani, 1996; Hastie et al., 2009). There too theℓ 1 penalty can be viewed as a convex relaxation of the ℓ 0 penalty. But in many situations with moderate sparsity, theLASSOwill outperform best subset in terms of prediction accuracy (Friedman, 2008; Hastie et al., 2009; Mazumder et al., 2009). By shrinking the parameters in the model (and hence reducing their variance), the lasso permits more parameters to be included. The nuclear norm is theℓ 1 penalty in matrix completion, as compared to theℓ 0 rank. By shrinking the singular values, we allow more dimensions to be included without incurring undue estimation variance. Another class of techniques used in collaborative filtering problems are close in spirit to (2). These are known asmaximum margin matrix factorizationmethods—in short MMMF—and use 2290 MATRIXCOMPLETION BYSPECTRALREGULARIZATION a factor model for the matrixZ(Srebro et al., 2005b). LetZ=UV ′ whereU m×r ′ andV n×r ′ , and consider the following problem minimize U,V 1 2 ∑ (i,j)∈Ω (X i j −(UV ′ ) i j ) 2 + λ 2 (kUk 2 F +kVk 2 F ).(6) It turns out that (6) is intimately related to (3), since (see Lemma 6) ||Z|| ∗ =min U,V:Z=UV ′ 1 2 ( kUk 2 F +kVk 2 F ) . For example, ifr ′ =min(m,n), the solution to (6) coincides with the solution to (3). 2 However, (6) is not convex in its arguments, while (3) is. We compare these two criteria in detail in Section 8, and the relative performance of their respective algorithms in Section 9.2. 3. SOFT-IMPUTE–an Algorithm for Nuclear Norm Regularization We first introduce some notation that will be used for the rest of this article. 3.1 Notation We adopt the notation of Cai et al. (2008). Define a matrixP Ω (Y)(with dimensionm×n) P Ω (Y) (i,j) = Y i j if(i,j)∈Ω 0if(i,j)/∈Ω, (7) which is a projection of the matrixY m×n onto the observed entries. In the same spirit, define the complementary projectionP ⊥ Ω (Y)viaP ⊥ Ω (Y)+P Ω (Y) =Y.Using (7) we can rewrite ∑ (i,j)∈Ω (X i j − Z i j ) 2 askP Ω (X)−P Ω (Z)k 2 F . 3.2 Nuclear Norm Regularization We present the following lemma, which forms a basic ingredient in our algorithm. Lemma 1Suppose the matrix W m×n has rank r. The solution to the optimization problem minimize Z 1 2 kW−Zk 2 F +λkZk ∗ (8) is given by ˆ Z=S λ (W)where S λ (W)≡U D λ V ′ withD λ =diag[(d 1 −λ) + ,...,(d r −λ) + ],(9) U DV ′ is the SVD of W , D=diag[d 1 ,...,d r ], and t + =max(t,0). The notationS λ (W)refers tosoft-thresholding(Donoho et al., 1995). Lemma 1 appears in Cai et al. (2008) and Ma et al. where the proof uses the sub-gradient characterization of the nuclear norm. In Appendix A.1 we present an entirely different proof, which can be extended in a relatively straightforward way to other complicated forms of spectral regularization discussed in Section 6. Our proof is followed by a remark that covers these more general cases. 2. We note here that the original MMMF formulation usesr ′ =minm,n. In this paper we will consider it for a family ofr ′ values. 2291 MAZUMDER, HASTIE ANDTIBSHIRANI 3.3 Algorithm Using the notation in 3.1, we rewrite (3) as: minimize Z f λ (Z):= 1 2 kP Ω (X)−P Ω (Z)k 2 F +λkZk ∗ .(10) We now present Algorithm 1—SOFT-IMPUTE—for computing a series of solutions to (10) for different values ofλusing warm starts. Algorithm 1SOFT-IMPUTE 1. InitializeZ old =0. 2. Do forλ 1 >λ 2 > ... >λ K : (a) Repeat: i. ComputeZ new ←S λ k (P Ω (X)+P ⊥ Ω (Z old )). i. If kZ new −Z old k 2 F kZ old k 2 F <εexit. i. AssignZ old ←Z new . (b) Assign ˆ Z λ k ←Z new . 3. Output the sequence of solutions ˆ Z λ 1 ,..., ˆ Z λ K . The algorithm repeatedly replaces the missing entries with the current guess, and then updates the guess by solving (8). Figures 2, 3 and 4 show some examples of solutions using SOFT-IMPUTE (blue continuous curves). We see test and training error in the top rows as a function of the nuclear norm, obtained from a grid of valuesΛ. These error curves show a smooth and very competitive performance. 4. Convergence Analysis In this section we study the convergence properties of Algorithm 1. Unlike generic first-order methods (Nesterov, 2003) including competitive first-order methods for nuclear norm regularized problems (Cai et al., 2008; Ma et al.), SOFT-IMPUTEdoes not involve the choice of any additional step-size. Most importantly our algorithm is readily scalable for solving largescale semidefinite programming problems (2) and (10) as will be explained later in Section 5. For an arbitrary matrix ̃ Z,define Q λ (Z| ̃ Z) = 1 2 kP Ω (X)+P ⊥ Ω ( ̃ Z)−Zk 2 F +λkZk ∗ (11) as a surrogate of the objective functionf λ (z). Note thatf λ ( ̃ Z) =Q λ ( ̃ Z| ̃ Z)for any ̃ Z. In Section 4.1, we show that the sequenceZ k λ generated viaSOFT-IMPUTEconvergesasymptot- ically, that is, ask→∞to a minimizer of the objective functionf λ (Z).SOFT-IMPUTEproduces a sequence of solutions for which the criterion decreases to the optimal solution with every iteration and the successive iterates get closer to the optimal set of solutions of the problem 10. Section 4.2 2292 MATRIXCOMPLETION BYSPECTRALREGULARIZATION derives the non-asymptotic convergence rate of the algorithm. The latter analysis concentrates on the objective valuesf λ (Z k λ ). Due to computational resources if one wishes to stop the algorithm afterKiterations, then Theorem 2 provides a certificate of howfar Z k λ is from the solution. Though Section 4.1 alone establishes the convergence off λ (Z k λ )to the minimum off λ (Z), this does not, in general, settle theconvergenceofZ k λ unless further conditions (like strong convexity) are imposed onf λ (). 4.1 Asymptotic Convergence Lemma 2For every fixedλ≥0,define a sequence Z k λ by Z k+1 λ =arg min Z Q λ (Z|Z k λ ) with any starting point Z 0 λ . The sequence Z k λ satisfies f λ (Z k+1 λ )≤Q λ (Z k+1 λ |Z k λ )≤f λ (Z k λ ). ProofNote that Z k+1 λ =S λ (P Ω (X)+P ⊥ Ω (Z k λ )).(12) By Lemma 1 and the definition (11) ofQ λ (Z|Z k λ ), we have: f λ (Z k λ ) =Q λ (Z k λ |Z k λ ) = 1 2 kP Ω (X)+P ⊥ Ω (Z k λ )−Z k λ k 2 F +λkZ k λ k ∗ ≥min Z 1 2 kP Ω (X)+P ⊥ Ω (Z k λ )−Zk 2 F +λkZk ∗ =Q λ (Z k+1 λ |Z k λ ) = 1 2 k P Ω (X)−P Ω (Z k+1 λ ) + P ⊥ Ω (Z k λ )−P ⊥ Ω (Z k+1 λ ) k 2 F +λkZ k+1 λ k ∗ = 1 2 kP Ω (X)−P Ω (Z k+1 λ )k 2 F +kP ⊥ Ω (Z k λ )−P ⊥ Ω (Z k+1 λ )k 2 F +λkZ k+1 λ k ∗ (13) ≥ 1 2 kP Ω (X)−P Ω (Z k+1 λ )k 2 F +λkZ k+1 λ k ∗ (14) =Q λ (Z k+1 λ |Z k+1 λ ) =f(Z k+1 λ ). Lemma 3The nuclear norm shrinkage operatorS λ ()satisfies the following for any W 1 ,W 2 (with matching dimensions) kS λ (W 1 )−S λ (W 2 )k 2 F ≤kW 1 −W 2 k 2 F . In particular this implies thatS λ (W)is a continuous map in W . 2293 MAZUMDER, HASTIE ANDTIBSHIRANI Lemma 3 is proved in Ma et al.; their proof is complex and based on trace inequalities. We give a concise proof based on elementary convex analysis in Appendix A.2. Lemma 4The successive differenceskZ k λ −Z k−1 λ k 2 F of the sequence Z k λ are monotone decreasing: kZ k+1 λ −Z k λ k 2 F ≤kZ k λ −Z k−1 λ k 2 F ∀k.(15) Moreover the difference sequence converges to zero. That is Z k+1 λ −Z k λ →0as k→∞. The proof of Lemma 4 is given in Appendix A.3. Lemma 5Every limit point of the sequence Z k λ defined in Lemma 2 is a stationary point of 1 2 kP Ω (X)−P Ω (Z)k 2 F +λkZk ∗ . Hence it is a solution to the fixed point equation Z=S λ (P Ω (X)+P ⊥ Ω (Z)).(16) The proof of Lemma 5 is given in Appendix A.4. Theorem 1The sequence Z k λ defined in Lemma 2 converges to a limit Z ∞ λ that solves minimize Z 1 2 kP Ω (X)−P Ω (Z)k 2 F +λkZk ∗ .(17) ProofIt suffices to prove thatZ k λ converges; the theorem then follows from Lemma 5. Let ˆ Z λ be a limit point of the sequenceZ k λ .There exists a subsequencem k such thatZ m k λ → ˆ Z λ . By Lemma 5, ˆ Z λ solves the problem (17) and satisfies the fixed point equation (16). Hence k ˆ Z λ −Z k λ k 2 F =kS λ (P Ω (X)+P ⊥ Ω ( ˆ Z λ ))−S λ (P Ω (X)+P ⊥ Ω (Z k−1 λ ))k 2 F (18) ≤ k(P Ω (X)+P ⊥ Ω ( ˆ Z λ ))−(P Ω (X)+P ⊥ Ω (Z k−1 λ ))k 2 F =kP ⊥ Ω ( ˆ Z λ −Z k−1 λ )k 2 F ≤ k ˆ Z λ −Z k−1 λ k 2 F .(19) In (18) two substitutions were made; the left one using (16) in Lemma 5, the right one using (12). Inequality (19) implies that the sequencek ˆ Z λ −Z k−1 λ k 2 F converges ask→∞. To show the conver- gence of the sequenceZ k λ it suffices to prove that the sequence ˆ Z λ −Z k λ converges to zero. We prove this by contradiction. Suppose the sequenceZ k λ has another limit pointZ + λ 6= ˆ Z λ .Then ˆ Z λ −Z k λ has two distinct limit points 0 andZ + λ − ˆ Z λ 6=0.This contradicts the convergence of the sequencek ˆ Z λ −Z k−1 λ k 2 F .Hence the sequenceZ k λ converges to ˆ Z λ :=Z ∞ λ . The inequality in (19) implies that at every iterationZ k λ gets closer to an optimal solution for the problem (17). 3 This property holds in addition to the decrease of the objective function (Lemma 2) at every iteration. 3. In fact this statement can be strengthened further—at every iterationthe distance of the estimate decreases from the set of optimal solutions. 2294 MATRIXCOMPLETION BYSPECTRALREGULARIZATION 4.2 Convergence Rate In this section we derive the worst case convergence rate ofSOFT-IMPUTE. Theorem 2For every fixedλ≥0, the sequence Z k λ ;k≥0defined in Lemma 2 has the following non-asymptotic (worst) rate of convergence: f λ (Z k λ )−f λ (Z ∞ λ )≤ 2kZ 0 λ −Z ∞ λ k 2 F k+1 .(20) The proof of this theorem is in Appendix A.6. In light of Theorem 2, aδ>0 accurate solution off λ (Z)is obtained after a maximum of 2 δ kZ 0 λ −Z ∞ λ k 2 F iterations. Using warm-starts,SOFT-IMPUTEtraces out the path of solutions on a grid ofλvaluesλ 1 >λ 2 > ... >λ K with a total of 4 K ∑ i=1 2 δ k ˆ Z λ i−1 −Z ∞ λ i k 2 F (21) iterations. Here ˆ Z λ 0 =0 and ˆ Z λ i denotes the output ofSOFT-IMPUTE(upon convergence) forλ=λ i (i∈1,...,K−1). The solutionsZ ∞ λ i andZ ∞ λ i−1 are likely to becloseto each other, especially on a dense grid ofλ i ’s. Hence every summand of (21) and the total number of iterations is expected to be significantly smaller than that obtained via arbitrary cold-starts. 5. Computational Complexity The computationally demanding part of Algorithm 1 is inS λ (P Ω (X)+P ⊥ Ω (Z k λ )). This requires cal- culating a low-rank SVD of a matrix, since the underlying model assumption is that rank(Z)≪ minm,n. In Algorithm 1, for fixedλ,the entire sequence of matricesZ k λ have explicit 5 low-rank representations of the formU k D k V ′ k corresponding toS λ (P Ω (X)+P ⊥ Ω (Z k−1 λ )). In addition, observe thatP Ω (X)+P ⊥ Ω (Z k λ )can be rewritten as P Ω (X)+P ⊥ Ω (Z k λ ) = P Ω (X)−P Ω (Z k λ ) +Z k λ =Sparse+Low Rank. (22) In the numerical linear algebra literature, there are very efficient directmatrix factorization methods for calculating the SVD of matrices of moderate size (at most a few thousand). When the matrix is sparse, larger problems can be solved but the computational cost depends heavily upon the sparsity structure of the matrix. In general however, for large matrices one has toresort to indirect iterative methods for calculating the leading singular vectors/values of a matrix. Thereis a lot research in numerical linear algebra for developing sophisticated algorithms for this purpose. In this paper we will use the PROPACK algorithm (Larsen, 2004, 1998) because of its low storage requirements, effective flop count and its well documented MATLAB version. The algorithm for calculating the truncated SVD for a matrixW(say), becomes efficient if multiplication operationsW b 1 andW ′ b 2 (withb 1 ∈ℜ n ,b 2 ∈ℜ m ) can be done with minimal cost. 4. We assume the solution ˆ Z λ at everyλ∈λ 1 ,...,λ K is computed to an accuracy ofδ>0. 5. Though we cannot prove theoretically that every iterate of the sequenceZ k λ will be of low-rank; this observation is rather practical based on the manner in which we trace out the entire path of solutions based on warm-starts. Our simulation results support this observation as well. 2295 MAZUMDER, HASTIE ANDTIBSHIRANI Algorithm SOFT-IMPUTErequires repeated computation of a truncated SVD for a matrixW with structure as in (22). Assume that at the current iterate, the matrixZ k λ has rank ̃r. Note that in (22) the termP Ω (Z k λ )can be computed inO(|Ω| ̃r)flops using only the required outer products (i.e., our algorithm does not compute the matrix explicitly). The cost of computing the truncated SVD will depend upon the cost in the operationsW b 1 and W ′ b 2 (which are equal). For the sparse part these multiplications costO(|Ω|). Although it costs O(|Ω| ̃r)to create the matrixP Ω (Z k λ ), this is used for each of the ̃rsuch multiplications (which also costO(|Ω| ̃r)), so we need not include that cost here. The Low Rank part costsO((m+n) ̃r)for the multiplication byb 1 . Hence the cost isO(|Ω|)+O((m+n) ̃r)per vector multiplication. Supposing we want a ̃rrank SVD of the matrix (22), the cost will be of the order ofO(|Ω| ̃r)+O((m+n)( ̃r) 2 ) (for that iteration, that is, to obtainZ k+1 λ fromZ k λ ). Suppose the rank of the solutionZ k λ isr, then in light of our above observations ̃r≈r≪minm,nand the order isO(|Ω|r)+O((m+n)r 2 ). For the reconstruction problem to be theoretically meaningful in the sense ofCand ` es and Tao (2009) we require that|Ω|≈nrpoly(logn).In practice often|Ω|is very small. Hence introducing theLow Rankpart does not add any further complexity in the multiplication byWandW ′ . So the dominant cost in calculating the truncated SVD in our algorithm isO(|Ω|). The SVT algorithm (Cai et al., 2008) for exact matrix completion (4) involves calculating the SVDof a sparse matrix with costO(|Ω|).This implies that the computational order of SOFT-IMPUTEand that of SVT is the same. This order computation does not include the number of iterations required for convergence. In our experimental studies we use warm-starts for efficiently computing the entire regularization path. On small scale examples, based on comparisons with the accelerated gradient method of Nesterov (see Section 9.3; Ji and Ye, 2009; Nesterov, 2007) we find that our algorithm converges faster than the latter in terms of run-time and number of SVD computations/ iterations. Thissupports the computational effectiveness of SOFT-IMPUTE. In addition, since the true rank of the matrix r≪minm,n,the computational cost of evaluating the truncated SVD (with rank≈r) is linear in matrix dimensions. This justifies the large-scale computational feasibility of our algorithm. The above discussions focus on the computational complexity for obtaining alow-rank SVD, which is to be performed at every iteration ofSOFT-IMPUTE. Similar to the total iteration complexity bound ofSOFT-IMPUTE(21), the total cost to compute the regularization path on a grid ofλvalues is given by: K ∑ i=1 O ( (|Ω| ̄r λ i +(m+n) ̄r 2 λ i ) 2 δ k ˆ Z λ i−1 −Z ∞ λ i k 2 F ) . Here ̄r λ denotes the rank 6 (on an average) of the iteratesZ k λ generated bySOFT-IMPUTEfor fixedλ. The PROPACK package does not allow one to request (and hence compute) only the singular values larger than a thresholdλ—one has to specify the number in advance. So once all the com- puted singular values fall above the current thresholdλ, our algorithm increases the number to be computed until the smallest is smaller thanλ. In large scale problems, we put an absolute limit on the maximum number. 6. We assume, above that the grid of valuesλ 1 > ...λ K is such thatallthe solutionsZ λ ,λ∈λ 1 ,...,λ K are ofsmall rank, as they appear in Section 5. 2296 MATRIXCOMPLETION BYSPECTRALREGULARIZATION 6. Generalized Spectral Regularization: From Soft to Hard Thresholding In Section 1 we discussed the role of the nuclear norm as a convex surrogate for the rank of a matrix, and drew the analogy withLASSOregression versus best-subset selection. We argued that in many problemsℓ 1 regularization gives better prediction accuracy. However, if the underlying model is very sparse, then theLASSOwith its uniform shrinkage can both overestimate the number of non- zero coefficients (Friedman, 2008) in the model, and overly shrink (bias)those included toward zero. In this section we propose a natural generalization of SOFT-IMPUTEto overcome these problems. Consider again the problem minimize rank(Z)≤k 1 2 kP Ω (X)−P Ω (Z)k 2 F , a rephrasing of (1). This best rank-ksolution also solves minimize 1 2 kP Ω (X)−P Ω (Z)k 2 F +λ ∑ j I(γ j (Z)>0), whereγ j (Z)is thejth singular value ofZ, and for a suitable choice ofλthat produces a solution with rankk. The “fully observed” matrix version of the above problem is given by theℓ 0 version of (8) as follows: minimize Z 1 2 kW−Zk 2 F +λkZk 0 ,(23) wherekZk 0 =rank(Z). The solution of (23) is given by a reduced-rank SVD ofW; for everyλ there is a correspondingq=q(λ)number of singular-values to be retained in the SVD decompo- sition. Problem (23) is non-convex inWbut its global minimizer can be evaluated. As in (9) the thresholding operator resulting from (23) is S H λ (W) =U D q V ′ whereD q =diag(d 1 ,...,d q ,0,...,0). Similar to SOFT-IMPUTE(Algorithm 1), we present below HARD-IMPUTE(Algorithm 2) for the ℓ 0 penalty. The continuous parameterization viaλdoes not appear to offer obvious advantages over rank-truncation methods. We note that it does allow for a continuum ofwarm starts, and is a natural post-processor for the output of SOFT-IMPUTE(next section). But it also allows for further generalizations that bridge the gap between hard and soft regularizationmethods. In penalized regression there have been recent developments directedtowards “bridging” the gap between theℓ 1 andℓ 0 penalties (Friedman, 2008; Zhang, 2010; Mazumder et al., 2009). This is done via using non-convex penalties that are a better surrogate (in the sense of approximating the penalty) toℓ 0 over theℓ 1 .They also produce less biased estimates than those produced by theℓ 1 penalized solutions. When the underlying model is very sparse they often perform very well, and enjoy superior prediction accuracy when compared to softer penalties likeℓ 1 . These methods still shrink, but are less aggressive than the best-subset selection. By analogy, we propose using a more sophisticated version of spectral regularization. This goes beyond nuclear norm regularization by using slightly more aggressive penalties that bridge the gap betweenℓ 1 (nuclear norm) andℓ 0 (rank constraint). We propose minimizing f p,λ (Z) = 1 2 kP Ω (X)−P Ω (Z)k 2 F +λ ∑ j p(γ j (Z);μ),(24) 2297 MAZUMDER, HASTIE ANDTIBSHIRANI Algorithm 2HARD-IMPUTE 1. Initialize ̃ Z λ k k=1,...,K(for example, using SOFT-IMPUTE; see Section 7). 2. Do forλ 1 >λ 2 > ... >λ K : (a) Repeat: i. ComputeZ new ←S H λ k (P Ω (X)+P ⊥ Ω (Z old )). i. If kZ new −Z old k 2 F kZ old k 2 F <εexit. i. AssignZ old ←Z new . (b) Assign ˆ Z H,λ k ←Z new . 3. Output the sequence of solutions ˆ Z H,λ 1 ,..., ˆ Z H,λ K . wherep(|t|;μ)is concave in|t|.The parameterμ∈[μ inf ,μ sup ]controls the degree of concavity. We may think ofp(|t|;μ inf ) =|t|(ℓ 1 penalty) on one end andp(|t|;μ sup ) =ktk 0 (ℓ 0 penalty) on the other. In particular for theℓ 0 penalty denotef p,λ (Z)byf H,λ (Z)for “hard” thresholding. See Friedman (2008), Mazumder et al. (2009) and Zhang (2010) for examples of such penalties. In Remark 1 in Appendix A.1 we argue how the proof can be modified for general types of spectral regularization. Hence for minimizing the objective (24) we will look at the analogous version of (8, 23) which is minimize Z 1 2 kW−Zk 2 F +λ ∑ j p(γ j (Z);μ). The solution is given by a thresholded SVD ofW, S p λ (W) =U D p,λ V ′ , whereD p,λ is a entry-wise thresholding of the diagonal entries of the matrixDconsisting of singular values of the matrixW. The exact form of the thresholding depends upon the form of the penalty functionp(;),as discussed in Remark 1. Algorithm 1 and Algorithm 2 can be modified for the penaltyp(;μ)by using a more general thresholding functionS p λ ()in Step 2(a)i. The corresponding step becomes: Z new ←S p λ (P Ω (X)+P ⊥ Ω (Z old )). However these types of spectral regularization make the criterion (24) non-convex and hence it becomes difficult to optimize globally. Recht et al. (2007) and Bach (2008)also consider the rank estimation problem from a theoretical standpoint. 7. Post-processing of “Selectors” and Initialization Because theℓ 1 norm regularizes by shrinking the singular values, the number of singularvalues retained (through cross-validation, say) may exceed the actual rank ofthe matrix. In such cases it is reasonable toundothe shrinkage of the chosen models, which might permit a lower-rank solution. 2298 MATRIXCOMPLETION BYSPECTRALREGULARIZATION IfZ λ is the solution to (10), then itspost-processedversionZ u λ obtained by “unshrinking” the eigen-values of the matrixZ λ is obtained by α=arg min α i ≥0,i=1,...,r λ kP Ω (X)− r λ ∑ i=1 α i P Ω (u i v ′ i )k 2 (25) Z u λ =U D α V ′ , whereD α =diag(α 1 ,...,α r λ ). Herer λ is the rank ofZ λ andZ λ =U D λ V ′ is its SVD. The estimation in (25) can be done via ordinary least squares, which is feasible because of the sparsity ofP Ω (u i v ′ i ) and thatr λ is small. 7 If the least squares solutionsαdo not meet the positivity constraints, then the negative sign can be absorbed into the corresponding singular vector. Rather than estimating a diagonal matrixD α as above, one can insert a matrixM r λ ×r λ between UandVabove to obtain better training error for the same rank. Hence givenU,V(each of rankr λ ) from the SOFT-IMPUTEalgorithm, we solve ˆ M=arg min M kP Ω (X)−P Ω (U MV ′ )k 2 ,(26) where, ˆ Z λ =U ˆ MV ′ . The objective function in (26) is the Frobenius norm of an affine functionofMand hence can be optimized very efficiently. Scalability issues pertaining to the optimization problem (26) can be handled fairly efficiently via conjugate gradients. Criterion (26) will definitely lead to a decrease in training error as that attained by ˆ Z=U D λ V ′ for the same rank and is potentially an attractive proposal for the original problem (1). However this heuristic cannot be caste as a (jointly) convex problem in(U,M,V).In addition, this requires the estimation of up tor 2 λ parameters, and has the potential for over-fitting. In this paper we report experiments based on (25). In many simulated examples we have observed that this post-processing stepgives a good es- timate of the underlying true rank of the matrix (based on prediction error). Since fixed points of Algorithm 2 correspond to local minima of the function (24), well-chosen warm starts ̃ Z λ are help- ful. A reasonable prescription for warms-starts is the nuclear norm solution via (SOFT-IMPUTE), or the post processed version (25). The latter appears to significantly speed up convergence for HARD-IMPUTE. This observation is based on our simulation studies. 8. Soft-Impute and Maximum-Margin Matrix Factorization In this section we compare in detail the MMMF criterion (6) with the SOFT-IMPUTEcriterion (3). For ease of comparison here, we put down these criteria again using ourP Ω notation. MMMF solves minimize U,V 1 2 ||P Ω (X−UV)|| 2 F + λ 2 (kUk 2 F +kVk 2 F ),(27) whereU m×r ′ andV n×r ′ are arbitrary (non-orthogonal) matrices. This problem formulation and re- lated optimization methods have been explored by Srebro et al. (2005b) andRennie and Srebro (2005). 7. Observe that theP Ω (u i v ′ i ),i=1,...,r λ are not orthogonal, though theu i v ′ i are. 2299 MAZUMDER, HASTIE ANDTIBSHIRANI SOFT-IMPUTEsolves minimize Z 1 2 ||P Ω (X−Z)|| 2 F +λkZk ∗ .(28) For each given maximum rank, MMMF produces an estimate by doing furthershrinkage with its quadratic regularization. SOFT-IMPUTEperforms rank reduction and shrinkage at the same time, in one smooth convex operation. The following theorem shows that this one-dimensional SOFT- IMPUTEfamily lies exactly in the two-dimensional MMMF family. Theorem 3Let X be m×n with observed entries indexed byΩ. 1. Let r ′ =min(m,n). Then the solutions to (27) and (28) coincide for allλ≥0. 2. Suppose ˆ Z ∗ is a solution to (28) forλ ∗ >0, and let r ∗ be its rank. Then for any solution ˆ U, ˆ V to (27) with r ′ =r ∗ andλ=λ ∗ , ˆ U ˆ V T is a solution to (28). The SVD factorization of ˆ Z ∗ provides one such solution to (27). This implies that the solution space of (28) is contained in that of (27). Remarks: 1. Part 1 of this theorem appears in a slightly different form in Srebro etal. (2005b). 2. In part 1, we could user ′ >min(m,n)and get the same equivalence. While this might seem unnecessary, there may be computational advantages; searching overa bigger space might protect against local minima. Likewise in part 2, we could user ′ >r ∗ and achieve the same equivalence. In either case, no matter whatr ′ we use, the solution matrices ˆ Uand ˆ Vhave the same rank as ˆ Z. 3. Let ˆ Z(λ)be a solution to (28) atλ. We conjecture that rank[ ˆ Z(λ)]is monotone non-increasing inλ. If this is the case, then Theorem 3, part 2 can be further strengthenedto say that for all λ≥λ ∗ andr ′ =r ∗ the solutions of (27) coincide with that of (28). The MMMF criterion (27) defines a two-dimensional family of models indexed by(r ′ ,λ), while the SOFT-IMPUTEcriterion (28) defines a one-dimensional family. In light of Theorem 3, thisfamily is a special path in the two-dimensional grid of solutions[ ˆ U (r ′ ,λ) , ˆ V (r ′ ,λ) ].Figure 1 depicts the situation. Any MMMF model at parameter combinations above the red squares are redundant, since their fit is the same at the red square. However, in practice the red squares are not known to MMMF, nor is the actual rank of the solution. Further orthogonalization of ˆ Uand ˆ Vwould be required to reveal the rank, which would only be approximate (depending on the convergence criterion of the MMMF algorithm). Despite the equivalence of (27) and (28) whenr ′ =min(m,n), the criteria are quite different. While (28) is a convex optimization problem inZ, (27) is a non-convex problem in the variables U,Vand has possibly several local minima; see also Abernethy et al. (2009).It has been observed empirically and theoretically (Burer and Monteiro, 2005; Rennie and Srebro, 2005) that bi-convex methods used in the optimization of (27) can get stuck in sub-optimal local minima for a small value ofr ′ or a poorly chosen starting point. For a large number of factorsr ′ and large dimensionsm,n the computational cost may be quite high (See also experimental studies in Section 9.2). Criterion (28) is convex inZfor every value ofλ, and it outputs the solution ˆ Zin the form of its soft-thresholded SVD, implying that the “factors”U,Vare already orthogonal and the rank is known. 2300 MATRIXCOMPLETION BYSPECTRALREGULARIZATION −4−20246 0 20 40 60 80 100 R ANK r ′ logλ Figure 1: Comparison of the parameter space for MMMF (grey and black points), and SOFT- IMPUTE(red squares) for a simple example. Since all MMMF solutions with parameters above the red squares are identical to the SOFT-IMPUTEsolutions at the red squares, all the grey points are redundant. MMMF has two different tuning parametersr ′ andλ, both of which are related to the rank or spectral properties of the matricesU,V. SOFT-IMPUTEhas only one tuning parameterλ. The presence of two tuning parameters is problematic: •It results in a significant increase in computational burden, since for every given value ofr ′ , one needs to compute an entire system of solutions by varyingλ(see Section 9 for illustra- tions). •In practice when neither the optimal values ofr ′ andλare known, a two-dimensional search (for example, by cross validation) is required to select suitable values. Further discussions and connections between the tuning parameters and spectral properties of the matrices can be found in Burer and Monteiro (2005) and Abernethy et al.(2009). The proof of Theorem 3 requires a lemma. Lemma 6For any matrix Z, the following holds: ||Z|| ∗ =min U,V:Z=UV T 1 2 ( kUk 2 F +kVk 2 F ) .(29) Ifrank(Z) =k≤minm,n, then the minimum above is attained at a factor decomposition Z= U m×k V T n×k . Note that in the decompositionZ=UV T in (29) there is no constraint on the number of columnsr of the factor matricesU m×r andV n×r .Lemma 6 is stronger than similar results appearing in Rennie and Srebro (2005) and Abernethy et al. (2009) which establish (29) forr=minm,n—we give a tighter estimate of the rankkof the underlying matrices. The proof is given in Appendix A.5. 2301 MAZUMDER, HASTIE ANDTIBSHIRANI 8.1 Proof of Theorem 3 Part 1. Forr=min(m,n), any matrixZ m×n can be written in the form ofZ=UV T .The criterion (27) can be written as min U,V 1 2 ||P Ω (X−UV T )|| 2 F + λ 2 (kUk 2 F +kVk 2 F )(30) =min U,V 1 2 ||P Ω (X−UV T )|| 2 F +λkUV T k ∗ (by Lemma 6) =min Z 1 2 ||P Ω (X−Z)|| 2 F +λkZk ∗ .(31) The equivalence of the criteria in (30) and (31) completes the proof of part 1. Part 2. Note that if we know that the solution ˆ Z ∗ to (28) withλ=λ ∗ has rankr ∗ , then ˆ Z ∗ also solves min Z,rank(Z)=r ∗ 1 2 ||P Ω (X−Z)|| 2 F +λkZk ∗ . We now repeat the steps (30)—(31), restricting the rankr ′ ofUandVto ber ′ =r ∗ , and the result follows. 9. Numerical Experiments and Comparisons In this section we study the performance of SOFT-IMPUTE, its post-processed variants, and HARD- IMPUTEfor noisy matrix completion problems. The examples assert our claim that the matrix reconstruction criterion (4) (Cai et al., 2008) is too rigid if one seeks good predictive models. We include the related procedures of Rennie and Srebro (2005) and Keshavan et al. (2009) in our com- parisons. The reconstruction algorithm OPTSPACE, described in Keshavan et al. (2009) considers crite- rion (1) (in the presence of noise). It uses the representationZ=U SV ′ (which need not correspond to the SVD). OPTSPACEalternates between estimatingSandU,V(in a Grassmann manifold) for computing a rank-rdecomposition ˆ Z= ˆ U ˆ S ˆ V ′ . It starts with a sparse SVD on acleanversion of the observed matrixP Ω (X).This is similar to the formulation of MMMF (27) as detailed in Section 8, without the squared Frobenius norm regularization on the componentsU,V. To summarize, we study the following methods: 1. SOFT-IMPUTE–Algorithm 1; 2. SOFT-IMPUTE+–post-processing on the output of SOFT-IMPUTE, as in Section 7; 3. HARD-IMPUTE–Algorithm 2, starting with the output of SOFT-IMPUTE+; 4. SVT–algorithm by Cai et al. (2008); 5. OPTSPACE–reconstruction algorithm by Keshavan et al. (2009); 6. MMMF–algorithm for (6) as in Rennie and Srebro (2005). 2302 MATRIXCOMPLETION BYSPECTRALREGULARIZATION In all our simulation studies we use the underlying modelZ m×n =U m×r V ′ r×n +ε, whereUandVare random matrices with standard normal Gaussian entries, andεis i.i.d. Gaussian.Ωis uniformly random over the indices of the matrix withp% percent of missing entries. These are the models under which the coherence conditions hold true for the matrix completion problem to be meaningful (Cand ` es and Tao, 2009; Keshavan et al., 2009). The signal to noise ratio forthe model and the test- error (standardized) are defined as SNR= √ var(UV ′ ) var(ε) ;Test Error= kP ⊥ Ω (UV ′ − ˆ Z)k 2 F kP ⊥ Ω (UV ′ )k 2 F . Training error (standardized) is defined as Training Error= kP Ω (Z− ˆ Z)k 2 F kP Ω (Z)k 2 F , the fraction of the error explained on the observed entries by the estimate relative to a zero estimate. Figures 2, 3 and 4 show training and test error for all of the algorithms mentioned above—both as a function of nuclear norm and rank—for the three problem instances. The results displayed in the figures are averaged over 50 simulations, and also show one-standard-error bands (hardly visible). In all examples(m,n) = (100,100).For MMMF we user ′ =min(m,n) =100, the number of columns inUandV. The performance of MMMF is displayed only in the plots with the nuclear norm along the horizontal axis, since the algorithm does not deliver a precise rank. SNR, true rank and percentage of missing entries are indicated in the figures. There is a unique correspondence betweenλand nuclear norm. The plots versus rank indicate how effective the nuclear norm is as a rank approximation—that is whether it recovers the true rank while minimizing prediction error. For routines not our own we use the MATLAB code as supplied on webpages by the authors. For SVT second author of Cai et al. (2008), for OPTSPACEthird author of Keshavan et al. (2009), and for MMMF first author of Rennie and Srebro (2005). 9.1 Observations The captions of each of Figures 2–4 detail the results, which we summarize here. For the first two figures, the noise is quite high with SNR=1, and 50% of the entries are missing. In Figure 2 the true rank is 10, while in Figure 3 it is 6. SOFT-IMPUTE, MMMF and SOFT-IMPUTE+ have the best prediction performance, while SOFT-IMPUTE+ is better at estimating the correct rank. The other procedures perform poorly here, although OPTSPACEimproves somewhat in Figure 3. SVT has very poor prediction error, suggesting once again that exactly fitting the training data is far too rigid. SOFT-IMPUTE+ has the best performance in Figure 3 (smaller rank—more aggressive fitting), and HARD-IMPUTEstarts recovering here. In both figures the training error for SOFT-IMPUTE(and hence MMMF) wins as a function of nuclear norm (as it must, by construction), but the more aggressive fitters SOFT-IMPUTE+ and HARD-IMPUTEhave better training error as a function of rank. Though the nuclear norm is often viewed as a surrogate for the rank of amatrix, we see in these examples that it can provide a superior mechanism for regularization. This is similar to the performance ofLASSOin the context of regression. Although theLASSOpenalty can be viewed as a convex surrogate for theℓ 0 penalty in model selection, itsℓ 1 penalty provides a smoother and often better basis for regularization. 2303 MAZUMDER, HASTIE ANDTIBSHIRANI 50%missing entries with SNR=1, true rank =10 010002000 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 010002000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Error Test Error Nuclear NormNuclear Norm SOFTIMP OPTSPACE SVT MMMF HARDIMP SOFTIMP+ 204060 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 204060 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Error Test Error RankRank SOFTIMP OPTSPACE SVT HARDIMP SOFTIMP+ Figure 2: SOFTIMP+ refers to the post-processing after SOFT-IMPUTE; HARD-IMPUTEuses SOFT- IMP+ as starting values. Both SOFT-IMPUTEand SOFT-IMPUTE+ perform well (predic- tion error) in the presence of noise; the latter estimates the actual rank of thematrix. MMMF (with full rank 100 factor matrices) has performance similar to SOFT-IMPUTE. HARD-IMPUTEand OPTSPACEshow poor prediction error. SVT also has poor predic- tion error, confirming our claim in this example that criterion (4) can result in overfitting; it recovers a matrix with high nuclear norm and rank>60 where the true rank is only 10. Values of test error larger than one are not shown in the figure. OPTSPACEis evaluated for a series of ranks≤30. 2304 MATRIXCOMPLETION BYSPECTRALREGULARIZATION 50%missing entries with SNR=1, true rank =6 050010001500 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 050010001500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Error Test Error Nuclear NormNuclear Norm SOFTIMP OPTSPACE SVT MMMF HARDIMP SOFTIMP+ 204060 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 204060 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Error Test Error RankRank SOFTIMP OPTSPACE SVT HARDIMP SOFTIMP+ Figure 3: SOFT-IMPUTE+ has the best prediction error, closely followed by SOFT-IMPUTEand MMMF. Both HARD-IMPUTEand OPTSPACEhave poor prediction error apart from near the true rank 6 of the matrix, where they show reasonable performance. SVT has very poor prediction error; it recovers a matrix with high nuclear norm and rank>60, where the true rank is only 6. OPTSPACEis evaluated for a series of ranks≤35. 2305 MAZUMDER, HASTIE ANDTIBSHIRANI 80%missing entries with SNR=10, true rank =5 0200400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0200400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Error Test Error Nuclear NormNuclear Norm SOFTIMP OPTSPACE SVT MMMF HARDIMP SOFTIMP+ 102030 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 102030 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Training Error Test Error RankRank SOFTIMP OPTSPACE SVT HARDIMP SOFTIMP+ Figure 4: With low noise the performance of HARD-IMPUTEimproves. It gets the correct rank whereas OPTSPACEslightly overestimates the rank. HARD-IMPUTEhas the best pre- diction error, followed by OPTSPACE. Here MMMF has slightly better prediction error than SOFT-IMPUTE. Although the noise is low here, SVT recovers a matrix with high rank (approximately 30) and has poor prediction error as well. The test error of SVT is found to be different from the limiting solution of SOFT-IMPUTE; although in theory the limiting solution of (10) should coincide with that of SVT, in practice we never goto the limit. 2306 MATRIXCOMPLETION BYSPECTRALREGULARIZATION In Figure 4 with SNR=10 the noise is relatively small compared to the other two cases. The true underlying rank is 5, but the proportion of missing entries is much higherat eighty percent. Test errors of both SOFT-IMPUTE+ and SOFT-IMPUTEare found to decrease till a large nuclear norm after which they become roughly the same, suggesting no further impact of regularization. MMMF has slightly better test error than SOFT-IMPUTEaround a nuclear norm of 350, while in theory they should be identical. Notice, however, that the training error is slightly worse (everywhere), suggesting that MMMF is sometimes trapped in local minima. The fact that this slightlyunderfit solution does better in test error is a quirk of this particular example. OPTSPACEperforms well in this high-SNR example, achieving a sharp minima at the true rank of the matrix. HARD-IMPUTE performs the best in this example. The better performance of both OPTSPACEand HARD-IMPUTE over SOFT-IMPUTEcan be attributed both to the low-rank truth and the high SNR. This is reminis- cent of the better predictive performance of best-subset or concavepenalized regression often seen overLASSOin setups where the underlying model is very sparse (Friedman, 2008). 9.2 Comparison with Fast MMMF (Rennie and Srebro, 2005) In this section we compare SOFT-IMPUTEwith MMMF in terms of computational efficiency. We also examine the consequences of two regularization parameters(r ′ ,λ)for MMMF over one for SOFT-IMPUTE. Rennie and Srebro (2005) describes a fast algorithm based on conjugate-gradient descent for minimization of the MMMF criterion (6). With (6) being non-convex, it is hard to provide theo- retical optimality guarantees for the algorithm for arbitraryr ′ ,λ—that is, what type of solution it converges to or how far it is from the global minimizer. In Table 1 we summarize the performance results of the two algorithms. For bothSOFT-IMPUTE and MMMF we consider a equi-spaced grid of 150λ∈[λ min ,λ max ],withλ min corresponding to a full-rank solution of SOFT-IMPUTEandλ max the zero solution. For MMMF, three different values ofr ′ were used, and for each( ˆ U, ˆ V)were solved for over the grid ofλvalues. A separate held-out validation set with twenty percent of the missing entries sampled fromΩ ⊥ were used to train the tuning parameterλ(for each value ofr ′ ) for MMMF and SOFT-IMPUTE. Finally we evaluate the standardized prediction errors on a test set consisting of the remaining eighty percent of the missing entries inΩ ⊥ . In all cases we report the training errors and test errors on the optimallytuned λ. SOFT-IMPUTEwas run till a tolerance of 10 −4 was achieved (fraction of decrease of objective value). Likewise for MMMF we set the tolerance of the conjugate gradientmethod to 10 −4 . In Table 1, for every algorithm total time indicates the time required for evaluating solutions over the entire grid ofλvalues. In these examples, we used direct SVD factorization based methods for the SVD computation, since the size of the problems were quite small. In all these examples we observe that SOFT-IMPUTEperforms very favorably in terms of total times. For MMMF the time to train the models increase with increasing rankr ′ ; and in case the underlying matrix has rank which is larger thanr ′ , the computational cost will be large in order to get competitive predictive accuracy. This point is supported in the examples of Table 1. It is importantto note that, the prediction error of SOFT-IMPUTEas obtained on the validation set is actually within standard error of the best prediction error produced by all the MMMF models. In addition we also performed some medium-scale examples increasing the dimensions of the matrices. To make comparisons fair, SOFT-IMPUTEmade use ofdirectSVD computations (in MATLAB) instead of iterative algorithms 2307 MAZUMDER, HASTIE ANDTIBSHIRANI DataMethod (rank)Test errorTraining errorTime (secs) (m,n) = (10 2 ,10 2 )SOFT-IMPUTE(39)0.7238(0.0027)0.07204.1745 |Ω|=5×10 3 (50%)MMMF (20)0.7318(0.0031)0.287548.1090 SNR= 3MMMF (60)0.7236(0.0027)0.173062.0230 rank (R)= 30MMMF (100)0.7237(0.0027)0.178496.2750 (m,n) = (10 2 ,10 2 )SOFT-IMPUTE(37)0.5877(0.0047)0.00174.0976 |Ω|=2×10 3 (20%)MMMF (20)0.5807(0.0049)0.018653.7533 SNR= 10MMMF (60)0.5823(0.0049)0.019762.0230 rank(R)= 10MMMF (100)0.5823(0.0049)0.019784.0375 (m,n) = (10 2 ,10 2 )SOFT-IMPUTE(59)0.6008(0.0028)0.00863.8447 |Ω|=8×10 3 (80%)MMMF (20)0.6880(0.0029)0.403733.8685 SNR= 10MMMF (60)0.5999(0.0028)0.027557.3488 rank(R)= 45MMMF (100)0.5999(0.0028)0.027589.4525 Table 1: Performances of SOFT-IMPUTEand MMMF for different problem instances, in terms of test error (with standard errors in parentheses), training error and times for learning the models. SOFT-IMPUTE,“rank” denotes the rank of the recovered matrix, at the optimally chosen value ofλ. For the MMMF, “rank” indicates the value ofr ′ inU m×r ′ ,V n×r ′ . Results are averaged over 50 simulations. exploiting the specializedSparse+Low-Rankstructure (22). We report our findings on one such simulation example: •For(m,n) = (2000,1000),|Ω|/(mn) =0.2, rank=500 and SNR=10; SOFT-IMPUTEtakes 1.29 hours to compute solutions on a grid of 100λvalues. The test error on the validation set and training error are 0.9630 and 0.4375 with the recovered solution having a rank of 225. For the same problem, MMMF withr ′ =200 takes 6.67 hours returning a solution with test- error 0.9678 and training error 0.6624. Withr ′ =400 it takes 12.89 hrs with test and training errors 0.9659 and 0.6564 respectively. We will like to note that DeCoste (2006) proposed an efficient implementation ofMMMF via an ensemble based approach, which is quite different in spirit from the batchoptimization algorithms we are studying in this paper. Hence we do not compare it withSOFT-IMPUTE. 9.3 Comparison with Nesterov’s Accelerated Gradient Method Ji and Ye (2009) proposed a first-order algorithm based on Nesterov’s acceleration scheme (Nes- terov, 2007), for nuclear norm minimization for a generic multi-task learning problem (Argyriou et al., 2008, 2007). Their algorithm (Liu et al., 2009; Ji and Ye, 2009) can be adapted to the SOFT- IMPUTEproblem (10); hereafter we refer to it as NESTEROV. It requires one to compute the SVD of a dense matrix having the dimensions ofX, which makes it prohibitive for large-scale problems. We instead would make use of the structure (22) for the SVD computation, a special characteristic of matrix completion which is not present in a generic multi-task learning problem. Here we compare the performances of SOFT-IMPUTEand NESTEROVon small scale examples, where direct SVDs can be computed easily. 2308 MATRIXCOMPLETION BYSPECTRALREGULARIZATION Since both algorithms solve the same criterion, the quality of the solutions—objective values, training and test errors—will be the same (within tolerance). We hence compare their performances based on the times taken by the algorithms to converge to the optimal solution of (10) on a grid of values ofλ.Both algorithms compute a path of solutions using warm starts. Results are shown in Figure 5, for four different scenarios described in Table 2. Example(m,n)|Ω|/(mn)RankTest Error i(100,100)0.550.4757 i(100,100)0.250.0668 i(100,100)0.850.0022 iv(1000,500)0.5300.0028 Table 2: Four different examples used for timing comparisons of SOFT-IMPUTEand NESTEROV (accelerated Nesterov algorithm of Ji and Ye 2009). In all cases the SNR=10. Figure 5 shows the time to convergence for the two algorithms. Their respective number of iterations are not comparable. This is because NESTEROVuses a line-search to compute an adaptive step-size (approximate the Lipschitz constant) at every iteration, whereasSOFT-IMPUTEdoes not. SOFT-IMPUTEhas a rate of convergence given by Theorem 2, which for largekis worse than the accelerated version NESTEROVwith rateO(1/k 2 ). However, timing comparisons in Figure 5 show thatSOFT-IMPUTEperforms very favorably. We do not know the exact reason behind this, but mention some possibilities. Firstly the rates areworst caseconvergence rates. On particular problem instances of the form (10), the rates of convergence inpracticeofSOFT-IMPUTEand NESTEROV may be quite similar. Since Ji and Ye (2009) uses an adaptive step-size strategy, the choice of a step-size may be time consuming.SOFT-IMPUTEon the other hand, uses a constant step size. Additionally, it appears that the use of themomentumterm in NESTEROVaffects theSparse+Low- rankdecomposition (22). This may prevent the algorithm to be adapted for solvinglarge problems, due to costly SVD computations. 9.4 Large Scale Simulations for SOFT-IMPUTE Table 3 reports the performance of SOFT-IMPUTEon some large-scale problems. All computations are performed in MATLAB and the MATLAB implementation of PROPACK is used.Data input, access and transfer in MATLAB take a sizable portion of the total computational time, given the size of these problems. However, the main computational bottle neck in our algorithm is the struc- tured SVD computation. In order to focus more on the essential computationaltask, Table 3 displays the total time required to perform the SVD computations over all iterations of the algorithm. Note that for all the examples considered in Table 3, the implementations of algorithms NESTEROV(Liu et al., 2009; Ji and Ye, 2009) and MMMF (Rennie and Srebro, 2005) are prohibitively expensive both in terms of computational time and memory requirements, and hence could notbe run. We used the valueλ=||P Ω (X)|| 2 /Kwith SOFT-IMPUTE, withK=1.5 for all examples but the last, whereK=2.λ 0 =||P Ω (X)|| 2 is the largest singular value of the input matrixX(padded with ze- ros); this is the smallest value ofλfor whichS λ 0 (P Ω (X)) =0 in the first iteration of SOFT-IMPUTE (Section 3). 2309 MAZUMDER, HASTIE ANDTIBSHIRANI 00.20.40.6 0 0.5 1 1.5 2 2.5 3 Soft−Impute Nesterov Time (in secs) i 00.20.40.6 0 0.5 1 1.5 2 2.5 3 i 0.10.20.30.40.50.6 0 0.5 1 1.5 2 2.5 3 Time (in secs) i log(k ˆ Z λ k ∗ /C+1) 0.450.50.550.60.650.7 0 20 40 60 80 100 120 iv log(k ˆ Z λ k ∗ /C+1) Figure 5: Timing comparisons of SOFT-IMPUTEand NESTEROV(accelerated Nesterov algorithm of Ji and Ye 2009). The horizontal axis corresponds to the standardized nuclear norm, withC=max λ k ˆ Z λ k ∗ . Shown are the times till convergence for the two algorithms over an entire grid ofλvalues for examples i–iv (in the last the matrix dimensions are much larger). The overall time differences between Examples i–i and Example ivis due to the increased cost of the SVD computations. Results are averaged over 10 simulations. The times for NESTEROVchange far more erratically withλthan they do for SOFT-IMPUTE. The prediction performance is awful for all but one of the models, because in most cases the fraction of observed data is very small. These simulations were mainly to show the computational capabilities of SOFT-IMPUTEon very large problems. 10. Application to the Netflix Data Set The Netflix training data consists of the ratings of 17,770 movies by 480,189 Netflix customers. The resulting data matrix is extremely sparse, with 100,480,507 or 1% of the entries observed. The task was to predict the unseen ratings for a qualifying set and a test set of about 1.4 million ratings each, with the true ratings in these data sets held in secret by Netflix. A probe set ofabout 1.4 million 2310 MATRIXCOMPLETION BYSPECTRALREGULARIZATION (m,n)|Ω||Ω|/(mn)RecoveredTime (mins)Test errorTraining error ×100%rank (10 4 ,10 4 )10 5 0.140 ∗ 0.27540.99460.6160 (10 4 ,10 4 )10 5 0.140 ∗ 0.37700.99590.6217 (10 4 ,10 4 )10 5 0.150 ∗ 0.42920.99620.3862 (10 4 ,10 4 )10 6 1.051.66640.69300.6600 (10 5 ,10 5 )5×10 6 0.0540 ∗ 17.25180.98870.8156 (10 5 ,10 5 )10 7 0.1520.31420.98030.9761 (10 6 ,10 5 )10 8 0.15259.96200.99130.9901 (10 6 ,10 6 )10 7 0.00120 ∗ 99.67800.99980.5834 Table 3: Performance of SOFT-IMPUTEon different problem instances. All models are generated with SNR=10 and underlying rank=5. Recovered rank is the rank of the solution matrix ˆ Zat the value ofλused in (10). Those with stars reached the “maximum rank” threshold, and option in our algorithm. Convergence criterion is taken as “fraction of improvement of objective value” less than 10 −4 or a maximum of 15 iterations for the last four examples. All implementations are done in MATLAB including the MATLAB implementation of PROPACK on a Intel Xeon Linux 3GHz processor. ratings was distributed to participants, for calibration purposes. The moviesand customers in the qualifying, test and probe sets are all subsets of those in the training set. The ratings are integers from 1 (poor) to 5 (best). Netflix’s own algorithmhas an RMSE of 0.9525, and the contest goal was to improve this by 10%, or an RMSE of 0.8572 or better. The contest ran for about 3 years, and the winning team was “Bellkor’s Pragmatic Chaos”, a merger of three earlier teams (see http://w.netflixprize.com/for details). They claimed the grand prize of $1M on September 21, 2009. Many of the competitive algorithms build on a regularized low-rank factor model similar to (6) using randomization schemes like mini-batch, stochastic gradient descent orsub-sampling to reduce the computational cost over making several passes over the entire data-set (see Salakhutdinov et al., 2007; Bell and Koren., 2007; Takacs et al., 2009, for example). In thispaper, our focus is not on using randomized or sub-sampling schemes. Here we demonstrate that our nuclear-norm regular- ization algorithm can be applied in batch mode on the entire Netflix training set with areasonable computation time. We note however that the conditions under which the nuclear-norm regulariza- tion is theoretically meaningful (Cand ` es and Tao, 2009; Srebro et al., 2005a) are not met on the Netflix data set. We applied SOFT-IMPUTEto the training data matrix and then performed a least-squares un- shrinking on the singular values with the singular vectors and the training datarow and column means as the bases. The latter was performed on a data-set of size 10 5 randomly drawn from the probe set. The prediction error (RMSE) is obtained on a left out portion of the probe set. Table 4 reports the performance of the procedure for different choices of the tuning parameterλ(and the corresponding rank); times indicate the total time taken for the SVD computationsover all itera- tions. A maximum of 10 iterations were performed for each of the examples. Again, these results are not competitive with those of the competition leaders, but rather demonstrate the feasibility of applying SOFT-IMPUTEto such a large data set. 2311 MAZUMDER, HASTIE ANDTIBSHIRANI λRankTime (hrs)RMSE λ 0 /250421.360.9622 λ 0 /300662.210.9572 λ 0 /500812.830.9543 λ 0 /600953.270.9497 Table 4: Results of applying SOFT-IMPUTEto the Netflix data.λ 0 =||P Ω (X)|| 2 ; see Section 9.4. The computations were done on a Intel Xeon Linux 3GHz processor; timingsare reported based on MATLAB implementations of PROPACK and our algorithm. RMSE is root- mean squared error, as defined in the text. Acknowledgments We thank the reviewers for their suggestions that lead to improvements in this paper. We thank Stephen Boyd, Emmanuel Candes, Andrea Montanari, and Nathan Srebro for helpful discussions. Trevor Hastie was partially supported by grant DMS-0505676 from the National Science Founda- tion, and grant 2R01 CA 72028-07 from the National Institutes of Health. Robert Tibshirani was partially supported from National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. Appendix A. Proofs We begin with the proof of Lemma 1. A.1 Proof of Lemma 1 ProofLetZ= ̃ U m×n ̃ D n×n ̃ V ′ n×n be the SVD ofZ.Assume without loss of generality,m≥n. We will explicitly evaluate the closed form solution of the problem (8). Note that 1 2 kZ−Wk 2 F +λkZk ∗ = 1 2 kZk 2 F −2 n ∑ i=1 ̃ d i ̃u ′ i W ̃v i + n ∑ i=1 ̃ d 2 i +λ n ∑ i=1 ̃ d i (32) where ̃ D=diag [ ̃ d 1 ,..., ̃ d n ] , ̃ U= [ ̃u 1 ,..., ̃u n ], ̃ V= [ ̃v 1 ,..., ̃v n ]. Minimizing (32) is equivalent to minimizing −2 n ∑ i=1 ̃ d i ̃u ′ i W ̃v i + n ∑ i=1 ̃ d 2 i + n ∑ i=1 2λ ̃ d i ; w.r.t.( ̃u i , ̃v i , ̃ d i ),i=1,...,n, under the constraints ̃ U ′ ̃ U=I n , ̃ V ′ ̃ V=I n and ̃ d i ≥0∀i. Observe the above is equivalent to minimizing (w.r.t. ̃ U, ̃ V) the functionQ( ̃ U, ̃ V): Q( ̃ U, ̃ V) =min ̃ D≥0 1 2 −2 n ∑ i=1 ̃ d i ̃u ′ i W ̃v i + n ∑ i=1 ̃ d 2 i +λ n ∑ i=1 ̃ d i .(33) 2312 MATRIXCOMPLETION BYSPECTRALREGULARIZATION Since the objective (33) to be minimized w.r.t. ̃ D,is separable in ̃ d i ,i=1,...,n; it suffices to minimize it w.r.t. each ̃ d i separately. The problem minimize ̃ d i ≥0 1 2 −2 ̃ d i ̃u ′ i W ̃v i + ̃ d 2 i +λ ̃ d i (34) can be solved looking at the stationary conditions of the function using its sub-gradient (Nes- terov, 2003). The solution of the above problem is given byS λ ( ̃u ′ i W ̃v i ) = ( ̃u ′ i W ̃v i −λ) + ,the soft- thresholding of ̃u ′ i W ̃v i (without loss of generality, we can take ̃u ′ i W ̃v i to be non-negative). More generally the soft-thresholding operator (Friedman et al., 2007; Hastie etal., 2009) is given by S λ (x) =sgn(x)(|x|−λ) + .See Friedman et al. (2007) for more elaborate discussions on how the soft-thresholding operator arises in univariate penalized least-squares problems with theℓ 1 penal- ization. Plugging the values of optimal ̃ d i ,i=1,...,n; obtained from (34) in (33) we get Q( ̃ U, ̃ V) = 1 2 kZk 2 F −2 n ∑ i=1 ( ̃u ′ i W ̃v i −λ) + ( ̃u ′ i W ̃v i −λ)+( ̃u ′ i X ̃v i −λ) 2 + .(35) MinimizingQ( ̃ U, ̃ V)w.r.t.( ̃ U, ̃ V)is equivalent to maximizing n ∑ i=1 2( ̃u ′ i W ̃v i −λ) + ( ̃u ′ i W ̃v i −λ)−( ̃u ′ i W ̃v i −λ) 2 + = ∑ ̃u ′ i W ̃v i >λ ( ̃u ′ i W ̃v i −λ) 2 .(36) It is a standard fact that for everyithe problem maximize kuk 2 2 ≤1,kvk 2 2 ≤1 u ′ W v; such thatu⊥ˆu 1 ,...,ˆu i−1 ,v⊥ˆv 1 ,...,ˆv i−1 is solved by ˆu i ,ˆv i ,the left and right singular vectors of the matrixWcorresponding to itsi th largest singular value. The maximum value equals the singular value. It is easy to seethat maximizing the expression to the right of (36) wrt(u i ,v i ),i=1,...,nis equivalent to maximizing the individual terms ̃u ′ i W ̃v i . Ifr(λ)denotes the number of singular values ofWlarger thanλthen the( ̃u i , ̃v i ),i= 1,...that maximize the expression (36) correspond to [ u 1 ,...,u r(λ) ] and [ v 1 ,...,v r(λ) ] ; ther(λ)left and right singular vectors ofWcorresponding to the largest singular values. From (34) the optimal ̃ D=diag [ ̃ d 1 ,..., ̃ d n ] is given byD λ =diag[(d 1 −λ) + ,...,(d n −λ) + ]. Since the rank ofWisr,the minimizer ˆ Zof (8) is given byU D λ V ′ as in (9). Remark 1For a more general spectral regularization of the formλ ∑ i p(γ i (Z))(as compared to ∑ i λγ i (Z)used above) the optimization problem (34) will be modified accordingly. The solution of the resultant univariate minimization problem will be given by S p λ ( ̃u ′ i W ̃v i )for some generalized “thresholding operator” S p λ (), where S p λ ( ̃u ′ i W ̃v i ) =arg min ̃ d i ≥0 1 2 −2 ̃ d i ̃u ′ i W ̃v i + ̃ d 2 i +λp( ̃ d i ). The optimization problem analogous to (35) will be minimize ̃ U, ̃ V 1 2 kZk 2 F −2 n ∑ i=1 ˆ d i ̃u ′ i W ̃v i + n ∑ i=1 ˆ d 2 i +λ ∑ i p( ˆ d i ),(37) 2313 MAZUMDER, HASTIE ANDTIBSHIRANI where ˆ d i =S p λ ( ̃u ′ i W ̃v i ),∀i.Any spectral function for which the above (37) is monotonically increas- ing in ̃u ′ i W ̃v i for every i can be solved by a similar argument as given in the above proof. The solution will correspond to the first few largest left and right singular vectors of the matrix W . The optimal singular values will correspond to the relevant shrinkage/ threshold operator S p λ ()operated on the singular values of W . In particular for the indicator functionp(t) =λ1(t6=0),the top few singular values (un-shrunk) and the corresponding singular vectors is the solution. A.2 Proof of Lemma 3 This proof is based on sub-gradient characterizations and is inspired by some techniques used in Cai et al. (2008). ProofFrom Lemma 1, we know that if ˆ Zsolves the problem (8), then it satisfies the sub-gradient stationary conditions: 0∈−(W− ˆ Z)+λ∂k ˆ Zk ∗ .(38) S λ (W 1 )andS λ (W 2 )solve the problem (8) withW=W 1 andW=W 2 respectively, hence (38) holds withW=W 1 , ˆ Z 1 =S λ (W 1 )andW=W 2 , ˆ Z 2 =S λ (W 1 ). The sub-gradients of the nuclear normkZk ∗ are given by ∂kZk ∗ =UV ′ +ω:ω m×n ,U ′ ω=0,ωV=0,kωk 2 ≤1,(39) whereZ=U DV ′ is the SVD ofZ. Letp( ˆ Z i )denote an element in∂k ˆ Z i k ∗ .Then ˆ Z i −W i +λp( ˆ Z i ) =0,i=1,2. The above gives ( ˆ Z 1 − ˆ Z 2 )−(W 1 −W 2 )+λ(p( ˆ Z 1 )−p( ˆ Z 2 )) =0,(40) from which we obtain h ˆ Z 1 − ˆ Z 2 , ˆ Z 1 − ˆ Z 2 i−hW 1 −W 2 , ˆ Z 1 − ˆ Z 2 i+λhp( ˆ Z 1 )−p( ˆ Z 2 ), ˆ Z 1 − ˆ Z 2 i=0, whereha,bi=trace(a ′ b). Now observe that hp( ˆ Z 1 )−p( ˆ Z 2 ), ˆ Z 1 − ˆ Z 2 i=hp( ˆ Z 1 ), ˆ Z 1 i−hp( ˆ Z 1 ), ˆ Z 2 i−hp( ˆ Z 2 ), ˆ Z 1 i+hp( ˆ Z 2 ), ˆ Z 2 i.(41) By the characterization of subgradients as in (39), we have hp( ˆ Z i ), ˆ Z i i=k ˆ Z i k ∗ andkp( ˆ Z i )k 2 ≤1,i=1,2. This implies |hp( ˆ Z i ), ˆ Z j i|≤kp( ˆ Z i )k 2 k ˆ Z j k ∗ ≤k ˆ Z j k ∗ fori6=j∈1,2. Using the above inequalities in (41) we obtain: hp( ˆ Z 1 ), ˆ Z 1 i+hp( ˆ Z 2 ), ˆ Z 2 i=k ˆ Z 1 k ∗ +k ˆ Z 2 k ∗ (42) −hp( ˆ Z 1 ), ˆ Z 2 i−hp( ˆ Z 2 ), ˆ Z 1 i ≥ −k ˆ Z 2 k ∗ −k ˆ Z 1 k ∗ .(43) 2314 MATRIXCOMPLETION BYSPECTRALREGULARIZATION Using (42,43) we see that the r.h.s. of (41) is non-negative. Hence hp( ˆ Z 1 )−p( ˆ Z 2 ), ˆ Z 1 − ˆ Z 2 i≥0. Using the above in (40), we obtain: k ˆ Z 1 − ˆ Z 2 k 2 F =h ˆ Z 1 − ˆ Z 2 , ˆ Z 1 − ˆ Z 2 i≤hW 1 −W 2 , ˆ Z 1 − ˆ Z 2 i.(44) Using the Cauchy-Schwarz Inequality,k ˆ Z 1 − ˆ Z 2 k F kW 1 −W 2 k F ≥h ˆ Z 1 − ˆ Z 2 ,W 1 −W 2 iin (44) we get k ˆ Z 1 − ˆ Z 2 k 2 F ≤h ˆ Z 1 − ˆ Z 2 ,W 1 −W 2 i≤k ˆ Z 1 − ˆ Z 2 k F kW 1 −W 2 k F and in particular k ˆ Z 1 − ˆ Z 2 k 2 F ≤k ˆ Z 1 − ˆ Z 2 k F kW 1 −W 2 k F . The above further simplifies to kW 1 −W 2 k 2 F ≥k ˆ Z 1 − ˆ Z 2 k 2 F =kS λ (W 1 )−S λ (W 2 )k 2 F . A.3 Proof of Lemma 4 ProofWe will first show (15) by observing the following inequalities: kZ k+1 λ −Z k λ k 2 F =kS λ (P Ω (X)+P ⊥ Ω (Z k λ ))−S λ (P Ω (X)+P ⊥ Ω (Z k−1 λ ))k 2 F (by Lemma 3)≤ k ( P Ω (X)+P ⊥ Ω (Z k λ ) ) − ( P Ω (X)+P ⊥ Ω (Z k−1 λ ) ) k 2 F =kP ⊥ Ω (Z k λ −Z k−1 λ )k 2 F (45) ≤ kZ k λ −Z k−1 λ k 2 F .(46) The above implies that the sequencekZ k λ −Z k−1 λ k 2 F converges (since it is decreasing and bounded below). We still require to show thatkZ k λ −Z k−1 λ k 2 F converges to zero. The convergence ofkZ k λ −Z k−1 λ k 2 F implies that: kZ k+1 λ −Z k λ k 2 F −kZ k λ −Z k−1 λ k 2 F →0 ask→∞. The above observation along with the inequality in (45,46) gives kP ⊥ Ω (Z k λ −Z k−1 λ )k 2 F −kZ k λ −Z k−1 λ k 2 F →0=⇒P Ω (Z k λ −Z k−1 λ )→0,(47) ask→∞. Lemma 2 shows that the non-negative sequencef λ (Z k λ )is decreasing ink. So ask→∞the sequencef λ (Z k λ )converges. Furthermore from (13,14) we have Q λ (Z k+1 λ |Z k λ )−Q λ (Z k+1 λ |Z k+1 λ )→0 ask→∞, 2315 MAZUMDER, HASTIE ANDTIBSHIRANI which implies that kP ⊥ Ω (Z k λ )−P ⊥ Ω (Z k+1 λ )k 2 F →0 ask→∞. The above along with (47) gives Z k λ −Z k−1 λ →0 ask→∞. This completes the proof. A.4 Proof of Lemma 5 ProofThe sub-gradients of the nuclear normkZk ∗ are given by ∂kZk ∗ =UV ′ +W:W m×n ,U ′ W=0,WV=0,kWk 2 ≤1,(48) whereZ=U DV ′ is the SVD ofZ. SinceZ k λ minimizesQ λ (Z|Z k−1 λ ), it satisfies: 0∈−(P Ω (X)+P ⊥ Ω (Z k−1 λ )−Z k λ )+∂kZ k λ k ∗ ∀k.(49) SupposeZ ∗ is a limit point of the sequenceZ k λ . Then there exists a subsequencen k ⊂1,2,... such thatZ n k λ →Z ∗ ask→∞. By Lemma 4 this subsequenceZ n k λ satisfies Z n k λ −Z n k −1 λ →0 implying P ⊥ Ω (Z n k −1 λ )−Z n k λ −→P ⊥ Ω (Z ∗ λ )−Z ∗ λ =−P Ω (Z ∗ ). Hence, (P Ω (X)+P ⊥ Ω (Z n k −1 λ )−Z n k λ )−→(P Ω (X)−P Ω (Z ∗ λ )).(50) For everyk,a sub-gradientp(Z k λ )∈∂kZ k λ k ∗ corresponds to a tuple(u k ,v k ,w k )satisfying the proper- ties of the set∂kZ k λ k ∗ (48). Considerp(Z n k λ )along the sub-sequencen k .Asn k −→∞,Z n k λ −→Z ∗ λ . Let Z n k λ =u n k D n k v ′ n k ,Z ∗ =u ∞ D ∗ v ′ ∞ denote the SVD’s. The product of the singular vectors convergeu ′ n k v n k →u ′ ∞ v ∞ .Furthermore due to boundedness (passing on to a further subsequence if necessary)w n k →w ∞ . The limitu ∞ v ′ ∞ +w ∞ clearly satisfies the criterion of being a sub-gradient ofZ ∗ .Hence this limit corresponds top(Z ∗ λ )∈ ∂kZ ∗ λ k ∗ . Furthermore from (49, 50), passing on to the limits along the subsequencen k , we have 0∈−(P Ω (X)−P Ω (Z ∗ λ ))+∂kZ ∗ λ k ∗ . Hence the limit pointZ ∗ λ is a stationary point off λ (Z). 2316 MATRIXCOMPLETION BYSPECTRALREGULARIZATION We shall now prove (16). We know that for everyn k Z n k λ =S λ (P Ω (X)+P ⊥ Ω (Z n k −1 λ )).(51) From Lemma 4, we knowZ n k λ −Z n k −1 λ →0.This observation along with the continuity ofS λ ()gives S λ (P Ω (X)+P ⊥ Ω (Z n k −1 λ ))→S λ (P Ω (X)+P ⊥ Ω (Z ∗ λ )). Thus passing over to the limits on both sides of (51) we get Z ∗ λ =S λ (P Ω (X)+P ⊥ Ω (Z ∗ λ )), therefore completing the proof. A.5 Proof of Lemma 6 The proof is motivated by the principle of embedding an arbitrary matrix into a positive semidefinite matrix (Fazel, 2002). We require the following proposition, which we proveusing techniques used in the same reference. Proposition 1Suppose matrices W m×m , ̃ W n×n ,Z m×n satisfy the following: ( WZ Z T ̃ W ) 0. Thentrace(W)+trace( ̃ W)≥2kZk ∗ . ProofLetZ m×n =L m×r Σ r×r R T n×r denote the SVD ofZ, whereris the rank of the matrixZ. Observe that the trace of the product of two positive semidefinite matrices is always non-negative. Hence we have the following inequality: trace ( L T −LR T −RL T R T )( WZ Z T ̃ W ) 0. Simplifying the above expression we get: trace(L T W)−trace(LR T Z T )−trace(RL T Z)+trace(R T ̃ W)≥0.(52) Due to the orthogonality of the columns ofL,Rwe have the following inequalities: trace(L T W)≤trace(W)and trace(R T ̃ W)≤trace( ̃ W). Furthermore, using the SVD ofZ: trace(LR T Z T ) =trace(Σ) =trace(LR T Z T ). Using the above in (52), we have: trace(W)+trace( ̃ W)≥2 trace(Σ) =2kZk ∗ . 2317 MAZUMDER, HASTIE ANDTIBSHIRANI Proof[Proof of Lemma 6.] For the matrixZ, consider any decomposition of the formZ= ̃ U m×r ̃ V T n×r and construct the following matrix ( ̃ U ̃ U T Z Z T ̃ V ̃ V T ) = ( ̃ U ̃ V ) ( ̃ U T ̃ V T ),(53) which is positive semidefinite. Applying Proposition 1 to the left hand matrix in (53), we have: trace( ̃ U ̃ U T )+trace( ̃ V ̃ V T )≥2kZk ∗ . Minimizing both sides above w.r.t. the decompositionsZ= ̃ U m×r ̃ V T n×r ; we have min ̃ U, ̃ V;Z= ̃ U ̃ V T trace( ̃ U ̃ U T )+trace( ̃ V ̃ V T ) ≥2kZk ∗ .(54) Through the SVD ofZwe now show that equality is attained in (54). SupposeZis of rank k≤min(m,n), and denote its SVD byZ m×n =L m×k Σ k×k R T n×k . Then for ̃ U=L m×k Σ 1 2 k×k and ̃ V= R n×k Σ 1 2 k×k the equality in (54) is attained. Hence, we have: kZk ∗ =min ̃ U, ̃ V;Z= ̃ U ̃ V T trace( ̃ U ̃ U T )+trace( ̃ V ̃ V T ) =min ̃ U, ̃ V;Z= ̃ U m×k ̃ V T n×k trace( ̃ U ̃ U T )+trace( ̃ V ̃ V T ) . Note that the minimum can also be attained for matrices withr≥kor evenr≥min(m,n); however, it suffices to consider matrices withr=k. Also it is easily seen that the minimum cannot be attained for anyr<k; hence the minimal rankrfor which (29) holds true isr=k. A.6 Proof of Theorem 2 There is a close resemblance betweenSOFT-IMPUTEand Nesterov’s gradient method (Nesterov, 2007, Section 3). However, as mentioned earlier the original motivation of our algorithm is very different. The techniques used in this proof are adapted from Nesterov (2007). ProofPluggingZ k λ = ̃ Zin (11), we have Q λ (Z|Z k λ ) =f λ (Z k λ )+ 1 2 kP ⊥ Ω (Z k λ −Z)k 2 F (55) ≥f λ (Z k λ ). LetZ k λ (θ)denote a convex combination of the optimal solution (Z ∞ λ ) and thek th iterate (Z k λ ): Z k λ (θ) =θZ ∞ λ +(1−θ)Z k λ .(56) 2318 MATRIXCOMPLETION BYSPECTRALREGULARIZATION Using the convexity off λ ()we get: f λ (Z k λ (θ))≤(1−θ)f λ (Z k λ )+θf λ (Z ∞ λ ).(57) ExpandingZ k λ (θ)using (56), and simplifyingP ⊥ Ω (Z k λ −Z k λ (θ))we have: kP ⊥ Ω (Z k λ −Z k λ (θ))k 2 F =θ 2 kP ⊥ Ω (Z k λ −Z ∞ λ )k 2 F ≤θ 2 kZ k λ −Z ∞ λ k 2 F (58) ≤θ 2 kZ 0 λ −Z ∞ λ k 2 F .(59) Line 59 follows from (58) by observing thatkZ m λ −Z ∞ λ k 2 F ≤kZ m−1 λ −Z ∞ λ k 2 F ,∀m—a consequence of the inequalities (19) and (18), established in Theorem 1. Using (55), the value off λ (Z)at the(k+1) th iterate satisfies the following chain of inequalities: f λ (Z k+1 λ )≤min Z f λ (Z)+ 1 2 kP ⊥ Ω (Z k λ −Z)k 2 F ≤min θ∈[0,1] f λ (Z k λ (θ))+ 1 2 kP ⊥ Ω (Z k λ −Z k λ (θ))k 2 F (60) ≤min θ∈[0,1] f λ (Z k λ )+θ(f λ (Z ∞ λ )−f λ (Z k λ ))+ 1 2 θ 2 kZ 0 λ −Z ∞ λ k 2 F .(61) Line 61 follows from Line 60, by using (57) and (59). The r.h.s. expression in (61), is minimized at ̂ θ(k+1)given by ̂ θ(k+1) =min1,θ k ∈[0,1],where, θ k = f λ (Z k λ )−f λ (Z ∞ λ ) kZ 0 λ −Z ∞ λ k 2 F . IfkZ 0 λ −Z ∞ λ k 2 F =0,then we takeθ k =∞. Note thatθ k is a decreasing sequence. This implies that ifθ k 0 ≤1 thenθ m ≤1 for allm≥k 0 . Suppose,θ 0 >1.Then ̂ θ(1) =1. Hence using (61) we have: f λ (Z 1 λ )−f λ (Z ∞ λ )≤ 1 2 kZ 0 λ −Z ∞ λ k 2 F =⇒θ 1 ≤ 1 2 . Thus we get back to the former case. Henceθ k ≤1 for allk≥1. In addition, observe the previous deductions show that, ifθ 0 >1 then (20) holds true fork=1. Combining the above observations, plugging in the value of ̂ θin (61) and simplifying, we get: f λ (Z k+1 λ )−f λ (Z k λ )≤− (f λ (Z k λ )−f λ (Z ∞ λ )) 2 2kZ 0 λ −Z ∞ λ k 2 F .(62) For the sake of notational convenience, we define the sequenceα k =f λ (Z k λ )−f λ (Z ∞ λ ).It is easily seen thatα k is a non-negative decreasing sequence. 2319 MAZUMDER, HASTIE ANDTIBSHIRANI Using this notation in (62) we get: α k ≥ α 2 k 2kZ 0 λ −Z ∞ λ k 2 F +α k+1 (Sinceα k ↓)≥ α k α k+1 2kZ 0 λ −Z ∞ λ k 2 F +α k+1 .(63) Dividing both sides of the inequality in (63), byα k α k+1 we have: α −1 k+1 ≥ 1 2kZ 0 λ −Z ∞ λ k 2 F +α −1 k .(64) Summing both sides of (64) over 1≤k≤(k−1)we get: α −1 k ≥ k−1 2kZ 0 λ −Z ∞ λ k 2 F +α −1 1 .(65) Sinceθ 1 ≤1,we observeα 1 /(2kZ 0 λ −Z ∞ λ k 2 F )≤1/2—using this in (65) and rearranging terms we get, the desired inequality (20)—completing the proof of the Theorem. References J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approachto collaborative filtering: opera- tor estimation with spectral regularization.Journal of Machine Learning Research, 10:803–826, 2009. A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. InAdvances in Neural Information Processing Systems 19. MIT Press, 2007. A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning.Machine Learning, 73(3):243–272, 2008. F. Bach. Consistency of trace norm minimization.Journal of Machine Learning Research, 9: 1019–1048, 2008. R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. Technical report, AT&T Bell Laboratories, 2007. S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004. S. Burer and R. D.C. Monteiro. Local minima and convergence in low-ranksemidefinite program- ming.Mathematical Programming, 103(3):427–631, 2005. J. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion, 2008. Available at http://w.citebase.org/abstract?id=oai:arXiv.org:0810.3286. E. Cand ` es and B. Recht. Exact matrix completion via convex optimization.Foundations of Com- putational Mathematics, 9:717–772, 2008. 2320 MATRIXCOMPLETION BYSPECTRALREGULARIZATION E. J. Cand ` es and T. Tao. The power of convex relaxation: near-optimal matrix completion.IEEE Transactions on Information Theory, 56(5):2053–2080, 2009. D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix factorizations. In Proceedings of the 23rd International Conference on Machine Learning, pages 249–256. ACM, 2006. A. Dempster, N.. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion).Journal of the Royal Statistical Society Series B, 39:1–38, 1977. D. Donoho, I. Johnstone, G. Kerkyachairan, and D. Picard. Wavelet shrinkage; asymptopia? (with discussion).Journal of the Royal Statistical Society: Series B, 57:201–337, 1995. M. Fazel.Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002. J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics, Stanford University, 2008. J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization.Annals of Applied Statistics, 2(1):302–332, 2007. M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, 2009. Web page and software available at http://stanford.edu/∼boyd/cvx. T. Hastie, R. Tibshirani, and J. Friedman.The Elements of Statistical Learning: Prediction, Infer- ence and Data Mining (Second Edition). Springer Verlag, New York, 2009. S. Ji and J. Ye. An accelerated gradient method for trace norm minimization.InProceedings of the 26th International Conference on Machine Learning, pages 457–464, 2009. R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries.IEEE Transactions on Information Theory, 56(6):2980–2998, 2009. R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization.Technical Report DAIMI PB-357, Department of Computer Science, Aarhus University, 1998. R.M. Larsen.Propack-software for large and sparse svd calculations, 2004.Available at http://sun.stanford.edu/∼rmunk/PROPACK. J. Liu, S. Ji, and J. Ye.SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009. Available at http://w.public.asu.edu/∼jye02/Software/SLEP. Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to system identfication.SIAM Journal on Matrix Analysis and Applications, 31(3):1235–1256, 2009. S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization.Mathematical Programming Series A, forthcoming. R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Technical report, Stanford University, 2009. 2321 MAZUMDER, HASTIE ANDTIBSHIRANI Y. Nesterov.Introductory Lectures on Convex Optimization: Basic course. Kluwer, Boston, 2003. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. B. Recht,M. Fazel,and P. A. Parrilo.Guaranteed minimum-rank solutions of linearmatrixequationsvianuclearnormminimization,2007.Availableat http://w.citebase.org/abstract?id=oai:arXiv.org:0706.4138. J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning, pages 713–719. ACM, 2005. R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted Boltzmann machines for collaborative filtering. InProceedings of the 24th International Conference on Machine Learning, pages 791– 798. AAAI Press, 2007. ACM SIGKDD and Netflix. Soft modelling by latent variables: the nonlinear iterative partial least squares (NIPALS) approach. InProceedings of KDD Cup and Workshop, 2007. Available at http://w.cs.uic.edu/∼liub/KDD-cup-2007/proceedings.html. N. Srebro and T. Jaakkola. Weighted low-rank approximations. InProceedings of the 20th Interna- tional Conference on Machine Learning, pages 720–727. AAAI Press, 2003. N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. InAdvances in Neural Information Processing Systems 17, pages 5–27. MIT Press, 2005a. N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. InAdvances in Neural Information Processing Systems 17, pages 1329–1336. MIT Press, 2005b. G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Scalable collaborative filtering approaches for large recommender systems.Journal of Machine Learning Research, 10:623–656, 2009. R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288, 1996. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays.Bioinformatics, 17(6):520– 525, 2001. C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty.Annals of Statis- tics, 38(2):894–942, 2010. 2322