Latent Variable Graphical Model Selection via Convex Optimization

The Annals of Statistics 2012, Vol. 40, No. 4, 1935–1967 DOI: 10.1214/11-AOS949 © Institute of Mathematical Statistics, 2012 LATENT VARIABLE GRAPHICAL MODEL SELECTION VIA CONVEX OPTIMIZATION1 B Y V ENKAT C HANDRASEKARAN , PABLO A. PARRILO AND A LAN S. W ILLSKY California Institute of Technology, Massachusetts Institute of Technology and Massachusetts Institute of Technology Suppose we observe samples of a subset of a collection of random variables. No additional information is provided about the number of latent variables, nor of the relationship between the latent and observed variables. Is it possible to discover the number of latent components, and to learn a statistical model over the entire collection of variables? We address this question in the setting in which the latent and observed variables are jointly Gaussian, with the conditional statistics of the observed variables conditioned on the latent variables being specified by a graphical model. As a first step we give natural conditions under which such latent-variable Gaussian graphical models are identifiable given marginal statistics of only the observed variables. Essentially these conditions require that the conditional graphical model among the observed variables is sparse, while the effect of the latent variables is “spread out” over most of the observed variables. Next we propose a tractable convex program based on regularized maximum-likelihood for model selection in this latent-variable setting; the regularizer uses both the 1 norm and the nuclear norm. Our modeling framework can be viewed as a combination of dimensionality reduction (to identify latent variables) and graphical modeling (to capture remaining statistical structure not attributable to the latent variables), and it consistently estimates both the number of latent components and the conditional graphical model structure among the observed variables. These results are applicable in the high-dimensional setting in which the number of latent/observed variables grows with the number of samples of the observed variables. The geometric properties of the algebraic varieties of sparse matrices and of low-rank matrices play an important role in our analysis. Received August 2010; revised November 2011. Discussed in 10.1214/12-AOS979, 10.1214/12-AOS980, 10.1214/12-AOS981, 10.1214/12-AOS984, 10.1214/12-AOS985 and 10.1214/12-AOS1001; rejoinder at 10.1214/12-AOS1020. 1 Supported in part by AFOSR Grant FA9550-08-1-0180, in part under a MURI through AFOSR Grant FA9550-06-1-0324, in part under a MURI through AFOSR Grant FA9550-06-1-0303 and in part by NSF FRG 0757207. MSC2010 subject classifications. 62F12, 62H12. Key words and phrases. Gaussian graphical models, covariance selection, latent variables, regularization, sparsity, low-rank, algebraic statistics, high-dimensional asymptotics. 1935 1936 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY 1. Introduction and setup. Statistical model selection in the high-dimensional regime arises in a number of applications. In many data analysis problems in geophysics, radiology, genetics, climate studies, and image processing, the number of samples available is comparable to or even smaller than the number of variables. As empirical statistics in these settings may not be well-behaved (see [17, 22]), high-dimensional model selection is therefore both challenging and of great interest. A model selection problem that has received considerable attention recently is the estimation of covariance matrices in the high-dimensional setting. As the sample covariance matrix is poorly behaved in such a regime, some form of regularization of the sample covariance is adopted based on assumptions about the true underlying covariance matrix [1, 2, 12, 14, 20, 36]. Graphical models. A number of papers have studied covariance estimation in the context of Gaussian graphical model selection. A Gaussian graphical model [19, 30] (also commonly referred to as a Gauss–Markov random field) is a statistical model defined with respect to a graph, in which the nodes index a collection of jointly Gaussian random variables and the edges represent the conditional independence relations (Markov structure) among the variables. In such models the sparsity pattern of the inverse of the covariance matrix, or the concentration matrix, directly corresponds to the graphical model structure. Specifically, consider a Gaussian graphical model in which the covariance matrix is given by a positivedefinite ∗ and the concentration matrix is given by K ∗ = ( ∗ )−1 . Then an edge ∗ = 0. In par{i, j } is present in the underlying graphical model if and only if Ki,j ticular the absence of an edge between two nodes implies that the corresponding variables are independent conditioned on all the other variables. The model selection method usually studied in such a Gaussian graphical model setting is 1 -regularized maximum-likelihood, with the 1 penalty applied to the entries of the concentration matrix to induce sparsity. The consistency properties of such an estimator have been studied [18, 26, 29], and under suitable conditions [18, 26] this estimator is also “sparsistent,” that is, the estimated concentration matrix has the same sparsity pattern as the true model from which the samples are generated. An alternative approach to 1 -regularized maximum-likelihood is to estimate the sparsity pattern of the concentration matrix by performing regression separately on each variable [23]; while such a method consistently estimates the sparsity pattern, it does not directly provide estimates of the covariance or concentration matrix. In many applications throughout science and engineering (e.g., psychology, computational biology, and economics), a challenge is that one may not have access to observations of all the relevant phenomena, that is, some of the relevant variables may be latent or unobserved. In general latent variables pose a significant difficulty for model selection because one may not know the number of relevant latent variables, nor the relationship between these variables and the observed variables. Typical algorithmic methods that try to get around this difficulty usually fix the number of latent variables as well as the structural relationship between LATENT VARIABLE MODEL SELECTION 1937 latent and observed variables (e.g., the graphical model structure between latent and observed variables), and use the EM algorithm to fit parameters [9]. This approach suffers from the problem that one optimizes nonconvex functions, and thus one may get stuck in suboptimal local minima. An alternative suggestion [13] is one based on a greedy, local, combinatorial heuristic that assigns latent variables to groups of observed variables, via some form of clustering of the observed variables; however, this approach has no consistency guarantees. Our setup. In this paper we study the problem of latent-variable graphical model selection in the setting where all the variables, both observed and latent, are jointly Gaussian. More concretely, X is a Gaussian random vector in Rp+h , O and H are disjoint subsets of indices in {1, . . . , p + h} of cardinalities |O| = p and |H | = h, and the corresponding subvectors of X are denoted by XO and XH , ∗ respectively. Let the covariance matrix underlying X be denoted by (O H ) . The marginal statistics corresponding to the observed variables XO are given by the ∗ , which is simply a submatrix of the full covarimarginal covariance matrix O ∗ ance matrix (O H ) . However, suppose that we parameterize our model by the ∗ ∗ −1 concentration matrix K(O H ) = ((O H ) ) , which as discussed above reveals the ∗ , K∗ ∗ connection to graphical models. Here the submatrices KO O,H , KH specify (in the full model) the dependencies among the observed variables, between the observed and latent variables, and among the latent variables, respectively. In such a ∗ )−1 corresponding to the parameterization, the marginal concentration matrix (O observed variables XO is given by the Schur complement [16] with respect to the ∗: block KH (1.1) ∗ ∗ −1 ∗ ∗ ∗ −1 ∗ = (O ) = KO − KO,H (KH ) KH,O . K̃O ∗ (or K̃ ∗ ). Thus if we only observe the variables XO , we only have access to O O ∗ above have interesting properties. The matrix The two terms that compose K̃O ∗ specifies the concentration matrix of the conditional statistics of the obKO served variables given the latent variables. If these conditional statistics are given ∗ is sparse. On the other hand, the matrix by a sparse graphical model, then KO ∗ ∗ )−1 K ∗ (KH KO,H H,O serves as a summary of the effect of marginalization over the latent variables XH . This matrix has small rank if the number of latent, unobserved variables XH is small relative to the number of observed variables XO . ∗ is generally not sparse due to the Therefore the marginal concentration matrix K̃O ∗ ∗ )−1 K ∗ . Hence standard graphical model (KH additional low-rank term KO,H H,O selection techniques applied directly to the observed variables XO are not useful. A modeling paradigm that infers the effect of the latent variables XH would be more suitable in order to provide a concise explanation of the underlying statistical structure. Hence we approximate the sample covariance by a model in which the concentration matrix decomposes into the sum of a sparse matrix and a low-rank matrix, which reveals the conditional graphical model structure in the observed 1938 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY variables as well as the number of and effect due to the unobserved latent variables. Such a method can be viewed as a blend of principal component analysis and graphical modeling. In standard graphical modeling one would directly approximate a concentration matrix by a sparse matrix to learn a sparse graphical model, while in principal component analysis the goal is to explain the statistical structure underlying a set of observations using a small number of latent variables (i.e., approximate a covariance matrix as a low-rank matrix). In our framework we learn a sparse graphical model among the observed variables conditioned on a few (additional) latent variables. These latent variables are not principal components, as the conditional statistics (conditioned on these latent variables) are given by a graphical model. Therefore we refer to these latent variables informally as latent components. Contributions. Our first contribution in Section 3 is to address the fundamental question of identifiability of such latent-variable graphical models given the marginal statistics of only the observed variables. The critical point is that we need to tease apart the correlations induced due to marginalization over the latent variables from the conditional graphical model structure among the observed variables. As the identifiability problem is one of uniquely decomposing the sum of a sparse matrix and a low-rank matrix into the individual components, we study the algebraic varieties of sparse matrices and low-rank matrices. An important theme in this paper is the connection between the tangent spaces to these ∗ ) dealgebraic varieties and the question of identifiability. Specifically let (KO ∗ note the tangent space at KO to the algebraic variety of sparse matrices, and let ∗ ∗ )−1 K ∗ ) denote the tangent space at K ∗ ∗ −1 ∗ (KH T (KO,H H,O O,H (KH ) KH,O to the algebraic variety of low-rank matrices. Then the statistical question of identifiability ∗ and K ∗ ∗ −1 ∗ ∗ of KO O,H (KH ) KH,O given K̃O is determined by the geometric notion ∗ ) and T (K ∗ ∗ −1 ∗ of transversality of the tangent spaces (KO O,H (KH ) KH,O ). The study of the transversality of these tangent spaces leads to natural conditions for identifiability. In particular we show that latent-variable models in which (1) the ∗ has a small number of nonzeros per row/column, and (2) the sparse matrix KO ∗ ∗ )−1 K ∗ (KH low-rank matrix KO,H H,O has row/column spaces that are not closely aligned with the coordinate axes, are identifiable. These conditions have natural statistical interpretations. The first condition ensures that there are no densely connected subgraphs in the conditional graphical model structure among the observed variables, that is, that these conditional statistics are indeed specified by a sparse graphical model. Such statistical relationships may otherwise be mistakenly attributed to the effect of marginalization over some latent variable. The second condition ensures that the effect of marginalization over the latent variables is “spread out” over many observed variables; thus, the effect of marginalization over a latent variable is not confused with the conditional graphical model structure among the observed variables. In fact the first condition is often assumed in standard graphical model selection without latent variables (e.g., [26]). 1939 LATENT VARIABLE MODEL SELECTION As our next contribution we propose a regularized maximum-likelihood decomposition framework to approximate a given sample covariance matrix by a model in which the concentration matrix decomposes into a sparse matrix and a low-rank matrix. Based on the effectiveness of the 1 norm as a tractable convex relaxation for recovering sparse models [5, 10, 11] and the nuclear norm for low-rank matrices [4, 15, 27], we propose the following penalized likelihood method given a n formed from n samples of the observed variables: sample covariance matrix O n (Ŝn , L̂n ) = arg min −(S − L; O ) + λn γ S1 + tr(L) (1.2) S,L s.t. S − L 0, L 0. The constraints 0 and 0 impose positive-definiteness and positive-semidefiniteness. The function represents the Gaussian log-likelihood (K; ) = log det(K) − tr(K) for K 0, where tr is the trace of a matrix and det is the ∗ , which represents the condideterminant. Here Ŝn provides an estimate of KO tional concentration matrix of the observed variables; L̂n provides an estimate of ∗ ∗ )−1 K ∗ , which represents the effect of marginalization over the latent KO,H (KH H,O variables. The regularizer is a combination of the 1 norm applied to S and the nuclear norm applied to L (the nuclear norm reduces to the trace over the cone of symmetric, positive-semidefinite matrices), with γ providing a trade-off between the two terms. This variational formulation is a convex optimization problem, and it is a regularized max-det program that can be solved in polynomial time using general-purpose solvers [33]. Our main result in Section 4 is a proof of the consistency of the estimator (1.2) in the high-dimensional regime in which both the number of observed variables and the number of latent components are allowed to grow with the number of samples (of the observed variables). We show that for a suitable choice of the regularization parameter λn , there exists a range of values of γ for which the estimates (Ŝn , L̂n ) ∗ , K∗ ∗ −1 ∗ have the same sparsity (and sign) pattern and rank as (KO O,H (KH ) KH,O ) with high probability (see Theorem 4.1). The key technical requirement is an identifiability condition for the two components of the marginal concentration ma∗ with respect to the Fisher information (see Section 3.4). We make contrix K̃O nections between our condition and the irrepresentability conditions required for support/graphical-model recovery using 1 regularization [26, 32, 37]. Our results provide numerous scaling regimes under which consistency holds in latentvariable graphical model selection. For example, we show that under suitable identifiability conditions consistent model selection is possible even when the number of samples and the number of latent variables are on the same order as the number of observed variables (see Section 4.2). Related previous work. The problem of decomposing the sum of a sparse matrix and a low-rank matrix via convex optimization into the individual components 1940 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY was initially studied in [7] by a superset of the authors of the present paper, with conditions derived under which the convex program exactly recovers the underlying components. In subsequent work Candès et al. [3] also studied this sparseplus-low-rank decomposition problem, and provided guarantees for exact recovery using the convex program proposed in [7]. The problem setup considered in the present paper is quite different and is more challenging because we are only given access to an inexact sample covariance matrix, and we wish to produce an inverse covariance matrix that can be decomposed as the sum of sparse and lowrank components (preserving the sparsity pattern and rank of the components in the true underlying model). In addition to proving the consistency of the estimator (1.2), we also provide a statistical interpretation of our identifiability conditions and describe natural classes of latent-variable Gaussian graphical models that satisfy these conditions. As such our paper is closer in spirit to the many recent papers on covariance selection, but with the important difference that some of the variables are not observed. Outline. Section 2 gives some background and a formal problem statement. Section 3 discusses the identifiability question, Section 4 states the main results of this paper, and Section 5 gives some proofs. We provide experimental demonstration of the effectiveness of our estimator on synthetic and real data in Section 6, and conclude with a brief discussion in Section 7. Some of our technical results are deferred to supplementary material [6]. 2. Problem statement and background. We give a formal statement of the latent-variable model selection problem. We also briefly describe various properties of the algebraic varieties of sparse matrices and of low-rank matrices, and the properties of the Gaussian likelihood function. The following matrix norms are employed throughout this paper. M2 denotes the spectral norm, or the largest singular value of M. M∞ denotes the largest entry in magnitude of M. MF denotes the Frobenius norm, or the square root of the sum of the squares of the entries of M. M∗ denotes the nuclear norm, or the sum of the singular values of M (this reduces to the trace for positivesemidefinite matrices). M1 denotes the sum of the absolute values of the entries of M. A number of matrix operator norms are also used. For example, let Z : Rp×p → Rp×p be a linear operator acting on matrices. Then the induced operator norm is defined as Z q→q maxN∈Rp×p ,Nq ≤1 Z (N)q . Therefore, Z F →F denotes the spectral norm of the operator Z . The only vector norm used is the Euclidean norm, which is denoted by · . Given any norm · q (either a vector norm, a matrix norm or a matrix operator norm), the dual norm is given by M∗q sup{ M, N |N q ≤ 1}. 2.1. Problem statement. In order to analyze latent-variable model selection methods, we need to define an appropriate notion of model selection con- LATENT VARIABLE MODEL SELECTION 1941 ∗ sistency for latent-variable graphical models. Given the two components KO ∗ ∗ ∗ −1 and KO,H (KH ) KH,O of the concentration matrix of the marginal distribution (1.1), there are infinitely many configurations of the latent variables [i.e., ∗ 0, K ∗ ∗ T matrices KH O,H = (KH,O ) ] that give rise to the same low-rank ma∗ ∗ ∗ trix KO,H (KH )−1 KH,O . Specifically for any nonsingular matrix B ∈ R|H |×|H | , ∗ → BK ∗ B T , K ∗ ∗ T one can apply the transformations KH H O,H → KO,H B and still ∗ ∗ )−1 K ∗ . In all of these models the (KH preserve the low-rank matrix KO,H H,O marginal statistics of the observed variables XO remain the same upon marginalization over the latent variables XH . The key invariant is the low-rank matrix ∗ ∗ )−1 K ∗ , which summarizes the effect of marginalization over the (KH KO,H H,O ∗ and latent variables. Consequently, from here on we use the notation S ∗ = KO ∗ ∗ )−1 K ∗ . These observations give rise to the following notion of (KH L∗ = KO,H H,O structure recovery. D EFINITION 2.1. A pair of |O| × |O| symmetric matrices (Ŝ, L̂) is an algebraically correct estimate of a latent-variable Gaussian graphical model given by ∗ the concentration matrix K(O H ) if the following conditions hold: (1) The sign-pattern of Ŝ is the same as that of S ∗ [here sign(0) = 0]: ∗ ) sign(Ŝi,j ) = sign(Si,j ∀i, j. (2) The rank of L̂ is the same as the rank of L∗ : rank(L̂) = rank(L∗ ). (3) The concentration matrix Ŝ − L̂ can be realized as the marginal concentration matrix of an appropriate latent-variable model: Ŝ − L̂ 0, L̂ 0. When a sequence of estimators is algebraically correct with probability approaching 1 in a suitable high-dimensional scaling regime, then we say that the estimators are algebraically consistent. The first condition ensures that Ŝ provides the correct structural estimate of the conditional graphical model of the observed variables conditioned on the latent components. This property is the same as the “sparsistency” property studied in standard graphical model selection [18, 26]. The second condition ensures that the number of latent components is properly estimated. Finally, the third condition ensures that the pair of matrices (Ŝ, L̂) leads to a realizable latent-variable model. In particular, this condition implies that there exists a valid latent-variable model in which (a) the conditional graphical model structure among the observed variables is given by Ŝ, (b) the number of latent variables is equal to the rank of L̂, and (c) the extra correlations induced due to marginalization over the latent variables are equal to L̂. Any method for matrix 1942 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY factorization (e.g., [35]) can be used to further factorize L̂, depending on the property that one desires in the factors (e.g., sparsity). We also study estimation error rates in the usual sense, that is, we show that one can produce estimates (Ŝ, L̂) that are close in various norms to the matrices (S ∗ , L∗ ). Notice that bounding the estimation error in some norm does not in general imply that the support/sign-pattern and rank of (Ŝ, L̂) are the same as those of (S ∗ , L∗ ). Therefore bounded estimation error is different from algebraic correctness, which requires that (Ŝ, L̂) have the same support/sign-pattern and rank as (S ∗ , L∗ ). ∗ Goal. Let K(O H ) denote the concentration matrix of a Gaussian model. Supi }n of the observed variables X . We would like pose that we have n samples {XO O i=1 to produce estimates (Ŝn , L̂n ) that, with high probability, are algebraically correct and have bounded estimation error (in some norm). Our approach. We propose the regularized likelihood convex program (1.2) n in to produce estimates (Ŝn , L̂n ). Specifically, the sample covariance matrix O (1.2) is defined as n O n 1 i i T XO XO . n i=1 ∗ We give conditions on the underlying model K(O H ) and suitable choices for the parameters λn , γ under which the estimates (Ŝn , L̂n ) are consistent (see Theorem 4.1). 2.2. Likelihood function and Fisher information. Given n samples {Xi }ni=1 of a finite collection of jointly Gaussian zero-mean random variables with concentration matrix K ∗ , it is easily seen that the log-likelihood function is given by: (2.1) (K; n ) = log det(K) − tr(K n ), where (K; n ) is a function of K. Notice that this function is strictly concave for n , the K 0. In the latent-variable modeling problem with sample covariance O likelihood function with respect to the parametrization (S, L) is given by (S − n ). This function is jointly concave with respect to the parameters (S, L) L; O whenever S − L 0, and it is employed in our variational formulation (1.2) to learn a latent-variable model. In the analysis of a convex program involving the likelihood function, the Fisher information plays an important role as it is the negative of the Hessian of the likelihood function and thus controls the curvature. As the first term in the likelihood function is linear, we need only study higher-order derivatives of the logdeterminant function in order to compute the Hessian. In the latent-variable setting with the marginal concentration matrix of the observed variables given by LATENT VARIABLE MODEL SELECTION 1943 ∗ = ( ∗ )−1 [see (1.1)], the corresponding Fisher information matrix is K̃O O (2.2) ∗ ∗ −1 ∗ −1 ∗ ∗ I (K̃O ) = (K̃O ) ⊗ (K̃O ) = O ⊗ O . Here ⊗ denotes the tensor product between matrices. Notice that this is precisely ∗ the |O|2 × |O|2 submatrix of the full Fisher information matrix I (K(O H )) = ∗ ∗ ∗ ∗ −1 (O H ) ⊗ (O H ) with respect to all the parameters K(O H ) = ((O H ) ) (corresponding to the situation in which all the variables XO∪H are observed). In Sec∗) tion 3.4 we impose various conditions on the Fisher information matrix I (K̃O under which our regularized maximum-likelihood formulation provides consistent estimates. 2.3. Algebraic varieties of sparse and low-rank matrices. The set of sparse matrices and the set of low-rank matrices can be naturally viewed as algebraic varieties (solution sets of systems of polynomial equations). Here we describe these varieties, and discuss some of their geometric properties such as the tangent space and local curvature at a (smooth) point. Let S (k) denote the set of matrices with at most k nonzeros: (2.3) S (k) {M ∈ Rp×p ||support(M)| ≤ k}. Here support denotes the locations of nonzero entries. The set S (k) is an algebraic 2 variety, and can in fact be viewed as a union of pk subspaces in Rp×p . This variety has dimension k, and it is smooth everywhere except at those matrices that have support size strictly smaller than k. For any matrix M ∈ Rp×p , consider the variety S (|support(M)|); M is a smooth point of this variety, and the tangent space at M is given by (2.4) (M) = {N ∈ Rp×p |support(N) ⊆ support(M)}. Next let L(r) denote the algebraic variety of matrices with rank at most r: (2.5) L(r) {M ∈ Rp×p |rank(M) ≤ r}. It is easily seen that L(r) is an algebraic variety because it can be defined through the vanishing of all (r + 1) × (r + 1) minors. This variety has dimension equal to r(2p − r), and it is smooth everywhere except at those matrices that have rank strictly smaller than r. Consider a rank-r matrix M with singular value decomposition (SVD) given by M = U DV T , where U, V ∈ Rp×r and D ∈ Rr×r . The matrix M is a smooth point of the variety L(rank(M)), and the tangent space at M with respect to this variety is given by (2.6) T (M) = {U Y1T + Y2 V T |Y1 , Y2 ∈ Rp×r }. We view both (M) and T (M) as subspaces in Rp×p . In Section 3 we explore the connection between geometric properties of these tangent spaces and the identifiability problem in latent-variable graphical models. 1944 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY Curvature of rank variety. The sparse matrix variety S (k) has the property that it has zero curvature at any smooth point. The situation is more complicated for the low-rank matrix variety L(r), because the curvature at any smooth point is nonzero. We analyze how this variety curves locally, by studying how the tangent space changes from one point to a neighboring point. Indeed the amount of curvature at a point is directly related to the “angle” between the tangent space at that point and the tangent space at a neighboring point. For any linear subspace T of matrices, let PT denote the projection onto T . Given two subspaces T1 , T2 of the same dimension, we measure the “twisting” between these subspaces by considering the following quantity: (2.7) ρ(T1 , T2 ) PT1 − PT2 2→2 = max [PT1 − PT2 ](N)2 . N2 ≤1 In the supplement [6] we review relevant results from matrix perturbation theory, which suggest that the magnitude of the smallest nonzero singular value is closely tied to the local curvature of the variety. Therefore we control the twisting between tangent spaces at nearby points by bounding the smallest nonzero singular value away from zero. 3. Identifiability. In the absence of additional conditions, the latent-variable model selection problem is ill-posed. In this section we discuss a set of conditions on latent-variable models that ensure that these models are identifiable given marginal statistics for a subset of the variables. Some of the discussion in Sections 3.1 and 3.2 is presented in greater detail in [7]. 3.1. Structure between latent and observed variables. Suppose that the lowrank matrix that summarizes the effect of the latent components is itself sparse. This leads to identifiability issues in the sparse-plus-low-rank decomposition problem. Statistically the additional correlations induced due to marginalization over the latent variables could be mistaken for the conditional graphical model structure of the observed variables. In order to avoid such identifiability problems the effect of the latent variables must be “diffuse” across the observed variables. To address this point the following quantity was introduced in [7] for any matrix M, defined with respect to the tangent space T (M): (3.1) ξ(T (M)) max N∞ . N∈T (M),N2 ≤1 Thus ξ(T (M)) being small implies that elements of the tangent space T (M) cannot have their support concentrated in a few locations; as a result M cannot be too sparse. This idea is formalized in [7] by relating ξ(T (M)) to a notion of “incoherence” of the row/column spaces, where the row/column spaces are said to be incoherent with respect to the standard basis if these spaces are not aligned closely with any of the coordinate axes. Typically a matrix M with incoherent row/column spaces would have ξ(T (M)) 1. This point is quantified precisely 1945 LATENT VARIABLE MODEL SELECTION in [7]. Specifically, we note that ξ(T (M)) can be as small as ∼ r p for a rank-r Rp×p matrix M ∈ with row/column spaces that are almost maximally incoherent (e.g., if the row/column spaces span any r columns of a p × p orthonormal Hadamard matrix). On the other hand, ξ(T (M)) = 1 if the row/column spaces of M contain a standard basis vector. Based on these concepts we roughly require that the low-rank matrix that summarizes the effect of the latent variables be incoherent, thereby ensuring that the extra correlations due to marginalization over the latent components cannot be confused with the conditional graphical model structure of the observed variables. Notice that the quantity ξ is not just a measure of the number of latent variables, but also of the overall effect of the correlations induced by marginalization over these variables. Curvature and change in ξ : As noted previously, an important technical point is that the algebraic variety of low-rank matrices is locally curved at any smooth point. Consequently the quantity ξ changes as we move along the low-rank matrix variety smoothly. The quantity ρ(T1 , T2 ) introduced in (2.7) allows us to bound the variation in ξ as follows (proof in Section 5): L EMMA 3.1. Let T1 , T2 be two linear subspaces of matrices of the same dimension with the property that ρ(T1 , T2 ) < 1, where ρ is defined in (2.7). Then we have that 1 [ξ(T1 ) + ρ(T1 , T2 )]. ξ(T2 ) ≤ 1 − ρ(T1 , T2 ) 3.2. Structure among observed variables. An identifiability problem also arises if the conditional graphical model among the observed variables contains a densely connected subgraph. These statistical relationships might be mistaken as correlations induced by marginalization over latent variables. Therefore we need to ensure that the conditional graphical model among the observed variables is sparse. We impose the condition that this conditional graphical model must have small “degree,” that is, no observed variable is directly connected to too many other observed variables conditioned on the latent components. Notice that bounding the degree is a more refined condition than simply bounding the total number of nonzeros as the sparsity pattern also plays a role. In [7] the authors introduced the following quantity in order to provide an appropriate measure of the sparsity pattern of a matrix: (3.2) μ((M)) max N2 . N∈(M),N∞ ≤1 The quantity μ((M)) being small for a matrix implies that the spectrum of any element of the tangent space (M) is not too “concentrated,” that is, the singular values of the elements of the tangent space are not too large. In [7] it is shown that a sparse matrix M with “bounded degree” (a small number of nonzeros per 1946 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY row/column) has small μ(M). Specifically, if M ∈ Rp×p is any matrix with at most deg(M) nonzero entries per row/column, then we have that μ((M)) ≤ deg(M). 3.3. Transversality of tangent spaces. Suppose that we have the sum of two vectors, each from two known subspaces. It is possible to uniquely recover the individual vectors from the sum if and only if the subspaces have a transverse intersection, that is, they only intersect at the origin. This simple observation leads to an appealing geometric notion of identifiability. Suppose now that we have the sum of a sparse matrix and a low-rank matrix, and that we are also given the tangent spaces at these matrices with respect to the algebraic varieties of sparse and low-rank matrices, respectively. Then a necessary and sufficient condition for identifiability with respect to the tangent spaces is that these spaces have a transverse intersection. This transverse intersection condition is also sufficient for local identifiability in a neighborhood around the sparse matrix and low-rank matrix with respect to the varieties of sparse and low-rank matrices (due to the inverse function theorem). It turns out that these tangent space transversality conditions are also sufficient for the convex program (1.2) to provide consistent estimates of a latent-variable graphical model (without any side information about the tangent spaces). In order to quantify the level of transversality between the tangent spaces and T we study the minimum gain with respect to some norm of the addition operator (which adds two matrices) A : Rp×p × Rp×p → Rp×p restricted to the cartesian product Y = × T . Then given any matrix norm · q on Rp×p × Rp×p , the minimum gain of A restricted to Y is defined as ε(, T , · q ) min PY A† APY (S, L)q , (S,L)∈×T ,(S,L)q =1 where PY denotes the projection onto Y , and A† denotes the adjoint of the addition operator (with respect to the standard Euclidean inner-product). The “level” of transversality of and T is measured by the magnitude of ε(, T , · q ), with transverse intersection being equivalent to ε(, T , · q ) > 0. Note that ε(, T , · F ) is the square of the minimum singular value of the addition operator A restricted to × T . A natural norm with which to measure transversality is the dual norm of the regularization function in (1.2), as the subdifferential of the regularization function is specified in terms of its dual. The reasons for this will become clearer as we proceed through this paper. Recall that the regularization function used in the variational formulation (1.2) is given by fγ (S, L) = γ S1 + L∗ , LATENT VARIABLE MODEL SELECTION 1947 where the nuclear norm · ∗ reduces to the trace function over the cone of positive-semidefinite matrices. This function is a norm for all γ > 0. The dual norm of fγ is given by S∞ , L2 . gγ (S, L) = max γ Next we define the quantity χ (, T , γ ) as follows in order to study the transversality of the spaces and T with respect to the gγ norm: (3.3) χ (, T , γ ) max ξ(T ) , 2μ()γ . γ Here μ and ξ are defined in (3.2) and (3.1). We then have the following result (proved in Section 5): L EMMA 3.2. Let S ∈ , L ∈ T be matrices such that S∞ = γ and let L2 = 1. Then we have that gγ (PY A† APY (S, L)) ∈ [1 − χ (, T , γ ), 1 + χ (, T , γ )], where Y = × T and χ (, T , γ ) is defined in (3.3). In particular we have that 1 − χ (, T , γ ) ≤ ε(, T , gγ ). The quantity χ (, T , γ ) being small implies that the addition operator is essentially isometric when restricted to Y = × T . Stated differently, the magnitude of χ (, T , γ ) is a measure of the level of transversality of the spaces and T . 1 ) ensures that χ (, T , γ ) < 1, which in If μ()ξ(T ) < 12 , then γ ∈ (ξ(T ), 2μ() turn implies that the tangent spaces and T have a transverse intersection. Observation: Thus we have that the smaller the quantities μ() and ξ(T ), the more transverse the intersection of the spaces and T as measured by ε(, T , gγ ). 3.4. Conditions on Fisher information. The main focus of Section 4 is to analyze the regularized maximum-likelihood convex program (1.2) by studying its optimality conditions. The log-likelihood function is well-approximated in a neighborhood by a quadratic form given by the Fisher information (which measures the ∗ ) denote the Fisher inforcurvature, as discussed in Section 2.2). Let I ∗ = I (K̃O ∗ [see (1.1)]. The mation evaluated at the true marginal concentration matrix K̃O appropriate measure of transversality between the tangent spaces2 = (S ∗ ) and T = T (L∗ ) is then in a space in which the inner-product is given by I ∗ . Specifically, we need to analyze the minimum gain of the operator PY A† I ∗ APY restricted to the space Y = × T . Therefore we impose several conditions on the Fisher information I ∗ . We define quantities that control the gains of I ∗ restricted to and T separately; these ensure that elements of and elements of T are 2 We implicitly assume that these tangent spaces are subspaces of the space of symmetric matrices. 1948 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY individually identifiable under the map I ∗ . In addition we define quantities that, in conjunction with bounds on μ() and ξ(T ), allow us to control the gain of I ∗ restricted to the direct-sum ⊕ T . I ∗ restricted to : The minimum gain of the operator P I ∗ P restricted to is given by α P I ∗ P (M)∞ . min M∈,M∞ =1 The maximum effect of elements in in the orthogonal direction ⊥ is given by δ max P⊥ I ∗ P (M)∞ . M∈,M∞ =1 The operator I ∗ is injective on if α > 0. The ratio αδ ≤ 1 − ν implies the irrepresentability condition imposed in [26], which gives a sufficient condition for consistent recovery of graphical model structure using 1 -regularized maximumlikelihood. Notice that this condition is a generalization of the usual Lasso irrepresentability conditions [32, 37], which are typically imposed on the covariance matrix. Finally we also consider the following quantity, which controls the behavior of I ∗ restricted to in the spectral norm: β max M∈,M2 =1 I ∗ (M)2 . I ∗ restricted to T : Analogously to the case of one could control the gains of the operators PT ⊥ I ∗ PT and PT I ∗ PT . However, as discussed previously, one complication is that the tangent spaces at nearby smooth points on the rank variety are in general different, and the amount of twisting between these spaces is governed by the local curvature. Therefore we control the gains of the operators PT ⊥ I ∗ PT and PT I ∗ PT for all tangent spaces T that are “close to” the nominal T (at the true underlying low-rank matrix), measured by ρ(T , T ) (2.7) being small. The minimum gain of the operator PT I ∗ PT restricted to T (close to T ) is given by αT min PT I ∗ PT (M)2 . min ρ(T ,T )≤ξ(T )/2 M∈T ,M2 =1 Similarly, the maximum effect of elements in T in the orthogonal direction T ⊥ (for T close to T ) is given by δT max max PT ⊥ I ∗ PT (M)2 . ρ(T ,T )≤ξ(T )/2 M∈T ,M2 =1 Implicit in the definition of αT and δT is the fact that the outer minimum and maximum are only taken over spaces T that are tangent spaces to the rank-variety. ) The operator I ∗ is injective on all tangent spaces T such that ρ(T , T ) ≤ ξ(T 2 if αT > 0. An irrepresentability condition (analogous to those developed for the sparse case) for tangent spaces near T to the rank variety would be that αδTT ≤ 1 − ν. LATENT VARIABLE MODEL SELECTION 1949 Finally we also control the behavior of I ∗ restricted to T close to T in the ∞ norm: βT max max I ∗ (M)∞ . ρ(T ,T )≤ξ(T )/2 M∈T ,M∞ =1 The two sets of quantities (α , δ ) and (αT , δT ) essentially control how I ∗ behaves when restricted to the spaces and T separately (in the natural norms). The quantities β and βT are useful in order to control the gains of the operator I ∗ restricted to the direct sum ⊕ T . Notice that although the magnitudes of elements in are measured most naturally in the ∞ norm, the quantity β is specified with respect to the spectral norm. Similarly, elements of the tangent spaces T to the rank variety are most naturally measured in the spectral norm, but βT provides control in the ∞ norm. These quantities, combined with μ() and ξ(T ) [defined in (3.2) and (3.1)], provide the “coupling” necessary to control the behavior of I ∗ restricted to elements in the direct sum ⊕ T . In order to keep track of fewer quantities, we summarize the six quantities as follows: α min(α , αT ); δ max(δ , δT ); Main assumption: There exists a ν ∈ (0, 1 2] β max(β , βT ). such that δ ≤ 1 − 2ν. α This assumption is to be viewed as a generalization of the irrepresentability conditions imposed on the covariance matrix [32, 37] or the Fisher information matrix [26] in order to provide consistency guarantees for sparse model selection using the 1 norm. With this assumption we have the following proposition, proved in Section 5, about the gains of the operator I ∗ restricted to ⊕ T . This proposition plays a fundamental role in the analysis of the performance of the regularized maximum-likelihood procedure (1.2). Specifically, it gives conditions under which a suitable primal-dual pair can be specified to certify optimality with respect to (1.2) (see Section 5.2 for more details). P ROPOSITION 3.3. Let and T be the tangent spaces defined in this section, and let I ∗ be the Fisher information evaluated at the true marginal concentration matrix. Further let α, β, ν be as defined above. Suppose that μ()ξ(T ) ≤ 1 να 6 β(2 − ν) 2 , and that γ is in the following range: γ∈ να 3ξ(T )β(2 − ν) , . να 2μ()β(2 − ν) Then we have the following two conclusions for Y = ×T with ρ(T , T ) ≤ ξ(T ) 2 : 1950 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY (1) The minimum gain of I ∗ restricted to ⊕ T is bounded below: min (S,L)∈Y ,S∞ =γ ,L2 =1 gγ (PY A† I ∗ APY (S, L)) ≥ α . 2 Specifically this implies that for all (S, L) ∈ Y gγ (PY A† I ∗ APY (S, L)) ≥ α gγ (S, L). 2 (2) The effect of elements in Y = × T on the orthogonal complement Y ⊥ = × T ⊥ is bounded above: ⊥ PY ⊥ A† I ∗ APY (PY A† I ∗ APY )−1 gγ →gγ ≤ 1 − ν. Specifically this implies that for all (S, L) ∈ Y gγ (PY ⊥ A† I ∗ APY (S, L)) ≤ (1 − ν)gγ (PY A† I ∗ APY (S, L)). The last quantity we consider is the spectral norm of the marginal covariance ∗ = (K̃ ∗ )−1 : matrix O O (3.4) ∗ ∗ −1 2 = (K̃O ) 2 . ψ O A bound on ψ is useful in the probabilistic component of our analysis, in order to derive convergence rates of the sample covariance matrix to the true covariance matrix. We also observe that ∗ −1 ∗ −1 ) ⊗ (K̃O ) 2→2 = ψ 2 . I ∗ 2→2 = (K̃O Remarks. The quantities α, β, δ bound the gains of the Fisher information I ∗ restricted to the spaces and T (and tangent spaces near T ). One can make stronger assumptions on I ∗ that are more easily interpretable. For example, α , β could bound the minimum/maximum gains of I ∗ for all matrices (rather than just those in ), and δ the I ∗ -inner-product for all pairs of orthogonal matrices (rather than just those in and ⊥ ). Similarly, αT , βT could bound the minimum/maximum gains of I ∗ for all matrices (rather than just those near T ), and δT the I ∗ -inner-product for all pairs of orthogonal matrices (rather than just those near T and T ⊥ ). Such bounds would apply in either the · 2→2 norm (for αT , δT , β ) or the · ∞→∞ norm (for α , δ , βT ). These modified assumptions are global in nature (not restricted just to or near T ) and are consequently stronger (they lower-bound the original α , αT and they upper-bound the original β , βT , δ , δT ), and they essentially control the gains of the operator I ∗ in the · 2→2 norm and the · ∞→∞ norm. In contrast, previous works on covariance selection [1, 2, 29] consider well-conditioned families of covariance matrices by LATENT VARIABLE MODEL SELECTION 1951 bounding the minimum/maximum eigenvalues (i.e., gain with respect to the spectral norm). 4. Consistency of regularized maximum-likelihood program. ∗ 4.1. Main results. Recall that K(O H ) denotes the full concentration matrix of a collection of zero-mean jointly-Gaussian observed and latent variables. Let p = |O| denote the number of observed variables, and let h = |H | denote the i }n number of latent variables. We are given n samples {XO i=1 of the observed variables XO . We consider the high-dimensional setting in which (p, h, n) are all allowed to grow simultaneously. We present our main result next demonstrating the consistency of the estimator (1.2), and then discuss classes of latent-variable graphical models and various scaling regimes in which our estimator is consistent. Recall from (1.2) that λn is a regularization parameter, and γ is a tradeoff parameter between the rank and sparsity terms. Notice from Proposition 3.3 that the choice of γ depends on the values of μ((S ∗ )) and ξ(T (L∗ )). While these quantities may not be known a priori, we discuss a method to choose γ numerically in our experimental results (see Section 6). The following theorem shows that the estimates (Ŝn , L̂n ) provided by the convex program (1.2) are consistent for a suitable choice of λn . In addition to the appropriate identifiability conditions (as specified by Proposition 3.3), we also impose lower bounds on the minimum magnitude nonzero entry θ of the sparse conditional graphical model matrix S ∗ and on the minimum nonzero singular value σ of the low-rank matrix L∗ summarizing the effect of the latent variables. The theorem is stated in terms of the quantities α, β, ν, ψ, and we particularly emphasize the dependence on μ((S ∗ )) and ξ(T (L∗ )) because these control the complexity of the ∗ underlying latent-variable graphical model given by K(O H ) . A number of quanνα α tities play a role in our theorem: let D = max{1, 3β(2−ν) }, C1 = ψ(1 + 6β ), √ 48 2Dψ(2−ν) 1 αν 1 αν , C = min{ , }, C = , samp λ 2 2 32(3−ν)D 4C1 256D(3−ν)ψC ξ(T )ν ψ 1 2 3αC2 (2−ν) C2 να CS = max{( 6(2−ν) + 1)C22 ψ 2 D, C2 + 16(3−ν) } and CL = β(2−ν) . ν C2 = 48 α + ∗ T HEOREM 4.1. Let K(O H ) denote the concentration matrix of a Gaussian i }n model. We have n samples {XO i=1 of the p observed variables denoted by O. ∗ ∗ Let = (S ) and T = T (L ) denote the tangent spaces at S ∗ and at L∗ with respect to the sparse and low-rank matrix varieties, respectively. Assumptions: Suppose that the quantities μ() and ξ(T ) satisfy the assumption of Proposition 3.3 for identifiability, and γ is chosen in the range specified by Proposition 3.3. Further suppose that the following conditions hold: p 128ψ 2 max{ , 2}, that is, we require that n ξ(Tp )4 . 4 2 ξ(T ) Csamp √ p p 1 λn = 48 2Dψ(2−ν) , that is, we require that λ n ξ(T )ν n ξ(T ) n . (1) Let n ≥ (2) Set 1952 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY (3) Let σ ≥ (4) Let θ ≥ CL λn , that is, we require that σ ξ(T1 )3 pn . ξ(T )2 p CS λn 1 , that is, we require that θ μ() ξ(T )μ() n . Conclusions: Then with probability greater than 1 − 2 exp{−p} we have algebraic correctness and estimation error given by: ∗ ); (1) sign(Ŝn ) = sign(S ∗ ) and rank( √ L̂n ) = rank(L p 2(3−ν)Dψ p 1 (2) gγ (Ŝn − S ∗ , L̂n − L∗ ) ≤ 512 ναξ(T ) n ξ(T ) n . The proof of this theorem is given in Section 5. The theorem essentially states that if the minimum nonzero singular value of the low-rank piece L∗ and minimum nonzero entry of the sparse piece S ∗ are bounded away from zero, then the convex program (1.2) provides estimates that are both algebraically correct and have bounded estimation error (in the ∞ and spectral norms). Notice that the condition on the minimum singular value of L∗ is more stringent than the one on the minimum nonzero entry of S ∗ . One role played by these conditions is to ensure that the estimates (Ŝn , L̂n ) do not have smaller support size/rank than (S ∗ , L∗ ). However, the minimum singular value bound plays the additional role of bounding the curvature of the low-rank matrix variety around the point L∗ , which is the reason for this condition being more stringent. Notice also that the number of latent variables h does not explicitly appear in the bounds in Theorem 4.1, which only depend on p, μ((S ∗ )), ξ(T (L∗ )). However, the dependence on h is implicit in the dependence on ξ(T (L∗ )), and we discuss this point in greater detail in the following section. Finally we note that consistency holds in Theorem 4.1 for a range of values of ) να γ ∈ [ 3β(2−ν)ξ(T , 2β(2−ν)μ() ]. In particular the assumptions on the sample comνα plexity, the minimum nonzero singular value of L∗ , and the minimum magnitude nonzero entry of S ∗ are governed by the lower end of this range for γ . These assumptions can be weakened if we only require consistency for a smaller range of values of γ . The next result conveys this point with a specific example. C OROLLARY 4.2. Consider the same setup and notation as in Theorem 4.1. Suppose that the quantities μ() and ξ(T ) satisfy the assumption of Proposiνα tion 3.3 for identifiability, and that γ = 2β(2−ν)μ() (the upper end of the range specified in Proposition 3.3), that is, γ μ()4 p; (2) λn μ() p n; (3) σ 1 μ() 2 μ() ξ(T ) p n; . Further suppose that: (1) n (4) θ p n. Then with probability greater than 1 − 2 exp{−p} we have estimates (Ŝn , L̂n ) that are algebraically correct, and with the error bounded as gγ (Ŝn − S ∗ , L̂n − L∗ ) μ() pn . LATENT VARIABLE MODEL SELECTION 1953 The proof of this corollary3 is analogous to that of Theorem 4.1. We emphasize that in practice it is often beneficial to have consistent estimates for a range of values of γ (as in Theorem 4.1). Specifically, the stability of the sparsity pattern and rank of the estimates (Ŝn , L̂n ) for a range of trade-off parameters is useful in order to choose a suitable value of γ , as prior information about the quantities μ((S ∗ )) and ξ(T (L∗ )) is not typically available (see Section 6). We remark here that the identifiability conditions of Proposition 3.3 are the main sufficient conditions required for Theorem 4.1 and Corollary 4.2 to hold. It would be interesting to obtain necessary conditions as well for these results, analogous to the necessity and sufficiency of the irrepresentability conditions for the Lasso [32, 37]. 4.2. Scaling regimes. Next we consider classes of latent-variable models that satisfy the conditions of Theorem 4.1. Recall from Section 3.2 that μ((S ∗ )) ≤ deg(S ∗ ). Throughout this section, we consider latent-variable models in which the ∗ ∗ low-rank matrix L is almost maximally incoherent, that is, ξ(T (L )) ∼ ph so the effect of marginalization over the latent variables is diffuse across almost all the observed variables. We suppress the dependence on the quantities α, β, ν, ψ defined in Section 3.4 in our scaling results, and specifically focus on the trade-off between ξ(T (L∗ )) and μ((S ∗ )) for consistent estimation (we also suppress the dependence of these quantities on n). Thus, based on Proposition 3.3 we study latent-variable models in which ∗ ∗ ξ(T (L ))μ((S )) = O h deg(S ∗ ) = O(1). p As we describe next, there are nontrivial classes of latent-variable graphical models in which this condition holds. Bounded degree: The first class of latent-variable models that we consider are those in which the conditional graphical model among the observed variables ∗ ) has constant degree: (given by KO deg(S ∗ ) = O(1), h ∼ p. Such models can be estimated consistently from n ∼ p samples. Thus consistent latent-variable model selection is possible even when the number of samples and the number of latent variables are on the same order as the number of observed variables. 3 By making stronger assumptions on the Fisher information matrix I ∗ , one can further remove the factor of ξ(T ) in the lower bound for σ . Specifically, the lower bound σ μ()3 pn suffices for consistent estimation if the bounds defined by the quantities αT , βT , δT can be strengthened as described in the remarks at the end of Section 3.4. 1954 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY Polylogarithmic degree: The next class of models that we consider are those in which the degree of the conditional graphical model of the observed variables grows polylogarithmically with p: deg(S ∗ ) ∼ log(p)q , h∼ p . log(p)2q Such latent-variable graphical models can be consistently estimated as long as n ∼ p polylog(p). For standard graphical model selection with no latent variables, 1 -regularized maximum-likelihood is shown to be consistent with n = O(log p) samples [26]. On the other hand, our results prove consistency in the setting with latent variables when n = O(p) samples. It would be interesting to study whether these rates are inherent to latent-variable model selection. 4.3. Rates for covariance matrix estimation. Theorem 4.1 gives conditions under which we can consistently estimate the sparse and low-rank parts that ∗ . Here we state a corollary that compose the marginal concentration matrix K̃O gives rates for covariance matrix estimation, that is, the quality of the estimate ∗. (Ŝn − L̂n )−1 with respect to the “true” marginal covariance matrix O C OROLLARY 4.3. Under the same conditions as in Theorem 4.1, we have with probability greater than 1 − 2 exp{−p} that ∗ gγ A† [(Ŝn − L̂n )−1 − O ] ≤ λn 1 + ∗ This corollary implies that (Ŝn − L̂n )−1 −O 2 ν . 6(2 − ν) 1 ξ(T ) p n based on the choice ∗ μ() of λn in Theorem 4.1, and that (Ŝn − L̂n )−1 − O 2 choice of λn in Corollary 4.2. p n based on the 5. Proofs. 5.1. Proofs of Section 3. Here we give proofs of the results stated in Section 3. P ROOF OF L EMMA 3.1. Since ρ(T1 , T2 ) < 1, the largest principal angle between T1 and T2 is strictly less than π2 . Consequently, the mapping PT2 : T1 → T2 restricted to T1 is bijective (as it is injective, and the spaces T1 , T2 have the same dimension). Consider the maximum and minimum gains of PT2 restricted to T1 ; for any M ∈ T1 , M2 = 1: PT2 (M)2 = M + [PT2 − PT1 ](M)2 ∈ [1 − ρ(T1 , T2 ), 1 + ρ(T1 , T2 )]. 1955 LATENT VARIABLE MODEL SELECTION Therefore, we can rewrite ξ(T2 ) as follows: ξ(T2 ) = ≤ ≤ max N∞ = N∈T2 ,N2 ≤1 max N∈T1 ,N2 ≤1/(1−ρ(T1 ,T2 )) max max PT2 (N)∞ N∈T2 ,N2 ≤1 N∈T1 ,N2 ≤1/(1−ρ(T1 ,T2 )) PT2 (N)∞ N∞ + [PT1 − PT2 ](N)∞ ≤ 1 ξ(T1 ) + max [PT1 − PT2 ](N)∞ N∈T1 ,N2 ≤1 1 − ρ(T1 , T2 ) ≤ 1 ξ(T1 ) + max [PT1 − PT2 ](N)2 N2 ≤1 1 − ρ(T1 , T2 ) 1 [ξ(T1 ) + ρ(T1 , T2 )]. 1 − ρ(T1 , T2 ) This concludes the proof of the lemma. ≤ P ROOF OF L EMMA 3.2. We have that A† A(S, L) = (S + L, S + L); therefore, PY A† APY (S, L) = (S + P (L), PT (S) + L). We need to bound S + P (L)∞ and PT (S) + L2 . First, we have S + P (L)∞ ∈ [S∞ − P (L)∞ , S∞ + P (L)∞ ] ⊆ [S∞ − L∞ , S∞ + L∞ ] ⊆ [γ − ξ(T ), γ + ξ(T )]. Similarly, one can check that PT (S) + L2 ∈ [−PT (S)2 + L2 , PT (S)2 + L2 ] ⊆ [1 − 2S2 , 1 + 2S2 ] ⊆ [1 − 2γ μ(), 1 + 2γ μ()]. These two bounds give us the desired result. P ROOF OF P ROPOSITION 3.3. Before proving the two parts of this proposition ) we make a simple observation about ξ(T ) using the condition that ρ(T , T ) ≤ ξ(T 2 by applying Lemma 3.1: ξ(T ) ≤ 3ξ(T )/2 ξ(T ) + ρ(T , T ) ≤ ≤ 3ξ(T ). 1 − ρ(T , T ) 1 − ξ(T )/2 Here we used the property that ξ(T ) ≤ 1 in obtaining the final inequality. Conse) να quently, noting that γ ∈ [ 3β(2−ν)ξ(T , 2β(2−ν)μ() ] implies that να (5.1) ξ(T ) να , 2μ()γ ≤ . χ (, T , γ ) = max γ β(2 − ν) 1956 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY Part 1: The proof of this step proceeds in a similar manner to that of Lemma 3.2. First we have for S ∈ , L ∈ T with S∞ = γ , L2 = 1: P I ∗ (S + L)∞ ≥ P I ∗ S∞ − P I ∗ L∞ ≥ αγ − I ∗ L∞ ≥ αγ − βξ(T ). Next, under the same conditions on S, L, PT I ∗ (S + L)2 ≥ PT I ∗ L2 − PT I ∗ S2 ≥ α − 2I ∗ S2 ≥ α − 2βμ()γ . Combining these last two bounds with (5.1), we conclude that min (S,L)∈Y ,S∞ =γ ,L2 =1 gγ (PY A† I ∗ APY (S, L)) να 2α(1 − ν) α ξ(T ) , 2μ()γ ≥ α − = ≥ , ≥ α − β max γ 2−ν 2−ν 2 where the final inequality follows from the assumption that ν ∈ (0, 12 ]. Part 2: Note that for S ∈ , L ∈ T with S∞ ≤ γ , L2 ≤ 1, P⊥ I ∗ (S + L)∞ ≤ P⊥ I ∗ S∞ + P⊥ I ∗ L∞ ≤ δγ + βξ(T ). Similarly, PT ⊥ I ∗ (S + L)2 ≤ PT ⊥ I ∗ S2 + PT ⊥ I ∗ L2 ≤ βγ μ() + δ. Combining these last two bounds with the bounds from the first part, we have that PY ⊥ A† I ∗ APY (PY A† I ∗ APY )−1 gγ →gγ ≤ δ + να/(2 − ν) δ + β max{ξ(T )/γ , 2μ()γ } ≤ α − β max{ξ(T )/γ , 2μ()γ } α − να/(2 − ν) ≤ (1 − 2ν)α + να/(2 − ν) = 1 − ν. α − να/(2 − ν) This concludes the proof of the proposition. 5.2. Proof strategy for Theorem 4.1. Standard results from convex analysis [28] state that (Ŝn , L̂n ) is a minimum of the convex program (1.2) if the zero matrix belongs to the subdifferential of the objective function evaluated at (Ŝn , L̂n ) [in addition to (Ŝn , L̂n ) satisfying the constraints]. Elements of the subdifferentials with respect to the 1 norm and the nuclear norm at a matrix M have the key property that they decompose with respect to the tangent spaces (M) and T (M) [34]. This decomposition property plays a critical role in our analysis. In particular it states that the optimality conditions consist of two parts, one part corresponding to the tangent spaces and T and another corresponding to the normal spaces ⊥ and T ⊥ . Our analysis proceeds by constructing a primal-dual pair of variables that certify optimality with respect to (1.2). Consider the optimization problem (1.2) with LATENT VARIABLE MODEL SELECTION 1957 the additional (nonconvex) constraints that the variable S belongs to the algebraic variety of sparse matrices and that the variable L belongs to the algebraic variety of low-rank matrices. While this new optimization problem is nonconvex, it has a very interesting property. At a globally optimal solution (and indeed at any locally optimal solution) (S̃, L̃) such that S̃ and L̃ are smooth points of the algebraic varieties of sparse and low-rank matrices, the first-order optimality conditions state that the Lagrange multipliers corresponding to the additional variety constraints must lie in the normal spaces (S̃)⊥ and T (L̃)⊥ . This basic observation, combined with the decomposition property of the subdifferentials of the 1 and nuclear norms, suggests the following high-level proof strategy: considering the solution (S̃, L̃) of the variety-constrained problem, we show under suitable conditions that the second part of the subgradient optimality conditions of (1.2) (without any variety constraints) corresponding to components in the normal spaces (S̃)⊥ and T (L̃)⊥ is also satisfied by (S̃, L̃). Thus, we show that (S̃, L̃) satisfies the optimality conditions of the original convex program (1.2). Consequently (S̃, L̃) is also the optimum of the convex program (1.2). As this estimate is obtained as the solution to the problem with the variety constraints, the algebraic correctness of (S̃, L̃) can be directly concluded. We emphasize here that the variety-constrained optimization problem is used solely as an analysis tool in order to prove consistency of the estimates provided by the convex program (1.2). The key technical complication is that the tangent spaces at L̃ and L∗ are in general different. We bound the twisting between these tangent spaces by using the fact that the minimum nonzero singular value of L∗ is bounded away from zero (as assumed in Theorem 4.1; see also the supplement [6]). 5.3. Results proved in supplement. In this section we give the statements of some results that are proved in a separate supplement [6]. These results are critical to the proof of our main theorem, but they deal mainly with nonstatistical aspects such as the curvature of the algebraic variety of low-rank matrices. Recall that = (S ∗ ) and T = T (L∗ ). We also refer frequently to the constants defined in Theorem 4.1. As the gradient of the log-determinant function is given by a matrix inverse, a key step in analyzing the properties of the convex program (1.2) is to show that the change in the inverse of a matrix due to small perturbations is wellapproximated by the first-order term in the Taylor series expansion. Consider the Taylor series of the inverse of a matrix: (M + )−1 = M −1 − M −1 M −1 + RM −1 (), where RM −1 () = M −1 ∞ (−M −1 k ) . k=2 This infinite sum converges for sufficiently small. The following proposition provides a bound on the second-order term specialized to our setting: 1958 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY P ROPOSITION 5.1. Suppose that γ is in the range given by Proposition 3.3. Further suppose S ∈ , and let gγ (S , L ) ≤ 2C1 1 . Then we have that gγ (A† RO∗ (A(S , L ))) ≤ 2DψC12 gγ (S , L )2 . ξ(T ) Next we analyze the following convex program subject to certain additional constraints: n ] − log det(S − L) + λn [γ S1 + L∗ ] (Ŝ , L̂T̃ ) = arg min tr[(S − L)O (5.2) S,L s.t. S − L 0, S ∈ , L ∈ T̃ , for some subspace T̃ . Comparing (5.2) with the convex program (1.2), we also do not constrain the variable L to be positive semidefinite in (5.2) for ease of proof of the next result (see the supplement [6] for more details; recall that the nuclear norm of a positive-semidefinite matrix is equal to its trace). We show that if T̃ is any ) tangent space to the low-rank matrix variety such that ρ(T , T̃ ) ≤ ξ(T 2 , then we can bound the error (S , L ) = (Ŝ − S ∗ , L∗ − L̂T̃ ). Let CT̃ = PT̃ ⊥ (L∗ ) denote the n − ∗ denote normal component of the true low-rank matrix at T̃ , and let En = O O the difference between the true marginal covariance and the sample covariance. The proof of the following result uses Brouwer’s fixed-point theorem [25], and is inspired by the proof of a similar result in [26] for standard sparse graphical model recovery without latent variables. P ROPOSITION 5.2. Let the error (S , L ) in the solution of the convex pro) gram (5.2) [with T̃ such that ρ(T̃ , T ) ≤ ξ(T 2 ] be as defined above, and define r = max 8 [gγ (A† En ) + gγ (A† I ∗ CT̃ ) + λn ], CT 2 . α αξ(T ) If r ≤ min{ 4C1 1 , 64DψC 2 } for γ as in Proposition 3.3, then gγ (S , L ) ≤ 2r. 1 Finally we give a proposition that summarizes the algebraic component of our proof. P ROPOSITION 5.3. Assume that γ is in the range specified by Proposi3α(2−ν) CL λn λn ν 1 † S λn tion 3.3, σ ≥ ξ(T ,θ ≥ C μ() , gγ (A En ) ≤ 6(2−ν) , and that λn ≤ 16(3−ν) min{ 4C1 , )2 αξ(T ) }. 64DψC12 Then there exists a T and a corresponding unique solution (Ŝ , L̂T ) of (5.2) with T̃ = T with the following properties: (1) sign(Ŝ ) = sign(S ∗ ) and rank(L̂T ) = rank(L∗ ), with L̂T 0. Further ) T (L̂T ) = T and ρ(T , T ) ≤ ξ(T 4 . 1959 LATENT VARIABLE MODEL SELECTION (2) Letting CT = PT ⊥ (L∗ ) we have that gγ (A† I ∗ CT ) ≤ CT 2 ≤ λn ν 6(2−ν) , and that 16(3−ν)λn 3α(2−ν) . λn ν Further, if gγ (A† RO∗ (A(Ŝ − S ∗ , L∗ − L̂T ))) ≤ 6(2−ν) , then the tangent space constraints S ∈ , L ∈ T are inactive in (5.2). Consequently the unique solution of (1.2) is (Ŝn , L̂n ) = (Ŝ , L̂T ). 5.4. Probabilistic analysis. The results given thus far in this section have been completely deterministic in nature. Here we present the probabilistic component of n converges our proof by studying the rate at which the sample covariance matrix O ∗ in spectral norm. This result is well known and to the true covariance matrix O follows directly from Theorem II.13 in [8]; we mainly discuss it here for complete∗ defined in (3.4). ness and also to show explicitly the dependence on ψ = O 2 See the supplement [6] for a proof. L EMMA 5.4. ∗ . Given any δ > 0 with δ ≤ 8ψ, let the number Let ψ = O 2 of samples n be such that n ≥ n Pr[O 64pψ 2 . δ2 ∗ − O 2 Then we have that nδ 2 ≥ δ] ≤ 2 exp − . 128ψ 2 The following corollary relates the number of samples required for an error bound to hold with probability 1 − 2 exp{−p}. n be the sample covariance formed from n samples Let O 2 . If n ≥ 2p, then of the observed variables. Set δn = 128pψ n C OROLLARY 5.5. n ∗ Pr[O − O 2 ≤ δn ] ≥ 1 − 2 exp{−p}. P ROOF. Note that n ≥ 2p implies that δn ≤ 8ψ, and apply Lemma 5.4. 5.5. Proof of Theorem 4.1 and Corollary 4.3. We first combine the results 2 n ∗ , obtained thus far to prove Theorem 4.1. Set En = O − O , set δn = 128pψ n 6Dδn (2−ν) and then set λn = ξ(T )ν . This setting of λn is equivalent to the specification in the statement of Theorem 4.1. P ROOF OF T HEOREM 4.1. We mainly need to show that the various sufficient conditions of Proposition 5.3 are satisfied. We condition on the event that En 2 ≤ δn , which holds with probability greater than 1 − 2 exp{−p} from Corollary 5.5 as n ≥ 2p by assumption. Based on the bound on n, we also have that δn ≤ ξ(T )2 αν 1 αν min , 32(3 − ν)D 4C1 256D(3 − ν)ψC12 . 1960 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY In particular, these bounds imply that δn ≤ 1 αξ(T )ν αξ(T ) min , ; 32(3 − ν)D 4C1 64DψC12 δn ≤ α 2 ξ(T )2 ν 2 . 8192ψC12 (3 − ν)2 D 2 (5.3) Both these weaker bounds are used later. Based on the assumptions of Theorem 4.1, the requirements of Proposition 5.3 on σ and θ are satisfied. Next we verify the bounds on λn and gγ (A† En ). Based on the setting of λn above and the bound on δn from (5.3), we have that 1 6D(2 − ν)δn 3α(2 − ν) αξ(T ) ≤ min , . λn = ξ(T )ν 16(3 − ν) 4C1 64DψC12 Next we combine the facts that λn = that (5.4) gγ (A† En ) ≤ 6Dδn (2−ν) ξ(T )ν and that En 2 ≤ δn to conclude Dδn λn ν = . ξ(T ) 6(2 − ν) Thus, we have from Proposition 5.3 that there exists a T and corresponding solution (Ŝ , L̂T ) of (5.2) with the prescribed properties. Next we apply Proposition 5.2 with T̃ = T to bound the error (Ŝ − S ∗ , L∗ − L̂T ). Noting that ) ρ(T , T ) ≤ ξ(T 4 , we have that (5.5) (5.6) 8 ν 8 [gγ (A† En ) + gγ (A† I ∗ CT ) + λn ] ≤ + 1 λn α α 3(2 − ν) 16(3 − ν)λn = 3α(2 − ν) 32(3 − ν)D δn = αξ(T )ν (5.7) ≤ min 1 αξ(T ) , . 4C1 64DψC12 λn ν In the first inequality we used the fact that gγ (A† En ) ≤ 6(2−ν) (from above) † ∗ and that gγ (A I CT ) is similarly bounded (from Proposition 5.3). In the second n (2−ν) equality we used the relation λn = 6Dδξ(T )ν . In the final inequality we used the bound on δn from (5.3). This satisfies one of the requirements of Proposition 5.2. The second requirement of Proposition 5.2 on CT 2 is also similarly satisfied as n we have that CT 2 ≤ 16(3−ν)λ 3α(2−ν) from Proposition 5.3, and we use the same sequence of inequalities as above. Thus we conclude from Proposition 5.2 and from 1961 LATENT VARIABLE MODEL SELECTION (5.5) that 32(3 − ν)λn 1 p . gγ (Ŝ − S , L − L̂T ) ≤ 3α(2 − ν) ξ(T ) n ∗ (5.8) ∗ Here the last inequality follows from the bound on λn . If we show that (Ŝn , L̂n ) = (Ŝ , L̂T ), we can conclude the proof of Theorem 4.1 since algebraic correctness of (Ŝ , L̂T ) holds from Proposition 5.3 and the estimation error bound follows from (5.8). In order to complete this final step, we again revert to Proposition 5.3 and prove the requisite bound on gγ (A† RO∗ (A(Ŝ − S ∗ , L∗ − L̂T ))). Since the bound (5.8) combined with the inequality (5.7) satisfies the condition of Proposition 5.1 [i.e., we have that gγ (Ŝ − S ∗ , L∗ − L̂T ) ≤ 2C1 1 ]: gγ A† RO∗ A(Ŝ − S ∗ , L∗ − L̂T ) ≤ 2DψC12 gγ (Ŝ − S ∗ , L∗ − L̂T )2 ξ(T ) 2DψC12 64(3 − ν)D ≤ ξ(T ) αξ(T )ν 2 δn2 8192ψC12 (3 − ν)2 D 2 Dδn δn 2 2 2 α ξ(T ) ν ξ(T ) Dδn ≤ ξ(T ) λn ν = . 6(2 − ν) = In the second inequality we used (5.6) and (5.8), in the final inequality we used the n (2−ν) bound (5.3) on δn , and in the final equality we used the relation λn = 6Dδξ(T )ν . P ROOF OF C OROLLARY 4.3. Based on the optimality conditions of the modified convex program (5.2), we have that n gγ A† [(Ŝn − L̂n )−1 − O ] ≤ λn . Combining this with the bound (5.4) yields the desired result. 6. Simulation results. In this section we give experimental demonstration of the consistency of our estimator (1.2) on synthetic examples, and its effectiveness in modeling real-world stock return data. Our choices of λn and γ are guided by Theorem 4.1. Specifically, we choose λn to be proportional to pn . For γ we observe that the support/sign-pattern and the rank of the solution (Ŝn , L̂n ) are the same for a range of values of γ . Therefore one could solve the convex program 1962 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY (1.2) for several values of γ , and choose a solution in a suitable range in which the sign-pattern and rank of the solution are stable (see [7] for details). In practical problems with real-world data these parameters may be chosen via crossvalidation (it would be of interest to consider methods such as those developed in [24]). For small problem instances we solve the convex program (1.2) using a combination of YALMIP [21] and SDPT3 [31]. For larger problem instances we use the special-purpose solver LogdetPPA [33] developed for log-determinant semidefinite programs. 6.1. Synthetic data. In the first set of experiments we consider a setting in which we have access to samples of the observed variables of a latent-variable graphical model. We consider several latent-variable Gaussian graphical models. The first model consists of p = 36 observed variables and h = 2 latent variables. The conditional graphical model structure of the observed variables is a cycle with the edge partial correlation coefficients equal to 0.25; thus, this conditional model is specified by a sparse graphical model with degree 2. The second model is the same as the first one, but with h = 3 latent variables. The third model consists of h = 1 latent variable, and the conditional graphical model structure of the observed variables is given by a 6 × 6 nearest-neighbor grid (i.e., p = 36 and degree 4) with the partial correlation coefficients of the edges equal to 0.15. In all three of these models each latent variable is connected to a random subset of 80% of the observed variables (and the partial correlation coefficients corresponding to these edges are also random). Therefore the effect of the latent variables is “spread out” over most of the observed variables, that is, the low-rank matrix summarizing the effect of the latent variables is incoherent. For each model we generate n samples of the observed variables, and use the n as input to our convex program (1.2). Figure 1 resulting sample covariance O shows the probability of obtaining algebraically correct estimates as a function of n. This probability is evaluated over 50 experiments for each value of n. In all of these cases standard graphical model selection applied directly to the observed variables is not useful as the marginal concentration matrix of the observed variables is not well-approximated by a sparse matrix. These experiments agree with our theoretical results that the convex program (1.2) is an algebraically consistent estimator of a latent-variable model given (sufficiently many) samples of only the observed variables. 6.2. Stock return data. In the next experiment we model the statistical structure of monthly stock returns of 84 companies in the S&P 100 index from 1990 to 2007; we disregard 16 companies that were listed after 1990. The number of samples n is equal to 216. We compute the sample covariance based on these returns and use this as input to (1.2). The model learned using (1.2) for suitable values of λn , γ consists of h = 5 latent variables, and the conditional graphical model structure of the stock returns conditioned on these latent components consists of 135 edges. Therefore LATENT VARIABLE MODEL SELECTION 1963 F IG . 1. Synthetic data: plot showing probability of algebraically correct estimation. The three models studied are (a) 36-node conditional graphical model given by a cycle with h = 2 latent variables, (b) 36-node conditional graphical model given by a cycle with h = 3 latent variables and (c) 36-node conditional graphical model given by a 6 × 6 grid with h = 1 latent variable. For each plotted point, the probability of algebraically correct estimation is obtained over 50 random trials. the number of parameters in the model is 84 + 135 + (5 × 84) = 639. The resulting KL divergence between the distribution specified by this model and a Gaussian distribution specified by the sample covariance is 17.7. Figure 2 (left) shows the conditional graphical model structure. The strongest edges in this conditional graphical model, as measured by partial correlation, are between Baker Hughes– F IG . 2. Stock returns: the figure on the left shows the sparsity pattern (black denotes an edge, and white denotes no edge) of the concentration matrix of the conditional graphical model (135 edges) of the stock returns, conditioned on five latent variables, in a latent-variable graphical model (total number of parameters equals 639). This model is learned using (1.2), and the KL divergence with respect to a Gaussian distribution specified by the sample covariance is 17.7. The figure on the right shows the concentration matrix of the graphical model (646 edges) of the stock returns, learned using standard sparse graphical model selection based on solving an 1 -regularized maximum-likelihood program (total number of parameters equals 730). The KL divergence between this distribution and a Gaussian distribution specified by the sample covariance is 44.4. 1964 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY Schlumberger, A.T.&T.–Verizon, Merrill Lynch–Morgan Stanley, Halliburton– Baker Hughes, Intel–Texas Instruments, Apple–Dell, and Microsoft–Dell. It is of interest to note that in the Standard Industrial Classification4 system for grouping these companies, several of these pairs are in different classes. As mentioned in Section 2.1, our method estimates a low-rank matrix that summarizes the effect of the latent variables; in order to factorize this low-rank matrix, for example, into sparse factors, one could use methods such as those described in [35]. We compare these results to those obtained using a sparse graphical model learned using 1 -regularized maximum-likelihood (see, e.g., [26]), without introducing any latent variables. Figure 2 (right) shows this graphical model structure. The number of edges in this model is 646 (the total number of parameters is equal to 646 + 84 = 730), and the resulting KL divergence between this distribution and a Gaussian distribution specified by the sample covariance is 44.4. These results suggest that a latent-variable graphical model is better suited than a standard sparse graphical model for modeling stock returns. This is likely due to the presence of global, long-range correlations in stock return data that are better modeled via latent variables. 7. Discussion. We have studied the problem of modeling the statistical structure of a collection of random variables as a sparse graphical model conditioned on a few additional latent components. As a first contribution we described conditions under which such latent-variable graphical models are identifiable given samples of only the observed variables. We also proposed a convex program based on 1 and nuclear norm regularized maximum-likelihood for latent-variable graphical model selection. Given samples of the observed variables of a latent-variable Gaussian model, we proved that this convex program provides consistent estimates of the number of latent components as well as the conditional graphical model structure among the observed variables conditioned on the latent components. Our analysis holds in the high-dimensional regime in which the number of observed/latent variables are allowed to grow with the number of samples of the observed variables. These theoretical predictions are verified via a set of experiments on synthetic data. We also demonstrate the effectiveness of our approach in modeling real-world stock return data. Several questions arise that are worthy of further investigation. While (1.2) can be solved in polynomial time using off-the-shelf solvers, it is preferable to develop more efficient special-purpose solvers to scale to massive datasets by taking advantage of the structure of (1.2). It is also of interest to develop statistically consistent convex optimization methods for latent-variable modeling with non-Gaussian variables, for example, for categorical data. 4 See the U.S. SEC website at http://www.sec.gov/info/edgar/siccodes.htm. LATENT VARIABLE MODEL SELECTION 1965 Acknowledgments. We would like to thank James Saunderson and Myung Jin Choi for helpful discussions, and Kim-Chuan Toh for kindly providing us specialized code to solve larger instances of our convex program. SUPPLEMENTARY MATERIAL Supplement to “Latent variable graphical model selection via convex optimization” (DOI: 10.1214/11-AOS949SUPP; .pdf). Due to space constraints, we have moved some technical proofs to a supplementary document [6]. REFERENCES [1] B ICKEL , P. J. and L EVINA , E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227. MR2387969 [2] B ICKEL , P. J. and L EVINA , E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604. MR2485008 [3] C ANDÈS , E. J., L I , X., M A , Y. and W RIGHT, J. (2011). Robust principal component analysis? J. ACM 58 Art. 11, 37. MR2811000 [4] C ANDÈS , E. J. and R ECHT, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772. MR2565240 [5] C ANDÈS , E. J., ROMBERG , J. and TAO , T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory 52 489–509. MR2236170 [6] C HANDRASEKARAN , V., PARRILO , P. A. and W ILLSKY, A. S. (2011). Supplement to “Latent variable graphical model selection via convex optimization.” DOI:10.1214/11AOS949SUPP. [7] C HANDRASEKARAN , V., S ANGHAVI , S., PARRILO , P. A. and W ILLSKY, A. S. (2011). Ranksparsity incoherence for matrix decomposition. SIAM J. Optim. 21 572–596. MR2817479 [8] DAVIDSON , K. R. and S ZAREK , S. J. (2001). Local operator theory, random matrices and Banach spaces. In Handbook of the Geometry of Banach Spaces, Vol. I 317–366. NorthHolland, Amsterdam. MR1863696 [9] D EMPSTER , A. P., L AIRD , N. M. and RUBIN , D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. [10] D ONOHO , D. L. (2006). For most large underdetermined systems of linear equations the minimal l1 -norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797–829. MR2217606 [11] D ONOHO , D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306. MR2241189 [12] E L K AROUI , N. (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 36 2717–2756. MR2485011 [13] E LIDAN , G., NACHMAN , I. and F RIEDMAN , N. (2007). “Ideal parent” structure learning for continuous variable Bayesian networks. J. Mach. Learn. Res. 8 1799–1833. MR2353820 [14] FAN , J., FAN , Y. and LV, J. (2008). High dimensional covariance matrix estimation using a factor model. J. Econometrics 147 186–197. MR2472991 [15] FAZEL , M. (2002). Matrix rank minimization with applications. Ph.D. thesis, Dept. Elec. Eng., Stanford Univ. 1966 V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY [16] H ORN , R. A. and J OHNSON , C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge. MR1084815 [17] J OHNSTONE , I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327. MR1863961 [18] L AM , C. and FAN , J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 4254–4278. MR2572459 [19] L AURITZEN , S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. Oxford Univ. Press, New York. MR1419991 [20] L EDOIT, O. and W OLF, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411. MR2026339 [21] L ÖFBERG , J. (2004). YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedings of the CACSD Conference, Taiwan. Available at http://control.ee.ethz.ch/~joloef/ yalmip.php. [22] M AR ČENKO , V. A. and PASTUR , L. A. (1967). Distribution of eigenvalues in certain sets of random matrices. Mat. Sb. 72 507–536. MR0208649 [23] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. MR2278363 [24] M EINSHAUSEN , N. and B ÜHLMANN , P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 417–473. MR2758523 [25] O RTEGA , J. M. and R HEINBOLDT, W. C. (1970). Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. MR0273810 [26] R AVIKUMAR , P., WAINWRIGHT, M. J., R ASKUTTI , G. and Y U , B. (2011). High-dimensional covariance estimation by minimizing 1 -penalized log-determinant divergence. Electron. J. Stat. 4 935–980. [27] R ECHT, B., FAZEL , M. and PARRILO , P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501. MR2680543 [28] ROCKAFELLAR , R. T. (1996). Convex Analysis. Princeton Univ. Press, Princeton, NJ. MR1451876 [29] ROTHMAN , A. J., B ICKEL , P. J., L EVINA , E. and Z HU , J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. MR2417391 [30] S PEED , T. P. and K IIVERI , H. T. (1986). Gaussian Markov distributions over finite graphs. Ann. Statist. 14 138–150. MR0829559 [31] T OH , K. C., T ODD , M. J. and T UTUNCU , R. H. (1999). SDPT3—a MATLAB software package for semidefinite-quadratic-linear programming. Available at http://www.math. nus.edu.sg/~mattohkc/sdpt3.html. [32] WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using 1 -constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202. MR2729873 [33] WANG , C., S UN , D. and T OH , K.-C. (2010). Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. SIAM J. Optim. 20 2994–3013. MR2735941 [34] WATSON , G. A. (1992). Characterization of the subdifferential of some matrix norms. Linear Algebra Appl. 170 33–45. MR1160950 [35] W ITTEN , D. M., T IBSHIRANI , R. and H ASTIE , T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515–534. [36] W U , W. B. and P OURAHMADI , M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 831–844. MR2024760 LATENT VARIABLE MODEL SELECTION 1967 [37] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563. MR2274449 V. C HANDRASEKARAN D EPARTMENT OF C OMPUTING AND M ATHEMATICAL S CIENCES C ALIFORNIA I NSTITUTE OF T ECHNOLOGY PASADENA , C ALIFORNIA 91106 USA E- MAIL : venkatc@caltech.edu P. A. PARRILO A. S. W ILLSKY L ABORATORY FOR I NFORMATION AND D ECISION S YSTEMS D EPARTMENT OF E LECTRICAL E NGINEERING AND C OMPUTER S CIENCE M ASSACHUSETTS I NSTITUTE OF T ECHNOLOGY C AMBRIDGE , M ASSACHUSETTS 02139 USA E- MAIL : parrilo@mit.edu willsky@mit.edu

Latent Variable Graphical Model Selection via Convex Optimization

Related documents

Products

Support

Latent Variable Graphical Model Selection via Convex Optimization

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib