Uploaded by vsaikishore24

11-AOS949

advertisement
The Annals of Statistics
2012, Vol. 40, No. 4, 1935–1967
DOI: 10.1214/11-AOS949
© Institute of Mathematical Statistics, 2012
LATENT VARIABLE GRAPHICAL MODEL SELECTION VIA
CONVEX OPTIMIZATION1
B Y V ENKAT C HANDRASEKARAN , PABLO A. PARRILO AND
A LAN S. W ILLSKY
California Institute of Technology, Massachusetts Institute of Technology
and Massachusetts Institute of Technology
Suppose we observe samples of a subset of a collection of random variables. No additional information is provided about the number of latent variables, nor of the relationship between the latent and observed variables. Is it
possible to discover the number of latent components, and to learn a statistical
model over the entire collection of variables? We address this question in the
setting in which the latent and observed variables are jointly Gaussian, with
the conditional statistics of the observed variables conditioned on the latent
variables being specified by a graphical model. As a first step we give natural conditions under which such latent-variable Gaussian graphical models
are identifiable given marginal statistics of only the observed variables. Essentially these conditions require that the conditional graphical model among
the observed variables is sparse, while the effect of the latent variables is
“spread out” over most of the observed variables. Next we propose a tractable
convex program based on regularized maximum-likelihood for model selection in this latent-variable setting; the regularizer uses both the 1 norm and
the nuclear norm. Our modeling framework can be viewed as a combination
of dimensionality reduction (to identify latent variables) and graphical modeling (to capture remaining statistical structure not attributable to the latent
variables), and it consistently estimates both the number of latent components and the conditional graphical model structure among the observed variables. These results are applicable in the high-dimensional setting in which
the number of latent/observed variables grows with the number of samples
of the observed variables. The geometric properties of the algebraic varieties
of sparse matrices and of low-rank matrices play an important role in our
analysis.
Received August 2010; revised November 2011.
Discussed in 10.1214/12-AOS979, 10.1214/12-AOS980, 10.1214/12-AOS981,
10.1214/12-AOS984, 10.1214/12-AOS985 and 10.1214/12-AOS1001; rejoinder at
10.1214/12-AOS1020.
1 Supported in part by AFOSR Grant FA9550-08-1-0180, in part under a MURI through AFOSR
Grant FA9550-06-1-0324, in part under a MURI through AFOSR Grant FA9550-06-1-0303 and in
part by NSF FRG 0757207.
MSC2010 subject classifications. 62F12, 62H12.
Key words and phrases. Gaussian graphical models, covariance selection, latent variables, regularization, sparsity, low-rank, algebraic statistics, high-dimensional asymptotics.
1935
1936
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
1. Introduction and setup. Statistical model selection in the high-dimensional regime arises in a number of applications. In many data analysis problems in
geophysics, radiology, genetics, climate studies, and image processing, the number
of samples available is comparable to or even smaller than the number of variables.
As empirical statistics in these settings may not be well-behaved (see [17, 22]),
high-dimensional model selection is therefore both challenging and of great interest. A model selection problem that has received considerable attention recently
is the estimation of covariance matrices in the high-dimensional setting. As the
sample covariance matrix is poorly behaved in such a regime, some form of regularization of the sample covariance is adopted based on assumptions about the true
underlying covariance matrix [1, 2, 12, 14, 20, 36].
Graphical models. A number of papers have studied covariance estimation in
the context of Gaussian graphical model selection. A Gaussian graphical model
[19, 30] (also commonly referred to as a Gauss–Markov random field) is a statistical model defined with respect to a graph, in which the nodes index a collection
of jointly Gaussian random variables and the edges represent the conditional independence relations (Markov structure) among the variables. In such models the
sparsity pattern of the inverse of the covariance matrix, or the concentration matrix, directly corresponds to the graphical model structure. Specifically, consider a
Gaussian graphical model in which the covariance matrix is given by a positivedefinite ∗ and the concentration matrix is given by K ∗ = ( ∗ )−1 . Then an edge
∗ = 0. In par{i, j } is present in the underlying graphical model if and only if Ki,j
ticular the absence of an edge between two nodes implies that the corresponding variables are independent conditioned on all the other variables. The model
selection method usually studied in such a Gaussian graphical model setting is
1 -regularized maximum-likelihood, with the 1 penalty applied to the entries of
the concentration matrix to induce sparsity. The consistency properties of such an
estimator have been studied [18, 26, 29], and under suitable conditions [18, 26]
this estimator is also “sparsistent,” that is, the estimated concentration matrix has
the same sparsity pattern as the true model from which the samples are generated.
An alternative approach to 1 -regularized maximum-likelihood is to estimate the
sparsity pattern of the concentration matrix by performing regression separately on
each variable [23]; while such a method consistently estimates the sparsity pattern,
it does not directly provide estimates of the covariance or concentration matrix.
In many applications throughout science and engineering (e.g., psychology,
computational biology, and economics), a challenge is that one may not have access to observations of all the relevant phenomena, that is, some of the relevant
variables may be latent or unobserved. In general latent variables pose a significant difficulty for model selection because one may not know the number of relevant latent variables, nor the relationship between these variables and the observed
variables. Typical algorithmic methods that try to get around this difficulty usually
fix the number of latent variables as well as the structural relationship between
LATENT VARIABLE MODEL SELECTION
1937
latent and observed variables (e.g., the graphical model structure between latent
and observed variables), and use the EM algorithm to fit parameters [9]. This approach suffers from the problem that one optimizes nonconvex functions, and thus
one may get stuck in suboptimal local minima. An alternative suggestion [13] is
one based on a greedy, local, combinatorial heuristic that assigns latent variables
to groups of observed variables, via some form of clustering of the observed variables; however, this approach has no consistency guarantees.
Our setup. In this paper we study the problem of latent-variable graphical
model selection in the setting where all the variables, both observed and latent,
are jointly Gaussian. More concretely, X is a Gaussian random vector in Rp+h ,
O and H are disjoint subsets of indices in {1, . . . , p + h} of cardinalities |O| = p
and |H | = h, and the corresponding subvectors of X are denoted by XO and XH ,
∗
respectively. Let the covariance matrix underlying X be denoted by (O
H ) . The
marginal statistics corresponding to the observed variables XO are given by the
∗ , which is simply a submatrix of the full covarimarginal covariance matrix O
∗
ance matrix (O H ) . However, suppose that we parameterize our model by the
∗
∗
−1
concentration matrix K(O
H ) = ((O H ) ) , which as discussed above reveals the
∗ , K∗
∗
connection to graphical models. Here the submatrices KO
O,H , KH specify (in
the full model) the dependencies among the observed variables, between the observed and latent variables, and among the latent variables, respectively. In such a
∗ )−1 corresponding to the
parameterization, the marginal concentration matrix (O
observed variables XO is given by the Schur complement [16] with respect to the
∗:
block KH
(1.1)
∗
∗ −1
∗
∗
∗ −1 ∗
= (O
) = KO
− KO,H
(KH
) KH,O .
K̃O
∗ (or K̃ ∗ ).
Thus if we only observe the variables XO , we only have access to O
O
∗ above have interesting properties. The matrix
The two terms that compose K̃O
∗ specifies the concentration matrix of the conditional statistics of the obKO
served variables given the latent variables. If these conditional statistics are given
∗ is sparse. On the other hand, the matrix
by a sparse graphical model, then KO
∗
∗ )−1 K ∗
(KH
KO,H
H,O serves as a summary of the effect of marginalization over the
latent variables XH . This matrix has small rank if the number of latent, unobserved variables XH is small relative to the number of observed variables XO .
∗ is generally not sparse due to the
Therefore the marginal concentration matrix K̃O
∗
∗ )−1 K ∗ . Hence standard graphical model
(KH
additional low-rank term KO,H
H,O
selection techniques applied directly to the observed variables XO are not useful.
A modeling paradigm that infers the effect of the latent variables XH would be
more suitable in order to provide a concise explanation of the underlying statistical
structure. Hence we approximate the sample covariance by a model in which the
concentration matrix decomposes into the sum of a sparse matrix and a low-rank
matrix, which reveals the conditional graphical model structure in the observed
1938
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
variables as well as the number of and effect due to the unobserved latent variables. Such a method can be viewed as a blend of principal component analysis
and graphical modeling. In standard graphical modeling one would directly approximate a concentration matrix by a sparse matrix to learn a sparse graphical
model, while in principal component analysis the goal is to explain the statistical
structure underlying a set of observations using a small number of latent variables
(i.e., approximate a covariance matrix as a low-rank matrix). In our framework we
learn a sparse graphical model among the observed variables conditioned on a few
(additional) latent variables. These latent variables are not principal components,
as the conditional statistics (conditioned on these latent variables) are given by a
graphical model. Therefore we refer to these latent variables informally as latent
components.
Contributions. Our first contribution in Section 3 is to address the fundamental question of identifiability of such latent-variable graphical models given the
marginal statistics of only the observed variables. The critical point is that we
need to tease apart the correlations induced due to marginalization over the latent variables from the conditional graphical model structure among the observed
variables. As the identifiability problem is one of uniquely decomposing the sum
of a sparse matrix and a low-rank matrix into the individual components, we
study the algebraic varieties of sparse matrices and low-rank matrices. An important theme in this paper is the connection between the tangent spaces to these
∗ ) dealgebraic varieties and the question of identifiability. Specifically let (KO
∗
note the tangent space at KO to the algebraic variety of sparse matrices, and let
∗
∗ )−1 K ∗ ) denote the tangent space at K ∗
∗ −1 ∗
(KH
T (KO,H
H,O
O,H (KH ) KH,O to the algebraic variety of low-rank matrices. Then the statistical question of identifiability
∗ and K ∗
∗ −1 ∗
∗
of KO
O,H (KH ) KH,O given K̃O is determined by the geometric notion
∗ ) and T (K ∗
∗ −1 ∗
of transversality of the tangent spaces (KO
O,H (KH ) KH,O ). The
study of the transversality of these tangent spaces leads to natural conditions for
identifiability. In particular we show that latent-variable models in which (1) the
∗ has a small number of nonzeros per row/column, and (2) the
sparse matrix KO
∗
∗ )−1 K ∗
(KH
low-rank matrix KO,H
H,O has row/column spaces that are not closely
aligned with the coordinate axes, are identifiable. These conditions have natural
statistical interpretations. The first condition ensures that there are no densely connected subgraphs in the conditional graphical model structure among the observed
variables, that is, that these conditional statistics are indeed specified by a sparse
graphical model. Such statistical relationships may otherwise be mistakenly attributed to the effect of marginalization over some latent variable. The second condition ensures that the effect of marginalization over the latent variables is “spread
out” over many observed variables; thus, the effect of marginalization over a latent
variable is not confused with the conditional graphical model structure among the
observed variables. In fact the first condition is often assumed in standard graphical
model selection without latent variables (e.g., [26]).
1939
LATENT VARIABLE MODEL SELECTION
As our next contribution we propose a regularized maximum-likelihood decomposition framework to approximate a given sample covariance matrix by a model
in which the concentration matrix decomposes into a sparse matrix and a low-rank
matrix. Based on the effectiveness of the 1 norm as a tractable convex relaxation
for recovering sparse models [5, 10, 11] and the nuclear norm for low-rank matrices [4, 15, 27], we propose the following penalized likelihood method given a
n formed from n samples of the observed variables:
sample covariance matrix O
n
(Ŝn , L̂n ) = arg min −(S − L; O
) + λn γ S1 + tr(L)
(1.2)
S,L
s.t. S − L 0, L 0.
The constraints 0 and 0 impose positive-definiteness and positive-semidefiniteness. The function represents the Gaussian log-likelihood (K; ) =
log det(K) − tr(K) for K 0, where tr is the trace of a matrix and det is the
∗ , which represents the condideterminant. Here Ŝn provides an estimate of KO
tional concentration matrix of the observed variables; L̂n provides an estimate of
∗
∗ )−1 K ∗ , which represents the effect of marginalization over the latent
KO,H
(KH
H,O
variables. The regularizer is a combination of the 1 norm applied to S and the
nuclear norm applied to L (the nuclear norm reduces to the trace over the cone of
symmetric, positive-semidefinite matrices), with γ providing a trade-off between
the two terms. This variational formulation is a convex optimization problem, and
it is a regularized max-det program that can be solved in polynomial time using
general-purpose solvers [33].
Our main result in Section 4 is a proof of the consistency of the estimator (1.2) in
the high-dimensional regime in which both the number of observed variables and
the number of latent components are allowed to grow with the number of samples
(of the observed variables). We show that for a suitable choice of the regularization
parameter λn , there exists a range of values of γ for which the estimates (Ŝn , L̂n )
∗ , K∗
∗ −1 ∗
have the same sparsity (and sign) pattern and rank as (KO
O,H (KH ) KH,O )
with high probability (see Theorem 4.1). The key technical requirement is an identifiability condition for the two components of the marginal concentration ma∗ with respect to the Fisher information (see Section 3.4). We make contrix K̃O
nections between our condition and the irrepresentability conditions required for
support/graphical-model recovery using 1 regularization [26, 32, 37]. Our results provide numerous scaling regimes under which consistency holds in latentvariable graphical model selection. For example, we show that under suitable identifiability conditions consistent model selection is possible even when the number
of samples and the number of latent variables are on the same order as the number
of observed variables (see Section 4.2).
Related previous work. The problem of decomposing the sum of a sparse matrix and a low-rank matrix via convex optimization into the individual components
1940
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
was initially studied in [7] by a superset of the authors of the present paper, with
conditions derived under which the convex program exactly recovers the underlying components. In subsequent work Candès et al. [3] also studied this sparseplus-low-rank decomposition problem, and provided guarantees for exact recovery using the convex program proposed in [7]. The problem setup considered in
the present paper is quite different and is more challenging because we are only
given access to an inexact sample covariance matrix, and we wish to produce an
inverse covariance matrix that can be decomposed as the sum of sparse and lowrank components (preserving the sparsity pattern and rank of the components in
the true underlying model). In addition to proving the consistency of the estimator (1.2), we also provide a statistical interpretation of our identifiability conditions
and describe natural classes of latent-variable Gaussian graphical models that satisfy these conditions. As such our paper is closer in spirit to the many recent papers on covariance selection, but with the important difference that some of the
variables are not observed.
Outline. Section 2 gives some background and a formal problem statement.
Section 3 discusses the identifiability question, Section 4 states the main results of
this paper, and Section 5 gives some proofs. We provide experimental demonstration of the effectiveness of our estimator on synthetic and real data in Section 6,
and conclude with a brief discussion in Section 7. Some of our technical results
are deferred to supplementary material [6].
2. Problem statement and background. We give a formal statement of the
latent-variable model selection problem. We also briefly describe various properties of the algebraic varieties of sparse matrices and of low-rank matrices, and the
properties of the Gaussian likelihood function.
The following matrix norms are employed throughout this paper. M2 denotes
the spectral norm, or the largest singular value of M. M∞ denotes the largest
entry in magnitude of M. MF denotes the Frobenius norm, or the square root
of the sum of the squares of the entries of M. M∗ denotes the nuclear norm,
or the sum of the singular values of M (this reduces to the trace for positivesemidefinite matrices). M1 denotes the sum of the absolute values of the entries of M. A number of matrix operator norms are also used. For example, let
Z : Rp×p → Rp×p be a linear operator acting on matrices. Then the induced
operator norm is defined as Z q→q maxN∈Rp×p ,Nq ≤1 Z (N)q . Therefore,
Z F →F denotes the spectral norm of the operator Z . The only vector norm used
is the Euclidean norm, which is denoted by · . Given any norm · q (either a
vector norm, a matrix norm or a matrix operator norm), the dual norm is given by
M∗q sup{ M, N |N q ≤ 1}.
2.1. Problem statement. In order to analyze latent-variable model selection methods, we need to define an appropriate notion of model selection con-
LATENT VARIABLE MODEL SELECTION
1941
∗
sistency for latent-variable graphical models. Given the two components KO
∗
∗
∗
−1
and KO,H (KH ) KH,O of the concentration matrix of the marginal distribution (1.1), there are infinitely many configurations of the latent variables [i.e.,
∗ 0, K ∗
∗
T
matrices KH
O,H = (KH,O ) ] that give rise to the same low-rank ma∗
∗
∗
trix KO,H (KH )−1 KH,O . Specifically for any nonsingular matrix B ∈ R|H |×|H | ,
∗ → BK ∗ B T , K ∗
∗
T
one can apply the transformations KH
H
O,H → KO,H B and still
∗
∗ )−1 K ∗ . In all of these models the
(KH
preserve the low-rank matrix KO,H
H,O
marginal statistics of the observed variables XO remain the same upon marginalization over the latent variables XH . The key invariant is the low-rank matrix
∗
∗ )−1 K ∗ , which summarizes the effect of marginalization over the
(KH
KO,H
H,O
∗ and
latent variables. Consequently, from here on we use the notation S ∗ = KO
∗
∗ )−1 K ∗ . These observations give rise to the following notion of
(KH
L∗ = KO,H
H,O
structure recovery.
D EFINITION 2.1. A pair of |O| × |O| symmetric matrices (Ŝ, L̂) is an algebraically correct estimate of a latent-variable Gaussian graphical model given by
∗
the concentration matrix K(O
H ) if the following conditions hold:
(1) The sign-pattern of Ŝ is the same as that of S ∗ [here sign(0) = 0]:
∗
)
sign(Ŝi,j ) = sign(Si,j
∀i, j.
(2) The rank of L̂ is the same as the rank of L∗ :
rank(L̂) = rank(L∗ ).
(3) The concentration matrix Ŝ − L̂ can be realized as the marginal concentration matrix of an appropriate latent-variable model:
Ŝ − L̂ 0,
L̂ 0.
When a sequence of estimators is algebraically correct with probability approaching 1 in a suitable high-dimensional scaling regime, then we say that the
estimators are algebraically consistent. The first condition ensures that Ŝ provides
the correct structural estimate of the conditional graphical model of the observed
variables conditioned on the latent components. This property is the same as the
“sparsistency” property studied in standard graphical model selection [18, 26].
The second condition ensures that the number of latent components is properly
estimated. Finally, the third condition ensures that the pair of matrices (Ŝ, L̂) leads
to a realizable latent-variable model. In particular, this condition implies that there
exists a valid latent-variable model in which (a) the conditional graphical model
structure among the observed variables is given by Ŝ, (b) the number of latent
variables is equal to the rank of L̂, and (c) the extra correlations induced due to
marginalization over the latent variables are equal to L̂. Any method for matrix
1942
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
factorization (e.g., [35]) can be used to further factorize L̂, depending on the property that one desires in the factors (e.g., sparsity).
We also study estimation error rates in the usual sense, that is, we show that
one can produce estimates (Ŝ, L̂) that are close in various norms to the matrices
(S ∗ , L∗ ). Notice that bounding the estimation error in some norm does not in general imply that the support/sign-pattern and rank of (Ŝ, L̂) are the same as those of
(S ∗ , L∗ ). Therefore bounded estimation error is different from algebraic correctness, which requires that (Ŝ, L̂) have the same support/sign-pattern and rank as
(S ∗ , L∗ ).
∗
Goal. Let K(O
H ) denote the concentration matrix of a Gaussian model. Supi }n of the observed variables X . We would like
pose that we have n samples {XO
O
i=1
to produce estimates (Ŝn , L̂n ) that, with high probability, are algebraically correct
and have bounded estimation error (in some norm).
Our approach. We propose the regularized likelihood convex program (1.2)
n in
to produce estimates (Ŝn , L̂n ). Specifically, the sample covariance matrix O
(1.2) is defined as
n
O
n
1
i
i T
XO
XO
.
n i=1
∗
We give conditions on the underlying model K(O
H ) and suitable choices for the
parameters λn , γ under which the estimates (Ŝn , L̂n ) are consistent (see Theorem 4.1).
2.2. Likelihood function and Fisher information. Given n samples {Xi }ni=1 of
a finite collection of jointly Gaussian zero-mean random variables with concentration matrix K ∗ , it is easily seen that the log-likelihood function is given by:
(2.1)
(K; n ) = log det(K) − tr(K n ),
where (K; n ) is a function of K. Notice that this function is strictly concave for
n , the
K 0. In the latent-variable modeling problem with sample covariance O
likelihood function with respect to the parametrization (S, L) is given by (S −
n ). This function is jointly concave with respect to the parameters (S, L)
L; O
whenever S − L 0, and it is employed in our variational formulation (1.2) to
learn a latent-variable model.
In the analysis of a convex program involving the likelihood function, the Fisher
information plays an important role as it is the negative of the Hessian of the
likelihood function and thus controls the curvature. As the first term in the likelihood function is linear, we need only study higher-order derivatives of the logdeterminant function in order to compute the Hessian. In the latent-variable setting with the marginal concentration matrix of the observed variables given by
LATENT VARIABLE MODEL SELECTION
1943
∗ = ( ∗ )−1 [see (1.1)], the corresponding Fisher information matrix is
K̃O
O
(2.2)
∗
∗ −1
∗ −1
∗
∗
I (K̃O
) = (K̃O
) ⊗ (K̃O
) = O
⊗ O
.
Here ⊗ denotes the tensor product between matrices. Notice that this is precisely
∗
the |O|2 × |O|2 submatrix of the full Fisher information matrix I (K(O
H )) =
∗
∗
∗
∗
−1
(O H ) ⊗ (O H ) with respect to all the parameters K(O H ) = ((O H ) ) (corresponding to the situation in which all the variables XO∪H are observed). In Sec∗)
tion 3.4 we impose various conditions on the Fisher information matrix I (K̃O
under which our regularized maximum-likelihood formulation provides consistent
estimates.
2.3. Algebraic varieties of sparse and low-rank matrices. The set of sparse
matrices and the set of low-rank matrices can be naturally viewed as algebraic varieties (solution sets of systems of polynomial equations). Here we describe these
varieties, and discuss some of their geometric properties such as the tangent space
and local curvature at a (smooth) point.
Let S (k) denote the set of matrices with at most k nonzeros:
(2.3)
S (k) {M ∈ Rp×p ||support(M)| ≤ k}.
Here support denotes the locations of nonzero entries. The set S (k) is an algebraic
2
variety, and can in fact be viewed as a union of pk subspaces in Rp×p . This
variety has dimension k, and it is smooth everywhere except at those matrices that
have support size strictly smaller than k. For any matrix M ∈ Rp×p , consider the
variety S (|support(M)|); M is a smooth point of this variety, and the tangent space
at M is given by
(2.4)
(M) = {N ∈ Rp×p |support(N) ⊆ support(M)}.
Next let L(r) denote the algebraic variety of matrices with rank at most r:
(2.5)
L(r) {M ∈ Rp×p |rank(M) ≤ r}.
It is easily seen that L(r) is an algebraic variety because it can be defined through
the vanishing of all (r + 1) × (r + 1) minors. This variety has dimension equal
to r(2p − r), and it is smooth everywhere except at those matrices that have rank
strictly smaller than r. Consider a rank-r matrix M with singular value decomposition (SVD) given by M = U DV T , where U, V ∈ Rp×r and D ∈ Rr×r . The
matrix M is a smooth point of the variety L(rank(M)), and the tangent space at M
with respect to this variety is given by
(2.6)
T (M) = {U Y1T + Y2 V T |Y1 , Y2 ∈ Rp×r }.
We view both (M) and T (M) as subspaces in Rp×p . In Section 3 we explore
the connection between geometric properties of these tangent spaces and the identifiability problem in latent-variable graphical models.
1944
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
Curvature of rank variety. The sparse matrix variety S (k) has the property
that it has zero curvature at any smooth point. The situation is more complicated
for the low-rank matrix variety L(r), because the curvature at any smooth point
is nonzero. We analyze how this variety curves locally, by studying how the tangent space changes from one point to a neighboring point. Indeed the amount of
curvature at a point is directly related to the “angle” between the tangent space at
that point and the tangent space at a neighboring point. For any linear subspace
T of matrices, let PT denote the projection onto T . Given two subspaces T1 , T2
of the same dimension, we measure the “twisting” between these subspaces by
considering the following quantity:
(2.7)
ρ(T1 , T2 ) PT1 − PT2 2→2 = max [PT1 − PT2 ](N)2 .
N2 ≤1
In the supplement [6] we review relevant results from matrix perturbation theory,
which suggest that the magnitude of the smallest nonzero singular value is closely
tied to the local curvature of the variety. Therefore we control the twisting between
tangent spaces at nearby points by bounding the smallest nonzero singular value
away from zero.
3. Identifiability. In the absence of additional conditions, the latent-variable
model selection problem is ill-posed. In this section we discuss a set of conditions on latent-variable models that ensure that these models are identifiable given
marginal statistics for a subset of the variables. Some of the discussion in Sections
3.1 and 3.2 is presented in greater detail in [7].
3.1. Structure between latent and observed variables. Suppose that the lowrank matrix that summarizes the effect of the latent components is itself sparse.
This leads to identifiability issues in the sparse-plus-low-rank decomposition problem. Statistically the additional correlations induced due to marginalization over
the latent variables could be mistaken for the conditional graphical model structure
of the observed variables. In order to avoid such identifiability problems the effect
of the latent variables must be “diffuse” across the observed variables. To address
this point the following quantity was introduced in [7] for any matrix M, defined
with respect to the tangent space T (M):
(3.1)
ξ(T (M)) max
N∞ .
N∈T (M),N2 ≤1
Thus ξ(T (M)) being small implies that elements of the tangent space T (M) cannot have their support concentrated in a few locations; as a result M cannot be
too sparse. This idea is formalized in [7] by relating ξ(T (M)) to a notion of “incoherence” of the row/column spaces, where the row/column spaces are said to
be incoherent with respect to the standard basis if these spaces are not aligned
closely with any of the coordinate axes. Typically a matrix M with incoherent
row/column spaces would have ξ(T (M)) 1. This point is quantified precisely
1945
LATENT VARIABLE MODEL SELECTION
in [7]. Specifically, we note that ξ(T (M)) can be as small as ∼
r
p
for a rank-r
Rp×p
matrix M ∈
with row/column spaces that are almost maximally incoherent (e.g., if the row/column spaces span any r columns of a p × p orthonormal
Hadamard matrix). On the other hand, ξ(T (M)) = 1 if the row/column spaces of
M contain a standard basis vector.
Based on these concepts we roughly require that the low-rank matrix that summarizes the effect of the latent variables be incoherent, thereby ensuring that the
extra correlations due to marginalization over the latent components cannot be
confused with the conditional graphical model structure of the observed variables.
Notice that the quantity ξ is not just a measure of the number of latent variables,
but also of the overall effect of the correlations induced by marginalization over
these variables.
Curvature and change in ξ : As noted previously, an important technical point
is that the algebraic variety of low-rank matrices is locally curved at any smooth
point. Consequently the quantity ξ changes as we move along the low-rank matrix
variety smoothly. The quantity ρ(T1 , T2 ) introduced in (2.7) allows us to bound
the variation in ξ as follows (proof in Section 5):
L EMMA 3.1. Let T1 , T2 be two linear subspaces of matrices of the same dimension with the property that ρ(T1 , T2 ) < 1, where ρ is defined in (2.7). Then we
have that
1
[ξ(T1 ) + ρ(T1 , T2 )].
ξ(T2 ) ≤
1 − ρ(T1 , T2 )
3.2. Structure among observed variables. An identifiability problem also
arises if the conditional graphical model among the observed variables contains
a densely connected subgraph. These statistical relationships might be mistaken as
correlations induced by marginalization over latent variables. Therefore we need
to ensure that the conditional graphical model among the observed variables is
sparse. We impose the condition that this conditional graphical model must have
small “degree,” that is, no observed variable is directly connected to too many
other observed variables conditioned on the latent components. Notice that bounding the degree is a more refined condition than simply bounding the total number
of nonzeros as the sparsity pattern also plays a role. In [7] the authors introduced
the following quantity in order to provide an appropriate measure of the sparsity
pattern of a matrix:
(3.2)
μ((M)) max
N2 .
N∈(M),N∞ ≤1
The quantity μ((M)) being small for a matrix implies that the spectrum of any
element of the tangent space (M) is not too “concentrated,” that is, the singular
values of the elements of the tangent space are not too large. In [7] it is shown
that a sparse matrix M with “bounded degree” (a small number of nonzeros per
1946
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
row/column) has small μ(M). Specifically, if M ∈ Rp×p is any matrix with at
most deg(M) nonzero entries per row/column, then we have that
μ((M)) ≤ deg(M).
3.3. Transversality of tangent spaces. Suppose that we have the sum of two
vectors, each from two known subspaces. It is possible to uniquely recover the
individual vectors from the sum if and only if the subspaces have a transverse intersection, that is, they only intersect at the origin. This simple observation leads
to an appealing geometric notion of identifiability. Suppose now that we have the
sum of a sparse matrix and a low-rank matrix, and that we are also given the tangent spaces at these matrices with respect to the algebraic varieties of sparse and
low-rank matrices, respectively. Then a necessary and sufficient condition for identifiability with respect to the tangent spaces is that these spaces have a transverse
intersection. This transverse intersection condition is also sufficient for local identifiability in a neighborhood around the sparse matrix and low-rank matrix with
respect to the varieties of sparse and low-rank matrices (due to the inverse function theorem). It turns out that these tangent space transversality conditions are
also sufficient for the convex program (1.2) to provide consistent estimates of a
latent-variable graphical model (without any side information about the tangent
spaces).
In order to quantify the level of transversality between the tangent spaces and
T we study the minimum gain with respect to some norm of the addition operator
(which adds two matrices) A : Rp×p × Rp×p → Rp×p restricted to the cartesian
product Y = × T . Then given any matrix norm · q on Rp×p × Rp×p , the
minimum gain of A restricted to Y is defined as
ε(, T , · q ) min
PY A† APY (S, L)q ,
(S,L)∈×T ,(S,L)q =1
where PY denotes the projection onto Y , and A† denotes the adjoint of the addition operator (with respect to the standard Euclidean inner-product). The “level”
of transversality of and T is measured by the magnitude of ε(, T , · q ),
with transverse intersection being equivalent to ε(, T , · q ) > 0. Note that
ε(, T , · F ) is the square of the minimum singular value of the addition operator A restricted to × T .
A natural norm with which to measure transversality is the dual norm of the
regularization function in (1.2), as the subdifferential of the regularization function is specified in terms of its dual. The reasons for this will become clearer as
we proceed through this paper. Recall that the regularization function used in the
variational formulation (1.2) is given by
fγ (S, L) = γ S1 + L∗ ,
LATENT VARIABLE MODEL SELECTION
1947
where the nuclear norm · ∗ reduces to the trace function over the cone of
positive-semidefinite matrices. This function is a norm for all γ > 0. The dual
norm of fγ is given by
S∞
, L2 .
gγ (S, L) = max
γ
Next we define the quantity χ (, T , γ ) as follows in order to study the transversality of the spaces and T with respect to the gγ norm:
(3.3)
χ (, T , γ ) max
ξ(T )
, 2μ()γ .
γ
Here μ and ξ are defined in (3.2) and (3.1). We then have the following result
(proved in Section 5):
L EMMA 3.2. Let S ∈ , L ∈ T be matrices such that S∞ = γ and let
L2 = 1. Then we have that gγ (PY A† APY (S, L)) ∈ [1 − χ (, T , γ ), 1 +
χ (, T , γ )], where Y = × T and χ (, T , γ ) is defined in (3.3). In particular
we have that 1 − χ (, T , γ ) ≤ ε(, T , gγ ).
The quantity χ (, T , γ ) being small implies that the addition operator is essentially isometric when restricted to Y = × T . Stated differently, the magnitude
of χ (, T , γ ) is a measure of the level of transversality of the spaces and T .
1
) ensures that χ (, T , γ ) < 1, which in
If μ()ξ(T ) < 12 , then γ ∈ (ξ(T ), 2μ()
turn implies that the tangent spaces and T have a transverse intersection.
Observation: Thus we have that the smaller the quantities μ() and ξ(T ),
the more transverse the intersection of the spaces and T as measured by
ε(, T , gγ ).
3.4. Conditions on Fisher information. The main focus of Section 4 is to analyze the regularized maximum-likelihood convex program (1.2) by studying its optimality conditions. The log-likelihood function is well-approximated in a neighborhood by a quadratic form given by the Fisher information (which measures the
∗ ) denote the Fisher inforcurvature, as discussed in Section 2.2). Let I ∗ = I (K̃O
∗ [see (1.1)]. The
mation evaluated at the true marginal concentration matrix K̃O
appropriate measure of transversality between the tangent spaces2 = (S ∗ ) and
T = T (L∗ ) is then in a space in which the inner-product is given by I ∗ . Specifically, we need to analyze the minimum gain of the operator PY A† I ∗ APY restricted to the space Y = × T . Therefore we impose several conditions on the
Fisher information I ∗ . We define quantities that control the gains of I ∗ restricted
to and T separately; these ensure that elements of and elements of T are
2 We implicitly assume that these tangent spaces are subspaces of the space of symmetric matrices.
1948
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
individually identifiable under the map I ∗ . In addition we define quantities that,
in conjunction with bounds on μ() and ξ(T ), allow us to control the gain of I ∗
restricted to the direct-sum ⊕ T .
I ∗ restricted to : The minimum gain of the operator P I ∗ P restricted to is given by
α P I ∗ P (M)∞ .
min
M∈,M∞ =1
The maximum effect of elements in in the orthogonal direction ⊥ is given by
δ max
P⊥ I ∗ P (M)∞ .
M∈,M∞ =1
The operator I ∗ is injective on if α > 0. The ratio αδ ≤ 1 − ν implies the
irrepresentability condition imposed in [26], which gives a sufficient condition for
consistent recovery of graphical model structure using 1 -regularized maximumlikelihood. Notice that this condition is a generalization of the usual Lasso irrepresentability conditions [32, 37], which are typically imposed on the covariance
matrix. Finally we also consider the following quantity, which controls the behavior of I ∗ restricted to in the spectral norm:
β max
M∈,M2 =1
I ∗ (M)2 .
I ∗ restricted to T : Analogously to the case of one could control the gains
of the operators PT ⊥ I ∗ PT and PT I ∗ PT . However, as discussed previously, one
complication is that the tangent spaces at nearby smooth points on the rank variety are in general different, and the amount of twisting between these spaces is
governed by the local curvature. Therefore we control the gains of the operators
PT ⊥ I ∗ PT and PT I ∗ PT for all tangent spaces T that are “close to” the nominal T (at the true underlying low-rank matrix), measured by ρ(T , T ) (2.7) being
small. The minimum gain of the operator PT I ∗ PT restricted to T (close to T ) is
given by
αT min
PT I ∗ PT (M)2 .
min
ρ(T ,T )≤ξ(T )/2 M∈T ,M2 =1
Similarly, the maximum effect of elements in T in the orthogonal direction T ⊥
(for T close to T ) is given by
δT max
max
PT ⊥ I ∗ PT (M)2 .
ρ(T ,T )≤ξ(T )/2 M∈T ,M2 =1
Implicit in the definition of αT and δT is the fact that the outer minimum and
maximum are only taken over spaces T that are tangent spaces to the rank-variety.
)
The operator I ∗ is injective on all tangent spaces T such that ρ(T , T ) ≤ ξ(T
2
if αT > 0. An irrepresentability condition (analogous to those developed for the
sparse case) for tangent spaces near T to the rank variety would be that αδTT ≤ 1 − ν.
LATENT VARIABLE MODEL SELECTION
1949
Finally we also control the behavior of I ∗ restricted to T close to T in the ∞
norm:
βT max
max
I ∗ (M)∞ .
ρ(T ,T )≤ξ(T )/2 M∈T ,M∞ =1
The two sets of quantities (α , δ ) and (αT , δT ) essentially control how I ∗
behaves when restricted to the spaces and T separately (in the natural norms).
The quantities β and βT are useful in order to control the gains of the operator I ∗
restricted to the direct sum ⊕ T . Notice that although the magnitudes of elements
in are measured most naturally in the ∞ norm, the quantity β is specified
with respect to the spectral norm. Similarly, elements of the tangent spaces T to
the rank variety are most naturally measured in the spectral norm, but βT provides
control in the ∞ norm. These quantities, combined with μ() and ξ(T ) [defined
in (3.2) and (3.1)], provide the “coupling” necessary to control the behavior of I ∗
restricted to elements in the direct sum ⊕ T . In order to keep track of fewer
quantities, we summarize the six quantities as follows:
α min(α , αT );
δ max(δ , δT );
Main assumption: There exists a ν ∈ (0,
1
2]
β max(β , βT ).
such that
δ
≤ 1 − 2ν.
α
This assumption is to be viewed as a generalization of the irrepresentability conditions imposed on the covariance matrix [32, 37] or the Fisher information matrix
[26] in order to provide consistency guarantees for sparse model selection using
the 1 norm. With this assumption we have the following proposition, proved in
Section 5, about the gains of the operator I ∗ restricted to ⊕ T . This proposition plays a fundamental role in the analysis of the performance of the regularized maximum-likelihood procedure (1.2). Specifically, it gives conditions under
which a suitable primal-dual pair can be specified to certify optimality with respect
to (1.2) (see Section 5.2 for more details).
P ROPOSITION 3.3. Let and T be the tangent spaces defined in this section,
and let I ∗ be the Fisher information evaluated at the true marginal concentration
matrix. Further let α, β, ν be as defined above. Suppose that
μ()ξ(T ) ≤
1
να
6 β(2 − ν)
2
,
and that γ is in the following range:
γ∈
να
3ξ(T )β(2 − ν)
,
.
να
2μ()β(2 − ν)
Then we have the following two conclusions for Y = ×T with ρ(T , T ) ≤
ξ(T )
2 :
1950
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
(1) The minimum gain of I ∗ restricted to ⊕ T is bounded below:
min
(S,L)∈Y ,S∞ =γ ,L2 =1
gγ (PY A† I ∗ APY (S, L)) ≥
α
.
2
Specifically this implies that for all (S, L) ∈ Y
gγ (PY A† I ∗ APY (S, L)) ≥
α
gγ (S, L).
2
(2) The effect of elements in Y = × T on the orthogonal complement Y ⊥ =
× T ⊥ is bounded above:
⊥
PY ⊥ A† I ∗ APY (PY A† I ∗ APY )−1 gγ →gγ ≤ 1 − ν.
Specifically this implies that for all (S, L) ∈ Y
gγ (PY ⊥ A† I ∗ APY (S, L)) ≤ (1 − ν)gγ (PY A† I ∗ APY (S, L)).
The last quantity we consider is the spectral norm of the marginal covariance
∗ = (K̃ ∗ )−1 :
matrix O
O
(3.4)
∗
∗ −1
2 = (K̃O
) 2 .
ψ O
A bound on ψ is useful in the probabilistic component of our analysis, in order
to derive convergence rates of the sample covariance matrix to the true covariance
matrix. We also observe that
∗ −1
∗ −1
) ⊗ (K̃O
) 2→2 = ψ 2 .
I ∗ 2→2 = (K̃O
Remarks. The quantities α, β, δ bound the gains of the Fisher information I ∗
restricted to the spaces and T (and tangent spaces near T ). One can make
stronger assumptions on I ∗ that are more easily interpretable. For example, α , β
could bound the minimum/maximum gains of I ∗ for all matrices (rather than
just those in ), and δ the I ∗ -inner-product for all pairs of orthogonal matrices (rather than just those in and ⊥ ). Similarly, αT , βT could bound the
minimum/maximum gains of I ∗ for all matrices (rather than just those near T ),
and δT the I ∗ -inner-product for all pairs of orthogonal matrices (rather than just
those near T and T ⊥ ). Such bounds would apply in either the · 2→2 norm (for
αT , δT , β ) or the · ∞→∞ norm (for α , δ , βT ). These modified assumptions
are global in nature (not restricted just to or near T ) and are consequently
stronger (they lower-bound the original α , αT and they upper-bound the original β , βT , δ , δT ), and they essentially control the gains of the operator I ∗ in the
· 2→2 norm and the · ∞→∞ norm. In contrast, previous works on covariance
selection [1, 2, 29] consider well-conditioned families of covariance matrices by
LATENT VARIABLE MODEL SELECTION
1951
bounding the minimum/maximum eigenvalues (i.e., gain with respect to the spectral norm).
4. Consistency of regularized maximum-likelihood program.
∗
4.1. Main results. Recall that K(O
H ) denotes the full concentration matrix
of a collection of zero-mean jointly-Gaussian observed and latent variables. Let
p = |O| denote the number of observed variables, and let h = |H | denote the
i }n
number of latent variables. We are given n samples {XO
i=1 of the observed variables XO . We consider the high-dimensional setting in which (p, h, n) are all
allowed to grow simultaneously. We present our main result next demonstrating
the consistency of the estimator (1.2), and then discuss classes of latent-variable
graphical models and various scaling regimes in which our estimator is consistent. Recall from (1.2) that λn is a regularization parameter, and γ is a tradeoff parameter between the rank and sparsity terms. Notice from Proposition 3.3
that the choice of γ depends on the values of μ((S ∗ )) and ξ(T (L∗ )). While
these quantities may not be known a priori, we discuss a method to choose γ
numerically in our experimental results (see Section 6). The following theorem
shows that the estimates (Ŝn , L̂n ) provided by the convex program (1.2) are consistent for a suitable choice of λn . In addition to the appropriate identifiability conditions (as specified by Proposition 3.3), we also impose lower bounds
on the minimum magnitude nonzero entry θ of the sparse conditional graphical
model matrix S ∗ and on the minimum nonzero singular value σ of the low-rank
matrix L∗ summarizing the effect of the latent variables. The theorem is stated
in terms of the quantities α, β, ν, ψ, and we particularly emphasize the dependence on μ((S ∗ )) and ξ(T (L∗ )) because these control the complexity of the
∗
underlying latent-variable graphical model given by K(O
H ) . A number of quanνα
α
tities play a role in our theorem: let D = max{1, 3β(2−ν)
}, C1 = ψ(1 + 6β
),
√
48 2Dψ(2−ν)
1
αν
1
αν
,
C
=
min{
,
},
C
=
,
samp
λ
2
2
32(3−ν)D
4C1 256D(3−ν)ψC
ξ(T )ν
ψ
1
2
3αC2 (2−ν)
C2 να
CS = max{( 6(2−ν)
+ 1)C22 ψ 2 D, C2 + 16(3−ν)
} and CL = β(2−ν)
.
ν
C2 =
48
α
+
∗
T HEOREM 4.1. Let K(O
H ) denote the concentration matrix of a Gaussian
i }n
model. We have n samples {XO
i=1 of the p observed variables denoted by O.
∗
∗
Let = (S ) and T = T (L ) denote the tangent spaces at S ∗ and at L∗ with
respect to the sparse and low-rank matrix varieties, respectively.
Assumptions: Suppose that the quantities μ() and ξ(T ) satisfy the assumption
of Proposition 3.3 for identifiability, and γ is chosen in the range specified by
Proposition 3.3. Further suppose that the following conditions hold:
p
128ψ 2
max{
, 2}, that is, we require that n ξ(Tp )4 .
4
2
ξ(T )
Csamp
√
p
p
1
λn = 48 2Dψ(2−ν)
,
that
is,
we
require
that
λ
n
ξ(T )ν
n
ξ(T ) n .
(1) Let n ≥
(2) Set
1952
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
(3) Let σ ≥
(4) Let θ ≥
CL λn
, that is, we require that σ ξ(T1 )3 pn .
ξ(T )2
p
CS λn
1
,
that
is,
we
require
that
θ
μ()
ξ(T )μ() n .
Conclusions: Then with probability greater than 1 − 2 exp{−p} we have algebraic correctness and estimation error given by:
∗ );
(1) sign(Ŝn ) = sign(S ∗ ) and rank(
√ L̂n ) = rank(L
p
2(3−ν)Dψ p
1
(2) gγ (Ŝn − S ∗ , L̂n − L∗ ) ≤ 512 ναξ(T
)
n
ξ(T ) n .
The proof of this theorem is given in Section 5. The theorem essentially states
that if the minimum nonzero singular value of the low-rank piece L∗ and minimum
nonzero entry of the sparse piece S ∗ are bounded away from zero, then the convex program (1.2) provides estimates that are both algebraically correct and have
bounded estimation error (in the ∞ and spectral norms).
Notice that the condition on the minimum singular value of L∗ is more stringent
than the one on the minimum nonzero entry of S ∗ . One role played by these conditions is to ensure that the estimates (Ŝn , L̂n ) do not have smaller support size/rank
than (S ∗ , L∗ ). However, the minimum singular value bound plays the additional
role of bounding the curvature of the low-rank matrix variety around the point L∗ ,
which is the reason for this condition being more stringent. Notice also that the
number of latent variables h does not explicitly appear in the bounds in Theorem 4.1, which only depend on p, μ((S ∗ )), ξ(T (L∗ )). However, the dependence
on h is implicit in the dependence on ξ(T (L∗ )), and we discuss this point in greater
detail in the following section.
Finally we note that consistency holds in Theorem 4.1 for a range of values of
)
να
γ ∈ [ 3β(2−ν)ξ(T
, 2β(2−ν)μ()
]. In particular the assumptions on the sample comνα
plexity, the minimum nonzero singular value of L∗ , and the minimum magnitude
nonzero entry of S ∗ are governed by the lower end of this range for γ . These assumptions can be weakened if we only require consistency for a smaller range of
values of γ . The next result conveys this point with a specific example.
C OROLLARY 4.2. Consider the same setup and notation as in Theorem 4.1.
Suppose that the quantities μ() and ξ(T ) satisfy the assumption of Proposiνα
tion 3.3 for identifiability, and that γ = 2β(2−ν)μ()
(the upper end of the range
specified in Proposition 3.3), that is, γ μ()4 p; (2) λn μ()
p
n;
(3) σ 1
μ()
2
μ()
ξ(T )
p
n;
. Further suppose that: (1) n (4) θ p
n.
Then with probabil-
ity greater than 1 − 2 exp{−p} we have estimates (Ŝn , L̂n ) that are algebraically
correct, and with the error bounded as gγ (Ŝn − S ∗ , L̂n − L∗ ) μ() pn .
LATENT VARIABLE MODEL SELECTION
1953
The proof of this corollary3 is analogous to that of Theorem 4.1. We emphasize
that in practice it is often beneficial to have consistent estimates for a range of
values of γ (as in Theorem 4.1). Specifically, the stability of the sparsity pattern
and rank of the estimates (Ŝn , L̂n ) for a range of trade-off parameters is useful
in order to choose a suitable value of γ , as prior information about the quantities
μ((S ∗ )) and ξ(T (L∗ )) is not typically available (see Section 6).
We remark here that the identifiability conditions of Proposition 3.3 are the main
sufficient conditions required for Theorem 4.1 and Corollary 4.2 to hold. It would
be interesting to obtain necessary conditions as well for these results, analogous
to the necessity and sufficiency of the irrepresentability conditions for the Lasso
[32, 37].
4.2. Scaling regimes. Next we consider classes of latent-variable models that
satisfy the conditions of Theorem 4.1. Recall from Section 3.2 that μ((S ∗ )) ≤
deg(S ∗ ). Throughout this section, we consider latent-variable models in which
the
∗
∗
low-rank matrix L is almost maximally incoherent, that is, ξ(T (L )) ∼ ph so
the effect of marginalization over the latent variables is diffuse across almost all
the observed variables. We suppress the dependence on the quantities α, β, ν, ψ
defined in Section 3.4 in our scaling results, and specifically focus on the trade-off
between ξ(T (L∗ )) and μ((S ∗ )) for consistent estimation (we also suppress the
dependence of these quantities on n). Thus, based on Proposition 3.3 we study
latent-variable models in which
∗
∗
ξ(T (L ))μ((S )) = O
h
deg(S ∗ ) = O(1).
p
As we describe next, there are nontrivial classes of latent-variable graphical models
in which this condition holds.
Bounded degree: The first class of latent-variable models that we consider are
those in which the conditional graphical model among the observed variables
∗ ) has constant degree:
(given by KO
deg(S ∗ ) = O(1),
h ∼ p.
Such models can be estimated consistently from n ∼ p samples. Thus consistent
latent-variable model selection is possible even when the number of samples and
the number of latent variables are on the same order as the number of observed
variables.
3 By making stronger assumptions on the Fisher information matrix I ∗ , one can further remove
the factor of ξ(T ) in the lower bound for σ . Specifically, the lower bound σ μ()3 pn suffices
for consistent estimation if the bounds defined by the quantities αT , βT , δT can be strengthened as
described in the remarks at the end of Section 3.4.
1954
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
Polylogarithmic degree: The next class of models that we consider are those
in which the degree of the conditional graphical model of the observed variables
grows polylogarithmically with p:
deg(S ∗ ) ∼ log(p)q ,
h∼
p
.
log(p)2q
Such latent-variable graphical models can be consistently estimated as long as
n ∼ p polylog(p).
For standard graphical model selection with no latent variables, 1 -regularized
maximum-likelihood is shown to be consistent with n = O(log p) samples [26].
On the other hand, our results prove consistency in the setting with latent variables
when n = O(p) samples. It would be interesting to study whether these rates are
inherent to latent-variable model selection.
4.3. Rates for covariance matrix estimation. Theorem 4.1 gives conditions
under which we can consistently estimate the sparse and low-rank parts that
∗ . Here we state a corollary that
compose the marginal concentration matrix K̃O
gives rates for covariance matrix estimation, that is, the quality of the estimate
∗.
(Ŝn − L̂n )−1 with respect to the “true” marginal covariance matrix O
C OROLLARY 4.3. Under the same conditions as in Theorem 4.1, we have with
probability greater than 1 − 2 exp{−p} that
∗
gγ A† [(Ŝn − L̂n )−1 − O
] ≤ λn 1 +
∗ This corollary implies that (Ŝn − L̂n )−1 −O
2
ν
.
6(2 − ν)
1
ξ(T )
p
n
based on the choice
∗ μ()
of λn in Theorem 4.1, and that (Ŝn − L̂n )−1 − O
2
choice of λn in Corollary 4.2.
p
n
based on the
5. Proofs.
5.1. Proofs of Section 3. Here we give proofs of the results stated in Section 3.
P ROOF OF L EMMA 3.1. Since ρ(T1 , T2 ) < 1, the largest principal angle between T1 and T2 is strictly less than π2 . Consequently, the mapping PT2 : T1 → T2
restricted to T1 is bijective (as it is injective, and the spaces T1 , T2 have the same
dimension). Consider the maximum and minimum gains of PT2 restricted to T1 ;
for any M ∈ T1 , M2 = 1:
PT2 (M)2 = M + [PT2 − PT1 ](M)2 ∈ [1 − ρ(T1 , T2 ), 1 + ρ(T1 , T2 )].
1955
LATENT VARIABLE MODEL SELECTION
Therefore, we can rewrite ξ(T2 ) as follows:
ξ(T2 ) =
≤
≤
max
N∞ =
N∈T2 ,N2 ≤1
max
N∈T1 ,N2 ≤1/(1−ρ(T1 ,T2 ))
max
max
PT2 (N)∞
N∈T2 ,N2 ≤1
N∈T1 ,N2 ≤1/(1−ρ(T1 ,T2 ))
PT2 (N)∞
N∞ + [PT1 − PT2 ](N)∞
≤
1
ξ(T1 ) +
max
[PT1 − PT2 ](N)∞
N∈T1 ,N2 ≤1
1 − ρ(T1 , T2 )
≤
1
ξ(T1 ) + max [PT1 − PT2 ](N)2
N2 ≤1
1 − ρ(T1 , T2 )
1
[ξ(T1 ) + ρ(T1 , T2 )].
1 − ρ(T1 , T2 )
This concludes the proof of the lemma. ≤
P ROOF OF L EMMA 3.2. We have that A† A(S, L) = (S + L, S + L); therefore,
PY A† APY (S, L) = (S + P (L), PT (S) + L). We need to bound S + P (L)∞
and PT (S) + L2 . First, we have
S + P (L)∞ ∈ [S∞ − P (L)∞ , S∞ + P (L)∞ ]
⊆ [S∞ − L∞ , S∞ + L∞ ]
⊆ [γ − ξ(T ), γ + ξ(T )].
Similarly, one can check that
PT (S) + L2 ∈ [−PT (S)2 + L2 , PT (S)2 + L2 ]
⊆ [1 − 2S2 , 1 + 2S2 ]
⊆ [1 − 2γ μ(), 1 + 2γ μ()].
These two bounds give us the desired result. P ROOF OF P ROPOSITION 3.3. Before proving the two parts of this proposition
)
we make a simple observation about ξ(T ) using the condition that ρ(T , T ) ≤ ξ(T
2
by applying Lemma 3.1:
ξ(T ) ≤
3ξ(T )/2
ξ(T ) + ρ(T , T )
≤
≤ 3ξ(T ).
1 − ρ(T , T )
1 − ξ(T )/2
Here we used the property that ξ(T ) ≤ 1 in obtaining the final inequality. Conse)
να
quently, noting that γ ∈ [ 3β(2−ν)ξ(T
, 2β(2−ν)μ()
] implies that
να
(5.1)
ξ(T )
να
, 2μ()γ ≤
.
χ (, T , γ ) = max
γ
β(2 − ν)
1956
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
Part 1: The proof of this step proceeds in a similar manner to that of Lemma 3.2.
First we have for S ∈ , L ∈ T with S∞ = γ , L2 = 1:
P I ∗ (S + L)∞ ≥ P I ∗ S∞ − P I ∗ L∞ ≥ αγ − I ∗ L∞ ≥ αγ − βξ(T ).
Next, under the same conditions on S, L,
PT I ∗ (S + L)2 ≥ PT I ∗ L2 − PT I ∗ S2 ≥ α − 2I ∗ S2 ≥ α − 2βμ()γ .
Combining these last two bounds with (5.1), we conclude that
min
(S,L)∈Y ,S∞ =γ ,L2 =1
gγ (PY A† I ∗ APY (S, L))
να
2α(1 − ν) α
ξ(T )
, 2μ()γ ≥ α −
=
≥ ,
≥ α − β max
γ
2−ν
2−ν
2
where the final inequality follows from the assumption that ν ∈ (0, 12 ].
Part 2: Note that for S ∈ , L ∈ T with S∞ ≤ γ , L2 ≤ 1,
P⊥ I ∗ (S + L)∞ ≤ P⊥ I ∗ S∞ + P⊥ I ∗ L∞ ≤ δγ + βξ(T ).
Similarly,
PT ⊥ I ∗ (S + L)2 ≤ PT ⊥ I ∗ S2 + PT ⊥ I ∗ L2 ≤ βγ μ() + δ.
Combining these last two bounds with the bounds from the first part, we have that
PY ⊥ A† I ∗ APY (PY A† I ∗ APY )−1 gγ →gγ
≤
δ + να/(2 − ν)
δ + β max{ξ(T )/γ , 2μ()γ }
≤
α − β max{ξ(T )/γ , 2μ()γ } α − να/(2 − ν)
≤
(1 − 2ν)α + να/(2 − ν)
= 1 − ν.
α − να/(2 − ν)
This concludes the proof of the proposition. 5.2. Proof strategy for Theorem 4.1. Standard results from convex analysis
[28] state that (Ŝn , L̂n ) is a minimum of the convex program (1.2) if the zero matrix belongs to the subdifferential of the objective function evaluated at (Ŝn , L̂n )
[in addition to (Ŝn , L̂n ) satisfying the constraints]. Elements of the subdifferentials
with respect to the 1 norm and the nuclear norm at a matrix M have the key property that they decompose with respect to the tangent spaces (M) and T (M) [34].
This decomposition property plays a critical role in our analysis. In particular it
states that the optimality conditions consist of two parts, one part corresponding to
the tangent spaces and T and another corresponding to the normal spaces ⊥
and T ⊥ .
Our analysis proceeds by constructing a primal-dual pair of variables that certify optimality with respect to (1.2). Consider the optimization problem (1.2) with
LATENT VARIABLE MODEL SELECTION
1957
the additional (nonconvex) constraints that the variable S belongs to the algebraic
variety of sparse matrices and that the variable L belongs to the algebraic variety
of low-rank matrices. While this new optimization problem is nonconvex, it has a
very interesting property. At a globally optimal solution (and indeed at any locally
optimal solution) (S̃, L̃) such that S̃ and L̃ are smooth points of the algebraic varieties of sparse and low-rank matrices, the first-order optimality conditions state
that the Lagrange multipliers corresponding to the additional variety constraints
must lie in the normal spaces (S̃)⊥ and T (L̃)⊥ . This basic observation, combined with the decomposition property of the subdifferentials of the 1 and nuclear
norms, suggests the following high-level proof strategy: considering the solution
(S̃, L̃) of the variety-constrained problem, we show under suitable conditions that
the second part of the subgradient optimality conditions of (1.2) (without any variety constraints) corresponding to components in the normal spaces (S̃)⊥ and
T (L̃)⊥ is also satisfied by (S̃, L̃). Thus, we show that (S̃, L̃) satisfies the optimality
conditions of the original convex program (1.2). Consequently (S̃, L̃) is also the
optimum of the convex program (1.2). As this estimate is obtained as the solution
to the problem with the variety constraints, the algebraic correctness of (S̃, L̃) can
be directly concluded. We emphasize here that the variety-constrained optimization problem is used solely as an analysis tool in order to prove consistency of the
estimates provided by the convex program (1.2). The key technical complication is
that the tangent spaces at L̃ and L∗ are in general different. We bound the twisting
between these tangent spaces by using the fact that the minimum nonzero singular
value of L∗ is bounded away from zero (as assumed in Theorem 4.1; see also the
supplement [6]).
5.3. Results proved in supplement. In this section we give the statements of
some results that are proved in a separate supplement [6]. These results are critical
to the proof of our main theorem, but they deal mainly with nonstatistical aspects
such as the curvature of the algebraic variety of low-rank matrices. Recall that
= (S ∗ ) and T = T (L∗ ). We also refer frequently to the constants defined in
Theorem 4.1.
As the gradient of the log-determinant function is given by a matrix inverse,
a key step in analyzing the properties of the convex program (1.2) is to show
that the change in the inverse of a matrix due to small perturbations is wellapproximated by the first-order term in the Taylor series expansion. Consider the
Taylor series of the inverse of a matrix:
(M + )−1 = M −1 − M −1 M −1 + RM −1 (),
where
RM −1 () = M
−1
∞
(−M
−1 k
) .
k=2
This infinite sum converges for sufficiently small. The following proposition
provides a bound on the second-order term specialized to our setting:
1958
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
P ROPOSITION 5.1. Suppose that γ is in the range given by Proposition 3.3.
Further suppose S ∈ , and let gγ (S , L ) ≤ 2C1 1 . Then we have that
gγ (A† RO∗ (A(S , L ))) ≤
2DψC12 gγ (S , L )2
.
ξ(T )
Next we analyze the following convex program subject to certain additional
constraints:
n
] − log det(S − L) + λn [γ S1 + L∗ ]
(Ŝ , L̂T̃ ) = arg min tr[(S − L)O
(5.2)
S,L
s.t. S − L 0, S ∈ , L ∈ T̃ ,
for some subspace T̃ . Comparing (5.2) with the convex program (1.2), we also do
not constrain the variable L to be positive semidefinite in (5.2) for ease of proof of
the next result (see the supplement [6] for more details; recall that the nuclear norm
of a positive-semidefinite matrix is equal to its trace). We show that if T̃ is any
)
tangent space to the low-rank matrix variety such that ρ(T , T̃ ) ≤ ξ(T
2 , then we can
bound the error (S , L ) = (Ŝ − S ∗ , L∗ − L̂T̃ ). Let CT̃ = PT̃ ⊥ (L∗ ) denote the
n − ∗ denote
normal component of the true low-rank matrix at T̃ , and let En = O
O
the difference between the true marginal covariance and the sample covariance.
The proof of the following result uses Brouwer’s fixed-point theorem [25], and is
inspired by the proof of a similar result in [26] for standard sparse graphical model
recovery without latent variables.
P ROPOSITION 5.2. Let the error (S , L ) in the solution of the convex pro)
gram (5.2) [with T̃ such that ρ(T̃ , T ) ≤ ξ(T
2 ] be as defined above, and define
r = max
8
[gγ (A† En ) + gγ (A† I ∗ CT̃ ) + λn ], CT 2 .
α
αξ(T )
If r ≤ min{ 4C1 1 , 64DψC
2 } for γ as in Proposition 3.3, then gγ (S , L ) ≤ 2r.
1
Finally we give a proposition that summarizes the algebraic component of our
proof.
P ROPOSITION 5.3. Assume that γ is in the range specified by Proposi3α(2−ν)
CL λn
λn ν
1
†
S λn
tion 3.3, σ ≥ ξ(T
,θ ≥ C
μ() , gγ (A En ) ≤ 6(2−ν) , and that λn ≤ 16(3−ν) min{ 4C1 ,
)2
αξ(T )
}.
64DψC12
Then there exists a T and a corresponding unique solution (Ŝ , L̂T ) of
(5.2) with T̃ = T with the following properties:
(1) sign(Ŝ ) = sign(S ∗ ) and rank(L̂T ) = rank(L∗ ), with L̂T 0. Further
)
T (L̂T ) = T and ρ(T , T ) ≤ ξ(T
4 .
1959
LATENT VARIABLE MODEL SELECTION
(2) Letting CT = PT ⊥ (L∗ ) we have that gγ (A† I ∗ CT ) ≤
CT 2 ≤
λn ν
6(2−ν) ,
and that
16(3−ν)λn
3α(2−ν) .
λn ν
Further, if gγ (A† RO∗ (A(Ŝ − S ∗ , L∗ − L̂T ))) ≤ 6(2−ν)
, then the tangent space
constraints S ∈ , L ∈ T are inactive in (5.2). Consequently the unique solution
of (1.2) is (Ŝn , L̂n ) = (Ŝ , L̂T ).
5.4. Probabilistic analysis. The results given thus far in this section have been
completely deterministic in nature. Here we present the probabilistic component of
n converges
our proof by studying the rate at which the sample covariance matrix O
∗ in spectral norm. This result is well known and
to the true covariance matrix O
follows directly from Theorem II.13 in [8]; we mainly discuss it here for complete∗ defined in (3.4).
ness and also to show explicitly the dependence on ψ = O
2
See the supplement [6] for a proof.
L EMMA 5.4.
∗ . Given any δ > 0 with δ ≤ 8ψ, let the number
Let ψ = O
2
of samples n be such that n ≥
n
Pr[O
64pψ 2
.
δ2
∗
− O
2
Then we have that
nδ 2
≥ δ] ≤ 2 exp −
.
128ψ 2
The following corollary relates the number of samples required for an error
bound to hold with probability 1 − 2 exp{−p}.
n be the sample covariance formed from n samples
Let O
2
. If n ≥ 2p, then
of the observed variables. Set δn = 128pψ
n
C OROLLARY 5.5.
n
∗
Pr[O
− O
2 ≤ δn ] ≥ 1 − 2 exp{−p}.
P ROOF.
Note that n ≥ 2p implies that δn ≤ 8ψ, and apply Lemma 5.4. 5.5. Proof of Theorem 4.1 and Corollary 4.3. We first combine the
results
2
n
∗
,
obtained thus far to prove Theorem 4.1. Set En = O − O , set δn = 128pψ
n
6Dδn (2−ν)
and then set λn = ξ(T )ν . This setting of λn is equivalent to the specification in
the statement of Theorem 4.1.
P ROOF OF T HEOREM 4.1. We mainly need to show that the various sufficient conditions of Proposition 5.3 are satisfied. We condition on the event that
En 2 ≤ δn , which holds with probability greater than 1 − 2 exp{−p} from Corollary 5.5 as n ≥ 2p by assumption. Based on the bound on n, we also have that
δn ≤ ξ(T )2
αν
1
αν
min
,
32(3 − ν)D
4C1 256D(3 − ν)ψC12
.
1960
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
In particular, these bounds imply that
δn ≤
1
αξ(T )ν
αξ(T )
min
,
;
32(3 − ν)D
4C1 64DψC12
δn ≤
α 2 ξ(T )2 ν 2
.
8192ψC12 (3 − ν)2 D 2
(5.3)
Both these weaker bounds are used later.
Based on the assumptions of Theorem 4.1, the requirements of Proposition 5.3
on σ and θ are satisfied. Next we verify the bounds on λn and gγ (A† En ). Based
on the setting of λn above and the bound on δn from (5.3), we have that
1
6D(2 − ν)δn 3α(2 − ν)
αξ(T )
≤
min
,
.
λn =
ξ(T )ν
16(3 − ν)
4C1 64DψC12
Next we combine the facts that λn =
that
(5.4)
gγ (A† En ) ≤
6Dδn (2−ν)
ξ(T )ν
and that En 2 ≤ δn to conclude
Dδn
λn ν
=
.
ξ(T ) 6(2 − ν)
Thus, we have from Proposition 5.3 that there exists a T and corresponding
solution (Ŝ , L̂T ) of (5.2) with the prescribed properties. Next we apply Proposition 5.2 with T̃ = T to bound the error (Ŝ − S ∗ , L∗ − L̂T ). Noting that
)
ρ(T , T ) ≤ ξ(T
4 , we have that
(5.5)
(5.6)
8
ν
8
[gγ (A† En ) + gγ (A† I ∗ CT ) + λn ] ≤
+ 1 λn
α
α 3(2 − ν)
16(3 − ν)λn
=
3α(2 − ν)
32(3 − ν)D
δn
=
αξ(T )ν
(5.7)
≤ min
1
αξ(T )
,
.
4C1 64DψC12
λn ν
In the first inequality we used the fact that gγ (A† En ) ≤ 6(2−ν)
(from above)
†
∗
and that gγ (A I CT ) is similarly bounded (from Proposition 5.3). In the second
n (2−ν)
equality we used the relation λn = 6Dδξ(T
)ν . In the final inequality we used the
bound on δn from (5.3). This satisfies one of the requirements of Proposition 5.2.
The second requirement of Proposition 5.2 on CT 2 is also similarly satisfied as
n
we have that CT 2 ≤ 16(3−ν)λ
3α(2−ν) from Proposition 5.3, and we use the same sequence of inequalities as above. Thus we conclude from Proposition 5.2 and from
1961
LATENT VARIABLE MODEL SELECTION
(5.5) that
32(3 − ν)λn
1
p
.
gγ (Ŝ − S , L − L̂T ) ≤
3α(2 − ν)
ξ(T ) n
∗
(5.8)
∗
Here the last inequality follows from the bound on λn .
If we show that (Ŝn , L̂n ) = (Ŝ , L̂T ), we can conclude the proof of Theorem 4.1 since algebraic correctness of (Ŝ , L̂T ) holds from Proposition 5.3
and the estimation error bound follows from (5.8). In order to complete this final step, we again revert to Proposition 5.3 and prove the requisite bound on
gγ (A† RO∗ (A(Ŝ − S ∗ , L∗ − L̂T ))).
Since the bound (5.8) combined with the inequality (5.7) satisfies the condition
of Proposition 5.1 [i.e., we have that gγ (Ŝ − S ∗ , L∗ − L̂T ) ≤ 2C1 1 ]:
gγ A† RO∗ A(Ŝ − S ∗ , L∗ − L̂T ) ≤
2DψC12
gγ (Ŝ − S ∗ , L∗ − L̂T )2
ξ(T )
2DψC12 64(3 − ν)D
≤
ξ(T )
αξ(T )ν
2
δn2
8192ψC12 (3 − ν)2 D 2
Dδn
δn
2
2
2
α ξ(T ) ν
ξ(T )
Dδn
≤
ξ(T )
λn ν
=
.
6(2 − ν)
=
In the second inequality we used (5.6) and (5.8), in the final inequality we used the
n (2−ν)
bound (5.3) on δn , and in the final equality we used the relation λn = 6Dδξ(T
)ν .
P ROOF OF C OROLLARY 4.3. Based on the optimality conditions of the modified convex program (5.2), we have that
n
gγ A† [(Ŝn − L̂n )−1 − O
] ≤ λn .
Combining this with the bound (5.4) yields the desired result. 6. Simulation results. In this section we give experimental demonstration of
the consistency of our estimator (1.2) on synthetic examples, and its effectiveness
in modeling real-world stock return data. Our choices of λn and
γ are guided by
Theorem 4.1. Specifically, we choose λn to be proportional to pn . For γ we observe that the support/sign-pattern and the rank of the solution (Ŝn , L̂n ) are the
same for a range of values of γ . Therefore one could solve the convex program
1962
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
(1.2) for several values of γ , and choose a solution in a suitable range in which
the sign-pattern and rank of the solution are stable (see [7] for details). In practical problems with real-world data these parameters may be chosen via crossvalidation (it would be of interest to consider methods such as those developed
in [24]). For small problem instances we solve the convex program (1.2) using
a combination of YALMIP [21] and SDPT3 [31]. For larger problem instances
we use the special-purpose solver LogdetPPA [33] developed for log-determinant
semidefinite programs.
6.1. Synthetic data. In the first set of experiments we consider a setting in
which we have access to samples of the observed variables of a latent-variable
graphical model. We consider several latent-variable Gaussian graphical models.
The first model consists of p = 36 observed variables and h = 2 latent variables.
The conditional graphical model structure of the observed variables is a cycle with
the edge partial correlation coefficients equal to 0.25; thus, this conditional model
is specified by a sparse graphical model with degree 2. The second model is the
same as the first one, but with h = 3 latent variables. The third model consists of
h = 1 latent variable, and the conditional graphical model structure of the observed
variables is given by a 6 × 6 nearest-neighbor grid (i.e., p = 36 and degree 4) with
the partial correlation coefficients of the edges equal to 0.15. In all three of these
models each latent variable is connected to a random subset of 80% of the observed
variables (and the partial correlation coefficients corresponding to these edges are
also random). Therefore the effect of the latent variables is “spread out” over most
of the observed variables, that is, the low-rank matrix summarizing the effect of
the latent variables is incoherent.
For each model we generate n samples of the observed variables, and use the
n as input to our convex program (1.2). Figure 1
resulting sample covariance O
shows the probability of obtaining algebraically correct estimates as a function
of n. This probability is evaluated over 50 experiments for each value of n. In all
of these cases standard graphical model selection applied directly to the observed
variables is not useful as the marginal concentration matrix of the observed variables is not well-approximated by a sparse matrix. These experiments agree with
our theoretical results that the convex program (1.2) is an algebraically consistent
estimator of a latent-variable model given (sufficiently many) samples of only the
observed variables.
6.2. Stock return data. In the next experiment we model the statistical structure of monthly stock returns of 84 companies in the S&P 100 index from 1990 to
2007; we disregard 16 companies that were listed after 1990. The number of samples n is equal to 216. We compute the sample covariance based on these returns
and use this as input to (1.2).
The model learned using (1.2) for suitable values of λn , γ consists of h = 5
latent variables, and the conditional graphical model structure of the stock returns conditioned on these latent components consists of 135 edges. Therefore
LATENT VARIABLE MODEL SELECTION
1963
F IG . 1. Synthetic data: plot showing probability of algebraically correct estimation. The three models studied are (a) 36-node conditional graphical model given by a cycle with h = 2 latent variables,
(b) 36-node conditional graphical model given by a cycle with h = 3 latent variables and (c) 36-node
conditional graphical model given by a 6 × 6 grid with h = 1 latent variable. For each plotted point,
the probability of algebraically correct estimation is obtained over 50 random trials.
the number of parameters in the model is 84 + 135 + (5 × 84) = 639. The resulting KL divergence between the distribution specified by this model and a Gaussian distribution specified by the sample covariance is 17.7. Figure 2 (left) shows
the conditional graphical model structure. The strongest edges in this conditional
graphical model, as measured by partial correlation, are between Baker Hughes–
F IG . 2. Stock returns: the figure on the left shows the sparsity pattern (black denotes an edge, and
white denotes no edge) of the concentration matrix of the conditional graphical model (135 edges)
of the stock returns, conditioned on five latent variables, in a latent-variable graphical model (total
number of parameters equals 639). This model is learned using (1.2), and the KL divergence with
respect to a Gaussian distribution specified by the sample covariance is 17.7. The figure on the right
shows the concentration matrix of the graphical model (646 edges) of the stock returns, learned using
standard sparse graphical model selection based on solving an 1 -regularized maximum-likelihood
program (total number of parameters equals 730). The KL divergence between this distribution and
a Gaussian distribution specified by the sample covariance is 44.4.
1964
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
Schlumberger, A.T.&T.–Verizon, Merrill Lynch–Morgan Stanley, Halliburton–
Baker Hughes, Intel–Texas Instruments, Apple–Dell, and Microsoft–Dell. It is of
interest to note that in the Standard Industrial Classification4 system for grouping
these companies, several of these pairs are in different classes. As mentioned in
Section 2.1, our method estimates a low-rank matrix that summarizes the effect of
the latent variables; in order to factorize this low-rank matrix, for example, into
sparse factors, one could use methods such as those described in [35].
We compare these results to those obtained using a sparse graphical model
learned using 1 -regularized maximum-likelihood (see, e.g., [26]), without introducing any latent variables. Figure 2 (right) shows this graphical model structure.
The number of edges in this model is 646 (the total number of parameters is equal
to 646 + 84 = 730), and the resulting KL divergence between this distribution and
a Gaussian distribution specified by the sample covariance is 44.4.
These results suggest that a latent-variable graphical model is better suited than
a standard sparse graphical model for modeling stock returns. This is likely due to
the presence of global, long-range correlations in stock return data that are better
modeled via latent variables.
7. Discussion. We have studied the problem of modeling the statistical structure of a collection of random variables as a sparse graphical model conditioned
on a few additional latent components. As a first contribution we described conditions under which such latent-variable graphical models are identifiable given
samples of only the observed variables. We also proposed a convex program based
on 1 and nuclear norm regularized maximum-likelihood for latent-variable graphical model selection. Given samples of the observed variables of a latent-variable
Gaussian model, we proved that this convex program provides consistent estimates of the number of latent components as well as the conditional graphical
model structure among the observed variables conditioned on the latent components. Our analysis holds in the high-dimensional regime in which the number of
observed/latent variables are allowed to grow with the number of samples of the
observed variables. These theoretical predictions are verified via a set of experiments on synthetic data. We also demonstrate the effectiveness of our approach in
modeling real-world stock return data.
Several questions arise that are worthy of further investigation. While (1.2) can
be solved in polynomial time using off-the-shelf solvers, it is preferable to develop
more efficient special-purpose solvers to scale to massive datasets by taking advantage of the structure of (1.2). It is also of interest to develop statistically consistent convex optimization methods for latent-variable modeling with non-Gaussian
variables, for example, for categorical data.
4 See the U.S. SEC website at http://www.sec.gov/info/edgar/siccodes.htm.
LATENT VARIABLE MODEL SELECTION
1965
Acknowledgments. We would like to thank James Saunderson and Myung
Jin Choi for helpful discussions, and Kim-Chuan Toh for kindly providing us specialized code to solve larger instances of our convex program.
SUPPLEMENTARY MATERIAL
Supplement to “Latent variable graphical model selection via convex optimization” (DOI: 10.1214/11-AOS949SUPP; .pdf). Due to space constraints, we
have moved some technical proofs to a supplementary document [6].
REFERENCES
[1] B ICKEL , P. J. and L EVINA , E. (2008). Regularized estimation of large covariance matrices.
Ann. Statist. 36 199–227. MR2387969
[2] B ICKEL , P. J. and L EVINA , E. (2008). Covariance regularization by thresholding. Ann. Statist.
36 2577–2604. MR2485008
[3] C ANDÈS , E. J., L I , X., M A , Y. and W RIGHT, J. (2011). Robust principal component analysis?
J. ACM 58 Art. 11, 37. MR2811000
[4] C ANDÈS , E. J. and R ECHT, B. (2009). Exact matrix completion via convex optimization.
Found. Comput. Math. 9 717–772. MR2565240
[5] C ANDÈS , E. J., ROMBERG , J. and TAO , T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform.
Theory 52 489–509. MR2236170
[6] C HANDRASEKARAN , V., PARRILO , P. A. and W ILLSKY, A. S. (2011). Supplement to
“Latent variable graphical model selection via convex optimization.” DOI:10.1214/11AOS949SUPP.
[7] C HANDRASEKARAN , V., S ANGHAVI , S., PARRILO , P. A. and W ILLSKY, A. S. (2011). Ranksparsity incoherence for matrix decomposition. SIAM J. Optim. 21 572–596. MR2817479
[8] DAVIDSON , K. R. and S ZAREK , S. J. (2001). Local operator theory, random matrices and
Banach spaces. In Handbook of the Geometry of Banach Spaces, Vol. I 317–366. NorthHolland, Amsterdam. MR1863696
[9] D EMPSTER , A. P., L AIRD , N. M. and RUBIN , D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
[10] D ONOHO , D. L. (2006). For most large underdetermined systems of linear equations the minimal l1 -norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797–829.
MR2217606
[11] D ONOHO , D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306.
MR2241189
[12] E L K AROUI , N. (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist. 36 2717–2756. MR2485011
[13] E LIDAN , G., NACHMAN , I. and F RIEDMAN , N. (2007). “Ideal parent” structure learning for
continuous variable Bayesian networks. J. Mach. Learn. Res. 8 1799–1833. MR2353820
[14] FAN , J., FAN , Y. and LV, J. (2008). High dimensional covariance matrix estimation using a
factor model. J. Econometrics 147 186–197. MR2472991
[15] FAZEL , M. (2002). Matrix rank minimization with applications. Ph.D. thesis, Dept. Elec. Eng.,
Stanford Univ.
1966
V. CHANDRASEKARAN, P. A. PARRILO AND A. S. WILLSKY
[16] H ORN , R. A. and J OHNSON , C. R. (1990). Matrix Analysis. Cambridge Univ. Press, Cambridge. MR1084815
[17] J OHNSTONE , I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327. MR1863961
[18] L AM , C. and FAN , J. (2009). Sparsistency and rates of convergence in large covariance matrix
estimation. Ann. Statist. 37 4254–4278. MR2572459
[19] L AURITZEN , S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. Oxford
Univ. Press, New York. MR1419991
[20] L EDOIT, O. and W OLF, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. Multivariate Anal. 88 365–411. MR2026339
[21] L ÖFBERG , J. (2004). YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedings of the CACSD Conference, Taiwan. Available at http://control.ee.ethz.ch/~joloef/
yalmip.php.
[22] M AR ČENKO , V. A. and PASTUR , L. A. (1967). Distribution of eigenvalues in certain sets of
random matrices. Mat. Sb. 72 507–536. MR0208649
[23] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. MR2278363
[24] M EINSHAUSEN , N. and B ÜHLMANN , P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat.
Methodol. 72 417–473. MR2758523
[25] O RTEGA , J. M. and R HEINBOLDT, W. C. (1970). Iterative Solution of Nonlinear Equations in
Several Variables. Academic Press, New York. MR0273810
[26] R AVIKUMAR , P., WAINWRIGHT, M. J., R ASKUTTI , G. and Y U , B. (2011). High-dimensional
covariance estimation by minimizing 1 -penalized log-determinant divergence. Electron.
J. Stat. 4 935–980.
[27] R ECHT, B., FAZEL , M. and PARRILO , P. A. (2010). Guaranteed minimum-rank solutions
of linear matrix equations via nuclear norm minimization. SIAM Rev. 52 471–501.
MR2680543
[28] ROCKAFELLAR , R. T. (1996). Convex Analysis. Princeton Univ. Press, Princeton, NJ.
MR1451876
[29] ROTHMAN , A. J., B ICKEL , P. J., L EVINA , E. and Z HU , J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Stat. 2 494–515. MR2417391
[30] S PEED , T. P. and K IIVERI , H. T. (1986). Gaussian Markov distributions over finite graphs.
Ann. Statist. 14 138–150. MR0829559
[31] T OH , K. C., T ODD , M. J. and T UTUNCU , R. H. (1999). SDPT3—a MATLAB software
package for semidefinite-quadratic-linear programming. Available at http://www.math.
nus.edu.sg/~mattohkc/sdpt3.html.
[32] WAINWRIGHT, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using 1 -constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55
2183–2202. MR2729873
[33] WANG , C., S UN , D. and T OH , K.-C. (2010). Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. SIAM J. Optim. 20 2994–3013.
MR2735941
[34] WATSON , G. A. (1992). Characterization of the subdifferential of some matrix norms. Linear
Algebra Appl. 170 33–45. MR1160950
[35] W ITTEN , D. M., T IBSHIRANI , R. and H ASTIE , T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.
Biostatistics 10 515–534.
[36] W U , W. B. and P OURAHMADI , M. (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90 831–844. MR2024760
LATENT VARIABLE MODEL SELECTION
1967
[37] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res.
7 2541–2563. MR2274449
V. C HANDRASEKARAN
D EPARTMENT OF C OMPUTING
AND M ATHEMATICAL S CIENCES
C ALIFORNIA I NSTITUTE OF T ECHNOLOGY
PASADENA , C ALIFORNIA 91106
USA
E- MAIL : venkatc@caltech.edu
P. A. PARRILO
A. S. W ILLSKY
L ABORATORY FOR I NFORMATION
AND D ECISION S YSTEMS
D EPARTMENT OF E LECTRICAL E NGINEERING
AND C OMPUTER S CIENCE
M ASSACHUSETTS I NSTITUTE OF T ECHNOLOGY
C AMBRIDGE , M ASSACHUSETTS 02139
USA
E- MAIL : parrilo@mit.edu
willsky@mit.edu
Download