ASYMPTOTIC NORMALITY AND OPTIMALITIES IN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODEL

advertisement
Submitted to the Annals of Statistics
ASYMPTOTIC NORMALITY AND OPTIMALITIES IN ESTIMATION
OF LARGE GAUSSIAN GRAPHICAL MODEL
By Zhao Ren
∗,‡ ,
Tingni Sun§ , Cun-Hui Zhang†,¶ and Harrison H. Zhou∗,‡
Yale University‡ , University of Pennsylvania§ and Rutgers University¶
The Gaussian graphical model, a popular paradigm for studying relationship among variables in a wide range of applications, has
attracted great attention in recent years. This paper considers a fundamental question: When is it possible to estimate low-dimensional
parameters at parametric square-root rate in a large Gaussian graphical model? A novel regression approach is proposed to obtain asymptotically efficient estimation of each entry of a precision matrix under
a sparseness condition relative to the sample size. When the precision
matrix is not sufficiently sparse, or equivalently the sample size is not
sufficiently large, a lower bound is established to show that it is no
longer possible to achieve the parametric rate in the estimation of
each entry. This lower bound result, which provides an answer to the
delicate sample size question, is established with a novel construction
of a subset of sparse precision matrices in an application of Le Cam’s
Lemma. Moreover, the proposed estimator is proven to have optimal
convergence rate when the parametric rate cannot be achieved, under
a minimal sample requirement.
The proposed estimator is applied to test the presence of an edge
in the Gaussian graphical model or to recover the support of the
entire model, to obtain adaptive rate-optimal estimation of the entire
precision matrix as measured by the matrix lq operator norm, and to
make inference in latent variables in the graphical model. All these are
achieved under a sparsity condition on the precision matrix and a side
condition on the range of its spectrum. This significantly relaxes the
commonly imposed uniform signal strength condition on the precision
matrix, irrepresentable condition on the Hessian tensor operator of
the covariance matrix or the `1 constraint on the precision matrix.
∗
The research of Zhao Ren and Harrison H. Zhou was supported in part by NSF Career Award DMS-
0645676 and NSF FRG Grant DMS-0854975.
†
The research of Cun-Hui Zhang was supported in part by the NSF Grants DMS 0906420, DMS-1106753 and NSA Grant H98230-11-1-0205.
AMS 2000 subject classifications: Primary 62H12; secondary 62F12, 62G09
Keywords and phrases: asymptotic efficiency, covariance matrix, inference, graphical model, latent
graphical model, minimax lower bound, optimal rate of convergence, scaled lasso, precision matrix, sparsity, spectral norm
1
2
Z. REN ET AL.
Numerical results confirm our theoretical findings. The ROC curve
of the proposed algorithm, Asymptotic Normal Thresholding (ANT),
for support recovery significantly outperforms that of the popular
GLasso algorithm.
1. Introduction. Gaussian graphical model, a powerful tool for investigating the relationship among a large number of random variables in a complex system, is used in a
wide range of scientific applications. A central question for Gaussian graphical model is
to recover the structure of an undirected Gaussian graph. Let G = (V, E) be an undirected graph representing the conditional dependence relationship between components of
a random vector Z = (Z1 , . . . , Zp )T as follows. The vertex set V = {V1 , . . . , Vp } represents
the components of Z. The edge set E consists of pairs (i, j) indicating the conditional
dependence between Zi and Zj given all other components. In applications, the following question is fundamental: Is there an edge between Vi and Vj ? It is well known that
recovering the structure of an undirected Gaussian graph G = (V, E) is equivalent to
recovering the support of the population precision matrix of the data in the Gaussian
graphical model. Let
Z = (Z1 , Z2 , . . . , Zp )T ∼ N (µ, Σ) ,
where Σ = (σij ) is the population covariance matrix. The precision matrix, denoted by
Ω = (ωij ), is defined as the inverse of covariance matrix, Ω = Σ−1 . There is an edge
between Vi and Vj , i.e., (i, j) ∈ E, if and only if ωij 6= 0. See, for example, Lauritzen
(1996). Consequently, the support recovery of the precision matrix Ω yields the recovery
of the structure of the graph G.
Suppose n i.i.d. p-variate random vectors X (1) , X (2) , . . . , X (n) are observed from the
same distribution as Z, i.e. the Gaussian N µ, Ω−1 . Assume without loss of generality
that µ = 0 hereafter. In this paper, we address the following two fundamental questions:
When is it possible to make statistical inference for each individual entry of a precision
√
matrix Ω at the parametric n rate? When and in what sense is it possible to recover the
support of Ω in the presence of some small nonzero |ωij |?
The problems of estimating a large sparse precision matrix and recovering its support
have drawn considerable recent attention. There are mainly two approaches in literature.
The first one is a penalized likelihood estimation approach with a lasso-type penalty on
entries of the precision matrix. Yuan and Lin (2007) proposed to use the lasso penalty
and studied its asymptotic properties when p is fixed. Ravikumar et al. (2011) derived
the rate of convergence when the dimension p is high by applying a primal-dual witness
3
construction under an irrepresentability condition on the Hessian tensor operator and a
constraint on the matrix l1 norm of the precision matrix. See also Rothman et al. (2008)
and Lam and Fan (2009) for other related results. The other one is the neighborhoodbased approach, by running a lasso-type regression or Dantzig selection type of each
variable on all the rest of variables to estimate precision matrix column by column. See
Meinshausen and Bühlmann (2006), Yuan (2010), Cai, Liu and Luo (2011), Cai, Liu and
Zhou (2012) and Sun and Zhang (2012a). The irrepresentability condition is no longer
needed in Cai, Liu and Luo (2011) and Cai, Liu and Zhou (2012) for support recovery,
but the thresholding level for support recovery depends on the matrix l1 norm of the
precision matrix. The matrix l1 norm is unknown and large, which makes the support
recovery procedures there nonadaptive and thus less practical. In Sun and Zhang (2012a),
optimal convergence rate in the spectral norm is achieved without requiring the matrix `1
norm constraint or the irrepresentability condition. However, support recovery properties
of the estimator was not analyzed.
In spite of an extensive literature on the topic, it is still largely unknown the fundamental limit of support recovery in the Gaussian graphical model, let alone an adaptive
procedure to achieve the limit.
Statistical inference of low-dimensional parameters at the
√
n rate has been considered
in the closely related linear regression model. Sun and Zhang (2012b) proposed an efficient
scaled Lasso estimator of the noise level under the sample size condition n (s log p)2 ,
where s is the `0 or capped-`1 measure of the size of the unknown regression vector. Zhang
and Zhang (2011) proposed an asymptotically normal low-dimensional projection estimator for the regression coefficients and their estimator was proven to be asymptotically
efficient by van de Geer, Bühlmann and Ritov (2013) in a semiparametric sense under
the same sample size condition. The asymptotic efficiency of these estimators can be also
understood through the minimum Fisher information in a more general context (Zhang,
2011). Alternative methods for testing and estimation of regression coefficients were proposed in Belloni, Chernozhukov and Hansen (2012), Bühlmann (2012), and Javanmard
and Montanari (2013). However, the optimal rate of convergence is unclear from these
papers when the sample size condition n (s log p)2 fails to hold.
This paper makes important advancements in the understanding of statistical inference
of low-dimensional parameters in the Gaussian graphical model in the following ways.
Let s be the maximum degree of the graph or a certain more relaxed capped-`1 measure
of the complexity of the precision matrix. We prove that the estimation of each ωij at
√
the parametric n convergence rate requires the sparsity condition s ≤ O(1)n1/2 / log p
4
Z. REN ET AL.
or equivalently a sample size of order (s log p)2 . We propose an adaptive estimator of
individual ωij and prove its asymptotic normality and efficiency when n (s log p)2 .
Moreover, we prove that the proposed estimator achieves the optimal convergence rate
when the sparsity condition is relaxed to s ≤ c0 n/ log p for a certain positive constant c0 .
The efficient estimator of the individual ωij is then used to construct fully data driven
procedures to recover the support of Ω and to make statistical inference about latent
variables in the graphical model.
The methodology we are proposing is a novel regression approach briefly described in
Sun and Zhang (2012c). In this regression approach, the main task is not to estimate
the slope as seen in Meinshausen and Bühlmann (2006), Yuan (2010), Cai, Liu and Luo
(2011), Cai, Liu and Zhou (2012) and Sun and Zhang (2012b), but to estimate the noise
level. For any index subset A of {1, 2, . . . , p} and a vector Z of length p, we use ZA to
denote a vector of length |A| with elements indexed by A. Similarly for a matrix U and
two index subsets A and B of {1, 2, . . . , p} we can define a submatrix UA,B of size |A|×|B|
with rows and columns of U indexed by A and B respectively. Consider A 
= {i, j}, for
ω11 ω12
. It is well
example, i = 1 and j = 2, then ZA = (Z1 , Z2 )T and ΩA,A = 
ω21 ω22
known that
−1
ZA |ZAc = N −Ω−1
A,A ΩA,Ac ZAc , ΩA,A .
This observation motivates us to consider regression with two response variables above.
The noise level Ω−1
A,A has only three parameters. When Ω is sufficiently sparse, a penalized
regression approach is proposed in Section 2 to obtain an asymptotically efficient estimation of ωij , i.e., the estimator is asymptotically normal and the variance matches that of
the maximum likelihood estimator in the classical setting where the dimension p is a fixed
constant. Consider the class of parameter spaces modeling sparse precision matrices with
at most kn,p off-diagonal nonzero elements in each column,
(1)


P
 Ω = (ω )

:
max
1
{ω
=
6
0}
≤
k
,
ij 1≤i,j≤p
1≤j≤p
ij
n,p
i6=j
G0 (M, kn,p ) =
,


and 1/M ≤ λ
(Ω) ≤ λ
(Ω) ≤ M.
min
max
where 1 {·} is the indicator function and M is some constant greater than 1. The following
√
theorem shows that a necessary and sufficient condition to obtain a n−consistent estimation of ωij is kn,p = O
√ n
log p
, and when kn,p = o
√ n
log p
the procedure to be proposed
in Section 2 is asymptotically efficient.
Theorem 1.
i.i.d.
Let X (i) ∼ Np (µ, Σ), i = 1, 2, . . . , n. Assume that kn,p ≤ c0 n/ log p with
ν with some ν > 2. We have the following
a sufficiently small constant c0 > 0 and p ≥ kn,p
5
probablistic results,
(i). There exists a constant 0 > 0 such that
inf inf
i,j
sup
n
ω̂ij G0 (M,kn,p )
P |ω̂ij − ωij | ≥ 0 max n−1 kn,p log p, n−1/2
o
≥ 0 .
(ii). The estimator ω̂ij defined in (12) is rate optimal in the sense of
max
sup
i,j G0 (M,kn,p )
n
P |ω̂ij − ωij | ≥ M max n−1 kn,p log p, n−1/2
o
→ 0,
as (M, n) → (∞, ∞). Furthermore, the estimator ω̂ij is asymptotically efficient when
kn,p = o
√ n
log p
2,
, i.e., with Fij−1 = ωii ωjj + ωij
q
(2)
D
nFij (ω̂ij − ωij ) → N (0, 1) .
Moreover, the minimax risk of estimating ωij over the class G0 (k, Mn,p ) satisfies, pro
vided n = O pξ with some ξ > 0,
(
(3)
log p
,
inf sup E |ω̂ij − ωij | max kn,p
ω̂ij G0 (M,kn,p )
n
r )
1
n
.
The lower bound is established through Le Cam’s Lemma and a novel construction
of a subset of sparse precision matrices. An important implication of the lower bound is
that the difficulty of support recovery for sparse precision matrix is different from that for
sparse covariance matrix when kn,p √ n
log p
, and when kn,p √ n
log p
the difficulty of
support recovery for sparse precision matrix is just the same as that for sparse covariance
matrix.
It is worthwhile to point out that the asymptotic efficiency result is obtained without the
need to assume the irrepresentable condition or the l1 constraint of the precision matrix
which are commonly required in literature. An immediate consequence of the asymptotic
normality result (2) is to test individually whether there is an edge between Vi and Vj
in the set E, i.e., the hypotheses ωij = 0. The result is applied to do adaptive support
recovery optimally. In addition, we can strengthen other results in literature under weaker
assumptions, and the procedures are adaptive, including adaptive rate-optimal estimation
of the precision matrix under various matrix lq norms, and an extension of our framework
for inference and estimation to a class of latent variable graphical models. See Section 3
for details.
Our work on optimal estimation of precision matrix given in the present paper is closely
connected to a growing literature on estimation of large covariance matrices. Many regularization methods have been proposed and studied. For example, Bickel and Levina
6
Z. REN ET AL.
(2008a,b) proposed banding and thresholding estimators for estimating bandable and
sparse covariance matrices respectively and obtained rate of convergence for the two estimators. See also El Karoui (2008) and Lam and Fan (2009). Cai, Zhang and Zhou (2010)
established the optimal rates of convergence for estimating bandable covariance matrices.
Cai and Zhou (2012) and Cai, Liu and Zhou (2012) obtained the minimax rate of convergence for estimating sparse covariance and precision matrices under a range of losses
including the spectral norm loss. In particular, a new general lower bound technique for
matrix estimation was developed there. See also Sun and Zhang (2012a).
The proposed estimator was briefly described in Sun and Zhang (2012c) along with a
statement of the efficiency of the estimator without proof under the sparsity assumption
kn,p n−1/2 log p. While we are working on the delicate issue of the necessity of the
sparsity condition kn,p n1/2 / log p and the optimality of the method for support recovery
and estimation under the general sparsity condition kn,p n/ log p, Liu (2013) developed
p-values for testing ωij = 0 and related FDR control methods under the stronger sparsity
condition kn,p n1/2 / log p. However, his method cannot be converted into confidence
intervals, and the optimality of his method is unclear under either sparsity conditions.
The paper is organized as follows. In Section 2, we introduce our methodology and
main results for statistical inference. Applications to estimation under the spectral norm,
to support recovery and estimation of latent variable graphical model are presented in
Section 3. Section 4 discusses extensions of results in Sections 2 and 3. Numerical studies
are given in Section 5. Proofs for theorems in Sections 2-3 are given in Sections 6-7.
Proofs for main lemmas are given in Section 8. We collect auxiliary results for proving
main lemmas in the supplementary material.
Notations. We summarize here some notations to be used throughout the paper. For
1 ≤ w ≤ ∞, we use kukw and kAkw to denote the usual vector lw norm, given a vector
u ∈ Rp and a matrix A = (aij )p×p respectively. In particular, kAk∞ denote the entry-wise
maximum maxij |aij |. We shall write k·k without a subscript for the vector l2 norm. The
matrix `w operator norm of a matrix A is defined by |||A|||w = maxkxkw =1 kAxkw . The
commonly used spectral norm ||| · ||| coincides with the matrix `2 operator norm ||| · |||2 .
2. Methodology and Statistical Inference. In this section we will introduce our
methodology for estimating each entry and more generally, a smooth functional of any
square submatrix of finite size. Asymptotic efficiency results are stated in Section 2.2
under a sparseness assumption. The lower bound in Section 2.3 shows that the sparseness
condition to obtain the asymptotic efficiency in Section 2.2 is sharp.
7
2.1. Methodology. We will first introduce the methodology to estimate each entry ωij ,
and discuss its extension to the estimation of functionals of a submatrix of the precision
matrix.
The methodology is motivated by the following simple observation with A = {i, j},
−1
c
Z{i,j} |Z{i,j}c = N −Ω−1
A,A ΩA,Ac Z{i,j} , ΩA,A .
(4)
Equivalently we write
T
c β + (ηi , ηj ) ,
(Zi , Zj ) = Z{i,j}
(5)
where the coefficients and error distributions are
T
−1
β = −ΩAc ,A Ω−1
A,A , (ηi , ηj ) ∼ N 0, ΩA,A .
(6)
Denote the covariance matrix of (ηi , ηj )T by


ΘA,A = Ω−1
A,A =
θii

θij
θji θjj
.
We will estimate ΘA,A and expect that an efficient estimator of ΘA,A yields an efficient
estimation of entries of ΩA,A by inverting the estimator of ΘA,A .
Denote the n by p dimensional data matrix by X. The ith row of data matrix is the
ith sample X (i) . Let XA be the columns indexed by A = {i, j} . Based on the regression
interpretation (5), we have the following data version of the multivariate regression model
(7)
XA = XAc β + A .
Here each row of (7) is a sample of the linear model (5). Note that β is a p − 2 by 2
dimensional coefficient matrix. Denote a sample version of ΘA,A by
ora
T
Θora
A,A = (θkl )k∈A,l∈A = A A /n
(8)
which is an oracle MLE of ΘA,A , assuming that we know β, and
ora
ora
Ωora
A,A = (ωkl )k∈A,l∈A = ΘA,A
(9)
−1
.
But of course β is unknown, and we will need to estimate β and plug in its estimator to
estimate A .
Now we formally introduce the methodology. For each m ∈ A = {i, j}, we apply a
scaled lasso penalization to the univariate linear regression of Xm against XAc as follows,
(10)
n
o
1/2
= arg
β̂m , θ̂mm
min

 kX
b∈Rp−2 ,σ∈R+ 
m
2
X kXk k


− X bk
σ
√ |bk | ,
+ +λ

2nσ
2
n
k∈Ac
Ac
8
Z. REN ET AL.
with a weighted `1 penalty, where the vector b is indexed by Ac . The penalty level will be
specified explicitly later. Define the residuals of the scaled lasso regression by
ˆ
A = XA − XAc β̂,
and
Θ̂A,A = ˆ
TAˆ
A /n.
(11)
It can be shown that this definition of θ̂mm is consistent with the θ̂mm obtained from the
scaled lasso (10) for each m ∈ A. Finally we simply inverse the estimator Θ̂A,A = θ̂kl
k,l∈A
to estimate entries in ΩA,A , i.e.
Ω̂A,A = Θ̂−1
A,A .
(12)
This methodology can be routinely extended into a more general form. For any subset
B ⊂ {1, 2, . . . , p} with a bounded size, the conditional distribution of ZB given ZB c is
−1
ZB |ZB c = N −Ω−1
B,B ΩB,B c ZB c , ΩB,B ,
(13)
so that the associated multivariate linear regression model is XB = XB c βB,B c + B with
−1
βB c ,B = −ΩB c ,B Ω−1
B,B and B ∼ N 0, ΩB,B . Consider a slightly more general problem of
estimating a smooth functional of Ω−1
B,B , denoted by
ζ := ζ Ω−1
B,B .
When βB c ,B is known, B is sufficient for Ω−1
B,B due to the independence of B and XB c ,
so that an oracle estimator of ζ can be defined as
ζ ora = ζ TB B /n .
We apply scaled lasso to the univariate linear regression of Xm against XB c for each
m ∈ B as in Equation (10),
n
o
1/2
β̂m , θ̂mm
= arg

 kX
min

2

X kXk k
σ
m − XB c bk
√ |bk |
+ +λ

2nσ
2
n
k∈B c
b∈Rp−|B| ,σ∈R+ 
where |B| is the size of subset B. The residual matrix of the model is ˆ
B = XB −XB c β̂B c ,B ,
then the scaled lasso estimator of ζ Ω−1
B,B is defined by
(14)
ζ̂ = ζ ˆ
TB ˆ
B /n .
9
2.2. Statistical Inference. For λ > 0, define capped-`1 balls as
G ∗ (M, s, λ) = {Ω : sλ (Ω) ≤ s, 1/M ≤ λmin (Ω) ≤ λmax (Ω) ≤ M } ,
(15)
where sλ = sλ (Ω) = maxj Σi6=j min {1, |ωij | /λ} for Ω = (ωij )1≤i,j≤p . In this paper, λ is of
the order
p
(log p)/n. We omit the subscript λ from s when it is clear from the context.
When |ωij | is either 0 or larger than λ, sλ is the maximum node degree of the graph,
which is denoted by kn,p in the class of parameter spaces (1). In general, kn,p is an upper
bound of the sparseness measurement sλ . The spectrum of Σ is bounded in the matrix
class G ∗ (M, s, λ) as in the `0 ball (1). The following theorem gives an error bound for our
estimators by comparing them with the oracle MLE (8), and shows that
κora
ij
=
√
ora − ω
ωij
ij
nq
2
ωii ωjj + ωij
is asymptotically standard normal, which implies the oracle MLE (8) is asymptotically
2 .
normal with mean ωij and variance n−1 ωii ωjj + ωij
Theorem 2.
Let Θ̂A,A and Ω̂A,A
be estimators of ΘA,A and ΩA,A defined in (11) and
q
p
(12) respectively, and λ = (1 + ε) 2δ log
for any δ ≥ 1 and ε > 0 in Equation (10).
n
(i). Suppose s ≤ c0 n/ log p for a sufficiently small constant c0 > 0. We have
(16)
max
∗
max
Ω∈G (M,s,λ) A:A={i,j}
P Θ̂A,A − Θora
A,A ∞
log p
> C1 s
n
≤ o p−δ+1 ,
and
(17)
max
∗
max
Ω∈G (M,s,λ) A:A={i,j}
P Ω̂A,A − Ωora
A,A ∞
> C10 s
log p
n
≤ o p−δ+1 ,
ora
where Θora
A,A and ΩA,A are the oracle estimators defined in (8) and (9) respectively
and C1 and C10 are positive constants depending on {ε, c0 , M } only.
(ii). There exist constants D1 and ϑ ∈ (0, ∞), and three marginally standard normal
√
random variables Zkl , where kl = ii, ij, jj, such that whenever |Zkl | ≤ ϑ n for all
kl, we have
(18)
D1 ora
2
2
1 + Zii2 + Zij
+ Zjj
,
κij − Z 0 ≤ √
n
where Z 0 ∼ N (0, 1), which can be defined as a linear combination of Zkl , kl =
ii, ij, jj.
Theorem 2 immediately yields the following results of estimation and inference for ωij .
10
Z. REN ET AL.
Theorem 3.
Let Ω̂A,A be the estimator of ΩA,A defined in (12), and λ = (1 + ε)
q
2δ log p
n
for any δ ≥ 1 and ε > 0 in Equation (10). Suppose s ≤ c0 n/ log p for a sufficiently small
constant c0 > 0. For any small constant > 0, there exists a constant C2 = C2 (, ε, c0 , M )
such that
(
(19)
(
log p
max
max P |ω̂ij − ωij | > C2 max s
,
∗
n
Ω∈G (M,s,λ) 1≤i≤j≤p
r ))
1
n
≤ .
Moreover, there is constant C3 = C3 (, ε, c0 , M ) such that
(20)



s


 log p
log
p
−δ+3
.
≤
o
p
,
max
P
Ω̂ − Ω > C3 max s
 n
∞
n 
Ω∈G ∗ (M,s,λ) 
Furthermore, ω̂ij is asymptotically efficient
q
D
nFij (ω̂ij − ωij ) → N (0, 1) ,
(21)
√
when Ω ∈ G ∗ (M, s, λ) and s = o ( n/ log p), where
2
.
Fij−1 = ωii ωjj + ωij
Remark 1.
The upper bounds max
n
s logn p ,
q o
1
n
and max
s logn p ,
q
log p
n
in Equa-
tions (19) and (20) respectively are shown to be rate-optimal in Section 2.3.
Remark 2.
The choice of λ = (1 + ε)
q
2δ log p
n
is common in literature, but can be
too big and too conservative, which usually leads to some estimation bias in practice.
Let tq (α, n) denotes the α quantile of the t distribution with n degrees of freedom. In
√
2
Section 4 we show the value of λ can be reduced to λnew
f inite = (1 + ε) B/ n − 1 + B
where B = tq 1 −
equivalent to (1 + ε)
smax
p
q
δ
/2, n − 1
2δ log(p/smax )
.
n
for every smax = o
√ n
log p
, which is asymptotically
See Section 4 for more details. In simulation studies
of Section 5, we use the penalty λnew
f inite with δ = 1, which gives a good finite sample
performance.
Remark 3.
In Theorems 2 and 3, our goal is to estimate each entry ωij of the preci-
sion matrix Ω. Sometimes it is more natural to consider estimating the partial correlation
rij = −ωij /(ωii ωjj )1/2 between Zi and Zj . Let Ω̂A,A be estimator of ΩA,A defined in
(12). Our estimator of partial correlation rij is defined as r̂ij = −ω̂ij /(ω̂ii ω̂jj )1/2 . Then
the results above can be easily extended to the case of estimating rij . In particular, under the same assumptions in Theorem 3, the estimator r̂ij is asymptotically efficient:
q
√
2 )−2 (r̂ − r ) converges to N (0, 1) when s = o ( n/ log p). This is stated as
n(1 − rij
ij
ij
Corollary 1 in Sun and Zhang (2012c) without proof.
11
The following theorem extends Theorems 2 and 3 to estimation of ζ Ω−1
B,B , a smooth
|B|×|B| → R is a unit
functional of Ω−1
B,B for a bounded size subset B. Assume that ζ : R
n
o
Lipschitz function in a neighborhood G : |||G − Ω−1
B,B ||| ≤ κ , i.e.,
−1
ζ (G) − ζ Ω−1
B,B ≤ |||G − ΩB,B |||.
(22)
Theorem 4.
Let ζ̂ be the estimator of ζ defined in (14), and λ = (1 + ε)
q
2δ log p
n
for
any δ ≥ 1 and ε > 0 in Equation (10). Suppose s ≤ c0 n/ log p for a sufficiently small
constant c0 > 0. Then,
(23)
log p
ora ≤ o |B| p−δ+1 ,
P ζ̂ − ζ > C1 s
max
n
Ω∈G ∗ (M,s,λ)
with a constant C1 = C1 (ε, c0 , M, |B|). Furthermore, ζ̂ is asymptotically efficient
q
(24)
D
nFζ ζ̂ − ζ → N (0, 1) ,
√
when Ω ∈ G ∗ (M, s, λ) and s = o ( n/ log p), where Fζ is the Fisher information of esti
mating ζ for the Gaussian model N 0, Ω−1
B,B .
Remark 4.
The results in this section can be easily extended to the weak lq ball with
0 < q < 1 to model the sparsity of the precision matrix. A weak lq ball of radius c in Rp
is defined as follows,
n
o
q
Bq (c) = ξ ∈ Rp : ξ(j)
≤ cj −1 , for all j = 1, . . . , p ,
where ξ(1) ≥ ξ(2) ≥ . . . ≥ ξ(p) . Let
(25)
Gq (M, kn,p ) =
Since ξ ∈ Bq (k) implies
(26)
P
j


Ω = (ωij )1≤i,j≤p : ω·j ∈ Bq (kn,p ) ,


 and 1/M ≤ λ

min (Ω) ≤ λmax (Ω) ≤ M.
.
min(1, |ξj |/λ) ≤ bk/λq c + {q/(1 − q)}k 1/q bk/λq c1−1/q /λ,
Gq (M, kn,p ) ⊆ G ∗ (M, s, λ), 0 ≤ q ≤ 1,
when kn,p /λq ≤ Cq (s ∨ 1), where Cq = 1 + q2q /(1 − q) for 0 < q < 1 and C0 = 1. Thus,
the conclusions of Theorems 2, 3 and 4 hold with G ∗ (M, s, λ) replaced by Gq (M, kn,p ) and
s by kn,p (n/ log p)q/2 , 0 ≤ q < 1.
12
Z. REN ET AL.
2.3. Lower Bound. In this section, we derive a lower bound for estimating ωij over the
matrix class G0 (M, kn,p ) defined in (1). Assume that
ν
p ≥ kn,p
with ν > 2,
(27)
and
kn,p ≤ C0
(28)
n
log p
for some C0 > 0. Theorem 5 below implies that the assumption kn,p logn p → 0 is necessary
for consistent estimation of any single entry of Ω.
We carefully construct a finite collection of distributions G0 ⊂ G0 (M, kn,p ) and apply
Le Cam’s method to show that for any estimator ω̂ij ,
log p
→1,
n
sup P |ω̂ij − ωij | > C1 kn,p
(29)
G0
for some constant C1 > 0. It is relatively easy to establish the parametric lower bound
q
1
n.
These two lower bounds together immediately yield Theorem 5 below.
Theorem 5.
Suppose we observe independent and identically distributed p-variate
Gaussian random variables X (1) , X (2) , . . . , X (n) with zero mean and precision matrix Ω =
(ωkl )p×p ∈ G0 (M, kn,p ). Under assumptions (27) and (28), we have the following minimax
lower bounds
(
(30)
(
kn,p log p
, C2
inf sup P |ω̂ij − ωij | > max C1
ω̂ij G0 (M,kn,p )
n
r ))
1
n
> c1 > 0,
and
(31)



s

 k log p
log p 
n,p
, C20
inf sup P Ω̂ − Ω > max C10
> c2 > 0,

∞
n
n 
Ω̂ G0 (M,kn,p ) 
where C1 , C2 , C10 and C20 are positive constants depending on M, ν and C0 only.
Remark 5.
The lower bound
kn,p log p
n
in Theorem 5 shows that estimation of sparse
precision matrix can be very different from estimation of sparse covariance matrix. The
sample covariance always gives a parametric rate of estimation of every entry σij . But
√
for estimation of sparse precision matrix, when kn,p impossible to obtain the parametric rate.
n
log p ,
Theorem 5 implies that it is
13
Remark 6.
Since G0 (M, kn,p ) ⊆ G ∗ (M, kn,p , λ) by the definitions in (1) and (15),
Theorem 5 also provides the lower bound for the larger class. Similarly, Theorem 5 can be
easily extended to the weak lq ball, 0 < q < 1, defined in (25) and the capped-`1 ball defined
in (15). For these parameter spaces, in the proof of Theorem 5 we only need to define
H
q/2
n
− 1 rather
as the collection of all p × p symmetric matrices with exactly kn,p log p
than (kn,p − 1) elements equal to 1 between the third and the last elements on the first
row (column) and the rest all zeros. Then it is easy to check that the
space
sub-parameter
q/2 v
with
G0 in (56) is indeed in Gq (M, kn,p ). Now under assumptions p ≥ kn,p logn p
ν > 2 and kn,p ≤ C0
n
log p
1−q/2
,we have the following minimax lower bounds
(
inf
sup
ω̂ij Gq (M,kn,p )
(
P |ω̂ij − ωij | > max C1 kn,p
log p
n
r ))
1−q/2
, C2
1
n
> c1 > 0,
and



s
1−q/2



log
p
log
p
inf sup P Ω̂ − Ω > max C10 kn,p
, C20
> c2 > 0.

∞
n
n 
Ω̂ Gq (M,kn,p ) 
These lower bounds match the upper bounds for the proposed estimator in Theorem 3 in
view of the discussion in Remark 4.
3. Applications. The asymptotic normality result is applied to obtain rate-optimal
estimation of the precision matrix under various matrix lw norms, to recover the support of Ω adaptively, and to estimate latent graphical models without the need of the
irrepresentable condition or the l1 constraint of the precision matrix commonly required
in literature. Our procedure is first obtaining an Asymptotically Normal estimation and
then do Thresholding. We thus call it ANT.
3.1. ANT for Adaptive Support Recovery. The support recovery of precision matrix
has been studied by several papers. See, for example, Friedman, Hastie and Tibshirani
(2008), d’Aspremont, Banerjee and El Ghaoui (2008), Rothman et al. (2008), Ravikumar
et al. (2011), Cai, Liu and Luo (2011), and Cai, Liu and Zhou (2012). In these literature, the theoretical properties of the graphical lasso (Glasso), CLIME and ACLIME
on the support recovery were obtained. Ravikumar et al. (2011) studied the theoretical
properties of Glasso, and showed that Glasso can correctly recover
the support under
q
irrepresentable conditions and the condition min(i,j)∈S(Ω) |ωij | ≥ c
log p
n
for some c > 0.
Cai, Liu and Luo (2011) does not
require irrepresentable conditions, but need to assume
q
log
p
2
that min(i,j)∈S(Ω) |ωij | ≥ CMn,p
n , where Mn,p is the matrix l1 norm of Ω. In Cai, Liu
14
Z. REN ET AL.
and Zhou (2012), they weakened the condition to min(i,j)∈S(Ω) |ωij | ≥ CMn,p
the threshold level there is
C
2 Mn,p
q
log p
n ,
q
log p
n ,
but
where C is unknown and Mn,p can be very large,
which makes the support recovery procedure there impractical.
In this section we introduce an adaptive support recovery procedure based on the
variance of the oracle estimator of each entry ωij to recover the sign of nonzero entries of
Ω with high probability. The lower bound condition for min(i,j)∈S(Ω) |ωij | is significantly
weakened. In particular, we remove the unpleasant matrix l1 norm Mn,p . In Theorem 3,
when the precision matrix is sparse enough s = o
√ n
log p
, we have the asymptotic normality
result for each entry ωij , i 6= j, i.e.,
q
D
nFij (ω̂ij − ωij ) → N (0, 1) ,
2
where Fij = ωii ωjj + ωij
−1
is the Fisher information of estimating ωij . The total number
of edges is p (p − 1) /2. We may apply thresholding to ω̂ij to correctly distinguish zero and
2 needs to be estimated. We define the
nonzero entries. However, the variance ωii ωjj + ωij
adaptive support recovery procedure as follows
thr
Ω̂thr = (ω̂ij
)p×p ,
(32)
thr = ω̂ 1{|ω̂ | ≥ τ̂ } for i 6= j with
where ω̂iithr = ω̂ii and ω̂ij
ij
ij
ij
(33)
τ̂ij =
v
u
u 2ξ ω̂ ω̂ + ω̂ 2 log p
ii jj
t 0
ij
n
.
2 is the natural estimate of the asymptotic variance of ω̂
Here ω̂ii ω̂jj + ω̂ij
ij defined in
(12) and ξ0 is a tuning parameter which can be taken as fixed at any ξ0 > 2. This
thresholding estimator is adaptive. The sufficient conditions in the Theorem 6 below for
support recovery are much weaker compared with other results in literature.
Define a thresholded population precision matrix as
(34)
thr
Ωthr = (ωij
)p×p ,
n
thr = ω 1 |ω | ≥
where ωiithr = ωii and ωij
ij
ij
q
o
2 )(log p)/n , with a certain
8ξ(ωii ωjj + ωij
ξ > ξ0 . Recall that E = E(Ω) = {(i, j) : ωij 6= 0} is the edge set of the Gauss-Markov
graph associated with the precision matrix Ω. Since Ωthr is composed of relatively large
components of Ω, (V, E(Ωthr )) can be view as a graph of strong edges. Define
S(Ω) = {sgn(ωij ), 1 ≤ i, j ≤ p}.
15
The following theorem shows that with high probability, ANT recovers all the strong edges
without false recovery. Moreover, under the uniform signal strength condition
|ωij | ≥ 2
(35)
v u
u 2ξ ω ω + ω 2 log p
ii jj
t
ij
n
, ∀ ωij 6= 0.
i.e. Ωthr = Ω, the ANT also recovers the sign matrix S(Ω).
Theorem 6.
Let λ = (1 + ε)
q
2δ log p
n
for any δ ≥ 3 and ε > 0. Let Ω̂thr be the
ANT estimator defined in (32) with ξ0 > 2 in the thresholding level (33). Suppose Ω ∈
G ∗ (M, s, λ) with s = o
p
n/ log p . Then,
lim P E(Ωthr ) ⊆ E(Ω̂thr ) ⊆ E(Ω) = 1.
(36)
n→∞
where Ωthr is defined in (34) with ξ > ξ0 . If in addition (35), then
lim P S(Ω̂thr ) = S(Ω) = 1.
(37)
n→∞
3.2. ANT for Adaptive Estimation under the Matrix lw Norm. In this section, we
consider the rate of convergence under the matrix lw norm. To control the improbable
case for which our estimator Θ̂A,A is nearly singular, we define our estimator based on the
thresholding estimator Ω̂thr defined in (32),
thr
Ω̆thr = (ω̂ij
1 {|ω̂ij | ≤ log p})p×p .
(38)
Theorem 7 follows mainly from the convergence rate under element-wise norm and the
fact that the upper bound holds for matrix l1 norm. Then it follows immediately by the
inequality |||M |||w ≤ |||M |||1 for any symmetric matrix M and 1 ≤ w ≤ ∞, which can
be proved by applying the Riesz-Thorin interpolation theorem. See, e.g., Thorin (1948).
2 = O (n/ log p) , it can be seen from the Equations
Note that under the assumption kn,p
(17) and (18) in Theorem 2 that with high probability the Ω̂ − Ω
kΩora − Ωk∞ = Op
q
log p
n
∞
is dominated by
. From there the details of the proof is in nature similar to
the Theorem 3 in Cai and Zhou (2012) and thus will be omitted due to the limit of space.
Theorem 7.
Under the assumptions s2 = O (n/ log p) and n = O pξ
ξ > 0, the Ω̆thr defined in (38) with λ = (1 + ε)
q
2δ log p
n
for sufficiently large δ ≥ 3+ 2ξ
and ε > 0 satisfies, for all 1 ≤ w ≤ ∞ and kn,p ≤ s
(39)
sup
G0 (M,kn,p )
E|||Ω̆thr − Ω|||2w ≤
sup
G ∗ (M,k
n,p ,λ)
with some
E|||Ω̆thr − Ω|||2w ≤ Cs2
log p
.
n
16
Z. REN ET AL.
Remark 7.
It follows from Equation (26) that result (39) also holds for the classes
of weak `p balls Gq (M, kn,p ) defined in Equation (25), with s = Cq kn,p
(40)
sup
Gq (M,kn,p )
Remark 8.
E|||Ω̆thr −
Ω|||2w
≤
2
Ckn,p
log p
n
n
log p
q/2
,
1−q
.
Cai, Liu and Zhou (2012) showed the rates obtained in Equations (39)
and (40) are optimal when p ≥ cnγ for some γ > 1 and kn,p = o n1/2 (log p)−3/2 .
3.3. Estimation and Inference for Latent Variable Graphical Model. Chandrasekaran,
Parrilo and Willsky (2012) first proposed a very natural penalized estimation approach
and studied its theoretical properties. Their work was discussed and appreciated by several researchers. But it was not clear if the conditions in their paper are necessary and the
results there are optimal. Ren and Zhou (2012) observed that
the support recovery boundq
q
ary can be significantly improved from an order of
p
n
to
log p
n
under certain conditions
including a bounded l1 norm constraint for the precision matrix. In this section we extend
the methodology and results in Section 2 to study latent variable graphical models. The
results in Ren and Zhou (2012) are significantly improved under weaker assumptions.
Let O and H be two subsets of {1, 2, . . . , p + h} with Card(O) = p, Card(H) = h and
(i)
(i)
O ∪ H = {1, 2, . . . , p + h}. Assume that XO , XH , i = 1, . . . , n, are i.i.d. (p + h)-variate
Gaussian random vectors with a positive covariance matrix Σ(p+h)×(p+h) . Denote the
corresponding precision matrix by Ω̄(p+h)×(p+h) = Σ−1
(p+h)×(p+h) . We only have access to
n
(1)
(2)
(n)
XO , XO , . . . , XO
o
n
(1)
(2)
(n)
, while XH , XH , . . . , XH
o
are hidden and the number of latent
components is unknown. Write Σ(p+h)×(p+h) and Ω̄(p+h)×(p+h) as follows,

Σ(p+h)×(p+h) = 
ΣO,O
ΣO,H
ΣH,O ΣH,H


 , and Ω̄(p+h)×(p+h) = 
(i)
(i)
Ω̄O,O
Ω̄O,H
Ω̄H,O Ω̄H,H

,
where ΣO,O and ΣH,H are covariance matrices of XO and XH respectively and from the
Schur complement we have
(41)
−1
Σ−1
O,O = Ω̄O,O − Ω̄O,H Ω̄H,H Ω̄H,O .
See, e.g., Horn and Johnson (1990). Define
S = Ω̄O,O , and L = Ω̄O,H Ω̄−1
H,H Ω̄H,O ,
where h0 = rank (L) = rank Ω̄O,H ≤ h.
17
We are interested in estimating Σ−1
O,O as well as S and L. To make the problem identifiable we assume that S is sparse and the effect of latent variables is spread out over all
coordinates, i.e.,
(42)
S = (sij )1≤i,j≤p ,
max
1≤j≤p
X
1 {sij 6= 0} ≤ kn,p ;
i6=j
and
L = (lij )1≤i,j≤p , |lij | ≤
(43)
M0
.
p
The sparseness of S = Ω̄O,O can be seen to be inherited from the sparse full precision
matrix Ω̄(p+h)×(p+h) , and it is particularly interesting for us to identify the support of
S = Ω̄O,O and make inference for each entry of S. A sufficient condition for the assumption
(43) is that the eigendecomposition of L =
and c0
Ph0
i=1 λi
Ph0
T
i=1 λi ui ui
satisfies kui k∞ ≤
q
c0
p
for all i
≤ M0 . See Candès and Recht (2009) for a similar assumption. In addition,
we assume that
(44)
1/M ≤ λmin (Σ(p+h)×(p+h) ) ≤ λmax (Σ(p+h)×(p+h) ) ≤ M
for some universal constant M , and
n
= o(p).
log p
(45)
(i)
Equation (44) implies that both the covariance ΣO,O of observations XO and the sparse
component S = Ω̄O,O have bounded spectrum, and λmax (L) ≤ M .
(i)
With a slight abuse of notation, denote the precision matrix Σ−1
O,O of XO by Ω and
its inverse by Θ. We propose to apply the methodology in Section 2 to the observations
X (i) which are i.i.d. N (0, ΣO,O ) with Ω = (sij − lij )1≤i,j≤p by considering the following
regression
(46)
XA = XO\A β + A
i.i.d.
−1
for A = {i, j} ⊂ O with β = ΩO\A,A Ω−1
A,A and A ∼ N 0, ΩA,A
scaled lasso procedure to estimate ΩA,A . When S = Ip and L =
√
√ 1/ p, . . . , 1/ p , we see that
max Σi6=j min 1,
j
|sij − lij |
λ
=
p−1
=O
2pλ
r
and the penalized
1
T
2 u0 u0
with uT0 =
n
.
log p
However, to obtain the asymptotic normality result as in Theorem 2, we required
√ !
|sij − lij |
n
max Σi6=j min 1,
=o
,
j
λ
log p
18
Z. REN ET AL.
which is no longer satisfied for the latent variable graphical model. In Section 7.2 we
overcome the difficulty through a new analysis.
Theorem 8.
Let Ω̂A,A be theqestimator of ΩA,A defined in (12) with A = {i, j} for the
2δ log p
n
regression (46). Let λ = (1 + ε)
(42)-(45) and kn,p = o
s
(47)
√ n
log p
we have
n
D
2 (ω̂ij − ωij ) ∼
ωii ωjj + ωij
Remark 9.
for any δ ≥ 1 and ε > 0. Under the assumptions
s
n
D
2 (ω̂ij − sij ) → N (0, 1) .
ωii ωjj + ωij
Without condition (45), our estimator may not be asymptotic efficient
but still has nice convergence property. We could obtain the following rate of convergence
n
sparsity maxj Σi6=j min 1,
|sij −lij |
λ
o
= O kn,p


n
log p ,
+ λ−1 ,
for estimating ωij = sij − lij , provided kn,p = o


log p
P |ω̂ij − ωij | > C3 max kn,p
,


n
s
by simply applying Theorem 2 with

log p 
≤ o p−δ+1 ,
n 
which further implies the rate of convergence for estimating sij




log p
,
P |ω̂ij − sij | > C3 max kn,p


n
s

log p M0 
,
≤ o p−δ+1 .
n
p 
thr )
Define the adaptive thresholding estimator Ω̂thr = (ω̂ij
p×p as in (32) and (33). Fol-
lowing the proof of Theorems 6 and 7, we are able to obtain the following results. We
shall omit the proof due to the limit of space.
Theorem 9.
Let λ = (1 + ε)
q
2δ log p
n
for some δ ≥ 3 and ε > 0 in Equation (10).
Assume the assumptions (42)-(45) hold. Then
(i). Under the assumptions kn,p = o
q
n
log p
and
v
u
u 2ξ ω ω + ω 2 log p
ii jj
t 0
ij
|sij | ≥ 2
n
, ∀sij ∈ S(S)
for some ξ0 > 2, we have
lim P S(Ω̂thr ) = S(S) = 1.
n→∞
2 = O (n/ log p) and n = O pξ with some ξ > 0, the Ω̆
(ii). Under the assumption kn,p
thr
defined in (38) with sufficiently large δ ≥ 3+ 2ξ satisfies, for all 1 ≤ w ≤ ∞,
2
E|||Ω̆thr − S|||2w ≤ Ckn,p
log p
.
n
19
4. Discussion. In the analysis of Theorem 2 and nearly all consequent results
in
q
Theorems 3-4, and Theorems 6-9, we have picked the penalty term λ = (1 + ε)
2δ log p
n
for any δ ≥ 1 (or δ ≥ 3 for support recovery) and ε > 0. This choice of λ can be too
conservative and cause some finite sample estimation bias. In this section we show that λ
can be chosen smaller.
Let smax ≤ c0 logn p with a sufficiently small constant c0 > 0 and smax = O pt for some
t < 1. Denote the cumulative distribution function of t(n−1) distribution by Ft(n−1) . Let
δ √
smax
2 where B = F −1
n
−
1
+
B
/2 . It can be shown
1
−
λnew
=
(1
+
ε)
B/
t(n−1)
f inite
p
new = (1 + ε)
that λnew
f inite is asymptotically equivalent to λ
Theorems 2-4 and 6-9 to the new penalties
λnew
and
λnew
f inite .
q
2δ log(p/smax )
.
n
We can extend
All the results remain the same
except that we need to replace s (or kn,p ) in those theorems by s + smax (or kn,p + smax ).
Since all theorems (except the minimax lower bound Theorem) are derived from Lemma
2, all we need is just an extension of Lemma 2 as follows.
Lemma 1.
Let λ = λnew or λnew
f inite for any δ ≥ 1 and ε > 0 in Equation (10). Assume
smax ≤ c0 logn p with a sufficiently small constant c0 > 0 and smax = O pt for some t < 1
and define the event Em as follows,
ora θ̂mm − θmm
βm − β̂m 1
2
XAc βm − β̂m /n
T
XAc m /n
≤ C10 λ2 (s + smax ) ,
≤ C20 λ (s + smax ) ,
≤ C30 λ2 (s + smax ) ,
≤ C40 λ,
∞
for m = i or j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions of Theorem 2,
we have
c
P (Em
) ≤ o p−δ+1 .
See its proof in the supplementary material for more details. Note that when s =
o
√ n
log p
, we have
2
√ XAc βm − β̂m /n ≤ C30 λ2 (s + smax ) = o 1/ n
with high probability, for every smax = o
√ n
log p
. Thus for every choice of smax = o
√ n
log p
in the penalty, our procedure leads asymptotically efficient estimation of every entry of
the precision matrix as long as s = o
√ n
log p
. In Section 5 we set λ = λnew
f inite with δ = 1
for statistical inference and δ = 3 for support recovery.
20
Z. REN ET AL.
5. Numerical Studies. In this section, we present some numerical results for both
asymptotic distribution and support recovery. We consider the following 200×200 precision
matrix with three blocks. The block sizes are 100 × 100, 50 × 50 and 50 × 50, respectively.
Let (α1 , α2 , α3 ) = (1, 2, 4). The diagonal entries are α1 , α2 , α3 in three blocks, respectively.
When the entry is in the k-th block, ωj−1,j = ωj,j−1 = 0.5αk , and ωj−2,j = ωj,j−2 = 0.4αk ,
k = 1, 2, 3. The asymptotic variance for estimating each entry can be very different, thus
a naive procedure of setting one universal threshold for all entries would likely fail.
We first estimate the entries in the precision matrix, and partial correlations which was
discussed in Remark 3, and consider the distributions of the estimators. We generate a random sample of size n = 400 from a multivariate normal distribution N (0, Σ) with Σ = Ω−1 .
√
2
As mentioned in Remark 2, the penalty constant is chosen to be λnew
f inite = B/ n − 1 + B ,
√
where B = tq(1 − ŝ/(2p), n − 1) with ŝ = n/ log p, which is asymptotically equivalent to
p
(2/n) log(p/ŝ).
Table 1 reports the mean and standard error of our estimators for four entries in the
precision matrix and the corresponding correlations based on 100 replications. Figure 1
shows the histograms of our estimates with the theoretical normal density super-imposed.
We can see that the distributions of our estimates match pretty well to the asymptotic
normality in Theorem 3. We have tried other choices of dimensions, e.g. p = 1000, and
obtained similar results.
Table 1
Mean and standard error of the proposed estimators.
ω
b1,j
b
r1,j
ω1,2 = 0.5
0.469 ± 0.051
r1,2 = −0.5
-0.480 ± 0.037
ω1,3 = 0.4
0.380 ± 0.054
r1,3 = −0.4
-0.392 ± 0.043
ω1,4 = 0
-0.042 ± 0.043
r1,4 = 0
0.043 ± 0.043
ω1,10 = 0
-0.003± 0.045
r1,10 = 0
0.003 ± 0.046
Support recovery of a precision matrix is of great interests. We compare our selection
results with the GLasso. In addition to the training sample, we generate an independent sample of size 400 from the same distribution for validating the tuning parameter
for the GLasso. The GLasso estimators are computed based on training data with a
range of penalty levels and we choose a proper penalty level by minimizing likelihood
b − log det(Ω)}
b on the validation sample, where Σ is the sample covariance
loss {trace(ΣΩ)
matrix. Our ANT estimators are computed based on the training sample only. As stated
in Theorem 6, we use a slightly larger penalty constant to allow the selection consistency.
√
3
2
Let λnew
f inite = B/ n − 1 + B , where B = tq(1 − (ŝ/p) /2, n − 1), which is asymptotically equivalent to
p
(6/n) log(p/ŝ). We then apply the thresholding step as in (33) with
ξ0 = 2. Table 2 shows the average selection performances of 10 replications. The true pos-
0
0
2
4
Density
4
2
Density
6
6
8
21
0.3
0.4
0.5
0.6
0.7
0.2
0.3
0.5
0.6
0.1
0.2
12
10
0
2
4
6
Density
8
10
8
6
0
2
4
Density
0.4
^ 1,3
ω
12
^ 1,2
ω
−0.2
−0.1
0.0
^ 1,4
ω
0.1
0.2
−0.2
−0.1
0.0
^ 1,10
ω
Fig 1. Histograms of estimated entries. Top: entries ω1,2 and ω1,3 in the precision matrix; bottom: entries
ω1,4 and ω1,10 in the precision matrix.
22
Z. REN ET AL.
itive (rate) and false positive (rate) are reported. In addition to the overall performance,
the summary statistics are also reported for each block. We can see that while both our
ANT method and the graphical Lasso choose all nonzero entries, ANT outperforms the
GLasso in the sense of the false positive rate and the false discovery rate.
Table 2
The performance of support recovery
Block
Overall
Block 1
Block 2
Block 3
Method
GLasso
ANT
GLasso
ANT
GLasso
ANT
GLasso
ANT
TP
391
391
197
197
97
97
97
97
TPR
1
1
1
1
1
1
1
1
FP
5298.2
3.5
1961
1.2
288.4
1.1
162.1
1.1
FPR
0.2716
0.0004
0.4126
0.0003
0.2557
0.0010
0.1437
0.0010
Moreover, we compare our method with the GLasso with various penalty levels. Figure
2 shows the ROC curves for the GLasso with various penalty levels and ANT with various
thresholding levels in the follow-up procedure. It is noticed that the GLasso at any penalty
level cannot achieve similar performance as ours. In addition, the circle in the plot represents the performance of ANT with the selected threshold level as in (33). The triangle in
the plot represents the performance of the graphical Lasso with the penalty level chosen
by cross-validation. This again indicates that our method simultaneously achieves a very
high true positive rate and a very low false positive rate.
6. Proof of Theorems 1-5.
6.1. Proof of Theorem 2-4. We will only prove Theorems 2 and 3. The proof of Theorem 4 is similar to that of Theorems 2 and 3. The following lemma is the key to the
proof.
Lemma 2.
Let λ = (1 + ε)
q
2δ log p
n
for any δ ≥ 1 and ε > 0 in Equation (10). Define
the event Em as follows,
(48)
(49)
(50)
(51)
ora θ̂mm − θmm
βm − β̂m 1
2
XAc βm − β̂m /n
T
XAc m /n
∞
≤ C10 λ2 s,
≤ C20 λs,
≤ C30 λ2 s,
≤ C40 λ,
●
0.6
0.7
tpr
0.8
0.9
1.0
23
●
0.00
0.05
0.10
0.15
0.20
0.25
ANT
GLasso
0.30
fpr
Fig 2. The ROC curves of the graphical Lasso and ANT. Circle: ANT with our proposed thresholding.
Triangle: GLasso with penalty level by CV.
24
Z. REN ET AL.
for m = i or j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions of Theorem 2,
we have
c
P (Em
) ≤ o p−δ+1 .
6.1.1. Proof of Theorems 2. We first prove (i). From Equation (48) of Lemma 2, the
ora and θ ora . We then need only to consider
large deviation probability in (16) holds for θii
jj
ora . On the event E ∩ E ,
the entry θij
i
j
ora θ̂ij − θij
= ˆ
Ti ˆ
j /n − Ti j /n
T T
j + XAc βj − β̂j /n − i j /n
= i + XAc βi − β̂i
≤ XTAc i /n βj − β̂j + XTAc j /n βi − β̂i 1
1
∞
∞
+ XAc βi − β̂i · XAc βj − β̂j /n
0 0
0 2
≤
2C2 C4 + C3 λ s,
where the last step follows from inequalities (49)-(51) in Lemma 2. Thus we have
P Θ̂A,A − Θora
A,A ∞
> C1 s
log p
n
≤ o p−δ+1 ,
for some C1 > 0. Since ΘA,A has a bounded spectrum, the functional ζkl (ΘA,A ) =
Θ−1
A,A
kl
is Lipschitz in a neighborhood of ΘA,A for k, l ∈ A, then Equation (17) is
an immediate consequence of Equation (16).
√
ora , η ora , where η ora =
Now we prove part (ii). Define random vector η ora = ηiiora , ηij
jj
kl
θora −θkl
n √ kl
2
θkk θll +θkl
. The following result is a multidimensional version of KMT quantile in-
equality: there exist some constants D0 , ϑ ∈ (0, ∞) and random normal vector Z =
√
(Zii , Zij , Zjj ) ∼ N 0, Σ̆ with Σ̆ = Cov(η ora ) such that whenever |Zkl | ≤ ϑ n for all kl,
we have
(52)
D0 2
2
+ Zjj
.
kη ora − Zk∞ ≤ √ 1 + Zii2 + Zij
n
See Proposition [KMT] in Mason and Zhou (2012) for one dimensional case and consult
√
Einmahl (1989) for multidimensional case. Note that nη ora can be written as a sum of
n i.i.d. random vectors with mean zero and covariance matrix Σ̆, each of which is subexponentially distributed. Hence the assumptions of KMT quantile inequality in literature
are satisfied. With a slight abuse of notation, we define Θ = (θii , θij , θjj ). To prove the
desired coupling inequality (18), let’s use the Taylor expansion of the function ωij (Θ) =
25
2
−θij / θii θjj − θij
to obtain
ora
ωij
− ωij
= h∇ωij (Θ) , Θora − Θi +
(53)
Rβ (Θora ) (Θ − Θora )β .
X
|β|=2
The multi-index notation of β = (β1 , β2 , β3 ) is defined as |β| =
Dβ f (x) =
∂ |β| f
β
β
β .
∂x1 1 ∂x2 2 ∂x3 3
P
k
βk , xβ =
Q
k
xβk k and
The derivatives can be easily computed. To save the space, we
omit their explicit formulas. The coefficients in the integral form of the remainder with
α
|β| = 2 have a uniform upper bound Rβ Θora
A,A ≤ 2 max|α|=2 maxΘ∈B D ωij (Θ) ≤ C2 ,
where B is some sufficiently small compact ball with center Θ when Θora is in this ball
B, which is satisfied by picking a sufficiently small value ϑ in our assumption kη ora k∞ ≤
√
ora ) .
ora are standardized versions of ω ora − ω
ϑ n. Recall that κora
ij and (Θ − Θ
ij and η
ij
Consequently there exist some deterministic constants h1 , h2 , h3 and Dβ with |β| = 2 such
ora as follows,
that we can rewrite (53) in terms of κora
ij and η
X Dβ Rβ (Θora )
ora
ora
ora
κora
ij = h1 ηii + h2 ηij + h3 ηjj +
√
|β|=2
n
(η ora )β ,
which, together with Equation (52), completes our proof of Equation (18),
ora
κij − Z 0 ≤
3
X
!
C3
D1 2
2
|hk | kZ − η ora k∞ + √ kη ora k2 ≤ √ 1 + Zii2 + Zij
+ Zjj
,
n
n
k=1
where constants C3 , D1 ∈ (0, ∞) and Z 0 := h1 Z1 + h2 Z2 + h3 Z3 ∼ N (0, 1) . The last
2 + Z2
inequality follows from kη ora k2 ≤ C4 Zii2 + Zij
jj for some large constant C4 , which
can be shown using (52) easily.
6.1.2. Proof of Theorems 3. The triangle inequality gives
ora
ora
|ω̂ij − ωij | ≤ ω̂ij − ωij
− ωij ,
+ ωij
Ω̂A,A − ΩA,A ∞
≤ Ω̂A,A − Ωora
A,A ∞
+ Ωora
A,A − ΩA,A .
∞
From Equation (17) we have
P Ω̂A,A − Ωora
A,A ∞
log p
> C1 s
n
≤ o p−δ+1 .
ora
Now we give a tail bound for ωij
− ωij and Ωora
−
Ω
A,A A,A
∞
respectively. For the con-
stant C > 0, we apply Equation (18) to obtain
n
P κora
ij > C
o
√ C
≤ P max {|Zkl |} > ϑ n + Φ̄
2
≤ o(1) + 2 exp −C 2 /8 ,
D1 C
2
2
+ P √ 1 + Zii2 + Zij
+ Zjj
>
n
2
26
Z. REN ET AL.
according to the inequality Φ̄ (x) ≤ 2 exp −x2 /2 for x > 0 and the union bound of three
normal tail probabilities. This immediately implies that for large C4 and large n,
(
r )
1
3
ora
P ωij − ωij > C4
≤ ,
n
4
which, together with Equations (17), yields that for C2 > C1 + C4 ,
(
(
log p
P |ω̂ij − ωij | > C2 max s
,
n
r ))
1
n
≤ .
Similarly, Equation (18) implies
o
n
p
≤
P κora
ij > C log p
!
√
√ C log p
P max {|Zkl |} > ϑ n + Φ̄
2
(
)
√
C log p
D1
2
2
2
+P √ 1 + Zii + Zij + Zjj >
n
2
≤ O p−C
2 /8
,
where the first and last components in the first inequality are negligible due to log p ≤ c0 n
with a sufficiently small c0 > 0, which follows from the assumption s ≤ c0 n/ log p. That
immediately implies that for C5 large enough,


s


log
p
>
C
−
Ω
= o(p−δ ),
P Ωora
5
A,A
 A,A
∞
n 
which, together with Equations (17), yields that for C3 > C10 + C5 .



s

 log p
log p 
P Ω̂A,A − ΩA,A > C3 max s
,
≤ o p−δ+1 .

 n
∞
n 
Thus we have the following union bound over all
p
2
pairs of (i, j),



s

 log p

log
p
,
≤ p2 /2 · o p−δ+1 = o p−δ+3 .
P Ω̂ − Ω > C3 max s

 n
∞
n 
Write
√ √ √ ora
n Ω̂A,A − ΩA,A = n Ω̂A,A − Ωora
A,A + n ΩA,A − ΩA,A .
Under the assumption s = o
√ n
log p
, we have
√ n Ω̂A,A − Ωora
A,A ∞
= op (1),
which together with Equation (18) further implies
√
D
n (ω̂ij − ωij ) ∼
√ ora
D
2
n ωij − ωij → N 0, ωii ωjj + ωij
.
27
6.2. Proof of Theorem 5. In this section we show that the upper bound given in Section
2.2 is indeed rate optimal. We will only establish Equation (30).
Equation (31) is an
q
log p
n
immediate consequence of Equation (30) and the lower bound
for estimation of
diagonal covariance matrices in Cai, Zhang and Zhou (2010).
The lower bound is established by Le Cam’s method. To introduce Le Cam’s method we
first need to introduce some notation. Consider a finite parameter set G0 = {Ω0 , Ω1 , . . . , Ωm∗ } ⊂
G0 (M, kn,p ). Let PΩm denote the joint distribution of independent observations X (1) ,
X (2) ,. . . , X (n) with each X (i) ∼ N 0, Ω−1
m , 0 ≤ m ≤ m∗ , and fm denote the corre
sponding joint density, and we define
m∗
1 X
PΩ .
P̄ =
m∗ m=1 m
(54)
For two distributions P and Q with densities p and q with respect to any common dominating measure µ, we denote the total variation affinity by kP ∧ Qk =
R
p ∧ qdµ. The
following lemma is a version of Le Cam’s method (cf. Le Cam (1973), Yu (1997)).
Lemma 3.
Let X (i) be i.i.d. N (0, Ω−1 ), i = 1, 2, . . . , n, with Ω ∈ G0 . Let Ω̂ = (ω̂kl )p×p
(m)
be an estimator of Ωm = ωkl
p×p
, then
α
1
(m) sup P ω̂ij − ωij >
≥ PΩ0 ∧ P̄ ,
2
2
0≤m≤m∗
(m)
where α = inf 1≤m≤m∗ ωij
(0) − ωij .
Proof of Theorem 5: We shall divide the proof into three steps. Without loss of
generality, consider only the cases (i, j) = (1, 1) and (i, j) = (1, 2). For the general case
ωii or ωij with i 6= j, we could always permute the coordinates and rearrange them to the
special case ω11 or ω12 .
Step 1: Constructing the parameter set. We first define Ω0 ,

(55)





Σ0 = 




(0)
i.e. Σ0 = σkl
1 b 0 ... 0








−1
0  , and Ω0 = Σ0 = 



.. 


. 


b 1 0 ... 0 

0 0 1 ...
.. .. .. . .
.
. . .
0 0 0
0

1
1
1−b2
−b
1−b2
−b
1−b2
1
1−b2
0
..
.
0
..
.
0
0


0 ... 0 


1 ... 0 
,

.. . . .. 
. . 
.

0
(0)
p×p
0 ... 0
0
1
(0)
is a matrix with all diagonal entries equal to 1, σ12 = σ21 = b and the
rest all zeros. Here the constant 0 < b < 1 is to be determined later. For Ωm , 1 ≤ m ≤ m∗ ,
28
Z. REN ET AL.
the construction is as follows. Without loss of generality we assume kn,p ≥ 2. Denote by
H the collection of all p × p symmetric matrices with exactly (kn,p − 1) elements equal to
1 between the third and the last elements on the first row (column) and the rest all zeros.
Define
n
o
G0 = Ω : Ω = Ω0 or Ω = (Σ0 + aH)−1 , for some H ∈ H ,
(56)
where a =
q
τ1 log p
n
for some constant τ1 which is determined later. The cardinality of
G0 / {Ω0 } is
!
p−2
.
kn,p − 1
∗
m = Card (G0 ) − 1 = Card (H) =
1
2
We pick the constant b =
(1 − 1/M ) and 0 < τ1 < min
2
2
2
(1−b2 )
(1−1/M )2 −b2 (1−b )
, 2C0 (1+b2 ) , 4ν(1+b2 )
C0
such that G0 ⊂ G0 (M, kn,p ).
First we show that for all Ωi ,
1/M ≤ λmin (Ωi ) < λmax (Ωi ) ≤ M .
(57)
For any matrix Ωm , 1 ≤ m ≤ m∗ , some elementary calculations yield that
q
λ1 Ω−1
m
= 1+
= λ3 Ω−1
= . . . = λp−1 Ω−1
= 1.
m
m
λ2 Ω−1
m
Since b =
1
2
b2 + (kn,p − 1) a2 , λp Ω−1
=1−
m
(1 − 1/M ) and 0 < τ1 <
(58)
q
b2 + (kn,p − 1) a2 ,
(1−1/M )2 −b2
,
C0
we can show that
1−
q
b2 + (kn,p − 1) a2 ≥ 1 −
1+
q
b2 + (kn,p − 1) a2 < 2 − 1/M < M,
p
b2 + τ1 C0 > 1/M,
which imply
1/M ≤ λ−1
Ω−1
= λmin (Ωm ) < λmax (Ωm ) = λ−1
Ω−1
≤ M.
m
p
m
1
As for matrix Ω0 , similarly we have
λ1 Ω−1
0
= 1 + b, λp Ω−1
= 1 − b,
0
= λ3 Ω−1
= . . . = λp−1 Ω−1
= 1,
0
0
λ2 Ω−1
0
thus 1/M ≤ λmin (Ω0 ) < λmax (Ω0 ) ≤ M for the choice of b =
1
2
(1 − 1/M ).
Now we show that the number of nonzero off-diagonal elements in Ωm , 0 ≤ m ≤
m∗ is no more than kn,p per row/column. From the construction of Ω−1
m , there exists
29
T
some permutation matrix Pπ such that Pπ Ω−1
m Pπ is a two-block diagonal matrix with
dimensions (kn,p + 1) and (p − kn,p − 1), of which the second block is an identity matrix,
T
then Pπ Ω−1
m Pπ
−1
= Pπ Ωm PπT has the same blocking structure with the first block of
dimension (kn,p + 1) and the second block being an identity matrix, thus the number of
nonzero off-diagonal elements is no more than kn,p per row/column for Ωm . Therefore, we
have G0 ⊂ G0 (M, kn,p ) from Equation (57).
Step 2: Bounding α. From the construction of Ω−1
m and the matrix inverse formula,
it can be shown that for any precision matrix Ωm we have
(m)
ω11 =
−b
1
(m)
, and ω12 =
,
1 − b2 − (kn,p − 1) a2
1 − b2 − (kn,p − 1) a2
for 1 ≤ m ≤ m∗ , and for precision matrix Ω0 we have
(0)
ω11 =
1
−b
(0)
, ω12 =
.
2
1−b
1 − b2
Since b2 + (kn,p − 1) a2 < (1 − 1/M )2 < 1 in Equation (58), we have
(59)
inf
(0) (m)
ω11 − ω11 =
inf
(0) (m)
ω12 − ω12 =
1≤m≤m∗
1≤m≤m∗
(kn,p − 1) a2
≥ C3 kn,p a2 ,
(1 − b2 ) (1 − b2 − (kn,p − 1) a2 )
b (kn,p − 1) a2
≥ C4 kn,p a2 ,
(1 − b2 ) (1 − b2 − (kn,p − 1) a2 )
for some constants C3 , C4 > 0.
Step 3: Bounding the affinity. The following lemma will be proved in Section 8.
Lemma 4.
Let P̄ be defined in (54). We have
PΩ ∧ P̄ ≥ C5
0
(60)
for some constant C5 > 0.
Lemma 3, together with Equations (59), (60) and a =
1
2
0≤m≤m∗
1
(m) sup P ω̂12 − ω12 >
2
0≤m≤m∗
sup
(m) P ω̂11 − ω11 >
q
τ1 log p
n ,
imply
C3 τ1 kn,p log p
≥ C5 /2,
n
C4 τ1 kn,p log p
·
≥ C5 /2,
n
·
which match the lower bound in (30) by setting C1 = min {C3 τ1 /2, C4 τ1 /2} and c1 = C5 /2.
Remark 10.
kn,p
q
log p
n
·
q
log p
n
Note that |||Ωm |||1 is at order of kn,p
|||Ωm |||1
q
log p
n .
q
log p
n ,
which implies
kn,p log p
n
=
This observation partially explains why in literature
30
Z. REN ET AL.
q
people need assume bounded matrix l1 norm of Ω to derive the lower bound rate
log p
n .
For the least favorable parameter space, the matrix l1 norm of Ω cannot be avoided in the
upper bound. But the methodology proposed in this paper improves the upper bounds in
literature by replacing the matrix l1 norm for every Ω by only matrix l1 norm bound of Ω
in the least favorable parameter space.
6.3. Proof of Theorem 1. The probabilistic results (i) and (ii) are the immediate consequences of Theorems 2 and 5. We only need to show the minimax rate of convergence
result (3). According to the probabilistic lower bound result (30) in Theorem 5, we immediately obtain that
(
kn,p log p
, C2
inf sup E |ω̂ij − ωij | ≥ c1 max C1
ω̂ij G0 (M,kn,p )
n
r )
1
n
.
Thus it is enough to show there exists some estimator of ωij such that it attains this upper
bound. More precisely, we define the following estimator based on ω̂ij defined in (12) to
control the improbable case for which Θ̂A,A is nearly singular.
ω̆ij = sgn(ω̂ij ) · min {|ω̂ij | , log p} .
n
(16) and (18) in Theorem 2 imply
o
kn,p log p ora that the Equations
n , ωij ≤ 2M . Note
P {Gc } ≤ C p−δ+1 + exp (−cn) for some constants C
ora ≤ C
Define the event G = ω̂ij − ωij
1
and c. Now according to the variance of inverse Wishart distribution, we pick δ ≥ 2ξ + 1
to complete our proof
ora
ora
ora
E |ω̆ij − ωij | ≤ E ω̆ij − ωij
− ωij 1 {G} + E ω̆ij − ωij
1 {Gc } + E ωij
kn,p log p
ora 2
+ P {Gc } E log p + ωij
≤ C1
n
δ+1
1
kn,p log p
≤ C1
+ C2 p− 2 log p + C3 √
n
n
(
0
≤ C max
kn,p log p
,
n
1/2
+ E
ora
ωij
− ωij
2 1/2
r )
1
n
,
where C2 , C3 and C 0 are some constants and the last equation follows from the assumption
n = O pξ .
7. Proof of Theorems 6-9.
7.1. Proof of Theorem 6. When δ > 3, from Theorem 2 it can be shown that the
following three results hold:
31
(i). For any constant ε > 0, we have
)
ω̂ii ω̂jj + ω̂ 2
ij
P sup 2 − 1 > ε → 0;
(i,j) ωii ωjj + ωij
(
(61)
(ii). There is a constant C1 > 0 such that
(
(62)
log p
ora
P sup ωij
− ω̂ij > C1 s
n
(i,j)
)
→ 0;
(iii). For any constant 2 < ξ1 , we have
(63)

s
ora
ωij − ωij 2ξ1 log p 
→ 0.
P sup q
>

 (i,j) ω ω + ω 2
n


ii jj
ij
In fact, under the assumption δ ≥ 3, Equation (17) in Theorem 2 and the union bound
over all pair (i, j) imply the second result (62), which further shows the first result (61)
2 is bounded below and
because that ω̂ij and ω̂ii are consistent estimators and ωii ωjj + ωij
above. For the third result, we apply Equation (18) from Theorem 2 and pick 2 < ξ2 < ξ1
√
√
and a = ξ1 − ξ2 to show that
P
n o
ora p
κij > 2ξ1 log p
p
√ ≤ P max {|Zkl |} > ϑ n + Φ̄
2ξ2 log p
p
D1 2
2
+P √ 1 + Zii2 + Zij
+ Zjj
> a 2 log p
n
s
≤ O(p−ξ2
1
),
log p
where the last inequality follows from log p = o(n). The third result (63) is thus obtained
by the union bound with 2 < ξ2 .
Essentially Equation (36) and Equation (37) are equivalent to each other. Thus we
only show that Equation (37) in Theorem 6 is just a simple consequence of results (i),
(ii) and (iii). Set ε > 0 sufficiently small and ξ ∈ (2, ξ0 ) sufficiently close to 2 such that
32
Z. REN ET AL.
p
√
√
2 2ξ0 − 2ξ0 (1 + ε) > 2ξ and ξ0 (1 − ε) > ξ, and 2 < ξ1 < ξ. We have
P S(Ω̂thr ) = S(Ω)
= P
thr
thr
ω̂ij
6= 0 for all (i, j) ∈ S(Ω) + P ω̂ij
= 0 for all (i, j) ∈
/ S(Ω)








v
u
u 2ξ ω̂ ω̂ + ω̂ 2 log p
ii jj
t 0
ij
for all (i, j) ∈ S(Ω)
|ω̂ij | >


n






v
u




u
2


t 2ξ0 ω̂ii ω̂jj + ω̂ij log p
for all (i, j) ∈
/ S(Ω)
+P |ω̂ij | ≤


n






s
)
(
ω̂ii ω̂jj + ω̂ 2

2ξ log p 
|ω̂ij − ωij |
ij
≤
− P sup ≥ P sup q
2 − 1 > ε ,
 (i,j)
n 
(i,j) ωii ωjj + ωij
ω ω + ω2
= P
ii jj
ij
which is bounded below by

o
  n
s
ora
log p
ora

P
sup
ω
−
ω̂
>
C
s
+
ωij − ωij ij
1
2ξ1 log p


n (i,j) ij
2
P sup q
−
≤
ω̂ii ω̂jj +ω̂ij
 = 1+o (1) ,
 (i,j) ω ω + ω 2

n
P
sup
−
1
>
ε
(i,j) ω ω +ω 2
ii jj
ij


ii jj
where s = o
p
n/ log p implies s logn p = o
p
ij
(log p) /n .
7.2. Proof of Theorem 8. The proof of this Theorem is very similar to that of Theorem
2. Due to the limit of space, we follow the line of the proof of Theorem 2 and Theorem 3,
but only give necessary details when the proof is different from that of Theorem 2. Note
that β for the latent variable graphical model is not sparse enough in the sense that
√ !
n
|sij − lij |
.
6= o
max Σi6=j min 1,
j
λ
log p
Our strategy is to decompose it into two parts,
−1
S
L
β =SO\A,A Ω−1
A,A − LO\A,A ΩA,A := β −β ,
(64)
−1
L = L
where β S = SO\A,A Ω−1
O\A,A ΩA,A correspond the sparse and low-rank
A,A and β
components respectively. We expect the penalized estimator β̂ in (10) is closer to the
sparse part β S than β itself, which motivates us to rewrite the regression model as follows,
XA = XO\A β S + A − XO\A β L := XO\A β S + SA ,
(65)
with SA = A − XO\A β L , and define
(66)
S
Θora,S
A,A = A
T SA /n.
33
We can show our estimator Θ̂A,A is within a small ball of the ”oracle” Θora,S
A,A with radius
at the rate of kn,p λ2 , where kn,p is the sparsity of model (65). More specifically, similar to
Lemma 2 for the proof of Theorem 2 we have the following result for the latent variable
graphical model. The proof is provided in the supplementary material.
Lemma 5.
Let λ = (1 + ε)
q
2δ log p
n
for any δ ≥ 1 and ε > 0 in Equation (10). Define
the event E as follows,
ora,S θ̂mm − θmm
S
βm − β̂m 1
2
S
XO\A βm − β̂m /n
T
XO\A S
m /n
(67)
∞
≤ C10 kn,p λ2 ,
≤ C20 kn,p λ,
≤ C30 kn,p λ2 ,
≤ C40 λ.
for m = i and j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions in Theorem
8, we have
P {E c } ≤ o p−δ+1 .
Similar to the argument in Section 6.1.1, from Lemma 5 we have
T ora,S S
T
≤ Ckn,p λ2
/n
−
ˆ
ˆ
/n
θ̂ij − θij
= S
i
j
i j
on the event E, thus there is a constant C1 > 0 such that
P Θ̂A,A − Θora,S
A,A ∞
log p
> C1 kn,p
n
≤ o p−δ+1 .
Later we will show that there are some constant C2 and C3 such that
(68)
ora,S P Θora
A,A − ΘA,A ∞
log p
> C2 kn,p
n
≤ C3 p−2δ ,
which implies
P Θ̂A,A − Θora
A,A ∞
> C4 kn,p
log p
n
≤ o p−δ+1 ,
for some constant C4 > 0. Then following the proof of Theorem 3 exactly, we establish
Theorem 8.
Now we conclude the proof by establishing Equation (68). In fact, we will only show
that
log p
ora
ora,S P θi,i − θi,i > C2 kn,p
≤ C5 p−2δ ,
n
34
Z. REN ET AL.
ora − θ ora,S and θ ora − θ ora,S can be shown
for some C5 > 0. The tail bounds for θj,j
i,j
j,j
i,j
similarly. Write
(69)
ora,S
ora
θi,i
−θi,i
=
Ti i − Si
T Si
n
=
T Ti i − i − XO\A βiL
i − XO\A βiL /n = D1 +D2 .
n
where
D1 =
n
2X
2 T
(k)
i XO\A βiL =
i,k · XO\A βiL
n
n k=1
D2 =
1 L T T
(k)
βi
XO\A XO\A βiL ∼ var XO\A βiL · χ2(n) /n.
n
It is then enough to show that there are constants C2 and C5 such that
P |Di | >
log p
C2
kn,p
2
n
C5 −2δ
p
2
≤
for each Di , i = 1, 2. We first study D1 , which is an average of n i.i.d. random variables
(k)
i,k · XO\A βiL with
(k)
L
L
i ∼ N 0, Ω−1
A,A , and XO\A βi ∼ N 0, βi
L
where Ω−1
A,A has bounded spectrum, and βi
T
T
ΣO\A,O\A βiL ,
ΣO\A,O\A βiL ≤
assumption (43) that elements of β L are at an order of
C5
p .
C6
p
for some C6 for the
From classical large deviations
bounds, there exist some uniform constants c1 , c2 > 0 such that
 P

√ (k) L  n

·
pX
β
i,k
k=1
O\A i P ≤ 2 exp −nt2 /c2 for 0 < t < c1 .
>
t


n
See, for example, Theorem 2.8 of Petrov (1995). By setting t =
s
(
P |D1 | > 2
(70)
where 2
q
2δc2 log p
np
2δc2 log p
np
q
2δc2 log p
n
)
≤ 2p−2δ ,
= o kn,p logn p from Equation (45). The derivation for the tail bound of
D2 is similar by a large deviation bound for χ2(n) ,
( 2
χ
P
(n)
n
)
>1+t
≤ 2 exp −nt2 /c2
for 0 < t < c1 ,
which implies
(71)
= o (1), we have



(k)
P |D2 | > var XO\A βiL 1 +

s

2δc2 log p 
≤ 2p−2δ ,

n
35
by setting t =
o kn,p logn p
q
2δc2 log p
n
(k)
= o (1), where var XO\A βiL
1+
q
2δc2 log p
n
= O (1/p) =
from Equation (45). Equations (70) and (71) imply the desired tail bound
(68).
8. Proof of Auxiliary Lemmas. In this section we prove two key lemmas, Lemmas
2 and 4, for establishing our main results.
8.1. Proof of Lemma 2. We first reparameterize Equations (7) and (10) by setting
(72)
kXk k
dk := √ bk , and Y := XAc · diag
n
√
n
kXk k
!
k∈Ac
to make the analysis cleaner, and then rewrite the regression (7) and the penalized procedure (10) as follows,
Xm = Ydtrue + m .
(73)
and
(74)
kXm − Ydk2 σ
+ + λ kdk1 ,
2nσ
2
n
o
ˆ
d, σ̂
: = arg
min
Lλ (d, σ) ,
Lλ (d, σ) : =
(75)
d∈Rp−2 ,σ∈R
where the true coefficients of the reparameterized regression (74) are
(76)
dtrue = dtrue
k
k∈Ac
kXk k
, where dtrue
= √ βm,k .
k
n
Then the oracle estimator of σ can be written as
(77)
σ
ora
:=
ora 1/2
(θmm
)
=
Xm − Ydtrue √
n
n
o
km k
= √ ,
n
n
o
ˆ σ̂ and β̂m , θ̂mm ,
and we have the following relationship between d,
√
(78)
β̂m,k = dˆk
n
, and θ̂mm = σ̂ 2 .
kXk k
n
ˆ σ̂
The proof has two parts. The first part is the algebraic analysis of the solution d,
o
to the regression (74) by which we can define an event E such that Equations (48)-(51)
in Lemma 2 hold. The second part of the proof is the probabilistic analysis of the E.
36
Z. REN ET AL.
8.1.1. Algebraic Analysis. The function Lλ (d, σ) is jointly convex in (d, σ). For fixed
σ, denote the minimizer of Lλ (d, σ) over all d ∈ Rp−2 by dˆ(σλ), a function of σλ, i.e.,
(
dˆ(σλ) = arg min Lλ (d, σ) = arg min
(79)
d∈Rp−2
d∈Rp−2
kXm − Ydk2
+ λσ kdk1 .
2n
)
then if we knew σ̂ in the solution of Equation (75), the solution for the equation is
o
dˆ(σ̂λ) , σ̂ . Let µ = λσ. From the Karush-Kuhn-Tucker condition, dˆ(µ) is the solution
n
of Equation (79) if and only if
YkT Xm − Ydˆ(µ) /n = µ · sgn dˆk (µ) , if dˆk (µ) 6= 0,
(80)
YkT Xm − Ydˆ(µ) /n ∈ [−µ, µ] , if dˆk (µ) = 0.
To define the event E we need to introduce some notation. Define the l1 cone invertibility
factor (CIF1 ) as follows,
(81)
CIF1 (α, K, Y) = inf
T

 |K| Y n Y u


∞
: u ∈ C (α, K) , u 6= 0 ,
kuK k1


where |K| is the cardinality of an index set K, and
n
o
C (α, K) = u ∈ Rp−2 : kuK c k1 ≤ α kuK k1 .
Let
=
n
k ∈ Ac , dtrue
≥λ
k
o
(82)
T
(83)
ν = YT Xm − Ydtrue /n
∞
= YT m /n ,
∞
and
4 (1 + ξ) 8λ |T |
true
τ = λ max
d
,
T c 1 CIF1 (2ξ + 1, T, Y)
σ ora
(84)
for some ξ > 1 be specified later. Define
E = ∩4i=1 Ii
(85)
where
ξ−1
=
ν≤σ λ
(1 − τ ) ,
ξ+1
(
)
√
Ccif
3
=
CIF1 (2ξ + 1, T, Y) ≥
>
0
with
C
=
1/
10
2M
cif
(ξ + 1)2
q
√
=
σ ora ∈
1/ (2M ), 2M
(86)
I1
(87)
I2
(88)
I3
(89)
I4 =
ora
q
√
kXk k
c
√ ∈
1/ (2M ), 2M for all k ∈ A .
n
37
Now we show Equations (48)-(51) in Lemma 2 hold on the event E defined in Equation
(85). The following two results are helpful to establish our result. Their proofs are given
in the supplementary material.
Proposition 1.
n
ˆ
d (µ) − dtrue (90)
1
T
,
c
1
(ν + µ) |T |
,
CIF1 (2ξ + 1, T, Y)
≤ (ν + µ) dˆ(µ) − dtrue .
n
Let
≤ max (2 + 2ξ) dtrue
2
1
(91) Y dtrue − dˆ(µ) Proposition 2.
o
ξ−1
For any ξ > 1, on the event ν ≤ µ ξ+1
, we have
1
n
ˆ σ̂
d,
o
be the solution of the scaled lasso (75). For any ξ > 1,
n
o
on the event I1 = ν ≤ σ ora λ ξ−1
ξ+1 (1 − τ ) , we have
σ̂
σ ora − 1 ≤ τ .
(92)
n
|ωij |
λ
o
is defined in terms of Ω which has bounded
max dtrue
,
λ
|T
|
≤ Cλs
c
Note that s = maxj Σi6=j min 1,
spectrum, then on the event I4 ,
T
1
from the definition of T in Equation (82). On their intersection I1 ∩ I2 ∩ I3 , Proposition 2
and Proposition 1 with µ = λσ̂ imply that there exist some constants c1 , c2 and c3 such
that
,
λ
|T
|
,
Tc 1
≤ c2 max dtrue c , λ |T | ,
|σ̂ − σ ora | ≤ c1 λ max dtrue
ˆ
d (µ) − dtrue 1
2
Y dtrue − dˆ n
T
≤ c3 λ max dtrue
1
T
,
λ
|T
|
.
c
1
then from the definition of β and Equations (6) and (78),
√
n
kXk k
2
ˆ
β̂m,k = dk
, θ̂mm = σ̂ , and XAc = Y · diag √
,
kXk k
n k∈Ac
we immediately have
ora θ̂mm − θ̂mm
βm − β̂m 1
2
XAc βm − β̂m /n
= |σ̂ − σ ora | · |σ̂ + σ ora | ≤ C10 λ2 s,
≤ C20 λs,
≤ C30 λ2 s,
for some constants Ci0 , which are exactly Equations (48)-(50). From the definition of events
I1 , I3 and I4 we obtain Equation (51), i.e., XTAc m /n
∞
≤ C40 λ.
38
Z. REN ET AL.
8.1.2. Probabilistic Analysis. We will show that
P {I1c } ≤ O p−δ+1 / log p ,
p
P {Iic } ≤ o(p−δ ) for i = 2, 3 and 4,
which implies
P {E} ≥ 1 − o p−δ+1 .
We will first consider P {I3c } and P {I4c }, then P {I2c }, and leave P {I1c } to the last, which
relies on the bounds for P {Iic }, 2 ≤ i ≤ 4.
(1). To study P {I3c } and P {I4c }, we need the following tail bound for the chi-squared
distribution with n degrees of freedom,
)
( 2
χ
(n)
− 1 ≥ t ≤ 2 exp (−nt (t ∧ 1) /8) ,
P n
(93)
√
for t > 0. Since σ ora = km k / n with m ∼ N (0, θmm In ), and Xk ∼ N (0, σkk In ) with
σkk ∈ (1/M, M ), we have
n (σ ora )2 /θmm ∼ χ2(n) , and kXk k2 /σkk ∼ χ2(n) ,
then Equation (93) implies
P {I3c }
n
= P (σ
(
)
(σ ora )2
1
) ∈
/ [1/ (2M ) , 2M ] ≤ P − 1 ≥
θmm
2
o
ora 2
≤ 2 exp (−n/32) ≤ o(p−δ ),
(94)
and
(95)
(
P {I4c }
=P
kXk k2
∈
/ [1/ (2M ) , 2M ] for some k ∈ Ac
n
)
≤ 2p exp (−n/32) ≤ o(p−δ ).
(2). To study the term CIF1 (2ξ + 1, T, Y) of the event I2 , we need to introduce some
notation first. Define
n
T
πa± (Y) = max ± YA
YA u/n − 1
G
o
T
, and θa,b (Y) = max v T YA
YB u/n.
G
where
G = {(A, B, u, v) : (|A| , |B| , kuk , kvk) = (a, b, 1, 1) with A ∩ B = ∅} .
then 1+πa+ (Y) and 1−πa− (Y) are the maximal and minimal eigenvalues of all submatrices
of YT Y/n with dimensions no greater than a respectively, and θa,b (Y) satisfies that
(96)
1/2 θa,b (Y) ≤ 1 + πa+ (Y)
1/2
1 + πb+ (Y)
+
≤ 1 + πa∨b
(Y) .
39
The following two propositions will be used to establish the probability bound for P {I2c }.
Proposition 3 follows from Zhang and Huang (2008) Proposition 2(i). The proof of Proposition 4 is given in the supplementary material.
Let rows of the data matrix X be i.i.d. copies of N (0, Σ), and denote
Proposition 3.
the minimal and maximal eigenvalues of Σ by λmin (Σ) and λmax (Σ) respectively. Assume
that m ≤ c logn p with a sufficiently small constant c > 0, then for ε > 0 we have
(97)
n
o
−
+
P (1 − h)2 λmin (Σ) ≤ 1 − πm
(X) ≤ 1 + πm
(X) ≤ (1 + h)2 λmax (Σ) ≥ 1 − ε,
where h =
q
m
n
+
q
.
2 m log p−log(ε/2)
n
Proposition 4.
For CIF1 (α, K, Y) defined in (81) with |K| = k, we have, for any
0 < l ≤ p − k,
(98)

1
CIF1 (α, K, Y) ≥
(1 + α) (1 + α) ∧
q
1+
l
k
s
−
1 − πl+k
(Y) − α

k
θ4l,k+l (Y) .
4l
Note that there exists some constant C such that Cs is an upper bound of the |T | from
the definitions of s and set T (82) on I4 . For l ≥ Cs Proposition 4 gives

CIF1 (2ξ + 1, T, Y) ≥
1
1 − π −
l+|T | (Y) − (2ξ + 1)
4 (1 + ξ)2

(99)
≥
1
1 − π − (Y) − (2ξ + 1)
4l
4 (1 + ξ)2
s
s

|T |
θ
(Y)
4l 4l,|T |+l

Cs +
1 + π4l
(Y)  ,
4l
where the second inequality follows from (96). From the definition Y = XAc ·diag
and the property that πa± (Y) are increasing as functions of a, we have
√
n
kXk k k∈Ac
)
( √ )
n n +
+
1+
≤ maxc
1 + π4l (XAc ) ≤ maxc
1 + π4l
(X) ,
k∈A
k∈A
kXk k
kXk k
( √ )
( √ )
n n −
−
−
1 − π4l
(X) ≤ minc
1 − π4l
(XAc ) ≤ 1 − π4l
(Y) ,
minc
k∈A
k∈A
kXk k
kXk k
( √
+
π4l
(Y)
and by applying Proposition 3 to the data matrix Y with m = 4l = 4 (2ξ + 1) M 3
2
Cs >
40
Z. REN ET AL.
Cs and ε = p−2δ , we have on I4
s
Cs +
1 + π4l
(Y)
4l
s
( √ )
( √ )
n
Cs
n
2
2
≥ (1 − h) λmin (Σ) minc
(1 + h) λmax (Σ) maxc
− (2ξ + 1)
k∈A
k∈A
kXk k
4l
kXk k
−
1 − π4l
(Y) − (2ξ + 1)
s
√
1
Cs
≥ (1 − h) √
− (2ξ + 1)
(1 + h)2 2M M
4l
2M M
1
1
4
≥ √
(1 − h)2 − √
(1 + h)2 ≥ √
3
3
2M
2 2M
10 2M 3
2
with probability at least 1 − ε, where h =
P {I4c }
≤
o(p−δ ),
thus we established that
q
P {I2c }
r
m log p−log(p−2δ /2)
n
m
n
+
≤
o(p−δ ).
2
= o(1). Note
(3). Finally we study the probability of event I1 . The following tail probability of t
distribution is helpful in the analysis.
Proposition 5.
Let Tn follows a t distribution with n degrees of freedom. Then there
exists εn → 0 as n → ∞ such that ∀t > 0
n
2 /(n−1)
P Tn2 > n e2t
o
−1
2
≤ (1 + εn ) e−t / π 1/2 t .
Please refer to Sun and Zhang (2012b) Lemma 1 for the proof. According to the definition of ν in Equation (83) we have
ν
σ ora
= maxc |hk | , with hk =
k∈A
Note that each column of Y has norm kYk k =
√
YkT m
for k ∈ Ac .
nσ ora
n by the normalization step (72). Given
XAc , equivalently Y, we have
√
n − 1hk
q
1 − h2k
r
=
n−1
q
n
=
q
√
YkT m / nθmm /
Tm m /nθmm
q
Tm m − Tm Yk YkT m /n /nθmm /
Tm m /nθmm
√
YkT m / nθmm
! ∼ t(n−1) ,
r
c 2
PYk m / (n − 1) θmm
where t(n−1) is t distribution with n − 1 degrees of freedom, since the numerator follows
a standard normal and the denominator follows an independent
q
χ2(n−1) / (n − 1). From
41
Proposition 5 we have


s
P |hk | >

(
= P
(
≤ P

2t2 
n 
(n − 1) h2k
2 (n − 1) t2 /n
>
2
1 − 2t2 /n
1 − hk
)
(
≤P
2 (n − 1) t2 / (n − 2)
(n − 1) h2k
>
2
1 − t2 / (n − 2)
1 − hk
2
(n − 1) h2k
2t /(n−2)
>
(n
−
1)
e
−
1
1 − h2k
)
2
)
≤ (1 + εn−1 ) e−t / π 1/2 t .
where the first inequality holds when t2 ≥ 2, and the second inequality which follows
the
√
q
log
p
2δ (1 + ε)
fact ex −1 ≤ x/(1− x2 ) for 0 < x < 2. Now let t2 = δ log p > 2, and λ =
n
ξ−1
with ξ = 3/ε+1, then we have λ ξ+1
(1 − τ ) ≥
q
2δ log p
n
τ defined in Equation (84) satisfies τ = O sλ
2
which is sufficiently small on I2 ∩ I3 ∩ I4 .
for sufficiently small τ. Clearly, the
Therefore we have
n
P ∩4i=1 Ii
o


s
 ν
2δ log p 
≤
− P {(I2 ∩ I3 ∩ I4 )c }
P

 σ ora
n


s

2δ log p 
1 − p · P |hk | >
− P {(I2 ∩ I3 ∩ I4 )c } ≥ 1 − O


n
≥
≥
(100)
!
p−δ+1
√
,
log p
√
which implies immediately P {I1c } ≤ O p−δ+1 / log p .
8.2. Proof of Lemma 4. Now we establish the lower bound (60) for the total variation
affinity. Since the affinity
R
q0 ∧ q1 dµ = 1 −
1
2
R
|q0 − q1 | dµ for any two densities q0 and
q1 , Jensen’s Inequality implies
2
Z
|q0 − q1 | dµ
Hence
R
=
q0 ∧ q1 dµ ≥ 1 − 21
∆=
2
Z Z
Z 2
q0 − q1 (q0 − q1 )2
q1
q0 dµ
≤
dµ
=
dµ − 1.
q
q
q0
0
0
R
q12
q0 dµ
1/2
−1
Z ( 1 Pm∗ f )2
m=1 m
m∗
f0
. To establish (60), it thus suffices to show that
Z 1 X
fm fl
−1= 2
− 1 → 0.
m∗ m,l
f0
The following lemma is used to calculate the term
Lemma 6.
R
(fm fl /f0 − 1) in ∆.
Let gs be the density function of N (0, Σs ), s = 0, m or l. Then
Z
i−1/2
gm gl h
−1
= det I − Σ−1
(Σ
−
Σ
)
Σ
(Σ
−
Σ
)
.
m
0
0
l
0
0
g0
42
Z. REN ET AL.
Let Σm = Ω−1
m for 0 ≤ m ≤ m∗ . Lemma 6 implies
Z
fm fl
=
f0
Z
gm gl
g0
n
= [det (I − Ω0 (Σm − Σ0 ) Ω0 (Σl − Σ0 ))]−n/2 .
Let J(m, l) be the number of overlapping nonzero off-diagonal elements between Σm and
Σl in the first row. Recall the simple structures of Ω0 (55) and Σm −Σ0 by our construction.
Elementary calculations yield that
1 + b2
Ja2 )2 ,
(1 − b2 )2
det (I − Ω0 (Σm − Σ0 ) Ω0 (Σl − Σ0 )) = (1 −
which is 1 when J = 0. Now we set d :=
1+b2
(1−b2 )2
> 1 to simplify our notation. It is easy to
p−2 kn,p −1 p−1−kn,p kn,p −1
j
kn,p −1−j .
see that the total number of pairs (Σm , Σl ) such that J(m, l) = j is
Hence,
∆ =
=
(101)
≤
1
m2∗
1
m2∗
1
m2∗
X
Z X
0≤j≤kn,p −1 J(m,l)=j
X
X
fm fl
−1
f0
(1 − dja2 )−n − 1
0≤j≤kn,p −1 J(m,l)=j
X
1≤j≤kn,p −1
p−2
kn,p − 1
!
kn,p − 1
j
!
!
p − 1 − kn,p
(1 − dja2 )−n .
kn,p − 1 − j
Note that
(1 − dja2 )−n ≤ (1 + 2dja2 )n ≤ exp n2dja2 = p2dτ1 j
where the first inequality follows from the fact that dja2 ≤ dkn,p a2 ≤
1+b2
τ C
(1−b2 )2 1 0
< 1/2.
Hence,
∆ ≤
kn,p −1 p−1−kn,p j
kn,p −1−j 2dτ1 j
p
p−2 kn,p −1
1≤j≤kn,p −1
X
2
(kn,p −1)!
(kn,p −1−j)!
p2dτ1 j
(p−2)!(p−2kn,p +j)!
[(p−1−kn,p )!]2
!j
2 p2dτ1
kn,p
X
=
1≤j≤kn,p
1
j!
−1
X
≤
1≤j≤kn,p −1
p − kn,p − 1
where the last inequality follows from the facts that
with each term less than kn,p and
(p−2)!(p−2kn,p +j)!
[(p−1−kn,p )!]2
,
(kn,p −1)!
(kn,p −1−j)!
is a product of j terms
is bounded below by a product of j
v .
terms with each term greater than (p − kn,p − 1) . Recall the assumption (27) p ≥ kn,p
So for large enough p, we have p − kn,p − 1 ≥ p/2 and
2
kn,p
p2dτ1
p − kn,p − 1
≤ 2p2/ν ·
p2dτ1
p
≤ 2p−(v−2)/(2v) ,
43
where the last step follows from the fact that τ1 ≤ (ν − 2) / (4νd). Thus
p−j(v−2)/(2v) → 0,
X
∆≤2
1≤j≤kn,p −1
which immediately implies (60).
9. Supplementary Material. In this supplement we collect proofs for proving auxiliary lemmas.
9.1. Proof of Lemma 5. The proof of this Lemma is similar to that of Lemma 2 in
Section 8.1. But for the latent variables case in both algebraic analysis and probabilistic
ora , σ ora , and ν by β S , θ ora,S , σ ora,S , S and ν S
analysis we need to replace βi , θij
i
i
i
ij
respectively, and subsequently define
(102)
I1 =
(103)
I3 =
where σ ora,S =
ora,S
θii
1/2
=
ξ−1
(1 − τ ) ,
ξ+1
q
√
∈
1/ (2M ), 2M ,
ν S ≤ σ ora,S λ
σ ora,S
k√Si k
n
and ν S = YT Si /n , while the definitions of I2
∞
and I4 are the same as before in Equations (87) and (89). As in Section 8.1 we define
E = ∩4i=1 Ii and need to show that P {E} ≥ 1 − (1 + o(1))p−δ+1 . We will need only to
show that
P {I1c } ≤ o p−δ+1 , and P {I3c } ≤ o(p−δ ).
The arguments for P {I2c } and P {I4c } are identical to those of the non-latent variables case
in Section 8.1.2.
It is relatively easy to establish the probabilistic bound for I3 . It is a consequence of
the following two bounds,
(104)
n
P (σ
)
( 2
)
(
χ
(σ ora )2
1
1
(n)
− 1 ≥
=P − 1 ≥
= o(p−δ ),
) ∈
/ [3/ (4M ) , 5M/4] ≤ P θii
n
4
4
o
ora 2
which follows from Equation (93), and
(105)
P (σ ora )2 − σ ora,S
2 > 1/ (4M ) = o(p−δ )
which follows two bounds for D1 (70) and D2 (71).
44
Z. REN ET AL.
Now we establish the probabilistic bound for I1 , i.e., P {I1c } ≤ o p−δ+1 . Write the
event I1c as follows,
νS − ν + ν
ξ−1
> σ ora,S λ
(1 − τ ) − σ ora
ξ+1

= σ

ora,S
ξ−1
λ
(1 − τ ) −
ξ+1
s
s

2δ log p 
+ σ ora
n
s
2δ log p
n

2δ log p  ora,S
+ σ
− σ ora
n
s

2δ log p 
+ σ ora
n
s
2δ log p
.
n
Set ξ = 6/ε + 5 such that
ξ−1
λ
(1 − τ ) > (1 + ε/2)
ξ+1
for λ = (1 + ε)
q
2δ log p
n
s
2δ log p
n
on I2 ∩ I3 . Then probabilistic bound for I1 is a consequence the
following two bounds,


s
P ν > σ ora
(106)


p
2δ log p 
≤ O p−δ+1 / log p ,
n 
which follows from Equation (100) or Proposition 5, and
(107)


ε
P ν S − ν > σ ora,S

2
s
2δ log p ora,S
+ σ
− σ ora
n
s

2δ log p 
= o p−δ ,
n 
which is established as follows. From Equations (103), D1 (70) and D2 (71), we have
ε
ε ora,S ora,S σ
− σ
− (σ ora ) ≥
2
2
r
1
+
p
1
−O
2M
s
log p
np
!
ε
≥
4
r
1
2M
with probability 1 − o p−δ . It is then enough to show


ε
P νS − ν >

4
(108)
r
1
2M
s

2δ log p 
= o p−δ
n 
to establish Equation (107). On the event I4 we have
S
ν − ν ≤
(109)
=
Y T S Y T YT X c β L k i k i k A i max −
≤ max n k∈Ac k∈Ac n n
√
n
L
T
T
1 X
(g)
(g)
n Xk XAc βi √
L
Xk XAc
βi ,
maxc T ≤ 2M maxc k∈A
k∈A n
n
Xk
g=1
45
where X (g) denotes the gth sample. Note that for each k ∈ Ac the term
1
n
Pn
(g)
g=1 Xk
(g) T
XAc
βiL
in (109) is an average of n i.i.d. random variables with
T
L
EX (g) X (g)
βi Ac
k
CM
≤ ΣXX βiL ≤ kΣXX k2 βiL ≤ √ ,
2
2
p
T
C
(g)
(g)
V ar Xk XAc
βiL ≤ .
p
From the classical large deviations bound in Theorem 2.8 of Petrov (1995), there exist
some uniform constant c1 , c2 > 0 such that


n 
 √p X
T
T
(g)
(g)
(g)
(g)
βiL > t ≤ 2 exp −nt2 /c2 for 0 < t < c1 .
βiL − E Xk XAc
P Xk XAc

 n
g=1
then by setting t =
q
2δc2 log p
,
n
with probability 1 − o p−δ , we have


s

s
√
CM
1
2δc
log
p
2δ
log
p
2
S
 = o
.
ν − ν ≤ 2M  √ + √
p
p
n
n
where the last inequality follows from Equation (45). Therefore we have shown Equation
(108), i.e., with probability 1 − o p−δ ,
s
S
ν − ν = o

2δ log p  ε
<
n
4
r
1
2M
s
2δ log p
,
n
which together with Equation (106), imply the probabilistic bound for I1 .
9.2. Proof of Proposition 1. From the KKT conditions (80) we have
Y T X − Y dˆ(µ)
T X − Ydtrue i
Y
i
≤ µ + ν,
=
−
n
n
∞
∞
YT Y true ˆ
(110) d (µ) − d
n
where ν is defined in Equation (83). We also have
=
(111)≤
≤
2
1
Y dtrue − dˆ(µ) n
T dtrue − dˆ(µ)
YT Xi − Ydˆ(µ) − YT Xi − Ydtrue
n
true ˆ
µ d
− d (µ) + ν dtrue − dˆ(µ)
1
1
1
(µ + ν) dtrue − dˆ(µ) + 2µ dtrue c − (µ − ν) dtrue − dˆ(µ)
T 1
T
1
T
,
c
1
46
Z. REN ET AL.
where the first inequality follows from the KKT conditions (80). Then on the event
n
ν ≤ µ ξ−1
ξ+1
o
we have
(112)
 true
ˆ(µ) d
−
d
ξ
1
2

T 1
+ dtrue
Y dtrue − dˆ(µ) ≤ 2µ 
n
ξ+1
If we could show that dˆ(µ)−dtrue ∈ C
ξ+ζ
1−ζ , T
T

true
− dˆ(µ) c d
T 1
−
.
c
ξ+1
1
, then by the definition (81) and inequality
(110) we would obtain
ˆ
d (µ) − dtrue ≤
(113)
1
(ν + µ) |T |
CIF1
ξ+ζ
1−ζ , T, Y
.
Suppose that
1+ξ ˆ
true
d (µ) − dtrue ≥
d
(114)
1
ζ
T
,
c
1
then the inequality (112) becomes
2
Y dtrue − dˆ(µ) n
2µ
≤
(ξ + ζ) dtrue − dˆ(µ) − (1 − ζ) dtrue − dˆ(µ) c .
T 1
T 1
ξ+1
Thus we have
dˆ(µ) − dtrue ∈ C
ξ+ζ
,T .
1−ζ
Combining this fact under the condition (114) with (113), we obtain the first desired
inequality (90)


 1 + ξ 
(ν
+
µ)
|T
|
true
ˆ
.
,
d
d (µ) − dtrue ≤ max
ξ+ζ
 ζ

T c 1 CIF
1
1 1−ζ , T, Y
We complete our proof by letting ζ = 1/2 and noting that (111) implies the second desired
inequality (91).
9.3. Proof of Proposition 2. For τ defined in Equation (84), we need to show that
n
o
σ̂ ≥ σ ora (1 − τ ) and σ̂ ≤ σ ora (1 + τ ) on the event ν ≤ σ ora λ ξ−1
(1
−
τ
)
. Let dˆ(σλ) be
ξ+1
the solution of (79) as a function of σ, then
2
Xi − Y dˆ(σλ)
∂
1
Lλ dˆ(σλ) , σ = −
∂σ
2
2nσ 2
(115)
since
n
o
∂
ˆ
∂d Lλ (d, σ) |d=d(σλ)
k
= 0 for all dˆk (σλ) 6= 0, and
n
n
o
o
∂ ˆ
∂σ d (σλ) k
= 0 for all dˆk (σλ) =
0 which follows from the fact that k : dˆk (σλ) = 0 is unchanged in a neighborhood of σ
for almost all σ. Equation (115) plays a key in the proof.
47
(1). To show that σ̂ ≥ σ ora (1 − τ ) it’s enough to show
∂
Lλ dˆ(σλ) , σ |σ=t1 ≤ 0.
∂σ
where t1 = σ ora (1 − τ ), due to the strict convexity of the objective function Lλ (d, σ) in
σ. Equation (115) implies that
∂
2t21 Lλ dˆ(σλ) , σ |σ=t1
∂σ
= t21 −
≤ t21 −
≤
t21
2
Xm − Y dˆ(t1 λ)
n
2
Xm − Ydtrue + Y dˆ(t1 λ) − dtrue − (σ
n
ora 2
true
) +2 d
≤ 2t1 (t1 − σ
(116)
ora
T Y T X − Ydtrue m
− dˆ(t1 λ)
n
) + 2ν dtrue − dˆ(t1 λ) .
1
n
o
n
o
ξ−1
From Equation (90) in Proposition 1, on the event ν ≤ t1 λ ξ+1
= ν/σ ora < λ ξ−1
ξ+1 (1 − τ )
we have
ˆ
true d (t1 λ) − d
≤ max 2 (1 + ξ) dtrue
1
T
,
c
1
(ν + t1 λ) |T |
.
CIF1 (2ξ + 1, T, Y)
then
2t21
∂
Lλ dˆ(σλ) , σ |σ=t1
∂σ
(ν + t1 λ) |T |
T 1 CIF1 (2ξ + 1, T, Y)
2σ ora λ |T |
true
ora
≤ 2t1 −τ σ + λ max 2 (1 + ξ) d
,
T c 1 CIF1 (2ξ + 1, T, Y)
2λ |T |
2 (1 + ξ) true
,
<0
d
= 2t1 σ ora −τ + λ max
T c 1 CIF1 (2ξ + 1, T, Y)
σ ora
≤ 2t1 (t1 − σ ora ) + 2t1 λ max 2 (1 + ξ) dtrue
,
c
where last inequality is from the definition of τ .
(2). Let t2 = σ ora (1 + τ ).To show the other side σ̂ ≤ σ ora (1 + τ ) it is enough to show
∂
Lλ dˆ(σλ) , σ |σ=t2 ≥ 0.
∂σ
n
Equation (115) implies that on the event ν ≤ t2 λ ξ−1
ξ+1
o
n
= ν/σ ora < λ ξ−1
ξ+1 (1 + τ )
o
we
48
Z. REN ET AL.
have
∂
Lλ dˆ(σλ) , σ |σ=t2
∂σ
2
Xm − Y dˆ(t2 λ)
2
= t2 −
n
2t22
= t22 − (σ ora )2 + (σ ora )2 −
= t22 − (σ ora )2 +
n
= t22 − (σ ora )2 +
≥
− (σ
n
2
Xm − Ydtrue 2 − Xm − Y dˆ(t2 λ)
t22
2
Xm − Y dˆ(t2 λ)
dˆ(t2 λ) − dtrue
T
YT Xm − Ydtrue + Xm − Ydˆ(t2 λ)
n
ˆ
true ) − d (t2 λ) − d
(ν + t2 λ) .
ora 2
1
Equation (90) and the fact 1 + τ ≤ 2 imply
2t22
∂
Lλ dˆ(σλ) , σ |σ=t2
∂σ
(
≥ (t2 + σ ora ) σ ora τ − max 2 (1 + ξ) (ν + t2 λ) dtrue
(
≥ (σ
ora 2
)
(2 + τ ) τ − max
T
,
c
1
2 (1 + ξ) (2λ (1 + τ )) true
d
ora
σ
T
(ν + t2 λ)2 |T |
CIF1 (2ξ + 1, T, Y)
,
c
1
)
8 (1 + τ ) λ2 |T |
CIF1 (2ξ + 1, T, Y)
)!
≥ (σ ora )2 τ,
where last inequality is from the definition of τ .
9.4. Proof of Proposition 4. This Proposition essentially follows from the shifting inequality Proposition 5 in Ye and Zhang (2010). We will give a brief proof using results
and notations in that paper.
Define the generalized version of lq cone invertibility factor (81),
0
CIFq,l
(α, K, Y) = inf
T

 |K|1/q Y n Y u
∞
kuA kq



: u ∈ C (α, K) , u 6= 0, |A\K| ≤ l .

0 (α, K, Y) = CIF 0 (α, K, Y) = CIF (α, K, Y). By EquaWhen q = 1 and l = p, CIFq,l
1
1,p
tions (17), (18) and (20) of Ye and Zhang (2010) we have
0
CIF1 (α, K, Y) = CIF1,p
(α, K, Y) ≥
≥
0 (α, K, Y)
CIF2,l
C1,2 α, |K|
l
φ̃∗2,l (α, K, Y)
φ∗2,l (α, K, Y)
C1,2 α, |K|
l
≥
C1,2 α, |K|
l
(1 + α) ∧
q
1+
l
|K|
49
where C1,2 α, |K|
, φ∗2,l (α, K, Y) and φ̃∗2,l (α, K, Y) are defined on page 3523-3524 of Ye
l
and Zhang (2010). From the definition of φ̃∗2,l (α, K, Y) in Equation (20) of Ye and Zhang
(2010), setting r = 2 (thus ar = 1/4 on page 3523) we have
s
−
φ̃∗2,l (α, K, Y) ≥ 1 − πl+k
(Y) − α
k
θ4l,k+l (Y) .
4l
= 1 + α, then
Since C1,2 α, |K|
l

1
CIF1 (α, K, Y) ≥
(1 + α) (1 + α) ∧
q
1+
l
k
s
−
1 − πl+k
(Y) − α

k
θ4l,k+l (Y) .
4l
9.5. Proof of Lemma 1. Most of this proof is the same as that of Lemma 2. Hence
we only emphasize the differences here and provide details whenever necessary. The proof
also consists of algebraic analysis and probabilistic analysis. Recall that the main idea in
the proof of Lemma 2 is to show the events ∩4i=1 Ii defined in (86)-(89) occur with high
probability and whenever they hold, Proposition 2 and Proposition 1 establish the desired
results. Since we decrease the penalty term λnew , the event I1 is no longer valid with high
probability. Now we need to redefine an appropriate event I1new and show that it occurs
with high probability and on I1new , similar properties like Proposition 2 and Proposition
1 also hold.
Recall ν = khk∞ with h := YT m /n defined in (83). Similar as (82), we define the
index set T =
≥ λnew
k ∈ Ac , dtrue
k
of dtrue with large coordinates. Now by using
smaller λnew , although ν is no longer smaller than σ ora λnew ξ−1
ξ+1 (1 − τ ) in I1 w.h.p., most
coordinates of h still hold. Define a random index set
T̂0 = k ∈ Ac , |hk | ≥ σ ora λnew
ξ−1
(1 − τ ) .
ξ+1
The new event can be defined as
I1new = ν ≤ σ ora
n o
λnew ξ − 1
(1 − τ ) ∪ T̂0 ≤ Cu smax ,
1−tξ+1
where Cu is some universal constant. Note that λnew ≥ (1 − t) λ by our assumption
smax = O(pt ), thus the probabilistic analysis for I1 in Lemma 2 immediately implies
that the first component of I1new holds w.h.p.. The second component holds w.h.p. can
be shown by large deviation result of order statistics, where the first component is also
can be seen as a special case. (See, e.g. Reiss (1989)). Therefore, we briefly showed that
√
P {(I1new )c } ≤ p−δ+1 / log p . We also need to modify the event I2 a little bit as we not
50
Z. REN ET AL.
only care about index T but also index T̂0 which are out of the bound.
I2new
= CIF1
ξ−1
ξ+2+
, T̂0 ∪ T, Y ≥ C > 0 .
1−t
The probabilistic analysis of I2 in Lemma 2 implies that there is no difference by using
I2new since the probability bound for I2 is universal for all index sets with cardinality
less than s + Cu smax = o
n
log p
. We don’t change events I3 and I4 . Thus we finish the
probabilistic analysis.
Now it’s enough for us to show that on ∩2i=1 Iinew ∩4i=3 Ii , the desired results hold. It’s not
hard to see in the proof of Proposition 2 that as long as a similar property like Proposition
1 holds (we will provide details and prove this key result in a minute), Proposition 2 is still
valid when we replace I1 by I1new in the assumption. The only thing we need to show is
the following Proposition 6. Note on ∩2i=1 Iinew ∩4i=3 Ii , Proposition 2 is valid and hence the
assumption of the following Proposition with µ = σ̂λnew is also satisfied. We then apply
this Proposition 6 again to finish the algebraic analysis and hence complete our proof.
Proposition 6.
n
For any ξ > 1, on the event ν ≤
µ ξ−1
1−t ξ+1
o
n
∪ T̂1 ≤ Cu smax
o
with
o
n
T̂1 = k ∈ Ac , |hk | ≥ µ ξ−1
ξ+1 , we have
(117)
(118)
ˆ
d (µ) − dtrue 1
≤ max




(ν + µ) T ∪ T̂1 
,
,

Tc 1
CIF1
(2 + 2ξ) dtrue
2
1
Y dtrue − dˆ(µ) ≤ (ν + µ) dˆ(µ) − dtrue ,
1
n
where CIF1 above is short for CIF1 ξ + 2 +
ξ−1
1−t , T
∪ T̂1 , Y .
The proof is a modification of that for Proposition 1. We still have equation (110)
≤ µ + ν. Define ∆ (µ) := dtrue − dˆ(µ) . The equation (112) needs
T Y Y ˆ
d (µ) − dtrue n
∞
to be modified as follows,
(119)
(120)
∆T (µ) YT Xi − Ydˆ(µ) − YT Xi − Ydtrue
kY∆ (µ)k2
=
n
n
ξ−1
≤ µ dtrue − dˆ(µ) + ν ∆ (µ)T̂1 + µ
∆ (µ)T̂ c 1 1
1
1
1
ξ+1
µ ξ−1 ≤
µ+
∆ (µ)T ∪T̂1 + 2µ dtrue c 1
T 1
1−tξ+1
ξ−1 ∆ (µ)
c .
− µ−µ
T
∪
T̂
(
)
1
ξ+1
1
51
where the first inequality follows from the KKT conditions (80) and our assumed event.
The remaining part is the same as that in Proposition 1. Suppose k∆ (µ)k1 ≥ 2 (1 + ξ) dtrue
then inequality (120) becomes
ξ−1 ≥ 0.
ξ+2+
∆ (µ)T ∪T̂1 − ∆ (µ)(T ∪T̂1 )c 1
1−t
1
Thus we have
ξ−1
d
− dˆ(µ) = ∆ (µ) ∈ C ξ + 2 +
, T ∪ T̂1 .
1−t
Combining this fact with equation (110), we obtain the first desired inequality (117). We
true
complete our proof by noting that (119) implies the second desired inequality (118).
Now we show that λnew can be replaced by its finite sample version λnew
f inite . As we have
n
o
seen, the analysis of event I1 is the key result. All we need to show is that P |hk | > λnew
f inite ≤
√
O p−δ , where √n−1h2k follows an t distribution with n − 1 degrees of freedom. Since hk
1−hk
√
is an increasing function of √n−1h2k on R+ , we can take the quantile of t(n−1) distribution
1−hk
rather than use the concentration inequality above.
References.
Belloni, A., Chernozhukov, V. and Hansen, C. (2012). Inference on treatment effects after selection.
arXiv preprint arXiv:1201.0224.
Bickel, P. J. and Levina, E. (2008a). Regularized estimation of large covariance matrices. Ann. Statist.
36 199-227.
Bickel, P. J. and Levina, E. (2008b). Covariance regularization by thresholding. Ann. Statist. 36 25772604.
Bühlmann, P. (2012). Statistical significance in high-dimensional linear models. arXiv preprint
arXiv:1202.1377.
Cai, T. T., Liu, W. and Luo, X. (2011). A constrained `1 minimization approach to sparse precision
matrix estimation. J. Amer. Statist. Assoc. 106 594-607.
Cai, T. T., Liu, W. and Zhou, H. H. (2012). Estimating Sparse Precision Matrix: Optimal Rates of
Convergence and Adaptive Estimation. Manuscript.
Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix
estimation. Ann. Statist. 38 2118-2144.
Cai, T. T. and Zhou, H. H. (2012). Optimal rates of convergence for sparse covariance matrix estimation.
Ann. Statist. 40 2389–2420.
Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. of
Comput. Math. 9 717-772.
Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012). Latent variable graphical model
selection via convex optimization. Ann. Statist. 40 1935-1967.
d’Aspremont, A., Banerjee, O. and El Ghaoui, L. (2008). First-order methods for sparse covariance
selection. SIAM Journal on Matrix Analysis and its Applications 30 56-66.
Einmahl, U. (1989). Extensions of results of Komlós, Major, and Tusnády to the multivariate case. J.
Multivariate Anal. 28 20–68.
Tc 1 ,
52
Z. REN ET AL.
El Karoui, N. (2008). Operator norm consistent estimation of large dimensional sparse covariance matrices. Ann. Statist. 36 2717-2756.
Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the
graphical lasso. Biostatistics 9 432-441.
Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge University Press.
Javanmard, A. and Montanari, A. (2013). Hypothesis testing in high-dimensional regression under the
gaussian random design model: Asymptotic theory. arXiv preprint arXiv:1301.4240.
Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation.
Ann. Statist. 37 4254-4278.
Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.
Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38-53.
Liu, W. (2013). Gaussian Graphical Model Estimation with False Discovery Rate Control. arXiv preprint
arXiv:1306.0976.
Mason, D. and Zhou, H. H. (2012). Quantile Coupling Inequalities and Their Applications. Probability
Surveys 9 439-479.
Meinshausen, N. and Bühlmann, P. (2006). High dimensional graphs and variable selection with the
Lasso. Ann. Statist. 34 1436-1462.
Petrov, V. V. (1995). Limit theorems of probability theory. Oxford Science Publications.
Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance
estimation by minimizing `1 penalized log-determinant divergence. Electron. J. Statist. 5 935-980.
Reiss, R. D. (1989). Approximate distributions of order statistics: with applications to nonparametric
statistics. Springer-Verlag New York.
Ren, Z. and Zhou, H. H. (2012). Discussion : Latent variable graphical model selection via convex
optimization. Ann. Statist. 40 1989-1996.
Rothman, A., Bickel, P., Levian, E. and Zhu, J. (2008). Sparse permutation invariant covariance
estimation. Electron. J. Statist. 2 494-515.
Sun, T. and Zhang, C. H. (2012a). Sparse matrix inversion with scaled Lasso. Manuscript.
Sun, T. and Zhang, C. H. (2012b). Scaled Sparse Linear Regression. Biometrika 99 879–898.
Sun, T. and Zhang, C. H. (2012c). Comment: Minimax estimation of large covariance matrices under l1
norm. Statist. Sinica 22 1354–1358.
Thorin, G. O. (1948). Convexity theorems generalizing those of M. Riesz and Hadamard with some
applications. Comm. Sem. Math. Univ. Lund. 9 1-58.
van de Geer, S., Bühlmann, P. and Ritov, Y. (2013). On asymptotically optimal confidence regions
and tests for high-dimensional models. arXiv preprint arXiv:1303.0518.
Ye, F. and Zhang, C. H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the `q loss in `r
ball. Journal of Machine Learning Research 11 3481-3502.
Yu, B. (1997). Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam 423–435.
Yuan, M. (2010). Sparse inverse covariance matrix estimation via linear programming. J. Mach. Learn.
Res. 11 2261-2286.
Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika
94 19-35.
Zhang, C.-H. (2011). Statistical inference for high-dimensional data. In Mathematisches Forschungsinstitut Oberwolfach: Very High Dimensional Semiparametric Models, Report No. 48/2011 28-31.
53
Zhang, C. H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional
linear regression. Ann. Statist. 36 1567-1594.
Zhang, C.-H. and Zhang, S. S. (2011). Confidence intervals for low-dimensional parameters with highdimensional data. arXiv preprint arXiv:1110.2563.
Department of Statistics
Department of Statistics
Yale University
The Wharton School
New Haven, Connecticut 06511
University of Pennsylvania
USA
Philadelphia, Pennsylvania 19104
E-mail: zhao.ren@yale.edu
USA
E-mail: tingni@wharton.upenn.edu
huibin.zhou@yale.edu
Department of Statistics and Biostatistics
Hill Center, Busch Campus
Rutgers University
Piscataway, New Jersey 08854
USA
E-mail: cunhui@stat.rutgers.edu
Download