Submitted to the Annals of Statistics ASYMPTOTIC NORMALITY AND OPTIMALITIES IN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODEL By Zhao Ren ∗,‡ , Tingni Sun§ , Cun-Hui Zhang†,¶ and Harrison H. Zhou∗,‡ Yale University‡ , University of Pennsylvania§ and Rutgers University¶ The Gaussian graphical model, a popular paradigm for studying relationship among variables in a wide range of applications, has attracted great attention in recent years. This paper considers a fundamental question: When is it possible to estimate low-dimensional parameters at parametric square-root rate in a large Gaussian graphical model? A novel regression approach is proposed to obtain asymptotically efficient estimation of each entry of a precision matrix under a sparseness condition relative to the sample size. When the precision matrix is not sufficiently sparse, or equivalently the sample size is not sufficiently large, a lower bound is established to show that it is no longer possible to achieve the parametric rate in the estimation of each entry. This lower bound result, which provides an answer to the delicate sample size question, is established with a novel construction of a subset of sparse precision matrices in an application of Le Cam’s Lemma. Moreover, the proposed estimator is proven to have optimal convergence rate when the parametric rate cannot be achieved, under a minimal sample requirement. The proposed estimator is applied to test the presence of an edge in the Gaussian graphical model or to recover the support of the entire model, to obtain adaptive rate-optimal estimation of the entire precision matrix as measured by the matrix lq operator norm, and to make inference in latent variables in the graphical model. All these are achieved under a sparsity condition on the precision matrix and a side condition on the range of its spectrum. This significantly relaxes the commonly imposed uniform signal strength condition on the precision matrix, irrepresentable condition on the Hessian tensor operator of the covariance matrix or the `1 constraint on the precision matrix. ∗ The research of Zhao Ren and Harrison H. Zhou was supported in part by NSF Career Award DMS- 0645676 and NSF FRG Grant DMS-0854975. † The research of Cun-Hui Zhang was supported in part by the NSF Grants DMS 0906420, DMS-1106753 and NSA Grant H98230-11-1-0205. AMS 2000 subject classifications: Primary 62H12; secondary 62F12, 62G09 Keywords and phrases: asymptotic efficiency, covariance matrix, inference, graphical model, latent graphical model, minimax lower bound, optimal rate of convergence, scaled lasso, precision matrix, sparsity, spectral norm 1 2 Z. REN ET AL. Numerical results confirm our theoretical findings. The ROC curve of the proposed algorithm, Asymptotic Normal Thresholding (ANT), for support recovery significantly outperforms that of the popular GLasso algorithm. 1. Introduction. Gaussian graphical model, a powerful tool for investigating the relationship among a large number of random variables in a complex system, is used in a wide range of scientific applications. A central question for Gaussian graphical model is to recover the structure of an undirected Gaussian graph. Let G = (V, E) be an undirected graph representing the conditional dependence relationship between components of a random vector Z = (Z1 , . . . , Zp )T as follows. The vertex set V = {V1 , . . . , Vp } represents the components of Z. The edge set E consists of pairs (i, j) indicating the conditional dependence between Zi and Zj given all other components. In applications, the following question is fundamental: Is there an edge between Vi and Vj ? It is well known that recovering the structure of an undirected Gaussian graph G = (V, E) is equivalent to recovering the support of the population precision matrix of the data in the Gaussian graphical model. Let Z = (Z1 , Z2 , . . . , Zp )T ∼ N (µ, Σ) , where Σ = (σij ) is the population covariance matrix. The precision matrix, denoted by Ω = (ωij ), is defined as the inverse of covariance matrix, Ω = Σ−1 . There is an edge between Vi and Vj , i.e., (i, j) ∈ E, if and only if ωij 6= 0. See, for example, Lauritzen (1996). Consequently, the support recovery of the precision matrix Ω yields the recovery of the structure of the graph G. Suppose n i.i.d. p-variate random vectors X (1) , X (2) , . . . , X (n) are observed from the same distribution as Z, i.e. the Gaussian N µ, Ω−1 . Assume without loss of generality that µ = 0 hereafter. In this paper, we address the following two fundamental questions: When is it possible to make statistical inference for each individual entry of a precision √ matrix Ω at the parametric n rate? When and in what sense is it possible to recover the support of Ω in the presence of some small nonzero |ωij |? The problems of estimating a large sparse precision matrix and recovering its support have drawn considerable recent attention. There are mainly two approaches in literature. The first one is a penalized likelihood estimation approach with a lasso-type penalty on entries of the precision matrix. Yuan and Lin (2007) proposed to use the lasso penalty and studied its asymptotic properties when p is fixed. Ravikumar et al. (2011) derived the rate of convergence when the dimension p is high by applying a primal-dual witness 3 construction under an irrepresentability condition on the Hessian tensor operator and a constraint on the matrix l1 norm of the precision matrix. See also Rothman et al. (2008) and Lam and Fan (2009) for other related results. The other one is the neighborhoodbased approach, by running a lasso-type regression or Dantzig selection type of each variable on all the rest of variables to estimate precision matrix column by column. See Meinshausen and Bühlmann (2006), Yuan (2010), Cai, Liu and Luo (2011), Cai, Liu and Zhou (2012) and Sun and Zhang (2012a). The irrepresentability condition is no longer needed in Cai, Liu and Luo (2011) and Cai, Liu and Zhou (2012) for support recovery, but the thresholding level for support recovery depends on the matrix l1 norm of the precision matrix. The matrix l1 norm is unknown and large, which makes the support recovery procedures there nonadaptive and thus less practical. In Sun and Zhang (2012a), optimal convergence rate in the spectral norm is achieved without requiring the matrix `1 norm constraint or the irrepresentability condition. However, support recovery properties of the estimator was not analyzed. In spite of an extensive literature on the topic, it is still largely unknown the fundamental limit of support recovery in the Gaussian graphical model, let alone an adaptive procedure to achieve the limit. Statistical inference of low-dimensional parameters at the √ n rate has been considered in the closely related linear regression model. Sun and Zhang (2012b) proposed an efficient scaled Lasso estimator of the noise level under the sample size condition n (s log p)2 , where s is the `0 or capped-`1 measure of the size of the unknown regression vector. Zhang and Zhang (2011) proposed an asymptotically normal low-dimensional projection estimator for the regression coefficients and their estimator was proven to be asymptotically efficient by van de Geer, Bühlmann and Ritov (2013) in a semiparametric sense under the same sample size condition. The asymptotic efficiency of these estimators can be also understood through the minimum Fisher information in a more general context (Zhang, 2011). Alternative methods for testing and estimation of regression coefficients were proposed in Belloni, Chernozhukov and Hansen (2012), Bühlmann (2012), and Javanmard and Montanari (2013). However, the optimal rate of convergence is unclear from these papers when the sample size condition n (s log p)2 fails to hold. This paper makes important advancements in the understanding of statistical inference of low-dimensional parameters in the Gaussian graphical model in the following ways. Let s be the maximum degree of the graph or a certain more relaxed capped-`1 measure of the complexity of the precision matrix. We prove that the estimation of each ωij at √ the parametric n convergence rate requires the sparsity condition s ≤ O(1)n1/2 / log p 4 Z. REN ET AL. or equivalently a sample size of order (s log p)2 . We propose an adaptive estimator of individual ωij and prove its asymptotic normality and efficiency when n (s log p)2 . Moreover, we prove that the proposed estimator achieves the optimal convergence rate when the sparsity condition is relaxed to s ≤ c0 n/ log p for a certain positive constant c0 . The efficient estimator of the individual ωij is then used to construct fully data driven procedures to recover the support of Ω and to make statistical inference about latent variables in the graphical model. The methodology we are proposing is a novel regression approach briefly described in Sun and Zhang (2012c). In this regression approach, the main task is not to estimate the slope as seen in Meinshausen and Bühlmann (2006), Yuan (2010), Cai, Liu and Luo (2011), Cai, Liu and Zhou (2012) and Sun and Zhang (2012b), but to estimate the noise level. For any index subset A of {1, 2, . . . , p} and a vector Z of length p, we use ZA to denote a vector of length |A| with elements indexed by A. Similarly for a matrix U and two index subsets A and B of {1, 2, . . . , p} we can define a submatrix UA,B of size |A|×|B| with rows and columns of U indexed by A and B respectively. Consider A = {i, j}, for ω11 ω12 . It is well example, i = 1 and j = 2, then ZA = (Z1 , Z2 )T and ΩA,A = ω21 ω22 known that −1 ZA |ZAc = N −Ω−1 A,A ΩA,Ac ZAc , ΩA,A . This observation motivates us to consider regression with two response variables above. The noise level Ω−1 A,A has only three parameters. When Ω is sufficiently sparse, a penalized regression approach is proposed in Section 2 to obtain an asymptotically efficient estimation of ωij , i.e., the estimator is asymptotically normal and the variance matches that of the maximum likelihood estimator in the classical setting where the dimension p is a fixed constant. Consider the class of parameter spaces modeling sparse precision matrices with at most kn,p off-diagonal nonzero elements in each column, (1) P Ω = (ω ) : max 1 {ω = 6 0} ≤ k , ij 1≤i,j≤p 1≤j≤p ij n,p i6=j G0 (M, kn,p ) = , and 1/M ≤ λ (Ω) ≤ λ (Ω) ≤ M. min max where 1 {·} is the indicator function and M is some constant greater than 1. The following √ theorem shows that a necessary and sufficient condition to obtain a n−consistent estimation of ωij is kn,p = O √ n log p , and when kn,p = o √ n log p the procedure to be proposed in Section 2 is asymptotically efficient. Theorem 1. i.i.d. Let X (i) ∼ Np (µ, Σ), i = 1, 2, . . . , n. Assume that kn,p ≤ c0 n/ log p with ν with some ν > 2. We have the following a sufficiently small constant c0 > 0 and p ≥ kn,p 5 probablistic results, (i). There exists a constant 0 > 0 such that inf inf i,j sup n ω̂ij G0 (M,kn,p ) P |ω̂ij − ωij | ≥ 0 max n−1 kn,p log p, n−1/2 o ≥ 0 . (ii). The estimator ω̂ij defined in (12) is rate optimal in the sense of max sup i,j G0 (M,kn,p ) n P |ω̂ij − ωij | ≥ M max n−1 kn,p log p, n−1/2 o → 0, as (M, n) → (∞, ∞). Furthermore, the estimator ω̂ij is asymptotically efficient when kn,p = o √ n log p 2, , i.e., with Fij−1 = ωii ωjj + ωij q (2) D nFij (ω̂ij − ωij ) → N (0, 1) . Moreover, the minimax risk of estimating ωij over the class G0 (k, Mn,p ) satisfies, pro vided n = O pξ with some ξ > 0, ( (3) log p , inf sup E |ω̂ij − ωij | max kn,p ω̂ij G0 (M,kn,p ) n r ) 1 n . The lower bound is established through Le Cam’s Lemma and a novel construction of a subset of sparse precision matrices. An important implication of the lower bound is that the difficulty of support recovery for sparse precision matrix is different from that for sparse covariance matrix when kn,p √ n log p , and when kn,p √ n log p the difficulty of support recovery for sparse precision matrix is just the same as that for sparse covariance matrix. It is worthwhile to point out that the asymptotic efficiency result is obtained without the need to assume the irrepresentable condition or the l1 constraint of the precision matrix which are commonly required in literature. An immediate consequence of the asymptotic normality result (2) is to test individually whether there is an edge between Vi and Vj in the set E, i.e., the hypotheses ωij = 0. The result is applied to do adaptive support recovery optimally. In addition, we can strengthen other results in literature under weaker assumptions, and the procedures are adaptive, including adaptive rate-optimal estimation of the precision matrix under various matrix lq norms, and an extension of our framework for inference and estimation to a class of latent variable graphical models. See Section 3 for details. Our work on optimal estimation of precision matrix given in the present paper is closely connected to a growing literature on estimation of large covariance matrices. Many regularization methods have been proposed and studied. For example, Bickel and Levina 6 Z. REN ET AL. (2008a,b) proposed banding and thresholding estimators for estimating bandable and sparse covariance matrices respectively and obtained rate of convergence for the two estimators. See also El Karoui (2008) and Lam and Fan (2009). Cai, Zhang and Zhou (2010) established the optimal rates of convergence for estimating bandable covariance matrices. Cai and Zhou (2012) and Cai, Liu and Zhou (2012) obtained the minimax rate of convergence for estimating sparse covariance and precision matrices under a range of losses including the spectral norm loss. In particular, a new general lower bound technique for matrix estimation was developed there. See also Sun and Zhang (2012a). The proposed estimator was briefly described in Sun and Zhang (2012c) along with a statement of the efficiency of the estimator without proof under the sparsity assumption kn,p n−1/2 log p. While we are working on the delicate issue of the necessity of the sparsity condition kn,p n1/2 / log p and the optimality of the method for support recovery and estimation under the general sparsity condition kn,p n/ log p, Liu (2013) developed p-values for testing ωij = 0 and related FDR control methods under the stronger sparsity condition kn,p n1/2 / log p. However, his method cannot be converted into confidence intervals, and the optimality of his method is unclear under either sparsity conditions. The paper is organized as follows. In Section 2, we introduce our methodology and main results for statistical inference. Applications to estimation under the spectral norm, to support recovery and estimation of latent variable graphical model are presented in Section 3. Section 4 discusses extensions of results in Sections 2 and 3. Numerical studies are given in Section 5. Proofs for theorems in Sections 2-3 are given in Sections 6-7. Proofs for main lemmas are given in Section 8. We collect auxiliary results for proving main lemmas in the supplementary material. Notations. We summarize here some notations to be used throughout the paper. For 1 ≤ w ≤ ∞, we use kukw and kAkw to denote the usual vector lw norm, given a vector u ∈ Rp and a matrix A = (aij )p×p respectively. In particular, kAk∞ denote the entry-wise maximum maxij |aij |. We shall write k·k without a subscript for the vector l2 norm. The matrix `w operator norm of a matrix A is defined by |||A|||w = maxkxkw =1 kAxkw . The commonly used spectral norm ||| · ||| coincides with the matrix `2 operator norm ||| · |||2 . 2. Methodology and Statistical Inference. In this section we will introduce our methodology for estimating each entry and more generally, a smooth functional of any square submatrix of finite size. Asymptotic efficiency results are stated in Section 2.2 under a sparseness assumption. The lower bound in Section 2.3 shows that the sparseness condition to obtain the asymptotic efficiency in Section 2.2 is sharp. 7 2.1. Methodology. We will first introduce the methodology to estimate each entry ωij , and discuss its extension to the estimation of functionals of a submatrix of the precision matrix. The methodology is motivated by the following simple observation with A = {i, j}, −1 c Z{i,j} |Z{i,j}c = N −Ω−1 A,A ΩA,Ac Z{i,j} , ΩA,A . (4) Equivalently we write T c β + (ηi , ηj ) , (Zi , Zj ) = Z{i,j} (5) where the coefficients and error distributions are T −1 β = −ΩAc ,A Ω−1 A,A , (ηi , ηj ) ∼ N 0, ΩA,A . (6) Denote the covariance matrix of (ηi , ηj )T by ΘA,A = Ω−1 A,A = θii θij θji θjj . We will estimate ΘA,A and expect that an efficient estimator of ΘA,A yields an efficient estimation of entries of ΩA,A by inverting the estimator of ΘA,A . Denote the n by p dimensional data matrix by X. The ith row of data matrix is the ith sample X (i) . Let XA be the columns indexed by A = {i, j} . Based on the regression interpretation (5), we have the following data version of the multivariate regression model (7) XA = XAc β + A . Here each row of (7) is a sample of the linear model (5). Note that β is a p − 2 by 2 dimensional coefficient matrix. Denote a sample version of ΘA,A by ora T Θora A,A = (θkl )k∈A,l∈A = A A /n (8) which is an oracle MLE of ΘA,A , assuming that we know β, and ora ora Ωora A,A = (ωkl )k∈A,l∈A = ΘA,A (9) −1 . But of course β is unknown, and we will need to estimate β and plug in its estimator to estimate A . Now we formally introduce the methodology. For each m ∈ A = {i, j}, we apply a scaled lasso penalization to the univariate linear regression of Xm against XAc as follows, (10) n o 1/2 = arg β̂m , θ̂mm min kX b∈Rp−2 ,σ∈R+ m 2 X kXk k − X bk σ √ |bk | , + +λ 2nσ 2 n k∈Ac Ac 8 Z. REN ET AL. with a weighted `1 penalty, where the vector b is indexed by Ac . The penalty level will be specified explicitly later. Define the residuals of the scaled lasso regression by ˆ A = XA − XAc β̂, and Θ̂A,A = ˆ TAˆ A /n. (11) It can be shown that this definition of θ̂mm is consistent with the θ̂mm obtained from the scaled lasso (10) for each m ∈ A. Finally we simply inverse the estimator Θ̂A,A = θ̂kl k,l∈A to estimate entries in ΩA,A , i.e. Ω̂A,A = Θ̂−1 A,A . (12) This methodology can be routinely extended into a more general form. For any subset B ⊂ {1, 2, . . . , p} with a bounded size, the conditional distribution of ZB given ZB c is −1 ZB |ZB c = N −Ω−1 B,B ΩB,B c ZB c , ΩB,B , (13) so that the associated multivariate linear regression model is XB = XB c βB,B c + B with −1 βB c ,B = −ΩB c ,B Ω−1 B,B and B ∼ N 0, ΩB,B . Consider a slightly more general problem of estimating a smooth functional of Ω−1 B,B , denoted by ζ := ζ Ω−1 B,B . When βB c ,B is known, B is sufficient for Ω−1 B,B due to the independence of B and XB c , so that an oracle estimator of ζ can be defined as ζ ora = ζ TB B /n . We apply scaled lasso to the univariate linear regression of Xm against XB c for each m ∈ B as in Equation (10), n o 1/2 β̂m , θ̂mm = arg kX min 2 X kXk k σ m − XB c bk √ |bk | + +λ 2nσ 2 n k∈B c b∈Rp−|B| ,σ∈R+ where |B| is the size of subset B. The residual matrix of the model is ˆ B = XB −XB c β̂B c ,B , then the scaled lasso estimator of ζ Ω−1 B,B is defined by (14) ζ̂ = ζ ˆ TB ˆ B /n . 9 2.2. Statistical Inference. For λ > 0, define capped-`1 balls as G ∗ (M, s, λ) = {Ω : sλ (Ω) ≤ s, 1/M ≤ λmin (Ω) ≤ λmax (Ω) ≤ M } , (15) where sλ = sλ (Ω) = maxj Σi6=j min {1, |ωij | /λ} for Ω = (ωij )1≤i,j≤p . In this paper, λ is of the order p (log p)/n. We omit the subscript λ from s when it is clear from the context. When |ωij | is either 0 or larger than λ, sλ is the maximum node degree of the graph, which is denoted by kn,p in the class of parameter spaces (1). In general, kn,p is an upper bound of the sparseness measurement sλ . The spectrum of Σ is bounded in the matrix class G ∗ (M, s, λ) as in the `0 ball (1). The following theorem gives an error bound for our estimators by comparing them with the oracle MLE (8), and shows that κora ij = √ ora − ω ωij ij nq 2 ωii ωjj + ωij is asymptotically standard normal, which implies the oracle MLE (8) is asymptotically 2 . normal with mean ωij and variance n−1 ωii ωjj + ωij Theorem 2. Let Θ̂A,A and Ω̂A,A be estimators of ΘA,A and ΩA,A defined in (11) and q p (12) respectively, and λ = (1 + ε) 2δ log for any δ ≥ 1 and ε > 0 in Equation (10). n (i). Suppose s ≤ c0 n/ log p for a sufficiently small constant c0 > 0. We have (16) max ∗ max Ω∈G (M,s,λ) A:A={i,j} P Θ̂A,A − Θora A,A ∞ log p > C1 s n ≤ o p−δ+1 , and (17) max ∗ max Ω∈G (M,s,λ) A:A={i,j} P Ω̂A,A − Ωora A,A ∞ > C10 s log p n ≤ o p−δ+1 , ora where Θora A,A and ΩA,A are the oracle estimators defined in (8) and (9) respectively and C1 and C10 are positive constants depending on {ε, c0 , M } only. (ii). There exist constants D1 and ϑ ∈ (0, ∞), and three marginally standard normal √ random variables Zkl , where kl = ii, ij, jj, such that whenever |Zkl | ≤ ϑ n for all kl, we have (18) D1 ora 2 2 1 + Zii2 + Zij + Zjj , κij − Z 0 ≤ √ n where Z 0 ∼ N (0, 1), which can be defined as a linear combination of Zkl , kl = ii, ij, jj. Theorem 2 immediately yields the following results of estimation and inference for ωij . 10 Z. REN ET AL. Theorem 3. Let Ω̂A,A be the estimator of ΩA,A defined in (12), and λ = (1 + ε) q 2δ log p n for any δ ≥ 1 and ε > 0 in Equation (10). Suppose s ≤ c0 n/ log p for a sufficiently small constant c0 > 0. For any small constant > 0, there exists a constant C2 = C2 (, ε, c0 , M ) such that ( (19) ( log p max max P |ω̂ij − ωij | > C2 max s , ∗ n Ω∈G (M,s,λ) 1≤i≤j≤p r )) 1 n ≤ . Moreover, there is constant C3 = C3 (, ε, c0 , M ) such that (20) s log p log p −δ+3 . ≤ o p , max P Ω̂ − Ω > C3 max s n ∞ n Ω∈G ∗ (M,s,λ) Furthermore, ω̂ij is asymptotically efficient q D nFij (ω̂ij − ωij ) → N (0, 1) , (21) √ when Ω ∈ G ∗ (M, s, λ) and s = o ( n/ log p), where 2 . Fij−1 = ωii ωjj + ωij Remark 1. The upper bounds max n s logn p , q o 1 n and max s logn p , q log p n in Equa- tions (19) and (20) respectively are shown to be rate-optimal in Section 2.3. Remark 2. The choice of λ = (1 + ε) q 2δ log p n is common in literature, but can be too big and too conservative, which usually leads to some estimation bias in practice. Let tq (α, n) denotes the α quantile of the t distribution with n degrees of freedom. In √ 2 Section 4 we show the value of λ can be reduced to λnew f inite = (1 + ε) B/ n − 1 + B where B = tq 1 − equivalent to (1 + ε) smax p q δ /2, n − 1 2δ log(p/smax ) . n for every smax = o √ n log p , which is asymptotically See Section 4 for more details. In simulation studies of Section 5, we use the penalty λnew f inite with δ = 1, which gives a good finite sample performance. Remark 3. In Theorems 2 and 3, our goal is to estimate each entry ωij of the preci- sion matrix Ω. Sometimes it is more natural to consider estimating the partial correlation rij = −ωij /(ωii ωjj )1/2 between Zi and Zj . Let Ω̂A,A be estimator of ΩA,A defined in (12). Our estimator of partial correlation rij is defined as r̂ij = −ω̂ij /(ω̂ii ω̂jj )1/2 . Then the results above can be easily extended to the case of estimating rij . In particular, under the same assumptions in Theorem 3, the estimator r̂ij is asymptotically efficient: q √ 2 )−2 (r̂ − r ) converges to N (0, 1) when s = o ( n/ log p). This is stated as n(1 − rij ij ij Corollary 1 in Sun and Zhang (2012c) without proof. 11 The following theorem extends Theorems 2 and 3 to estimation of ζ Ω−1 B,B , a smooth |B|×|B| → R is a unit functional of Ω−1 B,B for a bounded size subset B. Assume that ζ : R n o Lipschitz function in a neighborhood G : |||G − Ω−1 B,B ||| ≤ κ , i.e., −1 ζ (G) − ζ Ω−1 B,B ≤ |||G − ΩB,B |||. (22) Theorem 4. Let ζ̂ be the estimator of ζ defined in (14), and λ = (1 + ε) q 2δ log p n for any δ ≥ 1 and ε > 0 in Equation (10). Suppose s ≤ c0 n/ log p for a sufficiently small constant c0 > 0. Then, (23) log p ora ≤ o |B| p−δ+1 , P ζ̂ − ζ > C1 s max n Ω∈G ∗ (M,s,λ) with a constant C1 = C1 (ε, c0 , M, |B|). Furthermore, ζ̂ is asymptotically efficient q (24) D nFζ ζ̂ − ζ → N (0, 1) , √ when Ω ∈ G ∗ (M, s, λ) and s = o ( n/ log p), where Fζ is the Fisher information of esti mating ζ for the Gaussian model N 0, Ω−1 B,B . Remark 4. The results in this section can be easily extended to the weak lq ball with 0 < q < 1 to model the sparsity of the precision matrix. A weak lq ball of radius c in Rp is defined as follows, n o q Bq (c) = ξ ∈ Rp : ξ(j) ≤ cj −1 , for all j = 1, . . . , p , where ξ(1) ≥ ξ(2) ≥ . . . ≥ ξ(p) . Let (25) Gq (M, kn,p ) = Since ξ ∈ Bq (k) implies (26) P j Ω = (ωij )1≤i,j≤p : ω·j ∈ Bq (kn,p ) , and 1/M ≤ λ min (Ω) ≤ λmax (Ω) ≤ M. . min(1, |ξj |/λ) ≤ bk/λq c + {q/(1 − q)}k 1/q bk/λq c1−1/q /λ, Gq (M, kn,p ) ⊆ G ∗ (M, s, λ), 0 ≤ q ≤ 1, when kn,p /λq ≤ Cq (s ∨ 1), where Cq = 1 + q2q /(1 − q) for 0 < q < 1 and C0 = 1. Thus, the conclusions of Theorems 2, 3 and 4 hold with G ∗ (M, s, λ) replaced by Gq (M, kn,p ) and s by kn,p (n/ log p)q/2 , 0 ≤ q < 1. 12 Z. REN ET AL. 2.3. Lower Bound. In this section, we derive a lower bound for estimating ωij over the matrix class G0 (M, kn,p ) defined in (1). Assume that ν p ≥ kn,p with ν > 2, (27) and kn,p ≤ C0 (28) n log p for some C0 > 0. Theorem 5 below implies that the assumption kn,p logn p → 0 is necessary for consistent estimation of any single entry of Ω. We carefully construct a finite collection of distributions G0 ⊂ G0 (M, kn,p ) and apply Le Cam’s method to show that for any estimator ω̂ij , log p →1, n sup P |ω̂ij − ωij | > C1 kn,p (29) G0 for some constant C1 > 0. It is relatively easy to establish the parametric lower bound q 1 n. These two lower bounds together immediately yield Theorem 5 below. Theorem 5. Suppose we observe independent and identically distributed p-variate Gaussian random variables X (1) , X (2) , . . . , X (n) with zero mean and precision matrix Ω = (ωkl )p×p ∈ G0 (M, kn,p ). Under assumptions (27) and (28), we have the following minimax lower bounds ( (30) ( kn,p log p , C2 inf sup P |ω̂ij − ωij | > max C1 ω̂ij G0 (M,kn,p ) n r )) 1 n > c1 > 0, and (31) s k log p log p n,p , C20 inf sup P Ω̂ − Ω > max C10 > c2 > 0, ∞ n n Ω̂ G0 (M,kn,p ) where C1 , C2 , C10 and C20 are positive constants depending on M, ν and C0 only. Remark 5. The lower bound kn,p log p n in Theorem 5 shows that estimation of sparse precision matrix can be very different from estimation of sparse covariance matrix. The sample covariance always gives a parametric rate of estimation of every entry σij . But √ for estimation of sparse precision matrix, when kn,p impossible to obtain the parametric rate. n log p , Theorem 5 implies that it is 13 Remark 6. Since G0 (M, kn,p ) ⊆ G ∗ (M, kn,p , λ) by the definitions in (1) and (15), Theorem 5 also provides the lower bound for the larger class. Similarly, Theorem 5 can be easily extended to the weak lq ball, 0 < q < 1, defined in (25) and the capped-`1 ball defined in (15). For these parameter spaces, in the proof of Theorem 5 we only need to define H q/2 n − 1 rather as the collection of all p × p symmetric matrices with exactly kn,p log p than (kn,p − 1) elements equal to 1 between the third and the last elements on the first row (column) and the rest all zeros. Then it is easy to check that the space sub-parameter q/2 v with G0 in (56) is indeed in Gq (M, kn,p ). Now under assumptions p ≥ kn,p logn p ν > 2 and kn,p ≤ C0 n log p 1−q/2 ,we have the following minimax lower bounds ( inf sup ω̂ij Gq (M,kn,p ) ( P |ω̂ij − ωij | > max C1 kn,p log p n r )) 1−q/2 , C2 1 n > c1 > 0, and s 1−q/2 log p log p inf sup P Ω̂ − Ω > max C10 kn,p , C20 > c2 > 0. ∞ n n Ω̂ Gq (M,kn,p ) These lower bounds match the upper bounds for the proposed estimator in Theorem 3 in view of the discussion in Remark 4. 3. Applications. The asymptotic normality result is applied to obtain rate-optimal estimation of the precision matrix under various matrix lw norms, to recover the support of Ω adaptively, and to estimate latent graphical models without the need of the irrepresentable condition or the l1 constraint of the precision matrix commonly required in literature. Our procedure is first obtaining an Asymptotically Normal estimation and then do Thresholding. We thus call it ANT. 3.1. ANT for Adaptive Support Recovery. The support recovery of precision matrix has been studied by several papers. See, for example, Friedman, Hastie and Tibshirani (2008), d’Aspremont, Banerjee and El Ghaoui (2008), Rothman et al. (2008), Ravikumar et al. (2011), Cai, Liu and Luo (2011), and Cai, Liu and Zhou (2012). In these literature, the theoretical properties of the graphical lasso (Glasso), CLIME and ACLIME on the support recovery were obtained. Ravikumar et al. (2011) studied the theoretical properties of Glasso, and showed that Glasso can correctly recover the support under q irrepresentable conditions and the condition min(i,j)∈S(Ω) |ωij | ≥ c log p n for some c > 0. Cai, Liu and Luo (2011) does not require irrepresentable conditions, but need to assume q log p 2 that min(i,j)∈S(Ω) |ωij | ≥ CMn,p n , where Mn,p is the matrix l1 norm of Ω. In Cai, Liu 14 Z. REN ET AL. and Zhou (2012), they weakened the condition to min(i,j)∈S(Ω) |ωij | ≥ CMn,p the threshold level there is C 2 Mn,p q log p n , q log p n , but where C is unknown and Mn,p can be very large, which makes the support recovery procedure there impractical. In this section we introduce an adaptive support recovery procedure based on the variance of the oracle estimator of each entry ωij to recover the sign of nonzero entries of Ω with high probability. The lower bound condition for min(i,j)∈S(Ω) |ωij | is significantly weakened. In particular, we remove the unpleasant matrix l1 norm Mn,p . In Theorem 3, when the precision matrix is sparse enough s = o √ n log p , we have the asymptotic normality result for each entry ωij , i 6= j, i.e., q D nFij (ω̂ij − ωij ) → N (0, 1) , 2 where Fij = ωii ωjj + ωij −1 is the Fisher information of estimating ωij . The total number of edges is p (p − 1) /2. We may apply thresholding to ω̂ij to correctly distinguish zero and 2 needs to be estimated. We define the nonzero entries. However, the variance ωii ωjj + ωij adaptive support recovery procedure as follows thr Ω̂thr = (ω̂ij )p×p , (32) thr = ω̂ 1{|ω̂ | ≥ τ̂ } for i 6= j with where ω̂iithr = ω̂ii and ω̂ij ij ij ij (33) τ̂ij = v u u 2ξ ω̂ ω̂ + ω̂ 2 log p ii jj t 0 ij n . 2 is the natural estimate of the asymptotic variance of ω̂ Here ω̂ii ω̂jj + ω̂ij ij defined in (12) and ξ0 is a tuning parameter which can be taken as fixed at any ξ0 > 2. This thresholding estimator is adaptive. The sufficient conditions in the Theorem 6 below for support recovery are much weaker compared with other results in literature. Define a thresholded population precision matrix as (34) thr Ωthr = (ωij )p×p , n thr = ω 1 |ω | ≥ where ωiithr = ωii and ωij ij ij q o 2 )(log p)/n , with a certain 8ξ(ωii ωjj + ωij ξ > ξ0 . Recall that E = E(Ω) = {(i, j) : ωij 6= 0} is the edge set of the Gauss-Markov graph associated with the precision matrix Ω. Since Ωthr is composed of relatively large components of Ω, (V, E(Ωthr )) can be view as a graph of strong edges. Define S(Ω) = {sgn(ωij ), 1 ≤ i, j ≤ p}. 15 The following theorem shows that with high probability, ANT recovers all the strong edges without false recovery. Moreover, under the uniform signal strength condition |ωij | ≥ 2 (35) v u u 2ξ ω ω + ω 2 log p ii jj t ij n , ∀ ωij 6= 0. i.e. Ωthr = Ω, the ANT also recovers the sign matrix S(Ω). Theorem 6. Let λ = (1 + ε) q 2δ log p n for any δ ≥ 3 and ε > 0. Let Ω̂thr be the ANT estimator defined in (32) with ξ0 > 2 in the thresholding level (33). Suppose Ω ∈ G ∗ (M, s, λ) with s = o p n/ log p . Then, lim P E(Ωthr ) ⊆ E(Ω̂thr ) ⊆ E(Ω) = 1. (36) n→∞ where Ωthr is defined in (34) with ξ > ξ0 . If in addition (35), then lim P S(Ω̂thr ) = S(Ω) = 1. (37) n→∞ 3.2. ANT for Adaptive Estimation under the Matrix lw Norm. In this section, we consider the rate of convergence under the matrix lw norm. To control the improbable case for which our estimator Θ̂A,A is nearly singular, we define our estimator based on the thresholding estimator Ω̂thr defined in (32), thr Ω̆thr = (ω̂ij 1 {|ω̂ij | ≤ log p})p×p . (38) Theorem 7 follows mainly from the convergence rate under element-wise norm and the fact that the upper bound holds for matrix l1 norm. Then it follows immediately by the inequality |||M |||w ≤ |||M |||1 for any symmetric matrix M and 1 ≤ w ≤ ∞, which can be proved by applying the Riesz-Thorin interpolation theorem. See, e.g., Thorin (1948). 2 = O (n/ log p) , it can be seen from the Equations Note that under the assumption kn,p (17) and (18) in Theorem 2 that with high probability the Ω̂ − Ω kΩora − Ωk∞ = Op q log p n ∞ is dominated by . From there the details of the proof is in nature similar to the Theorem 3 in Cai and Zhou (2012) and thus will be omitted due to the limit of space. Theorem 7. Under the assumptions s2 = O (n/ log p) and n = O pξ ξ > 0, the Ω̆thr defined in (38) with λ = (1 + ε) q 2δ log p n for sufficiently large δ ≥ 3+ 2ξ and ε > 0 satisfies, for all 1 ≤ w ≤ ∞ and kn,p ≤ s (39) sup G0 (M,kn,p ) E|||Ω̆thr − Ω|||2w ≤ sup G ∗ (M,k n,p ,λ) with some E|||Ω̆thr − Ω|||2w ≤ Cs2 log p . n 16 Z. REN ET AL. Remark 7. It follows from Equation (26) that result (39) also holds for the classes of weak `p balls Gq (M, kn,p ) defined in Equation (25), with s = Cq kn,p (40) sup Gq (M,kn,p ) Remark 8. E|||Ω̆thr − Ω|||2w ≤ 2 Ckn,p log p n n log p q/2 , 1−q . Cai, Liu and Zhou (2012) showed the rates obtained in Equations (39) and (40) are optimal when p ≥ cnγ for some γ > 1 and kn,p = o n1/2 (log p)−3/2 . 3.3. Estimation and Inference for Latent Variable Graphical Model. Chandrasekaran, Parrilo and Willsky (2012) first proposed a very natural penalized estimation approach and studied its theoretical properties. Their work was discussed and appreciated by several researchers. But it was not clear if the conditions in their paper are necessary and the results there are optimal. Ren and Zhou (2012) observed that the support recovery boundq q ary can be significantly improved from an order of p n to log p n under certain conditions including a bounded l1 norm constraint for the precision matrix. In this section we extend the methodology and results in Section 2 to study latent variable graphical models. The results in Ren and Zhou (2012) are significantly improved under weaker assumptions. Let O and H be two subsets of {1, 2, . . . , p + h} with Card(O) = p, Card(H) = h and (i) (i) O ∪ H = {1, 2, . . . , p + h}. Assume that XO , XH , i = 1, . . . , n, are i.i.d. (p + h)-variate Gaussian random vectors with a positive covariance matrix Σ(p+h)×(p+h) . Denote the corresponding precision matrix by Ω̄(p+h)×(p+h) = Σ−1 (p+h)×(p+h) . We only have access to n (1) (2) (n) XO , XO , . . . , XO o n (1) (2) (n) , while XH , XH , . . . , XH o are hidden and the number of latent components is unknown. Write Σ(p+h)×(p+h) and Ω̄(p+h)×(p+h) as follows, Σ(p+h)×(p+h) = ΣO,O ΣO,H ΣH,O ΣH,H , and Ω̄(p+h)×(p+h) = (i) (i) Ω̄O,O Ω̄O,H Ω̄H,O Ω̄H,H , where ΣO,O and ΣH,H are covariance matrices of XO and XH respectively and from the Schur complement we have (41) −1 Σ−1 O,O = Ω̄O,O − Ω̄O,H Ω̄H,H Ω̄H,O . See, e.g., Horn and Johnson (1990). Define S = Ω̄O,O , and L = Ω̄O,H Ω̄−1 H,H Ω̄H,O , where h0 = rank (L) = rank Ω̄O,H ≤ h. 17 We are interested in estimating Σ−1 O,O as well as S and L. To make the problem identifiable we assume that S is sparse and the effect of latent variables is spread out over all coordinates, i.e., (42) S = (sij )1≤i,j≤p , max 1≤j≤p X 1 {sij 6= 0} ≤ kn,p ; i6=j and L = (lij )1≤i,j≤p , |lij | ≤ (43) M0 . p The sparseness of S = Ω̄O,O can be seen to be inherited from the sparse full precision matrix Ω̄(p+h)×(p+h) , and it is particularly interesting for us to identify the support of S = Ω̄O,O and make inference for each entry of S. A sufficient condition for the assumption (43) is that the eigendecomposition of L = and c0 Ph0 i=1 λi Ph0 T i=1 λi ui ui satisfies kui k∞ ≤ q c0 p for all i ≤ M0 . See Candès and Recht (2009) for a similar assumption. In addition, we assume that (44) 1/M ≤ λmin (Σ(p+h)×(p+h) ) ≤ λmax (Σ(p+h)×(p+h) ) ≤ M for some universal constant M , and n = o(p). log p (45) (i) Equation (44) implies that both the covariance ΣO,O of observations XO and the sparse component S = Ω̄O,O have bounded spectrum, and λmax (L) ≤ M . (i) With a slight abuse of notation, denote the precision matrix Σ−1 O,O of XO by Ω and its inverse by Θ. We propose to apply the methodology in Section 2 to the observations X (i) which are i.i.d. N (0, ΣO,O ) with Ω = (sij − lij )1≤i,j≤p by considering the following regression (46) XA = XO\A β + A i.i.d. −1 for A = {i, j} ⊂ O with β = ΩO\A,A Ω−1 A,A and A ∼ N 0, ΩA,A scaled lasso procedure to estimate ΩA,A . When S = Ip and L = √ √ 1/ p, . . . , 1/ p , we see that max Σi6=j min 1, j |sij − lij | λ = p−1 =O 2pλ r and the penalized 1 T 2 u0 u0 with uT0 = n . log p However, to obtain the asymptotic normality result as in Theorem 2, we required √ ! |sij − lij | n max Σi6=j min 1, =o , j λ log p 18 Z. REN ET AL. which is no longer satisfied for the latent variable graphical model. In Section 7.2 we overcome the difficulty through a new analysis. Theorem 8. Let Ω̂A,A be theqestimator of ΩA,A defined in (12) with A = {i, j} for the 2δ log p n regression (46). Let λ = (1 + ε) (42)-(45) and kn,p = o s (47) √ n log p we have n D 2 (ω̂ij − ωij ) ∼ ωii ωjj + ωij Remark 9. for any δ ≥ 1 and ε > 0. Under the assumptions s n D 2 (ω̂ij − sij ) → N (0, 1) . ωii ωjj + ωij Without condition (45), our estimator may not be asymptotic efficient but still has nice convergence property. We could obtain the following rate of convergence n sparsity maxj Σi6=j min 1, |sij −lij | λ o = O kn,p n log p , + λ−1 , for estimating ωij = sij − lij , provided kn,p = o log p P |ω̂ij − ωij | > C3 max kn,p , n s by simply applying Theorem 2 with log p ≤ o p−δ+1 , n which further implies the rate of convergence for estimating sij log p , P |ω̂ij − sij | > C3 max kn,p n s log p M0 , ≤ o p−δ+1 . n p thr ) Define the adaptive thresholding estimator Ω̂thr = (ω̂ij p×p as in (32) and (33). Fol- lowing the proof of Theorems 6 and 7, we are able to obtain the following results. We shall omit the proof due to the limit of space. Theorem 9. Let λ = (1 + ε) q 2δ log p n for some δ ≥ 3 and ε > 0 in Equation (10). Assume the assumptions (42)-(45) hold. Then (i). Under the assumptions kn,p = o q n log p and v u u 2ξ ω ω + ω 2 log p ii jj t 0 ij |sij | ≥ 2 n , ∀sij ∈ S(S) for some ξ0 > 2, we have lim P S(Ω̂thr ) = S(S) = 1. n→∞ 2 = O (n/ log p) and n = O pξ with some ξ > 0, the Ω̆ (ii). Under the assumption kn,p thr defined in (38) with sufficiently large δ ≥ 3+ 2ξ satisfies, for all 1 ≤ w ≤ ∞, 2 E|||Ω̆thr − S|||2w ≤ Ckn,p log p . n 19 4. Discussion. In the analysis of Theorem 2 and nearly all consequent results in q Theorems 3-4, and Theorems 6-9, we have picked the penalty term λ = (1 + ε) 2δ log p n for any δ ≥ 1 (or δ ≥ 3 for support recovery) and ε > 0. This choice of λ can be too conservative and cause some finite sample estimation bias. In this section we show that λ can be chosen smaller. Let smax ≤ c0 logn p with a sufficiently small constant c0 > 0 and smax = O pt for some t < 1. Denote the cumulative distribution function of t(n−1) distribution by Ft(n−1) . Let δ √ smax 2 where B = F −1 n − 1 + B /2 . It can be shown 1 − λnew = (1 + ε) B/ t(n−1) f inite p new = (1 + ε) that λnew f inite is asymptotically equivalent to λ Theorems 2-4 and 6-9 to the new penalties λnew and λnew f inite . q 2δ log(p/smax ) . n We can extend All the results remain the same except that we need to replace s (or kn,p ) in those theorems by s + smax (or kn,p + smax ). Since all theorems (except the minimax lower bound Theorem) are derived from Lemma 2, all we need is just an extension of Lemma 2 as follows. Lemma 1. Let λ = λnew or λnew f inite for any δ ≥ 1 and ε > 0 in Equation (10). Assume smax ≤ c0 logn p with a sufficiently small constant c0 > 0 and smax = O pt for some t < 1 and define the event Em as follows, ora θ̂mm − θmm βm − β̂m 1 2 XAc βm − β̂m /n T XAc m /n ≤ C10 λ2 (s + smax ) , ≤ C20 λ (s + smax ) , ≤ C30 λ2 (s + smax ) , ≤ C40 λ, ∞ for m = i or j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions of Theorem 2, we have c P (Em ) ≤ o p−δ+1 . See its proof in the supplementary material for more details. Note that when s = o √ n log p , we have 2 √ XAc βm − β̂m /n ≤ C30 λ2 (s + smax ) = o 1/ n with high probability, for every smax = o √ n log p . Thus for every choice of smax = o √ n log p in the penalty, our procedure leads asymptotically efficient estimation of every entry of the precision matrix as long as s = o √ n log p . In Section 5 we set λ = λnew f inite with δ = 1 for statistical inference and δ = 3 for support recovery. 20 Z. REN ET AL. 5. Numerical Studies. In this section, we present some numerical results for both asymptotic distribution and support recovery. We consider the following 200×200 precision matrix with three blocks. The block sizes are 100 × 100, 50 × 50 and 50 × 50, respectively. Let (α1 , α2 , α3 ) = (1, 2, 4). The diagonal entries are α1 , α2 , α3 in three blocks, respectively. When the entry is in the k-th block, ωj−1,j = ωj,j−1 = 0.5αk , and ωj−2,j = ωj,j−2 = 0.4αk , k = 1, 2, 3. The asymptotic variance for estimating each entry can be very different, thus a naive procedure of setting one universal threshold for all entries would likely fail. We first estimate the entries in the precision matrix, and partial correlations which was discussed in Remark 3, and consider the distributions of the estimators. We generate a random sample of size n = 400 from a multivariate normal distribution N (0, Σ) with Σ = Ω−1 . √ 2 As mentioned in Remark 2, the penalty constant is chosen to be λnew f inite = B/ n − 1 + B , √ where B = tq(1 − ŝ/(2p), n − 1) with ŝ = n/ log p, which is asymptotically equivalent to p (2/n) log(p/ŝ). Table 1 reports the mean and standard error of our estimators for four entries in the precision matrix and the corresponding correlations based on 100 replications. Figure 1 shows the histograms of our estimates with the theoretical normal density super-imposed. We can see that the distributions of our estimates match pretty well to the asymptotic normality in Theorem 3. We have tried other choices of dimensions, e.g. p = 1000, and obtained similar results. Table 1 Mean and standard error of the proposed estimators. ω b1,j b r1,j ω1,2 = 0.5 0.469 ± 0.051 r1,2 = −0.5 -0.480 ± 0.037 ω1,3 = 0.4 0.380 ± 0.054 r1,3 = −0.4 -0.392 ± 0.043 ω1,4 = 0 -0.042 ± 0.043 r1,4 = 0 0.043 ± 0.043 ω1,10 = 0 -0.003± 0.045 r1,10 = 0 0.003 ± 0.046 Support recovery of a precision matrix is of great interests. We compare our selection results with the GLasso. In addition to the training sample, we generate an independent sample of size 400 from the same distribution for validating the tuning parameter for the GLasso. The GLasso estimators are computed based on training data with a range of penalty levels and we choose a proper penalty level by minimizing likelihood b − log det(Ω)} b on the validation sample, where Σ is the sample covariance loss {trace(ΣΩ) matrix. Our ANT estimators are computed based on the training sample only. As stated in Theorem 6, we use a slightly larger penalty constant to allow the selection consistency. √ 3 2 Let λnew f inite = B/ n − 1 + B , where B = tq(1 − (ŝ/p) /2, n − 1), which is asymptotically equivalent to p (6/n) log(p/ŝ). We then apply the thresholding step as in (33) with ξ0 = 2. Table 2 shows the average selection performances of 10 replications. The true pos- 0 0 2 4 Density 4 2 Density 6 6 8 21 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.5 0.6 0.1 0.2 12 10 0 2 4 6 Density 8 10 8 6 0 2 4 Density 0.4 ^ 1,3 ω 12 ^ 1,2 ω −0.2 −0.1 0.0 ^ 1,4 ω 0.1 0.2 −0.2 −0.1 0.0 ^ 1,10 ω Fig 1. Histograms of estimated entries. Top: entries ω1,2 and ω1,3 in the precision matrix; bottom: entries ω1,4 and ω1,10 in the precision matrix. 22 Z. REN ET AL. itive (rate) and false positive (rate) are reported. In addition to the overall performance, the summary statistics are also reported for each block. We can see that while both our ANT method and the graphical Lasso choose all nonzero entries, ANT outperforms the GLasso in the sense of the false positive rate and the false discovery rate. Table 2 The performance of support recovery Block Overall Block 1 Block 2 Block 3 Method GLasso ANT GLasso ANT GLasso ANT GLasso ANT TP 391 391 197 197 97 97 97 97 TPR 1 1 1 1 1 1 1 1 FP 5298.2 3.5 1961 1.2 288.4 1.1 162.1 1.1 FPR 0.2716 0.0004 0.4126 0.0003 0.2557 0.0010 0.1437 0.0010 Moreover, we compare our method with the GLasso with various penalty levels. Figure 2 shows the ROC curves for the GLasso with various penalty levels and ANT with various thresholding levels in the follow-up procedure. It is noticed that the GLasso at any penalty level cannot achieve similar performance as ours. In addition, the circle in the plot represents the performance of ANT with the selected threshold level as in (33). The triangle in the plot represents the performance of the graphical Lasso with the penalty level chosen by cross-validation. This again indicates that our method simultaneously achieves a very high true positive rate and a very low false positive rate. 6. Proof of Theorems 1-5. 6.1. Proof of Theorem 2-4. We will only prove Theorems 2 and 3. The proof of Theorem 4 is similar to that of Theorems 2 and 3. The following lemma is the key to the proof. Lemma 2. Let λ = (1 + ε) q 2δ log p n for any δ ≥ 1 and ε > 0 in Equation (10). Define the event Em as follows, (48) (49) (50) (51) ora θ̂mm − θmm βm − β̂m 1 2 XAc βm − β̂m /n T XAc m /n ∞ ≤ C10 λ2 s, ≤ C20 λs, ≤ C30 λ2 s, ≤ C40 λ, ● 0.6 0.7 tpr 0.8 0.9 1.0 23 ● 0.00 0.05 0.10 0.15 0.20 0.25 ANT GLasso 0.30 fpr Fig 2. The ROC curves of the graphical Lasso and ANT. Circle: ANT with our proposed thresholding. Triangle: GLasso with penalty level by CV. 24 Z. REN ET AL. for m = i or j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions of Theorem 2, we have c P (Em ) ≤ o p−δ+1 . 6.1.1. Proof of Theorems 2. We first prove (i). From Equation (48) of Lemma 2, the ora and θ ora . We then need only to consider large deviation probability in (16) holds for θii jj ora . On the event E ∩ E , the entry θij i j ora θ̂ij − θij = ˆ Ti ˆ j /n − Ti j /n T T j + XAc βj − β̂j /n − i j /n = i + XAc βi − β̂i ≤ XTAc i /n βj − β̂j + XTAc j /n βi − β̂i 1 1 ∞ ∞ + XAc βi − β̂i · XAc βj − β̂j /n 0 0 0 2 ≤ 2C2 C4 + C3 λ s, where the last step follows from inequalities (49)-(51) in Lemma 2. Thus we have P Θ̂A,A − Θora A,A ∞ > C1 s log p n ≤ o p−δ+1 , for some C1 > 0. Since ΘA,A has a bounded spectrum, the functional ζkl (ΘA,A ) = Θ−1 A,A kl is Lipschitz in a neighborhood of ΘA,A for k, l ∈ A, then Equation (17) is an immediate consequence of Equation (16). √ ora , η ora , where η ora = Now we prove part (ii). Define random vector η ora = ηiiora , ηij jj kl θora −θkl n √ kl 2 θkk θll +θkl . The following result is a multidimensional version of KMT quantile in- equality: there exist some constants D0 , ϑ ∈ (0, ∞) and random normal vector Z = √ (Zii , Zij , Zjj ) ∼ N 0, Σ̆ with Σ̆ = Cov(η ora ) such that whenever |Zkl | ≤ ϑ n for all kl, we have (52) D0 2 2 + Zjj . kη ora − Zk∞ ≤ √ 1 + Zii2 + Zij n See Proposition [KMT] in Mason and Zhou (2012) for one dimensional case and consult √ Einmahl (1989) for multidimensional case. Note that nη ora can be written as a sum of n i.i.d. random vectors with mean zero and covariance matrix Σ̆, each of which is subexponentially distributed. Hence the assumptions of KMT quantile inequality in literature are satisfied. With a slight abuse of notation, we define Θ = (θii , θij , θjj ). To prove the desired coupling inequality (18), let’s use the Taylor expansion of the function ωij (Θ) = 25 2 −θij / θii θjj − θij to obtain ora ωij − ωij = h∇ωij (Θ) , Θora − Θi + (53) Rβ (Θora ) (Θ − Θora )β . X |β|=2 The multi-index notation of β = (β1 , β2 , β3 ) is defined as |β| = Dβ f (x) = ∂ |β| f β β β . ∂x1 1 ∂x2 2 ∂x3 3 P k βk , xβ = Q k xβk k and The derivatives can be easily computed. To save the space, we omit their explicit formulas. The coefficients in the integral form of the remainder with α |β| = 2 have a uniform upper bound Rβ Θora A,A ≤ 2 max|α|=2 maxΘ∈B D ωij (Θ) ≤ C2 , where B is some sufficiently small compact ball with center Θ when Θora is in this ball B, which is satisfied by picking a sufficiently small value ϑ in our assumption kη ora k∞ ≤ √ ora ) . ora are standardized versions of ω ora − ω ϑ n. Recall that κora ij and (Θ − Θ ij and η ij Consequently there exist some deterministic constants h1 , h2 , h3 and Dβ with |β| = 2 such ora as follows, that we can rewrite (53) in terms of κora ij and η X Dβ Rβ (Θora ) ora ora ora κora ij = h1 ηii + h2 ηij + h3 ηjj + √ |β|=2 n (η ora )β , which, together with Equation (52), completes our proof of Equation (18), ora κij − Z 0 ≤ 3 X ! C3 D1 2 2 |hk | kZ − η ora k∞ + √ kη ora k2 ≤ √ 1 + Zii2 + Zij + Zjj , n n k=1 where constants C3 , D1 ∈ (0, ∞) and Z 0 := h1 Z1 + h2 Z2 + h3 Z3 ∼ N (0, 1) . The last 2 + Z2 inequality follows from kη ora k2 ≤ C4 Zii2 + Zij jj for some large constant C4 , which can be shown using (52) easily. 6.1.2. Proof of Theorems 3. The triangle inequality gives ora ora |ω̂ij − ωij | ≤ ω̂ij − ωij − ωij , + ωij Ω̂A,A − ΩA,A ∞ ≤ Ω̂A,A − Ωora A,A ∞ + Ωora A,A − ΩA,A . ∞ From Equation (17) we have P Ω̂A,A − Ωora A,A ∞ log p > C1 s n ≤ o p−δ+1 . ora Now we give a tail bound for ωij − ωij and Ωora − Ω A,A A,A ∞ respectively. For the con- stant C > 0, we apply Equation (18) to obtain n P κora ij > C o √ C ≤ P max {|Zkl |} > ϑ n + Φ̄ 2 ≤ o(1) + 2 exp −C 2 /8 , D1 C 2 2 + P √ 1 + Zii2 + Zij + Zjj > n 2 26 Z. REN ET AL. according to the inequality Φ̄ (x) ≤ 2 exp −x2 /2 for x > 0 and the union bound of three normal tail probabilities. This immediately implies that for large C4 and large n, ( r ) 1 3 ora P ωij − ωij > C4 ≤ , n 4 which, together with Equations (17), yields that for C2 > C1 + C4 , ( ( log p P |ω̂ij − ωij | > C2 max s , n r )) 1 n ≤ . Similarly, Equation (18) implies o n p ≤ P κora ij > C log p ! √ √ C log p P max {|Zkl |} > ϑ n + Φ̄ 2 ( ) √ C log p D1 2 2 2 +P √ 1 + Zii + Zij + Zjj > n 2 ≤ O p−C 2 /8 , where the first and last components in the first inequality are negligible due to log p ≤ c0 n with a sufficiently small c0 > 0, which follows from the assumption s ≤ c0 n/ log p. That immediately implies that for C5 large enough, s log p > C − Ω = o(p−δ ), P Ωora 5 A,A A,A ∞ n which, together with Equations (17), yields that for C3 > C10 + C5 . s log p log p P Ω̂A,A − ΩA,A > C3 max s , ≤ o p−δ+1 . n ∞ n Thus we have the following union bound over all p 2 pairs of (i, j), s log p log p , ≤ p2 /2 · o p−δ+1 = o p−δ+3 . P Ω̂ − Ω > C3 max s n ∞ n Write √ √ √ ora n Ω̂A,A − ΩA,A = n Ω̂A,A − Ωora A,A + n ΩA,A − ΩA,A . Under the assumption s = o √ n log p , we have √ n Ω̂A,A − Ωora A,A ∞ = op (1), which together with Equation (18) further implies √ D n (ω̂ij − ωij ) ∼ √ ora D 2 n ωij − ωij → N 0, ωii ωjj + ωij . 27 6.2. Proof of Theorem 5. In this section we show that the upper bound given in Section 2.2 is indeed rate optimal. We will only establish Equation (30). Equation (31) is an q log p n immediate consequence of Equation (30) and the lower bound for estimation of diagonal covariance matrices in Cai, Zhang and Zhou (2010). The lower bound is established by Le Cam’s method. To introduce Le Cam’s method we first need to introduce some notation. Consider a finite parameter set G0 = {Ω0 , Ω1 , . . . , Ωm∗ } ⊂ G0 (M, kn,p ). Let PΩm denote the joint distribution of independent observations X (1) , X (2) ,. . . , X (n) with each X (i) ∼ N 0, Ω−1 m , 0 ≤ m ≤ m∗ , and fm denote the corre sponding joint density, and we define m∗ 1 X PΩ . P̄ = m∗ m=1 m (54) For two distributions P and Q with densities p and q with respect to any common dominating measure µ, we denote the total variation affinity by kP ∧ Qk = R p ∧ qdµ. The following lemma is a version of Le Cam’s method (cf. Le Cam (1973), Yu (1997)). Lemma 3. Let X (i) be i.i.d. N (0, Ω−1 ), i = 1, 2, . . . , n, with Ω ∈ G0 . Let Ω̂ = (ω̂kl )p×p (m) be an estimator of Ωm = ωkl p×p , then α 1 (m) sup P ω̂ij − ωij > ≥ PΩ0 ∧ P̄ , 2 2 0≤m≤m∗ (m) where α = inf 1≤m≤m∗ ωij (0) − ωij . Proof of Theorem 5: We shall divide the proof into three steps. Without loss of generality, consider only the cases (i, j) = (1, 1) and (i, j) = (1, 2). For the general case ωii or ωij with i 6= j, we could always permute the coordinates and rearrange them to the special case ω11 or ω12 . Step 1: Constructing the parameter set. We first define Ω0 , (55) Σ0 = (0) i.e. Σ0 = σkl 1 b 0 ... 0 −1 0 , and Ω0 = Σ0 = .. . b 1 0 ... 0 0 0 1 ... .. .. .. . . . . . . 0 0 0 0 1 1 1−b2 −b 1−b2 −b 1−b2 1 1−b2 0 .. . 0 .. . 0 0 0 ... 0 1 ... 0 , .. . . .. . . . 0 (0) p×p 0 ... 0 0 1 (0) is a matrix with all diagonal entries equal to 1, σ12 = σ21 = b and the rest all zeros. Here the constant 0 < b < 1 is to be determined later. For Ωm , 1 ≤ m ≤ m∗ , 28 Z. REN ET AL. the construction is as follows. Without loss of generality we assume kn,p ≥ 2. Denote by H the collection of all p × p symmetric matrices with exactly (kn,p − 1) elements equal to 1 between the third and the last elements on the first row (column) and the rest all zeros. Define n o G0 = Ω : Ω = Ω0 or Ω = (Σ0 + aH)−1 , for some H ∈ H , (56) where a = q τ1 log p n for some constant τ1 which is determined later. The cardinality of G0 / {Ω0 } is ! p−2 . kn,p − 1 ∗ m = Card (G0 ) − 1 = Card (H) = 1 2 We pick the constant b = (1 − 1/M ) and 0 < τ1 < min 2 2 2 (1−b2 ) (1−1/M )2 −b2 (1−b ) , 2C0 (1+b2 ) , 4ν(1+b2 ) C0 such that G0 ⊂ G0 (M, kn,p ). First we show that for all Ωi , 1/M ≤ λmin (Ωi ) < λmax (Ωi ) ≤ M . (57) For any matrix Ωm , 1 ≤ m ≤ m∗ , some elementary calculations yield that q λ1 Ω−1 m = 1+ = λ3 Ω−1 = . . . = λp−1 Ω−1 = 1. m m λ2 Ω−1 m Since b = 1 2 b2 + (kn,p − 1) a2 , λp Ω−1 =1− m (1 − 1/M ) and 0 < τ1 < (58) q b2 + (kn,p − 1) a2 , (1−1/M )2 −b2 , C0 we can show that 1− q b2 + (kn,p − 1) a2 ≥ 1 − 1+ q b2 + (kn,p − 1) a2 < 2 − 1/M < M, p b2 + τ1 C0 > 1/M, which imply 1/M ≤ λ−1 Ω−1 = λmin (Ωm ) < λmax (Ωm ) = λ−1 Ω−1 ≤ M. m p m 1 As for matrix Ω0 , similarly we have λ1 Ω−1 0 = 1 + b, λp Ω−1 = 1 − b, 0 = λ3 Ω−1 = . . . = λp−1 Ω−1 = 1, 0 0 λ2 Ω−1 0 thus 1/M ≤ λmin (Ω0 ) < λmax (Ω0 ) ≤ M for the choice of b = 1 2 (1 − 1/M ). Now we show that the number of nonzero off-diagonal elements in Ωm , 0 ≤ m ≤ m∗ is no more than kn,p per row/column. From the construction of Ω−1 m , there exists 29 T some permutation matrix Pπ such that Pπ Ω−1 m Pπ is a two-block diagonal matrix with dimensions (kn,p + 1) and (p − kn,p − 1), of which the second block is an identity matrix, T then Pπ Ω−1 m Pπ −1 = Pπ Ωm PπT has the same blocking structure with the first block of dimension (kn,p + 1) and the second block being an identity matrix, thus the number of nonzero off-diagonal elements is no more than kn,p per row/column for Ωm . Therefore, we have G0 ⊂ G0 (M, kn,p ) from Equation (57). Step 2: Bounding α. From the construction of Ω−1 m and the matrix inverse formula, it can be shown that for any precision matrix Ωm we have (m) ω11 = −b 1 (m) , and ω12 = , 1 − b2 − (kn,p − 1) a2 1 − b2 − (kn,p − 1) a2 for 1 ≤ m ≤ m∗ , and for precision matrix Ω0 we have (0) ω11 = 1 −b (0) , ω12 = . 2 1−b 1 − b2 Since b2 + (kn,p − 1) a2 < (1 − 1/M )2 < 1 in Equation (58), we have (59) inf (0) (m) ω11 − ω11 = inf (0) (m) ω12 − ω12 = 1≤m≤m∗ 1≤m≤m∗ (kn,p − 1) a2 ≥ C3 kn,p a2 , (1 − b2 ) (1 − b2 − (kn,p − 1) a2 ) b (kn,p − 1) a2 ≥ C4 kn,p a2 , (1 − b2 ) (1 − b2 − (kn,p − 1) a2 ) for some constants C3 , C4 > 0. Step 3: Bounding the affinity. The following lemma will be proved in Section 8. Lemma 4. Let P̄ be defined in (54). We have PΩ ∧ P̄ ≥ C5 0 (60) for some constant C5 > 0. Lemma 3, together with Equations (59), (60) and a = 1 2 0≤m≤m∗ 1 (m) sup P ω̂12 − ω12 > 2 0≤m≤m∗ sup (m) P ω̂11 − ω11 > q τ1 log p n , imply C3 τ1 kn,p log p ≥ C5 /2, n C4 τ1 kn,p log p · ≥ C5 /2, n · which match the lower bound in (30) by setting C1 = min {C3 τ1 /2, C4 τ1 /2} and c1 = C5 /2. Remark 10. kn,p q log p n · q log p n Note that |||Ωm |||1 is at order of kn,p |||Ωm |||1 q log p n . q log p n , which implies kn,p log p n = This observation partially explains why in literature 30 Z. REN ET AL. q people need assume bounded matrix l1 norm of Ω to derive the lower bound rate log p n . For the least favorable parameter space, the matrix l1 norm of Ω cannot be avoided in the upper bound. But the methodology proposed in this paper improves the upper bounds in literature by replacing the matrix l1 norm for every Ω by only matrix l1 norm bound of Ω in the least favorable parameter space. 6.3. Proof of Theorem 1. The probabilistic results (i) and (ii) are the immediate consequences of Theorems 2 and 5. We only need to show the minimax rate of convergence result (3). According to the probabilistic lower bound result (30) in Theorem 5, we immediately obtain that ( kn,p log p , C2 inf sup E |ω̂ij − ωij | ≥ c1 max C1 ω̂ij G0 (M,kn,p ) n r ) 1 n . Thus it is enough to show there exists some estimator of ωij such that it attains this upper bound. More precisely, we define the following estimator based on ω̂ij defined in (12) to control the improbable case for which Θ̂A,A is nearly singular. ω̆ij = sgn(ω̂ij ) · min {|ω̂ij | , log p} . n (16) and (18) in Theorem 2 imply o kn,p log p ora that the Equations n , ωij ≤ 2M . Note P {Gc } ≤ C p−δ+1 + exp (−cn) for some constants C ora ≤ C Define the event G = ω̂ij − ωij 1 and c. Now according to the variance of inverse Wishart distribution, we pick δ ≥ 2ξ + 1 to complete our proof ora ora ora E |ω̆ij − ωij | ≤ E ω̆ij − ωij − ωij 1 {G} + E ω̆ij − ωij 1 {Gc } + E ωij kn,p log p ora 2 + P {Gc } E log p + ωij ≤ C1 n δ+1 1 kn,p log p ≤ C1 + C2 p− 2 log p + C3 √ n n ( 0 ≤ C max kn,p log p , n 1/2 + E ora ωij − ωij 2 1/2 r ) 1 n , where C2 , C3 and C 0 are some constants and the last equation follows from the assumption n = O pξ . 7. Proof of Theorems 6-9. 7.1. Proof of Theorem 6. When δ > 3, from Theorem 2 it can be shown that the following three results hold: 31 (i). For any constant ε > 0, we have ) ω̂ii ω̂jj + ω̂ 2 ij P sup 2 − 1 > ε → 0; (i,j) ωii ωjj + ωij ( (61) (ii). There is a constant C1 > 0 such that ( (62) log p ora P sup ωij − ω̂ij > C1 s n (i,j) ) → 0; (iii). For any constant 2 < ξ1 , we have (63) s ora ωij − ωij 2ξ1 log p → 0. P sup q > (i,j) ω ω + ω 2 n ii jj ij In fact, under the assumption δ ≥ 3, Equation (17) in Theorem 2 and the union bound over all pair (i, j) imply the second result (62), which further shows the first result (61) 2 is bounded below and because that ω̂ij and ω̂ii are consistent estimators and ωii ωjj + ωij above. For the third result, we apply Equation (18) from Theorem 2 and pick 2 < ξ2 < ξ1 √ √ and a = ξ1 − ξ2 to show that P n o ora p κij > 2ξ1 log p p √ ≤ P max {|Zkl |} > ϑ n + Φ̄ 2ξ2 log p p D1 2 2 +P √ 1 + Zii2 + Zij + Zjj > a 2 log p n s ≤ O(p−ξ2 1 ), log p where the last inequality follows from log p = o(n). The third result (63) is thus obtained by the union bound with 2 < ξ2 . Essentially Equation (36) and Equation (37) are equivalent to each other. Thus we only show that Equation (37) in Theorem 6 is just a simple consequence of results (i), (ii) and (iii). Set ε > 0 sufficiently small and ξ ∈ (2, ξ0 ) sufficiently close to 2 such that 32 Z. REN ET AL. p √ √ 2 2ξ0 − 2ξ0 (1 + ε) > 2ξ and ξ0 (1 − ε) > ξ, and 2 < ξ1 < ξ. We have P S(Ω̂thr ) = S(Ω) = P thr thr ω̂ij 6= 0 for all (i, j) ∈ S(Ω) + P ω̂ij = 0 for all (i, j) ∈ / S(Ω) v u u 2ξ ω̂ ω̂ + ω̂ 2 log p ii jj t 0 ij for all (i, j) ∈ S(Ω) |ω̂ij | > n v u u 2 t 2ξ0 ω̂ii ω̂jj + ω̂ij log p for all (i, j) ∈ / S(Ω) +P |ω̂ij | ≤ n s ) ( ω̂ii ω̂jj + ω̂ 2 2ξ log p |ω̂ij − ωij | ij ≤ − P sup ≥ P sup q 2 − 1 > ε , (i,j) n (i,j) ωii ωjj + ωij ω ω + ω2 = P ii jj ij which is bounded below by o n s ora log p ora P sup ω − ω̂ > C s + ωij − ωij ij 1 2ξ1 log p n (i,j) ij 2 P sup q − ≤ ω̂ii ω̂jj +ω̂ij = 1+o (1) , (i,j) ω ω + ω 2 n P sup − 1 > ε (i,j) ω ω +ω 2 ii jj ij ii jj where s = o p n/ log p implies s logn p = o p ij (log p) /n . 7.2. Proof of Theorem 8. The proof of this Theorem is very similar to that of Theorem 2. Due to the limit of space, we follow the line of the proof of Theorem 2 and Theorem 3, but only give necessary details when the proof is different from that of Theorem 2. Note that β for the latent variable graphical model is not sparse enough in the sense that √ ! n |sij − lij | . 6= o max Σi6=j min 1, j λ log p Our strategy is to decompose it into two parts, −1 S L β =SO\A,A Ω−1 A,A − LO\A,A ΩA,A := β −β , (64) −1 L = L where β S = SO\A,A Ω−1 O\A,A ΩA,A correspond the sparse and low-rank A,A and β components respectively. We expect the penalized estimator β̂ in (10) is closer to the sparse part β S than β itself, which motivates us to rewrite the regression model as follows, XA = XO\A β S + A − XO\A β L := XO\A β S + SA , (65) with SA = A − XO\A β L , and define (66) S Θora,S A,A = A T SA /n. 33 We can show our estimator Θ̂A,A is within a small ball of the ”oracle” Θora,S A,A with radius at the rate of kn,p λ2 , where kn,p is the sparsity of model (65). More specifically, similar to Lemma 2 for the proof of Theorem 2 we have the following result for the latent variable graphical model. The proof is provided in the supplementary material. Lemma 5. Let λ = (1 + ε) q 2δ log p n for any δ ≥ 1 and ε > 0 in Equation (10). Define the event E as follows, ora,S θ̂mm − θmm S βm − β̂m 1 2 S XO\A βm − β̂m /n T XO\A S m /n (67) ∞ ≤ C10 kn,p λ2 , ≤ C20 kn,p λ, ≤ C30 kn,p λ2 , ≤ C40 λ. for m = i and j and some constants Ck0 , 1 ≤ k ≤ 4. Under the assumptions in Theorem 8, we have P {E c } ≤ o p−δ+1 . Similar to the argument in Section 6.1.1, from Lemma 5 we have T ora,S S T ≤ Ckn,p λ2 /n − ˆ ˆ /n θ̂ij − θij = S i j i j on the event E, thus there is a constant C1 > 0 such that P Θ̂A,A − Θora,S A,A ∞ log p > C1 kn,p n ≤ o p−δ+1 . Later we will show that there are some constant C2 and C3 such that (68) ora,S P Θora A,A − ΘA,A ∞ log p > C2 kn,p n ≤ C3 p−2δ , which implies P Θ̂A,A − Θora A,A ∞ > C4 kn,p log p n ≤ o p−δ+1 , for some constant C4 > 0. Then following the proof of Theorem 3 exactly, we establish Theorem 8. Now we conclude the proof by establishing Equation (68). In fact, we will only show that log p ora ora,S P θi,i − θi,i > C2 kn,p ≤ C5 p−2δ , n 34 Z. REN ET AL. ora − θ ora,S and θ ora − θ ora,S can be shown for some C5 > 0. The tail bounds for θj,j i,j j,j i,j similarly. Write (69) ora,S ora θi,i −θi,i = Ti i − Si T Si n = T Ti i − i − XO\A βiL i − XO\A βiL /n = D1 +D2 . n where D1 = n 2X 2 T (k) i XO\A βiL = i,k · XO\A βiL n n k=1 D2 = 1 L T T (k) βi XO\A XO\A βiL ∼ var XO\A βiL · χ2(n) /n. n It is then enough to show that there are constants C2 and C5 such that P |Di | > log p C2 kn,p 2 n C5 −2δ p 2 ≤ for each Di , i = 1, 2. We first study D1 , which is an average of n i.i.d. random variables (k) i,k · XO\A βiL with (k) L L i ∼ N 0, Ω−1 A,A , and XO\A βi ∼ N 0, βi L where Ω−1 A,A has bounded spectrum, and βi T T ΣO\A,O\A βiL , ΣO\A,O\A βiL ≤ assumption (43) that elements of β L are at an order of C5 p . C6 p for some C6 for the From classical large deviations bounds, there exist some uniform constants c1 , c2 > 0 such that P √ (k) L n · pX β i,k k=1 O\A i P ≤ 2 exp −nt2 /c2 for 0 < t < c1 . > t n See, for example, Theorem 2.8 of Petrov (1995). By setting t = s ( P |D1 | > 2 (70) where 2 q 2δc2 log p np 2δc2 log p np q 2δc2 log p n ) ≤ 2p−2δ , = o kn,p logn p from Equation (45). The derivation for the tail bound of D2 is similar by a large deviation bound for χ2(n) , ( 2 χ P (n) n ) >1+t ≤ 2 exp −nt2 /c2 for 0 < t < c1 , which implies (71) = o (1), we have (k) P |D2 | > var XO\A βiL 1 + s 2δc2 log p ≤ 2p−2δ , n 35 by setting t = o kn,p logn p q 2δc2 log p n (k) = o (1), where var XO\A βiL 1+ q 2δc2 log p n = O (1/p) = from Equation (45). Equations (70) and (71) imply the desired tail bound (68). 8. Proof of Auxiliary Lemmas. In this section we prove two key lemmas, Lemmas 2 and 4, for establishing our main results. 8.1. Proof of Lemma 2. We first reparameterize Equations (7) and (10) by setting (72) kXk k dk := √ bk , and Y := XAc · diag n √ n kXk k ! k∈Ac to make the analysis cleaner, and then rewrite the regression (7) and the penalized procedure (10) as follows, Xm = Ydtrue + m . (73) and (74) kXm − Ydk2 σ + + λ kdk1 , 2nσ 2 n o ˆ d, σ̂ : = arg min Lλ (d, σ) , Lλ (d, σ) : = (75) d∈Rp−2 ,σ∈R where the true coefficients of the reparameterized regression (74) are (76) dtrue = dtrue k k∈Ac kXk k , where dtrue = √ βm,k . k n Then the oracle estimator of σ can be written as (77) σ ora := ora 1/2 (θmm ) = Xm − Ydtrue √ n n o km k = √ , n n o ˆ σ̂ and β̂m , θ̂mm , and we have the following relationship between d, √ (78) β̂m,k = dˆk n , and θ̂mm = σ̂ 2 . kXk k n ˆ σ̂ The proof has two parts. The first part is the algebraic analysis of the solution d, o to the regression (74) by which we can define an event E such that Equations (48)-(51) in Lemma 2 hold. The second part of the proof is the probabilistic analysis of the E. 36 Z. REN ET AL. 8.1.1. Algebraic Analysis. The function Lλ (d, σ) is jointly convex in (d, σ). For fixed σ, denote the minimizer of Lλ (d, σ) over all d ∈ Rp−2 by dˆ(σλ), a function of σλ, i.e., ( dˆ(σλ) = arg min Lλ (d, σ) = arg min (79) d∈Rp−2 d∈Rp−2 kXm − Ydk2 + λσ kdk1 . 2n ) then if we knew σ̂ in the solution of Equation (75), the solution for the equation is o dˆ(σ̂λ) , σ̂ . Let µ = λσ. From the Karush-Kuhn-Tucker condition, dˆ(µ) is the solution n of Equation (79) if and only if YkT Xm − Ydˆ(µ) /n = µ · sgn dˆk (µ) , if dˆk (µ) 6= 0, (80) YkT Xm − Ydˆ(µ) /n ∈ [−µ, µ] , if dˆk (µ) = 0. To define the event E we need to introduce some notation. Define the l1 cone invertibility factor (CIF1 ) as follows, (81) CIF1 (α, K, Y) = inf T |K| Y n Y u ∞ : u ∈ C (α, K) , u 6= 0 , kuK k1 where |K| is the cardinality of an index set K, and n o C (α, K) = u ∈ Rp−2 : kuK c k1 ≤ α kuK k1 . Let = n k ∈ Ac , dtrue ≥λ k o (82) T (83) ν = YT Xm − Ydtrue /n ∞ = YT m /n , ∞ and 4 (1 + ξ) 8λ |T | true τ = λ max d , T c 1 CIF1 (2ξ + 1, T, Y) σ ora (84) for some ξ > 1 be specified later. Define E = ∩4i=1 Ii (85) where ξ−1 = ν≤σ λ (1 − τ ) , ξ+1 ( ) √ Ccif 3 = CIF1 (2ξ + 1, T, Y) ≥ > 0 with C = 1/ 10 2M cif (ξ + 1)2 q √ = σ ora ∈ 1/ (2M ), 2M (86) I1 (87) I2 (88) I3 (89) I4 = ora q √ kXk k c √ ∈ 1/ (2M ), 2M for all k ∈ A . n 37 Now we show Equations (48)-(51) in Lemma 2 hold on the event E defined in Equation (85). The following two results are helpful to establish our result. Their proofs are given in the supplementary material. Proposition 1. n ˆ d (µ) − dtrue (90) 1 T , c 1 (ν + µ) |T | , CIF1 (2ξ + 1, T, Y) ≤ (ν + µ) dˆ(µ) − dtrue . n Let ≤ max (2 + 2ξ) dtrue 2 1 (91) Y dtrue − dˆ(µ) Proposition 2. o ξ−1 For any ξ > 1, on the event ν ≤ µ ξ+1 , we have 1 n ˆ σ̂ d, o be the solution of the scaled lasso (75). For any ξ > 1, n o on the event I1 = ν ≤ σ ora λ ξ−1 ξ+1 (1 − τ ) , we have σ̂ σ ora − 1 ≤ τ . (92) n |ωij | λ o is defined in terms of Ω which has bounded max dtrue , λ |T | ≤ Cλs c Note that s = maxj Σi6=j min 1, spectrum, then on the event I4 , T 1 from the definition of T in Equation (82). On their intersection I1 ∩ I2 ∩ I3 , Proposition 2 and Proposition 1 with µ = λσ̂ imply that there exist some constants c1 , c2 and c3 such that , λ |T | , Tc 1 ≤ c2 max dtrue c , λ |T | , |σ̂ − σ ora | ≤ c1 λ max dtrue ˆ d (µ) − dtrue 1 2 Y dtrue − dˆ n T ≤ c3 λ max dtrue 1 T , λ |T | . c 1 then from the definition of β and Equations (6) and (78), √ n kXk k 2 ˆ β̂m,k = dk , θ̂mm = σ̂ , and XAc = Y · diag √ , kXk k n k∈Ac we immediately have ora θ̂mm − θ̂mm βm − β̂m 1 2 XAc βm − β̂m /n = |σ̂ − σ ora | · |σ̂ + σ ora | ≤ C10 λ2 s, ≤ C20 λs, ≤ C30 λ2 s, for some constants Ci0 , which are exactly Equations (48)-(50). From the definition of events I1 , I3 and I4 we obtain Equation (51), i.e., XTAc m /n ∞ ≤ C40 λ. 38 Z. REN ET AL. 8.1.2. Probabilistic Analysis. We will show that P {I1c } ≤ O p−δ+1 / log p , p P {Iic } ≤ o(p−δ ) for i = 2, 3 and 4, which implies P {E} ≥ 1 − o p−δ+1 . We will first consider P {I3c } and P {I4c }, then P {I2c }, and leave P {I1c } to the last, which relies on the bounds for P {Iic }, 2 ≤ i ≤ 4. (1). To study P {I3c } and P {I4c }, we need the following tail bound for the chi-squared distribution with n degrees of freedom, ) ( 2 χ (n) − 1 ≥ t ≤ 2 exp (−nt (t ∧ 1) /8) , P n (93) √ for t > 0. Since σ ora = km k / n with m ∼ N (0, θmm In ), and Xk ∼ N (0, σkk In ) with σkk ∈ (1/M, M ), we have n (σ ora )2 /θmm ∼ χ2(n) , and kXk k2 /σkk ∼ χ2(n) , then Equation (93) implies P {I3c } n = P (σ ( ) (σ ora )2 1 ) ∈ / [1/ (2M ) , 2M ] ≤ P − 1 ≥ θmm 2 o ora 2 ≤ 2 exp (−n/32) ≤ o(p−δ ), (94) and (95) ( P {I4c } =P kXk k2 ∈ / [1/ (2M ) , 2M ] for some k ∈ Ac n ) ≤ 2p exp (−n/32) ≤ o(p−δ ). (2). To study the term CIF1 (2ξ + 1, T, Y) of the event I2 , we need to introduce some notation first. Define n T πa± (Y) = max ± YA YA u/n − 1 G o T , and θa,b (Y) = max v T YA YB u/n. G where G = {(A, B, u, v) : (|A| , |B| , kuk , kvk) = (a, b, 1, 1) with A ∩ B = ∅} . then 1+πa+ (Y) and 1−πa− (Y) are the maximal and minimal eigenvalues of all submatrices of YT Y/n with dimensions no greater than a respectively, and θa,b (Y) satisfies that (96) 1/2 θa,b (Y) ≤ 1 + πa+ (Y) 1/2 1 + πb+ (Y) + ≤ 1 + πa∨b (Y) . 39 The following two propositions will be used to establish the probability bound for P {I2c }. Proposition 3 follows from Zhang and Huang (2008) Proposition 2(i). The proof of Proposition 4 is given in the supplementary material. Let rows of the data matrix X be i.i.d. copies of N (0, Σ), and denote Proposition 3. the minimal and maximal eigenvalues of Σ by λmin (Σ) and λmax (Σ) respectively. Assume that m ≤ c logn p with a sufficiently small constant c > 0, then for ε > 0 we have (97) n o − + P (1 − h)2 λmin (Σ) ≤ 1 − πm (X) ≤ 1 + πm (X) ≤ (1 + h)2 λmax (Σ) ≥ 1 − ε, where h = q m n + q . 2 m log p−log(ε/2) n Proposition 4. For CIF1 (α, K, Y) defined in (81) with |K| = k, we have, for any 0 < l ≤ p − k, (98) 1 CIF1 (α, K, Y) ≥ (1 + α) (1 + α) ∧ q 1+ l k s − 1 − πl+k (Y) − α k θ4l,k+l (Y) . 4l Note that there exists some constant C such that Cs is an upper bound of the |T | from the definitions of s and set T (82) on I4 . For l ≥ Cs Proposition 4 gives CIF1 (2ξ + 1, T, Y) ≥ 1 1 − π − l+|T | (Y) − (2ξ + 1) 4 (1 + ξ)2 (99) ≥ 1 1 − π − (Y) − (2ξ + 1) 4l 4 (1 + ξ)2 s s |T | θ (Y) 4l 4l,|T |+l Cs + 1 + π4l (Y) , 4l where the second inequality follows from (96). From the definition Y = XAc ·diag and the property that πa± (Y) are increasing as functions of a, we have √ n kXk k k∈Ac ) ( √ ) n n + + 1+ ≤ maxc 1 + π4l (XAc ) ≤ maxc 1 + π4l (X) , k∈A k∈A kXk k kXk k ( √ ) ( √ ) n n − − − 1 − π4l (X) ≤ minc 1 − π4l (XAc ) ≤ 1 − π4l (Y) , minc k∈A k∈A kXk k kXk k ( √ + π4l (Y) and by applying Proposition 3 to the data matrix Y with m = 4l = 4 (2ξ + 1) M 3 2 Cs > 40 Z. REN ET AL. Cs and ε = p−2δ , we have on I4 s Cs + 1 + π4l (Y) 4l s ( √ ) ( √ ) n Cs n 2 2 ≥ (1 − h) λmin (Σ) minc (1 + h) λmax (Σ) maxc − (2ξ + 1) k∈A k∈A kXk k 4l kXk k − 1 − π4l (Y) − (2ξ + 1) s √ 1 Cs ≥ (1 − h) √ − (2ξ + 1) (1 + h)2 2M M 4l 2M M 1 1 4 ≥ √ (1 − h)2 − √ (1 + h)2 ≥ √ 3 3 2M 2 2M 10 2M 3 2 with probability at least 1 − ε, where h = P {I4c } ≤ o(p−δ ), thus we established that q P {I2c } r m log p−log(p−2δ /2) n m n + ≤ o(p−δ ). 2 = o(1). Note (3). Finally we study the probability of event I1 . The following tail probability of t distribution is helpful in the analysis. Proposition 5. Let Tn follows a t distribution with n degrees of freedom. Then there exists εn → 0 as n → ∞ such that ∀t > 0 n 2 /(n−1) P Tn2 > n e2t o −1 2 ≤ (1 + εn ) e−t / π 1/2 t . Please refer to Sun and Zhang (2012b) Lemma 1 for the proof. According to the definition of ν in Equation (83) we have ν σ ora = maxc |hk | , with hk = k∈A Note that each column of Y has norm kYk k = √ YkT m for k ∈ Ac . nσ ora n by the normalization step (72). Given XAc , equivalently Y, we have √ n − 1hk q 1 − h2k r = n−1 q n = q √ YkT m / nθmm / Tm m /nθmm q Tm m − Tm Yk YkT m /n /nθmm / Tm m /nθmm √ YkT m / nθmm ! ∼ t(n−1) , r c 2 PYk m / (n − 1) θmm where t(n−1) is t distribution with n − 1 degrees of freedom, since the numerator follows a standard normal and the denominator follows an independent q χ2(n−1) / (n − 1). From 41 Proposition 5 we have s P |hk | > ( = P ( ≤ P 2t2 n (n − 1) h2k 2 (n − 1) t2 /n > 2 1 − 2t2 /n 1 − hk ) ( ≤P 2 (n − 1) t2 / (n − 2) (n − 1) h2k > 2 1 − t2 / (n − 2) 1 − hk 2 (n − 1) h2k 2t /(n−2) > (n − 1) e − 1 1 − h2k ) 2 ) ≤ (1 + εn−1 ) e−t / π 1/2 t . where the first inequality holds when t2 ≥ 2, and the second inequality which follows the √ q log p 2δ (1 + ε) fact ex −1 ≤ x/(1− x2 ) for 0 < x < 2. Now let t2 = δ log p > 2, and λ = n ξ−1 with ξ = 3/ε+1, then we have λ ξ+1 (1 − τ ) ≥ q 2δ log p n τ defined in Equation (84) satisfies τ = O sλ 2 which is sufficiently small on I2 ∩ I3 ∩ I4 . for sufficiently small τ. Clearly, the Therefore we have n P ∩4i=1 Ii o s ν 2δ log p ≤ − P {(I2 ∩ I3 ∩ I4 )c } P σ ora n s 2δ log p 1 − p · P |hk | > − P {(I2 ∩ I3 ∩ I4 )c } ≥ 1 − O n ≥ ≥ (100) ! p−δ+1 √ , log p √ which implies immediately P {I1c } ≤ O p−δ+1 / log p . 8.2. Proof of Lemma 4. Now we establish the lower bound (60) for the total variation affinity. Since the affinity R q0 ∧ q1 dµ = 1 − 1 2 R |q0 − q1 | dµ for any two densities q0 and q1 , Jensen’s Inequality implies 2 Z |q0 − q1 | dµ Hence R = q0 ∧ q1 dµ ≥ 1 − 21 ∆= 2 Z Z Z 2 q0 − q1 (q0 − q1 )2 q1 q0 dµ ≤ dµ = dµ − 1. q q q0 0 0 R q12 q0 dµ 1/2 −1 Z ( 1 Pm∗ f )2 m=1 m m∗ f0 . To establish (60), it thus suffices to show that Z 1 X fm fl −1= 2 − 1 → 0. m∗ m,l f0 The following lemma is used to calculate the term Lemma 6. R (fm fl /f0 − 1) in ∆. Let gs be the density function of N (0, Σs ), s = 0, m or l. Then Z i−1/2 gm gl h −1 = det I − Σ−1 (Σ − Σ ) Σ (Σ − Σ ) . m 0 0 l 0 0 g0 42 Z. REN ET AL. Let Σm = Ω−1 m for 0 ≤ m ≤ m∗ . Lemma 6 implies Z fm fl = f0 Z gm gl g0 n = [det (I − Ω0 (Σm − Σ0 ) Ω0 (Σl − Σ0 ))]−n/2 . Let J(m, l) be the number of overlapping nonzero off-diagonal elements between Σm and Σl in the first row. Recall the simple structures of Ω0 (55) and Σm −Σ0 by our construction. Elementary calculations yield that 1 + b2 Ja2 )2 , (1 − b2 )2 det (I − Ω0 (Σm − Σ0 ) Ω0 (Σl − Σ0 )) = (1 − which is 1 when J = 0. Now we set d := 1+b2 (1−b2 )2 > 1 to simplify our notation. It is easy to p−2 kn,p −1 p−1−kn,p kn,p −1 j kn,p −1−j . see that the total number of pairs (Σm , Σl ) such that J(m, l) = j is Hence, ∆ = = (101) ≤ 1 m2∗ 1 m2∗ 1 m2∗ X Z X 0≤j≤kn,p −1 J(m,l)=j X X fm fl −1 f0 (1 − dja2 )−n − 1 0≤j≤kn,p −1 J(m,l)=j X 1≤j≤kn,p −1 p−2 kn,p − 1 ! kn,p − 1 j ! ! p − 1 − kn,p (1 − dja2 )−n . kn,p − 1 − j Note that (1 − dja2 )−n ≤ (1 + 2dja2 )n ≤ exp n2dja2 = p2dτ1 j where the first inequality follows from the fact that dja2 ≤ dkn,p a2 ≤ 1+b2 τ C (1−b2 )2 1 0 < 1/2. Hence, ∆ ≤ kn,p −1 p−1−kn,p j kn,p −1−j 2dτ1 j p p−2 kn,p −1 1≤j≤kn,p −1 X 2 (kn,p −1)! (kn,p −1−j)! p2dτ1 j (p−2)!(p−2kn,p +j)! [(p−1−kn,p )!]2 !j 2 p2dτ1 kn,p X = 1≤j≤kn,p 1 j! −1 X ≤ 1≤j≤kn,p −1 p − kn,p − 1 where the last inequality follows from the facts that with each term less than kn,p and (p−2)!(p−2kn,p +j)! [(p−1−kn,p )!]2 , (kn,p −1)! (kn,p −1−j)! is a product of j terms is bounded below by a product of j v . terms with each term greater than (p − kn,p − 1) . Recall the assumption (27) p ≥ kn,p So for large enough p, we have p − kn,p − 1 ≥ p/2 and 2 kn,p p2dτ1 p − kn,p − 1 ≤ 2p2/ν · p2dτ1 p ≤ 2p−(v−2)/(2v) , 43 where the last step follows from the fact that τ1 ≤ (ν − 2) / (4νd). Thus p−j(v−2)/(2v) → 0, X ∆≤2 1≤j≤kn,p −1 which immediately implies (60). 9. Supplementary Material. In this supplement we collect proofs for proving auxiliary lemmas. 9.1. Proof of Lemma 5. The proof of this Lemma is similar to that of Lemma 2 in Section 8.1. But for the latent variables case in both algebraic analysis and probabilistic ora , σ ora , and ν by β S , θ ora,S , σ ora,S , S and ν S analysis we need to replace βi , θij i i i ij respectively, and subsequently define (102) I1 = (103) I3 = where σ ora,S = ora,S θii 1/2 = ξ−1 (1 − τ ) , ξ+1 q √ ∈ 1/ (2M ), 2M , ν S ≤ σ ora,S λ σ ora,S k√Si k n and ν S = YT Si /n , while the definitions of I2 ∞ and I4 are the same as before in Equations (87) and (89). As in Section 8.1 we define E = ∩4i=1 Ii and need to show that P {E} ≥ 1 − (1 + o(1))p−δ+1 . We will need only to show that P {I1c } ≤ o p−δ+1 , and P {I3c } ≤ o(p−δ ). The arguments for P {I2c } and P {I4c } are identical to those of the non-latent variables case in Section 8.1.2. It is relatively easy to establish the probabilistic bound for I3 . It is a consequence of the following two bounds, (104) n P (σ ) ( 2 ) ( χ (σ ora )2 1 1 (n) − 1 ≥ =P − 1 ≥ = o(p−δ ), ) ∈ / [3/ (4M ) , 5M/4] ≤ P θii n 4 4 o ora 2 which follows from Equation (93), and (105) P (σ ora )2 − σ ora,S 2 > 1/ (4M ) = o(p−δ ) which follows two bounds for D1 (70) and D2 (71). 44 Z. REN ET AL. Now we establish the probabilistic bound for I1 , i.e., P {I1c } ≤ o p−δ+1 . Write the event I1c as follows, νS − ν + ν ξ−1 > σ ora,S λ (1 − τ ) − σ ora ξ+1 = σ ora,S ξ−1 λ (1 − τ ) − ξ+1 s s 2δ log p + σ ora n s 2δ log p n 2δ log p ora,S + σ − σ ora n s 2δ log p + σ ora n s 2δ log p . n Set ξ = 6/ε + 5 such that ξ−1 λ (1 − τ ) > (1 + ε/2) ξ+1 for λ = (1 + ε) q 2δ log p n s 2δ log p n on I2 ∩ I3 . Then probabilistic bound for I1 is a consequence the following two bounds, s P ν > σ ora (106) p 2δ log p ≤ O p−δ+1 / log p , n which follows from Equation (100) or Proposition 5, and (107) ε P ν S − ν > σ ora,S 2 s 2δ log p ora,S + σ − σ ora n s 2δ log p = o p−δ , n which is established as follows. From Equations (103), D1 (70) and D2 (71), we have ε ε ora,S ora,S σ − σ − (σ ora ) ≥ 2 2 r 1 + p 1 −O 2M s log p np ! ε ≥ 4 r 1 2M with probability 1 − o p−δ . It is then enough to show ε P νS − ν > 4 (108) r 1 2M s 2δ log p = o p−δ n to establish Equation (107). On the event I4 we have S ν − ν ≤ (109) = Y T S Y T YT X c β L k i k i k A i max − ≤ max n k∈Ac k∈Ac n n √ n L T T 1 X (g) (g) n Xk XAc βi √ L Xk XAc βi , maxc T ≤ 2M maxc k∈A k∈A n n Xk g=1 45 where X (g) denotes the gth sample. Note that for each k ∈ Ac the term 1 n Pn (g) g=1 Xk (g) T XAc βiL in (109) is an average of n i.i.d. random variables with T L EX (g) X (g) βi Ac k CM ≤ ΣXX βiL ≤ kΣXX k2 βiL ≤ √ , 2 2 p T C (g) (g) V ar Xk XAc βiL ≤ . p From the classical large deviations bound in Theorem 2.8 of Petrov (1995), there exist some uniform constant c1 , c2 > 0 such that n √p X T T (g) (g) (g) (g) βiL > t ≤ 2 exp −nt2 /c2 for 0 < t < c1 . βiL − E Xk XAc P Xk XAc n g=1 then by setting t = q 2δc2 log p , n with probability 1 − o p−δ , we have s s √ CM 1 2δc log p 2δ log p 2 S = o . ν − ν ≤ 2M √ + √ p p n n where the last inequality follows from Equation (45). Therefore we have shown Equation (108), i.e., with probability 1 − o p−δ , s S ν − ν = o 2δ log p ε < n 4 r 1 2M s 2δ log p , n which together with Equation (106), imply the probabilistic bound for I1 . 9.2. Proof of Proposition 1. From the KKT conditions (80) we have Y T X − Y dˆ(µ) T X − Ydtrue i Y i ≤ µ + ν, = − n n ∞ ∞ YT Y true ˆ (110) d (µ) − d n where ν is defined in Equation (83). We also have = (111)≤ ≤ 2 1 Y dtrue − dˆ(µ) n T dtrue − dˆ(µ) YT Xi − Ydˆ(µ) − YT Xi − Ydtrue n true ˆ µ d − d (µ) + ν dtrue − dˆ(µ) 1 1 1 (µ + ν) dtrue − dˆ(µ) + 2µ dtrue c − (µ − ν) dtrue − dˆ(µ) T 1 T 1 T , c 1 46 Z. REN ET AL. where the first inequality follows from the KKT conditions (80). Then on the event n ν ≤ µ ξ−1 ξ+1 o we have (112) true ˆ(µ) d − d ξ 1 2 T 1 + dtrue Y dtrue − dˆ(µ) ≤ 2µ n ξ+1 If we could show that dˆ(µ)−dtrue ∈ C ξ+ζ 1−ζ , T T true − dˆ(µ) c d T 1 − . c ξ+1 1 , then by the definition (81) and inequality (110) we would obtain ˆ d (µ) − dtrue ≤ (113) 1 (ν + µ) |T | CIF1 ξ+ζ 1−ζ , T, Y . Suppose that 1+ξ ˆ true d (µ) − dtrue ≥ d (114) 1 ζ T , c 1 then the inequality (112) becomes 2 Y dtrue − dˆ(µ) n 2µ ≤ (ξ + ζ) dtrue − dˆ(µ) − (1 − ζ) dtrue − dˆ(µ) c . T 1 T 1 ξ+1 Thus we have dˆ(µ) − dtrue ∈ C ξ+ζ ,T . 1−ζ Combining this fact under the condition (114) with (113), we obtain the first desired inequality (90) 1 + ξ (ν + µ) |T | true ˆ . , d d (µ) − dtrue ≤ max ξ+ζ ζ T c 1 CIF 1 1 1−ζ , T, Y We complete our proof by letting ζ = 1/2 and noting that (111) implies the second desired inequality (91). 9.3. Proof of Proposition 2. For τ defined in Equation (84), we need to show that n o σ̂ ≥ σ ora (1 − τ ) and σ̂ ≤ σ ora (1 + τ ) on the event ν ≤ σ ora λ ξ−1 (1 − τ ) . Let dˆ(σλ) be ξ+1 the solution of (79) as a function of σ, then 2 Xi − Y dˆ(σλ) ∂ 1 Lλ dˆ(σλ) , σ = − ∂σ 2 2nσ 2 (115) since n o ∂ ˆ ∂d Lλ (d, σ) |d=d(σλ) k = 0 for all dˆk (σλ) 6= 0, and n n o o ∂ ˆ ∂σ d (σλ) k = 0 for all dˆk (σλ) = 0 which follows from the fact that k : dˆk (σλ) = 0 is unchanged in a neighborhood of σ for almost all σ. Equation (115) plays a key in the proof. 47 (1). To show that σ̂ ≥ σ ora (1 − τ ) it’s enough to show ∂ Lλ dˆ(σλ) , σ |σ=t1 ≤ 0. ∂σ where t1 = σ ora (1 − τ ), due to the strict convexity of the objective function Lλ (d, σ) in σ. Equation (115) implies that ∂ 2t21 Lλ dˆ(σλ) , σ |σ=t1 ∂σ = t21 − ≤ t21 − ≤ t21 2 Xm − Y dˆ(t1 λ) n 2 Xm − Ydtrue + Y dˆ(t1 λ) − dtrue − (σ n ora 2 true ) +2 d ≤ 2t1 (t1 − σ (116) ora T Y T X − Ydtrue m − dˆ(t1 λ) n ) + 2ν dtrue − dˆ(t1 λ) . 1 n o n o ξ−1 From Equation (90) in Proposition 1, on the event ν ≤ t1 λ ξ+1 = ν/σ ora < λ ξ−1 ξ+1 (1 − τ ) we have ˆ true d (t1 λ) − d ≤ max 2 (1 + ξ) dtrue 1 T , c 1 (ν + t1 λ) |T | . CIF1 (2ξ + 1, T, Y) then 2t21 ∂ Lλ dˆ(σλ) , σ |σ=t1 ∂σ (ν + t1 λ) |T | T 1 CIF1 (2ξ + 1, T, Y) 2σ ora λ |T | true ora ≤ 2t1 −τ σ + λ max 2 (1 + ξ) d , T c 1 CIF1 (2ξ + 1, T, Y) 2λ |T | 2 (1 + ξ) true , <0 d = 2t1 σ ora −τ + λ max T c 1 CIF1 (2ξ + 1, T, Y) σ ora ≤ 2t1 (t1 − σ ora ) + 2t1 λ max 2 (1 + ξ) dtrue , c where last inequality is from the definition of τ . (2). Let t2 = σ ora (1 + τ ).To show the other side σ̂ ≤ σ ora (1 + τ ) it is enough to show ∂ Lλ dˆ(σλ) , σ |σ=t2 ≥ 0. ∂σ n Equation (115) implies that on the event ν ≤ t2 λ ξ−1 ξ+1 o n = ν/σ ora < λ ξ−1 ξ+1 (1 + τ ) o we 48 Z. REN ET AL. have ∂ Lλ dˆ(σλ) , σ |σ=t2 ∂σ 2 Xm − Y dˆ(t2 λ) 2 = t2 − n 2t22 = t22 − (σ ora )2 + (σ ora )2 − = t22 − (σ ora )2 + n = t22 − (σ ora )2 + ≥ − (σ n 2 Xm − Ydtrue 2 − Xm − Y dˆ(t2 λ) t22 2 Xm − Y dˆ(t2 λ) dˆ(t2 λ) − dtrue T YT Xm − Ydtrue + Xm − Ydˆ(t2 λ) n ˆ true ) − d (t2 λ) − d (ν + t2 λ) . ora 2 1 Equation (90) and the fact 1 + τ ≤ 2 imply 2t22 ∂ Lλ dˆ(σλ) , σ |σ=t2 ∂σ ( ≥ (t2 + σ ora ) σ ora τ − max 2 (1 + ξ) (ν + t2 λ) dtrue ( ≥ (σ ora 2 ) (2 + τ ) τ − max T , c 1 2 (1 + ξ) (2λ (1 + τ )) true d ora σ T (ν + t2 λ)2 |T | CIF1 (2ξ + 1, T, Y) , c 1 ) 8 (1 + τ ) λ2 |T | CIF1 (2ξ + 1, T, Y) )! ≥ (σ ora )2 τ, where last inequality is from the definition of τ . 9.4. Proof of Proposition 4. This Proposition essentially follows from the shifting inequality Proposition 5 in Ye and Zhang (2010). We will give a brief proof using results and notations in that paper. Define the generalized version of lq cone invertibility factor (81), 0 CIFq,l (α, K, Y) = inf T |K|1/q Y n Y u ∞ kuA kq : u ∈ C (α, K) , u 6= 0, |A\K| ≤ l . 0 (α, K, Y) = CIF 0 (α, K, Y) = CIF (α, K, Y). By EquaWhen q = 1 and l = p, CIFq,l 1 1,p tions (17), (18) and (20) of Ye and Zhang (2010) we have 0 CIF1 (α, K, Y) = CIF1,p (α, K, Y) ≥ ≥ 0 (α, K, Y) CIF2,l C1,2 α, |K| l φ̃∗2,l (α, K, Y) φ∗2,l (α, K, Y) C1,2 α, |K| l ≥ C1,2 α, |K| l (1 + α) ∧ q 1+ l |K| 49 where C1,2 α, |K| , φ∗2,l (α, K, Y) and φ̃∗2,l (α, K, Y) are defined on page 3523-3524 of Ye l and Zhang (2010). From the definition of φ̃∗2,l (α, K, Y) in Equation (20) of Ye and Zhang (2010), setting r = 2 (thus ar = 1/4 on page 3523) we have s − φ̃∗2,l (α, K, Y) ≥ 1 − πl+k (Y) − α k θ4l,k+l (Y) . 4l = 1 + α, then Since C1,2 α, |K| l 1 CIF1 (α, K, Y) ≥ (1 + α) (1 + α) ∧ q 1+ l k s − 1 − πl+k (Y) − α k θ4l,k+l (Y) . 4l 9.5. Proof of Lemma 1. Most of this proof is the same as that of Lemma 2. Hence we only emphasize the differences here and provide details whenever necessary. The proof also consists of algebraic analysis and probabilistic analysis. Recall that the main idea in the proof of Lemma 2 is to show the events ∩4i=1 Ii defined in (86)-(89) occur with high probability and whenever they hold, Proposition 2 and Proposition 1 establish the desired results. Since we decrease the penalty term λnew , the event I1 is no longer valid with high probability. Now we need to redefine an appropriate event I1new and show that it occurs with high probability and on I1new , similar properties like Proposition 2 and Proposition 1 also hold. Recall ν = khk∞ with h := YT m /n defined in (83). Similar as (82), we define the index set T = ≥ λnew k ∈ Ac , dtrue k of dtrue with large coordinates. Now by using smaller λnew , although ν is no longer smaller than σ ora λnew ξ−1 ξ+1 (1 − τ ) in I1 w.h.p., most coordinates of h still hold. Define a random index set T̂0 = k ∈ Ac , |hk | ≥ σ ora λnew ξ−1 (1 − τ ) . ξ+1 The new event can be defined as I1new = ν ≤ σ ora n o λnew ξ − 1 (1 − τ ) ∪ T̂0 ≤ Cu smax , 1−tξ+1 where Cu is some universal constant. Note that λnew ≥ (1 − t) λ by our assumption smax = O(pt ), thus the probabilistic analysis for I1 in Lemma 2 immediately implies that the first component of I1new holds w.h.p.. The second component holds w.h.p. can be shown by large deviation result of order statistics, where the first component is also can be seen as a special case. (See, e.g. Reiss (1989)). Therefore, we briefly showed that √ P {(I1new )c } ≤ p−δ+1 / log p . We also need to modify the event I2 a little bit as we not 50 Z. REN ET AL. only care about index T but also index T̂0 which are out of the bound. I2new = CIF1 ξ−1 ξ+2+ , T̂0 ∪ T, Y ≥ C > 0 . 1−t The probabilistic analysis of I2 in Lemma 2 implies that there is no difference by using I2new since the probability bound for I2 is universal for all index sets with cardinality less than s + Cu smax = o n log p . We don’t change events I3 and I4 . Thus we finish the probabilistic analysis. Now it’s enough for us to show that on ∩2i=1 Iinew ∩4i=3 Ii , the desired results hold. It’s not hard to see in the proof of Proposition 2 that as long as a similar property like Proposition 1 holds (we will provide details and prove this key result in a minute), Proposition 2 is still valid when we replace I1 by I1new in the assumption. The only thing we need to show is the following Proposition 6. Note on ∩2i=1 Iinew ∩4i=3 Ii , Proposition 2 is valid and hence the assumption of the following Proposition with µ = σ̂λnew is also satisfied. We then apply this Proposition 6 again to finish the algebraic analysis and hence complete our proof. Proposition 6. n For any ξ > 1, on the event ν ≤ µ ξ−1 1−t ξ+1 o n ∪ T̂1 ≤ Cu smax o with o n T̂1 = k ∈ Ac , |hk | ≥ µ ξ−1 ξ+1 , we have (117) (118) ˆ d (µ) − dtrue 1 ≤ max (ν + µ) T ∪ T̂1 , , Tc 1 CIF1 (2 + 2ξ) dtrue 2 1 Y dtrue − dˆ(µ) ≤ (ν + µ) dˆ(µ) − dtrue , 1 n where CIF1 above is short for CIF1 ξ + 2 + ξ−1 1−t , T ∪ T̂1 , Y . The proof is a modification of that for Proposition 1. We still have equation (110) ≤ µ + ν. Define ∆ (µ) := dtrue − dˆ(µ) . The equation (112) needs T Y Y ˆ d (µ) − dtrue n ∞ to be modified as follows, (119) (120) ∆T (µ) YT Xi − Ydˆ(µ) − YT Xi − Ydtrue kY∆ (µ)k2 = n n ξ−1 ≤ µ dtrue − dˆ(µ) + ν ∆ (µ)T̂1 + µ ∆ (µ)T̂ c 1 1 1 1 1 ξ+1 µ ξ−1 ≤ µ+ ∆ (µ)T ∪T̂1 + 2µ dtrue c 1 T 1 1−tξ+1 ξ−1 ∆ (µ) c . − µ−µ T ∪ T̂ ( ) 1 ξ+1 1 51 where the first inequality follows from the KKT conditions (80) and our assumed event. The remaining part is the same as that in Proposition 1. Suppose k∆ (µ)k1 ≥ 2 (1 + ξ) dtrue then inequality (120) becomes ξ−1 ≥ 0. ξ+2+ ∆ (µ)T ∪T̂1 − ∆ (µ)(T ∪T̂1 )c 1 1−t 1 Thus we have ξ−1 d − dˆ(µ) = ∆ (µ) ∈ C ξ + 2 + , T ∪ T̂1 . 1−t Combining this fact with equation (110), we obtain the first desired inequality (117). We true complete our proof by noting that (119) implies the second desired inequality (118). Now we show that λnew can be replaced by its finite sample version λnew f inite . As we have n o seen, the analysis of event I1 is the key result. All we need to show is that P |hk | > λnew f inite ≤ √ O p−δ , where √n−1h2k follows an t distribution with n − 1 degrees of freedom. Since hk 1−hk √ is an increasing function of √n−1h2k on R+ , we can take the quantile of t(n−1) distribution 1−hk rather than use the concentration inequality above. References. Belloni, A., Chernozhukov, V. and Hansen, C. (2012). Inference on treatment effects after selection. arXiv preprint arXiv:1201.0224. Bickel, P. J. and Levina, E. (2008a). Regularized estimation of large covariance matrices. Ann. Statist. 36 199-227. Bickel, P. J. and Levina, E. (2008b). Covariance regularization by thresholding. Ann. Statist. 36 25772604. Bühlmann, P. (2012). Statistical significance in high-dimensional linear models. arXiv preprint arXiv:1202.1377. Cai, T. T., Liu, W. and Luo, X. (2011). A constrained `1 minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594-607. Cai, T. T., Liu, W. and Zhou, H. H. (2012). Estimating Sparse Precision Matrix: Optimal Rates of Convergence and Adaptive Estimation. Manuscript. Cai, T. T., Zhang, C. H. and Zhou, H. H. (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38 2118-2144. Cai, T. T. and Zhou, H. H. (2012). Optimal rates of convergence for sparse covariance matrix estimation. Ann. Statist. 40 2389–2420. Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. of Comput. Math. 9 717-772. Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012). Latent variable graphical model selection via convex optimization. Ann. Statist. 40 1935-1967. d’Aspremont, A., Banerjee, O. and El Ghaoui, L. (2008). First-order methods for sparse covariance selection. SIAM Journal on Matrix Analysis and its Applications 30 56-66. Einmahl, U. (1989). Extensions of results of Komlós, Major, and Tusnády to the multivariate case. J. Multivariate Anal. 28 20–68. Tc 1 , 52 Z. REN ET AL. El Karoui, N. (2008). Operator norm consistent estimation of large dimensional sparse covariance matrices. Ann. Statist. 36 2717-2756. Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432-441. Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge University Press. Javanmard, A. and Montanari, A. (2013). Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. arXiv preprint arXiv:1301.4240. Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37 4254-4278. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38-53. Liu, W. (2013). Gaussian Graphical Model Estimation with False Discovery Rate Control. arXiv preprint arXiv:1306.0976. Mason, D. and Zhou, H. H. (2012). Quantile Coupling Inequalities and Their Applications. Probability Surveys 9 439-479. Meinshausen, N. and Bühlmann, P. (2006). High dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436-1462. Petrov, V. V. (1995). Limit theorems of probability theory. Oxford Science Publications. Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing `1 penalized log-determinant divergence. Electron. J. Statist. 5 935-980. Reiss, R. D. (1989). Approximate distributions of order statistics: with applications to nonparametric statistics. Springer-Verlag New York. Ren, Z. and Zhou, H. H. (2012). Discussion : Latent variable graphical model selection via convex optimization. Ann. Statist. 40 1989-1996. Rothman, A., Bickel, P., Levian, E. and Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electron. J. Statist. 2 494-515. Sun, T. and Zhang, C. H. (2012a). Sparse matrix inversion with scaled Lasso. Manuscript. Sun, T. and Zhang, C. H. (2012b). Scaled Sparse Linear Regression. Biometrika 99 879–898. Sun, T. and Zhang, C. H. (2012c). Comment: Minimax estimation of large covariance matrices under l1 norm. Statist. Sinica 22 1354–1358. Thorin, G. O. (1948). Convexity theorems generalizing those of M. Riesz and Hadamard with some applications. Comm. Sem. Math. Univ. Lund. 9 1-58. van de Geer, S., Bühlmann, P. and Ritov, Y. (2013). On asymptotically optimal confidence regions and tests for high-dimensional models. arXiv preprint arXiv:1303.0518. Ye, F. and Zhang, C. H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the `q loss in `r ball. Journal of Machine Learning Research 11 3481-3502. Yu, B. (1997). Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam 423–435. Yuan, M. (2010). Sparse inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11 2261-2286. Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19-35. Zhang, C.-H. (2011). Statistical inference for high-dimensional data. In Mathematisches Forschungsinstitut Oberwolfach: Very High Dimensional Semiparametric Models, Report No. 48/2011 28-31. 53 Zhang, C. H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 36 1567-1594. Zhang, C.-H. and Zhang, S. S. (2011). Confidence intervals for low-dimensional parameters with highdimensional data. arXiv preprint arXiv:1110.2563. Department of Statistics Department of Statistics Yale University The Wharton School New Haven, Connecticut 06511 University of Pennsylvania USA Philadelphia, Pennsylvania 19104 E-mail: zhao.ren@yale.edu USA E-mail: tingni@wharton.upenn.edu huibin.zhou@yale.edu Department of Statistics and Biostatistics Hill Center, Busch Campus Rutgers University Piscataway, New Jersey 08854 USA E-mail: cunhui@stat.rutgers.edu