HOLP for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) UCL Workshop on Theory of Big Data 9 Jan., 2015 1 / 34 The setup The linear regression model y = β1 x1 + β2 x2 + · · · + βp xp + ε. With data (after standardisation) Y = X β + , where • Y ∈ Rn, • X ∈ R n×p , • ∈ R n consists of i.i.d. errors. 2 / 34 The problem In big data analysis, often • The dimension p is much larger than the sample size n (p n); • The number of the important variables s is often much smaller (s n); • But we don’t know which βj ’s are nonzero; • The goal is to identify these important variables. 3 / 34 A solution for screening For a linear model Y = X β + , 1. Compute β̂ = X T (XX T )−1 Y , which is not the usual OLS estimator; 2. Retain the d variables (usually can take d = n or any d > s) corresponding to the d largest entries of |β̂|. We call this HOLP screening procedure. Afterwards, build a (refined) model based on the remaining d variables. 4 / 34 Questions? • Why β̂ = X T (XX T )−1 Y ? • Denote M ⊆ {x1 , ..., xp } as a model, MS as the true model, where S = supp(β) and Md as the model chosen by HOLP. For HOLP to work at all, we need MS ⊆ Md with a probability approaching one. 5 / 34 Outline • Introduction • The motivation • A connection to ridge regression • A comparison with sure independence screening (SIS) in Fan and Lv (2008) • Theory: Three theorems, two surprising ones • Simulation • Concluding remarks 6 / 34 Review Two classes of approaches • One-stage: Selection and estimation; • Two-stage: Screening followed by some one-stage approach. 7 / 34 One-stage methods Loss plus a sparsity inducing penalty • Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic net (Zou and Hastie, 2005), grouped Lasso (Yuan and Lin, 2006), Cosso (Lin and Zhang, 2006), Dantzig selector (Candes and Tao, 2007) and so on. • Convex and non-convex optimisation. • Different conditions for optimal estimation (β̂ ≈ β) and selection consistency (M̂S = MS ): difficult to achieve both, especially selection consistency. 8 / 34 Two-stage methods Screen first, refine next. Intuition: choosing a superset M̂ ⊇ MS is much easier than estimating the exact set M̂ = MS . • Actually widely used before any theory was available • Fan and Lv (2008): a theory for screening by retaining variables with large marginal correlations: Sure Independent Screening (SIS) • Marginal: Generalised to many models including GLMs, Cox’s model, GAMs, varying-coefficient models, and etc. • Correlation: Generalised notions of correlation. • Alternative iterative procedures: forward regression (Wang, 2009) and tilting (Cho and Frylewicz, 2012). 9 / 34 Elements of Screening At least 2 elements: • Computational: key; sorry, little room for sophisticated slow procedures. • Theoretical: Sure screening property; MS ⊆ M̂ with a probability approaching one. Remark for SIS: • Computational X: compute X T Y • Theoretical ?: works when variables are not strongly dependent. 10 / 34 Motivation A class of estimators of β as β̃ = AY , where A ∈ R p×n . Recall SIS: β̃ = X T Y where A = X T . Screening procedure: Choose a submodel Md that retains the d << p largest entries of β̃, Md = {xj : |β̃j | are among the largest d of all |β̃j |’s}. For this to work, β̃ maintains the rank order of the entries of β • the nonzero entries of β are large in β̃ relatively • the zero entries of β are small in β̃ relatively. 11 / 34 Signal noise analysis Note β̃ = AY = A(X β + ) = (AX )β + A. • Signal (AX )β + Noise A • The noise part is small stochastically • In order for β̃ to preserve the rank order of β, ideally AX = I, or AX ≈ I. The above discussion motivated us to use some inverse of X . 12 / 34 Inverse of X Look for A such that AX ≈ I. • When p < n, A = (X T X )−1 X T gives rise to the OLS estimator. • When p > n, Moore-Penrose inverse of X as A = X T (XX T )−1 , unique to high-dimensional data. • High-d OLS: the High-dimensional OLS Projection (HOLP) β̂ = X T (XX T )−1 Y . 13 / 34 Remarks Write β̂ = X T (XX T )−1 Y = X T (XX T )−1 X β + X T (XX T )−1 , • HOLP projects β onto the row space of X ; OLS projects β onto the column space of X • Straightforward to implement • Can be efficiently computed: O(n2 p), as opposed to O(np) of SIS Computational: X 14 / 34 A comparison of the screening matrices The screening matrix AX in β̃ = AY = (AX )β + A • HOLP: AX = X T (XX T )−1 X • SIS: AX = X T X • A quick simulation: n = 50, p = 1000, x ∼ N(0, Σ). Three setups • Independent: Σ = I • Compound symmetric (CS): σjk = 0.6 for j 6= k • AR(1): σjk = 0.995|j−k | 15 / 34 Screening matrices SIS, Ind SIS, CS SIS, AR(1) HOLP, Ind HOLP, CS HOLP, AR(1) 16 / 34 Analytical insight SVD of X as X = VDU T , where • V is an n × n orthogonal matrix, • D is an n × n diagonal matrix, • U is an p × n matrix on the Stiefel manifold. Then HOLP :X T (XX T )−1 X = UU T , SIS :X T X = UD 2 U T . HOLP reduces the impact from the high correlation of X by removing the random diagonal matrix D. 17 / 34 Theory Assumptions • p > n and log p = O(nγ ) for some γ > 0. • Conditions on the eigenvalues of X Σ−1 X T /p and the distribution of Σ−1/2 x where Σ = var(x) • Conditions on the magnitude of the smallest βj for j ∈ S • Conditions on s and the condition number of Σ However, we don’t need the marginal correlation assumption which requires corr(y , xj ) 0 for j ∈ S. 18 / 34 Marginal screening • The marginal correlation assumption is vital to all marginal screening approaches. • In SIS, AY = X T Y = X T X β + X T . • The SIS signal X T X β ∝ Σβ: βj 6= 0 =⇒ 6 (Σβ)j is large, βj = 0 =⇒ 6 (Σβ)j is small. • For HOLP, X T (XX T )−1 X β ∝ Iβ = β. 19 / 34 Sure screening Theorem 1 (Screening property of HOLP) Under mild conditions, if we choose the submodel size d p properly, the Md chosen by HOLP satisfies nC1 ) . P(MS ⊂ Md ) = 1 − O exp(− log n 20 / 34 Surprise # 1 Theorem 2 (Screening consistency of HOLP) Under mild conditions, the HOLP estimator satisfies nC2 P min |β̂j | > max |β̂j | = 1 − O exp(− ) . log n j∈S j6∈S Surprise: If we choose d = s, then we select the right model. 21 / 34 Another motivation for HOLP • The ridge regression estimator β̂(r ) = (rI + X T X )−1 X T Y , where r is the ridge parameter. • Letting r → ∞ gives r β̂(r ) → X T Y , SIS. • Letting r → 0 gives β̂(r ) →, (X T X )− X T Y , HOLP. • Applying Sherman-Morrison-Woodbury formula gives (rI + X T X )−1 X T Y = X T (rI + XX T )−1 Y . Then letting r → 0 gives (X T X )− X T Y = X T (XX T )−1 Y , which gives us HOLP. 22 / 34 Ridge Regression Theorem 3 (Screening consistency of ridge regression) Under mild conditions, with a proper ridge parameter r , the ridge regression estimator satisfies n C3 P min |β̂j (r )| > max |β̂j (r )| = 1 − O exp(− ) . log n j∈S j6∈S Remarks • Surprise # 2: The theorem holds when the ridge parameter r is fixed. • Potential to generalise to GLMs, Cox’s models and etc. 23 / 34 Simulation • (p, n) = (1000, 100) or (10000, 200), fix d = n. • Signal to noise ratio R 2 = 0.9 • Σ and β (i) Independent predictors √ βi = (−1)ui (|N(0, 1)| + 4 log n/ n) where ui ∼ Ber (0.4) for i ∈ S and βi = 0 for i 6∈ S. (ii) Compound symmetry: βi = 5 for i = 1, ..., 5 and βi = 0 otherwise, ρ = 0.3, 0.6, 0.9. (iii) Autoregressive correlation: β1 = 3, β4 = 1.5, β7 = 2, and βi = 0 otherwise. 24 / 34 More setups P (iv) Factor model: xi = kj=1 φj fij + ηi , where fij and ηi and φj are iid normal. Coefs as in CS. (v) Group structure: 15 true variables into three groups. xj+3m = zj + N(0, δ 2 ). βi = 3, i ≤ 15; βi = 0, i > 15. where m = 0, ..., 4, j = 1, 2, 3, and δ 2 is 0.01, 0.05 or 0.1. √ (vi) Extreme correlation: xi = (zi + wi )/ 2, i = 1, · · · , 5 and P5 xi = (zi + j=1 wj )/2, i = 16, · · · , p. Coefs as in (ii). The response variable is more correlated to a large number of unimportant variables. Make it even harder, xi+s , xi+2s = xi + N(0, 0.01), i = 1, · · · , 5. 25 / 34 (p, n) = (1000, 100): R 2 = 0.9 Example (i) Ind. (ii) CS (iii) AR(1) (iv) Factor (v) Group (vi) Extreme ρ = 0.3 ρ = 0.6 ρ = 0.9 ρ = 0.3 ρ = 0.6 ρ = 0.9 k =2 k = 10 k = 20 δ 2 = 0.1 δ 2 = 0.05 δ 2 = 0.01 HOLP 0.935 0.980 0.830 0.050 0.990 1.000 1.000 0.940 0.715 0.430 1.000 1.000 1.000 0.905 SIS 0.910 0.855 0.260 0.010 0.965 1.000 1.000 0.015 0.000 0.000 1.000 1.000 1.000 0.000 ISIS 0.990 0.955 0.305 0.005 1.000 1.000 0.970 0.490 0.115 0.015 0.000 0.000 0.000 0.000 FR 1.000 1.000 0.575 0.000 1.000 1.000 0.985 0.950 0.370 0.105 0.000 0.000 0.000 0.150 Tilting 1.000 0.990 0.490 0.050 1.000 1.000 1.000 0.960 0.455 0.225 0.000 0.000 0.000 0.110 26 / 34 (p, n) = (10000, 200): R 2 = 0.9 Example (i) Ind. (ii) CS (iii) AR(1) (iv) Factor (v) Group (vi) Extreme ρ = 0.3 ρ = 0.6 ρ = 0.9 ρ = 0.3 ρ = 0.6 ρ = 0.9 k =2 k = 10 k = 20 δ 2 = 0.1 δ 2 = 0.05 δ 2 = 0.01 HOLP 0.960 1.000 0.960 0.100 0.990 1.000 1.000 0.980 0.850 0.540 1.000 1.000 1.000 1.000 SIS 0.960 0.920 0.280 0.000 0.990 1.000 1.000 0.000 0.000 0.000 1.000 1.000 1.000 0.000 ISIS 1.000 1.000 0.420 0.000 1.000 1.000 1.000 0.350 0.060 0.010 0.000 0.000 0.000 0.000 FR 1.000 1.000 0.960 0.000 1.000 1.000 1.000 0.990 0.700 0.230 0.000 0.000 0.000 0.210 Tilting — — — — — — — — — — — — — — 27 / 34 A demonstration of Theorem 2 and 3 We set ( p= 4 × [exp(n1/3 )] for examples except Example (vi) 20 × [exp(n1/4 )] for Example (vi) and ( s= 1.5 × [n1/4 ] for R 2 = 90% [n1/4 ] for R 2 = 50% Choose d = s. 28 / 34 Theorem 2: Screening consistency 1.0 0.8 0.2 0.4 0.6 (i) (ii) (iii) (iv) (v) (vi) 0.0 0.2 0.4 (i) (ii) (iii) (iv) (v) (vi) ^ ^ probability all βtrue>βfalse 1.0 0.6 0.8 HOLP with R2= 50% 0.0 ^ ^ probability all βtrue>βfalse HOLP with R2= 90% 100 200 300 400 500 100 200 300 400 500 Figure 1: HOLP: Selection consistency as n increases. 29 / 34 Theorem 3: Ridge regression 0.4 0.6 0.8 (i) (ii) (iii) (iv) (v) (vi) 0.0 0.0 0.2 ^ ^ probability all βtrue>βfalse 0.6 0.4 (i) (ii) (iii) (iv) (v) (vi) 0.2 ^ ^ probability all βtrue>βfalse 0.8 1.0 ridge−HOLP with R2= 50% 1.0 ridge−HOLP with R2= 90% 100 200 300 400 500 100 200 300 400 500 Figure 2: ridge-HOLP (r = 10): Selection consistency as n increases. 30 / 34 Computation efficiency: Varying d p=1000,n=100 (tilting excluded) 60 p=1000,n=100 40 50 Forward regression ISIS HOLP SIS 20 30 time cost (sec) 300 200 0 10 100 0 time cost (sec) 400 Tilting Forward regression ISIS HOLP SIS 0 20 40 60 80 100 0 20 40 60 80 100 Figure 3: Computational time when (p, n) = (1000, 100). 31 / 34 Computation efficiency: Varying p d=50,n=100 (tilting excluded) 35 2000 d=50,n=100 20 15 time cost (sec) 25 30 Forward regression ISIS HOLP SIS 10 1000 0 5 500 0 time cost (sec) 1500 Tilting Forward regression ISIS HOLP SIS 500 1000 1500 2000 2500 500 1000 1500 2000 2500 Figure 4: Computational time when (d, n) = (50, 100). 32 / 34 Conclusion We have offered HOLP for big data analysis • Computationally efficient • Theoretically appealing • Methodologically simple • Generalisable via its ridge version 33 / 34 Future work Our work opens a new direction for future work • Gaussian graphical model • GLMs • Cox’s model • Grouped variable screening, GAMs • Big-n-big-p Thank you! 34 / 34