HOLP for Screening Variables UCL Workshop on Theory of Big Data

advertisement
HOLP for Screening Variables
Chenlei Leng
Joint with Xiangyu Wang (Duke)
UCL Workshop on Theory of Big Data
9 Jan., 2015
1 / 34
The setup
The linear regression model
y = β1 x1 + β2 x2 + · · · + βp xp + ε.
With data (after standardisation)
Y = X β + ,
where
• Y ∈ Rn,
• X ∈ R n×p ,
• ∈ R n consists of i.i.d. errors.
2 / 34
The problem
In big data analysis, often
• The dimension p is much larger than the sample size n
(p n);
• The number of the important variables s is often much
smaller (s n);
• But we don’t know which βj ’s are nonzero;
• The goal is to identify these important variables.
3 / 34
A solution for screening
For a linear model
Y = X β + ,
1. Compute
β̂ = X T (XX T )−1 Y ,
which is not the usual OLS estimator;
2. Retain the d variables (usually can take d = n or any
d > s) corresponding to the d largest entries of |β̂|.
We call this HOLP screening procedure.
Afterwards, build a (refined) model based on the remaining d
variables.
4 / 34
Questions?
• Why β̂ = X T (XX T )−1 Y ?
• Denote M ⊆ {x1 , ..., xp } as a model,
MS as the true model, where S = supp(β)
and Md as the model chosen by HOLP.
For HOLP to work at all, we need
MS ⊆ Md
with a probability approaching one.
5 / 34
Outline
• Introduction
• The motivation
• A connection to ridge regression
• A comparison with sure independence screening (SIS) in
Fan and Lv (2008)
• Theory: Three theorems, two surprising ones
• Simulation
• Concluding remarks
6 / 34
Review
Two classes of approaches
• One-stage: Selection and estimation;
• Two-stage: Screening followed by some one-stage
approach.
7 / 34
One-stage methods
Loss plus a sparsity inducing penalty
• Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001), elastic
net (Zou and Hastie, 2005), grouped Lasso (Yuan and Lin,
2006), Cosso (Lin and Zhang, 2006), Dantzig selector
(Candes and Tao, 2007) and so on.
• Convex and non-convex optimisation.
• Different conditions for optimal estimation (β̂ ≈ β) and
selection consistency (M̂S = MS ): difficult to achieve
both, especially selection consistency.
8 / 34
Two-stage methods
Screen first, refine next.
Intuition: choosing a superset M̂ ⊇ MS is much easier than
estimating the exact set M̂ = MS .
• Actually widely used before any theory was available
• Fan and Lv (2008): a theory for screening by retaining
variables with large marginal correlations: Sure
Independent Screening (SIS)
• Marginal: Generalised to many models including GLMs,
Cox’s model, GAMs, varying-coefficient models, and etc.
• Correlation: Generalised notions of correlation.
• Alternative iterative procedures: forward regression (Wang,
2009) and tilting (Cho and Frylewicz, 2012).
9 / 34
Elements of Screening
At least 2 elements:
• Computational: key; sorry, little room for sophisticated
slow procedures.
• Theoretical: Sure screening property; MS ⊆ M̂ with a
probability approaching one.
Remark for SIS:
• Computational X: compute X T Y
• Theoretical ?: works when variables are not strongly
dependent.
10 / 34
Motivation
A class of estimators of β as
β̃ = AY ,
where A ∈ R p×n . Recall SIS: β̃ = X T Y where A = X T .
Screening procedure: Choose a submodel Md that retains the
d << p largest entries of β̃,
Md = {xj : |β̃j | are among the largest d of all |β̃j |’s}.
For this to work, β̃ maintains the rank order of the entries of β
• the nonzero entries of β are large in β̃ relatively
• the zero entries of β are small in β̃ relatively.
11 / 34
Signal noise analysis
Note
β̃ = AY = A(X β + ) = (AX )β + A.
• Signal (AX )β + Noise A
• The noise part is small stochastically
• In order for β̃ to preserve the rank order of β, ideally
AX = I, or AX ≈ I.
The above discussion motivated us to use some inverse of X .
12 / 34
Inverse of X
Look for A such that
AX ≈ I.
• When p < n, A = (X T X )−1 X T gives rise to the OLS
estimator.
• When p > n, Moore-Penrose inverse of X as
A = X T (XX T )−1 ,
unique to high-dimensional data.
• High-d OLS: the High-dimensional OLS Projection (HOLP)
β̂ = X T (XX T )−1 Y .
13 / 34
Remarks
Write
β̂ = X T (XX T )−1 Y = X T (XX T )−1 X β + X T (XX T )−1 ,
• HOLP projects β onto the row space of X ; OLS projects β
onto the column space of X
• Straightforward to implement
• Can be efficiently computed: O(n2 p), as opposed to O(np)
of SIS
Computational: X
14 / 34
A comparison of the screening matrices
The screening matrix AX in
β̃ = AY = (AX )β + A
• HOLP: AX = X T (XX T )−1 X
• SIS: AX = X T X
• A quick simulation: n = 50, p = 1000, x ∼ N(0, Σ). Three
setups
• Independent: Σ = I
• Compound symmetric (CS): σjk = 0.6 for j 6= k
• AR(1): σjk = 0.995|j−k |
15 / 34
Screening matrices
SIS, Ind
SIS, CS
SIS, AR(1)
HOLP, Ind
HOLP, CS
HOLP, AR(1)
16 / 34
Analytical insight
SVD of X as X = VDU T , where
• V is an n × n orthogonal matrix,
• D is an n × n diagonal matrix,
• U is an p × n matrix on the Stiefel manifold.
Then
HOLP :X T (XX T )−1 X = UU T ,
SIS :X T X = UD 2 U T .
HOLP reduces the impact from the high correlation of X by
removing the random diagonal matrix D.
17 / 34
Theory
Assumptions
• p > n and log p = O(nγ ) for some γ > 0.
• Conditions on the eigenvalues of X Σ−1 X T /p and the
distribution of Σ−1/2 x where Σ = var(x)
• Conditions on the magnitude of the smallest βj for j ∈ S
• Conditions on s and the condition number of Σ
However, we don’t need the marginal correlation assumption
which requires
corr(y , xj ) 0 for j ∈ S.
18 / 34
Marginal screening
• The marginal correlation assumption is vital to all marginal
screening approaches.
• In SIS, AY = X T Y = X T X β + X T .
• The SIS signal X T X β ∝ Σβ:
βj 6= 0 =⇒
6
(Σβ)j is large,
βj = 0 =⇒
6
(Σβ)j is small.
• For HOLP, X T (XX T )−1 X β ∝ Iβ = β.
19 / 34
Sure screening
Theorem 1
(Screening property of HOLP) Under mild conditions, if we
choose the submodel size d p properly, the Md chosen by
HOLP satisfies
nC1
) .
P(MS ⊂ Md ) = 1 − O exp(−
log n
20 / 34
Surprise # 1
Theorem 2
(Screening consistency of HOLP) Under mild conditions, the
HOLP estimator satisfies
nC2
P min |β̂j | > max |β̂j | = 1 − O exp(−
) .
log n
j∈S
j6∈S
Surprise: If we choose d = s, then we select the right model.
21 / 34
Another motivation for HOLP
• The ridge regression estimator
β̂(r ) = (rI + X T X )−1 X T Y ,
where r is the ridge parameter.
• Letting r → ∞ gives r β̂(r ) → X T Y , SIS.
• Letting r → 0 gives β̂(r ) →, (X T X )− X T Y , HOLP.
• Applying Sherman-Morrison-Woodbury formula gives
(rI + X T X )−1 X T Y = X T (rI + XX T )−1 Y .
Then letting r → 0 gives
(X T X )− X T Y = X T (XX T )−1 Y ,
which gives us HOLP.
22 / 34
Ridge Regression
Theorem 3
(Screening consistency of ridge regression) Under mild
conditions, with a proper ridge parameter r , the ridge
regression estimator satisfies
n C3
P min |β̂j (r )| > max |β̂j (r )| = 1 − O exp(−
) .
log n
j∈S
j6∈S
Remarks
• Surprise # 2: The theorem holds when the ridge parameter
r is fixed.
• Potential to generalise to GLMs, Cox’s models and etc.
23 / 34
Simulation
• (p, n) = (1000, 100) or (10000, 200), fix d = n.
• Signal to noise ratio R 2 = 0.9
• Σ and β
(i) Independent predictors
√
βi = (−1)ui (|N(0, 1)| + 4 log n/ n)
where ui ∼ Ber (0.4) for i ∈ S and βi = 0 for i 6∈ S.
(ii) Compound symmetry: βi = 5 for i = 1, ..., 5 and βi = 0
otherwise, ρ = 0.3, 0.6, 0.9.
(iii) Autoregressive correlation:
β1 = 3, β4 = 1.5, β7 = 2, and βi = 0 otherwise.
24 / 34
More setups
P
(iv) Factor model: xi = kj=1 φj fij + ηi , where fij and ηi and φj
are iid normal. Coefs as in CS.
(v) Group structure: 15 true variables into three groups.
xj+3m = zj + N(0, δ 2 ). βi = 3, i ≤ 15; βi = 0, i > 15.
where m = 0, ..., 4, j = 1, 2, 3, and δ 2 is 0.01, 0.05 or 0.1.
√
(vi) Extreme correlation:
xi = (zi + wi )/ 2, i = 1, · · · , 5 and
P5
xi = (zi + j=1 wj )/2, i = 16, · · · , p. Coefs as in (ii).
The response variable is more correlated to a large
number of unimportant variables. Make it even harder,
xi+s , xi+2s = xi + N(0, 0.01), i = 1, · · · , 5.
25 / 34
(p, n) = (1000, 100): R 2 = 0.9
Example
(i) Ind.
(ii) CS
(iii) AR(1)
(iv) Factor
(v) Group
(vi) Extreme
ρ = 0.3
ρ = 0.6
ρ = 0.9
ρ = 0.3
ρ = 0.6
ρ = 0.9
k =2
k = 10
k = 20
δ 2 = 0.1
δ 2 = 0.05
δ 2 = 0.01
HOLP
0.935
0.980
0.830
0.050
0.990
1.000
1.000
0.940
0.715
0.430
1.000
1.000
1.000
0.905
SIS
0.910
0.855
0.260
0.010
0.965
1.000
1.000
0.015
0.000
0.000
1.000
1.000
1.000
0.000
ISIS
0.990
0.955
0.305
0.005
1.000
1.000
0.970
0.490
0.115
0.015
0.000
0.000
0.000
0.000
FR
1.000
1.000
0.575
0.000
1.000
1.000
0.985
0.950
0.370
0.105
0.000
0.000
0.000
0.150
Tilting
1.000
0.990
0.490
0.050
1.000
1.000
1.000
0.960
0.455
0.225
0.000
0.000
0.000
0.110
26 / 34
(p, n) = (10000, 200): R 2 = 0.9
Example
(i) Ind.
(ii) CS
(iii) AR(1)
(iv) Factor
(v) Group
(vi) Extreme
ρ = 0.3
ρ = 0.6
ρ = 0.9
ρ = 0.3
ρ = 0.6
ρ = 0.9
k =2
k = 10
k = 20
δ 2 = 0.1
δ 2 = 0.05
δ 2 = 0.01
HOLP
0.960
1.000
0.960
0.100
0.990
1.000
1.000
0.980
0.850
0.540
1.000
1.000
1.000
1.000
SIS
0.960
0.920
0.280
0.000
0.990
1.000
1.000
0.000
0.000
0.000
1.000
1.000
1.000
0.000
ISIS
1.000
1.000
0.420
0.000
1.000
1.000
1.000
0.350
0.060
0.010
0.000
0.000
0.000
0.000
FR
1.000
1.000
0.960
0.000
1.000
1.000
1.000
0.990
0.700
0.230
0.000
0.000
0.000
0.210
Tilting
—
—
—
—
—
—
—
—
—
—
—
—
—
—
27 / 34
A demonstration of Theorem 2 and 3
We set
(
p=
4 × [exp(n1/3 )]
for examples except Example (vi)
20 × [exp(n1/4 )]
for Example (vi)
and
(
s=
1.5 × [n1/4 ]
for R 2 = 90%
[n1/4 ]
for R 2 = 50%
Choose d = s.
28 / 34
Theorem 2: Screening consistency
1.0
0.8
0.2
0.4
0.6
(i)
(ii)
(iii)
(iv)
(v)
(vi)
0.0
0.2
0.4
(i)
(ii)
(iii)
(iv)
(v)
(vi)
^
^
probability all βtrue>βfalse
1.0
0.6
0.8
HOLP with R2= 50%
0.0
^
^
probability all βtrue>βfalse
HOLP with R2= 90%
100
200
300
400
500
100
200
300
400
500
Figure 1: HOLP: Selection consistency as n increases.
29 / 34
Theorem 3: Ridge regression
0.4
0.6
0.8
(i)
(ii)
(iii)
(iv)
(v)
(vi)
0.0
0.0
0.2
^
^
probability all βtrue>βfalse
0.6
0.4
(i)
(ii)
(iii)
(iv)
(v)
(vi)
0.2
^
^
probability all βtrue>βfalse
0.8
1.0
ridge−HOLP with R2= 50%
1.0
ridge−HOLP with R2= 90%
100
200
300
400
500
100
200
300
400
500
Figure 2: ridge-HOLP (r = 10): Selection consistency as n increases.
30 / 34
Computation efficiency: Varying d
p=1000,n=100 (tilting excluded)
60
p=1000,n=100
40
50
Forward regression
ISIS
HOLP
SIS
20
30
time cost (sec)
300
200
0
10
100
0
time cost (sec)
400
Tilting
Forward regression
ISIS
HOLP
SIS
0
20
40
60
80
100
0
20
40
60
80
100
Figure 3: Computational time when (p, n) = (1000, 100).
31 / 34
Computation efficiency: Varying p
d=50,n=100 (tilting excluded)
35
2000
d=50,n=100
20
15
time cost (sec)
25
30
Forward regression
ISIS
HOLP
SIS
10
1000
0
5
500
0
time cost (sec)
1500
Tilting
Forward regression
ISIS
HOLP
SIS
500
1000
1500
2000
2500
500
1000
1500
2000
2500
Figure 4: Computational time when (d, n) = (50, 100).
32 / 34
Conclusion
We have offered HOLP for big data analysis
• Computationally efficient
• Theoretically appealing
• Methodologically simple
• Generalisable via its ridge version
33 / 34
Future work
Our work opens a new direction for future work
• Gaussian graphical model
• GLMs
• Cox’s model
• Grouped variable screening, GAMs
• Big-n-big-p
Thank you!
34 / 34
Download