Complete Least Squares and Variable Screening

advertisement
Complete Least Squares and
Variable Screening
Eric M Reyes
Under the Direction of Dennis D Boos and Leonard A Stefanski
North Carolina State University
Department of Statistics
Rose-Hulman Institute of Technology
Mathematics Seminar
15 Feb 2012
RHIT Seminar
Outline
1
Motivation
2
Complete Least Squares
CLS Objective Function
CLS Estimator
Related Estimators
3
Screening via CLS
CLS Variable Orderings
Simulation Studies
4
Discussion
RHIT Seminar
Motivation
Complete Least Squares
Screening via CLS
Discussion
Acknowledgments
NIH Grants T32HL079986 and P01 CA142538 for funding support.
RHIT Seminar
(3)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Acknowledgments
NIH Grants T32HL079986 and P01 CA142538 for funding support.
RHIT Seminar
(3)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Motivation
RHIT Seminar
(4)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Variable Selection
What are the risk factors associated with heart failure?
Which genetic biomarkers offer early identification of individuals more
likely to develop cancer?
When triaging a stroke victim in the emergency room, which
characteristics on their medical chart are predictive of recurrent
stroke?
RHIT Seminar
(5)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Genetic Studies
Polycystic Ovary Syndrome
Endocrine disorder affecting 10% of reproductive-aged women.
Characterized by high androgen levels.
Goal: identify genes associated with increased androgen levels.
E [yi |x1,i , . . . , xp,i ] = x1,i β1 + x2,i β2 + x3,i β3 + · · · + xp,i βp
Framingham Heart Study
Conducted to identify risk factors for cardiovascular disease.
Genetic and phenotypic data collected on ∼ 9000 subjects
Just over 50% female.
Genetic data includes 50,000 variables.
RHIT Seminar
(6)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Variable Screening
Sure Independence Screening (SIS) [Fan, JRSS-B 2008]
Order predictors by their correlation with the response.
Retain top k predictors for variable selection.
k chosen to be near the sample size (e.g. k = δn, δ ∈ (0, 1)).
Assumes predictors are marginally related to response.
RHIT Seminar
(7)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Drawback of SIS
●
●
14
Speed Driven Over Posted Limit (MPH)
●
●
●
●
●
●
●
●
12
●
●
●
8
6
4
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●● ● ●●
● ●●
● ● ● ●
● ● ● ●
●
● ●
●
●
● ●
●
●
●
●
● ●● ● ●
●
● ●
●●
●
●● ●
● ●● ●
●
●●
●
● ●●
● ●●
●
● ●●●●
●●
●
●
● ●●●
● ●
● ●
●● ●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●●
●
● ● ●
●
●●
● ● ● ●●
●●
● ●
●
●
●
●
●
●● ●
●
●
●
● ● ● ●● ●●
● ●
●
●
● ●
●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ● ●
●●
●
● ● ● ●●
●
●
● ● ● ● ●
●
●●
●
●●
● ●●
●
●
●
●
● ●
● ●
●● ● ● ●
●●● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
● ●● ●
●
●●
●
●
●
● ● ● ●
●
●
● ●
●
●
● ● ● ●●●●●●● ●●
●
● ●
●● ●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ● ●
● ●
●
●● ●● ●
●●
●
●
●
●
● ● ●
●●● ● ● ●
●
●
●
●
●●
●
● ●● ● ●● ● ●
●
● ● ●
●
●
●
●
●
●●
●●
●
●
●
●●
● ●
●
●
● ●●● ●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●●
●
● ● ●● ● ●●
●
● ●
●
●
●●
●● ●
●
● ●● ●
●
●
● ●●●
●
●
●
●
●
●
●●
●●
●
●
10
● ●
●
● ●
● ●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
60
65
70
75
Height (inches)
RHIT Seminar
(8)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Drawback of SIS
●
Speed Driven Over Posted Limit (MPH)
●
●
●
●
●
●
●
12
●
●
●
8
6
4
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●● ● ●●
● ●●
● ● ● ●
● ● ● ●
●
● ●
●
●
● ●
●
●
●
●
● ●● ● ●
●
● ●
●●
●
●● ●
● ●● ●
●
●●
●
● ●●
● ●●
●
● ●●●●
●●
●
●
● ●●●
● ●
● ●
●● ●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
● ●
●
●●
●
● ● ●
●
●●
● ● ● ●●
●●
● ●
●
●
●
●
●
●● ●
●
●
●
● ● ● ●● ●●
● ●
●
●
● ●
●
●
●● ●
● ●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ● ●
●●
●
● ● ● ●●
●
●
● ● ● ● ●
●
●●
●
●●
● ●●
●
●
●
●
● ●
● ●
●● ● ● ●
●●● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●●
● ●● ●
●
●●
●
●
●
● ● ● ●
●
●
● ●
●
●
● ● ● ●●●●●●● ●●
●
● ●
●● ●
●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ● ●
● ●
●
●● ●● ●
●●
●
●
●
●
● ● ●
●●● ● ● ●
●
●
●
●
●●
●
● ●● ● ●● ● ●
●
● ● ●
●
●
●
●
●
●●
●●
●
●
●
●●
● ●
●
●
● ●●● ●
●
●
●
●
● ●●
●
●
●●
●
●
●
●
●
●●
●
● ● ●● ● ●●
●
● ●
●
●
●●
●● ●
●
● ●● ●
●
●
● ●●●
●
●
●
●
●
●
●●
●●
●
●
10
● ●
●
● ●
● ●●
●
●
● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
Speed Driven Over Posted Limit (MPH)
●
14
●
●
●
14
65
70
Height (inches)
75
●
●
●●
●
●
●
12
● ●
● ●●
●
●
10
●
8
6
4
●
●
● ●●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
● ●
● ●
●● ●
●
●
●
●
●
●
● ●● ● ●●
● ●●
● ● ● ●
● ● ● ●
●
● ●
●
● ●
● ●●
●
●●
●
● ●●
●
●
●●
● ●● ● ●
●● ●
●
●
● ● ●●
●
● ●●
●
●
● ●●●●
●●
●
●
● ●●●
● ●
● ●
●● ●
●
●
●
●●
●
●●
● ●
●●
●
●
●
●
●
●
●
●●●
●
●
●● ●
●
● ●
●
●
●
●
● ●
●
●●
● ● ● ●●
●●
● ●
●
●
●
● ● ●●●●●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ● ●
●●
●
● ● ● ●●
●
●
● ● ● ● ●
●
●●
●
●
●
●
●
●
●
●
● ●
● ●●
●● ● ● ●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●●●●
●
●
●
●
●
●● ●
●●
●
●
● ● ● ●● ● ●●
●
● ●
●
●
● ● ● ● ●● ● ●●
●
● ●
●● ●
●●
●●
●
●
●
● ●●
● ●
● ●
● ●● ●●
●
● ● ●
● ●
●
●● ●● ● ●
●●
●
●
●
●
● ● ● ●
● ● ●●
● ● ●
●●
●●
●
● ●● ● ●● ● ●
●
● ● ●●
● ●
●
●
●
●●
●
●
●
●
●●●
● ●
●
●
●
●●● ●
●
●
●
●
●●
● ● ●●
● ●●
●
●
●●
●
●
● ● ● ● ●●
●
● ●●
●●
●
●
●
●
●
● ●● ●
●
●
● ●●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●
●
●
60
●
60
65
70
75
Height (inches)
●
Female
●
Male
RHIT Seminar
(8)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Drawback of SIS
●
●
3
●●
●
●
●
●
●
●
●
Response
●
−1
−2
−3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
0
●
●
2
1
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
● ● ●
● ●●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●●
●● ● ●● ● ●
● ●● ●
●
●
● ●●
● ●●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●●
●●●
●
●
● ●
●
● ●●
●
● ●
●●
●● ●
● ● ● ●
●
●
●
●
●
●
●
●
●●
●
●● ●●
●●●● ●
● ●
●
●
●●
●
●
●
●
●
●● ●●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●● ● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
● ●●●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●
● ●●●● ●
●●● ●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
● ● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●● ●
●
●●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0.2
0.4
0.6
0.8
Dose
RHIT Seminar
(9)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Drawback of SIS
●
●
3
●●
●
●
●
●
−1
−2
−3
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
● ● ●
● ●●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
● ●
●●
●● ● ●● ● ●
● ●● ●
●
●
● ●●
● ●●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●●
●●●
●
●
● ●
●
● ●●
●
● ●
●●
●● ●
● ● ● ●
●
●
●
●
●
●
●
●
●●
●
●● ●●
●●●● ●
● ●
●
●
●●
●
●
●
●
●
●● ●●●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●● ● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
● ●●●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●
● ●●●● ●
●●● ●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
● ● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●● ●
●
●●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
1
0
−1
−2
−3
●
●
2
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
Response
Response
●
●
●
●
●
0
●
●
2
●●
●
●
●
●
3
●
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0.4
0.6
0.8
●
● ●●
●
● ●
●
●
●●
●
●
●
●
●
●
● ● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
● ● ●
● ●●
●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
● ●
● ●●
●
● ●●
●● ●
●
●
●
● ●
● ●● ● ● ●
●●
●
●
● ●
●
●
●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
● ●
●●
●● ● ●● ● ●
● ●● ● ●
●●
● ●●
● ●●●
●
●
●●
● ●● ● ●●●
● ●●
●
●
● ●● ●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●●
●●●
●
●
● ●
●
● ●●
●
● ●
●● ●
● ● ●●
● ● ●● ● ●● ● ● ● ●
●
●
●
●●
●● ●●
● ●● ●
●
●
●●
●
●
●
●
●● ●
●● ●●●
●
● ●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●●
●
●●
●●
●
●
●
● ●
● ●
● ● ●●
●
●●
●
●
●
●●
●
●
● ●
●
●●●● ●●
●
●●
●
●
●●● ●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
● ● ●●
●
●●●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
● ●● ●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.2
●
●
●
●
●
●
●
●
●
0.2
0.4
0.6
0.8
Dose
Dose
●
No Aspirin
●
Aspirin
RHIT Seminar
(9)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Variable Screening
Sure Independence Screening (SIS) [Fan, JRSS-B 2008]
Order predictors by their correlation with the response.
Retain top k predictors for variable selection.
k chosen to be near the sample size (e.g. k = δn, δ ∈ (0, 1)).
Assumes predictors are marginally related to response.
RHIT Seminar
(10)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Complete Least Squares
(CLS)
RHIT Seminar
(11)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Notation
Assume the linear model:
y = Xβ + �
y is an (n × 1) response vector.
X is an (n × p) design matrix.
Assume �i are i.i.d. such that
E (�i ) = 0.
V (�i ) = σ 2 , for unknown σ 2 .
Assume y and X have been centered and scaled:
y� 1 = 0, y� y = 1, X� 1 = 0, and X� X = R.
R is a valid correlation matrix.
RHIT Seminar
(12)
Motivation
Complete Least Squares
Screening via CLS
Discussion
“Good” Estimate of β
Ordinary Least Squares (OLS)
�
β̂ OLS = X� X
Best linear unbiased estimator.
�−1
X� y
“Noisy” when predictors are highly correlated.
Not uniquely defined if p > n.
RHIT Seminar
(13)
Motivation
Complete Least Squares
Screening via CLS
Discussion
“Good” Estimate of β
LS Objective Functions for All Possible Models when p = 3
One-Variable Models:
�y − x1 β1 �2
�y − x2 β2 �2
�y − x3 β3 �2
Two-Variable Models:
�y − x1 β1 − x2 β2 �2
�y − x1 β1 − x3 β3 �2
�y − x2 β2 − x3 β3 �2
Three-Variable Models:
�y − x1 β1 − x2 β2 − x3 β3 �2
RHIT Seminar
(14)
Motivation
Complete Least Squares
Screening via CLS
Discussion
“Good” Estimate of β
Ordinary Least Squares (OLS)
�
β̂ OLS = X� X
Best linear unbiased estimator.
�−1
X� y
“Noisy” when predictors are highly correlated.
Not uniquely defined if p > n.
Alternative: Use All Information Simultaneously
Q(β) = �y − x1 β1 �2 + �y − x2 β2 �2 + �y − x3 β3 �2
+ �y − x1 β1 − x2 β2 �2 + �y − x1 β1 − x3 β3 �2 + �y − x2 β2 − x3 β3 �2
+ �y − x1 β1 − x2 β2 − x3 β3 �2
RHIT Seminar
(15)
Motivation
Complete Least Squares
Screening via CLS
Discussion
CLS Objective Function
The general form of the CLS objective function is a weighted average of
the LS objective functions for all possible models:
Qp (β, ω) = ω1
p
�
j=1
+ ω2
�y − xj βj �2 +
�
j<k
+ ω3
�y − xj βj − xk βk �2
�
j<k<l
�y − xj βj − xk βk − xl βl �2
+ ···+
+ ωp �y − Xβ�2
The model weights ω1 , ω2 , . . . , ωp ≥ 0 regulate the contribution of all
models of a given size.
RHIT Seminar
(16)
Motivation
Complete Least Squares
Screening via CLS
Discussion
CLS Objective Function
The CLS objective function reduces to a simple form:
Qp (β, ω) = (λ0 − pλ1 + (p − 1)λ2 ) y� y
+ λ2 �y − Xβ�2 + (λ1 − λ2 )
where λj =
�p
�p−j �
k=1 ωk k−j
p
�
k=1
�y − xk βk �2
for j = 0, 1, 2.
RHIT Seminar
(17)
Motivation
Complete Least Squares
Screening via CLS
Discussion
CLS Estimator
Theorem
For a fixed set of model weights ω = (ω1 , ω2 , . . . , ωp )� such that ωk ≥ 0
for all k, the estimator
�
�
�
β
CLS = τ X X + (1 − τ )DX� X
minimizes the CLS objective function, where
�−1
X� y
DX� X = diag{X� X}, and
�p−2� �p
�p−1�
�p
τ = λ2 /λ1 = k=1 ωk k−2 / k=1 ωk k−1
Proof is similar to that of OLS.
RHIT Seminar
(18)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Choice of Model Weights
�
�
�
β
CLS = τ X X + (1 − τ )DX� X
�p
�−1
X� y
�p−2�
λ2
k=1 ωk k−2
τ=
= �p
�p−1�
λ1
k=1 ωk k−1
ωk
�p−2�
�k−2
�
p−1
ωk k−1
k −1
=
≤ 1.
p−1
⇒ τ ∈ [0, 1]
RHIT Seminar
(19)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Choice of Model Weights
�
�
�
β
CLS = τ X X + (1 − τ )DX� X
�−1
X� y
Ordinary Least Squares
ωp = 1 and ωk = 0 for all k �= p
τ =1
Univariate Marginal Models
ω1 = 1 and ωk = 0 for all k �= 1
τ =0
Sum Across All Models
ωk = 1 for all k
τ = 1/2
RHIT Seminar
(20)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Ridge Regression
Penalized Objective
The ridge estimator of β minimizes
�y − Xβ�2 + ν�β�2
where ν is called a penalty parameter.
Estimator
�
�
β
Ridge (ν) = X X + νIp
�
�−1
X� y
RHIT Seminar
(21)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Ridge Regression
Connection to CLS
�
Ridge Estimator
�−1 �
�
X X + νIp
X y
�
CLS Estimator
�−1 �
�
τ X X + (1 − τ )Ip
X y
�
�
Let τ = (ν + 1)−1 , then β
CLS = (1 + ν)β Ridge (ν)
Comparison
Both estimators exist when p > n.
CLS is an “inflated” Ridge estimator.
Ridge shrinks coefficients to 0, CLS “shrinks” to the marginal.
ν chosen from the data, τ chosen by model weights.
RHIT Seminar
(22)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Ridge and CLS Trace
CLS
Ridge
Standardized Coefficient
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
τ
X1
X2
X3
RHIT Seminar
(23)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Variable Screening via CLS
RHIT Seminar
(24)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Variable Screening
Sure Independence Screening (SIS) [Fan, JRSS-B 2008]
Order predictors by their correlation with the response.
Retain top k predictors for variable selection.
k chosen to be near the sample size (e.g. k = δn, δ ∈ (0, 1)).
Assumes predictors are marginally related to response.
RHIT Seminar
(25)
Motivation
Complete Least Squares
Screening via CLS
Discussion
CLS Sequence
Standardize the response and
covariates.
Rank the variables by decreasing
magnitude of the coefficients.
Compute the full CLS fit
(τ = 1/2).
Variable
x1
x2
x3
x4
x5
x6
�
|β|
0.364
3.785
2.491
1.027
0.073
0.365
Rank ϕ
5
1
2
3
6
4
RHIT Seminar
(26)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Simulation Design
Data Generation
X: i-th row i.i.d. N(0, Σ)
y = Xβ + �
� ∼ N(0, I)
βj = c (1.5)I(j≤3) (−1)uj , j ≤ 6
βj = 0 for j > 6
iid
uj ∼ Ber (0.5)
Parameters for Simulation
p = 10n
100 replicate datasets
(Σ)i,j = ρI(i�=j)
c chosen such that
ρ ∈ {0, 0.6}
R 2 = β � Σβ/(β � Σβ + 1) = 0.6
RHIT Seminar
(27)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Comparing Accuracy
Accuracy
A method is said to be “accurate” if it orders the variables correctly, with
respect to the magnitude of the true parameter vector β (after
standardization).
RHIT Seminar
(28)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Increasing Sample Size
Independence
Equicorrelation
0.9
0.7
0.6
0.5
0
30
0
25
0
20
0
15
0
10
50
0
30
0
25
0
20
0
15
10
0
0.4
50
Accuracy
0.8
Sample Size (p = 10n)
CLS
SIS
RHIT Seminar
(29)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Discussion and Summary
RHIT Seminar
(30)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Future Work
Screening in GLM
Many screening studies involve a binary endpoint.
We have extended CLS to GLM framework.
Weighted average of estimating equations.
Algorithm is slow to converge.
Distance Correlation
Distance correlation is a measure of independence.
It is not restricted to linear association.
SIS has been extended to use with distance correlation.
Idea: replace key pieces in CLS with distance correlation measures.
�
τ X� X + (1 − τ )I
�−1
X� y
RHIT Seminar
(31)
Motivation
Complete Least Squares
Screening via CLS
Discussion
Summary
CLS is a new method of estimation, related to ridge regression.
CLS estimator is a competitive screening technique.
There may be advantages to its use in large samples.
RHIT Seminar
(32)
Appendices
References I
�
Fan J and Lv J.
Sure Independence Screening for Ultrahigh Dimensional Feature Space.
Journal of the Royal Statistical Society, Series B, 70:849-911, 2008.
�
Lipovetsky S.
Enhanced Ridge Regressions.
Mathematical and Computer Modelling, 51:338-348, 2010.
�
Wang H.
Forward Regression for Ultra-High Dimensional Variable Screening.
JASA, 104:1512-1524, 2009.
RHIT Seminar
(33)
Download