ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel MSU 9/2012

advertisement
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
ROBUSTNESS AND HIGH DIMENSIONAL
DATA
Peter J. Bickel
UC Berkeley
MSU 9/2012
(Joint with Boaz Nadler, Bin Yu, N. el Karoui, Derek Bean and
Chingway Lim)
Outline
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
1
Robust M estimation in Linear Regression for fixed
number of covariates p
2
What is known if,
3
Least squares and Lasso: Some current results
4
Some curious simulations
5
Heuristics
6
Projection Pursuit
7
Discussion
p
n
→ 0, p → ∞?
The Regression Model and Least Squares Data
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
X i = (Zi , Yi )
Peter J. Bickel
i = 1, . . . , n
Zi
p × 1 iid
Assumed model: (n = 1)
Y = ZT β 0 + e
e⊥Z
Used as an approximation to general model
Y = µ(Z) + e,
E (e|Z) ≡ 0
Basic Theorem
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
P
If β̂ = arg min{kY − ZT βk2n } where kf (X )kn ≡ n1 ni=1 f 2 (X i )
−1 (n)
(Z , Y )(n)
a) β̂ = n1 Z (n) [Z (n) ]T
P
where Y ≡ (Y1 , . . . , Yn )T , (Z(n) , Y)(n) ≡ n1 ni=1 Yi Zi
(n)
Zp×n = (Z1 , . . . , Zn )
P
Z (n) [Z (n) ]T = ni=1 Zi ZT
i
b) If Σ ≡ E (ZZT ) is nonsingular, β 0 is TRUE
√
n(β̂ − β 0 ) =⇒ N(0, σ 2 Σ−1 )
β 0 = Σ−1 E (Y Z)
Robust M Estimation in Regression
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
β 0 = arg min E0 ρ(Y − ZT β)
ρ convex, symmetric about 0.
P
β̂ ρ ≡ arg min n1 ni=1 ρ(Yi − ZT
i β)
Thm (Huber) If ψ ≡ ρ0 is smooth, p is fixed, n → ∞,
E0 ψ 2 (e) < ∞, E0 ψ 0 (e) 6= 0 and Z is full dimensional,
E ZZT non singular
p fixed
Robust M Estimation in Regression
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
n
1
ψ(e)
1X
Zi
+ oP (n− 2 )
0
n
E0 ψ (e)
i=1
−1
√
n(β̂ − β 0 ) =⇒ Np 0, E0 (ZZT ) σ 2 (ρ)
β̂ ρ = β 0 −
σ 2 (ρ) =
E0 ψ 2 (e)
[E0 ψ 0 (e)]2
E.g.: ψ(t) = t LSE not robust against heavy tails
ψ(t) = hk (t) = t, |t| ≤ k (Huber)
= k sgn(t), |t| > k
ψ(t) = sgn(t)
(L1)
The curent focus of interest: p, n both large
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
What if p → ∞?
Theorem (Huber) (1973) (Negative)
If pn → c > 0
∃ contrast tT β 0
tT (β̂ LSE − β 0 ) is not asymptotically Gaussian
2
Note: E X T (β̂ LSE − β̂ 0 ) = σ 2 pn
=⇒ Data picked contrast is inconsistent
(Huber, Portnoy) (Positive)
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
(Huber) If the projection matrix [Z (n) ]T {Z (n) [Z (n) ]T }−1 Z (n)
has diagonal πii ≡ pn and
p3
n
→0
aT (β̂ − β) ∼ N 0, σ 2 (a, ψ)
σ 2 (a, ψ) =
E ψ 2 (e) T (n) (n) T −1
a
2 a (Z [Z ]
E ψ 0 (e)
Improved Conditions: Portnoy (1985) AS
What if
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
p
n
→ 0 more slowly or
p
n
→ c, 0 < c ≤ ∞
Gaussian Linear Regression Model
Peter J. Bickel
Yn×1 = Zn×p β p×1 + en×1
e = (e1 , . . . , en )T iid N(0, σ 2 )


Zij


j = 1, . . . , p
Z(j) ≡  ...  ,
Znj
Z ≡ (Z(1) , . . . , Z(p) )n×p = [Z (n) ]T
Suppose |Z(j) |2 = n, Z(a) ⊥ Z(b) a 6= b
(Canonical Gaussian Model)
Equivalent to:
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Gaussian White Noise Model (Donoho, Johnstone,
Kerkyacharian, Picard (1995))
Xj = βj + εj , j = 1, . . . , p, εj ∼ N 0,
Xj =
σ2 iid
n
[Z(j) ]T Y
n
Assume
i) β sparse: If S = {βj ; βj 6= 0}, |S| = s << p.
ii) Signal strong: j ∈ S =⇒ |βj | ≥ δn > 0
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Let,
X̂j ≡ Xj − hK (Xj )
Peter J. Bickel
r
hK ≡ Huber function, K = σ
GWN Result: If δn
p
X
q
n
log p
2 log p
n
→ ∞,
E (X̂j − βj )2 =
j=1
(Best possible if S is known)
q
If δn = Ω logn p , s → s log p.
sσ 2
1 + 0(1)
n
The Lasso: Donoho, Saunders, Chen (1998),
Tibshirani (1996)
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
β̂ L ≡ arg min |Y − Z β|2 + λ|β|1
For canonical model
Z(1) , . . . , Z(p) orthonormal |Z(j) |2 = n, j = 1, . . . , p .
Then, for suitable λ(K )
β̂jL = X̂j .
Conclusion
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
a) If Z(1) , . . . , Z(p) are nearly orthogonal
b) β is sparse
c) The signal is strong
β̂ L behaves as LS when we know S.
Many Results: Buhlmann, van de Geer, Tsybakov,
Meinshausen, Yu, Fan and collaborators, have found minimal
versions of a)-c) extended GWN result.
Robust Case:
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Bradic, Fan, Wang, JRSS(B) (2011)
Variable selection including robust objective functions give
results of this type with
σ2
s
E ψ 2 (e) s
→
n
[E ψ 0 (e)]2 n
Open problems in paralleling the work done for LS + Lasso but
see GLM results of van de Geer and others (Buhlmann, van de
Geer(to appear)) Statistics for High Dimensional Data
Behavior of kβ̂ − β0 k2
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
A. What if a) or b) or c) conditions don’t hold:
Another Lecture
B. What if pn → 0 < c < 1 and robust and least squares are
compared without penalization?
Surprising simulations
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Ratio of Expected Squared Norms, n=100, 1000 simulations
●
●
Peter J. Bickel
●
●
●
●
●
●
●
●
●
1.2
●
●
●
1.0
●
●
0.8
E(|LAD|^2)/E(|OLS|^2)
1.4
●
D.E. errors
Normal errors
●
20
40
60
p
80
n for
n the
1y≥t .
show
the functional inverse of F̄r, .
Integration by parts, symmetry of fr, as well as the above
characterization of c finally lead to the implicit characterization of r`1 (κ) (denoted simply by r for short in the next
ROBUSTNESS
AND HIGH
DIMENSIONAL
0 simulations
DATA
Surprising simulations
E(r2
(p,n))/ E(r2
(p,n)) and r2
(κ)/r2
(κ) computed from system, double exponential errors, 1000 simulations
L1
L2
L1
L2
n=1000
1.4
Peter J. Bickel
1.3
1.2
E(r2L1(p,n))/ E(r2L2(p,n))
1.1
1
0.9
0.8
0.7
0.6
Average over 1000 simulations
Prediction from heuristics
0.5
0
0.1
0.2
0.3
0.4
0.5
p/n
1
0.6
0.7
0.8
0.9
1
(Semi-heuristic) Results of el Karoui, Bean, Bickel,
Lim and Yu
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Yi = XiT β0 + i i = 1, . . . , n
i i.i.d. g ⊥
⊥ {Xi : i = 1, . . . , n}
Xi i.i.d. N (0, Σ)
Define β̂(ρ; β0 , Σ) = argmin
β
ρ convex
n → ∞, p/n → κ < 1.
Pn
i=1 ρ(Yi
− XiT β)
Key Lemma
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
L
β̂(ρ; β0 , Σ) = β0 + kβ̂(ρ; 0, Ip )kΣ1/2 u,
where u is uniform on the p sphere of radius 1.
∴ Can assume β0 = 0, Σ = Ip .
Special case of basic result
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
If rρ (p, n) ≡ kβ̂(ρ; 0, Ip )k
Under suitable regularity conditions,
p
rρ (p, n) → rρ (κ) solving:
E [proxc (ρ)]0 (ẑ ) = 1 − κ
E (ẑ − [proxc (ρ)](ẑ ))2 = κrρ2 (κ),
L
ẑ = + rρ (κ)Z , ⊥
⊥ Z , Z ∼ N (0, 1).
Remarks
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
proxc (ρ)(x) = argmin
y
(x − y )2
ρ(y ) +
2c
Solves, if ρ is differentiable, strictly convex:
y + cρ0 (y ) = x.
Key ideas:
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
I. Leave out 1 predictor {Xpi : i = 1, . . . , n}.
ri,[p] ≡ i − ViT γ̂
where Vi ≡ (X1i , . . . , Xp−1,i )T , γ̂ ≡ estimate of
(β01 , . . . , β0,p−1 )T without X (p) ≡ (Xp1 , . . . , Xpn )T . Then:
P
i Xpi ψ(ri,[p] )
2 0
T −1
i Xpi ψ (ri,[p] ) − vp Sp vp
(∗)
β̂p = P
+ op (n−1/2 )
ψ 0 (ri,[p] )Vi Xpi
T
vpT Sp−1 vp = X (p) D 1/2 ΠV D 1/2 X (p) , Dii = ψ 0 (ri,[p] ).
ΠV is a projection matrix of rank p − 1.
X (p) ⊥⊥ ri,[p] , Vi
vp =
P
i
Key ideas:
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
For LS ψ(x) = x and (∗) holds exactly.
Asymptotically, ri,[p] ∼ g ∗ Gaussian
√
nβ̂p ⇒ N (0, σ 2 (ρ, g , κ)).
σ2
for LS.
σ 2 (ρ, g , κ) = 1−κ
II. Analysis requires leave out one (Xi , Yi ) as well.
Projection Pursuit
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
(J) Kruskal (1969),(1972), Switzer (1970), Switzer and Wright
(1971), Friedman Tukey (1974), Huber (1985), Diaconis and
Freedman (1985)
Given:
X 1, . . . , X n
p×1
iid
Find “interesting” projections i.e.
a, |a| = 1 3
n
1X
Pn,a ≡
δaT X i is as non-normal as possible
n
i=1
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
In expectation Pn,a ≈ Pa ↔ fa ≡ density of aT X
Measures of Nonnormality
Ea (X − Ea X )3
SK (Pn,a ) ↔ 3 ≡ SK (Pa )
Ea (X − Ea X )2 2
Ea (X − Ea X )4
KURT (Pn,a ) ↔ 2 − 3 ≡ K (Pa )
Ea (X − Ea X )2
These are highly nonrobust to outliers.
Robust and “efficient” measures
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Alternatives: Robust Measures of Skewness , Kurtosis by
Trimming
Efficient Measure (Estimate)
Z
1
1
log fa fa (x)dx + log(2πe) 2 Ea (X − Ea X )2 2
Procedure: Maximize over a
A Rationale
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Diaconis, Freedman (1985)
If p → ∞ with n → ∞ under weak conditions e.g.
If Xj iid F , EX 2 < ∞, X = (X1 , . . . , Xp )T ,
Then, almost all Pn,a are asymptotically Gaussian, where
“almost” all is with respect to Lebesgue measure on surface of
unit sphere in R p .
Some Comfort
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Theorem If F is Gaussian iid and
p
n
→ 0, then
P
sup sup Pa (−∞, x] − Pna (−∞, x] −→ 0
a x
Some Caution
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
But, even if F is Gaussian iid, if pn → c > 0,
Z
2
max aT x − µ(Pn,a ) dPn,a 6→ 1
a
(Wigner, Geman)
How bad can things get?
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Suppose
p
n
−→ ∞
Theorem (B, Nadler)
Let X 1 , . . . , X n i.i.d. N(0, Ip ). Let G be any cdf such that
G − Φ doesn’t change sign.. Let F̂a denote the empirical cdf of
aT X 1 , . . . , aT X n , |a| = 1. Then:
h
i
P inf kF̂a − G k∞ → 0 = 1
a
where kf k∞ = supx |f (x)|.
Idea of proof:
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
a) Let ψ : R → R monotone increasing bounded.
P
Let Ψa ≡ n1 ni=1 ψ(aT Xi )
N = λp for λ to be chosen, λ > 1.
A = {aj : 1 ≤ j ≤ N} points in Sp
Then, for any ε > 0,
PΦ [K0 − ε ≤ Ψaj ≤ K0 + ε for some j, 1 ≤ j ≤ N] → 1
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
b) Let ψ (1) , . . . , ψ (m) be as above, ε > 0. For,
Kj arbitrary, j = 1, . . . , m,
Z
(j)
sgn Kj − ψ (ξ)φ(ξ)dξ constant
Then,
(j)
PΦ Kj − ε ≤ Ψa ≤ Kj , 1 ≤ j ≤ m for some a ∈ A → 1
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
c) Let ψ (j) (u) = 1(xj , ∞)
Kj = G (xj )
By b)
P ∃ j ∈ A 3 |Fˆ aj (xk )−G (xk )| ≤ ε for all 1 ≤ k ≤ m → 1
Lemma
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
∃ λ > 1, ε > 0, a1 , . . . , aN ∈ Sp , such that, for N = λp ,
∃ a1 , . . . , aN so that|(aj , aj 0 )| ≤ 1 − ε for all 1 ≤ j 6= j 0 ≤ N,
Then,
T
T
(aT
1 X 1 , . . . , aN X 1 ) ∼ NN (0, R)
R = kρij kN×N , where |ρij | ≤ 1 − ε, i 6= j
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
If X ∼ N(0, R0 )
R0 ≡ (1 − ε)11T + εId
1 ≡ (1, . . . , 1)T ,
X = (1 − ε)z0 1 + 1 − (1 − ε)2
z0 ∼ N1 (0, 1) ⊥ Z ∼ NN (0, Id )
1
2
Z
Slepian’s inequality
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Extended by Joag-dev et al (1983) Ann. Prob.
Let Z(j) ∼ N (0, R (j) ) j = 0, 1,
Peter J. Bickel
(j)
(j)
(j)
RN×N ≡ ρab = δab + (1 − δab )ρab ,
(0)
(1)
and ρab ≤ ρab for all a, b
Let Ψ : R N −→ R, bounded,
∂2ψ
∂xa ∂xb ≥ 0 all a 6= b. Then,
E Ψ(Z(0) ) ≤ E Ψ(Z(1) )
(Valid if ∆2a,b ψ ≥ 0, where
∆2a,b ψ = ψ(xa + ha , xb + hb , xc , c 6= a, b) − ψ(xa + ha , xb , xc , c 6=
a, b) − ψ(xa , xb + hb , xc , c 6= a, b) + ψ(xc , c = 1, . . . , N). )
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
Let ψj , j = 1, . . . , m be bounded non-decreasing function
Consider aT X 1 , . . . , aT X n , a ∈ A, a1 , . . . , aN as in Lemma.
~
~
Consider Ym×n where Yij = aT
i X j and Xc (Y) where
Xc (uik : i = 1, . . . , n, k = 1, . . . , N)
"
#
N
m
Y
Y
(`)
≡
1−
1 X (u1k , . . . , unk ) ≥ c` ,
`=1
k=1
and X (`) (v1 , . . . , vn ) =
1
n
Pn
X satisfies our hypotheses.
`=1 ψ` (ui ).
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
f) Apply large deviation theory to
n
1X
(0)
ψ(Zij )
n
Peter J. Bickel
i=1
where Z01 , . . . , Z0n
(0)
Zi
(0)
(0)
≡ (Zi1 , . . . , ZiN )T
(0)
are iid NN (0, RN×N )
(0)
RN×N = kδab + (1 − δab )(1 − ε)kN×N
to obtain
n
i
h1 X
(0)
P
Zij 6∈ [K0 − δ, K0 + δ] for any 1 ≤ j ≤ N −→ 0
n
i=1
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
g) Apply Slepian’s inequality to get a).
Generalize to b), c) using Joag-dev’s inequality.
Choose the {xj } to be dense to get (∗)
Discussion
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
II: Huber (1985)
1) “Perhaps the practical conclusion to be drawn is that we
shall have to acquiesce to the fact that PP will in practice
reveal not only true but also spurious structure and that
we must weed out the latter by other methods.”
2) What structures survive if we consider a random set of m
projections of the data?
E.g. Suppose the true population is
(1 − ε)N(0, Ip ) + εN(0, Σ)
Σ of rank << p
If we take m = o(n) projections, what chance do we have
of finding N(0, Σ) structure?
3) Conjecture: Result holds for all G .
Discussion
ROBUSTNESS
AND HIGH
DIMENSIONAL
DATA
Peter J. Bickel
I. Proof now available
Questions:
1) “Optimal” ρ, g , κ known: DONE.
2) Robust ρ, optimizing on g given g in a small
neighborhood of φ: OPEN.
Download