ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel UC Berkeley MSU 9/2012 (Joint with Boaz Nadler, Bin Yu, N. el Karoui, Derek Bean and Chingway Lim) Outline ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel 1 Robust M estimation in Linear Regression for fixed number of covariates p 2 What is known if, 3 Least squares and Lasso: Some current results 4 Some curious simulations 5 Heuristics 6 Projection Pursuit 7 Discussion p n → 0, p → ∞? The Regression Model and Least Squares Data ROBUSTNESS AND HIGH DIMENSIONAL DATA X i = (Zi , Yi ) Peter J. Bickel i = 1, . . . , n Zi p × 1 iid Assumed model: (n = 1) Y = ZT β 0 + e e⊥Z Used as an approximation to general model Y = µ(Z) + e, E (e|Z) ≡ 0 Basic Theorem ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel P If β̂ = arg min{kY − ZT βk2n } where kf (X )kn ≡ n1 ni=1 f 2 (X i ) −1 (n) (Z , Y )(n) a) β̂ = n1 Z (n) [Z (n) ]T P where Y ≡ (Y1 , . . . , Yn )T , (Z(n) , Y)(n) ≡ n1 ni=1 Yi Zi (n) Zp×n = (Z1 , . . . , Zn ) P Z (n) [Z (n) ]T = ni=1 Zi ZT i b) If Σ ≡ E (ZZT ) is nonsingular, β 0 is TRUE √ n(β̂ − β 0 ) =⇒ N(0, σ 2 Σ−1 ) β 0 = Σ−1 E (Y Z) Robust M Estimation in Regression ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel β 0 = arg min E0 ρ(Y − ZT β) ρ convex, symmetric about 0. P β̂ ρ ≡ arg min n1 ni=1 ρ(Yi − ZT i β) Thm (Huber) If ψ ≡ ρ0 is smooth, p is fixed, n → ∞, E0 ψ 2 (e) < ∞, E0 ψ 0 (e) 6= 0 and Z is full dimensional, E ZZT non singular p fixed Robust M Estimation in Regression ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel n 1 ψ(e) 1X Zi + oP (n− 2 ) 0 n E0 ψ (e) i=1 −1 √ n(β̂ − β 0 ) =⇒ Np 0, E0 (ZZT ) σ 2 (ρ) β̂ ρ = β 0 − σ 2 (ρ) = E0 ψ 2 (e) [E0 ψ 0 (e)]2 E.g.: ψ(t) = t LSE not robust against heavy tails ψ(t) = hk (t) = t, |t| ≤ k (Huber) = k sgn(t), |t| > k ψ(t) = sgn(t) (L1) The curent focus of interest: p, n both large ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel What if p → ∞? Theorem (Huber) (1973) (Negative) If pn → c > 0 ∃ contrast tT β 0 tT (β̂ LSE − β 0 ) is not asymptotically Gaussian 2 Note: E X T (β̂ LSE − β̂ 0 ) = σ 2 pn =⇒ Data picked contrast is inconsistent (Huber, Portnoy) (Positive) ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel (Huber) If the projection matrix [Z (n) ]T {Z (n) [Z (n) ]T }−1 Z (n) has diagonal πii ≡ pn and p3 n →0 aT (β̂ − β) ∼ N 0, σ 2 (a, ψ) σ 2 (a, ψ) = E ψ 2 (e) T (n) (n) T −1 a 2 a (Z [Z ] E ψ 0 (e) Improved Conditions: Portnoy (1985) AS What if ROBUSTNESS AND HIGH DIMENSIONAL DATA p n → 0 more slowly or p n → c, 0 < c ≤ ∞ Gaussian Linear Regression Model Peter J. Bickel Yn×1 = Zn×p β p×1 + en×1 e = (e1 , . . . , en )T iid N(0, σ 2 ) Zij j = 1, . . . , p Z(j) ≡ ... , Znj Z ≡ (Z(1) , . . . , Z(p) )n×p = [Z (n) ]T Suppose |Z(j) |2 = n, Z(a) ⊥ Z(b) a 6= b (Canonical Gaussian Model) Equivalent to: ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Gaussian White Noise Model (Donoho, Johnstone, Kerkyacharian, Picard (1995)) Xj = βj + εj , j = 1, . . . , p, εj ∼ N 0, Xj = σ2 iid n [Z(j) ]T Y n Assume i) β sparse: If S = {βj ; βj 6= 0}, |S| = s << p. ii) Signal strong: j ∈ S =⇒ |βj | ≥ δn > 0 ROBUSTNESS AND HIGH DIMENSIONAL DATA Let, X̂j ≡ Xj − hK (Xj ) Peter J. Bickel r hK ≡ Huber function, K = σ GWN Result: If δn p X q n log p 2 log p n → ∞, E (X̂j − βj )2 = j=1 (Best possible if S is known) q If δn = Ω logn p , s → s log p. sσ 2 1 + 0(1) n The Lasso: Donoho, Saunders, Chen (1998), Tibshirani (1996) ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel β̂ L ≡ arg min |Y − Z β|2 + λ|β|1 For canonical model Z(1) , . . . , Z(p) orthonormal |Z(j) |2 = n, j = 1, . . . , p . Then, for suitable λ(K ) β̂jL = X̂j . Conclusion ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel a) If Z(1) , . . . , Z(p) are nearly orthogonal b) β is sparse c) The signal is strong β̂ L behaves as LS when we know S. Many Results: Buhlmann, van de Geer, Tsybakov, Meinshausen, Yu, Fan and collaborators, have found minimal versions of a)-c) extended GWN result. Robust Case: ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Bradic, Fan, Wang, JRSS(B) (2011) Variable selection including robust objective functions give results of this type with σ2 s E ψ 2 (e) s → n [E ψ 0 (e)]2 n Open problems in paralleling the work done for LS + Lasso but see GLM results of van de Geer and others (Buhlmann, van de Geer(to appear)) Statistics for High Dimensional Data Behavior of kβ̂ − β0 k2 ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel A. What if a) or b) or c) conditions don’t hold: Another Lecture B. What if pn → 0 < c < 1 and robust and least squares are compared without penalization? Surprising simulations ROBUSTNESS AND HIGH DIMENSIONAL DATA Ratio of Expected Squared Norms, n=100, 1000 simulations ● ● Peter J. Bickel ● ● ● ● ● ● ● ● ● 1.2 ● ● ● 1.0 ● ● 0.8 E(|LAD|^2)/E(|OLS|^2) 1.4 ● D.E. errors Normal errors ● 20 40 60 p 80 n for n the 1y≥t . show the functional inverse of F̄r, . Integration by parts, symmetry of fr, as well as the above characterization of c finally lead to the implicit characterization of r`1 (κ) (denoted simply by r for short in the next ROBUSTNESS AND HIGH DIMENSIONAL 0 simulations DATA Surprising simulations E(r2 (p,n))/ E(r2 (p,n)) and r2 (κ)/r2 (κ) computed from system, double exponential errors, 1000 simulations L1 L2 L1 L2 n=1000 1.4 Peter J. Bickel 1.3 1.2 E(r2L1(p,n))/ E(r2L2(p,n)) 1.1 1 0.9 0.8 0.7 0.6 Average over 1000 simulations Prediction from heuristics 0.5 0 0.1 0.2 0.3 0.4 0.5 p/n 1 0.6 0.7 0.8 0.9 1 (Semi-heuristic) Results of el Karoui, Bean, Bickel, Lim and Yu ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Yi = XiT β0 + i i = 1, . . . , n i i.i.d. g ⊥ ⊥ {Xi : i = 1, . . . , n} Xi i.i.d. N (0, Σ) Define β̂(ρ; β0 , Σ) = argmin β ρ convex n → ∞, p/n → κ < 1. Pn i=1 ρ(Yi − XiT β) Key Lemma ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel L β̂(ρ; β0 , Σ) = β0 + kβ̂(ρ; 0, Ip )kΣ1/2 u, where u is uniform on the p sphere of radius 1. ∴ Can assume β0 = 0, Σ = Ip . Special case of basic result ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel If rρ (p, n) ≡ kβ̂(ρ; 0, Ip )k Under suitable regularity conditions, p rρ (p, n) → rρ (κ) solving: E [proxc (ρ)]0 (ẑ ) = 1 − κ E (ẑ − [proxc (ρ)](ẑ ))2 = κrρ2 (κ), L ẑ = + rρ (κ)Z , ⊥ ⊥ Z , Z ∼ N (0, 1). Remarks ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel proxc (ρ)(x) = argmin y (x − y )2 ρ(y ) + 2c Solves, if ρ is differentiable, strictly convex: y + cρ0 (y ) = x. Key ideas: ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel I. Leave out 1 predictor {Xpi : i = 1, . . . , n}. ri,[p] ≡ i − ViT γ̂ where Vi ≡ (X1i , . . . , Xp−1,i )T , γ̂ ≡ estimate of (β01 , . . . , β0,p−1 )T without X (p) ≡ (Xp1 , . . . , Xpn )T . Then: P i Xpi ψ(ri,[p] ) 2 0 T −1 i Xpi ψ (ri,[p] ) − vp Sp vp (∗) β̂p = P + op (n−1/2 ) ψ 0 (ri,[p] )Vi Xpi T vpT Sp−1 vp = X (p) D 1/2 ΠV D 1/2 X (p) , Dii = ψ 0 (ri,[p] ). ΠV is a projection matrix of rank p − 1. X (p) ⊥⊥ ri,[p] , Vi vp = P i Key ideas: ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel For LS ψ(x) = x and (∗) holds exactly. Asymptotically, ri,[p] ∼ g ∗ Gaussian √ nβ̂p ⇒ N (0, σ 2 (ρ, g , κ)). σ2 for LS. σ 2 (ρ, g , κ) = 1−κ II. Analysis requires leave out one (Xi , Yi ) as well. Projection Pursuit ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel (J) Kruskal (1969),(1972), Switzer (1970), Switzer and Wright (1971), Friedman Tukey (1974), Huber (1985), Diaconis and Freedman (1985) Given: X 1, . . . , X n p×1 iid Find “interesting” projections i.e. a, |a| = 1 3 n 1X Pn,a ≡ δaT X i is as non-normal as possible n i=1 ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel In expectation Pn,a ≈ Pa ↔ fa ≡ density of aT X Measures of Nonnormality Ea (X − Ea X )3 SK (Pn,a ) ↔ 3 ≡ SK (Pa ) Ea (X − Ea X )2 2 Ea (X − Ea X )4 KURT (Pn,a ) ↔ 2 − 3 ≡ K (Pa ) Ea (X − Ea X )2 These are highly nonrobust to outliers. Robust and “efficient” measures ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Alternatives: Robust Measures of Skewness , Kurtosis by Trimming Efficient Measure (Estimate) Z 1 1 log fa fa (x)dx + log(2πe) 2 Ea (X − Ea X )2 2 Procedure: Maximize over a A Rationale ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Diaconis, Freedman (1985) If p → ∞ with n → ∞ under weak conditions e.g. If Xj iid F , EX 2 < ∞, X = (X1 , . . . , Xp )T , Then, almost all Pn,a are asymptotically Gaussian, where “almost” all is with respect to Lebesgue measure on surface of unit sphere in R p . Some Comfort ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Theorem If F is Gaussian iid and p n → 0, then P sup sup Pa (−∞, x] − Pna (−∞, x] −→ 0 a x Some Caution ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel But, even if F is Gaussian iid, if pn → c > 0, Z 2 max aT x − µ(Pn,a ) dPn,a 6→ 1 a (Wigner, Geman) How bad can things get? ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Suppose p n −→ ∞ Theorem (B, Nadler) Let X 1 , . . . , X n i.i.d. N(0, Ip ). Let G be any cdf such that G − Φ doesn’t change sign.. Let F̂a denote the empirical cdf of aT X 1 , . . . , aT X n , |a| = 1. Then: h i P inf kF̂a − G k∞ → 0 = 1 a where kf k∞ = supx |f (x)|. Idea of proof: ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel a) Let ψ : R → R monotone increasing bounded. P Let Ψa ≡ n1 ni=1 ψ(aT Xi ) N = λp for λ to be chosen, λ > 1. A = {aj : 1 ≤ j ≤ N} points in Sp Then, for any ε > 0, PΦ [K0 − ε ≤ Ψaj ≤ K0 + ε for some j, 1 ≤ j ≤ N] → 1 ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel b) Let ψ (1) , . . . , ψ (m) be as above, ε > 0. For, Kj arbitrary, j = 1, . . . , m, Z (j) sgn Kj − ψ (ξ)φ(ξ)dξ constant Then, (j) PΦ Kj − ε ≤ Ψa ≤ Kj , 1 ≤ j ≤ m for some a ∈ A → 1 ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel c) Let ψ (j) (u) = 1(xj , ∞) Kj = G (xj ) By b) P ∃ j ∈ A 3 |Fˆ aj (xk )−G (xk )| ≤ ε for all 1 ≤ k ≤ m → 1 Lemma ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel ∃ λ > 1, ε > 0, a1 , . . . , aN ∈ Sp , such that, for N = λp , ∃ a1 , . . . , aN so that|(aj , aj 0 )| ≤ 1 − ε for all 1 ≤ j 6= j 0 ≤ N, Then, T T (aT 1 X 1 , . . . , aN X 1 ) ∼ NN (0, R) R = kρij kN×N , where |ρij | ≤ 1 − ε, i 6= j ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel If X ∼ N(0, R0 ) R0 ≡ (1 − ε)11T + εId 1 ≡ (1, . . . , 1)T , X = (1 − ε)z0 1 + 1 − (1 − ε)2 z0 ∼ N1 (0, 1) ⊥ Z ∼ NN (0, Id ) 1 2 Z Slepian’s inequality ROBUSTNESS AND HIGH DIMENSIONAL DATA Extended by Joag-dev et al (1983) Ann. Prob. Let Z(j) ∼ N (0, R (j) ) j = 0, 1, Peter J. Bickel (j) (j) (j) RN×N ≡ ρab = δab + (1 − δab )ρab , (0) (1) and ρab ≤ ρab for all a, b Let Ψ : R N −→ R, bounded, ∂2ψ ∂xa ∂xb ≥ 0 all a 6= b. Then, E Ψ(Z(0) ) ≤ E Ψ(Z(1) ) (Valid if ∆2a,b ψ ≥ 0, where ∆2a,b ψ = ψ(xa + ha , xb + hb , xc , c 6= a, b) − ψ(xa + ha , xb , xc , c 6= a, b) − ψ(xa , xb + hb , xc , c 6= a, b) + ψ(xc , c = 1, . . . , N). ) ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel Let ψj , j = 1, . . . , m be bounded non-decreasing function Consider aT X 1 , . . . , aT X n , a ∈ A, a1 , . . . , aN as in Lemma. ~ ~ Consider Ym×n where Yij = aT i X j and Xc (Y) where Xc (uik : i = 1, . . . , n, k = 1, . . . , N) " # N m Y Y (`) ≡ 1− 1 X (u1k , . . . , unk ) ≥ c` , `=1 k=1 and X (`) (v1 , . . . , vn ) = 1 n Pn X satisfies our hypotheses. `=1 ψ` (ui ). ROBUSTNESS AND HIGH DIMENSIONAL DATA f) Apply large deviation theory to n 1X (0) ψ(Zij ) n Peter J. Bickel i=1 where Z01 , . . . , Z0n (0) Zi (0) (0) ≡ (Zi1 , . . . , ZiN )T (0) are iid NN (0, RN×N ) (0) RN×N = kδab + (1 − δab )(1 − ε)kN×N to obtain n i h1 X (0) P Zij 6∈ [K0 − δ, K0 + δ] for any 1 ≤ j ≤ N −→ 0 n i=1 ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel g) Apply Slepian’s inequality to get a). Generalize to b), c) using Joag-dev’s inequality. Choose the {xj } to be dense to get (∗) Discussion ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel II: Huber (1985) 1) “Perhaps the practical conclusion to be drawn is that we shall have to acquiesce to the fact that PP will in practice reveal not only true but also spurious structure and that we must weed out the latter by other methods.” 2) What structures survive if we consider a random set of m projections of the data? E.g. Suppose the true population is (1 − ε)N(0, Ip ) + εN(0, Σ) Σ of rank << p If we take m = o(n) projections, what chance do we have of finding N(0, Σ) structure? 3) Conjecture: Result holds for all G . Discussion ROBUSTNESS AND HIGH DIMENSIONAL DATA Peter J. Bickel I. Proof now available Questions: 1) “Optimal” ρ, g , κ known: DONE. 2) Robust ρ, optimizing on g given g in a small neighborhood of φ: OPEN.