High-Dimensional Statistics: Linear Regression & Sparsity

STAT 200C: High-dimensional Statistics Arash A. Amini April 14, 2021 1 / 29 Linear regression setup sample T • The data is (y , X ) where y 2 Rn and X 2 Rn⇥d , and the model y = X ✓⇤ + w . • ✓⇤ 2 Rd is an unknown parameter. • w 2 Rn is the vector of noise variables. o⇤ • Equivalently, yi = h✓ , xi i + wi , i = 1, . . . , n where xi 2 Rd is the nth row of X : 2 6 6 X =6 4 • Recall h✓⇤ , xi i = L Pd ⇤ j=1 ✓j xij . | x1T x2T .. . xnT {z d 3 M Nz d kn E IR 7 7 7 5 } k Intrested in thecasewhere n d 2 / 29 Sparsity models XO p why d has may solution in gurl tu mollis • When n < d, no hope of estimating ✓⇤ , fifiable ⇤ • unless we impose some sort of of low-dimensional model on ✓ . • Support of ✓⇤ (recall [d] = {1, . . . , d}): fog supp(✓⇤ ) := S(✓⇤ ) = j 2 [d] : ✓j⇤ 6= 0 . ⇤ • Hard sparsity assumption: s = |S(✓ )| ⌧ d. s • Weaker sparsity assumption via `q balls for q 2 [0, 1] 11011g q I I toil't T thisis proper n d Bq (Rq ) = ✓ 2 R : • q = gives `1 ball. b d X j=1 q o |✓j |  Rq . 11011g E ER 11 4,5 suppl0 11 3 Supp101 O cBg Rg Rafa • q = 0 the `0 ball, same as hard sparsity: go.com this is Tpaisti.IE k✓⇤ k0 := |S(✓⇤ )| = # j; ✓j⇤ 6= 0 rot adnormgl yay Evil subadditive 110 to'll S 110149 110 11 Es 110449 3 / 29 9 u unit by ball unite ball 9 1 go I 0 g as 75 t l It e a Tn 11 II f i Y (from HDS book) 4 / 29 Basis pursuit Try y xp I 110 1115 nad t • Consider the noiseless case y = X ✓⇤ . a • We assume that k✓⇤ k0 is small. solve y Xsp Isis's foray min k✓k0 ✓2Rd solution OT andOT both 11104 021 0 subject to f this 1 Of 02 0 2sEn_ Xs is full rank ISIS t foray s 02110525 1109 • Ideal program to solve: canwe recover O y = X✓ • k · k0 is highly non-convex, relax to k · k1 : min k✓k1 ✓2Rd subject to y = X✓ 0 1D 110 DH This is called basis pursuit (regression). • (1) is a convex program. D • In fact, can be written as a linear program1 . (1) Beker X i tttDH 110 for some t HOH EQ O • Global solutions can be obtained very efficiently. 1 Exercise: Introduce auxiliary variables sj 2 R and note that minimizing |✓j |  sj gives the `1 norm of ✓. P j sj subject to 5 / 29 Restricted null space property (RNS) d seeds p • Define d 2 Rd : k C(S) = { S c k1 k S k1 }. f (2) Theorem 1 The following two are equivalent: • For any ✓⇤ 2 Rd with support ✓ S, the basis pursuit program (1) applied to the data (y = X ✓⇤ , X ) has unique solution ✓b = ✓⇤ . • The restricted null space (RNS) property holds, i.e., C(S) \ ker(X ) = {0}. iii iii iii Deas a DEUS e y (3) i AER 6 / 29 r Proof • Consider the tangent cone to the `1 ball (of radius k✓⇤ k1 ) at ✓⇤ : T(✓⇤ ) = { t 2 Rd : k✓⇤ + t k1  k✓⇤ k1 , for some t > 0.} i.e., the set of descent directions for `1 norm at point ✓⇤ . • Feasible set is ✓⇤ + ker(X ), i.e. • ker(X ) is the set of feasible directions =✓ ✓⇤ . • Hence, there is a minimizer other than ✓⇤ if and only if T(✓⇤ ) \ ker(X ) 6= {0} • It is enough to show that C(S) = [ (4) T(✓⇤ ). ✓ ⇤ 2Rd : supp(✓ ⇤ )✓S 7 / 29 B1 ✓⇤ • (1) i ⇤ T(✓(2) ) Ker(X) ⇤ T(✓(1) ) d = 2, 11011 L •✓⇤ to (2) ⇤ = (0, 1), ✓ ⇤ = (0, [d] = {1, 2}, S = {2}, ✓(1) (2) C(S) = {( 1 , 2 ) : | 1 |  | 2 |}. C(S) 1). 8 / 29 • It is enough to show that Ot's T(✓⇤ ) (5) ✓ ⇤ 2Rd : supp(✓ ⇤ )✓S 2 T1 (✓⇤ ) i↵2 • We have HO't C(S) = To 05 0 ES HOT [ Dlf E 110 11 110511 DstDscH E k  k✓S⇤ k1 S c k1 k✓S⇤ + S k1 zo't sa suppl0 ES 2 T1 (✓⇤ ) for some ✓⇤ 2 Rd s.t. supp(✓⇤ ) ⇢ S i↵ h i HDsoll t 1 Il ⇤ ⇤ 1105 Ds k S c k1  sup k✓S k1 k✓S + S k1 • We have t d ✓S⇤ 2Rs ecomposibihy ofle decorpse oursupport =k S k1 disjoint 2 Let subset. T1 (✓ ⇤ ) be the subset of T(✓ ⇤ ) where t = 1, and argue that w.l.o.g. we can work this I D 110 1DH E 11011 9 / 29 Sufficient conditions for restricted nullspace inCgi • [d] := {1, . . . , d} • For a matrix X 2 Rd , let Xj be its jth column (for j 2 [d]). • The pairwise incoherence of X is defined as hXi , Xj i n PW (X ) := max i, j2[d] Ya 1 t he 1{i = j} • Alternative form: X T X is the Gram matrix of X , • (X T X )ij = hXi , Xj i. did X X d T Ip k1 n norm of the matrix. PW (X ) where k · k1 is the vector `1 := k Proposition 1 (HDS Prop. 7.1) (Uniform) restricted nullspace holds for all S with |S|  s if PW (X ) • Proof: Exercise 7.3.  1 3s 10 / 29 Restricted isometry property (RIP) • A more relaxed condition: Definition 1 (RIP) X 2 Rn⇥d satisfies a restricted isometry property (RIP) of order s with constant s (X ) > 0 if XST XS ||| n I |||op  s (X ), for all S with |S|  s • PW incoherence is close to RIP with s = 2; p • for example, when kXj / nk2 = 1 for all j, we have 2 (X ) • In general, for any s FTII = PW (X ). 2, (Exercise 7.4) PW (X )  s (X )  s PW (X ). 11 / 29 Definition (RIP) X 2 Rn⇥d satisfies a restricted isometry property (RIP) of order s with constant s (X ) > 0 if Jhs X X ||| T S n S s I |||op  s (X ), for all S with |S|  s • Let xiT be the i th row of X . Consider the sample covariance matrix: n X 1 1 b := X T X = ⌃ xi xiT 2 Rd⇥d . n n i=1 b SS = 1 X T XS , hence, RIP is • Then ⌃ n S b SS |||⌃ I |||op  s <1 b SS ⇡ Is . More precisely, i.e., ⌃ (1 b SS uk2  (1 + )kuk2 , )kuk2  k⌃ HISsub Hulk 8u 2 Rs 12 / 29 • RIP gives sufficient conditions: Spur E Is Proposition 2 (HDS Prop. 7.2) (Uniform) restricted null space holds for all S with |S|  s if 2s (X )  1 3 • Consider a sub-Gaussian matrix X with i.i.d. entries and EXij = 0 and EXij2 = 1 (Exercise 7.7): • We have N Z zs wtf fr lo nom n & s 2 log d =) n & s log ed s =) PW (X ) < 2s < 1 , 3 1 , 3s w.h.p. w.h.p. • Sample complexity requirement for RIP is milder. 13 / 29 Neither RIP or PWI is necessary • For more general covariance ⌃, it is harder to satisfy either PWI or RIP. • Consider X 2 Rn⇥d with i.i.d. rows Xi ⇠ N(0, ⌃). • Letting 1 2 Rd be the all-ones vector, and ⌃ := (1 µ)Id + µ11T for µ 2 [0, 1). (A spiked covariance matrix.) III iii int • We have max (⌃SS ) 1) ! 1 as s ! 1. = 1 + µ(s d n x xanm E stfu Ess stir • Exercise 7.8, µs Gt will Chou Ess (a) PW is violated w.h.p. unless µ ⌧ 1/s. Iss p (b) RIP is violated w.h.p. unless µ ⌧ 1/ s. p Ess can not be In fact grows like µ s for any fixed µ 2 (0, 1). 2s using on to Ig vs • However, for any µ 2 [0, 1), basis pursuit succeeds w.h.p. if n & s log (A later result shows this.) n ed . s if s as n Z s loj.IE 14 / 29 under s Noisy sparse regression yNRo 7g Rou fit att y X b b nd E Ks is fullrateforay 151525 wornHEH w tu If we can xo pidentified uniquely da glimnet • A very popular estimator is the `1 -regularized least-squares: h1 ✓b 2 argmin ky 2n d ✓2R on Or X ✓k22 + k✓k1 i (6) Fregularitation On• The idea: minimizing ` norm leads to sparse solutions. 1 a • (6) is a convex program; global solution can be obtained efficiently. • Other options: constrained form of lasso coin 4 I min k✓k1 R lion Is 01 1 ky 2n 0 11 unbiased X ✓k22 . wine o for (7) O Not Enid a unbiased 15 / 29 Edin231in i g E 110112 A UNT XmaxA Xmin A Max uTAw Hallet min uTAu Huh 1 tui Tho 11 112 at UNUTu T F max min v 10 Hull Utne 11h11 Ufo min ufo i UTAI Hull Restricted eigenvalue condition it • For a constant ↵ 1, 2 Rd | k C↵ (S) := S c k1  ↵k S k1 E tnxtx • A strengthening of RNS is: . smph tn.seyoniTcor.m Definition 2 (RE condition) ki ashy are A matrix X satisfies the restricted eigenvalue (RE) condition over S with ten men parameters (, ↵) if DT 1 kX k22 n • RNS corresponds to peek cyst103 Htt km zpecisl.IO D k k22 for all restricted IkAEC 2 C↵ (S). inf DID mine't TseCaisy libya 1 kX k22 > 0 n s.t XD y for all Zk 2 C1 (S) \ {0}. kXDH O 16 / 29 Deviation bounds under RE Theorem 2 (Deterministic deviations) Assume that y = X ✓⇤ + w , where X 2 Rn⇥d and ✓⇤ 2 Rd , and • ✓⇤ is supported on S ⇢ [d] with |S|  s • X satisfies RE(, 3) over S. Let us define z = XTw n . Then, we have the following: (a) Any solution of Lasso (6) with non otitis k✓b 2kzk1 satisfies ✓⇤ k2  I 1 WEIR End Teµdxn 3p s  (b) Any solution of constrained Lasso (7) with R = k✓⇤ k1 satisfies k✓b ✓⇤ k2  4p skzk1  17 / 29 IT Example (fixed design regression) • Assume y = X ✓⇤ + w where w ⇠ N(0, 2 In ), and • X 2 Rn⇥d fixed and satisfying RE condition and normalization µxjH Ecru Vj max j=1,...,d kXj k p a C . n d n where Xj is the jth column of X . • Recall z = X T w /n. O NIO n • It is easy to show that w.p. kzk1 • Thus, setting w.p. at least 1 q = 2C k✓b 2e n 2 y2 2e n /2 , q ⇣r 2 log d ⌘ C + n 1 2 log d n + g d mexlltilly t 3211716 up l 2e , Lasso solution satisfies r ⇣ ⌘ p 6C 2 log d ⇤ ✓ k2  s +  n /2 . 3 Epd 18 / 29 I c arguing Hy a XOR 1411011 Toy eyed Basic inequality G is opliml forthe puller and E L 0 1 1111811 L d s I E to expand the hits 11 5112 In XI 2 110 11 LI 2S no'll HEH E 11011 triangle 4 110511 E 1111041410 s app 10 1 of fuel use decomposihiy t e dew 10 1110 11 5 0 Taylor Ott is feasible 4814 s X no'll Scot Leon I iM Hot tbh i Ot's 108 t Hot's 1511 Is HIS'll TTYZY ftHbsH E lidsth l Ds Hbs it He Ms 1105 yfnaintbhfh hall.tl b KatbN S Hbk Hall edit s 11 54 In g dZ HBK dubs11 or 112165 In Is 11 511 3118511 E yall in 11 E Use the RE 2 11 t X Hbs Hbs41 S HBH 112lb d 3211 o hmdlN Hopn.siney.Lduli4bet I 3 Hbs11 T 41 E KUBIK s HdHas Hbs trs 115112 J 115511 I RECTI cordite 3 118511 t E rs f on at • Taking = p 2 logd/n, we have l F2d k✓b consistency ✓ k2 . ⇤ r s log d n if s.bg A 0 s Loyd n • This is the typical high-dimensional scaling in sparse problems. w.h.p. (i.e., 1 1 ). • Had we known the support S in advance, our rate would be (w.h.p.) r s si k✓b ✓⇤ k2 . . N n • The log d factor is the price for not knowing the support; • roughly the price for searching over supports. d s  d s collection of candidate 19 / 29 Proof of Theorem 2 • Let us simplify the loss L(✓) := • Setting =✓ ✓⇤ , L(✓) = = = = = 1 2n kX ✓ y k2 . 1 kX (✓ ✓⇤ ) w k2 2n 1 kX w k2 2n 1 1 kX k2 hX , w i + const. 2n n 1 1 kX k2 h , X T w i + const. 2n n 1 kX k2 h , zi + const. 2n where z = X T w /n. Hence, L(✓) o L(✓⇤ ) = 1 kX k2 2n h , zi. (8) • Exercise: Show that (8) is the Taylor expansion of L around ✓⇤ . 20 / 29 Proof (constrained version) • By optimality of ✓b and feasibility of ✓⇤ : • Error vector b := ✓b b  L(✓⇤ ) L(✓) ✓⇤ satisfies basic inequality • Using Holder inequality 1 kX b k22  hz, b i. 2n 1 kX b k22  kzk1 k b k1 . 2n b 1  k✓⇤ k1 , we have b = ✓b • Since k✓k ✓⇤ 2 C1 (S), hence p k b k1 = k b S k1 + k b S c k1  2k b S k1  2 sk b k2 . • Combined with RE condition ( b 2 C3 (S) as well) p 1 b 2 k k2  2 skzk1 k b k2 . 2 which gives the desired result. 21 / 29 Proof (Lagrangian version) • Let L(✓) := L(✓) + k✓k1 be the regularized loss. • Basic inequality is • Rearranging • We have b + k✓k b 1  L(✓⇤ ) + k✓⇤ k1 L(✓) 1 kX b k22  hz, b i + (k✓⇤ k1 2n k✓⇤ k1 • Since 2kzk1 , b 1 = k✓⇤ k1 k✓⇤ + b S k1 k✓k S S  k b S k1 k b S c k1 1 kX b k22  n  k b k1 + 2 (k b S k1 ( 3k b S k1 k b S c k1 ) b 1) k✓k k b S c k1 k b S c k1 ) • It follows that b 2 C3 (S) and the rest of the proof follows. 22 / 29 RE condition for anisotropic design • For a PSD matrix ⌃, let ⇢2 (⌃) = max ⌃ii . wi t i ETheorem 3 Let X 2 Rn⇥d with rows i.i.d. from N(0, ⌃). Then, there exist universal adx constants c1 < 1 < c2 such that kX ✓k22 n p c1 k ⌃✓k22 3 withe probability at least 1 e c2 ⇢2 (⌃) 16 n/32 /(1 log d k✓k21 , n n/32 e for all ✓ 2 Rd (9) ). • Exercise 7.11: (9) implies RE condition over C3 (S) uniformly over all subsets of cardinality E we assume swellesteigenvalue p min (⌃) n cont E |S|  c1 on fue depends 32c2 ⇢2 (⌃) log d J of Tmi E O I • In other words, n & s log d =) RE condition over C3 (S) for all |S|  s. 23 / 29 We look at all 0 for which HE't0112 1 proof of Thin 3 E Coital f 12911101111 41 found 111 01 gal 4191 same czmax d spat c d 4b ga PICE't 31 so thy smell R E on glholl E oyu Of glkoll vet C fo ftp.zep p Ie Lol game 24424 he pie nie f Esp's E sanity taglioni 54 FFgi Ti Io Issa s jiff f Peeky c2E2a't 25 file J cCon tab Ff all lt i E t 2galom inf l GET l Joe Il E sty fat Ate 0 c s's o lot I Coin.to nIe t s fol s's I Ate al OE St fue Slg stil e Ee i ii i f f lol 11491 1140 s 2g g 11011 E 2g 11,1 E 4 for some f 2 2e tu Ente f 2 r tI iM Ee E Tointon.IE XO is Gansu O process it ig I.gg m Supreme of GPs Yt EIR widely studies Ie foe gig nie v Enoevnti v I s Ev cIet Yt U Lunt T wifi fuku ml v1 wth'll Eno Sf cu I 11h11 un 247 1914it otm.aEIiii5 variation.fi t o II tht HDP ET in w min ER ex f are E sup Yt teh1Yt c eE.e o l kukri EIR et Yeo Ya of Iki rt lEE.Fi indp'tentnNloN Vh1 is the Ytml a a and her GP hae linen tank wonaldmb.fm We are gory visitawww 1 comparison ineywhly Gordon's inequality 6.65 HDS the let 12am suppone a pair by and supper tht E Yun of ten nonempty y s Yai for all and we have equality Elin y zig k to our problem M 1 Zane 49 Yum on Lu Wu I Yu ol and Ux V set T ti il E Zun if Yum f E GEIR X ii rY E Yun Yi In Efsa indexed GPs mean luv be Jl ET nd and v Effirmi u tum IRD have ii d Mom entries he d has i it No 1 entries xD ut ITcotxo Inla L tHoH4u Cov Xv co rent 110112 In Kye cool Huie X Kth f curl he he Xeivj Xkivi E corkkiri Kyiv i i sq.cat 4civiiXigY corLXkivi iil uXrivivav un Elsa v ta vY EfL9iu_ Y Iuliu vill 2 In verify oth condita tu Backido feinting man VEVNIE tunic.nu control e's 294 paint of Z lmvEInsemiui.ve hu 3 E 1241 4 241 I n E ni u a o IT tu v ul Mil s E Grind E Tiff then E t E I v cv emvanfemi.fm 9 I in 4 I vav Comments • Note that kX ✓k22 E n p = k ⌃✓k22 . • Bound (9) says that kX ✓k22 /n is lower-bounded by a multiple of its expectation minus a slack / k✓k21 . • Why the slack is needed? • Inpthe high-dimensional setting kX ✓k2 = 0 for any ✓ 2 ker(X ), while k ⌃✓k2 > 0 for any ✓ 6= 0 assuming ⌃ is non-singular. p • In fact, k ⌃✓k2 is uniformly bounded below in that case: p 2 k ⌃✓k22 min (⌃)k✓k2 , p showing that the population level version of X /n, that is, ⌃, satisfies RE, and in fact global eigenvalue condition. • (9) gives a nontrivial/good lower bound for ✓ for which k✓k1 is small compared to k✓k2 , i.e., sparse vectors. p • Recall that if ✓ is s-sparse, then k✓k1  sk✓k2 , p • while for general ✓ 2 Rd , we have k✓k1  dk✓k2 . 24 / 29 Examples • Toeplitz family: ⌃ij = ⌫ |i j| , ⇢2 (⌃) = 1, • Spiked model: ⌃ := (1 min (⌃) (1 ⌫)2 > 0 µ)Id + µ11T , ⇢2 (⌃) = 1, min (⌃) =1 µ • For future applications, note that (9) implies kX ✓k22 n where ↵1 = c1 ↵1 k✓k22 ↵2 k✓k21 , 2 min (⌃) and ↵2 = c2 ⇢ (⌃) 8✓ 2 Rd . log d . n 25 / 29 Lasso oracle inequality • For simplicity, let ̄ = oi oi min (⌃) and ⇢2 = ⇢2 (⌃) = maxi ⌃ii trend fillery Theorem 4 Under condition (9), consider the Lagrangian Lasso with regularization parameter 2kzk1 where z = X T w /n. For any ✓⇤ 2 Rd , any optimal solution ✓b satisfies the bound 1st 1110541 t tend110511,2 g k✓b ✓⇤ k22 fond 144 2 16 32c2 ⇢2 log d ⇤ 2 ⇤  |S| + k✓ c k1 + k✓S c k1 c1 ̄ S c1 ̄ n c12 ̄2 | {z } | {z } d Estimation error (10) ApproximationError valid for any subset S with cardinality F i 5 Take |S|  c1 ̄ n . 64c2 ⇢2 log d HEIDI O is sparse S suppl0 1 assay 26 / 29 • Simplifying the bound k✓b ✓⇤ k22  1 2 |S| + 2 k✓S⇤ c k1 + 3 log d ⇤ 2 k✓S c k1 n where 1 , 2 , 3 are constant dependent on ⌃. • Assume = 1 (noise variance) for simplicity. p • Since we kzk1 . log d/n w.h.p., we can take of this order: r log d log d ⇤ log d ⇤ 2 k✓b ✓⇤ k22 . |S| + k✓S c k1 + k✓S c k1 n n n • Optimizing the bound k✓b ✓⇤ k22 . inf n |S|. log d " log d |S| + n r log d ⇤ log d ⇤ 2 k✓S c k1 + k✓S c k1 n n # • An oracle that knows ✓⇤ can choose the optimal S. 27 / 29 Example: `q -ball sparsity Pd • Assume that ✓⇤ 2 Bq , i.e., • Then, assuming 2 2 j=1 |✓j⇤ |q  1, for some q 2 [0, 1]. = 1, we have the rate (Exercise 7.12) Sketch: k✓b ✓⇤ k22 . ⇣ log d ⌘1 q/2 n . • Trick: take S = {i : |✓i⇤ | > ⌧ } and find a good threshold ⌧ later. • Show that k✓S⇤ c k1  ⌧ 1 q q and |S|  ⌧ • The bound would be of the form (" := "2 ⌧ q + "⌧ 1 • Ignore the last term (assuming "⌧ 1 q q . p log d/n) + ("⌧ 1 q 2 ) .  1, it is not dominant), 28 / 29 Other results in Chapter 7 Bounds Yl far we for prediction error prediction error LI mis Ini fo t it Ei in cared about bandy prediction we can 110 abort xil foxtail f In't ne In refest fight In Hui oaf etffor th atgeneralization 0 117 your Model selection consistency support recovery The solution of the Lasso is generally spam in the hard spark does model selectin automatically tf Oti is actually s spars supplant The hardest problem Youneedadditul day Supplo'T I recover tmsaopfo.gr 1 most stringent condition constraints on a design mm conditm irrep a my lotil largeeagh l u Proof of Theorem 4 (Compact) • Basic inequality argument and assumption 1 kX b k22  (3k b S k1 n where b := 2k✓S⇤ c k1 . 2kzk1 gives k b S c k1 + b). • The error satisfies k b S c k1  3k b S k1 + b, • k b k21  32 s k b k22 + 2b 2 ( b behaves almost like an s-sparse vector.) • Bound (9) can be written as (↵1 , ↵2 > 0) 1 kX k22 n ↵1 k k22 ↵2 k k21 , 8 2 Rd . • Combining, rearranging, dropping, etc., we get p (↵1 32↵2 )k b k22  (3 sk b k2 + b) + 2↵2 b 2 • Assume ↵1 32↵2 s upper bound on b . • Hint: ax 2  bx + c, x ↵1 /2 > 0 so that the quadratic inequality enforces an 0 =) x  b a + pc a =) x 2  2b 2 a2 + 2c a . 29 / 29

High-Dimensional Statistics: Linear Regression & Sparsity

Related documents

Products

Support

High-Dimensional Statistics: Linear Regression & Sparsity

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib