Sketching as a Tool for Numerical Linear Algebra (slides)

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions 2 Regression Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example  Ohm's law V = R ∙ I  Find linear function that best fits the data 250 200 150 Example Regression 100 50 0 0 50 100 150 3 Regression Standard Setting  One measured variable b  A set of predictor variables a1 ,…, a d  Assumption: b = x0 + a 1 x1 + … + a d xd + e  e is assumed to be noise and the xi are model parameters we want to learn  Can assume x0 = 0  Now consider n observations of b 4 Regression analysis Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close  Consider the over-constrained case, when n À d  Assume that A has full column rank 5 Regression analysis Least Squares Method  Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²  Ai* is i-th row of A  Certain desirable statistical properties  Closed form solution: x = (ATA)-1 AT b Method of least absolute deviation (l1 -regression)  Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>|  Cost is less sensitive to outliers than least squares  Can solve via linear programming Time complexities are at least n*d2, we want better! 6 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions 7 Sketching to solve least squares regression  How to find an approximate solution x to minx |Ax-b|2 ?  Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability  Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to minx‘ |(SA)x-(Sb)|2 8 How to choose the right sketching matrix S?  Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2  Lots of matrices work  S is d/ε2 x n matrix of i.i.d. Normal random variables  Computing S*A may be slow… 9 How to choose the right sketching matrix S? [S]  S is a Johnson Lindenstrauss Transform  S = P*H*D  D is a diagonal matrix with +1, -1 on diagonals  H is the Hadamard transform  P just chooses a random (small) subset of rows of H*D  S*A can be computed much faster 10 Even faster sketching matrices [CW]  CountSketch matrix  Define k x n matrix S, for k = O~(d2/ε2)  S is really sparse: single randomly chosen non-zero entry per column [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 Surprisingly, this works! 11 Simpler and Sharper Proofs [MM, NN, N]  Let B = [A, b] be an n x (d+1) matrix  Let U be an orthonormal basis for the columns of B  Suffices to show |SUx|2 = 1 ± ε for all unit x|SBx|2 = (1±ε) |Bx|2 for all x  Implies |S(Ax-b)|2 = (1 ± ε) |Ax-b|2 for all x  SU is a (d+1)2/ε2 x (d+1) matrix S is called a subspace embedding  Suffices to show | UTST SU – I|2 · |UTST SU – I|F · ε  Matrix product result: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2  Set C = UT and D = U. Then |U|2F = (d+1) and (# rows of S) = (d+1)2/ε2 12 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions 13 Sketching to solve l1-regression  How to find an approximate solution x to minx |Ax-b|1 ?  Goal: output x‘ for which |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability  Natural attempt: Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to minx‘ |(SA)x-(Sb)|1  Turns out this does not work 14 Sketching to solve l1-regression [SW]  Why doesn’t outputting the solution x‘ to minx‘ |(SA)x-(Sb)|1 work?  Don‘t know of k x n matrices S with small k for which if x‘ is solution to minx |(SA)x-(Sb)|1 then |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability  Instead: can find an S so that |Ax‘-b|1 · (d log d) minx |Ax-b|1  S is a matrix of i.i.d. Cauchy random variables  Property: |Ax-b|1 · |S(Ax-b)|1 · (d log d) |Ax-b|1 15 Cauchy random variables If a and b are scalars and C1 and C2 independent Cauchys, then a*C1 + b*C2 ~ (|a|+|b|)*C for a Cauchy C  Cauchy random variables not as nice as Normal (Gaussian) random variables  They don’t have a mean and have infinite variance  Ratio of two independent Normal random variables is Cauchy 16 Sketching to solve l1-regression  Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B  Q has orthonormal columns and Q*R = S*B  B*R-1 is a “well-conditioning” of B:  Σi=1d |BR-1ei|1 · Σi=1d |SBR-1ei|1 · (d log d).5 Σi=1d |SBR-1ei|2 · d (d log d).5  |x|1 · |x|2 = |SBR-1x|2 · |SBR-1x|1 · (d log d) |BR-1x|1  These two properties make importance sampling work! 17 Importance Sampling  Want to estimate Σi=1n yi by sampling, for yi ¸ 0  Suppose we sample yi with probability pi  T = Σi=1n δ(yi sampled) yi/pi  E[T] = Σi=1n pi ¢ yi/pi = Σi=1n yi  Var[T] · E[T2] = Σi=1n pi ¢ yi 2 / pi2 · (Σi=1n yi ) maxi yi/pi  Bound maxi yi/pi by ε2 (Σi=1n yi )  For us, yi = |(Ax-b)i| and this holds if pi = |eiBR-1|1 * poly(d/ε) ! 18 Importance Sampling  To get a bound for all x, use Bernstein’s inequality and a net argument  Sample poly(d/ε) rows of B*R-1 where the i-th row is sampled proportional to its 1-norm  T is diagonal matrix with Ti,i = 0 if row i not sampled, otherwise Ti,i = 1/Pr[row i sampled]  |TBx|1 = (1 ± ε )|Bx|1 for all x  Solve regression on the (reweighted) samples! 19 Sketching to solve l1-regression [MM]  Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables  All other operations are in the “smaller space”  Can speed this up by choosing S as follows: ¢ [ C1 C2 C3 [ [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 … Cn 20 Further sketching improvements [WZ]  Can show you need a fewer number of sampled rows in Uses max-stability later if instead choose S as For follows recent work of steps exponentials on fast [Andoni]: max yof ~ |y|1/e of Cauchy random  Instead variables, choose sampling-based i/eidiagonal diagonal of reciprocals of exponential random variables algorithms, see Richard’s talk! ¢ [ [ [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 1/E1 1/E2 1/E3 … 1/En 21 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions 22 Low rank approximation  A is an n x d matrix  Typically well-approximated by low rank matrix  E.g., only high rank because of noise  Want to output a rank k matrix A’, so that |A-A’|F · (1+ε) |A-Ak|F, w.h.p., where Ak = argminrank k matrices B |A-B|F (For matrix C, |C|F = (Σi,j Ci,j2)1/2 ) 23 Solution to low-rank approximation [S]  Given n x d input matrix A  Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A A SA  Project rows of A onto SA, then find best rank-k approximation to points inside of SA. 24 Low Rank Approximation Idea  Regression problem minX |Ak X – A|F  S can be matrix of i.i.d.  Solution is X = I, and minimum isNormals |Ak – A|F  S can beproblem! a Fast Johnson  This is a generalized regression Lindenstrauss Matrix  If S is a subspace embedding space of Ak  S canfor becolumn a CountSketch and also if for any matrices B, C, matrix  |BSTSC – BC|F2 · 1/(# rows of S) |B|F2 |C|F2  Then if X’ is the minimizer to minX |SAkX – SA|F , then |Ak X’ – A|F · (1+ε) minX |AkX-A|F = (1+ε) |Ak-A|F  But minimizer X’ = (SAk SA is in the row span of SA! )- 25 Caveat: projecting the points onto SA is slow  Current algorithm: 1. Compute S*A (easy) 2. Project each of the rows onto S*A 3. Find best rank-k approximation of projected points inside of rowspace of S*A (easy)  Bottleneck is step 2  [CW] Turns out you can approximate the projection  Sketching for generalized regression again: minX |X(SA)-A|F2 26 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l1) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions 27 M-Estimators and Robust Regression  Solve minx |Ax-b|M  M: R -> R¸ 0  |y|M = Σi=1n M(yi)  Least squares and L1-regression are special cases  Huber function, given a parameter c: M(y) = y2/(2c) for |y| · c M(y) = |y|-c/2 otherwise  Enjoys smoothness properties of l2 and robustness properties of l1 [CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time 28 CUR Decompositions [BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k 29 Open Questions  Recent monograph in NOW Publishers  D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra”  Other types of low rank approximation: (Spectral) How quickly can we find a rank k matrix A’, so that |A-A’|2 · (1+ε) |A-Ak|2, w.h.p., where Ak = argminrank k matrices B |A-B|2 (Robust) How quickly can we find a rank k matrix A’, so that |A-A’|1 · (1+ε) |A-Ak|1, w.h.p., where Ak = argminrank k matrices B |A-B|1  For other questions regarding Schatten norms and communicationefficiency, see reference above. Thanks! 30

Sketching as a Tool for Numerical Linear Algebra (slides)

Related documents

Products

Support

Sketching as a Tool for Numerical Linear Algebra (slides)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib