Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions 2 Regression Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example Ohm's law V = R ∙ I Find linear function that best fits the data 250 200 150 Example Regression 100 50 0 0 50 100 150 3 Regression Standard Setting One measured variable b A set of predictor variables a1 ,…, a d Assumption: b = x0 + a 1 x1 + … + a d xd + e e is assumed to be noise and the xi are model parameters we want to learn Can assume x0 = 0 Now consider n observations of b 4 Regression analysis Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close Consider the over-constrained case, when n À d Assume that A has full column rank 5 Regression analysis Least Squares Method Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)² Ai* is i-th row of A Certain desirable statistical properties Closed form solution: x = (ATA)-1 AT b Method of least absolute deviation (l1 -regression) Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>| Cost is less sensitive to outliers than least squares Can solve via linear programming Time complexities are at least n*d2, we want better! 6 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions 7 Sketching to solve least squares regression How to find an approximate solution x to minx |Ax-b|2 ? Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2 with high probability Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x‘ to minx‘ |(SA)x-(Sb)|2 8 How to choose the right sketching matrix S? Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2 Lots of matrices work S is d/ε2 x n matrix of i.i.d. Normal random variables Computing S*A may be slow… 9 How to choose the right sketching matrix S? [S] S is a Johnson Lindenstrauss Transform S = P*H*D D is a diagonal matrix with +1, -1 on diagonals H is the Hadamard transform P just chooses a random (small) subset of rows of H*D S*A can be computed much faster 10 Even faster sketching matrices [CW] CountSketch matrix Define k x n matrix S, for k = O~(d2/ε2) S is really sparse: single randomly chosen non-zero entry per column [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 Surprisingly, this works! 11 Simpler and Sharper Proofs [MM, NN, N] Let B = [A, b] be an n x (d+1) matrix Let U be an orthonormal basis for the columns of B Suffices to show |SUx|2 = 1 ± ε for all unit x|SBx|2 = (1±ε) |Bx|2 for all x Implies |S(Ax-b)|2 = (1 ± ε) |Ax-b|2 for all x SU is a (d+1)2/ε2 x (d+1) matrix S is called a subspace embedding Suffices to show | UTST SU – I|2 · |UTST SU – I|F · ε Matrix product result: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2 Set C = UT and D = U. Then |U|2F = (d+1) and (# rows of S) = (d+1)2/ε2 12 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions 13 Sketching to solve l1-regression How to find an approximate solution x to minx |Ax-b|1 ? Goal: output x‘ for which |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability Natural attempt: Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x‘ to minx‘ |(SA)x-(Sb)|1 Turns out this does not work 14 Sketching to solve l1-regression [SW] Why doesn’t outputting the solution x‘ to minx‘ |(SA)x-(Sb)|1 work? Don‘t know of k x n matrices S with small k for which if x‘ is solution to minx |(SA)x-(Sb)|1 then |Ax‘-b|1 · (1+ε) minx |Ax-b|1 with high probability Instead: can find an S so that |Ax‘-b|1 · (d log d) minx |Ax-b|1 S is a matrix of i.i.d. Cauchy random variables Property: |Ax-b|1 · |S(Ax-b)|1 · (d log d) |Ax-b|1 15 Cauchy random variables If a and b are scalars and C1 and C2 independent Cauchys, then a*C1 + b*C2 ~ (|a|+|b|)*C for a Cauchy C Cauchy random variables not as nice as Normal (Gaussian) random variables They don’t have a mean and have infinite variance Ratio of two independent Normal random variables is Cauchy 16 Sketching to solve l1-regression Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B Q has orthonormal columns and Q*R = S*B B*R-1 is a “well-conditioning” of B: Σi=1d |BR-1ei|1 · Σi=1d |SBR-1ei|1 · (d log d).5 Σi=1d |SBR-1ei|2 · d (d log d).5 |x|1 · |x|2 = |SBR-1x|2 · |SBR-1x|1 · (d log d) |BR-1x|1 These two properties make importance sampling work! 17 Importance Sampling Want to estimate Σi=1n yi by sampling, for yi ¸ 0 Suppose we sample yi with probability pi T = Σi=1n δ(yi sampled) yi/pi E[T] = Σi=1n pi ¢ yi/pi = Σi=1n yi Var[T] · E[T2] = Σi=1n pi ¢ yi 2 / pi2 · (Σi=1n yi ) maxi yi/pi Bound maxi yi/pi by ε2 (Σi=1n yi ) For us, yi = |(Ax-b)i| and this holds if pi = |eiBR-1|1 * poly(d/ε) ! 18 Importance Sampling To get a bound for all x, use Bernstein’s inequality and a net argument Sample poly(d/ε) rows of B*R-1 where the i-th row is sampled proportional to its 1-norm T is diagonal matrix with Ti,i = 0 if row i not sampled, otherwise Ti,i = 1/Pr[row i sampled] |TBx|1 = (1 ± ε )|Bx|1 for all x Solve regression on the (reweighted) samples! 19 Sketching to solve l1-regression [MM] Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables All other operations are in the “smaller space” Can speed this up by choosing S as follows: ¢ [ C1 C2 C3 [ [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 … Cn 20 Further sketching improvements [WZ] Can show you need a fewer number of sampled rows in Uses max-stability later if instead choose S as For follows recent work of steps exponentials on fast [Andoni]: max yof ~ |y|1/e of Cauchy random Instead variables, choose sampling-based i/eidiagonal diagonal of reciprocals of exponential random variables algorithms, see Richard’s talk! ¢ [ [ [ [ 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 1/E1 1/E2 1/E3 … 1/En 21 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions 22 Low rank approximation A is an n x d matrix Typically well-approximated by low rank matrix E.g., only high rank because of noise Want to output a rank k matrix A’, so that |A-A’|F · (1+ε) |A-Ak|F, w.h.p., where Ak = argminrank k matrices B |A-B|F (For matrix C, |C|F = (Σi,j Ci,j2)1/2 ) 23 Solution to low-rank approximation [S] Given n x d input matrix A Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A A SA Project rows of A onto SA, then find best rank-k approximation to points inside of SA. 24 Low Rank Approximation Idea Regression problem minX |Ak X – A|F S can be matrix of i.i.d. Solution is X = I, and minimum isNormals |Ak – A|F S can beproblem! a Fast Johnson This is a generalized regression Lindenstrauss Matrix If S is a subspace embedding space of Ak S canfor becolumn a CountSketch and also if for any matrices B, C, matrix |BSTSC – BC|F2 · 1/(# rows of S) |B|F2 |C|F2 Then if X’ is the minimizer to minX |SAkX – SA|F , then |Ak X’ – A|F · (1+ε) minX |AkX-A|F = (1+ε) |Ak-A|F But minimizer X’ = (SAk SA is in the row span of SA! )- 25 Caveat: projecting the points onto SA is slow Current algorithm: 1. Compute S*A (easy) 2. Project each of the rows onto S*A 3. Find best rank-k approximation of projected points inside of rowspace of S*A (easy) Bottleneck is step 2 [CW] Turns out you can approximate the projection Sketching for generalized regression again: minX |X(SA)-A|F2 26 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l1) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions 27 M-Estimators and Robust Regression Solve minx |Ax-b|M M: R -> R¸ 0 |y|M = Σi=1n M(yi) Least squares and L1-regression are special cases Huber function, given a parameter c: M(y) = y2/(2c) for |y| · c M(y) = |y|-c/2 otherwise Enjoys smoothness properties of l2 and robustness properties of l1 [CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time 28 CUR Decompositions [BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k 29 Open Questions Recent monograph in NOW Publishers D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra” Other types of low rank approximation: (Spectral) How quickly can we find a rank k matrix A’, so that |A-A’|2 · (1+ε) |A-Ak|2, w.h.p., where Ak = argminrank k matrices B |A-B|2 (Robust) How quickly can we find a rank k matrix A’, so that |A-A’|1 · (1+ε) |A-Ak|1, w.h.p., where Ak = argminrank k matrices B |A-B|1 For other questions regarding Schatten norms and communicationefficiency, see reference above. Thanks! 30