Sketching as a Tool for Numerical Linear Algebra (slides)

advertisement
Sketching as a Tool for Numerical
Linear Algebra
David Woodruff
IBM Almaden
Talk Outline
 Regression
 Exact Regression Algorithms
 Sketching to speed up Least Squares Regression
 Sketching to speed up Least Absolute Deviation (l1) Regression
 Low Rank Approximation
 Sketching to speed up Low Rank Approximation
 Recent Results and Open Questions
 M-Estimators and robust regression
 CUR decompositions
2
Regression
Linear Regression
 Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
 Ohm's law V = R ∙ I
 Find linear function that
best fits the data
250
200
150
Example Regression
100
50
0
0
50
100
150
3
Regression
Standard Setting
 One measured variable b
 A set of predictor variables a1 ,…, a d
 Assumption:
b = x0 + a 1 x1 + … + a d xd + e
 e is assumed to be noise and the xi are model
parameters we want to learn
 Can assume x0 = 0
 Now consider n observations of b
4
Regression analysis
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number of
predictor variables
Output: x* so that Ax* and b are close
 Consider the over-constrained case, when n À d
 Assume that A has full column rank
5
Regression analysis
Least Squares Method
 Find x* that minimizes |Ax-b|22 = S (bi – <Ai*, x>)²
 Ai* is i-th row of A
 Certain desirable statistical properties
 Closed form solution: x = (ATA)-1 AT b
Method of least absolute deviation (l1 -regression)
 Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>|
 Cost is less sensitive to outliers than least squares
 Can solve via linear programming
Time complexities are at least n*d2, we want better!
6
Talk Outline
 Regression
 Exact Regression Algorithms
 Sketching to speed up Least Squares Regression
 Sketching to speed up Least Absolute Deviation (l1) Regression
 Low Rank Approximation
 Sketching to speed up Low Rank Approximation
 Recent Results and Open Questions
 M-Estimators and robust regression
 CUR decompositions
7
Sketching to solve least squares regression
 How to find an approximate solution x to minx |Ax-b|2 ?
 Goal: output x‘ for which |Ax‘-b|2 · (1+ε) minx |Ax-b|2
with high probability
 Draw S from a k x n random family of matrices, for a
value k << n
 Compute S*A and S*b
 Output the solution x‘ to minx‘ |(SA)x-(Sb)|2
8
How to choose the right sketching matrix S?
 Recall: output the solution x‘ to minx‘ |(SA)x-(Sb)|2
 Lots of matrices work
 S is d/ε2 x n matrix of i.i.d. Normal random variables
 Computing S*A may be slow…
9
How to choose the right sketching matrix S? [S]
 S is a Johnson Lindenstrauss Transform
 S = P*H*D
 D is a diagonal matrix with +1, -1 on diagonals
 H is the Hadamard transform
 P just chooses a random (small) subset of rows of
H*D
 S*A can be computed much faster
10
Even faster sketching matrices [CW]
 CountSketch matrix
 Define k x n matrix S, for k = O~(d2/ε2)
 S is really sparse: single randomly chosen non-zero
entry per column
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
Surprisingly,
this works!
11
Simpler and Sharper Proofs [MM, NN, N]
 Let B = [A, b] be an n x (d+1) matrix
 Let U be an orthonormal basis for the columns of B
 Suffices to show |SUx|2 = 1 ± ε for all unit x|SBx|2 = (1±ε) |Bx|2
for all x
 Implies |S(Ax-b)|2 = (1 ± ε) |Ax-b|2 for all x
 SU is a (d+1)2/ε2 x (d+1) matrix
S is called a
subspace
embedding
 Suffices to show | UTST SU – I|2 · |UTST SU – I|F · ε
 Matrix product result: |CSTSD – CD|F2 · [1/(# rows of S)] * |C|F2 |D|F2
 Set C = UT and D = U. Then |U|2F = (d+1) and (# rows of S) = (d+1)2/ε2
12
Talk Outline
 Regression
 Exact Regression Algorithms
 Sketching to speed up Least Squares Regression
 Sketching to speed up Least Absolute Deviation (l1) Regression
 Low Rank Approximation
 Sketching to speed up Low Rank Approximation
 Recent Results and Open Questions
 M-Estimators and robust regression
 CUR decompositions
13
Sketching to solve l1-regression
 How to find an approximate solution x to minx |Ax-b|1 ?
 Goal: output x‘ for which |Ax‘-b|1 · (1+ε) minx |Ax-b|1
with high probability
 Natural attempt: Draw S from a k x n random family of
matrices, for a value k << n
 Compute S*A and S*b
 Output the solution x‘ to minx‘ |(SA)x-(Sb)|1
 Turns out this does not work
14
Sketching to solve l1-regression [SW]
 Why doesn’t outputting the solution x‘ to minx‘ |(SA)x-(Sb)|1
work?
 Don‘t know of k x n matrices S with small k for which if x‘ is
solution to minx |(SA)x-(Sb)|1 then
|Ax‘-b|1 · (1+ε) minx |Ax-b|1
with high probability
 Instead: can find an S so that
|Ax‘-b|1 · (d log d) minx |Ax-b|1
 S is a matrix of i.i.d. Cauchy random variables
 Property: |Ax-b|1 · |S(Ax-b)|1 · (d log d) |Ax-b|1
15
Cauchy random variables
If a and b are scalars and
C1 and C2 independent
Cauchys, then
a*C1 + b*C2 ~ (|a|+|b|)*C
for a Cauchy C
 Cauchy random variables not as nice as Normal
(Gaussian) random variables
 They don’t have a mean and have infinite variance
 Ratio of two independent Normal random variables is
Cauchy
16
Sketching to solve l1-regression
 Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B
 Q has orthonormal columns and Q*R = S*B
 B*R-1 is a “well-conditioning” of B:
 Σi=1d |BR-1ei|1 · Σi=1d |SBR-1ei|1
· (d log d).5 Σi=1d |SBR-1ei|2
· d (d log d).5
 |x|1 · |x|2
= |SBR-1x|2
· |SBR-1x|1
· (d log d) |BR-1x|1
 These two properties make importance sampling work!
17
Importance Sampling
 Want to estimate Σi=1n yi by sampling, for yi ¸ 0
 Suppose we sample yi with probability pi
 T = Σi=1n δ(yi sampled) yi/pi
 E[T] = Σi=1n pi ¢ yi/pi = Σi=1n yi
 Var[T] · E[T2] = Σi=1n pi ¢ yi 2 / pi2 · (Σi=1n yi ) maxi yi/pi
 Bound maxi yi/pi by ε2 (Σi=1n yi )
 For us, yi = |(Ax-b)i| and this holds if pi = |eiBR-1|1 * poly(d/ε) !
18
Importance Sampling
 To get a bound for all x, use Bernstein’s inequality and a
net argument
 Sample poly(d/ε) rows of B*R-1 where the i-th row is
sampled proportional to its 1-norm
 T is diagonal matrix with Ti,i = 0 if row i not sampled,
otherwise Ti,i = 1/Pr[row i sampled]
 |TBx|1 = (1 ± ε )|Bx|1 for all x
 Solve regression on the (reweighted) samples!
19
Sketching to solve l1-regression [MM]
 Most expensive operation is computing S*A where S is
the matrix of i.i.d. Cauchy random variables
 All other operations are in the “smaller space”
 Can speed this up by choosing S as follows:
¢
[
C1
C2
C3
[
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
…
Cn
20
Further sketching improvements [WZ]
 Can
show
you need a fewer number of sampled rows in
Uses
max-stability
later
if instead choose S as For
follows
recent work
of steps
exponentials
on fast
[Andoni]:
max yof
~ |y|1/e of Cauchy random
 Instead
variables, choose
sampling-based
i/eidiagonal
diagonal of reciprocals of exponential
random variables
algorithms,
see
Richard’s talk!
¢
[
[
[
[
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
1/E1
1/E2
1/E3
…
1/En
21
Talk Outline
 Regression
 Exact Regression Algorithms
 Sketching to speed up Least Squares Regression
 Sketching to speed up Least Absolute Deviation (l1) Regression
 Low Rank Approximation
 Sketching to speed up Low Rank Approximation
 Recent Results and Open Questions
 M-Estimators and robust regression
 CUR decompositions
22
Low rank approximation
 A is an n x d matrix
 Typically well-approximated by low rank matrix
 E.g., only high rank because of noise
 Want to output a rank k matrix A’, so that
|A-A’|F · (1+ε) |A-Ak|F,
w.h.p., where Ak = argminrank k matrices B |A-B|F
(For matrix C, |C|F = (Σi,j Ci,j2)1/2 )
23
Solution to low-rank approximation [S]
 Given n x d input matrix A
 Compute S*A using a sketching matrix S with k << n
rows. S*A takes random linear combinations of rows of A
A
SA
 Project rows of A onto SA, then find best rank-k
approximation to points inside of SA.
24
Low Rank Approximation Idea
 Regression problem minX |Ak X – A|F
 S can be matrix of i.i.d.
 Solution is X = I, and minimum isNormals
|Ak – A|F
 S can beproblem!
a Fast Johnson
 This is a generalized regression
Lindenstrauss Matrix
 If S is a subspace embedding
space of Ak
 S canfor
becolumn
a CountSketch
and also if for any matrices B, C, matrix
 |BSTSC – BC|F2 · 1/(# rows of S) |B|F2 |C|F2
 Then if X’ is the minimizer to minX |SAkX – SA|F , then
|Ak X’ – A|F · (1+ε) minX |AkX-A|F = (1+ε) |Ak-A|F
 But minimizer X’ = (SAk SA is in the row span of SA!
)-
25
Caveat: projecting the points onto SA is slow
 Current algorithm:
1. Compute S*A (easy)
2. Project each of the rows onto S*A
3. Find best rank-k approximation of projected points
inside of rowspace of S*A (easy)
 Bottleneck is step 2
 [CW] Turns out you can approximate the projection
 Sketching for generalized regression again: minX |X(SA)-A|F2
26
Talk Outline
 Regression
 Exact Regression Algorithms
 Sketching to speed up Least Squares Regression
 Sketching to speed up Least Absolute Deviation (l1) Regression
 Low Rank Approximation
 Sketching to speed up Low Rank Approximation
 Recent Results and Open Questions
 M-Estimators and robust regression
 CUR decompositions
27
M-Estimators and Robust Regression
 Solve minx |Ax-b|M
 M: R -> R¸ 0
 |y|M = Σi=1n M(yi)
 Least squares and L1-regression are special cases
 Huber function, given a parameter c:
M(y) = y2/(2c) for |y| · c
M(y) = |y|-c/2 otherwise
 Enjoys smoothness properties of l2
and
robustness properties of l1
[CW15] For M-estimators with at least linear and at most quadratic growth,
can get O(1)-approximation in nnz(A) + poly(d) time
28
CUR Decompositions
[BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε)
time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k
29
Open Questions
 Recent monograph in NOW Publishers
 D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra”
 Other types of low rank approximation:
(Spectral) How quickly can we find a rank k matrix A’, so that
|A-A’|2 · (1+ε) |A-Ak|2,
w.h.p., where Ak = argminrank k matrices B |A-B|2
(Robust) How quickly can we find a rank k matrix A’, so that
|A-A’|1 · (1+ε) |A-Ak|1,
w.h.p., where Ak = argminrank k matrices B |A-B|1
 For other questions regarding Schatten norms and communicationefficiency, see reference above. Thanks!
30
Download