Talk - IBM

advertisement
Sketching for M-Estimators: A
Unified Approach to Robust
Regression
Kenneth Clarkson
David Woodruff
IBM Almaden
Regression
Linear Regression
• Statistical method to study linear dependencies
between variables in the presence of noise.
Example Regression
Example
• Ohm's law V = R ∙ I
250
200
150
Example Regression
100
• Find linear function that
best fits the data
50
0
0
50
100
150
Regression
Standard Setting
• One measured variable b
• A set of predictor variables a1 ,…, a d
• Assumption:
b = x0 + a 1 x1 + … + a d xd + e
e is assumed to be noise and the xi are model
parameters we want to learn
• Can assume x0 = 0
• Now consider n observations of b
Regression
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)
n is the number of observations; d is the number
of predictor variables
Output: x* so that Ax* and b are close
• Consider the over-constrained case, when n À d
Fitness Measures
Least Squares Method
• Find x* that minimizes |Ax-b|22
• Ax* is the projection of b onto the column span of A
• Certain desirable statistical properties
• Closed form solution: x* = (ATA)-1 AT b
Method of least absolute deviation (l1 -regression)
• Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>|
• Cost is less sensitive to outliers than least squares
• Can solve via linear programming
What about the many other fitness measures used in practice?
M-Estimators
• Measure function
– G: R -> R¸ 0
– G(x) = G(-x), G(0) = 0
– G is non-decreasing in |x|
• |y|M = Σi=1n G(yi)
• Solve minx |Ax-b|M
• Least squares and L1-regression are special
cases
Huber Loss Function
G(x) = x2/(2c) for |x| · c
G(x) = |x|-c/2 for |x| > c
Enjoys smoothness properties of l22 and
robustness properties of l1
Other Examples
• L1-L2
G(x) = 2((1+x2/2)1/2 – 1)
• Fair estimator
G(x) = c2 [ |x|/c - log(1+|x|/c) ]
• Tukey estimator
G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c
= c2/6
if |x| > c
Nice M-Estimators
• An M-Estimator is nice if it has at least linear growth and
at most quadratic growth
• There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0,
|a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’|
• Any convex G satisfies the linear lower bound
• Any sketchable G satisfies the quadratic upper bound
– sketchable => there is a distribution on t x n matrices S for which
|Sx|M = £(|x|M) with probability 2/3 and t is independent of n
Our Results
Let nnz(A) denote # of non-zero entries of an n x d matrix A
1.
[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to
output x’ so that w.h.p.
|Ax’-b|H · (1+ε) minx |Ax-b|H
2.
[Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to
output x’ so that for any constant C > 1, w.h.p.
|Ax’-b|M · C*minx |Ax-b|M
Remarks:
- For convex nice M-estimators can solve with convex
programming, but slow – poly(nd) time
- Our algorithm for nice M-estimators is universal
Talk Outline
• Huber result
• Nice M-Estimators result
Naive Sampling Algorithm
minx
A
x
- b
M
x’ = argminx
S¢A x
-
S¢b
M
S uniformly samples poly(d/ε) rows – this is a terrible algorithm
Leverage Score Sampling
- For l2, the qi are the
squared rowqnorms
in an
• For lp-norms, there are probabilities
1, …, qn with
orthonormal
Σi qi = poly(d/ε) so that sampling
worksbasis of A
minx
x’ = argminx
- For lp, the qi are p-th
powers of the p-norms of
rows in a “well conditioned
M
basis”
[Dasgupta et al.]
A
x
S¢A x
- b
-
S¢b
M
• All qi can be found in O(nnz(A)log n) + poly(d) time
• S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise
Huber Regression Algorithm
• [Huber inequality]: For z 2 Rn,
£(n-1/2) min(|z|1, |z|22/(2c)) · |z|H · |z|1
• Proof by case analysis
• Sample from a mixture of l1-leverage scores and l2leverage scores
– pi = n1/2¢(qi(1) + qi(2))
• Our nnz(A)log n + poly(d/ε) algorithm
–
–
–
–
After one step, number of rows < n1/2 poly(d/ε)
Recursively solve a weighted Huber
Weights do not grow quickly
Once size is < n.01 poly(d/ε), solve by convex programming
Talk Outline
• Huber result
• Nice M-Estimators result
CountSketch
• For l2 regression, CountSketch with poly(d) rows works
[Clarkson, W]:
[
[
S=
0010 01 00
1000 00 00
0 0 0 -1 1 0 -1 0
0-1 0 0 0 0 0 1
• Compute S*A in nnz(A) time
• Compute x’ argminx |SAx-Sb|2 in poly(d) time
-The same M-Sketch works
M-Sketch
for all nice M-estimators!
x’
= argmin
x |TAx-Tb|M, w
1
0
S ¢R
Note: many uses of
this data structure do 2
not work since they
3
involve a median
operation
S ¢ R1
S ¢ R2
- Sketch used for
…estimating frequency
Slog n ¢moments
Rlog n[Indyk, W] and
earthmover distance
[Verbin, Zhang]
i
• S are independent CountSketch matrices with poly(d) rows
• Ri is n x n diagonal and uniformly samples a 1/bi fraction of [n]
M-Sketch Intuition
• Consider a fixed y = Ax-b
• For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i)
• [Contraction] |Ty|w,M ¸ ½ |y|M w.pr. 1-exp(-d log d)
• [Dilation] |Ty|w,M · 2 |y|M w.pr. 9/10
• Contraction allows for a net argument (no scale-invariance!)
• Dilation implies the optimal y* does not dilate much
M-Skech Analysis
• Partition into weight classes:
– Si = {j | G(yj) 2 (|y|M/bi, |y|M/bi-1]}
• If |Si| > d log d, there’s a “sampling level” containing about
dlog d elements of Si (gives exp(-dlog d) failure probability)
– Elements from Sj for j · i do not collide
– Elements from Sj for j > i cancel in a bucket (concentrate to 2-norm)
• If |Si| small, all elements are found in the top level or Si is not
important (relate M to l2)
• If G close to quadratic growth, need to “clip” top buckets
– Ky-Fan norm
Conclusions
Summary:
1.
[Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm
2.
[Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm
Questions:
1. Is there a sketch-based estimator for (1+ε)-approximation?
2. (Meta-question) Apply streaming techniques to linear algebra
- countsketch –> l_2-regression
- p-stable random variables -> l_p regression for p in [1,2]
- countsketch + heavy hitters -> nice M-estimators
- Pagh’s tensorsketch -> polynomial kernel regression
…
Download