Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden Regression Linear Regression • Statistical method to study linear dependencies between variables in the presence of noise. Example Regression Example • Ohm's law V = R ∙ I 250 200 150 Example Regression 100 • Find linear function that best fits the data 50 0 0 50 100 150 Regression Standard Setting • One measured variable b • A set of predictor variables a1 ,…, a d • Assumption: b = x0 + a 1 x1 + … + a d xd + e e is assumed to be noise and the xi are model parameters we want to learn • Can assume x0 = 0 • Now consider n observations of b Regression Matrix form Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close • Consider the over-constrained case, when n À d Fitness Measures Least Squares Method • Find x* that minimizes |Ax-b|22 • Ax* is the projection of b onto the column span of A • Certain desirable statistical properties • Closed form solution: x* = (ATA)-1 AT b Method of least absolute deviation (l1 -regression) • Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>| • Cost is less sensitive to outliers than least squares • Can solve via linear programming What about the many other fitness measures used in practice? M-Estimators • Measure function – G: R -> R¸ 0 – G(x) = G(-x), G(0) = 0 – G is non-decreasing in |x| • |y|M = Σi=1n G(yi) • Solve minx |Ax-b|M • Least squares and L1-regression are special cases Huber Loss Function G(x) = x2/(2c) for |x| · c G(x) = |x|-c/2 for |x| > c Enjoys smoothness properties of l22 and robustness properties of l1 Other Examples • L1-L2 G(x) = 2((1+x2/2)1/2 – 1) • Fair estimator G(x) = c2 [ |x|/c - log(1+|x|/c) ] • Tukey estimator G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/6 if |x| > c Nice M-Estimators • An M-Estimator is nice if it has at least linear growth and at most quadratic growth • There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’| • Any convex G satisfies the linear lower bound • Any sketchable G satisfies the quadratic upper bound – sketchable => there is a distribution on t x n matrices S for which |Sx|M = £(|x|M) with probability 2/3 and t is independent of n Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A 1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b|H · (1+ε) minx |Ax-b|H 2. [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b|M · C*minx |Ax-b|M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal Talk Outline • Huber result • Nice M-Estimators result Naive Sampling Algorithm minx A x - b M x’ = argminx S¢A x - S¢b M S uniformly samples poly(d/ε) rows – this is a terrible algorithm Leverage Score Sampling - For l2, the qi are the squared rowqnorms in an • For lp-norms, there are probabilities 1, …, qn with orthonormal Σi qi = poly(d/ε) so that sampling worksbasis of A minx x’ = argminx - For lp, the qi are p-th powers of the p-norms of rows in a “well conditioned M basis” [Dasgupta et al.] A x S¢A x - b - S¢b M • All qi can be found in O(nnz(A)log n) + poly(d) time • S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise Huber Regression Algorithm • [Huber inequality]: For z 2 Rn, £(n-1/2) min(|z|1, |z|22/(2c)) · |z|H · |z|1 • Proof by case analysis • Sample from a mixture of l1-leverage scores and l2leverage scores – pi = n1/2¢(qi(1) + qi(2)) • Our nnz(A)log n + poly(d/ε) algorithm – – – – After one step, number of rows < n1/2 poly(d/ε) Recursively solve a weighted Huber Weights do not grow quickly Once size is < n.01 poly(d/ε), solve by convex programming Talk Outline • Huber result • Nice M-Estimators result CountSketch • For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]: [ [ S= 0010 01 00 1000 00 00 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 • Compute S*A in nnz(A) time • Compute x’ argminx |SAx-Sb|2 in poly(d) time -The same M-Sketch works M-Sketch for all nice M-estimators! x’ = argmin x |TAx-Tb|M, w 1 0 S ¢R Note: many uses of this data structure do 2 not work since they 3 involve a median operation S ¢ R1 S ¢ R2 - Sketch used for …estimating frequency Slog n ¢moments Rlog n[Indyk, W] and earthmover distance [Verbin, Zhang] i • S are independent CountSketch matrices with poly(d) rows • Ri is n x n diagonal and uniformly samples a 1/bi fraction of [n] M-Sketch Intuition • Consider a fixed y = Ax-b • For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i) • [Contraction] |Ty|w,M ¸ ½ |y|M w.pr. 1-exp(-d log d) • [Dilation] |Ty|w,M · 2 |y|M w.pr. 9/10 • Contraction allows for a net argument (no scale-invariance!) • Dilation implies the optimal y* does not dilate much M-Skech Analysis • Partition into weight classes: – Si = {j | G(yj) 2 (|y|M/bi, |y|M/bi-1]} • If |Si| > d log d, there’s a “sampling level” containing about dlog d elements of Si (gives exp(-dlog d) failure probability) – Elements from Sj for j · i do not collide – Elements from Sj for j > i cancel in a bucket (concentrate to 2-norm) • If |Si| small, all elements are found in the top level or Si is not important (relate M to l2) • If G close to quadratic growth, need to “clip” top buckets – Ky-Fan norm Conclusions Summary: 1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm 2. [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Questions: 1. Is there a sketch-based estimator for (1+ε)-approximation? 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l_2-regression - p-stable random variables -> l_p regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …