Lp Row Sampling by Lewis Weights Richard Peng M.I.T. Joint with Michael Cohen (M.I.T.) OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration DATA A • n-by-d matrix A, nnz(A) non-zeros • Columns: components • Rows: features Computational tasks • Identify patterns • Interpret new data Officemate’s ‘matrix tourism’: SUBSAMPLING MATRICES What applications need: reduce both rows/columns Fundamental problem: row reduction • #features>> #components • #rows (n) >> #columns (d) A A’=SA Approaches: • Subspace embedding: S that works for most A • Adaptive: build S based on A Run more expensive routines on A’ LINEAR MODEL Ax x1A:,1 x2A:,2 x3A:,3 • Can add/scale data points • x: coefficients, combo: Ax Interpret new data point b: DISTANCE MINIMIZATION x A b -1 Minx ║Ax–b║ p • p=2: Euclidean norm of x, least squares • p=1: absolute deviations, robust regression Simplified view: • Ax – b= [A, b] [x; -1] • min║Ax║p with one entry of x fixed ROW SAMPLING Pick some (rescaled) rows of A’ so that ║Ax║p≈1+ε║A’x║p for all x A A’ A’ = SA, S: • Õ(d) × n • one non-zero per row Feature selection Error notation ≈: a≈kb if there exists kmin, kmax s.t. kmax/kmin ≤ k and kmina ≤ b ≤ kmax a ON GRAPHS A: edge-vertex incidence matrix x: labels on vertices Row for edge uv: |aix|p = |xu – xv|p • • p = 1: (fractional) cuts [Benczur-Karger `96]: cut sparsifiers • • p = 2: energy of voltages [Spielman-Teng `04]: spectral sparsification A’ with O(dlogd) rows in both cases [Naor `11][ Matousek `97] on graphs, L2/ spectral sparsifiers after normlization work for all 1 ≤ p ≤ 2 PREVIOUS POLY-TIME ALGORITHMS Assuming ε = constant P 2 # rows By dlogd Matrix concentration bounds ([Rudelson-Vershynin `07], [Tropp `12]) d [Batson-Spielman-Srivastava `09] 1 d2.5 2 1 < p < 2 dp/2+2 2<p [Dasgupta-Drineas-Harb -Kumar-Mahoney `09] dp+1 [DMMW`11][CDMMMW`12][CW`12][MM `12][NN `12] [LMP `12]: input-sparsity time, O(nnz(A) + poly(d)) GENERAL MATRICES 1/2 A: 1/2 A’: 1 1/2 1/2 1-dimensional: only `interesting’ vector: x = [1] L2 distance = 1/1 = 1 L1 distance = 2/1 = 2 Difference = distortion between L2 and L1: n1/2 OUR RESULTS P 1 Previous d2.5 Our dlogd Uses [Talagrand `90] 1 < p < 2 dp/2+2 dlogd(loglogd)2 [Talagrand `95] 2<p dp/2logd dp+1 [Bourgain-MilmanLindenstrauss `89] • Runtime: input-sparsity time, O(nnz(A) + poly(d)) • When p < 4, overhead is O(dω) • For p = 1, elementary proof that gets most details right Will focus on p = 1 for this talk SUMMARY • Goal: sample rows of matrices to preserve ║Ax║p. • Graphs: preserving p-norm preserves all q < p. • Different notion for general matrices. OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration IMPORTANCE SAMPLING General scheme: Probability pi for each row • keep with probability pi • If picked, rescale to keep expectation Before: |aiTx|p sample, rescale by si After: |a’iTx|p = |siaiTx|p w.p. pi 0 w.p. 1 - pi Only one non-zero row: need to keep Need: E[|a’iTx|p] = |aiTx|p pi(si)p = 1 si pi-p E[║A’x║pp] = ║Ax║pp, need concentration ISSUES WITH SAMPLING BY NORM norm sampling: pi =║ai║22 column with one entry Need: ║A[1;0;…;0]║p≠ 0 Bridge in graph Need: connectivity MATRIX-CHERNOFF BOUNDS τ: L2 statistical leverage scores τi = aiT(ATA)-1ai ai row i of A Sampling rows w.p. pi = τi logd gives ║Ax║2≈║A’x║2 ∀x w.h.p. On graphs: weight of edge × effective resistance [Foster `49] Σi τi = rank ≤ d O(dlogd) rows [CW`12][NN`13][LMP`13][CLMMPS`15]: can estimate L2 leverage scores in O(nnz(A) + dω+a) time MATRIX-CHERNOFF BOUNDS w*: L1 Lewis Weights wi*2 = aiT(ATW*-1A)-1ai Recursive definition Sampling rows w.p. pi = wi logd gives ║Ax║1≈║A’x║1 ∀x w.h.p. Equivalent: wi* = wi*-1aiT(ATW*-1A)-1ai Leverage score of row i of W*-1/2A Σi wi*= d by Foster’s theorem Will show: • can get w ≈ w* using calls to L2 leverage score estimators / in O(nnz(A) + dω+a) time • Existence and uniqueness WHAT IS LEVERAGE SCORE Length of ai after `whitening transform’ Approximations are basis independent: Max ║Ax║p/║A’x║p = Max ║AUx║p/║A’ Ux║p Can reorganize the columns of A: Transform so ATA = I (isotropic position) Interpretation of matrix-Chernoff When A is in isotropic position, norm sampling works WHITENING FOR L1? 1 0 0 (1-ε2)1/2 0 ε/k 0 ε/k 0 … 0 ε/k Split εof ai into k2 copies of ε/k ai Most of ║A[0,1]║1 from small rows: k2 (ε/k) = kε, big when k > ε • Total sampling probability < lognε • Problematic when k > ε-1 > logn (this can also happen in a non-orthogonal manner) WHAT WORKS FOR ANY P • n > f(d), • A is isotropic, ATA = I, • All row norms < 2d/n, Sampling with pi = ½ gives ║Ax║p≈║A’x║p ∀x P 1 f(d) dlogd Citation [Talagrand `90] 1<p<2 dlogd(loglogd)2 [Talagrand `95] 2<p dp/2logd [Bourgain-MilmanLindenstrauss `89] Symmetrization ([Rudelson-Vershynin `07]): uniformly sample to f(d) rows gives ║Ax║p≈║A’x║p ∀x SPLITTING BASED VIEW Matrix concentration bound: uniform sampling works when all L2 leverage scores are same “Generalized whitening transformation”: Split ai into wi (fractional) “copies” s.t. all rows have L2 leverage scores d/n Split ai into wi copies Preserve |aiTx|p: each copy = wi-1/pai L2: wi copies of wi-1/2ai • Quadratic form: wi (wi-1/2ai)T(wi-1/2ai) = aiTai • wi (d/n)τi suffices SPLITTING FOR L1 Preserve L1: wi copies of wi-1ai 1 0 0 2 w2=4 1 0 0 1/2 0 1/2 0 1/2 0 1/2 New quadratic form: 1 0 0 1 Measuring leverage scores w.r.t. a different matrix! Row ai wi copies of (wi-1ai)T(wi-1ai) Lewis quadratic form: Σwi (wi-1ai)T(wi-1ai) = ATW-1A CHOICE OF N Lewis quadratic form: ATW-1A All L2 leverge scores same: d/n = wi-2aiT(ATW-1A)-1ai or: wi = n/d wi-1aiT(ATW-1A)-1ai Sanity check: Σwi = n/d (Σwi-1aiT(ATW-1A)-1ai )= n n’ rows instead: w’ w (n’/n) works Check: • ATW’-1A = (n/n’) ATW-1A • aiT(ATW’-1A)-1ai = (n’/n) aiT(ATW-1A)-1ai • wi'-2aiT(ATW’-1A)-1ai = (n/n’ wi)-2 (n’/n) aiT(ATW-1A)-1ai =(n/n’) wi-2aiT(ATW-1A)-1ai = d/n' CHOICE OF N wi: weights to split into n rows n’ rows instead: w’ w (n’/n) # samples of row i: (d f(d) / n) w Lp Lewis weights: w that gives n = d rows • ‘Fractional’ copies, w*i < 1 • Recusive definition • f(d) / d: sampling overhead, akin to O(logd) from L2 matrix Chernoff bounds LEWIS WEIGHTS Lp Lewis weights: w* s.t. wi*2/p = aiT(ATW*1-2/pA)-1ai Recursive definition, will show existence / computation next w* = L2 leverage scores of W*1/2-1/pA • Sum: d • p = 2: 1/2-1/p = 0, same as L2 leverage scores of A 2-approximate Lp Lewis weights: w s.t. wi2/p ≈2 aiT(ATW1-2/pA)-1ai L1: wi ≈2 (aiT(ATW-1A)1 1/2 INVOKING EXISTING RESULTS Symmetrization ([Rudelson-Vershynin `07]): importance sampling using 2-approximate Lp Lewis weights gives A’ s.t. ║Ax║p≈║A’x║p ∀x P 1 # rows dlogd Citation [Talagrand `90] 1<p<2 dlogd(loglogd)2 [Talagrand `95] 2<p dp/2logd [Bourgain-MilmanLindenstrauss `89] SUMMARY • • • • • Goal: sample rows of matrices to preserve ║Ax║p. Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores (p-norm) Lewis weights. OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration FINDING L1 LEWIS WEIGHTS Need: wi ≈2 (aiT(ATW-1A)-1ai)1/2 Algorithm: pretend w is the right answer, and iterate w’i (aiT(ATW-1A)-1ai)1/2 Has resemblances to iterative reweighted least squares Each iteration: compute leverage scores w.r.t. w, O(nnz(A) + dω+a) time We show: if 0 < p < 4, distance between w and w’ rapidly decreases CONVERGENCE PROOF OUTLINE w’i w’’i (aiT(ATW-1A)-1ai)1/2 (aiT(ATW’-1A)-1ai)1/2 (aiTPai)1/2 for some matrix P Goal: show distance between w’ and w’’ less than distance between w and w’ Spectral similarity of matrices: A ≈ k B if xTAx ≈ k xTBx ∀x Implications of P ≈ k Q : • P-1 ≈ k Q-1 • UTPU ≈ k UTQU for all matrices U CONVERGENCE FOR L1 Iteration steps: w’i (aiT(ATW-1A)-1ai)1/2 w’’i (aiT(ATW’-1A)-1ai)1/2 Assume: w ≈ k w’ Composition: Invert: Apply to vector ai: W ≈ k W’ W-1 ≈ k W’-1 ATW-1A ≈ k ATW’-1A (ATW-1A)-1 ≈ k (ATW’-1A)-1 aiT(ATW-1A)-1ai ≈ k aiT(ATW’-1A)-1ai w’i2 ≈ k w’’i2 w’i ≈ k1/2 w’’i Fixed point iteration: log(k) halves per step! OVERALL SCHEME We show: if initialize with wi = 1, After 1 step we have wi ≈ n w’i Convergence bound gives w(t) ≈ 2 w(t+1) in O(loglogn) rounds p 2logdw are good sampling probabilities Input-sparsity time: stop when w(t) ≈ nc w(t+1) • O(log(1/c)) rounds suffice. • Over-sampling by factor of nc. Uniqueness: w’i (aiT(ATW-1A)-1ai)1/2 is a contraction mapping, can show same convergence rate to fixed point OPTIMIZATION FORMULATION Lp Lewis weights: wi*2 = aiT(ATW*1-2/pA)-1ai poly-time algorithm: solve Max det(M) s.t. Σi (aiTMai)p/2 ≤ d M P.S.D. wi* • Convex problem when p > 2 • Also leads to input-sparsity time algorithms SUMMARY • • • • • • • Goal: sample rows of matrices to preserve ║Ax║p Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores (p-norm) Lewis weights. Iterative computation when 0 < p < 4. Solutions to max determinant. OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration PROOFS OF KNOWN RESULTS P 1 Citation [Talagrand `90] + [Pisier `89, Ch2] # pages 8 + ~30 1 < p < 2 [Talagrand `95] + 16 + 12 [Ledoux-Talagrand `90, Ch15.5] 2<p [Bourgain-Milman-Lindenstrauss `89] 69 Will show: elementary proof for p = 1 Tools used: • Gaussian processes • K-convexity for p = 1 • Majorizing measure for p > 1 CONCENTRATION ‘Nice’ case: • ATA = I (isotropic position) • ║ai ║22 <ε2 / logn • Sampling with pi = ½ Can use this to show the general case pick half the rows, double them si: copy of row i: 0 w.p. 1/2 2 w.p. 1/2 ║Ax║1 - ║A’x║1 = Σi |aiTx| - Σi si|aiTx| = Σi (1 – si)|aiTx| RADAMACHER PROCESSES σi = 1 – si ±1 w.p. ½ each Radamacher random variables Need to bound (over choices of σ ): Maxx, ║Ax║ ≤ 1Σi σi |aiTx| 1 Comparison theorem [Ledoux-Talagrand `89]: suffices to bound (expectation over σ) Maxx, ║Ax║ ≤ 1Σi σi aiTx 1 = Maxx, ║Ax║ ≤ 1σTAx 1 Proof via Hall’s theorem on the hypercube TRANSFORMATION: Maxx, ║Ax║ 1 TAx σ ≤1 ATA = I (assumption) = Maxx, ║Ax║ ≤ Maxy, ║y║ 1 1 TAATAx σ ≤1 TAATy σ ≤1 Dual norm: Maxy, ║y║ = ║σTAAT║∞ 1 ≤1 bTy = ║b║∞ EACH ENTRY (σTAAT)j = ΣiσiaiTaj Khintchine’s inequality (with logn moment): w.h.p Σσibi ≤ O( ( logn ║b║22)1/2 ) Σi(aiTaj)2 = Σi ajTaiaiTaj = ajT ATA aj INITIAL ASSUMPTIONS • ATA = I (isotropic position) • ║ai ║22 <ε2 / logn ajT ATA aj = ║aj║22 < ε2 / logn Khintchine’s inequality: w.h.p. each entry < O( ( logn ║b║22)1/2 ) = O(ε) Unwind proof stack: • W.h.p. ║σTAAT║∞ < ε • Maxx, ║Ax║ ≤ 1σTAx < ε 1 • Pass moment generating function through comparison theorem gives result SUMMARY • • • • • • • • • Goal: sample rows of matrices to preserve ║Ax║p. Graphs: preserving p-norm preserves all q < p. Different notion for general matrices. Sampling method: importance sampling. Leverage scores (p-norm) Lewis weights. Iterative computation when 0 < p < 4. Solutions to max determinant. Convergence: bound max of a vector. Follows from scalar Chernoff bounds. OPEN PROBLEMS What are Lewis weights on graphs? Elementary proof for p ≠ 1? O(d) rows for 1 < p < 2? Better algorithms for p ≥ 4 Fewer rows for structured matrices (e.g. graphs) when p > 2? Conjecture: O(dlogf(p)d) for graphs • Generalize low-rank approximations • • • • • Reference: http://arxiv.org/abs/1412.0588