Lp Row Sampling by Lewis Weights Richard Peng Georgia Tech, CS Joint with Michael Cohen (M.I.T.) PROBLEM: ROW SAMPLING Given n × d matrix A, norm p Pick a few (rescaled) rows of A’ so that ║Ax║p≈1+ε║A’x║p for all x A A’ Multiplicative error notation ≈: a≈kb if there exists kmin, kmax s.t. kmax/kmin ≤ k and kmina ≤ b ≤ kmax a APPLICATION: DATA A Often data exists in dimensions that are far larger than its true rank • Over-constrained regression • PCA / low rank approximation • Stochastic gradient descent ║Ax║p≈1+ε║A’x║p for all x: Can solve on A’ INSTANTIATION: REGRESSION x A b -1 Minx ║Ax–b║ p • p=2: least squares • p=1: robust regression Simplified view: • Ax – b= [A, b] [x; -1] • min║Ax║p with one entry of x fixed ON GRAPHS A: edge-vertex incidence matrix x: indicator vector of cut A’: subset of (rescaled) edges Widely used tools in graph algorithms • [Benczur-Karger `96]: cut sparsifiers • [Spielman-Teng `04]: spectral sparsification PREVIOUS POLY-TIME ALGORITHMS [CW`12][NN`13][LMP`13][CLMMPS`15] Input sparsity time: expensive steps only on size d matrices, O(nnz(A) + poly(d)) total P 2 # rows By dlogd Matrix concentration bounds ([Rudelson-Vershynin `07], [Tropp `12]) d [Batson-Spielman-Srivastava `09] 1 d2.5 2 1 < p < 2 dp/2+2 2<p dp+1 [Dasgupta-Drineas-Harb -Kumar-Mahoney `09] Assuming ε = constant L2 VS. L1 1/2 A: 1/2 A’: 1 1/2 1/2 1-dimensional: only `interesting’ vector: x = [1] L2 distance: 1 vs. 1 = 1 L1 distance: 2 vs. 1 = 2 Difference = distortion between L2 and L1: n1/2 CONNECTION A Functional analysis • ║Ax║p: Banach space • A’ = SA: embedding into a lower dimensional space p=1 • Embeddability of Banach spaces: A’ with O(dlogd) rows • Beck-Filla conjecture: implied by A’ with O(d) rows: OUR RESULTS: ROW COUNT P 1 Previous Our d2.5 dlogd Uses [Talagrand `90] 1 < p < 2 dp/2+2 dlogd(loglogd)2 [Talagrand `95] 2<p dp/2logd dp+1 [Bourgain-MilmanLindenstrauss `89] • Runtime: input-sparsity time, O(nnz(A) + poly(d)) • When p < 4, overhead is O(dω) Will focus on p = 1 for this talk OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration IMPORTANCE SAMPLING Only one non-zero row: need to keep while sampling rest of A Sampling scheme: Probability pi for each row Can ensure E[║A’x║pp] = ║Ax║pp by rescaling rows picked, need concentration ISSUES WITH SAMPLING BY NORM norm sampling: pi =║ai║22 column with one entry Need: ║A[1;0;…;0]║p≠ 0 MATRIX-CHERNOFF BOUNDS τ: L2 statistical leverage scores τi = aiT(ATA)-1ai ai row i of A [???]: Sampling rows w.p. pi = τi logd gives ║Ax║2≈║A’x║2 ∀x w.h.p. [Foster `49] Σi τi = rank ≤ d O(dlogd) rows in A’ [CW`12][NN`13][LMP`13][CLMMPS`15]: can estimate L2 leverage scores in O(nnz(A) + dω+a) time WHAT IS LEVERAGE SCORE Norm of ai after `whitening transform’ Approximations are basis independent: Max ║Ax║p/║A’x║p = Max ║AUx║p/║A’ Ux║p Can reorganize the columns of A: Transform so ATA = I (isotropic position) Interpretation of matrix-Chernoff When A is in isotropic position, norm sampling works WHAT WORKS FOR ANY P For any p, exists function f in d s.t. For any n × d A where • n > f(d), • A is isotropic, ATA = I, • All row norms of A < 2d/n, Sampling with pi = ½ gives ║Ax║p≈║A’x║p ∀x LEWIS CHANGE OF DENSITY w*: L1 Lewis Weights wi*2 = aiT(ATW*-1A)-1ai Recursive definition Symmetrization ([Rudelson-Vershynin `07]): importance sampling using 2-approximate Lp Lewis weights gives A’ s.t. ║Ax║p≈║A’x║p ∀x P 1 # rows dlogd Citation [Talagrand `90] 1<p<2 dlogd(loglogd)2 [Talagrand `95] 2<p dp/2logd [Bourgain-MilmanLindenstrauss `89] REST OF TALK An Algorithm: Computation of approximate Lewis weights and a proof: Elementary proof of 1-norm concentration Key ideas: • Isotropic position • Analyze intermediate matrices as operators using variants of relative condition number • Radamacher processes under L1 OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration ALGORITHM w*: L1 Lewis Weights wi*2 = aiT(ATW*-1A)-1ai Recursive definition Algorithm: pretend w is the right answer, and iterate w’i (aiT(ATW-1A)-1ai)1/2 Has resemblances to iterative reweighted least squares Each iteration: compute leverage scores w.r.t. w, O(nnz(A) + dω+a) time OPERATORS w*: L1 Lewis Weights wi*2 = aiT(ATW*-1A)-1ai Key to analysis: Lewis quadratic form, ATW-1A Spectral similarity: P ≈ k Q if xTPx ≈ k xTQx ∀x Implications of P ≈ k Q : • P-1 ≈ k Q-1 • UTPU ≈ k UTQU for all matrices U • Relative condition number of P and Q ≤ k CONVERGENCE PROOF OUTLINE Update rule w’i (aiT(ATW-1A)-1ai)1/2 w’’i (aiT(ATW’-1A)-1ai)1/2 … Will show: • if 0 < p < 4, distance between w and w’ rapidly decreases • distance between w’ and w’’ less than distance between w and w’, contraction mapping CONVERGENCE FOR L1 Assume: w ≈ k w’ Update: Composition: Invert: Apply to vector ai: W ≈ k W’ W-1 ≈ k W’-1 w’i (aiT(ATW-1A)-1ai)1/2 w’’i (aiT(ATW’-1A)-1ai)1/2 ATW-1A ≈ k ATW’-1A (ATW-1A)-1 ≈ k (ATW’-1A)-1 aiT(ATW-1A)-1ai ≈ k aiT(ATW’-1A)-1ai w’i2 ≈ k w’’i2 w’i ≈ k1/2 w’’i k’ k1/2 per step, convergence in O(loglogk) steps! OVERALL ALGORITHM Can also show: if initialize with wi = 1, After 1 step we have wi ≈ n w’i Convergence bound gives w(t) ≈ 2 w(t+1) in O(loglogn) rounds Sample with probabilities p = min{1, 2logdw} • Uniqueness: w’i (aiT(ATW-1A)-1ai)1/2 is a contraction mapping, same convergence rate to fixed point • Sum: w*i are L2 leverage scores of W-1/2A OPTIMIZATION FORMULATION Lp Lewis weights: wi*2 = aiT(ATW*1-2/pA)-1ai poly-time algorithm: solve Max det(M) s.t. Σi (aiTMai)p/2 ≤ d M P.S.D. wi* • Convex problem when p > 2 • Also leads to input-sparsity time algorithms OUTLINE • Row Sampling • Lewis Weights • Computation • Proof of Concentration PROOFS OF KNOWN RESULTS Will show: elementary proof for p = 1 P 1 Citation [Talagrand `90] + [Pisier `89, Ch2] # pages 8 + ~30 1 < p < 2 [Talagrand `95] + 16 + 12 [Ledoux-Talagrand `90, Ch15.5] 2<p [Bourgain-Milman-Lindenstrauss `89] 69 Tools used: • Gaussian processes • K-convexity for p = 1 • Majorizing measure + generic chaining for p > 1 CONCENTRATION ‘Nice’ case: • ATA = I (isotropic position) • ║ai ║22 <ε2 / logn • Sampling with pi = ½ pick half the rows, double them si: copy of row i: 0 w.p. 1/2 2 w.p. 1/2 ║Ax║1 - ║A’x║1 = Σi |aiTx| - Σi si|aiTx| = Σi (1 – si)|aiTx| RADAMACHER PROCESSES Need to bound (over choices of σ = s - 1): Maxx, ║Ax║ ≤ 1Σi (1 – si)|aiTx| 1 Radamacher random variables: σi = 1 – si = ±1 w.p. ½ each = Maxx, ║Ax║ 1 ≤ 1Σi σi |aiTx| MAIN STEPS Goal: show w.h.p. over choices of σ =±1 Maxx, ║Ax║ ≤ 1Σi σi |aiTx| ≤ 1 +ε 1 • Comparison lemma: turn sum of absolute values into a single sum • Dual norms: reduce to bounding the maximum of a single operator • Finish with scalar concentration bounds COMPARISON THEOREM [Ledoux-Talagrand `89], implication: Eσ[Maxx, ║Ax║ ≤ 1Σi σi aiTx] ≤ 2 Eσ[Maxx, ║Ax║ 1 1 Tx] Σ σ a ≤1 i i i Proof via Hall’s theorem on the hypercube Invoke with moment generating functions: suffices to show w.h.p. over σ: Maxx, ║Ax║ ≤ 1Σi σi aiTx ≤ 1 +ε 1 TRANSFORMATION + DUAL NORM Maxx, ║Ax║ 1 Tx Σ σ a ≤1 i i i = Maxx, ║Ax║ = Maxx, ║Ax║ ≤ Maxy, ║y║ 1 1 1 TAx σ ≤1 ATA = I (assumption) TAATAx σ ≤1 TAATy σ ≤1 Dual norm: Maxy, ║y║ = ║σTAAT║∞ 1 ≤1 bTy = ║b║∞ EACH ENTRY (σTAAT)j = ΣiσiaiTaj • σi are the only random variables. • Khintchine’s inequality (with logn moment): w.h.p Σσibi ≤ O( ( logn ║b║22)1/2 ) • Suffices to bound Σi bi2 = Σi(aiTaj)2 LAST STEPS Σi(aiTaj)2 = Σi ajTaiaiTaj = ajT (Σi aiaiT) aj = ajT ATA aj `Nice case’ assumptions: • ATA = I (isotropic position) • ║ai ║22 <ε2 / logn = ║aj║22 < ε2 / logn Consequence of Khintchine’s inequality: w.h.p. each entry < O( ( logn ║b║22)1/2 ) = O(ε) UNWIND PROOF STACK: • Finish with scalar concentration bounds W.h.p. ║σTAAT║∞ < ε • Dual norms: reduce to bounding the maximum of a single operator Maxx, ║Ax║ ≤ 1σTAx < ε 1 • Pass moment generating function through comparison theorem gives result OPEN PROBLEMS • • • • • • Elementary proof for p ≠ 1? O(d) rows for 1 < p < 2? Better algorithms for p ≥ 4 Generalize to low-rank approximations SGD using Lewis weights? Fewer rows for structured matrices (e.g. graphs) when p > 2? Conjecture: O(dlogf(p)d) for graphs Reference: http://arxiv.org/abs/1412.0588