Preconditioning in Expectation Richard Peng MIT Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) CMU theory seminar, April 5, 2014 RANDOM SAMPLING • Collection of many objects • Pick a small subset of them GOALS OF SAMPLING • Estimate quantities • Approximate higher dimensional objects • Use in algorithms SAMPLE TO APPROXIMATE • ε- nets / cuttings • Sketches • Graphs • Gradients This talk: matrices NUMERICAL LINEAR ALGEBRA • Linear system in n x n matrix • Inverse is dense • [Concus-Golub-O'Leary `76]: incomplete Cholesky, drop entries HOW TO ANALYZE? • Show sample is good • Concentration bounds • Scalar: [Bernstein `24][Chernoff`52] • Matrices: [AW`02][RV`07][Tropp `12] THIS TALK • Directly show algorithm using samples runs well • Better bounds • Simpler analysis OUTLINE • Random matrices • Iterative methods • Randomized preconditioning • Expected inverse moments HOW TO DROP ENTRIES? • Entry based representation hard • Group entries together • Symmetric with positive entries adjacency matrix of a graph SAMPLE WITH GUARANTEES • Sample edges in graphs • Goal: preserve size of all cuts • [BK`96] graph sparsification • Generalization of expanders DROPPING ENTRIES/EDGES • L: graph Laplacian • 0-1 x : |x|L2 = size of cut between 0sand-1s Unit weight case: |x|L2 = Σuv (xu – xv)2 Matrix norm: |x|P2 = xTPx DECOMPOSING A MATRIX Σuv (xu – xv)2 L = Σuv • Sample based on positive representations • P = Σi Pi, with each Pi P.S.D • Graphs: one Pi per edge 1 -1 u -1 1 v u v P.S.D. multi-variate version of positive MATRIX CHERNOFF BOUNDS P = Σi Pi, with each Pi P.S.D Can sample Q with O(nlognε-2) rescaled Pis s.t. P ≼ Q ≼ (1 +ε) P ≼ : Loewner’s partial ordering, A ≼ B B – A positive semi definite CAN WE DO BETTER? • Yes, [BSS `12]: O(nε-2) is possible • Iterative, cubic time construction • [BDM `11]: extends to general matrices DIRECT APPLICATION Find Q very close to P Solve problem on Q Return answer For ε accuracy, need P ≼ Q ≼(1 +ε) P Size of Q depends inversely on ε ε-1 is best that we can hope for USE INSIDE ITERATIVE METHODS Find Q somewhat similar to P Solve problem on P using Q as a guide • [AB `11]: crude samples give good answers • [LMP `12]: extensions to row sampling ALGORITHMIC VIEW • Crude approximations are ok • But need to be efficient • Can we use [BSS `12]? SPEED UP [BSS `12] • Expander graphs, and more • ‘i.i.d. sampling’ variant related to the Kadison-Singer problem MOTIVATION • One dimensional sampling: • moment estimation, • pseudorandom generators • Rarely need w.h.p. • Dimensions should be disjoint MOTIVATION • Randomized coordinate descent for electrical flows [KOSZ`13,LS`13] • ACDM from [LS `13] improves various numerical routines RANDOMIZED COORDINATE DESCENT • Related to stochastic optimization • Known analyses when Q = Pj • [KOSZ`13][LS`13] can be viewed as ways of changing bases OUR RESULT For numerical routines, random Q gives same performances as [BSS`12], in expectation IMPLICATIONS • Similar bounds to ACDM from [LS `13] • Recursive Chebyshev iteration ([KMP`11]) runs faster • Laplacian solvers in ~ mlog1/2n time OUTLINE • Random matrices • Iterative methods • Randomized preconditioning • Expected inverse moments ITERATIVE METHODS Find Q s.t. P ≼ Q ≼10 P Use Q as guide to solve problem on P • [Gauss, 1823] Gauss-Siedel iteration • [Jacobi, 1845] Jacobi Iteration • [Hestnes-Stiefel `52] conjugate gradient [RICHARDSON `1910] x(t + 1) = x(t) + (b – Px(t)) • Fixed point: b – Px(t) = 0 • Each step: one matrixvector multiplication ITERATIVE METHODS • Multiplication is easier than division, especially for matrices • Use verifier to solve problem 1D CASE Know: 1/2 ≤ p ≤ 1 1 ≤ 1/p ≤ 2 • 1 is a ‘good’ estimate • Bad when p is far from 1 • Estimate of error: 1 - p ITERATIVE METHODS • 1 + (1 – p) = 2 – p is more accurate • Two terms of Taylor expansion • Can take more terms ITERATIVE METHODS 1/p = 1 + (1 – p) + (1 – p)2 + (1 – p)3… Generalizes to matrix settings: P-1 = I + (I – P) + (I – P)2 + … [RICHARDSON `1910] x(0) = Ib X(1) = (I + (I – P))b x(2) = (I + (I – P) (I + (I – P)))b … x(t + 1) = b + (I – P) x(t) • Error of x(t) = (I – P)t b • Geometric decrease if P is close to I OPTIMIZATION VIEW Residue: r(t) = x(t ) – P-1b Error: |r(t)|22 • Quadratic potential function • Goal: walk down to the bottom • Direction given by gradient DESCENT STEPS x(t) x(t+1) x(t) • Step may overshoot • Need smooth function x(t+1) MEASURE OF SMOOTHNESS x(t + 1) = b + (I – P) x(t) Note: b = PP-1b r(t + 1) = (I – P) r(t) |r(t + 1)|2 ≤|I – P|2 |x(t)|2 MEASURE OF SMOOTHNESS • |I – P|2 : smoothness of |r(t)|22 • Distance between P and I • Related to eigenvalues of P 1 / 2 I ≼ P ≼ I |I – P|2 ≤ 1/2 MORE GENERAL • Convex functions • Smoothness / strong convexity This talk: only quadratics OUTLINE • Random matrices • Iterative methods • Randomized preconditioning • Expected inverse moments ILL POSED PROBLEMS .8 0 0 .1 • Smoothness of directions differ • Progress limited by steeper parts PRECONDITIONING P Q P • Solve similar problem Q • Transfer steps across PRECONDITIONED RICHARDSON P Q • Optimal step down energy function of Q given by Q-1 • Equivalent to solving Q-1Px = Q-1b PRECONDITIONED RICHARDSON x(t + 1) = b + (I – Q-1P) x(t) Residue: r(t + 1) = (I – Q-1P) r(t) |r(t + 1)|P = |(I – Q-1P )r(t)|P CONVERGENCE P Q Improvements depend on |I – P1/2Q-1P1/2|2 • If P ≼ Q ≼10 P, error halves in O(1) iterations • How to find a good Q? MATRIX CHERNOFF P = Σi P i Q = Σi s i P i s has small support • Take O(nlogn) (rescaled) Pis with probability ~ trace(PiP-1) • Matrix Chernoff ([AW`02],[RV`07]): w.h.p. P ≼ Q ≼ 2P Note: Σitrace(PiP-1) = n WHY THESE PROBABILITIES? .8 0 0 .1 • trace(PiP-1): • Matrix ‘dot product’ • If P is diagonal • 1 for all i • Need all entries Overhead of concentration: union bound on dimensions IS CHERNOFF NECESSARY? 1 0 1 0 0 1 0 0 • P: diagonal matrix • Missing one entry: unbounded approximation factor BETTER CONVERGENCE? • [Kaczmarz `37]: random projections onto small subspaces can work • Better (expected) behavior than what matrix concentration gives! HOW? P ≠ • Will still progress in good directions • Can have (finite) badness if they are orthogonal to goal Q1 QUANTIFY DEGENERACIES P .8 0 .2 0 0 .2 0 .1 • Have some D ≼ P ‘for free’ • D = λmin (P)I (min eigenvalue) • D = tree when P is a graph • D = crude approximation / rank certificate D REMOVING DEGENERACIES P D • ‘Padding’ to remove degeneracy • If D ≼ P and 0.5 P ≼ Q ≼ P, 0.5P ≼ D + Q ≼ 2P ROLE OF D P D • Implicit in proofs of matrix Chernoff, as well as [BSS`12] • Splitting of P in numerical analysis • D and P can be very different MATRIX CHERNOFF P Q • Let D ≤ 0.1P, t = trace(PD-1) • Take O(tlogn) samples with probability ~ trace(PiD-1) • Q D + (rescaled) samples • W.h.p. P ≼ Q ≼ 2 P WEAKER REQUIREMENT Q only needs to do well in some directions, on average Q1 P Q2 EXPECTED CONVERGENCE • Let t = trace(PD-1) • Take rand[t, 2t] samples, w.p. trace(PiD-1) • Add (rescaled) results to D to form Q Exist constant c s.t. for any r, E[|(I – c Q-1P )r|P ≤ 0.99|r|P OUTLINE • Random matrices • Iterative methods • Randomized preconditioning • Expected inverse moments ASIDE Matrix Chernoff • f(Q)=exp(P-1/2(P-Q)P-1/2) • Show decrease in relative eigenvalues Iterative methods: • f(x) = |x – P-1b|P • Show decrease in distance to solution Goal: combine these analyses SIMPLIFYING ASSUMPTIONS P0 P D0 D • P = I (by normalization) • tr(Pi D-1) = 0.1, ‘unit weight’ • Expected value of picking a Pi at random: 1/t I DECREASE Step: r’ = (I – Q-1P)r = (I – Q-1)r New error: |r’|P = |(I – Q-1 )r|2 Expand: | r' | - | r | = -2 | r | 2 2 2 2 2 Q-1 +|r | 2 Q-2 DECREASE: -2 | r | 2 Q-1 + | r | £ -0.1 r 2 2 Q-2 • I ≼ Q ≼ 1.1 I would imply: • 0.9 I ≼ Q-1 • Q-2 ≼ I • But also Q-3 ≼ I and etc. • Don’t need 3rd moment 2 RELAXATIONS 2 2 ù é EQ ë-2 | r |Q-1 + | r |Q-2 û • Only need Q-1 and Q-2 • By linearity, suffices to: • Lower bound EQ[Q-1] • Upper bound EQ[Q-2] TECHNICAL RESULT Assumption: Σi Pi = I trace(PiD-1) = 0.1 • Let t = trace(D-1) • Take rand[t, 2t] uniform samples • Add (rescaled) results to D to form Q • 0.9I ≼ E[Q-1] • E[Q-2] ≼ O(1) I Q-1 • 0.5I ≼ E[Q-1] follows from matrix arithmetic-harmonic mean inequality ([ST`94]) • Need: upper bound on E[Q-2] E[Q-2] ≼ O(1) ? Q-1 j=0 j=t Q-2 j=2t • Q-2 is gradient of Q-1 • More careful tracking of Q-1 gives info on Q-2 as well! TRACKING Q-1 • Q: start from D, add [t,2t] random (rescaled) Pis. • Track inverse of Q under rank-1 perturbations Sherman Morrison formula: ( A + M) -1 -1 -1 A MA =A + -1 1+tr(A M) -1 BOUNDING Q-1: DENOMINATOR Current matrix: Qj, sample: R -1 -1 Q RQ -1 j j Q-1 = Q + j+1 j 1+tr(Q-1 j R) • D ≼ Qj Qj-1 ≼ D-1 • tr(Qj-1R) ≤ tr(D-1R) ≤ 0.1 for any R, ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1E BOUNDING Q-1: NUMERATOR ER[Qj+1-1] ≼ Qj-1 – 0.9 Qj-1E[R]Qj-1 • R: random rescaled Pi sampled • Assumption: E[R] = P = I ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2 AGGREGATION ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2 D = Q0 Q1 Q2 • Qj is also random • Need to aggregate choices of R into bound on E[Qj-1] HARMONIC SUMS HrmSum(X, a) = 1/(1/x + 1/a) • Use harmonic sum of matrices • Matrix functionals • Similar to Steljes transform in [BSS`12] • Proxy for -2th power • Well behaved under expectation: EX[HrmSum (X,a)] ≤ HrmSum(E[X],a) HARMONIC SUM ER[Qj+1-1] ≼ Qj-1 – 0.9/t Qj-2 1 2 1 2 - u M-2 £ 2 - u M-1 t u M-1 + t æ 2 1ö 2 ù 2 é E ë u Q-1 û £ 0.1 u Q-1 + 0.9HrmSum ç u Q-1 , ÷ j+1 j è j tø Initial condition + telescoping sum gives E[Qt-1] ≼ O(1)I E[Q-2] ≼ O(1)I Q-1 j=0 j=t Q-2 j=2t • Q-2 is gradient of Q-1: 0.9/t Qj-2 ≼ Qj-1 - ER[Qj+1-1] • 0.9/tΣj=t2t-1 Qj-2 ≼ E[Q2t-1] - E[Qt-1] • Random j from [t,2t] is good! SUMMARY Un-normalize: • 0.5 P ≼ E[PQ-1P] • E[PQ-1PQ-1P] ≼ 5P One step of preconditioned Richardson: 2ù é -1 E ê ( I - 0.1Q P ) r ú Pû ë = r P - 0.2 r 2 2 PQ-1P + 0.01 r 2 PQ-1PQ-1P £ r P - 0.1 r P + 0.05 r P = 0.95 r 2 2 2 2 P MORE GENERAL • Works for some convex functions • Sherman-Morrison replaced by inequality, primal/dual FUTURE WORK • Expected convergence of • Chebyshev iteration? • Conjugate gradient? • Same bound without D (using pseudo-inverse)? • Small error settings • Stochastic optimization? • More moments? THANK YOU! Questions?