Fast Regression Algorithms Using Spectral Graph Theory Richard Peng OUTLINE • Regression: why and how • Spectra: fast solvers • Graphs: tree embeddings LEARNING / INFERENCE Input signal, s: Output: Find (hidden) pattern in (noisy) data REGRESSION Mininimize: Subject to: |x|p constraints on x • p ≥ 1: convex • Convex constraints e.g. linear equalities APPLICATION 0: LASSO Ax [Tibshirani `96]: Min |x|1 s.t. Ax = s Widely used in practice: • Structured output • Robust to noise APPLICATION 1: IMAGES MinΣi~j∈E(xi-xj-si~j)2 Poisson image processing No bears were harmed in the making of these slides APPLICATION 2: MIN CUT s 0 0 0 0 1 1 1 1 t Min Σij∈E|xi-xj| s.t. xs=0, xt=1 Fractional solution = integral solution Remove fewest edges to separate vertices s and t REGRESSION ALGORITHMS Convex optimization • 1940~1960: simplex, tractable • 1960~1980: ellipsoid, poly time • 1980~2000: interior point, efficient • m = # non-zeros • Õ hides log factors Õ(m1/2) interior steps EFFICIENCY MATTERS • m > 106 for most images • Even bigger (109): • Videos • 3D medical data KEY SUBROUTINE Õ(m1/2) Each step of interior point algorithms finds a step direction Linear system solves MORE REASONS FOR FAST SOLVERS [Boyd-Vanderberghe `04], Figure 11.20: The growth in the average number of Newton iterations (on randomly generated SDPs)… is very small LINEAR SYSTEM SOLVERS • [1st century CE] Gaussian Elimination: O(m3) • [Strassen `69] O(m2.8) • [Coppersmith-Winograd `90] O(m2.3755) • [Stothers `10] O(m2.3737) • [Vassilevska Williams`11] O(m2.3727) Total: > m2 NOT FAST NOT USED: • Preferred in practice: coordinate descent, subgradient methods • Solution quality traded for time FAST GRAPH BASED L2 REGRESSION [SPIELMAN-TENG ‘04] Input: Linear system where A is related to graphs, b Output: Solution to Ax=b Runtime: Nearly Linear, Õ(m) GRAPHS USING ALGEBRA Fast convergence + Low cost per step = state of the art algorithms LAPLACIAN PARADIGM [Daitch-Spielman `08] : mincost fow [Christiano-Kelner-Mądry-Spielman-Teng `11]: approx maximum flow /min cut EXTENSION 1 [Chin-Mądry-Miller-P `12]: regression, image processing, grouped L2 EXTENSION 2 s [Kelner-Miller-P `12]: t k-commodity flow Dual: k-variate labeling of graphs EXTENSION 3 [Miller-P `13]: faster for structured images / separable graphs NEED: FAST LINEAR SYSTEM SOLVERS Implication of fast solvers: • Fast regression routines • Parallel, work efficient graph algorithms OTHER APPLICATIONS • [Tutte `66]: planar embedding • [Boman-Hendrickson-Vavasis`04]: PDEs • [Orecchia-Sachedeva-Vishnoi`12]: balanced cut / graph separator OUTLINE • Regression: why and how • Spectra: Linear system solvers • Graphs: tree embeddings PROBLEM Given: matrix A, vector b Size of A: • n-by-n • m non-zeros SPECIAL STRUCTURE OF A A = Deg – Adj • Deg: diag(degree) • Adj: adjacency matrix Aij = deg(i) if i=j w(ij) otherwise ` [Gremban-Miller `96]: extensions to SDD matrices UNSTRUCTURED GRAPHS • Social network • Intermediate systems of other algorithms are almost adversarial NEARLY LINEAR TIME SOLVERS [SPIELMAN-TENG ‘04] Input: n by n graph Laplacian A with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: Nearly Linear. O(m logcn log(1/ε)) expected • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy. HOW MANY LOGS Runtime: O(mlogcn log(1/ ε)) Value of c: I don’t know [Spielman]: [Miller]: [Koutis]: c≤70 c≤32 c≤15 [Teng]: c≤12 [Orecchia]: When n = 106, log6n > 106 c≤6 PRACTICAL NEARLY LINEAR TIME SOLVERS [KOUTIS-MILLER-P `10] Input: n by n graph Laplacian A with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: O(mlog2n log(1/ ε)) • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy. PRACTICAL NEARLY LINEAR TIME SOLVERS [KOUTIS-MILLER-P `11] Input: n by n graph Laplacian A with m non-zeros, vector b Where: b = Ax for some x Output: Approximate solution x’ s.t. |x-x’|A<ε|x|A Runtime: O(mlogn log(1/ ε)) • runtime is cost per bit of accuracy. • Error in the A-norm: |y|A=√yTAy. STAGES OF THE SOLVER • Iterative Methods • Spectral Sparsifiers • Low Stretch Spanning Trees ITERATIVE METHODS Numerical analysis: Can solve systems in A by iteratively solving spectrally similar, but easier, B WHAT IS SPECTRALLY SIMILAR? A ≺ B ≺ kA for some small k • Ideas from scalars hold! • A ≺ B: for any vector x, |x|A2 < |x|B2 [Vaidya `91]: Since G A isis a a graph, H B should be too! `EASIER’ H Ways of easier: • Fewer vertices • Fewer edges Can reduce vertex count if edge count is small Goal: H with fewer edges that’s similar to G GRAPH SPARSIFIERS Sparse equivalents of graphs that preserve something • Spanners: distance, diameter. • Cut sparsifier: all cuts. • What we need: spectrum WHAT WE NEED: ULTRASPARSIFIERS ` ` • Given: G with n vertices, m edges parameter k • Output: H with n vertices, n-1+O(mlogpn/k) edges • Goal: G ≺ H ≺ kG [Spielman-Teng `04]: ultrasparsifiers with n1+O(mlogpn/k) edges imply solvers with O(mlogpn) running time. EXAMPLE: COMPLETE GRAPH O(nlogn) random edges (with scaling) suffice w.h.p. GENERAL GRAPH SAMPLING MECHANISM • For edge e, flip coin Pr(keep) = P(e) • Rescale to maintain expectation Number of edges kept: ∑e P(e) Also need to prove concentration EFFECTIVE RESISTANCE ` • View the graph as a circuit • R(u,v) = Pass 1 unit of current from u to v, measure resistance of circuit EE101 ` Effective resistance in general: solve Gx = euv, where euv is indicator vector, R(u,v) = xu – xv. (REMEDIAL?) EE101 ` w1 u ` u R(u, v) = 1/w1 v w1 w2 v R(u, v) = 1/w1 + 1/w2 • Single edge: R(e) = 1/w(e) • Series: R(u, v) = R(e1) + … + R(el) SPECTRAL SPARSIFICATION BY EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: Setting P(e) to W(e)R(u,v)O(logn) gives G ≺ H ≺ 2G* [Foster `49]: ∑e W(e)R(e) = n-1 Spectral sparsifier with O(nlogn) edges Ultrasparsifier? Solver??? *Ignoring probabilistic issues THE CHICKEN AND EGG PROBLEM How to find effective resistance? [Spielman-Srivastava `08]: use solver [Spielman-Teng `04]: need sparsifier OUR WORK AROUND • Use upper bounds of effective resistance, R’(u,v) • Modify the problem RAYLEIGH’S MONOTONICITY LAW ` Rayleigh’s Monotonicity Law: R(u, v) only increase when edges are removed Calculate effective resistance w.r.t. a tree T SAMPLING PROBABILITIES ACCORDING TO TREE ` Sample Probability: edge weight times effective resistance of tree path stretch Goal: small total stretch GOOD TREES EXIST Every graph has a spanning tree with total stretch O(mlogn) Hiding loglogn ∑e W(e)R’(e) = O(mlogn) O(mlog2n) edges, too many! ‘GOOD’ TREE??? ` Stretch = 1+1 = 2 Unit weight case: stretch ≥ 1 for all edges WHAT ARE WE MISSING? ` Haven’t used k! ` • Need: • G ≺ H ≺ kG • n-1+O(mlogpn/k) edges • Generated: • G ≺ H ≺ 2G • n-1+O(mlog2n) edges USE K, SOMEHOW ` G ≺ G’ ≺ kG • Tree is good! • Increase weights of tree edges by factor of k RESULT ` Stretch = 1/k+1/k = 2/k • Tree heavier by factor of k • Tree effective resistance decrease by factor of k NOW SAMPLE? ` Total: n1+O(mlog2n/k) Expected in H: Tree edges: n-1 Off tree edges: O(mlog2n/k) BUT WE CHANGED G! ` G ≺ G’ ≺ kG G’ ≺ H ≺ 2G’ G ≺ H≺ 2kG WHAT WE NEED: ULTRASPARSIFIERS ` ` • Given: G with n vertices, m edges parameter k • Output: H with n vertices, n-1+O(mlogpn/k) edges • Goal: G ≺ H ≺ kG [Spielman-Teng `04]: G ≺ H≺ 2kG n-1+O(mlog2n/k) edges ultrasparsifiers with n-1+O(mlogpn/k) edges imply solvers with O(mlogpn) running time. PSEUDOCODE OF O(MLOGN) SOLVER • • • • • • Input: Graph Laplacian G Compute low stretch tree T of G T ( log2n) T H G +T H SampleT(H) Solve G by iterating on H and solving recursively, but reuse T EXTENSIONS / GENERALIZATIONS ` ` • [Koutis-Levin-P `12]: sparsify mildly dense graphs in O(m) time • [Miller-P `12]: general matrices: find ‘simpler’ matrix that’s similar in O(m+n2.38+a) time. SUMMARY OF SOLVERS ` ` • Spectral graph theory allows one to find similar, easier to solve graphs • Backbone: good trees SOLVERS USING GRAPH THEORY Fast solvers for graph Laplacians use combinatorial graph theory OUTLINE • Regression: why and how • Spectra: linear system solvers • Graphs: tree embeddings LOW STRETCH SPANNING TREE Sampling probability: edge weight times effective resistance of tree path Unit weight case: length of tree path Low stretch spanning tree: small total stretch DIFFERENT THAN USUAL TREES n1/2-by-n1/2 unit weighted mesh ‘haircomb’ is both shortest path and max weight spanning tree stretch(e)= O(1) stretch(e)=Ω(n1/2) total stretch = Ω(n3/2) A BETTER TREE FOR THE GRID Recursive ‘C’ LOW STRETCH SPANNING TREES [Elkin-Emek-Spielman-Teng `05], [Abraham-Bartal-Neiman `08]: Any graph has a spanning tree with total stretch O(mlogn) Hiding loglogn ISSUE: RUNNING TIME Algorithms given by [Elkin-Emek-Spielman-Teng `05], [Abraham-Bartal-Neiman `08] take O(nlog2n+mlogn) time Reason: O(logn) shortest paths SPEED UP [Orlin-Madduri-Subramani-Williamson `10]: Shortest path on graphs with k distinct weights can run in O(mlogm/nk) time [Koutis-Miller-P `11]: • Round edge weights to powers of 2 • k=logn, total work = O(mlogn) Hiding loglogn, we actually improve these PARALLEL ALGORITHM? ` ` • [Blelloch-Gupta-Koutis-Miller-PTangwongsan. `11]: current framework parallelizes to O(m1/3+a) depth • Combine with Laplacian paradigm fast parallel graph algorithms PARALLEL GRAPH ALGORITHMS? • Before this work: parallel time > state of the art sequential time • Our result: parallel work close to sequential, and O(m2/3) time FUNDAMENTAL PROBLEM Long standing open problem: theoretical speedups for BFS / shortest path in directed graphs Sequential algorithms are too fast! PARALLEL ALGORITHM? ` ` First step of framework by [Elkin-Emek-Spielman-Teng `05]: shortest path PARALLEL TREE EMBEDDING • Workaround: use earlier algorithm by [Alon-Karp-Peleg-West `95] • Idea: repeated clustering • Based on ideas from [Cohen `93, `00] for approximating shortest path PARALLEL TREE EMBEDDING THE BIG PICTURE • Need fast linear system solvers for graph regression • Need combinatorial graph algorithms for fast solvers ONGOING / FUTURE WORK • Better regression? • Faster/parallel solver? • Sparse approximate (pseudo) inverse? • Other types of systems? THANK YOU! Questions?