Multifaceted Algorithm Design Richard Peng M.I.T. LARGE SCALE PROBLEMS Physical Simulation Network Analysis Optimization Emphasis on efficient algorithms in: • Scientific computing • Graph theory • (randomized) numerical routines WELL STUDIED QUESTIONS Scientific computing: fast solvers for structured linear systems Graphs / combinatorics: network flow problems B B’ Randomized algorithms: subsampling matrices and optimization formulations MY REPRESENTATIVE RESULTS Current fastest sequential and parallel solvers for linear systems in graph Laplacians matrices First nearly-linear time algorithm for approximate undirected maxflow B B’ First near-optimal routine for row sampling matrices in a 1-norm preserving manner RECURRING IDEAS Can solve a problem by iteratively solving several similar instances Larger problems can be approximated by smaller ones Data Approximator Approximations lead to better approximations MY APPROACH TO ALGORITHM DESIGN Identify problems that arise at the intersection of multiple areas and study them from multiple angles Combinatorics / Discrete algorithms This Problems talk: structureat their preserving intersection sampling Numerical analysis / Optimization Statistics / Randomized algorithms SAMPLING Classical use in statistics: • Extract info from a large data set • Directly output result (estimator) B Sampling from matrices, networks, and optimization problems: • Often compute on the sample • Need to preserve more structure B’ PRESERVING GRAPH STRUCTURES Undirected graph, n vertices, m < n2 edges Is n2 edges (dense) sometimes necessary? : questions For some information, e.g. connectivity: encoded by spanning forest, < n edges Deterministic, O(m) time algorithm MORE INTRICATE STRUCTURES k-connectivity: # of disjoint paths between s-t Stronger: weights of all 2n cuts in graphs Menger’s theorem / maxflowmincut t s : previous works Cut: # of edges leaving a subset of vertices [Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts ≈: multiplicative approximation HOW TO SAMPLE? Widely used: uniform sampling Works well when data is uniform e.g. complete graph Problem: long path, removing any edge changes connectivity (can also have both in one graph) More systematic view of sampling? ALGEBRAIC REPRESENTATION OF GRAPHS graph Laplacian Matrix L • Diagonal: degree • Off-diagonal: -edge weights 1 1 n vertices m edges 2 -1 -1 -1 1 0 -1 0 1 n rows / columns O(m) non-zeros Edge-vertex incidence matrix: Beu = -1/1 if u is endpoint of e 0 otherwise 1 -1 0 -1 0 1 m rows n columns L is the Gram matrix of B, L = BTB SPECTRAL SIMILARITY Numerical analysis: LG ≈ LH if xTLGx ≈ xTLHx for all vectors x ║yi║22 =Σi yi2 Gram matrix: LG = BGTBG xTLGx =║BGx║22 ║BGx║2 ≈║BHx║2 ∀ x Beu = -1/1 if u is endpoint of e 0 otherwise For edge e = uv, (Be:x) 2 = (xu – xv)2 x = {0, 1}V: G ≈ H on all cuts (1-0)2=1 xv=0 xu=1 (1-1)2=0 xz=1 ║BGx║22 = size of cut given by x ALGEBRAIC VIEW OF SAMPLING EDGES L2 Row sampling: Given B with m>>n, sample a few rows to form B’ s.t.║Bx║2 ≈║B’x║2 ∀ x 0 -1 0 0 0 1 0 m B 0 -5 0 0 0 5 0 B’ ≈n n n Note: normally use A instead of B, n and d instead of m and n IMPORTANCE SAMPLING Keep a row, bi, with probability pi, rescale if kept to maintain expectation Uniform sampling: pi = 1/k for a factor k size reduction Issue: only one non-zero row norm sampling: pi =m/k║bi║22 / ║B║F2 Issue: column with one entry THE `RIGHT’ PROBABILITIES τ: L2 statistical leverage scores τi = biT(BTB)-1bi = ║bi║2L-1 bi: row i of B, L = BTB Column with one entry Only one non-zero row 0 0 1 0 0 n/m n/m n/m n/m 1 Path + clique: 1/n 1 L2 MATRIX-CHERNOFF BOUNDS τ: L2 statistical leverage scores τi = biT(BTB)-1bi = ║bi║2L-1 [Rudelson, Vershynin `07], [Tropp `12]: sampling with pi ≥ τiO( logn) gives B’ s.t. ║Bx║2 ≈║B’x║2 ∀x w.h.p. [Foster `49] Σi τi = rank ≤ n O(nlogn) rows Near optimal: • L2-row samples of B • Graph sparsifiers • In practice O(logn) 5 usually suffices • can also improve via derandomization MY APPROACH TO ALGORITHM DESIGN Extend insights gained from studying problems at the intersection of multiple areas back to these areas Combinatorics / Discrete algorithms Maximum flow Algorithmic Problems extensions at their of structure-preserving intersection Solving linear sampling systems Numerical analysis / Optimization Preserving L1-structures Statistics / Randomized algorithms SUMMARY • Algorithm design approach: study problems at the intersection of areas, and extend insights back. • Can sparsify objects via importance sampling. Graph Laplacian • Diagonal: degree • Off-diagonal: weight Lx = b Directly related to: Application: estimate τi =║biproblems ║2L-1 by solving •allElliptic systems •O(logn) SDD, M,linear and H-matrices Combinatorics / Discrete algorithms Solvers for linear systems involving graph Laplacians Numerical analysis / Optimization Statistics / Randomized algorithms Current fastest sequential and parallel solvers for linear systems in graph Laplacians ALGORITHMS FOR Lx = b Given any graph Laplacian L with n vertices and m edges, any vector b, find vector x s.t. Lx = b [Vaidya `89]: use graph theory! [Spielman-Teng `04]: O(mlogcn) : questions [P-Spielman `14]: alternate, fully parallelizable approach : previous works : my results loglog plot of c: 2004: 70 2009: 15 2006: 32 2010: 2 2010: 6 2011: 1 2014: 1/2 ITERATIVE METHODS Simplification: assume L = I – A, A: transition matrix of random walk Division using multiplication I + A + A2 + A3 + …. = (I – A)-1 = L-1 Spectral theorem: can view as scalars Richardson iteration: truncate to i terms, Approximate x = (I – A)-1b with x(i) = (I + A + … Ai)b RICHARDSON ITERATION Evaluation (Horner’s rule): • (I + A + A2)b = A(Ab + b) + b • i terms: x(0) = b, x(i + 1) = Ax(i) + b i matrix-vector multiplications Can interpret as gradient descent Highly connected graphs: few terms ok #terms needed lower bounded by information propagation b Ab A2b Adiameterb Need n matrix operations? DEGREE N N OPERATIONS? Repeated squaring: A16 = ((((A2)2)2)2, 4 operations (I – A)-1 = I + A + A2 + A3 + …. = (I + A) (I + A2) (I + A4)… Dense matrix! • O(logn) terms ok • Similar to multi-level methods Combinatorial view: • A: step of random walk • I – A2: Laplacian of the 2 step random walk Still Can a graph Laplacian! sparsify! REPEATED SPARSE SQUARING (I – A)-1 = (I + A) (I + A2) (I + A4)… Combining known tools: efficiently sparsify I – A2 without computing A2 [P-Spielman `14] approximate L-1 with O(logn) sparse matrices key ideas: modify factorization to allow gradual introduction and control of error SUMMARY • Algorithm design approach: study problems at the intersection of areas, and extend insights back. • Can sparsify objects via importance sampling. • Solve Lx=b efficiently via sparsified squaring. FEW ITERATIONS OF Lx = b • [Tutte `61]: graph drawing, embeddings • [ZGL `03], [ZHS `05]: inference on graphical models [CFMNPW`14]: Helmholtz decomp. on 3D mesh Inverse powering: eigenvectors / heat kernel: • [AM `85] spectral clustering • [OSV `12]: balanced cuts • [SM `01][KMST `09]: image segmentation MANY ITERATIONS OF Lx = b [Karmarkar, Ye, Renegar, Nesterov, Nemirovski …]: convex optimization via. solving O(m1/2) linear systems [DS `08]: optimization on graphs Laplacian systems [CKMST `11]: faster approx maximum flow [KMP `12]: multicommodity flow [KM `09][MST`14]: random spanning trees MAXFLOW Combinatorics / Discrete algorithms Maximum flow Numerical analysis / Optimization Statistics / Randomized algorithms First O(mpolylog(n)) time algorithm for approximate undirected maxflow MAXIMUM FLOW PROBLEM (for unweighted, undirected graphs) Given s, t, find the maximum number of disjoint s-t paths s t Applications: • Clustering • Image processing • Scheduling Dual: separate s and t by removing fewest edges s t WHAT MAKES MAXFLOW HARD Highly connected: route up to n paths Long paths: a step may involve n vertices Each ‘easy’ on their own Goal: handle both and do better than many steps × long paths = n2 ALGORITHMS FOR FLOWS Ideas introduced: 1989: connections to Lx = b 1970s: Blocking flows 1980: dynamic trees 2010: few calls to Lx = b 2013: modify Lx = b 1986: dual algorithms Current fastest maxflow algorithms: • Exact (weakly-polytime): invoke Lx=b • Approximate: modify algorithms for Lx=b [P`14]: (1 – ε)-approx maxflow in O(mlogcnε-2) time MAXIMUM FLOW IN ALMOST LINEAR TIME [Sherman `13] [Kelner-Lee-Orecchia-Sidford `13]: Algebraic formulation of min s-t cut: can find approximate maxflow Minimize ║Bx║12 subject to xs = iteratively 0, xt = 1 via several to a structure approximator and x calls integral ║*║1 : 1-norm, sum of absolute values Maxflow Approximator [Racke-Shah-Taubig `14]1+θ O(n) sized [Madry `10]: finding O(m ) sized cn) approximator that require O(log θ approximator that require O(m ) iterations via1+θ solving maxflows on calls in O(m ) time (for any θ > 0) c graphs of total size O(mlog n) Maxflow Maxflow Approximator Approximator cnε O(mlog O(m1+2θ ε-2-2)) time time? ALGORITHMIC SOLUTION ` ` Ultra-sparsifier (e.g. [Koutis-Miller-P `10]): for any k, can find H close to G, but equivalent to graph of size O(m/k) Key step: [P`14]: build vertex approximator reductionson viathe edge smaller reductions graph Maxflow Absorb additional (small) error [CLMPPS`15]: extends to numerical via more calls to approximator data, has close connections to Recurse instances with smaller variants ofon Nystrom’s method total size, total cost: O(mlogcn) SUMMARY • Algorithm design approach: study problems at the intersection of areas, and extend insights back. • Can sparsify objects via importance sampling. • Solve Lx=b efficiently via sparsified squaring. • Approximate maximum flow routines and structure approximators can be constructed recursively from each other via graph sparsifiers. RANDOMIZED NUMERICAL LINEAR ALGEBRA Combinatorics / Discrete algorithms L1-preserving row sampling Numerical analysis / Optimization B B’ Statistics / Randomized algorithms First near-optimal routine for row sampling matrices in a 1-norm preserving manner GENERALIZATION Generalization of row sampling: given A, q, find A’ s.t.║Ax║q ≈║A’x║q ∀ x q-norm: ║y║q = (Σ|yi|q)1/q ║y║2 1-norm: standard for representing cuts, used in sparse recovery / robust regression ║y║ 1 Applications (for general A): • Feature selection • Low rank approximation / PCA ROW SAMPLING ROUTINES A’ s.t.║Ax║q ≈║A’x║q ∀ x How 2? nnz: # ofspecial non-zerosisinLA #rows for q=2 Dasgupta et al. `09 Magdon-Ismail `10 Runtime n2.5 mn5 nlog2n Sohler-Woodruff `11 Drineas et al. `12 #rows for q=1 mn2 n3.5 nlogn Clarkson et al. `12 mnω-1+θ mnlogn n4.5log1.5n mnlogn n2logn n8 nnz Mahoney-Meng `12 n2 n3.5 nnz+n6 Nelson-Nguyen `12 n1+θ Clarkson-Woodruff `12 Li et.`13, Cohen et al. 14 nlogn nnz n3.66 nnz+nω+θ [NaorOmitting `11][Matousek corresponding `97]: on graphs, empirical studies How special L2 approx Lq approx ∀ 1 ≤ q ≤ 2 are graphs? L1 ROW SAMPLING [Talagrand `90, “Embedding subspaces of L1 into LN1”] can be analyzed as row-sampling / sparsification [Sampling with pi ≥ wiO( logn) gives ║Ax║1 ≈ ║A’x║1 ∀x L1 Lewis weights ([Lewis `78]): w s.t. wi2 = aiT(ATW-1A)-1ai Recursive definition! Can check: Σi wi ≤ n O(nlogn) rows [COHEN-P `14] Update w on LHS with w on RHS w’i (aiT(ATW-1A)-1ai)1/2 Aside: similar to iterative reweighted least squares Converges in loglogn steps: analyze ATW-1A spectrally q Previous # of rows New # Rows Runtime 1 n2.5 nlogn nnz+nω+θ 1<q<2 nq/2+2 nlogn(loglogn)2 nnz+nω+θ 2<q nq+1 np/2logn nnz+nq/2+O(1) Elementary, optimization motivated proof of w.h.p. concentration for L1 SUMMARY • Algorithm design approach: study problems at the intersection of areas, and extend insights back. • Can sparsify objects via importance sampling. • Solve Lx=b efficiently via sparsified squaring. • Approximate maximum flow routines and cutapproximators can be constructed recursively from each other via graph sparsifiers. • Wider ranges of structures can be sparsified, key statistical quantities can be computed iteratively. I’VE ALSO WORKED ON • Dynamic graph data structures • Graph partitioning • Parallel algorithms • Image processing • Anomaly / sybil detection in graphs FUTURE WORK: LINEAR SYSTEM SOLVERS Combinatorics / Discrete algorithms Solvers for linear systems involving graph Laplacians Numerical analysis / Optimization Statistics / Randomized algorithms • Wider classes of linear systems • Relation to optimization / learning FUTURE WORK: COMBINATORIAL OPTIMIZATION Combinatorics / Discrete algorithms Maximum flow Numerical analysis / Optimization Statistics / Randomized algorithms Faster algorithms for more classical algorithmic graph theory problems? FUTURE WORK: RANDOMIZED NUMERICAL LINEAR ALGEBRA Combinatorics / Discrete algorithms L1-preserving row sampling Numerical analysis / Optimization B B’ Statistics / Randomized algorithms • Other algorithmic applications of Lewis weights? • Low-rank approximation in L1? • O(n)-sized L1-preserving row samples? (these exist for L2) SUMMARY Combinatorics / Discrete algorithms Problems at their intersection Numerical analysis / Optimization B B’ Statistics / Randomized algorithms Links to arXiv manuscripts and videos of more detailed talks are at: math.mit.edu/~rpeng/