Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng (U.S.C.) OUTLINE • Gaussian sampling, linear systems, matrix-roots • Sparse factorizations of Lp • Sparsification of random walk polynomials SAMPLING FROM GRAPHICAL MODELS Joint probability distribution between entries of n-dimensional random variables x graphical models: encode as local dependencies via graph Sampling: pick a uniformly random point from the model APPLICATIONS Ideal sampling routine: • Efficient, parallel • Use limited randomness • Rejection / importance sampling • Estimation of quantities on the samples Often need many samples PREVIOUS WORKS Gibbs sampling: locally resample each variable from the joint distribution given by its neighbors Instance of Markov Chain Monte-Carlo Parallel sampling algorithm: • [Gonzalez-Low-Gretton-Guestrin `11]: coloring. • [Niu-Recht-Re-Wright `11] Hogwild: go lock-free • [Williamson-Dubey-Xing `13]: auxiliary variables. GAUSSIAN GRAPHICAL MODELS AND LINEAR SYSTEMS Joint distribution specified by a precision matrix, M-1 Usually denoted as Λ-1 Goal: sample from Gaussian distribution N(0, M-1) Gibbs sampling: resample based on neighbors Iterative methods: x’ x + αMx Also recomputing on neighbors CONNECTION TO SOLVING LINEAR SYSTEMS [Johnson, Saunderson, Willsky `13]: if the precision matrix M is (generalized) diagonally dominant, then Hogwild Gibbs sampling converges Much more restrictive than the `graph’ in graphical models! 1 1 n vertices m edges Further simplification: 2 -1 -1 graph Laplacian Matrix L -1 1 0 • Diagonal: degree -1 0 1 • Off-diagonal: n rows / columns -edge weights O(m) non-zeros LOCAL METHODS #steps required lower bounded by information propagation b M b M2b Mdiameterb Need n matrix operations? What if we have more powerful algorithmic primitives? ALGEBRAIC PRIMITIVE Goal: generate random variable from the Gaussian distribution N(0, L-1) Can generate uniform Gaussians, N(0, I) x ~ N(0, I), y = Cx y ~ N(0, CTC) Need: efficiently evaluable linear operator C s.t. CTC = L-1 Assume L is full rank for simplicity DIRECT SOLUTION: Factorize L = BTB Set C = L-1BT CCT = L-1BT(L-1BT)T = L-1BTBL-1 = L-1 B: Edge-vertex incidence matrix: Beu = -1/1 if u is endpoint of e 0 otherwise 1 -1 0 -1 0 1 Factorization + black-box access solvers gives sampling algorithm PARALLEL SAMPLING ROUTINE C corresponding to: y’ BTy x solve(L, y’) gives L ≈ CTC ≈: spectral similarity, • A ≈k B iff ∀x we have: e-k xTAx ≤ xTBx ≤ ek xTAx • Can use B ‘in place’ of A • Can also boost accuracy [P-Spielman `14]: Z ≈ε L-1 in polylog depth and nearly-linear work Parallel sampling routine RANDOMNESS REQUIREMENT Sample y from N(0, I) y’ Bty x solve(L, y’) return x B: m – by – n matrix, m = # of edges y needs to be a m-dimensional Gaussians (can get to O(nlogn) with some work) Fewer random variables? Optimal randomness requirement: n C that is a square matrix GENERALIZATIONS Lower Randomness Requirement: L ≈ CTC where C is a square matrix Akin to QR factorization ≈: spectral approximation Can also view as matrix square root? Alternate definition of square-root: Z s.t. Z ≈ L-1/2? Z s.t. Z ≈ L-1/3? Application of matrix roots: ‘half a step’ of a random walk OUR RESULT Input: graph Laplacian L with condition number κ, parameter -1 ≤ p ≤ 1 Output: Access to square operator C s.t. CTC ≈ε Lp Cost: O(logc1m logc2κ ε-4) time O(m logc1m logc2κ ε-4) work κ : condition number, closely related to bit-complexity of solve(L, b) Extends to symmetric diagonally dominant (SDD) matrices SUMMARY • Gaussian sampling closely related to linear system solves and matrix pth roots • Can approximately factor Lp into a product of sparse matrices • Random walk polynomials can be sparsified by sampling random walks OUTLINE • Gaussian sampling, linear systems, matrix-roots • Sparse factorizations of Lp • Sparsification of random walk polynomials SIMPLIFICATION • • Adjust/rescale so diagonal = I Add to diagonal to make full rank L=I–A A: Random walk, ║A║ < 1 PROBLEM Given random walk matrix A, parameter p, produce easily evaluable C s.t. CTC ≈ (I – A)p Local approach for p = -1: I + A + A2 + A3 + …. = (I – A)-1 Each step: pass information to neighbor I A A2 Adiameter Need Adiameter Evaluate using O(diameter) matrix operations? FASTER INFORMATION PROPAGATION Repeated squaring: A16 = ((((A2)2)2)2, 4 operations Framework from [P-Spielman `14]: Reducing (I – A)p to computing (I – A2)p Recall: ║A║ < 1, I - An3 ≈ I if A corresponds to random walk on unweighted graph O(logκ) reduction steps suffice SQUARING DENSE GRAPHS?!? Graph sparsification: sparse A’ s.t. I - A’ ≈ε I – A2 • [ST `04][SS`08][OV `11] + some modifications, or [Koutis `14]: O(nlogcn ε-2) entries, efficient, parallel • [BSS`09, ALZ `14]: O(nε-2) entries, but quadratic cost Also preserves pth powers ABSORBING ERRORS Simplification: work with p = -1 Direct factorization: (I – A)-1 = (I + A) (I – A2)-1 Have: I – A’ ≈ I – A2 Implies: (I – A’)-1 ≈ (I – A2)-1 Incorporation But NOT: (I + A)of(I matrix – A’)-1 approximations ≈ (I + A) (I – A2)-1 need to be symmetric: X ≈ X’ UTXU ≈ UTX’U Instead use: (I – A)-1 = (I + A)1/2 (I – A2)-1 (I + A)1/2 ≈ (I + A)1/2 (I – A’)-1 (I + A)1/2 SIMILAR TO • • • • Multiscale methods NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [Rozenman-Vadhan `05] Iteration Until Size Reduction Connectivity Ai+1 ≈ Ai2 ║Ad║ small Low degree Our Algorithm I - Ai+1 ≈ I - Ai2 ║Ad║ small Sparse graph Method Solution transfer Derandomized Connectivity Randomized Solution vectors EVALUATING (I + A)1/2? (I – A’)-1 ≈ (I + A)1/2 (I – A’)-1 (I + A)1/2 A1 ≈ A02: • Eigenvalues between [0,1] • Eigenvalues of I + Ai in [1,2] when i > 0 • Well-conditioned matrix • Mclaurin series expansion, approximated well by a low degree polynomial T1/2(Ai) Doesn’t work for (I + A0)1/2: eigenvalues of A0 can be -1 MODIFIED IDENTITY (I – A)-1= (I + A/2)1/2(I – A/2 - A2/2)-1(I + A/2)1/2 • Modified reduction: I – Ai+1≈ I – A/2 - A2/2 • I + Ai/2 has eigenvalues in [1/2, 3/2] Can approximate (to very high accuracy) with low degree polynomial / Mclaurin series, T1/2(Ai/2) APPROX. FACTORIZATION CHAIN A2/2 I - A1 ≈ε I – A/2 I – A2 ≈ε I – A1/2 - A12 … I - A0 (I – Ai )-1 ≈ T1/2(Ai/2) (I – Ai+1 )-1T1/2(Ai/2) I – Ai ≈ε I – Ai-1/2 - Ai-12/2 I - Ad ≈ I I - Ad≈ I d = O(logκ) Ci = T1/2(Ai/2) T1/2(A1/2)…T1/2 (Ad/2) gives (I – Ai)-1 ≈ CiTCi, For pth root (-1 ≤ p ≤1): Tp/2(A0/2)Tp/2(A1/2) …Tp/2 (Ad/2) WORKING AROUND EXPANSIONS Alternate reduction step: (I – A)-1 = (I + A/2) (I – 3/4 A2 -1/4 A3)-1 (I + A/2) Composition now done with I + A/2, easy Hard part: finding sparse approximation to I – 3/4 A2 -1/4 A3 3/4(I – A2): same as before 1/4(I – A3): cubic power GENERALIZATION TO PTH POWER (I – A)p = (I + kA) ((1 + kA)2/p(I – A))p (I + kA) Intuition: scalar operations commute, cancel away extra outer terms with inner ones Can show: if 2/p is integer and k > 2/p, (1 + kA)2/p(I – A) is a combination of (I – Ac) for integer c up to 2/p Difficulty: sparsifying (I – Ac) for large values of c SUMMARY • Gaussian sampling closely related to linear system solves and matrix pth roots • Can approximately factor Lp into a product of sparse matrices OUTLINE • Gaussian sampling, linear systems, matrix-roots • Sparse factorizations of Lp • Sparsification of random walk polynomials SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance Aka. sample with logn Auv R(u, v) Issues: I - A3 is dense Need to sample without explicitly generating all edges / resistances Two step approach: get sparsifier with edge count close to m, then run full sparsifier TWO STEP APPROACH FOR I – X2 A: 1 step of random walk A2: 2 steps of random walk [P-Spielman `14]: for a fix midpoint, edges of A2, form a (weighted) complete graph Replace with expanders O(mlogn) edges Run black-box sparsifier I - A3 A: one step of random walk Length 3 path in A: u-y-z-v Weight: AuyAyzAzv (part of) edge uv in I - A3 A3: 3 steps of random walk BOUND RESISTANCE ON I - A Can check: I - A ≈3 I - A3 Spectral theorem: can work as scalars Resistance between u and v in I - A gives upper bound for sampling probability Rayleigh’s monotonicity law: resistances in subgraphs of I – A are good upper bounds Bound R(u, v) using length 3 path in A, u-y-z-v: Sampling probability = logn × w( )×R( ) SAMPLING DISTRIBUTION Sampling probability = logn × )×R( w( ) Auy Weight: AuyAyz Azv Ayz Azv Resistance: 1/Auv + 1/Ayz + 1/Azv Probability: AyzAzv + AuvAzv + AuvAyz ONE TERM AT A TIME Probability of picking uyzv: AyzAzv + AuvAzv + AuvAyz Total for a fixed choice fo uy: Σzv AyzAzv = Σz Ayz (ΣvAzv ) ≤ Σz Ayz A: random walk transition probability ≤1 total over all choices of uy: m Interepratation: pick edge uy, take 2 steps of random walk, then sample edge in A3 corresponding to uyzv MIDDLE TERM Probability of picking uyzv: AyzAzv + AuvAzv + AuvAyz Interpretation: pick edge yz, take one step from y to get u, one step from z to get edge uyzv from A3 Total: m again AuvAyz handled similarly • O(mlogn) size approximation to I - A3 in O(mlogn) time • Can then further sparsify in nearly-liner time EXTENSIONS Even power: I – A ≈ I - A2 does not hold But I – A2 ≈2 I - A4, certify via 2 step matrix, same algorithm I - Ak in O(mklogcn) time I - Ak in O(mlogklogcn) time when k is a multiple of 4 SUMMARY • Gaussian sampling closely related to linear system solves and matrix pth roots • Can approximately factor Lp into a product of sparse matrices • Random walk polynomials can be sparsified by sampling random walks OPEN QUESTIONS • Generalizations: • Batch sampling? • Connections to multigrid/multiscale methods? • Other functionals of L? • Sparsification of random walk polynomials: • Degree n polynomials in nearly-linear time? • Positive and negative coefficients? • Connections with other algorithms based on sampling random walks? THANK YOU! Questions? Manuscripts on arXiv: • http://arxiv.org/abs/1311.3286 • http://arxiv.org/abs/1410.5392