# A 2

Sampling from Gaussian Graphical
Models via Spectral Sparsification
Richard Peng
M.I.T.
Joint work with Dehua Cheng, Yu
Cheng, Yan Liu and Shanghua Teng
(U.S.C.)
OUTLINE
• Gaussian sampling, linear systems, matrix-roots
• Sparse factorizations of Lp
• Sparsification of random walk polynomials
SAMPLING FROM GRAPHICAL MODELS
Joint probability distribution between
entries of n-dimensional random variables x
graphical models: encode as
local dependencies via graph
Sampling: pick a uniformly
random point from the model
APPLICATIONS
Ideal sampling routine:
• Efficient, parallel
• Use limited randomness
• Rejection / importance sampling
• Estimation of quantities on the samples
Often need many samples
PREVIOUS WORKS
Gibbs sampling: locally resample each variable
from the joint distribution given by its neighbors
Instance of Markov Chain Monte-Carlo
Parallel sampling algorithm:
• [Gonzalez-Low-Gretton-Guestrin `11]: coloring.
• [Niu-Recht-Re-Wright `11] Hogwild: go lock-free
• [Williamson-Dubey-Xing `13]: auxiliary variables.
GAUSSIAN GRAPHICAL MODELS AND
LINEAR SYSTEMS
Joint distribution specified
by a precision matrix, M-1
Usually denoted as Λ-1
Goal: sample from Gaussian
distribution N(0, M-1)
Gibbs sampling: resample based on neighbors
Iterative methods: x’  x + αMx
Also recomputing on neighbors
CONNECTION TO SOLVING LINEAR
SYSTEMS
[Johnson, Saunderson, Willsky `13]: if the precision
matrix M is (generalized) diagonally dominant, then
Hogwild Gibbs sampling converges
Much more restrictive than the
`graph’ in graphical models!
1
1
n vertices
m edges
Further simplification:
2 -1 -1
graph Laplacian Matrix L
-1 1 0
• Diagonal: degree
-1 0 1
• Off-diagonal:
n rows / columns -edge weights
O(m) non-zeros
LOCAL METHODS
#steps required lower bounded
by information propagation
b
M
b
M2b
Mdiameterb
Need n matrix operations?
What if we have more powerful
algorithmic primitives?
ALGEBRAIC PRIMITIVE
Goal: generate random variable from
the Gaussian distribution N(0, L-1)
Can generate uniform Gaussians, N(0, I)
x ~ N(0, I), y = Cx
y ~ N(0, CTC)
Need: efficiently evaluable
linear operator C s.t. CTC = L-1
Assume L is full rank for simplicity
DIRECT SOLUTION:
Factorize L = BTB
Set C = L-1BT
CCT = L-1BT(L-1BT)T
= L-1BTBL-1
= L-1
B: Edge-vertex
incidence matrix:
Beu = -1/1 if u is
endpoint of e
0 otherwise
1 -1 0
-1 0 1
Factorization + black-box access
solvers gives sampling algorithm
PARALLEL SAMPLING ROUTINE
C corresponding to:
y’  BTy
x  solve(L, y’)
gives L ≈ CTC
≈: spectral similarity,
• A ≈k B iff ∀x we have:
e-k xTAx ≤ xTBx ≤ ek xTAx
• Can use B ‘in place’ of A
• Can also boost accuracy
[P-Spielman `14]: Z ≈ε L-1 in polylog
depth and nearly-linear work
Parallel sampling routine
RANDOMNESS REQUIREMENT
Sample y from N(0, I)
y’  Bty
x  solve(L, y’)
return x
B: m – by – n matrix,
m = # of edges
y needs to be a m-dimensional Gaussians
(can get to O(nlogn) with some work)
Fewer random variables?
Optimal randomness requirement: n
 C that is a square matrix
GENERALIZATIONS
Lower Randomness Requirement:
L ≈ CTC where C is a square matrix
Akin to QR factorization
≈: spectral approximation
Can also view as matrix square root?
Alternate definition of square-root:
Z s.t. Z ≈ L-1/2?
Z s.t. Z ≈ L-1/3?
Application of matrix roots:
‘half a step’ of a random walk
OUR RESULT
Input: graph Laplacian L with condition number κ,
parameter -1 ≤ p ≤ 1
Cost: O(logc1m logc2κ ε-4) time
O(m logc1m logc2κ ε-4) work
κ : condition number, closely related
to bit-complexity of solve(L, b)
Extends to symmetric diagonally dominant (SDD) matrices
SUMMARY
• Gaussian sampling closely related to
linear system solves and matrix pth roots
• Can approximately factor Lp into a
product of sparse matrices
• Random walk polynomials can be
sparsified by sampling random walks
OUTLINE
• Gaussian sampling, linear systems, matrix-roots
• Sparse factorizations of Lp
• Sparsification of random walk polynomials
SIMPLIFICATION
•
•
Add to diagonal to make full rank
L=I–A
A: Random walk, ║A║ &lt; 1
PROBLEM
Given random walk matrix A, parameter p,
produce easily evaluable C s.t. CTC ≈ (I – A)p
Local approach for p = -1:
I + A + A2 + A3 + …. = (I – A)-1
Each step: pass information to neighbor
I
A
A2
Evaluate using O(diameter) matrix operations?
FASTER INFORMATION PROPAGATION
Repeated squaring: A16 = ((((A2)2)2)2, 4 operations
Framework from [P-Spielman `14]:
Reducing (I – A)p to computing (I – A2)p
Recall: ║A║ &lt; 1, I - An3 ≈ I if A
corresponds to random walk on
unweighted graph
O(logκ) reduction steps suffice
SQUARING  DENSE GRAPHS?!?
Graph sparsification: sparse A’ s.t. I - A’ ≈ε I – A2
• [ST `04][SS`08][OV `11] + some modifications,
or [Koutis `14]: O(nlogcn ε-2) entries, efficient, parallel
• [BSS`09, ALZ `14]: O(nε-2) entries, but quadratic cost
Also preserves pth powers
ABSORBING ERRORS
Simplification: work with p = -1
Direct factorization: (I – A)-1 = (I + A) (I – A2)-1
Have: I – A’ ≈ I – A2
Implies: (I – A’)-1 ≈ (I – A2)-1
Incorporation
But
NOT: (I + A)of(I matrix
– A’)-1 approximations
≈ (I + A) (I – A2)-1
need to be symmetric:
X ≈ X’  UTXU ≈ UTX’U
Instead use: (I – A)-1 = (I + A)1/2 (I – A2)-1 (I + A)1/2
≈ (I + A)1/2 (I – A’)-1 (I + A)1/2
SIMILAR TO
•
•
•
•
Multiscale methods
NC algorithm for shortest path
Logspace connectivity: [Reingold `02]
Iteration
Until
Size Reduction
Connectivity
Ai+1 ≈ Ai2
Low degree
Our Algorithm
I - Ai+1 ≈ I - Ai2
Sparse graph
Method
Solution transfer
Derandomized
Connectivity
Randomized
Solution vectors
EVALUATING (I + A)1/2?
(I – A’)-1 ≈ (I + A)1/2 (I – A’)-1 (I + A)1/2
A1 ≈ A02:
• Eigenvalues between [0,1]
• Eigenvalues of I + Ai in [1,2] when i &gt; 0
• Well-conditioned matrix
• Mclaurin series expansion, approximated
well by a low degree polynomial T1/2(Ai)
Doesn’t work for (I + A0)1/2:
eigenvalues of A0 can be -1
MODIFIED IDENTITY
(I – A)-1= (I + A/2)1/2(I – A/2 - A2/2)-1(I + A/2)1/2
• Modified reduction: I – Ai+1≈ I – A/2 - A2/2
• I + Ai/2 has eigenvalues in [1/2, 3/2]
Can approximate (to very high accuracy) with
low degree polynomial / Mclaurin series, T1/2(Ai/2)
APPROX. FACTORIZATION CHAIN
A2/2
I - A1 ≈ε I – A/2 I – A2 ≈ε I – A1/2 - A12
…
I - A0
(I – Ai )-1 ≈ T1/2(Ai/2) (I – Ai+1 )-1T1/2(Ai/2)
I – Ai ≈ε I – Ai-1/2 - Ai-12/2
d = O(logκ)
Ci = T1/2(Ai/2) T1/2(A1/2)…T1/2 (Ad/2) gives (I – Ai)-1 ≈ CiTCi,
For pth root (-1 ≤ p ≤1): Tp/2(A0/2)Tp/2(A1/2) …Tp/2 (Ad/2)
WORKING AROUND EXPANSIONS
Alternate reduction step:
(I – A)-1 = (I + A/2) (I – 3/4 A2 -1/4 A3)-1 (I + A/2)
Composition now done with I + A/2, easy
Hard part: finding sparse approximation
to I – 3/4 A2 -1/4 A3
3/4(I – A2):
same as before
1/4(I – A3):
cubic power
GENERALIZATION TO PTH POWER
(I – A)p = (I + kA) ((1 + kA)2/p(I – A))p (I + kA)
Intuition: scalar operations commute, cancel
away extra outer terms with inner ones
Can show: if 2/p is integer and k &gt; 2/p, (1 + kA)2/p(I – A)
is a combination of (I – Ac) for integer c up to 2/p
Difficulty: sparsifying (I – Ac) for large values of c
SUMMARY
• Gaussian sampling closely related to
linear system solves and matrix pth roots
• Can approximately factor Lp into a
product of sparse matrices
OUTLINE
• Gaussian sampling, linear systems, matrix-roots
• Sparse factorizations of Lp
• Sparsification of random walk polynomials
SPECTRAL SPARSIFICATION VIA
EFFECTIVE RESISTANCE
[Spielman-Srivastava `08]: suffices to
sample with probabilities at least O(logn)
times weight times effective resistance
Aka. sample with logn Auv R(u, v)
Issues: I - A3 is dense
Need to sample without explicitly
generating all edges / resistances
Two step approach: get sparsifier with edge
count close to m, then run full sparsifier
TWO STEP APPROACH FOR I – X2
A: 1 step of
random walk
A2: 2 steps of
random walk
[P-Spielman `14]: for a fix midpoint, edges
of A2, form a (weighted) complete graph
Replace with expanders  O(mlogn) edges
Run black-box sparsifier
I - A3
A: one step of random walk
Length 3 path in A: u-y-z-v
Weight: AuyAyzAzv
(part of) edge uv in I - A3
A3: 3 steps of random walk
BOUND RESISTANCE ON I - A
Can check: I - A ≈3 I - A3
Spectral theorem: can work as scalars
Resistance between u and v in I - A gives
upper bound for sampling probability
Rayleigh’s monotonicity law: resistances in
subgraphs of I – A are good upper bounds
Bound R(u, v) using length 3 path in A, u-y-z-v:
Sampling probability = logn &times;
w(
)&times;R(
)
SAMPLING DISTRIBUTION
Sampling probability = logn &times;
)&times;R(
w(
)
Auy
Weight: AuyAyz Azv
Ayz Azv
Resistance: 1/Auv + 1/Ayz + 1/Azv
Probability: AyzAzv + AuvAzv + AuvAyz
ONE TERM AT A TIME
Probability of picking uyzv: AyzAzv + AuvAzv + AuvAyz
Total for a fixed choice fo uy:
Σzv AyzAzv = Σz Ayz (ΣvAzv )
≤ Σz Ayz
A: random walk
transition probability
≤1
total over all choices of uy: m
Interepratation: pick edge uy, take 2 steps of random
walk, then sample edge in A3 corresponding to uyzv
MIDDLE TERM
Probability of picking uyzv: AyzAzv + AuvAzv + AuvAyz
Interpretation: pick edge yz, take one step from y
to get u, one step from z to get edge uyzv from A3
Total: m again
AuvAyz handled similarly
• O(mlogn) size approximation to I - A3 in O(mlogn) time
• Can then further sparsify in nearly-liner time
EXTENSIONS
Even power: I – A ≈ I - A2 does not hold
But I – A2 ≈2 I - A4,
certify via 2 step matrix, same algorithm
I - Ak in O(mklogcn) time
I - Ak in O(mlogklogcn) time
when k is a multiple of 4
SUMMARY
• Gaussian sampling closely related to
linear system solves and matrix pth roots
• Can approximately factor Lp into a
product of sparse matrices
• Random walk polynomials can be
sparsified by sampling random walks
OPEN QUESTIONS
• Generalizations:
• Batch sampling?
• Connections to multigrid/multiscale methods?
• Other functionals of L?
• Sparsification of random walk polynomials:
• Degree n polynomials in nearly-linear time?
• Positive and negative coefficients?
• Connections with other algorithms based on
sampling random walks?
THANK YOU!
Questions?
Manuscripts on arXiv:
• http://arxiv.org/abs/1311.3286
• http://arxiv.org/abs/1410.5392