Multifaceted Algorithm Design Richard Peng M.I.T.

advertisement
Multifaceted Algorithm Design
Richard Peng
M.I.T.
LARGE SCALE PROBLEMS
Physical
Simulation
Network
Analysis
Optimization
Emphasis on efficient algorithms in:
• Scientific computing
• Graph theory
• (randomized) numerical routines
WELL STUDIED QUESTIONS
Scientific computing: fast solvers
for structured linear systems
Graphs / combinatorics:
network flow problems
B
B’
Randomized algorithms:
subsampling matrices and
optimization formulations
MY REPRESENTATIVE RESULTS
Current fastest sequential and
parallel solvers for linear systems
in graph Laplacians matrices
First nearly-linear time
algorithm for approximate
undirected maxflow
B
B’
First near-optimal routine for
row sampling matrices in a
1-norm preserving manner
RECURRING IDEAS
Can solve a problem by iteratively
solving several similar instances
Larger problems can be
approximated by smaller ones
Data
Approximator
Approximations lead to
better approximations
MY APPROACH TO ALGORITHM
DESIGN
Identify problems that arise at the intersection of
multiple areas and study them from multiple angles
Combinatorics /
Discrete algorithms
This
Problems
talk: structureat their
preserving
intersection
sampling
Numerical analysis /
Optimization
Statistics /
Randomized algorithms
SAMPLING
Classical use in statistics:
• Extract info from a large data set
• Directly output result (estimator)
B
Sampling from matrices, networks, and
optimization problems:
• Often compute on the sample
• Need to preserve more structure
B’
PRESERVING GRAPH STRUCTURES
Undirected graph, n vertices, m < n2 edges
Is n2 edges (dense) sometimes necessary?
: questions
For some information, e.g. connectivity:
encoded by spanning forest, < n edges
Deterministic, O(m) time algorithm
MORE INTRICATE STRUCTURES
k-connectivity: # of disjoint
paths between s-t
Stronger: weights of
all 2n cuts in graphs
Menger’s theorem / maxflowmincut
t
s
: previous works
Cut: # of edges leaving
a subset of vertices
[Benczur-Karger `96]: for ANY G, can sample to
get H with O(nlogn) edges s.t. G ≈ H on all cuts
≈: multiplicative approximation
HOW TO SAMPLE?
Widely used: uniform sampling
Works well when data is
uniform e.g. complete graph
Problem: long path, removing
any edge changes connectivity
(can also have both in one graph)
More systematic view of sampling?
ALGEBRAIC REPRESENTATION OF GRAPHS
graph Laplacian Matrix L
• Diagonal: degree
• Off-diagonal:
-edge weights
1
1
n vertices
m edges
2 -1 -1
-1 1 0
-1 0 1
n rows / columns
O(m) non-zeros
Edge-vertex
incidence matrix:
Beu = -1/1 if u is
endpoint of e
0 otherwise
1 -1 0
-1 0 1
m rows
n columns
L is the Gram matrix of B, L = BTB
SPECTRAL SIMILARITY
Numerical analysis:
LG ≈ LH if xTLGx ≈ xTLHx for all vectors x
║yi║22 =Σi yi2
Gram matrix: LG = BGTBG xTLGx =║BGx║22
║BGx║2 ≈║BHx║2 ∀ x
Beu = -1/1 if u is endpoint of e
0 otherwise
For edge e = uv, (Be:x) 2 = (xu – xv)2
x = {0, 1}V:
G ≈ H on all cuts
(1-0)2=1
xv=0
xu=1
(1-1)2=0
xz=1
║BGx║22 = size of cut given by x
ALGEBRAIC VIEW OF SAMPLING EDGES
L2 Row sampling:
Given B with m>>n, sample a few
rows to form B’ s.t.║Bx║2 ≈║B’x║2 ∀ x
0 -1 0 0 0 1 0
m
B
0 -5 0 0 0 5 0
B’
≈n
n
n
Note: normally use A instead of B,
n and d instead of m and n
IMPORTANCE SAMPLING
Keep a row, bi, with probability pi,
rescale if kept to maintain expectation
Uniform sampling: pi = 1/k
for a factor k size reduction
Issue: only one non-zero row
norm sampling:
pi =m/k║bi║22 / ║B║F2
Issue: column with one entry
THE `RIGHT’ PROBABILITIES
τ: L2 statistical leverage scores
τi = biT(BTB)-1bi = ║bi║2L-1
bi: row i of B,
L = BTB
Column with one entry
Only one non-zero row
0
0
1
0
0
n/m
n/m
n/m
n/m
1
Path + clique:
1/n
1
L2 MATRIX-CHERNOFF BOUNDS
τ: L2 statistical leverage scores
τi = biT(BTB)-1bi = ║bi║2L-1
[Rudelson, Vershynin `07], [Tropp `12]: sampling with
pi ≥ τiO( logn) gives B’ s.t. ║Bx║2 ≈║B’x║2 ∀x w.h.p.
[Foster `49] Σi τi = rank ≤ n  O(nlogn) rows
Near optimal:
• L2-row samples of B
• Graph sparsifiers
• In practice O(logn)  5 usually suffices
• can also improve via derandomization
MY APPROACH TO ALGORITHM
DESIGN
Extend insights gained from studying problems at the
intersection of multiple areas back to these areas
Combinatorics /
Discrete algorithms Maximum flow
Algorithmic
Problems
extensions
at their of
structure-preserving
intersection
Solving linear sampling
systems
Numerical analysis /
Optimization
Preserving
L1-structures
Statistics /
Randomized algorithms
SUMMARY
• Algorithm design approach: study problems at the
intersection of areas, and extend insights back.
• Can sparsify objects via importance sampling.
Graph Laplacian
• Diagonal: degree
• Off-diagonal: weight
Lx = b
Directly
related
to:
Application:
estimate
τi =║biproblems
║2L-1 by solving
•allElliptic
systems
•O(logn)
SDD, M,linear
and H-matrices
Combinatorics /
Discrete algorithms
Solvers for linear systems
involving graph Laplacians
Numerical analysis /
Optimization
Statistics /
Randomized algorithms
Current fastest sequential and parallel
solvers for linear systems in graph Laplacians
ALGORITHMS FOR Lx = b
Given any graph Laplacian L with n vertices and
m edges, any vector b, find vector x s.t. Lx = b
[Vaidya `89]: use graph theory!
[Spielman-Teng `04]: O(mlogcn)
: questions
[P-Spielman `14]: alternate,
fully parallelizable approach
: previous works
: my results
loglog
plot of c:
2004:
70
2009:
15
2006: 32
2010: 2
2010: 6
2011: 1
2014: 1/2
ITERATIVE METHODS
Simplification: assume L = I – A,
A: transition matrix of random walk
Division using multiplication
I + A + A2 + A3 + …. = (I – A)-1 = L-1
Spectral theorem: can view as scalars
Richardson iteration: truncate to i terms,
Approximate x = (I – A)-1b with x(i) = (I + A + … Ai)b
RICHARDSON ITERATION
Evaluation (Horner’s rule):
• (I + A + A2)b = A(Ab + b) + b
• i terms: x(0) = b, x(i + 1) = Ax(i) + b
i matrix-vector
multiplications
Can interpret as gradient descent
Highly connected graphs: few terms ok
#terms needed lower bounded
by information propagation
b
Ab
A2b
Adiameterb
Need n matrix operations?
DEGREE N  N OPERATIONS?
Repeated squaring: A16 = ((((A2)2)2)2, 4 operations
(I – A)-1 = I + A + A2 + A3 + ….
= (I + A) (I + A2) (I + A4)…
Dense matrix!
• O(logn) terms ok
• Similar to multi-level methods
Combinatorial view:
• A: step of random walk
• I – A2: Laplacian of the 2 step random walk
Still Can
a graph
Laplacian!
sparsify!
REPEATED SPARSE SQUARING
(I – A)-1 = (I + A) (I + A2) (I + A4)…
Combining known tools: efficiently
sparsify I – A2 without computing A2
[P-Spielman `14] approximate L-1
with O(logn) sparse matrices
key ideas: modify factorization to allow
gradual introduction and control of error
SUMMARY
• Algorithm design approach: study problems at the
intersection of areas, and extend insights back.
• Can sparsify objects via importance sampling.
• Solve Lx=b efficiently via sparsified squaring.
FEW ITERATIONS OF Lx = b
• [Tutte `61]: graph drawing, embeddings
• [ZGL `03], [ZHS `05]: inference on graphical models
[CFMNPW`14]: Helmholtz decomp. on 3D mesh
Inverse powering: eigenvectors / heat kernel:
• [AM `85] spectral clustering
• [OSV `12]: balanced cuts
• [SM `01][KMST `09]: image segmentation
MANY ITERATIONS OF Lx = b
[Karmarkar, Ye, Renegar, Nesterov, Nemirovski …]:
convex optimization via. solving O(m1/2) linear systems
[DS `08]: optimization on graphs  Laplacian systems
[CKMST `11]: faster approx maximum flow
[KMP `12]: multicommodity flow
[KM `09][MST`14]: random spanning trees
MAXFLOW
Combinatorics /
Discrete algorithms
Maximum flow
Numerical analysis /
Optimization
Statistics /
Randomized algorithms
First O(mpolylog(n)) time algorithm for
approximate undirected maxflow
MAXIMUM FLOW PROBLEM
(for unweighted, undirected graphs)
Given s, t, find the maximum number of disjoint s-t paths
s
t
Applications:
• Clustering
• Image processing
• Scheduling
Dual: separate s and t by removing fewest edges
s
t
WHAT MAKES MAXFLOW HARD
Highly connected:
route up to n paths
Long paths: a step
may involve n vertices
Each ‘easy’ on their own
Goal: handle both and do better
than many steps × long paths = n2
ALGORITHMS FOR FLOWS
Ideas introduced:
1989: connections to Lx = b
1970s: Blocking flows
1980: dynamic trees
2010: few calls to Lx = b
2013: modify Lx = b
1986: dual algorithms
Current fastest maxflow algorithms:
• Exact (weakly-polytime): invoke Lx=b
• Approximate: modify algorithms for Lx=b
[P`14]: (1 – ε)-approx maxflow in O(mlogcnε-2) time
MAXIMUM FLOW IN ALMOST LINEAR TIME
[Sherman
`13]
[Kelner-Lee-Orecchia-Sidford
`13]:
Algebraic
formulation
of min s-t cut:
can
find approximate
maxflow
Minimize
║Bx║12 subject
to xs = iteratively
0, xt = 1 via
several
to a structure approximator
and x calls
integral
║*║1 : 1-norm, sum of absolute values
Maxflow
Approximator
[Racke-Shah-Taubig
`14]1+θ
O(n)
sized
[Madry `10]: finding O(m
) sized
cn)
approximator
that
require
O(log
θ
approximator that require O(m )
iterations via1+θ
solving maxflows on
calls in O(m ) time (for any
θ > 0)
c
graphs of total size O(mlog n)
Maxflow
Maxflow
Approximator
Approximator
cnε
O(mlog
O(m1+2θ
ε-2-2)) time
time?
ALGORITHMIC SOLUTION
`
`
Ultra-sparsifier (e.g. [Koutis-Miller-P `10]): for any k, can
find H close to G, but equivalent to graph of size O(m/k)
Key step:
[P`14]:
build
vertex
approximator
reductionson
viathe
edge
smaller
reductions
graph
Maxflow
Absorb additional (small) error
[CLMPPS`15]:
extends
to numerical
via more calls
to approximator
data, has close connections to
Recurse
instances
with smaller
variants
ofon
Nystrom’s
method
total size, total cost: O(mlogcn)
SUMMARY
• Algorithm design approach: study problems at the
intersection of areas, and extend insights back.
• Can sparsify objects via importance sampling.
• Solve Lx=b efficiently via sparsified squaring.
• Approximate maximum flow routines and structure
approximators can be constructed recursively
from each other via graph sparsifiers.
RANDOMIZED NUMERICAL
LINEAR ALGEBRA
Combinatorics /
Discrete algorithms
L1-preserving
row sampling
Numerical analysis /
Optimization
B
B’
Statistics /
Randomized algorithms
First near-optimal routine for row sampling
matrices in a 1-norm preserving manner
GENERALIZATION
Generalization of row sampling:
given A, q, find A’ s.t.║Ax║q ≈║A’x║q ∀ x
q-norm: ║y║q = (Σ|yi|q)1/q
║y║2
1-norm: standard for representing cuts,
used in sparse recovery / robust regression
║y║
1
Applications (for general A):
• Feature selection
• Low rank approximation / PCA
ROW SAMPLING ROUTINES
A’ s.t.║Ax║q ≈║A’x║q ∀ x
How
2?
nnz:
# ofspecial
non-zerosisinLA
#rows for
q=2
Dasgupta et al. `09
Magdon-Ismail `10
Runtime
n2.5
mn5
nlog2n
Sohler-Woodruff `11
Drineas et al. `12
#rows for
q=1
mn2
n3.5
nlogn
Clarkson et al. `12
mnω-1+θ
mnlogn
n4.5log1.5n
mnlogn
n2logn
n8
nnz
Mahoney-Meng `12
n2
n3.5
nnz+n6
Nelson-Nguyen `12
n1+θ
Clarkson-Woodruff `12
Li et.`13, Cohen et al. 14
nlogn
nnz
n3.66
nnz+nω+θ
[NaorOmitting
`11][Matousek
corresponding
`97]: on graphs,
empirical studies
How special
L2 approx  Lq approx ∀ 1 ≤ q ≤ 2
are graphs?
L1 ROW SAMPLING
[Talagrand `90, “Embedding subspaces of L1 into LN1”]
can be analyzed as row-sampling / sparsification
[Sampling with pi ≥ wiO( logn) gives ║Ax║1 ≈ ║A’x║1 ∀x
L1 Lewis weights ([Lewis `78]):
w s.t. wi2 = aiT(ATW-1A)-1ai
Recursive
definition!
Can check: Σi wi ≤ n  O(nlogn) rows
[COHEN-P `14]
Update w on LHS with w on RHS
w’i  (aiT(ATW-1A)-1ai)1/2
Aside: similar to iterative
reweighted least squares
Converges in loglogn steps: analyze ATW-1A spectrally
q
Previous # of rows
New # Rows
Runtime
1
n2.5
nlogn
nnz+nω+θ
1<q<2
nq/2+2
nlogn(loglogn)2
nnz+nω+θ
2<q
nq+1
np/2logn
nnz+nq/2+O(1)
Elementary, optimization motivated
proof of w.h.p. concentration for L1
SUMMARY
• Algorithm design approach: study problems at the
intersection of areas, and extend insights back.
• Can sparsify objects via importance sampling.
• Solve Lx=b efficiently via sparsified squaring.
• Approximate maximum flow routines and cutapproximators can be constructed recursively
from each other via graph sparsifiers.
• Wider ranges of structures can be sparsified, key
statistical quantities can be computed iteratively.
I’VE ALSO WORKED ON
• Dynamic graph
data structures
• Graph partitioning
• Parallel algorithms
• Image processing
• Anomaly / sybil
detection in graphs
FUTURE WORK:
LINEAR SYSTEM SOLVERS
Combinatorics /
Discrete algorithms
Solvers for linear systems
involving graph Laplacians
Numerical analysis /
Optimization
Statistics /
Randomized algorithms
• Wider classes of linear systems
• Relation to optimization / learning
FUTURE WORK:
COMBINATORIAL OPTIMIZATION
Combinatorics /
Discrete algorithms
Maximum flow
Numerical analysis /
Optimization
Statistics /
Randomized algorithms
Faster algorithms for more classical
algorithmic graph theory problems?
FUTURE WORK: RANDOMIZED
NUMERICAL LINEAR ALGEBRA
Combinatorics /
Discrete algorithms
L1-preserving
row sampling
Numerical analysis /
Optimization
B
B’
Statistics /
Randomized algorithms
• Other algorithmic applications of Lewis weights?
• Low-rank approximation in L1?
• O(n)-sized L1-preserving row samples?
(these exist for L2)
SUMMARY
Combinatorics /
Discrete algorithms
Problems at their
intersection
Numerical analysis /
Optimization
B
B’
Statistics /
Randomized algorithms
Links to arXiv manuscripts and videos of more
detailed talks are at: math.mit.edu/~rpeng/
Download