1

advertisement
1
>> Yuval Peres: All right. Good morning. We're happy to welcome Richard
Peng, who will tell us about firefighter regression algorithms using spectral
graph theory.
>> Richard Peng: Thanks, Yuval. So the talk will be structured as follows.
I'll first start off by talking about some regression algorithms and why and
how to solve them, so why do we solve them and how do we solve them. Then I'll
move on to talk about fast solvers for linear systems, and then I'll finish by
describes some of the graph theoretic tools needed for such solver, such as
tree embeddings.
So let me start off with a problem that you have probably all seen in one way
or another, which is the learning/inference problem. For this problem, you're
given a signal that's potentially noisy, but you know there are some very good
underlying structure to it. You want to extract out -- you want to remove the
noise so you want to extract both the real signal so there's sort of step
function on the right, and you want to be able to find the hidden pattern as
well.
So because this is an important problem, there have been a lot of tools that's
used to solve it. And one of the most commonly used tool is the idea of a
regression algorithm. So the regression -- so a regression objective. So this
kind of objective is -- I put some constraints on a vector, X. And I want to
minimize some norm of this vector. So usually the P norm.
So P is usually picked to be at least one for this co to be a convex -- for
this to be a convex problem, and the constraints on X are usually convex
constraints so the simplest form of these convex constraints are linear
equalities, so X is -- is in some subspace. Usually, the X can be either the
underlying structure or could be the not noisy signal.
So this kind of a problem has many applications, and let me just describe a few
of them very quickly. So the first one's probably more well known than the
object itself, which is lasso. So this is an area known as compressive
sensing, and it was one key objective is this -- that lasso objective by
Tibshirani, which Ames to minimize the one norm of X, subject to AX equal to S,
the signal.
So this has a -- this was shown to be extremely structured and sparse output
2
and it also is very robust. It's extremely resilient to noise.
one -- this has a lot of applications.
So this is
Also, the other application I want to talk about is image processing. So here,
this is an example of a process known as Poisson image processing. The idea is
that I have two images that I -- that really don't belong to each other. For
example a polar bear and a picture of Mars, taken from a space rover, and I
blend them into each other. So I want to somehow merge the boundaries of these
two things.
So the objective to be solved is defining some underlying graph on these pixels
and I try to minimize some energy function depending on the difference of
neighboring pixels and the ideal value that I want. So this objective up
there, something over IJ of XI minus XJ minus a standard boundary, and by
solving this, you get to the picture on the right. So you can create these
fairly strange settings.
And the third application of this is a problem that perhaps the more
theoretical oriented of us are more familiar with, which is the minimum cut
problem. This problem asks to, given a graph of two special vertices, S and T,
and I want to remove the fewest edges so that S and T becomes disconnected in
this graph. So this is a classic problem in combinatorial optimization, and
it's not very clear that this is a regression problem. But as it turns out,
you can formulate it as one. So consider it a falling variation. I want to
label the vertices of the graph so the S gets labeled zero. T gets labeled
one, and I want to minimize something over all edges of the difference of the
labels at the end point of that edge.
So the convention is still not as clear as I would like to show, but on one
side of it, I think I can try to convince you, which is I can label everything
on one side of the cut to be zero, label everything else on the other side of
the cut to be one. And the size of the cut is simply summing over all. So any
edge that crosses the cut has a difference of one. Any other edge has a
difference of zero. So summing over all these edges gives me the size of the
cut.
So hopefully, I've convinced you that these problems exist in a variety of
contexts, which means that there is reason to study a very efficient algorithm
for them.
3
>>:
Sorry, for the last problem, the Xs are restricted to zero why?
>> Richard Peng: So again, you can't really are relax them to real numbers,
and there are results that show that. If they're the optimal fraction solution
equal to the optimal integer solution.
>>:
All right.
>> Richard Peng: So this is -- so the [indiscernible] for minimizing these
type of objectives have been studied for a very long time in computer science.
So and the history of these algorithms can essentially be divided into 20-year
intervals. So in the '40s and '60s [indiscernible] simplex methods and KT
conditions, we showed these problems are contractible so there exists
[indiscernible].
In the 60s, 80s, there were work on ellipsoid algorithm and a variety of other
methods that show these problems can be solved in polynomial time, which is
kind of a gold standard for theoretically efficient. And between the '80s and
'90s, there were further developments that which led to the [indiscernible]
interior point algorithms, which showed that these problems tend to be solved
very efficiently.
So generally idea of interior point algorithm is that
problems using about square root M Newton steps. And
let M to be the number of nine zeroes, and that's the
this tilde O notation, I will use the log factors and
consistently throughout this talk as well.
it solves one of these
to describe efficiency, I
size of the problem and
I will use it
So before continuing talking about algorithms, let me first explain why do we
need fast algorithms. The reason is that a lot of the data that come from
these regression problems are big. For example, this picture, which is almost
like a cell phone camera picture, has about ten to the six pixels. So M is at
least ten to the six. And if you go to videos or 3D medical data, you can
easily get to data that's about ten to the nine or bigger.
And if you look at the key sub routine that's used in interior point
algorithms, it's this idea of a Newton step. So interior point algorithms
solve one of these problems using about square root M Newton steps. And what
is a Newton step? There's a fairly simple characterization of it, which is
that I'm solving a linear system. So the key bottle neck in getting fast
4
algorithms for [indiscernible] problems, at least the way we know them today,
is we need fast linear system solves.
To give a little -- to give further reason that we actually want fast Lynn-year
system solvers, I'm going to quote -- I'm going to use this quote from
Boyd-Vanderberghe, which said if you look at the number of iterations for a
randomly generated typical SDP instance, it only grows very slowly as the
problem size grows. So that the chart on the right is this figure that
shows -- that plots the number of iterations against the log of the problem
size, and as you can see it doesn't really grow much.
But for the more sharp item, you may notice that the size of the item only tops
out at about ten to the three. So it's only about a thousand. And there's a
good reason for this, which is that we really don't have good ways of solving
general linear systems very fast. So the linear system algorithms are
actually -- linear system solvers are actually one of the earliest study
algorithmic problems. They worked on this as early as the first century, and
also I think Newton worked on this.
Also happens to be one of the most rediscovered algorithms in that the
algorithm that we're discovered is known as Gaussian emanation and in modern
terminology is in cubed algorithm. It was shown by Strassen that
[indiscernible] can be lowered to 29.8, many work followed to the state of the
art today which is about 2.3727. But if you think about this number, even if
you analyze it in conjunction of with interior point algorithms, we're still
looking at a quadratic time algorithm, and what's more problematic about this
approach is that these algorithms also need an M squared space, which is even
more problematic when you're about 10 to the 12.
So what's often used in practice, at least in practice today, is -- are
actually methods that get around this issue. So the method that's often used
in practice are first order methods. So some type of coordinate, gradient
descent or subgradient methods. So these methods have the nice property that
the cost per iteration is very low, but they do try this for slow convergence.
And often, what happens in them is that they run for a number of iterations
because of time constraints. They're cut. So one way to think about these
algorithms is they are training essentially solution quality for time, because
time is such a crucial resource for these minimization problems. On the other
hand, there has been some work recently which showed that at least for some
5
instances, we can solve linear systems extremely fast. This is due to a work
by Spielman and Teng in 2004, who showed that for a linear system that's
related to graphs, you can solve them in nearly linear time, roughly M poly log
N.
I will talk more about this algorithm in 12 slides, but first let me explore
some of the consequences of this. So the consequence of this is now that we
have first that these second order methods that use linear systems have a low
complexity. So have a low number of iterations. On the other hand, these
linear system solvers gives us a low complexity per step as well. So if you
combine these two algorithms together, you can get extremely efficient
algorithms with fairly good -- with very good guarantees as well.
And in the study of graph algorithms, this led to the idea of a Laplacian
paradigm. So some representative who works there is a work by Daitch and
Spielman, who showed minimum cost flow algorithm, that runs about M to the 1.5
time, and also there was a more recent work by Christiano, Kelner, Madry and
Spielman and Teng who showed the state of the art album for approximating
maximum flow and minimum cutting graphs. So this algorithm rank about M to the
four thirds time.
So some of the work that I've done in this area are related to stepping these
algorithms to the more general setting of regression, graph-based regression,
and the first extension that I've worked on was in a joint work with Chin,
Madry and Miller. Where we showed that it's possible to extend their algorithm
to the more general type of -- to the more general classes of regression
objective that come out of image processing.
So specifically, we showed that it's possible to solve the group, the L2
objective, which is a close cousin of the group lasso objective. And this
turns out to capture almost any image processing objective that's convex that
we could find.
So here is an example of denoising an image. So by setting up the objectives
correctly, you can first remove some of the small noise and if you overdo it,
you can also get to the image on the far right, which is a cartoon-like
version, where the person's ear is considered noise as well and also removed.
On a more theoretical side, in a joint work with Kelner and Miller, we show
that it's possible to extend this album to solving a K-commodity flow, which is
6
that you're trying route -- you're trying to shift goods through a network, but
these goods aren't exactly exchangeable, so the thing about trying to route
apples and oranges through the same graph.
And this problem actually is very relevant to a lot of the practical regression
objectives. They're related to multi-varied labelings and graphs. So instead
of labeling every vertex with a single number, as in a minimum cut setting,
you're assigning K labels for each vertex and minimizing some objective,
depending on their differences.
And in more recent work joined with my advisor, Gary Miller, we show that for
the more structured graphs, such as the graph that you get out of images or
videos, so basically any graph with a good separator structure, you can
actually speed up the Christiano, et al., framework to even faster running
times. So this is probably also relevant for a lot of these image-related
applications, because images, they do tend to have fairly good separator
structures.
So the implication of fast solvers, if you take these general [indiscernible]
into account, is that they give fast algorithms for solving a variety of these
graph-based regression objectives. And at the same time, as I will talk about
later, they also lead to very -- they also lead to a parallel algorithms.
And there are actually a variety of other fast solvers as well, such as planar
embedding, solving finite element systems and obtaining good separators for
graphs. So finding good balance [indiscernible] graphs. Due to time
constraints, I will not be able to talk much about them. I'll be happy to talk
about these offline.
Okay. So for the next bit, I'll try to talk about how these fast solvers work,
and the problem is that given a matrix A and a vector B, I want to find the
vector X at AX equal to B.
So to describe running time, I will say A is -- I will let N to denote the
dimension of A so A is an N value matrix and it has M non-zero entries. So
it's the problem [indiscernible].
But because A is related graph, it actually has some very specialized
structure. And the structure is that it's a graph Laplacian. So there are
many formal -- many possible formal definitions of graph Laplacian. The
7
simplest one is that it's a degree matrix. Diagonal matrix containing the
degrees, mean Nuss adjacency matrix of the graph.
The other way to think about it is that for each entry, AIG, this entry is a
degree of the vertex if it's on the diagonal. Otherwise, it's the negation of
the weight of the edge. So it's negative WIJ. So consider this example of a
graph Laplacian on the left. The graph below is the graph corresponding to it.
So here the -- this vertex, this is the first vertex, has degrees, three.
Three edges coming into it. So it corresponds with an entry of three in the
matrix.
This first edge corresponds to those two negative one entries, because the
matrix is symmetric so you put one on both sides as well. So the next edge,
this edge corresponds to these two negative ones, and this one corresponds to
the third one.
There is actually an extension of a graph Laplacian in numerical
[indiscernible] known as the symmetric [indiscernible] matrices. It was shown
in work by Gremban and Miller that solving graph Laplacians is essentially
equivalent to solving SDD matrices. So in terms of stating the result, this is
often talked as solvers for SDP linear systems. But in terms of algorithms,
it's suffice just to think about graph Laplacians.
And then the what are the issues with dealing graphs?
>>: Sorry, on the previous one, are the WIJs ones, or can they be different
for every entry?
>> Richard Peng:
graphs.
>>:
They can be different for every entry.
These are weighted
In this case, it's not the degree, it's just the row sum?
>> Richard Peng: Yeah, it's a weighted degree. So I think at various points
in this talk, I will actually go to weighted graphs for simplicity. These all
generalize to weighted cases.
So the, one of the biggest challenges we're dealing with graphs is that they
can be very unstructured. So, for example, this is a graph corresponding to a
social network. These are graphs that at the top level look almost like
complete graphs in that there's very little good partitioning that happen, but
8
we know that there are some underlying structures under there.
And the other problem is that if you're using this kind of algorithm, inside
other more sophisticated routines, the intermediate linear system that's been
solved, because there are so few of them, almost capture the entire difficulty
of that optimization problem because we know that these regression problems are
difficult, which is why we want to solve them, but difficulty has to go
somewhere.
So in some sense, a lot of difficulty of the problem actually goes into solving
the linear systems. So the result by Spielman and Teng formally can be stated
as given an N by N graph Laplacian, M non-zero entries such that I know that
the exact answer is X, so that I know that B equal to AX for some vector X,
this algorithm is able to find you an approximate solution to some epsilon.
Such that it runs in near linear time.
So notice that this algorithm is approximate in nature in that there is an
error guarantee that happens there, but if you look at the running time, the
dependency on epsilon is only log one over epsilon so it's a very strong
approximation. And by nearly linear, I mean M poly log N. So log N to the
power of C for some constant C.
So also the error is being measured in the A norm so you got to think about
this as the norm being induced by the matrix. So it's the square root of the
transpose. This G edge should be an A, by the way.
And so this is a great algorithm from a theoretical point of view, but there
is -- but from a practical point, there is a caveat to it. Which is that
consider -- so notice that there's this log N to the power of C that's in the
running time. So personally, I've never been able to track down what exactly
the value of C is, so I just resort to authority for some estimates. So the
estimate by Spielman is at 70. By my advisor, Gary Miller, is 32. Co-author
Koutis says 15, Teng says 12, and the lowest I've heard is from Lorenzo
Orecchio, which is 6.
So consider the case where M is about ten to the six. So you're dealing with a
million by million linear system. Here, log N to the power of six is more than
a million. So this algorithm, for a lot of the practical [indiscernible] that
we're dealing with, is essentially a quadratic time algorithm.
9
So in some -- in two joint works with Koutis and Miller, we showed that
essentially the same kind of guarantees can be obtained with the same running
time as M logged squared N, and also working in 2011, we showed that we can get
to M log N. So we essentially improved the run time exponent to a single log.
And for the next bit of the talk, I'll try to describe briefly how this
algorithm works. So the algorithm has three key components. Iterative
methods, graph sparsifiers and low stretch embeddings.
I'll talk about -- I'll talk mostly about the last two, which means I'll try to
give you the two-minute version of iterative methods next. So iterative method
is one of these great ideas from the numerical analysis, which states that if I
want to solve a linear system in A, matrix A, I can solve it by solving linear
systems in B and iterating. So I can solve it by making a number of calls to a
system that's similar and solving that system instead.
The idea of similarity is defined through a spectral conditions, which is
written using this [indiscernible] spectral notation. So the spectral less
than or equal to. Formally, this means that [indiscernible] means that for any
vector X, the A norm of X is less than the B norm of X, but all the uses that I
will make of this spectral notation in this talk, I think I can appeal to your
intuition about scalers in that the operation I will do to them is essentially
manipulating them in ways that's almost identical to manipulating the less or
equal for scalers.
And one of the key ideas in these solvers is that because A is a matrix that's
obtained from a graph, we might as well pick B as a graph, and analyze this
kind of spectral less or equal to conditions using graph theoretic notations.
So I will replace A with G, B with H, and deal with graph from this point.
So because we want to find an easier H and we're dealing with graphs, there are
two obvious candidates for considering what's an easier graph. The first one
is fewer vertices. The second one is fewer edges. As it turns out, we can
reduce vertex count if the edge count is small. So I will just focus on
getting H with a smaller number of edges, so that H is similar to G.
So what I want is I want graph sparsifiers. These are sparse equivalents of
graphs that preserve some property. These have been studied extensively in
computer science in various forms. Examples are spanners which preserve
dissonance or diameter of our graph, cut sparsifiers, which preserves the value
10
of all the cuts, up to some small factor, what we actually need is spectral
sparse fires, which seems to preserve the item values in eigenvectors, so the
spectrum of the graph.
Formally, what we need is an [indiscernible] known as ultra sparsifier which is
for a graph with N vertices M edges, along with a parameter K, I want a graph
with N minus one plus M poly log N over K edges, such that these two graphs are
within a factor K of each other.
So I essentially want to reduce the number of edges by the same factor as the
similarity factor. But I'm willing to lose some poly log factors. As it turns
out, this poly log factor has dependency on running time. It was shown by
Spielman Teng that getting ultra sparsifier with this type of quality implies
solvers with essentially the same exponent of log in the running time.
>>:
What is order of K that you take?
>> Richard Peng: So the order of K that would be used is slightly larger than
this poly log. So it's a rough -- so the value of K that's used for these
algorithms in the Spielman Teng framework is setting K to a poly log factor.
So it's mildly sparser from a theoretical point of view, but it's still more
than a constant factor reduction in the problem size.
And the Spielman Teng work can essentially be viewed as obtaining ultra
sparsifier constructions for fairly large value of P. So ultra sparsifier for
a fairly large poly log factor.
Before we continue further, let me just do a sanity check and consider one
example of a graph that we want to build ultra sparsifier for. So this graph
is a complete graph. For those of you who are familiar with [indiscernible]
linear systems and complete graphs, you are probably aware this is a fairly
easy problem, in the sense you can just read off the answer.
But let's just say we want to get ultra sparsifier for it. I claim here random
works in that I just pick about N log and random edges. But rescale these
edges so that the [indiscernible] expectation, the total weight is roughly the
same so rescale them out by a factor of N over log N. And [indiscernible] has
shown that with high probability, the resulting graph is going to be ultra
sparsifier. So it's going to get N log edges, which is far less than the
factor and also [indiscernible] factor approximation.
11
So but this only works for complete works because all the edges are roughly the
same. So for general graphs, we do something slightly different, which is we
want to have different probabilities for keeping each edges, because the edges
aren't exactly the same as each other anymore.
So the general graph sampling mechanism, this is due to a framework by
[indiscernible] is that I want to sample edges such that each edge is kept with
probability P of E. I want to associate a probability value for each edge.
And I want to sample these edges with that probability, and if I do pick that
edge, I want to scale it up to maintain -- to maintain the expectation. So I
want to scale by a factor of roughly 1 over P for any expectation, I got the
same edge back.
So there are a few quick observations that can be made about this framework.
The first one is that the number of edges that's kept is equal to the sum of
the probabilities. Just because any expectation I getting [indiscernible]
probability P of E. And what we need to do is we need to prove the
concentration of this sampling.
So before I try to describe the concentration result, I need to take a slight
detour and describe a key quantity that's immediate for measuring this
concentration, which is [indiscernible] factor resistance in a graph. So this
is another one of those notions that had me defining of a bunch of ways. And
the physics definition is that I'm going to view the graph as a [indiscernible]
so each edge is a conductor with conductors equalling to its weight.
And the
and you
unit of
between
way you measure effective resistance is you literally take a voltmeter
plug the two end points into those two vertices, and you try to run one
current from one vertex to the other vertex. And the resistant value
these two vertices is just a number you read off your voltmeter.
In general, the way you compute a factor resistance is that you solve -- you
solve GX equal to some indicator vector and you read off the difference in its
solution between these two vertices. This is something that I think is taught
in [indiscernible], but I don't really remember it. And I think, from what I
can remember this leads to a fairly complicated calculations, because once
you're asked to do Gaussian emanation on a five by five matrix on a test, these
are some horror stories.
12
So let's just [indiscernible] simpler version, which is I have a single edge.
Here, the effective resistance of this edge is one over the weight of the edge.
And also I will say I will use some rules about [indiscernible] series, which
if I have two resistors, a weight -- chained together in series, one after the
other, the resistant value of the combination is just a sum of the resistant
values of those. So the two rules that's going to be helpful is resistant
value is inversely proportional to the weight of the edge, and if I'm a channel
resistors, the resistive value of them as a single component is just some of
the resistive values.
So now that I've revealed a little bit of the combination, I can release the
result by Spielman Srivastava. What they showed is that by letting the
sampling probability equal to the weight of the edge times its effective
resistance, times extra log N factor, this will give a graph that's within a
factor of two of the original graph, with high probability. So there's some
probabilistic issues I'm hiding under the rug, but what's more crucial is that,
this fact about the sum of these probabilities.
This is an amazing fact which is that the sum over all edges of the weight
times the effective resistance in a complete graph is exactly N minus one.
It's a bit of an odd result, but let's consider some simple cases. Consider
the case where the graph is a tree, every edge is a unit of weight. The
only -- for each edge, the only path is between two end points by itself. So
each edge has effective resistance of one and if you're dealing with a graph,
you have N minus -- dealing with a graph that's a tree, you have N minus one
edges, so you have a sum total of N minus one.
What's amazing is this is actually true for general graphs. So -- and the
implication of it is that every graph has a spectral sparsifier with N log N
edges. So we constructed spectral sparsifiers, but we haven't really got into
ultra sparsifiers or solvers. And the reason is that there's actually a slight
caveat, which is what I -- which is essentially the chicken and egg problem.
The issue is how do we compute effective resistance? As I mentioned earlier,
computing effective resistance, you need to solve a linear system in general,
for general graphs.
And this was actually the case in the Spielman Srivastava work as well. So the
way they obtain the effective resistance is they used a solver to solve the
linear systems. But if you look at the solver, what they need is they need
spectral sparsifiers. So come back to this very quickly.
13
So most of our work can be considered as a way to work around this issue. And
there were two things that we do that do this. The first thing is that instead
of using effective resistance, we use fairly crude upper bounds for them. And
the sec thing that we do is we actually modify the problem so where we actually
can get fairly good upper bounds.
So to get upper bounds, we use the result from electrical engineering, known as
Rayleigh's monotonicity law. Otherwise known as I just unplugged my computer.
Which is that if I remove edges from a graph, or if I remove components from an
electrical circuit, the thing will stop working eventually, which means I've
made the resistant value go to infinity at some point.
The slower version of that is if I remove components from an electrical
circuit, the resistant value [indiscernible] can only increase. So to get
upper bounds for effective resistance, we can actually measure effective
resistance with respect to only a part of the graph. And what you do in the
most aggressive version of this, which is to calculate the effective resistance
with respect to a tree. So we don't want infinite values because that's really
bad for sampling, but what we can do is measure them with effect of a tree.
So we toss out all of the edges except a small part of the graph. And here to
measure effective resistance between two vertices, only the path between them
matters. So these two edges hang off the path. They don't really help for
effective resistance so all we need to do to measure effective resistance is to
sum up the resistant value along the path.
Turns out this is actually a known quantity. This is known as the stretch of
an edge with respect to the graph. And what we need is we need a tree for the
graph, such that the total stretch of all the edges is small.
What's even more amazing is that for any graph, turns
tree. So for any graph, there exists a spanning tree
stretch of all the edges is roughly N log N. This is
factor, and I'll get to more details about this in 12
>>:
When are you going to do the stretch?
>> Richard Peng:
I'm just summing over all.
out there exist as good
for which the total
hiding a log log N
slides again.
14
>>:
But only like I guess [indiscernible] every edge, right?
>> Richard Peng: So there is this expectation type guarantee, which is a
stronger guarantee, and it turns out that from a metric embedding point of
view, these are interchangeable notations, these notions. But for in terms of
getting a fast algorithm is actually so [indiscernible] we -- well, this
algorithm, the N log N algorithm is crucial that is a weaker notion of sum and
is actually open to get an expectation, but it's probably possible to get
expectation as well.
So what does this imply, though, is that our upper bounds for effective
resistance, so our upper bounds for way times effective resistance that's going
to sum to M log N. Which means that if you combine with sampling, you get
about M lock squared and edges back, which is more than the graph we started
with. So this is a problem. And the question -- and this actually, this
situation is actually most evident in the case of unweighted graph, who every
edge is a unit weight.
So consider this tree in red. I'm not sure even how clear it is. But if you
use this tree to stretch out that I'm pointing to, because it's taking two
edges to reach it, it's a length to path so the resistant value is two, and
because every edge is length one, the shortest path you can get between any two
vertices is one, which means that the low sampling probability we're going to
get if we do this to an unweighted graph is at least which really didn't mean
anything, because if you sample everything with probability one, you're going
to get the same thing back.
So the question is, is this even a good tree? So now let's just step back a
bit and see what we're missing with this kind of approach. So what we needed
was an ultra sparsifier. So this is something that has a factor K difference
with the original graph but has about M poly log N divided by K edges. What we
just generated has about M minus one plus M log squared edges. The N minus one
I put there to make them look about the same. But if you recall the Spielman
Srivastava guarantee, they say that G and H are within a factor of two of each
other, so it's a very good approximation. And the other thing to notice is
that we have not used K at all. So the one we needed has this factor of K. We
have not introduced K in our [indiscernible].
So we're going to use K somehow and I propose we use it in the simplest way
possible, which is going within this notion that the tree we got is good. It's
15
a good part of the graph.
So let's make it heavier by a factor of K.
And because by doing this, we're going to get a different graph, G prime, and
this graph is going to be within a factor -- is a factor of K approximation of
our original graph, and now I claim something good happens. So now you get
your edge in the tree, instead of having weight one, has weight one over -- has
weight K, which means that the resistant values are inversely proportional to
the edge weight. These two resistors now have resistance one over K each.
So you look at the same edge, its stretch now went down by a factor of K. So
this is actually true for any edge, and what you get is that the tree-based
effective resistance between any two points now decrease by a factor of K.
So what we need to do now is to do accounting. And we are going to pay for the
tree edges, because the tree edges they didn't really change themselves. So
they got heavier, but their paths also got heavier. So they're still going to
be there. So I've got N minus one tree edges.
But the off-tree edges, all the probability went down by a factor of K. So
we're looking at M log squared over K edges now. And once you sample it,
you're going to get a sparser graph.
But the issue is that we kind of cheated when we modified the problem so we
didn't really sparsify G. We sparsified G prime. But notice that we have
these two qualities. We have the G and G prime within a factor of K of each
other, and the G prime and H are within a factor of two. So ->>: Just so my intuitions keep track, what is the typical size of K?
smaller than two, bigger than two? A very big number?
>> Richard Peng:
or so.
>>:
It's like poly log N.
Is it
So think about it as about a thousand
So it's a big number and it's -- but it's small compared to M.
>> Richard Peng: It's very small compared to M, but it's much bigger than most
constants. So there are actually more gradual versions where smaller constants
of K are used, but to describe the algorithm, I think, easiest description of
the algorithm. Uses a reasonably big but not too big value of K.
16
So if you combine these, you get the G and H are within a factor of 2K of each
other. It's appealing to intuition about the scalers. And if you look at the
one we needed, we needed ultra sparsifiers, so we need about M log -- N to the
P over K edges and a factor K difference. What we just generated. We
generated H, that's within a factor of 2K of G. And we got about M log squared
and over K edges. So if you look at these two definitions and you look at
those two conditions there, up to a constant, roughly the same condition with P
equal to two.
So what you obtain if you combine this with the framework given by Spielman
Teng is you get an algorithm that runs about ten L squared in time. And it
turns out you can actually do even better by taking the notion that the tree is
good, even further. And the idea there is that instead of on the smaller
graph, so you -- now you've got yourself to a smaller graph, instead of
computing another tree of another tree from scratch, you can compute the
tree -- you can reuse the tree from the current iteration. Using this trick,
you can get to an M log N time algorithm.
There were some extensions that we have done with this work as well, in joint
work with Koutis and Levin, we showed that mildly dense graphs can especially
be sparsified in linear time. Also, in a joint work with Gary Miller, we
showed that it's possible to extend some of these ideas to general matrices.
So the idea there is you want to find, for general [indiscernible] simpler
versions, and you want to do this kind of sparsification in proportion to a
solve. So dealing with fairly -- you're dealing with a class of matrices that
formed by an outer product of [indiscernible] matrices, which happen quite a
bit in machine learning.
So to summarize solvers, what happened here is that by using the ideas from
spectral graph theory, we're able to construct similar graphs. But in using
these similar graphs, we're able to obtain extremely fast algorithms for
solving linear systems. But if you think about this algorithm, there's a key
backbone in this algorithm. Which is that we need these good trees for graphs.
So to get these fast solvers, a critical component are these algorithms from
combinatorial graph theory, which aims to generate these good trees. And for
the next -- and for the rest of this talk, I'll talk about algorithms for
getting these good tree embeddings. So these trees in theory are known as low
stretch spanning trees, and what they are is we call it the sampling
probability of edge equals to the weight times the effective resistance of tree
17
path.
So for simplicity, let's just consider the unit weight case again. So here,
the stretch of edge is simply the length of the tree path connecting these two
end points. And the most for the spanning is if you sum over all edges, the
distance between their end points is fairly small.
So this is fairly different than most of the definition of spanning trees that
we're used to. A good example of this is the square mesh. So the square root
N by square root N square mesh. The first tree to consider is the hair comb,
which is take the first column and every single row. I kind of -- because the
edges are unit weighted, this is a good tree from a lot of the [indiscernible]
that we're used to. It's a max weight spanning tree it's the shortest path
tree if you start from the top left vertex. But I claim that the stretch of
this isn't that good.
Actually, for quite a bit of the edges, the stretch are good. So for all the
edges on the second left column, the stretch of these edges are three. So you
just take one left, go one up and go around. But the problem -- how does
weight cross the midpoint. But once you cross the midpoint, you got to go all
the way to the left, go one down and come back. So the stretch of half the
edges in the graph is about square root of N. So you got a total of roughly N
to the three halves.
And as it turns out, the tree that's good for the grid is this kind of
recursive C construction. So you want to partition a graph up into smaller
diameter pieces and combine them together to join these trees. And then there
were a lot of work done on this, starting with the work by [indiscernible] and
west, and the state of the results combine two works by Elkin, Emek, Spielman
and Teng. And Abraham, Bartal and Neiman who showed that it's possible to get
a total stretch about M log 1/2 for any graph, it's actually a state of the art
when we were working on the solver. And also is hiding the log log N factor.
But if you look into it, look closely at the algorithms, you realize there is a
slight problem with them. From an efficient [indiscernible] perspective, which
is their running time is roughly M log squared N, and this M log squared N can
be explained pie the simple explanation is that you're essentially running
about log N shortest path on this graph. So you're taking a part of graph,
running shortest path and repeating. And you're doing this by log N levels.
And the running time is essentially shortest path time log N.
18
But to get a faster [indiscernible] algorithm, because our solver ran about M
log N time, which you need to get around this bottle neck as well. So the way
we did this is we drew some ideas from our work by Orlin, Madduri, Subramani
and Williamson, who showed that if a graph has only K distinct edge length, so
has a small number of edge length. So recall the [indiscernible] graph is
about linear time. And what it showed is that if a graph has about a K
distinct edge length, the running time of shortest path can be reduced to about
M log K. And the way we took -- the way we use this fact is that we rounded
all the edge weights to powers of two. So if you think of polynomial bounded
edge weights, your graph now has about log N distinct edge weights.
So each shortest path running about M log log N time, so you got a total work
that's roughly M log N, ignoring poly log factors.
We actually made some improvements to this [indiscernible] result as well, we
actually improved this M log -- M over NK to actually M plus N log K. So
almost the [indiscernible] style bound.
So generating these trees actually is even more interesting from a parallel
perspective. And the reason from a parallel perspective this is interesting is
due to a work joint with Blelloch, Gupta, Koutis, Miller and Tangwongsan, where
we showed that the Spielman Teng ultra sparsifier framework can be readily
parallelized to roughly about M to the one third depth and nearly linear work.
If you combine this with the Laplacian paradigm for getting graph algorithms
shown earlier, the result is that you get theoretically fast parallel
algorithms for many graph problems, and somewhat to our surprise, it's actually
a very difficult thing. And the state of the parallel graph algorithm, before
our work, was that the state of the art of parallel algorithm for max flow
actually had a higher run time depth, had a higher time than a total work by
the Goldberg Rowe algorithm, which is the state of the art max flow algorithm.
And what we obtained was actually parallel algorithms whose work are close to
the stated art sequential and had reasonable depth so that the parallel depth
of these algorithms are roughly about M to the two-thirds.
And for -- in terms of parallel graph algorithms, this is actually fairly
interesting in that our current existing frameworks for getting parallel
algorithms, so, for example, while most used one is the max reduce framework
19
and to quote Vassvilitski from Google, these frameworks actually seemingly now
very good for graph computation.
So this actually leads to some ideas of getting interesting ways of doing
parallel graph problems on a massively parallel scale. And this quote was
actually made in the context of -- more specifically to one problem. Which is
the parallel shortest path problem in directed graphs. [indiscernible] BFS on
directed graph. Actually, even computing reachability, ST reachability on
[indiscernible] graph, I don't think there are methods known to keep speedups
over sequential algorithms.
But if you look at framework by Elkin, Emek, Spielman and Teng, what happens is
that the first step using this algorithm, as I mentioned earlier, is that's
your shortest path. So generating these trees in parallel, it turns out to be
not exactly easy.
And we were able to get these embeddings was we did several things. We first
went and looked at an earlier idea by Alon, Karp, Peleg, West so there's
actually the first [indiscernible] spanning tree construction, and then we
combined that with the idea of repeated local clustering of the graph. So you
repeatedly cluster out small pieces of your graph, combine them and cluster
again. Also, you sample at different densities to find these clusterings.
So these were ideas that were used in approximate parallel shortest path graphs
by Cohen, and the pictorial version of this element essentially looks like the
following. We have a graph, find the centers, chop off some species of the
graph, get rid of them, find more centers, chop off the rest of the graph.
So with this, we were able to get a good parallel algorithm for solvers and
therefore a lot of these parallel graph algorithms as well.
So the big picture to sum things up, though, is that to solve a lot of these
graph-based regression, even combinatorial graph based problems, we need fast
linear system solvers. Well, fast linear system solver is a good way of
getting these algorithms. But to get these fast linear system solvers, we also
need a variety of combinatorial tools from the graph embedding literature.
And I think some interesting questions for future work are can we get better
regression algorithms, can we get a faster and more parallel solver? So one
way to view a parallel solver is [indiscernible] sparse representation of an
20
inverse of a matrix. This actually is not the case in the current
parallelization. So [indiscernible] is can we get a sparse and approximate
pseudo universe of the matrix. And also, can we extend solvers to other type
of systems. That's all.
Any questions?
>>: We can take this offline because we're going to meet separately, but I'm
very interested in the kinds of things that arise in image processing which are
quad meshes, you know, you think of them as planar graphs, things like that.
You mentioned you have some recent results so I can ask you about them, unless
you already have slides. The other thing is the theoretical results for a
regular mesh is the order and multi-grid solver, so it seems like there's still
a lot and what's the current gap between these tree-based approaches and
multi-grid?
>> Richard Peng: So for planar graphs there was a work by my two co-authors,
Koutis and Miller, who show that a linear time algorithm, for these regular
planar meshes, and they were the multi-grid algorithms, are shown to be to run
extremely fast because they run what's called a V cycle so the recursive
structure is much more compact than our algorithms. So if you -- so there
actually have -- so Yanis and Gary actually have some code which combine some
of the ideas from multi-grid solvers with some of these ideas.
And my impression is that to get fast algorithms for these well structured
graphs, probably ideas from both of these needs to come in at some point. And
also that I think we have a fairly efficient solver package.
>>:
Combinatorial multi-graph work or it's newer than that?
>> Richard Peng:
>>:
It's an updated version of the CMG solver.
Okay, because I did not have order N also the graph.
>> Yuval Peres:
>> Richard Peng:
Any other questions?
Thank you.
Thank you, Richard.
Download