Document 17864506

advertisement
>> Yuval Peres: Good morning, everyone. We're happy to welcome Shayan
who will tell us about multi-way spectral partitioning.
>> Shayan Oveis Gharan: Hello, everyone. It's very
here. Thanks very much. So I'm going to talk about
partitioning and higher order Cheeger inequalities.
`this talk is based on joint work with James Lee and
I'll talk about some newer results as well.
good to be back
spectral
The main part of
Luca Trevisan, but
In the talk, I try to be -- since most of the people are from theory
group I try not to be wishy-washy and give you almost the proof. So I
want to talk about K clustering problem.
Suppose we are given an undirected graph G and an integer K. We want
to find K good clusters as well as K in the graph, K joint clusters.
So this graph G may represent the friendships in a social network.
that case a good cluster would represent a community in a social
network.
In
Or the graph may come up from a set of data points where the edges
could be weighted and the weight of an edge would represent a
similarity between the data points.
So in that
the date.
unweighted
generalize
case a cluster in the graph which represents a cluster of
So in the talk, I'm going to assume that the graph is
and regular for simplicity. But all of the results would
to nonregular unweighted graphs.
Okay. Now let me tell you how I'm going to measure the quality of the
clustering. I'm going to use this notion of expansion to measure the
quality.
So suppose we have a set S of vertices. The expansion of the set is
defined as the ratio of the number of edges leaving the set to the sum
of the degrees of vertices in S.
Okay. So because we assume the graph is deregular, the denominator is
exactly D times S. For example, in this graph, the expansion of the
set is one-fifth because three edges are leaving the set and summation
of the vertex degrees is 50.
Okay. There are other parameters to measure the quality of the cluster
such as the diameter and cluster coefficient but for these parameters
you can find examples where there's a natural clustering with a bad
quality.
So I'm going to work with this notion of quality. The expansion
parameter is always between 0 and 1. And the closer to 0 means that we
have a better cluster.
Okay. So ideally we'd like to find a clustering of a graph such that
every cluster has a small expansion, very close to 0. So this is the
quality of one set to measure the quality of the whole clustering, I'm
going to look at the maximum expansion of all of the sets.
So I want to find a clustering such that the maximum expansion of all
the set is as small as possible. Okay. And our benchmark or the
optimum solution will be the one that achieves this. It will be a
clustering into K disjoint sets such that the maximum expansion is as
small as possible.
So I'm going to use this parameter phi of K to denote the optimal.
Again phi of K is the clustering into K disjoint sets such that the
maximum expansion is as small as possible.
>>: [inaudible] of SIs?
>>
>> Shayan Oveis Gharan:
Yes.
So next I'm going to characterize YFK in terms of the eigenvalues of
the graph. So turns out that there is an interesting connection
between the algebraic connectivity of a graph and phi of 2.
So let L be the normalized Laplacian of a graph defined as the identity
minus the adjacency matrix over degree. It's easy to see that the
Laplacian, normalized Laplacian is a positive semidefinite matrix, the
eigenvalues are nonnegative and moreover there are at most two.
So let lambda one be the first one, and lambda two be the second one.
Lambda one will always be 0. And let's say they are in this increasing
order, and there's a basic fact in algebraic, with respect to L graph
theory which says the number of connected components of G is exactly
equal to the multiplicity of 0.
What this implies is that let's say the spectral gap be the difference
of the first and second eigenvalue, which since lambda 1 is 0 it's
going to be lambda 2. This fact says that the spectral gap is 0 if and
only if the graph is disconnected.
But on the other hand, we know that the graph is disconnected if and
only if py of 2 is 0 if the graph is disconnected I can just choose my
two clusters as two connected components and they would have expansion
0. So putting these two together, I'm going to get the spectral gap is
0 if and only if phi of 2 is 0. In other words, by knowing that how
the eigenvalues look, knowing that whether or not the spectral gap is 0
I can say whether or not phi of 2 is 0.
Is this clear? Now, Cheeger type inequalities provide a robust version
of this fact. Meaning that phi of 2 is very well characterized by
lambda 2. It's at least one half of lambda 2, at most root 2 lambda 2.
So you can think of this as the following: A graph is barely connected
if and only if the spectral gap is very close to 0.
Or lambda 2 is very close to 0. Okay? The importance of this
inequality is that it is independent of a size of a graph. So no
matter how large G is, still you would have the same characterization
of phi of 2 in terms of the eigenvalues. And the proof of this is
algorithmic. It gives you so-called spectral partitioning algorithm
that I'm going to talk about it later. But it's a very simple linear
time algorithm to find the two clusters.
Miclo conjectured that the same characterization must generalize to
higher eigenvalues. In the sense we should be able to characterizes
phi of K in terms of lambda K without any dependency to the size of the
graph.
And this is the subject of this talk. Okay? I will try to answer the
question. So now let me tell you our main result. We proved that for
any graph G phi of K is at least one-half of lambda K and it's at most
order of K squared root lambda 2. The left side of the inequality is
easy. It was known before. The main part is the right side of the
inequality.
Let me show you an example to understand this better. Suppose our
graph is just a cycle. How would you find -- how would you partition a
cycle into K clusters of a small expansion? The best way to do it is
to just find K paths each of length roughly N over K.
Now, the expansion of each of these paths would be K over N,
essentially, up to constant time distance. So what this means is that
phi K for a cycle is K over N. But we know that lambda K for cycle is
K over N squared.
So plugging this in above we see that phi of K is at most root lambda
K. Even if you don't have a dependency to K squared.
Note that this inequality answers the Miclo question because although
we have dependency to K in the right-hand side, this is independent of
the size of the graph.
Another nice thing is that the proof is algorithmic as well. So not
only we can characterize Y of K in terms of eigenvalues, we can give an
algorithm that finds a K clustering with corresponding quality.
>>: Lambda K over K-1 squared that's for small K.
K close to N or K bigger?
That holds for even
>>: The asymptotic version [inaudible].
>> Shayan Oveis Gharan:
Here or --
>>: Big N.
>> Shayan Oveis Gharan: The top one? I think it holds for large
eigenvalue as well up to constant factor. But my main interest is a
small K here. If K is very large this is not going to give very good
estimate.
>>: You're saying K can depend on that?
>> Shayan Oveis Gharan:
result --
Yeah, K can depend on that.
Okay.
So is the
>>: N order 1.
>> Shayan Oveis Gharan: Lambda is never more than 1. Okay. So we can
improve the dependency to K exponentially. If I am allowed to use much
larger eigenvalues, set up the lambda K, if I can use lambda 2 K.
I can show that Y of K is at most root lambda 2 K log K. And
furthermore, if the graph is in low dimension, like a planar graph
bounded genus or doesn't have a fixed minor, then we can completely get
through dependency of K and show Y of K is at most root order lambda 2.
>>: Do you know that this definition is needed or ->> Shayan Oveis Gharan: Right. So both of these are typed. The
bottom one is typed for the example I just showed you, for the cycle.
The middle result is tied for graph called noisy hyper key, which is a
generalization of the hypercube. The top result is not necessarily
tight. It's still open to see -- I will talk about it at the end, it's
interesting to see if we can improve dependency to K like to
polylogarithmic function.
>>: Some kind of [inaudible] algorithm that these are the K parts and
two K parts and the type better approximation or something? You said
the top result logarithmic.
>>: So what's the algorithmic, you say two K parts instead of K?
>> Shayan Oveis Gharan: I'm going to break into K parts.
going to use lambda 2 K to measure the quality.
But I'm
These two also can be replaced by any constant greater than one.
>>: Is it easy to have some difference in K?
K squared, can you just
completely remove K squared is that possible or not.
>> Shayan Oveis Gharan: As I said the middle one is typed, right. So
you have to have log K dependence, but we don't know if the dependency
should be polylogarithmic or polynomial.
>>: The example lambda K and lambda 2 K the first substantially ->> Shayan Oveis Gharan: Well, we can make -- it's easy to construct
graphs where lamb did 2 K is much larger.
>>: [inaudible].
>> Shayan Oveis Gharan: Yeah. Is the result clear? So I'm going to
focus mainly on the first result, and if I have time tell you a little
bit about the second two. So here's the outline of the talk.
First I'm going to give an overview of
and show you how we can deal with it.
the ideas of the proof. And lastly, I
based on the tools and techniques that
Okay?
the special cases of our problem
And then this part will be not
will talk about new results
I talk about in this talk.
We're going to use the Rayleigh quotient as the spectral relaxation of
expansion. Suppose we have a function S from the vertex to the reels.
Then the Rayleigh quotient of F is defined as this ratio, the summation
of FU minus FU squared over V times the summation of FU squared.
To understand this, let's plug in a 01 function. If F is 01 function
then the Rayleigh quotient is exactly the expansion of the support of
F. Support of F is those with value 1. Why? Because the numerator is
just a number of edges leaving the support for such F. And the
denominator is just D times the size of the sub.
>>: Are you assuming degree is constant in all of this.
>> Shayan Oveis Gharan:
Degree is D.
>>: That's throughout?
>> Shayan Oveis Gharan:
generalized.
Yes, throughout the talk.
But everything was
So if you could, for example, find that 01 function minimizing the
Rayleigh quotient, that would give us the best non-expanding file. It
would be sparsest part of the graph. NP hard problem, we cannot do
that.
This shows why we think of this Rayleigh quotient as a continuous
relaxation. So we cannot solve the discrete version, we will deal with
continuous relaxation.
Now it turns out that we know the optimizers of this continuous
relaxation, and they are the eigenfunction or eigenvectors of the
normalized Laplacian matrix.
So, for example, the one that minimizes the Rayleigh quotient is F1.
And the Rayleigh quotient would be 0. So let me remind you that the
first eigenvector or eigenvector function of the Laplacian is a
constant function. And for that the numerator would be 0.
So order of F1 is 0. It's equal to lambda 1 of 1, which is 0. And for
any FR, RF of I is lambda I. So F1 is the function that minimizes
this. F2 is the function that minimizes the Rayleigh quotient. And
all functions which are orthogonal to F1, F3 is the one that minimizes
the Rayleigh functions that are orthogonal to F1 and F2 and so on.
So, in fact, F1 of FK gives you best K dimensional subspace minimizing
the Rayleigh quotient. So you can think of what I'm going to do, well
what I'm going to prove as a rounding algorithm, starting from this
continuous relaxation of the K clustering problem and round it into a K
cluster.
Again, here if you can see that F1 to FK was a 01 function, then this
would be the optimal solution of our problem. Okay. So I want to talk
about the rounding problem, how to start from F1 to FK and round them
into a K clustering, K disjoint sets.
>>: Minimize before the definition of the Rayleigh quotient but
minimize over F that just gives you F1, right?
>> Shayan Oveis Gharan: Yes. Now, if you minimize over those that are
orthogonal to F1 would give you F2.
>>: Give F2 K you need to have a constant say F1, based on F1 and it's
orthogonal to the first one, you get ->> Shayan Oveis Gharan:
Right.
>>: And it didn't normalize the ->> Shayan Oveis Gharan: Yeah, I just need to say the function is not
all 0. Is more constant.
>>: No.
Means 0.
If you minimize over all functions F.
>> Shayan Oveis Gharan:
That are not 0?
>>: You can ->> Shayan Oveis Gharan:
Then F one would be optional.
>>: If you add a huge constant to F2 to make this ratio very small ->>: Numerator is here.
>> Shayan Oveis Gharan:
norm of F1.
Think of the denominator as just making the
>>: If you take a non[inaudible] and add a huge constant you make this
ratio arbitrarily small.
>> Shayan Oveis Gharan:
No.
>>: It's really the same, it's just ->>: So the normalization technicality.
>> Shayan Oveis Gharan: Doesn't make it arbitrarily small. If you
make a huge constant you just make it closer and closer to F1.
>>: Time of F1 is ->> Shayan Oveis Gharan:
the minimizer.
>>: Okay.
F1 if I make it smaller and smaller, but F1 is
You count in the constant.
>>: That's the orthogonal draft.
>>: Okay.
Fine.
>> Shayan Oveis Gharan: Okay. So now let me tell you how we can do
this rounding. I'm going to start with the simplest case, which is -sorry, before that, so the rest of the talk all I'm going to use from
F1 up to FK is they're ortho normal and they have a small Rayleigh
quotient. I'm not going to be interested whether or not they are
eigenvectors or eigenfunctions. So let me tell you how we can do this
rounding. If RF of F1 and RF of F2 is 0, how can we show why F2 is 0?
So as well as two observations the first one is RF 0 is function F,
then for each adjacent pair of vertices F of U equal to F of V and vice
versa. Why? Because this plug is in here.
What this means by repeated application of this you can see that every
connected component has the same value in F. Now, on the other hand, I
know that F1 is orthogonal to F2. So if one of them is not constant.
So it has at least two different values.
disconnected.
So the graph must be
Just two observations. Now we can generalize this in two ways. One is
to generalize 2 to K. Say you have K functions of 0 Rayleigh quotient,
how can we show why FK is 0. And then the second one is to generalize
0 to some small number delta, say RF of 1 RF 2 delta, how can we show
why F2 is less than order of proof two delta. Start with the left one,
go to the right.
So, again, I want to assume RF of F1 up to RF K is 0 and show FK is 0.
And I'm going to use spectral embedding. This will be used later -this will be used throughout the talk.
It's important to remember this. So the spectral embedding of the
graph is what we get by -- is an embedding of the graph in a K
dimensional space where the value of each vertex, capital F of U is
just F1 of FU is K dimensional vector of F1 F2, FK. So here you see an
embedding of a cycle based on its three first eigenfunctions, F1, F2
through F3. If you remember F1 is constant and F2 and F3 just give you
the cycle.
Now let's see how we can use this to prove our claim. First
observation is that since all these Rayleigh quotients are 0, R of F
will be 0 as well. In fact, R of F, the Rayleigh quotient is always
less than the maximum of the Rayleigh quotient of F 0. That's very
easy to see.
Now what this means is again for any adjacent pair of vertices, they
must be mapped to the same point in this high dimensional embedding.
>>: [inaudible] also defining R of F of value ->>: That's why the norm was there earlier.
>> Shayan Oveis Gharan:
Right.
>>: And it's usual or?
>> Shayan Oveis Gharan: It's the L2 distance of adjacent pairs over
the summation of the L2 norm.
So okay what this means is each connected component of a graph is
mapped to a single point to this K dimensional embedding. So to prove
that Y of K is 0 I just need to say there are K points at least in this
embedding. Right? Why is that true?
Again, why ortho normality. So remember that F1 of F2 is ortho normal.
Now construct this matrix with rows F1 of FK. Then the columns of this
matrix would be my embedding.
Now, because F1 to FK are ortho normal, the matrix has ranked K. So
there are K linearly independent columns. There are K disjoint points
in that.
>>: Can you say that again?
>> Shayan Oveis Gharan:
to FK these are normal.
of this matrix. This is
means there's K disjoint
So because to construct a matrix with rows F1
So matrix has rank K. Now look at the columns
exactly embedding, there's K linear columns
points in the embedding.
This thing.
>>: So you're just saying in particular linearly independent implies
this thing?
>> Shayan Oveis Gharan: Yeah. So they have J disconnected components.
Now let's see how we can prove the other generalization. Say RF of 1
and RF of 2 is not 0, but let's say very small, small number of delta.
If I say Y of F2 is order of proof delta. This is by the way the
Cheeger inequality. Let me not prove it completely, just give you the
idea.
So, first of all, since F1 and F2 are orthogonal, one of them is not
constant. Say F2 is not constant by what we said before. Now, because
R of F2 is not 0, we cannot say adjacent vertices are mapped to the
same points, but because it's small we can say they're mapped to close
values.
So the idea is to map the vertices on a line, certain, based on their
values in F2. Now, sweep this line from left to right and consider all
the cuts and just choose the best one. This is exactly the spectral
partitioning algorithm. Very simple. Just use it the second eigen
function. The intuitive idea that this works is that think of random
cut, random threshold in this region, if two points are close the
probability is that it cuts them is very small.
But we know that the edges, the adjacent vertices are mapped to close
values. So on average we're going to cut very few number of edges. So
we're going to get sparse at one expanding cut.
So okay now our main theorem generalizes both of these special cases.
The one say you have K functions of a small value phi of K will be as
small. So polynomial FK times root delta.
So by what I said so far, our proof must have two main elements. The
first one is that we have to use the fact that the Rayleigh quotient of
capital F is small. And argue that adjacent vertices are mapped to
close points in this high dimensional embedding. To understand this,
look at the cycle.
The second observation is that you have to use ortho normality of the
vectors to argue that the embedding kind of spreads in this space not
concentrated in few number of places. This is what we've done in these
special cases.
So in particular, we use the first observation to choose our
nonexpanding sets from clusters of close points in this high
dimensional embedding.
And you use the second observation to argue that we can actually find K
clusters. K disjoint clusters.
Okay?
Is there any question?
>>: When it spreads?
>> Shayan Oveis Gharan: I'm going to make it rigorous. But the
intuitive idea is that it's spread in this space. It's not -- see that
cycle?
Okay. So now for the next I think 10 to 15 minutes I'm going to talk
about the ideas of the proof. There are three main ingredients. The
first one is the radial projection distance. This is a particular
distance we use in the proof.
I'm going to define it in the next slide. The second one is a
spreading property that we show that the vertices spread in the space
with respect to this particular metric. And the last one is that we
show that to prove the theorem it's sufficient to find the regions in
the space, each containing a large mass of points such that they're
well separated. We call it advanced well separated.
So let me first define what I mean by radial projection metric. If you
have two points, U and V, the radial projection distance is defined as
follows. First we project the points on the unit sphere around the
origin, and then we complete the Euclidian distance of the projected
points.
So this is a Euclidian metric. Why is it useful? This helps us to
reduce the problems with simpler case where all the vertices are mapped
to, around the unit sphere. They all have the same distance to the
origin.
So for simplicity, the rest of the proof, you can assume that this is
the case. The vertices all have the same distance. I'll point out
what it would make different, if you don't want to assume.
Now let me tell you the proof plan. There are two main steps. The
first step we find K legions in the space, X1 up to XK such that they
have 1 over K fraction of a total L2 mass, and their pairwise
difference is at least 2 epsilon.
What do I mean by L2 mass. L2 mass is just the summation of the normal
square root of the vertices. If you assume that the vertices are at
the same distance of the origin, you can replace the L2 mass with it a
number of vertices in it. So think of L2 of X as the number of
vertices in X.
In the second step, we show that if you have K disjoint regions, we can
find -- we can turn them into, if you have K well separated regions we
can turn them into K disjoint non-expanding sets each defined on the
epsilon 1 neighborhood of the regions. By epsilon, I mean the points
at the distance epsilon.
Now because we assume that the regions are at distance 2 epsilon, my
sets will be disjoint.
>>: [inaudible] subset of ->> Shayan Oveis Gharan: Yeah, the support is a subset of the epsilon
neighborhood of the regions.
>>: Partitioning of -- includes all the points, all the Xs?
necessarily the points here, some for every ->> Shayan Oveis Gharan:
Not
Yeah.
>>: Why do the assumptions keep all the points, not sure?
>> Shayan Oveis Gharan: I'm not going to say -- S1 up to SK would be
disjoint. I'm not going to say there would be a partitioning, but we
can make it into a partitioning. If you have it clustering into K
disjoint sets, we can make it a partitioning by adding the remaining
vertices to the largest set.
So okay. So we say a region is dense if it has 1 over K fraction of
the vertices. Fraction of the mass. And you say the regions are well
separated if their distance is at least two epsilon. So again the plan
is to find first K dense separated region and then turn them into K
disjoint nonexpanding sets.
And epsilon will be a function of K. Okay. Is the plank here?
start with the second step and talk about the first step.
Let me
So suppose we have a region that contains let's say [inaudible] K
fraction of the mass. When rounded into nonexpanding sets such that
the set is a subset of the epsilon neighborhood of the region.
And the expansion is small, square root of polynomial function of K
times the Rayleigh quotient of kappa.
How are we going to do that? This is essentially generalization of
Cheeger inequality. So the idea is to choose a random threshold in the
epsilon neighborhood of X.
So we consider all these balls in the epsilon neighborhood and we
choose the best one. Now, observe that each of these balls contain 1
over K fraction of a total mass. Well, on the other hand the
probability of cutting each edge is only its length over epsilon.
So putting these two together proves the statement we can show that
there exists a set of expansion summation of a few minus SV over
epsilon divided by 1 over K fraction of the mass, and by a simple
action of [indiscernible] we get what.
Okay. Now, let me tell you how we can find K dense separated regions.
Remember that because we want to find these densely separated regions
with respect to the idea of projection distance, the regions would look
like narrow cones around the origin.
The vertices in this cone that are, if you have one vertex here, one
vertex there, the distance would be very small, because we project on
the spheres. So the regions would look like this narrow cone.
So let's see how we can find it. Okay. So the proof has two steps.
First step is to prove the spreading property that I promised you to
talk about. So the spreading property says the following: If you have
a region of small diameter, then it cannot have a say constant diameter
one-half, it cannot have more than 1 over K fraction of the points.
So in other words think of a region of constant diameter as sparse
region. It has very few fraction. So the first thing we proved it.
>>: The projection, regular projection.
>> Shayan Oveis Gharan:
Yes.
>>: You're projecting on sphere of radius one so the whole space is
two? So you say constant as in -- constant methods?
>> Shayan Oveis Gharan: Yeah. Yeah, it's one-half of two, of course.
The second step is the following: We use the literature on random
partitioning of metric spaces to partition this space into
well-separated regions covering almost all of the points.
Now why to prove the theorem, think of the following. Suppose we
obtain this random partitioning we get the regions that are well
separated. Now, because they have a small diameter, one-half. Each of
them has a few fraction of the mass.
What we can do is simply manage them and make them dense. Because
they're originally well separated after managing them they're going to
be well separated as well. We're going to get K densely separated
regions.
>>: [inaudible].
>> Shayan Oveis Gharan: By dense I mean we really have 1 over K
fraction. Here they have less than 1 over K fraction. But if they
have 1 over K, have nothing to do, have less manage to make them
[inaudible] okay. So let's start with the first step and see how we
can prove the spreading property.
So, again, we want to assume that we have a region of some diameter
delta. I want to show that the L2 mass inside this region is at most 1
over K-1 minus delta squared of L2 of 3. Okay? For example, delta is
one-half, it's going to be 4 over 3 K of the total mass.
How are we going to prove this? So this is one place that we use the
radial projection distance. It's important here, because if you don't
use the radial projection distance, region of diameter one-half around
the origin would contain almost all of the points. Because you can
show that almost all of the points are on average distance root K over
N from the origin.
Okay. So it's important to use this. How are we going to prove this?
The idea is to prove the isotropic property. So what this said is that
our embedding is kind of very similar trick. It's not skewed in one
direction. Mathematically it says that for any vector Z, for any
vector Z, if you project all of the points on Z and take a summation of
the norm squared of the projection, it's exactly 1.
And it simply follows from the fact that our embedding is from an ortho
normal basis so any embedding from an ortho normal set of vectors would
have this property. It's very easy to prove.
Okay. Now let's see how we can use this to prove the theorem. I'm
going to choose my vector Z to be inside my region X. Now think of
delta as being very small, very close to 0. Then for the vertices in
X, their norm and they know after the projection are very close.
Okay. So this sum, summation of the normal square projection, will be
lower bounded by L2 of X. You should have this delta squared log -trigonometric inequalities. But if delta is very small you can
essentially bound this by L2 of X.
over 1 minus delta squared.
To say that L2 of X is at most 1
Now, on the other hand, we know that L2 of V, the total mass of the
vertices, is exactly changed. Why? Because our embedding comes from K
normal vectors, if you just sum up these things we can deduct them as
the sum of the normal functions F1 up to FK and get K.
Now, I'll put them together, prove the theorem. L2 of X is 1 over K.
So again what we've proved you prove any region of let's say diameter
one-half has at most essentially 1 over K fraction of the mass, delta
mass.
Now let's see how we can use this to find densely separated regions.
So, again, as I said I want to use their random partitioning of metric
spaces to partition their pace into well separated regions of constant
diameter and then we're going to measure them and get densely separated
regions.
So I start from putting a grid on the, based on the radial projection
distance such that the diameter of each cell of the grid is constant
one-half.
So grid with respect to radial conjecture distance looks like this.
Then I shape the grid. Okay. And I just delete the points that are
close to the boundary, choose each set to be the points completely
inside, far from the boundary.
Now, it's very easy to see that. If I choose epsilon to be 1 over K,
each point will be far from the boundary with some constant
probability. Can make it very close.
So essentially what this says is that I can find these regions, regions
of points that they are well separated. They have distance at least 2
epsilon, and they cover almost all of the points.
Now, since I started from the grid that have a small cell, cells of
diameter one-half, these regions would have diameter one-half as well.
So by the spreading property, they have a small mass. At most 1 over
K. Now I can measure them. Maybe I measure this region with this one
reduced to a size and get K dense well separated regions.
>>: Choose a random rotation ->> Shayan Oveis Gharan:
Yes.
>>: Hyperplanes instead of like rotating this also works, because
they're like roughly orthogonal.
>>: Why are we choosing -- you should rotate the grid, right?
>>: But instead of -- instead of this, let's say you choose log K
random hyper plans and around each one.
>> Shayan Oveis Gharan:
You don't want to start from the grid or --
>>: No, but then intuitively there are almost orthogonals, you get -related to the previous work, not the talk -- sorry about that.
>> Shayan Oveis Gharan:
But that is -- that is different.
>>: I know it's different.
I can ask you later.
>> Shayan Oveis Gharan: So this is the final algorithm. First we
embed the vertices based on the spectral embedding, the eigenfunctions,
then we consider the random partitioning of diameter one-half, let's
say that covers almost all of the points, almost all of the points are
at most distance 1 over K at the boundary. Then we remove the points
that are close to the boundary and just look at the regions that are
completely in the interior of each cell.
We marriage the regions to get K dense well separated regions. And
finally we apply spectral grounding on each of these dense regions to
get a nonexpanding set on the epsilon data.
So this algorithm that I just described is very similar to the spectral
partitioning algorithm, spectral clustering algorithm that people use
in practice. This paper of Jordan Eng [phonetic] have suggested this
algorithm only with the difference that instead of using random
partitioning they use K means.
But they use, for example, the radial projection distance and simply
all others. And here's simple application of spectral nice application
of spectral clustering to the image applications to detect parts of
objects.
So essentially our work gives a theoretical justification for these
spectral clustering algorithms.
>>: One at a time or [inaudible].
>> Shayan Oveis Gharan:
I guess I better --
>>: [inaudible] the objects.
>> Shayan Oveis Gharan: Right. So as I said, so for the image
application, you have to start from the image and then construct a
similarity graph based on how close the two data are.
>>: Exact on the element of the [inaudible] correct, which, on the
vertices elements of an image, are they images?
>> Shayan Oveis Gharan:
The vector --
>>: [inaudible] [cross talk].
>>: This is applying each ->> Shayan Oveis Gharan: Yes. To start from an image you construct the
graph and then apply spectral clustering algorithm. But there's very
vast literature how you can start from a graph and make it -- start
from an image or other context and make them into a graph. And this
work, for example, is very new, they have very novel techniques how to
make these graphs and then ->>: Somehow -- maybe we don't need to dwell on it too long, but sort of
spectral partitioning is better than clustering for this application.
>> Shayan Oveis Gharan:
What do you mean?
>>: Partitioning is method to for clustering.
>>: Yes, but maybe spectral partitioning is better than clustering so
as to minimize the expanding -- what the goal is.
>> Shayan Oveis Gharan: No, the point is each of these -- each of the
parts of these objects would have a small expansion essentially. For
example, look at these green areas here. They would have very small
expansion.
>>: Also what if you just took a little bit of that ->> Shayan Oveis Gharan:
How much time do I have?
>>: Seven minutes.
>> Shayan Oveis Gharan: Seven minutes. So let me talk about new
results. So I'm going to talk about the techniques -- I'm going to
talk about new results based on techniques that I covered here. Let's
zoom back a little and see what we've done. We've used this particular
embedding called spectral embedding. This embedding is an isotropic
embedding. Among all isotropic embeddings has minimum energy. What do
I mean by energy, the energy of embedding is just the summation of the
distance squared of the edges and pairs of the vertices.
So it's very easy to see that the spectral embedding the energy is
upper bounded by K lambda K. And, of course, it has the isotropic
property for any direction the summation of the inner product square is
exactly one. For example, here you see the spectral embedding of a
cycle hypercube. And these two properties were in fact the main
properties that we used for this proof, everything was followed from
this.
Now by extracting this out and thinking more about it, we can prove a
couple of other results that we can talk about. This is joint work
with Kwok, Lau, Lee and Trevisan. We proved the Cheeger inequality.
So we call that Cheeger inequality we say phi of 2 is at most root
lambda 2. We say for any graph phi of 2 is at most order of K, lambda
root 2 over K.
>>: [inaudible] when you start the sentence for any graph ->> Shayan Oveis Gharan:
Sorry.
I have a typo.
Any K.
Right.
Sorry.
>>: Then K.
>> Shayan Oveis Gharan:
Any K.
>>: But I thought the inequality [inaudible].
>> Shayan Oveis Gharan:
than root lambda 2.
So lambda 2 of root lambda K is always better
So in particular you can say phi of 2 is always at most lambda 2 over
root lambda 3 is always better with lambda 2, and the constant is not
huge, it's 10.
Okay. So and more interestingly the analysis shows that the spectral
partitioning algorithm the one that used the eigenvector satisfies
this. You don't need to use some fancy algorithm, you can use the
special partitioning. And although the algorithm doesn't know anything
about higher eigenvectors, eigenfunctions eigenvalues, anything like
that, it still matches the quality. For example, it shows that if
lambda K is a constant for some constant K, we get a constant factor
approximation for phi Y of 2 or sparsest cut graph.
This in fact happens in many applications of spectral clustering. By
using this new result and putting it together with what I've just
talked about, you can even improve the results that I talked about.
Instead of having bounding phi of K by root lambda K, you can bound it
upper bound by some preliminary function of K and lambda K over root
number L if L is larger than K, for any L larger than K. For example,
if lambda L is a constant for some L greater than K, then you
essentially get a very tight inequality.
Or you can use it -- okay. Now let me tell another result. So we use
the spectral embedding technique to lower bound the eigenvalues of
graphs, the eigenvalues of the Laplacian or upper bound eigenvalues of
the random walk matrices to show for any unweighted graph G, lambda K
is at least omega J cubed over NQ. And if the graph is regular it's at
least K squared over N squared.
>>: Is it constant?
>> Shayan Oveis Gharan:
Absolute constant.
And to prove whole new other result that I end up having here,
generalizes to infinite graphs, to vertex transliterate graphs.
can bound many properties about mixing time of random walks,
probabilities. Here's a nice algorithmic application.
You
We can use this theorem to design a fast algorithm, local algorithm for
approximating the number of spanning trees of a graph. Okay. Let me
wrap up.
So traditionally we knew a lot about second eigenvalue or the spectral
gap. We use it to analyze Markov chain, we use it to give an
approximation algorithm for sparsest cut problem. Use it in
partitioning clustering, but we knew very little about higher order
eigenvalues. So of all the works I talked about today, you can see
that if we didn't know about higher eigenvalues, we can improve many of
these traditional results. We can provide a generalization of Cheeger
inequality to higher eigenvalues get K partitioning instead of two
partitioning.
We can get even an improved Cheeger inequality even without changing
the current algorithms, or we can get a new framework for analyzing
recursive Markov chains or mixing time of the random walks.
Also said that our results give a theoretical justification for
spectral partitioning algorithms, spectral clustering algorithms that
use the spectral embedding to partition a graph.
Here are some open problems. First one is I said to find the right
dependency on K for a phi of K. For example, if it is possible to find
K disjoint clusters, each of exponential polylog K times the root
lambda K. The interesting question is the connection of this problem
higher order Cheeger inequalities to the unit games conjecture, small
set of expansion problem, for example, open problem if for some large
K, this time it's important that the K is large. Think of K as being,
for example, N to the 1 over log, log N. It's possible to find a set
much smaller than N and expansion at most order of root lambda K.
So in particular note that using our result, we can show that for any
graph there exists a set of size N over K, expansion of order of lambda
K random root log K.
So the difference here is that you don't want expansion to be dependent
on K. So you're allowed to retain it much larger than N over K, just
needed to be sublinear in N. But you don't want independence to do K
index expansion. This would have huge impacts on the unit games
conjecture as well as expansion problem.
The last thing is to analyze different partitioning methods for
spectral clustering such as K means, instead of the random partitioning
that I talked about. We have some partial results for this case. For
example, we can show that if lambda K plus 1 is much larger than lambda
K, they have a large gap between lambda K and lambda plus one, then the
spectral embedding looks like special K clouds of points, in fact the K
means would work.
But still we don't know robust version of this definition.
here.
[applause]
And I stop
>> Yuval Peres: Questions? Sam will be here at some point tomorrow. We're scheduled
but he'll find some time. And also we have some dinner today at 6:30, 6:30 plus
epsilon. So anyone interested please come talk to me.
Download