>> Konstantin Makarychev: It’s a great pleasure to have today’s... Grigory graduated from Pennsylvania State University two years ago and...

advertisement
>> Konstantin Makarychev: It’s a great pleasure to have today’s speaker Grigory Yaroslavtsev.
Grigory graduated from Pennsylvania State University two years ago and then he was a postdoc
at Brown and UPAN. He was in intern at Microsoft Research, a different Microsoft Research.
He works in various areas of theoretical computer science, particularly in approximation
algorithms and various aspects of theoretical dig data. Today he will talk about correlation
clustering problems and this work actually started here at Microsoft Research.
>> Grigory Yaroslavtsev: All right. Thanks a lot for the introduction. You actually spelled out
my name properly and I’m very grateful for that. So this work actually started as an internship
project here at MSR so I’m very happy to give this talk here joined with Shuchi Chawla,
Konstantin Makarychev and Tselil Schramm.
Today’s talk is about correlation clustering, which I like to think of as one of these happy success
stories of theoretical computer science. The theories came and addressed the problem it really
begged attention. The problem that was developed by these practitioners in the machine learning
lab called WhizBang Labs. There have been a few papers which introduced this model in a
similar rigorous way and machine learning people noticed that this problem really had to be
solved, but they didn’t have any rigorous algorithms for it. Then the theorists: Blaum, Bansal
and Chawla came and actually formalized what those machine learning people were trying to do
and developed some basic algorithms for solving this problem.
The problem is really so basic that you can explain it easily to a little kid as the folding graph
partitioning problem. So suppose I give you a graph, like the one on this picture and what this
graph represents is similarities between objects denoted as vertices in some of this picture. So if
there is a match between two objects this means these two objects are labeled as similar. If there
is no match between two objects it means they are dissimilar. The goal is to cluster this set of
objects into any possible number of clusters that you can pick on your own so that your
clustering is consistent with this information about similarities and dissimilarities.
I will make everything formal, but for now let’s just look at the solution, which has three clusters
in it. I just want you to notice that there are three pairs, which are violated in a sense that they
were labeled as similar, this, this, and this, and one pair which is actually dissimilar is no edge,
but this pair is inside one of these clusters. So in this sense this clustering actually makes four
mistakes: Three four cutting similar pairs and one for putting a dissimilar pair inside a cluster.
So in a more formal way the goal of correlation clustering is to minimize the number of
incorrectly classified pairs. A pair can be incorrectly classified if it’s either covered non-edge or
a non-covered edge, as in the first example that we’ve seen. Know that these two types of
mistakes are just treated symmetrically, you just add up these two types of mistakes and this is a
goal that we are trying to achieve minimize in the sum. So in this example score is 4 for 1
covered non-edge and 3 covered edges. This is actually the optimal solution for these instances;
you cannot do better than that.
>>: [inaudible].
>> Grigory Yaroslavtsev: This is not very easy to see. All right.
>>: [inaudible].
>> Grigory Yaroslavtsev: That’s another way of looking at it. So there is no small certificates
for me convince you of this. Okay, now for those of you familiar with constraint satisfaction I
just want to say that this can be seen as a mean CSP, but the number of labels is unbounded. The
number of labels here corresponds to the number of clusters that you can pick and because there
is no restriction on the number of clusters that you can pick the number of labels can be as big as
the number of vertices. So it’s not directly amenable to CSP techniques, because CSP
techniques only apply usually when the number of labels is very small.
Since the original introduction of this problem in 2004 there has been a lot of progress on it. So
the original paper by Blaum, Bansal and Chawla gave some constant approximation which was
pretty large and by approximation here I just mean the ratio between the numbers of mistakes
that the algorithm makes to the optimal solution costs. And since then there has been a lot of
progress, a number of papers which ended up attributing 2.5 approximations in the paper by
Alilon, Charikar and Newman. This problem is also known to be so called APX-hard, which
means that you cannot achieve an approximation which is arbitrary close to 1 in polynomial
time, but the constant in the [indiscernible] is very close to 1 actually, it’s 1.00 something.
There is also a complimentary objective, instead of minimizing the number of incorrectly
classified pairs we can [indiscernible] maximize the number of correctly classified pairs. That is
usually much easier to do because there are much stronger lower bounds than this objective. So
a complimentary objective might be much easier to optimize. And in 2004 we already had an
polynomial time approximation schema for that. So today I will focus on the minimization
objective.
During the 10 years of existence of correlation clustering it has become one of the most
successful clustering paradigms due to a number of advantages that it has over other clustering
methods. So, one such advantage is that it only uses qualitative information about similarities
between object. So note that the only data that I had to collect to run a correlation clustering
algorithm was information about comparisons between vertices, about objects in my dataset
whether they are similar or dissimilar. You can compare this to other popular clustering
methods, like for example if you want to cluster using K means, which is one of the main
[indiscernible] ways of machine learning then what you need is a D dimensional vector for each
point. You need to [indiscernible] your points into D dimensional [indiscernible] space. This is
a lot more information than with correlation clustering users.
Another interesting feature of correlation clustering is that you don’t really have to specify the
number of clusters in advance. The number of clusters is going to be selected by the algorithm
itself to best feed the data that you have. One of the questions that usually comes up with K
means, for example is: How do we pick the value of K and K means? If you don’t know what is
a good value of K you try all of them, which includes the running time or you guess it somehow
and you hope that it is a good guess, but here the number of clusters is really produced as an
outcome of the algorithm itself. You don’t have to worry about it, one of the applications if
correlation clustering is to document deduplication and image deduplication as well.
So the way it works is as follows: So suppose you have a large collection of documents or
images and your friends from machine learning group have come up with a new fancy algorithm
for comparing these objects. Now they can tell you where the two objects are similar and how
similar they are, but they don’t really have any rigorous guarantees about this comparison
measure. It can be quite inconsistent, like for example in the example that we had, the triangle
inequality did not have to hold necessarily. If A and B are similar, B and C are similar, it does
not necessarily mean that A and C are similar. So the black-box algorithm that we are using here
we don’t require any guarantees from this algorithm. So we can use deduplication any machine
learning algorithm that spits out a number about comparison between objects.
So since the original paper by Bansal, Blum and Chawla we know that the problem is NP-hard
and there are some simple approximation algorithms for it that have been developed. Pretty
much all the algorithms that I’m going to describe today in this talk will be pretty simple. So
using them in practice is actually pretty easy, they are not that complicated. For those of you
who know something about learning theory I just want to throw in these buzz words that you can
see correlation clustering as an agnostic learning problem.
Yeah, you have a question?
>>: No, please finish your sentence first.
>> Grigory Yaroslavtsev: Okay. You can see it as an agnostic learning problem in the following
sense: That we are trying to fit a concept from the classical clustering’s to the data and
unfortunately in this case there is no concept necessarily that perfect fits the data that makes zero
mistakes. There is necessarily no clustering that makes zero mistakes in this case so the
optimization goal here is whether its fact to the cluster in which achieves minimal possible
number of mistakes and that’s exactly the agnostic learning.
>>: So you were mentioning that these algorithms are very simple and so if I remember correctly
this original one with Blum and so on –.
>> Grigory Yaroslavtsev: Yea, that one is an exception.
>>: [inaudible]. So you mean simple in the sense that the later algorithms do they also use
[indiscernible] solver?
>> Grigory Yaroslavtsev: So for the algorithms that I’m going to use, for the specific objective
you don’t actually need [indiscernible] programming.
>>: But Blum did use it in 2004?
>> Grigory Yaroslavtsev: So for some other variance of this problem you are doing
[indiscernible] program and these are the problem that we are going to be referring.
>>: That are what, sorry?
>> Grigory Yaroslavtsev: I’m going to be referring to some other flavors of this problem where
–.
>>: [inaudible].
>> Grigory Yaroslavtsev: For this specific variant you don’t need [indiscernible].
>>: I have another question: How many of these different, you know, through the 10 years of
work, do they also allow them to have weights on the edges?
>> Grigory Yaroslavtsev: Yea that’s a good question and I’m going to talk about weights later.
Some of them do and some of them don’t and I will treat this topic later. All right. So if you
wan to learn more about this than I’m going to describe in this talk then you can check a survey
by Anthony Wirth. There was a KDD tutorial about correlation clustering and all sorts of flavors
and toppings on top of it and here was a Wikipedia article. That should be enough I hope.
Al right. So now let me move onto the algorithms. The first algorithm that I’m going to
describe is something I’m going to call data-based randomized pivoting. So this algorithm was
developed by Alilon, Charikar and Newman and it achieves a 3 approximation to the mean
disagree objective in expectation because it’s a randomized algorithm. The randomness is over
the choice of a random base of the algorithm. So here is the algorithm, it’s very simple as I
promised. You pick a random pivot vertex, let’s call it V, and for the rest of the talk all the pivot
vertices that I’m going to use will be purple. Then you make a cluster consistent of the vertex V,
together with a neighborhood of V in this graph. So this is the first cluster that you pick, you set
it aside and [indiscernible] on the rest of the vertices. So you move this cluster from the graph
and repeat this process again.
So a very simple algorithm and you can implement in almost linear time easily. It is very
practical, but pretty sequential if you look at this, because you have to pick these pivot vertices
one at a time and in the worst case it might take you linear number of iterations of this process to
produce a resulting clustering. So I just want to mention that there is a parallel modification of
this algorithm which achieves a slight reverse approximation, but can be efficiently parallelized.
Specifically in MapReduce model it runs a poly algorithmic number of rounds and this is a
recent paper which appeared at KDD. I have a blog post about it, which is going to rigorously
introduce you into this model of MapReduce and tell you what the algorithm really does in that
case, but this algorithm is so simple that I can be actually done in MapReduce as well.
So lastly an example of how this works on this story graph. So we pick a pivot vertex P, which
is the purple vertex over here and it graphs together its neighborhood in the first cluster and then
reoccurs on the rest. Uniformly at random pick a vertex from the rest, this purple one and
construct the second cluster. Now we just have two single tone vertices, so we just finish out by
picking clusters of size 1. And as you can see in this example we did cut 6 edges and put 2 nonedges inside. So we made 8 mistakes, which is a two approximation to the objective that we are
trying to optimize.
Okay, so now let’s look at this problem from a different perspective. In order to come up with a
proof I want to introduce an integer program for correlation clustering. So the same clustering
problem as before can actually be expressed as an integer program of the following form. So this
may look complicated, but let’s go over it step by step. The idea is that we introduce a binary
distance XUV between all pairs of vertices. This is binary in the sense that XUV are a 0 or 1
only. The interpretation of the binary distance is that if XUV is 0 then U and V are in the same
cluster, otherwise they are in different clusters.
Now using this interpretation you can convince us that the objective is exactly the mean disagree
objective that we had before, because we added over all edges XUV’s. So we pay 1 if U and V
are in different clusters and 1-XUV for the non-edges. Finally we have these triangle inequality
constraints here and because XUV’s are binary variables 0 or 1 only triangle inequalities are
equivalent to transitivity constraints. So if XUW is 0 and XWV is 0 then XUV is also 0. So this
means that any valid integral solution to this problem actually induces a partition in a clustering
of the set of vertices and that’s what we are looking for. So they could [indiscernible] classes of
the relationship U and V is the same if XUV is 0 actually form a partition and this is the same as
well.
Now we don’t know how to solve integral problems in polynomial time so we do the standard
thing, we [indiscernible] to a linear problem. So instead of using 0 and 1 variables we are going
to use variables which are real between 0 and 1. A solution to this problem now gives an
embedding of the vertices into a pseudo metric. I’m using this term pseudo just to highlight the
fact that it could be the case of two vertices coinciding.
>>: So are you assuming that there could be more than two clusters, right?
>> Grigory Yaroslavtsev: Yea, there can be any number of clusters because an honest
relationship can have any number [indiscernible] clusters.
>>: So you are saying XUV is 1 only if they are in different clusters?
>> Grigory Yaroslavtsev: That’s right, it’s like a distance. If your distance 0 then U is the same
thing and U is in the same cluster. If the distance is 1 then you are in different clusters.
>>: [inaudible].
>> Grigory Yaroslavtsev: All right. So we get an embedding of vertices into so called pseudo
metric and it’s exactly the same as just a usual metric, except that some points are allowed to be
a distance 0, that’s what pseudo is referring to. And now let me show you that there is a gap
between the objective values of the integral and linear program. This is called the integrality gap
of this program and I’m going to show you that it is at least 2 minus some [indiscernible] of 1.
There is a simple example that illustrates this. It is a star consisting of edges, 2 vertices which
are disconnected from each other.
Now remember that a solution to an integer program is exactly the same as a clustering and the
best clustering here that you can construct looks like this. You just pick one cluster of size 2 and
the rest of the vertices from single don’t clusters. That’s the best you can do. This gives the cost
of the integer program to be N minus 2, but the linear program can cheat in the following way:
The solution to the linear program can be constructed as follows: you just put one half on every
edge of this graph and one on every non-edge of this graph.
So know that because all the values are either 1 or half then the triangle inequality is satisfied for
every triple and the objective value is just roughly N over 2, because we only pay for the edges
in this graph and we pay one half per edge, because non-edges are a distance 2 and they are
okay. So this is integrality gap and this also means that we cannot achieve approximation factor
better than 2 if we just use only linear program as a lower bound on the optimum. Now the
question that we asked in our work was whether we can actually match this integrality gap and
achieve this approximation factor, which is roughly 2. The answer is: Almost, so we can achieve
a 2.06 approximation, which is within 3 percent of the integrality gap. So remember that the
previous one was 2.5 approximations.
So I’m going to introduce some other results that we get later formally, but for now let me just
informally say that we can also use your technique for a bunch of related scenarios achieving
approximation for clustering objects of K different types. So suppose instead of having just
objects of 1 type as we had in this example we have objects of K different types and we can only
compare pairs of objects of different types with each other. In that case we have a match 3
integrality gap, the previous one was 4.
I’m going to talk about the weighted case, the one that you asked me about later, but only about
some specific version of the weighted case in this talk, because the general version of weighted
cases is actually pretty hard. For the case when weight satisfies triangle inequality we get 1.5
approximations. There is an integrality gap of 1.2 that we show. So the new results are bold in
this slide, that’s the point of bold here.
>>: So what’s the second point, the K types, what was it?
>> Grigory Yaroslavtsev: So suppose that you have objects of 2 types. Suppose you are in
[indiscernible] and you have information about comparisons between people and movies, which
people like which movies, but you don’t know which people like which people because you are
not Facebook and you don’t know what movies like which movies, because that doesn’t make
any sense. So you want to cluster movies and people using this data.
>>: [inaudible].
>> Grigory Yaroslavtsev: The programs in these slides will be rigorously defined later, but
unfortunately not right now because I’ll need to introduce some rotation before I can do that.
All right. So the next algorithm that we are going to see is the one that actually uses this linear
program. I will call it a linear program-based pivoting algorithm introduced by Alilon, Charikar
and Newman. So this is the one that achieved 2.5 approximations. So it’s similar to the basic
pivoting we have seen before, but it uses a linear program. So first you solve the linear program
and you get values XUV, which I’m going to informally call distances. Then you do a similar
pivoting process as before. You pick a random pivot vertex and let’s call it P.
Now you form a cluster of P together with a randomly chosen set. I’ll call this set S of P. The
randomly chosen set contains every other vertex with probability, which is 1 minus the distance.
So the reason for the S is pretty simple. We want the vertices which are close to be in the cluster
with P with probability very large. If they are far the probability should be small. So here is just
the linear function of the distance 1 minus the distance. Now these decisions are made
independent for each of the other vertices. Then we make a cluster consistent of P together with
this random set S of P. And we just recur on the rest as before.
So this is the algorithm and let me show you how this algorithm works on the integrality gap
example, because this will be important for later. So remember that in this star like example the
LP solution XUV can look as one on edges and 1 on non-edges. So the LP cost is roughly N
divided by 2 here. Now let’s look at what our algorithm does in this case. So the algorithm
picks a random pivot. So suppose a random pivot is chose to be one of these vertices which are
the spikes of the star. So this happens with probability which is almost 1 in this example, 1
minus 1 over N.
So in this case with probability one half we pick a singleton cluster and with probability one half,
we don’t have information for that. So with really one half we get a cluster of size 2 and with
probability one half we get a cluster of size 1. So if you get a cluster of size 2, if you cluster this
edge together, then we get pretty much the optimum cost because we just cut this edge away and
the rest is just singleton clusters. On the other hand if we just cut a singleton cluster consistent of
this vertex then let me informally claim that we made no progress whatsoever. We did cut away
one vertex, but in the big picture of things that doesn’t matter at all. So this is why I just want to
say that expected cost condition on VI being a pivot is just one half of N plus one half of the
original expected cost.
So the other case is when we pick the central vertex of the star to be the pivot, which happens
with probability of 1 over N. It’s a very small probability, but unfortunately if we do this then
we have to pay a lot, because by concentration bound we cluster roughly half of the vertices
together with a pivot and that means that we grab a quadratic number of non-edges inside of the
cluster, roughly N squared over 8 pairs which non-edges will be inside. That means that we pay
a lot here.
So solving this for the expected cost, putting this all together, we can see that the expected cost
in this case is roughly 5 N divided by 4, which seems not that bad compared to the optimum
which is roughly N here, but remember that if you are just using linear program as a lower bound
we are already losing a factor of 2. So comparing this to the linear program in lower bound we
already lose a factor of 2.5 and this is exactly what the Alilon, Charikar and Newman analysis
matches. So you cannot do better than 2.5 this way.
So here is our idea on how to improve this. So the generic schema will be very similar, but we
are going to account for this issue that we had in the previous example. When we pick the
randomly chosen set S of P our decision about including a vertex into S of P is going to depend
on 2 things now, not just on the LP value or the type of the edge, but on both of them. So
remember that we have seen two algorithms before. One made the decision just based on the
graph alone without looking at the LP value. So it rounded the vertex together with P if PV is an
edge and did not round otherwise. So this was the first algorithm that we’ve seen today. The
second algorithm was using the LP and just set this function F to be 1 minus XPV.
So let’s try to use both. So here is a function that we use. The function makes a distinction
based on [indiscernible] PV’s and edge or a non-edge in the graph and applies different rules
based on this. So if it’s a non-edge it’s the same rule as in the LP-based pivoting, 1 minus XPV.
If it is an edge it is 1 minus some function F plus of XPV. The function F plus is some carefully
chosen function. It’s some piece wise quadratic function which is 0 up to the point 0.19 and 1
after some point B, which is roughly a half or slightly greater than a half. Then between is just
some quadratic polynomial. And know that the fact that we cannot get an approximation factor
of 2 really comes from the fact that this B is really a little bit above a half, which means in that
example when we peel it from the central vertex we still going to round a really tiny small
fraction of vertices together with U. They are really, really tiny, that’s something that we really
are really trying to minimize here.
>>: [inaudible].
>> Grigory Yaroslavtsev: That’s the reason why we got 2.06, but not exactly 2. So there is
actually some kind of conflict in reason which does not let us make this function to be 1 all the
way at one half. It will come from the analysis.
>>: [inaudible].
>> Grigory Yaroslavtsev: It’s a pretty benign function.
>>: [inaudible].
>> Grigory Yaroslavtsev: It’s something that you study in 7th grade right; you already know
what quadratic functions are so it’s not too bad. So now let’s move onto the analysis and see
where this comes from. All right. So for the analysis let’s break the execution of the algorithm
into stages and look at one stage at a time. So remember that at every stage we just construct one
cluster and we start with a set of vertices V. There are two things to keep track of ST is a cluster
constructed at step T and VT is a set of vertices which were still not processed when we started
step T. So V1 is exactly all vertices, S1 is a first cluster, V2 is what is left after the first cluster,
and so on and so on. So it’s important to keep track of these two notions: ST and VT for the rest.
So let’s look at the analysis. We are going to break it into stages as follows: at every stage we
have the set of vertices VT, which is still unprocessed. We pick a cluster S sub T from it and the
set of vertices VT plus 1 is still left unprocessed. And let’s break down the cost that the
algorithm pays into these stages as well.
So suppose we consider this stage T, let’s analyze the cost of the algorithm at this stage. What
the algorithm does at stage T is it makes commitments to make mistakes on two types of pairs.
There are edges in the graph which will get cut when we pick the cluster ST and there will be
non-edges in the graph that will be put inside the new cluster ST. Those are the two types of
mistakes that the algorithm commits to make when it picks the cluster ST. There are no other
mistakes that it has committed to make so far.
And let’s just add up these two types of mistakes. So for all edges UV, which is still left
unprocessed, in expectation the probably of that we separate the end points of this edge and it’s
just the probability that we put one of the end points inside and the other one outside. And for
the non-edges it’s just the probability that we put both end points inside. So this is this
expression. Now let’s compare this to the specifically designated part of the LP that we can use
to cherish these mistakes that the algorithm makes. So we will the full end part of the LP to
charge these mistakes. I will denote this SLP sub T, the part of the LP that we use is stop T. The
idea here is that we can use the pairs in the LP such that at least one of their end points is inside
our new cluster, because these are the pairs that sort of can be attributed to this cluster directly.
And each such pair will be attributed to only one cluster or so, we never over charge.
So if you use edge pairs at least one end point is inside the cluster then we get contributions from
edges and the contribution from edges is XUV, if you remember the objective, for non-edges it is
1 minus XUV and each of them is just multiplied by the probability that we get the contribution
from this specific pair and that probability is just that we get one of the end points inside. So this
is already something and we have been able to break down the comparison between the cost of
the algorithm and LP and to these stages. So it just suffices to show that the expected cost of
algorithm [indiscernible] is [indiscernible]. Then we just add them up over all iterations and then
we are done.
So this is indeed already something, but it’s still too complicated to analyze these things, because
there are too many things going on. The next idea is to break down each stage into something
very, very simple, into analysis of just triples of vertices. So this is what I’m going to call
triangle based analysis. So let’s just focus on a triple of vertices for now and then I will put these
together. For a triple of vertices U, V, and W, for W it’s a purple vertex. Let’s look at the
expected error that we get on a pair UV if W is the pivoting vertex.
So here is a picture, suppose we have a triple of vertices where W is a purple V with vertex. If
UV is an edge then we make a mistake on this edge if we cut one of the end points together with
W and the other one not if we cut this edge in this clustering step. So there are two types of
situations like this when we cut either U or V together with W. So is this clear? I want to make
sure because it’s kind of complicated. Now if UV is a non-edge then we only make a mistake if
we grab all the end points of U and V together in the cluster with W. If you remember what F
means then F really captures the provision that we grab the end point together. So these are just
probabilities of these two events.
The same reasoning can be applied to the LP. The LP contribution for the triangle UVW is as
follows. The LP contribution can be used in this triangle if we manage to cut away at least one
of the end points. So there are three cases when this happens, when we cut one or two of them
and the probability that at least one of these happens is given by this expression. If this happens
then depending on the type of this pair we get different contributions. If this pair is an edge then
the contribution is SUV. If the pair is a non-edge then the contribution is SUV, if the
contribution if the pair is non-edge then the contribution is 1 minus XUV.
>>: So I don’t understand why do you have this object in front of XUV, what is it?
>> Grigory Yaroslavtsev: So this is the probability that we can use a contribution of the LP for
this triangle. So remember that we partitioned the cost of the LP between iterations.
>>: [inaudible].
>> Grigory Yaroslavtsev: So the only way we cannot use the cost of the LP if we manage to
separate U and V from W, both of them and there are a few cases when this doesn’t happen.
So the last trick is to express the cost of each iteration in terms of these triangles that we have
seen. This is very simple; you just write the expected cost as a sum over all pairs UV, which is
still not processed. For each pair the expected cost of this pair is the following, because we
chose the P with W uniformly at random we choose a specific W with probability 1 over the
[indiscernible] of VT. Conditioning on W being the pivot the expected cost of UV is just this
expression LWUV, which was on the previous slide. So the summation really is the expected
cost of the algorithm at iteration D.
Now we can rearrange the summations and instead of adding up over UV and then over W add
up overall triangles UVW. Then we double count each such pair, so there is a factor of 2 here
and we end up with a summation of these expressions LWUV. The key part of the trick is to
reduce everything to just triangles. Exactly the same can be done about LP because it’s just a
summation. That just suffices to compare ALG and LP on triangles only. This is something that
is actually manageable. Now you only have to show that this relationship holds for triples UVW,
but it’s still not that easy. So we want to show this for all triangles, but remember that for each
triangle we can have first of all one of the possible edge/non-edge configurations. So a triangle
is pacified by edge non-edge configurations and the LP values that they assign to its size.
So we need to show that this relationship holds no matter what the type of the triangle is, no
matter what the LP values are as long as they satisfy the triangle inequality. The type of the
triangle is fairly easy to deal with. You just enumerate four possible types, but the LP way
satisfying triangle inequality form a search space which is a 3 dimensional space. So it’s not that
easy to enumerate over. This is where our paper sort of get’s rough and you have to go through
some case analysis and convince yourself that actually there is a weight which is a function such
that this holds. Okay, remember that here we have flexibility in the choice of the function also.
So first we choose a function and then we show that this holds.
So I encourage you to go and read our paper. This is where the interesting stuff is. How do we
actually deal with this problem? But, eventually we managed to show that [indiscernible]
configuration and every set of weights which satisfies a triangle inequality this holds. The alpha
that comes out of the analysis is 2.06. Moreover we can show that no matter which function F
you pick using this analysis you cannot get better than 2.025. We can actually push it up I guess,
but the point is you cannot really get all the way down to 2 in this analysis.
>>: So function which only depends on the edge type and the [indiscernible]?
>> Grigory Yaroslavtsev: Right yeah, this all applies only to this framework where you use this
type of function.
>>: [inaudible].
>> Grigory Yaroslavtsev: Exactly.
>>: Can you do a computer search to do find the function F?
>> Grigory Yaroslavtsev: Yea, initially we had to do some computer search.
>>: And what type of structure were you looking at, like quadratic functions like this?
>> Grigory Yaroslavtsev: Yea, we were trying to come up with something polynomial.
>>: But did you restrict to quadratic or did you go higher?
>> Grigory Yaroslavtsev: We tried higher polynomials, but the improvement was kind of
marginal.
>>: But you did get an improvement.
>> Grigory Yaroslavtsev: I guess maybe you can improve a little bit, but it’s sort of –. The point
here is that you cannot really get all the way down to 2 and we are pretty close to the limitation
of this approach. So it’s kind of a tradeoff between your time.
>>: So that’s why you didn’t.
>> Grigory Yaroslavtsev: Yeah.
>>: [inaudible].
>>: So I don’t understand how you prove those lower [indiscernible]. So how do you do a
computer search?
>>: Will you go over this?
>> Grigory Yaroslavtsev: Yea, I won’t go over this unfortunately.
>>: [inaudible]. Again, this is only for this framework. [inaudible].
>> Grigory Yaroslavtsev: Yea, so the analysis kind focus is there are some triangles that you can
show a bad and for those specific triangles they will have very specific configuration of LANS,
for example the triangle inequality has to be tight for them. That gives a lot of restrictions and
we are already going to have equal LANS that are going to be isosceles triangles. So that limits
this search quite a bit.
>>: Some of those.
>> Grigory Yaroslavtsev: What’s that?
>>: Some of those.
>> Grigory Yaroslavtsev: Yea, some of those. Okay, so then you can actually [indiscernible]
this. Instead of having function of 3 variables you only use 1 variable basically to control your
triangle because it has 2 other restrictions, like satisfying triangle inequality is isosceles. Then it
becomes the function of 1 variable and you can analyze it.
>>: [inaudible].
>> Grigory Yaroslavtsev: Right, but for them the problem was easier because they had the
function fixed. They didn’t have to do search of the function space.
>>: [inaudible].
>> Grigory Yaroslavtsev: So the framework was similar, but they didn’t have to do the search of
the function space. All right. So the setting that I described is usually referred to in the literature
as complete graphs case because we care here about all pairs in the graph. So in the objective we
add up over all pairs in the graph and we get 2.06 approximations for the last case. It can be
derandomized, which is actually not that trivial in this case. There was a previous paper by these
guys which derandomized the 2.5 approximation. This 2.06 approximation also works for the
case when there is real weight which satisfies the so called probability constraints.
So remember that in the setting that I described all weights were all either 0 or 1, but now
suppose in the objective in front of each term there will be a weight CUV and the weights
corresponding to XUV and 1 minus XUV are complementary. So XUV is multiplied by 1 minus
CUV and 1 minus XUV is multiplied by CUV and these are the weights. So in the previous case
these CUV’s were either 0 or 1. If you allow them to be real then everything also works. In the
special case when these CUV’s you should really think of them as some empirical distance. So
it’s natural to assume that they might satisfy triangle inequalities as empirical distances. In this
case you get a [indiscernible] approximation. So we get 1.5 approximations using some different
carefully chosen round in functions. The integrality gap for the triangle inequality in this case is
1.2.
>>: [inaudible].
>> Grigory Yaroslavtsev: It does actually –.
>>: [inaudible].
>> Grigory Yaroslavtsev: That’s also true, but to answer your question that’s generalized
because if CUV’s are 0 and 1. So at those points which have CUV 1 are non-edges and those
points which have points CUV 0 are edges.
>>: [inaudible].
>> Grigory Yaroslavtsev: So this is what I call just probability constraints on them and this is
strictly generalization of the complete case. Yea?
>>: [inaudible].
>> Grigory Yaroslavtsev: If they also satisfy something extra, which makes sense right, they
satisfied something else.
>>: You mean the triangle inequalities?
>> Grigory Yaroslavtsev: That’s right, like those previous 0/1 weights they did not satisfy
triangle inequalities, because it could be inconsistent.
>>: My point is that when you say weighted you intuitively want to generalize the beyond
weight and you wan to capture the [indiscernible].
>>: [inaudible].
>>: [inaudible].
>> Grigory Yaroslavtsev: And I was referring to this set before about objects of K types. So
now suppose that in the objective we only count contribution of pairs which are edges in some
graph. So we don’t care about all pairs now, but only about those which are edges in some
graph. When we have objects of K types this graph is very specific. It’s just a complete kpartite graph. So in this case we have E, the set of edges of a complete k-partite graph, like in
the [indiscernible] example this is a complete bipartite graph and we only sum up contributions
in the objective which corresponds to edge bearers. So this is a more general case and we get a 3
approximation here, again using some different bounding functions and we have a matching 3
integrality gap.
All right, thanks for your attention. Let me just wrap up with some open problems here. There
are two main directions for open problems: One is getting back to our approximation and it’s
natural to expect there maybe a stronger convex program relaxations can help us here. For
example if you use a natural semidefinite program in relaxation then the integrality gap is at least
three halves. It’s natural to ask whether linear programming or semidefinite programming
hierarchies can help reduce the integrality gap even further.
Another question is sort of the opposite of that: Can we achieve better running time while maybe
compromising an approximation or keeping it the same? Can we avoid solving the LP? This is a
technical obstacle that is the most computationally complex part of this algorithm. Can we get
the last know 3 approximation in parallel? And let me just mention there are a few related
scenarios and it’s a very specific flavor of this problem called consensus clustering. I did not
describe it actually, but just to drop it here. And as I think Konstantin mentioned before the most
general case of this problem where there are weights which don’t necessarily satisfy triangular in
equalities or any other constraints whatsoever is actually pretty hard. It’s has hard as multi-cuts
and getting a constant factor approximation would disprove UGC. So it’s a good way to work on
UGC. All right, thanks.
[Applause]
>>: What is a harness?
>> Grigory Yaroslavtsev: The last number of people, what the AB harness is, no one knows,
something like 1.00 something and no one knows what it is really, because it’s reduction from
some complicated problem and no one knows that harness is, but it’s very close to 1.
>>: [inaudible].
>>: What is consensus clustering?
>> Grigory Yaroslavtsev: Consensus clustering is a special case of the triangle inequalities case.
So the one for which we got 1.5, but the one when you weights a generated according some
specific process. Let me describe the process to you: the process is taking a convex combination
of clustering’s. So every clustering is a 0 1 distance measure. So suppose you have a bunch of
clustering’s coming to you from somewhere and you wan to come up with a single cluster which
somehow there is a consensus between them, something that represents them all. What you can
do is you can take a convex combination of weights given by each of these clustering’s, put them
together and because it’s a convex combination of weights is going to satisfy triangle inequalities
and that gives you an instance that you [indiscernible]. So you will get 1.5 as an instance
satisfying triangle inequality, but you can actually do better because the instance is generated
using some specific rule here. You can get four-thirds using the Alilon, Charikar and Newman,
but the question is can you do better? All right.
>>: So given such a distance can I actually also get back these clustering’s as well, is that easy?
>> Grigory Yaroslavtsev: No I don’t believe you can do that, because it’s one way information.
You get a clustering instead of a collection. It’s a good question, but I’m pretty sure that’s
impossible.
>> Konstantin Makarychev: Any more questions? Thank you.
[Applause]
Download