>> Konstantin Makarychev: It’s a great pleasure to have today’s speaker Grigory Yaroslavtsev. Grigory graduated from Pennsylvania State University two years ago and then he was a postdoc at Brown and UPAN. He was in intern at Microsoft Research, a different Microsoft Research. He works in various areas of theoretical computer science, particularly in approximation algorithms and various aspects of theoretical dig data. Today he will talk about correlation clustering problems and this work actually started here at Microsoft Research. >> Grigory Yaroslavtsev: All right. Thanks a lot for the introduction. You actually spelled out my name properly and I’m very grateful for that. So this work actually started as an internship project here at MSR so I’m very happy to give this talk here joined with Shuchi Chawla, Konstantin Makarychev and Tselil Schramm. Today’s talk is about correlation clustering, which I like to think of as one of these happy success stories of theoretical computer science. The theories came and addressed the problem it really begged attention. The problem that was developed by these practitioners in the machine learning lab called WhizBang Labs. There have been a few papers which introduced this model in a similar rigorous way and machine learning people noticed that this problem really had to be solved, but they didn’t have any rigorous algorithms for it. Then the theorists: Blaum, Bansal and Chawla came and actually formalized what those machine learning people were trying to do and developed some basic algorithms for solving this problem. The problem is really so basic that you can explain it easily to a little kid as the folding graph partitioning problem. So suppose I give you a graph, like the one on this picture and what this graph represents is similarities between objects denoted as vertices in some of this picture. So if there is a match between two objects this means these two objects are labeled as similar. If there is no match between two objects it means they are dissimilar. The goal is to cluster this set of objects into any possible number of clusters that you can pick on your own so that your clustering is consistent with this information about similarities and dissimilarities. I will make everything formal, but for now let’s just look at the solution, which has three clusters in it. I just want you to notice that there are three pairs, which are violated in a sense that they were labeled as similar, this, this, and this, and one pair which is actually dissimilar is no edge, but this pair is inside one of these clusters. So in this sense this clustering actually makes four mistakes: Three four cutting similar pairs and one for putting a dissimilar pair inside a cluster. So in a more formal way the goal of correlation clustering is to minimize the number of incorrectly classified pairs. A pair can be incorrectly classified if it’s either covered non-edge or a non-covered edge, as in the first example that we’ve seen. Know that these two types of mistakes are just treated symmetrically, you just add up these two types of mistakes and this is a goal that we are trying to achieve minimize in the sum. So in this example score is 4 for 1 covered non-edge and 3 covered edges. This is actually the optimal solution for these instances; you cannot do better than that. >>: [inaudible]. >> Grigory Yaroslavtsev: This is not very easy to see. All right. >>: [inaudible]. >> Grigory Yaroslavtsev: That’s another way of looking at it. So there is no small certificates for me convince you of this. Okay, now for those of you familiar with constraint satisfaction I just want to say that this can be seen as a mean CSP, but the number of labels is unbounded. The number of labels here corresponds to the number of clusters that you can pick and because there is no restriction on the number of clusters that you can pick the number of labels can be as big as the number of vertices. So it’s not directly amenable to CSP techniques, because CSP techniques only apply usually when the number of labels is very small. Since the original introduction of this problem in 2004 there has been a lot of progress on it. So the original paper by Blaum, Bansal and Chawla gave some constant approximation which was pretty large and by approximation here I just mean the ratio between the numbers of mistakes that the algorithm makes to the optimal solution costs. And since then there has been a lot of progress, a number of papers which ended up attributing 2.5 approximations in the paper by Alilon, Charikar and Newman. This problem is also known to be so called APX-hard, which means that you cannot achieve an approximation which is arbitrary close to 1 in polynomial time, but the constant in the [indiscernible] is very close to 1 actually, it’s 1.00 something. There is also a complimentary objective, instead of minimizing the number of incorrectly classified pairs we can [indiscernible] maximize the number of correctly classified pairs. That is usually much easier to do because there are much stronger lower bounds than this objective. So a complimentary objective might be much easier to optimize. And in 2004 we already had an polynomial time approximation schema for that. So today I will focus on the minimization objective. During the 10 years of existence of correlation clustering it has become one of the most successful clustering paradigms due to a number of advantages that it has over other clustering methods. So, one such advantage is that it only uses qualitative information about similarities between object. So note that the only data that I had to collect to run a correlation clustering algorithm was information about comparisons between vertices, about objects in my dataset whether they are similar or dissimilar. You can compare this to other popular clustering methods, like for example if you want to cluster using K means, which is one of the main [indiscernible] ways of machine learning then what you need is a D dimensional vector for each point. You need to [indiscernible] your points into D dimensional [indiscernible] space. This is a lot more information than with correlation clustering users. Another interesting feature of correlation clustering is that you don’t really have to specify the number of clusters in advance. The number of clusters is going to be selected by the algorithm itself to best feed the data that you have. One of the questions that usually comes up with K means, for example is: How do we pick the value of K and K means? If you don’t know what is a good value of K you try all of them, which includes the running time or you guess it somehow and you hope that it is a good guess, but here the number of clusters is really produced as an outcome of the algorithm itself. You don’t have to worry about it, one of the applications if correlation clustering is to document deduplication and image deduplication as well. So the way it works is as follows: So suppose you have a large collection of documents or images and your friends from machine learning group have come up with a new fancy algorithm for comparing these objects. Now they can tell you where the two objects are similar and how similar they are, but they don’t really have any rigorous guarantees about this comparison measure. It can be quite inconsistent, like for example in the example that we had, the triangle inequality did not have to hold necessarily. If A and B are similar, B and C are similar, it does not necessarily mean that A and C are similar. So the black-box algorithm that we are using here we don’t require any guarantees from this algorithm. So we can use deduplication any machine learning algorithm that spits out a number about comparison between objects. So since the original paper by Bansal, Blum and Chawla we know that the problem is NP-hard and there are some simple approximation algorithms for it that have been developed. Pretty much all the algorithms that I’m going to describe today in this talk will be pretty simple. So using them in practice is actually pretty easy, they are not that complicated. For those of you who know something about learning theory I just want to throw in these buzz words that you can see correlation clustering as an agnostic learning problem. Yeah, you have a question? >>: No, please finish your sentence first. >> Grigory Yaroslavtsev: Okay. You can see it as an agnostic learning problem in the following sense: That we are trying to fit a concept from the classical clustering’s to the data and unfortunately in this case there is no concept necessarily that perfect fits the data that makes zero mistakes. There is necessarily no clustering that makes zero mistakes in this case so the optimization goal here is whether its fact to the cluster in which achieves minimal possible number of mistakes and that’s exactly the agnostic learning. >>: So you were mentioning that these algorithms are very simple and so if I remember correctly this original one with Blum and so on –. >> Grigory Yaroslavtsev: Yea, that one is an exception. >>: [inaudible]. So you mean simple in the sense that the later algorithms do they also use [indiscernible] solver? >> Grigory Yaroslavtsev: So for the algorithms that I’m going to use, for the specific objective you don’t actually need [indiscernible] programming. >>: But Blum did use it in 2004? >> Grigory Yaroslavtsev: So for some other variance of this problem you are doing [indiscernible] program and these are the problem that we are going to be referring. >>: That are what, sorry? >> Grigory Yaroslavtsev: I’m going to be referring to some other flavors of this problem where –. >>: [inaudible]. >> Grigory Yaroslavtsev: For this specific variant you don’t need [indiscernible]. >>: I have another question: How many of these different, you know, through the 10 years of work, do they also allow them to have weights on the edges? >> Grigory Yaroslavtsev: Yea that’s a good question and I’m going to talk about weights later. Some of them do and some of them don’t and I will treat this topic later. All right. So if you wan to learn more about this than I’m going to describe in this talk then you can check a survey by Anthony Wirth. There was a KDD tutorial about correlation clustering and all sorts of flavors and toppings on top of it and here was a Wikipedia article. That should be enough I hope. Al right. So now let me move onto the algorithms. The first algorithm that I’m going to describe is something I’m going to call data-based randomized pivoting. So this algorithm was developed by Alilon, Charikar and Newman and it achieves a 3 approximation to the mean disagree objective in expectation because it’s a randomized algorithm. The randomness is over the choice of a random base of the algorithm. So here is the algorithm, it’s very simple as I promised. You pick a random pivot vertex, let’s call it V, and for the rest of the talk all the pivot vertices that I’m going to use will be purple. Then you make a cluster consistent of the vertex V, together with a neighborhood of V in this graph. So this is the first cluster that you pick, you set it aside and [indiscernible] on the rest of the vertices. So you move this cluster from the graph and repeat this process again. So a very simple algorithm and you can implement in almost linear time easily. It is very practical, but pretty sequential if you look at this, because you have to pick these pivot vertices one at a time and in the worst case it might take you linear number of iterations of this process to produce a resulting clustering. So I just want to mention that there is a parallel modification of this algorithm which achieves a slight reverse approximation, but can be efficiently parallelized. Specifically in MapReduce model it runs a poly algorithmic number of rounds and this is a recent paper which appeared at KDD. I have a blog post about it, which is going to rigorously introduce you into this model of MapReduce and tell you what the algorithm really does in that case, but this algorithm is so simple that I can be actually done in MapReduce as well. So lastly an example of how this works on this story graph. So we pick a pivot vertex P, which is the purple vertex over here and it graphs together its neighborhood in the first cluster and then reoccurs on the rest. Uniformly at random pick a vertex from the rest, this purple one and construct the second cluster. Now we just have two single tone vertices, so we just finish out by picking clusters of size 1. And as you can see in this example we did cut 6 edges and put 2 nonedges inside. So we made 8 mistakes, which is a two approximation to the objective that we are trying to optimize. Okay, so now let’s look at this problem from a different perspective. In order to come up with a proof I want to introduce an integer program for correlation clustering. So the same clustering problem as before can actually be expressed as an integer program of the following form. So this may look complicated, but let’s go over it step by step. The idea is that we introduce a binary distance XUV between all pairs of vertices. This is binary in the sense that XUV are a 0 or 1 only. The interpretation of the binary distance is that if XUV is 0 then U and V are in the same cluster, otherwise they are in different clusters. Now using this interpretation you can convince us that the objective is exactly the mean disagree objective that we had before, because we added over all edges XUV’s. So we pay 1 if U and V are in different clusters and 1-XUV for the non-edges. Finally we have these triangle inequality constraints here and because XUV’s are binary variables 0 or 1 only triangle inequalities are equivalent to transitivity constraints. So if XUW is 0 and XWV is 0 then XUV is also 0. So this means that any valid integral solution to this problem actually induces a partition in a clustering of the set of vertices and that’s what we are looking for. So they could [indiscernible] classes of the relationship U and V is the same if XUV is 0 actually form a partition and this is the same as well. Now we don’t know how to solve integral problems in polynomial time so we do the standard thing, we [indiscernible] to a linear problem. So instead of using 0 and 1 variables we are going to use variables which are real between 0 and 1. A solution to this problem now gives an embedding of the vertices into a pseudo metric. I’m using this term pseudo just to highlight the fact that it could be the case of two vertices coinciding. >>: So are you assuming that there could be more than two clusters, right? >> Grigory Yaroslavtsev: Yea, there can be any number of clusters because an honest relationship can have any number [indiscernible] clusters. >>: So you are saying XUV is 1 only if they are in different clusters? >> Grigory Yaroslavtsev: That’s right, it’s like a distance. If your distance 0 then U is the same thing and U is in the same cluster. If the distance is 1 then you are in different clusters. >>: [inaudible]. >> Grigory Yaroslavtsev: All right. So we get an embedding of vertices into so called pseudo metric and it’s exactly the same as just a usual metric, except that some points are allowed to be a distance 0, that’s what pseudo is referring to. And now let me show you that there is a gap between the objective values of the integral and linear program. This is called the integrality gap of this program and I’m going to show you that it is at least 2 minus some [indiscernible] of 1. There is a simple example that illustrates this. It is a star consisting of edges, 2 vertices which are disconnected from each other. Now remember that a solution to an integer program is exactly the same as a clustering and the best clustering here that you can construct looks like this. You just pick one cluster of size 2 and the rest of the vertices from single don’t clusters. That’s the best you can do. This gives the cost of the integer program to be N minus 2, but the linear program can cheat in the following way: The solution to the linear program can be constructed as follows: you just put one half on every edge of this graph and one on every non-edge of this graph. So know that because all the values are either 1 or half then the triangle inequality is satisfied for every triple and the objective value is just roughly N over 2, because we only pay for the edges in this graph and we pay one half per edge, because non-edges are a distance 2 and they are okay. So this is integrality gap and this also means that we cannot achieve approximation factor better than 2 if we just use only linear program as a lower bound on the optimum. Now the question that we asked in our work was whether we can actually match this integrality gap and achieve this approximation factor, which is roughly 2. The answer is: Almost, so we can achieve a 2.06 approximation, which is within 3 percent of the integrality gap. So remember that the previous one was 2.5 approximations. So I’m going to introduce some other results that we get later formally, but for now let me just informally say that we can also use your technique for a bunch of related scenarios achieving approximation for clustering objects of K different types. So suppose instead of having just objects of 1 type as we had in this example we have objects of K different types and we can only compare pairs of objects of different types with each other. In that case we have a match 3 integrality gap, the previous one was 4. I’m going to talk about the weighted case, the one that you asked me about later, but only about some specific version of the weighted case in this talk, because the general version of weighted cases is actually pretty hard. For the case when weight satisfies triangle inequality we get 1.5 approximations. There is an integrality gap of 1.2 that we show. So the new results are bold in this slide, that’s the point of bold here. >>: So what’s the second point, the K types, what was it? >> Grigory Yaroslavtsev: So suppose that you have objects of 2 types. Suppose you are in [indiscernible] and you have information about comparisons between people and movies, which people like which movies, but you don’t know which people like which people because you are not Facebook and you don’t know what movies like which movies, because that doesn’t make any sense. So you want to cluster movies and people using this data. >>: [inaudible]. >> Grigory Yaroslavtsev: The programs in these slides will be rigorously defined later, but unfortunately not right now because I’ll need to introduce some rotation before I can do that. All right. So the next algorithm that we are going to see is the one that actually uses this linear program. I will call it a linear program-based pivoting algorithm introduced by Alilon, Charikar and Newman. So this is the one that achieved 2.5 approximations. So it’s similar to the basic pivoting we have seen before, but it uses a linear program. So first you solve the linear program and you get values XUV, which I’m going to informally call distances. Then you do a similar pivoting process as before. You pick a random pivot vertex and let’s call it P. Now you form a cluster of P together with a randomly chosen set. I’ll call this set S of P. The randomly chosen set contains every other vertex with probability, which is 1 minus the distance. So the reason for the S is pretty simple. We want the vertices which are close to be in the cluster with P with probability very large. If they are far the probability should be small. So here is just the linear function of the distance 1 minus the distance. Now these decisions are made independent for each of the other vertices. Then we make a cluster consistent of P together with this random set S of P. And we just recur on the rest as before. So this is the algorithm and let me show you how this algorithm works on the integrality gap example, because this will be important for later. So remember that in this star like example the LP solution XUV can look as one on edges and 1 on non-edges. So the LP cost is roughly N divided by 2 here. Now let’s look at what our algorithm does in this case. So the algorithm picks a random pivot. So suppose a random pivot is chose to be one of these vertices which are the spikes of the star. So this happens with probability which is almost 1 in this example, 1 minus 1 over N. So in this case with probability one half we pick a singleton cluster and with probability one half, we don’t have information for that. So with really one half we get a cluster of size 2 and with probability one half we get a cluster of size 1. So if you get a cluster of size 2, if you cluster this edge together, then we get pretty much the optimum cost because we just cut this edge away and the rest is just singleton clusters. On the other hand if we just cut a singleton cluster consistent of this vertex then let me informally claim that we made no progress whatsoever. We did cut away one vertex, but in the big picture of things that doesn’t matter at all. So this is why I just want to say that expected cost condition on VI being a pivot is just one half of N plus one half of the original expected cost. So the other case is when we pick the central vertex of the star to be the pivot, which happens with probability of 1 over N. It’s a very small probability, but unfortunately if we do this then we have to pay a lot, because by concentration bound we cluster roughly half of the vertices together with a pivot and that means that we grab a quadratic number of non-edges inside of the cluster, roughly N squared over 8 pairs which non-edges will be inside. That means that we pay a lot here. So solving this for the expected cost, putting this all together, we can see that the expected cost in this case is roughly 5 N divided by 4, which seems not that bad compared to the optimum which is roughly N here, but remember that if you are just using linear program as a lower bound we are already losing a factor of 2. So comparing this to the linear program in lower bound we already lose a factor of 2.5 and this is exactly what the Alilon, Charikar and Newman analysis matches. So you cannot do better than 2.5 this way. So here is our idea on how to improve this. So the generic schema will be very similar, but we are going to account for this issue that we had in the previous example. When we pick the randomly chosen set S of P our decision about including a vertex into S of P is going to depend on 2 things now, not just on the LP value or the type of the edge, but on both of them. So remember that we have seen two algorithms before. One made the decision just based on the graph alone without looking at the LP value. So it rounded the vertex together with P if PV is an edge and did not round otherwise. So this was the first algorithm that we’ve seen today. The second algorithm was using the LP and just set this function F to be 1 minus XPV. So let’s try to use both. So here is a function that we use. The function makes a distinction based on [indiscernible] PV’s and edge or a non-edge in the graph and applies different rules based on this. So if it’s a non-edge it’s the same rule as in the LP-based pivoting, 1 minus XPV. If it is an edge it is 1 minus some function F plus of XPV. The function F plus is some carefully chosen function. It’s some piece wise quadratic function which is 0 up to the point 0.19 and 1 after some point B, which is roughly a half or slightly greater than a half. Then between is just some quadratic polynomial. And know that the fact that we cannot get an approximation factor of 2 really comes from the fact that this B is really a little bit above a half, which means in that example when we peel it from the central vertex we still going to round a really tiny small fraction of vertices together with U. They are really, really tiny, that’s something that we really are really trying to minimize here. >>: [inaudible]. >> Grigory Yaroslavtsev: That’s the reason why we got 2.06, but not exactly 2. So there is actually some kind of conflict in reason which does not let us make this function to be 1 all the way at one half. It will come from the analysis. >>: [inaudible]. >> Grigory Yaroslavtsev: It’s a pretty benign function. >>: [inaudible]. >> Grigory Yaroslavtsev: It’s something that you study in 7th grade right; you already know what quadratic functions are so it’s not too bad. So now let’s move onto the analysis and see where this comes from. All right. So for the analysis let’s break the execution of the algorithm into stages and look at one stage at a time. So remember that at every stage we just construct one cluster and we start with a set of vertices V. There are two things to keep track of ST is a cluster constructed at step T and VT is a set of vertices which were still not processed when we started step T. So V1 is exactly all vertices, S1 is a first cluster, V2 is what is left after the first cluster, and so on and so on. So it’s important to keep track of these two notions: ST and VT for the rest. So let’s look at the analysis. We are going to break it into stages as follows: at every stage we have the set of vertices VT, which is still unprocessed. We pick a cluster S sub T from it and the set of vertices VT plus 1 is still left unprocessed. And let’s break down the cost that the algorithm pays into these stages as well. So suppose we consider this stage T, let’s analyze the cost of the algorithm at this stage. What the algorithm does at stage T is it makes commitments to make mistakes on two types of pairs. There are edges in the graph which will get cut when we pick the cluster ST and there will be non-edges in the graph that will be put inside the new cluster ST. Those are the two types of mistakes that the algorithm commits to make when it picks the cluster ST. There are no other mistakes that it has committed to make so far. And let’s just add up these two types of mistakes. So for all edges UV, which is still left unprocessed, in expectation the probably of that we separate the end points of this edge and it’s just the probability that we put one of the end points inside and the other one outside. And for the non-edges it’s just the probability that we put both end points inside. So this is this expression. Now let’s compare this to the specifically designated part of the LP that we can use to cherish these mistakes that the algorithm makes. So we will the full end part of the LP to charge these mistakes. I will denote this SLP sub T, the part of the LP that we use is stop T. The idea here is that we can use the pairs in the LP such that at least one of their end points is inside our new cluster, because these are the pairs that sort of can be attributed to this cluster directly. And each such pair will be attributed to only one cluster or so, we never over charge. So if you use edge pairs at least one end point is inside the cluster then we get contributions from edges and the contribution from edges is XUV, if you remember the objective, for non-edges it is 1 minus XUV and each of them is just multiplied by the probability that we get the contribution from this specific pair and that probability is just that we get one of the end points inside. So this is already something and we have been able to break down the comparison between the cost of the algorithm and LP and to these stages. So it just suffices to show that the expected cost of algorithm [indiscernible] is [indiscernible]. Then we just add them up over all iterations and then we are done. So this is indeed already something, but it’s still too complicated to analyze these things, because there are too many things going on. The next idea is to break down each stage into something very, very simple, into analysis of just triples of vertices. So this is what I’m going to call triangle based analysis. So let’s just focus on a triple of vertices for now and then I will put these together. For a triple of vertices U, V, and W, for W it’s a purple vertex. Let’s look at the expected error that we get on a pair UV if W is the pivoting vertex. So here is a picture, suppose we have a triple of vertices where W is a purple V with vertex. If UV is an edge then we make a mistake on this edge if we cut one of the end points together with W and the other one not if we cut this edge in this clustering step. So there are two types of situations like this when we cut either U or V together with W. So is this clear? I want to make sure because it’s kind of complicated. Now if UV is a non-edge then we only make a mistake if we grab all the end points of U and V together in the cluster with W. If you remember what F means then F really captures the provision that we grab the end point together. So these are just probabilities of these two events. The same reasoning can be applied to the LP. The LP contribution for the triangle UVW is as follows. The LP contribution can be used in this triangle if we manage to cut away at least one of the end points. So there are three cases when this happens, when we cut one or two of them and the probability that at least one of these happens is given by this expression. If this happens then depending on the type of this pair we get different contributions. If this pair is an edge then the contribution is SUV. If the pair is a non-edge then the contribution is SUV, if the contribution if the pair is non-edge then the contribution is 1 minus XUV. >>: So I don’t understand why do you have this object in front of XUV, what is it? >> Grigory Yaroslavtsev: So this is the probability that we can use a contribution of the LP for this triangle. So remember that we partitioned the cost of the LP between iterations. >>: [inaudible]. >> Grigory Yaroslavtsev: So the only way we cannot use the cost of the LP if we manage to separate U and V from W, both of them and there are a few cases when this doesn’t happen. So the last trick is to express the cost of each iteration in terms of these triangles that we have seen. This is very simple; you just write the expected cost as a sum over all pairs UV, which is still not processed. For each pair the expected cost of this pair is the following, because we chose the P with W uniformly at random we choose a specific W with probability 1 over the [indiscernible] of VT. Conditioning on W being the pivot the expected cost of UV is just this expression LWUV, which was on the previous slide. So the summation really is the expected cost of the algorithm at iteration D. Now we can rearrange the summations and instead of adding up over UV and then over W add up overall triangles UVW. Then we double count each such pair, so there is a factor of 2 here and we end up with a summation of these expressions LWUV. The key part of the trick is to reduce everything to just triangles. Exactly the same can be done about LP because it’s just a summation. That just suffices to compare ALG and LP on triangles only. This is something that is actually manageable. Now you only have to show that this relationship holds for triples UVW, but it’s still not that easy. So we want to show this for all triangles, but remember that for each triangle we can have first of all one of the possible edge/non-edge configurations. So a triangle is pacified by edge non-edge configurations and the LP values that they assign to its size. So we need to show that this relationship holds no matter what the type of the triangle is, no matter what the LP values are as long as they satisfy the triangle inequality. The type of the triangle is fairly easy to deal with. You just enumerate four possible types, but the LP way satisfying triangle inequality form a search space which is a 3 dimensional space. So it’s not that easy to enumerate over. This is where our paper sort of get’s rough and you have to go through some case analysis and convince yourself that actually there is a weight which is a function such that this holds. Okay, remember that here we have flexibility in the choice of the function also. So first we choose a function and then we show that this holds. So I encourage you to go and read our paper. This is where the interesting stuff is. How do we actually deal with this problem? But, eventually we managed to show that [indiscernible] configuration and every set of weights which satisfies a triangle inequality this holds. The alpha that comes out of the analysis is 2.06. Moreover we can show that no matter which function F you pick using this analysis you cannot get better than 2.025. We can actually push it up I guess, but the point is you cannot really get all the way down to 2 in this analysis. >>: So function which only depends on the edge type and the [indiscernible]? >> Grigory Yaroslavtsev: Right yeah, this all applies only to this framework where you use this type of function. >>: [inaudible]. >> Grigory Yaroslavtsev: Exactly. >>: Can you do a computer search to do find the function F? >> Grigory Yaroslavtsev: Yea, initially we had to do some computer search. >>: And what type of structure were you looking at, like quadratic functions like this? >> Grigory Yaroslavtsev: Yea, we were trying to come up with something polynomial. >>: But did you restrict to quadratic or did you go higher? >> Grigory Yaroslavtsev: We tried higher polynomials, but the improvement was kind of marginal. >>: But you did get an improvement. >> Grigory Yaroslavtsev: I guess maybe you can improve a little bit, but it’s sort of –. The point here is that you cannot really get all the way down to 2 and we are pretty close to the limitation of this approach. So it’s kind of a tradeoff between your time. >>: So that’s why you didn’t. >> Grigory Yaroslavtsev: Yeah. >>: [inaudible]. >>: So I don’t understand how you prove those lower [indiscernible]. So how do you do a computer search? >>: Will you go over this? >> Grigory Yaroslavtsev: Yea, I won’t go over this unfortunately. >>: [inaudible]. Again, this is only for this framework. [inaudible]. >> Grigory Yaroslavtsev: Yea, so the analysis kind focus is there are some triangles that you can show a bad and for those specific triangles they will have very specific configuration of LANS, for example the triangle inequality has to be tight for them. That gives a lot of restrictions and we are already going to have equal LANS that are going to be isosceles triangles. So that limits this search quite a bit. >>: Some of those. >> Grigory Yaroslavtsev: What’s that? >>: Some of those. >> Grigory Yaroslavtsev: Yea, some of those. Okay, so then you can actually [indiscernible] this. Instead of having function of 3 variables you only use 1 variable basically to control your triangle because it has 2 other restrictions, like satisfying triangle inequality is isosceles. Then it becomes the function of 1 variable and you can analyze it. >>: [inaudible]. >> Grigory Yaroslavtsev: Right, but for them the problem was easier because they had the function fixed. They didn’t have to do search of the function space. >>: [inaudible]. >> Grigory Yaroslavtsev: So the framework was similar, but they didn’t have to do the search of the function space. All right. So the setting that I described is usually referred to in the literature as complete graphs case because we care here about all pairs in the graph. So in the objective we add up over all pairs in the graph and we get 2.06 approximations for the last case. It can be derandomized, which is actually not that trivial in this case. There was a previous paper by these guys which derandomized the 2.5 approximation. This 2.06 approximation also works for the case when there is real weight which satisfies the so called probability constraints. So remember that in the setting that I described all weights were all either 0 or 1, but now suppose in the objective in front of each term there will be a weight CUV and the weights corresponding to XUV and 1 minus XUV are complementary. So XUV is multiplied by 1 minus CUV and 1 minus XUV is multiplied by CUV and these are the weights. So in the previous case these CUV’s were either 0 or 1. If you allow them to be real then everything also works. In the special case when these CUV’s you should really think of them as some empirical distance. So it’s natural to assume that they might satisfy triangle inequalities as empirical distances. In this case you get a [indiscernible] approximation. So we get 1.5 approximations using some different carefully chosen round in functions. The integrality gap for the triangle inequality in this case is 1.2. >>: [inaudible]. >> Grigory Yaroslavtsev: It does actually –. >>: [inaudible]. >> Grigory Yaroslavtsev: That’s also true, but to answer your question that’s generalized because if CUV’s are 0 and 1. So at those points which have CUV 1 are non-edges and those points which have points CUV 0 are edges. >>: [inaudible]. >> Grigory Yaroslavtsev: So this is what I call just probability constraints on them and this is strictly generalization of the complete case. Yea? >>: [inaudible]. >> Grigory Yaroslavtsev: If they also satisfy something extra, which makes sense right, they satisfied something else. >>: You mean the triangle inequalities? >> Grigory Yaroslavtsev: That’s right, like those previous 0/1 weights they did not satisfy triangle inequalities, because it could be inconsistent. >>: My point is that when you say weighted you intuitively want to generalize the beyond weight and you wan to capture the [indiscernible]. >>: [inaudible]. >>: [inaudible]. >> Grigory Yaroslavtsev: And I was referring to this set before about objects of K types. So now suppose that in the objective we only count contribution of pairs which are edges in some graph. So we don’t care about all pairs now, but only about those which are edges in some graph. When we have objects of K types this graph is very specific. It’s just a complete kpartite graph. So in this case we have E, the set of edges of a complete k-partite graph, like in the [indiscernible] example this is a complete bipartite graph and we only sum up contributions in the objective which corresponds to edge bearers. So this is a more general case and we get a 3 approximation here, again using some different bounding functions and we have a matching 3 integrality gap. All right, thanks for your attention. Let me just wrap up with some open problems here. There are two main directions for open problems: One is getting back to our approximation and it’s natural to expect there maybe a stronger convex program relaxations can help us here. For example if you use a natural semidefinite program in relaxation then the integrality gap is at least three halves. It’s natural to ask whether linear programming or semidefinite programming hierarchies can help reduce the integrality gap even further. Another question is sort of the opposite of that: Can we achieve better running time while maybe compromising an approximation or keeping it the same? Can we avoid solving the LP? This is a technical obstacle that is the most computationally complex part of this algorithm. Can we get the last know 3 approximation in parallel? And let me just mention there are a few related scenarios and it’s a very specific flavor of this problem called consensus clustering. I did not describe it actually, but just to drop it here. And as I think Konstantin mentioned before the most general case of this problem where there are weights which don’t necessarily satisfy triangular in equalities or any other constraints whatsoever is actually pretty hard. It’s has hard as multi-cuts and getting a constant factor approximation would disprove UGC. So it’s a good way to work on UGC. All right, thanks. [Applause] >>: What is a harness? >> Grigory Yaroslavtsev: The last number of people, what the AB harness is, no one knows, something like 1.00 something and no one knows what it is really, because it’s reduction from some complicated problem and no one knows that harness is, but it’s very close to 1. >>: [inaudible]. >>: What is consensus clustering? >> Grigory Yaroslavtsev: Consensus clustering is a special case of the triangle inequalities case. So the one for which we got 1.5, but the one when you weights a generated according some specific process. Let me describe the process to you: the process is taking a convex combination of clustering’s. So every clustering is a 0 1 distance measure. So suppose you have a bunch of clustering’s coming to you from somewhere and you wan to come up with a single cluster which somehow there is a consensus between them, something that represents them all. What you can do is you can take a convex combination of weights given by each of these clustering’s, put them together and because it’s a convex combination of weights is going to satisfy triangle inequalities and that gives you an instance that you [indiscernible]. So you will get 1.5 as an instance satisfying triangle inequality, but you can actually do better because the instance is generated using some specific rule here. You can get four-thirds using the Alilon, Charikar and Newman, but the question is can you do better? All right. >>: So given such a distance can I actually also get back these clustering’s as well, is that easy? >> Grigory Yaroslavtsev: No I don’t believe you can do that, because it’s one way information. You get a clustering instead of a collection. It’s a good question, but I’m pretty sure that’s impossible. >> Konstantin Makarychev: Any more questions? Thank you. [Applause]