>> Nikhil Devanur: Hello, everyone. It's my great pleasure to introduce Nina Balcan, who is professor at Georgia Tech. She did her PhD at CMU. So Nina's work, Nina moves seamlessly between machine learning and algorithms, game theory and she's going to tell us very interesting things about clusterings. [inaudible] applications. So Nina. >> Maria-Florina Balcan: All right. Thank you. Great. So I'll; talk about finding low error clusterings. And, in fact, an even better title for this work is approximate clustering without the approximation. And the meaning of this title will become clear later in the talk. And this is ->>: [inaudible]. >> Maria-Florina Balcan: Yes. And this is the line of work joint with a number of people that I'm going to describe throughout the talk. All right. So this talk is about unsupervised learning or clustering which is as you probably know it's a major research topic involve algorithms and machine learning these days. Why, because problems of clustering in data come up everywhere in many real whole applications and the few examples of alternate application include clustering news articles or Web pages by topic, clustering protein sequences by function or let's say clustering images by who is in them. Okay. Now, many of these clustering problems can be formally modelled as follows: So we see we are given a set S of an object, say N document. And we'll assume that there exists some unknown correct desired target clustering. So it means that each object has some known true label, in this case the topic. And then in this context, our goal will be to find the clusterings of low error where the error of a given clustering C prime is defined as a fraction of the points which are my classified with respect to target clusterings after re-indexing of the clusters. All right. So again the setup here is as follows: We assume that we have a target clusterings C, which is the partition C1, C2, CK of the whole set of points. And now if we are given a new clustering, a clustering C prime which is another partition C1 prime, C2 prime, CK prime of the same set of points and we define the error of C prime with respect to C as a fraction of the points that you get wrong, in the optimal matching between the clusters in C and clusters in C prime. That's a natural notion of error rate. It's the analog in the context of clustering is really the analog of the O-1 loss in the context of supervised classification. So it's very natural notion of error rates. It's in the context of clustering. We don't really care about getting the names of the clusters right, we only care about getting the clusters and [inaudible] right. >>: [inaudible]. >> Maria-Florina Balcan: We don't get penalized. So only care about getting the clusters right, not the names right. And this notion of error rate exactly captures it. Because we -- so the error is defined -- it's the fraction of the point that you get wrong in the optimal matching between the clusters. >>: [inaudible]. You need to know the [inaudible]. >> Maria-Florina Balcan: No, no, no, it's minimum [inaudible] over all possible permutations of the labels. It exactly captures what you want it to capture. All right. So that's a natural goal in the context of clustering. One can argue that to get the clustering of low error according to this notion of error. I'm not claiming that this is the only notion of error that you can consider but it's definitely a natural one. Okay? So that's our goal to get a clustering of low error. Now, in order to do so, as it is usually not clustering setting, we see that we are given a pairwise measure between a pairs of points, so a measure of similarity or the similarity between pairs of points. So for instance in the document clustering case, this can be something based on the number of key holes in common or in the protein clustering cases can be something based on the added distance and so on. And now clearing since our goal is to get the clustering of low error, this measure, this pairwise measure has to somehow be related to what they are trying to do, has to somehow be related to the topic because otherwise there will be absolutely no hope to do anything. And for actually for the rest of my presentation, I'm going to focus on the case when the pairwise measure that you are given is a did I similarity measure and in particular is a distant function satisfies a triangle inequality. All right. So that's our approach. Now, a classic approach to solve such a problem in -- especially in theoretical -- in the theoretical science community of machine learning sometimes is to view the data points as nodes in a weighted graph where the weights are based on the distance function which we are given and then to pick some objective function to optimize like k-median, k-means, min-sums and so on. And just to remind you in case you have never seen it, so in the k-median clustering problem, the goal is to find the partition C1 prime, C2 prime, CK prime and center points or medians for these parts for these clusters in order to minimize the sum over all points of the distance to their corresponding median. So that's the k-median clustering objective. And in the K means clustering case the goal is to minimize the sum of the square distances while in the min sum clustering case the goal is to minimize the sum of the intracluster dissimilarities. All right. So again, a standard approach to solve the -- such clustering problems in the theoretical computer science community and also in machine learning is to view the point -- the top point as node in a weighted graph where the weights are based on the dissimilarity function which we are given and then to pick some objective -- to optimize and to develop algorithms that are approximation algorithms for these objectives. So I should say that many of these objectives are actually NP-hard to optimize and so the best you can hope to do is to design an approximation algorithm for them. And significant effort has been spent in the last few years on developing better approximation algorithms for many of these objectives as well as developing but better results for them. So significant [inaudible] has been spent on this. For example, the best known approximation algorithm for k-median is a 3 plus epsilon approximation, and it's also known that it's NP-hard, beating one plus two over E it's NP-hard. And actually this is a result that appears in the paper by Mohammed [inaudible]. Okay. So this is all right. I mean, this effort is all -- well, is justified sometimes. However, in many -- in the clustering problems that I was talking about, our goal is to get a clustering of low error. Our goal is to get the points right to get close to the target clustering. And so that means that in those problems if we end up using a C approximation algorithm for objective phi k-median, in order to say to minimize the error rate, in order to say cluster our documents then that means that we must make an implicit assumption that all the clusterings that are in a factor of C of the optimal solution for objective phi are, in fact, close to our target clustering. I mean, it's implicit because otherwise the clustering with output are not going to be meaningful. So again, if we end up using a C approximation algorithm to objective phi, say to k-median to class of documents, what we really want to do is to get a clustering of low error to get close to our target clustering, which means that you must implicitly assume that any clustering within a factor of C of the optimal solution for objective phi must hold the epsilon close according to the semantic difference distance to our target clustering. >>: [inaudible]. >> Maria-Florina Balcan: It's -- so it's a parameter. It's closeness to the -- it's how close we are to the target clustering. >>: I have a basic question. [inaudible] clustering ->> Maria-Florina Balcan: Yeah? >>: [inaudible]. >> Maria-Florina Balcan: So we have this notion, that I'll introduce the notion of distance between two clusterings. So ->>: But I total know that ->> Maria-Florina Balcan: Right. So the problem is difficult. And that's why people go and come up with surrogate objectives. Right. Exactly. Because it's difficult to think what's -- I mean, I don't know that round two of this, given my -what I have, the information, I cannot compute distance to the ground truth and so that's how people look at these surrogate objectives like sort of objectives like optimal -- let's say give me some that I can measure actually. Which is one of the reasons for which people develop approximation algorithms for clustering. But the point that we make is that what we really care is to get close to the target clustering and then the question is under what conditions can we get close to the target clustering? >>: Are you saying [inaudible] epsilon such that [inaudible] exist in epsilon that is like low enough ->> Maria-Florina Balcan: No, no, no. >>: [inaudible]. >> Maria-Florina Balcan: So it's an assumption. So okay, so it's an assumption okay. So we -- I'm not -- so I'm okay. Let me call this a C epsilon property which says my instance satisfies this C epsilon property if any clusterings within a factor of C of the optimal solution is epsilon close to the target clustering. It's an assumption. But might be or might not be satisfied. But the claim is that I guess the -- it's not a claim. It's more like the motivation here is that use C approximation algorithm for objective phi, it must be the case that something like this should be satisfied for a reasonably small epsilon. Because otherwise the clustering that your are going to output, you'll see approximation is going to be far from the target clustering anyway, so you're going to be meaningless, you're going to be meaningless. >>: [inaudible]. >> Maria-Florina Balcan: It's not -- these are -- okay. So we're going to -- what I'm going to show is that you're going to be able to cluster while under such an assumption. >>: [inaudible] you can choose different phis and you -- you say that phis [inaudible] is a good choice. If it gives you back the clustering and this is the C epsilon. >> Maria-Florina Balcan: Okay. So think about it as being an assumption and we'll go -- and I'll go through some of the theorems and then we'll see how to [inaudible]. So for now, I just -- it's an assumption. So I'm going to say that my instance, what my instance is a similarity function with -- distance function which I am given and a hidden target clustering that I don't know so that my instance is equal to my algorithm and I'm going to save it in my algorithm, I'm going to satisfy this instance is going to satisfy this property if it is a K. But for my instance, all the clusterings that are within a factor of [inaudible] the optimal solution for [inaudible] close to the target clustering. So it's a property of a given instance. And the type of guarantees that you're going to make is that if these property's satisfied, then we're able to cluster well. >>: [inaudible] how truthful is that a sum of -- of course, it's [inaudible] with epsilon equal to one ->> Maria-Florina Balcan: Exactly. >>: But how truthful is it to expect that this is the case for ->> Maria-Florina Balcan: Okay. So I'll come on to that a bit later. >>: [inaudible]. >> Maria-Florina Balcan: Okay. So there is a question. So I'll come to that later. Okay. But for now, the motivation is what I'm trying to give is that if this is not satisfied anyway, using approximation algorithms to cluster well, it's not a good idea. So let's assume that this property's satisfied, and let's see what we can do with it. Okay. So given the C epsilon property -- so what we can show first of all, we can show that under the C epsilon property defining -- the problem finding a C approximation to objective phi is as hard as it is in the general case. So the problem of finding C approximation for objective phi does not become easier. However, under this property we are able to cluster well. We show that we'll be able to cluster well. We'll be able to get close to the target clustering without approximating the objective at all. So we solve the problem that we really wanted to solve without ever approximating the [inaudible] at all. And that's why we call this actually approximate clustering by the approximation. And please ask begin your question, Alexander, later, because I have answers to it. But it's too early to give them now. Okay. And so in particular what we can show is that -- so here is an example of a result. We can show that for the k-median clustering problem for any C greater than one under the C epsilon property will be able to get clear order of epsilon close to the target clustering. So we'll be able to offer a clustering of low error. And we are able to do so even for values of 0 getting a C approximation to the k-median clustering objective is actually NP-hard. And moreover if the target clusters are sufficiently large, we are able to get even epsilon close to the target clustering. These assumptions are best we can hope to do. >>: So [inaudible]. >> Maria-Florina Balcan: Yeah. So this is actually order of epsilon over C minus one. So I'm hiding some factors here, right. The target closures are sufficiently large we can get exact epsilon close which given the assumption is the best you can hope to do. >>: In those hard cases you would get close to the right cluster but then it would be hard to find the median ->> Maria-Florina Balcan: Right. So what we end up doing is without the clustering which is close to the target clustering, but they have no guarantee -not necessarily have a small k-median objective value: So that's with we do approximate clustering in some sense we've already approximation. >>: This is [inaudible]. >> Maria-Florina Balcan: Again? I'm sorry? >>: [inaudible] if we can't fix it, it ain't broke. Some sounds very similar to [inaudible] do you have any reason to believe in that assumption with C epsilon property? >> Maria-Florina Balcan: Okay. So no, I have no reason to believe on this. There are multiple answers actually. One of them is actually we did so I -- let me go for the dessert. I'll come back in a second again. Okay. Let me comment on this again later. Okay. So -- but that's a very good question. So is this assumption ever satisfied? And again, I'm going to comment on this later. All right. So now let me make a -- before actually presenting some of our results, let me make a note from -- make me just make a quit note. So from one approximation algorithm perspective a natural and absolutely totally legitimate and natural motivation for time to improve our approximation ratio from C1 to C2 where C2 is [inaudible] C1 is that the data satisfies this condition for C2 but not for C1. And this is absolutely natural and legitimate because, in fact, we can even show instances, so for any C2 smaller C1 we can construct distances that have the property but they satisfy the C2 epsilon property but do not satisfy even the C1, 0.9 -- 0.49 property. Okay? So this is an absolutely natural -- so it's absolutely natural -yes? >>: [inaudible] just having a softer version of the assumption or is that a requiring that all clustering [inaudible]. >> Maria-Florina Balcan: Okay. So please, please bear with me with a second. I have at the end actually, so I have multiple answers for the three questions that I got but are about the same topic. So I just -- I'm -- so this is -- I find that assumption [inaudible] from a theoretical point of view, because we get around an approximate result by using implicit assumptions with our of a basic assumption that I might present anyway. I'm not promoting it as something -- I'm not saying that this is satisfied in the real world. Although I do have some experimental evidence for it. And actually what I really think we should -- and so I comment on that. So I'm going to have a whole part of the talk where I'm going to talk more broadly but it may be more interesting to consider more broader properties. But I find this is a create assumption from a theoretical perspective. >>: [inaudible]. >>: [inaudible]. >> Maria-Florina Balcan: So it's along the same lines, please wait, okay. >>: It's not exactly the same line. I just -- so I don't understand what is [inaudible] about the objective phi because if phi is actually the distance, so you know, if phi were to [inaudible] the distance between the clusters and [inaudible] at the beginning then what does that mean? >> Maria-Florina Balcan: So this assumption -- so here we assume -- so it's an assumption about how this these distances and [inaudible] relate. The distance is objective phi, the [inaudible] objective phi [inaudible]. >>: [inaudible]. >> Maria-Florina Balcan: Yes, it's and assumption. But not all the instances are going to satisfy this assumption. But the type of guarantees that we make is that either assumption is satisfied -- is satisfied then we'll be able to cluster one. >>: The assumption is for specific objective function? >> Maria-Florina Balcan: Right. Exactly. It's not a genetic assumption. So it's an assumption. Yeah. Okay. Good. So okay. So just a quick comment before I guess I move on to some create results and maybe hopefully clarify some of the questions. So the comment is that from a approximation algorithm's perspective an absolutely legitimate and natural motivation for trying to improve the approximation ratio from say C1 to C2 or C2 is more than C1 is that maybe our data satisfies a C2 epsilon property but not even the C1, 0.49 property. Okay? And this is absolutely legitimate. We can show such datasets. However, what the kind of the very nice and distinct test is that in our work we can do much better. We are able to cluster well even for values of C or getting a C approximation to the objective phis actually NP-hard. So that's an interesting fact. All right. Then let me now give examples of results and we can show on this framework. So for instance, so here is an example. The results are we can show that either for the k-median clustering objective, either they satisfy the C epsilon property then we can get order of epsilon over C minus one close to the target clustering. And moreover, if the target clusters are sufficiently large, we can even get epsilon close to the target clustering. Or the notion of larger still depends on C. >>: So does that [inaudible] it needs the C in epsilon? >> Maria-Florina Balcan: It does. Yeah. So it needs ->>: [inaudible]. >> Maria-Florina Balcan: It's not clear because again, you need to be careful because again you cannot test if I give you a clustering of a close clustering you cannot test how close to the target clustering because they're not the target clustering. So it's not -- you cannot try the binary search. So you need to be more careful and to consider the [inaudible]. >>: It says here is [inaudible] of the target cluster. >> Maria-Florina Balcan: Yes. We can do something similar for the C epsilon -for the k-means clustering objective and this is based on this -- these results up here in the original paper. And we also looked at the min-sum clustering objective. And so here we saw again if data satisfies the C epsilon property and if the target clusterings are sufficiently large, then we can get order of epsilon over C minus one close to the target clustering. Now, in the case of arbitrary small target clusters if the number of clusters is smaller than log N over log log N, then again can get order of epsilon C minus -order of epsilon by C minus one close to the target. However, if K is larger than log N over log log N, then what we do, we output a list of size -- a small list that the target clustering is close to another clustering in the list. And this is actually enjoyment work with Mark Braverman. So we cite examples of results. And I should also ->>: K is the number of clusters? >> Maria-Florina Balcan: The number of target clusters. >>: In stone? [inaudible]. >> Maria-Florina Balcan: Yes. >>: [inaudible]. >> Maria-Florina Balcan: Yes. It's far from the input. Okay. So these are examples of theoretical -- of positive theoretical results. And I should also point out that in a recent UAI paper actually we implemented the -- the algorithm, the algorithm for large clusters for the K median objective, and it turns out that this algorithm or variant of this algorithm provide state of the art results for protein clustering. Okay. So this algorithm seemed to be useful. So in other words, we [inaudible] dataset where definitely the assumption is satisfied. Obviously we're not going to be satisfied on lots of datasets necessarily. Okay. And I hope that this provide a partially answer to your question. And I have an even more complete answer at the end of my presentation. Once I go through the C epsilon property. All right. So this is an overview of the type of results that I'm going to talk about. And for the rest of my presentation I'm going to pick an object in particular that k-median clustering objective and I'm going to show how we can includes well if the C epsilon property's satisfied. All right. So let's assume that indeed for our given dataset that C epsilon property is satisfied, that means any C approximation to the k-median optimal solution is in fact close to the target cluster. Okay? And just to simplify things, let's assume that this is just to avoid technicalities let's assume that a target clustering is a k-median optimal solution and that all the clusters are large enough. So we have size at least two times epsilon times N. This is again just to avoid technicalities in the presentation. We don't need these assumption in general. Then let's introduce some annotations. So for any point X let's you know by W of X the distance to its own center and let's you know by W2 of X the distance to its second closest center. And so that means -- and let's more over you know by W average the average of all points X of W of X. And so that means that just by definition OPT will be N times W average. Do you see there the value of the opt k-median solution at which by assumption -- by my assumption goes to the target cluster. Okay. So this is just annotation. Now, let me describe two -- let me now describe two properties, two implications that we can derive. The first one is an implication of the C epsilon property. And it says that if the C epsilon property is true then we can show that at most epsilon times N points can have W2 smaller than C minus one average over epsilon. Okay? Why? Because otherwise if more than epsilon N points have W2 small then what we could do, we could move those points to their second closest cluster, and you would do so without increasing the objective phi more than C minus one average over epsilon time epsilon N, which is C minus one times OPT, and so you get the clustering with these still C approximation to the k-median optimal solution but which is now epsilon far from the target clustering. So the C epsilon property would be contradicted. Okay? So a consequence of this C epsilon property is that we have it at most epsilon times N points can have W2 smaller than C minus 1 W average over epsilon. Now, an even simpler fact to prove is that at most five epsilon N over C minus one points can have W greater than C mines one W average over five epsilon. And this just follows the mark of inequality. So it's very easy to show. And so that means that now that for the rest of the points for what we call the good points, we have a huge gap. So for the rest of the point we have a W of X is more than C minus one W average over five epsilon and W2 of X is greater than C minus one W average over epsilon. So for the rest of the point we have huge gap. So W is smaller than C minus one W average over five epsilon and W2 is greater than five times this quantity over here. Okay. So let's not denote this quantity by D. Let me call this quantity D critical. And so what we have is that most of the points what we call the good points, one [inaudible] epsilon the points look like this. So they are in distance D critical of their own center. And so that means that by triangle inequality we get that the -- any two good points in the same cluster will be within distance two times the critical of each other. Now, we also know that any good point is a distance at least five sometimes the critical to any other center. And so again by triangle inequality this then implies the distance between any good two points into different clusters is at least four times the critical. And so now that means that if we now define a new graph G or we connect two points X and Y, if they are within distance two times the critical of each other then what we get is that the good points in the same clusters are going to form a clique so they are going to be connected in that graph G. And we also have good points into different clusters not even have a neighbor in common in this graph G because another two good points in two different clusters are distance at least four times the critical of each other. And so that means that now basically the world will look like this. So the good points are going to form cliques, so in the graph G, so these are the good sets over here. Now, good points might connect to bad points and bad points might connect to bad points. However, any bad point are only going to connect to a good set because we know that any two good points in two different sets are a distance at least four times the critical of each other. Okay? So the world in the -- basically looks like this. So good points from cliques, they connect to bad points, bad points can connect to bad basis points, however, any bad point can only connect to a good set. And so now that means that if we furthermore assume that the clusters are large, in particular if we assume that the clusters are -- have size at least two times the size of the bad set, roughly, then what we can do, we can create a new graph H where we connect two points X and Y if they share enough neighbors in common in the graph G, and then in this new graph -- so the largest -- the components of H, the K largest component -connective components of H are going to spill like this, and we can just output this clustering. And this will be a clustering of low error. In particular, the error that we get will be basically order of the fraction of the bad points. Because we know with any component, any connecting component of H is going to correspond to a good set plus possibly some bad points. So you correctly cluster all the good points might make mistakes on the bad points but a small fraction of the possible points. So they have able to get result of epsilon over C minus one. Okay. So that's basically the algorithm for the large clusters case. >>: [inaudible] question. Are the bad points the ones that are currently labeled incorrectly or the ones that don't fit the sort of separation assumption? >> Maria-Florina Balcan: Right. So the bad points are defined by the -- I'm defining the analysis, right. So all those points that do not satisfy the property put they are much closer to their own center than to any other center. So the bad points are defined in the analysis. Okay. Now, if the target clusters are not so large, so basically then -- so the previous algorithm that I described didn't really work because -- why, because it could have some clusters that are completed dominated by bad points. However, luckily it turns out that something -- some algorithm that is equally simple we're going to do the right thing, okay? And what's the algorithm? So what we do, we just as before, we can side the graph G or we connect to points within distance two times the critical of each other. Then what we do, we just pick the vertex of the highest degree in the graph G. We pull out its entire neighborhood and we just repeat again, pick the vertex of the highest degree in D, pull out its entire neighborhood and repeat. And what we can show is that this algorithm gives us a clustering of small error rate. And the main idea here is to basically charge off the errors to the bad points. Okay? And how? So basically here is the analysis. If the vertex V that we picked, the vertex of the highest degree that we picked was a good point, then we are happy. Because if we know that we can all pull outer an entire neighborhood, an entire good neighborhood. Yeah? So that's a good case. However, if the vertex V that we picked was a bad point, like this one here, then we might end up pulling only part of a good set. Okay? So you might end up missing say RI good points over here. However, since we are greedy, we know that we must have pulled out at least the right bad points as well. This is the vertex to the highest degree and so if we missed our high points here, that means that we must have pulled RI bad points here. And so that means that we can basically charge off all the errors to the bad points. And so we get a clustering again over the fraction of the bad points. And now here it is essential to use the fact with any good -- any pad point all attaches on the good sets. So this case the only bad case that we have to consider. >>: So this [inaudible] two clusters [inaudible] the original clusters? >> Maria-Florina Balcan: Yes. But the point is that we still don't -- so we -- so that's true, right. So if you pick a bad point here, we might end up only pulling part of a good set plus a bad set. So you might miss some good points. But the claim is that overall, if we sum over all target clusters we're not going to miss too many good points because since we are greedy here, the size -- so if we missed I but RI good points here, we must pull out some -- at least as many bad points, and so that means that all these can judge of such errors to the bad points. Right. So -- and so that means that basically overall we get the clustering of low error. Okay. Now, going back to the large clusters case, it is all that we can even -- what we can do even better. So we can get even epsilon close to the target clustering. If the target clusters are large, but now the notion of largeness are going to depend on this parameter C. Okay. And what's the main idea? The main idea is that there are really two kinds of bad points. So epsilon N bad points are confused bad points in the sense that the distance to their second closest center is not too much larger in the distance to their own center. So these are confused points. Now, the rest of the points, the rest of the bad points are not really confused, it's just they have W2 much larger, the larger than C minus one W average over epsilon W. However, they are far from their own center. So these are non confused bad points. And it turns out that. I mean, it's kind of nice, you can recover the non confused bad points. And in particular what the algorithm, given the output -- given the clustering C prime from the algorithm so far, what we do, we just reclassify each point X into the cluster of the lowest median distance. Okay? So we just have a [inaudible] same step. And now the idea is that why is this giving a clustering over R most epsilon, the idea is that the non confused bad point have a huge gap, at least five times the critical from their distance to their own center and the distance to any other center. Now, of course we don't really know which these centers are, but remember that any good point is within distance D critical of its own center. And so combining these two facts we get that the non confused bad points are much closer to good points in their own cluster than to good points in any other cluster. Now, if these clusters are large enough when the median will be controlled by good points. And so the non confused bad points are going to be pulled out in the right direction. So I'm going to reclassify them correctly. >>: But you don't compute the median. Do you compute the median? >> Maria-Florina Balcan: Yeah. Yes. So given the clustering so far for each point X we compute the median distance to each of the clusters. And we assign it to one of the smallest median distance. I mean, so it's a prosperous step. And if you do so for the charge clusters case, this gives you a clustering of small error. >>: [inaudible]. >> Maria-Florina Balcan: The median. By median I mean like the statistical median. >>: [inaudible]. >> Maria-Florina Balcan: Yes, yes, yes, yes, yes. >>: Because the word median it seems ->> Maria-Florina Balcan: Yeah, yeah. >>: [inaudible]. >> Maria-Florina Balcan: No. Statistical median. All right. And now a number of technical issues, so in particular -- so the definition of the algorithm -- so the description of the algorithm that I had so far depends on this value D critical, which C minus one W average over five epsilon which depends on W average which in turn depends on knowing the knowing OPT which is the value of the optimal solution say for the k-median. And now of course we don't really know what that is. Okay. Now, how can we do this? In the large clusters case, so what we can do is the following: So we can start with log S of W and we can keep increasing it until we get to a point of the K largest component of the graph H are large enough and cover a large fraction of the whole set of points. And we can show that if we do solve then we do get a clustering of low error. And at high level the idea is as follows, so if we do reach the right value of W, so in keep increasing the value W and when its only output -- so when the clustering with two output if we do reach the right value of W, then we clearly get a clustering of low error because I just target that. Now, however, we can -- our guess might be too small, so we might stop earlier. But because we make sure that all the components with the output have -- are large enough, if size at least say -- the size of the bad set, be very sure that all the components of the graph H correspond to different good sets. And this together with the fact that the only output cluster is a cluster of -- clusters which cover a large fraction of the space implies that the clustering with the output has a small error rate. And in order to show -- to show these basically use a structure of these graphs H and W and in particular -- H and G and H, and in particular use the facts that for smaller values of W these graphs are only sparser. Okay? So that's the high level idea. Yes? >>: So if I remember correctly the algorithm only depended on epsilon and C through D critical? >> Maria-Florina Balcan: It depends on epsilon and C which assumes that the right [inaudible] parameters but it also depend on W average which is OPT over N. So ->>: Which is interdependent on epsilon and C besides the ->>: Besides [inaudible]. >> Maria-Florina Balcan: Besides D critical. Also be like it's also on -- so the original algorithm, okay, you me -- no. No. So the well, yeah. Yeah. No. >>: In that case you are using [inaudible] C epsilon. The algorithm is [inaudible] way of finding C critical, so that's important property for the [inaudible] to know what the assumption is. >> Maria-Florina Balcan: Well, you mean practically speaking, yes. And I guess [inaudible] my student go for that when he implement the algorithm, right so he comes up with reasonable guesses of D critical, right. But so theoretically speaking, right. So this depends technically on epsilon and C and the moreover also on W average, which is optimal N and so again seems -- so and we don't really know OPT. But likely it turns out that in the large cluster's case we can kind of implement the [inaudible] of this algorithm where we keep increasing the value of W until the clustering, the K largest component, the connective components of H cover a large portion of the space and are large enough and even use the structure -- the structure of our problem [inaudible] this way we'll get the clustering of lower error. But these three only works in the case where we have large clusters. And ->>: [inaudible]. >> Maria-Florina Balcan: Is the number of bad points. I guess the fraction of bad points. So this should actually be B times that. B is a fraction of bad points. >>: [inaudible] C minus one ->> Maria-Florina Balcan: What's the exact [inaudible] basically, okay, you are asking what's the exact quantity. Let me quickly take a look. Whoops. Right. Epsilon over C minus one times N. Yeah. Okay. Now, in the small clusters case, however the trick that I had on the previous slide doesn't really work. And what we have to do in this case is to actually run -- to first use the constant factor approximation algorithm for k-median and then use -- and use it approximation in our algorithm and then the approximation part only goes into the final error guarantee. But actually it's an interesting -- I mean good technical question to see if in the small clusters case we can get rid of really running an approximation algorithm first. Okay. And I should also say that we have also looked at an inductive version of this -- so there's an inductive setting where we assume that the set of points that we see is only a small random sample from a much large abstract instance pace and our goal is to come up with a clustering of the whole instant space. And so algorithmically what we do in this case, we first draw a sample S. We apply our algorithm on the sample and then we insert new points into our clusters as they arrive. And so what really do basically we run -- we first run the algorithm on the sample. So, in other words, we -- for example, for the large cluster scales we create the graph G, the graph H. We take the largest component of H and then when new points arrive we send them based on the median distance to the clusters for produced over the sample. And then we can argue that this gives us a clustering of low error. And in particular, the key idea here is that the key property which we use in order to argue correctness for the large clusters case is that in each cluster we have more good points than bad points. Which is what we needed in the case where we knew the value of W, if you don't really know the value of W, you have twice more good points than bad points. Now, however, is the clusters are large enough and if the sample itself is large enough, this is also going to be true over the sample as well and so this helps us to argue correctness over all. Okay. So we also have done an inductive version of this algorithm. And I should also -- okay. Let me just mention that we also looked at the k-means and min-sum. For k-means the similar argument -- we can just derive a similar argument as for the k-median clustering objective. For min-sum, however, the argument is a bit more involved. The algorithm is again equally simple. So the first solution that we had for -- to the min-sum was to connect it in a standard way of something called the balanced k-median objective, where the goal is to find a partition and center points in order to minimize this quantity which is the sum over all points of the distance to our corresponding clusters times the size of the cluster varying. Okay. So it turns out that for this balanced k-median you derive similar properties. As for min-sum -- as for -- so for these balanced k-median you can derive similar properties for k-median, however, the main difference is that we -in these cases don't really have a uniform D critical so it turns out that here we can of large clusters with large of points that are very concentrated for small diameter and we could have small clusters with very few points that are -- that have a very huge diameter. But luckily it turns out that a similar argument can be applied in this case as well if a [inaudible] subtract for the large clusters. And then send largest and so on. And this gives a way to deal with the large clusters case for min-sum or k-median if C is greater than 2. And in order to do the small clusters case one -- we had to do a much more defined argument. All right. And so to summarize this part of the talk, this part of the presentation, so let me go a little bit of a high level. So one can think -one can pain view the usual approach, the usual approximation approach, the usual approximation algorithm approach to clustering as saying we cannot really measure what we wanted, that means closeness to the ground truth, because go back to your comment, so we cannot really measure it beyond closeness to ground truth so set up a proxy objective that we can optimize and let's approximate it. Okay. So this is one way to think about the usual approximation algorithm's approach to clustering. So this is almost like in this picture right here. So this guy is about to jump off the building and we tell him we couldn't get a psychiatrist but perhaps you'd like to talk about your skin. Dr. Perry here is a determine technologist. Yeah? So I could interpret the usual approximation algorithm's approach to clustering as exactly this way. Okay? However, this is maybe perhaps a bit too cynical because the point is that if we end up using the C approximation algorithm to clustering our arguments, that means that we must make an implicit assumption about how distances and the closeness to the ground truth relate. So you must make an implicit assumption about the structure of our problem. And so what we do in our work we make it explicit and if we make it explicit, we get around inapproximability results by using the structure that is implied by assumptions that we are making implicit anyway. Okay. So this is one way to think about this work. And I think it's kind of cold enough from a perceptual point of view and from a theoretical point of view. Now, of course I don't say that this assumption is an assumption that would be satisfied in practice and, in fact, what I say is that we should really analyze other interesting properties, other ways in which the similarity or the similarity information relates to the target clustering. Okay. And actually let me just take a comment that it will be also interesting to -so actually to be interesting to also use this framework in order to apply to other problems where the standard objective is just a proxy and maybe where implicit assumptions could be exploited to go around in approximate results. It would be nice to apply this to other problems as well. And just as a comment, recently I have been looking at the Epsilon Nash -approximate Epsilon Nash problem in this framework with some progress. Okay. And I think I have a few more minutes. Maybe I should make a few more comments. So this goes back I guess to one of the questions that I had earlier. So maybe the C epsilon property is not reasonable but maybe an approximation of this is reasonable. And, in fact, we did analyze a relaxation of this epsilon property and in particular we analyzed one that allows for outliers. So for example we say that data satisfies a new C epsilon property if it satisfied the C epsilon property only after a number of outlier misbehaved points, ill-behaved points have been removed, right. So we say data satisfies the C epsilon property for a number of really bad points have been removed. Okay. And for example in this case we need to output a list of clustering and we cannot output just the clustering. Now, if we think about it, for the C epsilon property any two clustering of a given dataset has satisfied the property will be chose to each other or be the order of epsilon close to each other. However, it turns out that for this new C epsilon property two different sets of outliers could result in two very different clusterings satisfying this property, new C epsilon property. And this is true even if most of the points come from large target clusters. And so in this case the best we can hope for is to output the small list of clustering satisfying -- the property with any clustering satisfies this new C epsilon property it will be close to one of the clusterings in the list. So this clustering is here necessarily. Okay? And so this is an example of relaxation of the C epsilon property but analyze -- and more generally -- let me just give you this example. More generally, going back to the general picture, one could imagine where -- so going back to this module picture, what our goal was to get a clustering of low error, one could just imagine analyzing other types of properties about how the target clustering and dissimilarity relate and develop algorithms that are -- that can be used to cluster while those assumptions are satisfied. And, in fact, I have a whole line of work that does this. And the interesting fact is that -- program, so the C EPS property was an example of a property with output just a single clustering was this close to the target clustering. However, it turns out that once you start analyzing more interesting properties, more realistic properties, basically it's not possible to just output a single clustering with this close to the target clustering and that indeed you are looking at clustering well up to a three or up to a list. And I think that actually this is like a really interesting direction to analyze ->>: [inaudible]. >> Maria-Florina Balcan: Up to a tree good. So what I mean is to output the tree such that the target clustering is close to a pull of the tree. So let me give you an example, okay? So here is a different -- okay. So now I'm going back. I'm going away from the C epsilon property so -- and let's just list there another property, a natural property that you might hope that your target clustering is going to satisfy with respect to your similarity information. So let's assume that you have the follow property that says that all points -- I mean, sounds very strong, all points are more similar to points in their own cluster than to points in any other cluster. So we are given a similarity information to satisfy this property. Yes. So this sounds extremely strong, really strong. All points are more similar to points in their own cluster than to points in any other cluster. However, it turns out that in this case you don't have enough information to just output the clustering which is close to your target clustering. Why? Because you could have multiple different clusterings of the same data to satisfy this property. And since you have no labeled examples it would be impossible for you to know which of them is your target clustering. And here I have an example here so say that you have these four blobs over here, the similarity being each of the blobs is on, the similarity of each of the blobs is on, the similarity of down is a half and the similarity across is zero. Now if you think about it, even if I tell the number of clusters is exactly three, then we still have three clustering, particular this one here and this one here to satisfy this property. They both satisfy the property but each point is more similar to points in its own clusters than to points in any other cluster. >>: [inaudible] in this case the assumption that there's two clusters is false and there are really four clusters and we should output the [inaudible] clustering to satisfy the clusters. >> Maria-Florina Balcan: Well, this would be ->>: [inaudible]. >> Maria-Florina Balcan: This will be one way to solve the problem. But a different way to solve the problem is to -- well, actually no. My target clustering to be honest is maybe this one. I mean, this is -- this is my target clustering. And this I want to see as an output. >>: [inaudible] comes back to my understanding of the [inaudible] if the starting cluster is a [inaudible] objective as [inaudible] so the advantage of the k-means or cluster [inaudible] some objective definition of [inaudible] clustering but here the notion of target clustering which -- how is that. >> Maria-Florina Balcan: Well, usually in supervised learning we have the notion of a target function and use the input that I can supervise learning, right? We have emotional target function, we use labeled examples and we learn it, right? Similarly here we have the notion of a target clustering. We use the similarity information that we are given and we output that is close to our target clustering. And by this examples ->>: Where this notion comes from. >> Maria-Florina Balcan: I mean the true clustering of proteins by function, for example. I mean it's a target clustering, right? It's just that -- so for example for the clustering protein sequence is by function. There's two clusterings of proteins sequences by function in that case. It's a simple case. Now here, it's a true ->>: [inaudible] how do you tell that this is the true ->> Maria-Florina Balcan: So the user -- the assumption is that the user would know to recognize it, right? So I would be able to -- if you are on ->>: The user when he knows [inaudible] I mean, they showcased it so -- his knowledge is not [inaudible] there is some way [inaudible]. >>: Just [inaudible] in different forms. You can get these back on the form of [inaudible] and ask them if this is a cluster or not or you can give them pairs and ask them [inaudible] same cluster or not. So there's multiple ways to [inaudible] from users [inaudible]. [brief talking over]. >>: [inaudible] target function which [inaudible] the function is -- you can see what the function [inaudible]. >> Maria-Florina Balcan: Yes. Similarly here I can see what it does. So I can see the partition, right. >>: No. I mean, you can't see a target as [inaudible] action. They can give you something -- I don't know whether it's [inaudible] target or [inaudible] I never see it in action. >> Maria-Florina Balcan: What do you mean by that? >>: I cannot clarify -- what it is equivalent of, you know, the [inaudible] function but I see that function being acted on by [inaudible]. >> Maria-Florina Balcan: Okay. >>: So there's nothing -- no equivalent of that here. I can see what this function does on ->> Maria-Florina Balcan: I can see the label of every single -- right. So I can similarly exactly like [inaudible] what he said, for example, say I can verify it so maybe I can take a few points from the versatile document and so maybe the algorithm presents me the -- all topics is a proposed cluster. I can draw a few points at random and then I can ask the user is this really a true cluster? And then it yes it might be no. In that case, what my algorithm would do would then split into two and then you get the few points at random from [inaudible] and then you verify that those points are the same cluster or not. So you can just very -- I mean ->>: But what I'm saying is that is not part of the [inaudible]. >> Maria-Florina Balcan: Well, it is ->>: [inaudible] saying, you know, you can show the customer and ask him if ->> Maria-Florina Balcan: No, it's an assumption -- we assume. No, no, no, it's and assumption. It's an assumption that there is hidden target including. And somebody knowing the truth -- the ground truth labors can measure closeness to it. The algorithm cannot measure closeness to it, but the user can verify closeness to it. >>: This target is not part of ->> Maria-Florina Balcan: It is. It is implicitly. It is. It's just that the algorithm cannot use the labels, cannot use the ground truth label, but once you deploy for example like here a tree and you also propose a clustering, the user can verify closeness to it. I mean, that's implicit an assumption. >>: I think that that [inaudible] of this discussion [inaudible]. >> Maria-Florina Balcan: Well, it's -- well, the problem with clustering is [inaudible] what I try to do in my work is to make it well posed and actually I want to think about this by analyzing these properties is in a way trying to make it well posed. And by the way I think about the property, one of -- for example, here's an example of a property. So the way I think about one of this property or the C epsilon property it's like it's real a concept class in supervised learning, right, it's -- it catches the bus, our prior knowledge about the clustering problem, right? And supervised learning if you assume that the target function is linear separate organizer, linear separate learning algorithm. If you assume it's a decision tree, organize a decision tree learning algorithm, and similarly here, depending on my assumption, my prior knowledge about how the similarity function lets the target clustering, I'm going to use a certain clustering algorithm versus another. I mean, the only interesting fact that comes in the context of clustering is that since we have no labeled examples at all, when you analyze more realistic properties you really want to output the difference not only a clustering but maybe a different structure, maybe a -- say a tree that compacting represents many different clustering that satisfy the property. >>: [inaudible] the results of [inaudible] if you try to mix it [inaudible]. >> Maria-Florina Balcan: So he mentions axioms about the clustering algorithm, which is a bit different. >>: Yes. >> Maria-Florina Balcan: He's ->>: It's true. But [inaudible] you've taken a different view on that. But he was trying to say, and this is what we see here, that if you make certain assumption of what you can expect to happen, so things like, you know, if I just blow up everything [inaudible] expect my cluster to change and stuff like that, and if you make this kind of assumption still you can come up with different solution which all satisfies your assumptions. So at this. [brief talking over]. >>: Sorry? >>: In this case [inaudible] this assumption that the number of clusters [inaudible]. [brief talking over]. >> Maria-Florina Balcan: No, in the more general model ->>: [inaudible]. I don't think that it's a done deal, right? But still I think that the sum total that we see here is exactly this type of assumption where, you know ->> Maria-Florina Balcan: Right. So even in Clambers [phonetic] case, so he forces the algorithm -- first of all, the way -- his approach is a bit different. He comes up with axioms that are not reasonable clustering algorithms which, as you say, are questionable. But he forces the algorithm to output the partition. Even in this case, if you allow the algorithm to output the hierarchy or something more then these axioms will break. So yeah. And by the way, for this -- for this module model case not fixed. It goes just for the C epsilon property to a specific part of my talk. Yeah? >>: Yes. Most NP reductions start with extremely bad problems, the clustering can barely be made to work. >> Maria-Florina Balcan: Yeah. >>: [inaudible] where K is of clustering doesn't make sense. >> Maria-Florina Balcan: So one way to interpret this work is -- okay. So, yeah, thank you, that's a good point. So there are good ways to interpret this work. So the C epsilon property, going back to the first part of the talk. So one way to interpret it is well, we make implicit assumptions that -- explicit. So this is one way to interpret. We make an assumption of how the target's function -- the [inaudible] with -- with the similarity information. So this is one way to interpret it. A different way to interpret it is that we show that we can cluster well if from natural assumptions, natural stability assumptions are satisfied. And in particular, the stability assumption is with any two clustering that have a good say k-median value are close to each other. Right. And in that case we show that you can cluster well. You can -- yeah. So that's a very good point actually. All right. More questions? >>: Sort of a broad question, I guess. So in supervised learning you have labels. >> Maria-Florina Balcan: Yeah. >>: [inaudible] that you try to [inaudible] the function that gives you the label given the data point. And like in those supervised and unsupervised learning you have to make an assumption with your model. And you could use like a median -- you could use the same model as your median, supervised learning you could say you're in a [inaudible]. >> Maria-Florina Balcan: Yeah. >>: So you could -- you could use the same models in supervised and unsupervised settings? Like [inaudible] ->> Maria-Florina Balcan: Well, it's -- yeah. So in supervised you have a lot more. So the algorithm has a lot more power because it can get labeled examples. And moreover you can test how we -- you and I do some calculation and come up with a hypothesis and then I can test it, you know. But in the clustering case is problem is somehow more difficult because ->>: Yeah. I guess if you have in the unsupervised case if you have the right model ->> Maria-Florina Balcan: Yeah? >>: So like in this case it's like the -- I guess it seems the C epsilon assumption is basically did you have the right model for the data, like the data separable or whatever feature space. >> Maria-Florina Balcan: Right. So these are assumptions. The C epsilon assumption is a assumption, which would be a different assumption, right. And the -- and these are ->>: [inaudible] if you have the right model then is the -- is the unsupervised case sort of -- like are there -- is it at a disadvantage, like is it going to get confused like if the -- is the unsupervised case more likely to say split a cluster [inaudible] as the supervised case? Is it more likely to make those kind of mistakes? I mean, other than the fact that you can't get the actual [inaudible] if you have the right model, is it more likely split clusters or return clusters [inaudible]? I don't know if I'm ->> Maria-Florina Balcan: Well, you need to be more specific, right. Right. So having labels like -- even for supervised learning you can use a similarity information in addition to labels. And you already have much more power. Like learning with kernel function, similarity function. You have similarity functions plus labels. The labels give you a lot more power, right? Because the algorithm can use the label to check, you know. It does some computations, it come up with some hypotheses and then you can check, you know, is this the close to being something good or not. But for clustering you can only do that because the only information you have is the similarity information and so -- I mean there is a clear advantage in that sense. >>: I guess the supervised case you definitely would not make the kind of mistakes that an unsupervised ->> Maria-Florina Balcan: Right. So maybe split the cluster when they should have been together, for example. >>: Yeah. Okay. >> Maria-Florina Balcan: Yeah. >> Nikhil Devanur: So Nina will be here until Monday and [inaudible]. Thank you. [applause]