>> Dengyong Zhou: I am glad to have Ran... researcher in the machine learning group at Microsoft Research. ...

>> Dengyong Zhou: I am glad to have Ran Gilad-Bachrach speaking today. Ran is a researcher in the machine learning group at Microsoft Research. Before joining the department here he was an applied researcher at Bing. Ran has a PhD from Hebrew University in computer science. >> Ran Gilad-Bachrach: Thank you Denny. I'm going to be talking today about the Median Hypothesis. This is joint work with Chris Burges and to lay the groundwork I'm going to discuss a Pac Bayesian point of view about learning. We will need that to define the median and actually we will find out that will need also to define a depth function by which we will define the median. Once we have that, we will provide generalization bounds using the median hypothesis which is actually what we will call deep hypothesis. So this will be the first part of the talk, and in the second part we will start discussing algorithms. So the first one would be how do we measure the depth of a hypothesis and then we will discuss how do we find the deepest hypothesis which is actually the median. So this is kind of the outline of the talk So let's start by looking at the Pac Bayesian point of view on learning. Many learning algorithms look kind of like that. We have some sort of scoring function. I will call it the energy function. We give a score for every hypothesis based on some regularization term; pick your favorite, L2, L1, whatever; and a loss function on some observation that he had. From Pac Bayesian point of view we can look at it in the following way. We can say that actually the regularization term actually represents some prior belief that we had. So think about the regularization term as the log likelihood of the prior belief that we had. And then given the observation we can generate a posterior belief, and this is what we use then to select the hypothesis and make a prediction. It's very important to note that these are beliefs. So it doesn't mean that we assume that something in the real world really behaves according to these probability measures or distributions or whatever you call them. This is just a representation of our internal beliefs on what is going on out there. So from kind of the way that I would like you to view it is that the process of learning is made of three stages. We begin with some prior about the world and then we see some evidence. For example, it could be label, the nonlabeled examples from which we generate the posterior. And then once we have this posterior belief, we select the hypothesis with which we will go ahead and make our forecast and our predictions. My talk today is going to focus on this part. Once we have our posterior belief, what hypothesis should we choose to use to make our predictions. Let's look at some common methods of selecting the hypothesis once we have our posterior, so one option is to use the maximal posterior. This is actually what SVM and LASSO do; actually minimizing the energy function is equivalent in my language to selecting the maximal posterior. Another option is to use the Bayes function, which means that every time I want to make a prediction, I will just hold the vote among all hypotheses. I will do weighted voting and make my prediction. I can use Gibbs sampling which means just sample at random a hypothesis according to your belief. And I can use in ensembles which are kind of an extension of the Gibbs sampling, which is instead of just selecting one random hypothesis, I will select several and just hold the voting between them. So I would like to compare these methods, but this is very qualitative comparison and to make sure that you don't think that there is anything accurate about it, I drew it by hand [laughter] to emphasize that. And I went to look at to axes, one is the runtime. So when I want to make a prediction, how complicated is it to make this prediction. And the second axis is accuracy but think about accuracy more as how well does it capture my beliefs. So I had the posterior belief. Does the hypothesis, how well does it capture all of the structure of my beliefs. So let's look at some examples. If we look at the Gibbs method which is just I got my belief, I selected a random hypothesis and then I am just going to use it to make predictions. So the runtime is really easy. I just need to evaluate a single hypothesis and see how well it works. But does it capture all of the complexity of my belief? To a very small extent I would say. If we look at the maximal posterior, it seems to capture more, but still it just looks at the peak. My belief can change dramatically and still the peak, as long as it remains in the same place, I would pick the same one. So it doesn't seem to capture all that I have learned in the learning process. On the other hand, if we look at the Bayes classifier it uses a lot of information about my belief, but in terms of runtime, I really need to hold a vote every time I want to make a prediction. Ensemble method provides kind of a trade-off. If I use an ensemble of just one hypothesis, then this is just a Gibbs sampling and so it is going to be fast but not very accurate. Whereas as I get my ensemble to grow larger, then I capture more structure for my beliefs, but it becomes lower. >>: You are making some assumptions. >> Ran Gilad-Bachrach: Of course. >>: For example, for linear [inaudible] classes if you're just [inaudible] linear classifiers, then you can show that the Bayes optimum is also an accepted linear classifier. So somehow the fact that the Bayes… >> Ran Gilad-Bachrach: No, the Bayes is not--if you are making a binary classification, so it is true that if you--it is not true that the Bayes, if you take the Bayes classifier over linear classifier is a linear classifier. It is simply not true. >>: That depends on what your distributions are. When they do the general--when they use the--when you have Bayesian generalization bounds for SVM, then in that state it turns out that the Bayes classifier is in fact very fast. >> Ran Gilad-Bachrach: Again, this is very--this is why I didn't, why I drew it by hand. It is, think about the concept, not about the very specific cases in which things might be different. Generally speaking, computing the Bayes classifier is hard. Although there are examples in which it is easy, generally speaking for general class with general distribution, this tends to be hard. Okay? >>: So another question. If I think about this from a Bayesian respect here, there are two types of uncertainty. There is model uncertainty and there is parameter uncertainty. So it sounds like you are talking about just model uncertainty in this schematic or diagram, is that right? >> Ran Gilad-Bachrach: Again, I am isolating all the problem of estimating the parameters and everything. I had my prior belief and I saw some evidence, and this is currently my belief. This is my belief and this is it. But now I am asking myself okay given that this is my belief, how should I make predictions? >>: So you are thinking about distribution to the model classes and the parameters for the [inaudible]. [inaudible] both the model and parameters. >> Ran Gilad-Bachrach: Exactly. >>: Okay thanks. >> Ran Gilad-Bachrach: What I'm about to propose is a new way to select a hypothesis from my belief which is to use the median. Now the median is going to be just a single hypothesis so in terms of runtime it is going to be equivalent to the Gibbs and the maximal posterior, but I will select this hypothesis that captures as much as possible information about my belief. So this is where we are heading to. Okay, is it…? So now I want to start defining what it is I mean by median, but it turns out in order to define a median we will need to define something which we will call a depth, a depth function. So let's start with just the case that we all know about the one-dimensional univariate median. So assume while I have this sample here of points and I am going to find a median of these points. One way to look at it is to say the following; every point kind of splits the sample into two parts, all the points that are to the left and all of the points that are to the right of this point. And we associate depth with each point being the size of the smaller part. For example, for this point it had the depth three because here on this side there are only three points, where this point has depth one because if I split the space only this point is on this side. Now given this definition then this point is the deepest point and it is the median. Once we define the median via a depth function, we can now move to multivariate cases. >>: [inaudible] so your points [inaudible]. >> Ran Gilad-Bachrach: Sorry? >>: [inaudible] so your points [inaudible] distance [inaudible]. >> Ran Gilad-Bachrach: No. I do not consider distance; just count the number. Let's try to see if we can extend this definition to the multivariate case. When we moved to the multivariate case this is called the Tukey depth function and the Tukey median. So how do we find it in the multivariate case? So I assume again that I have this sample but now it is two-dimensional, and I want to measure the depth of every point. So what I do is I look at hyper planes. So if I want to compute the depth of this point, I look at all the hyper planes that go through this point. Every hyperplane split the sample into two, and I take the smaller part of it. I take the minimum over all hyper planes, and this would be the depth of this point. So for example, you can see that for this point, the depth would be two because every hyperplane will have this point and at least one additional point with it. Whereas, for this point the depth would be three because every hyperplane will have at least this point and two other points with it. So this point is deeper than that point. And again, once we have a depth function, automatically we get a median associated with it which is just the deepest point. A word of advice is you have to know that in the multivariate case actually there are many definitions of multivariate medians. So we can define a different definition of a depth function; each one of them would lead to different median. >>: [inaudible] example, if your points are in a convex shape, then all of them are medians. >> Ran Gilad-Bachrach: No. >>: So then you're saying--no. Because all of them would be [inaudible] the point one. >> Ran Gilad-Bachrach: Remember that the median does not have to be a sample point. So if I have points on a circle, so let me draw that. I have points on a circle, right? This is my sample. So you are right to say, you are correct that every one of these points has a depth of one. But actually, that point is much deeper. >>: Oh, so it's any… >> Ran Gilad-Bachrach: I can compute to any, for any point in space. And in this case you can actually see that the median is actually the center. So it is what you expect it to be. >>: Well I think that you [inaudible] to compute [inaudible]. What if you can't compute function in the [inaudible]? >> Ran Gilad-Bachrach: There are depth functions that are defined only on your sample and there are depth functions that are defined everywhere. There are different definitions. Each one of them has its pros and cons. So I don't--it is an interesting topic. It is a very interesting topic but I will have to skip on all of that. So I just wanted to point out the fact that this is not the only way to define a multivariate median. But this is the one that I will be inspired by, so this is why I brought it up and not other possible definitions. >>: [inaudible]. >> Ran Gilad-Bachrach: For example, like the univariate case, there could be more than one median. You know it from the univariate case that the median is not a single point. It could be--but here you could show for example that the level sets of the depth functions are convex. Therefore if you have more than one median, it will always be a convex set of points which are the median. So you can show all sorts of things. So there are a lot of studies on multivariate median and just as an anecdote I will tell you that about 10 years ago was the first time I encountered that, and I thought that I invented the multivariate median, because I searched for multidimensional median and when you search for multidimensional median I couldn't find anything. And I thought, you know, this is my big thing, right? And then I don't know--I know why. A friend told me, you know that in statistics they don't call it multidimensional; they call it multivariate. [laughter]. Yeah, so this is why I brought this just to mention the fact that there is a lot of literature, a lot of different definitions, and unlike the univariate case which when you say the median, we all know what you're talking about, when you go to the multivariate case, this is no longer true. But now I am motivated by this very specific definition of the median that we just saw. I am going to define a median for function classes. One thing that is important to note is unlike what we discussed before in which we thought of the sample of points and we wanted to find the median of these points, now we are talking about functions. We are not going to find the median of our sample, but the median of our hypothesis. Just a little bit of notation that we are going to use, our sample space is going to be, I am going to use the letter X for that and the function class is F and the most important notation is Q which is our belief, our posterior belief. This is the distribution over our sample class. And here is how we define the depth in this case. First I have a function that I want to measure its depth. So first of all I take an instance and define the depth of this function with respect to this instance X to be the probability of a random function agreeing with a label of F that F assigns to X. Now I define the depths of F to be just the infimum over all instances. It is similar to what we did in the Tukey depth thing in which we said every hyperplane defines some sort of a depth, because it splits a space into two and I take the smaller part and then I take the infimum over all hyper planes. This is what we do over here. >>: In the function family here like any, does it have to be [inaudible] space? >> Ran Gilad-Bachrach: I can compute the depth of the function which is not in my function class, for example. It is defined for every function. Once we have a definition of a depth function, we have a median, which is just the deepest point. >>: So in the previous slide you said that the range of F is just minus one, plus one, right? So there are just two depths. >> Ran Gilad-Bachrach: No. For every, again for every X there could be… >>: Oh, X is either one or minus one. >> Ran Gilad-Bachrach: Yes. But then I take that infimum of all Xs. >>: Wait a minute. For a given X, there are just two types of X, one that are deep and one that are shallow. >> Ran Gilad-Bachrach: Exactly. >>: Okay. >>: Wait, wait, there is a probability that when you draw F [inaudible] at random Q. >> Ran Gilad-Bachrach: Yes, what [inaudible] says is that if I have for a certain point, and we will see it in a second, for a second point, 80% of the functions say that the label is +1, and 20% won't say that. So either you are in this class or in that class. >>: There are two sums; there are two terms in this [inaudible]. >> Ran Gilad-Bachrach: Yes, but once you take the infimum, and then this function becomes continuous. You are absolutely correct. Now… >>: Wait [laughter] so the infimum is over all of X regardless of [inaudible] if there is an X there is no probability of this ever occurring then you still… >> Ran Gilad-Bachrach: With your question, wait five minutes. >>: So the [inaudible] is the amount support that [inaudible] were the least popular [inaudible]. >> Ran Gilad-Bachrach: Exactly. So for example, you can see that if I compute the depth of the Bayes classifier, it will always be at least half. Do you see that? Because the Bayes classifier always takes the majority vote. So for every X at least 50% of the population will agree with that and therefore the infimum has to be greater than half, right? This is an example to keep in mind. A depth of half is the best we can hope for in any realistic setting. And the question would be how close to half can we get. I will not go over all the details here, but we can actually see that the Tukey depth that we discussed before is actually a special case of this definition, and so when the hypothesis class which is really some sort of linear classifiers, then actually the Tukey depth actually becomes you know a special case of this definition. But there are some interesting things about the Tukey depth. For example, we know that the Tukey depth if we work on D dimensional space, there is always a point with depth greater than 1 over D +1, regardless of the distribution. There is always a point with depths 1 over D +1. We know further, that if that distribution Q is log concave, then there is always a point with depths 1 over E, regardless of the dimension. Now1over E is very good. Remember that the optimum we hope for is one half, so 0 over E is very close to that. >>: So the Bayes point, isn't always the Bayes function isn't always in this space F, is that what you're saying? >> Ran Gilad-Bachrach: No. It doesn't have to be in space F. >>: You're saying [inaudible]. [multiple speakers]. [inaudible]. >>: [inaudible] realizable so you might not be able to realize in the space of the measure Q that you can get a half [inaudible]. >> Ran Gilad-Bachrach: If we go back to the beginning of our discussion and we said many learning algorithms look like that, and the energy function looks like that, and we actually like the loss function to be like convex and regularization turn to be convex, which means that the energy function is convex, which means that Q is log concave. So actually in many of the cases which we really work with, we are exactly in the setting where our posterior belief is log concave, and therefore we are guaranteed to have a deep function, a function with a depth of at least 1 over E. So I hope you get some intuition with respect to the median and the depth and what does it mean, and now I would like to convince you that the median hypothesis is actually good. But before I do that, a warning, don't always use the median. [laughter]. But in our case, it is actually good to use the median and so here are the first results. So let's say we have some target. So the word is distributed with some distribution U that I don't know about it. And say we will just use our R nu of F is just a generalization error of F. So we can prove that for every function F its generalization error is bounded by one over its depth time the expected generalization error of if I would have just sampled hypothesis from Q. And the proof is trivial. The proof, we have to consider just two cases. Pick a point. There could be two things. If the majority vote is done with large majority, then my function since it is deep must agree with the majority. And therefore my risk, my loss on this would be very similar to the risk of just a random hypothesis. The other case is in which Q is not decisive. In which case if Q is not decisive my competition, which is just a random hypothesis, would err a lot regardless of what it proved, and therefore, I can't do much worse than that. So again, this is kind of vaguely speaking, but when you write it down it is just two lines and you get the proof. The nice thing about it is that we can connect that with more traditional Pac Bayesian analysis. So we had just about a month ago we had Mario Marchand over here and what he showed is something of this type of a theorem and trying to clear things around. Basically what it says is that with large probability, that generalization error of just selecting a hypothesis at random, the expected error of a random hypothesis is bounded by a term which is made of two things. One is it is training error and the other one is kind of the KL divergence between the prior and the posterior. And now we can just take this result, take the theorem that I just showed you and plug them together and we get a generalization bound that says that for every hypothesis, its generalization bound is bounded by 1 over its depth times the same term that we had before. But now there is something really annoying about our definition and [inaudible] asked about it. We looked at the infimum, when we defined the depth, we looked at the infimum and this seems to be kind of very harsh. It could be that we have a function that agreeing with the majority on almost all of the points, but say one point. So it seems not the right way to look at it. And therefore we can define a relaxed version of the depth in which we say instead of taking the infimum, we say okay, you are allowed to put aside a proportion Delta of the points and compute the depths only on the rest of them. So when we look at this relaxed depth over here, actually what it means is that for the majority of the points your depth is greater than that, but there might be a proportion Delta in which actually you do worse than that, but that is fine. And again we can repeat the same kind of proof that we have seen before, the same kind of theorems using this relaxed version of the depth function, so we can bound the generalization error in terms of this relaxed depth function and have a Delta term over here and again we can plug it into that PAC Bayesian bounds and get a generalization bound in terms of this relaxed depth function. >>: [inaudible]? >> Ran Gilad-Bachrach: Uhh. >>: Or we can choose [inaudible] whatever I want? >> Ran Gilad-Bachrach: I think you can, but I need to verify that. I think it does. I hope that by now I have motivated you that finding a deep hypothesis is something worthwhile and now with looking for, to find a deep hypothesis and the first thing is actually to be able to measure the depth of a hypothesis, so if someone comes with a hypothesis can I know what its depth is. And we want more than just estimating its depth; we want a uniform estimation, meaning that if I have a class of function, I want to be able to estimate the depths of all of them simultaneously and make sure that my estimate holds for all of them. So here's what we're going to do. Assume I have a bunch of points. As we discussed earlier, for each point I can look at what is the size of the population that labels it plus 1 and what is the size of the population that labels it minus 1? So say this is my, these are my sample points. And now I want to evaluate a special function. So this function for the first point it provides the label plus 1. I know that it depths on this point is 20%, because only 20% of my functions I agree that the label here is plus 1. And then I can take the label on the second point and again see what is the size of the population that agrees with it on the second point. And so on, and so forth I can go over all of the points in each one of them mark what is the depths with respect to this first specific point. Eventually the depth is just a minimum over that. So in this case on this point, I had the smallest agreement and therefore the depth of this function is .2. And this is what we are actually going to do. What we are going to do is instead of computing the infimum over all points, we are just going to take a sample and evaluate the depth only on this sample, and when we want to evaluate what is the agreement here, again, we will take a sample functions and use them to evaluate what is the proportion of the population that is labeled +1 and what is the proportion of the population that is labeled -1. And this actually is what the algorithm does, so you take two samples. One is unlabeled samples of instances and one is samples of functions from your belief and you actually compute for every point X you compute what is a proportion of the population that agree with the function that you are trying to evaluate and you take the minimum and this is your estimate. So this is what we call the empirical depth of the function F. We can prove the following property, and again I will try to clear up all of the details here, but the main thing is that with high probability we will have the following, the depth, the empirical depth we compute is going to be greater than the true depth minus epsilon, and it is going to be not larger than the relaxed depth plus epsilon. So it is going to be somewhere in this range, but this holds with high probability holds simultaneously for all functions, so it is uniform. This is a uniform bound that holds for all functions. If we look at the probability in terms of what we mean by high probability, let's look first over here. We have here the fife function, the gross function; we see the dimension that we used to see, and we see that we have some polynomial term, here but exponentially small term over here and the same thing over here. So we can choose the sizes of the sample that we need such that the probability will be as high as we want, is close to 1 as we want. >>: So epsilon [inaudible] back to the fraction and you relaxed that and [inaudible] parameter, right? In other words you go back to the definition of relaxed depth, there's a [inaudible] you choose Delta? >> Ran Gilad-Bachrach: Delta. Here it is. >>: Okay, so that Delta is… >> Ran Gilad-Bachrach: That Delta is here, and here and… >>: Oh, I see it. Okay. >> Ran Gilad-Bachrach: And again, I'm going to skip that. And the proof is very simple. The main ingredient is to note the following, that if I have in my sample--so we have this likeness between the true depth of the function F and it's kind of relaxed depth. We have sort of a selectness here. If my sample of points consists of a point for which DF of given X is somewhere in this range, then my estimate would be smaller than the relaxed depth. And my estimate will always be greater than this depth because this is infinite. So all I need to guarantee is that my sample is actually a hitting set, which means that it hits all of these regions so I have instances in all of these regions for all of the functions. So this is called a hitting set and in the machine learning literature this is called an epsilon net, so we know that if this dimension is finite, a random sample will actually hit all of these regions with high probability. So this is actually, you know, the main ingredient in this proof. So now we know how to measure depth and we know how to measure it uniformly for all of the functions in the class, but now we would like to find the median, or approximate the median. We want to find a deep point. And this is what we are going to do now. Again, I want to start with motivation. I am going to show it by pictures and then go over the algorithms. I have a sample of points and actually I will tell you in advance what we are going to do. Instead of trying to find really the deepest point, the deepest function, we are going to find the function that maximizes the empirical depths and not the true depths. And it turns out that it is easy to find this function. Assume that I have this sample of points and on each one of them I know what is the size of the population that labels a +1 and what part of the population labels it -1. The deepest possible points will actually agree with the majority vote on all of these cases. So the Bayes classifier will give the label -1 here and +1 over here and +1 over here because the majority says +1. So if I can find a function like that, this is the best that I can hope for. But now assume that I cannot. So I can't find any function in my class that actually gives all of the correct labels in all of these cases. So I am going to eliminate one of the points from the sample, saying that I allow the function that I am looking for to kind of miss-label this point. But which point do I want to eliminate? If for example, I eliminate this point, I might get a classifier that labels here +1 and its depth is going to be .2, so this is not very favorable. >>: I have a question. In drawing these graphs, how [inaudible]? >> Ran Gilad-Bachrach: Yes. I want to eliminate the point on which there is the smallest margin in the voting, because I know that for this point, for example, for this point although 55% of the classifiers label it -1, but still 45 say it’s +1. So if I got the label +1, this is not too bad. So I can delete this point and I can say can I find a classifier which agrees with all of the remaining labels? And I can keep going, and every time I will delete the point on which there is the smallest margin. I will keep going until I find a consistent classifier and not only that, once I have found one, I know that it is actually maximizing the empirical depth, I can also know what is the empirical depth and the empirical depth is actually the size of the minority set on the last point that I deleted. Now this is basically it. So again the algorithm itself will just have, will receive two samples, one is unlabeled sample of points. The other one is sample functions for my belief distribution, and the output would be a function F which actually maximizes the empirical depths and its empirical depth. >>: [inaudible] sample point starting with the same thing [inaudible] the sample the same sample? >> Ran Gilad-Bachrach: Label sample is something that I, the only thing that I get is a posterior. You might have used--I don't know what you use to reach with this posterior. Maybe you didn't use label sample or maybe you have some Oracle that gives you some hints. I don't know. But once you have this posterior, this is what I need in order to select the median hypothesis. Does it make any sense? >>: Yes. But I am saying what bounds can you use…? >> Ran Gilad-Bachrach: I can use label sample, but it doesn't add anything. I don't use the labels. >>: What I am trying to say is having a labeled sample doesn't [inaudible]. I mean when you are training an algorithm you have several different points, so using additional unlabeled points is not adding to anything either. >> Ran Gilad-Bachrach: You're asking whether--if I trained using a labeled example, can I reuse the sample again for this part? >>: Yes. And I think you should be. >> Ran Gilad-Bachrach: You should be. You can kind of reuse the Delta term in the confidence divided by two and you should be fine. >>: You have T. You can build it on sample over T. Is it true that this ensemble will always be better than your median? >> Ran Gilad-Bachrach: I don't know. It needs to be validated. But again, this ensemble in terms of what I want to really use it, it is much heavier than having just a single hypothesis, right? >>: Is N times… >> Ran Gilad-Bachrach: N times, but on runtime. Here I use it only in training. >>: I understand. But is it true that the best you can do, the best that you can hope to do is as good as the ensemble on the [inaudible] because you are using, you are hypothesizing… >> Ran Gilad-Bachrach: It makes sense, but can I prove it? I don't know if I can… >>: The ensemble you can just take it all over zero but the median. >> Ran Gilad-Bachrach: Sir? >>: You can have the ensemble all the way to zero but 1. >>: But here is the proof. So the proof is very easy. You have Q primes which is not the posterior but it is the empirical posterior, just as the posterior is defined by this. Then the ensemble on T is the base is the base optimum on that comparable posterior. >> Ran Gilad-Bachrach: That is true, but it is not my, but then you have to prove that this empirical posterior is better than the… >>: [inaudible] average, right? [inaudible]. >>: [inaudible]. >> Ran Gilad-Bachrach: On that Q prime, but we are asking on Q, right? >>: Right. So I think the answer is that the S upper bound for this, for this a sample is better than the best upper [inaudible] that you can get, but we don't know about lower bounds. >>: So I was just going to say that in the case of the Bayes point machine is the classic one where, and you were referring to this earlier, where in fact, the actual [inaudible] is an approximation because the actual classifier that you use may not live with in the function [inaudible]. It could be at the best-known ensemble out of T does pretty well, but if you happen to have a Q prime that you are using and [inaudible] against it that had that [inaudible] function, you could do better than what was actually in the ensemble. >> Ran Gilad-Bachrach: If you can use in ensemble, yep, and if you can use the Bayes classifier, probably would be the best thing that you can use, but if you can't… >>: You know what this reminds me of? Is Rich Coran and his whatchamacallit [inaudible] the thing where he made this gigantic loaded mega-classifier, the mega ensemble to triggered; is Rich here? >>: There's Rich. >>: He called it [inaudible]. The giant enormous classifier out of everything and the kitchen sink. >>: The ensemble collection. >>: The ensemble collection, so that is sort of your F and if that is sort of going back to Chris [inaudible]'s question, what is this? Is this like a parameter space? No. This really very weird space over--you are saying that your belief is, I don't know. I am going to try [inaudible] trees and all sorts of things. And that is sort of which are giant belief domain is. And instead what you are doing is you are sort of, I predict you are sort of boosting your averaging together. He is saying you could train the little classifier using his method maybe, and that is sort of a guidance on how he would train a little classifier… >>: The problem is you have to find one guy who agrees with one of the other guys. >> Ran Gilad-Bachrach: But this is usually easier. So if you think of the task, usually when we say we want to find a classifier that minimizes the number of errors and stuff like that. This is hard, right? But finding, what I need is an algorithm. This is what I call the consistent learning algorithm and it is an algorithm that says I found a hypothesis that agrees with everything or I failed. >>: [inaudible] see something because you have zero training… >>: When you are building ensembles, you are usually building ensembles that are [inaudible] simply Bayes hypothesis. >>: Everything, every imaginable algorithm was in the ensemble. >>: The ensemble is much more expressive than any one of the individual [inaudible]. >>: Right. >>: And here Rani is saying I want to find one guy who tends to agree, let's say uses all these things… [multiple speakers]. [inaudible]. >> Ran Gilad-Bachrach: To simplify the notation, I assume that the functions here are from the same class from where I am trying to select the final hypothesis, but actually you don't have to make that assumption. It just, otherwise I have to hold two function classes and all of the discussion becomes cumbersome. But actually it could be that the ensemble is made of some stumps and then you try to build a tree. >>: [inaudible] I think you are doing is picking and… >>: So we all have to… What? >>: You're missing, if you take 1 million things and you build an ensemble over them, and you are trying to find one thing which tends to agree with that ensemble, I agree that Rani is saying that it could be like a big huge tree, but it is going to be a big, huge tree. So you are going to pay again with computational complexity because if you want a very expressive function, you're going to have to make it complicated. >>: That was the beautiful thing that Rich had done, because he [inaudible] trained all of this insanely complicated ensemble. I mean they were just enormous, I don't know 100 times lower than you would ever want to use, but you would label in fact a very large training set. That is sort of different from here where you would label a very large unlabeled set and then you could get a much larger data set. Here he is saying well, maybe what you do is you just define the largest training set that you can get zero training error on that matches the labels of the giant mega ensemble, so just a different way of creating Riche’s giant data ensemble. >>: But you got to understand that I think you're missing a very good [inaudible] but let's take it off-line. There is a problem with what you're saying. You can't eliminate the Bayes variance [inaudible]. You can't get something for nothing. >> Ran Gilad-Bachrach: Think I agree with you. If you don't have any limitations in runtime, then yeah, use your more sophisticated ensemble. Use the Bayes optimal, right? I agree with you, but if you have limitations in runtime, and you want to moderate the size of your hypothesis because you have this limitation, so this is how I propose to do that. >>: That is exactly the motivation for ensemble selection, so that… >> Ran Gilad-Bachrach: You can build during the training process; you can build this huge ensemble. And this is actually here. This is the ensemble. But now still on the training process you have to prune it. So you can look at it, I think this is what John is suggesting, you can view all of this as a pruning process. >>: I guess the motivation is little bit different, because I think… Like my perspective on, well maybe it's just rattling back, Rich’s work was like, I agree with you philosophically. It feels very similar. But he was kind of like, his small set and then this small set I can't train a really simple thing to generalize well, but a really complex algorithm that can do a good job, so I pull a ton of them and then I train, then I [inaudible] deal with that to effectively expand the size of my [inaudible] data and then I train the super classifier on it. But here it doesn't seem to be so much, that at least from the software perspective; it doesn't seem to be so much that I have too little data to work from. It is more, well, maybe not so different. But it feels like you are trying to do something about generalization, well maybe it comes down to the same thing. >> Ran Gilad-Bachrach: Basically I separated the task, the training data is just out of the picture over here. It is just out of the picture. And the same thing--in what Rich did, once you train these low models and now use them to label--you didn't add new information. You didn't add new information to this project. This is just a method by which you actually select your frontal hypothesis. So here I just instead of just putting, gluing them together, I said okay, do whatever you want to do in the first part, here is what to do on the second part. And I am just analyzing the second part. Do whatever you want to do on the first part. >>: So I am just trying to understand your algorithm a little better. How does it prevent situations like that on the board that she just drew in that [inaudible] set up sample functions none of which are particularly deep? >> Ran Gilad-Bachrach: The function that I'm going to find eventually does not have to be one with these end functions. It is not going to be a 1 of F1 to FN. It could be any function in my class. So I have this algorithm MA for which I am going to feed it with samples. And it just either tells me no I can’t; I am going to feed with label samples. >>: Labels by the… >> Ran Gilad-Bachrach: They are going to be labeled by this ensemble. And I am going to feed it samples, labeled samples and this algorithm is either going to tell me I can't find any hypothesis which is consistent with the sample or it will say here is a hypothesis which is consistent with this sample. And this hypothesis doesn't have to be from this F1 to FN. It could be anywhere. And actually, and this is, I didn't want to talk about it, but it doesn't have to be in the same class. The algorithm, what I do in, first I compute for every point, what is the proportion of classifier in my ensemble that actually that label it +1 and this is a Pi plus. And then I compute the size of the minority group on each point, and sort the points, such that the first point will have the largest minority group, and the sizes of the minority groups are actually decreasing. I compute the label off the majority vote, what is the label that I really want to have and start iterating. So the first thing is I will just put all of the points in with the labels of the majority vote and see if I can find a classifier that just classifies correctly all of them. So I will close the algorithm and see if it can find a consistent hypothesis. If it does not, I will remove the first point. The first point is the one with the largest minority, so I will remove it. And again, send it to the algorithm and I will keep going, keep going until eventually the algorithm will return a function. >>: Why not do a binary search? Why do a linear scan? It seems like you're asking for a lot. >> Ran Gilad-Bachrach: You can do that. >>: Okay, it just seems like that is log n versus n number of trainers that you have to… >> Ran Gilad-Bachrach: Actually, yeah. But it doesn't matter. >>: But you are saying knockoff half, knock off half the [inaudible]. >>: Yeah, because otherwise it is [inaudible] neural net work or whatever [inaudible]. >> Ran Gilad-Bachrach: You're right. >>: Does it guarantee that it's true? >>: Is it… >>: I mean couldn't it be the case that it's not [inaudible]? >> Ran Gilad-Bachrach: No. And eventually once the algorithm returns a function which is consistent, then I can just compute its depth using the formula here and we can prove the properties of this algorithm so first of all it will always terminate. So it could be that you will keep the leading points and end up with an empty sample. And the reason is very trivial and the last point you give it a single point with the majority that, the label that the majority gave this point. So there must be [inaudible] hypothesis that agrees with that. And we can show that the function that this algorithm returns as actually the maximizer of this empirical depth, and the depth that it computes is actually the correct depth for this function. And again, trying to clear the clouds around this formula, the important thing is the following, that the relaxed depth of the return function is at least as big as the depth, actually the depths of the median. So this is a super [inaudible] of all of the depths, so this is actually the depths of the median point -2 epsilon. So actually it does approximate the median. So before I conclude, just a few small notes. So one is we discussed in the beginning actually the motivation for the definition of the depth were the Tukey depth in the Tukey median. So now we can go back to that and say that actually the algorithm that we described is actually, if we use them on this very specific case, they approximate the Tukey depth and the Tukey median, and they are polynomial algorithm and actually this is the new result. So all of the previous results in approximating the Tukey depth and Tukey median are exponential in terms of the dimension and this is a polynomial in the dimension, but I must say that usually the approximation that they consider is different than the type of approximation that I am considering here. But nevertheless, this is something that is a new result. I am not going to go over the details here, but we can also discuss the geometrical properties of these depth functions and if it is convex in some sense, but I'm going to skip that. But this is the big but. As you know I didn't have any empirical study over here which is kind of why don't you have any, if it's good, show us that it's good. And the problem is that I made the assumption that I can sample functions from my belief. So all algorithm used, actually if we look at the algorithm that approximates the median, we needed three things as inputs. We needed a sample of unlabeled instances, which is usually easy to get. We needed a consistent algorithm, which in many cases is easy to come up with. But what is hard to come up with is just a sample of functions from my belief. This is something that for interesting belief distributions is not so easy to achieve. So we can do heuristics, you can think about begging and random force and everything like that, this is actually what they do. So we can do heuristics. We can also show that if I can sample from the true distribution, but I can sample with something that is close to the distribution, it is fine. We can correct for those mistakes. But the reason why I didn't yet do the empirical studies is once you start using the heuristics and if it works, great. If it doesn't work, then is it because all of this method is broken or is it because the heuristics did not deliver what you expected it to deliver. You won't be able to say. So this is why I preferred first to, you know, complete the theory and know exactly what I wanted to achieve, and then separately try different heuristics and see how they perform with respect to this method. >>: Is there any way to measure how [inaudible] any way to measure how good heuristic is with respect to a fair sample [inaudible]? >>: Sure its Bayes error is low. I mean not the lowest one, but you can compare heuristics, but compare… >>: Compare heuristics but the can you tell if your… >> Ran Gilad-Bachrach: But you can't compare it. You want to compare it, again, you want to compare this Bayes error with--if its Bayes error is low, then you are good. But if it is not low, is it because your heuristic is broken or because your posterior is not very decisive or not very good? It is hard to isolate these two things in this is why I want to isolate the theory from the empirical because there is additional unknown over here. >>: I go by the hard Q comment. So it seems like in a lot of cases in [inaudible] Monte Carlo is a sample from the Q? >> Ran Gilad-Bachrach: Approximate. So even if… >>: So the question is whether or not it converged or not, but I think… >> Ran Gilad-Bachrach: For example, I think that even if you have in the simple case where you have uniform distribution over convex [inaudible]. So yes, there are polynomial algorithm for something from that and their polynomial, they have improved so that the complexity is something, so the first algorithm is something like, with the complexity like D to the power of 23, and they manage to reduce it all the way now to three or four, but the constant here is something like two to the power of 100. >>: But even [inaudible] finding mean instead of the posterior and a lot of approximation algorithms approximations like actually give you a very good assumption of Q. [inaudible] find the family… >> Ran Gilad-Bachrach: Again, there are a lot of heuristics on how to do that and there is evidence that in many cases it worked fine, but the only result that I know that is actually samples from the distribution at least in some cases, they have this complexity. So again once you start… >>: There are certain other cases where you know that the Markov chain is missing and you're going to guess [inaudible]. >>: You are making a very strong assumption that you have a sample from [inaudible] only sampling continuous values or [inaudible] but that is [inaudible] specific… >> Ran Gilad-Bachrach: That's true, but then if you say okay, let's sample only from Qs from which I want to sample, so maybe these Qs, so imagine now that I am going to apply to a certain problem. First of all let me just before we dive into this discussion, the first thing that I want to do is really go ahead and evaluate this. So this is obvious. But there is a reason why I wanted to isolate it from this theoretical work. Because once I want to do that, I have to either restrict the type of Qs, you know the posterior that I am allowed to use, to only Qs I know how to sample from, or say I am not going to restrict Qs, but I am going to use some heuristics to sample from this distribution. Anyway, if it works fine, that is great. But if it doesn't work, what does it mean? Does it mean that this full method is broken, or does it mean that if I restricted my family of distribution, restricted it in a way that this distribution do not work well for this problem. >>: By definition, since your method has to sample from Q, it is broken if it doesn't work. >> Ran Gilad-Bachrach: Yes, but then you can say okay, I can say there is another item now is how to sample from Q. So I went that far and obviously, you know, there is still a way to go. And I am very clear about it. But if I just evaluate now and see if it is working, we are happy. But if it doesn't work, what does it mean? That this whole line is broken, or that we need to improve our methods from sampling from Q? >>: Probably the latter. >>: Do you have any ideas on what types of Qs you are interested in sampling from? >> Ran Gilad-Bachrach: So for example, [inaudible]. >>: Exactly, [inaudible]. The kind of form, the function that you have. They haven't done work in trying to approximate the posterior [inaudible] like in the Bayesian [inaudible] community, but even SVM lost function [inaudible] laws and… >> Ran Gilad-Bachrach: As far as I know and I will be happy to know that I am wrong, is that none of them has provable guarantees on the difference between the true distribution from which you sample and the sample, the distribution that you are trying to sample from. All of them are well justified, seems to be working, but for none of them apart from this result from [inaudible] and one of those and I don't remember who else, none of them is really provable. >>: No, no. Here is the problem. So you talk about sample selection that comes closer to solution. And you were saying that your method [inaudible] immediate is a good one. And then I have [inaudible] evaluate in the mod, I have the method to evaluate the mean and in these cases people have shown that the mean might be better than mod, right? And I cannot pinpoint where does this [inaudible] lie in that space, unless there is some in particular evaluation. And hence my questions, what would it require for you to show as comparative as those [inaudible]? >> Ran Gilad-Bachrach: So definitely I am going to evaluate that and definitely I am going to try these heuristic methods either of sampling with the inaccuracy of the sampling process and see if it works, or otherwise restrict the family of Qs such that the families from which I know how to sample. But I wanted to separate that from this work because once I do that I add another unknown. The evaluation would be problematic. >>: Here's another way of looking at it. If maybe the sampling or the sample of posterior I said is really my posterior because I can adjust my prior. It is whatever I say. So I can adjust my prior maybe if there are certain [inaudible]'s that I don't know, because that is my prior. It's a crazy prior, but how can you argue with me? It's my prior and the prior happens to match where the posterior has to sample. >> Ran Gilad-Bachrach: But again, you run out to design an experiment, right? And you think okay, I am going to design an experiment. Let's see what are the possible outcomes? If the outcome is that it doesn't generalize well, is it because of your sampling technique, or is it because of the algorithm that I described here? >>: But [inaudible] I don't think this will [inaudible] maybe they are just inseparable. I think it doesn't make any sense [inaudible]. >> Ran Gilad-Bachrach: If it would have been inseparable, it would have been inseparable if I simply knew how to sample Q from a general Q, so for example energy function of SVM, if I could sample from this energy function. And then I could compare the result of the result of SVM, then I know exactly that this is… >>: Yeah, that is one choice of prior. >> Ran Gilad-Bachrach: Yes. This is one choice of prior, but an important one. >>: Yeah. [inaudible]. >> Ran Gilad-Bachrach: That's true. But then again, what is the broken component here? Is it that the prior is broken or is it that the algorithm is broken? So there is a problem in running this type of experiment. It is an inherent problem in this kind of an experiment; that you cannot control for all of the free variables. >>: That is true. >> Ran Gilad-Bachrach: So this is why I prefer to go all the way to establish a theory and know exactly what I want to achieve, analyze it and then once I do that, yeah, I am going to run the experiment, but I know that there are, that my ability to evaluate the results of these experiments is going to be limited. >>: [inaudible] you are saying that the death of the Bayes classifier is half and then your bounds have one over the depth, so you're going to pay factor of two in your error compared to the Gibbs sampling no matter what. So… >> Ran Gilad-Bachrach: Remember when [inaudible] over here said we get this annoying two factor for analyzing the Bayes classifier. We tried to get rid of it and didn't manage to do that. >>: No [inaudible] there's a paper where the two goes to almost epsilon in some settings but… >> Ran Gilad-Bachrach: I don't know. At least here when the studies were going on they said that we have this annoying two factor and we tried to get rid of it and we couldn't and… >>: [inaudible] years ago [inaudible]. >> Ran Gilad-Bachrach: You can see cases in which you can get better then two, for example, the relaxed depth, so we can say why is it two? It's two if you have many points on which the majority vote is very marginal. But if you actually, for most of the points the majority is very decisive. Actually this bound will show you that you can get better than factor two, kind of two is for the worst case in the classifier. It could be better. But even in their analysis actually what they show, the best bound that they get is for the expected Gibbs classifier, and then they say okay, the Bayes classifier cannot be much worse than that. And this is the same thing that we showed here. >>: I will show you the result where this two to [inaudible]. The other thing I wanted to say is that the trendy thing in learning theory now is to talk about the importance of strong convex [inaudible] and you didn't mention this at all. I am really [inaudible] required was the [inaudible] property of the posterior. Now what happens, I don't know what it would be called, when you have a posterior which is defined as normalized X of a strong convex function rather than just a convex function, which would be achieved by our strong convex regular [inaudible]. And I think what you get is that you have this guarantee that says that there exists a function which is at least at depth [inaudible] for E. >> Ran Gilad-Bachrach: [inaudible] I don't need the strong convex… >>: Right, but I'm saying I think that if you look that 1 over E in the general case. If you have a ball, you actually have half. And I think strong convexity [inaudible] interpolated between the two, right? This would be now just so you talking this result just for general convex body the center of mass you can cut it such that 1 over E is on one side. So I am assuming that that is like this 1 over E. if it was a ball, you could actually cut it in half and then if you had the strong convex [inaudible] like L2 regularizer think about, think of the last function is just an L2 regularizer. No empirical loss. So the weight of the empirical loss is zero. You are just quadratic. Then you have the ball. Then you would have half. >> Ran Gilad-Bachrach: Interesting direction. It is not something that I can come up with from the top of my head. Actually you can also show that if a function is not log concave actually there is a larger family of these distributions which is called ra concave function and for each one of them you can get a bound which is a function of ra. You can refine the results I'm sure. >>: Question about the, so the energy function that you were showing, regularizing the function, it doesn't always lead to [inaudible] solutions [inaudible] so if you just exponentiate the energy [inaudible]? >> Ran Gilad-Bachrach: Oh, no you always have… >>: [inaudible] for SVM, for [inaudible], it does not. >> Ran Gilad-Bachrach: So you always have the partition function. >>: Yeah, but I think that some people [inaudible]. So consider the case where you have, the SVM case, then you regularize it is a [inaudible] relative loss function. >> Ran Gilad-Bachrach: So you say that the integral might go to infinity and you cannot--could be, but then if you bound it, if you bound everything… >>: So if you can approximate it somehow? >> Ran Gilad-Bachrach: Thank you very much. [applause].

>> Dengyong Zhou: I am glad to have Ran... researcher in the machine learning group at Microsoft Research. ...

Related documents

Products

Support

&gt;&gt; Dengyong Zhou: I am glad to have Ran... researcher in the machine learning group at Microsoft Research. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Dengyong Zhou: I am glad to have Ran... researcher in the machine learning group at Microsoft Research. ...