>> Dengyong Zhou: I am glad to have Ran... researcher in the machine learning group at Microsoft Research. ...

advertisement
>> Dengyong Zhou: I am glad to have Ran Gilad-Bachrach speaking today. Ran is a
researcher in the machine learning group at Microsoft Research. Before joining the
department here he was an applied researcher at Bing. Ran has a PhD from Hebrew
University in computer science.
>> Ran Gilad-Bachrach: Thank you Denny. I'm going to be talking today about the
Median Hypothesis. This is joint work with Chris Burges and to lay the groundwork I'm
going to discuss a Pac Bayesian point of view about learning. We will need that to define
the median and actually we will find out that will need also to define a depth function by
which we will define the median. Once we have that, we will provide generalization
bounds using the median hypothesis which is actually what we will call deep hypothesis.
So this will be the first part of the talk, and in the second part we will start discussing
algorithms. So the first one would be how do we measure the depth of a hypothesis and
then we will discuss how do we find the deepest hypothesis which is actually the median.
So this is kind of the outline of the talk
So let's start by looking at the Pac Bayesian point of view on learning. Many learning
algorithms look kind of like that. We have some sort of scoring function. I will call it
the energy function. We give a score for every hypothesis based on some regularization
term; pick your favorite, L2, L1, whatever; and a loss function on some observation that
he had. From Pac Bayesian point of view we can look at it in the following way. We can
say that actually the regularization term actually represents some prior belief that we had.
So think about the regularization term as the log likelihood of the prior belief that we had.
And then given the observation we can generate a posterior belief, and this is what we
use then to select the hypothesis and make a prediction.
It's very important to note that these are beliefs. So it doesn't mean that we assume that
something in the real world really behaves according to these probability measures or
distributions or whatever you call them. This is just a representation of our internal
beliefs on what is going on out there. So from kind of the way that I would like you to
view it is that the process of learning is made of three stages. We begin with some prior
about the world and then we see some evidence. For example, it could be label, the nonlabeled examples from which we generate the posterior. And then once we have this
posterior belief, we select the hypothesis with which we will go ahead and make our
forecast and our predictions.
My talk today is going to focus on this part. Once we have our posterior belief, what
hypothesis should we choose to use to make our predictions. Let's look at some common
methods of selecting the hypothesis once we have our posterior, so one option is to use
the maximal posterior. This is actually what SVM and LASSO do; actually minimizing
the energy function is equivalent in my language to selecting the maximal posterior.
Another option is to use the Bayes function, which means that every time I want to make
a prediction, I will just hold the vote among all hypotheses. I will do weighted voting
and make my prediction. I can use Gibbs sampling which means just sample at random a
hypothesis according to your belief. And I can use in ensembles which are kind of an
extension of the Gibbs sampling, which is instead of just selecting one random
hypothesis, I will select several and just hold the voting between them.
So I would like to compare these methods, but this is very qualitative comparison and to
make sure that you don't think that there is anything accurate about it, I drew it by hand
[laughter] to emphasize that. And I went to look at to axes, one is the runtime. So when
I want to make a prediction, how complicated is it to make this prediction. And the
second axis is accuracy but think about accuracy more as how well does it capture my
beliefs. So I had the posterior belief. Does the hypothesis, how well does it capture all of
the structure of my beliefs. So let's look at some examples. If we look at the Gibbs
method which is just I got my belief, I selected a random hypothesis and then I am just
going to use it to make predictions.
So the runtime is really easy. I just need to evaluate a single hypothesis and see how well
it works. But does it capture all of the complexity of my belief? To a very small extent I
would say. If we look at the maximal posterior, it seems to capture more, but still it just
looks at the peak. My belief can change dramatically and still the peak, as long as it
remains in the same place, I would pick the same one. So it doesn't seem to capture all
that I have learned in the learning process. On the other hand, if we look at the Bayes
classifier it uses a lot of information about my belief, but in terms of runtime, I really
need to hold a vote every time I want to make a prediction. Ensemble method provides
kind of a trade-off. If I use an ensemble of just one hypothesis, then this is just a Gibbs
sampling and so it is going to be fast but not very accurate. Whereas as I get my
ensemble to grow larger, then I capture more structure for my beliefs, but it becomes
lower.
>>: You are making some assumptions.
>> Ran Gilad-Bachrach: Of course.
>>: For example, for linear [inaudible] classes if you're just [inaudible] linear classifiers,
then you can show that the Bayes optimum is also an accepted linear classifier. So
somehow the fact that the Bayes…
>> Ran Gilad-Bachrach: No, the Bayes is not--if you are making a binary classification,
so it is true that if you--it is not true that the Bayes, if you take the Bayes classifier over
linear classifier is a linear classifier. It is simply not true.
>>: That depends on what your distributions are. When they do the general--when they
use the--when you have Bayesian generalization bounds for SVM, then in that state it
turns out that the Bayes classifier is in fact very fast.
>> Ran Gilad-Bachrach: Again, this is very--this is why I didn't, why I drew it by hand.
It is, think about the concept, not about the very specific cases in which things might be
different. Generally speaking, computing the Bayes classifier is hard. Although there are
examples in which it is easy, generally speaking for general class with general
distribution, this tends to be hard. Okay?
>>: So another question. If I think about this from a Bayesian respect here, there are two
types of uncertainty. There is model uncertainty and there is parameter uncertainty. So it
sounds like you are talking about just model uncertainty in this schematic or diagram, is
that right?
>> Ran Gilad-Bachrach: Again, I am isolating all the problem of estimating the
parameters and everything. I had my prior belief and I saw some evidence, and this is
currently my belief. This is my belief and this is it. But now I am asking myself okay
given that this is my belief, how should I make predictions?
>>: So you are thinking about distribution to the model classes and the parameters for
the [inaudible]. [inaudible] both the model and parameters.
>> Ran Gilad-Bachrach: Exactly.
>>: Okay thanks.
>> Ran Gilad-Bachrach: What I'm about to propose is a new way to select a hypothesis
from my belief which is to use the median. Now the median is going to be just a single
hypothesis so in terms of runtime it is going to be equivalent to the Gibbs and the
maximal posterior, but I will select this hypothesis that captures as much as possible
information about my belief. So this is where we are heading to. Okay, is it…?
So now I want to start defining what it is I mean by median, but it turns out in order to
define a median we will need to define something which we will call a depth, a depth
function. So let's start with just the case that we all know about the one-dimensional
univariate median. So assume while I have this sample here of points and I am going to
find a median of these points. One way to look at it is to say the following; every point
kind of splits the sample into two parts, all the points that are to the left and all of the
points that are to the right of this point. And we associate depth with each point being the
size of the smaller part. For example, for this point it had the depth three because here on
this side there are only three points, where this point has depth one because if I split the
space only this point is on this side. Now given this definition then this point is the
deepest point and it is the median. Once we define the median via a depth function, we
can now move to multivariate cases.
>>: [inaudible] so your points [inaudible].
>> Ran Gilad-Bachrach: Sorry?
>>: [inaudible] so your points [inaudible] distance [inaudible].
>> Ran Gilad-Bachrach: No. I do not consider distance; just count the number. Let's try
to see if we can extend this definition to the multivariate case. When we moved to the
multivariate case this is called the Tukey depth function and the Tukey median. So how
do we find it in the multivariate case? So I assume again that I have this sample but now
it is two-dimensional, and I want to measure the depth of every point. So what I do is I
look at hyper planes. So if I want to compute the depth of this point, I look at all the
hyper planes that go through this point. Every hyperplane split the sample into two, and I
take the smaller part of it. I take the minimum over all hyper planes, and this would be
the depth of this point. So for example, you can see that for this point, the depth would
be two because every hyperplane will have this point and at least one additional point
with it. Whereas, for this point the depth would be three because every hyperplane will
have at least this point and two other points with it. So this point is deeper than that
point. And again, once we have a depth function, automatically we get a median
associated with it which is just the deepest point.
A word of advice is you have to know that in the multivariate case actually there are
many definitions of multivariate medians. So we can define a different definition of a
depth function; each one of them would lead to different median.
>>: [inaudible] example, if your points are in a convex shape, then all of them are
medians.
>> Ran Gilad-Bachrach: No.
>>: So then you're saying--no. Because all of them would be [inaudible] the point one.
>> Ran Gilad-Bachrach: Remember that the median does not have to be a sample point.
So if I have points on a circle, so let me draw that. I have points on a circle, right? This
is my sample. So you are right to say, you are correct that every one of these points has a
depth of one. But actually, that point is much deeper.
>>: Oh, so it's any…
>> Ran Gilad-Bachrach: I can compute to any, for any point in space. And in this case
you can actually see that the median is actually the center. So it is what you expect it to
be.
>>: Well I think that you [inaudible] to compute [inaudible]. What if you can't compute
function in the [inaudible]?
>> Ran Gilad-Bachrach: There are depth functions that are defined only on your sample
and there are depth functions that are defined everywhere. There are different
definitions. Each one of them has its pros and cons. So I don't--it is an interesting topic.
It is a very interesting topic but I will have to skip on all of that. So I just wanted to point
out the fact that this is not the only way to define a multivariate median. But this is the
one that I will be inspired by, so this is why I brought it up and not other possible
definitions.
>>: [inaudible].
>> Ran Gilad-Bachrach: For example, like the univariate case, there could be more than
one median. You know it from the univariate case that the median is not a single point.
It could be--but here you could show for example that the level sets of the depth functions
are convex. Therefore if you have more than one median, it will always be a convex set
of points which are the median. So you can show all sorts of things. So there are a lot of
studies on multivariate median and just as an anecdote I will tell you that about 10 years
ago was the first time I encountered that, and I thought that I invented the multivariate
median, because I searched for multidimensional median and when you search for
multidimensional median I couldn't find anything. And I thought, you know, this is my
big thing, right? And then I don't know--I know why. A friend told me, you know that in
statistics they don't call it multidimensional; they call it multivariate. [laughter]. Yeah,
so this is why I brought this just to mention the fact that there is a lot of literature, a lot of
different definitions, and unlike the univariate case which when you say the median, we
all know what you're talking about, when you go to the multivariate case, this is no
longer true.
But now I am motivated by this very specific definition of the median that we just saw. I
am going to define a median for function classes. One thing that is important to note is
unlike what we discussed before in which we thought of the sample of points and we
wanted to find the median of these points, now we are talking about functions. We are
not going to find the median of our sample, but the median of our hypothesis. Just a little
bit of notation that we are going to use, our sample space is going to be, I am going to use
the letter X for that and the function class is F and the most important notation is Q which
is our belief, our posterior belief. This is the distribution over our sample class. And
here is how we define the depth in this case. First I have a function that I want to
measure its depth. So first of all I take an instance and define the depth of this function
with respect to this instance X to be the probability of a random function agreeing with a
label of F that F assigns to X.
Now I define the depths of F to be just the infimum over all instances. It is similar to
what we did in the Tukey depth thing in which we said every hyperplane defines some
sort of a depth, because it splits a space into two and I take the smaller part and then I
take the infimum over all hyper planes. This is what we do over here.
>>: In the function family here like any, does it have to be [inaudible] space?
>> Ran Gilad-Bachrach: I can compute the depth of the function which is not in my
function class, for example. It is defined for every function. Once we have a definition
of a depth function, we have a median, which is just the deepest point.
>>: So in the previous slide you said that the range of F is just minus one, plus one,
right? So there are just two depths.
>> Ran Gilad-Bachrach: No. For every, again for every X there could be…
>>: Oh, X is either one or minus one.
>> Ran Gilad-Bachrach: Yes. But then I take that infimum of all Xs.
>>: Wait a minute. For a given X, there are just two types of X, one that are deep and
one that are shallow.
>> Ran Gilad-Bachrach: Exactly.
>>: Okay.
>>: Wait, wait, there is a probability that when you draw F [inaudible] at random Q.
>> Ran Gilad-Bachrach: Yes, what [inaudible] says is that if I have for a certain point,
and we will see it in a second, for a second point, 80% of the functions say that the label
is +1, and 20% won't say that. So either you are in this class or in that class.
>>: There are two sums; there are two terms in this [inaudible].
>> Ran Gilad-Bachrach: Yes, but once you take the infimum, and then this function
becomes continuous. You are absolutely correct. Now…
>>: Wait [laughter] so the infimum is over all of X regardless of [inaudible] if there is an
X there is no probability of this ever occurring then you still…
>> Ran Gilad-Bachrach: With your question, wait five minutes.
>>: So the [inaudible] is the amount support that [inaudible] were the least popular
[inaudible].
>> Ran Gilad-Bachrach: Exactly. So for example, you can see that if I compute the
depth of the Bayes classifier, it will always be at least half. Do you see that? Because
the Bayes classifier always takes the majority vote. So for every X at least 50% of the
population will agree with that and therefore the infimum has to be greater than half,
right? This is an example to keep in mind. A depth of half is the best we can hope for in
any realistic setting. And the question would be how close to half can we get. I will not
go over all the details here, but we can actually see that the Tukey depth that we
discussed before is actually a special case of this definition, and so when the hypothesis
class which is really some sort of linear classifiers, then actually the Tukey depth actually
becomes you know a special case of this definition. But there are some interesting things
about the Tukey depth. For example, we know that the Tukey depth if we work on D
dimensional space, there is always a point with depth greater than 1 over D +1, regardless
of the distribution. There is always a point with depths 1 over D +1.
We know further, that if that distribution Q is log concave, then there is always a point
with depths 1 over E, regardless of the dimension. Now1over E is very good. Remember
that the optimum we hope for is one half, so 0 over E is very close to that.
>>: So the Bayes point, isn't always the Bayes function isn't always in this space F, is
that what you're saying?
>> Ran Gilad-Bachrach: No. It doesn't have to be in space F.
>>: You're saying [inaudible].
[multiple speakers]. [inaudible].
>>: [inaudible] realizable so you might not be able to realize in the space of the measure
Q that you can get a half [inaudible].
>> Ran Gilad-Bachrach: If we go back to the beginning of our discussion and we said
many learning algorithms look like that, and the energy function looks like that, and we
actually like the loss function to be like convex and regularization turn to be convex,
which means that the energy function is convex, which means that Q is log concave. So
actually in many of the cases which we really work with, we are exactly in the setting
where our posterior belief is log concave, and therefore we are guaranteed to have a deep
function, a function with a depth of at least 1 over E.
So I hope you get some intuition with respect to the median and the depth and what does
it mean, and now I would like to convince you that the median hypothesis is actually
good. But before I do that, a warning, don't always use the median. [laughter]. But in
our case, it is actually good to use the median and so here are the first results. So let's say
we have some target. So the word is distributed with some distribution U that I don't
know about it. And say we will just use our R nu of F is just a generalization error of F.
So we can prove that for every function F its generalization error is bounded by one over
its depth time the expected generalization error of if I would have just sampled
hypothesis from Q. And the proof is trivial. The proof, we have to consider just two
cases. Pick a point. There could be two things. If the majority vote is done with large
majority, then my function since it is deep must agree with the majority. And therefore
my risk, my loss on this would be very similar to the risk of just a random hypothesis.
The other case is in which Q is not decisive. In which case if Q is not decisive my
competition, which is just a random hypothesis, would err a lot regardless of what it
proved, and therefore, I can't do much worse than that. So again, this is kind of vaguely
speaking, but when you write it down it is just two lines and you get the proof.
The nice thing about it is that we can connect that with more traditional Pac Bayesian
analysis. So we had just about a month ago we had Mario Marchand over here and what
he showed is something of this type of a theorem and trying to clear things around.
Basically what it says is that with large probability, that generalization error of just
selecting a hypothesis at random, the expected error of a random hypothesis is bounded
by a term which is made of two things. One is it is training error and the other one is
kind of the KL divergence between the prior and the posterior. And now we can just take
this result, take the theorem that I just showed you and plug them together and we get a
generalization bound that says that for every hypothesis, its generalization bound is
bounded by 1 over its depth times the same term that we had before.
But now there is something really annoying about our definition and [inaudible] asked
about it. We looked at the infimum, when we defined the depth, we looked at the
infimum and this seems to be kind of very harsh. It could be that we have a function that
agreeing with the majority on almost all of the points, but say one point. So it seems not
the right way to look at it. And therefore we can define a relaxed version of the depth in
which we say instead of taking the infimum, we say okay, you are allowed to put aside a
proportion Delta of the points and compute the depths only on the rest of them. So when
we look at this relaxed depth over here, actually what it means is that for the majority of
the points your depth is greater than that, but there might be a proportion Delta in which
actually you do worse than that, but that is fine.
And again we can repeat the same kind of proof that we have seen before, the same kind
of theorems using this relaxed version of the depth function, so we can bound the
generalization error in terms of this relaxed depth function and have a Delta term over
here and again we can plug it into that PAC Bayesian bounds and get a generalization
bound in terms of this relaxed depth function.
>>: [inaudible]?
>> Ran Gilad-Bachrach: Uhh.
>>: Or we can choose [inaudible] whatever I want?
>> Ran Gilad-Bachrach: I think you can, but I need to verify that. I think it does. I hope
that by now I have motivated you that finding a deep hypothesis is something worthwhile
and now with looking for, to find a deep hypothesis and the first thing is actually to be
able to measure the depth of a hypothesis, so if someone comes with a hypothesis can I
know what its depth is. And we want more than just estimating its depth; we want a
uniform estimation, meaning that if I have a class of function, I want to be able to
estimate the depths of all of them simultaneously and make sure that my estimate holds
for all of them.
So here's what we're going to do. Assume I have a bunch of points. As we discussed
earlier, for each point I can look at what is the size of the population that labels it plus 1
and what is the size of the population that labels it minus 1? So say this is my, these are
my sample points. And now I want to evaluate a special function. So this function for
the first point it provides the label plus 1. I know that it depths on this point is 20%,
because only 20% of my functions I agree that the label here is plus 1. And then I can
take the label on the second point and again see what is the size of the population that
agrees with it on the second point. And so on, and so forth I can go over all of the points
in each one of them mark what is the depths with respect to this first specific point.
Eventually the depth is just a minimum over that. So in this case on this point, I had the
smallest agreement and therefore the depth of this function is .2. And this is what we are
actually going to do. What we are going to do is instead of computing the infimum over
all points, we are just going to take a sample and evaluate the depth only on this sample,
and when we want to evaluate what is the agreement here, again, we will take a sample
functions and use them to evaluate what is the proportion of the population that is labeled
+1 and what is the proportion of the population that is labeled -1.
And this actually is what the algorithm does, so you take two samples. One is unlabeled
samples of instances and one is samples of functions from your belief and you actually
compute for every point X you compute what is a proportion of the population that agree
with the function that you are trying to evaluate and you take the minimum and this is
your estimate. So this is what we call the empirical depth of the function F. We can
prove the following property, and again I will try to clear up all of the details here, but the
main thing is that with high probability we will have the following, the depth, the
empirical depth we compute is going to be greater than the true depth minus epsilon, and
it is going to be not larger than the relaxed depth plus epsilon. So it is going to be
somewhere in this range, but this holds with high probability holds simultaneously for all
functions, so it is uniform. This is a uniform bound that holds for all functions.
If we look at the probability in terms of what we mean by high probability, let's look first
over here. We have here the fife function, the gross function; we see the dimension that
we used to see, and we see that we have some polynomial term, here but exponentially
small term over here and the same thing over here. So we can choose the sizes of the
sample that we need such that the probability will be as high as we want, is close to 1 as
we want.
>>: So epsilon [inaudible] back to the fraction and you relaxed that and [inaudible]
parameter, right? In other words you go back to the definition of relaxed depth, there's a
[inaudible] you choose Delta?
>> Ran Gilad-Bachrach: Delta. Here it is.
>>: Okay, so that Delta is…
>> Ran Gilad-Bachrach: That Delta is here, and here and…
>>: Oh, I see it. Okay.
>> Ran Gilad-Bachrach: And again, I'm going to skip that. And the proof is very simple.
The main ingredient is to note the following, that if I have in my sample--so we have this
likeness between the true depth of the function F and it's kind of relaxed depth. We have
sort of a selectness here. If my sample of points consists of a point for which DF of given
X is somewhere in this range, then my estimate would be smaller than the relaxed depth.
And my estimate will always be greater than this depth because this is infinite. So all I
need to guarantee is that my sample is actually a hitting set, which means that it hits all of
these regions so I have instances in all of these regions for all of the functions. So this is
called a hitting set and in the machine learning literature this is called an epsilon net, so
we know that if this dimension is finite, a random sample will actually hit all of these
regions with high probability. So this is actually, you know, the main ingredient in this
proof.
So now we know how to measure depth and we know how to measure it uniformly for all
of the functions in the class, but now we would like to find the median, or approximate
the median. We want to find a deep point. And this is what we are going to do now.
Again, I want to start with motivation. I am going to show it by pictures and then go over
the algorithms. I have a sample of points and actually I will tell you in advance what we
are going to do. Instead of trying to find really the deepest point, the deepest function,
we are going to find the function that maximizes the empirical depths and not the true
depths. And it turns out that it is easy to find this function. Assume that I have this
sample of points and on each one of them I know what is the size of the population that
labels a +1 and what part of the population labels it -1.
The deepest possible points will actually agree with the majority vote on all of these
cases. So the Bayes classifier will give the label -1 here and +1 over here and +1 over
here because the majority says +1. So if I can find a function like that, this is the best that
I can hope for. But now assume that I cannot. So I can't find any function in my class
that actually gives all of the correct labels in all of these cases. So I am going to
eliminate one of the points from the sample, saying that I allow the function that I am
looking for to kind of miss-label this point. But which point do I want to eliminate? If
for example, I eliminate this point, I might get a classifier that labels here +1 and its
depth is going to be .2, so this is not very favorable.
>>: I have a question. In drawing these graphs, how [inaudible]?
>> Ran Gilad-Bachrach: Yes. I want to eliminate the point on which there is the smallest
margin in the voting, because I know that for this point, for example, for this point
although 55% of the classifiers label it -1, but still 45 say it’s +1. So if I got the label +1,
this is not too bad. So I can delete this point and I can say can I find a classifier which
agrees with all of the remaining labels? And I can keep going, and every time I will
delete the point on which there is the smallest margin. I will keep going until I find a
consistent classifier and not only that, once I have found one, I know that it is actually
maximizing the empirical depth, I can also know what is the empirical depth and the
empirical depth is actually the size of the minority set on the last point that I deleted.
Now this is basically it. So again the algorithm itself will just have, will receive two
samples, one is unlabeled sample of points. The other one is sample functions for my
belief distribution, and the output would be a function F which actually maximizes the
empirical depths and its empirical depth.
>>: [inaudible] sample point starting with the same thing [inaudible] the sample the
same sample?
>> Ran Gilad-Bachrach: Label sample is something that I, the only thing that I get is a
posterior. You might have used--I don't know what you use to reach with this posterior.
Maybe you didn't use label sample or maybe you have some Oracle that gives you some
hints. I don't know. But once you have this posterior, this is what I need in order to
select the median hypothesis. Does it make any sense?
>>: Yes. But I am saying what bounds can you use…?
>> Ran Gilad-Bachrach: I can use label sample, but it doesn't add anything. I don't use
the labels.
>>: What I am trying to say is having a labeled sample doesn't [inaudible]. I mean when
you are training an algorithm you have several different points, so using additional
unlabeled points is not adding to anything either.
>> Ran Gilad-Bachrach: You're asking whether--if I trained using a labeled example,
can I reuse the sample again for this part?
>>: Yes. And I think you should be.
>> Ran Gilad-Bachrach: You should be. You can kind of reuse the Delta term in the
confidence divided by two and you should be fine.
>>: You have T. You can build it on sample over T. Is it true that this ensemble will
always be better than your median?
>> Ran Gilad-Bachrach: I don't know. It needs to be validated. But again, this ensemble
in terms of what I want to really use it, it is much heavier than having just a single
hypothesis, right?
>>: Is N times…
>> Ran Gilad-Bachrach: N times, but on runtime. Here I use it only in training.
>>: I understand. But is it true that the best you can do, the best that you can hope to do
is as good as the ensemble on the [inaudible] because you are using, you are
hypothesizing…
>> Ran Gilad-Bachrach: It makes sense, but can I prove it? I don't know if I can…
>>: The ensemble you can just take it all over zero but the median.
>> Ran Gilad-Bachrach: Sir?
>>: You can have the ensemble all the way to zero but 1.
>>: But here is the proof. So the proof is very easy. You have Q primes which is not the
posterior but it is the empirical posterior, just as the posterior is defined by this. Then the
ensemble on T is the base is the base optimum on that comparable posterior.
>> Ran Gilad-Bachrach: That is true, but it is not my, but then you have to prove that
this empirical posterior is better than the…
>>: [inaudible] average, right? [inaudible].
>>: [inaudible].
>> Ran Gilad-Bachrach: On that Q prime, but we are asking on Q, right?
>>: Right. So I think the answer is that the S upper bound for this, for this a sample is
better than the best upper [inaudible] that you can get, but we don't know about lower
bounds.
>>: So I was just going to say that in the case of the Bayes point machine is the classic
one where, and you were referring to this earlier, where in fact, the actual [inaudible] is
an approximation because the actual classifier that you use may not live with in the
function [inaudible]. It could be at the best-known ensemble out of T does pretty well,
but if you happen to have a Q prime that you are using and [inaudible] against it that had
that [inaudible] function, you could do better than what was actually in the ensemble.
>> Ran Gilad-Bachrach: If you can use in ensemble, yep, and if you can use the Bayes
classifier, probably would be the best thing that you can use, but if you can't…
>>: You know what this reminds me of? Is Rich Coran and his whatchamacallit
[inaudible] the thing where he made this gigantic loaded mega-classifier, the mega
ensemble to triggered; is Rich here?
>>: There's Rich.
>>: He called it [inaudible]. The giant enormous classifier out of everything and the
kitchen sink.
>>: The ensemble collection.
>>: The ensemble collection, so that is sort of your F and if that is sort of going back to
Chris [inaudible]'s question, what is this? Is this like a parameter space? No. This really
very weird space over--you are saying that your belief is, I don't know. I am going to try
[inaudible] trees and all sorts of things. And that is sort of which are giant belief domain
is. And instead what you are doing is you are sort of, I predict you are sort of boosting
your averaging together. He is saying you could train the little classifier using his
method maybe, and that is sort of a guidance on how he would train a little classifier…
>>: The problem is you have to find one guy who agrees with one of the other guys.
>> Ran Gilad-Bachrach: But this is usually easier. So if you think of the task, usually
when we say we want to find a classifier that minimizes the number of errors and stuff
like that. This is hard, right? But finding, what I need is an algorithm. This is what I call
the consistent learning algorithm and it is an algorithm that says I found a hypothesis that
agrees with everything or I failed.
>>: [inaudible] see something because you have zero training…
>>: When you are building ensembles, you are usually building ensembles that are
[inaudible] simply Bayes hypothesis.
>>: Everything, every imaginable algorithm was in the ensemble.
>>: The ensemble is much more expressive than any one of the individual [inaudible].
>>: Right.
>>: And here Rani is saying I want to find one guy who tends to agree, let's say uses all
these things…
[multiple speakers]. [inaudible].
>> Ran Gilad-Bachrach: To simplify the notation, I assume that the functions here are
from the same class from where I am trying to select the final hypothesis, but actually
you don't have to make that assumption. It just, otherwise I have to hold two function
classes and all of the discussion becomes cumbersome. But actually it could be that the
ensemble is made of some stumps and then you try to build a tree.
>>: [inaudible] I think you are doing is picking and…
>>: So we all have to… What?
>>: You're missing, if you take 1 million things and you build an ensemble over them,
and you are trying to find one thing which tends to agree with that ensemble, I agree that
Rani is saying that it could be like a big huge tree, but it is going to be a big, huge tree.
So you are going to pay again with computational complexity because if you want a very
expressive function, you're going to have to make it complicated.
>>: That was the beautiful thing that Rich had done, because he [inaudible] trained all of
this insanely complicated ensemble. I mean they were just enormous, I don't know 100
times lower than you would ever want to use, but you would label in fact a very large
training set. That is sort of different from here where you would label a very large
unlabeled set and then you could get a much larger data set. Here he is saying well,
maybe what you do is you just define the largest training set that you can get zero
training error on that matches the labels of the giant mega ensemble, so just a different
way of creating Riche’s giant data ensemble.
>>: But you got to understand that I think you're missing a very good [inaudible] but let's
take it off-line. There is a problem with what you're saying. You can't eliminate the
Bayes variance [inaudible]. You can't get something for nothing.
>> Ran Gilad-Bachrach: Think I agree with you. If you don't have any limitations in
runtime, then yeah, use your more sophisticated ensemble. Use the Bayes optimal, right?
I agree with you, but if you have limitations in runtime, and you want to moderate the
size of your hypothesis because you have this limitation, so this is how I propose to do
that.
>>: That is exactly the motivation for ensemble selection, so that…
>> Ran Gilad-Bachrach: You can build during the training process; you can build this
huge ensemble. And this is actually here. This is the ensemble. But now still on the
training process you have to prune it. So you can look at it, I think this is what John is
suggesting, you can view all of this as a pruning process.
>>: I guess the motivation is little bit different, because I think… Like my perspective
on, well maybe it's just rattling back, Rich’s work was like, I agree with you
philosophically. It feels very similar. But he was kind of like, his small set and then this
small set I can't train a really simple thing to generalize well, but a really complex
algorithm that can do a good job, so I pull a ton of them and then I train, then I
[inaudible] deal with that to effectively expand the size of my [inaudible] data and then I
train the super classifier on it. But here it doesn't seem to be so much, that at least from
the software perspective; it doesn't seem to be so much that I have too little data to work
from. It is more, well, maybe not so different. But it feels like you are trying to do
something about generalization, well maybe it comes down to the same thing.
>> Ran Gilad-Bachrach: Basically I separated the task, the training data is just out of the
picture over here. It is just out of the picture. And the same thing--in what Rich did,
once you train these low models and now use them to label--you didn't add new
information. You didn't add new information to this project. This is just a method by
which you actually select your frontal hypothesis. So here I just instead of just putting,
gluing them together, I said okay, do whatever you want to do in the first part, here is
what to do on the second part. And I am just analyzing the second part. Do whatever
you want to do on the first part.
>>: So I am just trying to understand your algorithm a little better. How does it prevent
situations like that on the board that she just drew in that [inaudible] set up sample
functions none of which are particularly deep?
>> Ran Gilad-Bachrach: The function that I'm going to find eventually does not have to
be one with these end functions. It is not going to be a 1 of F1 to FN. It could be any
function in my class. So I have this algorithm MA for which I am going to feed it with
samples. And it just either tells me no I can’t; I am going to feed with label samples.
>>: Labels by the…
>> Ran Gilad-Bachrach: They are going to be labeled by this ensemble. And I am going
to feed it samples, labeled samples and this algorithm is either going to tell me I can't find
any hypothesis which is consistent with the sample or it will say here is a hypothesis
which is consistent with this sample. And this hypothesis doesn't have to be from this F1
to FN. It could be anywhere. And actually, and this is, I didn't want to talk about it, but
it doesn't have to be in the same class.
The algorithm, what I do in, first I compute for every point, what is the proportion of
classifier in my ensemble that actually that label it +1 and this is a Pi plus. And then I
compute the size of the minority group on each point, and sort the points, such that the
first point will have the largest minority group, and the sizes of the minority groups are
actually decreasing. I compute the label off the majority vote, what is the label that I
really want to have and start iterating. So the first thing is I will just put all of the points
in with the labels of the majority vote and see if I can find a classifier that just classifies
correctly all of them. So I will close the algorithm and see if it can find a consistent
hypothesis. If it does not, I will remove the first point. The first point is the one with the
largest minority, so I will remove it. And again, send it to the algorithm and I will keep
going, keep going until eventually the algorithm will return a function.
>>: Why not do a binary search? Why do a linear scan? It seems like you're asking for a
lot.
>> Ran Gilad-Bachrach: You can do that.
>>: Okay, it just seems like that is log n versus n number of trainers that you have to…
>> Ran Gilad-Bachrach: Actually, yeah. But it doesn't matter.
>>: But you are saying knockoff half, knock off half the [inaudible].
>>: Yeah, because otherwise it is [inaudible] neural net work or whatever [inaudible].
>> Ran Gilad-Bachrach: You're right.
>>: Does it guarantee that it's true?
>>: Is it…
>>: I mean couldn't it be the case that it's not [inaudible]?
>> Ran Gilad-Bachrach: No. And eventually once the algorithm returns a function
which is consistent, then I can just compute its depth using the formula here and we can
prove the properties of this algorithm so first of all it will always terminate. So it could
be that you will keep the leading points and end up with an empty sample. And the
reason is very trivial and the last point you give it a single point with the majority that,
the label that the majority gave this point. So there must be [inaudible] hypothesis that
agrees with that. And we can show that the function that this algorithm returns as
actually the maximizer of this empirical depth, and the depth that it computes is actually
the correct depth for this function. And again, trying to clear the clouds around this
formula, the important thing is the following, that the relaxed depth of the return function
is at least as big as the depth, actually the depths of the median. So this is a super
[inaudible] of all of the depths, so this is actually the depths of the median point -2
epsilon. So actually it does approximate the median.
So before I conclude, just a few small notes. So one is we discussed in the beginning
actually the motivation for the definition of the depth were the Tukey depth in the Tukey
median. So now we can go back to that and say that actually the algorithm that we
described is actually, if we use them on this very specific case, they approximate the
Tukey depth and the Tukey median, and they are polynomial algorithm and actually this
is the new result. So all of the previous results in approximating the Tukey depth and
Tukey median are exponential in terms of the dimension and this is a polynomial in the
dimension, but I must say that usually the approximation that they consider is different
than the type of approximation that I am considering here. But nevertheless, this is
something that is a new result.
I am not going to go over the details here, but we can also discuss the geometrical
properties of these depth functions and if it is convex in some sense, but I'm going to skip
that. But this is the big but. As you know I didn't have any empirical study over here
which is kind of why don't you have any, if it's good, show us that it's good. And the
problem is that I made the assumption that I can sample functions from my belief. So all
algorithm used, actually if we look at the algorithm that approximates the median, we
needed three things as inputs. We needed a sample of unlabeled instances, which is
usually easy to get. We needed a consistent algorithm, which in many cases is easy to
come up with. But what is hard to come up with is just a sample of functions from my
belief. This is something that for interesting belief distributions is not so easy to achieve.
So we can do heuristics, you can think about begging and random force and everything
like that, this is actually what they do. So we can do heuristics. We can also show that if
I can sample from the true distribution, but I can sample with something that is close to
the distribution, it is fine. We can correct for those mistakes. But the reason why I didn't
yet do the empirical studies is once you start using the heuristics and if it works, great. If
it doesn't work, then is it because all of this method is broken or is it because the
heuristics did not deliver what you expected it to deliver. You won't be able to say. So
this is why I preferred first to, you know, complete the theory and know exactly what I
wanted to achieve, and then separately try different heuristics and see how they perform
with respect to this method.
>>: Is there any way to measure how [inaudible] any way to measure how good heuristic
is with respect to a fair sample [inaudible]?
>>: Sure its Bayes error is low. I mean not the lowest one, but you can compare
heuristics, but compare…
>>: Compare heuristics but the can you tell if your…
>> Ran Gilad-Bachrach: But you can't compare it. You want to compare it, again, you
want to compare this Bayes error with--if its Bayes error is low, then you are good. But
if it is not low, is it because your heuristic is broken or because your posterior is not very
decisive or not very good? It is hard to isolate these two things in this is why I want to
isolate the theory from the empirical because there is additional unknown over here.
>>: I go by the hard Q comment. So it seems like in a lot of cases in [inaudible] Monte
Carlo is a sample from the Q?
>> Ran Gilad-Bachrach: Approximate. So even if…
>>: So the question is whether or not it converged or not, but I think…
>> Ran Gilad-Bachrach: For example, I think that even if you have in the simple case
where you have uniform distribution over convex [inaudible]. So yes, there are
polynomial algorithm for something from that and their polynomial, they have improved
so that the complexity is something, so the first algorithm is something like, with the
complexity like D to the power of 23, and they manage to reduce it all the way now to
three or four, but the constant here is something like two to the power of 100.
>>: But even [inaudible] finding mean instead of the posterior and a lot of approximation
algorithms approximations like actually give you a very good assumption of Q.
[inaudible] find the family…
>> Ran Gilad-Bachrach: Again, there are a lot of heuristics on how to do that and there
is evidence that in many cases it worked fine, but the only result that I know that is
actually samples from the distribution at least in some cases, they have this complexity.
So again once you start…
>>: There are certain other cases where you know that the Markov chain is missing and
you're going to guess [inaudible].
>>: You are making a very strong assumption that you have a sample from [inaudible]
only sampling continuous values or [inaudible] but that is [inaudible] specific…
>> Ran Gilad-Bachrach: That's true, but then if you say okay, let's sample only from Qs
from which I want to sample, so maybe these Qs, so imagine now that I am going to
apply to a certain problem. First of all let me just before we dive into this discussion, the
first thing that I want to do is really go ahead and evaluate this. So this is obvious. But
there is a reason why I wanted to isolate it from this theoretical work. Because once I
want to do that, I have to either restrict the type of Qs, you know the posterior that I am
allowed to use, to only Qs I know how to sample from, or say I am not going to restrict
Qs, but I am going to use some heuristics to sample from this distribution. Anyway, if it
works fine, that is great. But if it doesn't work, what does it mean? Does it mean that
this full method is broken, or does it mean that if I restricted my family of distribution,
restricted it in a way that this distribution do not work well for this problem.
>>: By definition, since your method has to sample from Q, it is broken if it doesn't
work.
>> Ran Gilad-Bachrach: Yes, but then you can say okay, I can say there is another item
now is how to sample from Q. So I went that far and obviously, you know, there is still a
way to go. And I am very clear about it. But if I just evaluate now and see if it is
working, we are happy. But if it doesn't work, what does it mean? That this whole line is
broken, or that we need to improve our methods from sampling from Q?
>>: Probably the latter.
>>: Do you have any ideas on what types of Qs you are interested in sampling from?
>> Ran Gilad-Bachrach: So for example, [inaudible].
>>: Exactly, [inaudible]. The kind of form, the function that you have. They haven't
done work in trying to approximate the posterior [inaudible] like in the Bayesian
[inaudible] community, but even SVM lost function [inaudible] laws and…
>> Ran Gilad-Bachrach: As far as I know and I will be happy to know that I am wrong,
is that none of them has provable guarantees on the difference between the true
distribution from which you sample and the sample, the distribution that you are trying to
sample from. All of them are well justified, seems to be working, but for none of them
apart from this result from [inaudible] and one of those and I don't remember who else,
none of them is really provable.
>>: No, no. Here is the problem. So you talk about sample selection that comes closer
to solution. And you were saying that your method [inaudible] immediate is a good one.
And then I have [inaudible] evaluate in the mod, I have the method to evaluate the mean
and in these cases people have shown that the mean might be better than mod, right?
And I cannot pinpoint where does this [inaudible] lie in that space, unless there is some in
particular evaluation. And hence my questions, what would it require for you to show as
comparative as those [inaudible]?
>> Ran Gilad-Bachrach: So definitely I am going to evaluate that and definitely I am
going to try these heuristic methods either of sampling with the inaccuracy of the
sampling process and see if it works, or otherwise restrict the family of Qs such that the
families from which I know how to sample. But I wanted to separate that from this work
because once I do that I add another unknown. The evaluation would be problematic.
>>: Here's another way of looking at it. If maybe the sampling or the sample of posterior
I said is really my posterior because I can adjust my prior. It is whatever I say. So I can
adjust my prior maybe if there are certain [inaudible]'s that I don't know, because that is
my prior. It's a crazy prior, but how can you argue with me? It's my prior and the prior
happens to match where the posterior has to sample.
>> Ran Gilad-Bachrach: But again, you run out to design an experiment, right? And you
think okay, I am going to design an experiment. Let's see what are the possible
outcomes? If the outcome is that it doesn't generalize well, is it because of your sampling
technique, or is it because of the algorithm that I described here?
>>: But [inaudible] I don't think this will [inaudible] maybe they are just inseparable. I
think it doesn't make any sense [inaudible].
>> Ran Gilad-Bachrach: If it would have been inseparable, it would have been
inseparable if I simply knew how to sample Q from a general Q, so for example energy
function of SVM, if I could sample from this energy function. And then I could compare
the result of the result of SVM, then I know exactly that this is…
>>: Yeah, that is one choice of prior.
>> Ran Gilad-Bachrach: Yes. This is one choice of prior, but an important one.
>>: Yeah. [inaudible].
>> Ran Gilad-Bachrach: That's true. But then again, what is the broken component
here? Is it that the prior is broken or is it that the algorithm is broken? So there is a
problem in running this type of experiment. It is an inherent problem in this kind of an
experiment; that you cannot control for all of the free variables.
>>: That is true.
>> Ran Gilad-Bachrach: So this is why I prefer to go all the way to establish a theory
and know exactly what I want to achieve, analyze it and then once I do that, yeah, I am
going to run the experiment, but I know that there are, that my ability to evaluate the
results of these experiments is going to be limited.
>>: [inaudible] you are saying that the death of the Bayes classifier is half and then your
bounds have one over the depth, so you're going to pay factor of two in your error
compared to the Gibbs sampling no matter what. So…
>> Ran Gilad-Bachrach: Remember when [inaudible] over here said we get this
annoying two factor for analyzing the Bayes classifier. We tried to get rid of it and didn't
manage to do that.
>>: No [inaudible] there's a paper where the two goes to almost epsilon in some settings
but…
>> Ran Gilad-Bachrach: I don't know. At least here when the studies were going on
they said that we have this annoying two factor and we tried to get rid of it and we
couldn't and…
>>: [inaudible] years ago [inaudible].
>> Ran Gilad-Bachrach: You can see cases in which you can get better then two, for
example, the relaxed depth, so we can say why is it two? It's two if you have many
points on which the majority vote is very marginal. But if you actually, for most of the
points the majority is very decisive. Actually this bound will show you that you can get
better than factor two, kind of two is for the worst case in the classifier. It could be
better. But even in their analysis actually what they show, the best bound that they get is
for the expected Gibbs classifier, and then they say okay, the Bayes classifier cannot be
much worse than that. And this is the same thing that we showed here.
>>: I will show you the result where this two to [inaudible]. The other thing I wanted to
say is that the trendy thing in learning theory now is to talk about the importance of
strong convex [inaudible] and you didn't mention this at all. I am really [inaudible]
required was the [inaudible] property of the posterior. Now what happens, I don't know
what it would be called, when you have a posterior which is defined as normalized X of a
strong convex function rather than just a convex function, which would be achieved by
our strong convex regular [inaudible]. And I think what you get is that you have this
guarantee that says that there exists a function which is at least at depth [inaudible] for E.
>> Ran Gilad-Bachrach: [inaudible] I don't need the strong convex…
>>: Right, but I'm saying I think that if you look that 1 over E in the general case. If you
have a ball, you actually have half. And I think strong convexity [inaudible] interpolated
between the two, right? This would be now just so you talking this result just for general
convex body the center of mass you can cut it such that 1 over E is on one side. So I am
assuming that that is like this 1 over E. if it was a ball, you could actually cut it in half
and then if you had the strong convex [inaudible] like L2 regularizer think about, think of
the last function is just an L2 regularizer. No empirical loss. So the weight of the
empirical loss is zero. You are just quadratic. Then you have the ball. Then you would
have half.
>> Ran Gilad-Bachrach: Interesting direction. It is not something that I can come up
with from the top of my head. Actually you can also show that if a function is not log
concave actually there is a larger family of these distributions which is called ra concave
function and for each one of them you can get a bound which is a function of ra. You can
refine the results I'm sure.
>>: Question about the, so the energy function that you were showing, regularizing the
function, it doesn't always lead to [inaudible] solutions [inaudible] so if you just
exponentiate the energy [inaudible]?
>> Ran Gilad-Bachrach: Oh, no you always have…
>>: [inaudible] for SVM, for [inaudible], it does not.
>> Ran Gilad-Bachrach: So you always have the partition function.
>>: Yeah, but I think that some people [inaudible]. So consider the case where you
have, the SVM case, then you regularize it is a [inaudible] relative loss function.
>> Ran Gilad-Bachrach: So you say that the integral might go to infinity and you
cannot--could be, but then if you bound it, if you bound everything…
>>: So if you can approximate it somehow?
>> Ran Gilad-Bachrach: Thank you very much.
[applause].
Download