>> Ofer Dekel: Hi, it is our pleasure today... Laval in Québec City and he has come here to...

advertisement
>> Ofer Dekel: Hi, it is our pleasure today to have Mario Marchand from University of
Laval in Québec City and he has come here to talk to us about PAC-Bayesian machine
learning, which is learning by optimizing a performance guarantee. So, Mario, thank
you.
>> Mario Marchand: Thank you very much. I would like to thank Ofer Dekel and
Microsoft Research for this kind invitation. I am having a terrific time. I think that the
people here are very interesting and very kind, so it is a terrific place to be and I am just
very happy and very honored to be here.
So this is the title and most of our research that we are currently doing at Laval
University in our group machine learning group which in French is the group of
[inaudible] de Laval which is happens to be the same name of the Holy Grail which is a
nice coincidence. [laughter] So this is our group and basically we are two faculty
members, me and François Laviolette who is also another professor and these are the
grad students, some of the grad students that actually participated in the result that I will
show you today, but we also have other graduate students, and basically the title resumes
a bit of our main line of research, that is whenever we try to assign a learning algorithm
we first try to find a good performance guarantee and then propose the learning algorithm
that will optimize the performance guarantee.
Before outlining the talk, basically I will just present a few slides to introduce annotation
in the basic supervised learning session. So in the supervised learning setting we have a
training set of M examples. So each example is just an input output pair, so this is the
input and the output associated to input X. So the input space is usually arbitrary and the
output space will basically define the kind of supervised learning problem that we have
so if the output space is -1 or +1 this is just binary classification, the real set for
regression or a real interval for regression, and an arbitrary structured set for the
structured output prediction.
So all of the theoretical, well basically the only assumption behind the theoretical results
is this assumption here that each example is drawn independently according to the same
distribution on the input and output space. That is what we assume, that the training
sample, each element of the training sample was drawn according to this distribution. So
the goal of the learner or the learning algorithm is to find a predictor I think having
minimal risk, so this is the goal of learning. It is always that goal. And the risk of the
predictor is basically defined as the expected loss, the expectation performed according to
the real [inaudible] generating distribution and the loss error is expressed in terms of a
loss function that measures the loss incurred on predicting h of x when the actual output
is y.
Of course the learner, so the problem is that the learner does not know the distribution D
because this is the objective of the learner and it is defined with respect to the D, not with
respect to the training sample. So we don't know D. we have only access to a training
sample drawn according to D, and so the central question in supervised learning is
looking at what should be optimized on the training sample to obtain the predictor with
the possible minimal risk. This is the main question in supervised learning, and the usual
way to--we basically avoid the situation most of the time basically. We will say, well, we
don't exactly know what should be optimized on the training data, so let us just try to
guess what would be, what we should optimize and we would basically just propose a
regular rise risk functional so that would typically be like that. It is not always like that,
but you always have an empirical risk of the predictor which is just the empirical loss
measured on the training set of example. And you have some complexity term, penalty
term which depends on some complexity measure of your predictor. And lambda is some
positive real number which we don't know its value, so we normally choose it by cross
validation, so basically on the testing, cross validation on our data.
Perhaps we can do better, because this is the normal way of doing things. We propose
that we find that it works empirically, then there is theoretical person that is proposing
some bounds and some justification why it is not a bad idea to do that. But can we do
better than this? So if we cannot find the predictor with the best possible risk? Let's just
try to find a predictor with the best risk guarantee and a risk guarantee which can be
computed on the training data. Let's try to do that instead. The guarantees that we have
are called risk abounds so these are the theoretically accepted guarantees that we have.
And it is called a risk bound. A risk bound is just basically a function of the predictor, a
training set and some confidence parameter, normally it just weakly depends on the
confidence parameter. And the property of the risk bound is that with high probability
with respect to random draws of the training set, it is simultaneously for all predictors the
risk is bounded by its, is upper bounded by the bound and the bound is really here, the
random variable which depends on predictors and the training sets.
The optimization problem for supervised learning is then to find the predictor minimizing
the bound. This will be the predictor having the best guarantee which can be computed
on the training set, so this is basically our direction of research. So this makes sense of
course if you have a good guarantee, a tight bound. If the bound is not tight, well perhaps
it won't give good results empirically.
So we approach supervised learning problems basically by first trying to derive the risk
bound for a particular learning setting, which can be computed for any pairs of predictor
and training sample. And then once we have the bound we design a learning algorithm
for minimizing it, given the training set s. So basically this would be, it will return the
predictor with the best possible guarantee. So for step one we use the PAC-Bayes theory.
Why do we use the PAC-Bayes theory which was initiated quite recently by McAllester
in one of the Cole conferences in 1999. First it can give very tight risk balance. It is not
because you are doing PAC-Bayes theory that you are going to find tight risk bound, but
you can really obtain tight risk bound if you are putting in the effort.
Also it is about distribution so basically it is about distribution over class of predictors, so
it gives a guarantee to ensemble methods such as boosting, SVM, random forests and
kernel methods, many state-of-the-art, learning algorithms basically. It gives guarantees
to methods like that, and perhaps the most important point I have not written in here is it
is quite easy to master. PAC-Bayes theory is very simple; it is much simpler than a VC
dimension and even [inaudible] complexity is really something that a graduate student
can understand fully, so this is certainly one important point.
So this is the outline of my talks, so since it is easy to master I think it is a good occasion
to explain to you the elements of PAC-Bayes theory so you will leave the talk I think
with the main ideas behind it and if you work a bit more, you can just master the
technique and then we are going to apply that to some particular cases.
The first case is going to be on Gaussian distribution over a linear classifier, so this is
kind of a posterior distribution over a very important class of predictors, linear classifiers.
We're going also to talk about PAC-Bayes sample compression. This is recent work; we
happened to present that at the ICML in 2011, so very recently. The idea here is
basically we are going to use the PAC-Bayes approach on a set of classifiers which are
described by the training data, by a subset of the training data. And finally, perhaps I will
just outline some work that we are currently doing right now.
Elements of PAC-Bayes theory, here we restrict ourselves to the simple classification
case, so we consider the 01 loss, so the true risk of the predictor in this empirical
estimator on the training set s are just defined respectively as, this is the indicator
function, so it is one if the predicate is true and zero otherwise. So it is one when you are
making a mistake and so this is just the probability of making a mistake when you draw
an example from the data generating distribution and this is just the frequency of our 800
training set. The learner’s goal then will be--the output of the learning algorithm will be
to produce a posterior distribution Q that depends on the performance of individual or
collective classifiers. We are going to pursue the posterior distribution on the space of
classifiers such that the risk of the Q weighted majority vote that I call BQ, because
sometimes it is also called the Bayesian classifier, but here let's call it the majority vote or
the Q weighted majority vote and we want to obtain a Q weighted majority vote that has
the smallest possible risk. That is our aim.
PAC-Bayes bounds in fact produce bounds on the risk of majority vote indirectly because
it bounds the risk of another classifier which is a stochastic classifier. The majority vote
is deterministic. On each example it always output, if you present same example twice it
will output the same prediction. But the Gibbs classifier is something different. It is
stochastic, so to predict the label of an input example X, the Gibbs draws a classifier from
H according to distribution Q and then predicts the label of X according to H of X. And
so if you present X twice it might predict different labels because it is stochastic; the
draws might be different.
The risk and the training error of the Gibbs classifier are just given as follows so it is just
the expected risk where the expectation is done with respect to the draws according to Q,
so this is Gibbs risk and this is Gibbs empirical risk which is just the expectation of the
empirical risk of the individual predictor.
What is the relation between Gibbs risk and the risk of the majority vote? Well you can
bound the risk of the majority vote by this equation. You can easily prove that the risk of
the majority vote is at most twice the risk of the Gibbs classifier. So if we provide a risk
bound for the risk of the Gibbs classifier, we will automatically have a risk bound for the
base classifier, by the factor of two rule. So we might lose a bit, here because of this
factor of two rule.
So the way to understand how come we have this relation is basically you consider an
example, consider a fixed example XY, and consider the two cases when the majority
vote makes an error on this example and the case where it does not make an error.
Suppose first that it makes an error on this example. So if it makes an error on this
example, it means that half of the classifiers under measure Q will make an error on it, so
half of the classifier will make an error on it so this means that, so it could be 1/2+
epsilon of the classifier that will make an error on it. So this means that twice the error
on that example of the Gibbs will be greater or equal than, because now the risk on that
example is one and twice the error of the Gibbs is a bit larger, okay it is larger than one
so larger or equal than you have this inequality, and this inequality obviously [inaudible]
in the case where the majority vote makes no error because the Gibbs will in general
make some error on it. So in general you have this--so if you perform, I have
demonstrated this equation for a particular example; just take the expectation on both
sides and you have this inequality, because it is with respect to the same distributions.
So basically we have this factor of two rule and this is how PAC-Bayes works most of
the time. We will provide the risk bounds for the Gibbs classifier and the [inaudible]
factor of two rule, this will give a bound on the majority vote classifier. In PAC-Bayes
theory or in PAC-Bayes setting I would say well the task of the learner is to produce a
posterior distribution. So we call it a posterior distribution because the bound will also
depend on another distribution which must be defined before observing the data, so it is a
prior distribution on your set of classifier P and basically the PAC-Bayes risk bound will
depend on the Kullback-Leibler divergence between the posterior and the prior. So this
is just the standard definition of the Kullback-Leibler divergence. Notice that it in order
to be finite, the support of the prior must be, must include the support of the posterior.
Otherwise it would divergence if P was equal to zero, so this means that a support of Q
must be included, must be contained within the support of P, of the prior.
We will assume PAC-Bayes bounds also and Q's are expressed in terms of the KullbackLeibler divergence, but between two Bernoulli distributions; one with a probability of
success q and one with probability of success p. So we will also make use of the quantity
of this function. So this is our main theorem that we have obtained a couple of years ago.
We presented data I think at ICML 2009. I think it is quite nice because it is quite
general. You can obtain all known PAC-based bounds for classification out of this one,
and it is very, very simple to prove. It is stated as follows, for any distribution, for any
set of classifiers, for any prior P of support H and for any, and so you see here you have
this distance between two quantities, two real numbers which is Gibbs risk and Gibbs
empirical risk and boundless value for any distance measure or any function, any real
value function, can even be negative; that is not a problem. So for any D basically this
bounds O with probability at least one minus Delta simultaneously for all posteriors.
This distance between the true risk and the empirical risk is bounded by that.
I have already mentioned what is the Kullback-Leibler divergence, now you see the
bound depends on some double Laplus transform, so double expectation with respect to
the prior and with respect to the training set.
>>: [inaudible] less than or equal to [inaudible].
>> Mario Marchand: Pardon me?
>>: Greater than or equal to [inaudible]…
>> Mario Marchand: It is less or equal to that.
>>: I see. With probability more than one.
>> Mario Marchand: With probability one, not more than one but one minus [inaudible].
>>: [inaudible].
>> Mario Marchand: So of course to obtain a boundless means that you need to upper
bound this double Laplus transform. This is some sort of thing that we can do easily for
many cases once you are given a D you can find an upper bound for that and then you get
automatically this theorem. So the way to prove this theorem, I am going to go through
the steps so that at least you can, when I am saying that it is easy, you can say yes that it
is really easy and this is the art of PAC-Bayes theory. Basically the idea is to consider
this random variable. This is a random variable of S and it is a nonnegative random
variable because it is exponential of something, so it is a nonnegative random variable
and you can use Marcovian equality for that. So it is a random variable that is a function
of S. So the probability that this random variable exceeds T times its expectation is at
most one over T. So I just replaced T with one over Delta and there you have it.
This is the basic let's say bound that is used by PAC-Bayes. It is not [inaudible] equality
so you would say this is very bad. In fact nobody has come out with a better bound with
this random variable. I have tried to go to [inaudible] it does not bide me anything, so I
get the same, basically the same bound. We are using Marcov, so this means that with
probability one minus Delta, I get the negation of that. One minus Delta I have the
negation of that which is the great oracle to this. And I take the log on both sides. I took
the log on both sides so it holds with probability of one minus Delta, and so then the next
step remember the bound owes uniformly for all posteriors so it will be to convert this
expectation over the prior into the expectation over the posterior. The standard trick
would be so F of H is just this function. It is just this function here. So the first thing is
let's convert this expectation over P into an expectation over Q. And you would say well
this is easy. The expectation over P is equal to the expectation over Q but in fact it is
greater or equal so basically you see it is Q. So in order to get P this is the expectation
and over P, I just introduce Q and I divide; they cancel, but we have to be careful when it
becomes zero, so basically this holds for the support of Q plus the rest of the support of Q
which is MP basically. This term is greater or equal than this term, so basically the
expectation of P can be changed into an expectation over Q with this, and this is exactly
what we need here. We just basically, so you will have greater or equal than the
expectation over Q so it is on the right side.
The next thing is to, so we have a log of expectation that we are going to convert into the
expectation of a log. The Jensen's inequality, so this is the first step expectation over P is
greater or equal to the expectation over Q of this. And then the log of the expectation is
greater than the expectation of the logs, because of the concavity of the logs. So we have
one Jensen inequality here. So the log of a product is the sum of the log and basically
here you have minus the KL divergence here, because it is P over Q and the KL is Q over
P minus the KL plus the log off this. I remind you that F of H is this complicated
expression here. This is the exponential, so we applied this basically, this result to the
exponential of MD. And we just, this is just the application of the formula, and so we are
almost done because the log of the exponential is just the argument, so you see that
basically we find another bound; we find that the expectation of MD is more equal than
KL plus this, which I identified in the theorem. But we are stuck with the expectation of
MD but the expectation of MD basically by using Jensen inequality; the expectation of
basically of D is greater. We assume that D is convex, so the expectation of D is greater
than the expectations.
This is basically, you basically have the theorem that says, so it is not a long proof. You
see every step is there and you have this bound, but now you need to bound this Laplus
transform in order to get a bound. But that is PAC-Bayes theory, Marcovian equality and
to Jensen, that's it. And the amazing thing is it is simple but you can have, it can be very,
very tight.
So if for D you are using the KL divergence between the two Bernoulli distribution, if
you are using that for this distance, you can perform this expectation here so you need to
swap the expectation, so this is why P must be independent of S so you swap the
expectation. Otherwise it is difficult to perform the expectation over P before expectation
over S, so we are going to swap the expectation and then condition this expectation in
terms of the number of errors in terms we know that the distribution of error is a
Bernoulli distribution. I am skipping the detail. We find that and this is in theta a big
theta, the square root of M. So basically you have a bound. You obtain the same bound
that has been found by Langford and Seeger but a bit tighter because of this square root
of M dependence here.
So this is one of the famous PAC-Bayes bounds that has been found by Langford and
Seeger and this is a graphic illustration of it. So the first time you know it is a bit
confusing the way the bound is stated but let's say to see that, to see how it works
basically, assume that this fixed the posterior. If you are interested in some posterior and
you want to find a guarantee for this upper bound for this posterior, so you fix the
posterior. This will be a number because the prior is fixed, so this will be a number let's
say 10% or 5% and basically let's plug this function in terms of the real risk when you
have computed some empirical risk value. This is a plot of the function in terms of the
real risk when you assume that the empirical risk that you have measured is 10%. And so
basically the KL, this function must be in terms of RQ must be lower or equal than this
value, so this you see that this just gives you an upper bound and then a lower bound.
And you know that the real risk is with [inaudible] one minus Delta is between these
intervals, inside this interval. So if you change the function and you say well remember
the first argument is the empirical risk, so if you are looking for a function which is linear
in the empirical risk, which is a natural thing to do because you have the expectation of
something that is sum of linear so it is going to be expressible in terms of a product, you
find this result basically for the Laplus transform, and so you can choose F to--you see
this is E to the F power to the M and this is also power to the M so you say well, I have a
number power to the M so let's choose this number to be one so I will have one power to
the M which is one. So if you choose M to be one over this guy, so we can cancel out,
you will basically have for the Laplus transform log of one over Delta which gives you a
small bound. And if you are basically doing a bit of algebra afterwards to express R and
F minus one of the result, you get this bound.
So one minus exponential, it's a function like that so this is X; this is one minus
exponential of - X so this is this function here.
So one minus exponential - X is smaller than the argument of the exponential, so for
simplicity suppose this term is just C times RS +1 over M this. So this is another bound
that you can obtain immediately from the general theorem and in fact it has been found
by another method by Catoni in 2007. So it is kind of a simple, it says basically it is valid
for any numbers, any constant C, any positive real number C, but it is not valued
uniformly for all C, but you can buy a union bound argument, make it valid for K values;
you will just need to introduce an LNFK here.
So this bound says something simple. If you want to find a predictor, well the Gibbs
predictor with the smallest possible bound, it will be the one which minimizes this plus
this. So you will end up by saying when you need to minimize is this quantity. So then it
says well the posterior Q that you are looking should minimize that, which it's a bit like
what we have postulated in the beginning. We have some measure, but it is not the risk
of individual classifier. You have to measure the risk, the average risk or the risk of the
Gibbs classifier and you have this KL which acts as a regular riser, which grows
depending on how far you are from the prior. So you have a NYPIRG parameter that you
don't exactly know its value. You could try to, so you have one X try per parameter to
tune which is not the case from the previous bound, from the Langford Seeger case. You
have no NYPIRG parameter that's used.
So the corollary to 2.1 which is a Langford Seeger bound gives a bound which is in fact,
so we have two bounds. How do they compare? The one with the small KL, the one
with the KL between the two Bernoulli distribution, gives you a bound which is always
higher except for a narrow range of values of C which is normally around one. So
normally the Langford Seeger bound is tighter than the Catoni bound, but the fact that the
Catoni bound can be tighter is just because instead of having LN of square root of M
divided by Delta, it has LN of one over Delta. It just because of that, it can be tighter. If
you would have LN the square root of M over Delta, than the Langford Seeger bound
would always be tighter than the Catoni bound.
Okay so, let's apply these bounds and let us try to minimize these bounds for a certain
class of posterior distribution. Searching in the space of all posterior distribution is
something very hard; there are too many of them, so let us restrict ourselves to a class of
posterior distribution, which will be Gaussian distribution over linear classifier. Each
example is mapped so it will be a learning algorithm which can use a kernel. So basically
each X will be mapped either explicitly or implicitly in a feature space phi. And the
feature space can often be given in the terms of a Mercer kernel and so the output of a, so
the output of the linear classifier is going to be described by this. This is the usual
formula when you have a feature space phi of X, so it is the sign of the scalar product
with phi of X.
We are going to be looking for Gibbs distribution which are isotropic Gaussian centered
on W, so you have a Gaussian which is isotropic, so you have the covariance matrix is
unity, basically and it has unit variance so it is a very simple thing, so it is just an
isotropic Gaussian centered on some W. So W is the parameters of my distribution so I
will write Q sub W of the, so this is the density on V parameter rise by some W. So one
nice thing about this Gaussian distribution, so basically you have this W, this vector W,
and you have a Gaussian distribution and this board here is the space of all the linear
separator. So whenever you have an example to classify, so XY here, this is an example,
so in the space of separator is just a hyperplane, so some of the predictor here or let's say
predict +1 some of them predict -1. So basically you see that if you are, and consider the
majority vote. If you are taking the majority vote with respect to this Gaussian, well the
majority vote would be, the weight would be larger on the predictors predicting -1 and
this is the output of the majority vote. And you see it is exactly the same as the output of
W, and by symmetry basically the majority vote and this single linear classifier are
producing exactly the same output, so they are the same classifier basically. So the
output of the majority vote is the same as the one that of the deterministic classifier which
is the center of the Gaussian. So basically this means that twice the risk on the Gibbs
classifier, so let's say a bound on the risk classifier will give you a bound on the single
predictor W.
Okay. This is how it works. So we define the class of posterior. We need to compute
the KL divergence because the bound will depend on the KL divergence between the
posterior and the prior. So we will choose a prior centered on some other W. WP, so this
is the prior. Most often WP will be the origin. If you have not seen your data, there is no
preferred direction, so normally WP will just be the origin. So the KL, because these are
two Gaussian, the KL divergence basically can be computed exactly, and this is what it
gives. And it is just basically, it only depends on this norm. The norm of this difference
vector, totally depend on that. And it is a nice quantity because then you have a regular
riser which does not depend on the dimension of the space and this would be bad that it
would depend on the space of the dimension here, because this is feature space phi which
can be very large dimensionality. But because these are isotropic, Gaussian the integrals,
the KL divergence and all of the other direction they can sell so there is only one
surviving direction basically that gives you the KL.
We need to compute what will be the risk of the Gibbs classifier, in order to have a bound
on the DN; well you need to say what is Gibbs risk on the training set. What will be
Gibbs risk on the training set? So the Gibbs risk on the training set is just the average of
Gibbs risk on the individual classifier, and the individual example, sorry. And so you
need a formula, you need to be able to compute Gibbs risk on any single example. So
basically you need to be able to perform this integral. If you are not capable of finding a
close form and expression for that, then you end up needing to perform multi-Euclidean
relation and this would mean a very slow learning algorithm. Each time you need to
compute the empirical risk and this will occur often in a learning algorithm, you will need
to do at each point a Euclidean relation so this is not very interesting. So fortunately
Gaussian you can integrate this basically. It is just that this part, so suppose it makes an
error here, so it is just this volume, so basically again all directions cancel and it just
basically depends on this quantity here.
So this is what you obtain for Gibbs risk on a single example. It is just a [inaudible]
Gaussian basically, which depends--this gamma is the normalized margin between the
weight vector W and example Y phi of X, just it is a number between -1 and +1 and you
multiply it by the Euclidean norm of this weight vector. So this is Gibbs risk so you can
make some plots, so in terms of the margins, recall that the margin is just a real number,
it is a cosign of the angle between W and phi and it behaves like that. So it has a
sigmoidal shape. It is called the probit loss.
So this is a probit loss. Basically it gives you a loss function on the example which is as
this form in terms of the margin of the normalized margin, and as the norm of the weight
vector becomes very large, then it converges to the 01 loss of the majority vote basically,
so it makes sense then to find, to measure the performance of the Gibbs classifier in order
to obtain a performance indicator for the majority vote.
If you have a large weight vector the guarantee that you have for the Gibbs classifier
with the large weight vector will somehow be a good indicator of the performance of the
majority vote. The way to understand why you have this result basically is you just say
well suppose that this, suppose this is your weight vector W and you have some example
here, this is your example Y phi of X so if the scalar product between these two are
positive, so suppose it's here, so if the scalar product between these two vectors is
positive, then you have correct classification but it is a Gibbs classifier. It is a Gibbs
classifier so this means that with some probability you can draw predictor here which
misclassified that example, so this is why you don't have a perfect classification. So this
is why as basically the norm of this predictor increases, it becomes more and more
deterministic.
So we have bounds. The bounds are expressed in terms of the empirical risk and the KL
divergence. We found the analytical expression for that. Basically what is left then we
should say well let us find the predictor W which will minimize this bound, so this is a
Langford and Seeger bound. So we are looking for the W which will minimize this and
the bound is just the largest value of epsilon for which this inequality holds and this is
realize that the equality, but you have two solutions. Remember the plot I gave you. You
have one which gives you an upper bound and the other which gives you a lower bound,
so you must choose the upper bound. So basically we're going to try to find W
minimizing the bound such that the bound is given by the formula and you take the
solution where B is larger than the empirical risk. Okay so that is what we are going to
do. So it's a pity because you cannot express explicitly B in terms of the empirical risk in
the KL. You just have an implicit definition but you can have an explicit formula if you
are afforded gradients. So the gradient can be explicit.
Okay so the other bound, the Catoni bound basically is just to minimize this and if I
express the empirical risk in terms of what we find, we find this expression. So we need
to find W which will minimize this, and in the absence of any information about the theta
we choose WP equal to the origin, the origin vector. So basically you need to minimize
L 2 regular rise probit loss. This is what you need to do. So you should compare that to
what the SVM does. It minimizes a convex hinge loss with the same regular riser, so the
regular riser used by SVM now comes up naturally here. But instead of adding the, to
minimize the convex hinge loss, you need to minimize a non-convex probit loss.
>>: I have a question. So the [inaudible] through here is because you choose hardly no
more than for the… So the [inaudible] through your action you're going to choose
[inaudible] on that theta [inaudible].
>> Mario Marchand: Yes. You can use a prior knowledge; this is where prior
knowledge can enter. You can say I will use a Laplace prior instead of a Gaussian prior
and you will then add the L1 and probably an L1 norm that would pop up out of that.
>>: So it's not even just the Gaussian, it's two spherical Gaussians that came from the KL
divergence [inaudible] so it is a very special thing.
>> Mario Marchand: Yeah. But if you are using Gaussian for posterior [inaudible] it is
natural to use it also for the prior also. If you would use the Laplace prior than you
would use some other things probably for the posterior, something that has a nice KL
expression.
>>: But isn't it kind of the one of the motivations was that when you present traditional
machine learning framework you say I keep the regularization to be this number and you
say why do you use L2 why not L1 why not L 17?
>> Mario Marchand: This arbitrary [inaudible] wasn't here with the choice of the prior.
>>: Exactly, so you can kind of go back and forth between these and just turn the
regularization term into a prior right?
>> Mario Marchand: Yep.
>>: But you did that anyway. There we say that the regularization was just [inaudible]
the prior and you're doing map estimation, so it's not like it's anything new.
>> Mario Marchand: Anyway this is what we find that we find that in fact you see this
hinge loss, this is what we should minimize and the hinge loss is not quite that. It is a
convex relaxation but it is not an upper bound. Because it can go to zero, it can be larger
and then lower on some part of the probit loss. We have performed gradient descent on
these bounds basically, so for the Langford Seeger bound you can compute the gradient
so I am skipping the details. It is not very interesting. You can compute the detail for the
gradient for the Catoni bound, basically you derive this. The differential of this and then
if you compute the gradient in terms of W you will have this part solved immediately. So
you will see that I will sometimes use, choose to center the prior around some other W.
You will see why in a few minutes.
Okay so why do we use gradient descent? Well we have a problem here because the
probit loss is--the probit loss of phi of E is not the problem because it is quasi-convex.
This means that the level sets are all convex. The problem is that we have a sum of
quasi-convex functions because you sum over all of your examples and in each example
you have a probit loss. And the sum of quasi-convex functions is not guaranteed to be
convex anymore. And in fact the empirical risk is not quasi-convex and the bound
because of that does have several local minima. We have seen that on the data. It has
several local minima and you can be stuck on some of them, and this is especially true if
you take the Catoni bound with this hyper parameter C, and when C is large the problem
gets more important.
So basically nevertheless we say well the theorem says that we should minimize
something which is not convex. So let's try to do it. And we tried to do it let's say just by
doing a lot of random restarts, doing gradient descents with random restarts. And we
increased the number of random restarts for a large value of C because we had more local
minima and this is what we did basically. Sorry I am stuck. Okay. Okay so each
algorithm--with these functional expressions you have an empirical loss and an L2
regular riser, than the representer theorem applies and you can ride W. You can--the
representative theorem for these function to minimize will say that W must lie in the
linear span of the examples. So basically we can either work with the primal variable or
the dual variable, the alpha whenever we have a kernel. So we tried both versions, the
primal version we compared with AdaBoost basically and we used decision stumps.
AdaBoost seems to perform generally okay with decision stumps, so we compare it with
AdaBoost. And the dual version we used the RBF kernel and compared it with the soft
margin SVM which is the gold standard.
We proposed three learning algorithms. PBGD1 so this is for PAC-Bayes radial descent
one. So we use a non-informative prior centered at the origin. And we minimize the
Langford Seeger bound with some confidence that they will equal 2.05. The actual, it is
not very sensitive with the choice of Delta. So this, remember the Langford Seeger has
bound no hyper parameter in it so basically you don't need to try to trade-off what will be
the empirical risk with respect to the regular riser. It is given in the bound.
We tried that. The other thing we tried, this is version 3 of PAC-Bayes gradient descent
is to say well I do not have confidence in, perhaps the regular riser will weigh too much.
So basically we decided to minimize the Catoni bound but for different values of C. Not
necessarily the one that will give the smallest risk bound, but for several values. And we
choose the value by cross validation, so this is sort of okay. So let us try to see if this can
improve on the other.
PBGD2 is quite different. In fact this is a true bound minimizing algorithm. Basically at
the end you have a predictor and you really have a guarantee that it is valid for that
predictor. Here you are cheating a bit. You're not--you will have a predictor. You can
compute the bound. The bound is valid for that predictor but you are not necessarily
using the predictor with the smallest risk bound. You rely more on cross validation then
to select your best predictor. But here basically it is another way of getting a risk bound
minimizing algorithm. Basically here you say well I am going to use half of my data to
learn a good prior. So the only reason, they only use that you're going to make for the
first half of the theta is to learn a good prior. So the priors are isotropic Gaussian
centered on some W, so I am going to try to find a set of W's which would give me good
priors.
So basically we minimize the Catoni bound with different values of C, so arranging so
one, 10, 100, we try very large values. This will give me several solutions.
>>: [inaudible] prior [inaudible]?
>> Mario Marchand: Yep.
>>: But now you are [inaudible]. So are you going to…?
>> Mario Marchand: It's not the correct way of saying it, but basically you have the first
part of your training data. And you are just minimizing the second bound with a uniform
prior, the Catoni bound with a uniform prior. So this will give me some predictors
centered on some, set of predictors, and now for the second part of the data I will put the
prior centered on those. So they are independent of the second half of the data.
>>: [inaudible] distributions the first time.
>> Mario Marchand: Yes.
>>: So you can't swap the expectation. It is not legal to swap the expectations
[inaudible]?
>> Mario Marchand: Yes, yes, because it is not the same.
>>: It is conditioned on [inaudible], you know what I mean? You have conditioned on it
because…
>> Mario Marchand: Yes. But as long as you are computing the empirical risk on data,
the prior didn't see the second half of the data. So it is okay.
>>: [inaudible].
>>: Right but he is taking it [inaudible] deep into theories to get expectation overdrawing
samples from D and he is swapping it with the…
>>: The prior can depend on D; it just can't depend on S.
>> Mario Marchand: Right.
>>: Or it can depend on D in any way you want.
>>: Okay.
>> Mario Marchand: Yeah. So this is basically, so for the second half of the data
basically we do the same thing as PBGD1, so basically we minimize the Langford Seeger
bound with no adjust, with no hyper parameters for priors centered on the solution that
we found for the first half of the data. And basically we keep the solution I think the best
the best bound. So this is really a true bound minimizing learning algorithm, but where
we can cheat on the first half of the data to find better prior than just the non-informative
prior. But we lose. We will compute the empirical risk; we lose by this process because
the empirical risk is going to be computed only on half of the training data.
>>: So is there a reason for choosing half as opposed to a quarter or any other number?
>> Mario Marchand: We tried all of the fractions and I don't know why but half is the
better. Half gave the best result.
>>: You are doing two things here, right? So one is you are trying different Cs and the
other is you are splitting the data into two halves. So if I were just use the two halves
trick, so mainly I use maybe the Langford Seeger KL bound and I just learned 1W on half
the data and then if I take the second half of the data and then another W which is regular
to be close to the first one.
>> Mario Marchand: It does not buy you anything, because basically you will obtain the
same solution as you would have trained on all of the data with the noninformative prior.
>>: But my bound in the second randomization is now enough because I am assuming if
I turn on half of the data and then I tried to get on another random half of the data it is not
going to change the solution by much. It is just the variance of the data.
>> Mario Marchand: Yeah, yeah, the bound will be smaller. I guess.
>>: You are realizing something which is going to be almost exactly, so I am just saying,
you are doing two things here. The second thing, the optimizing of C maybe is not
needed.
>> Mario Marchand: Well this is a way where basically you can attain very large weight
vectors. It is not clear what it will give you. I don't recall exactly what it gave. It bought
us nothing at the end. But to say that you will obtain a better bound, the empirical risk
might be quite large still. You might, it might be the case that you find weight vectored
at a small norm and because of this probit loss, you allow quite a large empirical risk and
15% and this will contribute to the bound. Here basically by having large weight vector,
basically the empirical risk that you're going to find on the second round is going to be
smaller; so basically the weight vector is going to be close to one of these very large, is
going to just slightly adjust around it. So you will have small empirical risk and small
KL divergence, so basically the ones that were preferred by this trick were the ones with
large weight vectors generally.
So these are PBGD1 and 2 are true bound minimizing learned algorithm, but not this one
basically. You rely on cross validation for the hyper parameters, so you are minimizing a
functional which is inspired from the bound. The quantities are there. You are using the
probit loss. You are using L2 regular riser but for the respective way here, you are
relying on cross validation for it.
This is PBGD3, so let's compare these results. So we don't have a lot of data as in
Microsoft Research, so we rely on what is publicly available, so you see i in this and so
on. So basically the results are going to, we have split the data depending on the data set
but normally it is usually half and half; we basically trained on the first half and then the
testing score we compute the empirical risk on the remaining half of the data, and we
performed a binomial tail inversion test of Langford to determine the status if the
difference in empirical risk is statistically significant or not.
>>: That is not a matched test; it is just some sort of [inaudible]. It is not like a
[inaudible] test is just some sort of [inaudible] don't know what the paper said but it is
just [inaudible].
>> Mario Marchand: So basically you need to compute the test bound, the test upper
bound and the test lower bound, and if they are disjoin basically the difference is
statistically significant, otherwise it is not. So these are the results for in the primal case
for decision stump and basically this is, the bold ones are the ones that have better results
but they don't show up quite clearly unfortunately. I should have put some colors in it.
So basically this is the result for AdaBoost. The risk computed on the testing set and the
bound, you can compute the risk bound that we had for AdaBoost. This is PBGD1.
PBGD2 if you can see for the bound, it is really, we obtained better bounds for instance
on mushroom here. It is 3% for the bound which with respect to PBGD1 which gives
you 13% for the bound, so by this trick of using half the data, we can obtain really lower
bounds, better bounds. And PBGD3 is, the guarantees are worse, but the actual result
seems better. Overall I would say that this is a winner. So the statistical significant
difference is here. They is almost no statistical difference except on these cases, for
instance AdaBoost is statistically worse than PBGD1 and PBGD2 and vice versa.
Perhaps it is better, so better, yes. And here, three is better than one and so on. Two and
three are better. The worst one actually is PBGD1, just doing this straight risk bound
minimization with a uniform prior gives you generally much worse results than PBGD2,
and PBGD3 is still an improvement. If you rely on cross validation for the respective, so
it gives you better results which are quite competitive with AdaBoost.
>>: Any idea of a sense of how much of this might be an optimization problem? The
reason why I am thinking that is because [inaudible] in the context of AdaBoost you
suggest that using something very similar you to lost. Which is a [inaudible] but it's
almost the same. So he was getting better results [inaudible] exactly but I just didn't
know it could be optimization [inaudible] could be the loss [inaudible]. Have you tried
multiple restarts, right?
>> Mario Marchand: Yes that's what we tried. It is not indicated in their paper what they
have tried.
>>: AdaBoost [inaudible].
>> Mario Marchand: It's just that our [inaudible] AdaBoost, yeah.
>>: The different optimization could be a different optimization [inaudible]. It
>> Mario Marchand: Yes but you will end up in some local minima unfortunately.
>>: That is true.
>> Mario Marchand: And there is no regular riser but you start boosting after let's say
200 rounds.
>>: [inaudible] the main [inaudible] was to say we will get a [inaudible] bound and try to
optimize [inaudible] so this evidence doesn't exactly support it.
>> Mario Marchand: Yes. It is better, it is still better basically to tune the weight
between the regular risers and the empirical risk by cross validation, generally it is better.
This means that the tearoom somehow has to progressive…
>>: [inaudible].
>> Mario Marchand: Yeah, but still PBGD3 is a learning algorithm that inspired from
the bound because we were using the loss that was given by the Gibbs risk, and we are
using the regular riser that came out that the Gaussian which is just an L2 standard thing
but it is let's say a bound inspired learning algorithm, but quantitatively you need to
basically tweak the balance. That is that. So this is the result we have with the RBF
kernel. We compared with DSVN and the same sort of picture. This is the worst result.
These are better, and this is still better. And what is nice is that the bound can be quite
low here basically, 1%. This is a true risk bound is 1% and it is basically, it is doing
better than that; that is one thing. Better than 1%, but still it is nice to see that on adult to
see the boundaries sort of quite close to the actual risk. Yes?
>>: Going back to John's point again about [inaudible] transition, what, I don't know if
this is going to make sense, but can you take the weights from the DSVN and apply it to
see what bound you would come up with if you…
>> Mario Marchand: Yes. Well we can take this solution of the SVN and compute the
bound, the same bound that we…
>>: So that's the bound that…
>> Mario Marchand: Yeah, yeah, yeah. They are worse.
>>: Okay.
>> Mario Marchand: The bounds say, well you shouldn't pick this solution; you should
in fact pick this one which is better, but this one is even better. It has a higher bound.
>>: Optimization is finding a lower bound than this [inaudible], yeah, okay.
>> Mario Marchand: Yeah, for sure the bounds here are worse than this one.
>>: I see.
>> Mario Marchand: Okay.
>>: Back to that. So it seems as if, you know, there is still a selectness, you know, of
bounds. Do you have any feeling to where is the source of the selectness.
>> Mario Marchand: Well, the regular risers seem somehow to strong. It is difficult to
get rid of it and yeah, in this setting I could not get better bounds than that. This is
really…
>>: And then when you look at the choices of the PBGD3, you see that it gives smaller
weight to the…
>> Mario Marchand: Oh yes, these are typically the best results. You multiply the
empirical risk by 1000 and this is pretty good. So this means that the weight on the
regular riser is just too strong. But PBGD2 is sort of, I know that, so I'm going to cheat
on my first half of the data in finding large weight solutions and then on the second half
you will, the regular riser is smaller.
>>: [inaudible] maybe I am computing the bound wrong but I thought what the bound
really says is if you were to keep sampling test data, you will be above the bound of a 5%
[inaudible]. It doesn't tell you anything about what your average risk is. So I don't think
that this really is surprising that your finding that through your tests set that you run the
[inaudible] and you got something that is [inaudible] bound because the bound is talking
about your tail loss, right?
>> Mario Marchand: Yes, but this is the, you need to bound the tail. You need to bound
the worst case…
>>: But the results you show are not, you are evaluating the average [inaudible] that you
showed, but the bound is giving you [inaudible] the worst case so I don't think that there
may be a [inaudible] because you have to run this many times to see…
>>: Would you say that maybe some of these are harder seem to be performing better
than actually have larger buy-ins and we don't see that…
>>: We don't see that because you only get one test set.
>>: But another way to say that, another way to look at that would say that the examples
that you tested it on were too easy, and that you could perhaps some create some
synthetic data sets that would be much harder for the standard algorithms, and your
bounded algorithms would do reasonably well and you would know how well they did.
The distribution of examples should be harder.
>> Mario Marchand: I don't know.
>>: Your proofs are for any distribution of samples, right? The letters a compared to b
that's not a particularly tough distribution of samples, whereas maybe you could make
one that would be much harder.
>>: [inaudible] for all D and all Q [inaudible] posterior and vary distribution and data,
specific the worst ones and maybe the ones that I have and not the worst ones…
>>: Yeah, I don't think those are that bad so this is getting back to that thing that maybe
there really isn't slack there. You assume that it could be [inaudible].
>>: Yeah, but the other thing is how much to lose by this, by the fact that I worst case
and the answer is, you know, one more step closer to the answer. [laughter].
>> Mario Marchand: From Gibbs it’s based on the majority vote.
>>: There's too much [inaudible] [laughter].
>> Mario Marchand: Okay. Conclusion. Basically PBGD2, the second one is better
than the first one, so using half of the data to learn a prior really helps. PBGD3 is better
than the other one. I shouldn't say better. I should say definitely better, it is a bit better
but I think it has a true advantage. So the bound it is a bit worse than cross validation for
finding the proper trade-off. PBGD3 which is inspired from the bound, you take the
quantity they came out from the bound seems to be very competitive with AdaBoost and
SVM, but they are much slower because you have several local minima.
What we have done afterwards is to convexify, so if we have this probit loss and
basically we say okay we have all of these local minima problem so let's just convexify it
by a logistic loss. So you take the slope here and you go linear here. So this is sort of
different from let's say the hinge loss but…
>>: That is logistic [inaudible].
>> Mario Marchand: Very similar to L2 regular rise logistic regression, and so if you do
that, you find almost identical results, very, very, very similar. I don't show the numbers
but almost the same thing. But with basically--this was a bit disappointing. I was
expecting to obtain must better results because of non-convexity, saying well this is what
you have to do. You have to work harder but at the end you will find better solutions.
And basically no, you can just use the logistics instead of the probit loss and…
>>: I think Carl is right. Because I think that usually the excuse for having these nonconvex log functions they go to zero derivative for very [inaudible] one thing is you have
very active labeled noise in your data. In fact it is like an adversary coming in trying to
smash you and that is why you are trying to minimize this [inaudible]. So if your data
had no adversary--I think you should go to the [inaudible] list and put in like maybe
10%…
>> Mario Marchand: We have tried putting in classification noise and there was no
improvement between that. But somebody at ICML or NIPS from Google gave a talk
and, a similar experiment and he sort of imbalanced the noise between the positive and
the negative and then he saw the difference, but we didn't do it. We didn't think about it.
It was better to use the nonconvex one in that case.
>>: So you said the sound quality function [inaudible] and here you are approximating
one convex function by a convex function. So if you can do that for the sum maybe that
would give you a better convex function although probably not it might be more
complicated than [inaudible].
>>: But not for each individual one.
>>: [inaudible].
>>: No. So you can take a convex, I know for example, the convex [inaudible]. Maybe
if you…
>>: The sum of many [inaudible].
>>: I do not know if that would work well but I was just wondering.
>> Mario Marchand: That would be a lot of hard work to do. I don't know I never
thought about it.
>>: [inaudible].
>> Mario Marchand: So anyways the second one seemed the best choice if you want to
obtain a good guarantee. So this is part of the work now. The second part of the talk,
how much time do I have? [laughter]. Very little. Shucks, okay.
So the second part, how much minute? 10 minutes? Okay. I will try to go fast. But this
is a much more recent work. So basically we have a set of classifiers that we call simple
compress. What is a simple compress classifier? Well a simple compress classifier is
this [inaudible] by a subset of the training set that we call the compression set. And a
message which here is going to be a real number between -1 and +1 and an information
bit plus or minus. So these are not actually messages but the standard terminology in the
sample compression. And we are going to use, and we will see why. We are going to
use a node two set of classifiers. So basically for each classifier the negation of this one
is going to be by definition -1.
Let's give you an example of a set of sample compress classifiers. These are the ones
made by a single example. So suppose that, okay so this is an example of a sample
compression set of classifiers described by only one training example. And I am going to
use any similarity bounded similarity function for that. So I will normalize it to one
between -1 and +1. So each classifier is going to be described by a single example and a
message sigmoid and S such that H plus here, this is the definition is going to output +1 if
sigmoid is less than the kernel value at X. Otherwise the output is -1 so you can write it,
I like to write it compactly like that. The sign of this predicate. So if basically the kernel,
the similarity function evaluated at X, so this is the one used for describing this classifier.
If this distance is greater than sigmoid than I am outputting +1, otherwise I am outputting
-1.
And so the Boolean complement of this is just going to be -1 up to the plus. So I can
produce, so this is an example of simple compress classifier made of a single example.
You could take pairs of the example and define a hyperplane. You could take a whole
bunch of example and run in SVM and this will output a classifier. So basically it is a
simple compress classifier where the compression set is the set of examples that you have
used. So we are going to be interested in building a majority vote of compress classifier
and we want to attain a PAC-Bayes risk bound for majority vote of simple compressed
classifiers.
So basically each sample compression is going to be described by a vector i which points
to the individual example used for defining this classifier. So we have a prior on this set
of sample compress classifiers, not the priors, any posterior solution and this distribution
it will be written like that, so you will have a distribution on the subset of the training set
that is used to construct your classifier, and some distribution on the message given the
compression set. Okay. And we are interested in this guy here, the majority vote of these
sample compress classifiers. And we want the risk bound and to design an algorithm to
minimize the risk bound. So consider again the case where each sample compression set,
each compression set is made of a single example, so we are going to use a posterior like
that because it is quite natural. So you have a distribution on your index so each example
can make a single classifier. So you have a distribution on i a distribution on the sign,
that is for each example you have a pair of classifier, one which is a Boolean complement
of the other, and in fact you have a continuous also you have a continuously many
because you can have several values of Sigma, so we are going to use the uniform
distribution for Sigma over minus one plus one, and I am going to use this weight which
will appear in the expression of the sample compress classifier which is the weight
assigned to that example and the difference between the plus and minus. So the weight
can be, is bounded by this value but it can be positive or negative.
So these are interesting because a majority vote of sample compress classifier basically
can produce any SVM classifier. Pick any SVM classifier and you can express it as a
majority vote of sample compress classifier of size at most one. So you can realize this if
you just look at the output function for any X of a majority vote of sample compress
classifier made out of a single example. So this is the expectation, so your sum over all
possible example use for sample compress classifier some overall sign, the integral over
the uniform distribution of Sigma finds this, and basically what comes out is this
expression at the end. So it is the sign, so basically you can generate any weighted
combination like that, but Wi is bounded, but it doesn't matter because taking any
classifier, you can always divide it by the largest because of the sign you can always renormalize the weights.
So these majority vote of sample compress classifier of size one are interesting because
they contain all of the classes SVM. But here we use any similarity measure so we are
not restricted to positive semi-definite kernels, and we have a bound which is valued no
matter what is the similarity measure out of it. Okay so the way to go from the measure
the output gives is usually that but here it is not good. Here it is not good because in the
case where the sample compress classifier are made of a single example, they are very
weak classifiers. They are not strong classifiers. They will almost always make it hard.
They will have an error rate which is close to one half. So Gibbs error rate will be close
to one half. So it is not a good measure of the performance of the base classifier because
Gibbs error rate will be close to one half even if the measure we devote is zero.
So by probing the Gibbs error rate, it does not inform you at all in this case of the
performance of the majority vote. So we need some other things than Gibbs risk to
monitor the performance of the majority vote. And so to see what we need to use let us
consider what would be the expression of Gibbs risk. So this is the expression of Gibbs
risk. So it is just the expectation of the real risk and basically you can convert this
indicator function in terms of their margins so why H of X is either +1 or -1, so if it is +1
it is an error of zero. If it is -1, it gives you an error of one. And so here I am going to
define the margin of Q on example XY to be this expression. It is usual notation. And
basically the risk of the majority vote is at most twice the Gibbs risk which is just then
equal one minus the margin. But one minus the margin is some linear function of the
margin which has nothing to do with the step function of the margin which is the risk,
base risk.
So this is Bayes risk for the risk of the majority vote, the 01. This is one minus, oh so it
does not--this is one minus NQ the equation for, and so basically it does not make any
sense to bound Bayes risk with this function. If basically most of your example are
around here. So have small margin. So what we are going to use instead is a loss
function like that, which is a quadratic function centered on some margin and that can be
chosen any value basically. And so if you find distribution for which you know all
examples have small margins, then it makes the quadratic risk is a better measure than
this linear risk for this case. So basically we are going to use this, and the bound on the
expected loss for this quadratic risk also provide a bound on the majority vote. So if you
would use a linear function which is more inclined to this, it does not work because it is
negative and you have to upper bound the risk of the majority vote.
So this is the bound that we have, so the bound is actually valued more generally than just
a quadratic loss. It is valued for any loss function that has a Taylor series extension
around the zero margin of a finite degree. So this is the result that we have basically for
all posterior distribution align on P and I will tell you what align mean. The risk of this
loss function which could be the quadratic loss function, which also bound the risk of the
majority vote is less than the empirical measure plus there is no regular riser.
There is no regular riser but there is this constraint plus this just this confidence interval
is square root of M instead of M you have M minus D. D is the size of the sample
compression that you are using. If you are using a simple example for--so D is one. You
might use two or three. This will give a deterioration of the bound and the degree here of
the loss, which here will be two if you were using the quadratic loss. It is kind of a nice
bound because it applies for any simple compressed classifier which are defined in terms
of similarity function, and you are not restricted to use Mercer kernels or whatever. It
can be even nonsymmetric. Okay. So there exists a risk abound that applies to non-PSD
kernel, but what they have is that they have an empirical risk which must be measured on
part of the data.
So basically the way they are constructing the classifier on some part of the data and
measuring the empirical risk on the other end, whereas this is computed on all of the data,
so basically this is a better bound than what has appeared before. So we have been
minimizing, okay, okay a few words on what does align mean?
>> Ofer Dekel: I think we need to wrap up, so, quickly.
>> Mario Marchand: So align means just that you have a pair of ultimate complemented
classifier. Basically the posterior white and the two complemented pair is always equal
to the prior. So the only thing that the posterior is going to do is shift weights among the
others. And even with that constraint, you can produce any majority vote, so it basically
is not a constraint, but it gives you a good bound. So this is, and it has no KL diversions.
This is the proof I am just going over it. So basically we have used this bound to derive,
algorithm and it gives very good results compared to SVM for instance. This is the
aligned case. The nonaligned case where you have a KL divergence, so it gives good
results, state-of-the-art results in fact so we are pretty satisfied with that. I am going to…
>>: [inaudible] classification. I saw you guys [inaudible] tested [inaudible]. What is the
net result. Is it different…
>> Mario Marchand: The solutions are different, but here the null infinity constraint that
we have, the weights are restricted to be in plus, minus and here basically you have an L2
regular riser.
>>: [inaudible].
>> Mario Marchand: Yes, it is in a box. Okay so, so okay, let's me wrap up by saying
that we have a proposed a Pac-Bayes bound which compares favorably to currently
existing bound because it applies to non-symmetric positive definite kernel minimizing
the risk bound really gives state-of-the-art results. It is pretty good. The work can be
extended for multiple kernel learning. That is what we are working on and perhaps a new
words that we are doing sample compression KL right now. We want to apply that also
to random forest, because random forest is also sample compressed classifier where you
compute the empirical risk on is the out of bag estimate basically, so we want to do the
approach random for PAC-Bayes point of view. I am also quite involved in PAC-Bayes
structured output prediction, and basically we want to apply that to peptide interpreter
prediction. So you have a protein and we want to predict what will be the peptide that
will bind it strongly to given protein. So thanks a lot for your attention and I am sorry to
have exceeded my time. [applause]
Download