>> Ofer Dekel: Hi, it is our pleasure today... Laval in Québec City and he has come here to...

>> Ofer Dekel: Hi, it is our pleasure today to have Mario Marchand from University of Laval in Québec City and he has come here to talk to us about PAC-Bayesian machine learning, which is learning by optimizing a performance guarantee. So, Mario, thank you. >> Mario Marchand: Thank you very much. I would like to thank Ofer Dekel and Microsoft Research for this kind invitation. I am having a terrific time. I think that the people here are very interesting and very kind, so it is a terrific place to be and I am just very happy and very honored to be here. So this is the title and most of our research that we are currently doing at Laval University in our group machine learning group which in French is the group of [inaudible] de Laval which is happens to be the same name of the Holy Grail which is a nice coincidence. [laughter] So this is our group and basically we are two faculty members, me and François Laviolette who is also another professor and these are the grad students, some of the grad students that actually participated in the result that I will show you today, but we also have other graduate students, and basically the title resumes a bit of our main line of research, that is whenever we try to assign a learning algorithm we first try to find a good performance guarantee and then propose the learning algorithm that will optimize the performance guarantee. Before outlining the talk, basically I will just present a few slides to introduce annotation in the basic supervised learning session. So in the supervised learning setting we have a training set of M examples. So each example is just an input output pair, so this is the input and the output associated to input X. So the input space is usually arbitrary and the output space will basically define the kind of supervised learning problem that we have so if the output space is -1 or +1 this is just binary classification, the real set for regression or a real interval for regression, and an arbitrary structured set for the structured output prediction. So all of the theoretical, well basically the only assumption behind the theoretical results is this assumption here that each example is drawn independently according to the same distribution on the input and output space. That is what we assume, that the training sample, each element of the training sample was drawn according to this distribution. So the goal of the learner or the learning algorithm is to find a predictor I think having minimal risk, so this is the goal of learning. It is always that goal. And the risk of the predictor is basically defined as the expected loss, the expectation performed according to the real [inaudible] generating distribution and the loss error is expressed in terms of a loss function that measures the loss incurred on predicting h of x when the actual output is y. Of course the learner, so the problem is that the learner does not know the distribution D because this is the objective of the learner and it is defined with respect to the D, not with respect to the training sample. So we don't know D. we have only access to a training sample drawn according to D, and so the central question in supervised learning is looking at what should be optimized on the training sample to obtain the predictor with the possible minimal risk. This is the main question in supervised learning, and the usual way to--we basically avoid the situation most of the time basically. We will say, well, we don't exactly know what should be optimized on the training data, so let us just try to guess what would be, what we should optimize and we would basically just propose a regular rise risk functional so that would typically be like that. It is not always like that, but you always have an empirical risk of the predictor which is just the empirical loss measured on the training set of example. And you have some complexity term, penalty term which depends on some complexity measure of your predictor. And lambda is some positive real number which we don't know its value, so we normally choose it by cross validation, so basically on the testing, cross validation on our data. Perhaps we can do better, because this is the normal way of doing things. We propose that we find that it works empirically, then there is theoretical person that is proposing some bounds and some justification why it is not a bad idea to do that. But can we do better than this? So if we cannot find the predictor with the best possible risk? Let's just try to find a predictor with the best risk guarantee and a risk guarantee which can be computed on the training data. Let's try to do that instead. The guarantees that we have are called risk abounds so these are the theoretically accepted guarantees that we have. And it is called a risk bound. A risk bound is just basically a function of the predictor, a training set and some confidence parameter, normally it just weakly depends on the confidence parameter. And the property of the risk bound is that with high probability with respect to random draws of the training set, it is simultaneously for all predictors the risk is bounded by its, is upper bounded by the bound and the bound is really here, the random variable which depends on predictors and the training sets. The optimization problem for supervised learning is then to find the predictor minimizing the bound. This will be the predictor having the best guarantee which can be computed on the training set, so this is basically our direction of research. So this makes sense of course if you have a good guarantee, a tight bound. If the bound is not tight, well perhaps it won't give good results empirically. So we approach supervised learning problems basically by first trying to derive the risk bound for a particular learning setting, which can be computed for any pairs of predictor and training sample. And then once we have the bound we design a learning algorithm for minimizing it, given the training set s. So basically this would be, it will return the predictor with the best possible guarantee. So for step one we use the PAC-Bayes theory. Why do we use the PAC-Bayes theory which was initiated quite recently by McAllester in one of the Cole conferences in 1999. First it can give very tight risk balance. It is not because you are doing PAC-Bayes theory that you are going to find tight risk bound, but you can really obtain tight risk bound if you are putting in the effort. Also it is about distribution so basically it is about distribution over class of predictors, so it gives a guarantee to ensemble methods such as boosting, SVM, random forests and kernel methods, many state-of-the-art, learning algorithms basically. It gives guarantees to methods like that, and perhaps the most important point I have not written in here is it is quite easy to master. PAC-Bayes theory is very simple; it is much simpler than a VC dimension and even [inaudible] complexity is really something that a graduate student can understand fully, so this is certainly one important point. So this is the outline of my talks, so since it is easy to master I think it is a good occasion to explain to you the elements of PAC-Bayes theory so you will leave the talk I think with the main ideas behind it and if you work a bit more, you can just master the technique and then we are going to apply that to some particular cases. The first case is going to be on Gaussian distribution over a linear classifier, so this is kind of a posterior distribution over a very important class of predictors, linear classifiers. We're going also to talk about PAC-Bayes sample compression. This is recent work; we happened to present that at the ICML in 2011, so very recently. The idea here is basically we are going to use the PAC-Bayes approach on a set of classifiers which are described by the training data, by a subset of the training data. And finally, perhaps I will just outline some work that we are currently doing right now. Elements of PAC-Bayes theory, here we restrict ourselves to the simple classification case, so we consider the 01 loss, so the true risk of the predictor in this empirical estimator on the training set s are just defined respectively as, this is the indicator function, so it is one if the predicate is true and zero otherwise. So it is one when you are making a mistake and so this is just the probability of making a mistake when you draw an example from the data generating distribution and this is just the frequency of our 800 training set. The learner’s goal then will be--the output of the learning algorithm will be to produce a posterior distribution Q that depends on the performance of individual or collective classifiers. We are going to pursue the posterior distribution on the space of classifiers such that the risk of the Q weighted majority vote that I call BQ, because sometimes it is also called the Bayesian classifier, but here let's call it the majority vote or the Q weighted majority vote and we want to obtain a Q weighted majority vote that has the smallest possible risk. That is our aim. PAC-Bayes bounds in fact produce bounds on the risk of majority vote indirectly because it bounds the risk of another classifier which is a stochastic classifier. The majority vote is deterministic. On each example it always output, if you present same example twice it will output the same prediction. But the Gibbs classifier is something different. It is stochastic, so to predict the label of an input example X, the Gibbs draws a classifier from H according to distribution Q and then predicts the label of X according to H of X. And so if you present X twice it might predict different labels because it is stochastic; the draws might be different. The risk and the training error of the Gibbs classifier are just given as follows so it is just the expected risk where the expectation is done with respect to the draws according to Q, so this is Gibbs risk and this is Gibbs empirical risk which is just the expectation of the empirical risk of the individual predictor. What is the relation between Gibbs risk and the risk of the majority vote? Well you can bound the risk of the majority vote by this equation. You can easily prove that the risk of the majority vote is at most twice the risk of the Gibbs classifier. So if we provide a risk bound for the risk of the Gibbs classifier, we will automatically have a risk bound for the base classifier, by the factor of two rule. So we might lose a bit, here because of this factor of two rule. So the way to understand how come we have this relation is basically you consider an example, consider a fixed example XY, and consider the two cases when the majority vote makes an error on this example and the case where it does not make an error. Suppose first that it makes an error on this example. So if it makes an error on this example, it means that half of the classifiers under measure Q will make an error on it, so half of the classifier will make an error on it so this means that, so it could be 1/2+ epsilon of the classifier that will make an error on it. So this means that twice the error on that example of the Gibbs will be greater or equal than, because now the risk on that example is one and twice the error of the Gibbs is a bit larger, okay it is larger than one so larger or equal than you have this inequality, and this inequality obviously [inaudible] in the case where the majority vote makes no error because the Gibbs will in general make some error on it. So in general you have this--so if you perform, I have demonstrated this equation for a particular example; just take the expectation on both sides and you have this inequality, because it is with respect to the same distributions. So basically we have this factor of two rule and this is how PAC-Bayes works most of the time. We will provide the risk bounds for the Gibbs classifier and the [inaudible] factor of two rule, this will give a bound on the majority vote classifier. In PAC-Bayes theory or in PAC-Bayes setting I would say well the task of the learner is to produce a posterior distribution. So we call it a posterior distribution because the bound will also depend on another distribution which must be defined before observing the data, so it is a prior distribution on your set of classifier P and basically the PAC-Bayes risk bound will depend on the Kullback-Leibler divergence between the posterior and the prior. So this is just the standard definition of the Kullback-Leibler divergence. Notice that it in order to be finite, the support of the prior must be, must include the support of the posterior. Otherwise it would divergence if P was equal to zero, so this means that a support of Q must be included, must be contained within the support of P, of the prior. We will assume PAC-Bayes bounds also and Q's are expressed in terms of the KullbackLeibler divergence, but between two Bernoulli distributions; one with a probability of success q and one with probability of success p. So we will also make use of the quantity of this function. So this is our main theorem that we have obtained a couple of years ago. We presented data I think at ICML 2009. I think it is quite nice because it is quite general. You can obtain all known PAC-based bounds for classification out of this one, and it is very, very simple to prove. It is stated as follows, for any distribution, for any set of classifiers, for any prior P of support H and for any, and so you see here you have this distance between two quantities, two real numbers which is Gibbs risk and Gibbs empirical risk and boundless value for any distance measure or any function, any real value function, can even be negative; that is not a problem. So for any D basically this bounds O with probability at least one minus Delta simultaneously for all posteriors. This distance between the true risk and the empirical risk is bounded by that. I have already mentioned what is the Kullback-Leibler divergence, now you see the bound depends on some double Laplus transform, so double expectation with respect to the prior and with respect to the training set. >>: [inaudible] less than or equal to [inaudible]. >> Mario Marchand: Pardon me? >>: Greater than or equal to [inaudible]… >> Mario Marchand: It is less or equal to that. >>: I see. With probability more than one. >> Mario Marchand: With probability one, not more than one but one minus [inaudible]. >>: [inaudible]. >> Mario Marchand: So of course to obtain a boundless means that you need to upper bound this double Laplus transform. This is some sort of thing that we can do easily for many cases once you are given a D you can find an upper bound for that and then you get automatically this theorem. So the way to prove this theorem, I am going to go through the steps so that at least you can, when I am saying that it is easy, you can say yes that it is really easy and this is the art of PAC-Bayes theory. Basically the idea is to consider this random variable. This is a random variable of S and it is a nonnegative random variable because it is exponential of something, so it is a nonnegative random variable and you can use Marcovian equality for that. So it is a random variable that is a function of S. So the probability that this random variable exceeds T times its expectation is at most one over T. So I just replaced T with one over Delta and there you have it. This is the basic let's say bound that is used by PAC-Bayes. It is not [inaudible] equality so you would say this is very bad. In fact nobody has come out with a better bound with this random variable. I have tried to go to [inaudible] it does not bide me anything, so I get the same, basically the same bound. We are using Marcov, so this means that with probability one minus Delta, I get the negation of that. One minus Delta I have the negation of that which is the great oracle to this. And I take the log on both sides. I took the log on both sides so it holds with probability of one minus Delta, and so then the next step remember the bound owes uniformly for all posteriors so it will be to convert this expectation over the prior into the expectation over the posterior. The standard trick would be so F of H is just this function. It is just this function here. So the first thing is let's convert this expectation over P into an expectation over Q. And you would say well this is easy. The expectation over P is equal to the expectation over Q but in fact it is greater or equal so basically you see it is Q. So in order to get P this is the expectation and over P, I just introduce Q and I divide; they cancel, but we have to be careful when it becomes zero, so basically this holds for the support of Q plus the rest of the support of Q which is MP basically. This term is greater or equal than this term, so basically the expectation of P can be changed into an expectation over Q with this, and this is exactly what we need here. We just basically, so you will have greater or equal than the expectation over Q so it is on the right side. The next thing is to, so we have a log of expectation that we are going to convert into the expectation of a log. The Jensen's inequality, so this is the first step expectation over P is greater or equal to the expectation over Q of this. And then the log of the expectation is greater than the expectation of the logs, because of the concavity of the logs. So we have one Jensen inequality here. So the log of a product is the sum of the log and basically here you have minus the KL divergence here, because it is P over Q and the KL is Q over P minus the KL plus the log off this. I remind you that F of H is this complicated expression here. This is the exponential, so we applied this basically, this result to the exponential of MD. And we just, this is just the application of the formula, and so we are almost done because the log of the exponential is just the argument, so you see that basically we find another bound; we find that the expectation of MD is more equal than KL plus this, which I identified in the theorem. But we are stuck with the expectation of MD but the expectation of MD basically by using Jensen inequality; the expectation of basically of D is greater. We assume that D is convex, so the expectation of D is greater than the expectations. This is basically, you basically have the theorem that says, so it is not a long proof. You see every step is there and you have this bound, but now you need to bound this Laplus transform in order to get a bound. But that is PAC-Bayes theory, Marcovian equality and to Jensen, that's it. And the amazing thing is it is simple but you can have, it can be very, very tight. So if for D you are using the KL divergence between the two Bernoulli distribution, if you are using that for this distance, you can perform this expectation here so you need to swap the expectation, so this is why P must be independent of S so you swap the expectation. Otherwise it is difficult to perform the expectation over P before expectation over S, so we are going to swap the expectation and then condition this expectation in terms of the number of errors in terms we know that the distribution of error is a Bernoulli distribution. I am skipping the detail. We find that and this is in theta a big theta, the square root of M. So basically you have a bound. You obtain the same bound that has been found by Langford and Seeger but a bit tighter because of this square root of M dependence here. So this is one of the famous PAC-Bayes bounds that has been found by Langford and Seeger and this is a graphic illustration of it. So the first time you know it is a bit confusing the way the bound is stated but let's say to see that, to see how it works basically, assume that this fixed the posterior. If you are interested in some posterior and you want to find a guarantee for this upper bound for this posterior, so you fix the posterior. This will be a number because the prior is fixed, so this will be a number let's say 10% or 5% and basically let's plug this function in terms of the real risk when you have computed some empirical risk value. This is a plot of the function in terms of the real risk when you assume that the empirical risk that you have measured is 10%. And so basically the KL, this function must be in terms of RQ must be lower or equal than this value, so this you see that this just gives you an upper bound and then a lower bound. And you know that the real risk is with [inaudible] one minus Delta is between these intervals, inside this interval. So if you change the function and you say well remember the first argument is the empirical risk, so if you are looking for a function which is linear in the empirical risk, which is a natural thing to do because you have the expectation of something that is sum of linear so it is going to be expressible in terms of a product, you find this result basically for the Laplus transform, and so you can choose F to--you see this is E to the F power to the M and this is also power to the M so you say well, I have a number power to the M so let's choose this number to be one so I will have one power to the M which is one. So if you choose M to be one over this guy, so we can cancel out, you will basically have for the Laplus transform log of one over Delta which gives you a small bound. And if you are basically doing a bit of algebra afterwards to express R and F minus one of the result, you get this bound. So one minus exponential, it's a function like that so this is X; this is one minus exponential of - X so this is this function here. So one minus exponential - X is smaller than the argument of the exponential, so for simplicity suppose this term is just C times RS +1 over M this. So this is another bound that you can obtain immediately from the general theorem and in fact it has been found by another method by Catoni in 2007. So it is kind of a simple, it says basically it is valid for any numbers, any constant C, any positive real number C, but it is not valued uniformly for all C, but you can buy a union bound argument, make it valid for K values; you will just need to introduce an LNFK here. So this bound says something simple. If you want to find a predictor, well the Gibbs predictor with the smallest possible bound, it will be the one which minimizes this plus this. So you will end up by saying when you need to minimize is this quantity. So then it says well the posterior Q that you are looking should minimize that, which it's a bit like what we have postulated in the beginning. We have some measure, but it is not the risk of individual classifier. You have to measure the risk, the average risk or the risk of the Gibbs classifier and you have this KL which acts as a regular riser, which grows depending on how far you are from the prior. So you have a NYPIRG parameter that you don't exactly know its value. You could try to, so you have one X try per parameter to tune which is not the case from the previous bound, from the Langford Seeger case. You have no NYPIRG parameter that's used. So the corollary to 2.1 which is a Langford Seeger bound gives a bound which is in fact, so we have two bounds. How do they compare? The one with the small KL, the one with the KL between the two Bernoulli distribution, gives you a bound which is always higher except for a narrow range of values of C which is normally around one. So normally the Langford Seeger bound is tighter than the Catoni bound, but the fact that the Catoni bound can be tighter is just because instead of having LN of square root of M divided by Delta, it has LN of one over Delta. It just because of that, it can be tighter. If you would have LN the square root of M over Delta, than the Langford Seeger bound would always be tighter than the Catoni bound. Okay so, let's apply these bounds and let us try to minimize these bounds for a certain class of posterior distribution. Searching in the space of all posterior distribution is something very hard; there are too many of them, so let us restrict ourselves to a class of posterior distribution, which will be Gaussian distribution over linear classifier. Each example is mapped so it will be a learning algorithm which can use a kernel. So basically each X will be mapped either explicitly or implicitly in a feature space phi. And the feature space can often be given in the terms of a Mercer kernel and so the output of a, so the output of the linear classifier is going to be described by this. This is the usual formula when you have a feature space phi of X, so it is the sign of the scalar product with phi of X. We are going to be looking for Gibbs distribution which are isotropic Gaussian centered on W, so you have a Gaussian which is isotropic, so you have the covariance matrix is unity, basically and it has unit variance so it is a very simple thing, so it is just an isotropic Gaussian centered on some W. So W is the parameters of my distribution so I will write Q sub W of the, so this is the density on V parameter rise by some W. So one nice thing about this Gaussian distribution, so basically you have this W, this vector W, and you have a Gaussian distribution and this board here is the space of all the linear separator. So whenever you have an example to classify, so XY here, this is an example, so in the space of separator is just a hyperplane, so some of the predictor here or let's say predict +1 some of them predict -1. So basically you see that if you are, and consider the majority vote. If you are taking the majority vote with respect to this Gaussian, well the majority vote would be, the weight would be larger on the predictors predicting -1 and this is the output of the majority vote. And you see it is exactly the same as the output of W, and by symmetry basically the majority vote and this single linear classifier are producing exactly the same output, so they are the same classifier basically. So the output of the majority vote is the same as the one that of the deterministic classifier which is the center of the Gaussian. So basically this means that twice the risk on the Gibbs classifier, so let's say a bound on the risk classifier will give you a bound on the single predictor W. Okay. This is how it works. So we define the class of posterior. We need to compute the KL divergence because the bound will depend on the KL divergence between the posterior and the prior. So we will choose a prior centered on some other W. WP, so this is the prior. Most often WP will be the origin. If you have not seen your data, there is no preferred direction, so normally WP will just be the origin. So the KL, because these are two Gaussian, the KL divergence basically can be computed exactly, and this is what it gives. And it is just basically, it only depends on this norm. The norm of this difference vector, totally depend on that. And it is a nice quantity because then you have a regular riser which does not depend on the dimension of the space and this would be bad that it would depend on the space of the dimension here, because this is feature space phi which can be very large dimensionality. But because these are isotropic, Gaussian the integrals, the KL divergence and all of the other direction they can sell so there is only one surviving direction basically that gives you the KL. We need to compute what will be the risk of the Gibbs classifier, in order to have a bound on the DN; well you need to say what is Gibbs risk on the training set. What will be Gibbs risk on the training set? So the Gibbs risk on the training set is just the average of Gibbs risk on the individual classifier, and the individual example, sorry. And so you need a formula, you need to be able to compute Gibbs risk on any single example. So basically you need to be able to perform this integral. If you are not capable of finding a close form and expression for that, then you end up needing to perform multi-Euclidean relation and this would mean a very slow learning algorithm. Each time you need to compute the empirical risk and this will occur often in a learning algorithm, you will need to do at each point a Euclidean relation so this is not very interesting. So fortunately Gaussian you can integrate this basically. It is just that this part, so suppose it makes an error here, so it is just this volume, so basically again all directions cancel and it just basically depends on this quantity here. So this is what you obtain for Gibbs risk on a single example. It is just a [inaudible] Gaussian basically, which depends--this gamma is the normalized margin between the weight vector W and example Y phi of X, just it is a number between -1 and +1 and you multiply it by the Euclidean norm of this weight vector. So this is Gibbs risk so you can make some plots, so in terms of the margins, recall that the margin is just a real number, it is a cosign of the angle between W and phi and it behaves like that. So it has a sigmoidal shape. It is called the probit loss. So this is a probit loss. Basically it gives you a loss function on the example which is as this form in terms of the margin of the normalized margin, and as the norm of the weight vector becomes very large, then it converges to the 01 loss of the majority vote basically, so it makes sense then to find, to measure the performance of the Gibbs classifier in order to obtain a performance indicator for the majority vote. If you have a large weight vector the guarantee that you have for the Gibbs classifier with the large weight vector will somehow be a good indicator of the performance of the majority vote. The way to understand why you have this result basically is you just say well suppose that this, suppose this is your weight vector W and you have some example here, this is your example Y phi of X so if the scalar product between these two are positive, so suppose it's here, so if the scalar product between these two vectors is positive, then you have correct classification but it is a Gibbs classifier. It is a Gibbs classifier so this means that with some probability you can draw predictor here which misclassified that example, so this is why you don't have a perfect classification. So this is why as basically the norm of this predictor increases, it becomes more and more deterministic. So we have bounds. The bounds are expressed in terms of the empirical risk and the KL divergence. We found the analytical expression for that. Basically what is left then we should say well let us find the predictor W which will minimize this bound, so this is a Langford and Seeger bound. So we are looking for the W which will minimize this and the bound is just the largest value of epsilon for which this inequality holds and this is realize that the equality, but you have two solutions. Remember the plot I gave you. You have one which gives you an upper bound and the other which gives you a lower bound, so you must choose the upper bound. So basically we're going to try to find W minimizing the bound such that the bound is given by the formula and you take the solution where B is larger than the empirical risk. Okay so that is what we are going to do. So it's a pity because you cannot express explicitly B in terms of the empirical risk in the KL. You just have an implicit definition but you can have an explicit formula if you are afforded gradients. So the gradient can be explicit. Okay so the other bound, the Catoni bound basically is just to minimize this and if I express the empirical risk in terms of what we find, we find this expression. So we need to find W which will minimize this, and in the absence of any information about the theta we choose WP equal to the origin, the origin vector. So basically you need to minimize L 2 regular rise probit loss. This is what you need to do. So you should compare that to what the SVM does. It minimizes a convex hinge loss with the same regular riser, so the regular riser used by SVM now comes up naturally here. But instead of adding the, to minimize the convex hinge loss, you need to minimize a non-convex probit loss. >>: I have a question. So the [inaudible] through here is because you choose hardly no more than for the… So the [inaudible] through your action you're going to choose [inaudible] on that theta [inaudible]. >> Mario Marchand: Yes. You can use a prior knowledge; this is where prior knowledge can enter. You can say I will use a Laplace prior instead of a Gaussian prior and you will then add the L1 and probably an L1 norm that would pop up out of that. >>: So it's not even just the Gaussian, it's two spherical Gaussians that came from the KL divergence [inaudible] so it is a very special thing. >> Mario Marchand: Yeah. But if you are using Gaussian for posterior [inaudible] it is natural to use it also for the prior also. If you would use the Laplace prior than you would use some other things probably for the posterior, something that has a nice KL expression. >>: But isn't it kind of the one of the motivations was that when you present traditional machine learning framework you say I keep the regularization to be this number and you say why do you use L2 why not L1 why not L 17? >> Mario Marchand: This arbitrary [inaudible] wasn't here with the choice of the prior. >>: Exactly, so you can kind of go back and forth between these and just turn the regularization term into a prior right? >> Mario Marchand: Yep. >>: But you did that anyway. There we say that the regularization was just [inaudible] the prior and you're doing map estimation, so it's not like it's anything new. >> Mario Marchand: Anyway this is what we find that we find that in fact you see this hinge loss, this is what we should minimize and the hinge loss is not quite that. It is a convex relaxation but it is not an upper bound. Because it can go to zero, it can be larger and then lower on some part of the probit loss. We have performed gradient descent on these bounds basically, so for the Langford Seeger bound you can compute the gradient so I am skipping the details. It is not very interesting. You can compute the detail for the gradient for the Catoni bound, basically you derive this. The differential of this and then if you compute the gradient in terms of W you will have this part solved immediately. So you will see that I will sometimes use, choose to center the prior around some other W. You will see why in a few minutes. Okay so why do we use gradient descent? Well we have a problem here because the probit loss is--the probit loss of phi of E is not the problem because it is quasi-convex. This means that the level sets are all convex. The problem is that we have a sum of quasi-convex functions because you sum over all of your examples and in each example you have a probit loss. And the sum of quasi-convex functions is not guaranteed to be convex anymore. And in fact the empirical risk is not quasi-convex and the bound because of that does have several local minima. We have seen that on the data. It has several local minima and you can be stuck on some of them, and this is especially true if you take the Catoni bound with this hyper parameter C, and when C is large the problem gets more important. So basically nevertheless we say well the theorem says that we should minimize something which is not convex. So let's try to do it. And we tried to do it let's say just by doing a lot of random restarts, doing gradient descents with random restarts. And we increased the number of random restarts for a large value of C because we had more local minima and this is what we did basically. Sorry I am stuck. Okay. Okay so each algorithm--with these functional expressions you have an empirical loss and an L2 regular riser, than the representer theorem applies and you can ride W. You can--the representative theorem for these function to minimize will say that W must lie in the linear span of the examples. So basically we can either work with the primal variable or the dual variable, the alpha whenever we have a kernel. So we tried both versions, the primal version we compared with AdaBoost basically and we used decision stumps. AdaBoost seems to perform generally okay with decision stumps, so we compare it with AdaBoost. And the dual version we used the RBF kernel and compared it with the soft margin SVM which is the gold standard. We proposed three learning algorithms. PBGD1 so this is for PAC-Bayes radial descent one. So we use a non-informative prior centered at the origin. And we minimize the Langford Seeger bound with some confidence that they will equal 2.05. The actual, it is not very sensitive with the choice of Delta. So this, remember the Langford Seeger has bound no hyper parameter in it so basically you don't need to try to trade-off what will be the empirical risk with respect to the regular riser. It is given in the bound. We tried that. The other thing we tried, this is version 3 of PAC-Bayes gradient descent is to say well I do not have confidence in, perhaps the regular riser will weigh too much. So basically we decided to minimize the Catoni bound but for different values of C. Not necessarily the one that will give the smallest risk bound, but for several values. And we choose the value by cross validation, so this is sort of okay. So let us try to see if this can improve on the other. PBGD2 is quite different. In fact this is a true bound minimizing algorithm. Basically at the end you have a predictor and you really have a guarantee that it is valid for that predictor. Here you are cheating a bit. You're not--you will have a predictor. You can compute the bound. The bound is valid for that predictor but you are not necessarily using the predictor with the smallest risk bound. You rely more on cross validation then to select your best predictor. But here basically it is another way of getting a risk bound minimizing algorithm. Basically here you say well I am going to use half of my data to learn a good prior. So the only reason, they only use that you're going to make for the first half of the theta is to learn a good prior. So the priors are isotropic Gaussian centered on some W, so I am going to try to find a set of W's which would give me good priors. So basically we minimize the Catoni bound with different values of C, so arranging so one, 10, 100, we try very large values. This will give me several solutions. >>: [inaudible] prior [inaudible]? >> Mario Marchand: Yep. >>: But now you are [inaudible]. So are you going to…? >> Mario Marchand: It's not the correct way of saying it, but basically you have the first part of your training data. And you are just minimizing the second bound with a uniform prior, the Catoni bound with a uniform prior. So this will give me some predictors centered on some, set of predictors, and now for the second part of the data I will put the prior centered on those. So they are independent of the second half of the data. >>: [inaudible] distributions the first time. >> Mario Marchand: Yes. >>: So you can't swap the expectation. It is not legal to swap the expectations [inaudible]? >> Mario Marchand: Yes, yes, because it is not the same. >>: It is conditioned on [inaudible], you know what I mean? You have conditioned on it because… >> Mario Marchand: Yes. But as long as you are computing the empirical risk on data, the prior didn't see the second half of the data. So it is okay. >>: [inaudible]. >>: Right but he is taking it [inaudible] deep into theories to get expectation overdrawing samples from D and he is swapping it with the… >>: The prior can depend on D; it just can't depend on S. >> Mario Marchand: Right. >>: Or it can depend on D in any way you want. >>: Okay. >> Mario Marchand: Yeah. So this is basically, so for the second half of the data basically we do the same thing as PBGD1, so basically we minimize the Langford Seeger bound with no adjust, with no hyper parameters for priors centered on the solution that we found for the first half of the data. And basically we keep the solution I think the best the best bound. So this is really a true bound minimizing learning algorithm, but where we can cheat on the first half of the data to find better prior than just the non-informative prior. But we lose. We will compute the empirical risk; we lose by this process because the empirical risk is going to be computed only on half of the training data. >>: So is there a reason for choosing half as opposed to a quarter or any other number? >> Mario Marchand: We tried all of the fractions and I don't know why but half is the better. Half gave the best result. >>: You are doing two things here, right? So one is you are trying different Cs and the other is you are splitting the data into two halves. So if I were just use the two halves trick, so mainly I use maybe the Langford Seeger KL bound and I just learned 1W on half the data and then if I take the second half of the data and then another W which is regular to be close to the first one. >> Mario Marchand: It does not buy you anything, because basically you will obtain the same solution as you would have trained on all of the data with the noninformative prior. >>: But my bound in the second randomization is now enough because I am assuming if I turn on half of the data and then I tried to get on another random half of the data it is not going to change the solution by much. It is just the variance of the data. >> Mario Marchand: Yeah, yeah, the bound will be smaller. I guess. >>: You are realizing something which is going to be almost exactly, so I am just saying, you are doing two things here. The second thing, the optimizing of C maybe is not needed. >> Mario Marchand: Well this is a way where basically you can attain very large weight vectors. It is not clear what it will give you. I don't recall exactly what it gave. It bought us nothing at the end. But to say that you will obtain a better bound, the empirical risk might be quite large still. You might, it might be the case that you find weight vectored at a small norm and because of this probit loss, you allow quite a large empirical risk and 15% and this will contribute to the bound. Here basically by having large weight vector, basically the empirical risk that you're going to find on the second round is going to be smaller; so basically the weight vector is going to be close to one of these very large, is going to just slightly adjust around it. So you will have small empirical risk and small KL divergence, so basically the ones that were preferred by this trick were the ones with large weight vectors generally. So these are PBGD1 and 2 are true bound minimizing learned algorithm, but not this one basically. You rely on cross validation for the hyper parameters, so you are minimizing a functional which is inspired from the bound. The quantities are there. You are using the probit loss. You are using L2 regular riser but for the respective way here, you are relying on cross validation for it. This is PBGD3, so let's compare these results. So we don't have a lot of data as in Microsoft Research, so we rely on what is publicly available, so you see i in this and so on. So basically the results are going to, we have split the data depending on the data set but normally it is usually half and half; we basically trained on the first half and then the testing score we compute the empirical risk on the remaining half of the data, and we performed a binomial tail inversion test of Langford to determine the status if the difference in empirical risk is statistically significant or not. >>: That is not a matched test; it is just some sort of [inaudible]. It is not like a [inaudible] test is just some sort of [inaudible] don't know what the paper said but it is just [inaudible]. >> Mario Marchand: So basically you need to compute the test bound, the test upper bound and the test lower bound, and if they are disjoin basically the difference is statistically significant, otherwise it is not. So these are the results for in the primal case for decision stump and basically this is, the bold ones are the ones that have better results but they don't show up quite clearly unfortunately. I should have put some colors in it. So basically this is the result for AdaBoost. The risk computed on the testing set and the bound, you can compute the risk bound that we had for AdaBoost. This is PBGD1. PBGD2 if you can see for the bound, it is really, we obtained better bounds for instance on mushroom here. It is 3% for the bound which with respect to PBGD1 which gives you 13% for the bound, so by this trick of using half the data, we can obtain really lower bounds, better bounds. And PBGD3 is, the guarantees are worse, but the actual result seems better. Overall I would say that this is a winner. So the statistical significant difference is here. They is almost no statistical difference except on these cases, for instance AdaBoost is statistically worse than PBGD1 and PBGD2 and vice versa. Perhaps it is better, so better, yes. And here, three is better than one and so on. Two and three are better. The worst one actually is PBGD1, just doing this straight risk bound minimization with a uniform prior gives you generally much worse results than PBGD2, and PBGD3 is still an improvement. If you rely on cross validation for the respective, so it gives you better results which are quite competitive with AdaBoost. >>: Any idea of a sense of how much of this might be an optimization problem? The reason why I am thinking that is because [inaudible] in the context of AdaBoost you suggest that using something very similar you to lost. Which is a [inaudible] but it's almost the same. So he was getting better results [inaudible] exactly but I just didn't know it could be optimization [inaudible] could be the loss [inaudible]. Have you tried multiple restarts, right? >> Mario Marchand: Yes that's what we tried. It is not indicated in their paper what they have tried. >>: AdaBoost [inaudible]. >> Mario Marchand: It's just that our [inaudible] AdaBoost, yeah. >>: The different optimization could be a different optimization [inaudible]. It >> Mario Marchand: Yes but you will end up in some local minima unfortunately. >>: That is true. >> Mario Marchand: And there is no regular riser but you start boosting after let's say 200 rounds. >>: [inaudible] the main [inaudible] was to say we will get a [inaudible] bound and try to optimize [inaudible] so this evidence doesn't exactly support it. >> Mario Marchand: Yes. It is better, it is still better basically to tune the weight between the regular risers and the empirical risk by cross validation, generally it is better. This means that the tearoom somehow has to progressive… >>: [inaudible]. >> Mario Marchand: Yeah, but still PBGD3 is a learning algorithm that inspired from the bound because we were using the loss that was given by the Gibbs risk, and we are using the regular riser that came out that the Gaussian which is just an L2 standard thing but it is let's say a bound inspired learning algorithm, but quantitatively you need to basically tweak the balance. That is that. So this is the result we have with the RBF kernel. We compared with DSVN and the same sort of picture. This is the worst result. These are better, and this is still better. And what is nice is that the bound can be quite low here basically, 1%. This is a true risk bound is 1% and it is basically, it is doing better than that; that is one thing. Better than 1%, but still it is nice to see that on adult to see the boundaries sort of quite close to the actual risk. Yes? >>: Going back to John's point again about [inaudible] transition, what, I don't know if this is going to make sense, but can you take the weights from the DSVN and apply it to see what bound you would come up with if you… >> Mario Marchand: Yes. Well we can take this solution of the SVN and compute the bound, the same bound that we… >>: So that's the bound that… >> Mario Marchand: Yeah, yeah, yeah. They are worse. >>: Okay. >> Mario Marchand: The bounds say, well you shouldn't pick this solution; you should in fact pick this one which is better, but this one is even better. It has a higher bound. >>: Optimization is finding a lower bound than this [inaudible], yeah, okay. >> Mario Marchand: Yeah, for sure the bounds here are worse than this one. >>: I see. >> Mario Marchand: Okay. >>: Back to that. So it seems as if, you know, there is still a selectness, you know, of bounds. Do you have any feeling to where is the source of the selectness. >> Mario Marchand: Well, the regular risers seem somehow to strong. It is difficult to get rid of it and yeah, in this setting I could not get better bounds than that. This is really… >>: And then when you look at the choices of the PBGD3, you see that it gives smaller weight to the… >> Mario Marchand: Oh yes, these are typically the best results. You multiply the empirical risk by 1000 and this is pretty good. So this means that the weight on the regular riser is just too strong. But PBGD2 is sort of, I know that, so I'm going to cheat on my first half of the data in finding large weight solutions and then on the second half you will, the regular riser is smaller. >>: [inaudible] maybe I am computing the bound wrong but I thought what the bound really says is if you were to keep sampling test data, you will be above the bound of a 5% [inaudible]. It doesn't tell you anything about what your average risk is. So I don't think that this really is surprising that your finding that through your tests set that you run the [inaudible] and you got something that is [inaudible] bound because the bound is talking about your tail loss, right? >> Mario Marchand: Yes, but this is the, you need to bound the tail. You need to bound the worst case… >>: But the results you show are not, you are evaluating the average [inaudible] that you showed, but the bound is giving you [inaudible] the worst case so I don't think that there may be a [inaudible] because you have to run this many times to see… >>: Would you say that maybe some of these are harder seem to be performing better than actually have larger buy-ins and we don't see that… >>: We don't see that because you only get one test set. >>: But another way to say that, another way to look at that would say that the examples that you tested it on were too easy, and that you could perhaps some create some synthetic data sets that would be much harder for the standard algorithms, and your bounded algorithms would do reasonably well and you would know how well they did. The distribution of examples should be harder. >> Mario Marchand: I don't know. >>: Your proofs are for any distribution of samples, right? The letters a compared to b that's not a particularly tough distribution of samples, whereas maybe you could make one that would be much harder. >>: [inaudible] for all D and all Q [inaudible] posterior and vary distribution and data, specific the worst ones and maybe the ones that I have and not the worst ones… >>: Yeah, I don't think those are that bad so this is getting back to that thing that maybe there really isn't slack there. You assume that it could be [inaudible]. >>: Yeah, but the other thing is how much to lose by this, by the fact that I worst case and the answer is, you know, one more step closer to the answer. [laughter]. >> Mario Marchand: From Gibbs it’s based on the majority vote. >>: There's too much [inaudible] [laughter]. >> Mario Marchand: Okay. Conclusion. Basically PBGD2, the second one is better than the first one, so using half of the data to learn a prior really helps. PBGD3 is better than the other one. I shouldn't say better. I should say definitely better, it is a bit better but I think it has a true advantage. So the bound it is a bit worse than cross validation for finding the proper trade-off. PBGD3 which is inspired from the bound, you take the quantity they came out from the bound seems to be very competitive with AdaBoost and SVM, but they are much slower because you have several local minima. What we have done afterwards is to convexify, so if we have this probit loss and basically we say okay we have all of these local minima problem so let's just convexify it by a logistic loss. So you take the slope here and you go linear here. So this is sort of different from let's say the hinge loss but… >>: That is logistic [inaudible]. >> Mario Marchand: Very similar to L2 regular rise logistic regression, and so if you do that, you find almost identical results, very, very, very similar. I don't show the numbers but almost the same thing. But with basically--this was a bit disappointing. I was expecting to obtain must better results because of non-convexity, saying well this is what you have to do. You have to work harder but at the end you will find better solutions. And basically no, you can just use the logistics instead of the probit loss and… >>: I think Carl is right. Because I think that usually the excuse for having these nonconvex log functions they go to zero derivative for very [inaudible] one thing is you have very active labeled noise in your data. In fact it is like an adversary coming in trying to smash you and that is why you are trying to minimize this [inaudible]. So if your data had no adversary--I think you should go to the [inaudible] list and put in like maybe 10%… >> Mario Marchand: We have tried putting in classification noise and there was no improvement between that. But somebody at ICML or NIPS from Google gave a talk and, a similar experiment and he sort of imbalanced the noise between the positive and the negative and then he saw the difference, but we didn't do it. We didn't think about it. It was better to use the nonconvex one in that case. >>: So you said the sound quality function [inaudible] and here you are approximating one convex function by a convex function. So if you can do that for the sum maybe that would give you a better convex function although probably not it might be more complicated than [inaudible]. >>: But not for each individual one. >>: [inaudible]. >>: No. So you can take a convex, I know for example, the convex [inaudible]. Maybe if you… >>: The sum of many [inaudible]. >>: I do not know if that would work well but I was just wondering. >> Mario Marchand: That would be a lot of hard work to do. I don't know I never thought about it. >>: [inaudible]. >> Mario Marchand: So anyways the second one seemed the best choice if you want to obtain a good guarantee. So this is part of the work now. The second part of the talk, how much time do I have? [laughter]. Very little. Shucks, okay. So the second part, how much minute? 10 minutes? Okay. I will try to go fast. But this is a much more recent work. So basically we have a set of classifiers that we call simple compress. What is a simple compress classifier? Well a simple compress classifier is this [inaudible] by a subset of the training set that we call the compression set. And a message which here is going to be a real number between -1 and +1 and an information bit plus or minus. So these are not actually messages but the standard terminology in the sample compression. And we are going to use, and we will see why. We are going to use a node two set of classifiers. So basically for each classifier the negation of this one is going to be by definition -1. Let's give you an example of a set of sample compress classifiers. These are the ones made by a single example. So suppose that, okay so this is an example of a sample compression set of classifiers described by only one training example. And I am going to use any similarity bounded similarity function for that. So I will normalize it to one between -1 and +1. So each classifier is going to be described by a single example and a message sigmoid and S such that H plus here, this is the definition is going to output +1 if sigmoid is less than the kernel value at X. Otherwise the output is -1 so you can write it, I like to write it compactly like that. The sign of this predicate. So if basically the kernel, the similarity function evaluated at X, so this is the one used for describing this classifier. If this distance is greater than sigmoid than I am outputting +1, otherwise I am outputting -1. And so the Boolean complement of this is just going to be -1 up to the plus. So I can produce, so this is an example of simple compress classifier made of a single example. You could take pairs of the example and define a hyperplane. You could take a whole bunch of example and run in SVM and this will output a classifier. So basically it is a simple compress classifier where the compression set is the set of examples that you have used. So we are going to be interested in building a majority vote of compress classifier and we want to attain a PAC-Bayes risk bound for majority vote of simple compressed classifiers. So basically each sample compression is going to be described by a vector i which points to the individual example used for defining this classifier. So we have a prior on this set of sample compress classifiers, not the priors, any posterior solution and this distribution it will be written like that, so you will have a distribution on the subset of the training set that is used to construct your classifier, and some distribution on the message given the compression set. Okay. And we are interested in this guy here, the majority vote of these sample compress classifiers. And we want the risk bound and to design an algorithm to minimize the risk bound. So consider again the case where each sample compression set, each compression set is made of a single example, so we are going to use a posterior like that because it is quite natural. So you have a distribution on your index so each example can make a single classifier. So you have a distribution on i a distribution on the sign, that is for each example you have a pair of classifier, one which is a Boolean complement of the other, and in fact you have a continuous also you have a continuously many because you can have several values of Sigma, so we are going to use the uniform distribution for Sigma over minus one plus one, and I am going to use this weight which will appear in the expression of the sample compress classifier which is the weight assigned to that example and the difference between the plus and minus. So the weight can be, is bounded by this value but it can be positive or negative. So these are interesting because a majority vote of sample compress classifier basically can produce any SVM classifier. Pick any SVM classifier and you can express it as a majority vote of sample compress classifier of size at most one. So you can realize this if you just look at the output function for any X of a majority vote of sample compress classifier made out of a single example. So this is the expectation, so your sum over all possible example use for sample compress classifier some overall sign, the integral over the uniform distribution of Sigma finds this, and basically what comes out is this expression at the end. So it is the sign, so basically you can generate any weighted combination like that, but Wi is bounded, but it doesn't matter because taking any classifier, you can always divide it by the largest because of the sign you can always renormalize the weights. So these majority vote of sample compress classifier of size one are interesting because they contain all of the classes SVM. But here we use any similarity measure so we are not restricted to positive semi-definite kernels, and we have a bound which is valued no matter what is the similarity measure out of it. Okay so the way to go from the measure the output gives is usually that but here it is not good. Here it is not good because in the case where the sample compress classifier are made of a single example, they are very weak classifiers. They are not strong classifiers. They will almost always make it hard. They will have an error rate which is close to one half. So Gibbs error rate will be close to one half. So it is not a good measure of the performance of the base classifier because Gibbs error rate will be close to one half even if the measure we devote is zero. So by probing the Gibbs error rate, it does not inform you at all in this case of the performance of the majority vote. So we need some other things than Gibbs risk to monitor the performance of the majority vote. And so to see what we need to use let us consider what would be the expression of Gibbs risk. So this is the expression of Gibbs risk. So it is just the expectation of the real risk and basically you can convert this indicator function in terms of their margins so why H of X is either +1 or -1, so if it is +1 it is an error of zero. If it is -1, it gives you an error of one. And so here I am going to define the margin of Q on example XY to be this expression. It is usual notation. And basically the risk of the majority vote is at most twice the Gibbs risk which is just then equal one minus the margin. But one minus the margin is some linear function of the margin which has nothing to do with the step function of the margin which is the risk, base risk. So this is Bayes risk for the risk of the majority vote, the 01. This is one minus, oh so it does not--this is one minus NQ the equation for, and so basically it does not make any sense to bound Bayes risk with this function. If basically most of your example are around here. So have small margin. So what we are going to use instead is a loss function like that, which is a quadratic function centered on some margin and that can be chosen any value basically. And so if you find distribution for which you know all examples have small margins, then it makes the quadratic risk is a better measure than this linear risk for this case. So basically we are going to use this, and the bound on the expected loss for this quadratic risk also provide a bound on the majority vote. So if you would use a linear function which is more inclined to this, it does not work because it is negative and you have to upper bound the risk of the majority vote. So this is the bound that we have, so the bound is actually valued more generally than just a quadratic loss. It is valued for any loss function that has a Taylor series extension around the zero margin of a finite degree. So this is the result that we have basically for all posterior distribution align on P and I will tell you what align mean. The risk of this loss function which could be the quadratic loss function, which also bound the risk of the majority vote is less than the empirical measure plus there is no regular riser. There is no regular riser but there is this constraint plus this just this confidence interval is square root of M instead of M you have M minus D. D is the size of the sample compression that you are using. If you are using a simple example for--so D is one. You might use two or three. This will give a deterioration of the bound and the degree here of the loss, which here will be two if you were using the quadratic loss. It is kind of a nice bound because it applies for any simple compressed classifier which are defined in terms of similarity function, and you are not restricted to use Mercer kernels or whatever. It can be even nonsymmetric. Okay. So there exists a risk abound that applies to non-PSD kernel, but what they have is that they have an empirical risk which must be measured on part of the data. So basically the way they are constructing the classifier on some part of the data and measuring the empirical risk on the other end, whereas this is computed on all of the data, so basically this is a better bound than what has appeared before. So we have been minimizing, okay, okay a few words on what does align mean? >> Ofer Dekel: I think we need to wrap up, so, quickly. >> Mario Marchand: So align means just that you have a pair of ultimate complemented classifier. Basically the posterior white and the two complemented pair is always equal to the prior. So the only thing that the posterior is going to do is shift weights among the others. And even with that constraint, you can produce any majority vote, so it basically is not a constraint, but it gives you a good bound. So this is, and it has no KL diversions. This is the proof I am just going over it. So basically we have used this bound to derive, algorithm and it gives very good results compared to SVM for instance. This is the aligned case. The nonaligned case where you have a KL divergence, so it gives good results, state-of-the-art results in fact so we are pretty satisfied with that. I am going to… >>: [inaudible] classification. I saw you guys [inaudible] tested [inaudible]. What is the net result. Is it different… >> Mario Marchand: The solutions are different, but here the null infinity constraint that we have, the weights are restricted to be in plus, minus and here basically you have an L2 regular riser. >>: [inaudible]. >> Mario Marchand: Yes, it is in a box. Okay so, so okay, let's me wrap up by saying that we have a proposed a Pac-Bayes bound which compares favorably to currently existing bound because it applies to non-symmetric positive definite kernel minimizing the risk bound really gives state-of-the-art results. It is pretty good. The work can be extended for multiple kernel learning. That is what we are working on and perhaps a new words that we are doing sample compression KL right now. We want to apply that also to random forest, because random forest is also sample compressed classifier where you compute the empirical risk on is the out of bag estimate basically, so we want to do the approach random for PAC-Bayes point of view. I am also quite involved in PAC-Bayes structured output prediction, and basically we want to apply that to peptide interpreter prediction. So you have a protein and we want to predict what will be the peptide that will bind it strongly to given protein. So thanks a lot for your attention and I am sorry to have exceeded my time. [applause]

>> Ofer Dekel: Hi, it is our pleasure today... Laval in Québec City and he has come here to...

Related documents

Products

Support

&gt;&gt; Ofer Dekel: Hi, it is our pleasure today... Laval in Québec City and he has come here to...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ofer Dekel: Hi, it is our pleasure today... Laval in Québec City and he has come here to...