23692 >> Michel Galley: So let's get started. So... McAllester, who is a Professor and Chief Academic Officer at...

23692 >> Michel Galley: So let's get started. So it's my pleasure to introduce David McAllester, who is a Professor and Chief Academic Officer at the Toyota Technological Institute of Chicago. Prior to that he was faculty member successively at Cornell and MIT. And then he was a AT&T researcher at Bell Labs, AT&T Labs, and he's AAAI fellow and has published in a lot of different areas. Machine learning, of course, but also natural language processing, vision, AI planning, automatic reasoning and a few others that I have missed. And so just want to mention also that those options to meet with David. So send a signup sheet around and so whatever fill out the sheet or talk to me, send me an e-mail and we can certainly arrange something. So now let David speak. >> David McAllester: Okay. It occurs to me that we may have a problem with this talk in that there may be people in the audience who have seen it before. So there was a workshop on speech and language a year ago in Bellevue, and I gave this talk there. How many people have seen this talk before? Has anybody seen this talk before? Okay. At least the people who have seen it before didn't come. Okay. So we'll go ahead with this. This is joint work with Joseph Keshet at TTI. This is a theoretical talk. Just how many people have ever published a paper with, whose point was a generalization bound or some cult like conference on learning. Okay. So this is a theoretical talk but not a theoretical necessarily theoretical audience. So I often give this talk to people who don't do natural language processing. I know there are a lot of MLP people here. But my favorite example that I try to use as a running example in this talk is actually machine translation. So in machine learning we're often interested in binary classification. Does a patient have cancer or not. So there's often a typically cult [inaudible] study of the learning theory studies the problem theoretically of binary classification. But what I'm going to be interested in here is the study of what we call structured labeling. My favorite example of structured labeling is machine translation. We have some input, a structured object, like a sentence, and we're interested in not necessarily labeling it but decoding it into a sentence and some other language. So we can think of a decoding problem as a problem where we're given some input X and we want to produce some output Y where the output Y, where the input X is a structured object like a sentence and the output Y is a structured object like a structured sentence. And the way we're going to do the decoding is by optimizing a linear score. So we're going to say find the output, the decode, which maximizes a linear score, which is an inner product of a weight vector and a feature vector. And if you're familiar with machine learning theory and support vector machines and kernels, you realize that almost any kind of scoring function can be represented this way by making the feature vector sufficiently elaborate. You can make it have all kinds of nonlinear features. Even though this is a linear score the feature map can be nonlinear. So you can represent arbitrarily complicated things this way. Okay. So this is our decoder. It's maximizing, it's producing an output maximizing this linear score. We would like -- so what I'm going to be talking about is training the weight vector. We're going to hold this feature map fixed and we're going to look at the training problem. The training problem is the problem of setting the weight vector. We would like to set the weight vector such that it minimizes the expectation over input/output pairs. So I'm going to assume here we've got a corpus of translation pairs. We've got input/output pairs and we've got something like the blue score that we're interested in minimizing. Or we're interested in minimizing a loss. I'm going to use a loss that we're minimizing. And what we would like to do is minimize the expectation over drawing a new reference pair, a new translation pair, of the loss between the reference translation and the translation produced by this decoder. >>: Why is it reasonable to assume that you have a pairwise, over the inputs you have this situation why is it reasonable you have this situation over the algorithms. >> David McAllester: It's a standard cult assumption that you have a distribution over XY pairs. Intuitively, there are large corpora of translation pairs and they were produced somehow. >>: But unlike the core, the classification where you have target -- here these are the translations. So it's ->> David McAllester: The way I think about it is it's sort of a cynical way to think about it, is we want to win the competition. Right? And we know that the competition is going to have translation pairs that we're going to get scored on and our training data is probably a -- it's probably reasonable to assume that the training pairs are drawn from the same distribution that our evaluation pairs are going to be drawn from, and we want to win. >>: Wouldn't this [inaudible] instead of the Ys you have the true translation of Xs, that's what you have in the corpora. You don't have the random [inaudible] translation [inaudible]. >> David McAllester: All I'm going to need here is some distribution over XY pairs. I don't need that there's a true Y. >>: Any translations ->>: But the Y here is not sampled from the across the entire distribution of Ys. It's very biased, the Y part is the one that's really biased, as would be a random language [inaudible] but Y is going to be specific to reasonable translations close translations. >>: It's a translation of Ys. >> David McAllester: So this is a pair sampled from some corpus of translation pairs. So I build a big corpus of translation pair. I select some as train and some as test. >>: Why does that seem only to be -- this is all of the good translations? >> David McAllester: Whatever your translation pair data is like. Right. So this is a gold standard translation. A reference translation. Okay. So what we're interested in, what we would like to do is find the W that produces the best performance on the evaluation at the evaluation. Now, the problem is that the typical way of approximating this, there's a problem with overfitting. Right? If my corpus is not large enough, I can make my training performance very good. But my test performance is going to be bad because I overfit. And the way overfitting is typically controlled is adding a regularization term. So the training algorithm minimizes a loss, a sum over the training data of the loss of the training data plus a regularization to drive the norm of the weight vector to be down and what we're going to be doing is giving theorems that justify this kind of structure. That give us generalization properties. Okay, now there's a fundamental problem. I've written this as L sub S. This is the S stands for surrogate. If I literally use the training loss, this decoding is insensitive to scaling W. So I can scale W. I can take W and scale it down. Just multiply it by epsilon. And it doesn't change the decoding at all. But it completely eliminates this regularization term. So regularization of the task -- I'm going to call this the task loss, like the blue score the thing we're going to get evaluated at at test time. If I use the task loss as the empirical measure I become insensitive to scaling W and the regularization is meaningless. We need the surrogate loss, we need the surrogate loss to be scale sensitive. As we scale W up, this loss changes. It's sensitive to the norm of W. Okay. So an SVM, and so this is the loss, the surrogate loss function underlying a structural SVM. >>: So what we think about regularization term it's also scaling -- [inaudible]. >> David McAllester: Well, you could make -- an SVM is going to use -- you could say W has unit norm and work the margin. But then the margin becomes scale sensitive. So the standard approaches work with -- the standard approach to this, SVM approach, works with a scale sensitive regularization. People also don't always use L2. They use some other norm. But that norm also has a scaling issue. Okay. So here is -- how many people have seen structural SVMs? This is the general -- wow. Okay. This is the structural hinge loss. This is the binary hinge loss. And it's easy -- this is the standard mapping from the structured case to the binary case. And the structured case the feature map takes two inputs and the binary case, the label is either minus 1 or 1, and this is the standard mapping from the, between the two. Right, I can define the feature map on XY to be this way and I get, I can define the margin to be this. And then the standard hinge loss looks like this. So under this standard mapping between the structured case and the binary case, this agrees with the hinge loss. Okay. I'm not going to say -- I'll come back to the structure of this, I think. But this is a difference -- this hinge loss is the difference between something called the loss adjusted inference. This is maximizing the score plus the loss relative to the reference translation. So this is a loss of a weight vector on an input X and a reference translation Y. And this is saying consider the decoding which decodes in favor of bad blue scores. It's a loss adjusted inference. You're saying take a bad label favoring bad translations and take the difference between that score, adjusted by the loss, and this and that's the structural hinge loss. Okay. So I'm just going to go through surrogate loss functions. This is log loss. This is, we define the probability of a decode given X, if we have a log linear model we can define this probability in a log linear way. We take it to be proportional to be the exponential of the score in the normalized with the partition function. But then if we take that in the binary case, we get this smooth curve that looks a lot like, as qualitatively similar to the hinge loss. So in the binary case, the log loss and the hinge loss are similar. One's a smooth version of the other. One thing I want to point out, though, is that in the structured case, this max is over an exponentially large set. It's all possible decodes. In the binary case, that's over only a two-element set. So it's not two -- so the fact that these are similar in the binary case can be very misleading, in the structured case these are I believe these are quite different loss functions. Okay. What I'm going to be talking about in this talk is consistency. I'm going to give you a learning algorithm. What I'll do is define two more loss functions. We'll talk about ramp loss and probit loss, both are also meaningful in the structured and binary case and I'm going to be showing that those loss functions are consistent in a predictive sense. That means that in the limit of infinite training data, the weight vector will converge to the weight vector that's optimal with respect to your loss function. I'm not -- there's no notion of estimation or truth here. There's just optimal performance relative to the blue score. These functions are convex. So there's a fundamental convexity consistency tension. Any convex loss function, if you've got an outlier, especially in machine translation, you're going to have reference translations which your decoder are not going to get. So you've got these hopeless reference translations. And they're going to have bad blue scores and there's nothing you can do about it. Your system is not going to get it. You're going to have outliers. You're going to have margins that are bad. So when you have outliers, they have large, the loss is large. Especially for a convex loss function. The convex loss function has to assign them large loss. And that -- because it's convex it also has to be sensitive to how bad these terrible translations are so your ultimate training algorithm becomes sensitive to the outliers. And that's going to block consistency. So a convex loss function is -- is not going to be consistent. If you're familiar with SVMs, if you have a universal kernel something else happens. People claim that binary SVMs are consistent and it's because they're talking about a universal kernel. But we can talk about that if there are questions. Okay. Here's the ramp loss. So if you remember the structured hinge loss, it was a difference between a loss adjusted inference and the score of the reference translation. Here's the hinge loss. This is a loss adjusted decode, the score of a loss adjusted decode. It's the score plus the loss. Minus the score of the reference translation. Now what we're going to do is replace the score of the reference translation by the unadjusted decode. The score of the unadjusted decode. So this is the score of the adjusted decode minus the score of the unadjusted decode. And if we go to the binary case via the standard translation, we get a ramp that is not convex, right? The outliers eventually have constant loss. Because if these two things -- once these two things agree have the same decode, we're just left with this loss. So you get this plateau. And you can see that it's different from the hinge loss. So it's not convex. It has this other aspect to it. Here's the probit loss, one more loss function. The probit loss, what it does is it takes the weight vector W and adds a Gaussian noise. So if we're in D dimensions we're going to take a D dimensional unit variance Gaussian noise, add it to our weight vector, decode with that, take the loss, and then we're taking the expectation over the noise of the decode loss. That's the probit loss, if we take that to the binary case, we get a smooth -- a smoothed ramp. Right? It becomes a continuous -- this is going to be a nice continuous function of W. So we get this. But, again, the fact that this looks like this qualitatively is misleading. Because the binary case is misleading relative to the structured case. Okay. So here's some basic properties that hold both in the binary case and in the structured case. So the structured case is different from the binary case, but these qualitative properties hold. The ramp and the probit are both bounded to 0-1. So even in the structured -- I should back up. I'm assuming that the task loss itself is bounded to 0-1. All right. So assuming that the task loss is bounded to 0-1, these loss functions are bounded to 01. So that means they're not going to have high loss outliers. It's a robustness property. No individual data point can have an enormous effect. Another property is that the ramp loss is a tighter upper bound on task loss than is hinge loss. And all of these properties are kind of immediate properties. For example, you get that this is an upper bound on the task loss by sticking the -- so if I stick the decode value into here, this thing goes down, because I'm taking a value that's not optimal. If I stick the decoder that optimizes this into here, these cancel. It's the same score and I'm left with the loss. All of the properties I've shown you is derived by finding one of these maxes and sticking in a value realizing this goes up or down. And okay so we have these properties. This property was used as the original, so there was a 2008 paper introducing the structured ramp loss and their motivation was simply this property, that it's a tighter upper bound on the task loss. >>: If you can go back to the chart where you show the probit loss. So if -- so as the [inaudible] loss but the probit loss is not enough bound ->> David McAllester: Probit loss is not, right. But I'm going to argue ultimately that the probit loss is the best thing. >>: So just checking. So for the priority case, the task loss is 0-1, 0-1 loss. >> David McAllester: Right. >>: So the next slide, previous, sorry, if that was going direct that would be a motivation for this, that would be a task loss, get the picture? That's why it's a better approximation, is that fair? >> David McAllester: So if you look at hinge, sorry. Hinge here -- so task loss is going to go. It's a step function there. >>: The thing you're looking close to there is a step function. >> David McAllester: Right. Right. In this regime it's closer to the step function than it would be if it went up here. So this is just a slide on the history of some of these ideas. So there's a question of where -- is there a reference for each of these structured versions of the standard loss functions. So this is the structured hinge loss -- I'm sorry. I'm sorry. This is talking about subgradient descent on unregularized ramp loss. If you do subgradient descent on this, right? So what does it mean to do subgradient descent on this? Subgradient descent means you find this maximizer, you find this maximizer and you take the gradient of that function of W with respect to W. So you're finding a bad label and a better label, and the gradient with respect to W is the difference in their feature vectors. So there's been work in natural language which says look, take an N best list. So I'm going to argue that the following hack is approximately subgradient descent on the ramp loss. Take an N best list of your decodes, measure the blue score on all of them, relative to the reference translation, distinguish the good ones from the bad ones, take the feature vector of the good one and the difference between that and the feature vector of the bad one and that's a direction that's moving you toward the good one and updates your weight vector in that direction, toward the good one and away from the bad one. And the argument is that this, if you do subgradient descent on ramp loss, it's a version of that. You're taking a bad one and a good one and you end up moving toward the good one and away from the bad one. You're actually better in practice finding the decode and a better decode doing loss adjustment in the other direction. But this is a theoretical talk. And the theorem is easier -- the theorem actually -- I don't know how to prove the other theorem. The theorem works this way. Okay. So several groups and machine translation have done things like this that are like subgradient descent on ramp loss. We call that direct loss. We had an earlier theorem relating the subgradient, ignoring the regularization to the gradient of the task loss ->>: Some people have talked to me about [inaudible]. >> David McAllester: Happily abandoned convexity. My understanding the machine translation community is ->>: [inaudible]. >>: Our task loss function is convex. It's not happy. We would love convexity, but we can't find -- zero 1 loss is still not convex >> David McAllester: The theoreticians still love convexity. Okay. And we've recently -- this is the slide about empirical results. We've recently experimented with probit loss directly. And shown that probit loss shows improvements over hinge loss. There's two ways we've gotten improvement. We've gotten improvement with the direct loss update, subgradient descent on ramp loss, using early stopping instead of regularization in practice. And we've got improvement using the probit loss with a normal regularization. These are improvements over the structured hinge loss. Okay. So now I'm going to start proving theorems. So I'm going to introduce some notation. So if I have a weight vector W, the loss of W is just defined to be the expectation over drawing fresh data. The expectation of my test time loss. All right. So this is the expected test time loss of W. L star is the best test expected less time loss that I can achieve. The nth over W of the time loss over W. What's the best loss I could achieve of any W? Now I'm going to build empirical -- this is an empirical loss measure. So I'm going to prove consistency. The way I'm going to prove consistency I'm going to assume there's an infinite sequence of training data. I'm going to look at what happens when I train on the first N then I'm going to let N go to infinity. We're always training on the first N, letting N go to infinity. So this is the loss, the average loss on the first N training points. That's what this loss estimated loss based on the first N training points of W. Okay. This is going to be our learning rule. So we have -- so what we're going to do, this is the learning rule I had before. This is the surrogate loss function. This is the surrogate loss, the average loss on the first N training points. We're going to optimize a score which is the measured surrogate loss, the probit loss in this case, plus a regularization of the norm of -- plus a regularizer. And what's going to happen is that lambda N is going to grow with N. So we're going to regularize somewhat harder, somewhat harder than 1 over N, as N increases. Okay. So here's our theorem. As long as lambda N increases without bound, it could increase very slowly like log N. But lambda N and log N over N converges to 0. So this is -- another way to think about this is lambda N could be any power of N strictly between 0 and 1. So it's -- it grows at some rate between these bounds. It increases but this goes to 0. Then the limit as N goes to infinity of the generalization probit loss, so this is the generalization probit loss equals L star. >>: [inaudible]. >> David McAllester: What? >>: It's convex probability. >> David McAllester: With probability 1 over the sequence, right. Over the choice of the sequence. Okay. So here's the theorem. I'm a pack bayesian person so that's what I like to use. I think these theorems are much more awkward than any other framework. Especially this theorem, because there's a tight connection between the pack bayesian framework and this particular loss function. So there's a general pack bayesian -- what I'm going to do here is state the general pack bayesian theorem, there are a couple of recent references that put the pack bayesian theorem in this form. So this says that if I have a training loss, okay, so what's going on here? I've got a space of W, weight vectors. The space of weight vectors is continuous. So anything that talks about discrete sets of predictors isn't going to work here. In the pack bayesian framework, we assume a prior over the weight vectors. And we're using L2 regularization and the natural prior corresponding to an L2 regularization is a Gaussian prior. So I'm putting a isotropic Gaussian prior on the weight vectors. The pack bayesian theorem is completely general. It says for any set of predictors and for any prior on those predictors, what I'm going to learn is a posterior, it's kind of bayesian. It's going to learn a quote posterior on the predictors. And the way I'm going to use that posterior at test time is I'm going to randomly draw a predictor from the posterior and use it. So the loss of the posterior is the expected loss of drawing a predictor over the posterior on predictors. This says that the generalization. So with high probability over the draw of the training data, for all simultaneously for all possible posteriors Q, for all possible weights on the space, the generalization loss of that posterior is bounded by this expression in terms of the training loss of the posterior. Now this is the training task loss, but we're drawing from the posterior. So it's actually going to become scale sensitive. Plus this thing that depends on the KL divergence between the posterior and the prior, and a confidence parameter and the number of points of your training data. So all I'd have to do here is pick my regularization parameter before I look at the data. Right? This is to make it exactly hold. So it's a very nice simple statement. You can see that the regularization parameter can't get too large before this term doesn't matter, before this term becomes close to 1. So this is sort of predicting the regularizer in some sense, this theorem is saying your regularizer should be roughly order 1 in this formulation. And this formulation we're getting something that looks very much like regularized minimizing exactly what our learning rule is minimizing. So just as -- now what I'm going to do I'm going to say what I'm going to want to do is bound a certain loss of a weight vector W that I'm learning. So my prior is centered around 0, right, my prior is a Gaussian prior centered around 0. I want to say something about W that's highly non-zero. I want to take my posterior to be centered around zero with the same isotropic Gaussian distribution. So that distribution is exactly the distribution I get when I add Gaussian noise to W. So adding Gaussian noise to W defines a posterior over the distribution. I simply plug that posterior into this formula and this becomes the probit loss of W. Right? That is the probit loss of W. This is the empirical probit loss of W. Right? And I get this equation. This bound. So now basically this is going to give you the theorem. Right? So the theorem says ->>: [inaudible] Q, the expectation also draws from Q. >> David McAllester: Right. >>: The reason this works seems like you've got a expectation of probit that's now taken care of. The expectation of the noise is going to reappear. >> David McAllester: Right. Right. Yeah, this is an expectation over noise. So it's drawing from Q essentially. And the same is true here. This is the empirical performance when I draw from Q. This is also the empirical performance when I draw from Q. Okay. Now, we're interested in proving consistency. So we're interested in taking N to infinity, and the conditions I gave on lambda N is that this term is going to go to zero as N goes to infinity. There's the log term that's coming from this. The other thing I'm going to do is I'm going to use my delta term as 1 over N squared to get probability 1 over the sequence. And then if this term is going to zero as N goes to infinity but I still have this inequality and lambda N is going to infinity which means this is going to 1, I'm getting that this term is dominating this term. And then I have to argue that I can take W, I can consider particular Ws of increasing norm, and for Ws of sufficiently large norm, this term is going to converge to L star. That's a little -- the paper has a little bit of a careful argument there that basically what you want to do is say for any -- so L star is defined to be the nth over W over the performance of W. In the proof I have to say consider any reference W. All I have to prove is that my performance gets as good as any reference W. If I hold a reference W fixed and take N to infinity, the algorithm is going to minimize this, right? So what I'm going to get is that -- do I have a better slide with this? No. So all I have to do is prove that I do as well as any reference W. I pick a W. I have to prove that I'm doing as well as that W. My learning algorithm is going to be minimizing this expression. The learning algorithm is going to pick something whose probit loss here as N is going to infinity, I'm also only have to look at the limit as N goes to infinity. The learning algorithm is going to pick something whose empirical loss is doing as well as W. And I can also at the same time consider scalings of my reference W, I can scale it up to be larger and larger. As I scale it up to be larger and larger, we can prove that this empirical probit loss becomes a valid estimate of the true generalization loss for that particular W as its norm goes to infinity. It's a somewhat -- I've given this talk before. I've really cheated at this point in the talk and said this is a straightforward argument now. There's one issue in this theorem that some people get upset about. In that what I've proven here -- where's my consistency theorem? What I've proven here is that the probit loss of my estimator is converging to L star. I have not proved that the loss of my estimated W is converging to L star. And I justify that by saying well this probit loss can be realized. I can actually implement the process that adds Gaussian noise, make predictions. So it's giving me a prediction algorithm whose performance is approaching L star because I can achieve this loss. The reason I can't get that L approaches L star has to do with the fact that in infinite dimension with latent variables, even though the performance is converging, you can construct this weird example where the vector W is rotating in an infinite dimensional space forever. And it's not actually the direction of W is never converging. Okay. So here we're going to do the analogous thing for ramp loss. This becomes much more -- the proof of this is much trickier. But now we're going to replace the surrogate loss with ramp loss. So this is the empirical ramp loss. We're going to minimize the empirical ramp loss plus a regularized term. And the theorem is very similar. There's like a logarithmic factor difference in the theorem. It's going to be true that if the regularization is a power of N, where that power is strictly between 0 and 1 we're still going to get consistency. So this theorem looks deceptively like ramp loss is similar to probit loss. I'll say why that's not going to work or why it's deceptive later. But it's essentially the same theorem, up to logarithmic factor. Okay. How does this theorem work? We have this -- we have this inequality. We know that ramp is an upper bound on task, right? And we know that the limit as the weight vector goes to infinity or equivalently if we take -- if we think of adding noise with variant sigma, if we take the variance to zero, that's equivalent to taking this weight vector to infinity, that we have this system of inequalities. So what we want to do is find a finite rate as a function of sigma that relates the probit -- I'm sorry, that relates the probit to the ramp loss. So we know that we're going to get, as sigma goes to zero of this expression is less than the ramp loss. What I'm going to do is give a finite rate for that in terms of sigma. Okay. So this is our theorem that the probit loss is bounded by the ramp loss. The probit loss at a finite sigma is bounded by the ramp loss plus a penalty that depends on sigma. Okay. So that's what I'm going to prove. I'm taking this inequality in a limit and giving it a rate. So at the top level what I'm doing I've got a bound in terms of the probit loss. I just argued for that based on this pack bayesian theorem. And what I'm going to do is I'm going to give inequalities related to ramp to probit and use the bound on the probit loss. And I've got this. So should I skip this slide? I can see people fading away. >>: In such a case the last term, we should understand the complexity of the Y space can get ->> David McAllester: Oh, yes, sorry. This is the number of possible decodes, and that's bad. But at least it's in a log. So this is -- you think of this as the length of the sentence. That's still bad. We believe -- I believe that using Johnson Lindenstrauss we can actually prove that we can get that down to log-log. In the original paper Kohler, Gestraun, Tasker -Kohler Gestraun [phonetic] on the hinge loss, they proved a theorem with a log-log theorem here basically in their particular setting. I think you can get a log-log in general. But, yes, this is the number of possible decodings, and that's a troublesome term. Let me just give you the essence of the idea. I'm not going to go through all of this. Let's just look at this one line. What's going to happen here is I'm going to say for every possible -- so we've got the space of possible decodes. Input sentence and space of possible decodes. For every possible decodes there's a margin. What do I mean by the margin. Take the decode that the system's actually producing, that's the best scoring decode. Every other decode will have a score worse than that. Take the gap between the two scores and that's the margin of a potential competing decode. Sometimes these competing decodes are called distractors, the biological community likes to call them distractors. Every alternate decode has a margin. The idea is there's a certain threshold on that margin which provably, so it's provably that anything whose margin exceeds that threshold can be ignored. The decode is just not going to be that, when I add noise to the weight vector. So the idea is that the probit loss is less than this thing, is going to be something handing the fact that it's not completely true that I can ignore the bad guys, the things with large margin. But I can basically look at the maximum, the probit loss as less than or equal to sigma. We're going to take sigma to 0. Plus the max over all the plausible decodes of the loss of that plausible decode. And that's the fundamental proof method in getting something for ramp loss. Right? Because this quantity, the max over plausible decodes, is something that I can relate to ramp loss. This threshold is picked such to make this theorem true such when I do a union bound and a high probability deviation bound. And then this is just saying, this is just finishing the proof. It's taking that first thing and doing a sequence of steps that relate it to the ramp loss. This max can be replaced in here by a max over everything and then this becomes the decode, this becomes equivalent to a loss adjusted inference and you get it related to the ramp loss without going through a lot of details there. And then using this inequality, this main lemma, this says the probit loss is less than or equal to the ramp loss plus this, I get this generalization bound for the ramp loss. This is saying the generalization probit loss of W over sigma is less than -- and all I've done here is I've taken the probit loss and replaced it by an upper bound, by the upper bound I just proved. So the probit loss was upper bounded in terms of the ramp loss, and I've replaced the probit loss in the original theorem by the ramp loss here. And now I've got a generalization bound on the -- I've got a bound on the generalization probit loss in terms of the empirical ramp loss. And now we can just take schedules for sigma and lambda and get our theorem, get our consistency theorem. Okay. Now the other thing we can do, rather than just taking schedules for these to get our theorem, the other thing we can do is actually sort of optimize away sigma and get a generalization bound directly in terms of the ramp loss. So it turns out that this is an approximately optimal value for sigma to minimize that bound. And now we get a finite sample generalization bound in terms of the ramp loss. So this is saying that the generalization probit loss is bounded by this thing in terms of the ramp loss. And really I think that these consistency theorems these asymptotic consistency theorems are not so interesting. What's much more interesting to me are these generalization bounds, because this is providing a concrete finite sample generalization guarantee. It's telling you more than any kind of asymptotic infinite limit statement. So this is the finite sample guarantee we had for generalization with respect to using the probit loss as surrogate loss. When we optimize away sigma here's the analogous guarantee in terms of ramp loss but now look at the differences between what I want to focus on is this regularization term and this regularization term. So this term is linear in lambda, in this part of it, lambda times W squared over N. This is lambda times W squared over N to the one-third. So what this says is that for the bounds we've been able to prove, the ramp loss bound is significantly worse than the probit loss bound. And it could be an artifact with a proof technique because the proof technique was natural and immediate for the probit loss. But I think my feeling is this is real. That you're better off using this probit loss. Okay. So I'm basically done. I've skipped some of the, glossed over some of the technical details. And the summary is we know we need surrogate loss functions if we're going to regularize, and we have all these standard surrogate loss functions from the binary case that generalized to the structured case. I haven't talked about it, but all the theorems in the paper also are actually written for the structured latent case. So in the structured latent case I optimize not only over the decode but I optimize over latent information like parch trees or what have you, and all of this analysis generalizes to that case as well. And we have probit and ramp that are both provably consistent, but I believe they're significantly different in the structural setting. >>: I have a question, so your interest [inaudible] not consistent is that they are [inaudible] they just keep going up. >> David McAllester: Which makes them sensitive to outliers. >>: So that would suggest that if you consider, for example, the monetize function [inaudible] rather than the two you've got they're still not -- could this generalize to all such functions? You pick two particular functions, they're in rather similar shape. Also bounded on to the functions, to be consistent, or is it just ->> David McAllester: That's almost certainly true in the binary case. And there are -- there's lots of theory that uses a Lipschitz bound, they have to have a Lipschitz constant associated with them and the bound comes out in terms of the Lipschitz constant. I'd have to think about what properties you might need for the -- probably there's something like that for the structured case. Yeah. Theoreticians talk. [applause]

23692 >> Michel Galley: So let's get started. So... McAllester, who is a Professor and Chief Academic Officer at...

Related documents

Products

Support

23692 &gt;&gt; Michel Galley: So let's get started. So... McAllester, who is a Professor and Chief Academic Officer at...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

23692 >> Michel Galley: So let's get started. So... McAllester, who is a Professor and Chief Academic Officer at...