23692 >> Michel Galley: So let's get started. So... McAllester, who is a Professor and Chief Academic Officer at...

advertisement
23692
>> Michel Galley: So let's get started. So it's my pleasure to introduce David
McAllester, who is a Professor and Chief Academic Officer at the Toyota
Technological Institute of Chicago. Prior to that he was faculty member
successively at Cornell and MIT.
And then he was a AT&T researcher at Bell Labs, AT&T Labs, and he's AAAI
fellow and has published in a lot of different areas. Machine learning, of course,
but also natural language processing, vision, AI planning, automatic reasoning
and a few others that I have missed.
And so just want to mention also that those options to meet with David. So send
a signup sheet around and so whatever fill out the sheet or talk to me, send me
an e-mail and we can certainly arrange something.
So now let David speak.
>> David McAllester: Okay. It occurs to me that we may have a problem with
this talk in that there may be people in the audience who have seen it before.
So there was a workshop on speech and language a year ago in Bellevue, and I
gave this talk there.
How many people have seen this talk before? Has anybody seen this talk
before? Okay. At least the people who have seen it before didn't come.
Okay. So we'll go ahead with this. This is joint work with Joseph Keshet at TTI.
This is a theoretical talk. Just how many people have ever published a paper
with, whose point was a generalization bound or some cult like conference on
learning.
Okay. So this is a theoretical talk but not a theoretical necessarily theoretical
audience.
So I often give this talk to people who don't do natural language processing. I
know there are a lot of MLP people here.
But my favorite example that I try to use as a running example in this talk is
actually machine translation. So in machine learning we're often interested in
binary classification. Does a patient have cancer or not.
So there's often a typically cult [inaudible] study of the learning theory studies the
problem theoretically of binary classification. But what I'm going to be interested
in here is the study of what we call structured labeling. My favorite example of
structured labeling is machine translation. We have some input, a structured
object, like a sentence, and we're interested in not necessarily labeling it but
decoding it into a sentence and some other language.
So we can think of a decoding problem as a problem where we're given some
input X and we want to produce some output Y where the output Y, where the
input X is a structured object like a sentence and the output Y is a structured
object like a structured sentence.
And the way we're going to do the decoding is by optimizing a linear score. So
we're going to say find the output, the decode, which maximizes a linear score,
which is an inner product of a weight vector and a feature vector.
And if you're familiar with machine learning theory and support vector machines
and kernels, you realize that almost any kind of scoring function can be
represented this way by making the feature vector sufficiently elaborate. You
can make it have all kinds of nonlinear features. Even though this is a linear
score the feature map can be nonlinear. So you can represent arbitrarily
complicated things this way.
Okay. So this is our decoder. It's maximizing, it's producing an output
maximizing this linear score.
We would like -- so what I'm going to be talking about is training the weight
vector. We're going to hold this feature map fixed and we're going to look at the
training problem. The training problem is the problem of setting the weight
vector.
We would like to set the weight vector such that it minimizes the expectation over
input/output pairs. So I'm going to assume here we've got a corpus of translation
pairs.
We've got input/output pairs and we've got something like the blue score that
we're interested in minimizing. Or we're interested in minimizing a loss. I'm
going to use a loss that we're minimizing.
And what we would like to do is minimize the expectation over drawing a new
reference pair, a new translation pair, of the loss between the reference
translation and the translation produced by this decoder.
>>: Why is it reasonable to assume that you have a pairwise, over the inputs you
have this situation why is it reasonable you have this situation over the
algorithms.
>> David McAllester: It's a standard cult assumption that you have a distribution
over XY pairs. Intuitively, there are large corpora of translation pairs and they
were produced somehow.
>>: But unlike the core, the classification where you have target -- here these are
the translations. So it's ->> David McAllester: The way I think about it is it's sort of a cynical way to think
about it, is we want to win the competition. Right? And we know that the
competition is going to have translation pairs that we're going to get scored on
and our training data is probably a -- it's probably reasonable to assume that the
training pairs are drawn from the same distribution that our evaluation pairs are
going to be drawn from, and we want to win.
>>: Wouldn't this [inaudible] instead of the Ys you have the true translation of Xs,
that's what you have in the corpora. You don't have the random [inaudible]
translation [inaudible].
>> David McAllester: All I'm going to need here is some distribution over XY
pairs. I don't need that there's a true Y.
>>: Any translations ->>: But the Y here is not sampled from the across the entire distribution of Ys.
It's very biased, the Y part is the one that's really biased, as would be a random
language [inaudible] but Y is going to be specific to reasonable translations close
translations.
>>: It's a translation of Ys.
>> David McAllester: So this is a pair sampled from some corpus of translation
pairs. So I build a big corpus of translation pair. I select some as train and some
as test.
>>: Why does that seem only to be -- this is all of the good translations?
>> David McAllester: Whatever your translation pair data is like. Right. So this
is a gold standard translation. A reference translation.
Okay. So what we're interested in, what we would like to do is find the W that
produces the best performance on the evaluation at the evaluation.
Now, the problem is that the typical way of approximating this, there's a problem
with overfitting. Right? If my corpus is not large enough, I can make my training
performance very good. But my test performance is going to be bad because I
overfit. And the way overfitting is typically controlled is adding a regularization
term.
So the training algorithm minimizes a loss, a sum over the training data of the
loss of the training data plus a regularization to drive the norm of the weight
vector to be down and what we're going to be doing is giving theorems that justify
this kind of structure. That give us generalization properties.
Okay, now there's a fundamental problem. I've written this as L sub S. This is
the S stands for surrogate. If I literally use the training loss, this decoding is
insensitive to scaling W. So I can scale W. I can take W and scale it down. Just
multiply it by epsilon.
And it doesn't change the decoding at all. But it completely eliminates this
regularization term. So regularization of the task -- I'm going to call this the task
loss, like the blue score the thing we're going to get evaluated at at test time. If I
use the task loss as the empirical measure I become insensitive to scaling W and
the regularization is meaningless. We need the surrogate loss, we need the
surrogate loss to be scale sensitive. As we scale W up, this loss changes. It's
sensitive to the norm of W.
Okay. So an SVM, and so this is the loss, the surrogate loss function underlying
a structural SVM.
>>: So what we think about regularization term it's also scaling -- [inaudible].
>> David McAllester: Well, you could make -- an SVM is going to use -- you
could say W has unit norm and work the margin. But then the margin becomes
scale sensitive.
So the standard approaches work with -- the standard approach to this, SVM
approach, works with a scale sensitive regularization. People also don't always
use L2. They use some other norm.
But that norm also has a scaling issue. Okay. So here is -- how many people
have seen structural SVMs? This is the general -- wow. Okay. This is the
structural hinge loss. This is the binary hinge loss. And it's easy -- this is the
standard mapping from the structured case to the binary case. And the
structured case the feature map takes two inputs and the binary case, the label is
either minus 1 or 1, and this is the standard mapping from the, between the two.
Right, I can define the feature map on XY to be this way and I get, I can define
the margin to be this. And then the standard hinge loss looks like this. So under
this standard mapping between the structured case and the binary case, this
agrees with the hinge loss.
Okay. I'm not going to say -- I'll come back to the structure of this, I think. But
this is a difference -- this hinge loss is the difference between something called
the loss adjusted inference. This is maximizing the score plus the loss relative to
the reference translation.
So this is a loss of a weight vector on an input X and a reference translation Y.
And this is saying consider the decoding which decodes in favor of bad blue
scores. It's a loss adjusted inference. You're saying take a bad label favoring
bad translations and take the difference between that score, adjusted by the loss,
and this and that's the structural hinge loss.
Okay. So I'm just going to go through surrogate loss functions. This is log loss.
This is, we define the probability of a decode given X, if we have a log linear
model we can define this probability in a log linear way. We take it to be
proportional to be the exponential of the score in the normalized with the partition
function.
But then if we take that in the binary case, we get this smooth curve that looks a
lot like, as qualitatively similar to the hinge loss. So in the binary case, the log
loss and the hinge loss are similar. One's a smooth version of the other.
One thing I want to point out, though, is that in the structured case, this max is
over an exponentially large set. It's all possible decodes.
In the binary case, that's over only a two-element set. So it's not two -- so the
fact that these are similar in the binary case can be very misleading, in the
structured case these are I believe these are quite different loss functions.
Okay. What I'm going to be talking about in this talk is consistency. I'm going to
give you a learning algorithm. What I'll do is define two more loss functions.
We'll talk about ramp loss and probit loss, both are also meaningful in the
structured and binary case and I'm going to be showing that those loss functions
are consistent in a predictive sense. That means that in the limit of infinite
training data, the weight vector will converge to the weight vector that's optimal
with respect to your loss function.
I'm not -- there's no notion of estimation or truth here. There's just optimal
performance relative to the blue score.
These functions are convex. So there's a fundamental convexity consistency
tension. Any convex loss function, if you've got an outlier, especially in machine
translation, you're going to have reference translations which your decoder are
not going to get.
So you've got these hopeless reference translations. And they're going to have
bad blue scores and there's nothing you can do about it. Your system is not
going to get it.
You're going to have outliers. You're going to have margins that are bad. So
when you have outliers, they have large, the loss is large. Especially for a
convex loss function. The convex loss function has to assign them large loss.
And that -- because it's convex it also has to be sensitive to how bad these
terrible translations are so your ultimate training algorithm becomes sensitive to
the outliers. And that's going to block consistency.
So a convex loss function is -- is not going to be consistent. If you're familiar with
SVMs, if you have a universal kernel something else happens. People claim that
binary SVMs are consistent and it's because they're talking about a universal
kernel. But we can talk about that if there are questions.
Okay. Here's the ramp loss. So if you remember the structured hinge loss, it
was a difference between a loss adjusted inference and the score of the
reference translation.
Here's the hinge loss. This is a loss adjusted decode, the score of a loss
adjusted decode. It's the score plus the loss. Minus the score of the reference
translation.
Now what we're going to do is replace the score of the reference translation by
the unadjusted decode. The score of the unadjusted decode. So this is the
score of the adjusted decode minus the score of the unadjusted decode.
And if we go to the binary case via the standard translation, we get a ramp that is
not convex, right? The outliers eventually have constant loss. Because if these
two things -- once these two things agree have the same decode, we're just left
with this loss.
So you get this plateau. And you can see that it's different from the hinge loss.
So it's not convex. It has this other aspect to it.
Here's the probit loss, one more loss function. The probit loss, what it does is it
takes the weight vector W and adds a Gaussian noise. So if we're in D
dimensions we're going to take a D dimensional unit variance Gaussian noise,
add it to our weight vector, decode with that, take the loss, and then we're taking
the expectation over the noise of the decode loss.
That's the probit loss, if we take that to the binary case, we get a smooth -- a
smoothed ramp. Right? It becomes a continuous -- this is going to be a nice
continuous function of W.
So we get this. But, again, the fact that this looks like this qualitatively is
misleading. Because the binary case is misleading relative to the structured
case.
Okay. So here's some basic properties that hold both in the binary case and in
the structured case. So the structured case is different from the binary case, but
these qualitative properties hold. The ramp and the probit are both bounded to
0-1. So even in the structured -- I should back up. I'm assuming that the task
loss itself is bounded to 0-1. All right. So assuming that the task loss is bounded
to 0-1, these loss functions are bounded to 01. So that means they're not going
to have high loss outliers. It's a robustness property. No individual data point
can have an enormous effect.
Another property is that the ramp loss is a tighter upper bound on task loss than
is hinge loss. And all of these properties are kind of immediate properties. For
example, you get that this is an upper bound on the task loss by sticking the -- so
if I stick the decode value into here, this thing goes down, because I'm taking a
value that's not optimal. If I stick the decoder that optimizes this into here, these
cancel. It's the same score and I'm left with the loss. All of the properties I've
shown you is derived by finding one of these maxes and sticking in a value
realizing this goes up or down.
And okay so we have these properties. This property was used as the original,
so there was a 2008 paper introducing the structured ramp loss and their
motivation was simply this property, that it's a tighter upper bound on the task
loss.
>>: If you can go back to the chart where you show the probit loss. So if -- so as
the [inaudible] loss but the probit loss is not enough bound ->> David McAllester: Probit loss is not, right. But I'm going to argue ultimately
that the probit loss is the best thing.
>>: So just checking. So for the priority case, the task loss is 0-1, 0-1 loss.
>> David McAllester: Right.
>>: So the next slide, previous, sorry, if that was going direct that would be a
motivation for this, that would be a task loss, get the picture? That's why it's a
better approximation, is that fair?
>> David McAllester: So if you look at hinge, sorry. Hinge here -- so task loss is
going to go. It's a step function there.
>>: The thing you're looking close to there is a step function.
>> David McAllester: Right. Right.
In this regime it's closer to the step function than it would be if it went up here.
So this is just a slide on the history of some of these ideas. So there's a question
of where -- is there a reference for each of these structured versions of the
standard loss functions.
So this is the structured hinge loss -- I'm sorry. I'm sorry. This is talking about
subgradient descent on unregularized ramp loss. If you do subgradient descent
on this, right? So what does it mean to do subgradient descent on this?
Subgradient descent means you find this maximizer, you find this maximizer and
you take the gradient of that function of W with respect to W.
So you're finding a bad label and a better label, and the gradient with respect to
W is the difference in their feature vectors. So there's been work in natural
language which says look, take an N best list.
So I'm going to argue that the following hack is approximately subgradient
descent on the ramp loss. Take an N best list of your decodes, measure the blue
score on all of them, relative to the reference translation, distinguish the good
ones from the bad ones, take the feature vector of the good one and the
difference between that and the feature vector of the bad one and that's a
direction that's moving you toward the good one and updates your weight vector
in that direction, toward the good one and away from the bad one.
And the argument is that this, if you do subgradient descent on ramp loss, it's a
version of that. You're taking a bad one and a good one and you end up moving
toward the good one and away from the bad one.
You're actually better in practice finding the decode and a better decode doing
loss adjustment in the other direction. But this is a theoretical talk. And the
theorem is easier -- the theorem actually -- I don't know how to prove the other
theorem. The theorem works this way.
Okay. So several groups and machine translation have done things like this that
are like subgradient descent on ramp loss. We call that direct loss. We had an
earlier theorem relating the subgradient, ignoring the regularization to the
gradient of the task loss ->>: Some people have talked to me about [inaudible].
>> David McAllester: Happily abandoned convexity. My understanding the
machine translation community is ->>: [inaudible].
>>: Our task loss function is convex. It's not happy. We would love convexity,
but we can't find -- zero 1 loss is still not convex
>> David McAllester: The theoreticians still love convexity.
Okay. And we've recently -- this is the slide about empirical results. We've
recently experimented with probit loss directly. And shown that probit loss shows
improvements over hinge loss. There's two ways we've gotten improvement.
We've gotten improvement with the direct loss update, subgradient descent on
ramp loss, using early stopping instead of regularization in practice.
And we've got improvement using the probit loss with a normal regularization.
These are improvements over the structured hinge loss.
Okay. So now I'm going to start proving theorems. So I'm going to introduce
some notation. So if I have a weight vector W, the loss of W is just defined to be
the expectation over drawing fresh data. The expectation of my test time loss.
All right. So this is the expected test time loss of W. L star is the best test
expected less time loss that I can achieve. The nth over W of the time loss over
W.
What's the best loss I could achieve of any W? Now I'm going to build
empirical -- this is an empirical loss measure. So I'm going to prove consistency.
The way I'm going to prove consistency I'm going to assume there's an infinite
sequence of training data. I'm going to look at what happens when I train on the
first N then I'm going to let N go to infinity.
We're always training on the first N, letting N go to infinity.
So this is the loss, the average loss on the first N training points. That's what this
loss estimated loss based on the first N training points of W.
Okay. This is going to be our learning rule. So we have -- so what we're going to
do, this is the learning rule I had before. This is the surrogate loss function. This
is the surrogate loss, the average loss on the first N training points. We're going
to optimize a score which is the measured surrogate loss, the probit loss in this
case, plus a regularization of the norm of -- plus a regularizer. And what's going
to happen is that lambda N is going to grow with N.
So we're going to regularize somewhat harder, somewhat harder than 1 over N,
as N increases. Okay. So here's our theorem. As long as lambda N increases
without bound, it could increase very slowly like log N.
But lambda N and log N over N converges to 0. So this is -- another way to think
about this is lambda N could be any power of N strictly between 0 and 1.
So it's -- it grows at some rate between these bounds. It increases but this goes
to 0. Then the limit as N goes to infinity of the generalization probit loss, so this
is the generalization probit loss equals L star.
>>: [inaudible].
>> David McAllester: What?
>>: It's convex probability.
>> David McAllester: With probability 1 over the sequence, right. Over the
choice of the sequence.
Okay. So here's the theorem. I'm a pack bayesian person so that's what I like to
use. I think these theorems are much more awkward than any other framework.
Especially this theorem, because there's a tight connection between the pack
bayesian framework and this particular loss function.
So there's a general pack bayesian -- what I'm going to do here is state the
general pack bayesian theorem, there are a couple of recent references that put
the pack bayesian theorem in this form.
So this says that if I have a training loss, okay, so what's going on here? I've got
a space of W, weight vectors. The space of weight vectors is continuous. So
anything that talks about discrete sets of predictors isn't going to work here.
In the pack bayesian framework, we assume a prior over the weight vectors.
And we're using L2 regularization and the natural prior corresponding to an L2
regularization is a Gaussian prior. So I'm putting a isotropic Gaussian prior on
the weight vectors.
The pack bayesian theorem is completely general. It says for any set of
predictors and for any prior on those predictors, what I'm going to learn is a
posterior, it's kind of bayesian. It's going to learn a quote posterior on the
predictors.
And the way I'm going to use that posterior at test time is I'm going to randomly
draw a predictor from the posterior and use it. So the loss of the posterior is the
expected loss of drawing a predictor over the posterior on predictors.
This says that the generalization.
So with high probability over the draw of the training data, for all simultaneously
for all possible posteriors Q, for all possible weights on the space, the
generalization loss of that posterior is bounded by this expression in terms of the
training loss of the posterior. Now this is the training task loss, but we're drawing
from the posterior.
So it's actually going to become scale sensitive. Plus this thing that depends on
the KL divergence between the posterior and the prior, and a confidence
parameter and the number of points of your training data.
So all I'd have to do here is pick my regularization parameter before I look at the
data. Right? This is to make it exactly hold. So it's a very nice simple
statement. You can see that the regularization parameter can't get too large
before this term doesn't matter, before this term becomes close to 1. So this is
sort of predicting the regularizer in some sense, this theorem is saying your
regularizer should be roughly order 1 in this formulation.
And this formulation we're getting something that looks very much like
regularized minimizing exactly what our learning rule is minimizing.
So just as -- now what I'm going to do I'm going to say what I'm going to want to
do is bound a certain loss of a weight vector W that I'm learning. So my prior is
centered around 0, right, my prior is a Gaussian prior centered around 0. I want
to say something about W that's highly non-zero. I want to take my posterior to
be centered around zero with the same isotropic Gaussian distribution. So that
distribution is exactly the distribution I get when I add Gaussian noise to W.
So adding Gaussian noise to W defines a posterior over the distribution. I simply
plug that posterior into this formula and this becomes the probit loss of W.
Right? That is the probit loss of W. This is the empirical probit loss of W. Right?
And I get this equation. This bound.
So now basically this is going to give you the theorem. Right? So the theorem
says ->>: [inaudible] Q, the expectation also draws from Q.
>> David McAllester: Right.
>>: The reason this works seems like you've got a expectation of probit that's
now taken care of. The expectation of the noise is going to reappear.
>> David McAllester: Right. Right. Yeah, this is an expectation over noise. So
it's drawing from Q essentially. And the same is true here. This is the empirical
performance when I draw from Q. This is also the empirical performance when I
draw from Q.
Okay. Now, we're interested in proving consistency. So we're interested in
taking N to infinity, and the conditions I gave on lambda N is that this term is
going to go to zero as N goes to infinity.
There's the log term that's coming from this. The other thing I'm going to do is
I'm going to use my delta term as 1 over N squared to get probability 1 over the
sequence.
And then if this term is going to zero as N goes to infinity but I still have this
inequality and lambda N is going to infinity which means this is going to 1, I'm
getting that this term is dominating this term.
And then I have to argue that I can take W, I can consider particular Ws of
increasing norm, and for Ws of sufficiently large norm, this term is going to
converge to L star. That's a little -- the paper has a little bit of a careful argument
there that basically what you want to do is say for any -- so L star is defined to be
the nth over W over the performance of W. In the proof I have to say consider
any reference W. All I have to prove is that my performance gets as good as any
reference W.
If I hold a reference W fixed and take N to infinity, the algorithm is going to
minimize this, right? So what I'm going to get is that -- do I have a better slide
with this? No. So all I have to do is prove that I do as well as any reference W. I
pick a W. I have to prove that I'm doing as well as that W. My learning algorithm
is going to be minimizing this expression. The learning algorithm is going to pick
something whose probit loss here as N is going to infinity, I'm also only have to
look at the limit as N goes to infinity. The learning algorithm is going to pick
something whose empirical loss is doing as well as W.
And I can also at the same time consider scalings of my reference W, I can scale
it up to be larger and larger. As I scale it up to be larger and larger, we can prove
that this empirical probit loss becomes a valid estimate of the true generalization
loss for that particular W as its norm goes to infinity.
It's a somewhat -- I've given this talk before. I've really cheated at this point in
the talk and said this is a straightforward argument now.
There's one issue in this theorem that some people get upset about. In that what
I've proven here -- where's my consistency theorem?
What I've proven here is that the probit loss of my estimator is converging to L
star. I have not proved that the loss of my estimated W is converging to L star.
And I justify that by saying well this probit loss can be realized. I can actually
implement the process that adds Gaussian noise, make predictions.
So it's giving me a prediction algorithm whose performance is approaching L star
because I can achieve this loss. The reason I can't get that L approaches L star
has to do with the fact that in infinite dimension with latent variables, even though
the performance is converging, you can construct this weird example where the
vector W is rotating in an infinite dimensional space forever.
And it's not actually the direction of W is never converging. Okay. So here we're
going to do the analogous thing for ramp loss. This becomes much more -- the
proof of this is much trickier. But now we're going to replace the surrogate loss
with ramp loss. So this is the empirical ramp loss. We're going to minimize the
empirical ramp loss plus a regularized term.
And the theorem is very similar. There's like a logarithmic factor difference in the
theorem. It's going to be true that if the regularization is a power of N, where that
power is strictly between 0 and 1 we're still going to get consistency.
So this theorem looks deceptively like ramp loss is similar to probit loss. I'll say
why that's not going to work or why it's deceptive later. But it's essentially the
same theorem, up to logarithmic factor.
Okay. How does this theorem work? We have this -- we have this inequality.
We know that ramp is an upper bound on task, right? And we know that the limit
as the weight vector goes to infinity or equivalently if we take -- if we think of
adding noise with variant sigma, if we take the variance to zero, that's equivalent
to taking this weight vector to infinity, that we have this system of inequalities.
So what we want to do is find a finite rate as a function of sigma that relates the
probit -- I'm sorry, that relates the probit to the ramp loss. So we know that we're
going to get, as sigma goes to zero of this expression is less than the ramp loss.
What I'm going to do is give a finite rate for that in terms of sigma.
Okay. So this is our theorem that the probit loss is bounded by the ramp loss.
The probit loss at a finite sigma is bounded by the ramp loss plus a penalty that
depends on sigma.
Okay. So that's what I'm going to prove. I'm taking this inequality in a limit and
giving it a rate. So at the top level what I'm doing I've got a bound in terms of the
probit loss. I just argued for that based on this pack bayesian theorem.
And what I'm going to do is I'm going to give inequalities related to ramp to probit
and use the bound on the probit loss. And I've got this.
So should I skip this slide? I can see people fading away.
>>: In such a case the last term, we should understand the complexity of the Y
space can get ->> David McAllester: Oh, yes, sorry. This is the number of possible decodes,
and that's bad. But at least it's in a log.
So this is -- you think of this as the length of the sentence. That's still bad. We
believe -- I believe that using Johnson Lindenstrauss we can actually prove that
we can get that down to log-log. In the original paper Kohler, Gestraun, Tasker -Kohler Gestraun [phonetic] on the hinge loss, they proved a theorem with a
log-log theorem here basically in their particular setting.
I think you can get a log-log in general. But, yes, this is the number of possible
decodings, and that's a troublesome term.
Let me just give you the essence of the idea. I'm not going to go through all of
this. Let's just look at this one line. What's going to happen here is I'm going to
say for every possible -- so we've got the space of possible decodes. Input
sentence and space of possible decodes.
For every possible decodes there's a margin. What do I mean by the margin.
Take the decode that the system's actually producing, that's the best scoring
decode. Every other decode will have a score worse than that.
Take the gap between the two scores and that's the margin of a potential
competing decode. Sometimes these competing decodes are called distractors,
the biological community likes to call them distractors. Every alternate decode
has a margin.
The idea is there's a certain threshold on that margin which provably, so it's
provably that anything whose margin exceeds that threshold can be ignored.
The decode is just not going to be that, when I add noise to the weight vector.
So the idea is that the probit loss is less than this thing, is going to be something
handing the fact that it's not completely true that I can ignore the bad guys, the
things with large margin. But I can basically look at the maximum, the probit loss
as less than or equal to sigma. We're going to take sigma to 0. Plus the max
over all the plausible decodes of the loss of that plausible decode.
And that's the fundamental proof method in getting something for ramp loss.
Right? Because this quantity, the max over plausible decodes, is something that
I can relate to ramp loss. This threshold is picked such to make this theorem true
such when I do a union bound and a high probability deviation bound.
And then this is just saying, this is just finishing the proof. It's taking that first
thing and doing a sequence of steps that relate it to the ramp loss.
This max can be replaced in here by a max over everything and then this
becomes the decode, this becomes equivalent to a loss adjusted inference and
you get it related to the ramp loss without going through a lot of details there.
And then using this inequality, this main lemma, this says the probit loss is less
than or equal to the ramp loss plus this, I get this generalization bound for the
ramp loss. This is saying the generalization probit loss of W over sigma is less
than -- and all I've done here is I've taken the probit loss and replaced it by an
upper bound, by the upper bound I just proved. So the probit loss was upper
bounded in terms of the ramp loss, and I've replaced the probit loss in the original
theorem by the ramp loss here.
And now I've got a generalization bound on the -- I've got a bound on the
generalization probit loss in terms of the empirical ramp loss. And now we can
just take schedules for sigma and lambda and get our theorem, get our
consistency theorem.
Okay. Now the other thing we can do, rather than just taking schedules for these
to get our theorem, the other thing we can do is actually sort of optimize away
sigma and get a generalization bound directly in terms of the ramp loss.
So it turns out that this is an approximately optimal value for sigma to minimize
that bound. And now we get a finite sample generalization bound in terms of the
ramp loss. So this is saying that the generalization probit loss is bounded by this
thing in terms of the ramp loss.
And really I think that these consistency theorems these asymptotic consistency
theorems are not so interesting. What's much more interesting to me are these
generalization bounds, because this is providing a concrete finite sample
generalization guarantee. It's telling you more than any kind of asymptotic infinite
limit statement.
So this is the finite sample guarantee we had for generalization with respect to
using the probit loss as surrogate loss. When we optimize away sigma here's the
analogous guarantee in terms of ramp loss but now look at the differences
between what I want to focus on is this regularization term and this regularization
term.
So this term is linear in lambda, in this part of it, lambda times W squared over N.
This is lambda times W squared over N to the one-third. So what this says is
that for the bounds we've been able to prove, the ramp loss bound is significantly
worse than the probit loss bound. And it could be an artifact with a proof
technique because the proof technique was natural and immediate for the probit
loss.
But I think my feeling is this is real. That you're better off using this probit loss.
Okay. So I'm basically done. I've skipped some of the, glossed over some of the
technical details. And the summary is we know we need surrogate loss functions
if we're going to regularize, and we have all these standard surrogate loss
functions from the binary case that generalized to the structured case. I haven't
talked about it, but all the theorems in the paper also are actually written for the
structured latent case.
So in the structured latent case I optimize not only over the decode but I optimize
over latent information like parch trees or what have you, and all of this analysis
generalizes to that case as well.
And we have probit and ramp that are both provably consistent, but I believe
they're significantly different in the structural setting.
>>: I have a question, so your interest [inaudible] not consistent is that they are
[inaudible] they just keep going up.
>> David McAllester: Which makes them sensitive to outliers.
>>: So that would suggest that if you consider, for example, the monetize
function [inaudible] rather than the two you've got they're still not -- could this
generalize to all such functions? You pick two particular functions, they're in
rather similar shape. Also bounded on to the functions, to be consistent, or is it
just ->> David McAllester: That's almost certainly true in the binary case. And there
are -- there's lots of theory that uses a Lipschitz bound, they have to have a
Lipschitz constant associated with them and the bound comes out in terms of the
Lipschitz constant.
I'd have to think about what properties you might need for the -- probably there's
something like that for the structured case. Yeah. Theoreticians talk.
[applause]
Download