1

advertisement
1
>> Max Chickering: Joaquin was in MSR Cambridge and is now the -- in adCenter,
debt manager for the ranking allocation and pricing team.
>> Joaquin Quinonero-Candela: Hi. Good morning. Thanks for coming. So I
think the first question was how do you find time to teach a course and be a
debt lead in adCenter. I think I have done two 75 percent jobs during the
period of time between January and March. The reason I did it is that I moved
from research to a product group in the summer of 2009 and I was just getting
impatient that I hadn't really, how should I say, I hadn't really done anything
like what I was doing before that transition in those two years. And I was
getting a little bit impatient.
So it was great of Paul to actually allow me to go ahead and teach a course
with Carl Edward Rasmussen, who was my Ph.D. advisor a while ago. And he lives
in Cambridge as well, and we're sort of good friends. So 4F 2813 is a course,
is a machining course in engineering department. This is a course for fourth
year engineering students at Cambridge University. And [indiscernible] and
Carl have been teaching this course for five years and using the same slides
that Zubin had been using at University College in London for another you seven
years prior to that. And so Colin and I decided to just do it again from
scratch. Which actually ended up being also a lot of work.
The course was 16 lectures, and it has three parts. One part, it says Gaussian
processes here, but it really is more about an introduction -- oh, this -- oh,
sorry. It's really about an introduction to probabilistic regression and maybe
non-parametric models.
The second part, we talk latent [indiscernible] allocation just to teach them
about an unsupervised discrete model, and then finally, we taught the true
skill model and I'll probably, today I probably will not talk about LDA. And
everyone know LDA? Who knows LDA? Okay. Maybe I'll say a little tiny bit
about it, but I'll not say a lot. I think what I'll do is I'll just show you
the result of applying LDA to a corpus of papers, because it's sort of fun.
I'll talk about true skill a little bit later and I'll motivate. What we did,
of course, was we motivated everything with actual data. We sort of said,
here's sort of three problems we're going to try to address. And for true
skill, we're going to be ranking tennis players. So I wrote a crawler that
crawls the ATP website and gathers all of the games that every player has
2
played.
So I gather dataset for 2011 and we'll look at that.
But so what I'll do is I'm actually just going to start from the beginning, as
one does. Do you know if I can go actual full screen from here? Maybe I need
to save it first. I feel like probably -- so sorry.
Apologies.
I should have prepared a little better.
Apologies.
Okay. So because you're not fourth year engineering students, it's very
possible that you might get very bored so if you do get very bored, you have to
shake your hands or do something, just tell me just move on, man, all right?
Okay.
So this is the first slide that the students saw. So you have a certain
dataset for a regression problem and the question is, what's do you predict the
value of Y is going to be mean Nuss point 25, okay? So, in fact, in the
lecture, we actually ask people to give a number. So Charles, what's your
number?
>>:
Minus 15.
>> Joaquin Quinonero-Candela: Minus 15. Right. And it's possible that
different people might give different numbers. So how do we try to address
this affect? Well we need to sort of postulate some sort of model of the data,
right.
>>:
[indiscernible].
>> Joaquin Quinonero-Candela: That's a good -- all right. So here's a bunch
of curves, actually. So which one do we draw? So here's a bunch of curves,
and then we say, well, you could for example, you could for example pick a
polynomial, right? People know what a polynomial is. It's an interesting
parametric function. It has a bunch of weights. Polynomials of degree M have
M plus one parameters, okay?
And so now the parameters of the model of those weights, right, okay?
Interesting. But then what comes next? In order to -- oh, sorry. And what
you can do, or maybe one of the first things you should do if you postulate a
model is you should actually look at your model, right. Don't just go ahead
and fit the data immediately. Pick various degrees of your polynomial and then
just pick some values before you send the data and then just look at it. Get
3
some understanding for what your model can do.
And at the moment, well, at the moment we sort of see that with higher degrees,
the functions seem to be able to do more things. They also seem to have some
funky behaviors, but we'll get into that later, okay?
So some questions, right. One about model structure.
polynomial? Will, yep, go ahead.
Should we choose a
>>: By the way, [indiscernible] showed a very similar set of graphs 15 years
ago at Cambridge.
>> Joaquin Quinonero-Candela:
>>:
He did?
At a generalization workshop.
>> Joaquin Quinonero-Candela:
Cool.
>>: And he took the average of all of them, and it was like a beautiful fit to
the data.
>> Joaquin Quinonero-Candela:
data.
>>:
But that was after he had actually seen some
Yes.
>> Joaquin Quinonero-Candela: Okay. So we are actually going to do that.
Although unfortunately, I missed that lecture. I was not 15 years ago at
Cambridge. So we're going to sort of pause on this question here. We'll just
keep it in mind but we're just going to move on.
So what degree should we choose for the polynomial. And we're going to call
that model structure. And then if I fix a degree, what value should the
weights take? So for now, we're just going to try and do the following. We're
going to try to give ourselves a method for selecting the best polynomial.
Okay, the single best polynomial degree in weights.
So what do we need? We need some sort of objective function to do that, right.
So let's just pick a very simple objective function. If we had postulated one
particular case for a polynomial, the red line, we can measure for every point
4
the absolute error. We can take the sum of squared errors and that's the
well-known -- yeah, sum of squared errors or we could divide by N if we wanted
to compute the mean squared area. It doesn't matter.
The key thing to realize is F of X is really a function of our vector of
weights, okay. So now we have a loss function. We have a parametric model.
We can actually go ahead and solve, right?
So I suppose everyone is, of course, familiar with what's going to happen now.
We need to introduce a little bit of notation. We're going to be stacking
quantities into vectors. I guess this is a laser pointer. So let's remember
this number, N is the number of training data we have. We stack them in vector
Y. We can evaluate function F at all of our N training data. And the error
vector is just Y minus F. So the sum of squared errors is just the norm of
this vector here, right?
And then we're already, we're going to use the opportunity now to introduce a
little bit of notation just to generalize already a little bit from
polynomials, right. We're going to look at linear in the parameters models
where the polynomial is a special type where the basis functions are defined in
this way, where the basis function is just X to the power of J, okay?
And so how do we write F now? Well, we write F as fie times W. So what are
the dimensions of fie? This is a real question that I ask in the lecture too.
Fie -- yeah, M times M. Maybe it's M times M plus one if it's a polynomial.
Let's just say M times M, because it's annoying otherwise. So we'll call M
plus 1 and M is going to be the same, it's going to be M.
Okay.
>>:
That's cool.
So say again?
[inaudible].
>> Joaquin Quinonero-Candela:
>>:
Very scary to have mathematicians in the room.
What was the comment?
>> Joaquin Quinonero-Candela:
have M parameters, although in
but I don't want to say M plus
said if M and M plus 1 are the
I said that in general, I'm going to say that we
the case, in this case, that we have M plus 1,
1, because it's too much work. And then Leon
same, that means M is infinity. That was his
5
comment.
Okay. So then everybody knows this. You can write down some algebra and you
can take the derivative of this sum of squared errors with respect to vector W
and you, of course, get the normal equations that everybody knows. So there's
one view of the normal equations that I quite like, and that's the geometrical
view. I find it's quite cute. So I'll just try and draw it very quickly on
the board.
So the way I like to look at it is I like to say well, imagine that we only had
two basis functions, right. So metrics fie had only two columns, right. So
that means that vector F, right, is N dimensional, because I have N data
points. So I'm going to draw an N dimensional space here on the white board,
although obviously it's not N dimensional, because I don't know how to do that.
But that vector is spanned by only two basis functions that I'm going to call N
fie one, N fie two. Right. And so these are N dimensional objects. This is
the first basis function evaluated at all of my N inputs and that's the second
one.
So I know that F needs to live in this plane, right, because it's generated by
these two basis functions. So F is going to be somewhere here. But Y, Y can
live wherever it feels like, right? Y is not constrained so Y also lives in
this N dimensional space, and Y is going to be, you know, somewhere here, let's
say. Let's say that's Y.
Okay. So now what's the thing that I'm trying to do? Well, F is just a W1
times fie 1 plus W2 times fie 2, right? So what I'm trying to do now is I'm
trying to minimize the norm of the difference between these two guys. So I'm
trying to minimize this vector here. Doesn't matter in which direction I draw
it. I just have vector E here and I want to minimize it.
But then I know that to minimize this vector, what I need to do is I need to
make sure that -- so if this is the projection of Y on to the basis, on to fie
1, fie 2, right. Let's imagine it was here. I'm not really great of drawing.
Then in a way there's a component of vector E here that is actually -- that
actually can be explained by fie 1 and fie 2, right. So I should make sure
that after I've picked W1 and W2, that E is orthogonal to all of the basis
functions, right.
6
So if I just go ahead and do that, I can write it. This is what I said here.
And then if I do that and I crack E open and then I solve, I actually get, I
end up getting the same equation. So it's actually pretty simple, but it's a
view that I think I like as well.
Okay, cool. Let's move on. So imagine we did this now for all of the degrees
of polynomials. We find the best solution. So one thing that I'll tell you is
that we have 17 data points in this dataset and so now the question is, with
the polynomial of degree 16, we have 17 parameters. So in this case already,
the sum of squared errors can be made zero. In this case here as well. In
this case, you even have multiple solutions.
In all other cases, you would only be able to make it exactly zero if it
happened by chance that the points were arranged in such a way that they can be
explained by the polynomial of that degree. If they're arranged completely in
a general way, then you don't have a guarantee that the sum of square there is
zero.
So you look at this, and you say oh, fantastic. We've actually solved the
problem, because I can go and in this case, I'm actually going to choose and
pick the polynomial of degree 17 and you'll see in a moment why I do that.
And so the answer, so you were wrong, Charles, because the right answer seems
to be roughly 2. That's it. We're done, right? So is there any objections?
>>:
Pretty good.
>> Joaquin Quinonero-Candela:
Okay.
We've saved some time here.
So the interesting thing is that we can make everybody in the room be right.
So please think of one value of Y at X equals minus 0.25. Just pick one. Just
pick one, and we actually can find a polynomial of degree 17 that will make
your answer be correct, and that will go exactly through the training data as
well. Okay? So actually, we haven't actually solved any problem at all.
We're not able to ->>:
I think you solved everyone's problem.
>> Joaquin Quinonero-Candela: That's right. So we're missing, obviously, some
assumptions. And what's interesting in the course is so to your point, Rich,
7
what's interesting in the course is some people complained and said that's a
stupid solution. And it's like okay, why is it stupid? The mean square of the
area is zero. It's not stupid. It's a good solution.
And then they said yeah, but -- and when they look at that, oh, but there's
other ones. And then someone said oh, but this is better. I prefer many
stupid solutions to one stupid solution. So it was sort of interesting to sort
of get them thinking.
Okay. So now we need to sort of pause for a moment, and we need to sort of ask
ourselves a couple of questions, right. So do we think that all models are
equally probable before we see the data? And then, of course, the question is
what does a probability of a model even mean. Can you even reason in those
terms.
But there's no question that you should reason in terms of multiple models.
And then do we need to commit to picking one polynomial degree and one set of
weights, or do we have actually some way of computing with several of them.
So fine, but we might need some sort of language to do that. And then perhaps
our training targets are contaminated with noise, right, which is also
something that some of the people in the course said. They said actually, I
don't really want to go through all those points exactly. It's a fair thing,
right.
So since this is a much easier question to address than the other two, we can
just go ahead and quickly address it.
So okay. So imagine now we sort of, we're going to be introducing the
likelihood. So imagine that you knew what function generated the data. Then
there's a sudden probability that this function actually generated this
particular set of blue dots. And if you assume the noise is additive Gaussian
and sampled independently, then for one little error term, this would be the
probability of distribution, of course. And then we go back to our beloved
matrix and vector notation. We stack the N error terms. We can write a joint
Gaussian distribution, which is this guy. This is the variance of the noise.
And then massaging things around a tiny bit, we recognize a good old friend
here, right?
So now what's interesting is if you look at this quantity, and actually the
8
notation is often going to be a little bit confusing, because wherever I write
F, you could actually write W. You could write one or the other, it doesn't
matter, because F is equal to fie times W. So you can sort of see that. If
you wanted to maximize this quantity with respect to F, right, like now you say
well, my Ys are given and I want to find the weights that maximize the
likelihood, then you obviously, this is exactly equivalent to minimizing the
sum of squared errors.
So in a way, you can sort of say, okay, so very quickly on the likelihood also,
there's some terminology if you read David Macay's books or attend his
lectures, he's very religious about always saying the likelihood is -- it's of
the likelihood of the parameters. It's not the likelihood of the data. It's
the likelihood of the parameters and it's the probability of the data given the
parameters.
Okay. So the interesting thing is that the maximum likelihood solution here is
exactly equivalent to the minimum sum of squared errors. So this hasn't
worked. There was a nice -- it was a nice try, but it doesn't help us.
Okay. So back to what you were saying, Rich, earlier, I think. So we don't
really know what particular function generated the data. And it turns out if
we pick a polynomial of degree 17 that more than one of our models can
perfectly fit the data. And ideally, we'd like to be able to reason in terms
of multiple possible explanations of the data, not just one.
And so what's going on over the next couple of slides, we're going to explain
the mechanics of how you can go from a collection of plausible functions to a
collection of plausible functions after you've seen the data. And then if you
do that, you can indeed average them. So the black sort of line here is a
result of averaging all those sort of light gray samples. And then we've
plotted two standard deviations up and down, just sample standard deviations.
Nothing fancy, okay?
So in the course, what we do at this stage is we don't necessarily assume that
they remember the rules of probability. So we actually just quickly, we give a
quick example, which I'm sure everyone is familiar with this type of example
just to introduce base rule and sort of the two rules of probability. So I'll
try to just quickly go through it.
So this is actually real data and it's actually based on a real study that was
9
made on doctors to ask them what their inference would be. So the data is that
1 percent of scanned women have breast cancer and 80 percent of women who have
breast cancer will get a positive mammography. But 9.6 percent of women
without breast cancer will also get a positive mammography. So the question
is, so, a woman gets a scan and it is positive, what is the probability that
she has breast cancer? So these are the four options. I would like you to
vote, please.
>>:
Can we have average options?
>> Joaquin Quinonero-Candela: Average for options. What would the average be?
It would be 50. Yes, you can introduce an option 2.5 if I want. Okay. So who
votes for one? Nobody seems to vote for one. You have to vote for one of
them. Who votes for two? Okay, roughly three or four people. Who votes for
three? One, two, three, four people. And who votes for four? No, that's
obviously not. Who votes for 2.5?
Okay. So how do we work out the answer to this question? The easiest way to
work out the answer to this question, I'm actually going to skip this slide is
to actually just build an example. So you can pick 10,000 subjects if you
want, and then you can fill in a table like this. You have all the data to
fill in this table, because we're giving the probability of cancer, we say, is
1 percent, right? We assume that all the subjects have been scanned, right.
So that means that the sum here, the marginals across the rows must be 1
percent to 99 percent. If I have 100 subjects, sorry, 10,000 subjects in this
case, and I know this has to adopt to a hundred, okay?
I know that. And then I can, by looking at data here I know that if there is
cancer, the probability of a positive mammography is 80 percent. So that
allows me to actually split things in this way. So if there is cancer, I know
that 80 of cases where there is cancer would have a positive mammography and
that means, as a consequence, 20 will not and I can do the same down here and I
can complete my table that way.
Okay. So once I've completed that table, what's the question we were asking?
The question we were asking is you know that the mammography is positive so you
know you're in this column. What's the project of cancer? So that is simply
80 divided by 80 plus 950, right? That's what it is. And what's that? That's
7.8 percent. So you shouldn't feel too bad, because I don't remember exactly
what it was, it was a number between 60 and 70 percent of the doctors in that
10
study actually chose the same option that the majority chose here, which is
option three. Around 90%. So that's quite worrying, actually, if you think
about it, because that means that doctors don't really understand
probabilities. But that's fine.
>>:
[inaudible].
>> Joaquin Quinonero-Candela:
Okay.
>>: Just for the backhander, isn't it right to just go back and say that the
previous one would [indiscernible] the problem of the choices?
>> Joaquin Quinonero-Candela:
Yes.
>>: Isn't it just always easiest to look at your odds and multiply by your
current estimate?
>> Joaquin Quinonero-Candela:
Okay.
So how --
>>: So roughly, you know, you can do a backhand arithmetic [indiscernible] you
can say that's roughly 10. And you maintain a current estimate, right. That's
the [indiscernible] trick.
>> Joaquin Quinonero-Candela: And that's going to be presumably an
approximation to the correct answer? So you should have been a doctor.
So the reason we're giving this example is we need to introduce Bayes' rule,
because otherwise we would not be able to go on. At the moment, we introduced
the likelihood, we introduced the probability of the data given any particular
function, but we want to consider several functions. So we're going to have -we're very soon going to introduce a probability situation over functions and
so we will, of course, be interested in the posterior.
>>: Is it still legal to teach Bayes' rule in the U.K.? You've heard about
these cases where judges have decided that Bayes' rule is not allowed in
evidentiary reasoning?
>> Joaquin Quinonero-Candela: Yes. So what's his name? Bill Davies is very
involved. Yes, that's right. All right. So this is just an example of Bayes'
rule. And this is the slide that's really important. You only need to
11
remember two things. You need to remember the sum rule, which states that if
you have a joint distribution, either continuous or discrete, if you adapt over
the variable you're not interested in, you get sort of the marginal. And then
the product rule, of course. And then you combine them both, you get Bayes'
rule. And the important thing to remember about Bayes' rule really is one
thing. Is that if you look at the numerator here, ultimately you want to get a
proper distribution over A, right. So it has to adapt to one as you add over
all values of A. But this guy doesn't because you're multiplying one that does
by one that doesn't. So you just, if you sort of sum over A here, you sort of
get the marginal P of P. And that's it. And that's your sort of normalizing
constant.
Okay. So then we can go back. Then we can go back to our business here. So
the little tiny leap of faith for now is that we have one way of defining P of
F. So let's imagine -- the way I call these guys is I call them Santa Claus
bags full of functions. And so you grab your hand inside the bag and you pull
out a function and you put it here and you just do it again, right. And
there's a certain probability of grabbing any particular function. Let's not
worry how that happens. Let's just imagine you have a P of F, right.
And we did just look at P of Y given F a moment ago when we looked at the
likelihood. And now Bayes' rule tells us how to combine these two and in the
next few slides, I'm actually going to tell you how you go from here to there.
And I'll give you both a Monte Carlo version of it and I'll give you an
analytic version of it.
Okay.
Am I going too slow?
Am I going too fast?
Tell me something.
>>: For a different way, it's interesting to see how you present it.
[indiscernible]. Now if you take Charles question, his values like zero point
something, why is it good? Tell me why is it good. Why would this value
[inaudible]. It looks good. I agree. But are we still back to
[indiscernible] that's good ->>: [inaudible]. That's a different question.
functions [indiscernible].
What is the value of the
>>: [multiple people speaking]
>>:
But in the end you need to say, why is this curve a better curve for this
12
[indiscernible].
>> Joaquin Quinonero-Candela: So I'm going to actually postpone an attempt to
answer. I don't think I'll be able to answer that question in a satisfying
way.
>>: I'm looking how you get the course going and I see that at this point, you
can go, I can see very well you can go all the Bayesian way. But there is this
question that annoys me that you can't insert [indiscernible].
>> Joaquin Quinonero-Candela:
>>:
Let me try to go through --
[indiscernible].
>> Joaquin Quinonero-Candela: That's true. Okay. So if we stick with our
particular choice of a structure, and the structure is a linear in the
parameter model, then if we fix the basis functions for some reason and if we
have many of them, it's probably okay to fix them, it's not too bad, then we're
left with having to make choices for the Ws. So what we could do with the Ws
is, for example, we could decide to sample them from some distribution. So
imagine we said, well, we're also going to use Gaussian here with a certain
variance and it's centered around zero.
So how do I actually sample a function? Well, it's simple. If I have M equals
17, I sample 18 weights independently and then I multiply the basis functions
by them and then I sum and then I get sort of one particular sample. I can do
that sort of again and again. So that's sort of one way of defining a prior of
our functions. It's a bit of an indirect way, because I've done it in two
steps.
First, postulated some function or form and then I've defined a prior on one of
the sort of on the parameters.
>>: [inaudible] we're looking at the data which is in the prior
[indiscernible].
>> Joaquin Quinonero-Candela: Oh, no, no. I've done that. I mean, we
happened to have looked at the data because we had to motivate things here.
But no, no, this is not a prior that depends on the data.
13
>>:
Okay.
But these are priors that the average of the [indiscernible].
>> Joaquin Quinonero-Candela:
>>:
You have a zero there.
>> Joaquin Quinonero-Candela:
>>:
I have a zero here, that's right.
It's a prior.
>> Joaquin Quinonero-Candela:
>>:
Say again, sir?
That's a prior.
So [indiscernible].
>> Joaquin Quinonero-Candela: What this is saying, indeed, is it's saying,
indeed, that I don't believe the weights should be arbitrarily large. That's
true. And then what I should do is I should just sort of look at this.
So to your question, the one thing that I'm doing is I'm just fixing this prior
variance here, right. And I'm also saying I'm defining this to have zero mean.
For the parameter that's the bias, I could decide to now do that if I knew
better. And then we're not going to go into that, but I could actually define
the prior over this variable and I could also sample there.
>>: But the point is in terms of the [indiscernible] because kind of would it
show that look at your data. Based on your data, choose your prior. And then
I can, if I'm allowed to do that, answer 15, which is we know that our goal is
to get to the answer is 15. I can force my prior to give me this answer.
>> Joaquin Quinonero-Candela: You could force your prior to give you that
answer by only picking functions from the prior that are equal to minus 15 at
minus 2.5, and that can sort of wiggle around everywhere else. Absolutely
right. I agree with that.
>>:
But the priors implies that you think the derivatives are not very good.
>> Joaquin Quinonero-Candela:
>>:
Yes.
He's just explained how you can express the prior.
And --
14
>>: [multiple people speaking]
>> Joaquin Quinonero-Candela: So let me actually try to make it through the
next couple of slides, because what I'll show you is that you can actually be
extremely vague about your prior and you can still get -- you can still model
the data. So let me -- just bear with me for a moment.
>>:
[inaudible].
>> Joaquin Quinonero-Candela: That's fine. I'll skip through those. This is
just the mechanics of the maximum likelihood estimate where you don't have
priors. This is the case where you introduce a prior. This is just all
mechanics. There's one thing that I'd like to talk about, however. I'll sort
of skip that as well.
Yes, okay. This one I want to talk about a little bit. So I'm just going to
pause here for a second. If I wanted to do what I think you were saying, which
is I want to choose my prior based on the data. Okay. So what's my prior?
I'm going to demote my prior now by this very curly M. So this very curly M
doesn't mean the degree of the polynomial. This means a choice of the basis
functions, how many of them, right, and it can even mean what sort of priors I
choose on the weights, okay? And I can have as many as I want.
And then what I'm going to spend time doing now is I'm going to spend time
telling you about the marginal likelihood or the evidence.
So the evidence is going to allow me to -- it's sort of one potentially
dangerous way to choose retrospectively one out of several priors, right. And
I think of those priors, I could have actually not one, but I could have very
many Santa bags and these are different priors. And one of them is make
Charles' one that is equal to minus 15 at minus 0.25, right. But I could have
others, right. I could have different types of basis functions. I could have,
to Patrice's point, I could, depending on how, what the variance of those guys
is and depending on what the degree of the polynomial is, I can allow for more
wiggliness or less wiggliness, okay?
So what's the marginal likelihood like? What does it do? It's actually a
pretty simple object. This here, okay, this happens to be the weights here.
You could plug in F if you wanted. Like I said earlier, it kind of doesn't
matter. This sort of says, well if I have chosen one way to generate functions
15
from my prior, I'm going to compute the average likelihood of the functions
from this prior. That's all it is.
And I'm going to show you how to do that in a more intuitive way in a second.
>>: So quick question. So this is a [indiscernible] you called out
specifically the likelihood of the weights and not the data. And then this is
the likelihood of the data. Curious about that.
>> Joaquin Quinonero-Candela: No. If I have said that, then I was wrong. And
if it says so in the slide, then it's wrong. It's a marginal likelihood of the
model.
>>: I don't mean the marginal likelihood. So you said somebody's very careful
about saying likelihood of weights. I see Gaussian likelihood of weights and
then [indiscernible] likelihood of data.
>> Joaquin Quinonero-Candela:
>>:
Oh, sorry.
Bullet point two says Gaussian likelihood of the weights.
P, Y and X.
>> Joaquin Quinonero-Candela: But that's the probability of the data.
the probability of why given W is the likelihood of the weights.
>>:
That's
I see.
>> Joaquin Quinonero-Candela: It's probably not important, actually. I was
just quoting David and his obsession with getting that right. This object if
you want to describe it in terms of what it is of Y, it's a probability of Y.
Or it's a likelihood of W.
>>:
So how do you find likelihood?
>> Joaquin Quinonero-Candela:
engineering.
>>:
[inaudible].
>>: [multiple people speaking]
What does that mean?
The likelihood is a bit like reverse
16
>> Joaquin Quinonero-Candela:
>>:
[indiscernible].
Yeah, here it is --
You will calibrate it with the [indiscernible].
>> Joaquin Quinonero-Candela: Yeah, here, we can write it. It's very easy to
write. The likelihood here is the probability of the data, Y, given a function
F. So specifically what is it? Because we said that the noise was going to be
independent, it's the product of -- it's the product of 17 objects. I have 17
data points. And for each of them, it's a Gaussian evaluated so it's a
Gaussian in Y that has mean, the red dot, right. So it would be Y I mean Nuss
F of XY squared. Divided by whatever I think the variance of the noise is.
That for now is a parameter so the likelihood is a function of the -- you can
also learn what the variance of the noise is.
So yeah, and if you take logarithms, you're going to have essentially minus the
sum of squares. So yeah, you can write it down.
Okay. So now I have three Santa bags. These are three possible priors I could
have chosen, and I'm sorry that yours is not there. But that's just the way it
is.
So actually, we're going to try to do two things. We're going to try to obtain
what the predictive distribution would be, right. What the figures show, they
show samples from the prior, ignoring the data. And then on top of that, I've
plotted some even smaller, simpler data set, okay. That's all it is.
And now what we're going to do is we're going to have to define a likelihood.
The prior we already have. In a way, I've said I'm not telling you what it is.
It's some prior. It's actually not a polynomial. These functions are drawn
from three different Gaussian process priors. But let's just not worry about
that. I just have a Santa bag. That's the only thing that matters.
Then I have this data. And the other thing we're going to do is we're going to
compute the marginal likelihood for each of those three models. So that if I
wanted to do this thing of picking a prior retrospectively, I get some sort of
score to do that.
And then I'll tell you, it's not really on the slides, but I'll tell you how
you can actually relax that. You don't have to commit to any of them. You can
use them all three. And it's very easy to compute where the mixing proportions
17
should be.
Okay. So we're going to choose a different likelihood now. So we're going to
choose a different noise model. And here's why. So we're going to assume that
the noise is uniform in the interval -- well, in this interval here, minus 0.2
to plus 0.2. And all noise terms ever independent.
So what does that mean? That means that when I compute the likelihood, I have
to compute each of the individual terms here. And they're either going to be
zero or they're going to take this constant value, right. One of the two,
okay? So they're either zero or they're equal to 1 over 0.4, so they're equal
to 2.5, okay?
So that means that if I look at the likelihood of a given function, and I have
end data points, I just sort of -- it's either going to be zero, if any of the
Ys is further apart from F than 0.2 in absolute value or the Ys, I just get 2.5
to the end, right. So if I draw that it looks like this. Sod all we're saying
is that if I have here some function and this is the data that I have observed,
okay -- sorry. These are my inputs here. So I'm just going to put some data
down here.
If by any chance -- and now I'm going to draw sort of the -- I'm going to try
to draw something like a uniform of some -- of this. So imagine that this here
was 0.2, okay. So you can see that this likelihood term is zero, right,
because I've said that this data point has to be in a uniform distribution
center around the function value, right. This guy just makes it and this guy
just makes it on the other end. But the likelihood here is zero.
If this guy had been inside, I would have had 0.25 times 0.25 times 0.25.
Okay? So why did we construct this funky noise model? Because it allows us to
evaluate the -- to actually compute that integral in a very interesting way.
Okay. So now we have a likelihood. So remember, the marginal likelihood is
just again, you know, I look at all the functions that I can generate from my
prior and then I compute the likelihood for each of them and then I just report
the average likelihood or marginal likelihood, if you will.
So because I've made this funky choice of a likelihood, it turns out that the
likelihood is either a zero in the fraction of the cases if the function that I
pulled from my Santa bag misses one of the points by more than 0.2, then that
18
one is zero, right.
So say that I had S of A is the number of accepted ones, the number of ones
that actually fit my data and I sample a total of S of them. So I can
approximate this integral by a sum. I can say, well, imagine that I do the
following procedure. I go ahead and I draw a function F of S from my bag, and
then I evaluate the likelihood and then I just take the average of all of those
guys, right.
So you agree that is sort of the value that the evidence would have.
So now, let's just look at what happens. So in this case, we've had to sample
a million samples from each of the three bags. And only the super wiggly guy
here, Mr. Green, only generated 8 functions that actually conformed to the data
given our likelihood. Given our noise model, okay?
Mr. Blue generated ten times as many. He generated 88 such functions, right.
And Mr. Red, you know, 17. So the interesting thing is that these functions
here are actually -- this is a valid way, an incredibly inefficient but valid
way of sampling functions from the posterior distribution. Say if I had
written P of F given Y is equal to P of Y given F times P of F and then
normalized and I wanted to sample functions from the posterior after seeing the
data. Then this is a valid, crazy mechanism to do that.
So what have we learned? So if we divide, say, by S, by one million and
multiply by 2.5 to the N in this case, N is five data points, then we get those
four numbers here. But these four numbers here are proportional to the numbers
here. So it doesn't matter. You can look at either, right.
So what that is saying is that if you look at the marginal likelihood score,
then Mr. Blue is ten times more -- has a ten times higher score than 8. And if
you wanted to sort of use proper terminology, you would say that the
probability of the data under model blue is ten times higher than the
probability of the data under model green.
>>:
So what prevents [indiscernible] gazillion.
>> Joaquin Quinonero-Candela: You could use a gazillion models, and if you
were doing -- if you were using a Gaussian process prior, rather than having
sort of a discrete index for models --
19
>>: I just want to take the maximum [indiscernible] sample with the largest
SA, this would be my choice.
>> Joaquin Quinonero-Candela:
So you'd pick model blue.
>>: I'm not picking just three. I'm going to pick gazillion many colors. For
which one of them I'll want to sample million functions. Each one of them I'm
get SA and I'll pick the parameter, the color for which I got the largest
relation.
>> Joaquin Quinonero-Candela: And so what you're trying to say is that, I'm
guessing you're trying to say this, is that the marginal likelihood is
dangerous, is almost as dangerous as the likelihood because you can overfit.
If you try hard enough, you can overfit with it. So the answer is that the
safer way to pick a model, of course, is to consider them all. Because this is
P of Y given your model choice, given your bag, right?
But now what if you defined a P of MY, right? You said you wanted to use many
of them. A hundred of them. Sure, why not. If you were to, in the simplest
of all cases, if you chose them in such a way that you didn't have any
particular preference ->>:
[inaudible].
>> Joaquin Quinonero-Candela:
>>:
That's right.
[inaudible].
>> Joaquin Quinonero-Candela: So the key thing here is that if you're able to
specify your prior completely, say you said that P of -- let me just write it.
If you had a bunch of models, MI, with I goes from one to, I don't know, some
number L, right. And, you know, for simplicity, imagine that you decided that
the prior probability of any of those guys is 1 over L. If you know better,
then you pick something else. If you don't, then you just do that.
And then actually, you could combine the predictions from all of those guys by
weighting them by the posterior probability of the model. By P of MI ->>:
When you're talking about at this level MI not specified.
So the two
20
layers that you created don't have -- because you can say in advance, take
these bags, pull them together and ->> Joaquin Quinonero-Candela:
Go ahead.
>>: You use a uniform prior [indiscernible]. Is there any [indiscernible] to
get a prior other models by digging into a complexity of the [indiscernible].
>> Joaquin Quinonero-Candela: There is. There's this nice paper that Zubin
and Carl wrote early 2000, 2001 maybe. It's a paper about -- it's about
Occam's Razor, where they say Occam's Razor doesn't exist. Where they use
linear in the parameter models. I think they might have used polynomials, but
I'm not sure, or maybe they used something else.
But anyway, what they did is they -- the models were defined in such a way that
it could be polynomials. Anyway, something about the basis functions, basis
functions of higher index had more energy or more variance, if you will. So
they scaled down the prior variances on those in such a way that the amount of
variance from the function stayed constant or something like that. And then
you could add as many as you wanted.
I don't know whether that made sense.
>>: So what happens if you plug the same logic through where we have a
discontinuous function like the left side is a [indiscernible].
>> Joaquin Quinonero-Candela:
I'm not sure I understand the question.
>>: So you just, you have the graph. Slide [indiscernible] is the real
function space. But your model has this assumption of continuity, so you're
always going to force this weird connection between the two. Any human coming
along and drawing a curve would look at it and go instantly, this continuity
[indiscernible] you've thrown two species together. If you separated by
species, your actual two functions behave very differently. If you throw them
together, you guess discontinuity. So I'm just wondering for this Gaussian
process, this explanation here has this assumption of continuity that seems to
be broken in a lot of the real cases.
>> Joaquin Quinonero-Candela: So you're saying something about how we've -you're saying something about the prior functions itself. You're sort of
21
complaining about ->>: It's actually the model structure.
express that [indiscernible].
Given the model structure, you can
>> Joaquin Quinonero-Candela: So you're saying I cannot, with the linear and
parameter model, linear and the parameters model, I cannot express -- what if
some of those basis functions were zero in a range of the data, and then I
haven't really said how I'm choosing them. I could choose them however I want.
I think, yeah, with the Gaussian process, certain things are a little bit more
difficult because the Gaussian process is obtained by using an infinite amount
of basis functions and that actually means extra smoothness. So there are
certain things you can't define with Gaussian processes.
Yeah, so which is more general, the Gaussian process prior or a finite linear
combination of basis functions. I don't think there's an obvious answer to
that.
Okay. So in this case here, what we do is you just sort of take the average of
the samples. Here, we get all this wiggliness because we didn't have many
samples. We would have needed more. But then as you were saying, in a way, I
could have done this whole thing by combining all of these functions in one bag
and then just doing this whole thing. And that would be equivalent to doing
this and then reweighting -- combining these three but reweighting them by the
posterior probabilities of each of the three models.
Okay. So I have like three minutes, actually, time passes pretty fast. I
think what I'll do is I'm going to show you -- I'm just going to show you this
thing here. I will not bother -- if you can see it well enough, I'll not
bother downloading it again. So here, we can go back to our original question.
So okay. I'll address this in a second. What degree. So what we're going to
do now is we're going to use the marginal likelihood to select the degree of
the polynomial for our dataset here. And, oh, there's something we didn't talk
about in the previous one. Sorry for that. So I completely forgot.
So in the slide where we had the marginal likelihood, we also had a predictive
distribution. So a probability distribution of Y at minus 0.25 or whatever.
And you sort of see that you get a probability distribution.
22
And you do that because you don't get a point estimate of the weights. You
actually get -- if you have a Gaussian prior on the weights and you have
Gaussian noise, you actually get a Gaussian multivariate distribution on the
weights and it has the mean of that guy is pretty similar to the normal
equations. Except the pseudo inverse is regularized. So you get fie transpose
fie plus a diagonal term inverse. And it's a bit like capping or limiting how
small the [indiscernible] values can be.
Anyway, so remember how we solved this thing by sampling? We could do it here
again if we wanted to, right. So here we're back. Say we're back to a
Gaussian likelihood. So we're back to Gaussian noise. We don't have any more
of this uniform noise, okay. We could in principle do the same. We could
sample some functions F of S from the prior and then compute the likelihood for
each of them.
This time, it's not going to be a constant or zero. This time, its' going to
be whatever it is and I can average them, and this sort of would be a sampling
approximation to the evidence. Or I can just go ahead and compute it, because
this is the product of two, if I'm using a model that's, you know, F is equal
to some linear combination of the weights and the weights have Gaussian priors,
these two objects here look like Gaussians in the weights so I can multiply
them together and I get a joint Gaussian distribution so I can just compute
this evidence in one shot.
And if I plot the logarithm of the evidence, evidence is the same as marginal
likelihood. As a function of the degree of the polynomial, you can sort of see
that it very strongly prefers degree three, right.
Then again, the more Bayesian thing to would be to not commit. Just use them
all, right. Like you don't -- you define a prior over the degree of a
polynomial ->>:
[indiscernible] otherwise I can't [indiscernible].
>> Joaquin Quinonero-Candela: You do, you do. That's absolutely right. And
then back to I think it was your question. I could decide -- so I need a
distribution over polynomial degrees that adds up to one. And I could decide
to have a shape, you know, it could sort of be exponentially decaying. I could
say well, the probability of degree 7,000 is really small. Or I could just
23
call it zero at some point.
>>:
[indiscernible].
>> Joaquin Quinonero-Candela: You could.
sort of assume the consequences.
You could.
And then you have to
>>: The problem, I think the problem with this presentation is that we've
never presented consequences.
>> Joaquin Quinonero-Candela:
Right.
>>: So if I choose -- my figure is 7,000. I believe all problems in the world
should be solved by polynomial 7,000 so my prior would be just one of ->> Joaquin Quinonero-Candela:
I will show you a slide in a second.
>>: [multiple people speaking]
>>: You show [indiscernible] that draws very nice curves. They look good.
Which is why they're good. And this [indiscernible] never give an answer to
that question. Why is this curve that is good and nice that I like, looks good
visually to me. Why is that good and therefore if I do that in 1,000
dimensions, 1 million dimensions, if I apply [indiscernible] I'm not going to
see the curve.
>> Joaquin Quinonero-Candela:
I'm going to try.
>>: The prior is [indiscernible] to the degree, how sensitive is that to the
noise as opposed to just my prior degree of polynomial?
>> Joaquin Quinonero-Candela: Oh, so both things matter here. So when you
look at the evidence, the choice of the likelihood and the choice of the prior
both matter, yeah, definitely.
>>:
If you ahead more noise, it has to prefer a lower degree of polynomial.
>> Joaquin Quinonero-Candela: That's right. So why does it -- if we decided
to use the evidence and in practice, it could be impractical to average over --
24
>>: Isn't it the opposite. If your noise variance goes to zero, you have to
prefer the higher degree [inaudible].
>>: As the noise becomes large, and your ability to detect the [indiscernible]
polynomial drops. So you end up preferring simpler and simpler ->>:
Yes.
>> Joaquin Quinonero-Candela: Yes, in this case, the intuition is that if the
degree of the polynomial is too simple, it's too small, then it's always going
to be missing some of the data for sure. If it's very high, there exists in
that bag functions that exactly fit your data. They're there. But they're
very unlikely.
So although they exist, the resulting evidence is not going to be very high.
So the evidence has this way of preferring priors that hit a certain
equilibrium between the ability to nail the data and how often they can achieve
that.
And I'm trying to remember what else I have here. This is a discussion about
polynomials and whether they're good priors over functions. And they're a bit
problematic, as everyone knows. So a degree of 7,000 here would be quite
spectacular in the sense that once you leave the range where X is between minus
1 and 1, then the functions sort of start shooting like crazy. So if you
thought here that you wanted to limit the variance at 2, then you need to limit
the variance in this interval even more.
So the rest of the lecture is about relaxing the polynomial model structure and
I'm, of course, not going to cover it because we're over time. But what it
does is it sort of talks about Gaussian processes and let me just sort of try
to show -- this is a Gaussian process entity. And so I suppose many people are
familiar with Gaussian processes in the room? Are you?
>>:
[inaudible].
>> Joaquin Quinonero-Candela: Okay. I can show you that very, very, very
quickly. I feel terrible that I went over time. That's funny, actually.
>>:
[inaudible].
25
>> Joaquin Quinonero-Candela:
>>:
Say again?
[inaudible].
>> Joaquin Quinonero-Candela: Okay. So this part of the course was about
ranking. And to motivate the problem, to motivate the problem, what we did is
just show them this. And sort of say, on the last day of 2011, on the last day
where a tennis match that counts for the ATP ranking was the 28th of December,
and this was the ranking back then.
And so you see that there are some points here, right. You also see that
different people have played different tournaments. And so actually I'm just
going to explain this later. I'm going to explain this first.
So let's just focus on the top four guys, right. So the first question, right,
is you player that's ranked higher than another player more likely to win. I
think it's a fair question to ask. And I suppose most people think that's
true.
So we might think that Nadal has a higher chance of winning than Federer if
they were to play against each other. And then a trickier question is, well,
what's the actual probability that Murray defeats Djokovic, right? And to
translate this into units that people can understand, how much will you
actually be willing to bet on Murray.
So now I need to actually explain this to you.
This is interesting.
So the association of tennis professionals, that's what ATP stands for, ranking
is obtained as follows. You have a sliding window of 52 weeks. So you can
only count results for the past 52 weeks. Of those, you can count only 18
results. So you have 18 slots as a tennis play their you can use.
But now there are some constraints. You must include all four Grand Slams.
That means if you didn't play any of the four grand slams and you lose four
slots, you can only use 14 slots or you get like a zero on each of those. And
then which are the four grand slams? It's the Australian Open, the French
Open, Wimbledon and the U.S. Open.
But then you have to use eight Masters 1000 series events. Okay. And they're
listed here. There's a bunch of them. Okay, cool. But that's not all. So
26
then that's essentially four and eight makes 12. So you still have six results
that you can use. And of those, you have to pick four of those six have to be
500 events. And 500 events, there's a bunch of them. You can see here, right.
So that means that the two last slots can be any of the masters 200
tournaments, for example and there's even some other tournaments, challenger
and whatnot.
Okay. Cool. So now if you play your Grand Slam, these are the points that you
get depending on how far you make it. If you win it, you get 2,000 points. If
you reach the final, 1,200, et cetera, et cetera, okay. However, if it was a
Masters 1000 tournament, these are the points you get. And then oh, I didn't
talk about the ATP world tour finals. This is a tournament that only the eight
top players -- I don't know when they take the cut in time. But at some point,
they take the top eight, and then those are allowed to play this extra
tournament which gives 1,500 points to the winner only.
Okay. So this is roughly how the ATP ranking system works. So you can see
that someone has sat down and thought, you know, a little bit hard about how to
do it right. So this was sort of the motivation for actually writing down a
model. So I don't know how much more you want me to say.
>>: I have a question. So suppose you have some kind of model for
[inaudible]. And then it's actually [indiscernible] tennis matches
[inaudible]. Do these kind of considerations enter in?
>>:
[inaudible].
>>:
Or they choose their games carefully to be consistent.
>> Joaquin Quinonero-Candela: Yeah, I -- there are things you can do, like
what Patrice was saying. You can try to confuse the model maybe by being very
erratic or sort of trying to increase ->>:
These things are not robust to adversary.
>> Joaquin Quinonero-Candela: No, they're not designed to be. That's right.
So anyway, so in a nutshell, the model is every player has got a given skill.
Then you look at the difference of the skills.
So the assumption that is if a player's performed according to their skill,
27
then if one player has a higher skill than the other, then they would always
win, 100 percent of the times. But performance is a noisy version of skill.
So even if someone is more skilled than someone else, there's still a
probability that they might lose.
And then the
that has one
it's a graph
players play
graph.
story, the short story here is that the actual graph is a graph
skill for every player and then these nodes G are games and so
that can have loops, actually, because even if, sort of if two
two games, you already have a loop. So you have sort of a massive
And there's a couple of things you can do to it. If you can write down all of
the sort of Bayesian machinery, but it's intractable and so there's two things
you can do.
In the course, we presented a Gibbs sampling way of solving things.
we're over time, right? We had until what did we have?
>>:
You have the room until 12:00.
>> Joaquin Quinonero-Candela:
>>:
I guess
I see, but the meeting was --
We'll give you eight more minutes.
>> Joaquin Quinonero-Candela: Did I get confused about this actually? I don't
know. Okay. Good, okay. Eight more minutes? Okay. I'll try to use them
wisely. So what we realized is that you can actually do Gibbs sampling for
this model. And actually, I'm not aware and I haven't seen this done before
for this model. It's actually very simple. So I'm not saying it's hard or
anything.
But the key thing is if you look at this graph, what's happening here is so
we're going to have Gaussian priors on the skills. You take the difference.
Node G just takes the difference. This is a factor graph where these are the
variables and the black squares are just functions that depend on all of the
variables that are connected to them. And then the final function or form is
just a product of all the factors. This is sort of my one, my ten-second
explanation of a factor graph.
So in the course, we actually had a proper introduction to factor graphs.
28
>>:
[inaudible].
>> Joaquin Quinonero-Candela:
>>:
Yes.
Of the model?
Of how [indiscernible].
>> Joaquin Quinonero-Candela: Ah. I can try to do that. So what happens is
that you observe a game outcome here Y, which is the binary quantity. You
either -- so player one either won or lost, right. Here, you have the
performance difference. So if the performance difference is positive, then the
model predicts that player one won. If the performance difference is negative,
then player two won.
Now, this performance difference has uncertainty for two reasons. A, the
skills are model not as numbers but as Gaussian distributions so even their
difference will be uncertain. In addition to that, from S to T, we add some
additional noise, which is just the sort of performance noise, performance
versus skill.
>>:
The skill difference is different in two players [indiscernible].
>> Joaquin Quinonero-Candela:
Okay.
>>:
Is this the case?
I'm not sure I understand.
>>:
If you choose the first, is it raining, humid, hot?
>>: No, it is something [indiscernible]. The [indiscernible] is attached to
the player and S is attached to a pair players, and T is attached to a
different player and the Y is attached to the ->> Joaquin Quinonero-Candela:
>>:
Let me go through that.
Is that correct?
>> Joaquin Quinonero-Candela: I'm going to skip variable S. We're going to
get rid of it. So you could have two games and game two, these two guys could
have played this game and these two guys could have played this game and then
you would have here T1 and here T2 and then you have this other node that sort
29
of takes a threshold.
this is Y2.
>>:
This is Y1.
And
[inaudible].
>> Joaquin Quinonero-Candela:
>>:
Sorry, you can't see this, I guess.
Say again?
What's Y?
>> Joaquin Quinonero-Candela: Y is a binary variable. It says whether player
1 won or player 2 won. This is the one that says who won. T. So what happens
is this.
>>: The thing you don't express in the graph, you have the least appearance
and for which event, we know that player I played against player J, and the
winner was this guy.
>> Joaquin Quinonero-Candela: Yeah, I'm just going to try to take off my mic
so I can actually show you. And, of course, you could have, you know, you
could have sort of a game three that won and three played and so on and so
forth, right? And this sort of goes.
So every game is one such node. It connects these two skills. So if you have
-- you could have -- this could be P of W1, and then this could be P of W2,
right. And then T1 in this case is equal to W1 minus W2 plus some epsilon.
And so to your question, I could have an epsilon that knows about the weather.
It could be like I know that this game was played in rainy conditions. In the
model that we wrote, and actually we gave the students code, we just used a
global epsilon across the board, but you could definitely be more complex or
more granular. Yes?
>>: So you're going to build a model from just the pairings from the winnings,
so it's not connected to ->> Joaquin Quinonero-Candela:
Say again?
>>: This isn't connected to the model, the scoring model that they use in the
tennis rankings?
>> Joaquin Quinonero-Candela:
No, this is sort of we do our own thing from
30
scratch.
>>:
I see.
>> Joaquin Quinonero-Candela:
to rank the players by skill.
->>:
And the goal is at the end of the day, we want
But the skills, but any given skill is actually
Does not have a probabilistic interpretation.
>> Joaquin Quinonero-Candela: Is actually some sort of a distribution so you
know the mean skill. And what you'll do if you want to be conservative is you
probably report an order statistic or something like that. You sort of say,
well, for this player, I could rank maybe by the tenth percentile or something
like that if I wanted to be conservative.
>>:
Is the variance of the skill different for different players?
Or is it --
>> Joaquin Quinonero-Candela: It's initialized to be the same. Initially,
when run the code. The way we run this is we run it by doing Gibbs sampling.
Let me tell you, I just I'm going -- I'm just sort of freestyling, I suppose.
So this guy here, if you have observed the game outcome, this marginal
distribution here is a truncated Gaussian, right. Because if you know that
player 1 won, then you know that negative values are impossible, right?
So there is a, you know, that's a bit of a problem, because you have a
truncated Gaussian here. Facto-graphs are nice, because they tell you how to
compute marginal distributions and so what we want to do is we want to compute
the marginal distributions of all of these guys after we've observed all of the
games and you can express things in terms, so who knows about message passing
and facto-graphs? So you sort of compute [indiscernible] partial computations,
partial [indiscernible] sums and you just propagate them up. You don't need a
facto-graph to do it. You can just write all the equations ->>:
[inaudible].
>> Joaquin Quinonero-Candela: I'm going to have loops. That's absolutely
true. I'm going to have loops. And because I'm going to have loops, I'll have
to it rate back and forth. If I had a tree, I would only need to do sort of
31
one propagation of things. So I'm going to have to it rate because I have a
loop. And I will be approximating this guy here by a Gaussian in order to be
able to keep everything Gaussian. If I decide to do an analytic approximation.
If I decide to do Gibbs sampling, I don't need to do that.
The reason I don't need to do that, if I do Gibbs sampling is, is I'm going to
be alternating between sampling all of the Ts given all the distribution -given a sample of all of the skills and then in the other iteration, what I'll
do is I'll fix my samples from T and then condition on the value of T, then
everything else is actually really, really Gaussian. I'm not taking any short
cuts.
So I can it rate between sampling skills and sampling performance differences.
And I can actually, if I can show ->>:
So I'm assuming that in the end, you get a different ranking from --
>> Joaquin Quinonero-Candela:
You do.
Yes.
>>: And with this ranking, you can predict the outcome of the match and verify
how accurate it is.
>> Joaquin Quinonero-Candela:
Yes.
>>: And if you do this with the other one, assuming there's some sort of
normalization, you take that ranking, you normalize it so it's the best
probability for that ranking. Otherwise, it will be a fair comparison
[indiscernible].
>> Joaquin Quinonero-Candela: Oh, no, no. In the course, we don't even
compare. We don't even try to use the ATP ranking to predict anything.
>>: But you could. You could take that. And if you normalize it correctly,
then you could be able to predict, and then you could compare the two
predictions. If you don't normalize it [indiscernible].
>>: [inaudible]. Do something like this and compare with [indiscernible].
And, of course, you will probably see that the [indiscernible] because you know
whether the player is consistent or not.
32
>>: Here there's a worse problem because the matches are never
[indiscernible]. They're not from a distribution. They always take into
account the ATP ranking and they say this guy will play the worst and this guy
will play the second worst. And so, I mean, a sample from the distribution,
which is depends on the ATP and is very [indiscernible].
>> Joaquin Quinonero-Candela:
That's right.
>>: I'm still confused by doing all this gymnastics to this level where I
think most of the energy should be putting the right features to the opponent.
>> Joaquin Quinonero-Candela: Yeah. This is actually a very, very simple.
This is a very simple model. This is a way of counting, if you will, that
there is much more clever than just counting marginals, right. So in the
courses, I manage to get [indiscernible] running, I might be able to show it to
you.
But here, because you have -- because you sort of model things jointly and you
take into account how strong the player that you played against is, you're sort
of discounting -- you're sort of discounting that, right?
>>:
Have you thought about building the [indiscernible].
>>:
A model like this [indiscernible].
>>:
So in that version, how do you update the estimate of skill?
>>: [multiple people speaking]
>>: So what are the properties you want of any skill system is it asks them to
build a zero gradient, right. So in a typical example, right, you don't want
somebody to always play low players and add a slot every time so that
eventually every time they look like grand master. They have to play a grand
master to look like a grand master. And so the natural -- the probability
captures that nicely. I'm not saying it can't be worked into the optimization.
But that's one of the things, your expectation or the likelihood of observing
this outcome actually gives you a way of saying.
>>:
[inaudible].
33
>>:
It's actually, if you go back to the skill --
>>:
No, in the ATP.
But in the probabilistic [indiscernible].
>>: Yeah, it's the delta S. Sorry. It's a generalization of [indiscernible]
so the difference in the scores will give you an observation of expecting to
see the outcome. And when you do the update, sorry, I'm speaking for you, you
give the update, you actually, based on your observed outcome, that will modify
your expectation of skill.
>>: [inaudible]. This is not really the [inaudible]. You can do this kind of
thing. The thing that's a bit weird and that's [indiscernible] the future
examples will be modified by the ranking itself and then you have
[indiscernible].
>>: But your delta is small in those cases, same as helo. So actually, even
though it's modified, the gradient that you apply when you get that outcome
will be smaller than if you serve the expected outcome.
>>: The way you select the future matches, the players are [indiscernible] is
something that ensures that the low ball process [indiscernible].
>>: But they're not looking for [indiscernible].
playing the worst.
You're taking the best and
>>: You get the point when you get to the top. The top level you get to. So
the idea of building the hierarchy is you're expected to be playing somebody of
comparable level.
>>:
At some point.
>>: When you eventually play. So basically, when you fall out of the graph
either by winning or losing, you're expected to be playing somebody
[indiscernible].
>>: And that's why you wouldn't want to do the open, put all the weaker
players on one side of the tree and the strongest players on the other side of
the tree. At the end of the graph, you'd have one of the weakest players
playing one of the best players consistently. You wouldn't want that.
34
>> Joaquin Quinonero-Candela: Anyway, so yeah, you can't really see,
unfortunately. So the two types of things that you can do is you can either
try and, for any pair of players, you can compute the probability that one will
either have the greater skill than the other or that one will beat the other.
And typically, the probability that one will beat the other will be more
moderate, more sort of close to 2.5 because you have additional noise. Once
you do that, you can just compute -- you can sort of say, well, if I have this
matrix that tells me the probability that any player wins every other player, I
can sort of compute the average probability that this player will win if the
games are arranged truly at random. If I don't know who I'm playing and it's
really sample at random.
You can't really see very well what's there. But the interesting thing is that
Roger Federer and Rafael Nadal are sort of swapped compared to the ATP ranking.
To be quite honest, I haven't actually spent the time to understand why.
>>:
[inaudible] expand your graph with [indiscernible].
>>:
You mean per point?
>>:
If you want to make one [indiscernible] so you have all this graph.
>>: [multiple people speaking]
>> Joaquin Quinonero-Candela: Say this guy was a new game here, it was a new
game, at this stage, the difference is that these guys have been observed, but
this guy hasn't been observed. He has not been observed.
>>: So you should [indiscernible] with all the possible -- you want to compute
P of Y, given ->>: That is correct, but I'm conditioning it on everything else if I have the
marginals.
>>:
[inaudible].
>> Joaquin Quinonero-Candela:
>>:
[inaudible].
Say again?
35
>> Joaquin Quinonero-Candela: Yeah. And so what I'm showing here is we're
effectively sampling in this case, right. I mean, there's a way to do it. I
take approximation of this model, but this is the one where we're sampling, you
have sort of the means of the skills and you have the variances of the skills.
This guy here seems to be, it seems to sort of be converging to something. If
you had maybe a more complex model where this change kept sort of mixing
around, then you would have to average over a bunch of samples, I agree.
>>: [inaudible]. Then do it in a way that doesn't depend on the
[indiscernible] matches [indiscernible] regardless of how they've been decided.
But now when you cut, you somehow he [indiscernible] the next match that you
want to predict doesn't depend on which match you played before.
>> Joaquin Quinonero-Candela:
But I have absorbed that already in my marginal.
>>: Here's the [indiscernible] here's the mean and the variance conditioned on
all the observations in the past.
>>: Yes, okay. So if you make the assumption that the next match that's going
to be played is decided on the basis of the rankings, and the rankings only,
then [indiscernible] to compute the mean. That the ranking and the next match
will depend only on that so you can [indiscernible]. But the next match will
also depend, in fact, because it's [indiscernible], the next match will depend
on the outcome of [indiscernible] matches with the same guy.
>>:
I see what you're saying.
>>: So it's not completely correct.
approximation.
This is probably why you say it's an
>>: So for example if we have a player who is high variance, the odds of them
winning each of the matches to get to the final is ->> Joaquin Quinonero-Candela:
Let's officially finish the presentation now.
>>: So you're saying you should play the tournament as opposed to playing the
match.
>>:
Yes, because --
Download