1 >> Max Chickering: Joaquin was in MSR Cambridge and is now the -- in adCenter, debt manager for the ranking allocation and pricing team. >> Joaquin Quinonero-Candela: Hi. Good morning. Thanks for coming. So I think the first question was how do you find time to teach a course and be a debt lead in adCenter. I think I have done two 75 percent jobs during the period of time between January and March. The reason I did it is that I moved from research to a product group in the summer of 2009 and I was just getting impatient that I hadn't really, how should I say, I hadn't really done anything like what I was doing before that transition in those two years. And I was getting a little bit impatient. So it was great of Paul to actually allow me to go ahead and teach a course with Carl Edward Rasmussen, who was my Ph.D. advisor a while ago. And he lives in Cambridge as well, and we're sort of good friends. So 4F 2813 is a course, is a machining course in engineering department. This is a course for fourth year engineering students at Cambridge University. And [indiscernible] and Carl have been teaching this course for five years and using the same slides that Zubin had been using at University College in London for another you seven years prior to that. And so Colin and I decided to just do it again from scratch. Which actually ended up being also a lot of work. The course was 16 lectures, and it has three parts. One part, it says Gaussian processes here, but it really is more about an introduction -- oh, this -- oh, sorry. It's really about an introduction to probabilistic regression and maybe non-parametric models. The second part, we talk latent [indiscernible] allocation just to teach them about an unsupervised discrete model, and then finally, we taught the true skill model and I'll probably, today I probably will not talk about LDA. And everyone know LDA? Who knows LDA? Okay. Maybe I'll say a little tiny bit about it, but I'll not say a lot. I think what I'll do is I'll just show you the result of applying LDA to a corpus of papers, because it's sort of fun. I'll talk about true skill a little bit later and I'll motivate. What we did, of course, was we motivated everything with actual data. We sort of said, here's sort of three problems we're going to try to address. And for true skill, we're going to be ranking tennis players. So I wrote a crawler that crawls the ATP website and gathers all of the games that every player has 2 played. So I gather dataset for 2011 and we'll look at that. But so what I'll do is I'm actually just going to start from the beginning, as one does. Do you know if I can go actual full screen from here? Maybe I need to save it first. I feel like probably -- so sorry. Apologies. I should have prepared a little better. Apologies. Okay. So because you're not fourth year engineering students, it's very possible that you might get very bored so if you do get very bored, you have to shake your hands or do something, just tell me just move on, man, all right? Okay. So this is the first slide that the students saw. So you have a certain dataset for a regression problem and the question is, what's do you predict the value of Y is going to be mean Nuss point 25, okay? So, in fact, in the lecture, we actually ask people to give a number. So Charles, what's your number? >>: Minus 15. >> Joaquin Quinonero-Candela: Minus 15. Right. And it's possible that different people might give different numbers. So how do we try to address this affect? Well we need to sort of postulate some sort of model of the data, right. >>: [indiscernible]. >> Joaquin Quinonero-Candela: That's a good -- all right. So here's a bunch of curves, actually. So which one do we draw? So here's a bunch of curves, and then we say, well, you could for example, you could for example pick a polynomial, right? People know what a polynomial is. It's an interesting parametric function. It has a bunch of weights. Polynomials of degree M have M plus one parameters, okay? And so now the parameters of the model of those weights, right, okay? Interesting. But then what comes next? In order to -- oh, sorry. And what you can do, or maybe one of the first things you should do if you postulate a model is you should actually look at your model, right. Don't just go ahead and fit the data immediately. Pick various degrees of your polynomial and then just pick some values before you send the data and then just look at it. Get 3 some understanding for what your model can do. And at the moment, well, at the moment we sort of see that with higher degrees, the functions seem to be able to do more things. They also seem to have some funky behaviors, but we'll get into that later, okay? So some questions, right. One about model structure. polynomial? Will, yep, go ahead. Should we choose a >>: By the way, [indiscernible] showed a very similar set of graphs 15 years ago at Cambridge. >> Joaquin Quinonero-Candela: >>: He did? At a generalization workshop. >> Joaquin Quinonero-Candela: Cool. >>: And he took the average of all of them, and it was like a beautiful fit to the data. >> Joaquin Quinonero-Candela: data. >>: But that was after he had actually seen some Yes. >> Joaquin Quinonero-Candela: Okay. So we are actually going to do that. Although unfortunately, I missed that lecture. I was not 15 years ago at Cambridge. So we're going to sort of pause on this question here. We'll just keep it in mind but we're just going to move on. So what degree should we choose for the polynomial. And we're going to call that model structure. And then if I fix a degree, what value should the weights take? So for now, we're just going to try and do the following. We're going to try to give ourselves a method for selecting the best polynomial. Okay, the single best polynomial degree in weights. So what do we need? We need some sort of objective function to do that, right. So let's just pick a very simple objective function. If we had postulated one particular case for a polynomial, the red line, we can measure for every point 4 the absolute error. We can take the sum of squared errors and that's the well-known -- yeah, sum of squared errors or we could divide by N if we wanted to compute the mean squared area. It doesn't matter. The key thing to realize is F of X is really a function of our vector of weights, okay. So now we have a loss function. We have a parametric model. We can actually go ahead and solve, right? So I suppose everyone is, of course, familiar with what's going to happen now. We need to introduce a little bit of notation. We're going to be stacking quantities into vectors. I guess this is a laser pointer. So let's remember this number, N is the number of training data we have. We stack them in vector Y. We can evaluate function F at all of our N training data. And the error vector is just Y minus F. So the sum of squared errors is just the norm of this vector here, right? And then we're already, we're going to use the opportunity now to introduce a little bit of notation just to generalize already a little bit from polynomials, right. We're going to look at linear in the parameters models where the polynomial is a special type where the basis functions are defined in this way, where the basis function is just X to the power of J, okay? And so how do we write F now? Well, we write F as fie times W. So what are the dimensions of fie? This is a real question that I ask in the lecture too. Fie -- yeah, M times M. Maybe it's M times M plus one if it's a polynomial. Let's just say M times M, because it's annoying otherwise. So we'll call M plus 1 and M is going to be the same, it's going to be M. Okay. >>: That's cool. So say again? [inaudible]. >> Joaquin Quinonero-Candela: >>: Very scary to have mathematicians in the room. What was the comment? >> Joaquin Quinonero-Candela: have M parameters, although in but I don't want to say M plus said if M and M plus 1 are the I said that in general, I'm going to say that we the case, in this case, that we have M plus 1, 1, because it's too much work. And then Leon same, that means M is infinity. That was his 5 comment. Okay. So then everybody knows this. You can write down some algebra and you can take the derivative of this sum of squared errors with respect to vector W and you, of course, get the normal equations that everybody knows. So there's one view of the normal equations that I quite like, and that's the geometrical view. I find it's quite cute. So I'll just try and draw it very quickly on the board. So the way I like to look at it is I like to say well, imagine that we only had two basis functions, right. So metrics fie had only two columns, right. So that means that vector F, right, is N dimensional, because I have N data points. So I'm going to draw an N dimensional space here on the white board, although obviously it's not N dimensional, because I don't know how to do that. But that vector is spanned by only two basis functions that I'm going to call N fie one, N fie two. Right. And so these are N dimensional objects. This is the first basis function evaluated at all of my N inputs and that's the second one. So I know that F needs to live in this plane, right, because it's generated by these two basis functions. So F is going to be somewhere here. But Y, Y can live wherever it feels like, right? Y is not constrained so Y also lives in this N dimensional space, and Y is going to be, you know, somewhere here, let's say. Let's say that's Y. Okay. So now what's the thing that I'm trying to do? Well, F is just a W1 times fie 1 plus W2 times fie 2, right? So what I'm trying to do now is I'm trying to minimize the norm of the difference between these two guys. So I'm trying to minimize this vector here. Doesn't matter in which direction I draw it. I just have vector E here and I want to minimize it. But then I know that to minimize this vector, what I need to do is I need to make sure that -- so if this is the projection of Y on to the basis, on to fie 1, fie 2, right. Let's imagine it was here. I'm not really great of drawing. Then in a way there's a component of vector E here that is actually -- that actually can be explained by fie 1 and fie 2, right. So I should make sure that after I've picked W1 and W2, that E is orthogonal to all of the basis functions, right. 6 So if I just go ahead and do that, I can write it. This is what I said here. And then if I do that and I crack E open and then I solve, I actually get, I end up getting the same equation. So it's actually pretty simple, but it's a view that I think I like as well. Okay, cool. Let's move on. So imagine we did this now for all of the degrees of polynomials. We find the best solution. So one thing that I'll tell you is that we have 17 data points in this dataset and so now the question is, with the polynomial of degree 16, we have 17 parameters. So in this case already, the sum of squared errors can be made zero. In this case here as well. In this case, you even have multiple solutions. In all other cases, you would only be able to make it exactly zero if it happened by chance that the points were arranged in such a way that they can be explained by the polynomial of that degree. If they're arranged completely in a general way, then you don't have a guarantee that the sum of square there is zero. So you look at this, and you say oh, fantastic. We've actually solved the problem, because I can go and in this case, I'm actually going to choose and pick the polynomial of degree 17 and you'll see in a moment why I do that. And so the answer, so you were wrong, Charles, because the right answer seems to be roughly 2. That's it. We're done, right? So is there any objections? >>: Pretty good. >> Joaquin Quinonero-Candela: Okay. We've saved some time here. So the interesting thing is that we can make everybody in the room be right. So please think of one value of Y at X equals minus 0.25. Just pick one. Just pick one, and we actually can find a polynomial of degree 17 that will make your answer be correct, and that will go exactly through the training data as well. Okay? So actually, we haven't actually solved any problem at all. We're not able to ->>: I think you solved everyone's problem. >> Joaquin Quinonero-Candela: That's right. So we're missing, obviously, some assumptions. And what's interesting in the course is so to your point, Rich, 7 what's interesting in the course is some people complained and said that's a stupid solution. And it's like okay, why is it stupid? The mean square of the area is zero. It's not stupid. It's a good solution. And then they said yeah, but -- and when they look at that, oh, but there's other ones. And then someone said oh, but this is better. I prefer many stupid solutions to one stupid solution. So it was sort of interesting to sort of get them thinking. Okay. So now we need to sort of pause for a moment, and we need to sort of ask ourselves a couple of questions, right. So do we think that all models are equally probable before we see the data? And then, of course, the question is what does a probability of a model even mean. Can you even reason in those terms. But there's no question that you should reason in terms of multiple models. And then do we need to commit to picking one polynomial degree and one set of weights, or do we have actually some way of computing with several of them. So fine, but we might need some sort of language to do that. And then perhaps our training targets are contaminated with noise, right, which is also something that some of the people in the course said. They said actually, I don't really want to go through all those points exactly. It's a fair thing, right. So since this is a much easier question to address than the other two, we can just go ahead and quickly address it. So okay. So imagine now we sort of, we're going to be introducing the likelihood. So imagine that you knew what function generated the data. Then there's a sudden probability that this function actually generated this particular set of blue dots. And if you assume the noise is additive Gaussian and sampled independently, then for one little error term, this would be the probability of distribution, of course. And then we go back to our beloved matrix and vector notation. We stack the N error terms. We can write a joint Gaussian distribution, which is this guy. This is the variance of the noise. And then massaging things around a tiny bit, we recognize a good old friend here, right? So now what's interesting is if you look at this quantity, and actually the 8 notation is often going to be a little bit confusing, because wherever I write F, you could actually write W. You could write one or the other, it doesn't matter, because F is equal to fie times W. So you can sort of see that. If you wanted to maximize this quantity with respect to F, right, like now you say well, my Ys are given and I want to find the weights that maximize the likelihood, then you obviously, this is exactly equivalent to minimizing the sum of squared errors. So in a way, you can sort of say, okay, so very quickly on the likelihood also, there's some terminology if you read David Macay's books or attend his lectures, he's very religious about always saying the likelihood is -- it's of the likelihood of the parameters. It's not the likelihood of the data. It's the likelihood of the parameters and it's the probability of the data given the parameters. Okay. So the interesting thing is that the maximum likelihood solution here is exactly equivalent to the minimum sum of squared errors. So this hasn't worked. There was a nice -- it was a nice try, but it doesn't help us. Okay. So back to what you were saying, Rich, earlier, I think. So we don't really know what particular function generated the data. And it turns out if we pick a polynomial of degree 17 that more than one of our models can perfectly fit the data. And ideally, we'd like to be able to reason in terms of multiple possible explanations of the data, not just one. And so what's going on over the next couple of slides, we're going to explain the mechanics of how you can go from a collection of plausible functions to a collection of plausible functions after you've seen the data. And then if you do that, you can indeed average them. So the black sort of line here is a result of averaging all those sort of light gray samples. And then we've plotted two standard deviations up and down, just sample standard deviations. Nothing fancy, okay? So in the course, what we do at this stage is we don't necessarily assume that they remember the rules of probability. So we actually just quickly, we give a quick example, which I'm sure everyone is familiar with this type of example just to introduce base rule and sort of the two rules of probability. So I'll try to just quickly go through it. So this is actually real data and it's actually based on a real study that was 9 made on doctors to ask them what their inference would be. So the data is that 1 percent of scanned women have breast cancer and 80 percent of women who have breast cancer will get a positive mammography. But 9.6 percent of women without breast cancer will also get a positive mammography. So the question is, so, a woman gets a scan and it is positive, what is the probability that she has breast cancer? So these are the four options. I would like you to vote, please. >>: Can we have average options? >> Joaquin Quinonero-Candela: Average for options. What would the average be? It would be 50. Yes, you can introduce an option 2.5 if I want. Okay. So who votes for one? Nobody seems to vote for one. You have to vote for one of them. Who votes for two? Okay, roughly three or four people. Who votes for three? One, two, three, four people. And who votes for four? No, that's obviously not. Who votes for 2.5? Okay. So how do we work out the answer to this question? The easiest way to work out the answer to this question, I'm actually going to skip this slide is to actually just build an example. So you can pick 10,000 subjects if you want, and then you can fill in a table like this. You have all the data to fill in this table, because we're giving the probability of cancer, we say, is 1 percent, right? We assume that all the subjects have been scanned, right. So that means that the sum here, the marginals across the rows must be 1 percent to 99 percent. If I have 100 subjects, sorry, 10,000 subjects in this case, and I know this has to adopt to a hundred, okay? I know that. And then I can, by looking at data here I know that if there is cancer, the probability of a positive mammography is 80 percent. So that allows me to actually split things in this way. So if there is cancer, I know that 80 of cases where there is cancer would have a positive mammography and that means, as a consequence, 20 will not and I can do the same down here and I can complete my table that way. Okay. So once I've completed that table, what's the question we were asking? The question we were asking is you know that the mammography is positive so you know you're in this column. What's the project of cancer? So that is simply 80 divided by 80 plus 950, right? That's what it is. And what's that? That's 7.8 percent. So you shouldn't feel too bad, because I don't remember exactly what it was, it was a number between 60 and 70 percent of the doctors in that 10 study actually chose the same option that the majority chose here, which is option three. Around 90%. So that's quite worrying, actually, if you think about it, because that means that doctors don't really understand probabilities. But that's fine. >>: [inaudible]. >> Joaquin Quinonero-Candela: Okay. >>: Just for the backhander, isn't it right to just go back and say that the previous one would [indiscernible] the problem of the choices? >> Joaquin Quinonero-Candela: Yes. >>: Isn't it just always easiest to look at your odds and multiply by your current estimate? >> Joaquin Quinonero-Candela: Okay. So how -- >>: So roughly, you know, you can do a backhand arithmetic [indiscernible] you can say that's roughly 10. And you maintain a current estimate, right. That's the [indiscernible] trick. >> Joaquin Quinonero-Candela: And that's going to be presumably an approximation to the correct answer? So you should have been a doctor. So the reason we're giving this example is we need to introduce Bayes' rule, because otherwise we would not be able to go on. At the moment, we introduced the likelihood, we introduced the probability of the data given any particular function, but we want to consider several functions. So we're going to have -we're very soon going to introduce a probability situation over functions and so we will, of course, be interested in the posterior. >>: Is it still legal to teach Bayes' rule in the U.K.? You've heard about these cases where judges have decided that Bayes' rule is not allowed in evidentiary reasoning? >> Joaquin Quinonero-Candela: Yes. So what's his name? Bill Davies is very involved. Yes, that's right. All right. So this is just an example of Bayes' rule. And this is the slide that's really important. You only need to 11 remember two things. You need to remember the sum rule, which states that if you have a joint distribution, either continuous or discrete, if you adapt over the variable you're not interested in, you get sort of the marginal. And then the product rule, of course. And then you combine them both, you get Bayes' rule. And the important thing to remember about Bayes' rule really is one thing. Is that if you look at the numerator here, ultimately you want to get a proper distribution over A, right. So it has to adapt to one as you add over all values of A. But this guy doesn't because you're multiplying one that does by one that doesn't. So you just, if you sort of sum over A here, you sort of get the marginal P of P. And that's it. And that's your sort of normalizing constant. Okay. So then we can go back. Then we can go back to our business here. So the little tiny leap of faith for now is that we have one way of defining P of F. So let's imagine -- the way I call these guys is I call them Santa Claus bags full of functions. And so you grab your hand inside the bag and you pull out a function and you put it here and you just do it again, right. And there's a certain probability of grabbing any particular function. Let's not worry how that happens. Let's just imagine you have a P of F, right. And we did just look at P of Y given F a moment ago when we looked at the likelihood. And now Bayes' rule tells us how to combine these two and in the next few slides, I'm actually going to tell you how you go from here to there. And I'll give you both a Monte Carlo version of it and I'll give you an analytic version of it. Okay. Am I going too slow? Am I going too fast? Tell me something. >>: For a different way, it's interesting to see how you present it. [indiscernible]. Now if you take Charles question, his values like zero point something, why is it good? Tell me why is it good. Why would this value [inaudible]. It looks good. I agree. But are we still back to [indiscernible] that's good ->>: [inaudible]. That's a different question. functions [indiscernible]. What is the value of the >>: [multiple people speaking] >>: But in the end you need to say, why is this curve a better curve for this 12 [indiscernible]. >> Joaquin Quinonero-Candela: So I'm going to actually postpone an attempt to answer. I don't think I'll be able to answer that question in a satisfying way. >>: I'm looking how you get the course going and I see that at this point, you can go, I can see very well you can go all the Bayesian way. But there is this question that annoys me that you can't insert [indiscernible]. >> Joaquin Quinonero-Candela: >>: Let me try to go through -- [indiscernible]. >> Joaquin Quinonero-Candela: That's true. Okay. So if we stick with our particular choice of a structure, and the structure is a linear in the parameter model, then if we fix the basis functions for some reason and if we have many of them, it's probably okay to fix them, it's not too bad, then we're left with having to make choices for the Ws. So what we could do with the Ws is, for example, we could decide to sample them from some distribution. So imagine we said, well, we're also going to use Gaussian here with a certain variance and it's centered around zero. So how do I actually sample a function? Well, it's simple. If I have M equals 17, I sample 18 weights independently and then I multiply the basis functions by them and then I sum and then I get sort of one particular sample. I can do that sort of again and again. So that's sort of one way of defining a prior of our functions. It's a bit of an indirect way, because I've done it in two steps. First, postulated some function or form and then I've defined a prior on one of the sort of on the parameters. >>: [inaudible] we're looking at the data which is in the prior [indiscernible]. >> Joaquin Quinonero-Candela: Oh, no, no. I've done that. I mean, we happened to have looked at the data because we had to motivate things here. But no, no, this is not a prior that depends on the data. 13 >>: Okay. But these are priors that the average of the [indiscernible]. >> Joaquin Quinonero-Candela: >>: You have a zero there. >> Joaquin Quinonero-Candela: >>: I have a zero here, that's right. It's a prior. >> Joaquin Quinonero-Candela: >>: Say again, sir? That's a prior. So [indiscernible]. >> Joaquin Quinonero-Candela: What this is saying, indeed, is it's saying, indeed, that I don't believe the weights should be arbitrarily large. That's true. And then what I should do is I should just sort of look at this. So to your question, the one thing that I'm doing is I'm just fixing this prior variance here, right. And I'm also saying I'm defining this to have zero mean. For the parameter that's the bias, I could decide to now do that if I knew better. And then we're not going to go into that, but I could actually define the prior over this variable and I could also sample there. >>: But the point is in terms of the [indiscernible] because kind of would it show that look at your data. Based on your data, choose your prior. And then I can, if I'm allowed to do that, answer 15, which is we know that our goal is to get to the answer is 15. I can force my prior to give me this answer. >> Joaquin Quinonero-Candela: You could force your prior to give you that answer by only picking functions from the prior that are equal to minus 15 at minus 2.5, and that can sort of wiggle around everywhere else. Absolutely right. I agree with that. >>: But the priors implies that you think the derivatives are not very good. >> Joaquin Quinonero-Candela: >>: Yes. He's just explained how you can express the prior. And -- 14 >>: [multiple people speaking] >> Joaquin Quinonero-Candela: So let me actually try to make it through the next couple of slides, because what I'll show you is that you can actually be extremely vague about your prior and you can still get -- you can still model the data. So let me -- just bear with me for a moment. >>: [inaudible]. >> Joaquin Quinonero-Candela: That's fine. I'll skip through those. This is just the mechanics of the maximum likelihood estimate where you don't have priors. This is the case where you introduce a prior. This is just all mechanics. There's one thing that I'd like to talk about, however. I'll sort of skip that as well. Yes, okay. This one I want to talk about a little bit. So I'm just going to pause here for a second. If I wanted to do what I think you were saying, which is I want to choose my prior based on the data. Okay. So what's my prior? I'm going to demote my prior now by this very curly M. So this very curly M doesn't mean the degree of the polynomial. This means a choice of the basis functions, how many of them, right, and it can even mean what sort of priors I choose on the weights, okay? And I can have as many as I want. And then what I'm going to spend time doing now is I'm going to spend time telling you about the marginal likelihood or the evidence. So the evidence is going to allow me to -- it's sort of one potentially dangerous way to choose retrospectively one out of several priors, right. And I think of those priors, I could have actually not one, but I could have very many Santa bags and these are different priors. And one of them is make Charles' one that is equal to minus 15 at minus 0.25, right. But I could have others, right. I could have different types of basis functions. I could have, to Patrice's point, I could, depending on how, what the variance of those guys is and depending on what the degree of the polynomial is, I can allow for more wiggliness or less wiggliness, okay? So what's the marginal likelihood like? What does it do? It's actually a pretty simple object. This here, okay, this happens to be the weights here. You could plug in F if you wanted. Like I said earlier, it kind of doesn't matter. This sort of says, well if I have chosen one way to generate functions 15 from my prior, I'm going to compute the average likelihood of the functions from this prior. That's all it is. And I'm going to show you how to do that in a more intuitive way in a second. >>: So quick question. So this is a [indiscernible] you called out specifically the likelihood of the weights and not the data. And then this is the likelihood of the data. Curious about that. >> Joaquin Quinonero-Candela: No. If I have said that, then I was wrong. And if it says so in the slide, then it's wrong. It's a marginal likelihood of the model. >>: I don't mean the marginal likelihood. So you said somebody's very careful about saying likelihood of weights. I see Gaussian likelihood of weights and then [indiscernible] likelihood of data. >> Joaquin Quinonero-Candela: >>: Oh, sorry. Bullet point two says Gaussian likelihood of the weights. P, Y and X. >> Joaquin Quinonero-Candela: But that's the probability of the data. the probability of why given W is the likelihood of the weights. >>: That's I see. >> Joaquin Quinonero-Candela: It's probably not important, actually. I was just quoting David and his obsession with getting that right. This object if you want to describe it in terms of what it is of Y, it's a probability of Y. Or it's a likelihood of W. >>: So how do you find likelihood? >> Joaquin Quinonero-Candela: engineering. >>: [inaudible]. >>: [multiple people speaking] What does that mean? The likelihood is a bit like reverse 16 >> Joaquin Quinonero-Candela: >>: [indiscernible]. Yeah, here it is -- You will calibrate it with the [indiscernible]. >> Joaquin Quinonero-Candela: Yeah, here, we can write it. It's very easy to write. The likelihood here is the probability of the data, Y, given a function F. So specifically what is it? Because we said that the noise was going to be independent, it's the product of -- it's the product of 17 objects. I have 17 data points. And for each of them, it's a Gaussian evaluated so it's a Gaussian in Y that has mean, the red dot, right. So it would be Y I mean Nuss F of XY squared. Divided by whatever I think the variance of the noise is. That for now is a parameter so the likelihood is a function of the -- you can also learn what the variance of the noise is. So yeah, and if you take logarithms, you're going to have essentially minus the sum of squares. So yeah, you can write it down. Okay. So now I have three Santa bags. These are three possible priors I could have chosen, and I'm sorry that yours is not there. But that's just the way it is. So actually, we're going to try to do two things. We're going to try to obtain what the predictive distribution would be, right. What the figures show, they show samples from the prior, ignoring the data. And then on top of that, I've plotted some even smaller, simpler data set, okay. That's all it is. And now what we're going to do is we're going to have to define a likelihood. The prior we already have. In a way, I've said I'm not telling you what it is. It's some prior. It's actually not a polynomial. These functions are drawn from three different Gaussian process priors. But let's just not worry about that. I just have a Santa bag. That's the only thing that matters. Then I have this data. And the other thing we're going to do is we're going to compute the marginal likelihood for each of those three models. So that if I wanted to do this thing of picking a prior retrospectively, I get some sort of score to do that. And then I'll tell you, it's not really on the slides, but I'll tell you how you can actually relax that. You don't have to commit to any of them. You can use them all three. And it's very easy to compute where the mixing proportions 17 should be. Okay. So we're going to choose a different likelihood now. So we're going to choose a different noise model. And here's why. So we're going to assume that the noise is uniform in the interval -- well, in this interval here, minus 0.2 to plus 0.2. And all noise terms ever independent. So what does that mean? That means that when I compute the likelihood, I have to compute each of the individual terms here. And they're either going to be zero or they're going to take this constant value, right. One of the two, okay? So they're either zero or they're equal to 1 over 0.4, so they're equal to 2.5, okay? So that means that if I look at the likelihood of a given function, and I have end data points, I just sort of -- it's either going to be zero, if any of the Ys is further apart from F than 0.2 in absolute value or the Ys, I just get 2.5 to the end, right. So if I draw that it looks like this. Sod all we're saying is that if I have here some function and this is the data that I have observed, okay -- sorry. These are my inputs here. So I'm just going to put some data down here. If by any chance -- and now I'm going to draw sort of the -- I'm going to try to draw something like a uniform of some -- of this. So imagine that this here was 0.2, okay. So you can see that this likelihood term is zero, right, because I've said that this data point has to be in a uniform distribution center around the function value, right. This guy just makes it and this guy just makes it on the other end. But the likelihood here is zero. If this guy had been inside, I would have had 0.25 times 0.25 times 0.25. Okay? So why did we construct this funky noise model? Because it allows us to evaluate the -- to actually compute that integral in a very interesting way. Okay. So now we have a likelihood. So remember, the marginal likelihood is just again, you know, I look at all the functions that I can generate from my prior and then I compute the likelihood for each of them and then I just report the average likelihood or marginal likelihood, if you will. So because I've made this funky choice of a likelihood, it turns out that the likelihood is either a zero in the fraction of the cases if the function that I pulled from my Santa bag misses one of the points by more than 0.2, then that 18 one is zero, right. So say that I had S of A is the number of accepted ones, the number of ones that actually fit my data and I sample a total of S of them. So I can approximate this integral by a sum. I can say, well, imagine that I do the following procedure. I go ahead and I draw a function F of S from my bag, and then I evaluate the likelihood and then I just take the average of all of those guys, right. So you agree that is sort of the value that the evidence would have. So now, let's just look at what happens. So in this case, we've had to sample a million samples from each of the three bags. And only the super wiggly guy here, Mr. Green, only generated 8 functions that actually conformed to the data given our likelihood. Given our noise model, okay? Mr. Blue generated ten times as many. He generated 88 such functions, right. And Mr. Red, you know, 17. So the interesting thing is that these functions here are actually -- this is a valid way, an incredibly inefficient but valid way of sampling functions from the posterior distribution. Say if I had written P of F given Y is equal to P of Y given F times P of F and then normalized and I wanted to sample functions from the posterior after seeing the data. Then this is a valid, crazy mechanism to do that. So what have we learned? So if we divide, say, by S, by one million and multiply by 2.5 to the N in this case, N is five data points, then we get those four numbers here. But these four numbers here are proportional to the numbers here. So it doesn't matter. You can look at either, right. So what that is saying is that if you look at the marginal likelihood score, then Mr. Blue is ten times more -- has a ten times higher score than 8. And if you wanted to sort of use proper terminology, you would say that the probability of the data under model blue is ten times higher than the probability of the data under model green. >>: So what prevents [indiscernible] gazillion. >> Joaquin Quinonero-Candela: You could use a gazillion models, and if you were doing -- if you were using a Gaussian process prior, rather than having sort of a discrete index for models -- 19 >>: I just want to take the maximum [indiscernible] sample with the largest SA, this would be my choice. >> Joaquin Quinonero-Candela: So you'd pick model blue. >>: I'm not picking just three. I'm going to pick gazillion many colors. For which one of them I'll want to sample million functions. Each one of them I'm get SA and I'll pick the parameter, the color for which I got the largest relation. >> Joaquin Quinonero-Candela: And so what you're trying to say is that, I'm guessing you're trying to say this, is that the marginal likelihood is dangerous, is almost as dangerous as the likelihood because you can overfit. If you try hard enough, you can overfit with it. So the answer is that the safer way to pick a model, of course, is to consider them all. Because this is P of Y given your model choice, given your bag, right? But now what if you defined a P of MY, right? You said you wanted to use many of them. A hundred of them. Sure, why not. If you were to, in the simplest of all cases, if you chose them in such a way that you didn't have any particular preference ->>: [inaudible]. >> Joaquin Quinonero-Candela: >>: That's right. [inaudible]. >> Joaquin Quinonero-Candela: So the key thing here is that if you're able to specify your prior completely, say you said that P of -- let me just write it. If you had a bunch of models, MI, with I goes from one to, I don't know, some number L, right. And, you know, for simplicity, imagine that you decided that the prior probability of any of those guys is 1 over L. If you know better, then you pick something else. If you don't, then you just do that. And then actually, you could combine the predictions from all of those guys by weighting them by the posterior probability of the model. By P of MI ->>: When you're talking about at this level MI not specified. So the two 20 layers that you created don't have -- because you can say in advance, take these bags, pull them together and ->> Joaquin Quinonero-Candela: Go ahead. >>: You use a uniform prior [indiscernible]. Is there any [indiscernible] to get a prior other models by digging into a complexity of the [indiscernible]. >> Joaquin Quinonero-Candela: There is. There's this nice paper that Zubin and Carl wrote early 2000, 2001 maybe. It's a paper about -- it's about Occam's Razor, where they say Occam's Razor doesn't exist. Where they use linear in the parameter models. I think they might have used polynomials, but I'm not sure, or maybe they used something else. But anyway, what they did is they -- the models were defined in such a way that it could be polynomials. Anyway, something about the basis functions, basis functions of higher index had more energy or more variance, if you will. So they scaled down the prior variances on those in such a way that the amount of variance from the function stayed constant or something like that. And then you could add as many as you wanted. I don't know whether that made sense. >>: So what happens if you plug the same logic through where we have a discontinuous function like the left side is a [indiscernible]. >> Joaquin Quinonero-Candela: I'm not sure I understand the question. >>: So you just, you have the graph. Slide [indiscernible] is the real function space. But your model has this assumption of continuity, so you're always going to force this weird connection between the two. Any human coming along and drawing a curve would look at it and go instantly, this continuity [indiscernible] you've thrown two species together. If you separated by species, your actual two functions behave very differently. If you throw them together, you guess discontinuity. So I'm just wondering for this Gaussian process, this explanation here has this assumption of continuity that seems to be broken in a lot of the real cases. >> Joaquin Quinonero-Candela: So you're saying something about how we've -you're saying something about the prior functions itself. You're sort of 21 complaining about ->>: It's actually the model structure. express that [indiscernible]. Given the model structure, you can >> Joaquin Quinonero-Candela: So you're saying I cannot, with the linear and parameter model, linear and the parameters model, I cannot express -- what if some of those basis functions were zero in a range of the data, and then I haven't really said how I'm choosing them. I could choose them however I want. I think, yeah, with the Gaussian process, certain things are a little bit more difficult because the Gaussian process is obtained by using an infinite amount of basis functions and that actually means extra smoothness. So there are certain things you can't define with Gaussian processes. Yeah, so which is more general, the Gaussian process prior or a finite linear combination of basis functions. I don't think there's an obvious answer to that. Okay. So in this case here, what we do is you just sort of take the average of the samples. Here, we get all this wiggliness because we didn't have many samples. We would have needed more. But then as you were saying, in a way, I could have done this whole thing by combining all of these functions in one bag and then just doing this whole thing. And that would be equivalent to doing this and then reweighting -- combining these three but reweighting them by the posterior probabilities of each of the three models. Okay. So I have like three minutes, actually, time passes pretty fast. I think what I'll do is I'm going to show you -- I'm just going to show you this thing here. I will not bother -- if you can see it well enough, I'll not bother downloading it again. So here, we can go back to our original question. So okay. I'll address this in a second. What degree. So what we're going to do now is we're going to use the marginal likelihood to select the degree of the polynomial for our dataset here. And, oh, there's something we didn't talk about in the previous one. Sorry for that. So I completely forgot. So in the slide where we had the marginal likelihood, we also had a predictive distribution. So a probability distribution of Y at minus 0.25 or whatever. And you sort of see that you get a probability distribution. 22 And you do that because you don't get a point estimate of the weights. You actually get -- if you have a Gaussian prior on the weights and you have Gaussian noise, you actually get a Gaussian multivariate distribution on the weights and it has the mean of that guy is pretty similar to the normal equations. Except the pseudo inverse is regularized. So you get fie transpose fie plus a diagonal term inverse. And it's a bit like capping or limiting how small the [indiscernible] values can be. Anyway, so remember how we solved this thing by sampling? We could do it here again if we wanted to, right. So here we're back. Say we're back to a Gaussian likelihood. So we're back to Gaussian noise. We don't have any more of this uniform noise, okay. We could in principle do the same. We could sample some functions F of S from the prior and then compute the likelihood for each of them. This time, it's not going to be a constant or zero. This time, its' going to be whatever it is and I can average them, and this sort of would be a sampling approximation to the evidence. Or I can just go ahead and compute it, because this is the product of two, if I'm using a model that's, you know, F is equal to some linear combination of the weights and the weights have Gaussian priors, these two objects here look like Gaussians in the weights so I can multiply them together and I get a joint Gaussian distribution so I can just compute this evidence in one shot. And if I plot the logarithm of the evidence, evidence is the same as marginal likelihood. As a function of the degree of the polynomial, you can sort of see that it very strongly prefers degree three, right. Then again, the more Bayesian thing to would be to not commit. Just use them all, right. Like you don't -- you define a prior over the degree of a polynomial ->>: [indiscernible] otherwise I can't [indiscernible]. >> Joaquin Quinonero-Candela: You do, you do. That's absolutely right. And then back to I think it was your question. I could decide -- so I need a distribution over polynomial degrees that adds up to one. And I could decide to have a shape, you know, it could sort of be exponentially decaying. I could say well, the probability of degree 7,000 is really small. Or I could just 23 call it zero at some point. >>: [indiscernible]. >> Joaquin Quinonero-Candela: You could. sort of assume the consequences. You could. And then you have to >>: The problem, I think the problem with this presentation is that we've never presented consequences. >> Joaquin Quinonero-Candela: Right. >>: So if I choose -- my figure is 7,000. I believe all problems in the world should be solved by polynomial 7,000 so my prior would be just one of ->> Joaquin Quinonero-Candela: I will show you a slide in a second. >>: [multiple people speaking] >>: You show [indiscernible] that draws very nice curves. They look good. Which is why they're good. And this [indiscernible] never give an answer to that question. Why is this curve that is good and nice that I like, looks good visually to me. Why is that good and therefore if I do that in 1,000 dimensions, 1 million dimensions, if I apply [indiscernible] I'm not going to see the curve. >> Joaquin Quinonero-Candela: I'm going to try. >>: The prior is [indiscernible] to the degree, how sensitive is that to the noise as opposed to just my prior degree of polynomial? >> Joaquin Quinonero-Candela: Oh, so both things matter here. So when you look at the evidence, the choice of the likelihood and the choice of the prior both matter, yeah, definitely. >>: If you ahead more noise, it has to prefer a lower degree of polynomial. >> Joaquin Quinonero-Candela: That's right. So why does it -- if we decided to use the evidence and in practice, it could be impractical to average over -- 24 >>: Isn't it the opposite. If your noise variance goes to zero, you have to prefer the higher degree [inaudible]. >>: As the noise becomes large, and your ability to detect the [indiscernible] polynomial drops. So you end up preferring simpler and simpler ->>: Yes. >> Joaquin Quinonero-Candela: Yes, in this case, the intuition is that if the degree of the polynomial is too simple, it's too small, then it's always going to be missing some of the data for sure. If it's very high, there exists in that bag functions that exactly fit your data. They're there. But they're very unlikely. So although they exist, the resulting evidence is not going to be very high. So the evidence has this way of preferring priors that hit a certain equilibrium between the ability to nail the data and how often they can achieve that. And I'm trying to remember what else I have here. This is a discussion about polynomials and whether they're good priors over functions. And they're a bit problematic, as everyone knows. So a degree of 7,000 here would be quite spectacular in the sense that once you leave the range where X is between minus 1 and 1, then the functions sort of start shooting like crazy. So if you thought here that you wanted to limit the variance at 2, then you need to limit the variance in this interval even more. So the rest of the lecture is about relaxing the polynomial model structure and I'm, of course, not going to cover it because we're over time. But what it does is it sort of talks about Gaussian processes and let me just sort of try to show -- this is a Gaussian process entity. And so I suppose many people are familiar with Gaussian processes in the room? Are you? >>: [inaudible]. >> Joaquin Quinonero-Candela: Okay. I can show you that very, very, very quickly. I feel terrible that I went over time. That's funny, actually. >>: [inaudible]. 25 >> Joaquin Quinonero-Candela: >>: Say again? [inaudible]. >> Joaquin Quinonero-Candela: Okay. So this part of the course was about ranking. And to motivate the problem, to motivate the problem, what we did is just show them this. And sort of say, on the last day of 2011, on the last day where a tennis match that counts for the ATP ranking was the 28th of December, and this was the ranking back then. And so you see that there are some points here, right. You also see that different people have played different tournaments. And so actually I'm just going to explain this later. I'm going to explain this first. So let's just focus on the top four guys, right. So the first question, right, is you player that's ranked higher than another player more likely to win. I think it's a fair question to ask. And I suppose most people think that's true. So we might think that Nadal has a higher chance of winning than Federer if they were to play against each other. And then a trickier question is, well, what's the actual probability that Murray defeats Djokovic, right? And to translate this into units that people can understand, how much will you actually be willing to bet on Murray. So now I need to actually explain this to you. This is interesting. So the association of tennis professionals, that's what ATP stands for, ranking is obtained as follows. You have a sliding window of 52 weeks. So you can only count results for the past 52 weeks. Of those, you can count only 18 results. So you have 18 slots as a tennis play their you can use. But now there are some constraints. You must include all four Grand Slams. That means if you didn't play any of the four grand slams and you lose four slots, you can only use 14 slots or you get like a zero on each of those. And then which are the four grand slams? It's the Australian Open, the French Open, Wimbledon and the U.S. Open. But then you have to use eight Masters 1000 series events. Okay. And they're listed here. There's a bunch of them. Okay, cool. But that's not all. So 26 then that's essentially four and eight makes 12. So you still have six results that you can use. And of those, you have to pick four of those six have to be 500 events. And 500 events, there's a bunch of them. You can see here, right. So that means that the two last slots can be any of the masters 200 tournaments, for example and there's even some other tournaments, challenger and whatnot. Okay. Cool. So now if you play your Grand Slam, these are the points that you get depending on how far you make it. If you win it, you get 2,000 points. If you reach the final, 1,200, et cetera, et cetera, okay. However, if it was a Masters 1000 tournament, these are the points you get. And then oh, I didn't talk about the ATP world tour finals. This is a tournament that only the eight top players -- I don't know when they take the cut in time. But at some point, they take the top eight, and then those are allowed to play this extra tournament which gives 1,500 points to the winner only. Okay. So this is roughly how the ATP ranking system works. So you can see that someone has sat down and thought, you know, a little bit hard about how to do it right. So this was sort of the motivation for actually writing down a model. So I don't know how much more you want me to say. >>: I have a question. So suppose you have some kind of model for [inaudible]. And then it's actually [indiscernible] tennis matches [inaudible]. Do these kind of considerations enter in? >>: [inaudible]. >>: Or they choose their games carefully to be consistent. >> Joaquin Quinonero-Candela: Yeah, I -- there are things you can do, like what Patrice was saying. You can try to confuse the model maybe by being very erratic or sort of trying to increase ->>: These things are not robust to adversary. >> Joaquin Quinonero-Candela: No, they're not designed to be. That's right. So anyway, so in a nutshell, the model is every player has got a given skill. Then you look at the difference of the skills. So the assumption that is if a player's performed according to their skill, 27 then if one player has a higher skill than the other, then they would always win, 100 percent of the times. But performance is a noisy version of skill. So even if someone is more skilled than someone else, there's still a probability that they might lose. And then the that has one it's a graph players play graph. story, the short story here is that the actual graph is a graph skill for every player and then these nodes G are games and so that can have loops, actually, because even if, sort of if two two games, you already have a loop. So you have sort of a massive And there's a couple of things you can do to it. If you can write down all of the sort of Bayesian machinery, but it's intractable and so there's two things you can do. In the course, we presented a Gibbs sampling way of solving things. we're over time, right? We had until what did we have? >>: You have the room until 12:00. >> Joaquin Quinonero-Candela: >>: I guess I see, but the meeting was -- We'll give you eight more minutes. >> Joaquin Quinonero-Candela: Did I get confused about this actually? I don't know. Okay. Good, okay. Eight more minutes? Okay. I'll try to use them wisely. So what we realized is that you can actually do Gibbs sampling for this model. And actually, I'm not aware and I haven't seen this done before for this model. It's actually very simple. So I'm not saying it's hard or anything. But the key thing is if you look at this graph, what's happening here is so we're going to have Gaussian priors on the skills. You take the difference. Node G just takes the difference. This is a factor graph where these are the variables and the black squares are just functions that depend on all of the variables that are connected to them. And then the final function or form is just a product of all the factors. This is sort of my one, my ten-second explanation of a factor graph. So in the course, we actually had a proper introduction to factor graphs. 28 >>: [inaudible]. >> Joaquin Quinonero-Candela: >>: Yes. Of the model? Of how [indiscernible]. >> Joaquin Quinonero-Candela: Ah. I can try to do that. So what happens is that you observe a game outcome here Y, which is the binary quantity. You either -- so player one either won or lost, right. Here, you have the performance difference. So if the performance difference is positive, then the model predicts that player one won. If the performance difference is negative, then player two won. Now, this performance difference has uncertainty for two reasons. A, the skills are model not as numbers but as Gaussian distributions so even their difference will be uncertain. In addition to that, from S to T, we add some additional noise, which is just the sort of performance noise, performance versus skill. >>: The skill difference is different in two players [indiscernible]. >> Joaquin Quinonero-Candela: Okay. >>: Is this the case? I'm not sure I understand. >>: If you choose the first, is it raining, humid, hot? >>: No, it is something [indiscernible]. The [indiscernible] is attached to the player and S is attached to a pair players, and T is attached to a different player and the Y is attached to the ->> Joaquin Quinonero-Candela: >>: Let me go through that. Is that correct? >> Joaquin Quinonero-Candela: I'm going to skip variable S. We're going to get rid of it. So you could have two games and game two, these two guys could have played this game and these two guys could have played this game and then you would have here T1 and here T2 and then you have this other node that sort 29 of takes a threshold. this is Y2. >>: This is Y1. And [inaudible]. >> Joaquin Quinonero-Candela: >>: Sorry, you can't see this, I guess. Say again? What's Y? >> Joaquin Quinonero-Candela: Y is a binary variable. It says whether player 1 won or player 2 won. This is the one that says who won. T. So what happens is this. >>: The thing you don't express in the graph, you have the least appearance and for which event, we know that player I played against player J, and the winner was this guy. >> Joaquin Quinonero-Candela: Yeah, I'm just going to try to take off my mic so I can actually show you. And, of course, you could have, you know, you could have sort of a game three that won and three played and so on and so forth, right? And this sort of goes. So every game is one such node. It connects these two skills. So if you have -- you could have -- this could be P of W1, and then this could be P of W2, right. And then T1 in this case is equal to W1 minus W2 plus some epsilon. And so to your question, I could have an epsilon that knows about the weather. It could be like I know that this game was played in rainy conditions. In the model that we wrote, and actually we gave the students code, we just used a global epsilon across the board, but you could definitely be more complex or more granular. Yes? >>: So you're going to build a model from just the pairings from the winnings, so it's not connected to ->> Joaquin Quinonero-Candela: Say again? >>: This isn't connected to the model, the scoring model that they use in the tennis rankings? >> Joaquin Quinonero-Candela: No, this is sort of we do our own thing from 30 scratch. >>: I see. >> Joaquin Quinonero-Candela: to rank the players by skill. ->>: And the goal is at the end of the day, we want But the skills, but any given skill is actually Does not have a probabilistic interpretation. >> Joaquin Quinonero-Candela: Is actually some sort of a distribution so you know the mean skill. And what you'll do if you want to be conservative is you probably report an order statistic or something like that. You sort of say, well, for this player, I could rank maybe by the tenth percentile or something like that if I wanted to be conservative. >>: Is the variance of the skill different for different players? Or is it -- >> Joaquin Quinonero-Candela: It's initialized to be the same. Initially, when run the code. The way we run this is we run it by doing Gibbs sampling. Let me tell you, I just I'm going -- I'm just sort of freestyling, I suppose. So this guy here, if you have observed the game outcome, this marginal distribution here is a truncated Gaussian, right. Because if you know that player 1 won, then you know that negative values are impossible, right? So there is a, you know, that's a bit of a problem, because you have a truncated Gaussian here. Facto-graphs are nice, because they tell you how to compute marginal distributions and so what we want to do is we want to compute the marginal distributions of all of these guys after we've observed all of the games and you can express things in terms, so who knows about message passing and facto-graphs? So you sort of compute [indiscernible] partial computations, partial [indiscernible] sums and you just propagate them up. You don't need a facto-graph to do it. You can just write all the equations ->>: [inaudible]. >> Joaquin Quinonero-Candela: I'm going to have loops. That's absolutely true. I'm going to have loops. And because I'm going to have loops, I'll have to it rate back and forth. If I had a tree, I would only need to do sort of 31 one propagation of things. So I'm going to have to it rate because I have a loop. And I will be approximating this guy here by a Gaussian in order to be able to keep everything Gaussian. If I decide to do an analytic approximation. If I decide to do Gibbs sampling, I don't need to do that. The reason I don't need to do that, if I do Gibbs sampling is, is I'm going to be alternating between sampling all of the Ts given all the distribution -given a sample of all of the skills and then in the other iteration, what I'll do is I'll fix my samples from T and then condition on the value of T, then everything else is actually really, really Gaussian. I'm not taking any short cuts. So I can it rate between sampling skills and sampling performance differences. And I can actually, if I can show ->>: So I'm assuming that in the end, you get a different ranking from -- >> Joaquin Quinonero-Candela: You do. Yes. >>: And with this ranking, you can predict the outcome of the match and verify how accurate it is. >> Joaquin Quinonero-Candela: Yes. >>: And if you do this with the other one, assuming there's some sort of normalization, you take that ranking, you normalize it so it's the best probability for that ranking. Otherwise, it will be a fair comparison [indiscernible]. >> Joaquin Quinonero-Candela: Oh, no, no. In the course, we don't even compare. We don't even try to use the ATP ranking to predict anything. >>: But you could. You could take that. And if you normalize it correctly, then you could be able to predict, and then you could compare the two predictions. If you don't normalize it [indiscernible]. >>: [inaudible]. Do something like this and compare with [indiscernible]. And, of course, you will probably see that the [indiscernible] because you know whether the player is consistent or not. 32 >>: Here there's a worse problem because the matches are never [indiscernible]. They're not from a distribution. They always take into account the ATP ranking and they say this guy will play the worst and this guy will play the second worst. And so, I mean, a sample from the distribution, which is depends on the ATP and is very [indiscernible]. >> Joaquin Quinonero-Candela: That's right. >>: I'm still confused by doing all this gymnastics to this level where I think most of the energy should be putting the right features to the opponent. >> Joaquin Quinonero-Candela: Yeah. This is actually a very, very simple. This is a very simple model. This is a way of counting, if you will, that there is much more clever than just counting marginals, right. So in the courses, I manage to get [indiscernible] running, I might be able to show it to you. But here, because you have -- because you sort of model things jointly and you take into account how strong the player that you played against is, you're sort of discounting -- you're sort of discounting that, right? >>: Have you thought about building the [indiscernible]. >>: A model like this [indiscernible]. >>: So in that version, how do you update the estimate of skill? >>: [multiple people speaking] >>: So what are the properties you want of any skill system is it asks them to build a zero gradient, right. So in a typical example, right, you don't want somebody to always play low players and add a slot every time so that eventually every time they look like grand master. They have to play a grand master to look like a grand master. And so the natural -- the probability captures that nicely. I'm not saying it can't be worked into the optimization. But that's one of the things, your expectation or the likelihood of observing this outcome actually gives you a way of saying. >>: [inaudible]. 33 >>: It's actually, if you go back to the skill -- >>: No, in the ATP. But in the probabilistic [indiscernible]. >>: Yeah, it's the delta S. Sorry. It's a generalization of [indiscernible] so the difference in the scores will give you an observation of expecting to see the outcome. And when you do the update, sorry, I'm speaking for you, you give the update, you actually, based on your observed outcome, that will modify your expectation of skill. >>: [inaudible]. This is not really the [inaudible]. You can do this kind of thing. The thing that's a bit weird and that's [indiscernible] the future examples will be modified by the ranking itself and then you have [indiscernible]. >>: But your delta is small in those cases, same as helo. So actually, even though it's modified, the gradient that you apply when you get that outcome will be smaller than if you serve the expected outcome. >>: The way you select the future matches, the players are [indiscernible] is something that ensures that the low ball process [indiscernible]. >>: But they're not looking for [indiscernible]. playing the worst. You're taking the best and >>: You get the point when you get to the top. The top level you get to. So the idea of building the hierarchy is you're expected to be playing somebody of comparable level. >>: At some point. >>: When you eventually play. So basically, when you fall out of the graph either by winning or losing, you're expected to be playing somebody [indiscernible]. >>: And that's why you wouldn't want to do the open, put all the weaker players on one side of the tree and the strongest players on the other side of the tree. At the end of the graph, you'd have one of the weakest players playing one of the best players consistently. You wouldn't want that. 34 >> Joaquin Quinonero-Candela: Anyway, so yeah, you can't really see, unfortunately. So the two types of things that you can do is you can either try and, for any pair of players, you can compute the probability that one will either have the greater skill than the other or that one will beat the other. And typically, the probability that one will beat the other will be more moderate, more sort of close to 2.5 because you have additional noise. Once you do that, you can just compute -- you can sort of say, well, if I have this matrix that tells me the probability that any player wins every other player, I can sort of compute the average probability that this player will win if the games are arranged truly at random. If I don't know who I'm playing and it's really sample at random. You can't really see very well what's there. But the interesting thing is that Roger Federer and Rafael Nadal are sort of swapped compared to the ATP ranking. To be quite honest, I haven't actually spent the time to understand why. >>: [inaudible] expand your graph with [indiscernible]. >>: You mean per point? >>: If you want to make one [indiscernible] so you have all this graph. >>: [multiple people speaking] >> Joaquin Quinonero-Candela: Say this guy was a new game here, it was a new game, at this stage, the difference is that these guys have been observed, but this guy hasn't been observed. He has not been observed. >>: So you should [indiscernible] with all the possible -- you want to compute P of Y, given ->>: That is correct, but I'm conditioning it on everything else if I have the marginals. >>: [inaudible]. >> Joaquin Quinonero-Candela: >>: [inaudible]. Say again? 35 >> Joaquin Quinonero-Candela: Yeah. And so what I'm showing here is we're effectively sampling in this case, right. I mean, there's a way to do it. I take approximation of this model, but this is the one where we're sampling, you have sort of the means of the skills and you have the variances of the skills. This guy here seems to be, it seems to sort of be converging to something. If you had maybe a more complex model where this change kept sort of mixing around, then you would have to average over a bunch of samples, I agree. >>: [inaudible]. Then do it in a way that doesn't depend on the [indiscernible] matches [indiscernible] regardless of how they've been decided. But now when you cut, you somehow he [indiscernible] the next match that you want to predict doesn't depend on which match you played before. >> Joaquin Quinonero-Candela: But I have absorbed that already in my marginal. >>: Here's the [indiscernible] here's the mean and the variance conditioned on all the observations in the past. >>: Yes, okay. So if you make the assumption that the next match that's going to be played is decided on the basis of the rankings, and the rankings only, then [indiscernible] to compute the mean. That the ranking and the next match will depend only on that so you can [indiscernible]. But the next match will also depend, in fact, because it's [indiscernible], the next match will depend on the outcome of [indiscernible] matches with the same guy. >>: I see what you're saying. >>: So it's not completely correct. approximation. This is probably why you say it's an >>: So for example if we have a player who is high variance, the odds of them winning each of the matches to get to the final is ->> Joaquin Quinonero-Candela: Let's officially finish the presentation now. >>: So you're saying you should play the tournament as opposed to playing the match. >>: Yes, because --