>> Chris Bishop: My names Chris Bishop. I’m a Machine Learning Researcher from MSR in Cambridge which is the real Cambridge, the oldie Cambridge over in England. This morning and for the first part of this afternoon we have a series of three lectures on if you like a slightly different perspective on machine learning, not so different from what you’ve heard before, but different in some important ways. We call this Model-Based Machine Learning. Give you a little sort of road map of where we’re going. We have these three lectures and the first one is called A Gentle Introduction. Really the idea is to assume no pre-requisites. You may have you know read about machine learning, maybe experts on machine learning. But this talk is really aimed at people who are coming at this afresh. We’ll go over some very basic ideas. There’s a little bit of overlap with some previous talks. I think that’s fine. It think the basic ideas are very important. It’s always good to hear it several times and sometimes good to hear it from a different perspective. In this first talk we’re going to discuss some very basic ideas, the idea of reasoning backwards and why that’s a very ubiquitous problem that we have to solve. We’re going to talk about computing with uncertainty and how we’re going to do that using probabilities. We’ll introduce this idea of factor graphs. In the second lecture, that’s after the coffee break. We’ll go a little bit deeper. We’ll understand what models are, how we build models, and how we do learning in models, and learning I will call inference and when I say inference that’s the equivalent of training and learning. Then after lunch my colleague John Bronskill also from MSR Cambridge is going to show you how to turn some of these ideas into practice using a platform called Infer.NET, which we’ve been building over the last five years, which makes this sort of vision of model based machine learning a reality. He’ll be giving you samples of code and talking you through some case studies. I also have a little icon. If you see that little icon on a slide it’s just a little warning there’s a little bit of math in there. Especially this first lecture, even the first two lectures actually there’s almost no mathematics. But just occasionally I can’t resist showing you a little bit of mathematics because it makes something so clear or it’s so beautiful I just want to share it with you. It think there are maybe only, there may be two slides in, one slide in each talk where there’s a little bit of mathematics. You see that icon just be prepared, a couple of little equations. But for the most part we’re going to do this all without using mathematics. Okay, so our goal then is to build intelligent software, in other words software which can adapt, and which can learn, and which can reason about the world. Let’s look at some examples of where we might want to build intelligent software. Let’s consider Xbox and people are playing games. One of the things we’re interested in is the skill of the players. We care about the skill of the players because of course the players want to compete with each other. We need kind of leader boards, who’s the best, who’s the second best. We also care about the skills because on Xbox Live we need to do real time match making. We need to match up people of similar skills so that they have a good gaming experience. We can’t actually measure a person’s skill. There’s no way of measuring a person’s skill. What we can do though is to observe the results of that person playing games. Let’s say games of Halo for example. We observe the outcome of the games. The outcome of the games depends on their skill. If they have a very high skill level they’re going to win a lot of games. Another example might be recommendation systems. Recommendation systems are very important. The Amazon CEO claimed that something like twenty percent of Amazon’s revenue comes from their recommender system, so nice application of machine learning. Let’s take the particular case of movies. We care about peoples preferences for movies. Now people’s preferences of movies could be quite subtle, quite personal to them. Not just that they always like action and they hate romantic comedy. It could be more subtle. We can’t measure their preferences for movies unfortunately. What we can do though is look at ratings. People said well I like that movie, I didn’t like this other movie. Clearly their preferences for movies will influence which ones they rate highly, which ones they give a low rating to. Just a third example, imagine we’re putting words into a computer using a pen. Somewhere in the user’s head is some words. The words they want to input into the computer. We can’t measure that directly. But what happens is of course the user makes gestures using the pen, we observe that electronic ink, and the ink of course is determined in part by the words that they have in their head. Now in each of these cases we can build something called a model to describe what’s going on. I’ll explain a little bit later more precisely what we mean by model. But [indiscernible] when I use the term model you can think of it as a mathematical or a computational description of this process, the process by which players have skills, and those skills give rise to game results. That’s what I mean by a model. Now the problem that we have to solve is effectively the reverse problem. You look at the skill problem. What we observe are the game results. What we actually want to know are the player’s skills. We have to reason backwards. Likewise with the movie preferences what we observe are the ratings that people give to movies. What we’d like to do is to infer their preferences so that we can then recommend movies they’re going to like. Likewise with the inputs, what we really care about are the words the person has in mind. What we observe is the electronic ink. We have to reason backwards. You have to go from the ink to work out to infer what words they had in their head. This is the idea of reasoning backwards. It’s pretty ubiquitous. Another very central concept is the idea of dealing with uncertainty. Again if we look at that, that example of somebody playing Halo on Xbox Live. We don’t know a player’s skill. We’re uncertain about a player’s skill. We don’t know what it is. We have some idea but we don’t know exactly what their skill level is. But every time they play a game and they win or they lose we gain some information about their skill, so relevant information. But even after we’ve observed several games we don’t know exactly their skill level. We’re constantly having to deal with uncertainty. We need to reason in the face of uncertainty. What we need is a principled way rather than ad-hoc way of dealing with uncertainty. Again, this is pretty ubiquitous; uncertainty is all over the place. Which movie should the user watch next? What word did they write? What did they say? Which web page are they trying to find? What link are they going to click on? What products are they going to buy? What gesture are they making and so on? Computer fundamentally are deterministic, they’re logical things. The chip manufacturers go to a lot of trouble to make sure each transistor is either on or it’s off. It’s deterministic. But more and more applications today of computing involve interacting with the real world, interacting with users, operating in a world of uncertainty. We’re going to use a kind of a calculus of uncertainty. That’s the calculus of probability, so probability theory. Probability theory actually in essence is not very complicated, certainly the level that we need for machine learning applications. Probability is a way, in fact a unique way, an optimal way of quantifying uncertainty. Now you know when you’re at school and you’re introduced the idea of probability you might be introduced to probability as the idea of the limit of an infinite number of trials. If I say that a coin has probability of landing heads of point five. Then we can formalize that by saying well if I take a very large number of coin flips and I look at the fraction of times that it lands heads. That fraction in the limit of an infinite number of trials will tend towards point five, so idea of repeated events. Well we’re going to use probability in a much more general sense. Probability is way of quantifying uncertainty. I’ll just give you a little example, supposing we have a coin and imagine the coins a bent coin. Let’s suppose the physics of this coin is such that sixty percent of the time it will land concave side upwards, and forty percent of the time it will land concave side downward. Okay, so in a repeated infinite number of trials sixty percent of the trials it will land concave side upwards. Now let’s say one side of this coin is heads the other is tails. You don’t know which it is. I ask you what’s the probability of landing heads next time round? Well the rational answer is point five. There’s no reason to prefer one or the other. The rational answer is point five. It doesn’t mean that you think that if you flip the coin an infinite number of times that fifty percent of the time it will land heads. You know that it will either land heads sixty percent of the time or it will land heads forty percent of the time. You’re uncertain which. Applying probabilities here to a one off event which side of the coin is heads, which is tails. You can ask you know were the dinosaurs wiped out by meteorite or by volcano? You can express your uncertainty using a probability, even though it’s a one off event, even though it happened in the past. Many of you have come across a lot of techniques in machine learning. The field of machine learning is a research field, it’s sort of half a century old or more. Thousands of machine learning researchers have been developing lots of techniques, lots of algorithms. They’ve given them names and here’s a list of just a tiny fraction of them. I’m going to tend to call this traditional machine learning. I’m struggling to find a good word for this. I’ll call it traditional machine learning. It doesn’t mean that it was all developed in the last century. Much of it was but I mean things like deep networks which you’ll hear about after lunch are quite a recent development. This is traditional machine learning. They have a whole bunch of techniques. Now you’ve got an application that you want to solve. In this frame what you do is you try to map your application onto one of these techniques. You try to find a technique which you think is going to be good for solving your particular problem. Not only that but you maybe sort of influenced by the availability of software. You may not have software implementations of all of these. You may actually have a rather restricted set of things. You have to map your problem onto a rather restricted set of algorithms for which you have software available. In these incidental series of three lectures I want to contrast that with a slightly different philosophy of machine learning, which we call model-based machine learning. The key idea is sort of very simple. That is just to have a single development environment which supports the creation of a wide range of models. Those are bespoke models. The traditional view you map your problem onto another standard tools, which you have software. Model-based approach you’re going to say what is the model which describes my particular problem? You’re going to build a bespoke model. Now there’s some potential benefits if we can achieve this vision. The first benefit is our solution can be optimized to your particular application. You’re not shoehorning your application into some predefined framework. You’re developing something that bespoke can customize to your application, so potentially it could be a better solution because it’s optimized. Another nice advantage is the transparency of the functionalities. You’ll see in the particular case of infer.net. The model can be specified by in some cases you know ten or twenty lines of code. That specifies the model rather than thousands of lines of code that define the traditional method. It would be quite easy to see what the code is doing, quite easy to modify it. Quite easy to pass it over to a friend who has a similar but different problem who can take that model and then modify it to their application. Another nice advantage, there’s a clear segregation between the model definition and the training code. I should call it inference code. We’ll see what that means later. It means we can build general purpose inference engines that apply to a wide range of models. When we get smarter at doing inference that smarter inference is available to a whole load of different models. Equally as a user you can build a model for your application. You kind of don’t have to worry about the inference engine that should take care of itself. Another possible advantage is for newcomers to the field. I think even for people in the machine learning field there’s quite a bewildering variety of different techniques around, has a vast literature, and getting your head all those different techniques is pretty hard. Here you can learn a single modeling environment. Within that environment there’s special cases, you can recover a lot of those standard techniques. It might be that you’ve got a particular application, you build a bespoke solution. It so happens that your solution is pretty much equivalent to something developed thirty years ago and called the Blogs algorithm. Well you didn’t really care about that. All you care about is you’re building a solution for your application. You don’t have to know all about the Blogs method. Another I think very important point here is the potential to avoid what I call ad-hoc solutions. We’ll sort of see some examples of this as we go through the lectures. But very often you, you’re a domain expert in your particular application. You have a lot of intuitions about how things behave. You know well you know any sensible solution to my problem. If this thing gets bigger this other thing should get smaller. Okay, you kind of you know if it doesn’t happen there’s something wrong. You know you have this intuition. You could try and code that up. You could say well I could make one thing the reciprocal of the other, so if one goes up the other goes down. But should it be one of, should it be one over the thing, or should it be one of the things squared, or you know. When you’re in sort of an ad-hoc environment there’s lots of different solutions. It’s not clear which one you should choose. For me my background is theoretical physics, quantum field theory. One of the things I love about this framework is that it’s tremendously elegant. You simply describe your application as a little probabilistic model. Then the rules of probability will do the right thing automatically. If one thing should go up when another thing will go down that will happen automatically in your model. You don’t have to code it up. You don’t have to figure it out it happens automatically. For me that’s one of the most sort of compelling aspects of this view of machine learning. This of course is sort of vision, an idealization. What we have is infer.net as a platform that’s still in the development but it’s already very usable. You’ll see quite a few applications mentioned during these talks. It’s already available for download. You’ll hear a lot more about that later on in the day. Okay, so how we going to do all this in practice then? Well there really three key ideas that we need to think about. The first one we’ve encountered already and we’ll learn a lot more about this as we go through, the idea of using probabilities as a quantification of uncertainty. The next idea that’s we’re going to introduce is the idea of graphs to express our models. Now we don’t have to use graphs. But graphs are rather nice. People like pictures it’s often quite easy to see from the picture what it is you’re expressing. We’re going to introduce a particular type of graph called a factor graph. We’ll use that to describe the models in these lectures. Then finally we need to use the models to make predictions. That process we’ll call inference. This is an elegist to the sort of training or learning in some of the traditional methods. This is where all the computation happens. This is the computation expensive part. This is you know this computation as costly as traditional methods. We care about making that efficient. I’ll say a few ways about how we make that efficient. Okay, so in these lectures you’ll see quite a few real applications. But in order to introduce some very basic concepts I’m going to use a toy problem. It’s sort of like Hello World for model-base machine learning. It is a toy problem. But we can introduce about seventy or eighty percent of the important ideas using this very simple example. This is the murder mystery then. A fiendish murder has been committed. We want to know whodunit? There’s Osleaf. Now suppose that there are only two suspects. We’ve got of course as always the Butler. But was it the Butler? Perhaps it was the Cook. We’ll suppose that either Butler done it or the Cook done it. We’ll also suppose that there are three possible murder weapons that could have been used. There’s a butcher’s knife, there’s a pistol, there a fireplace poker. [laughter] Alright, so let’s introduce our first concept the idea of prior distribution. We have some domain knowledge. We always have some domain knowledge. In this case we know that the Butler, fine upstanding Butler he served the family for many years. The Cook on the other hand was hired pretty recently. There were various rumors about a dodgy history and we’re not quite sure about the Cook. You could represent this information probabilistically. We’ll say that the probability, as far as you can tell at the moment, the probability that the Butler was the person what done it is twenty percent, and the Cook is eighty percent. We think it’s much more likely that the Cook was responsible for the murder than the Butler. Given the information we have so far. Notice the notation, the notation here probability that Culprit equals Butler is twenty percent. P stands for probability. Culprit is an interesting quantity. It’s a variable, but it’s not like the variables we normally use. We normally think about sort of integers and Booleans, and double position floating point numbers and so on. Those are all deterministic variables. This is something more general. This is a random variable. If you think about a Boolean variable that’s either zero or one. Well you know it’s stored in memory and either has the value zero or it has the value one. It has a particular value. Here we’re interested in a two state variable called Culprit. Culprit can either take the value Butler or it can take the value Cook, we’re not sure which. But we do know something about it. We know that it’s eighty percent likely to be the Cook and twenty percent likely to be the Butler. This is the probability that the random variable takes on a particular value. You’ll notice the values add up to a hundred percent because we’re assuming in this model that either the Butler did it or the Cook did it, and nobody else. We can now represent this as a graph. This is our first example of a factor graph. Factor graphs are quite simple really. What you do is you have a circle for every random variable. We only have one random variable at the moment. The random variables called Culprit, so Culprit can take the states Butler or Cook. It’s represented by the circle. The thing above it which is called a factor, it’s this little square represents the probability distribution of that random variable. That square represents P of Culprit. P of Culprit is just a summary of these two lines here. Just encapsulates both of those statements. This thing sort factor graph, we’ll see a little bit later why it’s called a factor graph. That’s our first factor graph. Now of course so far things aren’t very interesting. What we need now is some evidence. Let’s look at the murder weapon. What do we know about the murder weapon? Well the Butler before he was our Butler was in the Army and he kept hold of his nice British Webley Revolver. He keeps it locked away in his bedroom, so the Butler’s got a gun. Well the Cook has access to lots of knives because the Cook works in the kitchen. We’ll suppose that the Butler is fairly old, he’s getting rather frail, and perhaps using the fire place poker is not so plausible because that’s quite a physically demanding weapon. The Cook on the other hand is young, very fit, potentially could have used the poker. What we’re going to do is to capture that again is a little probability distribution. First of all let’s suppose it was the Cook what done it. We know the Cook was responsible. What’s the probability of the Cook choosing these different weapons? Well we don’t think it’s very likely that the Cook would have used the pistol, because the pistols locked away by the Butler. Good chance they would have used the knife. They work in the kitchen, lots of knives. Possibly they used the poker as well. Now these possibilities again add up to a hundred percent. Because if the Cook was the murderer then they must of chosen one, and only one of these three weapons. The probabilities add to a hundred percent. On the other hand let’s suppose that it wasn’t the Cook. Let’s suppose it was the Butler what done it. Then we might have some different probabilities. The Butler has access to the pistol. Let’s say eighty percent probability they would have chosen the pistol, and some small probabilities, ten percent for the knife, and the poker. Again, these add up to a hundred percent. These are called conditional distributions. Because they’re conditional on knowing who committed the murder. Okay, so there’s one distribution is the Cook did it. A different distribution if the Butler did it. We have a notation for this. This variable, this is a random variable which we’ll call Weapon. It has three states, pistol, knife, and poker. We have P of Weapon, but the probability distribution of the Weapon depends on who the Culprit was. We have this kind of notation, P of Weapon. Then we have this vertical bar. On the right hand side of the bar we have Culprit. It’s called a conditional distribution. The way to read this is probability Weapon given Culprit. Okay it means the probability [indiscernible] over the Weapon if we know who the Culprit is. Okay, so this represents these two little tables. Now we can extend our factor graph to combine the prior distribution with the conditional distribution. This is the prior distribution, the Culprit, and P of Culprit. Now we can introduce this random variable Weapon which is a three state random variable, together with its distribution P of Weapon. But P of Weapon depends upon Culprit. We’ve got a line joining the Culprit random variable to this factor because this factor depends on Culprit. That dependency’s shown by this extra link in the graph. We call up the conditional distribution. We call that the prior distribution. Yeah? >>: In this case is the [indiscernible] arrows directional? >> Chris Bishop: In this case we have arrows. I won’t dive into too much detail. I want to keep this fairly high level. But essentially those arrows denote the fact that this is a probability distribution. It’s a distribution over the variable which the factor’s arrow is pointing at. Yep, it means it’s a normalized distribution, yep. What we have now is a joint distribution. What do I mean by the joint distribution? Well I can ask a simple question. I can say there are two possible murderers and three possible weapons. There are six combinations of murderer and weapon. I can say what’s the probability that it was the Cook that committed the murder and they did it using the pistol? Okay, well I could easily calculate that. Because I know that the probability it was the Cook is eighty percent. Condition on it being the Cook the probability that the Cook would have chosen the weapon is five percent. The probability of it being the pistol and the Cook is obtained by just multiplying those together. Okay, so you can think of it this way I’m just going to call this a generative view. Imagine repeating this situation many, many times. Eighty percent of the time it would have been the Cook that did it. Image it’s sort of a rolling biased dice to draw these random numbers. Eighty percent of the time it would have been the Cook that did it. Of all those instances where it was the Cook five percent of those the Cook would have chosen the pistol, over all its eighty percent of five percent which is four percent of it being the Cook using the pistol, okay. Again, we have a little bit of notation. If you see P of Weapon comma Culprit that’s the joint distribution of the two variables. It’s to be read as P of Weapon and Culprit. Obviously five other combinations so we can do the math for all five. It’s very simple we come up with the table. This is called the joint distribution. Each entry in the table if we take say this entry here for instance. This represents a choice for the Culprit and for the Weapons. There’s probability that the Butler did it using the knife and that’s two percent. Again, all of these numbers add up to a hundred percent because it must have one and only one of those six possible combinations. Here we have a little rule for calculating with probability. We call this the product rule. It says the probability of Weapon and Culprit is given by multiplying the probability for Culprit with the probability the Weapon given the Culprit. Okay, and, or in general for two variables X and Y the probability of X and Y is the probability of Y given X times the probability of X. That’s called the product rule of probability. There are only two rules that we need the product rule and an equally simple one called the sum rule. We’ll come to that in a moment. Those two rules of probability that’s all we need. Here’s our factor graph. Let’s just hide the variables so we don’t look at the factors. Well we’ve just seen that the joint distribution is obtained by multiplying the distribution at this factor times the distribution at this factor. In general that’s what these factor graphs mean. The factor graphs tell us the joint distribution of all the perhaps millions of variables in our problem can be expressed as the product of the distributions over little subsets of the variables, each described by a factor. Okay, so it’s just the product rule of probability. It says the joint distribution of everything described by our model is obtained by multiplying the factors together. Hence the term factor graph. So far we’ve kind of been reasoning in this direction. Okay, so this is a bit like going from the player’s skills to the game outcome. What we’re going to need to do is to work in the reverse direction. We need to reason backwards. Just before we do though let’s just have a look at one final concept is the idea of a marginal distribution. This is our joint distribution table. We could ask, so each of these entries is the probability of a particular Culprit with a particular Weapon. We could ask the question what’s the probability that it was the Butler that did it, irrespective of which Weapon they used. Let’s say we don’t know what the Weapon was; we don’t care what the Weapon was. We just want to know the probability that it was the Butler. Well all we have to do is add up the probability for each possible Weapon used by the Butler. Okay and we get twenty percent. Same as the Cook we get eighty percent. That’s a relief because that was what we fed into the model and we’ve got it back out again. Okay, so we haven’t got the math wrong. We could do it the other way around though. We can instead of adding up the rows we could add up the columns. Oh, by the way so this is the sum rule. Remember this is the probability of Weapon and Culprit. If we only want the probability of Culprit we simply sum over the different values of the Weapon random variable, or in general if you’ve got P of X and Y we just want X we sum over the thing we’re not interested in Y, or the thing we don’t know. That’s called the sum rule. That’s all the probability theory you need, the product rule and the sum rule, and that’s it. Instead of summing the rows we could sum the columns. What that tells us is the marginal distribution of the different Weapons. This is the probability, twenty percent the probability that the murder was committed using the pistol when we don’t know or we don’t care who committed the crime. It was either the Cook or the Butler. We don’t care. We just want to know what’s the chance it was done by using the pistol? That’s obtained by adding up the columns. Again these numbers must all add up to a hundred percent because it must have been done by one and only one of those Weapons. Now we come to the interesting bit. Okay, this is the bit that we really care about which is when we reason backwards to find the thing that we’re interested in. What we have now is some evidence. We make an observation. In this case our sleuth has discovered a pistol lying next to the body. That’s surely pretty relevant to this crime. What does it tell us? Well let’s look at that joint table. These are the six possibilities that could have occurred. But we know it was done with the pistol. We can just rule out these two columns. Okay we know that they didn’t occur. We’re left with these two numbers four percent and sixteen percent. They don’t add up to a hundred percent. But what they tell us is the proportion of if you like in this repeated sample generative thought experiment. The fraction of times that it was done by the Cook using the pistol, the fraction of times it was done by the Butler using the pistol. We can normalize those fractions to a hundred percent. It says that twenty percent of the time it was done by the Cook and eighty percent of the time it was done by the Butler. That’s the reverse of the probabilities we started with. We started out thinking it was the dodgy Cook. But having found a pistol its changed things around, it looks like it was the Butler what done it. Things look pretty bad for the Butler which was obvious because it’s a murder crime, so we knew that all along. What we’re doing now is reasoning backwards. Here’s our little factor graph. This is the Culprit. This is the Weapon. What we’ve done now is we’ve made an observation. We now know the value of Weapon. Weapon is going to be a bit like data in a machine learning application. We’re going to build a model and then we’re going to fix it in a variable that we know. The things we know that’s our training data. The thirds step then is to reason backwards and to work out the updated distribution of a Culprit. That’s the thing we just did by crossing out the columns of that joint distribution table. We can formalize that in a little piece of mathematics called Bayes’ theorem. Here it is sort of in words first of all. What we have is a prior distribution. That was the initial probabilities of Butler and Cook based on their sort of history. After we make the observation that the Weapon was a pistol we could update that distribution and we get the distribution after seeing the data which is called the posterior distribution. What happens if we now have more data? Supposing some more information came along in some kind of application. Well the posterior distribution what it really represents is our current stated knowledge of the world. Taking into account all the things we know so far. All the prior knowledge and all the data we’ve seen so far. If more data comes along we can just apply the same machinery. We can think of the posterior distribution as being like our prior distribution for the next observation. We’ll see lots of examples of that as we go through. Notice it’s intrinsically incremental. I think increasingly a lot of machine learning applications are online. Online in the sense of real time interactive. A lot of traditional machine learning algorithms you collect the data in the laboratory. You train up your machine learning solution. You tune it all up and get it working really nicely in the lab. Then you give a million identical copies of that to your million users. Okay and it’s sort of a frozen solution. That’s great for a lot of things. I mean that’s how the skeletal tracking system in Kinect works all tuned up in the lab. Then everybody who’s got a Kinect has got the same trained decision tree classifier. But a lot of things we want to solve problems. We’re trying to make the computer intelligent in a real time sense, data’s constantly being collected, the computers making decisions, it’s constantly making inferences. We’ll see examples of that again as we go through these lectures. But this framework is intrinsically incremental. It’s automatically online. All the information you’ve got so far you use to compute your current distribution, your current uncertain expressed as probabilities. That forms the prior distribution for any future data that you receive. Okay, so warning little bit of math’s coming up here. Remember it’s not that hard actually. Remember the product rule of probability, P of X and Y is P of Y given X times P of X. But by symmetry I could equally well write it as P of X given Y times P of Y. I just applied the product rule twice to this joint distribution. Now if I divide through my P of X what I get is this. It’s called Bayes’ theorem, P of Y given X is P of X given Y times P of Y divided by P of X. It’s a way of reversing a conditional probability. The reason why this is so fundamental to us is that supposing we’re interested in the quantity Y. Y might be the skill of the player. The thing we’d like to know but we don’t. We’ve expressed our uncertainty in terms of a distribution P of Y. Along comes some data X that’s relevant. What we need to do is compute this thing the likelihood P of X given Y. We multiple it by the prior, this is thing is just a normalizing constant. What we get is the posterior distribution. It’s a new distribution for Y taking account of this new data X, okay. If another data point comes along X prime we just take this P of Y given X multiplied by the new likelihood and we get the new posterior, and so on. What we’ve seen in the murder mystery example is that there are kind of three phases to solving a problem in this model-based framework. The first stage is to build a model. Now I could be a little bit more precise about what I mean by a model now. By model I mean a joint probability distribution over all the variables of interest in your application. A convenient way of doing that is to express that as a factor graph. It’s not the only way, not the most general way. But for many applications it’s sufficient. That’s the first stage to build a model. The second stage is to incorporate your observed data. Set known variables to their known values. They cease to be random variables. They become fixed to their known observed values. That’s like our training data. Then the third stage and this is where all the computational grunt comes in, is we have to do inference. What inference means is that we have to update the distributions over the variables that we care about. Okay, we saw that with, we updated the distribution over the Culprit once we knew what the Weapon was. Again, we’ll see lots more examples as we go through. Now if we’re in a sort of real time scenario we can simply iterate steps two and three. We’ve done some inference. We now observe some more data. We incorporate those new observations and we do some more inference. All the time our probability distribution’s revolving reflecting our improved understanding of the world, or the computer’s improved understanding of the world. That’s what learning means in this context. Learning here means the computer is updating its probability distributions which quantify uncertainty in light of data that it’s received. Another thing we can do is that if the domain changes or somebody wants to ship version two and it’s more complicated. We can simply extend the model as required according to our particular application. Perhaps by adding some more variables and some more factors. >>: Some [indiscernible] question. [indiscernible] building the model up like if you happen to say some part two thousand variables and we don’t exactly know how they interact with each other. Is there an algorithm or computational way to [indiscernible] the models in there? >> Chris Bishop: Okay, it’s a great question. I’ll come back to the question at the end if I may. But the question was what if we’ve got thousands of variables you don’t exactly know how they relate to each other? It’s a great question. Generally speaking you know something about your problem domain. I think once you’ve seen some specific example you’ll see what I mean by sort of typical graph. You’ll think of your application golly I can begin to see how this works. In extreme cases you may not be sure. There may be some uncertainty in the model. Should it be like this or should it be like that? Well you guys know what to do. If you’re uncertain about something you quantify your uncertain using probabilities. You allow both models to coexist. You might have an additional variable which is the truth is model A or it’s model B. You put a prior distribution over that. You run the whole thing through your inference algorithm. You get a posterior distribution saying the data says I’m ninety-eight percent sure it’s the right hand model, okay. I’m pretty much done with the lecture. What I’m going to now show you a demo and then we can have time for questions. The demo that I’m going to show you is called; we call this the Movie Recommender Demo. This is a little demo that we actually built for a public exhibition. I guess it was, was it last year? It was the three hundred and fiftieth Anniversary of The Royal Society had a huge exhibition in London on the River Thames, on the South Bank. We built this as an interactive demo for people to play with to try to convey the basic ideas of machine learning from this sort of model-based perspective. The demo was a failure in the sense that it recommends movies. People loved it so much we couldn’t pry them away from the demos because they wanted to know which movie to watch next. It was kind of hard to explain machine learning. But hopefully you won’t suffer from that problem today. The engine behind this is something called Matchbox. Matchbox is built on infer.net. Matchbox is used for large scale recommendation applications. What we’ve done though is put a simple demo frontend on just to, so I can use this to explain the idea of really of probabilities. The system is already seen, I think the data came from Netflix, so there are, well there are a hundred movies here. The database has more movies than this. But these hundred movies we have hundreds of thousands of recommendations from tens of thousands of people. The system has already seen that data. What we’re going to do now is a bit of personalization. It’s going to customize to my movie preferences. Now in a real recommendation system and in Matchbox itself you know a lot about the movies. You might know something about the user. You might know the gender of the user or the age of the user. For each movie you know that it’s a romantic comedy or an action adventure, and so on. Already from known population correlations between features of the user and features of the items, which are movies? You can already make recommendations out of the box. Purely for the purposes of this demo we’re not using any of that information. Alright, so as of now each, for this demo each movie is just movie a hundred and twenty-seven, okay. I’m just the new user. The system knows nothing at all about me. We’re just going to use collaborative filtering. I’m going to make recommend, I’m going to watch movies and I’m going to say I like this one, I don’t like that one. It’s going to combine that information with the likes and dislikes taken from that database. Just to prove a point we’re not going to use any of the features. But Matchbox itself which is described by a factor graph which uses inference to make these recommendations can use both features and collaborative filtering. This is a nice example of I guess the avoidance of ad-hoc solutions which I mentioned earlier. Out of the box we don’t know nothing about the user or we’ve got a new movie and you know nothing about the movie in terms that we have no recommendations that is. Then the only thing you can use is features, right. I like action movies and this is an action movie. The chances are I’ll like it. But once you start to see recommendations from an individual user you can start to tune or customize the recommendations. After I’ve made thousands of recommendations you want to base it mainly on recommendations not on these features. You’ve got of sort of gradually fade from initially using features to give more and more weight to the recommendations as we see more and more recommendations. That happens automatically in Matchbox just because of the sum and product rule of probability. You don’t have to code that in, in some sort of ad-hoc way. Okay, but for this demo then it just knows about recommendations. I have to find the cursor. Okay, so initially then it knows nothing at all about, let’s say I’ve watched a movie, let’s say I watched Pretty Woman. What I’m going to do is drag that across into the green area which tells the computer I’ve watched the movie and I like that movie. What it’s done now is to arrange all the other movies on the screen in a particular way. Now the vertical position on the screen is irrelevant. We’ve just spread them out vertically to make them easier to see. What matters is the horizontal position. The horizontal position of each movie is the probability that the computer thinks I will like that movie. If a movie is up against the right hand side here that’s probability of one, the computer is certain that I’m going to like that movie. If the movies down the left hand side that’s probability zero. It’s certain that I’m not going to like the movie. Movies down the middle are fifty-fifty. Now at the moment what does it know? It’s got tens of thousands of people and hundreds of thousands of recommendations. But as far as I’m concerned all it knows about me is that I liked Pretty Woman. But already it’s enough to start to make recommendations. Because people who like Pretty Woman also liked other movies and hated certain other movies, so it could already assign a probability to each of these movies. But what you’ll notice is they’re sort of clustered around the middle. Most of the movies are sort of near the middle. There’s a lot of white space down the left and right. After all, all it knows is I like this one movie. It hardly knows anything about me. It’s pretty uncertain about which movies I like, which ones I won’t like. Let’s carry on. Let’s give it some more information. Let’s say I watched another movie, say I didn’t like that one. Even after just two movies you’ll see what’s happening is things are spreading out. There spreading out towards the right and towards the left. They’re moving towards zero and one. The system is now a bit less uncertain about which movies I like, which I won’t like. That’s what learning means in this context. It’s a reduction in uncertainty as a result of seeing data. It’s intrinsically online because I can just carry on and give it more examples of movies that I like and movies that I don’t like. I keep losing the cursor because it’s a dual screen, there we go. Let’s say, or pick another movie that I don’t like again there’s sort of some rearranging. But generally things are sort of spreading out to the or maybe I do like that movie. Again we get different results. You can play with this all day. But even after just three examples or maybe just four examples, okay so there’s four examples, two that I like and two that I don’t like. You can see now what’s happening. There’s now a lot more white space down the middle. A lot of movies are crowded down the sides. Nothings right up against the side. It can’t be certain that I’m not going to like a movie. But it’s pretty confident that I’m not going to like these movies. It’s pretty confident that I will like these movies. The ones down the middle are just completely fifty-fifty about whether I’m going to like those ones. We’ll just illustrate one more point. Let’s suppose, let’s take a movie right down the right hand side. Oh, sorry, yeah? >>: What I’m not hearing about is how does the model capture the fact that somebody who likes this wouldn’t like that other movie. You said it doesn’t take into account the [indiscernible] movie type role and stuff like that. What prior information does it have? >> Chris Bishop: Okay I think the question was basically how does it work? I haven’t really sort of explained how it works. The question really there’s no metadata how can it know what’s going on. Really what it’s doing is just you know your intuition is that if there was somebody else in the room that liked Pretty Woman, liked Chicago. They hated The Sound of Music and they hated Elf. I asked them well what did you think about Closer and they say oh, great movie. Okay, well you know I seem to be like that person and so maybe I too will like Closer. That’s kind of the intuition. >>: [inaudible] people this before you showed it at the exhibition? Some, a bunch of people at like human judges and numbers? >>: It’s based on that point. >>: Oh, alright. >> Chris Bishop: Before the lecture when we built the demo we already you know trained it in inverted commas on a database. I think it’s Netflix data where we’ve got you know hundreds of thousands of ratings from tens of thousands of people. You imagine there’s sort of big matrix of sort of movies and people. It’s a sparse matrix but there are entries, occasionally there’s an entry where somebody said like or an entry where somebody said dislike. It’s kind of like the intuition that you know people like me will like future movies that I you know will have similar taste to me. But what we haven’t done is sort of code up that intuition because there’s lots of ways of doing it, and how did you know how to do it? It’s an ad-hoc mess. Instead we built a model which describes the relationship between these variables, expressed probabilistically. Are you showing this? >>: I actually show the, [indiscernible] model in factor graph form this afternoon. >> Chris Bishop: You show them the code. Okay, so this afternoon you’ll actually see John Bronskill’s going to show you the factor graph for how this works this afternoon and probably some infer.net code for it as well. I’m just going to show you one more thing. Let’s take a movie down the right hand side. That’s a movie that the system is very confident that I’m going to like. I’ll just pick one of them. Let’s say I’ve watched that movie. Let’s say I do like it. Watch what happens when I let go of the mouse button what watch happens to the other movies. You know not a lot, okay, I was really confident that I was going to like the movie. I said yep I like it. Okay, hasn’t learned very much because it kind of knew that already. Let’s do the other extreme. Let’s take a movie that’s down the left hand side. Now here’s a movie, it’s really confident that I’m going to hate that movie. Let’s say naturally I watched that movie and actually I like that one. I’m going to drag this across to here. Now look what happens when I let go of the mouse button. Okay, that was hugely informative. In fact information theory defines information as the degree of surprise. Okay, towards the right hand side the amount of surprise in saying I like that movie goes to zero. Across the left hand side it goes to infinity. There’s much more information in telling it something surprising that it wasn’t expecting than telling it kind of already knew. >>: [indiscernible] source of the error and the noise affect your factor model a lot? >> Chris Bishop: Question was the error and the noise do… >>: The noise, if I actually don’t like it [indiscernible] made a mistake. >> Chris Bishop: Right, great question. The question is well could this be affected by noise, could it be affected by mistakes? It could be affected like mislabeling. You know somebody watched the movie. They really loved it and they’re in a big hurry and they click dislike. Then they carried on, they didn’t notice, or the system wouldn’t let them change it, or whatever. We could sort of label error and so on. Generally speaking the answer, when somebody comes along and says oh, this is all very well but in my application domain it’s different. Because in my application domain I have users who make label errors, right. They occasionally flip the label. I say great, you’ve just told me how to extend the model for your domain. In your domain there’s a label error. We’ll put a label error variable. What we might have if you like the tree variable which we can’t observe. What we actually observe is the label the user gave. That’s a noisy version of the truth, right. We have a little [indiscernible] table that says well ninety-nine percent of the time the label they give is the thing they meant to give. One percent of the time they flipped it. Okay, so what we do is we just model it. Then the rules of probability will do the right thing. Yeah? >>: Do you mean that in that case you’re actually doubling the number of variables? Like because you wanted them as the actual state the other one is the state you observe, right? >> Chris Bishop: The question was am I doubling the number of variables? Do you mean in the little example I gave of the noisy… >>: Yes. >> Chris Bishop: Yes you’re introducing extra variables, yes. Yeah? >>: This will never be [indiscernible] the system actually say I am using [indiscernible]. [indiscernible] ruled out actually two people are the users and they have very different taste. [indiscernible] one user. Is there a way to quantify that your model is wrong? Like I have seen so much data and still not able to learn anything. I’m still not able to successfully predict something. Is there a way to quantify this [indiscernible] model is wrong and we should reconsider it or not? >> Chris Bishop: Okay, I’m not sure I understood exactly that scenario. But I think the question was can I quantify the fact that I may have the wrong model? Some of you give a slightly general answer to that which is a model misspecification. What you’ve done is you’ve written down a model of the world. What happens if that model is wrong? That’s a very general question in machine learning. Okay, it applies equally to traditional methods or to model-based methods. If you assume, if you use a linear regression system your model world is a linear model. The world is highly non-linear then you can get very wrong answers. If you make some assumption out of the world that’s one of the violated then you can make arbitrarily bad predictions. That’s true in any approach. That’s certainly true here as well. What you can do though is allow for the fact that the world maybe more complex than you thought. It might be that there are other processes going on. We’ve had a couple of examples, sort of label noise and so on. You can model those. You can model those causes of misspecification if you can anticipate them. Maybe you, to come back to the earlier point, maybe you’re not sure whether you should include a particular effect or not. I don’t know whether my users are flipping the labels ten percent of the time or whether actually they’re all completely correct. You can do model comparison. A very nice way of doing model comparison, this is how you do model comparison in infer.net is to have one model represented by a graph over here. Another model represented by a graph over here. Now you construct to sort of an Uber graph where you have a switch. The switch switches between the models. That switch has a little binary variable that says model A or model B. You put a prior distribution on that maybe fifty-fifty because you’re not too sure, maybe a sixty-forty, whatever it is. That’s now your model. That big model contained in the two sub-models. Now you do the second step which do condition on the data, you observe the data. You run inference. What you get is inferences made by model A, inferences made by model B, and a posterior distribution over which is the right model. >>: You are [indiscernible] by the user if there’s a new movie it can also handle that? >> Chris Bishop: The question is there a new user, a new movie can it handle it? Yes, I mean what’s essentially going on in here? We’ll look at the factor graph I guess this afternoon. There’s a very general technique or quite a widespread technique called matrix factorization. You take that big, that sort of a matrix and try to represent it in a low dimensional space. Matchbox is if you like a probabilistic version of matrix factorization, okay. What’s going on is we have some low dimensional latent space you know five dimensional or ten dimensional. The users are mapped down into that space. That mapping is one of the things we learned. There are parameters governing that mapping. There are prior distributions over those parameters. There’s another mapping from items down into that latent space, again, governed by a bunch of parameters. Again, there are distributions over those parameters. We have some notion of affinity or closeness, some metric within that space which represents how you know the alignment between the sort of you know the vector of the user and the items. Whether a user tends to like particular items tend to dislike other items and two users that are very similar will have vectors that are quite closely aligned, users that have very different taste will be far apart in that space, again, all that is represented by probability distributions expressed as a factor graph. Yeah? >>: If you have a very large dataset you have ability to use so what would be the sense be to kind of decrease. If you have new user because it’s all dependent on the influence of other users on that user, so is there a point where whatever you do you really do not get a crystallization that you really want from that result? >> Chris Bishop: Okay, so I think the question, I’ll repeat the question and tell me if I’ve got it right. The idea is supposing we’ve had a million users or a billion users and along comes little old me. You know I’m the billionth and firth user. Isn’t my data going to be completely swamped by the data from those billion users? It’ll take for; I’d have to rate a billion movies before it even starts to personalize to me, absolutely not. Again if you construct the model the way you want to construct the model the way you want to construct the model. We’re going to have a look at a nice example of personalization this afternoon in context of email. We can talk a little bit more about that then. Again, it does the right thing. It’s making the right tradeoffs between sort of community predictions versus personalization predictions. It’s starting off out of the box with the community prediction because it knows nothing about you, then making appropriate adaptations as it starts to know more and more about you. That tradeoff, that gradually fading out the effect of the community, and fading in the effect of your personal data happens automatically from the sum rule and the product rule of probability. Okay, you don’t have to think I’ll make it one over T or something. You know it just happens. >>: How does this project have, differ from the [indiscernible] hashing from multi task learning kind of methods? Like you know we saw that I think yesterday, how does that differ from, how does, is there an edge for this model over that one? How does it rate again to other traditional learning methods? >> Chris Bishop: I’m going to confess, so I you know I’ve only just flown in I haven’t listened to all of the other lectures. I’m, can we, what we maybe do is take that, there’s also an email list around this you know so I maybe have a chance to watch the lecture and give you an answer. I’m very happy to sort of make points of comparison between this, how we tackle a challenge in this approach versus some more traditional methods. But I wasn’t familiar with the particular piece of jargon you used in the, so I will, I’ll defer that if I may offline. Yeah? >>: [indiscernible] take on, your take on selecting prior model? >> Chris Bishop: Take on how to select priors? >>: Yeah. >> Chris Bishop: Okay, so you know this is a gentle introduction alright. We’ve you know there’s more to come, so you know hopefully in the next lecture you may get some insights into how we select priors. I’ve used the term prior in posterior because you know it’s commonly used and it’s one way to think about these things. But I tend not to think too much really about priors and posteriors. But just about models and distributions and their relationships. It’s really about building a model of the world. Now in the next lecture you’ll see some nice examples where, so in the murder mystery I just pulled those numbers out of thin air. I just said well okay the Butler’s more likely to do it than the Cook. I’ll make it eighty percent, oh sorry the Cook’s more likely than the Butler. I’ll make it eighty percent Cook and twenty percent Butler. Why eighty percent? Why not eighty-five percent? Why not seventy-five percent? Quite often we have distributions. The distributions have parameters. We don’t know what value to set the parameters to. Well, we know what to do, right. We model the uncertainty in those parameters by using random variables which themselves have distributions. That leads naturally to hierarchical models. We have distributions with parameters. Those parameters are uncertain we want to learn them from data. We don’t want to hand craft them so we put distributions over those parameters. But those distributions themselves have other parameters, might call them hyper parameters. We have to stop that hierarchy at some point. The answer in an engineering sense is well just look at some of the examples. I think the best thing to do is just look at a dozen examples of the applications. You’ll kind of get the flavor for how we build these models. I guess the more sort of philosophical answer is that in machine learning no matter how you approach machine learning you can’t learn anything purely from data. Now there’s some fundamental mathematics in machine learning that says you can’t do this. You only learn in the context of assumptions or a model, or prior knowledge, or background information, call it what you like. You assume something about the world. Sometimes those assumptions are quite general. You might assume that the output varies continuously with the input, or it varies smoothly. If you model it by neural network and you put a regularizer on the weights you’re saying I don’t think the output is going to vary too much if I change the input a little bit. Okay, that’s a form of prior knowledge. You might have much stronger prior knowledge. You might say the output is a linear function of the input with some Gaussian noise or something. Okay, that’s a very strong form of prior knowledge. Roughly speaking if you constrain the world very tightly you get a lot of juice out of your data. You’ll learn a lot from each data point. That’s good news provided your assumptions map to the real world. If they don’t map to the real world or if you assume the worlds linear and its non-linear then not only can you make bad prediction but you can even be very confident about those bad predictions. That’s not good. The other extreme, if you make almost no assumptions about the world. You have very, very flexible models. In the traditional paradigm you hit a major headache its called overfitting. Supposing I’ve got a hundred data points and I say well I don’t want to assume anything about the world. I just want the data to tell me everything. I’m going to fit a hundred [indiscernible] polynomial to my ten data points, and my hundredth [indiscernible] polynomial is really flexible. It can model all sorts of different things. What happens is you just tune up to the noise on the data because in the traditional methods you typically optimizing parameters and making point estimates of parameters. You can sort of over fit to the data. That’s one of the challenges that you face. There are ways to deal with this of course. I’m sure you’ve heard about them this week to deal with overfitting in that traditional framework. In this framework there isn’t really overfitting. Overfitting is a sort of a pathology that arises when you make point estimates. What happens in this domain is that if I have a very flexible model with very broad distributions. When I observe data my distributions get narrower and I learn stuff about the world. But I still have a lot of uncertainty. I’m hopefully still encompassing the truth but not so very confident about it. If we get time, I don’t think now is quite the right minute. We probably know a little bit more about this. But one of the nice things in a probabilistic model is you can get the data to choose between different models for you. I mentioned how we can do this in infer.net with a little switching variable. Let me try and give you some intuition behind this with sort of zero math. It’s sort of like this; imagine I’ve got three models. The first model says, it’s a very rigid model, right. It’s very, imposes a lot of prior knowledge, the world’s just linear, right, very restrictive model. I’ve got a second model which says well it could go up and down a little bit. But it’s kind of you know one or two oscillations is fine. The third model says oh, hugely flexible. It can you go up and down a million times and have step functions. It can be a [indiscernible] to all sorts of things, right, very flexible model. If you simply fit those three models to the data by optimizing the parameters then you’ll always favor the most complex model. Because it will just tune all the data, it will fit the data point exactly. You’ll over fit and it’ll say it’s the third model. You can’t choose the model by tuning or optimizing the parameters. All you have to do instead is divide your dataset up into two, train on one half, and then see how predictive it is on some holdout data. Okay and that’s the standard technique and widely established. In this probabilistic setting let’s suppose that the truth is actually the middle model, okay. The data, you know the real world it sort of goes up and down a little bit but not too much. Let’s look at the posterior probability of the three models. I’m uncertain about which model so I’m going to keep all three models. I’m going to have a prior probability of a third, a third, a third. What’s the posterior probability of those models when I run inference? Well the first model, the rigid model will have a low probability it basically can’t fit the data. The data’s going up and down. It’s trying to fit it with a straight line. There’s a really bad mismatch between the data and the model. That model has a low posterior probability. The middle model does a lot better because it can fit that data beautifully. Okay, so it’s got a much better data fit. The third model can fit the data just as well. In fact it fit the data even better because it can tune to all the little noise. But the problem is that the model has a sort of a unit; remember all the probabilities add up to a hundred percent. Each of these models has got a unit amount of predictive probability which it can spread around. The first model has piled all its predictive probability on just straight lines and none of them fitted the data, so that was bad. The middle model kind of spread its bets a bit more. It said well it could be a straight line; it could go up and down a little, or down and up a little bit. The probability which it actually assigns to the data is PYA, because if it can assign a decent amount of probability the actual data you observe. The third model can explain the data you know perfectly. But its unit amount of predictive probability mass has been spread incredibly thin because it can predict straight lines. It predicts things which go up and down a little bit, things that go up and down a lot, things which have steps, and so on. The amount of probability mass that it assigns to the observed dataset is actually lower than for the middle model. The posterior probability peaks around the middle model. Okay, that’s the probable model. That’s without having to do any hold out data. That’s on the training data, okay. Now you’re saying at this point, ah who cares you know I’ve got a big computer. I’ll run, what’s the problem is running on my training data and compare it on a test set. Well there’s nothing wrong with that. In fact you should always do it no matter what method you’re doing. Before you ship something to the customers I strongly recommend you test it on some of your data, right. That’s just good sound engineering. The thing is supposing it’s not just one thing you’re trying to tune. If you’re just trying to tune a regularization parameter sure I could run it ten times with ten different regularization parameters on my training set. Compare the ten models on the test set and pick the best. What if I’ve got a million data points and I’ve got one regularization parameter per data point. I’ve got a million parameters to tune. I can’t run all these different combinations of a million parameters and keep testing them. I have to be able to learn the parameters and the hyper-parameters from the training data. That’s where some of these methods can be very powerful. It was a very long winded answer to a short question but. >>: [indiscernible] the question he asked [indiscernible]. You had a very similar [indiscernible]. We had a cold start problem that, an add that’ you’ve never seen. How do you surface it? I’m just trying to understand what we could have done differently? Because people breaking their heads about that particular problem. >> Chris Bishop: Okay, so I think this question is sort about your specific application, a particular issue that you’ve encountered. I think my experience with these questions what happens is I’m going to have to ask you to expand on it and then you and I are going to have a half hour discussion at the whiteboard which I’d love to have but that’s probably not. That’s quite a specific question. Maybe we can talk or exchange email and discuss that particular question. But it’s obviously not; the question was to do with cold start problems I think in an advert situation. You’ve already though about it very hard, it’s obviously a hard problem. It probably isn’t a ten second answer. We should discuss that offline I think. Maybe last question. >>: Yeah, actually I have two. [indiscernible] specific [indiscernible] like how many users [indiscernible] need to have, to give a reasonable good result? The second one is the graph tree could be very large is that a good idea even if it has certain way to control the size? Like you know each [indiscernible] may have many connections and those connections also have foundations. Can you, you know reduce the [indiscernible] and [indiscernible] faster? >> Chris Bishop: Let’s just deal with that second part first then. The question was if the underlying graph which we haven’t [indiscernible] yet for this recommended system. If it becomes very large, what was the question then, what’s the problem? >>: Oh, so maybe there’s like, let’s say one have many parents and many children but maybe there’s some connections between the parents and children, I mean individually so maybe you can just reduce it , the parents and reduce the number of the children somehow? >> Chris Bishop: Okay, so there are lots of design issues in terms of designing the graph. I mean again can I give you a general answer rather than a sort of specific answer? I think one of the advantages of this framework is that I said we have to make assumptions. You cannot do machine learning just from data alone. There’s data in the context of some assumptions. One of the nice things here is that you make the assumptions very explicit. In fact you’re forced to make them explicit. Because step one is to write down your model. You really have to come clean and tell the world what you believe about the world. You know it can’t be buried away in some implicit sort of in the training algorithm somehow. Right, it has to be in the model. You have to make your assumptions explicit in the model. Part of the assumptions has to do with the structure of the graph. You know, again, that comes back to the points we made earlier where really this is the domain knowledge that you’re encoding. But sometimes you, you know I think you were talking about skip level connections between parents and children of other variables. Are they present, are they not present? Again, one thing you could do is just compare the two models. If you’re genuinely uncertain you could look at both models and see which one the data prefers. Whoops, so are you going to go back to that first, what was the first question again? >>: [indiscernible] wonder like to do [indiscernible] a model like this [indiscernible]. How many training [indiscernible]? >> Chris Bishop: Lovely question, how many data points do I need to get good performance on this application or any other application? You know the answer is forty-two. [laughter] The answer is I don’t know. [laughter] No, it’s so dependent on the problem, right. I mean that’s not, there’s no magic, alright. The answer is it will depend upon the, you know the particular application that you have. But the nice thing about this framework is that if you only have a little bit of data you can still use your full model. You don’t have to adjust the model to the amount of data that you have. It’s the funny thing to do. I’ve only got a hundred data points I better only fit three parameters otherwise it will over fit. Now hang on a bit. If you really think the world has a hundred and twenty-seven parameters surely that’s how you should model the world. Okay, so you can use your, you know your full blown complex model that you really think describes the world and we only see a few data points. That’s fine, you won’t have any overfitting. What you will have of course is still a lot of residual uncertainty. I mean you only seeing three data points again there’s no magic here. I’ve only made a, you know after a couple of movie ratings it knows something about me. It doesn’t know everything about me it’s still quite uncertain. I was told to allow twenty minutes for break. We’ll take a break now and then we’ll have another, and we’ll just go a little bit deeper into this topic at, starting at eleven o’clock. [applause]