Document 17836755

advertisement
>> Chris Bishop: My names Chris Bishop. I’m a Machine Learning Researcher from MSR in Cambridge
which is the real Cambridge, the oldie Cambridge over in England. This morning and for the first part of
this afternoon we have a series of three lectures on if you like a slightly different perspective on
machine learning, not so different from what you’ve heard before, but different in some important
ways. We call this Model-Based Machine Learning.
Give you a little sort of road map of where we’re going. We have these three lectures and the first one
is called A Gentle Introduction. Really the idea is to assume no pre-requisites. You may have you know
read about machine learning, maybe experts on machine learning. But this talk is really aimed at people
who are coming at this afresh. We’ll go over some very basic ideas. There’s a little bit of overlap with
some previous talks. I think that’s fine. It think the basic ideas are very important. It’s always good to
hear it several times and sometimes good to hear it from a different perspective.
In this first talk we’re going to discuss some very basic ideas, the idea of reasoning backwards and why
that’s a very ubiquitous problem that we have to solve. We’re going to talk about computing with
uncertainty and how we’re going to do that using probabilities. We’ll introduce this idea of factor
graphs.
In the second lecture, that’s after the coffee break. We’ll go a little bit deeper. We’ll understand what
models are, how we build models, and how we do learning in models, and learning I will call inference
and when I say inference that’s the equivalent of training and learning.
Then after lunch my colleague John Bronskill also from MSR Cambridge is going to show you how to turn
some of these ideas into practice using a platform called Infer.NET, which we’ve been building over the
last five years, which makes this sort of vision of model based machine learning a reality. He’ll be giving
you samples of code and talking you through some case studies.
I also have a little icon. If you see that little icon on a slide it’s just a little warning there’s a little bit of
math in there. Especially this first lecture, even the first two lectures actually there’s almost no
mathematics. But just occasionally I can’t resist showing you a little bit of mathematics because it
makes something so clear or it’s so beautiful I just want to share it with you. It think there are maybe
only, there may be two slides in, one slide in each talk where there’s a little bit of mathematics. You see
that icon just be prepared, a couple of little equations. But for the most part we’re going to do this all
without using mathematics.
Okay, so our goal then is to build intelligent software, in other words software which can adapt, and
which can learn, and which can reason about the world. Let’s look at some examples of where we might
want to build intelligent software. Let’s consider Xbox and people are playing games. One of the things
we’re interested in is the skill of the players. We care about the skill of the players because of course
the players want to compete with each other. We need kind of leader boards, who’s the best, who’s the
second best.
We also care about the skills because on Xbox Live we need to do real time match making. We need to
match up people of similar skills so that they have a good gaming experience. We can’t actually
measure a person’s skill. There’s no way of measuring a person’s skill. What we can do though is to
observe the results of that person playing games. Let’s say games of Halo for example. We observe the
outcome of the games. The outcome of the games depends on their skill. If they have a very high skill
level they’re going to win a lot of games.
Another example might be recommendation systems. Recommendation systems are very important.
The Amazon CEO claimed that something like twenty percent of Amazon’s revenue comes from their
recommender system, so nice application of machine learning. Let’s take the particular case of movies.
We care about peoples preferences for movies. Now people’s preferences of movies could be quite
subtle, quite personal to them. Not just that they always like action and they hate romantic comedy. It
could be more subtle. We can’t measure their preferences for movies unfortunately. What we can do
though is look at ratings. People said well I like that movie, I didn’t like this other movie. Clearly their
preferences for movies will influence which ones they rate highly, which ones they give a low rating to.
Just a third example, imagine we’re putting words into a computer using a pen. Somewhere in the
user’s head is some words. The words they want to input into the computer. We can’t measure that
directly. But what happens is of course the user makes gestures using the pen, we observe that
electronic ink, and the ink of course is determined in part by the words that they have in their head.
Now in each of these cases we can build something called a model to describe what’s going on. I’ll
explain a little bit later more precisely what we mean by model. But [indiscernible] when I use the term
model you can think of it as a mathematical or a computational description of this process, the process
by which players have skills, and those skills give rise to game results. That’s what I mean by a model.
Now the problem that we have to solve is effectively the reverse problem. You look at the skill problem.
What we observe are the game results. What we actually want to know are the player’s skills. We have
to reason backwards. Likewise with the movie preferences what we observe are the ratings that people
give to movies. What we’d like to do is to infer their preferences so that we can then recommend
movies they’re going to like. Likewise with the inputs, what we really care about are the words the
person has in mind. What we observe is the electronic ink. We have to reason backwards. You have to
go from the ink to work out to infer what words they had in their head. This is the idea of reasoning
backwards. It’s pretty ubiquitous.
Another very central concept is the idea of dealing with uncertainty. Again if we look at that, that
example of somebody playing Halo on Xbox Live. We don’t know a player’s skill. We’re uncertain about
a player’s skill. We don’t know what it is. We have some idea but we don’t know exactly what their skill
level is. But every time they play a game and they win or they lose we gain some information about
their skill, so relevant information. But even after we’ve observed several games we don’t know exactly
their skill level. We’re constantly having to deal with uncertainty. We need to reason in the face of
uncertainty. What we need is a principled way rather than ad-hoc way of dealing with uncertainty.
Again, this is pretty ubiquitous; uncertainty is all over the place. Which movie should the user watch
next? What word did they write? What did they say? Which web page are they trying to find? What
link are they going to click on? What products are they going to buy? What gesture are they making
and so on?
Computer fundamentally are deterministic, they’re logical things. The chip manufacturers go to a lot of
trouble to make sure each transistor is either on or it’s off. It’s deterministic. But more and more
applications today of computing involve interacting with the real world, interacting with users, operating
in a world of uncertainty.
We’re going to use a kind of a calculus of uncertainty. That’s the calculus of probability, so probability
theory. Probability theory actually in essence is not very complicated, certainly the level that we need
for machine learning applications. Probability is a way, in fact a unique way, an optimal way of
quantifying uncertainty. Now you know when you’re at school and you’re introduced the idea of
probability you might be introduced to probability as the idea of the limit of an infinite number of trials.
If I say that a coin has probability of landing heads of point five. Then we can formalize that by saying
well if I take a very large number of coin flips and I look at the fraction of times that it lands heads. That
fraction in the limit of an infinite number of trials will tend towards point five, so idea of repeated
events.
Well we’re going to use probability in a much more general sense. Probability is way of quantifying
uncertainty. I’ll just give you a little example, supposing we have a coin and imagine the coins a bent
coin. Let’s suppose the physics of this coin is such that sixty percent of the time it will land concave side
upwards, and forty percent of the time it will land concave side downward. Okay, so in a repeated
infinite number of trials sixty percent of the trials it will land concave side upwards.
Now let’s say one side of this coin is heads the other is tails. You don’t know which it is. I ask you what’s
the probability of landing heads next time round? Well the rational answer is point five. There’s no
reason to prefer one or the other. The rational answer is point five. It doesn’t mean that you think that
if you flip the coin an infinite number of times that fifty percent of the time it will land heads. You know
that it will either land heads sixty percent of the time or it will land heads forty percent of the time.
You’re uncertain which. Applying probabilities here to a one off event which side of the coin is heads,
which is tails. You can ask you know were the dinosaurs wiped out by meteorite or by volcano? You can
express your uncertainty using a probability, even though it’s a one off event, even though it happened
in the past.
Many of you have come across a lot of techniques in machine learning. The field of machine learning is
a research field, it’s sort of half a century old or more. Thousands of machine learning researchers have
been developing lots of techniques, lots of algorithms. They’ve given them names and here’s a list of
just a tiny fraction of them. I’m going to tend to call this traditional machine learning. I’m struggling to
find a good word for this. I’ll call it traditional machine learning. It doesn’t mean that it was all
developed in the last century. Much of it was but I mean things like deep networks which you’ll hear
about after lunch are quite a recent development.
This is traditional machine learning. They have a whole bunch of techniques. Now you’ve got an
application that you want to solve. In this frame what you do is you try to map your application onto
one of these techniques. You try to find a technique which you think is going to be good for solving your
particular problem. Not only that but you maybe sort of influenced by the availability of software. You
may not have software implementations of all of these. You may actually have a rather restricted set of
things. You have to map your problem onto a rather restricted set of algorithms for which you have
software available.
In these incidental series of three lectures I want to contrast that with a slightly different philosophy of
machine learning, which we call model-based machine learning. The key idea is sort of very simple.
That is just to have a single development environment which supports the creation of a wide range of
models. Those are bespoke models. The traditional view you map your problem onto another standard
tools, which you have software. Model-based approach you’re going to say what is the model which
describes my particular problem? You’re going to build a bespoke model.
Now there’s some potential benefits if we can achieve this vision. The first benefit is our solution can be
optimized to your particular application. You’re not shoehorning your application into some predefined
framework. You’re developing something that bespoke can customize to your application, so potentially
it could be a better solution because it’s optimized.
Another nice advantage is the transparency of the functionalities. You’ll see in the particular case of
infer.net. The model can be specified by in some cases you know ten or twenty lines of code. That
specifies the model rather than thousands of lines of code that define the traditional method. It would
be quite easy to see what the code is doing, quite easy to modify it. Quite easy to pass it over to a friend
who has a similar but different problem who can take that model and then modify it to their application.
Another nice advantage, there’s a clear segregation between the model definition and the training code.
I should call it inference code. We’ll see what that means later. It means we can build general purpose
inference engines that apply to a wide range of models. When we get smarter at doing inference that
smarter inference is available to a whole load of different models. Equally as a user you can build a
model for your application. You kind of don’t have to worry about the inference engine that should take
care of itself.
Another possible advantage is for newcomers to the field. I think even for people in the machine
learning field there’s quite a bewildering variety of different techniques around, has a vast literature,
and getting your head all those different techniques is pretty hard. Here you can learn a single modeling
environment. Within that environment there’s special cases, you can recover a lot of those standard
techniques. It might be that you’ve got a particular application, you build a bespoke solution. It so
happens that your solution is pretty much equivalent to something developed thirty years ago and
called the Blogs algorithm. Well you didn’t really care about that. All you care about is you’re building a
solution for your application. You don’t have to know all about the Blogs method.
Another I think very important point here is the potential to avoid what I call ad-hoc solutions. We’ll
sort of see some examples of this as we go through the lectures. But very often you, you’re a domain
expert in your particular application. You have a lot of intuitions about how things behave. You know
well you know any sensible solution to my problem. If this thing gets bigger this other thing should get
smaller. Okay, you kind of you know if it doesn’t happen there’s something wrong. You know you have
this intuition. You could try and code that up. You could say well I could make one thing the reciprocal
of the other, so if one goes up the other goes down. But should it be one of, should it be one over the
thing, or should it be one of the things squared, or you know. When you’re in sort of an ad-hoc
environment there’s lots of different solutions. It’s not clear which one you should choose.
For me my background is theoretical physics, quantum field theory. One of the things I love about this
framework is that it’s tremendously elegant. You simply describe your application as a little probabilistic
model. Then the rules of probability will do the right thing automatically. If one thing should go up
when another thing will go down that will happen automatically in your model. You don’t have to code
it up. You don’t have to figure it out it happens automatically. For me that’s one of the most sort of
compelling aspects of this view of machine learning.
This of course is sort of vision, an idealization. What we have is infer.net as a platform that’s still in the
development but it’s already very usable. You’ll see quite a few applications mentioned during these
talks. It’s already available for download. You’ll hear a lot more about that later on in the day.
Okay, so how we going to do all this in practice then? Well there really three key ideas that we need to
think about. The first one we’ve encountered already and we’ll learn a lot more about this as we go
through, the idea of using probabilities as a quantification of uncertainty.
The next idea that’s we’re going to introduce is the idea of graphs to express our models. Now we don’t
have to use graphs. But graphs are rather nice. People like pictures it’s often quite easy to see from the
picture what it is you’re expressing. We’re going to introduce a particular type of graph called a factor
graph. We’ll use that to describe the models in these lectures.
Then finally we need to use the models to make predictions. That process we’ll call inference. This is an
elegist to the sort of training or learning in some of the traditional methods. This is where all the
computation happens. This is the computation expensive part. This is you know this computation as
costly as traditional methods. We care about making that efficient. I’ll say a few ways about how we
make that efficient.
Okay, so in these lectures you’ll see quite a few real applications. But in order to introduce some very
basic concepts I’m going to use a toy problem. It’s sort of like Hello World for model-base machine
learning. It is a toy problem. But we can introduce about seventy or eighty percent of the important
ideas using this very simple example.
This is the murder mystery then. A fiendish murder has been committed. We want to know whodunit?
There’s Osleaf. Now suppose that there are only two suspects. We’ve got of course as always the
Butler. But was it the Butler? Perhaps it was the Cook. We’ll suppose that either Butler done it or the
Cook done it. We’ll also suppose that there are three possible murder weapons that could have been
used. There’s a butcher’s knife, there’s a pistol, there a fireplace poker.
[laughter]
Alright, so let’s introduce our first concept the idea of prior distribution. We have some domain
knowledge. We always have some domain knowledge. In this case we know that the Butler, fine
upstanding Butler he served the family for many years. The Cook on the other hand was hired pretty
recently. There were various rumors about a dodgy history and we’re not quite sure about the Cook.
You could represent this information probabilistically. We’ll say that the probability, as far as you can
tell at the moment, the probability that the Butler was the person what done it is twenty percent, and
the Cook is eighty percent. We think it’s much more likely that the Cook was responsible for the murder
than the Butler. Given the information we have so far.
Notice the notation, the notation here probability that Culprit equals Butler is twenty percent. P stands
for probability. Culprit is an interesting quantity. It’s a variable, but it’s not like the variables we
normally use. We normally think about sort of integers and Booleans, and double position floating point
numbers and so on. Those are all deterministic variables. This is something more general. This is a
random variable. If you think about a Boolean variable that’s either zero or one. Well you know it’s
stored in memory and either has the value zero or it has the value one. It has a particular value.
Here we’re interested in a two state variable called Culprit. Culprit can either take the value Butler or it
can take the value Cook, we’re not sure which. But we do know something about it. We know that it’s
eighty percent likely to be the Cook and twenty percent likely to be the Butler. This is the probability
that the random variable takes on a particular value. You’ll notice the values add up to a hundred
percent because we’re assuming in this model that either the Butler did it or the Cook did it, and nobody
else.
We can now represent this as a graph. This is our first example of a factor graph. Factor graphs are
quite simple really. What you do is you have a circle for every random variable. We only have one
random variable at the moment. The random variables called Culprit, so Culprit can take the states
Butler or Cook. It’s represented by the circle. The thing above it which is called a factor, it’s this little
square represents the probability distribution of that random variable. That square represents P of
Culprit. P of Culprit is just a summary of these two lines here. Just encapsulates both of those
statements. This thing sort factor graph, we’ll see a little bit later why it’s called a factor graph. That’s
our first factor graph.
Now of course so far things aren’t very interesting. What we need now is some evidence. Let’s look at
the murder weapon. What do we know about the murder weapon? Well the Butler before he was our
Butler was in the Army and he kept hold of his nice British Webley Revolver. He keeps it locked away in
his bedroom, so the Butler’s got a gun. Well the Cook has access to lots of knives because the Cook
works in the kitchen. We’ll suppose that the Butler is fairly old, he’s getting rather frail, and perhaps
using the fire place poker is not so plausible because that’s quite a physically demanding weapon. The
Cook on the other hand is young, very fit, potentially could have used the poker.
What we’re going to do is to capture that again is a little probability distribution. First of all let’s
suppose it was the Cook what done it. We know the Cook was responsible. What’s the probability of
the Cook choosing these different weapons? Well we don’t think it’s very likely that the Cook would
have used the pistol, because the pistols locked away by the Butler. Good chance they would have used
the knife. They work in the kitchen, lots of knives. Possibly they used the poker as well.
Now these possibilities again add up to a hundred percent. Because if the Cook was the murderer then
they must of chosen one, and only one of these three weapons. The probabilities add to a hundred
percent. On the other hand let’s suppose that it wasn’t the Cook. Let’s suppose it was the Butler what
done it. Then we might have some different probabilities. The Butler has access to the pistol. Let’s say
eighty percent probability they would have chosen the pistol, and some small probabilities, ten percent
for the knife, and the poker. Again, these add up to a hundred percent.
These are called conditional distributions. Because they’re conditional on knowing who committed the
murder. Okay, so there’s one distribution is the Cook did it. A different distribution if the Butler did it.
We have a notation for this. This variable, this is a random variable which we’ll call Weapon. It has
three states, pistol, knife, and poker. We have P of Weapon, but the probability distribution of the
Weapon depends on who the Culprit was. We have this kind of notation, P of Weapon. Then we have
this vertical bar. On the right hand side of the bar we have Culprit. It’s called a conditional distribution.
The way to read this is probability Weapon given Culprit. Okay it means the probability [indiscernible]
over the Weapon if we know who the Culprit is. Okay, so this represents these two little tables. Now
we can extend our factor graph to combine the prior distribution with the conditional distribution. This
is the prior distribution, the Culprit, and P of Culprit.
Now we can introduce this random variable Weapon which is a three state random variable, together
with its distribution P of Weapon. But P of Weapon depends upon Culprit. We’ve got a line joining the
Culprit random variable to this factor because this factor depends on Culprit. That dependency’s shown
by this extra link in the graph.
We call up the conditional distribution. We call that the prior distribution. Yeah?
>>: In this case is the [indiscernible] arrows directional?
>> Chris Bishop: In this case we have arrows. I won’t dive into too much detail. I want to keep this fairly
high level. But essentially those arrows denote the fact that this is a probability distribution. It’s a
distribution over the variable which the factor’s arrow is pointing at. Yep, it means it’s a normalized
distribution, yep.
What we have now is a joint distribution. What do I mean by the joint distribution? Well I can ask a
simple question. I can say there are two possible murderers and three possible weapons. There are six
combinations of murderer and weapon. I can say what’s the probability that it was the Cook that
committed the murder and they did it using the pistol? Okay, well I could easily calculate that. Because
I know that the probability it was the Cook is eighty percent. Condition on it being the Cook the
probability that the Cook would have chosen the weapon is five percent. The probability of it being the
pistol and the Cook is obtained by just multiplying those together.
Okay, so you can think of it this way I’m just going to call this a generative view. Imagine repeating this
situation many, many times. Eighty percent of the time it would have been the Cook that did it. Image
it’s sort of a rolling biased dice to draw these random numbers. Eighty percent of the time it would have
been the Cook that did it. Of all those instances where it was the Cook five percent of those the Cook
would have chosen the pistol, over all its eighty percent of five percent which is four percent of it being
the Cook using the pistol, okay.
Again, we have a little bit of notation. If you see P of Weapon comma Culprit that’s the joint distribution
of the two variables. It’s to be read as P of Weapon and Culprit. Obviously five other combinations so
we can do the math for all five. It’s very simple we come up with the table. This is called the joint
distribution. Each entry in the table if we take say this entry here for instance. This represents a choice
for the Culprit and for the Weapons. There’s probability that the Butler did it using the knife and that’s
two percent. Again, all of these numbers add up to a hundred percent because it must have one and
only one of those six possible combinations.
Here we have a little rule for calculating with probability. We call this the product rule. It says the
probability of Weapon and Culprit is given by multiplying the probability for Culprit with the probability
the Weapon given the Culprit. Okay, and, or in general for two variables X and Y the probability of X and
Y is the probability of Y given X times the probability of X. That’s called the product rule of probability.
There are only two rules that we need the product rule and an equally simple one called the sum rule.
We’ll come to that in a moment. Those two rules of probability that’s all we need.
Here’s our factor graph. Let’s just hide the variables so we don’t look at the factors. Well we’ve just
seen that the joint distribution is obtained by multiplying the distribution at this factor times the
distribution at this factor. In general that’s what these factor graphs mean. The factor graphs tell us the
joint distribution of all the perhaps millions of variables in our problem can be expressed as the product
of the distributions over little subsets of the variables, each described by a factor. Okay, so it’s just the
product rule of probability. It says the joint distribution of everything described by our model is
obtained by multiplying the factors together. Hence the term factor graph.
So far we’ve kind of been reasoning in this direction. Okay, so this is a bit like going from the player’s
skills to the game outcome. What we’re going to need to do is to work in the reverse direction. We
need to reason backwards. Just before we do though let’s just have a look at one final concept is the
idea of a marginal distribution.
This is our joint distribution table. We could ask, so each of these entries is the probability of a
particular Culprit with a particular Weapon. We could ask the question what’s the probability that it was
the Butler that did it, irrespective of which Weapon they used. Let’s say we don’t know what the
Weapon was; we don’t care what the Weapon was. We just want to know the probability that it was
the Butler. Well all we have to do is add up the probability for each possible Weapon used by the Butler.
Okay and we get twenty percent. Same as the Cook we get eighty percent. That’s a relief because that
was what we fed into the model and we’ve got it back out again. Okay, so we haven’t got the math
wrong.
We could do it the other way around though. We can instead of adding up the rows we could add up
the columns. Oh, by the way so this is the sum rule. Remember this is the probability of Weapon and
Culprit. If we only want the probability of Culprit we simply sum over the different values of the
Weapon random variable, or in general if you’ve got P of X and Y we just want X we sum over the thing
we’re not interested in Y, or the thing we don’t know. That’s called the sum rule. That’s all the
probability theory you need, the product rule and the sum rule, and that’s it.
Instead of summing the rows we could sum the columns. What that tells us is the marginal distribution
of the different Weapons. This is the probability, twenty percent the probability that the murder was
committed using the pistol when we don’t know or we don’t care who committed the crime. It was
either the Cook or the Butler. We don’t care. We just want to know what’s the chance it was done by
using the pistol? That’s obtained by adding up the columns. Again these numbers must all add up to a
hundred percent because it must have been done by one and only one of those Weapons.
Now we come to the interesting bit. Okay, this is the bit that we really care about which is when we
reason backwards to find the thing that we’re interested in. What we have now is some evidence. We
make an observation. In this case our sleuth has discovered a pistol lying next to the body. That’s surely
pretty relevant to this crime. What does it tell us?
Well let’s look at that joint table. These are the six possibilities that could have occurred. But we know
it was done with the pistol. We can just rule out these two columns. Okay we know that they didn’t
occur. We’re left with these two numbers four percent and sixteen percent. They don’t add up to a
hundred percent. But what they tell us is the proportion of if you like in this repeated sample generative
thought experiment. The fraction of times that it was done by the Cook using the pistol, the fraction of
times it was done by the Butler using the pistol. We can normalize those fractions to a hundred percent.
It says that twenty percent of the time it was done by the Cook and eighty percent of the time it was
done by the Butler. That’s the reverse of the probabilities we started with. We started out thinking it
was the dodgy Cook. But having found a pistol its changed things around, it looks like it was the Butler
what done it. Things look pretty bad for the Butler which was obvious because it’s a murder crime, so
we knew that all along.
What we’re doing now is reasoning backwards. Here’s our little factor graph. This is the Culprit. This is
the Weapon. What we’ve done now is we’ve made an observation. We now know the value of
Weapon. Weapon is going to be a bit like data in a machine learning application. We’re going to build a
model and then we’re going to fix it in a variable that we know. The things we know that’s our training
data. The thirds step then is to reason backwards and to work out the updated distribution of a Culprit.
That’s the thing we just did by crossing out the columns of that joint distribution table.
We can formalize that in a little piece of mathematics called Bayes’ theorem. Here it is sort of in words
first of all. What we have is a prior distribution. That was the initial probabilities of Butler and Cook
based on their sort of history. After we make the observation that the Weapon was a pistol we could
update that distribution and we get the distribution after seeing the data which is called the posterior
distribution.
What happens if we now have more data? Supposing some more information came along in some kind
of application. Well the posterior distribution what it really represents is our current stated knowledge
of the world. Taking into account all the things we know so far. All the prior knowledge and all the data
we’ve seen so far. If more data comes along we can just apply the same machinery. We can think of the
posterior distribution as being like our prior distribution for the next observation. We’ll see lots of
examples of that as we go through.
Notice it’s intrinsically incremental. I think increasingly a lot of machine learning applications are online.
Online in the sense of real time interactive. A lot of traditional machine learning algorithms you collect
the data in the laboratory. You train up your machine learning solution. You tune it all up and get it
working really nicely in the lab. Then you give a million identical copies of that to your million users.
Okay and it’s sort of a frozen solution. That’s great for a lot of things. I mean that’s how the skeletal
tracking system in Kinect works all tuned up in the lab. Then everybody who’s got a Kinect has got the
same trained decision tree classifier.
But a lot of things we want to solve problems. We’re trying to make the computer intelligent in a real
time sense, data’s constantly being collected, the computers making decisions, it’s constantly making
inferences. We’ll see examples of that again as we go through these lectures. But this framework is
intrinsically incremental. It’s automatically online. All the information you’ve got so far you use to
compute your current distribution, your current uncertain expressed as probabilities. That forms the
prior distribution for any future data that you receive.
Okay, so warning little bit of math’s coming up here. Remember it’s not that hard actually. Remember
the product rule of probability, P of X and Y is P of Y given X times P of X. But by symmetry I could
equally well write it as P of X given Y times P of Y. I just applied the product rule twice to this joint
distribution. Now if I divide through my P of X what I get is this. It’s called Bayes’ theorem, P of Y given
X is P of X given Y times P of Y divided by P of X. It’s a way of reversing a conditional probability.
The reason why this is so fundamental to us is that supposing we’re interested in the quantity Y. Y might
be the skill of the player. The thing we’d like to know but we don’t. We’ve expressed our uncertainty in
terms of a distribution P of Y. Along comes some data X that’s relevant. What we need to do is
compute this thing the likelihood P of X given Y. We multiple it by the prior, this is thing is just a
normalizing constant. What we get is the posterior distribution. It’s a new distribution for Y taking
account of this new data X, okay. If another data point comes along X prime we just take this P of Y
given X multiplied by the new likelihood and we get the new posterior, and so on.
What we’ve seen in the murder mystery example is that there are kind of three phases to solving a
problem in this model-based framework. The first stage is to build a model. Now I could be a little bit
more precise about what I mean by a model now. By model I mean a joint probability distribution over
all the variables of interest in your application. A convenient way of doing that is to express that as a
factor graph. It’s not the only way, not the most general way. But for many applications it’s sufficient.
That’s the first stage to build a model.
The second stage is to incorporate your observed data. Set known variables to their known values.
They cease to be random variables. They become fixed to their known observed values. That’s like our
training data. Then the third stage and this is where all the computational grunt comes in, is we have to
do inference. What inference means is that we have to update the distributions over the variables that
we care about. Okay, we saw that with, we updated the distribution over the Culprit once we knew
what the Weapon was. Again, we’ll see lots more examples as we go through.
Now if we’re in a sort of real time scenario we can simply iterate steps two and three. We’ve done
some inference. We now observe some more data. We incorporate those new observations and we do
some more inference. All the time our probability distribution’s revolving reflecting our improved
understanding of the world, or the computer’s improved understanding of the world. That’s what
learning means in this context. Learning here means the computer is updating its probability
distributions which quantify uncertainty in light of data that it’s received.
Another thing we can do is that if the domain changes or somebody wants to ship version two and it’s
more complicated. We can simply extend the model as required according to our particular application.
Perhaps by adding some more variables and some more factors.
>>: Some [indiscernible] question. [indiscernible] building the model up like if you happen to say some
part two thousand variables and we don’t exactly know how they interact with each other. Is there an
algorithm or computational way to [indiscernible] the models in there?
>> Chris Bishop: Okay, it’s a great question. I’ll come back to the question at the end if I may. But the
question was what if we’ve got thousands of variables you don’t exactly know how they relate to each
other? It’s a great question. Generally speaking you know something about your problem domain. I
think once you’ve seen some specific example you’ll see what I mean by sort of typical graph. You’ll
think of your application golly I can begin to see how this works.
In extreme cases you may not be sure. There may be some uncertainty in the model. Should it be like
this or should it be like that? Well you guys know what to do. If you’re uncertain about something you
quantify your uncertain using probabilities. You allow both models to coexist. You might have an
additional variable which is the truth is model A or it’s model B. You put a prior distribution over that.
You run the whole thing through your inference algorithm. You get a posterior distribution saying the
data says I’m ninety-eight percent sure it’s the right hand model, okay.
I’m pretty much done with the lecture. What I’m going to now show you a demo and then we can have
time for questions. The demo that I’m going to show you is called; we call this the Movie Recommender
Demo. This is a little demo that we actually built for a public exhibition. I guess it was, was it last year?
It was the three hundred and fiftieth Anniversary of The Royal Society had a huge exhibition in London
on the River Thames, on the South Bank. We built this as an interactive demo for people to play with to
try to convey the basic ideas of machine learning from this sort of model-based perspective. The demo
was a failure in the sense that it recommends movies. People loved it so much we couldn’t pry them
away from the demos because they wanted to know which movie to watch next. It was kind of hard to
explain machine learning. But hopefully you won’t suffer from that problem today.
The engine behind this is something called Matchbox. Matchbox is built on infer.net. Matchbox is used
for large scale recommendation applications. What we’ve done though is put a simple demo frontend
on just to, so I can use this to explain the idea of really of probabilities. The system is already seen, I
think the data came from Netflix, so there are, well there are a hundred movies here. The database has
more movies than this. But these hundred movies we have hundreds of thousands of recommendations
from tens of thousands of people. The system has already seen that data.
What we’re going to do now is a bit of personalization. It’s going to customize to my movie preferences.
Now in a real recommendation system and in Matchbox itself you know a lot about the movies. You
might know something about the user. You might know the gender of the user or the age of the user.
For each movie you know that it’s a romantic comedy or an action adventure, and so on. Already from
known population correlations between features of the user and features of the items, which are
movies? You can already make recommendations out of the box.
Purely for the purposes of this demo we’re not using any of that information. Alright, so as of now each,
for this demo each movie is just movie a hundred and twenty-seven, okay. I’m just the new user. The
system knows nothing at all about me. We’re just going to use collaborative filtering. I’m going to make
recommend, I’m going to watch movies and I’m going to say I like this one, I don’t like that one. It’s
going to combine that information with the likes and dislikes taken from that database.
Just to prove a point we’re not going to use any of the features. But Matchbox itself which is described
by a factor graph which uses inference to make these recommendations can use both features and
collaborative filtering. This is a nice example of I guess the avoidance of ad-hoc solutions which I
mentioned earlier. Out of the box we don’t know nothing about the user or we’ve got a new movie and
you know nothing about the movie in terms that we have no recommendations that is. Then the only
thing you can use is features, right.
I like action movies and this is an action movie. The chances are I’ll like it. But once you start to see
recommendations from an individual user you can start to tune or customize the recommendations.
After I’ve made thousands of recommendations you want to base it mainly on recommendations not on
these features. You’ve got of sort of gradually fade from initially using features to give more and more
weight to the recommendations as we see more and more recommendations.
That happens automatically in Matchbox just because of the sum and product rule of probability. You
don’t have to code that in, in some sort of ad-hoc way. Okay, but for this demo then it just knows about
recommendations. I have to find the cursor. Okay, so initially then it knows nothing at all about, let’s
say I’ve watched a movie, let’s say I watched Pretty Woman. What I’m going to do is drag that across
into the green area which tells the computer I’ve watched the movie and I like that movie.
What it’s done now is to arrange all the other movies on the screen in a particular way. Now the vertical
position on the screen is irrelevant. We’ve just spread them out vertically to make them easier to see.
What matters is the horizontal position. The horizontal position of each movie is the probability that the
computer thinks I will like that movie. If a movie is up against the right hand side here that’s probability
of one, the computer is certain that I’m going to like that movie. If the movies down the left hand side
that’s probability zero. It’s certain that I’m not going to like the movie. Movies down the middle are
fifty-fifty.
Now at the moment what does it know? It’s got tens of thousands of people and hundreds of thousands
of recommendations. But as far as I’m concerned all it knows about me is that I liked Pretty Woman.
But already it’s enough to start to make recommendations. Because people who like Pretty Woman also
liked other movies and hated certain other movies, so it could already assign a probability to each of
these movies.
But what you’ll notice is they’re sort of clustered around the middle. Most of the movies are sort of
near the middle. There’s a lot of white space down the left and right. After all, all it knows is I like this
one movie. It hardly knows anything about me. It’s pretty uncertain about which movies I like, which
ones I won’t like. Let’s carry on. Let’s give it some more information. Let’s say I watched another
movie, say I didn’t like that one. Even after just two movies you’ll see what’s happening is things are
spreading out. There spreading out towards the right and towards the left. They’re moving towards
zero and one. The system is now a bit less uncertain about which movies I like, which I won’t like.
That’s what learning means in this context. It’s a reduction in uncertainty as a result of seeing data. It’s
intrinsically online because I can just carry on and give it more examples of movies that I like and movies
that I don’t like. I keep losing the cursor because it’s a dual screen, there we go. Let’s say, or pick
another movie that I don’t like again there’s sort of some rearranging. But generally things are sort of
spreading out to the or maybe I do like that movie. Again we get different results. You can play with
this all day. But even after just three examples or maybe just four examples, okay so there’s four
examples, two that I like and two that I don’t like.
You can see now what’s happening. There’s now a lot more white space down the middle. A lot of
movies are crowded down the sides. Nothings right up against the side. It can’t be certain that I’m not
going to like a movie. But it’s pretty confident that I’m not going to like these movies. It’s pretty
confident that I will like these movies. The ones down the middle are just completely fifty-fifty about
whether I’m going to like those ones.
We’ll just illustrate one more point. Let’s suppose, let’s take a movie right down the right hand side.
Oh, sorry, yeah?
>>: What I’m not hearing about is how does the model capture the fact that somebody who likes this
wouldn’t like that other movie. You said it doesn’t take into account the [indiscernible] movie type role
and stuff like that. What prior information does it have?
>> Chris Bishop: Okay I think the question was basically how does it work? I haven’t really sort of
explained how it works. The question really there’s no metadata how can it know what’s going on.
Really what it’s doing is just you know your intuition is that if there was somebody else in the room that
liked Pretty Woman, liked Chicago. They hated The Sound of Music and they hated Elf. I asked them
well what did you think about Closer and they say oh, great movie. Okay, well you know I seem to be
like that person and so maybe I too will like Closer. That’s kind of the intuition.
>>: [inaudible] people this before you showed it at the exhibition? Some, a bunch of people at like
human judges and numbers?
>>: It’s based on that point.
>>: Oh, alright.
>> Chris Bishop: Before the lecture when we built the demo we already you know trained it in inverted
commas on a database. I think it’s Netflix data where we’ve got you know hundreds of thousands of
ratings from tens of thousands of people. You imagine there’s sort of big matrix of sort of movies and
people. It’s a sparse matrix but there are entries, occasionally there’s an entry where somebody said
like or an entry where somebody said dislike.
It’s kind of like the intuition that you know people like me will like future movies that I you know will
have similar taste to me. But what we haven’t done is sort of code up that intuition because there’s lots
of ways of doing it, and how did you know how to do it? It’s an ad-hoc mess. Instead we built a model
which describes the relationship between these variables, expressed probabilistically. Are you showing
this?
>>: I actually show the, [indiscernible] model in factor graph form this afternoon.
>> Chris Bishop: You show them the code. Okay, so this afternoon you’ll actually see John Bronskill’s
going to show you the factor graph for how this works this afternoon and probably some infer.net code
for it as well.
I’m just going to show you one more thing. Let’s take a movie down the right hand side. That’s a movie
that the system is very confident that I’m going to like. I’ll just pick one of them. Let’s say I’ve watched
that movie. Let’s say I do like it. Watch what happens when I let go of the mouse button what watch
happens to the other movies. You know not a lot, okay, I was really confident that I was going to like the
movie. I said yep I like it. Okay, hasn’t learned very much because it kind of knew that already.
Let’s do the other extreme. Let’s take a movie that’s down the left hand side. Now here’s a movie, it’s
really confident that I’m going to hate that movie. Let’s say naturally I watched that movie and actually I
like that one. I’m going to drag this across to here. Now look what happens when I let go of the mouse
button.
Okay, that was hugely informative. In fact information theory defines information as the degree of
surprise. Okay, towards the right hand side the amount of surprise in saying I like that movie goes to
zero. Across the left hand side it goes to infinity. There’s much more information in telling it something
surprising that it wasn’t expecting than telling it kind of already knew.
>>: [indiscernible] source of the error and the noise affect your factor model a lot?
>> Chris Bishop: Question was the error and the noise do…
>>: The noise, if I actually don’t like it [indiscernible] made a mistake.
>> Chris Bishop: Right, great question. The question is well could this be affected by noise, could it be
affected by mistakes? It could be affected like mislabeling. You know somebody watched the movie.
They really loved it and they’re in a big hurry and they click dislike. Then they carried on, they didn’t
notice, or the system wouldn’t let them change it, or whatever. We could sort of label error and so on.
Generally speaking the answer, when somebody comes along and says oh, this is all very well but in my
application domain it’s different. Because in my application domain I have users who make label errors,
right. They occasionally flip the label. I say great, you’ve just told me how to extend the model for your
domain. In your domain there’s a label error. We’ll put a label error variable.
What we might have if you like the tree variable which we can’t observe. What we actually observe is
the label the user gave. That’s a noisy version of the truth, right. We have a little [indiscernible] table
that says well ninety-nine percent of the time the label they give is the thing they meant to give. One
percent of the time they flipped it. Okay, so what we do is we just model it. Then the rules of
probability will do the right thing. Yeah?
>>: Do you mean that in that case you’re actually doubling the number of variables? Like because you
wanted them as the actual state the other one is the state you observe, right?
>> Chris Bishop: The question was am I doubling the number of variables? Do you mean in the little
example I gave of the noisy…
>>: Yes.
>> Chris Bishop: Yes you’re introducing extra variables, yes. Yeah?
>>: This will never be [indiscernible] the system actually say I am using [indiscernible]. [indiscernible]
ruled out actually two people are the users and they have very different taste. [indiscernible] one user.
Is there a way to quantify that your model is wrong? Like I have seen so much data and still not able to
learn anything. I’m still not able to successfully predict something. Is there a way to quantify this
[indiscernible] model is wrong and we should reconsider it or not?
>> Chris Bishop: Okay, I’m not sure I understood exactly that scenario. But I think the question was can
I quantify the fact that I may have the wrong model? Some of you give a slightly general answer to that
which is a model misspecification. What you’ve done is you’ve written down a model of the world.
What happens if that model is wrong? That’s a very general question in machine learning. Okay, it
applies equally to traditional methods or to model-based methods.
If you assume, if you use a linear regression system your model world is a linear model. The world is
highly non-linear then you can get very wrong answers. If you make some assumption out of the world
that’s one of the violated then you can make arbitrarily bad predictions. That’s true in any approach.
That’s certainly true here as well. What you can do though is allow for the fact that the world maybe
more complex than you thought. It might be that there are other processes going on. We’ve had a
couple of examples, sort of label noise and so on. You can model those. You can model those causes of
misspecification if you can anticipate them.
Maybe you, to come back to the earlier point, maybe you’re not sure whether you should include a
particular effect or not. I don’t know whether my users are flipping the labels ten percent of the time or
whether actually they’re all completely correct. You can do model comparison. A very nice way of
doing model comparison, this is how you do model comparison in infer.net is to have one model
represented by a graph over here. Another model represented by a graph over here.
Now you construct to sort of an Uber graph where you have a switch. The switch switches between the
models. That switch has a little binary variable that says model A or model B. You put a prior
distribution on that maybe fifty-fifty because you’re not too sure, maybe a sixty-forty, whatever it is.
That’s now your model. That big model contained in the two sub-models. Now you do the second step
which do condition on the data, you observe the data. You run inference. What you get is inferences
made by model A, inferences made by model B, and a posterior distribution over which is the right
model.
>>: You are [indiscernible] by the user if there’s a new movie it can also handle that?
>> Chris Bishop: The question is there a new user, a new movie can it handle it? Yes, I mean what’s
essentially going on in here? We’ll look at the factor graph I guess this afternoon. There’s a very general
technique or quite a widespread technique called matrix factorization. You take that big, that sort of a
matrix and try to represent it in a low dimensional space.
Matchbox is if you like a probabilistic version of matrix factorization, okay. What’s going on is we have
some low dimensional latent space you know five dimensional or ten dimensional. The users are
mapped down into that space. That mapping is one of the things we learned. There are parameters
governing that mapping. There are prior distributions over those parameters.
There’s another mapping from items down into that latent space, again, governed by a bunch of
parameters. Again, there are distributions over those parameters. We have some notion of affinity or
closeness, some metric within that space which represents how you know the alignment between the
sort of you know the vector of the user and the items. Whether a user tends to like particular items
tend to dislike other items and two users that are very similar will have vectors that are quite closely
aligned, users that have very different taste will be far apart in that space, again, all that is represented
by probability distributions expressed as a factor graph. Yeah?
>>: If you have a very large dataset you have ability to use so what would be the sense be to kind of
decrease. If you have new user because it’s all dependent on the influence of other users on that user,
so is there a point where whatever you do you really do not get a crystallization that you really want
from that result?
>> Chris Bishop: Okay, so I think the question, I’ll repeat the question and tell me if I’ve got it right. The
idea is supposing we’ve had a million users or a billion users and along comes little old me. You know
I’m the billionth and firth user. Isn’t my data going to be completely swamped by the data from those
billion users? It’ll take for; I’d have to rate a billion movies before it even starts to personalize to me,
absolutely not. Again if you construct the model the way you want to construct the model the way you
want to construct the model. We’re going to have a look at a nice example of personalization this
afternoon in context of email. We can talk a little bit more about that then.
Again, it does the right thing. It’s making the right tradeoffs between sort of community predictions
versus personalization predictions. It’s starting off out of the box with the community prediction
because it knows nothing about you, then making appropriate adaptations as it starts to know more and
more about you. That tradeoff, that gradually fading out the effect of the community, and fading in the
effect of your personal data happens automatically from the sum rule and the product rule of
probability. Okay, you don’t have to think I’ll make it one over T or something. You know it just
happens.
>>: How does this project have, differ from the [indiscernible] hashing from multi task learning kind of
methods? Like you know we saw that I think yesterday, how does that differ from, how does, is there
an edge for this model over that one? How does it rate again to other traditional learning methods?
>> Chris Bishop: I’m going to confess, so I you know I’ve only just flown in I haven’t listened to all of the
other lectures. I’m, can we, what we maybe do is take that, there’s also an email list around this you
know so I maybe have a chance to watch the lecture and give you an answer. I’m very happy to sort of
make points of comparison between this, how we tackle a challenge in this approach versus some more
traditional methods. But I wasn’t familiar with the particular piece of jargon you used in the, so I will, I’ll
defer that if I may offline. Yeah?
>>: [indiscernible] take on, your take on selecting prior model?
>> Chris Bishop: Take on how to select priors?
>>: Yeah.
>> Chris Bishop: Okay, so you know this is a gentle introduction alright. We’ve you know there’s more
to come, so you know hopefully in the next lecture you may get some insights into how we select priors.
I’ve used the term prior in posterior because you know it’s commonly used and it’s one way to think
about these things. But I tend not to think too much really about priors and posteriors. But just about
models and distributions and their relationships.
It’s really about building a model of the world. Now in the next lecture you’ll see some nice examples
where, so in the murder mystery I just pulled those numbers out of thin air. I just said well okay the
Butler’s more likely to do it than the Cook. I’ll make it eighty percent, oh sorry the Cook’s more likely
than the Butler. I’ll make it eighty percent Cook and twenty percent Butler. Why eighty percent? Why
not eighty-five percent? Why not seventy-five percent?
Quite often we have distributions. The distributions have parameters. We don’t know what value to set
the parameters to. Well, we know what to do, right. We model the uncertainty in those parameters by
using random variables which themselves have distributions. That leads naturally to hierarchical
models.
We have distributions with parameters. Those parameters are uncertain we want to learn them from
data. We don’t want to hand craft them so we put distributions over those parameters. But those
distributions themselves have other parameters, might call them hyper parameters. We have to stop
that hierarchy at some point. The answer in an engineering sense is well just look at some of the
examples. I think the best thing to do is just look at a dozen examples of the applications. You’ll kind of
get the flavor for how we build these models.
I guess the more sort of philosophical answer is that in machine learning no matter how you approach
machine learning you can’t learn anything purely from data. Now there’s some fundamental
mathematics in machine learning that says you can’t do this. You only learn in the context of
assumptions or a model, or prior knowledge, or background information, call it what you like. You
assume something about the world.
Sometimes those assumptions are quite general. You might assume that the output varies continuously
with the input, or it varies smoothly. If you model it by neural network and you put a regularizer on the
weights you’re saying I don’t think the output is going to vary too much if I change the input a little bit.
Okay, that’s a form of prior knowledge. You might have much stronger prior knowledge. You might say
the output is a linear function of the input with some Gaussian noise or something. Okay, that’s a very
strong form of prior knowledge.
Roughly speaking if you constrain the world very tightly you get a lot of juice out of your data. You’ll
learn a lot from each data point. That’s good news provided your assumptions map to the real world. If
they don’t map to the real world or if you assume the worlds linear and its non-linear then not only can
you make bad prediction but you can even be very confident about those bad predictions. That’s not
good.
The other extreme, if you make almost no assumptions about the world. You have very, very flexible
models. In the traditional paradigm you hit a major headache its called overfitting. Supposing I’ve got a
hundred data points and I say well I don’t want to assume anything about the world. I just want the
data to tell me everything. I’m going to fit a hundred [indiscernible] polynomial to my ten data points,
and my hundredth [indiscernible] polynomial is really flexible. It can model all sorts of different things.
What happens is you just tune up to the noise on the data because in the traditional methods you
typically optimizing parameters and making point estimates of parameters. You can sort of over fit to
the data. That’s one of the challenges that you face. There are ways to deal with this of course. I’m
sure you’ve heard about them this week to deal with overfitting in that traditional framework.
In this framework there isn’t really overfitting. Overfitting is a sort of a pathology that arises when you
make point estimates. What happens in this domain is that if I have a very flexible model with very
broad distributions. When I observe data my distributions get narrower and I learn stuff about the
world. But I still have a lot of uncertainty. I’m hopefully still encompassing the truth but not so very
confident about it.
If we get time, I don’t think now is quite the right minute. We probably know a little bit more about this.
But one of the nice things in a probabilistic model is you can get the data to choose between different
models for you. I mentioned how we can do this in infer.net with a little switching variable.
Let me try and give you some intuition behind this with sort of zero math. It’s sort of like this; imagine
I’ve got three models. The first model says, it’s a very rigid model, right. It’s very, imposes a lot of prior
knowledge, the world’s just linear, right, very restrictive model. I’ve got a second model which says well
it could go up and down a little bit. But it’s kind of you know one or two oscillations is fine. The third
model says oh, hugely flexible. It can you go up and down a million times and have step functions. It
can be a [indiscernible] to all sorts of things, right, very flexible model.
If you simply fit those three models to the data by optimizing the parameters then you’ll always favor
the most complex model. Because it will just tune all the data, it will fit the data point exactly. You’ll
over fit and it’ll say it’s the third model. You can’t choose the model by tuning or optimizing the
parameters. All you have to do instead is divide your dataset up into two, train on one half, and then
see how predictive it is on some holdout data. Okay and that’s the standard technique and widely
established.
In this probabilistic setting let’s suppose that the truth is actually the middle model, okay. The data, you
know the real world it sort of goes up and down a little bit but not too much. Let’s look at the posterior
probability of the three models. I’m uncertain about which model so I’m going to keep all three models.
I’m going to have a prior probability of a third, a third, a third.
What’s the posterior probability of those models when I run inference? Well the first model, the rigid
model will have a low probability it basically can’t fit the data. The data’s going up and down. It’s trying
to fit it with a straight line. There’s a really bad mismatch between the data and the model. That model
has a low posterior probability.
The middle model does a lot better because it can fit that data beautifully. Okay, so it’s got a much
better data fit. The third model can fit the data just as well. In fact it fit the data even better because it
can tune to all the little noise. But the problem is that the model has a sort of a unit; remember all the
probabilities add up to a hundred percent. Each of these models has got a unit amount of predictive
probability which it can spread around.
The first model has piled all its predictive probability on just straight lines and none of them fitted the
data, so that was bad. The middle model kind of spread its bets a bit more. It said well it could be a
straight line; it could go up and down a little, or down and up a little bit. The probability which it
actually assigns to the data is PYA, because if it can assign a decent amount of probability the actual data
you observe.
The third model can explain the data you know perfectly. But its unit amount of predictive probability
mass has been spread incredibly thin because it can predict straight lines. It predicts things which go up
and down a little bit, things that go up and down a lot, things which have steps, and so on. The amount
of probability mass that it assigns to the observed dataset is actually lower than for the middle model.
The posterior probability peaks around the middle model.
Okay, that’s the probable model. That’s without having to do any hold out data. That’s on the training
data, okay. Now you’re saying at this point, ah who cares you know I’ve got a big computer. I’ll run,
what’s the problem is running on my training data and compare it on a test set. Well there’s nothing
wrong with that. In fact you should always do it no matter what method you’re doing. Before you ship
something to the customers I strongly recommend you test it on some of your data, right. That’s just
good sound engineering.
The thing is supposing it’s not just one thing you’re trying to tune. If you’re just trying to tune a
regularization parameter sure I could run it ten times with ten different regularization parameters on my
training set. Compare the ten models on the test set and pick the best. What if I’ve got a million data
points and I’ve got one regularization parameter per data point. I’ve got a million parameters to tune. I
can’t run all these different combinations of a million parameters and keep testing them. I have to be
able to learn the parameters and the hyper-parameters from the training data. That’s where some of
these methods can be very powerful.
It was a very long winded answer to a short question but.
>>: [indiscernible] the question he asked [indiscernible]. You had a very similar [indiscernible]. We had
a cold start problem that, an add that’ you’ve never seen. How do you surface it? I’m just trying to
understand what we could have done differently? Because people breaking their heads about that
particular problem.
>> Chris Bishop: Okay, so I think this question is sort about your specific application, a particular issue
that you’ve encountered. I think my experience with these questions what happens is I’m going to have
to ask you to expand on it and then you and I are going to have a half hour discussion at the whiteboard
which I’d love to have but that’s probably not. That’s quite a specific question. Maybe we can talk or
exchange email and discuss that particular question. But it’s obviously not; the question was to do with
cold start problems I think in an advert situation. You’ve already though about it very hard, it’s
obviously a hard problem. It probably isn’t a ten second answer. We should discuss that offline I think.
Maybe last question.
>>: Yeah, actually I have two. [indiscernible] specific [indiscernible] like how many users [indiscernible]
need to have, to give a reasonable good result? The second one is the graph tree could be very large is
that a good idea even if it has certain way to control the size? Like you know each [indiscernible] may
have many connections and those connections also have foundations. Can you, you know reduce the
[indiscernible] and [indiscernible] faster?
>> Chris Bishop: Let’s just deal with that second part first then. The question was if the underlying
graph which we haven’t [indiscernible] yet for this recommended system. If it becomes very large, what
was the question then, what’s the problem?
>>: Oh, so maybe there’s like, let’s say one have many parents and many children but maybe there’s
some connections between the parents and children, I mean individually so maybe you can just reduce
it , the parents and reduce the number of the children somehow?
>> Chris Bishop: Okay, so there are lots of design issues in terms of designing the graph. I mean again
can I give you a general answer rather than a sort of specific answer? I think one of the advantages of
this framework is that I said we have to make assumptions. You cannot do machine learning just from
data alone. There’s data in the context of some assumptions. One of the nice things here is that you
make the assumptions very explicit. In fact you’re forced to make them explicit. Because step one is to
write down your model. You really have to come clean and tell the world what you believe about the
world. You know it can’t be buried away in some implicit sort of in the training algorithm somehow.
Right, it has to be in the model. You have to make your assumptions explicit in the model.
Part of the assumptions has to do with the structure of the graph. You know, again, that comes back to
the points we made earlier where really this is the domain knowledge that you’re encoding. But
sometimes you, you know I think you were talking about skip level connections between parents and
children of other variables. Are they present, are they not present? Again, one thing you could do is
just compare the two models. If you’re genuinely uncertain you could look at both models and see
which one the data prefers.
Whoops, so are you going to go back to that first, what was the first question again?
>>: [indiscernible] wonder like to do [indiscernible] a model like this [indiscernible]. How many training
[indiscernible]?
>> Chris Bishop: Lovely question, how many data points do I need to get good performance on this
application or any other application? You know the answer is forty-two.
[laughter]
The answer is I don’t know.
[laughter]
No, it’s so dependent on the problem, right. I mean that’s not, there’s no magic, alright. The answer is
it will depend upon the, you know the particular application that you have. But the nice thing about this
framework is that if you only have a little bit of data you can still use your full model. You don’t have to
adjust the model to the amount of data that you have. It’s the funny thing to do. I’ve only got a
hundred data points I better only fit three parameters otherwise it will over fit.
Now hang on a bit. If you really think the world has a hundred and twenty-seven parameters surely
that’s how you should model the world. Okay, so you can use your, you know your full blown complex
model that you really think describes the world and we only see a few data points. That’s fine, you
won’t have any overfitting. What you will have of course is still a lot of residual uncertainty. I mean you
only seeing three data points again there’s no magic here. I’ve only made a, you know after a couple of
movie ratings it knows something about me. It doesn’t know everything about me it’s still quite
uncertain.
I was told to allow twenty minutes for break. We’ll take a break now and then we’ll have another, and
we’ll just go a little bit deeper into this topic at, starting at eleven o’clock.
[applause]
Download