Document 17844371

advertisement
>> John Langford: Okay, let's begin. This is the machine-learning class. The plan for the week
is to teach everybody machine learning, and the first question, of course, is what is machine
learning? Mute. Okay, so if you go and you look at what a lot of modern machines look like
when they're actually running, you look at what's in their RAM, you see that there's some portion
that's compiled interpreted code and some portion that is actually just derived from data, and this
is really what machine learning is about. It's about trying to use data to make useful predictions.
The amount of RAM being used for data-derived predictions versus compiled or interpreted
varies greatly from machine to machine, but there are some machines where a large fraction is in
fact data derived. Okay, so now it's useful to think about examples of machine learning, and
there's quite a few examples, many of them at Microsoft. So search engine results are driven by
machine learning to a large extent. Interesting thing is I understand that Google search results
are not quite as driven as Bing is. There's a guy who's pulling a Paul Bunyan, and eventually, I
expect he will break. Kinect uses machine learning to recognize poses of people. It's pretty cool.
Spam prediction is the example of machine learning that you can tell everybody, including your
mother. Automated arbitraging, when you see what's happening on Wall Street, there's a lot of
machine learning going on there. And then I have a friend who's actually working for the
Obama campaign, trying to figure out how to best persuade people to vote for Obama with
machine learning. Okay, so these are just examples of machine learning. It's helpful to
understand what are the characteristics of machine-learning problems, and there are several basic
paradigms of machine learning which exist right now. The one that we're going to focus for
most of the class on is supervised machine learning. So in supervised machine learning, what
happens is you pick a problem, where you can't program the solution very easily, so where you
fail due to case complexity. So an example of this would be spam prediction, so in spam
prediction, you could of course program something which says, if the email has the word Viagra,
then reject it. Turn it into spam. But sometimes emails with the word Viagra actually aren't
spam, and sometimes, there's Cialis instead of Viagra. And if you start thinking about this
carefully, what happens is, you come up with exceptions, and then you come up with exceptions
to the exceptions, and exceptions to the exceptions to the exceptions and you go crazy, trying to
program these things. So machine learning is what can help in these kind of situations. If there
are more straightforward solutions, then of course you should use them, but when you start
getting this incredible case complexity, and no obvious program can actually do what you want it
to do, then it's time to start thinking about a data-driven solution. So once you have this
problem, you need to -- in classical supervised machine learning, you hire people to label input
and output example pairs. So you're going to be trying to learn a function which predicts, given
the email, is this spam or not? And the classical solution would be that you hire people to label
things as spam or not. And then you use your learning algorithm to find this function which
maps the input to the output. And there's lots of different learning algorithms, so we'll be
discussing several of them, but in general, all they're doing is trying to find this mapping from
input to output. And then, of course, you apply your learn function to solve the problem,
wherever it happens to exist. Are there questions? Yeah.
>>: How do you assure the quality of the labeling? Because different people might label things
differently, depending on how much [indiscernible]?
>> John Langford: How can you assure the quality of labeling? That's a tricky question in many
situations. Often, things get labeled multiple times, and you get a sense of the quality of the
labeler. This comes up to a much greater degree the lower the expertise of the labeler. So in
particular, if you're using Mechanical Turk to label things, which is fairly common nowadays,
there's a lot of tricks people do to try to weed out the bad labelers and keep the good labelers, and
so forth. Okay, so this is classical supervised machine learning. This is what existed even 20
years ago. It's been slowly improving over time, and it gets routinely applied in many situations.
There's also interactive machine learning. So interactive machine learning, the learning
algorithm starts controlling which inputs it learns from, and that introduces another level of
complexity. However, it can do some very cool things. So, for example, when you start hiring
labelers, it starts costing money to label things. And then if your learning algorithm can reduce
the number of things that you need to label by a factor of two or a factor of 10, that can greatly
improve your budget. So there are these systems for automated labeling of various tasks, so
UHRS is the internal Microsoft one, Mechanical Turk is from Amazon, both in Seattle.
Interesting. And then the other thing which happened is, there's a large number of recorded
computer-mediated actions, so what I mean by this is trades on an exchange or search sessions,
or here I displayed a news story, and somebody clicked on it and read it. So this is essentially
free data, free labels, as long as you can actually store the data. This is not quite labeling the
correct output given the input, and this is a tricky process here compared to standard supervised
learning, but this is cheaper and potentially more powerful, and this is much cheaper and
potentially much more powerful, because you can express a lot more complex function with the
data that you gather from all of these recorded computer-mediated actions. Another example of
this which comes up and which is very relevant to OSD is trying to find the [ATA] which most
interests people. You find the [ATA] which most interests people, then a lot of money gets
made. Okay, so in interactive machine learning, this is a newer area. It's been developing over
the last I would say decade, maybe decade and a half. There's typically much more data
involved, particularly for a number two, but there's a more tenuous connection between the data
and the solution. It's not like, for this input, I want that output, for this input, I want that output.
It's more like for this input, that output kind of worked. Okay, so that's interactive machine
learning. And there's another branch of machine learning, which is model-based machine
learning. So the goal here is typically to model how the world behaves. So the last day is the
modeling day, and this is a huge subject in general, so this is very broad. The basics will be
covered in this class. This is particularly useful, I think, for hypothesis testing. So if you have a
belief about how the world behaves -- not necessarily a complete belief, but you believe there's
some structure to how the world behaves, this is dependent upon that, and it's independent of
this, given that, then you can create graphical models. You can test these, and if you are correct
in your beliefs about the world, you can get very good predictive ability. This is also the tool
which gets used when you don't know what to do. So if you have some data and you're like,
hmm, what's the input, what's the output? I have no idea. Then, often these modeling-based
approaches are used to try to summarize the data in some manner which is human
understandable, or which can then be fed into a supervised learning algorithm for later
predictions. Okay, so these are sort of three branches of machine learning. Now, I'll discuss the
class itself. So Vijay, right after me, is going to talk about statistics for machine learning. The
first thing that you need to know with machine learning is that you have succeeded. So you need
to be able to say, I have succeeded, right? And that's essentially what Vijay's talk is about. So
an understanding of basic statistics is necessary to know that you have succeeded, because
machine learning is kind of -- with constraint programming, you have succeeded if you have
found the optima, which is you can prove when you're at some corner of some optimization
space. With machine learning, you have succeeded when you have a good test area of
performance, when you can predict things well, and defining well is something which is
inherently noisy, because some problems are just impossible to solve all the time perfectly. Is
this email spam or not? Well, maybe it depends upon the person who's actually receiving it, and
maybe you don't have information about that person which would allow the algorithm to
determine whether or not it's going to be flagged as spam. And that means that there is some
fundamental noise. You can only achieve a small error rate up to some limit, which is going to
be dependent upon the problem. That means it's noisy, and that means that you need to
understand statistics so that you can understand how confident to be in your solution results. The
simplest version of learning -- the simplest representation is linear learning, so Miro is going to
cover this in great detail. And then we're going to finish up with two tools later in the day, so
Matt is going to talk about linear learning and TLC, and I'm going to talk about linear learning
with VW. So at the end of the day, you should understand how we've succeeded -- understand
what success looks like in machine learning, and you should get some familiarity with basic
tools, so you should be able to just apply machine learning, or very simple versions of machine
learning, on your own. Okay, the second day is about a new representation, decision trees.
Decision trees get used in the core switch engine in Microsoft, and in general, Microsoft has a
very highly optimized decision tree package, which makes it an attractive algorithm to actually
use. So Rich is going to talk about decision trees, and then Vijay and I are going to follow up
with some more details. When you are doing machine learning, it's often very helpful to have a
lot of different metrics in your learning algorithm. What the learning algorithm does is often a
bit difficult to understand, and so there's a lot of different ways to evaluate what it's doing to help
you to debug if something is going wrong. And then I'm going to talk about the different
learning problem types. When you run into machine-learning problems, almost always, they're
not exactly the simplest machine -- they don't fit the exact setting that is the simplest, and so it's
very helpful to understand the known variations on the basic setting and how to address them.
And then Misha and Matt are going to do a full tutorial on TLC. So at this point, you should
know how to do machine learning fairly well. So then the next three days are on advanced
topics. The third day, Wednesday, is about making things go fast. In general, a lot of machine
learning is about optimizing stuff, and there's a lot of tricks to optimization which can make an
enormous difference in your life. It can be like I wait a day for this algorithm to finish, or I wait
10 seconds. So understanding various tricks to make things go really fast is very helpful for you.
So Alekh will talk about online learning. I will talk about feature hashing. So this is a technique
to improve the speed of optimization. This is a technique to compress the size of your learned
predictor. Alekh will then talk about parallel learning. This is again to speed things up. Then
we'll have Jianfeng and Misha and Matt talk about learning algorithms that can run directly on
Cosmos. Cosmos is OSD's internal combined compute and data cluster. And so it's very
desirable to have learning algorithms which run directly on Cosmos. And then the last talk will
be about clustering. This doesn't quite fit the rest of the day, but it was what we could fit in. So
clustering is a handy technique. It actually fits into the modeling category. Typically, you find
clusters in the data, and then that helps you understand how the data is useful, and it also helps
you extract features, which is essentially the cluster belonged to individual things that you can
then feed into other learning algorithms. Okay, so the fourth day is about interactive machine
learning. Miro is going to talk about learning and evaluation in this setting. So this is settings
where you're interacting with the user and you're just getting feedback from the data traces that
you actually observe from these interactions. Then, I'll talk about gathering exploration data, and
then Daniel will talk about active learning. Daniel's thesis was on active learning, and it's quite a
good one. Okay, the last day is about modeling. Chris Bishop and John Bronskill will introduce
the basic modeling setting and approach. John is going to talk about Infer.Net, which is one of
the core machine-learning tools at Microsoft, and then Li Deng is going to talk about deep
learning, which is a very fun topic. It lets you do things which we don't know how to do with
other approaches, like some of the really advanced feature recognition applications. Okay, so
this is a typical class schedule. There's a few variations from day to day. The thing which is -on our break, we have some refreshments, assuming we didn't run out of them this morning. For
lunch, the Commons is recommended. It's nearby. There's also a cafeteria right over there, so
the Commons is that way, about two buildings over, maybe three buildings over, and the
cafeteria is right inside of this building, and then we finish up at 4:30. So this is a picture. We're
here. I'm pointing that way. So I want to give you a sense of other machine-learning resources
at Microsoft. In general, Microsoft Research is one of the great centers of machine learning, and
there's a large number of experts on machine learning at Microsoft Research. So if you have
questions about specific things, you should post them on the machine learning mailing list, and
that can help you connect with people at Microsoft Research. There's many applications of
machine learning. OSD has many people who use machine learning all the time. STB and IEB
are both using machine learning and increasing the amount that they're using it. In general, I
think it's happening in the rest of the company, as well, but these are the areas that I'm most
aware of. Okay, there's also previous course material, which is on the SharePoint site, so there's
pointers to it on the SharePoint site, which you can use to fill in additional things. Everybody -this is an academic field, so everybody has their own view on what is the most important and so
forth. For different views, you can go here. And each of these are good talks, good courses,
where you will learn some more. Okay. Last thing is tools. So TLC is a tool which is
developed here in Redmond by Matt and Misha. VW is a tool which I worked on at Yahoo!
Research. It's an open-source process, and so it is quite possible to use it inside of Microsoft.
OWL-QN is a linear learner in Cosmos, and Infer.Net is a compiler for model-based learning.
So each of these -- these are sort of what I think of as the core tools at Microsoft. There may be
more, which I will learn about, but these are the ones that are going to be covered in this class.
Okay, so this is a class. Not all of us are teaching all the time, so I think the most important
thing is that you should ask questions. If you can, try out the tools in real time. These lectures
are being recorded on Resnet, so it can be hard to do things in real time, but you can easily pause
Resnet and continue and check things out more thoroughly yourself later. Distractions are
definitely interesting. If you have questions and it's more convenient to ask offline, then use the
ML October mailing list. I'll be watching it, and I'm sure many other people, as well, so your
questions can be answered. Are there questions right now?
Okay. Let's begin, then.
>> Vijay Narayanan: All right. Thanks, John. So in this session, we'll primarily focus on
understanding some basic concepts in probability and statistics, especially as they relate to
machine learning. The goal here is not to introduce anything very deep or mathematical off the
bat, but hopefully to gain an intuitive understanding of some of the basic concepts. There are
lots of rigorous definitions that will follow in subsequent classes, but in this session, what I really
want us to go away with is some appreciation of some deep concepts, some very basic concepts
which are taken for granted in many of the other sessions. So some of these are really, really
cool results to know, irrespective of whether you are learning machine learning or not, and I'm
sure you must have seen them in a lot of your earlier classes, basic classes in statistics or
probability, earlier, as well. In which case, think of this as a refresher, and more importantly,
there are two aspects in machine learning. One, as you know, is a lot of theory behind machine
learning. How do humans learn? What is the basic structure of the learning patterns? How does
learning happen in real life? All those nice, cool things. But also, very often, when you start
applying these techniques to real problems in the industry, you are dealing with practical data.
You are often dealing with noisy data, you are dealing with incomplete data. You are dealing
with partially labeled data and a lot of host of other problems. In these cases, it's very, very
important to understand what are some of the statistics of the data, how does the data look like?
Before you start applying the techniques, it's very important to analyze and explore and gain an
intuitive understanding of the data, some of the basic properties of the data before you start
throwing in algorithms. So this class will primarily -- hopefully give you an appreciation of
what are some of the basic tools and techniques that you can look at for gaining an intuitive
understanding of the data. So the basic outline is here. We'll cover some basic probability
concepts, go onto some statistics, and then I will take you through a gentle introduction to
machine learning, how does the typical machine-learning process look like, what to look out for,
etc. This is not about teaching you to drive, but this is hopefully to teach you not to create an
accident. So it's less about the machine-learning techniques, per se, but more about the
guardrails and some of the places where you need to be aware of when you apply machine
learning to real-world problems. So let's start with reviewing some basic concepts in probability.
We'll keep hearing the mention of the term random variables. So we'll quickly go over what
random variables are, how they are different from other deterministic variables, cover some basic
concepts of probability and probability distributions, joint probability, conditional probabilitybased theorem, etc. All right, so what is the random variable? So it's a very simple definition.
Basically, it's a result of any experiment that can yield one out of many possible outcomes. If
you have a variable that says X equals five, then you know exactly what is the value of X. But if
I tell you, hey, I'm going to flip a coin and the outcome of that is my variable, guess what, now
we can take two values. It could be a heads, it could be a tail. All right. So now, if I want to
quantify what is the outcome of a coin toss, I call that a random variable, because now it can take
one out of many possible values. And when you do this experiment repeatedly, you're going to
get some or all of these values. So it's any variable whose value varies due to chance. Another
very simple example of a random variable is the outcome of rolling a die. So if you just roll a
very fair die that has six faces in it, you're going to get one of the following outcomes, right? It
could be any number from one through six, the face of the die that is showing up could be any
one of these six numbers. Now, if it is a fair die, you would expect each of these outcomes to
occur equally likely, so this is a very simple random variable. Now, remember, suppose if we
ask another question, suppose if I roll a pair of die, then what is the value of the total? Now,
remember here again, each die by itself is a very simple random variable, and your sum is the
combination of these two random variables, which is, again, another random variable. But here,
the sum can only take on 11 possible values, so it can take on values from two through 12.
However, notice here that not all the values are going to be equally likely. Some of the values -for instance, six is much more likely to occur. The extreme values like two or 12 occur much
less frequently. So unlike in the first example, where you could just roll a single die, where all
the six possible outcomes are equally likely, when you look at the total that you get from rolling
a pair of die, the outcomes are not all equally likely. There is a certain variation in how often
they show up. Does that give a reasonable idea of what a random variable is? Okay. There are
basically two different types of random variables. One is a very discrete random variable, where
X just takes values from a very countable set, so the possible values are all countable. You can
start rolling up your fingers and start counting them. For example, in the case of a coin toss or
when you roll a die, you can actually enumerate all the different possible outcomes. Note,
however, here that just because it is countable doesn't mean that you can finish counting. So the
size of this countable set can be finite, which in the case of the coin toss or the die roll are all
things that you can finish within a few seconds. You can keep enumerating all the possible
outcomes in a few seconds. They could also be countably infinite. For example, if I turn -- if I
ask you a question, how many tosses, how many coin tosses do you need until the first tail
occurs, guess what? You could flip a coin, maybe the first time it occurs. Good for you. You
flip it two times, and it occurs the second time. Maybe the first time, it fell heads, on the second
time, it fell tails. But it's equally likely that both of the coin tosses came at first heads. So you
can keep count, and as you can see here, you can keep potentially tossing the coin infinite
number of times, and you might never get a tail. Now, the probability of such an event is very,
very small, but potentially, you can keep tossing the coin infinite number of times and you will
never end up, and you will never see a tail. So this is an example of a set that is countably
infinite. You can keep counting how many times it falls, but the probability that it's going to end
is very small. Go ahead.
>>: You said infinite number of times. If the probability is zero [indiscernible].
>> Vijay Narayanan: It just asymptotically goes to zero.
>> John Langford: Vijay, repeat the questions or the comments.
>> Vijay Narayanan: Yes. So the question is, if you toss it infinite number of times, the
probability goes to zero. Yes, asymptotically yes, but there is still a very, very, very small, finite
nonvanishing probability when it is definitely decreasing but it only hits zero, theoretically, at
infinite. So this is a discrete random variable, and as you can see here, it can be either finite or
countably infinite. Now, every finite random variable is almost by definition discrete, because
you can just enumerate them -- you can just enumerate all the possible outcomes. You can just
count them up, and it is finite. So all the possible outcomes can be discretized. So this is a
discrete random variable. The other type of random variable, it's not that it is indiscrete, but it's
just that it is discontinuous. So here, the random variable takes values from an uncountable set.
So basically, for example, if I go and ask, okay, what is the distribution -- what are all the heights
of some animals in a given region? So what are the lengths of all the fish in Lake Washington,
for example? Or if I ask a question like what is the lifetime of a light bulb, it's going to be a real
number. It's going to be some number between zero and potentially infinite, but it can take any
possible real value between these two extremes. Things like time to failure of a hard disk, for
example. All these sorts of variables are continuous random variables, and these continuous
random variables take on an infinite number of values. Now, these values can all be bounded
within a given range. So, for example, all possible various between the numbers zero and one.
There are going to be finite such numbers between zero and one, but the extremes are still
bounded. Or it could be if you take values over an unbounded range -- for example, all the
numbers greater than one, so it could just be all the real numbers between -- well, there is a typo
there. It should be greater than zero, so between zero and infinity. For example, something like
the time to failure of a hard disk is an example of unbounded -- is an example of a continuous
random variable that has an unbounded range. So now, with the definitions -- having some
rough idea about what these variables are, discrete, continuous, finite, countably infinite, etc.,
let's move on to the next possible question you might ask out of these random variables, which is
how likely is it to see? Now that you have said that a random variable takes values from a given
set of possible values, you can then go back and ask, okay, how likely is it that this random
variable takes one value as opposed to another value? How likely is it that this random variable
takes the value -- for example, if I toss a coin, how likely is it that it takes the value heads? Go
ahead. You had a question.
>>: So the random [indiscernible]. It doesn't have [percent] value, so [indiscernible].
>> Vijay Narayanan: Theoretically, it could be multidimensional reliability.
>> John Langford: You should echo the ->> Vijay Narayanan: So the question is, can random variables be composite? But composite is - if it's a sum of two random variables or if it's a function of many random variables, then that by
itself is under the random variable.
>>: What I mean is [indiscernible].
>> Vijay Narayanan: Yes, it could be a vector. Sure. Higher-order variations are possible, and
I'll actually touch a little bit about multivariate distributions later on. Yes, it's certainly possible.
Any other questions? Okay. So probably is nothing but a quantitative definition of this question,
a quantitative answer to this question -- how likely is to it to see one value from among all the
possible values of a given random variable? So you can also turn it and ask a related question, so
you can also frame it in a different manner. Now, you mentioned that the random variable is an
outcome of an experiment, so the question is, if I keep completing the experiment, then what
fraction of all the experiments will have a given outcome? Now, this is what is addressed by
probability. So what a probability says is, every outcome, every possible outcome in the set of
possible outcomes is associated with a given number, a number between zero and one, and for
example, when you roll a die, each of the possible outcomes from one through six is associated
with a number between zero and one. In a fair die, each of the possible outcomes can occur onesixth of your time, so if you do the experiment, say, 6 million times, if you keep flipping a coin 6
million times, then 1 million times, you will expect to see heads -- one. Sorry, it's not a coin. It's
a die. If you roll a die 6 million times, then 1 million times you will expect to see one, 1 million
times you'll expect to see two, approximately, right? So probability is just the quantitative
definition -- is just the quantitative answer to this question, how likely is a given value going to
occur for a random variable? Now, there are a couple of very basic axioms. I'm sure you must
have all seen this multiple times over. The probability, it's a bounded number between zero and
one. And the sum of all the probabilities of all possible outcomes within the set is one. This is
just a way of saying that if you do an experiment, then one of the possible outcomes has to occur.
And the set of all possible outcomes is what you have enumerated as the possible range of values
of this random variable. Now, for discrete random variables, each of the possible outcomes has
an associated probability. Because it is a countable set, so if you go through, say, dice roll, then
each of the possible six outcomes has one real number associated with it. In a fair die, it's about
one-sixth. However, what does it mean to define a probability for a random variable that takes
continuous values. Suppose if a random variable can take any variable between zero and one,
what does it mean by saying that, hey, this random variable has a probability of 0.1 of taking the
value of 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, to X? It doesn't really make much sense. So in this case,
for continuous random variables, you define probability over a range of the possible outcomes,
over a range of the continuous values that this random variable can take. So you define, hey,
now, I can turn the same problem around and say, between -- what is the probability that this
random variable takes the value between 1.1 and 1.15? Then you can start defining the notion of
how likely that if I do an experiment, my outcome is going to lie within these two ranges, 1.1 and
1.15? So this is the slight difference between defining probability for discrete random variables
and continuous random variables. But as for discrete random variables, every possible outcome
has an associated probability value. For continuous random variables, you only define
probability over a range of values of the continuous variable. Okay. Let's go onto probability
distributions. The thing to just keep in mind here is there are different types of distributions that,
at a very high level, will just be called out as probability distributions. And exactly which of the
probability distributions that I will mention here -- I mean, that are illustrated here is referred to
depends on the context. So we will loosely talk about the term probability distributions, but
there are actually different types of distributions that you need to be aware of, and it's important
to know that, depending on the context, different types are referred to. A probability distribution
is just a function that describes how the probability of a random variable takes a given value.
For discrete variables, as we saw earlier, you can count all the possible occurrences of the
random variable, so what is referred to as a distribution is really what is called as a probability
mass function. Here, it's just the probability that this random variable takes one of the possible
discrete values that it is allowed to take. So let's just take a simple function like this, right? So
here, the X is a discrete random variable. In this instance, it takes five possible values, one,
three, four, five and seven, and in each of these points, in each of these possible discrete
outcomes, there is a probability associated that tells you if you do an experiment, then maybe
40% of the time, you're likely to get the value one, 20% of the time, you are likely to get the
value three, 10% of the time, you can get the value four, and 10% of the time, you get the value
five. The value six can never occur for this random variable, so that's why it's a zero, and 20% of
the time, it's likely to be a seven. So here, you've just defined your probability mass for each
possible outcome of this discrete random variable. Sorry, did you have a question there? No?
Okay. So another type of distribution is this is something you keep hearing often, is this notion
called probability density function. People also call this as the probability distribution functions
very often, but it's the probability density function. This is what is referred when people
typically talk about continuous random variables. So here, if you want to know what the
probability is, then you only can define it over a range of possible values of the continuous
random variable. But for a given point, you can still associate a function, which when integrated
over this range or when summed over all possibilities within this range gives you the probability
within that range of this random variable falling within that range. And unlike probability,
which can only take on values between zero and one, the probability density function can take on
any nonnegative value. So if you want to say here is the probability that it takes -- what is the
probability that this continuous random variable takes values between X1 and X2, then you
integrate the PDF. It's also called -- it's common called as the PDF, the probability density
function, over the range X1 through X2. Now, it's important to note here that -- this is an
example of a PDF, of a very commonly occurring PDF, where the X here is your continuous
random variable that takes on all possible values. Here, in this case, it's all nonnegative values,
and this function, which has a P, can dice down very smoothly. It's an example of a probability
density function. Now, to know what is a probability that this random variable takes on values
between, say, five and 10, you integrate this function between the values five through 10, and
that tells you what is the probability that this continuous random variable takes a value between
five and 10, between this range of values five and 10? Yes.
>>: [Indiscernible] F of X GX? Shouldn't this be E of X, F of X, GX?
>> Vijay Narayanan: That is if you want to do the expectation of F of X over this range, so here,
you're just looking at the probability, not the expected value of the function. I have not even
come to expectations yet.
>>: If the PDF can go to infinity, then the integral of any range of the PDF can be greater than
one and can be fairly large. Then it's not a probability anymore.
>> Vijay Narayanan: So you can only define this -- strictly, you can only define this if the
integral is still finite. You can still go to infinite, but the integrality of X, DX can still be finite.
>>: All right, so finite, but still greater than one, which would not be a probability.
>> Vijay Narayanan: No, not really. So you typically normalize the PDF, and you'll also
normalize this by the integral from zero to infinity, to keep that within.
>>: What if you normalize the PDFs and the PDF values aren't between zero and one?
>> Vijay Narayanan: No, no, no. PDF values can be between zero to infinity, but normalize this
from zero to infinity, you divide it by zero to infinity to bring it to zero. The total value of
probability contained under this curve is the value that you normalize this by. Okay. And
typically, that gets absorbed within this P of X in most common distributions, so that's why I
didn't really call this. Those are [twiddles] that I really didn't want to get into. So another notion
of a distribution that frequently comes about when you talk about probability distributions is this
cumulative distribution. So remember, until now, we talked about what is the probability that
something takes the value -- takes the given, specific value in the case of a discrete random
variable or a value within a given range in the case of a continuous random variable. The
cumulative distribution is the probability that the random variable takes some value that is either
less than or equal to a given value. So if I go back to this previous PDF here, and I ask you, what
is the cumulative distribution at five, then it is the probability of this entire curve between minus
infinity to five. Now, in this example, there is no extension of this curve to nonnegative values,
but technically, there's nothing that prevents you from having a probability distribution where the
random variable takes on negative values. So for the same distribution here, the red color shows
the cumulative distribution function. In the case of a normalized PDF, the red curve, as you can
see here, starts off at zero, at very low levels of the random variable, and nicely settles down to
around one as you go to higher and higher values of the random variable. So all that this PDF
tells you is that what is the probability contained when the value of the random variable is either
at a given value or anything lower than that value. So it is just the integral of the probability
function until -- from minus infinity to X. Note here that the CDF itself is a function of the X.
So the CDF is actually a function, and sometimes, you can even -- if the CDF is a closed
analytical form, it may be easier to work with the CDF instead of the PDF, because especially if
the PDF has some funny behavior, like going to infinity at some point, etc., then the CDF might
be a more analytically tractable function. Okay. So before I go to multivariate probability, just a
quick point. So as John mentioned, we typically take a break at 10:30, but today I believe we
will go to 10:45, right? So this session will go until about 10:45. We'll take a half an hour
break, come back at 11:15 and go until 12:30. Now, so far, we have all talked about just the
distributions, the probability distributions, random variables of just one dimension, but
technically, you can extend it to all -- all these concepts can be easily extended to just multiple
random variables. So, for example, you can ask the question, what is the probability that the
random variable X takes a value between X1 and X2 and another random variable, Y, takes the
value between Y1 and Y2. For example, in the case of a coin toss, we can ask, what is the
probability that the first die takes a value between two and four and the second value -- and the
second die takes the value between, say, four and six? So you can extend this concept of random
variables, probability, all the distributions, to multivariate cases, as well. So in that case, the
integrals basically become higher-dimensional integrals, so in this bivariate distribution where
you have two random variables, X and Y, the probability that the variable X is between X1 and
X2 and the probability that the variable Y is between Y1 and Y2 is just this integral of the joint
probability density. Here, when you go to higher-dimensional distributions, you typically refer
to them as joint distributions, so P of X comma Y, dxdy, where X now goes over X1 and X2 and
Y ranges from Y1 to Y2. All right. So the first few slides are just to get some of the concepts
out of the way, and hopefully, we'll do a lot more examples as we go forward. Now, when you're
given a higher-order distribution, when you have a multivariate distribution, you can still ask,
okay, but you've given me this joined distribution of two variables, X and Y, how can I compute
just univariate distribution, just the distribution of P of X? You can easily do that by basically
integrating out all the possible values of your other dimensions, of your other random variables.
For example, if you want to look at P of X, if you want to compute P of X from the joined
distribution P of X comma Y, you integrate over all possible values of Y the joint distribution.
And when you do this, typically, P of X, you keep hearing the term marginal distribution. So
here, the marginal is the joint integrated role with all the other dimensions, all possible values of
your other dimensions. Now, for just a point to note here, I have shown some integrals that are
relevant for continuous distributions, but for discrete variables, basically, we place all these
integrals by sums, by sum over all the possible values of the other random variables. So you can
even have a distribution where X is the continuous random variable and Y is the discrete random
variable, in which case, if you want to do this marginal probability density of X, you will replace
this integral over Y by a sum over all possible values, all possible discrete values of Y. Okay.
Now, it starts to get a little bit more interesting. So far, we've just looked at individual
distributions, probabilities, etc., but when you bring in more than one random variable into the
picture, we can start asking questions like what is the probability that one random variable takes
a given value or within a range of values in the case of a continuous variable when the other
random variable has taken a specific value? The types of questions that you can -- so this is
referred to as your conditional problem. So in this example, you're asking, what is the
probability distribution? What is the probability of X given that another random variable Y has
taken a specific value? This is called the conditional probability of X given Y, and why this gets
very interesting is a lot of the machine-learning problems boil down to asking the questions,
suppose here is the state of the universe, here is the state of the system I observe, tell me what
will be the outcome. Many of the ML types -- especially the applications of ML, boil down to
asking the questions, here is my state of the system. For example, in the case of, say, if you want
to predict if a given transaction is fraudulent or not -- say you have done a credit card transaction
and if somebody wants to predict if it is fraudulent or not, you're going to ask the question, look,
here is the behavior of this credit card. Now, tell me if it is fraudulent or not. So the questions
become very conditional. We're asking, given that this has happened, now tell me, what is the
probability that this is my outcome. Can you learn these relationships? So this conditional
probability is typically used to represent these given that type of questions. For example, one
simple question it can ask is, given that the lawn is wet, what is the probability that it has rained
last night. Well, the lawn could be wet because maybe the sprinkler was on last night, or maybe
it really rained. But you can still ask the question, right? What is the probability that it had
rained last night, given that the lawn is wet. So the fact that your lawn is wet is an observed
variable. You have fixed the value of that random variable. There is no more uncertainty there.
The lawn could have been dry. It could have been wet. But you have gone out and seen that the
lawn is wet. Now, you're turning around and asking the question, given that I have seen that the
lawn is wet, what is the probability that it had rained last night? Another simple question is,
given that it had rained earlier, what is the probability that there will be a rainbow today? So all
these sorts of questions, like the given that questions, where you're saying that one random
variable has taken a value, has a specific value -- I have done an experiment and I have seen that
this random variable has taken a specific value or a specific outcome. Now, tell me, what is the
distribution of another random variable. So this is called conditional probability. And the reason
-- so just quickly cover over this concept cased the Bayes Theorem. I won't go too deeply into it,
but I guess this will be covered in a lot more detail on the last day, when Chris Bishop and John
Bronskill talk about modeling and inference. So if you take two random variables, D and H, and
I'll later come down to why I chose this notation, then suppose, say, this red circle tells you all
the possible values that the variable D can take. And the green circle tells you all the possible
values that H can take. And brown here, the intersection, is basically where it's a place where D
takes on a specific value and H also takes on a value within that brown circle. So red is basically
all possible values of D, H is all possible values of -- sorry, green is all possible values of H, and
brown is the intersection region. So now you can ask, what is the probability that D will occur,
given that H has already occurred. So as you can see here, it's just the ratio of the brown area
over the green area. So probability of D and H divided by the probability of H happening by
itself. You can also ask the other question, what is the probability of H, given that D has
occurred? So here, it's again the overlap area, the brown area, but now, you are asking what is
the probability of H will occur given that D has already occurred, so it is your total possible
values of D is in the red area, P of D.
>>: Are D and H in this case variables or outcomes?
>> Vijay Narayanan: They are possible outcomes. That red area is the set of all possible
outcomes of the variable D. So you can just combine these two and say the probability of H
given D is the probability of D given H, times the probability of H given the probability of D.
Now, this is a little bit abstract right now, but really, this equation is Bayes Theorem, and I'll just
tell you why it's very important. H here, think about them as your hypothesis. Suppose
somebody has given you some data, and you want to see, is my hypothesis -- so, my hypothesis
is that this credit card transaction is fraudulent. So what you have is the data which is all the
transaction streams, the history of all the transactions that has occurred on this credit card, for
example. Now, you're asking, given this data, what is the probability that my hypothesis, which
in this case is that this card has gone fraudulent, is correct? So probability of hypothesis is given
data, you're just turning it around and saying probability of data given the hypothesis times the
probability of the hypothesis itself, divided by the probability of data occurring by itself. Okay, I
know this seems to be a little bit confusing. I see a lot of people just staring. Yes.
>>: This implies you knowing the probability of the occurrence of data by itself, which in many
cases is hard to come by?
>> Vijay Narayanan: Yes, so you can replace that with the sum over all possible hypotheses, of
this data being generated by all possible hypotheses. So this is also frequently put as suppose -so what you have is the data, and you try to ask, what is the probability of this hypothesis being
correct? So what you have is the data, and now you are asking, what is the probability of this
hypothesis being correct? That's typically referred to as the posterior probability. It's posterior,
because you are asking these questions after you have had the data, after you have collected the
data. And now you are asking, before I collect the data, I don't have any intuitions about which
hypothesis could be right or wrong. I might have some expectation saying, hey, look fraud is
inherently a lot less likely to occur, so maybe my probability of this card being fraudulent is not
very high. But then, you actually collect the data, and then you can ask the same question again.
Now, tell me, how has my hypothesis changed? How has the probability of my hypothesis
changed, given that I have observed this data? So that's called the posterior probability, and the
probability of data, given the hypothesis, is frequently referred to as likelihood. So here, how
likely are you to have generated this data, given that your underlying hypothesis was correct?
The problem of hypothesis is also called the prior probability, which, for example, in this credit
card fraudulent transaction is what is my expected rate of fraud before I had any data? All right.
So at some level, that's been a very broad overview of just some basic concepts and probability.
Any questions?
>>: What if I go through all the [indiscernible] all the ways in which the general data will be
delivered?
>> Vijay Narayanan: Yes. What if my hypothesis case is not easily ->>: [Indiscernible].
>> Vijay Narayanan: So if you have absolutely no clue about the hypothesis space, then I
suggest you get some domain understanding to be able to ->>: What if I do not have a complete understanding? I can have a partial understanding.
>> Vijay Narayanan: That's fine. So in that case, your modeling is going to be as good as to
how well your hypothesis space is complete. If you are operating in a region, and we'll actually
come down to this concept a little bit later. This shows up again as to how rich is your
hypothesis space, what if you don't have a rich enough hypothesis space, or what if we have too
rich of a hypothesis space, right? Both of them are bad. We want to have some reasonable
assumptions about what are possible hypotheses to test. So there are some ways to constrain -there are some ways to measure if your hypothesis is reasonably good or not, but you need to
have some understanding of even what is possible, what hypotheses are even possible. At the
same time, we don't want to make it overly complicated, either. Go ahead.
>>: Can you go back to the previous slide for a second? I really like the concrete examples you
gave for what prior probability are and likelihood and posterior probability are. For prior
probability, you said expected rate of fraud before you add any data. What was the likelihood
and the posterior probability again?
>> Vijay Narayanan: The likelihood is, suppose if your hypothesis is correct, then how likely is
this hypothesis led to the data that you actually observed? So you are basically turning this
around and say probability of hypothesis given data -- look, what I have done here is, initially, I
know nothing about which hypothesis is correct. I have some vague notion saying that, look,
frauds are inherently less likely to occur, so it's probably not a 50% probability that this card is
fraudulent. The probability may be about 1%. But you can then ask the question, look, here is a
series of transactions I have seen. Now, what is the probability that this card is fraudulent, given
that I see this level of activity, this pattern of activity on this card? So that is your posterior
probability. The likelihood is, given that the card is actually fraudulent, what is the probability
that you see this type of transaction stream, your observed transaction stream? So that's your
likelihood. Okay, any questions? So I think we will stop at probability at this level and take a
look at some statistics. Any questions so far? All right. So let me start with statistics, or should
we just pass here, take a break and then come back at 11:00, John?
>> John Langford: We can take a break.
>> Vijay Narayanan: Take a break. Okay, so we'll cover statistics and machine learning after
we come back, maybe at 11:00. All right. Thanks.
Download