>> John Langford: Okay, let's begin. This is the machine-learning class. The plan for the week is to teach everybody machine learning, and the first question, of course, is what is machine learning? Mute. Okay, so if you go and you look at what a lot of modern machines look like when they're actually running, you look at what's in their RAM, you see that there's some portion that's compiled interpreted code and some portion that is actually just derived from data, and this is really what machine learning is about. It's about trying to use data to make useful predictions. The amount of RAM being used for data-derived predictions versus compiled or interpreted varies greatly from machine to machine, but there are some machines where a large fraction is in fact data derived. Okay, so now it's useful to think about examples of machine learning, and there's quite a few examples, many of them at Microsoft. So search engine results are driven by machine learning to a large extent. Interesting thing is I understand that Google search results are not quite as driven as Bing is. There's a guy who's pulling a Paul Bunyan, and eventually, I expect he will break. Kinect uses machine learning to recognize poses of people. It's pretty cool. Spam prediction is the example of machine learning that you can tell everybody, including your mother. Automated arbitraging, when you see what's happening on Wall Street, there's a lot of machine learning going on there. And then I have a friend who's actually working for the Obama campaign, trying to figure out how to best persuade people to vote for Obama with machine learning. Okay, so these are just examples of machine learning. It's helpful to understand what are the characteristics of machine-learning problems, and there are several basic paradigms of machine learning which exist right now. The one that we're going to focus for most of the class on is supervised machine learning. So in supervised machine learning, what happens is you pick a problem, where you can't program the solution very easily, so where you fail due to case complexity. So an example of this would be spam prediction, so in spam prediction, you could of course program something which says, if the email has the word Viagra, then reject it. Turn it into spam. But sometimes emails with the word Viagra actually aren't spam, and sometimes, there's Cialis instead of Viagra. And if you start thinking about this carefully, what happens is, you come up with exceptions, and then you come up with exceptions to the exceptions, and exceptions to the exceptions to the exceptions and you go crazy, trying to program these things. So machine learning is what can help in these kind of situations. If there are more straightforward solutions, then of course you should use them, but when you start getting this incredible case complexity, and no obvious program can actually do what you want it to do, then it's time to start thinking about a data-driven solution. So once you have this problem, you need to -- in classical supervised machine learning, you hire people to label input and output example pairs. So you're going to be trying to learn a function which predicts, given the email, is this spam or not? And the classical solution would be that you hire people to label things as spam or not. And then you use your learning algorithm to find this function which maps the input to the output. And there's lots of different learning algorithms, so we'll be discussing several of them, but in general, all they're doing is trying to find this mapping from input to output. And then, of course, you apply your learn function to solve the problem, wherever it happens to exist. Are there questions? Yeah. >>: How do you assure the quality of the labeling? Because different people might label things differently, depending on how much [indiscernible]? >> John Langford: How can you assure the quality of labeling? That's a tricky question in many situations. Often, things get labeled multiple times, and you get a sense of the quality of the labeler. This comes up to a much greater degree the lower the expertise of the labeler. So in particular, if you're using Mechanical Turk to label things, which is fairly common nowadays, there's a lot of tricks people do to try to weed out the bad labelers and keep the good labelers, and so forth. Okay, so this is classical supervised machine learning. This is what existed even 20 years ago. It's been slowly improving over time, and it gets routinely applied in many situations. There's also interactive machine learning. So interactive machine learning, the learning algorithm starts controlling which inputs it learns from, and that introduces another level of complexity. However, it can do some very cool things. So, for example, when you start hiring labelers, it starts costing money to label things. And then if your learning algorithm can reduce the number of things that you need to label by a factor of two or a factor of 10, that can greatly improve your budget. So there are these systems for automated labeling of various tasks, so UHRS is the internal Microsoft one, Mechanical Turk is from Amazon, both in Seattle. Interesting. And then the other thing which happened is, there's a large number of recorded computer-mediated actions, so what I mean by this is trades on an exchange or search sessions, or here I displayed a news story, and somebody clicked on it and read it. So this is essentially free data, free labels, as long as you can actually store the data. This is not quite labeling the correct output given the input, and this is a tricky process here compared to standard supervised learning, but this is cheaper and potentially more powerful, and this is much cheaper and potentially much more powerful, because you can express a lot more complex function with the data that you gather from all of these recorded computer-mediated actions. Another example of this which comes up and which is very relevant to OSD is trying to find the [ATA] which most interests people. You find the [ATA] which most interests people, then a lot of money gets made. Okay, so in interactive machine learning, this is a newer area. It's been developing over the last I would say decade, maybe decade and a half. There's typically much more data involved, particularly for a number two, but there's a more tenuous connection between the data and the solution. It's not like, for this input, I want that output, for this input, I want that output. It's more like for this input, that output kind of worked. Okay, so that's interactive machine learning. And there's another branch of machine learning, which is model-based machine learning. So the goal here is typically to model how the world behaves. So the last day is the modeling day, and this is a huge subject in general, so this is very broad. The basics will be covered in this class. This is particularly useful, I think, for hypothesis testing. So if you have a belief about how the world behaves -- not necessarily a complete belief, but you believe there's some structure to how the world behaves, this is dependent upon that, and it's independent of this, given that, then you can create graphical models. You can test these, and if you are correct in your beliefs about the world, you can get very good predictive ability. This is also the tool which gets used when you don't know what to do. So if you have some data and you're like, hmm, what's the input, what's the output? I have no idea. Then, often these modeling-based approaches are used to try to summarize the data in some manner which is human understandable, or which can then be fed into a supervised learning algorithm for later predictions. Okay, so these are sort of three branches of machine learning. Now, I'll discuss the class itself. So Vijay, right after me, is going to talk about statistics for machine learning. The first thing that you need to know with machine learning is that you have succeeded. So you need to be able to say, I have succeeded, right? And that's essentially what Vijay's talk is about. So an understanding of basic statistics is necessary to know that you have succeeded, because machine learning is kind of -- with constraint programming, you have succeeded if you have found the optima, which is you can prove when you're at some corner of some optimization space. With machine learning, you have succeeded when you have a good test area of performance, when you can predict things well, and defining well is something which is inherently noisy, because some problems are just impossible to solve all the time perfectly. Is this email spam or not? Well, maybe it depends upon the person who's actually receiving it, and maybe you don't have information about that person which would allow the algorithm to determine whether or not it's going to be flagged as spam. And that means that there is some fundamental noise. You can only achieve a small error rate up to some limit, which is going to be dependent upon the problem. That means it's noisy, and that means that you need to understand statistics so that you can understand how confident to be in your solution results. The simplest version of learning -- the simplest representation is linear learning, so Miro is going to cover this in great detail. And then we're going to finish up with two tools later in the day, so Matt is going to talk about linear learning and TLC, and I'm going to talk about linear learning with VW. So at the end of the day, you should understand how we've succeeded -- understand what success looks like in machine learning, and you should get some familiarity with basic tools, so you should be able to just apply machine learning, or very simple versions of machine learning, on your own. Okay, the second day is about a new representation, decision trees. Decision trees get used in the core switch engine in Microsoft, and in general, Microsoft has a very highly optimized decision tree package, which makes it an attractive algorithm to actually use. So Rich is going to talk about decision trees, and then Vijay and I are going to follow up with some more details. When you are doing machine learning, it's often very helpful to have a lot of different metrics in your learning algorithm. What the learning algorithm does is often a bit difficult to understand, and so there's a lot of different ways to evaluate what it's doing to help you to debug if something is going wrong. And then I'm going to talk about the different learning problem types. When you run into machine-learning problems, almost always, they're not exactly the simplest machine -- they don't fit the exact setting that is the simplest, and so it's very helpful to understand the known variations on the basic setting and how to address them. And then Misha and Matt are going to do a full tutorial on TLC. So at this point, you should know how to do machine learning fairly well. So then the next three days are on advanced topics. The third day, Wednesday, is about making things go fast. In general, a lot of machine learning is about optimizing stuff, and there's a lot of tricks to optimization which can make an enormous difference in your life. It can be like I wait a day for this algorithm to finish, or I wait 10 seconds. So understanding various tricks to make things go really fast is very helpful for you. So Alekh will talk about online learning. I will talk about feature hashing. So this is a technique to improve the speed of optimization. This is a technique to compress the size of your learned predictor. Alekh will then talk about parallel learning. This is again to speed things up. Then we'll have Jianfeng and Misha and Matt talk about learning algorithms that can run directly on Cosmos. Cosmos is OSD's internal combined compute and data cluster. And so it's very desirable to have learning algorithms which run directly on Cosmos. And then the last talk will be about clustering. This doesn't quite fit the rest of the day, but it was what we could fit in. So clustering is a handy technique. It actually fits into the modeling category. Typically, you find clusters in the data, and then that helps you understand how the data is useful, and it also helps you extract features, which is essentially the cluster belonged to individual things that you can then feed into other learning algorithms. Okay, so the fourth day is about interactive machine learning. Miro is going to talk about learning and evaluation in this setting. So this is settings where you're interacting with the user and you're just getting feedback from the data traces that you actually observe from these interactions. Then, I'll talk about gathering exploration data, and then Daniel will talk about active learning. Daniel's thesis was on active learning, and it's quite a good one. Okay, the last day is about modeling. Chris Bishop and John Bronskill will introduce the basic modeling setting and approach. John is going to talk about Infer.Net, which is one of the core machine-learning tools at Microsoft, and then Li Deng is going to talk about deep learning, which is a very fun topic. It lets you do things which we don't know how to do with other approaches, like some of the really advanced feature recognition applications. Okay, so this is a typical class schedule. There's a few variations from day to day. The thing which is -on our break, we have some refreshments, assuming we didn't run out of them this morning. For lunch, the Commons is recommended. It's nearby. There's also a cafeteria right over there, so the Commons is that way, about two buildings over, maybe three buildings over, and the cafeteria is right inside of this building, and then we finish up at 4:30. So this is a picture. We're here. I'm pointing that way. So I want to give you a sense of other machine-learning resources at Microsoft. In general, Microsoft Research is one of the great centers of machine learning, and there's a large number of experts on machine learning at Microsoft Research. So if you have questions about specific things, you should post them on the machine learning mailing list, and that can help you connect with people at Microsoft Research. There's many applications of machine learning. OSD has many people who use machine learning all the time. STB and IEB are both using machine learning and increasing the amount that they're using it. In general, I think it's happening in the rest of the company, as well, but these are the areas that I'm most aware of. Okay, there's also previous course material, which is on the SharePoint site, so there's pointers to it on the SharePoint site, which you can use to fill in additional things. Everybody -this is an academic field, so everybody has their own view on what is the most important and so forth. For different views, you can go here. And each of these are good talks, good courses, where you will learn some more. Okay. Last thing is tools. So TLC is a tool which is developed here in Redmond by Matt and Misha. VW is a tool which I worked on at Yahoo! Research. It's an open-source process, and so it is quite possible to use it inside of Microsoft. OWL-QN is a linear learner in Cosmos, and Infer.Net is a compiler for model-based learning. So each of these -- these are sort of what I think of as the core tools at Microsoft. There may be more, which I will learn about, but these are the ones that are going to be covered in this class. Okay, so this is a class. Not all of us are teaching all the time, so I think the most important thing is that you should ask questions. If you can, try out the tools in real time. These lectures are being recorded on Resnet, so it can be hard to do things in real time, but you can easily pause Resnet and continue and check things out more thoroughly yourself later. Distractions are definitely interesting. If you have questions and it's more convenient to ask offline, then use the ML October mailing list. I'll be watching it, and I'm sure many other people, as well, so your questions can be answered. Are there questions right now? Okay. Let's begin, then. >> Vijay Narayanan: All right. Thanks, John. So in this session, we'll primarily focus on understanding some basic concepts in probability and statistics, especially as they relate to machine learning. The goal here is not to introduce anything very deep or mathematical off the bat, but hopefully to gain an intuitive understanding of some of the basic concepts. There are lots of rigorous definitions that will follow in subsequent classes, but in this session, what I really want us to go away with is some appreciation of some deep concepts, some very basic concepts which are taken for granted in many of the other sessions. So some of these are really, really cool results to know, irrespective of whether you are learning machine learning or not, and I'm sure you must have seen them in a lot of your earlier classes, basic classes in statistics or probability, earlier, as well. In which case, think of this as a refresher, and more importantly, there are two aspects in machine learning. One, as you know, is a lot of theory behind machine learning. How do humans learn? What is the basic structure of the learning patterns? How does learning happen in real life? All those nice, cool things. But also, very often, when you start applying these techniques to real problems in the industry, you are dealing with practical data. You are often dealing with noisy data, you are dealing with incomplete data. You are dealing with partially labeled data and a lot of host of other problems. In these cases, it's very, very important to understand what are some of the statistics of the data, how does the data look like? Before you start applying the techniques, it's very important to analyze and explore and gain an intuitive understanding of the data, some of the basic properties of the data before you start throwing in algorithms. So this class will primarily -- hopefully give you an appreciation of what are some of the basic tools and techniques that you can look at for gaining an intuitive understanding of the data. So the basic outline is here. We'll cover some basic probability concepts, go onto some statistics, and then I will take you through a gentle introduction to machine learning, how does the typical machine-learning process look like, what to look out for, etc. This is not about teaching you to drive, but this is hopefully to teach you not to create an accident. So it's less about the machine-learning techniques, per se, but more about the guardrails and some of the places where you need to be aware of when you apply machine learning to real-world problems. So let's start with reviewing some basic concepts in probability. We'll keep hearing the mention of the term random variables. So we'll quickly go over what random variables are, how they are different from other deterministic variables, cover some basic concepts of probability and probability distributions, joint probability, conditional probabilitybased theorem, etc. All right, so what is the random variable? So it's a very simple definition. Basically, it's a result of any experiment that can yield one out of many possible outcomes. If you have a variable that says X equals five, then you know exactly what is the value of X. But if I tell you, hey, I'm going to flip a coin and the outcome of that is my variable, guess what, now we can take two values. It could be a heads, it could be a tail. All right. So now, if I want to quantify what is the outcome of a coin toss, I call that a random variable, because now it can take one out of many possible values. And when you do this experiment repeatedly, you're going to get some or all of these values. So it's any variable whose value varies due to chance. Another very simple example of a random variable is the outcome of rolling a die. So if you just roll a very fair die that has six faces in it, you're going to get one of the following outcomes, right? It could be any number from one through six, the face of the die that is showing up could be any one of these six numbers. Now, if it is a fair die, you would expect each of these outcomes to occur equally likely, so this is a very simple random variable. Now, remember, suppose if we ask another question, suppose if I roll a pair of die, then what is the value of the total? Now, remember here again, each die by itself is a very simple random variable, and your sum is the combination of these two random variables, which is, again, another random variable. But here, the sum can only take on 11 possible values, so it can take on values from two through 12. However, notice here that not all the values are going to be equally likely. Some of the values -for instance, six is much more likely to occur. The extreme values like two or 12 occur much less frequently. So unlike in the first example, where you could just roll a single die, where all the six possible outcomes are equally likely, when you look at the total that you get from rolling a pair of die, the outcomes are not all equally likely. There is a certain variation in how often they show up. Does that give a reasonable idea of what a random variable is? Okay. There are basically two different types of random variables. One is a very discrete random variable, where X just takes values from a very countable set, so the possible values are all countable. You can start rolling up your fingers and start counting them. For example, in the case of a coin toss or when you roll a die, you can actually enumerate all the different possible outcomes. Note, however, here that just because it is countable doesn't mean that you can finish counting. So the size of this countable set can be finite, which in the case of the coin toss or the die roll are all things that you can finish within a few seconds. You can keep enumerating all the possible outcomes in a few seconds. They could also be countably infinite. For example, if I turn -- if I ask you a question, how many tosses, how many coin tosses do you need until the first tail occurs, guess what? You could flip a coin, maybe the first time it occurs. Good for you. You flip it two times, and it occurs the second time. Maybe the first time, it fell heads, on the second time, it fell tails. But it's equally likely that both of the coin tosses came at first heads. So you can keep count, and as you can see here, you can keep potentially tossing the coin infinite number of times, and you might never get a tail. Now, the probability of such an event is very, very small, but potentially, you can keep tossing the coin infinite number of times and you will never end up, and you will never see a tail. So this is an example of a set that is countably infinite. You can keep counting how many times it falls, but the probability that it's going to end is very small. Go ahead. >>: You said infinite number of times. If the probability is zero [indiscernible]. >> Vijay Narayanan: It just asymptotically goes to zero. >> John Langford: Vijay, repeat the questions or the comments. >> Vijay Narayanan: Yes. So the question is, if you toss it infinite number of times, the probability goes to zero. Yes, asymptotically yes, but there is still a very, very, very small, finite nonvanishing probability when it is definitely decreasing but it only hits zero, theoretically, at infinite. So this is a discrete random variable, and as you can see here, it can be either finite or countably infinite. Now, every finite random variable is almost by definition discrete, because you can just enumerate them -- you can just enumerate all the possible outcomes. You can just count them up, and it is finite. So all the possible outcomes can be discretized. So this is a discrete random variable. The other type of random variable, it's not that it is indiscrete, but it's just that it is discontinuous. So here, the random variable takes values from an uncountable set. So basically, for example, if I go and ask, okay, what is the distribution -- what are all the heights of some animals in a given region? So what are the lengths of all the fish in Lake Washington, for example? Or if I ask a question like what is the lifetime of a light bulb, it's going to be a real number. It's going to be some number between zero and potentially infinite, but it can take any possible real value between these two extremes. Things like time to failure of a hard disk, for example. All these sorts of variables are continuous random variables, and these continuous random variables take on an infinite number of values. Now, these values can all be bounded within a given range. So, for example, all possible various between the numbers zero and one. There are going to be finite such numbers between zero and one, but the extremes are still bounded. Or it could be if you take values over an unbounded range -- for example, all the numbers greater than one, so it could just be all the real numbers between -- well, there is a typo there. It should be greater than zero, so between zero and infinity. For example, something like the time to failure of a hard disk is an example of unbounded -- is an example of a continuous random variable that has an unbounded range. So now, with the definitions -- having some rough idea about what these variables are, discrete, continuous, finite, countably infinite, etc., let's move on to the next possible question you might ask out of these random variables, which is how likely is it to see? Now that you have said that a random variable takes values from a given set of possible values, you can then go back and ask, okay, how likely is it that this random variable takes one value as opposed to another value? How likely is it that this random variable takes the value -- for example, if I toss a coin, how likely is it that it takes the value heads? Go ahead. You had a question. >>: So the random [indiscernible]. It doesn't have [percent] value, so [indiscernible]. >> Vijay Narayanan: Theoretically, it could be multidimensional reliability. >> John Langford: You should echo the ->> Vijay Narayanan: So the question is, can random variables be composite? But composite is - if it's a sum of two random variables or if it's a function of many random variables, then that by itself is under the random variable. >>: What I mean is [indiscernible]. >> Vijay Narayanan: Yes, it could be a vector. Sure. Higher-order variations are possible, and I'll actually touch a little bit about multivariate distributions later on. Yes, it's certainly possible. Any other questions? Okay. So probably is nothing but a quantitative definition of this question, a quantitative answer to this question -- how likely is to it to see one value from among all the possible values of a given random variable? So you can also turn it and ask a related question, so you can also frame it in a different manner. Now, you mentioned that the random variable is an outcome of an experiment, so the question is, if I keep completing the experiment, then what fraction of all the experiments will have a given outcome? Now, this is what is addressed by probability. So what a probability says is, every outcome, every possible outcome in the set of possible outcomes is associated with a given number, a number between zero and one, and for example, when you roll a die, each of the possible outcomes from one through six is associated with a number between zero and one. In a fair die, each of the possible outcomes can occur onesixth of your time, so if you do the experiment, say, 6 million times, if you keep flipping a coin 6 million times, then 1 million times, you will expect to see heads -- one. Sorry, it's not a coin. It's a die. If you roll a die 6 million times, then 1 million times you will expect to see one, 1 million times you'll expect to see two, approximately, right? So probability is just the quantitative definition -- is just the quantitative answer to this question, how likely is a given value going to occur for a random variable? Now, there are a couple of very basic axioms. I'm sure you must have all seen this multiple times over. The probability, it's a bounded number between zero and one. And the sum of all the probabilities of all possible outcomes within the set is one. This is just a way of saying that if you do an experiment, then one of the possible outcomes has to occur. And the set of all possible outcomes is what you have enumerated as the possible range of values of this random variable. Now, for discrete random variables, each of the possible outcomes has an associated probability. Because it is a countable set, so if you go through, say, dice roll, then each of the possible six outcomes has one real number associated with it. In a fair die, it's about one-sixth. However, what does it mean to define a probability for a random variable that takes continuous values. Suppose if a random variable can take any variable between zero and one, what does it mean by saying that, hey, this random variable has a probability of 0.1 of taking the value of 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, to X? It doesn't really make much sense. So in this case, for continuous random variables, you define probability over a range of the possible outcomes, over a range of the continuous values that this random variable can take. So you define, hey, now, I can turn the same problem around and say, between -- what is the probability that this random variable takes the value between 1.1 and 1.15? Then you can start defining the notion of how likely that if I do an experiment, my outcome is going to lie within these two ranges, 1.1 and 1.15? So this is the slight difference between defining probability for discrete random variables and continuous random variables. But as for discrete random variables, every possible outcome has an associated probability value. For continuous random variables, you only define probability over a range of values of the continuous variable. Okay. Let's go onto probability distributions. The thing to just keep in mind here is there are different types of distributions that, at a very high level, will just be called out as probability distributions. And exactly which of the probability distributions that I will mention here -- I mean, that are illustrated here is referred to depends on the context. So we will loosely talk about the term probability distributions, but there are actually different types of distributions that you need to be aware of, and it's important to know that, depending on the context, different types are referred to. A probability distribution is just a function that describes how the probability of a random variable takes a given value. For discrete variables, as we saw earlier, you can count all the possible occurrences of the random variable, so what is referred to as a distribution is really what is called as a probability mass function. Here, it's just the probability that this random variable takes one of the possible discrete values that it is allowed to take. So let's just take a simple function like this, right? So here, the X is a discrete random variable. In this instance, it takes five possible values, one, three, four, five and seven, and in each of these points, in each of these possible discrete outcomes, there is a probability associated that tells you if you do an experiment, then maybe 40% of the time, you're likely to get the value one, 20% of the time, you are likely to get the value three, 10% of the time, you can get the value four, and 10% of the time, you get the value five. The value six can never occur for this random variable, so that's why it's a zero, and 20% of the time, it's likely to be a seven. So here, you've just defined your probability mass for each possible outcome of this discrete random variable. Sorry, did you have a question there? No? Okay. So another type of distribution is this is something you keep hearing often, is this notion called probability density function. People also call this as the probability distribution functions very often, but it's the probability density function. This is what is referred when people typically talk about continuous random variables. So here, if you want to know what the probability is, then you only can define it over a range of possible values of the continuous random variable. But for a given point, you can still associate a function, which when integrated over this range or when summed over all possibilities within this range gives you the probability within that range of this random variable falling within that range. And unlike probability, which can only take on values between zero and one, the probability density function can take on any nonnegative value. So if you want to say here is the probability that it takes -- what is the probability that this continuous random variable takes values between X1 and X2, then you integrate the PDF. It's also called -- it's common called as the PDF, the probability density function, over the range X1 through X2. Now, it's important to note here that -- this is an example of a PDF, of a very commonly occurring PDF, where the X here is your continuous random variable that takes on all possible values. Here, in this case, it's all nonnegative values, and this function, which has a P, can dice down very smoothly. It's an example of a probability density function. Now, to know what is a probability that this random variable takes on values between, say, five and 10, you integrate this function between the values five through 10, and that tells you what is the probability that this continuous random variable takes a value between five and 10, between this range of values five and 10? Yes. >>: [Indiscernible] F of X GX? Shouldn't this be E of X, F of X, GX? >> Vijay Narayanan: That is if you want to do the expectation of F of X over this range, so here, you're just looking at the probability, not the expected value of the function. I have not even come to expectations yet. >>: If the PDF can go to infinity, then the integral of any range of the PDF can be greater than one and can be fairly large. Then it's not a probability anymore. >> Vijay Narayanan: So you can only define this -- strictly, you can only define this if the integral is still finite. You can still go to infinite, but the integrality of X, DX can still be finite. >>: All right, so finite, but still greater than one, which would not be a probability. >> Vijay Narayanan: No, not really. So you typically normalize the PDF, and you'll also normalize this by the integral from zero to infinity, to keep that within. >>: What if you normalize the PDFs and the PDF values aren't between zero and one? >> Vijay Narayanan: No, no, no. PDF values can be between zero to infinity, but normalize this from zero to infinity, you divide it by zero to infinity to bring it to zero. The total value of probability contained under this curve is the value that you normalize this by. Okay. And typically, that gets absorbed within this P of X in most common distributions, so that's why I didn't really call this. Those are [twiddles] that I really didn't want to get into. So another notion of a distribution that frequently comes about when you talk about probability distributions is this cumulative distribution. So remember, until now, we talked about what is the probability that something takes the value -- takes the given, specific value in the case of a discrete random variable or a value within a given range in the case of a continuous random variable. The cumulative distribution is the probability that the random variable takes some value that is either less than or equal to a given value. So if I go back to this previous PDF here, and I ask you, what is the cumulative distribution at five, then it is the probability of this entire curve between minus infinity to five. Now, in this example, there is no extension of this curve to nonnegative values, but technically, there's nothing that prevents you from having a probability distribution where the random variable takes on negative values. So for the same distribution here, the red color shows the cumulative distribution function. In the case of a normalized PDF, the red curve, as you can see here, starts off at zero, at very low levels of the random variable, and nicely settles down to around one as you go to higher and higher values of the random variable. So all that this PDF tells you is that what is the probability contained when the value of the random variable is either at a given value or anything lower than that value. So it is just the integral of the probability function until -- from minus infinity to X. Note here that the CDF itself is a function of the X. So the CDF is actually a function, and sometimes, you can even -- if the CDF is a closed analytical form, it may be easier to work with the CDF instead of the PDF, because especially if the PDF has some funny behavior, like going to infinity at some point, etc., then the CDF might be a more analytically tractable function. Okay. So before I go to multivariate probability, just a quick point. So as John mentioned, we typically take a break at 10:30, but today I believe we will go to 10:45, right? So this session will go until about 10:45. We'll take a half an hour break, come back at 11:15 and go until 12:30. Now, so far, we have all talked about just the distributions, the probability distributions, random variables of just one dimension, but technically, you can extend it to all -- all these concepts can be easily extended to just multiple random variables. So, for example, you can ask the question, what is the probability that the random variable X takes a value between X1 and X2 and another random variable, Y, takes the value between Y1 and Y2. For example, in the case of a coin toss, we can ask, what is the probability that the first die takes a value between two and four and the second value -- and the second die takes the value between, say, four and six? So you can extend this concept of random variables, probability, all the distributions, to multivariate cases, as well. So in that case, the integrals basically become higher-dimensional integrals, so in this bivariate distribution where you have two random variables, X and Y, the probability that the variable X is between X1 and X2 and the probability that the variable Y is between Y1 and Y2 is just this integral of the joint probability density. Here, when you go to higher-dimensional distributions, you typically refer to them as joint distributions, so P of X comma Y, dxdy, where X now goes over X1 and X2 and Y ranges from Y1 to Y2. All right. So the first few slides are just to get some of the concepts out of the way, and hopefully, we'll do a lot more examples as we go forward. Now, when you're given a higher-order distribution, when you have a multivariate distribution, you can still ask, okay, but you've given me this joined distribution of two variables, X and Y, how can I compute just univariate distribution, just the distribution of P of X? You can easily do that by basically integrating out all the possible values of your other dimensions, of your other random variables. For example, if you want to look at P of X, if you want to compute P of X from the joined distribution P of X comma Y, you integrate over all possible values of Y the joint distribution. And when you do this, typically, P of X, you keep hearing the term marginal distribution. So here, the marginal is the joint integrated role with all the other dimensions, all possible values of your other dimensions. Now, for just a point to note here, I have shown some integrals that are relevant for continuous distributions, but for discrete variables, basically, we place all these integrals by sums, by sum over all the possible values of the other random variables. So you can even have a distribution where X is the continuous random variable and Y is the discrete random variable, in which case, if you want to do this marginal probability density of X, you will replace this integral over Y by a sum over all possible values, all possible discrete values of Y. Okay. Now, it starts to get a little bit more interesting. So far, we've just looked at individual distributions, probabilities, etc., but when you bring in more than one random variable into the picture, we can start asking questions like what is the probability that one random variable takes a given value or within a range of values in the case of a continuous variable when the other random variable has taken a specific value? The types of questions that you can -- so this is referred to as your conditional problem. So in this example, you're asking, what is the probability distribution? What is the probability of X given that another random variable Y has taken a specific value? This is called the conditional probability of X given Y, and why this gets very interesting is a lot of the machine-learning problems boil down to asking the questions, suppose here is the state of the universe, here is the state of the system I observe, tell me what will be the outcome. Many of the ML types -- especially the applications of ML, boil down to asking the questions, here is my state of the system. For example, in the case of, say, if you want to predict if a given transaction is fraudulent or not -- say you have done a credit card transaction and if somebody wants to predict if it is fraudulent or not, you're going to ask the question, look, here is the behavior of this credit card. Now, tell me if it is fraudulent or not. So the questions become very conditional. We're asking, given that this has happened, now tell me, what is the probability that this is my outcome. Can you learn these relationships? So this conditional probability is typically used to represent these given that type of questions. For example, one simple question it can ask is, given that the lawn is wet, what is the probability that it has rained last night. Well, the lawn could be wet because maybe the sprinkler was on last night, or maybe it really rained. But you can still ask the question, right? What is the probability that it had rained last night, given that the lawn is wet. So the fact that your lawn is wet is an observed variable. You have fixed the value of that random variable. There is no more uncertainty there. The lawn could have been dry. It could have been wet. But you have gone out and seen that the lawn is wet. Now, you're turning around and asking the question, given that I have seen that the lawn is wet, what is the probability that it had rained last night? Another simple question is, given that it had rained earlier, what is the probability that there will be a rainbow today? So all these sorts of questions, like the given that questions, where you're saying that one random variable has taken a value, has a specific value -- I have done an experiment and I have seen that this random variable has taken a specific value or a specific outcome. Now, tell me, what is the distribution of another random variable. So this is called conditional probability. And the reason -- so just quickly cover over this concept cased the Bayes Theorem. I won't go too deeply into it, but I guess this will be covered in a lot more detail on the last day, when Chris Bishop and John Bronskill talk about modeling and inference. So if you take two random variables, D and H, and I'll later come down to why I chose this notation, then suppose, say, this red circle tells you all the possible values that the variable D can take. And the green circle tells you all the possible values that H can take. And brown here, the intersection, is basically where it's a place where D takes on a specific value and H also takes on a value within that brown circle. So red is basically all possible values of D, H is all possible values of -- sorry, green is all possible values of H, and brown is the intersection region. So now you can ask, what is the probability that D will occur, given that H has already occurred. So as you can see here, it's just the ratio of the brown area over the green area. So probability of D and H divided by the probability of H happening by itself. You can also ask the other question, what is the probability of H, given that D has occurred? So here, it's again the overlap area, the brown area, but now, you are asking what is the probability of H will occur given that D has already occurred, so it is your total possible values of D is in the red area, P of D. >>: Are D and H in this case variables or outcomes? >> Vijay Narayanan: They are possible outcomes. That red area is the set of all possible outcomes of the variable D. So you can just combine these two and say the probability of H given D is the probability of D given H, times the probability of H given the probability of D. Now, this is a little bit abstract right now, but really, this equation is Bayes Theorem, and I'll just tell you why it's very important. H here, think about them as your hypothesis. Suppose somebody has given you some data, and you want to see, is my hypothesis -- so, my hypothesis is that this credit card transaction is fraudulent. So what you have is the data which is all the transaction streams, the history of all the transactions that has occurred on this credit card, for example. Now, you're asking, given this data, what is the probability that my hypothesis, which in this case is that this card has gone fraudulent, is correct? So probability of hypothesis is given data, you're just turning it around and saying probability of data given the hypothesis times the probability of the hypothesis itself, divided by the probability of data occurring by itself. Okay, I know this seems to be a little bit confusing. I see a lot of people just staring. Yes. >>: This implies you knowing the probability of the occurrence of data by itself, which in many cases is hard to come by? >> Vijay Narayanan: Yes, so you can replace that with the sum over all possible hypotheses, of this data being generated by all possible hypotheses. So this is also frequently put as suppose -so what you have is the data, and you try to ask, what is the probability of this hypothesis being correct? So what you have is the data, and now you are asking, what is the probability of this hypothesis being correct? That's typically referred to as the posterior probability. It's posterior, because you are asking these questions after you have had the data, after you have collected the data. And now you are asking, before I collect the data, I don't have any intuitions about which hypothesis could be right or wrong. I might have some expectation saying, hey, look fraud is inherently a lot less likely to occur, so maybe my probability of this card being fraudulent is not very high. But then, you actually collect the data, and then you can ask the same question again. Now, tell me, how has my hypothesis changed? How has the probability of my hypothesis changed, given that I have observed this data? So that's called the posterior probability, and the probability of data, given the hypothesis, is frequently referred to as likelihood. So here, how likely are you to have generated this data, given that your underlying hypothesis was correct? The problem of hypothesis is also called the prior probability, which, for example, in this credit card fraudulent transaction is what is my expected rate of fraud before I had any data? All right. So at some level, that's been a very broad overview of just some basic concepts and probability. Any questions? >>: What if I go through all the [indiscernible] all the ways in which the general data will be delivered? >> Vijay Narayanan: Yes. What if my hypothesis case is not easily ->>: [Indiscernible]. >> Vijay Narayanan: So if you have absolutely no clue about the hypothesis space, then I suggest you get some domain understanding to be able to ->>: What if I do not have a complete understanding? I can have a partial understanding. >> Vijay Narayanan: That's fine. So in that case, your modeling is going to be as good as to how well your hypothesis space is complete. If you are operating in a region, and we'll actually come down to this concept a little bit later. This shows up again as to how rich is your hypothesis space, what if you don't have a rich enough hypothesis space, or what if we have too rich of a hypothesis space, right? Both of them are bad. We want to have some reasonable assumptions about what are possible hypotheses to test. So there are some ways to constrain -there are some ways to measure if your hypothesis is reasonably good or not, but you need to have some understanding of even what is possible, what hypotheses are even possible. At the same time, we don't want to make it overly complicated, either. Go ahead. >>: Can you go back to the previous slide for a second? I really like the concrete examples you gave for what prior probability are and likelihood and posterior probability are. For prior probability, you said expected rate of fraud before you add any data. What was the likelihood and the posterior probability again? >> Vijay Narayanan: The likelihood is, suppose if your hypothesis is correct, then how likely is this hypothesis led to the data that you actually observed? So you are basically turning this around and say probability of hypothesis given data -- look, what I have done here is, initially, I know nothing about which hypothesis is correct. I have some vague notion saying that, look, frauds are inherently less likely to occur, so it's probably not a 50% probability that this card is fraudulent. The probability may be about 1%. But you can then ask the question, look, here is a series of transactions I have seen. Now, what is the probability that this card is fraudulent, given that I see this level of activity, this pattern of activity on this card? So that is your posterior probability. The likelihood is, given that the card is actually fraudulent, what is the probability that you see this type of transaction stream, your observed transaction stream? So that's your likelihood. Okay, any questions? So I think we will stop at probability at this level and take a look at some statistics. Any questions so far? All right. So let me start with statistics, or should we just pass here, take a break and then come back at 11:00, John? >> John Langford: We can take a break. >> Vijay Narayanan: Take a break. Okay, so we'll cover statistics and machine learning after we come back, maybe at 11:00. All right. Thanks.