Microsoft 15849 Eric Horvitz>>: Okay. It's our honor today to have Eyal Amir with us, visiting from the University of Illinois at Urbana-Champaign. Eyal is a professor there and he's been there for several years. Before that he did some post doc work at UC Berkeley, and right before that he did his Ph.D. at Stanford University working with John McCarthy, (inaudible) and others. I came to know Eyal, with some of his innovative work, coming out of his Ph.D. work and shortly thereafter, linking logical reasoning with probabilistic methods and issues of separability of factionetic (phonetic) bases. I thought it was very interesting work. He's gone on to do some more central UAI work after that later on. He's the winner of a number of awards, including the Arthur L. Samuel award for best computer science, Ph.D. thesis, Stanford University. That award often is a century for great things to come. I've known some other winners of that award in the past that have gone on to do great things. Today, Eyal will be talking about two topics; one, reachability under uncertainty, and the other, Bayesian Inverse Reinforcement Learning. Eyal Amir>>: Thank you, Eric. Thank you all for coming and actually being in this room. And also thanks for the virtual audience looking at this talk. I just want to give you a brief -- before I start with the talk, this is my group circa a couple of years ago. As always, the work that you're going to see is really my students and only partially mine. So Allen Chang here is -- now he went to do -- into the startup world to make a lot of money, but he also is a -- is in this -- is contributor of half of this talk. And the other contributor is Deepak Ramachandran. And next year he's going to be on the market. So look forward to seeing him, perhaps, in person. All right. I'll go now to the topic of this talk. I'm going to talk about two separate topics. They come from trying to cast traditional problems in a slightly different view. One of them is the problem of reachability in graphs, and the other one is Inverse Reinforcement Learning, learning from the actions of experts. And I'm going to try -- the two talks are slightly separate, but I'm going to try to tie them together towards the end. All right. Here's the first -- here is the first problem. It's something that we started working on about a year and a half ago. It's an interesting, very simple problem. You're trying to find reachable -- reachability or the shortest path in a graph. We have a source and a sync. In this case, the graph is directed. You can think about it as an undirected graph as well. You're trying to find the shortest path, the path that minimizes the sum of costs along the edges of the graph. All right. So this is a very typical, well known problem. There are various standard solutions for it. What we're looking at are generalizations of this problem. And here we're looking at cases where the edges may fail. So typically, if we have a probability for the edge to fail, then we can say that the path altogether is reliable, if the product of the probabilities is high. Now, there is a hidden assumption there. And people have used that assumption over the years, and that is that the probabilities are independent. So the edges may fail independently. And we want to drop that assumption. By the way, the motivation for this was just kind of a discussion that I had with some of the security people in my department that were looking at finding those paths on the Internet or any network where the probability of the path failing is the least. And so what we're going to do is we're going to try to make some dependent assumptions about that. We're going to look at one particular example of a model, and we'll see what we can do about it. So I'm going to first present this notion of correlation that I'm going to use in this talk. And then I'll show why standard probabilistic techniques don't work. I'll show that, actually, the problem is an NP-hard problem. And I'll show an approximation algorithm for how to solve that. So here is a stochastic setting. Let's say that we have this very simple path, and we have random variables that represent the probability of those edges failing. Now, let's assume -- we're going to create an extra -- an extra node here. And that will kind of represent the correlation between those three random variables. Remember, each one of those variables is the probability that the edge will fail. Okay. This is a very simple graphical model. To those of you who are familiar with machine learning, this is a simple Ni-base-like model. So when you look at the distribution -- the distribution that governs the probability of the path being reliable, you notice that the path is reliable if the marginal here is high. So probability of edge one reliable, edge two reliable, edge three reliable, that's -- that is the probability -- that's the marginal probability that the path is, indeed, reliable. Now, notice that what we do to get to this marginal is really sum out this hidden variable X. This is the hidden variable that we represent -- we use to represent that correlation. So the way that we get to that marginal is by summing out the variable X. So here's one application of this, the application that we started with. We noticed that links may fail stochastically and routers will fail stochastically, and there is some correlation between the times that a link or router will fail. If two routers have the same version of software on them, then they have some -- they will be susceptible to the same viruses, to the same problems. Similarly, if you have -- say, if certain lines were actually built with the same material and so forth, you can create a -- you can see that those lines will be correlated. And the fact that we don't know if those routers have the same kind of -- what is this? Let's say, if we are not sure of what is the virus that is going to attack this system or we're not sure about some other elements, those will be those hidden variables that we're going to include in that model? >>: (Inaudible) same back, and now you're assuming (inaudible). Eyal Amir>>: Right. I'll flip back to that model. Here I showed it -- I showed this -- I'll show you the entire path. So I showed this on the path, but really it's a variable that controls all of the edges in the graph. It's just that I focussed on a particular path. Okay. So this probabilistic model here, this is a representation of the joint probability that we will look at. So the probability that edge one is reliable, and two and three and four, until N, and X, it's the product of these (indicating). Where here, actually, E -- the probability of EI, actually, there's a missing conditional here. This should be conditioned on X. All right. So here is another example of use of such models. Let's say we're trying look at parsing with weighted finance Sedatoma (phonetic). If we have some probability of transitions, then if those probabilities of the edges are correlated, then we will be able to use that kind of a model to represent this system. And also robot navigation. If you're looking at trying to find the best, most reliable path, and you say -- I'll give you an example from my Monday. If you're trying to find an airline path that is most likely to be on time, and you're considering two paths, one through Chicago and the other one through Indianapolis, you can say: Well, if Chicago has some weather issues, then Indianapolis will probably have some weather issues with some correlation between them. If the weather in Chicago is really nice, then probably the weather in Indianapolis will probably be nice. Here is the problem with this new problem. You can say: Well, I showed it to -originally to some theory people and they said: Yeah, it should be solvable with some nice techniques. Here's what fails. So in traditional algorithms, in traditional problems, this optimal path that we get from the source to the sync, is -- also uses those subpaths that are optimal. So the path from S to V is the best path, and if the path from V to T is the best path between them, then this is the best path that goes through V. Again, maybe there is a better path that goes somewhere else, but if V was like an island here that we have to go through, then we could canaliculate the two best paths and get the best path overall. All right. So we could essentially reuse those sub-problems. What happens with the -with the stochastic variant of this problem is that those paths, if I find the optimal path from S to V, and the optimal path from V to T, that's not necessarily the optimal path overall. Here is a very simple example, and then I'll give you some intuition for why that fails. The example is this: So I put some probabilities here. We have one hidden variable that has two values, X is equal to one or two, and this is path one (indicating). This is path two (indicating). And this is path SIGMA three (indicating). And so the probability -let's say, just to read this for you, the probability that path one is reliable, given that X equals one, is 0.9. Right. When you try to compute, we have this prior over X here (indicating), so X is equal to one is half and X equal to two is half. So when we try to find the best path from S to V, we compute the marginal over the probability of success of this path. This is just a nice, simple summing out of X from the -- from the joint for path one. And we do the same thing for path two and path three. So let me give you all of this and show you what happened. So this is the best path, according to the summing out, the probability that path one is reliable is 0.65, the probability of path two reliable is 0.6. So path one is better. We have only one path here from V to T; that's path three. So we have to take this one. Now, if we were going to use Dijkstra's algorithm, or something like that, we would choose this path from S to Z, and then concatenate it with P3. But notice what happens here. The probability of taking path one to path three is significantly lower than the one taking path two to path three. Now, why did that happen? There are two intuitions that I can give to you; one is that, when you concatenate paths -- this is very similar to what people in probabilistic reasoning and machine learning, they exchange the order of the summation of the product. By exchanging the order of the summation of the product, you assume that these can be essentially concatenated. In this case, that's not true. Algebraically, that's not always the case, but sometimes it's a good, simplifying assumption that we make to get things practical. Here's another intuition why that doesn't work. The other intuition is, look at what happens when we are trying to see, what is the probability of failure of the entire path, we look at the case of X equals to one separately, and the case of X equals to two separately, and then combine them. Here we first combined the solutions for X one -- for X one and X two for this path, and then look at a separate combination for X one and X two this path. They are separate -they are different options. They are different cases. Here this path will be really nice when X equals to one. This path will be really nice when X is equal to two. This one will be also really nice when X is equal to two and will be really bad when X equals to. So overall, this path, the first -- taking path one and then path three, will almost always be bad. Either it's bad in the first segment or it's bad in the second segment. Here, at least -for one of the segments it will be -- for one of the cases of X, it will be good. Yeah? >>: (Inaudible.) Eyal Amir>>: Right. >>: (Inaudible.) Eyal Amir>>: Right. But what is the value? Okay. You say, let's compute it for X one equals to one? >>: (Inaudible.) Eyal Amir>>: Right. But then what? Let's take that suggestion. For X equals to one, we will compute the optimal path is this. For X equal to two, maybe the optimal path is this -- sorry, the optimal path for X equal to two is this, for X equal to one, this one is better. Then the question of how to combine those two paths, it's not clear. You could say, well, maybe one of them is the best. But you can find examples where both optimal cases for either values of X will be as far as you want from the optimal. None of them will be really close even to the optimal. >>: Given the (inaudible), given the distribution, there is nothing better to do it. Eventually, you can combine it by (inaudible) all possible hidden variables, but that's the only formation (inaudible). Eyal Amir>>: True, but we have a very specific optimization problem here. We're trying to find the optimal path, the path that maximizes this formula, the sum, probability of X times this product. Okay. So what we tried to do is we tried to actually solve this. What I'm going to show is that there is a way to approximate it. >>: The probability (inaudible.) Eyal Amir>>: I know the prior over X. >>: So why not -- given every possible value of X, compute the short stuff and then amortize it with respect to the probability of over X. Eyal Amir>>: (Inaudible) I still need one path at the end. I find the best path for one value of X, I find the best path for another value of X. I still am looking for the best overall path. It gives me the best expectation here for success. So I'm trying to send my truck from San Francisco to New York? >>: (Inaudible.) Eyal Amir>>: Right. Well, right. The full combinatorial solution would be just to try all of the paths between San Francisco and New York. >>: They do in (inaudible.) Eyal Amir>>: Right. >>: Where you start from the end, go back (inaudible.) Eyal Amir>>: So that's exactly what fails here. Dynamic programming doesn't work here. That's exactly the argument. If you find -- what would dynamic programming do for us? It will find the best solution unto Chicago, and then from Chicago, the best path to New York. That's exactly what fails here. We tried all kinds of tree width, all kinds of dynamic programming. It all failed. And I couldn't believe why. I mean, why would it all fail until you actually showed that it's NP-hard. So let's examine why, kind of. So this gives us an intuition for why it won't work, but now let's see that it is indeed NP-hard. So we cannot view solutions, that's the most important kind of intuition that we got from this. So here is a simple reduction. And I see that the colors here are not going to be very favorable to me, but I'll try. So I'll read the first few lines. So this is a reduction from the string template matching. So we have -- a string template matching is the problem of given, say, several string templates -- and these are strings of ones and zeros and star, where it star is, I don't care. And you're trying to find a bit string that matches at least K of the templates. So let's say here in this case above, so for K equal to two, the answer is yes. So let's see. This one, there is nothing that will match S two and S three, because one and zero here don't agree. But there is something that will -- and there is nothing that will match two and three, because zero and one here don't agree. But there is something that will match both three and two -- three and one, string template one and string template three. So this bit string -- so there is a solution for K equals to two, there is no solution for K equals to three. So that's a standard -- one standard MP complete problem. And what we showed is that this says -- kind of just reducing this problem, it is -- we reduce this problem to our shortest path problem, the stochastic shortest path. So what we do here is for every string template, we're going to introduce -- so we're going to have one hidden variable value for each template. And we're going to ask, essentially, which template does a string match to? So -- sorry. If we're given a string, the string would be zero if it goes with the upper edge and one if it goes on the lower edge. And if you give me a path here in this -- in this -- in this graph, then that path corresponds to a string. Now, I'm going to ask, how many of the templates that particular string correspond to. Remember that hidden variable that I have, the hidden variable will have all of the -- let's say if I have, like, N templates, it will have N values. And I'm going to essentially count, what is the probability -- what is the highest -- what is the path that gives me the highest probability in this -- in this graph. So if I have a path, for example, that covers K strings, then the probability here will be one over -proportional to one over -- to X times -- to K over N. >>: (Inaudible) taking out the locally optimal path, and you are over -- I'm assuming that the distribution over X is uniform, you want something that's, on average, the best that the matches follow the locally optimal paths? Eyal Amir>>: Right. Remember, this is not the -- this is not an algorithm. This is just a reduction. If you find a path that matches K strings, then its probability will be K times one over N. Because one over N is the probability that you fit one bit string -- it's not exactly one over N, but this is the intuition. So if you have -- if each one of the strings matches one over N of the -- of your probability mass, then you're going to try to sum those -- you know, those strings, and you're trying to sum those probabilities for matching each one of the strings. Then the problem of finding -- if you match K strings, becomes finding whether you have a path that that has a certain probability or surpasses that probability in this -- in this graph? All right. So this is just kind of the intuition for why the problem is hard. We can -- the previous -- this string matching problem can be reduced to sat, and so this is -- sorry, sat can be reduced to the string matching problem and the string matching problem can be reduced to this. So what you get at the end is a problem that is a simple combinatorial problem. Notice, by the way, it is NP-hard. It is NP-hard in the number of variables that we have for the hidden invariable. So if we have N strings, it's NP-hard in that N. Actually, I'll write it down because this is an interesting point. So it's NP-hard in number of values of this hidden variable. Okay. So here's an approximation scheme. And what I want you to -- what I want to point out to you is this approximation scheme doesn't really solve this NP hardness. I'll explain in a second. So here's the approximation scheme. We're going to assume -- sorry. We're going to assume for a second that all the probabilities have this property. So the log of the conditional probability of the edge given, the hidden variable value is between one and sum Q minus one, the sum Q. So we will have Q values for the log probabilities. And what I'm going to do is try to create a dynamic program, but it's a dynamic program that will use those discrete values as kind of the setup for -- the setup for our dynamic program. So what we notice first is that the probability that a path is reliable given a certain value for the hidden variable, that's also one -- that's also kind of an integral value here between zero and N times Q minus one. And as a result of that, we can -- we're going to use those integral integers as our islands in this search. So here is the -- sorry. Here is the algorithm, essentially. So what we do is this, at each node we compute the Pareto optimal solutions for a similar set of values here. So here for every -- each CI -- I'll focus on this for a second, because this is really the main ingredient of this dynamic program. So if we have D values for the hidden variable, then for value -- so we're going to have this kind of product of all of the possible -- all the possible probabilities between zero and Q minus one for each one of those hidden variable values. So for example, I will have this, let's say, zero, one, one, five, three, two. And this is -- these are the path -- these are the entries in that table. So this, remember, this is the log -- minus log of the probability of XI given -- sorry, this is of the edge I -- this is edge I given a certain value for X, but now here, instead of the edge, we're going to have the probability of the path -- the best -the log probability of the best path that reaches -- that comes in with this value. So we're going to essentially -- so I said we have here, one, two, three, four, five, six -- so the domain for XI -- for X is six, the size of the domain. So for each value -- so this is X equals to one, X equals to two, X equals to three and so forth until X equals to six, for those six values. For each one of those I'm going to say: What is the path? Give me the best path that comes in with these minus log probabilities. So when X equals to one, it gives me zero. When X equals to six, it gives me two. Okay. So we're going to have this matrix, and for each value here I'm going to compute the best path, kind of the path or one of the paths that comes with these -- with these minus log probabilities. And that's what I'm going to now use for my dynamic program. Here's what happens. >>: (Inaudible.) >> Eyal Amir: That's right. It is. The number of things that I will have to evaluate is exponential indeed, not in the number of nodes in the network, but it is in the number, indeed. So it doesn't really counter this NP hardness. >>: (Inaudible.) Eyal Amir>>: True. Right. >>: (Inaudible.) I mean, you find the approximation (inaudible.) Eyal Amir>>: That's right. You immediately see the gap. We were not able to show that it's hard when the domain size is fixed and we were not able to show an approximation for the NP-hard problem. So here is how the algorithm goes. If we computed those -- the dynamic program until a certain node, we are then going to use those paths -- we can then use those paths to compute the best -- or kind of those paths with those probabilities to the -- the important thing is here we're going to actually use all of the path that came in, all of those that we cached in our dynamic program. Remember, we cached here -- what was the number? Some polynomial to N is the power of D. All right. So this is a lot of combinatorial exploring for the machine learning people in the crowd, but it's an interesting -- it's essentially -- it's an interesting problem that really doesn't yield to the standard techniques that people use in machine learning. >>: (Inaudible) dynamic programming needs (inaudible.) Eyal Amir>>: Yeah. I can show you very simple examples where it fails -- where it fails badly. I mean, as far as you want from the optimal. >>: (Inaudible) performance of this? Eyal Amir>>: Time? >>: No. Eyal Amir>>: The approximation? The approximation is really -- the approximation that you get is essentially this, in time that is proportional to one over epsilon. You get -- if opt is the reliability of the optimal solution -- so remember this is probability of -- that the path is reliable, you get a path that is more reliable than opt to the power one plus epsilon. Epsilon is really -- I assume that all of my probabilities are in those discrete buckets, right, so you would use that epsilon to round them in some way. Okay. >>: (Inaudible.) Eyal Amir>>: Opt is less than one. Remember ->>: (Inaudible.) Eyal Amir>>: Oh, I see. One minus ->>: (Inaudible.) Eyal Amir>>: One minus epsilon times ->>: (Inaudible.) Is it the same here? Eyal Amir>>: I have to say, I don't know. I think -- kind of my intuition is that this is worse than one minus epsilon times that. But we can sit down afterwards and just write it down. It's very simple. >>: Is the probability worse? Eyal Amir>>: It's worse. >>: (Inaudible) very small, very low (inaudible.) >>: The way you labeled the problem, it might be interesting to see more probabilities here, because you have nice results and graph theory on reachability and based on things like critical values and base (inaudible), imagine finding a problem where you have some dependency with some named distributions and see if you can actually prove something about critical values and dependencies and so on. It's sort of an interesting playground for new sets of problems, not necessarily solving the reachability directly, but looking at particular values, critical thresholds that might mentioned. This is -- it sounds like a very tough area to work in, but it's interesting. Have you thought about that at all, kind of the results in random graph theory and reachability, the parameters, and thinking what kind of class of problems can you define for that community (inaudible) in this model? Eyal Amir>>: There are so many -- there are different dependence models for different applications. The answer to you question is no, I haven't gone in that direction. But there are really different dependency models that one would like to have, like those dependencies between different roads that I'm going to use through Chicago or different dependencies of many different random variables that are hidden and affect the system. >>: (Inaudible) we have some interesting routing problems that might -- pending on -you have trafficking issues that are changing, and depending on your forecast of your waits over time (inaudible) worlds reached that each segment when you finally get there. And depending on -- (inaudible) dependency of what will happen, given where you go, that might not be (inaudible) this problem. Eyal Amir>>: One interesting kind of a piece of gossip about this, people ask me why we didn't -- we presented this in UAI of last year, and people asked me why we didn't submit it to a theory conference. And my answer was that people in theory were interested in this, but they are more interested -- this is not a fundamental problem, whereas -- also, the AI community is more interested in those dependency structures between random variables. The UAI community or the AI community is less -- is not as good at theory and combinatorial algorithms, but it is a set of problems that are of real use. >>: (Inaudible) people are having a hard enough time of independence right now. Eyal Amir>>: Right. >>: (Inaudible) the complexity. Eyal Amir>>: All right. So let me just summarize this part, and I'm going to quickly switch to the second part. And again, try not to get you too tired. By the way, you can still ask me questions. I love those -- any questions that you have. I just want to just not get you too tired. So here it is. So just what we've shown so far is there is an interesting different model for reachability. One can use different uncertainty models, and it's a very interesting problem where a lot of different techniques come into play. It is MP complete, but we still have a very important gap here between this hardness that depends on the size of the domain and the P test that we have that is exponential in the size of that domain. All right. So the second part of the talk, this is slightly -- completely different and something that is more in line with machine learning theory, is a different problem where we're looking at Inverse Reinforcement Learning and trying to add knowledge to it in kind of a probabilistic -- probabilistic prior to -- on top of that, a learning problem. Just to kind of explain to you what is Inverse Reinforcement Learning, here is an example. We have this game, and I can -- this is just an adventure game where a player moves around in a dungeon, collects treasures and avoids monsters and so forth. What we see is that I -- if I look at somebody who plays that game, I can learn a little bit about how the person solves the game. And the interesting problem from the perspective of AI is how to really learn there, and what are you learning as the result. What people have done -- what people typically do is they try to either pose the problem as a Markov decision process -- I'm going to describe that to you in a bit -- but kind of more recently people have tried to use approaches to learn those models, the Markov decision processes, and that's what we're going to do here. What we're going to do is we're going to describe a Bayesian model for Inverse Reinforcement Learning, or learning the model of the expert's choice of actions. And we're going to use that model to then find the most efficient or the most rewarding set of decisions in that domain. So what -- our intuition here is that, we would like, in the learning process, to incorporate some domain knowledge, something that will help us learn a little better from that teacher or expert or from what we observe. Here is the model that we're going to look -- that we're going to use. It's a very standard model in machine learning. There's a Markov decision process where we have a set of stakes, transactions, actions and a discount factor. Typically, an expert will -- we assume that the expert will use a certain policy that maximizes his future rewards or something, kind of the expected future reward is really the sum of reward from the current state and the next state and the state after that, and so forth, with some discount. So the farther we are in the future, the more discount we will apply. This Gamma here is less than one. Yeah? >>: (Inaudible.) Eyal Amir>>: Right. So our problem here today would be to actually learn the reward structure. >>: (Inaudible.) Eyal Amir>>: Right. We know that the -- we know that the expert uses a reward function. We're going to assume something about that reward function, and we're going to try to use those assumptions to learn a reward function that is more likely to succeed or more likely to be the correct one for the expert. >>: (Inaudible.) It's okay to assume that people (inaudible), but when you say that you want to learn a reward that is from an expert, you're saying that the reward is not something that is directly observable, it is something that you can play with. Eyal Amir>>: Right. I mean, the motivation -- we did not come with -- we did not come up with the Inverse Reinforcement Learning problem. That was originally posed by Andrew Ng and Stewart Russell like 10 years ago. Their motivation was: We want to learn -- for example, in the case of Andrew, Andrew's motivation was: I want to learn a model that will allow me to control, say, a helicopter, or control some unmanned aircraft vehicle. I don't get those rewards. I don't know. I can see only what that expert is doing. I don't know what is the reward structure of that expert. I can pose some rewards, but I really would like to know what are the rewards that person is having, so that I can use those rewards in my solution of the problem. There are those cases, and in fact, sometimes what people do in Inverse Reinforcement Learning is they assume some reward structure of some sort, either by guessing or by trying to kind of smooth a certain threshold reward function. So we're going to try to -- instead of doing that, what we are trying to do with this work is try to learn the reward of that expert. We already see the expert's doing these actions, we don't know what the reward is, let's try to learn that reward. >>: (Inaudible) what are you maximizing? If you get to choose the reward, then I choose that the reward will always be one million, and then I maximize, right? Eyal Amir>>: Right. So there will be two optimization problems that we are going to solve here. One would be to have the kind of minimum distance from the real reward of the expert. And the other one would be to maximize -- to maximize the expected rewards. We don't know what those rewards are, right, but we would like to maximize the expectation of our future rewards. All right. So I'm going to give a little more details, and I'll explain that a little more. So I'm going to use this Q function that is typically used in Reinforcement Learning. It essentially says -- it essentially represents, for a policy pie -- policy here is the way that you choose your actions. For policy pie, what is the expected reward for a certain state and a certain action that you're going to perform in that state. Now, we're going to modify this Q function and add this extra variable R for the reward. This is a reward function that will be part of our Q function representation. All right. Remember, what we're going to try to do is we're going to learn that R, but if you give me that R, here is the evaluation of that Q function. All right. So here is our assumptions. So we're trying to -- and this is, by the way, this is -- these are assumptions that were -this is our assumption. This is an assumption that was actually made before this work. We are going to relax those assumptions in this presentation. One is that X tries to maximize the accumulated expected reward, and the second thing is that X executes stationary policy and that stationary policy was learned by Reinforcement Learning algorithm. This is the assumption by the expert. Previous assumptions also were, by the way, that the expert is doing something that is optimal. And we're going to drop that here. We're going to allow the expert to be suboptimal and also we're going to have some prior over what the rewards could be. I'm going to give some examples in a couple of slides. So let's say that we are given this prior over rewards, and now I'm going to try to just analyze what one could do with that prior. Remember, so I'm going to just write it down here, for example. So the reward of being in a state where you have the gold or you get the gold, let's say that is, say the number 10. Just to kind of -- if you have never seen anything like this, then it will be a little bit -- just to give you some intuition, and the reward for no gold is, say, minus one. Sorry. This is a little low here. Let's see how to raise this. This is just an example. So the reward function is really -- sorry, no gold is minus one. And the reward function is just kind of a function from -- could be a state times action to some real number or it could be just a function from states to rewards to a real number. It doesn't matter. Most of the time you can translate from one model to the other. So what we're doing here is, as I side, we have a prior over those rewards. We don't know what the rewards are, but we have a probability distribution that governs that reward function. What we're going to do is we're going to say: Well, we're going to observe some things. Those observations, these are the actions that the expert had done in those states. So in state one, he did action one, in state two, he did action two, and so forth. We're going to assume that given that reward function, all of those actions are independent. What does that mean? So if somebody already gave that thing to me, then the probability that the expert will choose action one, given S one, will be independent of the probability that the agent will choose action two and state two. Okay. It's a very simplifying assumption. In some ways it allows you to introduce many different -- many different reward structures, many different experts here. It doesn't matter. Just assume this. What it allows us to do is break this into the probability of the product. All of my observations, all of the things I observed that the agent has done, the probability is really the product of the separate actions that the agent did. So here is what this boils down to. If we have a probability that S -- so for every state and every action, that probability, given the reward, can be represented with this simple exponential representation with some normalizing factor times E to the Alpha times this Q function that I mentioned here before. Remember, we don't know R, but if we were given R, we would be able to compute this Q function. What is this Alpha here? This is some prior -- this is some measure of the expertness of the -- of our -- the guy who does the demo for us. So we allow the person to be wrong. We just kind of measure the number of times that he will be wrong with this Alpha. Okay. So what this does -- I said the probability that SI and -- that at state I we choose action I, given R, that's a dependent of the choice of other actions and other states, and so the probability that we will get all of those observations is just the sum of those Q values times Alpha X. So it really boils down to a nice formula. And what we can do with that is we can now invert that formula the other way around using base theorem, and now we can compute the probability of R given the observations. So that probability is just some normalization factor times this, and now we can include the prior here. So far I just did very simple manipulation of those formulas. The real question is, okay, what do we get from this and how do we get the optimal reward? There were two optimization criterion that were trying to satisfy in different situations. One of them is we're trying to learn the rewards, so we can use our different loss function, either L one, L two or whatnot, whatever is most useful in your particular case. This is one problem that we would like to solve. The other problem that we would like to solve in other cases is to find the best expected -- the best policy, the policy that minimizes the expected loss according to whatever loss function we use here. This is -- you could usually use whatever norm you would like here in this loss policy, loss for the policy given this reward. So these are the two minimization -- these are the two optimization problems we would like to solve. The ways we -- so far we just posed a model, said, what is the optimization criterion? Now we would like to solve this. And there are two -- there is essentially one main solution method and two different cases. One -- the main solution that we're going to use is an MCMC sampling technique where we essentially valuate Q-star and then use that to decide which Q -- which reward function to go to next. So what we could do is we could change that policy. Let's say, if what we were looking for is the optimal policy, we can change that policy to keep track of -- to not compute this Q-star, we are going to be able to cache sum the values for it and use the Q-star just by looking at the previous values for it -- sorry, use the previous policy for -- that we used with the previous R. Now we're going to change R a little bit, and so once we change R a little bit, not all of the Q values are going to change. And so we can use the previous -- the previous computation for the policy, the chosen policy to now estimate the new Q value. >>: (Inaudible) keep track of (inaudible) policy? Eyal Amir>>: So let's say for a particular reward function, we now compute the optimal policy. >>: (Inaudible.) Eyal Amir>>: That's right. Now we're going to change -- we're going to use MCMC to perturb this reward function a little bit. We're going to now choose a different reward function, evaluate the expected -- the probability that that is the reward function in our model using the previous model that I presented, we can now compute -- I cannot optimize their reward function given that form that I showed, but I can evaluate. For every reward function, I can evaluate that probability given the observations that I had. So I'm going to evaluate that new reward function that I'm now proposing, and with some probability, I will change the reward to that structure. Now, I have a problem that in order to compute, every time that I'm trying this MCMC step, I need to evaluate this Q value in order for me to remember that Q was a participant in the computation, the probability of R -- I have this -- the probability of R given the observations of the expert's actions. And that included something that was an exponential model with sum -- sum of Q values. So I need to use this (indicating) in order to evaluate this (indicating). Now, to use this, every time that I'm going to even try to make a step in my MCMC, I'm going to spend a lot of time computing that Q function. So to compute -- yeah. So to compute that Q function, I will -- it's important to essentially try to do something smarter than just brute force computation. >>: (Inaudible.) Eyal Amir>>: Say it again. >>: So are you trying estimate the reward function given optimizations, (inaudible) the Q values in terms of (inaudible.) Eyal Amir>>: Right. >>: Whether your actor or your tutor has been following a different policy (inaudible) is there an optimal one? Eyal Amir>>: You're right. So the tutor may have followed his own policy, but I assume -- so I don't know what is the optimal policy. And it might not be the tutor's optimal policy. But I'm going to try to estimate the reward function. And in order to estimate the reward function -- so I'm going to use that estimate of the reward function later in choosing my optimal policy. I don't know if his policy is the optimal one, but I'm going to try to choose to estimate R in some -- you know, for the two different optimization problems, I'm going to do it -- I'm going to do a different kind of estimate here. But here, if I'm looking for a most likely R, I'm going to need to use -- and change the policy, because the Q value really will depend on the policy. And I don't know what is the policy. See, the Q value -- I don't know what the policy of the -- of the expert is. >>: So you're not necessarily -Eyal Amir>>: I just see ->>: There's two policy estimation problems here, one is an estimation (inaudible), and then there's another estimation policy? Eyal Amir>>: No, it won't happen later. I'm not trying to estimate what is the policy of the expert. I just see the actions that the expert did. Okay. I'm just going to try to estimate my optimal policy given my prior, because I don't know what the rewards are. All right. So what happens is we compute Q, and then we use that to compute the probability of the R that we're testing right now. And we are going to use that in a hill climbing -- it's not really hill climbing. It's an MCMC step that will -- to get to an estimate of the most likely R. So interest things about this procedure -- so I kind of described this procedure really in two words, but in this procedure, what happens with MCMC is that you create samples -you kind of have many steps -- many steps of this changing of the reward function. And after, say, one thousand steps, I reach some reward function and I take that as a sample. This is one sample. And now I'm going to start my MCMC again or I'm going to run this chain again and again. I'm going to run it many times until I generate samples of my reward function. Using those samples, I can then estimate what is the most likely reward function. I can estimate also what is the expected -- let's say the policy with the maximum expected sum of future rewards. So I can use those samples for whatever I want. But in that MCMC step, still the basic step really generates those samples of rewards. So there are two interesting things that I want to say about the sampling method. So one of them is that when the -- when we have a uniform prior over the rewards, turns out that this Markov chain rapidly converges -- rapidly mixes, so you don't need to make too many steps in order for you to create those samples for your estimation problem. So typically, you don't know how fast your Markov chain will mix, so you have to run it and just hope that it will be fine. This is not our result. I mean, it is our result, but it really derived from somebody else's result on sampling and convex regions that are uniform and distributed. So uniform distributions over convex regions. So this is the citation that we use. And the second thing is that even in those cases where -- so this is nice just because it guarantees for us some rapid convergence of the algorithm. The second thing that I want to show is that really this -- in experiments, even without the -- even without the uniform prior, this thing converged relatively well, relatively fast and to solutions that are better than just the standard IRL without such prior. So here I compared two versions of a Bayesian IRL, one with Q-learning and the other one with k greedy. So what you see here is -- so the top of -- sorry. These two lines here are from -- these are the expected reward loss. This is when we were trying to just learn rewards. This is the standard IRL and these are the two new vision IRLs, so just to kind of explain to you, the lower the curve, the better it is. Because what we are looking for is something that has a zero reward loss. So what you would ->>: (Inaudible.) Eyal Amir>>: Sorry. N is the, I think is the size of the domain that we -- that we experimented with. We tried it on different -- yeah, the number of states. And here I want to show you, this is the -- kind of the policy law -- (off microphone) Sorry. Just a second. And this is the same curves, but now the expected -- kind of finding the policy that maximizes the expected sum of future rewards, again, you don't know what the reward is, but given the expected reward with these -- with these priors, this is what we get. >>: You said that (inaudible) expert is nodal, right, you said that earlier, that's the outline? Eyal Amir>>: Right. >>: So what you're showing here is the difference from the policy of the expert -whoever the known expert is or the optimal policy? Eyal Amir>>: This is this difference from the optimal policy. Actually, I think that we used a very -- we didn't try it with different Alphas. We tried it with a very -- with a fixed one. I think it was a relatively high one, but I know we didn't change the Alpha. But this is the difference from the optimal -- the optimal policy in that particular domain. We tried it on two different domains. I think these graphs are of the adventure game example. I want to show you one more graph here in this adventure game. Here is just one more example of a prior. We used like (inaudible) model like prior where when you have -say, when you have the treasure, then you will have high rewards. So this is something we can assume as a prior over the reward. And then we correlate states that are adjacent. If two states are adjacent, they will have a higher reward, supposedly. This kind of prior, we then applied it to this adventure game, and you can see with this nice fire, you could get -- again, better reward loss compared to our expert that is trying to play the game and knows what it does when it play that game. We used -- again, over there we used, I think, a shaped reward function. So here just kind of a very simple measure of how they -- how the reward loss was a little better. All right. So I said I'm going to try to tie it together. I gave you about an hour talk, so let me -- let me just try to summarize what we've shown here in the second part of the talk. We tried to generalize this Inverse Reinforcement Learning problem, introduced some different priors. The motivation was to -- both allow us to learn reward functions when we have some knowledge about the domain, and secondly, if you look at the problem of Inverse Reinforcement Learning, you usually get relatively little data and you have a convex domain that has many possible solutions. To choose one of those solutions over another is a matter of heuristic, unless you have some prior knowledge over R, and that's what we tried to solve here. The computational problems here are still pretty big, but for those small domains that we experimented with, it was manageable. So that's it. I'm -- if you have any more questions, I'm happy to answer them. Thank you. (Applause). >>: (Inaudible.) Eyal Amir>>: Right. So we have papers on this. The first one is in UAI. It's uncertainty in AI07 and the other one was in IJCAI of last year also. This is number one (indicating), this is number two (indicating). >>: I think -- I'm not sure, but even for the first problems, even if you have only two options, it's already (inaudible)? Eyal Amir>>: You think that? >>: I have some kind of proof (inaudible.) Eyal Amir>>: I would love that. That would be interesting. All right. Thank you very much, then.