>> Eric Horvitz: Okay. I think we'll get... Brunskill here again. She visited us maybe, what, eight,...

advertisement

>> Eric Horvitz: Okay. I think we'll get started. We're excited to have Emma

Brunskill here again. She visited us maybe, what, eight, nine months ago?

>> Emma Brunskill: Yeah, last year.

>> Eric Horvitz: For a talk. And is now actually interviewing for positions around the country. We hope to impress her with Microsoft Research as being an interesting platform and promising one for her future while she's here today and tomorrow. Emma is an NSF mathematical scientist post-doc at UC Berkeley, where she's working with Stuart Russell, and old friend, and others.

She did her PhD in computer science at MIT. And before that, she did her masters in neuroscience at Oxford as a Rhodes Scholar. And she's interested in core AI. I think she would bring to the table some interesting -- there is in planning under uncertainty, sequential decision under uncertainty, the POMDP area, even robot interaction which is a growing area on our team. And education as well as information in computing technology for development which I think are two areas that we could use some bolstering in, particularly education I think is very promising.

So with that, I'll let Emma take over here on leveraging structure to efficiently make good decisions in an uncertain world. Emma.

>> Emma Brunskill: Thanks, Eric. I'm real excited to be here. So thanks everyone for coming. I know it's fairly early still in computer science time.

So as Eric said, I'm primarily interested in how do we make sequences of related decisions when the outcome of each of those decisions is uncertain in order to achieve a good outcome. And this is known as sequential decision making under uncertainty. For example. You could imagine trying to sequentially select which of a new set of advertisements to display to users given that you don't know what the click rate is for each of those advertisements.

You could also think about how would you compute a strategy for how to harvest natural resource like fish where the population of fish changes over time and is hard to observe.

This instances of sequential decision making under uncertainty comes up in a wide range of other problems as well. And things like personalized user interfaces, healthcare and a wide number of other applications.

It's also argue that this is really a fundamental challenge in artificial intelligence.

So to me, our ability to make sequences of decisions under uncertainty is really a core part of what makes us mart. And so tackling these types of challenges is core aspect of AI.

If we want to make sequences of decisions when part of the world is hidden, and

I'll quantify that a little more clearly in a moment, these problems are formally hard. They lie in PSPACE or EXPTIME, which means that it's going to be very

hard for us to solve these type of problems in their full generality. And, in fact, most artificial intelligence planners have generally handled very small problems, or fairly small problems. There's obviously some exceptions to that.

But part of my inspiration has really been, well, what could we do if we could scale up to enormous problems? What type of things would that enable? And to me it would enable some really exciting types of applications. So for example I think we could do things like huge security and surveillance operations over things like where to protocol and when to raise alerts. I think we could create better physician's assistants that would help physicians make decisions over the course of a patient's lifetime in order to make sure that that patient's lifetime maximizes their health and well-being over the course of 50 to 70 years. And I think we could really revolutionize education through the use of personalized tutoring assistance.

So my approach to trying to get up to these type of problems is really to leverage structure. And in particular I've been focusing on how do we pick models that well approximate these type of real world problems but also provide us with significant amounts of computational leverage that has enabled us to scale up to some problems where we couldn't previously achieve good performance.

So I'm going to say a quick word about the approach that I have to these type of problems. The first is that I think that formal performance guarantees are important. And the reason I think that these are particularly important in addition to being theoretically appealing, is that if we're looking at really large domains, it's going to be impossible for us to simulate all the types of scenarios that might occur. And so in order for us to have confidence in the type of systems we develop, formal performance guarantees can give us some measure of promise and confidence in these type of approaches.

The second thing I think is important is to actually do real world instantiations.

I've just told you the way I'm going to scale up to some of these large problems is by leveraging structure. But the way I'm giving that structure is by being inspired by particular real world domains, and we need to make sure that the assumptions are actually correct. And so we can only do that by going back into the real world and see whether or not the approaches we developed are actually creating good solutions in those cases.

All right. For some of you this is going to be a review, but this is a bit of a mixed audience, so I'm just briefly do an introduction to the Markov decision processes.

So Markov Decision Process can be specified by a tuple where S is a set of states. So the state in this case for a sailboat is its location and the set of states is all the possible locations of the sailboat.

>>: This is Jim Gray missing?

>> Emma Brunskill: It is not. It's actually the Spice Trader example. That would be sadder.

>>: [inaudible] when he went missing I look at the blog that was created, I did create kind of a --

>> Emma Brunskill: A Markov [inaudible].

>>: Analysis that I suggested would be [inaudible] so this [inaudible]. [laughter].

I was looking at the concurrence [inaudible].

>> Emma Brunskill: Exactly.

>>: Triage based on that. If you [inaudible] satellite photos, [inaudible].

>> Emma Brunskill: Right. It's locality.

>>: In fact, then I said that let's condition the fact that he's alive and given that he's alive what would you do.

>> Emma Brunskill: Right.

>>: [inaudible] worry about right now.

>> Emma Brunskill: Absolutely. Yes. The optimistic planning, definitely is a good way to go.

So just what Eric had just mentioned, we're going to assume that the dynamics of the world are stochastic. So we have action that is can containing the state of the world, but this is not going to be a deterministic process because of things like weather and current, so we're going to maintain a probability distribution of our -- what state might we end up in after we take an action starting from a particular state?

And then we're going to have a reward function. And this is going to be the reward we get from taking a particular action in a particular state. So if we go over to Indonesia and we collect nutmeg that's going to be a high reward. If we get attacked by Somali pirates, that's going to be a low reward. And then we're going to have a discount factor which is going to tell us how much we care about immediate reward versus reward in the future.

And within this instance we're really interested in created decision policies, which is the mapping of states to what action or decision we should make. And I'm going to be particularly interested in optimal policies, which is going to be defined by what is the sequence of decisions we could make in order to maximize our expected sum of future rewards?

Now, in some cases we won't actually know what the state of the world is, so this is particularly true inform our sailors. They did not have GPS. So they could only get local observations. And in this case, we can augment the MDP framework with two additional variables. One is a set of possible observations and then the other is the probability of receiving those observations given you've ended up in a particular state after taking a pick action.

Now, in this framework we can't make decisions based on the state we're in, because we don't know the state we're in, but we can maintain a probability distribution over what state could we be in in the world. And we will -- this has to sum to one because we have to be somewhere in the world.

And in the discrete setting, that means that we're going to have an S dimensional continuos-valued vector which is just this probability distribution over where you might be if the world. And we're going to do our policy based on that probability vector and then figure out what to do based on what we think -- where we think we might be in the world.

Now, there are two -- there are a couple different axes in terms of these type of decision scenarios. One is fully observable versus partially observable, which I just discussed, and another is whether or not you know the model parameters for your reward model and your dynamic's model.

And so in this talk I'm going to talk about two different types of decision scenarios. One is reinforcement learning, where we're going to assume we know where our ship is, so the state is fully observable, but don't know what our model parameters are.

And then in the second case, I'm going to talk about when our state is partially observable but we know what the model parameters are in our world.

And I'm going to spend most of my time today talking about three different contributions of thinking about how we could leverage structure to scale up to some very large problems. And the first thing I'm going to talk about is how we can do fast exploration in large continuous state Markov Decision Processes with typed noisy-offset dynamics.

So this is the problem. We would like to figure out a decision policy for acting in order to maximize our expected sum of feature rewards, but we don't know what those hidden model parameter -- what those model parameters are. We don't know the state of the world -- or we don't know the dynamics of the world and we don't know the reward model.

Now, one thing you could imagine doing in this scenario is representing a distribution over all the possible model parameters that could be out there, all the different transition dynamics and all the different reward models. And you could compute a policy with respect to that current distribution and the future possible distributions and try to optimize that exactly. And while that's really appealing that's PSPACE hard to do for general problems. So it's going to be an intractable thing for to us do in the general case.

This is the intuition for that. So there's sort of two types of things that we're trying to do if we don't know the model parameters. The first is we're trying to get experience in the world that allows us to estimate them. So you could imagine that if you were a spice trader and you were just starting off that you might sail all over the world in order to try to figure out where are good spices and what are

the best routes to get there. And as you move around in the world, you're going to be getting experience that you can use to estimate those model parameters, to estimate where is there good spices and what are the dynamics in the Indian

Ocean and the Atlantic Ocean so you can figure out how best to plan.

But eventually your boss gets kind of frustrated and says well, we have to start making money sometime soon. You can't keep going to Antarctica, and so at some point we're going to want to use the information we have to converge on some model parameters and then decide on what's the best route to go in and where are the best spices.

And so most work has focused on approximations to trying to do this tradeoff between exploration and exploitation. And in particular, some of the original work focused on asymptotic guarantees which says that in the limit of lots and lots and lots of experience and much time, then we're guaranteed to converge to a policy which is identical to the policy we would take if we actually knew what those optimal parameters -- or those hidden model parameters were.

>>: [inaudible]. So any model parameters are they in truth the parameters of the

[inaudible].

>> Emma Brunskill: In the general case they do. If the case that I'm going to talk about shortly, they don't, but in the general case they do. But this only provides asymptotic guarantees and that's not very satisfying because that's sort of saying we're going to visit Antarctica 40 times and we're still going to be checking there for nutmeg, you know, in years to come. And so there's been a lot of interest recently in probably approximately correct approaches. Which essentially give us a bounded amount of time to do exploration.

So these say on all but a finite number of samples we're going to choose a decision whose value is epsilon close to the optimal -- the value of the optimal decision that we'd take if we actually knew the model parameters.

And these were created by the people in the late '90s and early 2000s.

And the question that these are really trying to answer is how many samples do we need in order to build a good model that we can use to act well in the world?

So what do I mean by good model? Well, we're going to be estimating our model parameters from experience. And in general the more experience we have, the closer our estimated parameters will be to the true underlying model parameters with high probability.

In terms of acting well in the world, if we actually knew what those model parameters were, we could compute an epsilon optimal policy. So we have techniques for solving Markov decision processes to compute a policy if we actually know what those model parameters are.

And if we can bound the difference between what our true model parameters are and what our estimated model parameters are, we can then propagate that

through to bound the error in computing a decisions policy using our estimated model parameters instead of our optimal, through model parameters.

And so we can combine these two ideas together to talk about something called simple complexity, which is essentially the number of time steps on which we might not act well. This is kind of our budget of exploration and sort of how much evidence do we need in the world until our estimated model parameters will be close to our true model parameters and we can compute a good policy.

And this prior work, the simple complexity that they computed was a polynomial function of the number of states in the environment.

Now, the challenge is that in the large number of domains the state is specified by a multi-dimensional continuous value. So in the case of say a quadraped robot, you have a 15 dimensional state space to specify all the joints. In the case of something like our sailboat or you're car, often the state space is specified by our GPS coordinates. But if you have continuous valued state spaces then you effectively have an infinite number of states, which means is that these prior approaches are going to be exploring for a polynomial function of infinite. So they're not going to be able to handle these types of domains very well.

And so what we consider -- decided to focus on was a particular subclass of continuous state Markov Decision Processes where the dynamics could be described by typed noisy offsets. And the idea here is that each state here has a known type. And your type plus your action determine what your dynamics model is.

So the new state -- if you start off in a particular state, your new value after you take an action is just going to be an offset from where you were before, so like a delta, plus some noise where that noise is a Gaussian function.

And the parameters that determine that offset and that noise just depend on the type of the state and the action that you took.

So for example we might imagine that the Atlantic Ocean because of weather and current might have a really large noise and a smaller offset on the average when we open up our sails compared to if we're closer to the Indian Ocean which might have calmer currents and calmer weather, so maybe we have a smaller amount of noise, and we get to move further.

And even though these sound like they're could it a simple model, they're actually a very expressive model and they've been used a lot in inference for a wide variety of different types of problems. So people have thought about these for modeling things like human motion. People have used them for honey bee dances, they've also been used for the stock market. They're quite a simple model, but by using these type of switching systems, you can express a wide class of different phenomenon.

So this is how our algorithm works. And I'll tell you in a second how we compute this policy. But imagine for a second we've computed a policy and we're using it

to make a decision and act in the world. So once we act, we're going to update our type-action counts. So this is say how many times we've opened up our sails in the Atlantic Ocean. And then we check whether or not this exceeds a threshold. And if it exceeds a threshold, then we label the type-action dynamics as known.

And we use that to define an optimistic Markov Decision Process. So for anything which is known, then we're going to actually estimate the parameters for that known type action from the data we've had in the world. So we can just use our actual experience to estimate the model parameters.

But for all the other type actions that we haven't had very much experience in, we essentially just hallucinate that they're incredibly high reward areas. And we set the dynamics for those to be a self-loop. And what this means is that when we actually solve that optimistic Markov Decision Process it's going to drive us towards parts of the world where we don't have very much experience.

And so we'll solve that optimistic Markov Decision Process and then use that to decide how we act for our new state. And we repeat this process over and over again. And eventually this is going to say drive us towards -- if we start off in the

Atlantic Ocean it's going to drive us towards the Indian Ocean because we're going to hypothesize that there's lots and lots of reward there. And so our policy will drive us that way.

Now, the important thing is that the sample complexity, that is the amount of experience we need in the world to estimate these model parameters is only a polynomial function of the state dimension in this case. So a little bit more formally, if we're given a delta in epsilon and a noisy-offset dynamics Markov

Decision Process with M types, then on all but a number of steps, which is a polynomial function of the dimension of the state space, the number of types, the number of actions and several different problem parameters, we're going to be guaranteed to be epsilon close to the optimal value with high probability.

So how does this relate to prior work? So originally when people first started thinking about these kind of bounded amount of exploration models, people did them from very expressive classes of dynamics. So people thought about this where each state could have a different dynamics model. And those were very expressive but they needed an exponential in the number of state dimensions, amount of exploration time.

There was also some other models which thought about something closer to switching state models, but they also needed an exponential amount of time to sort of load those parameters.

There was some other models that thought about a much more restricted set of dynamics models. So these were linear Gaussian models where you know what the variance is. And these could learn really efficiently but they were too restricted for a large number of domains. So in order to capture a lot of real world phenomenon we wouldn't be able to make this restricted amount of assumptions.

>>: Do you know what kind of assumptions they were making?

>> Emma Brunskill: Yes. They're assuming that you foe the variance. So they were assuming that you sort of have a linear Gaussian dynamics but that --

>>: I see. [inaudible] domains would that not apply to or apply to [inaudible] helicopter [inaudible].

>> Emma Brunskill: Right. So one of them was basically a theoretical paper, but one of the instances that we're motivated by was if you think of something like traffic. I feel like the variance on different streets is really important when you do planning. So the offset for a highway is much better than the offset of a small side street. But the variance is really large. So if you're doing something like getting to the airport, you really care about estimating that variance parameter.

Because you might actually go on the side street even though it's slower. So these types of models wouldn't capture that.

So you can think of our approach really as being kind of getting clear to the upper right hand corner here which is where we'd really like to be. So we are sort of more expressive class of models than some of these previous approaches, but we still scale polynomially in the -- with the number of dimensions in terms of the amount of experience we need in the world.

And you can see this on this simulation examination. So this was a simulation example that came up in the mid '90s. And this is just to demonstrate that this were previous existing examples which fit our assumptions of the type of models that we could handle. And you could see that compared to some other approaches we learn really quickly because essentially the dynamics here is the same throughout the environment and we can learn those parameters really quickly and then start to do very well in items of our performance. Compared to some other approaches which are essentially trying to model very different dynamics all over the state space, and so they take much longer to try to find a good policy.

>>: I'm sorry. [inaudible] so on this opportunistic -- sorry, the optimistic aspect of things [inaudible] when does that get you in trouble though [inaudible] spaces that you think would be [inaudible].

>> Emma Brunskill: They're going to be fabulous. Well, it's bad -- so I guess one assumption [inaudible] you have to assume there are bounded rewards. So if it's possible that like you could -- if the world was flat, for example, this would have been a really bad assumption because you would have basically jumped off the cliff at least once, in fact, multiple times to make sure that you do die every time you jump off a cliff. So you have to assume that the world is bounded and there aren't catastrophic failures. So in terms of things that you could do.

>>: What about the classes of those -- wondering if the [inaudible].

>> Emma Brunskill: So that -- you assume that you will get hit by those but that then you'll get to act really well on pretty much everything else so over the long term you're going to still be getting good performance.

>>: One thing I [inaudible] so you're assuming finitely the dimension state space.

Is that correct.

>> Emma Brunskill: Well, it's not infinite dimensional. Finite dimensional. Yes, infinite number of states.

>>: [inaudible]. So that means like no matter how much exploration I do there always be someplace in the world where I'll have good reward, right? And why would I just you keep exploring -- I mean, what makes me --

>> Emma Brunskill: Why do you keep exploring? I'm assuming there's only a finite number of types and actions and that that's what's determining the dynamics model. So, yeah.

>>: So in a sense you're kind of considering a cluster of states -- I mean in each cluster is finite dimensional. I see.

>> Emma Brunskill: Yes. So I'm only doing -- I'm only [inaudible] the parameters of that more aggregate level. And that's what's saving me.

>>: Do you [inaudible] how do you know that you learned something?

>> Emma Brunskill: When you get that threshold. So I --

>>: So you initially know the threshold?

>> Emma Brunskill: Yes. So you can compute the threshold for a particular domain based on the number of types of actions. And that gives you how much experience you need for each of the type actions. And after you've hit that threshold you label it as known and you don't do any more exploration.

>>: So is it the study threshold like visiting the state throw times or can I define a dynamic threshold like if the variance amount of times I traveled this drops with all this threshold, can I do a dynamic threshold?

>> Emma Brunskill: So for the bounds, it's a fixed threshold. So the bounds give you a number. They'll say -- I mean, people use five or 10. The bounds are actually more like a million. They're very, very loose bounds.

So I mean, they're tighter than if you assume an infinite dimensional -- or infinite state space, but they're still loose. There have been people who are thinking about the variance of the dynamics recently if terms of similar bounds. But it's still a hard threshold, it's still, you know, 17 or things like that.

>>: Is the bounds -- [inaudible] of those bounds somewhere in a previous slide or [inaudible].

>> Emma Brunskill: I have not shown you the proof of how we do that.

>>: [inaudible].

>> Emma Brunskill: Yes. This is the equation. This is --

>>: The bound are where?

>> Emma Brunskill: So this is the number of steps you need. And essentially the number of steps you need per type action is if you remove this. So you basically need a polynomial function of the number of state dimensions and all of these parameters, and that is the amount of experience you need until a state type action a becomes bound.

>>: And that would give -- that will make you EPS close to the optimal policy.

[inaudible].

>>: You've clarified a bunch of things [inaudible] I don't understand [inaudible].

So I understand you've broken things up into classes and now you have a finite number per class so you can say okay, now I've learned [inaudible]. So the question I have though is that if the classes are not pre-known, like because you're inferring these classes to go along [inaudible].

>> Emma Brunskill: No. So these are --

>>: [inaudible].

>> Emma Brunskill: [inaudible] you know you're in the Atlantic Ocean or you know [inaudible].

>>: I see. I see. So you have a fixed set of classes [inaudible].

>> Emma Brunskill: Yes. And they're fully observable. So, yes.

>>: Okay. Got it.

>>: I [inaudible].

>> Emma Brunskill: Thanks for asking.

>>: So I can imagine that kind of thing being a nice refinement that wouldn't necessarily explode, having the probability issues of class types and others for example famous and in other. And updating that, that's kind of more about the standard basing update problem on top of all this that would create another loop.

>> Emma Brunskill: Right.

>>: [inaudible] bounds --

>> Emma Brunskill: Yes. Because the other would be quite scary.

>>: So is this hard switch from like unknown to known kind of required for those bounds are -- I mean, could you compute bounds if the same kind of setup where you do the switching model but where your exploration, exploitation is kind of more continuous and can keep going for all time, or is it --

>> Emma Brunskill: Or do you have to have -- you can do it in a very similar way when you did have this more fluid approach.

>>: Okay.

>> Emma Brunskill: So some other people have thought about this, not for switching state models, but there's these sort of interval estimation approaches where essentially you just have a bonus that depends on how much experience you've had so far, and those smoothly trade off. Even those, often they'll at some point say I've had enough experience and stop updating it. But they can be entirely smooth. And you can proof similar pack style approaches.

>>: [inaudible].

>> Emma Brunskill: Yeah.

>>: But also like your parameters are just paid out of variance [inaudible].

>> Emma Brunskill: Yes. For each of the type actions.

>>: So you can simply do a Gaussian inference where, you know, you basically

-- like complete base an inference for a Gaussian model and that will allow to you do a small [inaudible].

>> Emma Brunskill: Yes. You can do that as well. And then you just want to -- basically you have to propagate that information into how much uncertainty could you have over your value function in terms of the balance. So how far away could you be in a Gaussian solution.

>>: I hope you don't mind --

>> Emma Brunskill: Oh, no, it's great.

>>: [inaudible].

>> Emma Brunskill: So the other question then is, you know, these models are fairly simple and so it's good to make sure that they are realistic to capture some of the domains that we would like to. I'll explain the robot navigation first, because it's moving.

So we were interested in this -- I mean, this is obviously a small example, but you could imagine the Mars rover has to go over different types of terrain. And so in this case, we have a robot. The pattern on top of it allows the camera to

recognize where it is in the environment. And it's trying to learn how to navigate over rocks and carpet in order to reach destinations. And so it's learning there's a different type of dynamics over rocks than there is in carpet.

And in particular, it's a little hard to see, I guess, but we have this outline and outside of that is out of bounds much and it gets a high penalty if it falls out of bounds. So it does kind of a conservative turn on the rocks because it's more likely to get tripped up by the rocks and maybe go out of bounds. And then it can do a tighter turn over here by the carpet because its dynamics have a smaller amount of noise.

And then the other case we looked at and we did this in simulation, but we sampled from real data, there's -- we thought this could be really relevant for things like roads because there is very different variance under different offsets.

And so we thought about interstate, local highway and small side street roads.

And MIT has this project called car tell, where they collected a bunch of data with real cars over different types of roads.

And so we did this in simulation for a small sort of good world environment, but for the different types we'd sample from these actual datasets of highways and small side streets. And we found that even though the switching model was a pretty coarse approximation it was still able to get very good performance, and having multiple different types was beneficial. And I'm happy to talk more about that at the end, if you would like.

>>: [inaudible] that topic. So the idea is you have variances by -- means and variances by time of day?

>> Emma Brunskill: No, just by type of road.

>>: Just by type.

>> Emma Brunskill: You could have it -- I mean time of day could be that type as well. But in this case, it's by interstate local highway and small side streets.

>>: And are you -- are you sampling to learn in real time [inaudible] plans for kind of over averages.

>> Emma Brunskill: You're coming -- you're doing it in real time, and so like you've just moved to a new city and you're tying to figure out the best way to go to work. So each day you're trying out different things and how much experience do you need in terms of figuring out what's the right way to drive to work towards the variance and the mean offset that you want to have?

>>: [inaudible] data around here about variances of roads and high speeds and so on.

>> Emma Brunskill: I would imagine.

So really the key contribution here was the theoretical aspect showing that we could do provably efficient exploration in this expressive class of continuous-state

Markov decision processes.

I'm next going to fairly briefly talk about how we did efficient planning in a large linear Gaussian partially observable Markov Decision Process using macro-actions.

So just to recall in the partially observable Markov Decision Process framework, we're now assume thanking we know what the model parameters are for our dynamics and our reward functions, but our state is hidden. So we're going to be computing a policy for distribution over where we think we might be in the world.

And the challenge in this scenario is that again, if we're interested in sort of continuous valued states then we would have a probability distribution over and infinite-dimensional space. Because we would have sort of an infinite number of states we could be in, and we want a probability distribution over that.

And so aside from some restricted cases, the general -- generally existing methods don't scale to this type of problem.

And so just briefly we assume that you have sort of linear-Gaussian dynamics which is very common in particularly robotic domains and an exponential family observation model. And essentially if we make -- if we restrict our world to this, we can think very -- we can compute very efficiently what beliefs might we end up in after a sequence of actions, regardless of the observations we might receive during that sequence of actions. And that allowed us to evaluate conditional sequences of short open-loop policies which we call macro-actions.

And has allowed us to scale up to much larger long-horizon POMDPs.

I'm not sure if this is going to be at all visible when I go to -- all right. Let's see. I wonder, is there any way to turn down some of the lights?

>> Eric Horvitz: Yeah. We can.

>> Emma Brunskill: I was going to say ask and it will occur. This is fabulous. All right. Let's see if I can make this go. Oops. That didn't work. Okay. Good.

All right. So we did this for helicopters, so we had an autonomous helicopter which you can see flying around a little bit up here. And it was a monitoring and surveillance task. So the idea was that you wanted to know whether or not these suspicious vehicles were going inside highly sensitive areas. And the helicopter has a camera underneath it. So you can see that if it flies higher up, it can see more of the environment. But at a lower level of resolution. And then when it flies lower down, it can see less of the environment but it gets more accurate observations.

And its goal was to sort of fly around and correctly report whenever any of these targets go into this area.

>>: [inaudible].

>> Emma Brunskill: Oh, good.

>>: And flying up and down is basically just to fit off the resolution but it can see it [inaudible].

>> Emma Brunskill: Yes.

>>: [inaudible] shot down [laughter].

>> Emma Brunskill: We did not include that. Yes. You could add that in as well.

The visibility of the history.

>>: Catastrophic failure.

>> Emma Brunskill: Yes. We did not model this as accurately as one might imagine.

>>: That could be interesting though too.

>> Emma Brunskill: It would be. It would be cool because you would want to be like lower --

>>: Get down and get back up again quickly.

>> Emma Brunskill: Exactly. And in this case -- and what you can see here is the colors, is just that we have a laser range scanner on top of the helicopter. So the helicopter actually doesn't know exactly where it is, so it's localizing itself through the laser range.

But the important thing about this environment is that it's really large. So depending on the number of targets you have, it's somewhere between 12 to 24 dimensions because each of the agents on the an XY and orientation, plus the helicopter's own XYZ and orientation.

So it's a really large problem. And a lot of the sort of standard discrete state

POMDP parameters just can't scale up to the size of domain. It's very, very large.

But they say the -- and the lights can go up again if there is -- that would be pretty crazy. Well, awesome. So but the exciting thing too is that even for this type of domain you can always come up with some kind of have heuristic approach to solve really large problems. And the cool thing about this was not only could the sort of principled approaches not scale up that were reasoning with discrete states but the heuristic approaches just didn't do as well. So we videotaped this to get ground truth. And we compared the number of times the algorithms correctly reported when targets went in versus the number of true target entries. And our approach did much better than the other approaches in terms of correctly reporting or when targets entered. This approach, nominal

belief optimization has been really popular actually over the last year. There's been a bunch of papers all thinking about this idea. And essentially it's just to assume that the observation you get is the one you expect to get. And that really reduces planning because you don't have to think about all the observations you'd get, you just have to think about the observation -- the mean observation.

So it's much more efficient. But it doesn't do as well in a lot of environments where you really need to think about the conditional nature of under different observations you'd really want to do different things. So for example in this environment you might have a really large variance and if the agent stopped then you might need to wait and see whether or not that target actually continues going towards the sensitive area or not, whereas if they have already passed the sensitive area maybe you don't need to worry about that anymore. So you really want to think about under different observations what would you do, compared to assuming the mean.

Okay. So now I'm going to talk about efficient planning in large discrete-state

POMDPs with prerequisite structure. So I've Motorola talked about so far examples where we can characterize the state space by a number of different continuous valued variables. But there's obviously also a lot of other applications where that's not a very good model. So for example, I think right now there's a lot of interests, particularly after what's on, on speech recognition and speech assistant systems. And you could imagine for something like travel reservations you want to identify where someone is going from and where they're going to and the date. And while there are a lot of airports in the world there's still a finite number of airports. And it doesn't make sense to model things like the from and the to airplane as a continuous value variable.

Similarly something like network structure you might have a bunch of different nodes and they would either be up or down. Again, that's not a continuous value.

And then finally education, which is an area I'm real excited about.

So if we think about how you would do automated tutoring for arithmetic we'd like to be able to compute a policy to select actions, which are these pedagogical activities to help a student master a bunch of skills.

And we can frame this as a partially observable Markov Decision Process. So the set of states we have is perhaps a number of different skills when the student knows or doesn't know each of those skills.

And then the actions we have access to are essentially going to be along continuum between testing actions and teaching actions. And then we'll get observations and back from the students, which are things like whether or not they answered a question correctly, if they're asleep in class or a number of other types of responses that the student could give us back.

And then we have two different -- we have the transition model and we have the observation model. For the transition model it encaptures how different activities help a student probabilistically transition between a previous knowledge state and a new knowledge state. So how much does a calculus test help a student

know calculus versus and observation model which says what's the probability of a student responding with a particular observation after experiencing an action and transitioning to a new state?

So this might be something if a student has a calculus test and knows calculus, what's the probability that they'll do well on that test?

And so using these model, we can represent a belief state over the student state.

So if we had a sequence of different activities and answers from the student, we could represent a distribution over the states we think the student might end up in.

So for example it might be most likely given that sequence and assuming our model parameters that the student has mastered one digit -- one by two digit multiplication and one digit division. But still doesn't understand two digit long division.

And using this, we want to compute a policy to optimize learning. So for example a policy might look something like this. You might start with some instruction and the student would listen, which doesn't give you very much information about whether or not they know it, understood it. And then you would give them a drill exercise. Based on whether they got that right or wrong, then you would give them another activity such as a drill exercise. And if they got the second one right, maybe they're now respect for some harder material.

If they got the second one wrong, maybe they were just lucky and guessed correctly on the first one, and we actually need to back off and check that they actually really understood that earlier skill.

Now there's been a lot of work on intelligent tutoring systems. I think really people have been interested in education and AI pretty much as long as AI has been around. But I'm going to focus particularly on other approaches that are to do with sort of this notion with how do you do planning and sequential decision making.

Some of you might have taken the adaptive GRE. That's basically trying to give you different types of questions but according to what they think you know and don't know, but only for assessment. So it's not actually trying to help you learn anything.

There's been quite a lot of interest, though, in people don't sort of a Bayesian modeling of student state, keeping these belief states over what people know or don't know. But there's been very little on trying to think about how would we do teaching policies as sequential decision making under uncertainty.

There's been some work where that I thought about the state being fully observed, but I think that it's really much more flexible and robust to think about the state being hidden and getting sort of noisy observations of the state from the student.

There's also been work where people think of the state being hidden but using a greedy, so not sequential decisions. And the reason I think that it's important is because of the following example: If you think of only getting a single how to teach someone calculus, you wouldn't give them a test because if you want them to learn as much calculus as possible, you would may be give them a lecture or give them an exercise or something like that, but you wouldn't get any benefit from the diagnosis so there would be no reason if you only get a single interaction with a student to do diagnosis. But if you know you get to interact with a student for an entire semester doing a diagnostic test could be really helpful because then you would know what they understand and don't understand, and you could focus on that for the rest of the semester.

So if you do greedy approaches, information gathering or diagnosis just is never going to be worthwhile. But if you could think about it for a longer term sequence, then you can start to interleave these type of things.

Okay. So let's think again about arithmetic in particular. I would argue there's a lot of different skills that are included even in very simple arithmetic so something like a fourth grader you would like to learn. So they my have things like one by two digit multiplication, subtract fractions with different denominators. Just -- you could write down a lot of different skills.

And I'm going to pick 122 skills because you can easily come up with this many, and it's an example we have later. But the if we have 122 skills, that's a 2 to the

122 state space, which is approximately 10 to the 30. And that is a massive state space for any sort of existing partially observable planner. We can't scale up to anything like that. We can maybe scale up to 10 to the 5. Definitely 10 to the 4. Probably 10 to the 5.

But I would argue there's sort of two concerning things. One is that this is only a small problem. I'd love to be able to do this for K through 12, and this is only fourth grade math. And secondly, we don't even know which state the student's in. So which of those ten to the 30 states is the student in?

>>: Also, there's an interesting [inaudible] independent and other [inaudible] distributions having between these states but sometimes [inaudible] and you actually get everything else for free [inaudible].

>> Emma Brunskill: Yes. [inaudible].

>>: [inaudible].

>> Emma Brunskill: So the first idea is that to tackle this, we're only going to consider the probability of the student being in the subset of states. So imagine this is your belief state and this is the probability over all the states you're in. And

I've put up 16 there, but you could imagine, you know, a billion. It just would go off the page.

So instead of thinking of the student being in all of those states, we could kind of think about the student being in a subset of states or a massive integrate out

state which basically captures all the other states, you know, the student may be in. And this is really a novel extension of the motion of envelope planners which came up in the Markov Decision Process community in the big '90 -- the mid '90s. And what they were interesting in there is even if you know what the state space is, if your state space is really, really large, it may be intractable to plan over all of it, and so maybe you could just consider a subset of states. And what I'm proposing here is is that we could do something similar in the partially observable space where we just think of the probability of being in a subset of states, instead of all possible states.

And what this really is a mapping approach. We're mapping our state space into a much smaller state space in the hopes that if we get it to a small enough scenario them we could use existing approaches to solve that smaller state space.

Of course kind of the key question is how do you pick the smaller state space?

And this has to do with something that drill was just alluding to. So we can leverage the dynamics of these type of problems in order to efficiently compute a good subset.

And in particular these things called learning hierarchies. So these were first developed by the education community in the '70s and '80s because people were really interested in what is the prerequisite structure among skills for teachers to know when they're teaching things? And this essentially expresses what skills do you need to have before you can master other skills?

So for example I don't know anybody that understands calculus that doesn't understand addition. And so you don't really need to know about the student that has mastered calculus but not addition because that's a subset.

So just to be kind of precise about the type of problem POMDPs that I'm creating the solution for, we're focusing on things with positive-only effects, which means that once we make one of those skills true, we assume they stay true. So essentially like no forgetting, which I certainly don't fit. But this is a common approximation of the intelligent tutoring system's community, so it was a reasonable place to start.

And then the other probably most key assumption really is that we assume that if all your preconditions are satisfied for a skill that the cost of making that skill true is independent of the values of all your other variables. So this is like saying if you've got all your preconditions for calculus, then it doesn't know whether you know medieval history or abstract algebra that the cost of helping that student master calculus is the same, regardless of those other skills.

>>: [inaudible] obvious example so that fails.

>> Emma Brunskill: Yeah, I think there -- I think there are cases where they may not be strict pre-conditions but two things might help you learn something else faster. I've definitely had that experience in my own life. Like if you're sort of learning about say computer architecture and assembly language sometimes like

things from the bottom up and the top down can help you learn things faster, even though they're sort of independent.

So to create this initial envelope, you can simple initial state from that possible initial state space -- belief state. And then you just create a trajectory from that start state to the goal state. So our goal state is going to be where all of the skills are true and our start state is going to be where some of those skills are not true yet. And we can just compute any trajectory that goes from the start to the goal that satisfies the prerequisite assumptions. And this is just computing a topological order, so it can be done really cheaply, just linear and the number of variables and the number of precondition edges.

>>: [inaudible] so you're taking the structure [inaudible].

>> Emma Brunskill: Yes. Yes. And there's definitely really interesting questions about how do you get that structure. So what this does is that this means that the you number of states we end up with is linear in the number of state variables because when I create that trajectory, I'm essentially just flipping each fit to be true so at most there can be N flips if there are N bits. So this is taking us from the two to the 100 state space to 100 states. Which is good. We can hope to plan with 100 states.

So our algorithm is called a reachable anytime planner for partially sensed domains. And the idea is that we use the method I just talked about to create a subset of states we're going to plan for. And then we create a new POMDP where we put all the other states into an aggregate out state and we redirect any transitions to that out state if they would have taken us outside the envelope.

We make this out state a sync state and we give it a large negative reward. And that basically just penalizes us from falling outside of the envelope. And then we can do planning. So this is, you know, a hundred or a couple hundred size state space POMDP and their existing planners that can be used to solve this.

And then if we have additional time we're going to expand the envelope to include more states and then repeat this process and go back to two. So it's an anytime planner.

>>: Just for clarification, are you assuming that when I ask you a question for

[inaudible] you give me an answer, after that I certainly know what you know?

>> Emma Brunskill: No, I'm not assuming that it's an oracle, no. So I'm assuming there's noise and you could guess correctly even if you didn't know or you could make a mistake even if you did know.

So the first thing is sort of a sanity check. The hope would be if you have tons and tons of computational time eventually you could guarantee your approach would relax to an optimal approach or an epsilon optimal approach. And eventually if you keep doing this expansion process, your envelope will eventually include all the reachable states. So at that point, as long as you're

using a planner that has some optimality guarantees, you're guaranteed to converge to the epsilon optimal approach.

This is sort of a good sanity check. But in practice most of these problems are going to be way to large to include all of the possible states. So it's important to have performance bounds in the meantime. So there -- we provide upper and lower performance bounds. For upper performance bounds it turns out that if you think about the fully observed problem, it turns out to be really cheap to compute the cost of making the state -- computing the cost of making a number of different skills true if you're in the fully observable case. It's essentially a stochastic shortest path problem.

And because you have this independence relationship that as long as all of the other -- as long as all of your prerequisites are true, the cost of making this new skill true is independent of everything else, means that you have a lot of path independence in terms of the values. So it turns out that we can compute the fully observable MDP values in time that's linear the number of state variables.

And then to get an upper bound on our POMDP, we just take a weighted expectation over our belief that we're in each of those different possible states times the MDP value.

Now, for lower performance bounds we essentially just put a large negative reward on that a sync out state, and that means when we compute our POMDP values then they're guaranteed to be a lower bound on the true optimal values.

Because we're essentially saying every time you fall out of your envelope, it's really bad and you'll never escape and it's really a negative reward there.

Uh-huh?

>>: Maybe there's something that I missed earlier. So then are the states that are in your [inaudible] only the achievable state space on that hierarchy state that that's really coming from.

>> Emma Brunskill: Yeah.

>>: I see. How much does that reduce the overall -- you start out with 10 to the

30 heuristic how far [inaudible].

>> Emma Brunskill: Right. So well, I guess it is two parts. There's the reachable state space is smaller than 10 to the 30 just because of the precondition structure. And then inside of our envelope there is a subset of that.

>>: Oh, I see.

>> Emma Brunskill: Yeah.

>>: How do you choose the subset.

>> Emma Brunskill: We choose the subset by just picking any pre-condition -- we basically start from a possible start state to our goal state and we pick any sequence of states that satisfies the pre-condition structure.

>>: I see.

>> Emma Brunskill: So just some possible trajectory.

>>: Some path --

>> Emma Brunskill: Some path.

>>: [inaudible].

>> Emma Brunskill: And then as we have more time, we add in more possible paths to [inaudible].

>>: I see. So the person somehow -- well, that's interesting because then it seems that the person might be able to step off the path and end up in the out state --

>> Emma Brunskill: Absolutely. Yes. And so that's the -- that's the common -- like you could, depending on how many trajectories you've added so far. So normally it turns out if you're -- I'll show new a second in terms of the results, but normally only one trajectory isn't sufficient. But if you have several then you can start to do pretty well.

>>: [inaudible].

>> Emma Brunskill: There's a lot of different ways. We do it pretty simply. We just add in -- first we prioritize adding paths from other initial start states to the goal. Then you could also think about when do you hit French states on air policy, things like that. But I think there's a lot of interesting ways to go.

All right. So this gives us a lower performance bound. And then we did two simulations of tutoring problems, because I'm particularly exciting about those.

And the goal in this is to make all of the skill variables true. And you get a big reward when that occurs.

And our baselines are a couple of generic POMDP approaches that are sort of state of the art but handle any types of domains. And a heuristic that's used actually in the commercial tutoring system. So this heuristic essentially identifies all of the goal variables whose probability of being true is below a threshold. So I want you to learn calculus but I'm only 50 percent sure you've learned calculus and I need it to be 95 percent. And then they just select any of those things that are below the threshold or they select the one with the lowest probability and then pick the best action to make that true.

So they assume that all states are -- all skills are independent and then just focus on getting all of them to higher level.

>>: Have these systems been evaluated formally?

>> Emma Brunskill: Yeah, I think they're used with roughly 500,000 people. Not

-- you mean like decision theoretic formally or in terms of like do they actually help students learn?

>>: The latter.

>> Emma Brunskill: Yes, they do help students learn.

>>: Is it characterized -- in other words, if you're going to pose it different than this heuristic, it has to go through a neck in neck evaluation step, how much better is it to do X versus Y versus human, for example?

>> Emma Brunskill: Oh, how much better is this than human or --

>>: How much better is a non -- less heuristic approach than this?

>> Emma Brunskill: Right. So I don't know -- I know that they've been used in schools, and they -- they do pretty well. I mean, I think they often get like one standard deviation better than not using them but I'm not sure if they compare it to human tutors or not for this.

>>: [inaudible] your methodology --

>> Emma Brunskill: They -- I haven't -- so all of the stuff I'm talking about right now is simulated students. So these are -- so far the results are just for simulation, whereas these have always been extensively evaluated in real students. These are commercial systems.

>>: But they actually use the probability [inaudible].

>> Emma Brunskill: Uh-huh.

>>: And typically what is the base source of the probabilistic inference

[inaudible].

>> Emma Brunskill: In these other systems how are they -- or how are they doing [inaudible].

>>: Yeah. Yeah.

>> Emma Brunskill: I mean, they assume all skills are independent, so it's pretty easy to do the updating. So all the skills are independent. And then they're just sort of modeling -- they basically have a sort of a split parameter and a guess parameter. Sort of what's the probability that someone accidentally messing up, even if they know something, what's the probability of them guessing even if they don't and what's the probability of them learning it with each activity. But they model all the skills independently which makes it much simpler.

>>: Is your question they learn those from data or [inaudible] experts?

>> Emma Brunskill: Some of both. Yeah. Now they have a lot of data so they can learn those better.

>>: How long have these been used with probabilistic --

>> Emma Brunskill: I think the knowledge tracing stuff that that sort of this independent updating came out of the mid '90s or maybe early '90s. John

Anderson from CMU I think was -- and Albert Corvette.

>>: I'm surprised that I [inaudible] probabilities.

>> Emma Brunskill: I'm pretty sure it was Albert [inaudible] I think it -- pretty sure it was John Anderson and Corvette, first for capture programming systems.

So the first thing we looked at was a very creatively named small math domain because it's small. And there are 19 different skills. And you can see here the pre-condition relationship among these skills. So there's a lot of pre-condition structure.

And we had different types of actions. We had actions that were drill exercises, which would give us more information about whether or not our simulated student had mastered that skill, and then tutorial action that is were supposed to be sort of something like instruction.

And if we compare it to sort of these standard algorithms, sort of the state of the art but generic algorithms, they just fail to scale up to the size of domain. It's a very large domain, and they just can't handle it. But what you can see here, and this is a little bit related to one of the questions I just asked that after a small number of envelope expansions -- so if you only do one envelope expansion you're not doing very well, but if you add in some more states then after a small number of envelope expansions you start to get pretty good performance. And you can see there an upper bound on the type of performance we could achieve.

And if we quantify this in a little bit different of a way, we can think about what the average number of steps until our simulated student can get to the goal, so they can master all of the skills. And in our approach, it took them on average 34.5 steps over -- with an average over 200 trials. Whereas if we looked at the commercial tutor heuristic policy, it took almost double that number of steps to get to the goal. And that was even if we picked the best threshold. So you could think, do we want them to be 95 percent sure they've mastered the skill or 80 percent sure they've mastered the skill? When do you consider them to be done? And this was statistically significant. And this was encouraging because I think it really highlight the potential benefit of thinking about this as a sort of formal sequential decision process as opposed to just heuristic approaches.

The second one we looked at was more arithmetic simulation over a number of different types of problems. And we combined together learning hierarchies from the literature. So that's what this network represents. It's a 122 variables of these combined learning hierarchies. This is just much too large to try with standard state of the art POMDP approaches.

So really the only one we could compare to was this heuristic policy. And there were two instances that we tried. So we used the standard algorithm which would ignore any structure among the precondition skills. And if we did that, then it just failed. It can't handle this type of domain that's really large and it will sort of be working on skills that look low probability but it won't have done its prerequisite skills yet so it just fails.

If you include the prerequisite structure which is stuff that the standard algorithm doesn't have, it still performed worse than our approach over a very large number of thresholds we tried and performed comparable at one threshold.

So I think that this sort of really highlights the benefit of this style of approach over the prior purchase. Huh?

>>: So when you say you include the [inaudible] structure, is the baseline then basically the same [inaudible] hierarchy is not even going to worker on skills that are precedents until they're mastered and then [inaudible].

>> Emma Brunskill: Yes. Exactly.

>>: [inaudible] based on that. Okay.

>> Emma Brunskill: But really would like to see whether or not this actually impacts learning, not just simulated students. It's a little hard to see this one, but

I'm also really interested in how we can use information communication technologies for national development. And there's been interest in how do we do sort of intelligent tutoring systems in the classroom for places which only have a few computers, not enough for one per child. So in a lot of these scenarios, we've had scenarios -- people being interested -- well, coming from Microsoft, where we have multiple different input devices for the students to use at once as a way to allow multiple students to interact and have good experiences with the same computer.

But within these settings there can be a challenge in that if people have different abilities or different backgrounds or different interests, it can be hard to keep all the students engaged and challenged at once. And in particularly if you're sort of in competitive scenarios, one person may dominate the interactions which may mean that it's really hard for the other students to keep engaged and may end up having negative learning repercussions.

So our idea was well, what if we tried to do some amount of personalization. So we'd like to make it so that all students are kind of working at the -- their own level, and by doing that, we should sort of be able to equalize the playing field.

So we made this simple group math game where we split up the screen into different segments. So each student is working on their own segment but they're competing to be the first to answer 12 questions correctly. So they're each doing their own drill exercises.

And then the idea is that we're going to personalize what questions they actually get using our Markov -- our partially observable Markov Decision Process. So we're going to keep our belief state estimate over what they know or don't know.

And then ask them questions accordingly. So maybe Eric's working on calculus and I'm working on addition but we still have the same chance to win.

And we did this user study with two elementary schools? Low income areas of

Bangalore and we did it with fourth and fifth great students. And they had four sessions with the software tutor.

And encouragingly we found that game dominance was reduced so we tried to quantify game dominance in two different ways. One is that a single student was winning 80 percent or more of the games and another was where single student won 10 or more times than any other student in the group. And in both of these settings our adaptive condition, where we're using the POMDPs had many less instances of dominance compared to if we were randomly selecting questions for each student. So this is hopefully going to increase engagement amongst all the students.

But again, you know, we care about engagement but really the long-term objective is learning. And so while we've got some encouraging results on a concept learning task that I was doing with some computational psychologists at

Berkeley, I've also collaborating with the Pittsburgh Science of learning Center and Carnegie Learning on doing classroom studies with algebra 9th graders to try to see whether or not these decision theoretic approaches actually lead to better learning.

>>: Yes, I think evaluations [inaudible] in the hard studies, the really soft stuff, it's really nice to figure out whether or not some of the, you know, people reasoning methods [inaudible] and how much if any.

>> Emma Brunskill: Right. Yeah. Because you can always have heuristic methods to solve these [inaudible] by doing this [inaudible].

>>: You mentioned [inaudible] very sensitive to formulation and they're so sensitive that the heuristic methods [inaudible] better at capturing expertise might be better but by the fluke of the of expertise not by the machinery [inaudible].

>> Emma Brunskill: Right. Yes.

>>: [inaudible].

>> Emma Brunskill: Yeah. Definitely. And one thing that we are -- I was actually

-- I was actually going to talk about this morning was just that in general these parameters probably will be impossible to know exactly, and so how can you be robust to sort of uncertainty in the parameters.

But the contribution here really was showing that this prerequisite structure really enabled us to do this exponential reduction in the size of the POMDP.

>>: [inaudible]. So I think for example it's my intuition is that this was very focused on steps and conceptual space of learning for human and in reality it might turn out that for example that engagement and attention management is so much more important than the subtleties of the sequence. And if anything, this -- the sequence reveals something about [inaudible] based more in notions of attention and I'm just comprehension is the probability of the -- the fail to comprehension leads to a lack of attention. So it might turn out that the whole

[inaudible] sequence of attention [inaudible] that attention of engagement over time and not necessarily secondarily so that the actually content.

>> Emma Brunskill: Definitely.

>>: [inaudible] just a [inaudible].

>> Emma Brunskill: Yeah [inaudible].

>>: [inaudible] example of [inaudible]. Engagement and attention is really critical here.

>> Emma Brunskill: No, I definitely think that, and I'm really exciting to talk to

Desi Tamleader [phonetic] because I feel like being able to model that and be able to monitor that in terms of maybe like brain computer interfaces and stuff like that, you know, for motivation like I think it is really important. Motivation and frustration are very critical.

But just to go back to here. So this sort of -- this technique really allows us -- I mean, it's definitely relevant to tutoring and that's why I was first exciting about it.

But it also applies to other generic forms of POMDPs and this notion of just restricting yourself to planning over a restricted state space and doing belief space planning over that is more general even than the envelope planner that I proposed here. And this can really enable us to plan in domains that are much, much larger than was previously possible.

So I just talked to you so far about three different projects that I've done in decision making under uncertainty and that's really the bulk of the way that I spend my research time. But I've also been fortunate to spend some time on a few other projects that I just wanted to briefly mention.

So I come from a robotics lab during my graduate school days and a lot of interest there was how do we build sort of maps of our environment and how do we build maps of our environment that might be particularly suitable for humans to interact with computers that are using them. Computers or robots.

So things like topological representations of environments our hierarchal representations of environments. We are also interested in things like how could you have enabled better sort of robot -- human computer interaction in terms of inferring when someone said in national language speech a set of directions, what underlying physical path are they referring to, and I'm also interested as I said in international development and technology, and I've worked on a number of projects related to that, including things like what's the best way to have low

technology literacy users use cell phones in order to enter in say healthcare data? We were doing this for a tuberculosis application where we wanted healthcare workers to be collecting data regularly. And was it better to do something like mobile phone or use the mobile phone for voice or for text message or other interfaces.

Also we've been interested in how do you optimize health worker visits? A lot of health workers travel really far distances. This is Rwanda, where the travel time is non-significant -- is significant and you could use say the travelling salesman approach to try to optimize.

And I've also been interested in things like health modeling. So how do you know what the prevalence is of a disease across the world based on a meta analysis of studies?

>>: That a way that that map is [inaudible].

>> Emma Brunskill: Is that really -- yeah. So this is hearing loss. And these are

-- I mean, these unfortunately can't show you the uncertainty over these estimates, but, yeah, the highest hearing loss generally is more of that area.

>>: Why would that be?

>> Emma Brunskill: Why -- why -- why would it be higher in Africa?

>>: Yeah.

>> Emma Brunskill: Well, the studies have shown that it's generally pretty high. I mean, you can sometimes see --

>>: [inaudible] industrialized cities you would have [inaudible] with your in-laws --

>> Emma Brunskill: Well, you can see some interesting up and down. So sometimes it is Bell curved like in really rural areas it will be lower and then like it's higher in say industrial and then it's lower again as you get to higher income.

There's also war. So that also is a impact.

>>: [inaudible] pretty striking in the pictures.

>> Emma Brunskill: Yeah. Well, this is a little over -- this is one of the -- this isn't the final one that we ended up doing, but you could see a map of just how it changes over the world. And this isn't the [inaudible].

So just going forward, I'm still real excited about decision making under you uncertainty. I think it's a really fascinating area. And I'm particularly exciting about sort of continuing to tackle some fundamental questions inspired by real applications. So one of the things I'm really exciting about is thinking about how can we use population data to do better new personal policy. So the example for this that I like to think about is if you have a new patient who comes in that who was just diagnosed with colon cancer how did you leverage all of the prior data

you have about other colon cancer patients and the sequences of treatment decisions we did for them and those outcomes in order to construct a new treatment plan for the new individual?

And the reason this is hard in terms of a decision theoretic approach is that you're basically doing off-policy learning and off-model learning because the prior patients were slightly different and you didn't explore all of the prior policy space, inside of that prior set of people.

So I think that's a really interesting question. And I think there's a lot of potential impact on things like health and marketing and stuff like what.

Second, I mostly talked today about how a single agent could make decision this is order to optimize some objective. But I think it's also really interesting to think about how we could modify another agent's decision making through the use of incentives. And I think of this as really being kind of the intersection between inverse reenforcement learning and mechanism design. So how could I observe another agent's decisions and then from that figure out enough about their preferences in order to provide them with alternative incentives to get them to follow a policy that I'd like them to follow instead.

And I think this would be certainly relevant to things like modifying people's health behavior, things like exercise, things like nutrition. I'm talking to someone from Berkeley whose works on an e-mail system to help people change their nutrition and exercise which is currently wheel based but I'm talking to her a little bit about whether we could do something like this.

And I think there's some really interesting challenging questions here. So in general preference elicitation is also a piece base hard problem, it's a very hard problem. But we don't really feed to know exactly what people's preferences are in order to exchange them. So it might be as long as I know you like a Big Mac roughly this much and you like salad roughly this much that I could figure out incentives to change the ordinal relationship amongst those in order to change your policy.

Because I think one of the exciting things we do get in terms of making decisions is that we ultimately [inaudible] in the discrete action space just care about ordinal relationships amongst actions instead of actually their value.

And then finally I'm still real excited about the education space and particularly group tutoring. In the case I just showed you earlier, we treated each student individually. But I think once you start to think about the students together, there's some really exciting things.

One is you get this combinatorial action space. So if I recognize that Johnny doesn't understand fractions and Susan does, one of the things I can do is I can have students and help Johnny with fractions.

I also think there's some really challenging things there with content creation. I haven't thought in detail about this, but I could imagine doing something like

inverse theorem proving to try to automatically generate content. And I also think there's some really exciting challenges in terms of visualization and decision support for teachers.

So I love this picture because it's from a colleague in Chile, and he does these great things where he projects on the wall sort of these massive grids where each student is working on questions on a different grid cell and then the teacher can like monitor all of them and go in and change which exercise each student is doing.

But this is the massive space. I can't imagine looking at 40 different belief states inform a student and trying to figure out well, what does that mean about what I should teach tomorrow? And so I think there's really cool decision support questions there where you could say things like 40 percent of your students don't understand fractions or instead of having everyone go to the same recitation every week you'd make it dynamic. So each week you would say, all right, you get assigned to recitation two because everybody else in recitation two doesn't understand long division. And so you could reduce the variance over which teachers are trying to teach. So I think there's some real exciting questions there as well.

So just to summarize, I think that sequential decision making under uncertainty is really a key part of artificial intelligence and machine learning. I presented several algorithms for subclasses of problems where we achieved orders of magnitude improvement compared to prior purchase. And I'm really excited to continue to work on some fundamental challenges, inspired by real-world applications. Thanks.

[applause].

>> Eric Horvitz: Any more questions? [inaudible]. Thank you very much.

>> Emma Brunskill: Thanks.

Download