1 >> Tim Paek: Thank you for coming. It's my pleasure and honor to introduce our guest speaker for today. Steve Young. Steve is currently professor of information engineering at Cambridge University and head of the Information Engineering Division. However, he has very close ties to Microsoft, not only because he has advised many students who are now Microsoft employees, but he co-founded and was the technical director of entropic, which we acquired in 1999. So he was actually a blue badge, full-fledged Microsoft employee for a while as an architect, but he decided to go back into academia, so he went back to Cambridge University. He has had an illustrious research career in the area of spoken language technologies, from speech recognition, speech synthesis, to more recently dialogue management. Among kind of his notable contributions, he's the inventor and author of the HTK Tool Kit. He's been doing work on POMDPs lately, which has been gaining a lot of speed. And with this kind of career, you would expect a lot of distinctions, and he does have them. He's a fellow of the Royal Academy of Engineering, the Institution of Electrical Engineers, and so forth and so forth. I think you guys have all seen his bio. So without further ado, Steve Young. >> Steve Young: Okay, thank you. [applause] >> Steve Young: So I'm going to talk today about some work we've been doing for the last few years at Cambridge, which is kind of to one side of the speech recognition work, but arose out of the -- initially out of the assumption that however hard we work on speech recognition, it was never going to be perfect. So how can you improve a spoken dialogue system, given the recognizer that is going to make errors. That was sort of the original motivation. Since then, I thought sort of a little bit more about what a human computer interface should be doing, and I'll just say a little more about that in an introduction about why use POMDPs for HCI. And then I'll quickly go through a simple example, which you may find too trivial to bother with, but it's to try to illustrate for those who haven't really thought about using what POMDPs are and how you might use them in an interface. It makes it perhaps a little clearer than the speech case, which 2 is the complexity sometimes hides the kind of the big picture. And then I'll talk about how you scale POMDPs to big problems like speech recognition -- spoken dialogue systems. And then I'll talk, as a way of example, talk about a system we've started working on first, about five or six years ago, something called the hidden information state system. And then very briefly at the end, I'll say something about the more recent system, which is the Bayesian update system. And then I'll wrap up. So why use POMDPs? If you're going to build an interface which is going to be robust, whether it's speech or any kind of human interface, you're going to have to deal with uncertainty and I think if you don't model uncertainty explicitly in the way you manage a dialogue system, you're never going to be able to do very well. And as part of that, I think it's important to be able to track what the user's trying to do, because the only way you're going to interpret something which is noisy and probably ambiguous is if you've got a pretty good context in which to interpret it. The third thing is that communication is always trying to serve some goal. So it's good if you can quantify those goals and then you've got something to optimize. And then finally, you need to be able to adapt. So that suggested however you build an HCI interface, it really needs to be mostly parametric and not just hand crafted rules. Because otherwise, you're going to not be able to adapt to rapidly changing environments in the near term or even in the long term. One of the things always strikes me about most deployed spoken dialogue systems is that people put a lot of work into them on the day they install them and then they might run for six months, a year or several years, and the performance doesn't really change with time at all. There's no sense in which the longer you use it, the smarter it gets. So I'm going to argue the POMDP framework is the natural way to deal with all of those issues. So let me very quickly. Sometimes people stop me when I do this and I spend the entire talk with this example, which would be a bit sad. An iPhone uses a gesture interface. Sorry to mention an iPhone example in this particular building. Sorry for that. So think about the interface for -- suppose you've just taken lots and lots of photos and you want to quickly skim through them and delete the ones you really don't want to keep. The current iPhone is a bit clunky. You have to select the photo, select the delete key and then I think you've got to confirm that you really want to delete it. Suppose you want to have an interface that is really quick and consists of swipes. Left swipe, right swipe, down to delete. In other words, something that looks like 3 this. And delete and so on. The problem with this, of course, is if you do it quickly, you'll make errors, probably, and you'll start to delete the things you don't intend to delete. Now, traditionally, remember, this is a toy example for illustrative purposes. So don't start saying, Billy, don't do it that way. Just imagine that the gestures are, in this case, are just identified by the angle they make. So you it would divide up the compass like this. The blotches, the green blotches are forward gestures, backward gestures, delete gestures. Of course, you have to have something in the device that is going to measure the angle. That's going to make errors. This has made some sort of error. The usual way to do this in the classic sort of framework is you say okay, well, we'll sort of use some kind of patent classification approach. We'll record some data. We'll annotate it. We'll get some distributions that might look like this. We'll put some decision boundaries in there, and at least in some sense, we're making an optimal decision. We could even put some risk in there. And so when we get this gesture that we don't know about, we compare with the decision boundaries and we decide that's a backwards gesture. And then we can go a bit further than that, since we've got the distributions and we know what the overlap is, we can compute the probability of error and compute some kind of confidence threshold from that. And so then we get the aperture application is typically hard wired so you have something maybe like this, recognize a backwards gesture, check of confidence. If it's greater than some threshold, move back, maybe. Otherwise, do nothing. So that's kind of a classic implementation strategy. As far as I know that's how really most of the deployed semi-spoken dialogue systems work in essence. >>: So in that case if you make a deletion -- >> Steve Young: You interrupted me. >>: If you have an easy recovery strategy, something delete, you can simply have one gesture to recover, then you don't really have to ->> Steve Young: Oh, yes. I'm telling you why POMDPs are good. I'm not telling you how to make an interface, okay? But even then, you know -- well, let me continue, 4 okay? So what's missing? Okay. There's no model of uncertainty, as such. The iPhone is not trying to track what I want to do. It's just responding to my inputs, okay? So it's not trying to track my -- it has no belief about my intentions. And there are no quantifiable objectives so in some sense, the decision making is [unintelligible]. Now, that is very simple example of what quantifiable objectives could be. As we'll see, we could code the risk in terms of rewards. So how do we deal with that? Well, the first thing we need to do is model uncertainty. We use Bayes Rule. Our old friend Thomas Bayes. In graphical networks term what we might do is treat the problem like this. We say okay, I'm going to imagine the user has three possible intentions here, to move backwards, move fords forwards and delete. But the system doesn't know what they are. So we'll say that say that's a hidden variable. And we'll say the probability of having some intention at time T depends on the intention at time T minus 1 and also the last thing the machine did. You might think in the delete example, but surely these are independent events. They're not really, actually, because typically people scroll forward rapidly through the photographs. They'll go past the one they see and think I don't really need that. Then they'll go back and then they'll tend to delete. So there is structure, okay. Not a lot, but a little bit. In the speech example, there's much more structure than that. Then you model the -- then you say I don't know what the actual intention is, so I will represent -- I'll compute -- I mean, this is what the graph means, of course, this is a hidden variable. So all we ever know about this variable is its distribution. And I'm calling it B rather than P, but this is the probability of each state, S. And I'm calling it B, because that's going to be my belief. And you'll see later, that the critical thing is that the actions that the system takes depend on B and not S. So now, I have this set up so when I want to -- when we move to a new time slot, we can compute a new belief by looking at the data. We're not classifying the gesture anymore. We're just looking at the angle it makes and based on that angle and this observation distribution, we can update the belief distribution. And we are not going to, as I say, we're not going to use this as some kind of threshold or some kind of adaptive thresholding system. We are going to base what the device 5 does on the entire distribution and not do any kind of maximum likelihood estimate of the intention. So that's the belief framework. And this is the -- this is the framework that is implied when we say it's a POMDP. Because the second part of the puzzle is the optimization bit of goals, which depends on Bellman's optimality principle which comes in many forms. This is just one of them here. And essentially, it's recursive equation, which says that the key idea is that you could associate with each pair of belief states and action as local reward and then what you want to do is treat the whole dialogue or sequence of interactions as an entity and compute some total reward for the entire interaction. And the Bellman pointed out if the process is Markov, you can compute an expected value for any belief state, B. You can compute the expected value of total reward from that belief state by a recursion that looks like this. As I said, it comes in different forms. In this case, it's just saying it's the recursion over -- essentially it's the local reward plus the expected reward and the next belief state, the prime just means next time slot here and it's an expected value with respect to the observations. And the max is you can use the optimize -- find the optimal reward and, hence, the optimal policy. So in terms of our problem, so set out the graphical model, treat it as a DBN, which it is, extend it out over the T time slots and then what we're saying then is okay, I'm going to have my policy, which is instead of this hard wired decision network, I'm going to say, each action depends as a function of the belief state, not the most likely state, but the distribution over all states so we get this sequence of actions driven by a policy and then we can sum the local rewards to get a total reward and the expected value of this is V. And then we can affect various algorithms for doing this. But we can essentially it rate. We can use the policy to compute the reward and the maxing here allows us to adjust the policy to incrementally increase the reward and if we do this iteratively in a process called reinforcement learning, we'll end up with the optimal policy, under certain constraints, which aren't too interesting. Okay. So if we do that for this simple example, then I've got my user's goals. These are the things we don't know, but we assume they're in the user's head. We've got the system actions. We define some rewards so let's give a modest small reward for moving in the right direction. Bigger reward for deleting when we want to delete, 6 but then give a big negative reward for deleting when the user didn't want to delete and just wanted to move forwards or backwards. You can change these rewards, of course, to suit your design criteria. It's a design option, in effect. And then we can iteratively optimize. This is just a toy example, right, to illustrate the idea and so I didn't actually compute -- obviously, this depends on probability distributions for the transition function and the observation. I didn't train these. I just chose some plausible looking parameters, just to illustrate how this might work. And then also set up a simulator to generate gestures with error rates varying from zero to 50 percent and the vertical axis here is the average reward per turn. So this is zero axis here. Going below the zero axis is probably bad news. So this is a simple but reasonable hand crafted policy which just uses a fixed confidence on the threshold and it basically comes down pretty rapidly if the error rate increases. So if you go to a party and you drink enough, you really wouldn't get very far with this interface. Now, if we use the POMDP framework and train it a fixed 30% error rate you, get the curve to look something like this. Whereby we see we've made it more robust at the higher rates. Then we've lost a bit at low error rates. Indeed if you look at the policy, you find it's basically you become risk averse. >>: [inaudible]. >> Steve Young: Sorry, the policy. I'm not going to train the parameters of the transition probabilities, but I am going to optimize the policy. Yeah, the training here means the -- so ->>: [inaudible] >> Steve Young: So I used Q learning, and I used the user simulator to train it with learning, and I set the error rate to be 30%. Yeah, problem? Yeah? >>: [inaudible] observations. >> Steve Young: into actions. Of the policy. I'm trying to learn this mapping from belief states 7 >>: [inaudible]. >> Steve Young: >>: This is a POMDP, yes. Okay. >> Steve Young: I mean, not literally. Yes I don't mean literally Q learning, sorry. I use a Monte Carlo based training method, which is doing an approximate POMDP learning algorithm. It's not important. It's a reasonable algorithm. >>: So in the original hand crafted set-up, there's thresholds that had to be set, having to do with confidence and understanding ->> Steve Young: >>: Just one. Okay. >> Steve Young: Just one threshold, yeah. And I just fixed the reasonable value. >>: Then in this POMDP setup, there were several different rewards that had to be set up. >> Steve Young: Yes. >>: Are these rewards that you've set by hand? from the thresholds we set by hand before. Are they qualitatively different >> Steve Young: Different operating point and I haven't explored this. This is not a serious example. This is tutorial, Jeff. This is motivational. Okay? We're not trying to design and persuade you that this is the way to produce an iPhone interface. So if you want to disbelieve the results, fine. We can go on to the real results for speech systems later where we get the same performance. I'm trying to illustrate a different way of approaching the problem. >>: Will the results of the speech systems involve setting rewards, or is that also ->> Steve Young: Yes, but very straightforward reward. We just have dialogue systems and we give a big reward for giving the right answer and zero for getting 8 the wrong answer and we penalize, give you small penalty every turn to keep it going along. So -- and we've not tried to optimize that or say what do you really want from the design. We've just typically used that kind of straightforward, big reward at the end. Q Do you actually have to know what the error rate is? >> Steve Young: Let me just put the other curve on, okay. If you did know what the error rate is and you train the policy at different error rates and you updated the observation parameters to match the error rate, you get a curve like this. Okay. And so this is the kind of upper bound on the performance you could get with this kind of setup. Okay. Now, how you would know what the error rate is [unintelligible] is a different issue. But that's enough for bound on the performance for this particular setup. And it's a toy example. It's just meant to be tutorial. The point is one of the things that's making the difference here is that the system is using the transition probabilities to buy us implicitly just its threshold, remember, because it's updating its belief model based on the transition possibilities and that makes a difference. That's one of the reasons why you're getting a significant difference between this and this. Okay. So the bottom line of this example is simply that don't think of a speech system as being a command driven interface that you speak commands and the system responds. Think of it as being a system where you -- the system is just trying to track what the user wants to do and it's regarding inputs as observations. That it's using to refine those beliefs. And then the policy training stuff is almost secondary to that. But that's the key difference in terms of designing an interface. Okay. So, well, I just basically, I've just recapped that. So that's the basic idea. Now, the problem, of course, with speech or a speech-based system is it's pretty complicated. So I only have three in my iPhone example, I only have three possible states and there's a whole load of packages which will train POMDPs in different ways for small state spaces and make a reasonable job of it. The big problem comes where you have a very large state space and a large action space. So how do we set up a system where we can do Bayesian inference attractively in realtime over this very large state space and how can we actually optimize policies as well, using reinforcement learning, which tends to get very difficult in large systems. 9 But if we could do that I'm arguing that this is a really rather principled approach to handling uncertainty and planning and that's what you need in any kind of interface which is driven by human interaction. So what's the problem with the scaling problem. So just to set the context. This is the generic sort of spoken dialogue system architecture. The one we've been using. And what we do in our systems is we have -- I should say, this is a limited domain application. The domain is actually tourist information we've been working on. So you can talk about things in -- at the user can say things, I want to find the restaurant, the usual stuff. I want to find the hotel. The system I'll show some example of is actually with an artificial town, we're just about to start -- we have now a version for Cambridge that we're about to start making accessible to the public, which has got many more entities in it. But an example I'm going to talk about here is essentially you can ask questions about hotels, restaurants and bars in this fictitious town and we have a our architecture is we convert words into these abstract representations. We have a set of these dialogue acts like confirm and negate, inform, request. And then you have attribute value pairs, which are arguments that these dialogue acts and we have this as a -- we use this as a well-defined interface to our dialogue manager. I'll play you a little example of a demo system running a bit later, but just in passing, all of the components are statistically trained entirely from data apart from the message generator. And I should also stress the dialogue manager has no application dependent code in it at all. It doesn't know anything about towns or hotels or bars. It's all learned from interacting. Learned from data. So the first system is a hidden information state system. This was built primarily as a demonstrator that this notion of tracking belief over multiple states would give you increased robustness. And not necessarily meant to be the way to do it tries to mimic the basic ideas of a POMDP framework so it takes the speech understanding system as an observation, and, in fact, we compute an endless list of these abstract dialogue acts. So the interface is a list of the dialogue acts that the system can decode from the user's input ->>: [inaudible]. >> Steve Young: That's what I just -- these kind of abstract representations like confirm here equals tower. You've got a list of those here. This thing is trying 10 to update a belief over a set of states. I'll tell you a little bit more about the states in a minute. And then we have a policy, which generates actions, which gets converted into speech and so on. And we make this tractable by two mechanisms. First of all, we group states into what we call in equivalence classes call partitions. So rather than having to compute beliefs over every possible state, we have far fewer partitions. And then secondly, we don't compute the dialogue policy directly in belief space. We map this rather complex belief space into a summary space, which I'll explain a bit more in a minute, and then we implement the policy and also optimize the policy in this summarized space. And then we heuristically map back these similar actions into the higher level space. And this basic model was developed originally with Jason Williams, who is a Ph.D. student of mine. So the actual state we record in this system has three factors. The user's goal, that corresponds to the -- what does the user want to do move back, move forward, delete. The user's act. That's the last thing that the user said, and a dialogue history. Just to put a little bit of flesh on that. The goals are grouped into partitions. So what we actually do in practice is we have a list of things which represent possible groups of goals. So represented textually here, just to be able to read it. So this is the set of all goals, which involves finding a hotel in the east part of town. This is a set of all goals, which involves finding a bar in the east, hotel in the west. Find venue is just a goal of finding something. This is meant to be a mutually exclusive set. And this is composed with our own best list and then we have a grounding state. So any of the entities in any of these partitions will have a state associated with it and the state is something like -- is be queried, it's being grounded, it's being requested and so on. And this is conventional stuff in dialogue systems. This is essentially David crown's grounding model. But remember, this is sort of probabilistic, so anything like area equals east can have multiple states. In fact, it's a distribution over all possible grounding states for all possible arguments for these goals. So what we actually do, in practice, is we take instances of each of these distributions so a particular partition, with a particular assumed last user act and a particular set of grounding states makes a single instance for which we compute a probability. 11 And we compute all of the most likely combinations, rank them and prune them. So over the millions or even billions of potential combinations here, we typically maintain the top two to three hundred values, if you like, and their probabilities. If you work through the masks you, get something that looks a bit like this. So this is the actual belief update, and I'm not going to go in detail in this. In fact, I'm not showing the partitions here. This is for actual states. If you plug in some -- do some or algebra on this, and you have S divided into partitions, you can get something similar to this. But this illustrates the basic idea. >>: So the partition is done manually? >> Steve Young: No, the partition is completely -- so I didn't really want to go into this, because it's kind of almost boring implementation detail. We start off with everything in one partition, because we don't know anything. The user says something, the recognizer generates all of these hypothesis about what might have been said. Any mention in any hypothesis about anything like a Chinese restaurant, a cheap hotel, then this will -- the set of partitions gets divided. So you can always associate each possibility what the user said with the specific partition. So this splitting doesn't change the masks at all. It's just a computational device, in effect. It would work exactly the same as if you were able to maintain every possible combination and you computed the beliefs individually for each combination. So if you look at the equations, what you find is this is, B, remember B, this is a probability. But you're changing, recomputing this distribution in return. And this is the old belief and the prime is the new belief and so if you look at the terms in this updated equation, and this is standard textbook stuff, what you see is this three components. The transition model is just taking account of state changes and, in fact, we assume there are no state changes so we assume that whatever the user wants, they don't change their mind in the course of a dialogue and this is a weakness and I'll come back to that. So you can more or less ignore this. It's basically an identity transform. And then these two terms are the important ones. So this is the user action model. This says, okay, remember that this S prime here includes the three components, the 12 user's goal, the last user act and the grounding model. hidden in here. The grounding model is So this is -- this term says what's the probability of the user saying something, given that their goal is this. So if you're hypothesizing, the user says I want a cheap hotel and in the particular G here, part of the goal is they're looking for a hotel, you'd expect this to have a high probability. If the particular goal is they are looking for a bar and they say I want a cheap hotel, you'd expect this probability to be small. So this probability, we call this the user action model, this is the thing you don't get with many alternative formulations of this problem. And then the observation model here essentially is the -- takes the place of the confidence measure, which is a probability of the observation, given the specific user goal. >>: Just going back to the question, this is just a factorization of the partition set, right? >> Steve Young: This is taking the -- if you look at the graphical model, you figure out the relationship, the updated equation and then you substitute in the factored version. The S is actually represented as these three components and you'll get this and then as I say, in the actual system we go a bit further because we group the Ss into partitions. Which requires a little bit more manipulation. >>: [inaudible] part of the partition. So you partition basically at run time based on things that you hear from the recognizer if I understood you correctly. >> Steve Young: >>: Yeah. And from the system. Sorry. >> Steve Young: And from the system. So if the system says I could suggest a nice hotel in the east part of town that would also split the partitions. >>: Right. Just makes me wonder about, like, I guess what's left is priors. I'm wondering if I give you this information and you're reasoning your partition based on Chinese versus everything else, restaurant -- 13 >> Steve Young: >>: Can I come back to that. Sure. >> Steve Young: Because I'm going to talk about that specific point. >>: All right. >>: So when you say the [inaudible] transition is -- >> Steve Young: >>: The goal, so G doesn't change. I knew that, but what about the rest, right? >> Steve Young: They can change. Oh, in the transition model, no, that's true. No, they don't change either. So you're assuming that -- you're assuming that there's an underlying sort of set of fixed values for these. All you're trying to do is estimate them from the sequence of observations. But I'll come back to these issues again, because you're picking up problems with this model, right. So if you want to just think of it, we're hacking various things here to get to work, that's just fine. So we have a user action model which is a factored model. talk about that. So that's the first part, partitioning, okay. the big picture. I haven't got time to Lots of detail in there, but that's The second thing is the master space mapping. So we have these partitions and their grounding states and the user act and each one has a belief shown by the size of the bar here. So I'm just showing the goal bit, but this, each of these lines is meant to represent what we actually call a hypothesis, for obvious reasons, in the system. So you're maintaining sets of hypotheses about what the user goal is, what the last act is and what the grounding states are. You maintain a list of these. That's this list. To do policy optimization, we do some gross approximations. We first of all try to characterize this complete distribution by a fixed length vector. A B prime. 14 So this is summary space. And this is something we've been refining, but it consists, in the version I'll show you, in the one we trialed, this consists of a mixture of continuous and discrete variables so the probability of the top distribution, the probability of the next top distribution indicates variables, for example, T12same could potentially refer to the same entity. Various other things, last year act, last system act. >>: So the summary space now is manual. >> Steve Young: done, yes. >>: Well, the choice of this, what features to extract, is manually [inaudible]. >> Steve Young: No this is completely independent of the application. These are entirely structural thing. T 12 same means if you treat these -- I mean, okay, it's generic to the database type information retrieval system. The system knows nothing about east, west and so on, but it does know what the fields are in the database. >>: So the schema. >> Steve Young: So there's a schema that sits between this and the database. So if mapping these to the schema gave you the same set of entities in the database, you'd say they were equivalent. >>: Okay. >>: But the size of the summary space determines scaleability of the system? >> Steve Young: Yes I guess it does. >>: So somehow to address that issue, you need to [unintelligible] how small summary space eventually you need to have? >> Steve Young: Yeah, okay. Yes. This is something we've just started to look at a little bit about the trade-offs in the size of the summary space. But as you'll see, the policy optimization is really quite crude in this system. So what we do is we take this reduced mapping entity, fixed land factors and then we use a code, we actually quant ties this. We have a distance metric on this, and we [unintelligible] it. I think for each member of the code book, then we associate 15 an action which is a so-called summary space action. So it doesn't have these arguments. It's just basically the dialogue act itself. And we learn this, we optimize this policy. This is now effectively an MDP rather than a POMDP, and we optimize this using online, actually it's Q learning. And then we have a heuristic, which maps these summary vectors back into master space. And this is originally hand crafted, but, in fact, we now have a data driven way of doing this. So this mapping is essentially learned from data as well. But it's actually not difficult to do. So for example, if the policy says you should confirm something, then what it does is it looks at the grounding states of the various entities and picks something to confirm, for example. So that's the sort of the architecture. Let me just show you an example of it working. So what you'll see is the -- it's a bit -- you won't be able to read it very clearly, but you'll get the basic idea. This panel here just shows these hypotheses appearing as you speak. The recognizer out traces in the top here the output of the synthesizer is here. The recognizer is a simple off-the-shelf HTK recognizer. The output is a [unintelligible] based synthesizer. It sounds strangely like Allen Black, for anyone that knows Allen Black. What you'll see is that the bars here, if you look at the spread over the hypothesis, you see -- and this is in clean speech so there actually aren't many errors, but you'll see that the guy was one of my RAs is asking for a venue, which isn't -- there isn't a satisfiable solution. . >>: Hello. May I help you? >>: Hi. >>: The Kings Schilling is a great bar in the north part of town. >>: What kind of music do they play? >>: The Kings Schilling is known for great rock music. >>: Are there any bars playing jazz? >>: I am sorry, but there is no bar that plays jazz music. I would like a bar in the north. 16 >> Steve Young: It's now confused about what the user wants. >>: Are there any bars in the center playing jazz? >>: Ice Skill is a nice bar in the central part of town and they play jazz music. >>: What kind of drinks do they serve? >>: They serve cocktails and Ice Skill. >>: And what is the address? >> Steve Young: These top two hypotheses are actually identical, apart from one of the grounding states. So this represents a very sharp distribution. >>: Thank you, good bye. >> Steve Young: And this is essentially zero. It's on a non-linear scale so we can just see the colors and don't worry about what the colors are. So this is basically -- it's 100% certain. So in this example, the recognizer got it exactly right the first time and it was presumably pretty confident about it. So he effectively got a hit straightaway. But then the hit actually didn't suit the user and so the uncertainty came in in that way. Now, the user changed their goal in that case, but the system just assumes that it simply misrecognized if you like, the intention in the first place. So the goal stays fixed. It's just recomputing the beliefs. Now, that's to give you an idea. If you -- we have a simulated user that we actually use to train our systems as well. We don't -- unlike nuance and maybe you guys, we don't have access to millions of dialogues. So actually we have a corpus of dialogues, relatively modest number, probably about 1,000 dialogues and we train as a separate Ph.D. project, we have a statistical user simulator, which was trained on the data and then we use that user simulator both for training and testing so this is -- take this with a pinch of salt in the sense that if it's essentially testing on the training data. And what's shown here is a comparing the performance of this hidden information state system with an MDP system. So the MDP system is using all the same components except it's no model of uncertainty. It's always just selecting the most likely state and it's optimizing the policy, based on that in much the same way as the HIS system. 17 But was was And the MDP system is not a broken system in any sense. The guy who made this, it part of his Ph.D. and he worked pretty hard and he was also competitive so he trying to -- he was certainly trying hard to get the best performance possible. this is a reasonable hand crafted system. So learning a policy helps, but in high noise, unless you're tracking multiple alternatives, there's only so much you can do and the potential, again, of the HIS system is demonstrated by these simulated results. So we also ran some trials, actually, we've run a couple of trials with students. We had, I think, a total of about across the trials about 80 students who worked their way through, I suppose, 20 or 30 dialogues each. And what we did was we basically used the -- this wasn't a live system on the telephone, we came into the lab. But we had noise sources -- I don't actually remember the noise source. It was something from the noise database that used to be around a few years ago. So we basically had artificial noise in the background, which we increased the noise level to generate the range of, a range of noise conditions. And we'd hoped to reproduce these curves. But the results were not statistically significant if we tried to rank them by error rate. This is just the result for the pulled data, which is significant, statistically significant. And this is the percentage success we get on the user trial and as a -- and the average error rate, overall, of all of the dialogues is about 30%. >>: You say statistically significant, you mean the difference between -- >> Steve Young: >>: This is significant. But not between HDC -- >> Steve Young: Probably not, no. My significance comment is we can actually bend the dialogues at different noise levels and plot a graph like that. But if we do, airport or bars on the data points are so wide that there's, you know, that's not -- you can't get nice curves, basically. >>: [inaudible]. >> Steve Young: Yeah, for each individual bin, yeah. 18 >>: Like I'm wondering -- >> Steve Young: Well, we've tried, yeah, all sorts of fitting curves, but it's not that -- yeah. We don't feel that confident about it. But we're confident about this result. >>: So the purpose of adding noise is to make the condition. >> Steve Young: >>: Yeah, just to make it recognize. [inaudible]. >> Steve Young: Yeah. >>: On the other hand this system you talked about earlier also has a summary state. Is that helping or not helping? >> Steve Young: Oh, does it hurt? We don't know. We don't know. We haven't -- we haven't explored that part of the system very much at all yet. I'm going to run out of time and I need to say a bit more. But yeah. >>: When you add the noise, do the subjects hear the noise, or -- >> Steve Young: suffering, yes. >>: Yes, yes. So there's [unintelligible] effects, yes. They are [unintelligible]. >> Steve Young: Well, we change the noise level so that the error rate, as I said, the error rate, the measured error rate it was varying between 20% and 40%. >>: Do you notice that subjects do different things if you make it noisy? >> Steve Young: Pass. We videoed them if you want to -- >>: Besides the task completion rate, did you happen to qualitatively assess the differences? I mean, are these ->> Steve Young: Yes, we did. We have a paper that just come out in computer speech and language with the detail results if you want to go have a look at them. And 19 we do do some ->>: In general, he can notice that that's. >> Steve Young: Subjective, yeah, although we did another trial, we tried to get over the statistical significance problem with another trial, and there we started to have significant problems. The notion of paying people to satisfy to do scenarios that are artificial when you start to look at the results is very iffy, because it's -- people are -- when there's nothing if you give them something that doesn't exist in the town, they'll accept almost anything as an alternative. Even though we said, you know, you're really keen on jazz music, they would accept something else. And then subjectively, they'd say it was great, I got everything I needed. And you look at the objective results, and it didn't satisfy the criteria we gave them. So we kind of more or less given up trying to do this kind of trial. We need to do live assessment. It's probably the only way to do this. >>: So how do you explain the difference between the POMDP and the MDP? I would expect the POMDP noticed the lack of information and asked questions to clarify. >> Steve Young: And the policy will do that. things like did you say X or Y? >>: So the POMDP, for example, will do And the MDP wouldn't? >> Steve Young: The MDP wouldn't, because it doesn't know what the alternative is. In fact that's relatively rare. Where this gains is it's also repetition works. So you could keep repeating the same thing over and over again, and it's never in the top one or two from the recognizer. But it's consistently somewhere in the list, and if you're persistent, then you actually find the belief in that actually climbs to the point where, you know, somehow it gets to the top of the pile. So that's one of the most obvious things when you look at the data is repetition makes a difference. >>: In real situations, real data, what kind of human annotation do you need on top of the data itself? Do you need any? >> Steve Young: To score these things? 20 >>: To train? >> Steve Young: >>: To train. Yeah, you hope to improve more data. >> Steve Young: representation. Do you need any kind of human annotation? No. What we need is the -- are the dialogue acts in our That's all we need for the training data. >>: But does a human have to work those out? I mean if you have a live telephones where people are calling in, do you need something on top of the raw data? >> Steve Young: Well, the raw data is -- depends upon what you call the raw data. Our interest is in what the sequence of dialogue -- what the sequence of dialogue acts was and also what the goal -- what their goal was. That's what we need to know. The user simulator uses expectation maximization. So it doesn't need detailed state by state annotation. It figures that out for itself. It does need to know what the reward should be. And it needs to know what the dialogue acts were. Can I move on? So this was meant as a demonstrator of the potential, right. It has some severe problems. One is that the -- I skipped the slide that explained how the user action model works, but the user action model's hand crafted. It's not application dependent. It's actually a set of linguistic rules which define how well a dialogue act matches the goal representation. But it is handcrafted and there's nothing really to learn. And then we have this problem of changing state. So in the example here when the system offered a different place that didn't satisfy the user's request. You can think of it two ways. The year always had this other place in mind. That's actually not true. The users are guided by whether it exists or not and will change their mind. We really would like to model that. So when you're looking at what we're actually doing here, what we're saying is our fundamental problem is we have a large joint distribution to model things like the type of venue, the location, the price, the food. In our system, currently, there are, I think, 12 possible variables here. And so what the HIS system is doing is it's taking the most likely combinations of values, finding their probabilities, ranking them and pruning off all the unlikely ones. 21 So it's maintaining the top few members of this joint. But then it can't possibly do transitions, because it has -- you can't put a matrix on this to get the updated thing, because most of these values are missing. The obvious alternative is just to use a graphical model, because that's how I presented this originally. So what would you do if you had a graphical model. Well, it's going to get big very rapidly, so if you think about the minimum possible dependency, you might say that things like what people -- dialogue designers might call slot values, like location, price and food. You might say well, it's either independent, so you might say location, whether it's hotel or a bar, doesn't matter. But price and food probably depends on what the type of venue is. So you make the simplest possible graphical model that you can do. Of course, this has got its own problems. In fact, I know the east part of town is cheaper than the west part of town so location does matter. But suppose you did this. Then you then, with a graphical model dynamic Bayesian network, in effect, which looks something like this, you've got something to represent the goals, which might be the food, depending on the type. It's all going to depend on the last system action or system action. You've got the need to somehow get what the user says into this so you can imagine plucking out the individual components, referring to food and representing these as hidden variables. You can have your grounding states as hidden variables. And this is time T. Time T plus 1. You can have a few slices of this. And that's kind of the minimum model you could possibly make. And so if you do that for this simple artificial town, you get -- and convert to it a factor graph, then it looks like this. Actually, that's the first of several pages of it. And then you can do something like loopy belief propagation to update the parameters. And if you try and do that with an off the shelf LBP, it's very slow. It's quite a big graph. However, you can exploit features of the particular setup. And I haven't got time to go into detail about message passing, but you can do two things. You can partition the values again, just like we did in his system, but this time on a per slot basis. So no point doing an update over all possible values for food if the user's only ever mentioned French and Italian. Just chunk the rest and do the partitions dynamically. 22 And then you can also, instead of having a full transition matrix, you can say, I'm really only interested in whether the goal's changing or not. So you have a constant probability of change. And if you do that, then the compute time for standard L loopy belief propagation looks a curve something like this just parting time against branching factor. With these optimizations, you can get something which is tractable. The details don't matter. My only point here is if you're going to start building these large graphical models, you probably need -- we need a serious amount of optimization to get them to work in realtime. But the big advantage of this is that we can -- and this is what we're starting to do now is we can actually not just model the variables of the dialogue, we can also throw in all the parameters as well. So we can make it not just a discrete network, we can put in the parameters of the distributions. Then we can switch from loopy belief propagation to expectation propagation and we can update the parameters online as well as running the dialogue. So that's one of the things that we're just starting to get working. The other problem is how you actually build a policy on top of this very large Bayesian network. Again, there are ideas in the literature for doing this and probably, given the time, I should probably not spend long on this. Essentially, what we do is we constrict the sarcastic policy. There's no summary space now. We're building the policy directly on the full Bayesian network. But what we are doing is assuming that there's a very limited set of actions. So we have a -- we represent a policy [unintelligible] using a soft max with a basis functions and we have a basis function for each possible action and there's a limited set of these and then we factorize out the dependency on the various bits of the network into -- we partition it into components and then we discreteize each component with a very crude binary lookup table and, again, we have a paper coming out very soon in computer speech and language that describes the details of this. And then we use a standard algorithm, much like the critic algorithm, to optimize this. So doing all of that if we actually -- and remember there's huge approximations now in the conditions. We get a performance which actually, I should have overlaid these. Actually, is very similar to the HIS system. In fact, the performance currently is indistinguishable from the HIS system. The colors have changed because of randomness of power point, but this is the MDP system from my previous slide. This is the BUD system, and if I put the HIS system in there, which I should have 23 done, it would have been almost the same. This is the reward against success. identical. But again, the curves are essentially And again, the same user trial gave statistically significant difference again. The only reason I'm not combining these into one graph is because actually the dialogues were -- it was done on different days. It's not strictly fair to actually combine them. But the bottom line is it's essentially the same performance. So I'll stop there, apologize for going on. A little bit too long. My basic claim is this kind of framework of POMDP, Bayesian belief tracking, automatic strategy optimization provides a good way to design HDI type systems. The HIS system, they both, I think we've demonstrated we can get improved robustness. We haven't dealt with the adaptation problem yet. Adaptation work is focusing on the Bayesian network system. The HIS system is still interesting mind you because the people building industrial systems are much more interested in the HIS system because they can relate that to what they're doing. They can see that there's an incremental way of going forward so you can think of the HIS system as maintaining multiple standard dialogue managers in parallel. And instead of taking actions on the best, you're looking, trying to look at the whole bunch of parallel dialogue managers and saying what would be the best thing to do? And so anything you can think, any trick that you currently use in your system, you can, in principle, put them in the HIS system. As an evolutionary path, that looks interesting. From the point of view working on adaptation and so on, online parameter learning, then the BUDS system is more interesting. In the long-term that seems to be the way to do it. Moving forward, we need to develop scaleable solutions, particularly for the Bayesian network systems. We need to be able to deal with more linguistic phenomena. We need to be able to deal with multimodal things. But multimodel is trivial in that framework. You add in an extra observation function for all of your inputs. It's rather easy to integrate them. And there are issues of migrating to industrial system. Tim Paek has pointed out some of the issues, in his paper with Roberto, about how you guarantee performance 24 to a client when your system is essentially statistical and who knows what it's going to do next, but that's where we are. So I'll stop there. >> Tim Paek: We have time for questions. >>: So the HIS and BUDS system, in the final comparison, you did compare HIS versus BUDS I suppose BUDS is better than HIS. They use the Bayesian learning. >> Steve Young: They turn out to be currently about the same. They're pretty much the same. That's presumably because -- the HIS system is able, that's coming back to a question maybe Guy or Dan asked about the condition -- I mean, the HIS system is not throwing away any conditional dependencies, right, because it's taking the full -- it's sampling the full joint. So some things it does better. It doesn't get -- I mean, location and price, for example, happens to make a bit of a difference. It models that, whereas it's thrown away in the BUDS system. >>: They both have similar kind of summary space? >> Steve Young: Well, the BUDS system doesn't have a summary space. summary action space, the summary action space is very similar. It has a >>: If you don't have a summary space, how can you justify the scaleability of the buds system? >> Steve Young: They're both operating exactly the same domain. >>: In a situation where you have real data, not simulated data, how efficiently do these systems use the data? I mean, obviously, data is free if it's simulated, but ->> Steve Young: Yes, that's a very good question. We don't know the answer to that. Clearly, on the simulator, typically, we're talking about the training curves for both systems sort of start to total about $500,000. So you, know, this is -- I mean, they're getting pretty good by 50,000 to 100,000, but we train up to about 500,000, typically. >>: Is that depending upon the size of the problem? >> Steve Young: That will almost certainly depend on the size of the problem. As we move to richer domains, I expect that number to go up. But actually if look at 25 statistics on nuance calls, where you can look at [unintelligible] statistics, I'm sure, this is actually not completely, you know, out of the ballpark. >>: I didn't find this answer to my previous question. Like if you have calls coming in, you say you need to annotate with the dialogue act, does that mean just the acts the computer took or the actual real ->> Steve Young: So I perhaps misunderstood your question. What we need to do now, because we use our data to train our user simulator, my answer is what data our user simulator needs. Currently, we have a two-step system. We take data training, the simulator. Use the simulator to train the dialogue system. If the dialogue system was connected directly to the real users, all you would ever need to know is what the rewards are. And that just is ->>: [inaudible]. >> Steve Young: >>: [inaudible] any annotation at all, no. Do the numbers still hold, 100,000 or 500,000? >> Steve Young: >> Tim Paek: No idea. Don't know. I can't do the experiment. Go ahead. >>: I think the whole issue with words is interesting, because one of the things I would love to see an analysis of, and I don't know if you published this is when the system fails, how bad are the failures. Like, you know, it's okay to assign high reward to completing the task. And then everything else gets, you know, negative reward, for instance. But to some degree, that's not really true. Some experiences are better than others, right? Even when they fail, right? And you're kind of biassing your system to kind of -- do you know what I'm saying? >> Steve Young: Absolutely. So this is the sort of thing Lynn walker is doing and so on, trying to build a model between the reward function and user satisfaction. And if you could do that, then presumably, instead of optimizing this naive reward function, maybe, I don't know, does Microsoft want to optimize user satisfaction? Let's assume yes, okay? 26 So I guess that's what you'd do. I mean, and that's an interesting research topic, I guess. But it's not something we've looked at. >>: When you looked at the failures, were there any kind of -- >> Steve Young: Failures. >>: With [inaudible] systems, one good thing you can do is anticipate what the bad things are and kind of make sure they're not as bad. >> Steve Young: Yeah, okay. So the HIS system fails, really does fail. So there is implementation problem with the HIS system at the moment is that it's tricky to recombine partitions. And so what you get with a long dialogue is more and more partitions and the dialogue slows down. Okay. And so users give up. The BUDS system doesn't suffer from that. You can talk to the BUDS system forever. So in some sense, the BUDS system never fails. If you're persistent enough and you sit there long enough, you will probably get -- fail in the sense of not being able to get the answer, presuming you'll sit there long enough to figure out how to get the recognizer to recognize your voice sufficiently well. It will just keep on talking, because all you're doing essentially is you, know, updating beliefs in the various slots to the point where it can actually get a match. >>: I thought that the more function would penalize the length -- >> Steve Young: >>: Yes. [inaudible]. >> Steve Young: Yes, it does, but we're talking about failures right. The user doesn't know about the reward function. For them, failure, presumably, is -- and we just arbitrarily chop the dialogue at 20 turns. We say if you haven't got it by 20, you know, stop. Yeah? >>: These dialogue systems never say that they don't understand what you're saying. And wouldn't users give up long before 20 turns? >> Steve Young: Probably. And no, it never says I don't understand what you're 27 saying. >>: That might be an interesting -- >> Steve Young: >>: That's an obvious POMDP action. >> Steve Young: >>: We don't have that in our action set. It is obvious, yes. Your information is not good enough. I'm going to try to get more information. >> Steve Young: And we haven't got that in. It's probably got surrogate actions. Often, when it gets in a complete mess, it starts to ask you if, can I help you with anything else. Which is perhaps not the most helpful thing to say. So perhaps it should say, it should say, I'm not really understanding you, but, you know, it's the kind of back off when everything ->>: [inaudible] the users are constantly [inaudible] the system. The users learn how to play with the system, how to -- I wonder, using this kind of statistical system actually the system becomes less predictable and will that actually hurt the user performance, user experience, because users see inconsistent behaviors from the system, given the same input [inaudible]. >> Steve Young: Yeah. So first of all, all the trials we've done are with people, we deliberately did not reuse subjects. So subjects had -- they were all essentially never used before. And we sort of didn't let them have enough interactions to really learn very much about the behavior. I guess my -- my answer mostly would be we should be personalizing the user experience, right? So if we figure out how to parameterize these models and we can have something equivalent to the MLLR transform for the dialogue, then you would -- if you can recognize the caller, you might plug in the transform. Other than that, the other thing is that as I said, it takes currently a large number of dialogues to change. So this thing, if you had this with the real system live, unless we can figure out ways of adapting much more quickly, which we'd obviously like to do, then it's going to be quite a long time period over which it adapts. Hopefully, users would say, you know, I used that system yesterday. I used it three or four months ago before it was awful, but it's actually getting quite reasonable now. 28 But ->>: I think it's also hard to tell because speech you, have to produce the same speech each time, right, before you can detect a consistency and then if you find that there's a particular way in which it's being recognized, well, we'll take advantage of that anyway. >>: Just for a second, I'm wondering if there's anything interesting in looking at or if you guys have looked at how stable this is with respect to assigning rewards. If I assign these rewards, and I come up with minus 1 and plus 20 if I say minus 1 and plus 18, how stable is the policy? >> Steve Young: So the -- >>: It won't make a difference for task completion, but I don't know about the perceived ->>: So I think you're going to optimize. That's exactly my question. The list to me is not clear. Will it make a difference or not for task completion. We're optimizing rewards here. I don't know. The mapping the task completion or the user perception, like how linear or how varied that ->> Steve Young: Yeah, so we only ever look at the two metrics we look at is percentage success and the reward. And our reward is essentially the same metric. So ->>: You could still look, I mean if you vary some. Let's say -- is there a way, I don't know. I'm just talking off the top of my head. But is there a way you can vary the reward structure to look -- obviously, the rewards you're going to get are going to mean different things. But if you look at the policy, can you somehow inspect the policy and see how similar states it takes similar actions? >> Steve Young: On the HIS system, you certainly can because it's basically a lookup table and you can actually go through the lookup table and see why it seems to be choosing a particular action. One of the things we've done recently which actually improves performance is we actually don't associate a single action anymore with the -- with each belief point. We associate an end best list since we use a version of Q learning. So we have this Q function associated with every belief point. If we look at the end best list and then you combine the generation problem with the choice of which action to take and say, well, in this context, it's not really obvious how I would do a confirm, constricting a confirm from the current state of my 29 grounding states and so on. >>: It's interesting because it allows you to put the heuristics in place that can check against really bad things happening that Tim was talking about, if you have choices in the action. >> Steve Young: Yes. So we can use a ranked list now. performance quite a bit. Okay. >> Tim Paek: [applause] And it actually does improve If there's no further questions, let's give our --