>> Alex Acero: Okay. So good morning, everyone. It's my pleasure today to introduce to you -- I'll try my best to pronounce this, Milica Gasic, who just had her Ph.D. defense in the University of Cambridge working with Steve Young. And she's been working on the area of statistical dialogue modeling using POMDP and other techniques that she's going to tell us about. And we're very excited to have Milica here. >> Milica Gasic: Thank you very much, Alex, and thank you for inviting me here. It's -- a great pleasure to be here today. Is this not on? >> Alex Acero: It's not showing up in the screen. It should be. There you go. >> Milica Gasic: Yeah. I'm going to talk today about statistical dialogue modeling. And this is the work that we did in Cambridge Dialogue Systems Group and some things that I'm going to mention that are a part of my Ph.D. So let's start, then. Statistical -- well, spoken dialogue systems enable human computer interaction where the primary input is speech. But they allow -- enable more than that. They allow us to have natural dialogue between a machine and the computer, which is one of the most natural ways how people communicate with each other. So it's very appealing for many applications because they do not -- because such applications have zero learning curve, the users now have to get use to the system at all. So they have really enumerable benefits. But building them to operate robustly is a big challenge. And I'm going to talk today how statistical approach can be used to overcome these challenges. So let's start from a traditional approach to spoken dialogue system. They're built using rules that try to cover every possible stage that the dialogue can be in. So, for example, the system may start the dialogue by asking the user are you looking for a hotel or a restaurant. And then the user may reply I want a restaurant or I want a hotel. If the user said I want a restaurant, the system may ask what kind of food would you like, or in the case of a hotel, it may ask how many stars would you like, and so on. So the designer has to hand code these rules for every possible stage that the dialogue can be in. And what is more, it needs to process each of these nodes. So let's see how the node processing can look like, for example, for the question what type of food would you like. The system starts by asking the question, and then that question gets generated, and then the user's reply needs to be recognized. One way of doing that would be to load a grammar based on the question, and then speech understanding would be able to recognize that answer more correctly, because if the system said what kind of food would you like, an expected answer would be I want Chinese or I want Italian food and so on. But then the speech recognizer may not be correct. So how do we deal with an error that may occur. Well, we can use an error handler, and this is an example of an error handler which makes use of the confidence score. So, for example, if the confidence score is very low, then the system may ask the question again. If it's high, then the system just carries on asking other questions. And if it's medium, then it may consider to make a confirmation. So this represents just one node in that big graph, big directed graph that is called the call-flow. But how does this look like for real systems? Well, this is a part of a call-flow of a real deployed dialogue system. And you can see from here just how laborious it is for the designer to hand code each of these rules manually. The designer has to think of the right way to ask questions. There is not an automatic measure which says this is good or this is bad. It has to use their -- the designer has to use their own intuition about what is the best way to design dialogue. So that is clearly a limitation, because such dialogue systems are expensive to build. They rely on manual effort. And once the designer builds this system for one domain, it is not directly usable for another domain, because all these questions relate to a particular dialogue domain. What is more, these systems are typically fragile to speech recognition errors. You saw before that every node it represents a particular stage in dialogue, essentially assuming that that's what the user really wants, that that's -- if user said I want a restaurant and we recognize that, that's correct input. But that's not always the case. And we need to use external error handlers to deal with it. And these need to be hand coded because they might be different from node to node, from application to application. So that is clearly a problem. Finally, once these systems are built, they stay fixed unless the designer decides to change their functionality. Once they're deployed, they can serve tens of thousands dialogues, and still they behave exactly the same as they behaved with the first dialogue, unless the designer noticed the problem and changed it. So I'm going to tell you today how statistical approach to dialogue manager can be used to overcome each of these problems. So let's start from the first problem. Dialogue systems are expensive to build. And why is that so? We have many nodes, many states that the dialogue can be in, and then the designer has to hand code each of the -- rules for each of these states. If we could define the whole space of the states that the dialogue can be in and also define all the rules that the -- or actions that the dialogue manager can take, we can then automatically search through this space and then find the best path, find the best policy that the dialogue manager should follow. The essential ingredients for achieving this is to model dialogue as the Markov decision process, to define measure of successful dialogue, a reward, that is to be optimized, and then use the reinforcement learning techniques to find the best policy, and that is the policy that generates the highest expected reward over time. So let's see how the structure of a statistical dialogue system looks like. Now, speech understanding unit is completely separate from dialogue manager. There is no need to constrain the user to answer questions. User may say whatever they want in any point in the dialogue, because that's how humans naturally speak. What is more, the dialogue manager is now decoupled into part which maintains the dialogue state and the part which is responsible for defining the policy. That way we can easily change the dialogue policy, which is not -- which is not very easy with traditional rule-based approaches. During the process of training, we can replace the user and the speech understanding unit with a simulated user and then train the dialogue manager or the dialogue policy in interaction with a simulated user. The simulated user is able to evaluate dialogues and give a reward to the system. And that way the system can calculate the expected reward for every particular state and then change the policy so that it takes actions which are -- which lead to highest expected reward. >>: So is writing a simulated user easier than hand coding all those rules? I mean, isn't ->> Milica Gasic: Well, ideally we would want not to use a simulated user at all, but to train with real people. But that is not always possible. And I'm going to come back to that later in my talk, how can we actually avoid using simulated user. However, there are techniques for building simulated user directly from data to make a statistical simulated user. So then you start off with a corpus of dialogues, create a simulated user, and then train the dialogue manager with that simulated user. But I think we should get rid of the simulated user altogether, and that is clearly our limitation. Okay. So more theoretically how we model dialogue is a Markov decision process. Dialogue can be seen as a sequence of turns, and every turn corresponds to the state that the dialogue is in. And then in each of these states, so it's a state -- at the stated time T, the dialogue manager takes an action and then moves to the next state. The next state only depends on the previous state and the action that was taken. So in that way the model satisfies the Markov property. What is more, at every time step the system receives a reward. And the aim is to find -- to take actions in such way that they generate the highest expected reward over time. And in order to find a policy that gives us such actions, which we can use reinforcement learning, the Q-function gives the expected long-term reward that can be achieved when an action is taken in a particular state. Reinforcement learning algorithms can be used to optimize this function and that can give us a policy. In that way dialogue is perceived as a planning problem. So actions are taken now that -- in such way that they generate the highest expected reward later in future. And this was first introduced by Levin in '98 and many subsequent researchers shown that this approach really matches the human code -- human coded rule-based system. But it doesn't still deal with the problem of robustness. So I'm going to move on to the next problem, which is that the dialogue systems are fragile in noisy conditions. So the problem originates from the fact that every node in the call floor, every state in the MDP dialogue model is assumed to be correct. It assumes to correctly represents what really is the case. But obviously if there was an error in the speech recognizer, that won't be the case anymore. So one way of dealing with this is to take into account the uncertainty about which state the dialogue is in when making decisions. And for that the essential ingredients are to assume that the state is unobservable and that it depends on a noisy observation, and then model the dialogue as a partially observable Markov decision process. The outcome is increased robustness in performance. So let's go back to the graphical model of an MDP. See, we have sequence of states, actions are taken, and next state only depends on the previous state. The crucial difference here is that state is unobservable. So that is why it is not shaded here. It depends on a noisy observation. And then now the actual selection cannot be based on the particular state, because we don't know in which state the dialogue is in. But we maintain the distribution over all possible states at every dialogue turn. And then the action selection can be based on this distribution. The distribution over all possible states at every particular turn is called the belief state, and it represents the system's belief in which state it is. But you can probably see here that if you had N states to start with, now you base action selection on something that is a probability distribution. So it basically is a vector of length N with each element is being preserved to one, so it belongs to this huge multidimensional space. And obviously doing that would be intractable. And it is. >>: Go back. >> Milica Gasic: Yeah. >>: Wouldn't you want the reward to depend on the observations? >> Milica Gasic: Well, your observations are not correct. You never know whether what you see is really what user wants. >>: Why are you [inaudible] why is the state [inaudible] action depends on the state? >> Milica Gasic: So we model actions as a random variable. And then we want to take actions in such way that the overall reward that we get is maximal. So the model itself, the next state that the system is in depends on the action that the system took. So, for example, if the state represents the user goal -- the simple thing is if the state represent the user goal, and then the system says something -- if the user wanted a cheap Chinese restaurant in the center and then the system said there is no cheap Chinese restaurant in the center, then the user goal may change to an Italian restaurant. So that is why the state depends on the action. >>: [inaudible] user goal is a state? >> Milica Gasic: Well, it's -- I'll come back to that. It's a part of the state. There are other elements in the state as well. Also the most important one is what user actually said, because that is what's -- what is unobservable, and that is why we have to model it as an unobservable variable. >>: So for [inaudible] the state recognition tend to be different. >> Milica Gasic: It's the same state that finishes, just the tier is unobservable, and then the [inaudible] we assume that it's observable. So if user says something, we assume that that is correct. That we really understood what user want. >>: So I think [inaudible] because this state is different from the dialogue state. The previously we thought [inaudible] we have tens of thousand dialogue states, but this state is actually more like what user want. >> Milica Gasic: It can also be even larger. It can -- so in the call-flow, every -- the state or the nodes doesn't depend only on the previous node you are in, but overall part of the tree from the beginning to where you are. Here we assume that the state is Markov, so it's much richer and it encodes much more elements. >>: [inaudible] question, I thought A is obtained by optimizing whatever [inaudible] but that's a function of state, for a given state you have [inaudible]. >> Milica Gasic: Policy is a function of state. And for every state it gives us the action. So that's -- but this is just a graphical model. This is just -- so here we just see that as a random variable. >>: So the way I understand that, the state AT given rights are connected to ST plus 1, and that's understandable because people -- once you take action, the state will change. >> Milica Gasic: Yes. >>: But on the other hand how to arrive at [inaudible] so they should be linked from ST. >>: I think the point is that ->> Milica Gasic: Policy is linking those. Yeah. In practice you can take just any -- I mean, this model would be valid for any ->>: What exactly does this model represent? Is this representing a joint probability distribution over some variable? >> Milica Gasic: Yes. So ideally -- so using this graphical model we can update the distribution over these states given the distribution over previous states given the observation, the transition probability. And, I mean, the observation probability and the transition probability. >>: So if it's a joint probability distribution over variables, then the thing that you would like to maximize is your probability [inaudible] consistent with maximizing the reward. >> Milica Gasic: No, you don't maximize the probability. You just update probability given -so you observe -- you have observations and you know what the action you took. >>: [inaudible] learning dynamics of the model, so the [inaudible] probabilities, that's one problem. Say you have the dynamics of the models, so you know P over SE plus 1 given [inaudible] then you have another problem, how will I find the policy that chooses the action so that I optimize [inaudible]. >> Milica Gasic: Yeah, it's two separate [inaudible]. >>: [inaudible] that's why maybe confusing. >>: A policy is just a procedure. >> Milica Gasic: Yeah. >>: It's [inaudible] so for that reason I would think that A should be hidden variable as well, because [inaudible]. >> Milica Gasic: You can't do that if you want to infer from data the action the system took. But here we -- it's system's choice which action to take. So once ->>: [inaudible] do you say this. That could be one action. That action could be applied for many different states, the same action. Right? [inaudible] if the confidence score is below this, say this action, which is then you say X, and is the same action for many different states, then your point is it wouldn't tell you ->> Milica Gasic: Which one, yeah. But the difference here is that the policy is based on the distribution rather than ->>: Is it deterministic? >> Milica Gasic: The policy can be -- it can be either deterministic or stochastic. But if the model doesn't change over time, there exists the deterministic policy. So if your observation probabilities and transition probabilities are stationary ->>: It pauses deterministic thing [inaudible] you can update ->> Milica Gasic: Yeah, yeah, yeah. Well, that's ->>: [inaudible] treat it as a fixed variable. >> Milica Gasic: Yeah. >>: But if it's deterministic, there will be random -- there's a randomness in how to choose A. >> Milica Gasic: Yeah. >>: [inaudible] random variable. >> Milica Gasic: Yeah. So you can -- well, I mean, some people put that in a square node saying it's deterministic. Okay. Yeah. Okay. >> Alex Acero: Any more questions? So this is a reward [inaudible]. >> Milica Gasic: So for now we assume that the reward is observable, that there is some external factor which is giving you a reward at every dialogue step. In -- during simulation, that would be the user simulator. And but obviously -- so in order to perform learning, you really need the expected reward. So even if you don't get the exact same reward at every time, if you have -- if you know what is the expectation of the reward, that's enough to calculate the highest expected reward and to optimize the policy. >>: [inaudible] more recent research [inaudible] a few months ago of people in reinforcement learning talking about how to automatically learn the reward. Has that kind of research ->> Milica Gasic: Well, that would be -- that would be very interesting too. Well, yeah, we assume that it is given and that that is -- think that's a part of future research, and especially in the context of dialogue to know what is the reward. I think ideally you want really to get the reward from the user, but not by just asking say from 1 to 5 how good the dialogue is, because that's not good. But some other factors may be more applicable. >>: So as an additive [inaudible] the reward presumably comes from the user. >> Milica Gasic: Yeah. >>: The user doesn't know anything about the internal state of the dialogue. >> Milica Gasic: Yeah. The user is only aware of the action. >>: So why is the reward condition on the internal state of the dialogue system? >> Milica Gasic: Well, the user has -- the user has -- the user knew what happened in the dialogue prior to that point, right? The user doesn't speak -- doesn't only evaluate the action the system took at that particular point. It is -- so sometimes it's perfectly fine to say would you like an Italian restaurant, but not at every particular stage. So does depend on where in the dialogue you are. >>: [inaudible] paradigm, so in your research do you actually have any additional [inaudible] system anything that goes beyond ->> Milica Gasic: You mean how this graph looks like? I'm going to come to that. Yeah? So I already said applying this model directly to real problems would not work simply because it's intractable to optimize the policy over -- directly as a -- in the POMDP for everything but the very simple cases. However, there are approximations which enable this to be used for large dialogue domains. And examples include the hidden information state dialogue system and the Bayesian update of dialogue state system. I'm going to concentrate in my talk on the hidden information state system. So how does this system achieve tractability in updating the distribution over dialogue states? The idea is to decompose the state into conditionally independent elements. So the state is decomposed into the user goal, so this represents what the user's intention, what user wants to achieve. Then the user action, this is -- this represents the true user action that what user actually said to the system. And the dialogue history. The dialogue history is essentially a component to keep the state Markov so that the state only depends on the previous state. If we now put this in the graphical model, what we get is that the new user goal depends on the previous user goal and the action the system took. The user action, however, only depends on the user goal. The dialogue history depends on the previous dialogue history, the new user goal, what user action, but also the system action. The observation, however, only depends on the user action. But even with this model, it still would be intractable to directly do the belief update for a dialogue system because the space of user goals is huge, the space of possible dialogue histories is also huge. So the idea is to further simplify each of these components to allow -- to allow tractability. >>: Yes, the question here, the semantics of any connection in the Bayesian network that you show here is the conditional distribution. >> Milica Gasic: Yes. Yeah. >>: So I assume that for each of the thing you have ->> Milica Gasic: There is a distribution, yes. >>: So do you have the structure of the distribution [inaudible]? >> Milica Gasic: So here in this -- in the HIS system, the distributions are hand coded and they are very simple to allow the tractable -- for this model to tractably perform update. In the Bayesian update of the [inaudible] in the BUDS system, the state is further decomposed into conditional independent elements that depend on the particular domain that we are working on. So, for example, it would correspond to slots in dialogue domains. And then you can assume parameterize distribution, like a multinomial distribution between them, and then you can use -and then you can infer these distributions from data. >>: Is the user goal a long-term goal or it's ->> Milica Gasic: So the user -- from this model, the user goal can change at every dialogue turn. But we do not assume that the user goal changes over time. But still there are ways to support that beyond -- but you can -- I mean, ideally, you would want infer from data when user goal changes and how the user goal evolves over time. >>: So the data that you're talking about, oh, that's expressed in terms of words [inaudible]. >> Milica Gasic: So this is the observation -- so, yeah, that's on my next slide. So how does the dialogue state look like. Observation is scored N-best list of user acts. And in the HIS system, the N-best list represent -- the scores of the N-best lists are -- belong -- it belongs to probability distribution. So what we do, we assume that the true user act -- it appears in this list somewhere. I know this is a big approximation and obvious it's not ideal. Some researchers did have an additional probability which -- which is a probability of the true user act not being N-best list, which obviously is better. Then we have the user goal. The space of all possible user goals for a real-world domain is huge. So it is intractable to directly update the belief on these goals. So what we do, we group goals into mutually exclusive sets. And these sets are called partitions. They are built using the domain ontology. And the domain ontology tells us what are plausible goals. For example, I'm looking for a restaurant serving Chinese food would be a plausible goal, whereas I'm looking for a hotel serving Chinese food, well, not so. So these rules are defined by the domain ontology. Finally we have dialogue history. I already said the dialogue history has to encode everything that happened prior to that moment in dialogue, but that would be just intractable to use that. So what we do instead, for each concept that appears in the user goal, such as restaurant, Chinese, we keep track of a state called grounding state which tells us what is important for that concept. So the user inform about that concept or the user request the value of that concept and so on. So we basically keep track of only small number of states that represent the dialogue history. So a combination of an element from the N-best list, the partition, and the corresponding dialogue history represents one state of the HIS system, and that is called a hypotheses. A distribution over all these hypotheses is kept throughout the dialogue. >>: So is that what was earlier called summary states? >> Milica Gasic: No, I'm going to come back to the summary space, yeah. This is the full space. This is the full belief space. >>: [inaudible] already summarized all these things already. >> Milica Gasic: Yeah, yeah. >>: So there is a summary of the summary. >> Milica Gasic: There is a summary of the summary. >>: Were any of the visible variables conditioned on D? Was anything conditioned on D in that previous picture? >> Milica Gasic: Of the observable variables. >>: Or actually any variable. Was any ->> Milica Gasic: So I don't think so, no. >>: Besides D itself? >> Milica Gasic: No. So the D itself in the next, yeah. >>: So how does D have any effect on anything? >> Milica Gasic: So it makes sure that the dialogue state remains Markov. >>: The reward depends on the ->> Milica Gasic: The reward depends on the whole state. >>: I see. >>: Oh, I see. Oh, so all the visible depend on all three. >> Milica Gasic: Yeah, sorry. Yeah. >>: I see. Okay. I got it. >>: So what is the summary here? The summary [inaudible] previous ->> Milica Gasic: No, this is -- so this is -- these are partitions. These are dialogue states. And this -- so one element -- we assume -- so observation is a list, N-best list of user acts. We assume that at every particular time step the true one belongs to this list, is somewhere in the list. >>: Okay. So the summary -- sorry, the hypotheses [inaudible]. >> Milica Gasic: Yeah, yeah. So hypotheses is all this together. Yeah. >>: There's two different kinds of variables, one crosses the circle ->> Milica Gasic: Yeah, so basically the circle means just that [inaudible] depends on all of them. I didn't want to put them [inaudible] because then it will be too much. >>: Right. But then on the O ->> Milica Gasic: So O only -- only [inaudible]. >>: [inaudible] this whole big circle you will have -- >> Milica Gasic: Yeah, so ->>: [inaudible] summary, and everything else depends on summary. So this is kind of shorthand for this [inaudible]. >> Milica Gasic: Yeah. >>: Okay. >> Milica Gasic: Yeah? Okay. So I'm going to talk about the representation of the user goal in more detail. So I'm going to use an example. Example, system said how may I help you. This is internally represented as request task. Then user says I'd like a restaurant in the center. So this is semantically decoded as entity, inform entity venue, type restaurant, area center. I'm not going to talk here how we actually did the decoding, but we use this information to build the partition, the user -- the groups of user goals. We start with the concept that covers all possible user goals for a particular domain called the entity. So in task-oriented dialogues, this represents anything that user wants, such as the hotel or a flight or a booking. And then we fine grained the user goal using the user -- the slot value pairs from user action. So we start off with entity venue and build partition the goal space into two partitions, first one which represents entity venue, and then it can have more attributes such as type or area, and the other one represents everything but the venue. So, for example, flight or whatever the domain is about. Then we go on and represent the next one, which is a restaurant. So now we have a representation of restaurant and not restaurant and then area, central and not central. And in that way we basically represent the whole space user goals into just five partitions. And this enables us to tractably update the belief space. But remember we have here N-best list of user actions. And each of these N-best lists will take part into partitioning. Also the dialogue may be just very long, depending on the domain. And this has an exponential nature. So how do we ensure that this space of partitions remains bound. >>: Entity here is more variable in the earlier graphical models? >> Milica Gasic: So this is not a variable. This is just -- this is basically an element that G can take. If you go back here. Yeah. >>: Because of creating ontology for other Gs. >> Milica Gasic: Yeah. >>: Okay. All right. [inaudible]. >> Milica Gasic: Yeah. >>: And this is manually designed, right? >> Milica Gasic: So ontology is manually designed. So it's a file which defines the domain. >>: So for each application you have to design this. >> Milica Gasic: You have to have -- yeah, what the application is about. >>: [inaudible] >> Milica Gasic: Well, not really. It's like a -- I don't know, a 50-line file. But you need to ->>: It's the database of, you know, all the columns in the database. >> Milica Gasic: Exactly, yeah. >>: Hotels, location, flight, departure, airline. >>: I see. Okay. >>: [inaudible] you cannot do any unconstraint dialogue for any ->> Milica Gasic: So but you can plug in any ontology into this system. It's not -- the dialogue measure itself doesn't depend on the ontology, just uses it to build these partitions. >>: A question. Actually, so I'm having a hard time understanding what's going on here at all. Is this going to be important for the rest of the ->>: No, with the restaurant and the bang restaurant and the central and the way the [inaudible] can you explain that ->> Milica Gasic: So this represents one -- let's use this one. So this represents one group of goals. These are all the restaurants that are in the center of town. Because if user wants a restaurant in the center of town, this represent their goal. >>: Oh, so that's like ->> Milica Gasic: So we don't have to look at the top, just bottom ones are ->>: Some rows in your database ->> Milica Gasic: Yeah. >>: [inaudible] targets. >> Milica Gasic: Exactly. Because we may have recognized that the user wants a restaurant, we still need to maintain the probability that user doesn't want a restaurant and that is this or that user said I want a restaurant that is not in the center, so that would be this. >>: I think I interpret this as an interesting way of phasing a [inaudible]. >> Milica Gasic: Yeah. >>: So when you look at -- I'm not sure ->> Milica Gasic: But you need -- you need to -- the point here is that you need to keep the probability of the complement. You cannot -- never -- you can't -- at any point in the dialogue you shouldn't assume that what you heard is correct. So you always need to take care of the complement. >>: What's the -- the arrows are just different ->> Milica Gasic: So this is just the way we build them, so you can forget about those, what we are interested in the end is -- are these. >>: [inaudible] >> Milica Gasic: Yeah. >>: [inaudible] >> Milica Gasic: But there -- but they are useful for actually what comes next, it's how ->>: [inaudible] user to design this dialogue system [inaudible] consider essential is one of the elements I need to consider, right? [inaudible] you have the area essential. >> Milica Gasic: Yeah. >>: Typically it was [inaudible] for the user to actually come out with this kind of classification. >>: [inaudible] in your database, this is your database which has [inaudible]area, an area of your database shows up with center or whatever values it potentially has, these are automatically constructible ->>: I think the question is if you -- when you build the database, do you think about the column that said is it in the center or not. [multiple people speaking at once] >> Milica Gasic: Well, we can just assume that we have these elements in the database. But obviously you can do -- okay? What I'm going to just stress, how do we deal with this if it becomes very large. We use a pruning technique. So remember we keep the probability of every hypotheses which -- and each hypotheses have a partition that it belongs to. So we can calculate the marginal probability over all partitions. And since partitions are built using slot value pairs, we can calculate the marginal probability of each slot value pair. And that allows us to identify the lowest probability slot value pair. So that means that if it has a very low probability, that it probably doesn't contribute that much to the whole representation of user goal. So we can get rid of that information. How do we do that? Well, we find partitions which have that information and its complementary partition, and then we merge the partitions back together. We do that everywhere in this tree where that slot value pair appears. And that leaves us now with a smaller number of partitions. And in that -- that basically is the same as if user never said anything about restaurant. So this allows us to deal with arbitrary length of N-best list, and the arbitrary length of a dialogue. So this part was about tractably maintaining the distribution over all states. Now I'm going to talk about policy representation and more importantly the policy optimization. It would be intractable to directly optimize the policy in this huge belief state which we would ideally like to do. So what we do instead, we look at the most important features of this belief state and then map that into a smaller summary space. And then this summary space is still continuous. So what we do, we discretize it into discrete elements and it allows us to apply MDP algorithms similarly for MDP to optimize the policy. This is obviously not a very good -- well, this is not a desirable way to do this, because we need more -- we need heuristics to build a summary space, and then we need heuristics to map back the action that the summary space proposed -- that the algorithm in the summary space proposed back to master space. But I'm going to come back to that later, how we can avoid using this. But this is what we do for now. Also, optimization is performed in increasing noise levels. So we add an error model to the simulator so that the simulator is able to randomly insert errors into the input so that the -- and that allows to dialogue manager to gain robustness to the speech recognition errors. And still this whole approximation framework takes an order of tens of thousands dialogues to optimize the policy. And typically in the beginning of training the dialogues are not very good, so that is why we really needed to do this with a simulated user. >>: I thought typically the length of the dialogue is one of the component in the reward function, you penalize [inaudible]. >> Milica Gasic: Yes. Yeah. Yeah. >>: So if you require so long, you might just ask [inaudible] it's to define the [inaudible] function in such a way that the length of [inaudible] becomes a negative. >> Milica Gasic: Yeah. So you -- we do that. We penalize the dialogue for -- but, I mean, sometimes if the -- the nature of dialogues, for example, is negotiate, if the user -- you still want to allow the dialogue to take longer. You don't to make a hard decision to stop the dialogue. >>: So that comes back to exactly how to weight between the content [inaudible]. >> Milica Gasic: Yeah, yeah. >>: So how do you -- you don't [inaudible]. >> Milica Gasic: Well, you define -- in the reward function, you define -- you define that. >>: Vary from user to user. >> Milica Gasic: That's true, yeah. Well, as I said, ideally you would want to infer the reward from the user. >>: So that's an easy. >> Milica Gasic: That's one of the issues, yeah. Yeah. Exactly. Yeah. Okay. So I'm going to give you a demo now of our dialogue system. >> Audio playing: Hello, how may I help you? Hi, I would like a bar in the north. The Kings Shilling is a great bar in the north part of town. >> Milica Gasic: So you see here we have five hypotheses, each of them correspond to one partition. So the user said I would like a bar in the north, and then we have a partition that is for venue area north, type bar, and then another one which would be just for north, and then the third one just for any bar that's not in the north, so that's not included here. So you'll see during the dialogue how this probability distribution will change but also how the number of hypotheses will become larger and ->>: So those aren't marginal probabilities, I guess. >> Milica Gasic: So, no, this is actually the full probability because in this first turn you only have -- you only have five elements in this state. But later on you will have more. >>: Wouldn't the probability of north be greater than the probability of north and bar? >> Milica Gasic: No. Because this is -- so this is for not bar. It's not -- yeah, it's not written here. But this is probability of area north and type not bar. So, sorry, yeah, it's not -- it's not -yeah, components are not in the graphical interface. So this would be not bar and this would be not north and this would be just anything that's -sorry. This is not north and this is not bar and not north, and this is not venue at all. >>: So for that application, [inaudible]. >> Milica Gasic: So in this case we had -- so we had about 50 venues for -- this is a tourist information for an imaginary town. But we have the same application for a Cambridge domain where we have 500 different entities that user may ask, and each of them has about ten different slots. And the users can also say I would like -- the users can refer to a particular venue just by name, so then -- >>: 500. >> Milica Gasic: Yeah. So then it's 500 of -- the slots can take as many values, I mean, because they can take at least 500 venues. >>: [inaudible] >> Milica Gasic: Yeah. But, I mean, you optimize the policy in this summary space. So it's difficult to give an exact into how large it is. But it is a real -- it is usable and the users can [inaudible]. >>: So in this particular case -- so what makes the system think the user's goal is to find any bar. >> Milica Gasic: Yeah. >>: Why not just -- why not give a list of bars? >> Milica Gasic: We did that as well. So in the Cambridge -- in the Cambridge application we have -- we acquired the database from the Web where we had ratings for different entities. So then we used a representation saying that we found -- so that's part of the natural language [inaudible] which how they want to present something. So let's say there are, I don't know, tens of bars in the north, Kings Shilling is the best rated, and then they can ask for different ones. Also when we have a Web application where we actually present everything that currently matches the user goal and also put that on the map. But -- yeah, can we ->> Audio playing: What kind of music do they play? The Kings Shilling is known for great rock music. Are there any bars playing jazz? I am sorry, but there is no bar that plays jazz music and in the north part of town. Are there any bars in the center playing jazz? Ice Cool is a nice bar in the central part of town and they play jazz music. What kind of drinks do they serve? They serve cocktails at Ice Cool. And what is the address? Sure, Ice Cool is on [inaudible] Street. Thank you, good bye. Thank you, good bye. >> Milica Gasic: To show you how robust this system is, we evaluated it on the user simulator by adding different noise levels which is controlled by the error rate. So a 0 error rate correspond to no noise and the 0.5 is a very high noise level. And we compared this to an MDP system that is built for the same task. And here we measure the average success. So dialogue is considered successful if the system provided all information to the user. >>: [inaudible] >> Milica Gasic: Yes. So you can see that on 0 noise level both systems perform roughly the same, which is expectable because when there is no noise your interpretation of what user wants, the state, is correct so there is no need to take into account the uncertainty. But -- it is in the high noise level that the value of POMDP is really visible because there it takes into account varying probabilities between different states when it makes a decision, and that is why it performs much more robustly than an MDP. >>: [inaudible] train your model and give it [inaudible] error rate? >> Milica Gasic: We train model in increasing noise level. So we start from 0 and then we gradually add the -- add noise. But it's the designer's choice. It's important to expose the system to errors, because otherwise it wouldn't learn. >> [inaudible] >> Milica Gasic: Sorry? >>: [inaudible] is for one system? >> Milica Gasic: Yeah, yeah, so it's for one fully trained system on ->>: [inaudible] >> Milica Gasic: Yeah. >>: [inaudible] you train the system with multiple error rates. >> Milica Gasic: Yeah. >>: And once you have a single system, you test it multiple ->> Milica Gasic: Yeah, yeah. So this is testing. >>: Yeah, so the system is trained using a user simulated data. >> Milica Gasic: Yes. Yeah. >>: How much data does it ->> Milica Gasic: About order of tens of thousands, about. Yeah. >>: [inaudible] you have the error rate, the result are accessing to the training data. If you take it up by 50 percent ->> Milica Gasic: Well, I think for here we trained only up to 0.3 because -- but it's a choice how -- I mean, some researchers just train on very high noise level. And it still -- it's -- it is a choice. >>: So this is also evaluating using the same simulator. >> Milica Gasic: That's true, yes. We did evaluate with real people and it did show -- it did show improvements. But it's very difficult to get statistical significance because you would need just a much bigger corpus of dialogues. >>: But there was open challenge. >> Milica Gasic: Oh, in the open challenge the BUDS system was tested, which is also from DP dialogue system, and it did show -- in the control test it did show robust performance to noise. >>: How does training it in this robust way affect the reward function? It is making the dialogue system more conservative [inaudible]? >> Milica Gasic: Yes. Yeah. So the -- the dialogues are typically longer and the system use more confirmations. >>: So that might not lead to better user satisfaction? >> Milica Gasic: Well, if you -- so the user -- the way the dialogue behaves [inaudible]. So whatever you define in the -- sometimes if you give -- for instance, if you give a small reward for a successful dialogue, you can see during the training it just says good bye. It doesn't want to talk to the user at all, so ->>: If we have a lot more, maybe two other [inaudible] model hypothetically, do you think this gap will shrink between MDP and [inaudible]? >>: No, MDP doesn't have a means of representing multiple states at the same turn. So it will never be able to do that, no. >>: How do you measure task completion data? >> Milica Gasic: So the -- there are two ways you can do that. One is if you know what user wanted, then you can just look at what the system offered. And another one is by inferring the user goal from the transcriptions. So if you transcribe the dialogue and you know that there is -- what user said, then you can infer what the user goal was, and then based on that to define the success -- the completion. Yeah? Okay. So this brings us to the third problem, which is that dialogue systems do not improve with time. We saw that in the [inaudible] based approach this is really the case, because the designer has to manually add more nodes and rules in the call-flow. It's not entirely the case for the POMDP approach because the POMDP approach really can learn indirect interaction. But it still needs quite a lot of dialogues to train the policy. So in practice it's not really applicable for dialect interaction with real users for dialect learning and interaction with real users. How do -- so what we really want to have is a faster policy optimization. So how can we learn policy faster? One solution may be to take into account the similarities between different belief states. And for that the model -- the expected reward is a Gaussian process. So a Gaussian process allows us to estimate the function with just a few observations, and then it gives us an estimate of uncertainty of -- for every prediction. It uses the kernel function to define the similarities between different belief states, and then that can help to obtain fast policy optimization. >>: Yeah, well Gaussian [inaudible] machine learning [inaudible] perform the question [inaudible]. >> Milica Gasic: Yeah. So I -- so here we want to just go to the next slide. It's easier. So we want to model the Q-function as a Gaussian process. So the Q-function gives us the expected long-term reward in the case of MDP for a state and an action. In the case of POMDP it's for the belief state and an action. So we want to use basically a regression in the form of reinforcement learning. >>: I see. So it doesn't [inaudible]. >> Milica Gasic: Sorry? >>: So is the new thing [inaudible]. >> Milica Gasic: Yeah. Yeah. >>: Haven't seen anything. >> Milica Gasic: So there is -- the machine learning people did Gaussian -- employed the Gaussian process in the reinforcement learning, and they applied that to dialogue management. >>: [inaudible] approximation? >> Milica Gasic: Yes. So we basically want to model this Q-function as a stochastic process. And to give you a bit of insight in how we do this, I'm going to now use a toy problem. So it's a voicemail domain where the user asks the system to delete or save the message. And the user input is corrupted with noise. So the system doesn't know what the user want. Instead, it has a distribution over all possible states, and in this case it's only three states: save, delete, or the dialogue has ended. And that's the belief state. And so we want to estimate the expected reward that can be obtained from this belief state when actions -- different actions are taken. System can take only three actions: ask the user to confirm, save the message, or delete the message. And expected long-term reward can -- in this case we use negative reward if it's -- if the dialogue is very bad, so if the user is unhappy, or positive reward if the user is happy. In -- normally the Q value of an action and the belief state would be just a number, say minus 5, if the action was delete in this belief state because it may not be very good idea to -- when there is such high uncertainty to just delete the message, or if the user -- or if the system said confirm, that may actually lead to a successful dialogue. So these are just numbers. And we basically assume that we are perfectly confident into our estimation. What the Gaussian process enables us is to view this values as a distribution. So it gives us an additional estimate of uncertainty about the Q-function value. So, for example, if the Q-function value by taking -- in taking action delete in this belief sate, if the mean was 5, the Gaussian process gives us a variance. So in this case you can see that it's very confident that this is a bad action to take. Whereas here, it's not as confident, but they still think this is a good action to take. I'm going to say in a moment how can we use this uncertainty measure, but now I'm going to talk about the role of the kernel function in a Gaussian process. So if we had this belief state where there is almost equal probability of save and delete and we take action confirm and someone tells us it's a very good action to take, does this help us when we want to take -- confirm action into -- in belief state which just has a slightly smaller probability on save and a bit larger probability of delete. Does this estimate gives us any information? The kernel function defines the correlation between these two values given these two values. And that is why the kernel function is essential ingredient for speeding up the process of policy optimization. In addition, we can use active learning in Gaussian process reinforcement learning. So remember in the beginning I said we have the whole space of states, so belief states in this case and actions, and we need to explore it exhaustively to find the best path. If we want to learn with real people, we can't afford to explore this whole space. And that is where we can use the Gaussian process model for Q-function because it gives us the estimate of uncertainty. So in the previous case, there is not really a point of trying out delete action in that belief state because the model was already confident and it's a bad action to take. So during exploration, it's good to take actions that the system is uncertain about and during exploitation just take the best actions. So we tested this in Cambridge tourist information domain. This still uses the summary space. I have to say that. But -- and we compare this to a regular standard reinforcement learning algorithm that discretize the summary space into discrete elements. So here we use the very large grid to have a very small number of these grid points. And we did the training in batches. So after every 200 dialogues, we estimated -- we tested the performance of the resulting policy. So you can see that standard reinforcement learning algorithm reaches a bit more than 60 percent success rate on average after 200 dialogues. When we use Gaussian process on the top of that -instead of -- instead of that algorithm, we obtain about 800 -- sorry, 80 percent success rate. And, finally, if we use active learning, we get another 5 percent improvement. So this shows that with only 200 dialogues you can reach more than 85 percent success rate. And this have enumerable benefits. We can get rid of the summary space if we can define a kernel function on the belief space. We don't need any more heuristics into defining belief space -- into defining the summary space. But more importantly we can potentially get rid of the simulated user, because this says that we can train our policy in direct interaction with real people. And ->>: So what is the kernel function [inaudible]? >> Milica Gasic: So the kernel function -- so which kernel function I used here? Oh, here it was just a very simple linear kernel. So it just looked at -- it was just a dot product of the features of the summary space. So it was very -- you can put more effort into building a kernel function, and ideally you want to estimate the kernel function from data. So if you had a corpus of dialogues that is labeled with -- just with rewards, you don't really need to know what user actually said, because this whole operates on unobservable state, then you can learn the kernel parameters. >>: So how do estimate the [inaudible] Gaussian process? >> Milica Gasic: So you obtain rewards at every particular state. And then there is a linear -there is a linear relationship [inaudible] the rewards are related to the Q-function and then you can estimate it. Yeah, I don't -- I don't have the -- maybe in my thesis -- in my thesis there is lots of details on this. And in this paper as well. But it's a short paper, so ->>: [inaudible] learning kernels [inaudible]? >> Milica Gasic: Yeah, I didn't on -- on a toy problem I learned I -- so for a toy problem I was able to find the optimal policy using POMDP. Because it's such a small problem I can use -- I can directly use POMDP algorithms, and then I generated a corpus from which I learned the kernel. And it shows that it can reach -- that it can -- using that kernel you can learn the policy very quickly. But that's part of the future work to try to learn the kernel from a real corpus. >>: So the reason why you're doing better with less data is because the number of parameters are your learning is much smaller compared to the discrete space of reinforcement learning, right? >> Milica Gasic: So in the discrete -- in this case you take into account the similarity between different parts of the space. In the discrete ->>: Right. You do that because you have fewer parameters. >> Milica Gasic: Well, it's nonparametric. The Gaussian processes are nonparametric. >>: You need to learn means and variances and things like that. >> Milica Gasic: Yeah, yeah, but I don't have parameters over these. >>: Oh, the mean and the variances have parameters, right? >> Milica Gasic: No, but it's a -- it's a function. So the mean -- it's -- so I have -- for every point I have a mean and variance. But not -- not as -- yeah. It's a distribution over functions. >>: So wouldn't you -- sorry. Go ahead. >>: Is this a related question? >>: [inaudible] basically like when you use the kernel and when you did the Gaussian, like it's like [inaudible] approximations. So you're kind of having -- you're imposing some sort of modeling the assumption and structure in the problem, and that makes you kind of [inaudible] little data, like you're putting in some sort of knowledge ->> Milica Gasic: You put knowledge in the kernel function. >>: Right. >> Milica Gasic: So kernel function is crucial for -- what is nice about Gaussian processes is that even if you make a mistake with the kernel function, it is guaranteed to converge the optimal solution given enough data points. However, Gaussian processes in themselves are -- I mean, they are -- they have an order of cube in -- of the number of data points, so it's very easy to get to stuck, especially in reinforcement learning when you want to visit states many times. So I used one approximation, and that approximation unfortunately then limits from -- doesn't -the Gaussian process doesn't anymore have that nice property that it converges with enough data, so we have to be careful about how you choose the kernel function. But in general there are other approximations of Gaussian processes which allow you to still be able to converge even if you ->>: So compared to other [inaudible] function approximation [inaudible], like are there issues with like runtime performance or running this Gaussian process? >> Milica Gasic: So if you have to make it tractable. So, as I said ->>: So not the learning, I mean just when you apply it [inaudible]. >> Milica Gasic: It's slightly more computationally -- it needs more computational time, but it still works -- I mean, when you -- in comparison to how much time ASR takes in the dialogue manager, it's not -- yeah. And as long as you keep -- you can keep it bounded. You don't have to -- yeah. >>: Can you explain the link you made again between success in doing this Gaussian process and the summary space features giving you hope that without the summary space the Gaussian process would still be tractable to learn and useful? Because you've already put a lot of knowledge into creating a summary space. >> Milica Gasic: Yeah, so as I said, the crucial point about using a Gaussian process is to define kernel function on the space. What is nice about the approximation framework that I use of Gaussian process, it doesn't depend the dimensionality of the input. Because kernel functions can be seen as dot products. So as such, they don't depend on how big your vector -- your input vector is. So that is why I think that this method can be applied to the whole belief space. >>: Okay. It was my understanding that maybe you were able to get away with the simple dot product because a lot of the work of organizing the space had been gone -- had been accomplished. >> Milica Gasic: In the summary. >>: [inaudible] summary space. >> Milica Gasic: That's true. Yeah. I mean, I'm not saying that the dot product is the greatest kernel. But, I mean, the linear -- the linear kernel that I used. But if you define -- if you can define the kernel function on -- so, for example, if we go back to the trees that I was showing, these partitions, then you would be able to learn the policy directly. Because all you need is some sort of similarity measure between the points in the belief space. >>: Yeah. So one of the -- the very hot topic now is that people actually tend to say I don't still fully understand that. If you learn kernel automatically, that actually becomes special case of multilayer consumption of neural network. >> Milica Gasic: I did hear that, yeah. But there is a difference, and I think the difference is that -- that this -- yeah. It's just -- I think it's because it's nonparametric, so you don't predefine any parameters beforehand. >>: Whereas neural network give you training parameters [inaudible]. >> Milica Gasic: Yeah. So here -- so it's -- so basically it's nonparametric in the sense that it doesn't -- so it's not parameter free. You have kernel function which can be parameterized, and then you can infer these parameters. But it's nonparametric in the sense that it doesn't constrain your final -- your final policy or your final function approximation. You can still -- which I'm not sure is the case with the neural nets. >>: [inaudible] compare [inaudible] parametric method? >> Milica Gasic: No, I haven't, no. But -- yeah. >>: Why not? >> Milica Gasic: Well, I wasn't -- there are other people that work on parametric methods. And parametric methods require you to set -- to think of a basis functions, and then the -- it's only optimal within these spaces. I wanted to look into nonparametric methods that doesn't constrain the -- where the solution belongs to. >>: So do you explore this Bayesian nonparametric at all in your work? >> Milica Gasic: Sorry? >>: Bayesian nonparametric. >> Milica Gasic: Well, this is where I used it, yes. So to wrap up, I hope I've shown you how we can learn the dialogue manager behavior automatically with statistical approach, how the policies that we learn become robust, speech recognitioners, and how we can speed up the process of optimization. Future work, I already mentioned during the talk include better reward function, ideally I think it would be great if we can infer the reward function from users, maybe from user emotions, especially if the domain is such where you -- where the user can get excited or show negative emotions about user input. And finally we would really like to learn -- to learn the dialogue manager behavior from real users. Thank you. >> Alex Acero: Thank you. [applause] >> Alex Acero: We have a few minutes for questions, if anyone has [inaudible]. >>: One more question here. Sorry. >> Alex Acero: We have a little time. >>: Okay. So here you talk about in terms of the learning really focusing on this Q learning for policy. >> Milica Gasic: Yeah. >>: But how can some other parameters consist of [inaudible]? >> Milica Gasic: Exactly. Yeah. So as I said, [inaudible] did work on -- he further factorized this belief Bayesian network and then optimized the parameters. I think here you can -- if you come up with a good kernel for the state, which you use in learning the Q-function, you can use the same kernel and learn the probability distributions, the observation probability and transition probability as a Gaussian process. >>: I see. >> Milica Gasic: So [inaudible] did work on optimizing -- on learning probability distributions as Gaussian process. The problem there is that obviously Gaussian -- any sample from Gaussian process doesn't guarantee you that what you get is a distribution, because you have a mean, and then some variance over it. But you can put the constraint that the mean of the Gaussian process is distribution, which is all you need then, and then you can make the whole system fully statistical. >>: Okay. So that work hasn't been published, though? >> Milica Gasic: No, that hasn't. It's our future work. >>: Okay. >>: So can we consider that -- since you are using kernel functions, can we consider that as actually [inaudible]? >> Milica Gasic: It's a nonparametric one. >>: [inaudible] >> Milica Gasic: Yeah. So I'm not saying -- as I said previously, it's not parameter free. Kernel functions are parameterized functions. But it doesn't constrain your -- it doesn't -- if you use a normal parametric approach, then what you find is only optimal within that basis. Here it's not the case because Gaussian process gives you the uncertainty measure. So it allows you to make a mistake and it -- it can also give you a good estimate of how much you are potentially wrong. >>: [inaudible] your current point is based on some nearby points you have [inaudible]? >> Milica Gasic: Yes. >>: Right. So in that case ->> Milica Gasic: So that is one deficiency of a Gaussian process is that it -- the kernel function -- the variance of a Gaussian process only depends on the data points, not the observations in these data points. So the more data points you see, the uncertainty becomes smaller. But it doesn't actually take into account which values you observed. So but, yeah, if you know of a better model which does that, I'll be interested to hear. >>: So assuming you work with Carmine ->> Milica Gasic: Well, I'm working with Steve Young, but we are in the same department, so I'm -- this actually I talked about this with ->>: [inaudible] process. >> Milica Gasic: Sorry? Yes. >>: So is that any relevance to doing this in terms of how [inaudible] your state is with [inaudible]? >>: Yeah, I -- yeah, I don't know the details of to comment. >>: You do learning kernel optimization for all users. If I can use up [inaudible] and properties [inaudible], where would put and how you can use this. For example, males are more probably [inaudible] females for short ->> Milica Gasic: Well, you -- ideally ->>: That was just a toy example. >> Milica Gasic: I'd rather you give me the corpus and then I learn from corpus what I think is important, and then ->>: Yeah, but when you call, you can basically have a prior information, okay, this is a female voice and this is a male voice. So can you incorporate [inaudible]? >> Milica Gasic: So I did actually adaptation using a Gaussian process. >>: [inaudible] >> Milica Gasic: So what I did for my thesis is I've -- I had two Gaussian process models. In my case it was for unexperienced and experienced user. But I basically trained these on two data sets. And then online I was trying to figure out without using -- to match between these two. So I guess you can do something similar if you just train on female and on male. But you can use additional obviously features from -- yeah. >>: [inaudible] suppose you don't have speech input but text input whereby you don't have any speech recognition errors. To what extent can you simplify this whole thing to make that text based [inaudible]. >> Milica Gasic: Oh. So, well, it's -- you don't need -- I guess you -- well, you can say you don't have partial observability, but user doesn't really say everything in their inputs, right? So you can still use the same model to infer the user goal and to build the dialogue in the same way. >>: And then the input is not -- probably is [inaudible]. >> Milica Gasic: It won't be N-best list, it will be just one best list. [multiple people speaking at once] >> Milica Gasic: Yeah, exactly actually. Francois Mairesse, he was in our group. He did statistical semantic decoding where basically you get the input. So the idea is to have -- and if I go back to this slide quickly -- so the idea is to have probability distributions everywhere. So if this was ASR and then semantic decoding, you would have a probability distribution, but ->>: Distribution on the constant. >> Milica Gasic: On the constant. Yeah. Yeah. So if you -- so the speech understanding can be just semantic decoding, but the point is that it can give you [inaudible] ->>: So error [inaudible]. >> Milica Gasic: Yeah, yeah. >>: [inaudible] do you think [inaudible] bigger? Because when you [inaudible] the probability in the interim, similarly with the speech input you get more errors because ASR make errors there for semantic, you know ->> Milica Gasic: Yeah, yeah. But, I mean, a semantic decoder can make errors as well. Just simply two things can be interpreted in two different ways, and for humans as well. So but ->>: So this approach is not a specific for speech style [inaudible] -- >> Milica Gasic: No, it's -- no, it's any -- any ->>: In that case, what is it that -- do you think that -- in [inaudible] compared to speech [inaudible] ->> Milica Gasic: Well, I think they are mostly interested in open domain dialogue, like ChatBox that you can talk about anything. This is still -- this is still a limited domain dialogue. >> Alex Acero: Okay. Just in time. [applause]