1

advertisement
1
>> Tim Paek: Thank you for coming. It's my pleasure and honor to introduce our
guest speaker for today. Steve Young. Steve is currently professor of information
engineering at Cambridge University and head of the Information Engineering
Division. However, he has very close ties to Microsoft, not only because he has
advised many students who are now Microsoft employees, but he co-founded and was
the technical director of entropic, which we acquired in 1999.
So he was actually a blue badge, full-fledged Microsoft employee for a while as an
architect, but he decided to go back into academia, so he went back to Cambridge
University.
He has had an illustrious research career in the area of spoken language
technologies, from speech recognition, speech synthesis, to more recently dialogue
management. Among kind of his notable contributions, he's the inventor and author
of the HTK Tool Kit. He's been doing work on POMDPs lately, which has been gaining
a lot of speed.
And with this kind of career, you would expect a lot of distinctions, and he does
have them. He's a fellow of the Royal Academy of Engineering, the Institution of
Electrical Engineers, and so forth and so forth. I think you guys have all seen
his bio. So without further ado, Steve Young.
>> Steve Young:
Okay, thank you.
[applause]
>> Steve Young: So I'm going to talk today about some work we've been doing for
the last few years at Cambridge, which is kind of to one side of the speech recognition
work, but arose out of the -- initially out of the assumption that however hard we
work on speech recognition, it was never going to be perfect. So how can you improve
a spoken dialogue system, given the recognizer that is going to make errors. That
was sort of the original motivation.
Since then, I thought sort of a little bit more about what a human computer interface
should be doing, and I'll just say a little more about that in an introduction about
why use POMDPs for HCI. And then I'll quickly go through a simple example, which
you may find too trivial to bother with, but it's to try to illustrate for those
who haven't really thought about using what POMDPs are and how you might use them
in an interface. It makes it perhaps a little clearer than the speech case, which
2
is the complexity sometimes hides the kind of the big picture. And then I'll talk
about how you scale POMDPs to big problems like speech recognition -- spoken dialogue
systems.
And then I'll talk, as a way of example, talk about a system we've started working
on first, about five or six years ago, something called the hidden information state
system. And then very briefly at the end, I'll say something about the more recent
system, which is the Bayesian update system. And then I'll wrap up.
So why use POMDPs? If you're going to build an interface which is going to be robust,
whether it's speech or any kind of human interface, you're going to have to deal
with uncertainty and I think if you don't model uncertainty explicitly in the way
you manage a dialogue system, you're never going to be able to do very well. And
as part of that, I think it's important to be able to track what the user's trying
to do, because the only way you're going to interpret something which is noisy and
probably ambiguous is if you've got a pretty good context in which to interpret it.
The third thing is that communication is always trying to serve some goal. So it's
good if you can quantify those goals and then you've got something to optimize. And
then finally, you need to be able to adapt. So that suggested however you build
an HCI interface, it really needs to be mostly parametric and not just hand crafted
rules. Because otherwise, you're going to not be able to adapt to rapidly changing
environments in the near term or even in the long term. One of the things always
strikes me about most deployed spoken dialogue systems is that people put a lot of
work into them on the day they install them and then they might run for six months,
a year or several years, and the performance doesn't really change with time at all.
There's no sense in which the longer you use it, the smarter it gets.
So I'm going to argue the POMDP framework is the natural way to deal with all of
those issues. So let me very quickly. Sometimes people stop me when I do this and
I spend the entire talk with this example, which would be a bit sad. An iPhone uses
a gesture interface. Sorry to mention an iPhone example in this particular
building. Sorry for that. So think about the interface for -- suppose you've just
taken lots and lots of photos and you want to quickly skim through them and delete
the ones you really don't want to keep. The current iPhone is a bit clunky. You
have to select the photo, select the delete key and then I think you've got to confirm
that you really want to delete it.
Suppose you want to have an interface that is really quick and consists of swipes.
Left swipe, right swipe, down to delete. In other words, something that looks like
3
this.
And delete and so on.
The problem with this, of course, is if you do it quickly, you'll make errors,
probably, and you'll start to delete the things you don't intend to delete. Now,
traditionally, remember, this is a toy example for illustrative purposes. So don't
start saying, Billy, don't do it that way. Just imagine that the gestures are, in
this case, are just identified by the angle they make. So you it would divide up
the compass like this. The blotches, the green blotches are forward gestures,
backward gestures, delete gestures. Of course, you have to have something in the
device that is going to measure the angle. That's going to make errors. This has
made some sort of error.
The usual way to do this in the classic sort of framework is you say okay, well,
we'll sort of use some kind of patent classification approach. We'll record some
data. We'll annotate it. We'll get some distributions that might look like this.
We'll put some decision boundaries in there, and at least in some sense, we're making
an optimal decision. We could even put some risk in there.
And so when we get this gesture that we don't know about, we compare with the decision
boundaries and we decide that's a backwards gesture. And then we can go a bit further
than that, since we've got the distributions and we know what the overlap is, we
can compute the probability of error and compute some kind of confidence threshold
from that.
And so then we get the aperture application is typically hard wired so you have
something maybe like this, recognize a backwards gesture, check of confidence. If
it's greater than some threshold, move back, maybe. Otherwise, do nothing.
So that's kind of a classic implementation strategy. As far as I know that's how
really most of the deployed semi-spoken dialogue systems work in essence.
>>:
So in that case if you make a deletion --
>> Steve Young:
You interrupted me.
>>: If you have an easy recovery strategy, something delete, you can simply have
one gesture to recover, then you don't really have to ->> Steve Young: Oh, yes. I'm telling you why POMDPs are good. I'm not telling
you how to make an interface, okay? But even then, you know -- well, let me continue,
4
okay?
So what's missing? Okay. There's no model of uncertainty, as such. The iPhone
is not trying to track what I want to do. It's just responding to my inputs, okay?
So it's not trying to track my -- it has no belief about my intentions. And there
are no quantifiable objectives so in some sense, the decision making is
[unintelligible]. Now, that is very simple example of what quantifiable objectives
could be. As we'll see, we could code the risk in terms of rewards.
So how do we deal with that? Well, the first thing we need to do is model uncertainty.
We use Bayes Rule. Our old friend Thomas Bayes. In graphical networks term what
we might do is treat the problem like this. We say okay, I'm going to imagine the
user has three possible intentions here, to move backwards, move fords forwards and
delete. But the system doesn't know what they are. So we'll say that say that's
a hidden variable. And we'll say the probability of having some intention at time
T depends on the intention at time T minus 1 and also the last thing the machine
did.
You might think in the delete example, but surely these are independent events.
They're not really, actually, because typically people scroll forward rapidly
through the photographs. They'll go past the one they see and think I don't really
need that. Then they'll go back and then they'll tend to delete. So there is
structure, okay. Not a lot, but a little bit. In the speech example, there's much
more structure than that.
Then you model the -- then you say I don't know what the actual intention is, so
I will represent -- I'll compute -- I mean, this is what the graph means, of course,
this is a hidden variable. So all we ever know about this variable is its
distribution. And I'm calling it B rather than P, but this is the probability of
each state, S. And I'm calling it B, because that's going to be my belief.
And you'll see later, that the critical thing is that the actions that the system
takes depend on B and not S. So now, I have this set up so when I want to -- when
we move to a new time slot, we can compute a new belief by looking at the data. We're
not classifying the gesture anymore. We're just looking at the angle it makes and
based on that angle and this observation distribution, we can update the belief
distribution.
And we are not going to, as I say, we're not going to use this as some kind of threshold
or some kind of adaptive thresholding system. We are going to base what the device
5
does on the entire distribution and not do any kind of maximum likelihood estimate
of the intention.
So that's the belief framework. And this is the -- this is the framework that is
implied when we say it's a POMDP. Because the second part of the puzzle is the
optimization bit of goals, which depends on Bellman's optimality principle which
comes in many forms.
This is just one of them here. And essentially, it's recursive equation, which says
that the key idea is that you could associate with each pair of belief states and
action as local reward and then what you want to do is treat the whole dialogue or
sequence of interactions as an entity and compute some total reward for the entire
interaction. And the Bellman pointed out if the process is Markov, you can compute
an expected value for any belief state, B. You can compute the expected value of
total reward from that belief state by a recursion that looks like this. As I said,
it comes in different forms. In this case, it's just saying it's the recursion
over -- essentially it's the local reward plus the expected reward and the next
belief state, the prime just means next time slot here and it's an expected value
with respect to the observations.
And the max is you can use the optimize -- find the optimal reward and, hence, the
optimal policy.
So in terms of our problem, so set out the graphical model, treat it as a DBN, which
it is, extend it out over the T time slots and then what we're saying then is okay,
I'm going to have my policy, which is instead of this hard wired decision network,
I'm going to say, each action depends as a function of the belief state, not the
most likely state, but the distribution over all states so we get this sequence of
actions driven by a policy and then we can sum the local rewards to get a total reward
and the expected value of this is V. And then we can affect various algorithms for
doing this. But we can essentially it rate. We can use the policy to compute the
reward and the maxing here allows us to adjust the policy to incrementally increase
the reward and if we do this iteratively in a process called reinforcement learning,
we'll end up with the optimal policy, under certain constraints, which aren't too
interesting.
Okay. So if we do that for this simple example, then I've got my user's goals. These
are the things we don't know, but we assume they're in the user's head. We've got
the system actions. We define some rewards so let's give a modest small reward for
moving in the right direction. Bigger reward for deleting when we want to delete,
6
but then give a big negative reward for deleting when the user didn't want to delete
and just wanted to move forwards or backwards. You can change these rewards, of
course, to suit your design criteria. It's a design option, in effect.
And then we can iteratively optimize. This is just a toy example, right, to
illustrate the idea and so I didn't actually compute -- obviously, this depends on
probability distributions for the transition function and the observation. I
didn't train these. I just chose some plausible looking parameters, just to
illustrate how this might work. And then also set up a simulator to generate
gestures with error rates varying from zero to 50 percent and the vertical axis here
is the average reward per turn.
So this is zero axis here. Going below the zero axis is probably bad news. So this
is a simple but reasonable hand crafted policy which just uses a fixed confidence
on the threshold and it basically comes down pretty rapidly if the error rate
increases.
So if you go to a party and you drink enough, you really wouldn't get very far with
this interface.
Now, if we use the POMDP framework and train it a fixed 30% error rate you, get the
curve to look something like this. Whereby we see we've made it more robust at the
higher rates. Then we've lost a bit at low error rates. Indeed if you look at the
policy, you find it's basically you become risk averse.
>>:
[inaudible].
>> Steve Young: Sorry, the policy. I'm not going to train the parameters of the
transition probabilities, but I am going to optimize the policy. Yeah, the training
here means the -- so ->>:
[inaudible]
>> Steve Young: So I used Q learning, and I used the user simulator to train it
with learning, and I set the error rate to be 30%. Yeah, problem? Yeah?
>>:
[inaudible] observations.
>> Steve Young:
into actions.
Of the policy.
I'm trying to learn this mapping from belief states
7
>>:
[inaudible].
>> Steve Young:
>>:
This is a POMDP, yes.
Okay.
>> Steve Young: I mean, not literally. Yes I don't mean literally Q learning,
sorry. I use a Monte Carlo based training method, which is doing an approximate
POMDP learning algorithm. It's not important. It's a reasonable algorithm.
>>: So in the original hand crafted set-up, there's thresholds that had to be set,
having to do with confidence and understanding ->> Steve Young:
>>:
Just one.
Okay.
>> Steve Young:
Just one threshold, yeah.
And I just fixed the reasonable value.
>>: Then in this POMDP setup, there were several different rewards that had to be
set up.
>> Steve Young:
Yes.
>>: Are these rewards that you've set by hand?
from the thresholds we set by hand before.
Are they qualitatively different
>> Steve Young: Different operating point and I haven't explored this. This is
not a serious example. This is tutorial, Jeff. This is motivational. Okay?
We're not trying to design and persuade you that this is the way to produce an iPhone
interface. So if you want to disbelieve the results, fine. We can go on to the
real results for speech systems later where we get the same performance. I'm trying
to illustrate a different way of approaching the problem.
>>: Will the results of the speech systems involve setting rewards, or is that
also ->> Steve Young: Yes, but very straightforward reward. We just have dialogue
systems and we give a big reward for giving the right answer and zero for getting
8
the wrong answer and we penalize, give you small penalty every turn to keep it going
along. So -- and we've not tried to optimize that or say what do you really want
from the design. We've just typically used that kind of straightforward, big reward
at the end.
Q
Do you actually have to know what the error rate is?
>> Steve Young: Let me just put the other curve on, okay. If you did know what
the error rate is and you train the policy at different error rates and you updated
the observation parameters to match the error rate, you get a curve like this. Okay.
And so this is the kind of upper bound on the performance you could get with this
kind of setup. Okay. Now, how you would know what the error rate is
[unintelligible] is a different issue. But that's enough for bound on the
performance for this particular setup. And it's a toy example. It's just meant
to be tutorial.
The point is one of the things that's making the difference here is that the system
is using the transition probabilities to buy us implicitly just its threshold,
remember, because it's updating its belief model based on the transition
possibilities and that makes a difference. That's one of the reasons why you're
getting a significant difference between this and this.
Okay. So the bottom line of this example is simply that don't think of a speech
system as being a command driven interface that you speak commands and the system
responds. Think of it as being a system where you -- the system is just trying to
track what the user wants to do and it's regarding inputs as observations. That
it's using to refine those beliefs. And then the policy training stuff is almost
secondary to that. But that's the key difference in terms of designing an interface.
Okay. So, well, I just basically, I've just recapped that. So that's the basic
idea. Now, the problem, of course, with speech or a speech-based system is it's
pretty complicated. So I only have three in my iPhone example, I only have three
possible states and there's a whole load of packages which will train POMDPs in
different ways for small state spaces and make a reasonable job of it.
The big problem comes where you have a very large state space and a large action
space. So how do we set up a system where we can do Bayesian inference attractively
in realtime over this very large state space and how can we actually optimize policies
as well, using reinforcement learning, which tends to get very difficult in large
systems.
9
But if we could do that I'm arguing that this is a really rather principled approach
to handling uncertainty and planning and that's what you need in any kind of interface
which is driven by human interaction. So what's the problem with the scaling
problem.
So just to set the context. This is the generic sort of spoken dialogue system
architecture. The one we've been using. And what we do in our systems is we
have -- I should say, this is a limited domain application. The domain is actually
tourist information we've been working on. So you can talk about things in -- at
the user can say things, I want to find the restaurant, the usual stuff. I want
to find the hotel. The system I'll show some example of is actually with an
artificial town, we're just about to start -- we have now a version for Cambridge
that we're about to start making accessible to the public, which has got many more
entities in it.
But an example I'm going to talk about here is essentially you can ask questions
about hotels, restaurants and bars in this fictitious town and we have a our
architecture is we convert words into these abstract representations. We have a
set of these dialogue acts like confirm and negate, inform, request. And then you
have attribute value pairs, which are arguments that these dialogue acts and we have
this as a -- we use this as a well-defined interface to our dialogue manager.
I'll play you a little example of a demo system running a bit later, but just in
passing, all of the components are statistically trained entirely from data apart
from the message generator. And I should also stress the dialogue manager has no
application dependent code in it at all. It doesn't know anything about towns or
hotels or bars. It's all learned from interacting. Learned from data.
So the first system is a hidden information state system. This was built primarily
as a demonstrator that this notion of tracking belief over multiple states would
give you increased robustness. And not necessarily meant to be the way to do it
tries to mimic the basic ideas of a POMDP framework so it takes the speech
understanding system as an observation, and, in fact, we compute an endless list
of these abstract dialogue acts. So the interface is a list of the dialogue acts
that the system can decode from the user's input ->>:
[inaudible].
>> Steve Young: That's what I just -- these kind of abstract representations like
confirm here equals tower. You've got a list of those here. This thing is trying
10
to update a belief over a set of states. I'll tell you a little bit more about the
states in a minute. And then we have a policy, which generates actions, which gets
converted into speech and so on.
And we make this tractable by two mechanisms. First of all, we group states into
what we call in equivalence classes call partitions. So rather than having to
compute beliefs over every possible state, we have far fewer partitions. And then
secondly, we don't compute the dialogue policy directly in belief space. We map
this rather complex belief space into a summary space, which I'll explain a bit more
in a minute, and then we implement the policy and also optimize the policy in this
summarized space. And then we heuristically map back these similar actions into
the higher level space.
And this basic model was developed originally with Jason Williams, who is a Ph.D.
student of mine.
So the actual state we record in this system has three factors. The user's goal,
that corresponds to the -- what does the user want to do move back, move forward,
delete. The user's act. That's the last thing that the user said, and a dialogue
history. Just to put a little bit of flesh on that. The goals are grouped into
partitions. So what we actually do in practice is we have a list of things which
represent possible groups of goals. So represented textually here, just to be able
to read it.
So this is the set of all goals, which involves finding a hotel in the east part
of town. This is a set of all goals, which involves finding a bar in the east, hotel
in the west. Find venue is just a goal of finding something. This is meant to be
a mutually exclusive set. And this is composed with our own best list and then we
have a grounding state. So any of the entities in any of these partitions will have
a state associated with it and the state is something like -- is be queried, it's
being grounded, it's being requested and so on. And this is conventional stuff in
dialogue systems. This is essentially David crown's grounding model. But
remember, this is sort of probabilistic, so anything like area equals east can have
multiple states. In fact, it's a distribution over all possible grounding states
for all possible arguments for these goals.
So what we actually do, in practice, is we take instances of each of these
distributions so a particular partition, with a particular assumed last user act
and a particular set of grounding states makes a single instance for which we compute
a probability.
11
And we compute all of the most likely combinations, rank them and prune them. So
over the millions or even billions of potential combinations here, we typically
maintain the top two to three hundred values, if you like, and their probabilities.
If you work through the masks you, get something that looks a bit like this. So
this is the actual belief update, and I'm not going to go in detail in this. In
fact, I'm not showing the partitions here. This is for actual states. If you plug
in some -- do some or algebra on this, and you have S divided into partitions, you
can get something similar to this. But this illustrates the basic idea.
>>:
So the partition is done manually?
>> Steve Young: No, the partition is completely -- so I didn't really want to go
into this, because it's kind of almost boring implementation detail. We start off
with everything in one partition, because we don't know anything.
The user says something, the recognizer generates all of these hypothesis about what
might have been said. Any mention in any hypothesis about anything like a Chinese
restaurant, a cheap hotel, then this will -- the set of partitions gets divided.
So you can always associate each possibility what the user said with the specific
partition. So this splitting doesn't change the masks at all. It's just a
computational device, in effect.
It would work exactly the same as if you were able to maintain every possible
combination and you computed the beliefs individually for each combination.
So if you look at the equations, what you find is this is, B, remember B, this is
a probability. But you're changing, recomputing this distribution in return. And
this is the old belief and the prime is the new belief and so if you look at the
terms in this updated equation, and this is standard textbook stuff, what you see
is this three components. The transition model is just taking account of state
changes and, in fact, we assume there are no state changes so we assume that whatever
the user wants, they don't change their mind in the course of a dialogue and this
is a weakness and I'll come back to that.
So you can more or less ignore this.
It's basically an identity transform.
And then these two terms are the important ones. So this is the user action model.
This says, okay, remember that this S prime here includes the three components, the
12
user's goal, the last user act and the grounding model.
hidden in here.
The grounding model is
So this is -- this term says what's the probability of the user saying something,
given that their goal is this. So if you're hypothesizing, the user says I want
a cheap hotel and in the particular G here, part of the goal is they're looking for
a hotel, you'd expect this to have a high probability. If the particular goal is
they are looking for a bar and they say I want a cheap hotel, you'd expect this
probability to be small.
So this probability, we call this the user action model, this is the thing you don't
get with many alternative formulations of this problem.
And then the observation model here essentially is the -- takes the place of the
confidence measure, which is a probability of the observation, given the specific
user goal.
>>: Just going back to the question, this is just a factorization of the partition
set, right?
>> Steve Young: This is taking the -- if you look at the graphical model, you figure
out the relationship, the updated equation and then you substitute in the factored
version. The S is actually represented as these three components and you'll get
this and then as I say, in the actual system we go a bit further because we group
the Ss into partitions. Which requires a little bit more manipulation.
>>: [inaudible] part of the partition. So you partition basically at run time
based on things that you hear from the recognizer if I understood you correctly.
>> Steve Young:
>>:
Yeah.
And from the system.
Sorry.
>> Steve Young: And from the system. So if the system says I could suggest a nice
hotel in the east part of town that would also split the partitions.
>>: Right. Just makes me wonder about, like, I guess what's left is priors. I'm
wondering if I give you this information and you're reasoning your partition based
on Chinese versus everything else, restaurant --
13
>> Steve Young:
>>:
Can I come back to that.
Sure.
>> Steve Young:
Because I'm going to talk about that specific point.
>>:
All right.
>>:
So when you say the [inaudible] transition is --
>> Steve Young:
>>:
The goal, so G doesn't change.
I knew that, but what about the rest, right?
>> Steve Young: They can change. Oh, in the transition model, no, that's true.
No, they don't change either. So you're assuming that -- you're assuming that
there's an underlying sort of set of fixed values for these. All you're trying to
do is estimate them from the sequence of observations. But I'll come back to these
issues again, because you're picking up problems with this model, right. So if you
want to just think of it, we're hacking various things here to get to work, that's
just fine.
So we have a user action model which is a factored model.
talk about that.
So that's the first part, partitioning, okay.
the big picture.
I haven't got time to
Lots of detail in there, but that's
The second thing is the master space mapping. So we have these partitions and their
grounding states and the user act and each one has a belief shown by the size of
the bar here. So I'm just showing the goal bit, but this, each of these lines is
meant to represent what we actually call a hypothesis, for obvious reasons, in the
system.
So you're maintaining sets of hypotheses about what the user goal is, what the last
act is and what the grounding states are. You maintain a list of these. That's
this list. To do policy optimization, we do some gross approximations. We first
of all try to characterize this complete distribution by a fixed length vector. A
B prime.
14
So this is summary space. And this is something we've been refining, but it
consists, in the version I'll show you, in the one we trialed, this consists of a
mixture of continuous and discrete variables so the probability of the top
distribution, the probability of the next top distribution indicates variables, for
example, T12same could potentially refer to the same entity. Various other things,
last year act, last system act.
>>:
So the summary space now is manual.
>> Steve Young:
done, yes.
>>:
Well, the choice of this, what features to extract, is manually
[inaudible].
>> Steve Young: No this is completely independent of the application. These are
entirely structural thing. T 12 same means if you treat these -- I mean, okay, it's
generic to the database type information retrieval system. The system knows nothing
about east, west and so on, but it does know what the fields are in the database.
>>:
So the schema.
>> Steve Young: So there's a schema that sits between this and the database. So
if mapping these to the schema gave you the same set of entities in the database,
you'd say they were equivalent.
>>:
Okay.
>>:
But the size of the summary space determines scaleability of the system?
>> Steve Young:
Yes I guess it does.
>>: So somehow to address that issue, you need to [unintelligible] how small summary
space eventually you need to have?
>> Steve Young: Yeah, okay. Yes. This is something we've just started to look
at a little bit about the trade-offs in the size of the summary space. But as you'll
see, the policy optimization is really quite crude in this system. So what we do
is we take this reduced mapping entity, fixed land factors and then we use a code,
we actually quant ties this. We have a distance metric on this, and we
[unintelligible] it. I think for each member of the code book, then we associate
15
an action which is a so-called summary space action.
So it doesn't have these arguments. It's just basically the dialogue act itself.
And we learn this, we optimize this policy. This is now effectively an MDP rather
than a POMDP, and we optimize this using online, actually it's Q learning.
And then we have a heuristic, which maps these summary vectors back into master space.
And this is originally hand crafted, but, in fact, we now have a data driven way
of doing this. So this mapping is essentially learned from data as well. But it's
actually not difficult to do. So for example, if the policy says you should confirm
something, then what it does is it looks at the grounding states of the various
entities and picks something to confirm, for example.
So that's the sort of the architecture. Let me just show you an example of it
working. So what you'll see is the -- it's a bit -- you won't be able to read it
very clearly, but you'll get the basic idea. This panel here just shows these
hypotheses appearing as you speak. The recognizer out traces in the top here the
output of the synthesizer is here. The recognizer is a simple off-the-shelf HTK
recognizer. The output is a [unintelligible] based synthesizer. It sounds
strangely like Allen Black, for anyone that knows Allen Black.
What you'll see is that the bars here, if you look at the spread over the hypothesis,
you see -- and this is in clean speech so there actually aren't many errors, but
you'll see that the guy was one of my RAs is asking for a venue, which isn't -- there
isn't a satisfiable solution. .
>>:
Hello.
May I help you?
>>:
Hi.
>>:
The Kings Schilling is a great bar in the north part of town.
>>:
What kind of music do they play?
>>:
The Kings Schilling is known for great rock music.
>>:
Are there any bars playing jazz?
>>:
I am sorry, but there is no bar that plays jazz music.
I would like a bar in the north.
16
>> Steve Young:
It's now confused about what the user wants.
>>:
Are there any bars in the center playing jazz?
>>:
Ice Skill is a nice bar in the central part of town and they play jazz music.
>>:
What kind of drinks do they serve?
>>:
They serve cocktails and Ice Skill.
>>:
And what is the address?
>> Steve Young: These top two hypotheses are actually identical, apart from one
of the grounding states. So this represents a very sharp distribution.
>>:
Thank you, good bye.
>> Steve Young: And this is essentially zero. It's on a non-linear scale so we
can just see the colors and don't worry about what the colors are. So this is
basically -- it's 100% certain. So in this example, the recognizer got it exactly
right the first time and it was presumably pretty confident about it. So he
effectively got a hit straightaway. But then the hit actually didn't suit the user
and so the uncertainty came in in that way.
Now, the user changed their goal in that case, but the system just assumes that it
simply misrecognized if you like, the intention in the first place. So the goal
stays fixed. It's just recomputing the beliefs. Now, that's to give you an idea.
If you -- we have a simulated user that we actually use to train our systems as well.
We don't -- unlike nuance and maybe you guys, we don't have access to millions of
dialogues. So actually we have a corpus of dialogues, relatively modest number,
probably about 1,000 dialogues and we train as a separate Ph.D. project, we have
a statistical user simulator, which was trained on the data and then we use that
user simulator both for training and testing so this is -- take this with a pinch
of salt in the sense that if it's essentially testing on the training data.
And what's shown here is a comparing the performance of this hidden information state
system with an MDP system. So the MDP system is using all the same components except
it's no model of uncertainty. It's always just selecting the most likely state and
it's optimizing the policy, based on that in much the same way as the HIS system.
17
But
was
was
And
the MDP system is not a broken system in any sense. The guy who made this, it
part of his Ph.D. and he worked pretty hard and he was also competitive so he
trying to -- he was certainly trying hard to get the best performance possible.
this is a reasonable hand crafted system.
So learning a policy helps, but in high noise, unless you're tracking multiple
alternatives, there's only so much you can do and the potential, again, of the HIS
system is demonstrated by these simulated results.
So we also ran some trials, actually, we've run a couple of trials with students.
We had, I think, a total of about across the trials about 80 students who worked
their way through, I suppose, 20 or 30 dialogues each.
And what we did was we basically used the -- this wasn't a live system on the
telephone, we came into the lab. But we had noise sources -- I don't actually
remember the noise source. It was something from the noise database that used to
be around a few years ago.
So we basically had artificial noise in the background, which we increased the noise
level to generate the range of, a range of noise conditions. And we'd hoped to
reproduce these curves. But the results were not statistically significant if we
tried to rank them by error rate. This is just the result for the pulled data, which
is significant, statistically significant. And this is the percentage success we
get on the user trial and as a -- and the average error rate, overall, of all of
the dialogues is about 30%.
>>:
You say statistically significant, you mean the difference between --
>> Steve Young:
>>:
This is significant.
But not between HDC --
>> Steve Young: Probably not, no. My significance comment is we can actually bend
the dialogues at different noise levels and plot a graph like that. But if we do,
airport or bars on the data points are so wide that there's, you know, that's
not -- you can't get nice curves, basically.
>>:
[inaudible].
>> Steve Young:
Yeah, for each individual bin, yeah.
18
>>:
Like I'm wondering --
>> Steve Young: Well, we've tried, yeah, all sorts of fitting curves, but it's not
that -- yeah. We don't feel that confident about it. But we're confident about
this result.
>>:
So the purpose of adding noise is to make the condition.
>> Steve Young:
>>:
Yeah, just to make it recognize.
[inaudible].
>> Steve Young:
Yeah.
>>: On the other hand this system you talked about earlier also has a summary state.
Is that helping or not helping?
>> Steve Young: Oh, does it hurt? We don't know. We don't know. We haven't -- we
haven't explored that part of the system very much at all yet. I'm going to run
out of time and I need to say a bit more. But yeah.
>>:
When you add the noise, do the subjects hear the noise, or --
>> Steve Young:
suffering, yes.
>>:
Yes, yes.
So there's [unintelligible] effects, yes.
They are
[unintelligible].
>> Steve Young: Well, we change the noise level so that the error rate, as I said,
the error rate, the measured error rate it was varying between 20% and 40%.
>>:
Do you notice that subjects do different things if you make it noisy?
>> Steve Young:
Pass.
We videoed them if you want to --
>>: Besides the task completion rate, did you happen to qualitatively assess the
differences? I mean, are these ->> Steve Young: Yes, we did. We have a paper that just come out in computer speech
and language with the detail results if you want to go have a look at them. And
19
we do do some ->>:
In general, he can notice that that's.
>> Steve Young: Subjective, yeah, although we did another trial, we tried to get
over the statistical significance problem with another trial, and there we started
to have significant problems. The notion of paying people to satisfy to do scenarios
that are artificial when you start to look at the results is very iffy, because
it's -- people are -- when there's nothing if you give them something that doesn't
exist in the town, they'll accept almost anything as an alternative. Even though
we said, you know, you're really keen on jazz music, they would accept something
else.
And then subjectively, they'd say it was great, I got everything I needed. And you
look at the objective results, and it didn't satisfy the criteria we gave them. So
we kind of more or less given up trying to do this kind of trial. We need to do
live assessment. It's probably the only way to do this.
>>: So how do you explain the difference between the POMDP and the MDP? I would
expect the POMDP noticed the lack of information and asked questions to clarify.
>> Steve Young: And the policy will do that.
things like did you say X or Y?
>>:
So the POMDP, for example, will do
And the MDP wouldn't?
>> Steve Young: The MDP wouldn't, because it doesn't know what the alternative is.
In fact that's relatively rare. Where this gains is it's also repetition works.
So you could keep repeating the same thing over and over again, and it's never in
the top one or two from the recognizer. But it's consistently somewhere in the list,
and if you're persistent, then you actually find the belief in that actually climbs
to the point where, you know, somehow it gets to the top of the pile. So that's
one of the most obvious things when you look at the data is repetition makes a
difference.
>>: In real situations, real data, what kind of human annotation do you need on
top of the data itself? Do you need any?
>> Steve Young:
To score these things?
20
>>:
To train?
>> Steve Young:
>>:
To train.
Yeah, you hope to improve more data.
>> Steve Young:
representation.
Do you need any kind of human annotation?
No. What we need is the -- are the dialogue acts in our
That's all we need for the training data.
>>: But does a human have to work those out? I mean if you have a live telephones
where people are calling in, do you need something on top of the raw data?
>> Steve Young: Well, the raw data is -- depends upon what you call the raw data.
Our interest is in what the sequence of dialogue -- what the sequence of dialogue
acts was and also what the goal -- what their goal was. That's what we need to know.
The user simulator uses expectation maximization. So it doesn't need detailed state
by state annotation. It figures that out for itself. It does need to know what
the reward should be. And it needs to know what the dialogue acts were.
Can I move on?
So this was meant as a demonstrator of the potential, right. It has some severe
problems. One is that the -- I skipped the slide that explained how the user action
model works, but the user action model's hand crafted. It's not application
dependent. It's actually a set of linguistic rules which define how well a dialogue
act matches the goal representation. But it is handcrafted and there's nothing
really to learn.
And then we have this problem of changing state. So in the example here when the
system offered a different place that didn't satisfy the user's request. You can
think of it two ways. The year always had this other place in mind. That's actually
not true. The users are guided by whether it exists or not and will change their
mind. We really would like to model that.
So when you're looking at what we're actually doing here, what we're saying is our
fundamental problem is we have a large joint distribution to model things like the
type of venue, the location, the price, the food. In our system, currently, there
are, I think, 12 possible variables here. And so what the HIS system is doing is
it's taking the most likely combinations of values, finding their probabilities,
ranking them and pruning off all the unlikely ones.
21
So it's maintaining the top few members of this joint. But then it can't possibly
do transitions, because it has -- you can't put a matrix on this to get the updated
thing, because most of these values are missing.
The obvious alternative is just to use a graphical model, because that's how I
presented this originally. So what would you do if you had a graphical model. Well,
it's going to get big very rapidly, so if you think about the minimum possible
dependency, you might say that things like what people -- dialogue designers might
call slot values, like location, price and food. You might say well, it's either
independent, so you might say location, whether it's hotel or a bar, doesn't matter.
But price and food probably depends on what the type of venue is. So you make the
simplest possible graphical model that you can do.
Of course, this has got its own problems. In fact, I know the east part of town
is cheaper than the west part of town so location does matter. But suppose you did
this. Then you then, with a graphical model dynamic Bayesian network, in effect,
which looks something like this, you've got something to represent the goals, which
might be the food, depending on the type. It's all going to depend on the last system
action or system action. You've got the need to somehow get what the user says into
this so you can imagine plucking out the individual components, referring to food
and representing these as hidden variables.
You can have your grounding states as hidden variables. And this is time T. Time
T plus 1. You can have a few slices of this. And that's kind of the minimum model
you could possibly make.
And so if you do that for this simple artificial town, you get -- and convert to
it a factor graph, then it looks like this. Actually, that's the first of several
pages of it. And then you can do something like loopy belief propagation to update
the parameters.
And if you try and do that with an off the shelf LBP, it's very slow. It's quite
a big graph. However, you can exploit features of the particular setup. And I
haven't got time to go into detail about message passing, but you can do two things.
You can partition the values again, just like we did in his system, but this time
on a per slot basis. So no point doing an update over all possible values for food
if the user's only ever mentioned French and Italian. Just chunk the rest and do
the partitions dynamically.
22
And then you can also, instead of having a full transition matrix, you can say, I'm
really only interested in whether the goal's changing or not. So you have a constant
probability of change. And if you do that, then the compute time for standard L
loopy belief propagation looks a curve something like this just parting time against
branching factor. With these optimizations, you can get something which is
tractable. The details don't matter. My only point here is if you're going to start
building these large graphical models, you probably need -- we need a serious amount
of optimization to get them to work in realtime.
But the big advantage of this is that we can -- and this is what we're starting to
do now is we can actually not just model the variables of the dialogue, we can also
throw in all the parameters as well. So we can make it not just a discrete network,
we can put in the parameters of the distributions. Then we can switch from loopy
belief propagation to expectation propagation and we can update the parameters
online as well as running the dialogue. So that's one of the things that we're just
starting to get working.
The other problem is how you actually build a policy on top of this very large Bayesian
network. Again, there are ideas in the literature for doing this and probably, given
the time, I should probably not spend long on this.
Essentially, what we do is we constrict the sarcastic policy. There's no summary
space now. We're building the policy directly on the full Bayesian network. But
what we are doing is assuming that there's a very limited set of actions. So we
have a -- we represent a policy [unintelligible] using a soft max with a basis
functions and we have a basis function for each possible action and there's a limited
set of these and then we factorize out the dependency on the various bits of the
network into -- we partition it into components and then we discreteize each
component with a very crude binary lookup table and, again, we have a paper coming
out very soon in computer speech and language that describes the details of this.
And then we use a standard algorithm, much like the critic algorithm, to optimize
this.
So doing all of that if we actually -- and remember there's huge approximations now
in the conditions. We get a performance which actually, I should have overlaid
these. Actually, is very similar to the HIS system. In fact, the performance
currently is indistinguishable from the HIS system. The colors have changed because
of randomness of power point, but this is the MDP system from my previous slide.
This is the BUD system, and if I put the HIS system in there, which I should have
23
done, it would have been almost the same.
This is the reward against success.
identical.
But again, the curves are essentially
And again, the same user trial gave statistically significant difference again. The
only reason I'm not combining these into one graph is because actually the dialogues
were -- it was done on different days. It's not strictly fair to actually combine
them. But the bottom line is it's essentially the same performance.
So I'll stop there, apologize for going on. A little bit too long. My basic claim
is this kind of framework of POMDP, Bayesian belief tracking, automatic strategy
optimization provides a good way to design HDI type systems.
The HIS system, they both, I think we've demonstrated we can get improved robustness.
We haven't dealt with the adaptation problem yet. Adaptation work is focusing on
the Bayesian network system.
The HIS system is still interesting mind you because the people building industrial
systems are much more interested in the HIS system because they can relate that to
what they're doing. They can see that there's an incremental way of going forward
so you can think of the HIS system as maintaining multiple standard dialogue managers
in parallel. And instead of taking actions on the best, you're looking, trying to
look at the whole bunch of parallel dialogue managers and saying what would be the
best thing to do?
And so anything you can think, any trick that you currently use in your system, you
can, in principle, put them in the HIS system. As an evolutionary path, that looks
interesting. From the point of view working on adaptation and so on, online
parameter learning, then the BUDS system is more interesting. In the long-term that
seems to be the way to do it.
Moving forward, we need to develop scaleable solutions, particularly for the
Bayesian network systems. We need to be able to deal with more linguistic phenomena.
We need to be able to deal with multimodal things. But multimodel is trivial in
that framework. You add in an extra observation function for all of your inputs.
It's rather easy to integrate them.
And there are issues of migrating to industrial system. Tim Paek has pointed out
some of the issues, in his paper with Roberto, about how you guarantee performance
24
to a client when your system is essentially statistical and who knows what it's going
to do next, but that's where we are. So I'll stop there.
>> Tim Paek:
We have time for questions.
>>: So the HIS and BUDS system, in the final comparison, you did compare HIS versus
BUDS I suppose BUDS is better than HIS. They use the Bayesian learning.
>> Steve Young: They turn out to be currently about the same. They're pretty much
the same. That's presumably because -- the HIS system is able, that's coming back
to a question maybe Guy or Dan asked about the condition -- I mean, the HIS system
is not throwing away any conditional dependencies, right, because it's taking the
full -- it's sampling the full joint. So some things it does better. It doesn't
get -- I mean, location and price, for example, happens to make a bit of a difference.
It models that, whereas it's thrown away in the BUDS system.
>>:
They both have similar kind of summary space?
>> Steve Young: Well, the BUDS system doesn't have a summary space.
summary action space, the summary action space is very similar.
It has a
>>: If you don't have a summary space, how can you justify the scaleability of the
buds system?
>> Steve Young:
They're both operating exactly the same domain.
>>: In a situation where you have real data, not simulated data, how efficiently
do these systems use the data? I mean, obviously, data is free if it's simulated,
but ->> Steve Young: Yes, that's a very good question. We don't know the answer to that.
Clearly, on the simulator, typically, we're talking about the training curves for
both systems sort of start to total about $500,000. So you, know, this is -- I mean,
they're getting pretty good by 50,000 to 100,000, but we train up to about 500,000,
typically.
>>:
Is that depending upon the size of the problem?
>> Steve Young: That will almost certainly depend on the size of the problem. As
we move to richer domains, I expect that number to go up. But actually if look at
25
statistics on nuance calls, where you can look at [unintelligible] statistics, I'm
sure, this is actually not completely, you know, out of the ballpark.
>>: I didn't find this answer to my previous question. Like if you have calls
coming in, you say you need to annotate with the dialogue act, does that mean just
the acts the computer took or the actual real ->> Steve Young: So I perhaps misunderstood your question. What we need to do now,
because we use our data to train our user simulator, my answer is what data our user
simulator needs. Currently, we have a two-step system. We take data training, the
simulator. Use the simulator to train the dialogue system. If the dialogue system
was connected directly to the real users, all you would ever need to know is what
the rewards are. And that just is ->>:
[inaudible].
>> Steve Young:
>>:
[inaudible] any annotation at all, no.
Do the numbers still hold, 100,000 or 500,000?
>> Steve Young:
>> Tim Paek:
No idea.
Don't know.
I can't do the experiment.
Go ahead.
>>: I think the whole issue with words is interesting, because one of the things
I would love to see an analysis of, and I don't know if you published this is when
the system fails, how bad are the failures. Like, you know, it's okay to assign
high reward to completing the task. And then everything else gets, you know,
negative reward, for instance.
But to some degree, that's not really true. Some experiences are better than others,
right? Even when they fail, right? And you're kind of biassing your system to kind
of -- do you know what I'm saying?
>> Steve Young: Absolutely. So this is the sort of thing Lynn walker is doing and
so on, trying to build a model between the reward function and user satisfaction.
And if you could do that, then presumably, instead of optimizing this naive reward
function, maybe, I don't know, does Microsoft want to optimize user satisfaction?
Let's assume yes, okay?
26
So I guess that's what you'd do. I mean, and that's an interesting research topic,
I guess. But it's not something we've looked at.
>>:
When you looked at the failures, were there any kind of --
>> Steve Young:
Failures.
>>: With [inaudible] systems, one good thing you can do is anticipate what the bad
things are and kind of make sure they're not as bad.
>> Steve Young: Yeah, okay. So the HIS system fails, really does fail. So there
is implementation problem with the HIS system at the moment is that it's tricky to
recombine partitions. And so what you get with a long dialogue is more and more
partitions and the dialogue slows down.
Okay. And so users give up. The BUDS system doesn't suffer from that. You can
talk to the BUDS system forever. So in some sense, the BUDS system never fails.
If you're persistent enough and you sit there long enough, you will probably
get -- fail in the sense of not being able to get the answer, presuming you'll sit
there long enough to figure out how to get the recognizer to recognize your voice
sufficiently well. It will just keep on talking, because all you're doing
essentially is you, know, updating beliefs in the various slots to the point where
it can actually get a match.
>>:
I thought that the more function would penalize the length --
>> Steve Young:
>>:
Yes.
[inaudible].
>> Steve Young: Yes, it does, but we're talking about failures right. The user
doesn't know about the reward function. For them, failure, presumably, is -- and
we just arbitrarily chop the dialogue at 20 turns. We say if you haven't got it
by 20, you know, stop. Yeah?
>>: These dialogue systems never say that they don't understand what you're saying.
And wouldn't users give up long before 20 turns?
>> Steve Young:
Probably.
And no, it never says I don't understand what you're
27
saying.
>>:
That might be an interesting --
>> Steve Young:
>>:
That's an obvious POMDP action.
>> Steve Young:
>>:
We don't have that in our action set.
It is obvious, yes.
Your information is not good enough.
I'm going to try to get more information.
>> Steve Young: And we haven't got that in. It's probably got surrogate actions.
Often, when it gets in a complete mess, it starts to ask you if, can I help you with
anything else. Which is perhaps not the most helpful thing to say. So perhaps it
should say, it should say, I'm not really understanding you, but, you know, it's
the kind of back off when everything ->>: [inaudible] the users are constantly [inaudible] the system. The users learn
how to play with the system, how to -- I wonder, using this kind of statistical system
actually the system becomes less predictable and will that actually hurt the user
performance, user experience, because users see inconsistent behaviors from the
system, given the same input [inaudible].
>> Steve Young: Yeah. So first of all, all the trials we've done are with people,
we deliberately did not reuse subjects. So subjects had -- they were all
essentially never used before. And we sort of didn't let them have enough
interactions to really learn very much about the behavior.
I guess my -- my answer mostly would be we should be personalizing the user
experience, right? So if we figure out how to parameterize these models and we can
have something equivalent to the MLLR transform for the dialogue, then you
would -- if you can recognize the caller, you might plug in the transform. Other
than that, the other thing is that as I said, it takes currently a large number of
dialogues to change. So this thing, if you had this with the real system live, unless
we can figure out ways of adapting much more quickly, which we'd obviously like to
do, then it's going to be quite a long time period over which it adapts. Hopefully,
users would say, you know, I used that system yesterday. I used it three or four
months ago before it was awful, but it's actually getting quite reasonable now.
28
But ->>: I think it's also hard to tell because speech you, have to produce the same
speech each time, right, before you can detect a consistency and then if you find
that there's a particular way in which it's being recognized, well, we'll take
advantage of that anyway.
>>: Just for a second, I'm wondering if there's anything interesting in looking
at or if you guys have looked at how stable this is with respect to assigning rewards.
If I assign these rewards, and I come up with minus 1 and plus 20 if I say minus
1 and plus 18, how stable is the policy?
>> Steve Young:
So the --
>>: It won't make a difference for task completion, but I don't know about the
perceived ->>: So I think you're going to optimize. That's exactly my question. The list
to me is not clear. Will it make a difference or not for task completion. We're
optimizing rewards here. I don't know. The mapping the task completion or the user
perception, like how linear or how varied that ->> Steve Young: Yeah, so we only ever look at the two metrics we look at is percentage
success and the reward. And our reward is essentially the same metric. So ->>: You could still look, I mean if you vary some. Let's say -- is there a way,
I don't know. I'm just talking off the top of my head. But is there a way you can
vary the reward structure to look -- obviously, the rewards you're going to get are
going to mean different things. But if you look at the policy, can you somehow
inspect the policy and see how similar states it takes similar actions?
>> Steve Young: On the HIS system, you certainly can because it's basically a lookup
table and you can actually go through the lookup table and see why it seems to be
choosing a particular action. One of the things we've done recently which actually
improves performance is we actually don't associate a single action anymore with
the -- with each belief point. We associate an end best list since we use a version
of Q learning. So we have this Q function associated with every belief point. If
we look at the end best list and then you combine the generation problem with the
choice of which action to take and say, well, in this context, it's not really obvious
how I would do a confirm, constricting a confirm from the current state of my
29
grounding states and so on.
>>: It's interesting because it allows you to put the heuristics in place that can
check against really bad things happening that Tim was talking about, if you have
choices in the action.
>> Steve Young: Yes. So we can use a ranked list now.
performance quite a bit. Okay.
>> Tim Paek:
[applause]
And it actually does improve
If there's no further questions, let's give our --
Download