>> Kuansan Wang: Okay. I think we'll get started. My name is Kuansan Wang. Today
I have this honor to host my former intern, Antoine, to interview with us. So Antoine
interned with me probably two years ago, and he did such a good job that I cannot resist
to bring him back for interview now that he's graduating.
So today he's going to tell us that what he has been doing [inaudible] thesis, so without
further adieu ->> Antoine Raux: Hi. This working? Yes? Okay. So good morning everyone. So as
Kuansan said, I'm Antoine Raux. I'm finishing my Ph.D. at Carnegie Mellon. And I'll be
talking about my thesis topic which is turn-taking, meaning conversational turn-taking as
a dynamic decision process.
So my general field of research is in spoken dialect systems. So -- well, let's jump right
into it. So what -- current dialect systems, what do they do? There's been lots of research
on dialect systems for the past maybe two decades at least now, and lots of research has
been done on making them more robust or able to ground information, able to deal with
more complex structured information, et cetera.
So that leads us to something like this. This is something that could be stated as
state-of-the-art dialect system, I guess.
>> Antoine Raux: So by many accounts, of course that continues with the system
providing results. By many accounts, this is very successful dialogue. I mean, the person
who's getting their information without any problems. There's just two confirmations to
make sure the information is right, but that's not really a big deal.
However, I would say this is still far. So as you've understood, this is a system that
provides bus schedule for people. And now let's listen to a recording of a very similar
conversation with a human. That's an actual recording from a call to the customer service
of the bus company in Pittsburgh. And let's see what that sounds like.
>> Antoine Raux: Okay. So it's still -- sorry about that. There's still some difference
here. We're not quite there yet, as you can see. And it's not a matter of speech
recognition accuracy or even speech recognition speed because that's not where the
problem lies here in this particular example. There are dialogues with problems in
speech recognition, to be sure, but in this example we do have now lots of dialogues that
go well as far as understanding goes.
Now, what are the differences between these two dialogues? One was the specific
problem in prompt design, or difference in prompt design, where the prompts or the
questions from the human operator tend to be much more -- much shorter and much more
efficient than the one in the system side. That's not a very hard problem to solve at first
sight, right, just design your prompts to be short and you're good.
Now, the problem with this is that if you're using recorded prompts, that's probably fine.
If you're using synthesis, synthesizing very short utterances, like 51C, you need to
convey the meaning right, you need to use paucity in the right way. And humans are
very good at that. That's why he's able to have -- to do it in his -- this call. It's not that
trivial to producing synthesis, actual not really solved yet. That's the big -- one of the big
challenges of speech synthesis, having conversational paucity. The human is using it
both in the confirmation here. Also when we provides results, it's pretty good at
emphasizing the important bits of information in his utterance so that the user gets it.
And that allows -- by the way, sorry, that allows the human to speak very fast most of the
time, much faster than the speech synthesizer does, because it can emphasize the right
bits and the rest can be kind of all blurry and that's okay.
Another big difference is turn-taking. Specifically here, the fact that the human is much
faster to respond to the human than the system. That happened not all the time. It's not at
every point that that happened. Like, in the -- before sync 51C, there was a significant
delay here, and that's not a problem. However, in certain points, like after the user
confirmed yes, the system was very fast. Sorry. The operator was very fast to respond,
let me get that for you. And then it was a long delay to actually get the results.
But this pace of conversation is very different between the two, and it's much more
variable and flexible in the human case and there are good reasonings for that, and that's
something our systems are not able to do yet.
And finally something that maybe is not as obvious, but there were some incremental
processing going on in the human side here. You could hear as the caller was asking the
question, you could hear the operator actually going through paper and starting to get the
information about the 51Cs and then confirming actually that information later on. But
that's something also systems usually don't do, starting to process things as the user is
speaking to the system.
And that's particularly relevant when you get systems that get more and more natural
language input with longer input from the user. As long as you get yes-no answers or
one-word answers, it's not really that relevant. But when you move towards more natural
language, which is what current systems are doing, it makes more sense to start
addressing issues like incremental processing.
Now, these are many different problems that are very hard to solve. So we're not going
to be there tomorrow. Now, in the meantime, let's see what -- like a more reasonable, I'd
say, short-term goal. So this is -- I'm just like playing back the original.
>> Antoine Raux: Now, let's see. What I did then, I edited this audio to make it more
like something like we would like to be, but still a system but maybe the next generation
or something.
>> Antoine Raux: Okay. So two modifications that happened here. Remember these
four things I was looking at, actually in this specific example I addressed prompt design
by shortening the prompt, which again is not the hardest thing to do in all these
[inaudible], probably the easiest thing to do. And then I also shortened the latency in the
right places.
And this is -- if we could get the system that have this behavior, it would be a first step
towards going towards completely humanlike interaction. It's not there yet, but it's part -to achieve the whole humanlike interaction, you need all four of -- at least all four of
these and maybe more.
So as a first step, I'm proposing -- in this particular talk I'm proposing to address the
turn-taking problem. The problem design, I'll leave that as a separate issue. Yes.
>>: [inaudible]
>> Antoine Raux: I'm actually going to talk about that very specifically. But the -- yeah.
It's basically the fact that you don't know -- you don't want to interrupt the user in the
middle of their turn, and so you don't know if your user is going to pause in the middle of
their turn or by the end of their turn. Right? So that's -- using a fixed threshold for
endpointing is what triggers longer latencies. It's not a matter of computation power. So
that's exactly what I'm going to talk about now.
So all the current approaches, what do systems generally do now in terms of turn-taking?
Typically there's no explicit model of turn-taking in these systems. It's more addressed
from an engineering point of view through an ad hoc combination of low-level task. I
mean, it has to be dealt with, because otherwise you can't have a dialogue, but usually
just a combination of [inaudible] detection, which is the minimum thing you'd need to be
able to do to have a dialogue, and [inaudible] barging [phonetic] detection and handling.
If the system allows barging.
But there's no general framework. And the problem with not having a general framework
of like two-model turn-taking is that, first, it makes it hard to optimize, and it's not clear
what it means to optimize. And, second, it's not well integrated in the overall dialogue
model. And that's maybe more of a theoretical problem, but it prevents the lower levels
like turn-taking to inform the higher levels like grounding and dialect processing, and the
opposite to it. It separates the two so that turn-taking cannot be well informed by
higher-level information.
I'll see how -- I'm going to now propose two approaches to address these specific issues
in this talk. So the [inaudible] of the talk is about this two work that I have done during
my thesis. The first one is about optimizing endpointing thresholds still within the very
standard framework, but just changing how we set the threshold, and the second one is a
new model for turn-taking itself, which is a more generic model that's going to
encompass different turn-taking phenomena. And I'll give two examples of application
of this model to specific problems like endpointing and interruption detection.
So how is turn endpointing done in general today? It's usually a combination of two
levels of processes. There's first a voice activity detector that's going to just discriminate
between speech and silence in the incoming audio. Something like this. Oh, sorry. So
let me explain this example first. You have a -- the system says: What can I do for you.
User response: I'd like to go to the airport, with a pause after two. And so your voice
activity detector tells you that there is speech until after two when it detects there is
silence, and typically the system uses a threshold here if the silence here gets longer than
700 milliseconds [inaudible] endpoints.
In this case, because the user starts speaking again before the threshold, nothing happens
and the utterance continues until the next silence is detected, same threshold is set, and
this time the user doesn't start again, so it's endpointed. That's the very standard, very
simple approach that's used in most systems.
There are two different issues with endpointing in general, I would say. The first one is
[inaudible], which is when the system -- when the threshold is shorter than an internal
silence, the system is going to interrupt the user in the middle of that utterance, which is
in general not a desirable behavior. Or if you want to do that, you don't want it to be
trigged by silences. You want the system to be aware that they're going to interrupt the
user. And, in general, that's not something you want.
And the second problem is latency. That's exactly what I was talking about a minute ago.
Because you're using this fixed threshold, at the end of every turn you're going to have at
least the duration of the latency before you can respond. And in addition to that,
depending on how the system works, you might have additional processing after the
latencies. But at the very least you would get the latency.
And so you have a tradeoff here. It's either you can set long thresholds, which will -- I'm
sorry. I'll turn that off. Okay. Which will set -- which will trigger a few cut-ins because
you will have a few pauses longer than that threshold. But they will trigger long latency
at the end of every turn, or you can set short thresholds that then you'll have many cut-ins
and short latencies. Yes.
>>: I would have thought that in [inaudible] case you wouldn't see other things, like
[inaudible] that when I've got that kind of almost filled pause case, that I would have
thought the endpoint of the previous, of the first phrase would be really different from the
case where I'm saying over your turn to talk. And it isn't just the silence.
>> Antoine Raux: I totally agree with you. That's a very good point. It's exactly what I
will discuss in just the next few slides. The thing is, most current systems basically
ignore all that information. Because all the input we have is voice activity.
>>: [inaudible] distortion [inaudible] you have to correct for, not something you can
actually uses a signal.
>> Antoine Raux: Right. Basically, yeah, that's -- I mean, there's a lot of things
definitely positive -- I mean, there are many, many different features. I'll just explain
those in a minute. So there is information. It's just not used.
Sorry. Let me just try and turn off -- don't think that's going to help. Right. So what we
would want because we have this tradeoff is something that tells us that we should set
different threshold for different silences, right? We want to set a long threshold when it's
an internal silence to be sure we don't produce cut-ins, and we want to set a short
threshold at the end of the utterance using all these kind of information we can get. So
specifically what kind of information we have in the spoken dialogue system at our
disposal for this, to inform the threshold setting.
First you have discourse; you know which dialect state you're in. Like in the example I
will use, which I will base on the Let's Go!! bus information system, have a very simple
extraction of dialect state. It's three different states: Is it an open question, like what can
I do for you; is it a closed question, like where do you want to go; or is it a confirmation
question, like more yes/no type of question. But that kind of information helps you set a
Another very important source of information is semantics. And particularly here I'm
talking about using the partial recognition hypothesis that you get, hypothesis that you get
at the beginning of the silence. When you detect your silence, you know your recognizor
can tell you so far what it has recognized, and you can use that to inform, again, your
threshold setting.
Because that's going to tell you if the user is likely to be done or not. You can use
[inaudible] exactly, like we were just saying, which includes intonation, duration, vowel
lengthening, et cetera, all those that are very well known in human-human dialect to
actually effect the perception of turn-taking.
You can use timing. Very simple features like how long ago did this utterance start is
likely to be some source of information on whether it's going to finish soon or not. And
we can also use information about the particular speaker we're dealing with. And that
can be either you have access to many different sessions with the same speaker.
In my case it's not -- wasn't the case, so I'm using information from this particular
dialogue. But you can use information from the first few turns to inform your behavior in
the later turns of the dialogue. Information like how many pauses does this particular
user make, how long are they, et cetera.
And so we have all this wide, large set of features that we could use to set the threshold.
What I did, I'm not going to describe my algorithm here, but I designed an algorithm that
specifically built a decision tree based on these features by asking binary questions of
these features. And the leaves of the decision tree are thresholds. So it's a fairly
simple -- in just two words, it first clusters the pauses that are in the training sets. So of
course it's based on the training set of collected dialogues where each pause is annotated
as internal or final. And this is done ultimately -- this annotation is on an automatic
[inaudible]. It's completely unsupervised in that way.
So clusters or pauses based on the features. And the second set, you're going to set the
optimal threshold in each cluster. And that's how you build a decision tree this way.
This is published in my SIGDial paper this year.
So there is no point in going through the tree in detail, but what I want to show here is
that it does make use of many different features, so the different cause, different -represent different feature sets. If you see things about -- like the first one is timing here,
pause-start time, the green ones are all about semantics using partial recognition
hypothesis; the orange one is the dialect state; and the other red one here, red ones, are
about the behavior of the user in this particular dialogue. Yes.
>>: Are you going to tell us how you unsupervise [inaudible]?
>> Antoine Raux: It wasn't included in this particular, because of time constraints, I
didn't include it here. I can deal with that maybe after the talk, the questions.
>>: The key point is unsupervised.
>> Antoine Raux: It is unsupervised, yes. Yes.
>>: Can I just ask one question [inaudible] so how you get the training data? Is it a
system that just waits?
>> Antoine Raux: Wait. Okay. Right. A very good point. This -- in this case, it was a
system that used the fixed threshold 700 milliseconds, a reasonable -- not very, very long.
And so that introduces -- that's where the unsupervised learning is risky, because your
system at runtime is going to make errors, right?
And so what I used is a heuristic that after the fact can tell me whether a decision was
right or wrong, a particular endpointing decision. And there are hints -- I mean, it's not a
perfect algorithm, but there are hints that tell you if the user starts speaking again right
after [inaudible], it's probably that they weren't done speaking, because they're not
responding to anything, the system hasn't spoken yet.
But if you hear them speaking right after your endpoint, that probably meant that that
decision was wrong. I'm using this kind of heuristics to kind of relabel or to correct the
annotation, the annotation of the training set. And based on that, I retrained it.
>>: But you cannot [inaudible]?
>> Antoine Raux: Well, you can know it's an error, for one. And you can have some
hints, because if the user did start speaking again, you can use that start time to measure a
pause direction, even if you had made the decision to endpoint, the system at runtime
made the decision, you still have the last time the user spoke and the next time user
speaks. And you can use that as a pause duration that's longer than the threshold used at
>>: So is there a question that you ask [inaudible] come up with this new system and the
question you're asking is can I just deploy this new system, how well is [inaudible]?
>> Antoine Raux: Right. Well, I'm going to explain -- I have two different evaluation.
I'm going to explain that with the next slide. Yes.
>>: I'm wondering why unsupervised methods are off the table. So I'm thinking if I have
a large amount of recording or dialogue from, say, two speakers that are easy to separate,
say a man and a woman, and then I can tell after the fact who was speaking when I can
look at all the pauses in that arbitrary speech, and at any point in there during the pause I
could say can you guess whether the, you know, speaker A or speaker B is going to speak
next. And I would think you could get arbitrary amounts of data [inaudible].
>> Antoine Raux: Oh, you mean by detecting the switch between the speakers?
>>: [inaudible] I could do speaker identification or separation.
>> Antoine Raux: Right.
>>: And if I picked dialogue from which the separation was easy.
>> Antoine Raux: Right. That's -- that is true. The thing is that that requires
human-human -- that would be based on a human-human dialogue. So then that's true,
but then there are differences.
>>: [inaudible] I think I could do a very good job [inaudible].
>> Antoine Raux: Right.
>>: [inaudible] two speakers is going to speak next.
>> Antoine Raux: Right. I see the point. The approach I was taking was basically
starting from a system and kind of improved this particular system over time. So I was
trying to run directly on this system. And so [inaudible] evaluation, how did that
perform. So this is performance in the -- by cross-validation in the actual training set that
I collected collected in the way I just explained.
The double line here is the bass line using a fixed threshold, and so what this represents is
a -- it's a kind of RSC curve, if you want. It's the tradeoff between your latency and the
cutting rates. So, of course, if you have high latencies, you're going to get small cut-in
rates, and if you have low latencies, you're going to have high cut-in rates.
And you'll have to trust me on that, but reasonable cut-in rates for a workable system tend
to be in the range of 2 and 5 percent, I would say. And that corresponds to fairly
standard -- if you look at the bass line, fairly standard thresholds are used which are
usually between 500 and 1 second.
And so what did [inaudible] what you can see from this graph, so the different lines
represent using different subsets of the features to perform the training. And the blue line
on the bottom is using all the features that I described before, feature sets that I described
So first thing is that, well, to some extent it works, because we did get a significant
improvement, about 22 percent reduction of latency at a certain cut-in rate, if you take 4
percent here, or you can look at it the other way and see if you keep the latencies constant
and reduce your cut-in rate by 38 percent.
The other thing that this shows, when you look at the different features set, the semantics
is by far the most useful feature type. It's not that the other ones don't work at all; I mean,
they all bring about half of the overall except for semantics. But once you add semantics
into the mix, like the partial [inaudible] results, given -- it's also given the structure of this
particular dialogue that's fairly constrained, et cetera, of course. But given all that,
semantics are really bringing you most of the gain here.
>>: What do you mean by discourse?
>> Antoine Raux: The discourse is mostly dialouge state level features. And, yeah.
>>: [inaudible]
>> Antoine Raux: Yeah. I mean, it's kind of a blurry -- I'm not making strong statements
by the name of these sets. Because semantics also contains the understanding of the
particular partial recognition practices given the current state. It's semantics within the
current state.
So if you ask a ask a yes/no question and you get a yes, this is a high semantic score; if
you ask a yes/no question, you get a time, that's a low semantic score.
>>: [inaudible]
>> Antoine Raux: It has some [inaudible].
>>: Just to clarify, with respect to the decision tree that you had, that decision tree is
giving you a threshold on how long to wait?
>> Antoine Raux: Wait. Yeah, pause.
>>: In a pause, right? And with respect to the cut-in rate, the cut-in can be -- what I'm
trying to figure out is whether it's that bad to cut-in. I mean, sometimes, I mean, like, for
instance, you have some semantic information, like some [inaudible] information and the
user's still kind of trying to figure out, well, where am I going to, it's okay to come in and
say, okay, I heard that you want to go front gear to whatever, now, where do you want to
go ->> Antoine Raux: Yes. I see your point. That's true that cut-ins might have different
cost and different circumstances. It's not -- I totally agree it's different. It's hard to
actually get ahold of this cost. I mean, definitely if you have some information already
that helps, that might still be very confusing for the user because the user stops speaking
again and then that barges in on the system and then you can get in actual turn-taking
conflicts in the system, and that's -- can be very confusing to the user. So ->>: So what is it currently doing right now when it comes in? Is it just implicitly
confirming anything that it knows or ->> Antoine Raux: It's trying to explicitly confirm what it knows, or if there's nothing, it's
sending a nonunderstanding prompt, a repair prompt. Yes.
>>: I'm wondering, what are these durations in natural turns, you know, among
human-human [inaudible]?
>> Antoine Raux: That actually depends a lot. I did -- unfortunately I don't have that
right here, but I did the study on the corpus of human-human dialogues like the one -- the
example I played from that corpus. And what happens is you have a really wide
variability in the human-human. Some cases are actually one second more, and they're
not necessarily strange or uncomfortable, but some cases are really, really, really short,
and they have to be. So it really depends a lot on the context, on the type of dialogue
you're having. So it's hard to give ->>: [inaudible]
>> Antoine Raux: Oh, yes. It's zero. You cannot overlap actually. It's virtually zero.
>>: Like Tim is saying, in fact, in our conversation, at least depending on the different
>> Antoine Raux: Right. The fact that it's zero doesn't mean that you're cutting in on the
user. You might just be very, very confident that they're finished saying whatever they're
saying. And you're just very, very close to their turn, but you don't really interrupt them
in the sense that they were intending to say more after it.
>>: So it turns out that social linguists and psychologists [inaudible] different pauses,
pause lengths, and found that certain [inaudible] like you mentioned before with respect
to paucity, if someone says an "um" that the pause is much longer than if they say an "uh"
and users don't interrupt because they don't understand that "um" to mean that they're
>>: That's probably what phrase [inaudible].
>>: Probably don't even have to wait for the final [inaudible].
>> Antoine Raux: So yeah. So the human-human case actually much more complex in
general. And also I think the complexity is maybe arguable. But definitely ->>: If we looked at the short end of this, I'll bet 300 milliseconds is [inaudible].
>> Antoine Raux: In terms -- well, it depends. I mean, for like in terms of being
humanlike, yes, I agree with you. The cases where it's slow and it's okay to be slow are
fine. But in terms of short end, yes, I agree. We're not there yet. Definitely. Even with
the improved [inaudible]. It's going there, but it's not there yet.
>>: But is it your goal to fully reproduce a human-human conversation experience?
>>: It's one goal you can think of. The other one is to relate the -- this matrix to user
feedback about the interaction they have with the systems, so basically to user
satisfaction. Unfortunately I don't have that here. It's a very hard thing to get, because
it's hard to get feedback on these kind of subtle things directly from the user.
And in my particular case, I'm using a system that's used by the general public. And we
don't have access to the users after they've used this system.
>>: There's nothing [inaudible] to let user know that they are interacting with a machine,
>> Antoine Raux: Yes. On the other hand, something I forgot to mention, when -- like
the two example, the two machine examples I played at the very beginning, it's not only
about interaction, but if you just look at dialogue duration, these two, when it gets to very
short dialogues like these ones, the difference between the two was like 20, 25 percent.
Just by shortening the prompts and the latencies. And so that can also mean something
just economically speaking, if you have a system deployed out there and you want short
dialogues. So gaining here and there can actually add up in the end. But I think it's
mostly in how the interaction flows.
>>: The only problem I have with this [inaudible] timeout [inaudible] it has to make
decisions without knowing those -- what [inaudible].
>> Antoine Raux: Well, so ->>: [inaudible]
>> Antoine Raux: Right. So actually there's a good plug in here. Because I [inaudible]
on that system that I played, that I showed, because I'm not discussing the whole
architecture here, but it's researched dialogue systems, it has a fairly complex
architecture. And particularly I designed it so it has a central component that's in charge
of low-level interaction.
So what the SR does, it just feeds information about voice activity detection, partial
hypothesis, et cetera. The decision to endpoint is actually made by an external module
that also has access to high-level information, et cetera. And that might add a little bit of
delay, actually, but it's very small compared to all the other factors here. And that makes
it more computationally expensive then a standard -- a commercial dialogue system.
That's from the research perspective.
So I did implement the tree, the specific tree that you've seen. Example I showed was I
[inaudible] in the Let's Go! interaction manager, which is the central component. What
you might have noticed in that tree, that there was no paucity. Because for the live
version, because [inaudible] features are very expensive to compute and not as readily
accessible as the others and didn't bring anything in terms of the batch evaluation, I didn't
include them in the live version.
I picked a working point, because now I have these curves. Now to run the system I have
to decide where I run on the curve. I picked the curve -- I picked the one that -- 3 percent
cut-in, which was 635 millisecond average in the batch evaluation. And I ran a study in
the -- May '08 when I just set the system -- the public system to use randomly one of two
versions, either fixed threshold of [inaudible] or the decision tree.
Now, let's look at the results of this. So first the average latency for here, the left side is
the average latency overall, around this is split by dialogue state. Note that this time this
is real latency. This is not threshold value. This is the actual time between the end of the
user utterance and the beginning of the system utterance. It includes additional
processing, which is why it's higher than previous values.
Our first kind of disappointment here is that there's no difference at this point. We'll see
why in the next slide. It's not a bad thing. But even then, the behavior induced by the
proposed algorithm is very different from the control one. The control one, of course,
uses a fixed threshold; there is very little difference in terms of latency between different
states. That's mostly random difference.
However, in the proposed case, basically the algorithm learned that after an open
question, which is typically where you get users to say long utterances with pauses,
hesitations, et cetera, versus that longer threshold which ends up being longer than the
baseline. For close question, it has an average behavior similar -- same as the baseline.
For a yes/no question, which are much more predictable, where there are like definitely
fewer pauses because the user mostly respond yes or no, not only but mostly, the system
learns to shorten the latency. It's still not shortened enough, I would say, but it's
definitely significantly shortened in the latency.
Now, if we look at the cut-in rate, this is what explains why we had some latency. We
were actually working kind of at a horizontal line here. We significantly reduced the
cut-in rate here overall, and this came mostly from the -- mostly from the open requests
state, which, for the same reason as before, the system learned to set longer thresholds
there because that's where most of the cut-ins happened.
You can see in the control case, 12 percent of the turns are cut-in in that state. That's
definitely a significant number of them. Even if they're not very costly cut-ins, some of
them are definitely bound to be problematic. Yes.
>>: So [inaudible] your results make me wonder what would have happened if you had
as a baseline policy just don't cut-in unless you have any kind of semantic information
which would reduce the open questions. So, in other words, you know, just cut-in only
when you have some [inaudible] information, like, you know ->> Antoine Raux: At a fixed threshold?
>>: Not necessarily a fixed threshold, just whenever -- yeah, maybe you have to have
some thresholds, but, you know, having just a -- you know, without using a sophisticated
decision tree, just basically only cutting in when -- after a threshold when you have some
information. Otherwise it'll just let the user continue.
>> Antoine Raux: Right. The thing is, like specifically in these turns, it's likely that the
user would say more than one piece of information. So you would still end up cutting in.
Now, if you decide that these cut-ins don't matter, they might be okay, that there's still
uncertainty there. In terms of raw number of cut-ins, there would still be -- I don't have
the numbers, so I can't completely answer. But I think there would still be some because
of that. Because people would start saying something, and they would not be done but
there would still be -- already be some semantics.
But the other thing you can imagine doing is to set a state specific fixed threshold, right?
You could just learn. The interesting thing here was to learn that from the features. We
couldn't know for sure beforehand which one would be more useful or less useful. I
mean, we could have guessed definitely some intuitions, but this actually learned it from
data and it did optimize it.
>>: That was actually something I wanted to ask you. So in -- I've actually worked on
designing [inaudible] systems. So [inaudible] is there is, you know, you know that
[inaudible] special and tune that separately, and so the first question is how much of this
improvement comes from just the combinations? And the second question is, the other
baseline would be, like you said, just a static -- just set per dialogue state. How much are
you improving from that baseline?
>> Antoine Raux: Right. So I don't have the numbers. I did compute in, but
unfortunately I've forgot the numbers for the state specific threshold. It was not
exactly -- it was kind of halfway between -- it was going there, but not completely, not as
good as the proposed approach.
In terms of confirmation, I do agree that -- because Let's Go! has a lot of explicit
confirmation, that's a big factor in the improvement. It's not all, because like this
improvement here is something that's not related to a confirmation. But a gain in latency,
a lot of it comes from the confirmation questions, like we've seen in a previous slide.
But the interesting thing about this is that because it's a learning algorithm, then if you
have the more complex systems, like different states and different things, you can still
rerun this. And since additionally it's unsupervised -- it's unsupervised. You can always
rerun it if you add the different state to your dialogue that's not nonstandard. You can
learn from interaction with a user.
Now, the last thing I want you to see on this slide evaluation, which I couldn't see in the
batch one at all, was the impact of the algorithm on more speech recognition
performance. Now the problem here is we -- this data is not transcribed, so I don't have
word error rate value directly, so I looked up a -- the nonunderstanding rate and the
number of redirects, the proportion of redirects in the user's speech. Portion of utterances
has been rejected between the two conditions, and there are significant [inaudible]
significant reduction both overall and the yes/no questions. And so overall it's like 1
percent, it's a small reduction, but it's statistically significant.
The biggest surprise to me was that the improvement was here in the yes/no questions
because the reduction in cut-ins happened in the open-question case here.
Now, the reason why this happens is because the recognizor, when you add -- when you
have a very short word and you add a long silence to it, you're more likely to mess up
your recognition basically, that the impact of the silence, if you have that background
noise and background voices, is going to hurt recognition even of the actual word itself
more. So that's the information I have for this reduction here.
>>: Do you see an overall significant difference in task completion?
>> Antoine Raux: No. No, no. It wasn't sensitive -- like there's not -- my trees are not
sensitive enough for that kind of improvement.
>>: [inaudible] you're saying that in this system [inaudible] get the answer to a question
yes/no, it's a binary question [inaudible] that we're only getting, you know, 5 to 10
percent -- I mean, 95 to 90 percent right?
>> Antoine Raux: No. That's not exactly [inaudible] ->>: Or at least that we understand [inaudible].
>> Antoine Raux: Right. The thing is that some of these are not even speech in the first
>>: [inaudible] question and the user didn't answer it.
>> Antoine Raux: Right. There was some background noise, some baby crying on their
lap. Or some -- we have lots of data like that in our corpus, unfortunately.
>>: Right. But is it actually set [inaudible] what's a fraction if you ask a yes/no question
that you're actually getting a yes or a no as the answer?
>> Antoine Raux: I don't know how that -- it's maybe -- I would say it's 80 percent ->>: Really. So you're saying 20 percent of the time you don't, and there's a third choice,
that they didn't ->> Antoine Raux: Either they say something else or ->>: No, that I understand, they would say something else. But [inaudible] would be
interpreted as either yes or a no.
>> Antoine Raux: Oh. Oh. That's not what I said. And I don't have that specific
number. But in this case I'm not saying it was interpreted as yes or no. It was just
something that happened after a yes/no question.
>>: Right.
>>: And you're saying if a baby cries, you'll get the wrong answer?
>> Antoine Raux: No. You'll get something that might be misreading that as speech, but
in this case it would be linked to nonunderstanding, say I didn't -- I didn't get that. The
system would respond I didn't get that, which is not as bad as misinterpreting that as a
yes, for example.
>>: But, still, 10 percent is pretty [inaudible].
>> Antoine Raux: That tough data.
>>: And when you say speech, do you include [inaudible] speech like uh-huh, huh-uh
>> Antoine Raux: Yeah. Well, what I mean here is the state -- the question that the
system asked, what comes from the user can be ->>: It can be speech [inaudible].
>> Antoine Raux: Yeah, right. Right.
>>: [inaudible]
>> Antoine Raux: Not necessarily.
>>: [inaudible]
>> Antoine Raux: Yes. Because here I don't have any label of what is actually said. I
don't know what is actually said. So they are included, yes. Yes.
>>: So one thing that you could do to see if users can receive -- I mean, you were saying
before that, you know, after the users use the system they do [inaudible] ask them
[inaudible] well, you could certainly just pick the recorded audio and present them to a
bunch of raters and have them rating along certain dimensions, see if people can
distinguish between, you know, the small, subtle changes that you made. And wondering
have you done that, and, if so, what were the results?
>> Antoine Raux: Right. So I've started to look into that. My take [inaudible] result was
that I think we need more improvement than this. I didn't do a formal evaluation like
this, but from -- what I looked is -- to make it really -- I can hear the difference between
very sensitive or susceptible to the -- to listeners, to third-party listeners, I think we need
to go further than this.
Even in the best cases, even in cases where we reduce -- there are lots of yes/no cases that
reduce latency by 50 percent, say from 1 second to 500 millisecond, and that's very hard
to perceive from the -- it's surprisingly hard to perceive actually. So I think with -- we're
not completely there yet is the answer. But I don't have this formal evaluation. I started
looking into it, but I'm actually trying to improve more before doing that evaluation
because it's more costly than the [inaudible] batch.
Okay. So but -- so that's it for the first part, first part of my talk. So it was still a fairly -it was a principled approach in the sense that I was optimizing from data the turn-taking
behavior or the endpointing behavior. And it was very much in the standard framework
of setting a threshold and just making a decision, the endpointing decision based on the
Now, I want to take one step further and propose a new way of addressing turn-taking in
general. It's more of a theoretical model that will be applied to specific problems. And I
will explain that.
So if we want to describe what turn-taking is in the kinds of dialogue I'm looking at,
which are two-party dialogue between the system and the user, at the most fundamental
level, the floor, or the turn, is basically alternating between system and user in a loop like
this, right?
Now, the entrusting bit is not that phenomenon, which is well known and not very
special. The interesting thing to look at is what happened at the transitions, is what we
want because this is where we can improve behavior.
So one typical standard behavior is, say, the system speaks, they finish their utterance,
they free the floors, they stop speaking, the floor is free, and it's marked to being for the
user but still the user hasn't spoken there, spoken yet here. And then the user starts
speaking, gets the floor, finish their utterance, the flow becomes free but marked for the
system and we have this loop and the take-turn with slight pauses maybe between each
That's not the only way transitions happen. The other way is when they actually overlap.
And so in this case the system speaks, and before the system finishes, the user starts
speaking so it's marked, it goes to this state, it's marked as both speaking or both claiming
the floor with mark because the user is trying to claim the floor now. And if the system
stops speaking, then we get a switch to user.
So you have this kind of transition. And I'm going to in the next few slides explain how
this fairly simple machine, which actually has a few more details -- so first each state
can -- you can stay in each state indefinitely -- well, yeah. And also the difference -- I
don't know if you saw the difference, but I had arrows that you can actually go back, even
if the -- for example, the system speaks, user starts speaking on top of them, but then the
user stops speaking, you go back to systems, so you can actually go both ways in certain
So this is still a fairly simple state machine, and it -- but it captures really many of the
typical turn-taking phenomenon that we observe in a two-party conversation.
The first loose transition is basically the one I explained first, in this case user to system,
user speaks, user yields the floor, floor becomes free, system grabs the floor, we have
some transition.
Now, if after the user yielded the floor the system waits, then we have a latency. That
corresponds to exactly the latency that I was talking about in the first part. Okay. The
staying in this floor, in this state here.
Now, if the system grabs -- well, the floor is still to the user and not free, then we have a
cut-in, same thing we were talking about.
Now something for a completely different kind of behavior, say this system speaks,
finishes speaking, so we switch to the free U, but user doesn't get back to the system, at
some point system decides to grab the floor back. That's typically a timeout type of
behavior. Didn't get response from the user, decide to grab the floor again.
Another phenomenon that's very common, barge-in -- user barge-in. So the system is
speaking, user grabs the floor, so they're both claiming the floor at the same time and the
system yields, so we go to the user. It was a smooth barge-in transition.
And there are many other types, but the point is that this fairly simple [inaudible]
captures really a lot of the phenomena that happened and kind of formalizes them in a
nice way. And so based on this, we can construct a -- build an algorithm to train to make
the decisions. It's based on decision theory; I call it dynamic decision process.
The idea here is that at every point in time system has partial knowledge of state,
meaning that at some point the system is going to be pretty certain who has the floor, like
during, in the middle of utterances by the system or the user, it's usually pretty clear that
the floor is user or system. But there are transitions, there are points where suddenly
you're not sure anymore who has the floor, and that's the point where you have to make
the decision. So it's important. So you need to keep track of this uncertainty if you want
to use that to take your -- make your turn-taking decisions.
There are type of actions that can be taken either by the system or the user, but in terms
of -- I mean, we take the viewpoint of the system here is grabbing the floor, yielding the
floor, keeping the floor when you have it, waiting, which is not doing anything when the
other person has the floor.
And these different actions yield different costs in the different states we have there.
They are not possible -- not all of them are in all states, because for grab and keep, the
system needs to have the floor in the first place, versus for yield and wait, needs to not
have the floor in the first place. But most importantly they have different costs, and we'll
see what that means.
And so once you have that, you have states in the world with the belief on what the
current state is, and you have actions with cost. This is very decision-theory oriented.
And so you can use this in theory to pick the action with the lowest expected cost by
using all this information.
Now let me explain how like using two examples how this can be applied. So
endpointing again, typically you start, you know that the user has the floor, you're at a
point where you're pretty confident the floor is to the user, they've been speaking. And at
some point because there's a pause in the user's speech, you're uncertain on whether the
user just freed the floor to this state or the user still has the floor because they intend to
continue speaking.
And the user in any -- in either of these states, the system can take two actions, grab or
wait. If you're in this state, grab is the right thing to do because the user is done speaking
and you don't want to induce latency, which is what wait does. If you're in this state, wait
is the right thing to do, because the user is not done yet, so you want to wait. If you grab,
you're going to induce cut-in.
So we can translate that into an actual cost function or cost matrix here. Again, this
is once a silence has been detected in user speech. We'll see that how that [inaudible]
So the action -- the possible actions of wait and grab, those states are user or free S, and
for the right actions I said a cost of zero, the right thing to do, said a cost of zero. And for
the other ones I said different cost. For waiting, the cost is equal to the time since the
silence has been detected. So the longer you wait, the longer the cost of waiting more is
going to be -- the higher it's going to be.
For cut-ins, I said the cost of the cut-in to be a constant all over the system and all the
states, which is kind of -- both of them are kind of approximations. You can imagine
cut-ins have different cost and different circumstances, and you can imagine this
dependency not being strictly linear. But this is the first approximation to show how the
system can be used.
So now we have again a typical decision theory problem where you just want to take the
action that minimizes the expected cost, given that the user is still silent at times T.
Because if the user starts speaking again, then we're going back to being certain to being
the user state and so we don't have this problem anymore.
So we assume that the pause is still going on at time T, and the cost of grabbing is the
probability of being in the user state times the cut-in cost plus the probability of being in
the free state time, the cost which is zero. And same thing here for wait. So we have
these two expressions that express the expected cost.
Now, what we need here is to compute these state probabilities. We need to know the
probability that the state is user or free at times T. Now, here I'm using -- and the actual
expression is with conditional probabilities, but because they're both conditioning on the
same thing you can use joint probabilities. It's equivalent in terms of solving the equation
in the end, of finding the points. So I'm looking -- I'm actually expressing joint
You can use this with the definition of conditional probabilities here, the probability of
being free times the probability of being in a pause given that the floor is free. Now, I
make this approximation that if the user released the floor, they're not going to start
speaking again ever. So it's an approximation in the sense that we're going to get this
transition back to the user state, if the system never gets back to the user.
But this usually happens fairly -- after a fairly long time, which is not the same scale as
what we're dealing with. So [inaudible] I'm saying that if the user did free the floor,
they're never going to start speaking, so they're going to be silent regardless of T. So this
probability is 1.
And the other thing here is that because this doesn't depend on whether the silence is
going on or not, after we moved that conditioning, then it doesn't depend on T anymore.
So it's the same as the probability at the very beginning of the pause that the floor was
free, that the user had finished their utterance basically.
So we have this. Now, we'll need to compute this parameter. But this is no longer
dependent on time. Now, on this side, like the other probability of being in the user state,
saying the composition here, the only thing here is now the probability of being in the
pause at time T given that the user keeps the floor is the probability that they haven't
spoken again yet. And so it's the probability that the duration of this current silence is
bigger than T. And we can use that if we compute the probability distribution of internal
silences based on data.
And that distribution is well approximated by an exponential distribution which is
something other people have found in the literature in the past and have confirmed that
on for a good match on my data, and so I'm using this approximation which leads to this
probability to be just equal to exponential minus T over mu; mu is a parameter which is
equal to the mean paused duration. It's something you can compute easily.
And, on the other hand, you have the same phenomena as before because it's no longer
dependent on whether it's a silence; it's equal to the probability of time zero that the user
kept the floor or keeps the floor now. And this is 1 minus the probability we had in the
previous state. So we just have two parameters to compute. Mu is just the mean pause
duration and we still need to compute this probability. So at the beginning of the pause,
did the user yield the floor or did they -- are they keeping the floor is the question.
So this is a simple -- or simple -- very standard binary classification problem. You can
take all your pauses and your training data and leave those then as internal final and use
the features in the same way that I was doing in the previous algorithm, use the features
that are available in the beginning of the pause to try to predict this outcome, whether it's
final or internal. And I did that using logistic regression, and here are the results.
In terms of raw classification error, the majority baseline is 24.9 percent and went down
to 21.1 percent. So, of course, that might seem like it's not a very high, very large
improvement here. But that should not come as a surprise. Because if we could do a
very good job here, we didn't have to wait for any threshold. If we could know at the
very beginning of the pause whether the user is done or not, we wouldn't have to wait at
all any -- ever.
And there is -- it's not only a matter of having the right features and the right algorithm;
that there is some intrinsic ambiguity here. There have points in time where the user
might be done or not and you're not -- just not sure about it. Even a human is not sure
about it. Even the speaker might not be sure about it actually. So it's not a surprise that
you can get very, very high accuracy here. You could probably get better with more
features and with different classifiers maybe. But I don't think it would get much better.
But that doesn't matter because what we want is an estimate of the probability. We don't
need to have perfect classification. And because we did get improvement in both hard
and soft [inaudible] likelihood, this at least improved over a simple prior distribution. So
we can use that and plug that in our equations.
So what happens here is we have -- remember we had these two costs: the cost of
grabbing and the cost of waiting. And they [inaudible] cost of waiting was directly
proportional to T, was T times K, so it's just the linear increasing here. The cost of
grabbing -- the cost knowing the state was fixed, but our knowledge, or brief probability
on the state decreases exponentially, so we have these decreasing costs and these
increasing costs, and basically you want to endpoint when the cost of waiting gets higher
than the cost of grabbing.
And so this is -- the interaction is where you want to set your threshold, because we're in
a pause, we can actually go back to setting a threshold here, which is the time that
corresponds to the interaction. So, in other words, the threshold is the solution to this
Now, this doesn't have unfortunately an analytical solution. But you can find numerical
methods and solve that very easily, even at runtime.
Now, these are the results using this approach. Again, this is the baseline with a fixed
threshold, and this is the dynamic decision process approach. It's slightly better -- I don't
have it on this graph, but it's slightly better than the previous approach. I don't have it
here because this was done on different data. But I also have it -- the same computation
[inaudible] as before, it was very slightly better.
Now, the interesting thing is we're not done yet. So it wasn't -- the goal was not -- the
hope was not that this would be significantly better than the previous approach. The
interesting thing here is that we can extend this approach much more than we could the
previous one. Because we have this generic framework, we can apply this at different
places in different ways.
So an example of an extension here is to not only do endpointing at pauses but also at
speech. You can have a similar process within utterances before you detect a pause. And
that's important because pause detection itself incurs a delay. To be sure you have a
pause, you first need to wait something like 200 millisecond for your voice activity
detection to know, also because there are silences within speech at -- at consonants like
Ps and Ts that have potentially long silences within them not being a pause at all.
>>: [inaudible]
>> Antoine Raux: Depends on language. Not English, but in summary [inaudible].
>>: How fast can you compute this dynamic decision process?
>> Antoine Raux: So, well ->>: [inaudible]
>> Antoine Raux: I don't think the computation of the threshold is going to be high. So
before the threshold itself, there's the feature computation, which also happens like
between -- during this 200 millisecond that I'm saying, it's both to detect the pause for
sure and compute the features. So you have this thing that kind of introduces a delay
right off the bat here before you can make any decision.
So if we can start detecting the floor switches before we detect the pauses, [inaudible] for
the cases where we're pretty sure that the user is done, then that would improve over this,
over the current state. So what we can do is use partial recognition hypothesis directly
even without pause detection.
And in this case I define the floor switch because we don't have a pause to classify
binaurally between internal and final. But I say that the floor switch is at the first partial
hypothesis where the text of the hypothesis is identical to the text of the final hypothesis
in the training data. Because in the training data the system actually went on and did
standard endpointing at the end. And if the recognition results are the same, that means
we didn't gain anything by waiting some more. So we should have imparted earlier on.
So this is how I label the training data here.
And now we can use, again, the similar cost matrix. We're in the same states and same
transitions and actions. Again, the wait action has cost zero, now the cost of a cut-in is
still the same cost in K, and the cost of not endpointing when we could have during
speech, I set it to be a cost on D, another constant. Because we don't have now -- we
don't have this dependency on time anymore, we don't have this thing where we wait
once the pause starts, we have a certain duration, something that happens every time we
get the partial hypothesis. It's no longer -- this one is no longer dependent on time.
So what that means is that this leads -- these equations basically lead to a threshold on the
probability at this particular partial hypothesis, the probability that the floor is switched
or not. So we can do the same thing as before, use logistic aggression, but this time we
classify not pauses but we classify partial hypothesis as being final or not, identical to the
final or not.
So we have a classification. The baseline is higher and we get to the related
improvement is much bigger than the previous one. We're still at 20 percent, we're still at
the same value 20 percent error. But, again, we want to use the probability estimates
rather than the pure classification. So that's necessarily too bad.
So what happens here in terms of results, we do get some improvements. They are small,
I would say, over the endpointing at silences only. Now, what's interesting to do here -the point is that this latency metric average over all latency metric is not necessarily
exactly capturing the user experience very well.
So it's kind of interesting to look at the distribution of the thresholds picked by the two
approaches. But this is the different values for the threshold, this is proportion of turns
that have this threshold given the algorithm. And this is -- the blue one is endpointing at
silences, the first approach; the second one is endpointing anytime using including
potentially during speech.
So what happens first you get the end of it. It's -- they are completely overlapping.
There is nothing that changes in terms of long thresholds. What happens is of course at
zero, which is during an utterance, the original approach didn't have any -- didn't allow
this. It has zero percent of the utterances here. The new one has 10 percent. We still
have 10 percent of the turns we're going to have much faster response than we had
And basically some of the high -- like the original blue distribution has two peaks: one
short and one long kind of. And what happened is some of this peak has been transferred
to zero latencies. So these cases where we're pretty sure the user is done, turns out some
of them we can actually decide before we detect a pause and some of them we can't.
But it still leads to 10 percent of the turns where we will reduce the latency significantly.
And it's not -- I'm currently working on the -- doing the live evaluation for this because
it's not clear that just putting a zero latency translates exactly what's happening in the
real-life system. Yes.
>>: [inaudible] the model for trying to get -- to make decisions about, you know,
whether you grab or wait [inaudible] human-human [inaudible] so you have
human-human [inaudible] this interaction [inaudible] how well would that naturally
>> Antoine Raux: Right. So that's a good question. The problem with that is getting the
features in human-human calls. One thing like semantic features and et cetera, I mean,
it's possible it would require heavy annotation basically of ->>: [inaudible] it's the same task, right?
>> Antoine Raux: Right. Right. But I'm not doing any human annotation at all so far.
I'm purely relying on automatic features.
>>: So, yeah, it would require ->> Antoine Raux: So, yeah.
>>: [inaudible] annotation.
>> Antoine Raux: Right. So it is an interesting point. I don't have the data labeled for it.
And the other thing with that kind of evaluation, it's interesting to see. Now, if you don't
match exactly, the observed behavior doesn't necessarily mean you're completely wrong
either, because there are different acceptable behaviors. But it's definitely more
>>: [inaudible] the system actually barges in and you measure that as zero for the
>> Antoine Raux: Well, I wouldn't call that barge-in, but for the red curve, yes.
>>: [inaudible]
>> Antoine Raux: I count that as a zero threshold.
>>: Zero. And so why is it not at this [inaudible] zero then? So you get ->> Antoine Raux: Yeah. Well, these curves are smooth, actually, so there is a
discontinuity. That's a good point. Yeah. So there's zero, and the first available point
after that is 150 millisecond, and in this case the minimal threshold is 150. But the points
are actually every 150 millisecond in this case. But you can't get anything between the
two. That's a good point.
Okay. So that was -- well, I've talked enough about endpointing for today, I think, so
now I'm going to show how this can be applied to different problems. This is not
completed yet, something I'm working on right now. So I don't have final results. But
it's interesting because it shows how the model can be applied to different problems and
the same starting model.
So barge-in detection, this time we're in the case where the system has the floor and then
we come to this point of uncertainty where the user might be trying to barge in. So our
voice activity detector tells us there's speech detected [inaudible]. So we -- it might be
here but it might be some noise and it might be the user just back-channelling on the
system, not trying to grab the floor, but just providing some simple feedback without
trying to interrupt the system.
And so the kind of actions -- there are two actions that can be taken again, but this time
it's keep the floor or yield the floor. And if you're still in a system state, you want to
keep; if you're in the both state, when the user actually did start claiming the floor, you
want to yield.
And so you have a very similar structure as before, actually, cost structure as before with
different constant. And you can have the same thing that the cost of staying in the both
state is going to be equal to the duration that we've been in that state. So the longer you
stay, the higher it's going to cost. And the cost of yielding when you shouldn't, meaning
that the system interrupts itself or the user wasn't even speaking or wasn't trying to
interrupt, is a constant, is set as a constant.
And the interesting point here, the difference with before, I mean, the rest would be very
similar equations, obviously. The difference with before is we can use prior information
here. In addition to the standard [inaudible] regression approach or any other probability
estimation [inaudible] you can introduce priors which we didn't have before.
In this case you can look at previous data or corpus recorded data where you have the
prompts, system prompts, and you can measure how -- when the users barge in. You can
confuse the distribution of the times where users barge in for specific prompts. Because
barge-in is going to be triggered by certain things in the prompt, the content or the
So we can look at that actual data and use that as prior information. And at this particular
time the user is likely to actually barge in. Because random noises are not going to
follow a specific distribution. They're going to come randomly with a flat distribution.
So this can account both for discriminating between real speech and noises and, as I just
said, it can also account for [inaudible] back channel versus barge-in, because it's likely
that certain point in the prompt are more likely to trigger back channel from the user.
Others might be more likely to try to barge in.
And I did compute this under logical data on an overset of dialogues, of 6,000 dialogues,
about 127,000 prompts total. And just as an example from one prompt [inaudible] going
to -- this was normalized because, of course, this varies every time that I normalize the
varying part to give it the mean duration. This is correct. So going to the airport, is this
correct. And you can see how barge-in happens here. It happens first very, very much at
the beginning of the utterance, which means the user barging in here are not responding
to this utterance. They're responding to something that happened before, and they're
speaking -- it might be actually a cut-in where the user was not finished speaking and the
system started to respond, but the user continued their utterance so it resulted in a very
early barge-in. So that's one type of barge-in we get.
And which we might want to consider it not as a barge-in, actually, that's an interesting
choice to make here. Another one here is after going to the airport, people typically
back-channel, give the response yes or no or -- particularly when it's yes, they will just
give you one single answer here. And in this case, because we're doing explicit
confirmation all the time, it's okay if that's taken as a barge-in because anyway we were
going to ask this question. But we can use this information to structure -- to help implicit
confirmation as well.
You can imagine an implicit confirmation that would be exactly like this except saying
going to the airport, where are you leaving from. You can change the second part
because of the way the prompt is worded, you can just change the second half and have
something that's an implicit confirmation rather than explicit.
And then you can use this, that if you get the barge-in here, you know that this is not
going to be -- you really should continue speaking the remaining of the prompt. Because
the user particularly could say yes, the user is just providing an answer to the
back-channel, to this part, and they still want to hear the rest of the question.
So you can use this to inform -- although the data contains explicit confirmation, you can
use that to inform the design for -- and the models for implicit confirmation. And the fact
that we get all these barge-ins and different user reaction with implicit confirmation is
one big reason why it's so hard to actually implement implicit confirmation and really
deploy system, because the behavior's so much harder to predict and you get -- you know,
in this case you would say going to the airport, where are you leaving from, and you get
the user saying yes and then you interrupt this prompt because you heard yes, and so you
don't want to do all this.
And so this is something that could help. It could be used plugged in very easily in the
framework I proposed because it's just a prior probability on the state given the time at
which it happens.
And so there are other extensions to this framework. So the idea of the whole thing is
that it's a general framework to think about turn-taking. It's not the specific solution of
this [inaudible] that are important, it's just a way to think about it at this transition, prior
state machine and plugging in decision theory on top of these states and actions.
The possible extensions include changing the topology of the prior state machine itself,
particularly if you have more than two speakers, and obviously you're going to have a
different structure. You can change cost functions. I explained during the talk these
functions are fairly simple, the one I described, and you can maybe do a better job.
The ideal thing would be to relay the cost functions to user experience. If you have
measures -- high-level measures of user -- of test success, I don't think is going to relate
necessarily highly with this, but user satisfaction might relate more. So if you can get
some correlation, some way of fixing the cost of -- deciding the cost based on that, that
would be a much better optimization criterion than the current one.
And you can also improve just the probability models you're using, again, using different
features, classification regressionals and different types of priors. Yes.
>>: I agree with what you were saying with regards to the cost function; that is, if you
use user satisfaction. But user satisfaction is -- I think that relates a little bit more to
something that, as far as I can tell, you haven't really discussed function with respect to
[inaudible] which are decisions that you make over time. So so far what you've presented
seems like a greedy approach where all you're going to do is you're just trying to
maximize -- or you're minimizing the respected cost, right, for every particular way, but
not across the sequence of [inaudible]. Clearly if you grab too much, you're going to
upset the user and the cost is going to be greater. So have you thought about ->> Antoine Raux: Right. So that's a very good point. So what you're saying kind of
leads towards reinforcement learning approaches. Actually modeling this as more like a
Markov decision process where all the actions are in a chain over all the different turns.
>>: Not necessarily. Depending on the optimization [inaudible], I mean, clearly what
you're not including so far at least are even just dependencies between the decisions that
you make.
>> Antoine Raux: Right. So, I mean, one way to answer this, one thing I've thought
about is -- I mean, you can't -- first, you can't -- well, you can use this machine also to
remediate when something went wrong. So, I mean, that's not exactly what you're
saying, but it's a first step. So, you know, when you did interrupt the user, when you did
a cut-in here -- sorry for going from here to here -- then you can use this to at least make
the right decision then, right, so repair the problem. So you can use this for repair.
And what I was thinking, that it's a good point that it's not embedding the current -current decision framework itself, but using -- when you observe these things, you can
change your cost structure and make it more costly to have a cut-in if you already had
cut-ins before. For example, I mean, a very simple ->>: [inaudible] you are not trying to understand what user wants. And if [inaudible] user
[inaudible] your timing information is going to be such.
>> Antoine Raux: Right. Well, basically -- yeah.
>>: [inaudible] you research, your [inaudible] model an area that relates, is somehow
related to a dialogue [inaudible].
>> Antoine Raux: Right. I agree. But I think that comes from -- I don't think that
questions the overall approach, but it's more that you will need more complex cost
functions and maybe topologies as well.
But I think this will still capture a lot of the phenomena, even in a different task where,
for example, it's not necessarily always good to be as fast as possible, for example, which
is what this particular approach did.
>>: [inaudible]
>> Antoine Raux: So [inaudible] argument for the approach I've taken, which is very
local in terms of decisions, can come from the conversation analysis work, which -- well,
it can be questioned, but like in the Sax and Shegaloft [phonetic] original turn-taking
paper about human-human turn-taking, they actually made it explicit that turn-taking is a
local phenomenon that's independent of context.
Now, you can actually question that. But that -- I mean, there's actually a take on
turn-taking which, like a theoretical take on it that makes that assumption.
So I think that's why it at least works at all, making that assumption, right? Like you
couldn't do that optimizing the general flow of the dialogue like the actual prompt that
you're seeing, because that wouldn't make any sense. In turn-taking you can at last work
with that assumption first and then improve over that by introducing more dependency.
>>: That's why I think it would be really interesting to try to [inaudible] to look at
conversation analysis and see if they actually can apply.
>> Antoine Raux: Yes. I agree. I agree that would be interesting. I don't think I'll do
that in my thesis, but, no, that's very good. I wish someone did that.
Okay. So I've presented two principled approaches in different ways, two turn-taking.
First one is an optimization algorithm to get endpointing thresholds, and it showed -basically it showed that [inaudible] features can help turn-taking, particularly semantics if
you have them. It might be that on some complex domains it's much harder to get
reliable semantics, and then you might want to rely more on other features. In this
particular domain, semantics did help a lot.
And second I proposed an approach based on the finite state turn-taking machine that
modeled turn-taking and captured most phenomena, at least in a dyadic conversation and
a fairly standard state transition way. And because of that, we can use that as the basis of
what I call the dynamic decision process model, which is based on decision theory to
make decisions -- turn-taking decisions at every point based on our ability from what
state we're in and the cost of the different actions.
And I showed examples of two applications for endpoints and barge-in detection. I have
actually two slides on how that could all fit or I could all fit in MSR.
So first let me talk about my general research goals. I mean, this was specifically about
my thesis work, but in general what I want to do is improve human-machine interaction.
Human-machine interaction is really where I want to be, and improving it by designing
models that can learn and can learn either from -- in both case from data, but either from
previously collected interaction data that you have or, even better, through new
interaction, or both.
I mean, you can start from something collected, like I did in this particular work, and you
can -- again, because this is unsupervised, you can actually let it run and tune it. Because
the user behavior might change once you change the behavior of the system, and so
having the thing stay unsupervised actually helps it do online learning, continuous
learning. I didn't [inaudible] the online version of this, but there's no reason it wouldn't
And also the other aspect I'm really interested in is leveraging high-level task
information, whatever the task may be, to optimize core technology. It's like being at the
interface between the core technology, like speech recognition, or in other domains,
information retrieval, et cetera, and an actual task you're trying to do with it, like dialogue
in the case of speech, or other, like Web search in the case of information retrieval.
So in terms of domain [inaudible] application to relate to the reason I'm here these two
days, first for situating interaction work that's done in [inaudible] group.
Well, turn-taking itself is I believe crucial to realistic dialect systems, so the kind of
systems that they are working on in that group, something that you want to be much more
aware of the environment and much more -- like lead to much smoother interaction with
the users, and natural interaction.
I think that requires to have a good turn-taking model. Whether it's exactly this one or
not, that's a separate question, but I think it does need it. And it's kind of a natural
extension to the finite state -- I mean, the whole approach to multimodel, potentially
multiparty situations. So you can grow -- build on top of this is grow the state machine
and change the cost structure, et cetera.
For Web search it's not -- it's not directly an extension, but it still fits within my -definitely within my general research interests in the sense that -- well, presumably with
Kuansan, the idea is to take interactive search as a dialogue, frame it as a dialogue
between the user and the machine. And that -- given that, it's very interesting to me to
explore it -- again, to relay these low level and high level, to explore information on
search behavior you have through the interaction to inform the core technology
information retrieval [inaudible].
And so I think there are two ways; both of them would be very interesting to me actually
to pursue within the context of MSR, at least in the near future. Okay. Sorry. I didn't
have a thank you slide. But thank for you attention.
>>: So do you think that for [inaudible] taking [inaudible]?
>> Antoine Raux: I think it's different. I think it depends also what kind of multimodel
interaction, multiapplication. If it's -- so multimodel probably is not the big factor. If it's
multimodel but not at all humanlike, like multimodel because you have the map and you
can speak to it, then it depends if you have to construct a new model of what it is to take
turns and what it is -- what is the interaction. So I think it's important.
>>: [inaudible]
>> Antoine Raux: Yes.
>>: The other way [inaudible] the multimedia.
>> Antoine Raux: Right. Well ->>: In that case you're coming [inaudible] between the system and the users, it's much
more [inaudible].
>> Antoine Raux: Right. So basically you have something where you might have
several floors. I mean, there are different ways of modeling of that, right? Instead of
having a -- if you're using just speech, you have a single channel as -- that's why the floor
is there, because you can't easily share that one.
>>: [inaudible]
>> Antoine Raux: Right.
>>: But if you have a very high bandwidth [inaudible] between user and system?
>> Antoine Raux: Right. So I think it leads to a different -- a different model of what the
floor would mean. I don't think it removes completely it, because it's not infinite, right?
At least because the user is not never going to be able to process an infinite amount of
information presented to them. So you still have constraints on it. They're not the same
as just a speech-only conversation kind of thing.
>>: [inaudible] is generic to all human-machine interaction.
>> Antoine Raux: Right.
>>: My question is the way you approach turn-taking in your research you have focused
a lot on the [inaudible] information [inaudible] speech. The question I was asking is the
multimedia environment, how you [inaudible] is that particular to speech interaction or as
important in multimedia communication?
>> Antoine Raux: I don't think it's just because of speech. It's potentially as important,
depending on what multimedia presentation you do.
>>: [inaudible]
>> Antoine Raux: Right. But how you combine the two. So if you're talking about like
static Web page that has different elements to it, for example, then you don't have -- like
within this context, turn-taking, if it's not directly timed as in millisecond problems like
I'm talking here, it's at least about the sequence of things, right.
So but as soon as you have things happening in sequence, like basically you do have even
in any application, you do have the system and the user taking turns in some fashion. Not
necessarily strictly one at a time -- it's slightly more interesting when it's not -- but you
still have turn-taking happening here because of the transition. There's still a temporal
transition, is what I mean. There's a cycle. And so that's still -- even if it's not like
optimizing the thresholds like I did in the first part, it's not going to apply directly,
obviously, because you don't have this specific problem.
Now, the second approach is more generic and can actually lead more, because you can't
have these transitions, these state machines [inaudible].
>>: My question is that in the case -- let's say in your Web search, this latency or
duration of this issue, how important is that? I mean, if it's [inaudible] obviously
>> Antoine Raux: Right. Well, that's a good question, actually, how important is that. I
think it's not working at the same scale. Now, is it not important at all or ->>: Well, maybe [inaudible] my dialogue [inaudible] ->> Antoine Raux: Sure.
>>: -- how long does it take.
>> Antoine Raux: Right, right. Now, how does the wait it turns out taking influence that
aspect is -- then is the challenge, I guess.
>> Kuansan Wang: We have time for one more question.
>>: Just your thoughts on if you were to work on Web search as a dialogue, how would
you approach this problem?
>> Antoine Raux: That's a good question. Well, so briefly, I mean, I just had a
discussion with Kuansan about these kind of issues. But so the first thing, given what we
know about dialogue, like human dialogue would be to structure the problem. Because if
we want to approach it as a dialogue, I think that introduces some kind of structure, so
you can't be in a state, get some input, make a decision and move to a different state.
You need to define what these states are.
And the general -- the plain old information retrieval paradigm doesn't have -- it's very
specifically unstructured, actually. And so you need to find some -- to define some
structure either by clustering documents or -- I mean, there are already some approaches
to that, and then moving around. But the -- there might be other ways to do it, actually,
better ways to do it maybe.
>>: [inaudible] it's like you said [inaudible] barging you will be able to detect when you
are actually required, needed to get some information [inaudible] more efficient, just
don't have to try to prevent [inaudible].
>> Antoine Raux: That's an interesting -- that's an interesting point. Like if can we have
the system to be more productive by being more aware of what's happening in terms of
floor even if the floor means something else in here.
>>: [inaudible] the case search [inaudible].
>> Antoine Raux: Right. For example.
>>: [inaudible] by the time scale there is actually more -- much smaller than the speech.
>> Antoine Raux: Right. It's ->>: As you type, every caret you type is trying to reformulate your suggestion, and so
the latency -- and all the tiny information.
>> Antoine Raux: It's [inaudible] different scales, yes, yes. [inaudible].
>>: So you don't need to respond when you don't have enough information [inaudible].
[multiple people speaking at once]
>> Antoine Raux: Thank you.
>> Kuansan Wang: And thank you everybody for coming.