>> Kuansan Wang: Okay. I think we'll get started. My name is Kuansan Wang. Today I have this honor to host my former intern, Antoine, to interview with us. So Antoine interned with me probably two years ago, and he did such a good job that I cannot resist to bring him back for interview now that he's graduating. So today he's going to tell us that what he has been doing [inaudible] thesis, so without further adieu ->> Antoine Raux: Hi. This working? Yes? Okay. So good morning everyone. So as Kuansan said, I'm Antoine Raux. I'm finishing my Ph.D. at Carnegie Mellon. And I'll be talking about my thesis topic which is turn-taking, meaning conversational turn-taking as a dynamic decision process. So my general field of research is in spoken dialect systems. So -- well, let's jump right into it. So what -- current dialect systems, what do they do? There's been lots of research on dialect systems for the past maybe two decades at least now, and lots of research has been done on making them more robust or able to ground information, able to deal with more complex structured information, et cetera. So that leads us to something like this. This is something that could be stated as state-of-the-art dialect system, I guess. [audio playing] >> Antoine Raux: So by many accounts, of course that continues with the system providing results. By many accounts, this is very successful dialogue. I mean, the person who's getting their information without any problems. There's just two confirmations to make sure the information is right, but that's not really a big deal. However, I would say this is still far. So as you've understood, this is a system that provides bus schedule for people. And now let's listen to a recording of a very similar conversation with a human. That's an actual recording from a call to the customer service of the bus company in Pittsburgh. And let's see what that sounds like. [audio playing] >> Antoine Raux: Okay. So it's still -- sorry about that. There's still some difference here. We're not quite there yet, as you can see. And it's not a matter of speech recognition accuracy or even speech recognition speed because that's not where the problem lies here in this particular example. There are dialogues with problems in speech recognition, to be sure, but in this example we do have now lots of dialogues that go well as far as understanding goes. Now, what are the differences between these two dialogues? One was the specific problem in prompt design, or difference in prompt design, where the prompts or the questions from the human operator tend to be much more -- much shorter and much more efficient than the one in the system side. That's not a very hard problem to solve at first sight, right, just design your prompts to be short and you're good. Now, the problem with this is that if you're using recorded prompts, that's probably fine. If you're using synthesis, synthesizing very short utterances, like 51C, you need to convey the meaning right, you need to use paucity in the right way. And humans are very good at that. That's why he's able to have -- to do it in his -- this call. It's not that trivial to producing synthesis, actual not really solved yet. That's the big -- one of the big challenges of speech synthesis, having conversational paucity. The human is using it both in the confirmation here. Also when we provides results, it's pretty good at emphasizing the important bits of information in his utterance so that the user gets it. And that allows -- by the way, sorry, that allows the human to speak very fast most of the time, much faster than the speech synthesizer does, because it can emphasize the right bits and the rest can be kind of all blurry and that's okay. Another big difference is turn-taking. Specifically here, the fact that the human is much faster to respond to the human than the system. That happened not all the time. It's not at every point that that happened. Like, in the -- before sync 51C, there was a significant delay here, and that's not a problem. However, in certain points, like after the user confirmed yes, the system was very fast. Sorry. The operator was very fast to respond, let me get that for you. And then it was a long delay to actually get the results. But this pace of conversation is very different between the two, and it's much more variable and flexible in the human case and there are good reasonings for that, and that's something our systems are not able to do yet. And finally something that maybe is not as obvious, but there were some incremental processing going on in the human side here. You could hear as the caller was asking the question, you could hear the operator actually going through paper and starting to get the information about the 51Cs and then confirming actually that information later on. But that's something also systems usually don't do, starting to process things as the user is speaking to the system. And that's particularly relevant when you get systems that get more and more natural language input with longer input from the user. As long as you get yes-no answers or one-word answers, it's not really that relevant. But when you move towards more natural language, which is what current systems are doing, it makes more sense to start addressing issues like incremental processing. Now, these are many different problems that are very hard to solve. So we're not going to be there tomorrow. Now, in the meantime, let's see what -- like a more reasonable, I'd say, short-term goal. So this is -- I'm just like playing back the original. [audio playing] >> Antoine Raux: Now, let's see. What I did then, I edited this audio to make it more like something like we would like to be, but still a system but maybe the next generation or something. [audio playing] >> Antoine Raux: Okay. So two modifications that happened here. Remember these four things I was looking at, actually in this specific example I addressed prompt design by shortening the prompt, which again is not the hardest thing to do in all these [inaudible], probably the easiest thing to do. And then I also shortened the latency in the right places. And this is -- if we could get the system that have this behavior, it would be a first step towards going towards completely humanlike interaction. It's not there yet, but it's part -to achieve the whole humanlike interaction, you need all four of -- at least all four of these and maybe more. So as a first step, I'm proposing -- in this particular talk I'm proposing to address the turn-taking problem. The problem design, I'll leave that as a separate issue. Yes. >>: [inaudible] >> Antoine Raux: I'm actually going to talk about that very specifically. But the -- yeah. It's basically the fact that you don't know -- you don't want to interrupt the user in the middle of their turn, and so you don't know if your user is going to pause in the middle of their turn or by the end of their turn. Right? So that's -- using a fixed threshold for endpointing is what triggers longer latencies. It's not a matter of computation power. So that's exactly what I'm going to talk about now. So all the current approaches, what do systems generally do now in terms of turn-taking? Typically there's no explicit model of turn-taking in these systems. It's more addressed from an engineering point of view through an ad hoc combination of low-level task. I mean, it has to be dealt with, because otherwise you can't have a dialogue, but usually just a combination of [inaudible] detection, which is the minimum thing you'd need to be able to do to have a dialogue, and [inaudible] barging [phonetic] detection and handling. If the system allows barging. But there's no general framework. And the problem with not having a general framework of like two-model turn-taking is that, first, it makes it hard to optimize, and it's not clear what it means to optimize. And, second, it's not well integrated in the overall dialogue model. And that's maybe more of a theoretical problem, but it prevents the lower levels like turn-taking to inform the higher levels like grounding and dialect processing, and the opposite to it. It separates the two so that turn-taking cannot be well informed by higher-level information. I'll see how -- I'm going to now propose two approaches to address these specific issues in this talk. So the [inaudible] of the talk is about this two work that I have done during my thesis. The first one is about optimizing endpointing thresholds still within the very standard framework, but just changing how we set the threshold, and the second one is a new model for turn-taking itself, which is a more generic model that's going to encompass different turn-taking phenomena. And I'll give two examples of application of this model to specific problems like endpointing and interruption detection. So how is turn endpointing done in general today? It's usually a combination of two levels of processes. There's first a voice activity detector that's going to just discriminate between speech and silence in the incoming audio. Something like this. Oh, sorry. So let me explain this example first. You have a -- the system says: What can I do for you. User response: I'd like to go to the airport, with a pause after two. And so your voice activity detector tells you that there is speech until after two when it detects there is silence, and typically the system uses a threshold here if the silence here gets longer than 700 milliseconds [inaudible] endpoints. In this case, because the user starts speaking again before the threshold, nothing happens and the utterance continues until the next silence is detected, same threshold is set, and this time the user doesn't start again, so it's endpointed. That's the very standard, very simple approach that's used in most systems. There are two different issues with endpointing in general, I would say. The first one is [inaudible], which is when the system -- when the threshold is shorter than an internal silence, the system is going to interrupt the user in the middle of that utterance, which is in general not a desirable behavior. Or if you want to do that, you don't want it to be trigged by silences. You want the system to be aware that they're going to interrupt the user. And, in general, that's not something you want. And the second problem is latency. That's exactly what I was talking about a minute ago. Because you're using this fixed threshold, at the end of every turn you're going to have at least the duration of the latency before you can respond. And in addition to that, depending on how the system works, you might have additional processing after the latencies. But at the very least you would get the latency. And so you have a tradeoff here. It's either you can set long thresholds, which will -- I'm sorry. I'll turn that off. Okay. Which will set -- which will trigger a few cut-ins because you will have a few pauses longer than that threshold. But they will trigger long latency at the end of every turn, or you can set short thresholds that then you'll have many cut-ins and short latencies. Yes. >>: I would have thought that in [inaudible] case you wouldn't see other things, like [inaudible] that when I've got that kind of almost filled pause case, that I would have thought the endpoint of the previous, of the first phrase would be really different from the case where I'm saying over your turn to talk. And it isn't just the silence. >> Antoine Raux: I totally agree with you. That's a very good point. It's exactly what I will discuss in just the next few slides. The thing is, most current systems basically ignore all that information. Because all the input we have is voice activity. >>: [inaudible] distortion [inaudible] you have to correct for, not something you can actually uses a signal. >> Antoine Raux: Right. Basically, yeah, that's -- I mean, there's a lot of things definitely positive -- I mean, there are many, many different features. I'll just explain those in a minute. So there is information. It's just not used. Sorry. Let me just try and turn off -- don't think that's going to help. Right. So what we would want because we have this tradeoff is something that tells us that we should set different threshold for different silences, right? We want to set a long threshold when it's an internal silence to be sure we don't produce cut-ins, and we want to set a short threshold at the end of the utterance using all these kind of information we can get. So specifically what kind of information we have in the spoken dialogue system at our disposal for this, to inform the threshold setting. First you have discourse; you know which dialect state you're in. Like in the example I will use, which I will base on the Let's Go!! bus information system, have a very simple extraction of dialect state. It's three different states: Is it an open question, like what can I do for you; is it a closed question, like where do you want to go; or is it a confirmation question, like more yes/no type of question. But that kind of information helps you set a threshold. Another very important source of information is semantics. And particularly here I'm talking about using the partial recognition hypothesis that you get, hypothesis that you get at the beginning of the silence. When you detect your silence, you know your recognizor can tell you so far what it has recognized, and you can use that to inform, again, your threshold setting. Because that's going to tell you if the user is likely to be done or not. You can use [inaudible] exactly, like we were just saying, which includes intonation, duration, vowel lengthening, et cetera, all those that are very well known in human-human dialect to actually effect the perception of turn-taking. You can use timing. Very simple features like how long ago did this utterance start is likely to be some source of information on whether it's going to finish soon or not. And we can also use information about the particular speaker we're dealing with. And that can be either you have access to many different sessions with the same speaker. In my case it's not -- wasn't the case, so I'm using information from this particular dialogue. But you can use information from the first few turns to inform your behavior in the later turns of the dialogue. Information like how many pauses does this particular user make, how long are they, et cetera. And so we have all this wide, large set of features that we could use to set the threshold. What I did, I'm not going to describe my algorithm here, but I designed an algorithm that specifically built a decision tree based on these features by asking binary questions of these features. And the leaves of the decision tree are thresholds. So it's a fairly simple -- in just two words, it first clusters the pauses that are in the training sets. So of course it's based on the training set of collected dialogues where each pause is annotated as internal or final. And this is done ultimately -- this annotation is on an automatic [inaudible]. It's completely unsupervised in that way. So clusters or pauses based on the features. And the second set, you're going to set the optimal threshold in each cluster. And that's how you build a decision tree this way. This is published in my SIGDial paper this year. So there is no point in going through the tree in detail, but what I want to show here is that it does make use of many different features, so the different cause, different -represent different feature sets. If you see things about -- like the first one is timing here, pause-start time, the green ones are all about semantics using partial recognition hypothesis; the orange one is the dialect state; and the other red one here, red ones, are about the behavior of the user in this particular dialogue. Yes. >>: Are you going to tell us how you unsupervise [inaudible]? >> Antoine Raux: It wasn't included in this particular, because of time constraints, I didn't include it here. I can deal with that maybe after the talk, the questions. >>: The key point is unsupervised. >> Antoine Raux: It is unsupervised, yes. Yes. >>: Can I just ask one question [inaudible] so how you get the training data? Is it a system that just waits? >> Antoine Raux: Wait. Okay. Right. A very good point. This -- in this case, it was a system that used the fixed threshold 700 milliseconds, a reasonable -- not very, very long. And so that introduces -- that's where the unsupervised learning is risky, because your system at runtime is going to make errors, right? And so what I used is a heuristic that after the fact can tell me whether a decision was right or wrong, a particular endpointing decision. And there are hints -- I mean, it's not a perfect algorithm, but there are hints that tell you if the user starts speaking again right after [inaudible], it's probably that they weren't done speaking, because they're not responding to anything, the system hasn't spoken yet. But if you hear them speaking right after your endpoint, that probably meant that that decision was wrong. I'm using this kind of heuristics to kind of relabel or to correct the annotation, the annotation of the training set. And based on that, I retrained it. >>: But you cannot [inaudible]? >> Antoine Raux: Well, you can know it's an error, for one. And you can have some hints, because if the user did start speaking again, you can use that start time to measure a pause direction, even if you had made the decision to endpoint, the system at runtime made the decision, you still have the last time the user spoke and the next time user speaks. And you can use that as a pause duration that's longer than the threshold used at runtime. >>: So is there a question that you ask [inaudible] come up with this new system and the question you're asking is can I just deploy this new system, how well is [inaudible]? >> Antoine Raux: Right. Well, I'm going to explain -- I have two different evaluation. I'm going to explain that with the next slide. Yes. >>: I'm wondering why unsupervised methods are off the table. So I'm thinking if I have a large amount of recording or dialogue from, say, two speakers that are easy to separate, say a man and a woman, and then I can tell after the fact who was speaking when I can look at all the pauses in that arbitrary speech, and at any point in there during the pause I could say can you guess whether the, you know, speaker A or speaker B is going to speak next. And I would think you could get arbitrary amounts of data [inaudible]. >> Antoine Raux: Oh, you mean by detecting the switch between the speakers? >>: [inaudible] I could do speaker identification or separation. >> Antoine Raux: Right. >>: And if I picked dialogue from which the separation was easy. >> Antoine Raux: Right. That's -- that is true. The thing is that that requires human-human -- that would be based on a human-human dialogue. So then that's true, but then there are differences. >>: [inaudible] I think I could do a very good job [inaudible]. >> Antoine Raux: Right. >>: [inaudible] two speakers is going to speak next. >> Antoine Raux: Right. I see the point. The approach I was taking was basically starting from a system and kind of improved this particular system over time. So I was trying to run directly on this system. And so [inaudible] evaluation, how did that perform. So this is performance in the -- by cross-validation in the actual training set that I collected collected in the way I just explained. The double line here is the bass line using a fixed threshold, and so what this represents is a -- it's a kind of RSC curve, if you want. It's the tradeoff between your latency and the cutting rates. So, of course, if you have high latencies, you're going to get small cut-in rates, and if you have low latencies, you're going to have high cut-in rates. And you'll have to trust me on that, but reasonable cut-in rates for a workable system tend to be in the range of 2 and 5 percent, I would say. And that corresponds to fairly standard -- if you look at the bass line, fairly standard thresholds are used which are usually between 500 and 1 second. And so what did [inaudible] what you can see from this graph, so the different lines represent using different subsets of the features to perform the training. And the blue line on the bottom is using all the features that I described before, feature sets that I described before. So first thing is that, well, to some extent it works, because we did get a significant improvement, about 22 percent reduction of latency at a certain cut-in rate, if you take 4 percent here, or you can look at it the other way and see if you keep the latencies constant and reduce your cut-in rate by 38 percent. The other thing that this shows, when you look at the different features set, the semantics is by far the most useful feature type. It's not that the other ones don't work at all; I mean, they all bring about half of the overall except for semantics. But once you add semantics into the mix, like the partial [inaudible] results, given -- it's also given the structure of this particular dialogue that's fairly constrained, et cetera, of course. But given all that, semantics are really bringing you most of the gain here. >>: What do you mean by discourse? >> Antoine Raux: The discourse is mostly dialouge state level features. And, yeah. >>: [inaudible] >> Antoine Raux: Yeah. I mean, it's kind of a blurry -- I'm not making strong statements by the name of these sets. Because semantics also contains the understanding of the particular partial recognition practices given the current state. It's semantics within the current state. So if you ask a ask a yes/no question and you get a yes, this is a high semantic score; if you ask a yes/no question, you get a time, that's a low semantic score. >>: [inaudible] >> Antoine Raux: It has some [inaudible]. >>: Just to clarify, with respect to the decision tree that you had, that decision tree is giving you a threshold on how long to wait? >> Antoine Raux: Wait. Yeah, pause. >>: In a pause, right? And with respect to the cut-in rate, the cut-in can be -- what I'm trying to figure out is whether it's that bad to cut-in. I mean, sometimes, I mean, like, for instance, you have some semantic information, like some [inaudible] information and the user's still kind of trying to figure out, well, where am I going to, it's okay to come in and say, okay, I heard that you want to go front gear to whatever, now, where do you want to go ->> Antoine Raux: Yes. I see your point. That's true that cut-ins might have different cost and different circumstances. It's not -- I totally agree it's different. It's hard to actually get ahold of this cost. I mean, definitely if you have some information already that helps, that might still be very confusing for the user because the user stops speaking again and then that barges in on the system and then you can get in actual turn-taking conflicts in the system, and that's -- can be very confusing to the user. So ->>: So what is it currently doing right now when it comes in? Is it just implicitly confirming anything that it knows or ->> Antoine Raux: It's trying to explicitly confirm what it knows, or if there's nothing, it's sending a nonunderstanding prompt, a repair prompt. Yes. >>: I'm wondering, what are these durations in natural turns, you know, among human-human [inaudible]? >> Antoine Raux: That actually depends a lot. I did -- unfortunately I don't have that right here, but I did the study on the corpus of human-human dialogues like the one -- the example I played from that corpus. And what happens is you have a really wide variability in the human-human. Some cases are actually one second more, and they're not necessarily strange or uncomfortable, but some cases are really, really, really short, and they have to be. So it really depends a lot on the context, on the type of dialogue you're having. So it's hard to give ->>: [inaudible] >> Antoine Raux: Oh, yes. It's zero. You cannot overlap actually. It's virtually zero. >>: Like Tim is saying, in fact, in our conversation, at least depending on the different [inaudible]. >> Antoine Raux: Right. The fact that it's zero doesn't mean that you're cutting in on the user. You might just be very, very confident that they're finished saying whatever they're saying. And you're just very, very close to their turn, but you don't really interrupt them in the sense that they were intending to say more after it. >>: So it turns out that social linguists and psychologists [inaudible] different pauses, pause lengths, and found that certain [inaudible] like you mentioned before with respect to paucity, if someone says an "um" that the pause is much longer than if they say an "uh" and users don't interrupt because they don't understand that "um" to mean that they're thinking. >>: That's probably what phrase [inaudible]. >>: Probably don't even have to wait for the final [inaudible]. >> Antoine Raux: So yeah. So the human-human case actually much more complex in general. And also I think the complexity is maybe arguable. But definitely ->>: If we looked at the short end of this, I'll bet 300 milliseconds is [inaudible]. >> Antoine Raux: In terms -- well, it depends. I mean, for like in terms of being humanlike, yes, I agree with you. The cases where it's slow and it's okay to be slow are fine. But in terms of short end, yes, I agree. We're not there yet. Definitely. Even with the improved [inaudible]. It's going there, but it's not there yet. >>: But is it your goal to fully reproduce a human-human conversation experience? >>: It's one goal you can think of. The other one is to relate the -- this matrix to user feedback about the interaction they have with the systems, so basically to user satisfaction. Unfortunately I don't have that here. It's a very hard thing to get, because it's hard to get feedback on these kind of subtle things directly from the user. And in my particular case, I'm using a system that's used by the general public. And we don't have access to the users after they've used this system. >>: There's nothing [inaudible] to let user know that they are interacting with a machine, right? >> Antoine Raux: Yes. On the other hand, something I forgot to mention, when -- like the two example, the two machine examples I played at the very beginning, it's not only about interaction, but if you just look at dialogue duration, these two, when it gets to very short dialogues like these ones, the difference between the two was like 20, 25 percent. Just by shortening the prompts and the latencies. And so that can also mean something just economically speaking, if you have a system deployed out there and you want short dialogues. So gaining here and there can actually add up in the end. But I think it's mostly in how the interaction flows. >>: The only problem I have with this [inaudible] timeout [inaudible] it has to make decisions without knowing those -- what [inaudible]. >> Antoine Raux: Well, so ->>: [inaudible] >> Antoine Raux: Right. So actually there's a good plug in here. Because I [inaudible] on that system that I played, that I showed, because I'm not discussing the whole architecture here, but it's researched dialogue systems, it has a fairly complex architecture. And particularly I designed it so it has a central component that's in charge of low-level interaction. So what the SR does, it just feeds information about voice activity detection, partial hypothesis, et cetera. The decision to endpoint is actually made by an external module that also has access to high-level information, et cetera. And that might add a little bit of delay, actually, but it's very small compared to all the other factors here. And that makes it more computationally expensive then a standard -- a commercial dialogue system. That's from the research perspective. So I did implement the tree, the specific tree that you've seen. Example I showed was I [inaudible] in the Let's Go! interaction manager, which is the central component. What you might have noticed in that tree, that there was no paucity. Because for the live version, because [inaudible] features are very expensive to compute and not as readily accessible as the others and didn't bring anything in terms of the batch evaluation, I didn't include them in the live version. I picked a working point, because now I have these curves. Now to run the system I have to decide where I run on the curve. I picked the curve -- I picked the one that -- 3 percent cut-in, which was 635 millisecond average in the batch evaluation. And I ran a study in the -- May '08 when I just set the system -- the public system to use randomly one of two versions, either fixed threshold of [inaudible] or the decision tree. Now, let's look at the results of this. So first the average latency for here, the left side is the average latency overall, around this is split by dialogue state. Note that this time this is real latency. This is not threshold value. This is the actual time between the end of the user utterance and the beginning of the system utterance. It includes additional processing, which is why it's higher than previous values. Our first kind of disappointment here is that there's no difference at this point. We'll see why in the next slide. It's not a bad thing. But even then, the behavior induced by the proposed algorithm is very different from the control one. The control one, of course, uses a fixed threshold; there is very little difference in terms of latency between different states. That's mostly random difference. However, in the proposed case, basically the algorithm learned that after an open question, which is typically where you get users to say long utterances with pauses, hesitations, et cetera, versus that longer threshold which ends up being longer than the baseline. For close question, it has an average behavior similar -- same as the baseline. For a yes/no question, which are much more predictable, where there are like definitely fewer pauses because the user mostly respond yes or no, not only but mostly, the system learns to shorten the latency. It's still not shortened enough, I would say, but it's definitely significantly shortened in the latency. Now, if we look at the cut-in rate, this is what explains why we had some latency. We were actually working kind of at a horizontal line here. We significantly reduced the cut-in rate here overall, and this came mostly from the -- mostly from the open requests state, which, for the same reason as before, the system learned to set longer thresholds there because that's where most of the cut-ins happened. You can see in the control case, 12 percent of the turns are cut-in in that state. That's definitely a significant number of them. Even if they're not very costly cut-ins, some of them are definitely bound to be problematic. Yes. >>: So [inaudible] your results make me wonder what would have happened if you had as a baseline policy just don't cut-in unless you have any kind of semantic information which would reduce the open questions. So, in other words, you know, just cut-in only when you have some [inaudible] information, like, you know ->> Antoine Raux: At a fixed threshold? >>: Not necessarily a fixed threshold, just whenever -- yeah, maybe you have to have some thresholds, but, you know, having just a -- you know, without using a sophisticated decision tree, just basically only cutting in when -- after a threshold when you have some information. Otherwise it'll just let the user continue. >> Antoine Raux: Right. The thing is, like specifically in these turns, it's likely that the user would say more than one piece of information. So you would still end up cutting in. Now, if you decide that these cut-ins don't matter, they might be okay, that there's still uncertainty there. In terms of raw number of cut-ins, there would still be -- I don't have the numbers, so I can't completely answer. But I think there would still be some because of that. Because people would start saying something, and they would not be done but there would still be -- already be some semantics. But the other thing you can imagine doing is to set a state specific fixed threshold, right? You could just learn. The interesting thing here was to learn that from the features. We couldn't know for sure beforehand which one would be more useful or less useful. I mean, we could have guessed definitely some intuitions, but this actually learned it from data and it did optimize it. >>: That was actually something I wanted to ask you. So in -- I've actually worked on designing [inaudible] systems. So [inaudible] is there is, you know, you know that [inaudible] special and tune that separately, and so the first question is how much of this improvement comes from just the combinations? And the second question is, the other baseline would be, like you said, just a static -- just set per dialogue state. How much are you improving from that baseline? >> Antoine Raux: Right. So I don't have the numbers. I did compute in, but unfortunately I've forgot the numbers for the state specific threshold. It was not exactly -- it was kind of halfway between -- it was going there, but not completely, not as good as the proposed approach. In terms of confirmation, I do agree that -- because Let's Go! has a lot of explicit confirmation, that's a big factor in the improvement. It's not all, because like this improvement here is something that's not related to a confirmation. But a gain in latency, a lot of it comes from the confirmation questions, like we've seen in a previous slide. But the interesting thing about this is that because it's a learning algorithm, then if you have the more complex systems, like different states and different things, you can still rerun this. And since additionally it's unsupervised -- it's unsupervised. You can always rerun it if you add the different state to your dialogue that's not nonstandard. You can learn from interaction with a user. Now, the last thing I want you to see on this slide evaluation, which I couldn't see in the batch one at all, was the impact of the algorithm on more speech recognition performance. Now the problem here is we -- this data is not transcribed, so I don't have word error rate value directly, so I looked up a -- the nonunderstanding rate and the number of redirects, the proportion of redirects in the user's speech. Portion of utterances has been rejected between the two conditions, and there are significant [inaudible] significant reduction both overall and the yes/no questions. And so overall it's like 1 percent, it's a small reduction, but it's statistically significant. The biggest surprise to me was that the improvement was here in the yes/no questions because the reduction in cut-ins happened in the open-question case here. Now, the reason why this happens is because the recognizor, when you add -- when you have a very short word and you add a long silence to it, you're more likely to mess up your recognition basically, that the impact of the silence, if you have that background noise and background voices, is going to hurt recognition even of the actual word itself more. So that's the information I have for this reduction here. >>: Do you see an overall significant difference in task completion? >> Antoine Raux: No. No, no. It wasn't sensitive -- like there's not -- my trees are not sensitive enough for that kind of improvement. >>: [inaudible] you're saying that in this system [inaudible] get the answer to a question yes/no, it's a binary question [inaudible] that we're only getting, you know, 5 to 10 percent -- I mean, 95 to 90 percent right? >> Antoine Raux: No. That's not exactly [inaudible] ->>: Or at least that we understand [inaudible]. >> Antoine Raux: Right. The thing is that some of these are not even speech in the first place. >>: [inaudible] question and the user didn't answer it. >> Antoine Raux: Right. There was some background noise, some baby crying on their lap. Or some -- we have lots of data like that in our corpus, unfortunately. >>: Right. But is it actually set [inaudible] what's a fraction if you ask a yes/no question that you're actually getting a yes or a no as the answer? >> Antoine Raux: I don't know how that -- it's maybe -- I would say it's 80 percent ->>: Really. So you're saying 20 percent of the time you don't, and there's a third choice, that they didn't ->> Antoine Raux: Either they say something else or ->>: No, that I understand, they would say something else. But [inaudible] would be interpreted as either yes or a no. >> Antoine Raux: Oh. Oh. That's not what I said. And I don't have that specific number. But in this case I'm not saying it was interpreted as yes or no. It was just something that happened after a yes/no question. >>: Right. >>: And you're saying if a baby cries, you'll get the wrong answer? >> Antoine Raux: No. You'll get something that might be misreading that as speech, but in this case it would be linked to nonunderstanding, say I didn't -- I didn't get that. The system would respond I didn't get that, which is not as bad as misinterpreting that as a yes, for example. >>: But, still, 10 percent is pretty [inaudible]. >> Antoine Raux: That tough data. >>: And when you say speech, do you include [inaudible] speech like uh-huh, huh-uh [inaudible]? >> Antoine Raux: Yeah. Well, what I mean here is the state -- the question that the system asked, what comes from the user can be ->>: It can be speech [inaudible]. >> Antoine Raux: Yeah, right. Right. >>: [inaudible] >> Antoine Raux: Not necessarily. >>: [inaudible] >> Antoine Raux: Yes. Because here I don't have any label of what is actually said. I don't know what is actually said. So they are included, yes. Yes. >>: So one thing that you could do to see if users can receive -- I mean, you were saying before that, you know, after the users use the system they do [inaudible] ask them [inaudible] well, you could certainly just pick the recorded audio and present them to a bunch of raters and have them rating along certain dimensions, see if people can distinguish between, you know, the small, subtle changes that you made. And wondering have you done that, and, if so, what were the results? >> Antoine Raux: Right. So I've started to look into that. My take [inaudible] result was that I think we need more improvement than this. I didn't do a formal evaluation like this, but from -- what I looked is -- to make it really -- I can hear the difference between very sensitive or susceptible to the -- to listeners, to third-party listeners, I think we need to go further than this. Even in the best cases, even in cases where we reduce -- there are lots of yes/no cases that reduce latency by 50 percent, say from 1 second to 500 millisecond, and that's very hard to perceive from the -- it's surprisingly hard to perceive actually. So I think with -- we're not completely there yet is the answer. But I don't have this formal evaluation. I started looking into it, but I'm actually trying to improve more before doing that evaluation because it's more costly than the [inaudible] batch. Okay. So but -- so that's it for the first part, first part of my talk. So it was still a fairly -it was a principled approach in the sense that I was optimizing from data the turn-taking behavior or the endpointing behavior. And it was very much in the standard framework of setting a threshold and just making a decision, the endpointing decision based on the threshold. Now, I want to take one step further and propose a new way of addressing turn-taking in general. It's more of a theoretical model that will be applied to specific problems. And I will explain that. So if we want to describe what turn-taking is in the kinds of dialogue I'm looking at, which are two-party dialogue between the system and the user, at the most fundamental level, the floor, or the turn, is basically alternating between system and user in a loop like this, right? Now, the entrusting bit is not that phenomenon, which is well known and not very special. The interesting thing to look at is what happened at the transitions, is what we want because this is where we can improve behavior. So one typical standard behavior is, say, the system speaks, they finish their utterance, they free the floors, they stop speaking, the floor is free, and it's marked to being for the user but still the user hasn't spoken there, spoken yet here. And then the user starts speaking, gets the floor, finish their utterance, the flow becomes free but marked for the system and we have this loop and the take-turn with slight pauses maybe between each turn. That's not the only way transitions happen. The other way is when they actually overlap. And so in this case the system speaks, and before the system finishes, the user starts speaking so it's marked, it goes to this state, it's marked as both speaking or both claiming the floor with mark because the user is trying to claim the floor now. And if the system stops speaking, then we get a switch to user. So you have this kind of transition. And I'm going to in the next few slides explain how this fairly simple machine, which actually has a few more details -- so first each state can -- you can stay in each state indefinitely -- well, yeah. And also the difference -- I don't know if you saw the difference, but I had arrows that you can actually go back, even if the -- for example, the system speaks, user starts speaking on top of them, but then the user stops speaking, you go back to systems, so you can actually go both ways in certain transitions. So this is still a fairly simple state machine, and it -- but it captures really many of the typical turn-taking phenomenon that we observe in a two-party conversation. The first loose transition is basically the one I explained first, in this case user to system, user speaks, user yields the floor, floor becomes free, system grabs the floor, we have some transition. Now, if after the user yielded the floor the system waits, then we have a latency. That corresponds to exactly the latency that I was talking about in the first part. Okay. The staying in this floor, in this state here. Now, if the system grabs -- well, the floor is still to the user and not free, then we have a cut-in, same thing we were talking about. Now something for a completely different kind of behavior, say this system speaks, finishes speaking, so we switch to the free U, but user doesn't get back to the system, at some point system decides to grab the floor back. That's typically a timeout type of behavior. Didn't get response from the user, decide to grab the floor again. Another phenomenon that's very common, barge-in -- user barge-in. So the system is speaking, user grabs the floor, so they're both claiming the floor at the same time and the system yields, so we go to the user. It was a smooth barge-in transition. And there are many other types, but the point is that this fairly simple [inaudible] captures really a lot of the phenomena that happened and kind of formalizes them in a nice way. And so based on this, we can construct a -- build an algorithm to train to make the decisions. It's based on decision theory; I call it dynamic decision process. The idea here is that at every point in time system has partial knowledge of state, meaning that at some point the system is going to be pretty certain who has the floor, like during, in the middle of utterances by the system or the user, it's usually pretty clear that the floor is user or system. But there are transitions, there are points where suddenly you're not sure anymore who has the floor, and that's the point where you have to make the decision. So it's important. So you need to keep track of this uncertainty if you want to use that to take your -- make your turn-taking decisions. There are type of actions that can be taken either by the system or the user, but in terms of -- I mean, we take the viewpoint of the system here is grabbing the floor, yielding the floor, keeping the floor when you have it, waiting, which is not doing anything when the other person has the floor. And these different actions yield different costs in the different states we have there. They are not possible -- not all of them are in all states, because for grab and keep, the system needs to have the floor in the first place, versus for yield and wait, needs to not have the floor in the first place. But most importantly they have different costs, and we'll see what that means. And so once you have that, you have states in the world with the belief on what the current state is, and you have actions with cost. This is very decision-theory oriented. And so you can use this in theory to pick the action with the lowest expected cost by using all this information. Now let me explain how like using two examples how this can be applied. So endpointing again, typically you start, you know that the user has the floor, you're at a point where you're pretty confident the floor is to the user, they've been speaking. And at some point because there's a pause in the user's speech, you're uncertain on whether the user just freed the floor to this state or the user still has the floor because they intend to continue speaking. And the user in any -- in either of these states, the system can take two actions, grab or wait. If you're in this state, grab is the right thing to do because the user is done speaking and you don't want to induce latency, which is what wait does. If you're in this state, wait is the right thing to do, because the user is not done yet, so you want to wait. If you grab, you're going to induce cut-in. So we can translate that into an actual cost function or cost matrix here. Again, this is once a silence has been detected in user speech. We'll see that how that [inaudible] later. So the action -- the possible actions of wait and grab, those states are user or free S, and for the right actions I said a cost of zero, the right thing to do, said a cost of zero. And for the other ones I said different cost. For waiting, the cost is equal to the time since the silence has been detected. So the longer you wait, the longer the cost of waiting more is going to be -- the higher it's going to be. For cut-ins, I said the cost of the cut-in to be a constant all over the system and all the states, which is kind of -- both of them are kind of approximations. You can imagine cut-ins have different cost and different circumstances, and you can imagine this dependency not being strictly linear. But this is the first approximation to show how the system can be used. So now we have again a typical decision theory problem where you just want to take the action that minimizes the expected cost, given that the user is still silent at times T. Because if the user starts speaking again, then we're going back to being certain to being the user state and so we don't have this problem anymore. So we assume that the pause is still going on at time T, and the cost of grabbing is the probability of being in the user state times the cut-in cost plus the probability of being in the free state time, the cost which is zero. And same thing here for wait. So we have these two expressions that express the expected cost. Now, what we need here is to compute these state probabilities. We need to know the probability that the state is user or free at times T. Now, here I'm using -- and the actual expression is with conditional probabilities, but because they're both conditioning on the same thing you can use joint probabilities. It's equivalent in terms of solving the equation in the end, of finding the points. So I'm looking -- I'm actually expressing joint probabilities. You can use this with the definition of conditional probabilities here, the probability of being free times the probability of being in a pause given that the floor is free. Now, I make this approximation that if the user released the floor, they're not going to start speaking again ever. So it's an approximation in the sense that we're going to get this transition back to the user state, if the system never gets back to the user. But this usually happens fairly -- after a fairly long time, which is not the same scale as what we're dealing with. So [inaudible] I'm saying that if the user did free the floor, they're never going to start speaking, so they're going to be silent regardless of T. So this probability is 1. And the other thing here is that because this doesn't depend on whether the silence is going on or not, after we moved that conditioning, then it doesn't depend on T anymore. So it's the same as the probability at the very beginning of the pause that the floor was free, that the user had finished their utterance basically. So we have this. Now, we'll need to compute this parameter. But this is no longer dependent on time. Now, on this side, like the other probability of being in the user state, saying the composition here, the only thing here is now the probability of being in the pause at time T given that the user keeps the floor is the probability that they haven't spoken again yet. And so it's the probability that the duration of this current silence is bigger than T. And we can use that if we compute the probability distribution of internal silences based on data. And that distribution is well approximated by an exponential distribution which is something other people have found in the literature in the past and have confirmed that on for a good match on my data, and so I'm using this approximation which leads to this probability to be just equal to exponential minus T over mu; mu is a parameter which is equal to the mean paused duration. It's something you can compute easily. And, on the other hand, you have the same phenomena as before because it's no longer dependent on whether it's a silence; it's equal to the probability of time zero that the user kept the floor or keeps the floor now. And this is 1 minus the probability we had in the previous state. So we just have two parameters to compute. Mu is just the mean pause duration and we still need to compute this probability. So at the beginning of the pause, did the user yield the floor or did they -- are they keeping the floor is the question. So this is a simple -- or simple -- very standard binary classification problem. You can take all your pauses and your training data and leave those then as internal final and use the features in the same way that I was doing in the previous algorithm, use the features that are available in the beginning of the pause to try to predict this outcome, whether it's final or internal. And I did that using logistic regression, and here are the results. In terms of raw classification error, the majority baseline is 24.9 percent and went down to 21.1 percent. So, of course, that might seem like it's not a very high, very large improvement here. But that should not come as a surprise. Because if we could do a very good job here, we didn't have to wait for any threshold. If we could know at the very beginning of the pause whether the user is done or not, we wouldn't have to wait at all any -- ever. And there is -- it's not only a matter of having the right features and the right algorithm; that there is some intrinsic ambiguity here. There have points in time where the user might be done or not and you're not -- just not sure about it. Even a human is not sure about it. Even the speaker might not be sure about it actually. So it's not a surprise that you can get very, very high accuracy here. You could probably get better with more features and with different classifiers maybe. But I don't think it would get much better. But that doesn't matter because what we want is an estimate of the probability. We don't need to have perfect classification. And because we did get improvement in both hard and soft [inaudible] likelihood, this at least improved over a simple prior distribution. So we can use that and plug that in our equations. So what happens here is we have -- remember we had these two costs: the cost of grabbing and the cost of waiting. And they [inaudible] cost of waiting was directly proportional to T, was T times K, so it's just the linear increasing here. The cost of grabbing -- the cost knowing the state was fixed, but our knowledge, or brief probability on the state decreases exponentially, so we have these decreasing costs and these increasing costs, and basically you want to endpoint when the cost of waiting gets higher than the cost of grabbing. And so this is -- the interaction is where you want to set your threshold, because we're in a pause, we can actually go back to setting a threshold here, which is the time that corresponds to the interaction. So, in other words, the threshold is the solution to this equation. Now, this doesn't have unfortunately an analytical solution. But you can find numerical methods and solve that very easily, even at runtime. Now, these are the results using this approach. Again, this is the baseline with a fixed threshold, and this is the dynamic decision process approach. It's slightly better -- I don't have it on this graph, but it's slightly better than the previous approach. I don't have it here because this was done on different data. But I also have it -- the same computation [inaudible] as before, it was very slightly better. Now, the interesting thing is we're not done yet. So it wasn't -- the goal was not -- the hope was not that this would be significantly better than the previous approach. The interesting thing here is that we can extend this approach much more than we could the previous one. Because we have this generic framework, we can apply this at different places in different ways. So an example of an extension here is to not only do endpointing at pauses but also at speech. You can have a similar process within utterances before you detect a pause. And that's important because pause detection itself incurs a delay. To be sure you have a pause, you first need to wait something like 200 millisecond for your voice activity detection to know, also because there are silences within speech at -- at consonants like Ps and Ts that have potentially long silences within them not being a pause at all. >>: [inaudible] >> Antoine Raux: Depends on language. Not English, but in summary [inaudible]. >>: How fast can you compute this dynamic decision process? >> Antoine Raux: So, well ->>: [inaudible] >> Antoine Raux: I don't think the computation of the threshold is going to be high. So before the threshold itself, there's the feature computation, which also happens like between -- during this 200 millisecond that I'm saying, it's both to detect the pause for sure and compute the features. So you have this thing that kind of introduces a delay right off the bat here before you can make any decision. So if we can start detecting the floor switches before we detect the pauses, [inaudible] for the cases where we're pretty sure that the user is done, then that would improve over this, over the current state. So what we can do is use partial recognition hypothesis directly even without pause detection. And in this case I define the floor switch because we don't have a pause to classify binaurally between internal and final. But I say that the floor switch is at the first partial hypothesis where the text of the hypothesis is identical to the text of the final hypothesis in the training data. Because in the training data the system actually went on and did standard endpointing at the end. And if the recognition results are the same, that means we didn't gain anything by waiting some more. So we should have imparted earlier on. So this is how I label the training data here. And now we can use, again, the similar cost matrix. We're in the same states and same transitions and actions. Again, the wait action has cost zero, now the cost of a cut-in is still the same cost in K, and the cost of not endpointing when we could have during speech, I set it to be a cost on D, another constant. Because we don't have now -- we don't have this dependency on time anymore, we don't have this thing where we wait once the pause starts, we have a certain duration, something that happens every time we get the partial hypothesis. It's no longer -- this one is no longer dependent on time. So what that means is that this leads -- these equations basically lead to a threshold on the probability at this particular partial hypothesis, the probability that the floor is switched or not. So we can do the same thing as before, use logistic aggression, but this time we classify not pauses but we classify partial hypothesis as being final or not, identical to the final or not. So we have a classification. The baseline is higher and we get to the related improvement is much bigger than the previous one. We're still at 20 percent, we're still at the same value 20 percent error. But, again, we want to use the probability estimates rather than the pure classification. So that's necessarily too bad. So what happens here in terms of results, we do get some improvements. They are small, I would say, over the endpointing at silences only. Now, what's interesting to do here -the point is that this latency metric average over all latency metric is not necessarily exactly capturing the user experience very well. So it's kind of interesting to look at the distribution of the thresholds picked by the two approaches. But this is the different values for the threshold, this is proportion of turns that have this threshold given the algorithm. And this is -- the blue one is endpointing at silences, the first approach; the second one is endpointing anytime using including potentially during speech. So what happens first you get the end of it. It's -- they are completely overlapping. There is nothing that changes in terms of long thresholds. What happens is of course at zero, which is during an utterance, the original approach didn't have any -- didn't allow this. It has zero percent of the utterances here. The new one has 10 percent. We still have 10 percent of the turns we're going to have much faster response than we had before. And basically some of the high -- like the original blue distribution has two peaks: one short and one long kind of. And what happened is some of this peak has been transferred to zero latencies. So these cases where we're pretty sure the user is done, turns out some of them we can actually decide before we detect a pause and some of them we can't. But it still leads to 10 percent of the turns where we will reduce the latency significantly. And it's not -- I'm currently working on the -- doing the live evaluation for this because it's not clear that just putting a zero latency translates exactly what's happening in the real-life system. Yes. >>: [inaudible] the model for trying to get -- to make decisions about, you know, whether you grab or wait [inaudible] human-human [inaudible] so you have human-human [inaudible] this interaction [inaudible] how well would that naturally [inaudible]. >> Antoine Raux: Right. So that's a good question. The problem with that is getting the features in human-human calls. One thing like semantic features and et cetera, I mean, it's possible it would require heavy annotation basically of ->>: [inaudible] it's the same task, right? >> Antoine Raux: Right. Right. But I'm not doing any human annotation at all so far. I'm purely relying on automatic features. >>: So, yeah, it would require ->> Antoine Raux: So, yeah. >>: [inaudible] annotation. >> Antoine Raux: Right. So it is an interesting point. I don't have the data labeled for it. And the other thing with that kind of evaluation, it's interesting to see. Now, if you don't match exactly, the observed behavior doesn't necessarily mean you're completely wrong either, because there are different acceptable behaviors. But it's definitely more information. >>: [inaudible] the system actually barges in and you measure that as zero for the [inaudible]. >> Antoine Raux: Well, I wouldn't call that barge-in, but for the red curve, yes. >>: [inaudible] >> Antoine Raux: I count that as a zero threshold. >>: Zero. And so why is it not at this [inaudible] zero then? So you get ->> Antoine Raux: Yeah. Well, these curves are smooth, actually, so there is a discontinuity. That's a good point. Yeah. So there's zero, and the first available point after that is 150 millisecond, and in this case the minimal threshold is 150. But the points are actually every 150 millisecond in this case. But you can't get anything between the two. That's a good point. Okay. So that was -- well, I've talked enough about endpointing for today, I think, so now I'm going to show how this can be applied to different problems. This is not completed yet, something I'm working on right now. So I don't have final results. But it's interesting because it shows how the model can be applied to different problems and the same starting model. So barge-in detection, this time we're in the case where the system has the floor and then we come to this point of uncertainty where the user might be trying to barge in. So our voice activity detector tells us there's speech detected [inaudible]. So we -- it might be here but it might be some noise and it might be the user just back-channelling on the system, not trying to grab the floor, but just providing some simple feedback without trying to interrupt the system. And so the kind of actions -- there are two actions that can be taken again, but this time it's keep the floor or yield the floor. And if you're still in a system state, you want to keep; if you're in the both state, when the user actually did start claiming the floor, you want to yield. And so you have a very similar structure as before, actually, cost structure as before with different constant. And you can have the same thing that the cost of staying in the both state is going to be equal to the duration that we've been in that state. So the longer you stay, the higher it's going to cost. And the cost of yielding when you shouldn't, meaning that the system interrupts itself or the user wasn't even speaking or wasn't trying to interrupt, is a constant, is set as a constant. And the interesting point here, the difference with before, I mean, the rest would be very similar equations, obviously. The difference with before is we can use prior information here. In addition to the standard [inaudible] regression approach or any other probability estimation [inaudible] you can introduce priors which we didn't have before. In this case you can look at previous data or corpus recorded data where you have the prompts, system prompts, and you can measure how -- when the users barge in. You can confuse the distribution of the times where users barge in for specific prompts. Because barge-in is going to be triggered by certain things in the prompt, the content or the paucity. So we can look at that actual data and use that as prior information. And at this particular time the user is likely to actually barge in. Because random noises are not going to follow a specific distribution. They're going to come randomly with a flat distribution. So this can account both for discriminating between real speech and noises and, as I just said, it can also account for [inaudible] back channel versus barge-in, because it's likely that certain point in the prompt are more likely to trigger back channel from the user. Others might be more likely to try to barge in. And I did compute this under logical data on an overset of dialogues, of 6,000 dialogues, about 127,000 prompts total. And just as an example from one prompt [inaudible] going to -- this was normalized because, of course, this varies every time that I normalize the varying part to give it the mean duration. This is correct. So going to the airport, is this correct. And you can see how barge-in happens here. It happens first very, very much at the beginning of the utterance, which means the user barging in here are not responding to this utterance. They're responding to something that happened before, and they're speaking -- it might be actually a cut-in where the user was not finished speaking and the system started to respond, but the user continued their utterance so it resulted in a very early barge-in. So that's one type of barge-in we get. And which we might want to consider it not as a barge-in, actually, that's an interesting choice to make here. Another one here is after going to the airport, people typically back-channel, give the response yes or no or -- particularly when it's yes, they will just give you one single answer here. And in this case, because we're doing explicit confirmation all the time, it's okay if that's taken as a barge-in because anyway we were going to ask this question. But we can use this information to structure -- to help implicit confirmation as well. You can imagine an implicit confirmation that would be exactly like this except saying going to the airport, where are you leaving from. You can change the second part because of the way the prompt is worded, you can just change the second half and have something that's an implicit confirmation rather than explicit. And then you can use this, that if you get the barge-in here, you know that this is not going to be -- you really should continue speaking the remaining of the prompt. Because the user particularly could say yes, the user is just providing an answer to the back-channel, to this part, and they still want to hear the rest of the question. So you can use this to inform -- although the data contains explicit confirmation, you can use that to inform the design for -- and the models for implicit confirmation. And the fact that we get all these barge-ins and different user reaction with implicit confirmation is one big reason why it's so hard to actually implement implicit confirmation and really deploy system, because the behavior's so much harder to predict and you get -- you know, in this case you would say going to the airport, where are you leaving from, and you get the user saying yes and then you interrupt this prompt because you heard yes, and so you don't want to do all this. And so this is something that could help. It could be used plugged in very easily in the framework I proposed because it's just a prior probability on the state given the time at which it happens. And so there are other extensions to this framework. So the idea of the whole thing is that it's a general framework to think about turn-taking. It's not the specific solution of this [inaudible] that are important, it's just a way to think about it at this transition, prior state machine and plugging in decision theory on top of these states and actions. The possible extensions include changing the topology of the prior state machine itself, particularly if you have more than two speakers, and obviously you're going to have a different structure. You can change cost functions. I explained during the talk these functions are fairly simple, the one I described, and you can maybe do a better job. The ideal thing would be to relay the cost functions to user experience. If you have measures -- high-level measures of user -- of test success, I don't think is going to relate necessarily highly with this, but user satisfaction might relate more. So if you can get some correlation, some way of fixing the cost of -- deciding the cost based on that, that would be a much better optimization criterion than the current one. And you can also improve just the probability models you're using, again, using different features, classification regressionals and different types of priors. Yes. >>: I agree with what you were saying with regards to the cost function; that is, if you use user satisfaction. But user satisfaction is -- I think that relates a little bit more to something that, as far as I can tell, you haven't really discussed function with respect to [inaudible] which are decisions that you make over time. So so far what you've presented seems like a greedy approach where all you're going to do is you're just trying to maximize -- or you're minimizing the respected cost, right, for every particular way, but not across the sequence of [inaudible]. Clearly if you grab too much, you're going to upset the user and the cost is going to be greater. So have you thought about ->> Antoine Raux: Right. So that's a very good point. So what you're saying kind of leads towards reinforcement learning approaches. Actually modeling this as more like a Markov decision process where all the actions are in a chain over all the different turns. >>: Not necessarily. Depending on the optimization [inaudible], I mean, clearly what you're not including so far at least are even just dependencies between the decisions that you make. >> Antoine Raux: Right. So, I mean, one way to answer this, one thing I've thought about is -- I mean, you can't -- first, you can't -- well, you can use this machine also to remediate when something went wrong. So, I mean, that's not exactly what you're saying, but it's a first step. So, you know, when you did interrupt the user, when you did a cut-in here -- sorry for going from here to here -- then you can use this to at least make the right decision then, right, so repair the problem. So you can use this for repair. And what I was thinking, that it's a good point that it's not embedding the current -current decision framework itself, but using -- when you observe these things, you can change your cost structure and make it more costly to have a cut-in if you already had cut-ins before. For example, I mean, a very simple ->>: [inaudible] you are not trying to understand what user wants. And if [inaudible] user [inaudible] your timing information is going to be such. >> Antoine Raux: Right. Well, basically -- yeah. >>: [inaudible] you research, your [inaudible] model an area that relates, is somehow related to a dialogue [inaudible]. >> Antoine Raux: Right. I agree. But I think that comes from -- I don't think that questions the overall approach, but it's more that you will need more complex cost functions and maybe topologies as well. But I think this will still capture a lot of the phenomena, even in a different task where, for example, it's not necessarily always good to be as fast as possible, for example, which is what this particular approach did. >>: [inaudible] >> Antoine Raux: So [inaudible] argument for the approach I've taken, which is very local in terms of decisions, can come from the conversation analysis work, which -- well, it can be questioned, but like in the Sax and Shegaloft [phonetic] original turn-taking paper about human-human turn-taking, they actually made it explicit that turn-taking is a local phenomenon that's independent of context. Now, you can actually question that. But that -- I mean, there's actually a take on turn-taking which, like a theoretical take on it that makes that assumption. So I think that's why it at least works at all, making that assumption, right? Like you couldn't do that optimizing the general flow of the dialogue like the actual prompt that you're seeing, because that wouldn't make any sense. In turn-taking you can at last work with that assumption first and then improve over that by introducing more dependency. >>: That's why I think it would be really interesting to try to [inaudible] to look at conversation analysis and see if they actually can apply. >> Antoine Raux: Yes. I agree. I agree that would be interesting. I don't think I'll do that in my thesis, but, no, that's very good. I wish someone did that. Okay. So I've presented two principled approaches in different ways, two turn-taking. First one is an optimization algorithm to get endpointing thresholds, and it showed -basically it showed that [inaudible] features can help turn-taking, particularly semantics if you have them. It might be that on some complex domains it's much harder to get reliable semantics, and then you might want to rely more on other features. In this particular domain, semantics did help a lot. And second I proposed an approach based on the finite state turn-taking machine that modeled turn-taking and captured most phenomena, at least in a dyadic conversation and a fairly standard state transition way. And because of that, we can use that as the basis of what I call the dynamic decision process model, which is based on decision theory to make decisions -- turn-taking decisions at every point based on our ability from what state we're in and the cost of the different actions. And I showed examples of two applications for endpoints and barge-in detection. I have actually two slides on how that could all fit or I could all fit in MSR. So first let me talk about my general research goals. I mean, this was specifically about my thesis work, but in general what I want to do is improve human-machine interaction. Human-machine interaction is really where I want to be, and improving it by designing models that can learn and can learn either from -- in both case from data, but either from previously collected interaction data that you have or, even better, through new interaction, or both. I mean, you can start from something collected, like I did in this particular work, and you can -- again, because this is unsupervised, you can actually let it run and tune it. Because the user behavior might change once you change the behavior of the system, and so having the thing stay unsupervised actually helps it do online learning, continuous learning. I didn't [inaudible] the online version of this, but there's no reason it wouldn't work. And also the other aspect I'm really interested in is leveraging high-level task information, whatever the task may be, to optimize core technology. It's like being at the interface between the core technology, like speech recognition, or in other domains, information retrieval, et cetera, and an actual task you're trying to do with it, like dialogue in the case of speech, or other, like Web search in the case of information retrieval. So in terms of domain [inaudible] application to relate to the reason I'm here these two days, first for situating interaction work that's done in [inaudible] group. Well, turn-taking itself is I believe crucial to realistic dialect systems, so the kind of systems that they are working on in that group, something that you want to be much more aware of the environment and much more -- like lead to much smoother interaction with the users, and natural interaction. I think that requires to have a good turn-taking model. Whether it's exactly this one or not, that's a separate question, but I think it does need it. And it's kind of a natural extension to the finite state -- I mean, the whole approach to multimodel, potentially multiparty situations. So you can grow -- build on top of this is grow the state machine and change the cost structure, et cetera. For Web search it's not -- it's not directly an extension, but it still fits within my -definitely within my general research interests in the sense that -- well, presumably with Kuansan, the idea is to take interactive search as a dialogue, frame it as a dialogue between the user and the machine. And that -- given that, it's very interesting to me to explore it -- again, to relay these low level and high level, to explore information on search behavior you have through the interaction to inform the core technology information retrieval [inaudible]. And so I think there are two ways; both of them would be very interesting to me actually to pursue within the context of MSR, at least in the near future. Okay. Sorry. I didn't have a thank you slide. But thank for you attention. >>: So do you think that for [inaudible] taking [inaudible]? >> Antoine Raux: I think it's different. I think it depends also what kind of multimodel interaction, multiapplication. If it's -- so multimodel probably is not the big factor. If it's multimodel but not at all humanlike, like multimodel because you have the map and you can speak to it, then it depends if you have to construct a new model of what it is to take turns and what it is -- what is the interaction. So I think it's important. >>: [inaudible] >> Antoine Raux: Yes. >>: The other way [inaudible] the multimedia. >> Antoine Raux: Right. Well ->>: In that case you're coming [inaudible] between the system and the users, it's much more [inaudible]. >> Antoine Raux: Right. So basically you have something where you might have several floors. I mean, there are different ways of modeling of that, right? Instead of having a -- if you're using just speech, you have a single channel as -- that's why the floor is there, because you can't easily share that one. >>: [inaudible] >> Antoine Raux: Right. >>: But if you have a very high bandwidth [inaudible] between user and system? >> Antoine Raux: Right. So I think it leads to a different -- a different model of what the floor would mean. I don't think it removes completely it, because it's not infinite, right? At least because the user is not never going to be able to process an infinite amount of information presented to them. So you still have constraints on it. They're not the same as just a speech-only conversation kind of thing. >>: [inaudible] is generic to all human-machine interaction. >> Antoine Raux: Right. >>: My question is the way you approach turn-taking in your research you have focused a lot on the [inaudible] information [inaudible] speech. The question I was asking is the multimedia environment, how you [inaudible] is that particular to speech interaction or as important in multimedia communication? >> Antoine Raux: I don't think it's just because of speech. It's potentially as important, depending on what multimedia presentation you do. >>: [inaudible] >> Antoine Raux: Right. But how you combine the two. So if you're talking about like static Web page that has different elements to it, for example, then you don't have -- like within this context, turn-taking, if it's not directly timed as in millisecond problems like I'm talking here, it's at least about the sequence of things, right. So but as soon as you have things happening in sequence, like basically you do have even in any application, you do have the system and the user taking turns in some fashion. Not necessarily strictly one at a time -- it's slightly more interesting when it's not -- but you still have turn-taking happening here because of the transition. There's still a temporal transition, is what I mean. There's a cycle. And so that's still -- even if it's not like optimizing the thresholds like I did in the first part, it's not going to apply directly, obviously, because you don't have this specific problem. Now, the second approach is more generic and can actually lead more, because you can't have these transitions, these state machines [inaudible]. >>: My question is that in the case -- let's say in your Web search, this latency or duration of this issue, how important is that? I mean, if it's [inaudible] obviously [inaudible]. >> Antoine Raux: Right. Well, that's a good question, actually, how important is that. I think it's not working at the same scale. Now, is it not important at all or ->>: Well, maybe [inaudible] my dialogue [inaudible] ->> Antoine Raux: Sure. >>: -- how long does it take. >> Antoine Raux: Right, right. Now, how does the wait it turns out taking influence that aspect is -- then is the challenge, I guess. >> Kuansan Wang: We have time for one more question. >>: Just your thoughts on if you were to work on Web search as a dialogue, how would you approach this problem? >> Antoine Raux: That's a good question. Well, so briefly, I mean, I just had a discussion with Kuansan about these kind of issues. But so the first thing, given what we know about dialogue, like human dialogue would be to structure the problem. Because if we want to approach it as a dialogue, I think that introduces some kind of structure, so you can't be in a state, get some input, make a decision and move to a different state. You need to define what these states are. And the general -- the plain old information retrieval paradigm doesn't have -- it's very specifically unstructured, actually. And so you need to find some -- to define some structure either by clustering documents or -- I mean, there are already some approaches to that, and then moving around. But the -- there might be other ways to do it, actually, better ways to do it maybe. >>: [inaudible] it's like you said [inaudible] barging you will be able to detect when you are actually required, needed to get some information [inaudible] more efficient, just don't have to try to prevent [inaudible]. >> Antoine Raux: That's an interesting -- that's an interesting point. Like if can we have the system to be more productive by being more aware of what's happening in terms of floor even if the floor means something else in here. >>: [inaudible] the case search [inaudible]. >> Antoine Raux: Right. For example. >>: [inaudible] by the time scale there is actually more -- much smaller than the speech. >> Antoine Raux: Right. It's ->>: As you type, every caret you type is trying to reformulate your suggestion, and so the latency -- and all the tiny information. >> Antoine Raux: It's [inaudible] different scales, yes, yes. [inaudible]. >>: So you don't need to respond when you don't have enough information [inaudible]. [multiple people speaking at once] [applause] >> Antoine Raux: Thank you. >> Kuansan Wang: And thank you everybody for coming.