17770 >> Dan Bohus: Okay, everyone. I'm very excited to have Candy Sidner visit us today. She's a fantastic researcher. She's done tons of great and seminal work in dialogue and communication collaboration. And one of the things that I really admire about her is her ability to sort of like thread the whole space from sort of theory to practical engineering issues. Like she's done great work, I think in the past, on like discourse structure models with Barbara Gros and Share Plans. And she's brought that to life actually in the collagen framework, which has been used to develop a number of different systems. And then she's also done great work at the intersection of dialogue and robotics, and both at Mitsubishi Electric and WPI. And she's going to talk to us about some of her recent efforts on engagement in human-robot dialogue. So Candy. >> Candy Sidner: All right. I hope the mic is working. Especially for the people who weren't present. First of all, thank you for the nice introduction. And it's nice to see all of you here today. For those of you who know the work that Eric and Dan have been doing, I'm going to be talking about what is, I think, maybe in the framework, the perspective of their framework, how it is you make decisions about when you engage with someone. So the work that I've been doing over the past several years is to look at the question of what happens when people actually try and engage with one another in an interaction. So first I'm going to describe what engagement is. I'm going to show you some analysis that I've been doing of human-human dialogue. So what do people do. What's their situation. I'll define for you what a connection event is, as a way of giving some more grounded meaning to the notion of engagement. And then I'll talk about a module that's being developed by my collaborators, Chuck Rich, Aaron Holroyd and Brett Ponsler. Chuck is my long-standing colleague, and he also happens to be my spouse. And Brett and Aaron are two students who are in the research group with us. I'll talk about the module we've been developing and I'll show you a demonstration of this work in what's called the pointing game. Okay. So what is engagement? Engagement is a process and it's a collaborative process by which two participants, in my case it's really more if you want to look at the end person case, establish, maintain and terminate or end their perceived connection to one another. There's a problem -- the initial problem is how do they actually do that? I come from the language community. And people in the language community, you just say well you say hello and everything happens. It's really not like that. And Dan and Eric's work has been really looking at how this actually goes. Once the connection is made, then you have to negotiate this. And you negotiate it as part of the whole collaboration for whatever it is you're doing. The collaboration can be simply to talk to one another. It can be you're talking to one another while you're doing some task. So all of those things are part of it. Then there's the interesting problem of checking. So each participant in the interaction checks to see what the other one is doing if they're still engaged. What kinds of things do they do? Well, they talk, they look, they track the didactic references. I'll show you more about that. They do mutual gaze. There's some other interesting things we'll look at. And at some point they have to decide to terminate the interaction. They have the terminate their connection. And again it's not simply that you say good-bye. It's actually more complicated than that. I won't say too much about that today. You'll just have to believe me. So where is the evidence for this sort of thing come from? Well, there's the collaboration itself, whatever tasks we're doing. There's conversation management. That is the whole process by which we take turns. Now, conversations go on. We don't just say everything and then the other person says something. They're very complicated. But it's an indication that in fact we're engaged with one another. Gaze is a very important effect for how it is that we indicate that we're connected to one another. We use it to take turns. We use it to track objects. We use it to check the attention of the other person, are they looking at the objects I'm pointing at or are they looking at me, are they looking around the room, what are they doing? Are they looking at something else. There are hand gestures. The obvious ones I'll talk about have to do things with pointing, with presenting things, with explaining things. But there's a whole other range of gestures which I will say nothing about which I call semantic gestures. They're all the things like it was really big or I'm not really sure about that. All these wonderful things we do with our hands. Some of which are cliches. They're kind of fixed and they're almost like fixed little presentation signals. And others vary as far as everyone knows from person to person and the meanings are much more tenuous. Nonetheless, I'm not going to say anything about it, because I think that is probably the hardest problem I can think of in communication. And I'm probably not going to get to it in my lifetime. Head gestures. We nod our heads at one another. We shake our heads. We tilt our heads. There's all these things that go on there. Body stance. Normally when we're communicating with someone, we address that person with our bodies. But we can't always do that if we're doing some other task. If I'm washing the dishes and you're there, I have to stand like this to wash the dishes. So I have to do something to counter the fact that my body stance is not providing the right kind of information. There are facial gestures. Facial gestures are enormously complicated. It's another really interesting area. I'm going to say a tiny little bit about it in the course of showing you some video. Lastly, there are things that have to do with social relationships and cultural norms. These are incredibly important in how it is that we recognize and understand how the connectedness between the two of us is going. They have to do with things like our relative status to one another. How much we are connected to the other personally, a whole range of things that I'm not going to touch on at all, but you have to acknowledge that they're there. Okay. So how do you know when someone is engaged with you? All of the behaviors I'm going to talk about are two-sided things. So there's what I do and there's what you do. And I'm trying to influence you and you're trying to influence me. So when we look at computational systems, the system has to not only be generating behavior but producing it at the same time. When one participant generates a behavior and the other one responds, we call that a connection event. And the question is what are the kind of behaviors that come up, and we'll talk about that in a minute. If the second person ignores the behavior of the first person, then we say that the event failed. So it's very simple. I do something you either respond or you don't respond to it. >>: Candy, so for multi-party, would you say that the second party is the group or do you ->> Candy Sidner: Oh, Dan and I were just talking about this this morning. It's a really complicated matter, because there's the initial part of the group, it's who you're talking to in the group. There's the overhearers, the people who are sort of in the background. And then beyond overhearers there are bystanders who are not really even listening in but are sort of there in some capacity. So there's a really interesting question. And the real problem is, in multi-group things, is who do you actually address. So if I produce a behavior, who is it that I address it to? So right now I'm addressing sort of my comments to you but I'm trying also to catch the eye of other people in the room. But since you asked the question, I'm really addressing you. Well, is the group everybody in the room or is it you? That's the real question. And it's a really interesting problem about multi-group things. And I mostly have not looked at multiple group interactions. So you can ask this question of Dan and hear what he has to say. In fact, if you want to answer, go right ahead. This is a seminar. People are allowed to have their opinions. But it's really interesting from what a group actually comes to. So, okay. So what kinds of connection events are there? Well, the first type of connection event is what I call is an adjacency pair. This is a term that comes from the work of Sacks and Schegeloff, who were socioethnomethodologists who did their work in the '70s. Their notion was that somebody says something. This is entirely linguistic. Somebody says something and the other person responds that is what their response is how, is engendered by what the first person actually did. So it's the paradigmatic example is the question. I say what time is it. You say 10:00 p.m. It's harder when we start talking about adjacency pairs for other things, because sometimes I will say something like: I'm going to go to the store. I'm thinking about buying a new dress. And depending on who I'm talking to I'm getting ahas or maybe not ahas. Hi Dave. So there's this interesting question about what constitutes the adjacency peer. In the work that I'm going to talk about we've reformulated the notion to not just be straightly linguistic. And that's because in the data we see things like someone will say to the other person, and this is a kitchen scene. Knife and the other person hands them a knife. That's clearly a response to what was initially a linguistic event. But the response doesn't have to be linguistic. In fact, I'll show you, I hope, in the video I'm going to show you, one of the cases where there's a response. The response is not even a task response, that is, giving somebody something, but in fact their faces do the responding. So there's all kinds of nonverbal reasons that count as a response, including when you nod at me when I'm saying something. That count as part of the adjacency peer. So we've expanded this notion. The other thing to say about adjacency pairs is in the original work that Sacks and Schegeloff did an adjacency pair was first thing, second thing. They had a notion of what was called a third turn repair. If the first person said something like do you want to go to the store and the second person said which store, and the first person responded: Macy's, then that was the third -- you had three turns instead of just two. It turns out that you really want to be able to expand that notion even out to four turns, because people will say things and this comes from my data. One person asks the other: Can we eat these when we're done and the second person says: Well, I don't really think so. And the first person says damn it, and the second person says yeah, that's the way it is ha, ha. Now you could count the damn it and ha, ha, as their own thing and say well they're just engendered by themselves. But clearly they're not. The damn it is in response to what the first person said. So we actually have an adjacency pair that goes on. That's not really a pair. It's a set of pairs. So we've allowed for that in the data. My adjacency pairs also include back channels. Back channels are the phenomena of when I'm speaking and you nod your head as a speak. I keep going on. You don't actually get a turn, but you're using a nonverbal signal to give me information. Another kind of connection event is directed gaze. And this is how we use our heads and our eyes to tell the other person that we want them to do something. So there's sort of two ways to do this. One is to just use your physical head to turn to look at something. Now if I do that right here it's a little odd because there's nothing there on the floor for me to actually, for you to actually see. But if we're doing a task together and what's over there is something that I want you to actually look at, that's a way that I can actually get your gaze directed there, and that's something -- my behavior and your response counts as one of these connection events. Similarly, if I use my hand and I point, that's the same kind of thing. In that case I'm using both my gesturable, my head gaze because I can't really point for the most part accurately without turning my head to the right place, plus the use of my appendage to indicate a space that I also want you to point out. So that's another kind of directed gaze. Those are two other cases that I've been coding in the data, I've been looking at. The third case that I'm going to talk about is mutual facial gaze. Whenever we have an interaction, there are times we actually need to look at each other. In run along conversations where there's not some kind of other thing going on, most of the time we look at one another. But not all of the time by a lot. One study I did, I looked at how two people interacted when one of them was talking about various things that they were showing them in a laboratory with all kinds of cool new toys that people invent. And in that experience, what I discovered is that people look at each other about half the time. The rest of the time they're looking around the room at various things. They're doing other things like they have to look for the water glass. They have to look at it. Pick it up, when they drink from it. A whole bunch of things like that. And that was in a fairly benign setting. If you think about other settings in people's lives, for example, when you're outdoors you don't even look at people that much when you talk to them because you have to navigate in the environment. And that takes up a good amount of your face and eye time. Nonetheless, we do look at each other at various points in the interaction. It serves a purpose, not only to connect us. It often has other additional roles to play to tell us what it is that's actually, what it is we're actually paying attention to if the person's understanding and so forth. Now, those are just three kinds of connection events that are pretty obvious ones. There may be others. So, for example, touch may turn out to be a way that we make connection to one another. It's very hard to think about that when you're in a setting like this, because in public settings, when we're not among close family members, touch is a very constrained kind of behavior. Another thing is motion. If I'm really angry with you in conveying that, that's another kind of connection event. It means that we're really connected, especially if you're responding to what it is that I'm doing emotionally. So it may turn out that that's something else that indicates what the perceived connection is between us. But I'm guessing here because I don't have good data for that sort of thing. All right. So now I'm going to show you some of what it is we've been doing in a human-human studies. We have collected a set of videotapes of pairs of people doing a very simple task. And it's making canapés. So there's an instructor who teaches the student, another person, how to make canapés. These are crackers with lots of spreads on it, that sort of thing. And after they do a set of these and arrange them on a plate, then the person who is the student takes the role of the instructor with a third person who enters the environment, the original instructor leaves, and we go through this all again. And we have eight sets of these -- four sets of these interactions. So that's the study. And what I'm going to show you now is just a piece of one of these interactions so you can see what's going on. One of the things you should notice is that the instructor points a lot. He does other kinds of gestures, namely the iconic ones, things like really big. Although he doesn't do that particular one. He also does the sort of metaphorical gestures. You'll see as he points, there are responses on the part of the other person. >>: In this case, the instructor is the subject, not the confederate? >> Candy Sidner: In this case, the instructor is the confederate. Okay. So there's a confederate and the person who is learning from them. You won't see the case where they're essentially both nonconfederates. I'm not going to show you that. >>: So the confederate is aware of the purpose of ->> Candy Sidner: That's right. That's right. And the other case, the person, neither person is aware of exactly what's going on. So first we're going to take a look -- so I have videos that are -- they're done with two sets of cameras. You can see one camera there. That's the one that gets the student's face. You can't of course see the camera that we're getting the view from, which is this person who is the instructor. And so let's take a look at what he does. [video -- cracker followed by either cream cheese or [indiscernible] something. And then either put on pimentos or olives or something small like that]. >> Candy Sidner: Okay. So that's a very brief bit of information. But what you see is he gestures at a bunch of things that he's talking about, telling this person about, on one side of him, he switches his hand uses the other side to gain some things. Interesting, one of the things you may not have noticed, and maybe I should play this again so you can see it again, while he's pointing, he does not point this way. In fact, I have very few examples of this in all of these videos. Most of the pointing you'll see that goes on is this kind of stuff. I'm showing you things this kind of -- there are a bunch of different. They're half taps which is what this is and there are half taps. I don't think we see one of these but one of the things that the subjects do is kind of voila, here it is, I'm presenting it to you. So watch again. You can see this. Whoops. [video: Basically we start with a cracker followed by either cream cheese or [indiscernible] type thing. And then either put on the pimentos or olives or [indiscernible] something small like that]. >> Candy Sidner: Of course, what you notice is at the end the student nods his head. He doesn't actually say anything. So here's a case where that adjacency pair, his part went on pretty long. He points out a bunch of things. There was a very small nod. You'll see this when we look at the video from the other side. And then at the end there was a much larger kind of nod. So here's the same video from the other side. [followed by] not quite. We're off a little. Okay. [video: To start off basically start with cracker followed by either cream cheese or [indiscernible] something and either put on pimentos or olives or something small like that.] >> Candy Sidner: So one of the -- he did one of the things that I think is so interesting, which is his response about what the teacher says about the things on this side is he nods his head as the teacher's speaking, but at the end of it he doesn't nod his head he makes this funny expression with his face. So clearly he's understood the interaction is happening, there's no confusion about that. But it certainly wasn't done with anything resembling language or even a standard back channel. And the other case he actually nods his head fairly seriously. So that's fun. It's always fun to look at what people do. But the real hard job is annotating data. And we've been using the elan system. I'll show you an example of that in maintain. The annotators myself and a student. There's a whole slew of things we've obviously been annotating. We've been annotating what happens with people's head in terms of directions, where their eyes gaze at and to some extent this involves judgment on our part, because when you're looking at a video, you're to some extent you're taking the part of the other person in the interaction and trying to say, okay, what exactly is it that they are actually looking at. You look at -- we look at pointing and we annotate which kind of pointing they actually do, what's being point at, where their body position is. Largely in these videos, because there's a table, they're facing one another, but we set it up so that for part of the video the instructor actually has to turn to the side and get some materials off of another table and do something with them and bring them back to the other table. So during that time you get some of this phenomena of how do you talk to the other person when you're in fact not even facing them. Okay. Obviously we transcribed their speech. We transcribed time intervals of referring expressions, what the reference of referring expressions are, and those are not the same thing as the referring expression. So a lot of what people say are alighted expressions for what they really mean. So the instructor says, for example, crackers. And he taps on what's the cracker box. He's pointing to the cracker box and saying crackers. Easy for us we know crackers in the cracker box, no big deal. But the point is the utterance and the thing he was pointing to were not exactly the same. We also code what the adjacency pairs are, that come from all of those other things, where the mutual facial gazes are and what the responses are to pointing. All right. Give you a sense of how this all looks. Here's the two videos, and here's all of the different annotation channels that we're keeping track of. Now, before I say something about what we've been learning from all of that stuff, I'm going to give you a little bit of a definition about what we actually mean by these things. So what's directed gaze? Directed gaze happens because an initiator gazes at something for a period of time, which is what this piece of stuff is right there. And then there's a response. If there's going to be one by the other participant. And now when that response actually happened we marked that here. But in fact there may or may not be a little break between what the first participant does and what the second participant does. So optionally there may be a space in here where one is gazing and the other is not or it might be that there's in fact a long overlap. And that's what this dotted line is meant to suggest. This part is the delay relative to the responder, and this point, when they're actually doing something together, is the shared gaze. Okay. And if there's pointing, if it's not just the face, then sometime after the gazer, the initiator starts gazing, they point and that happens for a period of time. Usually before the responder actually gets to turn and look at something. Okay. Mutual facial gaze is a lot simpler, the initiator gazes at the other person, obviously. And then there's a response from the other person. This is the gaze point and this is where mutual facial gaze occurs. >>: Terminology, do you consider the neutral gaze phase to end at the time when either one of them stops gazing or when the responder stops? >> Candy Sidner: When either one of them does it. So that experience happens and then somebody looks away. And it can be either one of them. Okay. So adjacency pairs, I think, I said a fair amount about this. But there's a person who says something. They're the initiator. Then there's the response by the responder. It can overlap with what the first person said. It can start immediately after, which actually happens surprisingly often in my data, or there can be a break before the responder says something. And then, of course, there can be, in the case of third turns, there can be another response by the initiator. And I don't show here, but of course in the case where they're actually four sayings, four things before the whole thing ends it will go on out here. Okay. And finally there are back channels. Back channels are the initiator saying something, the responder responds with some kind of head motion, or I think it's possible that there could be some kind of other facial expression like the one I showed you. And then that stops at some point. >>: Verbal or do you put things like ->> Candy Sidner: We allow for uh-huhs as well. So there can be verbal expressions as well as nonverbal ones. Thank you. I didn't think to mention that. Back channels is a very complicated matter. It's been talked a lot about in the literature in terms of can you back channel anywhere? Is it controlled in some way. Those are open questions as far as I'm concerned. Okay. So we're interested in the matter of time that occurs between these various kinds of connection events. And we've defined a time as the time from one connection event starting until the next connection event occurs. And that's because that allows us to have overlaps. And the reason we've done this is we have a hypothesis that the meantime between connection events captures what we all informally experience of the pace of the conversation or interaction as a whole. When you're interacting with someone else, there's the kind of uptake that the other person has in the interaction. And sometimes a conversation can, an interaction as a whole can have a very kind of slow pace. You say your words pretty slowly, the other person doesn't pick up right away. They may speak very slowly. And that feels very different than a conversation where you say something and the other person says something or nods and there's this very quick uptake and that keeps happening, kind of rolls along in that kind of way. So we're interested in understanding this, because we think pace is a very important indicator of how it is that the engagement process is actually going. So the faster the pace is, the less time there is between connection events. So basically pace is approximately one over the meantime between events. Okay. The reason we're -- yeah. >>: Curious. I would see how that would make sense maybe with a task like the making of the hors d'oeuvres or whatever, but as the cognitive complexity of the task increases, if I ask you a question that requires you to really think, I may purposely sort of back out of the conversation in order to give you more space to think, and so you're deeply engaged in our conversation but processing it in a different way. Is that -- would you consider that engagement or would you consider that some other ->> Candy Sidner: So you've asked exactly the right question. This is a question we've been asking ourselves as well. When we're interacting, a very fast pace means boy it's very clear we're doing something together. If you ask me a question and I have to say, I sit there for a minute and I think about it. Obviously the pace of the conversation has changed. In fact, such change sort of almost instantaneously, if you will, whereas what we see is that the pace may change if you take a sliding window over time in the conversation and look at connection events in the meantime over time. So the question is where do you get evidence that the other person is no longer involved? Certainly if they're indicating, they've slowed the pace down, but it's clear that they're still involved, that is, they do the kinds of behaviors which largely are kind of looking up like this and the other one people do is look down, they serve different cognitive purposes, apparently. But in that case you're clearly not ignoring whatever it is I've asked you or vice versa. On the other hand, there can be in some of those kinds of circumstances, it's clear that the participants are no longer engaged with one another. So one of the reasons that we're interested in gestures in all of this is, because if it's just what I say and what you say, clearly that's not enough, and so if I ask you a question and your gesturial behavior comes relatively quickly even though you haven't said anything yet you've begun to respond to my interaction. That's why we want to count the nonverbal stuff as part of that particular process. Now, I'll show you in a minute -- just a minute Tim, I'll show you in a minute some other issues that I think come up that make this all a little more problematic. So your turn, Tim. >>: I was going to say even outside of the nonverbal cues, psycho linguists have found that people will signal they need more time based on the feeling of knowing, so they use hums and ahs to signal what you just asked me is a difficult thing and I'm going to take some time to think about it. So they're much more likely to say um instead of ah, and put you through a whole bunch of [indiscernible] at that length of time between um and a pick up is much longer than an ah. People seem to be seeking how long information seeking will take and signalling that to show they're still engaged. They still want to be part of it but they need some time. >>: So um versus ah is cross-cultural. >> Candy Sidner: It's probably American, frankly. >>: No. It's not. I know some countries don't use the same kind of disfluencies. And I know that someone did a similar kind of analysis for Japanese and found that there were slightly different markers, but they serve the same kind of purpose. >> Candy Sidner: Yes. I actually had a colleague many years ago who didn't do ums and ahs he did [indiscernible]. So whenever he didn't, whenever he wanted to indicate what was going on, and he certainly didn't want to not have the floor, you got these Latinisms that were very interesting and very strange. And it drove everybody else crazy. But, nonetheless, there are some people who can do something much more interesting than um or ah. Okay. So let me give you some statistics that come from one of my pairs. This is nine minutes of their interaction. So it's not even their whole interaction, because most of these pairs run about 12 minutes. So what do we see? Okay. Things like directed gaze. There are 19 directed gazes and what do you know, most of them succeed. Usually when the one person turns their head the other person actually pays attention and responds. The mean times, et cetera, are not very far off from one another. On the other hand if we look at mutual facial gaze, it's really different. It succeeds roughly half the time. And the rest of the time it doesn't work. Does that mean the other person's not interested in what the first person is trying to find out about them? The answer is I don't think so. Because this task, the task itself is actually very significant and takes a lot of eye space. So if you're making crackers and you have to spread stuff on them and cut up little doodads to put on top of them you actually have to use your eyes a lot to make that process happen. So that consumes one of the participants, namely the student, for a lot of the time. So he seems to miss a lot of the mutual facial gaze requests that the other person gives him. Adjacency pairs. There are about 30 out of, two-thirds of the whole bunch that succeed. But there are a surprising number that fail. Why is this? Again, I think it has to do with this particular task. When the student is busy doing things, the teacher occasionally is explaining other things about what's going on. And the student is interested enough in what he's doing that he simply doesn't indicate any response at all to what the person said. Did he not hear them? Well, probably not. But he simply doesn't respond to even with head shakes. He doesn't do any of that stuff to indicate what it is that's going on. Finally, there are about 15 back channels. That's not a tremendously big number. One of the interesting things is the meantime between connection event and the thing that's interesting here is while the meantime is about a little under six seconds, the maximum time is huge. 70 seconds. So what's going on here? Well, again, this has to do with a particular task that they're doing. For a long stretch of this interaction, one of them is actually making a bunch of canapes. And the other one is organizing the plate putting the plate together. So in some sense they're kind of doing parallel play, parallel activities. They don't have to say anything to each other during those period of times, and they don't. They're not people who know each other well. Whereas, if they were, they might be using that kind of dead air time to make jokes, talk about something that they're both doing. What's happening on Saturday night. Whatever. None of that goes on here, because they don't know each other very well. And also possibly because this is an experimental setting. One of the ugly things about putting people in a laboratory is that they freeze up a little bit. And they act in a sort of more formal way than they do in their kind of more normal circumstances. So that may be another effect on this. Nonetheless, there are long stretches where they're just doing their own thing. So the question is are they disengaged? Well, in a certain way they are. But clearly they're not in the sense that they are still both committed to the task that they've undertaken and they're doing what they need to do to get it done. So they're engaged at the level of their task and therefore have this connection to one another. But their connection is not reflected at all in things like what they say to each other, how they look at each other, or the other objects that they have in the room. So clearly a component and the nature of how we see ourselves as connected to other people, is mediated by the nature of the activities that we have to undertake. And so the interesting question for us in the long-term is how do we bring that to bear. All right. Now we're going to switch gears from what data tells us and say, okay, let's think about how we get a robot to be involved in this kind of activity. And here's the setting for how it is we're thinking about these problems now. We are not yet to the point of having the robot make canapes. But in fact our robot can't do that. But he has hands. He has them like this. He can point to things, because he's got enough degrees in his arm he can do tapping actions but he can't pick anything up. So he's never actually going to make canapés. The simple case is that the robot points to something and the person points to something. So these are two sides of the pointing game. There's the human. Human can nod and shake its head. The camera that the robot has makes it possible to recognize head nods and head shakes. We're using the Watson system from MIT to actually do that. And the robot can also not and shake his head because he's got the right kind of degrees of freedom in his neck. He can say things to the human. But for the moment whatever the human says back is gobbledygook as far as the robot is concerned. It's not because we don't want to do speech, but we've done a simplified version of the game for the moment. We started that originally because the robot's got motors that you will get to hear in his arms that made so much noise it made the speech system impossible. We have thankfully two weeks ago finally gotten this fixed. We had to take the robot back to its original designers and they put in different kind of motors and so we're a lot happier, because it's a really screechy sound and it was horrible to even work with the robot. That's the setup that we have. And what we're doing is we're developing a reusable software module for robots. And it implements the recognition of engagement only. So engagement has two parts to it. It's recognizing that someone else wants you to be engaged with them or is indicating their engagement. And the second part is what do you do about that? That is, do you respond or not respond? And we're trying to develop a set of generic algorithms that are independent of particular robot software details and that can be an independent package and this is being packaged up as a set of ROS messages, which is the framework that's been developed by Will Degrage [phonetic]. Here's the basic picture in which this engagement module actually sits. There's a whole lot of the rest of cognition going on. The robot has sensors to the world. In our case they happen to be vision. There's some sensory input. We know when it is that the person actually says something and where the sound is coming from, we just don't know what it means. But you can imagine having real speech understanding. The human is gazing and pointing and nodding and shaking, doing all that stuff. And the robot has to make similar kinds of behaviors, and he actually has to do that in terms of actuators. Okay. So what is the engagement recognition module getting? It's getting essentially three sources of information. From the sensory information it's getting what the human actually did, did they gaze, did they point, nod, shake, that kind of thing. It's also making use of what the robot itself is going to do. What the robot actually decided to do is going to be important for this engagement recognition module. Furthermore, we need to know what the rest of the cognition has said about what kind of goals the robot actually has. Does the robot actually want to be engaged? What kind of engagement is he trying to do. And of course how the floor changes, that is how turn-taking actually happens. But what it gives back are basically two pieces of information. One, did the engagement goal succeed or not? And, secondly, ongoing statistics, a sliding window about what the meantime is between engagement. So to summarize, again, the information about where the human looks, what they point out, et cetera, we want to be able to recognize what the human initiated connection events actually are. And we want to be able to know when they terminate, because that tells us when the meantime between connection events is. So we've got to recognize that process actually happening and that will be because the sensory information will tell us something about what it was that the human actually did. Okay. Similarly, information on where the robot looks and points and stuff is to allow us to recognize that the human completed whatever it was that the human asked of them. So if the human wanted the robot to recognize a directed gaze, for example, then we want the recognition module has to know that the robot actually did it. Okay. Did the robot actually turn and look at it and keep track? That means there's a connection and succeeded as opposed to one that failed. The engagement goals, we need to know when to begin to start waiting for human response, if the robot produces some kind of connection event. Because the real problem for the robot is when should I stop waiting? I've done this thing, when should I stop doing this, when should I stop expecting the human to actually do something, so that business about pace of the conversation is going to guide what it is that the robot's going to actually do and that's why we need to know about the engagement goals. And finally we need to know about floor exchange, because you have to know when it is you're actually supposed to be taking over and doing something. Okay. Now, when we started this, we thought, oh, engagement recognition module is going to do all these things and all this stuff. As we pared away what really needed to be there, it turned out there only were two things that the engagement needed to be able to tell the rest of the cognition. And that was did the robot initiate a connection event, succeed or fail? So that the robot could decide when to stop looking and when to stop pointing. So if the human says: I want you to do something, you've got to decide all these things about how you do it -- when you're getting the human to do something you have to decide when to stop. And finally, the statistics of in terms of a sliding window about current pace, as a way to know whether the connection for engagement is weakening. Now, this is just statistics that's then provided to the rest of the cognitive architecture, because some other component, for example, the engagement production component is going to have to decide what to do with the fact that the pace has changed. It has to make decisions about how it is that it should respond, and it may involve a whole lot of other planning processes than just the production component. So that's why we were providing those kind of statistics. Okay. The architecture we're using here evolves four different recognizers. So it's one for each of the different kinds of connection events I talked to, and we distinguish out the back channel cases as a special case of adjacency pair recognition. And these operate in parallel because in fact, of course, somebody can be saying something as well as moving their head around. So we have to be able to keep track and make all of those things operate at the same time. Now, I'm not going to talk today about all four of these. I'm just going to pick one. I think I'll take the first one, which is to give you a sense of how one of those recognizers actually works. And so we'll start with directed gaze. If you remember, there was the basic picture of what directed gaze is about. And so there are two different kinds of things that could be going on here. At the start, either the human's pointing at an object or the robot itself has directed gaze for the human to something, and if it's that case, then the robot, of course, is waiting to see if the human will actually respond within the window that it currently has its notion of what the meantime is. And if that happens, then in fact the statistics can report that in fact it succeeded and then at some point either the human or the robot will look away and then the activity is over. In the case where the robot's waiting and there's a timeout because we've now moved past what it currently thinks pace of the conversation is, or because some other goal comes up like mutual facial gaze, where the robot's not pointing, then we get a failure circumstance. On the other side, if the robot is in the situation where the human who is going to point, then this module is saying okay the human is waiting for the robot to do something. And either the robot's going to succeed in gazing and this module is going to get information from the actuators that actually did what it was supposed to do, or again there's going to be a timeout, and so this module is going to be able to report that it failed. So it's a fairly simple mechanism. >>: Do you think the sensors are noisy, how do you account for certain [indiscernible]. >> Candy Sidner: I'm really glad you raised this issue. This is clearly a very finite state interpretation of all this stuff. And we've been talking about what it would mean to do the stuff with them much more serious model of uncertainty built in. The Watson system is itself, you know, based on a probabilistic kinds of models. So it's got -- it kind of in its own way does uncertainty. But at any one of these particular points, you could be correctly or incorrectly making assumptions about -- I mean, you know about the robot's goals very clearly. But when it comes to human, you know, did they really look at it or didn't they? You get a certain amount of information from the Watson system and you make some decisions. And obviously we'd be a lot better off if we had some probabilistic ways to look at those things. That really turns on having enough data to be able to do that kind of thing. Which we don't at the moment have. So it's an issue that's been kind of racking our brains. In the work I did at Mitsubishi the way we got the data was to run tons and tons of subjects. I had 100 people interact with Mel the Penguin. So we ended up with lots and lots of data. We are not at that point yet. So we just don't have a data source to give us the possibility of looking at other ways besides some very simple finite state models. So I think of this as a bootstrapping process. You get the whole thing going and then you're ready to run subjects and you get lots of people to play with your thing, and then you have enough data to think about other ways to do it. So we'll see. >>: This one will be a little easier. If you recognize that the human is gazing at something, right? What is this thing doing to say you robot ought to respond by ->> Candy Sidner: It's not doing anything. It's not doing anything. That production problem, which is -that's a production problem. You recognize that the human is gazing. Some other component, namely a production component, has to decide am I going to engage, am I going to do that? Am I too busy doing something else that I can't actually do it? That's not the job of this component. This component's job is to, when that production component decides, okay, I'm going to actually look at it, the job of this component is to say, fine, that happened. >>: Production. >> Candy Sidner: I beg your pardon. >>: You consider it part of the production thing's job to recognize a request to engage has happened rather than part of your job to recognize to engage has happened. What seems odd is a structure to me. >> Candy Sidner: Let me be clear about this. That the request has happened is something that this component actually needs to know about. >>: Yes, that's right. >> Candy Sidner: So it gets information that says the human did some kind of pointing or something like that. >>: Some other process may or may not be a cause to go take some action on. >> Candy Sidner: Right. So this is the other component that has to decide, okay, something like that happened, I could decide to respond. I could use my, that connection event and respond to the connection event or not. All this component does is say, fine, when that actually happens or doesn't happen, it keeps track of that information. It's a kind of, you know, it's a kind of secretary accounting, bean counter. >>: When it failed do you put out reasoning for the fail? Because that's a very distinct thing, the reason we -- we decide not to engage. >> Candy Sidner: We report out which module it was that failed, okay? So we're just talking about directed gaze. So we report out the directed gaze field. And in the case where it's an adjacency phenomenon we report, et cetera, et cetera, for each one of those boxes in -- for each one of these things, we report out which component it was that failed as well as the failure. So that's the critical information. It might turn out that we would really want more than that to reason about it, et cetera, and I don't know -we just haven't reached the point to know whether that's critical yet or not. Okay. So there's obviously these things for everything. We're not going to go through them. As I mentioned, we are providing the results of our recognition component to RS time? >>: No, I have a question because you're not going to talk necessarily I think about the point. Do you like do, you use Watson for the gaze do you do pointing recognition also? >> Candy Sidner: We use pointing recognition. We use a totally different way to do that. >>: Decision based? >> Candy Sidner: Simple tracker that's tracking what the robot's hand's doing and also what the person's hand's doing. We don't use colored gloves at the moment. We thought we were going to have to do that. But it turns out that the algorithms we're using are good enough without that. They're doing very simple kind of geometric stuff. Little blob stuff. It works pretty well for the setup that we're talking about. Okay. So the recognizer we've created is now available as an RS package if you want to play with it. I realize it's not in Microsoft Robotics Studio, but it's there if somebody wants to look at it nonetheless. Now I'm going to demonstrate the pointing game and I'm going to demonstrate two versions of it. This is the first version. And here the set up will always be the same, so I'll tell you a little bit about what is going on. This is the robot obviously. Two degrees of freedom in his neck, two in his shoulder. One degree of freedom here. He's got eyes that go back and forth in his head. He's got two degrees of freedom in his eyes. He's got two degrees of freedom in his little lips, and he's got eyebrows that go up and down. He has two cameras. The camera you can see up there is the stereo optic camera. That's how he recognizes what the person's head is doing. There also is a camera you can't see which is up here. And this is where we get the information about what are the objects that are on the table what the hands are and what he's actually doing. And those are the kinds of various things in the setup. It matters that the plates are different colors, because object recognition is a problem in its own right and we decided to use a very simple algorithm. So having colors makes object recognition really easy. Okay. So there are two versions of this. The first version done last spring was an undergraduate project. And it's one giant state machine. It's really cute. You'll see it works pretty well. It's kind of neat. But everything is all kind of mooshed in there and together. There are some generic engagement rules but they're all just part of this big ugly finite state machine. The second version of what I'll show you, the observed behavior is sort of equivalent. It's not really virtually identical. It's much weaker on engagement generation. You'll see that when we look at it but it has this engagement recognition model set up to make use of that stuff. So now, let's see, okay. This is the first one. So [beep] I'm a little worried. It's not making noise. I wonder why. [Pause] [hello, my name is Melvin. Let's play the pointing game] [pause] you pointed at the orange plate, please... To start out with you'll notice something very interesting about this. The robot's just kind of looking forward in space. And that will change in a minute but the very beginning they didn't have him do any kind of natural movements in that way. >>: You have a camera [indiscernible] the hand function? >> Candy Sidner: Yeah, but the. >>: The robot's eyes I notice doesn't follow whatever the camera on the top is doing. So that's a little bit unnatural. >> Candy Sidner: What the robot should be doing is either looking a little bit down at the subject, at Brett as it turns out, or looking at Brett's hands. We'll see he does that better as it goes along [point again] [pause] [you pointed at the orange plate. Please point again. [pause] [you pointed at the blue plate. Please point again. ] [Pause] [chuckling] [you pointed at the orange plate. Please point again. ] You see the big sigh at the end that's because Brett was very happy it all worked. We didn't have to do the video again. Okay. Now, we're going to see the more recent version. And what you will notice -- I first have to make this thing bigger. Full screen. What you'll notice about this case -- is it paused? Wait a minute. First of all, stop. You'll also get to hear how noisy the motors are. What you'll notice about this case is that the generation behavior of the robot is much, much simpler. Whoops. Where is the thing that makes it -- oh, there it is. Okay. [hello, my name is Melvin. Let's play the pointing game] [pause] [you pointed at the blue plate. Please point again. ] [pause] [you pointed at the blue plate. Please point at a plate. ] [pause] okay. Thanks for playing. Yea. Now, one of the things you'll notice about this is that Aaron is actually using -- so that's our other student. Aaron is actually using nonverbal behavior to communicate with the robot. So when the robot says please point, he nods yes and then later on he shakes his head no. You'll notice, however, of course that there's only one plate to point at. And that's because the robot has a much simpler model of what it is actually going to do as part of generation. All right. So what I've shown you today are a couple of things. One I've defined what the notion of connection events are for various types of indicators of engagement. And secondly I've given you an architecture that clarifies at least where this one particular piece fits. That is, how we do the recognition of engagement with respect to the much larger part of the architecture. And, thirdly, I've talked about how we have created a reusable module for doing engagement and made it available in RRS. So what's next? Well, lots more data analysis. I have been through all of the videotapes, but not coded them all for adjacency pairs and so forth. So a lot of the things I mentioned like where the head turns and all that stuff is done, but I haven't done all the adjacency pair work yet. So there's a lot of work to be done there. The other part of the analysis is to look at the question in more detail and start counting in terms of didactic behavior, what kinds of didactics do we actually see, that is, how often do people actually say something where they're pointing to something they're actually naming the thing in a direct way and how much do they do this sort of alighted crackers and pointing to the cracker box and pimentos, when you're pointing to a jar of pimentos and so forth. The reason this matters when we start talking about robots we're not talking about people anymore, and things that have to do with allusions means you need some other interesting bunch of information about the nature of how objects are structured, what they contain, all this stuff, in order for all that stuff to make sense. So it's a much bigger undertaking. We want to do some studies to actually evaluate this module, but that is going to wait unfortunately until we have a better generation module as well so that we can actually do a version of the pointing game that fits together nicely. We also want to be able to do the reverse pointing game, where the person tells the robot to point to a particular plate, and that of course means that we have to get speech into a better state than it currently is. I mentioned one of the things that I'm interested in pursuing more, which is this question of parallel activity as a weak form of engagement. Dan and I were talking today about, you know, is engagement kind of an all or nothing thing, or there's this kind of gray scale of it. And if that's the case how do we want to begin to model that kind of thing. The other question is what other kinds of behaviors are there that signal engagement? The one about facial displays, I showed you an example of that, because it occurs in the dataset. We don't really have any way to envision at the moment to be able to deal with those things very nicely. There are people who are working on the recognition of emotion, if you will, in faces. It comes from Paul Akman's original work on identifying emotion in people's faces, and there have been various vision algorithms that can do some of this kind of thing. But some facial displays are not really about emotion quite so much as they're just some indication of change in the face that we pay attention to. So it's not clear what that really means. And the question about engagement being a model on some kind of scale. I've already talked about this a little bit and that's the question of how we represent uncertainty from robot sensing, from the kind of finite state models that we have. And it's pretty important because human behavior is really, really unpredictable, even in this relatively controlled setting of two people sitting across the table from one another. Okay. That's it. Questions? [applause] >>: I have a question about mutual facial [indiscernible] the high failure rate that you saw there, was that perhaps it was not always intended that, for example, versus looking at another person to see what they're doing, but the other person knowing that doesn't say we want to kind of react and maybe continue what they're doing. So it's not to say we just, the intent was to have a mutual face but it was just to continue on, maybe there's two things happening. >> Candy Sidner: There's a difference in the circumstance. Because of the nature of the activity they're doing, you can tell what somebody is doing by looking at what they're doing with their hands. That's very different than directing your attention to their face. And so I distinguish them, when I'm talking about mutual facial gaze I'm talking about looking at their face. And so in this circumstance, if I want to know what the student is doing, I can look at their hands and see that. I look at their face, presumably because I have something that I want to convey to them. There's some reason I need to get their attention in terms of their face. >>: Looking at the face can also give you other information maybe like are they happy doing the activity, are they sad doing the activity. Looking at the hands doesn't really communicate that, too. So you might be checking for that but at the same time that the person doing the activity might not be ->> Candy Sidner: Yep, that's very possible. So one of the questions about the rate of failure is, you know, what's the typical rate of failure? Nobody knows. Nobody's ever -- there's been lots written about mutual facial gaze. Nobody has ever counted before as far as I can tell. We don't really know about that stuff. I know a little bit about back channels because Tim Bickmor for example did an account on back channels. I've seen accounts of other interactions with robots to know something about what the back channel behavior is there. But we don't really know. >>: Now, how long is the term failure, not my field, but how long the term failure has been used in that context, but I would think that particularly with something like facial gaze, if it's a class of interaction that you don't fully understand yet, the use of the word failure sets up assumptions. >> Candy Sidner: Yes, it sets up assumptions, you're quite right about that. >>: That may not prove out to be the case but they can have a real impact on how quickly you converge on what it actually is just because of the assumptions that we are all trying to accomplish something. Which maybe we are. But it sounds like you don't have the data to ->> Candy Sidner: I do know that in other circumstances, you know, facial gaze -- people don't look at each other all the time. They move around and the other person doesn't even track them always when they move on things. So there is -- so it is a very complex phenomena, and you're right, it may be calling these failures, but the problem is that at the moment I don't have a way to even talk about -- because I only have what I can observe the two humans doing. I don't know when the one person looks at the other's face, did they do it because they're just trying to find out if they're happy or sad or bored or if in fact they intend to get information from them. I might be able to get a little bit. I'll go back, look at the data in terms of this. Do they then say something, for example? So if I do mutual facial gaze with you because I'm about to say something to you and I think it's important that you should pay attention, then clearly that's a case where I really want to do it to convey information. I really need that facial gaze as opposed to I'm just kind of checking in to see how life is going with you. >>: I would hypothesize that you ran the same test that you showed at this distance. My guess is that the rate of mutual gaze would sky rocket because it's a more comfortable distance from me to make eye contact with a stranger whereas three and a half foot distance, in our culture, that's a much less common thing to do with a stranger. I don't know. >> Candy Sidner: Well they're strangers, remember, but they're strangers that have undertaken an activity. It changes things. It's really different if you and I are standing this close to each other. That would be really strange. That really would because we don't know each other and so forth. But once they're undertaking this activity, there is this question about how much that mediates. These are all questions -- I don't know the answer to. But you're right, there is clearly those -- there are those kinds of effects. I also have been looking at other data with another colleague, and these people are sitting fairly far apart. And so you don't even see their eyes move very much, because they don't have to do that because the distance is far enough that you don't have to worry about eye movement in order to see what's going on. >>: Have you tried coding interactions in films? >> Candy Sidner: You mean like in movies? >>: Right. >> Candy Sidner: Actors doing stuff? >>: Exactly. In order to -- >> Candy Sidner: No. >>: There the director as somebody who has spent a lot of time observing human interactions is attempting to convey a class of engagement interaction, et cetera, et cetera. >>: Want to get real good insight in what you're doing, sit down with a very good director. >> Candy Sidner: Find out what they do. >>: Use the word pace. I just got off a production I was the director. He was always on about pace, and his observation was if you want a faster pace don't speak faster, that just makes it confusing. Cut the gaps down. So it works in place you know what the next one is going to say. So we know how to do it there. >> Candy Sidner: And you also know when they're going to finish what they're going to say so you get more information presumably. >>: In real time but you want someone with professional experiences observing people, seeing whether it looks realistic. >> Candy Sidner: Yeah, right. >>: So you can engage in a very interesting conversation with him about that with respect to that. >> Candy Sidner: Okay. >>: One of the things that is unnatural about plays and movies with few exceptions in particular directors, people talk simultaneously in plays and movies a lot less than they do in real life, unless you're talking about Robert Altman or Woody Allen, somebody like that who makes a point of trying to mimic real life. >> Candy Sidner: Yeah, that's interesting. >>: A very long time ago I did narrative analysis and conversational analysis with [indiscernible] and we did studies where we annotated somebody else's taped conversation and we also annotated conversations in which we ourselves participated. >> Candy Sidner: Yes. >>: And the annotations are different because you can know what it is that you were trying to accomplish at the time. >> Candy Sidner: Right. >>: It makes it very hard -- it would make it very hard for you to annotate your own interaction with the robot, because you know so much of the system. >> Candy Sidner: Yeah, sure. >>: But have you -- like if you were to take kind of interaction with Dan's assistant, and then sit back and annotate your own interaction, I don't know if it would be interesting and if you would end up with a different annotation. >> Candy Sidner: I do know that a standard technique to do with people interacting with some kind of software system, whether it's got faces or whatever, is they have their interaction, whatever the laboratory setting is, and then one of the things you do is you go back with them through the videotape and ask them about various -- especially the points you really are interested in, since you've often set these things up, and ask them to tell you what it was that was going on there. And that often is very revealing about that particular process. It's pretty hard in the case of the human-human data for me to go back and do that now because in fact it's been a fair length of time. This data was captured about 10 months ago. So the participants probably wouldn't know any more, and a bunch of undergraduates who graduated so that's no help either. But it is an interesting question whether we should plan when we collect these kinds of videos to look at them real quick, come up with a set of questions and then go and do that. But, frankly, if I looked at them really quick I wouldn't have known what I know now having spent a lot of time on this stuff. Now I would really know what kinds of questions I would want to ask. So there is this problem about what you're looking at. So Dan. >>: So this is where you were asking about earlier about the [indiscernible] I like the notion of connection events and identifying some of these classes. But it also seems like -- I don't know if this is influenced by the particular setup you have, where you [indiscernible] have you thought of looking at sort of the disconnection? Right now everything is the positive definition, the place is defined by these positive events that all indicate engagement. So the measure of engagement is dictated by that positive measure and wondering if it makes sense to have signals for the opposite of that where the lack of engagement is not just the lack of the positive but actually negative events that ->> Candy Sidner: Well, all of the failures constitute an interesting class. They're essentially native events. I say something you don't respond. I point at something you don't respond. The face, I look at you and expecting to have you look at me so I can say something. And so we could in fact -- we haven't done this yet -- we could in fact look at -- try and think about what kind of measure one wants out of the failures, and we haven't done that yet. So the meantime between connection events and for the ones that succeeded, not the ones that failed. So that's the other side. And then I think that's true. It would be useful to look at that kind of thing. You have a question? >>: I was just curious, the use of the physical robots as opposed to I'm watching the person that the robot that I'm talking to on TV. I'm sure that's something that people have looked at, thought about, et cetera, do you have a sense how that distinction then [indiscernible]. >> Candy Sidner: There are people that have looked at this. People are trying to understand what's the difference between having a character on a screen, the kind of thing that Dan has been doing, versus having this thing in front of you. There clearly is something different, and the question what is that? And so people have been struggling to come up with ways to try and measure what this is. Mostly they've been able to find out that there are differences in how much people trust the thing. They're sort of very kind of not very revealing in a certain way. Bits of information. I originally got interested in robotic stuff because I wanted to look at the problem of how robots point in the world. You can't do that very well. You can't point in this physical situation very well with something that's on a screen that's 2-D. Whether that's the robot on television or a character you create with animation, it looks weird. It's hard to make it work very well. >>: It was interesting your robot was ambidextrous. I assume that was a technical limitation in how far it could move. >> Candy Sidner: Yes it's a technical limitation. It turns out in order that he won't break himself by hitting himself, he can't get his hands any closer together than this. So these were all stupid things you have to worry about when you have these physical devices. So he in fact was designed so he could never clap his hands together because if he did he'd break his arms. This is a good robot in one sense his arms are very lightweight. He can't hurt a person. That's the thing you have to worry about with robots. But he can hurt himself really quickly. And we did it when he was first mobile, he went into a doorway and broke his elbow joint. This is the kind of thing you have to deal with. So that's a technical limitation. But also if you may have noticed, if you remember from the short video clip I showed you, the teacher says you know about the stuff here and then when he turns and talks about stuff there, he switches hands. So that's another thing that people are ambidextrous and they do these things. It's not a bad thing. >>: Person to person, people are so one-handed dominant. >> Candy Sidner: Yeah, that they would go the other way. I've seen both in among all the other subjects that are involved. >>: When they switch hands related to the fixation, and relative to the fixation? >> Candy Sidner: I think so. I mean, it's much more awkward to go like this to point at something over here when you've got an appendage that will do it this way instead. But there are people -- but again it may have something to do with a certain amount of dominance. >>: The visual of how the visual, the split of the brain, the left one is split relative to sensation from the left. So it's kind of closer to coordinate the same corresponding hand opposing hand. >> Candy Sidner: That may in fact be why it's natural to do this you get this effect for things over here. But there are people who will go to the trouble to point like that. I mean, I do have some cases where people point that way. So it may have -- I think that's probably their right hand, but nonetheless. So there's another interesting question in production which is what's the right way to kind of not at the level of conveying engagement, but it is an interesting question about how you use the limbs of the robot. So we could make it completely arbitrary. Well, thank you very much. [applause]