>> Eric Horvitz: Okay. It's an honor to introduce Ingrid Zukerman who's visiting us today from Melbourne -- or Melbourne as they say, I guess, as they pronounce it correctly. I was there last Christmas visiting Ingrid there. Ingrid was actually most famous as a visiting researcher at Microsoft Research in 1999. >> Ingrid Zukerman: 2000. >> Eric Horvitz: 2000. The 1999-2000 academic year. And she's a professor in computer science at Monash University. She did her bachelor's and master's work in operations research at Technion and her Ph.D. research at UCLA. Famously, again, is one of the earlier grad students of Judea Pearl -- the first Judea Pearl grad student working in Bayesian nets and uncertainty, I think, right? One of the early ones. >> Ingrid Zukerman: No, the only one working in natural language, actually. >> Eric Horvitz: But with uncertainty -- oh, certainly that's the case. That's probably still the case until this day almost, right? >> Ingrid Zukerman: But, yeah, I was the first person to actually do Bayesian propagation by hand ever on the planet. >> Eric Horvitz: Okay. It's nice to have an interactive introduction, isn't it? You can check my years and my details here. We're doing a dialogue-based introduction, thematically placed. Anyway, her interest areas are discourse planning interpretation, planning recognition, and dialogue. So Ingrid. >> Ingrid Zukerman: Thanks. Sorry to interrupt. >> Eric Horvitz: That was planned. >> Ingrid Zukerman: Okay. What I'll be talking about today is a spoken dialogue system we're developing for a robot and, well, about the motivations for the way we're developing the system and where we are so far. So the name of the system is DORIS, and it actually is an acronym, Dialogue Oriented Roaming Interactive System. And DORIS is a dialogue module for a robotic agent in a home and eventually to combine spoken and visual information, and actually we used to have this [inaudible] robot, but [inaudible] took it back. But I still like the picture. So this is the sort of thing that we are aiming for. The user might say can you get my mug. We are fixated with mugs. And DORIS might say something like I don't know which is your mug, but I can see a blue mug, and the user might say my mug is the one on the table. And if DORIS can do that, we will be very proud of DORIS. Because this type of dialogue highlights a number of issues. First of all, it highlights the fact that DORIS needs to be able to identify objects in the world and it also needs to -- is this thing -yeah. It also needs to be able to understand from this that the user's mug is not the blue mug, which is actually a very tall order. So if DORIS can do all that, we'll be very happy. At the moment we are on our way. So what I'll be talking about is I'll be motivating our design decisions, then I will be talking about our interpretation of spoken language. And, yes, what I should mention is at the moment we are not doing any dialogue as such; all we're doing is interpreting what people are saying. We are getting to dialogue. Then I will be talking about how we estimate the probability of an interpretation of a person's utterance, and then how we evaluate that part of the system. Now, what you see in light blue is if time permits I'll be talking about our more recent work about the interpretation of a sequence of utterances. Because people don't normally express themselves in one very smooth sentence, but they're actually quite segmented. And then conclusion and future work. So, first of all, the motivation. So what are the things that we want DORIS to do? We want DORIS to make decisions on the basis of the result of the interpretation process. And these decisions are both dialogue actions and physical actions. Now, what are the results of the interpretation process? They're not just what we understand one utterance to be but also how many alternatives there were, how close they were to the best interpretation. What were the differences between these alternatives, just one word that was different or was the whole thing different. And the decisions about what do depend on all these aspects of an interpretation. We also would like DORIS to modify decisions on the fly if it receives new information. So it has heard an utterance, made some type of decision, and then a new utterance comes. And also we would like DORIS to recover from flawed or partial interpretations. So it got it wrong and then there is new information and how do we recover. So in order to support all these, we want -- we decided that we need two things: We need to be able to maintain multiple interpretations and we need to be able to rank them. So you need multiple interpretations so you can do all that. First of all, you need to know how close they were to each other, what were the differences and so on. If you want to change your mind, you want to know what to change your mind to. And if you want to recover again, you want to have your alternatives in hand. And of course to know where you stand you need to be able to rank all of these interpretations. So these are the two things we want DORIS to be able to do. And Scusi? is the speech interpretation module of DORIS. Yeah. No idea why we called it Scusi? anymore, but that's how it stays. So what I will be focusing on and from now on is about Scusi?, and Scusi? has to do these two things: maintain multiple interpretations and apply a ranking process. And to maintain multiple interpretations what we have is a multistage interpretation mechanism and each stage maintains multiple options. And so that this thing doesn't explode for us, we have an anytime algorithm to do that. So our algorithm basically keeps track of all these multiple options until time runs out, only in our case until we have generated a particular number of alternatives, and then proceeds to pick the top few alternatives. And the ranking process is done through a mechanism that estimates the probability that a particular interpretation matches the speaker's intentions. So these are the two things that I'm going to talk about: our algorithm and the mechanism for estimating probability. So the stages of the interpretation process is we -- so far we are using a pipeline process that is quite normal. We have a speech recognizer that produces text, syntactic parsing that produces a parse tree, semantic interpretation that produces a concept graph that is based on Sowa's conceptual structures. Now, the idea for this, we know that in spoken dialogue systems a lot of people use semantic parsing. But the idea for this particular structure was that we want to get to the situational context as late as possible so that we don't have to redo a lot of the initial work. And in fact there is one piece of research that we are fond of citing by Knight et al. who say that if you have a situation where people will say unexpected things or are not familiar with the system, you are better off with this type of model rather than semantic parsing. So the first stage is speech recognition. And, yes, we are using Microsoft. It works very well. And in fact we're using an old Microsoft. Tim Paek assures me that the new one is even better. So the Microsoft API produces a range of interpretations and they are all scored. And we transform the scores into probabilities just using a linear mapping function. Nothing too fancy. >>: What do you mean by that? >> Ingrid Zukerman: Oh. The API produces numbers, and then you take -- you normalize and you turn them into numbers between zero and one. >>: I see. >> Ingrid Zukerman: Yeah. >>: [inaudible] those numbers as probability directly? >> Ingrid Zukerman: Yeah. >>: Okay. >> Ingrid Zukerman: Yeah. Exactly. Then syntactic parsing. We use Charniak's parser. And that parser returns any number of interpretations. And then you get something that looks like that. And in the semantic interpretation, as I said, it relies on concept graphs, which is a structure by Sowa that represent entities and relationships between them. And we have broken down the process into two stages: uninstantiated concept graphs and instantiated concept graphs. So uninstantiated are basically pure semantic and instantiated are contextualized in the environment. And as I said before, the reason again even for this breakup is that we want to leave the domain instantiation for as late as possible. So an uninstantiated concept graph, all it does is -- it's basically a different -- it is a different representation of a parse tree. And it is produced directly from a parse tree. However, one uninstantiated concept graph, or UCG, can have multiple parent parse trees if they mean similar things. For example, you could have a parse tree like the blue mug, and sometimes it's a direct adjective and sometimes it is parsed as an adjectival phrase. And both of them will lead to the same UCG. And as I said, one more time, domain independent, so we are not looking at the domain just yet. So, for example, here is find the blue mug in the kitchen for Susan. And what we have is a UCG that has all the right bits directly generated from the parse tree. And the only thing of note is that attributes of the nouns are included in the boxes for the nouns. So if we say the blue mug, the big mug, a mug, all those things are included in the box. They don't have a special note associated with them. The instantiated concept graphs pertain to the domain. So now that we have an uninstantiated we want to know what things in the domain actually can match this situation. So here we have again the blue mug. And now we have a particular instantiation which means a particular interpretation of find, like we have different kinds of find, you can find somebody's office, which means you just locate it, whereas you can find somebody's mug, which means locate and retrieve, and so on, so forth. And all the things have numbers. Even the patient. We've only got one kind of patient and one kind of beneficiary. We have three kinds of cups, two Susans, one kitchen, and so on. So this is a particular instance that we have generated from what we have said. Then of course all the cups and mugs in our domain could be possible instances and all the Susans could be possible instances. >>: So here you generate by just going across all options and get all possible combinations? >> Ingrid Zukerman: Almost. Conceptually it is all possible combinations. In practice you'll see in about five minutes that we reduce the search. But conceptually, yes. It's all. Okay. So how do we search for an interpretation. We have -- this would be the result of a search so far. So we have a speech wave, we have a bunch of text. Each text could have a bunch of parse trees. Each parse tree can generate one UCG. But you see, for example, the UCG over there, 22, can come from two different parse trees, and two UCGs can also produce an ICG. So this is the search structure that we generate. And the algorithm works like this. While there is time select a level to expand while preferring low levels. Basically we want a beeline to a solution. We don't want to meander around the top and not have an answer. And this is done -- like the preference for low levels is done probabilistically. Basically each level is a probability and the lowers have a higher probability of being selected. Select an item which prefers promising candidates. And what is a promising candidate? A promising candidate is a good parent. So how do we decide who's a good parent? This is like Darwinian. Okay. If a candidate has not generated any children but it has a good probability coming from above, he's allowed to procreate. Once you have generated a child, you're only allowed to procreate so long as the probability of your children is good. Once you have generated a bad child, you can no longer procreate. That's it. You're banned. So that's a promising candidate. And once you expand an item, you generate only one child at a time. So you don't produce -- even though the parser that's generated 50 parse trees, we picked them out one at a time. And, likewise, we picked them out one at a time. I mean, the fact that this speech recognizer and the parser generate all the options at once, it's an artifact of those products. But if this is truly anytime, we want to pick them one at a time. >>: [inaudible] explain this force, like that you give to each child later or ->> Ingrid Zukerman: Yes, yes. It's all coming. Here. Yeah. So okay. So how do we estimate the probability of an interpretation? This is the -- I mean, the complete development of this formula is by doing the base voodoo that one does, so chain rule condition or probability. So this is the bottom line. Now, the fancy Cs pertain to context. So what does this formula tell us? It tells us the probability of an interpretation I given a speech wave W in some context is -- let's leave the summation aside for the moment. The probability of a text given the speech wave, the sparse tree given the text, the UCG given the parse tree, and the ICG given the UCG and the context. And it works without with the probabilities. It works out very beautifully. Why do you need a summation? Because everybody can have multiple parents. The text can be a parent. The same parse tree can have multiple text parents. The same UCG can have multiple parse tree parents and so on. And the ICG can have multiple UCG parents. So that is the formula. And now where do we get that stuff? Okay. From speech wave to text from the ASR, from text to parse tree from the parse tree. This is a 1 because its's algorithmic. And this is a hard part. So this is what we're going to talk about now. >>: How complex are utterances in this domain? Do you actually need all this machinery for the parser to say, or could you achieve a similar result if there was just a few arguments with [inaudible]? >> Ingrid Zukerman: Okay. Okay. The thing is -- yeah. We have done trials with people, and they meander all over the place. And in fact you're going to see at the end -and this is -- this Y -- this is not a command and control situation. Yes, you could train people to say, you know, tea, Earl Grey, hot. For [inaudible] geeks that means something; for the rest, never mind. But most people actually meander all over the place. And we are telling them pretend you are talking to a robot. So even with all this machinery, you'll see later, we can process only a fraction of what we are being told. And there is another piece of research actually from the last InterSpeech that the older people are the more they meander in their discourse. So, yeah, if this is going to assist anybody of age... Okay. So now how do we calculate this probability? So we have three different sections and it all become clear. The first section is an intrinsic match probability between a node in the ICG and the corresponding node in its parent UCG. So if you have a match node for no, like we saw before, you want the probability that the person uttered a particular description when they intended a particular object in the domain. This is the first section here. The second section is a structural probability. It's the probability -- I mean, in principle we would like to calculate the probability of a complete structure, but we don't have that. So what we use, we use trigrams which have the structural probability to represent the structural probability of an ICG. And this is the probability of a node given its parent and its grandparent. And we'll see an example in a moment. And identifiably we have the probability of a node. This is an instantiated node given the context. And at the moment we're not doing anything with this other than our knowledge base of nodes. In the future we want to include visual salience and discourse salience. So this is where you will put all of that information. But at the moment all we have is do we know about this thing or not. Okay. So now we are going to go into how we calculate -- how we make these calculations. But before that we have some simplifying assumptions. The robot is co-present with the user and all possible referents, and it has an unobstructed view. And each object is uniquely identifiable. Now, the first two assumptions we make because at the moment we only want the robot to make linguistic decisions. We don't want it to make a decision that says there is nothing really good in this room, maybe I should go looks elsewhere. Or maybe I should move a little bit to get a better look. I mean, these are decisions that will have to be made in the future, but at the moment we don't want to deal with this. So we have co-presence and full view. Unique identification. That has to be removed if you're going to deal with a vision system. Because realistic vision systems don't have unique identification. And in fact we have a version of this system that we call visual DORIS that has a simulation of a vision system. Okay. So now we're going to go for the features that we are using in our comparison. So intrinsic matches, which is what is the probability that we said a particular description when meaning a particular node, pertain to lexical item, color and size. And, yes, we can extend it to texture and shape and other things. But once you figure out lexical item, color and size, you don't get brownie points for extra things. And structural probabilities pertain to ownership and location. So these pertain to the relationships between concepts. And we need to distinguish between intrinsic and structural because intrinsic are features of a thing itself. So if I want the blue mug, this is it. The mug is to be blue. But if I want the mug on the table, I have to consider all the mugs I know about and all the tables. So I have to consider these trigrams. Okay. So this partially answers Dan's question are we doing everybody against everybody when we're building these structures. No. We want to find good candidate objects. And what is a candidate object? A candidate object would be also something composite, like the mug on the table. So to do that we do it in two stages. The first stage we make a list of candidate objects for each UCG concept U. For each mug, each table, each concept that was mentioned, we make a list of candidate. And we estimate the match probabilities of those candidates. So if I ask for a blue mug, I look at all the cups and mugs and dishes and the ones that are blue win. And then we rank these objects in descending order. Next we make mini-ICGs of candidate composite objects. So now we want to -basically we want to disambiguate referring expressions. And the mini-ICG is an ICG just for a referring expression, no verb. So the mug on the table near the lamp in the corner. We want to find candidate for that. So we combine individual candidates into composite objects and we estimate their structure probabilities and rank those. So we retain -- at the end of this game, we retain candidates for composite objects. And now we can put them together in a larger utterance. So in principle it's everybody against r against everybody, but in practice it's only different composite objects against other composite objects. Okay. So what is the probability of a match -- so now we are going -- we are drilling down to intrinsic probabilities. So what's a probability of a match between a UCG node and a candidate instantiated concept. So the probability that we meant K and we said U. And as I said, at the moment we are only doing lexical item, color and size. So if we assume some feature independence, we can break it up like this. And, yes, color is not a hundred percent kosher because color also depends on lexical item. Red hair and a red mug is totally different reds. But we are looking for household objects. So we are pretending color only depends -- so the probability that you call the thing with a particular name that you indicated its color and that you indicated its size. Now, size does depend on what you asked for. A small elephant and a small mug is two different sizes. And then, again, we have heuristics to estimate each probability and we do a linear mapping. Sorry. We have heuristics to estimate scores and we do linear mapping to estimate probabilities. So let's have a look how we estimate the scores. For lexical item we use one of the many WordNet similarity metrics, and we picked Leacock and Chodorow's because gave the best result. And it didn't crash. >>: After a massive search through all possible methods? >> Ingrid Zukerman: No, like -- no, what's his name, Pedersen? He has a list of like six different methods, and we just tried them out and this one was the most sensible. So it returns a similarity score between the name that you used and the name that you might call a particular object. So a mug is called a mug and then you might call it a cup, a mug, a dish, et cetera. And it returns a similarity score, and this matches the maximum possible similarity. >>: [inaudible] >> Ingrid Zukerman: Leacock and Chodorow? WordNet distance, distance on the tree and maximum distance. But there are all sorts now. >>: But most of these [inaudible]. >> Ingrid Zukerman: Yeah. I mean, yeah. We didn't use the [inaudible] based ones, just that one seems to work. And color, our model for color similarity is based on the CIE model that the authors claimed that it's psychologically grounded. So we use the Euclidean distance, that's ED, between the color that was stated and the actual color of the object based on these L, a, b coordinates. And L, a, b stands for brightness, some number in the green-red spectrum and some number in the blue-yellow spectrum. And for size this is actually a fairly new size function -- well, a different one. We map -we have heuristics that map sizes in a requested size into two state: large or small. And then we estimate the probability that an object could be designated large or designated small. And this probability depends on the average new -- the average size of objects of that type and sigma lex the standard deviation for objects of that type. Where did we get these numbers? The Ikea catalog and Amazon. So if you want a bed or a mug or whatever, we just took a lot of mugs, average sizes. And then you have the probability that something could be called large or something could be called small. So now that we have all these probabilities, how do we combine them. According to Dale & Reiter and Wyatt and our own survey, people use the type most often, absolute adjective second and relative adjective third. So color is an absolute adjective and size a relative adjective. So we have a weighting scheme. We raise each probability to a particular power and we did excerpts to find out which numbers give the best result. And that's what we found. And, yes, you could use machine learning to do that, but we didn't. We just tried a few thing until we got something that we liked. Okay. So how does this whole thing work? As I said, we're fixated with mugs. And ->>: It sounds like you can do whole studies on each of these components. >> Ingrid Zukerman: Yes. >>: [inaudible] component, you're finding lots of graphs and tradeoffs and optimizations. >> Ingrid Zukerman: We did a few. But, yeah, the thing is that there is always a tradeoff being moving on and dwelling on a particular issue and we move on. But, yeah, if I had unlimited manpower, I would dwell. So say the large black mug. And we want to know the probability that cup 1 could be called a large black mug or mug 1 or mug 2. And then this is what we get. We get for cup 1 on lexical we're doing medium, because we said mug and the best name for it is a cup. Color is good. Size. Now it's kind of small. Mug 1 is doing bad on color and mug 2 is doing pretty good. So in this case, mug 2 is the winner. Yeah, it's a clear winner. But if we don't have a winner, the question is what would we do? And this is where we get into the next part of the project that we haven't started is, is it worth asking. I mean, how crucial is it to this person to have a large black mug? So this is the next part of the project. Okay. So now we finish with our intrinsic probabilities and we move to the structural one -- >>: And you're asking in realtime or asking as part of training? >> Ingrid Zukerman: No, asking in realtime like a person, you know, get me the large black mug. DORIS goes to the kitchen. Should I go back and ask ->>: If the black mug is still a coffee and the other mug is empty in the cupboard, there's a big difference. >> Ingrid Zukerman: Depends if they want coffee or water. >>: But which one ->> Ingrid Zukerman: Yeah. >>: It's not just mug, it's the actual experience and location and so on. So it can be definitely worth asking sometimes if you don't know. >> Ingrid Zukerman: Oh, absolutely. >>: I'm surprised you actually [inaudible]. >> Ingrid Zukerman: No, no. I say like sometimes you want to ask and sometimes you don't. But it's a nontrivial decision whether to ask or not. Okay. Ownership. Very simple heuristic function. Like this is what you would have in your tree, whether, you know, Susan's mug. So you want to know the probability that -of this particular bit of the structure, whether Susan owns this mug, and it's zero if you know for sure that Susan doesn't, one if Susan does, and some alpha ->>: And here's another [inaudible] in the future it'll be you're at a party, get rid of all your little "this is my wineglass" dangling thing and have the robots track it, where's my mug. That's your motive [inaudible]. >> Ingrid Zukerman: Your RFID tag. >>: [inaudible] just a little camera up in the ceiling, they can find your mug for you or your glass. You don't just sort of wash one or get a new one. So save on all those little cutesy little dangling jewelry you put on your mugs. >> Ingrid Zukerman: Yep. So that is very simple heuristic. Our positional stuff is -well, I think is kind of cool. Again, if your assumption's all objects are rigid, so that means that they can be represented by a box, and they have three dimensions: Length, width and height. So if somebody wants their pants, we assume their pants are not just draped all over the place but they're in a box. So we have heuristic functions for locations such as on, under and above. And for that what we do is if the directional semantics is satisfied, then the probability of a particular location is proportional to the shared surface area. And I'll show an example in a moment. We can have in that is proportional to shared volume, and near, that is based on Kelleher's gravitational attraction. So just to clarify how does this work, if I want the book on the table, directional semantics means that set coordinate is actually correct, the set of the book is the table plus the height -- the set of the table plus its height. Now that I got the directional, then I'm going for shared surface area, which means the probability that the table is on the book is the shared area -- sorry, that the book is on the table is the shared area between the table and the book and the minimum between the both. Because you got -- you could have a book overhanging slightly, a table, and it's still -- the book is still on the table. So this is for on. And now that we have seen an example for on, we can go back to in. So that's shared volume. And near is basically the idea is that near objects -- well, the nearer an object is it's better, but the measure of nearness for bigger objects is more slack than for smaller objects. The pen next to the laptop is one thing; the sofa next to the table doesn't have to be like this. Okay. So now we start our evaluation of this part of the property, and we have a few experiments. We had more, but I'm talking about these. So the first experiment was referring expressions only with intrinsic features. The second was for referring expressions, intrinsic and structural. And the third was complete utterances. In all experiments we generated at most 300 subinterpretations, so everything: text, parse, tree usages, ICGs. We set it to 300 as the limit. And one speaker spoke all the utterances, and that was Enes. And, yes, people complain about this, why don't we have multiple speakers. Because you have to train the API. You goes know that. >>: [inaudible] >> Ingrid Zukerman: Hmm. Well, somebody has to train it. >>: Well, I mean, you can just use the defects. >> Ingrid Zukerman: The defaults are American. >>: [inaudible] >> Ingrid Zukerman: Yeah. It didn't work because that's why poor Enes had to read Moby Dick to it for two days. >>: I think that's a good point, essentially like it would be interesting to see if you vary the quality of the recognition, you know, because combining a number of factors, there's the uncertainties that come from the recognition and then from the parsing and the understanding. [multiple people speaking at once] >>: [inaudible] from a long time ago on exactly this question that involved asking -- this is many years ago. I don't remember the citation at all, but getting people to chat with a -- in a Wizard of Oz setup with the [inaudible] that had very synthesis quality. So they were actually talking to a human being behind the curtain, but the synthesis quality varied from human quality to really tinny, cheesy, and there were huge differences in how people talked to the person behind the curtain. When it was tinny, poor quality, robotic sounding, they shortened things. They used fewer adjectives, much simpler syntax. And when it was human, they just [inaudible]. >>: This is listening side, though, right, is what I'm talking about. >>: Yeah, that's true. >>: I mean, one of my big concerns is -- or I think so far it's fabulous and very interesting. The concern is we're seeing intelligence and the glimmer of hope when it comes to very carefully crafted, very carefully spoken good recognitions more generally. And what happens when it's more -- when there's noise in the recognitions and people talk in -- really out of band waves, will the various safety nets and expansions you have from coordinate similarities and so on, is the structure for the physical environment enough to collapse it down at that point? >> Ingrid Zukerman: That is ->>: Big question, I think. >> Ingrid Zukerman: Oh, yes. Totally. I mean, at the moment -- I mean, we are poor over there. So Microsoft API was free. And we had to train it. So Enes was it. Because Enes' accent in English is better than mine. He's from Bosnia actually, so ->>: He's from Boston? >> Ingrid Zukerman: Bosnia. >>: Oh, Bosnia. >> Ingrid Zukerman: Yeah. >>: I said Boston? >> Ingrid Zukerman: No. No, no. >>: Boston is a bit different than Chicago. >> Ingrid Zukerman: No. But I totally agree that we should have multiple speakers. What we did do, though, is people did generate the utterances and Enes just read them out. It wasn't all of Enes's [inaudible]. >>: I see. I see. >> Ingrid Zukerman: No, Enes was the reader. >>: Okay. >> Ingrid Zukerman: No, no a lot of the utterances belong to other people. >>: How are they generated by the other people? >> Ingrid Zukerman: All will come. [multiple people speaking at once] >>: [inaudible] transition into a Bosnian accent -[laughter] >>: -- [inaudible] recognized. >> Ingrid Zukerman: Okay. So this is our first experiment. Yes. This is Ikea. And Ikea in Australia. And we had six worlds and we asked eight people to identify labeled objects. And the worlds were the pen world, the mug world, the furniture world and so on. And we would say like describe D. Or describe L. And they had to generate a description. So there were in total 57 possible reference. And people generated 178 descriptions out of which we could use only 75 because people do not follow instructions. We said our system can do at the moment lexical item, color and size. Nothing else. They ask for the rectangular tray. Okay. So -[multiple people speaking at once] >>: In reality that's always going to happen. Our systems will always be incomplete. And what's your sense for if you're given a million dollars by the ->> Ingrid Zukerman: Netflix. >>: No, say a million dollars from some government in Australia to do general research to fix that problem, I'm assuming incompleteness, how would you solve that problem? >> Ingrid Zukerman: Well, I would go for the number. Like assuming I get such a corporate study, you say, okay, so people talk about -- well, there are two issues: one, what they talk about and, two, what the robot can recognize. Because people spoke about texture. There is no way on earth a robot can recognize the velvet couch as opposed to the woollen couch. >>: Ingrid, what about three, a robot that understands [inaudible] doesn't know everything it's going to hear. >> Ingrid Zukerman: That's fine. That we can deal. I mean, we can't -- but we -- well, no, okay. So the thing is the first order of business is we do the corporate study. So things like shapes. So I would incorporate the other things that we're not incorporating because we're not getting brownie points for incorporating them. After that has been done, the rest of the stuff is in the vocabulary. It's just not stuff that we can deal with. So then you say, okay, you said the woollen. Sorry. I cannot identify wool. >>: [inaudible] these other constraints. >> Ingrid Zukerman: Exactly. So it is doable. >>: [inaudible] learn what woollen means later. >> Ingrid Zukerman: But still visually the robot would have to go up and touch it and do a tactile identification of wool as opposed to velvet. I think people will have to get to used to asking DORIS for things -- I mean, once you tell it it has no tactile sense, it can only see, then I hope people will become sensible. >>: Just popped in, I just want to say Canberra. A million dollars from Canberra. >> Ingrid Zukerman: Canberra? Canberra isn't given anybody a million dollars. [laughter] >>: [inaudible] >> Ingrid Zukerman: Canberra just gave every Australian $900 to spend on plasma screens, but we can talk about that later. Okay. So, in any event, this was the experiment and we ended up using only 75. There was some duplicates. But, yeah, people were a bit naughty that way. So this were our result. ASR is 26 percent, so it's not like -- maybe the next version will do better. And gold reference in the top 1 out of the 75 we got 49. But in the top 3 we got 87 percent. So we're actually overcoming ASR error for top 3. >>: What does gold refs mean? >> Ingrid Zukerman: Oh. We knew what was the gold reference, like what we want -the gold is the correct one. So we say like we spotted the correct one in 87 percent of the cases for top 3. We didn't find one. Which one was the one we didn't find? Oh, poof. DORIS didn't know poof. And -- oh, yeah, and also the API had trouble with Australianism, like ramekin and byra [phonetic]. Not a chance of getting it to recognize that. >>: What does it mean? >> Ingrid Zukerman: Ramekin? It's a cooking -- it's a little cooking dish. And a byra is a pen. >>: You say it like it's common. >> Ingrid Zukerman: In Australia it is. That's the thing. We're not used to it. >>: [inaudible] ramekin, isn't that the thing that they do the creme brulee? >>: Yes. There you go. >>: I'm showing my [inaudible] my wife would be upset to hear my conversation right now. >> Ingrid Zukerman: Well, so we didn't find the poof. And the average rank was 0.82. So what is the rank? Rank zero is best. >>: And what's poof? >> Ingrid Zukerman: Poof is the little round chair, like a -- do we have a poof here? >>: Is it Europe also? >> Ingrid Zukerman: This is a poof. Page is a poof. >>: Ottoman. >> Ingrid Zukerman: Well, some people call it a poof. >>: It's a little round -- it's a little round ->>: Oh, really? >>: Yeah. >> Ingrid Zukerman: Jake could also be a poof, no? >>: Now, is that used in here also and I just don't know about it? >> Ingrid Zukerman: There you go. You've been illuminated regarding furnishings. [laughter] >>: [inaudible] >> Ingrid Zukerman: Yeah, but a different accent. >>: No. >> Ingrid Zukerman: No? >>: Not in the U.K. >> Ingrid Zukerman: In Australia it's two syllables actually too. >>: Poof? >> Ingrid Zukerman: No, poofter they call it. >>: Oh. >> Ingrid Zukerman: Okay. We proceed. They're recording this. >>: It's PG-13 in here. >>: So gold references, those aren't the actual ultimate like semantical representation that you generate or -- I don't understand how [inaudible] means ->> Ingrid Zukerman: The ASR -- it means that the top text returned by the ASR was not the correct one in 26 percent of the cases. >>: And it sent those, though. >> Ingrid Zukerman: Yes. >>: But you're only focusing on a small number of words in those sentences, right? >> Ingrid Zukerman: Well, actually we use entire vocabulary, whatever [inaudible] has, whatever the ASR has, yeah, we're not restricting. >>: No, but I mean the bits that you're actually focusing on for the behavior of the robot, it's typically a small subset of -- a substring probably in a longer utterance, so much of it could go wrong, and as long as the ->> Ingrid Zukerman: Well, for this experiment it was only referring expressions, like, you know, the pink tray, the large blue tray. We're getting to the bigger expressions in a moment. So this was only referring expressions. And we got the 26 percent error rate. So in totality we're doing pretty good. And the rank, the rank means like your best rank is zero. So average rank is 0.82, which means the best interpretation is either the top -- the gold is either coming up top or second. So it's not bad. Okay. Next experiment. We had a simulated house. This is a top view of our house. There are five people in the house, 54 objects distributed among four areas. And this is one area of the house. Now, these descriptions we generated, and I'll tell you why. Because we are just -- I hope we were justified, but please complain -- we did the same -- experiment 1 we also did with our own utterances, and our performance was the same. Whether we did it or other people did it, the system performed the same. So we figured okay, whatever we're generating is not outrageous and it's not particularly tailored to the system, so for this one we can do our own. And we generate the noun phrases comprising between one and three concepts. So things like this, the wardrobe under the fan, the mug near the book on the table, Paul's book. These are the descriptions we generated. And we had -- now we compare against the baseline that was total interpretation for each stage, so just take the top. ASR was Meyer now. We are sitting on 30 percent ASR error. And the baseline had like -- so taking the top, in 49 percent of the cases the gold is top 1 -- well, of course if you only take the top, top 3 is going to be the same. But not found is 46. For Scusi?, we are generating even the top 1 is already at 91 percent. So top 1 performance overcomes ASR error. >>: I'm not sure I understand what you mean when you say that [inaudible] becomes ASR error. >> Ingrid Zukerman: Well, if you are taking top interpretation, like top ASR, top parse tree, top everything and in 30 percent of the cases your top ASR is wrong, then that's it. There's no way I want to find the right thing. >>: Why not? Because the ASR could be wrong in like -- well, I guess because of the nature of their sentences, but if the ASR number is sentence level, I could have, you know, like a -- like the bits that you're using like we're saying could be all okay, something else could be incorrect and then I go and parse it and I generate just using the top and I still get the right result. Wouldn't that be possible? >> Ingrid Zukerman: How could you do that? Like if you said the, whatever, like let's ->>: I want -- yeah. So let's say ->> Ingrid Zukerman: The mug near the book on the table. And the ASR got the mud near the book on the table. >>: So the ASR got mug near the book on the table with the initial "the." Okay? >> Ingrid Zukerman: Right. >>: You found that as an ASR error. >> Ingrid Zukerman: Yes. I would come to the center, but it also doesn't happen. You'll get an article there, but, yes, I would count it as an error, agreed. >>: And so then if I go with that through the whole change, I might still recover the right thing. >> Ingrid Zukerman: Yes. That is true. And -- okay. We should check what kinds of errors we are getting. That is true. But the fact is, I mean, if you look at this, you get 46 not founds. >>: Right. >> Ingrid Zukerman: So, anyway, this is what you are getting. So whatever is the case -and the ICG is not taking into consideration articles and things like that. So it's really not finding the right thing. Whereas, yeah, Scusi? is doing okay. We are happy with Scusi?. And the last experiment is complete utterances, full requests. We picked out a hundred utterances, and this is -- we actually run an experiment -- okay. First I tell you what we have. So we have a hundred utterances, 43 declarative, 57 imperative. We have 135 concepts in the knowledge base, objects and relations. And we actually collected a corpus from people. We asked people to interact with a robot -- Michael was the robot -- in two scenarios: tidy up and help me. Tidy up they have to ask the robot to clean the house, and they have to say, you know, pick up this, collect that, make the bed, et cetera. And help me you have to be pretending to be disabled in some way and you have to get the robot to do things for you. And people actually interacted with Michael in any way we could. So -- how do you call that thing that Microsoft does, the chatting, the Microsoft chatting thing, MSN? Messenger. Yeah. That. Messenger, Google Chat, whatever people could come up with. And they interacted with Michael. So it was starting interactions and then Enes read them. So our whole dialogue was a thousand utterances and we extracted a hundred out of them. And this is our tidy-up scenario. And I don't know how clear you can see, but this house is very messy. Can you see the chair is out of place, there is a puddle near the door, there is a puddle in the kitchen, there are plates on the floor. Can you see them or -[multiple people speaking at once] >> Ingrid Zukerman: Yeah. And these are sample utterances, open the door to the cupboard in the lounge, the book is on the desk. And what we want them to achieve is this, where the house is all neat and tidy. And so the help-me scenario, again, you know, people -- I forgot my pills, where did I put them, and so on and so forth. So in this experiment ASR error is lower. I have no idea why the ASR error fluctuates so much. And ASR error is lower and again we have the baseline and Scusi?. And out of 100 utterances, we have again -- we're beating baseline and the rank is pretty good and we have only seven not found as opposed to baseline 47 not found. So again top 3 we're doing better than ASR error. And here I collected some of the utterances that we had trouble with. Henry's little bowl needs a wash. So I suppose that would be an Australianism. That wouldn't be American English. Wash the shirt on the sofa in the basing, and it had trouble with inside the wooden box. Okay. So how are we going for time? Okay. I can do a fly-by-night of interpreting a sequence of sentence sequences. Bill was interested in that. This is like brand-new stuff. We're still debugging it. So okay. What happened is that people do not speak in these smooth sentences but they actually speak in segmented form. And so what we want to do before -- okay. Sorry. Before we proceed to dialogue, we want to interpret a sentence sequence. And after that we'll provide to dialogue. So what we do is for each sentence we generate UCGs. We determine their mode, the clarity for imperative, and we determine coreferents. Then we generate UCG sequences. So using the most promising UCGs for each sentences, so again combinations. We generate mode sequences. We generate coreferent sequences. And while there is time we continue playing this game and selecting the best. So this might clarify this concept. So this is actually one of the things we -- one of the utterances we collected. Go to the desk near the computer. The mug is on the desk near the phone. Fetch it for me. So then after parsing -- these experiments we did only with type. We didn't even dare go for speech. Even with type we were having problems. So after parsing you get the "to" prepositional attachment. The desk could be near the computer or the going could be near the computer. Again, the mug is on the desk near the phone, the mug is near the phone or the desk is near the phone. And fetch it for me. And again prepositional attachment problems. So these are top 2 for every sentence, top 2 parsing, or structures. Now we can generate two UCG sequences. Then we have the modes, and the modes are pretty clear here: imperative, declarative, imperative. The second mode is very low probability in all cases here. And then we want to do coreferents resolution. Like which desk? Is it the self-reference desk or the desk in the previous utterance? Who is it, the mug or the desk. So we play this game and generate a lot of possibilities and what's very pleasing is that we can actually continue our probabilistic calculations and extend them to multiple sentences. So if we continue with the same formalism, we actually get the probability of a UCG given the text, the mode given the parse tree in the text, and the [inaudible] coreferents given all the parse trees. And the first component we get from a single sentence, like that's what we've done up to now. The second component, the mode, the clarity or imperative we get from a classifier based on the corpus, and the coreferents we get on heuristics plus the corpus. And the heuristics we plan to replace with corpus-based information, but at the moment that's what we are using. So just quickly. Coreferents resolution we use -- we can do pronouns [inaudible] and just noun phrases, like the book. And because we don't have a lot of data, we do our Bayesian voodoo that we like to do. And we end up with this formula for coreferent resolution. So this is the probability that referent I refers to -- that referring expression I refers to J. Sorry, sorry, sorry. No. No. This is the probability of a particular referring expression J in sentence I. Sorry. And now we have the type of the referring expression, the probability of referring to particular sentence, and the probability of referring to particular noun phrase in that sentence. And, yes, we had to break it up that way because we just don't have enough data to do all the noun phrases in all the sentences. So you break it up into referring to previous sentence and referring to a noun phrase inside that sentence. Okay. So this is just very -- the math very, very quickly. And now we can go for the evaluation for which we need to do a lot more work. So we had an experiment where we asked people to pretend they've gone to a meeting and they've forgotten something and DORIS has to get something for them. And we got -- again, and this was a Web experiment. So people just typed stuff over the Web. And we got 115 requests and the vast majority had more than one sentence, and some people went to town and had up to nine sentences. Now, this is -- refers to what Bill said that these -- we actually had to make manual changes to even be able to process most of the sentences. Like things like I need you to fetch. It's just fetch. It's not the -- or, well, and we don't do composite nouns at the moment. So coffee table was just table. And things like that. So we did all our systematic changes and, yeah, how -- well, we have a plan -- we have some hopes to apply machine learning techniques to translate from people English to DORIS English. And this is an example of what people gave. DORIS, I left my mug in my office and I want the coffee. Can you go to my office and get my mug. It is on top of the cabinet. It is on the left side of my desk. This is actual input. And what we generated was this. This is what we can process. So -- yes. >>: What are the bottlenecks? What stops you from processing the [inaudible]? >> Ingrid Zukerman: Well, some things are just hacks, like DORIS please, things like that. [inaudible] that's all. I mean, those are little things. I left my mug in my office. This is narrative. Right? It really means my mug is in my office or I think my mug is in my office. But there is a lot of irrelevant narrative that what it indicates is a declarative sentence of where something is supposed to be. So that is a problem. Then other things ->>: I guess my question is like if I was to try and think this [inaudible] and put it through the machine [inaudible], why would it -- like what would you expect ->> Ingrid Zukerman: Oh, well, it doesn't do conjoined sentences. So I left my mug in my office and I want a coffee. >>: Like who doesn't do it? The parser? The recognizer can do that, right? >> Ingrid Zukerman: We are getting between 20 and 30 percent error even on simple stuff. Like whether the recognizer can do it or not is a big if. The parser I'm not sure. From parsing to UCG definitely not. We are not doing conjoined at the moment. I mean, these are all like -- some of the things are pretty easy. If the parser behaves, you can break up conjoined sentences. Right? So some of the things -- but a lot of it is really irrelevant narrative, or at the moment irrelevant, like, yes, you might argue I want the coffee, maybe DORIS should get the cup that has the coffee as opposed to the cup that doesn't have any coffee in it, for example. But if you're going bottom line, like translated from verbiage to command and control, all you want is -- this is what you really want. My mug is in my office or ->>: [inaudible] transformation, is this ->> Ingrid Zukerman: Manual. >>: [inaudible] or manual? >> Ingrid Zukerman: No. This is totally manual, but we have rules. Like we have somebody sitting there following -- we have a scribe following rules. >>: But these are not rules that could be easily implemented [inaudible]. >> Ingrid Zukerman: Well, depends on what your program can do. Like ->>: Okay. Got it. Yeah. >> Ingrid Zukerman: Yeah, some of them -- well, our plan is to apply machine learning techniques to see can you go from this type of structure to that type of structure, which is what ->>: So this goes back to my question earlier. The syntactic [inaudible] may actually hurt in a case like this where you may be better off [inaudible] ->>: Word spotting? >>: Yeah [inaudible], word spotting approach. Out of all this verbiage, you're looking for a few pretty straightforward types of information. >> Ingrid Zukerman: But the question is you want to look for -- I mean, if you do straight word spotting, then you'll find what you are looking for and you may not find what is being said if they want something else. >>: Right. It will be kind of an ROC curve in all the methods. I guess one comment is sort of an alternative approach. So Tim Paek and I engaged in Receptionist version 1, had sort of a goal hierarchy admittedly for a limited domain, can a receptionist do with a -- and other things [inaudible] but we actually did it -- we as a [inaudible] parse plus word spotting to use parts of speech and words that kind of coded into a handcrafted [inaudible] and that had certain characteristics. That's kind of [inaudible] suggested here, but kind of the -- not just spotting words but spotting structure [inaudible] structure as well in a pretty tight context. >> Ingrid Zukerman: Yeah. That's the thing that -- I mean, we were trying to stay away from context until now. But, yeah, I prefer to do it -- as I said, I prefer to do it at the end, but also for other reasons. Like at the moment we do this, right, we don't consider that the person wants their coffee. But if we want to do for sophisticated planning, actually recording that they want the coffee and knowing that will be important. It's just that at the moment we're pretending it's command and control. So, yeah, I take your point about the spotting, but I worry that it may be too restricted. So in any event, okay, here, the evaluation, we didn't do as well as we hoped. But at least we partially know why. So as I said, this was typed. There was no speech recognition. And for complete requests we added 51 percent in the top. And actually doesn't degrade. It's pretty binary, the whole thing. Either we find it, and if we find it we're doing pretty well. Like you see the median rank is zero and the 75 percentile rank is 1. If we find the right interpretation. But if -- but we don't find it in a lot of cases. Like in 31 percent of the cases we don't find it at all. Now, we also did this evaluation for partials, which is just for UCGs inside an interpretation. Because for complete requests you have to get all the sentences right, whereas for UCGs, you have to get -- well, you want to know how many you got right. So we do a little bit better, but still we need to do more work. And. Now, why aren't we doing -- I mean, where did we fail? Okay. The first place where we failed is that we did not compose interpretations from several imperatives. Okay. One thing that I forgot to mention is that we're combining declaratives in this interpretation process. Where is it? Here. Merge UCGs -- I forgot. I kind of glossed over that one -- which means if you say my mug is on the table, it is blue, then you get two merged interpretations. I have a blue mug on the table and I have a mug on the blue table. And we did all these mergers but only for the declaratives and for declaratives and imperatives. Initially we thought that people would not add extra information into imperatives. We were wrong. People add extra information when they are giving more than one imperative sentence. So this actually accounts for 19 requests that we just didn't process properly. And after our resolution accounts for 6 -- well, I mean, this is how it worked out and we need probably better mechanisms for our resolution. And PP-attachment was also problematic. While we know that the mug near the laptop -the mug on the table near the laptop, it's the mug that's near the laptop. DORIS is not quite sure about that. So we have some ideas about how to approve the PP-attachment, but at the moment this is problematic. So -- yes. >>: [inaudible] I was noticing that when you were showing PP-attachment, especially with near, you're still going to find the mug. >> Ingrid Zukerman: Yes. That is a good point. Yes. That is a very good point, that we often actually find the correct object but because our evaluation -- like whether the mug near the laptop on the table, whatever, you end up finding the right mug but with the wrong structure. So at the moment we're punishing ourselves for that structure. But, yes, we end up finding the -- yes, that is correct. So what have we done? Okay. We have our speech interpretation module that I've motivated, keep track of [inaudible] interpretations, and we have provided this probabilistic formalism that -- well, to handle the uncertainty, but also what I like about it is that we're being able to extend it to multiple sentences. And we have an anytime search algorithm, and we are extending this formalism to sentence sequences. So where are we going next? Yes, we have to deal with ASR error, but also we have to do it for these multiple sentences. We hope to improve performance for multiple sentences for something reasonable, and then we'll go to ASR error. Extending grammatical capabilities. As I said, applying machine learning techniques is the direction I want to go to. We need additional dialogue, and then we'll move on to dialogue and integrating with vision. If somebody gives us another robot, we can integrate with vision. Okay. That's it. Thanks. [applause] >> Eric Horvitz: Any questions? We've been pretty interactive the whole way through. >>: So I wonder if you might use like the environment to resolve some of these prepositional phrase attachments, like can you ->> Ingrid Zukerman: Yes. >>: -- [inaudible] there's a mug next to the laptop. >> Ingrid Zukerman: Yes. That's exactly what we are planning. Actually we are going to abandon the full pipeline and, yeah, what we are going to do is for the UCGs we're going to go down to the ICG level and see, okay, which ones actually exist. Yes. >>: It would also be interesting -- I guess I don't know. So you have -- there's a chain here of components, and each of them has uncertainties and failures and blind spots. It would be interesting to see which [inaudible]. Like I wonder [inaudible] people have done all this work on just ASR reranking, right? If you were to take just your ASR results and apply some sort of like reranking technique that uses features, that describe, you know, some of these relationships that you're trying to capture, just reran the ASR, you know, and pick the best hypothesis and parse it, like how would that fare against, you know, this approach where you're sort of keeping uncertainty later in the chain. I think is really interesting, just that comparison I think would be interesting to see. I don't know. >> Ingrid Zukerman: No, we tried it once actually. We used just vocabulary to rerank. But to be honest, I don't remember what happened. Probably nothing much, if I don't remember what happened. >>: And I other thing I think is like so the ASR, I have a question about it. So you're saying like you're getting these 20 percent, 30 percent errors. And you're seeing those as high. And I'm kind of wondering -- like in my head they're actually pretty good, especially if you're in a robot setting. And I'm assuming like the experiments like the ones you did with ASR, was it [inaudible]. >> Ingrid Zukerman: Oh, yeah, totally. >>: And so you [inaudible] the robot in a home where the TV is going and the robot is like five feet away, you kind of have a slim chance. Like it's actually interesting, and I think these techniques have more space for improving things and for robustifying things, one ASR is actually bad. >> Ingrid Zukerman: Yes. Well, actually an interesting point in one of the other experiments that we did is we -- with visual DORIS we had a simulated vision system that -- and we cranked up the vision up to 200 percent, so by the end you could think a mug is an elephant. And the longer your description -- like the mug on the table near the phone -- the better the performance got because you had so many factors to disambiguate that even if you thought an elephant was a mug at the end you didn't. >>: So it would be interesting, yeah, to see like -- I don't know if you get a chance in your teaching experiments to actually -- I mean, it's hard to vary unless you do you a simulated sort of model, like the vary the word [inaudible] speech recognition accuracy. But what you can do is get a lot of people, and inherently some of them will have different error rates. And then you can just have that as a post factor [inaudible] analysis and look at the correlations between when my error rate is low, how much do I improve [inaudible]. >> Ingrid Zukerman: Yes. Yeah. Like you can -- yes. Yes. And -- yeah. We can do the reranking thing again, because I don't remember what happened. No, it was years ago and for some reason we didn't pursue it. Half of me thinks it's because we thought it was a bit of a cheat, if you just add your vocabulary. Like, you know, you saw, we have up to 150 things in the knowledge base. So if you're just going to say, okay, I'm just playing with my own vocabulary and nothing else, it's a bit of a cheat. And also people could be referring to things by their own vocabulary, like the chalice. Well, the chalice could still be a mug. Okay, it's a stretch. But, you know, it's some type of vessel. So even though it's not part of the your object, if people want to call something in that way and it's in some vocabulary, they're entitled to. But, yeah, we should do a bit more with this. >> Eric Horvitz: All right. Thanks very much. [applause]