>> Eric Horvitz: Okay. It's an honor to... today from Melbourne -- or Melbourne as they say, I...

advertisement
>> Eric Horvitz: Okay. It's an honor to introduce Ingrid Zukerman who's visiting us
today from Melbourne -- or Melbourne as they say, I guess, as they pronounce it
correctly. I was there last Christmas visiting Ingrid there.
Ingrid was actually most famous as a visiting researcher at Microsoft Research in 1999.
>> Ingrid Zukerman: 2000.
>> Eric Horvitz: 2000. The 1999-2000 academic year. And she's a professor in
computer science at Monash University. She did her bachelor's and master's work in
operations research at Technion and her Ph.D. research at UCLA. Famously, again, is
one of the earlier grad students of Judea Pearl -- the first Judea Pearl grad student
working in Bayesian nets and uncertainty, I think, right? One of the early ones.
>> Ingrid Zukerman: No, the only one working in natural language, actually.
>> Eric Horvitz: But with uncertainty -- oh, certainly that's the case. That's probably still
the case until this day almost, right?
>> Ingrid Zukerman: But, yeah, I was the first person to actually do Bayesian
propagation by hand ever on the planet.
>> Eric Horvitz: Okay. It's nice to have an interactive introduction, isn't it? You can
check my years and my details here. We're doing a dialogue-based introduction,
thematically placed.
Anyway, her interest areas are discourse planning interpretation, planning recognition,
and dialogue.
So Ingrid.
>> Ingrid Zukerman: Thanks. Sorry to interrupt.
>> Eric Horvitz: That was planned.
>> Ingrid Zukerman: Okay. What I'll be talking about today is a spoken dialogue system
we're developing for a robot and, well, about the motivations for the way we're
developing the system and where we are so far.
So the name of the system is DORIS, and it actually is an acronym, Dialogue Oriented
Roaming Interactive System.
And DORIS is a dialogue module for a robotic agent in a home and eventually to
combine spoken and visual information, and actually we used to have this [inaudible]
robot, but [inaudible] took it back. But I still like the picture.
So this is the sort of thing that we are aiming for. The user might say can you get my
mug. We are fixated with mugs. And DORIS might say something like I don't know
which is your mug, but I can see a blue mug, and the user might say my mug is the one
on the table.
And if DORIS can do that, we will be very proud of DORIS. Because this type of
dialogue highlights a number of issues. First of all, it highlights the fact that DORIS
needs to be able to identify objects in the world and it also needs to -- is this thing -yeah. It also needs to be able to understand from this that the user's mug is not the blue
mug, which is actually a very tall order. So if DORIS can do all that, we'll be very
happy. At the moment we are on our way.
So what I'll be talking about is I'll be motivating our design decisions, then I will be
talking about our interpretation of spoken language. And, yes, what I should mention is
at the moment we are not doing any dialogue as such; all we're doing is interpreting what
people are saying. We are getting to dialogue.
Then I will be talking about how we estimate the probability of an interpretation of a
person's utterance, and then how we evaluate that part of the system.
Now, what you see in light blue is if time permits I'll be talking about our more recent
work about the interpretation of a sequence of utterances. Because people don't normally
express themselves in one very smooth sentence, but they're actually quite segmented.
And then conclusion and future work.
So, first of all, the motivation. So what are the things that we want DORIS to do? We
want DORIS to make decisions on the basis of the result of the interpretation process.
And these decisions are both dialogue actions and physical actions.
Now, what are the results of the interpretation process? They're not just what we
understand one utterance to be but also how many alternatives there were, how close they
were to the best interpretation. What were the differences between these alternatives, just
one word that was different or was the whole thing different. And the decisions about
what do depend on all these aspects of an interpretation.
We also would like DORIS to modify decisions on the fly if it receives new information.
So it has heard an utterance, made some type of decision, and then a new utterance
comes.
And also we would like DORIS to recover from flawed or partial interpretations. So it
got it wrong and then there is new information and how do we recover.
So in order to support all these, we want -- we decided that we need two things: We need
to be able to maintain multiple interpretations and we need to be able to rank them.
So you need multiple interpretations so you can do all that. First of all, you need to know
how close they were to each other, what were the differences and so on. If you want to
change your mind, you want to know what to change your mind to. And if you want to
recover again, you want to have your alternatives in hand. And of course to know where
you stand you need to be able to rank all of these interpretations. So these are the two
things we want DORIS to be able to do.
And Scusi? is the speech interpretation module of DORIS. Yeah. No idea why we called
it Scusi? anymore, but that's how it stays.
So what I will be focusing on and from now on is about Scusi?, and Scusi? has to do
these two things: maintain multiple interpretations and apply a ranking process.
And to maintain multiple interpretations what we have is a multistage interpretation
mechanism and each stage maintains multiple options. And so that this thing doesn't
explode for us, we have an anytime algorithm to do that.
So our algorithm basically keeps track of all these multiple options until time runs out,
only in our case until we have generated a particular number of alternatives, and then
proceeds to pick the top few alternatives.
And the ranking process is done through a mechanism that estimates the probability that a
particular interpretation matches the speaker's intentions.
So these are the two things that I'm going to talk about: our algorithm and the
mechanism for estimating probability.
So the stages of the interpretation process is we -- so far we are using a pipeline process
that is quite normal. We have a speech recognizer that produces text, syntactic parsing
that produces a parse tree, semantic interpretation that produces a concept graph that is
based on Sowa's conceptual structures.
Now, the idea for this, we know that in spoken dialogue systems a lot of people use
semantic parsing. But the idea for this particular structure was that we want to get to the
situational context as late as possible so that we don't have to redo a lot of the initial
work.
And in fact there is one piece of research that we are fond of citing by Knight et al. who
say that if you have a situation where people will say unexpected things or are not
familiar with the system, you are better off with this type of model rather than semantic
parsing.
So the first stage is speech recognition. And, yes, we are using Microsoft. It works very
well. And in fact we're using an old Microsoft. Tim Paek assures me that the new one is
even better.
So the Microsoft API produces a range of interpretations and they are all scored. And we
transform the scores into probabilities just using a linear mapping function. Nothing too
fancy.
>>: What do you mean by that?
>> Ingrid Zukerman: Oh. The API produces numbers, and then you take -- you
normalize and you turn them into numbers between zero and one.
>>: I see.
>> Ingrid Zukerman: Yeah.
>>: [inaudible] those numbers as probability directly?
>> Ingrid Zukerman: Yeah.
>>: Okay.
>> Ingrid Zukerman: Yeah. Exactly. Then syntactic parsing. We use Charniak's parser.
And that parser returns any number of interpretations. And then you get something that
looks like that.
And in the semantic interpretation, as I said, it relies on concept graphs, which is a
structure by Sowa that represent entities and relationships between them. And we have
broken down the process into two stages: uninstantiated concept graphs and instantiated
concept graphs. So uninstantiated are basically pure semantic and instantiated are
contextualized in the environment.
And as I said before, the reason again even for this breakup is that we want to leave the
domain instantiation for as late as possible.
So an uninstantiated concept graph, all it does is -- it's basically a different -- it is a
different representation of a parse tree. And it is produced directly from a parse tree.
However, one uninstantiated concept graph, or UCG, can have multiple parent parse trees
if they mean similar things.
For example, you could have a parse tree like the blue mug, and sometimes it's a direct
adjective and sometimes it is parsed as an adjectival phrase. And both of them will lead
to the same UCG.
And as I said, one more time, domain independent, so we are not looking at the domain
just yet.
So, for example, here is find the blue mug in the kitchen for Susan. And what we have is
a UCG that has all the right bits directly generated from the parse tree. And the only
thing of note is that attributes of the nouns are included in the boxes for the nouns. So if
we say the blue mug, the big mug, a mug, all those things are included in the box. They
don't have a special note associated with them.
The instantiated concept graphs pertain to the domain. So now that we have an
uninstantiated we want to know what things in the domain actually can match this
situation.
So here we have again the blue mug. And now we have a particular instantiation which
means a particular interpretation of find, like we have different kinds of find, you can
find somebody's office, which means you just locate it, whereas you can find somebody's
mug, which means locate and retrieve, and so on, so forth.
And all the things have numbers. Even the patient. We've only got one kind of patient
and one kind of beneficiary. We have three kinds of cups, two Susans, one kitchen, and
so on.
So this is a particular instance that we have generated from what we have said. Then of
course all the cups and mugs in our domain could be possible instances and all the Susans
could be possible instances.
>>: So here you generate by just going across all options and get all possible
combinations?
>> Ingrid Zukerman: Almost. Conceptually it is all possible combinations. In practice
you'll see in about five minutes that we reduce the search. But conceptually, yes. It's all.
Okay. So how do we search for an interpretation. We have -- this would be the result of
a search so far. So we have a speech wave, we have a bunch of text. Each text could
have a bunch of parse trees. Each parse tree can generate one UCG. But you see, for
example, the UCG over there, 22, can come from two different parse trees, and two
UCGs can also produce an ICG.
So this is the search structure that we generate. And the algorithm works like this. While
there is time select a level to expand while preferring low levels. Basically we want a
beeline to a solution. We don't want to meander around the top and not have an answer.
And this is done -- like the preference for low levels is done probabilistically. Basically
each level is a probability and the lowers have a higher probability of being selected.
Select an item which prefers promising candidates. And what is a promising candidate?
A promising candidate is a good parent. So how do we decide who's a good parent? This
is like Darwinian. Okay. If a candidate has not generated any children but it has a good
probability coming from above, he's allowed to procreate. Once you have generated a
child, you're only allowed to procreate so long as the probability of your children is good.
Once you have generated a bad child, you can no longer procreate. That's it. You're
banned. So that's a promising candidate.
And once you expand an item, you generate only one child at a time. So you don't
produce -- even though the parser that's generated 50 parse trees, we picked them out one
at a time. And, likewise, we picked them out one at a time. I mean, the fact that this
speech recognizer and the parser generate all the options at once, it's an artifact of those
products. But if this is truly anytime, we want to pick them one at a time.
>>: [inaudible] explain this force, like that you give to each child later or ->> Ingrid Zukerman: Yes, yes. It's all coming. Here. Yeah. So okay. So how do we
estimate the probability of an interpretation? This is the -- I mean, the complete
development of this formula is by doing the base voodoo that one does, so chain rule
condition or probability. So this is the bottom line. Now, the fancy Cs pertain to context.
So what does this formula tell us? It tells us the probability of an interpretation I given a
speech wave W in some context is -- let's leave the summation aside for the moment.
The probability of a text given the speech wave, the sparse tree given the text, the UCG
given the parse tree, and the ICG given the UCG and the context. And it works without
with the probabilities. It works out very beautifully.
Why do you need a summation? Because everybody can have multiple parents. The text
can be a parent. The same parse tree can have multiple text parents. The same UCG can
have multiple parse tree parents and so on. And the ICG can have multiple UCG parents.
So that is the formula. And now where do we get that stuff? Okay. From speech wave
to text from the ASR, from text to parse tree from the parse tree. This is a 1 because its's
algorithmic. And this is a hard part. So this is what we're going to talk about now.
>>: How complex are utterances in this domain? Do you actually need all this
machinery for the parser to say, or could you achieve a similar result if there was just a
few arguments with [inaudible]?
>> Ingrid Zukerman: Okay. Okay. The thing is -- yeah. We have done trials with
people, and they meander all over the place. And in fact you're going to see at the end -and this is -- this Y -- this is not a command and control situation. Yes, you could train
people to say, you know, tea, Earl Grey, hot. For [inaudible] geeks that means
something; for the rest, never mind.
But most people actually meander all over the place. And we are telling them pretend
you are talking to a robot.
So even with all this machinery, you'll see later, we can process only a fraction of what
we are being told. And there is another piece of research actually from the last
InterSpeech that the older people are the more they meander in their discourse. So, yeah,
if this is going to assist anybody of age...
Okay. So now how do we calculate this probability? So we have three different sections
and it all become clear. The first section is an intrinsic match probability between a node
in the ICG and the corresponding node in its parent UCG. So if you have a match node
for no, like we saw before, you want the probability that the person uttered a particular
description when they intended a particular object in the domain. This is the first section
here.
The second section is a structural probability. It's the probability -- I mean, in principle
we would like to calculate the probability of a complete structure, but we don't have that.
So what we use, we use trigrams which have the structural probability to represent the
structural probability of an ICG. And this is the probability of a node given its parent and
its grandparent. And we'll see an example in a moment.
And identifiably we have the probability of a node. This is an instantiated node given the
context. And at the moment we're not doing anything with this other than our knowledge
base of nodes. In the future we want to include visual salience and discourse salience.
So this is where you will put all of that information. But at the moment all we have is do
we know about this thing or not.
Okay. So now we are going to go into how we calculate -- how we make these
calculations. But before that we have some simplifying assumptions. The robot is
co-present with the user and all possible referents, and it has an unobstructed view. And
each object is uniquely identifiable.
Now, the first two assumptions we make because at the moment we only want the robot
to make linguistic decisions. We don't want it to make a decision that says there is
nothing really good in this room, maybe I should go looks elsewhere. Or maybe I should
move a little bit to get a better look.
I mean, these are decisions that will have to be made in the future, but at the moment we
don't want to deal with this. So we have co-presence and full view.
Unique identification. That has to be removed if you're going to deal with a vision
system. Because realistic vision systems don't have unique identification.
And in fact we have a version of this system that we call visual DORIS that has a
simulation of a vision system.
Okay. So now we're going to go for the features that we are using in our comparison. So
intrinsic matches, which is what is the probability that we said a particular description
when meaning a particular node, pertain to lexical item, color and size. And, yes, we can
extend it to texture and shape and other things. But once you figure out lexical item,
color and size, you don't get brownie points for extra things.
And structural probabilities pertain to ownership and location. So these pertain to the
relationships between concepts. And we need to distinguish between intrinsic and
structural because intrinsic are features of a thing itself.
So if I want the blue mug, this is it. The mug is to be blue. But if I want the mug on the
table, I have to consider all the mugs I know about and all the tables. So I have to
consider these trigrams.
Okay. So this partially answers Dan's question are we doing everybody against
everybody when we're building these structures. No. We want to find good candidate
objects.
And what is a candidate object? A candidate object would be also something composite,
like the mug on the table.
So to do that we do it in two stages. The first stage we make a list of candidate objects
for each UCG concept U. For each mug, each table, each concept that was mentioned,
we make a list of candidate. And we estimate the match probabilities of those candidates.
So if I ask for a blue mug, I look at all the cups and mugs and dishes and the ones that are
blue win.
And then we rank these objects in descending order.
Next we make mini-ICGs of candidate composite objects. So now we want to -basically we want to disambiguate referring expressions. And the mini-ICG is an ICG
just for a referring expression, no verb. So the mug on the table near the lamp in the
corner. We want to find candidate for that.
So we combine individual candidates into composite objects and we estimate their
structure probabilities and rank those. So we retain -- at the end of this game, we retain
candidates for composite objects. And now we can put them together in a larger
utterance.
So in principle it's everybody against r against everybody, but in practice it's only
different composite objects against other composite objects.
Okay. So what is the probability of a match -- so now we are going -- we are drilling
down to intrinsic probabilities. So what's a probability of a match between a UCG node
and a candidate instantiated concept. So the probability that we meant K and we said U.
And as I said, at the moment we are only doing lexical item, color and size. So if we
assume some feature independence, we can break it up like this. And, yes, color is not a
hundred percent kosher because color also depends on lexical item. Red hair and a red
mug is totally different reds. But we are looking for household objects.
So we are pretending color only depends -- so the probability that you call the thing with
a particular name that you indicated its color and that you indicated its size. Now, size
does depend on what you asked for. A small elephant and a small mug is two different
sizes.
And then, again, we have heuristics to estimate each probability and we do a linear
mapping. Sorry. We have heuristics to estimate scores and we do linear mapping to
estimate probabilities.
So let's have a look how we estimate the scores. For lexical item we use one of the many
WordNet similarity metrics, and we picked Leacock and Chodorow's because gave the
best result. And it didn't crash.
>>: After a massive search through all possible methods?
>> Ingrid Zukerman: No, like -- no, what's his name, Pedersen? He has a list of like six
different methods, and we just tried them out and this one was the most sensible.
So it returns a similarity score between the name that you used and the name that you
might call a particular object. So a mug is called a mug and then you might call it a cup,
a mug, a dish, et cetera. And it returns a similarity score, and this matches the maximum
possible similarity.
>>: [inaudible]
>> Ingrid Zukerman: Leacock and Chodorow? WordNet distance, distance on the tree
and maximum distance. But there are all sorts now.
>>: But most of these [inaudible].
>> Ingrid Zukerman: Yeah. I mean, yeah. We didn't use the [inaudible] based ones, just
that one seems to work.
And color, our model for color similarity is based on the CIE model that the authors
claimed that it's psychologically grounded. So we use the Euclidean distance, that's ED,
between the color that was stated and the actual color of the object based on these L, a, b
coordinates. And L, a, b stands for brightness, some number in the green-red spectrum
and some number in the blue-yellow spectrum.
And for size this is actually a fairly new size function -- well, a different one. We map -we have heuristics that map sizes in a requested size into two state: large or small. And
then we estimate the probability that an object could be designated large or designated
small.
And this probability depends on the average new -- the average size of objects of that
type and sigma lex the standard deviation for objects of that type. Where did we get
these numbers? The Ikea catalog and Amazon. So if you want a bed or a mug or
whatever, we just took a lot of mugs, average sizes. And then you have the probability
that something could be called large or something could be called small.
So now that we have all these probabilities, how do we combine them. According to
Dale & Reiter and Wyatt and our own survey, people use the type most often, absolute
adjective second and relative adjective third. So color is an absolute adjective and size a
relative adjective. So we have a weighting scheme. We raise each probability to a
particular power and we did excerpts to find out which numbers give the best result. And
that's what we found.
And, yes, you could use machine learning to do that, but we didn't. We just tried a few
thing until we got something that we liked.
Okay. So how does this whole thing work? As I said, we're fixated with mugs. And ->>: It sounds like you can do whole studies on each of these components.
>> Ingrid Zukerman: Yes.
>>: [inaudible] component, you're finding lots of graphs and tradeoffs and optimizations.
>> Ingrid Zukerman: We did a few. But, yeah, the thing is that there is always a tradeoff
being moving on and dwelling on a particular issue and we move on. But, yeah, if I had
unlimited manpower, I would dwell.
So say the large black mug. And we want to know the probability that cup 1 could be
called a large black mug or mug 1 or mug 2. And then this is what we get. We get for
cup 1 on lexical we're doing medium, because we said mug and the best name for it is a
cup. Color is good. Size. Now it's kind of small. Mug 1 is doing bad on color and mug
2 is doing pretty good.
So in this case, mug 2 is the winner. Yeah, it's a clear winner. But if we don't have a
winner, the question is what would we do? And this is where we get into the next part of
the project that we haven't started is, is it worth asking. I mean, how crucial is it to this
person to have a large black mug? So this is the next part of the project.
Okay. So now we finish with our intrinsic probabilities and we move to the structural
one --
>>: And you're asking in realtime or asking as part of training?
>> Ingrid Zukerman: No, asking in realtime like a person, you know, get me the large
black mug. DORIS goes to the kitchen. Should I go back and ask ->>: If the black mug is still a coffee and the other mug is empty in the cupboard, there's a
big difference.
>> Ingrid Zukerman: Depends if they want coffee or water.
>>: But which one ->> Ingrid Zukerman: Yeah.
>>: It's not just mug, it's the actual experience and location and so on. So it can be
definitely worth asking sometimes if you don't know.
>> Ingrid Zukerman: Oh, absolutely.
>>: I'm surprised you actually [inaudible].
>> Ingrid Zukerman: No, no. I say like sometimes you want to ask and sometimes you
don't. But it's a nontrivial decision whether to ask or not.
Okay. Ownership. Very simple heuristic function. Like this is what you would have in
your tree, whether, you know, Susan's mug. So you want to know the probability that -of this particular bit of the structure, whether Susan owns this mug, and it's zero if you
know for sure that Susan doesn't, one if Susan does, and some alpha ->>: And here's another [inaudible] in the future it'll be you're at a party, get rid of all
your little "this is my wineglass" dangling thing and have the robots track it, where's my
mug. That's your motive [inaudible].
>> Ingrid Zukerman: Your RFID tag.
>>: [inaudible] just a little camera up in the ceiling, they can find your mug for you or
your glass. You don't just sort of wash one or get a new one. So save on all those little
cutesy little dangling jewelry you put on your mugs.
>> Ingrid Zukerman: Yep. So that is very simple heuristic. Our positional stuff is -well, I think is kind of cool. Again, if your assumption's all objects are rigid, so that
means that they can be represented by a box, and they have three dimensions: Length,
width and height. So if somebody wants their pants, we assume their pants are not just
draped all over the place but they're in a box.
So we have heuristic functions for locations such as on, under and above. And for that
what we do is if the directional semantics is satisfied, then the probability of a particular
location is proportional to the shared surface area. And I'll show an example in a
moment. We can have in that is proportional to shared volume, and near, that is based on
Kelleher's gravitational attraction.
So just to clarify how does this work, if I want the book on the table, directional
semantics means that set coordinate is actually correct, the set of the book is the table
plus the height -- the set of the table plus its height.
Now that I got the directional, then I'm going for shared surface area, which means the
probability that the table is on the book is the shared area -- sorry, that the book is on the
table is the shared area between the table and the book and the minimum between the
both. Because you got -- you could have a book overhanging slightly, a table, and it's
still -- the book is still on the table.
So this is for on. And now that we have seen an example for on, we can go back to in.
So that's shared volume. And near is basically the idea is that near objects -- well, the
nearer an object is it's better, but the measure of nearness for bigger objects is more slack
than for smaller objects. The pen next to the laptop is one thing; the sofa next to the table
doesn't have to be like this.
Okay. So now we start our evaluation of this part of the property, and we have a few
experiments. We had more, but I'm talking about these. So the first experiment was
referring expressions only with intrinsic features. The second was for
referring
expressions, intrinsic and structural. And the third was complete utterances.
In all experiments we generated at most 300 subinterpretations, so everything: text,
parse, tree usages, ICGs. We set it to 300 as the limit. And one speaker spoke all the
utterances, and that was Enes. And, yes, people complain about this, why don't we have
multiple speakers. Because you have to train the API. You goes know that.
>>: [inaudible]
>> Ingrid Zukerman: Hmm. Well, somebody has to train it.
>>: Well, I mean, you can just use the defects.
>> Ingrid Zukerman: The defaults are American.
>>: [inaudible]
>> Ingrid Zukerman: Yeah. It didn't work because that's why poor Enes had to read
Moby Dick to it for two days.
>>: I think that's a good point, essentially like it would be interesting to see if you vary
the quality of the recognition, you know, because combining a number of factors, there's
the uncertainties that come from the recognition and then from the parsing and the
understanding.
[multiple people speaking at once]
>>: [inaudible] from a long time ago on exactly this question that involved asking -- this
is many years ago. I don't remember the citation at all, but getting people to chat with
a -- in a Wizard of Oz setup with the [inaudible] that had very synthesis quality.
So they were actually talking to a human being behind the curtain, but the synthesis
quality varied from human quality to really tinny, cheesy, and there were huge
differences in how people talked to the person behind the curtain. When it was tinny,
poor quality, robotic sounding, they shortened things. They used fewer adjectives, much
simpler syntax. And when it was human, they just [inaudible].
>>: This is listening side, though, right, is what I'm talking about.
>>: Yeah, that's true.
>>: I mean, one of my big concerns is -- or I think so far it's fabulous and very
interesting. The concern is we're seeing intelligence and the glimmer of hope when it
comes to very carefully crafted, very carefully spoken good recognitions more generally.
And what happens when it's more -- when there's noise in the recognitions and people
talk in -- really out of band waves, will the various safety nets and expansions you have
from coordinate similarities and so on, is the structure for the physical environment
enough to collapse it down at that point?
>> Ingrid Zukerman: That is ->>: Big question, I think.
>> Ingrid Zukerman: Oh, yes. Totally. I mean, at the moment -- I mean, we are poor
over there. So Microsoft API was free. And we had to train it. So Enes was it. Because
Enes' accent in English is better than mine. He's from Bosnia actually, so ->>: He's from Boston?
>> Ingrid Zukerman: Bosnia.
>>: Oh, Bosnia.
>> Ingrid Zukerman: Yeah.
>>: I said Boston?
>> Ingrid Zukerman: No. No, no.
>>: Boston is a bit different than Chicago.
>> Ingrid Zukerman: No. But I totally agree that we should have multiple speakers.
What we did do, though, is people did generate the utterances and Enes just read them
out. It wasn't all of Enes's [inaudible].
>>: I see. I see.
>> Ingrid Zukerman: No, Enes was the reader.
>>: Okay.
>> Ingrid Zukerman: No, no a lot of the utterances belong to other people.
>>: How are they generated by the other people?
>> Ingrid Zukerman: All will come.
[multiple people speaking at once]
>>: [inaudible] transition into a Bosnian accent -[laughter]
>>: -- [inaudible] recognized.
>> Ingrid Zukerman: Okay. So this is our first experiment. Yes. This is Ikea. And Ikea
in Australia. And we had six worlds and we asked eight people to identify labeled
objects. And the worlds were the pen world, the mug world, the furniture world and so
on. And we would say like describe D. Or describe L. And they had to generate a
description.
So there were in total 57 possible reference. And people generated 178 descriptions out
of which we could use only 75 because people do not follow instructions. We said our
system can do at the moment lexical item, color and size. Nothing else. They ask for the
rectangular tray. Okay. So -[multiple people speaking at once]
>>: In reality that's always going to happen. Our systems will always be incomplete.
And what's your sense for if you're given a million dollars by the ->> Ingrid Zukerman: Netflix.
>>: No, say a million dollars from some government in Australia to do general research
to fix that problem, I'm assuming incompleteness, how would you solve that problem?
>> Ingrid Zukerman: Well, I would go for the number. Like assuming I get such a
corporate study, you say, okay, so people talk about -- well, there are two issues: one,
what they talk about and, two, what the robot can recognize. Because people spoke about
texture. There is no way on earth a robot can recognize the velvet couch as opposed to
the woollen couch.
>>: Ingrid, what about three, a robot that understands [inaudible] doesn't know
everything it's going to hear.
>> Ingrid Zukerman: That's fine. That we can deal. I mean, we can't -- but we -- well,
no, okay. So the thing is the first order of business is we do the corporate study. So
things like shapes. So I would incorporate the other things that we're not incorporating
because we're not getting brownie points for incorporating them. After that has been
done, the rest of the stuff is in the vocabulary. It's just not stuff that we can deal with. So
then you say, okay, you said the woollen. Sorry. I cannot identify wool.
>>: [inaudible] these other constraints.
>> Ingrid Zukerman: Exactly. So it is doable.
>>: [inaudible] learn what woollen means later.
>> Ingrid Zukerman: But still visually the robot would have to go up and touch it and do
a tactile identification of wool as opposed to velvet. I think people will have to get to
used to asking DORIS for things -- I mean, once you tell it it has no tactile sense, it can
only see, then I hope people will become sensible.
>>: Just popped in, I just want to say Canberra. A million dollars from Canberra.
>> Ingrid Zukerman: Canberra? Canberra isn't given anybody a million dollars.
[laughter]
>>: [inaudible]
>> Ingrid Zukerman: Canberra just gave every Australian $900 to spend on plasma
screens, but we can talk about that later. Okay.
So, in any event, this was the experiment and we ended up using only 75. There was
some duplicates. But, yeah, people were a bit naughty that way.
So this were our result. ASR is 26 percent, so it's not like -- maybe the next version will
do better. And gold reference in the top 1 out of the 75 we got 49. But in the top 3 we
got 87 percent. So we're actually overcoming ASR error for top 3.
>>: What does gold refs mean?
>> Ingrid Zukerman: Oh. We knew what was the gold reference, like what we want -the gold is the correct one. So we say like we spotted the correct one in 87 percent of the
cases for top 3. We didn't find one. Which one was the one we didn't find? Oh, poof.
DORIS didn't know poof. And -- oh, yeah, and also the API had trouble with
Australianism, like ramekin and byra [phonetic]. Not a chance of getting it to recognize
that.
>>: What does it mean?
>> Ingrid Zukerman: Ramekin? It's a cooking -- it's a little cooking dish. And a byra is
a pen.
>>: You say it like it's common.
>> Ingrid Zukerman: In Australia it is. That's the thing. We're not used to it.
>>: [inaudible] ramekin, isn't that the thing that they do the creme brulee?
>>: Yes. There you go.
>>: I'm showing my [inaudible] my wife would be upset to hear my conversation right
now.
>> Ingrid Zukerman: Well, so we didn't find the poof. And the average rank was 0.82.
So what is the rank? Rank zero is best.
>>: And what's poof?
>> Ingrid Zukerman: Poof is the little round chair, like a -- do we have a poof here?
>>: Is it Europe also?
>> Ingrid Zukerman: This is a poof. Page is a poof.
>>: Ottoman.
>> Ingrid Zukerman: Well, some people call it a poof.
>>: It's a little round -- it's a little round ->>: Oh, really?
>>: Yeah.
>> Ingrid Zukerman: Jake could also be a poof, no?
>>: Now, is that used in here also and I just don't know about it?
>> Ingrid Zukerman: There you go. You've been illuminated regarding furnishings.
[laughter]
>>: [inaudible]
>> Ingrid Zukerman: Yeah, but a different accent.
>>: No.
>> Ingrid Zukerman: No?
>>: Not in the U.K.
>> Ingrid Zukerman: In Australia it's two syllables actually too.
>>: Poof?
>> Ingrid Zukerman: No, poofter they call it.
>>: Oh.
>> Ingrid Zukerman: Okay. We proceed. They're recording this.
>>: It's PG-13 in here.
>>: So gold references, those aren't the actual ultimate like semantical representation that
you generate or -- I don't understand how [inaudible] means ->> Ingrid Zukerman: The ASR -- it means that the top text returned by the ASR was not
the correct one in 26 percent of the cases.
>>: And it sent those, though.
>> Ingrid Zukerman: Yes.
>>: But you're only focusing on a small number of words in those sentences, right?
>> Ingrid Zukerman: Well, actually we use entire vocabulary, whatever [inaudible] has,
whatever the ASR has, yeah, we're not restricting.
>>: No, but I mean the bits that you're actually focusing on for the behavior of the robot,
it's typically a small subset of -- a substring probably in a longer utterance, so much of it
could go wrong, and as long as the ->> Ingrid Zukerman: Well, for this experiment it was only referring expressions, like,
you know, the pink tray, the large blue tray. We're getting to the bigger expressions in a
moment. So this was only referring expressions.
And we got the 26 percent error rate. So in totality we're doing pretty good. And the
rank, the rank means like your best rank is zero. So average rank is 0.82, which means
the best interpretation is either the top -- the gold is either coming up top or second. So
it's not bad. Okay.
Next experiment. We had a simulated house. This is a top view of our house. There are
five people in the house, 54 objects distributed among four areas. And this is one area of
the house.
Now, these descriptions we generated, and I'll tell you why. Because we are just -- I
hope we were justified, but please complain -- we did the same -- experiment 1 we also
did with our own utterances, and our performance was the same. Whether we did it or
other people did it, the system performed the same. So we figured okay, whatever we're
generating is not outrageous and it's not particularly tailored to the system, so for this one
we can do our own.
And we generate the noun phrases comprising between one and three concepts. So things
like this, the wardrobe under the fan, the mug near the book on the table, Paul's book.
These are the descriptions we generated.
And we had -- now we compare against the baseline that was total interpretation for each
stage, so just take the top. ASR was Meyer now. We are sitting on 30 percent ASR
error. And the baseline had like -- so taking the top, in 49 percent of the cases the gold is
top 1 -- well, of course if you only take the top, top 3 is going to be the same. But not
found is 46. For Scusi?, we are generating even the top 1 is already at 91 percent. So top
1 performance overcomes ASR error.
>>: I'm not sure I understand what you mean when you say that [inaudible] becomes
ASR error.
>> Ingrid Zukerman: Well, if you are taking top interpretation, like top ASR, top parse
tree, top everything and in 30 percent of the cases your top ASR is wrong, then that's it.
There's no way I want to find the right thing.
>>: Why not? Because the ASR could be wrong in like -- well, I guess because of the
nature of their sentences, but if the ASR number is sentence level, I could have, you
know, like a -- like the bits that you're using like we're saying could be all okay,
something else could be incorrect and then I go and parse it and I generate just using the
top and I still get the right result. Wouldn't that be possible?
>> Ingrid Zukerman: How could you do that? Like if you said the, whatever, like let's ->>: I want -- yeah. So let's say ->> Ingrid Zukerman: The mug near the book on the table. And the ASR got the mud
near the book on the table.
>>: So the ASR got mug near the book on the table with the initial "the." Okay?
>> Ingrid Zukerman: Right.
>>: You found that as an ASR error.
>> Ingrid Zukerman: Yes. I would come to the center, but it also doesn't happen. You'll
get an article there, but, yes, I would count it as an error, agreed.
>>: And so then if I go with that through the whole change, I might still recover the right
thing.
>> Ingrid Zukerman: Yes. That is true. And -- okay. We should check what kinds of
errors we are getting. That is true. But the fact is, I mean, if you look at this, you get 46
not founds.
>>: Right.
>> Ingrid Zukerman: So, anyway, this is what you are getting. So whatever is the case -and the ICG is not taking into consideration articles and things like that. So it's really not
finding the right thing. Whereas, yeah, Scusi? is doing okay. We are happy with Scusi?.
And the last experiment is complete utterances, full requests. We picked out a hundred
utterances, and this is -- we actually run an experiment -- okay. First I tell you what we
have. So we have a hundred utterances, 43 declarative, 57 imperative. We have 135
concepts in the knowledge base, objects and relations. And we actually collected a
corpus from people.
We asked people to interact with a robot -- Michael was the robot -- in two scenarios:
tidy up and help me. Tidy up they have to ask the robot to clean the house, and they have
to say, you know, pick up this, collect that, make the bed, et cetera. And help me you
have to be pretending to be disabled in some way and you have to get the robot to do
things for you.
And people actually interacted with Michael in any way we could. So -- how do you call
that thing that Microsoft does, the chatting, the Microsoft chatting thing, MSN?
Messenger. Yeah. That. Messenger, Google Chat, whatever people could come up with.
And they interacted with Michael. So it was starting interactions and then Enes read
them.
So our whole dialogue was a thousand utterances and we extracted a hundred out of
them. And this is our tidy-up scenario. And I don't know how clear you can see, but this
house is very messy. Can you see the chair is out of place, there is a puddle near the
door, there is a puddle in the kitchen, there are plates on the floor. Can you see them or -[multiple people speaking at once]
>> Ingrid Zukerman: Yeah. And these are sample utterances, open the door to the
cupboard in the lounge, the book is on the desk. And what we want them to achieve is
this, where the house is all neat and tidy. And so the help-me scenario, again, you know,
people -- I forgot my pills, where did I put them, and so on and so forth.
So in this experiment ASR error is lower. I have no idea why the ASR error fluctuates so
much. And ASR error is lower and again we have the baseline and Scusi?. And out of
100 utterances, we have again -- we're beating baseline and the rank is pretty good and
we have only seven not found as opposed to baseline 47 not found.
So again top 3 we're doing better than ASR error.
And here I collected some of the utterances that we had trouble with. Henry's little bowl
needs a wash. So I suppose that would be an Australianism. That wouldn't be American
English. Wash the shirt on the sofa in the basing, and it had trouble with inside the
wooden box.
Okay. So how are we going for time? Okay. I can do a fly-by-night of interpreting a
sequence of sentence sequences. Bill was interested in that. This is like brand-new stuff.
We're still debugging it.
So okay. What happened is that people do not speak in these smooth sentences but they
actually speak in segmented form. And so what we want to do before -- okay. Sorry.
Before we proceed to dialogue, we want to interpret a sentence sequence. And after that
we'll provide to dialogue. So what we do is for each sentence we generate UCGs. We
determine their mode, the clarity for imperative, and we determine coreferents. Then we
generate UCG sequences. So using the most promising UCGs for each sentences, so
again combinations.
We generate mode sequences. We generate coreferent sequences. And while there is
time we continue playing this game and selecting the best. So this might clarify this
concept. So this is actually one of the things we -- one of the utterances we collected.
Go to the desk near the computer. The mug is on the desk near the phone. Fetch it for
me.
So then after parsing -- these experiments we did only with type. We didn't even dare go
for speech. Even with type we were having problems. So after parsing you get the "to"
prepositional attachment. The desk could be near the computer or the going could be
near the computer. Again, the mug is on the desk near the phone, the mug is near the
phone or the desk is near the phone. And fetch it for me. And again prepositional
attachment problems. So these are top 2 for every sentence, top 2 parsing, or structures.
Now we can generate two UCG sequences. Then we have the modes, and the modes are
pretty clear here: imperative, declarative, imperative. The second mode is very low
probability in all cases here.
And then we want to do coreferents resolution. Like which desk? Is it the self-reference
desk or the desk in the previous utterance? Who is it, the mug or the desk.
So we play this game and generate a lot of possibilities and what's very pleasing is that
we can actually continue our probabilistic calculations and extend them to multiple
sentences.
So if we continue with the same formalism, we actually get the probability of a UCG
given the text, the mode given the parse tree in the text, and the [inaudible] coreferents
given all the parse trees.
And the first component we get from a single sentence, like that's what we've done up to
now. The second component, the mode, the clarity or imperative we get from a classifier
based on the corpus, and the coreferents we get on heuristics plus the corpus. And the
heuristics we plan to replace with corpus-based information, but at the moment that's
what we are using.
So just quickly. Coreferents resolution we use -- we can do pronouns [inaudible] and just
noun phrases, like the book. And because we don't have a lot of data, we do our
Bayesian voodoo that we like to do. And we end up with this formula for coreferent
resolution. So this is the probability that referent I refers to -- that referring expression I
refers to J.
Sorry, sorry, sorry. No. No. This is the probability of a particular referring expression J
in sentence I. Sorry. And now we have the type of the referring expression, the
probability of referring to particular sentence, and the probability of referring to
particular noun phrase in that sentence.
And, yes, we had to break it up that way because we just don't have enough data to do all
the noun phrases in all the sentences. So you break it up into referring to previous
sentence and referring to a noun phrase inside that sentence.
Okay. So this is just very -- the math very, very quickly. And now we can go for the
evaluation for which we need to do a lot more work.
So we had an experiment where we asked people to pretend they've gone to a meeting
and they've forgotten something and DORIS has to get something for them. And we
got -- again, and this was a Web experiment. So people just typed stuff over the Web.
And we got 115 requests and the vast majority had more than one sentence, and some
people went to town and had up to nine sentences.
Now, this is -- refers to what Bill said that these -- we actually had to make manual
changes to even be able to process most of the sentences. Like things like I need you to
fetch. It's just fetch. It's not the -- or, well, and we don't do composite nouns at the
moment. So coffee table was just table. And things like that.
So we did all our systematic changes and, yeah, how -- well, we have a plan -- we have
some hopes to apply machine learning techniques to translate from people English to
DORIS English. And this is an example of what people gave.
DORIS, I left my mug in my office and I want the coffee. Can you go to my office and
get my mug. It is on top of the cabinet. It is on the left side of my desk. This is actual
input. And what we generated was this. This is what we can process.
So -- yes.
>>: What are the bottlenecks? What stops you from processing the [inaudible]?
>> Ingrid Zukerman: Well, some things are just hacks, like DORIS please, things like
that. [inaudible] that's all. I mean, those are little things. I left my mug in my office.
This is narrative. Right? It really means my mug is in my office or I think my mug is in
my office.
But there is a lot of irrelevant narrative that what it indicates is a declarative sentence of
where something is supposed to be. So that is a problem.
Then other things ->>: I guess my question is like if I was to try and think this [inaudible] and put it through
the machine [inaudible], why would it -- like what would you expect ->> Ingrid Zukerman: Oh, well, it doesn't do conjoined sentences. So I left my mug in
my office and I want a coffee.
>>: Like who doesn't do it? The parser? The recognizer can do that, right?
>> Ingrid Zukerman: We are getting between 20 and 30 percent error even on simple
stuff. Like whether the recognizer can do it or not is a big if. The parser I'm not sure.
From parsing to UCG definitely not. We are not doing conjoined at the moment.
I mean, these are all like -- some of the things are pretty easy. If the parser behaves, you
can break up conjoined sentences. Right? So some of the things -- but a lot of it is really
irrelevant narrative, or at the moment irrelevant, like, yes, you might argue I want the
coffee, maybe DORIS should get the cup that has the coffee as opposed to the cup that
doesn't have any coffee in it, for example.
But if you're going bottom line, like translated from verbiage to command and control, all
you want is -- this is what you really want. My mug is in my office or ->>: [inaudible] transformation, is this ->> Ingrid Zukerman: Manual.
>>: [inaudible] or manual?
>> Ingrid Zukerman: No. This is totally manual, but we have rules. Like we have
somebody sitting there following -- we have a scribe following rules.
>>: But these are not rules that could be easily implemented [inaudible].
>> Ingrid Zukerman: Well, depends on what your program can do. Like ->>: Okay. Got it. Yeah.
>> Ingrid Zukerman: Yeah, some of them -- well, our plan is to apply machine learning
techniques to see can you go from this type of structure to that type of structure, which is
what ->>: So this goes back to my question earlier. The syntactic [inaudible] may actually hurt
in a case like this where you may be better off [inaudible] ->>: Word spotting?
>>: Yeah [inaudible], word spotting approach. Out of all this verbiage, you're looking
for a few pretty straightforward types of information.
>> Ingrid Zukerman: But the question is you want to look for -- I mean, if you do
straight word spotting, then you'll find what you are looking for and you may not find
what is being said if they want something else.
>>: Right. It will be kind of an ROC curve in all the methods. I guess one comment is
sort of an alternative approach. So Tim Paek and I engaged in Receptionist version 1,
had sort of a goal hierarchy admittedly for a limited domain, can a receptionist do with
a -- and other things [inaudible] but we actually did it -- we as a [inaudible] parse plus
word spotting to use parts of speech and words that kind of coded into a handcrafted
[inaudible] and that had certain characteristics.
That's kind of [inaudible] suggested here, but kind of the -- not just spotting words but
spotting structure [inaudible] structure as well in a pretty tight context.
>> Ingrid Zukerman: Yeah. That's the thing that -- I mean, we were trying to stay away
from context until now. But, yeah, I prefer to do it -- as I said, I prefer to do it at the end,
but also for other reasons. Like at the moment we do this, right, we don't consider that
the person wants their coffee. But if we want to do for sophisticated planning, actually
recording that they want the coffee and knowing that will be important. It's just that at
the moment we're pretending it's command and control.
So, yeah, I take your point about the spotting, but I worry that it may be too restricted.
So in any event, okay, here, the evaluation, we didn't do as well as we hoped. But at least
we partially know why.
So as I said, this was typed. There was no speech recognition. And for complete
requests we added 51 percent in the top. And actually doesn't degrade. It's pretty binary,
the whole thing. Either we find it, and if we find it we're doing pretty well. Like you see
the median rank is zero and the 75 percentile rank is 1. If we find the right interpretation.
But if -- but we don't find it in a lot of cases. Like in 31 percent of the cases we don't find
it at all.
Now, we also did this evaluation for partials, which is just for UCGs inside an
interpretation. Because for complete requests you have to get all the sentences right,
whereas for UCGs, you have to get -- well, you want to know how many you got right.
So we do a little bit better, but still we need to do more work. And.
Now, why aren't we doing -- I mean, where did we fail? Okay. The first place where we
failed is that we did not compose interpretations from several imperatives. Okay. One
thing that I forgot to mention is that we're combining declaratives in this interpretation
process.
Where is it? Here. Merge UCGs -- I forgot. I kind of glossed over that one -- which
means if you say my mug is on the table, it is blue, then you get two merged
interpretations. I have a blue mug on the table and I have a mug on the blue table. And
we did all these mergers but only for the declaratives and for declaratives and
imperatives.
Initially we thought that people would not add extra information into imperatives. We
were wrong. People add extra information when they are giving more than one
imperative sentence.
So this actually accounts for 19 requests that we just didn't process properly. And after
our resolution accounts for 6 -- well, I mean, this is how it worked out and we need
probably better mechanisms for our resolution.
And PP-attachment was also problematic. While we know that the mug near the laptop -the mug on the table near the laptop, it's the mug that's near the laptop. DORIS is not
quite sure about that. So we have some ideas about how to approve the PP-attachment,
but at the moment this is problematic.
So -- yes.
>>: [inaudible] I was noticing that when you were showing PP-attachment, especially
with near, you're still going to find the mug.
>> Ingrid Zukerman: Yes. That is a good point. Yes. That is a very good point, that we
often actually find the correct object but because our evaluation -- like whether the mug
near the laptop on the table, whatever, you end up finding the right mug but with the
wrong structure. So at the moment we're punishing ourselves for that structure. But, yes,
we end up finding the -- yes, that is correct.
So what have we done? Okay. We have our speech interpretation module that I've
motivated, keep track of [inaudible] interpretations, and we have provided this
probabilistic formalism that -- well, to handle the uncertainty, but also what I like about it
is that we're being able to extend it to multiple sentences. And we have an anytime
search algorithm, and we are extending this formalism to sentence sequences.
So where are we going next? Yes, we have to deal with ASR error, but also we have to
do it for these multiple sentences. We hope to improve performance for multiple
sentences for something reasonable, and then we'll go to ASR error.
Extending grammatical capabilities. As I said, applying machine learning techniques is
the direction I want to go to. We need additional dialogue, and then we'll move on to
dialogue and integrating with vision. If somebody gives us another robot, we can
integrate with vision. Okay. That's it. Thanks.
[applause]
>> Eric Horvitz: Any questions? We've been pretty interactive the whole way through.
>>: So I wonder if you might use like the environment to resolve some of these
prepositional phrase attachments, like can you ->> Ingrid Zukerman: Yes.
>>: -- [inaudible] there's a mug next to the laptop.
>> Ingrid Zukerman: Yes. That's exactly what we are planning. Actually we are going
to abandon the full pipeline and, yeah, what we are going to do is for the UCGs we're
going to go down to the ICG level and see, okay, which ones actually exist. Yes.
>>: It would also be interesting -- I guess I don't know. So you have -- there's a chain
here of components, and each of them has uncertainties and failures and blind spots. It
would be interesting to see which [inaudible]. Like I wonder [inaudible] people have
done all this work on just ASR reranking, right? If you were to take just your ASR
results and apply some sort of like reranking technique that uses features, that describe,
you know, some of these relationships that you're trying to capture, just reran the ASR,
you know, and pick the best hypothesis and parse it, like how would that fare against, you
know, this approach where you're sort of keeping uncertainty later in the chain. I think is
really interesting, just that comparison I think would be interesting to see. I don't know.
>> Ingrid Zukerman: No, we tried it once actually. We used just vocabulary to rerank.
But to be honest, I don't remember what happened. Probably nothing much, if I don't
remember what happened.
>>: And I other thing I think is like so the ASR, I have a question about it. So you're
saying like you're getting these 20 percent, 30 percent errors. And you're seeing those as
high. And I'm kind of wondering -- like in my head they're actually pretty good,
especially if you're in a robot setting. And I'm assuming like the experiments like the
ones you did with ASR, was it [inaudible].
>> Ingrid Zukerman: Oh, yeah, totally.
>>: And so you [inaudible] the robot in a home where the TV is going and the robot is
like five feet away, you kind of have a slim chance. Like it's actually interesting, and I
think these techniques have more space for improving things and for robustifying things,
one ASR is actually bad.
>> Ingrid Zukerman: Yes. Well, actually an interesting point in one of the other
experiments that we did is we -- with visual DORIS we had a simulated vision system
that -- and we cranked up the vision up to 200 percent, so by the end you could think a
mug is an elephant. And the longer your description -- like the mug on the table near the
phone -- the better the performance got because you had so many factors to disambiguate
that even if you thought an elephant was a mug at the end you didn't.
>>: So it would be interesting, yeah, to see like -- I don't know if you get a chance in
your teaching experiments to actually -- I mean, it's hard to vary unless you do you a
simulated sort of model, like the vary the word [inaudible] speech recognition accuracy.
But what you can do is get a lot of people, and inherently some of them will have
different error rates. And then you can just have that as a post factor [inaudible] analysis
and look at the correlations between when my error rate is low, how much do I improve
[inaudible].
>> Ingrid Zukerman: Yes. Yeah. Like you can -- yes. Yes. And -- yeah. We can do
the reranking thing again, because I don't remember what happened. No, it was years ago
and for some reason we didn't pursue it. Half of me thinks it's because we thought it was
a bit of a cheat, if you just add your vocabulary. Like, you know, you saw, we have up to
150 things in the knowledge base. So if you're just going to say, okay, I'm just playing
with my own vocabulary and nothing else, it's a bit of a cheat.
And also people could be referring to things by their own vocabulary, like the chalice.
Well, the chalice could still be a mug. Okay, it's a stretch. But, you know, it's some type
of vessel.
So even though it's not part of the your object, if people want to call something in that
way and it's in some vocabulary, they're entitled to. But, yeah, we should do a bit more
with this.
>> Eric Horvitz: All right. Thanks very much.
[applause]
Download