>> Michael Gamon: And so today it’s the 31

advertisement
>> Michael Gamon: And so today it’s the 31st Symposium and Emily actually pointed
that out. So we have two interesting connections to history today. The first
connection is the first talk because Emily gave a talk exactly ten years ago at the first
symposium, and I think she’s already working on the slides for 2023. And, of course,
now we’re going to exactly do the same speakers again for the next 10 years in
sequence. With the exception of Meg, who is taking the role of Bob Muller today,
who was the first speaker from Microsoft 10 years, but Meg also has a great
connection to the program because she’s a graduate, so you can actually see that
people who go through this program can turn out to be perfectly well adjusted
people, which may be encouraging.
[laughter]
And Fay is going to introduce Emily and Woodley for the first talk and then I’m
going to introduce meg for the second talk, and as usual we’re going to do about 30
minutes for each talk including questions.
>>: Okay, so I guess I don’t really need to introduce Emily. Everybody knows who
Emily is. She’s our associate professor in the linguistic department. She also has an
adjunct position as a CSE and I guess she’s faculty director for the CLMS program
and I can see maybe 50 percent or more of the audience actually come form that
program or are related to that program.
She has a new book coming out, 100 Essentials from Morphology and Syntax. That’s
her second book. And she’s interested in [inaudible] engineering computational
semantics and relationship between computing and linguistics. And our second
speaker, [inaudible]. So Woodley Packard is a second year CLMS student and he got
his bachelor and master from Stanford, and I guess she’s going to -- he’s, I keep
getting it wrong sorry about this, he going to get I guess graduate from our program
in maybe a quarter or two quarters ->>: End of the year.
[laughter]
>>: I don’t want to add any pressure. I’m just counting the quarters. So I guess
Emily will start.
>> Emily Bender: Thanks, Fay.
[applause]
So you’d think it’s a little confusing Woodley is the first author on this work, but the
way we decided to divide up the presentation I get to go first. This is a paper that
represents some joint work with some colleagues of ours at multiple different
institutions in Europe, and the title is predicting the scope of negation using minimal
recursion semantics. I’m sure you all had the chance to read that.
So I’m going to start with some just motivation for the topic. Who cares about
negation? Well, we should all care about negation. So here’s a cued example from
machine translation that another colleague of ours, Francis Bon [phonetic], turned
up. This Japanese sentence, [inaudible] should be translated by human as we
shouldn’t have any prejudice, but if you put it through a fairly standard SMT system
or configuration of Moses you get you should have a bias.
Right? And this work that Francis was reporting on found that Moses lost the
negation two thirds of the time, and that’s pretty important. You can gist without
negation what it’s about, but not what’s it saying. Another example comes from
Gmail. So there’s these add-ins that hopefully try to connect your calendar. So
actually I have a meeting next Thursday. [inaudible].
Well, no thank you. I’d rather not do that. And also perhaps more seriously there’s a
lot of interest in detecting negation and the scope of negation in the biomedical MLP
field. So the bioscope corpus was the first large-scale effort to provide an annotated
resource here.
And this is to support applications like coding critical notes for insurance claims,
right? So if you miss negation and you code something that was mentioned but
actually isn’t there you can get in trouble because you’re charging the insurer too
much, but also epidemiological studies other uses of text mining.
And so bioscope annotates things like a small amount of, I don’t know how to
pronounce that. Adinopathy [phonetic]? Cannot be completely excluded. Did I get
it right? Adinopathy?
So this is a particularly interesting example because we have cannot, which is sort of
a grammatical clausal negation, but then excluded is also a lexical representation of
negation. And so if you really want to know what this sentence is saying you have to
be able to handle both of those markers of negation, which leads into this slide here.
So there are many different ways to mark negation. There’s clausal negation, which
in English is a separate word. So she decided not to go home, where as the
embedded clause go home was negated with that not, and we also have constituent
negation on noun phrases where it shows up as a determiner. So he had no friends
in Seattle.
But there are also negative affixes. So a prefix, space flight was thought impossible,
suffix, that kid was clueless, and an infix, at least this is according to the annotated
data that we’re working with this was considered an infix, we got hopelessly lost,
they called an infix because the less was in the middle of the string.
A linguist would call that a suffix it happens to have another suffix after it. Read my
book!
So what was the thing I was talking about? There was a shared task at the star sem,
so the joint workshop or conference on lexical and composition semantics in 2012
on detecting the scope and focus of negation. And so the organizers of that shared
task formalized a notion of negation Q and a notion of the scope of negation and also
focus, which is what within the scope is primarily negated, and then negated events.
So there are lots of different subtasks.
And developed careful annotation guidelines and then annotated a bunch of
Sherlock homes data for training and testing. So you have annotated data where the
Q is marked and the scope according to that Q and then further annotations that are
not what we’re focusing on here.
There was good interest on interest on this problem. So there were twelve
submissions from seven different institutions, and to give you a sense of the task
here are some of the data. So you have sentences like the German was sent forth,
but professed to know nothing of the matter. It gives you a little bit of the flavor of
the style of this text too, which is fun to work with.
It may be that you are not yourself luminous, but you are a conductor of light. This
is way more fun than most [inaudible] text. And I trust that there is nothing of
consequence, which I have overlooked.
So we have nothing and not showing up in these examples, but there’s also the
affixes that they [inaudible] in a similar way.
And what’s particularly interesting I think is that the scope is based on the semantic
dependencies, and so in the first example the German was sent forth but professed
to know nothing of the matter, we have no being negated because it has a negated
argument, and then the rest of its arguments are also negated also in the scope, so
the German is counted as in scope.
So of those previous approaches the winning submission came from the university
of Oslo, and that was submitted by our coauthors among others, so we teamed up
with the winners here, and they used an SVM to detect the cues and then to rank the
syntactic constituents for scopes. So their training data included, I believe it was
turnac [phonetic] parser data. So automatically it provided parses for all the
sentences and then the system used that to come up with what might be or might
not be in the scope.
There was also a submission from the University of Washington by Jim White, I
don’t see Jim here today. That approach used regular expressions for Q detection
and then a CRF sequence labeling for the scopes. So not working with semantic
structures, in this case not even really working with the syntactic structures, just
looking at the surface strings. And that wasn’t too far behind the winner. There was
one system that actually approached this semantic problem from the point of view
of an explicit semantic representation, and that was the University of Groningen
[phonetic] submission where they used DRS, which is discourse representation
structures as produced by the CNC parser and the boxer system, and that gives a
fairly explicit notion of Q in scope.
So the only real modification they made was to change some of the lexical entries in
boxer so that more things gave the negation symbol in the semantics, and then they
just read the scope off of their semantic representations and then sort of filled in the
blanks to try to handle to semantically empty words.
They did not do well in the competition, somewhat disappointedly. It seems like
that’s a principled approach. It should have worked.
So in the paper describing this corpus, [inaudible] characterized negation as what
they call an extra propositional aspect of meaning, which struck me as very strange
because what I know semantics is that negation is actually a core piece of
compositionally constructed representations, and that’s what I think propositional
semantics means.
Their notion of scope of negation is not quite the same as the way we use scope in
mainstream underspecified semantics. It is more tied to predicate argument
structures and uninterested in scope ambiguities with quantifiers, but looking at the
annotations, and looking at the minimal recursion semantic structures I’m going to
tell you about in a minute here, we actually notice that there was a nice alignment.
And so the [inaudible] structures give us a good starting point for trying to model
the task-specific notion of scope of negation.
So very quickly on what minimal recursion semantics is. It’s a flat under specified
representations of formulae and predicate logic with generalized quantifiers. It
makes explicit the predicate argument lengths and also the lengths between scopal
arguments and the entities that fill those argument positions. Some of those are
fixed, so the negation takes a scopal argument, but it’s fixed by its syntax that’s in
contrast with quantifiers that are also scopal things but they can float around.
And so an MRS is an underspecified representation that can be monotonically
further refined to get fully scoped representations. When we first started working
on this we thought that we actually wanted those fully scoped representations and
quickly discovered that actually that’s not the notion of scope. So we’re working
directly with the underspecified representations.
The main reason we’re using MRS is that it can be produced automatically at scale
for English by the English resource grammar. So the English resource grammar will
give us a syntactic structure like this. This is very simplified in the sense that what’s
actually behind each of those node labels is an enormous feature structure with
hundreds of feature body pairs, all right? But that gives you a sense of the
constituent structure. And part of the big feature structure of the big S node at the
top is the semantic representation for the German was sent for but professed to
know nothing of the matter. And it’s this representation that we’re going to be
crawling around in to pick out the scope of negation. And here’s where I hand it
over to Woodley to tell you how we actually do that.
>> Woodley Packard: Thanks, Emily. All right. So this structure is you can see that
it’s pretty machine-readable. It’s got a little bit more information than we actually
need, but the critical thing is it’s got -- maybe I should use my laser pointer here
instead of putting my hand up there -- the character positions that link these
predicates back to the string, and these argument positions that describe how the
different predicates relate to each other. It’s also got these in curly braces here
properties as variables an we’re actually going to totally ignore those for this
purpose, but you can see, for instance, how this works that the thing that’s known by
this no relation here is its R2, which is X 18, and that’s a thing.
So the sentence here was one of our examples from before. It’s the German was sent
but professed to know nothing of the matter. And that’s how that works out in the
argument structure.
So I’m going to show you a slightly way of looking at this picture that throws away
some of the information we don’t care about in this project, but makes it easier to
see the graph structure here. So there’s more or less a graph underlying this and it
looks like that. So the different links here are the argument and the nodes are the
predicates.
So the link I just showed you a moment ago was from no to thing. So it goes form no
to the quantifier and also no to the thing.
All right. So to attack the problem of scope resolution the first thing we do is find
out what the q is, and we assume the q is actually one of these predicates and our
system assumes that somebody else has done the q identification for us at the string
level and then we project that onto the MRS using those standoffs that I showed you.
So this shows us where in the string the no was and we call that predicate in scope.
And then based on that we can say how are we going to look around the rest of this
graph to figure out what else is in scope? Because what we thought we would do is
figure out what parts of this graph are in scope and then project those back using
those string standoffs onto the surface.
So it’s a basic idea is to just crawl around these links, and the trick is which one
should we crawl and which one should we not crawl? So since this is an NP
negation with a quantifier here the first step was, well, the first step was to indentify
the q, the second step was to activate verbs that take it as an argument.
So you can see these green links are links from some other predicate back into an
activated node. And there’s one form of and one form know, but know is the only
verb. So we activate that. And then once we’ve done that, that was called, we’ve got
a couple different kind of crawling here that we give different names. We call that
functor crawling because it’s crawling backwards on the links. From then on we
only crawl forward.
We call that argument crawling and label crawling. We just did the functor crawling
as an initialization step. So looking from here we can color in the different arcs that
are accessible from these in scope nodes, and the red arcs argument crawling, the
green arcs are what we could do if we were willing to functor crawl. And you can
see we want to be able to get from know to its arguments, well, this maybe a bad
example because they’re both pronounced the same way but form know to its
arguments, and also from the no quantifier to its arguments.
So those are both argument crawling and we just go ahead and do that. And then
label crawling. So from here what’s available is there’s more green arcs if we were
willing to do the functor crawling. And there are a few instances where we do allow
that for modals and certain types of subordinated conjunctions, but there aren’t any
of those in this example.
I colored the label arc there blue. So there are a couple different types of arcs here,
but the main most interesting ones are the arguments, but the label arcs relate to
the scope structure that’s defined by the underspecified representation of scope
here. And the important thing is that co modifiers, modifiers of a noun or a verb will
share the label with that. So this label crawling link here in blue, so we’re always
willing to crawl across that. There are a couple of exceptions, but they’re not
important right here.
So you can see we’ve already crawled all of the red links, now we’re going to crawl
this blue link to get to of and then that enables us to get down a couple more red
links to those and then there’s nothing more that can be done. So at this point we’ve
crawled around the MRS graph and activated everything that we could reach from
these rules form know, and it turns out these are exactly the words that we wanted
to have in scope according to the annotations.
So project that back with the string standoffs to the surface string, and we find the
German know nothing of the matter is what we’re predicting to be in scope and
that’s correct.
And that’s basically the way the system works in general. There’s a wrinkle, which
is these so-called semantically empty words. So the English resource grammar in its
semantic representation that it produces there’s in most cases one predication, one
predicate, one node in that graph per word, but there are some cases where a word
just doesn’t contribute anything to the semantics. A good example is a
complementizer like to or that. And there are some cases where a word has some
meaning but it’s not totally composition so it’s best to encode that as part of a
predicate from another word. So send for was an example.
Although that one I think was out of scope so it didn’t matter here. Yeah. So actually
I think for is probably considered semantic with [inaudible] in this analysis, but it
didn’t matter to us because that was out of scope. But I’m going to give you another
example here that quite a few semantically vacuous words. The ones in blue here.
So I trust that there is nothing of consequence, which I have overlooked.
The curly braces there show that the scope that we want to get that the annotators
marked, and the red shows what the crawling rules that I just described will get for
us. So is thing of consequence overlooked? Actually I should also be red. I think
you can’t see that. But which have and that and there are vacuous in that they don’t
contribute an explicit predication to the graph. So most of those we want. So we
have to figure out how to do that.
Since they don’t show up in the MRS we have to back off to the syntax tree. So Emily
showed you before the tree structure and the MRS structure that are produced with
the RG. So here’s the tree structure for that sentence. I trust that there is nothing of
consequence, which I have overlooked, and what we do is initially we annotate the
words that the crawling system marked as in scope, and we walk up the lexical head
paths from those.
So something like nothing is the head of this entire noun phrase, but is not the head
once you get to the verb phrase. We mark from nothing as in scope all the way up to
here, and from is, which we marked as in scope up to there, and form consequence
up to here and so forth so that a large portion of the tree around the area that we
marked as in scope is flagged. So then from there each of these semantically
vacuous words is relatively close to information about whether that section of the
string is in scope or not, and then each of the types of semantically vacuous words,
relative pronoun, what else, complementizers, helping verbs, things like that each
have about four or five different classes of them and each have sets of rules
describing where in the tree to look to se whether you’re in scope or not.
So for instance, have says I’m in scope if my complement is in scope, and which says
I’m in scope if I fill a gap in an in scope sentence or something like that.
So we iterate these rules a few times because you can see which is not actually filling
a gap in an in scope sentence yet because the head of that sentence is have
syntactically, so after one step we can mark that have is in scope because it can take
an in scope complement and after two steps, so then its head path went up to the S
because it was the head of that constituent and then which now fills a gap in that
and we can mark it as in scope too.
And you can see that that out here doesn’t the rule for complementizers is that
they’re in scope if they are the complement of an in scope verb. And there are some
idiosyncrasies in the way that the annotation guidelines work.
All right, so we had to evaluate this. We used the Sherlock Holmes data and we used
the gold cues. It comes with a test dev train split that was used for the shared tasks
and we used that and we designed these rules using the training data. We looked at
the development data once and we applied it to the one best analysis from the ARG.
So the ARG has lots of different analyses for any particular string ambiguity, but
there’s a statistical model that will tell you which one is probably the right one, and
there is actually a confidence metric that goes with that.
So we did this and the results were kind of okay. You can see it’s a high precision,
low recall system. One of the reasons that the recall is low is because the grammar
doesn’t have full coverage. Frequently there is no parse or the statistical per
selection gets the wrong parse.
There are a few cases where there were rare cues that we didn’t know what to do
with. So there are a lot of cases where the system not puts out the wrong answer
per say, but says I don’t know. So the obvious thing to do is let’s combine with
another system that’s a high recall system. And the winners from the competition
were certainly that. And we’re good friends with them so we said okay, let’s do a
system combination with you guys.
And this system combination is if we use our results whenever they are available
and use their results whenever our results aren’t available, and you can see the
recall pops way up. It was 67 per token level recall on the development set, pops up
to 83 and all the other numbers are better too.
Well, the precision drops a little bit but not much. So that’s a lot better. It’s not
quite consistently beating the winner of the competition yet, but it’s actually
relatively close and there is one more game we can play, which is that sometimes we
put out an analysis I guess about the scopes even when we have a very low
confidence parse. So we have this confidence metric that’s a maximum entropy
parse selection model and we can just say all right, what was our confidence, what
was the probability that that model assigned to this tree that’s conditioned on the
input string?
So if we say, well, if our probability of having the right parse was at least 50 percent,
then we’ll use our system, otherwise we’ll fall back. And by now we’re only
producing guesses for about 25 percent of the negation cues, but it’s still a
meaningful portion, and using that we can actually out perform the published
results with the system combination here. We can see that the published results
down here from the winners of the competition last year are here and we’re able to
out perform them in almost every box. They went in recall in one box there.
It’s not a big win, but it is one percent, five percent error reduction, ten percent
error reduction, something like that. So we feel like that’s kind of exciting. I can’t
tell you statistical tests about whether those are significant or not. It seems like at
least the -- anyway ->>: What was the size of it?
>> Woodley Packard: So the size of the test set is -- do you remember the number
on that? I think there are 200 negation cues but there are more but we’re not doing
the task of identifying which ones are cued or not, and the number of tokens in it has
got to be 3 thousand tokens or so, of which several hundred are in scope.
In conclusion, the MRS based system is high-precision but low recall, and in a
system combination it seems to be able to perform at least as well, probably a little
bit better than the best published results. We also did an oracle experiment where
instead of using the confidence metric to decide which system to use we used
whichever one performed best according to the gold, and that was able to perform
even quite a bit better than the system combination.
Actually the numbers there, I think on the test set went up to 90 point something.
And we wanted to note that we implemented our rules looking at the data, but
actually if you look at the guidelines and compare them to our rules you can see that
they line up pretty well. The nice thing about a rule-based system is, of course, you
can interpret the rules and it looked like the rules did just about what the guideline
said.
The fact that we were able to converge on those rules and they picked rules that
matched up with the semantic representations that the ARG produced we think kind
of validates each other in a way.
So we think it’s neat that explicitly semantic approach to this problem was able to
build on a purely machine learning approach.
Thank you.
[applause]
Questions?
>>: So among sentences where the correct parse was chosen by the selection model
and it was coverage, did you find that most of the recall errors disappeared in that
case?
>> Woodley Packard: So it’s hard to say exactly. I mean, we did some tree banking
to determine what the -- so in general we don’t know when the right parse was
found or not found. You’d have to go in manually and say what was the right parse,
did we get it or not?
A lot of the cases in error analysis where we looked through to see what was going
wrong we found that it was the wrong parse. On the other hand we did some
manual tree banking and produced gold analyses for a bunch of sentences, and
when we ran those through our rules they didn’t perform that well. It looked like
the reason was that our gold analyses had a lot more of these rare cues in them,
things that the statistical model just wasn’t picking up on because they were low
frequency. And so our rules didn’t have them either.
>>: Corollary to that is how did you get the 50 percent figure that you decided on?
>> Woodley Packard: For the confidence? That was just, we did play with pushing
that back and forth. 50 percent was actually the first thing we tried and it turned
out to be not much difference between doing that and say 25 percent or 75 percent.
Yeah?
>>: Have you thought of using the [inaudible] approximation model that
[inaudible]?
>> Woodley Packard: No, so we haven’t done that. Actually, E [phonetic] hasn’t
released the software that I know of to make that possible.
>>: You could ask [inaudible].
>> Woodley Packard: Yeah, so to fill in some context there is a BCFG approximation
to the ERG that lets you produce a derivation tree at least in cases where the
grammar wouldn’t be able to parse and there’s a method for trying to produce an
MRS from those derivation trees even when t he unifications according to the
unification base grammar wouldn’t actually succeed and you can get a graph that
looks like our graphs out of that. That’s relatively new research and it’s not
software that we have access to. But we’d like to try that.
The fact that the lower scoring trees yielded noticeably worse results suggests that
maybe trees coming out of that wouldn’t work well either, but we don’t know. But
it’s worth trying.
Any other questions?
Okay.
[applause]
>> Michael Gamon: So it’s my pleasure to introduce Meg Mitchell. I’m very lucky she
joined our group recently. Meg, as mentioned before, actually got her masters
through the UW program, got a Ph. D. at the University of Aberdeen and was also
visiting scholar at Oregon health and science university. And Meg is particularly
interested in the connections between language and vision, which is a really
interesting and I think emerging field that’s drawing more and more attention.
And today she is going to be talking about generating human reference to visible
objects.
>> Margaret Mitchell: Can you guys hear me okay? I’m good? Okay. So I’m Meg.
I’m going to be giving sort of a general talk on generating human reference to visible
objects in particular. This is something I’ve worked on from different angles for a
while, including collaboration with a bunch of people. I kind of keep the same slide
deck and keep switching in slides but them I’m not sure what names to take in and
out, so this includes a lot of input from a lot of people spanning a lot of universities.
And I thought it would be kind of fun to look at this from the perspective of when I
started in the CLMA program, and now it’s called CLMS but then it was called CLMA,
and just sort of pick up from the last presentation I gave there when I was doing my
thesis proposal, and then start sort of picking up the space and the problems that
fell out from that beginning.
So let’s see. So my plan is to briefly cover some psycholinguistic studies I’ve run to
tease out what it is useful to model when you are trying to look at human reference
to visible objects, and some models that I’ve developed in pursuit of that goal. I
hope it’s clear from this talk that this is a wide-open space where there is a lot of
future work that can be done, but we have some sort of nice starts.
Okay, so back to 2008. This is a slide from my thesis proposal in the CLMA program.
Mike nicely made me these pretty spheres. Where I was trying to dive into the task
looking specifically at referring expression generation, which is a subtask in natural
language generation, and I was concerned with how this system could produce
referring expressions that sound natural.
So how can we produce expressions that uniquely pick out items in a seam in a way
that sounds human-like? That sort of brought up the question what does natural
mean? What the input could possibly be to get to that point? And then if we actually
do get to the point of outputting something that sounds natural how do we evaluate
it?
So this can be sort of seen within the general context of natural language generation,
which deals with taking non-linguistic input, associating it to some sort of syntactic
and semantic structures and from there figuring out how to output sentences that
sound cohesive and fluent. So referring to expression generation can be seen as just
a subtask within this larger set where you’re looking at generating expression that
can identify I reference to a listener or a hearer.
So the state of the art in 2008 was basically given some sort of computer graphic
representation of an image and some very clear sort of simple semantics of that
object, so something like color grey, size large, orientation front, find the best way to
refer to that one thing in context.
So say something like the grey desk. Moving forward in my masters’ thesis at CLMA
I was starting to wonder what happens when you move outside of the computer
graphic world and move to actual objects that people are talking about. So I ran this
mechanical Turk study where I had sort of blank on the left is tied with blank
different sort of configurations of objects and tried to tease out what kind of things
people were saying when they were referring to real objects.
One of the things that came out of this was the fact that there is an enormous
amount of speaker variation in the way that people talk about visual objects. So I
had 64 different participants and 32 different expressions to refer to this orange
bubbly thing. I didn’t have some underlying semantic representation I just had the
image and so it was sort of up to me to figure out how to get form this image to some
sort of set of possible ways to talk about the object.
And it’s sort of like a dual problem where I’m trying to figure out what sort of things
are worth mentioning and then how it should be realized in the surface form. So
maybe spiked and spiky are from the same sort of underlying semantics, but these
are two separate problems to tease out.
Right. So going from the sort of traditional task of trying to find the best expression
that you can use to refer to an item given some creative fine set of attributes and
values, I started spinning the task as trying to figure out from some visual input how
do we get the set of possible things that people might say to talk about it?
One thing that’s interesting about this is that the task of trying to figure out how to
talk about one specific object is similar to the task of figuring out how to describe
that object, right? So something like the bright, bumpy, ball ting randomly floating
on the right side of the slide is something you could say about it given the visual
input.
And once you start to look at it in terms of description you realize that it’s not a big
jump to go from talking about an object to saying full sentences, right? So this is just
a copular transition from a noun phrase to a sentence. The bright, bumpy, ball-thing
is randomly floating on the right side of the slide, and for some reason I like blew my
thesis advisor’s mind when I made this point.
But the idea is that looking at the object you can start to reason about the semantic
representations and move this to full-on sentence descriptions it seems. Okay.
Along with sort of interesting variations, so randomly floats, is randomly floating.
Right. So why look at an object reference? Why is this interesting at all? There are
a bunch of reasons why I find this interesting. I kind of change around the reasons
depending on the audience. Because I don’t have a ton of time I’ll just say here are a
few. One is that it anchors visual language. So as I mentioned before, being able to
reason about the semantics of an object and a visual scene gives you a way to reason
about the visual scenes as a whole. So they are nice like little bits of meaning that
you can start adding further syntactic structure around.
It’s useful in things like assistive technology. So at the University of Aberdeen I was
involved in this project called [inaudible] school today, where the idea was kids with
cerebral palsy who have difficult talking about the world might be able to have a
camera that takes pictures of things and then that could be used to generate
descriptions of objects that they could talk about that they could play with and
things like that.
It’s useful for devices that can scan the world and interact with humans as an
assistant. So this is kind of like the AI problem, but it cuts right in there.
It’s also useful for automatic summarization of videos. So by being able to analyze
videos, attach them to linguistic structures we can start getting at how people are
going to search through videos and find them without having to watch the videos
themselves. And it’s useful for creative tasks just like generating descriptions of
things for the fun of it.
Okay. It also kind of makes sense to look at the object within reference because it’s
sort of how we evolve to be able to talk about the world, right? So if we’re going
towards a model that can talk about the visual world that makes some sense to kind
of borrow from how humans have evolved to start learning about how to talk in
general.
So we know that the object is a principal unit in early language learning. We know
that some of the earliest vocabulary items for children are usually nouns that refer
to concrete objects. And these are the kinds of visual things you’re interacting with
in the world. There has been a lot of work showing that referring nouns seem to be
conceptually more basic than concepts refer to by verbs and prepositions, so it is
clearly sort of a nice starting space.
And here is the early reference. I try to have an early reference in all my -- anyway.
Nouns also provide a way to communicate about the natural world’s variety as
distinct entities. Right? So like when pigeons start up people are usually just
referring to nouns and basic verbs in order to talk about the world. And something
that is also sort of handy, this is just sort of a happy coincidence, is that computer
vision, which is the obvious sort of input if you’re trying to automatically talk about
the visual world is centered around the object.
So action detection doesn’t work very well, pose detection doesn’t work very well,
but object detection works reasonably well, and action and pose given object is
much better than action and pose alone.
Okay, so I ran a bunch of studies to try and start teasing out what it means to refer to
objects in these real visual domains. Previous corpora was sort of impoverished for
my needs, so the TUNA corpus was this corpus that looked at these things like grey
desks and certain configurations. The GRE 3D3 corpus is a similar thing but with
cubes and spheres. And I was trying to sort of get at what people are going to do not
when they’re asked specifically to refer to this thing, but when they’re given a bunch
of objects and they’re within a broader task and they are talking about them.
And how can we sort of model that sort of stuff? So this is -- there is further like
details on each of these, one in INLG 2010 and one in CogSci 2013. At the same time
it makes sense to think about what the input would be a eventually get to that point,
right? So I was simultaneously working with some computer vision people to try
and think about what computer vision can actually provide when I’m trying to get
just this sort of naturalistic output.
So things like object detection, these are bounding boxes around the object, can
work reasonably well in small domains.
Okay. So there is no one corpus, there is no one study you can do to sort of
understand how people talk about the visual world and get appropriate models for
that, but if you do a bunch of studies and you do corpus based work you can start to
see sort of general tendencies that come out of that and know what it’s useful to
focus your research towards. So that’s sort of what I did. I did this one study where
I had craft objects, which I chose because those are visual manifestations of visible
properties, right?
That’s the whole point is just to be these masses of visual properties. And I had
people describe how to use the objects on the board to get to put together a face.
And then I ran another study with sort of like office objects kind of things I found
around the house and I had them tell an assistant how to place them in these sort of
grid-like patterns in order to replicate some images that they saw on the screen.
What I found sort of going through all these things is a bunch of things that sort of
seem obvious, but there hasn’t been a lot of work done on them so it’s important to
kind of focus on them in particular. One thing is that people refer differently.
So this I mentioned in my masters’ thesis work, but it’s obviously come out again
and again, and again, one-shot descriptions don’t make sense when you’re trying to
talk about the visual world. That’s true for generating referring expressions. That’s
true for generating full descriptions. Different people will refer in different ways,
and a single person will refer differently in the same context. So unless you’re trying
to build an agent that is very repetitive it makes sense to start modeling the
distribution over both the sort of semantic content that people will pick when they
talk about an object, as well as the surface forms that that semantic content can be
used to realize.
Okay. It also becomes clear that ordering really matters. And this is true both for
modifiers and for nouns, and for a bunch of other things. So this sort of lead to a
bunch of projects where people, I noticed that people were comfortable saying
things like comfortable red chair, but things like red comfortable chair was dispreferred.
And this is true for even longer phrases, right? Big, rectangular, green Chinese silk
carpet is awkward, but sounds somehow more correct than Chinese, big, silk, green
rectangular carpet.
And there are sort of no existing models for this kind of thing. There are further
details on this in the ACL 2011, but the sort of punch line from this work is if you
automatically parse a ton of corpora, extract the noun phrases and then train a
language model on those noun phrases you can do really well at doing this
automatically.
If you want to do a class-based approach it works pretty well with a hidden Markov
model, which you can update using the M do sort of learn these latent classes that
give rise to the surface forms.
There’s a sort of other problem that comes out of this that there is still a lot of work
to be done in this. I don’t think anyone has touched it, and that’s given some sort of
semantic representation when are things post nominal and when are things
pronominal, and how do you make that decision?
Ordering also matters for nouns. So within a description of a scene there is a
tendency to put animate things at the front of the sentence and inanimate things
near the end of the sentence, and people are actually kind of predictable about what
they choose as subject and object.
So I did some work on this using word nut hypernyms [phonetic] where you could
see, like, object positions in some three-noun description, so a description that had
three nouns in it. And animate things tend to be in first position. Things like
structure, boxed tend to be in the second position. And you can actually use this in a
generative model to order a set of given objects.
Okay. So I was a little confused in the sense about which attributes are important to
focus on. So throughout these studies you see that color tends to be massively
preferred, also size sort of regardless of the task, material and shape as well, and
analogies and part-whole relations, which I don’t think had any work for generation.
So there is a lot of research to be done there. So in the study with craft objects I
found that there’s a dominant preference for color, followed by size, material and
shape. And these are also that material and shape are realized as head nouns, which
reflects peoples’ tendency to use shape when they’re giving object categories.
So things of the same object category tend to have roughly the same shape. And you
can see that in the way that people talk about objects.
This is also true for the study with office objects. Again, color, size, material, shape.
You see part-whole relations and you see fun things like analogies, sea urchin,
whatever that is.
It seems to be sort of like there’s a wealth of problems that you can work on here
and we’ve just started to scratch the surface.
So keeping in mind what the actual possible visual input is, it becomes clear that
although we can get some sort of understanding about the objects in the scene and
some of the attributes that apply to them, there are some things that referring
expression generation has assumed would just be available that really aren’t.
So in particular, size, right? So you can’t look at a scene and just know small, that
actually takes some sort of reasoning about the X, Y, and Z dimensions of the objects
in the scene. So it kind of opens up the sort of problems that is available within the
sort of vision to language connection when you’re trying to generate human-like
reference. Vision provides things like color, shape, and material, but the relative
properties that are common for people to use, things like size, location, and spatial
relationships require some reasoning about the object coordinates and
segmentation of the image and there is still a lot of work to be done there.
I did some work on size in particular because it was particularly common. It’s kind
of boring to work on size, but you know, it sort of needs to be solved, right, if you
want to start looking at this problem.
So if you have an object on the right here, this is a bit smaller than the smaller on the
left here, there’s sort of some point where you start to say okay, that’s still small but
maybe it’s thin, now it’s maybe thin, now it’s just thin that’s not small. Now it’s tall
and thin or maybe just tall.
There seems to be something about the height, width, and depth that you can reason
about in order to figure out what kind of size language people will use in a visual
scene and you can actually do pretty well on this using a discriminative model that
just uses these sort of object coordinates in different ways to try and predict some
basic size types that people will tend to use.
I distinguished six different basic size types that refer to small and big, as well as
things like tall, thin, short, fat, and we can actually get pretty close to approaching a
[inaudible] oracle method does at understanding the kind of size language that
people use.
Okay. So with all these sort of ideas, although nothing is completely solved you can
still put them together in a proof of concept to actually start generating descriptions
of images. So this is some work I did with the computer vision people back in 2011
where I was trying to understand how I can take what I know about the tendencies
of what people tend to mention, pronominal modifier ordering and put them
together with a given sort of computer vision input to create descriptions of images
that sound fluent and natural. And what I found that by collecting the object nouns,
so again the computer vision is just providing the nouns, the sort of objects, you can
collect the sort of sub trees that these tend to occur in using just a basic Penn
Treebank Syntax, you can build likely sub trees around them during generation, and
you can put them together to make reasonably well formed syntactic structures that
make sense.
So things like duck on grass and duck by grass. I’m using just a sort of method that
uses both the syntax and the computer vision input outperformed other recent
methods on that sort of task given a computer vision input.
There are still more projects to be done here. I haven’t touched analogies. I haven’t
touched part-whole relations. These are all things that you all should solve and then
let me use your code. So future directions here I think that a lot of work can be done
sort of learning a better semantic space to map between objects, attributes, and the
sub traces there. Everything has been kind of course grain and I think that we can
do a lot better. The syntax that I’ve used so far has largely been based on Penn
Treebank Syntax and I think that’s a bit impoverished especially for noun phrases. I
think we can do a lot better.
I think it’s useful to start making some better decision about when to refer to an
object by description, when to just list its attributes, when to make analogies sort of
choosing between these different ways of talking about objects.
There needs to be more work on capturing relative attributes. And we have very
little knowledge base building that we’re using currently throughout these models
and I think that there can be a lot more to guide what’s useful to say and what’s not.
Okay. Thanks.
[applause]
>>: It seems like a difficult thing to control for is the sort of meta purpose to which
the descriptions are only going to be used and that’s going to effect the human study
where the people don’t maybe know why you’re asking this. And then also it can
effect just sort of computer purposes where there’s not a sort of well-defined way to
define the problem.
I see you covering a lot of ground here in terms of different ultimate applications,
but I guess could you comment on zeroing in a little bit on one particular application
where this type of thing shows up when the description is target for this. For
example, on the school bus, you know, some people could just say, you know, a bus
and that’s sufficient. But then your description and bus with a blue sky ->> Margaret Mitchell: Yeah, yeah, yeah.
>>: Why is that motivated?
>> Margaret Mitchell: Yeah, so I think that -- I mean, that’s a really good point. And
I think that part of what I’ve been trying to do by looking at a bunch of different
corpora and a bunch of different experiments is just try and zero in on what it is
about visual language in general that people are picking out. So I can have some
generalizations that will hold across the different domains, right?
So I’m still going to need to do like pronominal modifier ordering no matter what
domain I’m in. It’s still important to talk about color and size basically no matter
what domain I’m in. In terms of the larger structures that you generate, I think that
goes to what corpus you should be training on, right?
So if you’re doing some sort of conversational agent and you have them talking
about visual objects in the world like for direction giving, then you want to be
training on a corpus that has that kind of language in it. The same problems sort of
fall out but the actual syntax changes.
>>: So does it make sense to talk about the reverse of the problem? Like given the
natural language how do you [inaudible]?
>> Margaret Mitchell: Yeah. There’s so much I can tell you about this. So there has
been a lot of work on this and I think that this is something that’s really nice to go
towards less supervised computer vision, right? So if you just parse out a sentence,
pull out the objects and then you can actually start training detectors on the scene as
a whole in order to learn the sort of objects there, you don’t have to have people
manually figuring out the sort of object categories.
>>: To what extent do you need to do MLP in order to enable?
>> Margaret Mitchell: MLP in order to enable ->>: Enable problematic process or --
>> Margaret Mitchell: Right. So that goes into parsing and NP chunking and I mean
it depends on how sophisticated you want to get I guess, but my preference is
generally to start at the parse structure.
>>: It’s fun to see what’s happened in the intervening years. So I was intrigued by
your statement that if the [inaudible] is what goes pronominally and what goes post
nominally because unlike the order of pronominal modifiers there are actually some
pretty hard constraints in English grammar about what can go before and after.
So is it more like which kind of expression are you going to use? Pronominal type
expression or post nominal type expression, or something else?
>> Margaret Mitchell: Yeah, that, as well as just the basic learning and mapping
between the sort of given attribute values and the pronominal and post nominal
forms.
So I’m given things like circular and I want to know do I say that this is a circular cup
or do I want to say the cup that’s in a circle. And perhaps it’s trivial to find out but
there are no models that do it yet. So it’s still open.
You. Sorry.
>>: So it seems like in a lot of cases the modifiers chosen might be chosen to
specifically distinguish and object form other similar objects near by. Is that ->> Margaret Mitchell: Yeah. That’s the state of the art for referring expression
generation is selecting modifiers based solely on whether or not they rule out other
items in the same space.
No, it’s not state of the art. It was state of the art until recently up until this year.
Yeah?
>>: I kind of have a similar question. Did you look at, I guess there are two parts to
it, for the experiments you did where you had people moving the craft objects and
things did you look at all at how the descriptors that they chose relate to the entropy
of the set in general, and whether they’re choosing the descriptors that maximize it?
Because depending on the set, given the set of things where they’re all the same
color ->> Margaret Mitchell: Totally. I could actually, for the craft stuff that would be a
really cool thing to do actually trying to pull out the visual attributes and seeing if
there’s something there, some sort of information gain in their selection. I think that
would be awesome.
For the office supply things I was actually reporting on the fillers but I was running a
larger study where I was trying to tease out if typicality of shape and material
affected how people named things, and I found that things that are an atypical shape
tend to be called by their atypical shape, but things that are an atypical material
tend not to be.
So it seems to be some sort of interesting interplay between some stored
knowledge, sort of prototype thing, and the given visual scene.
>>: So the second part of that is also another type of context is over time, right?
If people refer to something next time they’re probably going to refer to it with less
information. Did you do any work with that?
>> Margaret Mitchell: Yeah. So that gets into dialogue stuff and you do like lexical
entrainment so you start to actually agree on some sort of object name, and then
when someone else comes in you actually start without an object name again and
then you realize that you get -- there have been some really nice studies on how this
works and I have not implemented them.
So there’s like this work has been marginally a first pass at when you’re initially
describing and object, and then yeah, the clear next extension is figuring out how
those are shortened and changed to dialogue.
>>: Thank you.
>>: I was curious about cross-cultural rather than cross-linguistic. If you’re looking
at an object if you’re from another culture you might use different language to
describe it visually, where as a white bus might be white bus to us in our culture but
doesn’t in another culture.
>> Margaret Mitchell: Yeah. So when I was looking at the typicality study this was
in particular something that was an issue. I didn’t think it would be an issue so I
conducted the study in the U.S. and in the U.K. and using sort of semantic, or future
production norms, I don’t know if you are familiar with them but the sort of listing
of normal attributes, from people in Canada.
And I found that although the U.S. people used the sort of language I expected, the
U.K. people would randomly say things that made no sense to me at all.
[laughter]
Well, yeah so I mean there are cultural differences. They kept saying the pencil with
the rubber on it and things like that, like of course it has a rubber on it. I don’t know,
it’s an eraser. But yeah, so this is an issue to tuning for specifically what kind of
group you are working with. Definitely. And not just culturally, just smaller groups,
groups of friends will tend to use sort of different language than other groups of
friends.
>>: [inaudible]
>> Margaret Mitchell: Right, yeah. That’s true. Yeah?
>>: You kind of mentioned the order of modifiers. There is some work in kind of a
different domain in theoretical syntax that works on the order of adjectives and the
order of adverbs and [inaudible] does all that stuff. So I’m interested if you’re
looking to incorporate that kind of stuff, so the order of projections just explains it
kind of nano syntax [inaudible].
To explain that distribution. I’m wondering if you’re ->> Margaret Mitchell: Yeah. So at the CLMA program I was actually starting to do
this in terms of the underlying semantics, so you learn like color tends to be closer
to the noun, followed by size, and it seems like something I would learn
automatically. Frustratingly what worked the best was just an M-gram language
model.
So it’s like do I keep playing with this stuff or do I do the thing that works the best?
You know. So yeah, I kind of left it there like, okay, a language model works really
well, that’s frustrating. I’ll move on to another problem. But I think that it would be
really cool to be able to incorporate that kind of stuff.
Chris?
>>: [inaudible].
>> Margaret Mitchell: But presumably if you could get a corpus you could train on
that corpus and that language and learn it automatically.
>>: Is the preference for selecting a descriptor based on [inaudible]?
>> Margaret Mitchell: Yeah, that’s interesting. Yeah?
>>: So you mentioned early on the concept of natural speech and we’ve been talking
a lot about variation. Do you have any sort of measure for something being really
unnatural? So there seems like there’s a huge range of things that could be
potentially considered natural. Do you want the range to be narrower over specific
domains?
>> Margaret Mitchell: That’s all the more reason to learn the distribution. So you
want to have sort of the trailing [inaudible] or less and less natural construction. So
you’re not just trying to learn a natural thing, you’re trying to learn the set of natural
things that people might say but the very low probability space for things that
people don’t say.
Yeah?
>>: [inaudible] and I’m just kind of curious. I can imagine cases where someone is
referring to a bird and it has a very scientific name for that bird, they would
probably just say bird most of the time. Did you find that hierarchy to be
particularly useful, or did you apply it to compensate that?
>> Margaret Mitchell: Context where there was some shared or lack of shared
knowledge.
>>: It seems like you were mostly working with pretty basic objects.
>> Margaret Mitchell: Right, basic level categories.
>>: That they don’t have, like, complex names. I was just kind of curious if you’ve
ever seen this work of two areas where common people might refer to one
differently than [inaudible].
>> Margaret Mitchell: Yeah. So a lot of the sort of traditional referring expression
generation work they have a function that they call user knows, which sort of starts
at a base-level class, and if the user is not familiar with that base level class they
move up, and if the user is familiar with that base level class you have an option of
moving down.
This is still within the realm of sort of hand-written stuff so I haven’t looked at doing
something more statistical with it. But there is definitely precedence for looking at
that. I haven’t.
[applause]
Download