>> Lucy Vanderwende: Thank you for coming. I'd... joining us from Johns Hopkins, where she's doing a postdoc...

advertisement

>> Lucy Vanderwende: Thank you for coming. I'd like to introduce you to Meg Mitchell. She's joining us from Johns Hopkins, where she's doing a postdoc right now, working on sentiment analysis, named entity recognition and other things that are not generation, but that's actually because she did her PhD work at University of Aberdeen specifically in generation, on the topic that she'll be talking about today, which is generating referring expressions for visible objects, which is something that she's been pursuing also since her master's degree at University of

Washington. So she's no stranger to being here in Seattle. And I think it's interesting to watch how Meg has been very -- setting this direction as an interesting direction to pursue, specifically for generation, really opening up a new field for generation.

Meg would really enjoy fielding questions during her talk. She says it's a dynamic talk, so we can kind of go where some kind of combination of where we're interested and what she wants to talk about, but I'll be the policeman in case we rat hole too far and don't get to where we need to be. But please ask questions as we go along. Thanks, Meg. Thanks for coming.

>> Margaret Mitchell: All right. Hi, I'm Meg. The work I'll be talking about is work that I've worked on through at least three universities with a bunch of people, particularly Brian Roark from OHSU, Ehud Reiter and Kees van Deemter from the University of Aberdeen, and a lot of this work was done as part of the Johns Hopkins 2011 Summer Workshop.

So, before I start, I want to say that I've been doing additional research on sentiment analysis in social media, which I can't talk about here, or I'm not talking about here, but if anyone has a sentiment system, I'm organizing the TAC Sentiment Track, so you should submit.

Okay. Why connect vision to language? So say that you want to search for something on the

Internet, and, as it stands, it's easy to do enough when you have some textual information, but ideally, you could go further than that. You can start giving images and search for things based on image content and also search for pictures based on given textual descriptions. So, ideally, there's some sort of connection between getting information textually and getting information visually, and these two can work together. I started working in connecting vision to language with a particular interest in assistive technology, so at Aberdeen, there's an assistive device where kids with cerebral palsy, who can't otherwise communicate, can have a computer vision system recognize objects, and that gives them things that they can talk about and select, so they can discuss them in a natural way.

Clearly, this is also useful for improving search, so connecting visual characteristics to language characteristics, so we can actually query with text and query with images and go back and forth there. It's useful for devices that scan the world and interact with humans, so, for example, like

GPS devices that can give further information about what's around in the world. Functionality within gaming devices, and this is kind of a nice area to look at, because you don't have to have

computer vision, necessarily. You sort of have these objects already given to you by the system, so it's easier to make distinctions between them. It's also useful for things like automatic summarization of videos, so if you have a ton of content, video content, you don't have to rely on a human to provide information about it, but maybe you can actually extract that information textually based on the visual characteristics. This also can aid summarization, along with textual descriptions, taking also the visual content. It's fun for things like caption generation and creative tasks, like story generation, and also, I find that working on this ends up helping computer vision. So computer vision can do a lot better when you can start constraining its output based on linguistic information.

So my plan for this talk, I'm going to talk about a few things. First, I'm just doing a board overview to human object perception and human language production, which has informed my work as I go. I don't work towards cognitive models, but this kind of gives me some modeling hints and future ideas that I can then incorporate to try and approximate human-like language.

And I'm going to talk about some work I did on prenominal modifier ordering, which is one of the first tasks that sort of falls out of this, if you want to start describing objects, ordering the modifiers around them. Then, I'm going to talk briefly about what computer vision provides and what natural language generation provides, with the idea of starting to automatically connect these two in light of what human systems do, what we know about what people talk about and how we can actually start describing the visual world.

Then I'm going to be talking about two systems. One is an algorithm just for describing visual objects, given some ideal computer vision input. And then one is an end-to-end system called

Midge, which is me seeing what happens when you actually give it real visual input and everything breaks, and what you can do to handle that. Okay, so how does visual object perception work? So free viewing in the scene is guided by rods and cones that respond to light reflecting off of objects. This is passed through the ganglion sells to the primary visual cortex, where we first start recognizing color and size, in addition to edge, orientation and contrast. This feeds forward to two areas that process these in parallel. One is the dorsal or "where" pathway, which deals with things like size, location, orientation. The other is the ventral or the "what" pathway, which deals with things like color, texture, material, these sort of absolute properties.

So there's this nice kind of distinction in the brain, it seems, between relative properties that require reasoning about spatial characteristics and absolute properties of the object themselves.

The interesting thing about this is you can start to see vision influencing languages from this trend. So color and size are some of the first things to be processed in the visual system. Color is available pre-attentively. Size tends to be available pre-attentively. Things like shape, material, are harder to recognize when just free viewing the scene. You have to fixate on it. And what we see is a preference when people describe for color, above all, and then followed by size.

So these are just some graphs from a couple of studies I ran where I just had people describe

objects and tried to figure out what visual properties I wanted to focus on in order to approximate what kind of language people would generate. I found that across the board, they prefer color and size, followed by things like shape and material, these sort of more difficult-to-process attributes, but there's a clear preference for these basic visual properties. So if we can get at those, then we can get a higher recall by tackling those first.

So how does language production work? The important bit about this is there's thought to be a sort of horizontal and a vertical aspect to language generation, where we have three basic stages that correspond to conceptualization, thinking of what you're going to say, formulation, the sort of grammatical component where you figure out the sort of syntax, and the articulation stage, phonological, where you're actually starting to say it.

The thing about this is that each stage seems to have processes that run in parallel, so it's possible to start blurting out articulating things while at the same time you're formulating and also conceptualizing. So you don't plan the whole message before you begin talking, but you speak and plan at the same time.

This has ramifications specifically when you're talking about describing visual objects, because it falls in nicely with what we know about how people view scenes. So Pechmann 1989 showed that subjects will begin referring to objects before they have even begun scanning the alternatives. So they see an object, they start describing it, and then they start looking around.

They don't tend to look back, probably because of inhibition of return, and then this complementary finding in the vision community, Treisman and Gelade, that visual characteristics of different objects will tend to pop out without focusing on surrounding items.

So you see in this little set of Xs there, the red X and the O, you can begin talking about and describing without having to do any sort of serial search for them. And this suggests a possible intersection for combining vision with language at the level of the noun, where how people view nouns or view objects and then talk about them is something we can begin modeling. Okay, so to specify the task a little bit more, I've been looking at descriptions of objects, specifically inanimate objects, so I've largely avoided working on descriptions of people. That's encoded in the brain differently. That brings up a lot of different problems. People in the generation community tend to conflate the two, but I'd like to separate them out, at least for now, and say just focus on what's interesting of objects, and then maybe extend the models to people.

I'm looking at generating initial reference, so in a discourse, this would be the first thing introduced between two speakers for someone else to read or here. So the idea is that I want these models to be usable for both summarization or for caption generation, but also when initiating dialogue and beginning to bring up an object for a speaker to respond to. And I'm

calling this form of reference an identifying description, which I believe is identical to Searle's identifying description.

If anyone is very familiar with philosophy of language and disagrees, let me know, because I'm starting to publish calling it that, so I hope I'm right. So when you first start thinking about the object and how to describe objects in the visible world, a sort of low-hanging fruit that immediately comes out is how you order the different modifiers that you use to talk about these objects.

So this was one of my first forays into working on this problem, and basically, my hope was that

I could have a story of these underlying classes that gave rise, somehow, to the ordering that you see there, and these might roughly correspond to semantic classes. I ended up modeling it as an

HMM, where I just do transition probabilities and observation probabilities, but I put hard constraints about the possible transitions between classes. So we can only go C4 to C3, C3 to

C2, and the word classes themselves predict the position in the sequence.

So doing this worked reasonably well. Rather than training to full convergence, I ended up holding out the data set to empirically determine the stopping point, and I found that, as I was doing this, the constraints on the transition probabilities was tending to skew the probability of the missions. So I used this generalized procedure where I held out some of the parameters, trained to full convergence, and then added in the rest of the parameters.

Specifically, I just trained emission probabilities until convergence, and then added in the transition probabilities. And so the output was one that didn't have improvement over five iterations on the held-out set.

>>: A question. I'm just curious, so were the states naturally given then, because you trained the emission probabilities under -- we'd have to know which state corresponded to it.

>> Margaret Mitchell: No, so I held beliefs over all of them. I ended up using five classes, C1 through C8, and then just updated beliefs on the probability of the modifier, given all of the underlying possibilities. Yes, so it tells me what the classes are. It's a generative model.

So this ends up doing reasonably well. I tried a bunch of different techniques. At the end of the day, what ended up working the best, frustratingly, was an n-gram model. So if you train on the

New York Times section of the Gigaword, this is training on New York Times and then testing on Wall Street Journal, it does better than all the other approaches I tried, significantly, and it actually is relatively robust across these different corpora. So Switchboard, which is conversational data, Brown, which is a balanced corpus, including literature, it does reasonably well. And then it does even better if you start adding some in-domain training, and you see

significant improvement. So just adding a tenth of the Wall Street Journal automatically parsed, we saw a statistically significant improvement, just using an n-gram model.

So sort of lesson learned from that is, thinking about this in a generative way, maybe it makes sense to think about these actual underlying classes as being somehow semantic, so something color and size giving rise to these actual modifiers that you see. And they might not exactly correspond to the ordering, but maybe they're corresponding to what is generated, and then they're ordered on the fly or later. With that kind of model, you can start putting observation probabilities as the probability that you see some kind of modifier, be it postnominal or prenominal or whatever, given some underlying attribute. So you can start generating things like color and size, and if you begin creating this separation, similar to the separation we seen in the human brain, you start to be able to generate them at the same time, and you actually find overspecification that mirrors the kind of over-specification that people use.

I ran this sort of side study -- I haven't published on it, it's just sort of a small thing -- where I had people refer to cubes and spheres that I manipulated for color and size, and I found that even though only color or size was reasonable to distinguish the referent in all cases, people tended to over-specify, including size and color. So you can get these nice things in this sort of model.

>>: Do you know [specifically] whether these results hold in something like this?

>> Margaret Mitchell: Which results?

>>: The idea that you get these very strict ordering of what you might think of as semantic categories?

>> Margaret Mitchell: So I'm not proposing that there's a strict ordering of semantic.

>>: In those same ways, or even that they're broken out in those same ways.

>> Margaret Mitchell: You mean, is it true cross-linguistically that we'll see a preference for color over size? I haven't tested it in other languages. I mean, there are some languages...

>>: I know there's a big black hole that you can easily fall into there, but just as you start to edge into something like the cognitive plane, it seems like you should...

>> Margaret Mitchell: Oh, I'm not meaning to make a cognitive claim. I'm just sort of trying to learn from cognitive structures to see what I can do to approximate human language, but specifically I'm testing in English. I moved on to Spanish recently, but that's not different

enough. So there's some argument that some languages have -- I think the smallest number of colors is three. That would be fun to try, to see what happens. I'm not sure.

>>: And what are some of the expressions that people use for this, where you're saying that they're over-specifying.

>> Margaret Mitchell: So "big blue cube." So you can say "big cube" or you can say "blue cube," but people just say "big blue cube." And this really goes nicely with the story that people tend to parallel process things, so they're not making sure that they're uniquely identifying the referent, but they're actually saying visually salient things as it occurs to them.

>>: What was the prompt for the subjects?

>> Margaret Mitchell: So this study, if I go -- I'll tell you later. Can I tell you later? I'll just skip forward a bit to get to that, but they were fillers in another task, so it was part of a task where they were describing how to put together a configuration of objects, and they had to tell the hearer which object to grab, so it was an identification task.

>>: But were they specifically asked to talk about characteristics that differentiated them? But you're saying that was the...

>> Margaret Mitchell: Totally open domain. I don't control this -- well, I control the properties, but I try and not control what I ask people to do, other than "explain." Okay. So we have this idea of what we can do in order to generate descriptions of visible objects, kind of at a high level, and we have some sort of model to order descriptors once we have settled on them, so what can we actually do to start plugging these together? What does computer vision provide?

The thing to keep in mind about computer vision is that it generally doesn't work very well. This is a bunch of detectors we ran on just an open-domain image from Flickr. I don't know if you can see here, but you can see it recognizes things like plants and tiger faces and dogs and chairs.

I think there's a sheep there. I don't really see that there. The computer sees it there. But the thing about this is, if you tell the vision what you're looking for, you tell it to run the noun detector or whatever, it can do reasonably well. This is especially true on the PASCAL data sets that everyone uses and tests on. It kind of falls apart in our open domain, but general rule of thumb, if you just run arbitrary detectors, it doesn't do so well, but if you can constrain it in some way, the it does a lot better.

So the task of converting some object detection to some noun is a bit trickier, until you can start understanding what you expect there to be in the scene, semantically, and you can use sort of like distributional information to do this sort of thing. Generally, in these approaches, there are

discriminative models that just figure out the different categories. There are sort of binary classification tasks. Some particularly recently relevant work is this work on attribute detection, so that plugs in really nicely to the kind of stuff I've been working on.

This is the Farhadi et al. 2009 paper, just I pulled out a screenshot of it, where they started taking gold-standard object detections, and then within those detections running these binary classifiers to figure out whether or not it has text. They look at materials, like has wood. They look at a lot of parts, has metals and other material, has plastic and other material. They're looking at shapes a bit as well, is 3D boxy, is vert cylinder? It's a bit haphazard which properties they decide to pick out, so that kind of speaks to the need to be clear about what people are actually going to be talking about, rather than training arbitrary detectors. But the nice thing about this is that it starts to provide a connection to talk about objects in the computer vision world. So now we can actually start describing aspects of those objects.

So I worked on this a bit in 2011 as part of the Johns Hopkins Summer Workshop, and our idea was this, maybe we can start getting closer to describing objects if we can figure out the attributes better, and to do this, what we can do is take captions of images -- so we used Flickr, parse it, pull out the noun phrases, and then figure out what modifiers are being used for specific objects. So given some object that we have a detector for, like bear, you can try and figure out where it is on the image and then start training our models to learn those attributes automatically.

So we did this. I don't know how familiar people here are with computer vision, but the basic idea was that we used an SVM where we used histograms of oriented gradients, texture and color, these sort of low-level visual features, and we treated the found attributed detections as positive training samples and captions that did not have the given attribute as the negative training sample.

What we found was that we could do pretty well for color. That makes sense. Color is also a low-level feature. We can do okay for materials, and that's probably largely because of its connection to color. So materials are often clear from color. But what we can't do well on are relative attributes. So things like size requires this additional reasoning about the threedimensional characteristics of the scene that it's just not quite possible to train a detector on yet, especially because object detectors tend to return sort of areas of the object where -- areas of the image where the object maybe is, where it believes the object to be, but not actually specific segments of the object. Then you have to move to image segmentation.

Okay, so computer vision might be useful for things like color, and it's getting better at material, shape. For relative properties, there's still a lot of work to be done. So what does natural language generation provide? What can we actually do to start taking with the computer vision does and what natural language generation expects in order to start talking about objects. So the state of the art in natural language generation right now -- actually, I just released this corpus, so

this is the new state of the art, I guess. But before this recent release, it's generally these very basic images of objects that are computer generated. They're very long, properties, color, size, orientation, location, and that's about it.

The sort of mindset in generation here is that, given some clear specifications of the type, the color, the size and the orientation, the question is, how do you select which ones to generate?

The truth is that from a computer vision perspective, what you're getting is confidence levels for different kinds of object detections or stuff detections, which is not using edge detectors, basically. And then you can get likelihoods of attributes within the bounding box where the detected object is likely to be. You can get a bounding box, which is nice, but you can't get small.

So this basic connection here is already not working. Yes?

>>: Sorry. Just a (inaudible) the difference between stuff and object?

>> Margaret Mitchell: So they're just different kinds of object detectors, basically. So this is from work by Tamara and Alex Berg at Stony Brook University, and they just trained object detectors using histograms of oriented gradients, and they used Felzenszwalb object detectors, and then for the stuff they removed any sort of edge detection to detect sky, grass, things that tended not to have edges.

Okay. So given this kind of mindset within natural language generation and how they conceptualize the problem, the two basic approaches that you use now to talk about objects tends to focus on unique identification. One of the top algorithms that everyone uses is the graphbased algorithm, which defines the problem as a sub-graph construction problem, basically. Cost functions are defined over the edges and then self-loops mark object properties, and then these directed edges are used to mark spatial relations between objects.

They developed -- the best way to do this tends to be using a branch and bound method, where you try and find the vertex of the object that you're interested in and then recursively construct sub-graphs around that so you make sure it's connected. And that's -- yes.

>>: Does that model itself then learn from data, or is it specified?

>> Margaret Mitchell: Basically, so this doesn't have any costs on it, so this is just -- you can hand specify. If it's an absolute -- sorry, actually, the closed side here. If it's a self-property, then it's a self-edge, and the if it's a spatial relation, then it's a directed edge to the next.

>>: It doesn't have discrepancies.

>> Margaret Mitchell: Yes.

>>: Then they run the weight through it.

>> Margaret Mitchell: Yes, yes. I'll talk about that, too. So then the problem becomes, okay, if you have this nice structure, then maybe we can start seeing how we can model what humans do by defining the weights differently.

>>: Because the original issue is not to get the structure (inaudible).

>> Margaret Mitchell: I don't think so. I think it was just something that they were playing with and ended up working reasonably well on the data that they were working with. And then the other sort of basic approach in referring expression generation, where you're talking about objects, tends to be this incremental algorithm. It's more complicated than this, but the basic gist of how this works can just be written in this sort of four lines of pseudo code. So you have an empty set, and then for every attribute you have, in some pre-specified order, which you can learn from the data, based on frequency, if a value for that attribute rules out something else in the scene -- so this is nicely predefined for you already, then you add that to the set. Then, that becomes your identifying -- your distinguishing description to identify the referent. So those are two main approaches to generating descriptions of objects in NLG.

Okay, now let's talk about newer models for describing visual objects. As I mentioned before, across the corpora, I've looked at there's this preference for color, followed by size, so I wanted to get color and size in particular processed well in the models I was working on. We know computer vision can, to some extent, provide object detections and object attributes within the bounding boxes. It does okay with color. Size is still an open problem.

Natural language generation is working on object reference. That's obviously a clear connection, using attributes, but it expects size to be predefined, and color, the connection is already there reasonably well. So working on this connection, it seems like the next most reasonable thing to do is just start looking at size and how people use size. This is kind of a fun problem to think about. It's sort of not as simple as you think, right? If you have two boxes and one is smaller than the other, if the y-axis is a bit longer as you go, does that become small and long? Now what do you call it? Now it's thin, or smaller? Now it's definitely thinner, not smaller, I guess.

At this point, what is it? Is it larger? The volume might be more.

There's this interplay between the height and width and the depth, and that's interacting with the other objects in the scene to determine the kind of size language that people are going to end up using. So you can just use this in a discriminative approach to figure out what sort of size

language to use. So taking the features, like height features, ratio features, distance between the objects and things like that, you can use segmented images -- so these are clearly segmented -- in order to try and predict the kind of language that people are going to use. I ended up distinguishing six different basic size types, so there's two sort of overall, one for large and one for small, and then there are four that correspond to these thinner, fatter, shorter, taller. And you can train this as a bunch of binary classification tasks.

I ended up using decision trees. I tried other things, but they ended up working reasonably well. they overfit at first, but if you start taking away measurements of the target object, specific measurements, and just converting these to ratios, you can actually do really well at generalizing, even to new domains. So I tried looking at how people were referring to objects in these groups, using their size, and I found that instead of just proposing that there are two objects, so sort of the basic case -- you have one and another and how do they refer to the size of the one on the right -- in this case you can actually just take the average of all of the heights and widths. So the average height and the average width of objects of the same type, and you can do even better, actually, at predicting the kinds of size language that people describe.

So this ends up working reasonably well. So as soon as we can start getting computer vision to the point of reasonable segmentation in addition to having some sort of sense of distance -- that's sort of one of the crucial things -- we can actually start using color and size language in a way that mimics what humans seem to be doing. So that's one step, right? So this is the state of the art in REG, referring expression generation. They're acquiring things like small. I moved it to a bounding box.

So then, thinking further about this generative story, we have this color. We can learn probabilities to generate color. We can learn how to generate size, and then we can use this to build some set for an identifying description. In order to keep the model from over-generating, so generating a ton of different properties, just putting a basic length penalty ends up working reasonably well, so people tend not to include more than four modifiers when they're describing objects. And then I just tuned this just empirically based on the training data. And so at that point, the likelihood of generating some set of properties for an object is a function of the likelihood of those properties and the length penalties. Going further with this, you can actually start using these for nondeterministic generation, so instead of just generating the top end, you can generate a distribution over possible things that people will say, and then you get a ton of different forms and capture speaker variation, which is one of the key components if you want to start generating language that sounds natural, right? You need to be able to vary it.

So if you want to get really fancy and make sure that you uniquely identify the object, you can just compare against objects and then use a basic incremental approach to compare against some frequency-based properties, and if one rules out the object that you want to rule out, then add it

to the set. Okay. Now, how do you evaluate all these different systems? So the sort of general approach in referring expression generation, looking at identifying descriptions, or generating descriptions of objects, assumes these nice two poles here. I'm moving this from my algorithm, so the size is actually the bounding box of the image. For the other two algorithms I talked about, the graph-based algorithm and the incremental algorithm, I had to say small, large, but that's okay. It just makes it harder for mine, and I still beat them.

So I compared against these algorithms that people tend to use, so there's the incremental algorithm, which you have to specify a preference order in order to figure out how to iterate through the attributes, and for that I just based it on training frequencies, and then adding object type at the end. The graph-based algorithm obviously requires some sort of definition of the costs of the edges. For this, I just used k-means clustering. I wanted to encourage overspecification to bring it closer to descriptive reference, so I just used two clusters, where one corresponded to zero cost and one corresponded to one cost, cost of one, just on the frequencies of the properties and the data. And then to choose between edges, when there's a tie, there's a preference order in place similar to the incremental algorithm.

And then I'm comparing it against this newer algorithm that I've been working on that generates the attributes from some sort of underlying probabilities, and I just learned these from the training data. Because it's nondeterministic, I run it for all test instances and then can compare the set of human-produced descriptions against the set of generated descriptions.

>>: So just so that -- so they're all seeing the same input, which is these two attributes?

>> Margaret Mitchell: No. Okay, so the small difference is that the incremental algorithm and the graph-based algorithm, this says small and this says large, but for mine, I'm making it actually decide.

>>: This has binding buffers.

>> Margaret Mitchell: I'm plugging in the size algorithm there.

>>: Okay, so if you go to the graph model, and if this is a long answer, you can take it offline, too, but I'm just curious. Where you go, this is going to be dependent on what emissions you observe in the world. If these five things light up, then you wouldn't position all the other things, so you just (inaudible) those. And it seems like the edge weight should have to do with the initial probabilities of the orders of those being talked about in a language purpose or something?

>> Margaret Mitchell: The orders of them? I don't know about the order, because it's abstracted from the order. You're just looking at the frequency.

>>: Okay. Well, I'm just thinking that if there's like four or five attributes or something, a person would say something about them, right? You can imagine the transitions -- this is a directed graphic. It seems like the transitions, wouldn't they come from what those transition probabilities were like in language?

>> Margaret Mitchell: Yes. So that's just defining the cost, right? When you're trying to construct the sub-graph, how you're going to choose which edges you're going to connect.

>>: And what the weights would be.

>> Margaret Mitchell: Right, yes. That's the cost, the weight.

>>: So the attributes that you observe keep a certain set of nodes and you knock everything else out?

>> Margaret Mitchell: So you're trying to find a sub-graph that uniquely identifies the referent, so it has to be a sub-graph that only corresponds to the referent and is not isomorphic to any other referent.

>>: Okay.

>> Margaret Mitchell: Okay, so to evaluate these algorithms, commonly, in machine translation, people use things like BLEU and ROUGE, which tends to measure n-gram based string comparisons. But it's been shown that they're sort of a reasonable measure of fluency, but they're not a very good measure of content quality. So they correlate well with human judgments of how fluent it sounds, but it doesn't do very good at capturing whether or not the content is reasonable.

So in the generation community, people have largely moved over to DICE, which requires you to have attributes. So rather than measuring string overlap, it's looking at the set of attributes that are generated by the system, the set of attributes that are generated by the person and what the overlap is.

One of the nitpicky things I have about this is that this score depends on what's been annotated, so obviously, if you have a very fine-grained annotation, you're not going to do as well as if you have a very coarse-grained annotation. So my thought about this was, why don't we just evaluate over Boolean values for a fixed set of attributes? So say we are interested in color, size, location, orientation, that's four, and we're going to evaluate the overlap over those four sets. The nice thing about this is now it's equivalent to accuracy and precision in recall, right? Because a false

positive is a false negative, and there are no true negatives, so all metrics become equivalent, and that's nice.

Okay, so for a nondeterministic algorithm like the one I've been introducing, it becomes a more difficult problem because now you have a whole bunch of possible expressions that you've generated. So to approach this, I'm just trying to find the best alignment between the algorithmproduced expressions and then the people-produced expressions and then measuring the DICE overlap for those. This might look familiar. This is the same as an assignment problem for finding a maximum-weighted bipartite matching, so you can solve this in polynomial time, rather than having to look at all permutations using the Hungarian method, which is really fun and cute.

And in the case where we don't care about nondeterminism and we just want one answer, we can just take the majority, whatever it predicts the most, plurality in some cases.

So I evaluated on this corpora that are available now to generate in terms of specific objects and their attributes, and this is the GRE3D3 corpus and the TUNA corpus, and I find that on the

GRE3D3 corpus, split into two sets, they're just sort of mirrors of one another, my algorithm performs competitively with the incremental algorithm and the graph-based algorithm, at least as

I have implemented them. I tried to make them perform optimally under the development conditions. And then for majority, it's pretty reasonable, as well, sort of tying with the graphbased algorithm.

Similarly, in this other domain, the TUNA domain, we see that across the different conditions of this domain, they had a condition where people were discouraged from using location information, which I think is a bad way to go, because now you're just priming your subjects into knowing that you're looking at these kinds of things, but I evaluated on that anyway. And then in the condition where they weren't primed with anything, it also does reasonably well. The incremental algorithm does really well in the minus location condition, just looking at majority agreement, but it's not robust across both of them, so mine is kind of giving us more stable values across the different domains. So it seems to be working reasonably well.

So now let's talk about how to put this together in an end-to-end system, actually using real visual input. So this is output from some detectors that we ran as part of the Summer Workshop in 2011 with object detections that were primarily Felzenszwalb stuff detections that included like HOG and texture, these low-level features, as well as canny edge detectors. And this is the sort of bounding boxes that it's giving for this scene. So we see a bus, a sky, a road and these have sort of various attributes associated to them. The road is thought to be wooden, perhaps.

The bus is thought to be black. But there's also some likelihood for all attributes, right, because these are binary classifications.

So my idea here was we should just take the computer vision output and stick it into natural language generation input, so instead of backing off to some sort of underlying representation, let's see what we can do to just connect the two together. And so this is the top caption that my system ends up generating for this image. The bus by the road with a clear blue sky. It's a cherry-picked example, obviously, but it is a nice one, because it's kind of showing the kinds of things that I'm trying to do. Specifically, I'm trying to put together a syntactically well-formed sentence based on the object detections by figuring out what the reasonable spatial relations are and what the most likely attributes are, given what the computer vision detects.

So I mentioned earlier that computer vision is very noisy. But one thing that you can start doing is actually using distributional information to constrain what the computer vision sees and just take the intersection of what the language predicts to be likely and the computer vision system thinks that it sees, and then you end up getting descriptions that are relatively reasonable. I'm going to talk about previous work just very briefly to sort of situate what I'm doing with respect to prior work.

There are sort of two modes of generation for this kind of task. One is this approach where you try and find a caption of a similar image, so you can use low-level visual features and, from there, try and generate the given caption that you've already stored as the new caption for this novel image. It obviously looks human like, because it's written by humans, but it's often incorrect and misleading. It could be like, two girls walking in China, but it's somewhere else.

But you don't know this, because these are things that are just grabbed straight from the caption.

And then the other approach that people have been sort of working on is this template-based approach, right, where you can just fill in verbs. You just set up a basic declarative sentence, fill in object detections as you go. Although any one looks natural, the difficulty with that is that it's not varied as it talks about different images. So it'll just keep telling you -- you can start seeing that it's a template, the more you see. One sort of interesting thing about this work that sort of stuck out to me is that the perceived relevance of description was highest with a simple baseline that did not add any verbs, so computer vision can detect objects. There's also work on action and pose estimation. Action and pose estimation tends to perform not that well, but the idea was that if we take some sort of distributionally similar or likely verb for some given object detection, then we can just add it in, even if we haven't seen it in the computer vision, and create a full sentence. But, actually, people tended not to like this. It didn't end up producing the correct verb in a lot of cases, and the preference was to just say what you do have evidence for.

So the input to the system I'm using is the front end of this previous system called Baby Talk, which is from Kulkarni et al. 2011. And that provides object detections, stuff detections and attribute detections. Basically, how this approach has worked before is just to fill in template slots, similarly to the Yang et al. paper, but the nice thing about it is it's a set of 20 different

object detections and a bunch of attribute detections, and you could start using these to see what else you could generate. It also is a nice way to directly compare automatically generated descriptions with template-based descriptions, using the exact same input.

So I ended up being able to compare three different systems. I luckily knew the authors of all of these, so I was able to annoy them into giving me their systems. The Kulkarni et al. system for this picture, bear, generated something like, "This is a picture of two potted plants, one dog and one person. The black dog is by the black person and near the second feathered potted plant."

The Yang et al. system for this generated, "The person is sitting in the chair in the room," and my system generated, "A person in black with a black dog, by potted plants." You can kind of see how these two are a bit more templatic if you compare this one to this one. So this one says,

"Three people are showing the bottle on the street," so you can see they're taking the prepositional phrase, they're sticking at the end, and they're taking the -- they're just collapsing the object detections for the people in one.

The Kulkarni et al. system has -- it's a bit of a sort of recursive template, so they say, "This is a picture of three persons, one bottle and one dining table. The first rusty person is beside." And then I kind of tend not to read the rest. I just get bored at that point. I'm like, okay, this is just wrong. And so for my system, which uses the exact same visual input, we can generate something like, "People with a bottle at a table," so that's nice.

>>: So are the attributes not tagged with the particular object you're referring to.

>> Margaret Mitchell: They are.

>>: Oh, they are.

>> Margaret Mitchell: Yes.

>>: Okay, and so that was just a misclassification, in this case, then, the rusty person thing.

>> Margaret Mitchell: The rusty person, yes. So it detected rust.

>>: That was a vision error, then.

>> Margaret Mitchell: Yes, that was a vision error. Yes, right. That's the whole niceness of working on this vision to language thing, because even if you have an incorrect vision detection, you can have a language model, if your language model is good enough, to say, "Yeah, you don't see that, so let's not generate that."

>>: This is a problem, right, because the more vague you are, in many cases, the better the sentence will be. So, basically, when you become not vague, you're basically rolling the dice and taking a chance. The question is, when should you do that?

>> Margaret Mitchell: Yes, so there's...

>>: Can you come up with an evaluation metric that gives you special bonus points for not being vague.

>> Margaret Mitchell: That's a good point. There should be some sort of tradeoff between the confidence level of the vision detection and then the likelihood of the language model. So you don't want to rule out things that the vision is pretty confident about.

>>: How acceptable is it if you have a single template for most of the pictures that you're trying to caption? Because do people react negatively to that, where they really want these long, weird descriptions?

>> Margaret Mitchell: So I haven't tried just giving people only template-generated descriptions.

But for the goals I'm working towards, like I had this ongoing goal of trying to make a generation system for kids with cerebral palsy, I don't want to just give them the same sentences to say over and over again. I want to be able to capture human-like variation so they sound more like people.

>>: Because I don't know what a corpus of captions looks like.

>> Margaret Mitchell: Flickr.

>>: No, but if you look at the properties of that corpus, are they really templatic? Do people say the same kinds of things when they're talking about photos over and over again?

>> Margaret Mitchell: I don't have a direct measure of that, but I can just say, by looking through them, no. This is a problem I don't want you guys to ask me, but I'm going to tell you about it anyway. So there's this underlying knowledge of the person who takes the picture, and they might actually describe things that are not visual, right? It's like, "Sam's birthday party."

Maybe there are cues that you can pick up, like, "Sam is having fun," or these things that are a lot harder. And then there's also the captions that are just sort of describing the scene. So I'm just working towards these captions that are just describing the scene.

>>: I think Larry's question, and maybe Bill's question also, there's a question here of where you want to be in the [random] area depends a lot on the variation. And it seems like if you were to -

- a human judgment of the useful correctness, it might depend a lot on whether someone was looking for pictures. For instance, even though this rusty person one is a little bit wacky, it does have a lot of detail. If I was looking for a (inaudible) quick kind of thing.

>> Margaret Mitchell: Yeah, yeah.

>>: So in a search task, if I was looking through the corpus database or something to find something useful, but on the other hand, if I want something that's going to appear automatically in a news story, your translation might be a lot better.

>> Margaret Mitchell: And these are simple parameters you can just set in a system, so I just have a threshold on the probabilities on the confidences that I want as an input. And so you can decide if you want to accept something like rusty person from a computer vision system, given that it's above some likelihood threshold in your language model.

>>: But is there a -- I'm thinking of a human or Turing test. If you gave somebody pairs of these and said which do you prefer over a course of 1,000 or something, go up to people.

>> Margaret Mitchell: Well, that's how I evaluated it. I can't talk about evaluation yet, but I will, because I want to talk about how I'm actually generating first. So the basic idea behind this generation approach is that an image is processed by computer vision algorithms, and it can be characterized as a simple triple, where A is the set of object or stuff detections with bounding boxes and associated attribute detections, B is the set of action or pose detections, if those have been run, and C is a set of spatial relations holding between each bounding box. Similarly, a description of an image can be characterized as a triple, where A is the set of nouns in the description, B is the set of verbs and C is the set of prepositions that hold between them, right?

So if you look at it that way, connecting the two is simple. It's just a matter of figuring out how to order the things around it in a way that sounds natural. But the basic idea is to center the generation process on the object, modifiers and attributes are defined given the object, same with verbs and actions. They're defined given the object and prepositions and spatial relations.

Okay, so the problems to address in this is, first, how to order the objects so they are mentioned in natural ways, so what's the subject of the sentence and what are going to be in the relative clauses that follow and how are we going to order them? How to filter out incorrect detections, so this is just a current problem with the state of the art, just dealing with the noisiness of the data. How to connect these together within some sort of syntactically and semantically wellformed sentence so that it sounds actually natural, and then, if you want, how to add further descriptive information, even if you don't have a specific detection but you think it's likely that this is the case.

So the first problem I handled in this is just learning an ordering model to figure out, given some set of object detections, how am I going to talk about them? And I used Flickr captions from this and WordNet. I found that, after parsing it through, 92% of sentences had no more than three physical objects, so that kind of gives you just a cap on how many objects you're going to want to put in one sentence. And then, using WordNet, you can just estimate the probability of a position in a sentence, like first, second, third, given a hypernym class and the number of objects in the phrase. And that's the way to learn fun things like animate objects tend to be closer to the beginning of the sentence than inanimate objects. So here's an example of how his kind of works. So we have cat in a box contemplating the light. Cat, from WordNet, we know is an animal. Box we know is a structure. Animals tend to prefer first and second positions, structures tend to be equally distributed across the different positions, maybe a slight preference for second, and you see this sentence is reflecting these trends, right? Animal, which prefers to be in first, is in first, box is in second, light is put at the end.

>>: Would the condition of the set where both an animate and an inanimate object appear, or...

>> Margaret Mitchell: It's over all Flickr descriptions, so it's animate and inanimate.

>>: I see. So if you scoped it to the set where there is both animate objects and inanimate objects, these patterns might be even stronger, potentially.

>> Margaret Mitchell: Yes. That's probably true. I wanted to work with noisy data. I just wanted to take -- the set now, this Flickr corpus, is I think a million captions. At the time I was working with it, it was 700,000, and I just wanted to just parse it all and see what I could learn from it without filtering, other than HTML and norming it a little bit.

So this just says what I just said, but preparing the data for generation. So I used 700,000 Flickr images with associated descriptions, parsed each caption using the Berkeley parser, and then the key to this is, once it's parsed, you can start to extract word co-occurrences within local subtrees of given object nouns. So this is somewhat similar to tree substitution grammars, if anyone is familiar with that. It's not quite the same way, and I've been working with Matt Post at Hopkins to bring it closer to the TSG framework, but basically, the idea is that nouns are objects. They're these seeds that give rise to the rest of the syntactic information around the description. So they're used to define all the different things that they co-occur with, the adjectives and the determiners.

We can learn, if we take the parse, we have these basic template structures over subtrees that we expect to find for these declarative sentences, so we expect there to be an NP. It might have a determiner or not. It will be headed by an N and can have any number of things in between.

This is the Penn Treebank syntax, which has these really flat phrases, but just defining that, we

can start to learn, like, "Well, given box, the probability of A versus that versus another versus no determiner," and we can automatically figure out what's a mass noun, what's a count noun and these kinds of preferences like that.

We can do something similarly for adjectives, so this is helping to filter out the incorrect detections from the computer vision system, right? So we know that boxes tend to be red, wooden, old, small, whatever. Using distributional information, we can associate these adjectives to their attributes, and then we're actually starting to get closer to what referring expression generation is expecting, where you can reason over the attributes and then select values depending on that. And that's just attached to the adjectives that we learned in the corpus.

Okay, so I cast generation in the three-tiered procedure that is common. It's also roughly reflective of how humans seem to generate. Basically, the idea is in content determination, you figure out what you're going to talk about. In micro-planning, you sort of associate this to linguistic structures, and then in surface realization, you put it out in some nice fluent-sounding sentence. So for the content-determination stage, I'm collecting all the object nouns from the vision detections and then I'm grouping and ordering the nouns using this WordNet model that I learned. To order them, I'm just doing a greedy search over the maximum probability of the position given the class for each object. So things that really prefer to be first can be first and things that kind of prefer to be third can be third. And then I build these likely subtrees around object nouns. So from the data, I've defined -- sorry.

In the data, this is all parsed, so I can define these basic subtrees in order to figure out what is the likely verbs, given the noun, what is the likely determiners, given the noun, what is the likely prepositions, given the noun, all within these nice subtrees. And then, from there, you can start creating full trees. So if you have duck and grass, you've ordered them, so duck is at the beginning, grass is at the end, you pull out the likely subtrees that they tend to co-occur in, along with their closed class information to get prepositional information, so ducks tend to be above, on and by, grass tends to be on, by and over, and then you can just take the intersection between the two and start saying things like "duck on grass" and "duck by grass."

In order to maintain the Penn Treebank syntax, you have to add these auxiliary trees for concatenation, but that's the general idea. And the fun thing about this is that you can also start hallucinating, like some of this other work has looked at, where you just take something that's likely, given the subtrees that duck tends to occur in, given the subtrees that grass tends to occur in, and given this overlap, we can actually stick them together to say something like "a duck sitting in the grass." All right.

And then after that is surface realization, so prenominal modifier ordering ended up being a problem. Yay, so I could solve that. I used the top-scoring n-gram model from our earlier work,

and this basically turns things like blue clear sky into clear blue sky so it kind of has this roughly more fluent sound. Postmodifier generation, so if a person is detected to be green, I want to say something like person in green rather than green person. That still needs work. For now, they're hand-coded rules. And then the question is select the final tree, so you have a distribution over all these different possible trees, which one are you going to select? We tried a bunch of different approaches. The one that ended up working reasonably well was just taking the longest character length. This tended to be the things that were the most descriptive.

So to evaluate, I used 840 PASCAL images, and I compared it against the human-written, the

Kulkarni et al. Baby Talk system, the Yang et al. system, which is this template system, and my system. I put it up on Mechanical Turk, where each subject rated three descriptions from each system and used a five-point Likert scale from strongly disagree to strongly agree, with a neutral middle position, which has been shown to get reasonable judgments from humans. And I had them judge on a bunch of different levels. One was grammaticality -- this description is grammatically correct. One is main aspects, correctness, order, so getting at the ordering model, and human likeness. So that's the Turing test one.

So humans win. Participants could tell the human-written descriptions from the Midge-written descriptions, but Midge outperformed the other state-of-the-art systems that were based on these sort of more template-based approaches. So on correctness and order, it outperformed all automatic -- all these automatic approaches. On human likeness and main aspects, it outperformed the Yang et al. So this is the more simple template one, and people are finding this automatically generated structures a bit more human like.

>>: When you evaluated this, did you ever try comparing one sentence to the other, like which of these two sentences is more human like? Which of these two sentences is more...

>> Margaret Mitchell: Yes, I didn't.

>>: That would be really interesting versus a human.

>> Margaret Mitchell: That would be nice. So I have this friend, Advaith Siddharthan, that's how he tends to evaluate things. I like that idea. It will be fun to do, but for this, I just gave each subject just had a randomly selected set of images and descriptions, and they each judged three from each system.

>>: They would not judge the same images descriptions under different systems.

>> Margaret Mitchell: Right. Yes. Okay. Great. So that's putting it together in an end-to-end system. The problems of generating these nuanced descriptions of objects sort of take second

stage to the problem of figuring out how to constrain and control the computer vision system, but we can put them together and start approximating the sort of attributes and values that natural language generation tends to work with.

Okay, so I've talked about a bunch of things. One is just this basic overview of visible object perception and language production and how we can use this to learn models and features for generation. Computational perspective of what computer vision provides and what natural language generation provides, starting to put two and two together at just the basic level of ordering modifiers around a noun to figure out how to make descriptions sound fluid. Then, I delved a bit into describing visible objects, specifically looking at size and generating identifying descriptions, so not necessarily attempting to uniquely identify the referent, but generating things that are visually salient. And then I described an evaluation approach and an end-to-end system, plugging this into computer vision, which I call Midge. And you can try the Midge system online, if you want to see how it goes.

>> Lucy Vanderwende: Thank you so much. So you've had some questions throughout, but do people have other questions?

>>: I was curious about your results on the PASCAL data set. So, generally, this is considered a pretty hard vision range, so in this data set, did you try to solve the vision thing manually?

>> Margaret Mitchell: We used manual thresholding.

>>: Well, it's like -- let's say you say that this is -- let's think of the perfect vision algorithm.

This is what I want the output to be, and then if you plug that into the language model, does it do significantly better?

>> Margaret Mitchell: Every time I have mentioned this system, someone has said I should do this. I still haven't done that. I need to do that. Clearly, that would be a really nice thing to see how it works.

>>: But it seems that in the other comparison with Kulkarni et al., there was a large body of text where there was a lot of detail. I was wondering if the vision task, if you do it at a very finegrain level, whether it's a bad thing for language generation or not. That's something that sort of came to my mind. Because you could do the vision task at a fairly abstract level and we would produce maybe a very nice sentence.

>> Margaret Mitchell: Yes, I think that's right, and I think that that would be a really nice way to just push on the generation. The goal of this work was to connect the two together. I was very focused on what do I have to do to make these fit. But I think that as I've been working more on

the language generation component and trying to make it a bit more nuanced and sophisticated, I think that would be a really nice way to go. I'd like to do that.

>>: Yes, because -- so from this description, I get the impression that basically you ended up constraining the vision output or constraining, actually, the vision output by your generation system. But I think you're doing this for an understandable (inaudible). It would be interesting to constrain the training of the vision system by what the NLG system is going to do.

>> Margaret Mitchell: Yes, that is the way to go.

>>: Because, this way, you could actually avoid the waste, some of the effort, this waste in the vision system, by actually predicting the things that you would use later.

>> Margaret Mitchell: Yes, and now that I have this visible objects algorithm, it would make it really easy to just stick that in at the object description level. So, yes, that's clearly -- yes. That would be a nice thing to do.

>> Lucy Vanderwende: So a question to me, since you've talked about generating the expression, the referring expressions, of visible objects. Do you -- what work is there for generating referring expressions for not visible -- and would that be like when we're just talking, for example? What research is there in that area? How would it be different?

>> Margaret Mitchell: So there's been some work. Jordan and Walker were working on developing dialogue systems that sort of modeled how people would talk about furniture, how they would decide the costs and weigh those kinds of things, so it wasn't all visual. They ended up just using something just like a decision tree. It was 2000. I think it was a step in the direction of nonvisual domains, but nonvisual features were handled the same way as visual features, and that's been the philosophy of most of the work in referring expression generation.

The examples that they used and then the corpora that they built -- the examples that they used, they tend to talk about visual characteristics, like, "Say you have a large black dog and a small white dog," without focusing on the fact that what they're talking about is a visual modality. And there's been a lot of work, psycholinguistic work, showing that reference is different whether it's written or spoken, whether the hearer and the speaker know one another, the register, stuff like that. But in terms of the state of the art for referring expression generation, it's just mostly been focusing on give me a set of attributes, I don't care what they are, and I will order them using one of these algorithms. So what I'm doing is saying, "Well, you're only evaluating on visual domains, so let's just say this is visual reference and see what we can do when we actually attack that task head on."

>> Lucy Vanderwende: Any other questions? I know a number of you will have a chance to talk to Meg in person, also. Okay, thank you very much.

Download