>> Michael Gamon: And so today it’s the 31

>> Michael Gamon: And so today it’s the 31st Symposium and Emily actually pointed that out. So we have two interesting connections to history today. The first connection is the first talk because Emily gave a talk exactly ten years ago at the first symposium, and I think she’s already working on the slides for 2023. And, of course, now we’re going to exactly do the same speakers again for the next 10 years in sequence. With the exception of Meg, who is taking the role of Bob Muller today, who was the first speaker from Microsoft 10 years, but Meg also has a great connection to the program because she’s a graduate, so you can actually see that people who go through this program can turn out to be perfectly well adjusted people, which may be encouraging. [laughter] And Fay is going to introduce Emily and Woodley for the first talk and then I’m going to introduce meg for the second talk, and as usual we’re going to do about 30 minutes for each talk including questions. >>: Okay, so I guess I don’t really need to introduce Emily. Everybody knows who Emily is. She’s our associate professor in the linguistic department. She also has an adjunct position as a CSE and I guess she’s faculty director for the CLMS program and I can see maybe 50 percent or more of the audience actually come form that program or are related to that program. She has a new book coming out, 100 Essentials from Morphology and Syntax. That’s her second book. And she’s interested in [inaudible] engineering computational semantics and relationship between computing and linguistics. And our second speaker, [inaudible]. So Woodley Packard is a second year CLMS student and he got his bachelor and master from Stanford, and I guess she’s going to -- he’s, I keep getting it wrong sorry about this, he going to get I guess graduate from our program in maybe a quarter or two quarters ->>: End of the year. [laughter] >>: I don’t want to add any pressure. I’m just counting the quarters. So I guess Emily will start. >> Emily Bender: Thanks, Fay. [applause] So you’d think it’s a little confusing Woodley is the first author on this work, but the way we decided to divide up the presentation I get to go first. This is a paper that represents some joint work with some colleagues of ours at multiple different institutions in Europe, and the title is predicting the scope of negation using minimal recursion semantics. I’m sure you all had the chance to read that. So I’m going to start with some just motivation for the topic. Who cares about negation? Well, we should all care about negation. So here’s a cued example from machine translation that another colleague of ours, Francis Bon [phonetic], turned up. This Japanese sentence, [inaudible] should be translated by human as we shouldn’t have any prejudice, but if you put it through a fairly standard SMT system or configuration of Moses you get you should have a bias. Right? And this work that Francis was reporting on found that Moses lost the negation two thirds of the time, and that’s pretty important. You can gist without negation what it’s about, but not what’s it saying. Another example comes from Gmail. So there’s these add-ins that hopefully try to connect your calendar. So actually I have a meeting next Thursday. [inaudible]. Well, no thank you. I’d rather not do that. And also perhaps more seriously there’s a lot of interest in detecting negation and the scope of negation in the biomedical MLP field. So the bioscope corpus was the first large-scale effort to provide an annotated resource here. And this is to support applications like coding critical notes for insurance claims, right? So if you miss negation and you code something that was mentioned but actually isn’t there you can get in trouble because you’re charging the insurer too much, but also epidemiological studies other uses of text mining. And so bioscope annotates things like a small amount of, I don’t know how to pronounce that. Adinopathy [phonetic]? Cannot be completely excluded. Did I get it right? Adinopathy? So this is a particularly interesting example because we have cannot, which is sort of a grammatical clausal negation, but then excluded is also a lexical representation of negation. And so if you really want to know what this sentence is saying you have to be able to handle both of those markers of negation, which leads into this slide here. So there are many different ways to mark negation. There’s clausal negation, which in English is a separate word. So she decided not to go home, where as the embedded clause go home was negated with that not, and we also have constituent negation on noun phrases where it shows up as a determiner. So he had no friends in Seattle. But there are also negative affixes. So a prefix, space flight was thought impossible, suffix, that kid was clueless, and an infix, at least this is according to the annotated data that we’re working with this was considered an infix, we got hopelessly lost, they called an infix because the less was in the middle of the string. A linguist would call that a suffix it happens to have another suffix after it. Read my book! So what was the thing I was talking about? There was a shared task at the star sem, so the joint workshop or conference on lexical and composition semantics in 2012 on detecting the scope and focus of negation. And so the organizers of that shared task formalized a notion of negation Q and a notion of the scope of negation and also focus, which is what within the scope is primarily negated, and then negated events. So there are lots of different subtasks. And developed careful annotation guidelines and then annotated a bunch of Sherlock homes data for training and testing. So you have annotated data where the Q is marked and the scope according to that Q and then further annotations that are not what we’re focusing on here. There was good interest on interest on this problem. So there were twelve submissions from seven different institutions, and to give you a sense of the task here are some of the data. So you have sentences like the German was sent forth, but professed to know nothing of the matter. It gives you a little bit of the flavor of the style of this text too, which is fun to work with. It may be that you are not yourself luminous, but you are a conductor of light. This is way more fun than most [inaudible] text. And I trust that there is nothing of consequence, which I have overlooked. So we have nothing and not showing up in these examples, but there’s also the affixes that they [inaudible] in a similar way. And what’s particularly interesting I think is that the scope is based on the semantic dependencies, and so in the first example the German was sent forth but professed to know nothing of the matter, we have no being negated because it has a negated argument, and then the rest of its arguments are also negated also in the scope, so the German is counted as in scope. So of those previous approaches the winning submission came from the university of Oslo, and that was submitted by our coauthors among others, so we teamed up with the winners here, and they used an SVM to detect the cues and then to rank the syntactic constituents for scopes. So their training data included, I believe it was turnac [phonetic] parser data. So automatically it provided parses for all the sentences and then the system used that to come up with what might be or might not be in the scope. There was also a submission from the University of Washington by Jim White, I don’t see Jim here today. That approach used regular expressions for Q detection and then a CRF sequence labeling for the scopes. So not working with semantic structures, in this case not even really working with the syntactic structures, just looking at the surface strings. And that wasn’t too far behind the winner. There was one system that actually approached this semantic problem from the point of view of an explicit semantic representation, and that was the University of Groningen [phonetic] submission where they used DRS, which is discourse representation structures as produced by the CNC parser and the boxer system, and that gives a fairly explicit notion of Q in scope. So the only real modification they made was to change some of the lexical entries in boxer so that more things gave the negation symbol in the semantics, and then they just read the scope off of their semantic representations and then sort of filled in the blanks to try to handle to semantically empty words. They did not do well in the competition, somewhat disappointedly. It seems like that’s a principled approach. It should have worked. So in the paper describing this corpus, [inaudible] characterized negation as what they call an extra propositional aspect of meaning, which struck me as very strange because what I know semantics is that negation is actually a core piece of compositionally constructed representations, and that’s what I think propositional semantics means. Their notion of scope of negation is not quite the same as the way we use scope in mainstream underspecified semantics. It is more tied to predicate argument structures and uninterested in scope ambiguities with quantifiers, but looking at the annotations, and looking at the minimal recursion semantic structures I’m going to tell you about in a minute here, we actually notice that there was a nice alignment. And so the [inaudible] structures give us a good starting point for trying to model the task-specific notion of scope of negation. So very quickly on what minimal recursion semantics is. It’s a flat under specified representations of formulae and predicate logic with generalized quantifiers. It makes explicit the predicate argument lengths and also the lengths between scopal arguments and the entities that fill those argument positions. Some of those are fixed, so the negation takes a scopal argument, but it’s fixed by its syntax that’s in contrast with quantifiers that are also scopal things but they can float around. And so an MRS is an underspecified representation that can be monotonically further refined to get fully scoped representations. When we first started working on this we thought that we actually wanted those fully scoped representations and quickly discovered that actually that’s not the notion of scope. So we’re working directly with the underspecified representations. The main reason we’re using MRS is that it can be produced automatically at scale for English by the English resource grammar. So the English resource grammar will give us a syntactic structure like this. This is very simplified in the sense that what’s actually behind each of those node labels is an enormous feature structure with hundreds of feature body pairs, all right? But that gives you a sense of the constituent structure. And part of the big feature structure of the big S node at the top is the semantic representation for the German was sent for but professed to know nothing of the matter. And it’s this representation that we’re going to be crawling around in to pick out the scope of negation. And here’s where I hand it over to Woodley to tell you how we actually do that. >> Woodley Packard: Thanks, Emily. All right. So this structure is you can see that it’s pretty machine-readable. It’s got a little bit more information than we actually need, but the critical thing is it’s got -- maybe I should use my laser pointer here instead of putting my hand up there -- the character positions that link these predicates back to the string, and these argument positions that describe how the different predicates relate to each other. It’s also got these in curly braces here properties as variables an we’re actually going to totally ignore those for this purpose, but you can see, for instance, how this works that the thing that’s known by this no relation here is its R2, which is X 18, and that’s a thing. So the sentence here was one of our examples from before. It’s the German was sent but professed to know nothing of the matter. And that’s how that works out in the argument structure. So I’m going to show you a slightly way of looking at this picture that throws away some of the information we don’t care about in this project, but makes it easier to see the graph structure here. So there’s more or less a graph underlying this and it looks like that. So the different links here are the argument and the nodes are the predicates. So the link I just showed you a moment ago was from no to thing. So it goes form no to the quantifier and also no to the thing. All right. So to attack the problem of scope resolution the first thing we do is find out what the q is, and we assume the q is actually one of these predicates and our system assumes that somebody else has done the q identification for us at the string level and then we project that onto the MRS using those standoffs that I showed you. So this shows us where in the string the no was and we call that predicate in scope. And then based on that we can say how are we going to look around the rest of this graph to figure out what else is in scope? Because what we thought we would do is figure out what parts of this graph are in scope and then project those back using those string standoffs onto the surface. So it’s a basic idea is to just crawl around these links, and the trick is which one should we crawl and which one should we not crawl? So since this is an NP negation with a quantifier here the first step was, well, the first step was to indentify the q, the second step was to activate verbs that take it as an argument. So you can see these green links are links from some other predicate back into an activated node. And there’s one form of and one form know, but know is the only verb. So we activate that. And then once we’ve done that, that was called, we’ve got a couple different kind of crawling here that we give different names. We call that functor crawling because it’s crawling backwards on the links. From then on we only crawl forward. We call that argument crawling and label crawling. We just did the functor crawling as an initialization step. So looking from here we can color in the different arcs that are accessible from these in scope nodes, and the red arcs argument crawling, the green arcs are what we could do if we were willing to functor crawl. And you can see we want to be able to get from know to its arguments, well, this maybe a bad example because they’re both pronounced the same way but form know to its arguments, and also from the no quantifier to its arguments. So those are both argument crawling and we just go ahead and do that. And then label crawling. So from here what’s available is there’s more green arcs if we were willing to do the functor crawling. And there are a few instances where we do allow that for modals and certain types of subordinated conjunctions, but there aren’t any of those in this example. I colored the label arc there blue. So there are a couple different types of arcs here, but the main most interesting ones are the arguments, but the label arcs relate to the scope structure that’s defined by the underspecified representation of scope here. And the important thing is that co modifiers, modifiers of a noun or a verb will share the label with that. So this label crawling link here in blue, so we’re always willing to crawl across that. There are a couple of exceptions, but they’re not important right here. So you can see we’ve already crawled all of the red links, now we’re going to crawl this blue link to get to of and then that enables us to get down a couple more red links to those and then there’s nothing more that can be done. So at this point we’ve crawled around the MRS graph and activated everything that we could reach from these rules form know, and it turns out these are exactly the words that we wanted to have in scope according to the annotations. So project that back with the string standoffs to the surface string, and we find the German know nothing of the matter is what we’re predicting to be in scope and that’s correct. And that’s basically the way the system works in general. There’s a wrinkle, which is these so-called semantically empty words. So the English resource grammar in its semantic representation that it produces there’s in most cases one predication, one predicate, one node in that graph per word, but there are some cases where a word just doesn’t contribute anything to the semantics. A good example is a complementizer like to or that. And there are some cases where a word has some meaning but it’s not totally composition so it’s best to encode that as part of a predicate from another word. So send for was an example. Although that one I think was out of scope so it didn’t matter here. Yeah. So actually I think for is probably considered semantic with [inaudible] in this analysis, but it didn’t matter to us because that was out of scope. But I’m going to give you another example here that quite a few semantically vacuous words. The ones in blue here. So I trust that there is nothing of consequence, which I have overlooked. The curly braces there show that the scope that we want to get that the annotators marked, and the red shows what the crawling rules that I just described will get for us. So is thing of consequence overlooked? Actually I should also be red. I think you can’t see that. But which have and that and there are vacuous in that they don’t contribute an explicit predication to the graph. So most of those we want. So we have to figure out how to do that. Since they don’t show up in the MRS we have to back off to the syntax tree. So Emily showed you before the tree structure and the MRS structure that are produced with the RG. So here’s the tree structure for that sentence. I trust that there is nothing of consequence, which I have overlooked, and what we do is initially we annotate the words that the crawling system marked as in scope, and we walk up the lexical head paths from those. So something like nothing is the head of this entire noun phrase, but is not the head once you get to the verb phrase. We mark from nothing as in scope all the way up to here, and from is, which we marked as in scope up to there, and form consequence up to here and so forth so that a large portion of the tree around the area that we marked as in scope is flagged. So then from there each of these semantically vacuous words is relatively close to information about whether that section of the string is in scope or not, and then each of the types of semantically vacuous words, relative pronoun, what else, complementizers, helping verbs, things like that each have about four or five different classes of them and each have sets of rules describing where in the tree to look to se whether you’re in scope or not. So for instance, have says I’m in scope if my complement is in scope, and which says I’m in scope if I fill a gap in an in scope sentence or something like that. So we iterate these rules a few times because you can see which is not actually filling a gap in an in scope sentence yet because the head of that sentence is have syntactically, so after one step we can mark that have is in scope because it can take an in scope complement and after two steps, so then its head path went up to the S because it was the head of that constituent and then which now fills a gap in that and we can mark it as in scope too. And you can see that that out here doesn’t the rule for complementizers is that they’re in scope if they are the complement of an in scope verb. And there are some idiosyncrasies in the way that the annotation guidelines work. All right, so we had to evaluate this. We used the Sherlock Holmes data and we used the gold cues. It comes with a test dev train split that was used for the shared tasks and we used that and we designed these rules using the training data. We looked at the development data once and we applied it to the one best analysis from the ARG. So the ARG has lots of different analyses for any particular string ambiguity, but there’s a statistical model that will tell you which one is probably the right one, and there is actually a confidence metric that goes with that. So we did this and the results were kind of okay. You can see it’s a high precision, low recall system. One of the reasons that the recall is low is because the grammar doesn’t have full coverage. Frequently there is no parse or the statistical per selection gets the wrong parse. There are a few cases where there were rare cues that we didn’t know what to do with. So there are a lot of cases where the system not puts out the wrong answer per say, but says I don’t know. So the obvious thing to do is let’s combine with another system that’s a high recall system. And the winners from the competition were certainly that. And we’re good friends with them so we said okay, let’s do a system combination with you guys. And this system combination is if we use our results whenever they are available and use their results whenever our results aren’t available, and you can see the recall pops way up. It was 67 per token level recall on the development set, pops up to 83 and all the other numbers are better too. Well, the precision drops a little bit but not much. So that’s a lot better. It’s not quite consistently beating the winner of the competition yet, but it’s actually relatively close and there is one more game we can play, which is that sometimes we put out an analysis I guess about the scopes even when we have a very low confidence parse. So we have this confidence metric that’s a maximum entropy parse selection model and we can just say all right, what was our confidence, what was the probability that that model assigned to this tree that’s conditioned on the input string? So if we say, well, if our probability of having the right parse was at least 50 percent, then we’ll use our system, otherwise we’ll fall back. And by now we’re only producing guesses for about 25 percent of the negation cues, but it’s still a meaningful portion, and using that we can actually out perform the published results with the system combination here. We can see that the published results down here from the winners of the competition last year are here and we’re able to out perform them in almost every box. They went in recall in one box there. It’s not a big win, but it is one percent, five percent error reduction, ten percent error reduction, something like that. So we feel like that’s kind of exciting. I can’t tell you statistical tests about whether those are significant or not. It seems like at least the -- anyway ->>: What was the size of it? >> Woodley Packard: So the size of the test set is -- do you remember the number on that? I think there are 200 negation cues but there are more but we’re not doing the task of identifying which ones are cued or not, and the number of tokens in it has got to be 3 thousand tokens or so, of which several hundred are in scope. In conclusion, the MRS based system is high-precision but low recall, and in a system combination it seems to be able to perform at least as well, probably a little bit better than the best published results. We also did an oracle experiment where instead of using the confidence metric to decide which system to use we used whichever one performed best according to the gold, and that was able to perform even quite a bit better than the system combination. Actually the numbers there, I think on the test set went up to 90 point something. And we wanted to note that we implemented our rules looking at the data, but actually if you look at the guidelines and compare them to our rules you can see that they line up pretty well. The nice thing about a rule-based system is, of course, you can interpret the rules and it looked like the rules did just about what the guideline said. The fact that we were able to converge on those rules and they picked rules that matched up with the semantic representations that the ARG produced we think kind of validates each other in a way. So we think it’s neat that explicitly semantic approach to this problem was able to build on a purely machine learning approach. Thank you. [applause] Questions? >>: So among sentences where the correct parse was chosen by the selection model and it was coverage, did you find that most of the recall errors disappeared in that case? >> Woodley Packard: So it’s hard to say exactly. I mean, we did some tree banking to determine what the -- so in general we don’t know when the right parse was found or not found. You’d have to go in manually and say what was the right parse, did we get it or not? A lot of the cases in error analysis where we looked through to see what was going wrong we found that it was the wrong parse. On the other hand we did some manual tree banking and produced gold analyses for a bunch of sentences, and when we ran those through our rules they didn’t perform that well. It looked like the reason was that our gold analyses had a lot more of these rare cues in them, things that the statistical model just wasn’t picking up on because they were low frequency. And so our rules didn’t have them either. >>: Corollary to that is how did you get the 50 percent figure that you decided on? >> Woodley Packard: For the confidence? That was just, we did play with pushing that back and forth. 50 percent was actually the first thing we tried and it turned out to be not much difference between doing that and say 25 percent or 75 percent. Yeah? >>: Have you thought of using the [inaudible] approximation model that [inaudible]? >> Woodley Packard: No, so we haven’t done that. Actually, E [phonetic] hasn’t released the software that I know of to make that possible. >>: You could ask [inaudible]. >> Woodley Packard: Yeah, so to fill in some context there is a BCFG approximation to the ERG that lets you produce a derivation tree at least in cases where the grammar wouldn’t be able to parse and there’s a method for trying to produce an MRS from those derivation trees even when t he unifications according to the unification base grammar wouldn’t actually succeed and you can get a graph that looks like our graphs out of that. That’s relatively new research and it’s not software that we have access to. But we’d like to try that. The fact that the lower scoring trees yielded noticeably worse results suggests that maybe trees coming out of that wouldn’t work well either, but we don’t know. But it’s worth trying. Any other questions? Okay. [applause] >> Michael Gamon: So it’s my pleasure to introduce Meg Mitchell. I’m very lucky she joined our group recently. Meg, as mentioned before, actually got her masters through the UW program, got a Ph. D. at the University of Aberdeen and was also visiting scholar at Oregon health and science university. And Meg is particularly interested in the connections between language and vision, which is a really interesting and I think emerging field that’s drawing more and more attention. And today she is going to be talking about generating human reference to visible objects. >> Margaret Mitchell: Can you guys hear me okay? I’m good? Okay. So I’m Meg. I’m going to be giving sort of a general talk on generating human reference to visible objects in particular. This is something I’ve worked on from different angles for a while, including collaboration with a bunch of people. I kind of keep the same slide deck and keep switching in slides but them I’m not sure what names to take in and out, so this includes a lot of input from a lot of people spanning a lot of universities. And I thought it would be kind of fun to look at this from the perspective of when I started in the CLMA program, and now it’s called CLMS but then it was called CLMA, and just sort of pick up from the last presentation I gave there when I was doing my thesis proposal, and then start sort of picking up the space and the problems that fell out from that beginning. So let’s see. So my plan is to briefly cover some psycholinguistic studies I’ve run to tease out what it is useful to model when you are trying to look at human reference to visible objects, and some models that I’ve developed in pursuit of that goal. I hope it’s clear from this talk that this is a wide-open space where there is a lot of future work that can be done, but we have some sort of nice starts. Okay, so back to 2008. This is a slide from my thesis proposal in the CLMA program. Mike nicely made me these pretty spheres. Where I was trying to dive into the task looking specifically at referring expression generation, which is a subtask in natural language generation, and I was concerned with how this system could produce referring expressions that sound natural. So how can we produce expressions that uniquely pick out items in a seam in a way that sounds human-like? That sort of brought up the question what does natural mean? What the input could possibly be to get to that point? And then if we actually do get to the point of outputting something that sounds natural how do we evaluate it? So this can be sort of seen within the general context of natural language generation, which deals with taking non-linguistic input, associating it to some sort of syntactic and semantic structures and from there figuring out how to output sentences that sound cohesive and fluent. So referring to expression generation can be seen as just a subtask within this larger set where you’re looking at generating expression that can identify I reference to a listener or a hearer. So the state of the art in 2008 was basically given some sort of computer graphic representation of an image and some very clear sort of simple semantics of that object, so something like color grey, size large, orientation front, find the best way to refer to that one thing in context. So say something like the grey desk. Moving forward in my masters’ thesis at CLMA I was starting to wonder what happens when you move outside of the computer graphic world and move to actual objects that people are talking about. So I ran this mechanical Turk study where I had sort of blank on the left is tied with blank different sort of configurations of objects and tried to tease out what kind of things people were saying when they were referring to real objects. One of the things that came out of this was the fact that there is an enormous amount of speaker variation in the way that people talk about visual objects. So I had 64 different participants and 32 different expressions to refer to this orange bubbly thing. I didn’t have some underlying semantic representation I just had the image and so it was sort of up to me to figure out how to get form this image to some sort of set of possible ways to talk about the object. And it’s sort of like a dual problem where I’m trying to figure out what sort of things are worth mentioning and then how it should be realized in the surface form. So maybe spiked and spiky are from the same sort of underlying semantics, but these are two separate problems to tease out. Right. So going from the sort of traditional task of trying to find the best expression that you can use to refer to an item given some creative fine set of attributes and values, I started spinning the task as trying to figure out from some visual input how do we get the set of possible things that people might say to talk about it? One thing that’s interesting about this is that the task of trying to figure out how to talk about one specific object is similar to the task of figuring out how to describe that object, right? So something like the bright, bumpy, ball ting randomly floating on the right side of the slide is something you could say about it given the visual input. And once you start to look at it in terms of description you realize that it’s not a big jump to go from talking about an object to saying full sentences, right? So this is just a copular transition from a noun phrase to a sentence. The bright, bumpy, ball-thing is randomly floating on the right side of the slide, and for some reason I like blew my thesis advisor’s mind when I made this point. But the idea is that looking at the object you can start to reason about the semantic representations and move this to full-on sentence descriptions it seems. Okay. Along with sort of interesting variations, so randomly floats, is randomly floating. Right. So why look at an object reference? Why is this interesting at all? There are a bunch of reasons why I find this interesting. I kind of change around the reasons depending on the audience. Because I don’t have a ton of time I’ll just say here are a few. One is that it anchors visual language. So as I mentioned before, being able to reason about the semantics of an object and a visual scene gives you a way to reason about the visual scenes as a whole. So they are nice like little bits of meaning that you can start adding further syntactic structure around. It’s useful in things like assistive technology. So at the University of Aberdeen I was involved in this project called [inaudible] school today, where the idea was kids with cerebral palsy who have difficult talking about the world might be able to have a camera that takes pictures of things and then that could be used to generate descriptions of objects that they could talk about that they could play with and things like that. It’s useful for devices that can scan the world and interact with humans as an assistant. So this is kind of like the AI problem, but it cuts right in there. It’s also useful for automatic summarization of videos. So by being able to analyze videos, attach them to linguistic structures we can start getting at how people are going to search through videos and find them without having to watch the videos themselves. And it’s useful for creative tasks just like generating descriptions of things for the fun of it. Okay. It also kind of makes sense to look at the object within reference because it’s sort of how we evolve to be able to talk about the world, right? So if we’re going towards a model that can talk about the visual world that makes some sense to kind of borrow from how humans have evolved to start learning about how to talk in general. So we know that the object is a principal unit in early language learning. We know that some of the earliest vocabulary items for children are usually nouns that refer to concrete objects. And these are the kinds of visual things you’re interacting with in the world. There has been a lot of work showing that referring nouns seem to be conceptually more basic than concepts refer to by verbs and prepositions, so it is clearly sort of a nice starting space. And here is the early reference. I try to have an early reference in all my -- anyway. Nouns also provide a way to communicate about the natural world’s variety as distinct entities. Right? So like when pigeons start up people are usually just referring to nouns and basic verbs in order to talk about the world. And something that is also sort of handy, this is just sort of a happy coincidence, is that computer vision, which is the obvious sort of input if you’re trying to automatically talk about the visual world is centered around the object. So action detection doesn’t work very well, pose detection doesn’t work very well, but object detection works reasonably well, and action and pose given object is much better than action and pose alone. Okay, so I ran a bunch of studies to try and start teasing out what it means to refer to objects in these real visual domains. Previous corpora was sort of impoverished for my needs, so the TUNA corpus was this corpus that looked at these things like grey desks and certain configurations. The GRE 3D3 corpus is a similar thing but with cubes and spheres. And I was trying to sort of get at what people are going to do not when they’re asked specifically to refer to this thing, but when they’re given a bunch of objects and they’re within a broader task and they are talking about them. And how can we sort of model that sort of stuff? So this is -- there is further like details on each of these, one in INLG 2010 and one in CogSci 2013. At the same time it makes sense to think about what the input would be a eventually get to that point, right? So I was simultaneously working with some computer vision people to try and think about what computer vision can actually provide when I’m trying to get just this sort of naturalistic output. So things like object detection, these are bounding boxes around the object, can work reasonably well in small domains. Okay. So there is no one corpus, there is no one study you can do to sort of understand how people talk about the visual world and get appropriate models for that, but if you do a bunch of studies and you do corpus based work you can start to see sort of general tendencies that come out of that and know what it’s useful to focus your research towards. So that’s sort of what I did. I did this one study where I had craft objects, which I chose because those are visual manifestations of visible properties, right? That’s the whole point is just to be these masses of visual properties. And I had people describe how to use the objects on the board to get to put together a face. And then I ran another study with sort of like office objects kind of things I found around the house and I had them tell an assistant how to place them in these sort of grid-like patterns in order to replicate some images that they saw on the screen. What I found sort of going through all these things is a bunch of things that sort of seem obvious, but there hasn’t been a lot of work done on them so it’s important to kind of focus on them in particular. One thing is that people refer differently. So this I mentioned in my masters’ thesis work, but it’s obviously come out again and again, and again, one-shot descriptions don’t make sense when you’re trying to talk about the visual world. That’s true for generating referring expressions. That’s true for generating full descriptions. Different people will refer in different ways, and a single person will refer differently in the same context. So unless you’re trying to build an agent that is very repetitive it makes sense to start modeling the distribution over both the sort of semantic content that people will pick when they talk about an object, as well as the surface forms that that semantic content can be used to realize. Okay. It also becomes clear that ordering really matters. And this is true both for modifiers and for nouns, and for a bunch of other things. So this sort of lead to a bunch of projects where people, I noticed that people were comfortable saying things like comfortable red chair, but things like red comfortable chair was dispreferred. And this is true for even longer phrases, right? Big, rectangular, green Chinese silk carpet is awkward, but sounds somehow more correct than Chinese, big, silk, green rectangular carpet. And there are sort of no existing models for this kind of thing. There are further details on this in the ACL 2011, but the sort of punch line from this work is if you automatically parse a ton of corpora, extract the noun phrases and then train a language model on those noun phrases you can do really well at doing this automatically. If you want to do a class-based approach it works pretty well with a hidden Markov model, which you can update using the M do sort of learn these latent classes that give rise to the surface forms. There’s a sort of other problem that comes out of this that there is still a lot of work to be done in this. I don’t think anyone has touched it, and that’s given some sort of semantic representation when are things post nominal and when are things pronominal, and how do you make that decision? Ordering also matters for nouns. So within a description of a scene there is a tendency to put animate things at the front of the sentence and inanimate things near the end of the sentence, and people are actually kind of predictable about what they choose as subject and object. So I did some work on this using word nut hypernyms [phonetic] where you could see, like, object positions in some three-noun description, so a description that had three nouns in it. And animate things tend to be in first position. Things like structure, boxed tend to be in the second position. And you can actually use this in a generative model to order a set of given objects. Okay. So I was a little confused in the sense about which attributes are important to focus on. So throughout these studies you see that color tends to be massively preferred, also size sort of regardless of the task, material and shape as well, and analogies and part-whole relations, which I don’t think had any work for generation. So there is a lot of research to be done there. So in the study with craft objects I found that there’s a dominant preference for color, followed by size, material and shape. And these are also that material and shape are realized as head nouns, which reflects peoples’ tendency to use shape when they’re giving object categories. So things of the same object category tend to have roughly the same shape. And you can see that in the way that people talk about objects. This is also true for the study with office objects. Again, color, size, material, shape. You see part-whole relations and you see fun things like analogies, sea urchin, whatever that is. It seems to be sort of like there’s a wealth of problems that you can work on here and we’ve just started to scratch the surface. So keeping in mind what the actual possible visual input is, it becomes clear that although we can get some sort of understanding about the objects in the scene and some of the attributes that apply to them, there are some things that referring expression generation has assumed would just be available that really aren’t. So in particular, size, right? So you can’t look at a scene and just know small, that actually takes some sort of reasoning about the X, Y, and Z dimensions of the objects in the scene. So it kind of opens up the sort of problems that is available within the sort of vision to language connection when you’re trying to generate human-like reference. Vision provides things like color, shape, and material, but the relative properties that are common for people to use, things like size, location, and spatial relationships require some reasoning about the object coordinates and segmentation of the image and there is still a lot of work to be done there. I did some work on size in particular because it was particularly common. It’s kind of boring to work on size, but you know, it sort of needs to be solved, right, if you want to start looking at this problem. So if you have an object on the right here, this is a bit smaller than the smaller on the left here, there’s sort of some point where you start to say okay, that’s still small but maybe it’s thin, now it’s maybe thin, now it’s just thin that’s not small. Now it’s tall and thin or maybe just tall. There seems to be something about the height, width, and depth that you can reason about in order to figure out what kind of size language people will use in a visual scene and you can actually do pretty well on this using a discriminative model that just uses these sort of object coordinates in different ways to try and predict some basic size types that people will tend to use. I distinguished six different basic size types that refer to small and big, as well as things like tall, thin, short, fat, and we can actually get pretty close to approaching a [inaudible] oracle method does at understanding the kind of size language that people use. Okay. So with all these sort of ideas, although nothing is completely solved you can still put them together in a proof of concept to actually start generating descriptions of images. So this is some work I did with the computer vision people back in 2011 where I was trying to understand how I can take what I know about the tendencies of what people tend to mention, pronominal modifier ordering and put them together with a given sort of computer vision input to create descriptions of images that sound fluent and natural. And what I found that by collecting the object nouns, so again the computer vision is just providing the nouns, the sort of objects, you can collect the sort of sub trees that these tend to occur in using just a basic Penn Treebank Syntax, you can build likely sub trees around them during generation, and you can put them together to make reasonably well formed syntactic structures that make sense. So things like duck on grass and duck by grass. I’m using just a sort of method that uses both the syntax and the computer vision input outperformed other recent methods on that sort of task given a computer vision input. There are still more projects to be done here. I haven’t touched analogies. I haven’t touched part-whole relations. These are all things that you all should solve and then let me use your code. So future directions here I think that a lot of work can be done sort of learning a better semantic space to map between objects, attributes, and the sub traces there. Everything has been kind of course grain and I think that we can do a lot better. The syntax that I’ve used so far has largely been based on Penn Treebank Syntax and I think that’s a bit impoverished especially for noun phrases. I think we can do a lot better. I think it’s useful to start making some better decision about when to refer to an object by description, when to just list its attributes, when to make analogies sort of choosing between these different ways of talking about objects. There needs to be more work on capturing relative attributes. And we have very little knowledge base building that we’re using currently throughout these models and I think that there can be a lot more to guide what’s useful to say and what’s not. Okay. Thanks. [applause] >>: It seems like a difficult thing to control for is the sort of meta purpose to which the descriptions are only going to be used and that’s going to effect the human study where the people don’t maybe know why you’re asking this. And then also it can effect just sort of computer purposes where there’s not a sort of well-defined way to define the problem. I see you covering a lot of ground here in terms of different ultimate applications, but I guess could you comment on zeroing in a little bit on one particular application where this type of thing shows up when the description is target for this. For example, on the school bus, you know, some people could just say, you know, a bus and that’s sufficient. But then your description and bus with a blue sky ->> Margaret Mitchell: Yeah, yeah, yeah. >>: Why is that motivated? >> Margaret Mitchell: Yeah, so I think that -- I mean, that’s a really good point. And I think that part of what I’ve been trying to do by looking at a bunch of different corpora and a bunch of different experiments is just try and zero in on what it is about visual language in general that people are picking out. So I can have some generalizations that will hold across the different domains, right? So I’m still going to need to do like pronominal modifier ordering no matter what domain I’m in. It’s still important to talk about color and size basically no matter what domain I’m in. In terms of the larger structures that you generate, I think that goes to what corpus you should be training on, right? So if you’re doing some sort of conversational agent and you have them talking about visual objects in the world like for direction giving, then you want to be training on a corpus that has that kind of language in it. The same problems sort of fall out but the actual syntax changes. >>: So does it make sense to talk about the reverse of the problem? Like given the natural language how do you [inaudible]? >> Margaret Mitchell: Yeah. There’s so much I can tell you about this. So there has been a lot of work on this and I think that this is something that’s really nice to go towards less supervised computer vision, right? So if you just parse out a sentence, pull out the objects and then you can actually start training detectors on the scene as a whole in order to learn the sort of objects there, you don’t have to have people manually figuring out the sort of object categories. >>: To what extent do you need to do MLP in order to enable? >> Margaret Mitchell: MLP in order to enable ->>: Enable problematic process or -- >> Margaret Mitchell: Right. So that goes into parsing and NP chunking and I mean it depends on how sophisticated you want to get I guess, but my preference is generally to start at the parse structure. >>: It’s fun to see what’s happened in the intervening years. So I was intrigued by your statement that if the [inaudible] is what goes pronominally and what goes post nominally because unlike the order of pronominal modifiers there are actually some pretty hard constraints in English grammar about what can go before and after. So is it more like which kind of expression are you going to use? Pronominal type expression or post nominal type expression, or something else? >> Margaret Mitchell: Yeah, that, as well as just the basic learning and mapping between the sort of given attribute values and the pronominal and post nominal forms. So I’m given things like circular and I want to know do I say that this is a circular cup or do I want to say the cup that’s in a circle. And perhaps it’s trivial to find out but there are no models that do it yet. So it’s still open. You. Sorry. >>: So it seems like in a lot of cases the modifiers chosen might be chosen to specifically distinguish and object form other similar objects near by. Is that ->> Margaret Mitchell: Yeah. That’s the state of the art for referring expression generation is selecting modifiers based solely on whether or not they rule out other items in the same space. No, it’s not state of the art. It was state of the art until recently up until this year. Yeah? >>: I kind of have a similar question. Did you look at, I guess there are two parts to it, for the experiments you did where you had people moving the craft objects and things did you look at all at how the descriptors that they chose relate to the entropy of the set in general, and whether they’re choosing the descriptors that maximize it? Because depending on the set, given the set of things where they’re all the same color ->> Margaret Mitchell: Totally. I could actually, for the craft stuff that would be a really cool thing to do actually trying to pull out the visual attributes and seeing if there’s something there, some sort of information gain in their selection. I think that would be awesome. For the office supply things I was actually reporting on the fillers but I was running a larger study where I was trying to tease out if typicality of shape and material affected how people named things, and I found that things that are an atypical shape tend to be called by their atypical shape, but things that are an atypical material tend not to be. So it seems to be some sort of interesting interplay between some stored knowledge, sort of prototype thing, and the given visual scene. >>: So the second part of that is also another type of context is over time, right? If people refer to something next time they’re probably going to refer to it with less information. Did you do any work with that? >> Margaret Mitchell: Yeah. So that gets into dialogue stuff and you do like lexical entrainment so you start to actually agree on some sort of object name, and then when someone else comes in you actually start without an object name again and then you realize that you get -- there have been some really nice studies on how this works and I have not implemented them. So there’s like this work has been marginally a first pass at when you’re initially describing and object, and then yeah, the clear next extension is figuring out how those are shortened and changed to dialogue. >>: Thank you. >>: I was curious about cross-cultural rather than cross-linguistic. If you’re looking at an object if you’re from another culture you might use different language to describe it visually, where as a white bus might be white bus to us in our culture but doesn’t in another culture. >> Margaret Mitchell: Yeah. So when I was looking at the typicality study this was in particular something that was an issue. I didn’t think it would be an issue so I conducted the study in the U.S. and in the U.K. and using sort of semantic, or future production norms, I don’t know if you are familiar with them but the sort of listing of normal attributes, from people in Canada. And I found that although the U.S. people used the sort of language I expected, the U.K. people would randomly say things that made no sense to me at all. [laughter] Well, yeah so I mean there are cultural differences. They kept saying the pencil with the rubber on it and things like that, like of course it has a rubber on it. I don’t know, it’s an eraser. But yeah, so this is an issue to tuning for specifically what kind of group you are working with. Definitely. And not just culturally, just smaller groups, groups of friends will tend to use sort of different language than other groups of friends. >>: [inaudible] >> Margaret Mitchell: Right, yeah. That’s true. Yeah? >>: You kind of mentioned the order of modifiers. There is some work in kind of a different domain in theoretical syntax that works on the order of adjectives and the order of adverbs and [inaudible] does all that stuff. So I’m interested if you’re looking to incorporate that kind of stuff, so the order of projections just explains it kind of nano syntax [inaudible]. To explain that distribution. I’m wondering if you’re ->> Margaret Mitchell: Yeah. So at the CLMA program I was actually starting to do this in terms of the underlying semantics, so you learn like color tends to be closer to the noun, followed by size, and it seems like something I would learn automatically. Frustratingly what worked the best was just an M-gram language model. So it’s like do I keep playing with this stuff or do I do the thing that works the best? You know. So yeah, I kind of left it there like, okay, a language model works really well, that’s frustrating. I’ll move on to another problem. But I think that it would be really cool to be able to incorporate that kind of stuff. Chris? >>: [inaudible]. >> Margaret Mitchell: But presumably if you could get a corpus you could train on that corpus and that language and learn it automatically. >>: Is the preference for selecting a descriptor based on [inaudible]? >> Margaret Mitchell: Yeah, that’s interesting. Yeah? >>: So you mentioned early on the concept of natural speech and we’ve been talking a lot about variation. Do you have any sort of measure for something being really unnatural? So there seems like there’s a huge range of things that could be potentially considered natural. Do you want the range to be narrower over specific domains? >> Margaret Mitchell: That’s all the more reason to learn the distribution. So you want to have sort of the trailing [inaudible] or less and less natural construction. So you’re not just trying to learn a natural thing, you’re trying to learn the set of natural things that people might say but the very low probability space for things that people don’t say. Yeah? >>: [inaudible] and I’m just kind of curious. I can imagine cases where someone is referring to a bird and it has a very scientific name for that bird, they would probably just say bird most of the time. Did you find that hierarchy to be particularly useful, or did you apply it to compensate that? >> Margaret Mitchell: Context where there was some shared or lack of shared knowledge. >>: It seems like you were mostly working with pretty basic objects. >> Margaret Mitchell: Right, basic level categories. >>: That they don’t have, like, complex names. I was just kind of curious if you’ve ever seen this work of two areas where common people might refer to one differently than [inaudible]. >> Margaret Mitchell: Yeah. So a lot of the sort of traditional referring expression generation work they have a function that they call user knows, which sort of starts at a base-level class, and if the user is not familiar with that base level class they move up, and if the user is familiar with that base level class you have an option of moving down. This is still within the realm of sort of hand-written stuff so I haven’t looked at doing something more statistical with it. But there is definitely precedence for looking at that. I haven’t. [applause]

>> Michael Gamon: And so today it’s the 31

Related documents

Products

Support

&gt;&gt; Michael Gamon: And so today it’s the 31

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Michael Gamon: And so today it’s the 31