>> Lucy Vanderwende: Hi, everyone. This morning it's my pleasure to introduce Dipanjan Das who's visiting us from CMU. His talk will be on robust shallow parsing, which is a task that many of us have spent a long time thinking about and which we all think is very important for upcoming possible applications of our natural language processing techniques. He's about to graduate. His advisor is Noah Smith. He received the best paper award at the ACL 2011 for a paper on unsupervised part-of-speech tagging, but since the majority of us had already seen that talk, we asked Dipanjan to speak of his other work, but certainly the part-of-speech tagging is very noteworthy work that you've done. So thank you very much. >> Dipanjan Das: Thank you. Okay. Firstly, thanks for inviting me to give the talk here. I'll talk about this work which is major, like, part of my Ph.D. dissertation. It's with shallow semantic parsing of text using the theory of frame semantics. So this talk is about natural language understanding. Given a sentence like I want to go to Seattle on Sunday, we can -- the goal is to analyze this by putting layers of information such as part of speak tags. If you want to go deeper, we can look at dependency parses. However, here we are interested in even deeper structures. So given predicates which are salient in the sentence -- for example, the word go -- we want to find the meaning of the word go and also how different words and phrases in the sentence relate to the predicate and its meaning. So over here travel is the semantic frame that the word go evokes, and it just encodes an event or a scenario which this word means over here, and there are other words and phrases in the sentence that relate to the semantic frame that go evokes. So over here, these roles are participants that basically relate with this particular frame which the predicate go evokes. So I is the traveler over here, to Seattle is the goal where the person wants to go, and time is auxiliary semantic role. So there are multiple predicate argument structures possible in a sentence. So over here the word want also evokes a semantic frame called desiring, and there is the subject I that fulfills the semantic role experiencer. So this talk will be about the automatic prediction of this kind of predicate argument structures. Mostly it has been described in these two papers from NAACL 2010 and ACL 2011, but I'll be talking about some unpublished work as well. Okay. So one thing is that please stop me if you have clarification questions. So, broadly, my talk is divided into these three sections why we are doing semantic analysis, and I have subdivided it into the motivation section, why we chose this kind of a formalism for semantic analysis, and, finally, applications. The second part of the talk is the core where I talk about statistical models for structured prediction, and this task can be divided naturally into frame identification, finding the right frame for a predicate, and argument identification, which is basically the task of semantic role labeling, finding the different arguments once the frame is disambiguated. And, importantly, each of these sections will focus on the use of latent variables to improve coverage of frame identification, and the argument identification we will note the use of dual decomposition but a new twist on dual decomposition which we call dual decomposition with many overlapping components. And the final section I'll talk about some work on semi-supervised learning to improve the coverage of this parser, and I'll primarily focus on some novel graph-based learning algorithms. So this is our claim. So let me first start with why we're doing semantic analysis and motivating it. So given this sentence, Bengal's massive stock of food was reduced to nothing's, let's put a couple of layers of syntax on the sentence, part-of-speech tags and dependency parsers. Now, a lot of work in natural language processing has focused on this kind of syntactic analysis. Some of my work is on part-of-speech tagging and dependency parsing. However, we are going to look at deeper structures. Basically syntactic structures such as dependency parsers are not enough to answer certain questions. So over here let's look at the word stock, which is a noun. It is unclear from the dependency parse whether it's a store or a financial entity, so it's ambiguous. Moreover, if you want to ask questions such as store of what, of what size, whose store, these are also not answerable just from the syntactic analysis. Similarly, if you look at the verb reduced, we can ask similar questions like what was reduced, to what. That is also not apparent from the syntactic analysis. So the type of structures that we are going to predict frame semantic parsers can easily answer these questions. So think like store, the frame store, which is evoked by the word stock says that it is indeed a store, and then there are these different semantic roles which are fulfilled by words and phrases in the sentence. Similarly, there is this frame and its roles for the verb reduced. Now, I will take some time to trace back the history of this kind of work like in a couple of slides. Basically it started with the formalism of case grammar from the late 1960s by Charles Fillmore. So given a sentence such as I gave -- so this is a classic example -- I gave some money to him, Fillmore talked about cases that are basically words and phrases that are required by a predicate. So over here I, the subject, is an agent, to him is a beneficiary, and so forth. So I talked about three salient things, the case grammar theory, I talked about the semantic [inaudible] of a predicate. It also talked about the correlation of predicate argument structures like these with syntax and also talked about cases or roles like obligatory cases and optional cases. So we are mostly familiar with this kind of -- this theory. Now, case grammar. Around the same time in AI, Marvin Minksy talked about frames basically for representing knowledge. And Fillmore extended case grammar with the help of this theory about frames to frame semantics in the late '70s and early '80s. Now, frame semantics basically relates the meaning of a word with word knowledge, which is new in comparison to case grammar. It also basically presents the abstraction of predicates into frames. So, for example, the word gave in the previous slide evokes a giving frame, it has several participating roles, but this frame is also evoked by other words like bequeath contribute, and donate. So basically all these predicates associated with this giving frame and they can evoke this frame in their instantiations. Now, frame semantics and other related predicate argument semantic theories gave rise to annotated data sets like FrameNet, PropBank, VerbNet and currently OntoNotes, which is a popular corpus that people are using. And now we're doing data-driven shallow semantic parsing. Now, very roughly, in the world of AI, frames give rise to, around the same time, the theory of scripts was developed by Schank and Abelson, and template-filling extraction, which is extremely popular, came about with the help of these annotated data sets like MUC, ACE, and so forth. So, broadly, these -- we can partition this into CL and AI. Computational linguistics in AI work, but today machine learning is bridging these two areas, and we're not really doing very different things in both these areas. Structurally, shallow semantic parsing is very similar to information extraction. Okay. So enough about motivation and the history. So why did we choose this linguistic formalism? So I have -- like I'm representing semantic analysis in this spectrum from shallow semantic parsing to deep analysis. So in the shallow end -- so this is like a very approximate representation. Don't take it very seriously. So at the shallow end we have PropBank style semantic role labeling which is an extremely popular shallow semantic analysis task. So given a sentence with some syntactic analysis, this kind of semantic role labeling actually produces -- it fakes verbs. So over here the word reduced is a predicate of interest, and there are symbolic semantic roles like A1 and A4 which are arguments of this verbal predicate. Now, these semantic roles are only six in number according to PropBank, the core roles, and they have verb-specific meaning for these labels. However, there has been some work which has noted that these six semantic roles, since they take different meanings for different verbs, they conflate the meaning of different roles due to oversimplification. So from a learning point of view, you are learning classes which have different meanings for different verbs, which is not really desirable. On the other hand in the deep side, there is semantic parsing into logical forms which take sentences and then produce logical structures like, say, lambda calculus expressions. Now, these are really good because they give you the entire semantic structure, but these are trained on really restricted domains so far and has poor lexical coverage. So basically we cannot take a logical form parser and just -- like parsers trained on these restricted domains and then run them on free text. So our work lies in between these two types of popular parsing formalisms, and in frame semantic parsing there are certain negative sides. It doesn't model quantification and negation, unlike logical forms, but there are certain advantages like deeper than PropBank-style semantic role labeling because we look at explicit frame and role labels. There are more than a thousand in number, semantic roles. And we also model all types of part-of-speech categories: Verbs, nouns, adjectives, adverbs, prepositions, and so forth. We have larger lexical coverage than logical form parsers because our models are trained on a wide variety of text, and much larger in number than the restricted domain logical form parsers. And, finally, the lexicon in the supervised data that we use is actively increasing in size. So every year we can train on new data and get better and better performance. So it's an ongoing moving thing, and in tandem we can develop our statistical models. Okay. So I come to applications next. Basically lots of applications are possible for this parser. The first is question answering. So let's say the example that I showed you before, if we take a question that tries to extract some information from a large data set, basically if we parse both the question and the answer with frame-semantic parses, we will get isomorphic structures that can be used, say, in features or constrains or in whatever way to answer questions. So, moreover, if we use other lexical items like resource instead of stock, since frame semantics abstracts predicates through frames, we can actually get the same semantic structure and we can leverage the use of frames in question answering. So this type of work has been done previously by Bilotti, et al., in information retrieval where they used PropBank style semantic role labeling systems to build better question answering systems. And right now the deep QA engine of Watson which was partly developed in CMU is using my parser for question answering. Another application is information extraction in general. So there's lots of text, and if you parse it with a shallow semantic parser, basically we can get labels like this. The bold items are the words that evoke frames, and then there are semantic roles underlined, and all of these can be used to fill up a database of stores, and there are different roles that can form the database columns. Yes? >>: I'm intrigued by your [inaudible] idea of the system. Has there been [inaudible]? >> Dipanjan Das: No. >>: They haven't done it or -- >> Dipanjan Das: So there is this Bilotti, et al., paper that basically gave the idea of including it in the pipeline of deep QA. So they have quantification of using a PropBank-style semantic role labeler as to how it can improve question answering, but there hasn't been any quantification in the Watson system. Okay. Right. So last thing I'll comment about applications is multilingual applications. So let's take the translation of this sentence in Bengali, which is my native language. It is a syntactically divergent language from English. It is free word ordered. But the roles and the frames actually work -- most of them work in Bengali as well. So we can use word alignments to associate different parts of the English sentence with the Bengali sentence and we can do things like translation, cross-lingual information retrieval or fill up modeling word knowledge bases. So this is just some hand-waving suggestions at doing multilingual things with frame semantic parses. Okay. So I will next come to the core of the talk of statistical models for structured prediction. Most of this work has been described in this paper from NAACL two years back, but some of this is under review right now. So before I go on to the models I'll briefly talk about the structure of the lexicon and the data that we use to train our model. Now, the lexicon is basically this every-moving thing called FrameNet, which is a popular lexical resource. So this is a frame where placing is the name of the frame. It, again, encodes some event or a scenario. There are semantic roles like agent, cause, goal, theme, and so forth, and these black ones are core roles and the white ones are non-core roles. These non-core roles are shared across frames. So these are like the argM roles from PropBank. There are some interesting linguistic relationships and constraints that are provided by the expert annotators. So over here there is this excludes relationship. I'll talk about it later on. These are binary relationships between semantic roles. And there are potential predicates that are listed with each frame that can evoke this frame when instantiate in a sentence. So arrange, bag, bestow, bin. These are things that can evoke the placing frame. There are several frames in the lexicon. Actually, there are like 900 in number currently. So I'm showing some -- like six examples over here. There are some interesting relationships between these frames. They form sort of a graph structure, and the red arrows over here, it is the inheritance relationship which says that the dispersal, for example, inherits its meaning from the placing frame. And there is this used by relationship also, and there are some other multiple relationships that the linguists use to create this frame and role hierarchy. Now, there was this benchmark data set that we used to train and test our model. Now, it is a tiny data set in comparison to other semantic role labelers, other data sets that are used to train other semantic role labeling systems. So basically there were around 665 frames. The training set contained only around 2,000 annotated sentences with 11,000 predicate tokens. We use this for comparison with past state of the art. Very tiny data set. Now, in 2010 there was a better lexicon that was released. It nearly doubled the number of annotated data, and there were more frames and role labels, more predicate types also in the annotations. So this is now being -- increasingly, more information is being added, and it's easy to train our systems as it becomes available: So we will also show numbers on this data set. Right. >>: So are the frames being treated in conjunction with the training set sentences? >> Dipanjan Das: Yes. Right. >>: [inaudible] >> Dipanjan Das: Exactly. Yes. So previously what happened was before this full -- we call this full-text annotations that was released in SemEval in 2007. Before that they used to come up with frames without looking at a corpus. So right now for the development of this corpus, whenever a new frame -- word or predicate cannot be assigned with a new frame. Also, that works for roles also. >>: So that list of verbs that can be associated with each frame comes from the [inaudible]? >> Dipanjan Das: In our model it comes both from the lexicon, which was present before the corpus was like annotated, as well as the corpus, the union of those two. And these are not only verbs. They're nouns, adjectives, adverbs, prepositions. >>: So what's the chance that you're going to need a new frame -- >> Dipanjan Das: There is a big chance. So we work on the assumption that these 900 frames that we have currently, they have broad coverage. That's the assumption. But it's interesting -- it can be an interesting research problem to come up with new frames automatically. So we will see some work where we assigned or increase the lexicon size by getting more predicates into the lexicon. So that will be the last part of the talk. But not frames. So we assume that the frame set is fixed. >>: [inaudible] >> Dipanjan Das: This is work we work with currently. So this is released in 2010. I think they haven't made a formal -- >>: [inaudible] >> Dipanjan Das: Yes. Yes. So the lexicon is the frames, roles, predicates, and sometimes I interchangeably use lexicon with the annotated data also. >>: [inaudible] >> Dipanjan Das: Yes. 9,263. >>: [inaudible] >> Dipanjan Das: Not necessarily. >>: [inaudible] >> Dipanjan Das: That's a great question, yeah. It is not [inaudible]. So I have semi-supervised algorithms that can handle those predicates which were not seen either in the lexicon or the training data. >>: So suppose in the test set you have a some [inaudible]. >> Dipanjan Das: You'll get partial [inaudible]. But that's another great question. We do not exploit the relationships of frames during learning. That can be an extension where you can use the hierarchy information to train your model and get better -- like an example would be like if you use a max margin sort of training, your loss function can use the partial -- like the related frames, for example. Okay. So this is the set of statistical models that we used to train our system. So I'll first talk about frame identification which also address Lucy's point of these new predicates that we don't see in the training data or the lexicon. So let's say we have this predicate identified as the one evoking a frame, and it is ambiguous, we don't know which frame evoke, so the goal is to find the best frame among all the frames in the lexicon. So we can use this simple strategy by selecting the best frame according to a score. That score is a frame, a predicate, and the sentence. Basically the observation. Now, we like probabilistic models. So let's say that we use this conditional probability of a frame given a predicate and a sentence. This can be framed using a logistic regression model. Now, there are certain problems with using a simple logistic regression model here. Look at the feature function. It takes the frame, the predicate, and the sentence. Now, what happens when a predicate is on scene, you will not see it in the lexicon or the training data. The feature function will find it difficult to get informative features for predicates that did you not see before. Moreover -- so unable to model unknown predicates at test time. This is the first time. The second problem is that if you look at the total feature set, its order is the number of featured templates capital T multiplied by the total number of frames multiplied by the total number of predicate types in the lexicon. So this is the number of features in the model. Now, for our data set, this turns out to be 50 million features. And we can handle 50 million features now, but we thought can we do it with less numbers of features and can we do better. So instead we have a logistic regression with a latent variable where the feature function doesn't look at the predicate surface form at all. So the feature function is basically the frame, something called a proto-predicate, the sentence, and the lexical-semantic relationships between the predicate and the proto-predicate. So basically we're assuming that a proto-predicate is evoking the frame but through some lexical-semantic relationships with the surface predicate. Okay. So there is a frame, there's a proto-predicate, and then there are lexical-semantic relationships with the actual predicate. And since we don't know what proto-predicate it is, we marginalize it in this model. Is this kind of clear? With an example that I'm going to show you now, it will become clear. >>: What's the relation between this and [inaudible]? >> Dipanjan Das: Yeah. So that was like nearly unsupervised, like fully mostly unsupervised. So you take prototypes and then expand your knowledge. It is kind of related, but -- >>: So theory of prototype would be the one that you see in the [inaudible]? >> Dipanjan Das: Yes. Yes. So here is an example. So let's say for the store frame you saw cargo, inventory, reserve, stockpile, store, and supply, but if your actual predicate was stock, which we saw before, you just -- we just use lexical-semantic relationships with these with stock and the features only look at those relationships which are only three or four in number. >>: [inaudible] >> Dipanjan Das: Right. >>: But at test time you don't know the [inaudible]. >> Dipanjan Das: Yes. >>: [inaudible] >> Dipanjan Das: It is not. But you will get feature weights -- let's say that in training store was the predicate and your proto-predicate that you're currently working with is inventory. The lexical-semantic relationship will be something like a synonym, and your feature weight for the synonym feature will be given more weight, for example. So you will learn feature weights for features that look at a proto-predicate and lexical-semantics relationship. That's the idea. >>: So the inventory proto-predicates are actual other English tokens or other English types that evoke that same -- >> Dipanjan Das: Exactly. >>: They're not truly just some floating variable or -- >> Dipanjan Das: Yes. So it is a list of predicates that appeared in your supervised data. Now -- okay. Let's now look at the predicate surface form. Now, let's take a concrete example. As I already probably became clear, if the predicate was stocks, the proto-predicate was stockpile, then the lexical-semantics relationship is synonym, and let's say feature No. 10245 is fired only when the frame is store, the proto-predicate is stockpile and the synonym belongs to this LexSem set. And this for us comes from WordNet, but you can use the lexical-semantics relationships, but you can use any semi-supervised distributional similarity type of lexicon. >>: So does this evoke some sort of bias towards frames that have many predicates that [inaudible]? You're summing over all the proto-predicates [inaudible] feels like if you have a frame with lots of predicates, it's going to get more terms? You're training discriminately, so maybe not. Right? Maybe each of their weights are going to be reduced a little bit, but I'm just curious. >> Dipanjan Das: Yeah. That's a good analysis which I haven't done. >>: Okay. >> Dipanjan Das: Actually, this does well in comparison to -- no, no, it doesn't do really well. It does well [laughter]. It does well in comparison to a model that doesn't use latent variables. It does much worse than you can -- like a semi-supervised model. And we'll see how we can do better. Any more questions? Yes? >>: So if the features are still looking at all of the predicates that are in the training set, are the features still going to be [inaudible]? Because the number of predicates is kind of equal to the number of proto-predicates. >> Dipanjan Das: No. But let's say this stockpile predicate, there will be a feature with a high weight that will get -- like if the feature was stockpile with synonym, it will get a high weight during training, ideally, and that feature with high weight will fire during test time when it sees this new predicate. So from that perspective, you should get the right frame. >>: What kind of feature reduction do you get? You go from 50 million to -- >> Dipanjan Das: I'll come to that. >>: Okay. Sorry. >> Dipanjan Das: Number of features now is the number of feature templates multiplied by all the frames, and the max number of proto-predicates per frame comes to only 500,000 features, one percent of the features we had before. Okay. Now, I worked on this problem of paraphrase identification which was -which is of interest to researchers here as well which were we used all these similar resources like dependency parses, lexical semantics and latent structure to get good paraphrase identification. We trained this model using L-BFGS, although it is non-convex, and for fast inference we have a tweak basically for predicate is unseen. Only then we score all the frames. Otherwise, we score only the frames that the predicate evoked in the data. And this becomes really faster when we do this. So I would give you a caution beforehand, the numbers are not as good as other NLP problems because we have really small amounts of data, but there are some frames and roles on which our model is very confident, and those structures are good. So on this benchmark data set, in comparison to UT Dallas and LTH, which were the best systems at SemEval, we do better, significantly better. Now, this evaluation is on automatically identified predicates, and all of these three systems use similar heuristics to get the predicates, to mark the predicates which can evoke frames. Now, in a more controlled experiment where we have the predicates already given, the frame identification accuracy is 74 percent. And when the data set doubled, this goes to 91 percent, which is good, because we only get one out of ten frames incorrect. And without the latent variable, this number is much worse. So basically we're getting better performance as well as reducing the number of features. Yes? >>: Do you have some written analysis on when you get the frame wrong? Because it could be -- wrong frame could be completely wrong, something that absolute makes no sense, or it could be something that actually is not -- >> Dipanjan Das: Yeah. >>: -- exactly the correct one but it's meaningful. >> Dipanjan Das: So that's a great point. So this number actually uses this partial matching thing. If you only do exact matching, it is 83 percent. So that is also still pretty good, because we have, like, a thousand frames to choose from. Now, often there are some really closely related frames, and it's hard to actually find the right one. So there is a question of interannotator agreement there also, like what is the upper limit, and I think it is just like '92 or '93 percent. So we're doing pretty well in terms of frame identification, especially for known predicates. This number is abysmally low for unknown predicates, and we'll come to that at the final section of the talk. >>: [inaudible] >> Dipanjan Das: No, no. Actually, this model, I don't use unsupported features. That is a great question. So unsupported features are the one that appear in the partition function. So if you include those, then it is 50 million. But if you don't use them, then it is a few million, basically. >>: Okay. >> Dipanjan Das: The unsupported features, they actually help you a lot in this problem, which is what we observed. >>: I'm sorry. I didn't understand what you meant by given predicates. You mean given the correct predicate for the correct frame? >> Dipanjan Das: No, it's given a sentence you can you can mark which predicates can evoke frames. So if you do that automatically using heuristics, that is a non-trivial problem. So if you -- >>: So you're saying you tell it this is the predicate, but you don't tell it -- it doesn't necessarily match the -- you still have to do your proto-predicate >> Dipanjan Das: Yes. You still have to do frame identification. >>: Okay. >>: But you do now there is a frame out there for that predicate? >> Dipanjan Das: Yes. >>: So would that be a case where you would have different words that both evoke the same frame? So, for example, you have relation [inaudible]? >> Dipanjan Das: So you are asking -- let me just rephrase it. So in a sentence, whether two predicates can evoke the same frame? Is that the question? That happens, but not quite often. But since we do not jointly model all the predicates of a sentence together, that, from a learning or inference point of view, we don't care about what other predicates are evoking. But that can also be done like -- at a document level especially we can place constraints or soft constraints as to whether, like, different predicates in a document which have the same or similar meaning, they should evoke the same frame. >>: So lexical semantic, is it a binary feature? >> Dipanjan Das: Feature. >>: [inaudible]. >> Dipanjan Das: Yeah, it's like whether there is lexical semantics relationship of synonym, whether there is lexical semantics relationship of [inaudible]. >>: I'm just curious why suppose you're trying to incorporate, say, the lexical semantic relationship [inaudible]. >> Dipanjan Das: So this model can handle that like a real value feature. There should be not be any problem. But another hack would be to just make bins of distributionals, like 100 bins, for example, 10 bins, and use those as [inaudible]. Okay. This is the more interesting part. And I think that we have done some good modeling over here in comparison to previous semantic labeling systems. So let's say we have already identified the frame for stock, and now we have to find the different rules. So there are potential roles that come from the frame at lexicon once the frame is identified. So let's say these are these five roles over here, and the task is to -from a set of spans which can serve as arguments, we have to find a mapping between the roles and the spans. So this is just a bipartite matching -- maximum bipartite matching problem from our point of view. And at the least -- at the end over here we have this phi symbol, which is the null span, and when a few roles map to this null span, it means that the role is not overt in the sentence or it is absent from the analysis. >>: But why do you choose Bengal's as the ideal mapping rather than Bengali? >> Dipanjan Das: It's just a convention that annotators use. It's strange. It is like -- >>: Like for resource, it's [inaudible]. >> Dipanjan Das: Yes. It is a choice that both PropBank annotators inaudible annotators have taken. Like modeling the -- labeling the entire prepositional phrase for -- >>: [inaudible] >> Dipanjan Das: Yes. Yes. In dependency parsing also these are there, right? Because conjunction is a case where you don't really know what [inaudible]. Okay. Now, there are certain problems. This is not just a simple bipartite matching problem because we may make mistakes like this. If we map supply to food over here, this is linguistically infeasible according to the annotators because it violates some overlapping constraints. So basically two roles cannot have overlapping arguments. So this is a typical thing in standard semantical labeling work. For example, Christina had this paper in 2004 or 2005 where they did dynamic programming to solve this problem. Now, there are other things which are more interesting and have been explored previously like the mutual exclusion constraint that these two roles cannot appear together. An example is basically for this placing frame, if an agent places something, there cannot be a cause role in the sentence because both the agent and the cause are ideally placing the thing. So two examples are here. The waiter placed food on the table, and in Kabul, hauling water put food on the table. So these are very different meanings, but hauling water is the cause while agent is the waiter, and they cannot appear together in an analysis. There are more interesting constraints like the requires constraint, which means that if one role is present, the other has to be present in the analysis. So over here, the frame is similarity. So resembles is the predicate evoking the frame similarity. The mulberry resembles a loganberry. The first one is entity one, the second one is entity two, but a sentence like a mulberry resembles is meaningless. So you have to have both these roles in the structure. So there are more things -- more such linguistic information that can -- that we'll have to constrain the maximum bipartite matching problems so it's a constrained optimization problem. So this is sort of -- other people have also done this, but we are doing the semantic role labeling task in one step, which is just one optimization problem. So what we do is we use the scores over these edges. The edges look at the role and the span. So remember that this bidirectional arrow with role and span is the tuple that we operate on, and the score is -- we can assume it to have a linear form, like it's a linear function. There is a wave vector and there is this feature function g that just looks at the role, the span, the frame, and the sentence, which I've omitted over here. So it's a standard thing that we will do. Now let's introduce this binary variable z. It's a binary variable for each role and span tuple. Z equal to 1 means the role and span, that span fulfills the role. A zero means that span doesn't fulfill the role. And let's assume that there is a binary vector for all the role span tuples. So it's the entire z vector for all the roles and spans. Now, we're trying maximize this function. Basically sum over all the roles and spans. Zero span multiplied by the score with respect to z. And for all the roles we have this uniqueness constraint which says that a role can be satisfied by only one span, so this is a constraint that expresses that. And there will be more constraints like this one which is a little more complicated. It constrains each sentence position to be covered by only one role-span variable. So it will prevent overlap. And there are many other structure constraints that imposes the requires-excludes, these constraints. And people will be familiar with this. This is an integer linear program. Okay. Now, there has been a lot of work on this. So this is one of the seminal papers on semantic role labeling that uses ILP to do this inference problem. But ILP solvers are often very slow, and many of them are proprietary. And my parser is public, and we want to make it open source so that people can contribute, so we use this thing called dual decomposition with alternating direction method of multipliers that solves this ILP problem. It solves a relaxation of the integer linear program using a very nice technique that doesn't require us to use a solver. So this has been developed with colleagues at CMU. This is a paper under review. So basically we introduced this thing called basic part. A role and span tuple forms a basic part. So the entire space is all the roll and span tuples that we operate on. So what we do is we break down the whole bipartite matching problem into many small components, like find the best span for a role, like for a sentence position, find the best roll span tuple and so on. These are really simple problems that can be solved really fast. And at a global level we impose agreement between these components so the entire structure, there's a consensus between these small problems. So I think people are not familiar with dual decomposition. It's a really trending thing in NLP. What we will do is that they have two big problems, two big problems that can be solved using, say, dynamic programming, and then there is a consensus step at the end. But this is very different from that because we are -- we didn't bake the problem down into two big things. That is not really possible for this task. We have many, many small things, and then we tried to impose agreement. So for each component, now we define a copy of that binary vector. So this is z superscript component. So a component is basically one of those small things, and each constraint in our ILP can map to one component. So there is one component for each constraint in the ILP. So this is like in graphical models language, each component is a factor. So we have this z vector for each component. So basically we define this function called [inaudible] scores this entire binary factor, this binary vector, and it uses that role span score that we saw before. So given an entire z component vector, we have this function that gives us a score. Now, the ILP that we saw before can be expressed in this way. So we sum up over all the components with this score, and we have this constraint where all the z's for all the components, rolls, and spans, they have to agree, and that agreement is done by using this weakness vector u for consensus. So basically for all the z's are equal to this u for the different components we have a consensus. Okay. Please stop me if there are questions. Now -- so this is the primal problem where zr integer. Now, we tried to solve an easier problem, which is primal prime, which is where the integer constraints are relaxed. So it's a linear program. Now, we convert that to augmented Lagrangian function. So this is a new thing in comparison to the [inaudible] work where we augment the Lagrangian function with this term. This term is a quadratic penalty that actually should be ideally zero because the z's are equal to the u. So it doesn't make a difference to the primal LP, but it actually brings consensus faster. So that's the reason why people use it. So this is a standard trick that people do in the optimization community. Now, this augmented Lagrangian function which looks like this, the saddle point can be found using something called alternating minimization which is, again, another standard trick in optimization. And what is nice is that the saddle point which we seek can be solved using several decoupled worker problems. So it's basically an iterative optimization technique where there are three steps. There are Lagrange multiplier updates, there are consensus variable updates, and there are z updates. Now, the z updates can be solved at decoupled workers. They can be solved in parallel. Now, what do these decoupled workers look like? So remember that one component is basically one constraint. So let's say one z update that we need to do is for each role we have a worker that imposes a uniqueness constraint. So this is, again, a small, a tiny optimization problem that has to -- that solved the z update, and it looks like this. And let me just state that this is just a projection onto a probability simplex, and it can be solved using a sort operation which is really fast. And the challenge for a new problem where you have such constraints and you want to use our technique of dual decomposition, we have to define fast and simple workers that can bring optimization faster. Okay. So advantages of this approach is that there is significant speedup over ILB solvers, and we don't need a proprietary solver, and are the speedups are really marvelous. We get nearly 10 percent speedup in comparison to CPLEX, which is a very strong state-of-the-art solver. And that is parallelized and this is not. Okay. We get also certificate of optimality, like exact solutions for more than 99 percent of examples. So back to the -- yes? >>: You were saying that the augmented Lagrangian where you have a quality penalty and a linear penalty [inaudible]? Do you have numbers there? >> Dipanjan Das: Not for this problem. For dependency parsing we have numbers. So this was this paper by Martin [inaudible] at the NLP. But we haven't done any analysis in comparison. Yeah, it will be slower. Firstly, we don't want to use that because the subgradient approach of Collins, it uses these two big things or two big problems. We don't -you cannot break up this problem into -- >>: [inaudible]. >> Dipanjan Das: -- many things. There are some possibilities, but we haven't found any we like. The dynamic programming trick to find -- reduce overlap, remove overlap can be one big component, but I have no way to impose those other small constraints in a big [inaudible]. Okay. So learning. We use straightforward learning. We use local learning, actually, to learn these waits by just taking role and span pairs. We also have done some experiments where we do global inference using dual decomposition and use a max margin trick that doesn't work very well for some reason. So we're stuck with maximum condition log likelihood training. So here are some numbers. Just for argument identification we get a position boost of 2 percent using joint learning, and these -- actually these numbers do not -- so we are doing better than local inference, but these numbers do not reflect things like whether the output is respecting linguistic constraints that we placed. So we measured those. The local model makes 501 violations that include overlaps, requires and excludes relationships while we make no violations. >>: [inaudible] >> Dipanjan Das: Yes. >>: [inaudible] >> Dipanjan Das: Yes. Yes. So we have some numbers about [inaudible] search. These accuracies are similar, but it makes like 550 or -60 violations. >>: Do you know why -- so given that there's a substantial reduction of the violation, why there is [inaudible]. >> Dipanjan Das: Great question. We also wondered about that. So basically let's say we are trying to -- there is this requires relationship, and the local model predicted entity 1 and didn't predict entity 2. Now, the script that measures performance, it will give us -- give the local model some score for entity 1. On the other hand, this model is very conservative and did not produce either entity 1 or entity 2. So it got lower scores for that example. >>: So have you measured -- so this is including [inaudible]? >> Dipanjan Das: No, this is just argument identification. >>: [inaudible] >> Dipanjan Das: The frames are given. >>: Okay. But, I mean, it's sort of [inaudible] >> Dipanjan Das: That doesn't happen, no. We can change the script to do that, but we don't do that. >>: My question is in this goal -- you're saying the local [inaudible]. >> Dipanjan Das: Yes. >>: [inaudible] >> Dipanjan Das: It is not really a partial score. It will -- so when the local model chose entity 1 and it predicted it correctly, like mapped picked to the right span, the script gave it some points. >>: Right. >> Dipanjan Das: But this model neither predicted entity 1 or entity 2 because it respects that constraint. >>: So my suggestion is actually -- suppose if you look at a complete matching, so just like parsing, you look at just the [inaudible]. >>: [inaudible] >> Dipanjan Das: Yeah. That can be -- >>: [inaudible] >> Dipanjan Das: Yeah. That we haven't done. Right. We can -- >>: [inaudible] >> Dipanjan Das: That happens, actually. >>: [inaudible] >> Dipanjan Das: We can discuss this more, because I want to have numbers which reflect the better quality of parses [laughter]. >>: [inaudible] >> Dipanjan Das: Right. >>: [inaudible] >> Dipanjan Das: Okay. Okay. >>: [inaudible] >> Dipanjan Das: Right. Right. >>: [inaudible] >> Dipanjan Das: Okay. Okay. Okay. Yeah, I'm really scared about this. This is a good point, because the reviewers will also point this out for us. Okay. On a benchmark -- so this is full parsing, like identifying frames as well as arguments. Again, auto predicates because like these systems actually identified auto predicates and gave us their output. So we do, again, much better, nearly 5 percent improvement, which is significant. On given predicates we do even better, like 54 percent. While on the new data, it is like 69 percent. Now, we can get much better than this, and we are training only on very few examples in comparison to other SRL systems, and as more data becomes available, we are hopeful that this will get better. Now, the last part of the talk is about semi-supervised learning for robustness. I am really excited about this topic because I'm interested in semi-supervised learning. Now, this has been presented by this paper from last year's ACL, but I've done more work on this, and I'm going to present it as well. So unknown predicates frame identification accuracy is just 47 percent, which is half of what we get for the entire test set. So on all the unknown predicates, new predicates, the model does really well -- badly. And this is reflected on the full parsing performance also because once you get wrong frames, your whole parsing accuracy will go down. So what we do is that the 9263 predicates that we saw in the supervised data, we only have knowledge about those. While English has many more predicates, we actually filtered out around 65-, 66,000 from newswire English which can evoke predicates. >>: [inaudible] >> Dipanjan Das: Yeah. So these are basically words other than proper nouns basically and some other part of speeches which do not evoke frames in the lexicon. Okay. So we used -- what we do is both interesting from the linguistics point of view as well as from an NLP task. So we are doing lexicon expansion using graph-based semi-supervised learning. So we build a graph over predicates as vertices and compute the similarity matrix over these predicates using distributional similarity. And the label distribution at each vertex in the language of graph-based learning is the distribution of frames that the predicate can evoke. So here is -- so this is very similar to some work that I've done with [inaudible] Petrov on unsupervised lexicon expansion for POS tagging. So here is an example graph, a real graph that we use. The green predicates come from the lexicon in the supervised data, and I'm showing the most probable frame over each of these green vertices. Similarity is on the right side, and unemployment rate and poverty is on the left. And the black predicates are the ones which can potentially serve as predicates, and we want to find the best set of distributions, the best set of frames for each of these predicates. And we call the green ones seed predicates and the unlabeled predicates are the black ones, and we want to do graph propagation to spread labels throughout the graph to increase our lexicon size. And this is an iterative procedure that continues until convergence. So a brief overview. A graph is like lots of data points. The gray ones are -- so we have the geometry of the graph by the symmetric weight matrix that comes from distributional similarity for us, and there are high weights which say that these two predicates are similar, and low weights say that they're not. Now, this R1, R5 on the gray, the gray vertices, are the ones which come from labeled data. So we have supervised label distributions R on the labeled vertices, and the Q are basically distributions to be found on the vertices. Now, we use this new objective function to derive these Q distributions. So this is some work -- this is under review right now. So basically if you look closely, what we're doing here, the first term in the objective function that we're trying to minimize, it looks at the Jensen-Shannon divergence between QT and RT for the labeled vertices. So basically for the observed and the induced distributions over labeled vertices should be brought closer using the first term. The Jensen-Shannon divergence is basically an extension of the KL divergence which is symmetric and smoother. The second term over here which is more interesting, it uses the weight matrix to bring the Jensen-Shannon divergence of the neighboring vertices' distributions. And, finally, this is another interesting term that induces sparse distributions at each vertex. So this penalty is called an L1, L2 -- L1, 2 penalty. It's called a [inaudible] lasso in the regression world. So basically what it does is that it tries to induce vertex level sparsity. And the intuition is that a predicate in our data can evoke only one or two frames. So out of the 900 frames in the distribution, you only want one or two things to have a positive rate and the rest of them should be zero. So that's the idea. And we have seen that inducing -- having this sparsity penalty not only gives us better results, it also gives us tiny lexicons. So the lexicon size will be really small, which can use -- like store and use later on much more efficiently than a graph objective function that doesn't use sparsity. So the constraint inference, it's a really straightforward thing. If the predicate is seen, we only score the frames that the predicate evoked in the supervised data. Else, if the predicate is in the graph, we score only the top key frames in the graph's distribution. And, otherwise, we score all the frames. Okay. And this is an instance where semi-supervised learning is making inference much faster on unknown predicates. So instead of scoring a hundred frames, we score only two, which makes the parser fast. So we get huge amounts of improvement, but still not as good as the known predicate case. So we get 65 percent accuracy on unknown predicates in comparison to 46 percent for supervised data. And this is reflected on the full parsing performance also. Okay. So I'm at the end of the talk. >>: [inaudible] >> Dipanjan Das: Weight. That's distributional similarity. Just take a past corpus with dependency parsers, look at subject relationships of pressed kits, then use point-wise mutual information and take cosine similarity. It's standard distributional similarity. But great things can be done there also. You can learn those weights. We are working on that, but we haven't gotten any results. >>: [inaudible] >> Dipanjan Das: On development data across validation. >>: [inaudible] >> Dipanjan Das: Two. So it turns out that if you fix one, you can tune the other one. >>: So there's a global tail divergence between the -- sorry, [inaudible] between the labeled distribution and the unlabeled distribution and then there were two other terms, each with a distinct weight? >> Dipanjan Das: Yes. So -- interesting. So basically why we'd diverge from previous work is that in labeled propagation, people treat these things at matrices and do updates, and we don't do that. We take gradients. We -- firstly, we make the distributions unnormalized. That's the first thing we do. So none of the distributions Q are normalized. So that makes optimization much easier for us. So it becomes really, really nasty if things are normalized. So you can't -- I mean, there has been some very interesting work at UW on this by [inaudible]. Now, we make the thing unnormalized, and then if you take gradients with each -- with respect to each Q's component, then the problem, the gradient updates, become really trivial. So you can just operate on vertices and make updates, and it is trivially parallelizable across cores also. >>: [inaudible] >> Dipanjan Das: So on this graph with 65,000 vertices and 900 labels running optimization on 16 cores takes two minutes. It's really fast. Okay. So conclusions. We did some shallow semantic -- any more questions? Yes? >>: Just with the distributional similarity, you get a lot of information for nouns and verbs. >> Dipanjan Das: Yes. >>: But do you get information on adjectives and adverbs? >> Dipanjan Das: Yes. Yes. >>: How is that [inaudible]? >> Dipanjan Das: Yes. Other relationships, other -- [inaudible] dependency corpus, he has a lexicon for similar [inaudible] adverbs and adjectives there too. This is from the '90s. And we make use of that in this work. So the graph construction or the similarity learning. If we can use the frame [inaudible] supervised data and learn those weights, that will be really interesting. But I have not -- there's an undergraduate student working with me for that problem, a generic weight-learning problem, so a matrix learning problem for graphs. And if we can tune the distributional similarity calculation for FrameNet, the frames, it will be very interesting. Okay. So shallow semantic parsing using frame semantics, richer output than SRL systems, more domain general than deep semantic parsers. And on benchmark data sets we get better results. In comparison to prior work, we do less independence assumptions. We just have two statistical models. We have semi-supervised extensions and we saw an interesting use of dual decomposition with many overlapping components in this work. Lots of possible future work. We can train this -- one trivial extension is to train this for other languages. These six languages have frame semantic annotations. Use these techniques for deeper semantic analysis tasks. Like logical form parsers have this problem of lexical coverage, whether we can use this type of semi-supervised extensions for like merging distributional similarity and logical form parsing in some way. And use parser for NLP applications. So right now, actually, our parser is being used to bootstrap more annotations by Fillmore and his team, which is interesting, because it closes the circle and gives us more data. And the parser is available. It is an obscure task, but, still, 200 people have been using it probably. Thank you. [applause]