>> Lucy Vanderwende: Hi, everyone. This morning it's... Dipanjan Das who's visiting us from CMU. His talk...

advertisement
>> Lucy Vanderwende: Hi, everyone. This morning it's my pleasure to introduce
Dipanjan Das who's visiting us from CMU. His talk will be on robust shallow
parsing, which is a task that many of us have spent a long time thinking about
and which we all think is very important for upcoming possible applications of our
natural language processing techniques.
He's about to graduate. His advisor is Noah Smith. He received the best paper
award at the ACL 2011 for a paper on unsupervised part-of-speech tagging, but
since the majority of us had already seen that talk, we asked Dipanjan to speak
of his other work, but certainly the part-of-speech tagging is very noteworthy work
that you've done.
So thank you very much.
>> Dipanjan Das: Thank you.
Okay. Firstly, thanks for inviting me to give the talk here. I'll talk about this work
which is major, like, part of my Ph.D. dissertation. It's with shallow semantic
parsing of text using the theory of frame semantics.
So this talk is about natural language understanding. Given a sentence like I
want to go to Seattle on Sunday, we can -- the goal is to analyze this by putting
layers of information such as part of speak tags. If you want to go deeper, we
can look at dependency parses.
However, here we are interested in even deeper structures. So given predicates
which are salient in the sentence -- for example, the word go -- we want to find
the meaning of the word go and also how different words and phrases in the
sentence relate to the predicate and its meaning.
So over here travel is the semantic frame that the word go evokes, and it just
encodes an event or a scenario which this word means over here, and there are
other words and phrases in the sentence that relate to the semantic frame that
go evokes.
So over here, these roles are participants that basically relate with this particular
frame which the predicate go evokes. So I is the traveler over here, to Seattle is
the goal where the person wants to go, and time is auxiliary semantic role.
So there are multiple predicate argument structures possible in a sentence. So
over here the word want also evokes a semantic frame called desiring, and there
is the subject I that fulfills the semantic role experiencer.
So this talk will be about the automatic prediction of this kind of predicate
argument structures. Mostly it has been described in these two papers from
NAACL 2010 and ACL 2011, but I'll be talking about some unpublished work as
well.
Okay. So one thing is that please stop me if you have clarification questions.
So, broadly, my talk is divided into these three sections why we are doing
semantic analysis, and I have subdivided it into the motivation section, why we
chose this kind of a formalism for semantic analysis, and, finally, applications.
The second part of the talk is the core where I talk about statistical models for
structured prediction, and this task can be divided naturally into frame
identification, finding the right frame for a predicate, and argument identification,
which is basically the task of semantic role labeling, finding the different
arguments once the frame is disambiguated.
And, importantly, each of these sections will focus on the use of latent variables
to improve coverage of frame identification, and the argument identification we
will note the use of dual decomposition but a new twist on dual decomposition
which we call dual decomposition with many overlapping components.
And the final section I'll talk about some work on semi-supervised learning to
improve the coverage of this parser, and I'll primarily focus on some novel
graph-based learning algorithms.
So this is our claim. So let me first start with why we're doing semantic analysis
and motivating it.
So given this sentence, Bengal's massive stock of food was reduced to nothing's,
let's put a couple of layers of syntax on the sentence, part-of-speech tags and
dependency parsers.
Now, a lot of work in natural language processing has focused on this kind of
syntactic analysis. Some of my work is on part-of-speech tagging and
dependency parsing. However, we are going to look at deeper structures.
Basically syntactic structures such as dependency parsers are not enough to
answer certain questions.
So over here let's look at the word stock, which is a noun. It is unclear from the
dependency parse whether it's a store or a financial entity, so it's ambiguous.
Moreover, if you want to ask questions such as store of what, of what size,
whose store, these are also not answerable just from the syntactic analysis.
Similarly, if you look at the verb reduced, we can ask similar questions like what
was reduced, to what. That is also not apparent from the syntactic analysis.
So the type of structures that we are going to predict frame semantic parsers can
easily answer these questions.
So think like store, the frame store, which is evoked by the word stock says that it
is indeed a store, and then there are these different semantic roles which are
fulfilled by words and phrases in the sentence.
Similarly, there is this frame and its roles for the verb reduced.
Now, I will take some time to trace back the history of this kind of work like in a
couple of slides. Basically it started with the formalism of case grammar from the
late 1960s by Charles Fillmore. So given a sentence such as I gave -- so this is
a classic example -- I gave some money to him, Fillmore talked about cases that
are basically words and phrases that are required by a predicate. So over here I,
the subject, is an agent, to him is a beneficiary, and so forth.
So I talked about three salient things, the case grammar theory, I talked about
the semantic [inaudible] of a predicate. It also talked about the correlation of
predicate argument structures like these with syntax and also talked about cases
or roles like obligatory cases and optional cases. So we are mostly familiar with
this kind of -- this theory.
Now, case grammar. Around the same time in AI, Marvin Minksy talked about
frames basically for representing knowledge. And Fillmore extended case
grammar with the help of this theory about frames to frame semantics in the late
'70s and early '80s.
Now, frame semantics basically relates the meaning of a word with word
knowledge, which is new in comparison to case grammar. It also basically
presents the abstraction of predicates into frames. So, for example, the word
gave in the previous slide evokes a giving frame, it has several participating
roles, but this frame is also evoked by other words like bequeath contribute, and
donate. So basically all these predicates associated with this giving frame and
they can evoke this frame in their instantiations.
Now, frame semantics and other related predicate argument semantic theories
gave rise to annotated data sets like FrameNet, PropBank, VerbNet and currently
OntoNotes, which is a popular corpus that people are using. And now we're
doing data-driven shallow semantic parsing.
Now, very roughly, in the world of AI, frames give rise to, around the same time,
the theory of scripts was developed by Schank and Abelson, and template-filling
extraction, which is extremely popular, came about with the help of these
annotated data sets like MUC, ACE, and so forth.
So, broadly, these -- we can partition this into CL and AI. Computational
linguistics in AI work, but today machine learning is bridging these two areas, and
we're not really doing very different things in both these areas. Structurally,
shallow semantic parsing is very similar to information extraction.
Okay. So enough about motivation and the history. So why did we choose this
linguistic formalism?
So I have -- like I'm representing semantic analysis in this spectrum from shallow
semantic parsing to deep analysis.
So in the shallow end -- so this is like a very approximate representation. Don't
take it very seriously.
So at the shallow end we have PropBank style semantic role labeling which is an
extremely popular shallow semantic analysis task. So given a sentence with
some syntactic analysis, this kind of semantic role labeling actually produces -- it
fakes verbs. So over here the word reduced is a predicate of interest, and there
are symbolic semantic roles like A1 and A4 which are arguments of this verbal
predicate.
Now, these semantic roles are only six in number according to PropBank, the
core roles, and they have verb-specific meaning for these labels.
However, there has been some work which has noted that these six semantic
roles, since they take different meanings for different verbs, they conflate the
meaning of different roles due to oversimplification. So from a learning point of
view, you are learning classes which have different meanings for different verbs,
which is not really desirable.
On the other hand in the deep side, there is semantic parsing into logical forms
which take sentences and then produce logical structures like, say, lambda
calculus expressions.
Now, these are really good because they give you the entire semantic structure,
but these are trained on really restricted domains so far and has poor lexical
coverage. So basically we cannot take a logical form parser and just -- like
parsers trained on these restricted domains and then run them on free text.
So our work lies in between these two types of popular parsing formalisms, and
in frame semantic parsing there are certain negative sides. It doesn't model
quantification and negation, unlike logical forms, but there are certain advantages
like deeper than PropBank-style semantic role labeling because we look at
explicit frame and role labels. There are more than a thousand in number,
semantic roles. And we also model all types of part-of-speech categories:
Verbs, nouns, adjectives, adverbs, prepositions, and so forth.
We have larger lexical coverage than logical form parsers because our models
are trained on a wide variety of text, and much larger in number than the
restricted domain logical form parsers.
And, finally, the lexicon in the supervised data that we use is actively increasing
in size. So every year we can train on new data and get better and better
performance. So it's an ongoing moving thing, and in tandem we can develop
our statistical models.
Okay. So I come to applications next. Basically lots of applications are possible
for this parser. The first is question answering. So let's say the example that I
showed you before, if we take a question that tries to extract some information
from a large data set, basically if we parse both the question and the answer with
frame-semantic parses, we will get isomorphic structures that can be used, say,
in features or constrains or in whatever way to answer questions.
So, moreover, if we use other lexical items like resource instead of stock, since
frame semantics abstracts predicates through frames, we can actually get the
same semantic structure and we can leverage the use of frames in question
answering. So this type of work has been done previously by Bilotti, et al., in
information retrieval where they used PropBank style semantic role labeling
systems to build better question answering systems.
And right now the deep QA engine of Watson which was partly developed in
CMU is using my parser for question answering.
Another application is information extraction in general. So there's lots of text,
and if you parse it with a shallow semantic parser, basically we can get labels like
this. The bold items are the words that evoke frames, and then there are
semantic roles underlined, and all of these can be used to fill up a database of
stores, and there are different roles that can form the database columns.
Yes?
>>: I'm intrigued by your [inaudible] idea of the system. Has there been
[inaudible]?
>> Dipanjan Das: No.
>>: They haven't done it or --
>> Dipanjan Das: So there is this Bilotti, et al., paper that basically gave the idea
of including it in the pipeline of deep QA. So they have quantification of using a
PropBank-style semantic role labeler as to how it can improve question
answering, but there hasn't been any quantification in the Watson system.
Okay. Right. So last thing I'll comment about applications is multilingual
applications. So let's take the translation of this sentence in Bengali, which is my
native language.
It is a syntactically divergent language from English. It is free word ordered. But
the roles and the frames actually work -- most of them work in Bengali as well.
So we can use word alignments to associate different parts of the English
sentence with the Bengali sentence and we can do things like translation,
cross-lingual information retrieval or fill up modeling word knowledge bases. So
this is just some hand-waving suggestions at doing multilingual things with frame
semantic parses.
Okay. So I will next come to the core of the talk of statistical models for
structured prediction. Most of this work has been described in this paper from
NAACL two years back, but some of this is under review right now.
So before I go on to the models I'll briefly talk about the structure of the lexicon
and the data that we use to train our model.
Now, the lexicon is basically this every-moving thing called FrameNet, which is a
popular lexical resource. So this is a frame where placing is the name of the
frame. It, again, encodes some event or a scenario. There are semantic roles
like agent, cause, goal, theme, and so forth, and these black ones are core roles
and the white ones are non-core roles. These non-core roles are shared across
frames. So these are like the argM roles from PropBank.
There are some interesting linguistic relationships and constraints that are
provided by the expert annotators. So over here there is this excludes
relationship. I'll talk about it later on. These are binary relationships between
semantic roles.
And there are potential predicates that are listed with each frame that can evoke
this frame when instantiate in a sentence. So arrange, bag, bestow, bin. These
are things that can evoke the placing frame.
There are several frames in the lexicon. Actually, there are like 900 in number
currently. So I'm showing some -- like six examples over here. There are some
interesting relationships between these frames. They form sort of a graph
structure, and the red arrows over here, it is the inheritance relationship which
says that the dispersal, for example, inherits its meaning from the placing frame.
And there is this used by relationship also, and there are some other multiple
relationships that the linguists use to create this frame and role hierarchy.
Now, there was this benchmark data set that we used to train and test our model.
Now, it is a tiny data set in comparison to other semantic role labelers, other data
sets that are used to train other semantic role labeling systems.
So basically there were around 665 frames. The training set contained only
around 2,000 annotated sentences with 11,000 predicate tokens.
We use this for comparison with past state of the art. Very tiny data set.
Now, in 2010 there was a better lexicon that was released. It nearly doubled the
number of annotated data, and there were more frames and role labels, more
predicate types also in the annotations.
So this is now being -- increasingly, more information is being added, and it's
easy to train our systems as it becomes available:
So we will also show numbers on this data set.
Right.
>>: So are the frames being treated in conjunction with the training set
sentences?
>> Dipanjan Das: Yes. Right.
>>: [inaudible]
>> Dipanjan Das: Exactly. Yes. So previously what happened was before this
full -- we call this full-text annotations that was released in SemEval in 2007.
Before that they used to come up with frames without looking at a corpus.
So right now for the development of this corpus, whenever a new frame -- word
or predicate cannot be assigned with a new frame. Also, that works for roles
also.
>>: So that list of verbs that can be associated with each frame comes from the
[inaudible]?
>> Dipanjan Das: In our model it comes both from the lexicon, which was
present before the corpus was like annotated, as well as the corpus, the union of
those two.
And these are not only verbs. They're nouns, adjectives, adverbs, prepositions.
>>: So what's the chance that you're going to need a new frame --
>> Dipanjan Das: There is a big chance. So we work on the assumption that
these 900 frames that we have currently, they have broad coverage. That's the
assumption. But it's interesting -- it can be an interesting research problem to
come up with new frames automatically.
So we will see some work where we assigned or increase the lexicon size by
getting more predicates into the lexicon. So that will be the last part of the talk.
But not frames. So we assume that the frame set is fixed.
>>: [inaudible]
>> Dipanjan Das: This is work we work with currently. So this is released in
2010. I think they haven't made a formal --
>>: [inaudible]
>> Dipanjan Das: Yes. Yes. So the lexicon is the frames, roles, predicates, and
sometimes I interchangeably use lexicon with the annotated data also.
>>: [inaudible]
>> Dipanjan Das: Yes. 9,263.
>>: [inaudible]
>> Dipanjan Das: Not necessarily.
>>: [inaudible]
>> Dipanjan Das: That's a great question, yeah. It is not [inaudible]. So I have
semi-supervised algorithms that can handle those predicates which were not
seen either in the lexicon or the training data.
>>: So suppose in the test set you have a some [inaudible].
>> Dipanjan Das: You'll get partial [inaudible]. But that's another great question.
We do not exploit the relationships of frames during learning. That can be an
extension where you can use the hierarchy information to train your model and
get better -- like an example would be like if you use a max margin sort of
training, your loss function can use the partial -- like the related frames, for
example.
Okay. So this is the set of statistical models that we used to train our system.
So I'll first talk about frame identification which also address Lucy's point of these
new predicates that we don't see in the training data or the lexicon.
So let's say we have this predicate identified as the one evoking a frame, and it is
ambiguous, we don't know which frame evoke, so the goal is to find the best
frame among all the frames in the lexicon.
So we can use this simple strategy by selecting the best frame according to a
score. That score is a frame, a predicate, and the sentence. Basically the
observation.
Now, we like probabilistic models. So let's say that we use this conditional
probability of a frame given a predicate and a sentence. This can be framed
using a logistic regression model.
Now, there are certain problems with using a simple logistic regression model
here. Look at the feature function. It takes the frame, the predicate, and the
sentence. Now, what happens when a predicate is on scene, you will not see it
in the lexicon or the training data. The feature function will find it difficult to get
informative features for predicates that did you not see before.
Moreover -- so unable to model unknown predicates at test time. This is the first
time.
The second problem is that if you look at the total feature set, its order is the
number of featured templates capital T multiplied by the total number of frames
multiplied by the total number of predicate types in the lexicon. So this is the
number of features in the model.
Now, for our data set, this turns out to be 50 million features. And we can handle
50 million features now, but we thought can we do it with less numbers of
features and can we do better.
So instead we have a logistic regression with a latent variable where the feature
function doesn't look at the predicate surface form at all. So the feature function
is basically the frame, something called a proto-predicate, the sentence, and the
lexical-semantic relationships between the predicate and the proto-predicate.
So basically we're assuming that a proto-predicate is evoking the frame but
through some lexical-semantic relationships with the surface predicate. Okay.
So there is a frame, there's a proto-predicate, and then there are lexical-semantic
relationships with the actual predicate. And since we don't know what
proto-predicate it is, we marginalize it in this model.
Is this kind of clear?
With an example that I'm going to show you now, it will become clear.
>>: What's the relation between this and [inaudible]?
>> Dipanjan Das: Yeah. So that was like nearly unsupervised, like fully mostly
unsupervised. So you take prototypes and then expand your knowledge. It is
kind of related, but --
>>: So theory of prototype would be the one that you see in the [inaudible]?
>> Dipanjan Das: Yes. Yes. So here is an example.
So let's say for the store frame you saw cargo, inventory, reserve, stockpile,
store, and supply, but if your actual predicate was stock, which we saw before,
you just -- we just use lexical-semantic relationships with these with stock and
the features only look at those relationships which are only three or four in
number.
>>: [inaudible]
>> Dipanjan Das: Right.
>>: But at test time you don't know the [inaudible].
>> Dipanjan Das: Yes.
>>: [inaudible]
>> Dipanjan Das: It is not. But you will get feature weights -- let's say that in
training store was the predicate and your proto-predicate that you're currently
working with is inventory. The lexical-semantic relationship will be something like
a synonym, and your feature weight for the synonym feature will be given more
weight, for example.
So you will learn feature weights for features that look at a proto-predicate and
lexical-semantics relationship. That's the idea.
>>: So the inventory proto-predicates are actual other English tokens or other
English types that evoke that same --
>> Dipanjan Das: Exactly.
>>: They're not truly just some floating variable or --
>> Dipanjan Das: Yes. So it is a list of predicates that appeared in your
supervised data.
Now -- okay. Let's now look at the predicate surface form. Now, let's take a
concrete example. As I already probably became clear, if the predicate was
stocks, the proto-predicate was stockpile, then the lexical-semantics relationship
is synonym, and let's say feature No. 10245 is fired only when the frame is store,
the proto-predicate is stockpile and the synonym belongs to this LexSem set.
And this for us comes from WordNet, but you can use the lexical-semantics
relationships, but you can use any semi-supervised distributional similarity type
of lexicon.
>>: So does this evoke some sort of bias towards frames that have many
predicates that [inaudible]? You're summing over all the proto-predicates
[inaudible] feels like if you have a frame with lots of predicates, it's going to get
more terms? You're training discriminately, so maybe not. Right? Maybe each
of their weights are going to be reduced a little bit, but I'm just curious.
>> Dipanjan Das: Yeah. That's a good analysis which I haven't done.
>>: Okay.
>> Dipanjan Das: Actually, this does well in comparison to -- no, no, it doesn't do
really well. It does well [laughter]. It does well in comparison to a model that
doesn't use latent variables. It does much worse than you can -- like a
semi-supervised model. And we'll see how we can do better.
Any more questions?
Yes?
>>: So if the features are still looking at all of the predicates that are in the
training set, are the features still going to be [inaudible]? Because the number of
predicates is kind of equal to the number of proto-predicates.
>> Dipanjan Das: No. But let's say this stockpile predicate, there will be a
feature with a high weight that will get -- like if the feature was stockpile with
synonym, it will get a high weight during training, ideally, and that feature with
high weight will fire during test time when it sees this new predicate.
So from that perspective, you should get the right frame.
>>: What kind of feature reduction do you get? You go from 50 million to --
>> Dipanjan Das: I'll come to that.
>>: Okay. Sorry.
>> Dipanjan Das: Number of features now is the number of feature templates
multiplied by all the frames, and the max number of proto-predicates per frame
comes to only 500,000 features, one percent of the features we had before.
Okay. Now, I worked on this problem of paraphrase identification which was -which is of interest to researchers here as well which were we used all these
similar resources like dependency parses, lexical semantics and latent structure
to get good paraphrase identification.
We trained this model using L-BFGS, although it is non-convex, and for fast
inference we have a tweak basically for predicate is unseen. Only then we score
all the frames. Otherwise, we score only the frames that the predicate evoked in
the data. And this becomes really faster when we do this.
So I would give you a caution beforehand, the numbers are not as good as other
NLP problems because we have really small amounts of data, but there are
some frames and roles on which our model is very confident, and those
structures are good.
So on this benchmark data set, in comparison to UT Dallas and LTH, which were
the best systems at SemEval, we do better, significantly better. Now, this
evaluation is on automatically identified predicates, and all of these three
systems use similar heuristics to get the predicates, to mark the predicates which
can evoke frames.
Now, in a more controlled experiment where we have the predicates already
given, the frame identification accuracy is 74 percent. And when the data set
doubled, this goes to 91 percent, which is good, because we only get one out of
ten frames incorrect.
And without the latent variable, this number is much worse. So basically we're
getting better performance as well as reducing the number of features.
Yes?
>>: Do you have some written analysis on when you get the frame wrong?
Because it could be -- wrong frame could be completely wrong, something that
absolute makes no sense, or it could be something that actually is not --
>> Dipanjan Das: Yeah.
>>: -- exactly the correct one but it's meaningful.
>> Dipanjan Das: So that's a great point. So this number actually uses this
partial matching thing. If you only do exact matching, it is 83 percent.
So that is also still pretty good, because we have, like, a thousand frames to
choose from.
Now, often there are some really closely related frames, and it's hard to actually
find the right one.
So there is a question of interannotator agreement there also, like what is the
upper limit, and I think it is just like '92 or '93 percent. So we're doing pretty well
in terms of frame identification, especially for known predicates. This number is
abysmally low for unknown predicates, and we'll come to that at the final section
of the talk.
>>: [inaudible]
>> Dipanjan Das: No, no. Actually, this model, I don't use unsupported features.
That is a great question.
So unsupported features are the one that appear in the partition function. So if
you include those, then it is 50 million. But if you don't use them, then it is a few
million, basically.
>>: Okay.
>> Dipanjan Das: The unsupported features, they actually help you a lot in this
problem, which is what we observed.
>>: I'm sorry. I didn't understand what you meant by given predicates. You
mean given the correct predicate for the correct frame?
>> Dipanjan Das: No, it's given a sentence you can you can mark which
predicates can evoke frames. So if you do that automatically using heuristics,
that is a non-trivial problem. So if you --
>>: So you're saying you tell it this is the predicate, but you don't tell it -- it
doesn't necessarily match the -- you still have to do your proto-predicate
>> Dipanjan Das: Yes. You still have to do frame identification.
>>: Okay.
>>: But you do now there is a frame out there for that predicate?
>> Dipanjan Das: Yes.
>>: So would that be a case where you would have different words that both
evoke the same frame? So, for example, you have relation [inaudible]?
>> Dipanjan Das: So you are asking -- let me just rephrase it. So in a sentence,
whether two predicates can evoke the same frame? Is that the question?
That happens, but not quite often. But since we do not jointly model all the
predicates of a sentence together, that, from a learning or inference point of view,
we don't care about what other predicates are evoking. But that can also be
done like -- at a document level especially we can place constraints or soft
constraints as to whether, like, different predicates in a document which have the
same or similar meaning, they should evoke the same frame.
>>: So lexical semantic, is it a binary feature?
>> Dipanjan Das: Feature.
>>: [inaudible].
>> Dipanjan Das: Yeah, it's like whether there is lexical semantics relationship of
synonym, whether there is lexical semantics relationship of [inaudible].
>>: I'm just curious why suppose you're trying to incorporate, say, the lexical
semantic relationship [inaudible].
>> Dipanjan Das: So this model can handle that like a real value feature. There
should be not be any problem. But another hack would be to just make bins of
distributionals, like 100 bins, for example, 10 bins, and use those as [inaudible].
Okay. This is the more interesting part. And I think that we have done some
good modeling over here in comparison to previous semantic labeling systems.
So let's say we have already identified the frame for stock, and now we have to
find the different rules.
So there are potential roles that come from the frame at lexicon once the frame is
identified. So let's say these are these five roles over here, and the task is to -from a set of spans which can serve as arguments, we have to find a mapping
between the roles and the spans. So this is just a bipartite matching -- maximum
bipartite matching problem from our point of view.
And at the least -- at the end over here we have this phi symbol, which is the null
span, and when a few roles map to this null span, it means that the role is not
overt in the sentence or it is absent from the analysis.
>>: But why do you choose Bengal's as the ideal mapping rather than Bengali?
>> Dipanjan Das: It's just a convention that annotators use. It's strange. It is
like --
>>: Like for resource, it's [inaudible].
>> Dipanjan Das: Yes. It is a choice that both PropBank annotators inaudible
annotators have taken. Like modeling the -- labeling the entire prepositional
phrase for --
>>: [inaudible]
>> Dipanjan Das: Yes. Yes. In dependency parsing also these are there, right?
Because conjunction is a case where you don't really know what [inaudible].
Okay. Now, there are certain problems. This is not just a simple bipartite
matching problem because we may make mistakes like this. If we map supply to
food over here, this is linguistically infeasible according to the annotators
because it violates some overlapping constraints. So basically two roles cannot
have overlapping arguments.
So this is a typical thing in standard semantical labeling work. For example,
Christina had this paper in 2004 or 2005 where they did dynamic programming to
solve this problem.
Now, there are other things which are more interesting and have been explored
previously like the mutual exclusion constraint that these two roles cannot appear
together. An example is basically for this placing frame, if an agent places
something, there cannot be a cause role in the sentence because both the agent
and the cause are ideally placing the thing.
So two examples are here. The waiter placed food on the table, and in Kabul,
hauling water put food on the table. So these are very different meanings, but
hauling water is the cause while agent is the waiter, and they cannot appear
together in an analysis.
There are more interesting constraints like the requires constraint, which means
that if one role is present, the other has to be present in the analysis.
So over here, the frame is similarity. So resembles is the predicate evoking the
frame similarity. The mulberry resembles a loganberry. The first one is entity
one, the second one is entity two, but a sentence like a mulberry resembles is
meaningless. So you have to have both these roles in the structure.
So there are more things -- more such linguistic information that can -- that we'll
have to constrain the maximum bipartite matching problems so it's a constrained
optimization problem.
So this is sort of -- other people have also done this, but we are doing the
semantic role labeling task in one step, which is just one optimization problem.
So what we do is we use the scores over these edges. The edges look at the
role and the span. So remember that this
bidirectional arrow with role and span is the tuple that we operate on, and the
score is -- we can assume it to have a linear form, like it's a linear function.
There is a wave vector and there is this feature function g that just looks at the
role, the span, the frame, and the sentence, which I've omitted over here. So it's
a standard thing that we will do.
Now let's introduce this binary variable z. It's a binary variable for each role and
span tuple. Z equal to 1 means the role and span, that span fulfills the role. A
zero means that span doesn't fulfill the role. And let's assume that there is a
binary vector for all the role span tuples. So it's the entire z vector for all the
roles and spans.
Now, we're trying maximize this function. Basically sum over all the roles and
spans. Zero span multiplied by the score with respect to z. And for all the roles
we have this uniqueness constraint which says that a role can be satisfied by
only one span, so this is a constraint that expresses that.
And there will be more constraints like this one which is a little more complicated.
It constrains each sentence position to be covered by only one role-span
variable. So it will prevent overlap.
And there are many other structure constraints that imposes the
requires-excludes, these constraints. And people will be familiar with this. This
is an integer linear program.
Okay. Now, there has been a lot of work on this. So this is one of the seminal
papers on semantic role labeling that uses ILP to do this inference problem.
But ILP solvers are often very slow, and many of them are proprietary. And my
parser is public, and we want to make it open source so that people can
contribute, so we use this thing called dual decomposition with alternating
direction method of multipliers that solves this ILP problem. It solves a relaxation
of the integer linear program using a very nice technique that doesn't require us
to use a solver.
So this has been developed with colleagues at CMU. This is a paper under
review.
So basically we introduced this thing called basic part. A role and span tuple
forms a basic part. So the entire space is all the roll and span tuples that we
operate on.
So what we do is we break down the whole bipartite matching problem into many
small components, like find the best span for a role, like for a sentence position,
find the best roll span tuple and so on. These are really simple problems that
can be solved really fast. And at a global level we impose agreement between
these components so the entire structure, there's a consensus between these
small problems.
So I think people are not familiar with dual decomposition. It's a really trending
thing in NLP. What we will do is that they have two big problems, two big
problems that can be solved using, say, dynamic programming, and then there is
a consensus step at the end. But this is very different from that because we
are -- we didn't bake the problem down into two big things. That is not really
possible for this task. We have many, many small things, and then we tried to
impose agreement.
So for each component, now we define a copy of that binary vector. So this is z
superscript component. So a component is basically one of those small things,
and each constraint in our ILP can map to one component. So there is one
component for each constraint in the ILP. So this is like in graphical models
language, each component is a factor.
So we have this z vector for each component. So basically we define this
function called [inaudible] scores this entire binary factor, this binary vector, and it
uses that role span score that we saw before. So given an entire z component
vector, we have this function that gives us a score.
Now, the ILP that we saw before can be expressed in this way. So we sum up
over all the components with this score, and we have this constraint where all the
z's for all the components, rolls, and spans, they have to agree, and that
agreement is done by using this weakness vector u for consensus. So basically
for all the z's are equal to this u for the different components we have a
consensus.
Okay. Please stop me if there are questions.
Now -- so this is the primal problem where zr integer. Now, we tried to solve an
easier problem, which is primal prime, which is where the integer constraints are
relaxed. So it's a linear program.
Now, we convert that to augmented Lagrangian function. So this is a new thing
in comparison to the [inaudible] work where we augment the Lagrangian function
with this term. This term is a quadratic penalty that actually should be ideally
zero because the z's are equal to the u. So it doesn't make a difference to the
primal LP, but it actually brings consensus faster. So that's the reason why
people use it.
So this is a standard trick that people do in the optimization community.
Now, this augmented Lagrangian function which looks like this, the saddle point
can be found using something called alternating minimization which is, again,
another standard trick in optimization.
And what is nice is that the saddle point which we seek can be solved using
several decoupled worker problems. So it's basically an iterative optimization
technique where there are three steps. There are Lagrange multiplier updates,
there are consensus variable updates, and there are z updates.
Now, the z updates can be solved at decoupled workers. They can be solved in
parallel.
Now, what do these decoupled workers look like? So remember that one
component is basically one constraint. So let's say one z update that we need to
do is for each role we have a worker that imposes a uniqueness constraint. So
this is, again, a small, a tiny optimization problem that has to -- that solved the z
update, and it looks like this.
And let me just state that this is just a projection onto a probability simplex, and it
can be solved using a sort operation which is really fast.
And the challenge for a new problem where you have such constraints and you
want to use our technique of dual decomposition, we have to define fast and
simple workers that can bring optimization faster.
Okay. So advantages of this approach is that there is significant speedup over
ILB solvers, and we don't need a proprietary solver, and are the speedups are
really marvelous. We get nearly 10 percent speedup in comparison to CPLEX,
which is a very strong state-of-the-art solver. And that is parallelized and this is
not.
Okay. We get also certificate of optimality, like exact solutions for more than 99
percent of examples.
So back to the -- yes?
>>: You were saying that the augmented Lagrangian where you have a quality
penalty and a linear penalty [inaudible]? Do you have numbers there?
>> Dipanjan Das: Not for this problem. For dependency parsing we have
numbers.
So this was this paper by Martin [inaudible] at the NLP. But we haven't done any
analysis in comparison.
Yeah, it will be slower. Firstly, we don't want to use that because the subgradient
approach of Collins, it uses these two big things or two big problems. We don't -you cannot break up this problem into --
>>: [inaudible].
>> Dipanjan Das: -- many things. There are some possibilities, but we haven't
found any we like. The dynamic programming trick to find -- reduce overlap,
remove overlap can be one big component, but I have no way to impose those
other small constraints in a big [inaudible].
Okay. So learning. We use straightforward learning. We use local learning,
actually, to learn these waits by just taking role and span pairs. We also have
done some experiments where we do global inference using dual decomposition
and use a max margin trick that doesn't work very well for some reason. So
we're stuck with maximum condition log likelihood training.
So here are some numbers. Just for argument identification we get a position
boost of 2 percent using joint learning, and these -- actually these numbers do
not -- so we are doing better than local inference, but these numbers do not
reflect things like whether the output is respecting linguistic constraints that we
placed. So we measured those. The local model makes 501 violations that
include overlaps, requires and excludes relationships while we make no
violations.
>>: [inaudible]
>> Dipanjan Das: Yes.
>>: [inaudible]
>> Dipanjan Das: Yes. Yes. So we have some numbers about [inaudible]
search. These accuracies are similar, but it makes like 550 or -60 violations.
>>: Do you know why -- so given that there's a substantial reduction of the
violation, why there is [inaudible].
>> Dipanjan Das: Great question. We also wondered about that.
So basically let's say we are trying to -- there is this requires relationship, and the
local model predicted entity 1 and didn't predict entity 2. Now, the script that
measures performance, it will give us -- give the local model some score for
entity 1.
On the other hand, this model is very conservative and did not produce either
entity 1 or entity 2. So it got lower scores for that example.
>>: So have you measured -- so this is including [inaudible]?
>> Dipanjan Das: No, this is just argument identification.
>>: [inaudible]
>> Dipanjan Das: The frames are given.
>>: Okay. But, I mean, it's sort of [inaudible]
>> Dipanjan Das: That doesn't happen, no. We can change the script to do that,
but we don't do that.
>>: My question is in this goal -- you're saying the local [inaudible].
>> Dipanjan Das: Yes.
>>: [inaudible]
>> Dipanjan Das: It is not really a partial score. It will -- so when the local model
chose entity 1 and it predicted it correctly, like mapped picked to the right span,
the script gave it some points.
>>: Right.
>> Dipanjan Das: But this model neither predicted entity 1 or entity 2 because it
respects that constraint.
>>: So my suggestion is actually -- suppose if you look at a complete matching,
so just like parsing, you look at just the [inaudible].
>>: [inaudible]
>> Dipanjan Das: Yeah. That can be --
>>: [inaudible]
>> Dipanjan Das: Yeah. That we haven't done.
Right. We can --
>>: [inaudible]
>> Dipanjan Das: That happens, actually.
>>: [inaudible]
>> Dipanjan Das: We can discuss this more, because I want to have numbers
which reflect the better quality of parses [laughter].
>>: [inaudible]
>> Dipanjan Das: Right.
>>: [inaudible]
>> Dipanjan Das: Okay. Okay.
>>: [inaudible]
>> Dipanjan Das: Right. Right.
>>: [inaudible]
>> Dipanjan Das: Okay. Okay.
Okay. Yeah, I'm really scared about this. This is a good point, because the
reviewers will also point this out for us.
Okay. On a benchmark -- so this is full parsing, like identifying frames as well as
arguments. Again, auto predicates because like these systems actually identified
auto predicates and gave us their output.
So we do, again, much better, nearly 5 percent improvement, which is significant.
On given predicates we do even better, like 54 percent. While on the new data, it
is like 69 percent.
Now, we can get much better than this, and we are training only on very few
examples in comparison to other SRL systems, and as more data becomes
available, we are hopeful that this will get better.
Now, the last part of the talk is about semi-supervised learning for robustness. I
am really excited about this topic because I'm interested in semi-supervised
learning.
Now, this has been presented by this paper from last year's ACL, but I've done
more work on this, and I'm going to present it as well.
So unknown predicates frame identification accuracy is just 47 percent, which is
half of what we get for the entire test set. So on all the unknown predicates, new
predicates, the model does really well -- badly. And this is reflected on the full
parsing performance also because once you get wrong frames, your whole
parsing accuracy will go down.
So what we do is that the 9263 predicates that we saw in the supervised data,
we only have knowledge about those. While English has many more predicates,
we actually filtered out around 65-, 66,000 from newswire English which can
evoke predicates.
>>: [inaudible]
>> Dipanjan Das: Yeah. So these are basically words other than proper nouns
basically and some other part of speeches which do not evoke frames in the
lexicon.
Okay. So we used -- what we do is both interesting from the linguistics point of
view as well as from an NLP task. So we are doing lexicon expansion using
graph-based semi-supervised learning.
So we build a graph over predicates as vertices and compute the similarity matrix
over these predicates using distributional similarity. And the label distribution at
each vertex in the language of graph-based learning is the distribution of frames
that the predicate can evoke.
So here is -- so this is very similar to some work that I've done with [inaudible]
Petrov on unsupervised lexicon expansion for POS tagging.
So here is an example graph, a real graph that we use. The green predicates
come from the lexicon in the supervised data, and I'm showing the most probable
frame over each of these green vertices.
Similarity is on the right side, and unemployment rate and poverty is on the left.
And the black predicates are the ones which can potentially serve as predicates,
and we want to find the best set of distributions, the best set of frames for each of
these predicates.
And we call the green ones seed predicates and the unlabeled predicates are the
black ones, and we want to do graph propagation to spread labels throughout the
graph to increase our lexicon size. And this is an iterative procedure that
continues until convergence.
So a brief overview. A graph is like lots of data points. The gray ones are -- so
we have the geometry of the graph by the symmetric weight matrix that comes
from distributional similarity for us, and there are high weights which say that
these two predicates are similar, and low weights say that they're not.
Now, this R1, R5 on the gray, the gray vertices, are the ones which come from
labeled data. So we have supervised label distributions R on the labeled
vertices, and the Q are basically distributions to be found on the vertices.
Now, we use this new objective function to derive these Q distributions. So this
is some work -- this is under review right now.
So basically if you look closely, what we're doing here, the first term in the
objective function that we're trying to minimize, it looks at the Jensen-Shannon
divergence between QT and RT for the labeled vertices.
So basically for the observed and the induced distributions over labeled vertices
should be brought closer using the first term. The Jensen-Shannon divergence
is basically an extension of the KL divergence which is symmetric and smoother.
The second term over here which is more interesting, it uses the weight matrix to
bring the Jensen-Shannon divergence of the neighboring vertices' distributions.
And, finally, this is another interesting term that induces sparse distributions at
each vertex. So this penalty is called an L1, L2 -- L1, 2 penalty. It's called a
[inaudible] lasso in the regression world.
So basically what it does is that it tries to induce vertex level sparsity. And the
intuition is that a predicate in our data can evoke only one or two frames. So out
of the 900 frames in the distribution, you only want one or two things to have a
positive rate and the rest of them should be zero. So that's the idea.
And we have seen that inducing -- having this sparsity penalty not only gives us
better results, it also gives us tiny lexicons. So the lexicon size will be really
small, which can use -- like store and use later on much more efficiently than a
graph objective function that doesn't use sparsity.
So the constraint inference, it's a really straightforward thing. If the predicate is
seen, we only score the frames that the predicate evoked in the supervised data.
Else, if the predicate is in the graph, we score only the top key frames in the
graph's distribution. And, otherwise, we score all the frames.
Okay. And this is an instance where semi-supervised learning is making
inference much faster on unknown predicates. So instead of scoring a hundred
frames, we score only two, which makes the parser fast.
So we get huge amounts of improvement, but still not as good as the known
predicate case. So we get 65 percent accuracy on unknown predicates in
comparison to 46 percent for supervised data. And this is reflected on the full
parsing performance also.
Okay. So I'm at the end of the talk.
>>: [inaudible]
>> Dipanjan Das: Weight. That's distributional similarity. Just take a past
corpus with dependency parsers, look at subject relationships of pressed kits,
then use point-wise mutual information and take cosine similarity. It's standard
distributional similarity.
But great things can be done there also. You can learn those weights. We are
working on that, but we haven't gotten any results.
>>: [inaudible]
>> Dipanjan Das: On development data across validation.
>>: [inaudible]
>> Dipanjan Das: Two. So it turns out that if you fix one, you can tune the other
one.
>>: So there's a global tail divergence between the -- sorry, [inaudible] between
the labeled distribution and the unlabeled distribution and then there were two
other terms, each with a distinct weight?
>> Dipanjan Das: Yes. So -- interesting. So basically why we'd diverge from
previous work is that in labeled propagation, people treat these things at matrices
and do updates, and we don't do that. We take gradients.
We -- firstly, we make the distributions unnormalized. That's the first thing we do.
So none of the distributions Q are normalized. So that makes optimization much
easier for us.
So it becomes really, really nasty if things are normalized. So you can't -- I
mean, there has been some very interesting work at UW on this by [inaudible].
Now, we make the thing unnormalized, and then if you take gradients with
each -- with respect to each Q's component, then the problem, the gradient
updates, become really trivial. So you can just operate on vertices and make
updates, and it is trivially parallelizable across cores also.
>>: [inaudible]
>> Dipanjan Das: So on this graph with 65,000 vertices and 900 labels running
optimization on 16 cores takes two minutes. It's really fast.
Okay. So conclusions. We did some shallow semantic -- any more questions?
Yes?
>>: Just with the distributional similarity, you get a lot of information for nouns
and verbs.
>> Dipanjan Das: Yes.
>>: But do you get information on adjectives and adverbs?
>> Dipanjan Das: Yes. Yes.
>>: How is that [inaudible]?
>> Dipanjan Das: Yes. Other relationships, other -- [inaudible] dependency
corpus, he has a lexicon for similar [inaudible] adverbs and adjectives there too.
This is from the '90s. And we make use of that in this work.
So the graph construction or the similarity learning. If we can use the frame
[inaudible] supervised data and learn those weights, that will be really interesting.
But I have not -- there's an undergraduate student working with me for that
problem, a generic weight-learning problem, so a matrix learning problem for
graphs. And if we can tune the distributional similarity calculation for FrameNet,
the frames, it will be very interesting.
Okay. So shallow semantic parsing using frame semantics, richer output than
SRL systems, more domain general than deep semantic parsers. And on
benchmark data sets we get better results. In comparison to prior work, we do
less independence assumptions. We just have two statistical models. We have
semi-supervised extensions and we saw an interesting use of dual
decomposition with many overlapping components in this work.
Lots of possible future work. We can train this -- one trivial extension is to train
this for other languages. These six languages have frame semantic annotations.
Use these techniques for deeper semantic analysis tasks. Like logical form
parsers have this problem of lexical coverage, whether we can use this type of
semi-supervised extensions for like merging distributional similarity and logical
form parsing in some way.
And use parser for NLP applications. So right now, actually, our parser is being
used to bootstrap more annotations by Fillmore and his team, which is
interesting, because it closes the circle and gives us more data.
And the parser is available. It is an obscure task, but, still, 200 people have been
using it probably.
Thank you.
[applause]
Download