report - The Stanford NLP

advertisement
Robust Semantic Role Labeling for Nominals
Robert Munro
Aman Naimat
Department of Linguistics
Department of Computer Science
and School of Business
Stanford University
anaimat@stanford.edu
Stanford University
rmunro@stanford.edu
Abstract
We designed a semantic role labeling system for nominals that consistently outperformed the current state-of-the-art system. Focusing on the core task of classifying the known arguments of nominals, we
devised a novel set of features that modeled the syntactic context and animacy of
the nominals, reducing the error of the
current state-of-the-art systems by F1
0.012 on the NomBank corpus, and most
significantly, by 0.033 on test items with
unseen predicate/headword pairs. This
corresponds to an overall reduction in error of 10% and 15% respectively.
1 Introduction
Semantic role labeling (SRL) for verbs is an establish task in NLP, and a core component of tasks
like Information Extraction and QuestionAnswering (Gildea & Jurafsky 2002, Carreras &
Màrquez 2005, Toutanova, Haghighi & Manning
2005) but semantic role labeling for nominals has
received much less attention (Lapata 2002, Pradhan et al 2004, Liu & Ng 2007). In general, nominal-SRL has proved to be a more difficult task
than verb-SRL. Features that were successful for
verb-SRL have not always produce significant results for nominal-SRL, and in general the error
rates for nominal-SRL have been at least twice as
high (Pradhan et al 2004, Jiang & Ng 2006).
SRL is typically divided into two tasks: identifying the arguments of a predicate, and classifying
those arguments according to their semantic role.
These are known as semantic role identification
and semantic role classification respectively. The
two are typically tested both independently and in
combination. In (Liu & Ng 2007) the best-reported
results for both tasks and the combination are given, tested on the NomBank corpus (Meyers et al
2004). Constrained by time, in this paper we focus
solely on semantic role classification, comparing
our results directly to those of (Liu & Ng 2007).
1.1
SRL for nominals
It is easy to demonstrate why nominal-SRL is a
more complicated task than verb-SRL:
1)
2)
3)
[the police AGT] [investigated PRED] [the
crime PAT]
[the crime PAT] was investigated PRED] by [the
police AGT]
[the police PAT] were investigated PRED] by
[the governor AGT]
Examples (1-3) show semantic roles predicated on
a verb, and all three are unambiguous. For the
predicate investigate, the Subject of an active sentence is the Agent, and the Object is the Patent. For
a passive sentence, this is reversed. Provided a
verb-SRL system models active/passive sentences,
this is uncomplicated. Compare these to some
equivalent nominalizations:
4)
5)
6)
7)
[The police AGT] filed the report after 3
weeks, causing the governor to declare the
[investigation PRED] closed.
[The investigation PRED] took 3 weeks.
[the crime’s PAT] [investigation PRED] ...
[the police’s AGT] [investigation PRED] ...
8)
[The investigation PRED] of [the police
AGT/PAT??] took 3 weeks.
Example (4) shows that an argument may be realized some distance from the predicate; example (5)
shows that arguments are not mandatorially subcategorized by the predicate; and examples (6-8)
show that different roles may be realized in the
same syntactic position, and can be inherently ambiguous. Therefore, successful nominal-SRL is a
more difficult task than verb-SRL.
1.2
Our contribution
We report on new features, and interactions of features, that consistently improved the accuracy of
semantic role classification for nominals.
In particular, we demonstrate that the new features modeling the syntactic context of the nominals improve the accuracy of semantic role
classification systems, especially over unseen predicates/argument, and report on the distribution of
such structures across the labels in NomBank.
From some basic strategies to model a predicate’s arguments holistically, we find that modeling the relative animacy of the arguments improves
the classification of their roles, especially in combination with other features.
We report on the influence of training-set size
on classification accuracy, and the relative success
in classifying unseen predicate/argument pairs.
We followed the standard split of the corpus, training on sections 2-22, validating on section 24, and
testing on section 23.
3 Maximum entropy classifier
Maximum entropy (MaxEnt) classifiers have been
the staple for nominal-SRL (Jiang & Ng 2006).
More sophisticated discriminative learning algorithms have also been used, including multitask
linear optimizers and alternating structure optimizers, but they have not significantly improved the
results for semantic role classification over
MaxEnt classifiers (Liu & Ng 2007), and so we
used the Stanford MaxEnt classifier. The objective
function was:
where:
and the derivatives of the log likelihood correspond to:
2 NomBank
The NomBank corpus uses the PropBank set of
labels as (Palmer et al., 2005) to annotate all arguments of nominals in the Wall Street Journal corpus. It differs from PropBank in two ways. Firstly,
an argument may overlap the predicate. For example investigator is both the predicate and its own
Agent. Secondly, it is possible for an argument to
realize two roles. For example, truckdriver contains both the Agent driver and the Patient truck.
There are 20 labels in NomBank: ARG0, ARG1,
ARG2, ARG3, ARG4, ARG5, ARG8, ARG9,
ARGM-ADV, ARGM-CAU, ARGM-DIR, ARGMDIS, ARGM-EXT, ARGM-LOC, ARGM-MNR,
ARGM-MOD, ARGMNEG, ARGM-PNC, ARGMPRD, and ARGM-TMP.
This results in sentences like:
[The police’s ARG0] [investigation PRED] [of a crime
ARG1] [in Boston ARGM-LOC] [took 3 weeks ARGM-TMP].
The objective function was smoothed by assuming
a Gaussian distribution, and penalized accordingly:
where the derivatives were adjusted by:
Brief experimentation with Naïve Bayes and KNN
classifiers produced much less accurate results.
For future work, it would be interesting to compare our results across a greater number of classifiers, and with joint learning of the labels
(Toutanova et al. 2005).
60%
50%
60%
ARG0
ARG1
ARG2+
ARGM
ARG0
ARG1
ARG2+
ARGM
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0%
Sentence Subject
Direct Object
Other position
Figure 1: The relative frequency of different
roles in different sentence syntactic positions.
4 Data analysis
No previous attempt at modeling the semantic
roles of nominals has looked at the broader syntactic context of the constructions. Almost immediately, our data analysis revealed strong tendencies for
different roles to be realized in different sentence
positions.
We found strong likelihoods for certain roles to
appear when the argument is realized in the Subject, Object or Adjunct position of a sentence. For
example:
The police’s [investigation PRED] took 3 weeks.
(Subject)
They reported the police’s [investigation PRED]
(Direct Object)
The case was closed after the police’s [investigation PRED] (Adjunct)
Figure 1 gives the distributions of sentence positions for each role type in the training data. With
the exception of the ARG0/ARGM distinction for
other positions, within each sentence position, the
results are significant to p <0.001 by 2.
It is clear from the graph, that the difference is
important. An argument in the Subject position is
55% likely to be an ARG0, only 10% likely to be
one of the ARG2+ roles and not at all likely to be
of the ARGM roles. For a Direct Object, on the
other hand, an ARGM is almost 50% likely, and for
other syntactic positions ARG1 is >50% likely.
When running baseline data, sentences with possessives were often misclassified, such as:
[P&G's ARG0] [share PRED] of the Japanese market
This led us to create a feature modeling whether
the constituent was a possessive.
PER
ORG
LOC
Figure 2: The relative frequency of possessives
realizing different types of named entities.
The animacy of constituents is also correlated
with the distribution of semantic roles, as more
animate constituents will tend to more frequently
realize agents, and therefore ARG0s. More general
knowledge of named entities will also help identify
locations, and therefore ARGM-LOCs.
We defined features based on animacy of the
constituent using gazetteers of named entities and
pronouns, and modeled an order of animacy from
people to objects, based on the presence of known
entities, proper nouns or nouns in a constituent.
Figure 2 shows the intersection of the animacy
and possessive features, clearly showing that possessives realizing a person or organization are
more likely to take the ARG0 role.
The semantic roles of a given predicate are highly interdependent (Jiang et al. 2005, Toutanova et
al. 2005). For example, if we know that an ARG0
already exists for a predicate, then the chances of
another ARG0 for that predicate are greatly reduced. For reasons of time-constraints, we did not
implement the joint learning of (Toutanova et al.
2005) which is the best reported results for verbSRL. However, we did define features that took
into account all arguments for a given predicate,
capturing at least part of the interdependency. Using the animacy feature, we included a feature to
indicate whether a given argument was the (equal)
most animate, medially animate, or least animate
of the arguments for the current predicate. We also
included the total number of arguments for the given predicate, as predicates with only one argument
were more likely to realize either ARG0 or ARG1,
while predicates with a high number of arguments
where evidence that a low frequency role might be
present.
1
2
3
4
5
6
7
8
9
10
11
position: whether the constituent is to the
left, right or overlaps with the predicate
ptype: phrase type of the constituent C
firstword: first word spanned by C
lastword: last word spanned by C
ptype.r: phrase type of right sister
nomtype: the NOM-TYPE of predicate, as
defined by the NOMLEX dictionary
predicate & ptype
predicate & lastword
predicate & headword
predicate & position
nomtype & position
Table 1. Baseline features (‘&’ indicates an interaction feature)
Features sig to final classification:
12
13
14
15
16
17
18
19
20
5.1
Processing for significance
Testing all possible feature combinations is prohibitively expensive, so we devised two ways to
test the significance of each feature, leaving a
greater study of significance as future work.
sig.
0.002
0.001
0.002
0.006
0.001
0.001
Features not sig to final classification:
5 Features
We used the features of (Liu and Ng 2007) as a
baseline, as these produced the previous highestreported results on this task. These features are
given in Table 1, and give the results labeled L&N
in the Results section.
Table 2 gives the new features that we developed based on observations discussed in the previous section. As the interaction of features is vital to
an accurate classification, and we defined a number of interaction features. The final set of features,
we had more interaction features than regular ones.
The final results report the use of features 1-17.
Among the features that were not significant, many
of them, like subject and possessive where significant when interacting with other features. It is also
likely that the correlation between subject and possessive, and the interactions they take part in,
somewhat masked their significance for the reasons described in the last two paragraphs.
The biggest surprise of the features that were not
significant, either alone or in combination, was
sentposition. While it was very significant to model whether an argument was in a certain syntactic
position, including the sentence position made no
difference. This supports the theory that it the explicit interaction of syntactic properties that produces the distributions in Figure 1.
relativeanimacy: the animacy of constituent relative to the animacy of the
other arguments for the current predicate: (equal)highest/medial/lowest.
head: the headword of the constituent
subject & position
predicate & animacy & subject
possessive & nomtype
possessive & position & animacy
21
22
23
24
25
26
27
28
29
30
31
32
subject: whether the constituent is in
the sentence subject position
possessive: whether the constituent is
a possessive (eg, our, her, its, -’s)
sentposition: whether the constituent
is in the topic position, the 1st/2nd
half of the sentence, or is sentence
final
animacy: whether the constituent is a
person, organization, location, other
(unknown) proper noun or a noun.
predsyn: syntactic category of the
predicate
numargs: the number of arguments for
the current predicate
modifies: whether the constituent is a
premodifier of the predicate
highestanimacy: the highest animacy
of all the arguments for the current
predicate
sentposition & position
pred & animacy
nomtype & animacy
possessive & position
numargs & animacy
modifies & animacy
modifies & sentposition
0.001
0
0
0
0
0.004
0
0.003
0
0.004
0
0
0.004
0
0
Table 2. Novel features and interactions
We iteratively removing each feature from the
training data and looking at the change in accuracy. If the removal of a feature did not result in a
significant change of accuracy, then we considered
it to be not significant to the final classification.
This is the split of data in Table 2.
We expected that some of the non-significant
features are good indicators of semantic roles, but
were correlated strongly enough with other features for there to be no gain from their inclusion.
We therefore also looked at the significance of
adding each new feature to the baseline features.
The resulting increase in overall F1 given by the
feature is sig. column in Table 2.
0.95
0.85
0.8
0.9
0.75
0.85
0.7
0.8
0.65
F1
0.75
0.6
F1
F1: ARG0,1
F1: ARG2+
0.7
L&N F1
F1: ARG2+
0.5
L&N F1
L&N F1: ARG0,1
0.65
L&N F1: ARG2+
0.6
L&N F1: ARG0,1
0.45
L&N F1: ARG2+
90
%
10
0%
80
%
70
%
60
%
50
%
40
%
30
%
20
%
5%
10
%
3%
1%
10
0%
90
%
80
%
70
%
60
%
50
%
40
%
30
%
20
%
10
%
5%
3%
1%
0.4
percent of training data
percent of training data
Figure 3. Accuracy on test items with increasing training set sizes
6
F1: ARG0,1
0.55
Figure 4. Accuracy over test items with
unseen predicate/headword pairs, with increasing training set sizes
Results
The ARG0 and ARG1 labels, roughly corresponding to the Agent and Patient roles, made up
the majority of the examples and were more easily
classified than the other labels. We therefore compared the overall F1 values to the F1 values for
ARG0 and ARG1 (ARG0,1), and to the F1 values
for all other labels (ARG2+).
Figure 3 gives the results for different training
set sizes, comparing our results to that of the baseline features (L&N), which are the current state-ofthe-art performance (Liu & Ng 2007).
The results show that we consistently outperform the baseline, especially among the less frequent items. Our F1 over the full set was 0.884,
with 0.902 on ARG0,1 and 0.847 on ARG2+. This
corresponds to an increase over the L&N results of
0.012, 0.009 and 0.019 respectively. While it does
beat the current state-of-the-art results, it does not
blow them out of the water. Nonetheless it is a significant increase in accuracy when we take into
account the consistency over different training set
sizes. The difference between the L&N results and
that reported in (Liu & Ng 2007) is negligible, and
probably the result of a slightly different MaxEnt
algorithm and/or NomBank version.
We might have expected slightly better results
given the significance of the features. For the most
part, this was due to the classifying simply identifying and correctly labeling a known predicate/headword pair. This motivated us to
investigate our performance over unseen combinations of predicates/arguments.
6.1
Unseen predicate/heads
The increase in accuracy is more apparent when
we look only at predicate/head items that do not
occur in the training data, given in Figure 4.
Our analysis revealed that predicate & headword alone classified the test items with about 0.76
F1. On closer inspection, we found that predicate/head pairs that did appear in the training data
were being classified with 0.96 accuracy, and that
this accounted for 0.75 of the test items. This
might make practical implementations of a nominal-SLR easier in a closed domain, but it is not
very interesting from a research perspective. In
addition, labeling known predicate/head strings is
not a very robust method of classification, and so
we focused on the accuracy of unseen items.
Here, we found a greater relative increase in the
accuracy of our results. Our F1 over the full set
was 0.770, with 0.803 on ARG0,1 and 0.707 on
ARG2+. This corresponds to an increase over the
L&N results of 0.033, 0.021 and 0.058 respectively.
The results in Figure 4 are more indicative of the
robustness of our model than those in Figure 3, and
lead us to conclude that we have successfully
demonstrated that our features better modeled the
data.
There is, however, still plenty of room for improvement, however, especially with the ARG2+
labels, and so nominal-SRL remains an open an
unsolved task. We discuss the remaining errors and
possible future strategies in the following sections.
1
7 Discussion
0.9
confidence
0.8
0.7
0.6
0.5
0.4
Correctly classified
0.3
Incorrectly classified
0.2
Figure 5. Confidence in the label assigned to
test items (the x-axis is simply the ordering of
confidence).
6.2
Semi-supervised learning
The gradient of the results in Figures 3 and 4 indicates that more training items should produce more
accurate results. Figure 5 shows that semisupervised learning has the potential to be a successful strategy. 75% of correctly labeled test items
were classified with confidence >0.95, compared
to just 10% of incorrectly labeled items, so it could
be possible to use our test-guesses to achieve a better result without introducing too much noise.
We extended our system to a semi-supervised
model that attempted to bootstrap performance
using the unlabeled test data. We added the test
items that were classified with confidence greater
than c to the training data, with their predicted label, and reclassified the remaining test items. We
repeated this process until no test items could confidently be added. We varied the threshold c and
the initial size of the training data. When the initial
size of the data was < 100%, we allowed the learner to label and add the unseen training items.
This resulted in <0.01 increase in F1 across the
board, regardless of the confidence threshold, and
so the results are not included here. This seemed to
be because most confidently classified items were
from ARG1 or ARG0, which we had the least trouble identifying in the first place, and so it did not
improve our predication of hard-to-classify test
items.
Nonetheless, we do not rule out the possibility
that a more sophisticated semi-supervised learning
strategy could produce better results.
The results show that the intuitions and analysis
of the Data Analysis section proved to be correct.
Of the significant features in Table 2, 14 and 15
use the subject for feature interaction, features 12
and 5 used the animacy feature, and 16 and 17
used the possessive feature. The example below,
which we correctly identify but the baseline misses, combines all three features:
[Salomon 's ARG1] [warrants PRED] are the first here
to be issued by a third party .
Here, Salomon is an organization in the Subject
position realizing a possessive.
Another example of a sentence we gain shows
that low animacy in a non-Subject position implies
a Patent:
And his outlook improved after successful
[cataract ARG1] [surgery PRED] in August
You would not expect a cataract to be performing
the surgery. However, there were some sentences
that we misclassify that the baseline gets correct:
This small [Dallas ARG0] [suburb PRED] 's got trouble
We labeled Dallas as an ARGM-LOC for this sentence, no doubt influenced by the identification of
Dallas as a location, whereas the GOLD label is
ARG0.
Looking to the persistent errors, most were from
non-local arguments. Across the corpus, approximately 40% of arguments are realized outside the
maximal projection of the noun-phrase containing
the predicate, and it is not surprising that these
made up the majority of our errors, as they are notoriously difficult to classify. Support verbs that
relate the predicate to the argument are a good potential source of information for semantic roles, but
they will not always be easy to exploit:
[The population of all four states ARG1] is on the
upswing , according to new Census Bureau [estimates PRED] , following declines throughout the
early 1980s .
In order to correctly identify that the argument is
the ARG0, it would be necessary to parse a fair
amount of intermediate data, such as the verb
phrase headed by according to, the modification of
estimates by Census Bureau and possibly roles of
the upswing. Examples such as this are outside the
abilities of any current nominal-SRL system, but
provide an interest challenge for future work.
References
Carreras, Xavier and Llúis Márquez, editors. 2005. Proceedings of the CoNLL shared task: Semantic role
labelling.
Gildea, D and D. Jurafsky. 2002. Automatic labeling of
semantic roles. Computational Linguistics.
Jiang, Zheng Ping and Hwee Tou Ng. 2006. Semantic
role labeling of NomBank: A maximum entropy approach. Proceedings of EMNLP 2006.
Jiang, Zheng Ping, Jia Li, and Hwee Tou Ng. 2005. Semantic Argument Classification Exploiting Argument Interdependence. Proceedings of IJCAI 2005.
Lapata, Maria. 2002. The disambiguation of nominalisations. Computational Linguistics, 28(3):357-388.
Liu, Chang and Hwee Tou Ng. 2007. Learning predictive structures for semantic role labeling of NomBank. Proceedings of ACL 2007
Palmer, M, D. Gildea, and P. Kingsbury. 2005. The
Proposition Bank: an annotated corpus of semantic
roles. Computational Linguistics.
Pradhan, S, H Sun, W Ward, J Martin, and D Jurafsky.
2004. Parsing arguments of nominalizations in English and Chinese. Proceedings of HLT-NAACL.
Toutanova, K, A Haghighi, and C Manning. 2005. Joint
Learning Improves Semantic Role Labeling. Proceedings of ACL 2005.
Download