Robust Semantic Role Labeling for Nominals Robert Munro Aman Naimat Department of Linguistics Department of Computer Science and School of Business Stanford University anaimat@stanford.edu Stanford University rmunro@stanford.edu Abstract We designed a semantic role labeling system for nominals that consistently outperformed the current state-of-the-art system. Focusing on the core task of classifying the known arguments of nominals, we devised a novel set of features that modeled the syntactic context and animacy of the nominals, reducing the error of the current state-of-the-art systems by F1 0.012 on the NomBank corpus, and most significantly, by 0.033 on test items with unseen predicate/headword pairs. This corresponds to an overall reduction in error of 10% and 15% respectively. 1 Introduction Semantic role labeling (SRL) for verbs is an establish task in NLP, and a core component of tasks like Information Extraction and QuestionAnswering (Gildea & Jurafsky 2002, Carreras & Màrquez 2005, Toutanova, Haghighi & Manning 2005) but semantic role labeling for nominals has received much less attention (Lapata 2002, Pradhan et al 2004, Liu & Ng 2007). In general, nominal-SRL has proved to be a more difficult task than verb-SRL. Features that were successful for verb-SRL have not always produce significant results for nominal-SRL, and in general the error rates for nominal-SRL have been at least twice as high (Pradhan et al 2004, Jiang & Ng 2006). SRL is typically divided into two tasks: identifying the arguments of a predicate, and classifying those arguments according to their semantic role. These are known as semantic role identification and semantic role classification respectively. The two are typically tested both independently and in combination. In (Liu & Ng 2007) the best-reported results for both tasks and the combination are given, tested on the NomBank corpus (Meyers et al 2004). Constrained by time, in this paper we focus solely on semantic role classification, comparing our results directly to those of (Liu & Ng 2007). 1.1 SRL for nominals It is easy to demonstrate why nominal-SRL is a more complicated task than verb-SRL: 1) 2) 3) [the police AGT] [investigated PRED] [the crime PAT] [the crime PAT] was investigated PRED] by [the police AGT] [the police PAT] were investigated PRED] by [the governor AGT] Examples (1-3) show semantic roles predicated on a verb, and all three are unambiguous. For the predicate investigate, the Subject of an active sentence is the Agent, and the Object is the Patent. For a passive sentence, this is reversed. Provided a verb-SRL system models active/passive sentences, this is uncomplicated. Compare these to some equivalent nominalizations: 4) 5) 6) 7) [The police AGT] filed the report after 3 weeks, causing the governor to declare the [investigation PRED] closed. [The investigation PRED] took 3 weeks. [the crime’s PAT] [investigation PRED] ... [the police’s AGT] [investigation PRED] ... 8) [The investigation PRED] of [the police AGT/PAT??] took 3 weeks. Example (4) shows that an argument may be realized some distance from the predicate; example (5) shows that arguments are not mandatorially subcategorized by the predicate; and examples (6-8) show that different roles may be realized in the same syntactic position, and can be inherently ambiguous. Therefore, successful nominal-SRL is a more difficult task than verb-SRL. 1.2 Our contribution We report on new features, and interactions of features, that consistently improved the accuracy of semantic role classification for nominals. In particular, we demonstrate that the new features modeling the syntactic context of the nominals improve the accuracy of semantic role classification systems, especially over unseen predicates/argument, and report on the distribution of such structures across the labels in NomBank. From some basic strategies to model a predicate’s arguments holistically, we find that modeling the relative animacy of the arguments improves the classification of their roles, especially in combination with other features. We report on the influence of training-set size on classification accuracy, and the relative success in classifying unseen predicate/argument pairs. We followed the standard split of the corpus, training on sections 2-22, validating on section 24, and testing on section 23. 3 Maximum entropy classifier Maximum entropy (MaxEnt) classifiers have been the staple for nominal-SRL (Jiang & Ng 2006). More sophisticated discriminative learning algorithms have also been used, including multitask linear optimizers and alternating structure optimizers, but they have not significantly improved the results for semantic role classification over MaxEnt classifiers (Liu & Ng 2007), and so we used the Stanford MaxEnt classifier. The objective function was: where: and the derivatives of the log likelihood correspond to: 2 NomBank The NomBank corpus uses the PropBank set of labels as (Palmer et al., 2005) to annotate all arguments of nominals in the Wall Street Journal corpus. It differs from PropBank in two ways. Firstly, an argument may overlap the predicate. For example investigator is both the predicate and its own Agent. Secondly, it is possible for an argument to realize two roles. For example, truckdriver contains both the Agent driver and the Patient truck. There are 20 labels in NomBank: ARG0, ARG1, ARG2, ARG3, ARG4, ARG5, ARG8, ARG9, ARGM-ADV, ARGM-CAU, ARGM-DIR, ARGMDIS, ARGM-EXT, ARGM-LOC, ARGM-MNR, ARGM-MOD, ARGMNEG, ARGM-PNC, ARGMPRD, and ARGM-TMP. This results in sentences like: [The police’s ARG0] [investigation PRED] [of a crime ARG1] [in Boston ARGM-LOC] [took 3 weeks ARGM-TMP]. The objective function was smoothed by assuming a Gaussian distribution, and penalized accordingly: where the derivatives were adjusted by: Brief experimentation with Naïve Bayes and KNN classifiers produced much less accurate results. For future work, it would be interesting to compare our results across a greater number of classifiers, and with joint learning of the labels (Toutanova et al. 2005). 60% 50% 60% ARG0 ARG1 ARG2+ ARGM ARG0 ARG1 ARG2+ ARGM 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Sentence Subject Direct Object Other position Figure 1: The relative frequency of different roles in different sentence syntactic positions. 4 Data analysis No previous attempt at modeling the semantic roles of nominals has looked at the broader syntactic context of the constructions. Almost immediately, our data analysis revealed strong tendencies for different roles to be realized in different sentence positions. We found strong likelihoods for certain roles to appear when the argument is realized in the Subject, Object or Adjunct position of a sentence. For example: The police’s [investigation PRED] took 3 weeks. (Subject) They reported the police’s [investigation PRED] (Direct Object) The case was closed after the police’s [investigation PRED] (Adjunct) Figure 1 gives the distributions of sentence positions for each role type in the training data. With the exception of the ARG0/ARGM distinction for other positions, within each sentence position, the results are significant to p <0.001 by 2. It is clear from the graph, that the difference is important. An argument in the Subject position is 55% likely to be an ARG0, only 10% likely to be one of the ARG2+ roles and not at all likely to be of the ARGM roles. For a Direct Object, on the other hand, an ARGM is almost 50% likely, and for other syntactic positions ARG1 is >50% likely. When running baseline data, sentences with possessives were often misclassified, such as: [P&G's ARG0] [share PRED] of the Japanese market This led us to create a feature modeling whether the constituent was a possessive. PER ORG LOC Figure 2: The relative frequency of possessives realizing different types of named entities. The animacy of constituents is also correlated with the distribution of semantic roles, as more animate constituents will tend to more frequently realize agents, and therefore ARG0s. More general knowledge of named entities will also help identify locations, and therefore ARGM-LOCs. We defined features based on animacy of the constituent using gazetteers of named entities and pronouns, and modeled an order of animacy from people to objects, based on the presence of known entities, proper nouns or nouns in a constituent. Figure 2 shows the intersection of the animacy and possessive features, clearly showing that possessives realizing a person or organization are more likely to take the ARG0 role. The semantic roles of a given predicate are highly interdependent (Jiang et al. 2005, Toutanova et al. 2005). For example, if we know that an ARG0 already exists for a predicate, then the chances of another ARG0 for that predicate are greatly reduced. For reasons of time-constraints, we did not implement the joint learning of (Toutanova et al. 2005) which is the best reported results for verbSRL. However, we did define features that took into account all arguments for a given predicate, capturing at least part of the interdependency. Using the animacy feature, we included a feature to indicate whether a given argument was the (equal) most animate, medially animate, or least animate of the arguments for the current predicate. We also included the total number of arguments for the given predicate, as predicates with only one argument were more likely to realize either ARG0 or ARG1, while predicates with a high number of arguments where evidence that a low frequency role might be present. 1 2 3 4 5 6 7 8 9 10 11 position: whether the constituent is to the left, right or overlaps with the predicate ptype: phrase type of the constituent C firstword: first word spanned by C lastword: last word spanned by C ptype.r: phrase type of right sister nomtype: the NOM-TYPE of predicate, as defined by the NOMLEX dictionary predicate & ptype predicate & lastword predicate & headword predicate & position nomtype & position Table 1. Baseline features (‘&’ indicates an interaction feature) Features sig to final classification: 12 13 14 15 16 17 18 19 20 5.1 Processing for significance Testing all possible feature combinations is prohibitively expensive, so we devised two ways to test the significance of each feature, leaving a greater study of significance as future work. sig. 0.002 0.001 0.002 0.006 0.001 0.001 Features not sig to final classification: 5 Features We used the features of (Liu and Ng 2007) as a baseline, as these produced the previous highestreported results on this task. These features are given in Table 1, and give the results labeled L&N in the Results section. Table 2 gives the new features that we developed based on observations discussed in the previous section. As the interaction of features is vital to an accurate classification, and we defined a number of interaction features. The final set of features, we had more interaction features than regular ones. The final results report the use of features 1-17. Among the features that were not significant, many of them, like subject and possessive where significant when interacting with other features. It is also likely that the correlation between subject and possessive, and the interactions they take part in, somewhat masked their significance for the reasons described in the last two paragraphs. The biggest surprise of the features that were not significant, either alone or in combination, was sentposition. While it was very significant to model whether an argument was in a certain syntactic position, including the sentence position made no difference. This supports the theory that it the explicit interaction of syntactic properties that produces the distributions in Figure 1. relativeanimacy: the animacy of constituent relative to the animacy of the other arguments for the current predicate: (equal)highest/medial/lowest. head: the headword of the constituent subject & position predicate & animacy & subject possessive & nomtype possessive & position & animacy 21 22 23 24 25 26 27 28 29 30 31 32 subject: whether the constituent is in the sentence subject position possessive: whether the constituent is a possessive (eg, our, her, its, -’s) sentposition: whether the constituent is in the topic position, the 1st/2nd half of the sentence, or is sentence final animacy: whether the constituent is a person, organization, location, other (unknown) proper noun or a noun. predsyn: syntactic category of the predicate numargs: the number of arguments for the current predicate modifies: whether the constituent is a premodifier of the predicate highestanimacy: the highest animacy of all the arguments for the current predicate sentposition & position pred & animacy nomtype & animacy possessive & position numargs & animacy modifies & animacy modifies & sentposition 0.001 0 0 0 0 0.004 0 0.003 0 0.004 0 0 0.004 0 0 Table 2. Novel features and interactions We iteratively removing each feature from the training data and looking at the change in accuracy. If the removal of a feature did not result in a significant change of accuracy, then we considered it to be not significant to the final classification. This is the split of data in Table 2. We expected that some of the non-significant features are good indicators of semantic roles, but were correlated strongly enough with other features for there to be no gain from their inclusion. We therefore also looked at the significance of adding each new feature to the baseline features. The resulting increase in overall F1 given by the feature is sig. column in Table 2. 0.95 0.85 0.8 0.9 0.75 0.85 0.7 0.8 0.65 F1 0.75 0.6 F1 F1: ARG0,1 F1: ARG2+ 0.7 L&N F1 F1: ARG2+ 0.5 L&N F1 L&N F1: ARG0,1 0.65 L&N F1: ARG2+ 0.6 L&N F1: ARG0,1 0.45 L&N F1: ARG2+ 90 % 10 0% 80 % 70 % 60 % 50 % 40 % 30 % 20 % 5% 10 % 3% 1% 10 0% 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 5% 3% 1% 0.4 percent of training data percent of training data Figure 3. Accuracy on test items with increasing training set sizes 6 F1: ARG0,1 0.55 Figure 4. Accuracy over test items with unseen predicate/headword pairs, with increasing training set sizes Results The ARG0 and ARG1 labels, roughly corresponding to the Agent and Patient roles, made up the majority of the examples and were more easily classified than the other labels. We therefore compared the overall F1 values to the F1 values for ARG0 and ARG1 (ARG0,1), and to the F1 values for all other labels (ARG2+). Figure 3 gives the results for different training set sizes, comparing our results to that of the baseline features (L&N), which are the current state-ofthe-art performance (Liu & Ng 2007). The results show that we consistently outperform the baseline, especially among the less frequent items. Our F1 over the full set was 0.884, with 0.902 on ARG0,1 and 0.847 on ARG2+. This corresponds to an increase over the L&N results of 0.012, 0.009 and 0.019 respectively. While it does beat the current state-of-the-art results, it does not blow them out of the water. Nonetheless it is a significant increase in accuracy when we take into account the consistency over different training set sizes. The difference between the L&N results and that reported in (Liu & Ng 2007) is negligible, and probably the result of a slightly different MaxEnt algorithm and/or NomBank version. We might have expected slightly better results given the significance of the features. For the most part, this was due to the classifying simply identifying and correctly labeling a known predicate/headword pair. This motivated us to investigate our performance over unseen combinations of predicates/arguments. 6.1 Unseen predicate/heads The increase in accuracy is more apparent when we look only at predicate/head items that do not occur in the training data, given in Figure 4. Our analysis revealed that predicate & headword alone classified the test items with about 0.76 F1. On closer inspection, we found that predicate/head pairs that did appear in the training data were being classified with 0.96 accuracy, and that this accounted for 0.75 of the test items. This might make practical implementations of a nominal-SLR easier in a closed domain, but it is not very interesting from a research perspective. In addition, labeling known predicate/head strings is not a very robust method of classification, and so we focused on the accuracy of unseen items. Here, we found a greater relative increase in the accuracy of our results. Our F1 over the full set was 0.770, with 0.803 on ARG0,1 and 0.707 on ARG2+. This corresponds to an increase over the L&N results of 0.033, 0.021 and 0.058 respectively. The results in Figure 4 are more indicative of the robustness of our model than those in Figure 3, and lead us to conclude that we have successfully demonstrated that our features better modeled the data. There is, however, still plenty of room for improvement, however, especially with the ARG2+ labels, and so nominal-SRL remains an open an unsolved task. We discuss the remaining errors and possible future strategies in the following sections. 1 7 Discussion 0.9 confidence 0.8 0.7 0.6 0.5 0.4 Correctly classified 0.3 Incorrectly classified 0.2 Figure 5. Confidence in the label assigned to test items (the x-axis is simply the ordering of confidence). 6.2 Semi-supervised learning The gradient of the results in Figures 3 and 4 indicates that more training items should produce more accurate results. Figure 5 shows that semisupervised learning has the potential to be a successful strategy. 75% of correctly labeled test items were classified with confidence >0.95, compared to just 10% of incorrectly labeled items, so it could be possible to use our test-guesses to achieve a better result without introducing too much noise. We extended our system to a semi-supervised model that attempted to bootstrap performance using the unlabeled test data. We added the test items that were classified with confidence greater than c to the training data, with their predicted label, and reclassified the remaining test items. We repeated this process until no test items could confidently be added. We varied the threshold c and the initial size of the training data. When the initial size of the data was < 100%, we allowed the learner to label and add the unseen training items. This resulted in <0.01 increase in F1 across the board, regardless of the confidence threshold, and so the results are not included here. This seemed to be because most confidently classified items were from ARG1 or ARG0, which we had the least trouble identifying in the first place, and so it did not improve our predication of hard-to-classify test items. Nonetheless, we do not rule out the possibility that a more sophisticated semi-supervised learning strategy could produce better results. The results show that the intuitions and analysis of the Data Analysis section proved to be correct. Of the significant features in Table 2, 14 and 15 use the subject for feature interaction, features 12 and 5 used the animacy feature, and 16 and 17 used the possessive feature. The example below, which we correctly identify but the baseline misses, combines all three features: [Salomon 's ARG1] [warrants PRED] are the first here to be issued by a third party . Here, Salomon is an organization in the Subject position realizing a possessive. Another example of a sentence we gain shows that low animacy in a non-Subject position implies a Patent: And his outlook improved after successful [cataract ARG1] [surgery PRED] in August You would not expect a cataract to be performing the surgery. However, there were some sentences that we misclassify that the baseline gets correct: This small [Dallas ARG0] [suburb PRED] 's got trouble We labeled Dallas as an ARGM-LOC for this sentence, no doubt influenced by the identification of Dallas as a location, whereas the GOLD label is ARG0. Looking to the persistent errors, most were from non-local arguments. Across the corpus, approximately 40% of arguments are realized outside the maximal projection of the noun-phrase containing the predicate, and it is not surprising that these made up the majority of our errors, as they are notoriously difficult to classify. Support verbs that relate the predicate to the argument are a good potential source of information for semantic roles, but they will not always be easy to exploit: [The population of all four states ARG1] is on the upswing , according to new Census Bureau [estimates PRED] , following declines throughout the early 1980s . In order to correctly identify that the argument is the ARG0, it would be necessary to parse a fair amount of intermediate data, such as the verb phrase headed by according to, the modification of estimates by Census Bureau and possibly roles of the upswing. Examples such as this are outside the abilities of any current nominal-SRL system, but provide an interest challenge for future work. References Carreras, Xavier and Llúis Márquez, editors. 2005. Proceedings of the CoNLL shared task: Semantic role labelling. Gildea, D and D. Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics. Jiang, Zheng Ping and Hwee Tou Ng. 2006. Semantic role labeling of NomBank: A maximum entropy approach. Proceedings of EMNLP 2006. Jiang, Zheng Ping, Jia Li, and Hwee Tou Ng. 2005. Semantic Argument Classification Exploiting Argument Interdependence. Proceedings of IJCAI 2005. Lapata, Maria. 2002. The disambiguation of nominalisations. Computational Linguistics, 28(3):357-388. Liu, Chang and Hwee Tou Ng. 2007. Learning predictive structures for semantic role labeling of NomBank. Proceedings of ACL 2007 Palmer, M, D. Gildea, and P. Kingsbury. 2005. The Proposition Bank: an annotated corpus of semantic roles. Computational Linguistics. Pradhan, S, H Sun, W Ward, J Martin, and D Jurafsky. 2004. Parsing arguments of nominalizations in English and Chinese. Proceedings of HLT-NAACL. Toutanova, K, A Haghighi, and C Manning. 2005. Joint Learning Improves Semantic Role Labeling. Proceedings of ACL 2005.