Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Learning Cause Identifiers from Annotator Rationales Muhammad Arshad Ul Abedin and Vincent Ng and Latifur Rahman Khan Department of Computer Science University of Texas at Dallas Richardson, TX 75080-3021 {arshad,vince,lkhan}@utdallas.edu Abstract in the text. One way to solve the data paucity problem is, of course, to annotate more data, but obtaining human-annotated data could be expensive. Interestingly, using the annotator rationales may provide an answer to these two questions. An annotator rationale, as used by Zaidan et al. [2007], is the fragment of text that motivated an annotator to assign a particular label to a document. In their work on classifying the sentiment expressed in movie reviews as positive or negative, they generate additional training instances by removing rationales from documents. Since these pseudo-examples, known as contrast examples, lack information that the annotators thought as important, the learning algorithm should be less confident about the label of these weaker instances. A learner that successfully learns this difference in confidence assigns a higher importance to the pieces of text that are present only in the original instances. Thus the contrast examples help the learning algorithm both by providing indication to which parts of the documents are important and by increasing the number of available training instances. However, their approach explores only one way such rationales can be utilized. We propose two ways to utilize the rationales for cause identification in a supervised learning framework. Specifically, the rationales are used to generate (1) new features and (2) a new type of pseudo-examples for the learner. This new type of pseudo-examples, called the residue examples, complements Zaidan et al.’s (2007) contrast examples in that they contain only the rationales present in a document. We motivate the use of residue examples in Section 4. Experimental results on a set of manually annotated ASRS reports suggest that rationales are indeed helpful for cause identification. Using rationales as features in addition to generating both types of pseudo-examples, we find an improvement of over 7% in absolute F-score compared to a unigram SVM baseline that does not employ rationales. The rest of the paper is organized as follows. In Section 2 we discuss the ASRS dataset. Section 3 describes Zaidan et al.’s [2007] framework for exploiting annotator rationales. We discuss and evaluate our own approach to utilizing the rationales in the cause identification problem in Sections 4 and 5, respectively. Finally, we perform a detailed error analysis in Section 6 and conclude in Section 7. In the aviation safety research domain, cause identification refers to the task of identifying the possible causes responsible for the incident described in an aviation safety incident report. This task presents a number of challenges, including the scarcity of labeled data and the difficulties in finding the relevant portions of the text. We investigate the use of annotator rationales to overcome these challenges, proposing several new ways of utilizing rationales and showing that through judicious use of the rationales, it is possible to achieve significant improvement over a unigram SVM baseline. 1 Introduction The Aviation Safety Reporting System (ASRS) was established by NASA in 1976 to collect voluntarily submitted reports about aviation safety incidents written by flight crews, attendants, controllers and other related parties. Each report contains a free text narrative that describes, among other things, the cause of the incident. Knowing the causes of the incidents can play a significant role in developing aviation safety measures. As a result, we seek to automatically identify the causes of the incident described in a report narrative from a set of 14 possible causes, or shaping factors, identified by the NASA researchers [Posse et al., 2005]. Since several factors may have caused an incident, cause identification is a multi-class, multi-labeled text classification problem. This under-studied problem presents two major challenges to NLP researchers. One is the scarcity of labeled data: The original ASRS reports were not labeled with the shaping factors, and only a small part of it has so far been manually annotated, which makes it difficult for a supervised learning algorithm to generate an accurate model for all the shaping factors. Another is that the narratives chiefly describe the incidents that are occurring, while parts of the reports discussing the actual causes may be very small, which means the learner has to be careful as to which features it chooses as useful. This gives rise to two challenging questions: (1) how to solve the paucity of labeled data problem, and (2) how to enable the learner to make better use of the small pieces of clues lying 1758 2 Annotator A1 Annotator A2 Class Rationales Overlap Rationales Overlap with A2 with A1 1 12 75.0% 9 100.0% 2 22 77.3% 17 100.0% 3 6 50.0% 3 100.0% 4 8 87.5% 9 77.8% 5 0 0.0% 0 0.0% 6 26 42.3% 15 73.3% 7 42 50.0% 34 61.8% 8 16 75.0% 12 100.0% 9 9 44.4% 9 44.4% 10 10 100.0% 11 90.9% 11 25 32.0% 12 66.7% 12 64 37.5% 28 85.7% 13 11 72.7% 10 80.0% 14 6 100.0% 13 46.2% Total 257 54.5% 182 76.9% Dataset The data set used here is based on the ASRS data set previously created by us [Abedin et al., 2010]. It contains 1333 reports labeled with shaping factors1 (see the column “Shaping Factor” in Table 1 for the list of the 14 shapers2 ). These reports are divided into three subsets: a training set containing 233 reports, a development set containing 100 reports and a test set containing 1000 reports. Since we need reports annotated with rationales, we decided to generate some additional training data, choosing 1000 unlabeled reports randomly from those we provided in Abedin et al. [2010] and adding them to the training set. Then we annotated the newlyadded reports with shaping factors, and identified annotation rationales for all the reports in the training set. Below, we discuss the procedures in annotating a report with shaping factors (Section 2.1) and rationales (Section 2.2). 2.1 Annotation With Shaping Factors Table 2: Inter-annotator agreements for rationales. We followed the same annotation procedure described in Abedin et al. [2010] to assign shaping factor labels to the 1000 unlabeled reports that we added to the training set. One author (A1) and one student worker (A2) independently annotated 100 randomly chosen unlabeled reports, using the descriptions of the shaping factors in Posse et al. [2005] and Ferryman et al. [2006]. The inter-annotator agreement, calculated using the Krippendorff’s (2004) α statistics as described by Artstein and Poesio [2008] with the MASI scoring metric [Passonneau, 2004], was found to be 0.82. The annotators then resolved each disagreement through discussion. Given the high agreement rate, we had A1 annotate the remaining reports. Thus, at the end of the annotation process, we had a training set with 1233 reports labeled with shaping factors. 2.2 1233 training reports, for an average of 4.4 rationales per document. To get a better sense of these rationales, we list the five most frequently-occurring rationales annotated for each shaping factor in Table 1. Since rationales are meant to improve classifier acquisition, the test reports do not require (and therefore do not contain) any rationale annotations. 3 Rationales in Sentiment Classification In this section, we describe Zaidan et al.’s [2007] approach to training an SVM with annotator rationales for classifying the sentiment of a movie review as positive or negative. Let xi be the vector representation of document Ri . Given the rationale annotations on a positive example xi for the SVM learner, Zaidan et al. construct one or more not-sopositive contrast examples vij . Specifically, they create vij by removing rationale rij from Ri . Since vij lacks evidence that an annotator found relevant for the classification task, the correct SVM model should be less confident of a positive classification on vij . This idea can be implemented by imposing additional constraints on the correct SVM model, which can be defined in terms of a weight vector w. Recall that the usual SVM constraint on positive example xi is w · xi ≥ 1, which ensures that xi is on the positive side of the hyperplane. In addition to these usual constraints, we desire that for each j, w · xi − w · vij ≥ μ, where μ ≥ 0. Intuitively, this constraint specifies that xi should be classified as more positive than vij by a margin of at least μ. Let us define the annotator rationale framework more formally. Recall that in a standard soft-margin SVM, the goal is to find w and ξ to minimize 1 |w|2 + C ξi 2 i Annotation With Rationales The same two annotators, A1 and A2, went through the same 100 reports and answered the question below for each report: For each shaping factor identified for the incident described in the report, is there a fragment of text that is indicative of that shaping factor? If so, we provide A1 and A2 with the same instructions that Zaidan et al. had for their annotators for marking up the rationales (see Section 4.1 of their paper). Briefly, the annotators are asked to “do their best to mark enough rationales to provide convincing support for the class of interest”, but are not expected to “go out of their way to mark everything”. The inter-annotator agreement for the rationales were calculated in the same manner as Zaidan et al. [2007], where two rationales are considered overlapping if they have at least one word in common. In the 100 reports, A1 annotated 257 rationales, 54.5% of which overlap with A2’s rationales, and A2 identified 182 rationales, with a 76.9% overlap. Statistics for all the classes are shown in Table 2. A1 then proceeded to annotate the rationales for the remaining 1133 reports in training set, resulting in a total of 5482 rationales over the subject to 1 See http://www.utdallas.edu/∼maa056000/asrs.html 2 Space limitations preclude the inclusion of the definitions of these shaping factors. See Abedin et al. [2010] for details. ∀i : ci w · xi ≥ 1 − ξi , ξi > 0, 1759 Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shaping Factor Attitude Communication Environment Duty Cycle Familiarity Illusion Physical Environment Physical Factors Preoccupation Pressure Proficiency Resource Deficiency Taskload Unexpected Other Rationales attitude (24), complacency (13), complacent (6), playing a game (2), overconfident (2) noise (28), no response (18), did not hear (14), static (12), congestion (12) last leg (7), last night (4), reduced rest (3), longitude duty days (2), longitude duty day (2) new (123), unfamiliar (16), aligned (9), not familiar (5), very familiar (4) bright lights (2), wall of white (1), black hole (1) weather (144), visibility (77), turbulence (51), clouds (43), winds (38) fatigue (25), tired (14), sick (9), fatigued (7), very tired (5) busy (78), attention (76), distraction (30), distracted (17), DISTRS (9) late (56), pressure (45), expedite (14), short time (12), rushed (12) training (57), mistake (42), inadvertently (27), mistakes (18), forgotten (11) off (410), more (194), further (133), damage (85), warning (84) very busy (13), solo (13), extremely busy (9), many aircraft (8), single pilot (7) surprised (14), suddenly (13), unexpected (7), unusual event (2), unknown to me (2) Resolution Advisory (59), confusion (58), confused (21), confusing (13), UNCLR (9) Table 1: Five most frequently-occurring rationales associated with each shaping factor and their frequencies of occurrences. where xi is a training example; ci ∈ {−1, 1} is the class label of xi ; ξi is a slack variable that allows xi to be misclassified if necessary; and C > 0 is the misclassification penalty. To enable this standard soft-margin SVM to also learn from contrast examples, Zaidan et al. add contrast constraints: is 1 if the string representing the rationale appears in the given report and 0 otherwise. We use rationales to provide two types of additional training examples: contrast examples and residue examples. To use rationales to generate contrast examples, we follow Zaidan et al.’s approach, as described in Section 3. However, there is a caveat. In Zaidan et al.’s framework, rationales are used to generate both positive and negative examples, since rationales were collected from both positive and negative reviews. On the other hand, in cause identification, since each report can be labeled with more than one shaping factor, the presence of a rationale for shaper sj in a report does not provide evidence that the report should not be labeled with shaper si , where i = j. Rather, it is the absence of rationales for si rather than the presence of rationales for sj that indicates that the report should not be labeled with si . Hence, when recasting cause identification as binary classification tasks, the absence of rationales for the negative examples implies that we will only generate positively-labeled contrast examples from the rationales. In particular, these positivelylabeled contrast examples are generated to train each binary classifier in the same way as in Zaidan et al.’s framework. Now, recall that we propose to a new type of training examples from the rationales: residue examples. Hence, Zaidan et al.’s framework needs to be modified so that the SVM learner can take into account the residue examples. Before describing the modifications, let us motivate the use of residue examples. Recall from Section 2.2 that the rationale annotators were not expected to “go out of their way” to identify all the rationales in a document. Hence, any amount of rationale annotation is unlikely to mark all the relevant portions of the text, and thus, the text outside of the rationales may also contain some relevant portions. Thus, we propose residue examples, which are examples that contain only the rationales. Specifically, for each rationale rij appearing in training example xi , we generate one (positively-labeled) residue example, rij , which contains only the features extracted from rij . In a sense, the contrast examples help the learner understand the importance of the rationale, whereas the residue examples aid in learning the importance of the rest of the text. As with contrast examples, residue examples intuitively ∀i, j : w · xi − w · vij ≥ μ(1 − ξij ), where vij is the contrast example created from Ri by remov ing rationale rij , and ξij ≥ 0 is the slack variable associated with vij . These new slack variables should have their own misclassification cost, so a new term is added to the SVM objective function, which becomes: 1 |w|2 + C ξi + C ξij 2 i i,j Here, C is the misclassification cost associated with the new slack variables ξij . 4 Rationales in Cause Identification Before describing how we employ rationales for cause identification, recall that cause identification is a 14-class classification task. To employ Zaidan et al.’s annotator rationale framework, which relies on SVM training and hence assumes only positive and negative examples, we recast cause identification as a set of 14 binary classification problems, one for predicting each shaper. More specifically, in the binary classification problem for predicting shaper si , we create one training example xi from each document in the training set, labeling it as positive if the document has si as one of its labels, and negative otherwise, In essence, we are adopting a one-versus-all scheme for creating original training instances. Next, we describe how we can employ rationales for cause identification. We use rationales to provide additional features and additional training examples. Using rationales to create additional features is fairly straightforward. It involves (1) constructing a special lexicon that contains all the rationales in the training data, (2) creating one feature for each rationale in the lexicon, and (3) augmenting the feature set with these rationale features. The value of a rationale feature 1760 Expt Feature Set P R F SVM Baseline 1 Unigrams 50.8 36.2 42.2 SVM with Rationales as Features 2 Rationales 54.8 34.8 42.5 3 Unigrams, Rationales 56.0 38.8 45.8 Staelin [2002] to find the parameter combination that gives the best F-score on the development set. Specifically, we specify the search space of each parameter, and the algorithm specifies several points of evaluation. At each point, the SVM models are learned from the training data using the parameter values at that point, and the learned models are used to classify the development test set instances. The point at which the best performance (F-score) is observed is then selected as the new center of search and the search space ranges are halved. This search process is repeated 5 times. The models built using the best parameter combination after 5 iterations are then applied to the test set. Table 3: Results of the baseline and the SVM approach with rationales as features. contain less relevant information for classification than the original documents, and hence the SVM model should also be less sure of their class values than the original documents. To capture this intuition in Zaidan et al.’s framework, we add residue constraints: Using rationales as features. Results of using the rationales as features are shown in rows 2 and 3 of Table 3. Note that the frequency threshold of 3 mentioned earlier was employed to filter the unigram features but not the rationale features. When only the rationales are present in the feature set, the SVM model performs marginally better than the baseline, achieving an F-score of 42.5. However, when rationales are used as additional features to augment the unigram-based feature set, the F-score increases considerably to 45.8. This shows that the rationales are useful when used as features. ∀i, j : w · xi − w · rij ≥ μ(1 − ξij ), where rij is the residue example created from xij by retaining only rationale rij , μ ≥ 0 is the size of the minimum separa tion between original and residue examples, and ξij ≥ 0 is the slack variable associated with rij . These slack variables should have their own misclassification cost, so a new term is added to Zaidan et al.’s objective function, which becomes: 1 |w|2 + C ξi + C ξij + C ξij 2 i i,j i,j Using rationales as pseudo-examples. Results of using the rationales to generate pseudo-examples are shown in Table 4. As we can see, using only contrast examples (row 4) does not improve much upon the SVM baseline. Interestingly, using our proposed residue examples (row 5) improves the SVM baseline considerably to F=49.0. When both types of pseudoexamples are applied (row 6), F-score drops to 47.5, which still performs well above the baseline, however. Next, we repeat the above experiments, but augment the feature set with rationales. Comparing the three pairs of experiments (rows 4 and 7; rows 5 and 8; rows 6 and 9), we can see that using rationales both as features and as contrast examples consistently yields better performance than using rationales only as contrast examples. Hence, rationale features are useful for cause identification in the presence of contrast examples. Finally, when rationales are used as features, neither residue examples nor contrast examples are more effective than the other in improving the SVM model (rows 7 and 8), but employing both types of pseudo-examples enables the model to achieve substantially better performance (row 9). Here, C is the misclassification cost associated with ξij . 5 Evaluation In this section, we evaluate the usefulness of rationales for cause identification in a supervised learning framework. 5.1 Baseline Results Following Zaidan et al. [2007], our baseline uses SVMlight [Joachims, 1999] with a linear kernel to train one-versusall classifiers in conjunction with a feature set comprising only unigrams on the 1233 training reports without pseudoexamples. Numbers, punctuations, stopwords, and tokens appearing less than 3 times in the training set are excluded from the feature set. The feature values are all boolean: if the token appears in the report, its value is 1 and 0 otherwise. The only SVM learning parameter in this case is C, the misclassification penalty, which was selected as the one that achieved the best performance on the 100-report development set (see Section 2). Results are expressed in terms of precision (P) and recall (R), both of which are micro-averaged over those of the 14 binary classifiers, as well as F-score (F). As we can see from row 1 of Table 3, the baseline achieves an F-score of 42.2 on the 1000-document test set. 5.2 Using rationales as pseudo-negatives. So far we have only used rationales to provide pseudo positive examples. A natural question is: while we do not have rationales that explain why a document should not be labeled with a particular shaper, is it still possible to generate pseudo negative examples? We experiment with a simple idea for generating pseudo-negatives: to train the binary classifier for predicting shaper si , we create negatively-labeled contrast examples from the rationale annotations in the negative examples in the same way as we create positively-labeled examples from the rationales in the positive examples. The next question, of course, is: since the rationales in the negative examples are not necessarily indicators that a document should not be labeled as si , will these negative contrast examples be useful? To answer this question, we repeat the experiments in Table 4, Results Using Rationales Parameter tuning. We tune the cost parameters C, C and C , and the pseudo-example parameter μ to maximize the F-score of the SVM model on the development set. Due to the large number of parameters, we use the design of experiment-based search method for parameter tuning proposed by 1761 Expt Feature Set 4 Unigrams 5 Unigrams 6 Unigrams 49.9 38.1 43.2 36.9 72.8 49.0 36.4 68.5 47.5 7 52.2 40.7 45.7 8 9 PseudoInstances Contrast Residue Contrast, Residue Unigrams, Contrast rationales Unigrams, Residue rationales Unigrams, Contrast, rationales Residue P R F False Positives Concept present but not contributing to incident Wrong context Bad feature Too general feature Wrong word sense Hypothetical context Ambiguous feature 55.9 38.8 45.8 Table 5: Error analysis findings for false positives. 39.7 66.0 49.5 zero then it is classified as positive, otherwise it is labeled negative. Hence, to analyze the causes of the false positives and the false negatives, we need to look at the actual features values present in the document vector and their weights in the hyperplane vector. However, even with a subset of 100 reports, this means looking at more than 60,000 feature values, and thus we confine ourselves to looking at only the highest contributing features. In other words, for the false positives, we look at the feature with the highest positive contribution to the distance, and for the false negatives, we look at the one making the biggest negative contribution. Table 4: Results of approaches using pseudo-examples. but augment the training set in each experiment with negative contrast examples. Our preliminary results indicate that additional employing pseudo-negatives yields an improvement of 1.0–2.6 in F-score, but further experiments are needed to precisely determine their usefulness. Generating pseudo-examples from multiple rationales. So far each pseudo-example is generated from exactly one rationale. It is conceivable that a contrast example can be generated by removing multiple rationales and a residue example can be generated by retaining multiple rationales. Adding this flexibility to the generation of pseudo-examples may result in an explosion in the number of pseudo-examples, however. As a result, we limit the number of pseudo-examples that can be generated for each training report. Specifically, to generate pseudo-examples for a report with r rationales, we (1) choose l such that l rl ≤ 50, and (2) remove/retain rationale combinations of sizes 1, 2, . . ., l from the report when generating contrast/residue examples. Adding this flexibility to the generation of positive and negative contrast and residue examples, we achieve an F-score of 52.9 in an experiment where both unigrams and rationales are employed as features. This preliminary result indicates that it may be beneficial to generate pseudo-examples from multiple rationales. 6 6.2 False Positives We analyzed the 62 false positive errors made by the system and discovered several different reasons why the top positive contributing features actually misled the learner. These reasons are summarized in Table 5 and discussed in detail below. Concept present but not contributing to incident. In several cases, the top positive contributing feature is in fact relevant to the shaping factor, but in the specific report, it is not contributing to the incident in the opinion of the annotator. For example, the term “busy” is a feature that is relevant to the shaping factor Preoccupation, but not every report in which this term appears has Preoccupation as a shaper. For example, Report#230334 contains the sentence “CLBING to cruise altitude we were cleared to FL230 by a busy controller.” Even though the controller was evidently busy, that did not contribute to the occurrence of the incident. Error Analysis To better understand why the systems fail to correctly classify a number of test instances, we analyze the errors made by the best-performing system (i.e., the system that yielded an F-score of 52.9) on a set of 100 reports chosen randomly from the test set. We investigate two types of errors, namely false positives, in which the system labels a report as positive but the annotator does not, and false negatives, in which the annotator labels a report as positive but the system does not. 6.1 62 Percentage 21 33.87% 17 27.42% 12 19.35% 7 11.29% 3 4.84% 1 1.61% 1 1.61% Wrong context. In this category of errors, the feature is in fact relevant to the shaping factor, but it appears in such a context that does not imply the shaping factor. For example, the feature “damage” related to the shaping factor Resource Deficiency appears in Report#389873 in the following context: “minimal damage.” In this context, the word appears as an effect of what happened, not as a cause. The SVM learner is unable to make this distinction, and the feature ends up contributing heavily to the positiveness of the document. Analysis Method Bad feature. There are several cases in which the feature is not intuitively relevant to the shaping factor, but were assigned high weights by the learner merely because they happened to occur frequently enough in the positive examples in the training set. For example, the feature “ability” is the top positive contributor for the shaping factor Attitude in Report#101846, as in the sentence “aircraft performance charts and PDCS affirmed ability to do so”, but it is clear that the word “ability” has nothing to do with an attitude problem. Analyzing the errors made by an SVM learner is a rather daunting task since the actual reason of why a particular instance is classified as positive or negative is buried under the vectors representing the separating hyperplane and the document being classified. The SVM learner takes the dot product of these two vectors (and adds the bias term if biased hyperplanes are being used) to find the distance of the instance from the separating hyperplane, and if the distance is greater than 1762 Too general feature. There are some words that are related to the shaping factor in a very general manner, but because of this generality, may appear in any report without actually indicating the shaping factor. Similar to the bad features described above, the learner assigns them a high weight because of their frequent occurrence in the positive reports. For example, the feature “weather” is related to the shaping factor Physical Environment in a general sense since this shaping factor is chiefly related to the weather elements hampering a pilot, but in the context of Report#325010, where it appears in the sentence “the weather was clear with good visibility for the los angeles area”, it is merely appearing to describe the weather conditions. problem by assigning high negative weights to features important to other classes, while in other cases it simply focuses on random words that appear in the negative examples more than the positive ones. The first approach creates false negatives when that other shaping factor is also present in the test document (as we discuss in the next paragraph), while the second approach mostly selects bad features. Feature associated with other labels of the document We take each feature and find the shaping factor for which the feature has the highest positive weight in the vector representation of the hyperplane. Then, when we look at the false positive example, we find the highest negative contributing feature, and look up its most relevant shaping factor. Then we see if this shaping factor is also present in the set of labels assigned to the report. In 36 of the 87 false positives we analyzed, we found that the feature is affiliated with one of the other shaping factors assigned to the report. Thus, since the features relevant to the other shaping factors have been assigned negative weights, their presence in the report makes them contribute negatively even though they are present because of the presence of other shaping factors in the report. Wrong word sense. In this case the one sense of the feature is relevant to the shaping factor, but it appears in a different sense in the falsely positive report. For example, Report#642907 is labeled as a positive example for class Attitude, and the feature with the highest positive contribution is the word “attitude”. This word is apparently a good indicator for this class as this has also been identified by the annotators as a rationale for this class. However, this is true only as long as it means the mental state of someone. In this particular report, this word appears in the following context: “I became slightly disoriented and got into a dangerous attitude.” In this context the word “attitude” is actually used to mean the orientation of the aircraft and not the mental state of the pilot. The SVM learner understandably cannot make the difference between these two meanings and this feature makes a high contribution to the false positive label of the document. 7 Others. In one case, the word “tired”, a good feature for the shaper Physical Factors, appears in Report#534432 in a hypothetical context: “I feel this could happen to a pilot especially if he was single pilot; behind on the approach; a little rusty; tired; etc. ...”. In another case, the top positive contributor is the feature “fatigue”, which is relevant to two shapers, Duty Cycle and Physical Factors. Thus the report gets labeled positive for Duty Cycle, resulting in a false positive. 6.3 Conclusions We have proposed to use annotator rationales for cause identification, suggesting two novel ways to use rationales — as features and for generating a new type of pseudo-examples, namely residue examples. Experimental results on a subset of the ASRS reports demonstrated the usefulness of rationales for cause identification. Overall, an SVM learner that exploits rationales substantially improves one that does not by 7.3% in F-score. Moreover, we believe that our detailed analysis of the errors made by our system can provide insights into the problem as well as directions for future research. References [Abedin et al., 2010] M. Abedin, V. Ng, and L. Khan. Weakly supervised cause identification from aviation safety reports via semantic lexicon construction. JAIR. [Artstein and Poesio, 2008] R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Comp. Linguistics. [Ferryman et al., 2006] T. A. Ferryman, C. Posse, L. J. Rosenthal, A. N. Srivastava, and I. C. Statler. What happened, and why: Toward an understanding of human error based on automated analyses of incident reports Volume II. Technical Report NASA/TP– 2006-213490, NASA. [Joachims, 1999] T. Joachims. Making large-scale SVM learning practical. Advances in Kernel Methods - Support Vector Learning. MIT Press. [Passonneau, 2004] R. J. Passonneau. Computing reliability for coreference annotation. In LREC. [Posse et al., 2005] C. Posse, B. Matzke, C. Anderson, A. Brothers, M. Matzke, and T. Ferryman. Extracting information from narratives: An application to aviation safety reports. In 2005 IEEE Aerospace Conference. [Staelin, 2002] C. Staelin. Parameter selection for support vector machines. Technical Report HPL-2002-354R1, HP Labs Israel. [Zaidan et al., 2007] Omar F. Zaidan, J. Eisner, and C. Piatko. Using “annotator rationales” to improve machine learning for text categorization. In NAACL HLT. False Negatives In the analysis of the false negatives, we look at the highest negative contributing feature for the reports falsely labeled as negatives. However, in this case it is harder to identify whether the top contributing feature is a good negative feature or not. For example, it is easier to judge whether a given feature is relevant to a given shaping factor, but how do we judge whether a feature would be a good indicator for the shaping factor not being present? This exposes one interesting property of our problem of cause identification. When a shaping factor is present, it reflects in the report as such, and by utilizing the clues in the document, it is possible to identify its presence to a reasonable degree. However, when the shaping factor is not present, these is rarely any clue. In other words, the person writing the report usually does not discuss the shaping factors that were not present. When we look at the top negative contributing features for the false negatives, we find that the SVM learner facing this issue. In some of the cases, it tries to solve the 1763