Attribution and the PDTB Silvia Pareti The University of Edinburgh School of Informatics Outline • • • • • • Introduction Attribution in the PDTB Annotation schema extension Resources development Preliminary achievements Future directions Introduction - Attribution (wsj 0961) PDTB - Attribution Annotation (Prasad et al., 2008) Mr. Nemeth said in parliament that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037) ____Explicit____ ____Arg1____ 9163..9165 9097..9162 #### Text #### #### Text #### if that Czechoslovakia and Hungary would suffer #### Features #### environmental damage Ot, Comm, Null, Null #### Features #### 9067..9096 Inh, Null, Null, Null #### Text #### ____Arg2____ Mr. Nemeth said in 9166..9201 parliament #### Text #### ############## the twin dams were if, Contingency.Condition.U built as planned nreal present #### Features #### Inh, Null, Null, Null Other corpora with attribution • MPQA Opinion Corpus (Wiebe et al., 2002) – 692 articles – intra-sentential annotation • RST Discourse Treebank (Carlson&Marcu, 2001) – 385 articles – intra-sentential, only explicit sources, verb cues or according to • GraphBank (Wolf&Gibson, 2005) – 135 articles – only attributions not overlapping with other discourse relations • Other smaller or low-coverage projects – Sidney Morning Herald Corpus (O’Keefe et al., submitted) – Corpus TCC and RHETALHO (Pardo et al., 2004) PDTB - Advantages Large corpus less frequent structures and strategies are better observed, e.g. : Groused Robert Antolini, head of over-the-counter trading at Donaldson, Lufkin & Jenrette: "It's making it tough for traders to make money”. (wsj_1142) For some at the SEC, an agency that covets its independence, Mr. Breeden may be too much of a Washington insider. (wsj_0955) PDTB - Advantages The range of attributions covered is not pre-defined • Attributions are not limited to the sentence level • A wide range of attributions are annotated: – direct, indirect and mixed – having named or not named, explicit as well as implicit sources (e.g. it is believed…) – having verb and non-verb cues (e.g. idea, for) • Includes some relevant features PDTB - Extensions • Finer grained annotation of the attribution span: source, cue, circumstantial information • Completing content spans of some direct or mixed attributions PDTB - Extensions • Finer grained annotation of the attribution span: source, cue, circumstantial information • Completing content spans of some direct or mixed attributions "It's just sort of a one-upsmanship thing with some people," added Larry Shapiro. "They like to talk about having the new Red Rock Terrace one of Diamond Creek's Cabernets or the Dunn 1985 Cabernet, or the Petrus. Producers have seen this market opening up and they're now creating wines that appeal to these people." (wsj 0071) PDTB - Extensions • Annotation of attributions not overlapping with discourse relations • Annotation of nested attributions PDTB - Extensions • Annotation of attributions not overlapping with discourse relations • Annotation of nested attributions ["The Caterpillar people aren't too happy when they see their equipment used like that,"] [shrugs] [Mr. George]. ["They figure it's not a very good advert.“] (wsj 1121) [They] [figure] [it's not a very good advert] Annotation Schema source cue [PDTB attribution span] SUPPLEMENT content PDTB discourse connective /Arg1/Arg2 text spans [Mr. Nemeth said IN PARLIAMENT] that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037) PDTB Attribution Features attribution type • assertion (e.g. say, mention) • belief (e.g. think, doubt) • fact (e.g. remember, know) • eventuality (e.g. allow) source type • writer (if explicit, e.g. I think...) • other (e.g. Mr. Brown, a witness) • arbitrary (e.g. one, people) • mixed (e.g. My assessment and everyone's assessment is…(wsj_2012)) PDTB Attribution Features factuality (determinacy) scopal change (scopal polarity) • factual • non-factual • none • scopal change Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare seriamente una fase di riforme anche elettorali, Ø penso che la legislatura possa utilmente proseguire. (re075) If there is a majority at the Parliament able to seriously face a phase of reforms, also electoral, (I) think that the legislature could usefully continue. New Attribution Features source attitude •neutral (e.g. say, add) •positive (e.g. welcome, beam) •critical (e.g. lament, fume) •tentative (e.g. believe, suggest) •other (e.g. joke) authorial stance •committed (e.g. admit, know) •not-committed (e.g. lie, claim) •neutral (e.g. say, suggest) New Attribution Features Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037) Attribution type: Source type: Factuality: assertion Scopal change: other Source attitude: factual Authorial stance: none neutral neutral New Attribution Features Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary would suffer environmental damage if the twin dams were built as planned. (wsj_0037) Attribution type: Source type: Factuality: assertion Scopal change: other Source attitude: factual Authorial stance: none neutral neutral Confronted, Mrs. Yeargin admitted she had given the questions and answers two days before the examination to two low-ability geography classes.(wsj 0044) Authorial stance: committed "I think that this magazine is not only called Garbage, but it is practicing journalistic garbage," fumes a spokesman for Campbell Soup.(wsj 0062) Source attitude: negative Inter-Annotator Agreement • • • • • 2 annotators Data: 14 articles (PDTB) • 491 attributions annotation manual (22% are nested) training on an article (Pareti, 2012 submitted) MMAX2 annotation tool (Müller&Strube,2006) • complete annotation schema Results - Existence of Attribution 0.87 agr proportion of commonly annotated relations with respect to the annotations identified overall by Annotator A and Annotator B NOTE: writer attributions were annotated only if explicit Span selection tasks (agr metric): Cue Source Content Supplement 0.97 0.94 0.95 0.37 Results- Features PERCENT AGREEMENT COHEN'S KAPPA TYPE 83.42(317) 0.63 95(361) 0.71 SCOPAL CHANGE 98.68(375) 0.60 AUTHORIAL STANCE 94.47(359) 0.20 SOURCE ATTITUDE 82.36(313) 0.48 FACTUALITY 97.63(371) 0.73 SOURCE Italian Attribution Corpus-ItAC (Pareti and Prodanof, 2010) • 50 articles (37,000 tokens) from Italian newspaper corpora (e.g. La Repubblica) • 460 attribution relations • Freely available from: http://homepages.inf.ed.ac.uk/s1052974/resources.php PDTB Attribution Corpus (Pareti, 2012) 9868 attributions Stand-off annotation of attribution based on the PDTB: • Comprises all attribution relations annotated in the PDTB (reconstructed from the current annotation) • The annotation is further extended according to the revised annotation schema PDTB Attribution Corpus Annotation of the attribution span: source cue SUPPLEMENT 80% automatically, then manually revised, using 48 matching rules, e.g.: (NP-SBJ)(VP) (PP-LOC)(NP)(VB) (NP-SBJ)(VBP)(JJ) one person said IN DALLAS, LTV said I am sure 20 % had rarer syntax and was manually annotated, e.g.: Judge Curry ordered the refunds to begin Feb. 1 and said (wsj 0015) PDTB Attribution Corpus Further annotation of the content span: – adding punctuation (direct quotation marks) – completing content spans that had only been partially annotated – annotating the quote status of the attribution based on the position of quote span QS and content span CS: • direct • indirect • mixed QS = CS CS outside or contained in QS CS overlaps QS or QS contained in CS PDTB Attribution Corpus ATTRIBUTION ID: SOURCE SPAN: wsj_0003.pdtb_05 Darrell Phillips, vice president of human resources for Hollingsworth & Vose CUE SPAN: said CONTENT SPAN: “There’s no question that some of those workers and managers contracted asbestos–related diseases,” “But you have to recognize that these events took place 35 years ago. It has no bearing on our work force today.” SUPPLEMENT SPAN: None FEATURES: Ot, Comm, Null, Null QUOTE STATUS: Direct Use of PDTB Attribution Corpus Independent analysis of attribution: • cue composition – several cues other than verbs (prepositions, nouns, adverbs) – wide range of attributional verbs (266 types in the corpus) • source composition – NEs only about 50% of the sources • attribution structures Use of PDTB Attribution Corpus Testing a system for the identification of direct quotes and their speaker in the literature and news domains. University of Sydney and Sydney Morning Herald (O’Keefe et al. 2012, submitted). • rule-based and machine-learning based approaches have been tested on 3 corpora. • Approaches results show that direct quotes differ by domain and style Future • Development of an attribution extraction system using the data to train a classifier • Semi-automatic extension of the annotation to comprise all attributions in the corpus • Annotation of the level of nesting of each attribution • Release of the corpus for development/testing and shared tasks usages Conclusion • Advantages of attribution in the PDTB • Development of a finer-grained annotation schema and its inter-annotator agreement results • Application of the schema to a small corpus of Italian • Collection and further annotation of attribution in the PDTB • Importance of this resource for the analysis of attribution and its ‘long tail’ and for testing and developing attribution extraction systems Bibliography Carlson, L. and Marcu, D. Discourse tagging reference manual. Technical report ISITR545. Technical report, ISI, University of Southern California, September 2001. Müller, C. and Strube, M., Multi-Level Annotation of Linguistic Data with MMAX2. In: Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp. 197214. (English Corpus Linguistics, Vol.3 ), 2006. O’Keefe, T., Pareti, S., Curran, J., Koprinska, I. and Honnibal, M., A sequence labelling approach to quote attribution. Manuscript submitted for publication, 2012. Pardo, T., das Graças Volpe Nunes, M. and Rino, L.. Dizer: An automatic discourse analyzer for Brazilian Portuguese. In Ana Bazzan and Sofiane Labidi, editors, Advances in Artificial Intelligence – SBIA 2004, volume 3171 of Lecture Notes in Computer Science, pages 224–234. Springer Berlin / Heidelberg, 2004. Pareti, S. and Prodanof, I. Annotating attribution relations: Towards an Italian discourse treebank. In Proceedings of LREC10, 2010. Pareti,S. A database of attribution relations. In Proceedings of LREC12, Istanbul, 23-25 May 2012 (to appear). Pareti, S., Theory and practise of annotating attributions. Manuscript submitted for publication, 2012. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A. and Webber, B. The Penn Discourse Treebank 2.0. In Proceedings of LREC08, 2008. Wiebe, J. Instructions for annotating opinions in newspaper articles. Technical report, University of Pittsburgh, 2002. Wolf, F. and Gibson, E. Representing discourse coherence: A corpus-based study. Comput. Linguist., 31:249288, June 2005.