NI VER S Attribution Relation Cues Across Genres. Y TH IT E U G O F H A comparison of verbal and nonverbal cues in news and thread summaries! ! R E D I U N B Alice Bracchi ! UNIVERSITÀ DEGLI STUDI DI PAVIA alice.bracchi01@ateneopv.it! The corpora I compared the PARC (Pareti, 2012), the only available AR corpus annotating attribution cues, to KT-Pilot. KT-Pilot is a sample of the Kernel Traffic Summaries of the Linux Kernel Mailing List (http://kt.earth.li/kernel-traffic/archives.html), annotated with ARs for the purpose. KT-Pilot PARC N. of tokens 75k 1.139k N. of ARs 1.766 10.526 ARs/1k tokens 23 9,2 Genre Thread summaries News Register Informal Formal Domain Computer science Diverse 350 7000 300 6000 Non Verbal Cues 250 5000 KT-Pilot 200 4000 150 3000 100 2000 50 1000 0 0 The 20 most frequent verbal cues in the two corpora. 3. Attribution Type distribution WSJ ASSERTION Verb Cues Distribution BELIEF EVENTUALITY FACT 1. Phrasal Verbs frequency: Attribution Relations: the role of the cue What is an AR? An attribution relation is the relation ascribing the ownership of an attitude towards some linguistic material (i.e. the text itself, a portion of it or their semantic content) to an entity. (Pareti & Prodanof, 2010) Jeff Garzik said that the RTC driver was pointless without interrupts. SOURCE CUE CONTENT The cue in automatic AR extraction • Most previous studies used pre-compiled lists of 30-50 verbs attempting to include exhaustive and reliable verb cues, based on news articles. • Those lists were not based on any annotation of the phenomenon in text, but rather on those verbs which are most expected to signal an AR. Cue verbs used by Krestel et al. [2008] ASSERTION BELIEF Journalistic prose presents: • ASSERTIONS: 91%, of which • SAY: 70% of verbal cues FACT Types of verbal cues in the two corpora. Final results may vary due to the incompleteness of PARC annotation. KT-­‐Pilot This divergence in frequency is due to THREAD STRUCTURE. One user asks a quesLon, the community posts replies unLl the quesLon is answered, then the thread is closed. The striking prevalence of the verb say is also genre-­‐ specific. Journalists ogen report what people actually said, whilst communicaLon in thread summaries is wrihen. 1 say 300 18,08% 6493 69,65% 2 reply 181 10,91% 10 0.12% 3 add 170 10,24% 308 3.30% 4 want 106 6,38% 17 0.18% 5 think 94 5,67% 180 1.93% 6 ask 84 5,06% 28 0.30% 7 announce 63 4,00% 31 0.33% 8 point 52 3,14% 28 0.30% 9 report 38 2,29% 76 0.82% 10 explain 34 2,04% 37 0.40% 10 most frequent verb cue types in PARC PARC KT-­‐Pilot KT-Pilot MAINTAIN PARC KT-Pilot Mexican officials maintain the Japanese reserve is only a result of unfamiliarity . We will have the resources to maintain arch-xen on Linux 2.6 going forward. 1 say 6493 69.65% 300 18.08% 2 add 308 3.30% 170 10,91% SUPPORT (*) 3 note 186 2.00% 15 1% (idea/software) 4 think 180 1.93% 94 5,67% 5 believe 141 1.51% 13 1% 6 tell 103 1.10% 20 1,21% 7 expect 92 1.00% 31 1,90% 8 argue 78 0.84% 5 0,28% 9 report 76 0.82% 38 2,29% 10 esLmate 71 0.76% 1 0,05% KT-Pilot KT-Pilot They , as well as numerous Latin American and East European countries […], are supporting the direction Spain is taking. The comx drivers support lapb thru the lapb stack. (*) Of the three verbs, support is the least associated with an attribution meaning. Nonetheless, in thread summaries it almost never occurs as a cue, whereas in the PARC it can associate with belief attributions. WORD QUESTION THOUGHT REPLY INFO APPROVAL 1. Acronyms as cues: the presence of acronyms is to be considered highly genre and register specific, since no occurrences are found in PARC. • AFAIK: As far as I know, […] KT-Pilot AFAIK, it is a workaround for a gcc-2.7 bug discovered by John Davis. • IMO: In my opinion, […] • IMHO: In my humble opinion, […] • KT-Pilot IMNSHO: my not soquite humble […] first. IMO In it would need someopinion cleanup work • (statement/code) IDEA However, not analyzing them in thread summaries would miss 4,36% of cues. KT-Pilot `` This is the peak of my wine-making experience , '' Mr. Winiarski declared when he introduced the wine at a dinner in New York. I changed the name of the structure that must be declared from struct driver_file_entry to struct device_attribute. OTHER • This IMHO is a good thing for all Real Time SMP. IMNSHO: In my not so humble opinion, […] And IMNSHO it is needed, since it will make devfs users much cleaner. 2. Punctuation cues: • Absence of quotation marks when direct speech is reported: • Some verbs are considered particularly reliable in attribution cue lists (Krestel et al. 2008). • Yet some of these verbs are, in fact, polysemous, and do not occur in thread language as verbs with attributional meaning. PARC (IMO, IMHO, IMNSHO, AFAIK) QUOTE This, as far as news language is concerned, does not significantly jeopardize the outcome of the research, given the small frequency with which they occur ( < 1% in PARC). Overall, news language shows a more neutral choice of predicates. (statement/variable) ACRONYMS ARGUMENT • As opposed to thread summaries: • Wider variety of predicate choices; • Wider presence of BELIEF cues. DECLARE DISCUSSION Previous work shows an almost complete disregard towards non-verb cues, except for the phrase according to. Domain Specific differences PARC OBJECTION ASSUMPTION • KT-Pilot 10 most frequent verb cue types in KT-­‐Pilot : SUGGESTION 2. Verb type frequencies ARGUMENT OPINION EVENTUALITY David came back with: Look folks. All of these arguments are going on deaf ears, because the old behavior is not coming back without a solution to the problem which was solved. Why focus on the cue? • It is the lexical anchor linking source and content; • It expresses the attitude (e.g. assertion, belief), defining the nature of the AR; • Verb semantics and syntax can help identifying source and content (e.g. in the above sentence: SOURCE-Vsbj – Vcue– CONTENT-Vobj); KT-PILOT Higher percentage of phrasal verb cues is found in KT-Pilot, with predicates such as go on, come back with (sth), come down on (smb), etc. • KT-Pilot: 4,14 % • PARC: 0,42% KT-Pilot THREAD SUMMARIES: 4,36% of the cues is Nonverbal NEWS LANGUAGE: 0,46% of cues is Nonverbal say add note think believe tell expect argue report esLmate contend suggest predict agree recall indicate explain acknowledge show Previous work addressing the automatic detection of opinion and quotation Attribution Relations (ARs) has looked at the cue, the lexical anchor connecting the attributed text to its source, as the central element to the task. Most Attribution Extraction approaches are built upon lists of verb cues that are thought to be sufficiently exhaustive and reliable in signalling ARs in a text. The purpose of this project is to test how reliable such lists are once we move away from the news genre they have mostly been applied to. In order to investigate this, I have compared data from a news corpus annotating attribution cues to a small corpus of thread summaries I have compiled for the purpose. The comparison shows not only that cues are highly genre, register and domain specific, but also that attribution cue analysis should not be restricted to verbs. Thus, basing an analysis on pre-established lists of generally valid cues, or even attempting to compile new lists from annotated cues, proves to be a highly impracticable solution. Non-verb cues PARC KT-Pilot say reply add want think ask announce point report explain expect require suggest see feel tell remark show hope note Abstract KT-Pilot Alexander also said, If Linus decides to remove devfs, I certainly won’t weep for it. KT-Pilot Linus Torvalds: This one is a whole lot harder to fix. Conclusion Data from the two corpora highlights critical divergences in AR cues: • Genre: greater neutrality in the choice of predicates within news language, and remarkable difference in verb type distribution across genres; • Domain: polysemous verbs consistently used with an attributional meaning in the news domain can predominantly assume a non-attributional meaning in the computer domain (e.g. declare); Register: data from KT-Pilot corpus shows an increased proportion of nonverb cues in thread summaries, the use of acronyms as cues and less standard use of punctuation. This suggests that basing an AR extraction task or any kind of analysis on precompiled lists of cues would be a rather impractical attempt. References • • • Ralf Krestel and Sabine Bergler and Rene’ Witte. 2008. Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles. In Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC 2008). ELRA. Marrakech, Morocco. Silvia Pareti, Timothy O’Keefe, Ioannis Konstas, James R. Curran, and Irena Koprinska. 2013. Automatically Detecting and Attributing Indirect Quotations. In EMNLP. ACL, 989–999. Pareti, Silvia. 2012. A Database of Attribution Relations. In Proceedings of the Eighth conference on International Language Resources and Evaluation. LREC12, Istanbul.