Attribution Relation Cues Across Genres.

advertisement
NI VER
S
Attribution Relation Cues Across Genres.
Y
TH
IT
E
U
G
O F
H
A comparison of verbal and nonverbal cues in news and thread summaries!
!
R
E
D I
U
N B
Alice Bracchi !
UNIVERSITÀ DEGLI STUDI DI PAVIA
alice.bracchi01@ateneopv.it!
The corpora
I compared the PARC (Pareti, 2012), the only available AR corpus
annotating attribution cues, to KT-Pilot.
KT-Pilot is a sample of the Kernel Traffic Summaries of the Linux
Kernel Mailing List (http://kt.earth.li/kernel-traffic/archives.html),
annotated with ARs for the purpose.
KT-Pilot
PARC
N. of tokens
75k
1.139k
N. of ARs
1.766
10.526
ARs/1k tokens
23
9,2
Genre
Thread summaries
News
Register
Informal
Formal
Domain
Computer science
Diverse
350 7000 300 6000 Non Verbal Cues
250 5000 KT-Pilot
200 4000 150 3000 100 2000 50 1000 0 0 The 20 most frequent verbal cues in the two corpora.
3. Attribution Type distribution
WSJ
ASSERTION
Verb Cues Distribution
BELIEF
EVENTUALITY
FACT
1. Phrasal Verbs frequency:
Attribution Relations: the role of the cue
What is an AR?
An attribution relation is the relation ascribing the ownership of an
attitude towards some linguistic material (i.e. the text itself, a portion
of it or their semantic content) to an entity.
(Pareti & Prodanof, 2010)
Jeff Garzik said that the RTC driver was pointless without interrupts. SOURCE CUE CONTENT The cue in automatic AR extraction
•  Most previous studies used pre-compiled lists of 30-50 verbs attempting to include
exhaustive and reliable verb cues, based on news articles.
•  Those lists were not based on any annotation of the phenomenon in text, but
rather on those verbs which are most expected to signal an AR.
Cue verbs used by Krestel et al. [2008]
ASSERTION
BELIEF
Journalistic prose presents:
• ASSERTIONS: 91%, of which
• SAY: 70% of verbal cues
FACT
Types of verbal cues in the two corpora.
Final results may vary due to the
incompleteness of PARC annotation.
KT-­‐Pilot
This divergence in frequency is due to THREAD STRUCTURE. One user asks a quesLon, the community posts replies unLl the quesLon is answered, then the thread is closed. The striking prevalence of the verb say is also genre-­‐
specific. Journalists ogen report what people actually said, whilst communicaLon in thread summaries is wrihen. 1
say 300
18,08%
6493
69,65%
2
reply 181
10,91%
10
0.12%
3
add 170
10,24%
308
3.30%
4
want 106
6,38%
17
0.18%
5
think 94
5,67%
180
1.93%
6
ask 84
5,06%
28
0.30%
7 announce 63
4,00%
31
0.33%
8
point 52
3,14%
28
0.30%
9
report 38
2,29%
76
0.82%
10
explain 34
2,04%
37
0.40%
10 most frequent verb cue types in PARC PARC KT-­‐Pilot KT-Pilot
MAINTAIN
PARC
KT-Pilot
Mexican officials maintain the Japanese reserve is only a result of
unfamiliarity .
We will have the resources to maintain arch-xen on Linux 2.6 going forward.
1 say 6493 69.65% 300 18.08% 2 add 308 3.30% 170 10,91% SUPPORT (*)
3 note 186 2.00% 15 1% (idea/software)
4 think 180 1.93% 94 5,67% 5 believe 141 1.51% 13 1% 6 tell 103 1.10% 20 1,21% 7 expect 92 1.00% 31 1,90% 8 argue 78 0.84% 5 0,28% 9 report 76 0.82% 38 2,29% 10 esLmate 71 0.76% 1 0,05% KT-Pilot
KT-Pilot
They , as well as numerous Latin American and East European countries […],
are supporting the direction Spain is taking.
The comx drivers support lapb thru the lapb stack.
(*) Of the three verbs, support is the least associated with an attribution meaning.
Nonetheless, in thread summaries it almost never occurs as a cue, whereas in the
PARC it can associate with belief attributions.
WORD QUESTION THOUGHT REPLY INFO APPROVAL 1. Acronyms as cues: the presence of acronyms is to be considered
highly genre and register specific, since no occurrences are found in PARC.
• 
AFAIK: As far as I know, […] KT-Pilot
AFAIK, it is a workaround for a gcc-2.7 bug discovered by John Davis.
• 
IMO: In my opinion, […]
• 
IMHO: In my humble opinion, […]
• KT-Pilot
IMNSHO:
my not
soquite
humble
[…] first.
IMO In
it would
need
someopinion
cleanup work
• 
(statement/code)
IDEA However, not analyzing them in thread summaries would miss 4,36% of
cues.
KT-Pilot
`` This is the peak of my wine-making experience , '' Mr. Winiarski declared
when he introduced the wine at a dinner in New York.
I changed the name of the structure that must be declared from struct
driver_file_entry to struct device_attribute.
OTHER • 
This IMHO is a good thing for all Real Time SMP.
IMNSHO: In my not so humble opinion, […]
And IMNSHO it is needed, since it will make devfs users much cleaner.
2. Punctuation cues:
• Absence of quotation marks when direct speech is reported:
•  Some verbs are considered particularly reliable in attribution cue lists
(Krestel et al. 2008).
•  Yet some of these verbs are, in fact, polysemous, and do not occur in
thread language as verbs with attributional meaning.
PARC
(IMO, IMHO, IMNSHO, AFAIK) QUOTE This, as far as news language is concerned, does not significantly jeopardize
the outcome of the research, given the small frequency with which they occur
( < 1% in PARC).
Overall, news language shows a
more neutral choice of predicates.
(statement/variable)
ACRONYMS ARGUMENT • 
As opposed to thread summaries:
• Wider variety of predicate choices;
• Wider presence of BELIEF cues.
DECLARE
DISCUSSION Previous work shows an almost complete disregard towards non-verb cues,
except for the phrase according to.
Domain Specific differences
PARC
OBJECTION ASSUMPTION • 
KT-Pilot
10 most frequent verb cue types in KT-­‐Pilot : SUGGESTION 2. Verb type frequencies
ARGUMENT OPINION EVENTUALITY
David came back with: Look folks. All of these arguments are going on deaf ears,
because the old behavior is not coming back without a solution to the problem which
was solved.
Why focus on the cue?
•  It is the lexical anchor linking source and content;
•  It expresses the attitude (e.g. assertion, belief), defining the nature of
the AR;
•  Verb semantics and syntax can help identifying source and content
(e.g. in the above sentence: SOURCE-Vsbj – Vcue– CONTENT-Vobj);
KT-PILOT
Higher percentage of phrasal verb cues is found in KT-Pilot, with predicates
such as go on, come back with (sth), come down on (smb), etc.
•  KT-Pilot: 4,14 %
•  PARC: 0,42%
KT-Pilot
THREAD SUMMARIES:
4,36% of the cues is Nonverbal
NEWS LANGUAGE:
0,46% of cues is Nonverbal
say add note think believe tell expect argue report esLmate contend suggest predict agree recall indicate explain acknowledge show Previous work addressing the automatic detection of opinion and quotation
Attribution Relations (ARs) has looked at the cue, the lexical anchor
connecting the attributed text to its source, as the central element to the
task. Most Attribution Extraction approaches are built
upon lists of verb cues that are thought to be sufficiently exhaustive and
reliable in signalling ARs in a text.
The purpose of this project is to test how reliable such lists are once we
move away from the news genre they have mostly been applied to. In order
to investigate this, I have compared data from a news corpus annotating
attribution cues to a small corpus of thread summaries I have compiled for
the purpose. The comparison shows not only that cues are
highly genre, register and domain specific, but also
that attribution cue analysis should not be restricted to verbs.
Thus, basing an analysis on pre-established lists of generally valid cues, or
even attempting to compile new lists from annotated cues, proves to be a
highly impracticable solution.
Non-verb cues
PARC
KT-Pilot
say reply add want think ask announce point report explain expect require suggest see feel tell remark show hope note Abstract
KT-Pilot
Alexander also said, If Linus decides to remove devfs, I certainly won’t weep for it.
KT-Pilot
Linus Torvalds: This one is a whole lot harder to fix.
Conclusion
Data from the two corpora highlights critical divergences in AR cues:
•  Genre: greater neutrality in the choice of predicates within news language,
and remarkable difference in verb type distribution across genres;
•  Domain: polysemous verbs consistently used with an attributional meaning
in the news domain can predominantly assume a non-attributional meaning
in the computer domain (e.g. declare);
Register: data from KT-Pilot corpus shows an increased proportion of nonverb cues in thread summaries, the use of acronyms as cues and less
standard use of punctuation.
This suggests that basing an AR extraction task or any kind of analysis on precompiled lists of cues would be a rather impractical attempt.
References
• 
• 
• 
Ralf Krestel and Sabine Bergler and Rene’ Witte. 2008. Minding the Source: Automatic Tagging of
Reported Speech in Newspaper Articles. In Proceedings of the Sixth International Language
Resources and Evaluation Conference (LREC 2008). ELRA. Marrakech, Morocco.
Silvia Pareti, Timothy O’Keefe, Ioannis Konstas, James R. Curran, and Irena Koprinska. 2013.
Automatically Detecting and Attributing Indirect Quotations. In EMNLP. ACL, 989–999.
Pareti, Silvia. 2012. A Database of Attribution Relations. In Proceedings of the Eighth conference
on International Language Resources and Evaluation. LREC12, Istanbul.
Download