Attribution and the PDTB

advertisement
Attribution and the PDTB
Silvia Pareti
The University of Edinburgh
School of Informatics
Outline
•
•
•
•
•
•
Introduction
Attribution in the PDTB
Annotation schema extension
Resources development
Preliminary achievements
Future directions
Introduction - Attribution
(wsj 0961)
PDTB - Attribution Annotation
(Prasad et al., 2008)
Mr. Nemeth said in
parliament that
Czechoslovakia and
Hungary would suffer
environmental
damage if the twin
dams were built as
planned. (wsj_0037)
____Explicit____
____Arg1____
9163..9165
9097..9162
#### Text ####
#### Text ####
if
that Czechoslovakia and
Hungary would suffer
#### Features ####
environmental damage
Ot, Comm, Null, Null
#### Features ####
9067..9096
Inh, Null, Null, Null
#### Text ####
____Arg2____
Mr. Nemeth said in
9166..9201
parliament
#### Text ####
##############
the twin dams were
if,
Contingency.Condition.U built as planned
nreal present
#### Features ####
Inh, Null, Null, Null
Other corpora with attribution
• MPQA Opinion Corpus (Wiebe et al., 2002)
– 692 articles
– intra-sentential annotation
• RST Discourse Treebank (Carlson&Marcu, 2001)
– 385 articles
– intra-sentential, only explicit sources, verb cues or according to
• GraphBank (Wolf&Gibson, 2005)
– 135 articles
– only attributions not overlapping with other discourse relations
• Other smaller or low-coverage projects
– Sidney Morning Herald Corpus (O’Keefe et al., submitted)
– Corpus TCC and RHETALHO (Pardo et al., 2004)
PDTB - Advantages
Large corpus
 less frequent structures and strategies are better
observed, e.g. :
Groused Robert Antolini, head of over-the-counter
trading at Donaldson, Lufkin & Jenrette: "It's making it
tough for traders to make money”. (wsj_1142)
For some at the SEC, an agency that covets its
independence, Mr. Breeden may be too much of a
Washington insider. (wsj_0955)
PDTB - Advantages
The range of attributions covered is not pre-defined
• Attributions are not limited to the sentence level
• A wide range of attributions are annotated:
– direct, indirect and mixed
– having named or not named, explicit as well as
implicit sources (e.g. it is believed…)
– having verb and non-verb cues (e.g. idea, for)
• Includes some relevant features
PDTB - Extensions
• Finer grained annotation of the attribution span:
source, cue, circumstantial information
• Completing content spans of some direct or mixed
attributions
PDTB - Extensions
• Finer grained annotation of the attribution span:
source, cue, circumstantial information
• Completing content spans of some direct or mixed
attributions
"It's just sort of a one-upsmanship thing with some people,"
added Larry Shapiro. "They like to talk about having the
new Red Rock Terrace one of Diamond Creek's Cabernets or
the Dunn 1985 Cabernet, or the Petrus.
Producers have seen this market opening up and they're
now creating wines that appeal to these people."
(wsj 0071)
PDTB - Extensions
• Annotation of attributions not overlapping with
discourse relations
• Annotation of nested attributions
PDTB - Extensions
• Annotation of attributions not overlapping with
discourse relations
• Annotation of nested attributions
["The Caterpillar people aren't too happy when they see their
equipment used like that,"]
[shrugs] [Mr. George].
["They figure it's not a very good advert.“] (wsj 1121)
[They] [figure] [it's not a very good advert]
Annotation Schema
source
cue
[PDTB attribution span]
SUPPLEMENT
content
PDTB discourse connective
/Arg1/Arg2 text spans
[Mr. Nemeth said IN PARLIAMENT] that Czechoslovakia
and Hungary would suffer environmental damage if the
twin dams were built as planned. (wsj_0037)
PDTB Attribution Features
attribution type
• assertion (e.g. say, mention)
• belief (e.g. think, doubt)
• fact (e.g. remember, know)
• eventuality (e.g. allow)
source type
• writer (if explicit, e.g. I think...)
• other (e.g. Mr. Brown, a witness)
• arbitrary (e.g. one, people)
• mixed (e.g. My assessment and
everyone's assessment is…(wsj_2012))
PDTB Attribution Features
factuality
(determinacy)
scopal change
(scopal polarity)
• factual
• non-factual
• none
• scopal change
Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare
seriamente una fase di riforme anche elettorali, Ø penso che la
legislatura possa utilmente proseguire. (re075)
If there is a majority at the Parliament able to seriously face a
phase of reforms, also electoral, (I) think that the legislature could
usefully continue.
New Attribution Features
source attitude
•neutral (e.g. say, add)
•positive (e.g. welcome, beam)
•critical (e.g. lament, fume)
•tentative (e.g. believe, suggest)
•other (e.g. joke)
authorial stance
•committed (e.g. admit, know)
•not-committed (e.g. lie, claim)
•neutral (e.g. say, suggest)
New Attribution Features
Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary
would suffer environmental damage if the twin dams were built
as planned. (wsj_0037)
Attribution type:
Source type:
Factuality:
assertion Scopal change:
other
Source attitude:
factual Authorial stance:
none
neutral
neutral
New Attribution Features
Mr. Nemeth said IN PARLIAMENT that Czechoslovakia and Hungary
would suffer environmental damage if the twin dams were built
as planned. (wsj_0037)
Attribution type:
Source type:
Factuality:
assertion Scopal change:
other
Source attitude:
factual Authorial stance:
none
neutral
neutral
Confronted, Mrs. Yeargin admitted she had given the questions
and answers two days before the examination to two low-ability
geography classes.(wsj 0044)
Authorial stance: committed
"I think that this magazine is not only called Garbage, but it is
practicing journalistic garbage," fumes a spokesman for
Campbell Soup.(wsj 0062)
Source attitude: negative
Inter-Annotator Agreement
•
•
•
•
•
2 annotators
Data:
14 articles (PDTB)
• 491 attributions
annotation manual
(22% are nested)
training on an article
(Pareti, 2012 submitted)
MMAX2 annotation tool
(Müller&Strube,2006)
• complete annotation
schema
Results - Existence of Attribution
0.87 agr
proportion of commonly annotated
relations with respect to the annotations
identified overall by Annotator A and
Annotator B
NOTE: writer attributions were annotated only if explicit
Span selection tasks (agr metric):
Cue
Source Content Supplement
0.97
0.94
0.95
0.37
Results- Features
PERCENT AGREEMENT COHEN'S KAPPA
TYPE
83.42(317)
0.63
95(361)
0.71
SCOPAL CHANGE
98.68(375)
0.60
AUTHORIAL STANCE
94.47(359)
0.20
SOURCE ATTITUDE
82.36(313)
0.48
FACTUALITY
97.63(371)
0.73
SOURCE
Italian Attribution Corpus-ItAC
(Pareti and Prodanof, 2010)
• 50 articles (37,000 tokens) from Italian newspaper
corpora (e.g. La Repubblica)
• 460 attribution relations
• Freely available from:
http://homepages.inf.ed.ac.uk/s1052974/resources.php
PDTB Attribution Corpus
(Pareti, 2012)
9868 attributions
Stand-off annotation of attribution based on the PDTB:
• Comprises all attribution relations annotated in the
PDTB (reconstructed from the current annotation)
• The annotation is further extended according to the
revised annotation schema
PDTB Attribution Corpus
Annotation of the attribution span:
source
cue
SUPPLEMENT
80% automatically, then manually revised, using 48
matching rules, e.g.:
(NP-SBJ)(VP)
(PP-LOC)(NP)(VB)
(NP-SBJ)(VBP)(JJ)
one person said
IN DALLAS, LTV said
I am sure
20 % had rarer syntax and was manually annotated, e.g.:
Judge Curry ordered the refunds to begin Feb. 1 and
said (wsj 0015)
PDTB Attribution Corpus
Further annotation of the content span:
– adding punctuation (direct quotation marks)
– completing content spans that had only been partially
annotated
– annotating the quote status of the attribution based
on the position of quote span QS and content span CS:
• direct
• indirect
• mixed
QS = CS
CS outside or contained in QS
CS overlaps QS or QS contained in CS
PDTB Attribution Corpus
ATTRIBUTION ID:
SOURCE SPAN:
wsj_0003.pdtb_05
Darrell Phillips, vice president of human
resources for Hollingsworth & Vose
CUE SPAN:
said
CONTENT SPAN:
“There’s no question that some of those
workers and managers contracted
asbestos–related diseases,”
“But you have to recognize that these
events took place 35 years ago. It has no
bearing on our work force today.”
SUPPLEMENT SPAN: None
FEATURES:
Ot, Comm, Null, Null
QUOTE STATUS:
Direct
Use of PDTB Attribution Corpus
Independent analysis of attribution:
• cue composition
– several cues other than verbs (prepositions, nouns,
adverbs)
– wide range of attributional verbs (266 types in the corpus)
• source composition
– NEs only about 50% of the sources
• attribution structures
Use of PDTB Attribution Corpus
Testing a system for the identification of direct quotes
and their speaker in the literature and news domains.
University of Sydney and Sydney Morning Herald
(O’Keefe et al. 2012, submitted).
• rule-based and machine-learning based approaches
have been tested on 3 corpora.
• Approaches results show that direct quotes differ by
domain and style
Future
• Development of an attribution extraction system
using the data to train a classifier
• Semi-automatic extension of the annotation to
comprise all attributions in the corpus
• Annotation of the level of nesting of each attribution
• Release of the corpus for development/testing and
shared tasks usages
Conclusion
• Advantages of attribution in the PDTB
• Development of a finer-grained annotation schema
and its inter-annotator agreement results
• Application of the schema to a small corpus of Italian
• Collection and further annotation of attribution in
the PDTB
• Importance of this resource for the analysis of
attribution and its ‘long tail’ and for testing and
developing attribution extraction systems
Bibliography
Carlson, L. and Marcu, D. Discourse tagging reference manual. Technical report ISITR545. Technical report, ISI, University of Southern California, September 2001.
Müller, C. and Strube, M., Multi-Level Annotation of Linguistic Data with MMAX2. In:
Sabine Braun, Kurt Kohn, Joybrato Mukherjee (Eds.): Corpus Technology and Language
Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, pp. 197214. (English Corpus Linguistics, Vol.3 ), 2006.
O’Keefe, T., Pareti, S., Curran, J., Koprinska, I. and Honnibal, M., A sequence labelling
approach to quote attribution. Manuscript submitted for publication, 2012.
Pardo, T., das Graças Volpe Nunes, M. and Rino, L.. Dizer: An automatic discourse
analyzer for Brazilian Portuguese. In Ana Bazzan and Sofiane Labidi, editors, Advances
in Artificial Intelligence – SBIA 2004, volume 3171 of Lecture Notes in Computer
Science, pages 224–234. Springer Berlin / Heidelberg, 2004.
Pareti, S. and Prodanof, I. Annotating attribution relations: Towards an Italian
discourse treebank. In Proceedings of LREC10, 2010.
Pareti,S. A database of attribution relations. In Proceedings of LREC12, Istanbul, 23-25
May 2012 (to appear).
Pareti, S., Theory and practise of annotating attributions. Manuscript submitted for
publication, 2012.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A. and Webber, B. The
Penn Discourse Treebank 2.0. In Proceedings of LREC08, 2008.
Wiebe, J. Instructions for annotating opinions in newspaper articles. Technical report,
University of Pittsburgh, 2002.
Wolf, F. and Gibson, E. Representing discourse coherence: A corpus-based study.
Comput. Linguist., 31:249288, June 2005.
Download