Dense Event Ordering with a Multi-Pass Architecture

advertisement
Dense Event Ordering with a Multi-Pass Architecture
Nathanael Chambers
United States Naval Academy
Taylor Cassidy
Army Research Lab
IBM Research
nchamber@usna.edu
taylorcassidy64@gmail.com
Bill McDowell
Carnegie Mellon University
Steven Bethard
University of Alabama at Birmingham
forkunited@gmail.com
bethard@cis.uab.edu
Abstract
The past 10 years of event ordering research
has focused on learning partial orderings over
document events and time expressions. The
most popular corpus, the TimeBank, contains
a small subset of the possible ordering graph.
Many evaluations follow suit by only testing
certain pairs of events (e.g., only main verbs
of neighboring sentences). This has led most
research to focus on specific learners for partial labelings. This paper attempts to nudge
the discussion from identifying some relations
to all relations. We present new experiments
on strongly connected event graphs that contain ∼10 times more relations per document
than the TimeBank. We also describe a shift
away from the single learner to a sieve-based architecture that naturally blends multiple learners into a precision-ranked cascade of sieves.
Each sieve adds labels to the event graph one
at a time, and earlier sieves inform later ones
through transitive closure. This paper thus describes innovations in both approach and task.
We experiment on the densest event graphs to
date and show a 14% gain over state-of-the-art.
1
Introduction
Event ordering in the NLP community usually refers
to a specific labeling task: given pairs of events and
time expressions, choose ordering relations for them.
The TimeBank corpus (Pustejovsky et al., 2003) provided a corpus of labeled pairs that motivated machine learning approaches to this task. However,
only a small subset of possible pairs are labeled. The
annotators identified the most central and obvious relations, and left the rest unlabeled. Partly because of
this, there has been little research into complete event
graphs. This paper describes our recent exploration
into the dramatic changes required of both learning
algorithms and experimental setup when faced with
the task of a more complete graph labeling.
The most obvious shift required of a complete
graph labeling algorithm is that of relation identification. Research on the TimeBank and in the TempEval
contests has largely focused on relation classification.
The event pairs are given, and the task is to classify
them. We are now forced to first determine which
events should be paired up before classification. Our
experiments here fully integrate and evaluate both
identification and classification.
We are not the first to approach relation identification with classification, but this paper is the first
to directly and comprehensively address it. Most
recently, TempEval 3 (UzZaman et al., 2013a) proposed a labeling of raw text without prior relation
identification, but the challenge ultimately relied on
the TimeBank. Systems were only evaluated on its
subset of labeled event pairs. This meant that relation
identification was largely ignored. The top system optimized only relation classification and intentionally
left many pairs unlabeled (Bethard, 2013).
This paper presents CAEVO, a CAscading EVent
Ordering architecture. It is a novel sieve-based architecture for temporal event ordering that directly
addresses the interplay between identification and
classification. We shift away from the idea of monolithic learners, and propose smaller specialized classifiers. Inspiration comes from the recent success in
named entity coreference with sieve-based learning
(Lee et al., 2013). CAEVO contains a host of classi-
273
Transactions of the Association for Computational Linguistics, 2 (2014) 273–284. Action Editor: Ellen Riloff.
c
Submitted 11/2013; Revised 5/2014; Published 10/2014. 2014
Association for Computational Linguistics.
fiers that each specialize on different types of edges.
The classifiers are ordered by their individual precision, and run in order starting with the most precise.
Each sieve informs those below it by passing on its
decisions as graph constraints. These more precise
constraints are then used by later sieves to assist their
less precise decisions.
One of the main advantages of this CAEVO architecture is the natural inclusion of transitive reasoning.
After each sieve proposes its labels, the architecture
infers transitive links from the new labels and adds
them to the graph. This type of expansion has not
been adequately evaluated due to our community’s
sparsely labeled corpora. We are the first to report the
surprising result that edges from transitive closure are
more precise than the original sieve-provided labels.
Finally, CAEVO enables quick experimentation.
Classifiers are pulled in and out with ease, quickly
testing new theories. As in D’Souza and Ng (2013),
we created rule-based classifiers based on linguistic
theory. We naturally integrate these with machine
learned classifiers using the sieve architecture. Each
classifier focuses on its independent decisions, and
the architecture then imposes transitivity constraints.
CAEVO contains 12 sieves ordered by precision, and
it outperforms the top systems in event ordering.
2
Previous Work
Two lines of research are particularly relevant to our
work: research on event ordering annotation, and
research on event ordering models. We briefly review
both here.
2.1 Event Ordering Annotation
Most prior annotation efforts constructed corpora
with only a small subset of temporal relations annotated. In the TimeBank (Pustejovsky et al., 2003),
temporal relations were only annotated when the relation was judged to be salient by the annotator. The
TempEval competitions (Verhagen et al., 2007; Verhagen et al., 2010; UzZaman et al., 2013b) aimed to
improve coverage by annotating relations between
all events and times in the same sentence. However,
event tokens that were mentioned fewer than 20 times
were excluded and only one TempEval task considered relations between events in different sentences.
274
Events Times Relations R
TimeBank
7935 1414
6418 0.7
Bramsen 2006
627
–
615 1.0
TempEval 2007
6832 1249
5790 0.7
TempEval 2010
5688 2117
4907 0.6
TempEval 2013 11145 2078
11098 0.8
Kolomiyets 2012 1233
–
1139 0.9
1
Do 2012
324
232
3132 5.6
This work
1729
289
12715 6.3
Table 1: Events, times, temporal relations and the ratio of
relations to events + times (R) in various corpora.
To avoid such sparse annotations, researchers have
explored schemes that encourage annotators to connect all events. Bramsen et al. (2006) annotated
timelines as directed acyclic graphs, though they annotated multi-sentence segments of text rather than
individual events. Kolomiyets et al. (2012) annotated
“temporal dependency structures” (i.e. dependency
trees of temporal relations), though they only focused
on pairs of events. In Do et al. (2012), “the annotator was not required to annotate all pairs of event
mentions, but as many as possible”, and then more
relations were automatically inferred after the annotation was complete. In contrast, in our work we
required annotators to label every possible pair of
events/times in a given window, and our event graphs
are guaranteed to be strongly connected, with every
edge verified by the annotators. Table 1 compares the
density of relation annotation across various corpora.
A major dilemma from this prior work is that unlabeled event/time pairs are inherently ambiguous. The
unlabeled pair holds 3 possibilities:
1. The annotator looked at the pair of events and
decided that multiple valid relations could apply.
2. The annotator looked at the pair of events and
decided that no temporal relation exists
3. The annotator failed to look at the pair of events,
so a single relation may exist.
This ambiguity makes accurate training and evaluation difficult. In the current work, we adopt the
VAGUE relation, introduced by TempEval 2007, and
force our annotators to indicate pairs for which no
clear temporal relation exists. We are thus the first to
eliminate the 3rd possibility from our corpus.
1
Do 2012 reports 6264 relations, but we list half of this
because they include relations and their inverses in the count.
2.2 Event Ordering Models
The TimeBank
In part because of the sparsity of temporal relations in
the available training corpora, most existing models
formulate temporal ordering as a pair-wise classification task, where each pair of events and/or times
is examined and classified as having a temporal relation or not. Early work on the TimeBank took this
approach (Boguraev and Ando, 2005), classifying relations between all events and times within 64 tokens
of each other. Most of the top-performing systems
in the TempEval competitions also took this pairwise classification approach for both event-time and
event-event temporal relations (Bethard and Martin,
2007; Cheng et al., 2007; UzZaman and Allen, 2010;
Llorens et al., 2010; Bethard, 2013). These systems
have sometimes even explicitly focused on a small
subset of temporal relations; for example, the topranked ordering system in TempEval 2013 (Bethard,
2013) only classified relations in certain syntactic
constructions and with certain relation types.
Systems have tried to take advantage of global
information to ensure that the pair-wise classifications satisfy temporal logic transitivity constraints,
using frameworks like integer linear programming
and Markov logic networks (Bramsen et al., 2006;
Chambers and Jurafsky, 2008; Tatu and Srikanth,
2008; Yoshikawa et al., 2009; UzZaman and Allen,
2010). The gains have been small, likely because
of the disconnectedness that is common in sparsely
annotated corpora (Chambers and Jurafsky, 2008).
An approach that has not been leveraged for event
ordering, but that has been successful in the coreference community is the sieve-based architecture. The
top performer in CoNLL-2011 shared task was one
such system (Lee et al., 2013). The core idea is to
begin with the most reliable classifier first, and inform those below it. This idea also appeared in the
early IBM MT models (Brown et al., 1993) and in
the “islands of reliability” approaches to parsing and
speech (Borghesi and Favareto, 1982; Corazza et al.,
1991). D’Souza and Ng (2013) recently combined
a rule-based model with a machine learned model,
but lacked the fine-grained formality of a cascade of
sieves. This paper is inspired by the above and is the
first to apply it to temporal ordering as an extensible,
formal architecture.
275
TimeBank-Dense
There were four or five people
inside, and they just started firing
There were four or five people
inside, and they just started firing
Ms. Sanders was hit several times
and was pronounced dead at the
scene.
Ms. Sanders was hit several times
and was pronounced dead at the
scene.
The other customers fled, and the
police said it did not appear that
anyone else was injured.
The other customers fled, and the
police said it did not appear that
anyone else was injured.
Figure 1: TimeBank document on the left with TimeBankDense on the right. Solid arrows indicate BEFORE and
dotted INCLUDED IN. Relations with the DCT not shown.
3
TimeBank-Dense: A Dense Ordering
We use a new corpus, called TimeBank-Dense (Cassidy et al., 2014), to motivate and evaluate our architecture. This section highlights its main features.
The TimeBank-Dense corpus was created to address the sparsity in current corpora. It is unique
in that the annotators were forced to label all local
edges, even in ambiguous cases. The corpus is not
a complete graph over events and time expressions,
but it approximates completeness by labeling locally
complete graphs over neighboring sentences. All
pairs of events and time expressions in the same sentence and all pairs of events and time expressions
in the immediately following sentence were labeled.
It also includes edges between every event and the
document creation time (DCT). The document event
graphs in TimeBank-Dense are thus strongly connected2 , but they are not complete graphs because
edges that cross 2 or more sentences are not explicitly included3 . See Figure 1 for an illustration of
the significant difference between a TimeBank document and a TimeBank-Dense document. The specific
edges that were labeled are as follows:
1. Event-Event, Event-Time, and Time-Time pairs in
the same sentence
2. Event-Event, Event-Time, and Time-Time pairs between the current and next sentence
3. Event-DCT pairs for every event in the document
(DCT is the document creation time)
4. Time-DCT pairs for every time expression in the
document
2
Viewed as an undirected graph: since all events connect the
DCT, there exists a path from every event to every other event.
3
Though many are inferred through transitive closure.
TimeBank-Dense contains 12,715 labeled relations over 36 TimeBank newspaper articles. It uses
TimeBank annotated events and time expressions,
and then annotates the edges between them. The set
of temporal relations are a subset of the 13 original
Allen relations (Allen, 1983). The TempEval contests have used both a smaller set of relations (TempEval 1) and all 13 relations (TempEval 3). Published
work mirrors this trend, and different groups focus
on different aspects of the semantics. The TimeBankDense set is a middle ground between coarse and finegrained distinctions: BEFORE , AFTER , INCLUDES ,
IS INCLUDED , SIMULTANEOUS , and VAGUE .
The main reason for not using a more fine-grained
set is because we annotate pairs that are more ambiguous than those considered in previous efforts.
Decisions between relations like BEFORE and IM MEDIATELY BEFORE complicate an already difficult
task. The added benefit of a system that makes finegrained distinctions is also not clear. We lean toward
higher annotator agreement by using relations that
have a greater separation between their semantics.4
Annotators of TimeBank-Dense also had to mutually agree on each labeled edge, otherwise it was
labeled VAGUE. Precision between annotators averaged 65.1%, and the kappa ranged from .56 to
.64. For comparison, TimeBank annotators averaged
77% precision with a .71 kappa. The breakdown
of relation counts is shown in Table 2. T3 refers
to TempEval-3. TempEval-3 used TimeBank documents, but removed a small portion of its events.
This paper evaluates on T3 data to maintain a connection to the recent competition, but this resulted in the
removal of a portion of the densely annotated edges.
The most significant change in this new corpus is
the inclusion of the VAGUE relation. This allows a
system to separate relation identification from classification, if desired. The VAGUE relation makes up
46% of the annotated graph. This is the first empirical measurement of how many edges do not carry a
clear semantics. This annotation eliminates 1 of the 3
ambiguities (discussed in Section 2) that has plagued
ordering research: the annotator failed to look at the
pair of events, so a single relation may exist.
4
For instance, a relation like STARTS is a special case of
if events are viewed as open intervals, and IMME DIATELY BEFORE is a special case of BEFORE . We avoid this
overlap and only use INCLUDES and BEFORE.
INCLUDES
276
TimeBank-Dense Relation Count
All
T3 train
T3 dev
T3 test
b
2590
1444
242
589
a
2104
1148
218
428
i
ii
s
v
Total
836 1060 215 5910 12715
473 629 120 3013 6827
37 73 20 359
949
116 159 39 900 2231
Table 2: The number of each relation type in the
TimeBank-Dense corpus. T3 is the TempEval3 version
used in this paper’s experiments.
# of Mutual Vague Partial Vague No Vague
1657 (28%)
3234 (55%) 1019 (17%)
Table 3: VAGUE relation origins. Partial vague: one annotator does not choose vague. No vague: neither annotator
chooses vague, but they disagree on the specific relation.
This is a significant improvement over previous
ordering annotations. Several questions can now be
answered to benefit learning algorithms. Which set
of edges in the graph are unambiguous? Can I know
for certain that I have all BEFORE relations in this
document? These questions are critical to supervised
learning, and we see the removal of this ambiguity as
an important step toward better classification.
Of course, an ambiguity still remains: does VAGUE
mean that no relation exists, or that multiple relations
are possible? This is not unique to our corpus, but
is present in all current ordering corpora. We erred
on the side of caution because it is not clear how this
can be resolved. Though we don’t solve this problem
here, we can more deeply analyze how VAGUE relations come about during annotation. Table 3 shows
the 3 annotation scenarios that result in a VAGUE relation, and how often each occurred: (1) mutual vague:
annotators agreed on VAGUE, (2) partial vague: one
annotator chose VAGUE but the other chose a nonvague relation, and (3) no vague: annotators chose
different non-vague relations. These disagreements
will be released with the corpus, enabling future research into distinguishing these fine-grained semantics.
TimeBank-Dense is split into a 22 document training set, 5 document dev set, and a 9 document test
set 5 . We follow this split in all experiments.
5
Available at http://www.usna.edu/Users/cs/nchamber/caevo/
4
Sieve-Based Temporal Ordering
based approaches to capture these phenomena.
The sieve architecture applies a sequence of temporal
relation classifiers, one at a time, to label the edges of
a graph of events and time expressions. The individual classifiers are called sieves, and each sieve passes
its temporal relation decisions onto the next sieve.
The input to each sieve is the partially labeled graph
from previous sieves. These previous decisions can
help inform a sieve’s own decisions. The sieves are
ordered by precision, running the most precise models first. Events and time expressions are assumed to
be annotated with their standard attributes (such as
verb tense). This work uses gold event attributes for
consistent analysis of ordering decisions.
One of the benefits of this architecture is the seamless enforcement of transitivity constraints. Independent pairwise classifiers often produce inconsistent labelings. The architecture avoids inconsistent graphs
by inferring all transitive relations from each sieve’s
output before the graph is passed on to the next one.
Specifically, the nth sieve cannot add an inconsistent
edge that is inferrable from the previous n − 1 sieves.
Transitive inference is run after each stage, so it is
not possible to relabel an already inferred edge. In
a similar fashion, the architecture enforces the precision preference rule: the nth sieve may not label
edge e if one of the previous n − 1 sieves has already
labeled it. Transitive inferences are made with all
possible relations (e.g., A before B and B includes
C infers A before C). The only relation that does not
infer new edges is VAGUE.
This architecture facilitates a seamless integration
of a wide variety of classifiers. We experimented
with both rule-based and machine learned classifiers.
The following subsections describe the final set of 12
temporal classifiers in CAEVO. We begin with the
deterministic rule-based sieves.
4.1 Deterministic Sieves
The most precise sieves come from linguistic insight
into syntactic and semantic patterns. Many pairs of
events share common syntactic attributes that clearly
indicate a temporal relation. Machine learning is
not needed for these known linguistic phenomena,
and a trained classifier can incorrectly label these
obvious pairs if other features incorrectly override
the decision. We therefore rely on deterministic rule277
4.1.1 Verb/Timex Adjacency
Many prepositions in English carry either explicit
(e.g., before, during) or implicit temporal semantics
(e.g., in, over). Several are so reliable that a simple
series of rules can label event-timex edges with very
high precision. This Verb/Timex Adjacency sieve
thus applies to event words and time expressions that
are connected by a direct path in the syntactic parse
tree, as well as all event/time pairs such that the event
is at most 2 tokens before the time expression.
Time expressions under a preposition are handled
with specific rules. For instance, in, on, over, during, and within all resolve to IS INCLUDED. The
prepositions for, at, and throughout resolve to SI MULTANEOUS . In the absence of a preposition, a
timex might simply be a modifier of the verb. The
following is one such example.
Police confirmed Friday that the body
found along a highway...
The confirmed event IS INCLUDED in Friday. The
edge between an event and a direct time modifier all
default to IS INCLUDED. The precision of this sieve
on the development set is 92%, the top performer.
4.1.2 Timex-Timex Relations
One of the most precise sieves is a small set of
deterministic rules that address relations between
time expressions (including the DCT). This sieve
uses the normalized time values for time expressions
to compare their start/end times. Given two time
spans with defined starting and ending time points,
the temporal relation is unambiguous.
The more difficult cases involve time expressions
without stated start/end points. Examples of these
include references to the abstract future or past, such
as in phrases like ‘last year’ and ‘next quarter’. This
sieve chooses BEFORE if the abstract references are
not the same, but in clear opposition to each other.
For instance, a past reference and a future reference
are labeled BEFORE. If the abstract references are the
same, then VAGUE is typically applied. We defer to
the publicly released code for further details.
4.1.3 Reporting Event with DCT
Newspaper articles contain an abundance of reporting events (e.g., said, reply, tell). We crafted a
targeted sieve to handle this genre-specific phenomena. Since the document creation time is often the
day of the reporting events, edges between reporting events and the DCT are labeled as IS INCLUDED.
This is essentially a high baseline for event-DCT
edges. One of the advantages of the sieve-based architecture is that this sieve can be placed later in
the cascade of sieves. More sophisticated reasoners
run before this one, making more informed decisions
when appropriate, and this sieve cleans up the remaining reporting-DCT edges that are still unlabeled.
4.1.4 Reporting Event with Dominated Event
We also created rules for edges between a reporting event and another event that is syntactically dominated by the reporting event. Statistical classifiers
can pick up patterns for reporting verbs, but the labels are consistent enough that a few deterministic
rules can handle many of the cases. The following
rules are solely based on the tense and grammatical
aspect of each event.
The following is a sampling of these rules, where
gov is the governing event and dep the dependent.
1. If gov is present, and dep is past
then (gov AFTER dep)
2. If gov is present, and dep is present perfect
then (gov AFTER dep)
3. If gov is present, and dep is future
then (gov BEFORE dep)
4. If gov is past, and dep is past perfective
then (gov AFTER dep)
5. If gov is past, and dep is past progressive
then (gov IS INCLUDED dep)
4.1.5 General Event with Dominated Event
We generalized the above reporting-event sieve to
apply to all types of events. This ‘backoff’ sieve handles the non-reporting event pairs where one event
immediately dominates another. It operates similar to
the preceding sieve and uses event features like tense
and aspect to make its decisions. However, unlike the
preceding, the decisions are split into different rule
sets based on the specific typed dependency relation
that connects the two events. We have separate rules
nsubj, dobj, xcomp, ccomp, advcl, and conj.
4.1.6 Reichenbach Rules
Reichenbach (1947) defined the role played by the
various tenses of English verbs in conveying tem278
Simple Past: John left the room
E,R
S
Anterior Past: John had left the room
E
R
S
Posterior Past: (I did not know that) John would leave the room
R
E,S
Figure 2: Timelines in which the ordering S < R is fixed
because the verb to leave is in the past tense.
poral information in a discourse. Each tense is distinguished in terms of the relative ordering of the
following time points:
• Point of speech (S) The time at which an act of
speech is performed.
• Point of event (E) The time at which the the event
referred to by the verb occurs.
• Point of reference (R) A single time with respect to
which S is ordered by one dimension of tense, and
to which E is ordered by another.
Reichenbach’s account maps the set of possible orderings of S, E, and R, where each pair
of elements can be ordered with < (before), =
(simultaneous), or > (after), onto a set of tense
names given by: {anterior, simple, posterior} ×
{past, present, f uture}. The relative ordering of
E and R is indicated by the first dimension, while
that of S and R is given by the second. Figure 2
depicts examples of the ordering R < S.
Similar to Derczynski and Gaizauskas (2013) we
map a subset of Reichenbach’s tenses onto pairs of
TimeML tense and aspect attribute values, which
are given by {simple, perf ect, progressive} ×
{past, present, f uture}, where the first dimension
is grammatical aspect and the second is tense. We
refer to an event’s tense/aspect combination as its
tense-aspect profile. Consider two events e1 and
e2 associated with S1 /E1 /R1 and S2 /E2 /R2 . Intuitively, given R1 = R2 and the time point orderings
for each event (derived from its tense-aspect profile)
we can enumerate the possible interval relations that
might hold between E1 and E2 , which we can interpret as a disjunction of the possible orderings of
e1 and e2 (Derczynski and Gaizauskas, 2013). We
identify a subset of tense/aspect profile pairs that
unambiguously yield an interval relation from our
6 relations, and implement them as rules in the Reichenbach Sieve. Three such rules are given here:
1. If e1 is past/simple and e2 is past/perfect
then (e1 AFTER e2)
2. If e1 is future/simple and e2 is present/perfect
then (e1 AFTER e2)
3. If e1 is past/simple and e2 is future/simple
then (e1 BEFORE e2)
The Reichenbach sieve achieved a precision of
61% on the entire dataset, and 91% when links labeled VAGUE were excluded, a notably larger increase over other sieves. The lower 61% is primarily
due to the abundance of non-canonical usages of various tense/aspect profiles. For example, the present
tense may convey habitual action, an event conveyed
with the future tense in a quotation may have already
occurred, the present perfect may be used to indicate
that an event occurred at least once in the past, etc.
4.1.7
WordNet Rules
Determining whether two words are precisely coreferent is notoriously difficult (Hovy et al., 2013).
Synonymous words, or even those sharing the same
surface form may be only partially co-referent and
thus likely labeled VAGUE by annotators. For example, news articles state what someone said multiple
times without revealing temporal ordering between
the events. Event-event edges whose events share
the same lemma, or are each a member of the same
synset are thus labeled VAGUE. Time-time edges with
the same lemma/synset are labeled SIMULTANEOUS.
4.1.8
All Vague
The majority class baseline for this task is to label
all edges as VAGUE. This sieve is added to the end of
the gauntlet, labeling any remaining unlabeled edges.
4.2 Machine Learned Sieves
Current state-of-the-art models for temporal ordering
are machine learned classifiers. The top systems in
the latest TempEval-3 contest used supervised classifiers for the different types of edges in the event
graph (Bethard, 2013; Chambers, 2013). In the spirit
of the sieve architecture, rather than training one
large classifier for all types of edges, we create targeted classifiers for each type of edge. The resulting
models are again ranked by precision and mutually
279
Event-Time Features
Token, lemma, POS tag of event
Tense, aspect, class of event
Bigram of event and time expression words
Token path from event to time
Syntactic parse tree path between the event and time
Typed dependency edge path between the events
Boolean: syntactically dominates or is-dominated
Boolean: time expression concludes sentence?
Boolean: time expression day of week?
Full time expression phrase
Figure 3: Features for event-time edges.
inform each other through the architecture’s transitive constraints. All models use the MaxEnt classifier
from the Stanford CoreNLP with its default settings.
Below we describe the features for each sieve.
4.2.1
Event-Time Edges
We created two different sieves for labeling edges
between an event and a time expression. Similar to
the event-event distinction, one classifier applies to
intra-sentence edges and the other to inter-sentence.
The features for these classifiers are shown in Figure 3. The same features from the above eventevent classifier are used where applicable. These
include the parse paths, syntactic dominance, and
event features with POS tags and token context. Features specific to event-time edges include the event
word’s tense, aspect, and class (included independently from the time expression). Time features included a boolean feature if it was a day of the week,
the head word of the expression, and the entire expression if it is more than one word. A boolean feature
indicating if the time expression ended the sentence
is also included.
4.2.2
Event-Event Edges
We created three different sieves for event-event
edges. They use the same features, but are trained
and applied to different subsets of the graph edges.
1. intra-sentence: Label edges between all event pairs
that occur in the same sentence.
2. intra-sentence dominance: Label edges between all
event pairs that occur in the same sentence and one
event dominates the other in the syntactic parse tree.
3. inter-sentence: Label edges between all event pairs
that occur in neighboring sentences.
Event-Event Features
4.3
Text order normalized (put the first event first)
Boolean: syntactically dominates or is-dominated
Boolean: is there an event in between these two?
Syntactic parse tree path between the events
Typed dependency edge path between the events
Token n-grams, POS tag n-grams
Pairs of tense, aspect, event class
Prepositions attached to each event, if any
We now present the sieves used by CAEVO. Although the final system contains 12 sieves, we experimented with over 25 different classifiers. The
final set of 12 are sorted by precision:
precision(sieve) =
Figure 4: Features for intra-sentence event-event edges.
The features for these classifiers are summarized
in Figure 4. The POS tag features for each event
include its previous two tags, the event word’s tag,
and the POS bigram ending on the event word. A
POS bigram is also created from the POS tag of
each event. The token word, its lemma, its WordNet
synset, and the token bigram from each event word
are also used. Various syntactic features are used, including preposition words that dominate either event,
boolean dominance features if one event is above the
other in the parse tree, the parse path between the
two events (append the non-terminals together), and
the typed dependency path between the two. Finally,
we use event attributes of tense, aspect, and class to
create event pairs of their values (e.g., if both events
are past tense, then the feature’s value is ‘past past’).
The final CAEVO architecture includes both intrasentence learners, but not inter-sentence. The latter’s
precision was not high enough to warrant its inclusion. As mirrored in recent TempEval competition
results, inter-sentence event relations continue to be
an area of future research.
4.2.3
The Complete Sieve Gauntlet
Event-DCT Edges
Temporal relations between events and the document creation time (DCT) are also separately classified. The features for this machine-learned sieve
focus solely on the event word, including its POS tag,
the two previous POS tags, event lemma, WordNet
synset, token unigrams from a window of size 2, and
event attributes: tense, aspect, and class. Since this
sieve relies entirely on the event’s features with no
DCT specific features, it can be viewed as a singleevent classifier. We also experimented with a number
of rule-based sieves for Event-DCT classification, but
they were all outperformed by the machine-learning
sieve on the training and development sets.
280
#correct
#proposed by sieve
(1)
Transitive closure is not included when calculating
precision. Sieves are ranked according to precision,
placed in order from highest to lowest. Sieves with
a precision lower than the All Vague baseline were
removed. Table 4 shows the final sieve order.
5
Experiments and Results
We now present the first experiments on the
TimeBank-Dense corpus, using the set of events used
for TempEval-3. These experiments highlight two
separate contributions. The first is a new approach
to temporal ordering, the CAEVO architecture with
a cascade of classifiers. The second is an evaluation
of an ordering system on the densest event graphs
to date. We provide per-sieve results and compare
against two state-of-the-art systems.
All sieves were developed on and motivated by
the TimeBank-Dense training data. Precision was
initially computed on the development set, and final
sieve ordering is based on individual performance.
Final performance is reported on the test set.
We compare against the All Vague baseline that
labels all edges as VAGUE. We also compare against
the winner of TempEval-3, ClearTK (Bethard, 2013).
We include ClearTK in four configurations: (1) the
base TempEval-3 system, (2) with CAEVO transitivity, (3) with transitivity and altered to output a VAGUE
relation where it normally would have skipped the decision, and (4) retrained models on TimeBank-Dense
with transitivity and all VAGUE. We also include the
second place TempEval-3 system, NavyTime (Chambers, 2013), retrained and run as (3) and (4). Results
are shown in Table 5. The optimized TempEval-3
systems ultimately achieve ∼44 F1. The transitivity architecture automatically boosted ClearTK from
.147 to .264 with no additional changes. CAEVO
outperforms at 50.7 F1, improving over the retrained
TempEval-3 systems by 14% relative F1.
Ordered Sieves
Sieve
Verb/Time Adjacent
TimeTime
Reporting Governor
Reichenbach
General Governor
WordNet
Reporting DCT
ML-E-T SameSent
ML-E-E SameSent
ML-E-E Dominate
ML-E-DCT
AllVague
Brief Description
Rules for edges between one verbal event and one Timex
Rules for edges between two Timex
Rules for when a reporting event dominates an event
Rules following Reichenbach theory for edges between two events
Rules for edges where one event dominates another via typed dependencies
Rules for edges between two events/times based on WordNet synonym lookup
Rules for reporting events and the DCT
Trained classifier for event-time edges
Trained classifier for all intra-sentence event-event edges
Trained classifier for edges where one event syntactically dominates another
Trained classifier for all event-DCT edges
Labels all edges as VAGUE
Dev
P R
.92 .01
.88 .02
.69 .01
.63 .02
.63 .04
.52 .02
1.0 .00
.50 .03
.45 .11
.41 .05
.40 .06
.38 .38
Test
P R
.74 .01
.90 .02
.91 .01
.67 .03
.51 .02
.57 .01
1.0 .00
.42 .03
.44 .10
.44 .05
.53 .07
.40 .40
Table 4: The final order of sieves in CAEVO. Precision was computed on the development set. ReportingDCT only
matched one pair in the dev set, so we moved it below the other more common rule-based sieves.
CAEVO Component Ablation
System Comparison
System
ClearTK
ClearTK + trans
ClearTK VAGUE + trans
ClearTK Dense VAGUE + trans
NavyT Dense VAGUE
NavyT Dense VAGUE + trans
Baseline: All Vague
CAEVO
P
.397
.390
.431
.460
.485
.483
.405
.508
R
.091
.199
.431
.426
.415
.427
.405
.506
F1
.147
.264
.431
.442
.447
.453
.405
.507
Table 5: Comparison with top systems. The “Dense” systems were retrained on TimeBank-Dense. The “Opt” systems were optimized to label unknown edges as VAGUE.
The bottom of Table 6 illustrates the contribution
of transitive closure in the labeling decisions. Prior research has hypothesized that transitive closure should
help individual decisions (Chambers and Jurafsky,
2008; Yoshikawa et al., 2009), but this has not been
adequately tested due to the sparsely labeled graphs
available to the community. We computed the precision of edges directly labeled by sieves, and the
precision of edges inferred from transitive closure.
Transitive inferences achieve better performance than
directly classified edges (54.5% to 49.8% precision).
Table 7 breaks apart precision by edge type. Since
CAEVO labels all edges, precision and recall are
the same. Edges between an event and the DCT
show 55% precision, while event-event and event281
P
CAEVO with only ML Sieves .458
+ Rule-Based Sieves
.486
+ Transitivity
.505
+ All Vague Sieve
.507
Direct Sieve Predictions
.498
Predictions from Transitivity .544
R
.202
.240
.328
.507
.398
.109
F1
.280
.321
.398
.507
.442
.182
Table 6: System precision on the test set with and without
transitivity, and precision of transitive labels.
time edges achieve the lowest performance at 49%.
Linking time expressions to each other is helpful.
While far less numerous, time-time edges are the
easiest to classify (71% precision). Upon removing
the Time-Time sieve, overall results drop almost 5%,
despite the time-time edges only making up a smaller
2.6% of the data. Combined with transitive closure,
they positively influence the other event decisions.
Finally, Table 8 shows CAEVO’s per-relation results. We also reversed the order of sieves (but with
all vague still last) to see what effect it has. Performance on the dev set dropped from .48 F1 to .46.
6
Discussion
These are the first experiments on event ordering
where the document event graphs were annotated so
as to guarantee high density of relations and strong
Edge Type Performance
Edge Type
Total P/R/F1
Event-Event Edges 1427 .494
Event-Time Edges 423 .494
Event-DCT
311 .553
Time-Time Edges
59 .712
Table 7: The number of edges for each edge type, and the
system’s overall precision on each. Since CAEVO labels
all possible edges, precision and recall are the same.
Per-Relation Performance
Relation
P
.52
AFTER
.55
INCLUDES
.44
IS INCLUDED
.57
SIMULTANEOUS .71
VAGUE
.48
BEFORE
R
.45
.38
.21
.43
.31
.66
F1
.49
.45
.28
.49
.43
.56
Table 8: Performance on individual relation types.
connectivity. Most related to our experiments is the
dense annotation work of Do et al. (2012), but as they
did not force annotators to label every pair, unannotated pairs in their corpus are ambiguous between a
VAGUE relation and a relation that was simply missed
by the annotators. This ambiguity causes problems
for evaluation that are not present in a corpus where
annotators were required to annotate all pairs.
Constructing event ordering systems that will be
evaluated on dense, strongly connected document
event graphs is fundamentally different than optimizing a system to label a subset of edges. This can
be seen in the performance of the ClearTK-TimeML
system, which achieved the top performance (36.26)
on the sparse relation task of TempEval 2013, but
performed dramatically worse (15.8) on our dense
relation task. This is not to fault ClearTK, but to
illustrate the difference of evaluating only a subset of
an event graph’s edges rather than the entire graph.
We also presented the first sieve-based architecture for event/time ordering. The CAEVO architecture allowed us to focus on specific classifiers for
local decisions, and leave constraint enforcement
to the architecture itself. There are several intangible benefits to this setup that don’t appear in the
results: (1) seamless integration of new/existing classifiers, (2) extremely quick ablation experiments by
282
adding/removing sieves, and (3) facilitation of the
development of specific classifiers that are agnostic
to global constraints.
The final results are encouraging. Our multi-pass
system achieved an F1 of 50.7, a full 10 F1 points
over the All-Vague baseline, and a 14% relative increase over the top two systems in TempEval 3. This
increase is based not just on the top systems out of
the box, but after retraining them with TimeBankDense. The largest factor contributing to this gain
is the enforcement of global constraints. Transitive
inference alone boosted performance by about 10%.
Finally, we note that a lot of effort was put into
linguistically motivated rule-based classifiers. The
overall gain from these high precision sieves was 2
F1 points. The Reichenbach sieve’s application of
a more traditional linguistic theory of time helped
increase performance. Error analysis reveals there
are inherent difficulties in relying solely on tense
and aspect for ordering decisions, but we hope to
continue experiments into this and other theories.
CAEVO and its code is now publicly available6 .
In addition to temporal classification (the focus of
this paper), it also automatically extracts events and
time expressions. CAEVO is a complete system that
produces dense event/time graphs from raw text. We
hope it provides a new platform for quick experimentation and development. The TimeBank-Dense corpus is also publicly available, as well as the new annotation tool that forces annotators to label all edges.
It is a unique tool that helps with transitive inference
as annotators work. We hope it encourages further
dense data creation for temporal research.
Acknowledgments
This work was supported, in part, by the Johns Hopkins Human Language Technology Center of Excellence. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the authors. It was also supported by Grant
Number R01LM010090 from the National Library
Of Medicine. Finally, thanks to Benjamin Van Durme
for assistance and insight.
6
http://www.usna.edu/Users/cs/nchamber/caevo
References
James F Allen. 1983. Maintaining knowledge about
temporal intervals. Communications of the ACM,
26(11):832–843.
Steven Bethard and James H. Martin. 2007. CU-TMP:
Temporal relation classification using syntactic and semantic features. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval2007), pages 129–132, Prague, Czech Republic, June.
Association for Computational Linguistics.
Steven Bethard. 2013. ClearTK-TimeML: A minimalist
approach to tempeval 2013. In Second Joint Conference on Lexical and Computational Semantics (*SEM),
Volume 2: Proceedings of the Seventh International
Workshop on Semantic Evaluation (SemEval 2013),
pages 10–14, Atlanta, Georgia, USA, June. Association
for Computational Linguistics.
Branimir Boguraev and Rie Kubota Ando.
2005.
TimeBank-driven TimeML analysis. In Annotating,
Extracting and Reasoning about Time and Events.
Springer.
Luigi Borghesi and Chiara Favareto. 1982. Flexible parsing of discretely uttered sentences. In Proceedings
of the 9th conference on Computational linguisticsVolume 1, pages 37–42. Academia Praha.
Philip Bramsen, Pawan Deshpande, Yoong Keok Lee, and
Regina Barzilay. 2006. Inducing temporal graphs.
In Proceedings of the 2006 Conference on Empirical
Methods in Natural Language Processing, pages 189–
198. Association for Computational Linguistics.
Peter F Brown, Vincent J Della Pietra, Stephen A Della
Pietra, and Robert L Mercer. 1993. The mathematics
of statistical machine translation: Parameter estimation.
Computational linguistics, 19(2):263–311.
Taylor Cassidy, Bill McDowell, Nathanael Chambers, and
Steven Bethard. 2014. An annotation framework for
dense event ordering. In The 52nd Annual Meeting of
the Association for Computational Linguistics, Baltimore, MD, USA, June. Association for Computational
Linguistics.
Nathanael Chambers and Dan Jurafsky. 2008. Jointly
combining implicit constraints improves temporal ordering. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 698–
706. Association for Computational Linguistics.
Nathanael Chambers. 2013. Navytime: Event and time
ordering from raw text. In Second Joint Conference
on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta,
Georgia, USA, June. Association for Computational
Linguistics.
283
Yuchang Cheng, Masayuki Asahara, and Yuji Matsumoto.
2007. NAIST.Japan: Temporal relation identification
using dependency parsed tree. In Proceedings of the
Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 245–248, Prague, Czech
Republic, June. Association for Computational Linguistics.
A Corazza, R De Mori, R Gretter, and G Satta. 1991.
Stochastic context-free grammars for island-driven
probabilistic parsing. In Proceedings of Second International Workshop on Parsing Technologies (IWPT91),
pages 210–217.
Leon Derczynski and Robert Gaizauskas. 2013. Empirical validation of reichenbach’s tense framework. In
Proceedings of the 10th International Conference on
Computational Semantics (IWCS 2013), pages 71–82.
Quang Do, Wei Lu, and Dan Roth. 2012. Joint inference
for event timeline construction. In Proceedings of the
2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, pages 677–687, Jeju Island, Korea,
July. Association for Computational Linguistics.
Jennifer DSouza and Vincent Ng. 2013. Classifying
temporal relations with rich linguistic knowledge. In
Proceedings of NAACL-HLT, pages 918–927.
Eduard Hovy, Teruko Mitamura, Felisa Verdejo, Jun Araki,
and Andrew Philpot. 2013. Events are not simple:
Identity, non-identity, and quasi-identity. NAACL HLT
2013.
Oleksandr Kolomiyets, Steven Bethard, and MarieFrancine Moens. 2012. Extracting narrative timelines
as temporal dependency structures. In Proceedings of
the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
88–97, Jeju Island, Korea, July. Association for Computational Linguistics.
Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael
Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013.
Deterministic coreference resolution based on entitycentric, precision-ranked rules. Computational Linguistics.
Hector Llorens, Estela Saquete, and Borja Navarro. 2010.
TIPSem (English and Spanish): Evaluating CRFs and
semantic roles in TempEval-2. In Proceedings of the
5th International Workshop on Semantic Evaluation,
pages 284–291, Uppsala, Sweden, July. Association for
Computational Linguistics.
James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew
See, Robert Gaizauskas, Andrea Setzer, Dragomir
Radev, Beth Sundheim, David Day, Lisa Ferro, et al.
2003. The timebank corpus. In Corpus linguistics,
volume 2003, page 40.
Hans Reichenbach. 1947. Elements of Symbolic Logic.
Macmillan.
Marta Tatu and Munirathnam Srikanth. 2008. Experiments with reasoning for temporal relations between
events. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages
857–864. Association for Computational Linguistics.
Naushad UzZaman and James Allen. 2010. TRIPS and
TRIOS system for TempEval-2: Extracting temporal
information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages
276–283, Uppsala, Sweden, July. Association for Computational Linguistics.
Naushad UzZaman, Hector Llorens, Leon Derczynski,
James Allen, Marc Verhagen, and James Pustejovsky.
2013a. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In
Second Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation
(SemEval 2013), pages 1–9, Atlanta, Georgia, USA,
June. Association for Computational Linguistics.
Naushad UzZaman, Hector Llorens, Leon Derczynski,
Marc Verhagen, James Allen, and James Pustejovsky.
2013b. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations.
In Proceedings of the 7th International Workshop on
Semantic Evaluation. Association for Computational
Linguistics.
Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark
Hepple, Graham Katz, and James Pustejovsky. 2007.
Semeval-2007 task 15: Tempeval temporal relation
identification. In Proceedings of the 4th International
Workshop on Semantic Evaluations, pages 75–80. Association for Computational Linguistics.
Marc Verhagen, Roser Sauri, Tommaso Caselli, and James
Pustejovsky. 2010. Semeval-2010 task 13: Tempeval2. In Proceedings of the 5th International Workshop
on Semantic Evaluation, pages 57–62. Association for
Computational Linguistics.
Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asahara, and Yuji Matsumoto. 2009. Jointly identifying
temporal relations with Markov Logic. In Proceedings
of the Joint Conference of the 47th Annual Meeting of
the ACL and the 4th International Joint Conference
on Natural Language Processing of the AFNLP, pages
405–413. Association for Computational Linguistics.
284
Download