slides - Stanford Computer Science

advertisement
Using Query Patterns
to Learn the Durations of Events
Andrey Gusev
joint work with
Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky
Examples of Event Durations
•
•
•
•
•
•
Talk to a friend – minutes
Driving – hours
Study for an exam – days
Travel – weeks
Run a campaign – months
Build a museum – years
Why are we interested in durations?
• Event Understanding
• Duration is an important aspectual property
• Can help build timelines and events
• Event coreference
• Duration may be a cue that events are coreferent
•
Gender (learned from the web) helps nominal coreference
• Integration into search products
• Query: “healthy sleep time for age groups”
• Query: “president term length in [country x]”
How can we learn event durations?
Approach1: Supervised
System
Dataset (Pan et al., 2006)
• Labeled 58 documents from TimeBank with event
durations
• Average of minimum and maximum labeled durations
• A Brooklyn woman who was watching her clothes
dry in a laundromat.
• Min duration – 5 min
• Max Duration – 1 hour
• Average – 1950 seconds
Original Features (Pan et al., 2006)
• Event Properties
• Event token, lemma, POS tag
• Subject and Object
• Head word of syntactic subject and objects of the event,
along with their lemmas and POS tags.
• Hypernyms
• WordNet hypernyms for the event, its subject and its object.
• Starting from the first synset of each lemma, three
hyperhyms were extracted from the WordNet hierarchy.
New Features
• Event Attributes
• Tense, aspect, modality, event class
• Named Entity Class of Subjects and Objects
• Person, organization, locations, or other.
• Typed Dependencies
• Binary feature for each typed dependency
• Reporting Verbs
• Binary feature for reporting verbs (say, report, reply, etc.)
Limitations of the Supervised
Approach
Need explicitly annotated datasets
• Sparse and limited data
• Limited to the annotated domain
• Low inter-annotator agreement
•
•
•
More than a Day and Less Than a Day– 87.7%
Duration Buckets – 44.4%
Approximate Duration Buckets– 79.8%
Overcoming Supervised Limitations
Statistical Web Count approach
• Lots of text/data that can be used
• Not limited to the annotated domain
• Implicit annotations from many sources
•
Hearst(1998), Ji and Lin (2009)
How can we learn event durations?
Approach 2: Statistical Web
Counts
Terms - Durations Buckets and
Distributions
•
•
•
•
•
•
•
“talked for * seconds”
“talked for * minutes”
“talked for * hours”
“talked for * days”
“talked for * weeks”
“talked for * months”
“talked for * years”
- 1638 hits
- 61816 hits
- 68370 hits
- 4361 hits
- 3754 hits
- 5157 hits
- 103336 hits
Distribution
Duration Bucket
Two Duration Prediction Tasks
• Coarse grained prediction
• “Less than a day” or “Longer than a day”
• Fine grained prediction
• Second, minute, hour, etc.
Task 1: Coarse Grained Prediction
Yesterday Pattern for Coarse Grained
Task
• <eventpast> yesterday
• <eventpastp> yesterday
• eventpast = past tense
• eventpastp= past progressive tense
• Normalize yesterday event pattern counts with counts
of event occurrence in general
• Average the two ratios
• Find threshold on the training set
Example: “to say” with Yesterday Pattern
• “said yesterday” – 14,390,865 hits
• “said” – 1,693,080,248 hits
• “was saying yesterday” – 29,626 hits
• “was saying” – 14,167,103 hits
Ratio past
14,390,865

 0.0085
1,693,080,248
Ratio pastp 
29,626
 0.0021
14,167,103
• Average Ratio = 0.0053
Accuracy
Threshold for Yesterday Pattern
0.75
0.74
0.73
0.72
0.71
0.70
0.69
0.68
0.67
0.66
0.65
t = 0.002
Ratio
Task 2: Fine Grained Prediction
Fine Grained Durations from Web Counts
Said
• How long does the event
“X” last?
• Ask the web:
• “X for * seconds”
• “X for * minutes”
• …
• Output distribution over
time units
Not All Time Units are Equal
• Need to look at the base
distribution
• “for * seconds”
• “for * minutes”
• …
• In habituals, etc. people
like to say “for years”
Conditional Frequencies for Buckets
• Divide
• “X for * seconds”
• By
• “for * seconds”
• Reduce credit for
seeing “X for years”
Said
Double Peak Distribution
• Two interpretations
• Durative
• Iterative
0.5
to smile
to run
0.4
0.3
0.2
• Distributions show
that with two peaks
0.1
0.0
S
M
H
D
W
M
Y
D
Merging Patterns
• Multiple patterns
• Distributions averaged
• Reduces noise from
individual patterns
• Pattern needs to have
greater than 100 and less
100,000 hits
Said
Fine Grained Patterns
• Used Patterns
• <eventpast> for * <bucket>
• <eventpastp> for * <bucket>
• spent * <bucket> <eventger>
• Patterns not used
• <eventpast> in * <bucket>
• takes * <bucket> to <event>
• <eventpast> last <bucket>
Evaluation and Results
Evaluation
• TimeBank annotations (Pan, Mulkar and Hobbs 2006)
• Coarse Task: Greater or less than a day
• Fine Task: Time units (seconds, minutes, hours, …, years)
• Counted as correct if within 1 time unit
• Baseline: Majority Class
• Fine Grained – months
• Coarse Grained – greater than a day
• Compare with re-implementation of supervised (Pan,
Mulkar and Hobbs 2006)
New Split for TimeBank Dataset
• Train – 1664 events (714 unique verbs)
• Test – 471 events (274 unique verbs)
• TestWSJ – 147 events (84 unique verbs)
• Split info is available at
• http://cs.stanford.edu/~agusev/durations/
Web Counts System Scoring
• Fine grained
• Smooth over the adjacent buckets and select top bucket
score(bi) = bi-1 + bi + bi+1
• Coarse grained
• “Yesterday” classifier with a threshold (t = 0.002)
• Use fine grained approach
•
Select coarse grained bucket based on fine grained bucket
Results
Coarse - Test Fine - Test
Coarse - WSJ
Fine - WSJ
Baseline
62.4
59.2
57.1
52.4
Supervised
73.0
62.4
74.8
66.0
Bucket Counts
72.4
66.5
73.5
68.7
Yesterday Counts
70.7
N/A
74.8
N/A
Web counts perform as well as the fully supervised system
Backoff Statistics (“Spent” Pattern)
• Events in training dataset
Both
Subject
Object
None
356
446
195
548
• Had at least 10 hits
Both
Subject
Object
None
3
86
84
1372
Effect of the Event Context
• Supervised classifier use context in their features
• Web counts system doesn’t use context of the events
• Significantly fewer hits when including context
• Better accuracy with more hits than with context
• What is the effect of subject/object context on the
understanding of event duration?
Can humans do this task without context?
Human Annotation:
Mechanical Turk
MTurk Setup
• 10 MTurk workers for each event
• Without the context
• Event – choice for each duration bucket
• With the context
• Event with subject/object – choice for each duration bucket
Sometimes Context Doesn’t Matter
Exploded
Intolerant
Web counts vs. Turk distributions
“said” (web count)
“said” (MTurk)
Web counts vs. Turk distributions
“looking” (web count)
“looking” (MTurk)
Web counts vs. Turk distributions
“considering” (web count)
“considering” (MTurk)
Results: Mechanical Turk Annotations
 Compare accuracy
– Event with context
– Event without context
Coarse - Test
Fine - Test
Coarse - WSJ
Fine - WSJ
Baseline
62.4
59.2
57.1
52.4
Event only
52.0
42.1
49.4
43.8
Event and context
65.0
56.7
70.1
59.9
Context significantly improves accuracy of MTurk annotations
Event Duration Lexicon
• Distributions for 1000 most frequent verbs from the
NYT portion of the Gigaword with 10 most frequent
grammatical objects of each verb
• Due to thresholds not all the events have distributions
EVENT=to use,
ID=e13-7,
OBJ=computer,
PATTERNS=2,
DISTR=[0.009;0.337;0.238;0.090;0.130;0.103;0.092;0.002;]
http://cs.stanford.edu/~agusev/durations/
Summary
• We learned aspectual information from the web
• Event durations from the web counts are as accurate
as a supervised system
• Web counts are domain-general, work well even
without context
• New lexicon with 1000 most frequent verbs with 10
most frequent objects
• MTurk suggests that context can improve accuracy of
event duration annotation
Thanks! Questions?
Download