Extracting Rich Event Structure from Text Models and Evaluations Evaluations and More Nate Chambers US Naval Academy Experiments 1. Schema Quality – Did we learn valid schemas/frames ? 2. Schema Extraction – Do the learned schemas prove useful ? 2 Experiments 1. Schema Quality – Human judgments – Comparison to other knowledgebases 2. Schema Extraction – – – – Narrative Cloze MUC-4 TAC Summarization 3 Schema Quality: Humans “Generating Coherent Event Schemas at Scale” – Balasubramanian et al., 2013 Relation Coherence 1) Are the relations in a schema valid? 2) Do the relations belong to the schema topic? Actor coherence: 3) Do the actors have a useful role within the schema? 4) What fraction of instances fit the role 4 Schema Quality: Humans Amazon Turk Experiment: Relation Coherence 1. Ground the arguments with a single entity. – 2. Randomly sample based on frequency the head word for each argument. Present schema as a grounded list of tuples Grounded Schema Carey veto legislation Legislation be sign by Carey Legislation be pass by State Senate Carey sign into law … 5 Schema Quality: Humans Amazon Turk Questions: Relation Coherence 1. 2. 3. Is each of the grounded tuples valid (i.e. meaningful in the real world)? Do the majority of relations form a coherent topic? Does each tuple belong to the common topic? * Turkers told to ignore grammar * Five annotators per schema Grounded Schema Carey veto legislation Legislation be sign by Carey Legislation be pass by State Senate Carey sign into law … 6 Schema Quality: Humans Actor Coherence 1. Ground ONE argument with a single entity. 2. Show the top 5 head words for the second argument. Grounded Schema Carey veto legislation, bill, law, measure Legislation be sign by Carey, John, Chavez, She Legislation be pass by State Senate, Assembly, House, … Carey sign into law … “Do the actors represent a coherent set of arguments?” (yes/no question? Unclear what answers were allowed.) 7 Results 8 Schema Quality: Knowledgebases • FrameNet events and roles • MUC-3 templates Chambers and Jurafsky, 2009 9 FrameNet (Baker et al., 1998) 10 Comparison to FrameNet • Narrative Schemas – Focuses on events that occur together in a narrative. • FrameNet (Baker et al., 1998) – Focuses on events that share core roles. Comparison to FrameNet • Narrative Schemas – Focuses on events that occur together in a narrative. – Schemas represent larger situations. • FrameNet (Baker et al., 1998) – Focuses on events that share core roles. – Frames typically represent single events. Comparison to FrameNet 1. How similar are schemas to frames? – Find “best” FrameNet frame by event overlap 2. How similar are schema roles to frame elements? – Evaluate argument types as FrameNet frame elements. FrameNet Schema Similarity 1. How many schemas map to frames? – 13 of 20 schemas mapped to a frame – 26 of 78 (33%) verbs are not in FrameNet 2. Verbs present in FrameNet – 35 of 52 (67%) matched frame – 17 of 52 (33%) did not match FrameNet Schema Similarity • Why were 33% unaligned? – FrameNet represents subevents as separate frames – Schemas model sequences of events. One Schema trade rise fall Two FrameNet Frames Exchange Change Position on a Scale FrameNet Argument Similarity 2. Argument role mapping to frame elements. – 72% of arguments appropriate as frame elements FrameNet frame: Enforcing Frame element: Rule law, ban, rule, constitutionality, conviction, ruling, lawmaker, tax INCORRECT FrameNet to MUC? • FrameNet represents more atomic events, less larger scenarios. • Do we have a resource with larger scenarios? – Not really – MUC-4? 17 Schema Quality Perp 1. 2. 3. 4. Attack Bombing Kidnapping Arson Recall: 71% 18 Victim Target Instrument Location Time MUC-4 Issues • MUC-4 is a very limited domain • 6 template types • No good way to evaluate the learned knowledge except through the extraction task. – PROBLEM: You can do extraction without learning an event representation 19 Can we label more MUC? • Extremely time-consuming • Still domain-dependent One possibility: crowd-sourcing • Regneri et al. (2010) – Used Turk for 22 scenarios – Asked Turkers to list events in order for each 20 Regneri Example 21 Experiments 1. Schema Quality – Human judgments – Comparison to other knowledgebase 2. Schema Extraction – – – – Narrative Cloze MUC-4 TAC Turkers 22 Cloze Evaluation Taylor, Wilson. Cloze Procedure: a new tool for measuring readability. Journalism Quarterly. 1953. • Predict the missing event, given a set of observed events. McCann threw two interceptions early… Toledo pulled McCann aside and told him he’d start… McCann gold events quickly completed his first two passes… 23 X threw pulled X told X X????? start X completed Narrative Cloze Results 36.5% improvement Narrative Cloze Evaluation What was the original goal of this evaluation? 1. “comparative measure to evaluate narrative knowledge” 2. “never meant to be solvable by humans” Do you need narrative schemas to perform well? As with all things NLP, the community optimized evaluation performance, and not the big picture goal. 25 Narrative Cloze Evaluation Jans et al., (2012) Use the text ordering information in a cloze evaluation. It is no longer a bag of events that have occurred, but a specific order, and you know where in the order the missing event occurred in the text. This has developed into…events as Language Models P(x | previousEvent) * P(nextEvent | x) 26 Narrative Cloze Evaluation Two Major Changes • Cloze includes the text order. • Cloze tests are auto-generated from parses and coreference systems. The event chains aren’t manually verified as gold (as the original Narrative Cloze did). Jans et al., (2012) Pichotta and Mooney (2014) Rudinger et al. (2015) 27 Narrative Cloze Evaluation Language Modeling with Jans et al. (2011) • Event: (verb, dependency) • Pointwise Mutual Information between events with coreferring arguments (Chambers and Jurafsky, 2009) • Event bigrams, in text order • Event bigrams with one intervening event (skip-grams) • Event bigrams with two intervening events (skip-grams) • Varied which coreference chains they trained on. All, subset, or just the single longest event chain. 28 Narrative Cloze Evaluation Language Modeling with Jans et al. (2011) • Introduced the score metric: Recall@N • The number of cloze tests where the system guesses the missing event in the top N of its ranked list. • PMI events scored worse than bigram/skip-gram approaches. • Skip-grams outperformed vanilla bigrams. 2-skip-gram and 1skip-gram performed similarly. • Subset of chains (long ones) training performed best. 29 Narrative Cloze Evaluation Pichotta and Mooney (2014) • Extended and reproduced much of Jans et al. (2012) • Main Contribution: multi-argument bigram Cloze Evaluation arrested _Y_ convicted _Y_ 30 _X_ arrested _Y_ _Z_convicted _Y_ Narrative Cloze Evaluation Pichotta and Mooney (2014) • Extended and reproduced much of Jans et al. • Main Contribution: multi-argument bigram Cloze Evaluation arrested _Y_ convicted _Y_ _X_ arrested _Y_ _Z_convicted _Y_ • Fun finding: multi-argument bigrams improve performance in single-argument cloze tests • Not so fun: unigrams are an extremely high baseline 31 Narrative Cloze Evaluation Rudinger et al. (2015) • Duplicated Jans et al. skip-grams and Pichotta/Mooney unigrams • Contribution: log-bilinear language model (Mnih and Hinton, 2007) • Single-argument events, not multi-argument. arrested _Y_ convicted _Y_ 32 Narrative Cloze Evaluation Rudinger et al. (2015) • Main finding: Unigrams essentially as good as the bigram models (confirms Pichotta) • Main finding: log-bilinear language model ~36% recall in Top 10 ranking compared to ~30% with bigrams 33 Narrative Cloze Evaluation Remaining Observations 1. Language modeling is better than PMI on the Narrative Cloze. 2. PMI and other learners appear to learn attractive representations that LMs do not. Remaining Questions 1. Does this mean the Narrative Close is useless? • Do we care about predicting “X said”? 2. Should text order be part of the test? • Originally, it was not • Real-world order is what we care about 34 3. Perhaps it is one of a bag of evaluations… IE as an Evaluation • MUC-4 • TAC 35 MUC-4 Extraction MUC-4 corpus, as before Experiment Setup: • Train on all 1700 documents • Evaluate the inferred labels in the 200 test documents 36 Evaluations 1. Flat Mapping 2. Schema Mapping Mapping choice leads to very different extraction performance. 37 Evaluations 1. Flat Mapping – Map each learned slot to any MUC-4 slot Schema 1 Role 1 Role 2 Role 3 Role 4 Schema 2 Role 1 Role 2 Role 3 Role 4 Schema 3 Role 1 Role 2 Role 3 Role 4 Bombing Perpetrator Victim Target Instrument Arson Perpetrator Victim Target Instrument 38 Evaluations 2. Schema Mapping – Slots bound to a single MUC-4 template Schema 1 Role 1 Role 2 Role 3 Role 4 Schema 2 Role 1 Role 2 Role 3 Role 4 Schema 3 Role 1 Role 2 Role 3 Role 4 Bombing Perpetrator Victim Target Instrument Arson Perpetrator Victim Target Instrument 39 MUC-4 Evaluations • Cheung et al. (2013) – Learned Schemas – Flat Mapping • Chambers (2013) – Learned Schemas – Flat and Schema Mapping • Nguyen et al. (2015) – Learned a bag of slots, not schemas – Flat Mapping (unable to do Schema Mapping) 40 Evaluations 1. Flat Mapping - Didn’t learn schema structure Role 1 Role 2 Role 3 Role 4 Role 5 Bombing Perpetrator Victim Target Instrument Role 6 Role 7 Role 8 Arson Role 9 Perpetrator Victim Target Instrument Role 10 Role 11 Role 12 41 MUC-4 Evaluation Optimizing to the Evaluation 1. Latest efforts appear to be optimizing to the evaluation again. 2. Don’t evaluate with structure, so don’t learn structure (this gives higher evaluation results). – Similar to Narrative Cloze. The best rankings occur with a model that doesn’t learn good sets of events. 3. But if the goal is learning rich event structure, perhaps the flat mapping is inappropriate? – But if we extract better with it, why does it matter? 42 MUC-4 Evaluation A way forward? 1. Yes, perform the MUC-4 extraction task. 2. Also compare to the knowledgebase of templates. This prevents a specialized extractor from “winning”, in that it may not represent any useful knowledge beyond the task. It also prevents a cute way to learn event knowledge that has no practical utility. 43 TAC 2010 Cheung et al. (2013) TAC 2010 Guided Summarization • Write a 100 word summary for 10 newswire articles. • Documents come from the AQUAINT datasets • http://nist.gov/tac/2010/Summarization/GuidedSumm.2010.guidelines.html • KEY: each topic comes with a “topic statement”, essentially an event template 44 TAC 2010 Example TAC Template Accidents and Natural Disasters: WHAT: what happened WHEN: date, time, other temporal placement markers WHERE: physical location WHY: reasons for accident/disaster WHO_AFFECTED: casualties (death, injury), or individuals otherwise negatively affected by the accident/disaster DAMAGES: damages caused by the accident/disaster COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts, other reactions to the accident/disaster 45 TAC 2010 Example TAC Summary Text (WHEN During the night of July 17,) (WHAT a 23-foot <WHAT tsunami) hit the north coast of Papua New Guinea (PNG)>, (WHY triggered by a 7.0 undersea earthquake in the area). You can map this data to a MUC-style evaluation. BENEFIT: another domain beyond the niche MUC-4 domain 46 Summary of Evaluations • Chambers and Jurafsky (2008) • – Turkers – Narrative cloze and FrameNet • Regneri et al. (2010) • Chambers and Jurafsky (2011) – MUC-4 • Chen et al. (2011) – Custom annotation of docs for relations • Jans et al. (2012) • Cheung et al. (2013) – MUC-4 – TAC-2010 Summarization Chambers (2013) – MUC-4 • Pichotta and Mooney (2014) – Narrative Cloze • Rudinger et al. (2015) – Narrative Cloze – Narrative Cloze • Bamman et al. (2013) – Learned actor roles, gold movie clusters – Turkers • Balasubramian et al. (2013) • Nguyen et al. (2015) – MUC-4 47 References Niranjan Balasubramanian and Stephen Soderland and Mausam and Oren Etzioni. Generating Coherent Event Schemas at Scale. EMNLP 2013. David Bamman, Brendan O’Connor, Noah Smith. Learning Latent Personas of Film Characters. ACL 2013. Nathanael Chambers. Event Schema Induction with a Probabilistic Entity-Driven Model. EMNLP 2013. Nathanael Chambers and Dan Jurafsky. Template-Based Information Extraction without the Templates. ACL 2011. Harr Chen, Edward Benson, Tahira Naseem, and Regina Barzilay. In-domain Relation Discovery with Meta-constraints via Posterior Regularization. ACL 2011. Jackie Cheung, Hoifung Poon, Lucy Vanderwende. Probabilistic Frame Induction. ACL 2013. Bram Jans, Ivan Vulic, and Marie Francine Moens. Skip N-grams and Ranking Functions for Predicting Script Events. EACL 2012 Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret and Romaric Besançon. Generative Event Schema Induction with Entity Disambiguation. ACL 2015. Karl Pichotta and Raymond J. Mooney. Statistical Script Learning with Multi-Argument Events. EACL 2014. Michaela Regneri, Alexander Koller, Manfred Pinkal. Learning Script Knowledge with Web Experiments. Rachel Rudinger, Pushpendre Rastogi, Francis Ferraro, Benjamin Van Durme. Script Induction as Language Modeling. EMNLP 2015. 48