CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation Anselmo Peñas (UNED, Spain) Eduard Hovy (USC-ISI, USA) Pamela Forner (CELCT, Italy) Álvaro Rodrigo (UNED, Spain) Richard Sutcliffe (U. Limerick, Ireland) Roser Morante (U. Antwerp, Belgium) Walter Daelemans (U. Antwerp, Belgium) Caroline Sporleder (U. Saarland, Germany) Corina Forascu (UAIC, Romania) Yassine Benajiba (Philips, USA) Petya Osenova (Bulgarian Academy of Sciences) 1 Question Answering Track at CLEF 2003 2004 2005 2006 2007 2008 Multiple Language QA Main Task QA Tasks 2 Temporal restrictions and lists 2009 2010 ResPubliQA Answer Validation Exercise (AVE) Giki CLEF Real Time QA over Speech Transcriptions (QAST) WiQA WSD QA 2011 2012 QA4MRE Negation and Modality Biomedical Portrayal Question Question analysis Passage Retrieval 0.8 Answer Extraction x 0.8 Answer Answer Ranking x 1.0 = 0.64 Along the years, we learnt that the architecture is one of the main limitations for improving QA technology So we bet on a reformulation: 3 Hypothesis generation + validation Answer Searching space of candidate answers Hypothesis generation functions + Answer validation functions Question 4 We focus on validation … Is the candidate answer correct? QA4MRE setting: Multiple Choice Reading Comprehension Tests Measure progress in two reading abilities •Answer questions about a single text •Capture knowledge from text collections 5 … and knowledge Why capture knowledge from text collections? We need knowledge to understand language 6 The ability of making inferences about texts is correlated to the amount of knowledge considered Texts always omit information we need to recover • To build the complete story behind the document • And be sure about the answer Text as source of knowledge Text Collection (background collection) Set of documents that contextualize the one under reading (20,000-100,000 docs.) • We can imagine this done on the fly by the machine • Retrieval Big and diverse enough to acquire knowledge Define a scalable strategy: topic by topic Reference collection per topic Background Collections They must serve to acquire This is sensitive to occurrence in texts General facts (with categorization and relevant relations) Abstractions (such as Thus, also to the way we create the collection Key: Retrieve all relevant documents and only them Classical IR Interdependence with topic definition • The topic is defined by the set of queries that produce the collection 8 Example: Biomedical Alzheimer’s Disease Literature Corpus Search PubMed about Alzheimer Query: (((((("Alzheimer Disease"[Mesh] OR "Alzheimer's disease antigen"[Supplementary Concept] OR "APP protein, human"[Supplementary Concept] OR "PSEN2 protein, human"[Supplementary Concept] OR "PSEN1 protein, human"[Supplementary Concept]) OR "Amyloid beta-Peptides"[Mesh]) OR "donepezil"[Supplementary Concept]) OR ("gamma-secretase activating protein, human"[Supplementary Concept] OR "gamma-secretase activating protein, mouse"[Supplementary Concept])) OR "amyloid beta-protein (142)"[Supplementary Concept]) OR "Presenilins"[Mesh]) OR "Neurofibrillary Tangles"[Mesh] OR "Alzheimer's disease"[All Fields] OR "Alzheimer's Disease"[All Fields] OR "Alzheimer s disease"[All Fields] OR "Alzheimers disease"[All Fields] OR "Alzheimer's dementia"[All Fields] OR "Alzheimer dementia"[All Fields] OR "Alzheimer-type dementia"[All Fields] NOT "nonAlzheimer"[All Fields] NOT ("non-AD"[All Fields] AND "dementia"[All Fields]) AND (hasabstract[text] AND English[lang]) 9 66,222 abstracts Questions (Main Task) Distribution of question types 27 PURPOSE 30 METHOD 36 CAUSAL 36 FACTOID 31 WHICH-IS-TRUE Distribution of answer types 75 REQUIRE NO EXTRA KNOWLEDGE 46 REQUIRE BACKGROUND KNOWLEDGE 21 REQUIRE INFERENCE 20 REQUIRE GATHERING INFORMATION FROM DIFFERENT SENTENCES 10 Questions (Biomedical Task) Question types 1. Experimental evidence/qualifier 2. Protein-protein interaction 3. Gene synonymy relation 4. Organism source relation 5. Regulatory relation 6. Increase (higher expression) 7. Decrease (reduction) 8. Inhibition 11 Answer types Simple: The answer is found almost verbatim in the paper Medium: The answer is rephrased Complex: Require combining pieces of evidence and inference They involve a predefined set of entity types Main Task 16 test documents, 160 questions, 800 candidate answers 4 Topics 1. 2. 3. 4. AIDS Music and Society Climate Change Alzheimer (divulgative sources: blogs, web, news, …) 4 Reading tests per topic Document + 10 questions 5 choices per question 6 Languages new English, German, Spanish, Italian, Romanian, Arabic new Biomedical Task Same setting Scientific language Focus on one disease: Alzheimer Alzheimer's Disease Literature Corpus (ADLC) 66,222 abstracts from PubMed 9,500 full articles Most of them processed: • Dependency parser GDep (Sagae and Tsujii 2007) • UMLS-based NE tagger (CLiPS) • ABNER NE tagger (Settles 2005) Task on Modality and Negation Given an event in the text decide whether it is 1. 2. 3. 4. Asserted (NONE: no negation and no speculation) Negated (NEG: negation and no speculation) Speculated but negated (NEGMOD) Speculated and not negated (MOD) Is the event present as certain? Yes No Did it happen? Is it negated? Yes No Yes NONE NEG NEGMOD No MOD Participation Task Registered groups Participant groups Submitted Runs Main 25 11 43 Biomedical 23 7 43 Modality and Negation 3 3 6 Total 51 21 92 ~100% increase 15 100 80 60 40 20 0 Participants Runs 2011 2012 Evaluation and results QA perspective evaluation c@1 over all questions (random 0.2) Best systems Main Best systems Biomedical 0.65 0.55 0.40 0.47 Reading perspective evaluation Aggregating results test by test (pass if c@1 > 0.5) Best systems Main Best systems Biomedical Tests passed: 12 / 16 Tests passed: 3 / 4 Tests passed: 6 /16 16 More details during the workshop Monday 17th Sep. 17:00 - 18:00 Poster Session Tuesday 18th Sep. 10:40 – 12:40 Invited Talk + Overviews 14:10 – 16:10 Reports from participants (Main + Bio) 16:40 – 17:15 Reports from participants (Mod&Neg) 17:15 – 18:10 Breakout session 17 Thanks!