Question Answering (QA) Lecture 2 © Johan Bos April 2008 Lecture 1 • • • • • • • • What is QA? Query Log Analysis Challenges in QA History of QA System Architecture Methods System Evaluation State-of-the-art • Question Analysis • Background Knowledge • Answer Typing Lecture 3 • • • • • Query Generation Document Analysis Semantic Indexing Answer Extraction Selection and Ranking Pronto QA System parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Lecture 2 parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Question Answering (QA) Lecture 2 © Johan Bos April 2008 Question Analysis • Background Knowledge • Answer Typing Question Analysis – Why? © Johan Bos April 2008 • The aim of QA is to output answers, not documents • We need question analysis to – Determine the type of answer that we try to find – Estimate the number of answers that we want to return – Calculate the probability that an answer is correct Natural Language Processing • We need ways to automate the process of manipulating natural language © Johan Bos April 2008 – Punctuation – The way words are composed – The relationship between words – The structure of phrases – Represent meaning of phrases • This is where NLP comes in! – (NLP = Natural Language Processing) How to use NLP tools? © Johan Bos April 2008 • There is a large set of tools available on the web, most of it free for research • Examples of integrated text processing environments: – GATE (University of Sheffield) – TTT (University of Edinburgh) – LingPipe – For a general overview of NLP tools, see http://registry.dfki.de/ – C&C (used by the Pronto QA system) Architecture of PRONTO parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Question Analysis © Johan Bos April 2008 • • • • • Tokenisation Part of speech tagging Lemmatisation Syntactic analysis (Parsing) Semantic analysis (Boxing) • Named entity recognition • Anaphora resolution Tokenisation • Tokenisation is the task of splitting words from punctuation © Johan Bos April 2008 – Semicolons, colons ; : – exclamation marks, question marks – commas and full stops . , – quotes “ ‘ ` !? • Tokens are normally split by spaces – In the following slides, we use | Tokenisation: Example 1 • Input (9 tokens): © Johan Bos April 2008 When was the Buckingham Palace built in London, England? Tokenisation: Example 1 • Input (9 tokens): © Johan Bos April 2008 When | was | the | Buckingham | Palace | built | in | London, | England? Tokenisation: Example 1 • Input (9 tokens): When | was | the | Buckingham | Palace | built | in | London, | England? © Johan Bos April 2008 • Output (11 tokens): When | was | the | Buckingham | Palace | built | in | London | , | England | ? Tokenisation: Example 2 • Input (7 tokens): © Johan Bos April 2008 What year did "Snow White" come out? Tokenisation: Example 2 • Input (7 tokens): © Johan Bos April 2008 What | year | did | "Snow | White" | come | out? Tokenisation: Example 2 • Input (7 tokens): What | year | did | "Snow | White" | come | out? © Johan Bos April 2008 • Output (10 tokens): What | year | did | “ | Snow | White | " | come | out | ? Tokenisation: combined words • Combined words are split – I’d I | ’d – country’s country | ’s – won’t wo | n’t – “don’t!” “ do | n’t ! “ © Johan Bos April 2008 • Some Italian examples – gliel’ha detto glie | l’ | ha | detto – posso prenderlo posso | prender | lo Difficulties with tokenisation • Abbreviations, acronyms – When was the U.S. invasion of Haiti? © Johan Bos April 2008 • In particular if the abbreviation or acronym is the last word of a sentence – Look at next word: if in uppercase, then assume it is end of sentence – But think of cases such as Mr. Jones Why is tokenisation important? © Johan Bos April 2008 • Required for all subsequent stages of processing – Parsing – Named entity recognition – Lemmatisation – To look up a word in an electronic dictionary (such as WordNet) Question Analysis © Johan Bos April 2008 • Tokenisation Part of speech tagging • Named Entity Recognition • Lemmatisation • Syntactic analysis (Parsing) • Semantic analysis (Boxing) Traditional parts of speech • Verb • Preposition • Noun • Pronoun • Conjunction © Johan Bos April 2008 • Interjection • Adjective • Adverb © Johan Bos April 2008 Parts of speech in NLP CLAWS1 (132 tags) Penn Treebank (45 tags) Examples: Examples: NN singular common noun (boy, pencil ... ) NN$ genitive singular common noun (boy's, parliament's ... ) NNP singular common noun with word initial capital (Austrian, American, Sioux, Eskimo ... ) NNP$ genitive singular common noun with word initial capital (Sioux', Eskimo's, Austrian's, American's, ...) NNPS plural common noun with word initial capital (Americans, ... ) NNPS$ genitive plural common noun with word initial capital (Americans‘, …) NNS plural common noun (pencils, skeletons, days, weeks ... ) NNS$ genitive plural common noun (boys', weeks' ... ) NNU abbreviated unit of measurement unmarked for number (in, cc, kg …) JJ adjective (green, …) JJR adjective, comparative (greener,…) JJS adjective, superlative (greenest, …) MD modal (could, will, …) NN noun, singular or mass (table, …) NNS noun plural (tables, …) NNP proper noun, singular (John, …) NNPS proper noun, plural (Vikings, …) PDT predeterminer (both the boys) POS possessive ending (friend's) PRP personal pronoun (I, he, it, …) PRP$ possessive pronoun (my, his, …) RB adverb (however, usually, naturally, here, good, …) RBR adverb, comparative (better, …) © Johan Bos April 2008 POS tagged example What year did “ Snow White " come out ? © Johan Bos April 2008 POS tagged example What year did “ Snow White " come out ? WP NN VBD “ NNP NNP “ VB IN . Why is POS-tagging important? • To disambiguate words • For instance, to distinguish “book” used as a noun from “book” used as a verb © Johan Bos April 2008 – Where can I find a book on cooking? – Where can I book a room? • Prerequisite for further processing stages, such as parsing Question Analysis © Johan Bos April 2008 • Tokenisation • Part of speech tagging Lemmatisation • Syntactic analysis (Parsing) • Semantic analysis (Boxing) Lemmatisation • Lemmatising means © Johan Bos April 2008 – grouping morphological variants of words under a single headword • For example, you could group the words am, was, are, is, were, and been together under the word be Lemmatisation • Lemmatising means © Johan Bos April 2008 – grouping morphological variants of words under a single headword • For example, you could group the words am, was, are, is, were, and been together under the word be Lemmatisation © Johan Bos April 2008 • Using linguistic terminology, the variants taken together form the lemma of a lexeme • Lexeme: a “lexical unit”, an abstraction over specific constructions • Other examples: dying, die, died, dies die car, cars car man, men man Question Analysis © Johan Bos April 2008 • Tokenisation • Part of speech tagging • Lemmatisation Syntactic analysis (Parsing) • Semantic analysis (Boxing) © Johan Bos April 2008 What is Parsing • Parsing is the process of assigning a syntactic structure to a sequence of words • The syntactic structure is defined using a grammar • A grammar contains of a set of symbols (terminal and non-terminal symbols) and production rules (grammar rules) • The lexicon is built over the terminal symbols (i.e., the words) Syntactic Categories © Johan Bos April 2008 • The non-terminal symbols correspond to syntactic categories – – – – – – – – – – Det (determiner) N (noun) IV (intransitive verb) TV (transitive verb) PN (proper name) Prep (preposition) NP (noun phrase) PP (prepositional phrase) VP (verb phrase) S (sentence) the car at the table saw a car Mia likes Vincent Example Grammar © Johan Bos April 2008 Lexicon Det: which, a, the,… N: rock, singer, … IV: die, walk, … TV: kill, write,… PN: John, Lithium, … Prep: on, from, to, … Grammar Rules S NP VP NP Det N NP PN NNN N N PP VP TV NP VP IV PP Prep NP VP VP PP © Johan Bos April 2008 The Parser • A parser automates the process of parsing • The input of the parser is a string of words (annotated with POS-tags) • The output of a parser is a parse tree, connecting all the words • The way a parse tree is constructed is also called a derivation © Johan Bos April 2008 Derivation Example Which rock singer wrote Lithium © Johan Bos April 2008 Lexical stage Det Which N N TV PN rock singer wrote Lithium Use rule: NP Det N © Johan Bos April 2008 NP Det Which N N TV PN rock singer wrote Lithium Use rule: NP PN © Johan Bos April 2008 NP Det Which NP N N TV PN rock singer wrote Lithium Use rule: VP TV NP VP © Johan Bos April 2008 NP Det Which NP N N TV PN rock singer wrote Lithium Backtracking VP © Johan Bos April 2008 NP Det Which NP N N TV PN rock singer wrote Lithium Use rule: N N N VP © Johan Bos April 2008 N Det Which NP N N TV PN rock singer wrote Lithium Use rule: NP Det N NP © Johan Bos April 2008 N Det Which VP NP N N TV PN rock singer wrote Lithium Use rule S NP VP S NP © Johan Bos April 2008 N Det Which VP NP N N TV PN rock singer wrote Lithium Wide coverage parsers • Normally expect tokenised and POS-tagged input © Johan Bos April 2008 • Example of wide-coverage parsers: – Charniak parser – Collins parser – RASP (Carroll & Briscoe) – CCG parser (Clark & Curran – used in Pronto) Output C&C parser © Johan Bos April 2008 ba('S[wq]', fa('S[wq]', fa('S[wq]/(S[q]/PP)', fc('(S[wq]/(S[q]/PP))/N', lf(1,'(S[wq]/(S[q]/PP))/(S[wq]/(S[q]/NP))'), lf(2,'(S[wq]/(S[q]/NP))/N')), lf(3,'N')), fc('S[q]/PP', fa('S[q]/(S[b]NP)', lf(4,'(S[q]/(S[b]NP))/NP'), lex('N','NP', lf(5,'N'))), lf(6,'(S[b]NP)/PP'))), lf(7,'S[wq]S[wq]')). w(1,'For', w(2,which, w(3,newspaper, w(4,does, w(5,'Krugman', w(6,write, w(7,?, for, which, newspaper, do, krugman, write, ?, 'IN', 'O', '(S[wq]/(S[q]/PP))/(S[wq]/(S[q]/NP))'). 'WDT','O', '(S[wq]/(S[q]/NP))/N'). 'NN', 'O', 'N'). 'VBZ','O', '(S[q]/(S[b]NP))/NP'). 'NNP','I-PER', 'N'). 'VB', 'O', '(S[b]NP)/PP'). '.', 'O', 'S[wq]S[wq]'). Question Analysis © Johan Bos April 2008 • Tokenisation • Part of speech tagging • Lemmatisation • Syntactic analysis (Parsing) Semantic analysis (Boxing) Architecture of PRONTO parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Boxing (Semantic Analysis) © Johan Bos April 2008 • Providing a semantic analysis on the basis of the syntactic analysis • A semantic analysis of a question offers an abstract representation of the meaning of the question • Boxer uses a particular semantic theory: Discourse Representation Theory Discourse Representation Theory © Johan Bos April 2008 • Meaning of natural language expressions represented in first-order logic • No formulas but box representation (without explicit quantification and conjunction) • DRT covers a wide range of linguistic phenomena (Kamp & Reyle) Output of Boxer DRS (Discourse Representation Structure): © Johan Bos April 2008 _______________________ ____________________________________ | x0 | | x1 | |_______________________| |____________________________________| (| named(x0,krugman,per) |+| write(x1) |) | named(x0,paul,per) | | event(x1) | | | | agent(x1,x0) | |_______________________| | _______________ ____________ | | | x2 | | | | | |_______________| |____________| | | | newspaper(x2) | ? | event(x1) | | | |_______________| | for(x1,x2) | | | |____________| | |____________________________________| Paul Krugman. For which newspaper does Krugman write? Focus and Topic • Information expressed in a question can be structured into two parts: – the focus: information that is asked for – the topic: information about focus • Example: © Johan Bos April 2008 How many inhabitants does Rome have? FOCUS TOPIC Focus in DRS © Johan Bos April 2008 _______________________ ____________________________________ | x0 | | x1 | |_______________________| |____________________________________| (| named(x0,krugman,per) |+| write(x1) |) | named(x0,paul,per) | | event(x1) | | | | agent(x1,x0) | |_______________________| | _______________ ____________ | | | x2 | | | | | |_______________| |____________| | | | newspaper(x2) | ? | event(x1) | | | |_______________| | for(x1,x2) | | | |____________| | |____________________________________| Focus Question Answering (QA) Lecture 2 © Johan Bos April 2008 • Question Analysis Background Knowledge • Answer Typing Architecture of PRONTO parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Knowledge Construction • The knowledge component in Pronto constructs a local knowledge base for a the question under consideration – This knowledge is used in subsequent components © Johan Bos April 2008 • The task of the knowledge component is to find all relevant knowledge that might be used – As little as possible to ensure efficiency Manually Constructed Knowledge • Linguistic knowledge – WordNet – NomLex – FrameNet © Johan Bos April 2008 • General knowledge – CYC – CIA Factbook – Gazzetteers WordNet © Johan Bos April 2008 • Electronic dictionary • Not only words and definitions, but also relations between words • Four parts of speech – Nouns – Verbs – Adjectives – Adverbs WordNet SynSets • • © Johan Bos April 2008 • Words are organised in SynSets A SynSet is a group of words with the same meaning --- in other words, a set of synonyms Example: { Rome, Roma, Eternal City, Italian Capital, capital of Italy } Senses • • A word can have several different meanings Example: plant – A building for industrial labour – A living organism lacking the power of locomotion © Johan Bos April 2008 • • The different meanings of a word are called senses Therefore, one word can occur in more than one SynSet in WordNet © Johan Bos April 2008 SynSet Example - {mug, mugful} = the quantity that can be held in a mug - {chump, fool, gull, mark, patsy, fall guy, sucker, soft touch, chump, mug} = a person who is gullible and easy to take advantage of - {countenance, physiognomy, phiz, visage, kisser, smiler, mug} = the human face Hypernyms and Hyponyms • Hyperonomy is a WordNet relation defined among two SynSets – If A is a hypernym of B, then A is more generic then B • The inverse of hyperonomy is hyponomy – If A is a hyponym of B, then A is more specific then B © Johan Bos April 2008 • Take transitive closure of these relations • Examples: – “cow” and “horse” are hyponyms of “animal” – “publication” is a hypernym of “book” Examples using WordNet • Which rock singer wrote Lithium? – WordNet: singer is a hyponym of person – Knowledge: x(singer(x) person(x)) © Johan Bos April 2008 • What is the population of Andorra? – WordNet: population is a hyponym of number – Knowledge: x(population(x) number(x)) NomLex • NomLex is a database of nominalisation paraphrases – A nominalisation is a “verb promoted to a noun” – A paraphrase links the noun to the root verb © Johan Bos April 2008 • Example: – X is an invention by Y Y invented X – the killing of X X was killed Harvesting Knowledge • Often existing knowledge bases are incomplete for particular applications • There are various ways to automatically construct knowledge bases: © Johan Bos April 2008 – Instances and Hyponyms [e.g. Hearst] – Paraphrases [e.g. Lin & Pantel] Hyponyms (X such-as Y) TREC 20.2 (Concorde) What airlines have Concorde in their fleets? © Johan Bos April 2008 • WordNet has no instances of airlines. Hyponyms (X such as Y) © Johan Bos April 2008 TREC 20.2 (Concorde) What airlines have Concorde in their fleets? • Search for “Xs such as Y” patterns in large corpora, such as the web • Here: X = airline, Y a hyponym of X • Corpus: …airlines such as Continental and United now fly… Hyponyms (X such as Y) TREC 20.2 (Concorde) What airlines have Concorde in their fleets? © Johan Bos April 2008 • Knowledge (Acquaint corpus): Air Asia, Air Canada, Air France, Air Mandalay, Air Zimbabwe, Alaska, Aloha, American Airlines, Angel Airlines, Ansett, Asiana, Bangkok Airways, Belgian Carrier Sabena, British Airways, Canadian, Cathay Pacific, China Eastern Airlines, China Xinhua Airlines, Continental, Garuda, Japan Airlines, Korean Air, Lai, Lao Aviation, Lufthansa, Malaysia Airlines, Maylasian Airlines, Midway, Northwest, Orient Thai Airlines, Qantas, Seage Air, Shanghai Airlines, Singapore Airlines, Skymark Airlines Co., South Africa, Swiss Air, US Airways, United, Virgin, Yangon Airways © Johan Bos April 2008 Paraphrases • Several methods have been developed for automatically finding paraphrases in large corpora • This usually proceeds by starting with seed patterns of known positive instances • Using bootstrapping new patterns are found, and new seeds can be used Seed example • Start: Oswald killed JFK • Search for "Oswald * JFK" • Results: – Oswald assassinated JFK – Oswald shot JFK © Johan Bos April 2008 • Use these new patters to find other pairs and start again Paraphrase Example TREC 4.2 (James Dean) When did James Dean die? Knowledge: © Johan Bos April 2008 xt(e(kill(e)&theme(e,x)&in(e,t)) e(die(e)&agent(e,x)&in(e,t))) Paraphrase Example TREC 4.2 (James Dean) When did James Dean die? Knowledge: © Johan Bos April 2008 xt(e(kill(e)&theme(e,x)&in(e,t)) e(die(e)&agent(e,x)&in(e,t))) APW19990929.0165: In 1955, actor James Dean was killed in a two-car collision near Cholame, Calif. Question Answering (QA) Lecture 2 © Johan Bos April 2008 • Question Analysis • Background Knowledge Answer Typing Architecture of PRONTO parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Answer Typing • Providing information on the expected answer type – Type of question – Type (sortal ontology or taxonomy) – Answer cardinality © Johan Bos April 2008 • Issues – Ambiguities – Vagueness – Classification problems Question Types © Johan Bos April 2008 • Wh-questions: – Where was Franz Kafka born? – How many countries are member of OPEC? – Who is Thom Yorke? – Why did David Koresh ask the FBI for a word processor? – How did Frank Zappa die? – Which boxer beat Muhammed Ali? Question Types • Yes-no questions: – Does light have weight? – Scotland is part of England – true or false? © Johan Bos April 2008 • Choice-questions: – Did Italy or Germany win the world cup in 1982? – Who is Harry Potter’s best friend – Ron, Hermione or Sirius? Indirect Questions • Imperative mood: – Name four European countries that produce wine. – Give the date of birth of Franz Kafka. • Declarative mood: © Johan Bos April 2008 – I would like to know when Jim Morrison was born. Answer Type Taxonomies © Johan Bos April 2008 • Simple Answer-Type Taxonomy: PERSON NUMERAL DATE MEASURE LOCATION ORGANISATION Expected Answer Types • PERSON: © Johan Bos April 2008 – Who won the Nobel prize for Peace? – Which rock singer wrote Lithium? Expected Answer Types • NUMERAL: © Johan Bos April 2008 – How many inhabitants does Rome have? – What’s the population of Scotland? Expected Answer Types • DATE: © Johan Bos April 2008 – When was JFK killed? – In what year did Rome become the capital of Italy? Expected Answer Types • MEASURE: © Johan Bos April 2008 – How much does a 125 gallon fish tank cost? – How tall is an African elephant? – How heavy is a Boeing 777? Expected Answer Types • LOCATION: © Johan Bos April 2008 – Where does Angus Young of AC/DC live? – What city gives a Christmas tree to Westminster every year as a gift? Expected Answer Types • ORGANISATION: © Johan Bos April 2008 – Which company invented the compact disk? – Who purchased Gilman Paper Company? Using background knowledge • Which rock singer … – singer is a hyponym of person, therefore expected answer type is PERSON • What is the population of … © Johan Bos April 2008 – population is a hyponym of number, hence answer type NUMERAL Answer type tagging Simple rule-based systems: Who … PERSON Where … LOCATION When … DATE How many … NUMERAL © Johan Bos April 2008 …often fail… – Who launched the iPod? – Where in the human body is the liver? – When is it time to go to bed? © Johan Bos April 2008 Complex taxonomies • Simple ontologies cannot account for the large variety of questions • An example of a more complex ontology is proposed by Li & Roth • Pronto uses its own complex ontology • Machine learning approaches are often used to automatically tag questions with answer types Taxonomy of Li & Roth (1/3) © Johan Bos April 2008 • ENTITY – – – – – – – – – – – – – animal animals body organs of body color colors creative inventions, books and other creative pieces currency currency names – product products dis.med. diseases and medicine – religion religions event events – sport sports food food – substance elements and substances instrument musical instrument – symbol symbols and signs lang languages – technique techniques and methods letter letters like a-z – term equivalent terms – vehicle vehicles other other entities – word words with a special property plant plants Taxonomy of Li & Roth (2/3) • DESCRIPTION description and abstract concepts – – – – • HUMAN human beings – – – – © Johan Bos April 2008 • definition definition of sth. description description of sth. manner manner of an action reason reasons group a group or organization of persons ind an individual title title of a person description description of a person LOCATION locations – – – – – city cities country countries mountain mountains other other locations state states Taxonomy of Li & Roth (3/3) • NUMERIC numeric values © Johan Bos April 2008 – – – – – – – – – – – – – • code postcodes or other codes count number of sth. date dates distance linear measures money prices order ranks other other numbers period the lasting time of sth. percent fractions speed speed temp temperature size size, area and volume weight weight ABBREVIATION – abb abbreviation – exp expansion © Johan Bos April 2008 Pronto Answer Type Taxonomy © Johan Bos April 2008 Pronto Answer Type Taxonomy Answer typing: problems • Ambiguities How long distance or duration © Johan Bos April 2008 • Vague Wh-words What do pinguins eat? What is the length of a football pitch? • Taxonomy gaps Which alien race featured in Star Trek? What is the cultural capital of Italy? Answer Cardinality © Johan Bos April 2008 • How many distinct answers does a question have? • Examples: – When did Louis Braille die? 1 answer – Who won a nobel prize in chemistry? 1 or more answers – What are the seven wonders of the world? exactly 7 answers © Johan Bos April 2008 Class activity: answer typing 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. How many islands does Italy have? When did Inter win the Scudetto? What are the colours of the Lithuanian flag? Where is St. Andrews located? Why does oil float in water? How did Frank Zappa die? Name the Baltic countries. Which seabird was declared extinct in the 1840s? Who is Noam Chomsky? List names of Russian composers. Edison is the inventor of what? How far is the moon from the sun? What is the distance from New York to Boston? How many planets are there? What is the exchange rate of the Euro to the Dollar? What does SPQR stand for? What is the nickname of Totti? What does the Scottish word “bonnie” mean? Who wrote the song “Paranoid Android”? Lecture 3 parsing question answer ccg answer reranking boxing drs answer selection WordNet NomLex knowledge © Johan Bos April 2008 query answer typing Indri answer extraction Indexed Documents Question Answering (QA) Lecture 2 © Johan Bos April 2008 Lecture 1 • • • • • • • • What is QA? Query Log Analysis Challenges in QA History of QA System Architecture Methods System Evaluation State-of-the-art • Question Analysis • Background Knowledge • Answer Typing Lecture 3 • • • • • Query Generation Document Analysis Semantic Indexing Answer Extraction Selection and Ranking