Answering List Questions using Co-occurrence and Clustering Majid Razmara and Leila Kosseim Concordia University m_razma@cs.concordia.ca Introduction • Question Answering • TREC QA track Question Series Corpora Target: American Girl dolls FACTOID: In what year were American Girl dolls first introduced? LIST: Name the historical dolls. LIST: Which American Girl dolls have had TV movies made about them? FACTOID: How much does an American Girl doll cost? FACTOID: How many American Girl dolls have been sold? FACTOID: What is the name of the American Girl store in New York? FACTOID: What corporation owns the American Girl company? OTHER: Other 2 Hypothesis • Answer Instances 1. Have the same semantic entity class 2. Co-occur within sentences, or 3. Occur in different sentences sharing similar context Based on Distributional Hypothesis: “Words occurring in the same contexts tend to have similar meanings” [Harris, 1954]. 3 Target 232: "Dulles Airport“ Question 232.6: "Which airlines use Dulles” Ltw_Eng_20050712.0032 (AQUAINT-2) United, which operates a hub at Dulles, has six luggage screening machines in its basement and several upstairs in the ticket counter area. Delta, Northwest, American, British Airways and KLM share four screening machines in the basement. Ltw_Eng_20060102.0106 (AQUAINT-2) Independence said its last flight Thursday will leave White Plains, N.Y., bound for Dulles Airport. Flyi suffered from rising jet fuel costs and the aggressive response of competitors, led by United and US Airways. New York Times (Web) Continental Airlines sued United Airlines and the committee that oversees operations at Washington Dulles International Airport yesterday, contending that recently installed baggage-sizing templates inhibited competition. Wikipedia (Web) At its peak of 600 flights daily, Independence, combined with service from JetBlue and AirTran, briefly made Dulles the largest low-cost hub in the United States. 4 Our Approach 1. Create an initial candidate list Answer Type Recognition Document Retrieval Candidate Answer Extraction It may also be imported from an external source (e.g. Factoid QA) 2. Extract co-occurrence information 3. Cluster candidates based on their co- occurrence 5 Answer Type Recognition • 9 Types: Person, Country, Organization, Job, Movie, Nationality, City, State, and Other • Lexical Patterns ^ (Name | List | What | Which) (persons | people | men | women | players | contestants | artists | opponents | students) PERSON ^ (Name | List | What | Which) (countries | nations) COUNTRY • Syntagmatic Patterns for Other types ^ (WDT | WP | VB | NN) (DT | JJ)* (NNS | NNP | NN | JJ | )* (NNS | NNP | NN | NNPS) (VBN | VBD | VBZ | WP | $) ^ (WDT | WP | VB | NN) (VBD | VBP) (DT | JJ | JJR | PRP$ | IN)* (NNS | NNP | NN | )* (NNS | NNP | NN) • Type Resolution 6 Type Resolution • Resolves the answer subtype to one of the main types List previous conductors of the Boston Pops. • Type: OTHER Sub Type: Conductor PERSON WordNet's Hypernym Hierarchy 7 Document Retrieval • Document Collection Source Document Collection Source Domain Few documents To extract candidates Domain Document Collection Many documents To extract co-occurrence information • Query Generation Google Query on Web Lucene Query on Corpora 8 Candidate Answer Extraction • Term Extraction Extract all terms that conform to the expected answer type Person, Organization, Job Intersection of several NE taggers: LingPipe, Stanford tagger & Gate NE To get a better precision Country, State, City, Nationality Gazetteer To get a better precision Movie, Other Capitalized and quoted terms Verification of Movie numHits(GoogleQuery intitle:Term site:www.imdb.com) Verification of Other numHits(“SubType Term” OR “Term SubType”) numHits(“Term”) 9 Co-occurrence Information Extraction • Domain Collection Documents are split into sentences • Each sentence is checked as to whether it contains Se nt en ce s candidate answers Candidates 0 3 1 0 2 0 3 1 3 1 Candidates 10 Hierarchical Agglomerative Clustering • Steps: 1. Put each candidate term ti in a separate cluster Ci 2. Compute the similarity between each pair of clusters Average Linkage similarity (Ci, Cj) 1 similarity (tm, tn) | Ci | | Cj | tmCi tnCj 3. Merge two clusters with highest inter-cluster similarity 4. Update all relations between this new cluster and other clusters 5. Go to step 3 until There are only N clusters, or The similarity is less than a threshold 11 The Similarity Measure • Similarity between each pair of candidates • Based on co-occurrence termi termi Total termj O11 O21 O11 + O21 termj O12 O22 O12 + O22 within sentences • Using chi-square (2) Total O11 + O12 O21 + O22 N 2 N (O 11 O 22 – O 12 O 21 ) 2 (O11 O12)(O11 O21)(O12 O22)(O21 O22) • Shortcoming 12 Pinpointing the Right Cluster • Question and target keywords are used as “spies” • Spies are: Inserted into the list of candidate answers Are treated as candidate answers, hence their similarity to one another and similarity to candidate answers are computed they are clustered along with candidate answers • The cluster with the most number of spies is returned The spies are removed • Other approaches 13 Target 269: Pakistan earthquakes of October 2005 Question 269.2: What countries were affected by this earthquake? Cluster-2 Cluster-31 pakistan, 2005, oman afghanistan, octob, u.s, india, affect, Cluster-9 earthquak spain, bangladesh, japan, germany, haiti, nepal, Recall = 2/3 china, sweden, iran, mexico, Precision = 2/3 vietnam, belgium, lebanon, F-score = 2/3 iraq, russia, turkey 14 ba LC PA CF 07 er IL ret Q U Q A1 AS C Ep U3 hy ra 3 FD U IIT UQ ofL D AT IB 16 M pr 20 B on 0 to 7T 07 r lsv un3 2 pi 00 rc s0 7a 7 Q qa1 UA N TA c In te sai lle l1 x as er7 ke A d0 7c D ua a m l0 M s0 7t IT 7a R E2 tch Dr 00 ex 7B el R ed un uF 2 sc iiit 05 qa 07 Ly m Results in TREC 2007 TREC 2007 Results (F-measure) 0.5 0.45 0.4 0.35 0.2 Best 0.479 Median 0.085 0.3 Worst 0.000 0.25 F=14.5 0.15 0.1 0.05 0 15 Evaluation of Clustering • Baseline List of candidate answers prior to clustering • Our Approach List of candidate answers filtered by the clustering • Theoretical Maximum The best possible output of clustering based on the initial list Corpus Baseline Our Approach Theoretical Max Baseline Our Approach Theoretical Max Questions Precision Recall 2004 TREC 2005 2006 237 TREC 2007 85 0.064 0.141 1 0.075 0.165 1 0.407 0.287 0.407 0.388 0.248 0.388 F-score 0.098 0.154 0.472 0.106 0.163 0.485 16 Percentage of each Question Type in Training Set Evaluation of each Question Type 3% 2% 5% 5% 1% 1% 15% 36% 32% Other Organization Job F-score of each type in training and test sets Person Movie State Country City Nationality 0.700 0.600 0.500 0.400 Test Set Training Set 0.300 0.200 0.100 Ci ty Na tio na lity ov ie M Jo b rg an iza tio n O St at e Co un try th er O Pe rs on 0.000 17 Future Work • Developing a module that verifies whether each candidate is a member of the answer type In case of Movie and Other types • Using co-occurrence at the paragraph level rather than the sentence level Anaphora Resolution can be used Another method for similarity measure 2 does not work well with sparse data for example, using Yates correction for continuity (Yates’ 2) • Using different clustering approaches • Using different similarity measures Mutual Information 18 Questions? 19