Question Answering from Frequently Asked Question Files Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin, Steven L. Lytinen, Noriko Tomurom and Scott Schoenberg AI magazine; Summer 1997 Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee ACM CIKM ‘05 Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University December 13, 2011 What is FAQ Finder? • Matches answers to questions already asked in a site’s FAQ file • 4 Assumptions 1. Information in QA Format 2. All information needed to determine relevance of QA is can be found in QA Pair 3. Q half of QA pair most relevant for matching to user’s question 4. Broad, shallow knowledge of language is sufficient for question matching How Does It Work? • Uses SMART IR system to narrow focus of relevant FAQ files • Iterates through QA pairs in FAQ file, comparing against user’s question and computing a score using 3 metrics – Statistical term-vector similarity score t – Semantic similarity score s tT m – Coverage score c sS cC T W C T,S and C are constant weights that adjust reliance of system on each metric. Calculating Similarity • QA pair represented as a term vector w/ signif. Values for each term in the pair • Significance value = tfidf • n (term freq) = # time term appears in QA pair • m = # QA pairs term appears in in file • tfidf = n x log(M/m) • Evaluate relative rarity of term within documents – Use as factor to weight freq of term in document Nuances • Many ways to express the same question – Synonymous terms often used in large documents – Thus, variations will have no effect • However, FAQ Finder is matching on small # of terms, system needs means of matching synonyms – How do I reboot my system? – What do I do when my computer crashes? – Causal relationship resolved with WordNet WordNet • Semantic network of English words • Provides relations between words and synonym sets & between synonym sets and themselves • FAQ Finder utilizes through marker-passing algorithm – Compares each word in the user’s question to each word in FAQ file question WordNet (cont…) • Not a single semantic network, different subnetworks exist for nouns, verbs, etc. • Syntactically ambiguous words (e.g. run) appears in more than one network. • Simply relying on default word sense worked as well as any more sophisticated techniques Evaluating Performance • Corpus from log file of system’s use – May-Dec 1996. • 241 questions used • Manually scanned and found 138 answers to questions and 103 questions unanswered • Assumes there is a correct (single QA pair) • Because this task is different than conventional IR problem, have to redefine recall and precision Why Redefine Recall & Precision? • RECALL – typically is measurement of % of relevant docs in set relative to query • PRECISION – typically measurement of % retrieved docs that are relevant • There is only one correct doc, these are not independent • e.g. query returns 5 QA pairs – FAQ Finder returns either 100% recall and 20% precision OR – Returns 0% recall, 0% precision – If no answer exists, precision = 0%, recall = undefined Redefining Recall & Precision • Recallnew=% questions FAQFinder returns correct answer when one exists – Does not penalize if >1 correct answer (original) • Instead of precision, calculate rejection • Rejection - % questions FAQFinder correctly reports as being answered – Adjusted to set cutoff point for minimum-allowable-matches • There is still a tradeoff between rejection and recall – Rejection threshold too high, some correct answers eliminated – Rejection too low, incorrect answers given to user when no answer exists Results • Correct file appears 88% of time within top 5 files returned, 48% of time in first position Equates to 88% Recall, 23% Precision • System confidently returns garbage when there is not correct answer in file Ablation Study • Evaluation of different components in matching scheme by disabling 1. 2. 3. 4. QA pairs selected randomly from FAQ file Coverage score for each condition used by itself Semantic scores from WordNet used in eval Term vector comparison used in isolation Conditions’ Contributions • WordNet and stat technique contribute strongly • Their combination yields results that are better than either individually. Where FAQ Finder Fails • Biggest culprit of not finding is undue weight given to semantically useless words – Where can I find woodworking plans for a futon‽ – woodworking is incorporated as strongly as futon – futon should be much more important inside the woodworking FAQ than woodworking, which applies to everything • Other problem: violation of assumptions about FAQ files Conclusion • When there is an existing collection of Qs & As, Qs can be reduced to matching new questions against QA pairs • Power of approach is because FAQ Finder uses highly organized knowledge sources that are designed to answer commonly asked Qs. Citing Paper’s Objectives • Find questions in archive semantically similar to user’s question. • Resolve: – Two questions that have the same meaning use very different wording – Similarity measures developed for document retrieval work poorly when there is little word overlap. Approaches Toward The Word Mismatch Problem 1. Use knowledge databases as machine readable dictionaries (req. from first paper) – Current quality and structure are insufficient 2. Employ manual rules and templates – Expensive and hard to scale for large collections 3. Use statistical techniques from IR and natural language processing – Most promising with enough trained data Problems with the Statistical Approach • Need: Large # of semantically similar but lexically different sentences or Q pairs – No such collection exists on large scale • Researchers artificially generate collections through methods like translation and subsequent reverse translation • Paper proposed automatic way of building collections of semantically similar questions from existing Q&A collections Question & Answer Archives Question Title How to make multi• Naver – leading portal booting systems? Question Body I am using Windows98. I’d site in S. Korea. Ex. like to multi-boot with Windows XP. How can I do • Avg len of Q field = 5.8w this? Answer You must parition your hard disk, then install • Avg Q body = 49w windows98 first. If there is no problem with windows98, then, install • Avg Answer = 179w windows XP on… • Made 2 test collections from archive – A-6.8M QA Pairs across all categories – B-68k QA Pairs across “Computer Novice” Categ. • Need: Sets of topics with relevance judgments – 2 sets of 50 QA pairs rand. Selected • First set from Collection A and chosen across all Cats • Second set from Collection B, chosen from “Comp. Novice” category • Each pair converted to topic – QTITLE short query – QBODY long query – A supplemental query } Used only in relevance judgement procedure Find Relevant QA Pairs • Given a topic, employ TREC pooling technique • 18 diff. retrieval results generated by varying retrieval algorithm, query type & search field • Retrieval models such as Okapi BM25, query-likelihood and overlap coefficient used • Pooled top 20 QA pairs from each, did manual relevance judgments – As long as seman. Identical or very similar to query, QA pair is considered relevant – If no QA pairs found for a given topic, manually browse the collection to find ≥1 QA pair • Result = 785 Relevant QA Pairs for A, 1557 for B Verifying Field Importance • Prev. Research: Similarity between questions is more important than similarity betw. Qs & As in FAQ Retrieval task • Exp. 1: Search only QTitle field • Exp. 2: Only QBody • Exp 3: Only Answer • For all exps, use query likelihood model with Dirichlet smoothing and Okapi BM25 Regardless of retrieval model, best performance from searching the question title field. Performance gaps for others are significant. Collecting Semantically Similar Questions • Many people don’t search to see if Q has already been asked, so ask a seman. similar Q. • Assume: If two answers are similar then corresponding Qs are semantically similar but lexically different. Sample semantically similar questions with little word overlap I’d like to insert music into Powerpoint. How can I link sounds in Powerpoint? How can I shut down my system in Dos-mode. How to turn off computers in Dos-mode Photo transfer from cell phones to computers. How to move photos taken by cell phones. Algorithm • Consider 4 popular document similarity measures: 1. Cosine similarity with vector space model 2. Negative KL divergence between language models 3. Output score of query likelihood model 4. Score of Okapi model Finding a Similarity Measure: The Cosine Similarity Model • Length of answers vary considerably – Some very short (factoids) – Others very long (C&P from web) • Any similarity measure affected by length is not appropriate Finding a Similarity Measure: Negative KL Divergence & Okapi • Values are not symmetric and not probabilities – pair of answers that has a higher negative KL divergence than another pair does not necessarily have stronger semantic connections • Hard to rank pairs • Okapi Model has Similar Problems Finding a Similarity Measure: Query Likelihood Model • Score is a probability. • Can be used across different answer pairs • Score are NOT symmetric Overcoming Problems • Using ranks instead of scores was more effective – If answer A retrieves answer B @ rank r1 and answer B retrieves answer A @ rank r2 then similarity between 2 answers = reverse harmonic mean of two ranks: 1 1 1 sim ( A, B) 2 r1 r 2 – Use query likelihood model to calc init. ranks Experiments & Results • 68,000*67,999/2 answers possible from 68,000 Q&A pairs in Collection B • All ranked using established measure • Empirically set threshold 0.005 – Judge whether pair is related or not – Higher threshold = smaller but better quality collections – To acquire enough training samples, threshold cannot be too high • 331,965 pairs have score above threshold Word Translation Probabilities • Question pair collection a parallel corpus • IBM model 1 – Does not require any linguistic knowledge for src/target language, treats every word alignment equally – Translation from src s to target t = N P(t | s) s i 1 c(t | s; J i ) – λs = normalization factor, so sum of probs = 1 – N = # training samples – Ji= ith pair in training set 1 Word Translation Probabilities (cont) P(t | s) c(t | s; J ) P(t | s1 ) ... P(t | sn )# (t , J i )# ( s, J i ) i • • • • {s1,…,sn} = words in src sentence in Ji #(t,Ji) = number of times t occurs in Ji Still need: old translation probs We initialize translation probs with rand values, then est. new translation probs – Repeat until probs converge – Procedure always converges to same final solution1 [1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguis., 19(2):263-311, 1993. Experiments & Results (Word Translation) • Removed stop words • Collection of 331965 Q pairs duplicated by switching src and target pars then used as input • Usually: most similar word to a given word is the word itself • Found semantic relationships: found “bmp” to be similar to “jpg” and “gif” Question Retrieval • Where to go from Q titles from word translation probs? • Similarity between query and document: sim (Q, D) P(Q | D) P(w | D) wQ • Avoid 0 Probs, est. more accurate lang. models P(w | D) (1 ) Pml (w | D) Pml (w | C) • term w generated from collection C/D • In translation model, convert to: P( w | D) (1 ) (T ( w | t )Pml ( w | D)) Pml ( w | C ) tD Experiments & Results (Question Retrieval) • 50 short queries from collection B, searching only title field • Similarities betw. query Q and Q titles calculated • Compare performance model with vector space model w/ cosine similarity, Okapi BM25 and query likelihood language model Experiments & Results cont… (Question Retrieval) •Approach outperforms other baseline models at recall levels •QL and Okapi show comparable performance •In all evaluations, approach outperforms other models Model Cosine LM Okapi Trans MAP 0.183 0.258 0.251 0.314 R-Precision @ 5 0.368 0.492 0.476 0.520 R-Precision @ 10 0.310 0.456 0.436 0.480 Conclusions and Seminal Paper Relevance • Retrieval model based on translation probs learned from archive significantly outperforms other approaches in finding semantically similar questions despite lexical mismatch • Using translation probabilities and determining similarity of answers is a much more robust approach for resolving similar QA pairs with fewer prerequisite of corpus References • Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N., & Schoenberg, S. (1997). Question answering from frequently asked question files: Experience with the FAQ finder system (Tech. Rep.). Chicago,, IL, USA. • Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM '05). ACM, New York, NY, USA, 84-90.