Question Answering from Frequently Asked Question Files

advertisement
Question Answering from
Frequently Asked Question Files
Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin, Steven L. Lytinen, Noriko
Tomurom and Scott Schoenberg
AI magazine; Summer 1997
Finding Similar Questions in large
Question and Answer Archives
Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee
ACM CIKM ‘05
Presented by Mat Kelly
CS895 – Web-based Information Retrieval
Old Dominion University
December 13, 2011
What is FAQ Finder?
• Matches answers to questions already asked in a
site’s FAQ file
• 4 Assumptions
1. Information in QA Format
2. All information needed to determine relevance of
QA is can be found in QA Pair
3. Q half of QA pair most relevant for matching to
user’s question
4. Broad, shallow knowledge of language is sufficient
for question matching
How Does It Work?
• Uses SMART IR system to narrow focus of
relevant FAQ files
• Iterates through QA pairs in FAQ file,
comparing against user’s question and
computing a score using 3 metrics
– Statistical term-vector similarity score t
– Semantic similarity score s
tT 
m
– Coverage score c
sS  cC
T W  C
T,S and C are constant weights that adjust reliance of system on each metric.
Calculating Similarity
• QA pair represented as a term vector w/ signif.
Values for each term in the pair
• Significance value = tfidf
• n (term freq) = # time term appears in QA pair
• m = # QA pairs term appears in in file
• tfidf = n x log(M/m)
• Evaluate relative rarity of term within documents
– Use as factor to weight freq of term in document
Nuances
• Many ways to express the same question
– Synonymous terms often used in large documents
– Thus, variations will have no effect
• However, FAQ Finder is matching on small # of
terms, system needs means of matching
synonyms
– How do I reboot my system?
– What do I do when my computer crashes?
– Causal relationship resolved with WordNet
WordNet
• Semantic network of English words
• Provides relations between words and
synonym sets & between synonym sets and
themselves
• FAQ Finder utilizes through marker-passing
algorithm
– Compares each word in the user’s question to
each word in FAQ file question
WordNet (cont…)
• Not a single semantic network, different subnetworks exist for nouns, verbs, etc.
• Syntactically ambiguous words (e.g. run)
appears in more than one network.
• Simply relying on default word sense worked
as well as any more sophisticated techniques
Evaluating Performance
• Corpus from log file of system’s use
– May-Dec 1996.
• 241 questions used
• Manually scanned and found 138 answers to
questions and 103 questions unanswered
• Assumes there is a correct (single QA pair)
• Because this task is different than
conventional IR problem, have to redefine
recall and precision
Why Redefine Recall & Precision?
• RECALL – typically is measurement of % of relevant
docs in set relative to query
• PRECISION – typically measurement of % retrieved
docs that are relevant
• There is only one correct doc, these are not
independent
• e.g. query returns 5 QA pairs
– FAQ Finder returns either 100% recall and 20% precision
OR
– Returns 0% recall, 0% precision
– If no answer exists, precision = 0%, recall = undefined
Redefining Recall & Precision
• Recallnew=% questions FAQFinder returns correct answer
when one exists
– Does not penalize if >1 correct answer (original)
• Instead of precision, calculate rejection
• Rejection - % questions FAQFinder correctly reports as
being answered
– Adjusted to set cutoff point for minimum-allowable-matches
• There is still a tradeoff between rejection and recall
– Rejection threshold too high, some correct answers eliminated
– Rejection too low, incorrect answers given to user when no
answer exists
Results
• Correct file appears 88% of time within top 5
files returned, 48% of time in first position
Equates to 88% Recall, 23% Precision
• System confidently
returns garbage
when there is not
correct answer in file
Ablation Study
• Evaluation of different components in
matching scheme by disabling
1.
2.
3.
4.
QA pairs selected randomly from FAQ file
Coverage score for each condition used by itself
Semantic scores from WordNet used in eval
Term vector comparison used in isolation
Conditions’ Contributions
• WordNet and
stat technique
contribute
strongly
• Their combination yields results that are better than
either individually.
Where FAQ Finder Fails
• Biggest culprit of not finding is undue weight
given to semantically useless words
– Where can I find woodworking plans for a futon‽
– woodworking is incorporated as strongly as futon
– futon should be much more important inside the
woodworking FAQ than woodworking, which
applies to everything
• Other problem: violation of assumptions
about FAQ files
Conclusion
• When there is an existing collection of Qs &
As, Qs can be reduced to matching new
questions against QA pairs
• Power of approach is because FAQ Finder uses
highly organized knowledge sources that are
designed to answer commonly asked Qs.
Citing Paper’s Objectives
• Find questions in archive semantically similar
to user’s question.
• Resolve:
– Two questions that have the same meaning use
very different wording
– Similarity measures developed for document
retrieval work poorly when there is little word
overlap.
Approaches Toward The Word
Mismatch Problem
1. Use knowledge databases as machine
readable dictionaries (req. from first paper)
– Current quality and structure are insufficient
2. Employ manual rules and templates
– Expensive and hard to scale for large collections
3. Use statistical techniques from IR and natural
language processing
– Most promising with enough trained data
Problems with the Statistical
Approach
• Need: Large # of semantically similar but
lexically different sentences or Q pairs
– No such collection exists on large scale
• Researchers artificially generate collections
through methods like translation and
subsequent reverse translation
• Paper proposed automatic way of building
collections of semantically similar questions
from existing Q&A collections
Question & Answer Archives
Question Title
How to make multi• Naver – leading portal
booting systems?
Question Body
I am using Windows98. I’d
site in S. Korea. Ex. 
like to multi-boot with
Windows XP. How can I do
• Avg len of Q field = 5.8w
this?
Answer
You must parition your
hard disk, then install
• Avg Q body = 49w
windows98 first. If there is
no problem with
windows98, then, install
• Avg Answer = 179w
windows XP on…
• Made 2 test collections from archive
– A-6.8M QA Pairs across all categories
– B-68k QA Pairs across “Computer Novice” Categ.
• Need: Sets of topics with relevance judgments
– 2 sets of 50 QA pairs rand. Selected
• First set from Collection A and chosen across all Cats
• Second set from Collection B, chosen from “Comp.
Novice” category
• Each pair converted to topic
– QTITLE  short query
– QBODY  long query
– A  supplemental query
}
Used only in relevance judgement
procedure
Find Relevant QA Pairs
• Given a topic, employ TREC pooling technique
• 18 diff. retrieval results generated by varying retrieval
algorithm, query type & search field
• Retrieval models such as Okapi BM25, query-likelihood
and overlap coefficient used
• Pooled top 20 QA pairs from each, did manual
relevance judgments
– As long as seman. Identical or very similar to query, QA
pair is considered relevant
– If no QA pairs found for a given topic, manually browse the
collection to find ≥1 QA pair
• Result = 785 Relevant QA Pairs for A, 1557 for B
Verifying Field Importance
• Prev. Research: Similarity between questions
is more important than similarity betw. Qs &
As in FAQ Retrieval task
• Exp. 1: Search only QTitle field
• Exp. 2: Only QBody
• Exp 3: Only Answer
• For all exps, use query
likelihood model with
Dirichlet smoothing and
Okapi BM25
Regardless of retrieval model, best
performance from searching the
question title field. Performance gaps
for others are significant.
Collecting Semantically
Similar Questions
• Many people don’t search to see if Q has
already been asked, so ask a seman. similar Q.
• Assume: If two answers are similar then
corresponding Qs are semantically similar but
lexically different.
Sample semantically similar questions with little word overlap
I’d like to insert music into Powerpoint.
How can I link sounds in Powerpoint?
How can I shut down my system in Dos-mode.
How to turn off computers in Dos-mode
Photo transfer from cell phones to computers.
How to move photos taken by cell phones.
Algorithm
• Consider 4 popular document similarity
measures:
1. Cosine similarity with vector space model
2. Negative KL divergence between language
models
3. Output score of query likelihood model
4. Score of Okapi model
Finding a Similarity Measure:
The Cosine Similarity Model
• Length of answers vary considerably
– Some very short (factoids)
– Others very long (C&P from web)
• Any similarity measure affected by length is
not appropriate
Finding a Similarity Measure:
Negative KL Divergence & Okapi
• Values are not symmetric and not probabilities
– pair of answers that has a higher negative KL
divergence than another pair does not necessarily
have stronger semantic connections
• Hard to rank pairs
• Okapi Model has Similar Problems
Finding a Similarity Measure:
Query Likelihood Model
• Score is a probability.
• Can be used across different answer pairs
• Score are NOT symmetric
Overcoming Problems
• Using ranks instead of scores was more
effective
– If answer A retrieves answer B @ rank r1 and
answer B retrieves answer A @ rank r2 then
similarity between 2 answers = reverse harmonic
mean of two ranks:
1 1 1 
sim ( A, B)    
2  r1 r 2 
– Use query likelihood model to calc init. ranks
Experiments & Results
• 68,000*67,999/2 answers possible from 68,000
Q&A pairs in Collection B
• All ranked using established measure
• Empirically set threshold 0.005
– Judge whether pair is related or not
– Higher threshold = smaller but better quality
collections
– To acquire enough training samples, threshold cannot
be too high
• 331,965 pairs have score above threshold
Word Translation Probabilities
• Question pair collection a parallel corpus
• IBM model 1
– Does not require any linguistic knowledge for
src/target language, treats every word alignment
equally
– Translation from src s to target t =
N
P(t | s)  s i 1 c(t | s; J i )
– λs = normalization factor, so sum of probs = 1
– N = # training samples
– Ji= ith pair in training set
1
Word Translation Probabilities (cont)
P(t | s)
c(t | s; J ) 
P(t | s1 )  ...  P(t | sn )# (t , J i )# ( s, J i )
i
•
•
•
•
{s1,…,sn} = words in src sentence in Ji
#(t,Ji) = number of times t occurs in Ji
Still need: old translation probs
We initialize translation probs with rand values,
then est. new translation probs
– Repeat until probs converge
– Procedure always converges to same final solution1
[1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical
machine translation: parameter estimation. Comput. Linguis., 19(2):263-311, 1993.
Experiments & Results
(Word Translation)
• Removed stop words
• Collection of 331965 Q pairs duplicated by
switching src and target pars then used as
input
• Usually: most similar word to a given word is
the word itself
• Found semantic relationships: found “bmp” to
be similar to “jpg” and “gif”
Question Retrieval
• Where to go from Q titles from word
translation probs?
• Similarity between query and document:
sim (Q, D)  P(Q | D)   P(w | D)
wQ
• Avoid 0 Probs, est. more accurate lang. models
P(w | D)  (1   ) Pml (w | D)   Pml (w | C)
• term w generated from collection C/D
• In translation model, convert to:
P( w | D)  (1   ) (T ( w | t )Pml ( w | D))  Pml ( w | C )
tD
Experiments & Results
(Question Retrieval)
• 50 short queries from collection B, searching
only title field
• Similarities betw. query Q and Q titles
calculated
• Compare performance model with vector
space model w/ cosine similarity, Okapi BM25
and query likelihood language model
Experiments & Results cont…
(Question Retrieval)
•Approach outperforms
other baseline models at
recall levels
•QL and Okapi show
comparable performance
•In all evaluations, approach
outperforms other models
Model
Cosine
LM
Okapi
Trans
MAP
0.183
0.258
0.251
0.314
R-Precision @ 5
0.368
0.492
0.476
0.520
R-Precision @ 10
0.310
0.456
0.436
0.480
Conclusions and Seminal Paper
Relevance
• Retrieval model based on translation probs
learned from archive significantly outperforms
other approaches in finding semantically
similar questions despite lexical mismatch
• Using translation probabilities and
determining similarity of answers is a much
more robust approach for resolving similar QA
pairs with fewer prerequisite of corpus
References
• Burke, R. D., Hammond, K. J., Kulyukin, V. A.,
Lytinen, S. L., Tomuro, N., & Schoenberg, S.
(1997). Question answering from frequently
asked question files: Experience with the FAQ
finder system (Tech. Rep.). Chicago,, IL, USA.
• Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee.
2005. Finding similar questions in large question
and answer archives. In Proceedings of the 14th
ACM international conference on Information
and knowledge management (CIKM '05). ACM,
New York, NY, USA, 84-90.
Download