Modeling and Solving Term Mismatch for Full-Text Retrieval Dissertation Presentation Le Zhao

advertisement
Modeling and Solving
Term Mismatch for Full-Text Retrieval
Dissertation Presentation
Le Zhao
Committee:
Jamie Callan (Chair)
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
July 26, 2012
Jaime Carbonell
Yiming Yang
Bruce Croft (UMass)
1
What is Full-Text Retrieval
• The task
User
Query
Retrieval
Engine
Results
User
Document
Collection
• The Cranfield evaluation [Cleverdon 1960]
– abstracts away the user,
– allows objective & automatic evaluations
2
Where are We (Going)?
• Current retrieval models
– formal models from 1970s, best ones 1990s
– based on simple collection statistics (tf.idf),
no deep understanding of natural language texts
• Perfect retrieval
– Query: “information retrieval”, A: “… text search …”
imply
–
–
–
Textual entailment (difficult natural language task)
Searcher frustration [Feild, Allan and Jones 2010]
Still far away, what have been holding us back?
3
Two Long Standing Problems in Retrieval
• Term mismatch
– [Furnas, Landauer, Gomez and Dumais 1987]
– No clear definition in retrieval
• Relevance (query dependent term importance – P(t | R))
– Traditionally, idf (rareness)
– P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998]
– Few clues about estimation
• This work
– connects the two problems,
– shows they can result in huge gains in retrieval,
– and uses a predictive approach toward solving both
problems.
4
What is Term Mismatch & Why Care?
• Job search
–
You look for information retrieval jobs on the market.
They want text search skills.
– cost you job opportunities, (50% even if you are careful)
• Legal discovery
–
You look for bribery or foul play in corporate documents.
They say grease, pay off.
– cost you cases
• Patent/Publication search
–
cost businesses
• Medical record retrieval
–
cost lives
5
Prior Approaches
• Document:
– Full text indexing
• Instead of only indexing key words
–
Stemming
• Include morphological variants
–
Document expansion
• Inlink anchor, user tags
• Query:
– Query expansion, reformulation
• Both:
– Latent Semantic Indexing
– Translation based models
6
Main Questions Answered
• Definition
• Significance (theory & practice)
• Mechanism (what causes the problem)
• Model and solution
7
Definition
Importance
Prediction
Solution
_
Definition of Mismatch P(t | Rq)
Jobs mismatched
Relevant (q)
All relevant jobs
Documents that contain t
“retrieval”
Collection
_
mismatch (P(t | Rq)) == 1 – term recall (P(t | Rq))
Directly calculated given relevance judgments for q
|{𝑑: 𝑡 ∉ 𝑑 & 𝑑 ∈ 𝑅𝑞 }|
P(𝑡 | 𝑅𝑞 )=
|𝑅𝑞 |
[CIKM 2010]
8
Definition
Importance
Prediction
Solution
How Often do Terms Match?
Term in
Query
P(t | R)
Oil
Spills
Term
limitations
for US
Congress
members
Insurance
Coverage
which pays
for
Long
Term Care
School Choice
Voucher System and
its effects on the US
educational program
Vitamin the
cure or cause
of
human
ailments
0.9914
0.9831
0.6885
0.2821
0.1071
(Example TREC-3 topics)
9
Main Questions
• Definition
_
• P(t | R) or P(t | R), simple,
• estimated from relevant documents,
• analyze mismatch
• Significance (theory & practice)
• Mechanism (what causes the problem)
• Model and solution
10
Definition
Importance: Theory
Prediction
Solution
Term Mismatch &
Probabilistic Retrieval Models
Binary Independence Model
– [Robertson and Spärck Jones 1976]
– Optimal ranking score for each document d
Term recall
–
–
–
Idf (rareness)
Term weight for Okapi BM25
Other advanced models behave similarly
Used as effective features in Web search engines
11
Definition
Importance: Theory
Prediction
Solution
Term Mismatch &
Probabilistic Retrieval Models
Binary Independence Model
– [Robertson and Spärck Jones 1976]
– Optimal ranking score for each document d
Term recall
–
Idf (rareness)
“Relevance Weight”, “Term Relevance”
• P(t | R): only part about the query, & relevance
12
Main Questions
• Definition
• Significance
• Theory (as idf & only part about relevance)
• Practice?
• Mechanism (what causes the problem)
• Model and solution
13
Definition
Importance: Practice: Mechanism
Prediction
Solution
Term Mismatch &
Probabilistic Retrieval Models
Binary Independence Model
– [Robertson and Spärck Jones 1976]
– Optimal ranking score for each document d
Term recall
–
Idf (rareness)
“Relevance Weight”, “Term Relevance”
• P(t | R): only part about the query, & relevance
14
Definition
Importance: Practice: Mechanism
Prediction
Solution
Without Term Recall
•
The emphasis problem for tf.idf retrieval models
– Emphasize high idf (rare) terms in query
• “prognosis/viability of a political third party in U.S.” (Topic 206)
15
Definition
Importance: Practice: Mechanism
Prediction
Solution
Ground Truth (Term Recall)
Query: prognosis/viability of a political third party
party
political
third
viability
prognosis
True P(t | R) 0.9796
0.7143
0.5918
0.0408
0.0204
idf
2.513
2.187
5.017
7.471
2.402
Emphasis
Wrong Emphasis
16
Definition
Importance: Practice: Mechanism
Prediction
Solution
Top Results (Language model)
Query: prognosis/viability of a political third party
1. … discouraging prognosis for 1991 …
2. … Politics … party … Robertson's viability as a candidate …
3. … political parties …
4. … there is no viable opposition …
5. … A third of the votes …
6. … politics … party … two thirds …
7. … third ranking political movement…
8. … political parties …
9. … prognosis for the Sunday school …
10. … third party provider …
All are false positives. Emphasis / Mismatch problem, not precision.
(
,
are better, but still have top 10 false positives.
Emphasis / Mismatch also a problem for large search engines!)
17
Definition
Importance: Practice: Mechanism
Prediction
Solution
Without Term Recall
•
The emphasis problem for tf.idf retrieval models
– Emphasize high idf (rare) terms in query
• “prognosis/viability of a political third party in U.S.” (Topic 206)
–
False positives throughout rank list
• especially detrimental at top rank
–
No term recall hurts precision at all recall levels
• How significant is the emphasis problem?
18
Definition
Importance: Practice: Mechanism
Prediction
Solution
Failure Analysis of 44 Topics from TREC 6-8
Precision 9%
Emphasis 64%
Mismatch 27%
Recall term weighting
Mismatch guided expansion
Basis: Term Mismatch Prediction
RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)
Failure analyses of retrieval models & techniques still standard today
19
Main Questions
• Definition
• Significance
• Theory: as idf & only part about relevance
• Practice: explains common failures,
other behavior: Personalization, WSD, structured
• Mechanism (what causes the problem)
• Model and solution
20
Definition
Importance: Practice: Potential
Prediction
Solution
Failure Analysis of 44 Topics from TREC 6-8
Precision 9%
Emphasis 64%
Mismatch 27%
Recall term weighting
Mismatch guided expansion
Basis: Term Mismatch Prediction
RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)
21
Definition
Importance: Practice: Potential
Prediction
Solution
True Term Recall Effectiveness
• +100% over BIM (in precision at all recall levels)
– [Robertson and Spärk Jones 1976]
• +30-80% over Language Model, BM25 (in MAP)
– This work
• For a new query w/o relevance judgments,
– Need to predict
– Predictions don’t need to be very accurate
to show performance gain
22
Main Questions
• Definition
• Significance
• Theory: as idf & only part about relevance
• Practice: explains common failures, other behavior,
•
+30 to 80% potential from term weighting
• Mechanism (what causes the problem)
• Model and solution
23
Definition
Importance
Prediction: Idea
Solution
How Often do Terms Match?
Same term,
different
Recall
Term in
Query
Oil
Spills
Varies
0 to 1
Term
limitations
for US
Congress
members
Insurance
Coverage
which pays
for
Long
Term Care
School Choice
Voucher System and
its effects on the US
educational program
Vitamin the
cure or cause
of
human
ailments
P(t | R)
0.9914
0.9831
0.6885
0.2821
0.1071
idf
5.201
2.010
2.010
1.647
6.405
Differs
from idf
(Examples from TREC 3 topics)
24
Definition
Importance
Prediction: Idea
Solution
Statistics
Term recall across all query terms (average ~55-60%)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Term Recall P(t | R)
TREC 3 titles, 4.9 terms/query
average 55% term recall
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Term Recall P(t | R)
TREC 9 descriptions, 6.3 terms/query
average 59% term recall
25
Definition
Importance
Prediction: Idea
Solution
Statistics
Term recall on shorter queries (average ~70%)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Term Recall P(t | R)
TREC 9 titles, 2.5 terms/query
average 70% term recall
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Term Recall P(t | R)
TREC 13 titles, 3.1 terms/query
average 66% term recall
26
Definition
Importance
Prediction: Idea
Solution
Statistics
Query dependent (but for many terms, variance is small)
Term Recall for Repeating Terms
364 recurring words from TREC 3-7, 350 topics
27
Definition
Importance
Prediction: Idea
Solution
P(t | R) vs. idf
P(t | R)
P(t | R)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
df/N
0.1
0
-0.5
P(t | R) vs. df/N (Greiff, 1998)
0
0.5
1
idf
TREC 4 desc query terms
28
Definition
Importance
Prediction: Idea
Solution
Prior Prediction Approaches
• Croft/Harper combination match (1979)
–
treats P(t | R) as a tuned constant, or estimated from PRF
– when >0.5, rewards docs that match more query terms
• Greiff’s (1998) exploratory data analysis
–
Used idf to predict overall term weighting
– Improved over basic BIM
• Metzler’s (2008) generalized idf
–
Used idf to predict P(t | R)
– Improved over basic BIM
• Simple feature (idf), limited success
–
Missing piece: P(t | R) = term recall = 1 – term mismatch
29
Definition
Importance
Prediction: Idea
Solution
What Factors can Cause Mismatch?
• Topic centrality (Is concept central to topic?)
– “Laser research related or potentially related to defense”
– “Welfare laws propounded as reforms”
• Synonyms (How often they replace original term?)
– “retrieval” == “search” == …
• Abstractness
– “Laser research … defense”
“Welfare laws”
– “Prognosis/viability” (rare & abstract)
30
Main Questions
• Definition
• Significance
• Mechanism
• Causes of mismatch: Unnecessary concepts,
replaced by synonyms or more specific terms
• Model and solution
31
Definition
Importance
Prediction: Implement Solution
Designing Features to Model the Factors
• We need to
–
–
Identify synonyms/searchonyms of a query term
in a query dependent way
• External resource? (WordNet, wiki, or query log)
–
–
–
Biased (coverage problem, collection independent)
Static (not query dependent)
Not easy, not used here
• Term-term similarity in concept space!
–
Local LSI (Latent Semantic Indexing)
Query
Retrieval
Engine
Document
Collection
Results
Results
Top (500)
Results
Concept
Space
(150
dim)
32
Definition
Importance
Prediction: Implement Solution
Synonyms from Local LSI
Term limitation for US Insurance
Coverage Vitamin the cure or cause
Congress members
which pays for Long of human ailments
Term Care
P(t | Rq)
0.9831
0.6885
0.1071
Similarity with query term
0.5
0.4
0.3
0.2
0.1
0
Top Similar Terms
33
Definition
Importance
Prediction: Implement Solution
Synonyms from Local LSI
Term limitation for US Insurance
Coverage Vitamin the cure or cause
Congress members
which pays for Long of human ailments
Term Care
P(t | Rq)
0.9831
0.6885
0.1071
Similarity with query term
0.5
(1) Magnitude of self similarity – Term centrality
0.4
0.3
(2) Avg similarity of supporting terms – Concept centrality
0.2
0.1
(3) How likely synonyms replace term t in collection
0
Top Similar Terms
34
Definition
Importance
Prediction: Experiment Solution
Features that Model the Factors
Correlation with P(t | R)
0.3719
idf: – 0.1339
• Term centrality
– Self-similarity (length of t) after dimension reduction
• Concept centrality
0.3758
– Avg similarity of supporting terms (top synonyms)
• Replaceability
– 0.1872
– How frequently synonyms appear in place of original
query term in collection documents
– 0.1278
• Abstractness
– Users modify abstract terms with concrete terms
effects on the US educational program prognosis of a political third party
35
Definition
Importance
Prediction: Implement Solution
Prediction Model
Regression modeling
– Model:
M: <f1, f2, .., f5>  P(t | R)
– Train on one set of queries (known relevance),
– Test on another set of queries (unknown relevance)
– RBF kernel Support Vector regression
36
Definition
Importance
Prediction
Solution
A General View of Retrieval Modeling as
Transfer Learning
• The traditional restricted view sees a retrieval model as
– a document classifier for a given query.
• More general view: A retrieval model really is
– a meta-classifier, responsible for many queries,
– mapping a query to a document classifier.
• Learning a retrieval model == transfer learning
– Using knowledge from related tasks (training queries)
to classify documents for a new task (test query)
– Our features and model facilitate the transfer.
– More general view  more principled investigations
and more advanced techniques
37
Definition
Importance
Prediction: Experiment Solution
Experiments
• Term recall prediction error
– L1 loss (absolute prediction error)
• Term recall based term weighting retrieval
– Mean Average Precision (overall retrieval success)
– Precision at top 10 (precision at top of rank list)
38
Definition
Importance
Prediction: Experiment Solution
Term Recall Prediction Example
Query: prognosis/viability of a political third party.
(Trained on TREC 3)
party
political
third
viability
prognosis
True P(t | R) 0.9796
0.7143
0.5918
0.0408
0.0204
Predicted
0.6523
0.6236
0.3080
0.2869
0.7585
Emphasis
39
Definition
Importance
Prediction: Experiment Solution
Term Recall Prediction Error
Average Absolute Error (L1 loss) on TREC 4
0.35
The lower,
the better
0.3
0.25
0.2
0.15
0.1
0.05
0
Average
(constant)
IDF only
All 5 features
Tuning metaparameters
TREC 3
recurring words
L1 Loss:
40
Main Questions
• Definition
• Significance
• Mechanism
• Model and solution
• Can be predicted;
Framework to design and evaluate features
41
Definition
Importance
Prediction
Solution: Weighting
Using 𝑃 (t | R) in Retrieval Models
• In BM25
– Binary Independence Model
• In Language Modeling (LM)
– Relevance Model [Lavrenko and Croft 2001]
Only term weighting, no expansion.
42
Definition
Importance
Prediction
Solution: Weighting
Predicted Recall Weighting
MAP
0.25
10-25% gain
(MAP)
*
*
*
Baseline LM desc
Necessity
desc
Recall LM LM
desc
*
0.2
*
*
0.15
0.1
*
0.05
0
3 -> 4
3-5 -> 6
3-7 -> 8
7 -> 8
3-9 -> 10
9 -> 10
11 -> 12
13 -> 14
Datasets: train -> test
“*”: significantly better by sign & randomization tests
43
Definition
Importance
Prediction
Solution: Weighting
Predicted Recall Weighting
10-20% gain
(top Precision)
[email protected]
0.6
0.5
Baseline LM desc
Necessity
desc
Recall LM LM
desc
*
0.4
!
!
!
0.3
0.2
0.1
*
0
3 -> 4
3-5 -> 6
3-7 -> 8
7 -> 8
3-9 -> 10
9 -> 10
11 -> 12
13 -> 14
Datasets: train -> test
“*”: [email protected] is significantly better.
“!”: [email protected] is significantly better.
44
Definition
Importance
Prediction
Solution: Weighting
vs. Relevance Model
Relevance Model [Laverenko and Croft 2001]
Term occurrence in top docs
Unsupervised
RM weight (x) ~ Term recall (y)
Pm(t1 | R) ~ P(t1 | R)
Pm(t2 | R) ~ P(t2 | R)
5-10% better than unsupervised
Query Likelihood
y
1
Predicted
Term
Recall
TREC 3
TREC 7
TREC 13
0
0
0.5
1
x
Relevance Model Weights (normalized)
45
Main Questions
• Definition
• Significance
• Mechanism
• Model and solution
• Term weighting solves emphasis problem for long
queries
• Mismatch problem?
46
Definition
Importance
Prediction
Solution: Expansion
Failure Analysis of 44 Topics from TREC 6-8
Precision 9%
Emphasis 64%
Mismatch 27%
Recall term weighting
Mismatch guided expansion
Basis: Term Mismatch Prediction
RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)
47
Definition
Importance
Prediction
Solution: Expansion
Recap: Term Mismatch
• Term mismatch ranges 30%-50% on average
• Relevance matching can degrade quickly for multi-word
queries
• Solution: Fix every query term
[SIGIR 2012]
48
Definition
Importance
Prediction
Solution: Expansion
Conjunctive Normal Form (CNF) Expansion
Example keyword query:
placement of cigarette signs on television watched by children
 Manual CNF:
(placement OR place OR promotion OR logo OR sign OR
signage OR merchandise)
AND (cigarette OR cigar OR tobacco)
AND (television OR TV OR cable OR network)
AND (watch OR view)
AND (children OR teen OR juvenile OR kid OR adolescent)
–
–
–
–
Expressive & compact (1 CNF == 100s alternatives)
Highly effective (this work: 50-300% over base keyword)
Used by lawyers, librarians and other expert searchers
But, tedious & difficult to create, little research
49
Definition
Importance
Prediction
Solution: Expansion
Diagnostic Intervention
Query: placement of cigarette signs on television watched by children
Diagnosis: Low 𝑃(𝑡|𝑅) terms
placement of cigarette signs on
television watched by children
Expansion:
CNF
(placement OR place OR promotion
OR sign OR signage OR merchandise)
AND cigar AND television AND watch
AND (children OR teen OR juvenile OR kid
OR adolescent)
High idf (rare) terms
placement of cigarette signs on
television watched by children
CNF
(placement OR place OR promotion
OR sign OR signage OR merchandise)
AND cigar
AND (television OR tv OR cable OR network)
AND watch AND children
• Goal
– Least amount user effort  near optimal performance
– E.g. expand 2 terms  90% of total improvement
50
Definition
Importance
Prediction
Solution: Expansion
Diagnostic Intervention
Query: placement of cigarette signs on television watched by children
Diagnosis: Low 𝑃(𝑡|𝑅) terms
placement of cigarette signs on
television watched by children
Expansion:
Original
Bag of
word query
[ 0.9 (placement cigar television watch
Expansion query
children)
0.1 (0.4 place 0.3 promotion 0.2 logo
0.1 sign 0.3 signage 0.3 merchandise 0.5
teen 0.4 juvenile 0.2 kid 0.1 adolescent) ]
High idf (rare) terms
placement of cigarette signs on
television watched by children
Bag of word
[ 0.9 (placement cigar television watch
children)
0.1 (0.4 place 0.3 promotion 0.2 logo
0.1 sign 0.3 signage 0.3 merchandise 0.5
tv 0.4 cable 0.2 network) ]
• Goal
– Least amount user effort  near optimal performance
– E.g. expand 2 terms  90% of total improvement
51
Definition
Importance
Prediction
Solution: Expansion
Diagnostic Intervention (We Hope to)
User
Keyword query
Diagnosis
system
(P(t | R) or idf)
(child AND cigar)
Expansion terms
User expansion
(child  teen)
Query formulation
(CNF or Keyword)
Problem query
terms
(child > cigar)
Retrieval engine
Evaluation
(child OR teen)
AND cigar
52
Definition
Importance
Prediction
Solution: Expansion
Diagnostic Intervention (We Hope to)
User
Keyword query
Diagnosis
system
(P(t | R) or idf)
(child AND cigar)
Expansion terms
User expansion
(child  teen)
Query formulation
(CNF or
Keyword)
(child OR teen)
AND cigar
Problem query
terms
(child > cigar)
Retrieval engine
Evaluation
53
Definition
Importance
Prediction
Solution: Expansion
We Ended up Using Simulation
Full
CNF
Offline
Expert user
(child OR teen) AND
(cigar OR tobacco)
Keyword query
Diagnosis
system
(P(t | R) or idf)
(child AND cigar)
Online simulation
Expansion terms
User expansion
(child  teen)
Problem query
terms
Online simulation
(child > cigar)
Query formulation
(CNF or
Keyword)
(child OR teen)
AND cigar
Retrieval engine
Evaluation
54
Definition
Importance
Prediction
Solution: Expansion
Diagnostic Intervention Datasets
• Document sets
–
TREC 2007 Legal track, 7 million tobacco company
– TREC 4 Ad hoc track, 0.5 million newswire
• CNF Queries, 50 topics per dataset
–
TREC 2007 by lawyers, TREC 4 by Univ. Waterloo
• Relevance Judgments
–
TREC 2007 sparse, TREC 4 dense
• Evaluation measures
–
TREC 2007 statAP, TREC 4 MAP
55
Definition
Importance
Prediction
Solution: Expansion
Results – Diagnosis
P(t | R) vs. idf diagnosis
Full Expansion
Gain in retrieval (MAP)
100%
90%
80%
70%
P(t | R) on TREC 2007
60%
idf on TREC 2007
50%
P(t | R) on TREC 4
40%
idf on TREC 4
30%
20%
10%
0%
0
1
2
3
4
All
# query terms selected
No Expansion
Diagnostic CNF expansion on TREC 4 and 2007
56
Definition
Importance
Prediction
Solution: Expansion
Results – Form of Expansion
CNF vs. bag-of-word expansion
Retrieval performance (MAP)
0.35
Similar level of gain in top precision
0.3
0.25
CNF on TREC 4
0.2
50% to
300%
CNF on TREC 2007
Bag of word on TREC 2007 gain
Bag of word on TREC 4
0.15
0.1
0.05
0
0
1
2
3
4
All
# query terms selected
P(t | R) guided expansion on TREC 4 and 2007
57
Main Questions
• Definition
• Significance
• Mechanism
• Model and solution
• Term weighting for long queries
• Term mismatch prediction diagnoses problem terms,
and produces simple & effective CNF queries
58
Definition
Importance
Prediction: Efficiency
Solution: Weighting
Efficient P(t | R) Prediction
• 3-10X speedup (close to simple keyword retrieval),
while maintaining 70-90% of the gain
• Predict using P(t | R) values from similar, previously-seen
queries
[CIKM 2012]
59
Definition
Importance
Prediction
Solution
Contributions
• Two long standing problems: mismatch & P(t | R)
• Definition and initial quantitative analysis of mismatch
– Do better/new features and prediction methods exist?
• Role of term mismatch in basic retrieval theory
– Principled ways to solve term mismatch
– What about advanced learning to rank, transfer learning?
• Ways to automatically predict term mismatch
– Initial modeling of causes of mismatch, features
– Efficient prediction using historic information
– Are there better analyses or modeling of the causes?
60
Definition
Importance
Prediction
Solution
Contributions
• Effectiveness of ad hoc retrieval
– Term weighting & diagnostic expansion
– How to do automatic CNF expansion?
– Better formalisms: transfer learning, & more tasks?
• Diagnostic intervention
– Mismatch diagnosis guides targeted expansion
– How to diagnose specific types of mismatch problems
or different problems (mismatch/emphasis/precision)?
• Guide NLP, Personalization, etc. to solve the real problem
–
How to proactively identify search and other user
needs?
61
Acknowledgements
•
Committee: Jamie Callan, Jaime Carbonell, Yiming Yang, Bruce Croft
•
Ni Lao, Frank Lin, Siddharth Gopal, Jon Elsas, Jaime Arguello, Hui
(Grace) Yang, Stephen Robertson, Matthew Lease, Nick Craswell, Yi
Zhang (and her group), Jin Young Kim, Yangbo Zhu, Runting Shi, Yi
Wu, Hui Tan, Yifan Yanggong, Mingyan Fan, Chengtao Wen
–
Discussions & references & feedback
•
Reviewers: papers & NSF proposal
•
David Fisher, Mark Hoy, David Pane
–
•
Maintaining the Lemur toolkit
Andrea Bastoni and Lorenzo Clemente
–
Maintaining LSI code for Lemur toolkit
•
SVM-light, Stanford parser
•
TREC: data
•
NSF Grant IIS-1018317
•
Xiangmin Jin, and my whole family
and volleyball packs at CMU & SF Bay
62
END
63
Prior Definition of Mismatch
• Vocabulary mismatch (Furnas et al., 1987)
– How likely 2 people disagree in vocab choice
– Domain experts disagree 80-90% of the times
– Leads to Latent Semantic Indexing (Deerwester et al.,
1988)
– Query independent
– = Avgq P(t | Rq)
- to our query dependent definition of
– can be reduced
term mismatch
64
Knowledge
How Necessity explains behavior of IR techniques
• Why weight query bigrams 0.1, while query unigrams 0.9?
–
Bigram decreases term recall, weight reflects recall
• Why Bigram not gaining stable improvements?
–
Term recall is more of a problem
• Why using document structure (field, semantic annotation) not
improving performance?
–
Improves precision, need to solve structural mismatch
• Word sense disambiguation
–
Enhances precision, instead, should use in mismatch modeling!
• Identify query term sense, for searchonym id, or learning across queries
• Disambiguate collection term sense for more accurate replaceability
• Personalization
–
–
biases results to what a community/person likes to read (precision)
may work well in a mobile setting, short queries
65
Why Necessity?
System Failure Analysis
• Reliable Information Access (RIA) workshop (2003)
– Failure analysis for 7 top research IR systems
•
•
•
•
–
11 groups of researchers (both academia & industry)
28 people directly involved in the analysis (senior & junior)
>56 human*weeks (analysis + running experiments)
45 topics selected from 150 TREC 6-8 (difficult topics)
Causes (necessity in various disguise)
•
•
•
•
•
•
Emphasize 1 aspect, missing another aspect
Emphasize 1 aspect, missing another term
Missing either 1 of 2 aspects, need both
Missing difficult aspect that need human help
Need to expand a general term e.g. “Europe”
Precision problem, e.g. “euro”, not “euro-…”
(14+2 topics)
(7 topics)
(5 topics)
(7 topics)
(4 topics)
(4 topics)
66
67
68
Local LSI Top Similar Terms
Oil spills
spill
oil
0.5828
Insurance
coverage which
pays for long
term care
term
term
0.3310
Term limitations Vitamin the cure
for US Congress of or cause for
members
human ailments
term
term
0.3339
ail
ail
0.4415
oil
0.4210
long
0.2173
limit
0.1696
health
0.0825
tank
0.0986
nurse
0.2114
ballot
0.1115
disease 0.0720
crude
0.0972
care
0.1694
elect
0.1042
basler
0.0718
water
0.0830
home
0.1268
care
0.0997
dr
0.0695
69
1.2
Error plot of necessity predictions
1
0.8
0.6
Probability
0.4
0.2
0
-0.2
Necessity truth
-0.4
Predicted necessity
-0.6
Prediction trend (3rd order
polynomial fit)
-0.8
70
Necessity vs. idf (and emphasis)
71
True Necessity Weighting
TREC
4
6
8
9
10
12
14
Document collection
disk 2,3
disk 4,5
d4,5 w/o cr
.GOV
.GOV2
Topic numbers
201-250
301-350
401-450
451-500
501-550
TD1-50
751-800
LM desc – Baseline
0.1789
0.1586
0.1923
0.2145
0.1627
0.0239
0.1789
LM desc – Necessity
0.2703
0.2808
0.3057
0.2770
0.2216
0.0868
0.2674
Improvement
51.09%
77.05%
58.97%
29.14%
36.20%
261.7%
49.47%
p - randomization
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0001
p - sign test
0.0000
0.0000
0.0000
0.0005
0.0000
0.0000
0.0002
Multinomial-abs
0.1988
0.2088
0.2345
0.2239
0.1653
0.0645
0.2150
Multinomial RM
0.2613
0.2660
0.2969
0.2590
0.2259
0.1219
0.2260
Okapi desc – Baseline
0.2055
0.1773
0.2183
0.1944
0.1591
0.0449
0.2058
Okapi desc – Necessity
0.2679
0.2786
0.2894
0.2387
0.2003
0.0776
0.2403
LM title – Baseline
N/A
0.2362
0.2518
0.1890
0.1577
0.0964
0.2511
LM title – Necessity
N/A
0.2514
0.2606
0.2058
0.2137
0.1042
0.2674
WT10g
72
Predicted Necessity Weighting
10-25% gain
(necessity weight)
TREC train sets
Test/x-validation
LM desc – Baseline
LM desc – Necessity
Improvement
Baseline
[email protected]
Necessity
Baseline
[email protected]
Necessity
10-20% gain
(top Precision)
3
4
0.1789
0.2261
26.38%
0.4160
0.4940
0.3450
0.4180
3-5
6
0.1586
0.1959
23.52%
0.2980
0.3420
0.2440
0.2900
3-7
8
0.1923
0.2314
20.33%
0.3860
0.4220
0.3310
0.3540
7
8
0.1923
0.2333
21.32%
0.3860
0.4380
0.3310
0.3610
73
Predicted Necessity Weighting (ctd.)
TREC train sets
Test/x-validation
LM desc – Baseline
LM desc – Necessity
Improvement
Baseline
[email protected]
Necessity
Baseline
[email protected]
Necessity
3-9
10
0.1627
0.1813
11.43%
0.3180
0.3280
0.2400
0.2790
9
10
0.1627
0.1810
11.25%
0.3180
0.3400
0.2400
0.2810
11
12
0.0239
0.0597
149.8%
0.0200
0.0467
0.0211
0.0411
13
14
0.1789
0.2233
24.82%
0.4720
0.5360
0.4460
0.5030
74
vs. Relevance Model
Relevance Model: #weight( 1-λ #combine( t1 t2 )
λ #weight( w1 t1
w 2 t2
w 3 t3
…
)
)
1
y
0.5
x
0
0
4
0.1789
0.2423
0.2215
0.2330
6
0.1586
0.1799
0.1705
0.1921
0.5
1
Supervised > Unsupervised
(5-10%)
Weight Only ≈ Expansion
Test/x-validation
LM desc – Baseline
Relevance Model desc
RM reweight-Only desc
RM reweight-Trained desc
x~y
w1 ~ P(t1|R)
w2 ~ P(t2|R)
8
0.1923
0.2352
0.2435
0.2542
8
0.1923
0.2352
0.2435
0.2563
10
0.1627
0.1888
0.1700
0.1809
10
0.1627
0.1888
0.1700
0.1793
12
0.0239
0.0221
0.0692
0.0534
14
0.1789
0.1774
0.1945
0.2258
75
Feature Correlation
f1 Term
f2 Con
f3 Repl
f4 DepLeaf
f5 idf
RMw
0.3719
0.3758
-0.1872
0.1278
-0.1339
0.6296
Predicted Necessity: 0.7989
(TREC 4 test set)
≈
76
vs. Relevance Model
MAP
Weight Only ≈ Expansion
0.3
Supervised > Unsupervised
(5-10%)
0.25
0.2
0.15
Baseline LM desc
0.1
0.05
0
Relevance Model desc
RM Reweight-Only desc
RM is unstable
RM Reweight-Trained desc
3 -> 4
3-5 -> 6
3-7 -> 8
7 -> 8
3-9 -> 10
9 -> 10
11 -> 12
13 -> 14
Datasets: train -> test
77
Definition
Importance
Prediction: Idea
Solution
Efficient Prediction of Term Recall
• Currently:
– slow query dependent features that requires retrieval
• Can they be more effective and more efficient?
– Need to understand the causes of the query
dependent variation
– Design a minimal set of efficient features to capture
the query dependent variations
78
Causes of Query Dependent Variation (1)
• Example
• Cause
– Different word sense
79
Causes of Query Dependent Variation (2)
• Example
• Cause
– Different word use, e.g. term in phrase vs. not
80
Causes of Query Dependent Variation (3)
• Example
• Cause
– Different Boolean semantics of the queries, AND vs.
OR
81
Causes of Query Dependent Variation (4)
• Example
• Cause
– Different association level with topic
82
Efficient P(t | R) Prediction (2)
• Causes of P(t | R) variation of same term in different
queries
– Different query semantics: Canada or Mexico vs.
Canada
– Different word sense: bear (verb) vs. bear (noun)
– Different word use: Seasonal affective disorder
syndrome (SADS) vs. Agoraphobia as a disorder
– Difference in association level with topic
• Use historic occurrences to predict current
– 70-90% of the total gain
– 3-10X faster, close to simple keyword retrieval
83
Efficient P(t | R) Prediction (2)
• Low variation of same term in different queries
• Use historic occurrences to predict current
– 3-10X faster, close to the slower method & real time
0.3
MAP
*
0.25
0.2
*
*
Baseline LM desc
Necessity LM desc
Efficient Prediction
0.15
0.1
0.05
*
0
train -> test 3 -> 4 3-5 -> 6 3-7 -> 8 7 -> 8 3-9 -> 10 9 -> 10 11 -> 12 13 -> 14
84
Using Document Structure
• Stylistic: XML
• Syntactic/Semantic: POS, Semantic Role Label
• Current approaches
– All precision oriented
• Need to solve mismatch first?
85
Motivation
• Search is important, information portal
• Search is research worthy
Retrieval
User
Query
Results
– SIGIR, WWW, CIKM, ASIST,
ModelECIR, AIRS,
– Search is difficult
• Retrieval modeling difficulty
>= sentence paraphrasing
Document
Collection
– Since 1970s, but still not fully
understood, basic problem like mismatch
– Adapt to changing requirements of mobile, social and semantic Web
• Modeling user’s needs
Query
User
Activities
Results
Retrieval
Model
86
Collections
Online or Offline Study?
• Controlling confounding variables
– Quality of expansion terms
– User’s prior knowledge of the topic
– Interaction form & effort
• Enrolling many users and repeating experiments
• Offline simulations can avoid all these and still make
reasonable observations
87
Simulation Assumptions
• Real full CNFs to simulate partial expansions
• 3 assumptions about user expansion process
– Expansion of individual terms are independent of
each other
• A1: always same set of expansion terms for a given query
term, no matter which subset of query terms get expanded.
• A2: same sequence of expansion terms, no matter …
–
A3: Keyword query is re-constructed from the CNF
query
• Procedure to ensure vocabulary faithful to that of the original
keyword description
• Highly effective CNF queries ensure reasonable kw baseline
88
Take Home Message for
Ordinary Search Users
(people and software)
89
Be mean!
Is the term
Necessary for
doc relevance?
If not, remove,
replace or expand.
90
Download