lhi-lecture1 - University of Illinois at Urbana

advertisement
Statistical Models for Information
Retrieval and Text Mining
ChengXiang Zhai (翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
1
Course Overview
Scope of the course
Information Retrieval
Multimedia Data
Computer Vision
Text Data
Natural Language Processing
Machine Learning
Statistics
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
2
Goal of the Course
•
•
Overview of techniques for information retrieval (IR)
Detailed explanation of a few statistical models for IR and text
mining
– Probabilistic retrieval models (for search)
– Probabilistic topic models (for text mining)
•
Potential benefit for you:
– Some ideas working well for text retrieval may also work for
computer vision
– Techniques for computer vision may be applicable to IR
– IR and text mining raise new challenges as well as opportunities
for machine learning
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
3
Course Plan
• Lecture 1: Overview of information retrieval
• Lecture 2: Statistical language models for IR: Part 1
• Lecture 3: Statistical language models for IR: Part 2
• Lecture 4: Formal retrieval frameworks
• Lecture 5: Probabilistic topic models for text mining
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
4
Lecture 1: Overview of IR
• Basic Concepts in Text Retrieval (TR)
• Evaluation of TR
• Common Components of a TR system
• Overview of Retrieval Models
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
5
Basic Concepts in TR
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
6
What is Text Retrieval (TR)?
• There exists a collection of text documents
• User gives a query to express the information
need
• A retrieval system returns relevant documents to
users
• Known as “search technology” in industry
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
7
•
History of TR on One Slide
Birth of TR
– 1945: V. Bush’s article “As we may think”
– 1957: H. P. Luhn’s idea of word counting and matching
•
Indexing & Evaluation Methodology (1960’s)
– Smart system (G. Salton’s group)
– Cranfield test collection (C. Cleverdon’s group)
– Indexing: automatic can be as good as manual (controlled
vocabulary)
•
•
TR Models (1970’s & 1980’s) …
Large-scale Evaluation & Applications (1990’s-Present)
– TREC (D. Harman & E. Voorhees, NIST)
– Web search, PubMed, …
– Boundary with related areas are disappearing
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
8
Short vs. Long Term Info Need
• Short-term information need (Ad hoc retrieval)
– “Temporary need”, e.g., info about used cars
– Information source is relatively static
– User “pulls” information
– Application example: library search, Web search
• Long-term information need (Filtering)
– “Stable need”, e.g., new data mining algorithms
– Information source is dynamic
– System “pushes” information to user
– Applications: news filter
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
9
Importance of Ad hoc Retrieval
• Directly manages any existing large collection of
information
• There are many many “ad hoc” information needs
• A long-term information need can be satisfied
through frequent ad hoc retrieval
• Basic techniques of ad hoc retrieval can be used for
filtering and other “non-retrieval” tasks, such as
automatic summarization.
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
10
Formal Formulation of TR
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi  V
• Document di = di1,…,dimi, where dij  V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q)  C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task =
compute R’(q), an “approximate R(q)”
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
11
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator
function or classifier
– System must decide if a doc is relevant or not
(“absolute relevance”)
• Strategy 2: Document ranking
– R(q) = {dC|f(d,q)>}, where f(d,q)  is a relevance
measure function;  is a cutoff
– System must decide if one doc is more likely to be
relevant than another (“relative relevance”)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
12
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
1
Doc Selection
f(d,q)=?
Doc Ranking
f(d,q)=?
2008 © ChengXiang Zhai
0
+ +- + ++
R’(q)
- -- - - + - 0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
China-US-France Summer School, Lotus Hill Inst., 2008
R’(q)
13
Problems of Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific): no
relevant documents found
– “Under-constrained” query (terms are too general):
over delivery
– It is extremely hard to find the right position between
these two extremes
• Even if it is accurate,
all relevant documents are not
equally relevant
• Relevance is a matter of degree!
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
14
Ranking is often preferred
• Relevance is a matter of degree
• A user can stop browsing anywhere, so the boundary
is controlled by the user
– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking Principle
[Robertson 77]
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
15
Probability Ranking Principle
[Robertson 77]
• As stated by Cooper
“If a reference retrieval system’s response to each request is a ranking of the
documents in the collections in order of decreasing probability of usefulness
to the user who submitted the request, where the probabilities are estimated
as accurately a possible on the basis of whatever data made available to the
system for this purpose, then the overall effectiveness of the system to its
users will be the best that is obtainable on the basis of that data.”
• Robertson provides two formal justifications
• Assumptions: Independent relevance and sequential
browsing (not necessarily all hold in reality)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
16
According to the PRP, all we need is
“A relevance measure function f”
which satisfies
For all q, d1, d2,
f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2)
Most IR research has focused on finding a good function f
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
17
Evaluation in Information
Retrieval
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
18
Evaluation Criteria
• Effectiveness/Accuracy
– Precision, Recall
• Efficiency
– Space and time complexity
• Usability
– How useful for real user tasks?
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
19
Methodology: Cranfield Tradition
• Laboratory testing of system components
– Precision, Recall
– Comparative testing
• Test collections
– Set of documents
– Set of questions
– Relevance judgments
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
20
The Contingency Table
Action
Doc
Relevant
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
Not relevant Irrelevant Retrieved Irrelevant Rejected
Re l e van tRe tri e ve d
Pre ci si on
Re tri e ve d
Re l e van tRe tri e ve d
Re call
Re l e van t
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
21
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
precision
precision
x
Which is better?
x
x
x
x
x
x
recall
2008 © ChengXiang Zhai
x
recall
China-US-France Summer School, Lotus Hill Inst., 2008
22
Summarize a Ranking: MAP
•
Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document
is retrieved => p(1),…,p(k), if we have k rel. docs
•
•
•
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
Compute the average over all the relevant documents
– Average precision = (p(1)+…p(k))/k
This gives us (non-interpolated) average precision, which captures
both precision and recall and is sensitive to the rank of each relevant
document
Mean Average Precisions (MAP)
– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more
affected by difficult topics)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
23
Summarize a Ranking: NDCG
•
•
•
What if relevance judgments are in a scale of [1,r]? r>2
Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in ranked order)
– CG = r1+r2+…rn
Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
– We may use any base for the logarithm, e.g., base=b
•
– For rank positions above b, do not discount
Normalized Cumulative Gain (NDCG) at rank n
– Normalize DCG at rank n by the DCG value at rank n of the ideal
ranking
– The ideal ranking would first return the documents with the highest
relevance level, then the next highest relevance level, etc
•
– Compute the precision (at rank) where each (new) relevant
document is retrieved => p(1),…,p(k), if we have k rel. docs
NDCG is now quite popular in evaluating Web search
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
24
When There’s only 1 Relevant Document
• Scenarios:
– known-item search
– navigational queries
• Search Length = Rank of the answer:
– measures a user’s effort
• Mean Reciprocal Rank (MRR):
– Reciprocal Rank: 1/Rank-of-the-answer
– Take an average over all the queries
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
25
Precion-Recall Curve
Out of 4728 rel docs,
we’ve got 3212
Recall=3212/4728
Precision@10docs
about 5.5 docs
in the top 10 docs
are relevant
Breakeven Point
(prec=recall)
Mean Avg. Precision (MAP)
D1 +
D2 +
D3 –
D4 –
D5 +
D6 -
Total # rel docs = 4
System returns 6 docs
Average Prec = (1/1+2/2+3/5+0)/4
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
26
What Query Averaging Hides
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
27
The Pooling Strategy
•
•
•
•
•
When the test collection is very large, it’s impossible to
completely judge all the documents
TREC’s strategy: pooling
– Appropriate for relative comparison of different systems
– Given N systems, take top-K from the result of each, combine them
to form a “pool”
– Users judge all the documents in the pool; unjudged documents
are assumed to be non-relevant
Advantage: less human effort
Potential problem:
– bias due to incomplete judgments (okay for relative comparison)
– Favor a system contributing to the pool, but when reused, a new
system’s performance may be under-estimated
Reuse the data set with caution!
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
28
User Studies
•
Limitations of Cranfield evaluation strategy:
– How do we evaluate a technique for improving the interface of a search
engine?
•
•
– How do we evaluate the overall utility of a system?
User studies are needed
General user study procedure:
– Experimental systems are developed
– Subjects are recruited as users
•
– Variation can be in the system or the users
– Users use the system and user behavior is logged
– User information is collected (before: background, after: experience with
the system)
Clickthrough-based real-time user studies:
– Assume clicked documents to be relevant
– Mix results from multiple methods and compare their clickthroughs
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
29
Common Components in a
TR System
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
30
Typical TR System Architecture
docs
query
Feedback
Tokenizer
Query
Rep
Doc Rep (Index)
Indexer
Index
2008 © ChengXiang Zhai
judgments
User
Scorer
results
China-US-France Summer School, Lotus Hill Inst., 2008
31
Text Representation/Indexing
• Making it easier to match a query with a document
• Query and document should be represented using
the same units/terms
• Controlled vocabulary vs. full text indexing
• Full-text indexing is more practically useful and has
proven to be as effective as manual indexing with
controlled vocabulary
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
32
What is a good indexing term?
• Specific (phrases) or general (single word)?
• Luhn found that words with middle frequency are
most useful
– Not too specific (low utility, but still useful!)
– Not too general (lack of discrimination, stop words)
– Stop word removal is common, but rare words are kept
• All words or a (controlled) subset? When term
weighting is used, it is a matter of weighting not
selecting of indexing terms
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
33
Tokenization
•
•
Word segmentation is needed for some languages
– Is it really needed?
Normalize lexical units: Words with similar meanings
should be mapped to the same indexing term
– Stemming: Mapping all inflectional forms of words to the
same root form, e.g.
• computer -> compute
• computation -> compute
• computing -> compute (but king->k?)
– Are we losing finer-granularity discrimination?
•
Stop word removal
– What is a stop word? What about a query like “to be or not
to be”?
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
34
Relevance Feedback
Query
Retrieval
Engine
Updated
query
Document
collection
Feedback
2008 © ChengXiang Zhai
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
User
Judgments:
d1 +
d2 d3 +
…
dk ...
China-US-France Summer School, Lotus Hill Inst., 2008
35
Pseudo/Blind/Automatic
Feedback
Query
Retrieval
Engine
Updated
query
Document
collection
Feedback
2008 © ChengXiang Zhai
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
Judgments:
d1 +
d2 +
d3 +
…
dk ...
top 10
China-US-France Summer School, Lotus Hill Inst., 2008
36
Implicit Feedback
Query
Retrieval
Engine
Updated
query
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
User
Document
collection
Feedback
2008 © ChengXiang Zhai
Judgments:
d1 +
d2 d3 +
…
dk ...
infer
User Activities
e.g. clickthroughs
China-US-France Summer School, Lotus Hill Inst., 2008
37
Important Points to Remember
•
•
•
•
•
PRP provides a justification for ranking, which is generally
preferred to document selection
How to compute the major evaluation measure (precision,
recall, precision-recall curve, MAP, gMAP, breakeven
precision, NDCG, MRR)
What is pooling
What is tokenization (word segmentation, stemming, stop
word removal)
What are relevance feedback, pseudo relevance feedback,
and implicit feedback
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
38
Overview of Retrieval Models
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
Generative
Model
P(d q) or P(q d)
Probabilistic inference
Different
inference system
Query
Doc
Learn to generation
generation
…
Inference
Prob. concept
Rank
network
space model
(Joachims
02)
Vector space
model
Prob. distr.
Classical
LM
(Wong
&
Yao,
95)
(Burges et al. 05)
model
(Turtle & Croft, 91)
model
prob. Model
approach
(Salton et al., 75) (Wong & Yao, 89)
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
39
Retrieval Models: Vector Space
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
40
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q)  similarity(d,q)
• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
• Key issues
– How to represent query/document?
– How to define the similarity measure ?
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
41
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between the
query vector and document vector in the vector
space
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
42
VS Model: illustration
Starbucks
D9
D11
??
??
D2
D5
D3
D10
D4 D6
Java
Query
D7
D8
D1
Microsoft
??
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
43
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
44
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and hopefully
accurately
• Many possibilities: Words, stemmed words, phrases,
“latent concept”, …
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
45
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– TF normalization
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
46
TF Weighting
• Idea: A term is more important if it occurs more
frequently in a document
• Some formulas: Let f(t,d) be the frequency count of
term t in doc d
– Raw TF: TF(t,d) = f(t,d)
– Log TF: TF(t,d)=log f(t,d)
– Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”:
TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
• Normalization of TF is very important!
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
47
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than
the “first occurrence”
• Two views of document length
– A doc is long because it uses more words
– A doc is long because it has more contents
• Generally penalize long doc, but avoid overpenalizing (pivoted normalization)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
48
TF Normalization: How?
Norm. TF
Raw TF
Which curve is more reasonable?
Should normalized-TF be up-bounded?
Normalization interacts with the similarity measure
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
49
Regularized/“Pivoted”
Length Normalization
Norm. TF
Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization
1-b+b*doclen/avgdoclen
(b varies from 0 to 1)
What would happen if doclen is {>, <,=} avgdoclen?
Advantage: stabalize parameter setting
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
50
IDF Weighting
• Idea: A term is more discriminative if it occurs only in
fewer documents
• Formula:
IDF(t) = 1+ log(n/k)
n – total number of docs
k -- # docs with term t (doc freq)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
51
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc  high tf  high weight
– Rare in collection high idf high weight
• Imagine a word count profile, what kind of terms
would have high weights?
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
52
How to Measure Similarity?

Di  ( w i 1 ,...,w iN )

Q  ( wq1 ,...,wqN )
w  0 if a te rmis abse n t
N
 
Dot produ ctsim ilarity
: sim (Q , Di )   wqj  w ij
j 1
N
C osin e:
 
sim (Q , Di ) 
 wqj  wij
j 1
N
 ( wqj )
2

j 1
N
 ( wij )2
j 1
(  n orm aliz e dot
d produ ct)
How about Euclidean?
2008 © ChengXiang Zhai
 
sim(Q, Di ) 
N
2
(
w

w
)
 qj ij
j 1
China-US-France Summer School, Lotus Hill Inst., 2008
53
What Works the Best?
Error
[
]
•Use single words
•Use stat. phrases
•Remove stop words
•Stemming
•Others(?)
(Singhal 2001)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
54
“Extension” of VS Model
• Alternative similarity measures
– Many other choices (tend not to be very effective)
– P-norm (Extended Boolean): matching a Boolean
query with a TF-IDF document vector
• Alternative representation
– Many choices (performance varies a lot)
– Latent Semantic Indexing (LSI) [TREC performance
tends to be average]
• Generalized vector space model
– Theoretically interesting, not seriously evaluated
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
55
Relevance Feedback in VS
•
Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
•
– How do you learn from this to improve performance?
General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
•
– Doing both
The most well-known and effective approach is Rocchio
[Rocchio 1971]
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
56
Rocchio Feedback: Illustration
--- --+
+
+
+
q
q
-- +
- +++ + +
- - - + +++ + ++
-- -- --
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
57
Rocchio Feedback: Formula
Parameters
New query
Origial query
Rel docs
2008 © ChengXiang Zhai
Non-rel docs
China-US-France Summer School, Lotus Hill Inst., 2008
58
Rocchio in Practice
•
•
•
•
•
Negative (non-relevant) examples are not very important
(why?)
Often truncate the vector onto to lower dimension (i.e.,
consider only a small number of words that have high weights
in the centroid vector) (efficiency concern)
Avoid overfitting by keeping a relatively high weight on the
original query weights (why?)
Can be used for relevance feedback and pseudo feedback
Usually robust and effective
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
59
Advantages of VS Model
• Empirically effective! (Top TREC performance)
• Intuitive
• Easy to implement
• Well-studied/Most evaluated
• The Smart system
– Developed at Cornell: 1960-1999
– Still available
• Warning: Many variants of TF-IDF!
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
60
Disadvantages of VS Model
• Assume term independence
• Assume query and document to be the same
• Lack of “predictive adequacy”
– Arbitrary term weighting
– Arbitrary similarity measure
• Lots of parameter tuning!
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
61
Probabilistic Retrieval Models
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
62
Overview of Retrieval Models
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
Generative
Model
P(d q) or P(q d)
Probabilistic inference
Different
inference system
Query
Doc
Learn to generation
generation
…
Inference
Prob. concept
Rank
network
space model
(Joachims
02)
Vector space
model
Prob. distr.
Classical
LM
(Wong
&
Yao,
95)
(Burges et al. 05)
model
(Turtle & Croft, 91)
model
prob. Model
approach
(Salton et al., 75) (Wong & Yao, 89)
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
63
Probability of Relevance
• Three random variables
– Query Q
– Document D
– Relevance R  {0,1}
• Goal: rank D based on P(R=1|Q,D)
– Evaluate P(R=1|Q,D)
– Actually, only need to compare P(R=1|Q,D1) with
P(R=1|Q,D2), I.e., rank documents
• Several different ways to refine P(R=1|Q,D)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
64
•
Refining P(R=1|Q,D) Method 1:
conditional models
Basic idea: relevance depends on how well a query matches a document
– Define features on Q x D, e.g., #matched terms, # the highest IDF of a
matched term, #doclen,..
– P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), )
– Using training data (known relevance judgments) to estimate parameter 
•
•
– Apply the model to rank new documents
Early work (e.g., logistic regression [Cooper 92, Gey 94])
– Attempted to compete with other models
Recent work (e.g. Ranking SVM [Joachims 02], RankNet (Burges et al.
05))
– Attempted to leverage other models
– More features (notably PageRank, anchor text)
– More sophisticated learning (Ranking SVM, RankNet, …)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
65
Logistic Regression
(Cooper 92, Gey 94)
6
P ( R  1 | Q, D)
log
 0   i X i
1  P ( R  1 | Q, D)
i 1
P ( R  1 | Q, D) 
logit function:
log
x
1 x
logistic (sigmoid) function:
1
1
e xp(x )

1  e xp( x ) 1  e xp(x )
6
1  e xp(  0    i X i )
i 1
P(R=1|Q,D)
Uses 6 features X1, …, X6
1.0
X
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
66
Features/Attributes
1
X1 
M
M
 logQAF
tj
1
X 2  QL
1
X3 
M
Average Absolute Query Frequency
Query Length
M
 log DAF
tj
Average Absolute Document Frequency
1
X 4  DL
Document Length
1 M
X5 
log IDFt j

M 1
N  nt j
IDF 
nt j
Average Inverse Document Frequency
X 6  log M
Number of Terms in common between
query and document -- logged
Inverse Document Frequency
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
67
Learning to Rank
•
Advantages
– May combine multiple features (helps improve accuracy and
combat web spams)
– May re-use all the past relevance judgments (self-improving)
•
Problems
– Don’t learn the semantic associations between query words and
document words
– No much guidance on feature generation (rely on traditional
retrieval models)
•
All current Web search engines use some kind of learning
algorithms to combine many features such as PageRank and
many different representations of a page
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
68
The PageRank Algorithm (Page et al. 98)
Random surfing model:
At any page,
With prob. , randomly jumping to a page
With prob. (1-), randomly picking a link to follow.
0
1
M 
0

1/ 2
d1
d3
d2
1/ 2 1/ 2 
0
0 

0
0 “Transition
matrix”

1/ 2 0
0 
0
0
1
N
N= # pages
N
1
p (d k )
k 1 N
p(di )  (1   ) mki p(d k )   
k 1
d4
N
 [
k 1
1
  (1   )mki ] p(d k )
N
p  ( I  (1   ) M )T p
Initial value p(d)=1/N
Iij = 1/N
Stationary (“stable”)
distribution, so we
ignore time
Iterate until converge
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
69
PageRank in Practice
•
Interpretation of the damping factor  (0.15):
– Probability of a random jump
– Smoothing the transition matrix (avoid zero’s)
N
•
N
1
p (d k )
k 1 N
p(di )  (1   ) mki p(dk )   
k 1
p  ( I  (1   )M )T p
Normalization doesn’t affect ranking, leading to some
variants
N
N
k 1
k 1
 p '(di )  (1   ) mki p '(d k )   
CN
N
N
k 1
k 1
cp(di )  (1   ) mki cp(d k )   
Let p '(di )  cp(di ), c  constant ,
1
cp(d k )
N
1
p '(d k )
N
N
 p '(di )  (1   ) mki p '(d k )  
k 1
•
The zero-outlink problem: p(di)’s don’t sum to 1
– One possible solution = page-specific damping factor
(=1.0 for a page with no outlink)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
70
HITS: Capturing Authorities & Hubs
• Intuitions
[Kleinberg 98]
– Pages that are widely cited are good authorities
– Pages that cite many other pages are good hubs
• The key idea
of HITS
– Good authorities are cited by good hubs
– Good hubs point to good authorities
– Iterative reinforcement…
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
71
The HITS Algorithm [Kleinberg 98]
d1
d3
d2
d4
0 0
1 0
A
0 1

1 1
h( d i ) 
1 1
“Adjacency matrix”
0 0 
0 0

0 0
Initial values: a(di)=h(di)=1
 a(d j )
d j OUT ( di )
a(di ) 

d j IN ( di )
h  Aa ;
Iterate
h( d j )
Normalize:
a  AT h
2
h  AAT h ; a  AT Aa
2008 © ChengXiang Zhai
 a(di )   h(di )  1
i
2
i
China-US-France Summer School, Lotus Hill Inst., 2008
72
Refining P(R=1|Q,D) Method 2:
generative models
• Basic idea
– Define P(Q,D|R)
– Compute O(R=1|Q,D) using Bayes’ rule
O( R  1 | Q, D) 
P( R  1 | Q, D) P(Q, D | R  1) P( R  1)

P( R  0 | Q, D) P(Q, D | R  0) P( R  0)
Ignored for ranking D
• Special cases
– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)
– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
73
Document Generation
P( R  1 | Q, D) P(Q, D | R  1)

P( R  0 | Q, D) P(Q, D | R  0)
P( D | Q, R  1) P(Q | R  1)

P( D | Q, R  0) P(Q | R  0)
P( D | Q, R  1)

P( D | Q, R  0)
Model of relevant docs for Q
Model of non-relevant docs for Q
Assume independent attributes A1…Ak ….(why?)
Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )
k
P( Ai  d i | Q, R  1)
P ( R  1 | Q, D )

P( R  0 | Q, D) i 1 P( Ai  d i | Q, R  0)
Non-query terms are
equally likely to appear in
relevant and non-relevant
docs

P( Ai  1 | Q, R  1)
P( Ai  0 | Q, R  1)

i 1, d i 1 P ( Ai  1 | Q, R  0) i 1, d i  0 P ( Ai  0 | Q, R  0)

P( Ai  1 | Q, R  1) P( Ai  0 | Q, R  0)
i 1, d i 1 P ( Ai  1 | Q, R  0) P ( Ai  0 | Q, R  1)

P( Ai  1 | Q, R  1) P( Ai  0 | Q, R  0)
( Assum e P( Ai  1 | Q, R  1)  P( Ai  1 | Q, R  0), if qi  0)
i 1, d i  qi 1 P ( Ai  1 | Q, R  0) P ( Ai  0 | Q, R  1)
k

k
k

k

2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
74
Robertson-Sparck Jones Model
(Robertson & Sparck Jones 76)
logO( R  1 | Q, D)
Rank
k
 
i 1, d i  qi 1
log
pi (1  qi )
qi (1  pi )
(RSJ model)
Two parameters for each term Ai:
pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc
qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc
How to estimate parameters?
Suppose we have relevance judgments,
pˆ i 
# ( rel . doc with Ai )  0.5
# ( rel .doc)  1
qˆ i 
# ( nonrel. doc with Ai )  0.5
# ( nonrel.doc)  1
“+0.5” and “+1” can be justified by Bayesian estimation
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
75
RSJ Model: No Relevance Info
(Croft & Harper 79)
k
Rank
logO( R  1 | Q, D) 

log
i 1, d i  qi 1
pi (1  qi )
qi (1  pi )
(RSJ model)
How to estimate parameters?
Suppose we do not have relevance judgments,
- We will assume pi to be a constant
- Estimate qi by assuming all documents to be non-relevant
N  ni  0.5
logO( R  1 | Q, D)   log
ni  0.5
i 1, d i  qi 1
Rank
k
IDF '  log
N  ni
ni
N: # documents in collection
ni: # documents in which term Ai occurs
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
76
Improving RSJ: Adding TF
Basic doc. generation model:
P( R  1 | Q, D)
P( D | Q, R  1)

P( R  0 | Q, D) P( D | Q, R  0)
Let D=d1…dk, where dk is the frequency count of term Ak
k
P( Ai  d i | Q, R  1)
P ( R  1 | Q, D )

P( R  0 | Q, D) i 1 P( Ai  d i | Q, R  0)
P( Ai  d i | Q, R  1) k P( Ai  0 | Q, R  1)
 

P
(
A

d
|
Q
,
R

0
)
i 1, d i 1
i 1, d i  0 P ( Ai  0 | Q, R  0)
i
i
k

P( Ai  d i | Q, R  1) P( Ai  0 | Q, R  0)

i 1, d i 1 P ( Ai  d i | Q, R  0) P ( Ai  0 | Q, R  1)
k
2-Poisson mixture model
p ( Ai  f | Q, R )  p ( E | Q, R ) p ( Ai  f | E )  P ( E | Q, R ) p ( Ai  f | E )
 p ( E | Q, R )
E f
f!
e   E  P ( E | Q, R )
E f
f!
e
E
Many more parameters to estimate! (how many exactly?)
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
77
BM25/Okapi Approximation
•
(Robertson et al. 94)
Idea: Approximate p(R=1|Q,D) with a simpler
function that share similar properties
• Observations:
– log O(R=1|Q,D) is a sum of term weights Wi
– Wi= 0, if TFi=0
– Wi increases monotonically with TFi
– Wi has an asymptotic limit
• The simple function is
TFi ( k1  1)
pi (1  qi )
Wi 
log
k1  TFi
qi (1  pi )
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
78
Adding Doc. Length & Query TF
• Incorporating doc length
– Motivation: The 2-Poisson model assumes equal
document length
– Implementation: “Carefully” penalize long doc
• Incorporating query TF
– Motivation: Appears to be not well-justified
– Implementation: A similar TF transformation
• The final formula is called BM25, achieving top
TREC performance
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
79
The BM25 Formula
“Okapi TF/BM25 TF”
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
80
Extensions of
“Doc Generation” Models
• Capture term dependence (Rijsbergen & Harper 78)
• Alternative ways to incorporate TF (Croft 83, Kalt96)
• Feature/term selection for feedback (Okapi’s TREC reports)
• Other Possibilities (machine learning … )
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
81
Query Generation
P (Q , D | R  1)
P (Q , D | R  0 )
P (Q | D, R  1) P ( D | R  1)

P (Q | D , R  0 ) P ( D | R  0 )
P ( D | R  1)
 P (Q | D, R  1)
( Assume P (Q | D, R  0)  P (Q | R  0))
P ( D | R  0)
O( R  1 | Q , D ) 
Query likelihood p(q| d)
Document prior
Assuming uniform prior, we have O( R  1 | Q, D)  P (Q | D, R  1)
Now, the question is how to compute P(Q | D, R  1)
?
Generally involves two steps:
(1) estimate a language model based on D
(2) compute the query likelihood according to the estimated model
Leading to the so-called “Language Modeling Approach” …
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
82
Lecture 1: Key Points
•
•
•
Vector Space Model is a family of models, not a single model
Many variants of TF-IDF weighting and some are more
effective than others
State of the art retrieval performance is achieved through
–
–
–
–
•
Bag of words representation
TF-IDF weighting (BM25) + length normalization
Pseudo relevance feedback (mostly for recall)
For web search, add PageRank, anchor text, …, plus learning to
rank
Principled approaches didn’t lead to good performance
directly (before the “language modeling approach” was
proposed); heuristic modification has been necessary
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
83
Readings
• Amit Singhal’s overview:
– http://sifaka.cs.uiuc.edu/course/ds/mir.pdf
• My review of IR models:
– http://sifaka.cs.uiuc.edu/course/ds/irmod.pdf
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
84
Discussion
•
Text retrieval and image retrieval
– Query language:
• keywords vs. image features
• By example
– Content representation
• Bag of words vs. “bag of image features”?
• Phrase indexing vs. units from image parsing?
• Sentiment analysis
– Retrieval heuristics
• TF-IDF weighting vs. ?
• Passage retrieval vs. image region retrieval?
•
•
• Proximity
Text retrieval and video retrieval
Multimedia retrieval?
2008 © ChengXiang Zhai
China-US-France Summer School, Lotus Hill Inst., 2008
85
Download