Uploaded by Dylan Tucker

Text Info REtrieval

advertisement
TEXT RETRIEVAL part I
October
Text Retrieval
•
Text retrieval is to return relevant textual documents from a given collection,
according to users’ information needs as declared in a query
•
Main differences from database retrieval are concerned with:
– Information
• Unstructured text vs. structured data
• Ambiguous vs. well-defined semantics
– Query expression
• Ambiguous vs. well-defined semantics
• Incomplete vs. complete specification
– Answers
• Relevant documents vs. matched records
•
Formally the elements of the problem are as follows:
–
Vocabulary V={w1, w2, …, wN}
where wi are the words of the language
–
Query q = q1,…,qm,
where qi ∈ V
–
Document dk = { dk1,…,dkm }
where dki ∈ V
–
–
Collection C = {d1, …, dn}
Set of relevant documents R(q) ⊆ C
•
•
Generally unknown and user-dependent
Query is a “hint” on which doc is in R(q)
Based on this definition, the task is to compute R’(q), as an approximation of
R(q).
•
This can be done according to two different strategies:
– Document selection
• R’(q)={d∈C|f(d,q)=1},
where f(d,q) ∈{0,1} is an indicator function such that it is possible to decide if
a document is relevant or not (“absolute relevance”)
– Document ranking
• R’(q) = {d∈C|f(d,q)>θ},
where θ is a cutoff and f(d,q) ∈ℜ is a relevance measure function such that
it is possible to decide if one document is more likely to be relevant than
another (“relative relevance”)
Document Selection vs. Ranking
True R(q)
+ +--+ -+ +
+
--- ---
1
Doc S election
f(d,q)=
0
Doc R anking
f(d,q)=
+ +-+ ++
R’(q)
- -- - - + -0.98 d 1 +
0.95 d 2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)
•
With Document Selection, the classifier is inaccurate:
–
–
“Over-constrained” query (terms are too specific) ⇒ no relevant documents found
“Under-constrained” query (terms are too general) ⇒ over delivery
Even if the classifier is accurate, all relevant documents are not equally relevant.
Indexing is easier instead to be implemented.
•
Document Ranking allows the used to control the boundary according to
his/her preferences.
•
Measures for the evaluation of the retrieval sets are needed
Evaluation of retrieval sets
•
Two most frequent and basic measures for information retrieval are precision and recall.
These are first defined for the simple case where the information retrieval system returns a
set of documents for a query
All docs
Retrieved
Recall =
| RelRetrieved |
| Rel in Collection |
Precision =
Relevant
•
| RelRetrieved |
| Retrieved |
The advantage of having two numbers is that one is more important than the other in many
circumstances. Web surfers would like every result in the first page to be relevant (i.e. high
precision). In contrast professional searchers are very concerned with high recall and will
tolerate low precision.
Very high precision, very low recall
Relevant
High precision, high recall
Relevant
High recall, but low precision
Relevant
F-Measure
•
F-Measure is a single measure that trades off precision versus recall, which
is the the weighted harmonic mean of precision and recall.
α>1: precision is more important
α<1: recall is more important
F=
1
1
1
a + (1 - a )
P
R
(β2 + 1) PR / β2P + R
β = (1 − α ) / α
F β=1 = 2PR / (P + R)
Evaluation of ranked retrieval results
•
Precision and recall figures are appropriate for unranked retrieval sets. In a
ranking context, appropriate sets of retrieved documents are given by the top
k retrieved documents.
•
For each such set, precision and recall values can be plotted to give a
precision-recall curve. Precision-recall curves have a distinctive saw-tooth
shape:
–
–
if the (k + 1)th document retrieved is non-relevant then recall is the same as for the
top k documents, but precision drops;
if it is relevant, then both precision and recall increase, and the curve jags up and
to the right.
precision-recall curve
•
Interpolated Average Precision
It is often useful to remove jiggles with an interpolated precision: the interpolated
precision at a certain recall level r is defined as the highest precision found for any
recall level q ≥ r :
pint(r) = maxr’ ≥r p(r′ ).
•
Interpolated precision at recall
level k
precision-recall curve
k
Precision should be measured at different levels of Recall: this is an average
measure over many queries.
Different solutions (models)
•
Boolean model
– Based on the notion of sets
– Documents are retrieved only if they satisfy Boolean conditions specified in the
query
– Does not impose a ranking on retrieved documents
– Exact match
•
Vector space model
– Based on geometry, the notion of vectors in high dimensional space
– Documents are ranked based on their similarity to the query (ranked retrieval)
– Best/partial match
•
Language models
– Based on the notion of probabilities and processes for generating text
– Documents are ranked based on the probability that they generated the query
– Best/partial match
Relevance models
Relevance
∆(Rep(q), Rep(d))
Similarity
P(r=1|q,d) r ∈{0,1}
Probability of Relevance
Regression
Different
Model
rep & similarity (Fox 83)
…
Generative
Model
Doc
generation
Query
generation
P(d →q) or P(q →d)
Probabilistic inference
Different
inference system
Prob. concept Inference
network
space model
Vector space Prob. distr.
model
Classical
LM
(Wong & Yao, 95)
model
(Turtle & Croft, 91
model
prob. Model
approach
(Salton et al., 75)
(Wong & Yao, 89)
(Robertson & (Ponte & Croft, 98)
Sparck Jones, 76)
(Lafferty & Zhai, 01a)
Relevance
∆(Rep(q), Rep(d))
Similarity
P(r=1|d, q) r ∈{0,1}
Probability of Relevance
Regression
Different
Model
rep & similarity (Fox 83)
…
Generative
Model
Doc
generation
Query
generation
P(d
→q) or
P(q →d)
Boolean
Model
Probabilistic inference
Different
inference system
Prob. concept Inference
network
space model
Vector space Prob. distr.
model
Classical
LM
(Wong & Yao, 95)
model
(Turtle & Croft, 91
model
prob. Model
approach
(Salton et al., 75)
(Wong & Yao, 89)
(Robertson & (Ponte & Croft, 98)
Sparck Jones, 76)
(Lafferty & Zhai, 01a)
Vector Space Model: Relevance (d,q) = Similarity (d,q)
•
Assumption: Query and document are represented in the same way
•
Retrieved terms are such that:
R(q) = {d∈C|f(d,q)>θ}
where
f(q,d)=∆(Rep(q), Rep(d))
being ∆ a similarity measure and Rep a chosen representation for query and
documents
•
Key issues are:
– How to represent query and documents
– How to define the similarity measure ∆
•
In the Vector Space Model approach, a document/query is represented by a
term vector:
– A term is a basic concept, e.g., a word or phrase
– Each term defines one dimension
– Elements of the vector correspond to term weights
– E.g., d=(x1,…,xN), xi is the “importance” of term i
•
A collection of n documents with t distinct terms is represented by a
(sparse) matrix.
D1
D2
:
:
Dn
•
T1
w11
w12
:
:
w1n
T2
w21
w22
:
:
w2n
….
…
…
…
Tt
wt1
wt2
:
:
wtn
A query can also be represented as a vector like a document
Dimensions in document space
•
Some terms must be assumed to be orthogonal to form linearly independent
basis vectors. They must be non-overlapping in meaning
•
Which terms?
– Remove stop words (common function words).
– Stemming (standardize morphological variants; strip endings, etc: eat, eating…⇒ eat)
Starbucks
D2
D9
D11
D3
D10
D5
D4D6
D7
D8
Microsoft
Query
D1
Java
Term weighting
•
•
Binary Weighting: only the presence (1) or absence (0) of a term is included
in the vector
Binary weighting is not effective.
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
Empirical distribution of words
•
There are stable language-independent patterns in how people use natural languages
A few words occur very frequently; most occur rarely.
E.g., in news articles,
–
–
Top 4 words: 10 ~ 15% word occurrences
Top 50 words: 35 ~ 40% word occurrences
The most frequent word in one corpus may be rare in another
•
Zipf's law, publicized by Harvard linguist George Kingsley Zipf, stated that:
in a corpus of natural language utterances, the frequency of any word is roughly
inversely proportional to its rank in the frequency table. So, the most frequent word will
occur approximately twice as often as the second most frequent word, which occurs
twice as often as the fourth most frequent word, etc.
Zipf’s Law: rank * frequency ≈ constant
C
F ( w) =
r ( w)α
α ≈ 1, C ≈ 0.1
Most useful words
Word
Freq.
Most rare words
Biggest
data structure
(stop words)
Word Rank (by Freq)
The long tail impies that almost all words are rare
Generalized Zipf’s law:
F ( w) =
C
[r ( w) + B]α
applicable in many domains
•
TF (Term Frequency) Weighting: accounts of the number of occurrences of
t in d (Salience of t in d). A term is more important if it occurs more
frequently in a document
•
There exist different formulas for the computation of TF:
Let f(t,d) be the frequency count of term t in doc d
–
Raw TF:
TF(t,d) = f(t,d)
–
Log TF:
TF(t,d) = 1+ln(1 + ln ( f(t,d)))
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
2
1
0
3
1
3
0
0
0
0
4
t2
0
0
4
0
6
5
8
10
0
3
0
t3
3
0
7
0
3
0
0
0
1
5
1
•
It is important that TF is normalized due to:
– Document length variation
– Repeated occurrences are less informative than the first occurrence
Generally long docs should be penalized, but over-penalization should be
avoided (pivoted normalization)
•
Two views of document length:
– A doc is long because it uses more words
– A doc is long because it has more contents
•
Pivoted normalization uses the average document length to regularize
normalization:
1 - b + b(doclen / avgdoclen)
where b varies from 0 to 1
Norm. TF
Raw TF
pivoted normalization
“Okapi/BM25 TF”:
TF(t,d) = k f(t,d) / [ f(t,d) + k (1 - b + b(doclen / avgdoclen)) ]
where k and b are parameters
doclen = avgdoclen
=1
TF = TFref
doclen > avgdoclen
>1
TF < TFref
doclen < avgdoclen
<1
TF > TFref
•
IDF (Inverse Document Frequency) Weighting: accounts of the number of documents
which t appears in (Informativeness of t). A term is more discriminative if it occurs
only in fewer documents
IDF(t) = 1+ log(n/k)
Where:
•
n = total number of docs
k = # docs with term t (doc freq)
n=k
=0
IDF = 1
n>k
>0
IDF = >1
IDF provides high values for rare words and low values for common words
For a collection of 10000
documents
 10000 
log
=0
10000


 10000 
log
 = 0.301
5000


 10000 
log
 = 2.698
20


 10000 
log
=4
1


•
TF-IDF Weighting : a more effective weighting is obtained when weights are
assigned considering the combination of the two basic heuristics TF and IDF
•
TF-IDF Weighting assigns document weight as weight(t,d) = TF(t,d) * IDF(t)
This implies that:
– Common in doc
high tf
highest TF-IDF weight
– Rare in collection
high idf
Similarity measures in vector space
With the vector space model similarity has a geometric interpretation.
Assumption: documents that are “close together” in space are similar in meaning.
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
5
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
3
T1
D2 = 3T1 + 7T2 + T3
T2
7
•
One measure of similarity between two vectors is the angle between the vectors:
– 0o: overlapping vectors ⇒ identical
– 90o: orthogonal vectors ⇒ totally dissimilar
– The smallest the angle the most similar is Q to D
•
Cosine of angle varies monotonically from 0 to 1 as angle varies from 90o to 0o.
– For unit-length vectors, cosine is dot product:
r r
r r n
cos( x, y ) = x · y = å x i ×y i
i=1
•
For non-normalized vectors, the following expression can be used for similarity, also
called normalized correlation coefficient :
r r
r r
x· y
cos( x, y ) = r r =
x ×y
•
å
å
n
i=1
n
i=1
2
i
x ×
Where
–
–
x i ×y i
Dot product measures vector correlation
Denominator normalizes for length
å
n
i=1
y i2
Example: TF-IDF & Dot Product
doc1
information
retrieval
search
engine
information
Sim(q,doc1)=4.8*2.4+4.5*4.5 query=“information retrieval”
Sim(q,doc2)=2.4*2.4
travel
information
doc2
doc3
map
travel
government
president
congress
……
Sim(q,doc3)=0
IDF
info
2.4
TF-IDF
doc1
doc2
doc3
4.8
2.4
query
2.4
retrieval travel map search engine govern president congress
4.5
2.8
3.3
2.1
5.4
2.2
3.2
4.3
4.5
2.1
5.4
5.6
3.3
2.2
4.5
3.2
4.3
Most commonly used similarity measures:
Simple matching (coordination level match)
|Q ∩ D |
|Q ∩ D |
2
|Q | + | D |
Dice’s Coefficient
|Q ∩ D |
|Q ∪ D |
Q•D
1
Jaccard’s Coefficient
Cosine Coefficient
1
|Q | ×| D | 2
|Q ∩ D |
min(| Q |, | D |)
2
Overlap Coefficient
Criticisms
•
•
Unwarranted orthogonality assumptions
Reliance on terms:
– Ambiguous: many terms have more than one meaning (affects precision)
– Synonymy: many concepts can be expressed by more than one term (affects
recall)
•
Nevertheless vector space model is highly effective
Vector Space Model extensions:
from terms to concepts
•
Latent semantic indexing
– Dimensionality reduction (Singular Value Decomposition)
– Project vectors in document-by-term space onto lower-dimensionality documentby-concept space
– Leverages term co-occurrence in documents to approximate “latent concepts”
•
Blind relevance feedback
– Add terms from top documents to new query (another way to leverage term cooccurrence)
Related documents
Download