Information Retrieval

advertisement
Introduction to Information
Retrieval
1
Definition
• Information retrieval (IR) is the task of
– finding material (usually documents)
– of an unstructured nature (usually text)
– that satisfies an information need (usually
expressed as a query)
– from within large collections (usually stored on
computers).
2
Structured vs. Unstructured
• Text has a natural (linguistic) structure
– Sequential structure
– Latent grammatical structure
– Latent logical representations
– Connections to knowledge bases
• When we say ‘unstructured data’, we
really mean, we’re just going to ignore all
of that structure.
3
A more detailed look at the task
Information retrieval can involve:
• Filtering: Finding the set of relevant search results (Boolean
retrieval)
• Organizing
– Often ranking
– Can be more complicated: clustering the results, classifying into
different categories, …
• User interface
– Can be simple html (eg, Google)
– Can involve complex visualization techniques, to allow user to
navigate the space of potentially relevant search results
• In general, the system tries to reduce the user’s effort in
finding exactly the information
they’re looking for.
4
Preliminaries
Terminology and Preprocessing
5
Terminology
• Collection (or corpus): a set of documents
• Word token (or just token): a sequence of characters in
a document that constitutes a meaningful semantic unit
• Word type (or just type): an equivalence class of tokens
with the exact same sequence of characters
• Index: a data structure used for efficient processing of
large datasets
• Inverted index: The name for the main data structure
used in information retrieval
• Term: An equivalence class of word types, perhaps
transformed, used to build an inverted index.
• Vocabulary: the set of distinct terms in an index.
6
Preprocessing a corpus
1. Tokenize: split a document into tokens
–
–
Mostly easy in English
hard in, e.g., Mandarin or Japanese
2. Morphological processing, which may include
–
–
–
–
Stemming / lemmatization (removing
inconsequential endings of words)
Removing capitalization, punctuation, or diacritics
and accents
Identifying highly synonymous terms, or
“equivalence classes” (e.g., Jan. and January)
Removing very common word types (“stopwords”)
7
Tokenizing English
• Easy rule (about 90% accurate):
split on whitespace
• Some special cases:
– Contractions: I’m  I + `m, Jack’s  Jack + `s,
aren’t  are + n’t
– Multi-word expressions: New York, Hong Kong
– Hyphenation: “travel between San Francisco-Los
Angeles” vs. co-occurrence or Hewlett-Packard.
8
Tokenization in other languages
• Word segmentation: In East-Asian characterbased writing systems (Mandarin, Thai, Korean,
Japanese, …) word boundaries are not indicated
by whitespace.
• In Germanic languages, words can often be
compounded together to form very long words.
• Many languages have special forms of
contractions
9
The Inverted Index
10
Indexing
• Indexing is a technique borrowed from
databases
• An index is a data structure that supports
efficient lookups in a large data set
– E.g., hash indexes, R-trees, B-trees, etc.
11
Inverted Index
An inverted index stores an entry for every
word, and a pointer to every document
where that word is seen.
Vocabulary
term1

.
.
.
termN

Postings List
Document17, Document 45123
Document991, Document123001
12
Example
Document D1: “yes we got no bananas”
Document D2: “Johnny Appleseed planted apple seeds.”
Document D3: “we like to eat, eat, eat apples and bananas”
Vocabulary
yes
we
got
no
bananas
Johnny
Appleseed
planted
apple
seeds
like
to
eat
and
Postings List














D1
D1, D3
D1
D1
D1, D3
D2
D2
D2
D2, D3
D2
D3
D3
D3
D3
Query
“apples bananas”:
“apples”
“bananas”
 {D2, D3}
 {D1, D3}
Whole query gives the
intersection:
{D2, D3} ^ {D1, D3} = {D3}
13
Variations
Word-level index stores document IDs and offsets for the
position of the words in each document
• Why?
– Supports phrase-based queries
Vocabulary
yes
we
got
no
bananas
…
eat
…
Postings List





D1 (+0)
D1 (+1), D3 (+0)
D1 (+2)
D1 (+3)
D1 (+4), D3 (+8)

D3 (+3,+4,+5)
14
Variations
• Real search engines add all kinds of other information to
their postings lists
– for efficiency
– to support better ranking of results
Vocabulary
yes (docs=1)
we (docs=2)
…
eat (docs=1)
…
Postings List


D1 (freq=1)
D1 (freq=1), D3 (freq=1)

D3 (freq=3)
15
Index Construction
Algorithm:
1. Scan through each document, word by word
-
Write term, docID pair for each word to TempIndex
file
2. Sort TempIndex by terms
3. Iterate through sorted TempIndex:
merge all entries for the same term into one
postings list.
16
Index Construction
Algorithm:
1. Scan through each document, word by word
-
Write term, docID pair for each word to TempIndex
file
we
 D3
yes
 D1
we
 D1
got
 D1
no
 D1
bananas  D1
Johnny  D2
Appleseed  D2
planted  D2
apple
 D2
seeds  D2
like
 D3
to
 D3
eat
 D3
eat
 D3
eat
 D3
apples  D3
and
 D3
bananas  D3
Document D1: “yes we got no bananas”
Document D2: “Johnny Appleseed planted apple seeds.”
Document D3: “we like to eat, eat, eat apples and bananas”
17
Index Construction
Algorithm:
1. Scan through each document, word by word
-
Write term, docID pair for each word to TempIndex
file
yes
 D1
we
 D1
got
 D1
no
 D1
bananas  D1
TempIndex: Johnny  D2
Appleseed  D2
planted  D2
apple
 D2
seeds  D2
we
 D3
like
 D3
to
 D3
eat
 D3
eat
 D3
eat
 D3
apples  D3
and
 D3
bananas  D3
18
Index Construction
Algorithm:
2. Sort the TempIndex by Terms
yes
 D1
we
 D1
got
 D1
no
 D1
bananas  D1
TempIndex: Johnny  D2
Appleseed  D2
planted  D2
apple
 D2
seeds  D2
we
 D3
like
 D3
to
 D3
eat
 D3
eat
 D3
eat
 D3
apples  D3
and
 D3
bananas  D3
19
Index Construction
Algorithm:
2. Sort the TempIndex by Terms
TempIndex:
and
 D3
apple
 D2
apple
 D3
Appleseed  D2
bananas  D1
bananas  D3
eat
 D3
eat
 D3
eat
 D3
got
 D1
Johnny
like
no
planted
seeds
to
we
we
yes
20
 D2
 D3
 D1
 D2
 D2
 D3
 D1
 D3
 D1
Index Construction
Algorithm:
3. Merge postings lists for matching terms
TempIndex:
and
 D3
apple
 D2
apple
 D3
Appleseed  D2
bananas  D1
bananas  D3
eat
 D3
eat
 D3
eat
 D3
got
 D1
Johnny
like
no
planted
seeds
to
we
we
yes
21
 D2
 D3
 D1
 D2
 D2
 D3
 D1
 D3
 D1
Index Construction
Algorithm:
3. Merge postings lists for matching terms
TempIndex:
and
 D3
apple  D2, D3
Appleseed  D2
bananas  D1
bananas  D3
eat
 D3
eat
 D3
eat
 D3
got
 D1
Johnny
like
no
planted
seeds
to
we
we
yes
22
 D2
 D3
 D1
 D2
 D2
 D3
 D1
 D3
 D1
Index Construction
Algorithm:
3. Merge postings lists for matching terms
TempIndex:
and
 D3
apple
 D2, D3
Appleseed  D2
bananas  D1, D3
eat
 D3
eat
 D3
eat
 D3
got
 D1
Johnny
like
no
planted
seeds
to
we
we
yes
23
 D2
 D3
 D1
 D2
 D2
 D3
 D1
 D3
 D1
Index Construction
Algorithm:
3. Merge postings lists for matching terms
TempIndex:
and
 D3
apple
 D2, D3
Appleseed  D2
bananas  D1, D3
eat
 D3
got
 D1
Johnny  D2
Like
 D3
no
planted
seeds
to
we
we
yes
24
 D1
 D2
 D2
 D3
 D1
 D3
 D1
Index Construction
Algorithm:
3. Merge postings lists for matching terms
Final Index:
and
 D3
apple
 D2, D3
Appleseed  D2
bananas  D1, D3
eat
 D3
got
 D1
Johnny  D2
Like
 D3
no
 D1
planted  D2
seeds  D2
to
 D3
we
 D1, D3
yes
 D1
25
Efficient Index Construction
Problem: Indexes can be huge. How can
we efficiently build them?
- Blocked Sort-based Construction (BSBI)
- Single-Pass In-Memory Indexing (SPIMI)
What’s the difference?
26
Vector Space Model
27
Problem: Too many matching
results for every query
Using an inverted index is all fine and good,
but if your document collection has 10^12
documents and someone searches for
“banana”, they’ll get 90 million results.
We need to be able to return the “most
relevant” results.
We need to rank the results.
28
Vector Space Model
Idea: treat each document and query as a
vector in a vector space.
Then, we can find “most relevant” documents by finding the
“closest” vectors.
• But how can we make a document into a vector?
• And how do we measure “closest”?
29
Example: Documents as Vectors
Example:
Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
yes we
got
no
bananas
what
you
I
like
Vector V1:
1
1
1
1
1
0
0
0
0
Vector V2:
0
0
1
0
0
1
1
0
0
Vector V3:
1
0
1
0
0
1
1
1
1
30
What about queries?
The vector space model treats queries as
(very short) documents.
Example query: “you got”
yes we
Query Q1:
0
0
got
no
bananas
what
you
I
like
1
0
0
0
1
0
0
31
Measuring Similarity
Similarity metric:
the cosine of the angle
between document
vectors.
“Cosine Similarity”:

CS (v1 , v2 )  cos  v1 ,v2

v1  v2

v1 v2
32
Why cosine?
It gives some intuitive similarity judgments:
cos (0)
cos (π/4)
cos (π/2)
cos (π)
= cos(45)
= cos(90)
= cos(180)
33
=1
= .71
=0
= -1
Example: Computing relevance
yes we
got
no
bananas
what
you
I
like
Query Q1:
0
0
1
0
0
0
1
0
0
Vector V1:
1
1
1
1
1
0
0
0
0
Vector V2:
0
0
1
0
0
1
1
0
0
Vector V3:
1
0
1
0
0
1
1
1
1
CS (q1 , v1 ) 
q1  v1
1

q1 v1
2 5
CS (q1 , v2 ) 
q1  v2
2

q1 v2
2 3
34
CS (q1 , v3 ) 
q1  v3
2

q1 v3
2 6
Relevance Ranking
in the Vector Space Model
Definition: relevance between query q and
document d in the VSM is:
rel ( q, d ) 
vq  vd
vq vd
For a given query, the VSM ranks
documents from largest to smallest
relevance scores.
35
All words are equal?
Our example so far has converted documents to
boolean vectors, where each dimension
indicates whether a term is present or not.
Problems:
• If a term appears many times in a document, it should
probably count more.
• Some words are more “informative” than others (eg, stop
words are not informative)
• Longer documents contain more terms, and will
therefore be considered relevant to more queries, but
perhaps not for good reason.
36
Weighting Scheme
• A weighting scheme is a technique for
converting documents to real-valued
vectors (rather than boolean vectors).
• The real value for a term in a document is
called the weight of that term in that
document.
– We’ll write this as wt,d
37
Term Frequency Weighting
• Term Frequency (TF) weighting is a
heuristic that sets the weight of a term to
be the number of times it appears in the
document.
w  count(t, d )
tf
t ,d
38
Log-TF
• A common variant of TF is to reduce the
effect of high-frequency terms by taking
the logarithm:
log- tf
t ,d
w
1  log(count(t , d )) count(t , d )  1

0
count(t , d )  0

39
Inverse Collection Frequency
Weighting
• Inverse Collection Frequency (ICF) weighting is
a heuristic that sets the weight of a term to be
reduced by the number of times it appears in the
whole collection.
– This makes really common words (eg, “the”) have
really low weight.
icf
t ,d
w
T
 log
count(t , C )
Where T is the total number of tokens in collection C.
40
Inverse Document Frequency
Weighting
• Inverse Document Frequency (IDF) weighting is
a heuristic that sets the weight of a term to be
reduced by the number of documents it appears
in the whole collection.
– This makes really common words (eg, “the”) have
really low weight.
idf
t ,d
w
N
 log
count(d ' | t  d ' )
Where N is the total number of documents in collection C.
41
ICF vs. IDF
Word
CF
DF
ICF
IDF
try
10422
8760
0.3010
0.1632
insurance
10440
3997
0.3007
0.5040
Nobody uses ICF!
– Informative words tend to “clump together” in
documents
– As a result, words may appear many times, but only
in a few documents.
– This may make their CF large, while their DF isn’t as
big.
– In practice, IDF is better at discriminating between
“informative” and “non-informative”
terms.
42
Term Frequency –
Inverse Document Frequency
TF-IDF weighting combines TF and IDF (duh)
tf -idf
t ,d
w
w w
tf
t ,d
idf
t ,d
N
 count(t , d )  log
count(d ' | t  d ' )
Probably the most common weighting scheme in practice.
43
Limitations
The vector space model has the following limitations:
• Long documents are poorly represented because they
have poor similarity values.
• Semantic sensitivity: documents with similar context but
different term vocabulary won't be associated, resulting
in a false negative match.
• The order in which the terms appear in the document is
lost in the vector space representation.
44
Language Models for Ranking
45
VSM and heuristics
• One complaint about the VSM is that it
relies too heavily on ad-hoc heuristics
– Why should similarity be cosine similarity, as
opposed to some other similarity function?
– TF-IDF seems like a hack
• Can we formulate a more principled
approach that works well?
46
Language Models
• In theory classes, you’ve all seen models
that “accept” or “reject” a string:
– Finite automata (deterministic and non-)
– Context-sensitive and context-free grammars
– Turing machines
• Language models are probabilistic
versions of these models
– Instead of saying “yes” or “no” to each string,
they assign a probability of acceptance
47
Language Model Definition
• Let V be the vocabulary (set of word
types) for a language.
• A language model is a distribution P() over
V*, the set of sequences made from V.
48
Example Language Models
• Really simple: If string s contains “the”,
P(s) = 1, otherwise, P(s) = 0.*
• Slightly less simple: p(" the" )  count (" the" , C )
T
If string s contains “the”, let P(s) = p(“the”).
• More sophisticated: P( s) 
49
 p(t )
terms ts
*this is not a proper distribution, but for ranking, we’re not going to be too picky.
Ranking with a Language Model
We can use a language model to rank
documents for a given query, by
determining:
P(d | q)
and ranking according to the probability
scores.
50
Query Likelihood
• Most (but not all) techniques apply Bayes’ Rule
to this score:
P( q | d ) P( d )
P( d | q ) 
P(q)
• Since P(q) doesn’t depend on the document, it’s
not interesting for ranking, so we can get rid of it.
• Most systems ignore P(d) as well (although
there are good reasons for not ignoring it).
• What’s left is called the query likelihood:
P(q | d)
51
Language Models
for Query Likelihood
To compute the query likelihood:
1. Systems automatically learn a language model
PLM(d) from each document d
2. They then set P(q | d) = PLM(d)(q).
Two questions:
1. What language model to use?
2. How to automatically learn it from d?
52
Multinomial Model
Multinomial models are the standard choice
for LMs for information retrieval.
Urn (language model)
53
Document
Multinomial Model
This model generates one word at a time
(with replacement), with probability equal
to the fraction of balls with that color in the
urn.
Urn (language model)
54
Document
Multinomial Model
This model generates one word at a time
(with replacement), with probability equal
to the fraction of balls with that color in the
urn.
Urn (language model)
55
Document
Multinomial Model
This model generates one word at a time
(with replacement), with probability equal
to the fraction of balls with that color in the
urn.
Urn (language model)
56
Document
Multinomial Model
The multinomial model
throws out the sequence
information …
Urn (language model)
6
57
9
3
4
0
Document
0
2
Multinomial Model
Question: what’s the
probability of observing a
document with these
particular frequencies for
the words?
Urn (language model)
6
58
9
3
4
0
Document
0
2
Multinomial Model
Question: what’s the probability of observing a document
with these particular frequencies for the words, given our
language model?
Answer: Let pw be the probability of word w, according to
the language model (frequencies in the urn). And let cw be
the count of word w in the document.
length(d )!
cw
P LM (d ) 
( pw )

 cw! wV
wV
59
Simplifications
• Starting with
length(d )!
cw
P LM (d ) 
( pw )

 cw! wV
wV
• Since we’re only going to compute PLM(q),
and q’s length is a constant for a given
query, we can simplify to:
P LM (q)   ( pw )
wV
60
cw
Learning a Multinomial Model
• To use the multinomial model for ranking, we
need to be able to compute PLM(d)(q).
– That’s the probability of query q, according to the
language model we learned from document d.
• But what is PLM(d)?
– We need to determine all of the parameters of LM.
– For the multinomial model, this means we need to
find pw for each word w.
61
Maximum Likelihood
Parameter Estimation
• To find good settings of pw from document
d, we’ll use a technique called maximum
likelihood estimation (MLE)
– This means, set pw to the value that makes
PLM(d)(d) biggest (most likely).
• It’s possible to prove (on board) that the
cw
MLE for pw is:
c
w 'V
62
w'
Missing Words
• Let’s say you’re searching for
“cheap digital cameras”
and document d is all about this topic, but
never mentions “digital” explicitly.
• What’s PLM(d)(q)?
– p“digital”=0
– therefore, PLM(d)(q) = 0
– So, this document will get ranked at the bottom.
63
Smoothing
• Mission: to allow documents that are
missing query words to have non-zero
query likelihood.
• Method:
1. construct a “backoff” language model,
estimated from the whole corpus: PLM(C)()
2. Combine PLM(C) and PLM(d):
P’LM(d)(q) = αPLM(d)(q) + (1- α)PLM(C)(q)
64
Experimental Results
65
Comparison between LM and VSM
• Advantages of LM:
– Experimentally, seems to be more accurate
– Mathematically, more principled
• Multinomial model directly incorporates notions of TF and
IDF
• Disadvantages:
– It’s “harder” to incorporate some of the other
techniques that are widely used in information
retrieval (which we’re not going to discuss)
• Relevance feedback
• Personalization
– It can be computationally expensive
66
Homework
• Gather a corpus of documents
– Approximately 1 million tokens worth of text
– Divide it into distinct “documents”
– Tokenize it
• Potential sources
– Wikipedia
– News sites or web portals (like yahoo)
– Collections of literature (see, for instance, the
American National Corpus)
67
– Web crawl
Download