Indexing Models

advertisement
Indexing Implementation and
Indexing Models
CSC 575
Intelligent Information Retrieval
Information
need
Lexical
analysis and
stop words
Collections
Pre-process
text input
Parse
Index
Query
How is
the index
constructed?
Rank
Result
Sets
Indexing Implementation
 Inverted files
 Primary data structure for text indexes
 Source file: collection, organized by document
 Inverted file: collection organized by term (one record per term, listing
locations where term occurs)
 Query: traverse lists for each query term
OR: the union of component lists
AND: an intersection of component lists
 Based on the view of documents as vectors in n-dimensional space
n = number of index terms used for indexing
Each document is a bag of words (vector) with a direction and a magnitude
The Vector-Space Model for IR
Intelligent Information Retrieval
3
The Vector Space Model
 Vocabulary V = the set of terms left after pre-processing
the text (tokenization, stop-word removal, stemming, ...).
 Each document or query is represented as a |V| = n
dimensional vector:
dj = [w1j, w2j, ..., wnj].
wij is the weight of term i in document j.
the terms in V form the orthogonal dimensions of a vector space
 Document = Bag of words:
Vector representation doesn’t consider the ordering of words:
John is quicker than Mary vs. Mary is quicker than John.
4
Document Vectors and Indexes
 Conceptually, the index can be viewed as a documentterm matrix
 Each document is represented as an n-dimensional vector (n = no. of terms in
the dictionary)
 Term weights represent the scalar value of each dimension in a document
 The inverted file structure is an “implementation model” used in practice to
store the information captured in this conceptual representation
The
dictionary
Document Ids
Term Weights
(in this case
normalized)
Intelligent Information Retrieval
A
B
C
D
E
F
G
H
I
nova
1.0
0.5
galaxy
0.5
1.0
1.0
0.9
0.5
heat
0.3
0.8
1.0
hollywood
0.7
0.5
1.0
film
role
1.0
0.9
0.7
0.6
0.7
1.0
0.5
0.3
0.1
diet
1.0
0.9
0.2
fur
a document
vector
0.8
0.3
5
Example: Documents and Query in 3D Space
 Documents in term space
 Terms are usually stems
 Documents (and the query) are represented as vectors of terms
 Query and Document weights
 based on length and direction of their vector
 Why use this representation?
 A vector distance measure between the query and documents can be used to
rank retrieved documents
Intelligent Information Retrieval
6
Recall: Inverted Index Construction
 Invert documents into a big index
 vector file “inverted” so that rows become columns and columns become rows
 Basic idea:
 list all the tokens in the collection
 for each token, list all the docs it occurs in (together with frequency info.)
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
Intelligent Information Retrieval
t3
1
0
1
0
1
0
0
0
1
1
Terms
t1
t2
t3
D1
1
0
1
D2
1
0
0
D3
0
1
1
D4
1
0
0
D5
1
1
1
D6
1
1
0
D7
0
1
0
…
Sparse Matrix Representation: In practice this
data is very sparse; we do not need to store all the
0’s. Hence, the sorted array implementation …
7
How Are Inverted Files Created
 Sorted Array Implementation
 Documents are parsed to extract tokens. These are
saved with the Document ID.
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Intelligent Information Retrieval
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
8
How Inverted Files are Created
 After all documents have been
parsed and the inverted file is
sorted (with duplicates
retained for within document
frequency stats)
 If frequency information is
not needed, then inverted file
can be sorted with duplicates
removed.
Intelligent Information Retrieval
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
9
How Inverted Files are Created
 Multiple term entries for a
single document are merged
 Within-document term
frequency information is
compiled
 If proximity operators are
needed, then the location of
each occurrence of the term
must also be stored.
 Terms are usually represented
by unique integers to fix and
minimize storage space.
Intelligent Information Retrieval
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
10
How Inverted Files are Created
Then the file can be split into a Dictionary and a Postings file
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs (DF) Tot Freq
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
1
1
2
2
1
2
1
2
Dictionary
Intelligent Information Retrieval
Doc #
Freq
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
1
1
2
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
Freq
2
1
2
2
2
1
Postings
11
Inverted Indexes and Queries
 Permit fast search for individual terms
 For each term, you get a hit list consisting of:
 document ID
 frequency of term in doc
 position of term in doc (optional)
 These lists can be used to solve quickly Boolean queries:
country ==> {d1, d2}
manor ==> {d2}
country AND manor ==> {d2}
 Full advantage of this structure can taken by statistical ranking
algorithms such as the vector space model
 in case of Boolean queries, term or document frequency information is not used
(just set operations performed on hit lists)
 We will look at the vector model later; for now let’s examine
Boolean queries more closely
Intelligent Information Retrieval
12
Scalability Issues: Number of Postings
An Example: Reuters RCV1 Collection
 Number of docs = m = 800,000
 Average token per doc: 200
 Number of distinct terms = n = 400K
 100 million (non-positional) postings in the inverted
index
Intelligent Information Retrieval
13
Bottleneck
 Parse and build postings entries one doc at a time
 Sort postings entries by term (then by doc within each
term)
 Doing this with random disk seeks would be too slow –
must sort N=100M records
If every comparison took 2 disk seeks (10 milliseconds
each), and N items could be sorted with N log2N
comparisons, how long would this take?
Intelligent Information Retrieval
14
Sorting with fewer disk seeks
12-byte (4+4+4) records (term, doc, freq)
These are generated as we parse docs
Must now sort 600M such 12-byte records by
term
Define a Block (e.g., ~ 10M) records
Sort within blocks first and write to disk, then
merge the blocks into one long sorted order.
Blocked Sort-Based Indexing (BSBI)
Intelligent Information Retrieval
15
Sec. 4.3
Problem with sort-based algorithm
 Assumption: we can keep the dictionary in memory.
 We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
 Actually, we could work with term, docID postings
instead of termID, docID postings . . .
 . . . but then intermediate files become very large. (We
would end up with a scalable, but very slow index
construction method.)
Sec. 4.3
SPIMI:
Single-pass in-memory indexing
 Key idea 1: Generate separate dictionaries for each
block – no need to maintain term-termID mapping
across blocks.
 Key idea 2: Don’t sort. Accumulate postings in postings
lists as they occur.
 With these two ideas we can generate a complete
inverted index for each block.
 These separate indexes can then be merged into one big
index.
Distributed indexing
 For web-scale indexing
must use a distributed computing cluster
 Individual machines are fault-prone
Can unpredictably slow down or fail
 How do we exploit such a pool of machines?
Maintain a master machine directing the indexing job –
considered “safe”.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine
from a pool.
Intelligent Information Retrieval
18
Parallel tasks
 Use two sets of parallel tasks
 Parsers
 Inverters
 Break the input document corpus into splits
 Each split is a subset of documents
 E.g., corresponding to blocks in BSBI
 Master assigns a split to an idle parser machine
 Parser reads a document at a time and emits (term, doc)
pairs
 writes pairs into j partitions
 Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here
j = 3.
 Inverter collects all (term, doc) pairs for a partition; sorts
and writes to postings list
Intelligent Information Retrieval
19
Sec. 4.4
Data flow
assign
splits
Intelligent Information Retrieval
Master
assign
Parser
a-f g-p q-z
Parser
a-f g-p q-z
Parser
a-f g-p q-z
Map
phase
Segment files
Postings
Inverter
a-f
Inverter
g-p
Inverter
q-z
Reduce
phase
20
Dynamic indexing
Problem:
 Docs come in over time
postings updates for terms already in dictionary
new terms added to dictionary
 Docs get deleted
 Simplest Approach
 Maintain a “big” main index
 New docs go into a “small” auxiliary index
 Search across both, merge results
 Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation bit-vector
 Periodically, re-index into one main index
Intelligent Information Retrieval
21
Index on disk vs. memory
Most retrieval systems keep the dictionary in
memory and the postings on disk
Web search engines frequently keep both in
memory
massive memory requirement
feasible for large web service installations
less so for commercial usage where query loads are
lighter
Intelligent Information Retrieval
22
Retrieval From Indexes
 Given the large indexes in IR applications, searching for
keys in the dictionaries becomes a dominant cost
 Two main choices for dictionary data structures: Hashtables
or Trees
Using Hashing
requires the derivation of a hash function mapping terms to locations
may require collision detection and resolution for non-unique hash
values
Using Trees
Binary search trees
nice properties, easy to implement, and effective
enhancements such as B+ trees can improve search effectiveness
but, requires the storage of keys in each internal node
Intelligent Information Retrieval
23
Sec. 3.1
Hashtables
Each vocabulary term is hashed to an integer
(We assume you’ve seen hashtables before)
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search
[tolerant retrieval]
If vocabulary keeps growing, need to occasionally do
the expensive operation of rehashing everything
24
Sec. 3.1
Trees
 Simplest: binary tree
 More usual: B-trees
 Trees require a standard ordering of characters and
hence strings … but we typically have one
 Pros:
 Solves the prefix problem (e.g., terms starting with hyp)
 Cons:
 Slower: O(log M) [and this requires balanced tree]
 Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
25
Sec. 3.1
Tree: binary tree
a-m
a-hu
hy-m
Root
n-z
n-sh
si-z
26
Sec. 3.1
Tree: B-tree
n-z
a-hu
hy-m

Definition: Every internal node has a number of children in
the interval [a,b] where a, b are appropriate natural
numbers, e.g., [2,4].
27
Recall: Steps in Basic Automatic Indexing
 Parse documents to recognize structure
 Scan for word tokens
 Stopword removal
 Stem words
 Weight words
Intelligent Information Retrieval
28
Indexing Models (aka “Term Weighting”)
 Basic issue: which terms should be used to index a
document, and how much should it count?
 Some approaches
 binary weights
 Terms either appear or they don’t; no frequency information used.
 term frequency
Either raw term counts or (more often) term counts divided by total
frequency of the term across all documents
 TF.IDF (inverse document frequency model)
 Term discrimination model
 Signal-to-noise ratio (based on information theory)
 Probabilistic term weights
Intelligent Information Retrieval
29
Binary Weights
 Only the presence (1) or absence (0) of a term is
included in the vector
This representation
can be particularly
useful, since the
documents (and the
query) can be viewed
as simple bit strings.
This allows for query
operations be
performed using
logical bit operations.
Intelligent Information Retrieval
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
30
Binary Weights:
Matching of Documents & Queries
 In the case of binary weights, matching between documents and queries can be
seen as the size of the intersection of two sets (of terms): |Q  D|. This in turn
can be used to rank the relevance of documents to a query.
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q
t1
1
1
0
1
1
1
0
0
0
0
1
1
q1
Intelligent Information Retrieval
t2
0
0
1
0
1
1
1
1
0
1
0
1
q2
t3
1
0
1
0
1
0
0
0
1
1
1
1
q3
Rank=Q.Di
2
1
2
1
3
2
1
1
1
2
2
t1
t3
D9
D2
D1
D4
D11
D5
D3
D6
D10
D7
D8
t2
31
Beyond Binary Weight
 More generally, similarity between the query and the document can be seen as the
dot product of two vectors: Q  D (this is also called simple matching)
 Note that if both Q and D are binary this is the same as: |Q  D|
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q
t1
1
1
0
1
1
1
0
0
0
0
1
1
q1
Intelligent Information Retrieval
t2
0
0
1
0
1
1
1
1
0
1
0
2
q2
t3
1
0
1
0
1
0
0
0
1
1
1
3
q3
Rank=Q.Di
4
1
5
1
6
3
2
2
3
5
3
Given two vectors X and Y:
X  x1 , x2 ,
, xn
Y  y1 , y2 ,
, yn
Simple matching measures the similarity
between X and Y as the dot product of X
and Y:
sim ( X , Y )  X  Y   xi  yi
i
32
Raw Term Weights
 The frequency of occurrence for the term in each
document is included in the vector
Now the notion of simple
matching (dot product)
incorporates the term
weights from both the
query and the documents.
Using raw term weights
provides the ability to
better distinguish among
retrieved documents
Note: Although “term frequency” is
commonly used to mean raw occurrence
count, technically it implies that raw count
is divided by the document length (total no.
of term occurrences in the document).
Intelligent Information Retrieval
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q
t1
2
1
0
3
1
3
0
0
0
0
4
1
q1
t2
0
0
4
0
6
5
8
10
0
3
0
2
q2
t3
3
0
7
0
3
0
0
0
1
5
1
3
q3
RSV=Q.Di
11
1
29
3
22
13
16
20
3
21
7
33
Term Weights: TF
 More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j.
 May want to normalize term frequency (tf) by dividing
by the frequency of the most common term in the
document:
tfij = fij / maxi{fij}
 Or sublinear tf scaling:
tfij = 1 + log fij
34
Normalized Similarity Measures
 With or without normalized weights, it is possible to incorporate
normalization into various similarity measures
 Example (Vector Space Model)
 in simple matching, the dot product of two vectors measures the similarity of
these vectors
 the normalization can be achieved by dividing the dot product by the product of
the norms of the two vectors
 given a vector
the norm of X is:
X  x1 , x2 ,
X 
Note: this measures the
cosine of the angle between
two vectors; it is thus
called the normalized
cosine similarity measure.
, xn
2
x
 i
i
 the similarity of vectors X and Y is:
X Y
sim ( X , Y ) 

X  y
 (x  y )
i
i
x
2
i
i
Intelligent Information Retrieval

i
y
2
i
i
35
Normalized Similarity Measures
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q
t1
2
1
0
3
1
3
0
0
0
0
4
1
q1
t2
0
0
4
0
6
5
8
10
0
3
0
2
q2
t3
3
0
7
0
3
0
0
0
1
5
1
3
q3
RSV=Q.Di
11
1
29
3
22
13
16
20
3
21
7
Using normalized
cosine similarity
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q
t1
2
1
0
3
1
3
0
0
0
0
4
1
q1
t2
0
0
4
0
6
5
8
10
0
3
0
2
q2
t3
3
0
7
0
3
0
0
0
1
5
1
3
q3
SIM(Q,Di)
0.82
0.27
0.96
0.27
0.87
0.60
0.53
0.53
0.80
0.96
0.45
Note that the relative ranking among documents has changed!
Intelligent Information Retrieval
36
tf x idf Weighting
 tf x idf measure:
 term frequency (tf)
 inverse document frequency (idf) -- a way to deal with the
problems of the Zipf distribution
 Recall the Zipf distribution
 Want to weight terms highly if they are
frequent in relevant documents … BUT
infrequent in the collection as a whole
 Goal: assign a tf x idf weight to each term in each
document
Intelligent Information Retrieval
37
tf x idf
wik  tfik * log( N / nk )
Tk  term k in document Di
tfik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
idf k  log  N 
 nk 
Intelligent Information Retrieval
38
Inverse Document Frequency
 IDF provides high values for rare words and low values
for common words
 10000 
log 
0
 10000 
 10000 
log 
  0.301
 5000 
 10000 
log 
  2.698
 20 
 10000 
log 
4
 1 
Intelligent Information Retrieval
39
tf x idf normalization
Normalize the term weights (so longer documents
are not unfairly given more weight)
 normalize usually means force all values to fall within a certain range,
usually between 0 and 1, inclusive
 this is more ad hoc than normalization based on vector norms, but the
basic idea is the same:
Intelligent Information Retrieval
40
tf x idf Example
The initial
Term x Doc matrix
(Inverted Index)
T1
T2
T3
T4
T5
T6
T7
T8
Doc 1
0
1
0
3
0
2
1
0
Doc 2
2
3
1
0
4
7
0
1
Doc 3
4
0
0
1
0
2
0
1
Doc 4
0
0
2
5
0
1
5
0
Doc 5
1
0
0
4
0
3
5
0
Doc 6
0
2
0
0
1
0
1
3
df
3
3
2
4
2
5
4
3
idf = log2(N/df)
1.00
1.00
1.58
0.58
1.58
0.26
0.58
1.00
Documents represented as vectors of words
tf x idf
Term x Doc matrix
T1
T2
T3
T4
T5
T6
T7
T8
Doc 1
0.00
1.00
0.00
1.75
0.00
0.53
0.58
0.00
Doc 2
2.00
3.00
1.58
0.00
6.34
1.84
0.00
1.00
Doc 3
4.00
0.00
0.00
0.58
0.00
0.53
0.00
1.00
Doc 4
0.00
0.00
3.17
2.92
0.00
0.26
2.92
0.00
Doc 5
1.00
0.00
0.00
2.34
0.00
0.79
2.92
0.00
Doc 6
0.00
2.00
0.00
0.00
1.58
0.00
0.58
3.00
41
Alternative TF.IDF
Weighting Schemes
 Many search engines allow for different weightings for
queries vs. documents:
 A very standard weighting scheme is:
 Document: logarithmic tf, no idf, and cosine normalization
 Query: logarithmic tf, idf, no normalization
42
Keyword Discrimination Model
 The Vector representation of documents can be used as the
source of another approach to term weighting
 Question: what happens if we removed one of the words used as
dimensions in the vector space?
 If the average similarity among documents changes significantly, then
the word was a good discriminator
 If there is little change, the word is not as helpful and should be
weighted less
 Note that the goal is to have a representation that makes it
easier for a queries to discriminate among documents
 Average similarity can be measured after removing each
word from the matrix
 Any of the similarity measures can be used (we will look at a variety of
other similarity measures later).
Intelligent Information Retrieval
43
Keyword Discrimination
 Measuring average similarity (assume there are N documents)
sim(D1,D2) = similarity score for pair of documents D1 and D2
sim 
1
sim( Di , Dj )
2 
N i, j
simk  sim when termk removed
Computationally Expensive
 Better way to calculate AVG-SIM
 Calculate centroid D* (avg. document vector = Sum vectors / N)
 Then: sim 
Intelligent Information Retrieval
1
N
 sim ( D , D )
*
i
i
44
Keyword Discrimination
 Discrimination value (discriminant) and term weights
disck  simk  sim
disck > 0 ==> termk is a good discriminant
disck < 0 ==> termk is a poor discriminant
disck = 0 ==> termk is indifferent
 Computing Term Weights
 New weight for a term k in a document i is the original term
frequency of k in i time the discriminant value:
wik  tfik  disc k
Intelligent Information Retrieval
45
Keyword Discrimination - Example
t1
docs
10
D1
9
D2
8
D3
8
D4
19
D5
9
D6
D* 10.50
t2
1
2
1
1
2
2
1.50
t3
0
10
1
50
15
0
12.67
Doc-Sim to Centroid
sim(D1,D*) 0.641
sim(D2,D*) 0.998
sim(D3,D*) 0.731
sim(D4,D*) 0.859
sim(D5,D*) 0.978
sim(D6,D*) 0.640
0.808
AVG-SIM
1
sim 
N
*
sim
(
D
,
D
)

i
i
Using Normalized Cosine
simk  sim when termk removed
sim1(D1,D*)
sim1(D2,D*)
sim1(D3,D*)
sim1(D4,D*)
sim1(D5,D*)
sim1(D6,D*)
0.118
0.997
0.785
0.995
1.000
0.118
sim2(D1,D*)
sim2(D2,D*)
sim2(D3,D*)
sim2(D4,D*)
sim2(D5,D*)
sim2(D6,D*)
0.638
0.999
0.729
0.861
0.978
0.638
sim3(D1,D*)
sim3(D2,D*)
sim3(D3,D*)
sim3(D4,D*)
sim3(D5,D*)
sim3(D6,D*)
0.999
0.997
1.000
1.000
0.999
0.997
SIM1
0.669
SIM2
0.807
SIM3
0.999
Note: D* for each of the SIMk is now computed with only two terms
Intelligent Information Retrieval
46
Keyword Discrimination - Example
This shows that t1 tends to be a poor
discriminator, while t3 is a good
discriminator. The new term weight
will now reflect the discrimination
value for these terms. Note that further
normalization can be done to make all
term weights positive.
disck  simk  sim
Term
t1
t2
t3
D1
D2
D3
D4
D5
D6
disc k
-0.139
-0.001
0.191
t1
-1.392
-1.253
-1.114
-1.114
-2.645
-1.253
t2
-0.001
-0.001
-0.001
-0.001
-0.001
-0.001
t3
0.000
1.908
0.191
9.538
2.861
0.000
wik  tfik  disc k
New Weights for Terms t1, t2, and t3
Intelligent Information Retrieval
47
Signal-To-Noise Ratio
 Based on work of Shannon in 1940’s on Information Theory
 Developed a model of communication of messages across a noisy channel
 Goal is to devise an encoding of messages that is most robust in the face
of channel noise
 In IR, messages describe the content of documents
 Amount of information about the document from a word is inversely
proportional to its probability of occurrence
 The least informative words are those that occur approximately uniformly
across the corpus of documents
a word that occurs with the similar frequency across many documents (e.g.,
“the”, “and”, etc.) is less informative than one that occurs with high
frequency in one or two documents
Shannon used entropy (a logarithmic measure) to measure average
information gain with noise defined as its inverse
Intelligent Information Retrieval
48
Signal-To-Noise Ratio
pk = Prob(term k occurs in document i) = tfik / tfk
Infok = - pk log pk
Note: here we always take
Noisek = - pk log (1/pk) logs to be base 2.
Note: NOISE is the
negation of AVG-INFO,
so only one of these
needs to be computed in
practice.
SIGNALk  log(tf k )  NOISEk
wik  tf ik  SIGNALk
Intelligent Information Retrieval
The weight of term k in
document i
49
Signal-To-Noise Ratio - Example
pk = tfik / tfk
docs
D1
D2
D3
D4
D5
D6
tf k
t1
10
9
8
8
19
9
63
t2
1
2
1
1
2
2
9
t3
1
10
1
50
15
1
78
Prob t1 Prob t2
Prob t3
Info (t1 ) Info (t2 ) Info (t3 )
0.159
0.111
0.013
0.421
0.352
0.081
0.143
0.222
0.128
0.401
0.482
0.380
0.127
0.111
0.013
0.378
0.352
0.081
0.127
0.111
0.641
0.378
0.352
0.411
0.302
0.222
0.192
0.522
0.482
0.457
0.143
0.222
0.013
0.401
0.482
0.081
AVG-INFO
2.501
2.503
1.490
Note: By definition, if the
term k does not appear in
the document, we assume
Info(k) = 0 for that doc.
This is the “entropy” of term k in the collection
Intelligent Information Retrieval
50
Signal-To-Noise Ratio - Example
Term
t1
t2
t3
AVG-INFO
2.501
2.503
1.490
NOISE
-2.501
-2.503
-1.490
NOISEk   AVG-INFO
docs
D1
D2
D3
D4
D5
D6
Weight t1
34.760
31.284
27.808
27.808
66.044
31.284
Weight t2
0.667
1.333
0.667
0.667
1.333
1.333
Weight t3
4.795
47.951
4.795
239.753
71.926
4.795
SIGNAL
3.476
0.667
4.795
SIGNALk  log(tf k )  NOISEk
wik  tf ik  SIGNALk
The weight of term k in
document i
Additional normalization can be performed to have values in the range [0,1]
Intelligent Information Retrieval
51
Probabilistic Term Weights
 Probabilistic model makes explicit distinctions between
occurrences of terms in relevant and non-relevant documents
 If we know
pi: probability of term xi appears in relevant doc.
qi: probability of term xi appears in non-relevant doc.
with binary and independence assumption, the the weight of
term xi in document Dk is:
wt ik  log
pi(1qi )
qi(1 pi )
 Estimates of pi and qi requires relevance information:


using test queries and test collections to “train” the values of pi and qi
other AI/learning technique?
Intelligent Information Retrieval
52
Phrase Indexing and Phrase Queries
 Both statistical and syntactic methods have been used to
identify “good” phrases
 Example: Mutual Expected Information to find “co-locations”
 Linguistic Approaches: using a part-of-speech tagger to identify simple
noun phrases
 Phrases can have an impact on effectiveness and efficiency
 phrase indexing will speed up phrase queries
 improve precision by disambiguating the word senses:
e.g, “grass field” v. “magnetic field”
 effectiveness not straightforward and depends on retrieval model
e.g. “information retrieval”, how much do individual words count?
• For phrase queries, it no longer suffices to store only <term :
docs> entries
Intelligent Information Retrieval
53
Phrases Detection and Weighting
 Typical Approach
 Compute pairwise co-occurrence for high-frequency words
 If co-occurrence value is less than some threshold a, do not consider
the pair any further
 For qualifying pairs of terms (ti,tj) , compute the cohesion value
freq (ti , t j )
cohesion(ti , t j )  s 
totfreq (ti )  totfreq (t j )
(Salton and McGill, 1983)
where s is a size factor determined by the size of the vocabulary; OR
cohesion(ti , t j ) 
freq (ti , t j )
freq (ti )  freq (t j )
(Rada, 1986)
 But, indexing all pairwise (or longer) frequent cooccurrences can be computational very expensive
Intelligent Information Retrieval
54
Sec. 2.4.2
Better Solution: Positional indexes
In the postings, store, for each term the
position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Sec. 2.4.2
Positional Index Example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
 For phrase queries, we can use a merge algorithm
recursively at the document level
Sec. 2.4.2
Processing a Phrase Query
• Extract inverted index entries for each distinct term: to,
be, or, not.
• Merge their doc:position lists to enumerate all positions
with “to be or not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
• Same general method for proximity searches
• West Law Example: “LIMIT! /3 STATUTE /3 FEDERAL /2 TORT”
– /k means “within k words of”
– Positional indexes can be used for such queries; phrase indexes
cannot.
Sec. 2.4.2
Positional Index Size
 A positional index expands postings storage substantially
 Even though indices can be compressed
 Nevertheless, a positional index is now standardly used because of the
power and usefulness of phrase and proximity queries
 Need an entry for each occurrence, not just once per
document
 Index size depends on average document size and average frequency
of each term
Average web page has <1000 terms
SEC filings, books, even some epic poems … easily 100,000 terms
 Rule of Thumb
 A positional index is 2–4 as large as a non-positional index
 Positional index size 35–50% of volume of original text
Concept Indexing
 More complex indexing could include concept or thesaurus classes
 One approach is to use a controlled vocabulary (or subject codes) and map
specific terms to “concept classes”
 Automatic concept generation can use classification or clustering to
determine concept classes
 Automatic Concept Indexing
 Words, phrases, synonyms, linguistic relations can all be evidence used to
infer presence of the concept
 e.g. the concept “automobile” can be inferred based on the presence of the
words “vehicle”, “transportation”, “driving”, etc.
 One approach is to represent each word as a “concept vector”
 each dimension represents a weight for a concept associated with the term
 phrases or index items can be represented as weighted averages of concept
vectors for the terms in them
 Another approach: Latent Semantic Indexing (LSI)
Intelligent Information Retrieval
59
Next
 Retrieval Models and Ranking Algorithms
 Boolean Matching and Boolean Queries
 Vector Space Model and Similarity Ranking
 Extended Boolean Models
 Basic Probabilistic Models
 Implementation Issues for Ranking Systems
Intelligent Information Retrieval
60
Download