Introduction to Information Retrieval
Introduction to
CS276
Information Retrieval and Web Search
Christopher Manning and Prabhakar Raghavan
Lecture: Index construction
Introduction to Information Retrieval Ch. 4
How do we construct an index?
What strategies can we use with limited main memory?
What about dynamic text collections?
Introduction to Information Retrieval Sec. 4.1
Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).
Block sizes: 8KB to 256 KB.
Servers used in IR systems now typically have several
GB of main memory , sometimes tens of GB.
Available disk space is several (2–3) orders of magnitude larger.
Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.
Introduction to Information Retrieval
Introduction to Information Retrieval Sec. 4.2
Documents are parsed to extract words and these are saved with the Document ID.
Doc 1 Doc 2
I did enact Julius
Caesar I was killed i' the Capitol;
Brutus killed me.
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
2
2
2
2
2
2
2
2
2
1
1
2
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2 killed me so let it be with caesar the noble brutus hath
Term
I did enact
I julius caesar was killed i' the capitol brutus told you caesar was ambitious
Introduction to Information Retrieval
After all documents have been parsed, the inverted file is sorted by terms.
We focus on this sort step.
Sec. 4.2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Doc #
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1 let it be with caesar the noble brutus hath told you caesar was ambitious
I
Term did enact julius
I caesar was killed i' the capitol brutus killed me so
1
2
2
2
2
2
1
1
1
2
2
2
1
2
1
1
1
1
1
2
1
1
Doc #
2
2
1
1
2
2
1 killed killed let me noble so the the told you was was with
Term ambitious be brutus brutus capitol caesar caesar
I caesar did enact hath
I i' it julius
Introduction to Information Retrieval
In-memory index construction does not scale.
How can we construct an index for very large collections?
Taking into account. . .memory, disk, speed, etc.
Sec. 4.2
Introduction to Information Retrieval Sec. 4.2
As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit compression tricks (you can, but much more complex)
The final postings are incomplete until the end.
At 12 bytes per non-positional postings entry (term, doc,
freq), demands a lot of space for large collections.
T = 100,000,000 in the case of Reuters
So … we can do this in memory in 2009, but typical collections are much larger. E.g. the New York Times provides an index of >150 years of newswire
We need to store intermediate results on disk.
Introduction to Information Retrieval Sec. 4.2
aka BSBI : Blocked sort-based Indexing
Introduction to Information Retrieval
Take Wikipedia in Italian, and compute word freq :
Few GBs n
10 9 words
How do you proceed ??
Tokenize into a sequence of strings
Sort the strings
Create tuples < word, freq >
Introduction to Information Retrieval
Merge-Sort (A,i,j)
01 if ( i < j) then
02 m = (i+j)/2;
Divide
03 Merge-Sort (A,i,m);
04 Merge-Sort (A,m+1,j);
05 Merge (A,i,m,j)
Conquer
Combine
1 2 8 10 7 9 13 19
Merge is linear in the
#items to be merged
Introduction to Information Retrieval
Items = (short) strings = atomic...
Q
(n log n) memory accesses (I/Os ??)
[5ms] * n log
2 n ≈ 3 years
In practice it is a “faster”, why?
Introduction to Information Retrieval
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
2 passes (one Read/one Write) = 2 * (N/B) I/Os
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
2 passes (R/W) 2 passes (R/W)
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 log
2
N
2 10 1 5 13 19 7 9 4 15 3 8 12 17 6 11
10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11
M
N/M runs, each sorted in internal memory (no I/Os)
— I/O-cost for binary merge-sort is ≈ 2 (N/B) log
2
(N/M)
Introduction to Information Retrieval
After few steps, every run is longer than B !!!
Output
Run
1, 2, 3
1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17
B B B
Output
Buffer
B
Disk
We are using only 3 pages
But memory contains M/B pages ≈ 2 30 /2 15 = 2 15
Introduction to Information Retrieval
Sort N items with main-memory M and disk-pages B :
Pass 1: Produce (N/M) sorted runs.
Pass i: merge X = M/B-1 runs log
X
N/M passes
. . .
Disk
Pg for run1
Pg for run 2 . . .
Pg for run X
Out Pg
Main memory buffers of B items
. . .
Disk
Introduction to Information Retrieval
Number of passes = log
X
N/M
log
M/B
(N/M)
Total I/O-cost is
Q
( (N/B) log
M/B
N/M ) I/Os
In practice
M/B ≈ 10 5 #passes = 1
few mins
Large fan-out (M/B) decreases #passes
Tuning depends on disk features
Compression would decrease the cost of a pass!
Introduction to Information Retrieval
If sorting needs to manage arbitrarily long strings
Memory containing the strings
A
Indirect sort
Key observations:
Array A is an “ array of pointers to objects ”
For each object-to-object comparison A[i] vs A[j]:
2 random accesses to 2 memory locations A[i] and A[j]
Q
(n log n) random memory accesses (I/Os ??)
Again chaching helps,
But it may be less effective than before
Introduction to Information Retrieval
SPIMI :
Single-pass in-memory indexing
Sec. 4.3
Key idea #1: Generate separate dictionaries for each block.
Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur.
With these two ideas we can generate a complete inverted index for each block.
These separate indexes can then be merged into one big index.
Easy append with 1 file per posting (docID are increasing among blocks), otherwise emulate multi-way merging
Introduction to Information Retrieval
Sec. 4.3
Merging of blocks
BSBI (now dictionary-driven).
Introduction to Information Retrieval
Compression makes SPIMI even more efficient.
Compression of terms
Compression of postings
See next lectures
Sec. 4.3
Introduction to Information Retrieval
For web-scale indexing: must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?
Sec. 4.4
Introduction to Information Retrieval
Google data centers mainly contain commodity machines.
Data centers are distributed around the world.
Sec. 4.4
Estimate: Google installs 100,000 servers each quarter.
Based on expenditures of 200–250 million dollars per year
This is 10% of the computing capacity of the world !?!
Introduction to Information Retrieval Sec. 4.4
Maintain a master machine directing the indexing job
– considered “safe”.
Break up indexing into sets of (parallel) tasks.
Master machine assigns tasks to idle machines
Introduction to Information Retrieval Sec. 4.4
We will use two sets of parallel tasks
Parsers
Inverters
Break the input document collection into splits
Each split is a subset of documents
one machine handles a subrange of terms
one machine handles a subrange of documents
Introduction to Information Retrieval Sec. 4.4
assign Master assign
Set
1
Set
2
Parser Inverter
Postings
IL
1
Parser
Inverter IL
2 splits
Set k
Parser
Inverter IL k
Query must be also distributed
Introduction to Information Retrieval Sec. 4.4
assign Master assign
Postings
Set
1
Set
2
Parser a-f g-p q-z Inverter a-f
Parser a-f g-p q-z
Inverter g-p splits
Set k
Parser a-f g-p q-z
Inverter q-z
Each query-term goes to one machine
Introduction to Information Retrieval Sec. 4.4
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term, doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms’ first letters
(e.g., a-f, g-p, q-z ) – here j = 3.
Introduction to Information Retrieval Sec. 4.4
An inverter collects all (term,doc) pairs (= postings) for one term-partition.
Sorts and writes to postings lists
Introduction to Information Retrieval Sec. 4.4
Index construction is an instance of MapReduce
a robust and conceptually simple framework for distributed computing
… without having to write code for the distribution part.
Google indexing system (ca. 2002) consists of a number of phases, each implemented in
MapReduce.
Introduction to Information Retrieval Sec. 4.4
assign Master assign
Postings
Parser
Parser a-f g-p q-z a-f g-p q-z
Inverter
Inverter a-f g-p splits
Parser a-f g-p q-z
Inverter q-z
Map phase
Reduce phase
Introduction to Information Retrieval Sec. 4.5
Up to now, we have assumed static collections.
It was rare but now more frequently occurs that:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
And this induces:
Postings updates for terms already in dictionary
New terms added/deleted to/from dictionary
Introduction to Information Retrieval
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, and merge the results
Sec. 4.5
Deletions
Invalidation bit-vector for deleted docs
Filter search results (i.e. docs) by the invalidation bit-vector
Periodically, re-index into one main index
Introduction to Information Retrieval Sec. 4.5
Merges may touch a lot of stuff
Poor performance
Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list [docIDs are greater].
Merge is the same as a simple append.
But this needs a lot of files – inefficient for O/S.
Assumption: The index is one big file.
In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)
Introduction to Information Retrieval Sec. 4.5
Maintain a series of indexes, each twice as large as the previous one.
Keep smallest (Z
0
) in memory
Larger ones (I
0
, I
1
, …) on disk
If Z
0 gets too big (> n), write to disk as I
0 or merge with I
0
Either write merge Z
1
(if I
0 already exists) as Z to disk as I
1
(if no I
1
)
1 or merge with I
1 to form Z
2
etc.
Introduction to Information Retrieval Sec. 4.5
Auxiliary and main index : index construction time is
O(T 2 ) as each posting is touched in each merge.
Logarithmic merge : Each posting is merged O(log T) times, so complexity is O(T log T)
So logarithmic merge is much more efficient for index construction, but its query processing requires the merging of O(log T) indexes
Introduction to Information Retrieval Sec. 4.5
Most search engines now support dynamic indexing
News items, blogs, new topical web pages
But (sometimes/typically) they also periodically reconstruct the index from scratch
Query processing is then switched to the new index, and the old index is then deleted
Introduction to Information Retrieval
Positional indexes
Same type of sorting problem … just larger
Sec. 4.5
Character n-gram indexes:
For each n-gram, need pointers to all dictionary terms containing it – the “postings”.
Permuterm index:
Consider all cyclic rotations of T$, where T is any term