Document

advertisement
Information Retrieval Techniques
MS(CS) Lecture 5
AIR UNIVERSITY MULTAN CAMPUS
Quick Review
•
•
•
•
Inverted Index Construction (Exercise)
Query Processing Using Inverted Index
Faster Posting Merges: Skip Pointers
Phrase Queries
– Bi-word Index
– Extended Bi-Word Index
– Positional Index
Sec. 2.4.2
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for such
queries; biword indexes cannot.
• Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it work
for any value of k?
– This is a little tricky to do correctly and efficiently
– See Figure 2.12 of IIR
– There’s likely to be a problem on it!
3
Proximity search
We just saw how to use a positional index for phrase
searches.
We can also use it for proximity search.
For example: employment /4 place
Find all documents that contain EMPLOYMENT and
PLACE within 4 words of each other.
Employment agencies that place healthcare workers
are seeing growth is a hit.
Employment agencies that have learned to adapt
now place healthcare workers is not a hit.
4
4
Proximity search?
Use the positional index
Simplest algorithm: look at cross-product of positions
of (i) EMPLOYMENT in document and (ii) PLACE in
document
Very inefficient for frequent words, especially stop
words
Note that we want to return the actual matching
positions, not just a list of documents.
This is important for dynamic summaries etc.
5
5
Sec. 2.4.2
Positional index size
• You can compress position values/offsets:
• Nevertheless, a positional index expands
postings storage substantially link
• Nevertheless, a positional index is now
standardly used because of the power and
usefulness of phrase and proximity queries …
whether used explicitly or implicitly in a
ranking retrieval system.
6
Sec. 2.4.2
Positional index size
• Need an entry for each occurrence, not just
once per document
Why?
• Index size depends on average document size
– Average web page has <1000 terms
– SEC filings, books, even some epic poems … easily
100,000 terms
• Consider a term with frequency 0.1%
Document size
Postings
Positional postings
1000
1
1
100,000
1
100
7
Sec. 2.4.2
Rules of thumb
• A positional index is 2–4 as large as a nonpositional index
• Positional index size 35–50% of volume of
original text
• Caveat: all of this holds for “English-like”
languages
8
Sec. 2.4.3
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”,
“Britney Spears”) it is inefficient to keep on
merging positional postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of
the time of using just a positional index
– It required 26% more space than having a
positional index alone
9
Inverted Index Construction
•
•
•
•
Positional index size
Dictionary size
Hardware issues
Large collection requirements analysis
Inverted index
11
11
Dictionaries
The dictionary is the data structure for storing the
term vocabulary.
Term vocabulary: the data
Dictionary: the data structure for storing the term
vocabulary
12
12
Dictionary as array of fixed-width entries
For each term, we need to store a couple of items:
document frequency
pointer to postings list
. . .
Assume for the time being that we can store this
information in a fixed-length entry.
Assume that we store these entries in an array.
13
13
Dictionary as array of fixed-width entries
space needed: 20 bytes
4 bytes
4 bytes
How do we look up a query term qi in this array at query time?
That is: which data structure do we use to locate the entry (row)
in the array where qi is stored?
14
14
Data structures for looking up term
Two main classes of data structures: hashes and trees
Some IR systems use hashes, some use trees.
Criteria for when to use hashes vs. trees:
Is there a fixed number of terms or will it keep
growing?
What are the relative frequencies with which
various keys will be accessed?
How many terms are we likely to have?
15
15
Hashes
Each vocabulary term is hashed into an integer.
Try to avoid collisions
At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width array
Pros: Lookup in a hash is faster than lookup in a tree.
Lookup time is constant.
Cons
no way to find minor variants (resume vs. résumé)
no prefix search (all terms starting with automat)
need to rehash everything periodically if vocabulary keeps
growing
16
16
Trees
Trees solve the prefix problem (find all terms starting with
automat).
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of children
in the interval [a, b] where a, b are appropriate positive
integers, e.g., [2, 4].
17
17
Binary tree
18
18
B-tree
19
19
Ch. 4
Index construction
• How do we construct an index?
• What strategies can we use with limited main
memory?
Sec. 4.1
Hardware basics
• Many design decisions in information retrieval
are based on the characteristics of hardware
• We begin by reviewing hardware basics
Sec. 4.1
Hardware basics
• Access to data in memory is much faster than
access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
Sec. 4.1
Hardware basics
• Servers used in IR systems now typically have
several GB of main memory, sometimes tens
of GB.
• Available disk space is several (2–3) orders of
magnitude larger.
• Fault tolerance is very expensive: It’s much
cheaper to use many regular machines rather
than one fault tolerant machine.
Sec. 4.2
Recall IIR 1 index construction
• Documents are parsed to extract words and these
are saved with the Document ID.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Sec. 4.2
Key step
• After all documents have been
parsed, the inverted file is
sorted by terms.
We focus on this sort step.
We have 100M items to sort.
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
Doc #
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Sec. 4.2
Scaling index construction
• In-memory index construction does not scale
– Can’t stuff entire collection into memory, sort,
then write back
• How can we construct an index for very large
collections?
• Taking into account the hardware constraints
we just learned about . . .
• Memory, disk, speed, etc.
Sec. 4.2
Sort-based index construction
• As we build the index, we parse docs one at a time.
– While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
• The final postings for any term are incomplete until the end.
• At 12 bytes per non-positional postings entry (term, doc, freq),
demands a lot of space for large collections.
• T = 100,000,000 in the case of RCV1
– So … we can do this in memory in 2009, but typical
collections are much larger. E.g., the New York Times
provides an index of >150 years of newswire
• Thus: We need to store intermediate results on disk.
Sec. 4.2
Sort using disk as “memory”?
• Can we use the same index construction
algorithm for larger collections, but by using
disk instead of memory?
• No: Sorting T = 100,000,000 records on disk is
too slow – too many disk seeks.
• We need an external sorting algorithm.
RCV1 collection
Shakespeare’s collected works are not large
enough for demonstrating many of the points in
this course.
As an example for applying scalable index
construction algorithms, we will use the
Reuters RCV1 collection.
English newswire articles sent over the wire in
1995 and 1996 (one year).
29
29
Same algorithm for disk?
Can we use the same index construction algorithm
for larger collections, but by using disk instead of
memory?
No: Sorting T = 100,000,000 records on disk is too
slow – too many disk seeks.
We need an external sorting algorithm.
30
30
“External” sorting algorithm
(using few disk seeks)
We must sort T = 100,000,000 non-positional postings.
Each posting has size 12 bytes (4+4+4: termID, docID,
document frequency).
Define a block to consist of 10,000,000 such postings
We can easily fit that many postings into memory.
We will have 10 such blocks for RCV1.
Basic idea of algorithm:
For each block: (i) accumulate postings, (ii) sort in memory,
(iii) write to disk
Then merge the blocks into one long sorted order.
31
31
Merging two blocks
32
32
Blocked Sort-Based Indexing
Key decision: What is the size of one block?
33
33
Problem with sort-based algorithm
Our assumption was: we can keep the dictionary in
memory.
We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings . . .
. . . but then intermediate files become very large.
(We would end up with a scalable, but very slow index
construction method.)
34
34
Single-pass in-memory indexing
Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each
block – no need to maintain term-termID mapping
across blocks.
Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
35
35
Using wildcard in queries
Wildcard queries
mon*: find all docs containing any term beginning with mon
Easy with B-tree dictionary: retrieve all terms t in the range:
mon ≤ t < moo
*mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards
Then retrieve all terms t in the range: nom ≤ t < non
Result: A set of terms that are matches for wildcard query
Then retrieve documents that contain any of these terms
37
37
How to handle * in the middle of a term
Example: m*nchen
We could look up m* and *nchen in the B-tree and
intersect the two term sets.
Expensive
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the *
occurs at the end.
Store each of these rotations in the dictionary, say, in
a B-tree
38
38
Permuterm index
For term HELLO: add hello$, ello$h, llo$he,
lo$hel, and o$hell to the B-tree where $ is a
special symbol
39
39
Permuterm → term mapping
40
40
Permuterm index
For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and
o$hell
Queries
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up Y$X*
Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree.
But permuterm index is the more common name.
41
41
Processing a lookup in the permuterm index
Rotate query wildcard to the right
Use B-tree lookup as before
Problem: Permuterm more than quadruples the size
of the dictionary compared to a regular B-tree.
(empirical number)
42
42
k-gram indexes
More space-efficient than permuterm index
Enumerate all character k-grams (sequence of k characters)
occurring in a term
2-grams are called bigrams.
Example: from April is the cruelest month we get the bigrams:
$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo
on nt h$
$ is a special word boundary symbol, as before.
Maintain an inverted index from bigrams to the terms that
contain the bigram
43
43
Postings list in a 3-gram inverted index
44
44
k-gram (bigram, trigram, . . . ) indexes
Note that we now have two different types of
inverted indexes
The term-document inverted index for finding
documents based on a query consisting of terms
The k-gram index for finding terms based on a query
consisting of k-grams
45
45
Processing wildcarded terms in a bigram index
Query mon* can now be run as: $m AND mo AND on
Gets us all terms with the prefix mon . . .
. . . but also many “false positives” like MOON.
We must postfilter these terms against query.
Surviving terms are then looked up in the term-document
inverted index.
k-gram index vs. permuterm index
k-gram index is more space efficient.
Permuterm index doesn’t require postfiltering.
46
46
Exercise
Google has very limited support for wildcard queries.
For example, this query doesn’t work very well on Google:
[gen* universit*]
Intention: you are looking for the University of Geneva, but
don’t know which accents to use for the French words for
university and Geneva.
According to Google search basics, 2010-04-29: “Note that the
* operator works only on whole words, not parts of words.”
But this is not entirely true. Try [pythag*] and [m*nchen]
Exercise: Why doesn’t Google fully support wildcard queries?
47
47
Processing wildcard queries in the termdocument index
Problem 1: we must potentially execute a large number of
Boolean queries.
Most straightforward semantics: Conjunction of disjunctions
For [gen* universit*]: geneva university OR geneva université
OR genève university OR genève université OR general
universities OR . . .
Very expensive
Problem 2: Users hate to type.
If abbreviated queries like [pyth* theo*] for [pythagoras’
theorem] are allowed, users will use them a lot.
This would significantly increase the cost of answering queries.
Somewhat alleviated by Google Suggest
48
48
Spelling correction
Two principal uses
Correcting documents being indexed
Correcting user queries
Two different methods for spelling correction
Isolated word spelling correction
Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words, e.g., an
asteroid that fell form the sky
Context-sensitive spelling correction
Look at surrounding words
Can correct form/from error above
49
49
Correcting documents
We’re not interested in interactive spelling correction
of documents (e.g., MS Word) in this class.
In IR, we use document correction primarily for
OCR’ed documents. (OCR = optical character
recognition)
The general philosophy in IR is: don’t change the
documents.
50
50
Correcting queries
First: isolated word spelling correction
Premise 1: There is a list of “correct words” from which the
correct spellings come.
Premise 2: We have a way of computing the distance between
a misspelled word and a correct word.
Simple spelling correction algorithm: return the “correct” word
that has the smallest distance to the misspelled word.
Example: informaton → information
For the list of correct words, we can use the vocabulary of all
words that occur in our collection.
Why is this problematic?
51
51
Alternatives to using the term vocabulary
A standard dictionary (Webster’s, OED etc.)
An industry-specific dictionary (for specialized IR
systems)
The term vocabulary of the collection, appropriately
weighted
52
52
Hash function for strings:
key[i]
98 108 105
key a l i
0 1 2
i
KeySize = 3;
hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172
“ali”
0
1
2
hash
function
……
ali
8172
……
10,006 (TableSize)
CENG 213 Data Structures
53
QUESTIONS?
Download