CS 430 / INFO 430 Information Retrieval Searching Full Text 4 Lecture 4

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 4
Searching Full Text 4
1
Course Administration
Assignment 1 has been posted. It is a programming
assignment and is due on Saturday, September 17 at
midnight.
Follow the submission instructions carefully.
Send questions to cs430-l@cs.cornell.edu.
2
Organization of Files for Full Text
Searching
Word list
Postings
Term Pointer to
postings
ant
bee
cat
dog
elk
fox
gnu
hog
3
Inverted
lists
Documents store
Representation of Inverted Files
Document store: Stores the documents. Important for user
interface design. [Repositories for the storage of document
collections are covered in CS 431.]
Word list (vocabulary file): Stores list of terms (keywords).
Designed for searching and sequential processing, e.g., for
range queries, (lexicographic index). Often held in memory.
Postings file: Stores an inverted list (postings list) of
postings for each term. Designed for rapid merging of lists
and calculation of similarities. Each list is usually stored
sequentially.
4
Document Store
The Documents Store holds the corpus that is being indexed.
The corpus may be:
• primary documents, e.g., electronic journal articles or Web
pages
• surrogates, e.g., catalog records or abstracts, which refer to
the primary documents
5
Document Store
The storage of the document store may be:
Central (monolithic) - all documents stored together on a single
server (e.g., library catalog)
Distributed database - all documents managed together but
stored on several servers (e.g., Medline, Westlaw, Dialog)
Highly distributed - documents stored on independently
managed servers (e.g., the Web)
Each requires: a document ID, which is a unique identifier that
can be used by the search system to refer to the document, and a
location counter, which can be used to specify location of words
or characters within a document.
6
Documents Store for Web Search
Systems
For Web search systems:
•
A document is a Web page.
•
The documents store is the Web.
•
The document ID is the URL of the document.
Indexes are built using a web crawler, which retrieves each page
on the Web for indexing. After indexing, the local copy of each
page is discarded, unless stored in a cache.
(In addition to the usual word list and postings file the indexing
system stores contextual information, which will be discussed in a
later lecture.)
7
Inverted File
Inverted file:
An inverted file is list of search terms that are organized for
associative look-up, i.e., to answer the questions:
•
In which documents does a specified search term appear?
•
Where within each document does each term appear?
(There may be several occurrences.)
The word list and the postings file together provide an
inverted file system for free text searching. In addition, they
contain the data needed to calculate weights and information
that is used to display results.
8
Inverted File -- Basic Concept
Word
abacus
actor
aspen
atoll
9
Document
3
19
22
2
19
29
5
11
34
Stop words are removed
before building the index.
Inverted List -- Definitions
Inverted List: A list of all the entries in an inverted file that
apply to a specific word, e.g.
abacus
3
19
22
Posting: Entry in an inverted list that applies to a single
instance of a term within a document, e.g., there are three
postings for "abacus":
10
abacus
3
abacus
19
abacus
22
Use of Inverted Files for Calculating
Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R. the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj  R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.
11
Enhancements to Inverted Files -- Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
12
Inverted File -- Concept (Enhanced)
Word
abacus
13
Postings Document Location
4
actor
3
aspen
atoll
1
3
3
19
19
22
2
19
29
5
11
11
34
94
7
212
56
66
213
45
43
3
70
40
Inverted list
for term actor
Lexicographic Order
It is important that the word list can be processed sequentially,
i.e, in alphabetic order.
•
To search with wild cards, e.g. comp*, which expands to
every term beginning with the letters "comp".
•
To list results for browsing lists of search terms.
This is a special case of of the mathematical concept of
lexicographic order.
14
Postings File
The postings file stores the elements of a sparse matrix, the term
assignment matrix, with weights.
It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
Each list consists of one or many individual postings.
15
Postings File:
A Linked List for Each Term
16
1 abacus
2 actor
3 aspen
4 atoll
3
94
2
5
11 3
19
7
19 213
11 70
19
212
29
34 40
22
56
A linked list for each term is convenient
to process sequentially, but slow to
update when the lists are long.
66
45
43
Length of Postings File
For a common term there may be very large numbers of
postings for a given term.
Example:
1,000,000,000 documents
1,000,000 distinct words
average length 1,000 words per document
1012 postings
By Zipf's law, the 10th ranking word occurs, approximately:
(1012/10)/10 times
= 1010 times
17
Postings File
Merging inverted lists is the most computationally intensive task
in many information retrieval systems.
Since inverted lists may be long, it is important to match postings
efficiently.
Usually, the inverted lists will be held on disk and paged into
memory for matching. Therefore algorithms for matching
postings process the lists sequentially.
For efficient matching, the inverted lists should all be sorted in the
same sequence.
Inverted lists are commonly cached to minimize disk accesses.
18
Data for Calculating Weights
The calculation of weights requires extra data to be held in
the inverted file system.
For each term, tj and document, di
fij
number of occurrences of tj in di
For each term, tj
nj
number of documents containing tj
For each document, di
mi maximum frequency of any term in di
For the entire document file
n
total number of documents
19
Word List: Individual Records for
Each Term
The record for term j in the word list contains:
term j
pointer to inverted (postings) list for term j
number of documents in which term j occurs (nj)
20
Decisions in Building an Inverted File:
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressions
If inverted files are stored in lexicographic order,
comp* can be processed efficiently
*comp cannot be processed efficiently
Logical operators
If A and B are search terms
A or B can be processed by comparing two moderate sized lists
(not A) or (not B) requires two very large lists
21
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the
collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use
excessive resources.
22
Word List
On disk
If a word list is held on disk, search time is dominated by the
number of disk accesses.
In memory
Suppose that a word list has 1,000,000 distinct terms.
Each index entry consists of the term, some basic statistics and
a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in
memory of a dedicated computer.
23
File Structures for Inverted Files:
Linear Index
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for lexicographic processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
24
File Structures for Inverted Files:
Binary Tree
Input: elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
ant
hog
cat
fox
dog
25
gnu
File Structures for Inverted Files:
Binary Tree
Advantages
Can be searched quickly
Convenient for batch updating
Easy to add an extra term
Economical use of storage
Disadvantages
Less good for lexicographic processing, e.g., comp*
Tree tends to become unbalanced
If the index is held on disk, important to optimize
the number of disk accesses
26
File Structures for Inverted Files:
Binary Tree
Calculation of maximum depth of tree.
Worst case: depth = n
O(n)
Ideal case: depth = log(n + 1)/log 2
O(log n)
Illustrates importance of balanced trees.
27
File Structures for Inverted Files:
Right Threaded Binary Tree
Threaded tree:
A binary search tree in which each node uses an
otherwise-empty left child link to refer to the node's inorder predecessor and an empty right child link to refer
to its in-order successor.
Right-threaded tree:
A variant of a threaded tree in which only the right
thread, i.e. link to the successor, of each node is
maintained. Can be used for lexicographic processing.
A good data structure when held in memory
Knuth vol 1, 2.3.1, page 325.
28
File Structures for Inverted Files:
Right Threaded Binary Tree
dog
gnu
bee
ant
cat
hog
elk
fox
29
NULL
File Structures for Inverted Files:
B-trees
B-tree of order m:
A balanced, multiway search tree:
• Each node stores many keys
• Root has between 2 and 2m keys.
All other internal nodes have between m and 2m keys.
• If ki is the ith key in a given internal node
-> all keys in the (i-1)th child are smaller than ki
-> all keys in the ith child are bigger than ki
• All leaves are at the same depth
30
File Structures for Inverted Files:
B-trees
B-tree example (order 2)
50 65
55 59
10 19 35
36 47
1 5 8 9
12 14 18
21 24 28
70 90 98
66 68
91 95 97
72 73
Every arrow points to a node containing between 2 and 4 keys.
A node with k keys has k + 1 pointers.
31
File Structures for Inverted Files:
B+-tree
Example: B+-tree of order 2, bucket size 4
• A B-tree is used as an index
• Data is stored in the leaves of the tree, known as buckets
50 65
10 25
... D9
55 59
D51 ... D54
70 81 90
D66...
(Implementation of B+-trees is covered in CS 432.)
32
D81 ...
Download