indexing

advertisement
IR Indexing
Thanks to
B. Arms
SIMS
Baldi, Frasconi, Smyth
Manning, Raghavan, Schutze
What we have covered
•
•
•
•
•
What is IR
Evaluation
Tokenization, terms and properties of text
Web crawling; robots.txt
Vector model of documents
– Bag of words
• Queries
• Measures of similarity
• This presentation
– Indexing
– Inverted files
Last time
• Vector models of documents and queries
– Bag of words model
• Similarity measures
– Text similarity typically used
– Similarity is a measure of relevance (and
ranking)
– Query is a small document
– Match query to document
• All stored and indexed before a query is
matched.
Summary: What’s the point of using
vector spaces?
• A well-formed algebraic space for retrieval
• Key: A user’s query can be viewed as a (very)
short document.
• Query becomes a vector in the same space as the
docs.
• Can measure each doc’s proximity to it.
• Natural measure of scores/ranking – no longer
Boolean.
– Queries are expressed as bags of words
• Other similarity measures: see
http://www.lans.ece.utexas.edu/~strehl/diss/node52.html
for a survey
Solr/Lucene
Query Engine
Index
Interface
Indexer
Users
Crawler
YouSeer
Nutch
LucidWorks
Web
A Typical Web Search Engine
Indexing
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Why indexing?
• For efficient searching for documents with
unstructured text (not databases) using
terms as queries
– Databases can still be searched with an index
• Sphinx
– Online sequential text search (grep)
• Small collection
• Text volatile
– Data structures - indexes
• Large, semi-stable document collection
• Efficient search
Sec. 1.1
Unstructured data in 1620
• Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing
Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval (best documents to return)
8
Sec. 1.1
Term-document incidence
matrices
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Brutus AND Caesar BUT NOT
Calpurnia
1 if play contains
word, 0 otherwise
Sec. 1.1
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
–
–
–
–
110100 AND
110111 AND
101111 =
100100
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
10
Sec. 1.1
Answers to query
• Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
11
Sec. 1.1
Bigger collections
• Consider N = 1 million documents, each
with about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms
among these.
12
Sec. 1.1
Can’t build the matrix
• 500K x 1M matrix has half-a-trillion 0’s and
1’s.
• But it has no more than one billion 1’s.
– matrix is extremely sparse.
• What’s a better representation?
– We only record the 1 positions.
13
Sec. 1.2
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixed-size arrays for this?
Brutus
1
Caesar
1
Calpurnia
2
2
4
11 31 45 173 174
4
5
6
2 31 54 101
What happens if the word Caesar
is added to document 14?
16 57 132
14
Sec. 1.2
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable
length arrays
Posting
• Some tradeoffs in size/ease of insertion
Brutus
1
Caesar
1
Calpurnia
Dictionary
2
2
2
31
4
11 31 45 173 174
4
5
6
16 57 132
54 101
Postings
15
Sorted by docID (more later on why).
Sec. 1.2
Inverted index construction
Documents to
be indexed
Friends, Romans, countrymen.
Tokenizer
Token stream
Friends Romans
Countrymen
Linguistic modules
friend
Modified tokens
Indexer
Inverted index
roman
countryman
friend
2
4
roman
1
2
countryman
13
16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 3.1
Dictionary data structures for
inverted indexes
• The dictionary data structure stores the
term vocabulary, document frequency,
pointers to each postings list … in what
data structure?
18
Sec. 3.1
A naïve dictionary
• An array of struct:
char[20] int
Postings *
20 bytes 4/8 bytes
4/8 bytes
• How do we store a dictionary in memory efficiently?
• How do we quickly look up elements at query time?
19
Sec. 3.1
Dictionary data structures
• Two main choices:
– Hashtables
– Trees
• Some IR systems use hashtables, some
trees
20
Hashtables
Sec. 3.1
• Each vocabulary term is hashed to an
integer
– (We assume you’ve seen hashtables before)
• Pros:
– Lookup is faster than for a tree: O(1)
• Cons:
– No easy way to find minor variants:
• judgment/judgement
– No prefix search
[tolerant retrieval]
– If vocabulary keeps growing, need to
occasionally do the expensive operation of
rehashing everything
21
Sec. 3.1
Tree: binary tree
a-m
a-hu
hy-m
Root
n-z
n-sh
si-z
22
Sec. 3.1
Tree: B-tree
a-hu
–
hy-m
n-z
Definition: Every internal nodel has a number of
children in the interval [a,b] where a, b are
appropriate natural numbers, e.g., [2,4].
23
Sec. 3.1
Trees
• Simplest: binary tree
• More usual: B-trees
• Trees require a standard ordering of characters and hence
strings … but we typically have one
• Pros:
– Solves the prefix problem (terms starting with
hyp)
• Cons:
– Slower: O(log M) [and this requires balanced
tree]
– Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem
24
Unstructured vs structured data
• What’s available?
– Web
– Organizations
– You
• Unstructured usually means “text”
• Structured usually means “databases”
• Semistructured somewhere in between
Unstructured (text) vs. structured
(database) data companies in 1996
160
140
120
100
Unstructured
Structured
80
60
40
20
0
Data volume
Market Cap
Unstructured (text) vs. structured
(database) data companies in 2006
160
140
120
100
Unstructured
Structured
80
60
40
20
0
Data volume
Market Cap
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Unstructured data
• Typically refers to free text
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified
zones such as the Title and Bullets
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
… to say nothing of linguistic structure
Decisions in Building an Inverted File:
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressions
If inverted files are stored in lexicographic order,
comp* can be processed efficiently
*comp cannot be processed efficiently
Boolean terms
If A and B are search terms
A or B can be processed by comparing two moderate sized lists
(not A) or (not B) requires two very large lists
Major Indexing Methods
• Inverted index
– effective for very large collections of documents
– associates lexical items to their occurrences in the
collection
• Positional index
• Non-positional indexs
– Block-sort
• Suffix trees and arrays
– Faster for phrase searches; harder to build and maintain
• Signature files
– Word oriented index structures based on hashing
(usually not used for large texts)
Sparse Vectors
• Terms (tokens) as vectors
• Vocabulary and therefore dimensionality of
vectors can be very large, ~104 .
• However, most documents and queries do
not contain most words, so vectors are
sparse (i.e. most entries are 0).
• Need efficient methods for storing and
computing with sparse vectors.
Sparse Vectors as Lists
• Store vectors as linked lists of non-zeroweight tokens paired with a weight.
– Space proportional to number of unique terms
(n) in document.
– Requires linear search of the list to find (or
change) the weight of a specific term.
– Requires quadratic time in worst case to
compute vector for a document:
n(n + 1)
2
i
=
=
O
(
n
)
å
2
i =1
n
Sparse Vectors as Trees
• Index tokens in a document in a balanced
binary tree or trie with weights stored with
tokens at the leaves.
memory

<
film
<
bit
2
Balanced Binary Tree
variable

<

film memory variable
1
1
2
Sparse Vectors as Trees (cont.)
• Space overhead for tree structure: ~2n nodes.
• O(log n) time to find or update weight of a
specific term.
• O(n log n) time to construct vector.
• Need software package to support such data
structures.
Sparse Vectors as Hash Tables
• Hashing:
– well-defined procedure or mathematical function
that converts a large, possibly variable-sized
amount of data into a small datum, usually a
single integer that may serve as an index to an
array
• Store tokens in hash table, with token string as key
and weight as value.
– Storage overhead for hash table ~1.5n
– Table must fit in main memory.
– Constant time to find or update weight of a specific
token (ignoring collisions).
– O(n) time to construct vector (ignoring collisions).
Implementation Based on
Inverted Files
• In practice, document vectors are not stored
directly; an inverted organization provides much
better efficiency.
• The keyword-to-document index can be
implemented as a hash table, a sorted array, or a
tree-based data structure (trie, B-tree).
• Critical issue is logarithmic or constant-time
access to token information.
Index File Types - Lucene
• Field File
– Field infos: .fnm
– Stored Fields
• field index: .fdx
• field data: .fdt
• Term Dictionary file
– Term infos: .tis
– Term info index: .tii
• Frequency file: .frq
• Position file: .prx
Organization of Inverted Files
Term Frequency
(Posting file)
Term Dictionary
Prefix Suffix Field Doc
Length
Num Freq
Term
DocDelta
(implicit) , Freq?
Term Position File
(Inverted index)
Term(implicit) Doc1 Doc2 … DocN
cat
1
2, 1
…
2
0
cat
1
2
cat
3, 2, 2
cathy
2
1
…
3
3
hy
1
3
cathy
3, 3, 7
dog
3
3
…
6
0
dog
1
6
dog
doom
0
0
…
1
2
om
1
1
3, 3, 5, 3,
4, 2, 7
doom
7
Document Files
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the
collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use
excessive resources.
Document File
The documents file stores the documents that are being
indexed. The documents may be:
• primary documents, e.g., electronic journal articles
• surrogates, e.g., catalog records or abstracts
Postings File
Merging inverted lists is the most computationally intensive task
in many information retrieval systems.
Since inverted lists may be long, it is important to match postings
efficiently.
Usually, the inverted lists will be held on disk and paged into
memory for matching. Therefore algorithms for matching
postings process the lists sequentially.
For efficient matching, the inverted lists should all be sorted in the
same sequence.
Inverted lists are commonly cached to minimize disk accesses.
Document File
The storage of the document file may be:
Central (monolithic) - all documents stored together on a single
server (e.g., library catalog)
Distributed database - all documents managed together but
stored on several servers (e.g., Medline, Westlaw, Dialog)
Highly distributed - documents are stored on independently
managed servers (e.g., Web)
Each requires: a document ID, which is a unique identifier that
can be used by the inverted file system to refer to the document,
and a location counter, which can be used to specify location
within a document.
Documents File for Web Search
System
For web search systems:
•
A document is a web page.
•
The documents file is the web.
•
The document ID is the URL of the document.
Indexes are built using a web crawler, which retrieves each page
on the web (or a subset). After indexing, each page is discarded,
unless stored in a cache.
(In addition to the usual index file and postings file the indexing
system stores contextual information, which will be discussed in a
later lecture.)
Term Frequency (Postings File)
The postings file stores the elements of a sparse matrix, the term
assignment matrix.
It is stored as a separate inverted list for each term, i.e., a list
corresponding to each term in the index file.
Each element in an inverted list is called a posting, i.e., the
occurrence on a term in documents
Each list consists of one or many individual postings.
Length of Postings File
For a common term there may be very large numbers of
postings for a given term.
Example:
1,000,000,000 documents
1,000,000 distinct words
average length 1,000 words per document
1012 postings
By Zipf's law, the 10th ranking word occurs, approximately:
(1012/10)/10 times
= 1010 times
Inverted Index
Primary data structure for text indexes
• Basically two elements:
– (Vocabulary, Occurrences)
• Main Idea:
– Invert documents into a big index
• Basic steps:
– Make a “dictionary” of all the terms/tokens in the collection
– For each term/token, list all the docs it occurs in.
• Possibly location in document (Lucene stores the positions)
– Compress to reduce redundancy in the data structure
• Also reduces I/O and storage required
Inverted Indexes
We have seen “Vector files”. An Inverted
File is a vector file “inverted” so that rows
become columns and columns become
rows
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
D7
1
1
0
…
0
1
0
How Are Inverted Files Created
• Documents are parsed one document at a
time to extract tokens/terms. These are
saved with the Document ID (DID).
<term, DID>
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
How Inverted
Files are Created
• After all documents
have been parsed, the
inverted file is sorted
alphabetically and in
document order.
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted
Files are Created
• Multiple term/token
entries for a single
document are merged.
• Within-document term
frequency information
is compiled.
• Result <term,DID,tf>
<the,1,2>
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
How Inverted Files are Created
• Then the inverted file can be split into
– A Dictionary file
• File of unique terms
• Fit in memory if possible
and
– A Postings file
• File of what document the term/token is in and how often.
• Sometimes where the term is in the document.
• Store on disk
• Worst case O(n); n size of tokens.
Dictionary and Posting Files
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Posting
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Inverted indexes
• Permit fast search for individual terms
• For each term/token, you get a list consisting of:
– document ID (DID)
– frequency of term in doc (optional, implied in Lucene)
– position of term in doc (optional, Lucene)
– <term,DID,tf,position>
– <term,(DIDi,tf,positionij),…>
– Lucene:
• <positionij,…> (term and DID are implied from other files)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
How Inverted Files are Used
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Query on
“time” AND “dark”
2 docs with “time” in
dictionary ->
IDs 1 and 2 from posting
file
1 doc with “dark” in
dictionary ->
ID 2 from posting file
Therefore, only doc 2
satisfied the query.
Inverted index
• Associates a posting list with each term
– POSTING LIST example
• a (d1, 1)
• …
• the (d1,2) (d2,2)
• Replace term frequency(tf) with tfidf
– Lucene only uses tf, idf added at query time
• Compress index and put hash links
– Hashing (a type of compression) speeds up
the access
• Match query to index and rank
Position in inverted file posting
– POSTING LIST example
• now (1, 1)
•…
• time (1,4) (2,13)
Doc 1
Now is the time
for all good men
to come to the aid
of their country
Doc 2
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Change weight
• Multiple term entries
for a single document
are merged.
• Within-document term
frequency information
is compiled.
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Example: WestLaw
http://www.westlaw.com/
• Largest commercial (paying subscribers) legal
search service (started 1975; ranking added
1992)
• About 7 terabytes of data; 700,000 users
• Majority of users still use boolean queries
• Example query:
– What is the statute of limitations in cases involving
the federal tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM
• Long, precise queries; proximity operators;
incrementally developed; not like web search
Time Complexity of Indexing
• Complexity of creating vector and indexing a
document of n terms is O(n).
• So building an index of m such documents is O(m n).
• Computing vector lengths is also O(m n), which is
also the complexity of reading in the corpus.
Retrieval with an Inverted Index
• Terms that are not in both the query and the
document do not effect cosine similarity.
– Product of term weights is zero and does not contribute
to the dot product.
• Usually the query is fairly short, and therefore its
vector is extremely sparse.
• Use inverted index to find the limited set of
documents that contain at least one of the query
words.
• Retrieval time O(log M) due to hashing where M is
the size of the document collection.
Inverted Query Retrieval
Efficiency
• Assume that, on average, a query word appears in
B documents:
Q = q1
D11…D1B
q2
…
D21…D2B
qN
Dn1…DnB
• Then retrieval time is O(|Q| B), which is typically,
much better than naïve retrieval that examines all
M documents, O(|V| M), because |Q| << |N| and B
<< M. |V| size of vocabulary
Processing the Query
• Incrementally compute cosine similarity of
each indexed document as query words are
processed one by one.
• To accumulate a total score for each retrieved
document, store retrieved documents in a
hashtable, where DocumentReference is the
key and the partial accumulated score is the
value.
Index Files
On disk
If an index is held on disk, search time is dominated by the
number of disk accesses.
In memory
Suppose that an index has 1,000,000 distinct terms.
Each index entry consists of the term, some basic statistics and
a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in
memory of a dedicated computer.
Index File Structures: Linear
Index
Advantages
Can be searched quickly, e.g., by binary search, O(log M)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
Documents File for Web Search
System
For Web search systems:
•
A Document is a Web page.
•
The Documents File is the Web.
•
The Document ID is the URL of the document.
Indexes are built using a Web crawler, which retrieves each page
on the Web (or a subset). After indexing each page is discarded,
unless stored in a cache.
(In addition to the usual index file and postings file the indexing
system stores special information)
Time Complexity of Indexing
• Complexity of creating vector and indexing
a document of n tokens is O(n).
• So indexing m such documents is O(m n).
• Computing token IDFs for a vocabularly V
is O(|V|).
• Computing vector lengths is also O(m n).
• Since |V|  m n, complete process is O(m n),
which is also the complexity of just reading
in the corpus.
86
Index on disk vs. memory
• Most retrieval systems keep the dictionary
in memory and the postings on disk
• Web search engines frequently keep both in
memory
– massive memory requirement
– feasible for large web service installations
– less so for commercial usage where query loads
are lighter
Indexing in the real world
• Typically, don’t have all documents sitting on a
local filesystem
– Documents need to be crawled
– Could be dispersed over a WAN with varying
connectivity
– Must schedule distributed spiders/indexers
– Could be (secure content) in
• Databases
• Content management applications
• Email applications
Content residing in applications
• Mail systems/groupware, content management
contain the most “valuable” documents
• http often not the most efficient way of fetching
these documents - native API fetching
– Specialized, repository-specific connectors
– These connectors also facilitate document viewing
when a search result is selected for viewing
Secure documents
• Each document is accessible to a subset of users
– Usually implemented through some form of Access
Control Lists (ACLs)
• Search users are authenticated
• Query should retrieve a document only if user can
access it
– So if there are docs matching your search but you’re
not privy to them, “Sorry no results found”
– E.g., as a lowly employee in the company, I get “No
results” for the query “salary roster”
• Solr has this with post filtering
Users in groups, docs from
groups
• Index the ACLs and filter results by them
Documents
Users
0/1
0 if user can’t read
doc, 1 otherwise.
• Often, user membership in an ACL group verified
at query time – slowdown
Compound documents
• What if a doc consisted of components
– Each component has its own ACL.
• Your search should get a doc only if your query
meets one of its components that you have access
to.
• More generally: doc assembled from computations
on components
– e.g., in Lotus databases or in content management
systems
• How do you index such docs?
No good answers …
“Rich” documents
• (How) Do we index images?
• Researchers have devised Query based on Image
Content (QBIC) systems
– “show me a picture similar to this orange circle”
– Then use vector space retrieval
• In practice, image search usually based on metadata such as file name e.g., monalisa.jpg
• New approaches exploit social tagging
– E.g., flickr.com
Passage/sentence retrieval
• Suppose we want to retrieve not an entire
document matching a query, but only a
passage/sentence - say, in a very long
document
• Can index passages/sentences as minidocuments – what should the index units
be?
• This is the subject of XML search
Sec. 2.4
Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
– Many more queries are implicit phrase queries
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1
A first attempt: Biword indexes
• Index every consecutive pair of terms in the text
as a phrase
• For example the text “Friends, Romans,
Countrymen” would generate the biwords
– friends romans
– romans countrymen
• Each of these biwords is now a dictionary term
• Two-word phrase query-processing is now
immediate.
Sec. 2.4.1
Longer phrase queries
• Longer phrases can be processed by breaking
them down
• stanford university palo alto can be broken
into the Boolean query on biwords:
stanford university AND university palo AND
palo alto
Without the docs, we cannot verify that the docs
matching the above Boolean query do contain
the phrase.
Can have false positives!
Sec. 2.4.1
Issues for biword indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for
them
• Biword indexes are not the standard
solution (for all biwords) but can be part of
a compound strategy
Sec. 2.4.2
Solution 2: Positional indexes
• In the postings, store, for each term the
position(s) in which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Sec. 2.4.2
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
• For phrase queries, we use a merge
algorithm recursively at the document level
• But we now need to deal with more than
just equality
Sec. 2.4.2
Processing a phrase query
• Extract inverted index entries for each distinct
term: to, be, or, not.
• Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
• Same general method for proximity searches
Sec. 2.4.2
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for
such queries; biword indexes cannot.
• Exercise: Adapt the linear merge of postings to
handle proximity queries. Can you make it
work for any value of k?
– This is a little tricky to do correctly and efficiently
– See Figure 2.12 of IIR
Sec. 2.4.2
Positional index size
• A positional index expands postings storage
substantially
– Even though indices can be compressed
• Nevertheless, a positional index is now
standardly used because of the power and
usefulness of phrase and proximity queries
… whether used explicitly or implicitly in a
ranking retrieval system.
Sec. 2.4.2
Positional index size
• Need an entry for each occurrence, not just
once per document
• Index size depends on average document
size
Why?
– Average web page has <1000 terms
– SEC filings, books, even some epic poems …
easily 100,000 terms
• Consider a term with frequency 0.1%
Document size
Postings
Positional postings
1000
1
1
100,000
1
100
Sec. 2.4.2
Rules of thumb
• A positional index is 2–4 as large as a nonpositional index
• Positional index size 35–50% of volume of
original text
– Caveat: all of this holds for “English-like”
languages
Sec. 2.4.3
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”,
“Britney Spears”) it is inefficient to keep on
merging positional postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of
the time of using just a positional index
– It required 26% more space than having a
positional index alone
Sec. 4.4
Distributed indexing
• For web-scale indexing (don’t try this at
home!):
must use a distributed computing cluster
• Individual machines are fault-prone
– Can unpredictably slow down or fail
• How do we exploit such a pool of
machines?
Sec. 4.4
Web search engine data centers
• Web search data centers (Google, Bing,
Baidu) mainly contain commodity
machines.
• Data centers are distributed around the
world.
• Estimate: Google ~1 million servers, 3
million processors/cores (Gartner 2007)
Sec. 4.4
Massive data centers
• If in a non-fault-tolerant system with 1000
nodes, each node has 99.9% uptime, what
is the uptime of the system?
• Answer: 63%
• Exercise: Calculate the number of servers
failing per minute for an installation of 1
million servers.
Inside Google Data Center
• https://www.youtube.com/watch?v=avP5d1
6wEp0
Sec. 4.4
Distributed indexing
• Maintain a master machine directing the
indexing job – considered “safe”.
• Break up indexing into sets of (parallel)
tasks.
• Master machine assigns each task to an idle
machine from a pool.
Sec. 4.4
Parallel tasks
• We will use two sets of parallel tasks
– Parsers
– Inverters
• Break the input document collection into
splits
• Each split is a subset of documents
(corresponding to blocks in BSBI/SPIMI)
Sec. 4.4
Parsers
• Master assigns a split to an idle parser
machine
• Parser reads a document at a time and
emits (term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first
letters
– (e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
Sec. 4.4
Inverters
• An inverter collects all (term,doc) pairs (=
postings) for one term-partition.
• Sorts and writes to postings lists
Sec. 4.4
Data flow
assign
Master
assign
Parser
a-f g-p q-z
Parser
a-f g-p q-z
splits
Parser
a-f g-p q-z
Map
phase
Segment files
Postings
Inverter
a-f
Inverter
g-p
Inverter
q-z
Reduce
phase
Sec. 4.4
MapReduce
• The index construction algorithm we just
described is an instance of MapReduce.
• MapReduce (Dean and Ghemawat 2004) is
a robust and conceptually simple
framework for distributed computing …
• … without having to write code for the
distribution part.
• They describe the Google indexing system
(ca. 2002) as consisting of a number of
phases, each implemented in MapReduce.
MapReduce
Sec. 4.4
• Index construction was just one phase.
• Another phase: transforming a termpartitioned index into a documentpartitioned index.
– Term-partitioned: one machine handles a
subrange of terms
– Document-partitioned: one machine handles a
subrange of documents
• As we’ll discuss in the web part of the
course, most search engines use a
document-partitioned index … better load
balancing, etc.
Sec. 4.4
Schema for index construction in
MapReduce
•
•
•
•
•
Schema of map and reduce functions
map: input → list(k, v) reduce: (k,list(v)) → output
Instantiation of the schema for index construction
map: collection → list(termID, docID)
reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …)
→ (postings list1, postings list2, …)
Sec. 4.5
Dynamic indexing
• Up to now, we have assumed that
collections are static.
• They rarely are:
– Documents come in over time and need to be
inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings
lists have to be modified:
– Postings updates for terms already in
dictionary
– New terms added to dictionary
Sec. 4.5
Simplest approach
•
•
•
•
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, merge results
Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
Sec. 4.5
Issues with main and auxiliary
indexes
• Problem of frequent merges – you touch stuff a lot
• Poor performance during merge
• Actually:
– Merging of the auxiliary index into the main index is efficient if we
keep a separate file for each postings list.
– Merge is the same as a simple append.
– But then we would need a lot of files – inefficient for OS.
• Assumption for the rest of the lecture: The index is one
big file.
• In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in
one file etc.)
Indexing Subsystem
Documents
text
assign document IDs
break into tokens
tokens
*Indicates
optional
operation.
documents
stop list*
non-stoplist
stemming*
tokens
stemmed
terms
document
numbers
and *field
numbers
term weighting*
terms with
weights
Index
database
Search Subsystem
query parse query
ranked
document set
query tokens
stop list*
non-stoplist
tokens
ranking*
stemming*
*Indicates
optional
operation.
Boolean
retrieved operations*
document set
relevant
document set
stemmed
terms
Index
database
Lucene’s index (conceptual)
Index
Document
Document
Field
Field
Field
Document
Field
Document
Field
Name
Value
Create a Lucene index (step 1)
• Create Lucene document and add fields
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public void createDoc(String title, String body)
{
Document doc=new Document( );
doc.add(Field.Text(“Title”, title));
doc.add(Field.Text(“Body”,body);
}
Create a Lucene index (step 2)
• Create an Analyser
– Options
• WhitespaceAnalyzer
– divides text at whitespace
• SimpleAnalyzer
– divides text at non-letters
– convert to lower case
• StopAnalyzer
– SimpleAnalyzer
– removes stop words
• StandardAnalyzer
– good for most European Languages
– removes stop words
– convert to lower case
Create a Lucene index (step 3)
• Create an index writer, add Lucene document into the index
import java.IOException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyser;
public void WriteDoc(Document doc, String idxPath)
{
try{
IndexWriter writer =
new IndexWriter(idxPath, new StandardAnalyzer( ), true);
writer.addDocument(doc);
writer.close( );
}
catch (IOException exp) {System.out.println(“I/O Error!”);}
}
Luence Index – Behind the Scene
• Inverted Index (Inverted File)
Doc 1:
Posting
id
word
doc
offset
Penn State
Football …
1
football
Doc 1
3
Doc 1
67
Doc 2
1
football
Doc 2:
Football
players …
State
2
penn
Doc 1
1
3
players
Doc 2
2
4
state
Doc 1
1
Doc 2
13
Posting
Table
Posting table
• Posting table is a fast look-up mechanism
– Key: word
– Value: posting id, satellite data (#df, offset, …)
• Lucene implements the posting table with Java’s hash
table
– Objectified from java.util.Hashtable
– Hash function depends on the JVM
• hc2 = hc1 * 31 + nextChar
• Posting table usage
– Indexing: insertion (new terms), update (existing terms)
– Searching: lookup, and construct document vector
Search Lucene’s index (step 1)
• Construct an query (automatic)
import org.apache.lucene.search.Query;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyser;
public void formQuery(String ques)
{
Query query = new QueryParser.parse(
ques, “body”, new StandardAnalyser( )
);
}
Search Lucene’s index (step 1)
• Types of query:
– Boolean: [IST441 Giles] [IST441 OR Giles]
[java AND NOT SUN]
– wildcard: [nu?ch] [nutc*]
– phrase: [“JAVA TOMCAT”]
– proximity: [“lucene nutch” ~10]
– fuzzy: [roam~] matches roams and foam
– date range
– …
Search Lucene’s index (step 2)
• Search the index
import org.apache.lucene.document.Document;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
public void searchIdx(String idxPath)
{
Directory fsDir=FSDirectory.getDirectory(idxPath,false);
IndexSearcher is=new IndexSearcher(fsDir);
Hits hits = is.search(query);
}
Search Lucene’s index (step 3)
• Display the results
for (int i=0;i<hits.length();i++)
{
Document doc=hits.doc(i);
//show your results
}
Example: WestLaw
http://www.westlaw.com/
• Largest commercial (paying subscribers) legal search
service (started 1975; ranking added 1992)
• About 7 terabytes of data; 700,000 users
• Majority of users still use boolean queries
• Example query:
– What is the statute of limitations in cases involving the federal
tort claims act?
– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
• Long, precise queries; proximity operators; incrementally
developed; not like web search
Lucene workflow
Document
super_name: Spider-Man
name: Peter Parker
category: superhero
powers: agility, spider-sense
addDocument()
Query
Hits
(powers:agility) (Matching Docs)
search()
IndexWriter
IndexSearcher
1.
2.
3.
Lucene Index
4.
Get Lucene jar
file
Write indexing
code to get data
and create
Document objects
Write code to
create query
objects
Write code to
use/display results
What we covered
• Indexing modes
• Inverted files
– Storage and access advantages
– Posting files
• Examples from Lucene
Download