IR Indexing Thanks to B. Arms SIMS Baldi, Frasconi, Smyth Manning, Raghavan, Schutze What we have covered • • • • • What is IR Evaluation Tokenization, terms and properties of text Web crawling; robots.txt Vector model of documents – Bag of words • Queries • Measures of similarity • This presentation – Indexing – Inverted files Last time • Vector models of documents and queries – Bag of words model • Similarity measures – Text similarity typically used – Similarity is a measure of relevance (and ranking) – Query is a small document – Match query to document • All stored and indexed before a query is matched. Summary: What’s the point of using vector spaces? • A well-formed algebraic space for retrieval • Key: A user’s query can be viewed as a (very) short document. • Query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to it. • Natural measure of scores/ranking – no longer Boolean. – Queries are expressed as bags of words • Other similarity measures: see http://www.lans.ece.utexas.edu/~strehl/diss/node52.html for a survey Solr/Lucene Query Engine Index Interface Indexer Users Crawler YouSeer Nutch LucidWorks Web A Typical Web Search Engine Indexing Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine Why indexing? • For efficient searching for documents with unstructured text (not databases) using terms as queries – Databases can still be searched with an index • Sphinx – Online sequential text search (grep) • Small collection • Text volatile – Data structures - indexes • Large, semi-stable document collection • Efficient search Sec. 1.1 Unstructured data in 1620 • Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? • One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? • Why is that not the answer? – Slow (for large corpora) – NOT Calpurnia is non-trivial – Other operations (e.g., find the word Romans near countrymen) not feasible – Ranked retrieval (best documents to return) 8 Sec. 1.1 Term-document incidence matrices Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Brutus AND Caesar BUT NOT Calpurnia 1 if play contains word, 0 otherwise Sec. 1.1 Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. – – – – 110100 AND 110111 AND 101111 = 100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 10 Sec. 1.1 Answers to query • Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. • Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me. 11 Sec. 1.1 Bigger collections • Consider N = 1 million documents, each with about 1000 words. • Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents. • Say there are M = 500K distinct terms among these. 12 Sec. 1.1 Can’t build the matrix • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. – matrix is extremely sparse. • What’s a better representation? – We only record the 1 positions. 13 Sec. 1.2 Inverted index • For each term t, we must store a list of all documents that contain t. – Identify each doc by a docID, a document serial number • Can we used fixed-size arrays for this? Brutus 1 Caesar 1 Calpurnia 2 2 4 11 31 45 173 174 4 5 6 2 31 54 101 What happens if the word Caesar is added to document 14? 16 57 132 14 Sec. 1.2 Inverted index • We need variable-size postings lists – On disk, a continuous run of postings is normal and best – In memory, can use linked lists or variable length arrays Posting • Some tradeoffs in size/ease of insertion Brutus 1 Caesar 1 Calpurnia Dictionary 2 2 2 31 4 11 31 45 173 174 4 5 6 16 57 132 54 101 Postings 15 Sorted by docID (more later on why). Sec. 1.2 Inverted index construction Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules friend Modified tokens Indexer Inverted index roman countryman friend 2 4 roman 1 2 countryman 13 16 Initial stages of text processing • Tokenization – Cut character sequence into word tokens • Deal with “John’s”, a state-of-the-art solution • Normalization – Map text and query term to same form • You want U.S.A. and USA to match • Stemming – We may wish different forms of a root to match • authorize, authorization • Stop words – We may omit very common words (or not) • the, a, to, of Sec. 3.1 Dictionary data structures for inverted indexes • The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure? 18 Sec. 3.1 A naïve dictionary • An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently? • How do we quickly look up elements at query time? 19 Sec. 3.1 Dictionary data structures • Two main choices: – Hashtables – Trees • Some IR systems use hashtables, some trees 20 Hashtables Sec. 3.1 • Each vocabulary term is hashed to an integer – (We assume you’ve seen hashtables before) • Pros: – Lookup is faster than for a tree: O(1) • Cons: – No easy way to find minor variants: • judgment/judgement – No prefix search [tolerant retrieval] – If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 21 Sec. 3.1 Tree: binary tree a-m a-hu hy-m Root n-z n-sh si-z 22 Sec. 3.1 Tree: B-tree a-hu – hy-m n-z Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 23 Sec. 3.1 Trees • Simplest: binary tree • More usual: B-trees • Trees require a standard ordering of characters and hence strings … but we typically have one • Pros: – Solves the prefix problem (terms starting with hyp) • Cons: – Slower: O(log M) [and this requires balanced tree] – Rebalancing binary trees is expensive • But B-trees mitigate the rebalancing problem 24 Unstructured vs structured data • What’s available? – Web – Organizations – You • Unstructured usually means “text” • Structured usually means “databases” • Semistructured somewhere in between Unstructured (text) vs. structured (database) data companies in 1996 160 140 120 100 Unstructured Structured 80 60 40 20 0 Data volume Market Cap Unstructured (text) vs. structured (database) data companies in 2006 160 140 120 100 Unstructured Structured 80 60 40 20 0 Data volume Market Cap IR vs. databases: Structured vs unstructured data • Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. Unstructured data • Typically refers to free text • Allows – Keyword queries including operators – More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as – Title contains data AND Bullets contain search … to say nothing of linguistic structure Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists Major Indexing Methods • Inverted index – effective for very large collections of documents – associates lexical items to their occurrences in the collection • Positional index • Non-positional indexs – Block-sort • Suffix trees and arrays – Faster for phrase searches; harder to build and maintain • Signature files – Word oriented index structures based on hashing (usually not used for large texts) Sparse Vectors • Terms (tokens) as vectors • Vocabulary and therefore dimensionality of vectors can be very large, ~104 . • However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). • Need efficient methods for storing and computing with sparse vectors. Sparse Vectors as Lists • Store vectors as linked lists of non-zeroweight tokens paired with a weight. – Space proportional to number of unique terms (n) in document. – Requires linear search of the list to find (or change) the weight of a specific term. – Requires quadratic time in worst case to compute vector for a document: n(n + 1) 2 i = = O ( n ) å 2 i =1 n Sparse Vectors as Trees • Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory < film < bit 2 Balanced Binary Tree variable < film memory variable 1 1 2 Sparse Vectors as Trees (cont.) • Space overhead for tree structure: ~2n nodes. • O(log n) time to find or update weight of a specific term. • O(n log n) time to construct vector. • Need software package to support such data structures. Sparse Vectors as Hash Tables • Hashing: – well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array • Store tokens in hash table, with token string as key and weight as value. – Storage overhead for hash table ~1.5n – Table must fit in main memory. – Constant time to find or update weight of a specific token (ignoring collisions). – O(n) time to construct vector (ignoring collisions). Implementation Based on Inverted Files • In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. • The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). • Critical issue is logarithmic or constant-time access to token information. Index File Types - Lucene • Field File – Field infos: .fnm – Stored Fields • field index: .fdx • field data: .fdt • Term Dictionary file – Term infos: .tis – Term info index: .tii • Frequency file: .frq • Position file: .prx Organization of Inverted Files Term Frequency (Posting file) Term Dictionary Prefix Suffix Field Doc Length Num Freq Term DocDelta (implicit) , Freq? Term Position File (Inverted index) Term(implicit) Doc1 Doc2 … DocN cat 1 2, 1 … 2 0 cat 1 2 cat 3, 2, 2 cathy 2 1 … 3 3 hy 1 3 cathy 3, 3, 7 dog 3 3 … 6 0 dog 1 6 dog doom 0 0 … 1 2 om 1 1 3, 3, 5, 3, 4, 2, 7 doom 7 Document Files Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources. Document File The documents file stores the documents that are being indexed. The documents may be: • primary documents, e.g., electronic journal articles • surrogates, e.g., catalog records or abstracts Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses. Document File The storage of the document file may be: Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document. Documents File for Web Search System For web search systems: • A document is a web page. • The documents file is the web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.) Term Frequency (Postings File) The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each term, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in documents Each list consists of one or many individual postings. Length of Postings File For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times Inverted Index Primary data structure for text indexes • Basically two elements: – (Vocabulary, Occurrences) • Main Idea: – Invert documents into a big index • Basic steps: – Make a “dictionary” of all the terms/tokens in the collection – For each term/token, list all the docs it occurs in. • Possibly location in document (Lucene stores the positions) – Compress to reduce redundancy in the data structure • Also reduces I/O and storage required Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 t1 1 1 0 1 1 1 0 0 0 0 t2 0 0 1 0 1 1 1 1 0 1 t3 1 0 1 0 1 0 0 0 1 1 Terms D1 t1 t2 t3 D2 1 0 1 D3 1 0 0 D4 0 1 1 D5 1 0 0 D6 1 1 1 D7 1 1 0 … 0 1 0 How Are Inverted Files Created • Documents are parsed one document at a time to extract tokens/terms. These are saved with the Document ID (DID). <term, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 How Inverted Files are Created • After all documents have been parsed, the inverted file is sorted alphabetically and in document order. Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created • Multiple term/token entries for a single document are merged. • Within-document term frequency information is compiled. • Result <term,DID,tf> <the,1,2> Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 How Inverted Files are Created • Then the inverted file can be split into – A Dictionary file • File of unique terms • Fit in memory if possible and – A Postings file • File of what document the term/token is in and how often. • Sometimes where the term is in the document. • Store on disk • Worst case O(n); n size of tokens. Dictionary and Posting Files Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Posting 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Inverted indexes • Permit fast search for individual terms • For each term/token, you get a list consisting of: – document ID (DID) – frequency of term in doc (optional, implied in Lucene) – position of term in doc (optional, Lucene) – <term,DID,tf,position> – <term,(DIDi,tf,positionij),…> – Lucene: • <positionij,…> (term and DID are implied from other files) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 How Inverted Files are Used Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Postings 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Inverted index • Associates a posting list with each term – POSTING LIST example • a (d1, 1) • … • the (d1,2) (d2,2) • Replace term frequency(tf) with tfidf – Lucene only uses tf, idf added at query time • Compress index and put hash links – Hashing (a type of compression) speeds up the access • Match query to index and rank Position in inverted file posting – POSTING LIST example • now (1, 1) •… • time (1,4) (2,13) Doc 1 Now is the time for all good men to come to the aid of their country Doc 2 It was a dark and stormy night in the country manor. The time was past midnight Change weight • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Example: WestLaw http://www.westlaw.com/ • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • About 7 terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: – What is the statute of limitations in cases involving the federal tort claims act? – LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • Long, precise queries; proximity operators; incrementally developed; not like web search Time Complexity of Indexing • Complexity of creating vector and indexing a document of n terms is O(n). • So building an index of m such documents is O(m n). • Computing vector lengths is also O(m n), which is also the complexity of reading in the corpus. Retrieval with an Inverted Index • Terms that are not in both the query and the document do not effect cosine similarity. – Product of term weights is zero and does not contribute to the dot product. • Usually the query is fairly short, and therefore its vector is extremely sparse. • Use inverted index to find the limited set of documents that contain at least one of the query words. • Retrieval time O(log M) due to hashing where M is the size of the document collection. Inverted Query Retrieval Efficiency • Assume that, on average, a query word appears in B documents: Q = q1 D11…D1B q2 … D21…D2B qN Dn1…DnB • Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all M documents, O(|V| M), because |Q| << |N| and B << M. |V| size of vocabulary Processing the Query • Incrementally compute cosine similarity of each indexed document as query words are processed one by one. • To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value. Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer. Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log M) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added Documents File for Web Search System For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information) Time Complexity of Indexing • Complexity of creating vector and indexing a document of n tokens is O(n). • So indexing m such documents is O(m n). • Computing token IDFs for a vocabularly V is O(|V|). • Computing vector lengths is also O(m n). • Since |V| m n, complete process is O(m n), which is also the complexity of just reading in the corpus. 86 Index on disk vs. memory • Most retrieval systems keep the dictionary in memory and the postings on disk • Web search engines frequently keep both in memory – massive memory requirement – feasible for large web service installations – less so for commercial usage where query loads are lighter Indexing in the real world • Typically, don’t have all documents sitting on a local filesystem – Documents need to be crawled – Could be dispersed over a WAN with varying connectivity – Must schedule distributed spiders/indexers – Could be (secure content) in • Databases • Content management applications • Email applications Content residing in applications • Mail systems/groupware, content management contain the most “valuable” documents • http often not the most efficient way of fetching these documents - native API fetching – Specialized, repository-specific connectors – These connectors also facilitate document viewing when a search result is selected for viewing Secure documents • Each document is accessible to a subset of users – Usually implemented through some form of Access Control Lists (ACLs) • Search users are authenticated • Query should retrieve a document only if user can access it – So if there are docs matching your search but you’re not privy to them, “Sorry no results found” – E.g., as a lowly employee in the company, I get “No results” for the query “salary roster” • Solr has this with post filtering Users in groups, docs from groups • Index the ACLs and filter results by them Documents Users 0/1 0 if user can’t read doc, 1 otherwise. • Often, user membership in an ACL group verified at query time – slowdown Compound documents • What if a doc consisted of components – Each component has its own ACL. • Your search should get a doc only if your query meets one of its components that you have access to. • More generally: doc assembled from computations on components – e.g., in Lotus databases or in content management systems • How do you index such docs? No good answers … “Rich” documents • (How) Do we index images? • Researchers have devised Query based on Image Content (QBIC) systems – “show me a picture similar to this orange circle” – Then use vector space retrieval • In practice, image search usually based on metadata such as file name e.g., monalisa.jpg • New approaches exploit social tagging – E.g., flickr.com Passage/sentence retrieval • Suppose we want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document • Can index passages/sentences as minidocuments – what should the index units be? • This is the subject of XML search Sec. 2.4 Phrase queries • We want to be able to answer queries such as “stanford university” – as a phrase • Thus the sentence “I went to university at Stanford” is not a match. – The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that works – Many more queries are implicit phrase queries • For this, it no longer suffices to store only <term : docs> entries Sec. 2.4.1 A first attempt: Biword indexes • Index every consecutive pair of terms in the text as a phrase • For example the text “Friends, Romans, Countrymen” would generate the biwords – friends romans – romans countrymen • Each of these biwords is now a dictionary term • Two-word phrase query-processing is now immediate. Sec. 2.4.1 Longer phrase queries • Longer phrases can be processed by breaking them down • stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. Can have false positives! Sec. 2.4.1 Issues for biword indexes • False positives, as noted before • Index blowup due to bigger dictionary – Infeasible for more than biwords, big even for them • Biword indexes are not the standard solution (for all biwords) but can be part of a compound strategy Sec. 2.4.2 Solution 2: Positional indexes • In the postings, store, for each term the position(s) in which tokens of it appear: <term, number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> Sec. 2.4.2 Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? • For phrase queries, we use a merge algorithm recursively at the document level • But we now need to deal with more than just equality Sec. 2.4.2 Processing a phrase query • Extract inverted index entries for each distinct term: to, be, or, not. • Merge their doc:position lists to enumerate all positions with “to be or not to be”. – to: • 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... – be: • 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... • Same general method for proximity searches Sec. 2.4.2 Proximity queries • LIMIT! /3 STATUTE /3 FEDERAL /2 TORT – Again, here, /k means “within k words of”. • Clearly, positional indexes can be used for such queries; biword indexes cannot. • Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? – This is a little tricky to do correctly and efficiently – See Figure 2.12 of IIR Sec. 2.4.2 Positional index size • A positional index expands postings storage substantially – Even though indices can be compressed • Nevertheless, a positional index is now standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system. Sec. 2.4.2 Positional index size • Need an entry for each occurrence, not just once per document • Index size depends on average document size Why? – Average web page has <1000 terms – SEC filings, books, even some epic poems … easily 100,000 terms • Consider a term with frequency 0.1% Document size Postings Positional postings 1000 1 1 100,000 1 100 Sec. 2.4.2 Rules of thumb • A positional index is 2–4 as large as a nonpositional index • Positional index size 35–50% of volume of original text – Caveat: all of this holds for “English-like” languages Sec. 2.4.3 Combination schemes • These two approaches can be profitably combined – For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists • Even more so for phrases like “The Who” • Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme – A typical web query mixture was executed in ¼ of the time of using just a positional index – It required 26% more space than having a positional index alone Sec. 4.4 Distributed indexing • For web-scale indexing (don’t try this at home!): must use a distributed computing cluster • Individual machines are fault-prone – Can unpredictably slow down or fail • How do we exploit such a pool of machines? Sec. 4.4 Web search engine data centers • Web search data centers (Google, Bing, Baidu) mainly contain commodity machines. • Data centers are distributed around the world. • Estimate: Google ~1 million servers, 3 million processors/cores (Gartner 2007) Sec. 4.4 Massive data centers • If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? • Answer: 63% • Exercise: Calculate the number of servers failing per minute for an installation of 1 million servers. Inside Google Data Center • https://www.youtube.com/watch?v=avP5d1 6wEp0 Sec. 4.4 Distributed indexing • Maintain a master machine directing the indexing job – considered “safe”. • Break up indexing into sets of (parallel) tasks. • Master machine assigns each task to an idle machine from a pool. Sec. 4.4 Parallel tasks • We will use two sets of parallel tasks – Parsers – Inverters • Break the input document collection into splits • Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI) Sec. 4.4 Parsers • Master assigns a split to an idle parser machine • Parser reads a document at a time and emits (term, doc) pairs • Parser writes pairs into j partitions • Each partition is for a range of terms’ first letters – (e.g., a-f, g-p, q-z) – here j = 3. • Now to complete the index inversion Sec. 4.4 Inverters • An inverter collects all (term,doc) pairs (= postings) for one term-partition. • Sorts and writes to postings lists Sec. 4.4 Data flow assign Master assign Parser a-f g-p q-z Parser a-f g-p q-z splits Parser a-f g-p q-z Map phase Segment files Postings Inverter a-f Inverter g-p Inverter q-z Reduce phase Sec. 4.4 MapReduce • The index construction algorithm we just described is an instance of MapReduce. • MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing … • … without having to write code for the distribution part. • They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce. MapReduce Sec. 4.4 • Index construction was just one phase. • Another phase: transforming a termpartitioned index into a documentpartitioned index. – Term-partitioned: one machine handles a subrange of terms – Document-partitioned: one machine handles a subrange of documents • As we’ll discuss in the web part of the course, most search engines use a document-partitioned index … better load balancing, etc. Sec. 4.4 Schema for index construction in MapReduce • • • • • Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index construction map: collection → list(termID, docID) reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …) Sec. 4.5 Dynamic indexing • Up to now, we have assumed that collections are static. • They rarely are: – Documents come in over time and need to be inserted. – Documents are deleted and modified. • This means that the dictionary and postings lists have to be modified: – Postings updates for terms already in dictionary – New terms added to dictionary Sec. 4.5 Simplest approach • • • • Maintain “big” main index New docs go into “small” auxiliary index Search across both, merge results Deletions – Invalidation bit-vector for deleted docs – Filter docs output on a search result by this invalidation bit-vector • Periodically, re-index into one main index Sec. 4.5 Issues with main and auxiliary indexes • Problem of frequent merges – you touch stuff a lot • Poor performance during merge • Actually: – Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. – Merge is the same as a simple append. – But then we would need a lot of files – inefficient for OS. • Assumption for the rest of the lecture: The index is one big file. • In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.) Indexing Subsystem Documents text assign document IDs break into tokens tokens *Indicates optional operation. documents stop list* non-stoplist stemming* tokens stemmed terms document numbers and *field numbers term weighting* terms with weights Index database Search Subsystem query parse query ranked document set query tokens stop list* non-stoplist tokens ranking* stemming* *Indicates optional operation. Boolean retrieved operations* document set relevant document set stemmed terms Index database Lucene’s index (conceptual) Index Document Document Field Field Field Document Field Document Field Name Value Create a Lucene index (step 1) • Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(Field.Text(“Title”, title)); doc.add(Field.Text(“Body”,body); } Create a Lucene index (step 2) • Create an Analyser – Options • WhitespaceAnalyzer – divides text at whitespace • SimpleAnalyzer – divides text at non-letters – convert to lower case • StopAnalyzer – SimpleAnalyzer – removes stop words • StandardAnalyzer – good for most European Languages – removes stop words – convert to lower case Create a Lucene index (step 3) • Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriter writer = new IndexWriter(idxPath, new StandardAnalyzer( ), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) {System.out.println(“I/O Error!”);} } Luence Index – Behind the Scene • Inverted Index (Inverted File) Doc 1: Posting id word doc offset Penn State Football … 1 football Doc 1 3 Doc 1 67 Doc 2 1 football Doc 2: Football players … State 2 penn Doc 1 1 3 players Doc 2 2 4 state Doc 1 1 Doc 2 13 Posting Table Posting table • Posting table is a fast look-up mechanism – Key: word – Value: posting id, satellite data (#df, offset, …) • Lucene implements the posting table with Java’s hash table – Objectified from java.util.Hashtable – Hash function depends on the JVM • hc2 = hc1 * 31 + nextChar • Posting table usage – Indexing: insertion (new terms), update (existing terms) – Searching: lookup, and construct document vector Search Lucene’s index (step 1) • Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String ques) { Query query = new QueryParser.parse( ques, “body”, new StandardAnalyser( ) ); } Search Lucene’s index (step 1) • Types of query: – Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] – wildcard: [nu?ch] [nutc*] – phrase: [“JAVA TOMCAT”] – proximity: [“lucene nutch” ~10] – fuzzy: [roam~] matches roams and foam – date range – … Search Lucene’s index (step 2) • Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath,false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); } Search Lucene’s index (step 3) • Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results } Example: WestLaw http://www.westlaw.com/ • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • About 7 terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: – What is the statute of limitations in cases involving the federal tort claims act? – LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • Long, precise queries; proximity operators; incrementally developed; not like web search Lucene workflow Document super_name: Spider-Man name: Peter Parker category: superhero powers: agility, spider-sense addDocument() Query Hits (powers:agility) (Matching Docs) search() IndexWriter IndexSearcher 1. 2. 3. Lucene Index 4. Get Lucene jar file Write indexing code to get data and create Document objects Write code to create query objects Write code to use/display results What we covered • Indexing modes • Inverted files – Storage and access advantages – Posting files • Examples from Lucene