IR Indexing Thanks to B. Arms SIMS Baldi, Frasconi, Smyth Manning, Raghavan, Schutze What we have covered • • • • • What is IR Evaluation Tokenization, terms and properties of text Web crawling; robots.txt Vector model of documents – Bag of words • Queries • Measures of similarity • This presentation – Indexing – Inverted files Last time • Vector models of documents and queries – Bag of words model • Similarity measures – Text similarity typically used – Similarity is a measure of relevance (and ranking) – Query is a small document – Match query to document • All stored and indexed before a query is matched. Summary: What’s the point of using vector spaces? • A well-formed algebraic space for retrieval • Key: A user’s query can be viewed as a (very) short document. • Query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to it. • Natural measure of scores/ranking – no longer Boolean. – Queries are expressed as bags of words • Other similarity measures: see http://www.lans.ece.utexas.edu/~strehl/diss/node52.html for a survey Solr/Lucene Query Engine Index Interface Indexer Users Crawler YouSeer Nutch LucidWorks Web A Typical Web Search Engine Indexing Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine Why indexing? • For efficient searching for documents with unstructured text (not databases) using terms as queries – Databases can still be searched with an index • Sphinx – Online sequential text search (grep) • Small collection • Text volatile – Data structures - indexes • Large, semi-stable document collection • Efficient search Unstructured vs structured data • What’s available? – Web – Organizations – You • Unstructured usually means “text” • Structured usually means “databases” • Semistructured somewhere in between Unstructured (text) vs. structured (database) data companies in 1996 160 140 120 100 Unstructured Structured 80 60 40 20 0 Data volume Market Cap Unstructured (text) vs. structured (database) data companies in 2006 160 140 120 100 Unstructured Structured 80 60 40 20 0 Data volume Market Cap IR vs. databases: Structured vs unstructured data • Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. Unstructured data • Typically refers to free text • Allows – Keyword queries including operators – More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse • Classic model for searching text documents Semi-structured data • In fact almost no data is “unstructured” • E.g., this slide has distinctly identified zones such as the Title and Bullets • Facilitates “semi-structured” search such as – Title contains data AND Bullets contain search … to say nothing of linguistic structure More sophisticated semistructured search • Title is about Object Oriented Programming AND Author something like stro*rup • where * is the wild-card operator • Issues: – how do you process “about”? – how do you rank results? • The focus of XML search. Clustering and classification • Given a set of docs, group them into clusters based on their contents. • Given a set of topics, plus a new doc D, decide which topic(s) D belongs to. • Not discussed in this course The web and its challenges • Unusual and diverse documents • Unusual and diverse users, queries, information needs • Beyond terms, exploit ideas from social networks – link analysis, clickstreams ... • How do search engines work? And how can we make them better? More sophisticated information retrieval not covered here • • • • • Cross-language information retrieval Question answering Summarization Text mining … Search Subsystem query parse query ranked document set query tokens stop list* non-stoplist tokens ranking* stemming* *Indicates optional operation. Boolean retrieved operations* document set relevant document set stemmed terms Index database Lexical Analysis: Tokens What is a token? Free text indexing A token is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, tokens are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc. Decisions in Building Inverted Files: What is the Term/Token? • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers. Decisions in Building an Inverted File: Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in lexicographic order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists Major Indexing Methods • Inverted index – effective for very large collections of documents – associates lexical items to their occurrences in the collection • Positional index • Non-positional indexs – Block-sort • Suffix trees and arrays – Faster for phrase searches; harder to build and maintain • Signature files – Word oriented index structures based on hashing (usually not used for large texts) Sparse Vectors • Terms (tokens) as vectors • Vocabulary and therefore dimensionality of vectors can be very large, ~104 . • However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). • Need efficient methods for storing and computing with sparse vectors. Sparse Vectors as Lists • Store vectors as linked lists of non-zeroweight tokens paired with a weight. – Space proportional to number of unique terms (n) in document. – Requires linear search of the list to find (or change) the weight of a specific term. – Requires quadratic time in worst case to compute vector for a document: n(n 1) 2 i O ( n ) 2 i 1 n Sparse Vectors as Trees • Index tokens in a document in a balanced binary tree or trie with weights stored with tokens at the leaves. memory < film < bit 2 Balanced Binary Tree variable < film memory variable 1 1 2 Sparse Vectors as Trees (cont.) • Space overhead for tree structure: ~2n nodes. • O(log n) time to find or update weight of a specific term. • O(n log n) time to construct vector. • Need software package to support such data structures. Sparse Vectors as Hash Tables • Hashing: – well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array • Store tokens in hash table, with token string as key and weight as value. – Storage overhead for hash table ~1.5n – Table must fit in main memory. – Constant time to find or update weight of a specific token (ignoring collisions). – O(n) time to construct vector (ignoring collisions). Implementation Based on Inverted Files • In practice, document vectors are not stored directly; an inverted organization provides much better efficiency. • The keyword-to-document index can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). • Critical issue is logarithmic or constant-time access to token information. Index File Types - Lucene • Field File – Field infos: .fnm – Stored Fields • field index: .fdx • field data: .fdt • Term Dictionary file – Term infos: .tis – Term info index: .tii • Frequency file: .frq • Position file: .prx Organization of Inverted Files Term Frequency (Posting file) Term Dictionary Prefix Suffix Field Doc Length Num Freq Term DocDelta (implicit) , Freq? Term Position File (Inverted index) Term(implicit) Doc1 Doc2 … DocN cat 1 2, 1 … 2 0 cat 1 2 cat 3, 2, 2 cathy 2 1 … 3 3 hy 1 3 cathy 3, 3, 7 dog 3 3 … 6 0 dog 1 6 dog doom 0 0 … 1 2 om 1 1 3, 3, 5, 3, 4, 2, 7 doom 7 Document Files Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources. Document File The documents file stores the documents that are being indexed. The documents may be: • primary documents, e.g., electronic journal articles • surrogates, e.g., catalog records or abstracts Postings File Merging inverted lists is the most computationally intensive task in many information retrieval systems. Since inverted lists may be long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially. For efficient matching, the inverted lists should all be sorted in the same sequence. Inverted lists are commonly cached to minimize disk accesses. Document File The storage of the document file may be: Central (monolithic) - all documents stored together on a single server (e.g., library catalog) Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw, Dialog) Highly distributed - documents are stored on independently managed servers (e.g., Web) Each requires: a document ID, which is a unique identifier that can be used by the inverted file system to refer to the document, and a location counter, which can be used to specify location within a document. Documents File for Web Search System For web search systems: • A document is a web page. • The documents file is the web. • The document ID is the URL of the document. Indexes are built using a web crawler, which retrieves each page on the web (or a subset). After indexing, each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores contextual information, which will be discussed in a later lecture.) Term Frequency (Postings File) The postings file stores the elements of a sparse matrix, the term assignment matrix. It is stored as a separate inverted list for each term, i.e., a list corresponding to each term in the index file. Each element in an inverted list is called a posting, i.e., the occurrence on a term in documents Each list consists of one or many individual postings. Length of Postings File For a common term there may be very large numbers of postings for a given term. Example: 1,000,000,000 documents 1,000,000 distinct words average length 1,000 words per document 1012 postings By Zipf's law, the 10th ranking word occurs, approximately: (1012/10)/10 times = 1010 times Inverted Index Primary data structure for text indexes • Basically two elements: – (Vocabulary, Occurrences) • Main Idea: – Invert documents into a big index • Basic steps: – Make a “dictionary” of all the terms/tokens in the collection – For each term/token, list all the docs it occurs in. • Possibly location in document (Lucene stores the positions) – Compress to reduce redundancy in the data structure • Also reduces I/O and storage required Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 t1 1 1 0 1 1 1 0 0 0 0 t2 0 0 1 0 1 1 1 1 0 1 t3 1 0 1 0 1 0 0 0 1 1 Terms D1 t1 t2 t3 D2 1 0 1 D3 1 0 0 D4 0 1 1 D5 1 0 0 D6 1 1 1 … D7 1 1 0 0 1 0 How Are Inverted Files Created • Documents are parsed one document at a time to extract tokens/terms. These are saved with the Document ID (DID). <term, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 How Inverted Files are Created • After all documents have been parsed, the inverted file is sorted alphabetically and in document order. Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created • Multiple term/token entries for a single document are merged. • Within-document term frequency information is compiled. • Result <term,DID,tf> <the,1,2> Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 How Inverted Files are Created • Then the inverted file can be split into – A Dictionary file • File of unique terms • Fit in memory if possible and – A Postings file • File of what document the term/token is in and how often. • Sometimes where the term is in the document. • Store on disk • Worst case O(n); n size of tokens. Dictionary and Posting Files Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Dictionary Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Posting 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Inverted indexes • Permit fast search for individual terms • For each term/token, you get a list consisting of: – document ID (DID) – frequency of term in doc (optional, implied in Lucene) – position of term in doc (optional, Lucene) – <term,DID,tf,position> – <term,(DIDi,tf,positionij),…> – Lucene: • <positionij,…> (term and DID are implied from other files) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 How Inverted Files are Used Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Postings 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Inverted index • Associates a posting list with each term – POSTING LIST example • a (d1, 1) • … • the (d1,2) (d2,2) • Replace term frequency(tf) with tfidf – Lucene only uses tf, idf added at query time • Compress index and put hash links – Hashing (a type of compression) speeds up the access • Match query to index and rank Position in inverted file posting – POSTING LIST example • now (1, 1) •… • time (1,4) (2,13) Doc 1 Now is the time for all good men to come to the aid of their country Doc 2 It was a dark and stormy night in the country manor. The time was past midnight Change weight • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Time Complexity of Indexing • Complexity of creating vector and indexing a document of n terms is O(n). • So building an index of m such documents is O(m n). • Computing vector lengths is also O(m n), which is also the complexity of reading in the corpus. Retrieval with an Inverted Index • Terms that are not in both the query and the document do not effect cosine similarity. – Product of term weights is zero and does not contribute to the dot product. • Usually the query is fairly short, and therefore its vector is extremely sparse. • Use inverted index to find the limited set of documents that contain at least one of the query words. • Retrieval time O(log M) due to hashing where M is the size of the document collection. Inverted Query Retrieval Efficiency • Assume that, on average, a query word appears in B documents: Q = q1 D11…D1B q2 … D21…D2B qN Dn1…DnB • Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all M documents, O(|V| M), because |Q| << |N| and B << M. |V| size of vocabulary Processing the Query • Incrementally compute cosine similarity of each indexed document as query words are processed one by one. • To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value. Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer. Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log M) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added Documents File for Web Search System For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information) Time Complexity of Indexing • Complexity of creating vector and indexing a document of n tokens is O(n). • So indexing m such documents is O(m n). • Computing token IDFs for a vocabularly V is O(|V|). • Computing vector lengths is also O(m n). • Since |V| m n, complete process is O(m n), which is also the complexity of just reading in the corpus. 69 Index on disk vs. memory • Most retrieval systems keep the dictionary in memory and the postings on disk • Web search engines frequently keep both in memory – massive memory requirement – feasible for large web service installations – less so for commercial usage where query loads are lighter Indexing in the real world • Typically, don’t have all documents sitting on a local filesystem – Documents need to be crawled – Could be dispersed over a WAN with varying connectivity – Must schedule distributed spiders/indexers – Could be (secure content) in • Databases • Content management applications • Email applications Content residing in applications • Mail systems/groupware, content management contain the most “valuable” documents • http often not the most efficient way of fetching these documents - native API fetching – Specialized, repository-specific connectors – These connectors also facilitate document viewing when a search result is selected for viewing Secure documents • Each document is accessible to a subset of users – Usually implemented through some form of Access Control Lists (ACLs) • Search users are authenticated • Query should retrieve a document only if user can access it – So if there are docs matching your search but you’re not privy to them, “Sorry no results found” – E.g., as a lowly employee in the company, I get “No results” for the query “salary roster” • Solr has this with post filtering Users in groups, docs from groups • Index the ACLs and filter results by them Documents Users 0/1 0 if user can’t read doc, 1 otherwise. • Often, user membership in an ACL group verified at query time – slowdown Compound documents • What if a doc consisted of components – Each component has its own ACL. • Your search should get a doc only if your query meets one of its components that you have access to. • More generally: doc assembled from computations on components – e.g., in Lotus databases or in content management systems • How do you index such docs? No good answers … “Rich” documents • (How) Do we index images? • Researchers have devised Query based on Image Content (QBIC) systems – “show me a picture similar to this orange circle” – Then use vector space retrieval • In practice, image search usually based on metadata such as file name e.g., monalisa.jpg • New approaches exploit social tagging – E.g., flickr.com Passage/sentence retrieval • Suppose we want to retrieve not an entire document matching a query, but only a passage/sentence - say, in a very long document • Can index passages/sentences as minidocuments – what should the index units be? • This is the subject of XML search Indexing Subsystem Documents text assign document IDs break into tokens tokens *Indicates optional operation. documents stop list* non-stoplist stemming* tokens stemmed terms document numbers and *field numbers term weighting* terms with weights Index database Search Subsystem query parse query ranked document set query tokens stop list* non-stoplist tokens ranking* stemming* *Indicates optional operation. Boolean retrieved operations* document set relevant document set stemmed terms Index database Lucene’s index (conceptual) Index Document Document Field Field Field Document Field Document Field Name Value Create a Lucene index (step 1) • Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(Field.Text(“Title”, title)); doc.add(Field.Text(“Body”,body); } Create a Lucene index (step 2) • Create an Analyser – Options • WhitespaceAnalyzer – divides text at whitespace • SimpleAnalyzer – divides text at non-letters – convert to lower case • StopAnalyzer – SimpleAnalyzer – removes stop words • StandardAnalyzer – good for most European Languages – removes stop words – convert to lower case Create a Lucene index (step 3) • Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriter writer = new IndexWriter(idxPath, new StandardAnalyzer( ), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) {System.out.println(“I/O Error!”);} } Luence Index – Behind the Scene • Inverted Index (Inverted File) Doc 1: Posting id word doc offset Penn State Football … 1 football Doc 1 3 Doc 1 67 Doc 2 1 football Doc 2: Football players … State 2 penn Doc 1 1 3 players Doc 2 2 4 state Doc 1 1 Doc 2 13 Posting Table Posting table • Posting table is a fast look-up mechanism – Key: word – Value: posting id, satellite data (#df, offset, …) • Lucene implements the posting table with Java’s hash table – Objectified from java.util.Hashtable – Hash function depends on the JVM • hc2 = hc1 * 31 + nextChar • Posting table usage – Indexing: insertion (new terms), update (existing terms) – Searching: lookup, and construct document vector Search Lucene’s index (step 1) • Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String ques) { Query query = new QueryParser.parse( ques, “body”, new StandardAnalyser( ) ); } Search Lucene’s index (step 1) • Types of query: – Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] – wildcard: [nu?ch] [nutc*] – phrase: [“JAVA TOMCAT”] – proximity: [“lucene nutch” ~10] – fuzzy: [roam~] matches roams and foam – date range – … Search Lucene’s index (step 2) • Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath,false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); } Search Lucene’s index (step 3) • Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results } Example: WestLaw http://www.westlaw.com/ • Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) • About 7 terabytes of data; 700,000 users • Majority of users still use boolean queries • Example query: – What is the statute of limitations in cases involving the federal tort claims act? – LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM • Long, precise queries; proximity operators; incrementally developed; not like web search Lucene workflow Document super_name: Spider-Man name: Peter Parker category: superhero powers: agility, spider-sense addDocument() Query Hits (powers:agility) (Matching Docs) search() IndexWriter IndexSearcher 1. 2. 3. Lucene Index 4. Get Lucene jar file Write indexing code to get data and create Document objects Write code to create query objects Write code to use/display results What we covered • Indexing modes • Inverted files – Storage and access advantages – Posting files • Examples from Lucene