Memory-Based Indexing Requires no temporary files. Memory needed: A hashtable (in memory), approx. 3 entries per unique term in collection. This hashtable has a linked list of all doc_id, tf entries for each term Essentially: dict and post are both held in memory prior to output Creates Hash dictionary file. One-Pass-Algorithm Goal: create a global hash table of all unique words. For each entry in the hash table, add a linked list storing the postings for that word. Once complete, there is an inverted index in the memory data structure. Just write the hash table to disk in the dict file, and the postings in the post file. Algorithm Init the global hash table For each input document, // Process all tokens in the document Init the document hash table, total_freq (i.e., number of terms in the document) For all terms in the document If term is not in document hashtable Add it with tf = 1 Else Increment the tf for that term Increment total_freq For all entries in the document hash table Add a posting to the global hash table entry for the term containing doc_id, tf OR rtf = tf/total_freq // Write the global hashtable to disk as dict and post files Init start to 0 For each entry in the global hash table Calculate idf for the term Write term, start, numDocs to the dict file For each posting Write doc_id, tf OR rtf*idf to the postings file Output: Dict file (hashtable of fixed length entries) term numdocs start quickly 1 0 cat 1 1 blank -1 -1 dog 2 2 blank -1 -1 doctor 1 4 blank -1 -1 duck 1 5 ate 4 6 blank -1 -1 dry 1 10 Post file (Sequential file of fixed length entries) doc_id tf OR rtf OR wt 0 1 3 1 0 1 1 1 2 2 2 1 0 1 1 2 2 1 3 1 2 1 Pros: Hashtable for dictionary, therefore fast query processing. Cons: Need to be able to store essentially entire document collection in main (not virtual) memory. This inverts 5 GB in 6 hours using 4 GB memory 0 GB for temporary files. Multiway Merging – multiple file variation Requires 1 set of temporary files, one per input document. A hashtable (in memory), 2-3 entries per unique term in collection. Essentially: dict is in memory prior to output; postings are always on disk One merge buffer (one entry per document in the collection). Creates Hash dictionary file. Pass I Goal: create multiple output files, one per input file. Each record contatins term (or term_id), doci_id (optional) and tf (or relative tf) Simultaneously, build a global hash table of all unique words. Algorithm Init the global hash table For each input document Init the document hash table, total_freq (the number of terms in the document) For all terms in the document If term not in the document hash table Add it with a tf = 1 Else Increment tf for that term Increment total_freq For all terms in the document hash table If term not in global hash table Add it with numdocs = 1 Give it the next term_id Else Increment numdocs Sort all hashtable entries by term Open a new temporary file Write term, doc_id, tf OR rtf = tf/total_freq to temporary file Close temporary file After Pass I 0.out ate 0 1 dog 0 1 quickly 0 1 1.out ate 1 2 dog 1 1 2.out ate 2 1 doctor 2 2 dry 2 1 duck 2 1 term quickly cat blank dog blank doctor blank duck ate blank dry term_id 3 7 -1 1 -1 4 -1 6 2 -1 5 numdocs 1 1 -1 2 -1 1 -1 1 4 -1 1 start -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 3.out ate 3 1 cat 3 1 Pass II Goal: Merge the output files, calculate idf, write final dict and post files identical to Memory-Based Indexing output (except postings in post file are in alphabetical order) Algorithm Create a buffer array with one entry per document (each entry stores term, tf; doc_id is implicit by buffer index). For all documents Open file Load first line into buffer[i] Until all postings are written Find alphabetically first token in buffer Look up term in global hashtable Update “start” field for the token Calculate idf for term I from numdocs Loop over buffer If (buffer[i] == token) Write postings record for token (i.e., doc_id, tf OR rtf*idf) Reload buffer with next record from the file Numrecords++ Write global hashtable to disk as the dict file Improvements Eliminate doc_id from temporary files If can merge all documents at once, doc_id is implicitly indicated by filename. Compress temporary file size Store term_id in hashtable Use term_id instead of term in temporary files If cannot have all documents open at a time (usually a limit of several thousand), then can do multiple passes. Now we do need to store doc_id e.g., merge 1.out and 2.out -> 1-2.out merge 3.out and 4.out -> 3-4.out merge 1-2.out and 3-4.out -> posting file Pros: Hashtable for dictionary, therefore fast query processing. Cons: Needs 1 set of temporary files, one array entry in MergeArray per document; space for global hashtable (one entry per unique word * 3) This inverts 5 GB in 11 hours using 40 MB memory and 0.5 GB of for temporary files. Sort-based Inversion Requires 2 temporary files, 1 unsorted (from Pass 1); 1 sorted (from Pass 2). ability to hold all postings for a given term in memory Unix needs to be able to store and sort the single file of all postings created by Pass I Creates Sorted dictionary file. Pass I: Goal: create large temporary file, one entry per term per document, in random order within document. Requires a memory data structure (hashtable) per document to count frequency per term in that document. Algorithm: Loop over all documents Init document_hashtable Loop over all tokens in document Create valid term If keep(term) Increment count of terms in the document Insert term into document_hashtable If new, set tf to 0 Else, increment tf Loop over all terms in the hashtable If normalizing rtf = tf /num_terms_in_document Write (term, doci_id, tf or rtf) to temporary file Output: Temporary File 1 with terms grouped by document; random order within each document): Term Doc_id tf quickly 0 1 dog 0 1 ate 0 1 dog 1 1 ate 1 2 dry 2 2 doctor 2 1 ate 2 1 duck 2 1 ate 3 1 cat 3 1 Pass II: Goal: group all postings for the same term together. Sort the output file by using a Unix sort tool (which is implemented using mergesort) with term_id (or term) as the primary key. Note that if the sort is stable, doc_id is a secondary key “for free.” Algorithm: $ sort temp1 > temp2 $ rm temp1 Output: Temporary File 2 with terms in sorted order by term: Term ate ate ate ate cat dog dog dry duck quickly quickly Doc_id 0 1 2 3 3 0 1 2 2 0 2 tf 1 2 1 1 1 1 2 1 1 1 2 Pass III: Goal: Process Temporary 2 one line at a time to create the dict and post files. Algorithm: numdocs = 0; start = 0; read a line of the temporary file from Pass II if term == prev_term //it is another posting for the same term numdocs++ add doc_id, tf to linked list of postings for that term else // you have all the postings in the list for the term calc idf write dict entry for the term (word, numdocs, start) loop over list write post entry (doc_id, tf OR rtf*idf) start++ // handle first posting for new term free postings list; numdocs = 1; add doc_id, tf to linked list of postings for that term; Pros: Low memory use Cons: Sorted dictionary file – slow for query processing 2 sets of temporary files – slow for indexing This inverts 5 GB in 20 hours using 40 MB memory and 8 GB of for temporary files.