Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 3/26/2013 Indexing The World Wide Web 1 Is Indexing Difficult? - Yes! Words not known beforehand Content available in different languages Variations in Grammar and Style No structure – riddled with colors, fonts, images, etc. Various byte-encoding schemes 3/26/2013 Indexing The World Wide Web 2 Answering The User’s Query Retrieval for a typical query Find terms in dictionary Start with the least frequent term since posting list will be the shortest. Fetch corresponding posting lists Intersect the lists on document identifiers to get relevant documents Rank and re-order the documents to present it to user. To get quality results as fast as possible, understanding of each usage is required Disk Space Disk Transfer Memory CPU Time Choice of data structure impacts CPU and storage Fixed-length array wasteful if posting lists kept in memory Singly linked list allows cheap insertions and updates Variable length array require less CPU time Linked list of fixed length arrays can be used for each term. Avoid pointers when storing the posting list in memory. 3/26/2013 Indexing The World Wide Web 3 Better Understanding of User Intent Check proximity of different terms Positional Index expands storage, slows down query processing . Phrase based Indexing – expensive, no accurate mechanism for identifying which phrase might be used. – Use a good phrase. 3/26/2013 Indexing The World Wide Web 4 Document vs. Term Based Partitioning 3/26/2013 Indexing The World Wide Web 5 Memory vs. Disk Storage 3/26/2013 Indexing The World Wide Web 6 Compressing The Index Advantages of compressed index Faster transfer of data from disk to memory Reduces disk seek time Compressions schemes Variable Encoding Bit-level Encoding Using gaps Original posting lists: the: ⟨1, 9⟩ ⟨2, 8⟩ ⟨3, 8⟩ ⟨4, 5⟩ ⟨5, 6⟩ ⟨6, 9⟩ to: ⟨1, 5⟩ ⟨3, 1⟩ ⟨4, 2⟩ ⟨5, 2⟩ ⟨6, 6⟩ john: ⟨2, 4⟩ ⟨4, 1⟩ ⟨6, 4⟩ With gaps: the: ⟨1, 9⟩ ⟨1, 8⟩ ⟨1, 8⟩ ⟨1, 5⟩ ⟨1, 6⟩ ⟨1, 9⟩ to: ⟨1, 5⟩ ⟨2, 1⟩ ⟨1, 2⟩ ⟨1, 2⟩ ⟨1, 6⟩ john: ⟨2, 4⟩ ⟨2, 1⟩ ⟨2, 4⟩ 3/26/2013 Indexing The World Wide Web 7 Variable Byte Encoding Uses an integral but adaptive number of bytes depending upon the gap size. First bit of each byte is a continuation bit. Remaining 7 bits in each byte are used to encode part of gap. To decode a byte: Read sequence of bytes till continuation bit flips. Extract and concatenate the 7-bit parts to get the magnitude of a gap. 3/26/2013 Indexing The World Wide Web 8 Bit Level Encoding Used when disk space is at premium. These codes adapt the length of the code on a finer grained bit level. Codeword is divided into 2 parts – prefix and suffix Prefix indicates the binary magnitude of the value and tells the decoder how many bits are there in the suffix part. Suffix indicates the value of the number within the corresponding binary range. Query processing is more time consuming. 3/26/2013 Indexing The World Wide Web 9 Ordering by Highest Impact First Example: (<doc id, term frequency>): ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨40, 6⟩ ⟨78, 1⟩ ⟨101, 3⟩ ⟨106, 1⟩. When the list is reordered by term frequency, it gets transformed: ⟨40, 6⟩ ⟨101, 3⟩ ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨78, 1⟩ ⟨106, 1⟩. The repeated frequency information can then be factored out into a prefix component with a counter that indicates how many documents there are with this same frequency value: ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 17⟩ ⟨1 : 4 : 29, 32, 78, 106⟩. Not storing the repeated frequencies gives a considerable saving. Finally, if differences of document identifiers are taken, we get the following: ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 5⟩ ⟨1 : 4 : 29, 3, 46, 28⟩. The document gaps within each equal-frequency segment of the list are now on average larger than when the document identifiers were sorted, thereby requiring more encoding bits/bytes. 3/26/2013 Indexing The World Wide Web 10 Managing Multiple Indices Multiples indices bucketed by rate of refreshing. The Large, rarely refreshing pages index The small, ever-refreshing pages index The dynamic real-time/news pages index Waterfall approach Pages discovered in one tier can be passed over the next over time. Invalidate older index and crawl file entries 3/26/2013 Indexing The World Wide Web 11 SCALING THE SYSTEM Web search engines use Distributed indexing algorithms for index construction Distributed File System In order to manage large amounts of data across large commodity clusters, a distributed file system that provides efficient remote file access, file transfers, and the ability to carry out concurrent independent operations while being extremely fault tolerant is essential. Map-Shuffle-Reduce Map: The master node chops up the problem into small chunks and assigns each chunk to a worker. The worker either processes the chunk of data with the mapper and returns the result to the master or further chops up the input data and assigns it hierarchically. Shuffle: Group key-value pair from mapper. Reduce: Take sub-answers and combine to create final output. 3/26/2013 Indexing The World Wide Web 12 FUTURE RESEARCH DIRECTIONS Real Time Data and Search – What can we do with each tweet? Create a Social Graph Extract and Index links Real-Time Related Topics Sentiment Analysis Social and Personalized Web Search Facebook, Twitter, etc. Facebook Users post a wealth of information Static – book, movie interest Dynamic – user locations, status updates, wall posts Learning user’s personal information can personalize search results Facebook impacting the world of search 3/26/2013 Opened data to third party service Search for 2 degrees of user Indexing The World Wide Web 13 Pros and Cons What I liked about it Delves into the history of Search Engines Talks about the Future Enhancement Explains how a search engine works What I didn’t like Skims through the surface without going deep. Includes very few examples which make understanding difficult. Compressing the Index section lacks structure which makes it difficult to understand. 3/26/2013 Indexing The World Wide Web 14