Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley Search and research • Lots of research motivated by web search – Explore specific research questions – Small to moderate scale • A few large-scale production engines – Many additional challenges – Not all purely algorithmic/technical • What are the extra constraints for a production system? Production search engines • Scale up – Tens of billions of web pages, images, etc. – Tens of thousands to millions of computers • Geographic distribution – For performance and reliability • Continuous crawling and serving – No downtime, need fresh results • Long-term test/maintenance – Simplicity a core goal Disclaimer • Not going to describe any particular webscale search engine – No detailed public description of any engine • But, general principles apply Outline • • • • Anatomy of a search engine Query serving Link-based ranking Index generation Structure of a search engine Document crawling Link structure analysis Page feature training The Web Index building Ranker training Query serving User behavior analysis Auxiliary answers Some index statistics • Tens of billions of documents – Each document contains thousands of terms – Plus metadata – Plus snippet information • Billions of unique terms – Serial numbers, etc. • Hundreds of billions of nodes in web graph • Latency a few ms on average – Well under a second worst-case Query serving pipeline The Web Front-end web servers, caches, etc. Front-end web servers, caches, Front-end web servers, caches,etc. etc. Index servers Index Indexservers servers Page relevance • Query-dependent component – Query/document match, user metadata, etc. • Query-independent component – Document rank, spam score, click rate, etc. • Ranker needs: – Term frequencies and positions – Document metadata – Near-duplicate information –… Single-box query outline term posting list a … hello … world 1.2,1.10,1.16,…,1040.23,…, doc metadata 1 … 45 … 1125 foo.com/bar,EN-US,… doc snippet data 1 … “once a week …” Hello world + {EN-US,…} 3.76,…,45.48,…,1125.3,…, (45.48,45.29), (1125.3,1125.4),… 7.12,…,45.29,…,1125.4,…, Ranker go.com/hw.txt,EN-US,… bar.com/a.html,EN-US,… 1125.3,45.48,… Query Results Query statistics • Small number of terms (fewer than 10) • Posting lists length 1 to 100s of millions – Most terms occur once • Potentially millions of documents to rank – Response is needed in a few ms – Tens of thousands of near duplicates – Sorting documents by QI rank may help • Tens or hundreds of snippets Distributed index structure • Tens of billions of documents • Thousands of queries per second • Index is constantly updated – Most pages turn over in at most a few weeks – Some very quickly (news sites) – Almost every page is never returned How to distribute? Distributed index: split by term • Each computer stores a subset of terms • Each query goes only to a few computers • Document metadata stored separately Hello world + {EN-US,…} A-G H-M Ranker N-S Metadata Metadata Metadata T-Z Split by term: pros • Short queries only touch a few computers – With high probability all are working • Long posting lists improve compression – Most words occur many times in corpus Split by term: cons (1) • Must ship posting lists across network – Multi-term queries make things worse – But maybe pre-computing can help? • Intersections of lists for common pairs of terms • Needs to work with constantly updating index • Extra network roundtrip for doc metadata – Too expensive to store in every posting list • Where does the ranker run? – Hundreds of thousands of ranks to compute Split by term: cons (2) • Front-ends must map terms to computers – Simple hashing may be too unbalanced – Some terms may need to be split/replicated • Long posting lists • “Hot” posting lists • Sorting by QI rank is a global operation – Needs to work with index updates Distributed index: split by document • Each computer stores a subset of docs • Each query goes to many computers • Document metadata stored inline Hello world + {EN-US,…} Aggregator Ranker Ranker Ranker Ranker Docs 1-1000 Docs 1001-2000 Docs 2001-3000 Docs 3001-4000 Split by document: pros • Ranker on same computer as document – All data for a given doc in the same place – Ranker computation is distributed • Can get low latency • Sorting by QI rank local to each computer • Only ranks+scores need to be aggregated – Hundreds of results, not millions Split by document: cons • A query touches hundreds of computers – One slow computer makes query slow – Computers per query is linear in corpus size – But query speeds are not iid • Shorter posting lists: worse compression – Each word split into many posting lists Index replication • Multiple copies of each partition – Needed for redundancy, performance • Makes things more complicated – Can mitigate latency variability • Ask two replicas, one will probably return quickly – Interacts with data layout • Split by document may be simpler • Consistency may not be essential Splitting: word vs document • Original Google paper split by word • All major engines split by document now? – Tens of microseconds to rank a document Link-based ranking • Intuition: “quality” of a page is reflected somehow in the link structure of the web • Made famous by PageRank – Can be seen as stationary distribution of a random walk on the web graph – Google’s original advantage over AltaVista? Some hints • PageRank is (no longer) very important • Anchor text contains similar information – BM25F includes a lot of link structure • Query-dependent link features may be useful 0.05 0.00 Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007 .035 .034 .034 .033 .033 .032 .032 hits-hub-id-100 salsa-hub-all-100 degree-out-all salsa-hub-ih-8 salsa-hub-id-3 degree-out-ih degree-out-id .011 .036 hits-hub-ih-100 random .038 .090 hits-aut-all-100 hits-hub-all-100 .092 .102 hits-aut-ih-100 pagerank .104 hits-aut-id-25 .095 .105 degree-in-ih degree-in-all .106 degree-in-id salsa-aut-all-100 .156 0.10 salsa-aut-ih-8 .121 0.15 .158 0.20 salsa-aut-id-3 bm25f NDCG@10 .221 0.25 Query-dependent link features E F A J G B K H C L I D M N Real-time QD link information • Lookup of neighborhood graph • Followed by SALSA • In a few ms Seems like a good topic for approximation/learning Index building • Catch-all term – Create inverted files – Compute document features – Compute global link-based statistics – Which documents to crawl next? – Which crawled documents to put in the index? • Consistency may be needed here Index lifecycle Usage analysis Index selection The Web Query serving Page crawling Experimentation • A/B testing is best – Ranking, UI, etc. – Immediate feedback on what works – Can be very fine-grained (millions of queries) • Some things are very hard – Index selection, etc. – Can run parallel build processes • Long time constants: not easy to do brute force Implementing new features • Document-specific features much “cheaper” – Spam probability, duplicate fingerprints, language • Global features can be done, but with a higher bar – Distribute anchor text – PageRank et al. • Danger of “butterfly effect” on system as a whole Distributing anchor text Crawler Crawler Crawler Indexer Indexer Indexer Anchor text Anchor text Anchor text Docs f0-ff Docs Docsf0-ff f0-ff Distributed infrastructure • Things are improving – Large scale partitioned file systems • Files commonly contain many TB of data • Accessed in parallel – Large scale data-mining platforms – General-purpose data repositories • Data-centric – Traditional supercomputing is cycle-centric Software engineering • Simple always wins • Hysteresis – Prove a change will improve things • Big improvement needed to justify big change – Experimental platforms are essential Summary • Search engines are big and complicated • Some things are easier to change than others • Harder changes need more convincing experiments • Small datasets are not good predictors for large datasets • Systems/learning may need to collaborate