Instant Indexing Greg Lindahl CTO, Blekko October 21, 2010 - BCS Search Solutions 2010 Blekko Who? • Founded in 2007, $24m in funding • Whole-web search engine • Currently in invite-only beta – 3B page crawl – innovative UI • … but this talk is abut indexing What whole-web search was • Sort by relevance only • News and blog search done with separate engines • Main index updated slowly with a batch process • Months to weeks update cycle What web-scale search is now • • • • Relevance and date sorting Everything in a single index Incremental updating Live-crawled pages should appear in the main index in seconds • All data stored as tables Instant Search Indexing • /date screnshot Another Example Google’s take on the issue • Daniel Peng and Frank Dabek, Large Scale Incremental Processing Using Distributed Transactions and Notifications • “Databases do not meet the storage or throughput requirements of these tasks… MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.” Percolator details • ACID, with multi-row transactions • triggers ("observers"), can be cascaded • crawler is a cascade of triggers: – – – – – MapReduce writes new documents into bigtable trigger parses and extracts links cascaded trigger does clustering cascaded trigger exports changed clusters 10 triggers total in indexing system • max 1 observer per column for complexity reasons • message collapsing when there are multiple updates to a column Blekko’s take on this • We want to run the same code in a mapjob or in an incremental crawler/indexer • Our bigtable-like thingie shouldn’t need a percolator-sized addition to do it • Needs to be more efficient than other approaches • OK with non-ACID, relaxed eventual consistentcy, etc Combinators • Task: gather incoming links and anchortext • Each crawled webpage has dozens of outlinks • Crawler wants to write into dozens of inlists, each in a separate cell in a table • TopN combinator: list of N highest-ranked items • If a cell is frequently written, writes can be combined before hitting disk Combining combinators • Combine within the writing process • Combine within the local write daemon • Combine within the 3 disk daemons, and the ram daemon – highly contented cells result in 1 disk transaction per 30 seconds • Combinators are represented as strings and can be used without the database • Using combinators seems to be a significant reduction of RPCs over Percolator, but I have no idea what the relative performance is. TopN example • table: /index/32/url row: pbm.com/~lindahl/ column: inlinks – a list of: rank, key, data – 1000, www.disney.com, “great website” – 540, britishmuseum.com/dance, “16th century dance manuals in facsimile” – 1, www.ehow.com/dance, “renaissance dance” MapReduce from a combinator perspective • MapReduce is really map, shuffle, reduce • input: a file/table, output: a file/table • An incremental job to do the same MapReduce looks completely different; you have to implement the shuffle+reduce • Could write into BigTable cells… MapJobs+Combinators • Map function runs on shards • All output is done by writing into a table, using combinators • The same map function can also be run incrementally on individual inputs • The shuffle+reduce is still there, it’s just done by the database+combinators Combinator types • • • • topN lastN = topN, using time as the rank sum, avg, eavg, min, max counting things – logcount: +- 50% count of strings in 16 bytes • set -- everything is a combinator • Cells in our tables are native Perl/Python data structures • hence: atomic updates on a sub-cell level Combinators for indexing • The basic data structure for search is the posting list: – for each term, a list with rows • docid, rank • Sounds like a custom topN to us – rank = rank or date or … – lists heavily compressed • Each posting list has N shards Combinators for crawling • Pick a site, crawl the most important uncrawled pages – that’s stored as a topN • (the “livecrawl” uses other criteria) • Crawl, parse, and spew writes – outlinks into inlinks cells – page ip/geo into incoming ips, geos – page hashes into duptext detection table – count everything under the sun – 100s of writes total Instant index step • Crawler does the indexing • Decides which terms to index based on page contents and incoming anchortext • Writes into posting lists – if indexed before, use list of previously indexed terms to delete any obsolete terms • Heavily-contented posting lists are not a problem due to combining -- that’s how a naked [/date] query works. Supporting date queries • /date queries fetch about 3X the posting lists of a relevance query • to support [/health /date], we keep a posting list of the most recent dated pages for each website • date needs some relevance; every date-sorted posting list has a companion date-sorted lists of only highly-relevant articles Example: [obama /date] • The term posting list for ‘obama’ has overflowed -- moderately relevant dated queries are probably smushed out • The date posting list for ‘obama’ has overflowed • The date posting list for highly-relevant dated ‘obama’ is not full To Sum Up • There’s more than one way to do it – yes, we use Perl • I don’t think Blekko’s scheme is better or worse than Google’s, but at least it’s very different • See me if you’d like an invite to our beta-test