Instant Indexing

advertisement
Instant Indexing
Greg Lindahl
CTO, Blekko
October 21, 2010 - BCS Search Solutions 2010
Blekko Who?
• Founded in 2007, $24m in funding
• Whole-web search engine
• Currently in invite-only beta
– 3B page crawl
– innovative UI
• … but this talk is abut indexing
What whole-web search was
• Sort by relevance only
• News and blog search done with separate
engines
• Main index updated slowly with a batch
process
• Months to weeks update cycle
What web-scale search is now
•
•
•
•
Relevance and date sorting
Everything in a single index
Incremental updating
Live-crawled pages should appear in the main
index in seconds
• All data stored as tables
Instant Search Indexing
• /date screnshot
Another Example
Google’s take on the issue
• Daniel Peng and Frank Dabek, Large Scale
Incremental Processing Using Distributed
Transactions and Notifications
• “Databases do not meet the storage or throughput
requirements of these tasks… MapReduce and other
batch-processing systems cannot process small
updates individually as they rely on creating large
batches for efficiency.”
Percolator details
• ACID, with multi-row transactions
• triggers ("observers"), can be cascaded
• crawler is a cascade of triggers:
–
–
–
–
–
MapReduce writes new documents into bigtable
trigger parses and extracts links
cascaded trigger does clustering
cascaded trigger exports changed clusters
10 triggers total in indexing system
• max 1 observer per column for complexity reasons
• message collapsing when there are multiple updates
to a column
Blekko’s take on this
• We want to run the same code in a mapjob or
in an incremental crawler/indexer
• Our bigtable-like thingie shouldn’t need a
percolator-sized addition to do it
• Needs to be more efficient than other
approaches
• OK with non-ACID, relaxed eventual
consistentcy, etc
Combinators
• Task: gather incoming links and anchortext
• Each crawled webpage has dozens of outlinks
• Crawler wants to write into dozens of inlists,
each in a separate cell in a table
• TopN combinator: list of N highest-ranked
items
• If a cell is frequently written, writes can be
combined before hitting disk
Combining combinators
• Combine within the writing process
• Combine within the local write daemon
• Combine within the 3 disk daemons, and the ram
daemon
– highly contented cells result in 1 disk transaction per 30
seconds
• Combinators are represented as strings and can be
used without the database
• Using combinators seems to be a significant
reduction of RPCs over Percolator, but I have no idea
what the relative performance is.
TopN example
• table: /index/32/url row: pbm.com/~lindahl/
column: inlinks
– a list of: rank, key, data
– 1000, www.disney.com, “great website”
– 540, britishmuseum.com/dance, “16th century
dance manuals in facsimile”
– 1, www.ehow.com/dance, “renaissance dance”
MapReduce from a combinator perspective
• MapReduce is really map, shuffle, reduce
• input: a file/table, output: a file/table
• An incremental job to do the same
MapReduce looks completely different; you
have to implement the shuffle+reduce
• Could write into BigTable cells…
MapJobs+Combinators
• Map function runs on shards
• All output is done by writing into a table, using
combinators
• The same map function can also be run
incrementally on individual inputs
• The shuffle+reduce is still there, it’s just done
by the database+combinators
Combinator types
•
•
•
•
topN
lastN = topN, using time as the rank
sum, avg, eavg, min, max
counting things
– logcount: +- 50% count of strings in 16 bytes
• set -- everything is a combinator
• Cells in our tables are native Perl/Python data
structures
• hence: atomic updates on a sub-cell level
Combinators for indexing
• The basic data structure for search is the
posting list:
– for each term, a list with rows
• docid, rank
• Sounds like a custom topN to us
– rank = rank or date or …
– lists heavily compressed
• Each posting list has N shards
Combinators for crawling
• Pick a site, crawl the most important
uncrawled pages
– that’s stored as a topN
• (the “livecrawl” uses other criteria)
• Crawl, parse, and spew writes
– outlinks into inlinks cells
– page ip/geo into incoming ips, geos
– page hashes into duptext detection table
– count everything under the sun
– 100s of writes total
Instant index step
• Crawler does the indexing
• Decides which terms to index based on page
contents and incoming anchortext
• Writes into posting lists
– if indexed before, use list of previously indexed
terms to delete any obsolete terms
• Heavily-contented posting lists are not a
problem due to combining -- that’s how a
naked [/date] query works.
Supporting date queries
• /date queries fetch about 3X the posting lists
of a relevance query
• to support [/health /date], we keep a posting
list of the most recent dated pages for each
website
• date needs some relevance; every date-sorted
posting list has a companion date-sorted lists
of only highly-relevant articles
Example: [obama /date]
• The term posting list for ‘obama’ has
overflowed -- moderately relevant dated
queries are probably smushed out
• The date posting list for ‘obama’ has
overflowed
• The date posting list for highly-relevant dated
‘obama’ is not full
To Sum Up
• There’s more than one way to do it
– yes, we use Perl
• I don’t think Blekko’s scheme is better or
worse than Google’s, but at least it’s very
different
• See me if you’d like an invite to our beta-test
Download