Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley

advertisement
Practical considerations for
a web-scale search engine
Michael Isard
Microsoft Research Silicon Valley
Search and research
• Lots of research motivated by web search
– Explore specific research questions
– Small to moderate scale
• A few large-scale production engines
– Many additional challenges
– Not all purely algorithmic/technical
• What are the extra constraints for a
production system?
Production search engines
• Scale up
– Tens of billions of web pages, images, etc.
– Tens of thousands to millions of computers
• Geographic distribution
– For performance and reliability
• Continuous crawling and serving
– No downtime, need fresh results
• Long-term test/maintenance
– Simplicity a core goal
Disclaimer
• Not going to describe any particular webscale search engine
– No detailed public description of any engine
• But, general principles apply
Outline
•
•
•
•
Anatomy of a search engine
Query serving
Link-based ranking
Index generation
Structure of a search engine
Document
crawling
Link
structure
analysis
Page
feature
training
The Web
Index
building
Ranker
training
Query
serving
User
behavior
analysis
Auxiliary
answers
Some index statistics
• Tens of billions of documents
– Each document contains thousands of terms
– Plus metadata
– Plus snippet information
• Billions of unique terms
– Serial numbers, etc.
• Hundreds of billions of nodes in web graph
• Latency a few ms on average
– Well under a second worst-case
Query serving pipeline
The Web
Front-end
web
servers,
caches,
etc.
Front-end
web
servers,
caches,
Front-end web servers, caches,etc.
etc.
Index
servers
Index
Indexservers
servers
Page relevance
• Query-dependent component
– Query/document match, user metadata, etc.
• Query-independent component
– Document rank, spam score, click rate, etc.
• Ranker needs:
– Term frequencies and positions
– Document metadata
– Near-duplicate information
–…
Single-box query outline
term
posting list
a
…
hello
…
world
1.2,1.10,1.16,…,1040.23,…,
doc
metadata
1
…
45
…
1125
foo.com/bar,EN-US,…
doc
snippet data
1
…
“once a week …”
Hello world + {EN-US,…}
3.76,…,45.48,…,1125.3,…,
(45.48,45.29), (1125.3,1125.4),…
7.12,…,45.29,…,1125.4,…,
Ranker
go.com/hw.txt,EN-US,…
bar.com/a.html,EN-US,…
1125.3,45.48,…
Query Results
Query statistics
• Small number of terms (fewer than 10)
• Posting lists length 1 to 100s of millions
– Most terms occur once
• Potentially millions of documents to rank
– Response is needed in a few ms
– Tens of thousands of near duplicates
– Sorting documents by QI rank may help
• Tens or hundreds of snippets
Distributed index structure
• Tens of billions of documents
• Thousands of queries per second
• Index is constantly updated
– Most pages turn over in at most a few weeks
– Some very quickly (news sites)
– Almost every page is never returned
How to distribute?
Distributed index: split by term
• Each computer stores a subset of terms
• Each query goes only to a few computers
• Document metadata stored separately
Hello world + {EN-US,…}
A-G
H-M
Ranker
N-S
Metadata
Metadata
Metadata
T-Z
Split by term: pros
• Short queries only touch a few computers
– With high probability all are working
• Long posting lists improve compression
– Most words occur many times in corpus
Split by term: cons (1)
• Must ship posting lists across network
– Multi-term queries make things worse
– But maybe pre-computing can help?
• Intersections of lists for common pairs of terms
• Needs to work with constantly updating index
• Extra network roundtrip for doc metadata
– Too expensive to store in every posting list
• Where does the ranker run?
– Hundreds of thousands of ranks to compute
Split by term: cons (2)
• Front-ends must map terms to computers
– Simple hashing may be too unbalanced
– Some terms may need to be split/replicated
• Long posting lists
• “Hot” posting lists
• Sorting by QI rank is a global operation
– Needs to work with index updates
Distributed index: split by document
• Each computer stores a subset of docs
• Each query goes to many computers
• Document metadata stored inline
Hello world + {EN-US,…}
Aggregator
Ranker
Ranker
Ranker
Ranker
Docs
1-1000
Docs
1001-2000
Docs
2001-3000
Docs
3001-4000
Split by document: pros
• Ranker on same computer as document
– All data for a given doc in the same place
– Ranker computation is distributed
• Can get low latency
• Sorting by QI rank local to each computer
• Only ranks+scores need to be aggregated
– Hundreds of results, not millions
Split by document: cons
• A query touches hundreds of computers
– One slow computer makes query slow
– Computers per query is linear in corpus size
– But query speeds are not iid
• Shorter posting lists: worse compression
– Each word split into many posting lists
Index replication
• Multiple copies of each partition
– Needed for redundancy, performance
• Makes things more complicated
– Can mitigate latency variability
• Ask two replicas, one will probably return quickly
– Interacts with data layout
• Split by document may be simpler
• Consistency may not be essential
Splitting: word vs document
• Original Google paper split by word
• All major engines split by document now?
– Tens of microseconds to rank a document
Link-based ranking
• Intuition: “quality” of a page is reflected
somehow in the link structure of the web
• Made famous by PageRank
– Can be seen as stationary distribution of a
random walk on the web graph
– Google’s original advantage over AltaVista?
Some hints
• PageRank is (no longer) very important
• Anchor text contains similar information
– BM25F includes a lot of link structure
• Query-dependent link features may be
useful
0.05
0.00
Comparing the Effectiveness of HITS and SALSA, M. Najork, CIKM 2007
.035
.034
.034
.033
.033
.032
.032
hits-hub-id-100
salsa-hub-all-100
degree-out-all
salsa-hub-ih-8
salsa-hub-id-3
degree-out-ih
degree-out-id
.011
.036
hits-hub-ih-100
random
.038
.090
hits-aut-all-100
hits-hub-all-100
.092
.102
hits-aut-ih-100
pagerank
.104
hits-aut-id-25
.095
.105
degree-in-ih
degree-in-all
.106
degree-in-id
salsa-aut-all-100
.156
0.10
salsa-aut-ih-8
.121
0.15
.158
0.20
salsa-aut-id-3
bm25f
NDCG@10
.221
0.25
Query-dependent link features
E
F
A
J
G
B
K
H
C
L
I
D
M
N
Real-time QD link information
• Lookup of neighborhood graph
• Followed by SALSA
• In a few ms
Seems like a good topic
for approximation/learning
Index building
• Catch-all term
– Create inverted files
– Compute document features
– Compute global link-based statistics
– Which documents to crawl next?
– Which crawled documents to put in the index?
• Consistency may be needed here
Index lifecycle
Usage
analysis
Index
selection
The Web
Query
serving
Page
crawling
Experimentation
• A/B testing is best
– Ranking, UI, etc.
– Immediate feedback on what works
– Can be very fine-grained (millions of queries)
• Some things are very hard
– Index selection, etc.
– Can run parallel build processes
• Long time constants: not easy to do brute force
Implementing new features
• Document-specific features much “cheaper”
– Spam probability, duplicate fingerprints, language
• Global features can be done, but with a higher
bar
– Distribute anchor text
– PageRank et al.
• Danger of “butterfly effect” on system as a whole
Distributing anchor text
Crawler
Crawler
Crawler
Indexer
Indexer
Indexer
Anchor
text
Anchor
text
Anchor
text
Docs
f0-ff
Docs
Docsf0-ff
f0-ff
Distributed infrastructure
• Things are improving
– Large scale partitioned file systems
• Files commonly contain many TB of data
• Accessed in parallel
– Large scale data-mining platforms
– General-purpose data repositories
• Data-centric
– Traditional supercomputing is cycle-centric
Software engineering
• Simple always wins
• Hysteresis
– Prove a change will improve things
• Big improvement needed to justify big change
– Experimental platforms are essential
Summary
• Search engines are big and complicated
• Some things are easier to change than
others
• Harder changes need more convincing
experiments
• Small datasets are not good predictors for
large datasets
• Systems/learning may need to collaborate
Download