Facebook

advertisement
Scalable Data
Management@facebook
Srinivas Narayanan
11/13/09
Scale
Over 300 million
active users
>200 billion monthly
> 3.9 trillion feed actions
processed per day
page views
100 million search queries per day
Over 1 million
developers in 180
countries
2 billion pieces of
content per week
#2 site on the Internet
(time on site)
More than 232
photos…
6 billion minutes
per day
Growth Rate
2009
Active Users
300M
Social Networks
The social graph links everything
Scaling Social Networks
▪
▪
Much harder than typical
websites where...
▪
Typically 1-2% online: easy to
cache the data
▪
Partitioning & scaling relatively
easy
What do you do when everything
is interconnected?
name, status, privacy, profile
photo
name,
status, privacy,name,
profilestatus,
photo privacy, profile pho
name, status, privacy, profile photo
name, status, privacy,
name, status,
profile photo
privacy, profile ph
name,
status,
privacy,
profile photo
me,
status,
privacy,
profile
photo
name, status, privacy, profile photo
us, privacy, video thumbnail
status,
video thumbnail
name, status,name,
privacy,
videoprivacy,
name,
thumbnail
status,
privacy, video thumbnail
e, status, privacy,
video
thumbnail
name,
status,
privacy, profile
name,
photo
status, privacy, profile photo
name, status, privacy, video thumbnail
name, status, privacy, video thumbnail
name, status, privacy,name,
profilestatus,
photo privacy,
profile
photoprivacy, profile photo
name,
status,
name, name,
status, status,
privacy,privacy,
profile profile
photo
photo
name,
status, privacy, video thumbnail
name, status, privacy, profilename,
photo status, privacy, profile photo
me, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy,
name,
profile
status,
photoprivacy,
name, profile
status,photo
privacy, profile photo
name, status, privacy, profile photo
name, status,
name, privacy,
status, privacy,
profile photo
profile photo
name,
status,
name,
privacy,
status,
profile
privacy,
photoprofile photo
atus,
privacy,
profile
photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
e,
status,
privacy,
video
thumbnail
atus, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
name, status, privacy, profile photo
System Architecture
Architecture
Load Balancer (assigns a web server)
Web Server (PHP assembles data)
Memcache (fast, simple)
Database (slow, persistent)
Memcache
▪
Simple in-memory hash table
▪
Supports get/set,delete,multiget, multiset
▪
Not a write-through cache
▪
Pros and Cons
▪
The Database Shield!
▪
Low latency, very high request rates
▪
Can be easy to corrupt, inefficient for very
small items
Memcache Optimization
▪
Multithreading and efficient protocol code - 50k req/s
▪
Polling network drivers - 150k req/s
▪
Breaking up stats lock - 200k req/s
▪
Batching packet handling - 250k req/s
▪
Breaking up cache lock - future
Network Incast
Memcache
Memcache
Memcache
Switch
Many Small
Get Requests
PHP Client
Memcache
Network Incast
Memcache
Memcache
Memcache
Switch
Many big
data packets
PHP Client
Memcache
Network Incast
Memcache
Memcache
Memcache
Switch
PHP Client
Memcache
Network Incast
Memcache
Memcache
Memcache
Switch
PHP Client
Memcache
Memcache Clustering
Many small objects per server
Many servers per large object
Many small objects per server
Many servers per large object
Memcache Clustering
Memcache
10 Objects
PHP Client
Memcache Clustering
Memcache
Memcache
5 Objects
5 Objects
2 round trips total1 round trip
per server
PHP Client
Memcache Clustering
Memcache
4 Objects
Memcache
3 Objects
PHP Client
Memcache
3 Objects
Memcache Pool Optimization
▪
Currently a manual process
▪
Replication for obvious hot data sets
▪
Interesting problem: Optimize the allocation based on access patterns
Vertical Partitioning of Object Types
Specialized Replica 1
Shard 1
Shard 2
Specialized Replica 2
Shard 1
Shard 2
General pool with wide fanout
Shard 1
Shard 2
Shard 3
...
Shard n
MySQL has played a role from the
Thousands
beginningof MySQL servers in two datacenters
Scribe
Scribe
Scribe
Scribe
Scribe
Scribe
Scribe
Scribe
Scribe
MySQL Usage
•Pretty
solid transactional persistent store
•Logical
•
migration of data is difficult
Logical-Physical db mapping
•Rarely
use advanced query features
•
Performance
•
Database resources are precious
•
Web tier CPU is relatively cheap
•
Distributed data - no joins!
•Sound
administrative model
MySQL is better because it is Open Source
We can enhance or extend the database
▪
...as we see fit
▪
...when we see fit
▪
Facebook extended MySQL to support distributed cache invalidation
for memcache
INSERT table_foo (a,b,c) VALUES (1,2,3)
MEMCACHE_DIRTY key1,key2,...
Scaling across datacenters
West Coast
East Coast
SC Web
SF Web
VA Web
Memcache Proxy
Memcache Proxy
SC
Memcache
SF
Memcache
VA
Memcache
Memcache Proxy
SC MySQL
MySql replication
VA MySQL
Other Interesting Issues
▪
Application level batching and parallelization
▪
Super hot data items
▪
Cachekey versioning with continuous availability
Photos
Photos + Social Graph = Awesome!
Photos: Scale
▪
20 billion photos x4 = 80 billion
▪
Would wrap around the world
more than 10 times!
▪
Over 40M new photos per day
▪
600K photos / second
Photos Scaling - The easy wins
▪
Upload tier - handles uploads, scales images, stores on NFS
▪
Serving tier: Images served from NFS via HTTP
▪
However...
▪
▪
File systems are not good at supporting large number of files
▪
Metadata too large to fit in memory causing too many IOs for each file
read
▪
Limited by I/O not storage density
Easy wins
▪
CDN
▪
Cachr (http server + caching)
▪
NFS file handle cache
Photos: Haystack
Overlay file system
Index in memory
One IO per read
Data Warehousing
Data: How much?
▪
200GB per day in March 2008
▪
2+TB(compressed) raw data per day in April 2009
▪
4+TB(compressed) raw data per day today
The Data Age
▪
Free or low cost of user services
▪
Consumer behavior hard to predict
▪
▪
Data and analysis are critical
More data beats better algorithms
Deficiencies of existing technologies
▪
Analysis/storage on proprietary systems too expensive
▪
Closed systems are hard to extend
Hadoop & Hive
Hadoop
▪
Superior availability/scalability/manageability despite lower single node
performance
▪
Open system
▪
Scalable costs
▪
Cons: Programmability and Metadata
▪
Map-reduce hard to program (users know sql/bash/python/perl)
▪
Need to publish data in well known schemas
Hive
▪
A system for managing and
querying structured data built on
top of Hadoop
▪
Components
▪
Map-Reduce for execution
▪
HDFS for storage
▪
Metadata in an RDBMS
Hive: New Technology, Familiar Interface
hive> select key, count(1) from kv1 where key > 100 group by
key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
Hive: Sample Applications
▪
▪
Reporting
▪
E.g.,: Daily/Weekly aggregations of impression/click counts
▪
Measures of user engagement
Ad hoc Analysis
▪
▪
▪
E.g.,: how many group admins broken down by state/country
Machine Learning (Assembling training data)
▪
Ad Optimization
▪
E.g.,: User Engagement as a function of user attributes
Lots More
Hive: Server Infrastructure
▪
4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per node
▪
Two level network topology
▪
1 Gbit/sec from node to rack switch
▪
4 Gbit/sec to top level rack switch
Hive & Hadoop: Usage Stats
▪
4 TB of compressed new data added per day
▪
135TB of compressed data scanned per day
▪
7500+ Hive jobs on per day
▪
80K compute hours per day
▪
200 people run jobs on Hadoop/Hive
▪
Analysts (non-engineers) use Hadoop through Hive
▪
95% of jobs are Hive Jobs
Hive: Technical Overview
Hive: Open and Extensible
▪
Query your own formats and types with your own
Serializer/Deserializers
▪
Extend the SQL functionality through User Defined Functions
▪
Do any non-SQL transformations through TRANSFORM operator
that sends data from Hive to any user program/script
Hive: Smarter Execution Plans
▪
Map-side Joins
▪
Predicate Pushdown
▪
Partition Pruning
▪
Hash based Aggregations
▪
Parallel execution of operator trees
▪
Intelligent Scheduling
Hive: Possible Future Optimizations
▪
Pipelining?
▪
Finer operator control (controlling sorts)
▪
Cost based optimizations?
▪
HBase
Spikes: The Username Launch
System Design
▪
Database tier cannot
handle the load
▪
Dedicated memcache
tier for assigned
usernames
▪
Miss => Available
▪
Avoid database hits
altogether
▪
Blacklists: bucketize,
local tier cache
▪
Username Memcache Tier
▪
Parallel pool in each data
center
▪
Writes replicated to all nodes
▪
8 nodes per pool
▪
Reads can go to any node
(hashed by uid)
PHP Client
...
UN0
UN1
Username Memcache
UN7
Write Optimization
▪
Hashout store
▪
▪
Distributed key-value store (MySQL backed)
Lockless (optimistic) concurrency control
Fault Tolerance
▪
▪
Memcache nodes can go down
▪
Always check another node on miss
▪
Replay from a log file (scribe)
Memcache sets are not guaranteed to succeed
▪
Self-correcting code: write again to mc if we detect it during db writes
Nuclear Options
▪
▪
▪
Newsfeed
▪
Reduce number of stories
▪
Turn off scrolling, highlights
Profile
▪
Reduce number of stories
▪
Make info tab the default
Chat
▪
Reduce buddy list refresh rate
▪
Turn if off!
How much load?
▪
200k in 3 min
▪
1M in 1 hour
▪
50M in first month
▪
Prepared for over 10x!
Some interesting problems
Some interesting problems
▪
Graph models and languages
▪
Low latency fast access
▪
Slightly more expressive queries
▪
▪
Consistency, Staleness can be a bit loose
▪
Analysis over large data sets
▪
Privacy as part of the model
Fat data pipes
▪
Push enormous volumes of data to several third party applications
(E.g., entire newsfeed to search partners).
▪
Controllable QoS
Some interesting problems (contd.)
▪
Search relevance
▪
Storage systems
▪
Middle tier (cache) optimization
▪
Application data access language
Questions?
Download