Scalable Data Management@facebook Srinivas Narayanan 11/13/09 Scale Over 300 million active users >200 billion monthly > 3.9 trillion feed actions processed per day page views 100 million search queries per day Over 1 million developers in 180 countries 2 billion pieces of content per week #2 site on the Internet (time on site) More than 232 photos… 6 billion minutes per day Growth Rate 2009 Active Users 300M Social Networks The social graph links everything Scaling Social Networks ▪ ▪ Much harder than typical websites where... ▪ Typically 1-2% online: easy to cache the data ▪ Partitioning & scaling relatively easy What do you do when everything is interconnected? name, status, privacy, profile photo name, status, privacy,name, profilestatus, photo privacy, profile pho name, status, privacy, profile photo name, status, privacy, name, status, profile photo privacy, profile ph name, status, privacy, profile photo me, status, privacy, profile photo name, status, privacy, profile photo us, privacy, video thumbnail status, video thumbnail name, status,name, privacy, videoprivacy, name, thumbnail status, privacy, video thumbnail e, status, privacy, video thumbnail name, status, privacy, profile name, photo status, privacy, profile photo name, status, privacy, video thumbnail name, status, privacy, video thumbnail name, status, privacy,name, profilestatus, photo privacy, profile photoprivacy, profile photo name, status, name, name, status, status, privacy,privacy, profile profile photo photo name, status, privacy, video thumbnail name, status, privacy, profilename, photo status, privacy, profile photo me, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, name, profile status, photoprivacy, name, profile status,photo privacy, profile photo name, status, privacy, profile photo name, status, name, privacy, status, privacy, profile photo profile photo name, status, name, privacy, status, profile privacy, photoprofile photo atus, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo e, status, privacy, video thumbnail atus, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo name, status, privacy, profile photo System Architecture Architecture Load Balancer (assigns a web server) Web Server (PHP assembles data) Memcache (fast, simple) Database (slow, persistent) Memcache ▪ Simple in-memory hash table ▪ Supports get/set,delete,multiget, multiset ▪ Not a write-through cache ▪ Pros and Cons ▪ The Database Shield! ▪ Low latency, very high request rates ▪ Can be easy to corrupt, inefficient for very small items Memcache Optimization ▪ Multithreading and efficient protocol code - 50k req/s ▪ Polling network drivers - 150k req/s ▪ Breaking up stats lock - 200k req/s ▪ Batching packet handling - 250k req/s ▪ Breaking up cache lock - future Network Incast Memcache Memcache Memcache Switch Many Small Get Requests PHP Client Memcache Network Incast Memcache Memcache Memcache Switch Many big data packets PHP Client Memcache Network Incast Memcache Memcache Memcache Switch PHP Client Memcache Network Incast Memcache Memcache Memcache Switch PHP Client Memcache Memcache Clustering Many small objects per server Many servers per large object Many small objects per server Many servers per large object Memcache Clustering Memcache 10 Objects PHP Client Memcache Clustering Memcache Memcache 5 Objects 5 Objects 2 round trips total1 round trip per server PHP Client Memcache Clustering Memcache 4 Objects Memcache 3 Objects PHP Client Memcache 3 Objects Memcache Pool Optimization ▪ Currently a manual process ▪ Replication for obvious hot data sets ▪ Interesting problem: Optimize the allocation based on access patterns Vertical Partitioning of Object Types Specialized Replica 1 Shard 1 Shard 2 Specialized Replica 2 Shard 1 Shard 2 General pool with wide fanout Shard 1 Shard 2 Shard 3 ... Shard n MySQL has played a role from the Thousands beginningof MySQL servers in two datacenters Scribe Scribe Scribe Scribe Scribe Scribe Scribe Scribe Scribe MySQL Usage •Pretty solid transactional persistent store •Logical • migration of data is difficult Logical-Physical db mapping •Rarely use advanced query features • Performance • Database resources are precious • Web tier CPU is relatively cheap • Distributed data - no joins! •Sound administrative model MySQL is better because it is Open Source We can enhance or extend the database ▪ ...as we see fit ▪ ...when we see fit ▪ Facebook extended MySQL to support distributed cache invalidation for memcache INSERT table_foo (a,b,c) VALUES (1,2,3) MEMCACHE_DIRTY key1,key2,... Scaling across datacenters West Coast East Coast SC Web SF Web VA Web Memcache Proxy Memcache Proxy SC Memcache SF Memcache VA Memcache Memcache Proxy SC MySQL MySql replication VA MySQL Other Interesting Issues ▪ Application level batching and parallelization ▪ Super hot data items ▪ Cachekey versioning with continuous availability Photos Photos + Social Graph = Awesome! Photos: Scale ▪ 20 billion photos x4 = 80 billion ▪ Would wrap around the world more than 10 times! ▪ Over 40M new photos per day ▪ 600K photos / second Photos Scaling - The easy wins ▪ Upload tier - handles uploads, scales images, stores on NFS ▪ Serving tier: Images served from NFS via HTTP ▪ However... ▪ ▪ File systems are not good at supporting large number of files ▪ Metadata too large to fit in memory causing too many IOs for each file read ▪ Limited by I/O not storage density Easy wins ▪ CDN ▪ Cachr (http server + caching) ▪ NFS file handle cache Photos: Haystack Overlay file system Index in memory One IO per read Data Warehousing Data: How much? ▪ 200GB per day in March 2008 ▪ 2+TB(compressed) raw data per day in April 2009 ▪ 4+TB(compressed) raw data per day today The Data Age ▪ Free or low cost of user services ▪ Consumer behavior hard to predict ▪ ▪ Data and analysis are critical More data beats better algorithms Deficiencies of existing technologies ▪ Analysis/storage on proprietary systems too expensive ▪ Closed systems are hard to extend Hadoop & Hive Hadoop ▪ Superior availability/scalability/manageability despite lower single node performance ▪ Open system ▪ Scalable costs ▪ Cons: Programmability and Metadata ▪ Map-reduce hard to program (users know sql/bash/python/perl) ▪ Need to publish data in well known schemas Hive ▪ A system for managing and querying structured data built on top of Hadoop ▪ Components ▪ Map-Reduce for execution ▪ HDFS for storage ▪ Metadata in an RDBMS Hive: New Technology, Familiar Interface hive> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh Hive: Sample Applications ▪ ▪ Reporting ▪ E.g.,: Daily/Weekly aggregations of impression/click counts ▪ Measures of user engagement Ad hoc Analysis ▪ ▪ ▪ E.g.,: how many group admins broken down by state/country Machine Learning (Assembling training data) ▪ Ad Optimization ▪ E.g.,: User Engagement as a function of user attributes Lots More Hive: Server Infrastructure ▪ 4800 cores, Storage capacity of 5.5 PetaBytes, 12 TB per node ▪ Two level network topology ▪ 1 Gbit/sec from node to rack switch ▪ 4 Gbit/sec to top level rack switch Hive & Hadoop: Usage Stats ▪ 4 TB of compressed new data added per day ▪ 135TB of compressed data scanned per day ▪ 7500+ Hive jobs on per day ▪ 80K compute hours per day ▪ 200 people run jobs on Hadoop/Hive ▪ Analysts (non-engineers) use Hadoop through Hive ▪ 95% of jobs are Hive Jobs Hive: Technical Overview Hive: Open and Extensible ▪ Query your own formats and types with your own Serializer/Deserializers ▪ Extend the SQL functionality through User Defined Functions ▪ Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script Hive: Smarter Execution Plans ▪ Map-side Joins ▪ Predicate Pushdown ▪ Partition Pruning ▪ Hash based Aggregations ▪ Parallel execution of operator trees ▪ Intelligent Scheduling Hive: Possible Future Optimizations ▪ Pipelining? ▪ Finer operator control (controlling sorts) ▪ Cost based optimizations? ▪ HBase Spikes: The Username Launch System Design ▪ Database tier cannot handle the load ▪ Dedicated memcache tier for assigned usernames ▪ Miss => Available ▪ Avoid database hits altogether ▪ Blacklists: bucketize, local tier cache ▪ Username Memcache Tier ▪ Parallel pool in each data center ▪ Writes replicated to all nodes ▪ 8 nodes per pool ▪ Reads can go to any node (hashed by uid) PHP Client ... UN0 UN1 Username Memcache UN7 Write Optimization ▪ Hashout store ▪ ▪ Distributed key-value store (MySQL backed) Lockless (optimistic) concurrency control Fault Tolerance ▪ ▪ Memcache nodes can go down ▪ Always check another node on miss ▪ Replay from a log file (scribe) Memcache sets are not guaranteed to succeed ▪ Self-correcting code: write again to mc if we detect it during db writes Nuclear Options ▪ ▪ ▪ Newsfeed ▪ Reduce number of stories ▪ Turn off scrolling, highlights Profile ▪ Reduce number of stories ▪ Make info tab the default Chat ▪ Reduce buddy list refresh rate ▪ Turn if off! How much load? ▪ 200k in 3 min ▪ 1M in 1 hour ▪ 50M in first month ▪ Prepared for over 10x! Some interesting problems Some interesting problems ▪ Graph models and languages ▪ Low latency fast access ▪ Slightly more expressive queries ▪ ▪ Consistency, Staleness can be a bit loose ▪ Analysis over large data sets ▪ Privacy as part of the model Fat data pipes ▪ Push enormous volumes of data to several third party applications (E.g., entire newsfeed to search partners). ▪ Controllable QoS Some interesting problems (contd.) ▪ Search relevance ▪ Storage systems ▪ Middle tier (cache) optimization ▪ Application data access language Questions?