Scaling Memcache at Facebook Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com) Co-authors: Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, Venkateshwaran Venkataramani Infrastructure Requirements for Facebook 1. Near real-time communication 2. Aggregate content on-the-fly from multiple sources 3. Be able to access and update very popular shared content 4. Scale to process millions of user requests per second Design Requirements Support a very heavy read load • Over 1 billion reads / second • Insulate backend services from high read rates Geographically Distributed Support a constantly evolving product • System must be flexible enough to support a variety of use cases • Support rapid deployment of new features Persistence handled outside the system • Support mechanisms to refill after updates Need more read capacity • • • Two orders of magnitude more reads than writes Solution: Deploy a few memcache hosts to handle the read capacity How do we store data? • Demand-filled look-aside cache • Common case is data is available in the cache Web Server 1. Get (key) 3. DB lookup 2. Miss (key) 4. Set (key) Memcache Database Database Database Handling updates • Memcache needs to be invalidated after DB write Web Server • Prefer deletes to sets • Idempotent • Demand filled • Up to web application to specify which keys to invalidate after database update 1. Database update 2. Delete Memcache Database Memcache While evolving our system we prioritize two major design goals. • Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered • We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load. Roadmap • Single front-end cluster – Read heavy workload – Wide fanout – Handling failures • Multiple front-end clusters – Controlling data replication – Data consistency • Multiple Regions – Data consistency Single front-end cluster • Reducing Latency:focusing on the memcache client – – – – – serialization compression request routing request routing request batching • Parallel requests and batching • Client-server communication: – embed the complexity of the system into a stateless client rather than in the memcached servers – We rely on UDP for get requests ,TCP via mcrouter for set,delete requests. • Incast congestion: – Memcacheclients implement flowcontrol mechanisms to limit incast congestion. • Reducing Load – Leases: 是一个64-bit的Token,由MemCache在读请求时分配给 Client,可以用来解决两个问题:stale sets 和thundering herds. – stale sets :A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached. This can occur when concurrent updates to memcacheget reordered – Thundering herd :happens when a specific key undergoes heavy read and write activity • Stale values: – When a key is deleted, its value is transferred to a data structure that holds recently deleted items, where it lives for a short time before being flushed.A get request can return a lease token or data that is marked as stale. • Memcache Pools: We designate one pool (named wildcard) as the default and provision separate pools for keys whose residence in wildcard is problematic. • Replication Within Pools,We choose to replicate a category of keys within a pool : – the application routinely fetches many keys simultaneously – the entire data set fits in one or two memcached servers – the request rate is much higher than what a single server can manage • Handling Failures – There are two scales at which we must address failures: • a small number of hosts are inaccessible due to a network or server failure • a widespread outage that affects a significant percentage of the servers within the cluster – Gutter Multiple front-end clusters;Region • Region: – We split our web and memcached servers into multiple front-end clusters. These clusters, along with a storage cluster that contain the databases, define a region. – We trade replication of data for more independent failure domains, tractable network configuration, and a reduction of incast congestion Databases invalidate caches Front-End Cluster #2 Front-End Cluster #1 Web Server Web Server MC MC MC MC MySQL Storage Server • • MC MC MC Front-End Cluster #3 Web Server MC MC MC MC McSqueal Commit Log Cached data must be invalidated after database updates Solution: Tail the mysql commit log and issue deletes based on transactions that have been committed • Allows caches to be resynchronized in the event of a problem Invalidation pipeline Too many packets MC MC MC MC MC MC MC MC MC MC MC Memcache Memcache Memcache Routers Routers Routers • Aggregating deletes reduces Memcache Routers packet rate by 18x • Makes configuration McSqueal McSqueal McSqueal DB DB DB management easier • Each stage buffers deletes in case downstream component is down • Regional Pools – We can reduce the number of replicas by having multiple frontend clusters share the same set of memcached servers. We call this aregional pool . • Cold Cluster Warmup – A system called Cold Cluster Warmup mitigates this by allowing clients in the “cold cluster” to retrieve data from the “warm cluster” rather than the persistent storage. Across Regions: Geographically distributed clusters • One region to hold the master databases and the other regions to contain read-only replicas; • Advantages – putting web servers closer to end users can significantly reduce latency – geographic diversity can mitigate the effects of events such as natural disasters or massive power failures – new locations can provide cheaper power and other economic incentives Geographically distributed clusters Replica Replica Master Writes in non-master Database update directly in master • Race between DB replication and subsequent DB read Web Server Web Server 3. Read from DB (get missed) 2. Delete from mc 1. Write to master memcache 4. Set potentially state value to Race! Master DB 3. MySQL replication Replica DB Memcache Remote markers Set a special flag that indicates whether a race is likely Read miss path: If marker set read from master DB else read from replica DB Web Server 1. Set remote marker 2. Write to master 3. Delete from memcache Master DB 4. Mysql replication Memcache Replica DB 5. Delete remote marker Single Server Improvements • Performance Optimizations: – 允许内部使用的hashtable自动扩张(rehash) – 由原先多线程下的全局单一的锁改进为更细粒度的锁 – 每个线程一个UDP来避免竞争 • Adaptive Slab Allocator – 允许不同Slab Class互换内存。当发现要删除的item对应的Slab Class最近使用的频率超过平均的20%,将会把最近最少使用的Slab Class的内存交互给需要的Slab Class使用。整体思路类似一个全局 的LRU。 • The Transient Item Cache – 某些key的过期时间会设置的较短,如果这些key已经过期但是还没 达到LRU尾部时,它所占用的内存在这段时间就会浪费。因此可以 把这些key单独放在一个由链表组成的环形缓冲(即相当于一个循 环数组,数组的每个元素是个链表,以过期时间作为数组的下标 ),每隔1s,会将缓冲头部的数据给删除。 • Software Upgrades – Memcache的数据是保存在System V的共享内存区域,方便机器上 的软件升级。 Memcache Workload • Fanout • Response size • Pool Statistics • Invalidation Latency Conclusion • Separating cache and persistent storage systems allows us to independently scale them • Features that improve monitoring, debugging and operational efficiency are as important as performance • Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption. • Managing stateful components is operationally more complex than stateless ones. As a result keeping logic in a stateless client helps iterate on features and minimize disruption. • Simplicity is vital