幻灯片1

advertisement
Scaling Memcache
at Facebook
Presenter: Rajesh Nishtala (rajesh.nishtala@fb.com)
Co-authors: Hans Fugal, Steven Grimm, Marc
Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,
Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,
Tony Tung, Venkateshwaran Venkataramani
Infrastructure Requirements
for Facebook
1. Near real-time communication
2. Aggregate content on-the-fly from
multiple sources
3. Be able to access and update very popular
shared content
4. Scale to process millions of user requests
per second
Design Requirements
Support a very heavy read load
• Over 1 billion reads / second
• Insulate backend services from high read rates
Geographically Distributed
Support a constantly evolving product
• System must be flexible enough to support a variety of use cases
• Support rapid deployment of new features
Persistence handled outside the system
• Support mechanisms to refill after updates
Need more read capacity
•
•
•
Two orders of magnitude
more reads than writes
Solution: Deploy a few
memcache hosts to handle
the read capacity
How do we store data?
• Demand-filled look-aside cache
• Common case is data is
available in the cache
Web Server
1. Get (key)
3. DB lookup
2. Miss (key)
4. Set (key)
Memcache
Database
Database
Database
Handling updates
•
Memcache needs to be
invalidated after DB write
Web Server
•
Prefer deletes to sets
• Idempotent
• Demand filled
• Up to web application
to specify which keys
to invalidate after
database update
1. Database
update
2. Delete
Memcache
Database
Memcache
While evolving our system we prioritize two major
design goals.
• Any change must impact a userfacing or operational issue.
Optimizations that have limited scope are rarely considered
• We treat the probability of reading transient stale data as a
parameter to be tuned, similar to responsiveness. We are
willing to expose slightly stale data in exchange for insulating a
backend storage service from excessive load.
Roadmap
• Single front-end cluster
– Read heavy workload
– Wide fanout
– Handling failures
• Multiple front-end clusters
– Controlling data replication
– Data consistency
• Multiple Regions
– Data consistency
Single front-end cluster
• Reducing Latency:focusing on the memcache
client
–
–
–
–
–
serialization
compression
request routing
request routing
request batching
• Parallel requests and batching
• Client-server communication:
– embed the complexity of the system into a stateless client rather than
in the memcached servers
– We rely on UDP for get requests ,TCP via mcrouter for set,delete
requests.
• Incast congestion:
– Memcacheclients implement flowcontrol
mechanisms to limit incast congestion.
• Reducing Load
– Leases: 是一个64-bit的Token,由MemCache在读请求时分配给
Client,可以用来解决两个问题:stale sets 和thundering herds.
– stale sets :A stale set occurs when a web server sets a value in
memcache that does not reflect the latest value that should be
cached. This can occur when concurrent updates to memcacheget
reordered
– Thundering herd :happens when a specific key undergoes
heavy read and write activity
• Stale values:
– When a key is deleted, its value is transferred to a data
structure that holds recently deleted items, where it lives
for a short time before being flushed.A get request can
return a lease token or data that is marked as stale.
• Memcache Pools: We designate one pool (named wildcard)
as the default and provision separate pools for keys whose
residence in wildcard is problematic.
• Replication Within Pools,We choose to replicate a
category of keys within a pool :
– the application routinely fetches many keys simultaneously
– the entire data set fits in one or two memcached
servers
– the request rate is much higher than what
a single server can manage
• Handling Failures
– There are two scales at which we must address failures:
• a small number of hosts are inaccessible due to a network or
server failure
• a widespread outage that affects a significant percentage of the
servers within the cluster
– Gutter
Multiple front-end clusters;Region
• Region:
– We split our web and memcached servers into multiple
front-end clusters. These clusters, along with a storage
cluster that contain the databases, define a region.
– We trade replication of data for more independent failure
domains, tractable network configuration, and a reduction
of incast congestion
Databases invalidate caches
Front-End Cluster #2
Front-End Cluster #1
Web Server
Web Server
MC MC MC MC
MySQL
Storage Server
•
•
MC MC MC
Front-End Cluster #3
Web Server
MC MC MC MC
McSqueal
Commit Log
Cached data must be invalidated after database updates
Solution: Tail the mysql commit log and issue deletes based
on transactions that have been committed
• Allows caches to be resynchronized in the event of a problem
Invalidation pipeline
Too many packets
MC MC MC MC
MC MC MC
MC MC MC MC
Memcache
Memcache
Memcache
Routers
Routers
Routers
• Aggregating deletes reduces
Memcache Routers
packet rate by 18x
• Makes configuration
McSqueal
McSqueal
McSqueal
DB
DB
DB
management easier
• Each stage buffers deletes in
case downstream component is
down
• Regional Pools
– We can reduce the number of replicas by having multiple
frontend clusters share the same set of memcached
servers. We call this aregional pool .
• Cold Cluster Warmup
– A system called Cold Cluster Warmup mitigates this by
allowing clients in the “cold cluster” to retrieve data from
the “warm cluster” rather than the persistent storage.
Across Regions: Geographically distributed
clusters
• One region to hold the master databases and
the other regions to contain read-only replicas;
• Advantages
– putting web servers closer to end users can significantly
reduce latency
– geographic diversity can mitigate the effects of events such
as natural disasters or massive power failures
– new locations can provide cheaper power and other
economic incentives
Geographically distributed clusters
Replica
Replica
Master
Writes in non-master
Database update directly in master
• Race between DB replication and subsequent DB read
Web
Server
Web
Server
3. Read from DB
(get missed)
2. Delete from mc
1. Write to master
memcache
4. Set potentially
state value to
Race!
Master
DB
3. MySQL replication
Replica
DB
Memcache
Remote markers
Set a special flag that indicates whether a race is likely
Read miss path:
If marker set
read from master DB
else
read from replica DB
Web Server
1. Set remote
marker
2. Write to master
3. Delete from
memcache
Master
DB
4. Mysql replication
Memcache
Replica
DB
5. Delete remote
marker
Single Server Improvements
• Performance Optimizations:
– 允许内部使用的hashtable自动扩张(rehash)
– 由原先多线程下的全局单一的锁改进为更细粒度的锁
– 每个线程一个UDP来避免竞争
• Adaptive Slab Allocator
– 允许不同Slab Class互换内存。当发现要删除的item对应的Slab
Class最近使用的频率超过平均的20%,将会把最近最少使用的Slab
Class的内存交互给需要的Slab Class使用。整体思路类似一个全局
的LRU。
• The Transient Item Cache
– 某些key的过期时间会设置的较短,如果这些key已经过期但是还没
达到LRU尾部时,它所占用的内存在这段时间就会浪费。因此可以
把这些key单独放在一个由链表组成的环形缓冲(即相当于一个循
环数组,数组的每个元素是个链表,以过期时间作为数组的下标
),每隔1s,会将缓冲头部的数据给删除。
• Software Upgrades
– Memcache的数据是保存在System V的共享内存区域,方便机器上
的软件升级。
Memcache Workload
• Fanout
• Response size
• Pool Statistics
• Invalidation Latency
Conclusion
• Separating cache and persistent storage systems allows us to
independently scale them
• Features that improve monitoring, debugging and operational
efficiency are as important as performance
• Managing stateful components is operationally more complex
than stateless ones. As a result keeping logic in a stateless
client helps iterate on features and minimize disruption.
• Managing stateful components is operationally more complex
than stateless ones. As a result keeping logic in a stateless
client helps iterate on features and minimize disruption.
• Simplicity is vital
Download