HayStack_Facebook_ShakthiBachala

advertisement
Finding a needle in Haystack
Facebook’s Photo Storage
Shakthi Bachala
Outline
•
•
•
•
•
•
•
•
•
Scenario
Goal
Problem
Previous Approach
Current Approach
Evaluation
Advantages
Critic
Conclusion
Scenario :
April 2009
October 2011
Total
15 billion Photos
4*15 billion images= 60
billion images
1.5 petabytes of data
65 billion Photos
4*65 billion images =
260 billion images
20 petabytes of data
Upload Rate
220 million photos / week
25 terabytes of data
1 billion photos / week
60 terabytes of data
Serving Rate
550,000 images / sec
1 million images / sec
Goal
•
•
•
•
High throughput and low latency
Fault-tolerant
Cost-effective
Simple
Previous Approach : Typical design for
Photo Sharing
Previous Approach : NFS based design for
Photo Sharing at facebook
Previous Approach – NFS based design
• Traditional file system architecture performs poorly
under Facebook's kind of workload
• NFS - based Design: CDN effectively serves the hottest
photos (profile pictures and recently updated photos),
but facebook also generates a lot of requests for less
popular images (long tail images). These are not handled
by CDN
• Normal website had 99% CDN hit rate but facebook had
around 80% CDN hit rate
Long Tail Issue
8
Previous Approach cont..
Problems with that approach were:
• Wastage of storage capacity due to metadata
– Large metadata per file
– Each image stored as a file
• Large number of disk operations for reads
– Because of large directories (large directories containing thousands of
files)
– Change of the directory structures and changing from large directories
to small directories has brought down the iops approximately from 10
to 2.5-3.0
Current Approach – Haystack
Architecture
Current Approach- Haystack
Components
The main components of Haystack architecture
are:
1. Haystack Directory
2. Haystack Cache
3. Haystack Store
Current Approach- Haystack Directory
The main goals of directory are:
• Map logical volumes to physical volumes
– 3 Physical volumes( on 3 nodes) per one logical volume
• Load balance
– Writes across logical volumes
– Reads across physical volumes (any of the 3 stores)
• Caching strategy: Whether the photo request should be handled by
the CDN or by the cache
– URL generation
http://<CDN>/<Cache>/<Node>/<Logical volume id, Image id>
• The directory would Identify the logical volumes that are read only
either because of operational reason or because those volumes
have reached their storage capacity
Current Approach- Haystack Cache
• Approach:
– The Cache receives HTTP requests for photos from
browser or CDNs
– It is a distributed hash table with photo id as the key to
locate the cached data
– If the photo id is missing in cache , the cache fetches the
data from photo server and replies it to the browser or
CDN depending on the request
Current Approach- Haystack Cache
Caches a photo if it satisfies the following two conditions:
• The request directly come from a user and instead of CDN
– Facebook’s experience with the NFS-based design showed postCDN caching is ineffective as it is unlikely that a request misses
in the CDN would hit in our internal cache
• The photos is fetched by the write enabled store
– Photos are most heavily accessed soon after they are uploaded
– File systems generally work better when doing either writes or
reads but not both
Current Approach- Haystack Cache Hit Rate
Current Approach : Haystack Store
• Replaces the storage and photo server layer in NFS based
Design with this structure:
Current Approach : Haystack Store
• Storage :
– 12x 1TB SATA, RAID6
• Filesytem:
– Single approx. 10 TB xfs filesystem.
• Haystack:
– Log structured , append only object store containing
needles as object abstractions
– 100 haystacks per node each 100GB in size
Current Approach: Haystack Store File
Current Approach: Operations in
Haystack
• Photo Read
– Look up offset /size of the image in the incore index
– Read Data (approx. 1 iop)
• Photo Write
– Asynchronously append images one by one to the haystack
file
– Next haystack file when becomes full
– Asynchronously append index records to the index file
– Flush index file if too many dirty index records
– Update incore index
Current Approach: Operations in
Haystack
• Photo Delete
– Lookup offset of the image in the incore index
– Mark the image needle flag as “DELETED”
– Update incore index
• Index File:
– Provides minimum metadata to locate the needle in the
Haystack store
– Subset of Header metadata
Current Approach: Haystack Index File
Haystack Based Design - Photo Upload
Haystack Based Design - Photo Download
Current Approach: Operations in
Haystack
• Filesystem:
– Haystack uses XFS, an extent based file system
• It has two main advantages:
– The block maps for several contiguous large files can be
small enough to be stored in the main memory
– XFS provides efficient file pre allocation, mitigating
fragmentation and reigning in how large block maps can
grow
Current Approach: Haystack Optimization
• Compaction:
– Infrequent online operation
– Create a copy of haystack skipping duplicates and deleted
photos
– The patterns of deletes to photo views, young photos are a
lot more likely to be deleted
– Last year about 25% of the photos got deleted
Current Approach: Haystack Optimization
• Saving More Memory:
– With the following two techniques store machines reduced
their main memory footprints by 20%
– Eliminate the need for an in-memory representation of
flags by setting the offset to be 0 for deleted photos.
– Store machine do not keep track of cookie values in main
memory and instead check the supplied cookie after
reading from the disk
Current Approach: Haystack Optimization
• Batch Uploads:
– Disks perform better with large sequential writes instead
of small random writes, so facebook uses batch uploads
whenever possible
– Many users upload entire albums to facebook instead of
each picture which gives an opportunity to batch the
uploads
Evaluation -Data
Evaluation – Production Workload
Advantages
• Simple design
• Decrease number of disk operations by
reducing the average metadata per photo
• This system is robust enough to handle a very
large amount of data
• Fault Tolerant
Critic
• I thought this approach is very facebook
specific .
• Any other?
Conclusion
• Built a simple but robust data storage
mechanism for facebook photo storage to
accommodate long tail of photo requests
which was not possible by previous
approaches
References
1. http://www.usenix.org/event/osdi10/tech/
2. http://www.cs.uiuc.edu/class/sp11/cs525/sc
hed.htm
Download