Finding a needle in Haystack Facebook’s Photo Storage Shakthi Bachala Outline • • • • • • • • • Scenario Goal Problem Previous Approach Current Approach Evaluation Advantages Critic Conclusion Scenario : April 2009 October 2011 Total 15 billion Photos 4*15 billion images= 60 billion images 1.5 petabytes of data 65 billion Photos 4*65 billion images = 260 billion images 20 petabytes of data Upload Rate 220 million photos / week 25 terabytes of data 1 billion photos / week 60 terabytes of data Serving Rate 550,000 images / sec 1 million images / sec Goal • • • • High throughput and low latency Fault-tolerant Cost-effective Simple Previous Approach : Typical design for Photo Sharing Previous Approach : NFS based design for Photo Sharing at facebook Previous Approach – NFS based design • Traditional file system architecture performs poorly under Facebook's kind of workload • NFS - based Design: CDN effectively serves the hottest photos (profile pictures and recently updated photos), but facebook also generates a lot of requests for less popular images (long tail images). These are not handled by CDN • Normal website had 99% CDN hit rate but facebook had around 80% CDN hit rate Long Tail Issue 8 Previous Approach cont.. Problems with that approach were: • Wastage of storage capacity due to metadata – Large metadata per file – Each image stored as a file • Large number of disk operations for reads – Because of large directories (large directories containing thousands of files) – Change of the directory structures and changing from large directories to small directories has brought down the iops approximately from 10 to 2.5-3.0 Current Approach – Haystack Architecture Current Approach- Haystack Components The main components of Haystack architecture are: 1. Haystack Directory 2. Haystack Cache 3. Haystack Store Current Approach- Haystack Directory The main goals of directory are: • Map logical volumes to physical volumes – 3 Physical volumes( on 3 nodes) per one logical volume • Load balance – Writes across logical volumes – Reads across physical volumes (any of the 3 stores) • Caching strategy: Whether the photo request should be handled by the CDN or by the cache – URL generation http://<CDN>/<Cache>/<Node>/<Logical volume id, Image id> • The directory would Identify the logical volumes that are read only either because of operational reason or because those volumes have reached their storage capacity Current Approach- Haystack Cache • Approach: – The Cache receives HTTP requests for photos from browser or CDNs – It is a distributed hash table with photo id as the key to locate the cached data – If the photo id is missing in cache , the cache fetches the data from photo server and replies it to the browser or CDN depending on the request Current Approach- Haystack Cache Caches a photo if it satisfies the following two conditions: • The request directly come from a user and instead of CDN – Facebook’s experience with the NFS-based design showed postCDN caching is ineffective as it is unlikely that a request misses in the CDN would hit in our internal cache • The photos is fetched by the write enabled store – Photos are most heavily accessed soon after they are uploaded – File systems generally work better when doing either writes or reads but not both Current Approach- Haystack Cache Hit Rate Current Approach : Haystack Store • Replaces the storage and photo server layer in NFS based Design with this structure: Current Approach : Haystack Store • Storage : – 12x 1TB SATA, RAID6 • Filesytem: – Single approx. 10 TB xfs filesystem. • Haystack: – Log structured , append only object store containing needles as object abstractions – 100 haystacks per node each 100GB in size Current Approach: Haystack Store File Current Approach: Operations in Haystack • Photo Read – Look up offset /size of the image in the incore index – Read Data (approx. 1 iop) • Photo Write – Asynchronously append images one by one to the haystack file – Next haystack file when becomes full – Asynchronously append index records to the index file – Flush index file if too many dirty index records – Update incore index Current Approach: Operations in Haystack • Photo Delete – Lookup offset of the image in the incore index – Mark the image needle flag as “DELETED” – Update incore index • Index File: – Provides minimum metadata to locate the needle in the Haystack store – Subset of Header metadata Current Approach: Haystack Index File Haystack Based Design - Photo Upload Haystack Based Design - Photo Download Current Approach: Operations in Haystack • Filesystem: – Haystack uses XFS, an extent based file system • It has two main advantages: – The block maps for several contiguous large files can be small enough to be stored in the main memory – XFS provides efficient file pre allocation, mitigating fragmentation and reigning in how large block maps can grow Current Approach: Haystack Optimization • Compaction: – Infrequent online operation – Create a copy of haystack skipping duplicates and deleted photos – The patterns of deletes to photo views, young photos are a lot more likely to be deleted – Last year about 25% of the photos got deleted Current Approach: Haystack Optimization • Saving More Memory: – With the following two techniques store machines reduced their main memory footprints by 20% – Eliminate the need for an in-memory representation of flags by setting the offset to be 0 for deleted photos. – Store machine do not keep track of cookie values in main memory and instead check the supplied cookie after reading from the disk Current Approach: Haystack Optimization • Batch Uploads: – Disks perform better with large sequential writes instead of small random writes, so facebook uses batch uploads whenever possible – Many users upload entire albums to facebook instead of each picture which gives an opportunity to batch the uploads Evaluation -Data Evaluation – Production Workload Advantages • Simple design • Decrease number of disk operations by reducing the average metadata per photo • This system is robust enough to handle a very large amount of data • Fault Tolerant Critic • I thought this approach is very facebook specific . • Any other? Conclusion • Built a simple but robust data storage mechanism for facebook photo storage to accommodate long tail of photo requests which was not possible by previous approaches References 1. http://www.usenix.org/event/osdi10/tech/ 2. http://www.cs.uiuc.edu/class/sp11/cs525/sc hed.htm