BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence Laboratory +Quanta Research Cambridge ISCA 2015, Portland, OR. June 15, 2015 This work is funded by Quanta, Samsung and Lincoln Laboratory. We also thank Xilinx for their hardware and expertise donations. 1 Big data analytics Analysis of previously unimaginable amount of data can provide deep insight Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC) Analyzing personal genome can determine predisposition to diseases Social network chatter analysis can identify political revolutions before newspapers Scientific datasets can be mined to extract accurate models Likely to be the biggest economic driver for the IT industry for the next decade 2 A currently popular solution: RAM Cloud Cluster of machines with large DRAM capacity and fast interconnect + Fastest as long as data fits in DRAM - Power hungry and expensive - Performance drops when data doesn’t fit in DRAM What if enough DRAM isn’t affordable? Flash-based solutions may be a better alternative + + - Faster than Disk, cheaper than DRAM Lower power consumption than both Legacy storage access interface is burdening Slower than DRAM 3 Related work Use of flash SSDs, FusionIO, Purestorage Zetascale SSD for database buffer pool and metadata [SIGMOD 2008], [IJCA 2013] Networks QuickSAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012] Accelerators SmartSSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs 4 Latency profile of distributed flash-based analytics Distributed processing involves many system components Flash device access Storage software (OS, FTL, …) Network interface (10gE, Infiniband, …) Actual processing Flash Access 75 μs Storage Software 100 μs Network 20 μs Processing 50~100 μs 100~1000 μs 20~1000 μs … Latency is additive 5 Latency profile of distributed flash-based analytics Architectural modifications can remove unnecessary overhead Near-storage processing Cross-layer optimization of flash management software* Dedicated storage area network Accelerator Flash Access 75 μs Storage Software 100 μs 50~100 μs < 20μs 100~1000 … μs Network 20 μs Processing 20~1000 μs … Difficult to explore using flash packaged as off-the-shelf SSDs 6 To VC707 HPC FMC PORT Custom flash card had to be built Flash Artix 7 FPGA Flash Flash Flash Network Ports Bus 0 Bus 1 Bus 2 Bus 3 Flash Flash Flash Flash Flash Array (on both side) 7 BlueDBM: Platform with near-storage processing and inter-controller networks 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 8 BlueDBM: Platform with near-storage processing and inter-controller networks 1 of 2 Racks (10 Nodes) BlueDBM Storage Device 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 9 BlueDBM node architecture Lightweight flash management with very low overhead Flash Device In-Storage Processor Network Interface Flash Controller Adds almost no latency Custom network protocol with ECC support low latency/high bandwidth x4 20Gbps links at 0.5us latency Software has very low level Virtual channels with flow control access to flash storage High information can be used No timelevel to go into gritty details! for low level management FTL implemented inside file system PCIe Host Server 10 Userspace BlueDBM software view Hardware-assisted Applications Kernelspace File System Connectal Proxy Generated by Connectal* Block Device Driver FPGA Connectal (By Quanta) Flash Ctrl HW Accelerator Connectal Wrapper Accelerator Manager Network Interface NAND Flash BlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal) 11 Power consumption is low Component Power (Watts) VC707 30 Flash Board (x2) 10 Storage Device Total 40 Storage device power consumption is a very conservative estimate Component Power (Watts) Storage Device 40 Xeon Server 200+ Node Total 240+ GPU-based accelerator will double the power 12 Applications Content-based image search * Faster flash with accelerators as replacement for DRAM-based systems BlueCache – An accelerated memcached* Dedicated network and accelerated caching systems with larger capacity Graph analytics Benefits of lower latency access into distributed flash for computation on large graphs * Results obtained since the paper submission 13 Content-based image retrieval Takes a query image and returns similar images in a dataset of tens of million pictures Image similarity is determined by measuring the distance between histograms of each image Histogram is generated using RGB, HSV, “edgeness”, etc Better algorithms are available! 14 Image search accelerator Sang woo Jun, Chanwoo Chung Flash FPGA Sobel Filter Flash Controller Histogram Generator Query Histogram Comparator Software 15 Image query performance without sampling BlueDBM + FPGA CPU Bottleneck BlueDBM + CPU Off-the shelf M.2. SSD Faster flash with acceleration can perform at DRAM speed 16 Sampling to improve performance Intelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space But introduces random access pattern Locality-sensitive hash table Data Data accesses corresponding to a single hash table entry results in a lot of random accesses 17 Image query performance with sampling A disk based system cannot take advantage of the reduced search space 18 memcached service A distributed in-memory key-value store caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256GB) Extensively used by database-driven websites Facebook, Flicker, Twitter, Wikipedia, Youtube … Brower/ Mobile Apps Application Servers Memcached Response Return data Memcached request Web request Memcached Servers Networking contributes to 90% overhead 19 Bluecache: Accelerated memcached service Shuotao Xu Inter-controller network PCIe web server Bluecache accelerator … 1TB Flash Flash Controller Bluecache accelerator PCIe web server Network Bluecache accelerator Flash Controller Network Flash Controller 1TB Flash Network 1TB Flash PCIe web server Memcached server implemented in hardware Hashing and flash management implemented in FPGA 1TB hardware managed flash cache per node Hardware server accessed via local PCIe Direct network between hardware 20 Effect of architecture modification (no flash, only DRAM) 4500 THROUGHPUT (KOPS PER SECOND) 4000 4012 3500 11X Performance 3000 2500 2000 1500 1000 500 357 273 Local Memcached Remote Memcached 0 Bluecache Get Operations ( Key Size = 64Bytes, Value Size = 64Bytes) PCIe DMA and inter-controller network reduces access overhead FPGA acceleration of memcached is effective 21 High cache-hit rate outweighs slow flashaccesses (small DRAM vs. large Flash) Throughput (KOps per seconds) 350 Key size = 64 Bytes, Value size = 8K Bytes 5ms penalty per cache miss * Assuming no cache misses for Bluecache 300 250 200 Bluecache (0.5TB Flash) 150 100 Local memcached (50GB DRAM) 50 0 0 5 10 15 20 25 30 35 40 45 50 Bluecache starts performing better at 5% miss A “sweet spot” for large flash caches exist 22 Graph traversal Very latency-bound problem, because often cannot predict the next node to visit Beneficial to reduce latency by moving computation closer to data Flash 1 Flash 2 Flash 3 In-Store Processor Host 1 Host 2 Host 3 23 Graph traversal performance Nodes traversed per second 18000 DRAM Flash 16000 * Used fast BlueDBM network even for separate network for fairness 14000 12000 10000 8000 6000 4000 2000 0 Software+DRAM Software + Separate Network Software + Controller Network Accelerator + Controller Network Flash based system can achieve comparable performance with a much smaller cluster 24 Other potential applications Genomics Deep machine learning Complex graph analytics Platform acceleration Spark, MATLAB, SciDB, … Suggestions and collaboration are welcome! 25 Conclusion Fast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big Data Reducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networks Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration Thank you 26 27 Near-Data Accelerator is Preferable NIC Flash FPGA CPU DRAM Accelerator NIC Flash CPU DRAM FPGA Accelerator Traditional Approach Hardware & software latencies are additive Motherboard Motherboard Motherboard CPU NIC Flash NIC Flash FPGA DRAM CPU DRAM BlueDBM FPGA Motherboard 28 VC707 PCIe DRAM Artix 7 Network Cable Virtex 7 Network Ports Flash 29