pptx - People.csail.mit.edu

advertisement
BlueDBM: An Appliance for
Big Data Analytics
Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+
John Ankcorn+ Myron King+ Shuotao Xu* Arvind*
*MIT
Computer Science and Artificial Intelligence Laboratory
+Quanta Research Cambridge
ISCA 2015, Portland, OR.
June 15, 2015
This work is funded by Quanta, Samsung and Lincoln Laboratory.
We also thank Xilinx for their hardware and expertise donations.
1
Big data analytics
Analysis of previously unimaginable amount of
data can provide deep insight




Google has predicted flu outbreaks a week earlier
than the Center for Disease Control (CDC)
Analyzing personal genome can determine
predisposition to diseases
Social network chatter analysis can identify political
revolutions before newspapers
Scientific datasets can be mined to extract accurate
models
Likely to be the biggest economic driver for the IT
industry for the next decade
2
A currently popular solution:
RAM Cloud
Cluster of machines with large DRAM capacity and
fast interconnect
+ Fastest as long as data fits in DRAM
- Power hungry and expensive
- Performance drops when data doesn’t fit in DRAM
What if enough DRAM isn’t affordable?
Flash-based solutions may be a better alternative
+
+
-
Faster than Disk, cheaper than DRAM
Lower power consumption than both
Legacy storage access interface is burdening
Slower than DRAM
3
Related work
Use of flash



SSDs, FusionIO, Purestorage
Zetascale
SSD for database buffer pool and metadata
[SIGMOD 2008], [IJCA 2013]
Networks


QuickSAN [ISCA 2013]
Hadoop/Spark on Infiniband RDMA
[SC 2012]
Accelerators



SmartSSD[SIGMOD 2013], Ibex[VLDB 2014]
Catapult[ISCA 2014]
GPUs
4
Latency profile of distributed
flash-based analytics
Distributed processing involves many system
components




Flash device access
Storage software (OS, FTL, …)
Network interface (10gE, Infiniband, …)
Actual processing
Flash
Access
75 μs
Storage
Software
100 μs
Network
20 μs
Processing
50~100 μs
100~1000 μs
20~1000 μs
…
Latency is additive
5
Latency profile of distributed
flash-based analytics
Architectural modifications can remove
unnecessary overhead




Near-storage processing
Cross-layer optimization of flash management software*
Dedicated storage area network
Accelerator
Flash
Access
75 μs
Storage
Software
100 μs
50~100 μs < 20μs
100~1000
…
μs
Network
20 μs
Processing
20~1000 μs
…
Difficult to explore using flash packaged as
off-the-shelf SSDs
6
To
VC707
HPC FMC PORT
Custom flash card
had to be built
Flash
Artix 7
FPGA
Flash
Flash
Flash
Network Ports
Bus 0
Bus 1
Bus 2
Bus 3
Flash
Flash
Flash
Flash
Flash Array (on both side)
7
BlueDBM:
Platform with near-storage
processing and inter-controller networks
20 24-core Xeon Servers
20 BlueDBM Storage devices
 1TB flash storage
 x4 20Gbps controller
network
 Xilinx VC707
 2GB/s PCIe
8
BlueDBM:
Platform with near-storage
processing and inter-controller networks
1 of 2 Racks (10 Nodes)
BlueDBM Storage Device
20 24-core Xeon Servers
20 BlueDBM Storage devices
 1TB flash storage
 x4 20Gbps controller
network
 Xilinx VC707
 2GB/s PCIe
9
BlueDBM node architecture
Lightweight flash management
with very low overhead
Flash Device
In-Storage
Processor
Network
Interface
Flash
Controller

Adds almost
no latency
Custom
network
protocol with

ECC
support
low
latency/high
bandwidth

x4 20Gbps links at 0.5us latency
Software
has very low level

Virtual channels with flow control
access to flash storage

High
information
can be
used
No
timelevel
to go
into gritty
details!

for low level management
FTL implemented inside file system
PCIe
Host Server
10
Userspace
BlueDBM software view
Hardware-assisted Applications
Kernelspace
File System
Connectal Proxy
Generated by
Connectal*
Block Device Driver
FPGA
Connectal (By Quanta)
Flash Ctrl
HW
Accelerator
Connectal Wrapper
Accelerator Manager
Network
Interface
NAND Flash
BlueDBM provides a generic file system interface as well
as an accelerator-specific interface (Aided by Connectal)
11
Power consumption is low
Component
Power (Watts)
VC707
30
Flash Board (x2)
10
Storage Device Total
40
Storage device power consumption
is a very conservative estimate
Component
Power (Watts)
Storage Device
40
Xeon Server
200+
Node Total
240+
GPU-based accelerator will double the power
12
Applications
Content-based image search *

Faster flash with accelerators as replacement for
DRAM-based systems
BlueCache – An accelerated memcached*

Dedicated network and accelerated caching systems
with larger capacity
Graph analytics

Benefits of lower latency access into distributed flash
for computation on large graphs
* Results obtained since the paper submission
13
Content-based image retrieval
Takes a query image and returns similar images
in a dataset of tens of million pictures
Image similarity is determined by measuring the
distance between histograms of each image


Histogram is generated using RGB, HSV, “edgeness”, etc
Better algorithms are available!
14
Image search accelerator
Sang woo Jun, Chanwoo Chung
Flash
FPGA
Sobel Filter
Flash
Controller
Histogram
Generator
Query
Histogram
Comparator
Software
15
Image query performance
without sampling
BlueDBM
+ FPGA
CPU Bottleneck
BlueDBM
+ CPU
Off-the shelf
M.2. SSD
Faster flash with acceleration can perform at
DRAM speed
16
Sampling to improve
performance
Intelligent sampling methods (e.g., Locality
Sensitive Hashing) improves performance by
dramatically reducing the search space

But introduces random access pattern
Locality-sensitive
hash table
Data
Data accesses
corresponding to a
single hash table
entry results in a lot
of random accesses
17
Image query performance
with sampling
A disk based system cannot take advantage of
the reduced search space
18
memcached service
A distributed in-memory key-value store



caches DB results indexed by query strings
Accessed via socket communication
Uses system DRAM for caching (~256GB)
Extensively used by database-driven websites

Facebook, Flicker, Twitter, Wikipedia, Youtube …
Brower/
Mobile Apps
Application
Servers
Memcached
Response
Return data
Memcached
request
Web request
Memcached
Servers
Networking contributes to 90% overhead
19
Bluecache: Accelerated
memcached service Shuotao Xu
Inter-controller
network
PCIe
web server
Bluecache
accelerator
…
1TB Flash
Flash
Controller
Bluecache
accelerator
PCIe
web server
Network
Bluecache
accelerator
Flash
Controller
Network
Flash
Controller
1TB Flash
Network
1TB Flash
PCIe
web server
Memcached server implemented in hardware


Hashing and flash management implemented in FPGA
1TB hardware managed flash cache per node
Hardware server accessed via local PCIe
Direct network between hardware
20
Effect of architecture modification
(no flash, only DRAM)
4500
THROUGHPUT
(KOPS PER SECOND)
4000
4012
3500
11X Performance
3000
2500
2000
1500
1000
500
357
273
Local Memcached
Remote Memcached
0
Bluecache
Get Operations ( Key Size = 64Bytes, Value Size = 64Bytes)
PCIe DMA and inter-controller network reduces
access overhead
FPGA acceleration of memcached is effective
21
High cache-hit rate outweighs slow flashaccesses (small DRAM vs. large Flash)
Throughput (KOps per seconds)
350
Key size = 64 Bytes, Value size = 8K Bytes
5ms penalty per cache miss
* Assuming no cache misses for Bluecache
300
250
200
Bluecache
(0.5TB Flash)
150
100
Local
memcached
(50GB DRAM)
50
0
0
5
10
15
20
25
30
35
40
45
50
Bluecache starts performing better at 5% miss
A “sweet spot” for large flash caches exist
22
Graph traversal
Very latency-bound problem, because often
cannot predict the next node to visit

Beneficial to reduce latency by moving computation
closer to data
Flash 1
Flash 2
Flash 3
In-Store
Processor
Host 1
Host 2
Host 3
23
Graph traversal performance
Nodes traversed per second
18000
DRAM
Flash
16000
* Used fast BlueDBM network even for
separate network for fairness
14000
12000
10000
8000
6000
4000
2000
0
Software+DRAM
Software +
Separate
Network
Software +
Controller
Network
Accelerator +
Controller
Network
Flash based system can achieve comparable
performance with a much smaller cluster
24
Other potential applications
Genomics
Deep machine learning
Complex graph analytics
Platform acceleration

Spark, MATLAB, SciDB, …
Suggestions and collaboration
are welcome!
25
Conclusion
Fast flash-based distributed storage systems
with low-latency random access may be a
good platform to support complex queries on
Big Data
Reducing access latency for distributed
storage require architectural modifications,
including in-storage processors and fast
storage networks
Flash-based analytics hold a lot of promise,
and we plan to continue demonstrating more
application acceleration
Thank you
26
27
Near-Data Accelerator is
Preferable
NIC
Flash
FPGA
CPU
DRAM
Accelerator
NIC
Flash
CPU
DRAM
FPGA
Accelerator
Traditional Approach
Hardware &
software
latencies
are additive
Motherboard
Motherboard
Motherboard
CPU
NIC
Flash
NIC
Flash
FPGA
DRAM
CPU
DRAM
BlueDBM
FPGA
Motherboard
28
VC707
PCIe
DRAM
Artix 7
Network
Cable
Virtex 7
Network
Ports
Flash
29
Download