Flash-based (cloud storage) - University of Wisconsin

advertisement

Flash-based (cloud) storage systems

Lecture 25

Aditya Akella

• BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers)

• SILT: more “traditional” key-value store

Cheap and Large CAMs for High

Performance Data-Intensive

Networked Systems

Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and

Aditya Akella

University of Wisconsin-Madison

Suman Nath

Microsoft Research

New data-intensive networked systems

Large hash tables (10s to 100s of GBs)

New data-intensive networked systems

WAN optimizers

Object

Data center

Object

Key

(20 B)

WAN

Chunk pointer

Branch office

Large hash tables (32 GB)

Look up

Chunks(4 KB)

Object store (~4 TB)

High speed (~10K/sec) lookups for 500 Mbps link

Hashtable (~32GB)

High speed (~10 K/sec) inserts and evictions

New data-intensive networked systems

• Other systems

– De-duplication in storage systems (e.g., Datadomain)

– CCN cache (Jacobson et al., CONEXT 2009)

– DONA directory lookup (Koponen et al., SIGCOMM

2006)

Cost-effective large hash tables

C heap L arge c AMs

Disk

DRAM

Candidate options

Too slow

+ Price statistics from 2008-09

Random reads/sec

Random writes/sec

Cost

(128 GB)

250 250 $30 +

300K 300K $120K +

Too expensive

2.5 ops/sec/$

Flash-SSD 10K* 5K*

* Derived from latencies on Intel M-18 SSD in experiments

$225 +

Slow writes

How to deal with slow writes of Flash SSD

CLAM design

• New data structure “BufferHash” + Flash

• Key features

– Avoid random writes, and perform sequential writes in a batch

• Sequential writes are 2X faster than random writes (Intel

SSD)

• Batched writes reduce the number of writes going to Flash

– Bloom filters for optimizing lookups

BufferHash performs orders of magnitude better than

DRAM based traditional hash tables in ops/sec/$

Flash/SSD primer

• Random writes are expensive

Avoid random page writes

• Reads and writes happen at the granularity of a flash page

I/O smaller than page should be avoided, if possible

Conventional hash table on Flash/SSD

Keys are likely to hash to random locations

Flash

Random writes

SSDs: FTL handles random writes to some extent;

But garbage collection overhead is high

~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Conventional hash table on Flash/SSD

DRAM

Can’t assume locality in requests – DRAM as cache won’t work

Flash

Our approach: Buffering insertions

• Control the impact of random writes

• Maintain small hash table ( buffer ) in memory

• As in-memory buffer gets full, write it to flash

– We call in-flash buffer, incarnation of buffer

DRAM Flash SSD

Buffer: In-memory hash table

Incarnation: In-flash hash table

Two-level memory hierarchy

DRAM

Buffer

Latest incarnation

Flash

Incarnation

4 3 2

Incarnation table

1 Oldest incarnation

Net hash table is: buffer + all incarnations

Lookups are impacted due to buffers

DRAM

Lookup key

Buffer

Flash

In-flash look ups

4 3 2 1

Incarnation table

Multiple in-flash lookups. Can we limit to only one?

Bloom filters for optimizing lookups

Lookup key

DRAM

Buffer

Bloom filters

In-memory look ups

False positive!

Flash

4 3 2 1

Configure carefully!

Incarnation table

2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Update key

Update: naïve approach

DRAM

Buffer

Bloom filters

Flash Update key

Expensive random writes

4 3 2 1

Incarnation table

Discard this naïve approach

Lazy updates

Update key

DRAM

Buffer

Bloom filters

Insert key

Key, new value

Flash

Key, old value

4 3 2 1

Incarnation table

Lookups check latest incarnations first

Eviction for streaming apps

• Eviction policies may depend on application

– LRU, FIFO, Priority based eviction, etc.

• Two BufferHash primitives

– Full Discard: evict all items

• Naturally implements FIFO

– Partial Discard: retain few items

• Priority based eviction by retaining high priority items

• BufferHash best suited for FIFO

– Incarnations arranged by age

– Other useful policies at some additional cost

• Details in paper

Issues with using one buffer

• Single buffer in

DRAM

– All operations and eviction policies

DRAM

Flash

• High worst case insert latency

– Few seconds for 1

GB buffer

– New lookups stall

4 3 2 1

Buffer

Bloom filters

Incarnation table

Partitioning buffers

• Partition buffers

– Based on first few bits of key space

– Size > page

• Avoid i/o less than page

– Size >= block

• Avoid random page writes

• Reduces worst case latency

• Eviction policies apply per buffer

DRAM

Flash

4

0 XXXXX 1 XXXXX

3 2

Incarnation table

1

BufferHash: Putting it all together

• Multiple buffers in memory

• Multiple incarnations per buffer in flash

• One in-memory bloom filter per incarnation

DRAM

Flash

Buffer 1

. .

. .

Buffer K

Net hash table = all buffers + all incarnations

Latency analysis

• Insertion latency

– Worst case size of buffer

– Average case is constant for buffer > block size

• Lookup latency

– Average case Number of incarnations

– Average case False positive rate of bloom filter

DRAM B1

Flash

Parameter tuning: Total size of Buffers

Total size of buffers = B1 + B2 + … + BN

Given fixed DRAM, how much allocated to buffers

Total bloom filter size = DRAM – total size of buffers

.

. .

.

BN

Lookup #Incarnations * False positive rate

# Incarnations = (Flash size/Total buffer size)

False positive rate increases as the size of bloom filters decrease

Too small is not optimal

Too large is not optimal either

Optimal = 2 * SSD/entry

Parameter tuning: Per-buffer size

What should be size of a partitioned buffer (e.g. B1) ?

DRAM B1

Flash

.

. .

.

BN

Affects worst case insertion

Adjusted according to application requirement

(128 KB – 1 block)

SILT: A Memory-Efficient,

High-Performance Key-Value Store

Hyeontaek Lim, Bin Fan, David G. Andersen

Michael Kaminsky†

Carnegie Mellon University

†Intel Labs

2011-10-24

Key-Value Store

Clients

Key-Value Store

Cluster

PUT(key, value) value = GET(key)

DELETE(key)

• E-commerce (Amazon)

• Web server acceleration (Memcached)

• Data deduplication indexes

• Photo storage (Facebook)

26

• SILT goal: use much less memory than previous systems while retaining high

performance.

27

Three Metrics to Minimize

Memory overhead = Index size per entry

• Ideally 0 (no memory overhead)

Read amplification = Flash reads per query

• Limits query throughput

• Ideally 1 (no wasted flash reads)

Write amplification = Flash writes per entry

• Limits insert throughput

• Also reduces flash life expectancy

• Must be small enough for flash to last a few years

28

Landscape before SILT

Read amplification

6

4

2

0

0

?

SkimpyStash

HashCache

BufferHash FlashStore

FAWN-DS

2 4 6 8 10 12

Memory overhead (bytes/entry)

29

Solution Preview: (1) Three Stores with (2) New Index Data Structures

Queries look up stores in sequence (from new to old)

Inserts only go to Log

Data are moved in background

SILT Sorted Index

(Memory efficient)

SILT Filter

SILT Log Index

(Write friendly)

Memory

Flash

30

LogStore: No Control over Data Layout

Naive Hashtable (48+ B/entry)

SILT Log Index (6.5+ B/entry)

(Older)

Memory overhead

6.5+ bytes/entry

On-flash log

Memory

Flash

Inserted entries are appended

(Newer)

Write amplification

1

31

SortedStore: Space-Optimized Layout

SILT Sorted Index (0.4 B/entry)

On-flash sorted array

Memory overhead

0.4 bytes/entry

Memory

Flash

Need to perform bulkinsert to amortize cost

Write amplification

High

32

Combining SortedStore and LogStore

<SortedStore>

SILT Sorted Index

<LogStore>

SILT Log Index

Merge

On-flash sorted array On-flash log

33

Achieving both Low Memory Overhead and Low

Write Amplification

SortedStore

LogStore

• Low memory overhead

• High write amplification

• High memory overhead

• Low write amplification

SortedStore LogStore

Now we can achieve simultaneously:

Write amplification = 5.4 = 3 year flash life

Memory overhead = 1.3 B/entry

With “HashStores”, memory overhead = 0.7 B/entry!

34

SILT’s Design (Recap)

<SortedStore> <HashStore> <LogStore>

SILT Sorted Index SILT Filter SILT Log Index

Merge Conversion

On-flash sorted array On-flash hashtables On-flash log

Memory overhead Read amplification Write amplification

0.7 bytes/entry 1.01

5.4

35

New Index Data Structures in SILT

SILT Sorted Index SILT Filter & Log Index

Entropy-coded tries

For SortedStore

Highly compressed (0.4 B/entry)

Partial-key cuckoo hashing

For HashStore & LogStore

Compact (2.2 & 6.5 B/entry)

Very fast (> 1.8 M lookups/sec)

36

Landscape

Read amplification

6

4

2

SkimpyStash

HashCache

BufferHash FlashStore

FAWN-DS

SILT

0

0 2 4 6 8 10 12

Memory overhead (bytes/entry)

38

BufferHash: Backup

Outline

• Background and motivation

• Our CLAM design

– Key operations (insert, lookup, update)

– Eviction

– Latency analysis and performance tuning

• Evaluation

Evaluation

• Configuration

– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD

– 2 GB buffers, 2 GB bloom filters, 0.01 false positive rate

– FIFO eviction policy

BufferHash performance

• WAN optimizer workload

– Random key lookups followed by inserts

– Hit rate (40%)

– Used workload from real packet traces also

• Comparison with BerkeleyDB (traditional hash table) on Intel SSD

Average latency BufferHash

Look up (ms) 0.06

Insert (ms) 0.006

BerkeleyDB

4.6

Better lookups!

4.8

Better inserts!

Insert performance

CDF

1.0

0.8

0.6

0.4

0.2

Bufferhash

99% inserts < 0.1 ms

BerkeleyDB

40% of inserts > 5 ms !

0,01 0,1 1

Insert latency (ms) on Intel SSD

Buffering effect! Random writes are slow!

CDF

1.0

0.8

0.6

0.4

0.2

Lookup performance

Bufferhash BerkeleyDB

99% of lookups < 0.2ms

40% of lookups > 5 ms

Garbage collection overhead due to writes!

0,01 0,1 1 10

Lookup latency (ms) for 40% hit workload

Performance in Ops/sec/$

• 16K lookups/sec and 160K inserts/sec

• Overall cost of $400

• 42 lookups/sec/$ and 420 inserts/sec/$

– Orders of magnitude better than 2.5 ops/sec/$ of

DRAM based hash tables

Other workloads

• Varying fractions of lookups

• Results on Trancend SSD

Average latency per operation

Lookup fraction BufferHash

0 0.007 ms

BerkeleyDB

18.4 ms

0.5

1

0.09 ms

0.12 ms

10.3 ms

0.3 ms

• BufferHash ideally suited for write intensive workloads

Evaluation summary

• BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on

DRAM (and disks)

• BufferHash is best suited for FIFO eviction policy

– Other policies can be supported at additional cost, details in paper

• WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with

BerkeleyDB

– Details in paper

Related Work

• FAWN (Vasudevan et al., SOSP 2009)

– Cluster of wimpy nodes with flash storage

– Each wimpy node has its hash table in DRAM

– We target…

• Hash table much bigger than DRAM

• Low latency as well as high throughput systems

• HashCache (Badam et al., NSDI 2009)

– In-memory hash table for objects stored on disk

WAN optimizer using BufferHash

• With BerkeleyDB, throughput up to 10 Mbps

• With BufferHash, throughput up to 200 Mbps with Transcend SSD

– 500 Mbps with Intel SSD

• At 10 Mbps, average throughput per object improves by 65% with BufferHash

SILT Backup Slides

Evaluation

1. Various combinations of indexing schemes

2. Background operations (merge/conversion)

3. Query latency

Experiment Setup

CPU

Flash drive

Workload size

Query pattern

2.80 GHz (4 cores)

SATA 256 GB

(48 K random 1024-byte reads/sec)

20-byte key, 1000-byte value, ≥ 50 M keys

Uniformly distributed (worst for SILT)

51

LogStore Alone: Too Much Memory

Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

52

LogStore+SortedStore: Still Much Memory

Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

53

Full SILT: Very Memory Efficient

Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

54

Small Impact from Background Operations

Workload: 90% GET (100~ M keys) + 10% PUT

40 K

Oops! bursty

TRIM by ext4 FS

33 K

55

Low Query Latency

Workload: 100% GET (100 M keys)

Best tput @ 16 threads

Median = 330 μs

99.9 = 1510 μs

# of I/O threads

56

Conclusion

• SILT provides both memory-efficient and

high-performance key-value store

– Multi-store approach

– Entropy-coded tries

– Partial-key cuckoo hashing

• Full source code is available

– https://github.com/silt/silt

57

Download