Engineering Programming - UNIST Data Intensive Computing Lab

advertisement
LG 전자 세미나
Mar 18, 2014
Paper presented in USENIX FAST ’14, Santa Clara, CA
Resolving Journaling of Journal Anomaly in Android
I/O: Multi-Version B-tree with Lazy Split
Wook-Hee Kim1, Beomseok Nam1, Dongil Park2, Youjip Won2
1
Ulsan National Institute of Science and Technology, Korea
2 Hanyang University, Korea
Outline
 Motivation
• Journaling of Journal Anomaly
 How to resolve Journaling of Journal anomaly
• Multi-Version B-Tree (MVBT)
 Optimizations of Multi-Version B-Tree
• Lazy Split
• Reserved Buffer Space
• Lazy Garbage Collection
• Metadata Embedding
• Disabling Sibling Redistribution
 Evaluation
 Demo
2
USENIX FAST ‘14, Santa Clara, CA
Android is the most popular mobile platform
3
USENIX FAST ‘14, Santa Clara, CA
Storage in Android platform
Performance Bottleneck
1
0
0
1
1
1
0
0
Major cause for device
break down.
1
0
1
0 01 0
1
0 1
0
1
0
4
Cause = Excessive IO
USENIX FAST ‘14, Santa Clara, CA
Android I/O Stack
Insert/Update/
Delete/Select
SQLite
Misaligned
InteractionRead/Write
EXT4
Read/Write
Block Device Driver
5
USENIX FAST ‘14, Santa Clara, CA
SQLite insert (Truncate mode)
insert
SQLite
Open rollback journal.
Record the data to journal.
Put commit mark to journal.
Insert entry to DB
Truncate journal.
6
USENIX FAST ‘14, Santa Clara, CA
EXT4 Journaling (ordered mode)
SQLite
write()
EXT4
Write data block
Write EXT4 journal
Write journal metadata
Write journal commit
7
USENIX FAST ‘14, Santa Clara, CA
Journaling of Journal Anomaly
insert
SQLite
One insert of 100 Byte
 9 Random Writes of 4KByte
EXT4
Write data block
Write EXT4 journal
Write journal metadata
Write journal commit
8
USENIX FAST ‘14, Santa Clara, CA
How to Resolve Journaling of Journal?
insert
SQLite
Database Journaling
Database Insertion
EXT4
9
USENIX FAST ‘14, Santa Clara, CA
How to Resolve Journaling of Journal?
Reducing the number of fsync() calls
fsync()
Database
Journaling
fsync()
fsync()
Database
Insertion
Journaling +
Database
=
Version-based B-Tree
(MVBT)
10
USENIX FAST ‘14, Santa Clara, CA
Version-based B-Tree (MVBT)
Insert 10
Update 10 with 20
Time
T1
T2
10 [T1, ∞)
10
10 [T1,
[T1, T2)
T2)
20 [T2, ∞)
Do not overwrite old version data
Do not need a rollback journal
11
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
Key = 25
= 5
Page 1
5 [3~∞)
10 [2~∞)
12 [4~∞)
40 [1~∞)
12
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
Key = 25
= 5
Page 1
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
13
USENIX FAST ‘14, Santa Clara, CA
Dead
Node
Node Split in Multi-Version B-Tree
Key = 25
= 5
Page 1
5 [3~5)
10 [2~5)
12 [4~5)
40 [1~5)
14
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
Key = 25
= 5
Page 3
1
2
[3~5)
5 [5~∞)
[2~5)
10 [5~∞)
12 [4~5)
[5~∞)
40 [1~5)
[5~∞)
15
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
Page 4
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
∞ [0~5) : Page 1
Page 3
16
Page 1
5 [5~∞)
5 [3~5)
10 [5~∞)
10 [2~5)
Page 2
12 [4~5)
12 [5~∞)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Node Split in Multi-Version B-Tree
One more dirty page than
original B-Tree
Page 4
dirty
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
∞ [0~5) : Page 1
Page 3
dirty
Page 1
5 [5~∞)
5 [3~5)
10 [5~∞)
10 [2~5)
dirty
12 [4~5)
40 [1~5)
17
USENIX FAST ‘14, Santa Clara, CA
Page 2
12 [5~∞)
25 [5~∞)
40 [5~∞)
dirty
I/O Traffic in MVBT
DB buffer cache
MVBT
fsync()
Reduce the number of dirty pages!
18
USENIX FAST ‘14, Santa Clara, CA
I/O Traffic in MVBT
DB buffer cache
fsync()
MVBT
DB buffer cache
LS -MVBT
19
fsync()
USENIX FAST ‘14, Santa Clara, CA
Optimizations in Android I/O
twrite >> tread
Write
Write
Transaction 1
Write
Transaction 2
Write
Transaction 3
20
Read
Multiple
Insert
Single Insert
Transaction
USENIX FAST ‘14, Santa Clara, CA
Lazy Split Multi-Version B-Tree (LS-MVBT)
Optimizations
Lazy Split
Reserved
Buffer Space
LS-MVBT
Lazy Garbage
Collection
21
Disabling
Sibling
Redistribution
Metadata
Embedding
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
 Legacy Split in MVBT
Page 4
dirty
10 [5~∞) : Page 3
∞ [5~∞) : Page 2
∞ [0~5) : Page 1
Page 3
22
dirty
Page 1
dirty
4 dirty pages
Page 2
5 [5~∞)
5 [3~5)
10 [5~∞)
10 [2~5)
25 [5~∞)
12 [4~5)
12 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
dirty
Optimization1: Lazy Split
Key = 25
= 5
Page 1
5 [3~∞)
10 [2~∞)
12 [4~∞)
40 [1~∞)
23
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
Key = 25
= 5
Page 1
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
24
USENIX FAST ‘14, Santa Clara, CA
Lazy Node:
Half-dead,
Half-live
Optimization1: Lazy Split
Key = 25
= 5
Page 1
5 [3~∞)
10 [2~∞)
12 [4~5)
40 [1~5)
25
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
Key = 25
= 5
Page 2
1
5 [3~∞)
10 [2~∞)
[4~5)
12 [5~∞)
[1~5)
40 [5~∞)
26
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
Page 1
Page 2
5 [3~∞)
27
10 [2~∞)
12 [5~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
 Lazy Node Overflow
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
Page 2
5 [3~∞)
Dead
entries
28
10 [2~∞)
12 [5~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
 Lazy Node Overflow → Garbage collect dead entries
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
Page 2
5 [3~∞)
29
10 [2~∞)
12 [5~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization1: Lazy Split
 Lazy Node Overflow → Garbage collect dead entries
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
Page 2
5 [3~∞)
8 [6~∞)
12 [5~∞)
10 [2~∞)
25 [5~∞)
40 [5~∞)
But, what if the dead entries are being accessed by other transactions?
30
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
 Option 1. Wait for read transactions to finish
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
Page 2
5 [3~∞)
31
10 [2~∞)
12 [5~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
 Option 2. Split as in legacy MVBT split
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
4
Page 2
55 [3~6)
[6~∞)
32
10 [2~6)
[6~∞)
12 [5~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
 Option 2. Split as in legacy MVBT split
Page 3
10 [5~6) : Page 1
∞ [0~5) : Page 1
∞ [5~∞) : Page 2
10 [6~∞) : Page 3
Page 4
33
Page 1
Page 2
5 [6~∞)
5 [3~6)
8 [6~∞)
10 [2~6)
12 [5~∞)
10 [6~∞)
12 [4~5)
25 [5~∞)
40 [1~5)
40 [5~∞)
USENIX FAST ‘14, Santa Clara, CA
Optimization2: Reserved Buffer Space
Page 3
10 [5~∞) : Page 1
∞ [0~5) : Page 1
Key = 8
∞ [5~∞) : Page 2
= 6
Page 1
Page 2
10 [2~∞)
12 [5~∞)
Buffer space is used 12 [4~5)
when lazy node is full 40 [1~5)
25 [5~∞)
8 [6~∞)
34
40 [5~∞)
If buffer
space is also full,
split as in legacy MVBT
USENIX FAST ‘14, Santa Clara, CA
Rollback in LS-MVBT
• Similar to rollback in MVBT
Page 3
Rollback
10 [5~∞) : Page 1
Page 3
∞ [0~∞) : Page 1
∞ [0~5) : Page 1
∞ [0~5) : Page 1
Txn 5
crashes
∞ [5~∞) : Page 2
Number of dirty nodes touched by rollback of LSMVBT is also smaller than that of MVBT.
Page 1
Page 2
Page 1
5 [3~∞)
12 [5~∞)
5 [3~∞)
10 [2~∞)
25 [5~∞)
10 [2~∞)
12 [4~5)
40 [5~∞)
[4~5)
12 [4~∞)
[1~5)
40 [1~
∞)
40 [1~5)
35
USENIX FAST ‘14, Santa Clara, CA
Optimization3: Lazy Garbage Collection
Periodic Garbage Collection
Transaction 1
P1
Transaction buffer cache
36
USENIX FAST ‘14, Santa Clara, CA
P3
P1
P2
Optimization3: Lazy Garbage Collection
Periodic Garbage Collection
Transaction 1
Stopped
Transaction buffer cache
P1
P2
P3
Garbage
Collection
P2
fsync()
Extra Dirty Pages
GC buffer cache
37
P3
P1
USENIX FAST ‘14, Santa Clara, CA
Optimization3: Lazy Garbage Collection
Lazy Garbage Collection
Do not garabge collect
if no space is needed
P1
Transaction 1
Insert
fsync()
Transaction buffer cache
No Extra Dirty Page by GC
38
USENIX FAST ‘14, Santa Clara, CA
P3
P1
P2
Optimization4: Metadata embedding
Version = “File Change Counter” in DB header page
Header Page 1
Write
Transaction
Commit
Read
Transaction
5
6
++
6
Page 2
...
2 dirty pages
5
Page N
15 [6~∞)
25 [5~∞)
40 [1~∞)
39
dirty
USENIX FAST ‘14, Santa Clara, CA
dirty
Optimization4: Metadata embedding
Flush “File Change Counter” to the last modified page
and RAMDISK
Header Page 1
Write
Transaction
6
Commit
Read
Transaction
Page 2
...
5
Page N
RAMDISK
15 [6~∞)
25 [5~∞)
40 [1~∞)
5
6
40
5
USENIX FAST ‘14, Santa Clara, CA
6
5
Optimization4: Metadata embedding
Flush “File Change Counter” to the last modified page
and RAMDISK
Header Page 1
Write
Transaction
CRASH
6
Commit
Read
Transaction
Page 2
...
5
Page N
RAMDISK
15 [6~∞)
25 [5~∞)
40 [1~∞)
6
6
Volatile
41
5
USENIX FAST ‘14, Santa Clara, CA
6
dirty
1 dirty page
Optimization5: Disable Sibling Redistribution
 Sibling redistribution hurts insertion performance
Pagedirt
5:
dirty
y
12 : Page 4
10
25 : Page 3
40
40
∞ : Page 2
Insertion of key 20
4 dirty pages
Page 4: dirty
5
10
Page 3: dirty
Page 2: dirty
12
15
20
25
40
55
∞
• Disable sibling redistribution → avoid dirtying 4 pages
42
USENIX FAST ‘14, Santa Clara, CA
Optimization5: Disable Sibling Redistribution
Page 5:
dirty
10 : Page 4
40
15 : Page 3
5
90 : Page 2
4
PagePage
5: dirty
Page 3: dirty
5
10
20
12
15
25
40
Insertion of key 20
Page 2
55
90
3 dirty pages
43
USENIX FAST ‘14, Santa Clara, CA
Summary
Optimizations
Lazy Split
Avoid dirtying an extra page
when split occurs
Disabling
Sibling
Redistribution
Reserved
Buffer Space
Reduce the probability
of node split
LS-MVBT
Lazy Garbage
Collection
Metadata
Embedding
Delete dead entries on the
next mutation
44
Do not touch siblings to
make search faster
USENIX FAST ‘14, Santa Clara, CA
Avoid dirtying header page
Performance Evaluation
 Implemented LS-MVBT in SQLite 3.7.12
 Testbed: Samsung Galaxy-S4 (JellyBean 4.3)
• Exynos 5 Octa 5410 Quad Core 1.6GHz
• 2GB Memory
• 16GB eMMC
 Performance Analysis Tools
• Mobibench
• MOST
45
USENIX FAST ‘14, Santa Clara, CA
Analysis of insert Performance
SQLite Insert (Response Time)
LS-MVBT: 30~78% faster
than WAL mode
LS-MVBT fsync
WAL fsync()
MVBT fsync()
B-tree computation
WAL Checkpointing Interval
46
USENIX FAST ‘14, Santa Clara, CA
Analysis of insert Performance
SQLite Insert (# of Dirty Pages per Insert)
LS-MVBT flushes
only one dirty page
47
USENIX FAST ‘14, Santa Clara, CA
I/O Traffic at Block Device Level
SQLite Insert (I/O Traffic per 10 msec)
1000 insertions (LS-MVBT)
1000 insertions (WAL)
LS-MVBT saves 67% I/O traffic:
9.9 MB (LS-MVBT) vs 31 MB (WAL)
x3 longer life time of NAND flash
49
USENIX FAST ‘14, Santa Clara, CA
 How about Search performance?
• LS-MVBT is optimized for write performance.
50
USENIX FAST ‘14, Santa Clara, CA
Search Performance
Transaction Throughput with varying Search/Insert ratio
LS-MVBT wins unless more than 93% of
transactions are search
51
USENIX FAST ‘14, Santa Clara, CA
How about recovery latency?
• WAL mode exhibits longer recovery latency
due to log replay.
• How about LS-MVBT?
52
USENIX FAST ‘14, Santa Clara, CA
Recovery
Recovery Time with varying # of insert operations to rollback
LS-MVBT is x5~x6 faster than WAL.
53
USENIX FAST ‘14, Santa Clara, CA
How much performance improvement
for each optimization?
54
USENIX FAST ‘14, Santa Clara, CA
Effect of Disabling Sibling Redistribution
SQLite Insert: (Response Time and # of Dirty Pages)
Disabling sibling redistribution
↓
20% faster insertion
55
USENIX FAST ‘14, Santa Clara, CA
Performance Quantification
Throughput of SQLite: Combining All Optimizations
1220%
improvemnet
70%
51% improvemnet
40%
improvement
improvement
X1.7 performance
X13.2
performanceimprovement
improvement
againstTruncate
WAL mode
against
mode
LS-MVBT:
LS-MVBT: 704
704 insertions/sec
insertions/sec
WAL : 417
insertions/sec
Truncate
: 53
insertions/sec
56
USENIX FAST ‘14, Santa Clara, CA
Query response time
Response time of MVBT is 80% of WAL
Response time of LS-MVBT is 58% of WAL
57
USENIX FAST ‘14, Santa Clara, CA
Conclusion
 LS-MVBT resolves the Journaling of Journal anomaly via
• Reducing the number of fsync()’s with Version based B-tree.
• Reducing the overhead of single fsync() via lazy split,
disabling sibling redistribution, metadata embedding,
reserved buffer space, lazy garbage collection.
 LS-MVBT achieves 70% performance gain against WAL mode
(417 ins/sec 704 ins/sec) solely via software optimization.
58
USENIX FAST ‘14, Santa Clara, CA
Thank you…
USENIX FAST ‘14, Santa Clara, CA
Download