LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim1, Beomseok Nam1, Dongil Park2, Youjip Won2 1 Ulsan National Institute of Science and Technology, Korea 2 Hanyang University, Korea Outline Motivation • Journaling of Journal Anomaly How to resolve Journaling of Journal anomaly • Multi-Version B-Tree (MVBT) Optimizations of Multi-Version B-Tree • Lazy Split • Reserved Buffer Space • Lazy Garbage Collection • Metadata Embedding • Disabling Sibling Redistribution Evaluation Demo 2 USENIX FAST ‘14, Santa Clara, CA Android is the most popular mobile platform 3 USENIX FAST ‘14, Santa Clara, CA Storage in Android platform Performance Bottleneck 1 0 0 1 1 1 0 0 Major cause for device break down. 1 0 1 0 01 0 1 0 1 0 1 0 4 Cause = Excessive IO USENIX FAST ‘14, Santa Clara, CA Android I/O Stack Insert/Update/ Delete/Select SQLite Misaligned InteractionRead/Write EXT4 Read/Write Block Device Driver 5 USENIX FAST ‘14, Santa Clara, CA SQLite insert (Truncate mode) insert SQLite Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal. 6 USENIX FAST ‘14, Santa Clara, CA EXT4 Journaling (ordered mode) SQLite write() EXT4 Write data block Write EXT4 journal Write journal metadata Write journal commit 7 USENIX FAST ‘14, Santa Clara, CA Journaling of Journal Anomaly insert SQLite One insert of 100 Byte 9 Random Writes of 4KByte EXT4 Write data block Write EXT4 journal Write journal metadata Write journal commit 8 USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? insert SQLite Database Journaling Database Insertion EXT4 9 USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? Reducing the number of fsync() calls fsync() Database Journaling fsync() fsync() Database Insertion Journaling + Database = Version-based B-Tree (MVBT) 10 USENIX FAST ‘14, Santa Clara, CA Version-based B-Tree (MVBT) Insert 10 Update 10 with 20 Time T1 T2 10 [T1, ∞) 10 10 [T1, [T1, T2) T2) 20 [T2, ∞) Do not overwrite old version data Do not need a rollback journal 11 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree Key = 25 = 5 Page 1 5 [3~∞) 10 [2~∞) 12 [4~∞) 40 [1~∞) 12 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree Key = 25 = 5 Page 1 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) 13 USENIX FAST ‘14, Santa Clara, CA Dead Node Node Split in Multi-Version B-Tree Key = 25 = 5 Page 1 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) 14 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree Key = 25 = 5 Page 3 1 2 [3~5) 5 [5~∞) [2~5) 10 [5~∞) 12 [4~5) [5~∞) 40 [1~5) [5~∞) 15 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree Page 4 10 [5~∞) : Page 3 ∞ [5~∞) : Page 2 ∞ [0~5) : Page 1 Page 3 16 Page 1 5 [5~∞) 5 [3~5) 10 [5~∞) 10 [2~5) Page 2 12 [4~5) 12 [5~∞) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree One more dirty page than original B-Tree Page 4 dirty 10 [5~∞) : Page 3 ∞ [5~∞) : Page 2 ∞ [0~5) : Page 1 Page 3 dirty Page 1 5 [5~∞) 5 [3~5) 10 [5~∞) 10 [2~5) dirty 12 [4~5) 40 [1~5) 17 USENIX FAST ‘14, Santa Clara, CA Page 2 12 [5~∞) 25 [5~∞) 40 [5~∞) dirty I/O Traffic in MVBT DB buffer cache MVBT fsync() Reduce the number of dirty pages! 18 USENIX FAST ‘14, Santa Clara, CA I/O Traffic in MVBT DB buffer cache fsync() MVBT DB buffer cache LS -MVBT 19 fsync() USENIX FAST ‘14, Santa Clara, CA Optimizations in Android I/O twrite >> tread Write Write Transaction 1 Write Transaction 2 Write Transaction 3 20 Read Multiple Insert Single Insert Transaction USENIX FAST ‘14, Santa Clara, CA Lazy Split Multi-Version B-Tree (LS-MVBT) Optimizations Lazy Split Reserved Buffer Space LS-MVBT Lazy Garbage Collection 21 Disabling Sibling Redistribution Metadata Embedding USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Legacy Split in MVBT Page 4 dirty 10 [5~∞) : Page 3 ∞ [5~∞) : Page 2 ∞ [0~5) : Page 1 Page 3 22 dirty Page 1 dirty 4 dirty pages Page 2 5 [5~∞) 5 [3~5) 10 [5~∞) 10 [2~5) 25 [5~∞) 12 [4~5) 12 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA dirty Optimization1: Lazy Split Key = 25 = 5 Page 1 5 [3~∞) 10 [2~∞) 12 [4~∞) 40 [1~∞) 23 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Key = 25 = 5 Page 1 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) 24 USENIX FAST ‘14, Santa Clara, CA Lazy Node: Half-dead, Half-live Optimization1: Lazy Split Key = 25 = 5 Page 1 5 [3~∞) 10 [2~∞) 12 [4~5) 40 [1~5) 25 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Key = 25 = 5 Page 2 1 5 [3~∞) 10 [2~∞) [4~5) 12 [5~∞) [1~5) 40 [5~∞) 26 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 5 [3~∞) 27 10 [2~∞) 12 [5~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Lazy Node Overflow Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 Page 2 5 [3~∞) Dead entries 28 10 [2~∞) 12 [5~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Lazy Node Overflow → Garbage collect dead entries Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 Page 2 5 [3~∞) 29 10 [2~∞) 12 [5~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split Lazy Node Overflow → Garbage collect dead entries Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 Page 2 5 [3~∞) 8 [6~∞) 12 [5~∞) 10 [2~∞) 25 [5~∞) 40 [5~∞) But, what if the dead entries are being accessed by other transactions? 30 USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space Option 1. Wait for read transactions to finish Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 Page 2 5 [3~∞) 31 10 [2~∞) 12 [5~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space Option 2. Split as in legacy MVBT split Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 4 Page 2 55 [3~6) [6~∞) 32 10 [2~6) [6~∞) 12 [5~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space Option 2. Split as in legacy MVBT split Page 3 10 [5~6) : Page 1 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 10 [6~∞) : Page 3 Page 4 33 Page 1 Page 2 5 [6~∞) 5 [3~6) 8 [6~∞) 10 [2~6) 12 [5~∞) 10 [6~∞) 12 [4~5) 25 [5~∞) 40 [1~5) 40 [5~∞) USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space Page 3 10 [5~∞) : Page 1 ∞ [0~5) : Page 1 Key = 8 ∞ [5~∞) : Page 2 = 6 Page 1 Page 2 10 [2~∞) 12 [5~∞) Buffer space is used 12 [4~5) when lazy node is full 40 [1~5) 25 [5~∞) 8 [6~∞) 34 40 [5~∞) If buffer space is also full, split as in legacy MVBT USENIX FAST ‘14, Santa Clara, CA Rollback in LS-MVBT • Similar to rollback in MVBT Page 3 Rollback 10 [5~∞) : Page 1 Page 3 ∞ [0~∞) : Page 1 ∞ [0~5) : Page 1 ∞ [0~5) : Page 1 Txn 5 crashes ∞ [5~∞) : Page 2 Number of dirty nodes touched by rollback of LSMVBT is also smaller than that of MVBT. Page 1 Page 2 Page 1 5 [3~∞) 12 [5~∞) 5 [3~∞) 10 [2~∞) 25 [5~∞) 10 [2~∞) 12 [4~5) 40 [5~∞) [4~5) 12 [4~∞) [1~5) 40 [1~ ∞) 40 [1~5) 35 USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 P1 Transaction buffer cache 36 USENIX FAST ‘14, Santa Clara, CA P3 P1 P2 Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 Stopped Transaction buffer cache P1 P2 P3 Garbage Collection P2 fsync() Extra Dirty Pages GC buffer cache 37 P3 P1 USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Lazy Garbage Collection Do not garabge collect if no space is needed P1 Transaction 1 Insert fsync() Transaction buffer cache No Extra Dirty Page by GC 38 USENIX FAST ‘14, Santa Clara, CA P3 P1 P2 Optimization4: Metadata embedding Version = “File Change Counter” in DB header page Header Page 1 Write Transaction Commit Read Transaction 5 6 ++ 6 Page 2 ... 2 dirty pages 5 Page N 15 [6~∞) 25 [5~∞) 40 [1~∞) 39 dirty USENIX FAST ‘14, Santa Clara, CA dirty Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction 6 Commit Read Transaction Page 2 ... 5 Page N RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 5 6 40 5 USENIX FAST ‘14, Santa Clara, CA 6 5 Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction CRASH 6 Commit Read Transaction Page 2 ... 5 Page N RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 6 6 Volatile 41 5 USENIX FAST ‘14, Santa Clara, CA 6 dirty 1 dirty page Optimization5: Disable Sibling Redistribution Sibling redistribution hurts insertion performance Pagedirt 5: dirty y 12 : Page 4 10 25 : Page 3 40 40 ∞ : Page 2 Insertion of key 20 4 dirty pages Page 4: dirty 5 10 Page 3: dirty Page 2: dirty 12 15 20 25 40 55 ∞ • Disable sibling redistribution → avoid dirtying 4 pages 42 USENIX FAST ‘14, Santa Clara, CA Optimization5: Disable Sibling Redistribution Page 5: dirty 10 : Page 4 40 15 : Page 3 5 90 : Page 2 4 PagePage 5: dirty Page 3: dirty 5 10 20 12 15 25 40 Insertion of key 20 Page 2 55 90 3 dirty pages 43 USENIX FAST ‘14, Santa Clara, CA Summary Optimizations Lazy Split Avoid dirtying an extra page when split occurs Disabling Sibling Redistribution Reserved Buffer Space Reduce the probability of node split LS-MVBT Lazy Garbage Collection Metadata Embedding Delete dead entries on the next mutation 44 Do not touch siblings to make search faster USENIX FAST ‘14, Santa Clara, CA Avoid dirtying header page Performance Evaluation Implemented LS-MVBT in SQLite 3.7.12 Testbed: Samsung Galaxy-S4 (JellyBean 4.3) • Exynos 5 Octa 5410 Quad Core 1.6GHz • 2GB Memory • 16GB eMMC Performance Analysis Tools • Mobibench • MOST 45 USENIX FAST ‘14, Santa Clara, CA Analysis of insert Performance SQLite Insert (Response Time) LS-MVBT: 30~78% faster than WAL mode LS-MVBT fsync WAL fsync() MVBT fsync() B-tree computation WAL Checkpointing Interval 46 USENIX FAST ‘14, Santa Clara, CA Analysis of insert Performance SQLite Insert (# of Dirty Pages per Insert) LS-MVBT flushes only one dirty page 47 USENIX FAST ‘14, Santa Clara, CA I/O Traffic at Block Device Level SQLite Insert (I/O Traffic per 10 msec) 1000 insertions (LS-MVBT) 1000 insertions (WAL) LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash 49 USENIX FAST ‘14, Santa Clara, CA How about Search performance? • LS-MVBT is optimized for write performance. 50 USENIX FAST ‘14, Santa Clara, CA Search Performance Transaction Throughput with varying Search/Insert ratio LS-MVBT wins unless more than 93% of transactions are search 51 USENIX FAST ‘14, Santa Clara, CA How about recovery latency? • WAL mode exhibits longer recovery latency due to log replay. • How about LS-MVBT? 52 USENIX FAST ‘14, Santa Clara, CA Recovery Recovery Time with varying # of insert operations to rollback LS-MVBT is x5~x6 faster than WAL. 53 USENIX FAST ‘14, Santa Clara, CA How much performance improvement for each optimization? 54 USENIX FAST ‘14, Santa Clara, CA Effect of Disabling Sibling Redistribution SQLite Insert: (Response Time and # of Dirty Pages) Disabling sibling redistribution ↓ 20% faster insertion 55 USENIX FAST ‘14, Santa Clara, CA Performance Quantification Throughput of SQLite: Combining All Optimizations 1220% improvemnet 70% 51% improvemnet 40% improvement improvement X1.7 performance X13.2 performanceimprovement improvement againstTruncate WAL mode against mode LS-MVBT: LS-MVBT: 704 704 insertions/sec insertions/sec WAL : 417 insertions/sec Truncate : 53 insertions/sec 56 USENIX FAST ‘14, Santa Clara, CA Query response time Response time of MVBT is 80% of WAL Response time of LS-MVBT is 58% of WAL 57 USENIX FAST ‘14, Santa Clara, CA Conclusion LS-MVBT resolves the Journaling of Journal anomaly via • Reducing the number of fsync()’s with Version based B-tree. • Reducing the overhead of single fsync() via lazy split, disabling sibling redistribution, metadata embedding, reserved buffer space, lazy garbage collection. LS-MVBT achieves 70% performance gain against WAL mode (417 ins/sec 704 ins/sec) solely via software optimization. 58 USENIX FAST ‘14, Santa Clara, CA Thank you… USENIX FAST ‘14, Santa Clara, CA