A Lightweight Transac0onal Design LightTx: in Flash-­‐based SSDs to Support Flexible Transac0ons Youyou Lu1, Jiwu Shu1, Jia Guo1, Shuai Li1, Onur Mutlu2 1Tsinghua University 2Carnegie Mellon University ExecuHve Summary • Problem: Flash-­‐based SSDs support transacHons naturally (with out-­‐of-­‐place updates) but inefficiently: – Only a limited set of isolaHon levels are supported (inflexible) – IdenHfying transacHon status is costly (heavyweight) • Goal: a lightweight design to support flexible transacHons • ObservaHons and Key Ideas: – Simultaneous updates can be wriQen to different physical pages, and the FTL mapping table determines the ordering => (Flexibility) make commit protocol page-­‐independent – TransacHons have birth and death, and the near-­‐logged update way enables efficient tracking => (Lightweight) track recently updated flash blocks, and reHre the dead transacHons • Results: up to 20.6% performance improvement, stable GC overhead, fast recovery with negligible persistence overhead 2 SSD Basics Flash Memory Pkg #3 Die 0 Plane 1 Flash Memory Pkg #2 FTL Plane 0 H/W Inter face Flash Memory Pkg #1 Plane 0 Host Interconnect Flash Memory Pkg #0 Plane 1 • FTL (Flash TranslaHon Layer) • Address mapping, garbage collecHon, wear leveling • Out-­‐of-­‐place Update (address mapping) • Pages are updated to new physical pages instead of overwriHng original pages • Internal Parallelism • New pages are allocated from different pkgs/planes • Page metadata (OOB): (4096 + 224)Bytes Die 1 Flash Memory Pkg 3 Two ObservaHons Flash Memory Pkg #3 Die 0 Plane 1 Flash Memory Pkg #2 FTL Block Plane 0 H/W Inter face Flash Memory Pkg #1 Plane 0 Host Interconnect Flash Memory Pkg #0 Plane 1 • Simultaneous updates and FTL ordering • (Out-­‐of-­‐place update) pages for the same LBA can be updated simultaneously • (Ordering in mapping table) Only when the mapping table is updated, the write is visible to the external • Near-­‐logged update way • Pages are allocated from blocks over different parallel units • Pages are sequenHally allocated from each block Die 1 Flash Memory Pkg 4 Outline • ExecuHve Summary • Background • TradiHonal Sodware TransacHons • ExisHng Hardware TransacHons • LightTx Design • EvaluaHon • Conclusions 5 TradiHonal S/W TransacHon • TransacHon: Atomicity and Durability • Sodware TransacHon – Duplicate writes – SynchronizaHon for ordering Logical View Logical View Data Area Data Area Log Area HDD Log Area FTL Mapping Table SSD 6 We have both old and new versions in the SSD (out-­‐of-­‐place update). Why shall we write the log? Why not support transacHons inside the SSD? 7 ExisHng H/W Approaches • Atomic-­‐Write [HPCA’11] – Log-­‐structured FTL – Commit protocol: Tag the last page “1”, while the others “0” – Limited Parallelism: one tx at a Hme – High mapping persistence overhead: persistence on each commit • SCC/BPCC (Cyclic commit protocols) [OSDI’08] – Commit Protocol: Link all flash pages in a cyclic list by keeping pointers in page metadata – High overhead in differenHate broken cyclic lists for par$al erased commi.ed txs and aborted txs – SCC forces aborted pages erased before wriHng the new one – BPCC delays the erase of pages to its previous aborted pages are erased – Limited Parallelism: txs without overlapped accesses are allowed [HPCA’11] Beyond block i/o: Rethinking tradiHonal storage primiHves [OSDI’08] TransacHonal flash 8 Problems: • Tx support is inflexible (limited parallelism) – Cannot meet the flexible demands from sodware – Cannot fully exploit the internal parallelism of SSDs • Tx state tracking causes high overhead in the device Our Goal: A lightweight design to support flexible transacHons 9 Outline • ExecuHve Summary • Background • LightTx Design • Design Overview • Page Independent Commit Protocol • Zone-­‐based TransacHon State Tracking • EvaluaHon • Conclusions 10 Goal A lightweight design to support flexible transacHons Flexible • Page-­‐independent commit protocol: support simultaneous updates, to enable flexible isolaHon level choices in the system Lightweight • Zone-­‐based transacHon state tracking scheme: track only blocks that have live txs and reHre the dead ones, to reduce lower the cost 11 Page-­‐independent Commit Protocol • ObservaHons: – Simultaneous Updates (Out-­‐of-­‐place update) – Version order (FTL mapping table) • How to support this? A B C D Page E 1 2 – Extend the storage 3 interface Version – Make commit protocol number page-­‐independent 12 Design Overview Applications Database Systems File Systems READ, WRITE, BEGIN, COMMIT, ABORT FTL Mapping Table Active TxTable Commit Logic Recovery Logic Garbage Collection Data Area • TransacHon PrimiHve Read/Write Cache Wear Levelling FTL Free Blocks and Zone Mgmt. Mapping Table Flash Media – BEGIN(TxID) – COMMIT(TxID ) – ABORT(TxID) – WRITE(TxID, LBA, len …) 13 Page-­‐independent Commit Protocol LPN TxID • TransacHonal metadata: <TxID, TxCnt, TxVer> TxCnt ECC Page Metadata (OOB) Page Data A – TxID B – TxCnt: (00…0N) T0, T0, 0 – TxVer: commit sequence 1 0 • Keep it in the page metadata of each flash page TxVer T1, 2 0 T1, 2 3 T3, 1 Version number C T0, 0 Page D E T0, 0 T0, 5 14 Zone-­‐based TransacHon State Tracking • TransacHon LifeHme BEGIN COMMIT/ABORT CHECKPOINT ERASED Completed Active Live Dead – ReHre the dead: write back the mapping table, and remove the dead from tx state maintenance Can we write back the mapping back for each commit? -­‐ Ordering cost (waiHng for mapping table persistence) -­‐ Mapping persistent is not atomic • Writes appended in the free flash blocks – Track the recently updated flash blocks 15 • Block Zones – Free block: all pages are free – Available block: pages are available for allocaHon – Unavailable block: all pages have been wriQen to but some pages belong to (1) a live tx, or (2) a dead tx but has at least one page in some available block – Checkpointed block: all pages have been wriQen to and all pages belong to dead txs • RespecHvely, we have Free, Available, Unavailable and Checkpointed Zones. 16 • Checkpoint – Periodically write back the mapping table (making the txs dead) – And, sliding the zones (available + unavailable) • Zone Sliding – Check all blocks in available and unavailable zones • Move the block to the checkpointed zone if the block is checkpointed • Move the block to the unavailable zone if the block is unavailable – Pre-­‐allocate free blocks to the available zone – Garbage collecHon is only performed on the checkpointed zone 17 (1) Available Zone UpdaHng Tx3 Tx4 Tx5 2-­‐0 4-­‐0 6-­‐1 6-­‐2 2-­‐2 7-­‐0 2-­‐3 4-­‐1 3-­‐3 5-­‐0 5-­‐1 Tx4, 0 Tx4, 0 Tx5, 0 Tx3, 0 Tx4, 0 Tx3, 0 Tx3, 0 Tx3, 4 Tx5, 0 Tx5, 0 Tx4, 0 18 (2) Zone Sliding Tx3 Tx4 Tx5 Tx6 Tx7 Tx8 2-­‐0 4-­‐0 6-­‐1 6-­‐2 2-­‐2 7-­‐0 2-­‐3 4-­‐1 3-­‐3 5-­‐0 5-­‐1 Abort Tx5 4-­‐2 4-­‐3 Tx9 7-­‐3 Tx3, 0 Tx3, 0 Tx3, 4 Tx7, 0 Tx7, 2 Tx4, 0 Tx4, 0 Tx3, 0 Tx4, 0 Tx6, 0 Tx6, 0 Tx4, 0 Tx8, 0 Tx8, 0 Tx4, 5 Tx8, 3 Tx9, 0 Tx5, 0 Tx5, 0 Tx5, 0 Tx9, 0 Tx5, 0 6-­‐3 8-­‐0 7-­‐1 7-­‐2 5-­‐2 9-­‐1 9-­‐0 19 (3) Zone Sliding Tx7 Tx3 Tx8 Tx4 Tx5 Tx6 2-­‐0 4-­‐0 6-­‐1 6-­‐2 2-­‐2 7-­‐0 2-­‐3 4-­‐1 3-­‐3 5-­‐0 5-­‐1 Abort Tx5 4-­‐2 4-­‐3 Tx9 7-­‐3 Tx3, 0 Tx3, 0 Tx3, 4 Tx7, 0 Tx7, 2 Tx4, 0 Tx4, 0 Tx3, 0 Tx4, 0 Tx6, 0 Tx6, 0 Tx4, 0 Tx8, 0 Tx8, 0 Tx4, 5 Tx8, 3 Tx9, 0 Tx5, 0 Tx5, 0 Tx5, 0 Tx9, 0 Tx5, 0 6-­‐3 8-­‐0 7-­‐1 7-­‐2 5-­‐2 9-­‐1 9-­‐0 20 (4) System Failure Tx3 Tx4 Tx5 Tx6 Tx7 Tx8 2-­‐0 4-­‐0 6-­‐1 6-­‐2 2-­‐2 7-­‐0 2-­‐3 4-­‐1 3-­‐3 5-­‐0 5-­‐1 Abort Tx5 4-­‐2 4-­‐3 Tx9 7-­‐3 Tx10 Tx11 6-­‐3 8-­‐0 7-­‐1 7-­‐2 9-­‐0 5-­‐2 9-­‐1 9-­‐2 11-­‐0 11-­‐1 10-­‐0 8-­‐1 Tx3, 0 Tx3, 0 Tx3, 4 Tx7, 0 Tx7, 2 Tx10, 0 Tx4, 0 Tx4, 0 Tx3, 0 Tx4, 0 Tx6, 0 Tx6, 0 Tx4, 0 Tx8, 0 Tx8, 0 Tx4, 5 Tx8, 3 Tx9, 0 Tx9, 0 Tx11, 0 Tx11, 3 Tx5, 0 Tx5, 0 Tx5, 0 Tx9, 0 Tx5, 0 Tx11, 0 21 Recovery Tx7 • Scan the available zone • Scan the unavailable zone Tx8 Tx9 Tx10 Tx11 8-­‐0 6-­‐3 9-­‐0 7-­‐1 7-­‐2 9-­‐1 9-­‐2 5-­‐2 11-­‐0 11-­‐1 10-­‐0 8-­‐1 Tx3, 0 Tx4, 0 Tx6, 0 Tx6, 0 Tx3, 0 Tx3, 0 Tx3, 4 Tx7, 0 Tx7, 2 Tx10, 0 Tx5, 0 Tx5, 0 Tx9, 0 Tx5, 0 Tx4, 0 Tx8, 0 Tx8, 0 Tx4, 5 Tx8, 3 Tx9, 0 Tx9, 0 Tx11, 0 Tx11, 3 Tx11, 0 22 • Recovery – Scan the available zone • If TxCnt matches, completed tx • If not, add the tx to the pending list – Scan the unavailable zone • If TxID in the pending list, check TxCnt again. If TxCnt matches, completed tx • If txID not in the pending list, discard it • If TxCnt sHll doesn’t match, uncompleted tx – Replay with the sequence of TxVer 23 Outline • • • • • ExecuHve Summary Background LightTx Design EvaluaHon Conclusions 24 Experimental Setup • SSD simulator – SSD add-­‐on from Microsod on DiskSim – Parameters from Samsung K9F8G08UXM NAND flash • Trace – TPC-­‐C benchmark: DBT2 on PostgreSQL 8.4.10 25 Flexibility (1) For a given isolaHon level, LightTx provides as good or beQer tx throughput than other protocols. (2) In LightTx, no-­‐page-­‐conflict and serializaHon isolaHon improve throughput by 19.6% and 20.6% over strict isolaHon. 26 A bort R a tio (0% ) A bort R a tio (10% ) A bort R a tio (20% ) A bort R a tio (50% ) T ra ns a c tion T hroug hput (tx s /s ) 600 500 400 300 200 100 0 A tom ic -­‐W rite SCC BPC C T ra ns a c tion P rotoc ol L ig htT x N orm a liz e d G a rba g e C olle c tion C os t Garbage CollecHon Cost 45 A bort R a tio (0% ) A bort R a tio (10% ) A bort R a tio (20% ) A bort R a tio (50% ) 40 35 30 25 20 15 10 5 0 A tom ic -­‐W rite SCC BPC C T ra ns a c tion P rotoc ol L ig htT x (1) LightTx significantly outperforms SCC/BPCC when abort raHo is not zero. (2) Garbage collecHon overhead in SCC/BPCC goes extremely high when abort raHo goes up. 27 8 R e c ove ry T im e (s e c onds ) 7 A tom ic -­‐W rite S C C /B P C C L ig htT x 6 0.30 0.25 0.20 0.15 0.10 0.05 0.00 4M 16M 64M 256M S iz e of the A va ila ble Z one (byte ) 1G Ma pping P e rs is te nc e O ve rhe a d (% ) Recovery Time and Persistence Overhead 5 4 A tom ic -­‐W rite S C C /B P C C L ig htT x 3 2 1 0 4M 16M 64M 256M S iz e of the A va ila ble Z one (byte ) 1G LightTx achieves fast recovery with low mapping persistence overhead. 28 Outline • • • • • ExecuHve Summary Background LightTx Design EvaluaHon Conclusions 29 Conclusion • Problem: Flash-­‐based SSDs support transacHons naturally (with out-­‐of-­‐place updates) but inefficiently: – Only a limited set of isolaHon levels are supported (inflexible) – IdenHfying transacHon status is costly (heavyweight) • Goal: a lightweight design to support flexible transacHons • ObservaHons and Key Ideas: – Simultaneous updates can be wriQen to different physical pages, and the FTL mapping table determines the ordering => (Flexibility) make commit protocol page-­‐independent – TransacHons have birth and death, and the near-­‐logged update way enables efficient tracking => (Lightweight) track recently updated flash blocks, and reHre the dead transacHons • Results: up to 20.6% performance improvement, stable GC overhead, fast recovery with negligible persistence overhead 30 Thanks A Lightweight Transac0onal Design LightTx: in Flash-­‐based SSDs to Support Flexible Transac0ons Youyou Lu1, Jiwu Shu1, Jia Guo1, Shuai Li1, Onur Mutlu2 1Tsinghua University 2Carnegie Mellon University 31 Backup Slides 32 ExisHng Approaches (1) + No logging, no commit record + No tx state maintenance cost – Log-­‐structured FTL – TransacHon state: <00…1> -­‐ Poor parallelism -­‐ Mapping persistence overhead • Atomic Writes [HPCA’11] Beyond block i/o: Rethinking tradiHonal storage primiHves 33 ExisHng Approaches (2) • SCC/BPCC (Cyclic commit protocol) – Use pointers in the page metadata to put all pages in a cycle for each tx TransacHonal flash [OSDI’08] A B C D Page E 1 2 3 4 Version number 34 • SCC A – Block erase forced for aborted pages B C D Page E 1 -­‐ Low garbage collecHon efficiency: lots of data moves due to forced block erase 2 3 4 Version number 35 A • BPCC – SRS: Straddle Responsibility Set – Erasable only ader SRS is empty -­‐ Complex and costly SRS updates -­‐ Low garbage collecHon efficiency: wait unHl SRS is empty B C D Page E 1 2 3 4 Version number SRS(D3) = {C2, D2} SRS(E3) = {D2} SRS(D4) = {C2, D2, D3} 36 Atomic Writes SCC/BPCC + No logging, no commit + No logging, no commit record record + No tx state + Improved parallelism maintenance overhead -­‐ Limited parallelism -­‐ Poor parallelism -­‐ High tx state -­‐ Mapping persistence maintenance overhead overhead Problems: • Tx support is inflexible (limited parallelism) – Cannot meet the flexible demands from sodware – Cannot fully exploit the internal parallelism of SSDs • Tx state tracking causes high overhead in the device A lightweight design to support flexible transacHons 37