Speculations: Speculative Execution in a Distributed File System1 and Rethink the Sync2 Edmund Nightingale12, Kaushik Veeraraghavan2, Peter Chen12, Jason Flinn12 Presentation by Ji-Yong Shin (Some slides are from Nightingale’s talk) Agenda • CAP theorem, Consistency Semantics, Consistency Model • Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06) CAP Theorem by Eric Brewer • At most two of CAP can be satisfied simultaneously – Consistency: correctness of data – Availability: guaranteed immediate access to data – Partition Tolerance: guaranteed functioning despite network disruption or partition N1 C A P A N2 B Sync A B ACID vs BASE (Not exactly opposite but..) ACID • Atomicity • Consistency • Isolation • Duration BASE • Basically Available • Soft-state • Eventually consistent Distributed File System Design Consistency Semantics by Leslie Lamport • Atomic (single copy) – Every read returns the value of the most recent write • Regular – Read not concurrent with any write returns the most recent write – Read concurrent with some writes returns either the most recent write or a value of concurrent write • Safe – Read not concurrent with any write returns the most recent write – Read concurrent with some writes returns any value Consistency Models • Strict consistency – All executions in strict order • Sequential consistency – All execution results exposed in strict order • Causal consistency – All executions results with causal dependency exposed in strict order • Close-to-open consistency – All execution results of processes that closed the file should be exposed to process opening the file • Delta consistency – After fixed period of time all memory parts will be consistent • Eventually consistency – After sufficiently long period of time all memory parts will be consistent • Consistency and CAP theorem • Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06) Authors • Edmund B Nightingale – PhD from UMich (Jason Flinn) – Microsoft Research – Both papers are part of PhD Thesis • Kaushik Veeraraghavan – PhD Student in Umich (Jason Flinn) • Peter M Chen – PhD fromUCB (David Patteron) – Faculty at UMich • Jason Flinn – PhD at CMU (Mahadev Satyanarayanan) – Faculty at Umich Speculation If (a == 1) { b = 0; c = 1; • Execute using assumption b = 1; c = 0; – If assumption holds • performance gain – If assumption fails } Sync Delay IF DEC EX WB If (a == 1) Rollback and restart Clk cycle • Restart execution • Not much performance overhead • Example Sync Complete } else { b=0 If (a == 1) c=1 b=0 If (a == 1) c=1 b=0 If (a == 1) c=1 b=0 b=1 0 c=0 1 b=1 0 c=0 1 c=1 b=1 0 – Branch prediction c=0 1 b=1 0 c=0 1 – Transaction – Thread level speculation in multiprocessor (or multicore) Motivation and Approach • Distributed file system – Significant cost for consistency and safety • Block and wait from sync msg and write – Tradeoff between consistency and performance • Weak consistency for high performance • Speculative distributed file system – – – – Execute sync operations in async manner While syncing execute next operation on cached files Check correctness later and rollback if necessary Guarantee single copy semantics Big Idea: Speculator Slow Way Client Server 1) Checkpoint 2) Speculate! Block! 3) Correct? No: Yes:restore discardprocess ckpt. & re-execute RPC Req RPC Resp RPC Req RPC Resp 11 Conditions for Success 1. Highly predictive operations – Misprediction can worsen performance • Rare misprediction 2. Faster checkpointing compared to remote IO – Slow checkpointing is not worth doing • 52us for small process < network IO 3. Available spare resource for speculation – Speculation requires memory and CPU cycles • Modern computers have abundant resource Implementing Speculation 1) System call 2) Create speculation (create_speculation) Time Copy on write fork() Checkpoint Ordered list of speculative operations Undo log Spec Tracks kernel objects that depend on it 13 Speculation Success 1) System call 2) Create speculation 3) Commit speculation (commit_speculation) Time Checkpoint Ordered list of speculative operations Undo log Spec Tracks kernel objects that depend on it 14 Speculation Failure 1) System call 2) Create speculation 3) Fail speculation (fail_speculation) Time Checkpoint Ordered list of speculative operations Undo log Spec Tracks kernel objects that depend on it 15 Multi-Process Speculation • Processes often cooperate – Example: “make” forks children to compile, link, etc. – Would block if speculation limited to one task • Supports – Propagate dependencies among objects – Objects rolled back to prior states when specs fail Multi-Process Speculation Checkpoint Stat A Stat B Spec 1 Spec 2 Checkpoint Checkpoint pid 8000 pid 8001 Chown-1 Write-1 inode 3456 17 Ensuring Correctness • Speculative state must never be visible to 1. User or external device 2. Process not depending on the it • Controlling speculative process – Block access to external environment • Read only (getpid) and private state updates (dup2) allowed – Buffer write to external device – Propagate speculation if necessary Multi-Process Speculation • Supports – Objects in distributed file system • Will be explained in next slides – Objects in local memory file system (RAMFS) – Objects in local disk file system • Use buffering strategy for speculation • Shared on-disk metadata: only valid state committed using redo and undo • Journal: only commit non speculative operations – Etc • Pipe, fifos, unix sockets, signals, fork, exit • Doesn’t support – write-shared memory including V IPC, futex 19 Using Speculation Client 1 foo(0), bar(0) Server foo(1), bar(1) bar(0) Client 2 bar(1) cat foo(0) > bar(1) cat bar Using Speculation Client 1 Server foo(0), bar(0) cat foo(0) foo(1), bar(0) Client 2 bar(0) foo(0) > bar(1) foo(0), bar(1) • Mutating Operation – Server determines speculation success/failure • State at server never speculative • Can be durable to server crash – Requires server to track failed speculations – Requires in-order processing of messages cat bar Group Commit • Previously sequential ops now concurrent • Sync ops usually committed to disk • Speculator makes group commit possible Updating different files… Client Server Client Server write commit write commit Can significantly improve disk throughput 22 Implementation • SpecNFS – Modified NFSv3 in Linux 2.4 kernel to support Speculator • Same RPCs issued (but many now asynchronous) • SpecNFS has same close-to-open consistency, safety as NFS • BlueFS – new file system for Speculator • Single copy semantics • Each file, directory, etc. has version number • Check server for every operation • Two Dell Precision 370 desktops as the client and file server • Routed packet through NISTnet network emulator to insert delay. 23 Apache Build 300 4500 NFS SpecNFS BlueFS ext3 Time (seconds) 250 4000 3500 3000 200 2500 150 2000 1500 100 1000 50 500 0 0 No delay 30 ms delay • With delays SpecNFS up to 14 times faster 24 Time (seconds) The Cost of Rollback 140 2000 120 1800 1600 100 1400 80 1200 60 1000 800 40 600 20 400 200 0 0 NFS SpecNFS No delay ext3 No files invalid 10% files invalid 50% files invalid 100% files invalid NFS SpecNFS ext3 30ms delay • All files out of date SpecNFS up to 11x faster 25 Time (seconds) Group Commit & Sharing State 4500 500 450 400 350 300 250 200 150 100 50 0 4000 Default 3500 No prop 3000 No grp commit 2500 No grp commit & no prop 2000 1500 1000 500 NFS SpecNFS 0 ms delay BlueFS 0 NFS SpecNFS BlueFS 30ms delay 26 Conclusion • Speculator greatly improves performance of existing distributed file systems • Speculator enables new file systems to be safe, consistent in some sense and fast 27 Discussion • Starvation (Infinite rollback)? • Overhead for maintaining speculation? – Memory or CPU? • Multiple server environment? • Consistency? – Consistency Semantics? – Consistency Model? – CAP Theorem? • CAP theorem, Consistency Semantics, Consistency Model • Papers – Speculative Execution in a Distributed File System (Award Paper from SOSP’05) – Rethink the Sync (Best Paper from OSDI’06) Synchronization Asynchronous IO • High performance – Non-blocking • Low reliability – Vulnerable to crash – Ordering not guaranteed Synchronous IO • Low performance – Blocking • High reliability – Resilient to crash – Guaranteed ordering of IO External Synchrony • High performance close to async IO – Async-like execution until externalization • High reliability close to synchronous – User centric view of guaranteed durability External Synchrony • Delay commit of data until externally observable operation is necessary – Print to screen – Packet send • Externally observable behavior implicates – Operation before the observed behavior are committed Example: Synchronous I/O 101 102 103 104 write(buf_1); write(buf_2); print(“work done”); foo(); Application blocks Application blocks % %work done % Process TEXT OS Kernel Disk Example: External synchrony 101 102 103 104 write(buf_1); write(buf_2); print(“work done”); foo(); % %work done % Process TEXT OS Kernel Disk Improving Performance • Group commit of multiple modification – Atomic commit reduces disk access • Buffering of output – Output function runs while committing – Buffered output is released after completion of commit Multiprocess support • Necessary functions – Tracking down causal dependencies • Speculator concept borrowed – Output triggered commit • Buffering output borrowed from Speculator Multiprocess support Process 1 Process 2 101 write(file1); 101 print (“hello”); 102 do_something(); 102 read(file1); 103 print(“world”); % %hello %world Process 2 TEXT Commit Dep 1 Process 1 OS Kernel Disk Limitation • Application specific recovery is difficult – Delayed commit makes it difficult to track back • Commit may be unlimitedly delayed – 5 second rule applied, users may not meet user’s expectation • Data in multiple file system is difficult to commit in single transaction – Journal in different locations Implementation • Implemented ext sync file system Xsyncfs – Based on the ext3 file system and Speculator – Use journaling to preserve order of writes – Use write barriers to flush volatile cache – Write to disk guaranteed Speculator External Synchrony • Hide speculative state until RPC response • Trace of causal dependency for commit and roll back • Buffers output • Group commit • Delay commit until externalization • Trace of causal dependency for commit • Buffers output • Group commit Evaluation • Compare Xsyncfs to 3 other file systems – Default asynchronous ext3 – Default synchronous ext3 – Synchronous ext3 with write barriers When is data safe? File System Configuration Data durable on write() Data durable on fsync() Asynchronous No Not on power failure Synchronous Not on power failure Not on power failure Synchronous w/ write barriers Yes Yes External synchrony Yes Yes Postmark benchmark 10000 Time (Seconds) 1000 ext3-async xsyncfs ext3-sync ext3-barrier 100 10 1 Xsyncfs within 7% of ext3 mounted asynchronously New Order Transactions Per Minute The MySQL benchmark 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 xsyncfs ext3-barrier 0 5 10 15 Number of db clients MySQL’s group commit can reach xsyncfs performance when # of client is large 20 Specweb99 throughput 400 Throughput (Kb/s) 350 300 ext3-async xsyncfs ext3-sync ext3-barrier 250 200 150 100 50 0 Xsyncfs within 8% of ext3 mounted asynchronously Lots of operations buffered, more externalization Conclusion • New concept, external synchrony, proposed • External synchrony performs with 8% of async Discussion • What happens when external synchrony system fails? • Consistency? – Consistency Semantics? – Consistency Model? – CAP Theorem?