PPTX

advertisement
Speculations:
Speculative Execution in a Distributed File System1
and
Rethink the Sync2
Edmund Nightingale12, Kaushik Veeraraghavan2,
Peter Chen12, Jason Flinn12
Presentation by Ji-Yong Shin
(Some slides are from Nightingale’s talk)
Agenda
• CAP theorem, Consistency Semantics,
Consistency Model
• Papers
– Speculative Execution in a Distributed File System
(Award Paper from SOSP’05)
– Rethink the Sync (Best Paper from OSDI’06)
CAP Theorem by Eric Brewer
• At most two of CAP can be satisfied
simultaneously
– Consistency: correctness of data
– Availability: guaranteed immediate access to data
– Partition Tolerance: guaranteed functioning
despite network disruption or partition
N1
C
A
P
A
N2
B
Sync
A
B
ACID vs BASE
(Not exactly opposite but..)
ACID
• Atomicity
• Consistency
• Isolation
• Duration
BASE
• Basically Available
• Soft-state
• Eventually consistent
Distributed File
System Design
Consistency Semantics by Leslie Lamport
• Atomic (single copy)
– Every read returns the value of the most recent write
• Regular
– Read not concurrent with any write returns the most
recent write
– Read concurrent with some writes returns either the
most recent write or a value of concurrent write
• Safe
– Read not concurrent with any write returns the most
recent write
– Read concurrent with some writes returns any value
Consistency Models
• Strict consistency
– All executions in strict order
• Sequential consistency
– All execution results exposed in strict order
• Causal consistency
– All executions results with causal dependency exposed in strict order
• Close-to-open consistency
– All execution results of processes that closed the file should be exposed to
process opening the file
• Delta consistency
– After fixed period of time all memory parts will be consistent
• Eventually consistency
– After sufficiently long period of time all memory parts will be consistent
• Consistency and CAP theorem
• Papers
– Speculative Execution in a Distributed File System
(Award Paper from SOSP’05)
– Rethink the Sync (Best Paper from OSDI’06)
Authors
• Edmund B Nightingale
– PhD from UMich (Jason Flinn)
– Microsoft Research
– Both papers are part of PhD Thesis
• Kaushik Veeraraghavan
– PhD Student in Umich (Jason Flinn)
• Peter M Chen
– PhD fromUCB (David Patteron)
– Faculty at UMich
• Jason Flinn
– PhD at CMU (Mahadev Satyanarayanan)
– Faculty at Umich
Speculation
If (a == 1) {
b = 0;
c = 1;
• Execute using assumption
b = 1;
c = 0;
– If assumption holds
• performance gain
– If assumption fails
}
Sync
Delay
IF
DEC
EX
WB
If (a == 1)
Rollback
and
restart
Clk cycle
• Restart execution
• Not much performance overhead
• Example
Sync
Complete
} else {
b=0
If (a == 1)
c=1
b=0
If (a == 1)
c=1
b=0
If (a == 1)
c=1
b=0
b=1
0
c=0
1
b=1
0
c=0
1
c=1
b=1
0
– Branch prediction
c=0
1
b=1
0
c=0
1
– Transaction
– Thread level speculation in multiprocessor (or multicore)
Motivation and Approach
• Distributed file system
– Significant cost for consistency and safety
• Block and wait from sync msg and write
– Tradeoff between consistency and performance
• Weak consistency for high performance
• Speculative distributed file system
–
–
–
–
Execute sync operations in async manner
While syncing execute next operation on cached files
Check correctness later and rollback if necessary
Guarantee single copy semantics
Big Idea: Speculator
Slow Way
Client
Server
1) Checkpoint
2) Speculate!
Block!
3) Correct?
No:
Yes:restore
discardprocess
ckpt.
& re-execute
RPC Req
RPC Resp
RPC Req
RPC Resp
11
Conditions for Success
1. Highly predictive operations
– Misprediction can worsen performance
• Rare misprediction
2. Faster checkpointing compared to remote IO
– Slow checkpointing is not worth doing
• 52us for small process < network IO
3. Available spare resource for speculation
– Speculation requires memory and CPU cycles
• Modern computers have abundant resource
Implementing Speculation
1) System call
2) Create speculation (create_speculation)
Time
Copy on write fork()
Checkpoint
Ordered list of
speculative
operations
Undo log
Spec
Tracks kernel
objects that
depend on it
13
Speculation Success
1) System call
2) Create speculation
3) Commit speculation
(commit_speculation)
Time
Checkpoint
Ordered list of
speculative
operations
Undo log
Spec
Tracks kernel
objects that
depend on it
14
Speculation Failure
1) System call
2) Create speculation
3) Fail speculation
(fail_speculation)
Time
Checkpoint
Ordered list of
speculative
operations
Undo log
Spec
Tracks kernel
objects that
depend on it
15
Multi-Process Speculation
• Processes often cooperate
– Example: “make” forks children to compile, link,
etc.
– Would block if speculation limited to one task
• Supports
– Propagate dependencies among objects
– Objects rolled back to prior states when specs fail
Multi-Process Speculation
Checkpoint
Stat A
Stat B
Spec 1
Spec 2
Checkpoint
Checkpoint
pid 8000
pid 8001
Chown-1
Write-1
inode 3456
17
Ensuring Correctness
•
Speculative state must never be visible to
1. User or external device
2. Process not depending on the it
• Controlling speculative process
– Block access to external environment
• Read only (getpid) and private state updates (dup2) allowed
– Buffer write to external device
– Propagate speculation if necessary
Multi-Process Speculation
• Supports
– Objects in distributed file system
• Will be explained in next slides
– Objects in local memory file system (RAMFS)
– Objects in local disk file system
• Use buffering strategy for speculation
• Shared on-disk metadata: only valid state committed using redo
and undo
• Journal: only commit non speculative operations
– Etc
• Pipe, fifos, unix sockets, signals, fork, exit
• Doesn’t support
– write-shared memory including V IPC, futex
19
Using Speculation
Client 1
foo(0), bar(0)
Server
foo(1), bar(1)
bar(0)
Client 2
bar(1)
cat foo(0)
> bar(1)
cat bar
Using Speculation
Client 1
Server
foo(0), bar(0)
cat foo(0)
foo(1), bar(0)
Client 2
bar(0)
foo(0)
> bar(1)
foo(0), bar(1)
• Mutating Operation
– Server determines speculation success/failure
• State at server never speculative
• Can be durable to server crash
– Requires server to track failed speculations
– Requires in-order processing of messages
cat bar
Group Commit
• Previously sequential ops now concurrent
• Sync ops usually committed to disk
• Speculator makes group commit possible
Updating different files…
Client
Server
Client
Server
write
commit
write
commit
Can significantly improve disk throughput
22
Implementation
• SpecNFS
– Modified NFSv3 in Linux 2.4 kernel to support Speculator
• Same RPCs issued (but many now asynchronous)
• SpecNFS has same close-to-open consistency, safety as NFS
• BlueFS
– new file system for Speculator
• Single copy semantics
• Each file, directory, etc. has version number
• Check server for every operation
• Two Dell Precision 370 desktops as the client and file server
• Routed packet through NISTnet network emulator to insert
delay.
23
Apache Build
300
4500
NFS
SpecNFS
BlueFS
ext3
Time (seconds)
250
4000
3500
3000
200
2500
150
2000
1500
100
1000
50
500
0
0
No delay
30 ms delay
• With delays SpecNFS up to 14 times faster
24
Time (seconds)
The Cost of Rollback
140
2000
120
1800
1600
100
1400
80
1200
60
1000
800
40
600
20
400
200
0
0
NFS
SpecNFS
No delay
ext3
No files invalid
10% files invalid
50% files invalid
100% files invalid
NFS
SpecNFS
ext3
30ms delay
• All files out of date SpecNFS up to 11x faster
25
Time (seconds)
Group Commit & Sharing State
4500
500
450
400
350
300
250
200
150
100
50
0
4000
Default
3500
No prop
3000
No grp commit
2500
No grp commit & no prop
2000
1500
1000
500
NFS
SpecNFS
0 ms delay
BlueFS
0
NFS
SpecNFS
BlueFS
30ms delay
26
Conclusion
• Speculator greatly improves performance of existing
distributed file systems
• Speculator enables new file systems to be safe,
consistent in some sense and fast
27
Discussion
• Starvation (Infinite rollback)?
• Overhead for maintaining speculation?
– Memory or CPU?
• Multiple server environment?
• Consistency?
– Consistency Semantics?
– Consistency Model?
– CAP Theorem?
• CAP theorem, Consistency Semantics,
Consistency Model
• Papers
– Speculative Execution in a Distributed File System
(Award Paper from SOSP’05)
– Rethink the Sync (Best Paper from OSDI’06)
Synchronization
Asynchronous IO
• High performance
– Non-blocking
• Low reliability
– Vulnerable to crash
– Ordering not guaranteed
Synchronous IO
• Low performance
– Blocking
• High reliability
– Resilient to crash
– Guaranteed ordering of IO
External Synchrony
• High performance close to async IO
– Async-like execution until externalization
• High reliability close to synchronous
– User centric view of guaranteed durability
External Synchrony
• Delay commit of data until externally
observable operation is necessary
– Print to screen
– Packet send
• Externally observable behavior implicates
– Operation before the observed behavior are
committed
Example: Synchronous I/O
101
102
103
104
write(buf_1);
write(buf_2);
print(“work done”);
foo();
Application blocks
Application blocks
%
%work
done
%
Process
TEXT
OS Kernel
Disk
Example: External synchrony
101
102
103
104
write(buf_1);
write(buf_2);
print(“work done”);
foo();
%
%work
done
%
Process
TEXT
OS Kernel
Disk
Improving Performance
• Group commit of multiple modification
– Atomic commit reduces disk access
• Buffering of output
– Output function runs while committing
– Buffered output is released after completion of
commit
Multiprocess support
• Necessary functions
– Tracking down causal dependencies
• Speculator concept borrowed
– Output triggered commit
• Buffering output borrowed from Speculator
Multiprocess support
Process 1
Process 2
101
write(file1);
101
print (“hello”);
102
do_something();
102
read(file1);
103
print(“world”);
%
%hello
%world
Process 2 TEXT
Commit
Dep 1
Process 1
OS Kernel
Disk
Limitation
• Application specific recovery is difficult
– Delayed commit makes it difficult to track back
• Commit may be unlimitedly delayed
– 5 second rule applied, users may not meet user’s
expectation
• Data in multiple file system is difficult to
commit in single transaction
– Journal in different locations
Implementation
• Implemented ext sync file system Xsyncfs
– Based on the ext3 file system and Speculator
– Use journaling to preserve order of writes
– Use write barriers to flush volatile cache
– Write to disk guaranteed
Speculator
External Synchrony
• Hide speculative state until
RPC response
• Trace of causal dependency for
commit and roll back
• Buffers output
• Group commit
• Delay commit until
externalization
• Trace of causal dependency for
commit
• Buffers output
• Group commit
Evaluation
• Compare Xsyncfs to 3 other file systems
– Default asynchronous ext3
– Default synchronous ext3
– Synchronous ext3 with write barriers
When is data safe?
File System
Configuration
Data durable
on write()
Data durable
on fsync()
Asynchronous
No
Not on
power failure
Synchronous
Not on
power failure
Not on
power failure
Synchronous
w/ write barriers
Yes
Yes
External synchrony
Yes
Yes
Postmark benchmark
10000
Time (Seconds)
1000
ext3-async
xsyncfs
ext3-sync
ext3-barrier
100
10
1

Xsyncfs within 7% of ext3 mounted asynchronously
New Order Transactions Per Minute
The MySQL benchmark
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
xsyncfs
ext3-barrier
0
5
10
15
Number of db clients

MySQL’s group commit can reach xsyncfs
performance when # of client is large
20
Specweb99 throughput
400
Throughput (Kb/s)
350
300
ext3-async
xsyncfs
ext3-sync
ext3-barrier
250
200
150
100
50
0


Xsyncfs within 8% of ext3 mounted asynchronously
Lots of operations buffered, more externalization
Conclusion
• New concept, external synchrony, proposed
• External synchrony performs with 8% of async
Discussion
• What happens when external synchrony
system fails?
• Consistency?
– Consistency Semantics?
– Consistency Model?
– CAP Theorem?
Download