Ed Nightingale
Peter Chen
Jason Flinn
University of Michigan
Best Paper at SOSP 2005
Modified for CS739
• Why are distributed file systems slow(er)?
– Sync network messages provide consistency
– Sync disk writes provide safety
• Sacrifice guarantees for speed
• Can DFS can be safe, consistent and fast?
– Yes!
With OS support for speculative execution
2
1) Checkpoint
Client Server
3) Correct?
RPC Req
& re-execute
RPC Resp
• Guarantees without blocking I/O!
3
• Operations are highly predictable
– Conflicts are rare
• Checkpoints are cheaper than network I/O
– 52 µs for small process
(6.3 ms for 64 MB process)
• Computers have resources to spare
– Need memory and CPU cycles for speculation
4
• Operations are highly predictable
– What happens if not (lots of sharing)?
– Will cost of checkpointing and rolling back ever harm performance?
• Checkpoints are cheaper than network I/O
– What happens with large memory processes?
• Computers have resources to spare (Clients)
– What happens when running multiple processes?
– Can network become bottleneck?
– Can server become bottleneck?
5
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
6
• Implementation
– All within OS
– No changes needed to applications
– Three new interfaces
• Create_speculation()
• Commit_speculation()
• Fail_speculation()
• Goal: Ensure speculative state is
• Never externalized (sent to terminal, disk, network)
• Never directly observed by non-speculative processes
7
1) System call 2) Create speculation
Time
Copy-on-write fork Checkpoint
For each kernel object Undo log
Spec Tracks objects that depend on speculation
8
1) System call 2) Create speculation 3) Commit speculation
Time
Checkpoint
Spec
Undo log 9
1) System call 2) Create speculation 3) Fail speculation
Time
Checkpoint
Spec
Undo log 10
• Does code after checkpoint need to be deterministic for correct results?
11
• Spec processes often affect external state
– Speculative state should never be visible to user or any external device
– Process should never view speculative state unless it is already speculatively dependent on that state
• Three ways to ensure correct execution
– Block
– Buffer
– Propagate speculations (dependencies)
12
• Modify system call jump table
• Block calls that externalize state
– Allow read-only calls (e.g. getpid)
– Allow calls that modify only task state (e.g. dup2)
• File system calls -- need to dig deeper
– Mark file systems that support Speculator getpid reboot mkdir
Call sys_getpid()
Block until specs resolved
Allow only if fs supports Speculator
13
1) sys_stat 2) sys_mkdir 3) Commit spec 1
Time
“stat worked”
“mkdir worked”
Checkpoint
Checkpoint
Undo log
Spec
(stat)
Spec
(mkdir)
14
• Processes often cooperate
– Example: “make” forks children to compile, link, etc.
– Would block if speculation limited to one task
• Allow kernel objects to have speculative state
– Examples: inodes, signals, pipes, Unix sockets, etc.
– Propagate dependencies among objects
– Objects rolled back to prior states when specs fail
15
pid 8000
Stat A Stat B
Checkpoint
Spec 2 pid 8001
Chown -1
Write -1 inode 3456
16
• What we handle:
– DFS objects, RAMFS, Ext3, Pipes & FIFOs
– Unix Sockets, Signals, Fork & Exit
• What we don’t (i.e. we block)
– System V IPC
– Multi-process write-shared memory
17
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
18
Client 1 Server Client 2
Write
Modify B
Commit
Getattr
Open B
Why were asynchronous writes added to NFSv3?
Is this situation of no overlap expected to be the common case?
When can the commit return from the server?
19
speculate
Client 1 Server
Modify B
Write+Commit
Client 2
Getattr
Getattr
Open B speculate
Open B speculate
When can the commit return from the server?
20
Client 1
1. cat foo > bar
Client 2
2. cat bar
• bar depends on cat foo
• What does client 2 view in bar?
21
• Server determines speculation success/failure
– State at server never speculative
• Send server hypothesis speculation based on
– List of speculations an operation depends on
• Requires server to track failed speculations
– Would like to be convinced OK when server crashes!
• Requires in-order processing of messages
22
• Previously sequential ops now concurrent
• Sync ops usually committed to disk
• Speculator makes group commit possible
Updating different files…
Client write commit write
Server commit
Client Server
• Apply Speculator to an existing file system
• Modified NFSv3 in Linux 2.4 kernel
– Same RPCs issued (but many now asynchronous)
– SpecNFS has same consistency, safety as NFS
– Getattr, lookup, access speculate if data in cache
– Create, mkdir, commit, etc. always speculate
• Choose aliases for unknown file handles
24
• Design a new file system for Speculator
– Single copy semantics
– Synchronous I/O
• Each file, directory, etc. has version number
– Incremented on each mutating op (e.g. on write)
– Checked prior to all operations.
– Many ops speculate and check version async
25
• Motivation
• Implementing speculation
• Multi-process speculation
• Using Speculator
• Evaluation
26
300 4500
250
NFS
SpecNFS
BlueFS ext3
4000
3500
200
150
100
50
3000
2500
2000
1500
1000
500
0
0
No delay
• SpecNFS up to 14 times faster
30 ms delay
27
300
250
200
150
100
50
0
N
FS
S pe cN
FS
S pe cB
FS
C od a
No delay ex t3
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Remove
Configure
N
FS
S pe cN
FS
S pe cB
FS
C od a
30ms delay
Make
Untar ex t3
28
140
120
100
80
60
40
20
0
2000
1800
1600
1400
1200
1000
800
600
400
200
0
No files invalid
10% files invalid
50% files invalid
100% files invalid
NFS SpecNFS
No delay ext3 NFS SpecNFS
30ms delay ext3
• All files out of date SpecNFS up to 11x faster
29
• Other experiments you would like to see?
30
• Other experiments you would like to see?
• Impact of checkpoint size
• Resource constraints: Scalability
– How many clients can server handle?
– Is network a constraint?
– Server must still handle poll requests, even if in bg
• Fault tolerance
– Server now maintains state: per-client list of failed speculations
31
• Speculator greatly improves performance of existing distributed file systems
– Especially in wide-area
• Speculator enables new file systems to be safe, consistent and fast
• How general do you think Speculator is?
– Speculation clearly improves poll-based consistency
(NFSv3, BlueFS)
– How much improvement with callbacks and leases?
(AFS, Coda, NFSv4)?
32
500
450
400
350
300
250
200
150
100
50
0
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Default
No prop
No grp commit
No grp commit & no prop
NFS SpecNFS BlueFS
0 ms delay
NFS SpecNFS BlueFS
30ms delay
33
• Chang & Gibson, Fraser & Chang
– Speculative pre-fetching
• Time Warp
– Virtual Time: distributed simulations
• Hardware branch prediction
• Transactional file systems
34