Speculative Execution in a Distributed File System Ed Nightingale Peter Chen

advertisement

Speculative Execution in a

Distributed File System

Ed Nightingale

Peter Chen

Jason Flinn

University of Michigan

Best Paper at SOSP 2005

Modified for CS739

Motivation

• Why are distributed file systems slow(er)?

– Sync network messages provide consistency

– Sync disk writes provide safety

• Sacrifice guarantees for speed

• Can DFS can be safe, consistent and fast?

– Yes!

With OS support for speculative execution

2

1) Checkpoint

Client Server

3) Correct?

RPC Req

& re-execute

RPC Resp

• Guarantees without blocking I/O!

3

Conditions for Success

• Operations are highly predictable

– Conflicts are rare

• Checkpoints are cheaper than network I/O

– 52 µs for small process

(6.3 ms for 64 MB process)

• Computers have resources to spare

– Need memory and CPU cycles for speculation

4

Assumptions --> Experiments

• Operations are highly predictable

– What happens if not (lots of sharing)?

– Will cost of checkpointing and rolling back ever harm performance?

• Checkpoints are cheaper than network I/O

– What happens with large memory processes?

• Computers have resources to spare (Clients)

– What happens when running multiple processes?

– Can network become bottleneck?

– Can server become bottleneck?

5

Outline

• Motivation

• Implementing speculation

• Multi-process speculation

• Using Speculator

• Evaluation

6

Implementing Speculation

• Implementation

– All within OS

– No changes needed to applications

– Three new interfaces

• Create_speculation()

• Commit_speculation()

• Fail_speculation()

• Goal: Ensure speculative state is

• Never externalized (sent to terminal, disk, network)

• Never directly observed by non-speculative processes

7

Implementing Speculation

1) System call 2) Create speculation

Time

Copy-on-write fork Checkpoint

For each kernel object Undo log

Spec Tracks objects that depend on speculation

8

Speculation Success

1) System call 2) Create speculation 3) Commit speculation

Time

Checkpoint

Spec

Undo log 9

Speculation Failure

1) System call 2) Create speculation 3) Fail speculation

Time

Checkpoint

Spec

Undo log 10

Replay after Failure

• Does code after checkpoint need to be deterministic for correct results?

11

Ensuring Correctness

• Spec processes often affect external state

– Speculative state should never be visible to user or any external device

– Process should never view speculative state unless it is already speculatively dependent on that state

• Three ways to ensure correct execution

– Block

– Buffer

– Propagate speculations (dependencies)

12

Systems Calls

• Modify system call jump table

• Block calls that externalize state

– Allow read-only calls (e.g. getpid)

– Allow calls that modify only task state (e.g. dup2)

• File system calls -- need to dig deeper

– Mark file systems that support Speculator getpid reboot mkdir

Call sys_getpid()

Block until specs resolved

Allow only if fs supports Speculator

13

Output Commits: Buffer

1) sys_stat 2) sys_mkdir 3) Commit spec 1

Time

“stat worked”

“mkdir worked”

Checkpoint

Checkpoint

Undo log

Spec

(stat)

Spec

(mkdir)

14

Multi-Process Speculation

• Processes often cooperate

– Example: “make” forks children to compile, link, etc.

– Would block if speculation limited to one task

• Allow kernel objects to have speculative state

– Examples: inodes, signals, pipes, Unix sockets, etc.

– Propagate dependencies among objects

– Objects rolled back to prior states when specs fail

15

pid 8000

Multi-Process Speculation

Stat A Stat B

Checkpoint

Spec 2 pid 8001

Chown -1

Write -1 inode 3456

16

Multi-Process Speculation

• What we handle:

– DFS objects, RAMFS, Ext3, Pipes & FIFOs

– Unix Sockets, Signals, Fork & Exit

• What we don’t (i.e. we block)

– System V IPC

– Multi-process write-shared memory

17

Outline

• Motivation

• Implementing speculation

• Multi-process speculation

• Using Speculator

• Evaluation

18

Example: NFSv3 Linux

Client 1 Server Client 2

Write

Modify B

Commit

Getattr

Open B

Why were asynchronous writes added to NFSv3?

Is this situation of no overlap expected to be the common case?

When can the commit return from the server?

19

speculate

Example: SpecNFS

Client 1 Server

Modify B

Write+Commit

Client 2

Getattr

Getattr

Open B speculate

Open B speculate

When can the commit return from the server?

20

Problem: Mutating Operations

Client 1

1. cat foo > bar

Client 2

2. cat bar

• bar depends on cat foo

• What does client 2 view in bar?

21

Solution: Mutating Operations

• Server determines speculation success/failure

– State at server never speculative

• Send server hypothesis speculation based on

– List of speculations an operation depends on

• Requires server to track failed speculations

– Would like to be convinced OK when server crashes!

• Requires in-order processing of messages

22

Group Commit

• Previously sequential ops now concurrent

• Sync ops usually committed to disk

• Speculator makes group commit possible

Updating different files…

Client write commit write

Server commit

Client Server

Putting it all Together: SpecNFS

• Apply Speculator to an existing file system

• Modified NFSv3 in Linux 2.4 kernel

– Same RPCs issued (but many now asynchronous)

– SpecNFS has same consistency, safety as NFS

– Getattr, lookup, access speculate if data in cache

– Create, mkdir, commit, etc. always speculate

• Choose aliases for unknown file handles

24

Putting it all Together: BlueFS

• Design a new file system for Speculator

– Single copy semantics

– Synchronous I/O

• Each file, directory, etc. has version number

– Incremented on each mutating op (e.g. on write)

– Checked prior to all operations.

– Many ops speculate and check version async

25

Outline

• Motivation

• Implementing speculation

• Multi-process speculation

• Using Speculator

• Evaluation

26

Apache Benchmark

300 4500

250

NFS

SpecNFS

BlueFS ext3

4000

3500

200

150

100

50

3000

2500

2000

1500

1000

500

0

0

No delay

• SpecNFS up to 14 times faster

30 ms delay

27

Apache Benchmark

300

250

200

150

100

50

0

N

FS

S pe cN

FS

S pe cB

FS

C od a

No delay ex t3

4500

4000

3500

3000

2500

2000

1500

1000

500

0

Remove

Configure

N

FS

S pe cN

FS

S pe cB

FS

C od a

30ms delay

Make

Untar ex t3

28

140

120

100

80

60

40

20

0

The Cost of Rollback (Q1)

2000

1800

1600

1400

1200

1000

800

600

400

200

0

No files invalid

10% files invalid

50% files invalid

100% files invalid

NFS SpecNFS

No delay ext3 NFS SpecNFS

30ms delay ext3

• All files out of date SpecNFS up to 11x faster

29

Evaluation

• Other experiments you would like to see?

30

Evaluation

• Other experiments you would like to see?

• Impact of checkpoint size

• Resource constraints: Scalability

– How many clients can server handle?

– Is network a constraint?

– Server must still handle poll requests, even if in bg

• Fault tolerance

– Server now maintains state: per-client list of failed speculations

31

Conclusion

• Speculator greatly improves performance of existing distributed file systems

– Especially in wide-area

• Speculator enables new file systems to be safe, consistent and fast

• How general do you think Speculator is?

– Speculation clearly improves poll-based consistency

(NFSv3, BlueFS)

– How much improvement with callbacks and leases?

(AFS, Coda, NFSv4)?

32

Group Commit & Sharing State

500

450

400

350

300

250

200

150

100

50

0

4500

4000

3500

3000

2500

2000

1500

1000

500

0

Default

No prop

No grp commit

No grp commit & no prop

NFS SpecNFS BlueFS

0 ms delay

NFS SpecNFS BlueFS

30ms delay

33

Related Work

• Chang & Gibson, Fraser & Chang

– Speculative pre-fetching

• Time Warp

– Virtual Time: distributed simulations

• Hardware branch prediction

• Transactional file systems

34

Download