Diskless Checkpointing 15 Nov 2001

advertisement
Diskless Checkpointing
15 Nov 2001
Motivation
 Checkpointing on Stable Storage
 Disk access is a major bottleneck!





Incremental Checkpointing
Copy-on-write
Compression
Memory Exclusion
Diskless Checkpointing
Diskless?
 Extra memory is available (e.g. NOW)
 Use memory instead of disk
 Good:
 Network Bandwidth > Disk Bandwidth
 Bad:
 Memory is not stable
Bottom-line
 NOW with (n+m) processors
 The application runs on exactly n procs,
and should proceed as long as


The number of processors in the system is at least n
The failures occur within certain constraint
Application
Processors (n)
Chkpnt
Processors (m)
Available
Processors (n+m)
Overview
 Coordinated Chkpnt (Sync-and-Stop)
 To checkpoint,
 Application Proc: Chkpnt the state in memory
 Chkpnt Proc: Encoding the application chkpnts and
storing the encodings in memory
 To recover,
 Non-failed Procs: Roll-back
 Replacement processors are chosen.
 Replacement Proc: Calculate the chkpnts of the failed
procs using other chkpnts & encodings
Outline

Application Processor Chkpnt




 Incremental
 Forked (or copy-on-write)
Optimization
Encoding the chkpnts







Disk-based
Diskless
Parity (RAID level 5)
Mirroring
1-Dimensional Parity
2-Dimensional Parity
Reed-Solomon Coding
Optimization
Result
Application Processor Chkpnt
 Goal
 The processor should be able to roll back to its
most recent chkpnt.
 Need to tolerate failures when chkpnt
 Make sure that each coordinated chkpnt
remains valid until the next coordinated chkpnt
has been completed.
Disk-based Chkpnt
 To chkpnt
 Save all values in the stack,
heap, and registers to disk
 To recover
 Overwrites the address space
with the stored checkpoint
 Space Demands
 2M in disk
(M: the size of an application processor’s
address space)
Simple Diskless Chkpnt
 To chkpnt
 Wait until encoding calculated
 Overwrite diskless chkpnts in
memory
 To recover
 Roll-backed from in-memory
chkpnts
 Space Demands
 Extra M in memory
(M: the size of an application processor’s
address space)
Incremental Diskless Chkpnt
 To chkpnt
 Initially set all pages R_ONLY
 On page fault, copy & set RW
 To recover
 Restore all RW pages
 Space Demands
 Extra I in memory
(I: the incremental chkpnt size)
Forked Diskless Chkpnt
 To chkpnt
 Application clones itself
 To recover
 Overwrites state with clone’s
 Or clone assumes the role of
the application
 Space Demands
 Extra 2I in memory
(I: the incremental chkpnt size)
Optimizations
 Breaking the chkpnt into chunks
 Efficient use of memory
 Sending Diffs (Incremental)
 Bitwise xor of the current copy and chkpnt copy
 Unmodified pages need not be sent
 Compressing Diffs
 Unmodified regions of memory
Application Processor Chkpnt (review)
 Simple Diskless Chkpnt:
Extra M in memory
 Incremental Diskless Chkpnt:
Extra I in memory
 Forked Diskless Chkpnt:
Extra 2I in memory, less CPU activity
 Optimizations:
Chkpnt into chunks, diffs, and compressed diffs
Encoding the chkpnts
 Goal
 Extra chkpnt processors should store enough
information that the chkpnts of failed processors
may be reconstructed.
 Notation:
 Number of chkpnt processors (m)
 Number of application processors (n)
Parity (RAID level 5, m=1)

j
bckp
 b1j  b2j  ...  bnj
Example
n=4, m=1
j
b1j b2j b3j b4j bckp
To chkpnt,

On failure of ith proc,
j
bi j  b1j  ...  bi j 1  bi j 1  ...  bnj  bckp

Can tolerate:

Application
Processor
Chkpnt
Processor
bi j
j-th byte of
Application processor i

Only one processor failure
Remarks:

Chkpnt processor is a bottleneck of
communication and computation
Mirroring (m=n)

Example
n=m=4
b1j b2j b3j b4j
j
j
j
j
bckp
b
b
b
1 ckp 2 ckp 3 ckp 4
j
bckpi
 bi j

bi
j
j-th byte of
Application processor i
On failure of ith proc,
j
bi j  bckpi

Can tolerate:


Application
Processor
Chkpnt
Processor
To chkpnt,

Up to n processor failures
Except the failure of both an application
processor and its checkpoint processor
Remarks:

Fast, no calculation needed
1-Dimensional Parity (1<m<n)

Example
n=4, m=2
j
1
b
j
2
b
j
3
b
To chkpnt,


j
4
b

On failure of ith proc,

j
bckp
1
j
bckp
2
Application
Processor
Chkpnt
Processor
bi j
j-th byte of
Application processor i

Same as in Parity encoding
Can tolerate:


Application processors are partitioned
into m groups.
ith chkpnt processor calculates the parity
of the chkpnts in group i
One processor failure per group
Remarks:

More efficient in communication and
computation
2-Dimensional Parity

Example
n=4, m=4
To chkpnt,



On failure of ith proc,


Application
Processor
Chkpnt
Processor
bi j
j-th byte of
Application processor i
Same as in Parity encoding
Can tolerate:


Application processors are arranged
logically in a two-dimensional grid
Each chkpnt processor calculates the
parity of the row or the column
Any two-processor failures
Remarks:

Multicast
Reed-Solomon Coding (m)

To chkpnt,



To recover,


Use Gaussian Elimination
Can tolerate:


Vandermonde matrix F, s.t. f(i,j)=j^(i-1)
Use matrix-vector multiplication to calculate chkpnt
Any m failures
Remarks:


Use Galois Fields to perform arithmetic
Computation overhead
Optimizations
 Sending and calculating the encoding
in RAID level 5-based encodings (e.g. Parity)
(a) DIRECT: C1 bottleneck
(b) FAN-IN: log(n) step
Encoding the Chkpnts (review)

Parity (RAID level 5, m=1)
 Only one failure, bottleneck

Mirroring (m=n)
 Up to n failures (unless both app and chkpnt fail), fast

1-Dimensional Parity
 One failure per group, more efficient than Parity

2-Dimensional Parity
 Any two failures, comm overhead w/o multicast

Reed-Solomon Coding
 Any m failures, computation overhead

DIRECT vs. FAN-IN
Testing Applications (1)


CPU-Intensive parallel programs
Instances that took 1.5~2 hrs on 16 processors

NBODY : N-body interactions among particles in a system



Particles are partitioned among processors
Location field of each particle is updated
Expectation:



Poor with incremental chkpnt
Good with diff-based compression
MAT : FP matrix product of two square matrices (Cannon’s alg.)



All three matrices are partitioned in square blocks among processors
In each step, adds the product and passing the input submatrices
Expectation:


Incremental chkpnt
Very poor with diff-based compression
Testing Applications (2)

PSTSWM : Nonlinear shallow water equations on a rotating sphere


Majority pages, but only few bytes per page are modified
Expectation:



CELL : Parallel cellular automaton simulation program


Two (sparse) grids of cellular automata (current/next)
Expectation:



Poor with incremental chkpnt
Good with diff-based compression
Poor with incremental chkpnt
Good with compression
PCG : Ax=b for a large, sparse matrix


First, converted to a small, dense format
Expectation:


Incremental chkpnt
Very poor with diff-based compression
Diskless Checkpointing
20 Nov 2001
Disk-based vs. Diskless Chkpnt
Disk-based
Where to chkpnt? In stable storage
How to recover? Restore from stable storage
Remarks Can tolerate whole failure
Low BW to stable storage
Diskless
In local memory
Re-calculate
Cannot tolerate whole failure
Memory is much faster
Encoding (+communication)
overhead
Recalculate the lost chkpnt?
Error Detection & Correction
in Digital Communication
1-bit Parity (m=1)
Mirroring (m=n)
11001011[1]
11000011[1]
11001011[0]
11000011[0]
(right)
(detectable)
(detectable)
(oops)
11001011[11001011]
11001011[11001010]
11001011[00111100]
11001010[11001010]
(right)
(detectable)
(detectable)
(oops)
Chkpnt Recovery
in Diskless Chkpnt
11001011[1] (chkpnt)
1100X011[1] (tolerable)
11001011[X] (tolerable)
1100X011[X] (intolerable)
11001011[11001011] (right)
11001011[1100101X] (tolerable)
11001011[XXXXXXXX] (tolerable)
1100101X[1100101X] (intolerable)
Remarks
-Difference: we can easily know that which node is wrong in chkpnt system.
-Some codings can be used to recover from errors in Digital Comm, too. (e.g. Reed-Solomon)
Performance
 Criteria


Latency: time between chkpnt initiated and ready for recovery
Overhead: increase in execution time with chkpnt
 Applications
App
Description
Pattern
NBODY
N-body interactions
PSTSWM Simulation of the states on 3-D system
CELL
Parallel cellular automaton
Majority pages, but only
few bytes per page are
modified
MAT
PCG
Only small parts are
updated, but updated in
their entirety
FP Matrix multiplication (Canon’s)
PCG for sparse matrix
Implementation


BASE
DISK-FORK
: No chkpnt
: Disk-based chkpnt w/ fork()




SIMP
INC
FORK
INC-FORK
:
:
:
:




C-SIMP
C-INC
C-FORK
C-INC-FORK
: w/ diff-based compression
Simple diskless
Incremental diskless
Forked diskless
Incremental, forked diskless
Experiment Framework

Network of 24 Sun Sparc5 w/s connected to each other by a
fast, switched Ethernet: ~ 5MB/s

Each w/s has


96MB of physical memory
38MB of local disk storage

Disks with bandwidth of 1.7MB/s are connected via Ethernet,
and NFS on Ethernet achieved a bandwidth of 0.13 MB/s


Latency: time between chkpnt initiated and ready for recovery
Overhead: increase in execution time with chkpnt
Discussion

Latency: diskless has much lower latency than disk-based.


Lowers the expected running time of the application in the
presence of failures (has small recovery time)
Overhead: comparable…
Recommendations
 DISK-FORK:
 If chkpnt are small
 If the likelihood of wholesale system failures are high
 C-FORK:
 If many pages, but a few bytes per page are modified
 INC-FORK:
 If not a significant number of pages are modified
Reference
 J. S. Plank, K. Li, and M.A. Puening. "Diskless
checkpointing." IEEE Transactions on Parallel &
Distributed Systems, 9(10):972—986, Oct. 1998
Download