Diskless Checkpointing 15 Nov 2001

Diskless Checkpointing 15 Nov 2001 Motivation  Checkpointing on Stable Storage  Disk access is a major bottleneck!      Incremental Checkpointing Copy-on-write Compression Memory Exclusion Diskless Checkpointing Diskless?  Extra memory is available (e.g. NOW)  Use memory instead of disk  Good:  Network Bandwidth > Disk Bandwidth  Bad:  Memory is not stable Bottom-line  NOW with (n+m) processors  The application runs on exactly n procs, and should proceed as long as   The number of processors in the system is at least n The failures occur within certain constraint Application Processors (n) Chkpnt Processors (m) Available Processors (n+m) Overview  Coordinated Chkpnt (Sync-and-Stop)  To checkpoint,  Application Proc: Chkpnt the state in memory  Chkpnt Proc: Encoding the application chkpnts and storing the encodings in memory  To recover,  Non-failed Procs: Roll-back  Replacement processors are chosen.  Replacement Proc: Calculate the chkpnts of the failed procs using other chkpnts & encodings Outline  Application Processor Chkpnt      Incremental  Forked (or copy-on-write) Optimization Encoding the chkpnts        Disk-based Diskless Parity (RAID level 5) Mirroring 1-Dimensional Parity 2-Dimensional Parity Reed-Solomon Coding Optimization Result Application Processor Chkpnt  Goal  The processor should be able to roll back to its most recent chkpnt.  Need to tolerate failures when chkpnt  Make sure that each coordinated chkpnt remains valid until the next coordinated chkpnt has been completed. Disk-based Chkpnt  To chkpnt  Save all values in the stack, heap, and registers to disk  To recover  Overwrites the address space with the stored checkpoint  Space Demands  2M in disk (M: the size of an application processor’s address space) Simple Diskless Chkpnt  To chkpnt  Wait until encoding calculated  Overwrite diskless chkpnts in memory  To recover  Roll-backed from in-memory chkpnts  Space Demands  Extra M in memory (M: the size of an application processor’s address space) Incremental Diskless Chkpnt  To chkpnt  Initially set all pages R_ONLY  On page fault, copy & set RW  To recover  Restore all RW pages  Space Demands  Extra I in memory (I: the incremental chkpnt size) Forked Diskless Chkpnt  To chkpnt  Application clones itself  To recover  Overwrites state with clone’s  Or clone assumes the role of the application  Space Demands  Extra 2I in memory (I: the incremental chkpnt size) Optimizations  Breaking the chkpnt into chunks  Efficient use of memory  Sending Diffs (Incremental)  Bitwise xor of the current copy and chkpnt copy  Unmodified pages need not be sent  Compressing Diffs  Unmodified regions of memory Application Processor Chkpnt (review)  Simple Diskless Chkpnt: Extra M in memory  Incremental Diskless Chkpnt: Extra I in memory  Forked Diskless Chkpnt: Extra 2I in memory, less CPU activity  Optimizations: Chkpnt into chunks, diffs, and compressed diffs Encoding the chkpnts  Goal  Extra chkpnt processors should store enough information that the chkpnts of failed processors may be reconstructed.  Notation:  Number of chkpnt processors (m)  Number of application processors (n) Parity (RAID level 5, m=1)  j bckp  b1j  b2j  ...  bnj Example n=4, m=1 j b1j b2j b3j b4j bckp To chkpnt,  On failure of ith proc, j bi j  b1j  ...  bi j 1  bi j 1  ...  bnj  bckp  Can tolerate:  Application Processor Chkpnt Processor bi j j-th byte of Application processor i  Only one processor failure Remarks:  Chkpnt processor is a bottleneck of communication and computation Mirroring (m=n)  Example n=m=4 b1j b2j b3j b4j j j j j bckp b b b 1 ckp 2 ckp 3 ckp 4 j bckpi  bi j  bi j j-th byte of Application processor i On failure of ith proc, j bi j  bckpi  Can tolerate:   Application Processor Chkpnt Processor To chkpnt,  Up to n processor failures Except the failure of both an application processor and its checkpoint processor Remarks:  Fast, no calculation needed 1-Dimensional Parity (1<m<n)  Example n=4, m=2 j 1 b j 2 b j 3 b To chkpnt,   j 4 b  On failure of ith proc,  j bckp 1 j bckp 2 Application Processor Chkpnt Processor bi j j-th byte of Application processor i  Same as in Parity encoding Can tolerate:   Application processors are partitioned into m groups. ith chkpnt processor calculates the parity of the chkpnts in group i One processor failure per group Remarks:  More efficient in communication and computation 2-Dimensional Parity  Example n=4, m=4 To chkpnt,    On failure of ith proc,   Application Processor Chkpnt Processor bi j j-th byte of Application processor i Same as in Parity encoding Can tolerate:   Application processors are arranged logically in a two-dimensional grid Each chkpnt processor calculates the parity of the row or the column Any two-processor failures Remarks:  Multicast Reed-Solomon Coding (m)  To chkpnt,    To recover,   Use Gaussian Elimination Can tolerate:   Vandermonde matrix F, s.t. f(i,j)=j^(i-1) Use matrix-vector multiplication to calculate chkpnt Any m failures Remarks:   Use Galois Fields to perform arithmetic Computation overhead Optimizations  Sending and calculating the encoding in RAID level 5-based encodings (e.g. Parity) (a) DIRECT: C1 bottleneck (b) FAN-IN: log(n) step Encoding the Chkpnts (review)  Parity (RAID level 5, m=1)  Only one failure, bottleneck  Mirroring (m=n)  Up to n failures (unless both app and chkpnt fail), fast  1-Dimensional Parity  One failure per group, more efficient than Parity  2-Dimensional Parity  Any two failures, comm overhead w/o multicast  Reed-Solomon Coding  Any m failures, computation overhead  DIRECT vs. FAN-IN Testing Applications (1)   CPU-Intensive parallel programs Instances that took 1.5~2 hrs on 16 processors  NBODY : N-body interactions among particles in a system    Particles are partitioned among processors Location field of each particle is updated Expectation:    Poor with incremental chkpnt Good with diff-based compression MAT : FP matrix product of two square matrices (Cannon’s alg.)    All three matrices are partitioned in square blocks among processors In each step, adds the product and passing the input submatrices Expectation:   Incremental chkpnt Very poor with diff-based compression Testing Applications (2)  PSTSWM : Nonlinear shallow water equations on a rotating sphere   Majority pages, but only few bytes per page are modified Expectation:    CELL : Parallel cellular automaton simulation program   Two (sparse) grids of cellular automata (current/next) Expectation:    Poor with incremental chkpnt Good with diff-based compression Poor with incremental chkpnt Good with compression PCG : Ax=b for a large, sparse matrix   First, converted to a small, dense format Expectation:   Incremental chkpnt Very poor with diff-based compression Diskless Checkpointing 20 Nov 2001 Disk-based vs. Diskless Chkpnt Disk-based Where to chkpnt? In stable storage How to recover? Restore from stable storage Remarks Can tolerate whole failure Low BW to stable storage Diskless In local memory Re-calculate Cannot tolerate whole failure Memory is much faster Encoding (+communication) overhead Recalculate the lost chkpnt? Error Detection & Correction in Digital Communication 1-bit Parity (m=1) Mirroring (m=n) 11001011[1] 11000011[1] 11001011[0] 11000011[0] (right) (detectable) (detectable) (oops) 11001011[11001011] 11001011[11001010] 11001011[00111100] 11001010[11001010] (right) (detectable) (detectable) (oops) Chkpnt Recovery in Diskless Chkpnt 11001011[1] (chkpnt) 1100X011[1] (tolerable) 11001011[X] (tolerable) 1100X011[X] (intolerable) 11001011[11001011] (right) 11001011[1100101X] (tolerable) 11001011[XXXXXXXX] (tolerable) 1100101X[1100101X] (intolerable) Remarks -Difference: we can easily know that which node is wrong in chkpnt system. -Some codings can be used to recover from errors in Digital Comm, too. (e.g. Reed-Solomon) Performance  Criteria   Latency: time between chkpnt initiated and ready for recovery Overhead: increase in execution time with chkpnt  Applications App Description Pattern NBODY N-body interactions PSTSWM Simulation of the states on 3-D system CELL Parallel cellular automaton Majority pages, but only few bytes per page are modified MAT PCG Only small parts are updated, but updated in their entirety FP Matrix multiplication (Canon’s) PCG for sparse matrix Implementation   BASE DISK-FORK : No chkpnt : Disk-based chkpnt w/ fork()     SIMP INC FORK INC-FORK : : : :     C-SIMP C-INC C-FORK C-INC-FORK : w/ diff-based compression Simple diskless Incremental diskless Forked diskless Incremental, forked diskless Experiment Framework  Network of 24 Sun Sparc5 w/s connected to each other by a fast, switched Ethernet: ~ 5MB/s  Each w/s has   96MB of physical memory 38MB of local disk storage  Disks with bandwidth of 1.7MB/s are connected via Ethernet, and NFS on Ethernet achieved a bandwidth of 0.13 MB/s   Latency: time between chkpnt initiated and ready for recovery Overhead: increase in execution time with chkpnt Discussion  Latency: diskless has much lower latency than disk-based.   Lowers the expected running time of the application in the presence of failures (has small recovery time) Overhead: comparable… Recommendations  DISK-FORK:  If chkpnt are small  If the likelihood of wholesale system failures are high  C-FORK:  If many pages, but a few bytes per page are modified  INC-FORK:  If not a significant number of pages are modified Reference  J. S. Plank, K. Li, and M.A. Puening. "Diskless checkpointing." IEEE Transactions on Parallel & Distributed Systems, 9(10):972—986, Oct. 1998

Diskless Checkpointing 15 Nov 2001

Related documents

Products

Support

Diskless Checkpointing 15 Nov 2001

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib