DMTCP: A New Linux Checkpointing Mechanism For Vanilla Universe Jobs Condor Project Computer Sciences Department University of Wisconsin-Madison Why DMTCP? › Why checkpoint at all? › Problems with Condor’s Standard Universe Single process. No pthreads. No mmap() support. Forced re-link to form a static executable. › DMTCP removes these restrictions! www.cs.wisc.edu/Condor 2 What is DMTCP? › Distributed Multi-Threaded CheckPointing. › Works with Linux Kernel 2.6.9 and later. › Supports sequential and multi-threaded › › › › › computations across single/multiple hosts. Entirely in user space (no kernel modules or root privilege). Transparent (no recompiling, no re-linking). Written at Northeastern U. and MIT and under active development for 4+ years. LGPL’d and freely available. No Remote I/O. www.cs.wisc.edu/Condor 3 Process Structure Coordinator Process 1 CT T1 Signal (USR2) DMTCP CT Network Socket T1 T2 Process N CT = DMTCP checkpoint thread T = User Thread www.cs.wisc.edu/Condor 4 How Does It Work? › ./dmtcp_checkpoint a.out # starts coordinator too › ./dmtcp_command –c # talks to coordinator › ./dmtcp_restart ckpt_a.out-*.dmtcp › Coordinator is a stateless synchronization server › for the distributed checkpointing algorithm. Checkpoint/Restart performance related to size of memory, disk write speed, and synchronization. www.cs.wisc.edu/Condor 5 How Does It Work? › LD_PRELOAD: Transparently preloads checkpoint › › libraries which installs libc wrappers and checkpointing code. SIGUSR2: Used internally from checkpoint thread to user threads. Wrappers: Only on less heavily used calls to libc fork, exec, system, pipe, bind, listen, setsockopt, connect, accept, clone, close, ptsname, openlog, closelog, signal, sigaction, sigvec, sigblock, sigsetmask, sigprocmask, rt_sigprocmask, pthread_sigmask Overhead is negligible. www.cs.wisc.edu/Condor 6 How Does It Work? › Additional wrappers when process id & thread id virtualization is enabled getpid, getppid, gettid, tcgetpgrp, tcsetprgrp, getgrp, setpgrp, getsid, setsid, kill, tkill, tgkill, wait, waitpid, waitid, wait3, wait4 www.cs.wisc.edu/Condor 7 How Does It Work? › Checkpoint image compression on- the-fly (default). › Currently only supports dynamically linking to libc.so. Support for static libc.a is feasible, but not implemented. › Stays close to POSIX API standards. www.cs.wisc.edu/Condor 8 A Checkpoint Under DMTCP › dmtcphijack.so & mtcp.so present in executable’s memory. › Ask coordinator process for checkpoint via dmtcp_command. › Now what happens? www.cs.wisc.edu/Condor 9 A Checkpoint Under DMTCP › Suspend user threads with SIGUSR2. › Elect shared file descriptor leaders. › Drain kernel buffers and do network handshake with peers. › Write checkpoint to disk. › Refill kernel buffers. › Resume user threads. www.cs.wisc.edu/Condor 10 Where Is the Checkpoint? › In the cwd of the application. A set of ckpt_<exec>_<id>.dmtcp files. › In the cwd of the coordinator. A dmtcp_restart_script.sh file. The dmtcp_restart_script.sh may need tweaking depending upon circumstance. www.cs.wisc.edu/Condor 11 A Restart Under DMTCP › › › › › › › › Restart Process loads in memory. Reopen files and recreate ptys. Recreate and reconnect sockets. Fork into user processes. Rearrange file descriptors to initial layout. Restore memory and threads. Refill kernel buffers. Resume user threads. www.cs.wisc.edu/Condor 12 Supported OS Features › Threads, mutexes/semaphores, fork, exec Shared memory (via mmap), TCP/IP sockets, UNIX domain sockets, pipes, ptys, terminal modes, ownership of controlling terminals, signal handlers, open and/or shared fds, I/O (including the readline library), parent-child process relationships, process id & thread id virtualization, session and process group ids, and more… › Trying to keep the implementation small! www.cs.wisc.edu/Condor 13 Supported Applications › MPICH-2, OpenMPI, SciPy/iPython, Python cmsRun, Perl, Ruby, PHP, GHCi (Glasgow Haskell Compiler), Ocaml, Octave, Macaulay2, GNUPlot, slsh (S-Lang scripts), MZScheme, GST (Gnu Smalltalk virtual machine), tcsh, dash, csh, tclsh (tcl-based interpreter), SQLite. And many others! www.cs.wisc.edu/Condor 14 Planned Application Support › Bash, gcl (GNU Common Lisp), maxima (based on gcl), and the Sun JVM. › These programs use sbrk() for their own memory management and induce a bug in DMTCP. › A fix is planned and will go in soon. www.cs.wisc.edu/Condor 15 Planned Application Support › Matlab Directly calling the binary without graphics works, but matlab uses bash which needs the sbrk() fix. www.cs.wisc.edu/Condor 16 Condor/DMTCP Integration › Experimental at this time. Determining scalability, stability, and extent of › “weird edge cases” of DMTCP mixed with Condor. Completely outside of Condor source code. A vanilla job called “shim_dmtcp” that wraps the user’s job and stdfiles with DMTCP. A submit description file which transfers needed dmtcp files over to the remote side and saves intermediate checkpoints. No remote I/O! www.cs.wisc.edu/Condor 17 Shim Script Execution condor_starter shim_dmtcp Coordinator Job www.cs.wisc.edu/Condor 18 Submit File Example universe = vanilla executable = shim_dmtcp arguments = logfile stdinf stdoutf stderrf a.out arg0 arg1… should_transfer_files = YES when_to_transfer_output = ON_EVICT_OR_EXIT transfer_input_files = <dmtcp libraries and programs>,\ a.out, stdinf, stdoutf, stderrf environment = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null kill_sig = 2 output = shim.$(Cluster).$(Process).out error = shim.$(Cluster).$(Process).err log = shim.log queue www.cs.wisc.edu/Condor 19 Condor/DMTCP Integration › Early Results It works with our test case and thousands of jobs. Problems • Checkpointing between Physical Address Kernels and normal kernels is a challenge. • DMTCP’s API needs some improvement. • Coordinator failure means job failure. • Shim script is clunky, e.g. no streaming I/O. › Next: Integration into our stduniv test suite for full regression testing. www.cs.wisc.edu/Condor 20 Future Condor Integration › Add WantCheckpoint = True and › › › › CheckpointMethod = DMTCP for a vanilla universe job. Condor takes care of the wrapping of the job with DMTCP and transferal of needed DMTCP files--no shim script voodoo. Condor should honor CheckpointPlatform for Vanilla universe jobs in case of pool segmentation. Parallel universe support with single coordinator. Doug Thain’s Parrot for remote I/O. www.cs.wisc.edu/Condor 21 Challenges › C/C++ runtime library compatibility issues. Recompile DMTCP on slot before job execution? › Dynamic library incompatibilities. › No Checkpoint Server. Condor file transfer protocol enhancement? › Debugging methods and practices? www.cs.wisc.edu/Condor 22 Further Reading › “DMTCP: Transparent Checkpointing for Cluster Computation and the Desktop” http://arxiv.org/abs/cs/0701037 › Source Code http://dmtcp.sourceforge.net www.cs.wisc.edu/Condor 23 Questions? › DMTCP http://dmtcp.sourceforge.net Gene Cooperman: gene@ccs.neu.edu › Condor/DMTCP Integration Pete Keller: psilord@cs.wisc.edu Ask me if you want to try the Alpha Version out! www.cs.wisc.edu/Condor 24 Thank you www.cs.wisc.edu/Condor 25