Yavor Todorov Contents Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References Introduction • Basically, checkpoint/restart mechanisms allow a machine that crashes and is subsequently restarted to continue from the checkpoint with no loss of data, just as if no failure had occurred. • At some firms and supercomputing centers, it's common practice to break up long-running computational programs into several batches. Programs such as a gene- sequencing application search through enormous databases and execute complex algorithms that can take several weeks to complete. • But while the concept is easy to understand, the technical mechanism to checkpoint and restart an operating system or application is quite complex. How it works • Checkpointing can occur either within the operating system or at the application level. • Most high end mainframes have automated CPR utilities • CPR at the operating system level saves the state of everything that's being done within a given application at periodic checkpoints and allows the system to restart from the last point. • On very large computers with hundreds or thousands of processes running, saving the entire state of an operating system can take a long time How it works (cont’d) • It also takes a long time to later restart the machine at that state - on large jobs, it could take several hours. • The recovery is delayed because a large amount of data must be stored, whether or not the application requires that information to fully restart it. • When a process is checkpointed register set, file handlers Process image • Text • Data • Stack • Heap • Register set values • Status of open files • Sockets • Signals. Checkpoint at OS level • "Checkpointing at the operating system is useful but very costly, in that the operating system does not know what data the application really needs to restore it later, so it blindly saves everything," according to James Kasdorf , director of supercomputing center • At OS level CPR saves system state. That includes unneeded copies of data, program code and system libraries • Most supercomputing centers try to avoid CPR at OS level Checkpointing at OS level drawbacks • Since the whole process image is save its an expensive operation • Takes more time • Usually needs kernel modifications since most OS like Linux were not built with CPR functionality Checkpoint at application level • The application uses OS hooks to save information needed for restart • More efficient in a way that it takes less time to checkpoint and it is faster to restart the application • It allows you to choose optimal point which is typically at the end of a loop • Only needed data gets saved CPR at app level drawbacks • Difficult in some cases. E.g. application has an open communications channel to an external device or the application runs on a clustered computer. • Distributed application’s state is hard to save as programs state is changing across multiple nodes • CPR for apps with large buffer memory takes longer CPR for parallel programs • Each process is responsible for taking its own checkpoint. • Checkpoint timing is responsibility of a coordinating process. • CPR data includes: in-transit message data, data section, file offsets, signal state, executable information, stack contents and register contents, CPU state, info about open files, pending signals. • Checkpoint file can be stored either on local or global storage. • When program is restarted each process initiates its own restart. CPR for parallel programs • All migrating processes have to be stopped at the time, to avoid loss of a signal • Socket IP addressing space have to be taken in consideration( it can be virtualized) • I/O speeds are pivotal for any CPR process ( the faster the better) CPR functionality • Process migration • Load balancing • Crash recovery • Rollback transaction • Job control Sources Duell, J. (2005). The design and implementation of Berkeley Lab's linux checkpoint/restart. Berkeley: Lawrence Berkeley National Lab. Litzkow, M., Tannenbaum, T., Basney, J., & Livny, M. (1997). Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Zhong, H., & Nieh, J. (2001). CRAK: Linux Checkpoint/Restart As a Kernel Module. Depertment of CS Columbia University. http://www.computerworld.com/s/article/68930/Checkpoint_and_Restart