Cloud Computing Technology and Science, 11/30 – 12/3, 2010. Fault Tolerance for HPC with OpenVZ Virtualization by Lite Migration Toolkit Chang-Hsing Wu, Hui-Shan Chen, Yi-Lun Pan, Hsi-En Yu, Kuo-Yang Cheng and Weicheng Huang National Center for High-performance Computing, Taiwan, R.O.C. {hsing, chwhs, serenapan, yun, KuoYang, whuang}@nchc.org.tw jobs can be restarted automatically without any problem. Abstract The reliability of large-scale parallel jobs within a cluster or even across multi-clusters under the Grid or distributed computing environment is a long term issue due to its difficulties involving the monitoring and managing of a large number of compute nodes. To contribute to the issue, a Lite Migration toolkit with fault tolerance feature has been developed by the Distributed Computing Team in the National Center for Highperformance Computing (NCHC). The proposed approach relies on the virtualization techniques exemplified by the OpenVZ [2], which is an open source implementation of virtualization. The approach provides automatically and transparently the fault tolerance capability to the parallel HPC applications. An extremely lightweight approach helps the acquisition of resource manager services to provide checkpoint/restart mechanism. The approach leverages virtualization techniques combined with cluster queuing system and load balance migration mechanism. Automatic Recover & Live Migration - This is the most important part, because it can migrate the jobs away from the failing node. Therefore, the computation will be continued without much interruption. The toolkit proposed provides a checkpoint mechanism for MPI tasks. The MPI job under migration will be rollbacked automatically to specific checkpoint identified by the user. With the Lite Migration toolkit (LMT), users do not need to worry about the possible failure of their jobs during computation, especially under the dynamical environment of the Grid. The job on the failed node will be migrated to another node automatically, without restarting the whole job. In order not to change the usual practice of users, the multifarious operations and steps are designed in the architecture. The system architecture of the Lite Migration is sketched in the Figure 1. Keywords: Virtualization Techniques, OpenVZ, Fault Tolerance, HPC. Research Objective and System Architecture TORQUE [1] is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original PBS project. And, OpenVZ [2] is an open source container-based virtualization solution built on Linux. By integrating virtualization technologies - OpenVZ and TORQUE, we have come up with a new and straightforward toolkit, Lite Migration (LMT), to acquiring fault tolerance services in HPC environment. It provides automatically and transparently the fault tolerance capability to the parallel HPC applications. An extremely lightweight approach helps the acquisition of resource manager services to provide checkpoint/restart mechanism. The LMT focuses on leveraging virtualization techniques combined with cluster queuing system and load balance migration mechanism. There are several features of LMT, as the following shown: Lightweight to Operate & Use – With LMT, users submit job as usual, and do not setup additional configurations. Web Monitor & Control – Users can operate mechanism via Web-based module. Checkpoint & Restart - LMT takes snapshots periodically during job execution. Further, jobs can be stopped, as nodes should be utilized for other purposes. After finishing other purposes, Figure 1. The System Architecture of Lite Migration Middleware Furthermore, users can operate the mechanism through the designed Web Monitor/Control Module, as shown in Figure 2. Figure 2. Web Monitor/Control Module Challenges - Scenarios The Lite Migration toolkit will then take snapshots periodically during its execution. The taken snapshots are kept as checkpoints for future possible restorations. Once a failure of job or node is detected with VM Monitor/Control Module, the latest or a selected checkpoint will be restored, in another node if necessary, to recover the failed instance. The computation will thus be continued without much interruption. There are two scenarios, (1) Nodes fail during job executions as illustrated in the Figure 3, and (2) Job can be stopped, when computed nodes should be utilized for other purposes as shown in the Figure 4. These two are illustrated to demonstrate the features of the proposed mechanism. In the first scenario, LMT can periodically take snapshot these virtual machines during job computation and reserve these checkpoint files in a backup directory, as TORQUE’s command is described in Figure 5. Afterward, User can access the checkpoint backup directory to restore the lasted version or specified version of checkpoint files, jobs can be restarted to continue computing, as the TORQUE’s command is described in Figure 6. Finally, while job executed, the thread of job on failed nodes will be automatically migrated to healthy ones without stopping the job. Figure 4. Nodes can be Utilized for Other Purposes Figure 5. Periodic-checkpoint command of LMT The second scenario illustrated the situation that when the allocated cycle for a job is reached, this dispatched job can be frozen. The computed nodes can be utilized for other computing purposes. After all, the checkpoint and migration ensure the continuation of the work on other computed nodes. Figure 6. Restoring-snapshot command of LMT Experimental Results Figure 3. Nodes Failure during Job Execution The preliminaries of experiment are needed to set up, including the multi-nodes computing environment, the convergence of with Message Passing Interface (MPI) matrix, and the number of required computing cores. Therefore, we use that the Computational Fluid Dynamics (CFD) computing kernel, the Additive Swartz’s Preconditioned Conjugate Gradient (ASPCG), solves the two dimensional Laplace equation with Preconditioned Conjugated Method [3], [4]. Cases solved include 4-, 8-, 12-, 16-, 20-, and 32-CPU jobs on a heterogeneous research testbed. The characteristics of the testbed are summarized in the Table 1. In each test, all of the processes were launched concurrently and sent to run on randomly chosen computed nodes. For Lite Migration tests, we used it with 1024 MB of memory. Table 1 Summary of Environment Characteristics Resource Name Nacona CPU Model Intel(R) Xeon(TM) AMD Opteron Opteron(tm) Processor 248 LVM Intel(R) Core(TM)2 CPU Q 6600 CPU Clockrate (GHz) Memory (GB) # of CPU # of Nodes Job Manager 3.2 4 16 8 Torque 2.2 4 16 8 Moab 2.4 2 11 11 Maui The performance of each of these clusters was first measured using the High-Performance Linpack Benchmark (HPL) [5] and is listed in the Table 2 for users’ reference. Table 2 High-Performance Linpack Benchmark of NCHC Resources High-Performance Linpack Benchmark Nacona Cluster Opteron Cluster LVM Cluster Rmax(Gflops) 46.791424 34.08 60.5 Rpeak(Gflops) 102 70 115.5 Number of Cores 16 16 20 2.13 3.025 The Efficiency of CPU 2.924 (Gflops/CPU) Figure 8. The Overhead of Using Lite Migration with GPFS and InfiniBand Network Next, Figure 9 depicts the run-times with migration tests of multiple nodes. Obviously, the time of dump, restore, and memory copy are almost steady. They will not change along with the scale of problem size fill-out. As illustrated in the figure 7, it shows the overhead for using Lite Migration on each different matrix size, such as 512 by 512, 1024 by 1024, 2048 by 2048, and 4096 by 4096. The vertical axis is the proportion of overhead (%), and the horizontal axis is the number of CPU. The overhead of using Lite Migration amounts to less than 9%, especially when solving the huge matrix size – 4096 by 4096, in Figure 7. However, these results are not satisfied. Hence, we use the parallel file system GPFS and change the network channel – InfiniBand to improve reduce the overhead. In the Figure 8, the overhead of using LMT notably amounts to less than 3% when solving the huge matrix size – 4096 by 4096. Figure 7. The Overhead of Using Lite Migration Figure 9. Run-Times with Migration Conclusion and Future Work The proposed Lite Migration toolkit can ensure the successful execution of a parallel HPC job for both within a cluster or under the Grid or distributed computing environment, and thus to improve the reliability of computing resources. Through this work, an important property of Lite Migration is observed. It is more appropriated to deal with large scale of problem size jobs. The overhead can be amounted to less than 3%, when using Lite Migration toolkit. Furthermore, the ability to distribute and balance the workload across multiple clusters and Grid resources has been shown to be beneficial for better resource utilization along with improved turn-around time of computing jobs. The contribution of this paper is to avoid a complete restart and retain execution of parallel and all kind of message passing interface (MPI) jobs as nodes fail. Lite Migration toolkit takes snapshot of the MPI processes on healthy nodes periodically and automatically, and migrate the MPI processes of failed nodes onto spare nodes. This solution can make users submit their jobs as usual without any changing. Moreover, it also removes any re-queuing overhead by reusing existing resources in a seamless and transparent manner. References [1] http://www.clusterresources.com/products/torque-resourcemanager.php [2] http://wiki.openvz.org/Main_Page [3] W. Huang, “Dynamic Computing Power Balancing for Adaptive Mesh Refinement Applications,” International Parallel Computational Fluid Dynamics Conference’02, pp. 411- 418, Nara Japan, April, 2002. [4] W. Huang and Y.G. Lai, “A Parallel Implementation of a MultiBlock Three-Dimensional Incompressible Flow Solver on a DSM Machine,” Fourth International Conference on Hydroinfomatics, Iowa City, Iowa USA, July 23-27, 2000. [5] http://www.netlib.org/benchmark/hpl/