Fault Tolerance for HPC with OpenVZ Virtualization by Lite Migration...

advertisement
Cloud Computing Technology and Science, 11/30 – 12/3, 2010.
Fault Tolerance for HPC with OpenVZ Virtualization by Lite Migration Toolkit
Chang-Hsing Wu, Hui-Shan Chen, Yi-Lun Pan, Hsi-En Yu, Kuo-Yang Cheng and Weicheng Huang
National Center for High-performance Computing, Taiwan, R.O.C.
{hsing, chwhs, serenapan, yun, KuoYang, whuang}@nchc.org.tw
jobs can be restarted automatically without any
problem.
Abstract
The reliability of large-scale parallel jobs within a cluster
or even across multi-clusters under the Grid or
distributed computing environment is a long term issue
due to its difficulties involving the monitoring and
managing of a large number of compute nodes. To
contribute to the issue, a Lite Migration toolkit with fault
tolerance feature has been developed by the Distributed
Computing Team in the National Center for Highperformance Computing (NCHC). The proposed
approach relies on the virtualization techniques
exemplified by the OpenVZ [2], which is an open source
implementation of virtualization. The approach provides
automatically and transparently the fault tolerance
capability to the parallel HPC applications. An extremely
lightweight approach helps the acquisition of resource
manager services to provide checkpoint/restart
mechanism. The approach leverages virtualization
techniques combined with cluster queuing system and
load balance migration mechanism.

Automatic Recover & Live Migration - This is
the most important part, because it can migrate
the jobs away from the failing node. Therefore,
the computation will be continued without much
interruption.
The toolkit proposed provides a checkpoint mechanism
for MPI tasks. The MPI job under migration will be rollbacked automatically to specific checkpoint identified by
the user. With the Lite Migration toolkit (LMT), users do
not need to worry about the possible failure of their jobs
during computation, especially under the dynamical
environment of the Grid. The job on the failed node will
be migrated to another node automatically, without
restarting the whole job. In order not to change the usual
practice of users, the multifarious operations and steps
are designed in the architecture. The system
architecture of the Lite Migration is sketched in the
Figure 1.
Keywords: Virtualization Techniques, OpenVZ, Fault
Tolerance, HPC.
Research Objective and System Architecture
TORQUE [1] is an open source resource manager
providing control over batch jobs and distributed
compute nodes. It is a community effort based on the
original PBS project. And, OpenVZ [2] is an open source
container-based virtualization solution built on Linux. By
integrating virtualization technologies - OpenVZ and
TORQUE, we have come up with a new and
straightforward toolkit, Lite Migration (LMT), to acquiring
fault tolerance services in HPC environment. It provides
automatically and transparently the fault tolerance
capability to the parallel HPC applications. An extremely
lightweight approach helps the acquisition of resource
manager services to provide checkpoint/restart
mechanism. The LMT focuses on leveraging
virtualization techniques combined with cluster queuing
system and load balance migration mechanism. There
are several features of LMT, as the following shown:

Lightweight to Operate & Use – With LMT,
users submit job as usual, and do not setup
additional configurations.

Web Monitor & Control – Users can operate
mechanism via Web-based module.

Checkpoint & Restart - LMT takes snapshots
periodically during job execution. Further, jobs
can be stopped, as nodes should be utilized for
other purposes. After finishing other purposes,
Figure 1. The System Architecture of Lite Migration
Middleware
Furthermore, users can operate the mechanism through
the designed Web Monitor/Control Module, as shown in
Figure 2.
Figure 2. Web Monitor/Control Module
Challenges - Scenarios
The Lite Migration toolkit will then take snapshots
periodically during its execution. The taken snapshots
are kept as checkpoints for future possible restorations.
Once a failure of job or node is detected with VM
Monitor/Control Module, the latest or a selected
checkpoint will be restored, in another node if necessary,
to recover the failed instance. The computation will thus
be continued without much interruption. There are two
scenarios, (1) Nodes fail during job executions as
illustrated in the Figure 3, and (2) Job can be stopped,
when computed nodes should be utilized for other
purposes as shown in the Figure 4. These two are
illustrated to demonstrate the features of the proposed
mechanism.
In the first scenario, LMT can periodically take snapshot
these virtual machines during job computation and
reserve these checkpoint files in a backup directory, as
TORQUE’s command is described in Figure 5.
Afterward, User can access the checkpoint backup
directory to restore the lasted version or specified
version of checkpoint files, jobs can be restarted to
continue computing, as the TORQUE’s command is
described in Figure 6. Finally, while job executed, the
thread of job on failed nodes will be automatically
migrated to healthy ones without stopping the job.
Figure 4. Nodes can be Utilized for Other Purposes
Figure 5. Periodic-checkpoint command of LMT
The second scenario illustrated the situation that when
the allocated cycle for a job is reached, this dispatched
job can be frozen. The computed nodes can be utilized
for other computing purposes. After all, the checkpoint
and migration ensure the continuation of the work on
other computed nodes.
Figure 6. Restoring-snapshot command of LMT
Experimental Results
Figure 3. Nodes Failure during Job Execution
The preliminaries of experiment are needed to set up,
including the multi-nodes computing environment, the
convergence of with Message Passing Interface (MPI)
matrix, and the number of required computing cores.
Therefore, we use that the Computational Fluid
Dynamics (CFD) computing kernel, the Additive
Swartz’s Preconditioned Conjugate Gradient (ASPCG),
solves the two dimensional Laplace equation with
Preconditioned Conjugated Method [3], [4]. Cases
solved include 4-, 8-, 12-, 16-, 20-, and 32-CPU jobs on
a heterogeneous research testbed. The characteristics
of the testbed are summarized in the Table 1. In each
test, all of the processes were launched concurrently
and sent to run on randomly chosen computed nodes.
For Lite Migration tests, we used it with 1024 MB of
memory.
Table 1 Summary of Environment Characteristics
Resource Name Nacona CPU Model Intel(R) Xeon(TM) AMD Opteron Opteron(tm) Processor 248 LVM Intel(R) Core(TM)2 CPU Q 6600 CPU Clockrate (GHz) Memory (GB) # of CPU # of Nodes Job Manager 3.2 4 16 8 Torque 2.2 4 16 8 Moab 2.4 2 11 11 Maui The performance of each of these clusters was first
measured using the High-Performance Linpack
Benchmark (HPL) [5] and is listed in the Table 2 for
users’ reference.
Table 2 High-Performance Linpack Benchmark of NCHC
Resources
High-Performance
Linpack Benchmark
Nacona
Cluster
Opteron
Cluster
LVM
Cluster
Rmax(Gflops)
46.791424
34.08
60.5
Rpeak(Gflops)
102
70
115.5
Number of Cores
16
16
20
2.13
3.025
The Efficiency of CPU
2.924
(Gflops/CPU)
Figure 8. The Overhead of Using Lite Migration with
GPFS and InfiniBand Network
Next, Figure 9 depicts the run-times with migration tests
of multiple nodes. Obviously, the time of dump, restore,
and memory copy are almost steady. They will not
change along with the scale of problem size fill-out.
As illustrated in the figure 7, it shows the overhead for
using Lite Migration on each different matrix size, such
as 512 by 512, 1024 by 1024, 2048 by 2048, and 4096
by 4096. The vertical axis is the proportion of overhead
(%), and the horizontal axis is the number of CPU.
The overhead of using Lite Migration amounts to less
than 9%, especially when solving the huge matrix size –
4096 by 4096, in Figure 7. However, these results are
not satisfied. Hence, we use the parallel file system GPFS and change the network channel – InfiniBand to
improve reduce the overhead. In the Figure 8, the
overhead of using LMT notably amounts to less than 3%
when solving the huge matrix size – 4096 by 4096.
Figure 7. The Overhead of Using Lite Migration
Figure 9. Run-Times with Migration
Conclusion and Future Work
The proposed Lite Migration toolkit can ensure the
successful execution of a parallel HPC job for both
within a cluster or under the Grid or distributed
computing environment, and thus to improve the
reliability of computing resources. Through this work, an
important property of Lite Migration is observed. It is
more appropriated to deal with large scale of problem
size jobs. The overhead can be amounted to less than
3%, when using Lite Migration toolkit. Furthermore, the
ability to distribute and balance the workload across
multiple clusters and Grid resources has been shown to
be beneficial for better resource utilization along with
improved turn-around time of computing jobs.
The contribution of this paper is to avoid a complete
restart and retain execution of parallel and all kind of
message passing interface (MPI) jobs as nodes fail. Lite
Migration toolkit takes snapshot of the MPI processes on
healthy nodes periodically and automatically, and
migrate the MPI processes of failed nodes onto spare
nodes. This solution can make users submit their jobs
as usual without any changing. Moreover, it also
removes any re-queuing overhead by reusing existing
resources in a seamless and transparent manner.
References
[1]
http://www.clusterresources.com/products/torque-resourcemanager.php
[2]
http://wiki.openvz.org/Main_Page
[3]
W. Huang, “Dynamic Computing Power Balancing for Adaptive
Mesh
Refinement
Applications,”
International
Parallel
Computational Fluid Dynamics Conference’02, pp. 411- 418, Nara
Japan, April, 2002.
[4]
W. Huang and Y.G. Lai, “A Parallel Implementation of a MultiBlock Three-Dimensional Incompressible Flow Solver on a DSM
Machine,” Fourth International Conference on Hydroinfomatics,
Iowa City, Iowa USA, July 23-27, 2000.
[5]
http://www.netlib.org/benchmark/hpl/
Download