Notes for slides

Notes for slides
Scientists around the globe carry out experiments that generate petabytes of data that
have to be analyzed. These analyses coupled with the enormous amount of data, call for
fast, efficient and transparent delivery of large data sets and effective resource sharing to
expedite the process. This sets the stage for grid computing, which enables organizations
to coordinate, collaborate and share data and resouces to achieve common goals. The grid
terminology for this collaboration is called a "Virtual Organisation" or VO, it signifies a
group of organizations working together to achieve common goals. Grid computing is
being used in many physics projects such as DØ, IVDGL, PPDG and GridPhyN.
In this paper, we have taken the DØ grid as an example and studied its architecture and
structure. Clusters are the favored job sites with the head node having the grid package
installed on it as a gateway between the cluster and the grid. Rendering highly available
services is increasingly important as critical applications shift to grid systems.
Unavoidable errors can creep in anytime due to hardware, software and human errors.
Though the nature of grid computing is distributed, inevitable errors can make a site (a
member of the VO, e.g. a computational cluster resource of that VO) unusable, thus
reducing the number of resources available and, in turn, slowing down the overall speed
of computation.
4. ...
Efforts need to concentrate on making critical systems highly available and improving
fault tolerance by removing Single Point of Failure (SPoF) components such as the
singular head node occurrence, thusly reducing potential overall system downtime.
Start & monitor jobs, manage datasets , return results
a.k.a Entry pt
Job Manager
(Submits &
Monitor job)
Job submission using grid (For reference).
The grid exacerbates this vulnerability by possibly causing a SPoF to affect an entire site,
for example an entire site may be lost due to the failure of a single head-node ( as it has
critical grid processes such as gatekeeper). Hence, there is a need to focus on the highavailability aspect of the grid and thereby eliminate the SPoFs to minimize downtime.
"Smart Failover" feature tries to make the HA-OSCAR failover graceful in terms of job
5. Traditional intra site cluster configuration
The site-manager is the head node of a cluster having grid services running on it. It has
critical services such as gatekeeper,gridFTP as well as globus interface to the cluster
scheduler to submit to jobs to the cluster through grid. The site-manager is critical from
the point of the site being used to its full potential. Failure of the site-manager means that
the VO ( of which the site is a part of) loses critical resource till the time of recovery.
This recovery is dependant on the type of faialure that caused it and could range from a
simple reboot to a hardware component failure.
These failures are non-periodical and unpredicatable hence measure should be taken to
ensure high availability of a site. Hence the proposed architecture.
6. Critical Service Monitoring & Failover-Failback capability for sitemanager
On the head node, critical grid services such as gatekeeper ( entry point to the grid as i
authenticates and authorizes remote users), gridFTP as well as services such as PBS, NFS
etc. We monitor these critical services using HA-OSCAR’s service monitoring core and
based on the policy based framework we decide whether to restart the failed service or
7. Proposed Framework
Most task-level (Grid Workflow) fault tolerant techniques attempt to restart the job on
other alternative resources in the grid in an event of a host crash. This means a significant
decrease in computational resources. Moreover the downtime of the site-manager can be
anywhere between a few minutes to a few hours or even days based on the severity of the
problem, leaving the otherwise healthy compute nodes of that site unused.
While most efforts have concentrated on task-level fault tolerance, there is a dearth of
fault detection and recovery service for critical grid services.
The OS forms the lowest layer as each lower layer will be used by the layers above it.
Cluster management environments, such as ROCKS [14] and OSCAR, form the next
layer. The next layer in the hierarchy is the grid layer that sets up a computation grid. The
Globus Toolkit is used as the grid enabling technology and it includes the grid services
and daemons like gatekeeper, gridFTP, MDS etc.
The fourth layer is divided into two parts, namely the HA-OSCAR service monitoring
and the HA-OSCAR policy based recovery mechanism. The service monitoring sub-layer
keeps track of the status of critical grid services like the gatekeeper, gridFTP along with
some other critical services such as the NFS, PBS, SGE, etc. Depending on the status of
the grid services, an appropriate action is triggered as specified in the policy framework
9. …
globus-url-copy is a globus command that uses the gridFTP protocol to transfer files from
one location to another. It also supports parallel channel file transfer.
e.g. Globus-url-copy gsiftp://src file://dest for copying remote file to local disk
globus-job-run is an interactive job submission mechanism.
e.g globus-job-run hostdest path_to_executable
globus-job-run /home/gt3/hello
10. Smart Failover Framework.
The current active/hot-standby model in HA-OSCAR is provides an excellent solution for
stateless services, where the transition from the primary head node to the backup is
executed smoothly. However, this mechanism is not graceful if stateful services, such as
job management, are involved. “Smart Failover” feature in HA-OSCAR tries to achieve
graceful failover by monitoring the job queue and updating the changes to it to the
backup of the primary head node.
The framework consists of 3 components: the event monitor, job monitor and the backup
updater. Critical system events, such as repeated service failure, memory leaks and
system overload, are analyzed by the event monitor using the HA-OSCAR monitoring
core. The second component (job monitor) is a daemon that periodically monitors the job
queues at a user specified interval. It may also be triggered by the event monitor in case
of critical event. Whenever the job queue monitor senses a change in the job queues, it
invokes the backup updater to synchronize the backup server with the changes in job
queue and other critical directories
The mapping between Globus assigned job id and the scheduler assigned job id is the key
data structure for transparent head node fail-over and job restart on a HA-OSCAR cluster
for jobs submitted through grid.
12. Experiment
The head node was running the Redhat 9 operating system. OSCAR 3.0 was used to
build the cluster and setup environment between the head node and the 3 clients. We
installed Globus 3.2 on the head node; its interface to the OpenPBS jobmanager was also
installed. We later installed HA-OSCAR 1.0 on the head node and it created the backup.
HA-OSCAR handles the re-establishment of NFS between the backup and the clients
after the backup takes over. The job queue monitor and backup updater were running on
the head node, periodically updating the backup with the critical directories and mapping
from Globus jobID to scheduler assigned jobID (PBS in our case). The failover aware
client was written in Python using PyGlobus would submit MPI jobs to the PBS
14. Time needed to complete jobs with/without “Smart Failover”
Figure gives the total time needed for jobs submitted through scheduler primitives (not
through grid), to run without and without “smart failover”. Here we have assumed that
there is no checkpoint support for the scheduler. The Mean Time to Repair (MTTR) is the
time needed for the head node to be back to normal health and this period can be small or
large based on the cause of failure. We have varied it from two minutes (basic reboot
period) to two hours (e.g. needed for hardware replacement). If we do not use smart
failover feature, the scheduler queue shall restart after reboot but we would lose the jobs
in “running” state at time of failure. This is an undesirable condition as there could be
multiple jobs in “running” state. Time taken to complete last running jobs (TLR) is a
critical factor when we evaluate the total time needed to complete jobs with “smart
failover” feature. As specified earlier, all running jobs on the primary have “queued”
status on the backup. Whenever there is a failover, all jobs are started from scratch; we do
not lose any of the running jobs.