Notes for slides 3.Introduction Scientists around the globe carry out experiments that generate petabytes of data that have to be analyzed. These analyses coupled with the enormous amount of data, call for fast, efficient and transparent delivery of large data sets and effective resource sharing to expedite the process. This sets the stage for grid computing, which enables organizations to coordinate, collaborate and share data and resouces to achieve common goals. The grid terminology for this collaboration is called a "Virtual Organisation" or VO, it signifies a group of organizations working together to achieve common goals. Grid computing is being used in many physics projects such as DØ, IVDGL, PPDG and GridPhyN. In this paper, we have taken the DØ grid as an example and studied its architecture and structure. Clusters are the favored job sites with the head node having the grid package installed on it as a gateway between the cluster and the grid. Rendering highly available services is increasingly important as critical applications shift to grid systems. Unavoidable errors can creep in anytime due to hardware, software and human errors. Though the nature of grid computing is distributed, inevitable errors can make a site (a member of the VO, e.g. a computational cluster resource of that VO) unusable, thus reducing the number of resources available and, in turn, slowing down the overall speed of computation. 4. ... Efforts need to concentrate on making critical systems highly available and improving fault tolerance by removing Single Point of Failure (SPoF) components such as the singular head node occurrence, thusly reducing potential overall system downtime. Start & monitor jobs, manage datasets , return results Client GRAM based submission Output/error log Gatekeeper a.k.a Entry pt (Authenticates & Authorizes) Local Resource Manager PBS, Condor, fork Job Manager (Submits & Monitor job) Compute nodes Process Process Process Job submission using grid (For reference). The grid exacerbates this vulnerability by possibly causing a SPoF to affect an entire site, for example an entire site may be lost due to the failure of a single head-node ( as it has critical grid processes such as gatekeeper). Hence, there is a need to focus on the highavailability aspect of the grid and thereby eliminate the SPoFs to minimize downtime. "Smart Failover" feature tries to make the HA-OSCAR failover graceful in terms of job management. 5. Traditional intra site cluster configuration The site-manager is the head node of a cluster having grid services running on it. It has critical services such as gatekeeper,gridFTP as well as globus interface to the cluster scheduler to submit to jobs to the cluster through grid. The site-manager is critical from the point of the site being used to its full potential. Failure of the site-manager means that the VO ( of which the site is a part of) loses critical resource till the time of recovery. This recovery is dependant on the type of faialure that caused it and could range from a simple reboot to a hardware component failure. These failures are non-periodical and unpredicatable hence measure should be taken to ensure high availability of a site. Hence the proposed architecture. 6. Critical Service Monitoring & Failover-Failback capability for sitemanager On the head node, critical grid services such as gatekeeper ( entry point to the grid as i authenticates and authorizes remote users), gridFTP as well as services such as PBS, NFS etc. We monitor these critical services using HA-OSCAR’s service monitoring core and based on the policy based framework we decide whether to restart the failed service or failover. 7. Proposed Framework Most task-level (Grid Workflow) fault tolerant techniques attempt to restart the job on other alternative resources in the grid in an event of a host crash. This means a significant decrease in computational resources. Moreover the downtime of the site-manager can be anywhere between a few minutes to a few hours or even days based on the severity of the problem, leaving the otherwise healthy compute nodes of that site unused. While most efforts have concentrated on task-level fault tolerance, there is a dearth of fault detection and recovery service for critical grid services. The OS forms the lowest layer as each lower layer will be used by the layers above it. Cluster management environments, such as ROCKS  and OSCAR, form the next layer. The next layer in the hierarchy is the grid layer that sets up a computation grid. The Globus Toolkit is used as the grid enabling technology and it includes the grid services and daemons like gatekeeper, gridFTP, MDS etc. The fourth layer is divided into two parts, namely the HA-OSCAR service monitoring and the HA-OSCAR policy based recovery mechanism. The service monitoring sub-layer keeps track of the status of critical grid services like the gatekeeper, gridFTP along with some other critical services such as the NFS, PBS, SGE, etc. Depending on the status of the grid services, an appropriate action is triggered as specified in the policy framework of HA-OSCAR. 9. … globus-url-copy is a globus command that uses the gridFTP protocol to transfer files from one location to another. It also supports parallel channel file transfer. e.g. Globus-url-copy gsiftp://src file://dest for copying remote file to local disk globus-job-run is an interactive job submission mechanism. e.g globus-job-run hostdest path_to_executable globus-job-run oscar.cenit.latech.edu:/jobmanager-pbs /home/gt3/hello 10. Smart Failover Framework. The current active/hot-standby model in HA-OSCAR is provides an excellent solution for stateless services, where the transition from the primary head node to the backup is executed smoothly. However, this mechanism is not graceful if stateful services, such as job management, are involved. “Smart Failover” feature in HA-OSCAR tries to achieve graceful failover by monitoring the job queue and updating the changes to it to the backup of the primary head node. The framework consists of 3 components: the event monitor, job monitor and the backup updater. Critical system events, such as repeated service failure, memory leaks and system overload, are analyzed by the event monitor using the HA-OSCAR monitoring core. The second component (job monitor) is a daemon that periodically monitors the job queues at a user specified interval. It may also be triggered by the event monitor in case of critical event. Whenever the job queue monitor senses a change in the job queues, it invokes the backup updater to synchronize the backup server with the changes in job queue and other critical directories The mapping between Globus assigned job id and the scheduler assigned job id is the key data structure for transparent head node fail-over and job restart on a HA-OSCAR cluster for jobs submitted through grid. 12. Experiment The head node was running the Redhat 9 operating system. OSCAR 3.0 was used to build the cluster and setup environment between the head node and the 3 clients. We installed Globus 3.2 on the head node; its interface to the OpenPBS jobmanager was also installed. We later installed HA-OSCAR 1.0 on the head node and it created the backup. HA-OSCAR handles the re-establishment of NFS between the backup and the clients after the backup takes over. The job queue monitor and backup updater were running on the head node, periodically updating the backup with the critical directories and mapping from Globus jobID to scheduler assigned jobID (PBS in our case). The failover aware client was written in Python using PyGlobus would submit MPI jobs to the PBS scheduler. 14. Time needed to complete jobs with/without “Smart Failover” Figure gives the total time needed for jobs submitted through scheduler primitives (not through grid), to run without and without “smart failover”. Here we have assumed that there is no checkpoint support for the scheduler. The Mean Time to Repair (MTTR) is the time needed for the head node to be back to normal health and this period can be small or large based on the cause of failure. We have varied it from two minutes (basic reboot period) to two hours (e.g. needed for hardware replacement). If we do not use smart failover feature, the scheduler queue shall restart after reboot but we would lose the jobs in “running” state at time of failure. This is an undesirable condition as there could be multiple jobs in “running” state. Time taken to complete last running jobs (TLR) is a critical factor when we evaluate the total time needed to complete jobs with “smart failover” feature. As specified earlier, all running jobs on the primary have “queued” status on the backup. Whenever there is a failover, all jobs are started from scratch; we do not lose any of the running jobs.