International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 Improvement of Fault Tolerance Using Checkpoint Optimization Technique in Grid Computing Environment Sumant Jain1 , Jyoti Choudhary2 1 PG Student,CSE Department ,M.D.U Rohtak, Haryana ,India 2 Assistant Professor , IT Department ,M.D.U Rohtak ,Haryana ,India Abstract -Grid is an association of computer resources from several administrative domains to reach a mutual goal with an abstraction of service origination to the user. Fault tolerance is an important property in grid computing as the dependability of individual grid resources may not be guaranteed. Fault tolerant approach is useful in order to potentially prevent a malicious node affecting the overall performance of the application. In this paper, grid computing and related work are discussed. Keywords-GARA, physical faults, network faults. I. INTRODUCTION Grid computing or computational grid is always a vast research field in academic, as well as in industry also. Grid is an association of computer resources from several administrative domains to reach a mutual goal with an abstraction of service origination to the user [2]. A grid computing infrastructure hardware and software that provided access to high level computational capabilities into a reliable, consistent, pervasive and inexpensive offers [1]. II. and the demands of different users are causing complexity in thescheduling grid [3]. Scheduling process in grid can be organized in three stages: resource discovery, resource selection and scheduling based on clear goals and the last stage is request assignment to the most appropriate source. To achieve the full potential of grid environment we should perform the grid scheduling in an effective manner. Grid scheduling is the process of making scheduling decisions involving resources more than multiple administrative domains. This process can consist of searching multiple administrative domains to use a single machine or scheduling a single job to use multiple resources at a single site or multiple sites [8]. Geographically distributed resources cooperate to solve big problems, is called grid computing. Grid computing, is distributed computing model that is provides easy access to heterogeneous resources that are geographically dispersed. Today, due to heterogeneous grid resources that belong to different organizations and locations with different access policies and terms of workload dynamics are inherent; the use of this type in grade sharing, selection and gathering resources computing has become popular. The main goal of grid is providing services with high reliability and lowest cost for large volumes of users and support group work and the most important issue in grid computing are resource management and control, reliability and security. Today increased efficiency of grid is an important issue. To increase the efficiency of grid a properly and useful scheduling is needed. Unfortunately, the dynamic nature of grid resources ISSN: 2231-5381 Fault tolerance is an important property in grid computing as the dependability of individual grid resources may not be guaranteed. In many cases, an organization may send out jobs for remote execution on resources upon which no trust can be placed; for example, the resources may be outside of its organizational boundaries, or may be shared by different users at the same time. Fault tolerant approach may therefore be useful in order to potentially prevent a malicious node affecting the overall performance of the application. As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically. A major challenge in a dynamic grid with thousands of nodes connected to each other is fault tolerance. More the resources and components involved more the complication and error-prone is the system. To compare end fault tolerance mechanisms, it requires pointing out the dissimilarity between faults, errors and failures. A fault is a violation of a system’s basic assumptions. An error is an internal data state that reflects a fault. A failure is an superficially visible deviation from specifications. In reality, a fault need not result in an error, or an error in a failure. Different types of faults, classified based on several factors, are mentioned in the following: 1. Physical faults: Faulty storage, faulty CPUs, faulty memory. 2. Unconditional termination: Mostly, user pressed Ctrl+ c. http://www.ijettjournal.org Page 3294 International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 3. Network faults: Packet corruption, faults due to network division, packet loss. 4. Lifecycle faults: Legacy or versioning faults. 5. Processor faults: Machine or operating system crashes. 6. Media faults: Disk head crashes. 7. Service expiry fault: The service time of a resource may expire while application is using the resources in grid. 8. Process faults: software bug, resource scarcity. 9. Interaction faults: timing overhead, protocol incompatibilities, security incompatibilities, policy problems [4]. Many Grid applications have requirements for heterogeneous resources that are independently controlled or administered. Since resources belonging to different administrative domains do not share their schedules, if a user’s application needs to access more than one resource simultaneously, the user either has to arrange for it through the domain administrators or submit the tasks of the job to queues of different resources without any guarantees that all resources would be available simultaneously. In order to address this problem, advance reservations (ARs) were introduced as a part of Globus Architecture for Reservation and Allocation (GARA) [5]. Advance reservations of resources for a specific time in future ensure that all resources would be simultaneously available at the execution time of the application. As by reserving resources in advance, one can provide an upper bound on the response time, ARs can also be used for ensuring end-to-end quality of service. For jobs with sequential tasks, the response time of the first resource in sequence can become the start time of the reservation for the second resource and so on; thus guaranteeing the endto-end response time. III. .RELATED WORK: Leili Mohammad Khanli and Maryam discussed a strategy named Reliable Job Scheduler using RFOH in Grid Computing. This strategy maintains the history of fault occurrence of resources. Whenever a resource broker has jobs to schedule, it finds the optimal resources using fault occurrence and response time. It does not consider the resource failure as different aspects like processor, memory and BW. In our work we consider the different aspects of resource failure and hence it leads to optimal resource utilization [6]. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. Here author [7] addresses these problems by combining the checkpoint replication based fault tolerance mechanism with Minimum Total Time to ISSN: 2231-5381 Release (MTTR) job scheduling algorithm. TTR can have the service time of the job, coming up time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm reduces or minimizes the TTR by selecting a computational resource based on job necessities, job characteristics and hardware features of the resources. The fault tolerance system used here sets the job checkpoints based on the resource failure rate. i.e Replica Resource Selection Algorithm (RRSA) is proposed to provide Checkpoint Replication Service (CRS). Globus Tool Kit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. Author proposed[8], a new fault tolerance based scheduling approach for scheduling statically available meta tasks is proposed where in failure rate and the fitness value are calculated. The presentation of the fault tolerant scheduling policy is compared with a non-fault tolerant scheduling policy and it shows that the proposed policy performs better with less TTR in the presence of failures. The number of tasks productively completed is also more when compared to the non-fault tolerant scheduling strategy. Grid computing or computational grid is always a vast research field in scholastic, as well as in industry also. Computational grid provides resource sharing through multi-institutional virtual organizations for dynamic problem solving. Various diverse resources of different administrative domain are virtually distributed through different network in computational grids. Thus at any point of time any type of failure can occur. and job running in grid environment may fail. Hence fault tolerance is an important and challenging issue in grid computing as the dependability of individual grid resources may not be guaranteed. So to make computational grids more reliable and consistent fault tolerant system is necessary. Author [4] reviewed different existing fault tolerance techniques applicable in grid computing. Here also presented state of the art of various fault tolerance technique and comparative study of the existing algorithms. Author [9] presented and evaluated a fault-tolerant job scheduling system based on check pointing technique. When scheduling a job, the system uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling decisions. Failure rate of the assigned resources are used by the system to calculate the checkpoint gap for each job. General simulation experiments are conducted to evaluate the performance of the planned system. Experiments have shown that the proposed system can considerably perk up throughput, turnaround time and failure affinity. http://www.ijettjournal.org Page 3295 International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 IV. CONCLUSION Computational grids are used to execute the jobs. Therefore, users submit their jobs to the Grid Scheduler (GS) along with their necessities These necessities may include the deadline of job, resources required, proposal needed. In case of fault free, results of executing the job are returned to the user after completion of the job. If the grid resource abortive during execution of the job, the job is rescheduled on another resource which starts executing the job from scratch. This leads to more time consumed for the job than expected. Thus, the user’s requirements are not satisfied. [7]V. Rhymend Uthariaraj, Malarvizhi Nandagopal, “Fault Tolerant Scheduling Strategy For Computational Grid Environment”, Malarvizhi Nandagopal International Journal of Engineering Science and TechnologyVol. 2, 2010, 4361-4372 [8]P.Keerthika, Dr.N.Kasthuri,” A New Proactive Fault Tolerant Approach for Scheduling in Computational Grid, “international Conference on Web Services Computing (ICWSC) Proceedings published by International Journal of Computer Applications® ,2011. [9]Mohammed Amoon, “A Fault Tolerant Scheduling System Based on Check pointing for Computational Grids” International Journal of Advanced Science and Technology Vol. 48, November, 2012. To address this problem, the job check pointing mechanism is used. Using check pointing, we can restore the partially completed job from the last checkpoint saved and then starting a job from scratch is avoided. The main disadvantage of check pointing mechanism is that it performs identically regardless the stability of the resource. This inappropriate check pointing can delay the job execution and can increase the grid load. In computational grid environments, there are resources that satisfy requirements but they tend to fail. The GS select resources according to the response time combined with the resource fault index to execute the job. If the selected resource is failed and it is the only available resource that can execute the job at that time, then the job must wait for that resource to unite with the system again and become available. This waiting time delays the job execution and reduces the throughput of the grid. The implementation of the system will be done using the Grid Sim. REFERENCES [1] Foster and C. Kessel man , "The Grid: Blueprint for a Future Computing Infrastructure", Morgan Kaufmann , USA, 1999. [2] Arindam Das and Ajanta De Sarkar “ON FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS” International Journal of Grid Computing & Applications Vol.3, No.3, September 2012. [3] Ran Zheng, Hai Jin, "An Integrated Management and Scheduling Scheme for Computational Grid". [4]Arindam Das1 and Ajanta De Sarkar2 “ON FAULT TOLERANCE OF RESOURCES INCOMPUTATIONAL GRIDS, International Journal of Grid Computing & Applications ,Vol.3, September 2012. [5] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy, “A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation,” in the Proceedings of the 7th International Workshop on Quality of Service, UK, May 1999. [6] Leili Mohammad Khanli and Maryam “Reliable Job Scheduler using RFOH in Grid Computing”,2010. ISSN: 2231-5381 http://www.ijettjournal.org Page 3296