Improvement of Fault Tolerance Using Checkpoint Optimization Technique

advertisement
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013
Improvement of Fault Tolerance Using Checkpoint Optimization Technique
in Grid Computing Environment
Sumant Jain1 , Jyoti Choudhary2
1
PG Student,CSE Department ,M.D.U Rohtak, Haryana ,India
2
Assistant Professor , IT Department ,M.D.U Rohtak ,Haryana ,India
Abstract -Grid is an association of computer resources from
several administrative domains to reach a mutual goal with
an abstraction of service origination to the user. Fault
tolerance is an important property in grid computing as the
dependability of individual grid resources may not be
guaranteed. Fault tolerant approach is useful in order to
potentially prevent a malicious node affecting the overall
performance of the application. In this paper, grid
computing and related work are discussed.
Keywords-GARA, physical faults, network faults.
I.
INTRODUCTION
Grid computing or computational grid is always a vast
research field in academic, as well as in industry also.
Grid is an association of computer resources from several
administrative domains to reach a mutual goal with an
abstraction of service origination to the user [2]. A grid
computing infrastructure hardware and software that
provided access to high level computational capabilities
into a reliable, consistent, pervasive and inexpensive
offers [1].
II.
and the demands of different users are causing
complexity in thescheduling grid [3]. Scheduling
process in grid can be organized in three stages:
resource discovery, resource selection and
scheduling based on clear goals and the last stage is
request assignment to the most appropriate source.
To achieve the full potential of grid environment we
should perform the grid scheduling in an effective
manner. Grid scheduling is the process of making
scheduling decisions involving resources more than
multiple administrative domains. This process can
consist of searching multiple administrative domains
to use a single machine or scheduling a single job to
use multiple resources at a single site or multiple
sites [8].
Geographically distributed resources cooperate
to solve big problems, is called grid computing. Grid
computing, is distributed computing model that is
provides easy access to heterogeneous resources that
are geographically dispersed. Today, due to
heterogeneous grid resources that belong to different
organizations and locations with different access
policies and terms of workload dynamics are
inherent; the use of this type in grade sharing,
selection and gathering resources computing has
become popular. The main goal of grid is providing
services with high reliability and lowest cost for
large volumes of users and support group work and
the most important issue in grid computing are
resource management and control, reliability and
security. Today increased efficiency of grid is an
important issue. To increase the efficiency of grid a
properly and useful scheduling is needed.
Unfortunately, the dynamic nature of grid resources
ISSN: 2231-5381
Fault tolerance is an important property in grid computing
as the dependability of individual grid resources may not
be guaranteed. In many cases, an organization may send
out jobs for remote execution on resources upon which no
trust can be placed; for example, the resources may be
outside of its organizational boundaries, or may be shared
by different users at the same time.
Fault tolerant approach may therefore be useful in order
to potentially prevent a malicious node affecting the
overall performance of the application. As applications
scale to take advantage of Grid resources, their size and
complexity will increase dramatically.
A major challenge in a dynamic grid with thousands of
nodes connected to each other is fault tolerance. More the
resources and components involved
more the
complication and error-prone is the system. To compare
end fault tolerance mechanisms, it requires pointing out
the dissimilarity between faults, errors and failures. A
fault is a violation of a system’s basic assumptions. An
error is an internal data state that reflects a fault. A failure
is an superficially visible deviation from specifications.
In reality, a fault need not result in an error, or an error in
a failure. Different types of faults, classified based on
several factors, are mentioned in the following:
1. Physical faults: Faulty storage, faulty CPUs,
faulty memory.
2. Unconditional termination: Mostly, user
pressed Ctrl+ c.
http://www.ijettjournal.org
Page 3294
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013
3.
Network faults: Packet corruption, faults due to
network division, packet loss.
4. Lifecycle faults: Legacy or versioning faults.
5. Processor faults: Machine or operating system
crashes.
6. Media faults: Disk head crashes.
7. Service expiry fault: The service time of a
resource may expire while application is using
the resources in grid.
8. Process faults: software bug, resource scarcity.
9. Interaction faults: timing overhead, protocol
incompatibilities, security incompatibilities,
policy problems [4].
Many Grid applications have requirements for
heterogeneous resources that are independently controlled
or administered. Since resources belonging to different
administrative domains do not share their schedules, if a
user’s application needs to access more than one resource
simultaneously, the user either has to arrange for it
through the domain administrators or submit the tasks of
the job to queues of different resources without any
guarantees that all resources would be available
simultaneously. In order to address this problem, advance
reservations (ARs) were introduced as a part of Globus
Architecture for Reservation and Allocation (GARA) [5].
Advance reservations of resources for a specific time in
future ensure that all resources would be simultaneously
available at the execution time of the application. As by
reserving resources in advance, one can provide an upper
bound on the response time, ARs can also be used for
ensuring end-to-end quality of service. For jobs with
sequential tasks, the response time of the first resource in
sequence can become the start time of the reservation for
the second resource and so on; thus guaranteeing the endto-end response time.
III.
.RELATED WORK:
Leili Mohammad Khanli and Maryam discussed a
strategy named Reliable Job Scheduler using RFOH in
Grid Computing. This strategy maintains the history of
fault occurrence of resources. Whenever a resource broker
has jobs to schedule, it finds the optimal resources using
fault occurrence and response time. It does not consider
the resource failure as different aspects like processor,
memory and BW. In our work we consider the different
aspects of resource failure and hence it leads to optimal
resource utilization [6].
Two major problems that are critical to the effective
utilization of computational resources are efficient
scheduling of jobs and providing fault tolerance in a
reliable manner. Here author [7] addresses these problems
by combining the checkpoint replication based fault
tolerance mechanism with Minimum Total Time to
ISSN: 2231-5381
Release (MTTR) job scheduling algorithm. TTR can have
the service time of the job, coming up time in the queue,
transfer of input and output data to and from the resource.
The MTTR algorithm reduces or minimizes the TTR by
selecting a computational resource based on job
necessities, job characteristics and hardware features of
the resources. The fault tolerance system used here sets
the job checkpoints based on the resource failure rate. i.e
Replica Resource Selection Algorithm (RRSA) is
proposed to provide Checkpoint Replication Service
(CRS). Globus Tool Kit is used as the grid middleware to
set up a grid environment and evaluate the performance of
the proposed approach.
Author proposed[8], a new fault tolerance based
scheduling approach for scheduling statically available
meta tasks is proposed where in failure rate and the fitness
value are calculated. The presentation of the fault tolerant
scheduling policy is compared with a non-fault tolerant
scheduling policy and it shows that the proposed policy
performs better with less TTR in the presence of failures.
The number of tasks productively completed is also more
when compared to the non-fault tolerant scheduling
strategy.
Grid computing or computational grid is always a vast
research field in scholastic, as well as in industry also.
Computational grid provides resource sharing through
multi-institutional virtual organizations for dynamic
problem solving. Various diverse resources of different
administrative domain are virtually distributed through
different network in computational grids. Thus at any
point of time any type of failure can occur. and job
running in grid environment may fail. Hence fault
tolerance is an important and challenging issue in grid
computing as the dependability of individual grid
resources may not be guaranteed. So to make
computational grids more reliable and consistent fault
tolerant system is necessary. Author [4] reviewed
different existing fault tolerance techniques applicable in
grid computing. Here also presented state of the art of
various fault tolerance technique and comparative study
of the existing algorithms.
Author [9] presented and evaluated a fault-tolerant job
scheduling system based on check pointing technique.
When scheduling a job, the system uses both average
failure time and failure rate of grid resources combined
with resources response time to generate scheduling
decisions. Failure rate of the assigned resources are used
by the system to calculate the checkpoint gap for each
job. General simulation experiments are conducted to
evaluate the performance of the planned system.
Experiments have shown that the proposed system can
considerably perk up throughput, turnaround time and
failure affinity.
http://www.ijettjournal.org
Page 3295
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013
IV.
CONCLUSION
Computational grids are used to execute the jobs.
Therefore, users submit their jobs to the Grid Scheduler
(GS) along with their necessities These necessities may
include the deadline of job, resources required, proposal
needed. In case of fault free, results of executing the job
are returned to the user after completion of the job. If the
grid resource abortive during execution of the job, the job
is rescheduled on another resource which starts executing
the job from scratch. This leads to more time consumed
for the job than expected. Thus, the user’s requirements
are not satisfied.
[7]V. Rhymend Uthariaraj, Malarvizhi Nandagopal, “Fault Tolerant
Scheduling Strategy For Computational Grid Environment”, Malarvizhi
Nandagopal
International Journal of Engineering Science and
TechnologyVol. 2, 2010, 4361-4372
[8]P.Keerthika, Dr.N.Kasthuri,” A New Proactive Fault Tolerant
Approach for Scheduling in Computational Grid, “international
Conference on Web Services Computing (ICWSC) Proceedings
published by International Journal of Computer Applications® ,2011.
[9]Mohammed Amoon, “A Fault Tolerant Scheduling System Based on
Check pointing for Computational Grids” International Journal of
Advanced Science and Technology Vol. 48, November, 2012.
To address this problem, the job check pointing
mechanism is used. Using check pointing, we can restore
the partially completed job from the last checkpoint saved
and then starting a job from scratch is avoided. The main
disadvantage of check pointing mechanism is that it
performs identically regardless the stability of the
resource. This inappropriate check pointing can delay the
job execution and can increase the grid load.
In computational grid environments, there are resources
that satisfy requirements but they tend to fail. The GS
select resources according to the response time combined
with the resource fault index to execute the job. If the
selected resource is failed and it is the only available
resource that can execute the job at that time, then the job
must wait for that resource to unite with the system again
and become available. This waiting time delays the job
execution and reduces the throughput of the grid. The
implementation of the system will be done using the Grid
Sim.
REFERENCES
[1] Foster and C. Kessel man , "The Grid: Blueprint for a Future
Computing Infrastructure", Morgan Kaufmann , USA, 1999.
[2] Arindam Das and Ajanta De Sarkar “ON FAULT TOLERANCE OF
RESOURCES IN COMPUTATIONAL GRIDS” International Journal
of Grid Computing & Applications Vol.3, No.3, September 2012.
[3] Ran Zheng, Hai Jin, "An Integrated Management and Scheduling
Scheme for Computational Grid".
[4]Arindam Das1 and Ajanta De Sarkar2 “ON FAULT TOLERANCE
OF RESOURCES INCOMPUTATIONAL GRIDS, International
Journal of Grid Computing & Applications ,Vol.3, September 2012.
[5] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy, “A
Distributed Resource Management Architecture that Supports Advance
Reservations and Co-Allocation,” in the Proceedings of the 7th
International Workshop on Quality of Service, UK, May 1999.
[6] Leili Mohammad Khanli and Maryam “Reliable Job Scheduler using
RFOH in Grid Computing”,2010.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 3296
Download