Document 13324066

advertisement
Co-Alloation, Fault Tolerane and Grid Computing
1
2
Jon MaLaren,1
Mark M Keown,2 and Stephen Pikles2
Center for Computation and Tehnology, Louisiana State University,
Baton Rouge, Louisiana 70803, United States.
Manhester Computing, The University of Manhester, Oxford Road, Manhester M13 9PL.
Experiene gained from the TeraGyroid and SPICE projets has shown that o-alloation and fault
tolerane are important requirements for Grid omputing. Co-alloation is neessary for distributed
appliations that require multiple resoures that must be reserved before use. Fault tolerane is
important beause a omputational Grid will always have faulty omponents and some of those
faults may be Byzantine. We present HARC, Highly-Available Robust Co-alloator, an approah to
building fault tolerant o-alloation servies. HARC is designed aording to the REST arhitetural
style and uses the Paxos and Paxos Commit algorithms to provide fault tolerane. HARC/1, an
implementation of HARC using HTTP and XML, has been demonstrated at SuperComputing 2005
and iGrid 2005.
I.
INTRODUCTION
In this paper we disuss the importane of oalloation to Grid omputing. We provide some
bakground on the diulties assoiated with oalloation before presenting HARC, Highly-Available
Robust Co-alloator, a fault tolerant approah to oalloation. The paper also inludes a disussion on
the problem of fault tolerane and Grid omputing.
We make the ase that a omputational Grid will always have faulty omponents and that some of those
faults will be Byzantine [26, 27℄. However, we also
demonstrate that with suitable approahes it is still
possible to make a omputational Grid a fault tolerant system.
There are many denitions of Grid omputing but
reurring themes are: large sale or internet sale
distributed omputing and sharing resoures aross
multiple administrative domains. The goal of sharing resoures between organizations dates bak to the
ARPANET [33℄ projet and progress towards that
goal an be seen in the development of the Internet,
the World Wide Web [22℄ and now omputational
Grids. While the goals of the ARPANET projet
are still relevant today the underlying infrastruture
has hanged, powerful omputers have beome heap
and plentiful while high performane networks have
beome pervasive, presenting developers with a different set of hallenges and opportunities. We believe that o-alloation is a new hallenge, while the
falling ost of omponents makes fault tolerane a
new opportunity.
Parallel to the evolution of ARPANET through
to Grid omputing has been the development of the
theory of distributed systems providing us with a
deeper understanding of distributed systems and a
set of algorithms for building fault tolerant systems.
Representational State Transfer [12℄, REST, is an arhitetural style for building large sale distributed
systems that was used to develop the protools that
make up the World Wide Web. Paxos [28℄ is a
fault tolerant onsensus algorithm that an be used
to build highly available systems [31℄. Paxos Commit [18℄ is Paxos applied to the distributed transation ommit problem.
HARC uses REST and Paxos to provide a system that is fault tolerant and suitable for a large
sale distributed system suh as a omputational
Grid. HARC's fous on fault tolerane is unique
among approahes to designing o-alloation servies [3, 9, 24, 34, 39℄.
II.
CO-ALLOCATION
Running distributed appliations on a omputational Grid often requires that the resoures needed
by the appliation are available at the same time.
The resoures may need to be booked (eg Aess
Grid nodes) or they may use a bath submission system (eg HPC systems). We dene o-alloation as
the provision of a set of resoures at the same time
or at some o-ordinated set of times. Co-alloation
an be ahieved by making a set of reservations for
the required resoures with the respetive resoure
providers.
Experiene from the award winning TeraGyroid [7℄
and SPICE [23℄ projets has shown that support for
o-alloation on urrent prodution Grids [37, 38℄ is
ad-ho and often requires time onsuming diret negotiation with the resoure administrators. The resoures required by TeraGyroid and SPICE inluded
HPC systems, speial high performane networks,
high end visualization systems, Aess Grid nodes,
hapti devies and people. The lak of onvenient
o-alloation servies prevents the routine use of
the tehniques pioneered by TeraGyroid and SPICE.
Without o-alloation omputational Grids are limited in the type of appliations they an support, and
so are limited in their potential. Co-alloation's im-
2
portane to Grid omputing means that it must be
a reliable servie.
III.
FAULT TOLERANCE AND GRID
COMPUTING
Whatever denition of Grid omputing is used we
are lead to two inesapable onlusions: a omputational Grid will always have faulty omponents and
some of those faults will be Byzantine.
Computational Grids are internet sale distributed
systems, implying large numbers of omponents and
wide area networks. At this sale there will always
be faulty omponents.
The ase that a omputational Grid will always
have faulty omponents is illustrated by the fat that
on average over ten perent of the daily operational
monitoring tests, GITS [4℄, run on the UK National
Grid Servie, UK NGS [38℄, report failures. Sine the
start of the UK NGS a number of the ore nodes have
been unavailable for days at a time due to planned
and unplanned maintenane.
Crossing administrative boundaries raises issues of
trust: Can a user trust that a resoure provider has
ongured and administrates the resoure properly?
Can a resoure provider trust that a user will use
the resoure orretly? Distributed transations are
rarely used between organizations beause of a lak
of trust; one organization will not allow another organization to hold a lok on its internal databases.
Without trust users, resoure providers and middleware developers must be prepared for Byzantine fault
behaviour: when a omponent faults but ontinues
to operate in an unpreditable and potentially maliious way. Although Grid omputing supports the
reation of limited trust relationships between organizations through the onept of the Virtual Organization [14℄ our experiene has been that Grids exhibit
Byzantine fault behaviour.
To illustrate the point we provide three examples of Byzantine behaviour whih we have enountered. The rst ase involved the UK eSiene Grid's
MDS2 [2, 10℄ hierarhy. A GRIS [10℄ at a site
was rewalled preventing GIIS [10℄ higher up in the
MDS2 hierarhy from querying it. The GRIS ontinued to report that it was publishing information but
whenever a GIIS attempted to query it the rewall
would blok the onnetion. The GIIS would blok
waiting for the GRIS to respond ausing the whole
MDS2 hierarhy to blok. The rewall was raised by
the site's network administrators. The loal rewall
on the GRIS server and the GRIS servie itself were
ongured orretly.
The seond ase involved a job submission node
on the TeraGrid [37℄. The node onsisted of two
servers and utilized a DNS round robin to alloate
requests to a node. Unfortunately the DNS entries
were mis-ongured and reported the wrong hostname for one of the nodes ausing job submissions
to fail randomly. Loal testing at the site did not
reveal the problem.
The third ase involved a resoure broker that assigned jobs to omputational resoures. Oasionally
a omputational resoure would fail in a Byzantine
way: it would aept jobs from the resoure broker,
fail to exeute the job but report to the resoure
broker that the job had ompleted suessfully. The
resoure broker would ontinue assigning jobs to the
faulty resoure until it was drained of jobs.
The examples illustrate the importane of end-toend arguments in system design error reovery
at the appliation level is absolutely neessary for
a reliable system, and any other error detetion or
reovery is not logially neessary but is stritly for
performane [35℄. In eah ase making the individual
omponents more reliable would not have prevented
the problem. Retrotting reliability to an existing
design is very diult [30℄.
For o-alloation a very real example of Byzantine
fault behaviour ours when a resoure provider aepts a reservation for a resoure but at the sheduled
time the user, and possibly the resoure provider,
disover the resoure is unavailable.
IV.
REST
REST is an arhitetural style for building large
sale distributed systems. It onsists of a set of priniples and design onstraints whih were used in designing the protools that make up the World Wide
Web.
We provide a brief desription of REST and refer the reader to the thesis [12℄ for a omplete desription. REST is based on a lient-server model
whih supports ahing and where interations between lient and server are stateless, all interation
state is stored on the lient for server salability. The
onept of a resoure is entral to REST, resoures
have identity and anything that an have an identity an be a resoure. Resoures are manipulated
through their representations and are networked together through linking hypermedia is the engine
of appliation state. Together the last set of onstraints ombine to make up the priniple of uniform
interfae. REST also has an optional onstraint for
the support of mobile ode.
HTTP [11℄ is an example of a protool that has
been designed aording to REST. On the World
Wide Web a resoure is identied by a URI [6℄ and
lients an retrieve a representation of the resoure
using a HTTP GET. HTTP an also supply ahing
information along with the representation to allow
3
intermediaries to ahe the representation. The representation may ontain links to other resoures reating a network of resoures. Clients an hange the
representation of a resoure by replaing the existing representation with a new one, for example using a HTTP PUT. All resoures on the World Wide
Web have a uniform interfae allowing generi piees
of software suh as web browsers to interat with
them. Web servers are also able to send ode, for
example JavaSript, to the lient to be exeuted in
the browser.
V.
PROBLEMS OF CO-ALLOCATION
To illustrate the problems assoiated with oalloation we present two possible approahes and
disuss their shortomings. Most existing solutions [3, 9, 24, 34, 39℄ for o-alloation are based
on variations of these approahes.
A.
One Phase Approah
In this approah a suessful reservation is made
in a single step.
The user sends a booking request to eah of the
Resoure Managers (RM) requesting a reservation.
The RMs either aept or rejet the booking. If one
or more of the RMs rejets the booking the user must
anel any other bookings that have been made. This
approah has the advantage of being simple but a
potential drawbak is that the user may be harged
eah time he anels a reservation.
Supporting reservations prevents a RM from running the optimal workload on a resoure as it has to
shedule around the reservations. Even if a reservation is anelled before the sheduled time it may already have delayed exeution of some jobs. RMs may
harge for the use of the resoure and may harge
more for jobs that are submitted via reservation to
ompensate for the loss in throughput. They may
also harge for reservations that are anelled.
Any harging poliy is the prerogative of the RM.
Not all RMs may harge and it is unreasonable to
make any assumptions about or try to mandate a
harging poliy. A o-alloation protool should aommodate the issues assoiated with harging but
annot depend on RMs supporting harging.
Another potential problem with this approah
arises if one of the RMs rejets a booking but the
user does not anel the other reservations that have
been made. For example the user may fail before he
has had a hane to anel the reservations. Even if
the user reovers he may not have stored suient
state before failing to anel the reservations. This
Initial
State
Abort
Booking
Held
Commit
Booking
Committed
Abort
Booking
Rejected
FIG. 1: The statetransition diagram for a RM in the
two phase approah. The RM reeives the booking request message in the initial state. It an either rejet the
request and move to the booking rejeted state or aept
the request and move to the Booking Held state. The
RM waits in the Booking Held state until it reeives the
Commit or Abort message from the TM.
is related to the question of trust the RMs must
trust the user to anel any unwanted reservations.
A partial solution is to add an intermediary servie that is trusted by the user and RM to orretly
make and anel reservations as neessary. The user
sends the set of bookings he requires to the intermediary whih handles all the interations with the
RMs, anelling all the reservations if the purposed
shedule is unaeptable to any of the RMs. The intermediary is still a single point of failure unless it
an be repliated.
B.
Two Phase Approah
In this approah a suessful reservation is made
in two steps.
To resolve the problem of RMs harging for anelling a reservation we introdue a new state for the
RM, Booking Held. The lient an anel a reservation in the Booking Held state without being harged.
The approah is similar to the two phase ommit [17℄
algorithm for distributed transations but does not
neessarily have the ACID (Atomiity, Consisteny,
Isolation and Durability [19℄) properties assoiated
with two phase ommit.
A proess, the Transation Manager (TM), tries
to get a group of RMs to aept or rejet a set of
reservations. In the rst phase the TM sends the
booking requests to the RMs, the RMs reply with a
Booking Held message if the request is aeptable or
an Abort message if the request is unaeptable. If
all the RMs reply with Booking Held the TM sends
a Commit message to the RMs and the reservations
are ommitted. Otherwise the TM sends an Abort
message to the RMs and no reservations are made.
4
Fig. 1 shows the state transitions for the RM in the
two phase approah.
Unfortunately the two phase approah an blok.
If the TM fails after the rst phase but before starting the seond phase, before sending the Commit or
Abort message, the RMs are left in the Booking Held
state until the TM reovers. While in the Booking
Held state the RMs may be unable to aept reservations for the sheduled time slot and the resoure
sheduler may be unable to run the optimal work
load on the resoure.
The bloking nature of the two phase approah
may be aeptable for ertain senarios depending
on how long the TM is unavailable and providing it
reovers orretly. To overome the bloking nature
of the two phase approah requires a three phase
protool [36℄ suh as Paxos.
Another potential problem with the two phase approah is that RMs must support the Booking Held
state. A solution is for RMs that do not support the
Booking Held state to move straight to the Booking
Committed state if the booking request is aeptable
in the rst phase. For the seond phase they an ignore a Commit message and treat an Abort message
as a anellation of the reservation for whih the user
may be harged. This a relaxation of the onsisteny
property of two phase ommit.
It is also possible to relax the isolation property
of two phase ommit. If a RM reeives a booking
request while in the Booking Held state it an rejet
the request, advising the lient that it is holding an
unommitted booking whih may be released in the
future. This prevents deadlok [19℄, when two users
request the same resoures at the same time.
The role of the TM an be played by the user but
should be arried out by a trusted intermediary servie.
VI.
CO-ALLOCATION AND CONSENSUS
Co-alloation is a onsensus problem. Consensus
is onerned with how to get a set of proesses to
agree a value [26℄. Distributed transation ommit,
of whih two phase ommit is one approah, is a
speial ase of onsensus were all the proesses must
agree on either ommit or abort for a transation.
In the ase of o-alloation the RMs must agree to
either aept or rejet a purposed shedule.
Consensus is a well understood problem with over
twenty ve years of researh [13, 2628℄. For example it has been shown that distributed onsensus is
impossible in an asynhronous system with a perfet
network and just one faulty proessor [13℄. In an
asynhronous system it is impossible to dierentiate
between a proessor that is very slow and one that
has failed. To handle faults requires either fault de-
tetors or partial synhrony. Consensus is an important problem beause it an be used to build replias: start a set of deterministi state mahines in the
same state and use onsensus to make them agree on
the order of messages to proess [25, 31℄.
Paxos is a well known fault tolerant onsensus algorithm. We present only a desription of its properties and refer the reader to the literature [28, 29, 31℄
for a full desription of the algorithm. In Paxos a
leader proess tries to guide a set of aeptor proesses to agree a value. Paxos will reah onsensus
even if messages are lost, delayed or dupliated. It
an tolerate multiple simultaneous leaders and any
of the proesses, leader or aeptor, an fail and reover multiple times. Consensus is reahed if there is
a single leader for a long enough time during whih
the leader an talk to a majority of the aeptor proesses twie. It may not terminate if there are always
too many leaders. There is also a Byzantine version
of Paxos [8℄ to handle the ase were aeptors may
have Byzantine faults.
Paxos Commit [18℄ is the Paxos algorithm applied
to the distributed transation ommit problem. Effetively the transation manager of two phase ommit is replaed by a group of aeptors if a majority of aeptors are available for long enough the
transation will omplete. Gray and Lamport [18℄
showed that Paxos Commit is eient and has the
same message delay as two phase ommit for the
fault free ase. They also showed that two phase
ommit is Paxos Commit with only one aeptor.
Though it may be possible to apply Paxos diretly
to the o-alloation problem by having the RMs at
as aeptors we hose to use Paxos Commit instead.
The advantage Paxos Commit has over using Paxos
diretly for o-alloation is that the role played by
the RMs is no more omplex than in the two phase
approah. Limiting the role of the RM makes it more
aeptable to resoure providers and redues the possibilities for faulty behaviour from the RM.
VII.
HARC
The HARC approah to o-alloation is similar to
the two phase approah desribed in Setion II exept the TM is replaed with a set of Paxos aeptors. The user sends a booking request to an aeptor
who rst repliates the message to the other aeptors using Paxos and then it uses Paxos Commit to
make the reservations with the RMs. HARC terminates one it has deided to ommit or abort a set of
reservations. Users and RMs may modify or anel
reservations after HARC has terminated, but this is
outside the sope of HARC.
The aims of HARC are deliberately limited. It
does not address the issues of how to hoose the op-
5
timal shedule; how to manage a set of reservations
one they have been made; how to negotiate quality of servie with the resoure provider or how to
support ompensation mehanisms when a resoure
fails to full a reservation or when a user anels a
reservation. HARC is designed so that other servies
and protools an be ombined with it to solve these
problems.
A.
Choosing a Shedule
The RM advertises the shedule when a resoure
may be available through a URI. The user retrieves
the shedule using a HTTP GET. The information
retrieved is for guidane only, the RM is under no
obligation by advertising it. It is the RM's prerogative as to how muh information it makes available, it may hoose not to advertise a shedule at
all. The RM may supply ahing information along
with the shedule using the ahe support failities of
HTTP. The ahing information an indiate when
the shedule was last modied or how long the shedule is good for. The RM may also support onditional GET [11℄ so that lients do not have to retrieve and parse the shedule if it hasn't hanged
sine the last retrieval. From the shedules for all
the resoures the user hooses a suitable time when
the resoures he requires might be free.
A o-sheduling servie for HARC ould use
HTTP onditional GET to maintain a ahe of resoure shedules from whih to reate o-shedules
for the user. The o-sheduling servie would have
to store only soft state making it easy to repliate.
B.
Submitting the Booking Request
The user onstruts a booking request whih ontains a sub-request for eah resoure that he wants to
reserve. The booking request also ontains an identier hosen by the user, the UID. The ombination
of the identity of the user and the UID should be
globally unique. The user sends the booking request
to an aeptor using a HTTP POST. This aeptor
will at as the leader for the whole booking proess
unless it fails in whih ase another aeptor will take
over.
The leader piks a transation identier, the TID,
for the booking request from a set of TIDs that it
has been initially assigned. Eah aeptor has a different range of TIDs to hoose from to prevent two
aeptors trying to use the same TID. The aeptor
eetively repliates the message to the other aeptors by having Paxos agree the TID for the message.
This instane of Paxos agrees a TID for the ombination of the user's identity and the user hosen UID.
If the user resends the message to the aeptor or to
any other aeptor he will reeive the same TID.
The user should ontinue submitting the request
to any of the aeptors until he reeives a TID the submission of the booking request is idempotent
aross all the aeptors.
If a user reuses a UID he will get the TID of the
previous booking request assoiated with that UID.
To avoid reusing a UID the user an reord all the
UIDs he has previously used, however a more pratial approah is to randomly hoose a UID from a
large namespae. Two users an use the same UID
as it is the ombination of the user's identity and the
UID that is relevant.
The TID is a RequestURI [11℄. The user an use a
HTTP GET with the TID to retrieve the outome of
the booking request from any of the aeptors. The
TID is returned to user using the HTTP Loation
header in the response to the POST ontaining the
booking request. The HTTP response ode is 201
indiating that a new resoure has been reated. The
aeptor should return a 303 HTTP response ode if
a TID has already been hosen for the request to
allow the user to detet the ase when a UID may
have been reused.
C.
Making the Reservations
After the TID has been hosen the leader uses
Paxos Commit to make the bookings with the RMs.
The booking request is broken down into the subrequests and the sub-requests are sent to the appropriate RM using HTTP POST. The TID is also sent
to the RM as the Referer HTTP header in the POST
message.
The RM responds with a URI that will represent
the booking loal to the RM and whether the booking is being held or has been rejeted. It also broadasts this message to all the other aeptors.
As in the Paxos Commit algorithm the aeptors
deide whether the shedule hosen by the user is
aeptable to all the RMs or has been rejeted by any
of the RMs. The leader informs eah RM whether
to ommit or abort the booking using a HTTP PUT
on the URI provided by the RM.
It is the RM's obligation to disover the outome
of the booking. If the RM does not reeive the ommit or abort message it should use a HTTP GET
with the TID on any of the aeptors to disover the
outome of the booking. One the aeptors have
made a deision it an be ahed by intermediaries.
A HTTP GET on the TID returns the outome of
the booking request and a set of links to the reservations loal to eah RM. Detailed information on a
reservation at a partiular RM an be retrieved by
6
using a HTTP GET on the link assoiated with that
reservation.
The user an anel the reservation through the
URI provided by the RM and the RM an advertise that it has anelled the reservation using the
same URI. The user should poll the URI to monitor
the status of the reservation, HTTP HEAD or onditional GET an be used to optimize the polling.
There is a question of what happens if a RM deides to unilaterally abort from the Booking Held
state whih is not allowed in two phase ommit or
Paxos Commit. Sine HARC terminates when it
deides whether a shedule should be ommitted or
aborted we an say that the RM anelled the reservation after HARC terminated. This is possible beause there is only a partial ordering of events in a
distributed system [25℄. The user an only nd out
that the RM anelled the reservation after HARC
terminated. The onus is on the user to monitor the
reservations one they have been made and to deal
with any eventualities that may arise.
D.
HARC Ations
HARC supports a set of ations for making and
manipulating reservations: Make, Modify, Move and
Canel. Make is used to reate a new reservation, Modify to hange an existing reservation (eg
to hange the number of CPUs requested), Move to
hange the time of a reservation and Canel to anel a reservation. A booking request an ontain a
mixture of ations and HARC will deide whether
to ommit or abort all of the ations. The HARC
ations provide the user with exibility for dealing
with the ase of a RM anelling a reservation, for
example he ould make a new reservation on another
resoure at a dierent time and move the existing
reservations to the new time.
A third party servie ould monitor reservations on
the behalf of the user and deal with any eventualities
that arise using the HARC ations in aordane to
some poliy provided by the user.
E.
Seurity
The omplete booking request sent to the aeptor
is digitally signed [5℄ by the user with a X.509 [1℄ ertiate. The aeptors use the signature to disover
the identity of the user. Individual sub-requests are
also digitally signed by the user so that the RMs
an verify that the sub-request ame from an authorized user. The aeptors also digitally sign the
sub-requests before passing them to the RMs so that
the RM an verify that the sub-request ame from
a trusted aeptor. If required a sub-request an be
digitally enrypted [21℄ by the user so that the aeptors annot read it.
A HTTP GET using the TID returns only the outome of a booking request and a set of links to the
individual bookings so there is no requirement for aeptors to provide aess ontrol to this information.
The individual RMs ontrol aess to any detailed
information on reservations they hold.
VIII.
HARC AVAILABILITY
HARC has the same fault tolerane properties as
Paxos Commit whih means it will progress if a majority of aeptors are available. If a majority of
aeptors are not available HARC will blok until a
majority is restored, sine bloking is unaeptable
we dene this as HARC failing. To alulate the
MTTF for HARC we use the notation and denitions from [19℄.
The probability that an aeptor fails is 1/M T T F
were MTTF is the Mean Time To Failure of the aeptor. The probability that an aeptor is in a failed
state is approximately M T T R/M T T F , were MTTR
is the Mean Time To Repair of the aeptor.
Given 2F + 1 aeptors, the probability that
HARC bloks is the probability that F aeptors are
in the failed state and another aeptors fails.
The probability that F out of the 2F + 1 aeptors
are in the failed state is:
F
(2F + 1)! M T T R
(1)
F !(F + 1)! M T T F
The probability that one of the remaining F + 1 aeptors fails is:
1
(F + 1)
(2)
MT T F
The probability that HARC will blok is the produt
of (1) and (2):
M T T RF
(2F + 1)!
(3)
(F !)2
M T T F F +1
The MTTF for HARC is the reiproal of (3):
(F !)2
M T T F F +1
M T T FHARC ≈
(2F + 1)!
M T T RF
(4)
Using 5 aeptors, F = 2, eah with a MTTF of 120
days and a MTTR of 1 day, M T T FHARC is approximately 57,600 days, or 160 years.
IX.
BYZANTINE FAULTS AND HARC
HARC is based on Paxos Commit whih assumes
the aeptors do not have Byzantine faults but whih
7
an happen in a Grid environment. We believe that
aeptors an be implemented as failstop [19℄ servies that are provided as part of a Virtual Organization's role of reating trust between organizations.
Provision of trusted servies is one way of realizing
the Virtual Organization onept. If Byzantine fault
tolerane is neessary then Byzantine Paxos [8℄ ould
be used in HARC.
If a HARC aeptor does have a Byzantine fault it
annot make a reservation without a signed request
from a user. HARC has been designed to deal with
Byzantine fault behaviour from users and RMs.
No o-alloation protool an guarantee that the
reserved resoures will atually be available at the
sheduled time. HARC is a fault tolerant protool
for making a set of reservations, it does not attempt
to make any guarantees one the reservations have
been made. The user may be able to deal with the
situation were a resoure is unavailable at the sheduled time by booking extra resoures that an at as
bakup. This is an appliation spei solution and
again illustrates the importane of end-to-end arguments [35℄. HARC provides a fault tolerant servie
to the user but fault tolerane is still neessary at
the appliation level.
X.
HARC/1
HARC/1, an implementation of HARC, has been
suessfully demonstrated at SuperComputing 2005
and iGrid 2005 [20℄ where it was used to o-shedule
ten ompute and two network resoures. HARC/1
is implemented using Java with sample RMs implemented in Perl. HARC/1 is available at http://
www.t.lsu.edu/personal/malaren/CoShed.
mean that the reliability of the overall system improves. Just as Beowulf systems built out of ommodity omponents have displaed expensive superomputers so lusters of PCs are displaing expensive
mainframe type systems [15℄. HARC demonstrates
how important servies an be made fault tolerant
to reate a fault tolerant Grid.
HARC has been demonstrated to be a seure, fault
tolerant approah to building o-alloation servies.
It has also been shown that the user and RM roles in
HARC are simple. The statetransitions for the RM
in HARC are the same as for the two phase approah
illustrated in Fig. 1. The only extra requirement
for the RM over the two phase approah is that it
must broadast a opy of its Booking Held or Abort
message to all aeptors.
The use of HTTP ontributes to the simpliity of
HARC. HTTP is a well understood appliation protool with strong library support in many programming languages. HARC demonstrates through its
use of HTTP and URIs the eetiveness of REST as
an approah to developing Grid servies.
HARC's funtionality an be extended by adding
other servies, for example to support o-sheduling
and reservation monitoring. Extending funtionality
by adding servies, rather than modifying existing
servies, indiates good design and a salable system
in aordane to REST.
The opaqueness of the sub-requests to the aeptors means that HARC has the potential to be used
for something other than o-alloation. For example
a HARC implementation ould be used as an implementation of Paxos Commit to support distributed
transations.
XII.
XI.
ACKNOWLEDGEMENTS
CONCLUSION
As the size of a distributed system inreases so to
does the probablity that some omponent in the system is faulty. However, if a fault tolerant approah
is applied then inreasing the size of the system an
The authors would like to thank Jim Gray, Savas
Parastatidis and Dean Kuo for disussion and enouragement. The work is supported in part through
NSF Award #0509465, "EnLIGHTened Computing".
[1℄ C. Adams and S. Farrell. Internet X.509 Publi
Key Infrastruture Certiate Management Protools. IETF RFC 2510, 1999.
[2℄ R. Allan et al. Building the e-Siene Grid in the
UK: Grid Information Servies. Proeedings of UK
e-Siene All Hands Meeting. 2003.
[3℄ A. Andrieux et al. Web Servies Agreement. Draft
GGF Reommendation, September 2005.
[4℄ D. Baker and M. M Keown. Building the e-Siene
Grid in the UK: Providing a software toolkit to en-
able operational monitoring and Grid integration.
Proeedings of UK e-Siene All Hands Meeting.
2003.
[5℄ M. Bartel et al. XML-Signature Syntax and Proessing. W3C Reommendation, February 2002.
[6℄ T. Berners-Lee, R. Fielding and L. Masinter. Uniform Resoure Identier (URI): Generi Syntax.
IETF RFC 3986, 2005.
[7℄ R. Blake et al. The teragyroid experiment
superomputing 2003. Sienti Computing,
8
13(1):1-17, 2005.
[8℄ M. Castro and B. Liskov. Pratial Byzantine fault
tolerane. Proeedings of 3rd OSDI, New Orleans,
1999.
[9℄ K. Czajkowski et al. A protool for negotiating
servie level agreements and oordinating resoure
management on distributed systems. Proeedings
of 8th International Workshop on Job Sheduling
Strategies for Parallel Proessing, eds D Feitelson,
L. Rudolph and U. Shwiegelshohn, Leture Notes
in Computer Siene, 2537, Springer Verlag, 2002.
[10℄ K. Czajkowski et al. Grid Information Servies for
Distributed Resoure Sharing. Proeedings of the
Tenth IEEE International Symposium on HighPerformane Distributed Computing (HPDC-10),
IEEE Press, 2001.
[11℄ R. Fielding et al. Hypertext Transfer Protool HTTP/1.1, IETF RFC 2616, 1999.
[12℄ R. Fielding. Arhitetural Styles and the Design of
Network-based Software Arhitetures. PhD thesis.
University of California, Irvine, 2000.
[13℄ M. Fisher, N. Lynh, and M. Paterson. Impossibility of distributed onsensus with one faulty proess.
Journal of the ACM, 32(2), 1985.
[14℄ I. Foster, C. Kesselman and S. Tueke. The Anatomy
of the Grid: Enabling Salable Virtual Organizations. International J. Superomputer Appliations,
15(3), 2001.
[15℄ S. Ghemawat, H. Gobio, and S. Leung. The Google
Filesystem. 19th ACM Symposium on Operating
Systems Priniples, Lake George, NY, Otober,
2003.
[16℄ M. Gudgin et al. SOAP Version 1.2 Part 1: Messaging Framework. W3C Reommendation, June 2003.
[17℄ J. Gray. Notes on data base operating systems. In
R. Bayer, R. Graham, and G. Seegmuller, editors,
Operating Systems: An Advaned Course, volume
60 of Leture Notes in Computer Siene. SpringerVerlag, Berlin, Heidelberg, New York, 1978.
[18℄ J. Gray and L. Lamport. Consensus on Transation Commit. Mirosoft Researh Tehnial Report
MSR-TR-2003-96, 2005.
[19℄ J. Gray and A. Reuter. Transation Proessing:
Conepts and Tehniques. Morgan Kaufmann Publishers In., San Franiso, CA, USA, 1992.
[20℄ A. Hutanu et al. Distributed and ollaborative visualization of large data sets using high-speed networks. Submitted to proeedings of iGrid 2005, Future Generation Computer Systems. The International Journal of Grid Computing: Theory, Methods
and Appliations, 2005.
[21℄ T. Imamura, B. Dillaway and E. Simon. XML Enryption Syntax and Proessing. W3C Reommendation, Deember 2002.
[22℄ I. Jaobs and N. Walsh. Arhiteture of the World
Wide Web, Volume One. W3C Reommendation,
Deember 2004.
[23℄ S. Jha et al. Spie: Simulated pore interative omputing environment using grid omputing to understand dna transloation aross protein nanopores
embedded in lipid membranes. Proeedings of the
UK e-Siene All Hands Meetings, 2005.
[24℄ D. Kuo and M. M Keown. Advane reservation
and o-alloation for Grid Computing. In First International Conferene on e-Siene and Grid Computing, volume e-siene, IEEE Computer Soiety
Press, 2005.
[25℄ L. Lamport. Time, Cloks and the Ordering of
Events in a Distributed System. Communiations of
the ACM 21, 7, 1978.
[26℄ L. Lamport, M. Pease and R. Shostak. Reahing
Agreement in the Presene of Faults. Journal of the
Assoiation for Computing Mahinery 27, 2, 1980.
[27℄ L. Lamport, M. Pease and R. Shostak. The Byzantine Generals Problem. ACM Transations on Programming Languages and Systems 4, 3, 382-401,
1982.
[28℄ L. Lamport. The Part-Time Parliament. ACM
Transations on Computer Systems 16, 2, 133-169,
1998.
[29℄ L. Lamport. Paxos Made Simple. ACM SIGACT
News (Distributed Computing Column) 32, 4
(Whole Number 121, Deember 2001) 18-25, 2001.
[30℄ B. Lampson. Hints for omputer system design.
ACM Operating Systems Rev. 17, 5, pp 33-48, 1983.
[31℄ B. Lampson. How to build a highly available system using onsensus. In Distributed Algorithms, ed.
Babaoglu and Marzullo, Leture Notes in Computer
Siene 1151, Springer, 1996
[32℄ E. Resorla. HTTP Over TLS. IETF RFC 2818,
2000.
[33℄ L. Roberts. Resoure Sharing Computer Networks.
IEEE International Conferene, New York City.
1968.
[34℄ A. Roy. End-to-End Quality of Servie for High-end
Appliations. PhD thesis. University Of Chiago,
Illinois, 2001.
[35℄ J. Saltzer, D. Reed and D. Clark. End-to-end arguments in system design. Proeedings of the 2nd International Conferene Distributed Computing Systems, Paris, 1981.
[36℄ D. Skeen. Nonbloking ommit protools. In SIGMOD '81: Proeedings of the 1981 ACM SIGMOD
International Conferene on Management of Data.
ACM Press, 1981.
[37℄ TeraGrid. http://www.teragrid.org/.
[38℄ UK National Grid Servie. http://www.ngs.a.uk/.
[39℄ K. Yoshimoto, P. Kovath and P. Andrews. Cosheduling with usersettable reservations. In Job
Sheduling Strategies for Parallel Proessing, eds E.
Frahtenberg, L. Rudolph and U. Shwiegelshohn,
Leture Notes in Computing Siene, 3834, Springer,
2005.
Download