d-functions of individual elements are:

advertisement
1
Reliability and Performance of Star Topology Grid Service with
Precedence Constraints on Subtask Execution
Gregory Levitin a, Yuan-Shun Dai b, Hanoch Ben-Haim a
a
The Israel Electric Corporation Ltd., Reliability Department, P. O. Box 10,
Haifa 31000, Israel
b
Department of Computer & Information Science, Purdue University School of Science,
Indiana University, Purdue University, Indianapolis, 46202, USA
Abstract - The paper considers grid computing systems with star architecture in which
the resource management system (RMS) divides service task into subtasks and sends
the subtasks to different specialized resources for execution. In order to provide desired
level of service reliability the RMS can assign the same subtasks to several independent
resources for parallel execution. Some subtasks cannot be executed until they have
received input data, which can be the result of other subtasks. This imposes precedence
constraints on the order of subtask execution. The service reliability and performance
indices are introduced and a fast numerical algorithm for their evaluation given any
subtask distribution is suggested. Illustrative examples are presented.
Index Terms - Grid system, service time, service reliability, subtask distribution,
precedence constraints, universal generating function.
ACRONYMS
RMS
pmf
u-function
resource management system
probability mass function
universal generating function
2
T*
R(T*)
W
tij
NOTATION
unity function: 1(TRUE) = 1, 1(FALSE) = 0
probability of event e
number of subtasks
computational complexity of subtask j
probability that resource j does not fail during time t
processing speed of resource j
failure rate of resource j
data transmission speed (bandwidth) of communication channel j
amount of data transmitted between the RMS and the resource processing
subtask i
failure rate of communication channel j
maximum allowed service time
probability that service time is less than T*
conditional expected system execution time
random time of subtask i execution by resource j
tˆij
~
tij
realization of tij when subtask i is successfully completed by resource j
random completion time for subtask i executed by resource j
Ti
random time of the beginning of subtask i execution
T̂il
~
Ti
l-th realization of Ti
1(x)
Pr(e)
m
cj
pj(t)
xj
j
sj
ai
j
qil
uij (z)
u~ij ( z, t s )
U i (z )
u~i ( z, t s )
~
Ui ( z)
i
i
random completion time for subtask i
Pr(T  Tˆ )
i
il
u-function representing pmf of tij
u-function representing conditional pmf of ~
tij given the execution of subtask i
starts at time ts
u-function representing pmf of Ti
~
u-function representing conditional pmf of Ti given the execution of subtask i
starts at time ts
~
u-function representing pmf of Ti
set of resources processing subtask i
set of immediate predecessors of subtask i
I. INTRODUCTION
Grid computing (Foster & Kesselman, 2003) is a newly developed technology for
complex systems with large-scale resource sharing, wide-area communication, and multiinstitutional collaboration. This technology attracts much attention last years. Foster et al.
(2001) anatomized the grid computing and presented the concept of virtualization in the grid
that masks its heterogeneous nature; Kumar (2000) presented a protocol based on the
3
SuperGrid that can reach not only high performance but also the high availability; and
Berman et al. (2003) presented a practical grid project, called AppLeS (Application Level
Scheduling) which provided a methodology, application software, and software
environments for adaptively scheduling and deploying applications in dynamic,
heterogeneous, multi-user grid environments. Many experts believe that the grid
technologies will offer a second chance to fulfill the promises of the Internet.
The real and specific problem that underlies the Grid concept is coordinated resource
sharing and problem solving in dynamic, multi-institutional virtual organizations (Foster et
al., 2001). The sharing that we are concerned with is not primarily file exchange but rather
direct access to computers, software, data, and other resources. This is required by a range of
collaborative problem-solving and resource-brokering strategies emerging in industry,
science, and engineering. This sharing is controlled by the Resource Management System
(RMS), see e.g. Krauter et al. (2002) and Nabrzyski et al. (2003), with resource providers
and consumers defining what is shared, who is allowed to share, and the conditions under
which the sharing occurs.
Recently appeared Open Grid Services Architecture (Foster et al., 2002) enables the
integration of services and resources across distributed, heterogeneous, dynamic virtual
organizations, and also provides users a platform to easily request grid services. A grid
service is desired to execute a certain task under the control of the RMS. When the RMS
receives a service request from a user, the task can be divided into a set of subtasks that are
executed in parallel. The RMS assigns those subtasks to available resources for execution.
After the resources finish the assigned jobs, they return the results back to the RMS and then
the RMS integrates the received results into entire task output which is requested by the user.
4
The grid service process can be approximated by a structure with star topology, as
depicted in Fig. 1, where the center is the RMS directly connected with the resources through
respective communication channels.
Resource
Resource
Resource
RMS
Resource
Resource
Request for service
Fig. 1 Grid system with star architecture
The performance of grid computing is of great concern (Abramson et al., 2000). Usually
the measure of grid performance is the task execution time (service time). This index can be
significantly improved by using the RMS that divides a task into a set of subtasks which can
be executed in parallel by multiple online resources. Many complicated and time-consuming
tasks that could not be implemented before are currently working well under the grid
computing environment.
Some subtasks cannot be executed until they have received input data, which can be
result of other subtasks. This imposes precedence constraints on the order of subtask
execution. Such phenomenon of data dependence is very common, but often ignored by
parallel computing assumption. Since various services are provided by the grid system, one
cannot expect that all services can be completely divided into parallel subtasks without any
5
data dependence. Therefore, it is more practical to involve the precedence constraints into
the grid service models even though it may make the analysis more complicated.
The problem of evaluating the entire task execution time for given distribution of
subtasks among resources is well studied in multiprocessor systems, where algorithms for
subtask distribution minimizing the execution time are developed (Wu et al., 2004).
However in distributed grid systems where the availability of elements can be much lower
than in multiprocessor systems the service time cannot be evaluated without considering the
reliability aspect.
It is observed in many grid projects that the service time is a random variable, e.g.
Hamscher et al. (2000). However, no reasonable models were presented to analyze and
evaluate the grid service performance due to the dynamic and complex nature of the grid.
The random service time is actually affected by many factors. First, there are many resources
available online, that have different task processing speeds. Thus, the task execution time
can vary depending on which resource is assigned to execute the task/subtasks. Second,
some resources can fail when running the jobs, so the execution time is also affected by the
resource reliability. Similarly, the communication links in grid service can fail during the
data transmission. Thus, the communication reliability influences the service time as well as
data transmission speed in the communication channels. One of the ways of service
reliability improvement is assigning the same subtask to several (redundant) resources for
parallel execution.
Finding the distribution of the random service time is important for evaluating the grid
performance and improving the RMS functioning. This paper presents a novel model and
algorithm that obtains the distribution of service time in the grid with star topology taking
the
subtask
precedence
constraints
and
service
reliability
into
account.
6
Performance/reliability measures (performability and expected execution time) are derived
from this distribution.
The existing methods of grid system reliability analysis are based either on extended
binary decision diagram (BDD) technique or on sum-of-disjoint product (SDP) method. The
extended BDD (Zang et al.,2003) uses a decomposition of Boolean functions, whereas SDP
(Veeraraghavan & Trivedi, 1991,
Kumar & Agrawal 1993) uses a Boolean algebra
algorithm based on minpaths or mincuts. Both these algorithms can be applied only under
the assumption that the operational probabilities of the resources and the communication
links are constant. The universal generating function technique suggested in this paper
allows analyst to use much more realistic assumption that the failures of resources and links
follow Poisson processes and to incorporate such factors as the amount of transmitted data,
the computational complexity of the subtasks, the performance of the resources and the
communication links into the reliability model.
The paper is organized as follows. Section 2 presents the grid service performance and
reliability model considering the precedence constraints on subtask execution. Section 3
describes an algorithm that obtains the pmf of service time by using universal generating
function technique. Section 4 provides illustrative examples.
II. THE MODEL
A. Service execution by the grid system with star architecture
Different resources are distributed in the grid system. The considered service can use a
given set of resources. All the resources and communication channels from this set are
available at the time when the request for service arrives to the RMS (unavailable resources
are detected by RMS and, thus, not involved in the service). Each resource is directly
connected to the RMS by single communication channel, which forms the star topology.
7
The service task consists of subtasks that should be executed by resources of different
types. Each subtask is characterized by fixed complexity and by fixed amounts of input and
output data. The request for service (task execution) arrives to the RMS which assigns the
subtasks to different resources for processing. The resources are specialized. Each resource
can process only single subtask when it is available. On the other hand, the same subtask can
be assigned to several resources of the same type for parallel execution. If the same subtask
is processed by several resources, it is completed when first output is returned to the RMS.
The entire task is completed when all of the subtasks are completed and their results are
returned to the RMS from the resources.
Some subtasks require outputs of previous subtasks for their execution. The order of
subtasks' execution is determined by precedence constraints.
The resource is involved in data exchange process. Therefore, if resource failure or
communication channel failure occurs before the end of output data transmission from the
resource to the RMS, the subtask fails (cannot be completed).
B. Assumptions
1. When the RMS gets all the data necessary for execution of some subtasks, it sends the
data to the corresponding resources immediately.
2. Each resource starts processing of the assigned subtask immediately after it gets the
subtask input data from the RMS through the corresponding communication channel. Each
resource sends the output data to the RMS through the same communication channel
immediately after it completes the subtask.
3. Each resource has a given constant processing speed when it is available. Each
resource has a given constant failure rate.
8
4. Each communication channel has constant data transmission speed (bandwidth) when
it is available. Each communication channel has constant failure rate.
5. The subtask processing time is proportional to its computational complexity.
6. The data transmission time is proportional to the amount of data transmitted between
the RMS and a resource.
7. The failure rates of the communication channels or resources are the same when they
are idle or loaded (hot standby model). The failures at different resources and
communication channels are independent.
8. The RMS is fully reliable. The time of task processing by the RMS (subtask
assignment, sending them to the resources, receiving the results and integrating them into
entire task output) is negligible when compared with the subtasks' processing time.
C. Service time distribution
In the considered grid service model the entire task consists of m subtasks with
computational complexities cj and amount of data to be transferred between resource and the
RMS aj (1jm). The precedence constraints on task execution can be represented by mm
matrix H such that hki = 1 if subtask i needs for its execution output data from subtask k and
hki = 0 otherwise (the subtasks can always be numbered such that k<i for any hki = 1).
Therefore, if hki = 1 execution of subtask i cannot begin before completion of subtask k. For
any subtask i one can define a set i of its immediate predecessors: k  i if hki = 1. The
precedence constraints can always be presented in such a manner that the last subtask m
corresponds to final task processing by the RMS when it receives output data of all the
subtasks executed by the grid resources.
The subtask execution time is defined as time from the beginning of input data
transmission from the RMS to a resource to the end of output data transmission from the
9
resource to the RMS. Therefore, (according to assumptions 5 and 6) the random time tij of
subtask i execution by resource j can take two possible values
c
tij  tˆij  i 
xj
ai
sj
(1)
if the resource j and the communication channel j do not fail until the subtask completion
and tij =  otherwise.
Subtask i can be successfully completed by resource j if this resource and communication
link j do not fail before the end of subtask execution. For constant failure rates of resource j
and communication link j (assumptions 3 and 4 presume exponential distribution of time to
failure) one can obtain the conditional probability of subtask success given both resource and
link are available at the beginning of the subtask execution as
p j (tˆij )  e
 ( j  j )tˆij
(2)
These give the conditional distribution of the random subtask execution time tij :
Pr( tij  tˆij )  p j (tˆij ) and Pr( tij  )  1  p j (tˆij ) .
Assume that each subtask i is assigned by the RMS to resources composing set i
( i  j   ). The RMS can initiate execution of any subtask j (send the data to all the
resources from i) only after completion of every subtask k  i . Therefore the random time
of the start of subtask i execution Ti can be determined (according to assumption 1) as
~
Ti  max (Tk )
ki
(3)
~
where Tk is random completion time for subtask k. If i   , i.e. subtask i does not need
data produced by any other subtask, the subtask execution starts without delay: Ti = 0. If
i   , Ti can have different realizations T̂il (1lLi).
10
Having the time Ti when the execution of subtask i starts and the time tij of subtask i
execution by resource j one obtains (according to assumption 2) the completion time for
subtask i executed by resource j as
~
tij  Ti  tij .
(4)
In order to obtain the distribution of random time ~
tij one has to take into account that
~
probability of any realization of tij  Tˆil  tˆij is equal to product of probabilities of three
events:
-
execution of subtask i starts at time T̂il : qil=Pr( Ti = T̂il );
-
resource j does not fail before start of execution of subtask i: pj( T̂il );
-
resource j does not fail during the execution of subtask i: pj( tˆij ).
Therefore, (according to assumption 7) the conditional distribution of the random time ~
tij
given execution of subtask i starts at time T̂il ( Ti = T̂il ) takes the form
 (  )(Tˆ  tˆ )
~
Pr( tij  Tˆil  tˆij ) =pj( Tˆil )pj( tˆij ) = pj( Tˆil + tˆij ) = e j j il ij ,
Pr( ~
tij   )=1- pj( Tˆil + tˆij )=1- e
 ( j  j )(Tˆil  tˆij )
(5)
.
~
The random time of subtask i completion Ti is equal to the shortest time when one of the
resources from i completes the subtask execution:
~
Ti  min (~
tij ) .
ji
(6)
According to the definition of the last subtask m the time of its beginning Tm corresponds to
the time of service execution by the grid (according to assumption 8, the time of the task
processing by RMS is neglected). Therefore the random service time is equal to Tm. Having
11
the distribution (pmf) of the random value Tm in the form qml  Pr(Tm  Tˆml ) for 1lLm
one can evaluate the reliability and performance indices of the service.
C. Service reliability and expected performance
In order to estimate both the service reliability and its performance, different measures
can be used depending on the application. In applications where the execution time of each
task (service time) is of critical importance, the system reliability R(T*) is defined (according
to performability concept in Tai et al., 1993 and Meyer, 1980) as a probability that the
correct output is produced in time less than T*. This index can be obtained as
Lm
R(T *)   qml 1(Tˆml  T *) .
(7)
l 1
In applications where the average service performance (the number of executed tasks
over a fixed time) is of interest (Grassi et al., 1988), the service reliability is defined as the
probability that it produces correct outputs without respect to the service time. This index
can be referred to as R(). The conditional expected service time W is considered to be a
measure of its performance. This index determines the expected service time given that the
service does not fail. It can be obtained as
Lm
W   Tˆml qml / R().
(8)
l 1
The following section presents an algorithm for determining the distribution of the
service time Tm .
III. ALGORITHM FOR DETERMINING THE PMF OF THE SERVICE TIME
The procedure used in this paper for the evaluation of service time distribution is based
on the universal generating function (u-function) technique, which was introduced in
12
(Ushakov, 1987) and which proved to be very effective for the reliability evaluation of
different types of multi-state systems (Levitin et al., 1998, Lisnianski and Levitin, 2003).
The u-function representing the pmf of a discrete random variable Y is defined as a
polynomial
u( z) 
K
 k z y k ,
(9)
k 1
where the variable Y has K possible values and k is the probability that Y is equal to yk.
To obtain the u-function representing the pmf of a function of two independent random
variables (Yi, Yj), composition operators are introduced. These operators determine the ufunction for (Yi, Yj) using simple algebraic operations on the individual u-functions of the
variables. All of the composition operators take the form
U(z) = ui ( z )  u j ( z ) 

Ki
 ik z
k 1
yik
Kj
  jh z

y jh
h 1

Ki K j
  ik jh z
 ( yik , y jh )
(10)
k 1h 1
The u-function U(z) represents all of the possible mutually exclusive combinations of
realizations of the variables by relating the probabilities of each combination to the value of
function (Yi, Yj) for this combination.
In the case of grid system, the u-function uij (z) can define pmf of execution time for
subtask i assigned to resource j. This u-function takes the form
ˆ
t
uij ( z )  p j (tˆij ) z ij  (1  p j (tˆij )) z 
(11)
where tˆij and p j (tˆij ) are determined according to Eqs. (1) and (2) respectively.
The pmf of the random start time Ti for subtask i can be represented by u-function Ui(z)
taking the form
Li
ˆ
U i ( z )   qil z Til ,
l 1
(12)
13
where qil  Pr(Ti  Tˆil ) .
For any realization T̂il of Ti the conditional distribution of completion time ~
t ij for
subtask i executed by resource j given Ti  Tˆil according to (5) can be represented by the ufunction
Tˆ  tˆ
u~ij ( z, Tˆil )  p j (Tˆil  tˆij ) z il ij  (1  p j (Tˆil  tˆij )) z  .
(13)
The total completion time of subtask i assigned to a pair of resources j and d is equal to
the minimum of completion times for these resources according to Eq. (6). To obtain the ufunction representing the pmf of this time, given Ti  Tˆil , composition operator with
(Yj, Yd) = min(Yj ,Yd) should be used:
Tˆ  tˆ
u~i ( z , Tˆil )  u~ij ( z , Tˆil )  u~id ( z , Tˆil )  [ p j (Tˆil  tˆij ) z il ij  (1  p j (Tˆil  tˆij )) z  ]
min
ˆ ˆ
 [ pd (Tˆil  tˆid ) z Til  t id  (1  pd (Tˆil  tˆid )) z  ]
(14)
min
Tˆil  min( tˆij , tˆid )
Tˆil  tˆid
 p j (Tˆil  tˆij ) pd (Tˆil  tˆid ) z
 pd (Tˆil  tˆid )(1  p j (Tˆil  tˆij )) z
Tˆ  tˆ
 p j (Tˆil  tˆij )(1  pd (Tˆil  tˆid )) z il ij  (1  p j (Tˆil  tˆij ))(1  pd (Tˆil  tˆid )) z  .
~
The u-function u~i ( z,Tˆil ) representing the conditional pmf of completion time Ti for subtask i
assigned to all of the resources from set i ={j1, … ,ji} can be obtained as
u~i ( z,Tˆil )  u~ij1 ( z,Tˆil )  u~ij2 ( z,Tˆil )  ...  u~iji ( z,Tˆil ) .
min
min
min
(15)
u~i ( z,Tˆil ) can be obtained recursively:
u~i ( z , Tˆil )  u~ij1 ( z , Tˆil ),
u~i ( z,Tˆil )  u~i ( z,Tˆil )  u~ie ( z,Tˆil ) for e = j2, …, ji.
min
(16)
Having the probabilities of the mutually exclusive realizations of start time Ti,
qil  Pr(Ti  Tˆil )
and
u-functions
u~i ( z,Tˆil ) representing
corresponding
conditional
14
distributions of task i completion time, we can now obtain the u-function representing the
~
unconditional pmf of completion time Ti as
Li
~
U i ( z )   qil u~i ( z , Tˆil ) .
(17)
l 1
~
~
Having u-functions U k ( z ) representing pmf of the completion time Tk for any subtask
k i  {k1,..., ki } , one can obtain the u-functions U i (z ) representing pmf of subtask i start
time Ti according to (3) as
Li
ˆ
~
~
~
U i ( z )  U k1 ( z )  U k 2 ( z )  ...  U ki ( z )   qil z Til .
max
max
max
(18)
l 1
U i (z ) can be obtained recursively:
U i ( z)  z 0 ,
~
U i ( z )  U i ( z )  U e ( z ) for e = k1, …, ki.
max
(19)
It can be seen that if i   then U i ( z )  z 0 .
The final u-function Um(z) represents the pmf of random task completion time Tm in the
form
Lm
ˆ
U m ( z )   qml z Tm l .
(20)
l 1
Using the operators defined above one can obtain the service reliability and
performance indices by implementing the following algorithm:
1. Determine tˆij for each subtask i and resource j  i using Eq. (1);
~
Define for each subtask i (1im) U i ( z)  U i ( z) = z0.
2. For all i:
15
~
If i  0 or if for any k  i U k ( z )  z0 (u-functions representing the completion
times of all of the predecessors of subtask i are obtained)
Li
ˆ
2.1. Obtain U i ( z )   qil z Til using recursive procedure (19);
l 1
2.2. For l = 1, …, Li:
2.2.1. For each j  i obtain u~ij ( z , Tˆil ) using Eq. (13);
2.2.2. Obtain u~i ( z ) using recursive procedure (16);
~
2.3. Obtain U i ( z) using Eq. (17).
3. If U m (z ) = z0 return to step 2.
4. Obtain reliability and performance indices R(T*) and W using equations (7) and (8).
It should be noted that the presented algorithm works also when all of the subtasks are
independent. In this case i  0 and Ui(z)=z0 for 1im. However in this special case much
simpler procedure suggested in (Levitin and Dai, 2006) can be applied.
IV. ILLUSTRATIVE EXAMPLES
A. Analytical example
This example presents analytical derivation of the indices R(T*) and W for simple grid
service that uses six resources. Assume that the RMS divides the service task into three
subtasks. The first subtask is assigned to resources 1 and 2, the second subtask is assigned to
resources 3 and 4, the third subtask is assigned to resources 5 and 6:
1 = {1,2}, 2 = {3,4}, 3 = {5,6}.
The failure rates of the resources and communication channels and subtask execution
times are presented in Table 1.
16
Table 1. Parameters of grid system for analytical example
No of
subtask i
No of
resource j
j+j
(sec-1)
tˆij
(sec)
p j (tˆij )
1
0.0025
100
0.779
2
0.00018
180
0.968
3
0.0003
250
-
4
0.0008
300
-
5
0.0005
300
0.861
6
0.0002
430
0.918
1
2
3
Subtasks 1 and 3 get the input data directly from the RMS, subtask 2 needs the output of
subtask 1, the service task is completed when the RMS gets the outputs of both subtasks 2
and 3: 1  3   , 2  {1} , 4  {2,3} . These subtask precedence constraints can be
represented by the directed graph in Fig. 2.
1
2
3
4
Fig. 2. Subtask execution precedence constraints for analytical example
Since 1  3   , the only realization of start times T1 and T3 is 0 and therefore,
U1(z)=U2(z)=z0 . According to step 2 of the algorithm we can obtain the u-functions
~
~
representing pmf of completion times ~
t11 , ~
t12 , t35 and t36 . In order to determine the
subtask execution time distributions for the individual resources, define the u-functions uij(z)
according to Table 1 and Eq. (10):
u~11 ( z,0)  exp( 0.0025 100) z100  [1  exp( 0.0025 100)] z   0.779z100 + 0.221z.
17
In the similar way we obtain
u~12 ( z,0) = 0.968z180 + 0.032z;
u~35 ( z ,0) = 0.861z300 + 0.139z; u~36 ( z,0) = 0.918z430 + 0.082z.
The u-function representing the pmf of the completion time for subtask 1 executed by
both resources 1 and 2 is
~
U1 ( z )  u~1( z,0) = u~11( z,0)  u~12 ( z,0) = (0.779z100 + 0.221z)  (0.968z180 + 0.032z)
min
min
=0.779z100 +0.214z180 + 0.007z.
The u-function representing the pmf of the completion time for subtask 3 executed by
both resources 5 and 6 is
~
U 3 ( z )  u~3 ( z,0) = u~35 ( z )  u~36 ( z ) = (0.861z300 + 0.139z)  (0.918z430 + 0.082z)
min
min
=0.861z300 +0.128z430 + 0.011z.
Execution of subtask 2 begins immediately after completion of subtask 1. Therefore,
~
U2(z) = U1 ( z ) =0.779z100 +0.214z180 + 0.007z
(T2 has three realizations 100, 180 and ).
The u-functions representing the conditional pmf of the completion times for the subtask
2 executed by individual resources are obtained as follows.
u~23 ( z,100)  e 0.0003 (100  250 ) z100  250  [1  e 0.0003 (100  250 ) ]z  =0.9z350+0.1z;
u~23 ( z,180)  e 0.0003 (180  250 ) z180  250  [1  e 0.0003 (180  250 ) ]z  =0.879z430+0.121z;
u~23 ( z , )  z  ;
u~24 ( z,100)  e 0.0008 (100  300 ) z100  300  [1  e 0.0008 (100  300 ) ]z  =0.726z400+0.274z;
u~24 ( z,180)  e 0.0008 (180  300 ) z180  300  [1  e 0.0008 (180  300 ) ]z  =0.681z480+0.319z;
u~24 ( z, )  z  .
18
The u-functions representing the conditional pmf of subtask 2 completion time are:
u~2 ( z,100)  u~23 ( z,100)  u~24 ( z,100)  (0.9z350+0.1z)  (0.726z400+0.274z)
min
min
=0.9z350+0.073z400+0.027z;
u~2 ( z,180)  u~23 ( z,180)  u~24 ( z,180)  (0.879z430+0.121z)  (0.681z480+0.319z)
min
min
=0.879z430+0.082z480+0.039z;
u~2 ( z , )  u~23 ( z , )  u~24 ( z , )  z  .
min
According to Eq. (17) the unconditional pmf of subtask 2 completion time is represented by
the following u-function
~
U 2 ( z )  0.779u~2 ( z,100)  0.214u~2 ( z,180)  0.007 z 
=0.779(0.9z350+0.073z400+0.027z)+0.214(0.879z430+0.082z480+0.039z)+0.007z
=0.701z350+0.056z400+0.188z430+0.018z480+0.037z
The service task is completed when subtasks 2 and 3 return their outputs to the RMS
(which corresponds to the beginning of subtask 4). Therefore, the u-function representing the
pmf of the entire service time is obtained as
~
~
U 4 ( z)  U 2 ( z)  U 3 ( z)
max
=(0.701z350+0.056z400+0.188z430+0.018z480+0.037z)  (0.861z300 +0.128z430 +
max
0.011z)=0.603z350 +0.049z400 +0.283z430 +0.017z480 +0.048z.
The pmf of the service time is:
Pr(T4 = 350) = 0.603; Pr(T4 = 400) = 0.049;
Pr(T4 = 430) = 0.283; Pr(T4 = 480) = 0.017; Pr(T4 = ) = 0.048.
From the obtained pmf we can calculate the service reliability using Eq. (7):
R(T *)  0.603 for 350< T * 400; R(T *)  0.652 for 400< T * 430;
19
R(T *)  0.935 for 430< T * 480; R ()  0.952
and the conditional expected service time according to Eq. (8):
W = (0.603350 + 0.049400 + 0.283430 + 0.017480) / 0.952 = 378.69 sec.
B. Numerical example
This example illustrates the use of the numerical algorithm suggested in section III for
analysis of reliability and performance of a grid service that uses 15 specialized resources
distributed in a grid system. The entire service task consists of eight subtasks with
precedence constraints presented in Fig. 3 (note that subtask 9 corresponds to the final task
processing by the RMS). The subtasks can be assigned to the resources in accordance with
their specialization presented in Table 2. The subtask execution times and failure rates of
resources and corresponding communication channels are also presented in Table 2.
Subtasks 2, 4 and 7 can be executed by single specialized resources 4, 7 and 12
respectively. The rest of subtasks can be executed by several resources in parallel. Observe
that the model suggested can handle the case when some of subtasks are performed by the
RMS itself. In our example subtask 7 is performed by the fully reliable RMS (resource 12),
which corresponds to 12+12 = 0.
1
5
7
2
6
8
3
4
9
Figure 3. Subtask execution precedence constraints for numerical example
20
The service reliability and performance indices obtained by the presented algorithm for
the given set of parameters are presented in Table 3. In this table case A corresponds to the
situation when all of the resources are used for the service execution.
The suggested procedure for service performance evaluation allows analyst to easily
estimate the effect of changes in the grid on the service performance. For example we can
consider the effect of removal of some resources from the grid. Let resources 3, 11 and 14 be
considered as candidates for removal (due to contracting cost considerations). It is not
evident removal of which one of the resources causes the greatest service performance
deterioration since it depends on combination of several factors (redundancy of subtask
execution, performance and reliability of resources, precedence of subtask execution). Only
evaluating the entire system performance and reliability after removal of the elements can
help to compare the resource removal scenarios. The results obtained by the algorithm for
three different cases of single resource removal (cases B-D in Table 3) show that the removal
of resource 11 causes the greatest deterioration of service performance. The system
performance indices obtained for the case when all the three resources are removed are given
for comparison (case E in Table 3).
Table 2 Parameters of grid system
No of resource
j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
j+j
(sec -1)
.0...0
.0...0
.0...8
.0...0
.0..0.
.0..00
.0...0
.0...8
.0...0
.0...8
.0...0
.0....
.0...0
.0...3
.0...0
tij
(sec)
08.
03.
08.
8.
0..
0.
00.
88
00.
0..
08.
0.
0.
8.
88
No of subtask that can
be executed
1
1
1
2
3
3
4
5
5
6
6
7
8
8
8
21
The reliability as a function of required service time is presented for all five considered
cases in Fig. 4.
Table 3 Service performance indices for different subtask distributions
Case
A
B
C
D
E
Removed resources
3
11
14
3, 11, 14
Tmin
330
355
330
335
355
Tmax
440
440
430
440
430
R()
0.869
0.841
0.667
0.861
0.640
W
346.385
369.850
336.888
352.901
367.400
1
R (T *)
0.8
0.6
0.4
0.2
0
320
340
360
A
B
380 T* 400
C
D
420
440
E
Fig. 4. Service reliability R(T*) for different combinations of unavailable resources
In order to verify the results obtained by the suggested algorithm, a simulation program
has been developed. To obtain a more realistic estimation of the system behavior the
constant data processing and transmission speeds presented in Table 2 were replaced by
normally distributed random variables (assumptions 3 and 4 were removed). The mean
values of these random variables coincide with the constants from Table 2. The variance for
each variable was set to 100 sec. The failure rates presented in Table 2 were used.
The simulation results for cases A, B and E were obtained by running the simulation
program 1000 times for each case. The obtained reliability and performance indices are
presented in Table 4. The functions R(T*) obtained by the simulation are presented in Fig. 5
where the stepwise curves obtained by the suggested numerical algorithm are given again for
comparison.
22
Table 4: Service performance and reliability indices obtained by the simulation
Tmin
Case
Tmax
W
R()
A
317.83
431.65
0.866
343.11
B
340.11
453.44
0.851
368.24
E
341.00
449.14
0.666
369.04
1
R (T* )
0.8
0.6
0.4
0.2
0
320
340
360
380 T* 400
Simulation Case A
Simulation Case E
Algorithm Case B
420
440
Simulation Case B
Algorithm Case A
Algorithm Case E
Fig. 5. Simulated results (1000 runs) vs. Analytical results.
From Table 4 and Fig. 5, one can see that the service reliability and performance indices
obtained by the suggested algorithm are very close to the simulation results, which witnesses
that the algorithm can be used for the grid service reliability and performance evaluation
with good precision.
V. SUMMARY AND CONCLUSIONS
Grid technology is a newly developed method for large-scale distributed system. This
technology allows effective distribution of computational tasks among different resources
presented in the grid. The resource management system can divide service task into subtasks
and send the subtasks to different specialized resources for parallel execution. In order to
23
provide desired level of service reliability the RMS can assign the same subtasks to several
independent resources of the same type.
In order to evaluate the quality of service its reliability and performance indices should
be defined. This paper considers the indices: service reliability (probability that the service
task is accomplished within a specified time) and conditional expected system time and
presents the numerical algorithm for their evaluation for arbitrary subtask distribution in a
given grid with star architecture taking into account precedence constraints on the sequence
of subtask execution.
Even though the algorithm is based on some simplifying assumptions, it obtains quite
accurate estimates of service reliability and performance indices, which is very helpful for
such practical applications as
- comparison of different resource management alternatives (subtask assignment to different
resources),
- making decisions aimed at service performance improvement based on comparison of
different grid structure alternatives,
- estimating the effect of reliability and performance variation of grid elements on service
reliability and performance.
Further research can include optimization of subtask distribution among available
resources, incorporating variable resource performance into the model, analysis of effect of
transient failures on the grid service reliability and performance.
24
REFERENCES
Abramson, D., Giddy, J., Kotler, L. (2000), High performance parametric modeling with
Nimrod/G, 14th International Parallel and Distributed Processing Symposium, pp. 520-528.
Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., Faerman, M., Figueira, S.,
Hayes, J., Obertelli, G., Schopf, J., Shao, G., Smallen, S., Spring, N., Su, A., Zagorodnov,
D., (2003), Adaptive computing on the Grid using AppLeS, IEEE Transactions on Parallel
and Distributed Systems, vol.14, no.14, pp.369 – 382.
Foster, I., Kesselman, C. (2003), The Grid 2: Blueprint for a New Computing Infrastructure,
Morgan-Kaufmann.
Foster, I., Kesselman, C. and Tuecke, S. (2001), The anatomy of the grid: Enabling scalable
virtual organizations, International Journal of High Performance Computing Applications,
vol. 15, pp. 200-222.
Foster, I., Kesselman, C., Nick, J.M., Tuecke, S. (2002), Grid services for distributed system
integration, Computer, vol. 35, no. 6, pp. 37-46.
Grassi, V., Donatiello L., Iazeolla G., (1988), Performability evaluation of multicomponent
fault tolerant systems, IEEE Transactions on Reliability, vol. 37, no. 2, pp. 216-222.
Hamscher, V., Schwiegelshohn, U., Streit, A., Ramin, Y. (2000), Evaluation of jobscheduling strategies for grid computing, Proceedings of the First IEEE/ACM International
Workshop on Grid Computing, pp. 191-202.
Krauter, K., Buyya, R., Maheswaran, M. (2002), A taxonomy and survey of grid resource
management systems for distributed computing, Software - Practice and Experience, vol. 32,
no. 2, pp. 135-164.
Kumar, A. and Agrawal, D.P. (1993), A generalized algorithm for evaluating distributedprogram reliability, IEEE Transactions on Reliability, 42, pp. 416-424.
Kumar, A. (2000) An efficient SuperGrid protocol for high availability and load balancing,
IEEE Transactions on Computers, vol. 49, no. 10, pp. 1126-1133.
Levitin, G., Lisnianski A., Beh-Haim H., Elmakis D. (1998), Redundancy optimization for
series-parallel multi-state systems, IEEE Transactions on Reliability, vol. 47, pp. 165-172.
Levitin, G., Dai Y. (2006), Service reliability and performance in grid system with star
topology, to appear in Reliability Engineering and System Safety.
Lisnianski A., Levitin G. Multi-state System Reliability, World Scientific, Singapore, 2003.
Meyer, J. (1980), On evaluating the performability of degradable computing systems, IEEE
Transactions on Computers, vol. 29, pp. 720-731.
Nabrzyski, J., Schopf, J.M., Weglarz, J. (2003), Grid Resource Management, Kluwer
Publishing.
Tai, A., Meyer J., Avizienis A. (1993), Performability enhancement of fault-tolerant
software, IEEE Transactions on Reliability, vol. 42, no. 2, pp. 227-237.
Ushakov, I. (1987), Optimal standby problems and a universal generating function, Soviet
Journal of Computer Systems Science, vol. 25, pp. 79-82.
25
Veeraraghavan, M., Trivedi, K.S. (1991), An improved algorithm for symbolic reliability
analysis, IEEE Transactions on Reliability, 40(3), pp. 347-358.
Wu A., Yu H., Jin S., Lin K.-C., and Schiavone G. (2004), An incremental genetic algorithm
approach to multiprocessor scheduling. IEEE Trans. on Parallel andDistributed Systems,
15(9), pp. 824-834.
Zang, X., Wang, D., Sun H., Trivedi, K.S. (2003), A BDD-based algorithm for analysis of
multistate systems with multistate components. IEEE Trans. On Computer, 52(12), pp.
1608-1618.
Download