1 Reliability and Performance of Star Topology Grid Service with Precedence Constraints on Subtask Execution Gregory Levitin a, Yuan-Shun Dai b, Hanoch Ben-Haim a a The Israel Electric Corporation Ltd., Reliability Department, P. O. Box 10, Haifa 31000, Israel b Department of Computer & Information Science, Purdue University School of Science, Indiana University, Purdue University, Indianapolis, 46202, USA Abstract - The paper considers grid computing systems with star architecture in which the resource management system (RMS) divides service task into subtasks and sends the subtasks to different specialized resources for execution. In order to provide desired level of service reliability the RMS can assign the same subtasks to several independent resources for parallel execution. Some subtasks cannot be executed until they have received input data, which can be the result of other subtasks. This imposes precedence constraints on the order of subtask execution. The service reliability and performance indices are introduced and a fast numerical algorithm for their evaluation given any subtask distribution is suggested. Illustrative examples are presented. Index Terms - Grid system, service time, service reliability, subtask distribution, precedence constraints, universal generating function. ACRONYMS RMS pmf u-function resource management system probability mass function universal generating function 2 T* R(T*) W tij NOTATION unity function: 1(TRUE) = 1, 1(FALSE) = 0 probability of event e number of subtasks computational complexity of subtask j probability that resource j does not fail during time t processing speed of resource j failure rate of resource j data transmission speed (bandwidth) of communication channel j amount of data transmitted between the RMS and the resource processing subtask i failure rate of communication channel j maximum allowed service time probability that service time is less than T* conditional expected system execution time random time of subtask i execution by resource j tˆij ~ tij realization of tij when subtask i is successfully completed by resource j random completion time for subtask i executed by resource j Ti random time of the beginning of subtask i execution T̂il ~ Ti l-th realization of Ti 1(x) Pr(e) m cj pj(t) xj j sj ai j qil uij (z) u~ij ( z, t s ) U i (z ) u~i ( z, t s ) ~ Ui ( z) i i random completion time for subtask i Pr(T Tˆ ) i il u-function representing pmf of tij u-function representing conditional pmf of ~ tij given the execution of subtask i starts at time ts u-function representing pmf of Ti ~ u-function representing conditional pmf of Ti given the execution of subtask i starts at time ts ~ u-function representing pmf of Ti set of resources processing subtask i set of immediate predecessors of subtask i I. INTRODUCTION Grid computing (Foster & Kesselman, 2003) is a newly developed technology for complex systems with large-scale resource sharing, wide-area communication, and multiinstitutional collaboration. This technology attracts much attention last years. Foster et al. (2001) anatomized the grid computing and presented the concept of virtualization in the grid that masks its heterogeneous nature; Kumar (2000) presented a protocol based on the 3 SuperGrid that can reach not only high performance but also the high availability; and Berman et al. (2003) presented a practical grid project, called AppLeS (Application Level Scheduling) which provided a methodology, application software, and software environments for adaptively scheduling and deploying applications in dynamic, heterogeneous, multi-user grid environments. Many experts believe that the grid technologies will offer a second chance to fulfill the promises of the Internet. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (Foster et al., 2001). The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources. This is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is controlled by the Resource Management System (RMS), see e.g. Krauter et al. (2002) and Nabrzyski et al. (2003), with resource providers and consumers defining what is shared, who is allowed to share, and the conditions under which the sharing occurs. Recently appeared Open Grid Services Architecture (Foster et al., 2002) enables the integration of services and resources across distributed, heterogeneous, dynamic virtual organizations, and also provides users a platform to easily request grid services. A grid service is desired to execute a certain task under the control of the RMS. When the RMS receives a service request from a user, the task can be divided into a set of subtasks that are executed in parallel. The RMS assigns those subtasks to available resources for execution. After the resources finish the assigned jobs, they return the results back to the RMS and then the RMS integrates the received results into entire task output which is requested by the user. 4 The grid service process can be approximated by a structure with star topology, as depicted in Fig. 1, where the center is the RMS directly connected with the resources through respective communication channels. Resource Resource Resource RMS Resource Resource Request for service Fig. 1 Grid system with star architecture The performance of grid computing is of great concern (Abramson et al., 2000). Usually the measure of grid performance is the task execution time (service time). This index can be significantly improved by using the RMS that divides a task into a set of subtasks which can be executed in parallel by multiple online resources. Many complicated and time-consuming tasks that could not be implemented before are currently working well under the grid computing environment. Some subtasks cannot be executed until they have received input data, which can be result of other subtasks. This imposes precedence constraints on the order of subtask execution. Such phenomenon of data dependence is very common, but often ignored by parallel computing assumption. Since various services are provided by the grid system, one cannot expect that all services can be completely divided into parallel subtasks without any 5 data dependence. Therefore, it is more practical to involve the precedence constraints into the grid service models even though it may make the analysis more complicated. The problem of evaluating the entire task execution time for given distribution of subtasks among resources is well studied in multiprocessor systems, where algorithms for subtask distribution minimizing the execution time are developed (Wu et al., 2004). However in distributed grid systems where the availability of elements can be much lower than in multiprocessor systems the service time cannot be evaluated without considering the reliability aspect. It is observed in many grid projects that the service time is a random variable, e.g. Hamscher et al. (2000). However, no reasonable models were presented to analyze and evaluate the grid service performance due to the dynamic and complex nature of the grid. The random service time is actually affected by many factors. First, there are many resources available online, that have different task processing speeds. Thus, the task execution time can vary depending on which resource is assigned to execute the task/subtasks. Second, some resources can fail when running the jobs, so the execution time is also affected by the resource reliability. Similarly, the communication links in grid service can fail during the data transmission. Thus, the communication reliability influences the service time as well as data transmission speed in the communication channels. One of the ways of service reliability improvement is assigning the same subtask to several (redundant) resources for parallel execution. Finding the distribution of the random service time is important for evaluating the grid performance and improving the RMS functioning. This paper presents a novel model and algorithm that obtains the distribution of service time in the grid with star topology taking the subtask precedence constraints and service reliability into account. 6 Performance/reliability measures (performability and expected execution time) are derived from this distribution. The existing methods of grid system reliability analysis are based either on extended binary decision diagram (BDD) technique or on sum-of-disjoint product (SDP) method. The extended BDD (Zang et al.,2003) uses a decomposition of Boolean functions, whereas SDP (Veeraraghavan & Trivedi, 1991, Kumar & Agrawal 1993) uses a Boolean algebra algorithm based on minpaths or mincuts. Both these algorithms can be applied only under the assumption that the operational probabilities of the resources and the communication links are constant. The universal generating function technique suggested in this paper allows analyst to use much more realistic assumption that the failures of resources and links follow Poisson processes and to incorporate such factors as the amount of transmitted data, the computational complexity of the subtasks, the performance of the resources and the communication links into the reliability model. The paper is organized as follows. Section 2 presents the grid service performance and reliability model considering the precedence constraints on subtask execution. Section 3 describes an algorithm that obtains the pmf of service time by using universal generating function technique. Section 4 provides illustrative examples. II. THE MODEL A. Service execution by the grid system with star architecture Different resources are distributed in the grid system. The considered service can use a given set of resources. All the resources and communication channels from this set are available at the time when the request for service arrives to the RMS (unavailable resources are detected by RMS and, thus, not involved in the service). Each resource is directly connected to the RMS by single communication channel, which forms the star topology. 7 The service task consists of subtasks that should be executed by resources of different types. Each subtask is characterized by fixed complexity and by fixed amounts of input and output data. The request for service (task execution) arrives to the RMS which assigns the subtasks to different resources for processing. The resources are specialized. Each resource can process only single subtask when it is available. On the other hand, the same subtask can be assigned to several resources of the same type for parallel execution. If the same subtask is processed by several resources, it is completed when first output is returned to the RMS. The entire task is completed when all of the subtasks are completed and their results are returned to the RMS from the resources. Some subtasks require outputs of previous subtasks for their execution. The order of subtasks' execution is determined by precedence constraints. The resource is involved in data exchange process. Therefore, if resource failure or communication channel failure occurs before the end of output data transmission from the resource to the RMS, the subtask fails (cannot be completed). B. Assumptions 1. When the RMS gets all the data necessary for execution of some subtasks, it sends the data to the corresponding resources immediately. 2. Each resource starts processing of the assigned subtask immediately after it gets the subtask input data from the RMS through the corresponding communication channel. Each resource sends the output data to the RMS through the same communication channel immediately after it completes the subtask. 3. Each resource has a given constant processing speed when it is available. Each resource has a given constant failure rate. 8 4. Each communication channel has constant data transmission speed (bandwidth) when it is available. Each communication channel has constant failure rate. 5. The subtask processing time is proportional to its computational complexity. 6. The data transmission time is proportional to the amount of data transmitted between the RMS and a resource. 7. The failure rates of the communication channels or resources are the same when they are idle or loaded (hot standby model). The failures at different resources and communication channels are independent. 8. The RMS is fully reliable. The time of task processing by the RMS (subtask assignment, sending them to the resources, receiving the results and integrating them into entire task output) is negligible when compared with the subtasks' processing time. C. Service time distribution In the considered grid service model the entire task consists of m subtasks with computational complexities cj and amount of data to be transferred between resource and the RMS aj (1jm). The precedence constraints on task execution can be represented by mm matrix H such that hki = 1 if subtask i needs for its execution output data from subtask k and hki = 0 otherwise (the subtasks can always be numbered such that k<i for any hki = 1). Therefore, if hki = 1 execution of subtask i cannot begin before completion of subtask k. For any subtask i one can define a set i of its immediate predecessors: k i if hki = 1. The precedence constraints can always be presented in such a manner that the last subtask m corresponds to final task processing by the RMS when it receives output data of all the subtasks executed by the grid resources. The subtask execution time is defined as time from the beginning of input data transmission from the RMS to a resource to the end of output data transmission from the 9 resource to the RMS. Therefore, (according to assumptions 5 and 6) the random time tij of subtask i execution by resource j can take two possible values c tij tˆij i xj ai sj (1) if the resource j and the communication channel j do not fail until the subtask completion and tij = otherwise. Subtask i can be successfully completed by resource j if this resource and communication link j do not fail before the end of subtask execution. For constant failure rates of resource j and communication link j (assumptions 3 and 4 presume exponential distribution of time to failure) one can obtain the conditional probability of subtask success given both resource and link are available at the beginning of the subtask execution as p j (tˆij ) e ( j j )tˆij (2) These give the conditional distribution of the random subtask execution time tij : Pr( tij tˆij ) p j (tˆij ) and Pr( tij ) 1 p j (tˆij ) . Assume that each subtask i is assigned by the RMS to resources composing set i ( i j ). The RMS can initiate execution of any subtask j (send the data to all the resources from i) only after completion of every subtask k i . Therefore the random time of the start of subtask i execution Ti can be determined (according to assumption 1) as ~ Ti max (Tk ) ki (3) ~ where Tk is random completion time for subtask k. If i , i.e. subtask i does not need data produced by any other subtask, the subtask execution starts without delay: Ti = 0. If i , Ti can have different realizations T̂il (1lLi). 10 Having the time Ti when the execution of subtask i starts and the time tij of subtask i execution by resource j one obtains (according to assumption 2) the completion time for subtask i executed by resource j as ~ tij Ti tij . (4) In order to obtain the distribution of random time ~ tij one has to take into account that ~ probability of any realization of tij Tˆil tˆij is equal to product of probabilities of three events: - execution of subtask i starts at time T̂il : qil=Pr( Ti = T̂il ); - resource j does not fail before start of execution of subtask i: pj( T̂il ); - resource j does not fail during the execution of subtask i: pj( tˆij ). Therefore, (according to assumption 7) the conditional distribution of the random time ~ tij given execution of subtask i starts at time T̂il ( Ti = T̂il ) takes the form ( )(Tˆ tˆ ) ~ Pr( tij Tˆil tˆij ) =pj( Tˆil )pj( tˆij ) = pj( Tˆil + tˆij ) = e j j il ij , Pr( ~ tij )=1- pj( Tˆil + tˆij )=1- e ( j j )(Tˆil tˆij ) (5) . ~ The random time of subtask i completion Ti is equal to the shortest time when one of the resources from i completes the subtask execution: ~ Ti min (~ tij ) . ji (6) According to the definition of the last subtask m the time of its beginning Tm corresponds to the time of service execution by the grid (according to assumption 8, the time of the task processing by RMS is neglected). Therefore the random service time is equal to Tm. Having 11 the distribution (pmf) of the random value Tm in the form qml Pr(Tm Tˆml ) for 1lLm one can evaluate the reliability and performance indices of the service. C. Service reliability and expected performance In order to estimate both the service reliability and its performance, different measures can be used depending on the application. In applications where the execution time of each task (service time) is of critical importance, the system reliability R(T*) is defined (according to performability concept in Tai et al., 1993 and Meyer, 1980) as a probability that the correct output is produced in time less than T*. This index can be obtained as Lm R(T *) qml 1(Tˆml T *) . (7) l 1 In applications where the average service performance (the number of executed tasks over a fixed time) is of interest (Grassi et al., 1988), the service reliability is defined as the probability that it produces correct outputs without respect to the service time. This index can be referred to as R(). The conditional expected service time W is considered to be a measure of its performance. This index determines the expected service time given that the service does not fail. It can be obtained as Lm W Tˆml qml / R(). (8) l 1 The following section presents an algorithm for determining the distribution of the service time Tm . III. ALGORITHM FOR DETERMINING THE PMF OF THE SERVICE TIME The procedure used in this paper for the evaluation of service time distribution is based on the universal generating function (u-function) technique, which was introduced in 12 (Ushakov, 1987) and which proved to be very effective for the reliability evaluation of different types of multi-state systems (Levitin et al., 1998, Lisnianski and Levitin, 2003). The u-function representing the pmf of a discrete random variable Y is defined as a polynomial u( z) K k z y k , (9) k 1 where the variable Y has K possible values and k is the probability that Y is equal to yk. To obtain the u-function representing the pmf of a function of two independent random variables (Yi, Yj), composition operators are introduced. These operators determine the ufunction for (Yi, Yj) using simple algebraic operations on the individual u-functions of the variables. All of the composition operators take the form U(z) = ui ( z ) u j ( z ) Ki ik z k 1 yik Kj jh z y jh h 1 Ki K j ik jh z ( yik , y jh ) (10) k 1h 1 The u-function U(z) represents all of the possible mutually exclusive combinations of realizations of the variables by relating the probabilities of each combination to the value of function (Yi, Yj) for this combination. In the case of grid system, the u-function uij (z) can define pmf of execution time for subtask i assigned to resource j. This u-function takes the form ˆ t uij ( z ) p j (tˆij ) z ij (1 p j (tˆij )) z (11) where tˆij and p j (tˆij ) are determined according to Eqs. (1) and (2) respectively. The pmf of the random start time Ti for subtask i can be represented by u-function Ui(z) taking the form Li ˆ U i ( z ) qil z Til , l 1 (12) 13 where qil Pr(Ti Tˆil ) . For any realization T̂il of Ti the conditional distribution of completion time ~ t ij for subtask i executed by resource j given Ti Tˆil according to (5) can be represented by the ufunction Tˆ tˆ u~ij ( z, Tˆil ) p j (Tˆil tˆij ) z il ij (1 p j (Tˆil tˆij )) z . (13) The total completion time of subtask i assigned to a pair of resources j and d is equal to the minimum of completion times for these resources according to Eq. (6). To obtain the ufunction representing the pmf of this time, given Ti Tˆil , composition operator with (Yj, Yd) = min(Yj ,Yd) should be used: Tˆ tˆ u~i ( z , Tˆil ) u~ij ( z , Tˆil ) u~id ( z , Tˆil ) [ p j (Tˆil tˆij ) z il ij (1 p j (Tˆil tˆij )) z ] min ˆ ˆ [ pd (Tˆil tˆid ) z Til t id (1 pd (Tˆil tˆid )) z ] (14) min Tˆil min( tˆij , tˆid ) Tˆil tˆid p j (Tˆil tˆij ) pd (Tˆil tˆid ) z pd (Tˆil tˆid )(1 p j (Tˆil tˆij )) z Tˆ tˆ p j (Tˆil tˆij )(1 pd (Tˆil tˆid )) z il ij (1 p j (Tˆil tˆij ))(1 pd (Tˆil tˆid )) z . ~ The u-function u~i ( z,Tˆil ) representing the conditional pmf of completion time Ti for subtask i assigned to all of the resources from set i ={j1, … ,ji} can be obtained as u~i ( z,Tˆil ) u~ij1 ( z,Tˆil ) u~ij2 ( z,Tˆil ) ... u~iji ( z,Tˆil ) . min min min (15) u~i ( z,Tˆil ) can be obtained recursively: u~i ( z , Tˆil ) u~ij1 ( z , Tˆil ), u~i ( z,Tˆil ) u~i ( z,Tˆil ) u~ie ( z,Tˆil ) for e = j2, …, ji. min (16) Having the probabilities of the mutually exclusive realizations of start time Ti, qil Pr(Ti Tˆil ) and u-functions u~i ( z,Tˆil ) representing corresponding conditional 14 distributions of task i completion time, we can now obtain the u-function representing the ~ unconditional pmf of completion time Ti as Li ~ U i ( z ) qil u~i ( z , Tˆil ) . (17) l 1 ~ ~ Having u-functions U k ( z ) representing pmf of the completion time Tk for any subtask k i {k1,..., ki } , one can obtain the u-functions U i (z ) representing pmf of subtask i start time Ti according to (3) as Li ˆ ~ ~ ~ U i ( z ) U k1 ( z ) U k 2 ( z ) ... U ki ( z ) qil z Til . max max max (18) l 1 U i (z ) can be obtained recursively: U i ( z) z 0 , ~ U i ( z ) U i ( z ) U e ( z ) for e = k1, …, ki. max (19) It can be seen that if i then U i ( z ) z 0 . The final u-function Um(z) represents the pmf of random task completion time Tm in the form Lm ˆ U m ( z ) qml z Tm l . (20) l 1 Using the operators defined above one can obtain the service reliability and performance indices by implementing the following algorithm: 1. Determine tˆij for each subtask i and resource j i using Eq. (1); ~ Define for each subtask i (1im) U i ( z) U i ( z) = z0. 2. For all i: 15 ~ If i 0 or if for any k i U k ( z ) z0 (u-functions representing the completion times of all of the predecessors of subtask i are obtained) Li ˆ 2.1. Obtain U i ( z ) qil z Til using recursive procedure (19); l 1 2.2. For l = 1, …, Li: 2.2.1. For each j i obtain u~ij ( z , Tˆil ) using Eq. (13); 2.2.2. Obtain u~i ( z ) using recursive procedure (16); ~ 2.3. Obtain U i ( z) using Eq. (17). 3. If U m (z ) = z0 return to step 2. 4. Obtain reliability and performance indices R(T*) and W using equations (7) and (8). It should be noted that the presented algorithm works also when all of the subtasks are independent. In this case i 0 and Ui(z)=z0 for 1im. However in this special case much simpler procedure suggested in (Levitin and Dai, 2006) can be applied. IV. ILLUSTRATIVE EXAMPLES A. Analytical example This example presents analytical derivation of the indices R(T*) and W for simple grid service that uses six resources. Assume that the RMS divides the service task into three subtasks. The first subtask is assigned to resources 1 and 2, the second subtask is assigned to resources 3 and 4, the third subtask is assigned to resources 5 and 6: 1 = {1,2}, 2 = {3,4}, 3 = {5,6}. The failure rates of the resources and communication channels and subtask execution times are presented in Table 1. 16 Table 1. Parameters of grid system for analytical example No of subtask i No of resource j j+j (sec-1) tˆij (sec) p j (tˆij ) 1 0.0025 100 0.779 2 0.00018 180 0.968 3 0.0003 250 - 4 0.0008 300 - 5 0.0005 300 0.861 6 0.0002 430 0.918 1 2 3 Subtasks 1 and 3 get the input data directly from the RMS, subtask 2 needs the output of subtask 1, the service task is completed when the RMS gets the outputs of both subtasks 2 and 3: 1 3 , 2 {1} , 4 {2,3} . These subtask precedence constraints can be represented by the directed graph in Fig. 2. 1 2 3 4 Fig. 2. Subtask execution precedence constraints for analytical example Since 1 3 , the only realization of start times T1 and T3 is 0 and therefore, U1(z)=U2(z)=z0 . According to step 2 of the algorithm we can obtain the u-functions ~ ~ representing pmf of completion times ~ t11 , ~ t12 , t35 and t36 . In order to determine the subtask execution time distributions for the individual resources, define the u-functions uij(z) according to Table 1 and Eq. (10): u~11 ( z,0) exp( 0.0025 100) z100 [1 exp( 0.0025 100)] z 0.779z100 + 0.221z. 17 In the similar way we obtain u~12 ( z,0) = 0.968z180 + 0.032z; u~35 ( z ,0) = 0.861z300 + 0.139z; u~36 ( z,0) = 0.918z430 + 0.082z. The u-function representing the pmf of the completion time for subtask 1 executed by both resources 1 and 2 is ~ U1 ( z ) u~1( z,0) = u~11( z,0) u~12 ( z,0) = (0.779z100 + 0.221z) (0.968z180 + 0.032z) min min =0.779z100 +0.214z180 + 0.007z. The u-function representing the pmf of the completion time for subtask 3 executed by both resources 5 and 6 is ~ U 3 ( z ) u~3 ( z,0) = u~35 ( z ) u~36 ( z ) = (0.861z300 + 0.139z) (0.918z430 + 0.082z) min min =0.861z300 +0.128z430 + 0.011z. Execution of subtask 2 begins immediately after completion of subtask 1. Therefore, ~ U2(z) = U1 ( z ) =0.779z100 +0.214z180 + 0.007z (T2 has three realizations 100, 180 and ). The u-functions representing the conditional pmf of the completion times for the subtask 2 executed by individual resources are obtained as follows. u~23 ( z,100) e 0.0003 (100 250 ) z100 250 [1 e 0.0003 (100 250 ) ]z =0.9z350+0.1z; u~23 ( z,180) e 0.0003 (180 250 ) z180 250 [1 e 0.0003 (180 250 ) ]z =0.879z430+0.121z; u~23 ( z , ) z ; u~24 ( z,100) e 0.0008 (100 300 ) z100 300 [1 e 0.0008 (100 300 ) ]z =0.726z400+0.274z; u~24 ( z,180) e 0.0008 (180 300 ) z180 300 [1 e 0.0008 (180 300 ) ]z =0.681z480+0.319z; u~24 ( z, ) z . 18 The u-functions representing the conditional pmf of subtask 2 completion time are: u~2 ( z,100) u~23 ( z,100) u~24 ( z,100) (0.9z350+0.1z) (0.726z400+0.274z) min min =0.9z350+0.073z400+0.027z; u~2 ( z,180) u~23 ( z,180) u~24 ( z,180) (0.879z430+0.121z) (0.681z480+0.319z) min min =0.879z430+0.082z480+0.039z; u~2 ( z , ) u~23 ( z , ) u~24 ( z , ) z . min According to Eq. (17) the unconditional pmf of subtask 2 completion time is represented by the following u-function ~ U 2 ( z ) 0.779u~2 ( z,100) 0.214u~2 ( z,180) 0.007 z =0.779(0.9z350+0.073z400+0.027z)+0.214(0.879z430+0.082z480+0.039z)+0.007z =0.701z350+0.056z400+0.188z430+0.018z480+0.037z The service task is completed when subtasks 2 and 3 return their outputs to the RMS (which corresponds to the beginning of subtask 4). Therefore, the u-function representing the pmf of the entire service time is obtained as ~ ~ U 4 ( z) U 2 ( z) U 3 ( z) max =(0.701z350+0.056z400+0.188z430+0.018z480+0.037z) (0.861z300 +0.128z430 + max 0.011z)=0.603z350 +0.049z400 +0.283z430 +0.017z480 +0.048z. The pmf of the service time is: Pr(T4 = 350) = 0.603; Pr(T4 = 400) = 0.049; Pr(T4 = 430) = 0.283; Pr(T4 = 480) = 0.017; Pr(T4 = ) = 0.048. From the obtained pmf we can calculate the service reliability using Eq. (7): R(T *) 0.603 for 350< T * 400; R(T *) 0.652 for 400< T * 430; 19 R(T *) 0.935 for 430< T * 480; R () 0.952 and the conditional expected service time according to Eq. (8): W = (0.603350 + 0.049400 + 0.283430 + 0.017480) / 0.952 = 378.69 sec. B. Numerical example This example illustrates the use of the numerical algorithm suggested in section III for analysis of reliability and performance of a grid service that uses 15 specialized resources distributed in a grid system. The entire service task consists of eight subtasks with precedence constraints presented in Fig. 3 (note that subtask 9 corresponds to the final task processing by the RMS). The subtasks can be assigned to the resources in accordance with their specialization presented in Table 2. The subtask execution times and failure rates of resources and corresponding communication channels are also presented in Table 2. Subtasks 2, 4 and 7 can be executed by single specialized resources 4, 7 and 12 respectively. The rest of subtasks can be executed by several resources in parallel. Observe that the model suggested can handle the case when some of subtasks are performed by the RMS itself. In our example subtask 7 is performed by the fully reliable RMS (resource 12), which corresponds to 12+12 = 0. 1 5 7 2 6 8 3 4 9 Figure 3. Subtask execution precedence constraints for numerical example 20 The service reliability and performance indices obtained by the presented algorithm for the given set of parameters are presented in Table 3. In this table case A corresponds to the situation when all of the resources are used for the service execution. The suggested procedure for service performance evaluation allows analyst to easily estimate the effect of changes in the grid on the service performance. For example we can consider the effect of removal of some resources from the grid. Let resources 3, 11 and 14 be considered as candidates for removal (due to contracting cost considerations). It is not evident removal of which one of the resources causes the greatest service performance deterioration since it depends on combination of several factors (redundancy of subtask execution, performance and reliability of resources, precedence of subtask execution). Only evaluating the entire system performance and reliability after removal of the elements can help to compare the resource removal scenarios. The results obtained by the algorithm for three different cases of single resource removal (cases B-D in Table 3) show that the removal of resource 11 causes the greatest deterioration of service performance. The system performance indices obtained for the case when all the three resources are removed are given for comparison (case E in Table 3). Table 2 Parameters of grid system No of resource j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 j+j (sec -1) .0...0 .0...0 .0...8 .0...0 .0..0. .0..00 .0...0 .0...8 .0...0 .0...8 .0...0 .0.... .0...0 .0...3 .0...0 tij (sec) 08. 03. 08. 8. 0.. 0. 00. 88 00. 0.. 08. 0. 0. 8. 88 No of subtask that can be executed 1 1 1 2 3 3 4 5 5 6 6 7 8 8 8 21 The reliability as a function of required service time is presented for all five considered cases in Fig. 4. Table 3 Service performance indices for different subtask distributions Case A B C D E Removed resources 3 11 14 3, 11, 14 Tmin 330 355 330 335 355 Tmax 440 440 430 440 430 R() 0.869 0.841 0.667 0.861 0.640 W 346.385 369.850 336.888 352.901 367.400 1 R (T *) 0.8 0.6 0.4 0.2 0 320 340 360 A B 380 T* 400 C D 420 440 E Fig. 4. Service reliability R(T*) for different combinations of unavailable resources In order to verify the results obtained by the suggested algorithm, a simulation program has been developed. To obtain a more realistic estimation of the system behavior the constant data processing and transmission speeds presented in Table 2 were replaced by normally distributed random variables (assumptions 3 and 4 were removed). The mean values of these random variables coincide with the constants from Table 2. The variance for each variable was set to 100 sec. The failure rates presented in Table 2 were used. The simulation results for cases A, B and E were obtained by running the simulation program 1000 times for each case. The obtained reliability and performance indices are presented in Table 4. The functions R(T*) obtained by the simulation are presented in Fig. 5 where the stepwise curves obtained by the suggested numerical algorithm are given again for comparison. 22 Table 4: Service performance and reliability indices obtained by the simulation Tmin Case Tmax W R() A 317.83 431.65 0.866 343.11 B 340.11 453.44 0.851 368.24 E 341.00 449.14 0.666 369.04 1 R (T* ) 0.8 0.6 0.4 0.2 0 320 340 360 380 T* 400 Simulation Case A Simulation Case E Algorithm Case B 420 440 Simulation Case B Algorithm Case A Algorithm Case E Fig. 5. Simulated results (1000 runs) vs. Analytical results. From Table 4 and Fig. 5, one can see that the service reliability and performance indices obtained by the suggested algorithm are very close to the simulation results, which witnesses that the algorithm can be used for the grid service reliability and performance evaluation with good precision. V. SUMMARY AND CONCLUSIONS Grid technology is a newly developed method for large-scale distributed system. This technology allows effective distribution of computational tasks among different resources presented in the grid. The resource management system can divide service task into subtasks and send the subtasks to different specialized resources for parallel execution. In order to 23 provide desired level of service reliability the RMS can assign the same subtasks to several independent resources of the same type. In order to evaluate the quality of service its reliability and performance indices should be defined. This paper considers the indices: service reliability (probability that the service task is accomplished within a specified time) and conditional expected system time and presents the numerical algorithm for their evaluation for arbitrary subtask distribution in a given grid with star architecture taking into account precedence constraints on the sequence of subtask execution. Even though the algorithm is based on some simplifying assumptions, it obtains quite accurate estimates of service reliability and performance indices, which is very helpful for such practical applications as - comparison of different resource management alternatives (subtask assignment to different resources), - making decisions aimed at service performance improvement based on comparison of different grid structure alternatives, - estimating the effect of reliability and performance variation of grid elements on service reliability and performance. Further research can include optimization of subtask distribution among available resources, incorporating variable resource performance into the model, analysis of effect of transient failures on the grid service reliability and performance. 24 REFERENCES Abramson, D., Giddy, J., Kotler, L. (2000), High performance parametric modeling with Nimrod/G, 14th International Parallel and Distributed Processing Symposium, pp. 520-528. Berman, F., Wolski, R., Casanova, H., Cirne, W., Dail, H., Faerman, M., Figueira, S., Hayes, J., Obertelli, G., Schopf, J., Shao, G., Smallen, S., Spring, N., Su, A., Zagorodnov, D., (2003), Adaptive computing on the Grid using AppLeS, IEEE Transactions on Parallel and Distributed Systems, vol.14, no.14, pp.369 – 382. Foster, I., Kesselman, C. (2003), The Grid 2: Blueprint for a New Computing Infrastructure, Morgan-Kaufmann. Foster, I., Kesselman, C. and Tuecke, S. (2001), The anatomy of the grid: Enabling scalable virtual organizations, International Journal of High Performance Computing Applications, vol. 15, pp. 200-222. Foster, I., Kesselman, C., Nick, J.M., Tuecke, S. (2002), Grid services for distributed system integration, Computer, vol. 35, no. 6, pp. 37-46. Grassi, V., Donatiello L., Iazeolla G., (1988), Performability evaluation of multicomponent fault tolerant systems, IEEE Transactions on Reliability, vol. 37, no. 2, pp. 216-222. Hamscher, V., Schwiegelshohn, U., Streit, A., Ramin, Y. (2000), Evaluation of jobscheduling strategies for grid computing, Proceedings of the First IEEE/ACM International Workshop on Grid Computing, pp. 191-202. Krauter, K., Buyya, R., Maheswaran, M. (2002), A taxonomy and survey of grid resource management systems for distributed computing, Software - Practice and Experience, vol. 32, no. 2, pp. 135-164. Kumar, A. and Agrawal, D.P. (1993), A generalized algorithm for evaluating distributedprogram reliability, IEEE Transactions on Reliability, 42, pp. 416-424. Kumar, A. (2000) An efficient SuperGrid protocol for high availability and load balancing, IEEE Transactions on Computers, vol. 49, no. 10, pp. 1126-1133. Levitin, G., Lisnianski A., Beh-Haim H., Elmakis D. (1998), Redundancy optimization for series-parallel multi-state systems, IEEE Transactions on Reliability, vol. 47, pp. 165-172. Levitin, G., Dai Y. (2006), Service reliability and performance in grid system with star topology, to appear in Reliability Engineering and System Safety. Lisnianski A., Levitin G. Multi-state System Reliability, World Scientific, Singapore, 2003. Meyer, J. (1980), On evaluating the performability of degradable computing systems, IEEE Transactions on Computers, vol. 29, pp. 720-731. Nabrzyski, J., Schopf, J.M., Weglarz, J. (2003), Grid Resource Management, Kluwer Publishing. Tai, A., Meyer J., Avizienis A. (1993), Performability enhancement of fault-tolerant software, IEEE Transactions on Reliability, vol. 42, no. 2, pp. 227-237. Ushakov, I. (1987), Optimal standby problems and a universal generating function, Soviet Journal of Computer Systems Science, vol. 25, pp. 79-82. 25 Veeraraghavan, M., Trivedi, K.S. (1991), An improved algorithm for symbolic reliability analysis, IEEE Transactions on Reliability, 40(3), pp. 347-358. Wu A., Yu H., Jin S., Lin K.-C., and Schiavone G. (2004), An incremental genetic algorithm approach to multiprocessor scheduling. IEEE Trans. on Parallel andDistributed Systems, 15(9), pp. 824-834. Zang, X., Wang, D., Sun H., Trivedi, K.S. (2003), A BDD-based algorithm for analysis of multistate systems with multistate components. IEEE Trans. On Computer, 52(12), pp. 1608-1618.