UML based OSCAR Cluster Availability Modeling Hertong Song Computer Science College of Engineering & Science Louisiana Tech University Ruston, LA 71270, USA Email: hso001@latech.edu Abstract: Computer system’s availability is crucial as computers are getting more and more into aspects of our daily life. It is necessary to evaluate a system’s availability at its design stage. The most analytic availability modeling techniques are Fault Trees, Markov Chains, Stochastic Petri Net and various methods. On the other hand, system designers may not be intimate with these techniques; in stead, they normally use UML to describe systems behavior and communicate with each other. This paper proposes a modeling framework of using UML to model system’s availability and helping system designers to evaluate the systems availability at ease. The paper will then present a way to model OSCAR cluster system’s availability using UML notation, the mapping from UML to Continuous Time Markov Chain and finally acquire the result from the calculation of the underlying CTMC mode . Keywords: UML, High Availability, OSCAR, Continuous Time Markov Chain 1. Introduction Computer systems are becoming more and more associated with our society and life. They exist in almost every corner of our daily life, such as telecommunications, banking system, automobiles, spacecraft and even microwave. A computer clustering system is a high performance computing system (HPC), it can also improve system’s availability such that when one component is down, while the other components can take over the workload, and keep the system continuous functioning. Meanwhile System availability plays an important role in computer systems. Unexpected failures of systems may result the society into chaos, such as power down, financial lose, even human life. There is no doubt that it is extremely important to evaluate the system’s availability benchmark at the initial design stage. The analytical formalism of evaluating system’s availability is using statistical methods like Reliability diagram, fault tree, Markov Chain, Stochastic Petri Net and so on. On the other hand, UML is becoming a de facto standard modeling language to depict the system’s behaviors and aspects. Most industrial software architects, product managers and system designers are using UML to specify the system and they may not be familiar to the statistical methodologies used for evaluate the system’s availability. Some researchers have been using UML to model the reliability aspects on specific software [1]. In our previous study [5], only two-tier system has been addressed. We have enhanced our modeling techniques to model multi-tier system, and the system’s components may have the same functionality but different failure rate. In this paper, we propose a mechanism of using UML to model a multi-tier cluster computing system. Then the UML model will be mapped to the Continuous Time Markov Chain model. Finally, the result of the system’s availability is calculated based on the uniformization algorithm [9]. The framework of the UML based system availability modeling is illustrated in figure 1. Step (1) UML notations articulate a system design and describe component-wise reliability information. Step (2) Obtain an equivalent XMI representation from the UML model. Step (3) Maps the model to the continuous time Markov chain model (CTMC). And Step (4) calculated the system’s availability based on the CTMC model. UML Rep resentation XMI Representation Availability Model Results Figure 1 Framework of the UML modeling tool The remainder of this paper is structured as follows: section 2 gives a description of OSCAR cluster system architecture, section 3 shows how to use UML to model system’s reliability and section 4 describes the mapping to continuous time Markov chain (CTMC). Finally section 6 addresses the calculation of CTMC. 2. System Architecture OSCAR (Open Source Cluster Application Resources) [2] is a software package that is popular for building High Performance Computing (HPC) systems. It consists of a head node, referred as the server; and a set of working nodes, herein referred as the clients. The server takes the job requests and dispatches the jobs to the clients, which actually handle the requests. The server communicates with its clients via a local network connection. From a high availability point of view, the current OSCAR suffers several single point of failure. For example, if the server is down, then the whole system is down. For this reason, the High Availability OSCAR (HA-OSCAR) is being developed in Louisiana Tech University [3]. The HA-OSCAR has two servers, two switches and double connections. Figure 2 shows the system architecture of HA-OSCAR. Server 1 switch 1 node 1 node 2 Server 2 switch 2 node 3 node 4 Figure 2 the architecture of OSCAR cluster system 3. The UML Model of the Cluster System Initially, UML distributed diagram used for modeling the availability of cluster systems [5]. At current stage, we assume (1) each component in the system works independently, and that a failure of one component will not cause the failure of another component; (2) only the hardware failure for each node is considered. (3) Once the system is down, there will be no further failures happen. (4) The failure for each component is exponentially distributed. Figure 3 shows the distributed diagram represents a cluster system. Figure 3 the distributed diagram of a cluster system Currently, Gentleware’s Poseidon [10] for UML community edition’s tool is used to create the UML model. By specifying the availability information for each node as UML tagged values, we will be able to create the underlying Markov chain model for the cluster system. Then the availability of the system can be calculated based on the Markov chain model. Table 1 lists the UML tags names and their representations. UML Tags Name failure rate repair rate Min Multiplicity Represents Name of the component. Used for grouping The failure rate of the component The repair rate of the component The minimum components required to keep the system functioning The number of duplicates of the component Table 1. UML Tags used for Availability Modeling Considering a cluster system with two identical servers, two identical switches and two identical clients. The system is functioning if at least one of the servers, switches and clients are functioning. To model this system, we use the name server, switch and node as their names; specify failure rate and repair rate, the minimum requirement of each component to keep the system functioning, and finally the number of duplicates for each components. Figure 4 shows the using UML tags for this example. name server 1 name switch 1 name node 1 failure rate 0.001 failure rate 0.002 failure rate 0.001 repair rate 0.05 repair rate 0.01 repair rate 0.02 min 1 min 1 min 1 multiplicity 2 multiplicity 2 multiplicity 2 figure 4 an example of using UML tags. 4. Construction of Markov Chain Based on the giving UML notation, we can partition components into distinguished groups. Each tuple represents a group. The number of a tuple represents the number of components in the corresponding group. For example, the notation n1 , n 2 , , n s means there are s groups; there are n1 identical components in group 1, n 2 identical components in group 2 and so on. The resulting Markov chain will be in a tree-like structure. The top level (root) of the tree is the initial state of the Markov chain, denoted by n1 , n 2 , , n s . At each level below the root, there are s possibilities (braches), for each components in each group may fail. At the second level, for example, the states will be n1 1, n2 , , n s , n1 , n2 1, , n s , n1 , n2 , , n s 1 . The tree will propagate with s children until it reaches a leave, which is the minimum number of components required for a certain group to keep the system functioning. A leave will be in a state like n1 d 1 , n 2 d 2 , , k i , , n s d s ; where k i denotes the minimum number of components required for group i to keep the system functioning, d i represents the number of components are down in group i and ni d i k i . Each branch in the tree has a failure rate with ni d i i and a repair rate i . Once the tree is constructed, each duplicated state will be removed by linking its parent state to the same state in the left part of the tree, with the same failure rate of the removed state. The states in the tree will be marked as a single integer based on breath-first method. Namely, the root will be state 0; at the second level, from left to right, the state will be 1, 2, …, s ; then the third level and so on. The initial probability vector of the Markov chain will be 1,0 , ,0 . Figure 5 shows the resulting Markov chain from the previous example. 2,2,2 21 2 2 1 1,2,2 1 2 2 1 0,2,2 1 1 0,1,2 0,1,1 1 1,1,1 2 1,2,1 1 23 3 2 1 1 21 3 2 1,0,2 23 3 2,1,2 23 1,1,2 2 2 2 1,0,1 0,2,1 2 2 2 1,1,1 1 1,1,2 3 2 2,2,1 23 2 2,0,2 3 1,2,0 3 2,1,0 21 1 1,2,1 2 3 2,1,1 21 1 1,1,1 2 2 2 2 2,0,1 3 2,2,0 3 3 2,1,0 3 3 1,1,0 Figure 5 a Markov chain example Same level components with different failure or repair rate. One way is to use another tuple in the Markov states. For example if the there are s groups denoted by n1 , n2 , , ns . If the group 2 is composed with 2 different failure rates, we can use n1 , n2 , n3 , , n s 1 . Where n 2 and n3 denote the two subgroups of the group 2. The system is functioning if k 1 k 2 k . In the UML representation, we still use the same name or level to denote the two groups. The diagram below shows the Markov chain such that the two servers in the above example have two different failure rates. The repair rates and duplicated states are eliminated from the diagram for simplicity. 1,1,2,2 1 0,1,2,2 0,0,2,2 1,0,2,2 0,1,1,2 0,0,1,2 2 2 1 0,1,2,1 0,1,0,2 0,0,1,1 0,1,1,1 1,1,1,2 1,0,1,2 0,0,2,1 0,1,0,1 23 1,0,2,1 0,1,2,0 0,1,1,0 1,0,0,2 1,0,0,1 1,1,2,1 1,1,0,2 1,0,1,1 1,0,2,0 1,1,1,1 1,1,0,1 1,1,2,0 1,1,1,0 1,0,1,0 Figure 6 a Markov chain model with two different servers Figure 6 shows the Markov chain model from the previous example but two servers are different. 5. Calculation of Markov Chain Markov chain model is popular for system availability modeling, and it has been well developed [7]. The system’s transient availability of the corresponding continuous time Markov chain model is given by the Kolmogorov forward differential equation: d t t Q dt where t is the state probability vector at time t , and Q is the generator matrix. If we let Ω denotes the state space of the system described by a CTMC, Ω 0 the set of the state space of operational system states and Ω f the set of system down states, then the system’s availability at time t is given by: At i t iΩ0 Ω 0 is the set of all leaves represented in the tree as given in figure 5 and figure 6, and Ω 0 Ω Ω f . For the previous example shown in figure 5, the initial probability vector is 0 1,0 ,0 , ,0 , and the generator matrix is: 21 2 2 21 2 3 1 1 1 2 2 23 0 2 0 Q 3 0 0 1 0 2 0 3 2 3 0 0 0 1 2 2 0 23 The general solution of the above equation is t 0 e Qt . However, the general solution of Markov models is nearly impossible except for those in certain forms [7]. For this reason, we adopt a numerical method referred as uniformization [9] to find the solution. From the uniformization method, the solution of the Kolmogorov differential equation can be written in the form as: t 0 e qt k 0 qt k Q* k k! Q where Q * I and q maxi qii . The infinite series of the above equation can be truncated by q choosing a truncated error . Then the solution will be close to: t 0 e qt r k l qt k Q* k k! where l and r are the left starting point and right end point of the series. The detailed computation algorithm is given in the appendix of [9]. 6. Conclusion UML is becoming a standard to modeling software systems. It has been studied to model system’s availability aspects recently. We have enhanced our UML modeling availability techniques to model a multi-tier cluster system, by specifying tagged values for each system component in the UML model. The UML model is then mapped into the continuous time Markov chain model. The result of the system’s availability is finally calculated based on the uniformizaiton method. Envision that our framework may extend the semantics of UML and enable system architects, software designers to use UML to model the system’s reliability with ease, we will consider to enhance our work to use UML modeling availability of systems with dependency of the components, dynamic aspect. References [1] Zarras and V. Issarny. UML-Based Modeling of Software Reliability. Proc. of the 1st ICSE Workshop on Describing Software Architecture with UML, May 2001, Toronto, Canada. pp. 36- 40. 2001. [2] Michael J. Brim, Timothy G. Mattson, Stephen L. Scott, “OSCAR: Open Source Cluster Application Resources,” Ottawa Linux Symposium 2001, Ottawa, Canada, 2001. [3] C. Leangsuksun, L. Shen, H. Song, S. L. Scott, and I. Haddad, “The Modeling and Dependability Analysis of High Availability OSCAR Cluster System,” 17th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2003), Sherbrooke, Canada, May 2003. [4] C. Leangsuksun, L. Shen, T. Liu, H. Song, S.L. Scott, “Dependability Prediction of High Availability OSCAR Cluster Server,” The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTS’03), Las Vegas, Nevada, USA, June 23-26, 2003. [5] C. Leangsuksun, H. Song, L. Shen, “Reliability Modeling Using UML,” The 2003 International Conference on (SePDPTS’03), Las Vegas, Nevada, USA, June 23-26, 2003. [6] C. Leangsuksun, L. Shen, H. Song, and S. L. Scott, “Availability Prediction and Modeling of High Availability OSCAR Cluster,” Submitted to ……………… [7] K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. PrenticeHall. 1982. [8] K. S. Trivedi, R. Sahner “Reliability Modeling using SHARPE” IEEE Transactions on Reliability, Vol. R-36, No.2, June 1987, pp186-193. [9] A. Reibman and K.S. Trivedi, “Numerical transient analysis of Markov models,” Comput. and Oper. Res. 15(1), 19-36, (1988). [10] Gentleware’s Poseidon for UML community edition. http://www.gentleware.com