A Software Architecture for Monitoring the Reliability in Distributed Systems Florin POPENTIU VLADICESCU, City University London, DEEIE, Northampton Square,London EC1V, OHB, e-mail: Fl.Popentiu@city.ac.uk Pierre SENS, Pierre & Marie University, Paris, 4 place Jussieu, Paris 75252, Cedex 05, France, e-mail: Pierre.Sens@Lip6.fr Abstract: This paper addresses the problem of software architecture according to reliability forecasting. In a distributed computing environment a statistical model is used for determining the suitable algorithms related to performance requirements for each specific application. The proposed technique is based on a chain of automatic data collection, which has given us the possibility to adjust, during the execution, the strategy of fault management. We propose a monitoring scheme with flexibility. In this monitoring scheme we define a specific library for file accesses and conversation in order to keep track of files and communication usage. The pattern incorporates a self-contained architecture for distributed systems, which allows us to monitor the collection of the statistical data and to support the execution of the applications in a non-reliable operational environment. The main advantage of the proposed technique is that it gives us the possibility to design the software architecture according to the most recent requirements 1. Introduction At midnight on December 31, 1999 we will enter a new millenium. But without appropriate action now, some of our computers will generate many vital errors. Our proposed research project addresses the problem of software architecture according to software forecasting. Therefore, software reliability forecasting is a problem of increasing importance for many critical applications. The statistical methods are important to improve the ability of the current software reliability growth models to give measures and predictions of reliability that could be trusted for adapted tolerance algorithms. In order to retain the fault-tolerance of large-scale systems, not only the reliability of each component, but also the design of a whole system must be taken into consideration [Verma 97]. Therefore, it is useful to consider "how should the subsystems be interconnected to one other to attain fault tolerance as a whole?" A substantial body of work has been published regarding fault tolerance and we should find the best algorithm related to the dependability and performance requirements for each specific application. The reliability of such applications may be compromised due to external transiet faults, such as power failure, disk failure, processor failure, and other type of system malfunctions, or due to permanent faults, such as those internal to the program itself [Nicola 95]. Fault tolerance techniques are suitable to choose for each application and adapted management for dependability levels needed. In general this choice is the user's decision 1 at the launching of the application and possibly an expert in dependability can assist the user [Daniels 97]. However in all cases the actual technique does not allow the user to adapt the management, during the execution time. This is due to the uncertainty of the operational environment and the uncertainty of the rate of occurrence of failures [Fenton 97]. 2. Environments This section presents our target environment and applications. We consider an application program as a dynamic set of communicating processes. A process may use any resources of the network (CPU and files). The only way to exchange information between processes is through message passing. Applications concerned with fault-tolerant management are long-running such as high number factoring, VLSI applications, images processing, or gaussian elimination. Such applications may be executing for hours, days or weeks. In that case, the failure probability becomes significant, and the need for reliability is an important concern. We consider two kinds of environment: * Local area networks (LAN): applications are executed on a set of workstations linked by a local area network (Ethernet). In such environment the failure probability is low. Clark and McMillin measured the average crash time on a local area network to be once every 2.7 days [Clark 92]. * Multi-LAN: considering that the number of hosts in LAN is relatively low and the evergrowing improvement in networking technology, the idea of spanning an application over an interconnecting of multiples LAN can be worthwhile. Such a platform is extremely interesting as it provides more computing power and the possibility of profiting from the idleness of underutilized remote hosts. 3. A prediction Tool 2 Raw DATA Collection Monitor of the Application Refined DATA U-plot Test 1 i N Prediction 1 i N N1 PLR Test 1 i N1 Best model Recalibration 1 i N1 Measurement and Prediction Adapted Algorithm Figure 1: Forecasting Chain The main objective of our work is to improve the total quality of software architecture for monitoring fault-tolerant design in distributed systems. In a distributed computing environment a statistical model is used for determining the suitable algorithms related to the performance requirements for each specific application. The proposed technique is based on a chain of automatic data collection, which allows us the possibility of making adjustments, during the execution, i.e. the strategy of fault management (Figure 1). The chain consists of the following modules and sub-modules: 1. The module "monitor of the application" allows collection of information of the evolution of the applications and the environment. 2. The module "data collection" is concerned with a refinement process, extracting essential data elements from raw data. 3. The module "measurement and prediction" is choosing an appropriate reliability model by evaluating models accuracy; the sub-module "prediction" reveals the great variations in the results of applying N software reliability growth models to the data; the sub-module "U-plot" detects the bias of the medians; the sub-module "recalibration" improves the predictions; the sub-module "PLR test" compares N1 competing sets of predictions and selects one that produces the "global most accurate" prediction about the future failure behavior of the application and the operational environment. 4. The module "adapted algorithms" uses the information of the prediction module to select the most appropriate algorithm for adaptability of fault management. 3 Percentage of Total Failures 70 60 50 40 30 20 10 COMM MWR SWR HUM Complete Failure Type Figure 2: Complete Failure Type 4. Reliability Modeling 4.1 The failure chain To model software reliability one must first consider the principal factors that affect it: fault introduction, fault removal and use [Musa 98]. The LAN and Multi-LAN environments process and report automatically and continuously information such as the date and time of occurrence, duration of downtime, and the cause of the outage. The information is used by system engineers to detect congestion, balance loads, improve performance of the system and so on. Accuracy of the data collection plays a central role in the measurement and prediction of a system’s reliability. A failure does not necessarily impact all users logged into the system. Partial failures affect a certain fraction of users while the complete failures affect all users. In practice, however, we restrict reliability modeling to complete failures only. A failure-event chain includes different events such as network communications (COMM), computer problems (HWR), system software (SWR) and human errors (HUM). Figure 2 shows a distribution of complete failures for four categories of events used for LAN and Multi-LAN environments in the period 19941998. From Figure 2 we can see that software related problems (SWR) decrease as we have made a transition from a LAN to Multi-LAN environments based upon historical trends in software reliability. 4 A software reliability model specifies the general form of the dependence of the failure process on the factors mentioned. 4.2 R-Vision tool R-Vision is a software reliability estimation that runs under UNIX environments. The tool is relatively easy to learn and to use because it uses menus and dialog boxes. RVision was designed for X-Windows and the File, Options and Help menus are very friendly. In Figure 3 we present the distribution of the interfailure times for a set of 288 failures recorded in a Multi-LAN environment. The 288 successive failures are analysed using 8 reliability growth models; the current median time to next failure is plotted on Figure 4. To detect bias in the medians as plotted in Figure 4 we can use the U-plot [Fenton 97] shown in the Figure 5. Figure 3: Plot of failure in order of occurence 5 Figure 4: Predicted medians 6 Figure 5: U-plot test Figure 6: PLR method 7 The noise and the bias of the median predictions can be detected using the Prequential Likelihood Function (PLR). An example is plotted in Figure 6. 4.3 Some results The techniques implemented in the R-Vision tool represent a theory of design reliability, applicable equally to software and hardware. The software whose reliability we wish to measure is operating in a real environment and when failures occurs, attempts are made to find and fix the faults [Fenton 97]. In long run, we expect to see the reliability improvement. Figure 4 illustrates the raw predictive median time to failure for 8 reliability growth models [Duane(D), Jelinski-Moranda(JM), Keiller-Littlewood(KL), Littlewood)(LM), Littlewood non-homogeneous Poisson(LNHPP), LittlewoodVerall(LV) and Musa-Okumoto(MO)] using the data presented in Figure 3. As we can see some predictions are very optimistic like JM & D and some of them pessimistic LV&KL. After U-plot tests illustrated in Figure 5 we decided that L and MO recalibrated models are the best ones. This decision is taken after the measurement of the degree of deviation by using the Kolmogorov distance. We see the effects of recalibration in Figure 5, which shows the evolution of the maximum vertical distance. The difference in terms of noise and bias between these 2 models can be seen using the prequential likelihood function. From the decreasing shape presented in Figure 6 we can see that MO discredits LM and this PLR test gives us the opportunity to select the best model. 5. Monitoring Our aim is to adapt automatically the fault management according to the reliability forecasting. 5.1 Fault tolerance in cluster computing Checkpointing and rollback recovery are well known to provide fault tolerance in distributed systems. In this approach, each process saves its state on stable storage and, when an error is detected, the execution is rolled back and resumed from earlier checkpoints. To recover from a fault, the execution must be rolled back to a consistent state, but rolling back one process could result in an avalanche of rollbacks of other processes before a consistent state is found (such an effect is well-known as the domino effect). For example, when the sender of a message is rolled back, the corresponding receiver is also rolled back. Checkpointing techniques can be classified into two categories: co-ordinated and independent checkpointing. 8 With co-ordinated checkpointing, processes co-ordinate their checkpointing actions such that the collection of checkpoints represents a consistent state of the whole system. When a failure occurs, the system restarts from these checkpoints. The main drawback of this approach is that the messages used for synchronizing a checkpoint are an important source of overhead. Moreover, after a failure, surviving processes may have to rollback to their latest checkpoint in order to remain consistent with recovering processes. In independent checkpointing, each process independently saves its state without any synchronisation with the others. Message logging was introduced to avoid the domino effect [Bhargava 90, Elnozahy 94]. Logging methods fall into two classes: pessimistic and optimistic. In pessimistic message logging the system writes incoming messages to stable storage before delivering them to the application [Borg 89, Sens 98]. To restart, the failing process is rolled back to the last checkpoint and messages are replayed from the log. The receiver is blocked until the message is logged on stable storage. Alternatively, the system delivers the message to the application immediately but disallows any further messages until the message has been logged. In optimistic message logging, messages are tagged with information that allows the system to keep track of inter-process dependencies [Elnozahy 94, Johnson 90, and Smith 95]. Inter-process dependency information is used to determine which processes need to be rolled back on restart. This approach reduces the logging overhead but processes that survive a failure may be rolled back. Alternatively, messages are gathered on main memory of the sending host and are asynchronously saved on the stable storage. Environmental parameters directly influence the choice of the checkpoint policy. Thus, if the failure rate becomes too low, the recovery delay is not so important and we can reasonably choose a co-ordinated checkpointing. We can also adapt the checkpoint frequency according to the estimated time between failures. On the other hand, application behavior influences the efficiency of fault tolerant algorithms. For instance, the communication rate has a direct impact on the choice of the checkpoint algorithm. If this rate becomes too high message logging approaches are not suitable and we must switch to co-ordinated checkpointing. 5.2 Performance issues in checkpointing To reduce the cost of checkpointing, two complementary techniques are usually applied: incremental and non-blocking checkpointing. Incremental methods reduce the amount of data that must be written. Only those pages of the address space modified since the last checkpoint are written to stable storage. Nonblocking methods allow the process to continue executing while its checkpoint is written to stable storage. 9 To give an estimation of the checkpointing, Figure 7 shows the costs associated with optimized checkpointing (non-blocking and incremental) on a networked environment [Sens 98]. The cost of checkpointing has been measured with different replication degrees of the stable storage (from 1 to 4) and with different process sizes (from 100 to 1150 kilobytes). A replication degree of 4 means that each checkpoint is saved on 4 disks on 4 hosts. The programs run with a rather short 20 second checkpointing interval. In practice, longer intervals should be used. In that sense, we overestimate the cost of checkpointing to evaluate the checkpoint mechanism in a loaded situation. Replication degree 1 3 2 4 Time (seconds) 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 100 300 500 900 1100 700 Process size (Kbytes) Figure 7: Checkpoint evaluation 5.3 Monitoring architecture The monitoring module collects general information about resources usage and environment evolution. Programs offer naturally a locality of resources usage in terms of memory, file, and communication. Consequently, the observation of the past gives a good evaluation of the future. We define in this section the distributed monitoring architecture. A specific host, the collector, collects information from remote involved hosts. The collector is subscribed to specific information chosen by a “reliability administrator“. On each host a server called the monitor agent collects information concerning local processes and sends it back to the collector. Figure 8 illustrates the monitoring architecture. Servers rely on a high-level communication layer, the Network Message Server, developed at the LIP6. 10 P1 agent NMS Operating System collector P2 NSM agent NMS Operating System Operating System Figure 8: The monitoring architecture We define a library for file accesses and communication in order to keep track of file and communication usage. Observed programs have to use this library for all their file accesses and communication. We also periodically observe the memory usage of each program. All this information is saved on a remote file, which maintains statistics for each application program and host. To ensure the reliability of this file, it is replicated on several hosts. This architecture is suitable to observe a relatively small number of hosts (typically 50 workstations in a local network). To observe applications distributed on a huge number of hosts we must adopt a decentralized scheme. We plan to use the Territory model defined by Raverdy [Raverdy 95]. In this model, a specific manager is defined for each application. This manager controls, only for its application, resources allocation. Based on results presented in [Raverdy 95], we can claim that such architecture is very efficient on large-scale networks and we plan to use it in the next version of our system. 6 Conclusions and future work There are many prediction systems and also many algorithms, which allow us to run applications in a non-reliable operational environment. We have presented an architecture to observe parallel applications in distributed environments. Collected information is used to choose the suitable fault tolerant strategy for each application. The originality consists of linking up the two approaches. The information supplied by the prediction module will be used to adapt the appropriate fault tolerance strategy. A comparative analysis of the predictions allows us to choose between the pessimistic approach to favor the recovery delay and the optimistic one to reduce the cost of fault management under the failure - free execution. In order to avoid the domino effect we propose a flexible-monitoring scheme. In this monitoring scheme we define a specific library for accessing files & dialogue in order to keep track of files and communication usage. During the monitoring we also observe periodically the memory usage for each application. It is interesting to note that the originality of this approach is double. This project proposes (1) the architecture for distributed systems, which allows us 11 to monitor the collection of the statistical data and (2) the execution of the applications in a non-reliable operational environment. In this new approach the observation of the past will improve the future predictions in distributed systems. In order to validate the resulting models, the SURFnet will be used as case studies, and real data sets will be analysed. The SURFnet network consists of a structure of cables linked toghether by various switchboards on which several protocols-such as TCP/IP-run to enable communication over the network. The current SURFnet network, backbone, is based on so-called ‘Cluster Leader’ principle. Four institutions, SARA in Amsterdam and the universities of Delft, Enschede and Eindhoven are designated as Cluster Leaders (central junctions in the network). Since the beginning of 1999 SUFnet has an international 200 Mbit/s connection to accommodate the growing transatlantic communication. References [Bhargava 90] B. Bhargava, S-R. Lian, and P-J. Leu. Experimental Evaluation of Concurrent Checkpointing and Rollback-recovery Algorithms. In Proc. of the International Conference on Data Engineering, pp. 182-189, March 1990. [Borg 89] A. Borg, W. Blau, W. Craetsch, F. Herrmann, and W. Oberle. Fault Tolerance under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, February 1989. [Burtschy, 97] B. Burtschy, G. Albeanu, D. N. Boros, Fl. Popentiu, Improving Software Reliability Forecasting, Microelectronics and Reliability, Pergamon Press, Oxford, 1997, Vol. 37, No. 6, pp. 901-907 . [Clark 92] H. Clark, B. McMillin. DAWGS - a Distributed Compute Server Utilizing Idle Workstations. Journal of Parallel and Distributed Computing, 14:175-186, February 1992. [Daniels 97] F. Daniels, K. Kim, and M.A. Vouk, "The Reliable Hybrid Pattern - A Gene ralized Software Fault Tolerance Design Pattern," PLop Conference, 1997 [Elnozahy 94] E.N. Elnozahy, W. Zwaenepoel. On the Use and Implementation of Message Logging. In Proc. of the 24th International Symposium on Fault-Tolerant Computing Systems, pp. 298-307, Austin, Texas (USA), June 1994. [Fenton 97] N.E. Fenton and Sl. Pfleeger, Software Metrics: A rigorous and Practical Approach, International Thomson Computer Press, 1997 (second edition), 641 pp. [Johnson 90] D.B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems using Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462491, September 1990. [Koo 87] R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transactions on Software Engineering, SE-13(1):23-21, January 1987. [Musa 98] J.D. Musa, Software Reliability Engineering: More Reliable Software Faster Development and Testing, McGraw-Hill, New York, 1999 380 pp. [Nicola 95] V. Nicola Checkpointing and the Modeling of Program Execution Time, Chapter 7 in Software Fault Tolerance, M.R. Lyu (ed.), John Wiley \& Sons, Chichester, England, 1995, pp. 167-188. 12 [Raverdy 95] P-G. Raverdy, B. Folliot. Presentation of the Execution Territory: a Two Level Load Balancing Mechanism. Proc. of European Research Seminar on Advances in Distributed Systems, l'Alpes d'Huez, pp. 99-104, April 1995. [Sens 97] P. Sens and B. Folliot. Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments. Proc. of 26th International Conference on Parallel Processing (ICPP-97), Bloomingdale, Illinois (USA), August 1997. [Sens 97] P.Sens, Fl. Popentiu, A. Mantoiu, Software Reliability Forecasting for Adapted Fault Tolerance Algorithms, ESREL'98, Trondheim, 16-19 June 1998, pp. 1069-1077. [Smith 95] S.W. Smith, D.B. Johnson, J.D. Tygar. Completely Asynchronous Optimistic Recovery with Minimal Rollbacks. In Proc. of the 25th Annual International Symposium on Fault-Tolerant Computing, Pasadena, CA (USA), June 1995 [Verma 97] A. K. Verma, M.T. Tamhankar, Reliability-Based Optimal Task Allocation in Distributed - Database Management Systems, IEEE Transactions on Reliability, Vol 46, No. 4, Dec. 1997 13