A Software architecture for monitoring the reliability

advertisement
A Software Architecture for Monitoring the Reliability in
Distributed Systems
Florin POPENTIU VLADICESCU, City University London, DEEIE, Northampton
Square,London EC1V, OHB, e-mail: Fl.Popentiu@city.ac.uk
Pierre SENS, Pierre & Marie University, Paris, 4 place Jussieu, Paris 75252,
Cedex 05, France, e-mail: Pierre.Sens@Lip6.fr
Abstract: This paper addresses the problem of software architecture according to
reliability forecasting. In a distributed computing environment a statistical model is used
for determining the suitable algorithms related to performance requirements for each
specific application. The proposed technique is based on a chain of automatic data
collection, which has given us the possibility to adjust, during the execution, the strategy
of fault management. We propose a monitoring scheme with flexibility. In this
monitoring scheme we define a specific library for file accesses and conversation in order
to keep track of files and communication usage. The pattern incorporates a self-contained
architecture for distributed systems, which allows us to monitor the collection of the
statistical data and to support the execution of the applications in a non-reliable
operational environment. The main advantage of the proposed technique is that it gives us
the possibility to design the software architecture according to the most recent
requirements
1. Introduction
At midnight on December 31, 1999 we will enter a new millenium. But without
appropriate action now, some of our computers will generate many vital errors. Our
proposed research project addresses the problem of software architecture according to
software forecasting. Therefore, software reliability forecasting is a problem of increasing
importance for many critical applications. The statistical methods are important to
improve the ability of the current software reliability growth models to give measures and
predictions of reliability that could be trusted for adapted tolerance algorithms. In order to
retain the fault-tolerance of large-scale systems, not only the reliability of each
component, but also the design of a whole system must be taken into consideration
[Verma 97]. Therefore, it is useful to consider "how should the subsystems be
interconnected to one other to attain fault tolerance as a whole?" A substantial body of
work has been published regarding fault tolerance and we should find the best algorithm
related to the dependability and performance requirements for each specific application.
The reliability of such applications may be compromised due to external transiet faults,
such as power failure, disk failure, processor failure, and other type of system
malfunctions, or due to permanent faults, such as those internal to the program itself
[Nicola 95].
Fault tolerance techniques are suitable to choose for each application and adapted
management for dependability levels needed. In general this choice is the user's decision
1
at the launching of the application and possibly an expert in dependability can assist the
user [Daniels 97]. However in all cases the actual technique does not allow the user to
adapt the management, during the execution time. This is due to the uncertainty of the
operational environment and the uncertainty of the rate of occurrence of failures [Fenton
97].
2. Environments
This section presents our target environment and applications.
We consider an application program as a dynamic set of communicating processes. A
process may use any resources of the network (CPU and files). The only way to exchange
information between processes is through message passing. Applications concerned with
fault-tolerant management are long-running such as high number factoring, VLSI
applications, images processing, or gaussian elimination. Such applications may be
executing for hours, days or weeks. In that case, the failure probability becomes
significant, and the need for reliability is an important concern.
We consider two kinds of environment:
* Local area networks (LAN): applications are executed on a set of workstations linked
by a local area network (Ethernet). In such environment the failure probability is low.
Clark and McMillin measured the average crash time on a local area network to be once
every 2.7 days [Clark 92].
* Multi-LAN: considering that the number of hosts in LAN is relatively low and the evergrowing improvement in networking technology, the idea of spanning an application over
an interconnecting of multiples LAN can be worthwhile. Such a platform is extremely
interesting as it provides more computing power and the possibility of profiting from the
idleness of underutilized remote hosts.
3. A prediction Tool
2
Raw
DATA
Collection
Monitor of the
Application
Refined
DATA
U-plot
Test
1 i  N
Prediction
1 i  N
N1
PLR
Test
1 i  N1
Best model
Recalibration
1 i  N1
Measurement and Prediction
Adapted
Algorithm
Figure 1: Forecasting Chain
The main objective of our work is to improve the total quality of software architecture for
monitoring fault-tolerant design in distributed systems. In a distributed computing
environment a statistical model is used for determining the suitable algorithms related to
the performance requirements for each specific application. The proposed technique is
based on a chain of automatic data collection, which allows us the possibility of making
adjustments, during the execution, i.e. the strategy of fault management (Figure 1).
The chain consists of the following modules and sub-modules:
1. The module "monitor of the application" allows collection of information of the
evolution of the applications and the environment.
2. The module "data collection" is concerned with a refinement process, extracting
essential data elements from raw data.
3. The module "measurement and prediction" is choosing an appropriate reliability
model by evaluating models accuracy; the sub-module "prediction" reveals the great
variations in the results of applying N software reliability growth models to the data; the
sub-module "U-plot" detects the bias of the medians; the sub-module "recalibration"
improves the predictions; the sub-module "PLR test" compares N1 competing sets of
predictions and selects one that produces the "global most accurate" prediction about the
future failure behavior of the application and the operational environment.
4. The module "adapted algorithms" uses the information of the prediction module to
select the most appropriate algorithm for adaptability of fault management.
3
Percentage of Total Failures
70
60
50
40
30
20
10
COMM
MWR
SWR
HUM
Complete Failure Type
Figure 2: Complete Failure Type
4. Reliability Modeling
4.1 The failure chain
To model software reliability one must first consider the principal factors that affect it:
fault introduction, fault removal and use [Musa 98]. The LAN and Multi-LAN
environments process and report automatically and continuously information such as the
date and time of occurrence, duration of downtime, and the cause of the outage. The
information is used by system engineers to detect congestion, balance loads, improve
performance of the system and so on. Accuracy of the data collection plays a central role
in the measurement and prediction of a system’s reliability. A failure does not necessarily
impact all users logged into the system. Partial failures affect a certain fraction of users
while the complete failures affect all users. In practice, however, we restrict reliability
modeling to complete failures only. A failure-event chain includes different events such
as network communications (COMM), computer problems (HWR), system software
(SWR) and human errors (HUM). Figure 2 shows a distribution of complete failures for
four categories of events used for LAN and Multi-LAN environments in the period 19941998. From Figure 2 we can see that software related problems (SWR) decrease as we
have made a transition from a LAN to Multi-LAN environments based upon historical
trends in software reliability.
4
A software reliability model specifies the general form of the dependence of the failure
process on the factors mentioned.
4.2 R-Vision tool
R-Vision is a software reliability estimation that runs under UNIX environments. The
tool is relatively easy to learn and to use because it uses menus and dialog boxes. RVision was designed for X-Windows and the File, Options and Help menus are very
friendly. In Figure 3 we present the distribution of the interfailure times for a set of 288
failures recorded in a Multi-LAN environment.
The 288 successive failures are analysed using 8 reliability growth models; the current
median time to next failure is plotted on Figure 4.
To detect bias in the medians as plotted in Figure 4 we can use the U-plot [Fenton 97]
shown in the Figure 5.
Figure 3: Plot of failure in order of occurence
5
Figure 4: Predicted medians
6
Figure 5: U-plot test
Figure 6: PLR method
7
The noise and the bias of the median predictions can be detected using the Prequential
Likelihood Function (PLR). An example is plotted in Figure 6.
4.3 Some results
The techniques implemented in the R-Vision tool represent a theory of design reliability,
applicable equally to software and hardware. The software whose reliability we wish to
measure is operating in a real environment and when failures occurs, attempts are made to
find and fix the faults [Fenton 97]. In long run, we expect to see the reliability
improvement. Figure 4 illustrates the raw predictive median time to failure for 8
reliability growth models [Duane(D), Jelinski-Moranda(JM), Keiller-Littlewood(KL),
Littlewood)(LM), Littlewood non-homogeneous Poisson(LNHPP), LittlewoodVerall(LV) and Musa-Okumoto(MO)] using the data presented in Figure 3. As we can see
some predictions are very optimistic like JM & D and some of them pessimistic LV&KL.
After U-plot tests illustrated in Figure 5 we decided that L and MO recalibrated models
are the best ones. This decision is taken after the measurement of the degree of deviation
by using the Kolmogorov distance. We see the effects of recalibration in Figure 5, which
shows the evolution of the maximum vertical distance. The difference in terms of noise
and bias between these 2 models can be seen using the prequential likelihood function.
From the decreasing shape presented in Figure 6 we can see that MO discredits LM and
this PLR test gives us the opportunity to select the best model.
5. Monitoring
Our aim is to adapt automatically the fault management according to the reliability
forecasting.
5.1 Fault tolerance in cluster computing
Checkpointing and rollback recovery are well known to provide fault tolerance in
distributed systems. In this approach, each process saves its state on stable storage and,
when an error is detected, the execution is rolled back and resumed from earlier
checkpoints.
To recover from a fault, the execution must be rolled back to a consistent state, but rolling
back one process could result in an avalanche of rollbacks of other processes before a
consistent state is found (such an effect is well-known as the domino effect). For
example, when the sender of a message is rolled back, the corresponding receiver is also
rolled back.
Checkpointing techniques can be classified into two categories: co-ordinated and
independent checkpointing.
8
With co-ordinated checkpointing, processes co-ordinate their checkpointing actions such
that the collection of checkpoints represents a consistent state of the whole system. When
a failure occurs, the system restarts from these checkpoints. The main drawback of this
approach is that the messages used for synchronizing a checkpoint are an important
source of overhead. Moreover, after a failure, surviving processes may have to rollback
to their latest checkpoint in order to remain consistent with recovering processes. In
independent checkpointing, each process independently saves its state without any
synchronisation with the others. Message logging was introduced to avoid the domino
effect [Bhargava 90, Elnozahy 94].
Logging methods fall into two classes: pessimistic and optimistic. In pessimistic message
logging the system writes incoming messages to stable storage before delivering them to
the application [Borg 89, Sens 98]. To restart, the failing process is rolled back to the last
checkpoint and messages are replayed from the log. The receiver is blocked until the
message is logged on stable storage. Alternatively, the system delivers the message to the
application immediately but disallows any further messages until the message has been
logged. In optimistic message logging, messages are tagged with information that allows
the system to keep track of inter-process dependencies [Elnozahy 94, Johnson 90, and
Smith 95]. Inter-process dependency information is used to determine which processes
need to be rolled back on restart. This approach reduces the logging overhead but
processes that survive a failure may be rolled back. Alternatively, messages are gathered
on main memory of the sending host and are asynchronously saved on the stable storage.
Environmental parameters directly influence the choice of the checkpoint policy. Thus, if
the failure rate becomes too low, the recovery delay is not so important and we can
reasonably choose a co-ordinated checkpointing. We can also adapt the checkpoint
frequency according to the estimated time between failures.
On the other hand, application behavior influences the efficiency of fault tolerant
algorithms. For instance, the communication rate has a direct impact on the choice of the
checkpoint algorithm. If this rate becomes too high message logging approaches are not
suitable and we must switch to co-ordinated checkpointing.
5.2 Performance issues in checkpointing
To reduce the cost of checkpointing, two complementary techniques are usually applied:
incremental and non-blocking checkpointing.
Incremental methods reduce the amount of data that must be written. Only those pages of
the address space modified since the last checkpoint are written to stable storage. Nonblocking methods allow the process to continue executing while its checkpoint is written
to stable storage.
9
To give an estimation of the checkpointing, Figure 7 shows the costs associated with
optimized checkpointing (non-blocking and incremental) on a networked environment
[Sens 98]. The cost of checkpointing has been measured with different replication
degrees of the stable storage (from 1 to 4) and with different process sizes (from 100 to
1150 kilobytes). A replication degree of 4 means that each checkpoint is saved on 4 disks
on 4 hosts. The programs run with a rather short 20 second checkpointing interval. In
practice, longer intervals should be used. In that sense, we overestimate the cost of
checkpointing to evaluate the checkpoint mechanism in a loaded situation.
Replication degree
1
3
2
4
Time (seconds)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
100
300
500
900 1100
700
Process size (Kbytes)
Figure 7: Checkpoint evaluation
5.3 Monitoring architecture
The monitoring module collects general information about resources usage and
environment evolution. Programs offer naturally a locality of resources usage in terms of
memory, file, and communication. Consequently, the observation of the past gives a good
evaluation of the future. We define in this section the distributed monitoring architecture.
A specific host, the collector, collects information from remote involved hosts. The
collector is subscribed to specific information chosen by a “reliability administrator“. On
each host a server called the monitor agent collects information concerning local
processes and sends it back to the collector. Figure 8 illustrates the monitoring
architecture. Servers rely on a high-level communication layer, the Network Message
Server, developed at the LIP6.
10
P1
agent
NMS
Operating System
collector
P2
NSM
agent
NMS
Operating System
Operating System
Figure 8: The monitoring architecture
We define a library for file accesses and communication in order to keep track of file and
communication usage. Observed programs have to use this library for all their file
accesses and communication. We also periodically observe the memory usage of each
program.
All this information is saved on a remote file, which maintains statistics for each
application program and host. To ensure the reliability of this file, it is replicated on
several hosts.
This architecture is suitable to observe a relatively small number of hosts (typically 50
workstations in a local network). To observe applications distributed on a huge number of
hosts we must adopt a decentralized scheme. We plan to use the Territory model defined
by Raverdy [Raverdy 95]. In this model, a specific manager is defined for each
application. This manager controls, only for its application, resources allocation. Based
on results presented in [Raverdy 95], we can claim that such architecture is very efficient
on large-scale networks and we plan to use it in the next version of our system.
6 Conclusions and future work
There are many prediction systems and also many algorithms, which allow us to run
applications in a non-reliable operational environment.
We have presented an architecture to observe parallel applications in distributed
environments. Collected information is used to choose the suitable fault tolerant strategy
for each application. The originality consists of linking up the two approaches. The
information supplied by the prediction module will be used to adapt the appropriate fault
tolerance strategy. A comparative analysis of the predictions allows us to choose between
the pessimistic approach to favor the recovery delay and the optimistic one to reduce the
cost of fault management under the failure - free execution. In order to avoid the domino
effect we propose a flexible-monitoring scheme. In this monitoring scheme we define a
specific library for accessing files & dialogue in order to keep track of files and
communication usage. During the monitoring we also observe periodically the memory
usage for each application. It is interesting to note that the originality of this approach is
double. This project proposes (1) the architecture for distributed systems, which allows us
11
to monitor the collection of the statistical data and (2) the execution of the applications in
a non-reliable operational environment. In this new approach the observation of the past
will improve the future predictions in distributed systems. In order to validate the
resulting models, the SURFnet will be used as case studies, and real data sets will be
analysed.
The SURFnet network consists of a structure of cables linked toghether by various
switchboards on which several protocols-such as TCP/IP-run to enable communication
over the network. The current SURFnet network, backbone, is based on so-called ‘Cluster
Leader’ principle. Four institutions, SARA in Amsterdam and the universities of Delft,
Enschede and Eindhoven are designated as Cluster Leaders (central junctions in the
network). Since the beginning of 1999 SUFnet has an international 200 Mbit/s connection
to accommodate the growing transatlantic communication.
References
[Bhargava 90] B. Bhargava, S-R. Lian, and P-J. Leu. Experimental Evaluation of
Concurrent Checkpointing and Rollback-recovery Algorithms. In Proc. of the
International Conference on Data Engineering, pp. 182-189, March 1990.
[Borg 89] A. Borg, W. Blau, W. Craetsch, F. Herrmann, and W. Oberle. Fault Tolerance
under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, February 1989.
[Burtschy, 97] B. Burtschy, G. Albeanu, D. N. Boros, Fl. Popentiu, Improving Software
Reliability Forecasting, Microelectronics and Reliability, Pergamon Press, Oxford,
1997, Vol. 37, No. 6, pp. 901-907 .
[Clark 92] H. Clark, B. McMillin. DAWGS - a Distributed Compute Server Utilizing
Idle Workstations. Journal of Parallel and Distributed Computing, 14:175-186,
February 1992.
[Daniels 97] F. Daniels, K. Kim, and M.A. Vouk, "The Reliable Hybrid Pattern - A Gene
ralized Software Fault Tolerance Design Pattern," PLop Conference, 1997
[Elnozahy 94] E.N. Elnozahy, W. Zwaenepoel. On the Use and Implementation of
Message Logging. In Proc. of the 24th International Symposium on Fault-Tolerant
Computing Systems, pp. 298-307, Austin, Texas (USA), June 1994.
[Fenton 97] N.E. Fenton and Sl. Pfleeger, Software Metrics: A rigorous and Practical
Approach, International Thomson Computer Press, 1997 (second edition), 641 pp.
[Johnson 90] D.B. Johnson and W. Zwaenepoel. Recovery in Distributed Systems using
Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462491, September 1990.
[Koo 87] R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed
Systems. IEEE Transactions on Software Engineering, SE-13(1):23-21, January 1987.
[Musa 98] J.D. Musa, Software Reliability Engineering: More Reliable Software Faster
Development and Testing, McGraw-Hill, New York, 1999 380 pp.
[Nicola 95] V. Nicola Checkpointing and the Modeling of Program Execution Time,
Chapter 7 in Software Fault Tolerance, M.R. Lyu (ed.), John Wiley \& Sons,
Chichester, England, 1995, pp. 167-188.
12
[Raverdy 95] P-G. Raverdy, B. Folliot. Presentation of the Execution Territory: a Two
Level Load Balancing Mechanism. Proc. of European Research Seminar on Advances
in Distributed Systems, l'Alpes d'Huez, pp. 99-104, April 1995.
[Sens 97] P. Sens and B. Folliot. Performance Evaluation of Fault Tolerance for
Parallel Applications in Networked Environments. Proc. of 26th International
Conference on Parallel Processing (ICPP-97), Bloomingdale, Illinois (USA), August
1997.
[Sens 97] P.Sens, Fl. Popentiu, A. Mantoiu, Software Reliability Forecasting for Adapted
Fault Tolerance Algorithms, ESREL'98, Trondheim, 16-19 June 1998, pp. 1069-1077.
[Smith 95] S.W. Smith, D.B. Johnson, J.D. Tygar. Completely Asynchronous Optimistic
Recovery with Minimal Rollbacks. In Proc. of the 25th Annual International
Symposium on Fault-Tolerant Computing, Pasadena, CA (USA), June 1995
[Verma 97] A. K. Verma, M.T. Tamhankar, Reliability-Based Optimal Task Allocation in Distributed - Database Management Systems, IEEE Transactions on
Reliability, Vol 46, No. 4, Dec. 1997
13
Download