Improving Dependability in Service Oriented Architectures using Ontologies and Fault Injection Binka Gwynne Jie Xu School of Computing, University of Leeds, Leeds, LS2 9JT, UK {binka, jxu} @comp.leeds.ac.uk Abstract Large distributed systems and computer grids are increasingly being used in science and in business, with Service Oriented Architectures combined with Web services the current favoured solutions to access these distributed, heterogeneous resources. However, service-based systems have a high reliance on a middleware which is continuously evolving. This requires novel methods of testing and evaluation to provide high dependability. Our objective is to introduce innovative methods of testing SOA middleware with the use of experimental provenance and ontologically supported software fault injection, and gain better understanding of the SOA dependability and fault domains. 1. Introduction Software Oriented Architectures in combination with Web services are the current favoured solutions to access distributed, heterogeneous resources, yet service-based systems have a high reliance on continuously evolving middleware. However, this middleware requires high levels of dependability in order to become as widely adopted, ubiquitous and successful as the established Internet technologies have become. The Internet is predominantly used for information retrieval displayed in a human readable form; however, the emergence of P2P and grid technologies is driving a revolution in established distributed technologies, due to the high demands of increasingly large scale information processing, loosely coupled services, and machine comprehension. The objective of this paper is to introduce innovative methods of testing SOA middleware through the use of ontologically supported software fault injection; this will allow us to improve the evaluation of the dependability and fault domains of SOA middleware, and to cyclically enhance the quality of our testing methods. 2. Background 2.1 Service Oriented Architecture A Service Oriented Architecture (SOA) is an architectural model that emphasises properties of interoperability and location transparency. It is essentially a collection of services, where each service can be considered a resource that is either provided or consumed. In an ideal world the human elements of Software Oriented Systems would be unaware of this transparent, but in all probability complex, middleware that provides their trust worthy access to resources. Although Web services are the current de facto standard middleware for SOAs, they are not the only available middleware: an SOA is a high level architectural model and Web services only one means of its implementation. In addition, Web services are still continuing to develop, driven by ever increasing demands in requirements. For example, SOAs and Web services do not yet have full transactional reliability, although implementations and protocols are maturing to meet this requirement [1]. 2.2 Dependability Dependability is a collective term that encompasses reliability, performance, maintainability, security etc. in computing systems: it includes everything that is needed to make a computer system dependable. Dependability is inextricably linked with the notion of trust: “The ability to deliver service that can justifiably be trusted” [2]. Reliability is the part of dependability concerned with the probability that a given system will behave according to its requirements [3]. Reliability was chosen for our test scenario as it has a distinct numeric value between 0 and 1 for a given time period. In consequence, reliability can provide empirical results and an opportunity to generate meaningful provenance data. The concept of provenance is well established in the real world and is a trustworthy, documented history, including information on origin and fabrication processes. Provenance has already been applied in the virtual world, for example where it has been used for the verification of insilico experiments, so that they are fully documented and replicable [4] [5]. To improve reliability, it is necessary for a system to decrease the number of its failures by removing or surviving the occurrence of errors. A system failure is an event that occurs when a delivered service deviates from its correct service [6]. Errors are inextricably linked to failures, as an error is defined as present in a system when that system’s state deviates in some way from its expected state. In consequence it can be seen that violations of timing constraints can also be viewed as reliability failures [7], as could a task which completes more efficiently than planned. According to Randell [8]: “There will often be considerable subjective judgement involved in identifying errors, particularly errors due to design faults in complex software.” A paradox exists here: the concepts of the fault and dependability domains are obscure, but we try to provide machines, with their requirement of explicitness, with comprehension of these domains. 2.3 Ontologies An ontology is an abstract, formal and explicit description of a specified part of the world or domain. Computing ontologies must follow rules, have structure and lack ambiguities, so that machines are able process them and glean the information they require from them. In consequence, computing ontologies must be formal and explicit: formal, following or being in accord with accepted forms or regulations; explicit, fully, clearly expressed and leaving nothing implied. Although machines require their information to be both formal and explicit, there are difficulties in being explicit when describing the nature of something, and at this time machines still remain less successful semantic processors than their human counterparts. Computing ontologies are used in order to glean intelligence from data by the descriptions and relations of element attributes within domains; indeed they are central to the concept of the semantic Web [9]. Ontologies can vary in scale from flat lexicons with few relationships, to very large, expressive ontologies attempting to capture every possible aspect of a domain [10]. Detailed ontologies can be highly complex, although languages and tools are available to assist in their development, such as Protégé [11]. Our ontologies are to develop with information from experimental testing and provenance data; this makes testing methods replicable and should provide availability to heterogeneous systems. The ontologies attempt to classify and describe part of the fault and dependability domains and provide information to assist in the design of better tests and evaluation methods, including conceptualisation of domains. 3. Experimental Methods 3.1 Testing Using Software Fault Injection Middleware testing is being carried out using software fault injection on Web services. Software fault injection is the deliberate insertion of faults into a computer system so that the system will accelerate exhibition of its behavior in the presence of faults: a faster and more analytical method of testing than simple observation methods such as logging [12]. Fault injection is also a well proven method of demonstrating dependability [13] [14]. As with all other testing methods, fault injection increases the chance of finding faults within a system so that those faults can be removed or tolerated, but it cannot guarantee that all the faults have been discovered that exist in that system’s fault domain, and therefore never give a 100% surety against all future failures. There are two main concerns with fault injection: the plausibility of the fault model to represent actual faults (fault representativeness) [15]; the possibility that the fault injection itself will change conditions within the system which could lead to the phenomenon of spawned faults. 3.2 Experimental Procedures Experiments using fault injection techniques on Web services protocols aim to generate data for ontology development. Our experimental requirements include trustworthy, proven histories of our testing and evaluation methods and this is to be provided by recording experimental provenance. Attempts to model large domains accurately and completely can mean that ontologies can grow very large, which may lead to unnecessary problems where only a sub set of information is required. The solution to this situation is the use of sub-ontologies, which can also lead to an improvement in efficiency [16]. As the dependability domain is large and with unclear boundaries [17], the modeling of the dependability domain will involve using sub ontologies that describe inter related sub domains, and in consequence we have focused our initial area of interest within the reliability sub domain. Experiments explore replacement of parameters and latency issues; these are still under development. 3.3 The Software Fault Domain Model Models are abstractions able to portray the essentials of complex problems and provide a useful way of understanding those problems relative to the real world. The software fault domain model cannot be described simply with its elements necessarily occurring only once in a hierarchy and in consequence an open world assumption is required. For example, the concept of time related faults occurs in numerous sub domains of the software fault domain: latency may occur due to resource management issues, or during bit transmission within a communication subdomain, or during software aging. The phenomenon of software aging is an accumulation of errors during execution that over time eventually results in software failure [18]. Software aging can sometimes only be evident from an observable, gradual increase in resource requirements, but may occur with no observable effects, so is a logical case for experiments in fault injection. • Inherent design • Latent design • Resource management • Communication-heterogeneity • Communication-transmission • Life Cycle • Security Figure 1: A High Level Model of Faults in the Software Fault Domain Provenance of testing includes details of the type of test, system under test, timestamp, and results log, so that revealed errors can be used to describe their corresponding faults within the system. <test> <testID>testUniqueID</testID> <testerID>testerUniqueID</testerID> <testType>type</testType> <testSubject>subject</testSubject> <testDate>date</testDate> <startTestTimestamp> stTimestamp</startTestTimeStamp> <provenanceLogID>logNameUniqueID </provenanceLogID> <testResults> <noError></noError> <error> <errorType>type</errorType> <errorTimestamp>Timestamp </errorTimeStamp> <errorLocation>location </errorLocation> <errorMeassage>message </errorMessage> </error> </testResults> <endTestTimestamp>endTimestamp </endTestTimeStamp> </test> Figure 2. Example Test Data as Schema XML is a simple, well documented, straightforward data format that provides the explicitness required by machines [19]. XML can be used to describe the results of tests and to build a model of the fault domain of a system. Web services are the focus area of these tests, and also based on XML, as SOAP messages are well formed XML documents. In this way techniques can be applied to SOAP messages. In further developments we intend to use OWL [20] , which is a descriptive language built on top of the more basic language RDF [21]. RDF is an XML based schema where statements are represented as triples: subject, predicate, object. For example: <Program> <hasError> <arrayIndexOutOfBounds> OWL consists of Individuals, Properties and Classes: OWL Classes can intersect (AND), union (OR), and complement (NOT). OWL’s restrictions and properties and can facilitate automated reasoning. ONTOLOGY ENGINE PROVENANCE LOGGER PROVENANCE ANALYSER ONTOLOGY BUILDER TEST DESIGNER Figure 2: The Ontology Engine 4. Conclusion and Future Work Our early results are encouraging, and preliminary modeling of the fault and reliability domains show that they can prove useful in assisting in the design of further tests and evaluation methods. Provenance of all fault injection testing and ontology creation will be maintained, so that dependability levels of SOA systems may be established through proven records. Future work aims to produce fault models that are generic enough to match many eventualities. 5. Acknowledgements This research is supported by The Engineering and Physical Research Council (EPSRC) and The Distributed Aircraft Maintenance Environment (DAME). 6. References [1] B. Sleeper (2004) “Piecing the SOA Puzzle Together” InfoWorld, Issue 37, September 2004 [2] A. Avizienis, J. C. Laprie, B. Randell and C. Landwehr (2004) “Basic Concepts and Taxonomy of Dependable and Secure Computing” IEEE Distributed Systems Online (5) 2 [3] M. R. Lyu Editor in Chief (1995) “Handbook of Software Reliability Engineering” McGraw-Hill: Los Alamitos, USA [4] M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Mureau, and T. Oinn (2003) “Provenance of e-Science Experiments – Experience from Bioinformatics” UK e-Science All-Hands Meeting 2003, Nottingham, UK [5] P. Groth, M. Luck, and L. Moreau (2004) “Formalising a Protocol for Recording Provenance in Grids”, UK e-Science All-Hands Meeting 2004, Nottingham, UK [6] B. Randell (1998) “Dependability – a Unifying Concept” Technical Report, School of Computing Science, University of Newcastle, UK [7] J. Xu, and B. Randell (1996) “Roll-Forward Error Recovery in Embedded Real-Time Systems” IEEE International Conference on Parallel and Distributed Systems [8] B. Randell (2003) “On Failures and Faults” Technical Report, School of Computing Science, University of Newcastle, UK [9] T. Berners-Lee (2001) “The Semantic Web” Scientific American, May 2001 [10] J. de Bruijn (2003) “Using Ontologies, Enabling Knowledge Sharing and Reuse on the Semantic Web” Technical Report, Digital Enterprise Research Institute, Galway, Ireland [11] Protégé: http://protege.stanford.edu/ [12] E. Marsden, J.C. Fabre, and J.Arlat (2002) “Dependability of CORBA Systems: Service and Characterisation by Fault Injection” 21st IEEE Symposium on Reliable Distributed Systems 2002 [13] K. R. Joshi, R. M. Lever, W. H. Sanders, and M. Cukier (2004) “Achieving Practical Global-State-Based Fault Injection: Experiences and Techniques” Technical Report: University of Illinios, USA [14] N. Looker, and J. Xu (2003) “Dependability Assesment of an OGSA Compliant Middleware Implementation by Fault Injection” UK e-Science All-Hands Meeting 2003, Nottingham, UK [15] J. Arlat, Y. Crouzet, J. Karlsson, P. Folkesson, E. Fuchs, and G. H. Leber (2003) “Comparison of Physical and SoftwareImplemented Fault Injection Techniques. IEEE Transactions on Computers, 52(9):1115-1133, 2003 [16] N. Looker, B. Gwynne, J. Xu, and M. Munro (2005) "An Ontology-Based Approach for Determining the Dependability of ServiceOriented Architectures ," 10th IEEE International Workshop on Object-oriented Real-time Dependable Systems 2005, Sedona, USA [17] B. Randell (2004) “Dependability and Security” IEE UK http://www.iee.org/Policy/Areas/it/framework/B rianRandallIEEDependability.pdf [18] S. Garg, A. van Moorsel, K. Vaidyanathan, K. S. Trivedi (1998) “A Methodology for Detection and Estimation of Software Aging” 9th International Symposium on Software Reliability Engineering [19] E. R. Harold and W. S. Means (2004) 3rd Edition. “XML in a Nutshell” O’Reilly: Sebastopol, USA [20] OWL: http://www.w3.org/TR/owl-features/ [21] RDF: http://www.w3.org/RDF/