LCCI (Large-scale Complex Critical Infrastructures) ¡ LCCIs are Internet-scale constellations of heterogeneous systems glued together into a federated and open system by a data distribution middleware. ¡ The shift towards Internet is considered a necessary step to overcome the limitations of the monolithic and closed architectures used traditionally to build critical systems (e.g., SCADA architectures). ¡ Real world example is the novel framework for Air Traffic Management (ATM) that EuroCONTROL is developing within the SESAR EU Joint Undertaking. 1 LCCI (Large-scale Complex Critical Infrastructures) 2 ¡ New challenges rise from LCCIs that push the frontiers of current technologies. ¡ Data distribution task becomes crucial and has to be: ¡ Reliability: deliveries have to be guaranteed despite failures may happen; ¡ Timeliness: messages must reach their destinations at the right time, without breaking temporal constraints; ¡ Scalability: performance is affected neither by the time nor by the LCCI size. ¡ The challenge is to find the best data distribution paradigm able to meet the aforementioned requirements. Outline of SWIM concept 3 ¡ SWIM (System Wide Information Management) aims to establish a seamless interoperability among heterogeneous ATM stakeholders: ¡ common data representation; ¡ coherent view on current ATM information (e.g. Flight Data, Aeronautical Data, Weather). ¡ It may be seen as a common data/service bus on which systems having to interoperate are “connected”. ¡ Close in spirit to a middleware solution for LCCI. SWIM prototype 4 ¡ The prototype (named “SWIM-BOX”) has been conceived as a sort of “Gateway/Mediator” across legacy applications: ¡ Completely distributed architecture; ¡ Designed using a domain based approach (Flight, Surveillance, etc); ¡ Implemented using a standard based approach; ¡ Well known data and information models (e.g. ICOG2); ¡ Standard technologies (Web Services, EJB, DDS); ¡ DDS-compliant middleware for sharing data. Legacy A SWIMBOX SWIM Network Adapte rA Legacy B SWIMBOX Adapte rB Legacy site Common Infrastructure Legacy site Some challenges 5 ¡ How subsystems (as COTS) involved into LCCI impacts on its dependability? ¡ What are the effects on LCCI if DDS-compliant middleware is invoked with erroneous inputs? ¡ Robustness questions: testing provides answers to these ¡ Help vendors evaluating their implementations; ¡ Help clients selecting several solutions. ¡ Tests cost reduction à automating tests procedure. ¡ Automating tests results classification. Our goal ¡ Assessing the middleware 6 robustness of DDS-compliant ¡ What does robustness mean? “The degree to which a system operates correctly in the presence of exceptional inputs or stressful environmental conditions” [IEEE Std 610.12.1990]. “Dependability with respect to external faults, which characterizes a system reaction to a specific class of faults” [Avizienis 04]. ¡ Robustness testing features: ¡Only the system interface has to be known; ¡ Source code is not needed (black-box approach); ¡Injecting exceptional input through API; ¡Do not alter ”data and structure" internally; ¡Select carefully inputs and stressful conditions that cause the activation of faults representative of actual situations. Robustness Testing Approaches 7 ¡ Robustness testing: stressing the public interface of the application/system/API with invalid and exceptional values: ¡ From Application To System Under Test (Top-Down); ¡ From OS to System Under Test (Bottom-UP). API called with exceptional values Application DDS Middleware OS syscall Operating System OS return with exceptional values Robustness Testing Approaches 8 ¡ Robustness testing: stressing the public interface of the application/system/API with invalid and exceptional values: ¡ From Application To System Under Test (Top-Down); ¡ From OS to System Under Test (Bottom-UP). API called with exceptional values Application DDS Middleware OS syscall Operating System OS return with exceptional values Robustness Testing Approaches 9 ¡ Robustness testing: stressing the public interface of the application/system/API with invalid and exceptional values: ¡ From Application To System Under Test (Top-Down); ¡ From OS to System Under Test (Bottom-UP). ¡ Workload stands for a set of valid calls. It’s needed to stress each operation of the device under test. ¡ Fault model is a set of rules applied at API to expose robustness problems. ¡ Failure mode classification characterizes the behavior of the system under test while executing the workload in the presence of fault model. Fault Injection: WWW dilemma ¡ What to inject? Injection library ¡ Fault model -> Fault List ¡ Where to inject? Fault list ¡ At API interface level ¡ Method with higher occurrences Method list (Method list) ¡ When to inject? Trigger List ¡ At only one invocation of methods (Trigger list) ¡ Fault, Model and Trigger lists define our Injection library 10 Faults list 11 ¡ The rules list applied during the API invocation: ¡ Each method input is tested with all robustness values one for time. ¡ E.g., void replace(int a, String b). Method list ¡ Profiling different applications compliant middleware product: 12 using DDS- ¡ Ping-pong application; ¡ Touchstone: benchmarking framework for evaluating performance of OMG DDS compliant implementations; the ¡ SWIM-BOX. ¡ The methods occurrences have been measured for each applications: ¡ Only a limited set core of all available methods are invoked; ¡ The same occurrences distribution is noted for all applications ¡ Method list involved the methods with higher occurrences. Failure mode classification 13 ¡ CRASH scale has been utilized to classify the robustness problems ¡ Catastrophic: node crashes and OS hangs, DDS provider do not deliver messages correctly. ¡ Restart: DDS provider becomes unresponsive and must be terminated by force. ¡ Abort: Abnormal termination when invoking API. ¡ Silent: Faulty submitted value doesn’t rise exceptions, despite this message are or aren’t transmitted. ¡ Hindering: returned error code is incorrect. ¡Further and suitable levels have been added: ¡ non conformity: fault is not indicated as should be. ¡ DDS API analysis classification. has been performed for results ¡ Golden run has been run for each injecting value to understand the system behavior. Test automation: JFault Injection Tool (JFIT) 14 ¡ Pros: ¡ Java-based implementation; ¡ No knowledge about the SUT; ¡ Run-time methods mutation: interception and values ¡ Exploiting java reflection; ¡ Monitoring status and output of the SUT. ¡ Cons: ¡ Only methods with primitive types (i.e. String, int, …) are taken into account; ¡ Off line and by hand results classification. High level architecture of JFIT 15 ¡All robustness test are carried out according with the Injection library; ¡ Controller is in charge for tests management and runs them through the Activator; ¡ Interceptor catches the methods invocation to SUT and injects, by Injector, the faults one for time ¡ Monitor records the output at Pub and Sub side. CONTROLLER ACTIVATOR INTERCEPTO MONITOR R System Under Test INJECTOR Test execution stages 16 ¡Preliminary execution of the workload without faults ¡ To understand the normal behavior ¡Starting robustness testing DDS initialitation Workload execution Injection phase One fault for time Monitoring & Logging Golden run No faults are injected Tests Results 17 ¡DDS middleware: OpenSplice® implementation; ¡No QoS features have been defined (Best Effort); ¡ According with the failure mode classification the achieved results are as follows: ¡ no Catastrophic, Abort and Hindering problems have been evidenced: ¡ Neither node crashes and nor OS hangs; ¡ No abnormal termination when invoking API; ¡ No erroneous returned error code. ¡ 13% of robustness problems: tests have shown Restart ¡ Experiment doesn’t response and must be terminated by force. ¡ 45% of robustness tests have risen Silent problems: ¡ No exception has been thrown by DDS; Tests Results 18 ¡Faults distribution between Silent and Restart. Int faults types String faults types Faults types Conclusions 19 ¡ Our approach can automatically test the core set of DDS methods; ¡ A significant fraction of tests shows some robustness issues raised when exceptional values are submitted to OpenSplice® APIs (e.g., large strings, or big integers); ¡ The ability to reach a consistent system state before performing fault injection makes us confident of the results. Conclusions 20 ¡ Our approach can automatically test the core set of DDS methods; ¡ A significant fraction of tests shows some robustness issues raised when exceptional values are submitted to OpenSplice® APIs (e.g., large strings, or big integers); ¡ The ability to reach a consistent system state before performing fault injection makes us confident of the results. Ongoing activities ¡ Testing all parameters types and not only primitive types; ¡ Automating results classification; ¡ Running tests in presence of quality of service mechanisms; ¡ Carrying out the same tests with other DDS-compliant middleware. References 21 [Avizienis 04] A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secure Computing, 2004. [Koopman 02] P. Koopman. “What's Wrong With Fault Injection As A Benchmarking Tool?”. in Proc. DSN 2002 Workshop on Dependability Benchmarking, pp. F- 31-36, Washington, D.C.,USA, 2002. [Koopman 99] Koopman P., DeVale J., Comparing the robustness of POSIX operating, Proceedings of Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. [Johansson 07] Johansson A., Suri N., Murphy B. On the selection of Error models for OS Robustness Evaluation Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. [Miller 95] B.P. Miller et al, Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities., Technical report, 1995. 22 23 Test Scenario Further details JFIT Mo n ito rin g g t o r in Mo n i A P I in je c to r ¡ The Transmitter sends burst of messages for a while then terminates A P I in te rc e p to r ¡ A receiver is waiting for messages JFIT ¡DDS middleware: OpenSplice® implementation ¡No QoS features have been defined (Best Effort) Pub/Sub paradigm 24 ¡ Pub/Sub reveals effective to federate heterogeneous systems ¡ Space, time and synchronization decoupling enforce scalability ¡ Asynchronous multi-point communication good to devise cooperating systems SIENA GREEN HERALD CORBA NS DREAM JEDI JMS HERMES ¡ Among the plethora of Pub/Sub alternatives DDS exhibits better performances, higher scalability and larger set of offered QoS ¡ Widely used in large scope initiatives addressing wide area scenarios ¡ E.g., it has been investigating as the data distribution system into SESAR project through SWIM middleware infrastructure