Failure Spread in Redundant UMTS Core Network Author: Tuomas Erke, Helsinki University of Technology Supervisor: Timo Korhonen, Professor of Telecommunication Systems (S72) Tuomas.Erke@hut.fi 30.9.2003 Table of Contents Background Terminology Problem Setting Used Methodology Results of the Study Conclusions Future Work Background Fixed networks have been built reliable, but the reliability of mobile networks have been given less attention Enhanced services (e.g. telemedicine applications) and escalated competition between main market players over new subscribers is about to change this in the near future In fixed networks, outages involving a large number of people must be reported. This may be extended to mobile networks in future also Terminology (1/4) Availability: A probability that the system will be functioning correctly at any given time Failure: The impact of the faults and errors seen by user (SW program crash) Fault Tree Analysis (FTA): A top-down method of analyzing system design and performance. Specifies a top event followed by identifying all of the associated elements in the system that can cause the top event to occur Failure Spread: A failure occurs in some part(s) of the system, and propagates to other part(s) of the system Fault Tolerance: A capability of the system to withstand and handle faults Media Gateway (MGW): a network node in UMTS core network, which is used to interconnect networks Redundancy: Availability of unit(s) and mechanisms for taking over failed unit(s) Reliability: The capability of and item to carry out certain functionality in a certain period of time in certain conditions, or a probability that it will Terminology (2/4) Fault is connected to physical world (electronic components confront faults after a period of time), but it also includes mistakes made in design (incomplete system architecture) or implementation (programming mistakes) of the system Error has an impact on information (error in data processing) Failure is the impact of the faults and errors seen by user (SW program crash) Terminology (3/4) There are different types of failures: Sudden, when failure cannot be predicted (nondeterministic software based problems) Gradual, when failure can be predicted with prior examination (hardware wearout increases probability of a failure over a period of time) Partial when failure affects only some parts of the system (only one network node) Complete when failure has an impact on the whole system (complete network) Catastrophic when failure is both sudden and complete (power-system failure) Degradation failures are gradual and partial (HW component wear-out over time) Terminology (4/4) Standby redundancy (triggered only when the other unit fails) Parallel redundancy (frequently used in telecom networks) Problem Setting (1/2) Effect of failures in UMTS CN is studied in the thesis, and how redundancy mechanisms may be used in the network to increase availability and decrease the effect of failures Area has not been widely studied before, but some work related to node failures exists Network reliability has been studied, and this thesis compares results from the other studies and proposes solution alternatives Problem Setting (2/2) Failures occur in various parts of the system: Node failure (HLR database failure results in unavailability of permanent user data if no redundant component and mechanism is available) Protocol failures (wrong implementation or design), results in overload of network elements or signaling links, or faulty interaction between network nodes. For instance, wrong use of broadcasting messages, which leads to overload at the receiving side HW failures (bus/circuitry failures/memory corruption, results in HW either malfunctioning or failing) Recovery triggering mechnanisms (changeover procedure failures, DSP device manager failures or other triggering failures) Load sharing algorithm (ineffective use of resources, exceeding the capacity of the system before taking action or wrong resource sharing on right network nodes) HW/SW update procedure failures, which leads to faulty configurations and interworking of network elements Wrong network configuration (often because of complex network design) Used Methodology (1/3) Failures are considered to occur on different levels: Used Methodology (2/3) A partly redundant example network is studied in the thesis A tree format FTA (Fault-tree Analysis) is used for analyzing the causes of failures. FTA is mainly applied in SW reliability area A literature study is performed to find mechanisms for achieving a higher level of system reliability Used Methodology (3/3) Example network configuration Results of the Study (1/3) The chain between fault detection, localization, analysis and recovery must be unbroken, otherwise failures cannot be recovered completely Redundancy must be applied in different levels of the system for achieving high level of fault-tolerance (system is as strong as its weakest component) SW fault-tolerance is increased by building distributed, reliable and scalable SW The critical network nodes have to be duplicated, and restoration algorithms must be available Results of the Study (2/3) Emphasis on profound system testing (and especially testing of fault recovery mechanisms, load control and different failure scenarios) SW based mechanisms include: Distributed SW architecture (a failure of one component involves a smaller fraction of the system, so the loss of data and resources can be recovered in a better way). A fault can be isolated to a smaller area when the architecture is distributed Multithreaded protocol stacks so that a failure of a process involves only part of the module capabilities, and the SW modules use dynamic checkpointing protocols for recovering from failure of peer entities Optimization of the recovery process time to its minimum (only necessary part of the system is restarted: processor, board or node restart. This reduces the outage time) A blackbox for SW failure analysis can be implemented inside SW components for later analysis Results of the Study(3/3) Dynamic routing and meshed network architecture is a recommendable solution (Advantage: high tolerance for the network failures and adaptability to different network configurations. Disadvantage: complicated design and maintenance of the network) Reliability of the network is an optimization problem, but investing on redundant HW now can be used in future to increase the capacity of the system if needed Multifunction devices, MSC in Pool, Multihoming and special algorithms may be utilized to increase reliability of the system Conclusions Different redundancy mechanisms were discussed in this thesis and existing algorithms were compared Seems like the network design trend goes towards smaller, adaptable network nodes and architecture The reliability of the system is best achieved by using multiple levels of redundancy Failure spread depends on the availability and workability of the methods for ensuring the reliability of the system Future Work OPEX (OPerating EXpenses) and CAPEX (CApital EXpenses) calculations for network architecture solutions Testing of failure recovery mechanisms and effect of failures using real network or a simulated environment Multiple simultaneous failures have only been handled partly in this thesis. More research needs to be performed on the subject (e.g. for tolerating of geographical catastrophe involving large number of network nodes)