An Alarm Filtering Algorithm For Optical Communication Networks C. Mas, J.-Y. Le Boudec Laboratoire de Reseaux de Communication Swiss Federal Institute of Technology CH-1015 Lausanne, Switzerland E-mail: fmas, leboudecg@lrc.ep.ch Abstract A single failure in a communication network can trigger many alarms. We propose an alarm ltering algorithm for the management of an optical network using Wavelength Division Multiplexing (WDM). The algorithm supports: (i) multiple failures and (ii) passive network elements that do not generate alarms but may fail. This algorithm will be applied to the network of the ACTS COBNET project. 1 Introduction In the last few years, there has been a rapid development of communication networks. Some time ago, each communication link was transmitting a single channel. By using optical ber, several information channels can be multiplexed in one link. As a consequence, the transmitted information as well as the data rate have increased. The drawback of this evolution is that the information lost when a failure occurs has also increased. Fault Recognition identies the failure from the occasioned alarms and its speed is critical for ensuring availability of the network. Fault Recognition is composed of Alarm Filtering and Testing. The former gives a list of elements that may have failed and triggered all the alarms that the manager has received. Testing conrms the failure by checking each element of the Alarm Filtering result. The main part of Fault Recognition is Alarm Filtering which is essential for speeding up the testing process. Consider an optical network using Wavelength Division Multiplexing (WDM). When a network element fails, all the channels that pass through this element are interrupted. Consequently, all the aected elements that were participating in these interrupted channels and are able to send alarms to the manager, will report an alarm. This eect is called Alarm Storm Eect. The manager receives all these alarms and has to nd out which is the element that might have failed. In some cases, the faulty element will be able to inform the manager that it is not working properly but in other cases the element will stop working without Corresponding author 1 sending any information to the manager. This is particularly true for passive elements such as optical bers. The algorithm supports: (i) multiple failures and (ii) passive network elements that do not generate alarms but may fail. In the literature there are algorithms for the localization of failures. Some of them work with probabilities of failure [1]. The problem of working with probabilities is that they are changing continuously depending on external parameters (for example age, temperature and pressure). Other algorithms use Management Information Bases to look for information related with the alarms [2]. The proposed algorithm tries to get closer to real communication networks and works with the basic data that the manager is able to get without accessing any further information. This article is structured as follows. Section 2 introduces the network model used for the algorithm. In Section 3 the Alarm Algorithm is presented. Section 4 shows an application of this algorithm in the COBNET case. COBNET is an ACTS Project aimed at providing connections between separate Corporate Public Networks (CPNs) through the Public Network independently of the communication protocol used by the ends. 2 Model Denition The model regarded in this article is dened as relations between network elements. A network element is a part of the network that can fail. Each network nlement is identied by two integers = ( ). The former gives the real node where it belongs and the latter gives the component of this node. For example, the node 2 can have as a component a transceiver card that can be identied as (2.1). There are two types of relation between network elements: interconnection that gives the topology of the network (subsection 2.3) and association that gives the established channels (subsection 2.4). In this model we consider a channel to be the establishment of a communication between two nodes. e i; j 2.1 Network Elements This problem considers two kinds of network elements: active and passive elements. The active elements are able to send alarms and information to the manager (one example is a transceiver). As mentioned above, we model active elements by a couple of integers. On the drawing, they are represented by a square. On the contrary, the passive elements do not give any information to the manager. One example is an optical ber. The passive elements are modeled by a couple of integers where the rst integer is 0 (to distinguish them from the active elements). On the drawing, they are represented by a circle. The identier of this element has only one modiable eld because it is the only information that the manager has about it. For example, in gure 1, (1,2) is the second component of the node 1 and (0,1) is a passive element. We dene a as the number of active elements and n as the total number of elements in the network (active and passive). 2 (1.1) (0.3) (1.2) (0.1) (1.3) (0.2) (2.1) Passive Element Active Element (2.2) Figure 1: Interconnection Relations 2.2 Alarms As mentioned before, active elements can send alarms to the manager. We dene a primary alarm as the alarm resulting from the network element that has a failure and a secondary alarm as the alarm that is a consequence of a failure in other element. The information given by an alarm depends on the element but typically contains [3]: (i) which element sent the alarm, (ii) when the alarm was sent, (iii) the nature of alarm and (iv) what is the reason for sending this alarm. In our analysis the information associated with an alarm contains only two pieces of data: which element has sent the alarm and what type of alarm it is because too much information can make the diagnosis of the alarms much more dicult. In particular, considering our model of an optical system, the type of alarm can be either \not-receiving" or \not-sending", referred to by the alarm identier 1 and 2 respectively. Note that the algorithm does not consider \wrong-functioning" alarms because when an element sends this type of alarm, the element is considered as a very probable candidate and it is checked directly. This case does not require the application of this algorithm. The alarm structure has two elds: one with the active element identication that generated the alarm and another eld with the type of alarm (1 or 2 as it has been said before). These alarms are stored by the manager in an adimensional vector A. An important fact that has to be considered is that the alarms do not arrive in chronological order. This is the reason why the manager accepts alarms for a certain period of time. 2.3 Interconnection Relation We dene a relation between network elements called interconnection. A network element with identier 1 is interconnected with an element 2 if they are physically connected within the network. For example in gure 1, (0,1) is interconnected to (1.1) and (0.3) is not interconnected to (1.2). This relation is symmetric and can be represented by a nxn boolean array T. This array represents the network topology. T[ 1 , 2 ]=1 means that 1 and 2 are interconnected and T[ 1 , 2 ]=0 means that they are not. e e e e e 3 e e e (1.1) (0.3) (1.2) (0.1) (1.3) (0.2) (2.1) Channel 1 Channel 2 (2.2) Active Element Passive Element Figure 2: Association Relation with Channels 2.4 Association Relation As in all communication networks, channels are established between nodes that want to communicate. In our model, we dene a association as a relation between a network element and a channel. Each channel is identied by an integer. An element 1 is associated with a channel 1 if 1 belongs to the channel 1 . For example in gure 2 the passive element (0.1) is associated with channel 1 and the active element (1.2) is associated with channel 2. The properties of the communication relation are: Two or more elements can be associated with the same identier; A network element can be related to 0, 1 or more channels. This relation can also be represented in a cxn boolean array C where c is the number of established channels. C[ 1 , 1 ]=1 means that 1 passes through 1 and C[ 1 , 1 ]=0 means that it does not. e c e c c c e c e e 3 Alarm Filtering Algorithm Assume that at certain time, the manager receives a number of alarms from the network. The goal of the Alarm Filtering Algorithm is to nd the element or elements whose secondary alarms are the ones that the manager has received. The inputs to the algorithm are: Topology of the network T Established channels C Set of alarms received by the manager A The output of the algorithm is a set of the most probable candidates to be faulty O. The algorithm is composed of three parts: Initialization: Element Domain Phase (section 3.1) Core: { Alarm Discarding Phase (section 3.2.1) 4 { Candidate Generator Phase (section 3.2.2) Result: Comparison Phase (section 3.3) The initialization phase has to be run every time there is a change either in the topology of the network or in the established channels. The other phases have to be applied at any change in the set of received alarms. When the manager receives a set of alarms, he applies the Alarm Filtering Algorithm to nd the best candidate(s) which with its(their) failure has(have) generated all the alarms that the manager has received. With this result, the manager testes them. If the elements turn out to be faulty, the manager acts in consequence, for example, repairing the failure. 3.1 Initialization: Element Domain Phase This phase nds, for every network element e, the set of secondary alarms that would be generated in case it fails. In other words, if the element e fails, which alarms would the manager receive. Inputs: Topology of the network T Established channels C Output: Secondary alarms S The nxa output array S gives, for every element, which active elements would send an alarm and which kind of alarm. As mentioned before, the considered alarms are: \not-receiving" (alarm identier 1) and \not-sending" (alarm identier 2). It is important to note that this phase is independent of the alarms that are received by the manager. Once this phase has been executed, S can be kept and reused as long as there is no change either in the topology or in the established channels. The result of this algorithm is S[ 1 ,a] where a are the alarms that 1 would generate in case it fails. The code has the following structure: procedure Alarm-Domain (integer nxn array T, cxn array C) begin For each established channel i do vector1[n]:=C[i] For each element of vector1 j do if vector1[j]=1 then if j-th element belongs to the i-th channel call Generated-by (j) Find generated alarms when j-th element fails j:=j+1 i:=i+1 Return S[n,a] end e e 5 procedure Generated-by (integer j) /comment: This procedure nds which elements would send an alarm and which kind of alarm would be in case the element j fails/ begin vector[2]:=vector1[n] Modiable copy of the elements that belong to the same channel than the j-th element (a,b):=Find next active neighbors (j, vector2[n]) call evaluate A (j,a) Gives the kind of alarm from a if 0 then evaluate A (j,b) If there is a second neighbor, the kind of alarm is found from it vector2[j]:=0 The studied element is eliminated in order to no come back to it call Generated-by (a) Find the generated alarms when aelement fails if 0 then Generated-by (b) If there is a second neighbor, nd the generated alarms when it fails end b <> b <> procedure Find next active neighbors (integer j, n-dimensional vector vector) /comment: This procedure nds the 2 neighboring active elements of the jth-element that belong to the channel given by vector[n]/ begin integer found-neighbors:=0 while (the 2 next active neighbors have not been found) do x:=T[j,k] x is 1 if j and k are connected if (x=1 AND vector[k]=1) then If j and k are connected and k belong to the same channel then if found-neighbors=0 then a:=k k is the rst next active element found-neighbors:=1 else b:=k k is the second active element found-neighbors:=2 k:=k+1 Return (a,b) end procedure evaluate A (integer j,k) /comment: This procedure gives the kind of alarm from the k-th element. If the j-th and the k-th element are close neighbors then the alarm is a not-sending alarm. If these elements have a passive element between them, the alarm is a not-receiving alarm./ begin if T[j,k]=1 then S[j,k]:=2 else S[j,k]:=1 end 6 3.2 Core of the Algorithm In this section the main part of the algorithm is presented. It is composed of two parts. The former is called Alarm Discarding Phase because it eliminates obsolete alarms that do not give useful information to locate the failure. The latter is called Candidate Generator and nds the candidates for the subset of alarms resulting from the previous phase. 3.2.1 Alarm Discarding Phase Inputs: Topology of the network T Established channels C Set of received alarms A Output: Subset of received alarms A' The main goal of this phase is to recognize and discard obsolete alarms. In fact, many of the alarms do not give any useful information for the localization of the failure. These alarms are called obsolete alarms because they are clear consequences of a failure and they can be related to other alarms. The nonredundant alarms are stored in a a-dimensional vector A' and the redundant alarms are stored in a second a-dimensional vector Elim. Not-sending Not-receiving (0.1) (1.1) (1.2) Not-sending Not-receiving (0.2) (2.1) (2.2) (0.3) Established Channel Active element Passive element Node Figure 3: Alarm Discarding example For example, in gure 3, the active elements (1.1), (1.2), (2.1) and (2.2) belong to an established bidirectional channel. (1.1) and (1.2) belong to network node 1 and (2.1) and (2.2) to network node 2. Due to a failure in the network, the element (1.1) which could be a receiver card, does not receive any signal from (0.1). As a consequence, the rst node is not receiving any signal. The element (1.1) warns the manager that is not receiving and element (1.2) warns the manager that it is not sending any signal to the next element. The passive element (0.2) can not send any warning to the manager. The second node reacts 7 in the same way than the rst one. The alarms from (1.2), (2.1) and (2.2) are obsolete because they are a clear consequence of the alarm in (1.1). Therefore, the only alarm that the algorithm will keep in A' is the alarm from element (1.1) because this is not a consequence of any other alarm (in the situation described in the gure). 3.2.2 Candidate Generator Phase Inputs: Topology of the network T Established channels C Subset of received alarms A Output: Associated elements E' This phase nds the elements that could have caused the alarms received by the manager. These elements will be stored in a n-dimensional vector E'. The procedure consists of three steps: - The rst step considers the elements that have sent the alarms. All the elements of A' are included in E'. - The second step adds to E' the neighboring active elements of the A' elements that have not been discarded in the Alarm Discarding Phase (not included in Elim). This step is necessary to cover the case when a faulty element cannot send any alarm when it fails. - The third step nds the passive elements, if there are any, between each pair of active elements and adds them into E'. At the end of this phase E' contains all the candidates which could have failed. procedure Candidate-Generator (integer a-dimensional array A') begin For all the active elements of A'[a] i do E'[i]:=1 Add the i-th element in E'[n] (a,b):=Find next active neighbors (j, vector2[n]) If a does not belong to Elim[a] then E'[a]:=1 Add the a-element in E'[n] If T[i,a]=0 then If there is any passive element between i and a-element b:=Passive(i,a) b is the passive element between i and a E'[b]:=1 b is added in E'[n] Return E'[n] end procedure Passive (integer a,b) /comment: This algorithm nd the passive element between the two given active elements a and b./ begin x:=false p:=(n-a)+1 p is initialized to the rst passive element 8 Repeat If (T[a,p]=1 AND T[b,p]=1) then If p is connected to both passive elements x:=true Finish else p:=p+1 Otherwise continue with the next passive element Until is x=true Return p end 3.3 Result: Comparison Phase Inputs: Associated elements E' Secondary alarms S Set of received alarms A Output: Probable faulty elements O The elements belonging to E' are now analyzed. The analysis consists of choosing an element or several elements from E', nding the secondary alarms corresponding to them in S and comparing them with the set of received alarms A. The algorithm considers the possibility of multiple failures. Since one failure is more probable than several simultaneous failures, the selection procedure rst considers single elements failing before going on considering multiple failures. With the chosen element we check its secondary alarms (given by S[e,a]) and compare them with the alarms received by the manager. In case several candidates are chosen as a set, the union of their secondary alarms are compared with the received alarms. Once both sets are equal, the n-dimensional vector O is dated up to the last set of tested elements. procedure Comparison (integer nxa array S, n-dimensional vector E') begin x:=true x gives when the most probable faulty element has been found Repeat i:=Choose-Candidate(E'[n]) j:=0; t:=true while ( AND x=false AND 0) do If S[i,j] A[j] then t:=false j:=j+1 if j=n then x:=true Until x=true Return O[n] end j < n i <> <> 9 PIU LEN LEN WDM Ring LEN HEN LEN WDM Ring LEN LEN LC PIU Public Network Interface HEN High End Node LEN Local End Node LC Local Connections Figure 4: COBNET Network 4 Application to the COBNET case This Alarm Filtering Algorithm is applied to Corporate Public Networks (CPNs) of the COBNET Project. CPNs are composed by a central switch located in the High End Node (HEN) that interconnects the ports from one or more rings with local ports and ports to the Public Network. Figure 4 shows a concrete case of a CPN with two rings. The CPN manager, which resides in the Telecommunication Management Network (TMN), receives the alarms from all the elements of its domain. Suppose there are three channels established in the network as illustrated in Figure 5. The relation association is described in the table of gure 6. The rst channel C1 goes from the Public Network to the LEN 5, the second channel C2 connects the LEN 4 with LEN 8 (local channel) and the third channel C3 connects the Public Network to LEN 9. We modeled the COBNET network as shown in gure 5. All the passive elements are also represented and nodes are divided into separate parts that are active and independent. Suppose a set of alarms is received. First of all the manager will check if the fault is in its domain and if any alarm is a \wrong-functioning" alarm. If the failure is in the CPN and no element sends an alarm of type 3, the manager checks if there has been any change in the topology or in the established channels. If there have been changes, the initialization Element Domain phase has to be run again. Otherwise, the manager executes the main part of the algorithm. The result of the element domain phase is shown in gure 7. Assume that during a period of time, the manager has received the set of alarms described in gure 8. The manager applies the main part of the algorithm: rst discards the obsolete alarms and then nds the likely candidates: Alarm Discarding Phase: The alarms from elements 4.1, 2.2 and 8.1 can be considered as obsolete alarms and they are discarded. The resulting 10 1 1 2 7 4 2 1 2 (0.1) 1 (0.7) (0.3) (0.4) (0.8) 2 1 1 2 2 5 1 2 4 8 3 (0.5) (0.9) (0.10) (0.6) (0.2) 1 2 1 2 6 9 1 3 2 Active element C1 C2 C3 Passive element Figure 5: COBNET Network model C 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.10 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 3 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 Figure 6: Channel array C vector A' is in gure 9. Candidate Generator Phase: The element that may have caused these alarms are in shown gure 10. The algorithm has been able to reduce the number of candidates from 11 to 4. Comparison Phase: In this phase, one or several candidates are analyzed by comparing their secondary alarms with the alarm received by the manager. In this example, the element 7.1 is most probably the one that has failed because its secondary alarms are the alarms that the manager has received. 5 Conclusion The manager receives many alarms every time a failure occurs. In order to nd the failure, Fault Recognition with an eective Alarm Filtering Algorithm is needed. The network model used in this article is a simple and general one that is based on the interconnection of elements. It distinguishes between active and 11 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 P.1 P.2 P.3 P.4 P.5 P.6 P.7 P.8 P.9 P.10 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 x 2 1 2 0 2 0 0 1 2 1 0 0 0 0 0 0 0 0 1 2 x 2 2 0 2 0 0 1 2 1 0 0 0 0 0 0 0 0 1 2 1 x 2 0 2 0 0 1 2 1 0 0 0 0 0 0 0 0 1 2 1 2 x 0 2 0 0 1 2 1 0 0 0 1 2 1 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 2 0 x 0 0 1 0 0 0 0 0 1 2 1 0 0 1 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 1 0 2 0 0 x 2 1 0 0 0 1 2 1 0 0 0 2 1 2 1 0 0 0 0 2 x 1 0 0 0 0 0 0 0 0 0 2 1 2 1 0 0 0 0 2 1 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 x 2 1 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 2 x 1 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 2 1 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 2 1 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 x 2 1 1 2 0 2 0 0 1 2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 1 0 2 0 0 1 2 1 0 0 0 1 2 1 0 0 0 2 1 2 1 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 1 2 1 0 0 0 0 0 0 2 0 1 0 0 1 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Figure 7: Secondary alarms S 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 0 0 0 2 0 1 0 0 1 0 0 0 0 0 0 2 1 0 0 0 Figure 8: Received alarms A passive elements. The Alarm Filtering Algorithm locates failures occurring in both active and passive elements giving a more precise localization. One main aspect of the algorithm is the minimal information that it needs from the alarms i.e., only the elements that originated the alarm and what type of alarm it is (2 possibilities). We show that this algorithm is able to reduce by more than 60% the number of likely candidates that have failed. From this subset, the algorithm gives the most probable faulty element(s). Thus the time required to isolate the failure considerably and data loss in minimized. Finally, we considered a real application in the COBNET ACTS Project. References [1] I. Katzela and M. Schwartz, \Schemes for Fault Identication in Communication Networks," IEEE/ACM Transactions on Networking, vol. 3, December 1995. [2] J. F. Jordaan and M. E. Paterok, \Event Correlation in Heterogeneus Networks using OSI Management Framework," Integrated Network Management, vol. III, 1993. [3] A. T. Bouloutas and A. Finkel, \Alarm Correlation and Fault Ientication in Communication Networks," IEEE Transactions on Communications, vol. 42, Feb/March/April 1994. 12 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 Figure 9: Resulting alarms A' 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2 8.1 8.2 9.1 9.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 010 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 Figure 10: Candidates E' 13 0 0 0 0 0 0 1 0 0 0