From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Proactive Network.Maintenance Using Machine Learning R Sasisekharan V Seshadri AT&TBell Laboratories, S MWeiss HR2K033 480 Red Hill Road Middletown, N J 07748 sesh@eagle.hr.att.com ¯ - ABSTRACT A new approach to proactively maintain a massively interconnected communicationsnetworkis described. Tl’fis approachhas beenapplied to the detection and prediction of chronic transmission faults in AT&T’s digital communicationsnetwork. A windowingtechnique was applied to large volumesof diagnostic data, and these data were analyzed by machinelearning methods. A set of conditionshas beenfoundthat is highly predictive of chronic circuit problems,that is, problemsthat are likely to continue in the immediatefuture withoutdiagnosis and repair. In addition, a few conditions have been found that are predictive Of problemsthat affect multiple circuits. Such analyses over the completenetworkare helpful in proactively maintainingthe networkand in spotring trends for circuit problems.Proacfive maintenanceof the networkcan help in greatly improving the quality and reliability of a networkby identifying potentially serious problemsbefore they OCCur. KEYWORDS communications network,rule induction, proactive maintenance,machinelearning, databasemining KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 453 INTRODUCTION ¯ With the increasing complexity of moderncommunicationsnetworks, there is a commensurate need for intelligent systemsto help manageand maintainthem.¯Mostdesirable are systemsthat can analyze and resolve problemsin the networkautomatically, thus greatly improvingthe reliability and quality of the network.Artificial intelligence(M)techniques haveprovenuseful in building networkoperation systems[l]. Anarea in whichsuch techniques can have significant impact is in the proaetive maintenanceof networks. Proactive maintenanceof the networkcan help in greatly improvingthe quality and reliability of a networkby identifying potentially serious problemsbefore they degrade. This can be accomplished bymonitoring network performance over time and spotting trends in problems. In addition, monitoringnetworkperformancecan help in prioritizing networkproblems.The abilityto prioritize network¯problemscan significantly improvethe quality of the networksince those problemsthat have the most significant impacton the operations of the networkcan be addressed first. Monitoring network performanceinvolves analyzing extremely large amountsof diagnostic data that varies with time. Lookingfor pattems of behavior in such large volumesof data can only be accomplishedby computeranalysis. In this paper, wepresent an approachto do such an analysis using machinelearning techniques. Weillustrate the approachby specifically looking at detection and prediction of chronic transmission faults in AT&T’s world-widedigital communicationsnetwork. Even though this paper focuses on telecommunications networks where manyhomogeneous ¯ networkcomponents exist, our workis applicable to other types of networksas well. In the following sections, Wedescribe the application domain,explain the approachthat weused to enable the ¯ machineto learn from time-dependentProblemsin the network, apply alternative methodsof machine learning to our problem,and report the results we have obtained. PROBLEM DESCRIPTION Networkoperation systems (NOS)exist in the network to support provisioning, maintenance, operation, administration and management functions for the networkand for individual network components.A Circuit can be considered-as a path in the network, which contains network componentsand links. Transmissionproblemson a circuit are seen by several of the networkcomponents through which :the circuit connects. In a large network, such-as AT&T’s communications network, the ratio, of diagnostic data generated by various networkcomponentsto the root problem that is responsiblefor them,is very large. Thedifferent types .of problemsin the networkcan be broadly categorized into two classes, transient and non-transient. Transient transmission problemsare very common in the network, yet their behavior and causes are not completelyunderstood. Part of the difficulty in understanding themis related to separatingthe wheatfromthe chaff, that is, in learning to ignoreglitches that will not be repeated and focussing insteadon those transient problemsthat will recur (chronics). Chronits not only affect the quality of communications while they recur but also indicate degradationand potential-future failures in the network.Thusit is an importantand challengingproblemto identify these chronics and isolate their causes. Diagnostic procedures that attempt to resolve transient problemsmust rely on large volumes of historical information and a morecomplexanalysis of patterns of behavior. Onenovel approach to diagnosing transient faults is found in an AT&T system knownas SCOUT[2].Using historical and topological information, SCOUT finds specific related circuits that share common patterns of faulty behavior. Typically, these are difficult transient problems;multiple circuit problems,or even forms of chronic faulty behavior. In this paper, we consider a related form of analysis of chronic Page 454 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 behavior. Weconsider the performanceof the complete AT&T network over time. Our objective is to determinewhetherthere are patterns of behavior over the networksuch that the following predictions can be made: ¯ Thefaulty behavior will continue in the immediatefuture. ¯ Thefaulty behavior involves multiple circuits. Wedo not attempt to solve specific problems.Instead, our objective is to determinewhether there are any signature characteristics for those networkproblemsthat do not get fixed. Wemay not have enoughdatato determine the exact nature of the problem, but wemaybe able to predict accurately that the problemwill not be fixedquicklyduring the normalcourse of operations. In order to achieve maximum reliability of the network, problemswith these signatures must be considered, and those problemsthat are deemedcritical must be addressed quickly. Froma computerscience perspective, interest in this area centers around a numberof issues related to mappingtime-dependentevents into a standard classification framework.These include the following: ¯ Defining classes to conveythe concept of "chronic." ¯ Definingfeatures that summarizehistorical events. In addition, the samplesize in this analysis :numbersin the tens of thousandsof cases. Looking for patterns of behavior in such large volumesof data can only be accomplishedby computer analysis using machinelearning techniques, possibly resulting in newinformation that cannot be obtained by typical humanexperience. THE MACHINE LEARNING APPROACH In order to identify chronics that have the potential to degrade the performanceof the communicationsnetwork, we adopted a computer-basedapproachof learning from historical data. The macb_inelearning methodologyis described in this section. Describing the Goals and Measurements Thecircuit,related questions that wehave outlined in the previous sections need to be posed in a standard classification format, so that a numberof interesting analytical techniques that are available can be applied. PredictiOn modelsthat can be applied tea standard classification problem include decision trees, decision rules, statistical linear diseriminants, neural nets, and nearest neighbor methods.In the standard classification format, samplesof cases are obtained. For each case, identical measurements are taken, and at least one of these measuresis the class label. Methods are applied whichattempt to find patterns for one class that differ fromother classes. For our first problem,the class label is chronic failure on a circuit, a conceptthat has been defined in previous sections. Thegoal is to predict that current failures will continue to occur. We mustalso take into accountthat these failures are often transient, and that failures will likely not occur continuouslyin the future. Instead,-a failure mayoccur in the future, but the occurrencemay also be transient. Periods of time that are reasonablyclose to the current period are of interest. The measurementsthat are used for prediction must summarizehistorical information. These measurementsare recorded each time a fault occurs. Not all measurementsare recorded for every fault, only those that directly measurethe fault process. Faults are often transient, so the trends for a period of time must be measured.It is quite possible that manyfaults will occur for a short time, but these faults are not necessarily chronic, Theycan be fixed and do not reappear in the immediate future. Measurements mustbe specified that are useful in pr~icting the target concept, that is, future failures on the circuit. KDD-94 AAAI.94 Workshop on Knowledge Discovery in Databases Page 455 This time-dependentproblemwas mapped¯into a standard classification format by the use of fixed time windows.Historical information for circuits was examinedover a consecutive period of time, and this time period wasdivided into two windows,W,~and Wb.The objective was to use the ¯ measurementsmadein Wato predict that problems will occur in Wb. Weconsidered both our knowledgeof the application and experimentaldata to arrive at reasonable sizes for each of the two windows.The windowswere also divided into sub,units based on time, which we will refer to as a time unit. Wewill refer to the size of W,as T, time units, and the size of W b as Tb time units. There are manyreasonable measurementsthat can be taken over time. Assuminga fault oc: curs, an alarm or exception is noted. Included in the possibilities of measurements are the number of times such an event occurs, the average numberof times an event occurs during a time unit, or the numberof time units during which the event occurs. Wedefined around 30 performancefeatures for this problem,based on the variation of diagnostic data over time and variation over space. In addition to the timeperiodsfor the windows,another factor must be consideredin defining the conditions for each window,This is the degree of chronic failure. For the two windows,W,and Wb;the conditions that were chosen were different, as follows: ¯ W,: any fault during the time period T,,. ¯ Wb:faults during at least half the time units of the followingtime period"lb. Therationale for these periods is as follows. Wemust consider a prior period sufficiently long that prediction of continuing chronic behavioris feasible. Weconsideredall circuits wherea ¯ fau!t has occurredduringWa.Becausefaults are often transient, wemust specify a reasonablylong period for Wb.Afault that recurs over at least half the time units during Wbwouldbe of interest becauseit indicates a clearly continuing and unresolvedproblem. In our learning experiments, weexaminedall circuits with problemsduring predefined intervals. It must be emphasizedthat the predictions to be madeare those for continuing chronic failures. Theseare not necessarily the most obviousor acute problems, whichare often diagnosedand fixed quickly. The second question that was examinedis whether any patterns of.measurementsare indicative of multiple circuit involvement.Werefer to failures, occurring on multiple circuits due to a commoncause, as. multiple circuit problems. Werely on SCOUT’sanalysis to determine if the problemon a circuit is a multiple circuit problem.The question that wehopedto answeris whether there are certain types of faults within the completenetworkthat consistently suggest multiple circuit problems. Learning from Data Oncesample data are obtained, several computer-basedtechniques can be used to makepredictions. Amongthe more prominent techniques are: ¯ ¯ ¯ ¯ statistical linear discriminants[5], neural nets[6], decision rules or trees[7][8], and nearest neighbor methods[9]. Both neural nets and linear diseriminants makepredictions based on weighted functions.The linear discriminant uses a simple additive scoring function, for example If a*X + b*Y > 3, choose Class 1. (1) The neural net, on the other hand, can model morecomplexdecision functions, typically Page 456 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 non-linear functions. Decisiontrees and decisionrules posesolutions in the formofmaeor false conditions.Forexample, If A>10ANDB<20, choose Class 1. (2) Adecision tree is a stylized modelof decision rules, where every decision flows from the root node of the tree, and each path is mutuallyexclusive. Moreinformationabout these techniques can be found in Weiss and Kulikowski[4]. For our application, wehad a large numberof samples obtained from the operating commu¯ nications network,It wasnot feasible to try every variation of these methodson the entire set of samples, Instead, we performedsomesmaller experiments to see whether one approach offered an advantageover the others. Overall, for this application, nearest neighbormethodsand statistical linear discriminants performedpoorly. Neural nets and decision rules or trees were competitive, with a small edge for decision rules. In experimentson the completedata sets, representing all circuit problemsin the network over a fixed period, werelied mostlyon rule induction. Rule and tree induction methodshave been extensively described in published works[4]~:in our Study, we emphasizeda rule induction technique called SWAP-113]. Rule induction methodsattempt to find a compactcovering rule set that completelyseparates the classes. Thecovering set is found by heuristically searching for a single best rule that covers casesfor only one class. Havingfounda best conjunctive rule for a class C, the rule is addedto the rule set and the cases satisfying it are removedfromfurther consideration. Tkis process is repeated until no cases remain tobe covered. Unlike decision tree induction programsand other rule induction methods, SWAP-1 has the advantagethat it uses optimization techrtiques to revise and improveits decisions. Oncea coveringset is foundthat separates the classes, the inducedset of rules is further refined by either pruningor statistical techniques.Usingtrain and test evaluationmethods,the initial coveringrule set is then scaled backto the moststatistically accurate subset of rules. I~is quite possible that tuning manyof the alternative methodscould result in somewhatimproved results for each method. However,based on our knowledgeof the application, there are a numberof reasons whythe rule induction methodappears most appropriate: ¯ Theobjective is to extract newinformation from the data. Thehope is that we can gain insight into the performanceof the network. Decisionrules have the strongest explanatory capabilities of the cited models. ¯ Weknowin advancethat this is a noisy environment.Perfect classification can be achievedon all samplesonly whenchronic behavior is entirely consistent. This is not likely with all the efforts towardhigh reliability and the transient nature of manyproblems. Thusthe expectation is to find a subset of conditions that are highly reliable predictors of chronicfailure. ¯ Mostof the measurementsare counts of the numberof time units during which an event occurred over a predefined time period. Thus these measurementsare ordered discrete variables. Theyare not continuous. Patterns of these types of measurementsare usually effectively described in terms of the greater than or less than operators whichare used by the rule induction model. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 457 ¯ The minimum error solution is not necessarily the best solution. Becauseweare trying to extract newinformation, the preferred solution consists of the highest predictive rules even if they cover fewer cases. Thepreference is also for simple rules that enhanceour understanding of network performance. Giventhat the expectation is for predictions that coveronly a partial numberof chronic problems, decision rules most naturally modelthe partitioning of data. Theefficacy of the individual rules can then be tested on independenttest data. Testing the Decision Model Thecentral methodfor building a predictive modelis to learn from samplesand test on independentdata. In manyapplications there is a relative shortageof data. In these situations, a compromiseis madeby randomlypartitioningthe data into training and test sets. In our application, we had a large numberof samples. Training was performedon a randomsubset of data for a time period, and somepreliminary testing was done on the remaining data. Oncea solution was found, : further rigorous testing wasperformedby testing the solution.on additional data from subsequent time periods. Whileit is traditional to train on two-thirdsof the data and test on the remainingdata, the goal of finding simpler, morepredictive rules maySometimesbe achievedby training with fewer cases. Theseare rules that coverfewer cases, but are morepredictive for the cases that they cover. RESULTS In this section, wedescribe the results that wereobtained, both in identifying chronic problems and in Classifying problemsassociated with multiple circuits. Wehave also provided the resuits of our comparisonof various learning methodsas applied in this case. Alternative Methodsfor Learning Althoughweconcludedthat rule induction wasthe preferred learning methodfor this application;we performedseveral experimentsto evaluate its competitivenesswith alternative learning models.Table 1 summarizesthe results. Onethird of the cases were randomlyselected for testing, and the error rates in the table are based on the test case performance. Simplychoosingthe largest class gives an error rate of 6.4%.Thelinear discriminant (Fisher’s) that weused was the standard parametric discriminant found in statistical packages. Weused it with feature selection. It is not unusualthat this methoddoesnot performwell in a lowprevalence situation. Nearest neighbormethodsare greatly affected by noisy variables. This result is for k=l and euclidean distance. Thethree remainingtechniques were relatively close for this experiment. Thedecision tree was induced by CART,and the decision rules by Swap-1.The neural net was a standard backpropagation network, with a single hidden layer. Configuralions were considered with from 0 to 6 hidden units, with the best test error rate for 0 hiddenunits. Page 458 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94 Method Error rate (%) Prior 6.4 Linear discriminant 6.4 Nearest n~ghbor 5.6 Neural net 4.7 Decision tree 4.5 Decision rules 4.4 Table I: ComparativeResults For Alternative Learning Methods Sample Size Weexaminedthe historical records for several months during late 1992 and 1993. These samples were taken from the complete AT&T network, and covered a significant numberof all transmission problems encountered. Whencomparedto the billions of transmissions during a monthand the size of the network, the numberof problemsis quite small. However,from a sampiing perspective, wehad a large sample,consisting of tens of thousands.Of these circuits, between 5 and 10%fulfilled our definition of Chronic,that is, they had faults during at least half the time units during Wb. Predicting Chronic Circuit Problems Wenowreturn to the central task, namely,can wepredict chronic problems,that is, problems that wit1 continue in the immediatefuture? Wewereable to generate rule sets that werepredictive. Therule set reducesto a set of conditions. If any of the conditions hold, the problemis very likely to be chronic. The conditions were of the form ¯ X_Featurei> nl where X_Featurei is a performancefeature based on the numberof time units during which an event of a type i occurs and r~ is an integer. Wewill refer to a set of five conditions, of the type described abovethat were generated, as RulesetO.Anothercondition of a different type wasalso generated. It was of the form ¯ Y_Feature, > m, & Y_Featureb ¯ mb & Y_Featurec ¯ mc WhereY_Featureis a performancefeature that is not based on the count of time units and m is an integer. This condition is weakerthan the other conditions. Wewill refer to the rule set that includes all the conditions in RulesetOand this last condition as Rulesetl. While the predictive value of RulesetOis better than Rulesetl, Rulesetl covers a larger portion of the chronic problem. This is illustrated in Figures 2 and 3. Figure 2 plots the performanceof these two rule sets over the course of several time periods. The rule set was induced during a time period whenthe prevalence of chronic problemswas somewhat lower than during other time periods. Thus any solution induced for that time period necessarily wouldbe highly predictive to overcomethe odds of the larger class. Figure 3 plots the per- KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 459 centage of each time period’s chronic problemscovered by the rule sets. Figure 4 plots the changein predictive value versus the numberof time units with faults for a representative time period. Specifically,, as the definition of chronic problemsthat are to be detected is mademoreselective, theperformanceimprovesin that the numberof false alarms is reduced. At the sametime, of course, coveragedecreases, meaningthat the numberof transient problems, that are no longer classified as chronic but havethe potential to result in performancedegradation, increases. Multiple Circuit Detection Using the same measurements, we consider the detection of multiple circuit problems. Whethera problemon a circuit is a multiple circuit problemor not is determinedby meansof data obtained from SCOUT. For this application, we trained on data from one time period and tested on data from another time period. The classes are close to equal in size with a 45%- 55%split. Two conditions emergedas particular strong predictors of multiple circuit involvement.Theywere of the type: ¯ X_Featurei >ni Althoughthey cover a relatively small percentage of all circuits with problems,wheneither of these conditions occur the likelihood that multiple circuits are involvedis estimated at 93%. Wehave identified certain conditions that are common to both chronic problemdetection and multiple circuit detection. Thus, we have comeup with a rule that predicts whena problemwith a circuit is likely to be chronicas well as affect multiplecircuits. CONCLUSIONS If all problemsin a communicationnetwork were either transient or quickly repaired, it wouldnot be necessary to detect chronic problems. However,chronic problemsdo occur. Identifying patterns of these problemsshould be helpful in characterizing problemsthat are not detected and re/)aired quickly. In our.analysis, wefound that the numberof time units over which events occurred wascritical in determiningthe likelihood that the problemwouldcontinue in the future. Theserules suggest a form of momentum or inertia for chronic fault problems. There are a number of rationales for the validity of this formof analysis: ¯ Not all faults have this momentum to the samedegree. Wehave identified those measurementsthat are predictive along with the correspondingthresholds, that is, the numberof time units of faults during a windowfor whichthey are predictive. ¯ Of particular importance,any circuit that exhibits this behaviorwill likely continue this behavior.Thus, if the goal is to maximize reliability, circuits exhibiting these characteristics shouldbe given priority in diagnosis and repair. Themostimmediateapplication of these results is their use in reordering trouble tickets. Beyond that, though, we have provided a methodologyto proactively maintain and monitor the performanceof a network. The determination of trends can demonstrate whether progress is madein reducing chronic problems or whether chronic problems are increasing in the network. The best results for network performanceoccur whenno patterns emergeor whenthey cover a smaller percentage of the problems. Clearly, weare limited by the types of recorded measurements.If the causes of circuit problems were eventually determinedandrecorded, it might be possible to explore hypotheses directly related to the cause and repair of a problem.Suchinformationis not currently available, and because of the complexity of a network with transient problems, such records maynever be fully Page 460 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 available. The second issue that wehave addressed is the prediction of multiple circuit involvement. Wefound that certain types of faults are goodindicators of multiple circuit problems. Resolving multiple circuit problemsis particularly useful in reducing the overall numberof problemsin the network. Wehave considered a highly complex communicationsnetwork, and analyzed its behavior over time. To facilitate proaetive maintenance, wehave developed measurementsthat are sampled for the completenetworkduring regular time periods. Whileour results are boundedby the predictive capabilities of these measurements,this form of analysis did producereasonable predictors. Theanalysis’ involves intensive computerprocessing of very large volumesof data. In both objectivity and pattern matchingcapability, such efforts are beyondthe capabilities of humanprocessing and experience. REFERENCES [1] P. A. Corn, R. Dube, A. F. McMichaeland J. L. Tsay, "An AutonomousDistributed Expert System For Switched Network Maintenance", Proc. of IEEE GLOBECOM ’88, pages 1530-1537, 1988. [2] R. Sasisekharan, Y-K. Hsu, and D.Simen, "SCOUT:An Approach To Automate Diagnoses Of Faults In Large Scale Networks", to appear in Proc. oflEEE GLOBECOM’93, Houston, 1993. [3] S. Weiss and N. Indurkhya, Reduced Complexity Rule Induction, IJCAI-91, pages 678-684, Sydney, 1991. Proceedings of [4] S. Weiss and C. Kulikowski, ComputerSystems That Learn: Classification And Prediction Methods FromStatistics, Neural Nets, Machine Learning, And Expert Systems, Morgan Kaufmann,1991. [-5] M. James, Classification Algorithms, Wiley, 1985 [6] J. McClellandand D. Rumelhart, Explorations in Parallel Distributed Processing, MIT Press, 1988 [7] L. Breiman, J. Friedman,R. Olshen and C. Stone, Classification and Regression Trees, Wadsworth, 1984. [8] J. Quinlan, "Induction of Decision Trees", MachineLearning, vol. 1, pages 81-106, 1986. [9] D. Ahaand D. Kibler, "Noise Tolerant Instance-Based Learning Algorithms", Proc. of IJCAI-89, pages 794-799, Detroit, 1989. KDD-94 AAA1-94 Workshop on Knowledge Discovery in Databases Page 461 IO0 ,,, RuJumO 8060- Predictive Performance 40200 I I I I Time Figure2:. Predictiveperformance of rule sets. 100 80Percentage 60of cases Covered 40-- "’’’’°’°’’’’’°’’’’’ 200 I I I I Tim~ Figure3: Percentageof Chronicproblemscoveredby each rule set. 100, 80Pn~dic~ve Performance60and Percentage40Coverage ¯ .. ’’’-........, , ".... covemge ’’°’’’°°.. 200 ° "’°-° .° °.. l l l I I I I l "°.. l Number of time units with faults Figure4: Performance of Ruleset0. Page 462 AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94