From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Proactive Network.Maintenance Using Machine Learning
R Sasisekharan
V Seshadri
AT&TBell Laboratories,
S MWeiss
HR2K033
480 Red Hill Road
Middletown, N J 07748
sesh@eagle.hr.att.com
¯ - ABSTRACT
A new approach to proactively maintain a massively interconnected communicationsnetworkis described. Tl’fis approachhas beenapplied to the detection and prediction of chronic transmission faults in AT&T’s
digital communicationsnetwork. A windowingtechnique was applied to
large volumesof diagnostic data, and these data were analyzed by machinelearning methods. A
set of conditionshas beenfoundthat is highly predictive of chronic circuit problems,that is, problemsthat are likely to continue in the immediatefuture withoutdiagnosis and repair. In addition, a
few conditions have been found that are predictive Of problemsthat affect multiple circuits. Such
analyses over the completenetworkare helpful in proactively maintainingthe networkand in spotring trends for circuit problems.Proacfive maintenanceof the networkcan help in greatly improving the quality and reliability of a networkby identifying potentially serious problemsbefore they
OCCur.
KEYWORDS
communications
network,rule induction, proactive maintenance,machinelearning, databasemining
KDD-94
AAAI-94 Workshop on Knowledge Discovery in Databases
Page 453
INTRODUCTION
¯ With the increasing complexity of moderncommunicationsnetworks, there is a commensurate need for intelligent systemsto help manageand maintainthem.¯Mostdesirable are systemsthat
can analyze and resolve problemsin the networkautomatically, thus greatly improvingthe reliability and quality of the network.Artificial intelligence(M)techniques haveprovenuseful in building
networkoperation systems[l]. Anarea in whichsuch techniques can have significant impact is in
the proaetive maintenanceof networks. Proactive maintenanceof the networkcan help in greatly
improvingthe quality and reliability of a networkby identifying potentially serious problemsbefore they degrade. This can be accomplished bymonitoring network performance over time and
spotting trends in problems. In addition, monitoringnetworkperformancecan help in prioritizing
networkproblems.The abilityto prioritize network¯problemscan significantly improvethe quality
of the networksince those problemsthat have the most significant impacton the operations of the
networkcan be addressed first.
Monitoring network performanceinvolves analyzing extremely large amountsof diagnostic
data that varies with time. Lookingfor pattems of behavior in such large volumesof data can only
be accomplishedby computeranalysis. In this paper, wepresent an approachto do such an analysis
using machinelearning techniques. Weillustrate the approachby specifically looking at detection
and prediction of chronic transmission faults in AT&T’s
world-widedigital communicationsnetwork.
Even though this paper focuses on telecommunications networks where manyhomogeneous
¯ networkcomponents
exist, our workis applicable to other types of networksas well. In the following sections, Wedescribe the application domain,explain the approachthat weused to enable the
¯ machineto learn from time-dependentProblemsin the network, apply alternative methodsof machine learning to our problem,and report the results we have obtained.
PROBLEM
DESCRIPTION
Networkoperation systems (NOS)exist in the network to support provisioning, maintenance, operation, administration and management
functions for the networkand for individual network components.A Circuit can be considered-as a path in the network, which contains network
componentsand links. Transmissionproblemson a circuit are seen by several of the networkcomponents through which :the circuit connects. In a large network, such-as AT&T’s
communications
network, the ratio, of diagnostic data generated by various networkcomponentsto the root problem
that is responsiblefor them,is very large.
Thedifferent types .of problemsin the networkcan be broadly categorized into two classes,
transient and non-transient. Transient transmission problemsare very common
in the network, yet
their behavior and causes are not completelyunderstood. Part of the difficulty in understanding
themis related to separatingthe wheatfromthe chaff, that is, in learning to ignoreglitches that will
not be repeated and focussing insteadon those transient problemsthat will recur (chronics). Chronits not only affect the quality of communications
while they recur but also indicate degradationand
potential-future failures in the network.Thusit is an importantand challengingproblemto identify
these chronics and isolate their causes.
Diagnostic procedures that attempt to resolve transient problemsmust rely on large volumes
of historical information and a morecomplexanalysis of patterns of behavior. Onenovel approach
to diagnosing transient faults is found in an AT&T
system knownas SCOUT[2].Using historical
and topological information, SCOUT
finds specific related circuits that share common
patterns of
faulty behavior. Typically, these are difficult transient problems;multiple circuit problems,or even
forms of chronic faulty behavior. In this paper, we consider a related form of analysis of chronic
Page 454
AAAI-94 Workshop on Knowledge Discovery in Databases
KDD-94
behavior. Weconsider the performanceof the complete AT&T
network over time. Our objective is
to determinewhetherthere are patterns of behavior over the networksuch that the following predictions can be made:
¯ Thefaulty behavior will continue in the immediatefuture.
¯ Thefaulty behavior involves multiple circuits.
Wedo not attempt to solve specific problems.Instead, our objective is to determinewhether
there are any signature characteristics for those networkproblemsthat do not get fixed. Wemay
not have enoughdatato determine the exact nature of the problem, but wemaybe able to predict
accurately that the problemwill not be fixedquicklyduring the normalcourse of operations. In order to achieve maximum
reliability of the network, problemswith these signatures must be considered, and those problemsthat are deemedcritical must be addressed quickly.
Froma computerscience perspective, interest in this area centers around a numberof issues
related to mappingtime-dependentevents into a standard classification framework.These include
the following:
¯ Defining classes to conveythe concept of "chronic."
¯ Definingfeatures that summarizehistorical events.
In addition, the samplesize in this analysis :numbersin the tens of thousandsof cases. Looking for patterns of behavior in such large volumesof data can only be accomplishedby computer
analysis using machinelearning techniques, possibly resulting in newinformation that cannot be
obtained by typical humanexperience.
THE
MACHINE
LEARNING
APPROACH
In order to identify chronics that have the potential to degrade the performanceof the communicationsnetwork, we adopted a computer-basedapproachof learning from historical data. The
macb_inelearning methodologyis described in this section.
Describing the Goals and Measurements
Thecircuit,related questions that wehave outlined in the previous sections need to be posed
in a standard classification format, so that a numberof interesting analytical techniques that are
available can be applied. PredictiOn modelsthat can be applied tea standard classification problem
include decision trees, decision rules, statistical linear diseriminants, neural nets, and nearest
neighbor methods.In the standard classification format, samplesof cases are obtained. For each
case, identical measurements
are taken, and at least one of these measuresis the class label. Methods are applied whichattempt to find patterns for one class that differ fromother classes.
For our first problem,the class label is chronic failure on a circuit, a conceptthat has been
defined in previous sections. Thegoal is to predict that current failures will continue to occur. We
mustalso take into accountthat these failures are often transient, and that failures will likely not
occur continuouslyin the future. Instead,-a failure mayoccur in the future, but the occurrencemay
also be transient. Periods of time that are reasonablyclose to the current period are of interest.
The measurementsthat are used for prediction must summarizehistorical information. These
measurementsare recorded each time a fault occurs. Not all measurementsare recorded for every
fault, only those that directly measurethe fault process. Faults are often transient, so the trends for
a period of time must be measured.It is quite possible that manyfaults will occur for a short time,
but these faults are not necessarily chronic, Theycan be fixed and do not reappear in the immediate
future. Measurements
mustbe specified that are useful in pr~icting the target concept, that is, future failures on the circuit.
KDD-94
AAAI.94 Workshop on Knowledge Discovery in Databases
Page 455
This time-dependentproblemwas mapped¯into a standard classification format by the use of
fixed time windows.Historical information for circuits was examinedover a consecutive period of
time, and this time period wasdivided into two windows,W,~and Wb.The objective was to use the
¯ measurementsmadein Wato predict that problems will occur in Wb. Weconsidered both our
knowledgeof the application and experimentaldata to arrive at reasonable sizes for each of the two
windows.The windowswere also divided into sub,units based on time, which we will refer to as
a time unit. Wewill refer to the size of W,as T, time units, and the size of W
b as Tb time units.
There are manyreasonable measurementsthat can be taken over time. Assuminga fault oc:
curs, an alarm or exception is noted. Included in the possibilities of measurements
are the number
of times such an event occurs, the average numberof times an event occurs during a time unit, or
the numberof time units during which the event occurs. Wedefined around 30 performancefeatures for this problem,based on the variation of diagnostic data over time and variation over space.
In addition to the timeperiodsfor the windows,another factor must be consideredin defining
the conditions for each window,This is the degree of chronic failure. For the two windows,W,and
Wb;the conditions that were chosen were different, as follows:
¯ W,: any fault during the time period T,,.
¯ Wb:faults during at least half the time units of the followingtime period"lb.
Therationale for these periods is as follows. Wemust consider a prior period sufficiently
long that prediction of continuing chronic behavioris feasible. Weconsideredall circuits wherea
¯ fau!t has occurredduringWa.Becausefaults are often transient, wemust specify a reasonablylong
period for Wb.Afault that recurs over at least half the time units during Wbwouldbe of interest
becauseit indicates a clearly continuing and unresolvedproblem.
In our learning experiments, weexaminedall circuits with problemsduring predefined intervals. It must be emphasizedthat the predictions to be madeare those for continuing chronic failures. Theseare not necessarily the most obviousor acute problems, whichare often diagnosedand
fixed quickly.
The second question that was examinedis whether any patterns of.measurementsare indicative of multiple circuit involvement.Werefer to failures, occurring on multiple circuits due to a
commoncause, as. multiple circuit problems. Werely on SCOUT’sanalysis to determine if the
problemon a circuit is a multiple circuit problem.The question that wehopedto answeris whether
there are certain types of faults within the completenetworkthat consistently suggest multiple circuit problems.
Learning from Data
Oncesample data are obtained, several computer-basedtechniques can be used to makepredictions. Amongthe more prominent techniques are:
¯
¯
¯
¯
statistical linear discriminants[5],
neural nets[6],
decision rules or trees[7][8], and
nearest neighbor methods[9].
Both neural nets and linear diseriminants makepredictions based on weighted functions.The
linear discriminant uses a simple additive scoring function, for example
If a*X + b*Y > 3, choose Class 1.
(1)
The neural net, on the other hand, can model morecomplexdecision functions, typically
Page 456
AAAI-94 Workshop on Knowledge Discovery in Databases
KDD-94
non-linear functions.
Decisiontrees and decisionrules posesolutions in the formofmaeor false conditions.Forexample,
If A>10ANDB<20, choose Class 1.
(2)
Adecision tree is a stylized modelof decision rules, where every decision flows from the
root node of the tree, and each path is mutuallyexclusive. Moreinformationabout these techniques
can be found in Weiss and Kulikowski[4].
For our application, wehad a large numberof samples obtained from the operating commu¯ nications network,It wasnot feasible to try every variation of these methodson the entire set of
samples, Instead, we performedsomesmaller experiments to see whether one approach offered an
advantageover the others. Overall, for this application, nearest neighbormethodsand statistical
linear discriminants performedpoorly. Neural nets and decision rules or trees were competitive,
with a small edge for decision rules.
In experimentson the completedata sets, representing all circuit problemsin the network
over a fixed period, werelied mostlyon rule induction. Rule and tree induction methodshave been
extensively described in published works[4]~:in our Study, we emphasizeda rule induction technique called SWAP-113].
Rule induction methodsattempt to find a compactcovering rule set that
completelyseparates the classes. Thecovering set is found by heuristically searching for a single
best rule that covers casesfor only one class. Havingfounda best conjunctive rule for a class C,
the rule is addedto the rule set and the cases satisfying it are removedfromfurther consideration.
Tkis process is repeated until no cases remain tobe covered. Unlike decision tree induction programsand other rule induction methods, SWAP-1
has the advantagethat it uses optimization techrtiques to revise and improveits decisions. Oncea coveringset is foundthat separates the classes,
the inducedset of rules is further refined by either pruningor statistical techniques.Usingtrain and
test evaluationmethods,the initial coveringrule set is then scaled backto the moststatistically accurate subset of rules.
I~is quite possible that tuning manyof the alternative methodscould result in somewhatimproved results for each method. However,based on our knowledgeof the application, there are a
numberof reasons whythe rule induction methodappears most appropriate:
¯ Theobjective is to extract newinformation from the data. Thehope is that we can gain
insight into the performanceof the network. Decisionrules have the strongest explanatory
capabilities of the cited models.
¯ Weknowin advancethat this is a noisy environment.Perfect classification can be
achievedon all samplesonly whenchronic behavior is entirely consistent. This is not
likely with all the efforts towardhigh reliability and the transient nature of manyproblems. Thusthe expectation is to find a subset of conditions that are highly reliable predictors of chronicfailure.
¯ Mostof the measurementsare counts of the numberof time units during which an event
occurred over a predefined time period. Thus these measurementsare ordered discrete
variables. Theyare not continuous. Patterns of these types of measurementsare usually
effectively described in terms of the greater than or less than operators whichare used by
the rule induction model.
KDD-94
AAAI-94 Workshop on Knowledge Discovery in Databases
Page 457
¯
The minimum
error solution is not necessarily the best solution. Becauseweare trying to
extract newinformation, the preferred solution consists of the highest predictive rules
even if they cover fewer cases. Thepreference is also for simple rules that enhanceour
understanding of network performance.
Giventhat the expectation is for predictions that coveronly a partial numberof chronic problems, decision rules most naturally modelthe partitioning of data. Theefficacy of the individual
rules can then be tested on independenttest data.
Testing the Decision Model
Thecentral methodfor building a predictive modelis to learn from samplesand test on independentdata. In manyapplications there is a relative shortageof data. In these situations, a compromiseis madeby randomlypartitioningthe data into training and test sets. In our application, we
had a large numberof samples. Training was performedon a randomsubset of data for a time period, and somepreliminary testing was done on the remaining data. Oncea solution was found,
: further rigorous testing wasperformedby testing the solution.on additional data from subsequent
time periods. Whileit is traditional to train on two-thirdsof the data and test on the remainingdata,
the goal of finding simpler, morepredictive rules maySometimesbe achievedby training with fewer cases. Theseare rules that coverfewer cases, but are morepredictive for the cases that they cover.
RESULTS
In this section, wedescribe the results that wereobtained, both in identifying chronic problems and in Classifying problemsassociated with multiple circuits. Wehave also provided the resuits of our comparisonof various learning methodsas applied in this case.
Alternative Methodsfor Learning
Althoughweconcludedthat rule induction wasthe preferred learning methodfor this application;we performedseveral experimentsto evaluate its competitivenesswith alternative learning
models.Table 1 summarizesthe results. Onethird of the cases were randomlyselected for testing,
and the error rates in the table are based on the test case performance.
Simplychoosingthe largest class gives an error rate of 6.4%.Thelinear discriminant (Fisher’s) that weused was the standard parametric discriminant found in statistical packages. Weused
it with feature selection. It is not unusualthat this methoddoesnot performwell in a lowprevalence
situation.
Nearest neighbormethodsare greatly affected by noisy variables. This result is for k=l and
euclidean distance.
Thethree remainingtechniques were relatively close for this experiment. Thedecision tree
was induced by CART,and the decision rules by Swap-1.The neural net was a standard backpropagation network, with a single hidden layer. Configuralions were considered with from 0 to 6 hidden units, with the best test error rate for 0 hiddenunits.
Page 458
AAAI.94 Workshop on Knowledge Discovery in Databases
KDD-94
Method
Error rate (%)
Prior
6.4
Linear discriminant
6.4
Nearest n~ghbor
5.6
Neural net
4.7
Decision tree
4.5
Decision rules
4.4
Table I: ComparativeResults For Alternative Learning Methods
Sample Size
Weexaminedthe historical records for several months during late 1992 and 1993. These
samples were taken from the complete AT&T
network, and covered a significant numberof all
transmission problems encountered. Whencomparedto the billions of transmissions during a
monthand the size of the network, the numberof problemsis quite small. However,from a sampiing perspective, wehad a large sample,consisting of tens of thousands.Of these circuits, between
5 and 10%fulfilled our definition of Chronic,that is, they had faults during at least half the time
units during Wb.
Predicting Chronic Circuit Problems
Wenowreturn to the central task, namely,can wepredict chronic problems,that is, problems
that wit1 continue in the immediatefuture? Wewereable to generate rule sets that werepredictive.
Therule set reducesto a set of conditions. If any of the conditions hold, the problemis very likely
to be chronic. The conditions were of the form
¯ X_Featurei> nl
where X_Featurei is a performancefeature based on the numberof time units during which
an event of a type i occurs and r~ is an integer.
Wewill refer to a set of five conditions, of the type described abovethat were generated, as
RulesetO.Anothercondition of a different type wasalso generated. It was of the form
¯ Y_Feature, > m, & Y_Featureb ¯ mb & Y_Featurec ¯ mc
WhereY_Featureis a performancefeature that is not based on the count of time units and m
is an integer.
This condition is weakerthan the other conditions. Wewill refer to the rule set that includes
all the conditions in RulesetOand this last condition as Rulesetl. While the predictive value of
RulesetOis better than Rulesetl, Rulesetl covers a larger portion of the chronic problem. This is
illustrated in Figures 2 and 3.
Figure 2 plots the performanceof these two rule sets over the course of several time periods.
The rule set was induced during a time period whenthe prevalence of chronic problemswas somewhat lower than during other time periods. Thus any solution induced for that time period necessarily wouldbe highly predictive to overcomethe odds of the larger class. Figure 3 plots the per-
KDD-94
AAAI-94 Workshop on Knowledge Discovery in Databases
Page 459
centage of each time period’s chronic problemscovered by the rule sets.
Figure 4 plots the changein predictive value versus the numberof time units with faults for
a representative time period. Specifically,, as the definition of chronic problemsthat are to be detected is mademoreselective, theperformanceimprovesin that the numberof false alarms is reduced. At the sametime, of course, coveragedecreases, meaningthat the numberof transient problems, that are no longer classified as chronic but havethe potential to result in performancedegradation, increases.
Multiple Circuit Detection
Using the same measurements, we consider the detection of multiple circuit problems.
Whethera problemon a circuit is a multiple circuit problemor not is determinedby meansof data
obtained from SCOUT.
For this application, we trained on data from one time period and tested on
data from another time period. The classes are close to equal in size with a 45%- 55%split. Two
conditions emergedas particular strong predictors of multiple circuit involvement.Theywere of
the type:
¯ X_Featurei >ni
Althoughthey cover a relatively small percentage of all circuits with problems,wheneither
of these conditions occur the likelihood that multiple circuits are involvedis estimated at 93%.
Wehave identified certain conditions that are common
to both chronic problemdetection and
multiple circuit detection. Thus, we have comeup with a rule that predicts whena problemwith a
circuit is likely to be chronicas well as affect multiplecircuits.
CONCLUSIONS
If all problemsin a communicationnetwork were either transient or quickly repaired, it
wouldnot be necessary to detect chronic problems. However,chronic problemsdo occur. Identifying patterns of these problemsshould be helpful in characterizing problemsthat are not detected
and re/)aired quickly. In our.analysis, wefound that the numberof time units over which events
occurred wascritical in determiningthe likelihood that the problemwouldcontinue in the future.
Theserules suggest a form of momentum
or inertia for chronic fault problems. There are a number
of rationales for the validity of this formof analysis:
¯ Not all faults have this momentum
to the samedegree. Wehave identified those measurementsthat are predictive along with the correspondingthresholds, that is, the numberof
time units of faults during a windowfor whichthey are predictive.
¯ Of particular importance,any circuit that exhibits this behaviorwill likely continue this
behavior.Thus, if the goal is to maximize
reliability, circuits exhibiting these characteristics shouldbe given priority in diagnosis and repair.
Themostimmediateapplication of these results is their use in reordering trouble tickets. Beyond that, though, we have provided a methodologyto proactively maintain and monitor the performanceof a network. The determination of trends can demonstrate whether progress is madein
reducing chronic problems or whether chronic problems are increasing in the network. The best
results for network performanceoccur whenno patterns emergeor whenthey cover a smaller percentage of the problems.
Clearly, weare limited by the types of recorded measurements.If the causes of circuit problems were eventually determinedandrecorded, it might be possible to explore hypotheses directly
related to the cause and repair of a problem.Suchinformationis not currently available, and because of the complexity of a network with transient problems, such records maynever be fully
Page 460
AAAI-94 Workshop on Knowledge Discovery in Databases
KDD-94
available.
The second issue that wehave addressed is the prediction of multiple circuit involvement.
Wefound that certain types of faults are goodindicators of multiple circuit problems. Resolving
multiple circuit problemsis particularly useful in reducing the overall numberof problemsin the
network.
Wehave considered a highly complex communicationsnetwork, and analyzed its behavior
over time. To facilitate proaetive maintenance, wehave developed measurementsthat are sampled
for the completenetworkduring regular time periods. Whileour results are boundedby the predictive capabilities of these measurements,this form of analysis did producereasonable predictors.
Theanalysis’ involves intensive computerprocessing of very large volumesof data. In both objectivity and pattern matchingcapability, such efforts are beyondthe capabilities of humanprocessing
and experience.
REFERENCES
[1] P. A. Corn, R. Dube, A. F. McMichaeland J. L. Tsay, "An AutonomousDistributed
Expert System For Switched Network Maintenance", Proc. of IEEE GLOBECOM
’88, pages
1530-1537, 1988.
[2] R. Sasisekharan, Y-K. Hsu, and D.Simen, "SCOUT:An Approach To Automate Diagnoses Of Faults In Large Scale Networks", to appear in Proc. oflEEE GLOBECOM’93,
Houston,
1993.
[3] S. Weiss and N. Indurkhya, Reduced Complexity Rule Induction,
IJCAI-91, pages 678-684, Sydney, 1991.
Proceedings of
[4] S. Weiss and C. Kulikowski, ComputerSystems That Learn: Classification And Prediction Methods FromStatistics,
Neural Nets, Machine Learning, And Expert Systems, Morgan
Kaufmann,1991.
[-5] M. James, Classification Algorithms, Wiley, 1985
[6] J. McClellandand D. Rumelhart, Explorations in Parallel Distributed Processing, MIT
Press, 1988
[7] L. Breiman, J. Friedman,R. Olshen and C. Stone, Classification and Regression Trees,
Wadsworth, 1984.
[8] J. Quinlan, "Induction of Decision Trees", MachineLearning, vol. 1, pages 81-106,
1986.
[9] D. Ahaand D. Kibler, "Noise Tolerant Instance-Based Learning Algorithms", Proc. of
IJCAI-89, pages 794-799, Detroit, 1989.
KDD-94
AAA1-94 Workshop on Knowledge Discovery in Databases
Page 461
IO0 ,,,
RuJumO
8060-
Predictive
Performance
40200
I
I
I
I
Time
Figure2:. Predictiveperformance
of rule sets.
100
80Percentage 60of cases
Covered 40--
"’’’’°’°’’’’’°’’’’’
200
I
I
I
I
Tim~
Figure3: Percentageof Chronicproblemscoveredby each rule set.
100,
80Pn~dic~ve
Performance60and
Percentage40Coverage
¯ ..
’’’-........,
,
"....
covemge
’’°’’’°°..
200
°
"’°-°
.°
°..
l
l
l
I
I
I
I
l
"°..
l
Number
of time units with faults
Figure4: Performance
of Ruleset0.
Page 462
AAA/-94 Workshop on Knowledge Discovery
in Databases
KDD-94