Learning About Software Errors Via Systematic ... Terrance Goan Oren Etzioni Abstract

From: AAAI Technical Report FS-94-01. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Learning About Software Errors Via Systematic Experimentation
Terrance GoanOren Etzioni
Department of ComputerScience and Engineering, University of Washington
Seattle, WA98195
{goan, etzioni}@cs.washington.edu
Abstract
ware environments:
Critical to the success of any real worldagent is the
Software error messages generally signal only one
error. For example, in UNIX,if you execute the
d± f f commandon two files x and y, where both
files are not readable, the error messagewouldbe
"diff: x: Permission denied." Itprovides no informationabout the status of file y (i.e.,
the above error message masks the corresponding
error messagefor file y). Wecall this the masking
effect.
ability to detect and recover from unsuccessful
actions. Failing to detect these errors maycause the
execution of the remainder of a plan to have unexpected and dangerouseffects. In this paper we present
ED(the error detective) which systematically executes lesioned operators in order to generate a table of
errors and associated causes for a software agent. We
describe features of software environmentsthat allow
us to efficiently build this table without searching for
simultaneous errors or making the single fault
assumption. Wethen report on experiments comparing several methodsfor utilizing this collected data to
build an error diagnosis function.
Since software errors do not interact, errors can be
fixed incrementally (the decomposable fault
assumption). This means we need not assume just a
single fault has occurred, rather we assumethat if
multiple errors occur simultaneously, we can fix the
error signaled by the given message, re-execute the
operator, then handlethe next error (if there is one).
Introduction
Classical planners (e.g., (Chapman1987)) assume
that their internal modelis both correct and complete.The
dynamic nature of real-world domains(e.g., multi-user
software environments) makes these assumptions untenable. Several newplanners (e.g., XII (Goldenet al. 1994))
have been designed to work with incomplete information,
and strides have been madein planning with potentially
incorrect information. But, efficient operation in the presence of incorrect information is highly dependent on a
planner’s ability to detect errors. Failing to recognize
errors can result in unexpectedand potentially destructive
effects, as well as further corruption of the world model.
This paper describes ED(the Error Detective) which automatically generates error diagnosis functions for a softwarerobot (softbot) (Etzioni et al. 1993).
Real-world software environments (e.g., databases,
computernetworks, operating systems) are the subject of
intense study in computerscience, and software agents are
gaining prominence as a research area. These software
environments are not idealizations of physical environments, but have their ownunique features and challenges.
In designing EDwe utilized three key insights into soft-
Error messages caused by the same fault are often
very similar across operators, makingclassification
tools such as decision trees potentially useful in
forming generalizations that may cover unseen
variations.
The remainder of this paper is organized as follows.
In the next section we will further discuss the motivations
for this work. Wethen present our methodfor collecting
training data to be used in building our error diagnosis
functions. Next, we comparethe effectiveness of a variety
of methods for classifying error messages. Weconclude
with a discussion of the merits (and flaws) of these methods and somesuggestions for future work.
Motivation
In order to maintain an accurate internal world model
it is essential to knowwhencommands
have been successfully executed and when they have not. There are many
53
different waysin whichthis can be achieved.
If weassumethat our operatorsare completeandcorrect, andthat our worldmodelis also correct, then wecan
be assured that every operator execution will succeed.
But, these assumptions
are clearly not realistic in a multiuser software environment,due to exogenousevents. In
addition,fully specifiedoperatorsstill face the qualification problem(McCarthy1980). That is, completeoperators mayinclude so manypreconditions(qualifications)
that theycannot be utilizedefficiently(e.g., the set of preconditions for pwdwouldinclude the requirementthat
all ancestordirectories are readable,whichis impractical
fromexogenousevents.
Systematic Data Collection
Ourapproachis to executeoperatorswith a subset of
their preconditionsunsatisfied (lesioned operators),then
associate this subset (the cause) with the corresponding
command
output. This worksbecausethe set of preconditions of an operator (if correct and complete)specify
exactly the conditionsthat guaranteesuccessfulexecution
(assumingthat the world modelis also correct). Being
exponentialin the numberof preconditions,this algorithm
is muchtoo expensivein practice, but with the additionof
two insights wecan do muchbetter.
toverify).
Giventhat errors mayoccur, one wayto check for
themis to verify the consistencyof the operator’spostconditions (against the real world)followingevery operator
execution(i.e., makesure the operator had the desired
effect). Whileless susceptibleto exogenous
events, this
approachmaybe very time consuming.In addition, this
does not solve the problemof error detection for sensory
operatorssince these do not result in a changein the external world.
Theproblemscited abovesuggestthat a function that
analyzes the output of a software commandmaybe a
muchbetter choice. As a bonus, software error messages
often containsufficient informationto determinethe cause
of the error. Accurateerror diagnosiscan greatly decrease
the cost of recoveringfroman error by restricting the
search for invalid beliefs in the worldmodel.But, the
spaceof (error message,cause)pairs is prohibitivelylarge
to generate by hand, so we propose an automated
approach.
¯ Whenmultipleerrors occur simultaneouslyonly one
is signaledbythe error message
(i.e., the othersare
masked). Moreformally: error-msg(-pl and ~p2)
= (error-msg(-pl) or error-msg(~p2)), where
and~p2represent unsatisfied preconditions.
Errors can be fixed incrementally (the decomposable fault assumption).That is, weassumethat if
error-msg(~pl and ~p2) = error-msg(-p2) and
re-executethe operator(after achievingp2), wewill
nowget error-msg(-pl) whichcan be handled in
turn.
The "Single Lesion" method
The Problem
For eachoperator(OPx)
Foreachprecondition
(Py) of OPx
1. ExecuteOPxwith precondition
Py negated
Theproblemof generatingerror diagnosis functions
can be summarized
as:
2. Associate
the error message
returnedby
the softwarecommand
with its causePy
Given:Aset of completeandcorrect operators anda
set of randomlygeneratedsoftwareenvironments(playgrounds). A softwareplayground
is defined by such things as its directory
structure,file names,file protections,etc.
Figure 1: By recognizingthe maskingeffect and making
the decomposable
fault assumption,weachieve an algorithmfor buildingour error diagnosistable that is linear in
the number
of preconditions.
Generate:Afunction f= {(a, b) : a = command
output and b = ("no error" or (the set
possiblecausesof a))}.
Clearly a muchbetter approachwouldbe to execute
the operatorswith a single preconditionnegatedat a time
(see Figure1). This results in an exponentialspeedup
data collection time. Additionally,at executiontime the
incrementalhandlingof errors results in a complexitythat
is linear in the numberof errors. Table1 showsa sample
of the data that wascollected. Notethat different error
messagesare possible for a given lesioned operator
dependingon howthe negatedprecondition is satisfied.
Weachieve the negated preconditions last in order to
Moreformally, the set of possible causesis a subset
of the operator’s preconditionsthat maynot be satisfied.
Weassumethat command
output is both accessible and
distinguishablefromother text. This assumptiongenerally
holds in the UNIX
operating systemin whichwetest our
approach.Further, weprotect our software playgrounds
54
Precondition/Cause
Error Message(s)
-(protection
file-x readable)
file-x is not readable
"file-x:Permissiondenied"
- (current.directorysoftbot ?dir)
softbot is in the wrongdirectory
"dir2: No such file or directory"
"dir3 not found"
"wc: file4: No such file or directory"
"ipr: cannot access filel"
"mv: cannot access file7"
Table 1: SomeExamplesof Error-to-Cause Mappings
reducethe searchfor bindings.Bindingsfor anyremaining
unboundvariables are chosen at randomfrom our world
model.Weattempt to collect a variety of these possible
error messagesby usingrandomized
test worlds.If the set
of preconditionsof the lesioned operatorcannotbe satisfied, wesimplymoveon to the next lesionedoperator.
Error Diagnosis FunctionGeneration
Wechoseto use decisiontree classifiers to learn our
error detection functionsbecausethey havebeenshownto
workwell in a widevariety of applications(e.g., Quinlan
1986;Mitchie1987;Carter &Catlett 1987). For the task
of software error diagnosis the class of an exampleis
either "no error" or a set of possiblecausesfor the correspondingerror message.And,wedecided on an attribute
set consistingof error messagelength (numberof tokens),
and a binary (present or not present) attribute for every
token seen in the collected error messages.Wefelt that
using attributes such as "all possible sub-phrases"may
havebeeneffective, but wouldhaveincreased the number
of attributes (abovethe current 427) to an unmanageable
level.
In addition to the decisiontrees, wefelt that a more
straightforward approachmightdo well. For comparison,
wedeviseda simplestring matchclassifier. This approach
consistedof buildingup a table of error messagestrings
(minus commandarguments) and their corresponding
causes, then classifying a test examplebytable lookup.
Results
Wecompare
four different approachesas to their ability to diagnoseerrors as wellas their ability to reducethe
cost of recoveringfromthe error. Thefirst methodis the
hand-craftederror detector that our softbot usedinitially.
This methodrelies heavily on the presence of colons in
error messages and makesno attempt to diagnose the
causeof the error (we perceivedthis "colonhack"to work
well in early development).Thesecondapproachis a simple string match. Ourlast two methodsare the decision
trees with and withoutoperator names.
Thedata used in the experimentswascollected from
four separate runs of ED. Attemptswere madeto make
the environments
for eachrun (directory structure, files,
machinetype etc.) as uniqueas possible in the hopes of
generating a good variety of error messages. Table 2
showsthe results of the first experimentwhichconsisted
Correct
Detection Rate
Correct Diagnosis Rate
Recovery
Savings
Curr~tMethod
57.0%
N/A
0.0%
Simple String Match
89.1%
71.6%
50.5%
Decision Tree (with name)
93.0%
90.4%
63.6%
DecisionTree(no name)
86.3%
83.6%
56.1%
Classifier Type
Table2: Error detectionanddiagnosis.The"CorrectDetectionRate"is the percentageof the test cases that werecorrectly classified as to the successof the operatorexecution(error or successful execution).Thecolumnlabeled
"CorrectDiagnosisRate"contains the percentageof the test cases whoseerrors werecorrectly diagnosed."Recovery Savings"numbersrepresent the averagepercentageof an operator’s preconditionsthat can be rejected as a
cause of the corresponding
error message.
55
of training on three of the four executionsof eachlesioned
operator(351examples),andthen testing the learnedclassifters with the fourth execution(117examples).Thegoal
of this test wasto discoverthe accuracyof the four methodswhenall the operatorsare fully specified.
Thesecondtest of these four methodswasto recognize errors due to preconditionsthat weremissingfroma
particular operator’s specification duringtraining (see
Table3). Thetraining data consistedof all the collected
data minusthe groupof four examplescorrespondingto a
particular lesioned operator,andthe test set wasexactly
those four removedexamples.The hopeis that the test
cases will be coveredby the data obtained from other
operatorsthat containthis precondition.
Classifier Type
Correct Diagnosis
Rate
Current Method
N/A
SimpleString Match
31.4%
DecisionTree (with name)
24.2%
DecisionTree (no name)
63.3%
operator nameas an attribute (as shownin Table2). This
performance
differenceis dueto the fact that the operator
nameattribute acts to limit the classification mistakesdue
to unreportederrors instead of allowingthemto generalize
(i.e., spread)to othertest cases.
Thepoorerresults shownin Table3’s "Cta’rect Diagnosis Rate" columncomparedto Table 2’s demonstrate
that the secondtask is relatively moredifficult than the
first. Themaincause is the presenceof a goodnumberof
preconditionsthat appearin only a single operator, thus
causingunavoidablemisclassifications. Anotherinteresting feature of Table3 is that decisiontrees withoutoperator names,did better than thosewith them.This is dueto a
lack of generality in the trees that included operator
names,as is the case with the string matchmethod.This
result taken with the result fromTable 2 suggests that
buildinga decision tree withoutoperatornames,exceptin
the case of operatorsthat do not signal errors, maybe a
better approach.
Related
Table3: Error diagnosisrates whenthe errors are caused
by preconditionsthat did not appearin the original operator. In this case successfuldiagnosisis dependent
on generalization fromtraining on operatorsother than the one
beingtested.
Discussion
It wasinteresting to see that the methodused by the
currentsoftbot has sucha highmisclassificationrate. This
result can be linked to two maincauses. First, for many
operators, wesimplyassumethat errors will not occur.
Andsecond, the assumptionthat colons will appear in
error messagesis not universal. This highlights the need
for systematic experimentationin a variety of environments.Thereasonthat the poorperformance
of this classifier wentmoreor less unnoticedis presumablydueto the
generalinfrequencyof errors andthe fact that the softbot
is still in development
so errors are difficult to track down.
Anothersignificant causeof classification error are
"mistakes" which occur whena commandsuccessfully
executes, but with the wrongeffect. For example,if the
softbotbelievesthat it is directorydirl, but is actuallyin
dir2, and executes]. s, the command
will succeedbut the
softbot will misperceive
the output.
Verysimilar in effect to mistakesare commands
that
never signal errors (even whenmisused).This results
the better performance
of decision trees that utilize the
56
Work
This workresemblesother workin diagnosis (Shortliffe 1974;Davis1984), but softwareerror diagnosishas
someunique features. The masking of errors in this
domainis not unlike the maskingof symptomsone might
find in medical diagnosis, but the decomposablefault
assumptionmakesthe handling of maskederrors much
different. Wherewecan peel awayerror after error, this
wouldnot be possible in mostmedicalcases. In addition,
in the medicaldomainthe maskingmightnot be complete;
symptoms
can interact, addinga great deal of complexity
to the problemof diagnosis.
Dietterich (Dietterich 1984) used UNIX
as a testbed
for his workin data interpretation. Usingconstraint propagation within hand-crafted UNIXcommandmodels,
Dietterich attempts to determinethe result of a Command
execution given the command
arguments and the command’soutput. Error messagesare included in the commandmodels,but this is doneby hand.
EDis in manywayscomplementaryto "The Operator
RefinementMethod"presentedin (Carbonell &Gil 1990).
Whereweassumecorrect operator modelswhile developing error detectors, Carbonell and Gil assumeaccurate
error detection while augmentingoperator models. Actually, wecan utilize an accurateerror diagnosistable (generated by ED)to suggest preconditions to add to an
operator. Whenan error occurs whosereported cause is
not currently a precondition, we can add the suspected
cause (i.e., unsatisfied precondition)to the operatorand
conductfurther testing to checkfor appropriateness.This
wouldallow for preconditionsof one operator to propagate to othersas necessary(to avoiderrors).
Future Work
Anaccurate error detection function can improvethe
efficiency of a softwareagentin a numberof ways.First,
by simplydetecting errors accurately wereducethe number of future errors dueto inaccuraciesin the worldmodel.
Second,by narrowingthe field of possible causes of an
error wecan save a great amount
of time in error recovery.
Finally, an accurate (error, cause) table will allow yet
anotherset of potentiallyfruitful experiments
findingthe
"optimal"set of operator preconditions. By"optimal"we
meanthe set of preconditionsthat best balancethe cost of
executingthe operatorwith the cost of error recovery.For
example,it seemsreasonableto executepwd withoutworrying aboutwhetherall the parent directories of yourcurrent directory are readable. Simplyexecute the operator
andhandlethe errors that will occasionallyoccur.
Additionally,operatorfailures and the accuratediagnosis of their cause can be viewedas exploration. For
example,onewayto determineif a file is readableis to
simplytry readingit. If this actionresults in a error that is
deemedprotection-related, wehave learned that indeed
this file is not readable.Thekeyhereis that boththe success andfailure of operatorscan provideinformationthat
wecan utilize. Weplan to extendour softbot to incorporate this typeof information
gathering.
References
Carbonell and Gil 1990. Learning by Experimentation:
The ~tor Refinement Method. Machine Learning: An
Artificial IntelligenceApproach
vol. III. 1990.
Carter and Catlett 1987.AssessingCredit CardApplications UsingMachineLearning. IEEEExpert, Fall.
Chapman
1987. Planning for ConjunctiveGoals. Artificial Intelligence32(3). North-Holland
Publishing.
Davis 1984. Diagnostic ReasoningBased on Structure
and Behavior.Artificial Intelligence 24. North-Holland
Publishing.
Dietterich 1984. Constraint PropagationTechniquesfor
Theory-Driven
DataInterpretation. Doctoraldissertation
Stanford University, Dec. 1984. TechnicalReport84-46,
StanfordHeuristic Programming
Project, StanfordCA.
Etzioni, Lesh and Segal. 1993. Building Softbots for
UNIX(preliminary report). Technical Report 93-09-01,
University of Washington,1993.Availablevia anonymous
FTP from ftp/pub/ai/etzioni/softbots
at
cs.washington.edu.
57
Golden, Etzioni, and Weld1994. Omnipotencewithout
Omniscience. AAA/1994.
McCarthy1980. Circumscription--AFormof Non-Monotonic Reasoning.Artificial Intelligence 13(1). North-Holland Publishing.
Mitchie 1987. Current Developments
in Expert Systems.
In Quinlan(Ed.), Applicationsof ExpertSystems,Maidenhead: AddisonWesleyPublishers.
Quinlan 1986. Induction of Decision Trees, Machine
Learning,(1) 1, KluwerAcademic
Publishers.
Shortliffe 1974. MYCIN:
A Rule-Based ComputerProgram To Advise Physicians Regarding Antimicrobial
TherapySelection. Doctoraldissertation StanfordUniversity, Oct. 1974.TechnicalReportAIM-251,
StanfordArtificial IntelligenceLaboratory,Stanford,CA.