From: AAAI Technical Report FS-94-01. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Learning About Software Errors Via Systematic Experimentation Terrance GoanOren Etzioni Department of ComputerScience and Engineering, University of Washington Seattle, WA98195 {goan, etzioni}@cs.washington.edu Abstract ware environments: Critical to the success of any real worldagent is the Software error messages generally signal only one error. For example, in UNIX,if you execute the d± f f commandon two files x and y, where both files are not readable, the error messagewouldbe "diff: x: Permission denied." Itprovides no informationabout the status of file y (i.e., the above error message masks the corresponding error messagefor file y). Wecall this the masking effect. ability to detect and recover from unsuccessful actions. Failing to detect these errors maycause the execution of the remainder of a plan to have unexpected and dangerouseffects. In this paper we present ED(the error detective) which systematically executes lesioned operators in order to generate a table of errors and associated causes for a software agent. We describe features of software environmentsthat allow us to efficiently build this table without searching for simultaneous errors or making the single fault assumption. Wethen report on experiments comparing several methodsfor utilizing this collected data to build an error diagnosis function. Since software errors do not interact, errors can be fixed incrementally (the decomposable fault assumption). This means we need not assume just a single fault has occurred, rather we assumethat if multiple errors occur simultaneously, we can fix the error signaled by the given message, re-execute the operator, then handlethe next error (if there is one). Introduction Classical planners (e.g., (Chapman1987)) assume that their internal modelis both correct and complete.The dynamic nature of real-world domains(e.g., multi-user software environments) makes these assumptions untenable. Several newplanners (e.g., XII (Goldenet al. 1994)) have been designed to work with incomplete information, and strides have been madein planning with potentially incorrect information. But, efficient operation in the presence of incorrect information is highly dependent on a planner’s ability to detect errors. Failing to recognize errors can result in unexpectedand potentially destructive effects, as well as further corruption of the world model. This paper describes ED(the Error Detective) which automatically generates error diagnosis functions for a softwarerobot (softbot) (Etzioni et al. 1993). Real-world software environments (e.g., databases, computernetworks, operating systems) are the subject of intense study in computerscience, and software agents are gaining prominence as a research area. These software environments are not idealizations of physical environments, but have their ownunique features and challenges. In designing EDwe utilized three key insights into soft- Error messages caused by the same fault are often very similar across operators, makingclassification tools such as decision trees potentially useful in forming generalizations that may cover unseen variations. The remainder of this paper is organized as follows. In the next section we will further discuss the motivations for this work. Wethen present our methodfor collecting training data to be used in building our error diagnosis functions. Next, we comparethe effectiveness of a variety of methods for classifying error messages. Weconclude with a discussion of the merits (and flaws) of these methods and somesuggestions for future work. Motivation In order to maintain an accurate internal world model it is essential to knowwhencommands have been successfully executed and when they have not. There are many 53 different waysin whichthis can be achieved. If weassumethat our operatorsare completeandcorrect, andthat our worldmodelis also correct, then wecan be assured that every operator execution will succeed. But, these assumptions are clearly not realistic in a multiuser software environment,due to exogenousevents. In addition,fully specifiedoperatorsstill face the qualification problem(McCarthy1980). That is, completeoperators mayinclude so manypreconditions(qualifications) that theycannot be utilizedefficiently(e.g., the set of preconditions for pwdwouldinclude the requirementthat all ancestordirectories are readable,whichis impractical fromexogenousevents. Systematic Data Collection Ourapproachis to executeoperatorswith a subset of their preconditionsunsatisfied (lesioned operators),then associate this subset (the cause) with the corresponding command output. This worksbecausethe set of preconditions of an operator (if correct and complete)specify exactly the conditionsthat guaranteesuccessfulexecution (assumingthat the world modelis also correct). Being exponentialin the numberof preconditions,this algorithm is muchtoo expensivein practice, but with the additionof two insights wecan do muchbetter. toverify). Giventhat errors mayoccur, one wayto check for themis to verify the consistencyof the operator’spostconditions (against the real world)followingevery operator execution(i.e., makesure the operator had the desired effect). Whileless susceptibleto exogenous events, this approachmaybe very time consuming.In addition, this does not solve the problemof error detection for sensory operatorssince these do not result in a changein the external world. Theproblemscited abovesuggestthat a function that analyzes the output of a software commandmaybe a muchbetter choice. As a bonus, software error messages often containsufficient informationto determinethe cause of the error. Accurateerror diagnosiscan greatly decrease the cost of recoveringfroman error by restricting the search for invalid beliefs in the worldmodel.But, the spaceof (error message,cause)pairs is prohibitivelylarge to generate by hand, so we propose an automated approach. ¯ Whenmultipleerrors occur simultaneouslyonly one is signaledbythe error message (i.e., the othersare masked). Moreformally: error-msg(-pl and ~p2) = (error-msg(-pl) or error-msg(~p2)), where and~p2represent unsatisfied preconditions. Errors can be fixed incrementally (the decomposable fault assumption).That is, weassumethat if error-msg(~pl and ~p2) = error-msg(-p2) and re-executethe operator(after achievingp2), wewill nowget error-msg(-pl) whichcan be handled in turn. The "Single Lesion" method The Problem For eachoperator(OPx) Foreachprecondition (Py) of OPx 1. ExecuteOPxwith precondition Py negated Theproblemof generatingerror diagnosis functions can be summarized as: 2. Associate the error message returnedby the softwarecommand with its causePy Given:Aset of completeandcorrect operators anda set of randomlygeneratedsoftwareenvironments(playgrounds). A softwareplayground is defined by such things as its directory structure,file names,file protections,etc. Figure 1: By recognizingthe maskingeffect and making the decomposable fault assumption,weachieve an algorithmfor buildingour error diagnosistable that is linear in the number of preconditions. Generate:Afunction f= {(a, b) : a = command output and b = ("no error" or (the set possiblecausesof a))}. Clearly a muchbetter approachwouldbe to execute the operatorswith a single preconditionnegatedat a time (see Figure1). This results in an exponentialspeedup data collection time. Additionally,at executiontime the incrementalhandlingof errors results in a complexitythat is linear in the numberof errors. Table1 showsa sample of the data that wascollected. Notethat different error messagesare possible for a given lesioned operator dependingon howthe negatedprecondition is satisfied. Weachieve the negated preconditions last in order to Moreformally, the set of possible causesis a subset of the operator’s preconditionsthat maynot be satisfied. Weassumethat command output is both accessible and distinguishablefromother text. This assumptiongenerally holds in the UNIX operating systemin whichwetest our approach.Further, weprotect our software playgrounds 54 Precondition/Cause Error Message(s) -(protection file-x readable) file-x is not readable "file-x:Permissiondenied" - (current.directorysoftbot ?dir) softbot is in the wrongdirectory "dir2: No such file or directory" "dir3 not found" "wc: file4: No such file or directory" "ipr: cannot access filel" "mv: cannot access file7" Table 1: SomeExamplesof Error-to-Cause Mappings reducethe searchfor bindings.Bindingsfor anyremaining unboundvariables are chosen at randomfrom our world model.Weattempt to collect a variety of these possible error messagesby usingrandomized test worlds.If the set of preconditionsof the lesioned operatorcannotbe satisfied, wesimplymoveon to the next lesionedoperator. Error Diagnosis FunctionGeneration Wechoseto use decisiontree classifiers to learn our error detection functionsbecausethey havebeenshownto workwell in a widevariety of applications(e.g., Quinlan 1986;Mitchie1987;Carter &Catlett 1987). For the task of software error diagnosis the class of an exampleis either "no error" or a set of possiblecausesfor the correspondingerror message.And,wedecided on an attribute set consistingof error messagelength (numberof tokens), and a binary (present or not present) attribute for every token seen in the collected error messages.Wefelt that using attributes such as "all possible sub-phrases"may havebeeneffective, but wouldhaveincreased the number of attributes (abovethe current 427) to an unmanageable level. In addition to the decisiontrees, wefelt that a more straightforward approachmightdo well. For comparison, wedeviseda simplestring matchclassifier. This approach consistedof buildingup a table of error messagestrings (minus commandarguments) and their corresponding causes, then classifying a test examplebytable lookup. Results Wecompare four different approachesas to their ability to diagnoseerrors as wellas their ability to reducethe cost of recoveringfromthe error. Thefirst methodis the hand-craftederror detector that our softbot usedinitially. This methodrelies heavily on the presence of colons in error messages and makesno attempt to diagnose the causeof the error (we perceivedthis "colonhack"to work well in early development).Thesecondapproachis a simple string match. Ourlast two methodsare the decision trees with and withoutoperator names. Thedata used in the experimentswascollected from four separate runs of ED. Attemptswere madeto make the environments for eachrun (directory structure, files, machinetype etc.) as uniqueas possible in the hopes of generating a good variety of error messages. Table 2 showsthe results of the first experimentwhichconsisted Correct Detection Rate Correct Diagnosis Rate Recovery Savings Curr~tMethod 57.0% N/A 0.0% Simple String Match 89.1% 71.6% 50.5% Decision Tree (with name) 93.0% 90.4% 63.6% DecisionTree(no name) 86.3% 83.6% 56.1% Classifier Type Table2: Error detectionanddiagnosis.The"CorrectDetectionRate"is the percentageof the test cases that werecorrectly classified as to the successof the operatorexecution(error or successful execution).Thecolumnlabeled "CorrectDiagnosisRate"contains the percentageof the test cases whoseerrors werecorrectly diagnosed."Recovery Savings"numbersrepresent the averagepercentageof an operator’s preconditionsthat can be rejected as a cause of the corresponding error message. 55 of training on three of the four executionsof eachlesioned operator(351examples),andthen testing the learnedclassifters with the fourth execution(117examples).Thegoal of this test wasto discoverthe accuracyof the four methodswhenall the operatorsare fully specified. Thesecondtest of these four methodswasto recognize errors due to preconditionsthat weremissingfroma particular operator’s specification duringtraining (see Table3). Thetraining data consistedof all the collected data minusthe groupof four examplescorrespondingto a particular lesioned operator,andthe test set wasexactly those four removedexamples.The hopeis that the test cases will be coveredby the data obtained from other operatorsthat containthis precondition. Classifier Type Correct Diagnosis Rate Current Method N/A SimpleString Match 31.4% DecisionTree (with name) 24.2% DecisionTree (no name) 63.3% operator nameas an attribute (as shownin Table2). This performance differenceis dueto the fact that the operator nameattribute acts to limit the classification mistakesdue to unreportederrors instead of allowingthemto generalize (i.e., spread)to othertest cases. Thepoorerresults shownin Table3’s "Cta’rect Diagnosis Rate" columncomparedto Table 2’s demonstrate that the secondtask is relatively moredifficult than the first. Themaincause is the presenceof a goodnumberof preconditionsthat appearin only a single operator, thus causingunavoidablemisclassifications. Anotherinteresting feature of Table3 is that decisiontrees withoutoperator names,did better than thosewith them.This is dueto a lack of generality in the trees that included operator names,as is the case with the string matchmethod.This result taken with the result fromTable 2 suggests that buildinga decision tree withoutoperatornames,exceptin the case of operatorsthat do not signal errors, maybe a better approach. Related Table3: Error diagnosisrates whenthe errors are caused by preconditionsthat did not appearin the original operator. In this case successfuldiagnosisis dependent on generalization fromtraining on operatorsother than the one beingtested. Discussion It wasinteresting to see that the methodused by the currentsoftbot has sucha highmisclassificationrate. This result can be linked to two maincauses. First, for many operators, wesimplyassumethat errors will not occur. Andsecond, the assumptionthat colons will appear in error messagesis not universal. This highlights the need for systematic experimentationin a variety of environments.Thereasonthat the poorperformance of this classifier wentmoreor less unnoticedis presumablydueto the generalinfrequencyof errors andthe fact that the softbot is still in development so errors are difficult to track down. Anothersignificant causeof classification error are "mistakes" which occur whena commandsuccessfully executes, but with the wrongeffect. For example,if the softbotbelievesthat it is directorydirl, but is actuallyin dir2, and executes]. s, the command will succeedbut the softbot will misperceive the output. Verysimilar in effect to mistakesare commands that never signal errors (even whenmisused).This results the better performance of decision trees that utilize the 56 Work This workresemblesother workin diagnosis (Shortliffe 1974;Davis1984), but softwareerror diagnosishas someunique features. The masking of errors in this domainis not unlike the maskingof symptomsone might find in medical diagnosis, but the decomposablefault assumptionmakesthe handling of maskederrors much different. Wherewecan peel awayerror after error, this wouldnot be possible in mostmedicalcases. In addition, in the medicaldomainthe maskingmightnot be complete; symptoms can interact, addinga great deal of complexity to the problemof diagnosis. Dietterich (Dietterich 1984) used UNIX as a testbed for his workin data interpretation. Usingconstraint propagation within hand-crafted UNIXcommandmodels, Dietterich attempts to determinethe result of a Command execution given the command arguments and the command’soutput. Error messagesare included in the commandmodels,but this is doneby hand. EDis in manywayscomplementaryto "The Operator RefinementMethod"presentedin (Carbonell &Gil 1990). Whereweassumecorrect operator modelswhile developing error detectors, Carbonell and Gil assumeaccurate error detection while augmentingoperator models. Actually, wecan utilize an accurateerror diagnosistable (generated by ED)to suggest preconditions to add to an operator. Whenan error occurs whosereported cause is not currently a precondition, we can add the suspected cause (i.e., unsatisfied precondition)to the operatorand conductfurther testing to checkfor appropriateness.This wouldallow for preconditionsof one operator to propagate to othersas necessary(to avoiderrors). Future Work Anaccurate error detection function can improvethe efficiency of a softwareagentin a numberof ways.First, by simplydetecting errors accurately wereducethe number of future errors dueto inaccuraciesin the worldmodel. Second,by narrowingthe field of possible causes of an error wecan save a great amount of time in error recovery. Finally, an accurate (error, cause) table will allow yet anotherset of potentiallyfruitful experiments findingthe "optimal"set of operator preconditions. By"optimal"we meanthe set of preconditionsthat best balancethe cost of executingthe operatorwith the cost of error recovery.For example,it seemsreasonableto executepwd withoutworrying aboutwhetherall the parent directories of yourcurrent directory are readable. Simplyexecute the operator andhandlethe errors that will occasionallyoccur. Additionally,operatorfailures and the accuratediagnosis of their cause can be viewedas exploration. For example,onewayto determineif a file is readableis to simplytry readingit. If this actionresults in a error that is deemedprotection-related, wehave learned that indeed this file is not readable.Thekeyhereis that boththe success andfailure of operatorscan provideinformationthat wecan utilize. Weplan to extendour softbot to incorporate this typeof information gathering. References Carbonell and Gil 1990. Learning by Experimentation: The ~tor Refinement Method. Machine Learning: An Artificial IntelligenceApproach vol. III. 1990. Carter and Catlett 1987.AssessingCredit CardApplications UsingMachineLearning. IEEEExpert, Fall. Chapman 1987. Planning for ConjunctiveGoals. Artificial Intelligence32(3). North-Holland Publishing. Davis 1984. Diagnostic ReasoningBased on Structure and Behavior.Artificial Intelligence 24. North-Holland Publishing. Dietterich 1984. Constraint PropagationTechniquesfor Theory-Driven DataInterpretation. Doctoraldissertation Stanford University, Dec. 1984. TechnicalReport84-46, StanfordHeuristic Programming Project, StanfordCA. Etzioni, Lesh and Segal. 1993. Building Softbots for UNIX(preliminary report). Technical Report 93-09-01, University of Washington,1993.Availablevia anonymous FTP from ftp/pub/ai/etzioni/softbots at cs.washington.edu. 57 Golden, Etzioni, and Weld1994. Omnipotencewithout Omniscience. AAA/1994. McCarthy1980. Circumscription--AFormof Non-Monotonic Reasoning.Artificial Intelligence 13(1). North-Holland Publishing. Mitchie 1987. Current Developments in Expert Systems. In Quinlan(Ed.), Applicationsof ExpertSystems,Maidenhead: AddisonWesleyPublishers. Quinlan 1986. Induction of Decision Trees, Machine Learning,(1) 1, KluwerAcademic Publishers. Shortliffe 1974. MYCIN: A Rule-Based ComputerProgram To Advise Physicians Regarding Antimicrobial TherapySelection. Doctoraldissertation StanfordUniversity, Oct. 1974.TechnicalReportAIM-251, StanfordArtificial IntelligenceLaboratory,Stanford,CA.