From: AAAI Technical Report WS-99-09. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved. Intelligent Testing can be Very Lazy TimMenzies1, 2BojanCukic 1NASA/WVU Software Research Lab, 100 University Drive, Fairmont, WVUSA 2Departmentof ComputingScience and Electrical Engineering, West Virginia University, Morgantown,WVUSA < tim@menzies,com, cukic@csee,wvu. edu> Abstract Testingis a searchprocessand a test suite is completewhen the search has examined all the comersof the program.Standardmodelsof test suite sizes are grossover-estimatessince they are unawareof the nature of that search space.For example,onlya smallpart of the possiblesearch spaceis ever exercisedin practice. Further,a repeatedresult is that a few randomsearches often yields as muchinformation as more thoroughsearchstrategies. Hence,onlya fewtests are needed to samplethe range of behavioursof a program. Text (Harmon& King 1983) (Buchanan et al. 1983) (Bobrow,Mittal, &Stefik 1986) (Davies1994) (Yuet al. 1979) (Caraca-Valente et al. 2000) (Menzies1998) (Ramsey&Basili 1989) (Betta, D’Apuzzo, &Pietrosanto1995) # of tests 4..5 ~-,6 5..10 8..10 10 < 13 40 50 200 Table 1: Numberof tests proposedby different authors. Extended from a survey by (Caraca-Valenteet al. 2000). Introduction Dueto recent cutbacks, ISELtd. has to fire one of its test engineers. Whowouldyou pick? LazySusan never tests a programrigorously. Instead, she just has a quick glance at some of the possible consequences of knowninputs. Lazy Susan only ever proposessmall test suites basedjust on small variants to the inputs. LazySusan never works overtime and her tests are alwaysfinished on the due data. Onthe other hand, Eager Earnest is morethorough and reflects on howall the different KBpathwayscan interact and interfere with each other. Earnest often proposeslarger test suites based on the inputs, and all the interaction points downstreamof the inputs. Earnest always works overtime and never finishes his testing by the due date. Whogets sacked? Earnest thinks it should be Susan, saying that even a cursory study of the mathematicsshowsthat testing is fundamentallya slow process. For example,a simple attribute modelwoulddeclare that a system containing N variables with S assignments (on average) requires Iv tests. Further, a widely-usedstatistical modeldeclares that over 4,500 tests are required to find moderatelyinfrequent bugs; see Figure 1. Not surprisingly, Susan disagrees with Earnest and thinks he should be the one to go. She defends her test stratgies, noting that most of the expert systems literature proposes evaluations based on very few testsl; see Table 1. For example, Menziesneeded a mere 40 tests to check an expert system that controlled a complexchemical plant (125 kilometers of highly inter-connected piping) (Menzies1998). 1Exception:(Bahill, Bharathan, &Curlee 1995) propose least one test for everyfive rules and addthat "havingmoretest cases than rules wouldbe best". 53 that design, the expert systemand the humanoperators took turns to run the plant. At the end of a statistically significant numberof trials, the meanperformanceare comparable using a t-test 2. Further, she rejects Earnest’s SN modelas obviouslyridiculous as follows: In sample of fielded expert systems, knowledgebases contained between 55 and 510 (Preece & Shinghal 1992). Literals offer two assignmentsfor each proposition: true or false (i.e. S = 2 and N is half the number of literals. Assuming(i) it takes one minute to consider each test result (whichis a gross under-estimate) and (ii) that the effective workingyear is 225six hour days, then a test of those sampledsystems wouldtake between 29 years and 1070 years: a time longer than the age of this universe. 2Letmand n be the numberof trials of expert systemand the humanexperts respectively. Eachtrial generatesa performance score (time till unusual operations): XI ...Xm with mean#x for the humans;and performancescores Y1 ... Yn with mean/~ for the expert system. Weneed to find a Z value as follows: Z = ~ whereS~ = c<~i-~’)2 m-,and S~ = .~(~i-"v}~ n-i , j~+~ v,,, ~ Leta be the degrees of freedom. Ifn = rn = 20, thea = n + ra - 2 = 38. Wereject the hypothesisthat expert system is worsethan the human(i.e. #~ </z~) with 95%confidenceifZ is less than (-ts8,o.95 = -1.645). 4603 w 1 0.9 0.8 0.7 "6 0.6 =Z" 0.5 "~ 0.4 ,~ 0.3 ~o 0.2 " 0.1 0 ’- ../ 7" ~10 100 1000 / 1 failureIn 100..e--. fl 1 failure In1,000 --*--. ./ 1 failurein 101000 ~ 1 failureIn 1001000 i" 1 failureIn 1,000,000 --=,-" 10000 100000 le+06 # oftests le+07 le+08 Figure 1: Chanceof finding an error = 1 - (1 - failure rate) tests. Theoretically, 4603 tests are required to achieve a 99% chance of detecting moderately infrequent bugs; i.e. those which occur at a frequency of 1 in a thousand cases. (Hamlet Taylor 1990). The research division of ISE Inc. has been hired to resolve this dispute. Their core insight is that testing samples a search space. Someparts of the space contain desired goals and someparts contain undesirable behaviour. Regardless of howusers interpret different parts of the space, the testingas-search process is the same: a test suite is completewhen it has lookedin all the corners of that search space. This notion of testing=search covers numeroustesting schemes: ¯ Oracle testing uses someexternal source of information to specify distributions for the inputs and outputs ¯ Model-checkingfor temporal properties (Clarke, Emerson, &Sistla 1986) can be reduced to search as follows. First, the modelis expressed as the dependencygraph betweenliterals. Secondly,the constraints are groundedand negated. The generation of counter-examples by model checkers then becomesa search for pathwaysfrom inputs to undesirable outputs. ¯ A test suite that satisfies the "all-uses" data-flow coverage criteria can generateproof trees for all portions of the search space that connectsomeliteral fromwhereit is assigned to whereverit is used (Frankl &Weiss 1993). ¯ Partition testing (Hamlet&Taylor 1990)is the generation of inputs via a reflection over the search space looking for key literals that fork the program’s conclusions into N different partitions. ¯ The multiple-worlds reasoners (described below) divide the accessible areas of a search space into their consistent subsets. Based on this insight of testing=search, ISE Research madethe following conclusion. Table 1 proposes far fewer tests than Figure 1 since Figure 1 has limited insight into the devices being testing. Figure l’s estimates are grossly overinflated since it is unawareof certain regularities within the software it is testing. ISE research bases this conclusionon a literature review whichshows: ¯ Whilea static analysis of a programsuggests SN states, in practice only a small percentage of those are exercised by the systems inputs. ¯ Further, a review of inference engines showsthat a few randomstabs into the search space often yields as much information as morethoroughsearch strategies. Theliterature reviewis presentedbelow.Notethat this article is framedin terms of knowledgeengineering. However, there is nothing stopping these results from being applied to software engineering (Menzies & Cukic 1999). Studies with Data Sets The sections discusses two studies with data sets showing that the attribute space size of the data presented to an expert systemis far less than the potential upper boundof the attribute space size of the KB(BN). This is consistent with the used portions of search spaces being non-complex.Test suites grown beyond some small size would yield no new information since the smaller test suite wouldhave already exploredthe interesting parts of the KB’ssearch space. Avritzer et.aL studied the inputs given to an expert system. Eachrow of a state matrix stored one unique case presented to the expert system(and each literal in the inputs were generate one columnin the matrix). After examining 355 days of input data, they could only find 857 different inputs. There was massiveoverlap betweenthose input sets. On average, the overlap between two randomlyselected inputs was 52.9%. Further, a simple algorithm found that 26 carefully selected inputs covered 99%of the other inputs while 53 carefully selected inputs covered 99.9%of the other inputs (Avritzer, Ros, &Weyuker1996). If most naturally occuring test suites have such an overlap, then only a small numberof tests will be required. Avritzer et.al, looked at the inputs to a system. Coiomb comparedthe inputs presented to an expert systemwith its internal structure. Recalling the introduction, the internals of a KBcan very large: S states per N variables implies SN combinations. Each such combination could be represented as one row in a state matrix. Colombargues that the estimate of SN is a gross over-inflation of effective KB size. He argues that KBsrecord the very small regions of experience of humanexperts. For example, one medical expert system studied by Colombhad ,5 ’~v = 10z4. However, after one year’s operation, the inputs to that expert sys54 Isamp(theory) for i := 1 to MAX-TRIES{ set all variablesto unassigned; loop { if all variablesare valued return(currentassignment); v := random unvaluedvariable; assign v a randomlychosen value; unit_propagate(); if contradictionexit loop; } } return failure Figure 2: The ISAMPrandmoised-search theorem prover. (Crawford & Baker 1994). tern could be represented in a state matrix with only 4000 rows. That is, the region of experience exercised after one N year within the expert systemwas only a tiny fraction ofS (4000 << 1014) (Colomb1999). If most systems have feature, then only a few tests will be required to exercise a KB’sregion of expertise. Studies with Inference Engines This section reviews two studies suggesting the extra effort of the thorough engineer wouldbe wasted since the larger thoroughtest suite wouldyield little moreinformation that the smaller lazy test suite. Recall the behaviourof our two test engineers: ¯ Lazy Susan explored a couple of randomly chosen paths while Earnest considered all inputs and their downstream interactions. ¯ Earnest is acting like an ATMS device (DeKleer 1986). The ATMS takes the justification for every conclusions and weaves it into an assumption network. At anytime, the ATMS can report the key assumptions that drive the KBinto very different conclusions. Such a search, it was thoaght, was a useful way to explore all the competing options within a search space. ¯ LazySusanis acting like a locally-guidedbest-first search across the KB. Whena contradiction is detected, Susan (or the search engine) could look at the contradiction for hints on howto resolve that problem.Susan(or the search engine) could then instantiated those hints and search on. Note that, as a results, the earnest ATMS wouldfind many options and the lazy searcher wouldonly ever find one. Williams and Nayakcomparedthe results of their locallyguides search to the ATMS and found that this locally-guide contradiction resolution mechanismwas comparable to the very best ATMS implementations. That is, the information gained from an exploration of all options was nearly equivalent to the informationgained from the exploration of a single option (Williams & Nayak1996). In other work, Crawford and Baker compared TABLEAU, a depth-first search backtracking algorithm, to ISAMP,a randomised-search theorem prover (see Figure 2). ISAMP A B C D E F TABLEAU: full search o~ Success Time(sec) 90 255.4 100 104.8 70 79.2 10o 90.6 80 66.3 81.7 100 ISAMP: partial, random search %Success Time(sec) "rries l0 7 100 100 13 15 100 11 13 21 45 100 19 52" lO0 100 68 252" Table 2: Average performance of TABLEAU vs ISAMPon 6 scheduling problems(A..F) with different levels of constraints and bottlenecks. From(Crawford &Baker 1994). randomlyassigns a value to one variable, then infers some consequences using unit propagation. Unit propagation is not a very thoroughinference procedure:it only infers those conclusions whichcan be found using a special linear-time case of resolution; i.e. (,T,) A (~,T, or Yl ... Yn)[- (Yl or.." Yn) (’~x) A (X or Yl Or... Yn) (Ylor.. . Yn) After unit propagation, if a contradiction was detected, ISAMP re-assigns all the variables and tries again (giving up after MAX-TR’IES number of times). Otherwise, ISAMP continues looping till all variables are assigned. LazySusan approves of ISAMPwhile Earnest likes TABLEAU since it explores more options moresystematically. Table 2 shows the relative performance of the two algorithms on a suite of scheduling problems based on realworld parameters. Surprisingly, ISAMP took less time than TABLEAU to reach more scheduling solutions using, usually, just a small numberof TRIES.That is, a couple of randomexplorations of the easy parts of a search space yielded the best results. Crawfordand Bakeroffer a speculation why ISAMPwas so successful: their systems contained mostly "dependent" variables which are set by a small numberof "control" variables (Crawford &Baker 1994). If most systemshavethis feature, then large test suites are not required since a few key tests are sufficient to set the control variables. Experiments with HTx The results in the previous section support the argument that search spaces are less complexthan Earnest thinks. Hence,the test suites required to sample that search space maybe very small. However,the sample size of the above studies is very small. This section addresses the sample size issue. Based on the HTxabductive model of testing (Menzies 1995; Menzies & Compton1997), a suite mutators can generate any numberof sample testing problems3. Three HTx algorithms are HT4 (Menzies 1995), 3Informally,abductionis inferenceto the best explanation. Moreprecisely, abductionmakewhateverassumptionsA whichare required to reach output goals Outacross a theory T U A F- Out withoutcausingcontradictionTUA~.L. Eachconsistentset of assumptionsrepresentsan explanation.If morethan one explanation 55 HT4-dumb(Menzies & Waugh1998), and HT0 (Menzies & Michael 1999). HTxalgorithms were designed as automatic hypothesistesters for under-specified theories and are a generalization and optimization of QMOD, a validation tool for neuroendocrinological theories (Feldman, Compton, & Smythe 1989). HTx and QMOD assumes that the definitive test for a theory is that it can reproduce(or cover) knownbehaviour of the entity being modeled. Theory T1 is a better theory than theory T2iff cover(T1) >>cover(T2) Below, we will showexperiments where HTxis run millions of times over tens of thousands of models. The results endorse the widespread existence of simple search spaces. A small number of randomsearches (by HT0)covered nearly as muchoutput as extensive searches by HT4.Further, the sample size is muchlarger than that offered by Williams, Nayak, Crawford, and Baker. An Informal Model of Testing This section gives a quick overviewof the intuitions of HTx. Thenext section details these intuitions. At runtime, an inference engine explores the search space of a program. A given test applies someinputs (In) to the programto reach someoutputs (Out). The inference engine maybe indeterminate or chosen randomly (e.g. by ISAMP) and so the pathwaytaken from inputs to outputs mayvary. Hence,we call the intermediaries assumptions(A). That is: < Out, A >= f(ln, HTx: An More Formal Model of Testing This section describes the details of HTx.The next section describes an example. Werepresent a theory as a directed-cyclic graph T = G containing vertices {t~,~,...} with roots roots(G) and leaves leaves(G). A vertex is one of two types: ¯ Andscan be believed if all their parents are believed. ¯ Ors can be believed if any their parents are believed or it has been labeled an In vertex (see below). Each or-node contradicts 0 or more other or-nodes, denoted no(Vi) = {Vj, Vt~, ...}. The average size of the no sets is called eontraints(G). For propositional systems, contraints(G) = 1; e.g. no(a) = {~a}. In and Out are sets of vertices from ors(G). A proof P _C G is a tree containing the vertices uses(Px) { V~,~,... }.Theproof tree has: ¯ Exactly one leaf whichis an output; i.e. Ileaves(uses( P~) )l = KB) where f is the inference engine and KBare the literals {a..z} which we can divide as follows: In Our testing intuition is that a test case beamsa searchlight from In across A to Out. Sometimes,the light reveals somethinginteresting. Wecan stop testing whenour searchlight stops finding anything new. Whatwill be shownbelow is that a few quick flashes showas muchas lots of poking aroundin corners with the flashlight. Vi 6 leaves(uses(P,)) ¯ 1 or more roots roots(uses(Pc)) C_ Notwo proof vertices can contradict each other; i.e. Out < Vi, Vj >6 uses(ex) A Vj ~. no(Vi) a, b, c, ..... , l, m,n, ..., ...x, y, z A That is, fromthe inputs a, b, c, .... wecan reach the outputs ...x, y, z via the assumptions...l, m, n, .... Someof a..z are positive goals that we are trying to achieve while someare negative goals reflecting situations we are trying to avoid. In the general case, only a subset of Out will be reachable using somesubset of the In, A since: ¯ The search space may not approve of connections between all of In and Out. * Someof the assumptions maybe contradictory. Multiple worlds of beliefs maybe generated when contradictory assumptionsare sorted into their maximalconsistent subsets, , In the case of randomizedsearch, only parts of the search space may be explored. For example, we could run ISAMPfor a finite numberof TRIESand return the TRY wb~chthe maximumnumber of assignments to Out. is found,someassessment criteria is appliedto select the preferred explanation(s). A Vi 6 Aproof’s assumptionsare the vertices that are not inputs or outputs;i.e. assumes( P, ) = uses( Px )-roots(uses( Px ) )-leaves(uses( pairs A test suite is N We assumed {< lnx,Outa >,... < lnn,Outn >. that each Ini, Outi contains only or-vertices. Each output Outi can generated 0 or more proofs {P*,Pu, ".}. A world is a maximal consistent subsets of the proofs (maximal w.r.t, size and consistent w.r.t, the no sets) denoted proofs(Wi) = {P,, P~, ...}. Worlds to proofs is many-to-many.The cover of a world is howmanyoutputs it contains;i.e. lUv, cover(W/) proo/s(W,) ^ Zeaves(P I)} [Outl Wemakeno other commenton the nature of Outi. It may be someundesirable state or somedesired goal. In either case, the aimof our testing is to find the worldsthat cover the largest percentageof Out. 56 Ahypothetical economicstheory foreign sales domestic sales public confidence company profits inflation wages restraint corporate spending account balance world W.2 worldW.1 foreign [ ] sales=up] company profits=up trade deficit corporate spending=up public confidence=up domestic sales=down I f°reign I sales=up] company profits=down public confidence=up corporate ¢"~.n~t ," .,._-~,~.-’n’~ ~ .... spending=down investor confidence Figure 3: Twoworldsfor outputs (ellipses) from inputs (squares). Assumptions are worldvertices that are not inputs or outputs. Note that contradictory assumptions are managedin separate worlds. Pt foreignSales=up, companyProflts=upt,corporateSpending=upt,investorConfldence=up domesticSales=down,companyProflts=down~, corporateSpending=downT, wageRestraint=up domesticSales=down,companyProfits=downT, infation=down P4 domesticSales=down,companyProflts=down~, inflation=down, wagesRestraint=up P5 foreignSales=up,publicConfidence=up,inflation=down foreignSales=up,publicConfidence=up, inflation=down,wageRestraint=up Table 3: Proofs connectinginputs to outputs. Xt, and X~’denotethe assumptionsand the controversial assumptionsrespectively. An Example Figure 3 (left) showsa hypothetical economicstheory written in the QCMlanguage (Menzies & Compton1997). All theory variables have three mutually exclusive states: up, downor steady; i.e. T~o(Va=up) ---- {Va=down, Va=steady) and constraints(G) = 2. These values model the sign of the first derivative of these variables (i.e. the rate of change in each value). In QCM,x ~ y denotes that y being up or downcould be explained by x being up or down respectively. Also, x ~ y denotes that y being up or downcould be explained by x being downor up respectively. Somelinks are conditional on other factors; for example/f transport then education ~-~ literacy. To process such a link, QCM adds a conjunction betweeneducation and literacy. The preconditions of that connection are noweducation and transport. QCM also adds conjunctions to offer explanations of steadies: the conjunction to competingupstream influences can explain a steady. Consider the case where the inputs In is (foreignSales=up, domesticSales=down)and the goals Out are (investorConfidence=up, inflation=down, wageRestraint=up). There are six proofs that connect In to Out (see Table 3). These proofs comprise only or-nodes. These proofs contain assumptions (variable assignments not found in In tJ Out). Someof these proofs make contradictory assumptions A; e.g. corporateSpending=up in P1 and corporateSpending=downin/°2. That is, we cannot believe in P1 and P2 at the sametime. If we sort these proofs into the subsets which we can believe at one time, we get worlds W1(Figure 3, middle) and W2(Figure 3, right). W1is a maximalconsistent subset of pathwaysthat can be believed at the same time; i.e. {P1, Ps, P~}. W~is another maximalconsistent subset: {P2,Pa,P4,P6}. The cover of W1is 100%while the cover of W2is 67%. Recall that HTxscores a theory by the maximum cover of the worlds it generates. Hence, our economicstheory gets full marks: 100%. HT4(and QMOD before it) found errors in a published theory of neuroendocrinological(Smythe 1989) when its maximum cover was found to be 42%. That is, after makingevery assumption possible to explain as manyof the goals as possible, only 42%of certain published observations of humanglucose regulation could be explained. Interestingly, these faults were found using the data published to support those theories. This suggests that HT4-styleabductivevalidation is practical for assessing existing scientific publications. HTxis practical for many other real-world theories. For example, one implementation of HT4,was fast enoughto process at least one published sample of real-world expert systems (Menzies1996). HT4, HT4-dumb, HTx The difference between the HTxalgorithms is how they searchedfor their worlds: ¯ HT4found all proofs for all outputs, then generated all the worlds from those proofs. Next, the worlds(s) with 57 1. 2. 3. 4. Figure 4: An expansion of the left-hand-side QCM theory into IEDGEand XNODE. Note that for the XNODE expansion, the "~t variable wasA. largest cover was then returned. ¯ HT4-dumb was an a crippled version of liT4 that returned any world at all, chosen at random.It was meant to be a straw-mansystem but its results were so promising (see below) that it lead to the developmentof HT0. ¯ HT0 is a randomised each engine that ~zx.X-TRIES times, finds 1 proof for each memberof Outi. Out is explored in a randomorder and whenthe proof is being generated, if morethan one option is found, one is picked at random.If the proof for Outj is consistent with the proof found previously for Outi, (i < j), it is kept. Otherwise, HT0moveson to Out~ (j < k) and declares Outj unsolvable. After each TRY, HT0compares the "best" world found in previous TRIESto the world found in the try and discards the one with the lowest cover. Addedinfluencesbetweenvariables. Corruptedthe influence betweentwovariables; e.g. proportionality++wasflippedto inverseproportionality-- or visa versa. Experimented with different meaningstimewithin the system. In the explicit node (XNODE) interpretation time, all ax variables implied a link i) (x at Time (z at Tiraei+l). In the implicit edge lEDGE interpretation of time, all edges z -~ y also implied a link (z at Time~) ~ (V at Time’+l); (a 6 { + +,--}). XNODE and lEDGE are contrasted in Figure 4. Whythese twointerpretations of time?Thesewererandomlyselected froma muchlarger set of timeinterpretations studied by (Waugh,Menzies,&Goss1997). Forcedthe algorithm to generate different numbersof worlds. Experiments showedthat the maximum number of worldsweregeneratedwhenbetweena fifth to threefifths of the variablesin the theorywereunmeasured; i.e. U=20..60whereU = 100- (I;"l+l°"t[/*l°° and IVl is the Ivl numberof variables in the theory. Themutatorsbuilt In and Outsets withdifferent U settings. Valuesfor In and Outwerecollected using a mathematical simulationof the fisheries. Table 4" Problem mutators used in the HT4vs HT4-dumb study fish growth rate / ~++ /++ ~ changein fish population | [’" \ Susan approves of HT0since, like her, it uses a lazy methodto study a program. Earnest likes HT4since it explores more of the program. For example, says Earnest, if Lazy Susan had stumbled randomly on W2of Figure 3, and did not look any further to find W1,then she wouldhave inappropriately declared that the economicsmodelcould only explain 67%of the outputs. Susanreplies that, on average, the cost of such extra searchingis not justified by its benefits. The following studies comparingHTxalgorithms support Susan’s case. \ ~ " \÷÷1 fish density /++ fish catch catch proceeds / ++ net income y investment fraction ++ ~+÷ HT4 vs HT4-dumb Menzies and Waugh compared HT4 and HT4-dumbusing tens of thousands of theories (Menzies & Waugh1998). Starting with a seed theory (fish growingin a fishery, see Figure 5), automatic mutators were built to generate a wide range of problems(see Table 4). Whenthese problems were run with HT4 and HT4-dumb,the maximumdifference in the cover was 5.6%;i.e. very similar, see Figure 6. That is, in a large number(1,512,000) of world generation experiments, (l) manydifferent searches contain the same goals, (2) there was little observedutility in using morethan one world. boat purchases ~÷÷ change in boat numbers / ÷+ boat maintenance boat decomissions catch potential Figure 5: The QCMfisheries model contains two dx variables: change in fish population and change in boat numbers. 58 10000. A i HT4 ..o .... HT0 "-* .... y=O(XA2) 1000 100 Q) E 10 2II >- 1 0,1 J , I 0 3000 6000 I I 9000 12000 X=number of clauses I I 15000 18000 Figure 7: RuntimesHT4(top left) vs HT0(bottom). HT0was observed to be 2) (R~= 0. 98). XNODE,U%unmeasured 100 90 ~ l~ ~’~ ~ 8O ~t~ meny;U-0 -c-one:U=o -*-. many;U=20 -*--one;U-20-+-. many;U=40 -m-one;U=40 .o.many;U=60 he;U=60-N-. 70 100 90 80 lEDGE,U%unmeasured i , many;U= 0 ~ one;U=O- -. meny;U=20 one;U=20 ¯ -meny;U-40 one;U-40- "- \\ ~,~ ~ me,y;u.6o 7O In other experiments, the maximum cover found for different values of MAX-TRIES was explored. Surprisingly, the maximumcover found for MAX-TRTES= 5 0 was found with MAX-TRIES as low as 6. In summary,a small number of quick random searches (HT0)foundnearly as muchof the interesting portions of the KBas a careful, slower, larger numberof searchers (HT4). 6O !oo 50 Conclusion 5O 40 4O 30 3O ~ 20 .,, i i I t 5 10 1517 Number of corrupted edges;max=l7 2O 5 10 1517 Number of corrupted edges; rnax=l 7 Figure 6: Comparing coverage of Out seen in 1,512,000 runs of the HT4(solid line) or HT4-dumb (dashed line) gorithms with different percentages of unmeasuredU variables. X-axis refers to howmanyedges of Figure 5 were corrupted (++, -- flipped to --, ++ respectively.) HT4 vs HT0 HT4-dumbwas implemented as a back-end to HT4. HT4 would propose lots of worlds and HT4-dumbwould pick one at random. That is, HT4-dumb was at least as slow as HT4.However, it workedso astonishingly well, that HT0 was developed. Real world and artificially generated theories were used to test HT0.A real-world theory of neuroendocrinology with 558 clauses containing 91 variables with 3 values each (273 literals) was copied X times. Next, of the variables in in one copy were connected at randomto variables in other copies. In this way, theories between30 and 20000 Prolog clauses were built using Y=40, average sub-goals elattses clause ~ 1.7, literals = 1..5.5. Whenexecuted with MAX-TRIES=50 the O(N2) curve of Figure 7 was generated. HT4did not terminate for theories with more than 1000 clauses. Hence, we can only compare the cover of HT0to HT4for part of this experiment. In the comparable region, HT0found 98%of the outputs found by HT4. On average, exploring a theory is not as complex as one mightthink. Testing is a search process and a test suite is complete whenthe search has examinedall the corners of the program.Often a few lazy explorations of a search space yields as muchinformation as more thorough searches. Philosophically, this must be true. Our impressions are that humanresources are limited and most explorations of theories are cost-bounded. Hence, manytheories are not explored rigourously. Yet, to a useful degree, this limited reasoning about our ideas works. This can only be true if our theories yield their key insights early to our limited tests. Otherwise, we doubt that the humanintellectual process could have achieved so much. Earnest strongly disagrees with the last paragraph, saying that such an argumentrepresents the height of humanarrogance; i.e. once again humanityis claiming to have some special authority over the randomuniverse around us. Lazy Susan does not have the heart to disagree. She’d just been promotedwhile Earnest had been sacked. Her view is that this arrogance seemsjustified, at least based on the above evidence. She had lunch with Earnest on his last day. As he cried into his soup, he warnedthat the aboveanalysis is overly optimistic. The results of this article refer to the averagecase behaviourof a test rig. In safety-critical situations, such an average case analysis is inappropriate due to the disastrous implications of the non-average case. Secondly, these resuits only refer to the numberof tests that can detect anomalous behaviour. Oncean error is found, it must be localized and removed.Strategies for fault localization and repair are exploredelsewherein the literature (Shapiro 1983; Hamscher, Console, & DeKleer 1992). Susanpaid for that last lunch using her pay rise: it seemed the decent thing to do. 59 Acknowledgements SEKE ’99, June 17-19, Kaiserslautern, Germany. Available from hCtp://research.ivv.nasa.gov/docs/ Cechreports/1999~NASAIW- 99- 007. pdf. Menzies, T., and Waugh, S. 1998. On the practicality of viewpoint-based requirements engineering. In Proceedings, Pacific Rim Conference on Artificial Intelligence, Singapore. Springer-Verlag. Menzies, T. 1995. Principles for Generalised Testing of Knowledge Bases. Ph.D. Dissertation, University of NewSouth Wales. Avaliable from http: //www.cse. unsw.edu.au/~timm/ pub/docs/ 95 thesis.ps .gz. Menzies, T. 1996. Onthe practicality of abductive validation. In ECAI’96. Available from http: //www. cse. unsw. edu. au/~timm/pub/docs/96abvalid.ps.gz. Menzies,T. 1998. Evaluationissues with critical success metrics. In BanffKA ’98 workshop. Available from ht:tp: //www. cse. unsw. EDU. AU/~ timm/pub/docs / 97evalcsm. Preece, A., and Shinghal, R. 1992. Verifying knowledgebases by anomalydetection: Anexperience report. In ECA!’92. Ramsey,C. L., and Basili, V. 1989. Anevaluation for expert systems for software engineering management.IEEE Transactions on Software Engineering 15:747-759. Shapiro, E.Y. 1983. Algorithmic program debugging. Cambridge, Massachusetts: MITPress. Smythe, G. 1989. Brain-hypothalmus, Pituitary and the Endocrine Pancreas. The Endocrine Pancreas. Waugh,S.; Menzies, T.; and Goss, S. 1997. Evaluating a qualitative reasoner. In Sattar, A., ed., AdvancedTopics in Artificial Intelligence: lOth Australian Joint Conferenceon AI. SpringerVerlag. Williams, B., and Nayak,P. 1996. A model-basedapproach to reactive self-configuring systems. In Proceedings, AAAI’96, 971978. Yu, V.; Fagan, L.; Wraith, S.; C!ancey, W.; Scott, A.; Hanigan, J.; Blum, R.; Buchanan, B.; and Cohen, S. 1979. Antimicrobial Selection by a Computer: a Blinded Evaluation by Infectious Disease Experts. Journal of AmericanMedical Association 242:1279-1282. This work was partially supported by NASAthrough cooperative agreement #NCC2-979. References Avritzer, A.; Ros, J.; and Weyuker,E. 1996. Reliability of rulebased systems. IEEESoftware 7(>--82. Bahill, A.; Bharathan, K.; and Cudee, R. 1995. Howthe testing techniques for a decision support systems changedover nine years. 1EEETransactions on Systems, Man, and Cybernetics 25(12):1535-1542. Betta, G.; D’Apuzzo, M.; and Pietrosanto, A. 1995. A knowledge-basedapproachto instrument fault detection and isolation. IEEETransactions of Instrumentation and Measurement 44(6):1109-1016. Bobrow,D.; Mittal, S.; and Stefik, M, 1986. Expert systems: Perils and promise. Communicationsof the ACM29:880-894. Buchanan,B.; Barstow, D.; Bechtel, R.; Bennet, J.; Clancey, W.; Kulikowski, C.; Mitchell, T.; and Waterman,D. 1983. Building Expert Systems, E Hayes-Roth and D. Watermanand D. Lenat (eds). Addison-Wesley.chapter Constructing an Expert Sytem, 127-168. Caraca-Valente,J.; Gonzalez,L.; Morant,J.; and Pozas, J. 2000. Knowledge-basedsystems validation: Whento stop running test cases. International Journal of Human-ComputerStudies. To appear. Clarke, E.; Emerson, E.; and Sistla, A. 1986. Automatic verification of finite-state concurrent systems using temporallogic specifications. ACMTransactions on ProgrammingLanguages and Systems 8(2):244--263. Colomb, R. 1999. Representation of propositional expert systems as partial functions. Artificial Intelligence (to appear). Available from http://www.csee.uq.edu.au/ ~ colomb/PartialFunct ions.html. Crawford, J., and Baker, A. 1994. Experimental results on the application of satisfiability algorithms to schedulingproblems.In AAAI ’94. Davies, P. 1994. Planning and expert systems. In ECA1’94. DeKleer, J. 1986. An Assumption-BasedTMS.Artificial Intelligence 28:163-196. Feldman, B.; Compton, P.; and Smythe, G. 1989. Hypothesis Testing: an Appropriate Task for Knowledge-BasedSystems. In 4th AAAl-SponsoredKnowledgeAcquisition for Knowledge. based Systems WorkshopBanff, Canada. Frankl, P., and Weiss, S. 1993. Anexperimental comparisonof the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering 19(8):774-787. Hamlet, D., and Taylor, R. 1990. Partition testing does not inspire confidence. IEEE Transactions on Software Engineering 16(12):1402-1411. Hamscher,W.; Console, L.; and DeKleer, J. 1992. Readings in Model-Based Diagnosis. Morgan Kaufmann. Harmon,D., and King, D. 1983. Expert Systems: Artificial Intelligence in Business. John Wiley & Sons. Menzies, T., and Compton,P. 1997. Applications of abduction: Hypothesistesting of neuroendocrinologicalqualitative compartmental models. Artificial Intelligence in Medicine 10:145-175. Available from http : //www. cse. unsw. edu. au/~ tim.m/ pub/docs/ 96aim.ps. gz. Menzies, T., and Cukic, B. 1999. Whenyou don’t need to re-test the system. In Submittedto ISRE-99.In preperation. Menzies, T., and Michael, C. 1999. Fewer slices of pie: Optimising mutation testing via abduction. In 60