Verification and Validation: A Case Study of a Decision Support System A. Terry Bahai, K. Bharathan, and Richard F. Curlee, Systems and Industrial Engineering University of Arizona Tucson, AZ 85721-0020 teny@sie.arizona.edu From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. In 1985we started building a decision support sS~tem to help speech clinicians diagnose small children who have begun to stutter. This paper describes the verification and validation of that system: (1) having an expert use and evaluate it, (2) runningtest cases, (3) developinga programto detect redundantrules, (4) using the Analytic Hierarchy Process, (5) running a program that checks a knowledgebase for consistency and completeness, (6) having five experts independently critique the system, (7) obtaining diagnoses of stuttering from these five experts derived from repom of children who had been evaluated for possible stuttering problems, (8) using the system to expose missing and ambiguousinformation in 30 clinical reports, and (9) analyzing the di~on and bias of six experts and the decision support system in diagnosing stuttering. Whenusing the final system, three clinicians with widely differing backgrounds woduceddiagnostic opinions that evidence little variability and were indistinguishable from those of a panel of five experiencedclinicians. L Introduction The development of a decision support system to help speech clinicians diagnose and managepreschool children at risk for a stuttering disability beganin 1985. Several yeats of testing and modification . took place before Childhood Stuttering: A Second Opinion TM was co~a~ercially released in 1993. Speech-language c’~ vary widely in their diagnoses of stuttering in youngchildren, but discussing such cases with exI~enout clinicians greatly reduces this variability. It was found that using ChildhoodStuttering: A Second Opinion, led clinicians with different training and experiences to diagnostic opinions indistinguishable f~omthose of a panel of five experienced clinicians. In effect, using Second Opinion allows inexperienced 2 AI & Manufacturing Workshop clinicians m"discuss" cases of incipient stuttering with a panel of experts, a process that should increase the reliability of their diagnosesin the real world. In the course of its development, the product went through several versions, each of which was tested at a differing level of rigor. 11. Testing the First System The first system was tested m1985 in a simple manner by havingthe expert use it and tell us howhe felt about it, what seemed right or wrong, This method was unsatisfactory largely in terms of the time required of the expert. Each time a minor adjustment was made to the knowledgebase, the expert had no choice but to run the system in its entirety. The need for a testing technique that used test cases was realized and implementedwith the next version. HI. Testing the Second System A second system namedStutter was developed in 1987. A series of test cases were used to assess Stutter. The expert ran the system and answered all the questions as they applied to a child he had evaluated for a suspected stuttering problem. After Stutter presented its diagnostic conclusions and advice, the intermediate knowledgewas saved into a disk file, and a text editor used to remove values derived by the inference engine, leaving only the user’s responsesto the system’squestions. A range of files were similarly generated and the sets of answers were saved under different names. Stutter was constructed to get its inputs from such files, and each file thus becamea test case. The knowledgeengineer did not have to always depend on the expert’s time to check each change in the knowledgebase. Manytest case were accumulated and used for testing over the life of the project. Whenever a rule was addedor modified, each test case was loaded into the system and used to look for run-time errors. Whenthe medification consisted of an additional question to be askedof the user, symptoms of stuttering. To increase the numberof test the test files hadto be maintainedby the expert havingto From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. provide the answersto every case wherethe question was cases, he took a number of existing test cases and fired. Whitethe expert’s knowledgecouAdat times be systematically changedthose answers that should not obtained by a phonecall, rids was not alwaysfeasfole. changethe system’soutput and lookedto see ff they did. Whileon the surface the problemdid not appear to be Thus these types of test cases reflect an expert’s coaceptmlizationof a domainrather than real cases from eliminated, in actual fact, the ~ continuedto lDaintain close contact with the evolutionof the system, while his the domain. time wasless taxed. Otherthan the domainexpert, knowledgeengineers and IV. Testing the Third System end users are also good sources of test cases. Those ~eated by a knowledge engineer are often more The third expert advisory system, which was named parsimonious because more rules can be exercised by ExpertStuttering Program(ESP),wasreadyfor testing fewer test cases. Typically, an ~ a knowledge 1989.It wasthe first to be tested with a special program, engineer and an end user are like three blind men Validator, that we wrote to help find mistakes in describingan elephant: each focuseson different aspects knowledgebases (Jafar & BahiIl, 1991; 1993). By this of the system. The test cases of each will time, of course, there weremanypublications about diff~ but overlapping Ix~ensof the knowledge base; verification and validation of panic-I~rknowledge-based therefore it maybe wise to alwaysgather test cases from systems,(referencedin Jaf~r &Bahill 1991). all threesources. Validotorassesses the consistencyand completenessof Next a computerprogramwas created that used these a knowledsebase. It checksfor syntactic errors, unused test cases to test the expert advisory system more rules, unusedfacts, unusedquestions, incorrectly used systematically(Kang&Bahill, 1990). Thefiring of each legal values, redundantconstructs, rules that use illegal tale wasrecorded as each test case was rim. A role was values, and multiple methodsfor obtaining values for said to succeedif all of its premt~sevaluatedtrue andthe eXla’essionsandsystematicallyindicatespote~ianerrors to action specified in its conclusion wasperformed.Rules the knowledgeengineer. that werenever found to succeedfor any test case were flagged as probable mi.qakes and were bqroughtto the A. Verification with Validator attention of the human expert. Of course, a rule that never Verification,or buildingthe systemright, ensuresthat a succeedsduring such testing maynot be an error. Some system correctly implementsspecifications and determines rules maybe intended for unusual or infrequently how well each prototypeconformsto design requirements. occurringcases that werenot exercised by the test cases It gxmnmtees lmXluctconsistencyat the end of each phase that wereused. Conversely,rules that succeedfor every (with itself andwithpreviousprototypes) andensures a test case mayalso he mistakes.If a rule is alwaystrue it smooth transition from (me lm)totype to another. maybe better to replace it with a fact; however,the~ are Validator checks both the syntax and semantics of a control rules that mustalwayssucceed.Consequently, thi.~ knowledge base and brings po~n~alerrors to the attention technique is useful in screening a system’s rules for of the knowledge engineer, whohas the task of fixing such potential error; but a humanmustdecidewhetherthe rule errors. Validator has four verification modules: a in questionis appropriateor not. ~, a syntax analyzer, a symactic error check~, Thebest wayto test an expert advisorysystem, in our and a debugger. opinion, is to havethe domainexpert run the systemand ThePreprocessor: create a representative .~_mple of testcases, Next, Syntacticerrors can causea productionlanguage(which determinewhichrules never succeed___ andwhichalwaysdo. is used here to refer to AI languagesas well as expertOur ~ indicates that a test case for every five system shells) to misintmpret a Imowled~base and rules in the knawledge base maybe sutficient (i.e. 20 test consequm~y to alter its syntactic stmctm¢,leading to cases for a 100ride system)eventhoughclearly moretest semantic errors. Validatot~s ~ performs a lowcases are better and having manymore test cases than level syntax check. Detecting such errors saves rules wouldbe best. For those rules that do not succeed, knowledgeengineers and experts muchtime, fxe~an, the expert shouldprovide test cases that he thinks will and grief. This type of checking is dependent on a makethemmcceed untilmostdo succeed, Then,the system’s production language and amelimates the knowledgeengineer should try to determine the reason shortcomingsof the language’s compiler.Va!i,~__~bailds that the remainingrules are not succeeding,It shouldalso be notedthat sometest cases are not real. Theexpert for inten~ relxes~t~ion ~ of tl~ knowled~ base, which its other three modulesanalyze for syntactic ore" systemcharacterized sometest cases as canomsthat compliance. had idealized, exaggerated or stereotyped signs and Bah;ll 3 TheSyntax Analyzer: Manysyntactic errors are caused by misspellings, typographical errors, or ill-formed knowledge-hase structures. Therefore,Validator’ssyntaxanalyzercreates alphabeticallistings of the c:~pressionsin the knowledge base accordingto their categories: goals, rule premises, rule conclusions,questions(with legal values), andfacts (with values). This list comprises a knowledgebase dictionary that makesit easier for a knowledge engineerto find mistakes. For example,a knowledge engineer, forced to read an entire knowledge base, maybe hard pressed to remember if a hyphenor an underscoreis to be usedto tie twowordstogether. But if the two constructs are adjacent in a list it is easyto seean inconsistency. TheSyntactic Error Checker: Peoplecan easily detect inconsistentexpressionsff they are presentedin pairs rather than in large collections, but they need to rely on machinesfor detecting global and indirect inconsistencies.Thesyntactic error checkerlooks for syntaxthat, whilelegal, pro~cesunspecifiedbehavior by the productionlanguage’scompiler.For example,"outof-range values" (such as incorrect usage of reserved words)us,ally escape detection by the complierand the knowledge engineer. Validatordetects these errors in the contextsin whichthey occur. Expressions get their values from facts, from user responsesto questions, or fromthe conclusionsof rules. Thereare four basic types of values. (I) Legalvaluesare acceptableanswersto questions. (2) Utilized values appear in rule premisesand allow the rule premisein whichthey appearto be evaluatedto true. (3) Concludedvalues appear in rule conclusionsand are set for an expressionwhena rule using that expression in knowledge basesandoften result in the failure of rules using suchvalues. Co) UnusedLegal Values: The syntactic error checker searches the premises of rules, lookingfor declaredbut unusedlegal values, which it flags as potential errors. It also lists all unasked questions. Unusedlegal values are common in knowledge bases. Manyresult fromerrors, others are remnantsof old constructs that were put into the knowledgebase by mistake or were incompletely removed.Deleting such constructs reduces the size of the knowledgebase and speedsinferencesduringthe use of a system.If Validator searchedthe followingknowledge base, it wealdnote that the legal values for coat of animalare {hair, feathers, beads}but the utilized valuesare {hair, feathers, scales}. Validatorwouldpoint out that it is not possible to get a value of beadsfor coat of animaland wouldindicate that this is likely a mistake.It wouldalso point out that one legal value, scales, has never been used, and would suggestthat this mayalso be a mistake. (4) Assignedvalues are assignedto expressionswith facts or certain commands spe~fic to the productionlanguage. (a) Utilized VexsusLegalValues: Legal values guard against typngraphical errors and help in abbreviatinglong answers.If no legal values are provided,a systemacceptsany responseas a valid answer, so typographical errors and wrongresponsesmayescape detection. Evenif errors are detected, effective errorrecovery proceduresare time-consuming and increase the size of the knowledgebase. It should be noted that providing legal values will not prevent a knowledge engineer from using out-of-range values in a rule. For example, Myeln-derived production languages do not check a knowledgebase to ensure that only legal values havebeenused; these languagesonly checkuser responses to questions to see if they matchthe legal values ~ed by the knowledgeengineer. Legal values are related to questions, not to rules or facts. Illegal valuesare common question(coat of animal)= ’What is the coat of animal?’ From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. 4 AI & Manufacturing Workshop goal = identity of animal. if coat of animal = scales then type of animal = fish. if coat of animal = hair then type of animal = manunal. if coat of animal = feathers then type of animal = bird. if type of animal = lizard then identity of animal = Gila monster. legalvalues (coat of animal) [hair, feathers, beads]. (c) Utilized VersusConclnded Values: ff the valuesused in the premisesof rules do not match the valuesusedin the conclusionsof those rules, the rules will fail. In the aboveknowledge base, the utilized values for type of animal are {fish, mammal,bird, lizard}, whereasthe concludedvalues are {fish, mammal, bird}, whichindicates that it wouldbe impossiblefor this system to concludethat a JiT~ardis a ~ of animal. The Debugger: Debugging is a tedious, difficult, time-consuming, and costly process of finding and correcting errors in the knowledgebase. On manyoccasions, the presence of errors is discovered during developmentaltesting but finding them tlepencL~ on the knowledgeengineer’s validation part of Validator has two modules:a chaining From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. intuition, experience, and commonsense. However, thread tracer and a knowledgebase completenessmodule. computer-aideddebuggingtools are mo~reliable because Thechainingthreadtracer deterlnines ifrules call file by tracing their connectivity back to the goal. Rules that they reduce the Ix~sibility of humanerrors. Validator’s debuggerchecksfor rules that use variables, negations, cannotfire are flaggedas deadrules andare broughtto the attention of the knowledgeengineer. A rule is also and unknowns.It regards the knowledgebase as a closed world, and assumesthat all the information about the flagged as deadif it is the root of a deadtree. Thus, domainis captured in the knowledgebase. Thus, all flagging a dead rule mayuncover a whole set of dead possible rules and axiomsshould be either implied or rules. implicitly modeledin the knowledge base. Oneaspect of validation is checicingthe knowledge base for completeness, that is, attempting to determine if Variable-unsafeRules: variablescan be usedin both the something is missing from the knowledgebase. This premises and the conclusions of rules. They act as checkingusuallyproceedsin two ways.Thefirst is direct symbolic place holders. Each construct that uses and involves a knowledgeengineer and an expert working variables is logically equivalent to the large set of together reviewing,refining and analyzingeach item in constructs that could be obtained by replacing those the knowledgebase. They check the completeness of variables with suitable terms. A backward-chaining rule every module. They also check the consistency, that has variablesis variable-safe(closed)if: effectivenessandefficiency of everyknowledge base item. {1) Its conclusion is homomorphicto a fact or Thenthey revieweach rule anddecidewhetherto split it, consequencein the knowledgebase, modifyit, or delete it fromthe knowledge base. (2) Variablesthat appearas attributes of an expression a conclusionalso appearas attributes of an expressionin The secondapproachrequires a knowledgeengineer to the rule’s premise; (3) Vaxiablesthat appearas attributes of an expressionin workon the structure and representationof the knowledge base. Knowledge base items are analyzed, comparedfor premiseaLsoappearas utilized values in ia~ouspremises redundancy,completeness,and correctness of usage. This of the samerule or as attributes of an expressionin the approachis structured, algOrithmic, and moreexhaustive conclusion; tha~ the direct approach.It also uses heuristics that can (4) Variablesthat appearas concluded valuesof a rule also appearas utilized valuesin the rule’s premise. be automatedto produce fast and effective results. Validator allowedus to use this secondapproach.After A knowledgebase is variable-safe if its set of rules all potential errors flaggedby Validatorhadbeenresolved, evaluate to true, no matter whatvaluesare substitcted for its variables. Variable-tmsafe rules u,gmilylead to infinite wecontinuedtesting this expert advisorysystemusingtest loopsthat violate the closed-worldassumptions. cases and additional domain ~. Illegal Useof Negations:negationscan be used only in the premisesof rules. A rule that has a negationin its V. Testing the Fourth System conclusion violates clused-worldassumptions,becausea false fact is not explicitly declaredin the knowledge base, Thefourth version of this evolving decision support system, SecondOpinion~,wasready for field testing in but is inferred fromnot being able to concludea given valuefor an expression. 1991. It had already undergonetesting with all the techniques mentionedabovewhenour knowledgeengineer Illegal Use of Unknowns:the most common origin of traveled to the tmiversfliesof four other expertsandspent unknown is the result of a user’s responseto a question. a day with each rnnningthe systemon real cases.The Unknowncan also be concluded as a result of the cxl~s wc~ ~ pl~d with tha porforn,~nce of the unsuccessful firing of all the rules that concern an system, but each offered suggestionsfor changes,which expression. However,like negations, unknowncan be were implemented. Next, the clinical records of 30 used only in the premisesof rules. Arule that concludes anonymous child~ea whoseparents suspected them of nnknownfor an expression violates the clesed-world stuttering were colle~ed and rewritten after deleting any assumIXion,because an unknownfact is not explicitly information that might identify the child or his family. declaredin the knowledge base; rather, it is inferred from Each expert, at a workshop in Tucson, provided a not beingable to concludea valuefor an expression. diagnosis for each of the 30 children al~rreading the rewrittan clinical reports. Next, they discussed these B. Validation with Validator rewrittenreports andtheir diagnosis.Thelatter diagnoses Validation, building the right system, ensures the were used to derive a consensusdiagnosis for each child consistency and completenessof the wholesystem. The and SecondOpinionwas changedto matchthe consensus Bahill s opinions. Unfortunately, the performanceof the system Table I Heuristic Data Set From: Proceedings of the AI and Manufacturing Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. was not evaluated quantitatively beforeResearch its modification. If there is a collectionof input-outputdata that is known to be correct (sometimescalled a gold standard), then variety of quantitative teclmiqnes can be used to help validate a system, manyof these are described by (Adelman,1992). In effect, the expert panel’s diagnoses reflected our effort to developsucha goldstandard. VI. Testingthe Fifth System Whiletesting a fifth version of this decision support system, ChildhoodStuttering: A Second OpinionTM in 1992, we discovered that its outputs differed when different expertsused it. Weaskedtwocxperts to use the system based on information each obtained from reading the thirty clinical reports. Theoutput of the systemwas the samefor ten of these reports even thoughthere were different answers to, on average, one-third of the questions. This suggests that the D’stemis robust to diff~ attributable to users. Theoutput of the system was highly similar for another ten reports but diffcred significantly on the remaining ten. These findings suggestedthat somedifferences mightbe attributable to the clinical reports. For example,onereport stated "Jose felt frusuate~"Someexperts understoodthis statementto meanthat the child was showingfrustration about his disfluency, while others understoodit to reflect his frastmtion with English and Spanish word-finding difficulties. It wasdecided, therefore, to resolve such amhignitiesin these clinical reports before testing the decisionsupportsystemfurther. First, fifteen of the thirty clinical reportswerecarefully edited so that any lacunae, ambiguities, and conflicting statememswere removed.Next, ten additional clinical reports were obtained and edited in a similar manner. Thesetwentyfive edited reports wexethen usedto test the latest version of the decision support system. It would havebeenbetter, of course, to use a few hundredvali~t,d clinical report&but developingthese clinical reports was expensive. In 1992 and 1993 more time was spent acquiring and preparingthese reports than in developing andreflnin$ the rules in the system’sknowledge base. A. NewTools In Table I, diagnoses of four apocryphalexperts and that of a decisionsupportsystemfor four clinical reports are displayed.Thesedata are not real but werecreated to illustrate the use of severalstatistical toob. Diagnoses of the expertsare denotedwith lowercaseletters a, b, c, and d, andthat_ of the decision supportsystemwith dss. The four clinical re~orts ale dl~gilated by the upper case letters A, B, C, andD 6 AI & Manufacturing Workshop Diagnostic Name of" Clinical Report A B C D Score 5 b c, d dss 4 d 3 b, d c, dss a dss b 2 b, d, a dss 1 a,c a,c Thenumbersin the Diagnostic Score columnrepresent five diagnostic opinions, which correspond to the followingdescriptions: !. Little causefor concernaboutstutterin~ Thereis little reason to suspect that a stuttering problem maybe emerging in this child at this time. The types and frequencies of disfnencies that were observed and described are characteristic of those of nonstuttering children. 2. Somecause for concernabout stuttering. This child is evidencingsomedangersigns of incipient stuttering as well as somethat are characteristic of nonstuttermg children. This pattern of equivocalfindingssuggeststhat the child maybe at risk for an emerging stuttering problem. 3. Mildconcernaboutstuttering. This child is evidencing relativelyconsistentsignsof early stuttering. 4. Moderateconcernabout stuttering This child presents speechandbehavioralsigns whichsuggestthat stuttering maynot be a mmsientproblem. 5. Severeconcernabout stuttering. This child evidences speechand behavioralsigns that maysignal the evolution of a severestuttering problem. Twoquantitative measuresof the agreementbetween the experts and the decision support system, whichwere originally suggestedby LucienDuckstein,can be used to characterize the dispersion and bias of these diagnostic ~!~, Eachdiagnosis is denotedwith a capital R with a subscript identifying the individual expert. The term dispersion showshowmuchthe diagnosis of each expert varied fromthose of every other expert. It is computed with wherei runs over the set {a. b, c. d, d~v},k runs over the set {a, b, c, d. dss},j runsover the set {A,B, C, D},1=5 (the numberof evaluators) and J - 4 (the number clinicalreports). Forthe data in TableI the dispersionsare TaMe II AAAI (www.aaai.org). All rights reserved. From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, "Expert" Dispersion 1.47 a b 1.68 c 1.37 d 1.35 dss 1.06 Clinical Results Expert Bias Dispersion a -0. II 0.76 b 0.17 0.79 -0.17 c 0.71 0.50 d 0.88 0.14 0.86 e Theterm biasshows inclinations oftheindividuals. It -0.42 0.85 ec is computed with dss -0.10 O. 80 1 Df 1 ~u J Forthedata inTable IIthebiases are "Expert" a b c d dss Bias -1.19 0.38 -0.25 1.00 0.06 Thesemeasurescan be used to help validate a decision supportsystem. For the data in TableI all the "experts" performedless consistently than did the decision snppon system,becausefour experts haddispersionscores greater than that of the decision supportsystem+Theopinionsof expert too’, for example,were the least consistent. The data also illustrate that exl~rts maybe biased. The diagnosisof expert "a" suggestthat he viewedeachchild’s flneuc7 problemas less severe than the other experts, while those of expert "d" wereatthe other extreme. The decision suplxmsystem, on the other hand, showedlitOe bias. Thesedata werecontrived to illustrate such ideal behaviorby a decision SUl~Ort system. B. The Final System Test The25 newclinical reports werenmi,edto five highly ex~+~lKX~d Sta~ 8 Gliniclsns who wl~ asked to evaluate the likelihood that each child wasstuttering TableII summar/z~ the statistical results of the d/agnostic scores of these five experiem~ clinicians, of the examiningclinicians whoprepared the original t~orts, and the of derision support system for each of the 25 reports, The five ex~’rleuced din/clans are ~ by the letters It, b, C, d, al~ e. Thee~mmining clinicians Who evaluatedeachchild are designatedby ec, andthe decision mmplmrt systemby dins. Basedon the data in Table II, the performanceof the decision support systemis indistinguishablefromthat of experienced dbfician~ Itsdispersion ishigher thansome clinicians but lowerthan others. Its bias is very small. In contrast, the ¢xaminillgclinicians’ diagnosticscores do stand out fromthe panel of clinicians, as is indicated by the large negative bias score. This seemsreasonable, because the clinicians whoconductedthese evainmions hadaccess to data that the other clinicians lacked: they examinedthe actual children. Furthermore,they likely reviewedvideo tapes of their evaluation and discussed di~cult cases with other clinicians before arriving at a diagnosis and writing their clinical reports. Therefore their diagnostic opinions maybe the most valkt The decision support system’sdiagnosesweredesignedto fall betweenthose of the examiningclinicians andthe meanof the panel of e, xI~eucedclinicians. The low bias and dispersion of the decision support systemindicates its tohlg~ss. Whenexperienced clinicians have aa opportunity to conferanddiscusstheir diagnosticopinions,differencesin opinion that they mayhaveusually decrease. At the time of their discussions, panel men.tamswereshownboth the meanand range of diagnostic scores for each case. The intendedpuzpme of these discussionswasto clarify informationabout a case, not to developa consenm~ bet the range of diagnostic scores was smaller after the discussions. This redaction in range maybe dee to panel members consideringadd/lional diagnostic factors during their discussionsthan wereconsideredwhentheir initial diagnosesweregiven. It mayalso reflect, in part, some panelmembers’ decisionsto modifytheir initial diagnostic scores to moreclosely approximate the panel’s meanscore. One purpose of Second Ol~nion is to allow an inexperienced¢lini<~3n to "discuss" findings obtained froma diagnosticevaluationwith a panelof expertsandto learn thediagnosticopinionsof the panelof experts. Such "discussions"are intendedto increase the reliability of in~enced dinieians’ diagnoses. Theoetput of SecondOpinionis robust. Three people usedit with the sametwentyfive clinical reports descr/bed Bahill 7 above.Onewasthe chief scientist for our decision sopport systems,whohas had over twentyfive years of experience evaluating children with fluency problems. The second g~msa doctoral candidate in Communication Sciences and Disorders at Syracuse University whohad acquired substantial previousexperiencewith the decision support system. Thethird wasa master’s degree student at the Universityof Arizonawhohad little experiencee~aluating young stutterers and no previous ~ence with any decision support system. Their diagnostic opinions evidencelittle difference. Thecase aboutwhichthere was the most disagreement,Couzens,led to further reviewof that clinical report and the identification of ambiguities that mayhave causedthis discrelmncy.Useof Childhood Stuttering: A Second Opinion, appears to promote diagnosticopinionsthat are indistinguishablefromthose of a panel of five experiencedclinicians regardless of a clinician’s training and background.One significant finding is that the diagnosis of three peoplewith widely differing backgroundsevidenced less variability when using the decision support systemthan did those of five experiencedclinicians rating the sameclinical reports withoutthe decisionsupportsystem. As a penultimate test, the system was given to 30 students in an Expert Systemsclass at the University of Arizona.It is nowundergoingits ultimate test as it is used by customers.Mostof the concernsof both students and customersinvolve the installation procedure.Noone yet has pointedout errors or sugo~’tcdimprovements in its knowledgebase. This decision support systemis meantto assist, not replace, inexperiencedspeech-languageclinicians. One thing a humandoes that a comimtersystemdoes not do is track a child’s behavior over time, looking for intprovemcntor a worseningof signs and symptoms.In the future weintend to build a versionof this systemfor teaching that will include explanations and pertinent ref~ea-tces for each question asked. Suchsystemswould to be usefulin training clinical skills as well as in sopporting the skill,q of inexilerienced clinicians. testing procedures, and, wethink, credibility to the decision support systems’ performance. Finally wc comparedthe output of the systemto the evaluations of humanexperts. Whenusing Childhood Stuttering: A SecondOpinion, three clinicians with widelydiffering backgroundsin stuttering produceddiagnostic opinions that evidencelittle variability andwereindistinguishable fromthose of a panelof five experienced clinicians. In the software industry, the mantffacturingprocess itself is subsumedin significance by the development process. While the former requires physical quality control such as uncontaminated storage mediaand so on, rigorous validation and verification at the development stage is whatensuresthe quality of the final product.In the case discussedin this paper, the needfor the domain experts’ knowledge to control development while simultaneouslyensuring the technical soundnessof the knowledgebase was of paramountconcern To this end, several innovations, reported above, had to be made. Whilethe gcneralizabilityof suchinnovationsneedfurther examination,the point weseek to makeis simply this: within the overall need for rigorous verification and validation, the specific methed(s)to be used are often dictated by the specifics of the product and the circumstances in whichit is built. From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. VII. Conclusions R is difficult to detect errors in a decision support system. Therefore, mucheffort was devotedduring the knowledge extraction processto insure that errors did not creep into the knowledge base. Because all such knowledge came from and was captured by fallible lnnnans, it is umeasonable to expect a knowledgebase to be free of errors. Therefore,several tools andprocedures weredevelopedto help detect errors in a decision support system’s knowledgebese. Eachwas designedto workwith existing tools, and each addedadditional complexityto 8 M& Manufacturing Workshop Acknowledgments This workwas supported by the National Institute of Child Health and HumanDevelopment Grant R44 HD26209 and by Bahill Intelligent CornlmterSystems. Thecontentsof this paperare solely the responsibility of the authorsand do not necessarilyrepresent the official viewsof the NationalInstitute of Child Healthand Human Development. REFERENCES L. Adelman,1992. Evaluating Decision Support and Expert Systems. NewYork: John Wiley& Sons. M. andBahill, A. T. 1991.Interactive verification andvalidation with Validator. Chapter4 in Bahill, A. T. e~ Verifying and Vafidaang Personal Computer-Based ExpertSystems.Englewond Cliff.a: PrenticeHall. Jafar, Jafar, M.andBahili, A.T. 1993.Interactiveverificationof knowledge-based systems. IEEEExpert8(1 ): 25-32. Kang,Y. and Bahill, A. T., 1990. A tool for detecting expert systemerrors, AI Expert5(2): 46-51.