Verification and Validation: A Case Study of ... Support System

Verification and Validation: A Case Study of a Decision
Support System
A. Terry Bahai, K. Bharathan, and Richard F. Curlee,
Systems and Industrial Engineering
University of Arizona
Tucson, AZ 85721-0020
teny@sie.arizona.edu
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
In 1985we started building a decision support sS~tem
to help speech clinicians diagnose small children who
have begun to stutter. This paper describes the
verification and validation of that system:
(1) having an expert use and evaluate it,
(2) runningtest cases,
(3) developinga programto detect redundantrules,
(4) using the Analytic Hierarchy Process,
(5) running a program that checks a knowledgebase
for consistency and completeness,
(6) having five experts independently critique the
system,
(7) obtaining diagnoses of stuttering from these five
experts derived from repom of children who had
been evaluated for possible stuttering problems,
(8) using the system to expose missing and
ambiguousinformation in 30 clinical reports, and
(9) analyzing the di~on and bias of six experts
and the decision support system in diagnosing
stuttering.
Whenusing the final system, three
clinicians
with widely differing backgrounds
woduceddiagnostic opinions that evidence little
variability and were indistinguishable from those of a
panel of five experiencedclinicians.
L Introduction
The development of a decision support system to help
speech clinicians diagnose and managepreschool children
at risk for a stuttering disability beganin 1985. Several
yeats of testing and modification . took place before
Childhood Stuttering:
A Second Opinion TM was
co~a~ercially released in 1993.
Speech-language
c’~ vary widely in their
diagnoses of stuttering in youngchildren, but discussing
such cases with exI~enout clinicians greatly reduces this
variability. It was found that using ChildhoodStuttering:
A Second Opinion, led clinicians with different training
and experiences to diagnostic opinions indistinguishable
f~omthose of a panel of five experienced clinicians. In
effect, using Second Opinion allows inexperienced
2
AI & Manufacturing Workshop
clinicians m"discuss" cases of incipient stuttering with a
panel of experts, a process that should increase the
reliability of their diagnosesin the real world.
In the course of its development, the product went
through several versions, each of which was tested at a
differing level of rigor.
11. Testing the First
System
The first system was tested m1985 in a simple manner
by havingthe expert use it and tell us howhe felt about it,
what seemed right or wrong, This method was
unsatisfactory largely in terms of the time required of the
expert. Each time a minor adjustment was made to the
knowledgebase, the expert had no choice but to run the
system in its entirety. The need for a testing technique
that used test cases was realized and implementedwith the
next version.
HI. Testing
the Second System
A second system namedStutter was developed in 1987.
A series of test cases were used to assess Stutter. The
expert ran the system and answered all the questions as
they applied to a child he had evaluated for a suspected
stuttering problem. After Stutter presented its diagnostic
conclusions and advice, the intermediate knowledgewas
saved into a disk file, and a text editor used to remove
values derived by the inference engine, leaving only the
user’s responsesto the system’squestions. A range of files
were similarly generated and the sets of answers were
saved under different names. Stutter was constructed to
get its inputs from such files, and each file thus becamea
test case. The knowledgeengineer did not have to always
depend on the expert’s time to check each change in the
knowledgebase.
Manytest case were accumulated and used for testing
over the life of the project. Whenever
a rule was addedor
modified, each test case was loaded into the system and
used to look for run-time errors. Whenthe medification
consisted of an additional question to be askedof the user,
symptoms
of stuttering. To increase the numberof test
the
test files hadto be maintainedby the expert havingto
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
provide the answersto every case wherethe question was
cases, he took a number of existing test cases and
fired. Whitethe expert’s knowledgecouAdat times be
systematically changedthose answers that should not
obtained by a phonecall, rids was not alwaysfeasfole.
changethe system’soutput and lookedto see ff they did.
Whileon the surface the problemdid not appear to be
Thus these types of test cases reflect an expert’s
coaceptmlizationof a domainrather than real cases from
eliminated, in actual fact, the ~ continuedto lDaintain
close contact with the evolutionof the system, while his
the domain.
time wasless taxed.
Otherthan the domainexpert, knowledgeengineers and
IV. Testing the Third System
end users are also good sources of test cases. Those
~eated by a knowledge engineer are often more
The third expert advisory system, which was named
parsimonious because more rules can be exercised by
ExpertStuttering Program(ESP),wasreadyfor testing
fewer test cases. Typically, an ~ a knowledge
1989.It wasthe first to be tested with a special program,
engineer and an end user are like three blind men
Validator, that we wrote to help find mistakes in
describingan elephant: each focuseson different aspects
knowledgebases (Jafar & BahiIl, 1991; 1993). By this
of the system. The test cases of each will
time, of course, there weremanypublications about
diff~ but overlapping Ix~ensof the knowledge
base;
verification and validation of panic-I~rknowledge-based
therefore it maybe wise to alwaysgather test cases from
systems,(referencedin Jaf~r &Bahill 1991).
all threesources.
Validotorassesses the consistencyand completenessof
Next a computerprogramwas created that used these
a knowledsebase. It checksfor syntactic errors, unused
test cases to test the expert advisory system more
rules, unusedfacts, unusedquestions, incorrectly used
systematically(Kang&Bahill, 1990). Thefiring of each
legal values, redundantconstructs, rules that use illegal
tale wasrecorded as each test case was rim. A role was
values, and multiple methodsfor obtaining values for
said to succeedif all of its premt~sevaluatedtrue andthe
eXla’essionsandsystematicallyindicatespote~ianerrors to
action specified in its conclusion wasperformed.Rules
the knowledgeengineer.
that werenever found to succeedfor any test case were
flagged as probable mi.qakes and were bqroughtto the
A. Verification with Validator
attention of the human
expert. Of course, a rule that never
Verification,or buildingthe systemright, ensuresthat a
succeedsduring such testing maynot be an error. Some
system
correctly implementsspecifications and determines
rules maybe intended for unusual or infrequently
how
well
each prototypeconformsto design requirements.
occurringcases that werenot exercised by the test cases
It
gxmnmtees
lmXluctconsistencyat the end of each phase
that wereused. Conversely,rules that succeedfor every
(with
itself
andwithpreviousprototypes)
andensures a
test case mayalso he mistakes.If a rule is alwaystrue it
smooth
transition
from
(me
lm)totype
to
another.
maybe better to replace it with a fact; however,the~ are
Validator checks both the syntax and semantics of a
control rules that mustalwayssucceed.Consequently,
thi.~
knowledge
base and brings po~n~alerrors to the attention
technique is useful in screening a system’s rules for
of
the
knowledge
engineer, whohas the task of fixing such
potential error; but a humanmustdecidewhetherthe rule
errors.
Validator
has four verification modules: a
in questionis appropriateor not.
~,
a
syntax
analyzer,
a symactic error check~,
Thebest wayto test an expert advisorysystem, in our
and
a
debugger.
opinion, is to havethe domainexpert run the systemand
ThePreprocessor:
create a representative .~_mple
of testcases,
Next,
Syntacticerrors can causea productionlanguage(which
determinewhichrules never succeed___
andwhichalwaysdo.
is used here to refer to AI languagesas well as expertOur ~ indicates that a test case for every five
system shells) to misintmpret a Imowled~base and
rules in the knawledge
base maybe sutficient (i.e. 20 test
consequm~y
to alter its syntactic stmctm¢,leading to
cases for a 100ride system)eventhoughclearly moretest
semantic errors. Validatot~s ~ performs a lowcases are better and having manymore test cases than
level syntax check. Detecting
such errors saves
rules wouldbe best. For those rules that do not succeed,
knowledgeengineers and experts muchtime, fxe~an,
the expert shouldprovide test cases that he thinks will
and grief. This type of checking is dependent on a
makethemmcceed
untilmostdo succeed,
Then,the
system’s production language and amelimates the
knowledgeengineer should try to determine the reason
shortcomingsof the language’s
compiler.Va!i,~__~bailds
that the remainingrules are not succeeding,It shouldalso
be notedthat sometest cases are not real. Theexpert for
inten~ relxes~t~ion ~ of tl~ knowled~ base,
which its other three modulesanalyze for syntactic
ore" systemcharacterized sometest cases as canomsthat
compliance.
had idealized, exaggerated or stereotyped signs and
Bah;ll
3
TheSyntax Analyzer:
Manysyntactic errors are caused by misspellings,
typographical errors, or ill-formed knowledge-hase
structures. Therefore,Validator’ssyntaxanalyzercreates
alphabeticallistings of the c:~pressionsin the knowledge
base accordingto their categories: goals, rule premises,
rule conclusions,questions(with legal values), andfacts
(with values). This list comprises a knowledgebase
dictionary that makesit easier for a knowledge
engineerto
find mistakes. For example,a knowledge
engineer, forced
to read an entire knowledge
base, maybe hard pressed to
remember
if a hyphenor an underscoreis to be usedto tie
twowordstogether. But if the two constructs are adjacent
in a list it is easyto seean inconsistency.
TheSyntactic Error Checker:
Peoplecan easily detect inconsistentexpressionsff they
are presentedin pairs rather than in large collections, but
they need to rely on machinesfor detecting global and
indirect inconsistencies.Thesyntactic error checkerlooks
for syntaxthat, whilelegal, pro~cesunspecifiedbehavior
by the productionlanguage’scompiler.For example,"outof-range values" (such as incorrect usage of reserved
words)us,ally escape detection by the complierand the
knowledge
engineer. Validatordetects these errors in the
contextsin whichthey occur.
Expressions get their values from facts, from user
responsesto questions, or fromthe conclusionsof rules.
Thereare four basic types of values.
(I) Legalvaluesare acceptableanswersto questions.
(2) Utilized values appear in rule premisesand allow the
rule premisein whichthey appearto be evaluatedto true.
(3) Concludedvalues appear in rule conclusionsand are
set for an expressionwhena rule using that expression
in knowledge
basesandoften result in the failure of rules
using suchvalues.
Co) UnusedLegal Values:
The syntactic error checker searches the premises of
rules, lookingfor declaredbut unusedlegal values, which
it flags as potential errors. It also lists all unasked
questions. Unusedlegal values are common
in knowledge
bases. Manyresult fromerrors, others are remnantsof old
constructs that were put into the knowledgebase by
mistake or were incompletely removed.Deleting such
constructs reduces the size of the knowledgebase and
speedsinferencesduringthe use of a system.If Validator
searchedthe followingknowledge
base, it wealdnote that
the legal values for coat of animalare {hair, feathers,
beads}but the utilized valuesare {hair, feathers, scales}.
Validatorwouldpoint out that it is not possible to get a
value of beadsfor coat of animaland wouldindicate that
this is likely a mistake.It wouldalso point out that one
legal value, scales, has never been used, and would
suggestthat this mayalso be a mistake.
(4) Assignedvalues are assignedto expressionswith facts
or certain commands
spe~fic to the productionlanguage.
(a) Utilized VexsusLegalValues:
Legal values guard against typngraphical errors and
help in abbreviatinglong answers.If no legal values are
provided,a systemacceptsany responseas a valid answer,
so typographical errors and wrongresponsesmayescape
detection. Evenif errors are detected, effective errorrecovery proceduresare time-consuming
and increase the
size of the knowledgebase. It should be noted that
providing legal values will not prevent a knowledge
engineer from using out-of-range values in a rule. For
example, Myeln-derived production languages do not
check a knowledgebase to ensure that only legal values
havebeenused; these languagesonly checkuser responses
to questions to see if they matchthe legal values ~ed
by the knowledgeengineer. Legal values are related to
questions, not to rules or facts. Illegal valuesare common
question(coat of animal)=
’What is the coat of animal?’
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
4
AI & Manufacturing Workshop
goal = identity of animal.
if coat of animal = scales
then type of animal = fish.
if coat of animal = hair
then type of animal = manunal.
if coat of animal = feathers
then type of animal = bird.
if type of animal = lizard
then identity of animal = Gila monster.
legalvalues (coat of animal)
[hair, feathers, beads].
(c) Utilized VersusConclnded
Values:
ff the valuesused in the premisesof rules do not match
the valuesusedin the conclusionsof those rules, the rules
will fail. In the aboveknowledge
base, the utilized values
for type of animal are {fish, mammal,bird, lizard},
whereasthe concludedvalues are {fish, mammal,
bird},
whichindicates that it wouldbe impossiblefor this system
to concludethat a JiT~ardis a ~ of animal.
The Debugger:
Debugging
is a tedious, difficult, time-consuming,
and
costly process of finding and correcting errors in the
knowledgebase. On manyoccasions, the presence of
errors is discovered during developmentaltesting but
finding
them tlepencL~ on the knowledgeengineer’s
validation part of Validator has two modules:a chaining
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
intuition, experience, and commonsense. However,
thread tracer and a knowledgebase completenessmodule.
computer-aideddebuggingtools are mo~reliable because
Thechainingthreadtracer
deterlnines
ifrules
call
file
by
tracing their connectivity back to the goal. Rules that
they reduce the Ix~sibility of humanerrors. Validator’s
debuggerchecksfor rules that use variables, negations,
cannotfire are flaggedas deadrules andare broughtto the
attention of the knowledgeengineer. A rule is also
and unknowns.It regards the knowledgebase as a closed
world, and assumesthat all the information about the
flagged as deadif it is the root of a deadtree. Thus,
domainis captured in the knowledgebase. Thus, all
flagging a dead rule mayuncover a whole set of dead
possible rules and axiomsshould be either implied or
rules.
implicitly modeledin the knowledge
base.
Oneaspect of validation is checicingthe knowledge
base
for
completeness,
that
is,
attempting
to
determine
if
Variable-unsafeRules: variablescan be usedin both the
something is missing from the knowledgebase. This
premises and the conclusions of rules. They act as
checkingusuallyproceedsin two ways.Thefirst is direct
symbolic place holders.
Each construct that uses
and involves a knowledgeengineer and an expert working
variables is logically equivalent to the large set of
together reviewing,refining and analyzingeach item in
constructs that could be obtained by replacing those
the knowledgebase. They check the completeness of
variables with suitable terms. A backward-chaining
rule
every module. They also check the consistency,
that has variablesis variable-safe(closed)if:
effectivenessandefficiency of everyknowledge
base item.
{1) Its conclusion is homomorphicto a fact or
Thenthey revieweach rule anddecidewhetherto split it,
consequencein the knowledgebase,
modifyit, or delete it fromthe knowledge
base.
(2) Variablesthat appearas attributes of an expression
a conclusionalso appearas attributes of an expressionin
The secondapproachrequires a knowledgeengineer to
the rule’s premise;
(3) Vaxiablesthat appearas attributes of an expressionin
workon the structure and representationof the knowledge
base.
Knowledge
base items are analyzed, comparedfor
premiseaLsoappearas utilized values in ia~ouspremises
redundancy,completeness,and correctness of usage. This
of the samerule or as attributes of an expressionin the
approachis structured, algOrithmic, and moreexhaustive
conclusion;
tha~ the direct approach.It also uses heuristics that can
(4) Variablesthat appearas concluded
valuesof a rule also
appearas utilized valuesin the rule’s premise.
be automatedto produce fast and effective results.
Validator allowedus to use this secondapproach.After
A knowledgebase is variable-safe if its set of rules
all potential errors flaggedby Validatorhadbeenresolved,
evaluate to true, no matter whatvaluesare substitcted for
its variables. Variable-tmsafe
rules u,gmilylead to infinite
wecontinuedtesting this expert advisorysystemusingtest
loopsthat violate the closed-worldassumptions.
cases and additional domain ~.
Illegal Useof Negations:negationscan be used only in
the premisesof rules. A rule that has a negationin its
V. Testing the Fourth System
conclusion violates clused-worldassumptions,becausea
false fact is not explicitly declaredin the knowledge
base,
Thefourth version of this evolving decision support
system, SecondOpinion~,wasready for field testing in
but is inferred fromnot being able to concludea given
valuefor an expression.
1991. It had already undergonetesting with all the
techniques mentionedabovewhenour knowledgeengineer
Illegal Use of Unknowns:the most common
origin of
traveled
to the tmiversfliesof four other expertsandspent
unknown
is the result of a user’s responseto a question.
a
day
with
each rnnningthe systemon real cases.The
Unknowncan also be concluded as a result of the
cxl~s
wc~
~ pl~d with tha porforn,~nce of the
unsuccessful firing of all the rules that concern an
system,
but
each
offered suggestionsfor changes,which
expression. However,like negations, unknowncan be
were
implemented.
Next, the clinical records of 30
used only in the premisesof rules. Arule that concludes
anonymous
child~ea
whoseparents suspected them of
nnknownfor an expression violates the clesed-world
stuttering
were
colle~ed
and rewritten after deleting any
assumIXion,because an unknownfact is not explicitly
information
that
might
identify
the child or his family.
declaredin the knowledge
base; rather, it is inferred from
Each
expert,
at
a
workshop
in
Tucson, provided a
not beingable to concludea valuefor an expression.
diagnosis for each of the 30 children
al~rreading the
rewrittan
clinical
reports.
Next,
they
discussed these
B. Validation with Validator
rewrittenreports andtheir diagnosis.Thelatter diagnoses
Validation, building the right system, ensures the
were used to derive a consensusdiagnosis for each child
consistency and completenessof the wholesystem. The
and SecondOpinionwas changedto matchthe consensus
Bahill
s
opinions. Unfortunately, the performanceof the system
Table I
Heuristic Data Set
From:
Proceedings
of the
AI and Manufacturing
Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
was
not evaluated
quantitatively
beforeResearch
its modification.
If there is a collectionof input-outputdata that is known
to be correct (sometimescalled a gold standard), then
variety of quantitative teclmiqnes can be used to help
validate a system, manyof these are described by
(Adelman,1992). In effect, the expert panel’s diagnoses
reflected our effort to developsucha goldstandard.
VI. Testingthe Fifth System
Whiletesting a fifth version of this decision support
system, ChildhoodStuttering: A Second OpinionTM in
1992, we discovered that its outputs differed when
different expertsused it. Weaskedtwocxperts to use the
system based on information each obtained from reading
the thirty clinical reports. Theoutput of the systemwas
the samefor ten of these reports even thoughthere were
different answers to, on average, one-third of the
questions. This suggests that the D’stemis robust to
diff~ attributable to users. Theoutput of the system
was highly similar for another ten reports but diffcred
significantly on the remaining ten. These findings
suggestedthat somedifferences mightbe attributable to
the clinical reports. For example,onereport stated "Jose
felt frusuate~"Someexperts understoodthis statementto
meanthat the child was showingfrustration about his
disfluency, while others understoodit to reflect his
frastmtion with English and Spanish word-finding
difficulties. It wasdecided, therefore, to resolve such
amhignitiesin these clinical reports before testing the
decisionsupportsystemfurther.
First, fifteen of the thirty clinical reportswerecarefully
edited so that any lacunae, ambiguities, and conflicting
statememswere removed.Next, ten additional clinical
reports were obtained and edited in a similar manner.
Thesetwentyfive edited reports wexethen usedto test the
latest version of the decision support system. It would
havebeenbetter, of course, to use a few hundredvali~t,d
clinical report&but developingthese clinical reports was
expensive. In 1992 and 1993 more time was spent
acquiring and preparingthese reports than in developing
andreflnin$ the rules in the system’sknowledge
base.
A. NewTools
In Table I, diagnoses of four apocryphalexperts and
that of a decisionsupportsystemfor four clinical reports
are displayed.Thesedata are not real but werecreated to
illustrate the use of severalstatistical toob. Diagnoses
of
the expertsare denotedwith lowercaseletters a, b, c, and
d, andthat_ of the decision supportsystemwith dss. The
four clinical re~orts ale dl~gilated by the upper case
letters A, B, C, andD
6
AI & Manufacturing Workshop
Diagnostic Name
of" Clinical Report
A
B
C
D
Score
5
b
c, d
dss
4
d
3
b, d c, dss
a
dss
b
2
b, d,
a
dss
1
a,c
a,c
Thenumbersin the Diagnostic Score columnrepresent
five diagnostic opinions, which correspond to the
followingdescriptions:
!. Little causefor concernaboutstutterin~ Thereis little
reason to suspect that a stuttering problem maybe
emerging in this child at this time. The types and
frequencies of disfnencies that were observed and
described are characteristic of those of nonstuttering
children.
2. Somecause for concernabout stuttering. This child is
evidencingsomedangersigns of incipient stuttering as
well as somethat are characteristic of nonstuttermg
children. This pattern of equivocalfindingssuggeststhat
the child maybe at risk for an emerging stuttering
problem.
3. Mildconcernaboutstuttering. This child is evidencing
relativelyconsistentsignsof early stuttering.
4. Moderateconcernabout stuttering This child presents
speechandbehavioralsigns whichsuggestthat stuttering
maynot be a mmsientproblem.
5. Severeconcernabout stuttering. This child evidences
speechand behavioralsigns that maysignal the evolution
of a severestuttering problem.
Twoquantitative measuresof the agreementbetween
the experts and the decision support system, whichwere
originally suggestedby LucienDuckstein,can be used to
characterize the dispersion and bias of these diagnostic
~!~, Eachdiagnosis is denotedwith a capital R with a
subscript identifying the individual expert. The term
dispersion showshowmuchthe diagnosis of each expert
varied fromthose of every other expert. It is computed
with
wherei runs over the set {a. b, c. d, d~v},k runs over the
set {a, b, c, d. dss},j runsover the set {A,B, C, D},1=5
(the numberof evaluators) and J - 4 (the number
clinicalreports).
Forthe data in TableI the dispersionsare
TaMe
II AAAI (www.aaai.org). All rights reserved.
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright
© 1996,
"Expert" Dispersion
1.47
a
b
1.68
c
1.37
d
1.35
dss
1.06
Clinical Results
Expert Bias Dispersion
a
-0.
II
0.76
b
0.17
0.79
-0.17
c
0.71
0.50
d
0.88
0.14
0.86
e
Theterm
biasshows
inclinations
oftheindividuals.
It
-0.42
0.85
ec
is computed
with
dss
-0.10
O. 80
1 Df 1 ~u
J
Forthedata
inTable
IIthebiases
are
"Expert"
a
b
c
d
dss
Bias
-1.19
0.38
-0.25
1.00
0.06
Thesemeasurescan be used to help validate a decision
supportsystem. For the data in TableI all the "experts"
performedless consistently than did the decision snppon
system,becausefour experts haddispersionscores greater
than that of the decision supportsystem+Theopinionsof
expert too’, for example,were the least consistent. The
data also illustrate that exl~rts maybe biased. The
diagnosisof expert "a" suggestthat he viewedeachchild’s
flneuc7 problemas less severe than the other experts,
while those of expert "d" wereatthe other extreme.
The
decision suplxmsystem, on the other hand, showedlitOe
bias. Thesedata werecontrived to illustrate such ideal
behaviorby a decision SUl~Ort
system.
B. The Final System Test
The25 newclinical reports werenmi,edto five highly
ex~+~lKX~d
Sta~ 8 Gliniclsns
who
wl~
asked
to
evaluate the likelihood that each child wasstuttering
TableII summar/z~
the statistical results of the d/agnostic
scores of these five experiem~ clinicians, of the
examiningclinicians whoprepared the original t~orts,
and the of derision support system for each of the 25
reports, The five ex~’rleuced din/clans are ~ by
the letters It, b, C, d, al~ e. Thee~mmining
clinicians Who
evaluatedeachchild are designatedby ec, andthe decision
mmplmrt
systemby dins.
Basedon the data in Table II, the performanceof the
decision support systemis indistinguishablefromthat of
experienced
dbfician~
Itsdispersion
ishigher
thansome
clinicians but lowerthan others. Its bias is very small. In
contrast, the ¢xaminillgclinicians’ diagnosticscores do
stand out fromthe panel of clinicians, as is indicated by
the large negative bias score. This seemsreasonable,
because the clinicians whoconductedthese evainmions
hadaccess to data that the other clinicians lacked: they
examinedthe actual children. Furthermore,they likely
reviewedvideo tapes of their evaluation and discussed
di~cult cases with other clinicians before arriving at a
diagnosis and writing their clinical reports. Therefore
their diagnostic opinions maybe the most valkt The
decision support system’sdiagnosesweredesignedto fall
betweenthose of the examiningclinicians andthe meanof
the panel of e, xI~eucedclinicians. The low bias and
dispersion of the decision support systemindicates its
tohlg~ss.
Whenexperienced clinicians have aa opportunity to
conferanddiscusstheir diagnosticopinions,differencesin
opinion that they mayhaveusually decrease. At the time
of their discussions, panel men.tamswereshownboth the
meanand range of diagnostic scores for each case. The
intendedpuzpme
of these discussionswasto clarify
informationabout a case, not to developa consenm~
bet
the range of diagnostic scores was smaller after the
discussions. This redaction in range maybe dee to panel
members
consideringadd/lional diagnostic factors during
their discussionsthan wereconsideredwhentheir initial
diagnosesweregiven. It mayalso reflect, in part, some
panelmembers’
decisionsto modifytheir initial diagnostic
scores to moreclosely approximate
the panel’s meanscore.
One purpose of Second Ol~nion is to allow an
inexperienced¢lini<~3n to "discuss" findings obtained
froma diagnosticevaluationwith a panelof expertsandto
learn
thediagnosticopinionsof the panelof experts.
Such
"discussions"are intendedto increase the reliability of
in~enced dinieians’ diagnoses.
Theoetput of SecondOpinionis robust. Three people
usedit with the sametwentyfive clinical reports descr/bed
Bahill
7
above.Onewasthe chief scientist for our decision sopport
systems,whohas had over twentyfive years of experience
evaluating children with fluency problems. The second
g~msa doctoral candidate in Communication
Sciences and
Disorders at Syracuse University whohad acquired
substantial previousexperiencewith the decision support
system. Thethird wasa master’s degree student at the
Universityof Arizonawhohad little experiencee~aluating
young stutterers and no previous ~ence with any
decision support system. Their diagnostic opinions
evidencelittle difference. Thecase aboutwhichthere was
the most disagreement,Couzens,led to further reviewof
that clinical report and the identification of ambiguities
that mayhave causedthis discrelmncy.Useof Childhood
Stuttering: A Second Opinion, appears to promote
diagnosticopinionsthat are indistinguishablefromthose
of a panel of five experiencedclinicians regardless of a
clinician’s training and background.One significant
finding is that the diagnosis of three peoplewith widely
differing backgroundsevidenced less variability when
using the decision support systemthan did those of five
experiencedclinicians rating the sameclinical reports
withoutthe decisionsupportsystem.
As a penultimate test, the system was given to 30
students in an Expert Systemsclass at the University of
Arizona.It is nowundergoingits ultimate test as it is
used by customers.Mostof the concernsof both students
and customersinvolve the installation procedure.Noone
yet has pointedout errors or sugo~’tcdimprovements
in its
knowledgebase.
This decision support systemis meantto assist, not
replace, inexperiencedspeech-languageclinicians. One
thing a humandoes that a comimtersystemdoes not do is
track a child’s behavior over time, looking for
intprovemcntor a worseningof signs and symptoms.In
the future weintend to build a versionof this systemfor
teaching that will include explanations and pertinent
ref~ea-tces for each question asked. Suchsystemswould
to be usefulin training clinical skills as well as in
sopporting
the skill,q of inexilerienced
clinicians.
testing procedures, and, wethink, credibility to the
decision support systems’ performance. Finally wc
comparedthe output of the systemto the evaluations of
humanexperts. Whenusing Childhood Stuttering: A
SecondOpinion, three clinicians with widelydiffering
backgroundsin stuttering produceddiagnostic opinions
that evidencelittle variability andwereindistinguishable
fromthose of a panelof five experienced
clinicians.
In the software industry, the mantffacturingprocess
itself is subsumedin significance by the development
process. While the former requires physical quality
control such as uncontaminated
storage mediaand so on,
rigorous validation and verification at the development
stage is whatensuresthe quality of the final product.In
the case discussedin this paper, the needfor the domain
experts’ knowledge to control development while
simultaneouslyensuring the technical soundnessof the
knowledgebase was of paramountconcern To this end,
several innovations, reported above, had to be made.
Whilethe gcneralizabilityof suchinnovationsneedfurther
examination,the point weseek to makeis simply this:
within the overall need for rigorous verification and
validation, the specific methed(s)to be used are often
dictated by the specifics of the product and the
circumstances
in whichit is built.
From: Proceedings of the AI and Manufacturing Research Planning Workshop. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
VII. Conclusions
R is difficult to detect errors in a decision support
system. Therefore, mucheffort was devotedduring the
knowledge
extraction processto insure that errors did not
creep into the knowledge base. Because all such
knowledge came from and was captured by fallible
lnnnans, it is umeasonable
to expect a knowledgebase to
be free of errors. Therefore,several tools andprocedures
weredevelopedto help detect errors in a decision support
system’s knowledgebese. Eachwas designedto workwith
existing tools, and each addedadditional complexityto
8
M& Manufacturing Workshop
Acknowledgments
This workwas supported by the National Institute of
Child Health and HumanDevelopment Grant R44
HD26209
and by Bahill Intelligent CornlmterSystems.
Thecontentsof this paperare solely the responsibility of
the authorsand do not necessarilyrepresent the official
viewsof the NationalInstitute of Child Healthand Human
Development.
REFERENCES
L. Adelman,1992. Evaluating Decision Support and
Expert Systems. NewYork: John Wiley& Sons.
M. andBahill, A. T. 1991.Interactive verification
andvalidation with Validator. Chapter4 in Bahill, A. T.
e~ Verifying and Vafidaang Personal Computer-Based
ExpertSystems.Englewond
Cliff.a: PrenticeHall.
Jafar,
Jafar, M.andBahili, A.T. 1993.Interactiveverificationof
knowledge-based
systems. IEEEExpert8(1 ): 25-32.
Kang,Y. and Bahill, A. T., 1990. A tool for detecting
expert systemerrors, AI Expert5(2): 46-51.