Methods for Evaluating the Performance of Diagnostic Tests in the

advertisement
Methods for Evaluating the Performance of Diagnostic
Tests in the Absence of a “Gold Standard:”
A Latent Class Model Approach
Elizabeth S. Garrett
Division of Biostatistics
Johns Hopkins University
December 9, 2002
Evaluating Diagnostic Criteria
• Relatively few areas of medicine have true “gold standard”
tests, where test is perfectly accurate.
– “Pathognomic indicators”
– When indicator is present, disease is present
– When indicator is absent, disease is absent
• Other situations:
– Combination of signs and symptoms provide very
accurate diagnosis.
– Disease process is not well understood: controversy
exists about how to define diagnosis.
– Disease process is well understood but measuring
disease via signs and symptoms is difficult.
Diagnostic Criteria in Psychiatry
• Currently, the DSM (Diagnostic and Statistical Manual of
Mental Disorders) is the standard for defining mental
disorders.
• Diagnostic algorithms are provided with which a
determination of disorder absence or presence can be made
• Examples: major depression, schizophrenia, autism,
alcoholism, generalized anxiety disorder.
Major Depressive Episode, as diagnosed by the DSM-IV (APA, 1994)
A. A person who suffers from major depressive disorder must either have depressed mood or
a loss of interest or pleasure in daily activities for at least a 2 week period.
B. The disorder is characterized by the presence of five or more of the following nine
symptoms:
1. depressed mood most of the day, nearly every day, as indicated by either subjective report
or observation made by others.
2. markedly diminished interest or pleasure in all, or almost all, activities most of the day,
nearly every day.
3. significant weight loss when not dieting or weight gain, or decrease or increase in appetite
nearly every day.
4. insomnia or hypersomnia nearly every day.
5. psychomotor agitation or retardation nearly every day.
6. fatigue or loss of energy nearly every day.
7. feelings of worthlessness or excessive inappropriate guilt nearly every day.
8. diminished ability to think or concentrate, or indecisiveness, nearly every day.
9. recurrent thoughts of death, recurrent suicidal ideation without a specific plan, or a suicide
attempt or a specific plan for committing suicide.
Symptoms are not better accounted for by bereavement, the symptoms persist for longer than
2 months or are characterized by marked functional impairment, morbid preoccupation with
worthlessness, suicidal ideation, psychotic symptoms, or psychomotor retardation.
How do we validate the DSM criteria?
• How can we be sure that these definitions are valid
measures?
• How can we determine the sensitivity and specificity of
these measures?
• Is there a gold standard?
• Is psychiatrist’s diagnosis a gold standard?
• What types of individuals are the diagnostic criteria
diagnosing as depressed?
• How often are individuals misdiagnosed?
• What are the implications of a positive or negative
diagnosis?
Example: Major Depression
• Epidemiologic Catchment Area Study (ECA): Collected mental
health data on individuals in 5 cities, beginning in 1981.
• Our sample: epidemiologic sample of 1322 individuals in the East
Baltimore area collected in 1993 (wave 3).
• Depression questions are from Diagnostic Interview Schedule,
which has been shown to be valid and reliable (Robins et al., 1981)
• Questions about symptom presence were asked for the DSM
major depression symptoms and coded as “present” if the
symptoms occurred in the same two week period.
• Symptom groups: some questions ask about the same type of
symptom:
– Have you had trouble sleeping?
– Have you had trouble waking?
– Do you sleep too much?
• Related symptoms are categorized into the same symptom group.
Distribution of Symptoms
ECA Wave 3, N = 1322
Group Symptom
Prevalence
1
Depressed mood
0.06
2
Disinterest in sex
0.08
Less fun
3
Reduced
energy/fatigued
0.05
4
Reduced
concentration
0.04
Symptom
Ideas of self-harm
Suicide attempts
Trouble falling
asleep
Sleep too much
9
Loss of appetite
Increased appetite
0.08
Weight gain
0.03
10
Slow movement
Fast movement
0.02
0.09
Trouble waking
Indecisive
Guilty/sinful
0.05
Suicidal thoughts
Weight loss
Feel inferior
Prevalence
Want to die
Slow thoughts
Lacking selfconfidence
6
7
8
Loss of enjoyment
5*
Group
fidgety
0.04
Evaluating the DSM Criteria
• Without an available gold standard, we resort to other
methods
• Suppose that the proposed symptom (groups) define
depression.
• Without relying on the DSM definition of depression but
imposing model assumptions, what types of symptom
patterns are observed in the data?
• Do individuals tend to “cluster” into categories based on
symptom response patterns?
• We can evaluate this using a ‘Latent Class Model.’
• Categorical analog of factor analysis.
The Latent Class Model
• Assumes that each individual in the population is a member of
one of M latent classes.
• Each of the classes is defined by a vector of symptom
prevalences, pm = (p1m, p2m, …pKm) where there are K symptoms,
m = 1,…,M.
• The vector yi = (yi1, yi2, …., yik) is individual i’s binary vector of
symptom responses, i = 1,…,N.
• The proportion of individuals in class m is denoted by m.
• The true, yet unobserved, latent class of individual i is denoted
by ηi, where ηi  {1,2,..,M}.
• The symptoms “define” the latent variable of interest.
• M is fixed.
• Conditional Independence: Given class membership, symptoms
are independent.
Graphical Depiction of the Latent Class Model
class 1
(η = 1)
p11, p21, …,pK1
yi1, yi2, …,yiK
class 2
(η = 2)
p12, p22, …,pK2
yi’1, yi’2, …,yi’K
class 3
(η = 3)
p13, p23, …,pK3
yi’’1, yi’’2, …,yi’’K
Statistical Details
Probability distribution of Yi:
P (Yi
 yi
)  P (Yi1 
y i1 , Yi 2  y i 2 ,..., YiK  y iK
M
K
m 1
k 1
)
yik
   m  pkm
(1  pkm ) (1 yik )
Likelihood function:
N
M
K
yik
L( , p | Y )    m  pkm
(1  pkm ) (1 yik )
i 1 m 1
k 1
Interpretation
• Two class model:
– A non-depressed class which reports on average no
symptoms (93% of sample)
– A depressed class which reports on average 4 to 5 of the
10 symptoms
• Three class model:
– A non-depressed class which reports on average no
symptoms (88% of sample)
– A mildly depressed class which reports on average 2 to
3 of the 10 symptoms (9% of sample)
– A severely depressed class which reports on average 6
to 7 of the 10 symptoms (3% of sample)
• The three class model is deemed more appropriate from a
statistical standpoint (model fit, adherence to model
assumptions)
Results of Estimation
• p matrix
•  vector
 Posterior probability of class membership:
– Tells us probability that individual i is in one of the
classes, given his response pattern.
P( yi |i  r ) P(i  r )
P(i  r| yi ) 
P( yi )

P( yi |i  r ) P(i  r )
M
 P ( y |
m 1
i
i
 m) P(i  m)
Examples: Assume M = 2
• Individual reports absence of all symptoms:
y *  (0,0,0,0,0,0,0,0,0,0,0)
*
P
(
Y

y
|i  1) P(i  1)
i
*
P(i  1|Yi  y ) 
P(Yi  y * |i  1) P(i  1)  P(Yi  y * |i  2) P(i  2)
 K yk*
(1 y k* ) 
  pk 1 (1  pk 1 )
  1
 k 1

 K
K


*
* 
yk
(1 y k )
y k*
(1 y k* ) 
  pk 1 (1  pk 1 )
   1    pk 2 (1  pk 2 )
  2
 k 1

 k 1

 0.999
Examples: Assume M = 2
• Individual reports only fatigue and sleep problems:
y *  (0,0,1,0,0,0,0,0,1,0,0)
P(i  1|Yi  y * )  0.87
P(i  2|Yi  y * )  013
.
• Individual reports all symptoms except self-esteem and guilt:
y *  (1111
, , , ,0,0,1111
,,,)
P(i  1|Yi  y * )  0.001
P(i  2|Yi  y * )  0.999
Estimation Options
• Maximum Likelihood Approach
– Widely available
– Accepted approach
• Bayesian Approach
– Markov Chain Monte Carlo estimation
– Easily implemented in “WinBugs” (Imperial College of Science,
Technology and Medicine: http://www.mrc-bsu.cam.ac.uk/bugs/)
– Benefits:
• Model checking methods
• ‘Identifiability’ can be assessed (Garrett and Zeger, 2000)
 MCMC approach allows estimation of ANY
function of parameters and standard errors.
Bayesian Estimation Approach
The Gibbs Sampler is an iterative process used to estimate posterior
distributions of parameters.
– we sample parameters from conditional distributions
e.g. P(1|Y,p, , 2, 3)
– At each iteration, we get ‘sampled’ values of p, , and .
– We use the samples from the iterations to estimate posterior distributions
by averaging over other parameter values.
P(1 | y )
0.10
0.12
0.14
1
0.16
0.18
Evaluating Depression Diagnosis
• Assumption: Treat the latent class model as our “gold standard”
definition of depression.
• We can use the symptom responses to evaluate the DSM-IV diagnosis
of depression
• Compare the DSM diagnosis to the latent class diagnosis using standard
definitions:
• Assume two classes of depression
– Class 1 is non-depressed class
– Class 2 is depressed class
Sensitivity  P(test  | truly depressed )
 P( DSM - IV diagnosis| class  2)
Specificity  P(test  | truly not depressed )
 P( no DSM - IV diagnosis| class  1)
More specifically…
SE (2)  P(diagnosed as depressed|i  2)
SP(1)  P(diagnosed as not depressed|i  1)
SE ( j ) 
 P( y
r R
SP( j ) 
 yr |i  j )
i
 P( y
r 'R c
i
 yr ' |i  j )
where {yr: r  R} is the set of symptom patterns that are
classified as a diagnosis by the DSM-IV.
Predictive Values
• Positive and Negative Predictive Values are simply
transformations of SE and SP:
PPV ( j )  P(i  j| diagnosed as depressed )

SE ( j )   j
P(diagnosed as depressed)
NPV ( j )  P(i  j| diagnosed as not depressed )

SP( j )   j
P(diagnosed as not depressed)
Class assignment?
• Complication: latent class model provides us with
“posterior probabilities” of class membership. We don’t
know the true latent classes, η, for individuals in the dataset.
• Example: M =3
– Posterior probabilities of class membership for a
particular symptom pattern are 0.48, 0.48, 0.04.
– To which class should this individual be assigned?
– How do we account for the uncertainty in the
assignment?
One Approach to Class Assignment
• “Pseudo-classes” (Maximum Likelihood)
– assign individuals to “pseudo-classes” based on posterior probability of class
membership (Bandeen-Roche et al., 1997)
– recall that posterior probability is based on observed pattern
– e.g. individual with 0.20, 0.05, 0.75
• better chance of being in class 3
• not necessarily in class 3
•
•
•
•
Using class assignment, we can calculate sensitivity and specificity
We can repeat assignment procedure T times, where T is large.
On average, the sensitivity and specificity estimates will be correct.
Drawback: we don’t get precision associated with estimates.
 Standard deviation of repeated estimates does not account for
imprecision in estimates of p and 
 Confidence intervals based on the T repeats will be too narrow.
MCMC Approach to Class Assignment
• η is a vector of parameters
• At each iteration in the Gibbs sampler, each parameter is
drawn from its conditional distribution
 At each iteration in Gibbs sampler, individuals are
automatically assigned to classes
no need to
“manually” assign.
• For each of the W iterations of the chain, we can calculate
sensitivity and specificity.
• Sensitivity and specificity are simply additional parameters.
 Due to the nature of the MCMC approach, the standard
deviation of the posterior interval of sensitivity
represents its standard error.
 Precision estimates for sensitivity and specificity are
valid.
Operating Characteristics of Depression Diagnoses
• Several definitions of depression:
– DSM-III
– DSM-IV
– ICD-10a (mild)
– ICD10b (moderate)
– ICD10c (severe)
• We calculate sensitivity and specificity for each of five
diagnoses (above) for models with M = 2 and M = 3.
• We do the same for PPV and NPV.
• Vertical lines represent 95% posterior intervals.
Interpreting results from three class model
• Diagnoses only have two possibilities: depressed or not
depressed
• Two class model also has two possibilities.
 Three class model has a non-depressed class and two
depression classes (mild and severe).
• Should we think of BOTH or just SEVERE as the
“treatment class.”
• Why does it matter?
– Clinical decision making
– “Pre-clinical” depression?
• Which is better?
Misclassification probabilities for identifying “severe depression”
using the DSM-IV criteria
Two-class model
Three-class model
P(false positive)
< 0.001
0.004
P(false negative)
0.035
0.002
P(misclassification)
0.035
0.006
Misclassification probabilities for identifying “any depression”
using the DSM-IV criteria
Two-class model
Three-class model
P(false positive)
< 0.001
< 0.001
P(false negative)
0.035
0.078
P(misclassification)
0.035
0.078
Revisiting questions….
• Recall that three class model was chosen
versus the two class model as more
appropriate.
• We answer questions posed earlier by
examining agreement of DSM-IV and the
three class model.
What types of individuals are the diagnostic
criteria diagnosing as depressed?
• DSM-IV tends to diagnose individuals who are in ‘class 3’
of the three class model (i.e. our severe depression class)
• The mildly depressed class tends to be ignored.
• Not necessarily a bad thing:
– DSM criteria are developed for deciding treatment.
– If mild depression does not require any treatment, then
diagnosis of DSM-IV is adequate.
• But what if:
– Class 2 individuals (ie mildly depressed) would benefit
from treatment.
– Class 2 is a “pre-clinical” class: intervention could
prevent transition to severe depression
How often are individuals misdiagnosed?
• Assuming that diagnosis of severely depressed individuals
is intent of DSM-IV, there is LOW probability of
misclassification:
P(misclassification) = 0.006
• If intent is to diagnose ANY depression (i.e., mild
or severe), then there is much higher probability of
misclassification:
P(misclassification) = 0.078
(Note that of these 8%, almost all are false negatives)
What are the implications of a positive or
negative diagnosis?
• The DSM-IV has high PPV for severe depression:
PPV(3)  0.90
• High NPV for no depression:
NPV(1)  0.90
• Essentially no information is provided as to an individual’s
likelihood of mild depression given either a negative or a
positive diagnosis:
PPV(2)  0.10
NPV(2) 0.10
Issues and Concerns
• Operating characteristics assume that two types of
diagnosis being compared are determined independently.
– Methods of assessment are different
– But, large overlap of symptoms
– Possibly/probably not truly independent
• Conditional independence of tests given simply presence
or absence of disease is a common problem.
– Tests may be independent given “continuum” level of disease, but
not when disease status is simply categorized.
– However, the latent class model does not definitively assign
individuals to classes. Instead, posterior probability is estimated
– Because individuals are assigned posterior probabilities, we can
more easily think of a “continuum” of disease.
– This is true even in the case of classes which are not ordinal in
nature, because the posterior probabilities for each class will be
continuous.
Conclusions
• DSM-IV appears to be a valid approach for diagnoses of
“severe” depression.
• There appears to be another class of milder depression that
is not identified by any of the depression definitions.
• By using an MCMC approach to latent class model
estimation, we can estimate operating characteristics of
tests and their standard errors in a straightforward way.
• This approach can be used quite generally for other
medical diagnoses
– Psychiatric diagnoses
– Arthritis
Download