Diagnostic Testing & Predictive Models - Georgetown

advertisement
Diagnostic Testing & Predictive
Models
John Kwagyan, PhD
Howard University College of Medicine
Design, Biostatistics & Population Studies
GHUCCTS
1
'physicians must be content to end not in certainties,
but rather in statistical probabilities.
The physician thus has a right to feel certain, within
statistical constraints, but never cocksure.
Absolute certainty remains for some theologians - and
like-minded physicians'.
2
Am J Cardiol 1975;36:592-6
Objective
To understand the usage of diagnostic measures and
screening tools
3
Outline
•
•
•
•
•
•
Examples
Why/What Diagnostic Testing
Measures of Diagnostic accuracy
ROC Curves
Adaptation of Diagnostic/Screening Tools
Predictive Models
4
EXAMPLES
5
4P’s Plus Screening Instrument
-Substance Abuse in Pregnant Women
What is a positive assessment??
6
J. Perinatology 2005
Index to Predict Relapse in Asthma
Factor
Score 0
Score 1
Pulse
<120
>= 120
Respiration
<30
>=30
Pulsus Paradoxus
<18
>=18
Peak Flow Rate
>120
<=120
Dyspnea
Absent or mild
Moderate or Severe
Accessory muscle use Absent or mild
Moderate or severe
Wheezing
Moderate or severe
Absent or mild
Positive Test => Score 4 or more
7
Fischl et al, NEJM 1981
Validation of the 4P's Plus© screen for substance use in
pregnancy validation of the 4P's Plus- J. Perinatology (2007)
Study Design:
A total of 228 pregnant women underwent screening….
Reliability, sensitivity, specificity, and positive and negative predictive
validity were conducted.
Result:
Overall reliability for the five-item measure was 0.62.
Seventy-four (32.5%) of the women had a positive screen.
Sensitivity and specificity were very good, at 87 and 76%, respectively.
Positive predictive validity was low (36%),
Negative predictive validity was quite high (97%).
Conclusion:
The 4P's Plus reliably and effectively screens pregnant women for risk
of substance use, including those women typically missed by other
perinatal screening methodologies.
8
Maternal Biochemical Serum Screening for Down
Syndrome in Pregnancy With HIV Infection
To estimate the influence of HIV infection and
antiretroviral therapy on maternal serum markers
levels and the false-positive rate with
biochemical maternal serum screening for Down
syndrome.
Obstetrics & Gynecology
9
Inability to Predict Relapse in
Acute Asthma NEJM 1984;310(9)
• Fischl & Co. developed an index to predict relapse in patients with
acute asthma.
• Based on data from ER patients in Miami, FL
• Reported 95% sensitivity and 97% specificity
• Dramatic drop in accuracy when externally validated on patients in
Richmond, VA?????
10
Other Examples
• SSAGA for alcohol dependence
• Genetic Screening for hereditary disease
• Etc
11
Why Diagnostic Testing
• Accurate screening/diagnosis of a health condition is
often a first step towards its prevention or control
• Need for fast, inexpensive and RELIABLE tools
12
Purpose of Diagnostic Testing
• A (binary) diagnostic test performance is designed to determine
whether a target condition is present or absent in a subject from
the intended use population.
• the target condition can refer to
- a particular disease
- a disease stage
- health status or condition
that should prompt clinical action, such as the initiation,
modification or termination of treatment, counseling, etc
13
Test Scale
• Binary - underlying measure is usually continuous
presence or absence of a diseased
• Continuous (quantitative)
- biomarkers for cancer
PSA are measured as serum concentration
- creatinine for kidney malfunction
- blood sugar for diabetes,
- cholesterol for dyslipidemia
• Ordinal
- clinical symptoms
moderate, severe, highly severe
- Index Score : 0, 1, 2, 3, 4, 5
14
Other Test Scales
• Likert-type rating
- highly disagree, disagree, neutral, agree, highly agree
• Nominal **
- genotype groups
- ApoE Genotypes:
E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, E4/E4
15
What is Diagnostic Testing?
• Evaluation of a (new) test to determine whether a target
condition is present by comparison with a benchmark!
• Evaluation of the ability of a test to classify subjects as
diseased or disease-free
• For non-binary scales, a classification rule is set by a threshold
- PSA > 4.0ng/ml
- Blood glucose > 126 mg/dL
- BP >140/90 mmHg
-used to be 160/95 mmHg
- contemplating 130/85 mmHg???
16
…………
…………
That we are in the midst of crisis is now well understood.
Our nation is at war,………….
Our economy is badly weakened, ………..
Homes have been lost; jobs shed; businesses shuttered.
Our health care is too costly; our schools fail too many;..
These are the indicators of crisis, subject to data
and statistics.
…………
…………
…………
Pres. Barack Obama (Inaugural speech)
Benchmarks for Comparison
1. comparison to a reference (Gold) standard
- considered to be the best available method for establishing the presence or
absence of the target condition
- can be a single method, or a combination of methods, including clinical
follow-up.
2. comparison to a non-reference standard
- method other than a reference (Gold) standard.
Note!!!: The choice of comparative method will determine
which performance measures may be reported
18
Some Conventional Tests
• Bacterial cultures
- strep throat, urinary tract infection, meningitis, etc
• Imaging technology
– X-ray for bone fracture,
- CT scans for brain injury
- MRI for brain injury
• Biochemical markers
- serum creatine for kidney dysfunction
- serum bilirubin for liver dysfunction
- blood glucose for diabetics
- blood for HIV
19
Other Conventional Tests
• Expert judgment
present or absence of heart murmur
• Response to Questionnaire !!!!
substance abuse
• Experts Interview or Observation
schizophrenic, bi-polar, major depression
• Radiologists score of mammograms
no cancer, benign cancer, possible malignancy, malignancy
20
Measures of Accuracy
Validation
21
Validation
• it is the evaluation of the accuracy of the test
• can only be established by comparing with the Gold Standard
• validity is measured by sensitivity and specificity
The extent to which a test measures what it is
supposed to measure!!!
22
Measures for Accuracy
• Sensitivity => True Positive Rate
• Specificity => True Negative Rate
• False Negative Rate (FNR) = 1- sensitivity
• False Positive Rate (FPR) = 1- specificity
• Predictive values
positive predictive value
negative predictive value
• Diagnostic Likelihood Ratios
LR+ = TPR/FPR = SenS/(1-SpeC)
LR- = FNR /TNR = (1-SenS)/SpeC
• ROC Curves
23
Sensitivity & Specificity
• Sensitivity is the ability of a test to correctly classify an
individual as ‘diseased’.
Estimated as the proportion of subjects with the target
condition in whom the test is positive
• Specificity is the ability of a test to correctly classify an
individual as ‘disease-free’.
Estimated as the proportion of subjects without the target
condition in whom the test is negative
Best illustrated using a 2 X 2 table!!!!!
24
Diagnostic Testing
True Disease Status
________________________________________________________________
Diseased
Disease-free
______________________________________________________________________________________________________
Test Result
Positive
Negative
No Error Error I
Error II
No Error
N1=total diseased N2 =total disease-free
25
Sensitivity & Specificity
True Disease Status
________________________________________________________________
Diseased Disease-free
______________________________________________________________________________________________
Test Result
Positive
True-positive
Negative
False-negative True-negative
False-positive
• Sensitivity ~ ability of a test to detect a disease when present
=> True-positive fraction
• Specificity ~ability to indicate disease-free when absent
=>True-negative fraction
26
Consequence of Diagnostic Errors
• False negative errors, i.e., missing disease that is present.
-can result in people foregoing needed treatment for the disease
- the consequence can be as serious as death
• False positive errors, i.e., falsely indicating disease
- disease-free are subjected to unnecessary work-up
procedures or even treatment.
- negative impact include personal inconvenience and/or
unnecessary stress, anxiety, etc .
27
Estimating Sensitivity & Specificity
True Diseased
Test
Diseased
Status
Disease-free
Total
Positive
TP
FP
T+
Negative
FN
TN
T-
T_D
T_Df
Total
(Estimate) SenS
(Estimate) SpeC
False Negative rate
False Positive Rate
T_N
= TP/T_D
= TN/T_Df
= FN/T_D = 1- SenS
= FP/T_Df = 1- SpeC
28
Example
Coronary Artery Surgery Study
CAD
Gold Standard Arteriography
EST
Diseased
Positive
815
Negative
208
1023
Disease-free
115
327
442
SenS = 815/1023 = 0.79
FNF = 208/1023 = 0.19
SpeC
FPF
930
535
1465
CI=[0.77,0.82]
CI=[0.18, 0.23]
= 327/442 = 0.74 CI =[70, 78]
= 115/442 = 0.26 CI =[0.22, 0.30]
29
Absolute certainty remains for
some theologians - and likeminded physicians!!!.
30
If Ever There is a Perfect Test!
Ideal Case
True Diseased
Test
Diseased
Positive
TP
Negative
FN = 0
T_D
(=TP)
Status
Disease-free
FP = 0
TN
T_Df
(=TN)
T+
TT_N
SenS =
SpeC =
TP/T_D
TN/T_Df
=
=
100%
100%
FNF =
FPF =
FN/T_D=1-SenS
FP/T_Df=1-SpeC
=
=
0%
0%
31
Uninformative (Useless)Tests
• Test is uninformative, if test result is unrelated to disease status
• the probability distributions of the measure are the same in both
diseased and disease-free populations
• for uninformative tests,
SenS = 1- SpeC
TPF = FPF
Ex: Exercise Stress Test to determine Diabetes, HIV, etc
• Test is informative if:
SenS + SpeC > 1
32
Clinical Application
Detection of Primary Angle Closure Glaucoma
Gold Standard = Gonioscopy
Test
Intraocular Pressure
Torch Light Test
van Herick Test
SenS (%)
SpeC(%)
47
80
92
70
61.9
89.3
Which test should we use to screen for PACG??
33
Indian J Ophthalmology 2008;56:45-50
Sensitivity vs. Specificity
Rule Out & Rule In
• Test with high degree of sensitivity have a low FNR,
- they ensure that not many true cases of the disease are
missed.
• A screening test which is used to “rule out” a diagnosis,
should have a high degree of sensitivity
•Test with high degree of specificity have a low FPR
- they ensure that not many patients are misdiagnosed.
• A confirmatory test which is used to “rule in” a diagnosis,
should have a high degree of specificity
34
Clinical Application
Detection of Primary Angle Closure Glaucoma
Gold Standard = Gonioscopy
Test
SenS (%)
SpeC(%)
Intraocular Pressure
Torch Light Test
47
80
92
70
van Herick Test
61.9
89.3
Which test should we use to screen for PACG??
How about Combining the tests !!!!
35
Combining Tests!
• 2 ways for performing combination Tests
in parallel
in series
• 2 Rules for combination
"the OR rule"
“the AND rule”
36
Combining 2 Tests
• “OR rule”
Test is positive, if either test is positive
Test is negative, if both are negative
SenS (Combo Test) = SenS1 + SenS2- SenS1*SenS2
SpeC (Combo Test) = SpeC1*SpeC2
• “AND rule”
Test is positive if only both A and B are positive
Test is negative if either A or both are negative
SpeC (Combo Test) = SpeC1 + SpeC2 –SpeC1*SpeC2
SenS (Combo Test) = SenS1*SenS2
37
Clinical Application
Combining Test
Detection of Primary Angle Closure Glaucoma
Gold Standard = Gonioscopy
Test
Torch Light Test
van Herick Test
Combined Test
(OR RULE)
Combined Test
(AND RULE)
SenS
SpeC
80
62
92.4
70
89.3
62.3
49.6
97
38
Combining Tests
• "the OR rule" increases SenS
~ useful for screening tests to “rule out” diagnosis
• “the AND rule” increases SpeC
~ useful for confirmatory tests to “rule in” diagnosis
39
Predictive Values
Daring Clinical Questions
How likely is the disease, given the test result??
- what is the likelihood of disease when test is positive?
- what is likelihood of non-disease when test is negative?
Answers to these questions are known as the predictive
values
40
Predictive Values
• +PV – fraction of test positives who are diseased
PPV = probability ( diseased | positive test)
• - PV – fraction of test negatives who are non-diseased
NPV = probability (disease-free | negative test)
41
Predictive Values
True Diseased
Test
Diseased
Status
Disease-free
Positive
TP
FP
T+
Negative
FN
TN
T-
T_D
T_Df
T_N
Positive Predictive Value = TP/T+
Negative Predictive Value = TN/T42
Predictive Values
Example
True Diseased
Test
Diseased
Positive
815
Negative
208
1023
Status
Disease-free
115
327
442
930
535
1465
PPV = 815/930= 87%
NPV= 327/535= 61%
43
Predictive Values
• A perfect test will predict perfectly, i.e.,
PPV = 1, NPV=1
• Predictive values depend on the prevalence of the disease
PPV decreases with decreasing prevalence
- low PPV may simply be a result of low prevalence
NPV decreases with increasing prevalence
• Useless Test if:
PPV = prevalence and NPV = 1- prevalence
• PV are not used to quantify the inherent accuracy of the test
44
Attributes of Measures
Classification Probabilities Predictive Values
Perfect Test
SenS = 1 SpeC =
PPV=1, NPV=1
Useless Test
SenS = 1- SpeC
PPV=ρ, NPV=1-ρ
Context
Accuracy
Clinical prediction
Question addressed To what degree does the test
reflect the true disease state
How likely is the
disease given the test
result
Affected by disease No
prevalence?
Yes
45
Validation of the 4P's Plus© screen for substance use
Study Design: A total of 228 pregnant
Result:
Seventy-four (32.5%) a positive screen = prevalence!!!!!.
Sensitivity = 87% => missed 13% of diseased
Specificity = 76% => incorrectly classified 24% as diseased
Positive predictive validity = 36% => fraction w/D that tested +ve
Negative predictive validity = 97% => fraction wo/D that tested -ve
Conclusion:
The 4P's Plus reliably and effectively screens pregnant women for risk
of substance use, including those women typically missed by other
perinatal screening methodologies.
46
ROC Curves
Non-Binary Scales
47
ROC Curves
• ~ for evaluating tests that yield on a non-binary scale, set by a
threshold.
BP > 140/90 mmHg for Hypertension
~ used to be 160/95 mmHg
~ contemplating 130/85 mmHg
• BP values can fluctuate in any individual-healthy or diseasedthere will be some overlap of values in the disease population.
•The choice of a threshold depends on the trade-off that is
acceptable between failing to detect disease and falsely
identifying disease
•The ROC curve is a devise that describes the range of tradeoffs that can be achieved
48
ROC Curves
• Plot of SenS against 1-SpeC for all possible thresholds
• It is a visual representation of the global performance of the test
• ROC plot shows the trade-off of sensitivity and specificity
• Test is Useless (Uninformative Test)
if for any threshold, c
SenS = 1- SpeC
49
Construction of ROC Curves
Calculate SenS and (1-SpeC) for all possible cutoff point
Cutoff
SenS SpeC
1-SpeC
Comments
0
100
0
100
Ideal case
110
98
20
80
120
95
40
60
130
92
60
40
140
78
80
20
150
55
90
10
160
40
92
8
0
100
0
…
…
500
50
51
52
Attributes of ROC curves
• Provides complete description of potential
performance of a test
• Facilitates comparing and combining information
across studies of the same test
• Guides the choice of threshold in application
• Provides mechanism for relevant comparison
between different tests
53
Reporting of Estimates
Variability
• Point estimates of sensitivity, specificity, predictive
values, are not sufficient.
• Confidence intervals reflect the uncertainty of the
estimates, and should be reported.
• Focus is on confidence intervals to characterize the
performance of the test and not on hypothesis testing
54
Sources of Bias
Diagnostic Testing are subject to an array of biases:
• Verification bias
-study selectively includes diseased for verification
• Imperfect reference standard
- error in reference stand!
• Spectrum bias
subjects may not be a complete representative of the
population, i.e., important subgroups are missing 55
Measures of Agreement
Reliability
56
Reliability vs. Validity
•
Sometimes the goal is estimate the validity (accuracy) of ratings in the
absence of a "gold standard."
• Other times one merely wants to know the consistency of ratings made
by different raters.
• In some cases, the issue of accuracy may even have no meaning--for
example ratings may concern opinions, attitudes, or values
57
Measures of Agreement
• Reliability Coefficients
positive percent agreement
negative percent agreement
Overall agreement
overall % agreement
Kappa- how much is due to chance???
58
Other Measures of Agreement
•
•
•
•
B-Statistics
McNemar Test
Latent class models
Bayesian methods
59
Agreement
Test 2
Test 1
Positive
Positive
Negative
AGREEMENT
DISAGREEMENT
Negative DISAGREEMENT AGREEMENT
60
Raw Agreement
Test 1
TEST 2
Positive
Positive Negative
Total
40
1
41
512
513
531
572
Negative 1
Total
41
Positive % agreement = 40/41 = 97.6%
Negative % agreement = 512/531 = 96.4%
Overall % agreement = (40+512)/572 = 96.5
61
Limitation of Agreement Measures
•
Reliability is not proof of validity;
- two tests can report the same readings, but be wrong
•
It does not tell how the disagreements occurred:
- whether the positive and negative results are evenly
distributed
• Does not tell the extent to which the agreement occurs by
chance
62
Generalization
Cultural Adaptation
63
Generalization
• Ultimate question (s)!!!!!
- Does the test perform well for new, unseen patients?
- Does the test perform well in other populations?
64
Inability to Predict Relapse in
Acute Asthma NEJM 1984;310(9)
• Fischl & Co. developed an index to predict relapse in patients with
acute asthma.
• Based on data from ER patients in Miami, FL
• Reported 95% sensitivity and 97% specificity
• Dramatic drop in accuracy when externally validated on patients in
Richmond, VA?????
65
Inability to Predict Relapse in Acute
Asthma
• Index to predict relapse in patients with acute asthma.
• 205 Based on data from ER patients in Miami, FL
95% sensitivity
97% specificity
• 114 Based on data from ER patients seen at Richmond, VA
18.1% Sensitivity???
82.4% Specificity
66
Centor RM et al. NEJM 1984;310(9):577-580.
FL
VA
67
Centor RM et al. NEJM 1984;310(9):577-580.
Generalization-Validation
• Internal validation - restricted to a single data set
- data splitting (or cross-validation)
•
Temporal validation
- evaluation on a second data set from the same population.
• External validation
- evaluation on data from other populations, perhaps by
different investigators.
68
Predictive Models
69
Predictive Models
• A predictive model is a model for making
expectations on future events
• It is usually build from a number of predictors, and
a response (or outcome) variable
70
When Does a Diagnostic Test Work??
Does the diagnostic test add anything to what is
already known?
• Example: A diagnostic test for macular
degeneration would need to show that it is better
than just using a person’s age.
71
Covariate Modeling:
Age in Home Macular Perimeter
Problem: If you sample from subjects with MD and those
without, there is likely to be an age difference that could
confound the assessment of HMP. This is a bias!!!!.
•Question: Is HMP just a surrogate for age?
• Solution: Build a predictive model -logistic model- using
age and HMP -visual field functional defects - to predict risk
of MD and see if HMP adds anything
72
Summary
Medical applications
• Screening
– triage => prioritization
• Diagnosis
– triage
– management and decision making
– test selection
• Prognosis => Prediction
– management and decision making
– informing patients and their families
– risk adjustment
– eligibility in clinical trials
73
Thank You
????????
74
References
1.
2.
Weinstern et al. Clinical Evaluation of Diagnostic Tests. AJR 2005
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and
prediction. New York: Oxford University Press
75
Download