Diagnostic Testing & Predictive Models John Kwagyan, PhD Howard University College of Medicine Design, Biostatistics & Population Studies GHUCCTS 1 'physicians must be content to end not in certainties, but rather in statistical probabilities. The physician thus has a right to feel certain, within statistical constraints, but never cocksure. Absolute certainty remains for some theologians - and like-minded physicians'. 2 Am J Cardiol 1975;36:592-6 Objective To understand the usage of diagnostic measures and screening tools 3 Outline • • • • • • Examples Why/What Diagnostic Testing Measures of Diagnostic accuracy ROC Curves Adaptation of Diagnostic/Screening Tools Predictive Models 4 EXAMPLES 5 4P’s Plus Screening Instrument -Substance Abuse in Pregnant Women What is a positive assessment?? 6 J. Perinatology 2005 Index to Predict Relapse in Asthma Factor Score 0 Score 1 Pulse <120 >= 120 Respiration <30 >=30 Pulsus Paradoxus <18 >=18 Peak Flow Rate >120 <=120 Dyspnea Absent or mild Moderate or Severe Accessory muscle use Absent or mild Moderate or severe Wheezing Moderate or severe Absent or mild Positive Test => Score 4 or more 7 Fischl et al, NEJM 1981 Validation of the 4P's Plus© screen for substance use in pregnancy validation of the 4P's Plus- J. Perinatology (2007) Study Design: A total of 228 pregnant women underwent screening…. Reliability, sensitivity, specificity, and positive and negative predictive validity were conducted. Result: Overall reliability for the five-item measure was 0.62. Seventy-four (32.5%) of the women had a positive screen. Sensitivity and specificity were very good, at 87 and 76%, respectively. Positive predictive validity was low (36%), Negative predictive validity was quite high (97%). Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies. 8 Maternal Biochemical Serum Screening for Down Syndrome in Pregnancy With HIV Infection To estimate the influence of HIV infection and antiretroviral therapy on maternal serum markers levels and the false-positive rate with biochemical maternal serum screening for Down syndrome. Obstetrics & Gynecology 9 Inability to Predict Relapse in Acute Asthma NEJM 1984;310(9) • Fischl & Co. developed an index to predict relapse in patients with acute asthma. • Based on data from ER patients in Miami, FL • Reported 95% sensitivity and 97% specificity • Dramatic drop in accuracy when externally validated on patients in Richmond, VA????? 10 Other Examples • SSAGA for alcohol dependence • Genetic Screening for hereditary disease • Etc 11 Why Diagnostic Testing • Accurate screening/diagnosis of a health condition is often a first step towards its prevention or control • Need for fast, inexpensive and RELIABLE tools 12 Purpose of Diagnostic Testing • A (binary) diagnostic test performance is designed to determine whether a target condition is present or absent in a subject from the intended use population. • the target condition can refer to - a particular disease - a disease stage - health status or condition that should prompt clinical action, such as the initiation, modification or termination of treatment, counseling, etc 13 Test Scale • Binary - underlying measure is usually continuous presence or absence of a diseased • Continuous (quantitative) - biomarkers for cancer PSA are measured as serum concentration - creatinine for kidney malfunction - blood sugar for diabetes, - cholesterol for dyslipidemia • Ordinal - clinical symptoms moderate, severe, highly severe - Index Score : 0, 1, 2, 3, 4, 5 14 Other Test Scales • Likert-type rating - highly disagree, disagree, neutral, agree, highly agree • Nominal ** - genotype groups - ApoE Genotypes: E2/E2, E2/E3, E2/E4, E3/E3, E3/E4, E4/E4 15 What is Diagnostic Testing? • Evaluation of a (new) test to determine whether a target condition is present by comparison with a benchmark! • Evaluation of the ability of a test to classify subjects as diseased or disease-free • For non-binary scales, a classification rule is set by a threshold - PSA > 4.0ng/ml - Blood glucose > 126 mg/dL - BP >140/90 mmHg -used to be 160/95 mmHg - contemplating 130/85 mmHg??? 16 ………… ………… That we are in the midst of crisis is now well understood. Our nation is at war,…………. Our economy is badly weakened, ……….. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many;.. These are the indicators of crisis, subject to data and statistics. ………… ………… ………… Pres. Barack Obama (Inaugural speech) Benchmarks for Comparison 1. comparison to a reference (Gold) standard - considered to be the best available method for establishing the presence or absence of the target condition - can be a single method, or a combination of methods, including clinical follow-up. 2. comparison to a non-reference standard - method other than a reference (Gold) standard. Note!!!: The choice of comparative method will determine which performance measures may be reported 18 Some Conventional Tests • Bacterial cultures - strep throat, urinary tract infection, meningitis, etc • Imaging technology – X-ray for bone fracture, - CT scans for brain injury - MRI for brain injury • Biochemical markers - serum creatine for kidney dysfunction - serum bilirubin for liver dysfunction - blood glucose for diabetics - blood for HIV 19 Other Conventional Tests • Expert judgment present or absence of heart murmur • Response to Questionnaire !!!! substance abuse • Experts Interview or Observation schizophrenic, bi-polar, major depression • Radiologists score of mammograms no cancer, benign cancer, possible malignancy, malignancy 20 Measures of Accuracy Validation 21 Validation • it is the evaluation of the accuracy of the test • can only be established by comparing with the Gold Standard • validity is measured by sensitivity and specificity The extent to which a test measures what it is supposed to measure!!! 22 Measures for Accuracy • Sensitivity => True Positive Rate • Specificity => True Negative Rate • False Negative Rate (FNR) = 1- sensitivity • False Positive Rate (FPR) = 1- specificity • Predictive values positive predictive value negative predictive value • Diagnostic Likelihood Ratios LR+ = TPR/FPR = SenS/(1-SpeC) LR- = FNR /TNR = (1-SenS)/SpeC • ROC Curves 23 Sensitivity & Specificity • Sensitivity is the ability of a test to correctly classify an individual as ‘diseased’. Estimated as the proportion of subjects with the target condition in whom the test is positive • Specificity is the ability of a test to correctly classify an individual as ‘disease-free’. Estimated as the proportion of subjects without the target condition in whom the test is negative Best illustrated using a 2 X 2 table!!!!! 24 Diagnostic Testing True Disease Status ________________________________________________________________ Diseased Disease-free ______________________________________________________________________________________________________ Test Result Positive Negative No Error Error I Error II No Error N1=total diseased N2 =total disease-free 25 Sensitivity & Specificity True Disease Status ________________________________________________________________ Diseased Disease-free ______________________________________________________________________________________________ Test Result Positive True-positive Negative False-negative True-negative False-positive • Sensitivity ~ ability of a test to detect a disease when present => True-positive fraction • Specificity ~ability to indicate disease-free when absent =>True-negative fraction 26 Consequence of Diagnostic Errors • False negative errors, i.e., missing disease that is present. -can result in people foregoing needed treatment for the disease - the consequence can be as serious as death • False positive errors, i.e., falsely indicating disease - disease-free are subjected to unnecessary work-up procedures or even treatment. - negative impact include personal inconvenience and/or unnecessary stress, anxiety, etc . 27 Estimating Sensitivity & Specificity True Diseased Test Diseased Status Disease-free Total Positive TP FP T+ Negative FN TN T- T_D T_Df Total (Estimate) SenS (Estimate) SpeC False Negative rate False Positive Rate T_N = TP/T_D = TN/T_Df = FN/T_D = 1- SenS = FP/T_Df = 1- SpeC 28 Example Coronary Artery Surgery Study CAD Gold Standard Arteriography EST Diseased Positive 815 Negative 208 1023 Disease-free 115 327 442 SenS = 815/1023 = 0.79 FNF = 208/1023 = 0.19 SpeC FPF 930 535 1465 CI=[0.77,0.82] CI=[0.18, 0.23] = 327/442 = 0.74 CI =[70, 78] = 115/442 = 0.26 CI =[0.22, 0.30] 29 Absolute certainty remains for some theologians - and likeminded physicians!!!. 30 If Ever There is a Perfect Test! Ideal Case True Diseased Test Diseased Positive TP Negative FN = 0 T_D (=TP) Status Disease-free FP = 0 TN T_Df (=TN) T+ TT_N SenS = SpeC = TP/T_D TN/T_Df = = 100% 100% FNF = FPF = FN/T_D=1-SenS FP/T_Df=1-SpeC = = 0% 0% 31 Uninformative (Useless)Tests • Test is uninformative, if test result is unrelated to disease status • the probability distributions of the measure are the same in both diseased and disease-free populations • for uninformative tests, SenS = 1- SpeC TPF = FPF Ex: Exercise Stress Test to determine Diabetes, HIV, etc • Test is informative if: SenS + SpeC > 1 32 Clinical Application Detection of Primary Angle Closure Glaucoma Gold Standard = Gonioscopy Test Intraocular Pressure Torch Light Test van Herick Test SenS (%) SpeC(%) 47 80 92 70 61.9 89.3 Which test should we use to screen for PACG?? 33 Indian J Ophthalmology 2008;56:45-50 Sensitivity vs. Specificity Rule Out & Rule In • Test with high degree of sensitivity have a low FNR, - they ensure that not many true cases of the disease are missed. • A screening test which is used to “rule out” a diagnosis, should have a high degree of sensitivity •Test with high degree of specificity have a low FPR - they ensure that not many patients are misdiagnosed. • A confirmatory test which is used to “rule in” a diagnosis, should have a high degree of specificity 34 Clinical Application Detection of Primary Angle Closure Glaucoma Gold Standard = Gonioscopy Test SenS (%) SpeC(%) Intraocular Pressure Torch Light Test 47 80 92 70 van Herick Test 61.9 89.3 Which test should we use to screen for PACG?? How about Combining the tests !!!! 35 Combining Tests! • 2 ways for performing combination Tests in parallel in series • 2 Rules for combination "the OR rule" “the AND rule” 36 Combining 2 Tests • “OR rule” Test is positive, if either test is positive Test is negative, if both are negative SenS (Combo Test) = SenS1 + SenS2- SenS1*SenS2 SpeC (Combo Test) = SpeC1*SpeC2 • “AND rule” Test is positive if only both A and B are positive Test is negative if either A or both are negative SpeC (Combo Test) = SpeC1 + SpeC2 –SpeC1*SpeC2 SenS (Combo Test) = SenS1*SenS2 37 Clinical Application Combining Test Detection of Primary Angle Closure Glaucoma Gold Standard = Gonioscopy Test Torch Light Test van Herick Test Combined Test (OR RULE) Combined Test (AND RULE) SenS SpeC 80 62 92.4 70 89.3 62.3 49.6 97 38 Combining Tests • "the OR rule" increases SenS ~ useful for screening tests to “rule out” diagnosis • “the AND rule” increases SpeC ~ useful for confirmatory tests to “rule in” diagnosis 39 Predictive Values Daring Clinical Questions How likely is the disease, given the test result?? - what is the likelihood of disease when test is positive? - what is likelihood of non-disease when test is negative? Answers to these questions are known as the predictive values 40 Predictive Values • +PV – fraction of test positives who are diseased PPV = probability ( diseased | positive test) • - PV – fraction of test negatives who are non-diseased NPV = probability (disease-free | negative test) 41 Predictive Values True Diseased Test Diseased Status Disease-free Positive TP FP T+ Negative FN TN T- T_D T_Df T_N Positive Predictive Value = TP/T+ Negative Predictive Value = TN/T42 Predictive Values Example True Diseased Test Diseased Positive 815 Negative 208 1023 Status Disease-free 115 327 442 930 535 1465 PPV = 815/930= 87% NPV= 327/535= 61% 43 Predictive Values • A perfect test will predict perfectly, i.e., PPV = 1, NPV=1 • Predictive values depend on the prevalence of the disease PPV decreases with decreasing prevalence - low PPV may simply be a result of low prevalence NPV decreases with increasing prevalence • Useless Test if: PPV = prevalence and NPV = 1- prevalence • PV are not used to quantify the inherent accuracy of the test 44 Attributes of Measures Classification Probabilities Predictive Values Perfect Test SenS = 1 SpeC = PPV=1, NPV=1 Useless Test SenS = 1- SpeC PPV=ρ, NPV=1-ρ Context Accuracy Clinical prediction Question addressed To what degree does the test reflect the true disease state How likely is the disease given the test result Affected by disease No prevalence? Yes 45 Validation of the 4P's Plus© screen for substance use Study Design: A total of 228 pregnant Result: Seventy-four (32.5%) a positive screen = prevalence!!!!!. Sensitivity = 87% => missed 13% of diseased Specificity = 76% => incorrectly classified 24% as diseased Positive predictive validity = 36% => fraction w/D that tested +ve Negative predictive validity = 97% => fraction wo/D that tested -ve Conclusion: The 4P's Plus reliably and effectively screens pregnant women for risk of substance use, including those women typically missed by other perinatal screening methodologies. 46 ROC Curves Non-Binary Scales 47 ROC Curves • ~ for evaluating tests that yield on a non-binary scale, set by a threshold. BP > 140/90 mmHg for Hypertension ~ used to be 160/95 mmHg ~ contemplating 130/85 mmHg • BP values can fluctuate in any individual-healthy or diseasedthere will be some overlap of values in the disease population. •The choice of a threshold depends on the trade-off that is acceptable between failing to detect disease and falsely identifying disease •The ROC curve is a devise that describes the range of tradeoffs that can be achieved 48 ROC Curves • Plot of SenS against 1-SpeC for all possible thresholds • It is a visual representation of the global performance of the test • ROC plot shows the trade-off of sensitivity and specificity • Test is Useless (Uninformative Test) if for any threshold, c SenS = 1- SpeC 49 Construction of ROC Curves Calculate SenS and (1-SpeC) for all possible cutoff point Cutoff SenS SpeC 1-SpeC Comments 0 100 0 100 Ideal case 110 98 20 80 120 95 40 60 130 92 60 40 140 78 80 20 150 55 90 10 160 40 92 8 0 100 0 … … 500 50 51 52 Attributes of ROC curves • Provides complete description of potential performance of a test • Facilitates comparing and combining information across studies of the same test • Guides the choice of threshold in application • Provides mechanism for relevant comparison between different tests 53 Reporting of Estimates Variability • Point estimates of sensitivity, specificity, predictive values, are not sufficient. • Confidence intervals reflect the uncertainty of the estimates, and should be reported. • Focus is on confidence intervals to characterize the performance of the test and not on hypothesis testing 54 Sources of Bias Diagnostic Testing are subject to an array of biases: • Verification bias -study selectively includes diseased for verification • Imperfect reference standard - error in reference stand! • Spectrum bias subjects may not be a complete representative of the population, i.e., important subgroups are missing 55 Measures of Agreement Reliability 56 Reliability vs. Validity • Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a "gold standard." • Other times one merely wants to know the consistency of ratings made by different raters. • In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values 57 Measures of Agreement • Reliability Coefficients positive percent agreement negative percent agreement Overall agreement overall % agreement Kappa- how much is due to chance??? 58 Other Measures of Agreement • • • • B-Statistics McNemar Test Latent class models Bayesian methods 59 Agreement Test 2 Test 1 Positive Positive Negative AGREEMENT DISAGREEMENT Negative DISAGREEMENT AGREEMENT 60 Raw Agreement Test 1 TEST 2 Positive Positive Negative Total 40 1 41 512 513 531 572 Negative 1 Total 41 Positive % agreement = 40/41 = 97.6% Negative % agreement = 512/531 = 96.4% Overall % agreement = (40+512)/572 = 96.5 61 Limitation of Agreement Measures • Reliability is not proof of validity; - two tests can report the same readings, but be wrong • It does not tell how the disagreements occurred: - whether the positive and negative results are evenly distributed • Does not tell the extent to which the agreement occurs by chance 62 Generalization Cultural Adaptation 63 Generalization • Ultimate question (s)!!!!! - Does the test perform well for new, unseen patients? - Does the test perform well in other populations? 64 Inability to Predict Relapse in Acute Asthma NEJM 1984;310(9) • Fischl & Co. developed an index to predict relapse in patients with acute asthma. • Based on data from ER patients in Miami, FL • Reported 95% sensitivity and 97% specificity • Dramatic drop in accuracy when externally validated on patients in Richmond, VA????? 65 Inability to Predict Relapse in Acute Asthma • Index to predict relapse in patients with acute asthma. • 205 Based on data from ER patients in Miami, FL 95% sensitivity 97% specificity • 114 Based on data from ER patients seen at Richmond, VA 18.1% Sensitivity??? 82.4% Specificity 66 Centor RM et al. NEJM 1984;310(9):577-580. FL VA 67 Centor RM et al. NEJM 1984;310(9):577-580. Generalization-Validation • Internal validation - restricted to a single data set - data splitting (or cross-validation) • Temporal validation - evaluation on a second data set from the same population. • External validation - evaluation on data from other populations, perhaps by different investigators. 68 Predictive Models 69 Predictive Models • A predictive model is a model for making expectations on future events • It is usually build from a number of predictors, and a response (or outcome) variable 70 When Does a Diagnostic Test Work?? Does the diagnostic test add anything to what is already known? • Example: A diagnostic test for macular degeneration would need to show that it is better than just using a person’s age. 71 Covariate Modeling: Age in Home Macular Perimeter Problem: If you sample from subjects with MD and those without, there is likely to be an age difference that could confound the assessment of HMP. This is a bias!!!!. •Question: Is HMP just a surrogate for age? • Solution: Build a predictive model -logistic model- using age and HMP -visual field functional defects - to predict risk of MD and see if HMP adds anything 72 Summary Medical applications • Screening – triage => prioritization • Diagnosis – triage – management and decision making – test selection • Prognosis => Prediction – management and decision making – informing patients and their families – risk adjustment – eligibility in clinical trials 73 Thank You ???????? 74 References 1. 2. Weinstern et al. Clinical Evaluation of Diagnostic Tests. AJR 2005 Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press 75