Multiple-Reader, Multiple-Case (MRMC) ROC Analysis in Diagnostic Imaging, Computer-aided Diagnosis, and Statistical Pattern Recognition R.F. Wagner, S.V. Beiden From papers by SV Beiden, M Maloof, RF Wagner G Campbell, CE Metz, Y Jiang, HP Chan & al . . . or . . . why ROC analysis is fundamentally and practically a multivariate task In Three Parts: I) Analyzing Medical Imaging with only human readers II ) Analyzing human readers assisted by computer III ) Analyzing computer classifiers without humans Part I Medical Imaging: unaided human readers The Assessment Paradigm in Medical Imaging: Patient Cases Readers Competing Modalities Patient Cases Distribution For Actually Negative Cases One Possible Decision Threshold Distribution For Actually Positive Cases FPF TNF TPF Decision Axis FNF TEST RESULT VALUE, OR SUBJECTIVE JUDGEMENT OF LIKELIHOOD THAT CASE IS POSITIVE 1.0 True Positive Fraction Lax Threshold Moderate Threshold 0.5 Strict Threshold 0 0 0.5 False Positive Fraction 1.0 Decision paradigm, distribution of Patient Cases, and ROC curve Readers 100 90 80 Sensitivity % 70 60 50 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 Specificity % Scatterplot of sensitivity vs specificity among the 108 US radiologists in the study of Beam et al. Comparing Two Modalities Two (possibly correlated) modalities The variables we now have: Degree of Reader aggressiveness – controlled for by ROC (and related paradigms) - Range of difficulty of patient Cases - Range of Reader skill - Correlation of Patients across Modalities - Correlation of Readers across Modalities - Reader “jitter” – inconsistency w/i a reader, etc. Full analysis requires Multiple-Reader, Multiple-Case (MRMC) ROC analysis - where every reader reads every case MRMC ROC analysis: The uncertainty of a difference in avg ROC parameters between two modalities contains only the variability that is uncorrelated across modalities = (Uncorrelated portion of Case Var) / (# Cases) + (Uncorrelated portion of Reader Var) / (# Readers) + (Within reader jitter & MRC) / (# Cases x # Readers) Now classic example of Digital Mammography 44 cancers in total of 625 women (Hendrick, Lewin et al. - & GEMS) 5 readers – balanced reading of A & D MRMC ROC Results: Mean (over 5 readers) ROC Area SFM = 0.77 Mean (over 5 readers) ROC Area FFDM = 0.76 95% C.I. about the difference of 0.01: +/- 0.064 Implications of the Variance Structure (from MRMC ROC analysis via B.W.C.) for sizing a trial with tighter error bars: . . . with, say, goal of 95% C.I. ~ +/- 0.03 : 78 cancers, 100 readers 100 cancers, 20 readers etc. This is the general idea . . . to which we’ll return . . . Part II : Human readers assisted by computer (in this case, CADx e.g., the task might be one of discrimination of microcalc clusters: CA vs Benign) Note: Part I was a pedestrian overview (Later we will provide a more formal analysis) In the more formal components-of-variance analysis the first three components are written: cases ( c ) (includes range of case difficulty & finite sample size) readers ( r ) (range of reader skill) reader-by-case (r x c) (whether range of case difficulty depends on reader or v.v.) Jiang et al. CADx classification of µcalc clusters “Separator” analysis (note case ~ reader components; also, 3 interactions - not shown) . . . i.e., contribution to variance from the case set is comparable to contribution to variance from a single reader (commonly encountered result) . . . with error bars due to estimation from finite-samples . . . & when we account for difference in variance structure: without (upper) vs with Computer-Assisted Reading (lower) i.e., Computer assist has greatly reduced: the range of variation of “reader skill” & the dependence of this range on the case Part III : Computers alone The Automatic Discriminant/Classifier Problem . . . i.e., the classical field of Statistical Pattern Recognition Conventional Wisdom in Statistical Pattern Recognition: (1) Bias of Classifier Performance is due only to the Finite Training Set Conventional Wisdom in Statistical Pattern Recognition: (1) Bias of Classifier Performance is due only to the Finite Training Set This is well-established - Some examples follow (Chan, Sahiner, Wagner, Petrick: Med Phys 26, 1999) ARTIFICIAL NEURAL NETWORK Dimensionality: 1.0 ANN: k21 f12, f15 f9 Resubstitution f6 f3 0.9 Az f3 f6 f9 f12 f15 0.8 Equal Cov. Matrices 0.7 0.00 0.01 0.02 0.03 Hold Out Iterations: 400 0.04 0.05 0.06 1/No. of Training Samples per Class QUADRATIC CLASSIFIER Dimensionality: f9, f12, f15 1.0 f6 f3 0.9 Resubstitution f3 0.8 Az f6 Hold Out f9 0.7 f12 0.6 f15 Equal Cov. Matrices 0.5 0.00 0.01 0.02 0.03 0.04 0.05 0.06 1/No. of Training Samples per Class Conventional Wisdom in Statistical Pattern Recognition: (2) “Sampling Variance of Accuracy measures (i.e., C.I.s or error bars on Az, etc.) comes mainly from the finite number of Testers” Usual Practice in CAD algorithm Research Community: Error bars on Accuracy Assessment (e.g., ROC estimates from available software) only includes effect of finite number of Testers Thus -- the “generalizability” is limited to expressing the uncertainty from the finite testers – => but provides no idea about uncertainty from finite trainers Is Conventional Wisdom ( 2 )True ? - Our recent work on this demonstrates that part ( 2 ) of the Conventional Wisdom is Not only unTrue . . . it is Unwise . . . especially if one is comparing algorithms . . . finite training set contributes at least as much uncertainty as finite test set We already have all the machinery we need for this . . . . . . since there is an isomorphism between the MRMC problem and Statistical Pattern Recognition: MRMC Pattern Recognition sample of patients => C <= sample of readers => R <= imaging modality => M <= C R test sample multiple training sets CADx algorithm R x C (correlated across modalities/algorithm) M x C M x R MRC (uncorrelated across modals/algorithm) We simulated a project involving 10 institutions - each with its own set of training patients (analogous to 10 radiologists) Each institution designs 2 competing classifiers (~ to modalities) (each classifier has same architecture across institutions) using their own training patients We included a single independent set of testing patients that were to be classified by each institution Then -- studied the 6 Components of Variance (as above) over 300 Monte Carlo trials Representative Results When analyzing one algorithm at a time all six components contribute variance When comparing two competing algorithms only the last three (uncorrelated across algs) contribute In either case – uncertainty (variance) is clearly seen to come from finite Training set (terms with R) as well as finite Test set (terms with C) Thus – The contribution to uncertainty (error bars) from finite training set* can be equal to or greater than that from finite test set => This argues for some form of bootstrapping (resampling procedure) when designing and assessing a classifier (* ignored by conventional assessment software) Contemporary Best Candidate Approach: 0.632 Bootstrap (Efron JASA ‘83) => 0.632+ Bootstrap (Efron & Tibshirani JASA ‘97) The latter has three contributions: resubstitution assessment leave-one-out bootstrap the “no information” performance (an adjustment for “overfitting”) Research continues in this field Multivariate ROC Analysis is a very rich field, and applies to medical imaging, computer-assisted reading statistical pattern recognition (machine classifiers) We have presented a unified approach to the above problems based on work of S. Beiden, R.F. Wagner, G. Campbell (BWC) D. Dorfman, K. Berbaum, C.E. Metz (DBM) & M. Maloof . . . If time permits . . . we will include the implications of the above analysis for so-called “expected benefit” analysis . . . i.e., averaging benefits (+ and - ) of a modality over society . . . . . . given the wide range of reader performance.