Multiple-Reader, Multiple-Case (MRMC) ROC Analysis in Diagnostic Imaging, Computer-aided Diagnosis,

advertisement
Multiple-Reader, Multiple-Case (MRMC)
ROC Analysis
in Diagnostic Imaging,
Computer-aided Diagnosis,
and Statistical Pattern Recognition
R.F. Wagner, S.V. Beiden
From papers by
SV Beiden, M Maloof, RF Wagner
G Campbell, CE Metz, Y Jiang, HP Chan & al
. . . or . . .
why ROC analysis is fundamentally and practically
a multivariate task
In Three Parts:
I)
Analyzing Medical Imaging with only human readers
II )
Analyzing human readers assisted by computer
III )
Analyzing computer classifiers without humans
Part I
Medical Imaging: unaided human readers
The Assessment Paradigm in Medical Imaging:
Patient Cases
Readers
Competing Modalities
Patient Cases
Distribution For Actually
Negative Cases
One Possible
Decision Threshold
Distribution For Actually
Positive Cases
FPF
TNF
TPF
Decision Axis
FNF
TEST RESULT VALUE,
OR SUBJECTIVE JUDGEMENT OF LIKELIHOOD THAT CASE IS POSITIVE
1.0
True Positive Fraction
Lax
Threshold
Moderate
Threshold
0.5
Strict
Threshold
0
0
0.5
False Positive Fraction
1.0
Decision paradigm, distribution of Patient Cases, and ROC curve
Readers
100
90
80
Sensitivity %
70
60
50
40
30
20
10
0
10
20
30
40
50
60
70
80
90
100
Specificity %
Scatterplot of sensitivity vs specificity among
the 108 US radiologists in the study of Beam et al.
Comparing Two Modalities
Two (possibly correlated) modalities
The variables we now have:
Degree of Reader aggressiveness
– controlled for by ROC (and related paradigms)
- Range of difficulty of patient Cases
- Range of Reader skill
- Correlation of Patients across Modalities
- Correlation of Readers across Modalities
- Reader “jitter” – inconsistency w/i a reader, etc.
Full analysis requires Multiple-Reader, Multiple-Case
(MRMC) ROC analysis - where every reader reads every case
MRMC ROC analysis:
The uncertainty of a difference in avg ROC parameters
between two modalities
contains only the variability that is
uncorrelated across modalities =
(Uncorrelated portion of Case Var) / (# Cases)
+
(Uncorrelated portion of Reader Var) / (# Readers)
+
(Within reader jitter & MRC) / (# Cases x # Readers)
Now classic example of Digital Mammography
44 cancers in total of 625 women
(Hendrick, Lewin et al. - & GEMS)
5 readers – balanced reading of A & D
MRMC ROC Results:
Mean (over 5 readers) ROC Area SFM = 0.77
Mean (over 5 readers) ROC Area FFDM = 0.76
95% C.I. about the difference of 0.01: +/- 0.064
Implications of the Variance Structure
(from MRMC ROC analysis via B.W.C.)
for sizing a trial with tighter error bars:
. . . with, say, goal of 95% C.I. ~ +/- 0.03 :
78 cancers, 100 readers
100 cancers, 20 readers
etc.
This is the general idea
. . . to which we’ll return . . .
Part II :
Human readers assisted by computer
(in this case, CADx
e.g., the task might be one of discrimination
of microcalc clusters: CA vs Benign)
Note: Part I was a pedestrian overview
(Later we will provide a more formal analysis)
In the more formal components-of-variance analysis
the first three components are written:
cases ( c )
(includes range of case difficulty & finite sample size)
readers ( r )
(range of reader skill)
reader-by-case (r x c)
(whether range of case difficulty depends on reader or v.v.)
Jiang et al. CADx classification of µcalc clusters
“Separator” analysis
(note case ~ reader components; also, 3 interactions - not shown)
. . . i.e.,
contribution to variance
from the case set
is comparable to
contribution to variance
from a single reader
(commonly encountered result)
. . . with error bars due to estimation from finite-samples
. . . & when we account for difference in variance structure:
without (upper) vs with Computer-Assisted Reading (lower)
i.e.,
Computer assist has greatly reduced:
the range of variation of “reader skill”
& the dependence of
this range on the case
Part III :
Computers alone
The Automatic Discriminant/Classifier Problem
. . . i.e., the classical field of
Statistical Pattern Recognition
Conventional Wisdom in Statistical Pattern Recognition:
(1)
Bias of Classifier Performance
is due only to the Finite Training Set
Conventional Wisdom in Statistical Pattern Recognition:
(1)
Bias of Classifier Performance
is due only to the Finite Training Set
This is well-established
- Some examples follow
(Chan, Sahiner, Wagner, Petrick: Med Phys 26, 1999)
ARTIFICIAL NEURAL NETWORK
Dimensionality:
1.0
ANN: k21
f12, f15
f9
Resubstitution
f6
f3
0.9
Az
f3
f6
f9
f12
f15
0.8
Equal Cov. Matrices
0.7
0.00
0.01
0.02
0.03
Hold Out
Iterations: 400
0.04
0.05
0.06
1/No. of Training Samples per Class
QUADRATIC CLASSIFIER
Dimensionality:
f9, f12, f15
1.0
f6
f3
0.9
Resubstitution
f3
0.8
Az
f6
Hold Out
f9
0.7
f12
0.6
f15
Equal Cov. Matrices
0.5
0.00
0.01
0.02
0.03
0.04
0.05
0.06
1/No. of Training Samples per Class
Conventional Wisdom in Statistical Pattern Recognition:
(2)
“Sampling Variance of Accuracy measures
(i.e., C.I.s or error bars on Az, etc.)
comes mainly from the finite number of Testers”
Usual Practice in CAD algorithm Research Community:
Error bars on Accuracy Assessment
(e.g., ROC estimates from available software)
only includes effect of finite number of Testers
Thus -- the “generalizability” is limited
to expressing the uncertainty from the finite testers –
=> but provides no idea about uncertainty from finite trainers
Is Conventional Wisdom ( 2 )True ?
- Our recent work on this demonstrates
that part ( 2 ) of the
Conventional Wisdom
is Not only unTrue . . . it is Unwise
. . . especially if one is comparing algorithms . . .
finite training set
contributes at least as much uncertainty
as finite test set
We already have all the machinery we need for this . . .
. . . since there is an isomorphism between
the MRMC problem and Statistical Pattern Recognition:
MRMC
Pattern Recognition
sample of patients
=> C <=
sample of readers
=> R <=
imaging modality
=> M <=
C
R
test sample
multiple training sets
CADx algorithm
R x C (correlated across modalities/algorithm)
M x C M x R MRC (uncorrelated across modals/algorithm)
We simulated a project involving 10 institutions
- each with its own set of training patients
(analogous to 10 radiologists)
Each institution designs 2 competing classifiers (~ to modalities)
(each classifier has same architecture across institutions)
using their own training patients
We included a single independent set of testing patients
that were to be classified by each institution
Then -- studied the 6 Components of Variance (as above)
over 300 Monte Carlo trials
Representative Results
When analyzing one algorithm at a time
all six components contribute variance
When comparing two competing algorithms
only the last three (uncorrelated across algs) contribute
In either case –
uncertainty (variance) is clearly seen to come
from finite Training set (terms with R)
as well as finite Test set (terms with C)
Thus –
The contribution to uncertainty (error bars) from
finite training set*
can be equal to or greater than that from
finite test set
=> This argues for some form of bootstrapping
(resampling procedure)
when designing and assessing a classifier
(* ignored by conventional assessment software)
Contemporary Best Candidate Approach:
0.632 Bootstrap (Efron JASA ‘83) =>
0.632+ Bootstrap (Efron & Tibshirani JASA ‘97)
The latter has three contributions:
resubstitution assessment
leave-one-out bootstrap
the “no information” performance
(an adjustment for “overfitting”)
Research continues in this field
Multivariate ROC Analysis
is a very rich field, and applies to
medical imaging,
computer-assisted reading
statistical pattern recognition
(machine classifiers)
We have presented a unified approach
to the above problems based on work of
S. Beiden, R.F. Wagner, G. Campbell (BWC)
D. Dorfman, K. Berbaum, C.E. Metz (DBM)
& M. Maloof
. . . If time permits . . .
we will include the implications of the above analysis for
so-called “expected benefit” analysis
. . . i.e., averaging benefits (+ and - ) of a modality
over society . . .
. . . given the wide range of reader performance.
Download