Comprehensive Introduction to the Evaluation of Neural
Networks and other Computational Intelligence Decision
Functions: Receiver Operating Characteristic, Jackknife,
Bootstrap and other Statistical Methodologies
David G. Brown and Frank Samuelson
Center for Devices and Radiological Health, FDA
6 July 2014
Course Outline
I.
II.
III.
IV.
V.
1.
2.
3.
1.
2.
3.
4.
1.
2.
3.
4.
1.
2.
Performance measures for Computational Intelligence (CI) observers
Accuracy
Prevalence dependent measures
Prevalence independent measures
Maximization of performance: Utility analysis/Cost functions
Receiver Operating Characteristic (ROC) analysis
Sensitivity and specificity
Construction of the ROC curve
Area under the ROC curve (AUC)
Error analysis for CI observers
Sources of error
Parametric methods
Nonparametric methods
Standard deviations and confidence intervals
Boot strap methods
Theoretical foundation
Practical use
References
• Emphasis on algorithm innovation to exclusion of performance assessment
• Use of subjective measures of performance –
“beauty contest”
• Use of “accuracy” as a measure of success
• Lack of error bars—My CIO is .01 better than yours (+/- ?)
• Flawed methodology—training and testing on same data
• Lack of appreciation for the many different sources of error that can be taken into account
Original image
Lena . Courtesy of the Signal and Image Processing Institute at the University of Southern California.
CI improved image
Baboon . Courtesy of the Signal and Image Processing Institute at the University of Southern California.
Panel of experts funnymonkeysite.com
I. Performance measures for computational intelligence (CI) observers
• Task based: (binary) discrimination task
– Two populations involved: “normal” and “abnormal,”
• Accuracy – Intuitive but incomplete
– Different consequences for success or failure for each population
• Some measures depend on the prevalence (Pr) some do not, Pr =
Number
Total of number abnormals of all in subjects population in population
– Accuracy, positive predictive value, negative predictive value
– Sensitivity, specificity, ROC, AUC
• True optimization of performance requires knowledge of cost functions or utilities for successes and failures in both populations
How to make a CIO with >99% accuracy
• Medical problem: Screening mammography
(“screening” means testing in an asymptomatic population)
• Prevalence of breast cancer in the screening population Pr = 0.5 %
• My CIO always says “normal”
• Accuracy (Acc) is 99.5% (accuracy of accepted present-day systems ~75%)
• Accuracy in a diagnostic setting (Pr~20%) is
80% -- Acc=1-Pr (for my CIO)
CIO operates on two different populations
Normal cases p(t|0)
Abnormal cases
Threshold t = T p(t|1) t-axis
Must consider effects on normal and abnormal populations separately
• CIO output t
• p(t|0) probability distribution of t for the population of normals
• p(t|1) probability distribution of t for the population of abnormals
• Threshold T. Everything to the right of T called abnormal, and everything to the left of T called normal
• Area of p(t|0) to left of T is the true negative fraction (TNF = specificity) and to the right the false positive fraction (FPF = type 1 error).
TNF + FPF = 1
• Area of p(t|1) to left of T is the false negative fraction (FNF = type 2 error) and to the right is the true positive fraction (TPF = sensitivity)
FNF + TPF = 1
• TNF, FPF, FNF, TPF all are prevalence independent, since each is some fraction of one of our two probability distributions
• {Accuracy = Pr x TPF + (1-Pr) x TNF}
Normal cases
Abnormal cases
Threshold T
FNF (.05)
FPF (.5)
TNF (.5) t-axis
TPF (.95) t-axis
• Accuracy (Acc)
Acc = Pr x TPF + (1-Pr) x TNF
• Positive predictive value (PPV): fraction of positives that are true positives
PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr))
• Negative predictive value (NPV): fraction of negatives that are true negatives
NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr)
• Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values: Pr = .05, TPF = .95, TNF =
0.5, FNF=.05, FPF=0.5
Acc = .05x.95+.95x.5 = .52
PPV = .95x.05/(.95x.05+.5x.95) = .10
NPV = .5x.95/(.5x.95+.05x.05) = .997
• Accuracy (Acc)
Acc = Pr x TPF + (1-Pr) x TNF
• Positive predictive value (PPV): fraction of positives that are true positives
PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr))
• Negative predictive value (NPV): fraction of negatives that are true negatives
NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr)
• Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values: Pr = .005, TPF = .95,
TNF = 0.5, FNF=.05, FPF=0.5
Acc = .005x.95+.995x.5 = .50
PPV = .95x.005/(.95x.005+.5x.995) = .01
NPV = .5x.995/(.5x.995+.05x.005) = .995
Acc, PPV, NPV as functions of prevalence
(screening mammography)
• TPF=.95
• FNF=.05
• TNF=0.5
• FPF=0.5
Acc = NPV as function of prevalence
(forced “normal” response CIO)
• Sensitivity = TPF
• Specificity = TNF (1-FPF)
• Receiver Operating Characteristic (ROC)
= TPF as a function of FPF (Sensitivity as a function of 1 – Specificity)
• Area under the ROC curve (AUC)
= Sensitivity averaged over all values of
Specificity
Normal / Class 0 subjects
Abnormal / Class 1 subjects
ROC slope
Entire ROC curve
Threshold
FPF, 1specificity
17
0.8
0.7
0.6
0.5
0.4
0.3
0.2
1.0
1.0
0.9
0.8
True Negative Fraction
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.9
0.1
0.1
0.9
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
False Positive Fraction
0.8
0.9
1.0
1.0
0.5
0.6
0.7
0.8
0.2
0.3
0.4
Craig Beam et al.
• Need to know utilities or costs of each type of decision outcome – but these are very hard to estimate accurately. You don’t just maximize accuracy.
• Need prevalence
• For mammography example
– TPF: prolongation of life minus treatment cost
– FPF: diagnostic work-up cost, anxiety
– TNF: peace of mind
– FNF: delay in treatment => shortened life
• Hypothetical assignment of utilities for some decision threshold T:
– Utility
T
= U(TPF) x TPF x Pr + U(FPF) x FPF x (1-Pr)
+ U(TNF) x TNF x (1-Pr) + U(FNF) x FNF x Pr
– U(TPF) = 100, U(FPF) = -10, U(TNF) = 4, U(FNF) = -20
– Utility
T
= 100 x .95 x .05 – 10 x .50 x .95
+ 4 x .50 x .95 – 20 x .05 x .05 = 1.85
• Now if we only knew how to trade off TPF versus FPF, we could optimize (?) medical performance.
Choice of ROC operating point through utility analysis —screening mammography
u = (
U
TPF
TPF+U
FNF
FNF)PR+(U
TNF
TNF+U
FPF
FPF)(1-PR)
=(U
TPF
TPF+U
FNF
(1-TPF))PR+(U
TNF
(1-FPF)+U
FPF
FPF)(1-PR) d u
/dFPF=(U
FPF
-U
TNF
)(1-PR)+(U
TPF
-U
FNF
)PRdTPF/dFPF
=0 dTPF/dFPF=(U
TNF
-U
FPF
)(1-PR)/(U
TPF
-U
FNF
)PR
PR=.005 dTPF/dFPF = 23.
PR=.05 dTPF/dFPF = 2.2
(UTPF=100, UFNF=-20, UTNF=4, UFPF=-20)
Normal cases
Abnormal cases
ROC slope
Entire ROC curve
Threshold
FPF, 1-specificity
• TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and AUC are all fractions or probabilities.
• Normally we have a finite sample of subjects on which to test our CIO. From this finite sample we try to estimate the above fractions
– These estimates will vary depending upon the sample selected (statistical variation).
– Estimates can be nonparametric or parametric
• TPF=
• TPF=
Number of abnormals that would be selected by CIO in the population
Number of abnormals in the population
Number of abnormals that were selected by CIO in the sample
Number of abnormals in the sample
• Number in sample << Number in population (at least in theory)
• R eceiver O perating C haracteristic
• Binary Classification
• Test result is compared to a threshold
Threshold
Computational intelligence observer output
Distribution of Output for Normal / Class 0
Subjects, p(t|0)
Distribution of Output for Abnormal / Class 1
Subjects, p(t|1)
Threshold t-axis
Computational intelligence observer output
Distribution of Output for Normal / Class 0
Subjects, p(t|0)
Threshold
Abnormal / Class 1 subjects
Distribution of Output for Normal / Class 0
Subjects, p(t|0)
Threshold
Abnormal / Class 1 subjects
Specificity
= True Negative Fraction
= TNF
Sensitivity
= True Positive Fraction
= TPF
Normal / Class 0 subjects
Threshold
Abnormal / Class 1 subjects
Specificity
Sensitivity
D
Decision
0
D
1
TNF
0.50
TPF
0.95
Normal / Class 0 subjects
Threshold
Abnormal / Class 1 subjects
1 - Specificity
= False Positive Fraction
= FPF
1 - Sensitivity
= False Negative Fraction
= FNF
Normal / Class 0 subjects
Threshold
Abnormal / Class 1 subjects
1 - Specificity
D
Decision
0
D
1
TNF
0.50
FPF
0.50
FNF
0.05
TPF
0.95
1 - Sensitivity
Normal / Class 0 subjects
Threshold
Abnormal / Class 1 subjects high sensitivity
FPF, 1specificity
Normal / Class 0 subjects
Abnormal / Class 1 subjects
Threshold sensitivity = specificity
FPF, 1specificity
Normal / Class 0 subjects
Abnormal / Class 1 subjects
Threshold high specificity
FPF, 1specificity
Normal / Class 0 subjects
Abnormal / Class 1 subjects
Threshold
Which CIO is best?
CIO #2
CIO #3
CIO #1
FPF, 1specificity
TPF
CIO #1 0.50
CIO #2 0.78
CIO #3 0.93
FPF
0.07
0.22
0.50
Normal / Class 0 subjects
Threshold
Do not compare rates of one class, e.g. TPF, at different rates of the other class
(FPF).
CIO #2
CIO #3
CIO #1
Abnormal / Class 1 subjects
FPF, 1specificity
TPF
CIO #1 0.50
CIO #2 0.78
CIO #3 0.93
FPF
0.07
0.23
0.50
Normal / Class 0 subjects
Abnormal / Class 1 subjects
Threshold
Entire ROC curve
FPF, 1specificity
AUC=0.98
AUC=0.85
AUC=0.5
Entire ROC curve
Discriminability
-or-
CIO performance
FPF, 1-specificity
• AUC is a separation probability
• AUC = probability that
– CIO output for abnormal > CIO output for normal
– CIO correctly tells which of 2 subjects is normal
• Estimating AUC from finite sample
– Select abnormal subject score = x i
– Select normal subject score = y k
– Is x i
> y k
?
– Average over all x,y: AUC
=
1 n
1 n
0 n
1 i n
0 k
I
( x i
y k
• ROC plots in probability space
• ROC plots in quantile space
• When the input features of the data are distributed as Gaussians with equal variance,
– The optimal discriminant, the log-likelihood ratio, is a linear function,
– That linear discriminant is also distributed as a
Gaussian,
– The signal to noise ratio (SNR) is easily calculated from the input data distributions and is a monotonic function of AUC.
• Can serve as a benchmark against which to measure CIO performance
• p( x |0) probability distribution of data x for the population of normals and p( x |1) probability distribution of x for the population of abnormals with components x with means 0 and m i i independent Gaussian distributed respectively and identical variances s i
2 p (
x | 0 )
= i
D
=
1
( 2
s i
2
)
D / 2 exp(
x i
2
/ 2 s i
2
) p (
x | 1 )
= i
D
=
1
( 2
s i
2
)
D / 2 exp(
( x i
m i
)
2
/ 2 s i
2
)
L
= p ( 1 |
x ) / p ( 0 |
x )
= p (
x | 1 ) p ( 1 ) / p (
x | 0 ) p ( 0 )
= kp (
x | 1 ) / p (
x | 0 )
L
= k exp
i
D
=
1 m s i i
2 x i
exp
i
D
=
1 m
2 s i
2 i
2
t
1
= exp
i
D
=
1 m s i i
2 x i
t
= ln( t
1
)
= i
D
=
1 m s i i x i
2
p ( t | 0 )
=
Gauss
0 , i
N
=
1 m i s
2 i
2
=
Gauss ( 0 ,
) p ( t | 1 )
=
Gauss
(
SNR
=
(
p ( t | 1 )
p ( t | 0 )
) / s
( p ( t | 1 )
p ( t | 0 ))
=
/( 2
)
1
2
=
(
/ 2 )
1
2
AUC
=
( SNR );
=
Cumulative Gaussian distributi on
(
1
2
• The likelihood ratio of the decision variable t is the slope of the ROC curve:
• ROC= TPF(FPF); TPF= 1-P(t|0); FPF= 1-P(t|1) slope
= d TPF d FPF
= d d
P (
P( t t |
|
1)
0 )
= p( p( t t
|
| 1
0
)
)
=
L( t )
• Sources of error
• Parametric methods
• Nonparametric methods
• Standard deviations and confidence intervals
• Hazards
• Test error—limited number of samples in the test set
• Training error—limited number of samples in the training set
– Incorrect parameters
– Incorrect feature selection, etc.
• Human observer error (when applicable)
– Intraobserver
– Interobserver
• Use known underlying probability distribution – may be exact for simulated data
• Assume Gaussian distribution
• Other parameterization – e.g., Binomial or
ROC linearity in z-transformation coordinates
– (
-1 (TPF) versus
-1 (FPF
), where is the cumulative Gaussian distribution)
• For single population measures, f= TPF, FPF, FNF, TNF
• Var(f) = f (1-f) / N
• For AUC (back of envelope calculation)
Var(AUC) =
AUC (1 AUC)
N
• Repeat experiment M times
• Estimate distribution parameters—e.g., for a Gaussian distributed performance measure f, G( m
, s
2 ): f
ˆ = m ˆ f
=
1
M i
M
=
1 f i s ˆ 2 f
=
M
1
1
M i
=
1
( f i
m ˆ
) 2 s ˆ f
2
ˆ
= s ˆ f
2
/ M
• Find error bars or confidence limits f
ˆ s ˆ f
ˆ f
ˆ k s f
ˆ
( k
95 %
=
1 .
96 )
• Mean AUC
• “Distribution” variance
• Variance of mean
• Error bars, confidence interval
• Reuse the data you have:
Resubstitution, Resampling
• Two common approaches:
– Jackknife
– Bootstrap
• Have N observations
• Leave out m of these, then have M subsets of the N observations to calculate m
, s
2
M
=
N m
• N=10, m=5: M=252; N=10, m=1: M=10
• Given N datasets
AUC
N
=
AUC
k / N
AUC
N
1
=
AUC
k /( N
1 )
2 AUC
=
AUC
N
AUC
N
1
k
( 2 N
N ( N
1 )
1 )
k
1
N ( N
1 )
A U
ˆ
C
J
=
N
=
AUC
AUC
N
N
( N
AUC
1 )
N
1
AUC
N
1 s ˆ
J
2 =
N
N
1 i
N
=
1
( AUC
N
1 ( i )
AUC
N
1
)
2
• Divide both the normal and abnormal classes in half, yielding 4 possible pairings
AUC
N
=
AUC
k / N
AUC
N / 2
A U
ˆ
C
F
H
=
AUC
2 k / N
=
2
AUC
N
AUC
N / 2
•
AUC estimates as a function of number of cases N. Solid line is the multilayer perceptron result. Open circle jackknife, closed circle
Fukunaga-
Hayes. The horizontal dotted line is the asymptotic ideal result
• Theoretical foundation
• Practical use
• What you have is what you’ve got—the data is your best estimate of the probability distribution:
– Sampling with replacement, M times
– Adequate number of samples M>N
Simple bootstrap s ˆ
B
2 =
M
1
1 i
M
=
1
( AUC
B ( i )
AUC
B
)
2
Bootstrap and jackknife error estimates
• Standard deviation of AUC: Solid line simulation results, open circles jackknife estimate, closed circles bootstrap estimate.
Note how much larger the jackknife error bars are than those provided by the bootstrap method.
2 Gaussian dist., 20 normal, 20 abnormal, pop. AUC=.936
• Actual s.d.
• Binomial approx.
• Bootstrap
• Jackknife
.0380
.0380
.0388
.0396
• Mean bootstrap AUC est.
.936
• Mean jackknife bias est.
2x10 -17
• Have N cases, draw M samples of size N with replacement for training (Have on average .632 x N unique cases in each sample of size N)
• Test on the unused (~.368 x N) cases for each sample p
case i
=
( 1
1
N
)
N e
1 p case i
=
1 cases
p
case i training
1
=
1
e
1 =
.
cases
632 ...
testing
=
.
632 ...
N
N
( 1
1 / N )
N
5
.328
1-.672
10
.349
1-.651
20
.358
1-.642
100
.366
1-.634
Infinity
.368
1-.632
• Have N cases, draw M samples of size N with replacement for training (Have on average .632 x N unique cases in each sample of size N)
• Test on the unused (~.368 x N) cases for each sample
• Get bootstrap average result AUC
B
• Get resubstitution result (testing on training set)
AUC
R
• AUC
.632
= .632 x AUC
B
+ .368 x AUC
R
• As variance take the AUC
B variance
• Cover’s theorem:
C ( N , d )
= d
2
=
0 k
N k
1
.
• For N<2(d+1) a hyperplane exists that will perfectly separate almost all possible dichotomies of N points in d space
f d
(N) for d=1,5,25,125, and the limit of large d. The abscissa x=N/2(d+1) is scaled so that the values of f d
(N)=0.5 lie superposed at x=1 for all d.
• Reporting on training data results/ testing on training data
• Carrying out any part of the training process on data later used for testing
– e.g., using all of the data to select a manageable feature set from among a large number of features —and then dividing the data into training and test sets.
60
50
40
30
20
10
0
0.4
1
0.5
0.6
0.7
Area Under the ROC curve
0.8
0.9
1
0.4
0.3
0.2
0.1
0
0
0.9
0.8
0.7
0.6
0.5
0.2
Method 1, AUC=0.52
Method 2, AUC=0.62
Method 3, AUC=0.82
Method 4, AUC=0.91
0.4
0.6
False Positive Fraction
0.8
1
Distributions of AUC values in 900 simulation experiments (on the left) and the mean ROC curves
(on the right) for four validation methods: Method 1 – Feature selection and classifier training on one dataset and classifier testing on another independent dataset; Method 2 – Given perfect feature selection, classifier training on one dataset and classifier testing on another independent dataset;
Method 3 – Feature selection using the entire dataset and then the dataset is partitioned into two, one for training and one for testing the classifier; Method 4 – Feature selection, classifier training, and testing using the same dataset.
180
160
140
120
100
80
60
40
20
0
10
0
250
200
150
100
50
2 10
10
1
Feature Index
10
2
10
3
0
0 4 6 8
Number of useful features (out of 30)
12
An insight of feature selection performance in Method 1. On the left plots the number of experiments
(out of 900) that a feature is selected. By design of the simulation population, the first 30 features are useful for classification and the remaining are useless. On the right plots the distribution of the number of useful features (out of 30) in the 900 experiments.
• Accuracy and other prevalence dependent measures are inadequate
• ROC/AUC provide good measures of performance
• Uncertainty must be quantified
• Bootstrap and jackknife techniques are useful methods
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[1] K. Fukunaga, Statistical Pattern Recognition, 2nd Edition. Boston: Harcourt Brace Jovanovich, 1990.
[2] K. Fukunaga and R. R. Hayes, “Effects of sample size in classifier design,” IEEE Trans. Pattern Anal. Machine Intell., vol.
PAMI-11, pp. 873 –885, 1989.
[3] D. M. Green and J. A. Swets, Signal Detection Theory and Psychophysics. New York: John Wiley & Sons, 1966.
[4] J. P. Egan, Signal Detection Theory and ROC Analysis. New York: Academic Press, 1975.
[5] C. E. Metz, “Basic principles of roc analysis,” Seminars in Nuclear Medicine, vol. VIII, no. 4.
[6] H. H. Barrett and K. J. Myers, Foundations of Image Science. Hoboken: John Wiley & Sons, 2004, ch. 13 Statistical
Decision Theory.
[7] B. Efron and R. J. Tibshirani, Introduction to the Bootstrap. Boca Raton: Chapman & Hall/CRC, 1993.
[8] B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia: Society for Industrial and Applied
Mathematics, 1982.
[9] A. C. Davison and D. V. Hinkley, Bootstrap Methods and their Applications. Cambridge: Cambridge University Press,
1997.
[10] B. Efron, “Estimating the error rate of a prediction rule: Some improvements on cross-validation,” Journal of the
American Statistical Association, vol. 78, pp. 316 –331, 1983.
[11] B. Efron and R. J. Tibshirani, “Improvements on cross-validation: The .632+ bootstrap method,” Journal of the American
Statistical Association, vol. 92, no. 438, pp. 548 –560, 1997.
[12] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 3rd Edition. New York: Springer, 2009.
[13] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006.
[14] ——, Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995.
[15] R. F. Wagner, D. G. Brown, J.P. Guedon, K. J. Myers, and K. A. Wear, “Multivariate Guassian pattern classification: effects of finite sample size and the addition of correlated or noisy features on summary measures of goodness,” in
Information processing in Medical Imaging, Proceedings of IPMI ’93, 1993, pp. 507–524.
[16] ——, “On combining a few diagnostic tests or features,” in Proceedings of the SPIE, Image Processing, vol. 2167, 1994.
[17] D. G. Brown, A. C. Schneider, M. P. Anderson, and R. F. Wagner, “Effects of finite sample size and correlated noisy input features on neural network pattern classification,” in Proceedings of the SPIE, Image Processing, vol. 2167, 1994.
[18] C. A. Beam, “Analysis of clustered data in receiver operating characteristic studies,” Statistical Methods in Medical
Research, vol. 7, pp. 324 –336, 1998.
[19] W. A. Yousef, et al. “Assessing Classifiers from Two Independent Data Sets Using ROC Analysis: A Nonparametric
Approach,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1809-1817, 2006
[19] F. W. Samuelson and D. G. Brown, “Application of cover’s theorem to the evaluation of the performance of CI observers,” in Proceedings of the IJCNN 2011, 2011.
[20] W. Chen and D. G. Brown, “Optimistic bias in the assessment of high dimensional classifiers with a limited dataset,” in
Proceedings of the IJCNN 2011, 2011.