Lecture slides

advertisement
Supervised classification performance
(prediction) assessment
Dr. Huiru Zheng Dr. Franscisco Azuaje
School of Computing and Mathematics
Faculty of Engineering
University of Ulster, N.Ireland, UK
Building prediction models
• Different models, tools and applications.
•The problem of prediction (classification).
Process
Category
Response
Event
Condition
Properties
Data
P
Prediction
model
Action
Values
Predictions
Building prediction models
Supervised learning methods
Training phase:
A set of cases and their respective labels
are used to build a classification model.
A, C
Prediction
model
C’
A, C
Prediction
model
C*
Test phase:
the trained classifier is used to predict
new cases.
A, (C)
Prediction
model
C*
Prediction models, such as ANN, aim to
achieve an ability to generalise: The capacity
to correctly classify cases or problems
unseen during training.
Quality indicator: Accuracy during the test
phase
Building prediction models – Assessing their
quality
A classifier will be able to generalise if:
a) its architecture and learning parameters have been
properly defined, and
b) enough training data are available.
• The second condition is difficult to achieve due to
resource and time constraints.
• Key limitations appear when dealing with smalldata samples, which is a common feature observed
in many studies.
• A small test data set may contribute to an
inaccurate performance assessment.
Key questions
• How to measure classification quality?
• How can I select training and test cases ?
• How many experiments ?
• How to estimate prediction accuracy ?
• Effects on small – large data sets ?
The problem of Data Sampling
What is Accuracy?
What is Accuracy?
No. of correct predictions
Accuracy =
No. of predictions
TP + TN
=
TP + TN + FP + FN
Examples (1)
classifier
A
B
C
D
•
•
•
•
TP
25
50
25
37
TN
25
25
50
37
FP
25
25
0
13
FN Accuracy
25
50%
0
75%
25
75%
13
74%
Clearly, B, C, D are all better than A
Is B better than C, D?
Is C better than B, D?
Accuracy may not
Is D better than B, C?
tell the whole story
Examples (2)classifier
A
B
C
D
TP
25
0
50
30
TN
75
150
0
100
FP
75
0
150
50
FN Accuracy
25
50%
50
75%
0
25%
20
65%
• Clearly, D is better than A
• Is B better than A, C, D?
What is Sensitivity (recall)?
No. of correct positive predictions
Sensitivity =
True positive rate
No. of positives
TP
=
TP + FN
True negative rate is termed specificity
What is Precision?
No. of correct positive predictions
Precision =
wrt positives
No. of positives predictions
TP
=
TP + FP
Precision-Recall Trade-off
• A predicts better than
B if A has better recall
and precision than B
• There is a trade-off
between recall and
precision
precision
• In some applications,
once you reach a
satisfactory precision,
you optimize for recall
• In some applications,
once you reach a
satisfactory recall,
you optimize for
precision
Comparing Prediction
Performance
• Accuracy is the obvious measure
– But it conveys the right intuition only when the
positive and negative populations are roughly
equal in size
• Recall and precision together form a better
measure
– But what do you do when A has better recall
than B and B has better precision than A?
Some Alternate measures
• F-Measure - Take the harmonic mean of
recall and precision
F=
2 * recall * precision
recall + precision
(wrt positives)
• Adjusted Accuracy – weight
• ROC curve - Receiver Operating
Characteristic analysis
Adjusted Accuracy
• Weigh by the importance of the classes
Adjusted accuracy =  * Sensitivity +
 * Specificity
where  +  = 1
typically,  =  = 0.5
classifier
A
B
C
D
TP TN FP FN Accuracy Adj Accuracy
25 75 75 25
50%
50%
0 150
0 50
75%
50%
50
0 150 0
25%
50%
30 100 50 20
65%
63%
But values for ,  ?
ROC Curves
• By changing t, we get a
range of sensitivities
and specificities of a
classifier
• A predicts better than B
if A has better
sensitivities than B at
most specificities
• Leads to ROC curve
that plots sensitivity vs.
(1 – specificity)
• Then the larger the
area under the ROC
curve, the better
1 – specificity
Key questions
• How to measure classification quality?
• How can I select training and test cases ?
• How many experiments ?
• How to estimate prediction accuracy ?
• Effects on small – large data sets ?
The problem of Data Sampling
Data sampling techniques
Main goals:
Reduction of the
estimation bias
Too
optimistic
Too
conservative
Reduction of the
variance introduced
by a small data set
Other important goals
a) to establish differences between data
sampling techniques when applied to small
and larger datasets,
b) to study the response of these methods
to the size and number of train-test sets,
and
c) to discuss criteria for the selection of
sampling techniques.
Three Data Sampling Techniques
• cross-validation
• leave-one-out
• bootstrap.
k-fold cross validation
Randomly divides the data into the training and test sets.
This process is repeated k times and the classification
performance is the average of the individual test estimates.
N samples, p training samples, q test
samples (q = N – p)
Data
Data
Data
Experiment 2
Experiment k
N
Experiment 1
k-fold cross validation
The classifier may not be able to accurately predict
new cases if the amount of data used for training is
too small. At the same time, the quality
assessment may not be accurate if the portion of
data used for testing is too small.
q%
?
p%
Splitting procedure
?
The Leave-One-Out Method
• Given N cases available in a dataset, a classifier is
trained on (N-1) cases, and then is tested on the case
that was left out.
• This is repeated N times until every case in the
dataset has been included once as a cross-validation
instance.
• The results are averaged across the N test cases to
estimate the classifier’s prediction performance.
Data
Data
Data
N
Experiment 1
Experiment 2
Experiment N
The Bootstrap Method
• A training dataset is generated by sampling with
replacement N times from the available N cases.
• The classifier is trained on this set and then tested on the
original dataset.
• This process is repeated several times, and the classifier’s
accuracy estimate is the average of these individual
estimates.
Data
Case 1
Training (1)
Case 1
Test (1)
Case 1
Case 2
Case 1
Case 2
Case 3
Case 3
Case 3
Case 4
Case 3
Case 4
Case 5
Case 5
Case 5
An example
• 88 cases categorised into four classes: Ewing family of
tumors (EWS, 30), rhabdomyosarcoma (RMS, 11),
Burkitt lymphomas (BL, 19) and euroblastomas (NB,
28).
• Represented by the expression values of 2308 genes
with suspected roles in processes relevant to these
tumors.
• PCA was applied to reduce the dimensionality of the
cases, the 10 dominant components per case were used
to train the networks.
• All of the classifiers (BP-ANN) were trained using the
same learning parameters.
• The BP-ANN architectures comprised 10 input nodes, 8
hidden nodes and 4 output nodes.
• Each output node encodes one of the tumor classes.
Analysing the k-fold cross validation
The cross-validation results were analysed for
three different data splitting methods:
a)50% of the available cases were used for
training the classifiers and the remaining
50% for testing,
b) 75% for training and 25% for testing,
c) 95% for training and 5% for testing.
Tumour classification
Cross-validation method based on a 50%-50% splitting.
0.81
0.80
0.79
Classification accuracy
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
A
B
C
D
E
F
G
H
I
J
Train-test runs
A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test
runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G:
2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test
runs.
Tumour classification
Cross-validation method based on a 75%-25% splitting.
0.80
0.79
Classification accuracy
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
A
B
C
D
E
F
G
H
I
J
Train-test runs
A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100
train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size
equal to 0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000
train-test runs, J: 5000 train-test runs.
Tumour classification
Cross-validation method based on a 95%-5% splitting.
0.95
0.90
Classification accuracy
0.85
0.80
0.75
0.70
0.65
0.60
0.55
A
B
C
D
E
F
G
H
I
J
Train-test runs
A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100
train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000
train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000
train-test runs (interval size equal to 0.01) .
Tumour classification
• The 50%-50% cross-validation produced
the most conservative accuracy estimates.
• The 95%-5% cross-validation method
produced the most optimistic crossvalidation accuracy estimates.
• The leave-one-out method produced the
highest accuracy estimate for this dataset
(0.79).
• The estimation of high accuracy values
may be linked to an increase of the size of
the training datasets.
Tumour classification
Bootstrap method
0.770
0.765
Classification accuracy
0.760
0.755
0.750
0.745
0.740
0.735
0.730
0.725
A
B
C
D
E
F
G
H
I
J
Train-test runs
A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400
train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 traintest runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to
0.01) , J: 1000 train-test runs.
Final remarks
• The problem of estimating prediction
quality should be carefully addressed and
deserves further investigations.
• Sampling techniques can be implemented
to assess the classification quality factors
(such accuracy) of classifiers (such as
ANNs).
• In general there is variability among the
three techniques.
• These experiments suggest that it is
possible to achieve lower variance estimates
for different numbers of train-test runs.
Final remarks (II)
• Furthermore, one may identify conservative and
optimistic accuracy predictors, whose overall
estimates may be significantly different.
• This effect is more distinguishable in small-sample
applications.
• The predicted accuracy of a classifier is generally
proportional to the size of the training dataset.
• The bootstrap method may be applied to generate
conservative and robust accuracy estimates, based
on a relatively small number of train-test
experiments.
Final remarks (III)
• This presentation highlights the importance of
performing more rigorous procedures on the selection
of data and classification quality assessment.
• In general the application of more than one
sampling technique may provide the basis for accurate
and reliable predictions.
Download