Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering University of Ulster, N.Ireland, UK Building prediction models • Different models, tools and applications. •The problem of prediction (classification). Process Category Response Event Condition Properties Data P Prediction model Action Values Predictions Building prediction models Supervised learning methods Training phase: A set of cases and their respective labels are used to build a classification model. A, C Prediction model C’ A, C Prediction model C* Test phase: the trained classifier is used to predict new cases. A, (C) Prediction model C* Prediction models, such as ANN, aim to achieve an ability to generalise: The capacity to correctly classify cases or problems unseen during training. Quality indicator: Accuracy during the test phase Building prediction models – Assessing their quality A classifier will be able to generalise if: a) its architecture and learning parameters have been properly defined, and b) enough training data are available. • The second condition is difficult to achieve due to resource and time constraints. • Key limitations appear when dealing with smalldata samples, which is a common feature observed in many studies. • A small test data set may contribute to an inaccurate performance assessment. Key questions • How to measure classification quality? • How can I select training and test cases ? • How many experiments ? • How to estimate prediction accuracy ? • Effects on small – large data sets ? The problem of Data Sampling What is Accuracy? What is Accuracy? No. of correct predictions Accuracy = No. of predictions TP + TN = TP + TN + FP + FN Examples (1) classifier A B C D • • • • TP 25 50 25 37 TN 25 25 50 37 FP 25 25 0 13 FN Accuracy 25 50% 0 75% 25 75% 13 74% Clearly, B, C, D are all better than A Is B better than C, D? Is C better than B, D? Accuracy may not Is D better than B, C? tell the whole story Examples (2)classifier A B C D TP 25 0 50 30 TN 75 150 0 100 FP 75 0 150 50 FN Accuracy 25 50% 50 75% 0 25% 20 65% • Clearly, D is better than A • Is B better than A, C, D? What is Sensitivity (recall)? No. of correct positive predictions Sensitivity = True positive rate No. of positives TP = TP + FN True negative rate is termed specificity What is Precision? No. of correct positive predictions Precision = wrt positives No. of positives predictions TP = TP + FP Precision-Recall Trade-off • A predicts better than B if A has better recall and precision than B • There is a trade-off between recall and precision precision • In some applications, once you reach a satisfactory precision, you optimize for recall • In some applications, once you reach a satisfactory recall, you optimize for precision Comparing Prediction Performance • Accuracy is the obvious measure – But it conveys the right intuition only when the positive and negative populations are roughly equal in size • Recall and precision together form a better measure – But what do you do when A has better recall than B and B has better precision than A? Some Alternate measures • F-Measure - Take the harmonic mean of recall and precision F= 2 * recall * precision recall + precision (wrt positives) • Adjusted Accuracy – weight • ROC curve - Receiver Operating Characteristic analysis Adjusted Accuracy • Weigh by the importance of the classes Adjusted accuracy = * Sensitivity + * Specificity where + = 1 typically, = = 0.5 classifier A B C D TP TN FP FN Accuracy Adj Accuracy 25 75 75 25 50% 50% 0 150 0 50 75% 50% 50 0 150 0 25% 50% 30 100 50 20 65% 63% But values for , ? ROC Curves • By changing t, we get a range of sensitivities and specificities of a classifier • A predicts better than B if A has better sensitivities than B at most specificities • Leads to ROC curve that plots sensitivity vs. (1 – specificity) • Then the larger the area under the ROC curve, the better 1 – specificity Key questions • How to measure classification quality? • How can I select training and test cases ? • How many experiments ? • How to estimate prediction accuracy ? • Effects on small – large data sets ? The problem of Data Sampling Data sampling techniques Main goals: Reduction of the estimation bias Too optimistic Too conservative Reduction of the variance introduced by a small data set Other important goals a) to establish differences between data sampling techniques when applied to small and larger datasets, b) to study the response of these methods to the size and number of train-test sets, and c) to discuss criteria for the selection of sampling techniques. Three Data Sampling Techniques • cross-validation • leave-one-out • bootstrap. k-fold cross validation Randomly divides the data into the training and test sets. This process is repeated k times and the classification performance is the average of the individual test estimates. N samples, p training samples, q test samples (q = N – p) Data Data Data Experiment 2 Experiment k N Experiment 1 k-fold cross validation The classifier may not be able to accurately predict new cases if the amount of data used for training is too small. At the same time, the quality assessment may not be accurate if the portion of data used for testing is too small. q% ? p% Splitting procedure ? The Leave-One-Out Method • Given N cases available in a dataset, a classifier is trained on (N-1) cases, and then is tested on the case that was left out. • This is repeated N times until every case in the dataset has been included once as a cross-validation instance. • The results are averaged across the N test cases to estimate the classifier’s prediction performance. Data Data Data N Experiment 1 Experiment 2 Experiment N The Bootstrap Method • A training dataset is generated by sampling with replacement N times from the available N cases. • The classifier is trained on this set and then tested on the original dataset. • This process is repeated several times, and the classifier’s accuracy estimate is the average of these individual estimates. Data Case 1 Training (1) Case 1 Test (1) Case 1 Case 2 Case 1 Case 2 Case 3 Case 3 Case 3 Case 4 Case 3 Case 4 Case 5 Case 5 Case 5 An example • 88 cases categorised into four classes: Ewing family of tumors (EWS, 30), rhabdomyosarcoma (RMS, 11), Burkitt lymphomas (BL, 19) and euroblastomas (NB, 28). • Represented by the expression values of 2308 genes with suspected roles in processes relevant to these tumors. • PCA was applied to reduce the dimensionality of the cases, the 10 dominant components per case were used to train the networks. • All of the classifiers (BP-ANN) were trained using the same learning parameters. • The BP-ANN architectures comprised 10 input nodes, 8 hidden nodes and 4 output nodes. • Each output node encodes one of the tumor classes. Analysing the k-fold cross validation The cross-validation results were analysed for three different data splitting methods: a)50% of the available cases were used for training the classifiers and the remaining 50% for testing, b) 75% for training and 25% for testing, c) 95% for training and 5% for testing. Tumour classification Cross-validation method based on a 50%-50% splitting. 0.81 0.80 0.79 Classification accuracy 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 A B C D E F G H I J Train-test runs A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs (interval size equal to 0.01), F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs. Tumour classification Cross-validation method based on a 75%-25% splitting. 0.80 0.79 Classification accuracy 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 A B C D E F G H I J Train-test runs A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs (interval size equal to 0.01), G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs. Tumour classification Cross-validation method based on a 95%-5% splitting. 0.95 0.90 Classification accuracy 0.85 0.80 0.75 0.70 0.65 0.60 0.55 A B C D E F G H I J Train-test runs A: 10 train-test runs, B: 25 train-test runs, C: 50 train-test runs, D: 100 train-test runs, E: 500 train-test runs, F: 1000 train-test runs, G: 2000 train-test runs, H: 3000 train-test runs, I: 4000 train-test runs, J: 5000 train-test runs (interval size equal to 0.01) . Tumour classification • The 50%-50% cross-validation produced the most conservative accuracy estimates. • The 95%-5% cross-validation method produced the most optimistic crossvalidation accuracy estimates. • The leave-one-out method produced the highest accuracy estimate for this dataset (0.79). • The estimation of high accuracy values may be linked to an increase of the size of the training datasets. Tumour classification Bootstrap method 0.770 0.765 Classification accuracy 0.760 0.755 0.750 0.745 0.740 0.735 0.730 0.725 A B C D E F G H I J Train-test runs A: 100 train-test runs, B: 200 train-test runs, C: 300 train-test runs, D: 400 train-test runs, E: 500 train-test runs, F: 600 train-test runs, G: 700 traintest runs, H: 800 train-test runs, I: 900 train-test runs (interval size equal to 0.01) , J: 1000 train-test runs. Final remarks • The problem of estimating prediction quality should be carefully addressed and deserves further investigations. • Sampling techniques can be implemented to assess the classification quality factors (such accuracy) of classifiers (such as ANNs). • In general there is variability among the three techniques. • These experiments suggest that it is possible to achieve lower variance estimates for different numbers of train-test runs. Final remarks (II) • Furthermore, one may identify conservative and optimistic accuracy predictors, whose overall estimates may be significantly different. • This effect is more distinguishable in small-sample applications. • The predicted accuracy of a classifier is generally proportional to the size of the training dataset. • The bootstrap method may be applied to generate conservative and robust accuracy estimates, based on a relatively small number of train-test experiments. Final remarks (III) • This presentation highlights the importance of performing more rigorous procedures on the selection of data and classification quality assessment. • In general the application of more than one sampling technique may provide the basis for accurate and reliable predictions.