test set - LIACS Data Mining Group

Learning Algorithm Evaluation Algorithm evaluation: Outline  Why?   How?   Train/Test vs Cross-validation What?   Overfitting Evaluation measures Who wins?  Statistical significance Introduction Introduction  A model should perform well on unseen data drawn from the same distribution Classification accuracy  performance measure      Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: #errors/#instances Accuracy: #successes/#instances Quiz  50 examples, 10 classified incorrectly • Accuracy? Error rate? Evaluation Rule #1 Never evaluate on training data! Train and Test Step 1: Randomly split data into training and test set (e.g. 2/3-1/3) a.k.a. holdout set Train and Test Step 2: Train model on training data Train and Test Step 3: Evaluate model on test data Train and Test Quiz: Can I retry with other parameter settings? Evaluation Rule #1 Never evaluate on training data! Rule #2 Never train on test data! (that includes parameter setting or feature selection) Train and Test Step 4: Optimize parameters on separate validation set validation testing Test data leakage  Never use test data to create the classifier   Can be tricky: e.g. social network Proper procedure uses three sets    training set: train models validation set: optimize algorithm parameters test set: evaluate final model Making the most of the data   Once evaluation is complete, all the data can be used to build the final classifier Trade-off: performance  evaluation accuracy   More training data, better model (but returns diminish) More test data, more accurate error estimate Train and Test Step 5: Build final model on ALL data (more data, better model) Cross-Validation k-fold Cross-validation • • • • Split data (stratified) in k-folds Use (k-1) for training, 1 for testing Repeat k times Average results Original Fold 1 Fold 2 Fold 3 train test Cross-validation  Standard method:   Stratified ten-fold cross-validation 10? Enough to reduce sampling bias  Experimentally determined Leave-One-Out Cross-validation Original 100    Fold 100 ……… A particular form of cross-validation:   Fold 1 #folds = #instances n instances, build classifier n times Makes best use of the data, no sampling bias Computationally expensive ROC Analysis ROC Analysis      Stands for “Receiver Operating Characteristic” From signal processing: tradeoff between hit rate and false alarm rate over noisy channel Compute FPR, TPR and plot them in ROC space Every classifier is a point in ROC space For probabilistic algorithms   Collect many points by varying prediction threshold Or, make cost sensitive and vary costs (see below) Confusion Matrix actual + + - TP FP true positive - FN false negative TP+FN TPrate (sensitivity): FPrate (fall-out): false positive TN true negative FP+TN ROC space J48 parameters fitted J48 OneR classifiers ROC curves Change prediction threshold: Threshold t: (P(+) > t) Area Under Curve (AUC) =0.75 ROC curves     Alternative method (easier, but less intuitive) Rank probabilities Start curve in (0,0), move down probability list If positive, move up. If negative, move right   Jagged curve—one set of test data Smooth curve—use cross-validation ROC curves Method selection     Overall: use method with largest Area Under ROC curve (AUROC) If you aim to cover just 40% of true positives in a sample: use method A Large sample: use method B In between: choose between A and B with appropriate probabilities ROC Space and Costs equal costs skewed costs Different Costs   In practice, TP and FN errors incur different costs Examples:     Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X? Promotional mailing: will X buy the product? Add cost matrix to evaluation that weighs TP,FP,... pred + pred - actual + cTP = 0 cFN = 1 actual - cFP = 1 cTN = 0 Statistical Significance Comparing data mining schemes  Which of two learning algorithms performs better?    Note: this is domain dependent! Obvious way: compare 10-fold CV estimates Problem: variance in estimate   Variance can be reduced using repeated CV However, we still don’t know whether results are reliable Significance tests  Significance tests tell us how confident we can be that there really is a difference     Null hypothesis: there is no “real” difference Alternative hypothesis: there is a difference A significance test measures how much evidence there is in favor of rejecting the null hypothesis E.g. 10 cross-validation scores: B better than A? P(perf) Algorithm A Algorithm B x x x xxxxx x x x x x xxxx x x x perf Paired t-test P(perf) Algorithm A Algorithm B perf   Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different Use a paired t-test when individual samples are paired   i.e., they use the same randomization Same CV folds are used for both algorithms William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student". Performing the test 1. Fix a significance level   Algoritme A Algoritme B P(perf)  perf Significant difference at % level implies (100-)% chance that there really is a difference Scientific work: 5% or smaller (>95% certainty) 2. Divide  by two (two-tailed test) 3. Look up the z-value corresponding to /2: 4. If t  –z or t  z: difference is significant  null hypothesis can be rejected α z 0.1% 4.3 0.5% 3.25 1% 2.82 5% 1.83 10% 1.38 20% 0.88

test set - LIACS Data Mining Group

Related documents

Products

Support

test set - LIACS Data Mining Group

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib