Data Mining CSCI 307, Spring 2019 Lecture 31 Comparing Data Mining Schemes 1 5.6 Comparing Data Mining Schemes Question: Suppose we have 2 classifiers, M1 and M2, which one is better? • Obvious way: Use 10-fold cross-validation to obtain and • These mean error rates are just estimates of error on the true population of future data cases 2 Comparing Schemes continued • Want to show that scheme M1 is better than scheme M2 in a particular domain – For a given amount of training data – On average, across all possible training sets • Assume we have an infinite amount of data from the domain: – Obtain cross-validation estimate on each dataset for each scheme – Check if mean accuracy for scheme M1 is better than mean accuracy for scheme M2 • We probably don't have an infinite amount of data 3 What about ML research? What if the difference between the 2 error rates is just attributed to chance? • Use a test of statistical significance • Obtain confidence limits for our error estimates 4 Overview Estimating Confidence Intervals: Null Hypothesis • Perform 10-fold cross-validation • Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10) • Use t-test (or Student’s t-test) • If fail to reject Null Hypothesis: M1 & M2 are the "same" • If can reject Null Hypothesis, then – Conclude that the difference between M1 & M2 is statistically significant – Choose model with lower error rate 5 Sidebar: the Null Hypothesis • Computer Scientists like to prove a hypothesis true, or false, but we don't do that here. • We might be less sure and say we accept the hypothesis or reject it, but we shouldn't do that either. • Statisticians fail to reject the hypothesis or reject the hypothesis. – If can reject Null Hypothesis, then we conclude that the difference between two machine learning methods is statistically significant. – If we fail to reject Null Hypothesis, then we conclude that the differences between two machine learning methods could be just chance. 6 Paired t-test • In practice we have limited data and a limited number of estimates for computing the mean • Student’s t-test tells whether the means of two samples are significantly different • In our case, the samples are cross-validation estimates for datasets from the domain • Use a paired t-test because the individual samples are paired – Same Cross Validation is applied twice William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student." 7 Estimating Confidence Intervals: t-test If only 1 test set available: pairwise comparison – For ith round of 10-fold cross-validation, the same cross partitioning is used to obtain err(M1)i and err(M2)i – Average over 10 rounds to get and – t-test computes t-statistic with k-1 degrees of freedom: 8 Table for t-distribution Rejection region • Significance level, e.g., sig = 0.05 or 5% means we have Confidence limit, z = value(sig/2) • Symmetric, so -z = -value(sig/2) 9 Statistical Significance Are M1 & M2 significantly different? • Compute t. Select significance level (e.g. sig = 5%) • Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9) • t-distribution is symmetric: typically upper % points of distribution shown, so look up value for confidence limit z=sig/2 (here, 0.025) • If t > z or t < -z, then t value lies in rejection region: – Reject null hypothesis that mean error rates of M1 & M2 are different – Conclude: statistically significant difference between M1 & M2 • Otherwise if –z ≤ t ≤ z, then – fail to reject null hypothesis that mean error rates of M1 & M2 are same – Conclude: that any difference is likely due to chance 10 Recap: Performing the Test • Fix a significance level – If a difference is significant at the a% level, there is a (100-a)% chance that the true means differ • Divide the significance level by two because the test is two-tailed • Look up the value for z that corresponds to a/2 • If t ≤ –z or t ≥z then the difference is significant – i.e. the null hypothesis (that the difference is zero) can be rejected 11 EXAMPLE Have two prediction models, M1 and M2. We have performed 10 rounds of 10-fold cross validation on each model, where the same data partitioning in round i is used for both M1 and M2. The error rates obtained for M1 are 30.5, 32.2, 20.7, 20.6, 31.0, 41.0, 27.7, 26.0, 21.5, 26.0. The error rates for M2 are 22.4, 14.5, 22.4, 19.6, 20.7, 20.4, 22.1, 19.4, 16.2, 35.0. Is one model is significantly better than the other considering a significance level of 1%? 12 EXAMPLE continued We hypothesis test to determine if there is a significant difference in average error. We used the same test data for each observation so we use the “paired observation” hypothesis test to compare two means: H0: ¯x1 − ¯x2 = 0 (Null hypothesis, difference is chance) H1: ¯x1 − ¯x2 ≠ 0 (Statistical difference in the model errors) Where ¯x1 is the mean error of model M1 , and ¯x2 is the mean error of model M2. Compute the test statistic t using the formula: t= (mean of the differences in error) (std dev of the differences in error) / sqrt (number of observations) 13 EXAMPLE (the Calculations) t= (mean of the differences in error) (std dev of the differences in error) / sqrt (number of observations) M1 M2 30.5 32.2 20.7 20.6 31.0 41.0 27.7 26.0 21.5 26.0 22.4 14.5 22.4 19.6 20.7 20.4 22.1 19.4 16.2 35.0 8.1 17.7 -1.7 1.0 10.3 20.6 5.6 6.6 5.3 -9.0 Average= 6.45 (8.1 -6.45)2 (17.7-6.45)2 (-1.7-6.45)2 (1.0-6.45)2 (10.3-6.45)2 (20.6-6.45)2 (5.6-6.45)2 (6.6-6.45)2 (5.3-6.45)2 (-9.0-6.45)2 2.7225 126.5625 66.4225 29.7025 14.8225 200.2225 0.7225 0.0225 1.3225 238.7025 Average and take square root to get Std Dev= 8.25 14 Example: Table Lookup Significance level 1% (0.01), so look up tsig/2 value for probability 0.005 9 degrees of freedom if –z ≤ t ≤ z, i.e. –3.25 ≤ 2.47 ≤ 3.25 then accept fail to reject null hypothesis, i.e., the two models are not different at a significance level of 0.01 15 Estimating Confidence Intervals: t-test RECALL: If only 1 test set available: pairwise comparison t-test computes t-statistic with k-1 degrees of freedom: where If two test sets available: use non-paired t-test where 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 = 𝑣𝑎𝑟(𝑀1 𝑣𝑎𝑟(𝑀2 + 𝑘1 𝑘2 where k1 & k2 are # of cross-validation samples used for M1 & M2, respectively. 16 In Other Words: Unpaired Observations • If the CV estimates are from different datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other one) • Then we have to use an unpaired t-test with min(k , j) – 1 degrees of freedom • The estimate of the variance of the difference of the means becomes: 17 Dependent Estimates • We assumed that we have enough data to create several datasets of the desired size • Need to re-use data if that's not the case – e.g. running cross-validations with different randomizations on the same data • Samples become dependent and then insignificant differences can become significant • A heuristic test is the corrected resampled t-test: – Assume we use the repeated hold-out method, with n1 instances for training and n2 for testing – New test statistic is 18