Uploaded by dogiku

Lecture3134

advertisement
Data Mining
CSCI 307, Spring 2019
Lecture 31
Comparing Data Mining Schemes
1
5.6 Comparing Data Mining Schemes
Question: Suppose we have 2 classifiers, M1 and
M2, which one is better?
• Obvious way: Use 10-fold cross-validation to
obtain
and
• These mean error rates are just estimates of
error on the true population of future data cases
2
Comparing Schemes continued
• Want to show that scheme M1 is better than
scheme M2 in a particular domain
– For a given amount of training data
– On average, across all possible training sets
• Assume we have an infinite amount of data from
the domain:
– Obtain cross-validation estimate on each dataset for
each scheme
– Check if mean accuracy for scheme M1 is better than
mean accuracy for scheme M2
• We probably don't have an infinite amount of data
3
What about ML research?
What if the difference between the 2 error rates is
just attributed to chance?
• Use a test of statistical significance
• Obtain confidence limits for our error estimates
4
Overview
Estimating Confidence Intervals: Null Hypothesis
• Perform 10-fold cross-validation
• Assume samples follow a t distribution with k–1
degrees of freedom (here, k=10)
• Use t-test (or Student’s t-test)
• If fail to reject Null Hypothesis: M1 & M2 are the "same"
• If can reject Null Hypothesis, then
– Conclude that the difference between M1 & M2 is
statistically significant
– Choose model with lower error rate
5
Sidebar: the Null Hypothesis
• Computer Scientists like to prove a hypothesis true, or false,
but we don't do that here.
• We might be less sure and say we accept the hypothesis or
reject it, but we shouldn't do that either.
• Statisticians fail to reject the hypothesis or reject the
hypothesis.
– If can reject Null Hypothesis, then we conclude that the
difference between two machine learning methods is
statistically significant.
– If we fail to reject Null Hypothesis, then we conclude that
the differences between two machine learning methods
could be just chance.
6
Paired t-test
• In practice we have limited data and a limited
number of estimates for computing the mean
• Student’s t-test tells whether the means of two
samples are significantly different
• In our case, the samples are cross-validation
estimates for datasets from the domain
• Use a paired t-test because the individual
samples are paired
– Same Cross Validation is applied twice
William Gosset
Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England
Obtained a post as a chemist in the Guinness brewery in Dublin
in 1899. Invented the t-test to handle small samples for
quality control in brewing. Wrote under the name "Student."
7
Estimating Confidence Intervals: t-test
If only 1 test set available: pairwise comparison
– For ith round of 10-fold cross-validation, the
same cross partitioning is used to obtain err(M1)i
and err(M2)i
– Average over 10 rounds to get
and
– t-test computes t-statistic with k-1 degrees of
freedom:
8
Table for t-distribution
Rejection
region
• Significance level, e.g.,
sig = 0.05 or 5% means
we have Confidence
limit, z = value(sig/2)
• Symmetric,
so -z = -value(sig/2)
9
Statistical Significance
Are M1 & M2 significantly different?
• Compute t. Select significance level (e.g. sig = 5%)
• Consult table for t-distribution: Find t value corresponding
to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of
distribution shown, so look up value for confidence limit z=sig/2
(here, 0.025)
• If t > z or t < -z, then t value lies in rejection region:
– Reject null hypothesis that mean error rates of M1 & M2 are different
– Conclude: statistically significant difference between M1 & M2
• Otherwise if –z ≤ t ≤ z, then
– fail to reject null hypothesis that mean error rates of M1 & M2 are
same
– Conclude: that any difference is likely due to chance
10
Recap: Performing the Test
• Fix a significance level
– If a difference is significant at the a% level, there is a
(100-a)% chance that the true means differ
• Divide the significance level by two because the
test is two-tailed
• Look up the value for z that corresponds to a/2
• If t ≤ –z or t ≥z then the difference is significant
– i.e. the null hypothesis (that the difference is zero)
can be rejected
11
EXAMPLE
Have two prediction models, M1 and M2. We have
performed 10 rounds of 10-fold cross validation on each
model, where the same data partitioning in round i is used
for both M1 and M2.
The error rates obtained for M1 are 30.5, 32.2, 20.7, 20.6,
31.0, 41.0, 27.7, 26.0, 21.5, 26.0.
The error rates for M2 are 22.4, 14.5, 22.4, 19.6, 20.7,
20.4, 22.1, 19.4, 16.2, 35.0.
Is one model is significantly better than the other
considering a significance level of 1%?
12
EXAMPLE continued
We hypothesis test to determine if there is a significant
difference in average error. We used the same test data for
each observation so we use the “paired observation”
hypothesis test to compare two means:
H0: ¯x1 − ¯x2 = 0 (Null hypothesis, difference is chance)
H1: ¯x1 − ¯x2 ≠ 0 (Statistical difference in the model errors)
Where ¯x1 is the mean error of model M1 , and ¯x2 is the mean error of
model M2.
Compute the test statistic t using the formula:
t=
(mean of the differences in error)
(std dev of the differences in error) / sqrt (number of observations)
13
EXAMPLE (the Calculations)
t=
(mean of the differences in error)
(std dev of the differences in error) / sqrt (number of observations)
M1
M2
30.5
32.2
20.7
20.6
31.0
41.0
27.7
26.0
21.5
26.0
22.4
14.5
22.4
19.6
20.7
20.4
22.1
19.4
16.2
35.0
8.1
17.7
-1.7
1.0
10.3
20.6
5.6
6.6
5.3
-9.0
Average= 6.45
(8.1 -6.45)2
(17.7-6.45)2
(-1.7-6.45)2
(1.0-6.45)2
(10.3-6.45)2
(20.6-6.45)2
(5.6-6.45)2
(6.6-6.45)2
(5.3-6.45)2
(-9.0-6.45)2
2.7225
126.5625
66.4225
29.7025
14.8225
200.2225
0.7225
0.0225
1.3225
238.7025
Average and take square root to get
Std Dev= 8.25
14
Example: Table Lookup
Significance level 1% (0.01), so look up tsig/2 value for probability 0.005
9 degrees of freedom
if –z ≤ t ≤ z, i.e. –3.25 ≤ 2.47 ≤ 3.25
then accept fail to reject null hypothesis, i.e., the two
models are not different at a significance level of 0.01
15
Estimating Confidence Intervals: t-test
RECALL: If only 1 test set available: pairwise comparison
t-test computes t-statistic with k-1 degrees of freedom:
where
If two test sets available: use non-paired t-test
where
𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 =
𝑣𝑎𝑟(𝑀1
𝑣𝑎𝑟(𝑀2
+
𝑘1
𝑘2
where k1 & k2 are # of cross-validation samples used for M1 & M2, respectively.
16
In Other Words: Unpaired Observations
• If the CV estimates are from different
datasets, they are no longer paired (or maybe
we have k estimates for one scheme, and j
estimates for the other one)
• Then we have to use an unpaired t-test with
min(k , j) – 1 degrees of freedom
• The estimate of the variance of the difference
of the means becomes:
17
Dependent Estimates
• We assumed that we have enough data to create
several datasets of the desired size
• Need to re-use data if that's not the case
– e.g. running cross-validations with different
randomizations on the same data
• Samples become dependent and then insignificant
differences can become significant
• A heuristic test is the corrected resampled t-test:
– Assume we use the repeated hold-out method, with n1
instances for training and n2 for testing
– New test statistic is
18
Download