On Comparing Classifiers: Pitfalls to Avoid and Recommended

advertisement

On Comparing Classifiers:

Pitfalls to Avoid and

Recommended Approach

Published by

Steven L. Salzberg

Presented by

Prakash Tilwani

MACS 598

April 25 th 2001

Introduction

Classification basics

Definitions

Statistical validity

Bonferroni Adjustment

Statistical Accidents

Repeated Tuning

A Recommended Approach

Conclusion

Agenda

Introduction

Comparative studies – proper methodology?

Public databases – relied too heavily on them?

Comparison results – are they really correct or just statistical accidents?

T-test

F-test

P-value

Null hypothesis

Definitions

T-test

The t-test assesses whether the means of two groups are from each other.

statistically different

Ratio of difference in means to variability of groups.

F-test

It determines whether the variances of two samples are significantly different.

Ratio of variance of two datasets

Basis for “Analysis of Variance” (ANOVA)

p-value

It represents probability of concluding

(incorrectly) that there is a difference in samples when no true difference exists.

Dependent upon the statistical test being performed.

P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.

NULL Hypothesis

Assumption that there is no difference in two or more populations.

Any observed difference in samples is due to chance or sampling error.

Statistical Validity Tests

Statistics offers many tests that are designed to measure the significance of any difference.

Adaptation to classifier comparison should be done carefully.

Bonferroni Adjustment – an example

Comparison of classifier algorithms

154 datasets

NULL hypothesis is true if p-value is <

0.05 (not very stringent)

Differences were reported significant if a t-test produced p-value < 0.05.

Example (cont.)

This is not correct usage of p-value significance test.

There were 154 experiments. Therefore,

154 chances to be significant.

Actual p-value used is 154*0.05 (= 7.7).

Example (cont.)

Let the significance for each level be 

Chance for making right conclusion for one experiment is 1

Assuming experiments are independent of one another, chance for getting experiments correct is (1 ) n n

Chances of not making correct conclusion is 1-(1 ) n

Example (cont.)

Substituting  =0.05

Chances for making incorrect conclusion is 0.9996

To obtain results significant at 0.05 level with 154 tests

1-(1 ) 154 < 0.05 or  < 0.003

Example - conclusion

Rough calculations but provides insight to problem

The use of wrong p-value results in incorrect conclusions

T-test overall is wrong test as training and test sets are not independent

Simple Recommended

Statistical Test

Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B)

A > B

A < B

A = B

~A = ~B

Simple Recommended

Statistical Test (cont.)

If only two algorithms compared

Throw out ties.

Compare A>B Vs A<B

If more than two algorithms compared

Use “Analysis of Variance” (ANOVA)

Bonferroni adjustment for multiple test should be applied

Statistical Accidents

Suppose 100 people are studying the effect of algorithms A and B.

At least 5 will get results statistically significant at p <= 0.05 (assuming independent experiments).

These results are nothing but due to chance.

Repeated Tuning

Algorithms are “tuned” repeatedly on same datasets.

Every “tuning” attempt should be considered as a separate experiment.

For example if 10 tuning experiments were attempted, then p-value should be

0.005 instead of 0.05.

Repeated Tuning (cont.)

Datasets are not independent, therefore even Bonferroni adjustment is not very accurate.

A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned

(one disadvantage of using public databases).

Repeated Tuning –

Recommended approach

Break dataset into k disjoint subsets of approximately equal size.

K experiments are performed.

After every experiment one subset is removed.

Trained system is tested on held-out subsystem.

Repeated Tuning –

Recommended approach (cont.)

At the end of k-fold experiment, every sample has been used in test set exactly once.

Advantage: test sets are independent.

Disadvantage: training sets are clearly not independent.

A Recommended Approach

Choose other algorithms to include in the comparison. Try including most similar to new algorithm.

Choose datasets.

Divide the data set into k subsets for cross validation.

Typically k=10.

For a small data set, choose larger k, since this leaves more examples in the training set.

A Recommended Approach

(cont.)

Run a cross-validation

For each of the k subsets of the data set

D, create a training set T = D – k

Divide T into T1 (training) and T2

(tuning) subsets

Once tuning is done, rerun training on T

Finally measure accuracy on k

Overall accuracy is averaged across all k partitions.

A Recommended Approach

(cont.)

Finally, compare algorithms

In case of multiple data sets, Bonferroni adjustment should be applied

Conclusion

We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls.

Statistical tools should be used carefully.

Every details of the experiment should be reported.

Download