On Comparing Classifiers: Pitfalls to Avoid and Recommended

On Comparing Classifiers:

Pitfalls to Avoid and

Recommended Approach

Published by

Steven L. Salzberg

Presented by

Prakash Tilwani

MACS 598

April 25 th 2001













Introduction

Classification basics

Definitions

Statistical validity

•

Bonferroni Adjustment

•

•

Statistical Accidents

Repeated Tuning

A Recommended Approach

Conclusion

Agenda

Introduction







Comparative studies – proper methodology?

Public databases – relied too heavily on them?

Comparison results – are they really correct or just statistical accidents?









T-test

F-test

P-value

Null hypothesis

Definitions

T-test





The t-test assesses whether the means of two groups are from each other.

statistically different

Ratio of difference in means to variability of groups.

F-test







It determines whether the variances of two samples are significantly different.

Ratio of variance of two datasets

Basis for “Analysis of Variance” (ANOVA)

p-value







It represents probability of concluding

(incorrectly) that there is a difference in samples when no true difference exists.

Dependent upon the statistical test being performed.

P = 0.05 means that there is 5% chance that you would be wrong if concluding the populations are different.

NULL Hypothesis



Assumption that there is no difference in two or more populations.



Any observed difference in samples is due to chance or sampling error.

Statistical Validity Tests





Statistics offers many tests that are designed to measure the significance of any difference.

Adaptation to classifier comparison should be done carefully.

Bonferroni Adjustment – an example









Comparison of classifier algorithms

154 datasets

NULL hypothesis is true if p-value is <

0.05 (not very stringent)

Differences were reported significant if a t-test produced p-value < 0.05.

Example (cont.)







This is not correct usage of p-value significance test.

There were 154 experiments. Therefore,

154 chances to be significant.

Actual p-value used is 154*0.05 (= 7.7).

Example (cont.)









Let the significance for each level be 

Chance for making right conclusion for one experiment is 1

Assuming experiments are independent of one another, chance for getting experiments correct is (1 ) n n

Chances of not making correct conclusion is 1-(1 ) n

Example (cont.)







Substituting  =0.05

Chances for making incorrect conclusion is 0.9996

To obtain results significant at 0.05 level with 154 tests

1-(1 ) 154 < 0.05 or  < 0.003

Example - conclusion







Rough calculations but provides insight to problem

The use of wrong p-value results in incorrect conclusions

T-test overall is wrong test as training and test sets are not independent

Simple Recommended

Statistical Test



Comparison must consider 4 numbers when a common test set to compare two algorithms (A and B)



A > B







A < B

A = B

~A = ~B

Simple Recommended

Statistical Test (cont.)





If only two algorithms compared





Throw out ties.

Compare A>B Vs A<B

If more than two algorithms compared



Use “Analysis of Variance” (ANOVA)



Bonferroni adjustment for multiple test should be applied

Statistical Accidents







Suppose 100 people are studying the effect of algorithms A and B.

At least 5 will get results statistically significant at p <= 0.05 (assuming independent experiments).

These results are nothing but due to chance.

Repeated Tuning







Algorithms are “tuned” repeatedly on same datasets.

Every “tuning” attempt should be considered as a separate experiment.

For example if 10 tuning experiments were attempted, then p-value should be

0.005 instead of 0.05.

Repeated Tuning (cont.)





Datasets are not independent, therefore even Bonferroni adjustment is not very accurate.

A greater problem occurs while using an algorithm that has been used before: you may not know how it was tuned

(one disadvantage of using public databases).

Repeated Tuning –

Recommended approach









Break dataset into k disjoint subsets of approximately equal size.

K experiments are performed.

After every experiment one subset is removed.

Trained system is tested on held-out subsystem.

Repeated Tuning –

Recommended approach (cont.)







At the end of k-fold experiment, every sample has been used in test set exactly once.

Advantage: test sets are independent.

Disadvantage: training sets are clearly not independent.








Choose other algorithms to include in the comparison. Try including most similar to new algorithm.

Choose datasets.





Divide the data set into k subsets for cross validation.

Typically k=10.

For a small data set, choose larger k, since this leaves more examples in the training set.




(cont.)











Run a cross-validation

For each of the k subsets of the data set

D, create a training set T = D – k

Divide T into T1 (training) and T2

(tuning) subsets

Once tuning is done, rerun training on T

Finally measure accuracy on k

Overall accuracy is averaged across all k partitions.


(cont.)



Finally, compare algorithms



In case of multiple data sets, Bonferroni adjustment should be applied

Conclusion







We don’t mean to discourage empirical comparisons but to provide suggestions to avoid pitfalls.

Statistical tools should be used carefully.

Every details of the experiment should be reported.

On Comparing Classifiers: Pitfalls to Avoid and Recommended

Related documents

Products

Support

On Comparing Classifiers: Pitfalls to Avoid and Recommended

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib