BCB 444/544 Lab 10 (11/8/07) Machine Learning

advertisement
BCB 444/544
Lab 10 (11/8/07)
Machine Learning
Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu
1. What is K-fold cross validation? What data is used for training? What data is
used for testing?
K-fold cross validation is a training/validation method that divides the data set into
K groups. In each round of cross-validation all but one of the groups (K-1) are used
as the training set and the reserved group is used as the validation (testing) dataset.
This is repeated until each K groups has been used as a testing dataset.
2. What assumption is used in the Naïve Bayes classifier?
It assumes that each attribute is independent given the class label.
3. What criterion does the decision tree classifier use to decide which attribute to
put first in the decision tree?
The amount of information each attribute contains; the attribute with the most
information is used first.
4. What is the purpose of the kernel function in a SVM classifier?
A kernel function can transform the the data points into a higher dimensional
feature space where the data can be separated (hopefully).
5. Based on what you read, which method(s) can a human interpret? What
method(s) can a human not interpret, i.e., “black box” method(s)?
Humans can most easily interpret decision trees, then Naïve Bayes, but the SVM is
much more difficult for humans to interpret.
6. According to this web page, which algorithm tends to have the highest
classification accuracy?
Support Vector Machines
5 Fold Cross Validation Results
Algorithm Accuracy TPRate FPRate Precision Recall
NB
62.25%
0.628
0.363
0.634
0.628
J48
58.75%
0.511
0.336
0.603
0.511
SVM
61.7%
0.595
0.361
0.622
0.595
SVM
62%
0.541
0.301
0.643
0.541
RBF
Test Case Results
Algorithm Accuracy TPRate FPRate Precision
NB
73.5%
0.828
0.377
0.726
J48
71.7%
0.641
0.189
0.804
SVM
74.4%
0.828
0.358
0.736
SVM
76.07%
0.781
0.264
0.781
RBF
TP
628
511
595
FP
372
489
405
FN
363
336
361
TN
637
664
639
541
459
301
699
Recall
0.828
0.641
0.828
TP
53
41
53
FP
11
23
11
FN
20
10
19
TN
33
43
34
0.781
50
14
14
39
7. What algorithm did the best and under what conditions?
Answers to this question can vary. It depended on what you determined to be most
important. For example, if you thought the performance in the cross validation
experiments were the most important, and were only concerned with accuracy, you
might conclude that Naïve Bayes is the best, closely followed by the two SVMs.
However, if you were concerned with precision, you might have chosen the SVM
with the RBF kernel. You may have considered the performance on the test case
more important, and chosen the J48 decision tree because it had the highest
precision. The point is that evaluating classifiers is not necessarily straightforward,
and different people will have different goals and priorities for their prediction
tasks.
8. Did the cross validation results indicate accurately what performance on the test
case would be?
The cross validation results do not accurately predict the performance on the test
cases. They were always ~10% lower accuracy and they relative performance
among the algorithms changed. Basically, the test case was “easier” than the
training data. The point here is that cross validation experiments on any dataset
can only tell you average performance over a range of examples. Your protein of
interest may be a particularly difficult (or easy) case and prediction performance
can vary widely.
9. Briefly describe what the algorithm you chose does.
Answers here varied depending on which algorithm you chose to use.
Download