BCB 444/544 Lab 10 (11/8/07) Machine Learning Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu 1. What is K-fold cross validation? What data is used for training? What data is used for testing? K-fold cross validation is a training/validation method that divides the data set into K groups. In each round of cross-validation all but one of the groups (K-1) are used as the training set and the reserved group is used as the validation (testing) dataset. This is repeated until each K groups has been used as a testing dataset. 2. What assumption is used in the Naïve Bayes classifier? It assumes that each attribute is independent given the class label. 3. What criterion does the decision tree classifier use to decide which attribute to put first in the decision tree? The amount of information each attribute contains; the attribute with the most information is used first. 4. What is the purpose of the kernel function in a SVM classifier? A kernel function can transform the the data points into a higher dimensional feature space where the data can be separated (hopefully). 5. Based on what you read, which method(s) can a human interpret? What method(s) can a human not interpret, i.e., “black box” method(s)? Humans can most easily interpret decision trees, then Naïve Bayes, but the SVM is much more difficult for humans to interpret. 6. According to this web page, which algorithm tends to have the highest classification accuracy? Support Vector Machines 5 Fold Cross Validation Results Algorithm Accuracy TPRate FPRate Precision Recall NB 62.25% 0.628 0.363 0.634 0.628 J48 58.75% 0.511 0.336 0.603 0.511 SVM 61.7% 0.595 0.361 0.622 0.595 SVM 62% 0.541 0.301 0.643 0.541 RBF Test Case Results Algorithm Accuracy TPRate FPRate Precision NB 73.5% 0.828 0.377 0.726 J48 71.7% 0.641 0.189 0.804 SVM 74.4% 0.828 0.358 0.736 SVM 76.07% 0.781 0.264 0.781 RBF TP 628 511 595 FP 372 489 405 FN 363 336 361 TN 637 664 639 541 459 301 699 Recall 0.828 0.641 0.828 TP 53 41 53 FP 11 23 11 FN 20 10 19 TN 33 43 34 0.781 50 14 14 39 7. What algorithm did the best and under what conditions? Answers to this question can vary. It depended on what you determined to be most important. For example, if you thought the performance in the cross validation experiments were the most important, and were only concerned with accuracy, you might conclude that Naïve Bayes is the best, closely followed by the two SVMs. However, if you were concerned with precision, you might have chosen the SVM with the RBF kernel. You may have considered the performance on the test case more important, and chosen the J48 decision tree because it had the highest precision. The point is that evaluating classifiers is not necessarily straightforward, and different people will have different goals and priorities for their prediction tasks. 8. Did the cross validation results indicate accurately what performance on the test case would be? The cross validation results do not accurately predict the performance on the test cases. They were always ~10% lower accuracy and they relative performance among the algorithms changed. Basically, the test case was “easier” than the training data. The point here is that cross validation experiments on any dataset can only tell you average performance over a range of examples. Your protein of interest may be a particularly difficult (or easy) case and prediction performance can vary widely. 9. Briefly describe what the algorithm you chose does. Answers here varied depending on which algorithm you chose to use.