1 - Exploring Weka's interfaces, and working with big data Lesson 1.2 Activity: Percentage split vs cross-validation estimates 1. In the lesson video, Ian obtained the following results from J48 on the segment-challenge dataset, using 10 repetitions in both cases: mean std dev 10-fold cross validation 95.71% 1.85% 90% split 95.22% 1.20% (Note: 10 repetitions of 10-fold cross-validation involves executing J48 100 times.) Which do you think is a better estimate of the mean? 95.71% 95.22% Don't know 2a. In the above table it's curious that the standard deviation estimate for percentage split is smaller than that for cross-validation. Re-run the experiment and determine the standard deviation estimate for a 95% split, repeated 10 times. Answer: 2b. What is the standard deviation estimate for a 20-fold cross-validation, repeated 10 times? Answer: 2c. What is the standard deviation estimate for a 80% split, repeated 10 times? Answer: 2d. What is the standard deviation estimate for a 5-fold cross-validation, repeated 10 times? Answer: 3. Can you think of a reason why the standard deviation estimate tends to be smaller for percentage split than for the corresponding crossvalidation? The estimate is made using a different number of samples in each case. There's some overlap in the test sets for percentage split, but none for cross-validation. There's no reason; it's just a coincidence. Lesson 1.3 Activity: Comparing classifiers In the Experimenter, use J48, OneR and ZeroR (with default parameters) on the iris, breast-cancer, creditg, diabetes, glass,ionosphere, and segment-challenge datasets to get the following output. (Ian does exactly this in the lesson video, and the procedure is documented on the slides. Don't worry about the order of the rows, or the order of the columns -- so long as J48 comes first.) Now delete the segment-challenge dataset (which will make things much faster), add the NaiveBayes, IBk, Logistic, SMO, and AdaBoostM1 classifiers (all with default parameters), and perform the same test. From the output obtained, answer the following questions. 1. For one (and only one) of the datasets, some schemes significantly outperform J48 (at the 5% level). Which dataset? ionosphere credit-g iris breast-cancer glass diabetes 2. One of the classifiers is not significantly different from J48 on any dataset. Which classifier? AdaBoostM1 NaiveBayes Logistic SMO IBk 3. One of the classifiers is significantly better than J48 on exactly one dataset, and significantly worse than J48 on exactly one other dataset. Which classifier? AdaBoostM1 NaiveBayes Logistic SMO IBk 4. Change the analysis parameters to compare the various classifiers with OneR instead of with J48. There's one dataset on which all classifiers significantly outperform OneR. Which dataset? ionosphere glass diabetes breast-cancer credit-g iris 5. Which classifier significantly outperforms OneR on the largest number of datasets? AdaBoostM1 NaiveBayes Logistic SMO IBk 6. Change the analysis parameters to compare the various classifiers with SMO instead of with OneR. Which classifier significantly outperforms SMO on the largest number of datasets? AdaBoostM1 NaiveBayes Logistic J48 IBk 7. Which other classifier significantly outperforms SMO on at least one dataset? IBk NaiveBayes Logistic SMO AdaBoostM1 8. Ignoring whether or not the differences are statistically significant, which classifier performs best of all on each of the following datasets: a) Best performing classifier for iris: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 b) Best performing classifier for breast-cancer: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 c) Best performing classifier for credit-g: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 d) Best performing classifier for diabetes: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 e) Best performing classifier for glass: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 f) Best performing classifier for ionosphere: J48 IBk NaiveBayes Logistic SMO AdaBoostM1 Lesson 1.4 Activity: Looking inside cross-validation Let's look at the models produced during cross-validation. To begin, open the Knowledge Flow interface and recreate the first example given in the lesson video. Use an ArffLoader; right-click it and select Configure to set the file iris. Connect the ArffLoader to a ClassAssigner, then make a dataSet connection to a CrossValidationFoldMaker, then make a trainingSet and a testSet connection to J48, then make a batchClassifier connection to a ClassifierPerformanceEvaluator, then make a text connection to a TextViewer. Add a GraphViewer and connect the graph output of J48 to it. Run the system (Start loading on the ArffLoader right-click menu) and examine the results of the GraphViewer. This shows the 10 models that are produced inside the cross-validation: click to see each one. It's easier to compare them using a TextViewer rather than a GraphViewer, so connect a TextViewer up to J48 and use it to look at the models (you'll need to run the system again). 1. Among the 10 models there are two basically different structures. How many leaves do they have respectively? (Type your answer as two integers separated by a space.) Answer: 2. How many of the 10 models have the larger number of leaves? Answer: 3. The first of the numbers that appear in brackets after each leaf is the number of instances that reach that leaf. The total number of instances that reach a leaf is the same for each of the 10 trees. What is that number of instances? 130 135 122 150 4. When two numbers appear in brackets after a leaf, the second number shows how many incorrectly classified instances reach that leaf. One of the 7 models with 5 leaves makes fewer misclassifications than the other 6 models. How many does it make? Answer: 5. The 3 models with 4 leaves all differ slightly from each other. How? They branch on different attributes. They branch on the same attributes but the cutoff values differ. They have a smaller number of class values. A different total number of instances reach the leaves. Lesson 1.5 Activity: Using Javadoc and the Simple CLI 1. Use Weka's Javadoc to find out what the "-I" (capital i, not lower-case l) option of the IBk classifier does. What does the documentation say about this? Weight neighbors by 1 - their distance. The nearest neighbor search algorithm to use. Weight neighbors by the inverse of their distance. Number of nearest neighbors (k) used in classification. 2. In the Explorer, configure IBk to use this option (by setting distanceWeighting to Weight by 1/distance), set the KNN parameter to 5, click OK, right-click the configuration (that is, the large field that appears to the right of the Choose button), and copy it to the clipboard. Paste this command into the Simple CLI interface, preceded by "java" and followed by "-t iris.arff" (i.e., specify iris as a training file; you will have to precede it by an appropriate directory specification). Then press Enter to run the command. What is the percent accuracy of correctly classified instances? (Here and elsewhere in this course, round percentage accuracies to the nearest integer.) Answer: 3. In preparation for Lesson 1.6, use Weka's Javadoc to find out which classifiers are "updateable", i.e., implement the UpdateableClassifier interface. (Hint: IBk is one of them, and near the top of its "Class" page are links to all the interfaces it implements. Just click the appropriate one.) How many updateable classifiers are there? Answer: Lesson 1.6 Activity: Experience big data Warm-up exercise. Reproduce what Ian was doing in the Command Line interface with the LED24 data in the lesson: make a test file with 100,000 instances using java weka.datagenerators.classifiers.classification.LED24 -n 100000 -o make a training file with 10,000,000 instances using java weka.datagenerators.classifiers.classification.LED24 -n 10000000 -o apply NaiveBayesUpdateable: java weka.classifiers.bayes.NaiveBayesUpdateable -t -T (this may take a few minutes) verify that Weka runs out of memory if cross-validation is attempted: java weka.classifiers.bayes.NaiveBayesUpdateable -t if you feel particularly brave, repeat the exercise with a 100,000,000-instance training file Actual exercise. The covertype data has 581,012 instances, 55 attributes, and 7 class values. Working with big data is a pain, so we've created a small 10,000-instance subset, covtype.small, for you to play with. 1. Using the Explorer, determine the percentage accuracy of Naive Bayes, evaluated on the training set. 66% 65% 68% 67% 2. Using the Explorer, determine the percentage accuracy of NaiveBayes evaluated by cross-validation accuracy. 67% 68% 65% 66% 3. Using the Simple CLI, determine the percentage accuracy of NaiveBayesUpdateable evaluated on the training set. 68% 66% 67% 65% 4. Using the Simple CLI, determine the percentage accuracy of NaiveBayesUpdateable evaluated by cross-validation accuracy. 67% 65% 68% 66%