2. Discretization and text classification Lesson 2.1 Activity: Unsupervised discretization When analyzing the ionosphere data with J48 in the lesson, Ian performed just one cross-validation run for each experimental condition, which makes comparing results from the various discretization methods rather unreliable. 1. In general, would you expect equal-frequency binning to outperform equal-width binning? Yes No 2. In general, would you expect the binary-attribute version (with the makeBinary option) to improve results in each case? Yes No Let's check it out. Create four new versions of the ionosphere data, all of whose attributes are discrete, by applying the unsupervised discretization filter with the four option combinations and default number of bins (10), and write out each resulting dataset. Then use the Experimenter (not the Explorer) to evaluate the classification accuracy using ten repetitions of 10-fold cross-validation (the default), using J48 and the five datasets (including the original ionosphere.arff). 3. What percentage accuracy do you get using J48 on the original ionosphere dataset? (Here and elsewhere in this course, round percentage accuracies to the nearest integer.) Answer: 4. What percentage accuracy do you get using J48 on the equal-widthbinned version? Answer: 5. What percentage accuracy do you get using J48 on the equalfrequency-binned version? Answer: 6. What percentage accuracy do you get using J48 on the equal-widthbinned version, with binary attributes? Answer: 7. What percentage accuracy do you get using J48 on the equalfrequency-binned version, with binary attributes? Answer: 8. These results contain some small surprises. What is the most striking surprise? The equal-frequency-binned version outperforms the unfiltered version. With binary attributes, equal-width bins outperform equal-frequency bins. 9. Using the Experimenter, compare the binary-attribute and non-binaryattribute versions of equal-width-binning at the 5% level (Weka's default). (Note: you will have to re-select the row and column in the Analyse panel, and then re-select the test base.) For equal-width binning, is the binary-attribute version significantly better? Yes No 10. Similarly, compare the binary-attribute and non-binary-attribute versions of equal-frequency-binning at the 5% level. (Note: you will have to re-select the test base.) For equal-frequency binning, is the binary-attribute version significantly better? Yes No Lesson 2.2 Activity: Examining the benefits of cheating As you know, with supervised discretization you mustn't use the test set to help set discretization boundaries -- that's cheating! Let's compare cross-validation using a pre-discretized dataset (which is cheating, because we'll use supervised discretization) with cross-validation using the FilteredClassifier, which for each fold applies the discretization operation to the training set alone (and thus is not cheating). The effect of cheating is rather small -- if we use 10-fold cross-validation, it's the difference between applying discretization to the 90% training data and applying it to the entire dataset. To see whether a significant effect can be discerned, we'll use a very simple classification method - OneR. First, discretize the ionosphere dataset using supervised discretization with default parameters. Then set up the Experimenter with two datasets, the original ionosphere dataset and this discretized version, and two classifiers, OneR and the FilteredClassifier configured to use the supervised discretization filter and the OneR classifier. (Use default parameters throughout.) 1. Using OneR on the pre-discretized dataset is cheating. What classification accuracy is obtained? Answer: 2. Using the FilteredClassifier on the original dataset is not cheating. What classification accuracy is obtained? Answer: 3. Is the difference significant at the 5% level? (The Experimenter doesn't compare one method on one dataset with the other method on the other dataset. However, in this case, both classifiers will necessarily produce identical results on one of the datasets. Think about it.) Yes No 4. Would you expect OneR's performance to improve if you used the binary-attribute version of discretization? Yes No 5. Replace OneR with J48. How does the result of "cheating" compare to not cheating. Cheating is significantly better than not cheating. Cheating is somewhat better than not cheating. They are the same. Cheating is somewhat worse than not cheating. Cheating is significantly worse than not cheating. Lesson 2.3 Activity: Pre-discretization vs. built-in discretization How good is J48's built-in discretization compared with Weka's supervised Discretize filter with the makeBinary option set (probably the best configuration of all the discretization filters)? Use the Experimenter to compare the two on these datasets: diabetes, glass,ionosphere, iris, and schizo (use default settings throughout). 1. For how many datasets does J48's built-in discretization give better results than the Filtered Classifer? 0 2 3 5 2. For how many datasets does J48's built-in discretization give significantly better results (at the 5% level)? 0 1 2 3 3. For how many datasets does J48's built-in discretization give significantly worse results? 0 1 2 3 Repeat with the classifiers JRip and PART (use default settings throughout), comparing them on the above datasets with and without Weka's supervised Discretize filter with the makeBinary option set. This time, count just the results that are statistically significant (at the 5% level). 4. For how many datasets does JRip's built-in discretization gives significantly better results than the Filtered Classifer? 0 1 2 3 5. For how many datasets does JRip's built-in discretization gives significantly worse results than the Filtered Classifer? 0 1 2 3 6. For how many datasets does PART's built-in discretization gives significantly better results than the Filtered Classifer? 0 1 2 3 7. For how many datasets does PART's built-in discretization gives significantly worse results than the Filtered Classifer? 0 1 2 3 The classifiers SMO and SimpleLogistic implement linear decision boundaries in instance space. 8. How would you expect pre-discretization (with makeBinary enabled) to affect their performance? Make it worse than without discretization. Make it better than without discretization. Confirm your intuition (using default settings throughout) by testing the above datasets. 9. For how many datasets does pre-discretization significantly improve SMO's performance? 0 1 2 5 10. For how many datasets does pre-discretization make SMO's performance significantly worse? 0 1 2 3 11. For how many datasets does pre-discretization significantly improve SimpleLogistic's performance? 0 2 4 5 12. For how many datasets does pre-discretization make SimpleLogistic's performance significantly worse? 0 1 2 3 13. How would you expect pre-discretization to affect IBk's performance? Pre-discretization would improve its performance. Pre-discretization would make its performance worse. Performance would not change its performance significantly. Confirm your intuition (using default settings throughout) by testing the above datasets. 14. For how many datasets does pre-discretization significantly improve IBk's performance? 0 1 2 5 15. For how many datasets does pre-discretization make IBk's performance significantly worse? 0 1 3 5 Lesson 2.4 Activity: Document classification with Naive Bayes In the lesson video, Ian ran J48, training on the ReutersCorn-train.arff dataset and testing on ReutersCorn-test.arff, and obtained an overall classification accuracy of 97%: 62% on 24 corn-related documents and 99% on 580 non-corn-related ones. Unfortunately, Ian set a very bad example. 1. What's the first thing you should do before starting to evaluate classifiers on a new dataset? Apply a filter to the data. Try a simple baseline classifier. Check the source of the data for missing values. Using the new dataset ReutersGrain-train and ReutersGrain-test, evaluate the FilteredClassifier with the StringToWordVector filter and the J48 classifier (default parameters throughout). 2. What is the overall classification accuracy on the test set? (As usual, round to the nearest integer.) Answer: 3. What is the classification accuracy on the 57 grain-related documents in the test set? Answer: 4. What is the classification accuracy on the 547 non-grain-related documents in the test set? Answer: Now repeat this exercise using Naive Bayes as the classifier instead of J48. 5. What is Naive Bayes's overall classification accuracy on the test set? Answer: 6. What is Naive Bayes's classification accuracy on the 57 grain-related documents in the test set? Answer: 7. What is Naive Bayes's classification accuracy on the 547 non-grainrelated documents in the test set? Answer: If you apply the StringToWordVector filter in the Preprocess panel you will notice that that although the attributes all have values 0 and 1, they are nevertheless defined as "numeric". In fact, this causes NaiveBayes to treat them completely differently from nominal attributes (technically, it assumes that they are distributed according to a Gaussian distribution). So let's apply the NumericToNominal filter to convert the attributes to nominal, and re-evaluate NaiveBayes. But how can we use the FilteredClassifier with multiple filters? The answer lies in the MultiFilter, which applies several filters successively. Figure out how to do this (the interface is a little weird), checking that you get the same results as before if you configure MultiFilter to use the single StringToWordVector filter. Now add NumericToNominal to convert the attributes to nominal. Re-evaluate the classification accuracies for NaiveBayes. 8. What is the overall classification accuracy on the test set? Answer: 9. What is the classification accuracy on the 57 grain-related documents in the test set? Answer: 10. What is the classification accuracy on the 547 non-grain-related documents in the test set? Answer: Lesson 2.5 Activity: Comparing AUCs You get an ROC curve by plotting the classification accuracy for the first class against (1 – classification accuracy for the second class). Switching the classes effectively reflects the curve with respect to a bottom-left-to-top-right diagonal line. In either case the area under the curve should remain the same, and thus the ROC Area is often called simply AUC, for "area under curve". However, in Weka the diagonally reflected version sometimes produces a slightly different area (due to differing number of ties because of the different floating-point resolution of very small probabilities ε and probabilities very close to one, 1–ε). Weka prints in the output panel a weighted average of the two ROC Area values. Use the ReutersCorn-train dataset for training, ReutersCorn-test for testing, and the StringToWordVector filter with default parameter settings to determine the following: 1. What is the weighted-average ROC Area for J48? Answer: 2. What is the weighted-average ROC Area for Naive Bayes (default configuration)? Answer: 3. What is the weighted-average ROC Area for NaiveBayes with nominal attributes? (Hint: use the NumericToNominal filter as weil as the StringToWordVector filter, and combine them using the MultiFilter.) Answer: 4. Judging by the ROC Area, which of these three methods perform best? J48 NaiveBayes with nominal attributes NaiveBayes with numeric attributes Examine the ROC curve for J48, which you obtain by right-clicking its line in the Result list and selecting "Visualize threshold curve" for 0. 5. Which part of the graph corresponds to >75% accuracy for the second class? The leftmost 25% of the horizontal axis. The rightmost 25% of the horizontal axis. The upper 25% of the vertical axis. 6. Which of these statements regarding this ROC curve do you agree with? An accuracy of 75% for the second class can be combined with with an accuracy of 75% for the first class. An accuracy of >75% for the second class can only be achieved with an accuracy of <5% for the first class. An accuracy of 75% for the second class can be combined with an accuracy of 25% for the first class. Examine the corresponding ROC curve for NaiveBayes with numeric attributes. 7. Which part of the graph corresponds to a >80% accuracy for the first class? The rightmost 20% of the horizontal axis. The lower 20% of the vertical axis. The upper 20% of the vertical axis. 8. Which of these statements regarding this ROC curve is most correct? An accuracy of 90% for the first class can be combined with an accuracy of 90% on the second class. An accuracy of 80% for the first class can be combined with an accuracy of 90% for the second class. An accuracy of 90% for the first class can be combined with an accuracy of 80% for the second class. Lesson 2.6 Activity: Document classification with Multinomial Naive Bayes Use the ReutersCorn-train dataset for training, ReutersCorn-test for testing, and the StringToWordVector filter with default parameter settings. 1. What is the weighted-average ROC Area for NaiveBayesMultinomial? Answer: In the StringToWordVector filter, set outputWordCounts and lowerCaseTokens to true; set minTermFreq to 5. 2. What is NaiveBayesMultinomial's weighted-average ROC Area now? Answer: It might help to stem words, that is, remove word endings like -s and -ing. SnowBall is a good stemming algorithm, so set this as the stemmer in the StringToWordVector filter. Also, it might help to reduce the number of words kept per class, so change wordsToKeep from 1000 to 800. 3. What is NaiveBayesMultinomial's weighted-average ROC Area now? Answer: Of course, tiny changes in the ROC Area are probably insignificant, so let's do a proper comparison using the Experimenter. Here, you can't specify training and test sets separately, so we'll just use the ReutersCorn-train and ReutersGrain-train training sets, with a 66% percentage split (which will be a lot quicker than cross-validation). Set up the Experimenter to use these two files with four classifiers: NaiveBayes with default parameters for StringToWordVector, and three instances of MultinomialNaiveBayes with the three parameter settings for StringToWordVector that you tested above: (a) default parameters; (b) outputWordCounts, lowerCaseTokens and minTermFreq = 5; and (c) these settings plus SnowballStemmer and wordsToKeep = 800. 4. How do these NaiveBayesMultinomial configurations compare with NaiveBayes? NaiveBayes performs better than all the NaiveBayesMultinomial methods. Some NaiveBayesMultinomial configurations perform better than NaiveBayes All NaiveBayesMultinomial configurations perform significantly better than NaiveBayes. 5. Of all the NaiveBayesMultinomial configurations you have tested, which performed the best? StringToWordVector with outputWordCounts and lowerCaseTokens set to true, and using the SnowballStemmer. StringToWordVector with default parameters. StringToVector with outputWordCounts and lowerCaseTokens set to true. 6. Change the Comparison field to Area_under_ROC. What are the significant differences now? There are no significant differences. For the Grain dataset, NaiveBayesMultinomial configurations are significantly better than NaiveBayes. For the Corn dataset, NaiveBayesMultinomial configurations are significantly better than NaiveBayes. For both datasets, NaiveBayesMultinomial configurations are significantly better than NaiveBayes.