2. Discretization and text classification

advertisement
2. Discretization and text classification
Lesson 2.1 Activity: Unsupervised discretization
When analyzing the ionosphere data with J48 in the lesson, Ian
performed just one cross-validation run for each experimental condition,
which makes comparing results from the various discretization methods
rather unreliable.
1. In general, would you expect equal-frequency binning to outperform
equal-width binning?
Yes
No
2. In general, would you expect the binary-attribute version (with the
makeBinary option) to improve results in each case?
Yes
No
Let's check it out. Create four new versions of the ionosphere data, all of
whose attributes are discrete, by applying the unsupervised
discretization filter with the four option combinations and default number
of bins (10), and write out each resulting dataset.
Then use the Experimenter (not the Explorer) to evaluate the
classification accuracy using ten repetitions of 10-fold cross-validation
(the default), using J48 and the five datasets (including the
original ionosphere.arff).
3. What percentage accuracy do you get using J48 on the original
ionosphere dataset? (Here and elsewhere in this course, round
percentage accuracies to the nearest integer.)
Answer:
4. What percentage accuracy do you get using J48 on the equal-widthbinned version?
Answer:
5. What percentage accuracy do you get using J48 on the equalfrequency-binned version?
Answer:
6. What percentage accuracy do you get using J48 on the equal-widthbinned version, with binary attributes?
Answer:
7. What percentage accuracy do you get using J48 on the equalfrequency-binned version, with binary attributes?
Answer:
8. These results contain some small surprises. What is the most striking
surprise?
The equal-frequency-binned version outperforms the unfiltered
version.
With binary attributes, equal-width bins outperform equal-frequency
bins.
9. Using the Experimenter, compare the binary-attribute and non-binaryattribute versions of equal-width-binning at the 5% level (Weka's default).
(Note: you will have to re-select the row and column in the Analyse
panel, and then re-select the test base.)
For equal-width binning, is the binary-attribute version significantly
better?
Yes
No
10. Similarly, compare the binary-attribute and non-binary-attribute
versions of equal-frequency-binning at the 5% level. (Note: you will have
to re-select the test base.)
For equal-frequency binning, is the binary-attribute version significantly
better?
Yes
No
Lesson 2.2 Activity: Examining the benefits of cheating
As you know, with supervised discretization you mustn't use the test set
to help set discretization boundaries -- that's cheating! Let's compare
cross-validation using a pre-discretized dataset (which is cheating,
because we'll use supervised discretization) with cross-validation using
the FilteredClassifier, which for each fold applies the discretization
operation to the training set alone (and thus is not cheating). The effect
of cheating is rather small -- if we use 10-fold cross-validation, it's the
difference between applying discretization to the 90% training data and
applying it to the entire dataset. To see whether a significant effect can
be discerned, we'll use a very simple classification method - OneR.
First, discretize the ionosphere dataset using supervised discretization
with default parameters. Then set up the Experimenter with two datasets,
the original ionosphere dataset and this discretized version, and two
classifiers, OneR and the FilteredClassifier configured to use the
supervised discretization filter and the OneR classifier. (Use default
parameters throughout.)
1. Using OneR on the pre-discretized dataset is cheating. What
classification accuracy is obtained?
Answer:
2. Using the FilteredClassifier on the original dataset is not cheating.
What classification accuracy is obtained?
Answer:
3. Is the difference significant at the 5% level? (The Experimenter
doesn't compare one method on one dataset with the other method on
the other dataset. However, in this case, both classifiers will necessarily
produce identical results on one of the datasets. Think about it.)
Yes
No
4. Would you expect OneR's performance to improve if you used the
binary-attribute version of discretization?
Yes
No
5. Replace OneR with J48. How does the result of "cheating" compare to
not cheating.
Cheating is significantly better than not cheating.
Cheating is somewhat better than not cheating.
They are the same.
Cheating is somewhat worse than not cheating.
Cheating is significantly worse than not cheating.
Lesson 2.3 Activity: Pre-discretization vs. built-in
discretization
How good is J48's built-in discretization compared with Weka's
supervised Discretize filter with the makeBinary option set (probably the
best configuration of all the discretization filters)? Use the Experimenter
to compare the two on these datasets: diabetes, glass,ionosphere, iris,
and schizo (use default settings throughout).
1. For how many datasets does J48's built-in discretization give better
results than the Filtered Classifer?
0
2
3
5
2. For how many datasets does J48's built-in discretization
give significantly better results (at the 5% level)?
0
1
2
3
3. For how many datasets does J48's built-in discretization
give significantly worse results?
0
1
2
3
Repeat with the classifiers JRip and PART (use default settings
throughout), comparing them on the above datasets with and without
Weka's supervised Discretize filter with the makeBinary option set. This
time, count just the results that are statistically significant (at the 5%
level).
4. For how many datasets does JRip's built-in discretization
gives significantly better results than the Filtered Classifer?
0
1
2
3
5. For how many datasets does JRip's built-in discretization
gives significantly worse results than the Filtered Classifer?
0
1
2
3
6. For how many datasets does PART's built-in discretization
gives significantly better results than the Filtered Classifer?
0
1
2
3
7. For how many datasets does PART's built-in discretization
gives significantly worse results than the Filtered Classifer?
0
1
2
3
The classifiers SMO and SimpleLogistic implement linear decision
boundaries in instance space.
8. How would you expect pre-discretization (with makeBinary enabled)
to affect their performance?
Make it worse than without discretization.
Make it better than without discretization.
Confirm your intuition (using default settings throughout) by testing the
above datasets.
9. For how many datasets does pre-discretization significantly improve
SMO's performance?
0
1
2
5
10. For how many datasets does pre-discretization make SMO's
performance significantly worse?
0
1
2
3
11. For how many datasets does pre-discretization significantly improve
SimpleLogistic's performance?
0
2
4
5
12. For how many datasets does pre-discretization make
SimpleLogistic's performance significantly worse?
0
1
2
3
13. How would you expect pre-discretization to affect IBk's
performance?
Pre-discretization would improve its performance.
Pre-discretization would make its performance worse.
Performance would not change its performance significantly.
Confirm your intuition (using default settings throughout) by testing the
above datasets.
14. For how many datasets does pre-discretization significantly improve
IBk's performance?
0
1
2
5
15. For how many datasets does pre-discretization make IBk's
performance significantly worse?
0
1
3
5
Lesson 2.4 Activity: Document classification with Naive
Bayes
In the lesson video, Ian ran J48, training on the ReutersCorn-train.arff
dataset and testing on ReutersCorn-test.arff, and obtained an overall
classification accuracy of 97%: 62% on 24 corn-related documents and
99% on 580 non-corn-related ones.
Unfortunately, Ian set a very bad example.
1. What's the first thing you should do before starting to evaluate
classifiers on a new dataset?
Apply a filter to the data.
Try a simple baseline classifier.
Check the source of the data for missing values.
Using the new dataset ReutersGrain-train and ReutersGrain-test,
evaluate the FilteredClassifier with the StringToWordVector filter and the
J48 classifier (default parameters throughout).
2. What is the overall classification accuracy on the test set? (As usual,
round to the nearest integer.)
Answer:
3. What is the classification accuracy on the 57 grain-related documents
in the test set?
Answer:
4. What is the classification accuracy on the 547 non-grain-related
documents in the test set?
Answer:
Now repeat this exercise using Naive Bayes as the classifier instead of
J48.
5. What is Naive Bayes's overall classification accuracy on the test set?
Answer:
6. What is Naive Bayes's classification accuracy on the 57 grain-related
documents in the test set?
Answer:
7. What is Naive Bayes's classification accuracy on the 547 non-grainrelated documents in the test set?
Answer:
If you apply the StringToWordVector filter in the Preprocess panel you
will notice that that although the attributes all have values 0 and 1, they
are nevertheless defined as "numeric". In fact, this causes NaiveBayes
to treat them completely differently from nominal attributes (technically, it
assumes that they are distributed according to a Gaussian distribution).
So let's apply the NumericToNominal filter to convert the attributes to
nominal, and re-evaluate NaiveBayes.
But how can we use the FilteredClassifier with multiple filters? The
answer lies in the MultiFilter, which applies several filters successively.
Figure out how to do this (the interface is a little weird), checking that
you get the same results as before if you configure MultiFilter to use the
single StringToWordVector filter. Now add NumericToNominal to convert
the attributes to nominal.
Re-evaluate the classification accuracies for NaiveBayes.
8. What is the overall classification accuracy on the test set?
Answer:
9. What is the classification accuracy on the 57 grain-related documents
in the test set?
Answer:
10. What is the classification accuracy on the 547 non-grain-related
documents in the test set?
Answer:
Lesson 2.5 Activity: Comparing AUCs
You get an ROC curve by plotting the classification accuracy for the first
class against (1 – classification accuracy for the second class).
Switching the classes effectively reflects the curve with respect to a
bottom-left-to-top-right diagonal line. In either case the area under the
curve should remain the same, and thus the ROC Area is often called
simply AUC, for "area under curve". However, in Weka the diagonally
reflected version sometimes produces a slightly different area (due to
differing number of ties because of the different floating-point resolution
of very small probabilities ε and probabilities very close to one, 1–ε).
Weka prints in the output panel a weighted average of the two ROC
Area values.
Use the ReutersCorn-train dataset for training, ReutersCorn-test for
testing, and the StringToWordVector filter with default parameter
settings to determine the following:
1. What is the weighted-average ROC Area for J48?
Answer:
2. What is the weighted-average ROC Area for Naive Bayes (default
configuration)?
Answer:
3. What is the weighted-average ROC Area for NaiveBayes with nominal
attributes? (Hint: use the NumericToNominal filter as weil as the
StringToWordVector filter, and combine them using the MultiFilter.)
Answer:
4. Judging by the ROC Area, which of these three methods perform
best?
J48
NaiveBayes with nominal attributes
NaiveBayes with numeric attributes
Examine the ROC curve for J48, which you obtain by right-clicking its
line in the Result list and selecting "Visualize threshold curve" for 0.
5. Which part of the graph corresponds to >75% accuracy for the second
class?
The leftmost 25% of the horizontal axis.
The rightmost 25% of the horizontal axis.
The upper 25% of the vertical axis.
6. Which of these statements regarding this ROC curve do you agree
with?
An accuracy of 75% for the second class can be combined with with
an accuracy of 75% for the first class.
An accuracy of >75% for the second class can only be achieved with
an accuracy of <5% for the first class.
An accuracy of 75% for the second class can be combined with an
accuracy of 25% for the first class.
Examine the corresponding ROC curve for NaiveBayes with numeric
attributes.
7. Which part of the graph corresponds to a >80% accuracy for the first
class?
The rightmost 20% of the horizontal axis.
The lower 20% of the vertical axis.
The upper 20% of the vertical axis.
8. Which of these statements regarding this ROC curve is most correct?
An accuracy of 90% for the first class can be combined with an
accuracy of 90% on the second class.
An accuracy of 80% for the first class can be combined with an
accuracy of 90% for the second class.
An accuracy of 90% for the first class can be combined with an
accuracy of 80% for the second class.
Lesson 2.6 Activity: Document classification with
Multinomial Naive Bayes
Use the ReutersCorn-train dataset for training, ReutersCorn-test for
testing, and the StringToWordVector filter with default parameter
settings.
1. What is the weighted-average ROC Area for NaiveBayesMultinomial?
Answer:
In the StringToWordVector filter, set outputWordCounts and
lowerCaseTokens to true; set minTermFreq to 5.
2. What is NaiveBayesMultinomial's weighted-average ROC Area now?
Answer:
It might help to stem words, that is, remove word endings like -s and -ing.
SnowBall is a good stemming algorithm, so set this as the stemmer in
the StringToWordVector filter. Also, it might help to reduce the number
of words kept per class, so change wordsToKeep from 1000 to 800.
3. What is NaiveBayesMultinomial's weighted-average ROC Area now?
Answer:
Of course, tiny changes in the ROC Area are probably insignificant, so
let's do a proper comparison using the Experimenter. Here, you can't
specify training and test sets separately, so we'll just use
the ReutersCorn-train and ReutersGrain-train training sets, with a 66%
percentage split (which will be a lot quicker than cross-validation). Set up
the Experimenter to use these two files with four classifiers: NaiveBayes
with default parameters for StringToWordVector, and three instances of
MultinomialNaiveBayes with the three parameter settings for
StringToWordVector that you tested above: (a) default parameters; (b)
outputWordCounts, lowerCaseTokens and minTermFreq = 5; and (c)
these settings plus SnowballStemmer and wordsToKeep = 800.
4. How do these NaiveBayesMultinomial configurations compare with
NaiveBayes?
NaiveBayes performs better than all the NaiveBayesMultinomial
methods.
Some NaiveBayesMultinomial configurations perform better than
NaiveBayes
All NaiveBayesMultinomial configurations perform significantly better
than NaiveBayes.
5. Of all the NaiveBayesMultinomial configurations you have tested,
which performed the best?
StringToWordVector with outputWordCounts and lowerCaseTokens
set to true, and using the SnowballStemmer.
StringToWordVector with default parameters.
StringToVector with outputWordCounts and lowerCaseTokens set to
true.
6. Change the Comparison field to Area_under_ROC. What are the
significant differences now?
There are no significant differences.
For the Grain dataset, NaiveBayesMultinomial configurations are
significantly better than NaiveBayes.
For the Corn dataset, NaiveBayesMultinomial configurations are
significantly better than NaiveBayes.
For both datasets, NaiveBayesMultinomial configurations are
significantly better than NaiveBayes.
Download