Machine Learning for Language Technology (2015) – DRAFT July

advertisement
Machine Learning for Language Technology (2015) – DRAFT July 2015
Lecture 04: LAB Assignment
Weka: Decision Trees (2): Feature
Selection and Reduction
ACKNOWLEDGEMENTS : THIS LAB ASSIGNMENT IS BASED ON THE CONTENT OF THE WEKA BOOK .
SOME TASKS HAVE BEEN BORROWED AND ADAPTED FROM MARTIN D. SYKORA ’S TUTORIALS
(<HTTP ://HOMEPAGES.LBORO.AC.UK/~COMDS2/ COC131/>).
Required Reading for this Lab Assignment
 Daume III (2014): 16-23; 51-58.
 Witten et al. (2011): 487-494; 567; 575-577.
ATT: datasets can be downloaded from here:
<http://stp.lingfil.uu.se/~santinim/ml/2015/datasets/>
Free material:
Free Weka book (2005), 2nd edition:
<http://home.etf.rs/~vm/os/dmsw/Morgan.Kaufman.Publishers.Weka.2nd.Edition.200
5.Elsevier.pdf>
Optional tutorials
- Interactive Decision Tree Construction (Witten et al. (2011): 569-571
- Visualizing Decision Trees and Rule Sets (Witten et al. (2011): 573-574)
- Document Classification with J48 (Witten et al. (2011): 578-582
Learning objectives
In this lab assignment you are going to:
 explore if and how feature selection can improve the performance;
 apply feature selection and analyze its effects;
 use a decision tree classifier on two different datasets.
Preliminaries
In this lab assignment you are going to use once more a decision tree classifier as
implemented in J48 and explore if and how feature selection can improve the
performance. The purpose of feature selection is to select a subset of most relevant
features for building robust classifiers. This is usually done by keeping features that
discriminate best between classes in the dataset, and at the same time removing
features that are redundant (the Minimum-Redundancy-Maximum-Relevance principle).
Machine Learning for Language Technology (2015) – DRAFT July 2015
Tasks
G tasks: pls provide comprehensive answers to all the questions below:
(1) Start Weka, launch the explorer window and select the "Preprocess" tab. Open
the iris dataset. Under Filter, choose the AttributeSelection filter. What does it
do? Are the attributes it selects the same as the ones you chose as discriminatory
as you would manually choose?
(2) Select the Visualize. This shows you 2D scatter plots of each attribute against
each other attribute. Make sure the drop-down box at the bottom says Color:
class (Nom). Pay close attention to the plots between attributes you think
discriminate best between classes, and the plots between attributes selected by
the AttributeSelection filter. Can you verify from these plots whether your
thoughts and the AttributeSelection filter are correct? Which attributes are
correlated? Generally speaking we can identify redundant attributes, when some
features are highly correlated. In our case, which attribute(s) do you think can be
removed without harming the classification?
(3) Select the Classify tab to get into the Classication perspective of Weka. Click on
Choose and select J48. In Test options, select 10-fold cross validation and hit
Start. Report the classification accuracy (Correctly Classied Instances) for: the
full dataset and a version of the dataset containing only the features that you
think are most discriminative according to your exploration above. What are
your conclusions?
(4) Open the past tense dataset and perform exactly the same tasks and reply to all
the answers that have been listed above for the iris dataset. Compare and
discuss the behaviors on the two different datasets.
(5) In the past tense dataset, analyze the informativeness of the different features
using information gain and gain ratio. To do this, choose Select attributes and
then choose InfoGainAttributeEval as the Attribute Evaluator (which also
requires you to choose Ranker as the Search Method). This will give you a
ranking of the features in terms of information gain and gain ratio. Select the
Classify tab to get into the Classication perspective of Weka. Click on Choose
and select J48. In Test options, select 10-fold cross validation and hit Start.
Report the classification accuracy (Correctly Classified Instances) for: (a) the
full dataset; (b) a version of the dataset containing only the features selected by
information gain (c) a version of the dataset containing the features selected by
gain ratio. Compare and briefly discuss your results.
(6) Theoretical question: what is the difference between information gain and gain
ratio? Give a reasoned example.
VG tasks: pls provide comprehensive answers to all the questions below:
(7) In the past tense dataset, compare the training error (Test Option: Training set)
to test error (Test option: Cross-validation), and see whether there are signs of
overfitting. Also compare tree induction with and without pruning (click on
option next to the Classifier choice to switch the parameter unpruned from false
Machine Learning for Language Technology (2015) – DRAFT July 2015
to tree) and see how this affects the size of the tree as well as the relation
between training and test error. Describe and comment your results.
(8) Theoretical question: define in your own words the concept of “inductive bias”
and provide a reasoned example.
To be submitted
A one-page written report containing the reasoned answers to the questions above and
a short section where you summarize your reflections and experience. Submit the report
to santinim@stp.lingfil.uu.se no later than 22 Nov 2015.
Download