Machine Learning for Language Technology (2015) – DRAFT

advertisement
Machine Learning for Language Technology (2015) – DRAFT July 2015
Lecture 03: LAB Assignment
Weka: Decision Trees (1): Reading the
output
ACKNOWLEDGEMENTS : THIS LAB ASSIGNMENT IS BASED ON THE CONTENT OF THE WEKA BOOK .
TASKS HAVE BEEN BORROWED FROM MARTIN D. SYKORA’S TUTORIALS.
(<HTTP ://HOMEPAGES.LBORO.AC.UK/~COMDS2/ COC131/>).
Required Reading for this Lab Assignment
 Daume III (2014): 10-16
 Witten et al. (2011): Ch 17: 562-565
ATT: datasets can be downloaded from here:
<http://stp.lingfil.uu.se/~santinim/ml/2015/datasets/>
Free material:
Free Weka book (2005), 2nd edition:
<http://home.etf.rs/~vm/os/dmsw/Morgan.Kaufman.Publishers.Weka.2nd.Edition.200
5.Elsevier.pdf>
Additional reading (optional):
Witten et al. (2011):
- Section 4.3: Divide and Conquer: Construction Decision Trees;
- Section 4.4: Covering Algorithms: Constructing Rules.
Learning objectives
In this lab assignment you are going to:
 experience supervised machine learning classification;
 use a Weka implementation of the decision tree classifier called J48;
 use a decision tree classifier on two different datasets;
 familiarize with the presentation of the results in Weka.
Tasks
G tasks: pls provide comprehensive answers to all the questions below:
(1) Start Weka, Launch the explorer window and select the "Preprocess" tab. Open the
iris dataset. Select the Classify tab. Under Classifier, select J48. What main parameters
can be specified for this classifier?
(2) Under Test options, select Crossvalidation and under More options, check Output
predictions. Click Start to start training the model. You should see a stream of output
Machine Learning for Language Technology (2015) – DRAFT July 2015
appear in the window named Classifier output. What do each of the following sections
tell you about the model?
(a) Predictions on ..."
(b) Summary"
(c) "Detailed accuracy by class"
(d) "Confusion matrix"
(3) Go to the graphical representation of the decision tree. it can be displayed
graphically in a pop-up tree visualizer. What is the feature under the root node, that is
the most discriminative feature?
(4) Once you have finished with the iris dataset, repeat the same steps for the English
past tense dataset. What is the performance (accuracy, P/R, and f-measure) of the
decision tree classifier on this dataset? Try and explain why you get this performance
on the past tense dataset. (suggestion: look at the distribution of the classes and analyse
the confusion matrix) .
(5) Theoretical question: what is a loss function? Give an informal definition and
example(s).
VG tasks: pls provide comprehensive answers to all the questions below:
(6) Under Result list you should see the model that is created at each run. Right-click on
the model created for the iris dataset and select Visualize classifier error. Points marked
with a square are errors, i.e. incorrectly classified instances. How do you think the
classifier performed? Once you have finished with the iris dataset, repeat the same
action with the English past tense dataset. How do you think the classifier performed
on this larger dataset?
(7) Analyse the graphical representation of the decision trees of both the iris dataset
and the English past tense dataset. What can you notice? Describe what you see and
interpret the trees.
To be submitted
A one-page written report containing the reasoned answers to the questions above and
a short section where you summarize your reflections and experience. Submit the report
to santinim@stp.lingfil.uu.se no later than 22 Nov 2015.
Download