Machine Learning for Language Technology (2015) – DRAFT July 2015 Lecture 04: LAB Assignment Weka: Decision Trees (2): Feature Selection and Reduction ACKNOWLEDGEMENTS : THIS LAB ASSIGNMENT IS BASED ON THE CONTENT OF THE WEKA BOOK . SOME TASKS HAVE BEEN BORROWED AND ADAPTED FROM MARTIN D. SYKORA ’S TUTORIALS (<HTTP ://HOMEPAGES.LBORO.AC.UK/~COMDS2/ COC131/>). Required Reading for this Lab Assignment Daume III (2014): 16-23; 51-58. Witten et al. (2011): 487-494; 567; 575-577. ATT: datasets can be downloaded from here: <http://stp.lingfil.uu.se/~santinim/ml/2015/datasets/> Free material: Free Weka book (2005), 2nd edition: <http://home.etf.rs/~vm/os/dmsw/Morgan.Kaufman.Publishers.Weka.2nd.Edition.200 5.Elsevier.pdf> Optional tutorials - Interactive Decision Tree Construction (Witten et al. (2011): 569-571 - Visualizing Decision Trees and Rule Sets (Witten et al. (2011): 573-574) - Document Classification with J48 (Witten et al. (2011): 578-582 Learning objectives In this lab assignment you are going to: explore if and how feature selection can improve the performance; apply feature selection and analyze its effects; use a decision tree classifier on two different datasets. Preliminaries In this lab assignment you are going to use once more a decision tree classifier as implemented in J48 and explore if and how feature selection can improve the performance. The purpose of feature selection is to select a subset of most relevant features for building robust classifiers. This is usually done by keeping features that discriminate best between classes in the dataset, and at the same time removing features that are redundant (the Minimum-Redundancy-Maximum-Relevance principle). Machine Learning for Language Technology (2015) – DRAFT July 2015 Tasks G tasks: pls provide comprehensive answers to all the questions below: (1) Start Weka, launch the explorer window and select the "Preprocess" tab. Open the iris dataset. Under Filter, choose the AttributeSelection filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory as you would manually choose? (2) Select the Visualize. This shows you 2D scatter plots of each attribute against each other attribute. Make sure the drop-down box at the bottom says Color: class (Nom). Pay close attention to the plots between attributes you think discriminate best between classes, and the plots between attributes selected by the AttributeSelection filter. Can you verify from these plots whether your thoughts and the AttributeSelection filter are correct? Which attributes are correlated? Generally speaking we can identify redundant attributes, when some features are highly correlated. In our case, which attribute(s) do you think can be removed without harming the classification? (3) Select the Classify tab to get into the Classication perspective of Weka. Click on Choose and select J48. In Test options, select 10-fold cross validation and hit Start. Report the classification accuracy (Correctly Classied Instances) for: the full dataset and a version of the dataset containing only the features that you think are most discriminative according to your exploration above. What are your conclusions? (4) Open the past tense dataset and perform exactly the same tasks and reply to all the answers that have been listed above for the iris dataset. Compare and discuss the behaviors on the two different datasets. (5) In the past tense dataset, analyze the informativeness of the different features using information gain and gain ratio. To do this, choose Select attributes and then choose InfoGainAttributeEval as the Attribute Evaluator (which also requires you to choose Ranker as the Search Method). This will give you a ranking of the features in terms of information gain and gain ratio. Select the Classify tab to get into the Classication perspective of Weka. Click on Choose and select J48. In Test options, select 10-fold cross validation and hit Start. Report the classification accuracy (Correctly Classified Instances) for: (a) the full dataset; (b) a version of the dataset containing only the features selected by information gain (c) a version of the dataset containing the features selected by gain ratio. Compare and briefly discuss your results. (6) Theoretical question: what is the difference between information gain and gain ratio? Give a reasoned example. VG tasks: pls provide comprehensive answers to all the questions below: (7) In the past tense dataset, compare the training error (Test Option: Training set) to test error (Test option: Cross-validation), and see whether there are signs of overfitting. Also compare tree induction with and without pruning (click on option next to the Classifier choice to switch the parameter unpruned from false Machine Learning for Language Technology (2015) – DRAFT July 2015 to tree) and see how this affects the size of the tree as well as the relation between training and test error. Describe and comment your results. (8) Theoretical question: define in your own words the concept of “inductive bias” and provide a reasoned example. To be submitted A one-page written report containing the reasoned answers to the questions above and a short section where you summarize your reflections and experience. Submit the report to santinim@stp.lingfil.uu.se no later than 22 Nov 2015.