CS412 - CS512 Machine Learning Fall 2015-2016 Homework 1 50pts - Due Oct 16, 2015 @ 10pm (Late accpted until Oct 18, 2015 @ 10pm with 5pts off each day) Task: For this homework, you will implement an isolated leaf recognition system using WEKA as a real-world problem. You are provided with features extracted from five different types of leaf images: Oak (Quercus), Gingko, Service tree (Sorbus), Elm (Ulmus), and Hedera. You are supposed to train two classifiers (see below) and classify the test data into one of the five classes.; and interpret results. Note: If you have installed WEKA, this homework can be done in 30-40 minutes. If not, please follow wekatutorial slides under Lectures. Left-to-right: Oak (guercus), Gingko, Service tree (sorbus), Elm (ulmus) and Hedera leaves Background: Plant taxonomy is a highly laborious task consisting of the scientific classification of our planet’s flora. Botanists exploit all available plants characteristics such as flowers, seeds, and leaves for identifying a plant. However, the vast majority of the approaches proposed so far have been focusing exclusively on leaf based plant identification both for limiting the problem’s complexity and increasing the discriminability through their color, shape, and texture features. On the other hand, recognition process can be tiresome and take a long time, especially for a brand new plant. Consequently, there is a significant need for an automated plant identification system that, when provided with raw visual plant data, will extract a number of descriptive features and use them in order to determine and output the corresponding plant species. Data: The data for this homework consists of two separate arff files as train and test sets including 852 and 210 samples sorted in rows, respectively. Pre-computed attributes are extracted from scanned leaf images using deep convolutional neural networks. These high-level features have a dimension of 1024, placed column wise. Therefore, each row represents a new sample point while each column shows values of a specific feature. Also, the last column in the train set indicates the labels for the leaf species. In summary, the arff train file looks like the following pattern whereas the test file lacks the Label column. 1024D features, followed by label. In other words: F1 F2 … F1024 label Classifiers to try: 1. 15pts - First choose ZeroR as classifier and use test set (supplied) as test. Explain the resulting accuracy and confusion matrix in 1-2 lines each, using correct/precise terminology. E.g. we don’t just say “performance is good”, we talk about error or accuracy as … and to make sure we are understood clearly, in scientific writing, when something may be misunderstood, we add a redundant explanation like: “in other words, …” 5pt - Accuracy: ………………………………………………………………………………………………………………………………………………….. …………………………………………………………………………………………………………………………………………………………………….. 10pts - Confusion matrix: ……………………. …………... is classified as …………………..; in other words, …………………… plant. Note: This classifier is basically the simplest thing classifier: it returns the majority class as the answer to everything. 2. 35pts - Try the decision tree classifier J48 under trees, with the default parameters. 5pt - What is the accuracy with 10-fold cross validation? …………….……% 5pt - What is the accuracy with supplied test set? ………………………..……% 5pt – What and how many errors occurred on the test set? …………………………………………………………………………………………………………………………………………………………………….. 10pt - Which accuracy (test or cross-val.) do you expect to reflect the generalization accuracy better? Explain in one line. …………………………………………………………………………………………………………………………………………………………………….. 10pts - Feature 890 is used at the root of the tree. Include the picture (a cropped print screen) how classes are distributed with respect to this feature and comment (which class is well split with this feature and at what threshold). …………………………………………………………………………………………………………………………………………………………….. ……………………………………………………………………………………………………………………………………………………………… Hint: Choose that feature under Attributes (bottom-left window) and see the resulting figure on the bottom-right.