Summary of HW Assigned • Assignment Due date Assigned Due HW0: Term paper proposal, 7irst draft, 4 Sep 2013 13 Sep 2013 HW1: WEKA features 6 Sep 2013, Friday 13 Sep 2013 HW2: Decision Tree: Play Tennis, Paper & Pencil 9 Sep 2013, Monday 16 Sep 2013 HW3: MATLAB Program on IRIS Data, Decision Trees 11 Sep 2013 23 Sep 2013 1 Classifica8on: Example IRIS DATA Classifica8on using Decision Trees MATLAB Sta8s8cal Package 2 Where to Get Data? • The UC Irvine Data Sets • hMp://archive.ics.uci.edu/ml/datasets/Iris • This is perhaps the best known database to be found in the paMern recogni8on literature. • Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) • The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. • One class is linearly separable from the other 2; the laMer are NOT linearly separable from each other. 3 Fisher’s IRIS Data Set • 150 data items that contain measurements about iris flowers from 3 species: • Setosa (50), Versicolor (50), and Virginica (50) • The data include informa8on about four features of the flowers: • sepal length, sepal width, petal length, petal width. 4 AMribute Informa8on Col. 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. Class1: Iris Setosa 6. Class2: Iris Versicolour 7. Class3: Iris Virginica 5 Flower Parts 6 Load Iris Data into MATLAB • Visit UCI ML web site • Download the iris data into a file and name it “fisheriris” • Study the data • See how the sepal measurements differ between species. You can use the two columns containing sepal measurements. 7 Matlab Instruc8on: gscaMer • • • • • ScaMer plot by group h = gscaMer(...) gscaMer(x,y,group) creates a scaMer plot of x and y, grouped by group. – x and y are vectors of the same size. group is a grouping variable in the form of a categorical variable, vector, string array, or cell array of strings. gscaMer(x,y,group,clr,sym,siz) specifies the color, marker type, and size for each group. clr is a string array of colors recognized by the plot func8on. – The default for clr is 'bgrcmyk'. sym is a string array of symbols recognized by the plot command, with the default value '.'. siz is a vector of sizes, with the default determined by the 'DefaultLineMarkerSize' property. If you do not specify enough values for all groups, gscaMer cycles through the specified values as needed. gscaMer(x,y,group,clr,sym,siz,doleg) controls whether a legend is displayed on the graph (doleg is 'on', the default) or not (doleg is 'off'). 8 MATLAB: Loading Data & ScaMer Plolng load fisheriris gscaMer(meas(:,1), meas(:,2), species,'rgb',’csd'); xlabel('Sepal length'); ylabel('Sepal width'); N = size(meas,1); NOTE: ‘rgb’ = red, green, blue ‘csd’ = circle, square, diamond 9 Resul8ng ScaMer Plot 10 For Home Work #3 For training data, use Fisher's sepal measurements for iris versicolor and virginica: (Try this at home) load fisheriris SL = meas(51:end,1); SW = meas(51:end,2); group = species(51:end); h1 = gscaMer(SL,SW,group,'rb','v^',[],'off'); set(h1,'LineWidth',2) legend('Fisher versicolor','Fisher virginica',... 'Loca8on','NW') 11 What is the Problem to be Solved? • Suppose you measure a sepal and petal from an iris, and you need to determine its species on the basis of those measurements. • The func8on can perform classifica8on using different types of discriminant analysis. (Later, we will use classify func8on that uses Linear/Quadra8c Discriminant Analyses, LDA & QDA.) • Now let us use Decision Tree Classifier 12 Decision Trees • Is a classifica8on algorithm. • A decision tree is a set of simple rules, such as "if the sepal length is less than 5.45, classify the specimen as setosa." • Decision trees are also nonparametric because they do not require any assump8ons about the distribu8on of the variables in each class. • The class creates a decision tree. • Create a decision tree for the iris data and see how well it classifies the irises into species. 13 Visualizing the Decision Boundaries • It is interes8ng to see how the decision tree method divides the plane. • Use the same technique as above to visualize the regions assigned to each species. 14 One Method 15 Decision Tree 16 Reading the Tree • This cluMered-­‐looking tree uses a series of rules of the form "SL < 5.45" to classify each specimen into one of 19 terminal nodes. • To determine the species assignment for an observa8on, start at the top node and apply the rule. If the point sa8sfies the rule you take the lew path, and if not you take the right path. • Ul8mately you reach a terminal node that assigns the observa8on to one of the three species. 17 Home Work #3 • Your job is to reproduce the results shown so far (and no more) by using MATLAB. • You can use the code suggested by me or you can modify it. • I cannot guarantee that the code suggested by me works on your system. • You can take two weeks to do this 18 Rest of the Slides • I will stop the Decision Tree discussion at this point. • We will pick it up later • The next topic is Neural Nets 19 Re-­‐Subs8tu8on & CV Errors dtclass = t.eval(meas(:,1:2)); bad = ~strcmp(dtclass,species); dtResubErr = sum(bad) / N dtClassFun = @(xtrain,ytrain,xtest) (eval(classregtree(xtrain,ytrain),xtest)); dtCVErr = crossval('mcr',meas(:,1:2),species, ... 'predfun', dtClassFun,'par88on',cp) 20 Comment • For the decision tree algorithm, the cross-­‐valida8on error es8mate is significantly larger than the re-­‐ subs8tu8on error. • This shows that the generated tree overfits the training set. In other words, this is a tree that classifies the original training set well, but the structure of the tree is sensi8ve to this par8cular training set so that its performance on new data is likely to degrade. • It is owen possible to find a simpler tree that performs beMer than a more complex tree on new data. 21 Decision Tree Pruning • Try pruning the tree. • First compute the re-­‐subs8tu8on error for various of subsets of the original tree. • Then compute the cross-­‐valida8on error for these sub-­‐trees. • A graph shows that the re-­‐subs8tu8on error is overly op8mis8c. It always decreases as the tree size grows, but beyond a certain point, increasing the tree size increases the cross-­‐valida8on error rate. 22 Pruning Code 23 Performance of Pruned Tree 24 Shape of Pruned Tree • pt = prune(t,bestlevel);view(pt) 25 Addi8onal Analysis on the Iris Data Set 26 Ranges of the Measurements of Iris Flowers (in cen8meters) Sepal length Sepal width Petal length Petal width 4.3 – 7.9 2.0 – 4.4 1.0 – 6.9 0.1 – 2.5 27 Ranges Within the Classes Sepal length Sepal width Setosa Versicolor Virginica 4.3 – 5.8 4.9 – 7.0 4.9 – 7.9 2.3 – 4.4 2.0 – 3.4 2.2 – 3.8 Petal length 1.0 – 1.9 3.0 – 5.1 4.5 – 6.9 Petal width 0.1 – 0.6 1.0 – 1.8 1.4 – 2.5 28 Granulated Ranges (of sepal length) 4.3 – 4.9 Sestosa 4.9 – 5.8 5.8 – 7.0 Sestosa Versicolor Versicolor 7.0 – 7.9 Virginica Virginica Virginica 29 Granulated Ranges (of sepal width) 2.0 – 2.2 Versicolor Versicolor Versicolor 2.2 – 2.3 2.3 – 3.4 3.4 – 3.8 3.8 – 4.4 Sestosa Sestosa Sestosa Virginica Virginica Virginica 30 Granulated Ranges (of petal length) 1.0 – 1.9 Sestosa 3.0 – 4.5 4.5 – 5.1 5.1 – 6.9 Versicolor Versicolor Virginica Virginica 31 Granulated Ranges (of petal width) 0.1 – 0.6 Sestosa 1.0 – 1.4 Versicolor Versicolor Virginica Virginica 1.4 – 1.8 1.8 – 2.5 32 Fuzzy Models: Linguis8c Labels (for sepal length) 4.3 – 4.9 4.9 – 5.8 5.8 – 7.0 7.0 – 7.9 short sepal medium long sepal long sepal very long sepal A11 A12 A13 A14 33 Fuzzy Models: Linguis8c Labels (for sepal width) 2.0 – 2.2 very narrow sepal 2.2 – 2.3 narrow sepal 2.3 – 3.4 3.4 – 3.8 3.8 – 4.4 medium wide sepal wide sepal very wide sepal A21 A22 A23 A24 A25 34 Fuzzy Models: Linguis8c Labels (for petal length) 1.0 – 1.9 very short petal A31 3.0 – 4.5 medium long petal A32 4.5 – 5.1 long petal A33 5.1 – 6.9 very long petal A34 35 Fuzzy Models: Linguis8c Labels (for petal width) 0.1 – 0.6 very narrow petal A41 1.0 – 1.4 medium wide petal A42 1.4 – 1.8 wide petal A43 1.8 – 2.5 very wide petal A44 36