TURORIAL#3 CLASSIFICATION A Tree Classification algorithm is used to compute a decision tree. Decision trees are easy to understand and modify, and the model developed can be expressed as a set of decision rules. CLASSIFICATION By classifying larger data sets, you will be able to improve the accuracy of the Classification model. In Classification, the given situation is a set of example records, called a training set, where each record consists of several fields or attributes. Attributes are either numerical (coming from an ordered domain), or categorical (coming from an unordered domain). One of the attributes, called the class label field (target field), indicates the class to which each example belongs. CLASSIFICATION A Decision Tree model contains rules to predict the target variable. The Tree Classification algorithm (ID3). ID3 ALGORITHM First: Calculate Entropy (s) for all data: Entropy( S ) p p log 2 pn p n n n log 2 pn p n Second: Try all attribute and calculate Gain for each one. Gain( A) E(Current set ) E(all child sets) Third: Build a tree starting division with maximum Gain. EXAMPLE Person Homer Marge Bart Lisa Maggie Abe Selma Otto Krusty Hair Weight Length 0” 10” 2” 6” 4” 1” 8” 10” 6” 250 150 90 78 20 170 160 180 200 Age Class 36 34 10 8 1 70 41 38 45 M F M F F M F M M Hair length M M M F M F F M F 0 1 2 4 6 6 8 10 10 Weight F F M F F M M M M 20 78 90 150 160 170 180 200 250 F F M F M M F M M 1 8 10 34 36 38 41 45 70 Age Entropy( S ) 9 Persons p p log 2 pn p n n n log 2 pn p n Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes Hair Length <4? 3 Males 4 Females, 2Males Let us try splitting on Hair length Gain( A) E(Current set ) E(all child sets) Gain(Hair Length < 4) = 0.9911 – (3/9 * 0 + 6/9 * 0.92 ) = 0.3789 Entropy( S ) 9 Persons p p log 2 pn p n n n log 2 pn p n Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes Weight < 170? 4 Females, 1 Male 4 Males Let us try splitting on Weight Gain( A) E(Current set ) E(all child sets) Gain(Weight < 170) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 Entropy( S ) 9 Persons p p log 2 pn p n n n log 2 pn p n Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes age <= 40? 3 Females, 3 Males 1 Female, 2 Males Let us try splitting on Age Gain( A) E(Current set ) E(all child sets) Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 Decision Tree: 9 Persons no yes Weight < 170? 1 Male 4 Males 4 Females no yes Hair Length < 4? 1 Male 4 Females Weight < 170? Convert Decision Trees to rules… yes no Hair Length < 4? yes Male no Female Rules to Classify Males/Females If Weight greater than or equal 170, classify as Male Elseif Hair Length less than 4, classify as Male Else classify as Female Male TRY WEKA PROGRAM Insert same data (in file test.csv) in example to weka and show the same tree. REFERENCES: Quinlan, J.R. 1986, Machine Learning, 1, 81 http://dms.irb.hr/tutorial/tut_dtrees.php http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtre es/4_dtrees2.html Professor Sin-Min Lee, SJSU. http://cs.sjsu.edu/~lee/cs157b/cs157b.html