Document

advertisement
TURORIAL#3
CLASSIFICATION

A Tree Classification algorithm is used to
compute a decision tree. Decision trees are easy
to understand and modify, and the model
developed can be expressed as a set of decision
rules.
CLASSIFICATION

By classifying larger data sets, you will be able to
improve the accuracy of the Classification model.
In Classification, the given situation is a set of
example records, called a training set, where
each record consists of several fields or
attributes. Attributes are either numerical
(coming from an ordered domain), or categorical
(coming from an unordered domain). One of the
attributes, called the class label field (target
field), indicates the class to which each example
belongs.
CLASSIFICATION
A Decision Tree model contains rules to predict
the target variable.
 The Tree Classification algorithm (ID3).

ID3 ALGORITHM

First: Calculate Entropy (s) for all data:
Entropy( S )  

 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Second: Try all attribute and calculate Gain for
each one.
Gain( A)  E(Current set )   E(all child sets)

Third: Build a tree starting division with
maximum Gain.
EXAMPLE
Person
Homer
Marge
Bart
Lisa
Maggie
Abe
Selma
Otto
Krusty
Hair
Weight
Length
0”
10”
2”
6”
4”
1”
8”
10”
6”
250
150
90
78
20
170
160
180
200
Age
Class
36
34
10
8
1
70
41
38
45
M
F
M
F
F
M
F
M
M
Hair length
M
M
M
F
M
F
F
M
F
0
1
2
4
6
6
8
10
10
Weight
F
F
M
F
F
M
M
M
M
20
78
90
150
160
170
180
200
250
F
F
M
F
M
M
F
M
M
1
8
10
34
36
38
41
45
70
Age
Entropy( S )  
9 Persons
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
Hair Length <4?
3 Males
4 Females, 2Males
Let us try splitting on
Hair length
Gain( A)  E(Current set )   E(all child sets)
Gain(Hair Length < 4) = 0.9911 – (3/9 * 0 + 6/9 * 0.92 ) = 0.3789
Entropy( S )  
9 Persons
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
Weight < 170?
4 Females, 1 Male
4 Males
Let us try splitting on
Weight
Gain( A)  E(Current set )   E(all child sets)
Gain(Weight < 170) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Entropy( S )  
9 Persons
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
age <= 40?
3 Females, 3 Males
1 Female, 2 Males
Let us try splitting on
Age
Gain( A)  E(Current set )   E(all child sets)
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Decision Tree:
9 Persons
no
yes
Weight < 170?
1 Male
4 Males
4 Females
no
yes
Hair Length < 4?
1 Male
4 Females
Weight < 170?
Convert Decision Trees to rules…
yes
no
Hair Length < 4?
yes
Male
no
Female
Rules to Classify Males/Females
If Weight greater than or equal 170, classify as Male
Elseif Hair Length less than 4, classify as Male
Else classify as Female
Male
TRY WEKA PROGRAM

Insert same data (in file test.csv) in example to
weka and show the same tree.
REFERENCES:

Quinlan, J.R. 1986, Machine Learning, 1, 81

http://dms.irb.hr/tutorial/tut_dtrees.php

http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html

http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtre
es/4_dtrees2.html

Professor Sin-Min Lee, SJSU.
http://cs.sjsu.edu/~lee/cs157b/cs157b.html
Download