3/5[-1/2(log 2 (1/2))

advertisement
SEEM4630 2012-2013
Tutorial 2 – Classification
Decision tree, Naïve Bayes & k-NN
WANG Jing
Classification: Definition

Given a collection of records (training set )


Find a model for class attribute as a function
of the values of other attributes.


Each record contains a set of attributes, one of the
attributes is the class.
Decision tree, Naïve bayes & k-NN
Goal: previously unseen records should be
assigned a class as accurately as possible.
2
Decision Tree

Goal


Construct a tree so that instances belonging to different
classes should be separated
Basic algorithm (a greedy algorithm)




Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Test attributes are selected on the basis of a heuristics
or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected
attributes
3
Attribute Selection Measure 1:
Information Gain


Let pi be the probability that a tuple belongs to
class Ci, estimated by |Ci,D|/|D|
Expected information (entropy) needed to classify
m
a tuple in D:
Info( D)   pi log2 ( pi )
i 1

Information needed (after using A to split D into
v |D |
j
v partitions) to classify D: Info ( D) 
  Info(D j )
A
j 1

| D|
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
4
Attribute Selection Measure 2:
Gain Ratio

Information gain measure is biased towards
attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain)
v |D |
| Dj |
j
SplitInfoA ( D)  
 log2 (
)
| D|
j 1 | D |

GainRatio(A) = Gain(A)/SplitInfo(A)
5
Attribute Selection Measure 3:
Gini index


If a data set D contains examples from n classes,
gini index, gini(D) is defined as
n 2
gini(D) 1  p j
j 1
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1
and D2, the gini index gini(D) is defined as
gini A ( D) 

Reduction in Impurity:
|D1|
|D |
gini ( D1)  2 gini ( D 2)
|D|
|D|
gini( A)  gini(D)  giniA (D)
6
Example
Outlook
Temperature
Humidity
Wind
Play Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast
>25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast
<15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast
15-25
High
Strong
Yes
Overcast
>25
Normal
Weak
Yes
Rain
15-25
High
Strong
No
7
Tree induction example
S[9+, 5-]
Outlook
Sunny [2+,3-]
Info(S) = -9/14(log2(9/14))-5/14(log2(5/14))
Overcast [4+,0-]
= 0.94
Rain [3+,2-]
Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))]
– 4/14[-4/4(log2(4/4))-0/4(log2(0/4))]
– 5/14[-3/5(log2(3/5))-2/5(log2(2/5))]
= 0.94 – 0.69 = 0.25
S[9+, 5-]
Temperature
<15 [3+,1-]
15-25 [5+,1-]
>25 [2+,2-]
Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))]
– 6/14[-5/6(log2(5/6))-1/6(log2(1/6))]
– 4/14[-2/4(log2(2/4))-2/4(log2(2/4))]
= 0.94 – 0.80 = 0.14
8
High [3+,4-]
S[9+, 5-] Humidity
Normal [6+, 1-]
Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))]
– 7/14[-6/7(log2(6/7))-1/7(log2(1/7))]
= 0.94 – 0.79 = 0.15
Weak [6+, 2-]
S[9+, 5-] Wind
Strong [3+, 3-]
Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))]
– 6/14[-3/6(log2(3/6))-3/6(log2(3/6))]
= 0.94 – 0.89 = 0.05
9
Outlook Tempe
rature
Humidi
ty
Wind
Play
Tennis
Sunny
>25
High
Weak
No
Sunny
>25
High
Strong
No
Overcast >25
High
Weak
Yes
Rain
15-25
High
Weak
Yes
Rain
<15
Normal
Weak
Yes
Rain
<15
Normal
Strong
No
Overcast <15
Normal
Strong
Yes
Sunny
15-25
High
Weak
No
Sunny
<15
Normal
Weak
Yes
Rain
15-25
Normal
Weak
Yes
Sunny
15-25
Normal
Strong
Yes
Overcast 15-25
High
Strong
Yes
Overcast >25
Normal
Weak
Yes
Rain
High
Strong
No
15-25
Gain(Outlook) = 0.25
Gain(Temperature)=0.14
Gain(Humidity) = 0.15
Gain(Wind) = 0.05
Outlook
Sunny
??
Overcast
Rain
Yes
??
10
Sunny[2+,3-]
Temperature
<15 [1+,0-]
15-25 [1+,1-]
>25 [0+,2-]
Info(Sunny) = -2/5(log2(2/5))
-3/5(log2(3/5))
= 0.97
Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))]
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0.4 = 0.37
Sunny[2+,3-]
Humidity
High [0+,3-]
Normal [2+, 0-]
Sunny[2+, 3-]
Wind
Weak [1+, 2-]
Strong [1+, 1-]
Gain(Humidity)
= 0.97
– 3/5[-0/3(log2(0/3))-3/3(log2(3/3))]
– 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]
= 0.97 – 0 = 0.97
Gain(Wind)
= 0.97
– 3/5[-1/3(log2(1/3))-2/3(log2(2/3))]
– 3/5[-1/2(log2(1/2))-1/2(log2(1/2))]
= 0.97 – 0.96 = 0.02
11
Outlook
Sunny
Humidity
High
No
Overcast
Yes
Rain
??
Normal
Yes
12
Rain[3+,2-]
Temperature
<15 [1+,1-]
15-25 [2+,1-]
>25 [0+,0-]
Info(Rain) = -3/5(log2(3/5))
-2/5(log2(2/5))
= 0.97
Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
– 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]
= 0.97 – 0.75 = 0.22
Rain[3+,2-]
Humidity
High [1+,1-]
Normal [2+, 1-]
Rain[3+,2-]
Wind
Gain(Humidity)
= 0.97
– 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]
– 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]
= 0.97 – 0.43 = 0.54
Weak [3+, 0-]
Strong [0+, 2-]
Gain(Wind)
= 0.97
– 3/5[-3/3(log2(3/3))-0/3(log2(0/3))]
– 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]
= 0.97 – 0 = 0.97
13
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Weak
Yes
Strong
No
14
Bayesian Classification

A statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
 P(Ci | x1 , x2 ,...,xn )
where xi is the value of attribute Ai


Choose the class label that has the highest probability
Foundation: Based on Bayes’ Theorem.
P( x1 , x2 ,...,xn | Ci ) P(Ci )
P(Ci | x1 , x2 ,...,xn ) 
P( x1 , x2 ,...,xn )



P(Ci | x1, x2 ,...,xn ) posteriori probability
P(Ci ) prior probability
Model: compute from data
P( x1, x2 ,...,xn | Ci ) likelihood
?
15
Naïve Bayes Classifier

Problem: joint probabilities are difficult to
estimate P( x1, x2 ,...,xn | Ci )

Naïve Beyes Classifier

Assumption: attributes are conditionally independent
P( x1 , x2 ,...,xn | Ci )  P( x1 | Ci )P( xn | Ci )

P(C | x , x ,..., x ) 
n
i
1
2
n
j 1
P( x j | Ci )P(Ci )
P( x1 , x2 ,..., xn )
16
Naïve Bayes Classifier
A
B
C
m
b
t
m
s
t
g
q
t
h
s
t
g
q
t
g
q
f
g
s
f
h
b
f
h
q
f
m
b
f
P(C=t) = 1/2 P(C=f ) = 1/2
P(A=m|C=t) = 2/5
P(A=m|C=f) = 1/5
P(B=q|C=t) = 2/5
P(B=q|C=f) = 2/5
Test Record: A=m, B=q, C=?
17
SEG4630 Tutorial 6
Made by Wenting
Naïve Bayes Classifier
 For
C=t
P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 *
2/5 * 1/2 = 2/25
Higher!
 P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)

 For
C=f
P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 *
2/5 * 1/2 = 1/25
 P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)

 Conclusion:
A=m, B=q, C=t
SEG4630 Tutorial 6
Made by Wenting
18
Nearest Neighbor Classification
 Input


A set of stored records
k: # of nearest neighbors
 Output



Compute distance: d ( p, q)   ( p  q )
Identify k nearest neighbors
Determine the class label of unknown record based on
class labels of nearest neighbors (i.e. by taking majority
vote)
2
i
i
i
19
Nearest Neighbor Classification
A Discrete Example
 Input
Given 8 training
instances








P1
P2
P3
P4
P5
P6
P7
P8
(4, 2)  Orange
(0.5, 2.5)  Orange
(2.5, 2.5)  Orange
(3, 3.5)  Orange
(5.5, 3.5)  Orange
(2, 4)  Black
(4, 5)  Black
(2.5, 5.5)  Black

Calculate the distances:








d(P1,
d(P2,
d(P3,
d(P4,
d(P5,
d(P6,
d(P7,
d(P8,
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
Pn)
=
=
=
=
=
=
=
=
(4  4) 2  (2  4) 2  2
3.80
2.12
1.12
1.58
2
1
2.12
k=1&k=3
 new instance:
 Pn (4, 4)  ???
20
Nearest Neighbor Classification
k=3
k=1
P8
P8
P7
P7
Pn
Pn
P6
P6
P2
P4
P3
P4
P5
P1
P2
P5
P3
P1
21
Nearest Neighbor Classification…

Scaling issues

Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes



Each attribute must follow in the same range
Min-Max normalization
Example:


Two data records: a = (1, 1000), b = (0.5, 1)
dis(a, b) = ?
22
Lazy & Eager Learning
 Two

Types of Learning Methodologies
Lazy Learning


Instance-based learning. (k-NN)
Eager Learning


Decision-tree and Bayesian classification.
ANN & SVM
P8
P8
P7
P7
Pn
Pn
P6
P6
P4
P2
P4
P5
P3
P1
P2
P5
P3
P1
23
Lazy & Eager Learning
 Key

Lazy Learning




Differences
Do not require model building
Less time training but more time predicting
Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
Eager Learning



Require model building
More time training but less time predicting
must commit to a single hypothesis that covers the
entire instance space
24
Download