Classification Using Decision Trees

advertisement
Jackknife Estimator: Example 1
Data Mining
n
CS 341, Spring 2007
n
n
n
Lecture 8: Decision tree
algorithms
n
Estimate of mean for X={x1, x2, x3,}, n =3, g=3,
m=1, θ = µ = (x
(x1+ x2+ x3)/3
θ1 = (x
(x2 + x3)/2, θ2 = (x
(x1 + x3)/2, θ1 = (x
(x1 + x2)/2,
θ_ = (θ
(θ1 + θ2 + θ2)/3
θQ = gθ
gθ-(g(g-1) θ_= 3θ
3θ-(3(3-1) θ_= (x
(x1 + x2 + x3)/3
In this case, the Jackknife Estimator is the
same as the usual estimator.
© Prentice Hall
Jackknife Estimator: Example 2
n
n
n
n
n
n
Jackknife Estimator: Example 2(cont’
2(cont’d)
Estimate of variance for X={1, 4, 4}, n =3, g=3,
m=1, θ = σ2
σ2 = ((1((1-3)2 +(4+(4-3)2 +(4+(4-3)2 )/3 = 2
θ1 = ((4((4-4)2 + (4(4-4)2 ) /2 = 0,
0,
θ2 = 2.25 , θ3 = 2.25
θ_ = (θ1 + θ2 + θ2)/3 = 1.5
θQ = gθ-(g(g-1) θ_= 3θ
3θ-(3(3-1) θ_
n
n
In this case, the Jackknife Estimator is
different from the usual estimator.
© Prentice Hall
3
Review: DistanceDistance-based Algorithms
n
n
n
n
n
Place items in class to which they are
“closest”
closest”.
Similarity measures or distance
measures
Simple approach
K Nearest Neighbors
Decision Tree issues, pros and cons
4
Classification Using Decision
Trees
n
n
n
n
5
then the jackknife estimator is s2
s2 = Σ (xi – x )2 / (n -1)
Which is known to be unbiased for σ2
© Prentice Hall
n
© Prentice Hall
In general, apply the Jackknife technique
to the biased estimator σ2
σ2 = Σ (xi – x )2 / n
=3(2)=3(2)-2(1.5)=3
n
2
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART
© Prentice Hall
6
1
DT Induction
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree
associated with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can
be applied to attribute at parent
– Each leaf node is labeled with a class, Cj
© Prentice Hall
7
© Prentice Hall
8
Information
Decision Tree Induction is often based on
Information Theory
So
© Prentice Hall
9
© Prentice Hall
DT Induction
n
n
Information/Entropy
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
n
Given probabilities p1, p2, .., ps whose sum is 1, Entropy is
defined as:
n
Entropy measures the amount of randomness or surprise or
uncertainty.
n
Goal in classification
– Its value is between 0 and 1.
– Reaches the maximum when all the probabilities are the same.
– no surprise
– entropy = 0
Use this approach with DT Induction !
© Prentice Hall
10
11
© Prentice Hall
12
2
Entropy
ID3
n
n
Creates tree using information theory
concepts and tries to reduce expected
number of comparison..
ID3 chooses split attribute with the highest
information gain:
– Information gain: the difference between how
much information is needed to make a correct
classification before the split versus how much
information is needed after the split.
log (1/p)
H(p,1-p)
© Prentice Hall
13
© Prentice Hall
Height Example Data
Nam e
K r is t i n a
J im
M a g g ie
M a r th a
S te p h a n ie
B ob
K a th y
D ave
W o rth
S te v e n
D e b b ie
Todd
K im
Amy
W y n e tte
G ender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
H e ig h t
1 .6 m
2m
1 .9 m
1 .8 8 m
1 .7 m
1 .8 5 m
1 .6 m
1 .7 m
2 .2 m
2 .1 m
1 .8 m
1 .9 5 m
1 .9 m
1 .8 m
1 .7 5 m
O u tp u t1
S h o rt
T a ll
M e d iu m
M e d iu m
S h o rt
M e d iu m
S h o rt
S h o rt
T a ll
T a ll
M e d iu m
M e d iu m
M e d iu m
M e d iu m
M e d iu m
Information Gain
O u tp u t2
M e d iu m
M e d iu m
T a ll
T a ll
M e d iu m
M e d iu m
M e d iu m
M e d iu m
T a ll
T a ll
M e d iu m
M e d iu m
T a ll
M e d iu m
M e d iu m
© Prentice Hall
n
Choose gender as the split attribute
– H(D): entropy before split
– E(H(D)) : expected entropy after split
– Information gain =
n
Choose height as the split attribute
– H(D): entropy before split
– E(H(D)) : expected entropy after split
– Information gain =
15
© Prentice Hall
ID3 Example (Output1)
16
ID3 Example (Output1)
Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
n Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) =
0.4392
– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.34152
– Gain: 0.4384 – 0.34152 = 0.09688
n Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
n Choose height as first splitting attribute
n
© Prentice Hall
14
17
Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15
log(15/3) = 0.4384
n
n
n
n
Gain using gender: 0.09688
Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
Choose height as first splitting attribute
© Prentice Hall
18
3
C4.5
n
C4.5: Example
ID3 favors attributes with large number of
divisions
n
n
Improved version of ID3:
– Missing Data
– Continuous Data
– Pruning
– Rules
– GainRatio:
GainRatio:
Calculate the GainRatio for the gender
split
– Entropy associated with the split ignoring
classes
H(9/15, 6/15) = 0.292
– The GainRatio value for the gender
attribute
0.09688/0.292 = 0.332
© Prentice Hall
19
© Prentice Hall
C5.0
n
CART
A commercial version of C4.5 widely
used in many data mining packages.
n
n
n
n
Targeted toward use with large
datasets.
– Produce more accurate rules.
– Improves on memory usage by 90%
– Run much faster than C4.5
n
n
© Prentice Hall
n
PL,PR probability that a tuple in the training set will be
on the left or right side of the tree.
P(Cj|tL) , P(Cj|tR) :probability that a tuple is in class Cj
and in the left (or right) subtree.
subtree.
© Prentice Hall
22
Scalable DT Techniques
At the start, there are six choices for split
point (right branch on equality):
n
ϕ(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
ϕ(1.6) = 0
ϕ(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
ϕ(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
ϕ(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
ϕ(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
n
Create Binary Tree
Uses entropy for best splitting attribute (as with ID3)
Formula to choose split point, s, for node t:
21
CART Example
n
20
SPRINT
– Creation of DTs for large datasets.
– Based on CART techniques
Best split at 1.8
What is next?
© Prentice Hall
23
© Prentice Hall
24
4
Next Lecture:
n
n
RuleRule-based algorithms
Combing techniques
© Prentice Hall
25
5
Download