Classification Trees Outline • Growing Trees

advertisement
Classification Trees
Stat 557
Heike Hofmann
Outline
• Growing Trees
• Pruning
• Predictions & Assessment
• Next: Random Forests
Trees
• Breiman & Olsen (1984)
• Situation:
Categorical Response Y
Set of explanatory variables X1, ..., Xp
• Goal: find partition of the data space X , ...,
1
Xp that has homogeneous response
Example
http://archive.ics.uci.edu/ml/machine-learning-databases/
letter-recognition/letter-recognition.names
Attribute Information:
!
1.! lettr!capital letter!(26 values from A to Z)
!
2.! x-box!horizontal position of box!(integer)
!
3.! y-box!vertical position of box! (integer)
!
4.! width!width of box! ! ! (integer)
!
5.! high !height of box! ! ! (integer)
!
6.! onpix!total # on pixels!! (integer)
!
7.! x-bar!mean x of on pixels in box!(integer)
!
8.! y-bar!mean y of on pixels in box!(integer)
!
9.! x2bar!mean x variance! ! ! (integer)
! 10.! y2bar!mean y variance! ! ! (integer)
! 11.! xybar!mean x y correlation!! (integer)
! 12.! x2ybr!mean of x * x * y!! (integer)
! 13.! xy2br!mean of x * y * y!! (integer)
! 14.! x-ege!mean edge count left to right!(integer)
! 15.! xegvy!correlation of x-ege with y! (integer)
! 16.! y-ege!mean edge count bottom to top!(integer)
! 17.! yegvx!correlation of y-ege with x! (integer)
Example
V12< 2.5
|
V10< 3.5
V8< 9.5
V14< 1.5
A L
V14< 5.5
V12< 8.5
V16< 4.5
V15>=7.5V7< 5.5
V13>=9.5
J
C
W
V16>=2.5
V13< 7.5
V10>=4.5
P
V15< 6.5
E Z
V17>=8.5 V9>=4.5 M
I
B D
V14< 3.5
V15< 8.5
V14< 6.5
V7< 7.5
X
H U N W
G Q
Questions
• Construction?
• Model properties:
• goodness of fit?
• parameter estimates?
T V
Split #1
V1
V1
1.0
A
8000
A
B
B
C
C
D
D
E
E
0.8
F
6000
F
G
G
H
H
I
I
J
J
0.6
K
K
M
4000
N
L
count
count
L
O
2000
M
N
O
0.4
P
P
Q
Q
R
R
S
S
0.2
T
T
U
U
V
V
W
W
X
0
X
0.0
Y
0
5
10
V12
Y
Z
15
0
5
10
V12
Z
15
Split #2
V1
V1
1.0
A
250
A
B
B
C
C
D
D
E
E
0.8
F
200
F
G
G
H
H
I
I
J
K
150
J
0.6
K
M
N
O
100
L
count
count
L
M
N
O
0.4
P
P
Q
Q
R
R
S
T
50
S
0.2
T
U
U
V
V
W
W
X
0
Y
0
2
4
V10
6
8
10
Z
X
0.0
Y
0
2
4
V10
6
8
10
Z
Split Evaluation
• Criteria for each split:
compare homogeneity after splitting with
homogeneity before splitting
• Homogeneity Measures:
gini index (default in rpart)
information measure
...
Gini Index
• probabilistic view:
for each node i we have class probabilities
pik (with sample size ni)
• Definition
G(i) =
K
!
k=1
p̂ik (1 − p̂ik )
with p̂ik
ni
1 !
I(Yi = k)
=
ni i=1
Deviance/Entropy/
Information Criterion
For node i:
• Entropy
•
E(i) = −
K
!
p̂ik log p̂ik
k=1
Deviance D(i) = −2 ·
K
!
nik log p̂ik
k=1
Construction
• Starting with the root, find best split at each
node using an exhaustive search,
i.e. for each variable Xi generate all possible
splits and compute homogeneity, select best
split for best variable
B
C
D
E
5000
F
G
H
4000
I
J
count
K
L
3000
M
N
O
2000
P
Q
R
1000
S
T
U
0
V
0
2
4
6
V11
8
10
12
W
14
X
left + right
12
11
10
9
8
7
2
4
6
x
8
10
12
14
Some Stopping Rules
•
•
•
Nodes are homogeneous “enough”
•
•
Minimize Error (e.g. cross-validation)
Nodes are small (ni < 20 rpart, ni < 10 tree)
“Elbow” criterion: gain in homogeneity levels
out
Minimize cost-complexity measure:
Ra = R + a*size
(R homogeneity measure evaluated at leaves,
a>0 real value, penalty term for complexity)
Pruning
Prediction
• prediction based on tree object: all
members in each leaf are assigned the value
of the mode (predict, type=‘class’)
• alternative: matrix of class probabilities
(predict, type=‘prob’)
Letter Recognition
Predicted values
2500
2000
count
1500
1000
500
0
A
B
C
D
E
F
G
H
I
J
K
L
M
tree
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
not all letters are
recognized
Misclassification matrix
y: observations
x: predictions
Z
Y
X
W
V
U
T
S
R
Q
P
O
y
N
M
Ks are classified as Xs
L
K
J
I
Ys are classified as
Ds, Ts or Vs
H
G
F
E
D
C
B
A
A B C D E F G H
I
J K L M N O P Q R S T U V W X Y Z
x
Re-run tree with
smaller cp rate
V12< 2.5
|
V10< 3.5
V8< 9.5
V14< 1.5
A L
V14< 5.5
V12< 8.5
V16< 4.5
V7< 5.5
V15>=7.5
V13< 7.5
V11< 5.5
E
V13>=9.5
S Z
J
V17>=8.5 V9>=4.5
C I
W
V16>=2.5
V15< 6.5
V14< 3.5
V13< 5.5 V10>=4.5
V9< 3.5
R V16>=1.5V14< 1.5
F P
V12< 9.5V16>=0.5
T
V
F Y
T Y
M
V14< 6.5
V15>=7.5V7< 7.5V10< 7.5 V15< 8.5
V10>=3.5
D
V10>=5.5
B R
H
NW
X
K X
U V
G Q
Loss matrix
• by default 1s off diagonal
0s on the diagonal
• adjust corresponding to risk
- are all types of misclassifications similar?
- in binary situation: is false positive as bad
as false negative?
Gs vs Os
adjust loss
matrix such
that mistaken
Gs and Os
are penalized
V12< 2.5
|
V10< 3.5
V14< 6.5
V12< 8.5
A
V13>=4.5
V14< 0.5
J
I
N
V8< 10.5
V
V7< 7.5
V10< 8.5
V14< 2.5
V17>=8.5
Q
Z
V10>=6.5
V15>=7.5V9>=6.5
K
V16< 4.5
V14< 0.5
G
V13< 6.5
B R O
C E
I
D X
Z
Y
X
W
V
U
T
S
R
Q
P
O
N
M
L
K
J
I
H
G
F
E
D
C
B
A
A B C D E F G H
I
J
K L M N O P Q R S T U V W X Y Z
x
M W
P U
V13>=8.5
V15< 7.5
V9>=4.5
V10>=4.5
V16< 1.5
y
V15< 8.5
L
Y
T
Random Forests
• Breiman (2001), Breiman & Cutler (2004)
• Tree Ensemble built by randomly sampling
cases and variables
• Each case classified once for each tree in
the ensemble
Fitting Strategies
• Eliminate variables that are independent of
the response
Classifly
• Install ggobi (probably not necessary) and
rggobi
both available at www.ggobi.org
• in R install package classifly
• run demo example(classifly)
Download