slides

advertisement
Decision Tree Pruning
Problem Statement
• We like to output small decision tree
Model Selection
• The building is done until zero training error
• Option I : Stop Early
Small decrease in index function
Cons: may miss structure
• Option 2: Prune after building.
Pruning
•
•
•
•
Input: tree T
Sample: S
Output: Tree T’
Basic Pruning: T’ is a sub-tree of T
Can only replace inner nodes by leaves
• More advanced:
Replace an inner node by one of its children
Reduced Error Pruning
•
•
•
•
Split the sample to two part S1 and S2
Use S1 to build a tree.
Use S2 to sample whether to prune.
Process every inner node v
After all its children has been process
Compute the observed error of Tv and leaf(v)
If leaf(v) has less errors replace Tv by leaf(v)
Reduced Error Pruning: Example
Pruning: CV & SRM
• Generate for each pruning size
compute the minimal error pruning
At most m different sub-trees
• Select between the prunings
Cross Validation
Structural Risk Minimization
Any other index method
Finding the minimum pruning
• Procedure Compute
• Inputs:
k : number of errors
T : tree
S : sample
• Output:
P : pruned tree
size : size of P
Procedure compute
• IF IsLeaf(T)
IF Errors(T)  k
• THEN size=1
• ELSE size = 
P=T; return;
• IF Errors(root(T))  k
size=1; P=root(T); return;
Procedure compute
• For i = 0 to k DO
Call Compute(i, T[0], S0, sizei,0,Pi.0)
Call Compute(k-i, T[1], S1, sizei,1,Pi.1)
•
•
•
•
size = minimum {sizei,0 + sizei,1 +1}
I = arg min {sizei,0 + sizei,1 +1}
P = MakeTree(root(T),PI,0, PI,1}
What is the time complexity?
Cross Validation
•
•
•
•
•
Split the sample S1 and S2
Build a tree using S1
Compute the candidate pruning
Select using S2
Output the tree with smallest error on S2
SRM
•
•
•
•
Build a Tree T using S
Compute the candidate pruning
kd the size of the pruning with d errors
Select using the SRM formula
kd
min d {obs(Td ) 
}
m
Drawbacks
• Running time
Since |T| = O(m)
Running time O(m2)
Many passes over the data
• Significant drawback for large data sets
Linear Time Pruning
• Single Bottom-up pass
linear time
• Use SRM like formula
Local soundness
• Competitiveness to any pruning
Algorithm
• Process a node after processing its children
• Local parameters:
Tv current sub-tree atobsv, of size sizev
Sv sample reaching v, of size mv
lv length of path leading to v
• Local Test:
obs(Tv,Sv) + a(mv,sizev,lv,) > obs (root(Tv),Sv)
The function a()
• Parameters:
paths(H,l) set of paths of length l over H.
trees(H,s) set of trees of size s over H.
• Formula:
log | paths( H , l ) |  log | trees( H , size |  log( m /  )
a(m, size , l ,  )  c
m
The function a()
• Finite Class H
|paths(H,l)| < |H|l.
|trees(H,s)| < (4|H|)s.
• Formula:
(l  size ) log | H |  log( m /  )
a(m, size , l ,  )  c
m
• Infinite Classes: VC-dim
Example
m
lv =3
mv
a(mv,sizev,lv,)
sizev
Local uniform convergence
• Sample S
Sc = { x S |c(x)=1}, mc=|Sc|
• Finite classes C and H
e(h|c) = Pr[ h(x) f(x) | c(x)=1 ]
obs(h|c)
• Lemma: with probability 1-
log | C |  log | H |  log( 1 /  )
| e(h | c)  obs(h | c) |
mc
Global Analysis
• Notation
T original tree (depends on S)
T* pruned tree
Topt optimal tree
rv= (lv+sizev)log|H| +log (mv/)
a(mv,sizev,lv,) = O( sqrt{ rv/mv }) 1
Sub-Tree Property
• Lemma: with probability 1-
T* is a sub-tree of Topt
• Proof:
Assume the all the local lemmas hold.
Each pruning reduces the error.
Assume T* has a subtree outside Topt
Adding that subtree to Topt will improve it!
Comparing
*
T and
Topt
• Additional pruned nodes: V={v1, … , vt}
• Additional error:
e(T*) - e(Topt) =  (e(vi)-e(T*vi))Pr[vi]
• Claim: With high probability
t
rvi
i 1
mvi
e(T )  e(Topt )  4 Pr[vi ]
*
 4
Analysis
• Lemma: With probability 1-
If Pr[vi] > 12(lopt log |H|+ log 1/ )/m =b
THEN Pr[vi] > 2obs(vi)
• Proof:
Relative Chernoff Bound.
Union over |H|l paths.
• V’ = {vi V | Pr[vi]>b}
Analysis of 

 Pr[v ]
vi V '
i
rvi
mvi


vi V V '
Pr[vi ]
rvi
mvi
• Sum over V-V’ bounded by sopt b
Analysis of 
rvi
2
Pr[vi ]

(  rvi )(  mvi )

mvi m vi V '
vi V '
vi V '
• Sum of mv < loptsizeopt
• Sum of rv <sizeopt(lopt log |H|+ log m/ )
• Putting it all together
Download