Decision Tree Pruning Problem Statement • We like to output small decision tree Model Selection • The building is done until zero training error • Option I : Stop Early Small decrease in index function Cons: may miss structure • Option 2: Prune after building. Pruning • • • • Input: tree T Sample: S Output: Tree T’ Basic Pruning: T’ is a sub-tree of T Can only replace inner nodes by leaves • More advanced: Replace an inner node by one of its children Reduced Error Pruning • • • • Split the sample to two part S1 and S2 Use S1 to build a tree. Use S2 to sample whether to prune. Process every inner node v After all its children has been process Compute the observed error of Tv and leaf(v) If leaf(v) has less errors replace Tv by leaf(v) Reduced Error Pruning: Example Pruning: CV & SRM • Generate for each pruning size compute the minimal error pruning At most m different sub-trees • Select between the prunings Cross Validation Structural Risk Minimization Any other index method Finding the minimum pruning • Procedure Compute • Inputs: k : number of errors T : tree S : sample • Output: P : pruned tree size : size of P Procedure compute • IF IsLeaf(T) IF Errors(T) k • THEN size=1 • ELSE size = P=T; return; • IF Errors(root(T)) k size=1; P=root(T); return; Procedure compute • For i = 0 to k DO Call Compute(i, T[0], S0, sizei,0,Pi.0) Call Compute(k-i, T[1], S1, sizei,1,Pi.1) • • • • size = minimum {sizei,0 + sizei,1 +1} I = arg min {sizei,0 + sizei,1 +1} P = MakeTree(root(T),PI,0, PI,1} What is the time complexity? Cross Validation • • • • • Split the sample S1 and S2 Build a tree using S1 Compute the candidate pruning Select using S2 Output the tree with smallest error on S2 SRM • • • • Build a Tree T using S Compute the candidate pruning kd the size of the pruning with d errors Select using the SRM formula kd min d {obs(Td ) } m Drawbacks • Running time Since |T| = O(m) Running time O(m2) Many passes over the data • Significant drawback for large data sets Linear Time Pruning • Single Bottom-up pass linear time • Use SRM like formula Local soundness • Competitiveness to any pruning Algorithm • Process a node after processing its children • Local parameters: Tv current sub-tree atobsv, of size sizev Sv sample reaching v, of size mv lv length of path leading to v • Local Test: obs(Tv,Sv) + a(mv,sizev,lv,) > obs (root(Tv),Sv) The function a() • Parameters: paths(H,l) set of paths of length l over H. trees(H,s) set of trees of size s over H. • Formula: log | paths( H , l ) | log | trees( H , size | log( m / ) a(m, size , l , ) c m The function a() • Finite Class H |paths(H,l)| < |H|l. |trees(H,s)| < (4|H|)s. • Formula: (l size ) log | H | log( m / ) a(m, size , l , ) c m • Infinite Classes: VC-dim Example m lv =3 mv a(mv,sizev,lv,) sizev Local uniform convergence • Sample S Sc = { x S |c(x)=1}, mc=|Sc| • Finite classes C and H e(h|c) = Pr[ h(x) f(x) | c(x)=1 ] obs(h|c) • Lemma: with probability 1- log | C | log | H | log( 1 / ) | e(h | c) obs(h | c) | mc Global Analysis • Notation T original tree (depends on S) T* pruned tree Topt optimal tree rv= (lv+sizev)log|H| +log (mv/) a(mv,sizev,lv,) = O( sqrt{ rv/mv }) 1 Sub-Tree Property • Lemma: with probability 1- T* is a sub-tree of Topt • Proof: Assume the all the local lemmas hold. Each pruning reduces the error. Assume T* has a subtree outside Topt Adding that subtree to Topt will improve it! Comparing * T and Topt • Additional pruned nodes: V={v1, … , vt} • Additional error: e(T*) - e(Topt) = (e(vi)-e(T*vi))Pr[vi] • Claim: With high probability t rvi i 1 mvi e(T ) e(Topt ) 4 Pr[vi ] * 4 Analysis • Lemma: With probability 1- If Pr[vi] > 12(lopt log |H|+ log 1/ )/m =b THEN Pr[vi] > 2obs(vi) • Proof: Relative Chernoff Bound. Union over |H|l paths. • V’ = {vi V | Pr[vi]>b} Analysis of Pr[v ] vi V ' i rvi mvi vi V V ' Pr[vi ] rvi mvi • Sum over V-V’ bounded by sopt b Analysis of rvi 2 Pr[vi ] ( rvi )( mvi ) mvi m vi V ' vi V ' vi V ' • Sum of mv < loptsizeopt • Sum of rv <sizeopt(lopt log |H|+ log m/ ) • Putting it all together