Classification Trees Stat 557 Heike Hofmann Outline • Growing Trees • Pruning • Predictions & Assessment • Next: Random Forests Trees • Breiman & Olsen (1984) • Situation: Categorical Response Y Set of explanatory variables X1, ..., Xp • Goal: find partition of the data space X , ..., 1 Xp that has homogeneous response Example http://archive.ics.uci.edu/ml/machine-learning-databases/ letter-recognition/letter-recognition.names Attribute Information: ! 1.! lettr!capital letter!(26 values from A to Z) ! 2.! x-box!horizontal position of box!(integer) ! 3.! y-box!vertical position of box! (integer) ! 4.! width!width of box! ! ! (integer) ! 5.! high !height of box! ! ! (integer) ! 6.! onpix!total # on pixels!! (integer) ! 7.! x-bar!mean x of on pixels in box!(integer) ! 8.! y-bar!mean y of on pixels in box!(integer) ! 9.! x2bar!mean x variance! ! ! (integer) ! 10.! y2bar!mean y variance! ! ! (integer) ! 11.! xybar!mean x y correlation!! (integer) ! 12.! x2ybr!mean of x * x * y!! (integer) ! 13.! xy2br!mean of x * y * y!! (integer) ! 14.! x-ege!mean edge count left to right!(integer) ! 15.! xegvy!correlation of x-ege with y! (integer) ! 16.! y-ege!mean edge count bottom to top!(integer) ! 17.! yegvx!correlation of y-ege with x! (integer) Example V12< 2.5 | V10< 3.5 V8< 9.5 V14< 1.5 A L V14< 5.5 V12< 8.5 V16< 4.5 V15>=7.5V7< 5.5 V13>=9.5 J C W V16>=2.5 V13< 7.5 V10>=4.5 P V15< 6.5 E Z V17>=8.5 V9>=4.5 M I B D V14< 3.5 V15< 8.5 V14< 6.5 V7< 7.5 X H U N W G Q Questions • Construction? • Model properties: • goodness of fit? • parameter estimates? T V Split #1 V1 V1 1.0 A 8000 A B B C C D D E E 0.8 F 6000 F G G H H I I J J 0.6 K K M 4000 N L count count L O 2000 M N O 0.4 P P Q Q R R S S 0.2 T T U U V V W W X 0 X 0.0 Y 0 5 10 V12 Y Z 15 0 5 10 V12 Z 15 Split #2 V1 V1 1.0 A 250 A B B C C D D E E 0.8 F 200 F G G H H I I J K 150 J 0.6 K M N O 100 L count count L M N O 0.4 P P Q Q R R S T 50 S 0.2 T U U V V W W X 0 Y 0 2 4 V10 6 8 10 Z X 0.0 Y 0 2 4 V10 6 8 10 Z Split Evaluation • Criteria for each split: compare homogeneity after splitting with homogeneity before splitting • Homogeneity Measures: gini index (default in rpart) information measure ... Gini Index • probabilistic view: for each node i we have class probabilities pik (with sample size ni) • Definition G(i) = K ! k=1 p̂ik (1 − p̂ik ) with p̂ik ni 1 ! I(Yi = k) = ni i=1 Deviance/Entropy/ Information Criterion For node i: • Entropy • E(i) = − K ! p̂ik log p̂ik k=1 Deviance D(i) = −2 · K ! nik log p̂ik k=1 Construction • Starting with the root, find best split at each node using an exhaustive search, i.e. for each variable Xi generate all possible splits and compute homogeneity, select best split for best variable B C D E 5000 F G H 4000 I J count K L 3000 M N O 2000 P Q R 1000 S T U 0 V 0 2 4 6 V11 8 10 12 W 14 X left + right 12 11 10 9 8 7 2 4 6 x 8 10 12 14 Some Stopping Rules • • • Nodes are homogeneous “enough” • • Minimize Error (e.g. cross-validation) Nodes are small (ni < 20 rpart, ni < 10 tree) “Elbow” criterion: gain in homogeneity levels out Minimize cost-complexity measure: Ra = R + a*size (R homogeneity measure evaluated at leaves, a>0 real value, penalty term for complexity) Pruning Prediction • prediction based on tree object: all members in each leaf are assigned the value of the mode (predict, type=‘class’) • alternative: matrix of class probabilities (predict, type=‘prob’) Letter Recognition Predicted values 2500 2000 count 1500 1000 500 0 A B C D E F G H I J K L M tree N O P Q R S T U V W X Y Z not all letters are recognized Misclassification matrix y: observations x: predictions Z Y X W V U T S R Q P O y N M Ks are classified as Xs L K J I Ys are classified as Ds, Ts or Vs H G F E D C B A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z x Re-run tree with smaller cp rate V12< 2.5 | V10< 3.5 V8< 9.5 V14< 1.5 A L V14< 5.5 V12< 8.5 V16< 4.5 V7< 5.5 V15>=7.5 V13< 7.5 V11< 5.5 E V13>=9.5 S Z J V17>=8.5 V9>=4.5 C I W V16>=2.5 V15< 6.5 V14< 3.5 V13< 5.5 V10>=4.5 V9< 3.5 R V16>=1.5V14< 1.5 F P V12< 9.5V16>=0.5 T V F Y T Y M V14< 6.5 V15>=7.5V7< 7.5V10< 7.5 V15< 8.5 V10>=3.5 D V10>=5.5 B R H NW X K X U V G Q Loss matrix • by default 1s off diagonal 0s on the diagonal • adjust corresponding to risk - are all types of misclassifications similar? - in binary situation: is false positive as bad as false negative? Gs vs Os adjust loss matrix such that mistaken Gs and Os are penalized V12< 2.5 | V10< 3.5 V14< 6.5 V12< 8.5 A V13>=4.5 V14< 0.5 J I N V8< 10.5 V V7< 7.5 V10< 8.5 V14< 2.5 V17>=8.5 Q Z V10>=6.5 V15>=7.5V9>=6.5 K V16< 4.5 V14< 0.5 G V13< 6.5 B R O C E I D X Z Y X W V U T S R Q P O N M L K J I H G F E D C B A A B C D E F G H I J K L M N O P Q R S T U V W X Y Z x M W P U V13>=8.5 V15< 7.5 V9>=4.5 V10>=4.5 V16< 1.5 y V15< 8.5 L Y T Random Forests • Breiman (2001), Breiman & Cutler (2004) • Tree Ensemble built by randomly sampling cases and variables • Each case classified once for each tree in the ensemble Fitting Strategies • Eliminate variables that are independent of the response Classifly • Install ggobi (probably not necessary) and rggobi both available at www.ggobi.org • in R install package classifly • run demo example(classifly)