Decision trees Characteristic • supervised learning method • rule-based classifier • regression and classification problems – categorical variable decision tree – continous variable decision tree Example: Playing tennis We will go to the tennis court if it is not raining and the temperature is over 15 dg C. Example: Playing tennis If Outlook = „No rain“ and Temperature >=15 then Play_tennis = „Yes“ If Outlook = „Rain“ and Temperature <15 then Play_tennis = „No“ Basic terminology • • • • • • • Root Node Splitting Decision Node Leaf/Terminal Node Prunning Branch/Sub-tree Parent and Child Node Tree building process • Recusrive partitioning mechanism Process • 1) the DT consists of a single root node • 2) for the root node, a variable and partitioning of its values are determined (best partitioning ability, best variable) • 3) data is split into distinct subgroups, based o the previously chosen splitting criteria • 4) the new data subsets are the basis for the nodes in the first level of the DT • 5) each of these new nodes again identifies the best variable for partitioning the subset and splits it according to splitting criteria • 6) the node partitioning process is repeated until a stopping criteria is fulfilled for the particular node – all or almost all data in the node are of the same target category – the tree has reached it predefined max level – the node has reached the min occupation size Example: Playing tennis Characteristics • Pros: – – – – not sensitive to multicollinearity not sensitive to outliers interpretation both quantitative and categorical variables • Cons: – small changes in data can cause large changes in results – calculation sometimes go far more complex compared to the other algorithms – no regression – can not be used for prediction of continous variables – overfitting risk CART • Binary splitting tree • Gini coefficient – For each split – For all nodes 𝐺𝐼 = 1 − 𝐽 2 𝑝 𝑐=1 𝑡𝑐 𝐺𝐼𝑇𝑜𝑡𝑎𝑙 = 𝐾 𝑁𝑖 𝑖=1 𝑁 𝑇 𝐺𝐼 • Improvement (Gini Gain) = difference between Gini in Parent Node and Gini in Child Nodes CHAID • general tree • Chi-square statistics (test of independency) – Target categories (Parent) vs. Predictor categories (Child) 2 𝜒 = 𝐼 𝑟=1 𝑛𝑖𝑗 −𝑒𝑖𝑗 𝐽 𝑐=1 𝑒𝑖𝑗 2