Uploaded by gianghppro98

decision trees (1)

advertisement
Decision trees
Characteristic
• supervised learning method
• rule-based classifier
• regression and classification problems
– categorical variable decision tree
– continous variable decision tree
Example: Playing tennis
We will go to the tennis court if it is not raining and
the temperature is over 15 dg C.
Example: Playing tennis
If Outlook = „No rain“ and Temperature >=15 then
Play_tennis = „Yes“
If Outlook = „Rain“ and Temperature <15 then
Play_tennis = „No“
Basic terminology
•
•
•
•
•
•
•
Root Node
Splitting
Decision Node
Leaf/Terminal Node
Prunning
Branch/Sub-tree
Parent and Child Node
Tree building process
• Recusrive partitioning mechanism
Process
• 1) the DT consists of a single root node
• 2) for the root node, a variable and partitioning of its values are
determined (best partitioning ability, best variable)
• 3) data is split into distinct subgroups, based o the previously chosen
splitting criteria
• 4) the new data subsets are the basis for the nodes in the first level of the
DT
• 5) each of these new nodes again identifies the best variable for
partitioning the subset and splits it according to splitting criteria
• 6) the node partitioning process is repeated until a stopping criteria is
fulfilled for the particular node
– all or almost all data in the node are of the same target category
– the tree has reached it predefined max level
– the node has reached the min occupation size
Example: Playing tennis
Characteristics
• Pros:
–
–
–
–
not sensitive to multicollinearity
not sensitive to outliers
interpretation
both quantitative and categorical variables
• Cons:
– small changes in data can cause large changes in
results
– calculation sometimes go far more complex
compared to the other algorithms
– no regression – can not be used for prediction of
continous variables
– overfitting risk
CART
• Binary splitting tree
• Gini coefficient
– For each split
– For all nodes
𝐺𝐼 = 1 −
𝐽
2
𝑝
𝑐=1 𝑡𝑐
𝐺𝐼𝑇𝑜𝑡𝑎𝑙 =
𝐾 𝑁𝑖
𝑖=1 𝑁
𝑇
𝐺𝐼
• Improvement (Gini Gain) = difference between Gini in Parent
Node and Gini in Child Nodes
CHAID
• general tree
• Chi-square statistics (test of independency)
– Target categories (Parent) vs. Predictor categories (Child)
2
𝜒 =
𝐼
𝑟=1
𝑛𝑖𝑗 −𝑒𝑖𝑗
𝐽
𝑐=1
𝑒𝑖𝑗
2
Download