A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label Hsiao-Wei Hu, Yen-Liang Chen, and Kwei Tang IEEE Transactions On Knowledge And Data Engineering, pp. 1505-1514, VOL.21, NO. 11, November 2009 Adviser:Yu-Chiang, Li Speaker:Yi-Chun, Pan Date︰2009/12/03 Outline Introduction Related work The proposed algorithm Experiments Conclusion Introduction 1/2 Main classification methods Decision trees (DT) Neural networks Logistic regression Nearest neighbors Introduction 2/2 Decision trees Easily interpretable Well-organized result Computationally efficient Capable of dealing with noisy data Example: Credit scoring Customer relationship managementDirect marketing Related work 1/3 Main DT algorithms type Data discretization method Drawback:may cannot provide good fit for the data Regression tree algorithm Drawback: size of a regression tree is usually large, results are often not accurate Related work 2/3 Data discretization method (C4.5) equal width method equal depth method clustering method Monothetic Contrast Criterions (MCCs) 3-4-5 partition method Related work 3/3 Regression tree algorithm Classification and Regression Trees (CARTs) The proposed algorithm 1/ Continuous Label Classifier (CLC) dynamically performs discretization based on the data associated with the node in the process of constructing a tree produce the mean, median, and other statistics for each leaf node as part of its output The proposed algorithm 1/11 The main steps of the algorithm The proposed algorithm 2/11 The proposed algorithm 3/11 Rewrite steps 6 and 7 The proposed algorithm 4/11 Explain 6a, 6b and steps 4 and 5 The proposed algorithm 5/11 Determining Nonoverlapping Intervals 決定區間 Set Ci ±16 C5:40-16=24 & 40+16=56 C8:65-16=49 & 65+16=81 Neighboring range: C1:33 C9:35 C2:28 C10:27 C3:27 C11:28 C4:28 C5:24~56 =>10 C6:11 C7:24 C8:49~81 =>29 The proposed algorithm 6/11 If the distance is large C4 Label=25 C7 Label=60 The proposed algorithm 7/11 If necessary can set splitting points The proposed algorithm 8/11 Computing the Goodness Value The proposed algorithm 9/11 The proposed algorithm 10/11 Stopping Tree Growing If all conditions is met, stop splitting R:range of the label for the entire set D The proposed algorithm 11/11 Performance evaluation 21Using X-fold cross vaildation Experiments 1/7 Comparing CLC and C4.5 Discretization methods: EW-T:equal width ED-T: equal depth KM-T:k-means clustering MCC-T:MCC method Experiments 2/7 Experiments 3/7 A comparison between CLC and ED-T. Experiments 4/7 The comparison between CLC and KM-T. Experiments 5/7 A comparison between CLC and MCC-T. Experiments 6/7 CLC and Regression Trees (CART) MAD:Mean absolute deviation w-STDEV:weighted standard deviation Experiments 7/7 Supplementary Comparisons Running time. Memory requirement Conclusion 1. propose a decision tree algorithm that allows the data in each node to be discretized dynamically during the tree induction process. 2. may add this constraint or even a set of user-specified intervals in the problem considered in this paper.