A Dynamic Discretization Approach for Constructing Decision Trees with a Continuous Label

advertisement
A Dynamic Discretization Approach
for Constructing Decision Trees with
a Continuous Label
Hsiao-Wei Hu, Yen-Liang Chen, and Kwei Tang
IEEE Transactions On Knowledge And Data Engineering,
pp. 1505-1514, VOL.21, NO. 11, November 2009
Adviser:Yu-Chiang, Li
Speaker:Yi-Chun, Pan
Date︰2009/12/03
Outline





Introduction
Related work
The proposed algorithm
Experiments
Conclusion
Introduction 1/2
 Main classification methods




Decision trees (DT)
Neural networks
Logistic regression
Nearest neighbors
Introduction 2/2
 Decision trees





Easily interpretable
Well-organized result
Computationally efficient
Capable of dealing with noisy data
Example:
 Credit scoring
 Customer relationship managementDirect
marketing
Related work 1/3
 Main DT algorithms type
 Data discretization method
 Drawback:may cannot provide good fit for
the data
 Regression tree algorithm
 Drawback: size of a regression tree is
usually large, results are often not accurate
Related work 2/3
 Data discretization method (C4.5)





equal width method
equal depth method
clustering method
Monothetic Contrast Criterions (MCCs)
3-4-5 partition method
Related work 3/3
 Regression tree algorithm
 Classification and Regression Trees
(CARTs)
The proposed algorithm 1/
 Continuous Label Classifier (CLC)
 dynamically performs discretization
based on the data associated with the
node in the process of constructing a
tree
 produce the mean, median, and other
statistics for each leaf node as part of its
output
The proposed algorithm 1/11
 The main steps of the algorithm
The proposed algorithm 2/11
The proposed algorithm 3/11
 Rewrite steps 6 and 7
The proposed algorithm 4/11
 Explain 6a, 6b and steps 4 and 5
The proposed algorithm 5/11
 Determining Nonoverlapping Intervals
決定區間
Set Ci ±16
C5:40-16=24 & 40+16=56
C8:65-16=49 & 65+16=81
Neighboring range:
C1:33
C9:35
C2:28
C10:27
C3:27
C11:28
C4:28
C5:24~56 =>10
C6:11
C7:24
C8:49~81 =>29
The proposed algorithm 6/11
 If the distance is large
C4
Label=25
C7
Label=60
The proposed algorithm 7/11
 If necessary can set splitting points
The proposed algorithm 8/11
 Computing the Goodness Value
The proposed algorithm 9/11
The proposed algorithm 10/11
 Stopping Tree Growing
If all conditions is met, stop splitting
R:range of the label for the entire set D
The proposed algorithm 11/11
 Performance evaluation
 21Using X-fold cross vaildation
Experiments 1/7
 Comparing CLC and C4.5
 Discretization methods:




EW-T:equal width
ED-T: equal depth
KM-T:k-means clustering
MCC-T:MCC method
Experiments 2/7
Experiments 3/7
A comparison between CLC and ED-T.
Experiments 4/7
The comparison between CLC and KM-T.
Experiments 5/7
A comparison between CLC and MCC-T.
Experiments 6/7
 CLC and Regression Trees (CART)
MAD:Mean absolute deviation
w-STDEV:weighted standard deviation
Experiments 7/7
 Supplementary Comparisons
Running time.
Memory requirement
Conclusion
1. propose a decision tree algorithm
that allows the data in each node to
be discretized dynamically during the
tree induction process.
2. may add this constraint or even a set
of user-specified intervals in the
problem considered in this paper.
Download