Detection and reconstruction of concept change in decision trees

advertisement
On the Optimality of Probability
Estimation by Random Decision Trees
Wei Fan
IBM T.J.Watson
Some important facts about inductive learning
 Given a set of labeled data items, such as, (amt,
merchant category, outstanding balance, date/time,
……,) and the label is whether it is a fraud or nonfraud.
 Inductive model: predict if a transaction is a fraud or
non-fraud.
 Perfect model: never makes mistakes.
 Not always possible due to:
 Stochastic nature of the problem
 Noise in training data
 Data is insufficient
Optimal Model
 Loss function L(t,y) to evaluate
performance.
 t is true label and y is prediction
 Optimal decision decision y* is the label that
minimizes the expected loss when x is
sampled many times:
 0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: the label that minimizes the
“empirical risk”.
If P(fraud|x) * $1000 > $90 or
p(fraud|x) > 0.09, predict fraud
How we look for optimal models?
 NP-hard for most “model representation”
 We think that simplest hypothesis that fits
the data is the best.
 We employ all kinds of heuristics to look for
it.
 info gain, gini index, etc
 pruning: MDL pruning, reduced errorpruning, cost-based pruning.
 Reality: tractable, but still very expensive
How many optimal models out there?
 0-1 loss binary problem:
 Truth: P(positive|x) > 0.5, we predict x to be positive.
 P(positive|x) = 0.6, P(positive|x) = 0.9 makes no
difference in final prediction!
 Cost-sensitive problems:
 Truth: P(fraud|x) * $1000 > $90, we predict x to be
fraud.
 Re-write it P(fraud|x) > 0.09
 P(fraud|x) = 1.0 and P(fraud|x) = 0.091 makes no
difference.
 There are really many many optimal models out
there.
Random Decision Tree: Outline
 Train multiple trees. Details to follow.
 Each tree outputs posterior
probability when classifying an
example x.
 The probability outputs of many trees
are averaged as the final probability
estimation.
 Loss function and probability are used
to make the best prediction.
Training
At each node, an un-used feature is
chosen randomly
 A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
 A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
Training
At each node, an un-used feature is
chosen randomly
 A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
 A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
Example
Gender?
M
F
Age>30
P: 1
N: 9
y
……
Age> 25
n
P: 100
N: 150
Training: Continued
 We stop when one of the following
happens:
 A node becomes empty.
 Or the total height of the tree exceeds a
threshold, currently set as the total number of
features.
 Each node of the tree keeps the number of
examples belonging to each class.
Classification
 Each tree outputs membership probability
 p(fraud|x) = n_fraud/(n_fraud + n_normal)
 If a leaf node is empty (very likely for when discrete
feature is tested at the end):
Use the parent nodes’ probability estimate but do not
output 0 or NaN
 The membership probability from multiple random
trees are averaged to approximate as the final
output
 Loss function is required to make a decision
 0-1 loss: p(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: p(fraud|x) $1000 > $90
Credit Card Fraud
 Detect if a transaction is a fraud
 There is an overhead to detect a
fraud, {$60, $70, $80, $90}
 Loss Function
Result
Donation Dataset
 Decide whom to send charity
solicitation letter. About 5% positive.
 It costs $0.68 to send a letter.
 Loss function
Result
Independent study and implementation
of random decision tree
 Kai Ming Ting and Tony Liu from U of
Monash, Australia on UCI datasets
 Edward Greengrass from DOD on their data
sets.





100 to 300 features.
Both categorical and continuous features.
Some features have a lot of values.
2000 to 3000 examples.
Both binary and multiple class problem (16 and
25)
Why random decision tree works?
 Original explanation:
 Error tolerance property.
 Truth: P(positive|x) > 0.5, we predict x to
be positive.
 P(positive|x) = 0.6, P(positive|x) = 0.9
makes no difference in final prediction!
 New discovery:
 Posterior probability, such as
P(positive|x), is a better estimate than
the single best tree.
Credit Card Fraud
Adult Dataset
Donation
Overfitting
Non-overfitting
Selectivity
Tolerance to data insufficiency
Other related applications of random
decision tree
 n-fold cross-validation
 Stream Mining.
 Multiple class probability estimation
Implementation issues
 When there is not an astronomical number
of features and feature values, we can build
some empty tree structures and feed the
data in one simple scan to finalize the
construction.
 Otherwise, build the tree iteratively just like
traditional tree construction WITHOUT any
expensive purity function check.
 Both ways are very efficient since we do not
check any expensive purity function
On the other hand
 Occam’s Razor’s interpretation: two hypotheses with
the same loss, we should prefer the simpler one.
 Very complicated hypotheses that are highly
accurate:
 Meta-learning
 Boosting (weighted voting)
 Bagging (sampling without replacement)
 None of purity functions really obeys Occam’s razor
 Their philosophy is: simpler is better, but we hope simpler
brings high accuracy. That is not true!
Download