Detection and reconstruction of concept change in decision trees

On the Optimality of Probability
Estimation by Random Decision Trees
Wei Fan
IBM T.J.Watson
Some important facts about inductive learning
 Given a set of labeled data items, such as, (amt,
merchant category, outstanding balance, date/time,
……,) and the label is whether it is a fraud or nonfraud.
 Inductive model: predict if a transaction is a fraud or
 Perfect model: never makes mistakes.
 Not always possible due to:
 Stochastic nature of the problem
 Noise in training data
 Data is insufficient
Optimal Model
 Loss function L(t,y) to evaluate
 t is true label and y is prediction
 Optimal decision decision y* is the label that
minimizes the expected loss when x is
sampled many times:
 0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: the label that minimizes the
“empirical risk”.
If P(fraud|x) * $1000 > $90 or
p(fraud|x) > 0.09, predict fraud
How we look for optimal models?
 NP-hard for most “model representation”
 We think that simplest hypothesis that fits
the data is the best.
 We employ all kinds of heuristics to look for
 info gain, gini index, etc
 pruning: MDL pruning, reduced errorpruning, cost-based pruning.
 Reality: tractable, but still very expensive
How many optimal models out there?
 0-1 loss binary problem:
 Truth: P(positive|x) > 0.5, we predict x to be positive.
 P(positive|x) = 0.6, P(positive|x) = 0.9 makes no
difference in final prediction!
 Cost-sensitive problems:
 Truth: P(fraud|x) * $1000 > $90, we predict x to be
 Re-write it P(fraud|x) > 0.09
 P(fraud|x) = 1.0 and P(fraud|x) = 0.091 makes no
 There are really many many optimal models out
Random Decision Tree: Outline
 Train multiple trees. Details to follow.
 Each tree outputs posterior
probability when classifying an
example x.
 The probability outputs of many trees
are averaged as the final probability
 Loss function and probability are used
to make the best prediction.
At each node, an un-used feature is
chosen randomly
 A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
 A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
At each node, an un-used feature is
chosen randomly
 A discrete feature is un-used if it has
never been chosen previously on a given
decision path starting from the root to the
current node.
 A continuous feature can be chosen
multiple times on the same decision path,
but each time a different threshold value
is chosen
P: 1
N: 9
Age> 25
P: 100
N: 150
Training: Continued
 We stop when one of the following
 A node becomes empty.
 Or the total height of the tree exceeds a
threshold, currently set as the total number of
 Each node of the tree keeps the number of
examples belonging to each class.
 Each tree outputs membership probability
 p(fraud|x) = n_fraud/(n_fraud + n_normal)
 If a leaf node is empty (very likely for when discrete
feature is tested at the end):
Use the parent nodes’ probability estimate but do not
output 0 or NaN
 The membership probability from multiple random
trees are averaged to approximate as the final
 Loss function is required to make a decision
 0-1 loss: p(fraud|x) > 0.5, predict fraud
 cost-sensitive loss: p(fraud|x) $1000 > $90
Credit Card Fraud
 Detect if a transaction is a fraud
 There is an overhead to detect a
fraud, {$60, $70, $80, $90}
 Loss Function
Donation Dataset
 Decide whom to send charity
solicitation letter. About 5% positive.
 It costs $0.68 to send a letter.
 Loss function
Independent study and implementation
of random decision tree
 Kai Ming Ting and Tony Liu from U of
Monash, Australia on UCI datasets
 Edward Greengrass from DOD on their data
100 to 300 features.
Both categorical and continuous features.
Some features have a lot of values.
2000 to 3000 examples.
Both binary and multiple class problem (16 and
Why random decision tree works?
 Original explanation:
 Error tolerance property.
 Truth: P(positive|x) > 0.5, we predict x to
be positive.
 P(positive|x) = 0.6, P(positive|x) = 0.9
makes no difference in final prediction!
 New discovery:
 Posterior probability, such as
P(positive|x), is a better estimate than
the single best tree.
Credit Card Fraud
Adult Dataset
Tolerance to data insufficiency
Other related applications of random
decision tree
 n-fold cross-validation
 Stream Mining.
 Multiple class probability estimation
Implementation issues
 When there is not an astronomical number
of features and feature values, we can build
some empty tree structures and feed the
data in one simple scan to finalize the
 Otherwise, build the tree iteratively just like
traditional tree construction WITHOUT any
expensive purity function check.
 Both ways are very efficient since we do not
check any expensive purity function
On the other hand
 Occam’s Razor’s interpretation: two hypotheses with
the same loss, we should prefer the simpler one.
 Very complicated hypotheses that are highly
 Meta-learning
 Boosting (weighted voting)
 Bagging (sampling without replacement)
 None of purity functions really obeys Occam’s razor
 Their philosophy is: simpler is better, but we hope simpler
brings high accuracy. That is not true!