DECISION TREES AND CLASSIFICATION MIS2502 Data Analytics

advertisement
DECISION TREES AND
CLASSIFICATION
MIS2502
Data Analytics
Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
http://www-users.cs.umn.edu/~kumar/dmbook/
What is classification?
• Determining to what group a data
element belongs
• Based on data contained within that
element
• Or “attributes” of that “entity”
• Examples
• Determining whether a customer should be given a loan
• Flagging a credit card transaction as a fraudulent charge
• Categorizing a news story as finance, entertainment, or
sports
How classification works
1
• Choose a set of records for the training set
2
• Choose a set of records for the test set
3
• Choose a class attribute (classifier, fact of
interest, dependent variable, outcome variable)
4
• Find model that predicts the outcome variable
as a function of the other attributes
5
• Apply that model to the test set to check
accuracy
6
• Apply the final model to future records to
classify
Decision Tree Learning
Training Set
Trans.
ID
Charge
Amount
Avg. Charge
6 months
Item
Same
state as
billing
Classification
1
$800
$100
Electronics
No
Fraudulent
2
$60
$100
Gas
Yes
Legitimate
3
$1
$50
Gas
No
Fraudulent
4
$200
$100
Restaurant
Yes
Legitimate
5
$50
$40
Gas
No
Legitimate
6
$80
$80
Groceries
Yes
Legitimate
7
$140
$100
Retail
No
Legitimate
8
$140
$100
Retail
No
Fraudulent
Derive model
Classification
software
Test Set
Trans.
ID
Charge
Amount
Avg. Charge
6 months
Item
Same
state as
billing
Classification
101
$100
$200
Electronics
Yes
?
102
$200
$100
Groceries
No
?
103
$1
$100
Gas
Yes
?
104
$30
$25
Restaurant
Yes
?
Apply model
Goals
• The model developed using the training set should
assign test set records to the right category
• It won’t be 100% accurate, but should be as close as
possible
• Once this is done, the model’s rules can be applied to
new records as they come along
• We want an automated, reliable way to predict the
outcome
Classification Method: The Decision Tree
• A model to predict membership of cases or values of a
dependent variable based on one or more predictor
variables (Tan, Steinback, and Kumar 2004)
• Presents decision rules in plain English
Some Terms
• What is a “Variable”
• Outcome (fact of interest, dependent variable,
classification, classifier) Variable
• Default/no default
• Sales performance
• Credit decision
• Predictor (Factor, independent variable) Variable
• Age
• Income
• Marital status
Example: Credit Card Default
TID
Income
Debt
Owns/
Rents
Outcome
1
25k
35%
Owns
Default
2
35k
40%
Rent
Default
3
33k
15%
Owns
No
default
4
28k
19%
Rents
Default
5
55k
30%
Owns
No
default
6
48k
35%
Rent
Default
7
65k
17%
Owns
No
default
8
85k
10%
Rents
No
default
…
Training Data
We create the tree from a set of
training data
Each unique combination of
predictors is associated with an
outcome
This set was “rigged” so that
every combination is accounted
for and has an outcome
Example: Credit Card Default
Leaf
node
Child
node
TID
Income
Debt
Owns/
Rents
Outcome
1
25k
35%
Owns
Default
2
35k
40%
Rent
Default
3
33k
15%
Owns
No
default
4
28k
19%
Rents
Default
5
55k
30%
Owns
No
default
6
48k
35%
Rent
Default
7
65k
17%
Owns
No
default
8
85k
10%
Rents
…
Training Data
No
default
Root
node
Debt >
20%
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Owns
house
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
No
Default
Same Data, Different Tree
TID
Income
Debt
Owns/
Rents
Decision
1
25k
35%
Owns
Default
2
35k
40%
Rent
Default
3
33k
15%
Owns
No
Default
4
28k
19%
Rents
Default
5
55k
30%
Owns
No
Default
6
48k
35%
Rent
No
Default
7
65k
17%
Owns
Default
8
85k
10%
Rents
No
Default
Concept of conditional
probability!
Income <
40k
Owns
Income >
40k
Credit
Approval
Income <
40k
Rents
Income >
40k
Debt >
20%
Default
Debt <
20%
No
Default
Debt >
20%
No
Default
Debt <
20%
No
Default
Debt >
20%
Default
Debt <
20%
Default
Debt >
20%
Default
Debt <
20%
No
Default
…
Training Data
We just changed the order of the predictors.
Apply the data to new (test) data
Debt >
20%
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Owns
house
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
No
Default
TID
Income
Debt
Owns/
Rents
Decision
(Predicted)
Decision
(Actual)
1
80k
35%
Rent
Default
Default
2
20k
40%
Owns
Default
Default
3
15k
15%
Owns
No Default
No
Default
4
50k
19%
Rents
No Default
Default
5
35k
30%
Owns
Default
Default
Test Data
How well did the
decision tree do in
predicting the
outcome?
This assumes some things
• For real data with a lot of data the tree is built
using software
• The software has to deal with several issues
• When the same set of predictors result in different
outcomes (probability!)
• When not every combination of predictors is in the
training set
• When multiple paths result in the same outcome
Tree Induction Algorithms
Tree induction
algorithms take large
sets of data and
compute the tree
Debt >
20%
Similar cases may
have different
outcomes
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
So probability of an
outcome is
computed
Income
>40k
Debt <
20%
Owns
house
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
No
Default
For instance, you may find:
When income > 40k, debt < 20%, and the customers
rents no default occurs 80% of the time.
0.8
How the induction algorithm works
Start with single node with all training data
Yes
Are samples all of same classification?
No
Are there predictor(s) that will split the data?
No
Yes
Partition node into child nodes according to
predictor(s)
Are there more nodes (i.e., new child
nodes)?
Yes
Go to next node.
No
DONE!
Debt >
20%
Start with root node
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Credit
Approval
• There are both “defaults”
and “no defaults” in the
set
• So we need to look for
predictors to split the
data
Owns
house
Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
No Default
Debt >
20%
Split on income
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Owns
house
Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
No Default
• Income is a factor
Credit
Approval
Income
<40k
Income
>40k
•
(Income < 40, Debt > 20, Owns = Default)
but
(Income > 40, Debt > 20, Owns = No default)
• But there are also a combination of
defaults and no defaults within each
income group
• So look for another split
Debt >
20%
Split on debt
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Credit
Approval
Income
<40k
Income
>40k
Debt
>20%
Debt
<20%
Debt
>20%
Debt
<20%
Owns
house
Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
No Default
• Debt is a factor
•
(Income < 40, Debt < 20, Owns =
No default)
but
(Income < 40, Debt < 20, Rents =
Default)
• But there are also a
combination of defaults and
no defaults within some
debt groups
• So look for another split
Debt >
20%
Split on Owns/Rents
Income
<40k
Debt <
20%
Credit
Approval
Debt >
20%
Income
>40k
Debt <
20%
Debt
>20%
Income
<40k
Debt
<20%
Credit
Approval
Debt
>20%
Income
>40k
Debt
<20%
Default
Owns
house
No
Default
Rents
Default
Owns
house
No
Default
Rents
Default
No
Default
Owns
house
Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
Default
Owns
house
No Default
Rents
No Default
• Owns/Rents is a
factor
• For some cases it
doesn’t matter, but
for some it does
• So you group
similar branches
• …And we stop
because we’re out
of predictors!
How does it know when and how to split?
• Our example was simple because we had one case per
branch
• So there was 100% agreement within every branch
• When you have a lot of data, you’ll get different outcomes
with the same predictors
• So you need criteria to decide when to split the branch
using a predictor
• And you need a way of figuring out when your leaf nodes
• Describe primarily one type of outcome
• Are different from one another
Hypothesis Testing
• Consider we want to test the following statement (hypothesis):
• The split between “owns” and “rents” will make no difference on
the chance of default (50-50 chance of default for the two
groups).
• How do we test it?
• Mathematically:
• Defaultowns=No Defaultowns
• Defaultrents=No Defaultrents
• We need a joint test statistic to test these two equations! We
want to have confidence to reject at least one of them!
Chi-squared test
• Is the proportion of the outcome
class the same in each child node
• It shouldn’t be, or the classification
isn’t very helpful
2  
Rents (n=650)
Default = 450
No Default = 200
O
ij  Eij 
2
Eij
Owns
Rents
Default
300
450
750
No Default
550
200
750
850
650
1500
Hypothesized
(Expected)
Root (n=1500)
Default = 750
No Default = 750
Owns (n=850)
Default = 300
No Default = 550
Observed
Owns
Rents
Default
425
325
750
No Default
425
325
750
850
650
1500
Chi-squared test
2  
Observed
Owns
Rents
Default
300
450
750
No Default
550
200
750
O
ij  Eij 
2
Eij
(300  425) 2 (550  425) 2 (450  325) 2 (200  325) 2
 



425
425
325
325
2
850
650
1500
 2  36.7  36.7  48.0  48.0  169.4
Hypothesized
(Expected)
Owns
Rents
Default
425
325
750
No Default
425
325
750
850
650
1500
p  0.0001
If the groups were the same, you’d expect an
even split (Hypothesized - Expected)
But we can see they aren’t distributed evenly
(Observed)
But is it enough (i.e., statistically significant)?
Small p-values (i.e.,
less than 0.05 mean it’s
very unlikely the groups
are the same)
So Owns/Rents is a
predictor that creates
two different groups
Gini index
• Also called “Simpson Diversity Index”
• How much did the split lower diversity amongst groups
• A node with all cases having the same outcomes (i.e., “No
default”) has a Gini index of 0
• The more the nodes are mixed, the closer the index gets
to 1
Computing the Gini index
Gini  1   [ p ( j | t )]2
This means: Add up the squared
proportions for each possible value.
Then subtract that value from 1.
j
Observed
Owns
Rents
Default
300
450
750
No Default
550
200
750
850
650
1500
  300  2  550  2 
  1  0.124  0.418  0.458
Giniowns  1   



  850   850  


  450  2  200  2 
  1  0.479  0.094  0.427
Ginirents  1   



  650   650  


Neither group
is very “pure.”
Both have a
lot of both
classifications
in them.
A “pure” example
Gini  1   [ p ( j | t )]2
This means: Add up the squared
proportions for each possible value.
Then subtract that value from 1.
j
Observed
Owns
Rents
Default
850
0
850
No Default
0
650
650
850
650
1500
Giniowns
  850  2  0  2 
  11 0  0
 1 



  850   850  


Ginirents
  0  2  650  2 
  1 0 1  0
 1 

  650   650  


This means both
groups are
perfectly “pure.”
Each group only
has one type of
classification.
Bottom line: Interpreting the model tests
• High result from chi-squared test (low p-value)
• Means the groups are different
• Low result from the Gini test
• Means that the groups have mostly one classification in them
• And you want the Gini statistic for the leaf nodes to get smaller as
you add branches
• Data mining software computes these metrics for you, but
it’s good to know where they come from.
• And you definitely need to understand how to interpret them!
One more thing…
• Outcome cases are not 100% certain
• There are probabilities attached to each outcome in a node
• So let’s code “Default” as 1 and “No Default” as 0
0: 15%
1: 85%
Debt
>20%
Income
<40k
Debt
<20%
Credit
Approval
Debt
>20%
Income
>40k
Debt
<20%
Owns
house
0: 80%
1: 20%
Rents
0: 70%
1: 30%
Owns
house
0: 78%
1: 22%
Rents
0: 40%
1: 60%
0: 74%
1: 26%
So what is the
chance that:
A renter making
more than $40,000
and debt more than
20% of income will
default?
A home owner
making less than
$40,000 and debt
more than 20% of
income will default?
Is this the only way to analyze the data?
• Conditional probability/expectations
• Regressions
• Likelihood of default = a0+a1*own (or rent) + a2*income
+a3*debt + error
• Fit historical data with this equation with regression
technique -> either least squares method or maximum
likelihood method
Simple linear regression
P=.22; not
significant
The linear regression model:
intercept
Love of Math = 5 + .01*math SAT score
slope
Download