Software Engineering: Analysis and Design

advertisement
CSE5230/DMS/2004/6
Data Mining - CSE5230
Classifiers 2
Decision Trees
CSE5230 - Data Mining, 2004
Lecture 6.1
Lecture Outline
 Why
use Decision Trees?
 What is a Decision Tree?
 Examples
 Use as a data mining technique
 Popular Models
CART
CHAID
ID3 & C4.5
CSE5230 - Data Mining, 2004
Lecture 6.2
Lecture Objectives
 By
the end of this lecture you should be able to:
Describe what a Decision Tree is and how it is used to
do classification
Explain how and why the explainability of Decision Tree
classifications is important in some Data Mining tasks
Describe the top-down construction of a Decision Tree
by choosing splitting criteria
CSE5230 - Data Mining, 2004
Lecture 6.3
Why use Decision Trees? - 1
 Last
week we talked about classifiers in general,
and Bayesian classifiers in particular. Recapping:
A classifier assigns items to classes
 There
are many different classification models
and algorithms used in Data Mining
Naïve Bayes, Decision Trees, Feedforward Neural
Networks, etc.
 Classification
models differ in two important
ways:
the complexity of the classifications they can learn
the ease with which humans can understand the
reasons why certain items are classifier as they are
CSE5230 - Data Mining, 2004
Lecture 6.4
Why use Decision Trees? - 2
 Whereas
Petal-length
 2.6
> 2.6
Iris setosa
Bayesian classifiers and
neural networks compute a
mathematical function of their inputs
to generate their outputs, decision
trees use logical rules, e.g.
Petal-width
 1.65
> 1.65
Petal-length
5
Iris virginica
>5
Iris versicolor
Sepal-length
 6.05
Iris versicolor
CSE5230 - Data Mining, 2004
> 6.05
Iris virginica
IF
Petal-length > 2.6 AND
Petal-width  1.65 AND
Petal-length > 5 AND
Sepal-length > 6.05
THEN
the flower is Iris virginica
NB. This is not the only rule for this
species. What is the other?
Figure adapted from [SGI2001]
Lecture 6.5
Why use Decision Trees? - 3
 For
some applications accuracy of classification
or prediction is sufficient, e.g.:
A direct mail firm needing to find a model for identifying
customers who will respond to mail
Predicting the stock market using past data
 In
other applications it is better (sometimes
essential) that the decision be explainable, e.g.:
Rejection of a credit card application
Medical diagnosis
 Humans
generally require explanations for most
decisions
Sometimes there are legal and ethical requirements for
this
CSE5230 - Data Mining, 2004
Lecture 6.6
Why use Decision Trees? - 4
 Example:
When a bank rejects a credit card
application, it is better to explain to the customer
that it was due to the fact that:
He/she is not a permanent resident of Australia AND
He/she has been residing in Australia for < 6 months AND
He/she does not have a permanent job.
 This
is better than saying:
 “We are very sorry, but our neural network thinks
that you are not a credit-worthy customer.” (In
which case the customer might become angry and
move to another bank)
CSE5230 - Data Mining, 2004
Lecture 6.7
What is a Decision Tree?
root node
 Built
Petal-length
 2.6
> 2.6
Iris setosa
Petal-width
 1.65
Petal-length
from root node (top) to leaf
test
nodes (bottom)
child node A record first enters the root node
 A test is applied to determine to
> 1.65
which child node it should go next
path
5
Iris virginica
>5
Iris versicolor
Sepal-length
 6.05
Iris versicolor
leaf nodes
CSE5230 - Data Mining, 2004
A variety of algorithms for choosing the
initial test exists. The aim is to
discriminate best between the target
classes
process is repeated until a
record arrives at a leaf node
Iris virginica
 The path from the root to a leaf
node provides an expression of a
rule
> 6.05
 The
Lecture 6.8
Building a Decision Tree - 1
 Algorithms
for building decision trees (DTs) begin
by trying to find the test which does the “best job”
of splitting the data into the desired classes
The desired classes have to be identified at the start
 Example:
we need to describe the profiles of
customers of a telephone company who “churn”
(do not renew their contracts). The DT building
algorithm examines the customer database to
find the best splitting criterion:
Phone technology
Age of customer
Time has been a customer
Gender
 The
DT algorithm may discover out that the
“Phone technology” variable is best for
separating churners from non-churners
CSE5230 - Data Mining, 2004
Lecture 6.9
Building a Decision Tree - 2
 The
process is repeated to discover the best splitting
criterion for the records assigned to each node
Phone technology
old
new
Time has been a customer
 2.3
Churners
> 2.3

Once built, the effectiveness of a decision tree can be
measured by applying it to a collection of previously
unseen records and observing the percentage of
correctly classified records
CSE5230 - Data Mining, 2004
Lecture 6.10
Phone Technology
Example - 1
Classify
customers who churn,
i.e. do not renew
their phone
contracts.
50 Churners
50 Non-churners
 Requirement:
(adapted from [BeS1997])
new
Time has been a Customer
30 Churners
50 Non-churners
<= 2.3 years
5 Churners
40 Non-churners
25 Churners
10 Non-churners
20 Churners
0 Non-churners
CSE5230 - Data Mining, 2004
20 Churners
0 Non-churners
> 2.3 years
Age
<= 35
old
> 35
5 Churners
10 Non-churners
Lecture 6.11
Example - 2
 The
number of records in a given parent node
equals the sum of the records contained in the
child nodes
 Quite easy to understand how the model is being
built (unlike NNs, as we will see next week)
 Easy to use the model
e.g. for a targeted marketing campaign aimed at
customers likely to churn
 Provides
intuitive ideas about the customer base
e.g: “Customers who have been with the company for a
couple of years and have new phones are pretty loyal”
CSE5230 - Data Mining, 2004
Lecture 6.12
Use as a data mining technique - 1
 Exploration
Analyzing the predictors and splitting criteria selected
by the algorithm may provide interesting insights which
can be acted upon
e.g. if the following rule was identified:
IF
time a customer < 1.1 years AND
sales channel = telesales
THEN chance of churn is 65%
It might be worthwhile conducting a study on the way
the telesales operators are making their calls
CSE5230 - Data Mining, 2004
Lecture 6.13
Use as a data mining technique - 2
 Exploration
(continued)
Gleaning information from rules that fail
e.g. from the phone example we obtained the rule:
IF
Phone technology = old AND
Time has been a customer  2.3 years AND
Age > 35
THEN there are only 15 customers (15% of total)
Can this rule be useful?
» Perhaps we can attempt to build up this small market
segment. If this is possible then we have the edge over
competitors since we have a head start in this knowledge
» We can remove these customers from our direct
marketing campaign since there are so few of them
CSE5230 - Data Mining, 2004
Lecture 6.14
Use as a data mining technique - 3
 Exploration
(continued)
Again from the phone company example we noticed
that:
» There was no combination of rules to reliably
discriminate between churners and non-churners
for the small market segment mentioned on the
previous slide (5 churners, 10 non-churners).
Do we consider this as an occasion where it was not
possible to achieve our objective?
From this failure we have learnt that age is not all that
important for this category churners (unlike those under
35).
Perhaps we were asking the wrong questions all along this warrants further analysis
CSE5230 - Data Mining, 2004
Lecture 6.15
Use as a data mining technique - 4
 Data
Pre-processing
Decision trees are very robust at handling different
predictor types (number/categorical), and run quickly.
Therefore the can be good for a first pass over the data
in a data mining operation
This will create a subset of the possibly useful
predictors which can then be fed into another model,
say a neural network
 Prediction
Once the decision tree is built it can be then be used as
a prediction tool, by using it on a new set of data
CSE5230 - Data Mining, 2004
Lecture 6.16
Popular Decision Tree Models:
CART

CART: Classification And Regression Trees, developed in
1984 by a team of researchers (Leo Breiman et al.) from
Stanford University
 Used in the DM software Darwin - from Thinking Machines
Corporation (recently bought by Oracle)
http://www.oracle.com/technology/documentation/darwin.html
 Also available in SPSS Classification Trees add-on module
http://www.spss.com/classification_trees/

Often uses an entropy measure to determine the split point
(Shannon’s Information theory).
measure of disorder (MOD) =

  p log 2 ( p)
where p is is the probability of that prediction value
occurring in a particular node of the tree. Other measures
used include Gini and twoing.
CART produces a binary tree
CSE5230 - Data Mining, 2004
Lecture 6.17
CART - 2
 Consider
the “Churn” problem from slide 6.11
At the first node there are 100 customers to split, 50 who
churn and 50 who don’t churn
The MOD of this node is:
MOD = -0.5*log2(0.5) + -0.5*log2(0.5) = 1.00
The algorithm will try each predictor variable
For each predictor the algorithm will calculate the MOD of the
split produced by several values to identify the optimum
splitting on “Phone technology” produces two nodes, one with
50 churners and 30 non-churners, the other with 20 churners
and 0 non-churners. The first of these has:
MOD = -5/8*log2(5/8) + -3/8log2(3/8) = 0.95
and the second has a MOD of 0.
CART will select the predictor producing nodes with the lowest
MOD as the split point
CSE5230 - Data Mining, 2004
Lecture 6.18
Node splitting
An ideally good split
Name
Churned?
Name
Churned?
Jim
Yes
Bob
No
Sally
Yes
Betty
No
Steve
Yes
Sue
No
Joe
Yes
Alex
No
An ideally bad split
Name
Churned?
Name
Churned?
Jim
Yes
Bob
No
Sally
Yes
Betty
No
Steve
No
Sue
Yes
Joe
No
Alex
Yes
CSE5230 - Data Mining, 2004
Lecture 6.19
Popular Decision Tree Models:
CHAID
 CHAID:
Chi-squared Automatic Interaction
Detector, developed by J. A. Hartigan in 1975.
 Widely used since it is distributed as part of the popular
statistical packages SAS and SPSS
 Differs
from CART in the way it identifies the split
points. Instead of the information measure, it
uses chi-squared test to identify the split points
(a statistical measure used for identifying
independent variables)
 All predictors must be categorical or put into
categorical form by binning
 The accuracy of the two methods CHAID and
CART have been found to be similar
CSE5230 - Data Mining, 2004
Lecture 6.20
Popular Decision Tree Models:
ID3 & C4.5
 ID3:
Iterative Dichtomiser, developed by the
Australian researcher Ross Quinlan in 1979
Used in the data mining software Clementine of Integral
Solutions Ltd. (taken over by SPSS)
 ID3
picks predictors and their splitting values on
the basis of the information gain provided
Gain is the difference between the amount of
information that is needed to make a correct prediction
before a split has been made, and that required after
the split
If the amount of information required is lower after the
split is made, then the split is said to have decreased
the disorder of the original data
CSE5230 - Data Mining, 2004
Lecture 6.21
ID3 & C4.5 - 2
A
B
+ ----
++++ By using the entropy
left
A ++++-
+++++----
-
  p log( p )
right left entropy
right entropy
start entropy
+---- -4/5log(4/5)+
-1/5log(1/5)=.72
-4/5log(4/5)+
-1/5log(1/5)=.72
-5/10log(5/10)+
-5/10(log(5/10)
=1
B +++++ ----
CSE5230 - Data Mining, 2004
-5/9log(5/9)+
-4/9log(4/9)=.99
-1/1log(1/1)
=0
Lecture 6.22
ID3 & C4.5 - 3
 Split
Weighted Entropy
Gain
A
(5/10)*0.72+(5/10)*0.72
= 0.72
0.28
B
(9/10)*0.99+(1/10)*0
= 0.89
0.11
A will be selected
CSE5230 - Data Mining, 2004
Lecture 6.23
ID3 & C4.5 - 4
 C4.5
introduces a number of extensions to ID3:
Handles unknown field values in training set
Tree pruning method
» pruning produces a smaller tree (by “pruning” some
of the leaves and branches) that performs better on
test data

the aim is to avoid over-fitting the training data
Automated rule generation
» extracts human-readable logical rules from the tree
 Quinlan
now has a commercial version with
further improvements, See5/C5.0
http://www.rulequest.com/see5-info.html
CSE5230 - Data Mining, 2004
Lecture 6.24
Case Study
 Now
we will look at a case study, available on the
web at
http://web.maths.unsw.edu.au/~inge/statlearn/kolyshkina.pdf
CSE5230 - Data Mining, 2004
Lecture 6.25
Strengths and Weaknesses
 Strengths
of decision trees
Able to generate understandable rules
Classify with very little computation
Some decision tree induction algorithms are able to
handle both continuous and categorical data
Provides a clear indication of which variables are most
important for prediction or classification
 Weaknesses
Most are not appropriate for estimation or prediction
tasks (income, interest rates, etc.)
» exception is regression trees
Problematic with time series data (much pre-processing
required), can be computationally expensive
CSE5230 - Data Mining, 2004
Lecture 6.26
References



[BeL1997] J. A. Berry and G. Linoff, Data Mining
Techniques: For Marketing, Sales, and Customer Support,
John Wiley & Sons Inc.,1997
[BeS1997] A. Berson and S. J. Smith, Data Warehousing,
Data Mining and OLAP, McGraw Hill, 1997
[KPR2002] I. Kolyshkina, P. Petocz and I. Rylande,
Modeling insurance risk: A comparison of data mining and
logistic regression approaches, in Proceedings of the 16th
Australian Statistical Conference, Canberra, ACT, Australia,
July 7-11, 2002.
Related presentation:
http://web.maths.unsw.edu.au/~inge/statlearn/kolyshkina.pdf

[SGI2001] Silicon Graphics Inc. MLC++ Utilities Manual,
2001
http://www.sgi.com/tech/mlc/utils.html
CSE5230 - Data Mining, 2004
Lecture 6.27
Download