Chapter 7
Decision Tree
Decision Trees
An Algorithm for Building Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the
instances contained in T.
3. Create a tree node whose value is the chosen
attribute. Create child links from this node where
each link represents a unique value for the chosen
attribute. Use the child link values to further
subdivide the instances into subclasses.
Data Warehouse and Data Mining
2
Chapter 6
Decision Trees
An Algorithm for Building Decision Trees
4. For each subclass created in step 3:
a. If the instances in the subclass satisfy
predefined criteria or if the set of remaining attribute
choices for this path of the tree is null, specify the
classification for new instances following this
decision path.
b. If the subclass does not satisfy the
predefined criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the
current set of subclass instances and return to step 2.
Data Warehouse and Data Mining
3
Chapter 6
Decision Trees
An Algorithm for Building Decision Trees
Data Warehouse and Data Mining
4
Chapter 6
Decision Trees
An Algorithm for Building Decision Trees
Data Warehouse and Data Mining
5
Chapter 6
Decision Trees
An Algorithm for Building Decision Trees
Data Warehouse and Data Mining
6
Chapter 6
Decision Trees
An Algorithm for Building Decision Trees
Data Warehouse and Data Mining
7
Chapter 6
Decision Trees
Decision Trees for the Credit Card Promotion Database
Data Warehouse and Data Mining
8
Chapter 6
Decision Trees
Decision Trees for the Credit Card Promotion Database
Data Warehouse and Data Mining
9
Chapter 6
Decision Trees
Decision Trees for the Credit Card Promotion Database
Data Warehouse and Data Mining
10
Chapter 6
Decision Trees
Decision Tree Rules
IF Age <= 43 & Sex = Male & Credit Card
Insurance = No
THEN Life Insurance Promotion = No
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
Data Warehouse and Data Mining
11
Chapter 6
Decision Trees
General Considerations
Here is a list of a few of the many advantages
decision trees have to offer.
 Decision trees are easy to understand and map
nicely to a set of production rules.
 Decision trees have been successfully applied to
real problems.
 Decision trees make no prior assumptions about the
nature of the data.
 Decision trees are able to build models with
datasets containing numerical as well as categorical
data.
Chapter 6
Data Warehouse and Data Mining
12
Decision Trees
General Considerations
As with all data mining algorithms, there are several
issues surrounding decision tree usage. Specifically,
• Output attributes must be categorical, and multiple output
attributes are not allowed.
• Decision tree algorithms are unstable in that slight variations
in the training data can result in different attribute selections at
each choice point with in the tree. The effect can be significant
as attribute choices affect all descendent subtrees.
• Trees created from numeric datasets can be quite complex as
attribute splits for numeric data are typically binary.
Data Warehouse and Data Mining
13
Chapter 6
Classification by Decision Tree Induction
• Decision tree
–
–
–
–
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Data Warehouse and Data Mining
14
Chapter 6
Training Dataset
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
This follows an example
15 from Quinlan’s ID3 Chapter 6
Data Warehouse and Data Mining
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Warehouse and Data Mining
16
Chapter 6
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
Data Warehouse and Data Mining
17
Chapter 6
weather
sunny
Temp > 75
BBQ
rainy
Eat in
Eat in
Data Warehouse and Data Mining
cloudy
18
windy
no
yes
BBQ
Eat in
Chapter 6
Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each
attribute
– May need other tools, such as clustering, to get the
possible split values
– Can be modified for categorical attributes
Data Warehouse and Data Mining
19
Chapter 6
Information Gain (ID3/C4.5)
• Select the attribute with the highest information
gain
• Assume there are two classes, ....P and ....N
– Let the set of examples S contain
• ...p elements of class P
• ...n elements of class N
– The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is defined as
p
p
n
n
I ( Data
p, nMining
)
log2 20

log2
Data Warehouse and
pn
pn pn
pn
Chapter 6
Information Gain in Decision Tree Induction
• Assume that using attribute A
a set S will be partitioned into sets
{S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N
,the entropy, or the expected information needed to
classify objects in all subtrees Si is
 p n
E ( A)   i i I ( pi , ni )
i 1 p  n
• The encoding information that would be gained
by branching on A Gain( A)  I ( p, n)  E ( A)
Data Warehouse and Data Mining
21
Chapter 6
Attribute Selection by Information Gain
Computation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium no excellent
yes
Data
Data Mining
high Warehouse
yes andfair
yes 22
medium no excellent
no
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
Chapter 6
Attribute Selection by Information Gain
Computation
Gain( A)  I ( p, n)  E( A)
I ( p , n)  
p
p
n
n
log2

log2
pn
pn
pn
pn
-9/(9+5)log2 9/(9+5) – 5/(9+5)log25/(9+5)

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) = 0.940

Compute the entropy for age:
age
<=30
30…40
>40
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n

Data Warehouse and Data Mining
23
pi
ni
2
4
3
3
0
2
9
5
I(pi, ni)
0.971
0
0.971
Chapter 6
Attribute Selection by Information Gain
Computation
I(p,
Gain(age)  I ( p, n)  E(age)
n) = I(9, 5) =0.940
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n

Gain(age) = 0.940 –0.692 = 0.248
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3, 2)  0.6 9 2
14
E ( a g e) 
age
pi
ni
I(pi, ni)
<=30
2
3 0.971
30…40 4
0 0
Data Warehouse and Data Mining
>40
3
2 0.971
Similarly
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
24
Chapter 6
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN
rules
• One rule is created for each path from the root
to a leaf
• Each attribute-value pair along a path forms a
conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
Data Warehouse and Data Mining
25
Chapter 6
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Warehouse and Data Mining
26
Output: A Decision Tree for “buys_computer”
Chapter 6
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches...., some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get
a sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Data Warehouse and Data Mining
27
Chapter 6
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate whether
expanding or pruning a node may improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is minimized
Data Warehouse and Data Mining
28
Chapter 6
Enhancements to basic decision
tree induction
• Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are sparsely
represented
– This reduces fragmentation, repetition, and replication
Data Warehouse and Data Mining
29
Chapter 6
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
–
–
–
–
relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
Data Warehouse and Data Mining
30
Chapter 6
Chapter Summary
•Decision trees are probably the most popular
structure for supervised data mining.
•A common algorithm for building a decision tree
selects a subset of instances from the training data to
construct an initial tree.
•The remaining training instances are then used to test
the accuracy of the tree.
•If any instance is incorrectly classified the instance is
added to the current set of training data and the
process is repeated.
Data Warehouse and Data Mining
31
Chapter 6
Chapter Summary
•A main goal is to minimize the number of tree levels
and tree nodes, thereby maximizing data
generalization.
•Decision trees have been successfully applied to real
problems, are easy to understand, and map nicely to a
set of production rules.
Data Warehouse and Data Mining
32
Chapter 6
Reference
Data Mining: Concepts and Techniques (Chapter 7 Slide for
textbook), Jiawei Han and Micheline Kamber, Intelligent Database
Systems Research Lab, School of Computing Science, Simon Fraser
University, Canada
Data Warehouse and Data Mining
33
Chapter 6