Hunt`s Algorithm - The Institute of Finance Management

advertisement
Hunt’s Algorithm
CIT365: Data Mining & Data Warehousing
Bajuna Salehe
The Institute of Finance Management:
Computing and IT Dept.
Decision Tree Induction Algorithms
Number of Algorithms:
• Hunt’s
– Hunt's Algorithm (1966)
• Quinlan's
– Iterative Dichotomizer3 (1975) uses Entropy
– C4.5 / 4.8 / 5.0 (1993) uses Entropy
• Brieman's
– Classification And Regression Trees (1984) uses Gini
• Kass's
– CHi-squared Automatic Interaction Detector (1980) uses ____
• IBM:
Mehta
– Supervised Learning In Quest (1996) uses Gini
Shafer
– Scalable PaRallelizable INduction of decision Trees (1996) uses Gini
Hunt’s Algorithm
• In the Hunt’s algorithm, a decision tree is
grown in a recursive fashion by partitioning
the training records successively into purer
subsets
Hunt’s Algorithm
• Let Dt be the set of training records that are
associated with node t and y = {y1, y2, · · · , yc}
be the class labels. The following is a recursive
definition of Hunt’s algorithm.
• Step 1: If all the records in Dt belong to the
same class yt, then t is a leaf node labeled as
yt.
Hunt’s Algorithm
• Step 2: If Dt contains records that belong to
more than one class, an attribute test
condition is used to partition the records into
smaller subsets. A child node is then created
for each outcome of the test condition. The
records in Dt are distributed to the children
based upon their outcomes. This procedure is
repeated for each child node.
Hunt’s Algorithm
Dt = {training records @ node t}
• If Dt = {records from different classes}
– Split Dt into smaller subsets via attribute test
– Traverse each subset with same rules
• If Dt = {records from single class yt}
– Set Node t = leaf node with class label yt
• If Dt = {} (empty)
– Set Node t = leaf node with default class label
yd
• Recursively apply above criterion until ...
– No more training records left
Example
• Consider the problem of predicting whether a
loan applicant will succeed in repaying her
loan obligations or become delinquent, and
subsequently, default on her loan.
• The training set used for predicting borrowers
who will default on their loan payments will
be as follows.
Example.
Figure1
Example
• A training set for this problem can be
constructed by examining the historical
records of previous loan borrowers.
• In the training set shown in Figure 1, each
record contains the personal information of a
borrower along with a class label indicating
whether the borrower has defaulted on her
loan payments.
Example
• The initial tree for the classification problem
contains a single node with class label
Defaulted = No as illustrated below:
Figure 1a: Step 1
• This means that most of the borrowers had
successfully repayed their loans.
• However, the tree needs to be refined since
the root node contains records from both
classes.
Example
• The records are subsequently divided into smaller
subsets based on the outcomes of the Home
Owner test condition, as shown in Figure below:
Figure 1b: Step 2
• The reason for choosing this attribute test
condition instead of others is an implementation
issue that will be discussed later.
Example
• Now we can assume that this is the best
criterion for splitting the data at this point.
• The Hunt’s algorithm is then applied
recursively to each child of the root node.
• From the training set given in Figure 1, notice
that all borrowers who are home owners had
successfully repayed their loan.
Example
• As a result, the left child of the root is a leaf
node labeled as Defaulted = No as shown in
figure 1b
• For the right child of the root node, we need
to continue applying the recursive step of
Hunt’s algorithm until all the records belong
to the same class.
Example
• This recursive step is shown in Figures 1c and
d below:
Figure1c: Step 3
Figure 1d: step 4
Example
• Generally the whole diagram will be as follows
Design Issues of Decision Tree
Induction
• How to split the training records? - Each
recursive step of the tree growing process
requires an attribute test condition to divide
the records into smaller subsets.
• To implement this step, the algorithm must
provide a method for specifying the test
condition for different attribute types as well
as an objective measure for evaluating the
goodness of each test condition.
Design Issues of Decision Tree
Induction
• When to stop splitting? A stopping condition is
needed to terminate the tree growing process.
• A possible strategy is to continue expanding a
node until all the records belong to the same
class or if all the records have identical
attribute values.
How to Split an Attribute
• Before automatically creating a decision tree,
you can choose from several splitting
functions that are used to determine which
attribute to split on. The following splitting
functions are available:
– Random - The attribute to split on is chosen
randomly.
– Information Gain - The attribute to split on is the
one that has the maximum information gain.
How to Split an Attribute
– Gain Ratio - Selects the attribute with the highest
information gain to number of input values ratio.
The number of input values is the number of
distinct values of an attribute occurring in the
training set.
– GINI - The attribute with the highest GINI index is
chosen. The GINI index is a measure of impurity of
the examples.
Training Dataset
Age
Income
Student
CreditRating
BuysComputer
<=30
high
no
fair
no
<=30
high
no
excellent
no
31 - 40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31 - 40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31 - 40
medium
no
excellent
yes
31 - 40
high
yes
fair
yes
>40
medium
no
excellent
no
Resultant Decision Tree
Attribute Selection Measure:
Information Gain (ID3/C4.5)
• The attribute selection mechanism used in ID3
and based on work on information theory by
Claude Shannon
• If our data is split into classes according to
fractions {p1,p2…, pm} then the entropy is
measured as the info required to classify any
arbitrary tuple as follows:
m
E ( p1 ,p2 ,...,pm )   pi log2  pi 
i 1
Attribute Selection Measure:
Information Gain (ID3/C4.5) (cont…)
• The information measure is essentially the
same as entropy
• At the root node the information is as follows:
9 5
info[9,5]  E  , 
 14 14 
9
9 5
5
  log2    log2  
14
 14  14
 14 
 0.94
Attribute Selection Measure:
Information Gain (ID3/C4.5) (cont…)
• To measure the information at a particular
attribute we measure info for the various
splits of that attribute
• For instance with age attribute look at the
distribution of ‘Yes’ and ‘No’ samples for each
value of age. Compute the expected
information for each of these distribution.
• For age “<=30”
Attribute Selection Measure:
Information Gain (ID3/C4.5) (cont…)
• At the age attribute the information is as
follows:
5
4
5
info[2,3],[4,0],[3,2]  info2,3  info4,0  info3,2
14
14
14
5 2
2 3
 3 
   log2    log2   
14  5
5 5
 5 
4 4
4 0
 0 
   log2    log2   
14  4
4 4
 4 
5 3
3 2
 2 
   log2    log2   
14  5
5 5
 5 
 0.694
Attribute Selection Measure:
Information Gain (ID3/C4.5) (cont…)
• In order to determine which attributes we
should use at each node we measure the
information gained in moving from one node
to another and choose the one that gives us
the most information
Attribute Selection By Information
Gain Example
• Class P: BuysComputer = “yes”
• Class N: BuysComputer = “no”
– I(p, n) = I(9, 5) =0.940
• Compute the entropy for age:
Age
<=30
<=30
31 - 40
>40
>40
>40
31 - 40
<=30
<=30
>40
<=30
31 - 40
31 - 40
Income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
Student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
CreditRating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
BuysComputer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Age
pi
ni
I(pi, ni)
>=30
2
3
0.971
30 – 40
4
0
0
>40
3
2
0.971
Attribute Selection By Information
Gain Computation
5
4
5
E (age) 
I (2,3)  I (4,0)  I (3,2)
14
14
14
 0.694
•
means “age <=30” has 5 out of 14 samples,
with 2 yes and 3 no. Hence:
Similarly:
Gain(incom e)  0.029
Gain( student)  0.151
Gain(credit _ rating)  0.048
Download