Decision Trees and Tables

advertisement
Knowledge Acquisition and
Modelling
Decision Tables and Decision Trees
Decision Trees





Map of a reasoning process
Describes in a tree like structure
A graphical representation of a decision situation
Decision situation points are connected together by arcs and
terminate in ovals
Main components





Decision points represented by nodes
Actions
Particular choices from a decision point represented by arcs (or straight
lines)
Can be left to right
Or top down
Decision Trees

Essentially flowcharts


A natural order of ‘micro decisions’ (Boolean – yes/no
decisions) to reach a conclusion
In simplest form all you need is



A start
A cascade of Boolean decisions (each with exactly outbound
branches)
A set of decision nodes and representing all the ‘leaves’ of the
decision tree
An Example



Bank - loan application
Classify application : approved class, denied class
Criteria - Target Class approved if 3 binary attributes have
certain value:




(a) borrower has good credit history (credit rating in excess of some
threshold)
(b) loan amount less than some percentage of collateral value (e.g., 80%
home value)
(c) borrower has income to make payments on loan
Possible scenarios = 32 = 8

If the parameters for splitting the nodes can be adjusted, the number of
scenarios grows exponentially.
How They Work





Decision rules - partition sample of data
Terminal node (leaf) indicates the class assignment
Tree partitions samples into mutually exclusive groups
One group for each terminal node
All paths


start at the root node
end at a leaf (terminal node = decision node)
How They Work

Each path represents a decision rule



joining (AND) of all the tests along that path
separate paths that result in the same class are disjunctions
(ORs)
All paths - mutually exclusive



for any one case - only one path will be followed
false decisions on the left branch
true decisions on the right branch
Disjunctive Normal Form

Non-terminal node - model identifies an attribute to be
tested



test splits attribute into mutually exclusive disjoint sets
splitting continues until a node - one class (terminal node or
leaf)
Structure - disjunctive normal form


limits form of a rule to conjunctions (adding) of terms
allows disjunction (or-ing) over a set of rules
Predicting Commute Time
Leave At
10 AM
8 AM
Stall?
9 AM
Accident?
No
Yes
Short
Long
Long
No
Yes
Medium
Long
If we leave at
10 AM and
there are no
cars stalled on
the road, what
will our
commute time
be?
Inductive Learning

Making a series of Boolean decisions and following
the relevant branch:

Did we leave at 10 AM?

Did a car stall on the road?
Is there an accident on the road?


Answering each yes/no question allows traversal of
tree to reach a conclusion
Decision Tree as a Rule Set







IF hour == 8am THEN commute time = long
IF hour == 9am AND accident == yes THEN commute
time = long
IF hour == 9am AND accident == no THEN commute
time = medium
IF hour == 10am and stall == yes
THEN
commute time = long
IF hour == 10am and stall == no
THEN
commute time = short
How to Create a Decision Tree

Make a list of attributes that we can measure



These attributes (for now) must be discrete
We then choose a target attribute that we want to predict
Then create an experience table that lists what we have
seen in the past
Sample Experience Table
Example
Attributes
Target
Hour
Weather
Accident
Stall
Commute
D1
8 AM
Sunny
No
No
Long
D2
8 AM
Cloudy
No
Yes
Long
D3
10 AM
Sunny
No
No
Short
D4
9 AM
Rainy
Yes
No
Long
D5
9 AM
Sunny
Yes
Yes
Long
D6
10 AM
Sunny
No
No
Short
D7
10 AM
Cloudy
No
No
Short
D8
9 AM
Rainy
No
No
Medium
D9
9 AM
Sunny
Yes
No
Long
D10
10 AM
Cloudy
Yes
Yes
Long
D11
10 AM
Rainy
No
No
Short
D12
8 AM
Cloudy
Yes
No
Long
D13
9 AM
Sunny
No
No
Medium
Choosing Attributes



The previous experience decision table showed 4
attributes: hour, weather, accident and stall
But the decision tree only showed 3 attributes: hour,
accident and stall
Why is that?
Choosing Attributes


Methods for selecting attributes show that weather is not
a discriminating attribute
We use the principle of Occam’s Razor: Given a number
of competing hypotheses, the simplest one is preferable
Decision Tree Algorithms

The basic idea behind any decision tree algorithm is
as follows:



Choose the best attribute(s) to split the remaining instances
and make that attribute a decision node
Repeat this process for recursively for each child
Stop when:



All the instances have the same target attribute value
There are no more attributes
There are no more instances
Identifying the Best Attributes

Refer back to our original decision tree
Leave At
10 AM
Accident?
Stall?
Short

Long
Yes
No
9 AM
8 AM
Long
No
Medium
Yes
Long
How did we know to split on leave at
and then on stall and accident and not
weather?
ID3 Heuristic




To determine the best attribute, we look at the ID3
heuristic
ID3 splits attributes based on their entropy.
Entropy is the measure of disinformation
Entropy is minimized when all values of the target
attribute are the same.


If we know that commute time will always be short, then
entropy = 0
Entropy is maximized when there is an equal chance of all
values for the target attribute (i.e. the result is random)

If commute time = short in 3 instances, medium in 3 instances
and long in 3 instances, entropy is maximized
Entropy

Calculation of entropy

Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)



S = set of examples
Si = subset of S with value vi under the target attribute
l = size of the range of the target attribute
ID3


ID3 splits on attributes with the lowest entropy
We calculate the entropy for all values of an attribute as
the weighted sum of subset entropies as follows:


∑(i = 1 to k) |Si|/|S| Entropy(Si), where k is the range of the
attribute we are testing
We can also measure information gain (which is inversely
proportional to entropy) as follows:

Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)
ID3

Given our commute time sample set, we can
calculate the entropy of each attribute at the
root node
Attribute
Expected Entropy
Information Gain
Hour
0.6511
0.768449
Weather
1.28884
0.130719
Accident
0.92307
0.496479
Stall
1.17071
0.248842
Problems with ID3

ID3 is not optimal


Uses expected entropy reduction, not actual reduction
Must use discrete (or discretized) attributes


What if we left for work at 9:30 AM?
We could break down the attributes into smaller values…
Problems with ID3

If we broke down leave time to the minute, we might get
something like this:
8:02 AM
Long
8:03 AM
9:05 AM
Medium
Short
9:07 AM
Long
9:09 AM
10:02 AM
Long
Short
Since entropy is very low for each branch, we have
n branches with n leaves. This would not be helpful
for predictive modeling.
Problems with ID3

Can use a technique known as discretization


choose cut points, such as 9AM for splitting continuous
attributes
generally lie in a subset of boundary points, such that a
boundary point is where two adjacent instances in a
sorted list have different target value attributes
Pruning (another technique for attribute
selection)

Pre-Pruning

Decide during the building process when to stop adding
attributes


(possibly based on their information gain)
May be problematic


Individually attributes may not contribute much to a decision
But what about in combination with other attributes


may have a significant impact
Post-Pruning

waits until the full decision tree has built and then prunes the
attributes


Subtree Replacement
Subtree Raising
Subtree Replacement

Entire subtree is replaced by a single leaf node
A
B
C
1
2
4
3
5
Subtree Replacement


Node 6 replaced the subtree
Generalizes tree a little more, but may increase
accuracy
A
B
6
4
5
Subtree Raising

Entire subtree is raised onto another node
A
B
C
1
2
4
3
5
Subtree Raising


Entire subtree is raised onto another node
This was not discussed in detail as it is not clear
whether this is really worthwhile (as it is very
time consuming)
A
C
1
2
3
Problems with Decision Trees

While decision trees classify quickly, the time for building
a tree may be higher than another type of classifier

Decision trees suffer from a problem of errors
propagating throughout a tree

A very serious problem as the number of classes increases
Error Propagation

Since decision trees work by a series of local decisions,
what happens when one of these local decisions is
wrong?


Every decision from that point on may be wrong
We may never return to the correct path of the tree
Decision tree representation of salary decision
Modern Systems Analysis
and Design
Fourth Edition Jeffrey A. Hoffer , Joey F. George, Joseph S. Valacich
Decision Tables



Used to lay out in tabular form all possible situations
which a decision may encounter and to specify which
action to take in each of these situations.
A matrix representation of the logic of a decision
Specifies the possible conditions and the resulting actions
Terminology

Decision Table


Condition Stubs



Condition stubs describe the conditions or factors that will affect the decision
or policy.
They are listed in the upper section of the decision table.
Action Stubs



A decision table is a tabular form that presents a set of conditions and their
corresponding actions.
Action stubs describe, in the form of statements, the possible policy actions or
decisions.
They are listed in the lower section of the decision table.
Rules


Rules describe which actions are to be taken under a specific combination of
conditions.
They are specified by first inserting different combinations of condition attribute
values and then putting X's in the appropriate columns of the action section of
the table.
Example
Modern Systems Analysis
and Design
Fourth Edition Jeffrey A. Hoffer , Joey F. George, Joseph S. Valacich
Decision Table Methodology

1. Identify Conditions & Values


2. Compute Max Number of Rules


For each rule, mark the appropriate actions with an X in the decision table.
6.Verify the Policy


Fill in the values of the condition data attributes in each numbered rule column.
5. Define Actions for each Rule


Determine each independent action to be taken for the decision or policy.
4. Enter All Possible Rules


Multiply the number of values for each condition data attribute by each other.
3. Identify Possible Actions


Find the data attribute each condition tests and all of the attribute's values.
Review completed decision table with end-users.
7. Simplify the Table

Eliminate and/or consolidate rules to reduce the number of columns.
A Simple Example

Scenario: A marketing company wishes to construct a
decision table to decide how to treat clients according to
three characteristics:




Gender,
City Dweller, and
age group: A (under 30), B (between 30 and 60), C (over 60).
The company has four products (W, X,Y and Z) to test
market.




Product W will appeal to male city dwellers.
Product X will appeal to young males.
Product Y will appeal to Female middle aged shoppers who do
not live in cities.
Product Z will appeal to all but older males.
Identify Conditions & Values

The three data attributes tested by the conditions in this
problem are




gender, with values M and F;
city dweller, with value Y and N; and
age group, with values A, B, and C
as stated in the problem.
2. Compute Maximum Number of Rules

The maximum number of rules is 2 x 2 x 3 = 12
3. Identify Possible Actions

The four actions are:




market product W,
market product X,
market product Y,
market product Z.
4. Enter All Possible Conditions
RULES
1 2
3
4
5
6
7
8
9
10
11
12
Sex
m f
m
f
m
f
m
f
m
f
m
f
City
y
y
n
n
y
y
n
n
y
Y
n
n
Age
a a
a
a
b
b
b
b
c
c
c
c
5. Define Actions for each Rule
Actions
Market
1
W
X
X
X
2
3
4
5
6
7
X
9
10
11
12
X
X
X
Y
Z
8
X
X
X
X
X
X
X
X
X
X
Full table
RULES
1
2
3
4
5
6
7
8
9
10
11
12
Sex
m f
m
f
m
f
m
f
m
f
m
f
City
y
y
n
n
y
y
n
n
y
Y
n
n
Age
a
a
a
a
b
b
b
b
c
c
c
c
8
9
10
11
12
Actions
Market
1
W
X
X
X
2
3
4
5
6
7
X
X
X
X
Y
Z
X X
X
X
X
X
X
X
X
X
6. Verify the Policy

Let us assume that the client agreed with our decision
table.
7. Simplify the Table



There appear to be no impossible rules.
Note that rules 2, 4, 6, 7, 10, 12 have the same action pattern.
Rules 2, 6 and 10 have





two of the three condition values (gender and city dweller) identical
and
all three of the values of the non- identical value (age) are covered,
so they can be condensed into a single column 2.
The rules 4 and 12 have identical action pattern, but they
cannot be combined because the indifferent attribute "Age"
does not have all its values covered in these two columns.
Age group B is missing.
The revised table is as follows:
RULES
1
2
3
4
5
6
7
8
9
10
Sex
M F
M
F
M
M
F
M
M
F
City
Y Y
N
N
Y
N
N
Y
N
N
Age
A -
A
A
B
B
B
C
C
C
6
7
8
9
10
Actions
Market
1
3
4
X
W
X
2
X
5
X
X
X
X
Y
Z
X X
X
X
X
X
X
X
Step 1. Identify Conditions & Values


We first examine the problem and identify the data
attributes upon which the decision or policy depends.
We then list the possible values of each data attribute.

Often, answering the question: "What do I need to know in
order to take action in this situation?" will help identify the
appropriate condition attributes.
Step 2. Compute Maximum Number of
Rules

A rule is determined by a different combination of the
condition attributes values.




Since we have listed these values in the previous step, the
multiplication rule of counting tells us that there will be no
more columns than the product of the number of values for
each of the condition attributes.
This can be easily verified by constructing a tree diagram listing
all possible values of each attribute for each branch of the
preceding attribute.
The number of leaves of the tree will be the product described
above.
Since some combinations of attribute values may be impossible,
the actual number of rules may be less that the maximum.
Step 3. Identify Possible Actions

The actions describe the decisions to be made or the
policy rules to be followed.

Asking the question, "What are the different options for
implementing the decision or policy?", should help identify the
possible actions.
Step 4. Enter All Possible Rules

We now begin to build the decision table by listing


the action descriptions in the left margin of the lower part.


the condition descriptions in the left margin of the upper part of the
table and
Then we write consecutive numbers from 1 to the maximum
number of rules across the top.
In the rule columns and the condition rows, we list all possible
combinations of condition attribute values.

A rule of thumb for arranging the rule combinations is to alternate
the possible values for the first condition, then repeat each value of
the second condition as many times as there are values in the first
condition, repeat each value of the third condition as many times as
needed to cover one iteration of the second condition values, etc.
Step 5. Define Actions for each Rule



In this step we decide which actions are appropriate for
each combination of condition attribute values and mark
an X in that column of the action row.
This should be fairly straightforward if the decision
making procedure is well defined.
If it is not well defined then the organization of the
decision table makes it easier to get the end-user to
specify the action(s) for each rule.
step 6. Verify the Policy





Review the completed decision table with the end-users.
Resolve any rules for which the actions are not specific.
Verify that rules you think are impossible or cannot in
actuality occur.
Resolve apparent contradictions, such as one rule with
two contradictory actions.
Finally, verify that each rule's actions are correct.
Step 7. Simplify the Decision Table


In this step we look for and eliminate impossible rules, and also
combine rules with indifferent conditions. An indifferent condition is
one whose values do not affect the decision and always result in the
same action.
Impossible rules are those in which the given combination of
condition attribute values cannot occur according to the
specifications of the problem.





(E.G. if we assumed for marketing purposes that all middle-aged men
lived in the city).
To determine indifferent conditions, first look for rules with exactly the
same actions.
From these, find those whose condition values are the same except for
one and only one condition (called the indifferent condition).
This latter set of rules has the potential for being collapsed into a single
rule with the indifferent condition value replaced with a dash.
Note that all possible values of the indifferent condition must be present
among the rules to be combined before they can be collapsed.
Exercise

A student may receive a final course grade of A, B, C, D, or F. In deriving the student's final course
grade, the instructor first determines an initial or tentative grade for the student, which is
determined in the following manner for a student who has:

received a total of no lower than 90 percent on the first 3 assignments and received a score no lower than
70 percent on the 4th assignment will receive an initial grade of A for the course.

scored a total lower than 90 percent but no lower than 80 percent on the first 3 assignments and received a
score no lower 70 percent on the 4th assignment will receive an initial grade of B for the course.

received a total lower than 80 percent but no lower than 70 percent on the first 3 assignments and received
a score no lower than 70 percent on the 4th assignment will receive an initial grade of C for the course.

scored a total lower than 70 percent but no lower than 60 percent on the first 3 assignments and received a
score no lower 70 percent on the 4th assignment will receive an initial grade of D for the course.

a total lower than 60 percent on the first 3 assignments, or received a score lower than 70 percent on the
4th assignment, will receive an initial grade of F for the course.

Once the instructor has determined the initial course grade for the student, the final course grade
will be determined.

The student's final course grade will be the same as his or her initial course grade if no more than
3 class periods during the semester were missed.

Otherwise, the student's final course grade will be one letter grade lower than his or her initial
course grade (for example, an A will become a B).
Download