Decision Tree Modelling - cmaste

advertisement
Alberta Ingenuity & CMASTE
Lesson 3: Probability Using Decision Trees
Purpose: To take a complex problem with several variables and map the observations in
a tree model to make conclusions and decisions easier. The resulting outcome can be a
class to which the data belongs or a real number such as the price of a house or a patient’s
length of stay in a hospital.
Problem: Staffing for a golf course on a day-to-day basis is difficult to plan because the
number of people that golf on a daily basis is variable and hard to predict. This leads to
problems trying to staff appropriately. Can we find a way to make the staffing decisions
easier?
Hypothesis: A decision tree can organize multi-variable data so that reliable inferences
can be made.
Prediction: We can significantly improve decision-making in complex situations with
the help of the decision tree strategy.
Design: Here is a model of how a decision tree can be used to solve a multi-variable
problem:
The manager of a golf course is having some trouble understanding his customer
attendance. There are days when everyone wants to play golf and the staff is overworked.
On other days, no one plays golf and staff have too much slack time. The manager’s
objective is to optimize staff availability by trying to predict when people will play golf.
To accomplish that he needs to understand the reason people decide to play and if there is
any explanation for that. He assumes that weather must be an important underlying
factor, so he decides to use the weather forecast and record whether or not people play
golf under certain weather conditions. Over the next two weeks he keeps records of:
The outlook: whether it was sunny, overcast or raining.
The temperature (oC).
The relative humidity (%).
Whether it was windy or not.
Whether or not people played golf on that day.
The decision to golf or not will be considered the dependent variable and all the other
factors will be independent variables. The manager compiled this dataset into a table
containing 14 rows and 5 columns as shown below.
AICML3DecisionTrees
Centre for Machine Learning
1/7
Alberta Ingenuity & CMASTE
Independent Variables
Dependent
Variable
Outlook
Temperature(oC)
Humidity(%)
Windy
Play?
Sunny
28
75
No
Play
Sunny
24
95
Yes
Don’t Play
Overcast
22
80
No
Play
Rain
20
95
No
Don’t Play
Rain
17
80
No
Play
Rain
15
85
Yes
Don’t Play
Overcast
18
55
Yes
Play
Sunny
20
90
No
Don’t Play
Sunny
21
70
No
Play
Overcast
23
85
Yes
Rain
25
80
No
Sunny
21
65
Yes
Overcast
27
65
No
Play
Rain
22
75
Yes
Don’t Play
AICML3DecisionTrees
Centre for Machine Learning
Play
Play
Play
2/7
Alberta Ingenuity & CMASTE
There are a lot of variables to try to deal with at one time, making it difficult to find any
patterns. This is where the Machine Learning strategy of a decision tree model can help
to solve the manager’s problem. We want to make decisions about the dependent
variable based on data concerning the independent variables. In a decision tree, the top
node represents all the data and in deciding on the best first split, the most important
independent variable seems to be Outlook, as it is the main weather feature. Each level
below consists of classes of information (new nodes or leaves) so we must consider each
subset of data, connected by branches. The tree stops when the information cannot be
broken down any further.
Materials and Procedure: Decision tree
Dependent Variable: Play?
Play 9
Don’t Play 5
Outlook
overcast
rain
sunny
Play 4
Don’t Play 0
Play 3
Don’t Play 2
Play 2
Don’t Play 3
Windy
Humidity
 70%
> 70%
Play 2
Don’t Play 0
AICML3DecisionTrees
Yes
Play 1
Don’t Play 2
Play 0
Don’t Play 2
Centre for Machine Learning
No
Play 2
Don’t Play 1
3/7
Alberta Ingenuity & CMASTE
Evidence: When the manager observed the data at the first level or node, based on the
independent variable Outlook, he saw three main groups of golfers:
1) One group that plays golf when the weather is sunny
2) One group that plays when the weather is overcast
3) One group that plays when it's raining.
Analysis: The manager’s first conclusion is that if the outlook is overcast people always
play golf so that branch terminates, and also that there are some fanatics who play golf
even in the rain, so he divided the rain category in two based on wind. People didn’t
always golf when it was sunny, so he divided the sunny group in two, based on humidity
with an arbitrary cutoff of 70%.
Evaluation: As a result, the manager feels he has enough accurate information from the
decision tree in order to solve the staffing problem. The numbers at the bottom level of
the decision tree lead him to these inferences: He dismisses most of the staff on days that
are sunny and humid or on days that are rainy and windy, because almost no one is going
to play golf on those days. On days when a lot of people will play golf, he hires extra
staff. The conclusion is that the decision tree helped the manager turn a complex data
representation into a much easier structure which made the probabilities involved easier
to understand. As well, we expect that this technique can be applied to even more
complex and important problems with great volumes of data.
Exercise:
Our task is to create a decision tree in the same way as the golf course manager did, for a
different data set and then to use the information to make appropriate decisions to solve a
problem.
A company is sending out a promotion to households as part of a marketing strategy. For
each mail out, the company records information about these independent variables: type
of district (urban, suburban, or rural), house type (detached, semi-detached, apartment),
income (high or low), whether they are a previous customer (yes or no), and finally the
dependent variable: outcome (responded or no response).
The problem for the company is to determine which factors most strongly affect the
chance of a response to a promotion. Knowing this might help the company be more
selective in another promotion.
Even though the mail out would probably go to thousands of homes, for our purposes we
will limit our data to the following 14 households.
AICML3DecisionTrees
Centre for Machine Learning
4/7
Alberta Ingenuity & CMASTE
Independent Variables
District
Suburban
Suburban
Rural
Urban
Urban
Urban
Rural
Suburban
Suburban
Urban
Suburban
Rural
Rural
Urban
House Type
Detached
Detached
Detached
Semi-detached
Semi-detached
Semi-detached
Semi-detached
Apartment
Semi-detached
Apartment
Apartment
Apartment
Detached
Apartment
Income
High
High
High
High
Low
Low
Low
High
Low
Low
Low
High
Low
High
Previous
Customer
No
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Dependent
Variable
Outcome
No Response
No Response
Responded
Responded
Responded
No Response
Responded
Responded
Responded
Responded
Responded
Responded
Responded
No Response
1) Begin your decision tree (on the next page) by totaling the number of Responded
versus No Response as the root or first node. Which of the independent variables
do you think is the most important for the first level of branching?
__________________
2) Recalculate the number of Responded versus No Response for each new branch,
according to your choice in 1).
3) Based on the results in 2) how would you use the remaining independent variables
to branch to the next level?
4) Recalculate the number of Responded versus No Response for each new branch,
according to your choice in 3).
5) Based on the results in 4), what inferences can you make with respect to the
original question “What factors most strongly affect the chance that a household
will respond to the promotion?”
6) How might someone else in your class develop a decision tree that is different
from yours?
AICML3DecisionTrees
Centre for Machine Learning
5/7
Alberta Ingenuity & CMASTE
7) How would you change, if at all, your next promotional mail out in order to
increase the probability that a particular household would respond?
8) How would this problem be handled differently if the mail out involved thousands
of households?
Decision Tree
Synthesis:
Decision-tree modeling is applicable to a wide variety of situations requiring analysis
of multi-variate data. A decision tree uses a “divide and conquer” approach to the
variables in a large problem.
To see the decision-tree processes done automatically in a Java applet go to the
University of Alberta computing Science AIxploratorium site at:
www.cs.ualberta.ca/~aiexplore/learning/DecisionTrees/index.html , then choose
Decision trees and follow the instructions through the documentation for the applet,
followed by running the actual applet. It is interesting to run the applet with different
features and seeing the different decision-trees that result.
AICML3DecisionTrees
Centre for Machine Learning
6/7
Alberta Ingenuity & CMASTE
Sources:
1) www.decisiontrees.net/node/21, Tutorial (1): “Decision Trees and Data Mining”,
A simple decision tree, pg. 1 - 3
2) http://en.wikipedia.org/wiki/Decision_tree_learning, pg. 1 – 5
3) Braha, Dan, “Data Mining for Design and Manufacturing”, Kluwer Academic
Publishers, Dordrecht, Netherlands, 2001
4) www.cs.ualberta.ca/~greiner/Presentations.html#IntroML
AICML3DecisionTrees
Centre for Machine Learning
7/7
Download