Alberta Ingenuity & CMASTE Lesson 3: Probability Using Decision Trees (Teachers’ Resource) Purpose: To take a complex problem with several variables and map the observations in a tree model to make conclusions and decisions easier. The resulting outcome can be a class to which the data belongs or a real number such as the price of a house or a patient’s length of stay in a hospital. Problem: Staffing for a golf course on a day-to-day basis is difficult to plan because the number of people that golf on a daily basis is variable and hard to predict. This leads to problems trying to staff appropriately. Can we find a way to make the staffing decisions easier? Hypothesis: A decision tree can organize multi-variable data so that reliable inferences can be made. Prediction: We can significantly improve decision-making in complex situations with the help of the decision tree strategy. Design: Here is a model of how a decision tree can be used to solve a multi-variable problem: The manager of a golf course is having some trouble understanding his customer attendance. There are days when everyone wants to play golf and the staff is overworked. On other days, no one plays golf and staff have too much slack time. The manager’s objective is to optimize staff availability by trying to predict when people will play golf. To accomplish that he needs to understand the reason people decide to play and if there is any explanation for that. He assumes that weather must be an important underlying factor, so he decides to use the weather forecast and record whether or not people play golf under certain weather conditions. Over the next two weeks he keeps records of: The outlook: whether it was sunny, overcast or raining. The temperature (oC). The relative humidity (%). Whether it was windy or not. Whether or not people played golf on that day. The decision to golf or not will be considered the dependent variable and all the other factors will be independent variables. The manager compiled this dataset into a table containing 14 rows and 5 columns as shown below. AICML3DecisionTreesTR Centre for Machine Learning 1/7 Alberta Ingenuity & CMASTE Independent Variables Dependent Variable Outlook Temperature(oC) Humidity(%) Windy Play? Sunny 28 75 No Play Sunny 24 95 Yes Don’t Play Overcast 22 80 No Play Rain 20 95 No Don’t Play Rain 17 80 No Play Rain 15 85 Yes Don’t Play Overcast 18 55 Yes Play Sunny 20 90 No Don’t Play Sunny 21 70 No Play Overcast 23 85 Yes Rain 25 80 No Sunny 21 65 Yes Overcast 27 65 No Play Rain 22 75 Yes Don’t Play AICML3DecisionTreesTR Centre for Machine Learning Play Play Play 2/7 Alberta Ingenuity & CMASTE There are a lot of variables to try to deal with at one time, making it difficult to find any patterns. This is where the Machine Learning strategy of a decision tree model can help to solve the manager’s problem. We want to make decisions about the dependent variable based on data concerning the independent variables. In a decision tree, the top node represents all the data and in deciding on the best first split, the most important independent variable seems to be Outlook, as it is the main weather feature. Each level below consists of classes of information (new nodes or leaves) so we must consider each subset of data, connected by branches. The tree stops when the information cannot be broken down any further. Materials and Procedure: Decision tree Dependent Variable: Play? Play 9 Don’t Play 5 Outlook overcast sunny rain Play 4 Don’t Play 0 Play 3 Don’t Play 2 Play 2 Don’t Play 3 Windy Humidity 70% Play 2 Don’t Play 0 > 70% Yes Play 1 Don’t Play 2 AICML3DecisionTreesTR Play 0 Don’t Play 2 Centre for Machine Learning No Play 2 Don’t Play 1 3/7 Alberta Ingenuity & CMASTE Evidence: When the manager observed the data at the first level or node, based on the independent variable Outlook, he saw three main groups of golfers: 1) One group that plays golf when the weather is sunny 2) One group that plays when the weather is overcast 3) One group that plays when it's raining. Analysis: The manager’s first conclusion is that if the outlook is overcast people always play golf so that branch terminates, and also that there are some fanatics who play golf even in the rain. People didn’t always golf when it was sunny, so he divided the sunny group in two, based on humidity with an arbitrary cutoff of 70%. He also divided the rain category in two, based on wind. Evaluation: As a result, the manager feels he has enough accurate information from the decision tree in order to solve the staffing problem. The numbers at the bottom level of the decision tree lead him to these inferences: He dismisses most of the staff on days that are sunny and humid or on days that are rainy and windy, because almost no one is going to play golf on those days. On days when a lot of people will play golf, he hires extra staff. The conclusion is that the decision tree helped the manager turn a complex data representation into a much easier structure which made the probabilities involved easier to understand. As well, we expect that this technique can be applied to even more complex and important problems with great volumes of data. Exercise: Our task is to create a decision tree in the same way as the golf course manager did, for a different data set and then to use the information to make appropriate decisions to solve a problem. A company is sending out a promotion to households as part of a marketing strategy. For each mail out, the company records information about these independent variables: type of district (urban, suburban, or rural), house type (detached, semi-detached, apartment), income (high or low), whether they are a previous customer (yes or no), and finally the dependent variable: outcome (responded or no response). The problem for the company is to determine which factors most strongly affect the chance of a response to a promotion. Knowing this might help the company be more selective in another promotion. Even though the mail out would probably go to thousands of homes, for our purposes we will limit our data to the following 14 households. AICML3DecisionTreesTR Centre for Machine Learning 4/7 Alberta Ingenuity & CMASTE Independent Variables District Suburban Suburban Rural Urban Urban Urban Rural Suburban Suburban Urban Suburban Rural Rural Urban House Type Detached Detached Detached Semi-detached Semi-detached Semi-detached Semi-detached Apartment Semi-detached Apartment Apartment Apartment Detached Apartment Income High High High High Low Low Low High Low Low Low High Low High Previous Customer No Yes No No No Yes Yes No No No Yes Yes No Yes Dependent Variable Outcome No Response No Response Responded Responded Responded No Response Responded Responded Responded Responded Responded Responded Responded No Response 1) Begin your decision tree by totaling the number of Responded versus No Response as the root or first node. Which of the independent variables do you think is the most important for the first level of branching? __________________ Answer: Could actually be any of the four, but District is the choice for decision tree on next page. 2) Recalculate the number of Responded versus No Response for each new branch, according to your choice in 1). Answer: In decision tree on next page. 3) Based on the results in 2) how would you use the remaining independent variables to branch to the next level? Answer: In decision tree on next page. 4) Recalculate the number of Responded versus No Response for each new branch, according to your choice in 3). Answer: In decision tree on next page. 5) Based on the results in 4), what inferences can you make with respect to the original question “What factors most strongly affect the chance that a household will respond to the promotion?” Answer: Rural households responded to the mail out, regardless of other variables. Suburban households were much more likely to respond if income was low and urban households were much more likely to respond if they were not a previous customer. House type did not appear to be a significant factor in deciding whether households responded or not. AICML3DecisionTreesTR Centre for Machine Learning 5/7 Alberta Ingenuity & CMASTE 6) How might someone else in your class develop a decision tree that is different from yours? Answer: There are many other decision trees possible; they may have chosen a different independent variable, say income or previous customer for their first split, or they may have chosen different variables at their second split. It is hoped that they would still arrive at the same inferences. 7) How would you change, if at all, your next promotional mail out in order to increase the probability that a particular household would respond? Answer: The next mail out from this company should target rural households, urban households that were not previous customers, and suburban households of low income. 8) How would this problem be handled differently if the mail out involved thousands of households? Answer: The analysis would not be able to be done manually. A data mining program with many queries would have to be used to sort through the large volume of data to find the patterns. Decision Tree Response(R) 10 No Response(NR) 4 District Suburban Urban Rural R 3 NR 2 R 3 NR 2 R 4 NR 0 Income High Previous Customer Low w R 1 NR 2 AICML3DecisionTreesTR R 2 NR 0 Yes R 0 NR 2 Centre for Machine Learning No R 3 NR 0 6/7 Alberta Ingenuity & CMASTE Synthesis: Decision-tree modeling is applicable to a wide variety of situations requiring analysis of multi-variate data. A decision tree uses a “divide and conquer” approach to the variables in a large problem. To see the decision-tree processes done automatically in a Java applet go to the University of Alberta computing Science AIxploratorium site at: www.cs.ualberta.ca/~aiexplore/learning/DecisionTrees/index.html , then choose Decision trees and follow the instructions through the documentation for the applet, followed by running the actual applet. It is interesting to run the applet with different features and seeing the different decision-trees that result. Sources 1) www.decisiontrees.net/node/21, Tutorial (1): “Decision Trees and Data Mining”, A simple decision tree, pg. 1 - 3 2) http://en.wikipedia.org/wiki/Decision_tree_learning, pg. 1 – 5 3) Braha, Dan, “Data Mining for Design and Manufacturing”, Kluwer Academic Publishers, Dordrecht, Netherlands, 2001 4) www.cs.ualberta.ca/~greiner/Presentations.html#IntroML AICML3DecisionTreesTR Centre for Machine Learning 7/7