Lecture Notes 2 Descriptive, Predictive, and Explanatory Analysis 1 ZHANGXI LIN ISQS 7339 TEXAS TECH UNIVERSITY ISQS 7342-001, Business Analytics Outline 2 Context Based Analysis Decision Tree Algorithms ISQS 7342-001, Business Analytics Context Based Analysis 3 ISQS 7342-001, Business Analytics From Descriptive to Explanatory Use of Data 4 Three types of data analysis with decision trees Descriptive analysis is to describe data or a relationship among various data elements in the data set. Predictive use of data, in addition, is to assert that the above relationship will hold over time and be the same with new data. Explanatory use of data describes a relationship and attempt to show, by reference to the data, the effect and interpretation of the relationship. Step up to the rigor of the data work and task organization as moving from descriptive to explanatory use ISQS 7342-001, Business Analytics Showing Context 5 Decision trees can display contextual effects – hot spots and soft spots in the relationships that characterize the data. One intuitively knows the importance of these contextual effects, but finds it difficult to understand the context because of the inherent difficulty of capturing and describing the complexity of the factors. Terms Antecedents (as shown in the first level of the split). Referring to factors or effects that are at the base of a chain of events or relationships Intervening factors (as shown in the second level or lower levels of the split). Coming between the ordering established by the other factors and outcome. Intervening factors can interact with antecedents or other intervening factors to produce an interactive effect. Interactive effects are important dimension of discussions about decision trees and are explained more fully. ISQS 7342-001, Business Analytics Simpson’s Paradox 6 Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. This result is often encountered in social and medical science statistics, and occurs when frequency data are hastily given causal interpretation; the paradox disappears when causal relations are derived systematically, through formal analysis. Edward H. Simpson described the phenomenon in 1951, along with Karl Pearson et al., and Udny Yule in 1903. The name Simpson's paradox was coined by Colin R. Blyth in 1972. Since Simpson did not discover this statistical paradox, some authors, instead, have used the impersonal names reversal paradox and amalgamation paradox in referring to what is now called Simpson's Paradox and the Yule-Simpson effect. - Source: http://en.wikipedia.org/wiki/Simpson's_paradox Reference: “On Simpson's Paradox and the Sure-Thing Principle,” Colin R. Blyth, Journal of the American Statistical Association, Vol. 67, No. 338 (Jun., 1972), pp. 364-366 ISQS 7342-001, Business Analytics Simpson’s Paradox - Example 7 Lisa and Bart, each edit Wikipedia articles for two weeks. In the first week, Lisa improves 60% of the articles she edits out of 100 articles edited, and Bart improves 90% of the articles he edits out of 10 articles edited. In the second week, Lisa improves just 10% of the articles she edits but out of 10 articles edited, while Bart improves 30% yet out of 100 articles edited. Both times, Bart improved a higher percentage of the quantity of articles compared to Lisa, while Lisa improved a higher percentage of the quality of articles. When the two tests are combine using a weighted average, overall, Lisa has improved a much higher percentage than Bart because of the quality modifier had a significantly higher percentage. - Source: wikipedia.org ISQS 7342-001, Business Analytics Week 1 Week 2 Total Lisa 60/100 1/10 61/110 Bart 9/10 30/100 39/110 The Effect of Context 8 Segmentation makes difference Demonstration: http://zlin.ba.ttu.edu/pwGrad/7342/AID-Example.xlsx ISQS 7342-001, Business Analytics Decision Tree Algorithms 9 ISQS 7342-001, Business Analytics Decision Tree Algorithms 10 AID (Automatic interaction detection), 1969 by Morgan and Sonquist CHAID (CHi-squared AID), 1975 by Gordon V. Kass XAID, 1982 by by Gordon V. Kass CRT (or CART, Classification and Regression Trees), 1984 by Breiman et al. QUEST (Quick, Unbiased and Efficient Statistical Tree) , 1997 by Wei-Yin Loh and Yu-Shan Shih CLS , 1966 by Hunt et al. ID3 (Iterative Dichotomizer 3), 1983 by Ross Quinlan C4.5, C5.0, by Ross Quinlan SLIQ ISQS 7342-001, Business Analytics AID 11 AID stands for Automatic Interaction Detector. It is a statistical technique for multivariate analysis. can be used to determine the characteristics that differentiate buyers from nonbuyers. involves a successive series of analytical steps that gradually focus on the critical determinants of behavior, creating clusters of people with similar demographic characteristics and buying behavior. This technique is explained in John A. Sonquist and James N. Morgan, The Detection Of Interaction Effects, University of Michigan, Monograph No. 35, 1969. ISQS 7342-001, Business Analytics AID – Interaction with Multicollinearity 12 Saving Split by gender Income ISQS 7342-001, Business Analytics AID – Multicollinearity without Interaction 13 Split by gender Saving x x x x x x x x x x x x x x x xx xx x x x x xx x x x x Income ISQS 7342-001, Business Analytics Simpson's Paradox 14 Simpson's paradox for continuous data: a positive trend appears for two separate groups (blue and red), a negative trend (black, dashed) appears when the data are combined. - Source: Wikipedia.org ISQS 7342-001, Business Analytics AID 15 Use of decision tree to search through the many factors, by which to influence a relationship to ensure that the presented final results are accurate. Partition the dataset in terms of selected attributes Do regression on each subset of the data Features Capability of dealing with multicollinearity (w/ or w/o interaction) Addressing the problem of hidden relationship Morgan and Sonquist’s notes: Intervening effect is due to an interaction, e.g., between customer segment and the effect of the promotional program versus retention. It can obscure the relationship. Decision tree provides 2/3 of the variability in some relationship, while regression accounts for 1/3 of the variability. Decision trees perform well with strong categorical, non-linear effects, and are inefficient at packaging the predictive effects of generally linear relationships. ISQS 7342-001, Business Analytics Imperfects of AID 16 Untrue relationship - Because the algorithm looks through so many potential groupings of values, it is more likely to find groupings that are actually anomalies in the data. Biased selection of inputs or predictors - The successive partitioning of the data set into bins quickly exhausts the number of observations that are in lower levels of decision trees. Potentially overfitting - AID does not know when to stop growing branches, and it forms splits at lower extremities of the decision tree where few data records are actually available. ISQS 7342-001, Business Analytics Remedies 17 Using statistical tests to test the efficacy of a branch that is grown. CHAID and XAID by Kass (1975, 1982) Using validating data to test any branch that is formed for reproducibility. CRT/CART by Breiman et al (1984) ISQS 7342-001, Business Analytics CHAID Algorithm 18 CHAID stands for CHi-squared Automatic Interaction Detector, is a type of decision tree technique, published in 1980 by Gordon V. Kass. In practice, it is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables. Advantage: Its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively as with small sample sizes the respondent groups can quickly become too small for reliable analysis. CHAID detects interaction between variables in the data set. Using this technique we can establish relationships between a ‘dependent variable’ and other explanatory variables such as price, size, supplements etc. CHAID is often used as an exploratory technique and is an alternative to multiple regression, especially when the data set is not well-suited to regression analysis. ISQS 7342-001, Business Analytics Terms 19 Bonferroni correction The Bonferroni correction states that if an experimenter is testing n dependent or independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested. Statistically significant simply means that a given result is unlikely to have occurred by chance. It was developed by Italian mathematician Carlo Emilio Bonferroni. Kass Adjustment A p-value adjustment that multiplies the p-value by a Bonferroni factor that depends on the number of branches and chi-square target values, and sometimes on the number of distinct input values. The Kass adjustment is used in the Tree node. ISQS 7342-001, Business Analytics CHAID 20 Statistical tests Uses a test of similarity to determine whether individual values of an input should be combined. After similar values for an input have been combined according to the previous rule, test of significance are used to select whether inputs are significant descriptors of target values and, if so, which are their strengths relative to other inputs. CHAID addresses all the problems in the AID approach A statistical test is used to ensure that only relationships that are significantly different from random effects are identified. Statistical adjustments address the biased selection of variables as candidates for the branch partitions. Tree growth is terminated when the branch that is produced fails the test of significance. ISQS 7342-001, Business Analytics CRT (or CART) Algorithm 21 Closely follows the original AID goal but with improvement through the application of validation and cross-validation. It can identify the overfitting problem and verify the reproducibility of the decision tree structure using hold-out or validation data Breiman et al found that it was not necessary to have hold-out or validation data to implement this grow-and-compare method. A cross-validation method can be used by resampling the training data that is used to grow the decision tree. Advantages Prevent to grow bigger tree that pass all the validation tests Possible to use prior probabilities ISQS 7342-001, Business Analytics QUEST 22 QUEST (Quick, Unbiased and Efficient Statistical Tree) is a binary- split decision tree algorithm for classification and data mining developed by Wei-Yin Loh (University of Wisconsin-Madison) and YuShan Shih (National Chung Cheng University, Taiwan). The objective of QUEST is similar to CRT, by Breiman, Friedman, Olshen and Stone (1984). The major differences are: QUEST uses an unbiased variable selection technique by default QUEST uses imputation instead of surrogate splits to deal with missing values QUEST can easily handle categorical predictor variables with many categories If there are no missing values in the data, QUEST can optionally use the CART algorithm to produce a tree with univariate splits ISQS 7342-001, Business Analytics CHAID, CRT, and QUEST 23 For classification-type problems (categorical dependent variable), all three algorithms can be used to build a tree for prediction. QUEST is generally faster than the other two algorithms, however, for very large datasets, the memory requirements are usually larger. For regression-type problems (continuous dependent variable), the QUEST algorithm is not applicable, so only CHAID and CRT can be used. CHAID will build non-binary trees that tend to be "wider". This has made the CHAID method particularly popular in market research applications. CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable or dimension of the table. This type of display matches well the requirements for research on market segmentation. CRT will always yield binary trees, which can sometimes not be summarized as efficiently for interpretation and/or presentation. ISQS 7342-001, Business Analytics Machine Learning 24 A general way of describing computer-mediated methods of learning or developing knowledge Began as an academic discipline Often associated with using computers to simulate or reproduce intelligent behavior. Machine learning and business analytics share common goal: In order to behavior with intelligence, it is necessary to acquire intelligence and to refine it over time. The development of decision trees to form rules is called rule induction in machine learning literature Induction is the process of developing general laws on the basis of an examination of particular cases. ISQS 7342-001, Business Analytics ID3 Algorithm 25 ID3 (Iterative Dichotomizer 3), invented by Ross Quinlan in 1983, is an algorithm used to generate a decision tree. The algorithm is based on Occam's razor: it prefers smaller decision trees (simpler theories) over larger ones. However, it does not always produce the smallest tree, and is therefore a heuristic. Occam's razor is formalized using the concept of information entropy. The ID3 algorithm can be summarized as follows: Take all unused attributes and count their entropy concerning test samples Choose attribute for which entropy is minimum Make node containing that attribute Reference: http://www.cis.temple.edu/~ingargio/cis587/readings/id3-c45.html ISQS 7342-001, Business Analytics Occam's Razor 26 “One should not increase, beyond what is necessary, the number of entities required to explain anything” Occam's razor is a logical principle attributed to the mediaeval philosopher William of Occam. The principle states that one should not make more assumptions than the minimum needed. This principle is often called the principle of parsimony. It underlies all scientific modeling and theory building. It admonishes us to choose from a set of otherwise equivalent models of a given phenomenon the simplest one. In any given model, Occam's razor helps us to "shave off" those concepts, variables or constructs that are not really needed to explain the phenomenon. By doing that, developing the model will become much easier, and there is less chance of introducing inconsistencies, ambiguities and redundancies. ISQS 7342-001, Business Analytics ID3 Algorithm 27 ID3 (Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin A = The Attribute that best classifies examples. Decision Tree attribute for Root = A. For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi), be the subset of examples that have the value vi for A If Examples(vi) is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A}) End Return Root ISQS 7342-001, Business Analytics C4.5 28 Features Builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. Examines the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. Can convert a decision tree into a rule set. An optimizer goes through the rules set to reduce the redundancy of the rules. Can create fuzzy splits on interval inputs. A few base cases: All the samples in the list belong to the same class. Once this happens, the algorithm simply create a single leaf node for the decision tree. None of the features give you any information gain, in this case C4.5 creates a decision node higher up the tree using the expected value of the class. It also might happen that there is no any instances of a class. Then C4.5 creates a decision node higher up the tree using expected value. ISQS 7342-001, Business Analytics C4.5 29 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. Tutorial: http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz YouTube: http://www.youtube.com/watch?v=8-vHunc4k8s ISQS 7342-001, Business Analytics C4.5 vs. ID3 30 C4.5 made a number of improvements to ID3: Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. [Quinlan, 96] Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. ISQS 7342-001, Business Analytics C5.0 31 Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5: Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude) Memory Usage - C5.0 is more memory efficient than C4.5 Smaller Decision Trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees. Support For Boosting - improves the trees and gives them more accuracy. Weighting - allows you to weight different attributes and misclassification types. Winnowing - automatically winnows the data to help reduce noise. C5.0/See5 is a commercial and closed-source product, although free source code is available for interpreting and using the decision trees and rule sets it outputs. Is See5/C5.0 Better Than C4.5? http://www.rulequest.com/see5-comparison.html ISQS 7342-001, Business Analytics SLIQ 32 A decision tree classifier that can handle both numerical and categorical attributes It builds compact and accurate trees It uses a pre-sorting technique in the tree growing phase and an inexpensive pruning algorithm It is suitable for classification of large disk-resident datasets, independently of the number of classes, attributes and records The Gini index is used to evaluate the “goodness” of the alternative splits for an attribute ISQS 7342-001, Business Analytics The Evolution of DT Algorithms 33 Statistical Decision Tree AID Rule Induction Machine learning CLS Hunt et al, 1966 Entropy ID3 Quinlan, 1983 Entropy C4.5 Quinlan, 1993 Morgan & Sonquist, 1969 Statistical tests CHAID XAID Kass ,1975 Chi-Square Kass ,1982 Validation CRT QUEST Breiman et al. 1984 Loh & Shih , 1997 Gini C5.0 http://www.gavilan.edu/research/reports/DMALGS.PDF ISQS 7342-001, Business Analytics Quinlan, ? Commercial version Features of ABORETUM Procedure 34 Tree branching criteria Variance reduction for interval targets F-test for interval targets Gini or entropy reduction for categorical targets Chi-squared for nominal targets Missing value handling Use missing values as a separate, but legitimate code in the split search Assign missing values to the leaf that they most closely resemble Distribute missing observations across all branches Use surrogate, non-missing inputs to impute the distribution of missing value in the branch Methods Cost-complexity pruning and reduced-error pruning Prior probabilities can be used in training or assessment Misclassification costs can be used to influence decisions and branch construction Interactive training mode can be used t produce branches and prune branches Others SAS Code generation PMML code generation ISQS 7342-001, Business Analytics Questions 35 What are differences between predictive and explanatory analysis? Why are input variables distinguished as antecedents and intervening factors? What is the implication of Simpson’s paradox to the decision tree construction and explanation? What does Interaction mean in the context of decision tree construction? What is the primary purpose of AID algorithm? What are weaknesses of AID algorithm? How these problems are addressed by other algorithm? Summarize the main principles in decision tree algorithm design and implementation. ISQS 7342-001, Business Analytics