Slides2 - Zhangxi Lin's

advertisement
Lecture Notes 2
Descriptive, Predictive, and
Explanatory Analysis
1
ZHANGXI LIN
ISQS 7339
TEXAS TECH UNIVERSITY
ISQS 7342-001, Business Analytics
Outline
2
 Context Based Analysis
 Decision Tree Algorithms
ISQS 7342-001, Business Analytics
Context Based Analysis
3
ISQS 7342-001, Business Analytics
From Descriptive to Explanatory Use of Data
4
 Three types of data analysis with decision trees
 Descriptive analysis is to describe data or a relationship among
various data elements in the data set.
 Predictive use of data, in addition, is to assert that the above
relationship will hold over time and be the same with new
data.
 Explanatory use of data describes a relationship and attempt
to show, by reference to the data, the effect and interpretation
of the relationship.
 Step up to the rigor of the data work and task
organization as moving from descriptive to
explanatory use
ISQS 7342-001, Business Analytics
Showing Context
5
 Decision trees can display contextual effects – hot spots and
soft spots in the relationships that characterize the data.
 One intuitively knows the importance of these contextual
effects, but finds it difficult to understand the context because
of the inherent difficulty of capturing and describing the
complexity of the factors.
 Terms



Antecedents (as shown in the first level of the split). Referring to factors
or effects that are at the base of a chain of events or relationships
Intervening factors (as shown in the second level or lower levels of the
split). Coming between the ordering established by the other factors and
outcome. Intervening factors can interact with antecedents or other
intervening factors to produce an interactive effect.
Interactive effects are important dimension of discussions about
decision trees and are explained more fully.
ISQS 7342-001, Business Analytics
Simpson’s Paradox
6
 Simpson's paradox (or the Yule-Simpson effect) is a statistical
paradox wherein the successes of groups seem reversed when the groups
are combined.

This result is often encountered in social and medical science statistics, and occurs when
frequency data are hastily given causal interpretation; the paradox disappears when causal
relations are derived systematically, through formal analysis.
 Edward H. Simpson described the phenomenon in 1951, along with Karl
Pearson et al., and Udny Yule in 1903. The name Simpson's paradox was
coined by Colin R. Blyth in 1972. Since Simpson did not discover this
statistical paradox, some authors, instead, have used the impersonal names
reversal paradox and amalgamation paradox in referring to what is now
called Simpson's Paradox and the Yule-Simpson effect.
-
Source: http://en.wikipedia.org/wiki/Simpson's_paradox
 Reference: “On Simpson's Paradox and the Sure-Thing Principle,”
Colin R. Blyth, Journal of the American Statistical Association, Vol. 67, No.
338 (Jun., 1972), pp. 364-366
ISQS 7342-001, Business Analytics
Simpson’s Paradox - Example
7
Lisa and Bart, each edit Wikipedia
articles for two weeks. In the first week,
Lisa improves 60% of the articles she
edits out of 100 articles edited, and
Bart improves 90% of the articles he
edits out of 10 articles edited. In the
second week, Lisa improves just 10% of
the articles she edits but out of 10
articles edited, while Bart improves
30% yet out of 100 articles edited.
 Both times, Bart improved a higher
percentage of the quantity of articles
compared to Lisa, while Lisa improved
a higher percentage of the quality of
articles.
 When the two tests are combine using a
weighted average, overall, Lisa has
improved a much higher percentage
than Bart because of the quality
modifier had a significantly higher
percentage.
- Source: wikipedia.org

ISQS 7342-001, Business Analytics
Week 1
Week 2
Total
Lisa
60/100
1/10
61/110
Bart
9/10
30/100
39/110
The Effect of Context
8
 Segmentation makes difference
 Demonstration:
http://zlin.ba.ttu.edu/pwGrad/7342/AID-Example.xlsx
ISQS 7342-001, Business Analytics
Decision Tree Algorithms
9
ISQS 7342-001, Business Analytics
Decision Tree Algorithms
10
 AID (Automatic interaction detection), 1969 by Morgan and








Sonquist
CHAID (CHi-squared AID), 1975 by Gordon V. Kass
XAID, 1982 by by Gordon V. Kass
CRT (or CART, Classification and Regression Trees), 1984 by
Breiman et al.
QUEST (Quick, Unbiased and Efficient Statistical Tree) , 1997 by
Wei-Yin Loh and Yu-Shan Shih
CLS , 1966 by Hunt et al.
ID3 (Iterative Dichotomizer 3), 1983 by Ross Quinlan
C4.5, C5.0, by Ross Quinlan
SLIQ
ISQS 7342-001, Business Analytics
AID
11
 AID stands for Automatic Interaction Detector.
 It is a statistical technique for multivariate analysis.


can be used to determine the characteristics that differentiate buyers
from nonbuyers.
involves a successive series of analytical steps that gradually focus on the
critical determinants of behavior, creating clusters of people with similar
demographic characteristics and buying behavior.
 This technique is explained in John A. Sonquist and James
N. Morgan, The Detection Of Interaction Effects, University
of Michigan, Monograph No. 35, 1969.
ISQS 7342-001, Business Analytics
AID – Interaction with Multicollinearity
12
Saving
Split by gender
Income
ISQS 7342-001, Business Analytics
AID – Multicollinearity without Interaction
13
Split by gender
Saving
x x
x x x
x
x
x x
x x x
x
x
x xx xx x x
x x xx x
x
x
x
Income
ISQS 7342-001, Business Analytics
Simpson's Paradox
14
 Simpson's paradox for
continuous data: a
positive trend appears
for two separate groups
(blue and red), a negative
trend (black, dashed)
appears when the data
are combined.
- Source: Wikipedia.org
ISQS 7342-001, Business Analytics
AID
15
 Use of decision tree to search through the many factors, by which to
influence a relationship to ensure that the presented final results are
accurate.


Partition the dataset in terms of selected attributes
Do regression on each subset of the data
 Features
 Capability of dealing with multicollinearity (w/ or w/o interaction)
 Addressing the problem of hidden relationship
 Morgan and Sonquist’s notes:
 Intervening effect is due to an interaction, e.g., between customer segment and
the effect of the promotional program versus retention. It can obscure the
relationship.
 Decision tree provides 2/3 of the variability in some relationship, while
regression accounts for 1/3 of the variability.
 Decision trees perform well with strong categorical, non-linear effects, and are
inefficient at packaging the predictive effects of generally linear relationships.
ISQS 7342-001, Business Analytics
Imperfects of AID
16
 Untrue relationship - Because the algorithm looks
through so many potential groupings of values, it is more
likely to find groupings that are actually anomalies in the
data.
 Biased selection of inputs or predictors - The
successive partitioning of the data set into bins quickly
exhausts the number of observations that are in lower
levels of decision trees.
 Potentially overfitting - AID does not know when to
stop growing branches, and it forms splits at lower
extremities of the decision tree where few data records are
actually available.
ISQS 7342-001, Business Analytics
Remedies
17
 Using statistical tests to test the efficacy of a branch
that is grown.

CHAID and XAID by Kass (1975, 1982)
 Using validating data to test any branch that is
formed for reproducibility.

CRT/CART by Breiman et al (1984)
ISQS 7342-001, Business Analytics
CHAID Algorithm
18
 CHAID stands for CHi-squared Automatic Interaction Detector, is a type





of decision tree technique, published in 1980 by Gordon V. Kass.
In practice, it is often used in the context of direct marketing to select
groups of consumers and predict how their responses to some variables
affect other variables.
Advantage: Its output is highly visual and easy to interpret.
Because it uses multiway splits by default, it needs rather large sample sizes
to work effectively as with small sample sizes the respondent groups can
quickly become too small for reliable analysis.
CHAID detects interaction between variables in the data set. Using this
technique we can establish relationships between a ‘dependent variable’
and other explanatory variables such as price, size, supplements etc.
CHAID is often used as an exploratory technique and is an alternative to
multiple regression, especially when the data set is not well-suited to
regression analysis.
ISQS 7342-001, Business Analytics
Terms
19
 Bonferroni correction
 The Bonferroni correction states that if an experimenter is testing n
dependent or independent hypotheses on a set of data, then the
statistical significance level that should be used for each hypothesis
separately is 1/n times what it would be if only one hypothesis were
tested. Statistically significant simply means that a given result is
unlikely to have occurred by chance.
 It was developed by Italian mathematician Carlo Emilio Bonferroni.
 Kass Adjustment
 A p-value adjustment that multiplies the p-value by a Bonferroni
factor that depends on the number of branches and chi-square target
values, and sometimes on the number of distinct input values. The
Kass adjustment is used in the Tree node.
ISQS 7342-001, Business Analytics
CHAID
20
 Statistical tests


Uses a test of similarity to determine whether individual values of an
input should be combined.
After similar values for an input have been combined according to
the previous rule, test of significance are used to select whether
inputs are significant descriptors of target values and, if so, which are
their strengths relative to other inputs.
 CHAID addresses all the problems in the AID approach



A statistical test is used to ensure that only relationships that are
significantly different from random effects are identified.
Statistical adjustments address the biased selection of variables as
candidates for the branch partitions.
Tree growth is terminated when the branch that is produced fails the
test of significance.
ISQS 7342-001, Business Analytics
CRT (or CART) Algorithm
21
 Closely follows the original AID goal but with improvement
through the application of validation and cross-validation.
 It can identify the overfitting problem and verify the
reproducibility of the decision tree structure using hold-out or
validation data
 Breiman et al found that it was not necessary to have hold-out or
validation data to implement this grow-and-compare method. A
cross-validation method can be used by resampling the training
data that is used to grow the decision tree.
 Advantages


Prevent to grow bigger tree that pass all the validation tests
Possible to use prior probabilities
ISQS 7342-001, Business Analytics
QUEST
22
 QUEST (Quick, Unbiased and Efficient Statistical Tree) is a binary-
split decision tree algorithm for classification and data mining
developed by Wei-Yin Loh (University of Wisconsin-Madison) and YuShan Shih (National Chung Cheng University, Taiwan).
 The objective of QUEST is similar to CRT, by Breiman, Friedman,
Olshen and Stone (1984). The major differences are:



QUEST uses an unbiased variable selection technique by default
QUEST uses imputation instead of surrogate splits to deal with missing values
QUEST can easily handle categorical predictor variables with many categories
 If there are no missing values in the data, QUEST can optionally use the
CART algorithm to produce a tree with univariate splits
ISQS 7342-001, Business Analytics
CHAID, CRT, and QUEST
23
 For classification-type problems (categorical dependent variable), all
three algorithms can be used to build a tree for prediction. QUEST is
generally faster than the other two algorithms, however, for very large
datasets, the memory requirements are usually larger.
 For regression-type problems (continuous dependent variable), the
QUEST algorithm is not applicable, so only CHAID and CRT can be
used.
 CHAID will build non-binary trees that tend to be "wider". This has
made the CHAID method particularly popular in market research
applications.

CHAID often yields many terminal nodes connected to a single branch, which can be
conveniently summarized in a simple two-way table with multiple categories for each
variable or dimension of the table. This type of display matches well the requirements for
research on market segmentation.
 CRT will always yield binary trees, which can sometimes not be
summarized as efficiently for interpretation and/or presentation.
ISQS 7342-001, Business Analytics
Machine Learning
24
 A general way of describing computer-mediated methods of
learning or developing knowledge
 Began as an academic discipline
 Often associated with using computers to simulate or
reproduce intelligent behavior.
 Machine learning and business analytics share common goal:
In order to behavior with intelligence, it is necessary to
acquire intelligence and to refine it over time.
 The development of decision trees to form rules is called rule
induction in machine learning literature

Induction is the process of developing general laws on the basis of an
examination of particular cases.
ISQS 7342-001, Business Analytics
ID3 Algorithm
25
 ID3 (Iterative Dichotomizer 3), invented by Ross Quinlan
in 1983, is an algorithm used to generate a decision tree.
 The algorithm is based on Occam's razor: it prefers smaller
decision trees (simpler theories) over larger ones. However,
it does not always produce the smallest tree, and is
therefore a heuristic. Occam's razor is formalized using the
concept of information entropy.
 The ID3 algorithm can be summarized as follows:



Take all unused attributes and count their entropy concerning test
samples
Choose attribute for which entropy is minimum
Make node containing that attribute
 Reference: http://www.cis.temple.edu/~ingargio/cis587/readings/id3-c45.html
ISQS 7342-001, Business Analytics
Occam's Razor
26
 “One should not increase, beyond what is necessary, the
number of entities required to explain anything”
 Occam's razor is a logical principle attributed to the mediaeval
philosopher William of Occam. The principle states that one
should not make more assumptions than the minimum
needed.
 This principle is often called the principle of parsimony. It
underlies all scientific modeling and theory building.

It admonishes us to choose from a set of otherwise equivalent models of a
given phenomenon the simplest one. In any given model, Occam's razor
helps us to "shave off" those concepts, variables or constructs that are not
really needed to explain the phenomenon. By doing that, developing the
model will become much easier, and there is less chance of introducing
inconsistencies, ambiguities and redundancies.
ISQS 7342-001, Business Analytics
ID3 Algorithm
27
ID3 (Examples, Target_Attribute, Attributes)
 Create a root node for the tree
 If all examples are positive, Return the single-node tree Root, with label = +.
 If all examples are negative, Return the single-node tree Root, with label = -.
 If number of predicting attributes is empty, then Return the single node tree
Root, with label = most common value of the target attribute in the examples.
 Otherwise Begin



A = The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,



Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi), be the subset of examples that have the value vi for A
If Examples(vi) is empty


Then below this new branch add a leaf node with label = most common target value in the
examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute,
Attributes – {A})
 End
 Return Root
ISQS 7342-001, Business Analytics
C4.5
28
 Features




Builds decision trees from a set of training data in the same way as ID3,
using the concept of information entropy.
Examines the normalized information gain (difference in entropy) that
results from choosing an attribute for splitting the data.
Can convert a decision tree into a rule set. An optimizer goes through the
rules set to reduce the redundancy of the rules.
Can create fuzzy splits on interval inputs.
 A few base cases:



All the samples in the list belong to the same class. Once this happens, the
algorithm simply create a single leaf node for the decision tree.
None of the features give you any information gain, in this case C4.5 creates
a decision node higher up the tree using the expected value of the class.
It also might happen that there is no any instances of a class. Then C4.5
creates a decision node higher up the tree using expected value.
ISQS 7342-001, Business Analytics
C4.5
29
 Simple depth-first construction.
 Uses Information Gain
 Sorts Continuous Attributes at each node.
 Needs entire data to fit in memory.
 Unsuitable for Large Datasets.

Needs out-of-core sorting.
 Tutorial:
http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html
 You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
 YouTube: http://www.youtube.com/watch?v=8-vHunc4k8s
ISQS 7342-001, Business Analytics
C4.5 vs. ID3
30
C4.5 made a number of improvements to ID3:
 Handling both continuous and discrete attributes - In order to
handle continuous attributes, C4.5 creates a threshold and
then splits the list into those whose attribute value is above
the threshold and those that are less than or equal to it.
[Quinlan, 96]
 Handling training data with missing attribute values - C4.5
allows attribute values to be marked as ? for missing. Missing
attribute values are simply not used in gain and entropy
calculations.
 Handling attributes with differing costs.
 Pruning trees after creation - C4.5 goes back through the tree
once it's been created and attempts to remove branches that
do not help by replacing them with leaf nodes.
ISQS 7342-001, Business Analytics
C5.0
31
 Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for
Windows) which he markets commercially.
 C5.0 offers a number of improvements on C4.5:







Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
Memory Usage - C5.0 is more memory efficient than C4.5
Smaller Decision Trees - C5.0 gets similar results to C4.5 with considerably
smaller decision trees.
Support For Boosting - improves the trees and gives them more accuracy.
Weighting - allows you to weight different attributes and misclassification
types.
Winnowing - automatically winnows the data to help reduce noise.
C5.0/See5 is a commercial and closed-source product, although free source
code is available for interpreting and using the decision trees and rule sets it
outputs.
 Is See5/C5.0 Better Than C4.5?

http://www.rulequest.com/see5-comparison.html
ISQS 7342-001, Business Analytics
SLIQ
32
 A decision tree classifier that can handle both numerical




and categorical attributes
It builds compact and accurate trees
It uses a pre-sorting technique in the tree growing phase
and an inexpensive pruning algorithm
It is suitable for classification of large disk-resident
datasets, independently of the number of classes, attributes
and records
The Gini index is used to evaluate the “goodness” of the
alternative splits for an attribute
ISQS 7342-001, Business Analytics
The Evolution of DT Algorithms
33
Statistical
Decision Tree
AID
Rule Induction
Machine learning
CLS
Hunt et al, 1966
Entropy
ID3
Quinlan, 1983
Entropy
C4.5
Quinlan, 1993
Morgan & Sonquist, 1969
Statistical tests
CHAID
XAID
Kass ,1975
Chi-Square
Kass ,1982
Validation
CRT
QUEST
Breiman et al. 1984
Loh & Shih , 1997
Gini
C5.0
http://www.gavilan.edu/research/reports/DMALGS.PDF
ISQS 7342-001, Business Analytics
Quinlan, ?
Commercial version
Features of ABORETUM Procedure
34
 Tree branching criteria




Variance reduction for interval targets
F-test for interval targets
Gini or entropy reduction for categorical targets
Chi-squared for nominal targets
 Missing value handling




Use missing values as a separate, but legitimate code in the split search
Assign missing values to the leaf that they most closely resemble
Distribute missing observations across all branches
Use surrogate, non-missing inputs to impute the distribution of missing value in the branch
 Methods




Cost-complexity pruning and reduced-error pruning
Prior probabilities can be used in training or assessment
Misclassification costs can be used to influence decisions and branch construction
Interactive training mode can be used t produce branches and prune branches
 Others


SAS Code generation
PMML code generation
ISQS 7342-001, Business Analytics
Questions
35
 What are differences between predictive and explanatory analysis?
 Why are input variables distinguished as antecedents and intervening





factors?
What is the implication of Simpson’s paradox to the decision tree
construction and explanation?
What does Interaction mean in the context of decision tree
construction?
What is the primary purpose of AID algorithm?
What are weaknesses of AID algorithm? How these problems are
addressed by other algorithm?
Summarize the main principles in decision tree algorithm design and
implementation.
ISQS 7342-001, Business Analytics
Download