Uploaded by yi zhihu

Student Week 2 - Logistic Regression and Decision Trees

advertisement
BUSI4359
Accounting Analytics
Instructor: Dr. Qingxin Meng
Week 2: Logisitc Regression and Decision Trees (and Node Purity)
à Vitamin
Sales
à Pregnancy
à House
Size
à House
Price
Straight
Line
Model Name
7. Structure of an Analytics Problems
Inputs/Outputs
LOGISTIC
REGRESSION
Hyperplane
Boston
Housing
Data
1. Slope
2. Intercept
LINEAR
REGRESSION
Placement
of the
plane
Retail
Transaction
Data
A Fixed
Structure
Learnt
Parameters
Training Data
A Final Model
Model structure: Underfitting and
overfitting
1. Example of incorrect structure
àWhen we, for
example, are
doing a regression,
there's no reason
to think the data
will follow a a
straight line….
àOften a “linear”
model’s structure
is too tight.
2. Looser Structures?
A CURVED LINE for regression…
MULTIPLE HYPERPLANES for classification...
2. Deciding on Model Structure
à TOO MUCH STRUCTURE = overfitting
à TOO LITTLE STRUCTURE = poor fit
This will generate
BAD BUSINESS
PERFORMANCE!
• Consider setting a price point for a new product.
• As an analytics problem this requires looking at previous,
comparable products, and modelling the relationship between
price and revenue.
• This is a non-linear relationship, so you instinctively would not
favour a model like linear regression.
3. Just enough structure to model the data
• All our data points
here are previous
products from similar
categories.
• Discount the price
point too low, and no
amount of sales will
save overall revenue.
• And once we set the
price too high, sales
drop off too rapidly to
balance the per item
sales gain.
Revenue
The blue curve
denotes the true
relationship. This
can be modelled
with a polynomial
regression.
Item Price Point
3. Overfitting produces poor performance
• However, consider if
we’d allowed for a
super-curvy line, that
went through almost
every point in our
data…
• Despite being a great fit
that line’s predictive
performance would be
poor. The simpler blue
line was a far better
model to use for new
products.
Revenue
In polynomial
regression we can
add too many
degrees of
freedom.
Item Sales Price
3. Underfitting can be even worse
• Overfitting is an
important issue (and
one that you will see
come up frequently in
business analytics
problems).
• However, underfitting
is a potentially and
even worse problem,
that occurs when your
model is just too simple
Revenue
Item Price Point
3. Underfitting and Overfitting
• It is very hard to know how complex or simple a
parametric model needs to be, without experience
of the problem domain. As a result business analysts
will tend to try lots of models and assess which is
best.
• Bad analysts will assess this by fit (and overfit!)
• Good analysts will see how well the candidate
models generalize to new data. But they will also
use “non-parametric models” too…
Training and Testing
Testing the performance of a
model on the training data
Testing the performance of a
model on a NEVER SEEN BEFORE
test set.
Evaluates
Training Accuracy
Evaluates
Testing Accuracy
Which one (traning accuracy, testing accurcay) is a better
measurement for Model Performance?
•Overfitting and
Underfitting are two
fundamental concepts in
Machine Learning. From the
following statements, which
ones are correct
representations of
Overfitting and
Underfitting?
4. Underfitting (in Classifiers)
• Both underfitting and
overfitting are serious
challenges to both
classifiers and
regressors.
• Here you can see that
logistic regression can
never be a solution to
the business task.
• Why?
4. Underfitting (in Classifiers)
• The reason is because
logistic regression can
only construct a single
partitioning
hyperplane.
• And in this example no
single line can separate
the classes our business
cares about…
• …. But two lines could!
Model 4: DECISION TREES
finally more than one line
5. Decision Trees
For classification tasks (e.g. will a customer leave
within 3 months) I want you to do the following:
• Create a Logistic Regression model
• Examine to see how accurate your logistic
regression model is for use in the real-world
business case.
• Try and beat it using a Decision Tree.
5. Decision Tree Example
A decision tree (of which there are many varieties,
but which are predominantly used for classification)
functions as follows:
• Find the feature where we can find a single point
which will linearly discriminate the different
classes the most.
• Take each subspace created and repeat this
process.
• Keep iterating until an ‘end condition’ is reached.
5. Decision Tree Example
• Let’s consider an example business task that a lot
of retailing companies would be interested in.
• How does the psychology of a customer affect the
impact of advertising on them?
• Given how “agreeable” a customer is, and how
often they already shop at a retailing store, can
we predict if an advertising campaign will
increase their visits?
5. Decision Tree Example
agreeableness
visited more
no change
start
agreeableness>-2.24
agreeableness<-2.24
no change
visited more
agreeableness>1.15
no change
agreeableness<1.15
visited more
total spend
spend > 0.78
no change
spend < 0.78
visited more
agreeableness<-1.79
no change
spend<-0.98
no change
spending
agreeableness>-1.79
visited more
spend>-0.98
visited more
5. Decision Tree Example
agreeableness
visited more
no change
start
agreeableness>-2.24
agreeableness<-2.24
advertise to
ignore
agreeableness>1.15
agreeableness<1.15
ignore
advertise to
spend < 0.78
agreeableness<-1.79
total spend
spend > 0.78
ignore
advertise to
ignore
spend<-0.98
ignore
spending
agreeableness>-1.79
advertise to
spend>-0.98
advertise to
5. Decision Tree Example
• The original algorithm for decision trees, called
ID3, only used categorical features (Categorical
features have a smaller search space, so are
easier to analyse).
• This algorithm was extended to continuous
variables, and called C4.5. This algorithm allowed
use of input features which were continuous
(searching across their values for optimal
placement of hyperplanes).
• Note these techniques are “non-parametric”.
6. Decision Tree example – Customer Churn!
• Let’s work
through a
canonical
example of how a
decision tree
algorithm works.
• All types of
decision tree split
using a notion of
“purity”
6. Decision Tree example – Customer Churn!
No
Yes
Yes
Yes
1
2
3
4
Yes
No
Yes
No
No
Yes
No
Yes
6
7
8
9
10
11
12
5
1. What style of problem is this?
2. What is the output feature/variable?
3. How many classes are we trying to predict?
4. What are the input variables and their values?
5. What model would be good to use?
Exited
To work this out create a table and analyse each input (independent) feature.
If you split on it how "pure" the resulting groups would be in voting choice.
Task I. Decide which feature to split on
Exited
To work this out create a table and analyse each input (independent) feature.
If you split on it how "pure" the resulting groups would be in voting choice.
BODY SHAPE!
Exited
If we split on the head
shape category, then
these will be the subgroups that result.
If someone fell into
this "round" class we
would predict the
mode: "YES"
That notion allows us
to calculate a
measure that
consider how often
the model could
guess right if we split
on this category.
7. Classification Accuracy
• Classification accuracy is one way of measuring the
concept of node purity (a measure which allows us to
pick between different features, and values to split
our decision tree on).
• In each node we would predict the most probable
class. So our error rate for a node will be:
Classification Error = 1 – max(pi)
Remember that we first consider each
potential generated node’s error as:
Classification Error = 1 – max(pi)
• We assess a split by looking at the %
of items flowing into each node it
produces, and weighting the
classification errors accordingly by
that percentage (summing the results
for each node)
• Thus if our decision tree used
Classification error as a measure of
node impurity, it would indeed pick
body shape to split on -it has the min
expected error
Now we have our first split, take the sub-groups it produces and do it all
again!
NODE A: all the people
NODE B: square bodies
3
4
NODE C: oval bodies
5
5 YES
7
1 NO
10
6
2
1
12
2 YES
8
4 NO
9
11
NODE B: square bodies
Exited
NODE B: square bodies
NODE B1: black shirts
NODE B2: white shirts
5 YES
0 NO
0 YES
1 NO
NODE C: oval bodies
Exited
NODE C: oval bodies
NODE C1: Square Heads
0 YES
NODE C2: Round heads
4 NO
2 YES
0 NO
Body shape
NODE B: square bodies
NODE C: oval bodies
5 YES
1 NO
Shirt Colour
NODE B1: black shirts
5 YES
NODE B2: white shirts
0 NO
0 YES
2 YES
Head Shape
NODE C1: Square Heads
1 NO
4 NO
0 YES
4 NO
NODE C2: Round heads
2 YES
0 NO
UH OH! New data…
No
Yes
Yes
Yes
1
2
3
4
Yes
5
No
Yes
No
No
Yes
No
Yes
6
7
8
9
10
11
12
UH OH! New data…
No
Yes
Yes
Yes
1
2
3
4
Yes
5
No
Yes
No
No
Yes
No
Yes
6
7
8
9
10
11
12
Exited
Things have changed in the Wellbeing column…
now which feature choice would create the most "pure" group based on classification error?
Task II. Now which feature to split on?
Exited
Things have changed in the Wellbeing column…
now which feature choice would create the most "pure" group based on classification error?
Exited
With that small change in the data, we end up with a completely different tree… splitting on wellbeing first…
NODE A: all the people
Node C:
Node B: sad
No
1
No
6
0 YES
No
9
4 NO
No
11
neither
Node D: happy
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
7
10
2
3
4
5
8
12
2 YES
0 NO
5 YES
1 NO
With that small change in the data, we
ultimately ended up with a completely
different tree.
Wellbeing
Head shape
Node D2: square heads
Node D1: round heads
2 YES
0 NO
3 YES
1 NO
Body shape
Node D2.1: round heads
0 YES
1 NO
Node D2.2: square heads
3 YES
0 NO
4. Conclusion
• A Decision Tree is a very robust analytics model. Their
principles underpin many real-world analytics
solutions, from credit default prediction, to customer
churn.
• However, they are potentially volatile, and require
some arbitrary selection of a method of “node purity”.
• They make it hard to underfit your data…
• But have to be told when to stop splitting nodes, or
they will ultimately completely overfit the data.
viii. A note on terminology
One Input
Variable
Many Input
Variables
Predicting from
2 classes
Predicting from
several classes
Binary
Logistic
Regression
Multinomial
Logistic
Regression
Multiple
Multiple
Multinomial
Logistic
Regression
Logistic
Regression
ii. Decision Trees + continuous variables
agreeableness
visited more
no change
start
agreeableness>-2.24
agreeableness<-2.24
advertise to
ignore
agreeableness>1.15
agreeableness<1.15
ignore
advertise to
spend < 0.78
agreeableness<-1.79
total spend
spend > 0.78
ignore
advertise to
ignore
spend<-0.98
ignore
agreeableness>-1.79
advertise to
spend>-0.98
advertise to
iii. Decision Trees + discrete variables
Body shape
NODE B: square bodies
NODE C: oval bodies
5 YES
0 NO
0 YES
1 NO
1. Gini Impurity / Index
• Gini Impurity is the likelihood of an incorrect classification of
a new instance of a random variable, if that new instance
were randomly classified according to the distribution of
class labels from the data set.
• Gini Impurity of each node is calculated as:
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝𝑗
"
!
• So you simply calculate the probability (relative frequency) of
each class label occurring in a node, sum the squares and minus
from one.
• A split, as before, is a good one if it has the lowest “expected”
gini impurity.
1. Gini Impurity example
Consider Head Shape:
• For SQUARE we have:
2
2
1 – (4/9) – (5/9) =
0.494
• For ROUND we have:
2
2
1 – (2/3) – (1/3) =
0.444
So our expected value is:
• 9/12 (SQUARE) +
3/12(ROUND) = 0.481
Note again Body Shape has
lower Gini Impurity!
Head Shape
SQUARE
ROUND
Yes
Body Shape
OVAL
SQUARE
Yes
Shirt Colour
BLACK
WHITE
Yes
Wellbeing
Yes
SAD
HAPPY
NONPLUSSED
4
2
2
5
6
1
2
3
2
No
No
No
No
5
1
4
1
4
1
3
2
0
Total
Total
Total
Total
Gini Index
Weighted
9
0.494
0.370
3
0.444
0.111
Expected Gini
0.481
Gini Index
Weighted
6
0.444
0.222
6
0.278
0.139
Expected Gini
0.361
Gini Index
Weighted
10
0.480
0.400
2
0.500
0.083
Expected Gini
0.483
Gini Index
Weighted
5
0.480
0.200
5
0.480
0.200
2
0.000
0.000
Expected Gini
0.400
1. Gini Impurity / Index
• The CART algorithm uses Gini Impurity, and
therefore produces a different tree than if it
had used classification error.
• Often, this will produce a better tree,
because its predictions will have less overfit –
it will be more generalizable to unseen realworld data.
• Remember, this – it is crucial, as we care not
about ‘fitting’ but new predictions.
• However, potentially we have an even better
option… entropy.
2. Entropy
• Entropy as a concept was invented by
Claude Shannon.
• It underpins a huge mathematical
field called “Information Theory”
• Information theory is highly related
to prediction, and hence of immense
importance to analytics.
• It can be slightly dry.
2. Entropy
This concept can be considered in several ways:
• Entropy as a measure of ‘node impurity’
• Entropy as a measure of randomness.
• Entropy as a measure of predictability.
An entropy score of zero means you we have
complete uniformity. The unknown is perfectly
predictable.
As entropy increases, the probability of
potential outcomes all start to all match each
other – they become equally likely and
impossible to predict.
3. Entropy in Decision Trees
Consider that the entropy of this whole
dataset is:
= −0.3𝑙𝑜𝑔! 0.3
− " 𝑃(𝑥)𝑙𝑜𝑔" 𝑃(𝑥) = −0.7𝑙𝑜𝑔! 0.7
!
= 0.88132
à What is the best node to split on if we
use entropy as our measure of node
impurity?
Note: In entropy calculations, 0log20 = 0
Download