Linear and Logistic Regression using SAS Enterprise Miner

advertisement
Regression for Data Mining
Mgt. 2206 – Introduction to Analytics
Matthew Liberatore
Thomas Coghlan
Learning Objectives

To understand the application of regression
analysis in data mining




Linear/nonlinear
Logistic (Logit)
To understand the key statistical measures of
fit
To learn how to run and interpret regression
analyses using SAS Enterprise Miner
software
Analysis of Association
In business problems interests often go beyond the
statistical testing of differences (e.g., female versus
male preferences)
Often interested in degree of association between
variables.
Regression is one of the techniques that helps
uncover those relations.
Linear Regression Analysis


Analysis of the strength of the linear
relationship between predictor
(independent) variables and outcome
(dependent/criterion) variables.
In two dimensions (one predictor, one
outcome variable) data can be plotted on a
scatter diagram.
E(y) = b0 + b1 (x)
Expected value of
y (outcome)
Intercept
Term
coefficient
Predictor
variable
Estimation Process
Regression Model
y = b0 + b1x +e
Regression Equation
E(y) = b0 + b1x
Unknown Parameters
b0, b1
b0 and b1
provide estimates of
b0 and b1
Sample Data:
x
y
x1
y1
.
.
.
.
xn yn
Estimated
Regression Equation
yˆ  b0 + b1 x
Sample Statistics
b0, b1
Simple Linear Regression Equation:
Positive Linear Relationship
E(y): Outcome
Regression line
Intercept
b0
Slope b1
is positive
x : Predictor
Simple Linear Regression Equation:
Negative Linear Relationship
E(y): Outcome
Intercept
b0
Regression line
Slope b1
is negative
x: Predictor
Simple Linear Regression Equation:
No Relationship
E(y): Outcome•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
• •
•
•
•
•
x: Predictor
Simple Linear Regression Equation:
No Relationship
E(y)
Regression line
Intercept
b0
Slope b1
is 0
x
Simple Linear Regression Equation:
Parabolic Relationship
E(y): Outcome
Intercept
b0
•
••
••
•
•
•
•
x: Predictor
Example



List Variables we have
Determine a DV of interest
Is there a way to predict DV?
Least Squares Method

Least Squares Criterion: minimize error
(distance between actual data & estimated
line)
min  (y i  y i ) 2
where:
yi = observed value of the dependent variable
for the ith observation
y^i = estimated value of the dependent variable
for the ith observation
Least Squares Method

Slope for the Estimated Regression
Equation
b1
( x  x )( y  y )


 (x  x )
i
i
2
i
Least Squares Method

y-Intercept for the Estimated Regression Equation
b0  y  b1 x
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations
Least Squares Estimation Procedure

Least Squares Criterion:
The sum of the vertical deviations (y axis) of the
points from the line is minimal.
Predicted
Line
Actual Data
Example: Kwatts vs. Temp
Temp
59.2
61.9
55.1
66.2
52.1
69.9
46.8
76.8
79.7
79.3
80.2
83.3
Kwatts
9,730
9,750
10,180
10,230
10,800
11,160
12,530
13,910
15,110
15,690
17,020
17,880
Is the Relationship Linear?
KWatts vs. Temp
20,000
18,000
16,000
KWatts
14,000
12,000
10,000
KWatts
8,000
6,000
4,000
2,000
0
40
45
50
55
60
65
Temp
70
75
80
85
90
Example Results
Let X = Temp, Y = Kwatts
Y = 319.04 + 185.27 X
KWatts vs. Temp
20,000
18,000
16,000
KWatts
14,000
12,000
KWatts
Forecast
average
10,000
8,000
6,000
4,000
2,000
0
40
45
50
55
60
65
Temp
70
75
80
85
90
Coefficient of Determination


How “strong” is relationship between predictor &
outcome? (Fraction of observed variance of
outcome variable explained by the predictor
variables).
Relationship Among SST, SSR, SSE
SST
=
SSR
+
SSE
2
2
2
ˆ
ˆ
(
y

y
)

(
y

y
)
+
(
y

y
)
 i
 i
 i i
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Coefficient of Determination (r2)
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
Kwatts vs. Temp Example
df
SS
Regression
1
58784708.31
Residual
10
38696916.69
Total
11
97481625
r2 = 0.603033734
Does the linear regression provide a good fit?
Assumptions About the Error
Term e
1. The error e is a random variable with mean of zero.
2. The variance of e , denoted by  2, is the same for
all values of the independent variable.
3. The values of e are independent.
4. The error e is a normally distributed random
variable.
Significance Test for Regression
Is the value of b1 zero?
Two tests are commonly used:
t Test
and
F Test
Both the t test and F test require an estimate of the
variance ( 2) of the error (e).
As in most of our statistical work, we are working with
a sample, not the population, so we use
mean square error (s 2).
Testing for Significance

An Estimate of 
s 2 = MSE = SSE/(n  2)
where:
SSE   (yi  yˆi ) 2   ( yi  b0  b1 xi ) 2
Testing for Significance

An Estimate of 
• To estimate  we take the square root of  2.
• The resulting s is called the standard error of
the estimate.
SSE
s  MSE 
n2
Testing for Significance:
t Test
Hypotheses: Coefficient (b1) is 0
(no relationship between predictor &
outcome)


Calculating t Statistic:
b1
t
sb1
Testing for Significance: t Test
H 0if: b 1  0
1. Determine
2. Specify the level of significance. a = .05
3. Select the test statistic.
b1
t
sb1
4. State the rejection rule. Reject
H 0 : b 1  0 if
p-value < .05 or |t| > 3.182
(with 3 degrees of freedom)
Alternative Test: F Test
H0 : b1  0
 Same Hypotheses:
 Different Test Statistic: F = MSR/MSE
Testing for Significance: F Test

Reject if: p-value < a or F > Fa
F = MSR/MSE
where:
Fa is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Testing for Significance: F Test
H 0if: b 1  0
1. Determine
2. Specify the level of significance. a = .05
3. Select the test statistic. F = MSR/MSE
4. State the rejection rule. Reject
H 0 : b 1  0 if
p-value < .05 or F > 10.13
(with 1 d.f. in numerator and
3 d.f. in denominator)
Standard Error of the Estimate

Standard Error of Estimate has properties
analogous to those of standard deviation.
How “good” is our “fit”?

Interpretation is similar:



~68% of outcomes/predictions within one sest.
~95% of outcomes/predictions within two sest.
Kwatts vs. Temp Example
ANOVA
df
SS
MS
F
F
Regression
Residual
1
10
58784708.31
38696916.69
58784708.31
3869691.669
15.19 0.002972726
Total
11
97481625
Intercept
Temp
Coefficients
319.0414124
185.2702073
Standard Error t Stat
3260.412811 0.097853073
47.53479059 3.897570706
Significance
P-value
0.923982528
0.002972726
Is the regression model statistically significant? Is the coefficient of Temp
significant?
Cautions about
Interpreting Significance Tests


Statistical significance does not mean linear
relationship between x and y.
Relationship between x and y does not
mean a cause-and-effect relationship is
present between x and y.
SAS Enterprise Miner


These results can be obtained using Excel or using
a data mining package such as SAS Enterprise
Miner 5.3
Using SAS Enterprise Miner requires the following
steps:



Convert your data (usually in an Excel file) into a SAS data
file Using SAS 9.1
Create a project in Enterprise Miner
Within the project:
 Create a data source using your SAS data file
 Create a diagram that includes a data node and a
regression node and a multiplot node for graphs
 Run the model in the diagram and review the results
Creating a SAS data file from an Excel file: open SAS 9.1.
Select File then Import Data
This opens the import wizard. Since the source file is from
Excel, click Next. Then click Browse to find the
TempKWatts.xls file
Since the data are on sheet1$, click Next. Then enter SASUSER
as the Library and TEMPKILOWATTL as the Member. Then click
Next
Now click Finish to create your file
Open SAS Enterprise Miner 5.3. Enter the user name
and password provided
The Enterprise Window below opens. Select New Project
The Create New Project dialog box appears. Select the
General tab, then type the short name of the project, e.g.,
KWattTemp0. Keep the default path.
In the Startup code tab, enter:
libname Ktemps "C:\Documents and Settings\mliberat\My
Documents\My SAS Files\9.1\EM_Projects";
This code will be run each time you open the project
The Enterprise Miner application window opens
Right-click on Data Source, opening the wizard. Source is
SAS table, so click Next
Browse the SAS libraries to find the SAS table Tempkilowattl
found in the SASuser Library (previously created)
Click Next twice. Note that the Table properties shows that
we have two variables with 12 observations
The next step controls how Enterprise Miner organizes
metadata for the variables in your data. Select advanced,
then click next
(you can view/change the settings if you click Customize
before clicking Next)
Change Role of KWatts to target (outcome variable); change
Level of both KWatts and Temp to interval (continuous
values); then click Next (Other levels are possible, such as
binary). You can click on Explore if you wish to look at some
basic stats – we will do this later
Here Role relates to the role of the data set (raw, train,
validate, score); raw is fine for our analysis of data, so click
Finish
Tempkilowattl now appears under Data Sources in the top
left panel called the Project Panel
We need to create a Diagram for our model. Right-click on
Diagrams, then enter TempKwatts0 in the dialog box. Now the
left panel shows TempKwatts0 as a Diagram, and the righthand panel is called the Diagram Workspace. Icons can be
dragged and dropped onto the Diagram Workspace.
Now add an Input Data Node to the Diagram. From the Data
Sources list in the Project Panel drag and drop the Data
Source TempKwatts0 onto the Diagram Workspace. Note
that when input data node is highlighted, various properties
are displayed on the left-hand panel.
If you wish to see the properties of any or all of the
variables, highlight the input data node; then on the left
hand Properties Panel under Train, click on the box to the
right of Variables; in the screen that opens control-click on
KWatts and Temp; then click on Explore in the lower right
Frequency distributions for the variables and the raw data
are provided. Right-clicking on observations in the lower-left
panel will show where they appear in the bar charts. Cancel
when finished.
Click on the Explore tab found over the Diagram Workspace,
and then drag and drop the Multiplot icon onto the field.
Using your cursor, draw a directed arrow from the
TempKwattsl icon to the Multiplot icon. With the Multiplot
icon highlighted, its properties are found in the left-hand
Properties Panel.
Right-click on the Multiplot icon and select Run. After the
run is completed select Results from the Run Status
window.
Various charts are available as shown below. Descriptive
statistics for each variable are given in the lower pane.
Click on the Model tab and drag the Regression icon onto
the Model field. Connect the Tempkwattsl icon to the
Regression icon. Highlight the Regression icon and on the
Property Panel change Regression Type to linear
regression.
Run the Regression and select Results. Starting from the
upper left and going clockwise, these windows show the fit
between target and predicted in percentile terms, the various
fit statistics, model output (estimates, F and t stats, Rsquare), and the two effects (intercept and slope – bars
represent size and color represents direction)
For a given percentile, the Target Mean is the actual (or
estimated value based on actuals), or what you are trying to
predict; the Mean for Predicted is the forecasted values, or
the predictions (or estimated values based on forecasts). The
results are shown from highest to lowest forecasted values.
The distances between the curves shows how well the model
predicts the actual data.
A variety of fit statistics are provided. These include SSE,
MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE),
RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in
terms of absolute value, where n = no. of observations, p=no. of
variables in model (one in our case).
Schwartz’s Bayesian Criterion and Akaike’s Information Criterion
are used for model selection (comparing one model to another).
Schwartz’s adjusts the residual squared error for the number of
parameters estimated, while Akaike’s is a relative measure of
information lost from fitting the model.
Kwatts vs. Temp Example 2


Another approach to modeling the
relationship between Kwatts and Temp is to
use a nonlinear regression
This is easily accomplished in Enterprise
Miner – highlight the regression node, then in
the left hand panel select yes for polynomial
terms


We use the default of two terms
Is the fit any better???
Multiple Regression
Consider the following data relating family size and income to food expenditures:
family
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
food $
5.2
5.1
5.6
4.6
11.3
8.1
7.8
5.8
5.1
18
4.9
11.8
5.2
4.8
7.9
6.4
20
13.7
5.1
2.9
income $
28
26
32
24
54
59
44
30
40
82
42
58
28
20
42
47
112
85
31
26
family size
3
3
2
1
4
2
3
2
1
6
3
4
1
5
3
1
6
5
2
2
Multiple Regression




We can run this problem in Enterprise Miner using the same
approach followed with the previous example
On our model field we have placed the data source called
foodexpenditures, and also both Multiplot and StatExplore found
under the Explore tab above the model field
Highlight foodexpenditures, then in the left-hand panel under
Training, find variables and click on the box to the right to open
up the variables
Change the role of family to rejected (it is just the number of the
observation) and change the level of food_ to target, and
income_, food_, and fam_size to interval, then click OK
Foodexpenditures Model
Highlight the StatExplore node, right-click to Run, then
select Results. Correlations between the input variables
and the target are provided, along with basic statistics. The
input variables are ordered by the size of the correlations.
Now close out the results window and run the regression
node and obtain results
Starting from the upper left and going clockwise, these
windows show the fit between target and predicted in
percentile terms, the various fit statistics, model output
(estimates, F and t stats, R-square), and the three effects
(intercept and slopes for the two input variables with bars
represent size and color represents direction). The model
is significant and is a good fit with the data.
What happens in regression analysis when the
target variable is binary?

There are many situations when the target
variable is binary – some examples:




whether a customer will or will not receive credit
whether a customer will or will not response to a
promotion
Whether a firm will go bankrupt in a year
Whether a student will pass an exam!!!
Passing an Exam Data
Student id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outcome
0
1
0
0
0
1
1
1
0
1
0
1
1
0
Study Hours
3
34
17
6
12
15
26
29
14
58
2
31
26
11
Running a linear regression to predict pass/don’t pass as a
function of hours of study provides a model that doesn’t
correctly model the data. The data are given in
exampassing.xls
Passing an Exam
1.6
1.4
pass or don't pass
1.2
1
Actual
Predicted
0.8
0.6
0.4
0.2
0
0
10
20
30
40
hours of study
50
60
70
The Enterprise Miner results show a poor fit on a percentile
basis between predicted and target – another modeling
approach is needed.
Logistic Regression

Similar to linear regression, two main
differences

Y (outcome or response) is categorical




Yes/No
Approve/Reject
Responded/Did not respond
Result is expressed as a probability of being in
either group.
Comparing the Logistic &
Linear Regression Models
Logisitic regression
p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)]
1-p =1/[1+exp(a+bx)]
ln [p/(1-p)] = a + bx
where:
exp or e is the exponential function (e=2.71828…)
ln is the natural logarithm (ln(e) = 1)
p is probability that the event y occurs given x, and can range
between 0 and 1
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
all other components of the regression model are the same
Odds Ratio


Frequently used
Related to probability of an event as follows:
Odds Ratio = p/(1-p)

Example:




Probability of firm going bankrupt = .25
Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1
This is how sports books calculate odds
 (e.g., if odds of VU winning a championship are 2:1, probability
is 1/3
ln [p/(1-p)] = a + bx means that as x increases by 1, the
natural log of the odds ratio increases by b, or the odds
ratio increase by a factor of exp(b)
Probability, Odds Ratio, LN of
Odds Ratio
25
20
15
odds
10
nl(odds)
5
probability
95
9
0.
0.
85
8
0.
0.
75
7
0.
0.
65
6
-5
0.
0.
55
5
0.
0.
45
4
0.
0.
35
3
0.
0.
25
2
0.
0.
15
1
0.
0.
0.
05
0
Running the exam data: Change regression type from linear
regression to logistic regression
Highlight the data node; on left-hand panel under Train open
variables and change the level of outcome to binary
Results show a much better fit (upper left) and only one
misclassification (lower right – a false negative).
The results show that the odds ratio = p(1-p) = exp(8.4962+0.4949x). For every additional hour of study the
odds ratio increases by a factor of exp(0.4949)= 1.640
Understanding Response Rate and
Lift
To better understand the top left chart, change cumulative lift to
cumulative % response. The observations are ranked by the
predicted probability of response (highest to lowest) for each
observation (from the fitted model).
Understanding Response Rate and Lift




Since the first 6 passes were correctly classified, the cumulative
% response is 100% through the 40th percentile.
At the 50th percentile the next observation with the highest
predicted probability is a non-response, so the cumulative
response drops to 6/7 or 85.7%.
The 8th ranked observation, between the 55th and 60th percentile,
is a positive response, so the cumulative % response is about 7/8
or 87%.
 Since there are no more positive responses after the 60th
percentile, the cumulative response rate will drop to 50%.
The chart compares how well the cumulative ranked predictions
lead to a match between actual and predicted responses
Understanding Response Rate and
Lift


Lift calculates the ratio of the actual response rate (passing) of the
top n% of the ranked observations to the overall response rate.
Cumulative lift is likewise defined.
At the 50th percentile, the cumulative % response is 88.7%, the
cumulative base response is 50%, for a lift of 1.7142.
On the Properties Panel, click on Exported Data to see the
predicted probabilities and response for each observation
and compare to the actual response.
Logistic regression uses maximum likelihood (and not sum
of squared errors) to estimate the model parameters. The
results below show that the model is highly significant
based on a chi-square test. The Wald chi-square statistic
tests whether an effect is significant or not.
Bankruptcy Prediction

To predict bankruptcy a year in advance, you
might collect:





working capital/total assets (WC/TA)
retained earnings/total assets (RE/TA)
earnings before interest and taxes/total assets
(EBIT/TA)
market value of equity/total debt (MVE/TD)
sales/total assets (S/TA)
Bankruptcy Training Data
Firm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
WC/TA
0.0165
0.1415
0.5804
0.2304
0.3684
0.1527
0.1126
0.0141
0.222
0.2776
0.2689
0.2039
0.5056
0.1759
0.3579
0.2845
0.1209
0.1254
0.1777
0.2409
RE/TA
0.1192
0.3868
0.3331
0.296
0.3913
0.3344
0.3071
0.2366
0.1797
0.2567
0.1729
-0.0476
-0.1951
0.1343
0.1515
0.2038
0.2823
0.1956
0.0891
0.166
EBIT/TA
0.2035
0.0681
0.081
0.1225
0.0524
0.0783
0.0839
0.0905
0.1526
0.1642
0.0287
0.1263
0.2026
0.0946
0.0812
0.0171
-0.0113
0.0079
0.0695
0.0746
MVE/TD
0.813
0.5755
0.5755
0.4102
0.1658
0.7736
1.3429
0.5863
0.3459
0.2968
0.1224
0.8965
0.538
0.1955
0.1991
0.3357
0.3157
0.2073
0.1924
0.2516
S/TA
1.6702
1.0579
1.0579
3.0809
1.1533
1.5046
1.5736
1.4651
1.7237
1.8904
0.9277
1.0457
1.9514
1.9218
1.4582
1.3258
2.3219
1.489
1.6871
1.8524
BR/NB
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
Bankruptcy Example

Using the BankruptTrain.xls data create a
SAS data file called bankrupt



BR_NB: role is target and level is binary
Firm: role is rejected and level is nominal (it is
simply the firm number)
Remaining five financial ratio variables: role is input
and level is interval
Create a diagram named bankrupt1. Drag and drop the data
node onto the model. Highlight the data node and on the left
hand panel under variables click on the box to its right to
see the variables data
From the Explore tab drag and drop the StatExplore node
onto the diagram and link it to the bankrupt node. Highlight
the StatExplore node, right-click and run it, and obtain
results. On top, correlations between the five input variables
and the target are shown via bars ordered from largest to
smallest. Below the mean variable score for bankrupt vs. nonbankrupt observations is shown.
From the Model tab drag and drop the regression node onto
the diagram and connect it to the bankrupt node. Highlight
the regression node and run, and obtain the results
The results show that the model fits the data very well with
highly significant overall chi square statistic, low error
values, and 0 misclassifications. Cumulative lift shows that
for the top 50% of observations that are bankrupt, they are
twice as likely to be classified as bankrupt.
Scoring



Once you have specified a model you might
wish to apply it to new data whose outcome
is unknown -- make predictions
This can be easily accomplished in
Enterprise Miner using scoring
Convert the data set BankruptScore.xls to a
SAS file called bankruptscore. The role of this
data is score.
Bankruptcy Scoring Data
Firm
A
B
C
D
E
F
G
H
I
J
WC/TA
0.1759
0.3732
0.1725
0.163
0.1904
0.1123
0.0732
0.2653
0.107
0.2921
RE/TA
0.1343
0.3483
0.3238
0.3555
0.2011
0.2288
0.3526
0.2683
0.0787
0.239
EBIT/TA MVE/TD S/TA
0.0956
0.1955
1.9218
-0.0013
0.3483
1.8223
0.104
0.8847
0.5576
0.011
0.373
2.8307
0.1329
0.558
1.6623
0.01
0.1884
2.7186
0.0587
0.2349
1.7432
0.0235
0.5118
1.835
0.0433
0.1083
1.2051
0.9673
0.3402
0.9277
Drag and drop the bankruptscore data node to the
bankrupt1 diagram. From the Assess tab, drag and drop the
Score node into the diagram. Link the regression and
bankruptscore nodes together and connect them to the
Score node.
Run the Score node and obtain the Results. Of the 10 firms,
6 are predicted to become bankrupt.
For details about the individual predictions, highlight the
Score node and on the left-hand panel click on the square to
the right of Exported Data. Then in the box that appears click
on the row whose Port entry is Score. Then click on Explore.
The lower portion of the output is shown below. The
predictions are given, along with the probabilities of the firm
becoming bankrupt or not.
Regression Using Selection
Models


When there are a number of possible input
variables, procedures are available to sort
through them and include those that have a
certain level of statistical significance
SAS Enterprise Miner 5.3 offers three
selection methods:



Backward
Forward
Stepwise
Regression Using Selection
Models
Backward: training begins with all candidate effects
in the model and removes effects until the stay
significance level or the stop criterion is met
 Forward: training begins with no candidate effects in
the model and adds effects until the entry
significance level or the stop criterion is met.
 Stepwise: training begins as in the forward model but
may remove effects already in the model. This
continues until the stay significance level or the stop
criterion is met
Note that the default significance levels (p values)
values are 0.05 and no stop criteria (such as
maximum number of steps in the regression) are set

Regression Using Selection
Models – Bankruptcy Model
To select stepwise regression
for the bankruptcy model, highlight
the regression node and in the
properties panel under
Selection Model choose
Stepwise. The default significance
level of 0.05 is used
Regression Using Selection
Models – Bankruptcy Model

Interestingly, the Training Model only uses RE/TA as
a predictor



There are 3 misclassifications (.15 rate) in this set vs. 0 in
the original model
The results are very different: the original model with
all 5 input variables predicted bankruptcy for G, E,
C, and J, while the stepwise model predicted B, C,
D, F, G, H, and J would become bankrupt.
Changing the significance levels to 0.1 (to make it
easier for input variables to enter/leave the stepwise
model) produces the same results
Download