Lecture 5 Slides

advertisement
Chapter 6: Multiple Linear
Regression
Data Mining for Business Intelligence
Shmueli, Patel & Bruce
1
Topics
 Explanatory vs. predictive modeling with regression
 Example: prices of Toyota Corollas
 Fitting a predictive model
 Assessing predictive accuracy
 Selecting a subset of predictors
2
Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target
 Familiar use of regression in data analysis
 Model Goal: Fit the data well and understand the
contribution of explanatory variables to the model
 “goodness-of-fit”: R2, residual analysis, p-values
3
Predictive Modeling
Goal: predict target values in other data where we
have predictor values, but not target values
 Classic data mining context
 Model Goal: Optimize predictive accuracy
 Train model on training data
 Assess performance on validation (hold-out) data
 Explaining role of predictors is not primary purpose
(but useful)
4
Example: Prices of Toyota Corolla
ToyotaCorolla.xls
Goal: predict prices of used Toyota Corollas
based on their specification
Data: Prices of 1442 used Toyota Corollas,
with their specification information
5
Data Sample
(showing only the variables to be used in analysis)
Price
Age
13500
13750
13950
14950
13750
12950
16900
18600
21500
12950
20950
6
KM
23
23
24
26
30
32
27
30
27
23
25
46986
72937
41711
48000
38500
61000
94612
75889
19700
71138
31461
Fuel_Type HP
Metallic
Diesel
90
1
Diesel
90
1
Diesel
90
1
Diesel
90
0
Diesel
90
0
Diesel
90
0
Diesel
90
1
Diesel
90
1
Petrol
192
0
Diesel
69
0
Petrol
192
0
Automatic
0
0
0
0
0
0
0
0
0
0
0
cc
2000
2000
2000
2000
2000
2000
2000
2000
1800
1900
1800
Doors
3
3
3
3
3
3
3
3
3
3
3
Quarterly_Tax
210
210
210
210
210
210
210
210
100
185
100
Weight
1165
1165
1165
1165
1170
1170
1245
1245
1185
1105
1185
Variables Used
7
Price in Euros
Age in months as of 8/04
KM (kilometers)
Fuel Type (diesel, petrol, CNG)
HP (horsepower)
Metallic color (1=yes, 0=no)
Automatic transmission (1=yes, 0=no)
CC (cylinder volume)
Doors
Quarterly_Tax (road tax)
Weight (in kg)
Preprocessing
Fuel type is categorical, must be transformed into
binary variables
Diesel (1=yes, 0=no)
CNG (1=yes, 0=no)
None needed for “Petrol” (reference category)
8
Subset of the records selected for training
partition (limited # of variables shown)
Id
Model
Price
Age_08_04
Mfg_Month
Mfg_Year
KM
Corolla 2.0 D4D HATCHB TERRA
1
2/3-Doors
Corolla 2.0 D4D HATCHB TERRA
4
2/3-Doors
TA Corolla 2.0 D4D HATCHB
5 SOL 2/3-Doors
TA Corolla 2.0 D4D HATCHB
6 SOL 2/3-Doors
OTA Corolla 1800 T SPORT9VVT I 2/3-Doors
TA Corolla 1.9 D HATCHB 10
TERRA 2/3-Doors
8 16V VVTLI 3DR T SPORT
12 BNS 2/3-Doors
olla 1.8 16V VVTLI 3DR T 17
SPORT 2/3-Doors
13500
14950
13750
12950
21500
12950
19950
22750
23
26
30
32
27
23
22
30
10
7
3
1
6
10
11
3
2002
2002
2002
2002
2002
2002
2002
2002
46986
48000
38500
61000
19700
71138
43610
34000
Fuel_Type_Di Fuel_Type_Pe
esel
trol
1
0
1
0
1
0
1
0
0
1
1
0
0
1
0
1
60% training data / 40% validation data
9
The Fitted Regression Model
Input variables
Constant term
Age_08_04
KM
Fuel_Type_Diesel
Fuel_Type_Petrol
HP
Met_Color
Automatic
cc
Doors
Quarterly_Tax
Weight
10
Coefficient
-3608.418457
-123.8319168
-0.017482
210.9862518
2522.066895
20.71352959
-50.48505402
178.1519013
0.01385481
20.02487946
16.7742424
15.41666317
Std. Error
1458.620728
3.367589
0.00175105
474.9978333
463.6594238
4.67398977
97.85591125
212.0528565
0.09319961
51.0899086
2.09381151
1.40446579
p-value
0.0137
0
0
0.6571036
0.00000008
0.00001152
0.60614568
0.40124047
0.88188446
0.69526076
0
0
SS
97276410000
8033339000
251574500
6212673
4594.9375
330138600
596053.75
19223190
1272449
39265060
160667200
214696000
Error reports
Training Data scoring - Summary Report
Total sum of
squared errors
RMS Error Average Error
1514553377
1325.527246 -0.000426154
Validation Data scoring - Summary Report
Total sum of
squared errors
1021587500
11
RMS Error Average Error
1334.079894
116.3728779
Predicted Values
Predicted price
computed using
regression
coefficients
12
Predicted
Value
15863.86944
16285.93045
16222.95248
16178.77221
19276.03039
19263.30349
18630.46904
18312.04498
19126.94064
16808.77828
15885.80362
15873.97887
15601.22471
15476.63164
15544.83584
15562.25552
15222.12869
17782.33234
Actual Value
Residual
13750 -2113.869439
13950 -2335.930454
16900
677.047525
18600 2421.227789
20950 1673.969611
19600 336.6965066
21500 2869.530964
22500 4187.955022
22000 2873.059357
16950 141.2217206
16950 1064.196384
16250 376.0211263
15750 148.7752903
15950 473.3683568
14950
-594.835836
14750 -812.2555172
16750 1527.871313
19000 1217.667664
Residuals =
difference between
actual and
predicted prices
Distribution of Residuals
Symmetric distribution
Some outliers
13
Selecting Subsets of Predictors
Goal: Find parsimonious model (the simplest model
that performs sufficiently well)
 More robust
 Higher predictive accuracy
Exhaustive Search
Partial Search Algorithms
 Forward
 Backward
 Stepwise
14
Exhaustive Search
 All possible subsets of predictors assessed (single,
pairs, triplets, etc.)
 Computationally intensive
 Judge by “adjusted R2”
2
adj
R
15
n 1
 1
(1  R 2 )
n  p 1
Penalty for number
of predictors
Forward Selection
 Start with no predictors
 Add them one by one (add the one with largest
contribution)
 Stop when the addition is not statistically significant
16
Backward Elimination
 Start with all predictors
 Successively eliminate least useful predictors one by
one
 Stop when all remaining predictors have statistically
significant contribution
17
Stepwise
 Like Forward Selection
 Except at each step, also consider dropping non-
significant predictors
18
Backward elimination (showing last 7 models)
1
Constant
Constant
Constant
Constant
Constant
Constant
Constant
2
Age_08_04
Age_08_04
Age_08_04
Age_08_04
Age_08_04
Age_08_04
Age_08_04
3
4
5
6
7
*
*
*
*
*
Weight
*
*
*
*
KM
Weight
*
*
*
KM
Fuel_Type_Petrol
Weight
*
*
KM
Fuel_Type_Petrol Quarterly_Tax
Weight
*
KM
Fuel_Type_Petrol
HP Quarterly_Tax
Weight
KM
Fuel_Type_Petrol
HP
Automatic Quarterly_Tax
Top model has a single predictor (Age_08_04)
Second model has two predictors, etc.
19
8
*
*
*
*
*
*
Weight
All 12 Models
20
Diagnostics for the 12 models
Good model has:
21
High adj-R2, Cp = # predictors
Next step
 Subset selection methods give candidate models that
might be “good models”
 Do not guarantee that “best” model is indeed best
 Also, “best” model can still have insufficient
predictive accuracy
 Must run the candidates and assess predictive
accuracy (click “choose subset”)
22
Model with only 6 predictors
The Regression Model
Input variables
Constant term
Age_08_04
KM
Fuel_Type_Petrol
HP
Quarterly_Tax
Weight
Coefficient
-3874.492188
-123.4366303
-0.01749926
2409.154297
19.70204735
16.88731384
15.91809368
Std. Error
1415.003052
3.33806777
0.00173714
319.5795288
4.22180223
2.08484554
1.26474357
p-value
SS
0.00640071 97276411904
0
8033339392
0
251574528
0
5049567
0.00000394
291336576
0
192390864
0
281026176
Training Data scoring - Summary Report
Model Fit
Total sum of
squared errors
RMS Error Average Error
1516825972
1326.521353 -0.000143957
Validation Data scoring - Summary Report
Predictive performance
23
(compare to 12-predictor model!)
Total sum of
squared errors
1021510219
RMS Error Average Error
1334.029433
118.4483556
Summary
 Linear regression models are very popular tools, not




24
only for explanatory modeling, but also for prediction
A good predictive model has high predictive
accuracy (to a useful practical level)
Predictive models are built using a training data set,
and evaluated on a separate validation data set
Removing redundant predictors is key to achieving
predictive accuracy and robustness
Subset selection methods help find “good”
candidate models. These should then be run and
assessed.
Download