Multiple Linear Regression - Television Station Prices (Model Building Diagnostics)

advertisement
Case Study – Regression Model Building/Diagnostics
Television Station Prices
Data: 31 stations sold during the late 1950s and early 1960s
Dependent Variable: PRICE (less replacement costs in $1000s)
Candidate Independent Variables:








Market Retail Sales ($1M) – RETAIL
Market Buying Income ($1M) – INCOME
TV Homes per Station in Market – HOMES
Network Hourly Advertising Rate ($) – NETRATE
National Spot Rate ($) – SPOTRATE
Age of Station (1=started after 1/1/1952, 0=before) – AGENEW
Number of Ties with Major Networks – NETWORKS
Percent of Market Population in Urban Areas – PCTURBAN
Summary Statistics:
De scri ptive Statistics
N
PRICE
RETAIL
INCOME
HOMES
NETRATE
SPOTRATE
AGENEW
NETW ORKS
PCTURBAN
Valid N (lis twis e)
31
31
31
31
31
31
31
31
31
31
Minimum
13.00
85.00
136.00
15.00
60.00
25.00
0
0
23.10
Maximum
Mean
12634. 00 1981.7742
8472.00 1873.5161
13733. 00 2473.9032
1553.00
206.1290
3500.00
684.0323
750.00
143.7742
1
.77
2
.90
88.30
52.9742
St d. Deviat ion
2530.73471
2037.01059
2729.92404
288.62025
670.02415
139.22014
.425
.831
17.52246
Source: H. Levin (1964). “Economic Effects of Broadcast Licensing”, Journal of
Political Economy, Vol.72, #4, pp152-162.
Pairwise Correlations:
Correlations
PRICE
PRICE
RETAIL
INCOME
HOMES
NETRATE
SPOTRATE
AGENEW
NETWORKS
PCTURBAN
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
1
.
31
.681**
.000
31
.838**
.000
31
.475**
.007
31
.892**
.000
31
.903**
.000
31
-.390*
.030
31
-.167
.369
31
.412*
.021
31
RETAIL
INCOME
.681**
.838**
.000
.000
31
31
1
.933**
.
.000
31
31
.933**
1
.000
.
31
31
.399*
.454*
.026
.010
31
31
.719**
.877**
.000
.000
31
31
.675**
.835**
.000
.000
31
31
-.132
-.242
.480
.190
31
31
-.003
-.038
.988
.839
31
31
.405*
.441*
.024
.013
31
31
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
HOMES
NETRATE SPOTRATE
AGENEW NETWORKS
.475**
.892**
.903**
-.390*
-.167
.007
.000
.000
.030
.369
31
31
31
31
31
.399*
.719**
.675**
-.132
-.003
.026
.000
.000
.480
.988
31
31
31
31
31
.454*
.877**
.835**
-.242
-.038
.010
.000
.000
.190
.839
31
31
31
31
31
1
.471**
.449*
-.419*
.243
.
.008
.011
.019
.187
31
31
31
31
31
.471**
1
.979**
-.440*
-.032
.008
.
.000
.013
.865
31
31
31
31
31
.449*
.979**
1
-.475**
-.068
.011
.000
.
.007
.717
31
31
31
31
31
-.419*
-.440*
-.475**
1
.030
.019
.013
.007
.
.871
31
31
31
31
31
.243
-.032
-.068
.030
1
.187
.865
.717
.871
.
31
31
31
31
31
.305
.402*
.357*
-.511**
-.255
.095
.025
.049
.003
.167
31
31
31
31
31
PCTURBAN
.412*
.021
31
.405*
.024
31
.441*
.013
31
.305
.095
31
.402*
.025
31
.357*
.049
31
-.511**
.003
31
-.255
.167
31
1
.
31
Scatterplot Matrix:
PRICE
RETAIL
INCOME
HOMES
NETRATE
SPOTRATE
AGENEW
NETWORKS
PCTURBAN
Results of Stepwise Regression with SLE=0.2000 and SLS=0.2001:
Coefficientsa
Model
1
2
3
(Constant)
SPOTRATE
(Constant)
SPOTRATE
INCOME
(Constant)
SPOTRATE
INCOME
NETWORKS
Unstandardized
Coefficients
B
Std. Error
-378.467
287.848
16.416
1.450
-408.807
274.561
12.234
2.512
.255
.128
-83.318
345.897
11.992
2.463
.262
.125
-339.312
227.060
Standardized
Coefficients
Beta
.903
.673
.275
.660
.282
-.111
t
-1.315
11.324
-1.489
4.871
1.993
-.241
4.868
2.086
-1.494
Sig.
.199
.000
.148
.000
.056
.811
.000
.047
.147
a. Dependent Variable: PRICE
Fitted Model:
Price = -83.32 + 12.0*SPOTRATE + 0.26*INCOME – 339.3*NETWORKS
Observed Prices, Predicted Prices, and Residuals (Observed-Predicted):
Y
Y-hat
e=Y-Yhat
3620.00
257.00
900.00
2246.00
449.00
145.00
13.00
630.00
1854.00
898.00
765.00
5156.00
3375.00
2681.00
1258.00
233.00
38.00
547.00
70.00
264.00
211.00
330.00
1968.00
1785.00
1705.00
4818.00
513.00
4374.00
12634.00
3485.00
3270.90
995.46
1971.65
1746.45
70.86
692.15
70.11
1032.85
3851.61
1745.44
1406.31
3312.51
3444.30
3829.83
1170.67
596.63
61.74
-184.48
11.98
312.04
894.54
669.09
2706.16
1287.93
2354.04
4073.59
1052.97
3050.49
12504.64
2341.76
349.10
-738.46
-1071.65
499.55
378.14
-547.15
-57.11
-402.85
-1997.61
-847.44
-641.31
1843.49
-69.30
-1148.83
87.33
-363.63
-23.74
731.48
58.02
-48.04
-683.54
-339.09
-738.16
497.07
-649.04
744.41
-539.97
1323.51
129.36
1143.24
4213.00
1090.79
3122.21
All Possible Regressions: Cp (First Page of output based on all 255 models)
C(p) Selection Method
Number in
Model
3
2
3
4
4
5
4
3
3
3
3
4
4
4
5
4
4
3
1
2
5
5
6
3
5
4
5
4
5
2
2
5
5
4
6
4
3
6
4
2
4
3
C(p) R-Square Variables in Model
3.0549 0.8508 income spotrate networks
(3.0549 < 3+1 =4)  Best model based on Cp
3.2099 0.8385 income spotrate
3.4697 0.8485 retail income spotrate
3.5371 0.8595 income homes spotrate networks
3.5540 0.8594 retail income spotrate networks
4.1051 0.8677 retail income homes spotrate networks
4.3810 0.8547 retail income netrate spotrate
4.6526 0.8417 income spotrate pcturban
4.6860 0.8415 income homes spotrate
4.7706 0.8410 income netrate spotrate
4.8850 0.8403 homes spotrate networks
4.8906 0.8518 income netrate spotrate networks
4.9200 0.8516 income spotrate networks pcturban
4.9498 0.8514 retail income homes spotrate
4.9509 0.8629 retail income netrate spotrate networks
4.9837 0.8512 retail income spotrate pcturban
5.0501 0.8509 income spotrate agenew networks
5.2075 0.8385 income spotrate agenew
5.2127 0.8156 spotrate
5.2557 0.8268 spotrate networks
5.3156 0.8608 income homes netrate spotrate networks
5.3910 0.8604 income homes spotrate agenew networks
5.4074 0.8717 retail income homes netrate spotrate networks
5.4221 0.8373 retail spotrate networks
5.4374 0.8601 retail income spotrate networks pcturban
5.4674 0.8485 retail income spotrate agenew
5.5313 0.8595 income homes spotrate networks pcturban
5.5428 0.8480 retail homes spotrate networks
5.5494 0.8594 retail income spotrate agenew networks
5.5929 0.8248 spotrate pcturban
5.5996 0.8248 retail spotrate
5.6263 0.8590 retail income netrate spotrate pcturban
5.6786 0.8587 retail income homes netrate spotrate
5.7739 0.8467 homes spotrate agenew networks
5.9655 0.8685 retail income homes spotrate agenew networks
6.0261 0.8453 income netrate spotrate pcturban
6.0712 0.8336 spotrate agenew pcturban
6.1019 0.8677 retail income homes spotrate networks pcturban
6.1350 0.8446 income homes netrate spotrate
6.1408 0.8217 homes spotrate
6.2585 0.8439 income homes spotrate pcturban
6.3410 0.8320 spotrate networks pcturban
Model Assumptions:
1) Normality of Errors: Histogram of Residuals
12
10
8
6
4
2
Std. Dev = 977.45
Mean = 0.0
N = 31.00
0
-2000.0
-1000.0
-1500.0
0.0
-500.0
1000.0
500.0
2000.0
1500.0
3000.0
2500.0
Unstandardized Residual
Comment: Appears to be one outlier…a few too many in first negative cell below 0.
2) Constant Error Variance: Plot of Residuals versus Predicted Values
4000
3000
2000
1000
0
-1000
-2000
-3000
-2000
0
2000
4000
6000
8000
10000
12000
14000
Unstandardized Predicted V alue
Comment: With exception of one very extreme observation, residuals appear to be
increasing (in absolute value) with the mean.
3) Plots of Residuals versus Independent Variables:
4000
3000
2000
Unstandardized Residual
1000
0
-1000
-2000
-3000
0
2000
4000
6000
8000
10000
12000
14000
INCOME
4000
3000
2000
Unstandardized Residual
1000
0
-1000
-2000
-3000
0
200
400
600
800
SPOTRA TE
4000
3000
2000
Unstandardized Residual
1000
0
-1000
-2000
-3000
-.5
0.0
NETWORKS
.5
1.0
1.5
2.0
2.5
Comment: No evidence of nonlinear relationships.
Observation (Station) Specific Diagnostic Measures:
1) Studentized Residuals (Residual divided by its estimated standard
error):
ri 
ei
^
MSE (1  vii )
ei  Yi  Y i
vii  i th diagonal element of X ( X ' X ) 1 X matrix
Values above t(.05/2n) , n-k-1 (in absolute value) are considered to be outliers (some
practicioners use 3 as a critical value.
2) Leverage Values – Measures how far an observation is from the others with
respect to the independent variables. Large values have the potential to highly
effect the least squares estimates
vii  i th diagonal element of X ( X ' X ) 1 X matrix
Values above 2(k+1)/n are considered to be potentially highly influential (where k=# of predictor
variables)
3) DFFITS – Measures how much an observation has influenced its own fitted
value.
^
DFFITS i 
^
Y i  Y i (i )
s (i ) vii
^
wh ere Y i ( i ) is the fitted value for the i th observatio n when it was not included in
the model and s (i ) is the square root of MSE when the observatio n was not included in model.
DFFITS represents the number of standard errors that the fitted value for a data point has
shifted when it was not used in the regression fit. Values above 2 (k  1) / n in absolute value are
considered highly influential.
4) DFBETAS – Measure of how much an observation has effected the
estimate of a regression coefficient (there is one DFBETA for each
regression coefficient, including the intercept). Values larger than
2
in absolute value are considered highly influential.
n
^
DFBETAS j (i ) 
^
 j   j (i )
c jj  ( j  1) st diagonal element of ( X ' X ) 1
s(i ) c jj
Note that DFBETASj(i) measures the number of standard errors that the regression
coefficient for the jth predictor variable has shifted when the ith case was not used in the
regression fit.
5) Cook’s D – Measure of aggregate impact of each observation on the
group of regression coefficients, as well as the group of fitted values.
Values larger than F.50,k 1,n  k 1, are considered highly influential.
'
^
Di 
^
^
^
(  (i )   )' ( X ' X )(  (i )   )
(k  1) s 2

^
^
^
^

 Y (i )  Y   Y (i )  Y 
2
ri  vii 



  
2
(k  1)  1  vii 
(k  1) s
Note that Di is like an F-statistic used for testing K '   m where K’=I. It can also
be thought of as the shift in a 100(1-)100% confidence ellipse when Di  Fa , p ',n  p '
when the ith case is not used to fit the regression model.
6) COVRATIO – Measure of the impact of each observation on the
variances (and standard errors) of the regression coefficients and
their covariances. Values outside the interval 1 
considered highly influential.
COVRATIO i 
det( s(2i ) ( X (' i ) X (i ) ) 1 )
det( s 2 ( X ' X ) 1 )
3( k  1)
are
n
7) Variance Inflation Factor (VIF) – Measure of how highly correlated each
independent variable is with the other predictors in the model. Values
larger than 10 for a predictor imply large inflation of standard errors of
regression coefficients due to this variable being in model.
VIFk 
1
where R 2j is the coefficient of multiple determination when Xj is
1  R 2j
regressed on the k-1 remaining independent variables.
Obtaining Influence Statistics and Studentized Residuals in SPSS
A. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable
and set of Independent variables from your model of interest (possibly having been
chosen via an automated model selection method).
B. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and
All Cases and CONTINUE
C. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM.
These give a plot of studentized residuals versus standardized predicted values, and a
histogram of standardized residuals (residual/sqrt(MSE)). Select CONTINUE.
D. Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance
Ratio, Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The
results will be added to your original data worksheet.
Critical Values for Television station data (n=31, k=3):
Studentized Residuals: t.05/(2(31)) , 31-4 = t.0008,27 = 4.04
Leverge Values: 2(3+1)/31 = 0.258
DFFITS: 2 (3  1) / 31 = 0.718
DFBETAS: 2 / 31 = 0.359
Cook’s D: F.50,3+1,31-3-1 = F.50,3,27 = 0.861
COVRATIO: 1 
3(3  1)
31
 1  0.387  outside (0.613 , 1.387)
SPSS Output:
.
stresid
Cooks D
Leverage COVRAT DFFITS
DFB0
DFB_INC DFB_SPT DFB_NET
1.5660
0.3978
0.0150
0.2424
0.2409
0.0272
0.2259
-0.1829
0.0028
-0.7528
0.0146
0.0612
1.1785
-0.2397
0.0176
-0.0296
0.0424
-0.1852
-1.0707
0.0171
0.0241
1.0361
-0.2624
-0.0764
0.1669
-0.1572
-0.0313
0.5049
0.0054
0.0456
1.2142
0.1447
0.1254
-0.0338
0.0155
-0.1030
0.3775
0.0021
0.0225
1.2046
0.0894
0.0661
-0.0048
-0.0271
0.0044
-0.5570
0.0078
0.0587
1.2216
-0.1739
-0.1710
0.0146
0.0312
0.1190
-0.0570
0.0001
0.0231
1.2305
-0.0135
-0.0098
0.0029
0.0020
-0.0008
-0.4115
0.0046
0.0650
1.2563
-0.1330
-0.1156
0.0563
-0.0257
0.0851
-0.9992
0.4397
0.3699
-0.6119
-0.6472
-2.1175
0.2162
0.1294
0.6709
-0.8695
0.0222
0.0729
1.1600
-0.2966
0.0475
-0.1111
0.0903
-0.2164
-0.6485
0.0090
0.0465
1.1855
-0.1875
-0.1765
-0.0026
0.0331
0.1362
1.0263
-0.7552
0.7937
-0.4190
1.9892
0.2334
0.1587
0.7626
0.3390
-0.0701
0.0001
0.0460
1.2607
-0.0200
-0.0120
-0.0020
-0.0013
0.0139
0.2755
-0.9075
0.6318
-0.7627
-0.4437
-1.3401
0.1996
1.2756
0.3200
0.0889
0.0002
0.0594
1.2787
0.0277
-0.0034
-0.0030
0.0017
0.0220
-0.3710
0.0036
0.0630
1.2593
-0.1184
-0.1141
0.0268
0.0065
0.0786
-0.0237
0.0000
0.0235
1.2315
-0.0057
-0.0041
0.0013
0.0008
-0.0003
0.7539
0.0181
0.0809
1.2043
0.2670
0.0291
0.0452
-0.0988
0.1817
0.0580
0.0001
0.0240
1.2316
0.0139
0.0102
-0.0023
-0.0028
0.0007
-0.0492
0.0001
0.0685
1.2929
-0.0162
-0.0159
0.0026
0.0024
0.0106
-0.6945
0.0115
0.0551
1.1856
-0.2128
-0.2051
0.0322
0.0194
0.1471
-0.3365
0.0013
0.0110
1.1952
-0.0703
-0.0480
0.0126
0.0073
-0.0054
0.2840
-0.5544
0.4802
-0.8664
0.0868
1.5194
-0.5864
-0.0939
-0.0011
0.4915
0.0023
0.0042
1.1643
0.0942
0.0577
0.0116
-0.0249
0.0082
-0.6538
0.0083
0.0395
1.1754
-0.1799
-0.1456
0.0066
-0.0035
0.1329
0.7744
0.0223
0.0974
1.2213
0.2966
0.1255
0.1828
-0.1105
-0.1641
-0.5564
0.0099
0.0807
1.2519
-0.1960
0.0097
-0.0818
0.0857
-0.1352
1.3215
0.0255
0.0229
0.9420
0.3240
0.0283
-0.1430
0.2015
0.0443
0.6808
4.0201
0.2344
0.0341
0.3630
-0.1129
0.0717
0.1253
-0.0647
1.1744
0.0414
0.0750
1.0562
0.4101
-0.1238
-0.1172
0.1601
0.3126
0.1962
1.2489
0.9852
3.1787
0.2534
0.0589
-0.1226
0.0259
-0.0853
Station
WNHC
WHTN
WHAM
WTVT
WSVA
KOSA
WDAM
KOB
WJZ
KOVR
WKJG
WBRC
WDAF
KMOX
KOCO
KNAC
WJDM
WTVC
KCTV
WAGM
WLBZ
WWTV
WTRF
WKTV
KFRE
WPRO
WLOS
WSOC
WCAU
WVUE
WITI
Highlighted measures are considered influential. Six stations appear to be quite different from
the rest.
Coefficientsa
Model
1
(Constant)
INCOME
SPOTRATE
NETWORKS
Unstandardized
Coefficients
B
Std. Error
-83.318
345.897
.262
.125
11.992
2.463
-339.312
227.060
a. Dependent Variable: PRICE
Standardized
Coefficients
Beta
.282
.660
-.111
t
-.241
2.086
4.868
-1.494
Sig.
.811
.047
.000
.147
Collinearity Statistics
Tolerance
VIF
.302
.301
.994
3.313
3.324
1.006
All VIF’s are well below 10, Multicollinearity is not a problem in this model.
Download