Here

advertisement
OPRE504
Chapter Study Guide
Chapter 15
Multiple Regressions
The Multiple Regression Model:
𝐲̂ = b0 + b1x1 + b2x2 + … + bk xk
Where b0 is the intercept, each bk is the estimated coefficient (slope) of its corresponding
predictor xk.
e = y - 𝑦̂.
df = n-k-1 (n is number of OBS, k is number of predictors)
Residual:
Degree of Freedom:
∑(𝑦−𝑦̂)2
Standard Deviation of Residual: se = √ 𝑛−𝑘−1
Interpretation of b1 (b2 or bk):
When the values of all other predictors are held constant, one unit change in x1 is associated with
b1 unit of change in y.
I
1.
Assumptions and Conditions:
Check Linearity Conditions:
 Prior model check: scatterplots of y against each of the predictors are reasonably
straight (no bend observed)
 Post model check: A scatterplot of residuals against the predicted values should show
no obvious pattern – bend.
 Violations of Linearity Conditions:
2.
Check Independence Conditions:
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 1 of 9
 Error terms associated with individual observations should be independent of each
other. Rule of thumb: Random samples ensure independence. Without randomization,
the generalization of regression models is limited to the data under analysis.
 Check 1: scatterplot of residuals and predicted value should show no trends, or
clumping.
 Check 2: individual plot of residuals against each predictor should show no trends, or
clumping; special attention to time series data for serial correlation.
 Violations of independence assumptions:
3.
Check Equal Variance Assumption (Homoscedasticity):
 Variability of error terms should be the same (constant) for all values of each
predictor.
 Check 1: Scatterplot of residuals against the predicted value shows consistent spread.
 Check 2: Boxplot of y against each predictor of x should show consistent spread.
 Homoscedasticity vs. Heteroscedasticity:
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 2 of 9
4.
Check Normality Assumption:
Error terms around the regression model at any specific values of x-predictors should
follow a Normal distribution or nearly normal distribution. Check normality of residuals and
individual variables and identify outliers of variables using normal probability plot (i.e., DDXL)
Visit the Following Links for More Details:
Testing The Assumptions of Linear Regression
http://www.duke.edu/~rnau/testing.htm
Testing Assumptions of Linear Regression Using SPSS
http://www.utexas.edu/courses/schwab/sw388r6_fall_2006/SolvingProblems/Homework%20Pro
blems%20-Simple%20Linear%20Regression%20-%20Testing%20%20Assumptions.ppt
Regression Diagnostics Using SPSS
http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter2/spssreg2.htm
Testing Linear Assumptions Using SAS
http://www2.sas.com/proceedings/sugi22/STATS/PAPER267.PDF
Check Linear Regression Assumptions for Ph.D. Studies
http://courses.unt.edu/yeatts/6200-Multivariate%20Stats/Lectures-Tests/Test%202/Week-11diagnostics-solutions.pdf
Assumptions of Linear Regressions:
http://www.statisticssolutions.com/methods-chapter/statistical-tests/assumptions-of-linearregression/
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 3 of 9
II
Hypothesis Tests and Interpretations in Multiple Regressions
Given the underlying population parameters:
Y = β0 + β1 x1 + β2 x2+ ... + βk xk
Model Hypotheses:
H0:
β1 = β2 = .. = βk = 0 (the model predicts no better than using the grand mean)
Ha:
at least one β is not 0.
F-test for the Model:
F-test with numerator degree of freedom =k and denominator degree of freedom of n-k-1 (k =
number of predictors, 1 is used to account for the intercept).
𝑆𝑆𝑅 ⁄𝑘
𝑀𝑆𝑅
𝑆𝑆𝑅
𝑆𝑆𝐸
F(k, n-k-1) = 𝑆𝑆𝐸⁄(𝑛−𝑘−1) = 𝑀𝑆𝐸 , since R2 = 𝑆𝑆𝑇 = 1 - 𝑆𝑆𝑇
F(k, n-k-1) =
𝑆𝑆𝑅 ⁄𝑘
𝑆𝑆𝐸 ⁄(𝑛−𝑘−1)
=
𝑆𝑆𝑅
𝑆𝑆𝑇∗𝑘
𝑆𝑆𝐸
𝑆𝑆𝑇∗(𝑛−𝑘−1)
=
𝑆𝑆𝑅 1
𝑆𝑆𝑇 𝑘
𝑆𝑆𝐸
1
𝑆𝑆𝑇 (𝑛−𝑘−1)
=
𝑅2
(1−𝑅 2 )
1
𝑘
1
(𝑛−𝑘−1)
T-tests for Individual Coefficients at Desired Alpha Levels:
𝑏
t*n-k-1, alpha = 𝑆𝐸(𝑏𝑖 )
𝑖
If one predictor is not significant, it does not necessarily mean that this predictor has no linear
relationship to dependent variable, y; rather, it means that this particular predictor contributes
nothing to the explanation of y after controlling for all other predictors.
Confidence Intervals for Each Slope (Coefficient):
bi ± t* n-k-1 x SE (𝑏𝑖 ) using statistical software.
R2 and Adjusted R2:
𝑆𝑆𝑅
𝑆𝑆𝐸
R2 = 𝑆𝑆𝑇 = 1 - 𝑆𝑆𝑇
𝑛−1
Radj2 = 1 – (1-R2) 𝑛−𝑘−1 = 1-
𝑆𝑆𝐸 ⁄(𝑛−𝑘−1)
𝑆𝑆𝑇 ⁄(𝑛−1)
Multicollinearity Issue:
When two independent variables are highly correlated, a multicollinearity issue is a serious
concern. Variance inflation Factor (VIF) Test
1
VIFj = 1−𝑅 2 , using jth predictor as dependent variable while all other predictors are independent
𝑗
variables.
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 4 of 9
[Chapter 15, Exercise 22, 25 and 26, Sharpe 2011, pp.510-511] Here is a dataset containing
monthly revenue of Wal-Mart Corp., relating that revenue to the Total U.S. Retail Sales, the
Personal Consumption Index, and the Consumer Price Index.
Date
11/28/2003
12/30/2003
1/30/2004
2/27/2004
3/31/2004
4/29/2004
5/28/2004
6/30/2004
7/27/2004
8/27/2004
9/30/2004
10/29/2004
11/29/2004
12/31/2004
1/21/2005
2/24/2005
3/30/2005
4/29/2005
5/25/2005
6/28/2005
7/28/2005
8/26/2005
9/30/2005
10/31/2005
11/28/2005
12/30/2005
1/27/2006
2/23/2006
3/31/2006
4/28/2006
5/25/2006
6/30/2006
7/28/2006
8/29/2006
9/28/2006
10/20/2006
11/24/2006
12/29/2006
1/26/2007
Wal Mart
Revenue
14.764
23.106
12.131
13.628
16.722
13.98
14.388
18.111
13.764
14.296
17.169
13.915
15.739
26.177
13.17
15.139
18.683
14.829
15.697
20.23
15.26
15.709
18.618
15.397
17.384
27.92
14.555
18.684
16.639
20.17
16.901
21.47
16.542
16.98
20.091
16.583
18.761
28.795
20.473
CPI
552.7
552.1
554.9
557.9
561.5
563.2
566.4
568.2
567.5
567.6
568.7
571.9
572.2
570.1
571.2
574.5
579
582.9
582.4
582.6
585.2
588.2
595.4
596.7
592
589.4
593.9
595.2
598.6
603.5
606.5
607.8
609.6
610.9
607.9
604.6
603.6
604.5
606.348
Personal
Consumption
7868495
7885264
7977730
8005878
8070480
8086579
8196516
8161271
8235349
8246121
8313670
8371605
8410820
8462026
8469443
8520687
8568959
8654352
8644646
8724753
8833907
8825450
8882536
8911627
8916377
8955472
9034368
9079246
9123848
9175181
9238576
9270505
9338876
9352650
9348494
9376027
9410758
9478531
9540335
Retail Sales Index
301337
357704
281463
282445
319107
315278
328499
321151
328025
326280
313444
319639
324067
386918
293027
294892
338969
335626
345400
351068
351887
355897
333652
336662
344441
406510
322222
318184
366989
357334
380085
373279
368611
382600
352686
354740
363468
424946
332797
December
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
a)
Research Question
Whether Wal-Mart revenue is closely associated with the general state of the U.S. economy
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 5 of 9
b)
State Hypotheses
Based on economic theories and reasoning, we could formulate the following hypotheses:
c)
H1:
Wal-Mart revenue is positively associated with the Total U.S. Retail Sales
H2:
Wal-Mart revenue is positively associated with Personal Consumption Index
H3:
Wal-Mart revenue is negatively associated with Consumer Price Index
The Regression Model
Revenue = β0 + β1 Total Retail Sales + β2 Personal Consumption Index + β3 Consumer Price Index
d)
Descriptive Statistics
In DDXL, Charts and Plots – Normal Probability Plot:
Wal-Mart Revenue
Retail Sales Index
Personal
Consumption
CPI
Summary: There appear to be four outlier values for Wal-Mart Revenue (23.106 in 12/2003;
26.177 in 12/2004; 27.92 in 12/2005; 28.795 in 12/2006). Other variables appear to be
approximately normal.
e)
Correlation Table (Only include independent variables used in the final regression)
Excel – Data – Data Analysis – Correlation (highlight all independent variables)
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 6 of 9
CPI
Personal
Consumption
Retail Sales Index
CPI
1.00
0.98
Personal Consumption
0.63
0.64
Retail Sales Index
1.00
1.00
Summary: CPI and Personal Consumption are highly correlated, which could cause
multicollinearity issue in the multiple regression. Special caution should be made with respect to
the regression results. Many methods can be used to address multicollinearity issues.
f)
Check Regression Assumptions
(1)
Linearity (check a scatterplot of y vs. x using DDXL)
Retail Sales
Personal Consumption
CPI
Summary: independent variables show linear correlations with dependent variable (Wal-Mart revenues)
Homoscedasticity (check scatterplot of residuals vs. predicted/fitted values)
0
-4
-2
Residuals
2
4
2.
10
15
20
Fitted values
25
30
Summary: no particular pattern (bend) is observed. DDXL 3.
Independence (whether there is a serial correlation)
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 7 of 9
5
0
-10
-5
Residuals
0
10
20
Time
30
40
Summary: residuals show a bumping pattern, indicating a time serial correlation; independence
assumption may be violated; some other models rather than linear regressions may be used.)
4.
Normality (check the normal probability plot of the residuals)
Summary: a largely straight line is shown, indicating normality assumption is met.
Data Analysis Toolpack: Regression – Standardized Residuals – Residual Plot
g)
Regression Results
Source
SS
MS
Model
Residual
378.748744
189.474058
3
35
126.249581
5.4135445
Total
568.222802
38
14.9532316
walmartrev~e
Coef.
retailsale~x
personalco~n
cpi
_cons
.0001032
.0000111
-.3447946
87.00878
h)
df
Std. Err.
.0000155
4.40e-06
.120335
33.59896
t
6.67
2.52
-2.87
2.59
Number of obs
F( 3,
35)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.017
0.007
0.014
=
=
=
=
=
=
39
23.32
0.0000
0.6665
0.6380
2.3267
[95% Conf. Interval]
.0000718
2.15e-06
-.5890876
18.79926
.0001345
.00002
-.1005017
155.2183
Interpret the Coefficients and Testing Hypotheses
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 8 of 9
The coefficient for Retail Sales Index is 0.0001032 and highly significant (p<0.000),
suggesting Retail Sales Index is positively associated with Wal-Mart revenue. H1 is supported.
The coefficient for Personal Consumption is 0.0000111 and highly significant (p<0.05),
suggesting Personal Consumption is positively associated with Wal-Mart revenue. H2 is
supported.
The coefficient for CPI is -0.3447946 and highly significant (p<0.01), suggesting CPI is
negatively associated with Wal-Mart revenue. H3 is supported.
i)
Conclusion
Wal-Mart revenue is closely related to the general state of U.S. economy.
Outlier:
Regression Results without Outlier Values of Revenue:
Regression Statistics
Multiple R
0.649807474
R Square
0.422249753
Adjusted R
Square
0.366338439
Standard Error
1.87418242
Observations
35
ANOVA
Regression
Residual
Total
Intercept
CPI
Personal
Consumption
Retail Sales Index
MS
26.52732289
3.512559744
F
7.55213429
Significance
F
0.0006233
Standard
Error
35.989036
0.129729
t Stat
-0.629109
0.366678
P-value
0.533887
0.716350
Lower 95%
-96.041140
-0.217015
Upper 95%
50.759106
0.312152
0.000004
0.000023
0.183319
0.591614
0.855741
0.558399
-0.000008
-0.000033
0.000010
0.000060
df
3
31
34
SS
79.58196867
108.8893521
188.4713207
Coefficients
-22.641017
0.047569
0.000001
0.000013
Conclusion:
When outliers (December holiday sales) are excluded, there is no relationship between WalMart
revenue and the general state of U.S. economy.
Chaodong Han
OPRE504 Data Analysis and Decisions
ClassHandout
Page 9 of 9
Download