Chapter 7: Multiple Regression Analysis

advertisement
Chapter 7: Multiple Regression Analysis
Multiple regression involves the use of more than one independent (predictor) variable
to predict a dependent variable.
Good predictor variables:
(i)
are correlated with the response variable (Y), and
(ii)
are not correlated with other predictor variables (X’s)
If two predicted variables are highly correlated with each other, they will explain the
same variation in Y, and the addition of the second variable will not improve the
forecast. This condition is called “multicollinearity”.
Estimating Correlation
Use Data Analysis/Correlation
Multiple Regression Model
Statistical model: Y = 0 + 1X1 + 2X2 + . . . + kXk + 
Mean response = Y = 0 + 1X1 + 2X2 + . . . + kXk
i.e. Y = y + 
The ’s are error components. The errors are assumed to be (i) independent, (ii)
normally distributed, and (iii) with zero for mean and unknown  for standard
deviation.
Estimated (Sample) Regression Equation: Y = b0 + b1X1 + b2X2 + . . . + bkXk + e
Prediction equation for Y:
Ŷ = b0 + b1X1 + b2X2 + . . . + bkXk,
where, b0 = the y-intercept, bi = the slope of the regression, i = 1, 2, …, k
The slope coefficients b1, b2, .. , bk are referred to as partial or net regression coefficients.
Estimation (Least Squares Method)
Y = Observation, Ŷ = Fit
Error (residual) = e = Y – Ŷ
SSE = Sum of squared errors =  (Y – Ŷ)2 =  [Y – (b0 + b1X1 + b2X2 + . . . + bkXk)]2
The least squares method chooses the values for b0, b2, … bk to minimize the sum of
squared errors (SSE). In Excel use the LINEST array function to determine these
estimates.
Decomposition of variance
Y = Fit + Residual = Ŷ + (Y – Ŷ)
Subtracting Y from both sides, (Y - Y ) = (Ŷ - Y ) + (Y – Ŷ)
Then,
(Y - Y )2 =  (Ŷ - Y )2 +  (Y – Ŷ)2
i.e. SST = SSR + SSE
SST = Sum of Squares Total = (Y - Y )2
SSR = Sum of Squares Regression =  (Ŷ - Y )2
SSE = Sum of Squares due to Error  (Y – Ŷ)2= SST - SSR
ANOVA Table
Source
Sum of squares
Regression
SSR
Error
SSE
Total
SST
Standard Error of estimate = Sy.x =
df
k
n-k-1
n-1
MSE
Mean Square
MSR = SSR/k
MSE = SSE/(n-k-1)
F-test
F = MSR/MSE
Coefficient of Determination
SSR
R2 
= Proportion of variation of y explained by the regression
SST
Multiple correlation coefficient
R=
R 2 = Correlation between Y and Ŷ, (0 < R < 1)
Significance of Regression
The regression is said to be significant if at least one slope coefficient is not equal to
zero. Therefore, the hypotheses we need to test is given as follows.
Ho: 1 = 2 = …. k = 0
Regression is NOT significant
Ha: At least one   0
Regression IS significant
Assume:
1.
2.
3.
4.
Y = 0 + 1X1 + 2X2 + . . . + kXk + 
The errors ’s are independent
The errors ’s have constant variance, 
The error ’s are normally distributed
MSR
MSE
b
t Test for significance of individual predictor variables: t  i
s bi
F Test for Significance in Simple Linear Regression: F 
where sbi = Estimated Standard Deviation of bi
Forecasting Y
Point prediction Ŷ = b0 + b1X1 + b2X2 + . . . + bkXk,
Interval prediction = Ŷ ± t Sf, where Sf = Standard error of the forecast with df = n-k-1
Dummy variables
Dummy (indicator) variable is a column of just zeros ad ones, with ones representing a condition
or category, and zero representing the opposite. Dummy (indicator) variables are used to
determine the relationship between qualitative predictor (independent) variables and a dependent
(response) variable.
Example 7.5 (page 293)
Model: Y = 0 + 1X1 + 2X2 + 
where, Y = Job performance rating
X1 = Aptitude test score, and X2 = Gender (0 = Female, 1 = Male)
Predictor equation = Ŷ = b0 + b1X1 + b2X2
This single prediction equation is equivalent to the following two equations.
1. For X2 = 1, Males, Ŷ = b0 + b1X1 + b2, and
2. For X2 = 0, Females, Ŷ = b0 + b1X1
Interpretation for b2: The estimated difference in job performance rating between males
and females.
Multicollinearity
A linear relationship between two or more independent variables (X’s) is called
multicollinearity. The strength of multicollinearity is measured by the variance inflation
factor (VIF).
1
Variance Inflation factor = VIFj =
, j = 1, 2, .. , k
1  R 2j
Where R 2j = Coefficient of determination from a regression with Xj as the dependent
variable, and the remaining k-1 X’s as the independent variables.
If a given X variable is not correlated with the X’s already in the regression, the
corresponding R 2j value will be small and the VIFj value will be close to 1. This would
indicate multicollinearity is not a problem. On the other hand, if a given X variable is
highly correlated with the X’s already in the regression, the corresponding R 2j value
will be high (close to 1) and the VIFj value will be large. This would indicate the
presence of multicollinearity problem. If multicollinearity problem exists, the b
estimates and the corresponding standard errors will change considerably when the
given X is included in the regression.
Use the user-defined VIF array-function for determining the VIF values.
Example 7.7 (Page 298)
Selecting the “best” regression equation
Selecting the best regression equation involves developing a regression equation with
(1) as many useful predictor variables (X’s), and (2) as few predictors as possible in
order to minimize cost.
1.
Selection of a complete list of predictors (X’s)
2.
Screen out the predictor variables that do not seem appropriate. Four reasons why
a predictor may not be appropriate, (i) predictor not directly related to the response
variable, (ii) may be subject to measurement errors, (iii) may have multicollinearity with
another predictor variable, and (iv) may be difficult to measure accurately.
3.
Shorten the list of predictors so as to obtain the “best” regression equation.
Model selection methodologies
(a) All possible regression
Goal:
Develop the best possible regression with each number of predictor
(X) variables, namely, with one, two, three, etc. number of predictor
(X) variables.
SPSS Command: Under Options use as small a value as possible for the “entry” and
“removal” F (e.g. .0002 and .0001 respectively)
(b) Step-wise regression
Goal:
Develop the best possible regression with as few predictor (X)
variables.
Procedure:
1. The predictor variable with the largest correlation with Y is
entered first into the regression.
2. Of the remaining X variables the one that will increase the F
statistics the most, provided it is at least a specified minimum
amount, is added to the regression equation.
3. Once an additional X-variable is included in the regression
equation, the individual contributions of the X-variables already
in the equation are checked for significance using F. If any such
F is smaller than a threshold minimum value the corresponding
X-variable is removed from the regression equation.
4. Repeat steps 2 and 3 until no more X-variable can be added or
removed.
SPSS Command: Under Method select Stepwise. Under Options enter the
appropriate F value for the “entry” and “removal” (e.g. .05 and .10
respectively)
Example 7.8 (page 288)
Y = One month’s sales
X2 = age, in years
X1 = selling aptitude test score
X3 = anxiety test score
X4 = experience, in years
X5 = high school GPA
Regression diagnostics and residual analysis
Histogram of residuals
Checks for the normality assumption – moderate
deviation from bell-shape is permitted
Residuals (on y-axis) v. fitted
Checks for the linear assumptions – if the plot is
values Ŷ (on x-axis)
not completely random a transformation may be
considered
Residuals v. explanatory
Also checks for the linear model and for constant
variables (x)
variance assumption
Residuals over time (for time
Used for time-series data - checks for all
series)
assumptions
Autocorrelation of residuals
Checks for independence of residuals
Leverage of the ith data point (hii)
0 < hii < 1
Leverage depends only on predictors, not Y
Rule of thumb: hii > 3(k+1)/n is considered high
High leverage point is an outlier among X’s
High leverage points will unduly influence the estimated parameter values
Outlying or extreme Y value
Residual = ei = Y – Ŷ
e
Standardized residual = i , where, S ei  SY . X 1  hii
S ei
If
ei
> 2, the residual is considered extreme.
S ei
SPSS option: Under Residuals, select Case-wise diagnostics and specify “All cases” or the
number of standard deviations
Forecasting Caveats
Over-fitting
Too many X’s in the model
Rule of thumb: At least 10 observations for each X-variable in the model
Useful Regression
F-ratio must be four times the corresponding critical value for the model to yield
useful results.
Download