Multiple Regression - Home

advertisement
Chapter
Multiple Regression
13
Multiple Regression
Bivariate or Multivariate?
Multiple Regression
Assessing Overall Fit
Predictor Significance
Confidence Intervals for Y
Binary Predictors
Tests for Nonlinearity and
Interaction
Multicollinearity
Violations of Assumptions
Other Regression Topics
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc.
•
•
13-2
Multiple Regression
Multiple Regression
Regression Terminology
13-3
•
Y is the response variable and is assumed
to be related to the k predictors (X1, X2, …
Xk) by a linear equation called the
population regression model:
•
The fitted regression equation is:
Multiple regression is an extension of
bivariate regression to include more than
one independent variable.
Limitations of bivariate regression:
- often simplistic
- biased estimates if relevant predictors are
omitted
- lack of fit does not show that X is
unrelated to Y
Data Format
•
13-4
n observed values
of the response
variable Y and its
proposed predictors
X1, X2, … Xk are
presented in the
form of an n x k
matrix:
Multiple Regression
Multiple Regression
Illustration: Home Prices
•
Illustration: Home Prices
Consider the
following data of the
selling price of a
home (Y, the
response variable)
and three potential
explanatory variables:
X1 = SqFt
X2 = LotSize
X3 = Baths
13-5
•
Intuitively, the regression models are
13-6
Multiple Regression
Multiple Regression
Logic of Variable Selection
Fitted Regressions
•
•
State the hypotheses about the sign of the
coefficients in the model.
•
•
13-7
13-8
Use Excel, MegaStat, MINITAB, or any other
statistical package.
For n = 30 home sales, here is the fitted
regression and its statistics of fit.
R2 is the coefficient of determination and SE is
the standard error of the regression.
Multiple Regression
Multiple Regression
Common Misconceptions about Fit
Regression Modeling
•
• Four Criteria for Regression Assessment
Logic
Is there an a priori reason to expect a
causal relationship between the
predictors and the response variable?
Fit
Does the overall regression show a
significant relationship between the
predictors and the response variable?
•
A common mistake is to assume that the
model with the best fit is preferred.
Principle of Occam’s Razor: When two
explanations are otherwise equivalent, we
prefer the simpler, more parsimonious one.
13-9
13-10
Multiple Regression
Regression Modeling
Assessing Overall Fit
F Test for Significance
• Four Criteria for Regression Assessment
Parsimony
Does each predictor contribute
significantly to the explanation?
Are some predictors not worth
the trouble?
Stability
Are the predictors related to one
another so strongly that
regression estimates
become erratic?
13-11
•
13-12
For a regression with k predictors, the
hypotheses to be tested are
H0: All the true coefficients are zero
H1: At least one of the coefficients is
nonzero
In other words,
H0: β1 = β2 = … = β4 = 0
H1: At least one of the coefficients is
nonzero
Assessing Overall Fit
F Test for Significance
•
Assessing Overall Fit
F Test for Significance
The ANOVA table decomposes variation of
the response variable around its mean into
13-13
•
13-14
Assessing Overall Fit
F Test for Significance
•
Assessing Overall Fit
Coefficient of Determination (R2)
•
Here are the ANOVA calculations for the
home price data
•
•
13-15
The ANOVA calculations for a k-predictor
model can be summarized as
13-16
R2, the coefficient of determination, is a
common measure of overall fit.
It can be calculated one of two ways.
For example, for the home price data,
Assessing Overall Fit
Adjusted R2
•
•
It is generally possible to raise the coefficient of
determination R2 by including addition predictors.
The adjusted coefficient of determination is done
to penalize the inclusion of useless predictors.
For n observations and k predictors,
•
For the home price data, the adjusted R2 is
•
13-17
Assessing Overall Fit
How Many Predictors?
•
•
•
13-18
Predictor Significance
F Test for Significance
•
•
•
13-19
Limit the number of predictors based on the
sample size.
When n/k is small, the R2 no longer gives a
reliable indication of fit.
Suggested rules are:
Evan’s Rule (conservative): n/k > 0 (at least
10 observations per predictor)
Doane’s Rule (relaxed): n/k > 5 (at least 5
observations per predictor)
Predictor Significance
Test Statistic
Test each fitted coefficient to see whether it is
significantly different from zero.
The hypothesis tests for predictor Xj are
•
The test statistic for coefficient of predictor
Xj is
•
Find the critical value tα for a chosen level
of significance α from Appendix D.
Reject H0 if tj > tα or if p-value < α.
•
If we cannot reject the hypothesis that a
coefficient is zero, then the corresponding
predictor does not contribute to the prediction
of Y.
•
13-20
The 95% confidence interval for coefficient
βj is
Confidence Intervals for Y
Confidence Intervals for Y
Standard Error
•
•
•
Standard Error
The standard error of the regression (SE) is
another important measure of fit.
For n observations and k predictors
•
Approximate prediction interval
for individual Y value
13-22
Confidence Intervals for Y
Binary Predictors
What Is a Binary Predictor?
Quick Prediction Interval for Y
The t-values for 95% confidence are typically
near 2 (as long as n is too small), and so …
•
•
•
An approximate 95% confidence interval for
conditional mean of Y is:
•
•
13-23
Approximate confidence interval
for conditional mean of Y.
If all predictions were perfect, the SE = 0.
13-21
•
•
An approximate 95% prediction interval for
individual Y value is:
•
13-24
A binary predictor has two values (usually 0 and 1)
to denote the presence or absence of a condition.
For example, for n graduates from an MBA
program:
Employed = 1
Unemployed = 0
These variables are also called dummy or indicator
variables.
For easy understandability, name the binary
variable the characteristic that is equivalent to the
value of 1.
Binary Predictors
Binary Predictors
Effects of a Binary Predictor
Effects of a Binary Predictor
•
•
•
•
A binary predictor is sometimes called a shift
variable because it shifts the regression plane
up or down.
Suppose X1 is a binary predictor which can
take on only the values of 0 or 1.
Its contribution to the regression is either b1
or nothing, resulting in an intercept of either
b0 (when X1 = 0) or b0 + b1 (when X1 = 1).
13-25
The slope does
not change, only
the intercept is
shifted. For
example,
Figure 13.8
13-26
Binary Predictors
Testing a Binary for Significance
Binary Predictors
More Than One Binary
•
•
13-27
In multiple regression, binary predictors
require no special treatment. They are
tested as any other predictor using a t test.
•
13-28
More than one binary occurs when the number
of categories to be coded exceeds two.
For example, for the variable GPA by class level,
each category is a binary variable:
Freshman = 1 if a freshman, 0 otherwise
Sophomore = 1 if a sophomore, 0 otherwise
Junior = 1 if a junior, 0 otherwise
Senior = 1 if a senior, 0 otherwise
Masters = 1 if a master’s candidate, 0
otherwise
Doctoral = 1 if a PhD candidate, 0 otherwise
Binary Predictors
More Than One Binary
•
Binary Predictors
What if I Forget to Exclude One Binary?
If there are c mutually exclusive and
collectively exhaustive categories, then there
are only c-1 binaries to code each observation.
Any one of the
categories can be
omitted because the
remaining c-1
binary values uniquely
determine the
remaining binary.
•
•
•
Table 13.6
13-29
13-30
Tests for Nonlinearity and
Interaction
Binary Predictors
Regional Binaries
•
Tests for Nonlinearity
Binaries are commonly used to code
regions. For example,
Midwest = 1 if in the Midwest, 0 otherwise
Neast = 1 if in the Northeast, 0 otherwise
Seast = 1 if in the Southeast, 0 otherwise
West = 1 if in the West, 0 otherwise
•
•
Sometimes the effect of a predictor is nonlinear.
To test for nonlinearity of any predictor, include
its square in the regression. For example,
•
If the linear model is the correct one, the
coefficients of the squared predictors β2 and β4
would not differ significantly from zero.
Otherwise a quadratic relationship would exist
between Y and the respective predictor variable.
Figure 13.11
•
13-31
Including all c binaries for c categories
would introduce a serious problem for the
regression estimation.
One column in the X data matrix will be a
perfect linear combination of the other
column(s).
The least squares estimation would fail
because the data matrix would be singular
(i.e., would have no inverse).
13-32
Tests for Nonlinearity and
Interaction
Tests for Interaction
•
Test for interaction between two predictors by
including their product in the regression.
•
If we reject the hypothesis H0: β3 = 0, then we
conclude that there is a significant interaction
between X1 and X2.
Interaction effects require careful interpretation
and cost 1 degree of freedom per interaction.
•
13-33
Multicollinearity
What is Multicollinearity?
•
•
•
13-34
Multicollinearity
Variance Inflation
•
•
•
13-35
Multicollinearity occurs when the
independent variables X1, X2, …, Xm are
intercorrelated instead of being
independent.
Collinearity occurs if only two predictors
are correlated.
The degree of multicollinearity is the real
concern.
Multicollinearity
Correlation Matrix
Multicollinearity induces variance inflation
when predictors are strongly
intercorrelated.
This results in wider confidence intervals
for the true coefficients β1, β2, …, βm and
makes the
t statistic less reliable.
The separate contribution of each predictor
in “explaining” the response variable is
difficult to identify.
•
To check whether two predictors are correlated
(collinearity), inspect the correlation matrix
using Excel, MegaStat, or MINITAB. For
example,
Table 13.10
13-36
Multicollinearity
Correlation Matrix
•
•
Multicollinearity
Variance Inflation Factor (VIF)
•
A quick Rule:
A sample correlation whose absolute value
exceeds 2/ n probably differs significantly
from zero in a two-tailed test at α = .05.
•
This applies to samples that are not too
small (say, 20 or more).
13-37
•
where Rj2 is the coefficient of
determination when predictor j is
regressed against all other
predictors.
13-38
Multicollinearity
Variance Inflation Factor (VIF)
•
Multicollinearity
Rules of Thumb
•
•
Some possible situations are:
•
•
•
13-39
The matrix scatter plots and correlation matrix
only show correlations between any two
predictors.
The variance inflation factor (VIF) is a more
comprehensive test for multicollinearity.
For a given predictor j, the VIF is defined as
13-40
There is no limit on the magnitude of the VIF.
A VIF of 10 says that the other predictors
“explain” 90% of the variation in predictor j.
This indicates that predictor j is strongly
related to the other predictors.
However, it is not necessarily indicative of
instability in the least squares estimate.
A large VIF is a warning to consider whether
predictor j really belongs to the model.
Multicollinearity
Are Coefficients Stable?
•
Multicollinearity
Are Coefficients Stable?
•
Evidence of instability is
when X1 and X2 have a high pairwise
correlation with Y, yet one or both
predictors have insignificant t statistics in
the fitted multiple regression, and/or
if X1 and X2 are positively correlated with Y,
yet one has a negative slope in the multiple
regression.
13-41
•
•
13-42
Violations of Assumptions
•
•
•
•
13-43
As a test, try dropping a collinear predictor
from the regression and seeing what happens
to the fitted coefficients in the re-estimated
model.
If they don’t change much, then
multicollinearity is not a concern.
If it causes sharp changes in one or more of
the remaining coefficients in the model, then
the multicollinearity may be causing instability.
The least squares method makes several
assumptions about the (unobservable)
random errors εi. Clues about these errors
may be found in the residuals ei.
Assumption 1: The errors are normally
distributed.
Assumption 2: The errors have constant
variance (i.e., they are homoscedastic).
Assumption 3: The errors are independent
(i.e., they are nonautocorrelated).
Violations of Assumptions
Non-Normal Errors
•
•
•
•
13-44
Except when there are major outliers, nonnormal residuals are usually considered a mild
violation.
Regression coefficients and variance remain
unbiased and consistent.
Confidence intervals for the parameters may
be unreliable since they are based on the
normality assumption.
The confidence intervals are generally OK with
a large sample size (e.g., n > 30) and no
outliers.
Violations of Assumptions
Non-Normal Errors
•
•
•
Violations of Assumptions
Nonconstant Variance (Heteroscedasticity)
•
Test
H0: Errors are normally distributed
H1: Errors are not normally distributed
Create a histogram of residuals (plain or
standardized) to visually reveal any outliers
or serious asymmetry.
The normal probability plot will also
visually test for normality.
13-45
•
•
•
If the error variance is constant, the errors are
homoscedastic. If the error variance is
nonconstant, the errors are heteroscedastic.
This violation is potentially serious.
The least squares regression parameter
estimates are unbiased and consistent.
Estimated variances are biased (understated)
and not efficient, resulting in overstated t
statistics and narrow confidence intervals.
13-46
Violations of Assumptions
Violations of Assumptions
Nonconstant Variance (Heteroscedasticity)
Nonconstant Variance (Heteroscedasticity)
•
•
•
13-47
The hypotheses are:
H0: Errors have constant variance
(homoscedastic)
H1: Errors have nonconstant variance
(heteroscedastic)
Constant variance can be visually tested by
examining scatter plots of the residuals
against each predictor.
Ideally there will be no pattern.
Figure 13.19
13-48
Violations of Assumptions
Autocorrelation
•
•
•
•
Violations of Assumptions
Autocorrelation
Autocorrelation is a pattern of nonindependent
errors that violates the assumption that each
error is independent of its predecessor.
This is a problem with time series data.
Autocorrelated errors results in biased
estimated variances which will result in narrow
confidence intervals and large t statistics.
The model’s fit may be overstated.
13-49
•
•
13-50
Violations of Assumptions
Autocorrelation
•
•
•
•
•
13-51
Test the hypotheses:
H0: Errors are nonautocorrelated
H1: Errors are autocorrelated
We will use the observable residuals e1, e2,
…, en for evidence of autocorrelation and
the Durbin-Watson test statistic DW:
Violations of Assumptions
Unusual Observations
The DW statistic lies between 0 and 4.
When H0 is true (no autocorrelation), the
DW statistic will be near 2.
A DW < 2 suggests positive autocorrelation.
A DW > 2 suggests negative
autocorrelation.
Ignore the DW statistic for cross-sectional
data.
•
13-52
An observation may be unusual
1. because the fitted model’s prediction is
poor (unusual residuals), or
2. because one or more predictors may be
having a large influence on the
regression estimates (unusual leverage).
Violations of Assumptions
Unusual Observations
•
Other Regression Topics
Outliers: Causes and Cures
To check for unusual residuals, simply inspect
the residuals to find instances where the model
does not predict well.
To check for unusual leverage, look at the
leverage statistic (how far each observation is
from the mean(s) of the predictors) for each
observation.
For n observations and k predictors, look for
observations whose leverage exceeds 2(k + 1)/n.
•
•
13-53
•
•
13-54
Other Regression Topics
Missing Predictors
•
•
•
13-55
An outlier may be due to an error in
recording the data and if so, the
observation should be deleted.
It is reasonable to discard an observation
on the grounds that it represents a different
population that the other observations.
Other Regression Topics
Ill-Conditioned Data
An outlier may also be an observation that
has been influenced by an unspecified
“lurking” variable that should have been
controlled but wasn’t.
Try to identify the lurking variable and
formulate a multiple regression model
including both predictors.
Unspecified “lurking” variables cause
inaccurate predictions from the fitted
regression.
•
•
•
•
•
13-56
All variables in the regression should be of
the same general order of magnitude.
Do not mix very large data values with very
small data values.
To avoid mixing magnitudes, adjust the
decimal point in both variables.
Be consistent throughout the data column.
The decimal adjustments for each data
column need not be the same.
Other Regression Topics
Significance in Large Samples
•
•
Other Regression Topics
Model Specification Errors
•
Statistical significance may not imply
practical importance.
Anything can be made significant if you get
a large enough sample.
13-57
•
13-58
Other Regression Topics
Missing Data
•
•
•
13-59
A misspecified model occurs when you
estimate a linear model when actually a
nonlinear model is required or when a relevant
predictor is omitted.
To detect misspecification
- Plot the residuals against estimated Y
(should be no discernable pattern).
- Plot the residuals against actual Y
(should be no discernable pattern).
- Plot the fitted Y against the actual Y
(should be a 45° line).
Other Regression Topics
Binary Dependent Variable
Discard a variable if many data values are
missing.
If a Y value is missing, discard the
observation to be conservative.
Other options would be to use the mean of
the X data column for the missing values or
to use a regression procedure to “fit” the
missing X-value from the complete
observations.
•
•
13-60
When the response variable Y is binary (0,
1), the least squares estimation method is
no longer appropriate.
Use logit and probit regression methods.
Applied Statistics in
Business & Economics
Other Regression Topics
Stepwise and Best Subsets Regression
•
•
•
13-61
The stepwise regression procedure finds
the best fitting model using 1, 2, 3, …, k
predictors.
This procedure is appropriate only when
there is no theoretical model that specifies
which predictors should be used.
Perform best subsets regression using all
possible combinations of predictors.
End of Chapter 13
13-62
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc.
Download