Multiple Regression

advertisement
STAT 3130
Statistical Methods II
Session 5
Multiple Regression
Multiple Regression
From last semester, we learned that a simple linear equation of a line
takes the general form of y=mx+b, where:
• Y is the dependent variable
• m is the slope of the line
• X is the independent variable or predictor
• b is the Y-intercept.
When we discussion regression models, we transform this equation to be:
Y = bo + b1x1
Where bo is the y-intercept and b1 is the slope of the line. The “slope” is
also the effect of a one unit change of x on y.
Multiple Regression
This was fine…but typically we don’t have just one predictor – we have
lots of predictors.
When we discussion multiple regression models, the general form of the
equation is like this:
Y = bo + b1x1 + b2x2 + b3x3 … bnxn
Where bo is still the y-intercept and “bi “ is the effect of a unit change of
each of the individual predictors on the y (dependent) variable.
Lets discuss the general form of different hypothetical multiple regression
models…
Multiple Regression
The requirements for Multiple Regression are general the same as they
were for Linear Regression:
1. The relationship of the dependent and the independent (s) variables
is assumed to be linear.
2. The relationship of the dependent and the independent (s) variables
will have some (hopefully) significant correlation.
3. There should be no extreme values that influence (usually
negatively) the results.
4. Results are homoscedastic.
5. All observations are independent.
Multiple Regression
But…there are some issues in Multiple Regression which are not present
in Linear Regression:
1. Multicollinearity amongst predictors
2. “Ingredient” variables
3. Selection Methods/Model Parsimony
Lets explore each of these in turn…
Multiple Regression
Multicollinearity – what is it and what’s the big deal?
Lets look at the temperature data and predict the temp in
August…
 Correlation Matrix – how is each potential predictor
correlated individually with August Temperature?
 Now…lets build the regression model…pay particular
attention to the beta coefficients and the p-values…
Multiple Regression
Multicollinearity – what is it and what’s the big deal?
The moral of the story is that caution must be employed in
interpreting the individual regression coefficients in a multiple
regression analysis. The regression can be used to determine a
predicted value of August temperature, even when it is
difficult to interpret the sample regression coefficients.
If the individual coefficients are important, the multicollinearity
must be removed…
Consider the VIF (Variance Inflation Factor).
VIF = 1/(1-R2)…where the R2 value here is the value when the
predictor in question is set as the dependent variable.
Multiple Regression
Consider the VIF (Variance Inflation Factor).
VIF = 1/(1-R2)…where the R2 value here is the value when the
predictor in question is set as the dependent variable.
For example, if the VIF = 10, then the respective R2 would be 90%.
This would mean that 90% of the variance in the predictor in
question can be explained by the other independent variables.
Because so much of the variance is captured elsewhere, removing
the predictor in question should not cause a substantive decrease
in overall R2.
The rule of thumb is to remove variables with VIF scores greater
than 10.
Multiple Regression
What is an “ingredient” variable?
If the dependent variable is comprised of one of the predictor
variables (or vice versa), the results are not reliable.
One or both of the following will happen:
 You will generate an incredibly high R2 value
 The predictor in question will have a DOMINATING t-statistic
Lets look at the credit data…
Multiple Regression
What are the different selection methods and what are the
differences?
 “All In”
Manual Process
 “Forward”
 “Backward”
Work from the
Correlation Matrix
 “Stepwise”
Model Parsimony = less is more. You are better off with an R2 of .75 and
3 predictors than with an R2 of .80 and 10 predictors.
Multiple Regression
Additional Topics
1. Transformations
 Logs
 Discretization
 Square/Square Root
2. Dummy Coding
 K-1 new values
3. Additional Plots
 Predicteds versus Actuals
 Residuals versus Actuals
 Residuals versus Predicteds
 Standardized Residuals
Download