pnairp workshop on multivariate linear regression

advertisement
PNAIRP WORKSHOP ON MULTIVARIATE LINEAR
REGRESSION
Portland, OR
November 2012
Mike Tamada
Director of Institutional Research
Reed College
1. PRELIMINARIES
Goals of this workshop: workshop attendees will learn the basics of multivariate linear
regression, including the purposes and uses, advantages over univariate analyses,
common pitfalls, and hints about how to set up, run and interpret regression analysis.
Pre-requisites: flexible. Ideally you should have decent familiarity with these concepts,
which are taught in most introductory statistics classes: scatterplots, correlation,
hypothesis testing, null hypothesis, Type I and II errors, standard errors, T-tests.
If you know about ANOVA (analysis of variance) that’s a bonus. Even if you already
know a little bit about regression this workshop will give you some pointers about how to
run better regressions.
2. INTRODUCTION TO REGRESSION AND OUR SAMPLE DATA
What is regression?
Illustration: fitting a least squared error line.
Pitfalls of univariate regression and univariate analysis.
The software we’ll see today is SPSS. Almost every statistical package can do
multivariate linear regression: SAS, Stata, R.
Excel can do regressions, albeit a bit clumsily, using the Data Analysis Pack (an add-in
which comes with Excel). With some versions of Excel, you must download StatPlus, a
separate statistical application which works with Excel:
http://www.analystsoft.com/en/products/statplus/
Our example data file has student data (from some different colleges and years, mixed
together and with some perturbing of the data). What types of students get higher or
lower GPAs in our sample?
Sample SPSS syntax:
freq sex underrep finaid firstgen returned grad4.
desc satv to hsgpa froshgpa cumgpa.
means froshgpa by underrep sex finaid firstgen /stat anova.
Why multivariate regression (aka ordinary least squares or OLS) is better: looks at the
effects of all variables simultaneously, disentangling their effects; can help determine
which variables are causal (but OLS cannot truly determine causality).
3. THE ONLY MATH-Y AND THEORETICAL PART OF THIS WORKSHOP
Regression equation: y = b0 + b1*x1 + b2*x2 + b3*x3 + … + e
y is the dependent variable
x1, x2, x3 … are the explanatory variables
b1, b2, b3 … are the coefficients or slope coefficients on the explanatory variables
b0 is the constant term
“e” is the error or disturbance term
We often use “N” to mean the number of observations, and “K” to mean the number of
explanatory variables (typically including the constant term).
Gauss-Markov Theorem: if the variables have the relationship described by the
regression equation, and the disturbances “e” are independent and identically distributed
as Normal (Gaussian) random variables with mean zero, then OLS regression is “BLUE”
(the Best Linear Unbiased Estimator of the coefficients b0, b1, b2, b3 …)
Some key assumptions: the errors are independent (uncorrelated errors) and identically
distributed Normal (homoscedasticity). Even more key: the regression equation must
correctly describe the relationship between the variables (correct specification: the
correct variables have a linear relationship), and the explanatory variables must be nonrandom (they must be exogenous; the error terms cannot be affected by the value of x1,
x2, x3 …; there must be no simultaneity bias). We will return to these concepts later.
4. FINALLY, A REGRESSION
A univariate regression: froshgpa depends on satv.
SPSS syntax:
Regress /variables froshgpa satv
/dependent froshgpa
/enter.
Interpreting the results.
Another univariate regression: dummy (binary) variables.
SPSS syntax
Regress /var froshgpa firstgen /dep froshgpa /enter.
5. DATA PREPARATION: STRING VARIABLES
But: is it really this easy? No: missing values, and string variables.
Preparing the data.
String variables: have to convert to numeric.
Dummy variables (binary variables)
SPSS syntax:
compute female = 0.
if sex eq 'F' female = 1.
compute underrep_minority = 0.
if underrep eq 'underrep' underrep_minority = 1.
Important: excluded value.
E.g. if you have a variable ethnicity with values WHI, BLK, ASN, HIS, and OTH/UNK,
then you can create dummy variables with names such as Black, Asian, Hispanic, and
OthrUnk.
SPSS syntax:
Compute black = 0.
If ethnicity eq ‘BLK’ black = 1.
Compute Asian = 0. Etc.
DO NOT CREATE A VARIABLE “WHITE” AND INCLUDE IT IN THE
REGRESSION. ONE CATEGORY MUST BE EXCLUDED FROM THE
REGRESSION.
What happens if we fail to exclude a category, e.g. we include dummy variables for both
females and males? The regression cannot run (perfect collinearity); SPSS will
gracefully drop one of the variables for you!
SPSS syntax:
Compute male = 0.
If sex eq ‘M’ male = 1.
Reg /var froshgpa female male /dep froshgpa /enter.
Note: the choice of which value to exclude is largely arbitrary – the regression is
essentially the same regardless. Which contrasts, i.e. which slope coefficients are you
most interested in?
6. DATA PREPARATION: MISSING VALUES
Instead of losing many observations due to missing values, set their value to zero and
create a dummy variable to indicate when that value of zero indicates a missing value
(and not a true zero value).
SPSS syntax:
compute satvnew = satv.
compute satvmiss = 0.
do if missing(satv).
compute satvnew = 0.
compute satvmiss = 1.
end if.
means satvnew by satvmiss.
compute satmnew = satm.
compute satmmiss = 0.
do if missing(satm).
compute satmnew = 0.
compute satmmiss = 1.
end if.
compute sat2new = sat2.
compute sat2miss = 0.
do if missing(sat2).
compute sat2new = 0.
compute sat2miss = 1.
end if.
compute actnew = act.
compute actmiss = 0.
do if missing(act).
compute actnew = 0.
compute actmiss = 1.
end if.
compute hspercnew = hsperctl.
compute hsperctlmiss = 0.
do if missing(hsperctl).
compute hspercnew = 0.
compute hsperctlmiss = 1.
end if.
compute hsgpanew = hsgpa.
compute hsgpamiss = 0.
do if missing(hsgpa).
compute hsgpanew = 0.
compute hsgpamiss = 1.
end if.
7. FINALLY, TRUE MULTIVARIATE REGRESSION
Let’s look at satv and satm.
SPSS syntax
reg /var froshgpa satvnew satmnew satvmiss satmmiss /dep froshgpa /enter.
Better syntax:
reg /var froshgpa satvnew satmnew satvmiss /dep froshgpa /enter.
8. INTERPRETING THE RESULTS
N. F-test. R, R-squared and adjusted R2. T-stats. Std. Error of Regression, or RMSE.
And most importantly, the estimates of the slope coefficients (B) and the standard error
of these estimates. Standardized coefficients.
What does the slope mean, in a multivariate context? “Ceteris parabus”.
What about b0, the constant term?
9. WHAT VARIABLES SHOULD WE USE IN THE REGRESSION?
It is tempting to play around, e.g. let’s add hsgpa.
SPSS syntax:
reg /var froshgpa satvnew satmnew hsgpanew satvmiss hsgpamiss /dep froshgpa
/enter.
Hmm, what about hsgpa alone?
SPSS syntax:
reg /var froshgpa hsgpanew hsgpamiss /dep froshgpa /enter.
Model specification: in this context, which variables are in the equation.
Don’t play! Type I errors.
SPSS lets you do stepwise regression, but this is also not a good idea.
Data fishing. “If you torture the data enough, it will confess.”
The better procedure: research the literature (plus use common sense) to decide which
variables are of interest, state your hypothesis, choose the variables accordingly, and run
the regression – once (SPSS’s “enter” sub-command forces SPSS to run your exact
model). Your t-statistics (and F-statistic) will be valid.
But then by all means do a sensitivity analysis: run other models with other variables; do
your conclusions hold up?
Regression hint: always include a constant term in your regression.
10. PITFALLS IN MODEL SPECIFICATION: COLLINEARITY
It’s not always easy to follow the ideal procedure. Collinearity (multicollinearity).
Perfect collinearity: failing to exclude a category when creating dummy variables.
Satvmiss and satmmiss.
How to detect collinearity. Looking at the correlations between the variables is simple
and easy – but not thorough.
SPSS syntax:
corr satvnew satmnew satvmiss satmmiss actnew hsgpanew hspercnew
female underrep_minority firstgen finaid.
But does collinearity mean that the model is badly specified? If the model is properly
specified, then the proper solution is to get a larger sample.
11. PITFALLS IN MODEL SPECIFICATION: TOO MANY VARIABLES
Overfitting. N=K. R2 = 1.00!
SPSS syntax:
temp.
select if pidm ge 50220000 and pidm le 50234311.
reg /var froshgpa satvnew satvmiss female firstgen /dep froshgpa /enter.
How many variables is too many? Look at N-K, and look at N/K.
12. SOME HINTS ON SPECIFYING MODELS: SCALE
Re-scaling variables: divide or multiply by 10 or 100 to put the variables on a similar
scale. This will increase the accuracy of the software’s calculations, and make the
coefficients easier to compare and interpret.
SPSS syntax:
compute satv100 = satvnew/100.
compute satm100 = satmnew/100.
reg /var froshgpa satv100 satm100 satvmiss hsgpa /dep froshgpa /enter.
13. SOME HINTS ON SPECIFYING MODELS: NON-LINEARITY
Non-linear functional form.
Logarithms and square roots. These can be especially useful if the variable (y or x) is
something like income or price. Exponents and squared values can be used too, but are
less frequently useful.
Squared terms. If we’re uncertain if the relationship is linear, we can add a squared term
as well as the linear term.
Illustration: squared terms permit us to fit non-linear relationships
SPSS syntax:
compute hsgpanew2 = hsgpanew*hsgpanew.
reg /var froshgpa satv100 satm100 hsgpanew2 hsgpanew satvmiss hsgpamiss
/dep froshgpa /enter.
Slope dummies. We can flexibly fit a piecewise linear relationship.
SPSS syntax:
compute hsgpanew3.8 = 0.
if hsgpa ge 3.8 hsgpanew3.8 = hsgpanew - 3.8.
reg /var froshgpa satv100 satm100 hsgpanew3.8 hsgpanew satvmiss hsgpamiss
/dep froshgpa /enter.
Illustration: the resulting regression output yields this relationship between freshman
GPA and HS GPA (ceteris parabus, i.e. holding constant all other variables):
14. INTERACTIVE EFFECTS
Often they are not worth adding to the model, because they add a lot of complexity to the
interpretation and increase collinearity.
SPSS syntax:
reg /var froshgpa satv100 satm100 female satvmiss satmmiss /dep froshgpa /enter.
compute satvfem = satv100 * female.
compute satmfem = satm100 * female.
reg /var froshgpa satv100 satm100 female satvfem satmfem satvmiss satmmiss
/dep froshgpa /enter.
As with squared terms, interactive terms permit the functional form to be highly flexible
(yet still linear) – but at the cost of complicating the interpretation of the results as well as
possible collinearity.
15. A MODEL WHICH LOOKS AT SEVERAL VARIABLES
SPSS syntax:
compute sat2.100 = sat2new/100.
reg /var froshgpa satv100 satm100 sat2.100 hsgpanew female underrep_minority finaid
firstgen
satvmiss sat2miss hsgpamiss /dep froshgpa /enter.
16. EQUIVALENT MODELS
Demonstration that the dummy is arbitrary. Male instead of female. 0 vs 100. 1 vs 3.
SPSS syntax:
reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter.
reg /var froshgpa satv100 satm100 male satvmiss /dep froshgpa /enter.
compute female = 100* female.
reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter.
compute female = 1.
if sex eq 'F' female = 3.
reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter.
Download