PNAIRP WORKSHOP ON MULTIVARIATE LINEAR REGRESSION Portland, OR November 2012 Mike Tamada Director of Institutional Research Reed College 1. PRELIMINARIES Goals of this workshop: workshop attendees will learn the basics of multivariate linear regression, including the purposes and uses, advantages over univariate analyses, common pitfalls, and hints about how to set up, run and interpret regression analysis. Pre-requisites: flexible. Ideally you should have decent familiarity with these concepts, which are taught in most introductory statistics classes: scatterplots, correlation, hypothesis testing, null hypothesis, Type I and II errors, standard errors, T-tests. If you know about ANOVA (analysis of variance) that’s a bonus. Even if you already know a little bit about regression this workshop will give you some pointers about how to run better regressions. 2. INTRODUCTION TO REGRESSION AND OUR SAMPLE DATA What is regression? Illustration: fitting a least squared error line. Pitfalls of univariate regression and univariate analysis. The software we’ll see today is SPSS. Almost every statistical package can do multivariate linear regression: SAS, Stata, R. Excel can do regressions, albeit a bit clumsily, using the Data Analysis Pack (an add-in which comes with Excel). With some versions of Excel, you must download StatPlus, a separate statistical application which works with Excel: http://www.analystsoft.com/en/products/statplus/ Our example data file has student data (from some different colleges and years, mixed together and with some perturbing of the data). What types of students get higher or lower GPAs in our sample? Sample SPSS syntax: freq sex underrep finaid firstgen returned grad4. desc satv to hsgpa froshgpa cumgpa. means froshgpa by underrep sex finaid firstgen /stat anova. Why multivariate regression (aka ordinary least squares or OLS) is better: looks at the effects of all variables simultaneously, disentangling their effects; can help determine which variables are causal (but OLS cannot truly determine causality). 3. THE ONLY MATH-Y AND THEORETICAL PART OF THIS WORKSHOP Regression equation: y = b0 + b1*x1 + b2*x2 + b3*x3 + … + e y is the dependent variable x1, x2, x3 … are the explanatory variables b1, b2, b3 … are the coefficients or slope coefficients on the explanatory variables b0 is the constant term “e” is the error or disturbance term We often use “N” to mean the number of observations, and “K” to mean the number of explanatory variables (typically including the constant term). Gauss-Markov Theorem: if the variables have the relationship described by the regression equation, and the disturbances “e” are independent and identically distributed as Normal (Gaussian) random variables with mean zero, then OLS regression is “BLUE” (the Best Linear Unbiased Estimator of the coefficients b0, b1, b2, b3 …) Some key assumptions: the errors are independent (uncorrelated errors) and identically distributed Normal (homoscedasticity). Even more key: the regression equation must correctly describe the relationship between the variables (correct specification: the correct variables have a linear relationship), and the explanatory variables must be nonrandom (they must be exogenous; the error terms cannot be affected by the value of x1, x2, x3 …; there must be no simultaneity bias). We will return to these concepts later. 4. FINALLY, A REGRESSION A univariate regression: froshgpa depends on satv. SPSS syntax: Regress /variables froshgpa satv /dependent froshgpa /enter. Interpreting the results. Another univariate regression: dummy (binary) variables. SPSS syntax Regress /var froshgpa firstgen /dep froshgpa /enter. 5. DATA PREPARATION: STRING VARIABLES But: is it really this easy? No: missing values, and string variables. Preparing the data. String variables: have to convert to numeric. Dummy variables (binary variables) SPSS syntax: compute female = 0. if sex eq 'F' female = 1. compute underrep_minority = 0. if underrep eq 'underrep' underrep_minority = 1. Important: excluded value. E.g. if you have a variable ethnicity with values WHI, BLK, ASN, HIS, and OTH/UNK, then you can create dummy variables with names such as Black, Asian, Hispanic, and OthrUnk. SPSS syntax: Compute black = 0. If ethnicity eq ‘BLK’ black = 1. Compute Asian = 0. Etc. DO NOT CREATE A VARIABLE “WHITE” AND INCLUDE IT IN THE REGRESSION. ONE CATEGORY MUST BE EXCLUDED FROM THE REGRESSION. What happens if we fail to exclude a category, e.g. we include dummy variables for both females and males? The regression cannot run (perfect collinearity); SPSS will gracefully drop one of the variables for you! SPSS syntax: Compute male = 0. If sex eq ‘M’ male = 1. Reg /var froshgpa female male /dep froshgpa /enter. Note: the choice of which value to exclude is largely arbitrary – the regression is essentially the same regardless. Which contrasts, i.e. which slope coefficients are you most interested in? 6. DATA PREPARATION: MISSING VALUES Instead of losing many observations due to missing values, set their value to zero and create a dummy variable to indicate when that value of zero indicates a missing value (and not a true zero value). SPSS syntax: compute satvnew = satv. compute satvmiss = 0. do if missing(satv). compute satvnew = 0. compute satvmiss = 1. end if. means satvnew by satvmiss. compute satmnew = satm. compute satmmiss = 0. do if missing(satm). compute satmnew = 0. compute satmmiss = 1. end if. compute sat2new = sat2. compute sat2miss = 0. do if missing(sat2). compute sat2new = 0. compute sat2miss = 1. end if. compute actnew = act. compute actmiss = 0. do if missing(act). compute actnew = 0. compute actmiss = 1. end if. compute hspercnew = hsperctl. compute hsperctlmiss = 0. do if missing(hsperctl). compute hspercnew = 0. compute hsperctlmiss = 1. end if. compute hsgpanew = hsgpa. compute hsgpamiss = 0. do if missing(hsgpa). compute hsgpanew = 0. compute hsgpamiss = 1. end if. 7. FINALLY, TRUE MULTIVARIATE REGRESSION Let’s look at satv and satm. SPSS syntax reg /var froshgpa satvnew satmnew satvmiss satmmiss /dep froshgpa /enter. Better syntax: reg /var froshgpa satvnew satmnew satvmiss /dep froshgpa /enter. 8. INTERPRETING THE RESULTS N. F-test. R, R-squared and adjusted R2. T-stats. Std. Error of Regression, or RMSE. And most importantly, the estimates of the slope coefficients (B) and the standard error of these estimates. Standardized coefficients. What does the slope mean, in a multivariate context? “Ceteris parabus”. What about b0, the constant term? 9. WHAT VARIABLES SHOULD WE USE IN THE REGRESSION? It is tempting to play around, e.g. let’s add hsgpa. SPSS syntax: reg /var froshgpa satvnew satmnew hsgpanew satvmiss hsgpamiss /dep froshgpa /enter. Hmm, what about hsgpa alone? SPSS syntax: reg /var froshgpa hsgpanew hsgpamiss /dep froshgpa /enter. Model specification: in this context, which variables are in the equation. Don’t play! Type I errors. SPSS lets you do stepwise regression, but this is also not a good idea. Data fishing. “If you torture the data enough, it will confess.” The better procedure: research the literature (plus use common sense) to decide which variables are of interest, state your hypothesis, choose the variables accordingly, and run the regression – once (SPSS’s “enter” sub-command forces SPSS to run your exact model). Your t-statistics (and F-statistic) will be valid. But then by all means do a sensitivity analysis: run other models with other variables; do your conclusions hold up? Regression hint: always include a constant term in your regression. 10. PITFALLS IN MODEL SPECIFICATION: COLLINEARITY It’s not always easy to follow the ideal procedure. Collinearity (multicollinearity). Perfect collinearity: failing to exclude a category when creating dummy variables. Satvmiss and satmmiss. How to detect collinearity. Looking at the correlations between the variables is simple and easy – but not thorough. SPSS syntax: corr satvnew satmnew satvmiss satmmiss actnew hsgpanew hspercnew female underrep_minority firstgen finaid. But does collinearity mean that the model is badly specified? If the model is properly specified, then the proper solution is to get a larger sample. 11. PITFALLS IN MODEL SPECIFICATION: TOO MANY VARIABLES Overfitting. N=K. R2 = 1.00! SPSS syntax: temp. select if pidm ge 50220000 and pidm le 50234311. reg /var froshgpa satvnew satvmiss female firstgen /dep froshgpa /enter. How many variables is too many? Look at N-K, and look at N/K. 12. SOME HINTS ON SPECIFYING MODELS: SCALE Re-scaling variables: divide or multiply by 10 or 100 to put the variables on a similar scale. This will increase the accuracy of the software’s calculations, and make the coefficients easier to compare and interpret. SPSS syntax: compute satv100 = satvnew/100. compute satm100 = satmnew/100. reg /var froshgpa satv100 satm100 satvmiss hsgpa /dep froshgpa /enter. 13. SOME HINTS ON SPECIFYING MODELS: NON-LINEARITY Non-linear functional form. Logarithms and square roots. These can be especially useful if the variable (y or x) is something like income or price. Exponents and squared values can be used too, but are less frequently useful. Squared terms. If we’re uncertain if the relationship is linear, we can add a squared term as well as the linear term. Illustration: squared terms permit us to fit non-linear relationships SPSS syntax: compute hsgpanew2 = hsgpanew*hsgpanew. reg /var froshgpa satv100 satm100 hsgpanew2 hsgpanew satvmiss hsgpamiss /dep froshgpa /enter. Slope dummies. We can flexibly fit a piecewise linear relationship. SPSS syntax: compute hsgpanew3.8 = 0. if hsgpa ge 3.8 hsgpanew3.8 = hsgpanew - 3.8. reg /var froshgpa satv100 satm100 hsgpanew3.8 hsgpanew satvmiss hsgpamiss /dep froshgpa /enter. Illustration: the resulting regression output yields this relationship between freshman GPA and HS GPA (ceteris parabus, i.e. holding constant all other variables): 14. INTERACTIVE EFFECTS Often they are not worth adding to the model, because they add a lot of complexity to the interpretation and increase collinearity. SPSS syntax: reg /var froshgpa satv100 satm100 female satvmiss satmmiss /dep froshgpa /enter. compute satvfem = satv100 * female. compute satmfem = satm100 * female. reg /var froshgpa satv100 satm100 female satvfem satmfem satvmiss satmmiss /dep froshgpa /enter. As with squared terms, interactive terms permit the functional form to be highly flexible (yet still linear) – but at the cost of complicating the interpretation of the results as well as possible collinearity. 15. A MODEL WHICH LOOKS AT SEVERAL VARIABLES SPSS syntax: compute sat2.100 = sat2new/100. reg /var froshgpa satv100 satm100 sat2.100 hsgpanew female underrep_minority finaid firstgen satvmiss sat2miss hsgpamiss /dep froshgpa /enter. 16. EQUIVALENT MODELS Demonstration that the dummy is arbitrary. Male instead of female. 0 vs 100. 1 vs 3. SPSS syntax: reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter. reg /var froshgpa satv100 satm100 male satvmiss /dep froshgpa /enter. compute female = 100* female. reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter. compute female = 1. if sex eq 'F' female = 3. reg /var froshgpa satv100 satm100 female satvmiss /dep froshgpa /enter.