Multiple Regression Diagnostics 2 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission Announcements • None Multiple Regression Assumptions • 3. d. Predictors (Xis) are uncorrelated with error – This most often happens when we leave out an important variable that is correlated with another Xi – Example: Predicting job prestige with family wealth, but not including education – Omission of education will affect error term. Those with lots of education will have large positive errors. • Since wealth is correlated with education, it will be correlated with that error! – Result: coefficient for family wealth will be biased. Multiple Regression Assumptions • 4. In systems of equations, error terms of equations are uncorrelated • Knoke, p. 256 – This is not a concern for us in this class • Worry about that later! Multiple Regression Assumptions • 5. Sample is independent, errors are random – Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all! • Cases that are non-independent often have correlated error • Things that cause patterns in error (autocorrelation): – Measuring data over long periods of time (e.g., every year). Error from nearby years may be correlated. • Called: “Serial correlation”. Multiple Regression Assumptions • More things that cause patterns in error (autocorrelation): – Measuring data in families. All members are similar, will have correlated error – Measuring data in geographic space. • Example: data on 50 US states. States in a similar region have correlated error • Called “spatial autocorrelation” • There are variations of regression models to address each kind of correlated error. Regression: Outliers • Note: Even if regression assumptions are met, slope estimates can have problems • Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample • More formally: “influential cases” • Outliers can result from: • Errors in coding or data entry • Highly unusual cases • Or, sometimes they reflect important “real” variation • Even a few outliers can dramatically change estimates of the slope, especially if N is small. Regression: Outliers • Outlier Example: Extreme case that pulls regression line up 4 2 -4 -2 0 -2 -4 2 4 Regression line with extreme case removed from sample Regression: Outliers • Strategy for identifying outliers: • 1. Look at regression partial plots (avplots) for extreme values • 2. Compute outlier diagnostic statistics – High values indicate potential outliers • • • • • “Leverage” Cook’s D DFFIT DFBETA residuals, standardized residuals, studentized residuals. Scatterplots • Example: Study time and student achievement. – X variable: Average # hours spent studying per day – Y variable: Score on reading test Case 1 2 3 4 5 6 7 X 2.6 1.4 .65 4.1 .25 1.9 3.5 Y 28 13 17 31 8 16 6 Y axis 30 20 10 X axis 0 0 1 2 3 4 Outliers • Results with outlier: Model Summaryb Model 1 R R Square a .466 .217 Adjus ted R Square .060 Std. Error of the Es timate 9.1618 a. Predictors : (Constant), HRSTUDY Coefficientsa b. Dependent Variable: TESTSCOR Standardi zed Uns tandardized Coefficien Coefficients ts Model B Std. Error Beta 1 (Cons tant) 10.662 6.402 HRSTUDY 3.081 2.617 .466 a. Dependent Variable: TESTSCOR t 1.665 1.177 Sig. .157 .292 Outlier Diagnostics • Residuals: The numerical value of the error • Error = distance that points falls from the line • Cases with unusually large error may be outliers • Note: residuals have many other uses! • Standardized residuals – Z-score of residuals… converts to a neutral unit – Often, standardized residuals larger than 3 are considered worthy of scrutiny • But, it isn’t the best outlier diagnostic – Studentized residuals: recalculates standarized residual after removing case from analysis. Outlier Diagnostics • Cook’s D: Identifies cases that are strongly influencing the regression line – SPSS calculates a value for each case • Go to “Save” menu, click on Cook’s D • How large of a Cook’s D is a problem? – Rule of thumb: Values greater than: 4 / (n – k – 1) – Example: N=7, K = 1: Cut-off = 4/5 = .80 – Cases with higher values should be examined. Outlier Diagnostics • Example: Outlier/Influential Case Statistics Hours 2.60 1.40 .65 4.10 .25 1.90 3.50 Score 28 13 17 31 8 16 6 Resid 9.32 -1.97 4.33 7.70 -3.43 -.515 -15.4 Std Resid 1.01 -.215 .473 .841 -.374 -.056 -1.68 Cook’s D .124 .006 .070 .640 .082 .0003 .941 Outliers • Results with outlier removed: Model Summaryb Model 1 R R Square .903 a .816 Adjus ted R Square .770 Std. Error of the Es timate 4.2587 a. Predictors : (Constant), HRSTUDY Coefficientsa b. Dependent Variable: TESTSCOR Standardi zed Uns tandardized Coefficien Coefficients ts Model B Std. Error Beta 1 (Cons tant) 8.428 3.019 HRSTUDY 5.728 1.359 .903 a. Dependent Variable: TESTSCOR t 2.791 4.215 Sig. .049 .014 Regression: Outliers • Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in? – Obviously, you should drop cases that are incorrectly coded or erroneous – But, generally speaking, you should be cautious about throwing out cases • If you throw out enough cases, you can produce any result that you want! So, be judicious when destroying data. Regression: Outliers • Circumstances where it can be good to drop outlier cases: • 1. Coding errors • 2. Single extreme outliers that radically change results – Your results should reflect the dataset, not one case! • 3. If there is a theoretical reason to drop cases – Example: In analysis of economic activity, communist countries may be outliers • If the study is about “capitalism”, they should be dropped. Regression: Outliers • Circumstances when it is good to keep outliers • 1. If they form meaningful cluster – Often suggests an important subgroup in your data • Example: Asian-Americans in a dataset on education • In such a case, consider adding a dummy variable for them – Unless, of course, research design is not interested in that sub-group… then drop them! • 2. If there are many – Maybe they reflect a “real” pattern in your data. Regression: Outliers • When in doubt: Present results both with and without outliers – Or present one set of results, but mention how results differ depending on how outliers were handled • For final projects: Check for outliers! • At least with scatterplots – In the text: Mention if there were outliers, how you handled them, and the effect it had on results. Multicollinearity • Another common regression problem: Multicollinearity • Definition: collinear = highly correlated • Multicollinearity = inclusion of highly correlated independent variables in a single regression model • Recall: High correlation of X variables causes problems for estimation of slopes (b’s) • Recall: variable denominators approach zero, coefficients may wrong/too large. Multicollinearity • Multicollinearity symptoms: • Unusually large standard errors and betas • Compared to if both collinear variables aren’t included • Betas often exceed 1.0 • Two variables have the same large effect when included separately… but… – When put together the effects of both variables shrink – Or, one remains positive and the other flips sign • Note: Not all “sign flips” are due to multicollinearity! Multicollinearity • What does multicollinearity do to models? • Note: It does not violate regression assumptions • But, it can mess things up anyway • 1. Multicollinearity can inflate standard error estimates • Large standard errors = small t-values = no rejected null hypotheses • Note: Only collinear variables are effected. The rest of the model results are OK. Multicollinearity • What does multicollinearity do? • 2. It leads to instability of coefficient estimates • Variable coefficients may fluctuate wildly when a collinear variable is added • These fluctuations may not be “real”, but may just reflect amplification of “noise” and “error” – One variable may only be slightly better at predicting Y… but SPSS will give it a MUCH higher coefficient. Multicollinearity • Diagnosing multicollinearity: • 1. Look at correlations of all independent vars • Correlation >.8 is is a concern • But, sometimes problems aren’t always bivariate… and don’t show up in bivariate correlations – Ex: If you forget to omit a dummy variable • 2. Watch out for the “symptoms” • 3. Compute diagnostic statistics • Tolerance, VIF (Variance Inflation Factor). Multicollinearity • Multicollinearity diagnostic statistics: • “Tolerance”: Easily computed in SPSS • Low values indicate possible multicollinearity – Start to pay attention at .3; Below .1 is very likely to be a problem • Tolerance is computed for each independent variable by regressing it on other independent variables – VIF = inverse of tolerance Multicollinearity • If you have 3 independent variables: X1, X2, X3… – Tolerance is based on doing a regression: X1 is dependent; X2 and X3 are independent. • Tolerance for X1 is simply 1 minus regression R-square. • If a variable (X1) is highly correlated with all the others (X2, X3) then they will do a good job of predicting it in a regression • Result: Regression r-square will be high… 1 minus rsquare will be low… indicating a problem. Multicollinearity • Variance Inflation Factor (VIF) is the reciprocal of tolerance: 1/tolerance • High VIF indicates multicollinearity – Gives an indication of how much the Standard Error of a variable grows due to presence of other variables. Multicollinearity • Solutions to multcollinearity – It can be difficult if a fully specified model requires several collinear variables • 1. Drop unnecessary variables • 2. If two collinear variables are really measuring the same thing, drop one or make an index – Example: Attitudes toward recycling; attitude toward pollution. Perhaps they reflect “environmental views” • 3. Advanced techniques: e.g., Ridge regression • Uses a more efficient estimator (but not BLUE – may introduce bias). “Robust” Standard Errors • Robust / Huber / White / Sandwich standard error – An alternative method of estimating regression SEs • More accurate under conditions of heteroskedasticity • Potentially more accurate under conditions of nonindependence (clustered data) • Potentially more accurate when model is underspecified • Stata: regress y x1 x2, robust – Increasingly common… Some people now use them routinely… • But, Freedman (2006) criticizes use for underspecification: • What use are SEs if model is underspecified and therefore slopes are biased? Models and “Causality” • Issue: People often use statistics to support theories or claims regarding causality – They hope to “explain” some phenomena • What factors make kids drop out of school • Whether or not discrimination leads to wage differences • What factors make corporations earn higher profits • Statistics provide information about association • Always remember: Association (e.g., correlation) is not causation! • The old aphorism is absolutely right • Association can always be spurious Models and “Causality” • How do we determine causality? • The randomized experiment is held up as the ideal way to determine causality • Example: Does drug X cure cancer? • We could look for association between receiving drug X and cancer survival in a sample of people • But: Association does not demonstrate causation; Effect could be spurious • Example: Perhaps rich people have better access to drug X; and rich people have more skilled doctors! • Can you think of other possible spurious processes? Models and “Causality” • In a randomized experiment, people are assigned randomly to take drug X (or not) • Thus, taking drug X is totally random and totally uncorrelated with any other factor (such as wealth, gender, access to high quality doctors, etc) • As a result, the association between drug X and cancer survival cannot be affected by any spurious factor • Nor can “reverse causality” be a problem • SO: We can make strong inferences about causality! Models and “Causality” • Unfortunately, randomized experiments are impractical (or unethical) in many cases • Example: Consequences of high-school dropout, national democracy, or impact of homelessness • Plan B: Try to “control” for spurious effects: • Option 1: Create homogenous sub-groups – Effects of Drug X: If there is a spurious relationship with wealth, compare people with comparable wealth • Ex: Look at effect of drug X on cancer survivors among people of constant wealth… eliminating spurious effect. Models and “Causality” • Option 2: Use multivariate model to “control” for spurious effects • Examine effect of key variable “net” of other relationships – Ex: Look at effect of Drug X, while also including a variable for wealth • Result: Coefficients for Drug X represent effect net of wealth, avoiding spuriousness. Models and “Causality” • Limitations of “controls” to address spuriousness • 1. The “homogenous sub-groups” reduces N • To control for many possible spurious effects, you’ll throw away lots of data • 2. You have to control for all possible spurious effects • If you overlook any important variable, your results could be biased… leading to incorrect conclusions about causality • First: It is hard to measure and control for everything • Second: Someone can always think up another thing you should have controlled for, undermining your causal claims. Models and “Causality” • Under what conditions can a multivariate model support statements about causality? • In theory: A multivariate model support claims about causality… IF: • The sample is unbiased • The measurement is accurate • The model includes controls for every major possible spurious effect • The possibility of reverse causality can be ruled out • And, the model is executed well: assumptions, outliers, multicollinearity, etc. are all OK. Models and “Causality” • In Practice: Scholars commonly make tentative assertions about causality… IF: • The data set is of high quality; sample is either random or arguably not seriously biased • Measures are high quality by the standards of the literature • The model includes controls for major possible spurious effects discussed in the prior literature • The possibility of reverse causality is arguably unlikely • And, the model is executed well: assumptions, outliers, multicollinearity, etc. are all acceptable… (OR, the author uses variants of regression necessary to address problems). Models and “Causality” • In sum: Multivariate analysis is not the ideal tool to determine causality • If you can run an experiment, do it • But: Multivariate models are usually the best tool that we have! • Advice: Multivariate models are a terrific way to explore your data • Don’t forget: “correlation is not causation” • The models aren’t magic; they simply sort out correlation • But, if used thoughtfully, they can provide hints into likely causal processes!