PADM 692 | Data Analysis II Session II Linear Regression Diagnostics March 17, 2012 University of La Verne Soomi Lee, PhD Copyright © by Soomi Lee Do not copy or distribute without permission Overview 1. Recap: multiple regression 2. Assumptions of Classical Linear Regression Model (CLRM) 3. Most common problems in regression analysis 1. 2. 3. 4. Multicollinearity Omitted variable bias Heteroskedasticity Outliers Recap: Multiple Regression • Summary statistics • Eyeball the relationship between your main IV(s) and DV – Cannot plot two IVs. Show one IV and DV at a time. • Interpretation 1. Individual coefficients (significance, magnitude) – Statistically significant? – How big is it? – Do not forget: “holding other variables constant” 2. Overall model performance – Adjusted R square: how much variation of Y does your model explain? – F statistic (statistical significance of the model as a whole): are your independent variables jointly important? Presidential Election Example GROWTH INFLATION Constant N .705*** (.217) -.478 (.339) 53.365*** (1.919) 23 Adjusted R2 .527 F 13.25*** Note: *** p<0.01; standard errors are in parentheses. DV: Incumbent president’s vote share (%) 1. How many IVs? 2. Interpretation of each coefficient? 3. Overall model performance? Abortion Rates Example Religion .0004 (.083) Price -.045** (.022) Income .002*** (.000) Picket -.109*** (.040) Constant -5.869 (9.182) N Adjusted R2 F 50 .498 13.162*** Note: ** p<0.05, *** p<0.01; Standard errors are in parentheses. DV: abortion rate (per 1000 women aged 1544) 1. How many IVs? 2. Interpretation of each coefficient? 3. Overall model performance? British Crime Rates Example Unemployment 5.352* (2.703) DV: Crime rate (per 1000 Cars -.052 (.036) people) 1. How many IVs? Police -4.204 (7.546) Young population 7.941*** (2.176) 2. Interpretation of each coefficient? Constant .309 (36.312) 3. Overall model N 42 performance? Adjusted R2 .589 F 15.705*** Note: * p<0.1; ** p<0.05, *** p<0.01; Standard errors are in parentheses. Concerns • No mindless data crunching. Theory should guide you. • Limitation of our dataset – Number of observation > number of IVs – External validity: generalization – Internal validity: are you measuring what you want to measure? Break Assumptions of Classical Linear Regression Model (CLRM) 1. 2. 3. 4. 5. 6. 7. Zero average of population errors Equal variance (homoskedasticity) No autocorrelation No correlation between X and errors No measurement errors No model specification error Normal distribution of the population error term Break Collinearity • Collinearity: Two independent variables are linearly related. • Multicollinearity: More than two variables are a linear combination of one another High Multicollinearity • High multicollinearity: one of the IVs is highly correlated with one or more of the other IVs. • Why it is a problem? – Redundant information. Cannot separate the effect of X1 and X2 on the DV. Unable to estimate the marginal effect. – Why? Inflates standard errors – Not a signal of a problem in our model – OLS is still BLUE in the presence of multicollinearity High Multicollinearity • How to detect 1. Look at the significance of individual independent variables and overall model performance: • Adjusted R2 is high, but individual variables are not statistically significant. 2. Examine correlations among X variables (Create correlation matrix). • Solution 1. Increase sample size (not always feasible). 2. Drop one of the variables. High multicollinearity: Example Unemployment CARS Police Young population Constant N Adjusted R2 F • British Crime Rates Model • CARS is not statistically -.052 (.036) significant. Possible -4.204 (7.546) multicollinearity? 7.941*** (2.176) • CARS may be highly .309 (36.312) correlated with other variables. 42 • What to do: create a .589 correlation matrix .329* (5.703) 15.705*** Note: ** p<0.05, *** p<0.01; Standard errors are in parentheses. Detecting High Multicollinearity CARS Police Young population Unemployment CARS 1 - - - Police -0.639 1 - - Young population -0.519 0.620 1 - Unemployment -0.575 0.810 0.598 1 • • • • CUT-OFF (rule of thumb): 0.8 CARS: did not exceed correlation coefficient of 0.8 with any other IVs. It may be correlated with combination of variables. We also have a high correlation coefficient between police and unemployment. Multicollinearity Diagnostics in SPSS • Request the display of Collinearity Statistics • AnalyzeRegressionLinearStatisticsCollinearity diagnostics • SPSS will show you “tolerance” and “VIF” along with regression results. Multicollinearity Diagnostics in SPSS • Tolerance (between 0 and 1) – The percent of variance in a predictor that cannot be accounted for by the other predictors. (1-R2i) – Low level of tolerance high multicollinearity – At 0.3 Attention – Less than .1 indication of multicollinearity. Likely to be a problem. • VIF: Variance Inflation Factor – 1/tolerance – High level of VIF high multicollinearity – Greater than 10 Attention Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b1=0; H1: b1>0 1. Estimate the following model: Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e 2. Examine if there is a problem of multicollinearity. 1) Look at the overall model performance and individual coefficients. 2) Create a correlation matrix. High correlations? 3) Compute tolerance and VIF. What these values tell us about multicollinearity in this model? What do we do about it? Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H0: b1=0; H1: b1>0 Hypothesis 2: H0: b2=0; H1: b2>0 1. Estimate the following model: Wage = a+b1(edu)+b2(experience)+e 2. Examine if there is a problem of multicollinearity. 1) Look at the overall model performance and individual coefficients. 2) Create a correlation matrix. High correlations? 3) Compute tolerance and VIF. What these values tell us about multicollinearity in this model? What do we do about it? Break Omitted Variable Bias • Omitted variable bias: exclusion of a relevant variable or inclusion of an irrelevant variable • Model specification error • If you have an omitted variable bias you violate the assumption that regressors are uncorrelated to the errors. Omitted Variable Bias • Suppose our true model is: Yi = α + β1X1i + β2X2i + ui • Our misspecified model excludes X2 (due to ignorance or unavailability of X2): Yi = c + d1X1i + vi • Now our error term in the misspecified model contains the effects of the X2. vi = β2X2i + ui • Our estimation of d1 is biased! Solution? • Think about possible measurement errors. • Re-specify your model: adding one or more variables or excluding possibly irrelevant variables • See if an additional variable changes anything (significance and the size of predictors, R2) • Go back to theory. Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b1=0; H1: b1>0 1. Estimate the following models: Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e Medicaid=a+c2(age65)+c3(income)+v 2. Compare the two models. 1) Compare Adjusted R2 and the F statistic. 2) See if there are any changes in statistical significance or magnitude of each regression coefficient. 3) Do you consider the second model misspecified? Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H0: b1=0; H1: b1>0 Hypothesis 2: H0: b2=0; H1: b2>0 1. Estimate the following models: Wage = a+b1(edu)+b2(experience)+b3(female)+e Wage = a+c2(experience)+c3(female)+v 2. Compare the two models. 1) Compare Adjusted R2 and the F statistic. 2) See if there is any changes in statistical significance or magnitude of each regression coefficient. 3) Do you consider the second model misspecified? Break Heteroskedasticity • Error variance is constant throughout the regression line. homoskedastic (equal variance) • If the error variance is not constant throughout the regression line it violates the assumption of equal variance heteroskedastic Homoskedastic Heteroskedastic Heteroskedastic Heteroskedastic Heteroskedasticity: Causes • Measurement errors • Omitted variables – Suppose the true model is: y=a+b1x1+b2x2+u – The model we estimate fails to include x2: y=a+c1x1+v – Then the error term in the model will be capturing the effect of X2, so it will be correlated with X2. • Non-linearity – True model: y=a+b1X12+u – Our model: y=a+cX1+v – Then the residual will capture the non-linearity and affect the variance accordingly. Heteroskedasticity: Consequences • Heteroskedasticity by itself does not cause OLS to be biased or inconsistent. • It is still unbiased, but not the best. • Then you no longer have the minimum variance lose accuracy • Heteroskedasticity is a symptom of omitted variables, and measurement errors. OLS estimators will be biased and inconsistent if you have omitted variables or measurement errors. Heteroskedasticity: Detection • Graphical Methods – Plot the standardized residuals (on Y axis) against the standardized predicted values (on X axis) • AnalyzeRegressionLinearPlotsSelect ZRESID (standardized residuals) as the Y and ZPRED (standardized predicted value) as the X variable. Click on the “Continue” button. – If you see a pattern (a funnel shape or a curve) this indicates heteroskedasticity. Satisfactory Residual Plot Non-constant Variance Heteroskedasticity: Solution • Re-specify your model. You may omit one or more important variables. • Consider using different measurements. Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b1=0; H1: b1>0 1. Estimate the following models: Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e Medicaid=a+c2(age65)+c3(income)+v 2. Create and compare residual plots for both models. Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H0: b1=0; H1: b1>0 Hypothesis 2: H0: b2=0; H1: b2>0 1. Estimate the following models: Wage = a+b1(edu)+b2(experience)+b3(female)+e Wage = a+c2(experience)+c3(female)+v 2. Create and compare residual plots for both models. Break Outlier • Outliers: cases with extreme values – They are influential: removing them substantially changes the estimate of coefficients. • Consequence – Estimates are biased especially when the sample size is small. • Causes of outliers – Errors in coding or data entry – Highly unusual cases – Important real variation Extreme case that pulls the regression line up Regression line with extreme case removed from the sample Detecting Outliers 1. Scatter plots – Detect if there are any outliers (eyeballing) Detecting Outliers 2. Compute Cook’s D – Identifies strongly influential cases to the regression line – Higher value potential outlier – Rule of thumb: cut-off point = 4/N – If Cook’s D > 4/N pay attention – Example: N=50 cutoff: 4/50=0.08 Detecting Outliers 3. Compute DfBeta – – – – – DfBeta: change in the regression coefficient that results form the deletion of the ith case. DfBeta value is calculated for each case for each regression coefficient. Higher value potential outlier Rule of thumb: Pay attention if DfBeta > cut-off = 2/sqrt(N) Ex. N=50 cut-off = 2/sqrt(50) = 0.28 Detecting Outliers 4. Compute DfFit – Changes in the predicted value when the ith case is deleted. – Higher value potential outlier – Rule of thumb: Pay attention if DfFit > cut-off = 2×sqrt(K/N); K=number of independent variables – Ex. N=50; K=5 cut-off = 6.32 Example 1. Scatter plots AnalyzeRegressionLinearPlotsProduce all partial plots 1. Compute Cook’s D, DfBeta, DfFit AnalyzeRegressionLinearSaveCook’s, DfBeta(s), DfFit (Cook’s D, DfBeta, DfFit will be saved in your data file. Go back to Data View and see) Solution? • In the presence of outliers… • Fit the model with and without outliers • Remove influential observations from regression analysis • Recall “They are influential: removing them substantially changes the estimate of coefficients.” • Must justify why. Do not destroy your data without justification. Example: Medicaid We want to estimate the effect of poverty on the share of state Medicaid spending. H0: b1=0; H1: b1>0 1. Estimate the following model: Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e 2. Examine outliers. 1) Create partial scatter plots. Eyeball each scatter plot if there are any outliers. 2) Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do we have any outliers? Group Work We want to estimate the marginal effect of education on wage. Hypothesis 1: H0: b1=0; H1: b1>0 Hypothesis 2: H0: b2=0; H1: b2>0 1. Estimate the following model: Wage = a+b1(edu)+b2(experience)+b3(female)+e 2. Examine outliers. 1) Create partial scatter plots. Eyeball each scatter plot if there are any outliers. 2) Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do we have any outliers? Go Home.