POLS 7012 REGRESSION DIAGNOSTICS Topic: Model fit, regression diagnostics. STATA commands and features: hettest, dfbeta, lvrplot, rvfplot, vif Data sets: demoplus.dta, includes data on 76 countries Readings: Alan Agresti and Barbara Finlay (1997). Statistical Methods for the Social Sciences, 3rd ed. Upper Saddle River, NJ: Prentice Hall. [CHAPTER 14] 1. INTRODUCTION This week we will be using an expanded version of the dataset that we used last week; we will be concentrating on testing how well the models fit the data and how robust those models are. Substantively, we will again be looking at the predictors of demonstration activity, but we will also see how well we can explain electoral turnout cross-nationally in those same countries. 2. MODEL FIT When it comes to deciding which variables to include in our model, we usually want to fit a model which includes the variables we are interested in, but does not include variables that have no effect. Parsimony is a virtue in model selection: we choose the model which gives us the most information for the fewest number of predictor variables. This might mean that we choose the simplest model, only including significant terms. If a term is not significant (interpreted rather loosely, i.e., up to a pvalue of around 0.10), or does not improve the model, it is normally removed. For a single model, we would use the adjusted R2 statistic to tell us how much of the variance in the dependent variable is ‘explained’ by the independent variables (or how our model fits). We can also use adjusted R2 to compare nested models. In a set of nested models, previous models always use a subset of the variables in later models. For example: Model 1 Model 2 Model 3 Model 4 demo = demo = demo = demo = unemploy unemploy + press1 unemploy + press1 + conflict unemploy + press1 + conflict + corrupt EXERCISE 1 Recode press and war into 0/1 dummy variables press1 and conflict as we have previously (remember to label fully). Run the four models above, noting down the adjusted R2 for each of them. Which terms lead to a big improvement in model fit, and how would you interpret this? 3. REGRESSION DIAGNOSTICS – NON-LINEARITY To keep things simple, we will start with a model containing only main effects (though we know from last week that there are probably interactions between some of the variables): . regress demo unemploy press1 conflict corrupt In many ways this is a nice model since almost all the terms are statistically significant, and there is a plausible story to explain the coefficients. When we use regression analysis we must check that the model does not violate any of the assumptions of regression analysis; if these are flouted we may get misleading estimates of the coefficients, standard errors or both. Conducting regression diagnostics can be a complicated business, but there are a few checks that are important and relatively easy to perform. Most of these require either looking at or testing different aspects of the residuals (the ε's in the regression equation). This is especially useful when we have a small number of cases, as we do in this dataset. With datasets of several thousand observations (e.g. one based on a sample survey of individuals) some of these checks may be less useful however. To get STATA to produce the residuals from our chosen model (i.e. the last model that we ran). Type: . predict res, resid This command creates a new variable called res that takes the values of the residual from the model for each case. Residuals should be randomly distributed around zero (it’s inherent variation that we can’t predict), so what we want to know is if there is any pattern to how the residuals are distributed. If there is a pattern it suggests that our model as it stands could be problematic in some way. A good first step is to plot the residuals against each of the independent variables in the model. . twoway scatter res unemploy . twoway scatter res corrupt Note that there’s no point doing this for our two binary variables (conflict and press1). For unemployment, there appears to be no pattern to how the residuals are distributed. However, if we look at the corruption scatterplot, we can see that the residuals for low corruption scores tend to be just higher than zero, those for medium scores tend to be just lower than zero, and those for high scores tend to be much higher than zero. 4 2 -2 0 Residuals 0 2 4 corruption score 6 8 What this should suggest to us is that corruption may have a non-linear effect on demonstration activity; that is, for different values of corruption the slope of the line is different. The most commonly used way to take account of non-linearity is to use polynomial regression functions. Y 1X 2 X 2 In practice, all we need to do is to generate a new variable that is the square (or cube etc.) of the independent variable that we believe to have a non-linear relationship with the dependent variable. . generate corrupt2=corrupt*corrupt . label variable corrupt2 “Corruption squared” And then simply include this new variable in our model, making sure that we also include the original (non-squared) variable, in our case, corrupt. . regress demo unemploy press1 conflict corrupt corrupt2 Source | SS df MS -------------+-----------------------------Model | 84.6685635 5 16.9337127 Residual | 87.7493293 70 1.25356185 Number of obs = F( 5, 70) = Prob > F = R-squared = 76 13.51 0.0000 0.4911 -------------+-----------------------------Total | 172.417893 75 2.29890524 Adj R-squared = Root MSE = 0.4547 1.1196 -----------------------------------------------------------------------------demo | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------unemploy | .0331588 .0132818 2.50 0.015 .0066691 .0596485 press1 | -.6483971 .3427103 -1.89 0.063 -1.331911 .0351171 conflict | 1.33142 .4305441 3.09 0.003 .472727 2.190113 corrupt | -.4085027 .2179467 -1.87 0.065 -.8431837 .0261783 corrupt2 | .0643187 .0251509 2.56 0.013 .0141569 .1144805 _cons | 1.557482 .5326082 2.92 0.005 .4952281 2.619735 ------------------------------------------------------------------------------ The results show that our new corrupt2 variable is highly statistically significant, confirming that there is a non-linear relationship between corruption and demonstration activity. It can be difficult to interpret the size of the coefficient. What we need to do is, for each level of corruption, calculate the combined impact of corrupt and corrupt2. For example, with a 0 level of unemployment, a controlled press and no armed conflict, our equation would be: demo = 0.03*unemploy + -0.65*press1+-0.41*corrupt + 1.33*armed + 0.06*corrupt2 + 1.56 = 0.03*0 + -0.65*0 + -0.41*corrupt + 1.33*0 + 0.06*(corrupt*corrupt) + 1.56 = -0.41*corrupt + 0.06*(corrupt*corrupt) + 1.56 Thus, for a corruption level of 3, our equation would be: demo = -0.41*3 + 0.06*(3*3) + 1.56 = 0.93 Often the best thing to do is to use the regression equation to produce a graph of predicted demonstration activity for various levels of corruption (this is probably easiest to do in Excel). Predicted demonstration activity by corruption Demonstration activity 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 Corruption (1-10) EXERCISE 2 7 8 9 10 For this exercise we would like you to produce a model to predict turnout in national elections in the countries in our dataset. The dependent variable is turnout, and you should include corrupt, parcomp, enpp, elecsys, distsize and timeelap as independent variables in your model. First examine these variables in the codebook and look at basic univariate statistics, noting whether the variables are continuous, binary or categorical. Then run a series of multiple regressions, finally choosing a simple and parsimonious model that fits the data well. Now check for non-linearity using the residuals. Create squared terms for the continuous variables and add them to the model. Do the additional terms improve the model? 4. REGRESSION DIAGNOSTICS – OUTLIERS The first test one should run is to ensure that the residuals are roughly normally distributed. If the errors are not normally distributed much of the theory behind the ttests will no longer apply. First, though, you need to generate the residuals from the model. We will use the demonstration activity model that we used previously. The command for that in STATA is predict, which can be used to generate residuals from whichever regression STATA last estimated. However, we need the variances of the residuals to be standardized so we will generate what are call “studentized” residuals with the following command: . regress demo unemploy press1 conflict corrupt corrupt2 . predict rstu, rstudent Studentized residuals are just the actual value of the residuals divided by the standard deviation we would expect from normal sampling variability. This is like a z-statistic, so about 5% of values should be above 1.96 or below 1.96. This gives us an idea of how outlying the outliers are. A good way of looking at the distribution of residuals is to produce a histogram: .3 0 .1 .2 Density .4 .5 . histogram rstu -2 0 2 Studentized residuals 4 6 This graph tells us that, as we would expect, most values fall between -2 and +2, but we do have one real outlier with a residual score of around 5. Why care about outliers? Big residual scores are not necessarily a problem in themselves, as that outlying case may have little effect on the overall model. What we really want to know is how influential any particular case is for the model coefficients as well as whether it is an outlying observation. We should be worried if an outlier has a big effect on our model; sometimes just one case can determine the coefficients, if it has sufficient leverage (the amount of influence an observation has on the regression line). We can plot the amount of leverage a variable has by using the normalized squared residuals (the squaring is necessary because the residuals are both positive and negative. By definition, the sum of the residuals will be 0). STATA has a command for this that can be run right after a regression. . lvr2plot, mlabel(nation) We label each case so we can identify any that are problematic. Problematic outliers are the ones that are in the top right of the plot. These have both a large residual as well as high amounts of leverage. Bolivia, for example, has a large residual but little leverage, unlike Senegal which has high leverage but is not an outlying observation. The most worrying observation in this graph is Israel. How much effect does Israel have on the coefficients? We can observe for what variables Israel might be having an effect. The DFbeta value shows the degree to which the coefficient will change when that observation is omitted. We use the dfbeta command to generate DFbeta values for each variable and observation. Type as follows: . dfbeta Next we plot the residuals by the DF values for each variable. . . . . twoway twoway twoway twoway (scatter (scatter (scatter (scatter rstu rstu rstu rstu DFunemploy, mlabel(nation)) DFpress1, mlabel(nation)) DFcorrupt, mlabel(nation)) DFconflict, mlabel(nation)) If we look at these plots we can see that whilst Bolivia generally has a high residual, its DFbeta score is around zero, which indicates that if we were to remove Bolivia from the dataset this would have little effect on the size of the coefficients. But consider the last scatterplot of DFconflict: 4 Bolivia Malawi Honduras 2 Israel Nicaragua Ecuador -2 0 Chile Kenya Guatemala France Philippines Peru USA Ivory Coast Norway Luxembourg Greece Jamaica Tunisia Malaysia Australia Iceland Spain Ireland Germany Switzerland Taiwan Poland Tanzania Latvia Brazil Estonia Indonesia Denmark Canada Belgium Netherlands Portugal UK S Africa Jordan Austria Thailand Japan India Cameroon Colombia Zimbabwe Italy South Korea Mexico Hungary Sweden Costa Mauritius Rica Czech Slovak New Zealand Finland Pakistan Belarus Venezuela Nigeria Senegal Turkey Uruguay Romania Morocco Argentina Bulgaria Ukraine Botswana Singapore Paraguay Russia Zambia -.5 0 .5 1 Dfbeta conflict Here, we see that Israel has a high DFbeta and is far to the right of the plot which implies that if we removed from the dataset it would decrease the size of the coefficient for the armed conflict variable. That is, the effect of armed conflict on demonstration activity is larger with Israel included than when it is not. In fact, the coefficient will be about 0.9 standard errors smaller, or 0.9 * 0.43 = 0.39. So the coefficient for conflict would shrink from 1.33 to 0.94, if this observation were omitted. As a rule of thumb, DFbeta values of more than 1 are worthy of attention. What should we do with cases that we believe to be outliers and influence the model? On the one hand, we could simply remove these cases from the dataset. But while certain observations may be outliers that cause problems for linear regression, they are still real data points that we should consider. It is clearly important to know if your model is driven by only a handful of outlying cases, but when it comes to deciding what to do about this, there is no simple solution, other than to keep this in mind when drawing your conclusions. Another reason for being concerned about outliers is that several outlying cases can often indicate that your model/dataset is missing important variables that could help to explain why those outliers are outlying. EXERCISE 3 Investigate whether there are any outlying cases (and their impact on the model) in your previous parsimonious model of turnout. Could you add other variables from our dataset which might explain why these cases are outliers? 5. HETEROSKEDASTICITY AND MULTICOLLINEARITY Outliers are not of course the only problem that we may encounter. Another problem is heteroskedasticity, where the values of Y are more variable at some levels of X compared to others. If these are big differences then they violate one of the assumptions behind OLS regression. A good way to start detecting heteroskedasticity is by looking at a plot of the residuals versus the fitted values. Run our basic model again predicting demonstration activity, and then use the command rvfplot: . regress demo unemploy press1 conflict corrupt corrupt2 . rvfplot, yline(0) The option yline(a) just shows the line y = a. The residuals should be randomly distributed above and below the line and there should be no pattern in them. Changes in the variance of the residuals, otherwise known as heteroskedasticity, implies that standard errors (and therefore p-values) could be incorrect. It is not always easy to tell from a plot whether there is non-constant variance so the heteroskedasticity test is also wise to run. . hettest The p-value tells us whether we can reject the hypothesis of no heteroskedasticity, so that when we have a statistically significant result (normally lower than 0.05), we conclude that there is heteroskedasticity. In our case, there is, as there often is in small datasets. While there are a variety of solutions to heteroskedasticity, the best is the use of robust standard errors, which correct for the unequal variances in the calculation of the standard errors. . regress demo unemploy press1 conflict corrupt corrupt2, robust The robust errors do not make a spectacular difference in the standard errors and pvalues, but there are changes. Yet another problem we may encounter is that of multicollinearity. This means that the independent variables that we are interested in are closely related, so that when one of them increases the others increase as well, making it difficult to work out the separate effect of each predictor. This can be a problem when we use interactions, since the interaction is likely to be highly related to the main effect. We can check for multicollinearity with the variance inflation factor. Like most regression diagnostic commands it need to run immediately after the regression command: . regress demo unemploy press1 conflict corrupt corrupt2, robust .vif Generally, if the VIF is above 15 we have some reason to be concerned. We see here that our quadratic term for corruption is highly collinear with the corruption term. This is of course not surprising, as the quadratic term is a simple transformation of the term itself. We therefore have no need to worry about this. There is no evidence of multicollinearity for the other variables in the model. With multicollinearity, there is little we can do other than gather more data which is rarely an option. The only real possibility is to combine highly correlated variables by creating scales. Reducing multicollinearity should shrink the standard errors. EXERCISE 4 Check your final model of turnout for heteroskedasticity and multicollinearity. Should we be worried about anything, and if so what measures could we take to allay those worries?