stata.regress.diagnostics.notes

advertisement
POLS 7012
REGRESSION DIAGNOSTICS
Topic: Model fit, regression diagnostics.
STATA commands and features: hettest, dfbeta, lvrplot, rvfplot, vif
Data sets: demoplus.dta, includes data on 76 countries
Readings: Alan Agresti and Barbara Finlay (1997). Statistical Methods for the Social
Sciences, 3rd ed. Upper Saddle River, NJ: Prentice Hall. [CHAPTER 14]
1. INTRODUCTION
This week we will be using an expanded version of the dataset that we used last week;
we will be concentrating on testing how well the models fit the data and how robust
those models are. Substantively, we will again be looking at the predictors of
demonstration activity, but we will also see how well we can explain electoral turnout
cross-nationally in those same countries.
2. MODEL FIT
When it comes to deciding which variables to include in our model, we usually want
to fit a model which includes the variables we are interested in, but does not include
variables that have no effect. Parsimony is a virtue in model selection: we choose the
model which gives us the most information for the fewest number of predictor
variables. This might mean that we choose the simplest model, only including
significant terms. If a term is not significant (interpreted rather loosely, i.e., up to a pvalue of around 0.10), or does not improve the model, it is normally removed.
For a single model, we would use the adjusted R2 statistic to tell us how much of the
variance in the dependent variable is ‘explained’ by the independent variables (or how
our model fits). We can also use adjusted R2 to compare nested models. In a set of
nested models, previous models always use a subset of the variables in later models.
For example:
Model 1
Model 2
Model 3
Model 4
demo =
demo =
demo =
demo =
unemploy
unemploy + press1
unemploy + press1 + conflict
unemploy + press1 + conflict + corrupt
EXERCISE 1
Recode press and war into 0/1 dummy variables press1 and conflict as we have
previously (remember to label fully). Run the four models above, noting down the
adjusted R2 for each of them. Which terms lead to a big improvement in model fit, and
how would you interpret this?
3. REGRESSION DIAGNOSTICS – NON-LINEARITY
To keep things simple, we will start with a model containing only main effects
(though we know from last week that there are probably interactions between some of
the variables):
. regress demo unemploy press1 conflict corrupt
In many ways this is a nice model since almost all the terms are statistically
significant, and there is a plausible story to explain the coefficients.
When we use regression analysis we must check that the model does not violate any
of the assumptions of regression analysis; if these are flouted we may get misleading
estimates of the coefficients, standard errors or both. Conducting regression
diagnostics can be a complicated business, but there are a few checks that are
important and relatively easy to perform. Most of these require either looking at or
testing different aspects of the residuals (the ε's in the regression equation). This is
especially useful when we have a small number of cases, as we do in this dataset.
With datasets of several thousand observations (e.g. one based on a sample survey of
individuals) some of these checks may be less useful however.
To get STATA to produce the residuals from our chosen model (i.e. the last model that
we ran). Type:
. predict res, resid
This command creates a new variable called res that takes the values of the residual
from the model for each case. Residuals should be randomly distributed around zero
(it’s inherent variation that we can’t predict), so what we want to know is if there is
any pattern to how the residuals are distributed. If there is a pattern it suggests that our
model as it stands could be problematic in some way.
A good first step is to plot the residuals against each of the independent variables in
the model.
. twoway scatter res unemploy
. twoway scatter res corrupt
Note that there’s no point doing this for our two binary variables (conflict and press1).
For unemployment, there appears to be no pattern to how the residuals are distributed.
However, if we look at the corruption scatterplot, we can see that the residuals for low
corruption scores tend to be just higher than zero, those for medium scores tend to be
just lower than zero, and those for high scores tend to be much higher than zero.
4
2
-2
0
Residuals
0
2
4
corruption score
6
8
What this should suggest to us is that corruption may have a non-linear effect on
demonstration activity; that is, for different values of corruption the slope of the line
is different. The most commonly used way to take account of non-linearity is to use
polynomial regression functions.
Y    1X   2 X 2  
In practice, all we need to do is to generate a new variable that is the square (or cube
etc.) of the independent variable that we believe to have a non-linear relationship with
the dependent variable.
. generate corrupt2=corrupt*corrupt
. label variable corrupt2 “Corruption squared”
And then simply include this new variable in our model, making sure that we also
include the original (non-squared) variable, in our case, corrupt.
. regress demo unemploy press1 conflict corrupt corrupt2
Source |
SS
df
MS
-------------+-----------------------------Model | 84.6685635
5 16.9337127
Residual | 87.7493293
70 1.25356185
Number of obs =
F( 5,
70) =
Prob > F
=
R-squared
=
76
13.51
0.0000
0.4911
-------------+-----------------------------Total | 172.417893
75 2.29890524
Adj R-squared =
Root MSE
=
0.4547
1.1196
-----------------------------------------------------------------------------demo |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------unemploy |
.0331588
.0132818
2.50
0.015
.0066691
.0596485
press1 | -.6483971
.3427103
-1.89
0.063
-1.331911
.0351171
conflict |
1.33142
.4305441
3.09
0.003
.472727
2.190113
corrupt | -.4085027
.2179467
-1.87
0.065
-.8431837
.0261783
corrupt2 |
.0643187
.0251509
2.56
0.013
.0141569
.1144805
_cons |
1.557482
.5326082
2.92
0.005
.4952281
2.619735
------------------------------------------------------------------------------
The results show that our new corrupt2 variable is highly statistically significant,
confirming that there is a non-linear relationship between corruption and
demonstration activity. It can be difficult to interpret the size of the coefficient. What
we need to do is, for each level of corruption, calculate the combined impact of
corrupt and corrupt2. For example, with a 0 level of unemployment, a controlled
press and no armed conflict, our equation would be:
demo
= 0.03*unemploy + -0.65*press1+-0.41*corrupt + 1.33*armed +
0.06*corrupt2 + 1.56
= 0.03*0 + -0.65*0 + -0.41*corrupt + 1.33*0 + 0.06*(corrupt*corrupt) +
1.56
= -0.41*corrupt + 0.06*(corrupt*corrupt) + 1.56
Thus, for a corruption level of 3, our equation would be:
demo
= -0.41*3 + 0.06*(3*3) + 1.56
= 0.93
Often the best thing to do is to use the regression equation to produce a graph of
predicted demonstration activity for various levels of corruption (this is probably
easiest to do in Excel).
Predicted demonstration activity by corruption
Demonstration activity
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
1
2
3
4
5
6
Corruption (1-10)
EXERCISE 2
7
8
9
10
For this exercise we would like you to produce a model to predict turnout in national
elections in the countries in our dataset. The dependent variable is turnout, and you
should include corrupt, parcomp, enpp, elecsys, distsize and timeelap as independent
variables in your model. First examine these variables in the codebook and look at
basic univariate statistics, noting whether the variables are continuous, binary or
categorical. Then run a series of multiple regressions, finally choosing a simple and
parsimonious model that fits the data well. Now check for non-linearity using the
residuals. Create squared terms for the continuous variables and add them to the
model. Do the additional terms improve the model?
4. REGRESSION DIAGNOSTICS – OUTLIERS
The first test one should run is to ensure that the residuals are roughly normally
distributed. If the errors are not normally distributed much of the theory behind the ttests will no longer apply. First, though, you need to generate the residuals from the
model. We will use the demonstration activity model that we used previously. The
command for that in STATA is predict, which can be used to generate residuals from
whichever regression STATA last estimated. However, we need the variances of the
residuals to be standardized so we will generate what are call “studentized” residuals
with the following command:
. regress demo unemploy press1 conflict corrupt corrupt2
. predict rstu, rstudent
Studentized residuals are just the actual value of the residuals divided by the standard
deviation we would expect from normal sampling variability. This is like a z-statistic,
so about 5% of values should be above 1.96 or below 1.96. This gives us an idea of
how outlying the outliers are. A good way of looking at the distribution of residuals is
to produce a histogram:
.3
0
.1
.2
Density
.4
.5
. histogram rstu
-2
0
2
Studentized residuals
4
6
This graph tells us that, as we would expect, most values fall between -2 and +2, but
we do have one real outlier with a residual score of around 5.
Why care about outliers? Big residual scores are not necessarily a problem in
themselves, as that outlying case may have little effect on the overall model. What we
really want to know is how influential any particular case is for the model coefficients
as well as whether it is an outlying observation. We should be worried if an outlier
has a big effect on our model; sometimes just one case can determine the coefficients,
if it has sufficient leverage (the amount of influence an observation has on the
regression line). We can plot the amount of leverage a variable has by using the
normalized squared residuals (the squaring is necessary because the residuals are both
positive and negative. By definition, the sum of the residuals will be 0). STATA has a
command for this that can be run right after a regression.
. lvr2plot, mlabel(nation)
We label each case so we can identify any that are problematic. Problematic outliers
are the ones that are in the top right of the plot. These have both a large residual as
well as high amounts of leverage. Bolivia, for example, has a large residual but little
leverage, unlike Senegal which has high leverage but is not an outlying observation.
The most worrying observation in this graph is Israel. How much effect does Israel
have on the coefficients? We can observe for what variables Israel might be having an
effect.
The DFbeta value shows the degree to which the coefficient will change when that
observation is omitted. We use the dfbeta command to generate DFbeta values for
each variable and observation. Type as follows:
. dfbeta
Next we plot the residuals by the DF values for each variable.
.
.
.
.
twoway
twoway
twoway
twoway
(scatter
(scatter
(scatter
(scatter
rstu
rstu
rstu
rstu
DFunemploy, mlabel(nation))
DFpress1, mlabel(nation))
DFcorrupt, mlabel(nation))
DFconflict, mlabel(nation))
If we look at these plots we can see that whilst Bolivia generally has a high residual,
its DFbeta score is around zero, which indicates that if we were to remove Bolivia
from the dataset this would have little effect on the size of the coefficients. But
consider the last scatterplot of DFconflict:
4
Bolivia
Malawi
Honduras
2
Israel
Nicaragua
Ecuador
-2
0
Chile
Kenya
Guatemala
France
Philippines
Peru
USA
Ivory
Coast
Norway
Luxembourg
Greece
Jamaica
Tunisia
Malaysia
Australia
Iceland
Spain
Ireland
Germany
Switzerland
Taiwan
Poland
Tanzania
Latvia
Brazil
Estonia
Indonesia
Denmark
Canada
Belgium
Netherlands
Portugal
UK
S
Africa
Jordan
Austria
Thailand
Japan
India
Cameroon
Colombia
Zimbabwe
Italy
South
Korea
Mexico
Hungary
Sweden
Costa
Mauritius
Rica
Czech
Slovak
New
Zealand
Finland
Pakistan
Belarus
Venezuela
Nigeria
Senegal
Turkey
Uruguay
Romania
Morocco
Argentina
Bulgaria
Ukraine
Botswana
Singapore
Paraguay
Russia
Zambia
-.5
0
.5
1
Dfbeta conflict
Here, we see that Israel has a high DFbeta and is far to the right of the plot which
implies that if we removed from the dataset it would decrease the size of the
coefficient for the armed conflict variable. That is, the effect of armed conflict on
demonstration activity is larger with Israel included than when it is not. In fact, the
coefficient will be about 0.9 standard errors smaller, or 0.9 * 0.43 = 0.39. So the
coefficient for conflict would shrink from 1.33 to 0.94, if this observation were
omitted. As a rule of thumb, DFbeta values of more than 1 are worthy of attention.
What should we do with cases that we believe to be outliers and influence the model?
On the one hand, we could simply remove these cases from the dataset. But while
certain observations may be outliers that cause problems for linear regression, they
are still real data points that we should consider. It is clearly important to know if
your model is driven by only a handful of outlying cases, but when it comes to
deciding what to do about this, there is no simple solution, other than to keep this in
mind when drawing your conclusions.
Another reason for being concerned about outliers is that several outlying cases can
often indicate that your model/dataset is missing important variables that could help to
explain why those outliers are outlying.
EXERCISE 3
Investigate whether there are any outlying cases (and their impact on the model) in
your previous parsimonious model of turnout. Could you add other variables from our
dataset which might explain why these cases are outliers?
5. HETEROSKEDASTICITY AND MULTICOLLINEARITY
Outliers are not of course the only problem that we may encounter. Another problem
is heteroskedasticity, where the values of Y are more variable at some levels of X
compared to others. If these are big differences then they violate one of the
assumptions behind OLS regression. A good way to start detecting heteroskedasticity
is by looking at a plot of the residuals versus the fitted values. Run our basic model
again predicting demonstration activity, and then use the command rvfplot:
. regress demo unemploy press1 conflict corrupt corrupt2
. rvfplot, yline(0)
The option yline(a) just shows the line y = a. The residuals should be randomly
distributed above and below the line and there should be no pattern in them. Changes
in the variance of the residuals, otherwise known as heteroskedasticity, implies that
standard errors (and therefore p-values) could be incorrect. It is not always easy to tell
from a plot whether there is non-constant variance so the heteroskedasticity test is
also wise to run.
. hettest
The p-value tells us whether we can reject the hypothesis of no heteroskedasticity, so
that when we have a statistically significant result (normally lower than 0.05), we
conclude that there is heteroskedasticity. In our case, there is, as there often is in small
datasets. While there are a variety of solutions to heteroskedasticity, the best is the use
of robust standard errors, which correct for the unequal variances in the calculation of
the standard errors.
. regress demo unemploy press1 conflict corrupt corrupt2, robust
The robust errors do not make a spectacular difference in the standard errors and pvalues, but there are changes.
Yet another problem we may encounter is that of multicollinearity. This means that
the independent variables that we are interested in are closely related, so that when
one of them increases the others increase as well, making it difficult to work out the
separate effect of each predictor. This can be a problem when we use interactions,
since the interaction is likely to be highly related to the main effect. We can check for
multicollinearity with the variance inflation factor. Like most regression diagnostic
commands it need to run immediately after the regression command:
. regress demo unemploy press1 conflict corrupt corrupt2, robust
.vif
Generally, if the VIF is above 15 we have some reason to be concerned. We see here
that our quadratic term for corruption is highly collinear with the corruption term.
This is of course not surprising, as the quadratic term is a simple transformation of the
term itself. We therefore have no need to worry about this. There is no evidence of
multicollinearity for the other variables in the model.
With multicollinearity, there is little we can do other than gather more data which is
rarely an option. The only real possibility is to combine highly correlated variables by
creating scales. Reducing multicollinearity should shrink the standard errors.
EXERCISE 4
Check your final model of turnout for heteroskedasticity and multicollinearity. Should
we be worried about anything, and if so what measures could we take to allay those
worries?
Download