Homework 14

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Homework 14 (Due Tuesday, Nov. 17th)
Identifying Outliers, Correcting for Heteroskedasticity, and Checking Normality of the Residuals
In Homework 14, you will check the regression model that you developed in Homework 13 for outliers, heteroskedasticity
and normality of the residuals, and you will correct the model for the effects of heteroskedasticity. (We’ll work with
autocorrelation in the next homework.) Recall that in Homework 13, you developed the following OLS regression model:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq +
β6·LNMuniH2OArea + β7·LNDocsPer1000 + β8·MtnRegion + β9·CstRegion + e
First, we will check the regression for outliers. Second, because the regression above is based on a cross-section dataset,
we should check for heteroskedasticity, which occurs often in cross-section datasets. Third, we’ll check the residuals from
the final, heteroskedasticity-corrected regression to determine whether they meet the normality assumption of OLS
regression. Based on the model above, run the following regression again in SAS (you first ran this regression in Homework
13), but this time use the “output” statement to save: (1) the predicted values, name them "yhat", (2) the residuals , name
them “ehat”, (3) the leverage values, name them "lev", and (4) the studentized residuals, name them "student_resid". You
can simply add the SAS commands below to the SAS program that you developed in Homework 13 (assuming that the last
dataset you were working with in Homework 13 was named “dataset02,” otherwise, you may need to update the dataset name
in the SAS commands below from dataset02 to dataset03, or whatever):
proc reg data=dataset02;
model EMPmanf2000 = LandArea SchoolSpendPP PropTaxRate
PavedMilesArea PavedMilesAreaSq LNMuniH2OArea LNDocsPer1000
MtnRegion CstRegion;
output out=dataset03 p=yhat r=ehat h=lev rstudent=student_resid;
run;
Checking for Outliers
Outliers are data points that have unusually large or small values of X or Y relative to other data points in a dataset. Outliers
are of concern because they can affect the β's, and the s.e.’s of the β's, of regression models. We can identify outlier data
points by using PROC PLOT to make several graphs, as discussed in the “Outliers” handout. (In the first plot, I chose to
graph the Y variable EMPmanf2000 against the X variable LandArea to look for any patterns. We could graph
EMPmanf2000 against other X variables to look for patterns, I just chose LandArea as an example.)
In the “plot” commands below, we’re using ='o' to tell SAS which symbol to use for the data points when they are printed out
on a graph, and we’re using “$ CntyName” to tell SAS to label each data point with the county name for that data point. By
using the variable CntyName to label the data points, you can tell which counties are the outliers.
proc plot data=dataset03;
plot EMPmanf2000*LandArea='o' $ CntyName ;
plot lev*yhat='o' $ CntyName;
plot student_resid*yhat='o' $ CntyName;
plot lev*student_resid='o' $ CntyName;
run;
From the plots, it looks like the points for Mecklenburg and Guilford counties (and maybe Orange, Catawba and Durham
counties) are outliers. For these counties, the value of Y (EmpManf2000) predicted by the model will likely be far from the
actual value of Y in the dataset; that is, the model likely won’t predict well for these counties. At this point, we could check
the dataset to make sure that there are no errors on the data rows corresponding to Mecklenburg and Guilford counties. If we
find no errors, then we can think about whether either of these two counties would have unusually large or small values of
one or more of the X variables. If so, we could transform those particular X variables in an attempt to bring the data points
for these two counties closer to the other data points. Alternatively, we could just decide that there is something
“special/weird” about these two counties and create a dummy variable to represent the “special effect” of these two counties.
The coefficient (β value) for the dummy variable would measure the “Mecklenburg/Guilford Effect”—the change in the Y
variable due to the special/weird thing about the two counties. The dummy variable would help bring the two outlier data
points closer to the other data points. A final alternative is that we could simply drop these two data points from the analysis,
but this seems like a bad choice in this case, because these two counties are important counties in North Carolina--they are
the counties containing the large cities of Charlotte and Greensboro.
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
For this homework, let’s create one dummy variable to represent the “Large City Effect” that makes these counties outliers.
In the data step of the SAS program, go back and add the following commands to create the dummy variable, called
DumMeckGuil,
if CntyName='Mecklenburg' or CntyName='Guilford' then DumMeckGuil=1;
else DumMeckGuil=0;
We will include the DumMeckGuil dummy variable in the next regression below.
In your homework, say that you "checked for outliers by examining plots of the data points, leverage values of the data
points, and Studentized residuals of the data points."
In your homework, say that Mecklenburg and Guildford counties appear to be outliers, and that you created a dummy
variable (DumMeckGuil) to represent the special effect of these counties on the dependent variable of the regression.
Tip: Identifying outliers gives you something to speculate about in the Conclusions section of a report. In your conclusions,
mention which data points (counties) are outliers, and give some suggestions as to why they might be outliers. You can then
say that investigating these outliers would be a good thing to pursue in future research.
Checking for Heteroskedasticity
Again, because the regression is based on a cross-section dataset, we should check for heteroskedasticity. To check for
heteroskedasticity, use PROC PLOT to make plots of the ehat’s against each X variable (you don't need to check
PavedMilesAreaSq, because you are already checking PavedMilesArea):
proc plot data=dataset03;
plot ehat*LandArea ;
plot ehat*SchoolSpendPP ;
plot ehat*PropTaxRate ;
plot ehat*PavedMilesArea ;
plot ehat*LNMuniH2OArea ;
plot ehat*LNDocsPer1000 ;
plot ehat*MtnRegion ;
plot ehat*CstRegion ;
plot ehat*DumMeckGuil ;
run;
Now examine each of the plots and look for patterns. There appears to be some degree of sideways cone-shaped pattern (a
sign of heteroskedasticity) in the plots of ehat against LNMuniH2OArea, SchoolSpendPP, and PavedMilesArea. Also, the
variation in the ehats appears to be different for the different values of the dummy variable in both the MtnRegion and
CstRegion plots (and DumMeckGuil). So, there does appear to be heteroskedasticity.
In your homework, say that you tested the regression for heteroskedasticity because you have cross-section data, and
describe briefly in words which plots you created, and what patterns you observed for which variables.
Correcting Heteroskedasticity
Let's do a weighted least squares (WLS) regression to correct for the heteroskedasticity. To do a WLS regression, we need
to create a weight variable, w. Let's use PavedMilesArea as the weight variable. (We will hope that correcting
heteroskedasticity due to PavedMilesArea will also correct most of the heteroskedasticity due to the other variables; this hope
will likely be fulfilled if the X variables move together somewhat, which is likely true here—for example, if PavedMilesArea
is large, then SchoolSpendPP and LNMuniH2OArea will also likely be large. In any case, in more advanced Econometrics
courses you will learn how to correct for heteroskedasticity from multiple sources.)
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Looking at the plot of ehat against PavedMilesArea, it appears as if the variation in the ehats is proportional to the square of
PavedMilesArea, so the weight variable needed to do the WLS regression is:
w = 1 / SQRT(PavedMilesArea**2).
In SAS, we go back and add commands to the Data Step to create the weight variable w and the weighted Y and X variables
that will be used in the WLS regression equation:
w = 1/SQRT(PavedMilesArea**2);
EMPmanfNEW = w*EMPmanf2000;
LandAreaNEW = w*LandArea;
SchoolSpendNEW = w*SchoolSpendPP;
PropTaxRateNEW = w*PropTaxRate;
PavedMilesNEW = w*PavedMilesArea;
PavedMilesSqNEW = w*PavedMilesAreaSq;
LNMuniH2ONEW = w*LNMuniH2OArea;
LNDocsNEW = w*LNDocsPer1000;
MtnRegionNEW = w*MtnRegion;
CstRegionNEW = w*CstRegion;
DumMeckGuilNEW = w*DumMeckGuil;
Next, run the WLS regression using the w variable and the new, weighted Y and X variables. As described in the handout
about Heteroskedasticity, don’t forget to include the w variable itself as an X variable in the regression, and don’t forget to
include the “noint” option at the end of the model command line.
proc reg data=dataset03;
model EMPmanfNEW = w LandAreaNEW SchoolSpendNEW PropTaxRateNEW
PavedMilesNEW PavedMilesSqNEW LNMuniH2ONEW LNDocsNEW
MtnRegionNEW CstRegionNEW DumMeckGuilNEW / noint;
output out=dataset04 p=yhatNEW r=ehatNEW h=levNEW rstudent=student_residNEW;
run;
In your homework, say that you corrected for heteroskedasticity by running a weighted least squares regression (WLS), give
the formula for w, and include the name of the X variable that you used in the formula.
Interpreting the WLS Regression Results
Compare the standard error of the regression (SER) for the two regressions. (Note: SAS calls the SER the "Root MSE," that
is, SER = Root MSE.) In the first, OLS regression, SER = 6483, while in the second, WLS regression, the SER is much
lower; that is, correcting for heteroskedasticity reduced the average distance (error) between the data points and the
regression line by quite a bit!
Also, make a plot of the residuals from the WLS regression, the “ehatNEW’, against PavedMilesArea and compare it to the
plot of ehat against PavedMiles Area.
proc plot data=dataset04;
plot ehatNEW* PavedMilesArea ='o' $ CntyName;
run;
The WLS regression should have reduced any pattern in the ehats, so there should be less evidence of a pattern in the graph
of ehatNEW against PavedMilesArea compared to the graph of ehat against PavedMilesArea.
In your homework, briefly describe whether the “outlier” counties still appear to be causing problems.
3
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Put the heteroskedasticity-corrected WLS regression output from SAS in a table in your homework using a table
format like that described in Homework 13. Then, interpret the regression results:







What is the final estimated WLS equation? (This is the regression equation with the β values filled-in.)
What is the F-value (and its p-value) and what does it mean?
What is the Adj R-square value and what does it mean?
What are the parameter estimates and what do they mean?
What are the t-values and what do they mean?
Which of the X variables have a statistically significant effect on Y at the α = 0.05 significance level?
Report the OLS SER, the WLS SER, and the reduction in SER resulting from the heteroskedasticity correction.
By what percentage did SER decrease?
Checking the Normality Assumption
Let’s check the residuals (that is, the ehatNEW’s) from the final, heteroskedasticity-corrected WLS regression to determine
whether the residuals meet the normality assumption of OLS regression.
In SAS, following the WLS regression commands, use PROC GCHART to create a histogram of the residuals (the
ehatNEW’s):
proc gchart data=dataset04;
vbar ehatNEW / levels=13;
run;
Check the histogram. Is it approximately bell-shaped? Is it approximately centered on zero (on the x axis)?
Use PROC MEANS to check the mean, median, skewness and kurtosis of the residuals.
proc means data=dataset04 vardef=df maxdec=3 n mean median skew kurt;
var ehatNEW;
run;
If the distribution of the residuals is normal (bell-shaped and centered on zero), then (1) the mean will be approximately equal
to the median, (2) the skewness will be approximately zero, and (3) kurtosis will be approximately 3. Based on these
numbers, does the distribution of the residuals appear to be normal?
Finally, use the output from PROC MEANS to conduct a Jarque-Bera (JB) test of normality of the residuals. Do this test on
paper, not in SAS. What is the result of the JB test?
In your homework, say that you tested the residuals from the WLS regression for normality by (1) examining a histogram of
the residuals, (2) comparing of mean of the residuals to the median, (3) checking the skewness and kurtosis of the residuals,
and (4) conducting a Jarque-Bera test.
In your homework, describe the results of the normality tests. Say whether the residuals from the WLS regression appear
to be normally distributed, or not. If they do appear to be normally-distributed, then don’t write anything else. If they do not
appear to be normally-distributed, then say that future research will consider this issue in greater detail. (In the real world, if
the residuals do not appear to be normally-distributed, then we need to investigate further and perhaps change the functional
form of the variables in the model, etc., until we get residuals that are normally-distributed.)
Save/Print Your Program and Write up Your Homework
After you run your SAS program and verify that it is working correctly, save the SAS program as HW14.sas. Print out your
program (you can copy it from the Editor window of SAS, paste it into Word, and print it), and turn it in with your
homework. Also, when this homework asks you to answer specific questions about the results, you need to answer in
complete sentences, in addition to giving the appropriate numbers. Be sure to put your name, ECN377, your section, and
“Homework 14” at the top of your homework.
4
Download