UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Homework 14 (Due Tuesday, Nov. 17th) Identifying Outliers, Correcting for Heteroskedasticity, and Checking Normality of the Residuals In Homework 14, you will check the regression model that you developed in Homework 13 for outliers, heteroskedasticity and normality of the residuals, and you will correct the model for the effects of heteroskedasticity. (We’ll work with autocorrelation in the next homework.) Recall that in Homework 13, you developed the following OLS regression model: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNDocsPer1000 + β8·MtnRegion + β9·CstRegion + e First, we will check the regression for outliers. Second, because the regression above is based on a cross-section dataset, we should check for heteroskedasticity, which occurs often in cross-section datasets. Third, we’ll check the residuals from the final, heteroskedasticity-corrected regression to determine whether they meet the normality assumption of OLS regression. Based on the model above, run the following regression again in SAS (you first ran this regression in Homework 13), but this time use the “output” statement to save: (1) the predicted values, name them "yhat", (2) the residuals , name them “ehat”, (3) the leverage values, name them "lev", and (4) the studentized residuals, name them "student_resid". You can simply add the SAS commands below to the SAS program that you developed in Homework 13 (assuming that the last dataset you were working with in Homework 13 was named “dataset02,” otherwise, you may need to update the dataset name in the SAS commands below from dataset02 to dataset03, or whatever): proc reg data=dataset02; model EMPmanf2000 = LandArea SchoolSpendPP PropTaxRate PavedMilesArea PavedMilesAreaSq LNMuniH2OArea LNDocsPer1000 MtnRegion CstRegion; output out=dataset03 p=yhat r=ehat h=lev rstudent=student_resid; run; Checking for Outliers Outliers are data points that have unusually large or small values of X or Y relative to other data points in a dataset. Outliers are of concern because they can affect the β's, and the s.e.’s of the β's, of regression models. We can identify outlier data points by using PROC PLOT to make several graphs, as discussed in the “Outliers” handout. (In the first plot, I chose to graph the Y variable EMPmanf2000 against the X variable LandArea to look for any patterns. We could graph EMPmanf2000 against other X variables to look for patterns, I just chose LandArea as an example.) In the “plot” commands below, we’re using ='o' to tell SAS which symbol to use for the data points when they are printed out on a graph, and we’re using “$ CntyName” to tell SAS to label each data point with the county name for that data point. By using the variable CntyName to label the data points, you can tell which counties are the outliers. proc plot data=dataset03; plot EMPmanf2000*LandArea='o' $ CntyName ; plot lev*yhat='o' $ CntyName; plot student_resid*yhat='o' $ CntyName; plot lev*student_resid='o' $ CntyName; run; From the plots, it looks like the points for Mecklenburg and Guilford counties (and maybe Orange, Catawba and Durham counties) are outliers. For these counties, the value of Y (EmpManf2000) predicted by the model will likely be far from the actual value of Y in the dataset; that is, the model likely won’t predict well for these counties. At this point, we could check the dataset to make sure that there are no errors on the data rows corresponding to Mecklenburg and Guilford counties. If we find no errors, then we can think about whether either of these two counties would have unusually large or small values of one or more of the X variables. If so, we could transform those particular X variables in an attempt to bring the data points for these two counties closer to the other data points. Alternatively, we could just decide that there is something “special/weird” about these two counties and create a dummy variable to represent the “special effect” of these two counties. The coefficient (β value) for the dummy variable would measure the “Mecklenburg/Guilford Effect”—the change in the Y variable due to the special/weird thing about the two counties. The dummy variable would help bring the two outlier data points closer to the other data points. A final alternative is that we could simply drop these two data points from the analysis, but this seems like a bad choice in this case, because these two counties are important counties in North Carolina--they are the counties containing the large cities of Charlotte and Greensboro. 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas For this homework, let’s create one dummy variable to represent the “Large City Effect” that makes these counties outliers. In the data step of the SAS program, go back and add the following commands to create the dummy variable, called DumMeckGuil, if CntyName='Mecklenburg' or CntyName='Guilford' then DumMeckGuil=1; else DumMeckGuil=0; We will include the DumMeckGuil dummy variable in the next regression below. In your homework, say that you "checked for outliers by examining plots of the data points, leverage values of the data points, and Studentized residuals of the data points." In your homework, say that Mecklenburg and Guildford counties appear to be outliers, and that you created a dummy variable (DumMeckGuil) to represent the special effect of these counties on the dependent variable of the regression. Tip: Identifying outliers gives you something to speculate about in the Conclusions section of a report. In your conclusions, mention which data points (counties) are outliers, and give some suggestions as to why they might be outliers. You can then say that investigating these outliers would be a good thing to pursue in future research. Checking for Heteroskedasticity Again, because the regression is based on a cross-section dataset, we should check for heteroskedasticity. To check for heteroskedasticity, use PROC PLOT to make plots of the ehat’s against each X variable (you don't need to check PavedMilesAreaSq, because you are already checking PavedMilesArea): proc plot data=dataset03; plot ehat*LandArea ; plot ehat*SchoolSpendPP ; plot ehat*PropTaxRate ; plot ehat*PavedMilesArea ; plot ehat*LNMuniH2OArea ; plot ehat*LNDocsPer1000 ; plot ehat*MtnRegion ; plot ehat*CstRegion ; plot ehat*DumMeckGuil ; run; Now examine each of the plots and look for patterns. There appears to be some degree of sideways cone-shaped pattern (a sign of heteroskedasticity) in the plots of ehat against LNMuniH2OArea, SchoolSpendPP, and PavedMilesArea. Also, the variation in the ehats appears to be different for the different values of the dummy variable in both the MtnRegion and CstRegion plots (and DumMeckGuil). So, there does appear to be heteroskedasticity. In your homework, say that you tested the regression for heteroskedasticity because you have cross-section data, and describe briefly in words which plots you created, and what patterns you observed for which variables. Correcting Heteroskedasticity Let's do a weighted least squares (WLS) regression to correct for the heteroskedasticity. To do a WLS regression, we need to create a weight variable, w. Let's use PavedMilesArea as the weight variable. (We will hope that correcting heteroskedasticity due to PavedMilesArea will also correct most of the heteroskedasticity due to the other variables; this hope will likely be fulfilled if the X variables move together somewhat, which is likely true here—for example, if PavedMilesArea is large, then SchoolSpendPP and LNMuniH2OArea will also likely be large. In any case, in more advanced Econometrics courses you will learn how to correct for heteroskedasticity from multiple sources.) 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Looking at the plot of ehat against PavedMilesArea, it appears as if the variation in the ehats is proportional to the square of PavedMilesArea, so the weight variable needed to do the WLS regression is: w = 1 / SQRT(PavedMilesArea**2). In SAS, we go back and add commands to the Data Step to create the weight variable w and the weighted Y and X variables that will be used in the WLS regression equation: w = 1/SQRT(PavedMilesArea**2); EMPmanfNEW = w*EMPmanf2000; LandAreaNEW = w*LandArea; SchoolSpendNEW = w*SchoolSpendPP; PropTaxRateNEW = w*PropTaxRate; PavedMilesNEW = w*PavedMilesArea; PavedMilesSqNEW = w*PavedMilesAreaSq; LNMuniH2ONEW = w*LNMuniH2OArea; LNDocsNEW = w*LNDocsPer1000; MtnRegionNEW = w*MtnRegion; CstRegionNEW = w*CstRegion; DumMeckGuilNEW = w*DumMeckGuil; Next, run the WLS regression using the w variable and the new, weighted Y and X variables. As described in the handout about Heteroskedasticity, don’t forget to include the w variable itself as an X variable in the regression, and don’t forget to include the “noint” option at the end of the model command line. proc reg data=dataset03; model EMPmanfNEW = w LandAreaNEW SchoolSpendNEW PropTaxRateNEW PavedMilesNEW PavedMilesSqNEW LNMuniH2ONEW LNDocsNEW MtnRegionNEW CstRegionNEW DumMeckGuilNEW / noint; output out=dataset04 p=yhatNEW r=ehatNEW h=levNEW rstudent=student_residNEW; run; In your homework, say that you corrected for heteroskedasticity by running a weighted least squares regression (WLS), give the formula for w, and include the name of the X variable that you used in the formula. Interpreting the WLS Regression Results Compare the standard error of the regression (SER) for the two regressions. (Note: SAS calls the SER the "Root MSE," that is, SER = Root MSE.) In the first, OLS regression, SER = 6483, while in the second, WLS regression, the SER is much lower; that is, correcting for heteroskedasticity reduced the average distance (error) between the data points and the regression line by quite a bit! Also, make a plot of the residuals from the WLS regression, the “ehatNEW’, against PavedMilesArea and compare it to the plot of ehat against PavedMiles Area. proc plot data=dataset04; plot ehatNEW* PavedMilesArea ='o' $ CntyName; run; The WLS regression should have reduced any pattern in the ehats, so there should be less evidence of a pattern in the graph of ehatNEW against PavedMilesArea compared to the graph of ehat against PavedMilesArea. In your homework, briefly describe whether the “outlier” counties still appear to be causing problems. 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Put the heteroskedasticity-corrected WLS regression output from SAS in a table in your homework using a table format like that described in Homework 13. Then, interpret the regression results: What is the final estimated WLS equation? (This is the regression equation with the β values filled-in.) What is the F-value (and its p-value) and what does it mean? What is the Adj R-square value and what does it mean? What are the parameter estimates and what do they mean? What are the t-values and what do they mean? Which of the X variables have a statistically significant effect on Y at the α = 0.05 significance level? Report the OLS SER, the WLS SER, and the reduction in SER resulting from the heteroskedasticity correction. By what percentage did SER decrease? Checking the Normality Assumption Let’s check the residuals (that is, the ehatNEW’s) from the final, heteroskedasticity-corrected WLS regression to determine whether the residuals meet the normality assumption of OLS regression. In SAS, following the WLS regression commands, use PROC GCHART to create a histogram of the residuals (the ehatNEW’s): proc gchart data=dataset04; vbar ehatNEW / levels=13; run; Check the histogram. Is it approximately bell-shaped? Is it approximately centered on zero (on the x axis)? Use PROC MEANS to check the mean, median, skewness and kurtosis of the residuals. proc means data=dataset04 vardef=df maxdec=3 n mean median skew kurt; var ehatNEW; run; If the distribution of the residuals is normal (bell-shaped and centered on zero), then (1) the mean will be approximately equal to the median, (2) the skewness will be approximately zero, and (3) kurtosis will be approximately 3. Based on these numbers, does the distribution of the residuals appear to be normal? Finally, use the output from PROC MEANS to conduct a Jarque-Bera (JB) test of normality of the residuals. Do this test on paper, not in SAS. What is the result of the JB test? In your homework, say that you tested the residuals from the WLS regression for normality by (1) examining a histogram of the residuals, (2) comparing of mean of the residuals to the median, (3) checking the skewness and kurtosis of the residuals, and (4) conducting a Jarque-Bera test. In your homework, describe the results of the normality tests. Say whether the residuals from the WLS regression appear to be normally distributed, or not. If they do appear to be normally-distributed, then don’t write anything else. If they do not appear to be normally-distributed, then say that future research will consider this issue in greater detail. (In the real world, if the residuals do not appear to be normally-distributed, then we need to investigate further and perhaps change the functional form of the variables in the model, etc., until we get residuals that are normally-distributed.) Save/Print Your Program and Write up Your Homework After you run your SAS program and verify that it is working correctly, save the SAS program as HW14.sas. Print out your program (you can copy it from the Editor window of SAS, paste it into Word, and print it), and turn it in with your homework. Also, when this homework asks you to answer specific questions about the results, you need to answer in complete sentences, in addition to giving the appropriate numbers. Be sure to put your name, ECN377, your section, and “Homework 14” at the top of your homework. 4