Homework 13

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Homework 13 (Due Tues., Nov. 10th)
Background
Policy makers in North Carolina want to attract manufacturing jobs to the state. To do so, they need to identify
the factors that are attractive to manufacturers seeking to locate new factories or relocate old ones. Suppose
policy makers have been arguing about the reasons for the large differences in manufacturing employment across
North Carolina counties in 2000. A glance at the descriptive statistics for variable EmpManf2000 from our
cleanNCcounties.xls dataset shows that manufacturing employment ranges from a minimum of 79 in one county
to a maximum of 56,229 in another, with a mean of 8,163 and a median of 4,689. Why the huge differences?
Which factors cause some counties to be very attractive to manufacturers and other counties to be less attractive?
Model Specification: Choosing Variables for the Regression Model
You decide to build a regression model to explain the variation in county Manufacturing Employment in North
Carolina in 2000 (variable EmpManf2000) using the data available in the cleanNCcounties.xls dataset.
Begin a new SAS program for this analysis. Use PROC IMPORT to bring the cleanNCcounties.xls data set
into SAS.
The first issue you need to address is Model Specification: which variables to include in your model and which
Functional Form to use for your model equation.
Obviously, EmpManf2000 will be the dependent (Y) variable, but what about the independent (X) variables?
We need to think of variables that would have a large impact on Y. Also, keep in mind that the X variables we
select should NOT affect one another (they should not be linearly correlated among themselves) in order to avoid
the multicollinearity problem.
Well, when it comes to choosing X variables, you have some ideas of your own, and you receive lots of advice
from others. First, you think that the size of the county might affect EmpManf2000, so you will include
LandArea as an X variable. Next, you think that population might affect EmpManf2000, but then you think:
Wait! Does the population attract the manufacturing jobs, or do the manufacturing jobs attract the population?
To sort this out we would need to do a Simultaneous Equations model. We decide not to get into that yet, so we
decide to leave population out of the model.
What other factors could affect EmpManf2000? You ask county managers. They tell you that the amount of
infrastructure in the county affects the attractiveness of the county to manufacturers, things like paved roads,
water and sewer lines, etc. Okay, so you decide to include paved miles of roadway per square mile for each
county (you need to create this variable: PavedMilesArea = PavedMiles/LandArea) and gallons of water used by
city and industry per year per square mile for each county (you need to create this variable: MuniH2OArea =
MuniCommH2O*1000000/LandArea). TIP: Make sure that you use the letter "O" in H2O rather than the
number zero "0". Computers see the letter O as something different from the number 0.
In SAS, use a single Data Step to create your new variables (you might need to review the Data Step handout).
Create all of the new variables in one Data Step; don't make a separate Data Step for each variable. Name the
new dataset created by the Data Step dataset02. Be sure to use a "set" command inside the Data Step to copy all
of the variables from cleanNCcounties.xls into dataset02.
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
You ask county Economic Development agencies, the people responsible for trying to attract manufacturers to the
county. They tell you that the quality of the local schools, the education level of the workforce, the quality of
healthcare available, and the level of fire and police protection also affect the location decisions of manufacturers.
Based on this information, you decide to include dollars spent on local schools per person for each county (you
need to create this variable: SchoolSpendPP = LocPubSchExp*1000/PopCens), college graduates per 100 people
in each county (you need to create this variable: ColGradsper100 = CoColGrads/(PopCens/100)), the number of
doctors per 1000 people in each county (you need to create this variable: DocsPer1000 =
PCPhysicians/(PopCens/1000)), and the dollars spent on public safety programs (police and fire department) per
square mile for each county (you need to create this variable: PubSafExpArea = PubSafExp/LandArea).
You ask the Chamber of Commerce about potential X variables, and they mention the ones already mentioned by
others plus an additional, important one: property tax rate (variable PropTaxRate).
Finally, geographic region (variable GeoRegion) might have a significant effect on EmpManf2000, so let's create
dummy variables that allow for regional effects. Looking at the data in the cleanNCcounties.xls spreadsheet, you
notice that GeoRegion is a categorical variable with 3 categories; therefore, we need to create 2 dummy variables
(3 – 1 = 2) to represent the 3 categories. In SAS, create dummy variables "MtnRegion" and "CstRegion" to
represent the mountain and coast regions, respectively:
if GeoRegion = 'mountain' then MtnRegion = 1; else MtnRegion = 0 ;
if GeoRegion = 'coast' then CstRegion = 1; else CstRegion = 0 ;
Notice that there is no dummy variable for the "plains" region. “Plains” is the “left out” region that will
serve as the baseline region. When both MtnRegion=0 and CstRegion=0, the region is "plains."
So, your list of potential X variables is: LandArea, PavedMilesArea, MuniH2OArea, SchoolSpendPP,
ColGradsper100, DocsPer1000, PubSafExpArea, PropTaxRate, MtnRegion and CstRegion. This is a nice start,
don't you think?
Model Specification: Functional Form
Next, we need to decide on the Functional Form for the model. The idea here is to decide how each X variable
should be included in the model equation, should it be squared, logged, etc., or none of the above. Use PROC
PLOT in your SAS program to make a plot of EmpManf2000 against each of the X variables separately, and then
look for patterns. Do not include these plots in the homework that you turn in; you're just using the plots to
decide on the best functional form for your regression model.
The plots of EmpManf2000 against LandArea, SchoolSpendPP, and PropTaxRate show roughly linear patterns,
so you decide to include these variables linearly in the model, that is, without any squares, logs, etc. So, at this
point the regression model looks like this:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + e,
where e is an error term.
The plot of EmpManf2000 against PavedMilesArea shows a nonlinear pattern of increasing faster and faster, so
we'll include a squared term for PavedMilesArea to capture this pattern. You must create the squared term. In
SAS, create the variable PavedMilesAreaSq = PavedMilesArea**2. Create this variable in the same Data Step
that you used for the variables created earlier. In SAS, the symbol "**" means "raise to the power." So, at this
point the regression model looks like this:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea +
β5·PavedMilesAreaSq + e
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
The plots of EmpManf2000 against MuniH2OArea, ColGradsper100, DocsPer1000, PubSafExpArea show a
nonlinear pattern of increasing at a slower and slower rate, so we'll include these variables in logged form. In
SAS, create the variables LNMuniH2OArea = log(MuniH2OArea), LNColGradsper100 = log(ColGradsper100),
LNDocsPer1000 = log(DocsPer1000), and LNPubSafExpArea = log(PubSafExpArea). So, at this point the
regression model looks like this:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea +
β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNColGradsper100 + β8·LNDocsPer1000 +
β9·LNPubSafExpArea + e
Finally, include the two dummy variables for the mountain region and the coast region:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea +
β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNColGradsper100 + β8·LNDocsPer1000 +
β9· LNPubSafExpArea + β10·MtnRegion + β11·CstRegion + e
Checking for Multicollinearity
Recall that one of the assumptions of the OLS regression method is that X variables are not linearly related to one
another. If two or more X variables are linearly correlated with one another, we have the “Multicollinearity
Problem.” Recall that two X variables are strongly, linearly related when their Pearson correlation coefficient, r,
is large (say, r > 0.70). Before we run the regression, use PROC CORR in your SAS program to check for
multicollinearity among our continuous X variables (do include PavedMiles, but you don't need to include
PavedMilesAreaSq; also, don't include MtnRegion or CstRegion, because these are dummy variables). We can
drop some X variables from the model if they are highly correlated with other X variables.
First, we notice that LNPubSafExpArea is moderately to strongly correlated with several other variables:
LNPubSafExpArea and LNMuniH2OArea ρ = 0.83
LNPubSafExpArea and PavedMilesArea ρ = 0.78
LNPubSafExpArea and SchoolSpendPP ρ = 0.56
LNPubSafExpArea and LNColGradsper100 ρ = 0.53
So, let's drop LNPubSafExpArea from the model.
We also notice that LNColGradsper100 is moderately linearly correlated with two of the remaining variables:
LNColGradsper100 and SchoolSpendPP ρ = 0.63
LNColGradsper100 and LNDocsPer1000 ρ = 0.57
So, let's also drop LNColGradsper100 from the model.
Finally, although PavedMilesArea and LNMuniH2OArea are moderately linearly related (ρ = 0.66), we will leave
both in the model, because county officials (our clients) are very interested in the effects of these two variables on
manufacturing employment. However, by choosing to leave both variables in the model we may be accepting
some degree of multicollinearity bias in the results of the regression analysis. Our model is now:
EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea +
β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNDocsPer1000 + β8·MtnRegion + β9·CstRegion + e
3
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
NOTE!! You only need to put the last model equation above in your homework; all of the
other equations are "scratch work" that we used to develop the final equation.
Initial Regression Analysis
Based on the model above, run the following regression in SAS:
proc reg data=dataset02;
model EMPmanf2000 = LandArea SchoolSpendPP PropTaxRate
PavedMilesArea PavedMilesAreaSq LNMuniH2OArea LNDocsPer1000
MtnRegion CstRegion;;
run;
Presenting the Regression Results
Now present the results from the regression. We will modify/correct this regression in later homeworks, but first
we need to describe the results from this initial regression. Put regression results (the output from the regression
that appears in the Output window of SAS—but only the regression output, not all of the stuff that SAS puts in
the Output window) in a table in your homework. Use a format for the table like that shown in the example
below:
OLS Regression Results
Dependent Variable: (name of Y variable goes here)
Independent Variables:
Intercept
(name of first X variable)
(name of second X variable)
(name of third X variable)
etc.
etc.
n
F-value (p-value)
S.E.R.
Adj-R2
Coefficients:
(value of β0 goes here)
(value of β1 goes here)
(value of β2 goes here)
(value of β3 goes here)
etc.
etc.
(value of n goes here)
(value of F goes here)
(value of S.E.R. goes here)
(value of Adj-R2 goes here)
t-values
(t-value of β0 goes here)
(t-value of β1 goes here)
(t-value of β2 goes here)
(t-value of β3 goes here)
etc.
etc.
(p-value of F goes here. put it in parentheses)
Interpreting the Initial Regression Results
Interpret/discuss the regression results in your homework by answering the following questions:







What is the F-value (and its p-value), and what does it mean?
What is the Adj R-square value, and what does it mean?
What is the SER for the regression, and what does it mean?
What are the coefficient/parameter estimates, and what do they mean?
What are the t-values, and what do they mean?
Which of the X variables appear to affect Y?
What is the estimated model equation? (This is the regression equation with the estimated beta values
filled-in for the betas in the equation.)
Begin your discussion of the regression results like this: “The table above presents the results of the initial
regression analysis. The F-value of the regression indicates that . . . “
4
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Save Your Program and Write up Your Homework
After you run your SAS program and verify that it is working correctly, save the SAS program as HW13.sas.
Print out your program (you can copy it from the Editor window of SAS, paste it into Word, and print it), and turn
it in with your homework. Also, when this homework asks you to answer specific questions about the results, you
need to answer in complete sentences, in addition to giving the appropriate numbers. Be sure to put your name,
ECN377, your section, and “Homework 13” at the top of your homework.
5
Download