UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Homework 13 (Due Tues., Nov. 10th) Background Policy makers in North Carolina want to attract manufacturing jobs to the state. To do so, they need to identify the factors that are attractive to manufacturers seeking to locate new factories or relocate old ones. Suppose policy makers have been arguing about the reasons for the large differences in manufacturing employment across North Carolina counties in 2000. A glance at the descriptive statistics for variable EmpManf2000 from our cleanNCcounties.xls dataset shows that manufacturing employment ranges from a minimum of 79 in one county to a maximum of 56,229 in another, with a mean of 8,163 and a median of 4,689. Why the huge differences? Which factors cause some counties to be very attractive to manufacturers and other counties to be less attractive? Model Specification: Choosing Variables for the Regression Model You decide to build a regression model to explain the variation in county Manufacturing Employment in North Carolina in 2000 (variable EmpManf2000) using the data available in the cleanNCcounties.xls dataset. Begin a new SAS program for this analysis. Use PROC IMPORT to bring the cleanNCcounties.xls data set into SAS. The first issue you need to address is Model Specification: which variables to include in your model and which Functional Form to use for your model equation. Obviously, EmpManf2000 will be the dependent (Y) variable, but what about the independent (X) variables? We need to think of variables that would have a large impact on Y. Also, keep in mind that the X variables we select should NOT affect one another (they should not be linearly correlated among themselves) in order to avoid the multicollinearity problem. Well, when it comes to choosing X variables, you have some ideas of your own, and you receive lots of advice from others. First, you think that the size of the county might affect EmpManf2000, so you will include LandArea as an X variable. Next, you think that population might affect EmpManf2000, but then you think: Wait! Does the population attract the manufacturing jobs, or do the manufacturing jobs attract the population? To sort this out we would need to do a Simultaneous Equations model. We decide not to get into that yet, so we decide to leave population out of the model. What other factors could affect EmpManf2000? You ask county managers. They tell you that the amount of infrastructure in the county affects the attractiveness of the county to manufacturers, things like paved roads, water and sewer lines, etc. Okay, so you decide to include paved miles of roadway per square mile for each county (you need to create this variable: PavedMilesArea = PavedMiles/LandArea) and gallons of water used by city and industry per year per square mile for each county (you need to create this variable: MuniH2OArea = MuniCommH2O*1000000/LandArea). TIP: Make sure that you use the letter "O" in H2O rather than the number zero "0". Computers see the letter O as something different from the number 0. In SAS, use a single Data Step to create your new variables (you might need to review the Data Step handout). Create all of the new variables in one Data Step; don't make a separate Data Step for each variable. Name the new dataset created by the Data Step dataset02. Be sure to use a "set" command inside the Data Step to copy all of the variables from cleanNCcounties.xls into dataset02. 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas You ask county Economic Development agencies, the people responsible for trying to attract manufacturers to the county. They tell you that the quality of the local schools, the education level of the workforce, the quality of healthcare available, and the level of fire and police protection also affect the location decisions of manufacturers. Based on this information, you decide to include dollars spent on local schools per person for each county (you need to create this variable: SchoolSpendPP = LocPubSchExp*1000/PopCens), college graduates per 100 people in each county (you need to create this variable: ColGradsper100 = CoColGrads/(PopCens/100)), the number of doctors per 1000 people in each county (you need to create this variable: DocsPer1000 = PCPhysicians/(PopCens/1000)), and the dollars spent on public safety programs (police and fire department) per square mile for each county (you need to create this variable: PubSafExpArea = PubSafExp/LandArea). You ask the Chamber of Commerce about potential X variables, and they mention the ones already mentioned by others plus an additional, important one: property tax rate (variable PropTaxRate). Finally, geographic region (variable GeoRegion) might have a significant effect on EmpManf2000, so let's create dummy variables that allow for regional effects. Looking at the data in the cleanNCcounties.xls spreadsheet, you notice that GeoRegion is a categorical variable with 3 categories; therefore, we need to create 2 dummy variables (3 – 1 = 2) to represent the 3 categories. In SAS, create dummy variables "MtnRegion" and "CstRegion" to represent the mountain and coast regions, respectively: if GeoRegion = 'mountain' then MtnRegion = 1; else MtnRegion = 0 ; if GeoRegion = 'coast' then CstRegion = 1; else CstRegion = 0 ; Notice that there is no dummy variable for the "plains" region. “Plains” is the “left out” region that will serve as the baseline region. When both MtnRegion=0 and CstRegion=0, the region is "plains." So, your list of potential X variables is: LandArea, PavedMilesArea, MuniH2OArea, SchoolSpendPP, ColGradsper100, DocsPer1000, PubSafExpArea, PropTaxRate, MtnRegion and CstRegion. This is a nice start, don't you think? Model Specification: Functional Form Next, we need to decide on the Functional Form for the model. The idea here is to decide how each X variable should be included in the model equation, should it be squared, logged, etc., or none of the above. Use PROC PLOT in your SAS program to make a plot of EmpManf2000 against each of the X variables separately, and then look for patterns. Do not include these plots in the homework that you turn in; you're just using the plots to decide on the best functional form for your regression model. The plots of EmpManf2000 against LandArea, SchoolSpendPP, and PropTaxRate show roughly linear patterns, so you decide to include these variables linearly in the model, that is, without any squares, logs, etc. So, at this point the regression model looks like this: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + e, where e is an error term. The plot of EmpManf2000 against PavedMilesArea shows a nonlinear pattern of increasing faster and faster, so we'll include a squared term for PavedMilesArea to capture this pattern. You must create the squared term. In SAS, create the variable PavedMilesAreaSq = PavedMilesArea**2. Create this variable in the same Data Step that you used for the variables created earlier. In SAS, the symbol "**" means "raise to the power." So, at this point the regression model looks like this: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq + e 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas The plots of EmpManf2000 against MuniH2OArea, ColGradsper100, DocsPer1000, PubSafExpArea show a nonlinear pattern of increasing at a slower and slower rate, so we'll include these variables in logged form. In SAS, create the variables LNMuniH2OArea = log(MuniH2OArea), LNColGradsper100 = log(ColGradsper100), LNDocsPer1000 = log(DocsPer1000), and LNPubSafExpArea = log(PubSafExpArea). So, at this point the regression model looks like this: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNColGradsper100 + β8·LNDocsPer1000 + β9·LNPubSafExpArea + e Finally, include the two dummy variables for the mountain region and the coast region: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNColGradsper100 + β8·LNDocsPer1000 + β9· LNPubSafExpArea + β10·MtnRegion + β11·CstRegion + e Checking for Multicollinearity Recall that one of the assumptions of the OLS regression method is that X variables are not linearly related to one another. If two or more X variables are linearly correlated with one another, we have the “Multicollinearity Problem.” Recall that two X variables are strongly, linearly related when their Pearson correlation coefficient, r, is large (say, r > 0.70). Before we run the regression, use PROC CORR in your SAS program to check for multicollinearity among our continuous X variables (do include PavedMiles, but you don't need to include PavedMilesAreaSq; also, don't include MtnRegion or CstRegion, because these are dummy variables). We can drop some X variables from the model if they are highly correlated with other X variables. First, we notice that LNPubSafExpArea is moderately to strongly correlated with several other variables: LNPubSafExpArea and LNMuniH2OArea ρ = 0.83 LNPubSafExpArea and PavedMilesArea ρ = 0.78 LNPubSafExpArea and SchoolSpendPP ρ = 0.56 LNPubSafExpArea and LNColGradsper100 ρ = 0.53 So, let's drop LNPubSafExpArea from the model. We also notice that LNColGradsper100 is moderately linearly correlated with two of the remaining variables: LNColGradsper100 and SchoolSpendPP ρ = 0.63 LNColGradsper100 and LNDocsPer1000 ρ = 0.57 So, let's also drop LNColGradsper100 from the model. Finally, although PavedMilesArea and LNMuniH2OArea are moderately linearly related (ρ = 0.66), we will leave both in the model, because county officials (our clients) are very interested in the effects of these two variables on manufacturing employment. However, by choosing to leave both variables in the model we may be accepting some degree of multicollinearity bias in the results of the regression analysis. Our model is now: EmpManf2000 = β0 + β1·LandArea + β2·SchoolSpendPP + β3·PropTaxRate + β4·PavedMilesArea + β5·PavedMilesAreaSq + β6·LNMuniH2OArea + β7·LNDocsPer1000 + β8·MtnRegion + β9·CstRegion + e 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas NOTE!! You only need to put the last model equation above in your homework; all of the other equations are "scratch work" that we used to develop the final equation. Initial Regression Analysis Based on the model above, run the following regression in SAS: proc reg data=dataset02; model EMPmanf2000 = LandArea SchoolSpendPP PropTaxRate PavedMilesArea PavedMilesAreaSq LNMuniH2OArea LNDocsPer1000 MtnRegion CstRegion;; run; Presenting the Regression Results Now present the results from the regression. We will modify/correct this regression in later homeworks, but first we need to describe the results from this initial regression. Put regression results (the output from the regression that appears in the Output window of SAS—but only the regression output, not all of the stuff that SAS puts in the Output window) in a table in your homework. Use a format for the table like that shown in the example below: OLS Regression Results Dependent Variable: (name of Y variable goes here) Independent Variables: Intercept (name of first X variable) (name of second X variable) (name of third X variable) etc. etc. n F-value (p-value) S.E.R. Adj-R2 Coefficients: (value of β0 goes here) (value of β1 goes here) (value of β2 goes here) (value of β3 goes here) etc. etc. (value of n goes here) (value of F goes here) (value of S.E.R. goes here) (value of Adj-R2 goes here) t-values (t-value of β0 goes here) (t-value of β1 goes here) (t-value of β2 goes here) (t-value of β3 goes here) etc. etc. (p-value of F goes here. put it in parentheses) Interpreting the Initial Regression Results Interpret/discuss the regression results in your homework by answering the following questions: What is the F-value (and its p-value), and what does it mean? What is the Adj R-square value, and what does it mean? What is the SER for the regression, and what does it mean? What are the coefficient/parameter estimates, and what do they mean? What are the t-values, and what do they mean? Which of the X variables appear to affect Y? What is the estimated model equation? (This is the regression equation with the estimated beta values filled-in for the betas in the equation.) Begin your discussion of the regression results like this: “The table above presents the results of the initial regression analysis. The F-value of the regression indicates that . . . “ 4 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Save Your Program and Write up Your Homework After you run your SAS program and verify that it is working correctly, save the SAS program as HW13.sas. Print out your program (you can copy it from the Editor window of SAS, paste it into Word, and print it), and turn it in with your homework. Also, when this homework asks you to answer specific questions about the results, you need to answer in complete sentences, in addition to giving the appropriate numbers. Be sure to put your name, ECN377, your section, and “Homework 13” at the top of your homework. 5