Lab #2 POL SCI 701 Fall 2002 Let's start out by taking the STATA tutorial on regression. Type tutorial regress. Once you've completed the tutorial, you may move onto the actual assignment. Part I: Follow along with Hamilton. This section is designed to get you to read the Hamilton chapters I gave you and to see how STATA runs certain commands. I also hope to give you experience interpreting the output obtained by running the various regression commands. Be sure to answer each question fully and to attach all relevant output. I'd prefer responses to the various questions to be typed. You can either answer each number separately or write your responses in essay form. Just be sure to answer each part of each question. NOTE: It is adequate for this section to use only those tests given in the Hamilton chapters. In Part II, you will be asked to also run tests learned in class. Using states.dta (from the Hamilton package), do the following: 1) Run a regression using mean composite SAT score as the dependent variable and per pupil expenditures as the independent variable. Interpret all relevant output. (F, R-squared, Adj R-squared, the effect of x on y, whether x is significant and at what level, MSE (standard error of the estimate we've talked about before)). What does this tell us substantively? (Attach both your answers and the output generated.) 2) Control for the potential influence of other variables including percentage of high school grads taking the test, median household income of the student's family, percentage of test takers over 25 with a high school diploma, and percentage of test takers over 25 with a bachelor's degree. What do you find? Again, interpret (and attach) all relevant output. Test the null hypothesis that the addition of these variables was not necessary. (HINT: Conduct an F test using information obtained from this regression and the regression in number 1.) 3) Which variable has the most substantive impact on SAT scores? (Run the appropriate regression needed to answer this question and interpret all relevant output.) * For the following questions, use the large model without standardized coefficients. You should run that model again so that it's the last regression in STATA's memory. 4) Is there any evidence of multicollinearity in your model? Use plain old correlations to test this and the outputs of your model for clues. 5) Is there any evidence of autocorrelation in your model? Use the durbin-watson. (Note: To use the t ( ) option with regdw, you should first sort the data by region, then create a variable that merely numbers the cases (so each state is now a number), (use gen stcode=_n) and then use that new variable as the time variable. It should be clear to you why we would need to do this. [See your notes on autocorrelation if it is not!]) Compare your results here with an appropriate diagnostic plot. What do you find? 6) Is there evidence of an omitted variable bias? Use ovtest. 7) Is there evidence of heteroskedasticity? Use hettest. Which states seem to be the problem? [HINT: You may want to run a diagnostic plot to determine this] 8) Are there any influential cases? Use cooksd, dfits, and dfbeta to determine this. Part II: Applying what we've learned here and in class to a real data problem. In this section, we'll use the tests in Hamilton along with some of the other tests we learned how to conduct in class to test for violations of the various OLS assumptions. Use the STATA dataset auto.dta (found in the lab computers in the STATA folder) to conduct the following analyses. 1) Create a variable called forxmpg which is the interaction between foreign and mpg (gen foxmpg=foreign*mpg). 2) Run a regression using price as the dependent variable, and weight, mpg, forxmpg, and foreign as the independent variables. Interpret this regression. First, what seems to be the theory here? Second, is the theory supported? Third, what does the regression output tell us? Which variable has the most impact on the dependent variable? 3) Is there evidence of heteroskedasticity here? Use STATA's hettest, a plot of the residuals, the Goldfeld-Quandt test, and White's general heteroskedasticity test. What do you conclude from each? 4) Is there evidence of multicollinearity? Examine the output of the regression for clues, check the pairwise correlations and partial correlations, run the necessary auxiliary regressions, and examine the VIF. What do you conclude? 5) Are there any influential cases? Again, use cooksd, dffits, and dfbeta to make this determination. 6) Since this isn't a time-series or really even cross sectional dataset, open the states.dta dataset again and test it for autocorrelation (you obtained the Durbin Watson above). Again, be sure to sort by region and then create a counting variable that can be a proxy for time. Now plot the residuals and run the Breush-Godfrey and M tests. What can you conclude? Is this conclusion different from that found above using Durbin's d?