Stat 401C

advertisement
Stat 401C
Lab 11
Fall 2005
Objective: Estimate regression models with both categorical and continuous variables.
Reading: Sections 16.5 – 16.7 in Howell (2002).
We can imagine situations where we might want to investigate the effects of both categorical and continuous
variables on an outcome. For example, we could ask whether some outcome y (e.g., life satisfaction; depressive
symptoms) is significantly related to marital status, after controlling for age, which is measured on a continuum,
and whether age contributes significantly the explanation of y after controlling for marital status. We might also
ask whether there is a significant interaction between these two predictor variables; that is, is the relationship
between y and age the same or different for each marital status groups. To address this issue, we would estimate
the following models:
M1: yi  0  1D1i   2 D2i  3D3i   i
ε ~ NID (0, σ2)
M2: yi  0  1X i   i
M3: y i  0  1D1i   2 D2i  3D3i   4 X i   i
M4: yi  0  1D1i   2 D2i  3D3i   4 X i  5XD1i  6 XD 2i  7 XD3i  i
where we create interactions terms with the following compute statements:
compute XD1=X*D1.
compute XD2=X*D2.
compute XD3=X*D3.
Now we can use partial and multiple partial F-tests to address the following questions:
Q1.
Is there a significant affect on the outcome variable y due to martial status after controlling for age?
Answer: compare models M3 and M2.
SSR(D1 D2 D3| X) = SSR (D1 D2 D3 X) – SSR(X).
Q2.
Is there a significant affect due to age after controlling for marital status? Answer: compare models M3
and M1.
Q3: Is there evidence of a significant interaction effect between age and marital status, after controlling for the
two “main effects.” Answer: compare models M4 and M3.
SSR(XD1 XD2 XD3 | D1 D2 D3 X) = SSR(M4) – SSR(M3).
Regional differences in size of governments
Urban sociologists are interested in knowing whether there are regional differences in the size of city
governments. They bring you a data set containing 63 randomly selected U. S. cities, and ask you to answer
their question. Two of the variables in the data set are size of government, measured in number of employees
(GOVTEMPL), and region of the country (REGION). There are 4 regions: (1) northwest, (2) south, (3) midwest
and (4) far west. Set up a dummy coding scheme to capture the concept of “region,” using the far west as your
reference group. The data is saved as “cities.sav” and the syntax is saved as “lab11.sps” on the class website.
1. Before testing the hypothesis, inspect the data by running FREQUENCIES and by plotting size of
government against region using the scatterplot option in GRAPH (using either syntax or pick & click).
2. Using model M1, test the null hypothesis that size of government is independent of region of the country,
and evaluate the hypotheses associated with each of the individual slopes. Output residuals and report any
outliers or influential data points or any remaining patterns in the residuals.
Next, examine the effects of both REGION and city population (POPULAT) on size of government
(GOVTEMPL). As is often the case, we use the natural log of population (LNPOP) rather than population
itself. Set up the compute statements you need to obtain LNPOP and the interaction terms, and estimate models
M2 to M4 as they apply to this problem. Address the following questions raised by the urban sociologists:
3.
First, show the sociologists the relationship of government size to city population by plotting GOVTEMPL
against LNPOP.
4. Is there evidence of a significant effect due to city size? Let  = 0.05. (The parallel question, is there a
significant effect due to region, was answered in question 2).
5. Is there evidence of a significant effect due to region after controlling for city size? Is there a significant
effect due to city size after controlling for region?
6. Is there any evidence of a significant interaction between region and city size; that is, is the relationship
between city size and size of city government different in different regions? (If there is a significant
difference between models M3 and M4, then M4 is the best model; if there is not a significant difference,
then M3 is more parsimonious).
7. Using the estimates from M4, draw a graph illustrating the relationship. The graph should have the LNPOP
on the horizontal axis and GOVTEMPL on the vertical, and you want to draw four separate line segments to
show how the regions differ in their slopes and intercepts. Use your dummy coding scheme to obtain
expressions for the four regions. That is, if the coding scheme for a specific region is (0 1 0), then your
estimate for that region is obtained by inputting that dummy coding scheme into the prediction equation:
ŷi = b0 + b1D1 +b2D2 +b3D3 +b4X+ b5XD1 +b6XD2 +b7XD3
ŷi = b0 + b1(0) +b2(1) +b3(0) +b4X+ b5X*(0) +b6X*(1) +b7X*(0)
ŷi = (b0 + b2) + (b4 + b6)X
8. Output the residuals from model 4 if there is a significant interaction, or model 3 if there is not a significant
interaction. Is there evidence of an outlier? An influential data point? Is there evidence of a curvilinear
relationship? Of heterogeneity of variance?
9. Write up a paragraph describing what you have found about regional differences in the relationship between
size of government and city size.
Download