Stat 401C Lab 11 Fall 2005 Objective: Estimate regression models with both categorical and continuous variables. Reading: Sections 16.5 – 16.7 in Howell (2002). We can imagine situations where we might want to investigate the effects of both categorical and continuous variables on an outcome. For example, we could ask whether some outcome y (e.g., life satisfaction; depressive symptoms) is significantly related to marital status, after controlling for age, which is measured on a continuum, and whether age contributes significantly the explanation of y after controlling for marital status. We might also ask whether there is a significant interaction between these two predictor variables; that is, is the relationship between y and age the same or different for each marital status groups. To address this issue, we would estimate the following models: M1: yi 0 1D1i 2 D2i 3D3i i ε ~ NID (0, σ2) M2: yi 0 1X i i M3: y i 0 1D1i 2 D2i 3D3i 4 X i i M4: yi 0 1D1i 2 D2i 3D3i 4 X i 5XD1i 6 XD 2i 7 XD3i i where we create interactions terms with the following compute statements: compute XD1=X*D1. compute XD2=X*D2. compute XD3=X*D3. Now we can use partial and multiple partial F-tests to address the following questions: Q1. Is there a significant affect on the outcome variable y due to martial status after controlling for age? Answer: compare models M3 and M2. SSR(D1 D2 D3| X) = SSR (D1 D2 D3 X) – SSR(X). Q2. Is there a significant affect due to age after controlling for marital status? Answer: compare models M3 and M1. Q3: Is there evidence of a significant interaction effect between age and marital status, after controlling for the two “main effects.” Answer: compare models M4 and M3. SSR(XD1 XD2 XD3 | D1 D2 D3 X) = SSR(M4) – SSR(M3). Regional differences in size of governments Urban sociologists are interested in knowing whether there are regional differences in the size of city governments. They bring you a data set containing 63 randomly selected U. S. cities, and ask you to answer their question. Two of the variables in the data set are size of government, measured in number of employees (GOVTEMPL), and region of the country (REGION). There are 4 regions: (1) northwest, (2) south, (3) midwest and (4) far west. Set up a dummy coding scheme to capture the concept of “region,” using the far west as your reference group. The data is saved as “cities.sav” and the syntax is saved as “lab11.sps” on the class website. 1. Before testing the hypothesis, inspect the data by running FREQUENCIES and by plotting size of government against region using the scatterplot option in GRAPH (using either syntax or pick & click). 2. Using model M1, test the null hypothesis that size of government is independent of region of the country, and evaluate the hypotheses associated with each of the individual slopes. Output residuals and report any outliers or influential data points or any remaining patterns in the residuals. Next, examine the effects of both REGION and city population (POPULAT) on size of government (GOVTEMPL). As is often the case, we use the natural log of population (LNPOP) rather than population itself. Set up the compute statements you need to obtain LNPOP and the interaction terms, and estimate models M2 to M4 as they apply to this problem. Address the following questions raised by the urban sociologists: 3. First, show the sociologists the relationship of government size to city population by plotting GOVTEMPL against LNPOP. 4. Is there evidence of a significant effect due to city size? Let = 0.05. (The parallel question, is there a significant effect due to region, was answered in question 2). 5. Is there evidence of a significant effect due to region after controlling for city size? Is there a significant effect due to city size after controlling for region? 6. Is there any evidence of a significant interaction between region and city size; that is, is the relationship between city size and size of city government different in different regions? (If there is a significant difference between models M3 and M4, then M4 is the best model; if there is not a significant difference, then M3 is more parsimonious). 7. Using the estimates from M4, draw a graph illustrating the relationship. The graph should have the LNPOP on the horizontal axis and GOVTEMPL on the vertical, and you want to draw four separate line segments to show how the regions differ in their slopes and intercepts. Use your dummy coding scheme to obtain expressions for the four regions. That is, if the coding scheme for a specific region is (0 1 0), then your estimate for that region is obtained by inputting that dummy coding scheme into the prediction equation: ŷi = b0 + b1D1 +b2D2 +b3D3 +b4X+ b5XD1 +b6XD2 +b7XD3 ŷi = b0 + b1(0) +b2(1) +b3(0) +b4X+ b5X*(0) +b6X*(1) +b7X*(0) ŷi = (b0 + b2) + (b4 + b6)X 8. Output the residuals from model 4 if there is a significant interaction, or model 3 if there is not a significant interaction. Is there evidence of an outlier? An influential data point? Is there evidence of a curvilinear relationship? Of heterogeneity of variance? 9. Write up a paragraph describing what you have found about regional differences in the relationship between size of government and city size.