Stat 401G Lab 5: Due October 8 Solution 1. A sociologist is interested in the relationship between the homicide rate, number of homicides per 100,000 population (Y) and a city’s population (X1) in thousands of people, the percentage of families with yearly incomes less than $10,000 (X2), and the rate of unemployment (X3). Data are provided for a random sample of 20 cities. City 1 2 3 4 5 6 7 8 9 10 Y 11.2 13.4 40.7 5.3 24.8 12.7 20.9 35.7 8.7 9.6 X1 587 643 635 692 1248 643 1964 1531 713 749 X2 16.5 20.5 26.3 16.5 19.2 16.5 20.2 21.3 17.2 14.3 X3 6.2 6.4 9.3 5.3 7.3 5.9 6.4 7.6 4.9 6.4 City 11 12 13 14 15 16 17 18 19 20 Y 14.5 26.9 15.7 36.2 18.1 28.9 14.9 25.8 21.7 25.7 X1 7895 762 2793 741 625 854 716 921 595 3353 X2 18.1 23.1 19.1 24.7 18.6 24.9 17.9 22.4 20.2 16.9 X3 6.0 7.4 5.8 8.6 6.5 8.3 6.7 8.6 8.4 6.7 The data are on the class web page http://www.public.iastate.edu/~wrstephe/stat401.html (a) Use JMP to fit a multiple regression model with homicide rate as the response variable and population, low income and unemployment as the three explanatory variables. Turn in the JMP output with your answers to the following questions. (b) Give the multiple regression equation. Predicted Homicide Rate = –36.765 + 0.0007629*Population + 1.1922*Low Income % + 4.7198*Unemployment Rate (c) Give the value of the estimate of the error standard deviation, σ. RMSE = 4.59 (d) Give the value and an interpretation of R2. R2 = 0.818, 81.8% of the variation in homicide rates can be explained by the linear model that includes population, low income % and unemployment rate. (e) What is the value of the adjusted R2? adjusted R2 = 0.784 1 (f) Test the hypothesis H 0 : 1 2 3 0 versus the alternative that at least one slope parameter is not zero. What does this test tell you about the relationship between the homicide rate and the three explanatory variables? F = 24.02, P-value < 0.0001. Because the P-value is so small, reject the null hypothesis. This tells you that at least one of the explanatory variables is useful in predicting the homicide rate. (g) Look at the plot of residuals versus unemployment rate. What does this plot tell you about the linear model? There does not appear to be any discernable pattern so adding a curved relationship with unemployment rate will not help explain more variation in the homicide rate. (h) Describe what you see in the analysis of residuals; histogram, box plot and normal quantile plot. What does this indicate about the conditions necessary for multiple regression analysis? The histogram is mounded around zero. Although not perfectly symmetric, it is really not too far off (move one residual from the 10 to 15 category to the 5 to 10 category and one residual from 0 to 5 to –5 to 0 and you will get a perfectly symmetric histogram). The box plot does not look symmetric but there are no outliers. The points snake around the normal model line in the normal quantile plot. Although the distribution is not as nice as we would like it, the residuals could have come from a normal distribution. Note: even if you find the residuals to violate the normal distribution condition, the linear relationship is so strong that we would probably not change our minds about the statistical significance of the model. 2. Use JMP to fit the models necessary to answer the following questions. Be sure to support your answers with information from the appropriate JMP output. Turn in the JMP output with your answers. (a) Is there a statistically significant relationship between homicide rate and the percentage of families with yearly incomes less than $10,000? Model (Low Income %) F = 43.064, P-value < 0.0001 or t = 6.56, P-value < 0.0001. Because the Pvalue is so small, there is a statistically significant linear relationship between Low Income % and Homicide Rate. 2 (b) Does unemployment rate add significantly to the model that already contains population? Model (Population and Unemployment Rate) Yes. F = 55.682, P-value < 0.0001 or t = 7.46, P-value < 0.0001. Because the P-value is so small, Unemployment Rate does add significantly to the model that already contains Population. (c) Does population add significantly to the model that contains the two variables percentage of families with yearly incomes less than $10,000 and unemployment rate? No. F = 1.4376, P-value = 0.2480 or t = 1.20, P-value = 0.2480. Because the P-value is not small, Population does not add significantly to the model with Low Income % and Unemployment Rate. (d) There are three models that satisfy the first two criteria for being the “best” model, i.e. the model is useful and all variables in the model add significantly. Summarize the models that are both statistically useful and have all explanatory variables adding significantly. In your summaries give: The explanatory variables in the model. The test statistic and P-value for the test of model utility. The test statistic and P-value for each of the explanatory variables that add significantly to the model. R2, adjusted R2, and the Root Mean Square Error. Model (Low Income %) F = 43.064, P-value < 0.0001. t = 6.56, P-value < 0.0001. R2 = 0.705, adjusted R2 = 0.689, and the Root Mean Square Error = 5.512 Model (Unemployment Rate) F = 53.415, P-value < 0.0001. t = 7.31, P-value < 0.0001. R2 = 0.748, adjusted R2 = 0.734, and the Root Mean Square Error = 5.097 Model (Low Income % and Unemployment Rate) F = 34.428, P-value < 0.0001. Low Income %: F = 4.62 or t = 2.15, P-value = 0.0459. Unemployment Rate: F = 8.29 or t = 2.88, P-value = 0.0103. R2 = 0.802, adjusted R2 = 0.779, and the Root Mean Square Error = 4.648 3 (e) Of the three models summarized in d), which is the “best” model? Explain briefly. The model with Low Income % and Unemployment Rate is the “best”. Of the models that are useful and have all variables adding significantly, this one has the highest R2. (f) For the “best” model look at the plot of residuals versus predicted values and the analysis of the distribution of residuals. Indicate what you see in the plots and what this tells you about the conditions necessary for conducting the statistical analysis. The plot of residuals versus predicted values is a random scatter with about the same spread across all predicted values. The linear model is the best we can do and the equal standard deviation condition is satisfied. The histogram is mounded above zero with a skew to the left (lower values). The box plot does not look symmetric but there are no outliers. The points snake around the normal model line in the normal quantile plot. Although the distribution is not as nice as we would like it, the residuals could have come from a normal distribution. 3. Use JMP Fit Y by X to fit a simple linear model with Homicide rate as the response and Population as the explanatory variable. Turn in the JMP output with your answers. (a) How much of the variation in homicide rate is explained by the linear relationship with population? R2 = 0.0045, so less than 0.5% of the variation in homicide rate is explained by the linear relationship with population. (b) Is this model useful? Support your answer with an appropriate test of hypothesis. No. F = 0.0814 or t = -0.29, P-value = 0.7787. Such a large P-value indicates that there is not a statistically significant linear relationship. (c) Looking at the plot of the data, and the plot of residuals versus population, there is one point that seems to influence where the regression line goes. Which point is this? Give city number, population, percentage of low income families and unemployment rate. City number 11, with a population of 789,500, 18.1% low income families, an unemployment rate of 6 is the influential point. The homicide rate is 14.5 per 100,000. 4 (d) Click on this point, go to Rows and Exclude the point. Go to the red triangle pull down next to Bivariate Fit in the output window and chose Fit Line. You will get a second line on your graph (this one does not include the point you clicked on). How is this line different from the original line (indicate how the intercept and slope have changed)? The most noticeable difference is in the slope. With all 20 cities, the estimated slope is negative (–0.0003892). After excluding the largest city, the estimated slope is positive (+0.0017709). The intercept has changed from 21.128 to 18.954. (e) How much of the variation in homicide rate is explained by the linear relationship with population once the one point is excluded? R2 = 0.019, so about 2% of the variation in homicide rate is explained by the linear relationship with population. (f) Is this model useful? Support your answer with an appropriate test of hypothesis. No. F = 0.3351 or t = 0.58, P-value = 0.5703. Such a large P-value indicates that there is not a statistically significant linear relationship. (g) If we fit a linear model with number of homicides, rather than the rate of homicides, as the response variable and population as the explanatory variable, would you expect to see a statistically significant linear relationship? Explain your answer. You do not need to actually fit this model to answer the question. Yes. Rate of homicides is the number of homicides per 100,000 people. The number of homicides would be the rate*population where the population is in units of 100,000 people. For example, city 1 has a rate of 11.2 per 100,000 and a population of 587,000 so the number of homicides for city 1 is 11.2*5.87 = 66. Because the number of homicides can be found from the rate times the population, number of homicides should be linearly related to population. The estimated slope of the regression line would give you the average number of homicides per person in the population so the estimated slope gives you an average rate of homicides. 5