Stat 401G Lab 5: Due October 8 Solution

advertisement
Stat 401G
Lab 5: Due October 8
Solution
1. A sociologist is interested in the relationship between the homicide rate, number of
homicides per 100,000 population (Y) and a city’s population (X1) in thousands of
people, the percentage of families with yearly incomes less than $10,000 (X2), and
the rate of unemployment (X3). Data are provided for a random sample of 20 cities.
City
1
2
3
4
5
6
7
8
9
10
Y
11.2
13.4
40.7
5.3
24.8
12.7
20.9
35.7
8.7
9.6
X1
587
643
635
692
1248
643
1964
1531
713
749
X2
16.5
20.5
26.3
16.5
19.2
16.5
20.2
21.3
17.2
14.3
X3
6.2
6.4
9.3
5.3
7.3
5.9
6.4
7.6
4.9
6.4
City
11
12
13
14
15
16
17
18
19
20
Y
14.5
26.9
15.7
36.2
18.1
28.9
14.9
25.8
21.7
25.7
X1
7895
762
2793
741
625
854
716
921
595
3353
X2
18.1
23.1
19.1
24.7
18.6
24.9
17.9
22.4
20.2
16.9
X3
6.0
7.4
5.8
8.6
6.5
8.3
6.7
8.6
8.4
6.7
The data are on the class web page
http://www.public.iastate.edu/~wrstephe/stat401.html
(a) Use JMP to fit a multiple regression model with homicide rate as the response
variable and population, low income and unemployment as the three explanatory
variables. Turn in the JMP output with your answers to the following questions.
(b) Give the multiple regression equation.
Predicted Homicide Rate = –36.765 + 0.0007629*Population
+ 1.1922*Low Income %
+ 4.7198*Unemployment Rate
(c) Give the value of the estimate of the error standard deviation, σ.
RMSE = 4.59
(d) Give the value and an interpretation of R2.
R2 = 0.818, 81.8% of the variation in homicide rates can be explained by the
linear model that includes population, low income % and unemployment
rate.
(e) What is the value of the adjusted R2?
adjusted R2 = 0.784
1
(f) Test the hypothesis H 0 :  1   2   3  0 versus the alternative that at least one
slope parameter is not zero. What does this test tell you about the relationship
between the homicide rate and the three explanatory variables?
F = 24.02, P-value < 0.0001. Because the P-value is so small, reject the null
hypothesis. This tells you that at least one of the explanatory variables is
useful in predicting the homicide rate.
(g) Look at the plot of residuals versus unemployment rate. What does this plot tell
you about the linear model?
There does not appear to be any discernable pattern so adding a curved
relationship with unemployment rate will not help explain more variation in
the homicide rate.
(h) Describe what you see in the analysis of residuals; histogram, box plot and normal
quantile plot. What does this indicate about the conditions necessary for multiple
regression analysis?
The histogram is mounded around zero. Although not perfectly symmetric,
it is really not too far off (move one residual from the 10 to 15 category to the
5 to 10 category and one residual from 0 to 5 to –5 to 0 and you will get a
perfectly symmetric histogram). The box plot does not look symmetric but
there are no outliers. The points snake around the normal model line in the
normal quantile plot. Although the distribution is not as nice as we would
like it, the residuals could have come from a normal distribution. Note: even
if you find the residuals to violate the normal distribution condition, the
linear relationship is so strong that we would probably not change our minds
about the statistical significance of the model.
2. Use JMP to fit the models necessary to answer the following questions. Be sure to
support your answers with information from the appropriate JMP output. Turn in the
JMP output with your answers.
(a) Is there a statistically significant relationship between homicide rate and the
percentage of families with yearly incomes less than $10,000?
Model (Low Income %)
F = 43.064, P-value < 0.0001 or t = 6.56, P-value < 0.0001. Because the Pvalue is so small, there is a statistically significant linear relationship between
Low Income % and Homicide Rate.
2
(b) Does unemployment rate add significantly to the model that already contains
population?
Model (Population and Unemployment Rate)
Yes. F = 55.682, P-value < 0.0001 or t = 7.46, P-value < 0.0001. Because the
P-value is so small, Unemployment Rate does add significantly to the model
that already contains Population.
(c) Does population add significantly to the model that contains the two variables
percentage of families with yearly incomes less than $10,000 and unemployment
rate?
No. F = 1.4376, P-value = 0.2480 or t = 1.20, P-value = 0.2480. Because the
P-value is not small, Population does not add significantly to the model with
Low Income % and Unemployment Rate.
(d) There are three models that satisfy the first two criteria for being the “best”
model, i.e. the model is useful and all variables in the model add significantly.
Summarize the models that are both statistically useful and have all explanatory
variables adding significantly. In your summaries give:




The explanatory variables in the model.
The test statistic and P-value for the test of model utility.
The test statistic and P-value for each of the explanatory variables that add
significantly to the model.
R2, adjusted R2, and the Root Mean Square Error.
Model (Low Income %)
F = 43.064, P-value < 0.0001.
t = 6.56, P-value < 0.0001.
R2 = 0.705, adjusted R2 = 0.689, and the Root Mean Square Error = 5.512
Model (Unemployment Rate)
F = 53.415, P-value < 0.0001.
t = 7.31, P-value < 0.0001.
R2 = 0.748, adjusted R2 = 0.734, and the Root Mean Square Error = 5.097
Model (Low Income % and Unemployment Rate)
F = 34.428, P-value < 0.0001.
Low Income %: F = 4.62 or t = 2.15, P-value = 0.0459.
Unemployment Rate: F = 8.29 or t = 2.88, P-value = 0.0103.
R2 = 0.802, adjusted R2 = 0.779, and the Root Mean Square Error = 4.648
3
(e) Of the three models summarized in d), which is the “best” model? Explain briefly.
The model with Low Income % and Unemployment Rate is the “best”. Of
the models that are useful and have all variables adding significantly, this
one has the highest R2.
(f) For the “best” model look at the plot of residuals versus predicted values and the
analysis of the distribution of residuals. Indicate what you see in the plots and
what this tells you about the conditions necessary for conducting the statistical
analysis.
The plot of residuals versus predicted values is a random scatter with about
the same spread across all predicted values. The linear model is the best we
can do and the equal standard deviation condition is satisfied.
The histogram is mounded above zero with a skew to the left (lower values).
The box plot does not look symmetric but there are no outliers. The points
snake around the normal model line in the normal quantile plot. Although
the distribution is not as nice as we would like it, the residuals could have
come from a normal distribution.
3. Use JMP Fit Y by X to fit a simple linear model with Homicide rate as the response
and Population as the explanatory variable. Turn in the JMP output with your
answers.
(a) How much of the variation in homicide rate is explained by the linear relationship
with population?
R2 = 0.0045, so less than 0.5% of the variation in homicide rate is explained
by the linear relationship with population.
(b) Is this model useful? Support your answer with an appropriate test of hypothesis.
No. F = 0.0814 or t = -0.29, P-value = 0.7787. Such a large P-value indicates
that there is not a statistically significant linear relationship.
(c) Looking at the plot of the data, and the plot of residuals versus population, there is
one point that seems to influence where the regression line goes. Which point is
this? Give city number, population, percentage of low income families and
unemployment rate.
City number 11, with a population of 789,500, 18.1% low income families, an
unemployment rate of 6 is the influential point. The homicide rate is 14.5 per
100,000.
4
(d) Click on this point, go to Rows and Exclude the point. Go to the red triangle pull
down next to Bivariate Fit in the output window and chose Fit Line. You will get
a second line on your graph (this one does not include the point you clicked on).
How is this line different from the original line (indicate how the intercept and
slope have changed)?
The most noticeable difference is in the slope. With all 20 cities, the
estimated slope is negative (–0.0003892). After excluding the largest city, the
estimated slope is positive (+0.0017709). The intercept has changed from
21.128 to 18.954.
(e) How much of the variation in homicide rate is explained by the linear relationship
with population once the one point is excluded?
R2 = 0.019, so about 2% of the variation in homicide rate is explained by the
linear relationship with population.
(f) Is this model useful? Support your answer with an appropriate test of hypothesis.
No. F = 0.3351 or t = 0.58, P-value = 0.5703. Such a large P-value indicates
that there is not a statistically significant linear relationship.
(g) If we fit a linear model with number of homicides, rather than the rate of
homicides, as the response variable and population as the explanatory variable,
would you expect to see a statistically significant linear relationship? Explain
your answer. You do not need to actually fit this model to answer the question.
Yes. Rate of homicides is the number of homicides per 100,000 people. The
number of homicides would be the rate*population where the population is
in units of 100,000 people. For example, city 1 has a rate of 11.2 per 100,000
and a population of 587,000 so the number of homicides for city 1 is
11.2*5.87 = 66. Because the number of homicides can be found from the rate
times the population, number of homicides should be linearly related to
population. The estimated slope of the regression line would give you the
average number of homicides per person in the population so the estimated
slope gives you an average rate of homicides.
5
Download