Criminal investigoatrs often need to predict unobserved

advertisement
Statistics 215
Regression worksheet
At the scene of the crime*
Criminal investigators often need to predict unobserved characteristics of individuals from
observed characteristics. For example, if a footprint is left at the scene of a crime, how
accurately can we estimate that person’s height based on the length of the footprint?
1. Identify the observational units, explanatory variable, and response variable in this study.
The observational units are the individual students in Statistics 215. Most important in answering
this question is the CONTEXT of the problem. Investigators seek to use footprint data at the
scene of the crime to estimate crime suspects’ heights. So foot length is the explanatory variable
and height is the response variable.
Following are the self-reported heights (in inches) of the students in Statistics 215 along with
summary statistics from SPSS.
64 64 67 71 67 75 69 67 68 61 68 60 70 71 76 73 64 64 64 64 73 70 72
68 68 64 66 70.5 65 66 72 73 73 67 74 74
Height Stem-and-Leaf Plot
Frequency
2.00
.00
8.00
6.00
5.00
5.00
6.00
2.00
1.00
Stem &
6
6
6
6
6
7
7
7
7
.
.
.
.
.
.
.
.
.
Leaf
01
44444455
667777
88889
00011
223333
45
6
De scri ptives
Height
Mean
95% Confidenc e
Int erval for Mean
5% Trimmed Mean
Median
Variance
St d. Deviat ion
Minimum
Maximum
Range
Int erquartile Range
Sk ewness
Kurtos is
Lower Bound
Upper Bound
St atist ic
68.2714
66.8890
St d. E rror
.68026
69.6539
68.3016
68.0000
16.196
4.02445
60.00
76.00
16.00
7.00
.011
-.742
.398
.778
2. If you were trying to predict the height of a random statistics student based on these
observations, what value would you report?
Either the mean or the median with a preference for the median. So 68 inches.
3. About how accurate would you be using such a prediction method? Fill in the blanks:
The standard deviation gives a measure of the spread so
I would predict a random student’s height to be about 68 inches give or take about 4 inches.
4. When you were sleeping last night, my investigators entered your room and recorded your foot
length (in centimeters). Here is a scatterplot of height and foot length.
Describe the association in this scatterplot (remember type, direction, strength, clusters, outliers,
etc.) Without looking on the next page guess the correlation. Without looking on the next page,
draw (lightly) on the scatterplot what looks like the best fit line for the data.
There is a moderate to strong positive linear association in the data. There are no apparent
clusters or outliers. Answers will vary on correlation guesstimates and where the draw the line.
But reasonable guesses should fall between 65% and 90%.
5. Following is SPSS regression and correlation output. What is the correlation? Give the
regression equation and draw it as accurately as possible on the scatterplot.
The regression equation from the SPSS output Coefficients Table is
Height = 41.839 + .953 * Foot.
(A mathematician, as opposed to a statistician, would write y = 41.839 + .958*x .)
The correlation is 0.841. To draw the line on the scatterplot we find two points on the line and
connect the dots. Since the foot variable varies from about 20 to 34 we’ll use those two x-values.
Since 41.839 + .958*20 = 60.999, the point (20, 60.999) is on the line. Since 41.839 + .958*34 =
74.411, the point (20, 74.411) is on the line. Joining those two points gives
Model Summaryb
Model
1
R
.841a
R Square
.707
Adjusted
R Square
.698
Std. Error of
the Estimate
2.21233
a. Predictors: (Constant), Foot
b. Dependent Variable: Height
Coeffi cientsa
Model
1
(Const ant)
Foot
Unstandardized
Coeffic ients
B
St d. Error
41.839
2.988
.953
.107
a. Dependent Variable: Height
St andardiz ed
Coeffic ients
Beta
.841
t
14.003
8.917
Sig.
.000
.000
Residuals Statistics a
Predicted Value
Residual
Std. Predicted Value
Std. Residual
Minimum
61.8470
-3.56365
-1.899
-1.611
Maximum
74.2331
4.48357
1.762
2.027
Mean
68.2714
.00000
.000
.000
Std. Deviation
3.38316
2.17955
1.000
.985
N
35
35
35
35
a. Dependent Variable: Height
6. The point (x-bar, y-bar) always lies on the regression line. Find this point and determine the
approximate mean foot length.
We know that y-bar (the mean of height) is 68.2714 inches. It looks like the mean foot length is
about 27 cm. We could also solve 68.2714 = 41.839 + .953(x-bar) to get that x-bar = (68.271441.839)/.953 = 27.7359. (The difference between that and the exact value of 27.7429 we attribute
to rounding errors.)
7. On page 169 of the textbook are the formulas for the slope and intercept of the regression
equation. Verify these formulas using the above summary statistics. Note that the actual mean
and standard deviation of foot length are, respectively, 27.7429 cm and 3.55083 cm.
The slope is r * sy / sx = (.841) * 4.02445 / 3.55083 = 0.953, which agrees with SPSS.
The intercept term is y-bar – slope*x-bar = 68.2714 - .953*27.7429 = 41.83242, whose difference
from the stated value of 41.839 we again attribute to round-off errors.
8. Use the regression line to predict the height of a person with a 28 cm foot length. Then repeat
for a person with a 29 cm foot length. Calculate the difference in these two height predictions.
Does this value look familiar? Explain.
For a 28cm foot length, the regression predicts a height of 41.839 + .953*28 = 68.523 inches.
For a 29cm foot length, the regression predicts a height of 41.839 + .953*29 = 69.476 inches.
The difference of these two heights is 0.953, which is the slope, as expected.
9. Provide an interpretation of the slope coefficient in context.
Height grows by 0.953 inches for every centimeter of foot length.
10. Provide an interpretation of the intercept coefficient in context. Is such a prediction
meaningful for these data? Explain.
The intercept term is the height corresponding to a foot length of 0. Such a prediction is not
meaningful.
11. Predict the height of someone whose foot length is 44 cm. Explain why you would not be as
comfortable making this prediction as the one in (8).
The model predicts a height of 41.839 + .953*44= 83.771 inches, which is almost 7 feet. Because
44 cm is far outside the interval of values of the data we don’t expect the model to do well.
12. Zach, a student in statistics, has a foot length of 25 cm and is 67 inches tall. What is Zach’s
residual? In general, what is the meaning of a positive residual? Of a negative residual?
The predicted height for Zach is 41.839 + .953*25 = 65.664 inches. So the residual is equal
to observed height – predicted height = 67 – 65.664 = 1.336 inches. The model underestimated
Zach’s height by 1.336 inches. A positive residual corresponds to an under-estimation. A negative
residual corresponds to an over-estimation.
13. Here is a residual plot of predicted values versus residuals.
We will discuss the residuals and R-Squared. So take a break here and (if there’s time) we’ll talk
about the following in class.
14. What fraction of the variability in heights is explained by the linear regression model?
This is given by R-squared. That is, 70.7% of the variability in heights is explained by the
regression model (so 29.3% of that variability is not).
15. Do you think that the linear model is appropriate for these data?
Yes. The scatter plot shows a fairly strong linear association. Also, the residual plot shows no
particular structure or pattern to the residuals. The points are somewhat uniformly distributed
around the line y = 0. Being a little more nit-picky, because of some slightly higher residuals
towards the center of the plot, there is a slight tendency of the model to under-estimate heights for
foot lengths about 69 cm.
16. Guess-timate the standard deviation of the residuals. How much do the data points spread
about the regression line?
From the residual plot, look at the y-values. Imagine “projection” the points onto the y-axis. We
see that the spread is about 2 inches. A reasonable guesstimate of the standard deviation of the
residuals is about 2 inches. This is the measure of spread about the regression line. We expect
points to be about 2 units above or below the line.
17. Consider how height varies about its mean. Compare your answers in (3) and (16). Is the
regression model superior to just estimating height based on its mean and standard deviation?
In (3), we say that height was about 68 inches give or take 4 inches. The regression model
says that height is about 41.839 + .953*foot length, give or take about 2 inches.
*This worksheet is based, in part, on Investigation 6.3.3 in Investigating Statistical Concepts, Applications,
and Methods by Beth Chance and Allan Rossman.
Download