AP STATISTICS Chapter 13 – Homework Simple Linear Regression

advertisement
AP STATISTICS
Chapter 13 – Homework
Simple Linear Regression & Correlation Inferential Methods
HW#
1
Objective
Section
To understand the conditions (assumptions) of
simple linear regression
To estimate the population regression line.
13.1
Reading
pages
689-692
13.1
693-695
To calculate a point estimate using simple linear
regression.
Example
13.2 on
pp. 69597
To calculate the probability (proportion) of an
event.
2
13.1
13.2
697-698
&
Example
13.3 on
page 698700
Last
paragraph
on p. 698
702-704
13.2
704 - 706
p. 711
#15b,c, # 19b
13.2
706-710
13.2
706,710
p. 712 # 21
p. 713 # 25
p. 712 # 26
(class 19a)
p. 711 # 18,
p. 712 # 20
13.3
713-723
13.6
737-739
To estimate σ2 and σ
3
To calculate the estimated standard
deviation of slope.
4
To calculate a confidence interval for β (slope).
5
To carry out a hypothesis test concerning β
(slope).
6
To read computer output
7
To understand and check the conditions
(assumptions) for simple linear regression.
8
Interpreting & Communicating the results of
statistical analyses.
9
Review problems (if needed)
pp. 700 -02
#1
#5, #6
#7, # 9, #11
To interpret the slope of the least squares regression
line (LSRL)
To interpret se. The estimated standard deviation of
the line.
Homework
Problems
13.1
(# 2,3 in
class)
p. 711
#15a
p. 724 # 29 as
a class. # 32
using
Fathom.
Be sure to
read “A
Word to
Wise” on
page 739
pp. 741 – 45
# 58,59,61a,
62,63,65a
(68 as a class)
Due
Date
Questions
13.1.
a) y=α + βx is the population regression line.
y=-5.0 + .017x, where x is the size of the house is square feet and y is the number of
natural gas therms used during a specified period.
b) Graph: Be sure you labeled and have scales on your axis.
Scatter Plot
therms
problem 1
30
28
26
24
22
20
18
16
14
12
problem 1
squaref...
1000
1200
1400 1600 1800 2000
squarefootage
therms = 0.0170squarefootage - 5.0; r^2 = 1.0
therms
1
1000
12
2
2000
29
2200
c) If x=2100 square feet, find y in y=-5.0 + .017x. y = -5+(.017)(2100) = 30.7 therms.
.017 therms
. Thus, on average, for every increase in 1 square
1 square foot
footage, the number of therms used goes up by 0.017.
d) The slope is 0.017 =
.017 therms
1.7 therms
=
. Thus, on average, for every
1 square foot 100 square footage
increase in 10 square feet, the number of therms used goes up by 1.7.
e) The slope is 0.017 =
f) No, since there are no small houses in the community and a 500 square foot house is
considered small, I would not use the least squares regression line.
This is extrapolation.
13.2
a) y=α + βx is the population regression line.
y= -0.12+ 0.095x, where x is the pressure (inches of water) and y is the flow rate.
If x = 10
If x = 5
y= -0.12+ 0.095(10) = 0.83
y= -0.12+ 0.095(5)= 0.355
.095 flow rate
b) The slope is 0.095 =
. Thus, on average, increase of one inch of
1 pressure (in inches)
water, there is an increase of 0.095 in the flow rate.
.095 flow rate
-0.475 flow rate
c) The slope is 0.095 =
=
. Thus, on average,
1 pressure (in inches) −5 pressure (in inches)
if the pressure decreases by 5 inches, the flow rate will also decrease by 0.475.
13.3
y=α + βx is the population regression line.
y= -2 + 1.4x, where x is the intake of serum manganese (Mn) and y is Mn concentration.
σ = 1.2
a) If x = 4, y= -2 + 1.4(4) = 3.6 and if x = 4.5, y= -2 + 1.4(4.5) = 4.3
b) If x = 4, P(y > 5) = P(z>1.1666) ≈ 0.1217
c) If x = 5, y= -2 + 1.4(5) = 5 P(y>5) = P(z > 0) ≈ 0.5
If x = 5, y= -2 + 1.4(5) = 5 P(y<3.8) = P(z <-1) ≈ 0.1587
13.5.
y=α + βx is the population regression line.
y= 23,000 + 47x, where x is the house size in square feet and y is the house price in
dollars. σ = 5000
a) The slope is 47 =
47 dollars
4700 dollars
=
.
1 square foot 100 square feet
For every additional square foot added, the house price increases by $47 on average.
Similarly, on average, the house price increases $4700 for every additional 100
square feet added.
b) If x = 1800, y = 23,000 + 47(1800) = 107,600
P( y > 110,000) = P(z>.48) ≈ .3156
z=
P( y < 100,000) = P(z<-1.52) ≈ .0643
110, 000 − 107600
= .48
5000
z=
100, 000 − 107600
= −1.52
5000
13.6
a) y=α + βx is the least squares regression line for the whole population while y = a + bx is
the regression line for a given sample.
b) β is the slope of the regression line for the whole population. β is a parameter.
b is the slope of the regression line for a sample. b is a statistic.
c) If x* is a value of the independent variable, then α + βx* represents the average y value
(response variable) if repeated samples are taken for the given x*. Remember, one
assumption is that for each x*, the y values vary normally and µy (the average of the y’s
for a given x) lines on the population regression line. In contrast, a + bx* represents the
predicted y value using the regression line from the sample. This value is called a point
estimate or point prediction.
d) σ represents “the extent to which observed points (x,y) tend to fall close to or far away
from the population regression line.” (page 697). “The value of σ represents the
magnitude of a typical deviation of a point (x,y) in the population from the population
regression line.” (p. 698).
se is an estimate of σ using a sample. “se is the magnitude of a typical sample deviation
(residual) from the least squares line.” (p. 698). On page 740, se is also discussed.
se =
SSresid
=
n−2
∑ (residuals)
n−2
2
=
∑ (y-y)ˆ
n−2
2
is a point estimate (statistic) of the
standard deviation with degrees of freedom on n-2.
13.7 x represents the wind speed in m/sec and y the residence half time.
a) r2 represents the percent of variation in residence half time that can be attributed by the
least squares regression line. In class, we often said the coefficient of determination (r2)
is the percent of variation in the y that can be explained by the linear relationship between
x and y. We would always want this in context.
r2 represents the percent of variation in residence half time that can be explained by linear
relationship between wind speed and residence half time.
2
2
SSResid
∑ (residuals) = 1 − ∑ (y-y)ˆ = 1 − 27.890 ≈ 0.6228
r = 1−
= 1−
2
2
SSTo
73.937
∑ (y-y)
∑ (y-y)
2
b) se =
SSresid
=
n−2
∑ (residuals)
n−2
2
=
∑ (y-y)ˆ
2
=
n−2
27.890
≈ 1.5923
13 − 2
se represents the typical deviation from the least square regression line. Thus, the typical
residual is about 1.5923 away from the least squares regression line.
c) To estimate the mean change in residence half time associated with a 1- m/sec increase in
wind speed, we need to calculate the slope of the LSRL. The slope is given to be 3.4307.
Thus, on average, for every increase in 1 –m/sec in the wind speed, the residence half
time increases by 3.43 units (not sure if it is seconds, hours, days etc).
d) If x = 1, then ŷ = a+bx = 0.0119+3.4307(1) ≈ 3.4426
13.9
2
2
SSResid
∑ (residuals) = 1 − ∑ (y-y)ˆ = 1 − 2620.57 ≈ 0.8830
a) r = 1 −
= 1−
2
2
SSTo
22398.05
∑ (y-y)
∑ (y-y)
2
b) se =
SSresid
=
n−2
∑ (residuals)
n−2
2
=
∑ (y-y)ˆ
n−2
2
=
2620.57
≈ 13.6815 with df = 14.
16 − 2
13.11
c.) r2=.4356. 43.56% of the variation in the market share can be explained by the linear
relationship between the advertising share and the market share.
d)
See above for calculator commands.
13.15 Let x represent the time elapsed since termination of the molding process and y represent
the hardness of the molded plastic.
a) sb =
se
S xx
SSresid
se
n−2
=
=
=
2
2
(
x
−
x
)
(
x
−
x
)
∑
∑
∑ (residuals)
sb =
n−2
∑ ( x − x )2
2
∑ (residuals)
n−2
∑ ( x − x )2
2
∑ (y-y)ˆ
=
2
n−2
∑ ( x − x )2
1235.470
15 − 2 ≈ .1537
=
4024.20
b) I will assume all four conditions have been satisfied. They are the following:
1. µy for each value of x lie on a straight line.
2. For repeated x-values, the response variable varies normally.
3. For each x-value, the standard deviations of the y are the same.
4. For every x-value, repeated y-values are independent.
C.I . = b ± t * sb
df = n-2
C.I . = 2.50 ± 2.16(.1537)
CL = .95
df = 15-2=13
C.I . = 2.50 ± .3320
C.I . = (2.168, 2.832)
I am 95% confident the true slope between the time elapsed since termination of the molding
process and the hardness of the molded plastic is between 2.168 and 2.832.
c) Since the margin of error is relatively small at 0.3320, I believe the slope has been estimated
precisely. In other words, the confidence interval is not too wide.
13.19a
Let y represent the average SAT score on x = expenditure per pupil (in thousands of dollars).
n=44, b = 15 and sb=5.3.
Ho: β = 0: There is no association between the average SAT score and expenditure per pupil for
New Jersey school districts. (There is not a useful linear relationship between SAT scores and
expenditures per pupil.)
Ha: β ≠ 0: There is an association between the average SAT score and expenditure per pupil for
New Jersey school districts. (There is a useful linear relationship between SAT scores and
expenditures per pupil.)
I will assume all four conditions have been satisfied. They are the following:
1. µy for each value of x lie on a straight line.
2. For repeated x-values, the response variable varies normally.
3. For each x-values the standard deviations of the y are the same.
4. For every x-value, repeated y-values are independent.
df=42
b − β 15 − 0
=
≈ 2.8302 P(t > 2.8302 or t < -2.8302) ≈ .0071
sb
5.3
At α = .05, I would reject Ho. Therefore, I have evidence to believe there is an association
between the average SAT score and expenditure per pupil for New Jersey school districts.
(There is a useful linear relationship between SAT scores and expenditures per pupil.)
t=
13.19b
Let y represent the average SAT score on x = expenditure per pupil (in thousands of dollars).
n=44, b = 15 and sb=5.3.
Assume conditions as stated above are true.
C.I . = b ± t * sb
df = n-2 CL = .95
C.I . = 15 ± 2.018(5.3)
df = 44-2=42
C.I . = 15 ± 10.6954
C.I . = (4.3046, 25.6954)
Since this interval does not capture zero, I do have evidence to believe there is an association
between the average SAT score and the expenditure per pupil for New Jersey school districts.
The true slope is between 4.3 and 25.7. I am 95% confident the true average SAT score
associated with every additional one thousand dollars spent on the pupil the SAT score increases
between 4.3 and 25.7 points.
21. Part a)
Let y represent the mean response time for those suffering a closed-head
injury and x represent the mean response time on the same task for
individuals with no head injury.
Part b)
Ho: β = 0: There is no linear relationship between the mean response
time for individuals with no head injury and the mean response time for
individual with CHI.
Ha: β ≠ 0: There is a linear relationship between the mean response
time for individuals with no head injury and the mean response time for
individual with CHI.
I will assume
1. µy for each value of x lie on a straight line.
2. It is stated in the problems that it is reasonable that the
observations are independent. In other words, for every xvalue, repeated y-values are independent
3. For repeated x-values, the response variable varies normally; we
can check the normal probability plot of the residuals. Since
the normal probability plot is approximately linear, I believe the
y-values for each x vary normally.
4. For each x-values the standard deviations of the y are the same.
To check this condition, one wants to look at the residual plot.
Since the residual plot is randomly scattered with no apparent
pattern, I believe the standard deviations are the same for each
x-value.
df=8
b − β 1.5946 − 0
=
≈ 27.165 P(t > 27.165 or t < -27.165) ≈ 0
sb
.0587
At α = .05, I would reject Ho. Therefore, I have evidence to believe
there is a linear relationship between the mean response time for
individuals with no head injury and the mean response time for
individual with CHI.
t=
Note, to solve for sb, I used the calculator regression test and solved the equation.
t=
b − β 1.5946 − 0
=
≈ 27.165
sb
sb
25.
Let x represent the temperature in Celsius and y represent the milk’s ph
of skim milk.
You should always visualize the data by looking at a scatterplot.
Ho: β = 0: There is no linear relationship skim milk’s ph and the
temperature.
Ha: β < 0: There is a negative linear relationship skim milk’s ph and the
temperature.
I will assume
1. µy for each value of x lie on a straight line.
2. For every x-value, repeated y-values are independent.
3. For repeated x-values, the response variable varies normally; we
can check the normal probability plot of the residuals. Since
the normal probability plot is approximately linear, I believe the
y-values for each x vary normally.
4. For each x-values the standard deviations of the y are the same.
To check this condition, one wants to look at the residual plot.
Since the residual plot is randomly scattered with no apparent
pattern, I believe the standard deviations are the same for each
x-value.
df=14
b − β −.0073 − 0
=
≈ −17.5693 P(t < -17.5693) ≈ 0
sb
.000415
At α = .05, I would reject Ho. Therefore, I have evidence to believe
there is a negative linear relationship skim milk’s ph and the
temperature.
t=
Note, to solve for sb, I used the calculator regression test and solved the equation.
b − β −.0073 − 0
=
≈ −17.5693
sb
sb
Note, the value of s below is se, the standard error of the line.
t=
13.18.
a) The p-value is approximately zero, so I would reject the null hypothesis. Thus, there is a
useful linear relationship between the average wage and quit rate.
b) I will assume all four conditions have been satisfied. They are the following:
1. µy for each value of x lie on a straight line.
2. For repeated x-values, the response variable varies normally.
3. For each x-values the standard deviations of the y are the same.
4. For every x-value, repeated y-values are independent.
C.I . = b ± t * sb
df = n-2 CL = .95
C.I . = .34655 ± 2.16(0.05866)
df = 15-2=13
C.I . = .34655 ± .1267
C.I . = (.2198,.4733)
I am 95% confident the average change in quit rate associated with an $1 increase in average
hourly wage is between 21.98% an 47.33%. Since this is a large margin of error, the precision is
not very accurate. Further, the average residual is .4862 away from the least squares regression
line.
13.20
The p-value is .111, so there does not appear to be a useful linear relationship between the
percent of a tooth’s root with transparent dentine and the age of the person. Further, the
coefficient of determination is .286, which and the correlation coefficient is .534, indicating a
very weak positive linear association between the percent of root with transparent dentine and
the age of the individual.
26.
Let x the length of the lambda-opisthion chord (mm) and y represent the
capacity (cm3).
You should always visualize the data by looking at a scatterplot.
Ho: β = 20: The slope of the linear relationship between the chord
length and the capacity is 20cm3/mm.
Ha: β ≠ 20: The slope of the linear relationship between the chord
length and the capacity is less than 20cm3/mm.
I will assume
1. µy for each value of x lie on a straight line.
2. For every x-value, repeated y-values are independent.
3. For repeated x-values, the response variable varies normally; we
can check the normal probability plot of the residuals. Since
the normal probability plot is approximately linear, I believe the
y-values for each x vary normally.
4. For each x-values the standard deviations of the y are the same.
To check this condition, one wants to look at the residual plot.
Since the residual plot is randomly scattered with no apparent
pattern, I believe the standard deviations are the same for each
x-value.
b − β 22.2570 − 20
=
≈ .4512 P(t < .4512) ≈ .6646 df=7-2 = 5
sb
5.002
At α = .05, I would not reject Ho. Therefore, I have do not evidence to
believe the slope of the linear relationship between the chord length and
the capacity is not 20cm3/mm.
t=
To calculate sb, I used the calculator commands shown below.
∑ (residuals)
sb =
n−2
∑ ( x − x )2
2
3088.54
7 − 2 ≈ 5.002
=
123.4286
OR one can use the linear regression test and solve for sb. Remember, the calculator assumes
β=0. You will get the same answer for sb ≈ 5.002.
t=
b − β 22.2570 − 0
=
≈ 4.44935
sb
sb
Download