AP STATISTICS Chapter 13 – Homework Simple Linear Regression & Correlation Inferential Methods HW# 1 Objective Section To understand the conditions (assumptions) of simple linear regression To estimate the population regression line. 13.1 Reading pages 689-692 13.1 693-695 To calculate a point estimate using simple linear regression. Example 13.2 on pp. 69597 To calculate the probability (proportion) of an event. 2 13.1 13.2 697-698 & Example 13.3 on page 698700 Last paragraph on p. 698 702-704 13.2 704 - 706 p. 711 #15b,c, # 19b 13.2 706-710 13.2 706,710 p. 712 # 21 p. 713 # 25 p. 712 # 26 (class 19a) p. 711 # 18, p. 712 # 20 13.3 713-723 13.6 737-739 To estimate σ2 and σ 3 To calculate the estimated standard deviation of slope. 4 To calculate a confidence interval for β (slope). 5 To carry out a hypothesis test concerning β (slope). 6 To read computer output 7 To understand and check the conditions (assumptions) for simple linear regression. 8 Interpreting & Communicating the results of statistical analyses. 9 Review problems (if needed) pp. 700 -02 #1 #5, #6 #7, # 9, #11 To interpret the slope of the least squares regression line (LSRL) To interpret se. The estimated standard deviation of the line. Homework Problems 13.1 (# 2,3 in class) p. 711 #15a p. 724 # 29 as a class. # 32 using Fathom. Be sure to read “A Word to Wise” on page 739 pp. 741 – 45 # 58,59,61a, 62,63,65a (68 as a class) Due Date Questions 13.1. a) y=α + βx is the population regression line. y=-5.0 + .017x, where x is the size of the house is square feet and y is the number of natural gas therms used during a specified period. b) Graph: Be sure you labeled and have scales on your axis. Scatter Plot therms problem 1 30 28 26 24 22 20 18 16 14 12 problem 1 squaref... 1000 1200 1400 1600 1800 2000 squarefootage therms = 0.0170squarefootage - 5.0; r^2 = 1.0 therms 1 1000 12 2 2000 29 2200 c) If x=2100 square feet, find y in y=-5.0 + .017x. y = -5+(.017)(2100) = 30.7 therms. .017 therms . Thus, on average, for every increase in 1 square 1 square foot footage, the number of therms used goes up by 0.017. d) The slope is 0.017 = .017 therms 1.7 therms = . Thus, on average, for every 1 square foot 100 square footage increase in 10 square feet, the number of therms used goes up by 1.7. e) The slope is 0.017 = f) No, since there are no small houses in the community and a 500 square foot house is considered small, I would not use the least squares regression line. This is extrapolation. 13.2 a) y=α + βx is the population regression line. y= -0.12+ 0.095x, where x is the pressure (inches of water) and y is the flow rate. If x = 10 If x = 5 y= -0.12+ 0.095(10) = 0.83 y= -0.12+ 0.095(5)= 0.355 .095 flow rate b) The slope is 0.095 = . Thus, on average, increase of one inch of 1 pressure (in inches) water, there is an increase of 0.095 in the flow rate. .095 flow rate -0.475 flow rate c) The slope is 0.095 = = . Thus, on average, 1 pressure (in inches) −5 pressure (in inches) if the pressure decreases by 5 inches, the flow rate will also decrease by 0.475. 13.3 y=α + βx is the population regression line. y= -2 + 1.4x, where x is the intake of serum manganese (Mn) and y is Mn concentration. σ = 1.2 a) If x = 4, y= -2 + 1.4(4) = 3.6 and if x = 4.5, y= -2 + 1.4(4.5) = 4.3 b) If x = 4, P(y > 5) = P(z>1.1666) ≈ 0.1217 c) If x = 5, y= -2 + 1.4(5) = 5 P(y>5) = P(z > 0) ≈ 0.5 If x = 5, y= -2 + 1.4(5) = 5 P(y<3.8) = P(z <-1) ≈ 0.1587 13.5. y=α + βx is the population regression line. y= 23,000 + 47x, where x is the house size in square feet and y is the house price in dollars. σ = 5000 a) The slope is 47 = 47 dollars 4700 dollars = . 1 square foot 100 square feet For every additional square foot added, the house price increases by $47 on average. Similarly, on average, the house price increases $4700 for every additional 100 square feet added. b) If x = 1800, y = 23,000 + 47(1800) = 107,600 P( y > 110,000) = P(z>.48) ≈ .3156 z= P( y < 100,000) = P(z<-1.52) ≈ .0643 110, 000 − 107600 = .48 5000 z= 100, 000 − 107600 = −1.52 5000 13.6 a) y=α + βx is the least squares regression line for the whole population while y = a + bx is the regression line for a given sample. b) β is the slope of the regression line for the whole population. β is a parameter. b is the slope of the regression line for a sample. b is a statistic. c) If x* is a value of the independent variable, then α + βx* represents the average y value (response variable) if repeated samples are taken for the given x*. Remember, one assumption is that for each x*, the y values vary normally and µy (the average of the y’s for a given x) lines on the population regression line. In contrast, a + bx* represents the predicted y value using the regression line from the sample. This value is called a point estimate or point prediction. d) σ represents “the extent to which observed points (x,y) tend to fall close to or far away from the population regression line.” (page 697). “The value of σ represents the magnitude of a typical deviation of a point (x,y) in the population from the population regression line.” (p. 698). se is an estimate of σ using a sample. “se is the magnitude of a typical sample deviation (residual) from the least squares line.” (p. 698). On page 740, se is also discussed. se = SSresid = n−2 ∑ (residuals) n−2 2 = ∑ (y-y)ˆ n−2 2 is a point estimate (statistic) of the standard deviation with degrees of freedom on n-2. 13.7 x represents the wind speed in m/sec and y the residence half time. a) r2 represents the percent of variation in residence half time that can be attributed by the least squares regression line. In class, we often said the coefficient of determination (r2) is the percent of variation in the y that can be explained by the linear relationship between x and y. We would always want this in context. r2 represents the percent of variation in residence half time that can be explained by linear relationship between wind speed and residence half time. 2 2 SSResid ∑ (residuals) = 1 − ∑ (y-y)ˆ = 1 − 27.890 ≈ 0.6228 r = 1− = 1− 2 2 SSTo 73.937 ∑ (y-y) ∑ (y-y) 2 b) se = SSresid = n−2 ∑ (residuals) n−2 2 = ∑ (y-y)ˆ 2 = n−2 27.890 ≈ 1.5923 13 − 2 se represents the typical deviation from the least square regression line. Thus, the typical residual is about 1.5923 away from the least squares regression line. c) To estimate the mean change in residence half time associated with a 1- m/sec increase in wind speed, we need to calculate the slope of the LSRL. The slope is given to be 3.4307. Thus, on average, for every increase in 1 –m/sec in the wind speed, the residence half time increases by 3.43 units (not sure if it is seconds, hours, days etc). d) If x = 1, then ŷ = a+bx = 0.0119+3.4307(1) ≈ 3.4426 13.9 2 2 SSResid ∑ (residuals) = 1 − ∑ (y-y)ˆ = 1 − 2620.57 ≈ 0.8830 a) r = 1 − = 1− 2 2 SSTo 22398.05 ∑ (y-y) ∑ (y-y) 2 b) se = SSresid = n−2 ∑ (residuals) n−2 2 = ∑ (y-y)ˆ n−2 2 = 2620.57 ≈ 13.6815 with df = 14. 16 − 2 13.11 c.) r2=.4356. 43.56% of the variation in the market share can be explained by the linear relationship between the advertising share and the market share. d) See above for calculator commands. 13.15 Let x represent the time elapsed since termination of the molding process and y represent the hardness of the molded plastic. a) sb = se S xx SSresid se n−2 = = = 2 2 ( x − x ) ( x − x ) ∑ ∑ ∑ (residuals) sb = n−2 ∑ ( x − x )2 2 ∑ (residuals) n−2 ∑ ( x − x )2 2 ∑ (y-y)ˆ = 2 n−2 ∑ ( x − x )2 1235.470 15 − 2 ≈ .1537 = 4024.20 b) I will assume all four conditions have been satisfied. They are the following: 1. µy for each value of x lie on a straight line. 2. For repeated x-values, the response variable varies normally. 3. For each x-value, the standard deviations of the y are the same. 4. For every x-value, repeated y-values are independent. C.I . = b ± t * sb df = n-2 C.I . = 2.50 ± 2.16(.1537) CL = .95 df = 15-2=13 C.I . = 2.50 ± .3320 C.I . = (2.168, 2.832) I am 95% confident the true slope between the time elapsed since termination of the molding process and the hardness of the molded plastic is between 2.168 and 2.832. c) Since the margin of error is relatively small at 0.3320, I believe the slope has been estimated precisely. In other words, the confidence interval is not too wide. 13.19a Let y represent the average SAT score on x = expenditure per pupil (in thousands of dollars). n=44, b = 15 and sb=5.3. Ho: β = 0: There is no association between the average SAT score and expenditure per pupil for New Jersey school districts. (There is not a useful linear relationship between SAT scores and expenditures per pupil.) Ha: β ≠ 0: There is an association between the average SAT score and expenditure per pupil for New Jersey school districts. (There is a useful linear relationship between SAT scores and expenditures per pupil.) I will assume all four conditions have been satisfied. They are the following: 1. µy for each value of x lie on a straight line. 2. For repeated x-values, the response variable varies normally. 3. For each x-values the standard deviations of the y are the same. 4. For every x-value, repeated y-values are independent. df=42 b − β 15 − 0 = ≈ 2.8302 P(t > 2.8302 or t < -2.8302) ≈ .0071 sb 5.3 At α = .05, I would reject Ho. Therefore, I have evidence to believe there is an association between the average SAT score and expenditure per pupil for New Jersey school districts. (There is a useful linear relationship between SAT scores and expenditures per pupil.) t= 13.19b Let y represent the average SAT score on x = expenditure per pupil (in thousands of dollars). n=44, b = 15 and sb=5.3. Assume conditions as stated above are true. C.I . = b ± t * sb df = n-2 CL = .95 C.I . = 15 ± 2.018(5.3) df = 44-2=42 C.I . = 15 ± 10.6954 C.I . = (4.3046, 25.6954) Since this interval does not capture zero, I do have evidence to believe there is an association between the average SAT score and the expenditure per pupil for New Jersey school districts. The true slope is between 4.3 and 25.7. I am 95% confident the true average SAT score associated with every additional one thousand dollars spent on the pupil the SAT score increases between 4.3 and 25.7 points. 21. Part a) Let y represent the mean response time for those suffering a closed-head injury and x represent the mean response time on the same task for individuals with no head injury. Part b) Ho: β = 0: There is no linear relationship between the mean response time for individuals with no head injury and the mean response time for individual with CHI. Ha: β ≠ 0: There is a linear relationship between the mean response time for individuals with no head injury and the mean response time for individual with CHI. I will assume 1. µy for each value of x lie on a straight line. 2. It is stated in the problems that it is reasonable that the observations are independent. In other words, for every xvalue, repeated y-values are independent 3. For repeated x-values, the response variable varies normally; we can check the normal probability plot of the residuals. Since the normal probability plot is approximately linear, I believe the y-values for each x vary normally. 4. For each x-values the standard deviations of the y are the same. To check this condition, one wants to look at the residual plot. Since the residual plot is randomly scattered with no apparent pattern, I believe the standard deviations are the same for each x-value. df=8 b − β 1.5946 − 0 = ≈ 27.165 P(t > 27.165 or t < -27.165) ≈ 0 sb .0587 At α = .05, I would reject Ho. Therefore, I have evidence to believe there is a linear relationship between the mean response time for individuals with no head injury and the mean response time for individual with CHI. t= Note, to solve for sb, I used the calculator regression test and solved the equation. t= b − β 1.5946 − 0 = ≈ 27.165 sb sb 25. Let x represent the temperature in Celsius and y represent the milk’s ph of skim milk. You should always visualize the data by looking at a scatterplot. Ho: β = 0: There is no linear relationship skim milk’s ph and the temperature. Ha: β < 0: There is a negative linear relationship skim milk’s ph and the temperature. I will assume 1. µy for each value of x lie on a straight line. 2. For every x-value, repeated y-values are independent. 3. For repeated x-values, the response variable varies normally; we can check the normal probability plot of the residuals. Since the normal probability plot is approximately linear, I believe the y-values for each x vary normally. 4. For each x-values the standard deviations of the y are the same. To check this condition, one wants to look at the residual plot. Since the residual plot is randomly scattered with no apparent pattern, I believe the standard deviations are the same for each x-value. df=14 b − β −.0073 − 0 = ≈ −17.5693 P(t < -17.5693) ≈ 0 sb .000415 At α = .05, I would reject Ho. Therefore, I have evidence to believe there is a negative linear relationship skim milk’s ph and the temperature. t= Note, to solve for sb, I used the calculator regression test and solved the equation. b − β −.0073 − 0 = ≈ −17.5693 sb sb Note, the value of s below is se, the standard error of the line. t= 13.18. a) The p-value is approximately zero, so I would reject the null hypothesis. Thus, there is a useful linear relationship between the average wage and quit rate. b) I will assume all four conditions have been satisfied. They are the following: 1. µy for each value of x lie on a straight line. 2. For repeated x-values, the response variable varies normally. 3. For each x-values the standard deviations of the y are the same. 4. For every x-value, repeated y-values are independent. C.I . = b ± t * sb df = n-2 CL = .95 C.I . = .34655 ± 2.16(0.05866) df = 15-2=13 C.I . = .34655 ± .1267 C.I . = (.2198,.4733) I am 95% confident the average change in quit rate associated with an $1 increase in average hourly wage is between 21.98% an 47.33%. Since this is a large margin of error, the precision is not very accurate. Further, the average residual is .4862 away from the least squares regression line. 13.20 The p-value is .111, so there does not appear to be a useful linear relationship between the percent of a tooth’s root with transparent dentine and the age of the person. Further, the coefficient of determination is .286, which and the correlation coefficient is .534, indicating a very weak positive linear association between the percent of root with transparent dentine and the age of the individual. 26. Let x the length of the lambda-opisthion chord (mm) and y represent the capacity (cm3). You should always visualize the data by looking at a scatterplot. Ho: β = 20: The slope of the linear relationship between the chord length and the capacity is 20cm3/mm. Ha: β ≠ 20: The slope of the linear relationship between the chord length and the capacity is less than 20cm3/mm. I will assume 1. µy for each value of x lie on a straight line. 2. For every x-value, repeated y-values are independent. 3. For repeated x-values, the response variable varies normally; we can check the normal probability plot of the residuals. Since the normal probability plot is approximately linear, I believe the y-values for each x vary normally. 4. For each x-values the standard deviations of the y are the same. To check this condition, one wants to look at the residual plot. Since the residual plot is randomly scattered with no apparent pattern, I believe the standard deviations are the same for each x-value. b − β 22.2570 − 20 = ≈ .4512 P(t < .4512) ≈ .6646 df=7-2 = 5 sb 5.002 At α = .05, I would not reject Ho. Therefore, I have do not evidence to believe the slope of the linear relationship between the chord length and the capacity is not 20cm3/mm. t= To calculate sb, I used the calculator commands shown below. ∑ (residuals) sb = n−2 ∑ ( x − x )2 2 3088.54 7 − 2 ≈ 5.002 = 123.4286 OR one can use the linear regression test and solve for sb. Remember, the calculator assumes β=0. You will get the same answer for sb ≈ 5.002. t= b − β 22.2570 − 0 = ≈ 4.44935 sb sb