313 Chapter 15 15.1 The correlation is r = 0.994, and the least-squares linear regression equation is yˆ = −3.66 + 1.1969 x , where y = humerus length and x = femur length. The scatterplot with the regression line below shows a strong, positive, linear relationship. Yes, femur length is a very good predictor of humerus length. 90 Humerus lenght (cm) 80 70 60 50 40 40 45 50 55 60 Femur length (cm) 65 70 75 15.2 (a) The least-squares regression line is yˆ = 11.547 + 0.84042 x , where y = height (inches) and x in arm span (inches). (b) Yes, the least-squares line is an appropriate model for the data because the residual plot shows an unstructured horizontal band of points centered at zero. Since 76 inches is within the range of arm spans examined in Mr. Shenk’s class, it is reasonable to predict the height of a student with a 76 inch arm span. 5.0 5 2.5 4 0.0 0 Count residual 15.3 (a) The observations are independent because they come from 13 unrelated colonies. (b) The scatterplot of the residuals against the percent returning (below on the left) shows no systematic deviations from the linear pattern. (c) The spread may be slightly wider in the middle, but not markedly so. (d) The histogram (below on the right) shows no outliers or strong skewness, so there are no clear deviations from Normality. 3 -2.5 2 -5.0 1 -7.5 40 50 60 Percent return 70 80 0 -5.0 -2.5 0.0 residual 2.5 5.0 15.4 (a) The observations are independent because they come from 16 different individuals. (b) The scatterplot of the residuals against nonexercise activity (below on the left) shows no systematic deviations from the linear pattern. One residual, about 1.6, is slightly larger than the others, but this is nothing to get overly concerned about. (c) The spread is slightly higher for larger values of nonexercise acitvity, but not markedly so. (d) The histogram (below on the right) 314 Chapter 15 shows no outliers and a slight skewness to the right, but this does not suggest a lack of Normality. 2.0 5 1.5 4 0.5 Count Residual 1.0 0 0.0 -0.5 3 2 1 -1.0 -100 0 100 200 300 400 500 Nonexercise activity (calories) 600 700 0 -1.0 -0.5 0.0 0.5 1.0 1.5 Residual 15.5 (a) The slope parameter β represents the change in the mean humerus length when femur length increases by 1 cm. (b) The estimate of β is b = 1.1969 , and the estimate of α is a = −3.66 . (c) The residuals are −0.8226, −0.3668, 3.0425, −0.9420, and −0.9110, and their sum ∑ ( resid ) = 2 is −0.0001. The standard deviation is estimated by s = n−2 11.79 1.982 . 3 15.6 (a) The scatterplot (below on the left) shows a strong, positive linear relationship between x = speed (feet/second) and y = steps (per second). The correlation is r = 0.999 and the leastsquares regression line is yˆ = 1.76608 + 0.080284 x . (b) The residuals (rounded to 4 decimal places) are 0.0106, −0.0013, −0.0010, −0.0110, −0.0093, 0.0031, and 0.0088, and their sum is −0.0001 (essentially 0, except for rounding error). (c) The estimate of α is a = 1.76608 , the estimate of β is b = 0.080284 , and the estimate of σ is s = 0.00041 0.0091 . 5 3.6 Steps (per second) 3.5 3.4 3.3 3.2 3.1 3.0 15 16 17 18 19 Speed (ft/s) 20 21 22 15.7 (a) The scatterplot below shows a strong, positive linear relationship. (b) The slope β gives this rate. The estimate of β is listed as the coefficient of “year” in the output, b = 9.31868 tenths of a millimeter. (c) We are not able to make an inference for the tilt rate from a simple linear regression model, because the observations are not independent. Inference for Regression 315 Lean (coded from 2.9 meters) 750 725 700 675 650 75 77 79 81 83 Year (coded as last two digits) 85 87 15.8 (a) The least-squares regression line is yˆ = 0.12049 + 0.008569 x , where y = the proportion of perch killed and x = the number of perch. The fact that the slope is positive tells us that as the number of perch increases, the proportion being killed by bass also increases. (b) The regression standard error is s = 0.1886, which estimates the standard deviation σ . (c) Who? The individuals are kelp perch. What? The response variable is the proportion of perch killed and the explanatory variable is the number of perch available (or in the pen); both variables a quantitative. Why? The researcher was interested in examining the relationship between predators and available prey. When, where, how, and by whom? Todd Anderson published the data obtained from the ocean floor off the coast of southern California in 2001.Graphs: The scatterplot provided clearly shows that the proportion of perch killed increases as the number of perch increases. Numerical Summaries The mean proportions of perch killed are 0.175, 0.283, 0.425, and 0.646, in order from smallest to largest number of perch available. Model The leastsquares regression model is provided in part (a). Interpretation The data clearly support the predator-prey principle provided. (Students will soon learn how to formally test this hypothesis.) (d) Using df = 16 − 2 = 14 and t * = 2.145 , a 95% confidence interval for β is 0.008569 ± 2.145 × 0.002456 = (0.0033, 0.0138). We are 95% confident that the proportion of perch killed increases on average between 0.0033 and 0.0138 for each addition perch added to the pen. 15.9 The regression equation is yˆ = 560.65 − 3.0771x , where y =calories and x =time. The scatterplot with regression line (below) shows that the longer a child remains at the table, the fewer calories he or she will consume. The conditions for inference are satisfied. Using df = 18 and t * = 2.101 , a 95% confidence interval for β is −3.0771 ± 2.101× 0.8498 = (−4.8625, −1.2917). With 95% confidence, we estimate that for every extra minute a child sits the table, he or she will consume an average of between 1.29 and 4.86 calories less during lunch. 316 Chapter 15 520 Calories (average number) 500 480 460 440 420 400 20 25 30 35 Time (average number of minutes) 40 45 15.10 (a) Excel’s 95% confidence interval for β is (0.0033, 0.0138). This matches the confidence interval calculated in Exercise 15.8. We are 95% confident that the proportion of perch killed increases on average between 0.0033 and 0.0138 for each addition perch added to the pen. (b) See Exercise 15.8 part (d) for a verification using the Minitab output. Using df = 16 − 2 = 14 and t * = 2.145 with the Excel output, a 95% confidence interval for β is 0.0086 ± 2.145 × 0.0025 = (0.0032, 0.0140). (c) Using df = 16 − 2 = 14 and t * = 1.761 , a 90% confidence interval for β is 0.0086 ± 1.761× 0.0025 = (0.0042, 0.0130). 15.11 (a) The least-squares regression line from the S-PLUS output is yˆ = −3.6596 + 1.1969 x , where y = humerus length and x = femur length. (b) The test statistic is t= b 1.1969 = 15.9374 . (c) The test statistic t has df = 5 − 2 =3. The largest value in Table D SEb 0.0751 is 12.92. Since 15.9374 > 12.92, we know that P-value < 0.0005. (d) There is very strong evidence that β > 0, that is, the line is useful for predicting the length of the humerus given the length of the femur. (e) Using df = 3 and t * = 5.841 , a 99% confidence interval for β is 1.1969 ± 5.841× 0.0751 = (0.7582,1.6356). We are 99% confident that for every extra centimeter in femur length, the length of the humerus will increase on average between 0.7582 cm and 1.6356 cm. 15.12 (a) The value of r 2 = 0.998 or 99.8% is very close to one (or 100%), which indicates perfect linear association. (b) The slope parameter β gives this rate. Using df = 5 and t * = 4.032 , a 99% confidence interval for β is 0.080284 ± 4.032 × 0.0016 = (0.0738, 0.0867). We are 99% confident that the rate at which steps per second increase as running speed increases by 1 ft/s is on average between 0.0738 and 0.0867. 15.13 (a) The scatterplot (below) with regression line shows a strong, positive linear association between the number of jet skis in use (explanatory variable) and the number of accidents (response variable). (b) We want to test H 0 : β = 0 (there is no association between number of jet skis in use and number of accidents) versus H a : β > 0 (there is a positive association between number of jet skis in use and number of accidents). (c) The conditions are independence, the mean number of accidents should have a linear relationship with the number of jet skis in use, the standard deviation should be the same for each number of jet skis in use, and the number of Inference for Regression 317 accidents should follow a Normal distribution. The conditions are satisfied except for having independent observations, so we will proceed with caution. (d) LinRegTTest reports that t = 21.079 with df = 8 and P-value is 0.000. With the earlier caveat, there is very strong evidence to reject H 0 and conclude that there is a significant positive association between number of accidents and number of jet skis in use. As the number of jet skis in use increases, the number of accidents significantly increases. (e) Using df = 8 and t * = 2.896 , a 98% confidence interval for β is 0.0048 ± 2.896 × 0.0002 = (0.0042, 0.0054). With 98% confidence, we estimate that for every extra thousand jet skis in use, the number of accidents increase by a mean of between 4.2 and 5.4 per year. Number of accidents 4000 3000 2000 1000 0 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Number of jet skis in use Deaths from heart disease (per 100,000 people) 15.14 (a) We want to test H 0 : β = 0 (there is no association between yearly consumption of wine and deaths from heart disease) versus H a : β < 0 (there is a negative association between yearly consumption of wine and deaths from heart disease). The data are obtained from different nations, so independence seems reasonable. The other conditions of constant variance, linear −22.969 −6.46 with df = relationship and Normality are also satisfied. The test statistic is t = 3.557 17 and P-value < 0.0005. Since the P-value is smaller than any reasonable significance level, say 1%, we reject H 0 . We have very strong evidence of a significant negative association between the consumption of wine and deaths from heart disease. (b) Using df = 17 and t * = 2.110 , a 95% confidence interval for β is −22.969 ± 2.110 × 3.557 = (−30.4743, −15.4637). With 95% confidence, we estimate that the number of deaths from heart disease (per 100,000 people) decreases on average between 15.46 and 30.47 for each additional liter of wine consumed (per person). 300 250 200 150 100 50 0 1 2 3 4 5 6 7 Wine consumption (liters per person) 8 9 318 Chapter 15 15.15 (a) The scatterplot below shows a moderately strong, positive linear association between y = number of beetle larvae clusters and x = number of beaver-caused stumps. (b) The leastsquares regression line is yˆ = −1.286 + 11.894 x . r 2 =83.9%, so regression on stump counts explains 83.9% of the variation in the number of beetle larvae. (c) We want to test H 0 : β = 0 versus H a : β ≠ 0 . The conditions for inference are met, and the test statistic is t = 10.47 with df = 21. The output shows P-value = 0.000, so we have very strong evidence that beaver stump counts help explain beetle larvae counts. Number of beetle larvae clusters 60 50 40 30 20 10 0 1 2 3 4 Number of beaver-caused stumps 5 15.16 (a) The mean of the standardized residuals is 0.00174 and the standard deviation is 1.014. Since the residuals are standardized, we expect the mean and standard deviation to be close to 0 and 1, respectively. (b) A stemplot is shown below on the left. The distribution is slightly skewed to the left, but this is not unusual for a small data set. There are no striking departures from Normality. For a standard Normal distribution, we would expect 95% of the observations to fall between −2.0 and 2.0. Thus, −1.99 is quite reasonable. (c) The residual plot on the right below shows no obvious patterns. 3 5 6 10 (4) 9 4 -1 -1 -0 -0 0 0 1 965 30 7 4422 0224 56789 2233 N 1.5 = 23 1.0 Standardized residuals Stem-and-leaf of Residuals Leaf Unit = 0.10 0.5 0.0 0 -0.5 -1.0 -1.5 -2.0 1 2 3 4 Number of beaver-caused stumps 5 CASE CLOSED! (1) Descriptive statistics for x = number of three-point shots taken and y = percent made are shown below. The average number of three-point shots taken per game is 15.684 and the standard deviation is 2.865. The average percent of three-point shots made per game is 35.379 and the standard deviation is 1.425. The correlation is r = −0.958 and the scatterplot below shows a negative association between these two variables. Notice that the cluster of points in the bottom right corner shows some positive association, but the overall association between x and y is clearly negative. Inference for Regression Variable Taken Percent N 19 19 Mean 15.684 35.379 319 StDev 2.865 1.425 Minimum 9.200 34.100 Q1 13.800 34.400 Median 17.100 34.600 Q3 17.700 36.200 Maximum 18.300 38.400 Percent of 3-pointers made 39 38 37 36 35 34 10 12 14 16 Number of 3-pointers taken 18 (2) The least-squares regression line is yˆ = 42.8477 − 0.4762 x with r 2 = 0.917 or 91.7%. The linear model provides a reasonably good fit for these data. However, the residual plot shows a clear pattern with positive residuals for small and large numbers of 3-pointers taken and negative residuals in between the two extremes. (3) The point is tagged as being influential because it may have a considerable impact on the regression line. Influential points often pull the regression line in their direction so the residuals tend to be small for influential points. (4) We want to test H 0 : β = 0 versus H a : β ≠ 0 . Independence is reasonable because the data are from different seasons. The linear relationship condition is met, but the constant variance condition and the Normality are both questionable so we will proceed with caution. A histogram of the percent made below shows that the distribution is skewed to the right. The test statistic is t = −13.7 with df = 17 and P-value = 0.000. We have very strong evidence of a significant association between the number of three-pointers taken and the percent made. 9 8 7 Count 6 5 4 3 2 1 0 34 35 36 37 Percent of thee-pointers made 38 (5) Using df = 17 and t * = 2.110 , a 95% confidence interval for β is −0.4762 ± 2.110 × 0.03475 = (−0.5495, −0.4029). With 95% confidence, we estimate that for every additional three-pointer taken, the percent made will decrease on average between 0.40 and 0.55. 15.17 Regression of fuel consumption on speed gives b = −0.01466 , SEb = 0.02334 , and t = −0.63 with df = 13 and P-value= 0.541. Thus, we have no evidence to suggest a straight- 320 Chapter 15 line relationship between speed and fuel use. The scatterplot below shows a strong relationship between speed and fuel use, but the relationship is not linear. See Exercise 3.9 for more details. 22.5 Fuel consumption 20.0 17.5 15.0 12.5 10.0 7.5 5.0 0 20 40 60 80 100 Speed (km/h) 120 140 160 15.18 Repeated measurements of Sarah’s height are clearly not independent. 15.19 (a) The slope β tells us the mean change in the percent of forest lost for a 1 unit (1 cent per pound) increase in the price of coffee. The estimate of β is b = 0.05525 and the estimate of α is a = −1.0134 . (b) This says that the straight-line relationship described by the least-squares line is very strong. r 2 = 0.907 or 91% indicates that 91% of the total variation in the percent of forest lost is accounted for by the straight-line relationship with prices paid to coffee growers. (c) The P-value refers to the two-sided alternative: H 0 : β = 0 versus H a : β ≠ 0 . The small Pvalue indicates that we have very strong evidence of a significant association between the percent of forest lost and the price paid for coffee. (d) The residuals are −0.0988, 0.3934, −0.2800, −0.2053, and 0.1907, and their sum is 0. The standard deviation σ is estimated by 0.3215 0.3274 . (e) A scatterplot (on the left) and a residual plot (on the right) are shown 3 s= below. Even though the number of observations is small, there are no obvious problems with the linear regression model. Coffee price appears to be a very good predictor of forest lost for this range of values. 0.4 3.0 0.2 2.0 Residual Forest lost (percent) 0.3 2.5 1.5 0.1 0.0 0 -0.1 1.0 -0.2 0.5 -0.3 30 40 50 Price (cents per pound) 60 70 30 40 50 Price (cents per pound) 60 70 15.20 (a) The scatterplot below, with the regression line yˆ = 70.436874 + 274.7821x , shows a moderate, positive, linear association. The linear relationship explains r 2 0.493 or 49.3% of the variation in gate velocity. (b) We want to test H 0 : β = 0 versus H a : β ≠ 0 . The test statistic Inference for Regression 321 274.7821 3.1163 with df = 10 and P-value = 0.011. (Table C indicates that 0.01 < P88.17712 value < 0.02.) Since the P-value < 0.05, we reject H 0 and conclude that there is a significant linear relationship between thickness and gate velocity. The regression formula might be used as a rule of thumb for new workers to follow, but the wide spread in the scatterplot below suggests that there may be other factors that should be taken into account in choosing the gate velocity. is t = 350 Gate velocity (ft/sec) 300 250 200 150 100 0.2 0.3 0.4 0.5 0.6 0.7 Cylinder wall thickness (inches) 0.8 0.9 15.21 (a) A scatterplot with the regression line is shown below. r 2 = 0.992 or 99.2%. (b) The estimates of α , β , and σ are a = −2.3948 cm, b = 0.1585 cm/min, and s = 0.8059 cm. (c) The least-squares regression line is yˆ = −2.3948 + 0.1585 x , where y = length and x = time. 30 25 Length (cm) 20 15 10 5 0 0 50 100 Time (min) 150 200 15.22 (a) A scatterplot with the least-squares regression line yˆ = 3.5051 − 0.0034 x is shown below. We want to test H 0 : β = 0 versus H a : β < 0 . The test statistic is t = −4.64 with df = 14 and P-value < 0.0005. We have very strong evidence that people with higher NEA gain less fat. (b) To find this interval, we need SEb , which is given in the Minitab output below as 0.0007414. Using df = 14 and t * = 1.761 , a 90% confidence interval for β is −0.00344 ± 1.761× 0.0007414 = (−0.0047, −0.0021). 322 Chapter 15 4 Fat gain (kg) 3 2 1 0 -100 0 100 200 300 400 NEA change (cal) 500 600 700 The regression equation is Fat gain (kg) = 3.51 - 0.00344 NEA change (cal) Predictor Constant NEA change (cal) S = 0.739853 Coef 3.5051 -0.0034415 R-Sq = 60.6% SE Coef 0.3036 0.0007414 T 11.54 -4.64 P 0.000 0.000 R-Sq(adj) = 57.8% 15.23 (a) A scatterplot is shown below. There is a moderate, positive, linear association between investment returns in the U.S. and investments overseas. (b) The test statistic is t= b 0.6181 2.6091 with df = 25 and 0.01 < P-value < 0.02. Thus, we have fairly strong = SEb 0.2369 evidence that there is a significant linear relationship between the two returns. That is, the slope is nonzero. (c) r 2 = 0.214 or 21.4%, so only 21.4% of the variation in the overseas returns is explained by using linear regression with U.S. returns as the explanatory variable. Using this linear regression model for prediction will not be very useful in practice. 70 60 Overseas return (%) 50 40 30 20 10 0 -10 -20 -30 -20 -10 0 10 U.S. return (%) 20 30 40 15.24 (a) The residual plot (below on the left) shows that the variability about the regression line increases as the U.S. return increases. (b) The histogram (below on the right) indicates that the distribution of the residuals is skewed to the right. The outlier is from 1986, when the overseas return was much higher than our regression model predicts. Inference for Regression 323 60 6 50 5 40 4 20 Count Residual 30 10 0 0 3 2 -10 1 -20 -30 -30 -20 -10 0 10 U.S. return (%) 20 30 0 40 -20 0 20 40 Residual 15.25 (a) The scatterplot below (on the left) shows a weak, negative association between corn yield and weeds. The least-squares regression line is yˆ = 166.483 − 1.0987 x , where y = corn yield (bushels per acre) and x = weeds (per meter). r 2 = 0.209 or 20.9%, so the linear relationship explains about 20.9% of the variation in yield. (b) The t statistic for testing H 0 : β = 0 versus H a : β < 0 is t = −1.92 with df = 14 and P-value = 0.0375. Since 0.0375 < 0.05, there is sufficient evidence to conclude that more weeds reduce corn yields. (c) The small number of observations for each value of the explanatory variable (weeds/meter), the large variability in those observations, and the small value of r2 will make prediction with this model imprecise. A residual plot below (on the right) also shows that the linear model is quite imprecise. 180 5 Residual Corn yield (bushels per acre) 10 170 160 0 -5 150 -10 -15 140 0 1 2 3 4 5 6 Weeds (per meter) 7 8 9 0 1 2 3 4 5 6 Weeds per meter 7 8 9 324 Chapter 15 15.26 Using df = 21 and t * = 1.721 , a 90% confidence interval for β is −9.6949 ± 1.721×1.8887 = (−12.9454, −6.4444). With 90% confidence, we estimate that for each one minute increase in time (a slower, more leisurely swim) the professor’s pulse will drop on average between 6 and 13 beats per minute. There is a negative relationship between the professor’s swimming time and heart rate. A scatterplot is shown below. Pulse (beats per minute) 160 150 140 130 120 34.0 34.5 35.0 35.5 Time (in minutes) 36.0 36.5