Chapter 5 Exercises 5.1 – 5.16 5.1 5.2 a. A positive correlation would be expected, since as temperature increases cooling costs would also increase. b. A negative correlation would be expected, since as interest rates climb fewer people would be submitting applications for loans. c. A positive correlation would be expected, since husbands and wives tend to have jobs in similar or related classifications. That is, a spouse would be reluctant to take a low-paying job if the other spouse had a high-paying job. d. No correlation would be expected, because those people with a particular I.Q. level would have heights ranging from short to tall. e. A positive correlation would be expected, since people who are taller tend to have larger feet and people who are shorter tend to have smaller feet. f. A weak to moderate positive correlation would be expected. There are some who do well on both, some who do poorly on both, and some who do well on one but not the other. It is perhaps the case that those who score similarly on both tests outnumber those who don't. g. A negative correlation would be expected, since there is a fixed amount of time and as time spent on homework increases, time in watching television would decrease. h. No correlation overall, because for small or substantially large amounts of fertilizer yield would be small. The statement is incorrect. The correlation coefficient measures the extent to which x and y are linearly related. They may have a strong nonlinear relationship and yet have a correlation of zero. 5.3 77 5.4 Association ≠ Causation. For example, it could be age, or the amount they entertain, or even the age of their children that has a more important effect on their drinking habits rather than the amount they earn. Sugar Consumption x 150 300 350 375 390 480 5.5 Depression Rate y 2.3 3.0 4.4 5.0 5.2 5.7 zx zy zxzy -1.726 -.369 .083 .309 .444 1.259 -1.470 -.947 .099 .548 .697 1.071 2.537 .349 .008 .169 .309 1.348 4.720 a. r= ∑ zxzy n -1 = 4.720 = 0.944 5 The correlation is strong and positive. b. Increasing sugar consumption doesn’t cause or lead to higher rates of depression, it may be another reason that causes an increase in both. For instance, a high sugar consumption may indicate a need for comfort food for a reason that also causes depression. c. These countries may not be representative of any other countries. It may be that only these countries have a strong positive correlation between sugar consumption and depression rate and other countries may have a different type of relationship between these factors. It is therefore not a good idea to generalize these results to other countries. 78 5.6 a. Inpatient, x 80 76 75 62 100 100 88 64 50 54 83 r= ∑ zxzy n -1 Outpatient y 62 66 63 51 54 75 65 56 45 48 71 = zy 0.246 0.663 0.351 -0.900 -0.587 1.601 0.559 -0.379 -1.525 -1.213 1.184 zx 0.258 0.022 -0.038 -0.806 1.441 1.441 0.731 -0.688 -1.516 -1.279 0.435 zxzy 0.064 0.014 -0.013 0.726 -0.846 2.307 0.409 0.261 2.312 1.552 0.516 7.30 = 0.73 10 Outpatient y 75 65 55 45 50 60 70 80 90 100 Inpatient, x There appears to be a reasonably strong positive linear relationship between the cost-tocharge ratio for inpatient and outpatient services at these Oregon hospitals. b. There is one hospital, Harney District that has a lower outpatient cost-to-charge ratio or higher inpatient cost-to-charge ratio than the other ten hospitals. c. If this observation was removed, the remaining points would all be much closer to a line and so the correlation coefficient would be greater. The relationship would be stronger. 5.7 The correlation coefficient between household debt and corporate debt would be positive, and the relationship would be strong. As the household debt increases, the corporate debt increases at a similar rate; this can clearly be seen on the graph by a constant width between the two lines. 5.8 a. r = 0.1178 79 b. consumer debt 8 7 6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 houshold debt The scatterplot supports the correlation coefficient that indicates a very weak, or no linear relationship between consumer and household debts. 5.10 a. r = 0.335. There is a weak positive relationship between timber sales and the amount of acres burned in forest fires. b No. Correlation does not imply a causal relationship. a. Comparing Heart Rate Responses to Two Exercise Tests 200 195 300 yd run 5.9 190 185 180 170 175 180 Shuttle 185 190 195 There does not appear to be any linear relationship between peak heart rate during a shuttle run and peak heart rate during a 300 yard run. b. n = 10, Σx = 1788, Σy = 1898, Sxy = 339472 − Sxx = 320400 − (1788 )(1898 ) 10 (1788 )2 10 ∑x 2 = 320400, Σy 2 = 360628, Σxy = 339472 = 109.6 = 705.6 80 Syy = 360628 − r = (1898 )2 109.6 705.6 387.6 10 = 387.6 = .2096 The value of .2096 suggests at best a very weak linear relationship between the two variables. The conclusion is consistent with the one of part a. 5.12 The value of r does not depend on which variable is labeled x. So switching the labels will not change the value of r . a. Since the points tend to be close to the line, it appears that x and y are strongly correlated in a positive way. b. An r value of .9366 indicates a strong positive linear relationship between x and y. c. If x and y were perfectly correlated with r = 1, then each point would lie exactly on a line. The line would not necessarily have slope 1 and intercept 0. a. 80 70 Exam Score 5.11 c. 60 50 40 0 10 20 Test Anxiety There are several observations that have identical x-values yet different y-values. Thus, the value of y is not determined solely by x, but also by various other factors. There is one data point that is far removed from the remaining data points. The plot seems to indicate there may be a tendency for exam scores to decrease as test anxiety increases. b. There appears to be a linear relationship between x and y. The scatter plot shows a tendency for y to decrease as x increases. That is, as test anxiety increases, exam scores decrease. The relationship may be characterized as a moderate negative relationship. 81 c. x 23 14 14 0 7 20 20 15 21 ∑ x = 134 x2 529 196 196 0 49 400 400 225 441 2 ∑ x = 2436 y 43 59 48 77 50 52 46 51 51 ∑ y = 477 y2 1849 3481 2304 5929 2500 2704 2116 2601 2601 2 ∑ y =26085 xy 989 826 672 0 350 1040 920 765 1071 ∑ xy = 6633 (134)(134) = 2436 – 1995.11 = 440.89 9 (477)(477) Syy = 26085 = 26085 – 25281 = 804 9 (134)(477) Sxy = 6633 = 6633 – 7102 = -469 9 −469 −469 r = = = -0.7878 (21)(28.35) 440.89 804 Sxx = 2436 - r = -0.7878 indicates a moderate negative linear relationship between test anxiety and exam score. d. Correlation measures the extent of association, but association does not imply causation. It is possible that the two variables are not causally related but they may both be related to a third variable. (9620)(7436) 2600 r = (9620)(9620) (7436)(7436) 36168 − 23145 − 2600 2600 27918 − 5.13 27918 − 27513.2 404.8 = = .3899 (23.9583)(43.336) 574 1878.04 There is a weak positive linear relationship between high school GPA and first-year college GPA. = 5.14 No, because x, artist, is not a numerical variable. 5.15 No. An r value of −0.085 indicates an extremely weak relationship between support for environmental spending and degree of belief in God. 5.16 The sample correlation coefficient would be closest to −.9. This is because there is an almost perfect negative linear relationship between speed and time required to travel a fixed distance. That is, as speed increases, time required to traverse the fixed distance decreases. 82 Exercises 5.17 – 5.35 5.17 a. There is a weak negative association between pollution and the cost of medical care. b. x = Pollution and y = cost; Σ x = 191.1, Σ x2 = 6184.05 Σ xy = 177807, Σ y = 5597, n = 6, x = 31.85, y = 932.833 ( x )( y ) (191.1)(5597) = −457.45 ∑ xy − ∑ n ∑ = 177807 − 6 (∑ x ) (191.1) = ∑x − = 6184.05 − = 97.515 Sxy = 2 Sxx 2 2 6 n The slope, b = Sxy Sxx = −457.45 = −4.69 97.515 The intercept, a = y − bx = 932.833 − ( −4.69)(31.85) = 1082.21 The equation: ŷ = 1082.21 - 4.69x The slope is negative, consistent with the description in part a. d. Yes, it does support the conclusion that elderly people that live in more polluted areas have higher medical costs, but care must be taken not to state that the pollution causes the high medical costs – or even the high medical costs causes the pollution! a. 90 80 % who would buy lottery ticket 5.18 c. 70 60 50 40 30 20 10 0 4 6 8 10 12 14 Grade There appears to be positive linear relationship between x, the grade level, and y, the percentage who said they were more likely to purchase a lottery ticket. 83 b. x = Grade and y = %bought; Σ x = 36, Σ x2 = 344 Σ xy = 2318.2, Σ y = 237.4, n = 4, x = 9, y = 59.35 ( x )( y ) (36)(237.4) ∑ xy − ∑ n ∑ = 2318.2 − 4 = 181.6 (∑ x ) (36) = ∑x − = 344 − = 20 Sxy = 2 Sxx 2 2 n The slope, b = Sxy Sxx 4 = 181.6 = 9.08 20 The intercept, a = y − bx = 59.35 − (9.08)(9) = −22.37 The equation: ŷ = -22.37 + 9.08 x 5.20 5.21 a. The dependent variable is the number of fruit and vegetable servings per day. The independent variable is the number of hours of TV viewed per day. b. Negative, because as the number of hours of TV watched increases, the number of servings of fruit and vegetables decreases. a. For lower values of patient to nurse ratios, nurse job satisfaction might be low because up to a point, the more patients a nurse has to look after, the more interesting the job would become. After a certain number, however, the job would get difficult to do well and might get frustrating. The relationship might be nonlinear. b. Patient satisfaction is probably related to the amount of attention received. The higher the patient to nurse ratio, the less personal attention would be received, so the relationship would be negative. c. In an ideal world, there would be no relationship between nurse to patient ratio and patient quality care; it should be excellent, no matter how many patients each nurse has to care for! However, quality of care probably declines as the number of patients a nurse must care for increases. The relationship would be negative. a. Head circumference z-score vs. volume of cerebral grey matter 850 Cerebral Grey Matter (ml) 2-5 y 5.19 800 750 700 -1 0 1 Head circumference z-scores 84 2 3 b. n = 18, Σx = 24.35, Σy = 13890 Σx 2 = 49.6775, Σy 2 = 10767400, Σxy = 19501.75 x = 1.3528, y = 771.667 Sxy = 19501.75 − Sxx = 49.6775 − (24.35)(13890) = 711.667 18 (24.35)2 = 16.737 18 Syy = 10767400 − r = c. Sxy Sxx Syy The slope, b = = Sxy Sxx 138902 = 48950 18 711.677 16.737 48950 = = .7863 711.667 = 42.521 16.737 The intercept, a = y − bx = 771.667 − (42.521)(1.3528) = 714.145 The equation: ŷ = 714.145 + 42.521 x 5.22 5.23 d. When head circumference z-score is 1.8, the predicted volume of grey matter is 790.68 ml. e. The least-squares line was calculated using values of z-scores of between -0.75 and 2.8 and therefore is only valid for values in this range. We don’t know if the relationship between cerebral grey matter and head circumference z-score remains the same outside these values and so this equation cannot be used for prediction. a. The value of the y-intercept of the line ŷ = -147 + 6.175x is –147. The value of the slope is 6.175. This means that each unit increase in x is associated with an increase of 6.175 in y, on average. Thus, each 1 cm increase in snout-vent length is associated with a 6.175 increase in clutch size, on average. b. This least squares line should not be used to predict clutch size of a salamander with a snout-vent length of 22 cm because a 22cm snout-vent length is outside the 3070cm snout-vent range of the data set by quite a bit. a. There is a moderately strong positive linear relationship between the percentage of public schools who were at or above the proficient level in math in 4th and 8th grade in the 8 states. 85 b. x = 4th grade and y = 8th grade; Σ x = 140, Σ x2 = 2586 Σ xy = 3497, Σ y = 188, n = 8, x = 17.5, y = 23.5 ( x )( y ) (140)(188) ∑ xy − ∑ n ∑ = 3497 − 8 = 207 (∑ x ) (140) = ∑x − = 2586 − = 136 Sxy = 2 Sxx 2 2 n The slope, b = Sxy Sxx 8 = 207 = 1.522 136 The intercept, a = y − bx = 23.5 − 1.522(17.5) = −3.135 The equation: ŷ = -3.135 + 1.522x c. 5.24 a. Predicted 8th grade = -3.135 + 1.522(4th grade percent) ⇒ -3.135 + 1.522(14) = 18 (rounded to nearest integer). This is 2% lower than the actual 8th grade value of 20 for Nevada. y= 7436 = 2.86 2600 x= 9620 = 3.7 2600 (9620)(7436) 404.8 2600 b= = = 0.7052 (9620)(9620) 574 36168 − 2600 27918 − a = 2.86 – 0.7052(3.7) = 0.2508 Therefore, the equation of the least squares regression line is ŷ = 0.2508 + 0.7052x 5.25 b. The equation ŷ = 0.2508 + 0.7052x has slope b = 0.7052, so each one unit increase in x is associated with an increase of 0.7052 in y, on average. This means that, on average, each 1.0 increase in high school GPA is associated with an increase of 0.7052 in first year college GPA. c. For x = 4.0, ŷ = 0.2508 + 0.7052(4.0) = 0.2508 + 2.8208 = 3.0716 a. There appears to be a negative linear association between carbonation depth and the strength of concrete for a sample of core specimens. 86 b. Σ x = 323, Σ x2 = 14339, Σ xy = 3939.9, Σ y = 130.8, n = 9, x = 35.889, y = 14.533 ( x )( y ) (323)(130.8) = −754.367 ∑ xy − ∑ n ∑ = 3939.9 − 9 (∑ x ) (323) = ∑x − = 14339 − = 2746.889 Sxy = 2 Sxx 2 2 The slope, b = n Sxy Sxx 9 = −754.367 = −0.275 2746.889 The intercept, a = y − bx = 14.533 − ( −0.275)(35.889) = 24.40 The equation: ŷ = 24.4 – 0.275x c. When depth is 25, predicted strength = 24.4 – 0.275(25) = 17.5 d. The least squares line was calculated using values of “depth” of between 8 mm and 65 mm and therefore is only valid for values in this range. We don’t know if the relationship between depth and strength remains the same outside these values and so this equation cannot be used. A depth of 100 mm is clearly outside these values and it would be unreasonable to use this equation to predict strength. 5.26 It certainly seems that the sooner the paramedics get there, the higher your chances of survival. The slope of the least squares line is – 9.30, which means that for every extra minute, on average, the survival rate decreases by 9.30%. 5.27 The slope is the average increase in the y variable for an increase of one unit in the x variable. Because the home prices (y variable) dropped by an average of $4000 (-4000) for every (1) mile (x variable) from the Bay area, the slope is -4000/1 = -4000. 5.28 a. r = 0.70 There is a moderately strong positive linear relationship between sale price and property size. b. r = -0.333 There is a very weak negative linear relationship (if any!) between sale price and land/building ratio. c. I would use size as it has a correlation coefficient ( r) much closer to |1|. d. Using x = size and y = sale price: Σ x = 16603, Σ x2 = 40097671 Σ xy = 232691.5, Σ y = 100.5, n = 10, x = 1660.3, y = 10.05 ( x )( y) (16603)(100.5) = 65831.35 ∑ xy − ∑ n ∑ = 232691.5 − 10 (∑ x ) (16603) =∑x − = 40097671 − = 12531710.1 S xy = 2 S xx 2 2 The slope, b = n S xy S xx = 10 65831.35 = 0.00525 12531710.1 The intercept, a = y − bx = 10.05 − 0.00525(1660.3) = 1.333 The equation: ŷ = 1.333 + 0.00525x 87 5.29 a. Σ x = 240, Σ x2 = 6750 Σ xy = 199750, Σ y = 7250, n = 11, x = 21.818, y = 659.091 ( x )( y) (240)(7250) ∑ xy − ∑ n ∑ = 199750 − 11 = 41568.182 (∑ x ) (240) =∑x − = 6750 − = 1513.636 11 n S xy = 2 S xx 2 2 The slope, b = Sxy Sxx = 41568.182 = 27.462 1513.636 The intercept, a = y − bx = 659.091 − 27.462(21.818) = 59.925 The equation: ŷ = 59.925 + 27.462x 5.30 5.31 b. Concentration with 18% bare ground: 59.925 + 27.462(18) = 554 (to nearest integer) c No, because the data used to obtain the least squares equation was from steeply sloped plots, so it would not make sense to use it to predict runoff sediment from gradually sloped plots. You would need to use data from gradually sloped plots to create a least squares regression equation to predict runoff sediment from gradually sloped plots. a. slope = 244.9 b. 244.9 c. y = −275.1 + 244.9(2) = −275.1 + 489.8 = 214.7 d. No. When shell height (x) equals 1, the equation would result in a predicted breaking strength of −275.1 + 244.9(1) = −30.2. It is impossible for breaking strength to be a negative value, so the equation results in a predicted value which is not meaningful. intercept = −275.1 a. The graph reveals a moderate linear relationship between x and y. 88 (1368.1)(80.9) 6933.48 − 6917.456 16.0244 16 = = = 0.1123 b= 2 − 117123.85 116981.101 142.7494 (1368.1) 117123.85 − 16 6933.48 − b. a= c. 80.9 ⎛ 1368.1 ⎞ − 0.1123 ⎜ ⎟ = 5.0563 − 0.1123(85.5063) = 5.0563 − 9.6024 = −4.5461 16 ⎝ 16 ⎠ The change in vital capacity associated with a 1 cm. increase in chest circumference is .1123. The change in vital capacity associated with a 10 cm. increase in chest circumference is 10(.1123) = 1.123. d. yˆ = −4.5461+ .1123(85) = 4.9994 e. No; this is shown by the fact that there are two data points in the data set whose x values are 81.8, but these data points have different y values. 5.32 It is dangerous to use the least squares line to obtain predictions for x-values outside the range of those contained in the sample, because there is no information in the sample about the relationship that exists between x and y beyond the range of the data. The relationship may be the same or it may change substantially. There is no data to support a conclusion either way. 5.33 a. y = 100 + .75(sy)2 = 100 + 1.5(sy). That person's annual sales would be 1.5 standard deviations above the mean sales of 100. b. (y − y ) = r r = 5.34 sy sx (x − x ) , which implies y− y sy =r x− x sx . Hence, −1.0 = r(−1.5) implies −1.0 = .67. −1.5 The denominators of b and of r are always positive numbers. The numerator of b and r is ∑ (x − x )(y − y) . Since both b and r have the same numerator and positive denominators, they will always have the same sign. 89 5.35 a. ŷ = −424.7 + 3.891x b. Let y' = cy. Then y ′ = cy . b′ = ∑(x − x )(cy − c y ) ∑(x − x ) 2 = c ∑(x − x )(y − y ) ∑(x − x )2 = cb a′ = c y − cb x = c( y − b x ) = ca Both the slope and the y intercept are changed by the multiplicative factor c. Thus, the new least squares line is the original least squares line multiplied by c. Exercises 5.36 – 5.51 5.36 a. Σ x = 55, Σ x2 = 385 Σ xy = 1086, Σ y = 185.6, n = 10, x = 5.5, y = 18.56 ( x )( y) (55)(185.6) ∑ xy − ∑ n ∑ = 1086 − 10 = 65.2 (∑ x ) (55) =∑x − = 385 − = 82.5 10 n S xy = 2 S xx 2 2 The slope, b = Sxy Sxx = 65.2 = 0.7903 82.5 The intercept, a = y − bx = 18.56 − 0.7903(5.5) = 14.213 The equation: ŷ = 14.213 + 0.7903x The number of transplants has increased steadily over time. b. x y 1 2 3 4 5 6 7 8 9 10 15 15.7 16.1 17.6 18.3 19.4 20 20.3 21.4 21.8 y - ŷ -.0036 -.0939 -.4842 .2254 .1351 .4448 .2545 -.2358 -.0739 -.3164 ŷ 15.0036 15.7939 16.5842 17.3745 18.1648 18.9552 19.7455 20.5358 21.3261 22.1164 90 There does appear to be curvature in the residual plot which indicates that the relationship between year and number of transplants may be better described by a curve rather than a line. 5.37 a. b. Yes, there appear to be large residuals, those associated with the x-values of 40, 50 and 60. 91 c. x y 40 50 60 70 80 90 100 58 34 32 30 28 27 22 ŷ 46.5 42.0 37.5 33.0 28.5 24.0 19.5 y - yˆ 11.5 −8.0 −5.5 −3.0 −0.5 3.0 2.5 Yes, the residuals for small x-values and large x-values are positive, while the residuals for the middle x-values are negative. 5.38 a. b. If the equation of the least squares line is ŷ = 1082.2 − 4.691x : x y 30.0 31.8 32.1 26.8 30.4 40.0 915 891 968 972 952 899 ŷ 941.47 933.03 931.62 956.48 939.59 894.56 residuals -26.47 -42.03 36.38 15.52 12.41 4.44 The correlation coefficient, r = -0.581. It indicates a moderately strong negative linear relationship between pollution and medical cost. 92 c. It appears that the areas with the high and low pollution have smaller pollution rates have smaller residuals than the areas with pollution rates in the middle range. This might warrant further investigation. 5.39 d. The observation is influential. With this observation deleted, the equation of the regression line is ŷ = 974 – 1.35x, which is quite different than the line based on the complete data set. a. The equation of the least-squares line is ŷ = 94.33 − 15.388 x . b. x .106 .193 .511 .527 1.08 1.62 1.73 2.36 2.72 3.12 3.88 4.18 y 98 95 87 85 75 72 64 55 44 41 37 40 y 92.6989 91.3601 86.4667 86.2205 77.7110 69.4014 67.7088 58.0143 52.4746 46.3194 34.6246 30.0082 residual 5.30112 3.63988 0.53326 -1.22053 -2.71096 2.59856 -3.70876 -3.01432 -8.47464 -5.31945 2.37544 9.99184 There appears to be a pattern in the plot. It is like the graph of a quadratic equation. 93 5.40 a. No: The least squares line with observation 11 is : ŷ = -1.1 + 1.29x, without observation 11: ŷ = -17.59 + 1.59x (not a lot of difference in the slope) 5.41 5.42 b. Yes. With a residual of 100 – 68.56 = 31.44, when se = 12.185, observation 11 can be considered an outlier. c. No: The least squares line with observation 5 is : ŷ = -1.1 + 1.29x, without observation 5: ŷ = 5.26 + 1.16x (not a lot of difference in the slope) d. No. With a residual of 100 – 95.65 = 4.35, when se = 12.185, observation 5 cannot be considered an outlier. a. r2 = 15.4% b. r2 = 16%: No, only 16% of the variability in first-year grades can be attributed to an approximate linear relationship between first-year college grades and SAT II score so this does not indicate a good predictor. a. A value of r 2 = 0.7664 means that 76.64% of the observed variability in clutch size can be explained by an approximate linear relationship between clutch size and snout vent length. b. To find the value of s e , we need the value of SSResid. We know r 2 = 1- SSResid/SSTo Solving for SSResid, we get SSResid = 10266.954 10266.954 = 29.25 se = 14 − 2 Thus, a typical amount by which an observation deviates from the least squares line is 29.25. 29.25. a. There does appear to be a positive linear relationship between x and y. 100 90 80 70 Runoff 5.43 60 50 40 30 20 10 0 0 50 100 Rainfall 94 b. ∑ x = 798 ∑ x x= 798 = 53.2 15 2 = 63040 y = ∑ y = 643 ∑ y 2 = 41999 643 = 42.87 15 (798)(643) 17024.4 15 = 0.827 b= = (798)(798) 20586.4 63040 − 15 a = 42.87 – 0.827(53.2) = -1.13 yˆ = −1.13 + 0.827 x 51232 − c. For x = 80, d. SSResid = yˆ = −1.13 + 0.827(80) = 65.03 ∑y 2 −a ∑ y − b∑ xy = 41999 – (-1.13)(643) – 0.827(51232) = 356.726 se = 356.726 = 5.238 15 − 2 e. Rainfall 5 12 14 17 23 30 40 47 55 67 72 81 96 112 127 Runoff 4 10 13 15 15 25 27 46 38 46 53 70 82 99 100 95 Residual 0.99344 1.20463 2.55068 2.06976 -2.89208 1.31911 -4.95062 8.26057 -6.35522 -8.2789 -5.41376 4.14348 3.73888 7.50731 -3.89728 ∑ xy = 51232 10 Residual 5 0 -5 -10 0 20 40 60 80 100 120 140 Rainfall Yes, the variability of the residuals appears to be increasing with x, indicating that a linear relationship may not be appropriate. 5.44 a. n = 38 ∑ x = 704 Sxy = 829.48 − Sxx = 14752 − ∑ y = 45.48 ∑ y2 = 55.444 ∑ xy = 829.48 (704)(45.48) = −13.097 38 (704)2 = 1709.474 38 Syy = 55.444 − the slope, b = ∑ x2 = 14752 (45.48)2 = 1.012 38 −13.097 = −0.00766 1709.474 intercept, a = 1.1968 − (−.00766)(18.526) = 1.339 ŷ = 1.339 – 0.008x (Sxy ) 2 r = c. With r 2 close to 0, the linear relationship between perceived stress and telomere length accounts for a very small proportion of variability in telomere length. (Sxx )(Syy ) = ( −13.097 )2 b. 2 (1709.474)(1.012) = .0992 96 5.45 a. ŷ = 766 + .015(9900) = 914.5 residual = 893 − 914.5 = −21.5 5.46 5.47 b. The typical amount that average SAT score deviates from the least squares line is 53.7. c. Only about 16% of the observed variation in average SAT scores can be attributed to the approximate linear relationship between average SAT scores and expenditure per pupil. The least-squares line does not effectively summarize the relationship between average SAT scores and expenditure per pupil. a. ŷ = −89.09 + .72907(375) = 184.31 residual = 165 − 184.31 = −19.31 b. 2 r =.963 a. The plot suggests that the least squares line will give fairly accurate predictions. The least squares equation is y = 5.20683 − .03421x. b. The summary statistics for the data remaining after the point (143, .3) is deleted are: n=9 ∑ x = 1060 − 143 = 917 ∑ y = 15.8 − .3 = 15.5 ∑ x2 = 114514 − (143)2 = 94065 ∑ y = 27.82 − (.3) = 27.73 2 2 ∑ xy = 1601.1 − (143)(.3) = 1558.2 ∑ x2 − ( ∑ x )2 (917)2 = 94065 − = 94065 − 93432.1111 = 632.8889 n 9 ∑ xy − ( ∑ x)( ∑ y) (917)(15.5) = 1558.2 − = 1558.2 − 1579.2778 = −21.0778 n 9 97 b= −21.0778 = − .0333 632.8889 a = 1.7222 − (−.0333)(101.8889) = 1.7222 + 3.3930 = 5.1151 The least squares equation with the point deleted is ŷ = 5.1151 − .0333x. The deletion of this point does not greatly affect the equation of the line. c. For the full data set: (15.8)2 = 27.82 − 24.964 = 2.856 10 SSResid = 27.82−5.2068338(15.8) − (−.03421541)(1601.1) = 27.82 − 82.2680 + 54.7823 = .3343 SSTo = 27.82 − r 2 = 1− .3343 = 1 − .1171 = .8829 2.856 For the data set with the point (143, .3) deleted: (15.5)2 SSTo = 27.73 − = 27.73 − 26.6944 = 1.0356 9 SSResid = 27.73 − 5.1151(15.5) − (−.0333)(1558.2) = 27.73 − 79.28405 + 51.8881 = .33405 .33405 = 1 − .3226 = .6774 1.0356 The value of r 2 becomes smaller when the point (143, 0.3) is deleted. The reason for this is that in the equation r 2 = 1 – SSResid/SSTo, the value of SSTo is lowered by dropping the point (143, 0.3) but the value of SSResid remains about the same. r 2 = 1− 5.48 5.49 1235.470 = 0.9512 25321.368 The coefficient of determination reveals that 95.12% of the total variation in hardness of molded plastic can be explained by the linear relationship between hardness and the amount of time elapsed since termination of the molding process. 2 r = 1− a. yˆ = 62.9476 − 0.54975(25) = 49.2 residual = 49.2 –70 = -20.8 b. Since the slope of the fitted line is negative, the value of r is the negative square root of c. r 2 . So r = - r 2 = − 0.57 = −0.755 SSresid r 2 = 1− SSTo SSresid 0.57 = 1 2520 Solving for SSResid gives SSResid = 1083.6 1083.6 se = = 11.64 8 98 5.50 a. Whether se is small or not depends upon the physical setting of the problem. An se of 2 feet when measuring heights of people would be intolerable, while an se of 2 feet when measuring distances between planets would be very satisfactory. It is possible for the linear association between x and y to be such that r2 is large and yet have a value of se that would be considered large. Consider the following two data sets: Set 1 x y 5 14 6 16 7 17 8 18 9 19 10 21 Set 2 x y 14 5 16 15 17 25 18 35 19 45 21 55 For set 1, r2 = .981 and se = .378. For set 2, r2 = .981 and se = 2.911. Both sets have a large value for r2, but se for data set 2 is 7.7 times larger than se for data set 1. Hence, it can be argued that data set 2 has a large r2 and a large se. b. Now consider the data set x y 5 10.004 55 10.006 15 10.007 45 10.008 25 10.009 35 10.010 This data set has r2 = .12 and se = .002266. So yes, it is possible for a bivariate data set to have both r2 and se small. 5.51 c. When r2 is large and se is small, then not only has a large proportion of the total variability in y been explained by the linear association between x and y, but the typical error of prediction is small. a. When r = 0, then se = sy. The least squares line in this case is a horizontal line with intercept of y . b. When r is close to 1 in absolute value, then se will be much smaller than sy. c. 2 s e = 1 − (.8) (2.5) =.6(2.5) = 1.5 d. Letting y denote height at age 6 and x height at age 18, the equation for the least squares line for predicting height at age 6 from height at age 18 is ⎛ 1.7 ⎞ (height at age 6) = 46 + .8 ⎜ ⎟ [(height at age 18) − 70] = 7.95 + .544(height at age 18) ⎝ 2.5 ⎠ The value of se is 1 − (.8)2 (1.7) = .6(1.7) = 1.02. 99 Exercises 5.52 – 5.60 5.52 a. Because of the substantial curvature in the plot, a straight line would not provide an effective summary of the relationship. b. The plot of the transformed variables suggests that the relationship could be modeled by a straight line. c. The coefficient of determination between y ′ and x ′ is .973. This suggests that a leastsquares line might effectively summarize the relationship between x ′ and y ′ . d. When x = 35, x ' = 1.54407 yˆ ′ = 2.01780 − 1.05171(1.54407) = 0.393886 yˆ = 100.393886 = 2.47677 e. Yes, this appears to be the case. To see this, predict y using both approaches. Compare the values of ( y − yˆ )2 for the two methods. The method of part c results in a lower value for ∑ ∑ ( y − yˆ )2 . 100 a. Fatality Rate vs Age % of Drivers killed in Injury Crashes 3.5 3.0 2.5 2.0 1.5 1.0 20 b. 30 40 50 60 Age of Driver 70 Table suggests moving x up or y down so let y ' = 80 90 1 . y c. Age x 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 Fatality Rate y 1 0.99 0.8 0.8 0.75 0.75 0.95 1.05 1.15 1.2 1.3 1.65 2.2 3 3.2 y'=1/y 1 1.01 1.25 1.25 1.33 1.33 1.05 0.95 0.87 0.83 0.77 0.61 0.45 0.33 0.31 Transformed Data vs Age 1.4 1.2 Transformed Data 5.53 1.0 0.8 0.6 0.4 0.2 20 30 40 50 60 70 80 90 Age d. The scatterplot suggests there is a good linear transformation. 101 e. Using transformed data: Σ x = 825, Σ (x)2 = 52375 Σ xy’ = 643.31, Σ y’ = 13.3603, n = 15, x = 55, y ' = 0..8907 ( x )( y ') (825)(13.3603) = −91.5065 ∑ xy '− ∑ n∑ = 12.72 − 15 (∑ x ) (825) =∑x − = 52375 − = 7000 n 15 S xy ' = 2 S xx 2 2 The slope, b = Sxy ' Sxx = −91.5065 = −0.0131 7000 The intercept, a = y − bx = 0.8907 − ( − 0.0131)(55) = 1.6112 1 = 1.6112 − 0.0131x , where x = Age and yˆ 1 1 y = fatality rate. When x = 78, = 1.6112 − 0.0131(78) = .5894 , so yˆ = = 1.697. ŷ .5894 The equation: yˆ ' = 1.6112 − 0.0131x or 5.54 a. The plot is curvilinear, not linear. b. The plot looks like section 3 of figure 5.31, which suggests going down the ladder on y and/or x. Both x and y are down the ladder. 102 c. Yes, this plot is straighter than the plot in part a. 5.55 d. Since there are x observations whose values are 0, both log(x) and 1/x cannot be employed. Another transformation that might be helpful in straightening the plot is cube root of x and cube root of y. a. n = 12 ∑ x = 22.4 ∑ y = 303.1 ∑ x 2 241.29 − r= = 88.58 (22.4)(303.1) 12 2 2 (22.4) (303.1) 12039.27 − 12 12 −324.5 = −0.717 = (6.84)(66.208) 88.58 − 103 = ∑y 2 = 12039.27 −324.50 46.767 4383.47 ∑ xy = 241.29 b. ∑ x = 13.5 ∑ y = 55.74 ∑ x n=12 47.7283 − r= 22.441 − 2 ∑y = 22.441 (13.5)(55.74) 12 (13.5)2 (55.74)2 303.3626 − 12 12 = 2 = 303.3626 ∑ xy = 47.7283 −14.9792 7.2535 44.4503 −14.9792 = −0.835 (2.693)(6.667) The correlation between x and y is −.835. Since this correlation is larger in absolute value than the correlation of part a, the transformation appears successful in straightening the plot. 5.56 a. From 1990 to 1999 the number of people waiting for a transplant has increased. Each year, the number of people added to the waiting list increases. b. Using the transformation y’ where y’ = y : Σ x = 55, Σ x2 = 385 Σ xy’ = 391.2 Σ y’ = 64.72, n = 10, x = 5.5, y ' = 6.47 ( x )( y ') (55)(64.72) ∑ xy '− ∑ n∑ = 391.2 − 10 = 35.24 (∑ x ) (55) =∑x − = 385 − = 82.5 S xy ' = 2 S xx 2 2 10 n Sxy ' 35.24 The slope, b = = = 0.427 82.5 Sxx The intercept, a = y '− bx = 6.47 − 0.427(5.5) = 4.12 104 The equation: y’= 4.12 + 0.427x c. Using the transformed equation: yˆ = 4.12 + 0.427 x, the predicted number of patients waiting for an organ transplant in 2000 (year 11) is ŷ = 4.12 + 0.427(11) = 8.817 ⇒ y = 2.269. As y is measured in thousands, we predict 2269 patients awaiting transplant surgery in 2000. d. 5.57 We are assuming the relationship between year and the number awaiting transplant stays the same outside the range of x-values in the given data range. The further from the data range a prediction is going to be made, the less accurate it may be. 2010 is further away from the data used to create the least squares line and we don’t know if the relationship between the two variables is still the same. I would be less confident to make a prediction if the year was 2010. a. The relationship appears non-linear. b. Σ x = 7.5, Σ x2 = 13.75 Σ xy = 641.05, Σ y = 370.1, n = 5, x = 1.5, y = 74.02 ( x )( y) (7.5)(370.1) = 85.9 ∑ xy − ∑ n ∑ = 641.05 − 5 (∑ x ) (7.5) =∑x − = 13.75 − = 2.5 n 5 S xy = 2 S xx 2 2 The slope, b = Sxy 85.9 = 34.36 Sxx 2.5 The intercept, a = y − bx = 74.02 − 34.36(1.5) = 22.48 The equation: ŷ = 22.48 + 34.36x 105 There is a definite curvature in the residual plot confirming the conclusion in part a. c. The value of r2 is higher and the size of the residuals are smaller for the log transformation. d. y = a + b(x’) where x’ = log10(x) values of x’ are: -0.30102, 0, 0.17609, 0.30103, 0.39794 Σ x’ = .5740, Σ (x’)2 = .3706 Σ x’y = 73.2836, Σ y = 370.1, n = 5, xʹ = .1148, y = 74.02 S xy = ( x )( y) (.5740)(370.1) = 30.796 ∑ xy − ∑ n ∑ = 73.2836 − 5 106 S xx = ∑x 2 − ( ∑ x) 2 = .3706 − (.574)2 = 0.3047 5 n Sxy 30.796 The slope, b = = 101.07 Sxx 0.3047 The intercept, a = y − bx = 74.02 − 101.07(.1148) = 62.417 The equation: ŷ = 62.417 + 101.07x’ ⇒ ŷ = 62.417 + 101.07 log (x) e. 5.58 When energy of shock (x) = 1.75, predicted success percent to be 62.417 + 101.07(log 1.75) = 87.0%. When energy of shock is 0.8, the predicted success would be 62.417 + 101.07(log 0.8) = 52.6% a. b. 107 c. d. e. I would recommend using either the transformation of part d or part e. 108 5.59 a. The plot does appear to have a positive slope, so the scatter plot is compatible with the "positive association" statement made in the paper. b. This transformation does straighten the plot, but it also appears that the variability of y increases as x increases. 109 c. The plot appears to be as straight as the plot in b, and has the desirable property that the variability in y appears to be constant regardless of the value of x. d. This plot has curvature opposite of the plot in part a, suggesting that this transformation has taken us too far along the ladder. 110 5.60 The relationship between age and canal length is not linear, but curvilinear. Transforming to 1/x produces a scatterplot that is much straighter than the plot above. Exercises 5.61 – 5.65 5.61 Using x = peak intake and y’ = ln(p/1-p): Σ x = 250, Σ x2 = 16500 Σ xy’ = 61.75, Σ y’ = 6.4958, n = 5, x = 50, y ' = 1.3 ( x )( y ') (250)(6.4958) = −263.04 ∑ xy '− ∑ n∑ = 61.75 − 5 (∑ x ) S (250) =∑x − = 16500 − = 4000. The slope, b = S xy ' = 2 S xx 2 2 n 5 xy ' Sxx = −263.04 = −0.065876. 4000 The intercept, a = y '− bx = 1.3 − ( −0.06576)(50) = 4.589 The equation: yˆ ' = 4.589 - 0.0659x Using the values of a and b from the logistic equation, the probability of survival for a hamster with a e 4.589 −0.0659(40) peak intake of 40 μg: p = = .876 1 + e 4.589 −0.0659(40) 111 a. Probability of Hatching for Low and Mid Elevation Treatments 0.9 Variable Low 0.8 Mid 0.7 0.6 Probability 5.62 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 3 4 Days 5 6 7 8 The plots have the characteristic “S-shape” of the logistic plot. b. Days Proportion p (1 − p ) ⎛ p ⎞ y ' = ln ⎜ ⎟ ⎝1− p ⎠ 1 2 3 4 5 6 7 8 0.75 0.67 0.36 0.31 0.14 0.09 0.06 0.07 3 2.030303 0.5625 0.449275 0.162791 0.098901 0.06383 0.075269 1.098612 0.708185 -0.57536 -0.80012 -1.81529 -2.31363 -2.75154 -2.58669 The resulting best fit line is: y ' = a + bx = 1.513 − 0.587 x , where y is the proportion of eggs hatched and x = the days of exposure. The negative slope mean that the value of b < 0, indicating that the curve starts near 1 for small x values and then decreases as x increases. In other words, the greater the exposure time, the lower the probability of hatching. c. d. For 3 days: p= For 5 days: p= e1.513 −0.587(3) 1 + e1.513 −0.587(3) e1.513 −0.587(5) 1 + e1.513 −0.587(5) = .438 . = .194 . Somewhere between two days (p = .584) and three days (p = .438). 112 5.63 a. It can be seen form the table that as the elevation increases, the Lichen becomes less common. b. p (1 − p) Elevation Proportion 400 600 800 1000 1200 1400 1600 0.99 0.96 0.75 0.29 0.077 0.035 0.01 ⎛ p ⎞ y ' = ln ⎜ ⎟ ⎝1− p ⎠ 99 24 3 0.408451 0.083424 0.036269 0.010101 4.59512 3.178054 1.098612 -0.89538 -2.48382 -3.31678 -4.59512 The resulting best fit line is: yˆ ' = a + bx = 7.537 − 0.0079 x , where y is the proportion of plots with lichen and x = elevation. c. The table with a row of proportions of mosquitoes killed: Concentration Proportion killed a. 0.10 .2083 0.15 .25 0.20 .4464 0.30 .6078 0.50 .8298 0.70 .9623 0.95 .9608 Proportion of Mosquitos Killed with different Concentrations of Pesticide 1.0 0.9 0.8 Prop. Killed 5.64 To estimate the proportion of plots of land where the lichen is classified as “common” at an elevation of 900m: e7.537 −0.0079(900) p= = .6052 1 + e7.537 −0.0079(900) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 Conc. 113 0.7 0.8 0.9 1.0 b. p (1 − p ) Concentration Proportion 0.1 0.15 0.2 0.3 0.5 0.7 0.95 0.2083 0.25 0.4464 0.6078 0.8298 0.9623 0.9608 0.263105 0.333333 0.806358 1.54972 4.875441 25.5252 24.5102 ⎛ p ⎞ y ' = ln ⎜ ⎟ ⎝ 1− p ⎠ -1.3352 -1.09861 -0.21523 0.438074 1.58421 3.239666 3.19909 The resulting best fit line is: yˆ ' = a + bx = −1.559 + 5.768 x , where y is the proportion of mosquitoes killed and x = the concentration of pesticide. The positive slope, b >0, shows that as the concentration of the pesticide increases, the proportion of the mosquitoes killed also increases. c. ⎛ p ⎞ When the dose kills 50%, p = 0.5, so: ln ⎜ ⎟ = a + bx ⎝ (1 − p ) ⎠ ⎛ 0.5 ⎞ ln ⎜ ⎟ = −1.559 + 5.768 x ⎝ (1 − 0.5) ⎠ x= a. Proportion failing vs. Load on the Fabric 0.4 Proportion failing 5.65 ln(1) + 1.559 1.559 = = 0.270 . About 0.27 g/cc 5.768 5.768 0.3 0.2 0.1 0.0 0 10 20 30 40 50 Load 60 70 80 90 b. Load Prop. failing 5 15 35 50 70 80 90 0.02 0.04 0.2 0.23 0.32 0.34 0.43 ⎛ p ⎞ p y ' = ln ⎜ ⎟ (1 − p ) ⎝ 1− p ⎠ 0.020408 -3.89182 0.041667 -3.17805 0.25 -1.38629 0.298701 -1.20831 0.470588 -0.75377 0.515152 -0.66329 0.754386 -0.28185 114 The resulting best fit line is: y ' = a + bx = −3.579 + 0.0397 x , where y is the proportion of fabrics failing and x = the load applied. The positive slope, b >0, shows that as the load or forces increases, the proportion of the fabrics that fail also increases. c. d. When the load is 60, p = e −3.579 + 0.0397(60) 1 + e −3.579 +0.0397(60) = .232 . .232 lbs per sq. in. ⎛ p ⎞ When the failure rate is 5%, p = 0.05, so: ln ⎜ ⎟ = a + bx ⎝ (1 − p ) ⎠ ⎛ 0.05 ⎞ ln ⎜ ⎟ = −3.579 + 0.0397 x ⎝ (1 − 0.05) ⎠ ln(0.0526) + 3.579 = 15.98 . 0.0397 To have less than a 5% chance of a wardrobe malfunction, a maximum force of 15.5 lbs/sq in. might be suggested. x= Exercises 5.66 – 5.78 5.66 a. r = -0.981 There appears to be a strong negative linear relationship between the amount of catalyst added to a chemical reaction and the resulting reaction time. b. There is a definite curvature to the plot. Linear does not seem the best description of this relationship. This shows the importance of checking, not only the numerical checks of “a good fit” but also graphical ones too. 5.67 a. ∑ x = 5.92 , ∑ x = 3.8114, ∑ y = 10.47 , ∑ y = 9.885699 , ∑ xy = 5.8464 , x = 0.455, y = 0.805 2 For this data set, n= 13, 2 (5.92)(10.47) 5.8464 − 4.7679 1.0785 13 = 0.9668 b= = = (5.92)(5.92) 3.8114 − 2.6959 1.1155 3.8114 − 13 5.8464 − a = 0.805 – 0.9668(0.455) = 0.3651 The least squares regression line is ŷ = 0.3651 + 0.9668x b. For a value of x = 0.5, ŷ = 0.3651 + 0.9668(0.5) = 0.8485 115 5.68 n = 15, ∑ x = 82.82, ∑ y = 12545, ∑ x2 = 459.9784, ∑ y2 = 12734425, ∑ xy = 67703.9 a. ∑ xy − ( ∑ x)( ∑ y) (82.82)(12545) = 67703.9 − = −1561.227 n 15 ∑ x2 − ( ∑ x )2 (82.82)2 = 459.9784 − = 2.702 n 15 b= −1561.227 = −577.895 2.702 a = 836.33 − (−577.8953)(5.5213) = 4027.083 ŷ = 4027.083 − 577.895 x b. The b value of −577.895 is the estimate of the average change in myoglobin level associated with a one unit increase in finishing time. c. ŷ = 4027.083 − 577.895(8) = −596.077 The least squares equation yields a negative value for the estimated level of myoglobin when the finishing time is 8h. This is clearly unreasonable since myoglobin level cannot be a negative value. 5.69 r 2 = 1− 5987.16 = 1 − .3439 = .6561 17409.60 So 65.61% of the observed variation in age is explained by a linear relationship between percentage of root with transparent dentine for premolars and age. se2 = 5987.16 5987.16 = = 176.0929 36 − 2 34 s e = 176.0929 = 13.27 The typical amount by which an observed age deviates from the least squares line of percentage of root with transparent dentine and age is 13.27. 5.70 a. The least square line is yˆ = 32.08 + 0.5549 x x 15 19 31 39 41 44 47 48 55 65 y 23 52 65 55 32 60 78 59 61 60 predicted 40.4048 42.6245 49.2837 53.7231 54.8330 56.4978 58.1626 58.7175 62.6020 68.1513 residual −17.4048 9.3755 15.7163 1.2769 −22.8330 3.5022 19.8374 0.2825 −1.6020 −8.1513 116 b. SSResid = (−17.4048)2 + (9.3755)2 + . . . + (−1.6020)2 + (−8.1513)2 = 302.9271 + 87.9000 + . . . + 2.5664 + 66.4437 = 1635.6833 SSTo = 31993 − r 2 = 1− c. (545)2 = 31993 − 29702.5 = 2290.50 10 1635.6833 = 1 − .7141 = .2859 2290.5000 Only 28.59% of the observed variation in age is explained by the linear relationship between percent of root transparent dentine and age. Also, se = 14.3, so a typical prediction error is quite large. The least squares line does not give very accurate predictions. 5.71 a. ∑ x2 − ( ∑ x )2 (22.027)2 = 62.600235 − = 62.600235 − 40.43239 = 22.16784 n 12 ∑ xy − ( ∑ x ∑ y) (22.027)(793) = 1114.5 − = 1114.5 − 1455.61758 = −341.11758 n 12 b= −341.11758 = −15.38795 22.16784 a = 66.08333 − (−15.38795)(1.83558) = 66.08333 + 28.24586 = 94.32919 The least squares equation is ŷ = 94.33 − 15.388x b. SSTo = ∑ y 2 − ( ∑ y )2 (793)2 = 57939 − = 57939 − 52404.08333 = 5534.91667 n 12 SSResid = 57939 − 94.32919(793) − (−15.38795)(1114.5) = 57939 − 74803.04767 + 17149.87028 = 285.82261 c. r 2 = 1− d. se = 2 285.82261 = 1 − .05164 = .94836 or 94.836% 5534.91667 285.82261 = 28.582261 10 s e = 28.582261 = 5.34624 A typical prediction error would be about 5.35 percent. e. Since the slope of the fitted line is negative, the value of r is the negative square root of r2. So r = − r 2 = − .94836 = − .97384 . 117 5.72 a. 118 b. It appears that log x, log y does the best job of producing an approximate linear relationship. The least-squares equation for predicting y ' = log y from x ' = log x is ŷ' = 1.61867 − .31646 x ' . When x = 25, x ' = 1.39764 yˆ ′ = 1.61867 − .31646(1.39764) = 1.17628 yˆ = 101.17628 = 15.0064 5.73 a. A value of .11 for r indicates a weak linear relationship between annual raises and teaching evaluations. b. r 2 = (.11)2 = .0121 119 5.74 a. The plot does not suggest a linear relationship. However, the one outlier value (51.3, 49.3) prevents an accurate interpretation. b. ŷ = −11.37 + 1.0906(40) = 32.254 c. The value of r is not very large and the value of se is 4.70, which is large relative to the size 2 of the y-values in the sample. A straight line is not very effective in summarizing the relationship. d. For the new data set, n = 9, ∑ x = 388.8 − 51.3 = 337.5, ∑ y = 310.3 − 49.3 = 261.0 = 12706.85, ∑ y = 10072.41 − ( 49.3 ) = 7641.92 ∑ x = 15338.54 − (51.3 ) ∑ xy = 12306.58 − (51.3 )( 49.3 ) = 9777.49 2 2 ∑x 2 (∑ x ) − 2 2 2 = 12706.85 − n (337.5)2 = 50.60 9 ( x )( y ) (337.5)(261.0) = −10.01 ∑ xy − ∑ n ∑ = 9777.49 − 9 b= 261 −10.01 ⎛ 337.5 ⎞ = −0.1978 , a = − ( −.1978) ⎜ ⎟ = 36.4175 9 50.60 ⎝ 9 ⎠ ŷ = 36.4175 − .1978x ∑ y 2 (∑ y ) − n 2 = 7641.92 − ( 261)2 9 = 72.92 , r 2 = ( −10.01)2 = .027 ( 50.6 )( 72.92 ) Without the observation (51.3, 49.3) there is very little evidence of a linear relationship between fire-simulation consumption and treadmill consumption. One would be very hesitant to use the prediction equation based on the data set including this observation because this observation is very influential. 5.75 The summary values are: n = 13, ∑ x = 91, ∑ y = 470, ∑ xy = 3867 ∑ xy − a. ∑x 2 = 819 , ∑y 2 = 19118 ( ∑ x )2 ( ∑ y )2 ( ∑ x)( ∑ y) = 577, ∑ x 2 − = 182, ∑ y 2 − = 2125.6923 n n n b= 577 = 3.1703 182 a = 36.1538 − 3.1703(7) = 13.9617 The equation of the estimated regression line is ŷ = 13.9617 + 3.1703 x 120 b. The plot with the line drawn in suggests that perhaps a simple linear regression model may not be appropriate. The scatterplot suggests that a curvilinear relationship may exist between flood depth and damage. The points for small x-values or large x-values are below the line, while points for x-values in the middle range are above the line. 5.76 c. When x = 6.5, ŷ = 13.9617 + 3.1703(6.5) = 34.5687 d. The scatterplot in part b suggests that the value of damage levels off at between 45 and 50 when the depth of flooding is in excess of 10 feet. Using the least squares line to predict flood damage when x = 18 would yield a very high value for damage and result in a predicted value far in excess of actual damage. Since x = 18 is outside of the range of x-values for which data has been collected, we have no information concerning the relationship in the vicinity of x = 18. All of these reasons suggest that one would not want to use the least squares line to predict flood damage when depth of flooding is 18 feet. a. ∑(x − x )2 = 2, ∑(y − y )2 = 2, ∑(x − x )(y − y ) = 0 r= 0 2(2) =0 b. If y = 1, when x = 6, then r = .509. (Comment: Any y value greater than .973 will work.) c. If y = −1, when x = 6, then r = −.509. (Comment: any y value less than −.973 will work). 121 5.77 a. b. ∑ x2 − b= ( ∑ x )2 ( ∑ y )2 ( ∑ x)( ∑ y) = .2157 , ∑ y 2 − = 3.08 , ∑ xy − = 0.474 n n n .474 = 2.1975 .2157 a = 7.6 − (2.1975)(.93286) = 5.550 The least squares line is ŷ = 5.550 + 2.1975 x c. s x = .2157 / 6 = .1896, s y = 3.08 / 6 = .7165 .474 =.5815 6(.1896)(.7165) This value of r suggests a moderate positive linear relationship between x and y. r = 122 d. x rank y rank 6.5 6.5 4.0 3.0 1.0 2.0 5.0 5.0 7.0 2.0 1.0 3.5 3.5 6.0 129.5 − rs = (x rank)(y rank) 7 (8) 4 7(6)(8) 12 32.5 45.5 8.0 3.0 3.5 7.0 30.0 129.5 2 = 17.5 = .625 28 This value is very close to the value of r in part c. 5.78 a. b. Based on the plot in part a and figure 5.34 a transformation going down the ladder on x or y is suggested. The transformation log(time) will produce a reasonably straight plot. 123