734 S e c t i o n V Additional Opportunities to Learn from Data 16 Understanding Relationships—Numerical Data Part 2 Daniel M. Nagy/Shutterstock.com Preview Chapter Learning Objectives 16.1The Simple Linear Regression Model 16.2Inferences Concerning the Slope of the Population Regression Line 16.3Checking Model Adequacy Are You Ready to Move On? Chapter 16 Review Exercises Technology Notes AP* Review Questions for Chapter 16 Preview In Chapter 4, you learned how to describe relationships between two numerical variables. When the relationship was judged to be linear you found the equation of the least squares regression line and assessed the quality of the fit using the scatterplot, the residual plot, and the values of the coefficient of determination (r2) and the standard deviation about the least squares line (se ). In this chapter you will learn how to make inferences about the slope of the population regression line. 734 85241_ch16_ptg01.indd 734 20/12/12 6:39 PM Chapter Learning Objectives Conceptual Understanding After completing this chapter, you should be able to C1 Understand how probabilistic and deterministic models differ. C2 Understand that the simple linear regression model provides a basis for making inferences about linear relationships. Mastering the Mechanics After completing this chapter, you should be able to M1 Interpret the parameters of the simple linear regression model in context. M2 Use scatterplots, residual plots, and normal probability plots to assess the credibility of the assumptions of the simple linear regression model. M3 Know the conditions for appropriate use of methods for making inferences about b. M4 Compute the margin of error when the sample slope b is used to estimate a population slope b. M5 Use the five-step process for estimation problems (EMC3) and computer output to construct and interpret a confidence interval estimate for the slope of a population regression line. M6 Use the five-step process (HMC3) to test hypotheses about the slope of the population regression line. M7 Use graphs to identify potential outliers and influential points. Putting It into Practice After completing this chapter, you should be able to P1 Interpret a confidence interval for a population slope in context. P2 Carry out the model utility test and interpret the result in context. Preview Example Premature Babies Babies born prematurely (before the 37th week of pregnancy) often have low birth weights. Is a low birth weight related to factors that affect brain function? The authors of the paper “Intrauterine Growth Restriction Affects the Preterm Infant’s Hippocampus”(Pediatric Research [2008]: 438-43) hoped to use data from a study of premature babies to answer this question. They measured x 5 birth weight (in grams) and y 5 hippocampus volume (in mL) for 26 premature babies. The hippocampus is a part of the brain that is important in the development of both short- and longterm memory. The sample correlation coefficient for their data is r 5 0.4722 and the equation of the least squares regression line is y ˆ 5 1.67 1 0.0026x. The pattern in the scatterplot (Figure 16.1) suggests there may be a positive linear relationship. However, the correlation coefficient is not very large, and the value of the slope is close to zero. Could the pattern observed in the scatterplot—and the nonzero slope—be plausibly explained by chance? That is, is it plausible that there is no relationship between birth weight and hippocampus volume in the population of all premature babies? Or does the sample provide convincing evidence of a linear relationship between these two variables? If there is evidence of a meaningful relationship between these two variables, the regression line could be used to predict the hippocampus volume. If the predicted volume was sufficiently small, early cognitive therapy could be recommended. On the other hand, if there is no meaningful relationship between these variables, low birth weight should not automatically trigger potentially expensive therapy. 735 85241_ch16_ptg01.indd 735 20/12/12 6:39 PM 736 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 2.4 Hippocampus volume 2.3 2.2 2.1 2.0 1.9 1.8 1.7 1.6 Figure 16.1 1.5 Scatterplot of birth weight versus hippocampus volume. 500 1000 1500 Birth weight 2000 2500 In this chapter, you will learn methods that will help you determine if there is a real and useful linear relationship between two variables or if the pattern in the data could be simply due to chance differences that occur when a sample is selected from a population. section 16.1 The Simple Linear Regression Model A deterministic relationship between two variables x and y is one in which the value of y is completely determined by the value of the independent variable x. A deterministic relationship can be described, or “modeled,” using mathematical notation, such as y 5 f (x) where f (x) is a particular function of x. This relationship is deterministic in the sense that the value of the independent variable is all that is needed to determine the value of the dependent variable. For example, you might convert x 5 temperature in degrees centigrade to y 5 temperature in 9 degrees Fahrenheit using y 5 f (x), where f (x) 5 __ x 1 32. Once the centigrade temperature 5 is known, the Fahrenheit temperature is completely determined. Or you might determine y 5 amount of money in a savings account after x years, using the compound interest forr nx mula, y 5 P 1 1 __ , where P is the principal (the amount of money deposited), r is the n interest rate, and n is the number of times each year the interest is compounded. The number of years you leave the principal in the bank determines the amount in the account. In many situations the variables of interest are not deterministically related. For example, the value of y 5 first-year college grade point average is not determined solely by x 5 high school grade point average, and y 5 crop yield is determined partly by factors other than x 5 amount of fertilizer used. The relationship between two variables, x and y, that are not deterministically related can be described by extending the deterministic model to specify a probabilistic model. The general form of a probabilistic model allows y to be larger or smaller than f (x) by a random amount e. The model equation for a probabilistic model has the form ( ) y 5 deterministic function of x 1 random deviation 5 f (x) 1 e In a scatterplot of y versus x, some of the data points will fall above the graph of f (x) and some will fall below. Thinking geometrically, if e . 0, the corresponding point in the scatterplot will lie above the graph of the function y 5 f (x). If e , 0, the corresponding point will fall below the graph of f (x). For example, consider the probabilistic model y 5 50 2 10x 1 x2 1 e ___________________ f (x) The graph of the function y 5 50 2 10x 1 x2 is shown as the orange curve in Figure 16.2. The observed point (4, 30) is also shown in the figure. Because f (4) 5 50 2 10(4) 1 42 5 Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 736 20/12/12 6:39 PM 737 16.1 The Simple Linear Regression Model 50 2 40 1 16 5 26 for this point, you can write y 5 f (x) 1 e, where e 5 4. The point (4, 30) falls 4 above the graph of the function, y 5 50 2 10x 1 x2. y Observation (4, 30) e=4 26 Graph of y = 50 – 10x + x 2 Figure 16.2 A deviation from the deterministic part of a probabilistic model. x 4 Simple Linear Regression Model The simple linear regression model is a special case of the general probabilistic model in which the deterministic function, f (x), is linear (so its graph is a straight line). Definition The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y 5 a 1 bx 1 e Without the random deviation e, all observed (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation recognizes that points will deviate from the line by a random amount. Figure 16.3 shows two observations in relation to the population regression line. y Observation when x = x1 (positive deviation) Population regression line (slope b) e2 e1 Observation when x = x2 (negative deviation) a = vertical intercept Figure 16.3 Two observations and deviations from the population regression line. x 0 0 x = x1 x = x2 Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 737 20/12/12 6:39 PM 738 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Before you actually observe a value of y for any particular value of x, you are uncertain about the value of e. It could be negative, positive, or even 0. Also, e might be quite large in magnitude (resulting in a point far from the population regression line) or quite small (resulting in a point very close to the line). The simple linear regression model makes some assumptions about the distribution of e at any particular x value in the population. Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular x value has mean value 0. That is, me 5 0. 2. The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by se. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, ..., en associated with different observations are independent of one another. The simple linear regression model assumptions about the variability in the values of e in the population imply that there is also variability in the y values observed at any particular value of x. Consider y when x has some fixed value x*, so that y 5 a 1 bx* 1 e. Because a and b are fixed (they are unknown population values), a 1 bx* is also a fixed number. The sum of a fixed number and a normally distributed variable (e) is also a normally distributed variable (the bell-shaped curve is simply shifted), so y itself has a normal distribution. Furthermore, me 5 0 implies that the mean value of y is a 1 bx*, the height of the population regression line for the value x 5 x*. Finally, because there is no variability in the fixed number a 1 bx*, the standard deviation of y is the same as the standard deviation of e. These properties are summarized in the following box. At any fixed value x*, y has a normal distribution, with and ( ) ( ) mean y value height of the population ___________ 5 ____________________ for x* regression line above x* 5 a 1 bx* standard deviation of y for a fixed value x* 5 se The slope b of the population regression line is the mean or expected change in y associated with a 1-unit increase in x. The y intercept a is the height of the population line when x 5 0. The value of se determines how much the (x, y) observations deviate vertically from the population line; when se is small, most observations will be close to the line, but when se is large, the observations will tend to deviate more from the line. The key features of the model are illustrated in Figures 16.4 and 16.5. Notice that the three normal curves in Figure 16.4 have identical spreads. This is a consequence of se being the same at any value of x, which implies that the variability in the y values at a particular value of x is constant—the variability does not depend on the value of x. 85241_ch16_ptg01.indd 738 20/12/12 6:39 PM 16.1 The Simple Linear Regression Model 739 y y = a + bx, the population regression line (line of mean values) a + bx3 Mean value a + bx3 Standard deviation s Normal curve a + bx2 Mean value a + bx2 Standard deviation s Normal curve a + bx1 Mean value a + bx1 Standard deviation s Normal curve x x1 Figure 16.4 Illustration of the simple linear regression model. x2 x3 Three different x values Population regression line Population regression line Figure 16.5 The simple linear regression model: (a) small se ; (b) large se (b) (a) Example 16.1 Stand on Your Head to Lose Weight? The authors of the article “On Weight Loss by Wrestlers Who Have Been Standing on Their Heads” (paper presented at the Sixth International Conference on Statistics, Combinatorics, and Related Areas, Forum for Interdisciplinary Mathematics, 1999, with the data also appearing in A Quick Course in Statistical Process Control, Mick Norton, 2005) state that “amateur wrestlers who are overweight near the end of the weight certification period, but just barely so, have been known to stand on their heads for a minute or two, get on their feet, step back on the scale, and establish that they are in the desired weight class. Using a headstand as the method of last resort has become a fairly common practice in amateur wrestling.” Does this really work? Data were collected in an experiment where weight loss was recorded for each wrestler after exercising for 15 minutes and then doing a headstand for 1 minute 45 sec. Based on these data, the authors of the article concluded that there was in fact a demonstrable weight loss that was greater than that for a control group that exercised for 15 minutes but did not do the headstand. (The authors give a plausible explanation for why this might be the case based on the way blood and other body fluids collect in the head during the headstand and the effect of weighing while these fluids are draining immediately after standing.) The authors also concluded that a simple linear regression model was a reasonable way to describe the relationship between the variables y 5 weight loss (in pounds) and x 5 body weight prior to exercise and headstand (in pounds) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 739 20/12/12 6:39 PM 740 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Suppose that the actual model equation has a 5 0, b 5 0.001, and se 5 0.09 (these values are consistent with the findings in the article). The population regression line is shown in Figure 16.6. y ( ) Mean y when = 0.19 x = 190 Population regression line y = 0.001x Figure 16.6 x The population regression line for Example 16.1 x = 190 If the distribution of the random errors at any fixed weight (x value) is normal, then the variable y 5 weight loss is normally distributed with my 5 0 1 0.001x sy 5 0.09 For example, when x 5 190 (corresponding to a 190-pound wrestler), weight loss has mean value my 5 0 1 0.001(190) 5 0.19 pounds Because the standard deviation of y is sy 5 0.09, the interval 0.19 6 2(0.09) 5 (0.01, 0.37) includes y values that are within 2 standard deviations of the mean value for y when x 5 190. Roughly 95% of the weight loss observations made for 190-lb wrestlers will be in this range. The slope b 5 0.001 can be interpreted as the mean change in weight associated with each additional pound of body weight. More insight into model properties can be gained by thinking of the population of all (x, y) pairs as consisting of many smaller subpopulations. Each subpopulation contains pairs for which x has a fixed value. Suppose, for example, that in a large population of college students the variables x 5 grade point average in major courses and y 5 starting salary after graduation are related according to the simple linear regression model. Then you can think about the subpopulation of all pairs with x 5 3.20 (corresponding to all students with a grade point average of 3.20 in major courses), the subpopulation of all pairs having x 5 2.75, and so on. The model assumes that for each of these subpopulations, y is normally distributed with the same standard deviation, and that the mean y value (rather than y itself) is linearly related to x. In practice, the judgment of whether the simple linear regression model is appropriate—that is the judgments about the credibility of the assumptions underlying the linear model—must be based on knowledge of how the data were collected, as well as an inspection of various plots of the data and the residuals. The sample observations should be independent of one another, which will be the case if the data are from a random sample. In addition, the scatterplot should show a linear rather than a curved pattern, and the vertical spread of points should be very similar throughout the range of x values. Figure 16.7 shows plots with three different patterns; only the first pattern is consistent with the simple linear regression model assumptions. Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 740 20/12/12 6:39 PM 741 16.1 The Simple Linear Regression Model Figure 16.7 y y Some commonly encountered patterns in scatter plots: (a) Consistent with the simple linear regression model; (b) Suggests a nonlinear probabilistic model; (c) Suggests that variability in y changes with x. y x (a) x xx (b) (c) Estimating the Population Regression Line In Section 16.3, you will see how to check whether the basic assumptions of the simple linear regression model are reasonable. When this is the case, the values of a and b (y intercept and slope of the population regression line) can be estimated from sample data. The estimates of a and b are denoted by a and b, respectively. These estimates are the values of the intercept and slope of the least squares regression line. Recall that that the least squares regression line is the line for which the sum of squared vertical deviations of points in the scatterplot from the line is smaller than for any other line. The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line. That is, _ _ ) ∑(x 2 x )(y 2 y _ b 5 estimate of b 5 ______________ 2 ∑(x 2 x ) _ _ a 5 estimate of a 5 y 2 bx The values of a and b are usually obtained using statistical software or a graphing calculator. If the slope and intercept are calculated by hand, you can use the following computational formula: (∑ x)(∑ y) ∑xy 2 ________ n _____________ b 5 2 (∑ x) ∑ x2 2 _____ n The estimated regression line is the familiar least squares line y ˆ 5 a 1 bx Let x* denote a specified value of the independent variable x. Then a 1 bx* has two different interpretations: 1. It is a point estimate of the mean y value when x 5 x*. 2. It is a point prediction of an individual y value to be observed when x 5 x*. Example 16.2 Mother’s Age and Baby’s Birth Weight Medical researchers have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. (Low birth weight in humans is generally defined as a weight below 2,500 grams) Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. One such study is described in the article “Body Size and Intelligence in 6-Year-Olds: Are Offspring of Teenage Mothers at Risk?” (Maternal and Child Health Journal [2009]: 847-856). The following data on x 5 maternal age (in years) and y 5 birth weight of baby (in grams) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 741 20/12/12 6:39 PM 742 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 are consistent with summary values given in the article and also with data published by the National Center for Health Statistics. Observation 1 2 3 4 5 6 7 8 9 10 x 15 17 18 15 16 19 17 16 18 19 y 2,289 3,393 3,271 2,648 2,897 3,327 2,970 2,535 3,138 3,573 A scatterplot of the data is given in Figure 16.8. The scatterplot shows a linear pattern, and the spread in the y values appears to be similar across the range of x values. This supports the appropriateness of the simple linear regression model. Baby’s weight (g) 3500 3000 2500 Figure 16.8 15 Scatterplot of birth weight versus maternal age for Example 16.2. 16 17 Mother’s age (yr) 18 19 For these data, the equation of the estimated regression line was found using statistical software, resulting in y ˆ 5 a 1 bx 5 21,163.45 1 245.15x An estimate of the mean birth weight of babies born to 18-year-old mothers results from substituting x 5 18 into the estimated equation: estimated mean y for 18-year-old mothers 5 a 1 bx 5 21,163.45 1 245.15(18) 5 3,249.25 grams Similarly, you would predict the birth weight of a baby to be born to a particular 18-year-old mother to be y ˆ 5 predicted y value when x 5 18 5 a 1 b(18) 5 3,249.25 grams The estimate of the mean weight and the prediction of an individual baby weight are identical, because the same x value was used in each calculation. However, their interpretations differ. One is the prediction of the weight of a single baby whose mother is 18, whereas the other is an estimate of the mean weight of all babies born to 18-year-old mothers. In Example 16.2, the x values in the sample ranged from 15 to 19. The estimated regression equation should not be used to make an estimate or prediction for any x value much outside this range. Without sample data for such values, or some clear theoretical reason for expecting the relationship to be linear outside the observed range of x values, you have no reason to believe that the estimated linear relationship continues outside the range from 15 to 19. Making predictions outside this range can be misleading, and statisticians refer to this as the danger of extrapolation. Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 742 20/12/12 6:39 PM 16.1 The Simple Linear Regression Model 743 Estimating s2e and se The value of se determines the extent to which observed points (x, y) tend to fall close to or far away from the population regression line. A point estimate of se is based on SSResid 5 ∑( y 2 y ˆ ) 2 where y ˆ 1 5 a 1 bx1, …, y ˆ n 5 a 1 bxn are the fitted or predicted y values and the residuals are y1 2 y ˆ 1,… yn 2 y ˆ n. SSResid is a measure of the extent to which the sample data spread out around the estimated regression line. Definition The statistic for estimating the variance s2e is SSResid s2e 5 _______ n22 where SSResid 5 ∑(y 2 y ˆ ) 2 5 ∑y2 2 a ∑y 2 b ∑xy The subscript in s2e and s2e is a reminder that you are estimating the variance of the “errors” or residuals. The estimate of se is the estimated standard deviation __ s2e se 5 Ï The number of degrees of freedom associated with estimating s2e or se in simple linear regresssion is n 2 2. The estimates and number of degrees of freedom here have analogs in previous work involving a single sample x1, x2, …, xn. The sample variance s2 had a numerator of _ 2 ∑(x 2 x ) , a sum of squared deviations (residuals), and denominator n 2 1, the number of _ degrees of freedom associated with s2 and s. The use of x as an estimate of m in the formula for s2 reduces the number of degrees of freedom by 1, from n to n 21. In simple linear regression, estimation of two quantities, a and b, results in a loss of 2 degrees of freedom, leaving n 2 2 as the number of degrees of freedom associated with SSResid, s2e and se. Once the estimated regression equation has been found, the usefulness of this model is evaluated using a residual plot and the values of se and the coefficient of determination, r2. Recall from Chapter 4 that the values of se and r2 are interpreted as described in the following box. The coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y. The value of se, the estimated standard deviation about the population regression line, is interpreted as the typical amount by which an observation deviates from the population regression line. Example 16.3 Estimating Elk Weight Wildlife biologists monitor the ecological health of animals. For large animals whose habitat is relatively inaccessible, this can present some practical problems. The Rocky Mountain elk is the fourth largest deer species and is a case in point. Males range up to 7.5 feet in length and over 500 pounds in weight. The equipment, manpower, and time needed to weigh these creatures make direct measurement of weight difficult and expensive. The authors of the paper “Estimating Elk Weight From Chest Girth” (Wildlife Society Bulletin [1996]: 58-611) found they could reliably estimate elk weights by a much more practical method: measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk in Custer State Park, South Dakota. The 85241_ch16_ptg01.indd 743 20/12/12 6:39 PM 744 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 resulting data (from a scatterplot in the paper) is given in the accompanying table. The table also includes the predicted values and residuals for the estimated regression line. Girth (cm) Weight(kg) Predicted y Value Residual 96 105 108 109 110 114 121 124 131 135 137 138 140 142 157 157 159 155 162 87 196 163 196 183 171 230 225 211 231 225 266 241 264 284 292 300 337 339 136.266 161.069 169.336 172.092 174.848 185.871 205.162 213.429 232.720 243.744 249.255 252.011 257.523 263.034 304.372 304.372 309.884 298.860 318.151 238.2661 34.9314 26.3361 23.9080 8.1522 214.8711 24.8380 11.5705 221.7203 212.7436 224.2553 13.9889 216.5228 0.9655 220.3720 212.3720 29.8837 38.1397 20.8488 The scatterplot (Figure 16.9) gives evidence of a strong positive linear relationship between x 5 chest girth (in cm) and y 5 weight in (kg) 350 Weight (kg) 300 250 200 150 100 Figure 16.9 Scatterplot of weight versus chest girth for Example 16.3 90 100 110 120 130 Girth (cm) 140 150 160 170 Partial Minitab regression output is shown here. Regression Analysis: Weight versus Girth The regression equation is Weight 5 2 136 1 2.81 Girth Predictor Constant Girth S 5 23.6626 Coef 2135.51 2.8063 SE Coef T 35.75 23.79 0.2686 10.45 R-Sq 5 86.5% P 0.001 0.000 R-Sq(adj) 5 85.7% Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 744 20/12/12 6:39 PM 16.1 The Simple Linear Regression Model 745 From the output, y ˆ 5 2136 1 2.81x r2 5 0.865 Se 5 23.6626 Approximately 86.5% of the observed variation in elk weight y can be attributed to the linear relationship between weight and chest girth. The magnitude of a typical deviation from the least-squares line is about 23.6626 kg, which is relatively small in comparison to the y values themselves. Another important assumption of the simple linear regression model is that the random deviations at any particular x value are normally distributed. In Section 16.3, you will see how the residuals can be used to determine whether this assumption is plausible. section 16.1 Exercises Each exercise set assesses the following chapter learning objectives: C1, M1 Section 16.1 Exercise Set 1 16.1 Identify the following relationships as deterministic or probabilistic: a. The relationship between the length of the sides of a square and its perimeter. b. The relationship between the height and weight of an adult. c. The relationship between SAT score and college freshman GPA. d. The relationship between tree height in centimeters and tree height in inches. 16.2 Let x be the size of a house (in square feet) and y be the amount of natural gas used (therms) during a specified period. Suppose that for a particular community, x and y are related according to the simple linear regression model with b 5 slope of population regression line 5 .017 a 5 y intercept of population regression line 5 25.0 Houses in this community range in size from 1000 to 3000 square feet. a. What is the equation of the population regression line? b. Graph the population regression line by first finding the point on the line corresponding to x 5 1000 and then the point corresponding to x 5 2000, and drawing a line through these points. c. What is the mean value of gas usage for houses with 2100 sq. ft. of space? d. What is the average change in usage associated with a 1 sq. ft. increase in size? e. What is the average change in usage associated with a 100 sq. ft. increase in size? f. Would you use the model to predict mean usage for a 500 sq. ft. house? Why or why not? 16.3 Suppose that a simple linear regression model is appropriate for describing the relationship between y 5 85241_ch16_ptg01.indd 745 house price (in dollars) and x 5 house size (in square feet) for houses in a large city. The population regression line is y 5 23,000 1 47x and se 5 5000. a. What is the average change in price associated with one extra square foot of space? With an additional 100 sq. ft. of space? b. Approximately what proportion of 1800 sq. ft. homes would be priced over $110,000? Under $100,000? Section 16.1 Exercise Set 2 16.4 Identify the following relationships as deterministic or probabilistic: a. The relationship between height at birth and height at one year of age. b. The relationship between a positive number and its square root. c. The relationship between temperature in degrees Fahrenheit and degrees centigrade. d. The relationship between adult shoe size and shirt size. 16.5 The flow rate in a device used for air quality measurement depends on the pressure drop x (inches of water) across the device’s filter. Suppose that for x values between 5 and 20, these two variables are related according to the simple linear regression model with population regression line y 5 20.12 1 0.095x. a. What is the mean flow rate for a pressure drop of 10 inches? A drop of 15 inches? b. What is the average change in flow rate associated with a 1 inch increase in pressure drop? Explain. 16.6 The paper “Predicting Yolk Height, Yolk Width, Albumen Length, Eggshell Weight, Egg Shape Index, Eggshell Thickness, Egg Surface Area of Japanese Quails Using Various Egg Traits as Regressors” (International Journal of 20/12/12 6:39 PM 746 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Poultry Science [2008]: 85–88) suggests that the simple linear regression model is reasonable for describing the relationship between y 5 eggshell thickness (in micrometers) and x 5 egg length (mm) for quail eggs. Suppose that the population regression line is y 5 0.135 1 0.003x and that se 5 0.005. Then, for a fixed x value, y has a normal distribution with mean 0.135 1 0.003x and standard deviation 0.005. a. What is the mean eggshell thickness for quail eggs that are 15 mm in length? For quail eggs that are 17 mm in length? b. What is the probability that a quail egg with a length of 15 mm will have a shell thickness that is greater than 0.18 mm? c. Approximately what proportion of quail eggs of length 14 mm has a shell thickness of greater than 0.175? Less than 0.178? Additional Exercises 16.7 Tom and Ray are managers of electronics stores with slightly different pricing strategies for USB drives. In Tom’s store, customers pay the same amount, c, for each USB drive. In Ray’s store, it is a little more exciting. The customer pays an up-front cost of $1.00. Ray charges the same price per USB drive, c, but at the register the customer flips a coin. If the coin lands heads up, the customer gets his or her $1.00 back, plus another dollar off the total cost of the USB drives purchased. a. Which of these pricing strategies can be expressed as a deterministic model? b. Using mathematical notation, specify a model using Tom’s pricing strategy that relates y 5 total cost to x 5 number of USB drives purchased. c. Using mathematical notation, specify a model using Ray’s pricing strategy that relates y 5 total cost to x 5 number of USB drives purchased. d. Describe the distribution of e for the probabilistic model described above. What is the mean of the distribution of e? What is the standard deviation of e? 16.8 Identify the following relationships as deterministic or probabilistic: a. The relationship between the speed limit and a driver’s speed. b. The relationship between the price in dollars and the price in Euros of an object. c. The relationship between the number of pages and the number of words in a text book. d. The relationship between the possible numbers of pennies and the nickels in a pile if no other coins are in the pile and the amount of money in the pile is $3.00. 16.9 Hormone replacement therapy (HRT) is thought to increase the risk of breast cancer. The accompanying data on x 5 percent of women using HRT and y 5 breast cancer incidence (cases per 100,000 women) for a region in 85241_ch16_ptg01.indd 746 Germany for 5 years appeared in the paper “Decline in Breast Cancer Incidence after Decrease in Utilisation of Hormone Replacement Therapy” (Epidemiology [2008]: 427–430). The authors of the paper used a simple linear regression model to describe the relationship between HRT use and breast cancer incidence. HRT Use 46.30 40.60 39.50 36.60 30.00 Breast Cancer Incidence 103.30 105.00 100.00 93.80 83.50 a. What is the equation of the estimated regression line? b. What is the estimated average change in breast cancer incidence associated with a 1 percentage point increase in HRT use? c. What would you predict the breast cancer incidence to be in a year when HRT use was 40%? d. Should you use this regression model to predict breast cancer incidence for a year when HRT use was 20%? Explain. e. Calculate and interpret the value of r 2. f. Calculate and interpret the value of se. 16.10 Consider the accompanying data on x 5 advertising share and y 5 market share for a particular brand of soft drink during 10 randomly selected years. x 0.103 0.072 0.071 0.077 0.086 0.047 0.060 0.050 0.070 0.052 y 0.135 0.125 0.120 0.086 0.079 0.076 0.065 0.059 0.051 0.039 a. C onstruct a scatterplot for these data. Do you think the simple linear regression model would be appropriate for describing the relationship between x and y? b. Calculate the equation of the estimated regression line and use it to obtain the predicted market share when the advertising share is 0.09. c. Compute r 2. How would you interpret this value? d. Calculate a point estimate of se. How many degrees of freedom is associated with this estimate? 16.11 The authors of the paper “Weight-Bearing Activity During Youth Is a More Important Factor for Peak Bone Mass than Calcium Intake” (Journal of Bone and Mineral studied a number of variables they thought might be related to bone mineral density (BMD). The accompanying data on x 5 weight at age 13 and y 5 bone mineral density at age 27 are consistent with summary quantities for women given in the paper. Research [1994], 1089–1096) 20/12/12 6:39 PM 16.1 The Simple Linear Regression Model Weight (kg) BMD (g/cm2) 54.4 59.3 74.6 62.0 73.7 70.8 66.8 66.7 64.7 71.8 69.7 64.7 62.1 68.5 58.3 1.15 1.26 1.42 1.06 1.44 1.02 1.26 1.35 1.02 0.91 1.28 1.17 1.12 1.24 1.00 747 d. Compute a point estimate of the mean BMD at age 27 for women whose age 13 weight was 60 kg. 16.12 The production of pups and their survival are the most significant factors contributing to gray wolf population growth. The causes of early pup mortality are unknown and difficult to observe. The pups are concealed within their dens for 3 weeks after birth, and after they emerge it is difficult to confirm their parentage. Researchers recently used portable ultrasound equipment to investigate some factors related to reproduction (“Diagnosing Pregnancy, in Utero Litter Size, and Fetal Growth with Ultrasound in Wild, Free-Ranging Wolves,” Journal of Mammology [2006]: 85-92). A scatterplot of y 5 length of an embryonic sac diameter (in cm) and x 5 gestational age (in days) is shown below. Computer output from a regression analysis is also given. Bivariate Fit of Emb Ves Diam (cm) By Gest Age (days) 6 The accompanying computer output is from JMP. 5 Emb Ves Diam (cm) 1.5 1.4 BMD (g/cm^2) 1.3 4 3 2 1.2 1 1.1 0 25 1 35 30 Gest Age (days) 40 Linear Fit 0.9 Linear Fit 0.8 55 60 65 Weight (kg) 70 75 Linear Fit Linear Fit BMD (g/cm^2) = 0.5584011 + 0.0094363*Weight (kg) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.121081 0.053472 0.155141 1.18 15 Lack of Fit Analysis of Variance Parameter Estimates Term Intercept Weight (kg) Estimate Std Error t Ratio Prob>|t| 0.5584011 0.466212 1.20 0.2524 0.0094363 0.007051 1.34 0.2038 a. What percentage of observed variation in BMD at age 27 can be explained by the simple linear regression model? b. Give a point estimate of se and interpret this estimate. c. Give an estimate of the average change in BMD associated with a 1 kg increase in weight at age 13. Emb Ves Diam (cm) = –3.497279 + 0.1903121*Gest Age (days) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.792803 0.780615 0.450587 2.482526 19 Lack of Fit Analysis of Variance Parameter Estimates Term Intercept Gest Age (days) Estimate –3.497279 0.1903121 Std Error 0.748605 0.023597 t Ratio –4.67 8.07 Prob>|t| 0.0002* <.0001* a. What is the equation of the estimated regression line? b. What is the estimated embryonic sac diameter for a gestational age of 30 days? c. What is the average change in sac diameter associated with a 1-day increase in gestational age? d. What is the average change in sac diameter associated with a 5-day increase in gestational age? e. Would you use this model to predict the mean embryonic sac diameter for all gestation ages from conception to birth? Why or why not? Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 747 20/12/12 6:39 PM 748 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 section 16.2 Inferences Concerning the Slope of the Population Regression Line The slope coefficient b in the simple linear regression model represents the average or expected change in the response variable y that is associated with a 1-unit increase in the value of the independent variable x. For example, consider x 5 the size of a house (in square feet) and y 5 selling price of the house. If the simple linear regression model is appropriate for the population of houses in a particular city, b would be the average increase in selling price associated with a 1-square-foot increase in size. As another example, if x 5 amount of time per week a computer system is used and y 5 the resulting annual maintenance expense, then b would be the expected change in expense associated with using the computer system one additional hour per week. Because the value of b is almost always unknown, it must be estimated from sample data. The slope of the least squares regression line, b, provides an estimate. In some situations, the value of the statistic b may vary greatly from sample to sample, and the value of b computed from a single sample may be quite different from the value of the population slope, b. In other situations, almost all possible samples result in a value of b that is quite close to b. The sampling distribution of b provides information about the behavior of this statistic. AP* exam tip Inferences about the slope of the population regression line are based on the sampling distribution of the statistic b. The properties given here depend on the four basic assumptions of the linear regression model being met. In Section 16.3, you will see how to determine if these assumptions are reasonable. Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied 1. The mean value of the sampling distribution of b is b. That is, mb 5 b , so the sampling distribution of b is always centered at the value of b. This means that b is an unbiased statistic for estimating b. 2. The standard deviation of the sampling distribution of the statistic b is se sb 5 __________ ________ _ 2 ∑(x 2 x ) Ï 3. The statistic b has a normal distribution (a consequence of the model assumption that the random deviation e is normally distributed). The fact that b is unbiased tells you that the sampling distribution is centered at the right place, but it gives no information about variability. If sb is large, the sampling distribution of b will be quite spread out around b and an estimate far from the value of b could result. For se ________ sb 5 ___________ _ to be small, the numerator se should be small (little variability about the ∑(x 2 x )2 Ï ________ _ _ population line) and/or the denominator Ï ∑(x 2 x )2 should be large. Because ∑(x 2 x )2 is a measure of how much the observed x values spread out, b tends to be more precisely estimated when the x values in the sample are spread out rather than when they are close together. The normality of the sampling distribution of b implies that the standardized variable b2b z 5 ______ sb has a standard normal distribution. However, inferential methods cannot be based on this statistic, because the value of sb is not known (because the unknown se appears in the numerator of sb). One way to proceed is to estimate se with se to obtain an estimate of sb. The estimated standard deviation of the statics b is se ________ sb 5 ___________ _ 2 Ï ∑(x 2 x ) AP* exam tip For inferences about the slope of the population regression line, df 5 n 2 2. 85241_ch16_ptg01.indd 748 When the four basic assumptions of the simple linear regression model are satisfied, b2b is the the probability distribution of the standardized variable t 5 ______ s t distribution with df 5 ( n 2 2 ). b 20/12/12 6:39 PM 16.2 Inferences Concerning the Slope of the Population Regression Line 749 _ x 2 m was used in Chapter 12 to develop a confidence interIn the same way that t 5 ______ s ____ __ n Ï val for m, the t variable in the preceding box can be used to obtain a confidence interval for b. Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form b 6 ( t critical value ) sb where the t critical value is based on df 5 n 2 2. Appendix Table 3 gives critical values corresponding to the most frequently used confidence levels. The interval estimate of b is centered at b and extends out from the center by an amount that depends on the sampling variability of b. When sb is small, the interval is narrow, implying that the investigator has relatively precise knowledge of the value of b. Calculation of a confidence interval for the slope of a population regression line is illustrated in Example 16.4. In Section 7.2, you learned four key questions that guide the decision about what statistical inference method to consider in any particular situation. In Section 7.3, a five-step process for estimation problems was introduced. The four key questions of section 7.2 were Q Question Type S Study Type T Type of Data N Number of Samples or Treatments Estimation or hypothesis testing? Sample data or experiment data? One variable or two? Categorical or numerical? How many samples or treatments? When the answers to these questions are Q: estimation S: sample data T: two numerical variables N: one sample the method you will want to consider in a regression setting is the confidence interval for the slope of a population regression line. Once you have selected the confidence interval for the slope of a population regression line as the method you want to consider, because this is an estimation problem you would follow the five-step process for estimation problems (EMC3). Example 16.4 The Bison of Yellowstone Park The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3,000 animals. This recovery is a mixed blessing. Many bison have been exposed to the bacteria that cause brucellosis, a disease that infects domestic cattle, and there are many domestic cattle herds near Yellowstone. Because of concerns that free-ranging bison can infect nearby cattle, it is important to monitor and manage the size of the bison population and, if possible, keep bison from transmitting this bacteria to ranch cattle. The article “Reproduction and Survival of Yellowstone Bison” (The Journal of Wildlife Management [2007]: 2365-2372) described a large multiyear study of the factors that influence bison movement and herd size. The 85241_ch16_ptg01.indd 749 20/12/12 6:39 PM 750 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 researchers studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison. Data from 1981–1997 on y 5 spring calf ratio (SCR) and x 5 previous fall snow-water equivalent (SWE) are shown in the accompanying table. Spring calf ratio is the ratio of calves to adults, a measure of reproductive success. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent. Let’s answer the four key questions for this problem. SCR SWE SCR SWE 0.19 0.14 0.21 0.23 0.26 0.19 0.29 0.23 0.16 1,933 4,906 3,072 2,543 3,509 3,908 2,214 2,816 4,128 0.22 0.22 0.18 0.21 0.25 0.19 0.22 0.17 3,317 3,332 3,511 3,907 2,533 4,611 6,237 7,279 The answers are estimation, sample data, two numerical variables, one sample. This Q Question Type S Study Type T Type of Data N Number of Samples or Treatments Estimation or hypothesis testing? Estimation Sample data or experiment data? Sample data One variable or two? Categorical or numerical? Two numerical values How many samples or treatments? One sample (regression) combination of answers suggests considering a confidence interval for the slope of a population regression line. You can now use the five-step process (EMC3) to estimate the slope of the population regression line. Step Estimate In this example, the value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated. Method Because the answers to the four key questions are estimation, sample data, two numerical values, one sample, a confidence interval for b, the slope of the population regression line, will be considered. For this example, a 95% confidence level will be used. Check The four basic assumptions of the simple linear regression model need to be met in order to use the confidence interval. (continued) 85241_ch16_ptg01.indd 750 20/12/12 6:39 PM 16.2 Inferences Concerning the Slope of the Population Regression Line 751 Step The investigators collected data from 17 successive years. To proceed, you would need to assume that these years are representative of yearly circumstances at Yellowstone, and that each year’s reproduction and snowfall is independent of previous years. You should keep this in mind when you get to the step that involves interpretation. A scatterplot of the data is shown here. The pattern in the plot looks linear and the spread does not seem to be different for different values of x. 0.300 0.275 SCR 0.250 0.225 0.200 0.175 0.150 2000 3000 4000 5000 SWE 6000 7000 8000 A box plot of the residuals is also shown. –0.050 –0.025 –0.000 0.025 Residuals 0.050 0.075 Because the boxplot is approximately symmetric and there are no outliers, it is reasonable to think that the distribution of e is approximately normal. Calculate JMP regression output is shown here: Linear Fit SCR 5 0.2606561 2 0.0136639*SWE Summary of Fit RSquare 0.257644 RSquare Adj 0.208153 Root Mean Square Error 0.033513 Mean of Response 0.209412 Observations (or Sum Wgts) 17 (continued) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 751 20/12/12 6:39 PM 752 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Step Parameter Estimates Term Estimate Std Error t Ratio Intercept 0.2606561 0.023885 10.91 SWE 20.013664 0.005989 22.28 Prob>|t| <.0001* 0.0375* sb df 5 n 2 2 = 17 2 2 = 15 The t critical value for a 95% confidence level and df 5 15 is 2.13. b 6(t critical value)sb 5 20.0137 6(2.13)(0.00599) 5 (20.265, 20.0009) Communicate Results Confidence interval: You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between 20.0265 and 20.0009. Confidence level: The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression about 95% of the time. Hypothesis Tests Concerning b Hypotheses about b can be tested using a t test similar to the t tests introduced in Chapters 12 and 13. The null hypothesis states that b has a specified hypothesized value. The t statistic results from standardizing b, the estimate of b, under the assumption that H0 is true. When H0 is true, the sampling distribution of this statistic is the t distribution with df 5 n 2 2. Hypothesis Test for the Slope of the Population Regression Line, b Appropriate when the four basic assumptions of the simple linear regression model are reasonable: 1. The distribution of e at any particular x value has mean value 0 (that is me5 0 ). 2. The standard deviation of e is se, which does not depend on x. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, e3, … en associated with different observations are independent of one another. When these conditions are met, the following test statistic can be used: b 2 b0 t 5 ______ sb where b0 is the hypothesized value from the null hypothesis. Form of the null hypothesis: H0: b 5 b0 When the assumptions of the simple linear regression model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df 5 n 2 2. Associated P-value: When the alternative hypothesis is… The P-value is… Ha: b . b0 Area to the right of the computed t under the appropriate t curve (continued) 85241_ch16_ptg01.indd 752 20/12/12 6:39 PM 16.2 Inferences Concerning the Slope of the Population Regression Line Ha: b , b0 Area to the left of the computed t under the appropriate t curve Ha: b Þ b0 2(area to the right of t) if t is positive or 2(area to the left of the t) if t is negetive 753 This test is a method you should consider when the answers to the four key questions are hypothesis testing, sample data, two numerical variables, one sample. You would carry out this test using the five-step process for hypothesis testing problems (HMC3). Inference for a population slope generally focuses on two questions: (1) Is the population slope different from zero? (2) What are plausible values for the population slope? The question of plausible values can be addressed by calculating a confidence interval for the population slope. The question of whether a population slope is equal to zero can be answered by using the hypothesis testing procedure with a null hypothesis H0: b 5 0. This test of H0: b 5 0 versus Ha: b Þ 0 is called the model utility test for simple linear regression. The default computer output for inference for a regression slope is for the model utility test. When the null hypothesis of the model utility test is true, the population regression line is a horizontal line, and the value of y in the simple linear regression model does not depend on x. That is, y 5 a 1 bx 1 e 5 a 1 0x 1 e 5a1e If b is in fact equal to 0, knowledge of x will be of no use — it will have no “utility” for predicting y. On the other hand, if b is different from 0, there is a useful linear relationship between x and y, and knowledge of x is useful for predicting y. This is illustrated by the scatterplots in Figure 16.10. y y nonzero slope slope = 0 x Figure 16.10 (a) b 5 0; (b) b Þ 0 x (a) (b) The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of H0: b 5 0 versus Ha: b Þ 0 The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear (continued) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 753 20/12/12 6:39 PM 754 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 relationship between x and y. If H0 is rejected, you can conclude that the simple linear regression model is useful for predicting y. The test statistic is the t ratio ( b 2 0 ) __ b t 5 _______ 5 s . s b b It is recommended that the model utility test be carried out before using the estimated regression line to make inferences. Example 16.5 The British (Musical) Invasion Have you experienced a sudden flood of memory when scanning from station to station on your car radio and recognized a song from your past? Perhaps you could remember the title of the song, the artist, and even when the song was released. From a seemingly small amount of information you were able to recover a great deal of the song’s context from memory. The article “Plink: ‘Thin slices’ of Music” (Krumhansl, C. Music Perception [2010]:337-354) describes a study of this phenomenon. The investigator compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students. Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week. After hearing three short clips from a song (only 400 ms in duration), the students were asked in what year each of the songs was released. The accompanying table shows the Actual and Judged Release Years Actual Release Judged Release Actual Release Judged Release Actual Release Judged Release Actual Release Judged Release 1998 1967 1998 1999 1983 1982 1965 1991 1983 1976 1971 1981 1967 2007 1997.2 1973.7 1996.3 1993.3 1985.4 1988.0 1970.2 1992.8 1984.1 1979.3 1975.4 1984.6 1973.7 1997.2 1976 2008 1971 1965 1967 1971 1967 1984 1984 1968 1965 1965 1979 1997 1983.3 1995.0 1979.8 1976.8 1975.0 1978.0 1978.0 1983.3 1989.8 1976.7 1978.5 1977.2 1986.7 1996.3 1976 2006 1974 2007 1976 1974 1970 1971 1999 1997 2006 1981 2008 1965 1988.0 1996.7 1985.4 1999.8 1987.2 1977.6 1982.8 1976.3 1988.5 1994.1 1995.4 1989.3 1993.7 1981.1 1970 1975 1991 2008 1965 1987 1975 1968 1987 2008 1982 1979 2000 2000 1985.4 1985.9 1993.3 1995.4 1977.6 1990.7 1986.3 1986.7 1988.0 1990.2 1991.1 1983.7 1989.8 1991.1 actual release year and the average of the release years given by the students. The actual release years ranged from 1965 (The Beatles, “Help”) to 2008 (Katy Perry, “I Kissed a Girl”). Is there a relationship between the judged and actual release year for these songs? A scatterplot of the data (Figure 16.11) suggests that there is a linear relation between these two variables, but this can be confirmed this using the model utility test. With x 5 actual release year and y 5 judged release year, the equation of the esti mated regression line is y ˆ 5 1095 1 0.449x. The five-step process for hypothesis testing can be used to carry out the model utility test. 85241_ch16_ptg01.indd 754 20/12/12 6:39 PM 755 16.2 Inferences Concerning the Slope of the Population Regression Line 2000 1995 Judged 1990 1985 1980 1975 1970 Figure 16.11 Scatterplot of judged release year versus actual release year 1960 1970 1980 1990 2000 2010 Actual Process Step H Hypotheses In the model utility test, the null hypothesis is there is no useful relationship between the actual and the judged release year: H0: b 5 0. The alternative hypothesis specifies that there is a useful relationship: b Þ 0. Hypotheses: Null hypothesis: H0: b 5 0 Alternative hypothesis: Ha: b Þ 0 M Method Because the answers to the four key questions are hypothesis testing, sample data, two numerical variables in a regression setting and one sample, a hypothesis test for the slope of a population regression line will be considered. The test statistic for this test is b20 b t 5 _____ 5 __ s sb b The value of 0 in the test statistic is the hypothesized value from the null hypothesis. For this example, a significance level of 0.05 will be used. Significance level: a 5 0.05 C Check In Section 16.3, you will see how to check to see if the four assumptions of the simple linear regression model are reasonable. For this example, you can assume that these assumptions are reasonable and proceed with the model utility test. C Calculate JMP output is shown here: Linear Fit Judged Release = 1095.1525 + 0.449281*Actual Release Summary of Fit RSquare 0.771 RSquare Adj 0.766759 3.59844 Root Mean Square Error 1986.013 Mean of Response 56 Observations (or Sum Wgts) Lack of Fit Analysis of Variance sb Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1095.1525 16.58 <.0001* 66.07159 Actual Release 0.449281 0.033321 13.48 <.0001* (continued) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 755 20/12/12 6:39 PM 756 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Test statistic: .449 2 0 b 2 0 0__________ t 5 _____ 5 5 13.48 sb 0.0333 Associated P-value: P 2 value 5 twice area under t curve to the right of 13.48 5 2P(t .13.48) ø0 C Commu­nicate results Because the P-value is less than the selected significance level, the null hypothesis is rejected. Decision: Reject H0. Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year. Because the model utility test confirms that there is a useful linear relationship between judged release year and actual release year, it would be reasonable to use the estimated regression model to predict the judged release year for a given song based on its actual release year. Of course, before you do this, you would also want to evaluate the accuracy of predictions by looking at the value of se. When H0: b 5 0 cannot be rejected using the model utility test at a reasonably small significance level, the search for a useful model must continue. One possibility is to relate y to x using a nonlinear model — an appropriate strategy if the scatterplot shows curvature. section 16.2 Exercises Each exercise set assesses the following chapter learning objectives: C2, M3, M4 , M5, M6, P1, P2 Section 16.2 Exercise Set 1 16.13 The standard deviation of the errors, se, is an important part of the linear regression model. a. What is the relationship between the value of se and the value of the test statistic in a test of a hypotheses about b ? b. What is the relationship between the value of se and the width of a confidence interval for b ? 16.14 A journalist is reporting about some research on appropriate amounts of sleep for people 9 to 19 years of age. In that research, a linear regression model is used to describe the relationship between alertness and number of hours of sleep the night before. The researchers reported a 95% confidence interval, but newspapers usually report an estimate and a margin of error. a. In order to calculate a margin of error from the reported confidence interval, what additional conditions, if any, need to be verified? b. In order to calculate a margin of error from the reported confidence interval, what additional information, if any, is needed? 16.15. A nursing student has completed his final project, and is preparing for a meeting with his project advisor. The subject of his project was the relationship between systolic blood pressure (SBP) and body mass index (BMI). The last time he met with his advisor he had completed his measurements, but only entered half his data into his statistical software. For the data he 85241_ch16_ptg01.indd 756 had entered, the necessary conditions for inference for b were met. In a short paragraph, explain, using appropriate statistical terminology, which of the conditions below must be rechecked. 1. The standard deviation of e is the same for all values of x. 2. The distribution of e at any particular x value is normal. 16.16 Consider the accompanying data on x 5 research and development expenditure (thousands of dollars) and y 5 growth rate (% per year) for eight different industries. x y 2024 1.90 5038 3.96 905 2.44 3572 0.88 1157 0.37 327 20.90 378 0.49 191 1.01 a. Would a simple linear regression model provide useful information for predicting growth rate from research and development expenditure? Use a .05 level of significance. b. Use a 90% confidence interval to estimate the average change in growth rate associated with a $1000 increase in expenditure. Interpret the resulting interval 16.17 The paper “The Effects of Split Keyboard Geometry on Upper Body Postures” (Ergonomics [2009]: 104–111) describes a study to determine the effects of several keyboard characteristics on typing speed. One of the variables considered was the front-to-back surface angle of the keyboard. Minitab output resulting from fitting the simple linear regression model with x 5 surface angle (degrees) and y 5 typing speed (words per minute) is given below. 20/12/12 6:39 PM 16.2 Inferences Concerning the Slope of the Population Regression Line Regression Analysis: Typing Speed versus Surface Angle The regression equation is Typing Speed 5 60.0 1 0.0036 Surface Angle Predictor Constant Surface Angle Coef SE Coef T P 60.0286 0.2466 243.45 0.000 0.00357 0.03823 0.09 0.931 S 5 0.511766 R-Sq 5 0.3% R-Sq(adj) 5 0.0% Analysis of Variance Source Regression Residual Error Total DF SS MS F P 1 0.0023 0.0023 0.01 0.931 3 0.7857 0.2619 4 0.7880 a. Suppose that the basic assumptions of the simple linear regression model are met. Carry out a hypothesis test to decide if there is a useful linear relationship between x and y. b. Are the values of se and r2 consistent with the conclusion from Part (a) ? Explain. 16.18 Do taller adults make more money? The authors of the paper “Stature and Status: Height, Ability, and Labor Market Outcomes” (Journal of Political Economics [2008]: 499–532) investigated the association between height and earnings. They used the simple linear regression model to describe the relationship between x 5 height (in inches) and y 5 log(weekly gross earnings in dollars) in a very large sample of men. The logarithm of weekly gross earnings was used because this transformation resulted in a relationship that was approximately linear. The paper reported that the slope of the estimated regression line was b 5 0.023 and the standard deviation of b was sb 5 0.004 . Carry out a hypothesis test to decide if there is convincing evidence of a useful linear relationship between height and the logarithm of weekly earnings. You can assume that the basic assumptions of the simple linear regression model are met. 16.19 The effects of grazing animals on grasslands have been the focus of numerous investigations by ecologists. One such study, reported in “The Ecology of Plants, Large Mammalian Herbivores, and Drought in Yellowstone National Park” (Ecology [1992]: 2043–2058), proposed using the simple linear regression model to relate y 5 green biomass concentration (g/cm3) to x 5 elapsed time since snowmelt (days). a. The estimated regression equation was given as y ˆ 5 106.3 2 .640x. What is the estimate of average change in biomass concentration associated with a 1-day increase in elapsed time? b. What value of biomass concentration would you predict when elapsed time is 40 days? c. The sample size was n 5 58, and the reported value of the coefficient of determination was 0.470. What does this tell you about the linear relationship between the two variables? 85241_ch16_ptg01.indd 757 757 Section 16.2 Exercise Set 2 16.20 Consider a test of hypotheses about, b the population slope in a linear regression model. a. If you reject the null hypothesis, b 5 0, what does this mean in terms of a linear relationship between x and y? b. If you fail to reject the null hypothesis, b 5 0, what does this mean in terms of a linear relationship between x and y? 16.21 Researchers studying pleasant touch sensations measured the firing frequency (impulses per second) of nerves that were stimulated by a light brushing stroke on the forearm and also recorded the subject’s numerical rating of how pleasant the sensation was. The accompanying data was read from a graph in the paper “Coding of Pleasant Touch by Unmyelinated Afferents in Humans” (Nature Neuroscience, April 12, 2009). Firing Frequency 23 24 22 25 27 Pleasantness Rating 0.2 1.0 1.2 1.2 1.0 Firing Frequency 28 34 33 36 34 Pleasantness Rating 2.0 2.3 2.2 2.4 2.8 a. Estimate the mean change in pleasantness rating associated with an increase of 1 impulse per second in firing frequency using a 95% confidence interval. Interpret the resulting interval. b. Carry out a hypothesis test to decide if there is convincing evidence of a useful linear relationship between firing frequency and pleasantness rating. 16.22 The largest commercial fishing enterprise in the southeastern United States is the harvest of shrimp. In a study described in the paper “Long-term Trawl Monitoring of White Shrimp, Litopenaeus setiferus (Linnaeus), Stocks within the ACE Basin National Estuariene Research Reserve, South Carolina” ( Journal of Coastal Research [2008]:193-199), researchers monitored variables thought to be related to the abundance of white shrimp. One variable the researchers thought might be related to abundance is the amount of oxygen in the water. The relationship between mean catch per tow of white shrimp and oxygen concentration was described by fitting a regression line using data from ten randomly selected offshore sites. (The “catch” per tow is the number of shrimp caught in a single outing.) Computer output is shown below. The regression equation is Mean catch per tow 5 25859 1 97.2 O2 Saturation Predictor Coef SE Coef T P Constant 25859 2394 22.45 0.040 O2 Saturation 97.22 34.63 2.81 0.023 S 5 481.632 R-Sq 5 49.6% R-Sq(adj) 5 43.3% a. Is there convincing evidence of a useful linear relationship between the shrimp catch per tow and oxygen concentration density? Explain. 20/12/12 6:39 PM 758 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 time for individuals with CHI? Test the appropriate hypotheses using a 5 .05. b. Would you describe the relationship as strong? Why or why not? c. Construct a 95% confidence interval for b and interpret it in context. d. What margin of error is associated with the confidence interval in Part (c)? Mean Response Time 16.23 The authors of the paper “Decreased Brain Volume in Adults with Childhood Lead Exposure” (Public Library of Science Medicine [May 27, 2008]: e112) studied the relationship between childhood environmental lead exposure and a measure of brain volume change in a particular region of the brain. Data were given for x 5 mean childhood blood lead level (mg/dL) and y 5 brain volume change (BVC, in percent). A subset of data read from a graph that appeared in the paper was used to produce the accompanying Minitab output. Regression Analysis: BVC versus Mean Blood Lead Level The regression equation is BVC 5 20.00179 2 0.00210 Mean Blood Lead Level Predictor Coef SE Coef T P Constant 20.001790 0.008303 20.22 0.830 Mean Blood 20.0021007 0.0005743 23.66 0.000 Lead Level Study Control 1 2 3 4 5 6 7 8 9 10 250 360 475 525 610 740 880 920 1010 1200 CHI 3 03 491 659 683 922 1044 1421 1329 1481 1815 16.27 The article “Photocharge Effects in Dye Sensitized Ag[Br,I] Emulsions at Millisecond Range Exposures” (Photographic Science and Engineering [1981]: 138–144) gave the accompanying data on x 5 % light absorption and y 5 peak photovoltage. 4.0 0.12 x y 8.7 0.28 12.7 0.55 19.1 0.68 21.4 0.85 24.6 1.02 28.9 1.15 29.8 1.34 30.5 1.29 JMP output for these data is shown below. Carry out a hypothesis test to decide if there is convincing evidence of a useful linear relationship between x and y. You can assume that the basic assumptions of the simple linear regression model are met. Bivariate Fit of PeakPhotoVoltage By %LightAbsorption 1.4 16.25 What is the distinction between se and se? 16.26 The accompanying data were read from a plot (and are a subset of the complete data set) given in the article “Cognitive Slowing in Closed-Head Injury” (Brain and Cognition [1996]: 429–440). The data represent the mean response times for a group of individuals with closed-head injury (CHI) and a matched control group without head injury on 10 different tasks. Each observation was based on a different study, and used different subjects, so it is reasonable to assume that the observations are independent. a. Fit a linear regression model that would allow you to predict the mean response time for those suffering a closed-head injury from the mean response time on the same task for individuals with no head injury. b. Do the sample data support the hypothesis that there is a useful linear relationship between the mean response time for individuals with no head injury and the mean response PeakPhotoVoltage 1.2 Additional Exercises 16.24 a. Explain the difference between the line y 5 a 1 bx and the line y ˆ 5 a 1 bx. b. Explain the difference between b and b. c. Let x* denote a particular value of the independent variable. Explain the difference between a 1 bx* and a 1 bx*. d. Explain the difference between s and se. 1 0.8 0.6 0.4 0.2 0 0 5 10 15 20 %LightAbsorption 25 30 35 Linear Fit Linear Fit PeakPhotoVoltage = –0.082594 + 0.0446485* %LightAbsorption Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.982731 0.980264 0.061117 0.808889 9 Analysis of Variance Parameter Estimates Term Estimate Intercept –0.082594 %LightAbsorption 0.0446485 Std Error t Ratio Prob>|t| 0.049093 –1.68 0.1364 0.002237 19.96 <.0001* Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 758 20/12/12 6:39 PM 16.3 Checking Model Adequacy a. What does the scatterplot suggest about the relationship between the peak photovoltage and the percent of light absorption? b. What is the equation of the estimated regression line? c. How much of the observed variation in peak photovoltage can be explained by the model relationship? d. Predict peak photovoltage when percent absorption is 19.1, and compute the value of the corresponding residual. e. The authors claimed that there is a useful linear relationship between the two variables. Do you agree? Carry out a formal test. f. Give an estimate of the average change in peak photovoltage associated with a 1 percentage point increase in light absorption. Your estimate should convey information about the precision of estimation. Suppose that from previous evidence, anthropologists had believed that for each 1-mm increase in chord length, cranial capacity would be expected to increase by 20 cm 3. Do these new experimental data provide convincing evidence against this prior belief? 16.29 Suppose you are given the computer output shown. You are interested in testing the null hypothesis b 5 1.0 versus an alternative hypothesis of b > 1.0. Describe how you would use the given computer output to test these hypotheses. 16.28 In anthropological studies, an important characteristic of fossils is cranial capacity. Frequently skulls are at least partially decomposed, so it is necessary to use other characteristics to obtain information about capacity. One measure that has been used is the length of the lambda-opisthion chord. The article “Vertesszollos and the Presapiens Theory” (American Journal of Physical Anthropology [1971]) reported the accompanying data for n 5 7 Homo erectus fossils. x (chord length in mm) y (capacity in cm 3) 78 75 78 81 84 86 87 850 775 750 975 915 1015 1030 section 16.3 759 Linear Fit y = 5.6452776 + 0.9797401*x Summary of Fit RSquare 0.985289 RSquare Adj 0.984954 Root Mean Square Error 12.48525 Mean of Response 0.791304 Observations (or Sum Wgts) 46 Lack of Fit Analysis of Variance Parameter Estimates Term Estimate Intercept 5.6452776 0.9797401 x Std Error t Ratio Prob>|t| 1.84302 3.06 0.0037* 0.018048 54.29 <.0001* Checking Model Adequacy Section 16.2 introduced methods for estimating and testing hypotheses about b, the slope in the simple linear regression model y 5 a 1 bx 1 e In this model, e represents the random deviation of a y value from the population regression line a 1 bx. The methods presented in Section 16.2 require that some assumptions about the random deviations in the simple linear regression model be met in order for inferences to be valid. These assumptions include: 1. At any particular x value, the distribution of e is normal. 2. At any particular x value, the standard deviation of e is se, which is constant over all values of x (that is, se does not depend on x). Inferences based on the simple linear regression model are still appropriate if model assumptions are slightly violated (for example, mild skew in the distribution of e). However, interpreting a confidence interval or the result of a hypothesis test when assumptions are seriously violated can result in misleading conclusions. For this reason, it is important to be able to detect any serious violations. Residual Analysis If the deviations e1, e2, …, en from the population line were available, they could be examined for any inconsistencies with model assumptions. For example, a normal probability plot of these deviations would suggest whether or not the normality assumption was plausible. However, because these deviations are Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 759 20/12/12 6:39 PM 760 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 e1 5 y1 2 (a 1 bx1) : en 5 yn 2 (a 1 bxn ) they can be calculated only if a and b are known. In practice, this will almost never be the case. Instead, diagnostic checks must be based on the residuals y1 2 y ˆ 1 5 y1 2 (a 1 bx1) : yn 2 y ˆ n 5 yn 2 (a 1 bxn ) which are the deviations from the estimated regression line. When all model assumptions are met, the mean value of the residuals at any particular x value is 0. Any observation that gives a large positive or negative residual should be examined carefully for any unusual circumstances, such as a recording error or nonstandard experimental condition. Identifying residuals with unusually large magnitudes is made easier by inspecting standardized residuals. Recall that a quantity is standardized by subtracting its mean value (0 in this case) and dividing by its actual or estimated standard deviation: residual standardized residual 5 _________________________________ estimated standard deviation of residual The value of a standardized residual tells you the distance (in standard deviations) of the corresponding residual from its expected value, 0. Because residuals at different x values have different standard deviations (depending on the value of x for that observation)1, computing the standardized residuals can be tedious. Fortunately, many computer regression programs provide standardized residuals. Example 16.6 Revisiting the Elk Example 16.3 introduced data on x 5 chest girth (in cm) and y 5 weight (in kg) for a sample of 19 Rocky Mountain elk. (See Example 16.3 for a more detailed description of the study.) Inspection of the scatterplot in Figure 16.12 suggests the data are consistent with the assumptions of the simple linear regression model. 350 Weight (kg) 300 250 200 150 100 90 Figure 16.12 Scatterplot for the elk data 1 100 110 120 130 Girth (cm) 140 150 160 170 Ï The estimated standard deviation of the i residual, yi 2 y ˆ i, is se th ________________ _ 2 (xi 2 x ) 1 __ ________ _ 2 1 2 n 2 ∑(x 2 x ) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 760 20/12/12 6:39 PM 761 16.3 Checking Model Adequacy The data, residuals, and the standardized residuals (computed using Minitab) are given in Table 16.1. For the residual with the largest magnitude, 38.1397, the standardized residual is 1.81294. That is, the residual is approximately 1.8 standard deviations above its expected value of 0. This value is not particularly unusual in a sample of this size. Also notice that for the negative residual with the largest magnitude, 238.2661, the standardized residual is 21.92313, still not unusual in a sample of this size. On the standardized scale, no residual here is surprisingly large. Table 16.1 Data, residuals, and standardized residuals for the elk data Observation Girth (cm) x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Weight (kg) y 96 105 108 109 110 114 121 124 131 135 137 138 140 142 157 157 159 155 162 98 196 163 196 183 171 230 225 211 231 225 266 241 264 284 292 300 337 339 Residual 238.2661 34.9314 26.3361 23.9080 8.1522 214.8711 24.8380 11.5705 221.7203 212.7436 224.2553 13.9889 216.5228 0.9655 220.3720 212.3720 29.8837 38.1397 20.8488 Standardized Residual y ˆ 21.92313 1.68004 20.30135 1.13323 0.38517 20.69477 1.14452 0.53117 20.99323 20.58320 21.11135 0.64147 20.75921 0.04448 20.97540 20.59236 20.47699 1.81294 1.01967 136.266 161.069 169.336 172.092 174.848 185.871 205.162 213.429 232.720 243.744 249.255 252.011 257.523 263.034 304.372 304.372 309.884 298.860 318.151 Next, consider the assumption of the normality of e’s. Figure 16.13 shows box plots of the residuals and standardized residuals. The box plots are approximately symmetric and there are no outliers, so the assumption of normally distributed errors seems reasonable. –40 –30 –20 –10 0 10 Residual 20 30 40 –2 –1 0 Standardized Residual 1 2 Figure 16.13 Boxplots of residuals and standardized residuals for the elk data. Notice that the boxplots of the residuals and standardized residuals are nearly identical. While it is preferable to work with the standardized residuals, if you do not have access to a computer package or calculator that will produce standardized residuals, a plot of the unstandardized residuals should suffice. A normal probability plot of the standardized residuals (or the residuals) is another way to assess whether it is reasonable to assume that e1, e2,..., en all come from the same normal distribution. An advantage of the normal probability plot, shown in Figure 16.14, is that the value of each residual can be seen, which provides more information about the distribution. The pattern in the normal probability plot of the Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 761 20/12/12 6:39 PM CHAPTER 16 Understanding Relationships—Numerical Data Part 2 2 2 1 1 Normal score Normal score 762 0 –1 –1 –2 0 –2 –40 –30 Figure 16.14 Normal probability plots of residuals and standardized residuals for the elk data –20 –10 0 10 Residual 20 30 40 –2 –1 0 Standardized residual 1 2 standardized residuals and pattern in the normal probability plot of the the residuals for the elk data are reasonably straight, confirming that the assumption of normality of the error distribution is reasonable. Also notice that the pattern in both normal probability plots is similar, so you don’t need to construct both—either plot could be used. Plotting the Residuals AP* exam tip When considering linear regression, your first step should be to study the scatterplot and a residual plot. These two plots provide important information about whether a linear model is appropriate A plot of the (x, residual) pairs is called a residual plot, and a plot of the (x, standardized residual) pairs is a standardized residual plot. Residual and standardized residual plots typically exhibit the same general shapes. If you are using a computer package or graphing calculator that calculates standardized residuals, the standardized residual plot is recommended. If not, it is acceptable to use the unstandardized residual plot instead. A standardized residual plot or a residual plot is often helpful in identifying unusual or highly influential observations and in checking for violations of model assumptions. A desirable plot is one that exhibits no particular pattern (such as curvature or a much greater spread in one part of the plot than in another) and that has no point that is far removed from all the others. A point in the residual plot falling far above or far below the horizontal line at height 0 corresponds to a large residual, which can indicate unusual behavior, such as a recording error, a nonstandard experimental condition, or an atypical experimental subject. A point with an x value that differs greatly from others in the data set could have exerted excessive influence in determining the estimated regression line. A standardized residual plot, such as the one pictured in Figure 16.15(a) is desirable, because no point lies much outside the horizontal band between 22 and 2 (so there is no unusually large residual corresponding to an outlying observation). There is no point far to the left or right of the others (which could indicate an observation that might greatly influence the estimated line), and there is no pattern to indicate that the model should somehow be modified. When the plot has the appearance of Figure 16.15(b), the fitted model should be changed to incorporate curvature (a nonlinear model). The increasing spread from left to right in Figure 16.15(c) suggests that the variance of y is not the same at each x value but rather increases with x. A straightline model may still be appropriate, but the best-fit line should be obtained by using weighted least squares rather than ordinary least squares. This involves giving more weight to observations in the region exhibiting low variability and less weight to observations in the region exhibiting high variability. A specialized regression analysis textbook or a statistician should be consulted for more information on using weighted least squares. The standardized residual plots of Figures 16.15(d) and 16.15(e) show an outlier (a point with a large standardized residual) and a potentially influential observation, respectively. Consider deleting the observation corresponding to such a point from the data set and refitting a line. Substantial changes in estimates and various other quantities are a signal that a more careful analysis should be carried out before proceeding. Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 762 20/12/12 6:39 PM 16.3 Checking Model Adequacy AP* exam tip Notice the difference between an outlier (an observation that is far removed from the other observations in the y direction) and a potentially influential observation (an observation that is far removed from the other observations in the x direction). Standardized residual 763 Standardized residual 2 2 1 1 0 x 0 –1 –1 –2 –2 x (a) (b) Standardized residual Standardized residual 2 2 1 1 x 0 Large residual x 0 –1 –1 –2 –2 (d) (c) Standardized residual 2 FIGURE 16.15 Examples of residual plots: (a) satisfactory plot; (b) plot suggesting that a curvilinear regression model is needed; (c) plot indicating nonconstant variance; (d) plot showing a large residual; (e) plot showing a potentially influential observation. 1 0 Potentially influential observation –1 –2 x (e) Example 16.7 Snow Cover and Temperature The article “Snow Cover and Temperature Relationships in North America and Eurasia” ( Journal of Climate and Applied Meteorology [1983]: 460–469) explored the relationship between October–November continental snow cover (x, in millions of square kilometers) and December–February temperature ( y, in °C). The following data refer to Eurasia during the n 5 13 time periods (196921970, 197021971, …, 198121982): x y 13.00 12.75 16.70 18.85 16.60 15.35 13.90 213.5 215.7 215.5 214.7 216.1 214.6 213.4 Standardized Residual 20.11 22.19 20.36 1.23 20.91 20.12 0.34 x y 22.40 16.20 16.70 13.65 13.90 14.75 218.9 214.8 213.6 214.0 212.0 213.5 Standardized Residual 21.54 0.04 1.25 20.28 21.54 0.58 A simple linear regression analysis described in the article included r2 5 0.52 and r 5 0.72, suggesting a significant linear relationship. This is confirmed by a model Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 763 20/12/12 6:39 PM 764 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 utility test. The scatterplot and standardized residual plot are displayed in Figure 16.16. There are no unusual patterns, although one standardized residual, 22.19, is a bit on the large side. The most interesting feature is the observation (22.40, 218.9), corresponding to a point far to the right of the others in these plots. This observation may have had a substantial influence on the estimated regression line. The estimated slope when all 13 observations are included is b 5 20.459, and sb 5 0.133. When the potentially influential observation is deleted, the estimate of b based on the remaining 12 observations is b 5 20.228. The change in slope is change in slope 5 original b 2 new b 5 20.459 2 ( 2 0.288 ) 5 20.231 The change expressed in standard deviations is 20.231/0.133 5 21.74. Because b has changed by substantially more than 1 standard deviation, the observation under consideration appears to be highly influential. TEMP -11.5 + -13.0 + -14.5 + -16.0 + -17.5 + -19.0 + * * * * * * * * * * * * * +-----------+-----------+-----------+-----------+-----------+ SNOW 12.5 15.0 17.5 20.0 22.5 25.0 Figure 16.16 Plots for the data of Example 16.7: (a) Scatter plot; (b) Standardized residual plot (a) STRESID 2.0 + Potentially influential * observation * * 1.0 + * * * * 0.0 + * * * * -1.0 + * -2.0 + * -3.0 + +-----------+-----------+-----------+-----------+-----------+ SNOW 12.5 15.0 17.5 20.0 22.5 25.0 (b) In addition, r2 based just on the 12 observations is only 0.13, and the t ratio for testing b 5 0 is not significant. Evidence for a linear relationship is much less conclusive in light of this analysis. The investigators should seek a climatological explanation for the influential observation and collect more data, which could be used to find a more useful model. Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 764 20/12/12 6:39 PM 16.3 Checking Model Adequacy 765 Example 16.8 Treadmill Time and Ski Time The paper “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise [195]: 1302–1310) describes a study of the relationship between cardiovascular fitness (as measured by time to exhaustion running on a treadmill) and performance on a 20-kilometer ski race. Data on x 5 treadmill time to exhaustion (in minutes) and AP* exam tip Don’t forget to check assumptions. If you are used to checking assumptions before doing much in the way of calculation, it is sometimes easy to forget to check them in a regression setting. Be sure to step back and think about whether the four basic assumptions of the linear regression model are reasonable before making inferences about the population slope or using the estimated model to make predictions. Figure 16.17 Plots for Example 16.8 (a) Normal probability plot of standardized residuals; (b)Standardized residual plot y 5 20-km ski time (in minutes) for 11 athletes are shown in Table 16.2. Standardized residuals and residuals are also given. Is it reasonable to use the given data to construct a confidence interval or test hypotheses about b, the average change in ski time associated with a 1-min increase in treadmill time? It depends on whether the assumptions that the distribution of the deviations from the population regression line at any fixed x is approximately normal and that the variance of this distribution does not depend on x are reasonable. Constructing a normal probability plot of the standardized residuals and a standardized residual plot will provide insight into whether these assumptions are in fact reasonable. Table 16.2 Data, Residuals, and Standardized Residuals for Example 16.8 Observation 1 2 3 4 5 6 7 8 9 10 11 Treadmill Ski Time Residual Standardized Residual 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 0.172 2.206 3.494 0.906 1.994 3.006 2.461 0.394 2.373 0.527 0.206 0.10 1.13 1.74 0.44 0.96 1.44 1.18 0.19 1.16 0.27 0.12 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 Figure 16.17 shows a normal probability plot of the standardized residuals and a standardized residual plot. The normal probability plot is quite straight, and the standardized residual plot does not show evidence of any patterns or of increasing spread. Standardized residual Standardized residual 1 1 0 0 21 21 22 22 22 21 0 Normal score (a) 1 2 8 9 10 Treadmill time 11 12 (b) Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 765 20/12/12 6:39 PM 766 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Example 16.9 A new pediatric tracheal tube The article “Appropriate Placement of Intubation Depth Marks in a New Cuffed, Paediatric Tracheal Tube” (British Journal of Anaesthesia [2004]: 80-87) describes a study of the use of tracheal tubes in newborns and infants. Newborns and infants have small trachea, and there is little margin for error when inserting tracheal tubes. Using X-rays of a large number of children aged 2 months to 14 years, the researchers examined the relationships between appropriate trachea tube insertion depth and other variables such as height, weight, and age. A scatterplot and a standardized residual plot constructed using data on the insertion depth and height of the children (both measured in cm) are shown in Figure 16.18. 3 20 2 Standardized residual Insertion depth 18 16 14 12 1 0 –1 –2 10 –3 50 75 100 Figure 16.18 (a) Scatterplot for insertion depth vs. height data of Example 16.9; (b) standardized residual plot. Figure 16.19 (a) Scatterplot for insertion depth vs. weight data of Example 16.9; (b) standardized residual plot. 125 Height 150 175 75 100 (a) 125 Height 150 175 200 (b) Residual plots like the ones pictured in Figure 16.18(b) are desirable. No point lies much outside the horizontal band between 22 and 2 (so there are no unusually large residuals corresponding to outliers). There is no point far to the left or right of the others (no observation that might be influential), and there is no pattern of curvature or differences in the variability of the residuals for different height values to indicate that the model assumptions are not reasonable. But consider what happens when the relationship between insertion depth and weight is examined. A scatterplot of insertion depth and weight (kg) is shown in Figure 16.19(a), and a standardized residual plot in Figure 16.19(b). While some curvature is evident in the original scatterplot, it is even more clearly visible in the standardized residual plot. A careful inspection of these plots suggests that along with curvature, the residuals may be more variable at larger weights. When plots have this curved appearance and increasing variability in the residuals, the linear regression model is not appropriate. 3 22 2 Standardized residual 24 20 Insertion depth 50 200 18 16 14 12 1 0 –1 –2 10 –3 0 10 20 30 40 50 60 70 80 90 Weight 0 10 20 30 40 50 Weight (a) (b) 60 70 80 90 Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 766 20/12/12 6:39 PM 767 16.3 Checking Model Adequacy Example 16.10 Looking for love in all the right... trees? Treefrogs’ search for mating partners was the examined in the article, “The Cause of Correlations Between Nightly Numbers of Male and Female Barking Treefrogs (Hyla gratiosa) Attending Choruses” (Behavioral Ecology [2002: 274–281). A lek, in the world of animal Figure 16.20 (a) Scatterplot for treefrog data of Example 16.20; (b) residual plot behavior, is a cluster of males gathered in a relatively small area to exhibit courtship displays. The “female preference” hypothesis asserts that females will prefer larger leks over smaller leks, presumably because there are more males to choose from. The scatterplot and residual plot in Figure 16.20 show the relationship between the number of females and the number of males in observed leks of barking treefrogs. You can see that the unequal variance, which is noticeable in the scatterplot, is even more evident in the residual plot. This indicates that the assumptions of the linear regression model are not reasonable in this situation. 35 15 10 25 Residuals Number of females 30 20 15 0 10 –5 17.5 0 0 10 20 50 60 30 40 Number of males (a) 70 80 90 Squirrels per plot 5 section 5 16.3 Exercises –10 15.0 20 30 40 50 60 Number of males (b) 70 0 10 20 30 40 %Logged 50 60 70 30 40 %Logged 50 60 70 80 90 7.5 Each exercise set assesses the following chapter learning objectives: M2, M7 Exercise Set 1 16.30 The following graphs are based on data from an experiment to assess the effects of logging on a squirrel population in British Columbia (“Effects of Logging Pattern 3 and Intensity on Squirrel Demography,” The Journal of Wildlife Management [2007]: 2655–2663). Plots of land, 0 2 1 Residual each nine hectares in area, were subjected to different percentages of logging, and the squirrel population density for each plot was measured after 3 years. The scatterplot, residual plot, and a boxplot of the residuals are shown here. 10 10.0 5.0 Section 16.2 0 12.5 –1 –2 –3 0 10 20 Squirrels per plot 17.5 15.0 12.5 10.0 7.5 5.0 –3 0 10 20 30 40 %Logged 50 60 70 –2 –1 0 Residual 1 2 3 3 Unless otherwise noted, 2 all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 767 Residual 1 0 –1 –2 20/12/12 6:39 PM 768 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 16.31 The clutch size (number of eggs laid) for turtles is known to be influenced by body size, latitude, and average environmental temperature. Researchers gathered data on Gopher tortoises in Okeeheelee County Park in Florida to further understand the factors that affect reproduction in these animals (“Geographic Variation in Body and Clutch Size of Gopher Tortoises,” Copeia [2007]: 355–363). The scatterplot, residual plot, and a normal probability plot of the residuals for the least squares regression line with x 5 body length and y 5 clutch size are shown here. Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences. 14 ClutchSize 10 1.64 0.9 1.28 0.8 0.67 0.7 0.0 0.5 0.3 –0.67 0.2 0.1 –1.28 0.05 –1.64 –8 –6 –4 –2 0 2 4 6 16.32 Carbon aerosols have been identified as a contributing factor in a number of air quality problems. In a chemical analysis of diesel engine exhaust, x 5 mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were recorded (“Comparison of Solvent Extraction and Thermal set is y ˆ 5 31 1 .737x. The accompanying table gives the observed x and y values and the corresponding standardized residuals. 8 6 4 2 6 0.95 Optical Carbon Analysis Methods: Application to Diesel Vehicle Exhaust Aerosol” Environmental Science Technology [1984]: 231–234). The estimated regression line for this data 12 0 280 Normal Quantile Plot Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences. 290 300 310 Length(mm) 320 330 340 x y St. resid. x y St. resid. x y St. resid. x y St. resid. x y St. resid. 164.2 181 2.52 161.8 170 1.72 118.7 106 21.07 108.1 102 20.75 78.9 86 20.27 156.9 156 0.82 230.9 193 20.73 248.8 204 20.95 89.4 91 20.51 387.8 310 20.89 109.8 115 0.27 106.5 110 0.05 102.4 98 20.73 76.4 97 0.85 135.0 141 0.91 111.4 87.0 132 96 1.64 0.08 97.6 79.7 94 77 20.77 21.11 64.2 89.4 76 89 20.20 20.68 131.7 100.8 128 88 0.00 21.49 82.9 117.9 90 130 20.18 1.05 4 Residuals 2 0 –2 –4 –6 –8 a. Construct a standardized residual plot. Are there any unusually large residuals? Do you think that there are any influential observations? b. Is there any pattern in the standardized residual plot that would indicate that the simple linear regression model is not appropriate? c. Based on your plot in Part (a), do you think that it is reasonable to assume that the variance of y is the same at each x value? Explain. 16.33 The article “Vital Dimensions in Volume Perception: Can the Eye Fool the Stomach?” ( Journal of Marketing Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 768 20/12/12 6:39 PM 769 16.3 Checking Model Adequacy the dimensions of 27 representative food products (Gerber baby food, Cheez Whiz, Skippy Peanut Butter, and Ahmed’s tandoori paste, to name a few). Product Maximum Width (cm) Minimum Width (cm) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2.50 2.90 2.15 2.90 3.20 2.00 1.60 4.80 5.90 5.80 2.90 2.45 2.60 2.60 2.70 3.10 5.10 10.20 3.50 2.70 3.00 2.70 2.50 2.40 4.40 7.50 4.25 1.80 2.70 2.00 2.60 3.15 1.80 1.50 3.80 5.00 4.75 2.80 2.10 2.20 2.60 2.60 2.90 5.10 10.20 3.50 1.20 1.70 1.75 1.70 1.20 1.20 7.50 4.25 a. Fit the simple linear regression model that would allow prediction of the maximum width of a food container based on its minimum width. b. Calculate the standardized residuals (or just the residuals if you don’t have access to a computer program that gives standardized residuals) and make a residual plot to determine whether there are any outliers. c. The data point with the largest residual is for a 1-liter Coke bottle. Delete this data point and refit the regression. Did deletion of this point result in a large change in the equation of the estimated regression line? d. For the regression line of Part (c), interpret the estimated slope and, if appropriate, the intercept. e. For the data set with the Coke bottle deleted, do you think that the assumptions of the simple linear regression model are reasonable? Give statistical evidence for your answer. 16.34 Models of climate change predict that global temperatures and precipitation will increase in the next 100 years, with the largest changes occurring during winter in northern latitudes. Researchers gathered data on the potential effects of climate change for flowering plants in Norway. (“Climatic Variability, Plant Phenology, and Northern Ungulates,” Ecology [1999]: 1322–1339). The table below gives data for one flower species. Range of flowering dates and elevation for different sites in Norway were used to construct the given scatterplot. A potentially influential point is indicated on the scatterplot. Bivariate Fit of Flowering Date Range by Elevation 35 30 Flowering date range Research [1999]: 313–326) gave the accompanying data on 25 20 15 0 100 200 300 Elevation 400 500 Flowering Range versus Elevation: Tussilago Farfara Elevation (Meters Above Sea Level) 23.3 5.6 55.6 140.0 31.1 112.2 106.7 42.2 75.6 176.7 126.7 126.7 176.7 201.1 133.3 90.0 41.1 125.6 477.8 Flowering Date Range 33.4 32.0 31.9 31.3 28.1 29.3 28.4 26.6 24.9 25.7 24.7 23.5 23.2 21.8 22.3 21.4 19.7 17.6 17.6 a. Fit a linear regression model using all 19 observations. What are the values of a, b, r2, se? b. Fit a linear regression model with the indicated point omitted. What are the values of a, b, r2, se? Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 769 20/12/12 6:39 PM 770 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Section 16.2 Exercise Set 2 16.35 In the study described in Exercise 16.31, the effect of latitude on mean clutch size was investigated. Data from various locations in Florida, Georgia, Alabama, and Mississippi on y 5 mean clutch size and x 5 latitude were measured. The scatterplot, standardized residual plot, and several graphs of the standardized residuals are shown below. Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences. Histogram of the Residuals 4 Frequency c. I n a few sentences, describe any differences you found in Parts (a) and (b). d. The researchers could use the estimated regression equation based on all 19 observations to make predictions for elevations ranging from 0 to 200 meters; or they could use the estimated regression equation based on the 18 observations (omitting the observation identified by an arrow) to make predictions for elevations ranging from 0 to 500 meters. Which strategy would you recommend, and why? 3 2 1 0 16.36 Exercise 6.21 gave data on x 5 nerve firing frequency and y 5 pleasantness rating when nerves were stimulated by a light brushing stoke on the forearm. The x values and the corresponding residuals from a simple linear regression are as follows: a. Construct a standardized residual plot. Does the plot exhibit any unusual features? Firing Frequency, x 23 24 22 25 27 28 34 33 36 34 8 7 5 4 4 26 27 28 26 27 28 Standardized Standardized Residual Residual 2 29 30 Latitude 29 30 31 32 33 31 32 33 0 –1 0 21 –1 –2 –2 26 27 28 26 27 28 29 30 Latitude 29 30 31 32 33 31 32 33 Latitude Normal Probability plot of the Residuals 22 22.0 21.5 21.0 20.5 0.0 0.5 Standardized residual 1.0 1.5 16.37 The accompanying scatterplot, based on 34 sediment samples with x 5 sediment depth (cm) and y 5 oil and grease content (mg/kg), appeared in the article “Mined Land Reclamation Using Polluted Urban Navigable Waterway Sediments” ( Journal of Environmental Quality [1984]: 415–422). 90 Percent 2 1 1 0 50 10 1 21.83 0.04 1.45 0.20 21.07 1.19 20.24 20.13 20.81 1.17 Latitude 2 1 99 Standardized Residual b. A normal probability plot of the standardized residuals follows. Based on this plot, do you think it is reasonable to assume that the error distribution is approximately normal? Explain. 6 5 Normal score MeanMean Clutch Clutch Size Size 8 7 6 –2.0 –1.5 –1.0–0.5 0.0 0.5 1.0 1.5 Residual –2 –1 0 Residual 1 2 Discuss the effect that the observation (20, 33,000) will have on the estimated regression line. If this point were omitted, what do you think will happen to the slope of the estimated regression line compared to the slope when this point is included? Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 770 20/12/12 6:39 PM Standardized R 2 0 –1 –2 Oil and grease (mg/kg) 0 771 40 60 80 Adequacy 100 16.3 Checking Model Locations/Pack 20 7 6 Frequency 32,000 28,000 5 4 3 2 24,000 1 20,000 0 –1 16,000 8,000 4,000 0 30 60 90 120 150 180 Subsample mean depth (cm) 16.38 Investigators in northern Alaska periodically monitored radio collared wolves in 25 wolf packs over 4 years, keeping track of the packs’ home ranges. (“Population Dynamics and Harvest Characteristics of Wolves in the Central Brooks Range, Alaska,” Wildlife Monographs, [2008]: 1–25). The home range of a pack is the area typically covered by its members in a specified amount of time. The investigators noticed that wolf packs with larger home ranges tended to be located more often by monitoring equipment. The investigators decided to explore the relationship between home range and the number of locations per pack. A scatterplot and standardized residual plot of the data are shown below, as well as plots of the standardized residuals. Does it appear that the assumptions of the simple linear regression model are plausible? Explain your reasoning in a few sentences. 2 Additional Exercises 16.39 Carbon acrosols have been identified as a contributing factor in a number of air quality problems. In a chemical analysis of diesel engine exhaust, x 5 mass (mg/cm2) and y 5 elemental carbon (mg/cm2) were recorded ("Comparison of Solvent Extraction and Thermal Optical Carbon Analysis Methods: Application to Diesel Vehicle Exhaust Aerosol" Environmental science Technology [1984]: 231–234). The esti mated regression line for this data set is y ˆ 5 31 1 .737x. A scatterplot of the data and a standardized residual plot are shown below. Bivariate Fit of carbon By mass 300 250 carbon 12,000 0 1 Standardized Residual 200 150 2500 100 Home Range 2000 1500 50 1000 3 50 100 150 200 250 mass 300 350 400 500 0 20 40 60 Locations/Pack 80 100 St. Residuals Standardized Residual 3 2 1 2 1 0 0 –1 –1 –2 0 20 40 60 Locations/Pack 80 100 –2 50 100 150 200 250 mass 300 350 400 Unless otherwise7noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 771 Frequency 6 5 4 3 20/12/12 6:39 PM 772 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 a. Are there any unusually large residuals? Do you think that there are any influential observations? b. Is there any pattern in the standardized residual plot that would indicate that the simple linear regression model is not appropriate? c. Based on the scatterplot and the standardized residual plot, do you think that it is reasonable to assume that the variance of y is the same at each x value? Explain. Plant Phenology, and Northern Ungulates,” Ecology [1999]: 1322-1339). The table below gives data for one flower spe- cies. A scatterplot of the “range of flowering dates” versus latitude for different sites in Norway is also shown. Two points that are potentially influential are indicated on the scatterplot. 50 45 Mean Flowering date range 40 35 30 25 20 15 10 58 59 60 61 62 Latitude (N) 63 Flowering Range Versus Latitude: Anemone Hepatica Flowering Latitude (N) Date Range 58.7 58.2 58.2 59.4 60.0 59.4 59.1 59.3 59.5 59.5 59.7 59.8 60.8 46.1 35.9 34.7 32.3 33.0 29.7 26.9 26.2 25.6 27.6 19.1 24.4 26.2 64 60.9 63.4 63.4 60.5 60.7 60.7 61.1 16.41 The sand scorpion is a predator that always hunts from a motionless resting position outside its own burrow. When prey appears on the horizon, within say 20 cm, the scorpion assumes an alert posture; it determines the angular position of the prey, makes a quick rotation, and runs after it. In a recent study of the scorpion’s accuracy, the angular position (0 degrees 5 right in front) of the prey, and the turning angle of the scorpion was recorded for 23 attacks. A simple regression model relating the response angle of the predator to the target angle position of the prey, r ˆ 5 a 1 b(t), was fit. The resulting residual plot is shown. Describe the locations of any outliers you see in the residual plot. 40 30 20 10 0 –10 –20 –30 –40 –200 –150 (continued) 26.8 28.7 19.2 22.5 17.9 12.9 11.8 a. Fit a linear regression model using all 20 observations. What are the values of a, b, r2 and se? b. Fit a linear regression model with the two observations identified by arrows omitted. What are the values of a, b, r2 and se? c. In a few sentences, describe any differences you found in Parts (a) and (b). d. The researchers could use the estimated regression equation based on all 20 observations to make predictions for latitudes ranging from 58 to 64, or they could use the estimated regression equation based on the 18 observations (omitting the two observations identified by arrows) to make predictions for latitudes ranging from 58 to 62. Which strategy would you recommend, and why? Residual 16.40 Models of climate change predict that global temperatures and precipitation will increase in the next 100 years, with the largest changes occurring during winter in northern latitudes. Researchers recently gathered data on the potential effects of climate change for flowering plants in Norway. (“Climatic Variability, Flowering Range Versus Latitude: Anemone Hepatica Flowering Latitude (N) Date Range –100 –50 0 50 Target Angle 100 150 200 Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 772 20/12/12 6:39 PM 16.3 Checking Model Adequacy 16.42 The production of pups and their survival are the most significant factors contributing to gray wolf population growth. The causes of early pup mortality are unknown, and difficult to observe. The pups are concealed within their dens for 3 weeks after birth, and after they emerge it is difficult to confirm their parentage. Researchers recently used portable ultrasound equipment to investigate some factors related to reproduction. (“Diagnosing Pregnancy, in Utero Litter Size, and Fetal Growth with Ultrasound in Wild, Free-Ranging Wolves,” Journal of Mammology [2006]: 85–92) A scatterplot and linear regression of the length of a wolf fetus (in cm, measured from crown to rump) and gestational age (in days) is shown below. Identify the point that has the largest residual by giving its approximate coordinates. 5 Crown – rump(cm) 16.43 The authors of the article “Age, Spacing and Growth Rate of Tamarix as an Indication of Lake Boundary Fluctuations at Sebkhet Kelbia, Tunisia” ( Journal of Arid Environments [1982]: 43–51) used a simple linear regres- sion model to describe the relationship between y 5 vigor (average width in centimeters of the last two annual rings) and x 5 stem density (stems/m2). The estimated model was based on the following data. Also given are the standardized residuals. x y St. resid. x y St. resid. 4 0.75 20.28 15 0.55 0.24 5 1.20 1.92 15 0.00 22.05 6 0.55 20.90 19 0.35 20.12 9 0.60 20.28 21 0.45 0.60 14 0.65 0.54 22 0.40 0.52 a. What assumptions are required for the simple linear regression model to be appropriate? b. Construct a normal probability plot of the standardized residuals. Does the assumption that the random deviation distribution is normal appear to be reasonable? Explain. c. Construct a standardized residual plot. Are there any unusually large residuals? d. Is there anything about the standardized residual plot that would cause you to question the use of the simple linear regression model to describe the relationship between x and y? 4 3 2 1 0 773 25 30 Gest Age(days) 35 40 are you ready to move on?Chapter 16 Review Exercises All chapter learning objectives are assessed in these exercises. The learning objectives assessed in each exercise are given in parentheses. 16.44 (C1) Describe what distinguishes a deterministic model from a probabilistic model. 16.45 (C2) In the context of the simple linear regression model, explain the difference between a and a. Between b and b. Between se and se. 16.46 (M1) The SAT and ACT exams are often used to predict a student’s first-term college grade point average (GPA). Different formulas are used for different colleges and majors. Suppose that a student is applying to State U with an intended major in civil engineering. Also suppose that for this college and this major, the following model is used to predict first term GPA. GPA 5 a 1 b (ACT ) a 5 0.5 b 5 0.1 a.In this context, what would be the appropriate interpretation of a? b.In this context, what would be the appropriate interpretation of b? 16.47 (M2) Theropods were carnivorous dinosaurs, characterized by short forelimbs, living in the Jurassic and Cretaceous periods. (Tyrannosaurus rex is classified as a Theropod.) What scientists know about therapods is based on studying incomplete skeletal remains. In a study described in the paper “My Theropod is Bigger than Yours…or not: Estimating Body Size from Skull Length in Theropods” ( Journal of Vertebrate Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 773 20/12/12 6:39 PM 774 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 Bivariate Fit of Residuals By SkullLength 2 1.5 1 Residuals Paleontology [2007]: 108–115), researchers used data from skeletons to develop a model describing the relationship between body length and skull length. JMP was used to produce the following graphical displays and computer output. When you evaluate the fit of an estimated regression line, all of the information below is considered as a whole. However, the summary statistics in the computer output and the different plots each convey some specific information. a.Using only the scatterplot, do you think a linear model does a good job of describing the relationship? Explain why or why not. b.Using only the residual plot, what can you determine about whether the basic assumptions of the linear regression model are met? c.Using only the normal probability plot and boxplot of the residuals, what can you determine about whether the basic assumptions of the linear regression model are met? d.Using only the values of r2 and se, what can you say about the quality of the fit of the linear model for these data? 0.5 0 –0.5 –1 –1.5 –2 0 0.25 0.5 0.75 SkullLength 1 1.25 1.5 Linear Fit BodyLength = 0.7061088 + 7.791973*SkullLength Normal Quantile Plot 0.95 0.9 Summary of Fit 1.64 1.28 0.8 RSquare RSqureAdj Root Mean Square Error Mean of Response Observations(or Sum Wgts) 0.67 0.5 0.0 0.2 0.1 0.05 –0.67 Analysis Of Variance –1.28 –1.64 Parameter Estimates –2 –1.5 –1 –0.5 0 0.5 1 Bivariate Fit of BodyLength By SkullLength 12 Estimate 0.7061088 Std Error 0.330485 SkullLength 7.791973 0.415318 t Ratio Prob>|t| 2.14 0.0475* 18.76 <.0001* 16.49 (M3, M4, M5, P1, P2) Ruffed grouse are a species of birds that nest on the ground. Because of this, chick survival at night in the first few weeks of life depends on avoiding predators. Biologists have theorized that protection from predators might be supplied by the mother hen’s choice of brooding sites. One variable that biologists thought might be related to survival is the density of vegetation in the vicinity of the nest. Dense vegetation would possible reduce the ability of predators to detect the nests. The paper “Nocturnal Roost 10 BodyLength Term Intercept 16.48 (M3) There are 4 basic assumptions necessary for making inferences about b, the slope of the population regression line. a.What are the four assumptions? b.Which assumptions can be checked using sample data? c.What statistics or graphs would be used to check each of the assumptions you listed in Part (b)? 1.5 14 8 6 4 2 0 0.953929 0.951218 0.801042 5.859474 19 0 0.25 0.5 0.75 SkullLength 1 1.25 1.5 Habitat Selection by Ruffed Grouse Broods ( Journal of Field Ornithology [2005]:168–174) describes a study in which Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 774 20/12/12 6:39 PM Technology Notes Linear Fit BroodSurvival = 0.9468008 – 0.0261902*StemDensity Summary of Fit RSquare RSqureAdj Root Mean Square Error Mean of Response Observations(or Sum Wgts) 0.9 0.9 0.8 0.8 LicePrevalence Prevalence Lice researchers monitored the survival of the brood (number of chicks surviving /number of eggs hatched) in 23 nests in different vegetation densities (thousands of stems / hectare.) Computer output (from JMP) is shown below. 0.193788 0.155397 0.287538 0.436043 23 StemDensity –0.02619 Std Error 0.235108 0.011657 0.5 0.5 16.50 (M7) Researchers in Hawaii have recently documented a large increase in the prevalence of a bird parasite known as chewing lice. (“Explosive Increase in Ectoparasites in Hawaiian Forest Birds,” The Journal of Parasitology [2008]: 1009–1021). Current data suggest that the prevalence of chewing lice may be less for bird species with a high degree of bill overhang. A species is said to have bill overhang when the upper bill extends downward in front of the end of the lower bill. The following scatterplot shown shows the relationship between the prevalence of chewing lice and bill overhang for 8 bird species in the Hawaiian Islands. A residual plot is also shown. Use these plots to identify any outliers or potentially influential observations. For each point you identify, assess its influence on the estimated slope of the regression line. 0.2 0.2 0.4 0.6 0.4 0.6 Bill Overhang Bill Overhang 0.8 0.8 1.0 1.0 –0.4 –0.4 0.0 0.0 0.2 0.2 0.6 0.4 0.6 0.4 Bill Overhang Bill Overhang 0.8 0.8 1.0 1.0 0.2 0.2 0.0355* a.Is there convincing evidence of a useful linear relationship between brood survival and stem density? Explain. b.Would you describe the relationship as strong? Why or why not? c.Construct a 95% confidence interval for b and interpret it in context. d.What margin of error is associated with the confidence interval in part (c)? 0.0 0.0 0.3 0.3 t Ratio Prob>|t| 4.03 0.0006* –2.25 0.6 0.6 0.3 0.3 0.1 0.1 0.0 0.0 Residual Residual Estimate 0.9468008 0.7 0.7 0.4 0.4 Parameter Estimates Term Intercept 775 –0.1 –0.1 –0.2 –0.2 –0.3 –0.3 16.51 (M6) Suppose you are given the computer output shown. You want to test the hypothesis, b 5 1.0. Describe how you would use the computer output to test this hypothesis Linear Fit y = 5.6452776 + 0.9797401*x Summary of Fit RSquare RSqureAdj Root Mean Square Error Mean of Response Observations(or Sum Wgts) 0.985289 0.984954 12.48525 0.791304 46 Parameter Estimates Term Intercept Estimate 5.6452776 Std Error 1.84302 x 0.9797401 0.018048 t Ratio Prob>|t| 3.06 0.0037* 54.29 <.0001* Technology Notes Regression Test TI-83/84 1.Enter the data for the independent variable into L1 (In order to access lists press the STAT key, highlight the option called Edit… then press ENTER) 2.Enter the data for the dependent variable into L2 3. Press STAT 4. Highlight TESTS 5. Highlight LinRegTTest… and press ENTER 6. Next to b & r select the appropriate alternative hypothesis 7. Highlight Calculate Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 775 20/12/12 6:39 PM 776 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 TI-Nspire 1.Enter the data into two separate data lists (In order to access data lists select the spreadsheet option and press enter) Note: Be sure to title the lists by selecting the top row of the column and typing a title. 2.Press the menu key and select 4:Stat Tests then 4:Stats Tests then A:Linear Reg t Test… and press enter 3.In the box next to X List choose the list title where you stored your independent data from the drop-down menu 4.In the box next to Y List choose the list title where you stored your dependent data from the drop-down menu 5.In the box next to Alternate Hyp choose the appropriate alternative hypothesis from the drop-down menu 6. Press OK JMP 1.Input the data for the dependent variable into the first column 2.Input the data for the independent variable into the second column 3. Click Analyze and select Fit Y by X 4.Select the dependent variable (Y) from the box under Select Columns and click on Y, Response 5.Select the independent variable (X) from the box under Select Columns and click on X, Factor 6.Click the red arrow next to Bivariate Fit of… and select Fit Line MINITAB 1.Input the data for the dependent variable into the first column 2.Input the data for the independent variable into the second column 3.Select Stat then Regression then Regression… 4.Highlight the name of the column containing the dependent variable and click Select 5.Highlight the name of the column containing the independent variable and click Select 6. Click OK 85241_ch16_ptg01.indd 776 Note: You may need to scroll up in the Session window to view the t-test results for the regression analysis. SPSS 1.Input the data for the dependent variable into one column 2.Input the data for the independent variable into a second column 3.Click Analyze then click Regression then click Linear… 4.Select the name of the dependent variable and click the arrow to move the variable to the box under Dependent: 5.Select the name of the independent variable and click the arrow to move the variable to the box under Independent(s): 6. Click OK Note: The p-value for the regression test can be found in the Coefficients table in the row with the independent variable name. Excel 1.Input the data for the dependent variable into the first column 2.Input the data for the independent variable into the second column 3.Select Analyze then choose Regression then choose Linear… 4.Highlight the name of the column containing the dependent variable 5.Click the arrow button next to the Dependent box to move the variable to this box 6.Highlight the name of the column containing the independent variable 7.Click the arrow button next to the Independent box to move the variable to this box 8. Click OK Note: The test statistic and p-value for the regression test for the slope can be found in the third table of output. These values are listed in the row titled with the independent variable name and the columns entitled t Stat and P-value. 20/12/12 6:39 PM Review Questions 777 AP* Review Questions for Chapter 16 Use the following information for questions 1–6. A study was carried out to investigate the relationship between x 5 the number of components needing repair and y 5 the time of the service call (in minutes) for a computer repair company. The number of components and the service time for a random sample of 20 service calls was used to fit a simple linear regression model. Partial computer output is shown below. The regression equation is Time 5 37.2 1 9.97 Number Predictor Coef SE Coef T P Constant 37.213 7.985 4.66 0.000 Number 9.9695 0.7218 13.81 0.000 S 5 18.7534 R-Sq 5 89.7% R-Sq(adj) 5 89.2% 1. Which of the following statements is a correct interpretation of the value 9.97? (A) The average number of components needing repair goes up 9.97 for each 1 minute increase in the service time of a call. (B) On average, the service call time goes up 9.97 minutes for each additional component needing repair. (C) The service call time is 9.97 minutes when there are 0 components to repair. (D) Approximately 9.97% of the observed variation in the service call times can be explained by the linear relationship between service time and number of components requiring repair. (E) If this regression equation is used to predict service call times, we can expect predictions to be within 9.97 minutes of the actual time. 2. Which of the following statements is a correct interpretation of the value 89.7%? (A) On average, the service call time goes up 89.7 minutes for each additional component needing repair. (B) The magnitude of a typical difference between an observed service call time and the service call time predicted by the linear model is approximately 89.7 minutes. (C) The correlation between service call time and number of components needing repair is 89.7%. (D) Approximately 89.7% of the observed variation in service call time can be explained by the linear relationship between service call time and number of components needing repair. (E) If this regression equation is used to predict service call times, we can expect predictions to be within 89.7 minutes of the actual time. 3. The value of se is 18.75. Which of the following is an appropriate interpretation of this value? (A) 18.75% of the variability in service time can be explained by the linear relationship between service call time and number of components needing repair. (B) There is a positive correlation between service call time and number of components needing repair. (C) For every 1-component increase in the number of components needing repair, the predicted service call time increases by about 18.75 minutes. (D) The magnitude of a typical difference between an observed service time and the service call time predicted by the linear model is approximately 18.75 minutes. (E) The average service call time is 18.75 minutes. 4. The value of se is 18.75. If the assumptions of the simple linear regression model are satisfied, which of the following is correct? (A) The width of a 95% confidence interval for the slope of the population regression line is 2(18.75) 5 37.50. (B) It would be unlikely that a prediction based on the regression line will be greater than 18.75 minutes. (C) It would be unlikely that a prediction based on the regression line will differ from the actual value by more than 2(18.75) 5 37.50 minutes. (D) Errors associated with predictions based on the regression line will always be less than 18.75 minutes. (E) The value of se does not provide any information about the anticipated magnitude of prediction errors. 5. Which of the following is a 95% confidence interval for the change in service time associated with a 1-unit increase in the number of components needing repair? (A) 37.21 6 (1.96)(7.985) (B) 37.21 6 (2.910)(7.985) (C) 9.97 6 (1.96)(0.7218) (D) 9.97 6 (2.10)(0.7218) (E) 9.97 6 2(18.7534) 6. If the basic assumptions of the simple linear regression model are reasonable, what conclusion should be reached regarding model utility if a significance level of 0.05 is used for the model utility test? (A) There is convincing evidence of a negative linear relationship between service call time and number of components needing repair. (B) There is convincing evidence that the model is not useful for predicting service call time. (C) There is convincing evidence that the model is useful for predicting service call time. (D) There is not convincing evidence that the model is useful for predicting service call time. (E) A conclusion cannot be reached based on the given information. AP* and the Advanced Placement Program are registered trademarks of the College Entrance Examination Board, which was not involved in the production of, and does not endorse, this product. 85241_ch16_ptg01.indd 777 20/12/12 6:39 PM 778 CHAPTER 16 Understanding Relationships—Numerical Data Part 2 7. If there is a positive linear relationship between two variables x and y, which of the following must be true of b, the slope of the population regression line? (A) b , 0 (B) b . 0 (C) b 5 0 (D) b . 1 (E) 21 , b , 1 (A) I only (B) II only (C) III only (D) I and III only (E) II and III only Use the scatterplot below to answer questions 9 and 10. y 8. The plots shown are residual plots resulting from fitting a linear regression. Which of these plots indicates that the relationship between the two variables used to fit the line may not be linear? 8.5 8.0 A B D 7.5 Standardized residual 2 7.0 6.5 1 C 6.0 0 4 5 6 7 8 9 x 9. Which of the labeled points would have the largest residual when a linear model is fit to the data? 21 22 210 25 0 5 10 x Standardized residual 2.0 (A) A (B) B (C) C (D) D (E) Both C and D 10. Which of the labeled points corresponds to a potentially influential observation if a linear model is to be fit to the data? 1.5 1.0 (A) A (B) B (C) C (D) D (E) Both C and D 0.5 0.0 20.5 21.0 21.5 100 110 120 130 140 x 11. If there is evidence of a linear relationship between x and y, what decision will be made in a test of H0: b 5 0 versus H0: b Þ 0? (A) Reject H0 and conclude that there is no evidence that the linear model is useful (B) Reject H0 and conclude that there is evidence that the linear model is useful (C) Fail to reject H0 and conclude that there is no evidence that the linear model is useful (D) Fail to reject H0 and conclude that there is evidence that the linear model is useful (E) Not enough information to say. Standardized residual 3 2 1 0 21 22 23 150 200 250 300 350 x Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 778 20/12/12 6:39 PM Review Questions Use the following information to answer questions 12 and 13. As part of a study of the swimming speed of sharks, a random sample of 18 lemon sharks (Triakis semifasciata) were observed in a laboratory sea tunnel. Body lengths and maximum sustainable swimming speeds (“MSSS,” reported in body lengths per second) were measured for each shark. The computer output from a regression with y = MSSS and x = body length is given below. Linear Fit MSSS = 1.8928955 - 0.0104278*Length Summary of Fit RSquare 0.526395 RSquare Adj 0.496794 S 0.272031 Mean of Response 1.24 N Observations 18 Analysis of Variance Sum of Source DF Squares Model 1 1.3159870 Error 16 1.1840130 Total 17 2.5000000 779 14. Which of the following is not an assumption that is made about the random deviation e in a simple linear regression model? (A) The distribution of e is normal. (B) The standard deviation of e, se , depends upon the particular value of x. (C) The mean value of e is 0. (D) The random deviations, e1, e2 …, en , associated with different observations are independent of one another. (E) The standard deviation of e, se , is the same for each x value. 15. The residual plot below indicates that the one or more of assumptions of the linear regression model may not be met. Which of the following is a reasonable conclusion based on this residual plot? Standardized residual 3 Mean Square 1.31599 0.07400 Parameter Estimates Term Estimate Std Error t Ratio Intercept 1.8928955 0.167575 11.30 Length(cm) 20.010428 0.002473 24.22 F Ratio 17.7834 Prob > F 0.0007* Prob>|t| ,.0001* 0.0007* 12. For this data set, the model utility test is based on how many degrees of freedom? (A) 15 (B) 16 (C) 17 (D) 18 (E) 19 13. What is the P-value associated with the model utility test? 2 1 0 21 22 23 150 200 250 300 350 x (A) The residual plot clearly indicates a non-linear model would be more appropriate. (B) There is evidence that the residuals are not normally distributed. (C) The slope of the regression line is non-zero. (D) The correlation between x and y is non-zero. (E) There is evidence the residuals do not have the same variance for all x values. (A) 0.0001 (B) 0.0007 (C) 0.07400 (D) 0.167575 (E) 0.526395 Unless otherwise noted, all content on this page is © Cengage Learning. 85241_ch16_ptg01.indd 779 20/12/12 6:39 PM