CHAPTER 3 Relationships Between Two Quantitative Variables Overview Goals The overall goal of Chapter 3 is to develop a systematic way to uncover information about bivariate data (two variables per case). We can pursue this goal as we did in Chapter 2 for univariate data: by using graphical displays and then finding a measure of center and spread to summarize the distribution. The basic plot used is the scatterplot. The basic summary measures are the regression line (which can be thought of as the measure of center) and the correlation (which can be thought of as the measure of spread). The five sections of this chapter will teach students to • make a scatterplot and determine what its shape tells about the relationship between the two variables • find and interpret the equation of the least squares regression line • find and interpret the correlation coefficient • use diagnostic tools such as residual plots to determine whether the linear regression line is appropriate or a transformation is needed first • make transformations that re-express a curved relationship as a linear one Content Overview Statistical Software Statistical software is important for this chapter—one of the most computationally intensive in this textbook. Students can use statistical software to construct scatterplots quickly, accurately, and flexibly. But more to the instructional point, they can use the software to change the location of points and then observe the corresponding change in the correlation and the regression line to further their understanding of the properties of these two measures. A Note on Terminology In the statistical community, the language of correlation and regression has not been standardized. The words in the chart on the next page are sometimes used interchangeably. 75 Terminology Concept A Bivariate Relationship Terminology Words Used Relationship Association Correlation (used only when the trend is linear) Concept Words Used The Degree of Spread Correlation of the Points About Measure of strength the Line of the relationship Scatter Variation A Summary Line Line of best fit Regression line Trend line Model Fitted line Least squares line Comparing Analyses of One-Variable and Two-Variable Distributions Thinking of the parallels between exploring a one-variable distribution and exploring a two-variable relationship is helpful. You may want to discuss this table with your students, which you can show them on the overhead projector or give them as a handout (master on page 92). Chapter 2: One Variable Chapter 3: Two Variables Key Idea Distribution Relationship (association) Plots Dot plot Stemplot Boxplot Histogram Scatterplot Shape Linear or curved Normal, uniform, or skewed Symmetric Constant strength Clusters, gaps, and outliers Clusters, gaps, and outliers Ideal Shape Normal Linear (oval/ellipse) Measure of Center Mean Median Regression line Measure of Spread Standard deviation from the Center Interquartile range Correlation Notes for AP Teachers Preparing for the AP Exam Throughout the Year Describing Distributions in Context: Shape, Trend, and Strength As in Chapter 2, frequently AP Exam free-response problems that cover the content of this chapter ask the student to describe distributions of bivariate data in context. With bivariate data, instead of shape, center, and spread, we use shape, trend, and strength. 76 Chapter 3 Overview Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.1 is a great place to start students focusing on a complete description of a scatterplot in context. Even though they can’t yet compute the regression equation or the correlation, they can still describe a scatterplot’s trend and strength, just by looking at the data. When they do learn how to do the calculations, they can be used as additional evidence for conclusions they can reach from observation and analyzing a scatterplot as described in Section 3.1. In the past, the question on the AP Exam has been stated along these lines, “Describe the nature of the relationship between the two variables.” (See AP Exam 2004 B Question 1, 2002 Question 4, 2000 Question 1.) For the answer to be “essentially correct,” the student needs to include four important pieces: 1. Shape: Whether it was linear or nonlinear—if there was a pattern, that should be stated. 2. Trend: Is there a positive or negative trend? 3. Strength: If there is a trend, is it strong or weak? Does the relationship vary in strength or is it more constant? 4. Context: The context needs to be included! For example, if you consider Display 3.1 on page 106 of the student book, there is a moderately positive association between the year an employee at Westvaco was hired and the year that employee was born, and that trend is fairly linear without curvature. There is more variability among values of y for larger values of x than for smaller. (The student book gives another example of a description.) The student who loses points in this area usually either leaves out one of the big three (shape, trend, strength) or completely forgets the context. For example, for Display 3.1, the student could write, “There is a positive and moderate association between the variables. As one variable increases, so does the other. There are clusters in the x 75–90’s range for different y groupings, and a possible outlier at (43, 28).” Although this is a fairly nice description of most features, the student forgot to mention that there is a linear trend and never once mentioned the actual variables involved. This description would lose points for both omissions. Comparing Distributions: Differences in Shape, Trend, and Strength When students are asked to compare two scatterplots (or other distributions) on the AP Statistics Exam, they cannot simply describe each scatterplot separately. They should say how the shape, trend, and strength of one differ from those of the other. For example, “While both shapes are linear with a positive trend, the points in Scatterplot A cluster more closely to the regression line than the points in Scatterplot B.” Thinking in Context After the basic review of linear equations in Section 3.2, use the variable names instead of x and y. That way, the context is always part of the discussion, and that does seem to help students remember to include context—especially in interpreting the slope in context. For example, after the basic review of Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Chapter 3 Overview 77 linear equations, the very first line the class discusses is the line they get from Activity 3.2a, “Pinching Pages.” Use for the slope: response change in response variable thickness slope ______________________ _________ _______________ change in predictor variable predictor number of pages and for the equation of the line: thickness (thickness-intercept) slope pages This practice constantly encourages students to think “in context.” Instead of thinking in terms of x’s and y’s, they’ll be thinking in terms of thickness and pages. By starting off with both of these techniques, by the time they are interpreting r and r2, they hopefully will automatically add the context. If students use x and y on the AP Exam, they must define each variable in context. AP Exam Practice As you work through the chapter, you may wish to assign questions from past AP Statistics Exams. The questions that correspond to topics in Chapter 3 are listed in the table below. To get the free-response questions (FR), go to AP Central, apcentral.collegeboard.com, where the free-response questions and investigative tasks for all exams since 1997 are posted. (You will need to print the test and then literally cut and paste if you want to use only some questions.) The multiple-choice questions (MC) have been released only for 1997, 2002, and 2007. (The 2007 Exam was not yet available at the time of this printing, so those items are not listed in the table below.) These “released exams” may be purchased at AP Central. You can also use the sample multiple-choice and free-response questions in the AP Statistics Course Description, available as a download at AP Central. The College Board gives teachers permission to use all of these questions with their students. Section 1997 MC: 28 FR: 6a FR: 4a 3.4 MC: 31 FR: 6d–e 3.5 FR: 6b, c, e FR: 4b Chapter 3 Overview 2000 2001 FR: 1a–c FR: 1b–d FR: 6c 3.3 78 1999 FR: 2c 3.1 3.2 1998 FR: 1a, e MC: 20 FR: 6c FR: 6d 2002 2002 Form B 2003 FR: 6a, c MC: 6, 34 FR: 4b–c FR: 1b–d MC: 28 2004 2004 Form B 2005 2005 Form B 2006 2006 Form B FR: 1a MC: 31 FR: 4a, c MC: 17 FR: 4c 2003 Form B FR: 1a–b FR: 1a FR: 3b, d FR: 1b FR: 3c FR: 1c FR: 3a FR: 5a–b FR: 2a FR: 1a Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Time Required Traditional Schedule Block 4 × 4 Block Section 3.1 1–2 days Day 1 Describing a scatterplot Day 2 Summary, exercises Day 1 Activity 3.2a, lines as summaries and prediction Day 2 Least squares regression, reading computer output Day 3 Summary, exercises Day 1 Activity 3.3a, estimating r, formula, appropriateness of linear model Day 2 Relation to slope, causation, interpreting r 2 Day 3 Regression to the mean (optional), summary, exercises Day 1 Activity 3.4a, influence, residual plots Day 2 Summary, exercises Day 1 Activity 3.5a, exponential growth and decay, log transformations Day 2 Log-log transformations, power functions Day 3 Summary, exercises 1 day 1 long 2 days 1 long, 1 short 2 days 2 long Section 3.2 2–3 days Section 3.3 2–3 days Section 3.4 1–2 days 1.5 days 1 long, 1 short Section 3.5 2–3 days 2 days 2 long Review 1–2 days 1.5 days 1 long, 1 short Materials Section 3.1: None Section 3.2: For Activity 3.2a, a ruler with a millimeter scale and a textbook for each pair of students Section 3.3: For Activity 3.3a, a measuring tape, a yardstick, or a meterstick for each pair of students Section 3.4: For Activity 3.4a, a piece of paper for recording data Section 3.5: For Activity 3.5a, a paper cup and 200 pennies for each student (or each pair of students) Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Chapter 3 Overview 79 Suggested Assignments Classwork Section Essential Recommended 3.1 D1, D2 P1, P2 3.2 Activity 3.2a D3–5, D8 P3–8 D6, D7 3.3 D9–13, D17, D18, D21 P9–19 Activity 3.3a D14, D20, D23 3.4 D26–30 P22–25 Activity 3.4a D31 3.5 Activity 3.5a D35, D40 D32, D33, D36–38, D41, D43 P30–32, P35, P37, P38 P26–29, P33, P34 Optional D15, D16, D19, D22, D24, D25 P20, P21 Chapter 3 Quiz 1 D34, D39, D42 P36, P39 Chapter 3 Quiz 2 Homework Section Essential Optional 3.1 E1, E3, E5, E7 E4 E2, E6, E8 3.2 E9, E11, E13, E15, E17, E19, E21 E14, E22, E23 E10, E12, E16, E18, E20, E24–26 3.3 E27, E31, E33, E34, E37, E40 E28, E29, E36, E39 E30, E32, E35, E38, E41, E42 3.4 E43, E45, E47, E49 E44, E50 E46, E48, E51–54 3.5 E55–57, E59 E60, E65 E58, E61–64, E66 E74–76, E78, E80 E71, E72, E77, E79, E82–83 Chapter E67–70, E73, E81 Summary For AP Students 80 Recommended Chapter 3 Overview AP1–10 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 3.1 Scatterplots Objectives • to make a scatterplot and describe its basic shape in terms of linearity, curvature, clusters, and outliers • to describe whether the trend in a scatterplot is positive or negative • to describe whether the strength of the relationship is strong, moderate, or weak and whether the strength is constant across all values of x • to decide whether the pattern in a scatterplot can be generalized to other cases and to propose possible explanations for the pattern Important Terms and Concepts • bivariate data • scatterplot • variables and cases for bivariate data • shape of a scatterplot: linear or curved • strength of an association or relationship: strong, moderate, or weak • constant strength versus varying strength • lurking variable • trend: positive or negative Alignment with the AP Statistics Topic Outline This section aligns with the listed items of the AP Statistics Topic Outline as described here. The actual text of the AP Statistics Topic Outline and the complete correlation begin on page xxi. ID1 Students construct and interpret scatterplots. ID2 Students examine bivariate data for correlation and linearity. Lesson Planning Class Time One to two days Materials None Suggested Assignments Classwork Essential Recommended Optional D1, D2 P1, P2 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.1 81 Homework Essential E1, E3, E5, E7 Recommended E4 Optional E2, E6, E8 Lesson Notes: Describing the Pattern in a Scatterplot Making Comparisons As mentioned in the Overview, when students are asked to compare two scatterplots (or other distributions) on the AP Statistics Exam, they cannot just describe the shape trend and strength of each scatterplot separately. They should say, for example, which is more linear, which has the greater slope, and which has the stronger relationship. To help students understand the concept of strength, ask them, “If you look at the plot, do you see more trend or more variation?” Linearity Students often find it very difficult to get used to the idea that any scatterplot where the points fall within an oval is called “linear.” Linear does not mean that all points lie on or even near a line. For example, the points in the next scatterplot have a linear relationship even though they are not clustered closely about the line. On the other hand, the points on this next scatterplot do not follow a linear pattern even though they are clustered rather closely to the line. This pattern is curved. Outliers in Two Variable Relationships The question of what is an outlier in a scatterplot is more complicated than with univariate data, where we could use the Q1 1.5 IQR or Q3 1.5 IQR rule as a guideline for identifying possible outliers. A scatterplot can have 82 Section 3.1 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press several kinds of outliers: • a point with an extreme value of x or an extreme value of y or both • a point that does not follow the general trend In this next plot, the point represented by the x is an outlier in the sense that it has an extreme x-value. However, its y-value is not extreme and this point follows the curved pattern. y x In the next plot, the point represented by the x has neither an extreme x-value nor an extreme y-value. However, it does not follow the general trend and is considered an outlier. y x Varying Strength (Heteroscedasticity) Instead of using the word heteroscedasticity, you can substitute “fan-shaped” or other synonyms for the word. But students like the way this word sounds, and it is fun to use in class. It is pronounced just as it is spelled: hetero-sce-das-ticity, where “sce” is pronounced like the “scu” part of scuff and “das” is pronounced so it rhymes with class. If a plot is not heteroscedastic, it is homoscedastic, which means “having the same variance.” A summary of the various ways to describe the pattern in a scatterplot is shown on the next page, and a blackline master is provided on page 93. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.1 83 Describing Bivariate Data Positive Negative None Strong Moderate Weak Constant Varying Strength: Fan to Right Varying Strength: Fan to Left Linear Curved None Outlier in x Outlier in y Outlier in the Residuals Trend Strength Variability Pattern Influential Point 84 Section 3.1 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Notes for AP Teachers Modeling Good Answers Once students have completed E4, show them the model answer as an example of what is expected on the AP Exam. This item is on the AP Quiz for Chapters 2–3, on pages 41–42 of the Instructor’s Resource Book. Don’t assign this problem yet if you plan to use that quiz; review the model answer as you go over the quiz. A PDF file containing the question and the model answer is available at www.keypress.com/keyonline. Solutions Discussion D1. a. Display 3.1 illustrates the commonsense idea that people born earlier typically become employed before people who were born later. Display 3.2 shows the same idea but uses age instead of birth year. A larger birth year means a smaller age, so the relationship in Display 3.2 is the reverse of Display 3.1: Older people were hired earlier and younger people are hired later, so the association is negative. b. All the points are in the lower-right half of the plot because people cannot be hired until they are about 18 years old. Specifically, a person’s year of birth must be at least 18 years before the year of hire so all points (but one) are below the line y x 18. The one point above this diagonal line is a person who was hired into the company at a very young age during the 1940s. Although there is a lot of variation, in general people were born 25 or 30 years before they were hired. Note on D1b: As you discuss the graph in the context of a real situation, introduce the practice of writing models with variable names that are in the context of the problem. For example, the line y x 18 could be stated as year of birth year of hire 18. The practice of using intelligible variable names is consistent with many statistics software programs but not with calculators. Mixing words and symbols can be a problem at lower levels, but should not present difficulty for your students. c. No, this is not correct. The ages plotted are not the ages of the employees when they were hired but their ages when layoffs began. This idea will be explored further in E7. D2. a. For these data, the cases are the states and the variables are the number of people per thousand living in dorms and the proportion of the state population living in cities. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press The shape of the cluster is linear (roughly oval or elliptical), except for three points (VT, RI, and MA) that lie relatively far away from the main cloud of points. Vermont, a rural state, has a large number of colleges and a higher dorm proportion than would be anticipated. Rhode Island and Massachusetts also have a relatively high proportion living in dorms, but they are essentially urban states. The trend is negative, as a larger proportion of people living in cities tends to mean a smaller proportion in dorms. There is a lot of variability in dorm proportions for any particular proportion living in cities, and thus the strength of the association between the two variables is only moderate. Other than the three apparent outliers, the strength is relatively constant across all values of proportion of population living in cities. This scatterplot shows the data for all of the 50 states, so there is no larger population to generalize to. What you see is all there is for this particular year. However, it is reasonable to generalize to a previous or subsequent year. A possible explanation for this pattern is that states with a high proportion of the population living in cities also have a high proportion of their colleges located in cities. In an urban area, there is little need for students to live in dorms. They can commute from home or get off-campus housing in nearby apartments. Thus, for highly urbanized states, a lower proportion of students need to live in dorms. b. The positive trend in the original data comes from the fact that states with a large number of people tend to have a large number of colleges and universities and a large number of people living in cities. A possible explanation for the negative trend in the proportion data is given in part a. Section 3.1 Solutions 85 Practice Height (in.) P1. a. You may have to remind students that a scatterplot without labels and units on the axes is meaningless. Emphasize the importance of appropriate labeling. Here is the scatterplot. 50 48 46 44 42 40 38 36 Exercises 2 3 4 5 Age (yr) 6 7 b. These data are not very interesting to describe. The x-axis shows ages 2 to 7 years, and the y-axis shows the median height of children at each age. The shape is linear, the trend is positive, and the strength is very strong. That is, the scatterplot shows a very strong positive linear trend. Students may mention that a typical child grows about 2.7 inches per year. c. The linear trend could reasonably be expected to hold for another year. However, median height could not be expected to increase at this rate to age 50, as people typically stop growing around age 20. d. In the background is something called “growing up” that happens over time during the early years of life. That is, an increase in age is associated with an increase in height. P2. a. The worst record for baggage handling during this period is Delta, and Northwest has the lowest on-time percentage among the airlines. b. Airlines with a high percentage of on-time arrivals and a low rate of mishandled baggage would fall in the upper left of the plot. United and America West are the top two in both categories, so are the best overall. c. False. American’s baggage mishandling rate of 6.5 mishandled bags per thousand was not twice Southwest’s rate of a little under 4.5. It appears to be more than twice because the scale on the x-axis starts at 3.75, not 0. d. The relationship between the two variables is negative and weak. The negative relationship shows that an airline that is “bad” on one variable tends to be “bad” on the other as well. e. No. These are the largest carriers in the United States. (The airlines included in this plot are all of the U.S. carriers with at least 1% of total domestic scheduled-service passenger revenues.) The other airlines that might be added would be small regional carriers. There is no reason to expect 86 their pattern to be the same as that of the large national airlines. If we were to plot these same airlines for the previous or following year, it would probably look much the same but would likely be quite different from a plot of similar variables for ten years ago. With the increase in airport security since the terrorist attacks of 2001, you would expect the situation to have changed in some way. Section 3.1 Solutions E1. Plot a shows a positive relationship that is strong and linear. There is fairly uniform variation across all values of x. Plot b shows a negative relationship that is strong and linear, again with fairly uniform variation across all values of x. Plot c shows a positive relationship that is moderate and linear with fairly uniform variation across all values of x. One point lies a short distance from the bulk of the data. Plot d shows a negative relationship that is moderate and linear with fairly uniform variation across all values of x. Again, there is one outlier. Plot e shows a positive relationship that is strong and linear except for the outlier. As students will learn, the one outlier has dramatic influence on the strength of this relationship. There is fairly uniform variation across all values of x. Plot f shows a negative relationship that is very strong and curved. One point on the far right lies in the general pattern but far away from the remainder of the data, which accentuates the strong relationship. Another outlier lies below the bulk of the data on the left. Plot g shows a negative relationship that is strong and curved. The two points at either end of the array accentuate the curvature. There is a bit more variability among values of y for smaller values of x than for larger values of x. Plot h shows a positive relationship that is strong and curved. Again, the outlier on the extreme right accentuates the curved pattern and would have dramatic influence on where a trend line might be placed. The variability in y is fairly constant across all values of x. E2. a. Positive and strong: As eggs get bigger, both length and width increase proportionally. b. Positive and moderate: Most students tend to score relatively high on both parts of the exam, middling on both parts, or relatively low on both parts of the exam. c. Positive and strong: Trees produce one new ring each year. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Frequency 10 8 6 4 2 0 0 20 40 60 80 Percentage Taking Exam 100 The distribution of SAT I math scores also looks like it may be bimodal. There appear to be a cluster of scores around 510 and another cluster around 560. In the histogram of the average SAT I math scores, even though you can see two peaks around 510 and 560, the shape is more skewed toward the larger values than bimodal. Note on E4c: Ask students to visualize all of the points dropping onto the x-axis so that they can see the distribution of percentage taking SAT. A histogram of percentages is shown above. It makes a nice demonstration to display the scatterplot in Fathom Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press and remove the y attribute and watch the dots drop to the x-axis. (The data are available as a Fathom file on the Instructor’s Resource Book CD.) To visualize the distribution of SAT scores, ask students to imagine all the points dropping onto the y-axis. (Again, the Fathom file makes a nice demonstration.) Average SAT Math Score d. Negative and moderate: People tend to lose flexibility as they get older. e. Positive and strong: The number of representatives is roughly proportional to the population of a state. f. Positive and weak: Large countries tend to have large populations, but there are notable exceptions such as Canada and Australia. Also, some small countries in area have very large populations, such as Indonesia. g. Negative and strong (but curved): Winning times tend to improve (get smaller) over the years. E3. a. II b. IV (The larger states—mainly in the West— tend to have fewer people per square mile.) c. III d. I (Heavier cars tend to get lower gas mileage.) E4. a. Iowa 5% and Illinois 10%; about 92% of New York students took the SAT I, and they averaged 511. b. The overall trend is negative, moderately strong, and curved. The gap between 40 and 50% in the middle of the scatterplot suggests two groups of states—one with low percentages and high average scores and another with high percentages and low average scores. One state stands out a bit from the rest—West Virginia, with only 20% taking the SAT I and a relatively low average of 511. c. Yes, the distribution of the percentage taking the SAT I looks bimodal because there is a cluster of percentages around 10 and a second around 65 to 75. 620 600 580 560 540 520 500 480 0 1 2 3 4 5 6 Frequency 7 8 9 d. There are no more states to add, so this is the complete picture for the given year. You might generalize, however, to the previous year and the next year. In fact, the plots for the last 20 or so years look similar to this one. These numbers do not change rapidly from year to year. e. The Midwestern states are predominantly ACT states. In these states, only small percentages of students take the SAT I, and these tend to be the better students who are trying for admission to exclusive colleges, perhaps outside the Midwest. If only a few students in a state are taking the SAT I, they are probably the better students in the state and their average scores would then be higher than the average scores for other states. Thus, as the percentage of students taking the SAT I increases, the average score tends to decrease. Although this explanation makes sense, we cannot be sure from these data alone. E5. a. Plots A, B, and C are the most linear. Plot D is not linear because of the seven universities in the lower right, which may be different from the rest. Plot A, of graduation rate versus alumni giving rate, gives some impression of downward curvature. However, if you disregard the point in the upper right, the impression of any curvature disappears. Plots A, B, and C all have just one cluster. However, plot D, the plot of graduation rate versus top 10% in high school, has two clusters. Most of the points follow the upward linear trend, but the cluster of seven points in the lower right with the highest percentage of freshmen in the top 10% shows little relationship with the graduation rate. Plots A and C have possible outliers. Plot A, the plot of graduation rate versus alumni giving rate has a possible outlier in the upper right. The point is below the general trend and its x-value (but not its y-value) is unusually large. In plot C, Section 3.1 Solutions 87 88 Section 3.1 Solutions 100 90 Graduation Rate 80 70 60 1200 Top 25% 1300 1400 1500 SAT 75th Percentile 1600 26%–50% Note on E5d: You may wish to mention that these relationships hold for universities and that the data provide no evidence about whether the relationships hold for individuals. e. Graduation rates may increase as SAT scores increase because better prepared students may be more successful in college courses, so a university with a greater number of prepared students will, on average, graduate a higher percentage of students. Alumni giving rates may increase as graduation rates increase because the university has produced happy alumni. These data do not “prove” this claim, however, because of other possible explanations. These types of observational studies cannot prove claims; the proof of a claim requires an experiment, which is one of the topics of the next chapter. Note on E6: If your students aren’t working with computers, you may wish to provide them with copies of the three scatterplots. E6. All three have positive association, but circumference appears to have the strongest positive association with hat size. Students may discover this rule: hat size is equal to circumference divided by p and then rounded to the closest eighth. The measurements tend to come aligned in vertical strips, indicating that the students made the measurements only to the nearest quarter of an inch. Hat Size the plot of graduation rate versus SAT 75th percentile, the points toward the upper left and the middle right should be examined because they are farther from the general trend than the other points, although neither their x-values nor their y-values are unusual. The point in the lower right of plot D should also be examined, along with the other six points nearby. b. Plots A and C have similar moderate positive linear trends. Plot D shows wide variation in both variables, with little or no trend. Plot B is the only plot that shows a negative trend. c. Among these four variables, it appears that the alumni giving rate is the best predictor of the graduation rate, and SAT scores (as measured by the 75th percentile) is second best. However, both of these relationships are moderate and neither is a strong predictor of graduation rate. Ranking in high school class (as measured by the top 10%) is almost useless as a predictor of college graduation rate. Plot A, of graduation rate versus alumni giving rate, owes part of the impression of a strong relationship to the point in the upper right. This plot shows some heteroscedasticity, with the graduation rate varying more with smaller alumni giving rates. Students should understand that even though the relationship between, say, the graduation rate and the student/faculty ratio is negative, that’s not what makes the student/faculty ratio a poor predictor of the graduation rate. Given a specific student/faculty ratio, we can predict a graduation rate. The problem is the great deal of uncertainty about how close the actual rate would be to the predicted rate because the range of graduate rates is large for any given student/faculty ratio. d. The relationships could change considerably when looking at all universities because these are highly rated universities, so the values of all variables tend to be “good.” A larger collection of universities may have more spread in the values of all variables and, possibly, a stronger relationship between graduation rate and other variables. To explain to students how this could be, show students the version that follows of the graduation rate versus SAT 75th percentile plot. On this plot, the x’s represent the top 25 universities and the closed circles represent the next 25 most highly rated universities. Note that if you look at either group, there is very little upward trend. However, putting the two groups together gives a stronger linear trend. If another group of 25 universities were added, the trend probably would be stronger. This phenomenon is sometimes called the effect of a restricted range. 7.8 7.6 7.4 7.2 7.0 6.8 6.6 20 21 22 23 Circumference (in.) 24 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 7.0 7.5 8.0 Major Axis Length (in.) 8.5 5.5 6.0 6.5 Minor Axis Length (in.) 7.0 7.8 7.6 7.4 7.2 7.0 6.8 6.6 E7. a. The cases are the individual employees at the time of the layoffs, and the variables are the employee’s age at hire and the year of hire. The triangular shape of this plot indicates heteroscedasticity. The direction of the cloud is upward to the right, showing weak positive association between age at hire and year of hire. There are no points in the upper-left side of the plot because these employees would have reached retirement age. b. This plot does not help us decide this question. Because so many people who were hired in the early years (and even recently) would have retired, we do not know whether older people were hired then or not. To determine whether age discrimination in hiring may have existed, we need a plot of the age at hire of all people hired, not just those who remained at the time of layoffs. c. From this plot, it appears that people hired earliest were more subject to layoff, not necessarily the older employees. Everyone hired before the early 1960s was laid off, but not all of the older people were laid off. Perhaps, then, it was higher salaries because of seniority or obsolete job skills that resulted in a greater proportion of older employees being laid off. Note on E8: A computer should be used for this exercise. This open-ended investigation can be very time consuming if students do all three parts. We suggest that you assign just one of the three parts to each student (or small group of students) and have them share their results. Or you could also give each student a copy of the scatterplot matrix on page 91 (also reproduced on page 94). This graphic was made with statistical software. All of the relationships requested in this exercise are shown in one large plot. For example, the plots requested in part a, with cost per hour on the vertical axis, are shown across the bottom row. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press E8. a. cost per hour i. Students might point out that all of the associations are positive. As the variable on the x-axis increases, so does the cost per hour. Although all relationships are strong, the strongest are cost per hour versus fuel consumption per hour and versus number of seats. The other three relationships are weaker and show less constant strength. The scatterplot of cost per hour against cargo space is the most fan-shaped. Relationships of cost per hour to fuel consumption per hour and to number of seats are the most linear. The relationship of cost per hour to speed is the most curved. As speed increases, the cost per hour stays relatively constant up to about 460 miles per hour and then increases rapidly with increasing speed. There appear to be no outliers in cost per hour. However, one plane is an outlier in flight length, the B747-400. ii. The patterns in the plots do show that bigger planes cost more per hour to operate, but this may be perfectly efficient given that they carry more people and more cargo (and go faster). One variable that might measure cost efficiency for the airplanes is cost per hour per seat (cost/seat). (Students may come up with others.) A plot showing this variable plotted against its denominator is shown here. Notice cost per hour per seat remains somewhat constant (but with a lot of variability) across the number of passengers carried. That is, larger planes tend to cost about the same to fly a passenger for an hour as smaller planes. However, larger planes also tend to go faster and take less time to travel the same number of miles. Considering that, larger planes may be more efficient. Cost Per Hour Per Seat Hat Size Hat Size 7.8 7.6 7.4 7.2 7.0 6.8 6.6 70 60 50 40 30 20 10 0 50 100 150 200 250 300 350 400 Seats The next scatterplot shows cost per passenger mile (cost/h, divided by speed (in mi/h) divided by seats or number of passengers carried) versus number of seats. Section 3.1 Solutions 89 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0 50 100 150 200 250 300 350 400 Seats b. flight length i. Students might discover some of these things from their scatterplots or from the scatterplots in the fourth column of the scatterplot matrix: Number of seats has a stronger relationship with flight length than does cargo. These are all passenger aircraft, and their primary purpose is to carry passengers. Cargo is added as space (and weight) is available. The weak relationship in the cargo versus flight length plot is due primarily to two aircraft, the large A300-600 (Airbus), used primarily to carry passengers and cargo over the short routes in Europe, and the B747-400, used to carry large numbers of passengers (and relatively less freight) over the very long international routes. Planes with the longest flight lengths (the B747’s) have the most seats but are not at the top of the cargo carried. This can be seen in the plots of seats and cargo versus length, but you have to look back at the data to identify the planes. A description of the scatterplot of speed versus flight length follows. Cases and variables: The cases are the planes in the data set, and the variables are the airborne speed in miles per hour and the length of flight in miles. Shape: The shape shows a single cluster of points in a thin, curved cloud that opens downward, with one point (the B747-400) as an outlier on the length axis. Trend: The direction of the relationship is positive. Strength: The relationship is very strong; a pattern is quite obvious. Generalization: It seems reasonable that a similar pattern might appear even if other planes were added to the study. The general 90 Section 3.1 Solutions pattern of speed versus flight length should not depend entirely on the specific planes being studied here. Explanation: As the typical length of flight increases, airlines tend to use faster planes because it is inefficient to use a fast plane on a short flight and inconvenient for travelers to use a slow plane on a long flight. But there is a maximum speed that can be achieved by the designs used for commercial aircraft so that few planes fly much over 500 miles per hour, which causes the leveling off of the plot for the longer flights. ii. The faster planes used on the longer flights use more fuel per hour, but they cover many more miles in an hour than do the slower planes. So perhaps they use less fuel per mile. To compute gallons of fuel used per mile, divide fuel in gallons per hour by speed in miles per hour. This plot shows fuel consumption in gallons per mile plotted against flight length. Gallons Of Fuel Per Mile Cost Per Passenger Per Mile Now we see that larger planes do tend to be somewhat more cost efficient per passenger mile. However, the relationship is weak. 7 6 5 4 3 2 1 0 500 1500 2500 Flight Length 3500 Apparently, the planes that are capable of flying longer distances use more gallons of fuel per mile than do planes that fly shorter distances. This isn’t surprising, as they carry more passengers and cargo. Further, flying faster may take more gallons per mile (as with an automobile). c. speed, seats, and cargo i. Here are some things students may discover: The curvature is more pronounced in the relationship between speed and cargo. The slower (and smaller) planes carry little cargo, and the plane that carries the biggest cargo has only about medium speed. The planes that carry the biggest cargo carry only a moderate number of passengers. The A300-600 (Airbus) is unusually slow, both for the amount of cargo it carries (it has the largest cargo capacity of all the planes on the list) and for the number of seats it has. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press ii. The flat part to the left of the plot of cargo versus seats reveals that passenger planes with a relatively small number of seats (up to nearly 200) carry very little cargo. The fan shape to the right reveals that as the planes get bigger and the number of seats increases beyond 200, the amount of cargo carried by these planes also increases. However, the variation in the cargo-carrying capacity of a plane also increases as the number of seats increases. Scatterplot Matrix 350 250 Seats 150 50 40 Cargo 20 0 550 450 Speed 350 250 3500 2500 FlLength 1500 500 3500 2500 Fuel 1500 500 9000 7000 Cost 5000 3000 1000 50 150 300 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 0 10 30 50 250 400 550 500 2000 4000 500 2000 4000 1000 6000 Section 3.1 Solutions 91 92 Section 3.1 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Scatterplot Linear or curved Constant strength Clusters, gaps, and outliers Linear (oval/ellipse) Dot plot Stemplot Boxplot Histogram Normal, uniform, or skewed Symmetric Clusters, gaps, and outliers Normal Mean Median Standard deviation Interquartile range Plots Shape Ideal Shape Measure of Center Measure of Spread from the Center Correlation Regression line Relationship (association) Distribution Two Variables One Variable Key Idea Chapter 3 Chapter 2 Distributions and Relationships Describing Bivariate Data Positive Negative None Strong Moderate Weak Constant Varying Strength: Fan to Right Varying Strength: Fan to Left Linear Curved None Outlier in x Outlier in y Outlier in the Residuals Trend Strength Variability Pattern Influential Point Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.1 93 Matrix Plot of Aircraft Data Scatterplot Matrix 350 250 Seats 150 50 40 Cargo 20 0 550 450 Speed 350 250 3500 2500 FlLength 1500 500 3500 2500 Fuel 1500 500 9000 7000 Cost 5000 3000 1000 50 150 94 Section 3.1 300 0 10 30 50 250 400 550 500 2000 4000 500 2000 4000 1000 6000 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 3.2 Getting a Line on the Pattern Objectives After reviewing the definition of slope and the slope-intercept form of a linear equation, students will learn • to interpret the slope and y-intercept in the context of the situation • to understand when it is appropriate to use a fitted line to model a relationship and to predict y when the value of x is known • to understand that interpolation is more trustworthy than extrapolation • to compute and interpret residuals and draw them on the scatterplot • to understand that the least squares regression line minimizes the sum of the squared errors (residuals) • to compute the least squares regression line • to read regression output from various statistical software packages • to understand various properties of the least squares regression line Important Terms and Concepts • slope; y-intercept • least squares regression line; fitted line • predictor or explanatory variable x • predicted value ŷ • • • • • response variable y observed value y interpolation; extrapolation residual sum of squared errors (SSE) Alignment with the AP Statistics Topic Outline This section aligns with the listed items of the AP Statistics Topic Outline as described here. The actual text of the AP Statistics Topic Outline and the complete correlation begin on page xxi. ID2 Students examine bivariate data for correlation and linearity. ID3 Students examine least squares regression lines for bivariate data. Lesson Planning Class Time Two to three days Materials For Activity 3.2a, a ruler with a millimeter scale and a textbook for each pair of students Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.2 95 Suggested Assignments Classwork Essential Recommended Activity 3.2a D3–5, D8 P3–8 Optional D6, D7 Homework Essential Recommended E9, E11, E13, E15, E17, E19, E14, E22, E23 E21 Optional E10, E12, E16, E18, E20, E24–26 Lesson Notes: Lines as Summaries Students should be able to find the equation of a line, given two points or a point and a slope. However, many have not interpreted an equation of a line in the context of a situation, especially its slope. See Activity 3.2a step 5 for an example. Why Do We Use y b0 b1 x in Statistics Rather Than y mx b? In multiple regression, there can be many explanatory variables: x1, x2, x3, . . . , xn . For example, if you want to predict college grade point average, you might have a linear combination of variables such as high school GPA, number of Advanced Placement courses, SAT score, number of mathematics courses taken in high school, and so on. The fitted “plane” would then be of the form ŷ b0 b1x1 b2 x2 b3x3 . . . bn xn which is a straightforward generalization of ŷ b0 b1x If we used the form y mx b, we have no obvious way to generalize the symbols to the case of more than one explanatory variable. Notes on Calculator Use When using a calculator to get a least squares regression line, there is a new wrinkle: The equation of the regression line may be written y a bx. So if your students are used to y mx b, with m as the slope and b as the intercept, warn them to be careful: b can be the slope. Alternatively (or additionally), a calculator may write the equation in the form y ax b. When a graphing calculator computes the regression line, it gives values of a and b precise to many decimal places. If you round these values, any calculations you do with them (such as interpolating or extrapolating) will retain that error. To avoid this, use the stored values for a and b from your calculator. For the ). Select TI-83 Plus or TI-84 Plus, they can be found in the variables menu ( 5:Statistics… then arrow over to EQ. The a and b listed are the values from the most recently calculated regression equation. The TI-84 Plus calculator has a statistical feature called Manual-Fit, which allows you to place a line on the screen and adjust it by changing the slope and 96 Section 3.2 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press y-intercept right from the graph screen. This feature could be used to quickly , create the lines for parts b and c. You can access this command by pressing arrowing over to CALC , and selecting D: MANUALFIT. If you press while the manual line is on the screen, the equation is stored into Y1. Activity 3.2a: Pinching Pages The Pinching Pages activity can be done quickly and requires only the textbook and a ruler marked with millimeters for each student or pair of students. This activity gives students experience with an easily comprehended interpretation of the slope and y-intercept in context. Note: This activity is essential for AP students; it will help them realize that the line of best fit need not contain any of the data points—to determine the equation of the line, students must use points on the line, which are not necessarily points from the data set. 1–2. A set of data for five “pinches” of a standard textbook is given below. (The sheets may have different thickness from the book your students have.) The thickness measurements are in millimeters. Students who subtract page numbers instead of counting sheets will get estimates that are half the thickness of a sheet (because there are two “pages” per sheet). Row Sheets Thickness 1 50 6.0 2 100 11.0 3 150 12.5 4 200 17.0 5 250 21.0 Total Thickness (mm) 3–4. This scatterplot shows a nearly straight line. It should be linear as an increase of one sheet adds a fixed amount to the thickness. 25 20 15 10 5 50 100 150 200 Number of Sheets 250 5. The line passes near the points (50, 6) and (250, 21), so its slope is about 0.075. Slope measures change in y, per unit change in x, and because a unit change in x is one sheet and y is thickness, this slope is an estimate of the thickness of one sheet. The y-intercept can be found by using any point on the line and solving for b0: 6 0.075(50) b0 b0 2.25 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.2 97 The y-intercept is the thickness if no sheets are included, so it is the approximate thickness of the cover of the book. You can write this equation as total thickness y-intercept slope (number of sheets) 6. Because the points lie so close to the line, we believe the measurements are fairly precise. However, because we measured only to the nearest 0.5 millimeter, we expect some inaccuracy in each measurement—some will be too small and others too large. By using the line to estimate the slope, we are averaging out those errors. 7. If the pinched pages did not include the front cover, the y-intercept should be very close to 0. It may not be exactly 0 because of measurement error. Minimum Wage Example In the example on page 118, students are asked to estimate the slope of a line. The data for Display 3.14 are given in the accompanying table. The equation of the regression line of minimum wage against year is minimum wage 0.1009 year 197 When estimating the slope of the line we used two points on the line, (1960, 0.80) and (2000, 4.80), with the y-values approximated from the plot. Neither of these two points are actual data values and, in fact, none of the actual data values are on the line for which we want to estimate the slope. Year Minimum Wage ($) 1960 1.00 1965 1.25 1970 1.60 1975 2.10 1980 3.10 1985 3.35 1990 3.80 1995 4.25 2000 5.15 2005 5.15 Why Points Farther Apart Give Better Estimates of the Slope Here is an example to show why picking points with values of x that are farther apart works better when estimating the slope. Suppose the equation of the line is actually y x, so the slope is 1. Suppose also that your estimate of y tends to be off by about 5 units. If you select 1 and 2 as values for x, you might estimate the y-values to be 4 rather than 1, and 7 rather than 2. These values give an estimate for the slope of 7 (4) ________ 11 21 98 Section 3.2 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press which is wildly off. If you pick 1 and 101 as the values for x, you might estimate the y-values to be 4 and 106. This gives an estimate for the slope of 106 (4) __________ 1.1 101 1 which is fairly close. Lesson Notes: Using Lines for Prediction Students should be able to determine whether a regression model tends to overestimate or under-estimate the value of y for particular values of x by viewing the residual plot. For example, if the observed value is above the line, the estimate is an underestimate because the predicted value on the line is below (less than) the observed value. As shown in Display 3.16 of the student book, a residual is the vertical distance between the actual (observed) value and the predicted value at a particular value of the explanatory variable. Once students have found a regression line, some will rely on it to the point of forgetting about the variability in the original data. It is not uncommon for students to predict a value but then not be able to see that it may be an underestimate, overestimate, or about right. Instead of going back and looking at how the data points vary around that predicted value, students might state that the value is a good estimate because it falls on the regression line and the line is an appropriate model for these data. Even after you show them various clusters that are below or above the line and tell them that the estimates from the line can be much too high or too low, students may still have a hard time accepting that a good model can produce estimates that are either too high or too low. You may want to connect the term explanatory variable to the independent variable and the term response variable to the dependent variable to help students connect the terminology with what they understand about functions. Lesson Notes: Least Squares Regression Lines Why the SSE? Why do we minimize the SSE, the sum of the squared errors (residuals), not just the sum of the errors or the sum of the absolute value of the errors? • Minimizing the sum of the residuals does not give a unique line. As students will see in E20, any line through the point of averages has residuals that sum to 0, and some of these lines clearly are not good fits to the data. • The same reason applies to the sum of the absolute errors. For example, consider the four points (0, 0), (0, 1), (1, 1), and (1, 2). The lines y 1, y x, y 1 x, y 2x, and y 0.5 1.5x all have a sum of absolute residuals equal to 2. (See also D7.) • The sum of squared errors is a recurring theme in statistics. Students first met this idea with the standard deviation, which is constructed from the sum of the squared “errors” of the data values from the mean. The mean, in fact, is the measure of the center that minimizes the sum of the squared errors, much as the regression line is the “center” line that minimizes that sum. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.2 99 • A nice formula results when we minimize the SSE. Although students may not yet appreciate it, the formula for the slope of the least squares regression line is easy to use, and it is easy to prove that it minimizes the sum of the squared errors. There is no simple formula for minimizing the sum of the absolute values of the residuals. • A sum of squared errors is like our usual measure of Euclidean distance: ____________________________ x1 x2 2 y1 y2 2 z1 z 2 2 To see how this applies to regression, think of the data xi , yi , i 1, 2, . . . , n, as two vectors [] x1 x2 x xn and [] y1 y y 2 yn in n-dimensional space. If the points xi , yi are not collinear, then vector y does not lie in the plane spanned by the vectors I [1 1 1]T and x. That is, y is not equal to b0I b1x for any b0 and b1. The SSE is the square of the Euclidean distance between the endpoint of vector y and the plane spanned by I and x. These two articles give more on this geometric interpretation: • “The Geometry of Linear Regression” by Richard Parris in Consortium 58 (Summer 1996): 8–9, or math.exeter.edu/rparris/documents.html • “The ‘Naturalness’ of Squaring in Linear Regression,” by Dan Teague, at courses.ncssm.edu/math/TALKS Properties of the Least Squares Regression Line The boxed information on page 125 of the student book gives a concise summary of the important facts about residuals from a least squares regression line. In addition, it provides formulas and a procedure by which a student can calculate the equation of a least squares regression line. The third property of the least squares regression line is equivalent to saying that the sum of the squares of the deviations of the values of y from their predicted values ŷ is as small as possible, or, that the SSE is as small as possible. Note: The AP Statistics Exam is unlikely to ask a student to calculate a regression equation by hand in this way. However, it is important for all students to experience the process to deepen their understanding. Proof of the Regression Formula Many multivariate calculus books include a proof that the procedure given in the text does indeed give the equation of the line that minimizes the sum of the squared errors. A readable one may be found in William G. McCallum et al., Multivariable Calculus, 3rd ed. (New York: Wiley, 2002), page 714. For a linear algebra proof, see David C. Lay, Linear Algebra and Its Applications, 2nd ed. (Reading, Mass.: Addison-Wesley, 2000), pages 404–416. For a noncalculus-based approach, see Dan Kalman, Elementary Mathematical Models (Washington, D.C.: Mathematical Association of America, 1997), Chapter 8. 100 Section 3.2 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Note on Calculator Use Some graphing calculators automatically calculate residuals and store them every time a regression is performed. See Calculator Note 3D in Calculator Notes for the Texas Instruments TI-83 Plus and TI-84 Plus. Lesson Notes: Reading Computer Output Complete regression analyses are supplied even though students will not learn to interpret most of the details of these analyses until later. AP students should learn to pick out the needed computations from the entire analysis. Typically, software writes equations using variable names instead of using x’s and y’s. For example, in the minimum wage example on pages 118–119 of the student book the equation y 195.20 0.10x could be written as minimum wage 195.20 0.10 year. Note how the software packages use the variable names supplied by the user. Notes for AP Teachers Regression Lines on the AP Exam Be sure that students use ŷ (not y) when writing equations of regression lines. Points have been deducted for this error in the grading of AP Exams. Interpreting the Slope The student book uses two different wordings for interpreting the slope of a regression line. To illustrate the first wording, consider the situation where the height of a child is the explanatory variable and her weight is the predicted value. The measurements of her height and weight are taken each month from her third birthday to her tenth birthday. A case (point on the scatterplot) is a specific month. The slope, 5, of the regression line can be interpreted as: “For every 1 inch increase in her height, her weight tended to increase by 5 pounds.” It makes sense to talk about her height increasing and how her weight tended to increase with it. On the other hand, in the situation where the adult students in a statistics class are the cases and the explanatory variable is height and the predicted value is weight, the slope, 5, of the regression line should be interpreted like this: “A student who is 1 inch taller than another student tends to be 5 pounds heavier.” It wouldn’t be quite right to say, “For every 1 inch increase in height, a student’s weight tends to increase by 5 pounds.” The height of a student isn’t increasing; nor is the weight. As always, a good interpretation depends on how a case is defined. Reading the Computer Output The AP Exam requires students to be adept at reading and interpreting computer printouts from popular statistical software packages. This chapter has many opportunities for students to develop skill in reading computer output. Statistics in Action with Fathom is a resource for using dynamic data software with the activities in the student book. As you give students opportunities to become comfortable using and interpreting statistical output from Fathom, Minitab, and other software packages, you might use resources on reading statistics software, such as Chapter 8 of AP Statistics: Preparing for the Advanced Placement Examination, by James F. Bohan (New York: AMSCO School Publications, 2000). Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.2 101 Modeling Good Answers Once students have completed P5, show them the answer as a model of what is expected on the AP Exam. The answer to P5 gives lots of details about the fitted line and the error. A PDF file containing the questions and the model answers is available at www.keypress.com/keyonline. Solutions Discussion Note on D3: The interpretation of slope as a rate may be new to some students. During the discussion, provide several more examples of rate from students’ experience, for example, miles per gallon or miles per hour. D7. a. The plot and a table that includes the absolute deviations are shown. This line fits best because it passes through the mean value of y at each value of x. It also minimizes the variation among the residuals. D3. a. cost of purchase versus number of gallons purchased b. miles driven versus number of gallons used c. weight in pounds versus volume of the liquid D4. Answers will vary a little, but a summarizing line should come close to the points (1970, 110) and (2005, 590) for a slope of approximately 2005 1970 This slope tells us that the CPI tended to increase at the rate of approximately $13.70 per year across this time span. The y-intercept can be found by solving 110 b0 13.7(1970) b0 26879 The equation of the line is predicted CPI 26879 13.7 year D5. a. Far below the line; the point lies on the line. b. The fitted line is too low. It lies below all of the points. Move the line up so that there are both positive and negative residuals. c. The fitted line is mean income 8300.6 4.2248 year. A literal interpretation of the intercept would be that in year 0, the mean income was a negative $8300. It makes no sense to extrapolate this far backward in time, so the intercept does not have literal interpretation in context. D6. The arithmetic is fine. The reasoning is an amusing example of the folly of extreme extrapolation. The equation is length 6460 1.375 year 102 Section 3.2 Solutions y ŷ | y – ŷ | 0 0 1 1 0 2 1 1 2 2 3 1 2 4 3 1 | y – ŷ | 4 y 590 110 13.7 ___________ x 4 3 2 1 0 0.0 0.4 0.8 1.2 x 1.6 2.0 b. Another such line is y 1.5 0.5x. The residuals are 1.5, 0.5, 0.5, and 1.5, which sum in absolute value to 4. c. Another such line is y 0.2 1.6x. The residuals are 0.2, 1.8, 1.4, and 0.6, which sum in absolute value to 4. d. Such a line is y 2.5 0.5x. The residuals are 2.5, 0.5, 1.5, and 0.5, which sum in absolute value to 5. e. The original line, y 1 x, is the least squares line. The sum of squared residuals is 4. The sum of squared residuals for the lines in b, c and d are, respectively, 5.0, 5.6, and 9.0. The least squares line minimizes the sum of squared residuals among these lines. f. The standard deviation of the residuals for the least squares line is 1.15. For the lines in b, c, and d, the standard deviations are, respectively, 1.2910, 1.3466, and 1.2910. The least squares line also minimizes the standard deviations of the residuals among these lines. D8. a. income 8300.58 4.22478 year. Estimating the SE from the plot, you get about Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press b. As you can see from the plot, a run of 5 in the direction of the x-axis corresponds to a drop of about 10 percentage points in the direction of the y-axis. 60 Alumni Giving Percentage (4)2 12 (1)2 (2)2 (3)2 32 72 52 32 (8)2, or about 182. From the printout, you get 182.340. b. Minitab gives the equation of the regression line as well as giving the coefficients in a table. There are also a few differences in the analysis of variance section, but basically the same information is presented. Note on D8: With Fathom, students find the equation of the least squares line by creating a scatterplot and choosing Least-Squares Line from the Graph menu. They find the SSE by choosing Show Squares from the Graph menu. The output looks like this. 50 40 5 10 30 20 10 0 Scatter Plot Mean Net Income 10 15 20 Student/Faculty Ratio 25 The slope of the fitted line is 2, which is equal to rise ____ 10 2 ____ 140 run 130 120 110 100 1990 1992 1994 1996 1998 2000 2002 Year GeneralFamily_Practice = 4.2248Year - 8300.6; r 2 = 0.91 Sum of squares = 182.3 Practice P3. The points (40, 91) and (240, 88.3) lie on or near the regression line, so the slope is about 0.0135. Each day, the eraser tended to lose around 0.0135 gram of weight. P4. a. About 0.8. b. If one student has a hand length 1 inch longer than another student, we would expect the first student to have a hand width that is about 0.8 inch under than the second student. c. hand width 1.7 0.8 hand length d. It looks like these students measured their hand spans with their fingers together rather than spread apart. If these points were removed, the regression line would move up slightly at the end for smaller hand lengths and move up a bit more at the end for longer hand lengths. In fact, the equation becomes hand width 1.03 hand length 0.65. P5. a. The student/faculty ratio is the predictor or explanatory variable, and the alumni giving rate (in percent) is the response variable. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 5 c. The y-intercept would mean that a university that has a student/faculty ratio of 0 (i.e., no students) would have a giving rate of 55%. This makes no sense in this context. Extrapolation is not reasonable in this case. d. The giving rate would be about 55 2(16) or 23%. The error probably is rather large because the points are not clustered closely about the line. There is quite a lot of variation around the line, especially for the universities with smaller student/faculty ratios. On the average, a prediction would be off by about 7 percentage points. e. The point is just above the line, and the residual is about 1 or 2. The residual for the point with the highest giving rate is about 56 40 16. f. The equation of the fitted line is ŷ 55 2(6) 43. The residual is 32 43 11. g. Because the line has a negative slope, the largest possible predicted giving rate occurs when the student/faculty ratio is as small as possible. The smallest the student/faculty ratio could be is 0 (if there were no students). This ratio gives a predicted giving rate of 55 2(0) 55. This is the largest possible predicted giving rate. The rate at Piranha State is larger than 55%, so the residual for Piranha State will be positive. P6. a. and b. Calories GeneralFamily_Practice 150 5 320 310 300 9 10 11 Fat (g) 12 13 Section 3.2 Solutions 103 b. The equation is ŷ 279.75 2.75x. The slope and y-intercept are found using this table. x Pizza _ _ y xx y y _ _ (x x ) (y y ) _ (x x )2 1 9 305 –2 5 10 4 2 11 309 0 1 0 0 3 13 316 2 6 12 4 Sum 33 930 0 0 22 8 Mean 11 310 22 2.75 b1 ___ 8 b0 310 2.75(11) 279.75 c. The slope of 2.75 means that if one pizza has one gram of fat more than another, it tends to have 2.75 more calories. The y-intercept means 5 ounces of pizza with no fat is predicted to have 279.75 calories, which may be reasonable. d. The point of averages is (11, 310), which satisfies the equation. 279.75 2.75 (11) 310 e. The residuals are 305 [279.75 2.75(9)] 0.5 309 [279.75 2.75(11)] 1.0 316 [279.75 2.75(13)] 0.5 The sum is 0. (When computing, the sum of the residuals typically won’t be exactly 0 because the coefficients usually must be rounded.) P7. [3.5, 8.5, 1, 65, 85, 1] computed by hand. The SSE is 0.52 (1)2 0.52 1.5 and is found in the Analysis of Variance table in row “Error,” column “Sum of Squares.” Exercises E9. a. I—E; II—C; III —A; IV—D; V—B b. I—A; II—E; III—B; IV—D; V—C E10. a. Pizza Hut’s Hand Tossed and Little Caesar’s Original Round have the fewest calories. They also have the least fat. The right side of the graph contains pizzas with the most fat. b. I. E II. D III. A IV. C V. B c. i. A: The line lies above all the points. ii. E: The line lies below most of the points. iii. B: The line lies over most points on the left and under most points on the right. iv. D: The line lies under most points on the left and over most points on the right. v. C: This line fits best overall, going through the middle of the points on both the left and right. 67 37 , or 2.5 inches per year. E11. a. The slope is about _____ 14 2 b. The median height of boys of a given age tends to increase about 2.5 inches per year from the ages of 2 to 14. c. Answers will vary, but using this slope and the point (3, 39), the equation is approximately median height 31.5 2.5 age. d. The y-intercept of 31.5 would mean that the median length of an average newborn is 31.5 inches. Because this is clearly too long, this extrapolation is not valid. E12. a. The calorie prediction for a pizza with 10.5 grams of fat is about 270 calories. The calorie prediction for a pizza with 15 grams of fat is about 335 calories. b. The slope of the line is approximately (335 270) ________ (15 10.5) 14.4. Using the point (15, 335) the equation is approximately ŷ 119 14.4x. c. The estimated slope is quite a bit higher than 9. Other ingredients must add calories, which also increase as the fat content increases. E13. a. The points in this scatterplot lie perfectly on a straight line. The equation of the least squares regression line is % on time 87.2 2.15 mishandled baggage. In order, the residuals are 4.08, 2.31, 0.71, 6.51, 1.56, 0.66, 0.11, 0.18, 2.99, and 8.47. P8. Yes, the regression equation and the mean of the response variable, 310, are the same as you 104 Section 3.2 Solutions Reaction Distance (ft) 80 70 60 50 40 30 20 20 30 40 50 60 Speed (mi/h) 70 80 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press b. The y-intercept should be 0 because if the car has 0 speed, the reaction distance is 0. c. The distance increases at the rate of 11 feet for every increase in 10 mi/h of the speed. Thus, 11 1.1. This means that the slope of the line is __ 10 for every 1 mi/h increase in speed, the reaction distance is an additional 1.1 feet. d. The equation of the line is ŷ 1.1x, where ŷ is predicted reaction distance in feet and x is speed in miles per hour. e. The predictions are ŷ 1.1(55) 60.5 ft and ŷ 1.1(75) 82.5 ft. f. If the reaction time were longer, the reaction distances would be greater, with more than 11 feet between each successive value. Thus, the slope of the line would increase and the predicted distances would be longer than the corresponding predicted distances for the model given here. The equation would be ŷ 1.47x. Note: The formula for the reaction distance in feet for a given speed in miles per hour and reaction time in seconds is (reaction time(in s))(speed(in mi/h)) __________ 3600 s/h 5280 ft/mi E14. a. Fuel consumption rate is the explanatory variable. Operating cost is the response variable. b. The slope is approximately 2.5. If one plane uses 1 gallon per hour more than another, its operating cost tends to be about $2.50 per hour more. This could be the cost of 1 gallon of fuel. c. This value means that if an aircraft used no fuel, the cost per hour would be $470 per hour. While it doesn’t make sense for an aircraft to use no fuel, it does make sense that there are costs besides fuel; the y-intercept would represent the cost per hour of running an aircraft in addition to fuel costs. d. The cost per hour for a plane that consumes 1500 gallons per hour of fuel is approximately 470 2.50(1500) or $4220. E15. a. The predictor variable is the arsenic concentration in the well water. The response variable is the concentration of arsenic in the toenails of people who use the well water. b. There is a moderate positive linear relationship between arsenic concentration in the toenails of well water users and the arsenic concentration in the well water. There is a cluster around well water arsenic concentrations of 0 to 0.005 parts per million. c. The largest residual is about 0.3 parts per million. d. The concentration of arsenic in this person’s toenails is about 0.4 parts per million. e. Seven of the 21 wells are above this standard. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press E16. a. 40.58 Pizza Hut’s Pan 17.66 Domino’s Deep Dish 15.95 Pizza Hut’s Hand Tossed 1.03 Little Caesar’s Original Round 14.28 Domino’s Hand Tossed 26.44 Little Caesar’s Deep Dish 34.50 Pizza Hut’s Stuffed Crust b. The residual of 40.58 shows that Pizza Hut’s Pan Pizza has more than 40 fewer calories than would be predicted for a pizza with that pizza’s amount of fat. c. You can see that both are negative because both points are below the regression line. _ _ E17. a. x 2, y 26 _ _ (x x )(y y) b1 ______________ _ 2 (x x ) (1 2)(31 26) (2 2)(28 26) (3 2)(19 26) (1 2) (2 2) (3 2) _______________________________________________ 2 2 2 5 0 7 6 ____________ 1 0 1 _ _ b0 y b1x 26 (6) 2 38 So the equation is ŷ 38 6x, where ŷ is the predicted number of days with AQI 100 in Detroit and x is the number of years after 2000. b. The number of days with AQI greater than 100 in Detroit tended to decrease by 6 per year. c. The residuals for each year are calculated in this table: Residual y ŷ x y ŷ 1 31 38 6 1 32 1 2 28 38 6 2 26 2 3 19 38 6 3 20 1 The largest residual is 2, for the year 2002. d. SSE (1)2 22 (1)2 6 e. 1 2 1 0 f. The equation for this line is y 40 6x. (Since the residual for this point was 2, simply shift the line up two units.) The SSE for this line would be (3)2 02 (3)2 18. This is larger than the SSE for the least squares line, so according to the least squares criterion, the first line was better. Students should agree that the least squares line fits better because it passes through the middle of the set of points. The line through the point for 2002 is too high. g. This equation would be y 37 6x. The fitted value for 2002 would be 37 6 2 25. The only nonzero residual would be for this point, so the SSE would be 32 9. Section 3.2 Solutions 105 Number of Days AQI > 100 h. As the plot shows, the least squares line is a better indicator of the trend of all the points than either of the others. For both other lines, all points are on or to the same side of the line. _ 32 30 28 26 24 22 20 18 1.0 1.5 2.0 2.5 3.0 Years Since 2000 • For a constant a, a n a, since that means adding up n instances of a. • The is a constant, so _ mean _of a set of numbers _ x n x where n x, in turn, is equal to the sum of the individual x values xi because _ xi n x n ___ n xi . E20. a. A horizontal line has an equation of the form y a. The residual for a point (xi , yi) would be yi a. Adding these up for n points, you get ( yi a) yi a yi na. Assume this sum is zero. 3.5 _ E18. a. x 13.1, y 307.143 _ _ __ _ _ y (Calories) xx yy (x x )(y y) (x x )2 9.0 230 4.1 77.143 316.2863 16.81 19.5 385 6.4 77.857 498.2848 40.96 14.0 280 0.9 27.143 24.4287 0.81 12.0 305 1.1 2.143 2.3573 1.21 8.0 230 5.1 77.143 393.4299 26.01 14.2 350 1.1 42.857 47.1427 1.21 15.0 370 1.9 62.857 119.4283 3.61 x (Fat) Note on E19c: This small discrepancy is due to rounding error. See the note on page 96. d. This equation is very close to that for boys from E11. Note on E20: For these proofs to make sense, students must realize the following about working with sums: yi n a 0 yi n a i ____ ____ na y n i ____ a (x _x)(y _y) (x _x) y n 2 1352.5 90.62 So the equation is calories 111.56 14.93 fat. The estimate in E12 was very close to this equation for the slope but not for the y-intercept. b. From the scatterplot, Pizza Hut’s Pan Pizza has a residual of about 40. From this point alone, the SSE must be at least 402 1600. Thus, 4307 is the only possible SSE. E19. a. The equation is height 31.57 2.43 age. Here is the plot. n _ So, a y . The horizontal line has equation _ _ _ y y, so it passes through (x, y). Conversely, if we start with the assumption that the horizontal line passes through the point _ _ _ (x, y), the equation of the line is y y. The residual _ for a point (xi , yi ) would be yi y. Summing these _ _ residuals we get (yi y) yi y i yi n y yi n ___ n yi yi 0. _ y b. Let the line be y a bx. For a point (xi , yi ), the predicted y-value is a b xi. So the residual is yi a b xi. Assume that the sum of these residuals is zero. (yi a b xi ) 0 yi n a b xi 0 [1, 15, 1, 30, 7, 5] b. The residual is positive because the point is above the line. The actual residual is 59.5 58.3 1.20, which is positive. (The residual is 1.21 with no rounding.) c. The mean age is 8 years, and the mean height is 51 inches. Substitute into the regression equation to see that this point is on the regression line: 51 31.57 2.43 8 51 51.01 106 Section 3.2 Solutions yi n a b xi i ____ ____ n a b ____1 y n x n n _ _ yabx _ _ This means that the point (x, y) must satisfy the equation of the line if the residuals are to sum to zero. Conversely, if we start with the assumption _ _ that the line passes through (x, y), that means the Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press _ _ yi n y n b x b xi _ _ _ _ nynynbxnbx0 c. As shown in parts a and b, any line that passes _ _ through the point (x, y) makes the sum of the residuals equal to zero. E21. a. height 31.5989 2.47418 age. Answers will vary according to students’ original estimates. It should be fairly close. b. The SSE is 2.4. This does seem reasonable from the scatterplot. The residuals are all quite small. E22. a. percent alumni giving 54.979 1.9455 student/faculty ratio b. The largest residual is 22.31. The student/ faculty ratio for that university is 13. c. The fit should be 54.979 1.9455 13 29.6875. The table has the fit calculated correctly. The actual y-value (also given in the table) is 52. The residual, 52 29.69 22.31, is calculated correctly in the table. d. The SSE is 3578.5. The value is large because there are many points and they are scattered widely around the regression line. E23. height 31.57 2.43 age Age Height Predicted Height Residual 2 35.1 31.57 2.43 2 36.46 35.1 36.43 1.33 8 51.7 31.57 2.43 8 51.04 51.7 51.01 0.69 14 63.6 31.57 2.43 14 65.62 63.6 65.59 1.99 With no rounding, the residuals are 1.325, 0.7, and 1.975, respectively. The first point is below the regression line, the second is above the line, and the third is below the line. This pattern of a residuals suggests a curve in the trend of the data. In the scatterplot of all the data the points at the far left are below the regression line, the points in the middle region lie mostly above the line, and the points at the right lie below the line. This suggests that a line may not be the best model for this data. You will learn more about this in Section 3.4. E24. a. Yes; the ratio of price to gallons is the price per gallon, which is the same for all four purchasers. Another way to look at this is to observe that the price, y, increases at a constant rate for each additional gallon of gas purchased. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press b. The relationship between average speed x, time y, and distance is speed time distance, or x y 80 in this scenario. Thus, plotting 80 , y (time) against x (speed) means plotting y __ x which will be a curve (it’s a rotated hyperbola). Plotting y* _1y against x results in a straight line because 80 xy 1 __ 1 ___ y 80 x 1x y* ___ 80 which is a linear equation. E25. a. In the plot of calories versus fat, the association shows a positive trend that is moderately strong. Even though a few points lie relatively far from the pattern, fat content could be used as a reasonably good predictor of calories. The equation of the regression line is ŷ 194.75 10.05x, where ŷ is the predicted number of calories and x is the number of grams of fat. The slope means that if one pizza has 1 more gram of fat than another, it tends to have 10.05 additional calories. Calories _ _ 400 380 360 340 320 300 280 260 240 220 8 10 12 14 16 Fat (g) 18 20 b. In the plot of fat versus cost, there is a very weak positive association between fat and cost. The equation of the regression line is ŷ 10.66 2.41x, where ŷ is the predicted number of grams of fat and x is the cost. The slope means that if one pizza costs $1 more than another, it tends to have 2.41 more grams of fat. (You will be able to check to see if this association is “real” in Chapter 11.) Fat (g) _ equation y a b x must be true. So a y _ b x. The residual for the point (xi , yi ) would be yi _ _ _ _ (y b x) b xi yi y b x b xi. The sum of these residuals then is 20 18 16 14 12 10 8 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Cost ($) c. In the plot (on the next page) there appears to be no linear association between calories and cost. Section 3.2 Solutions 107 220 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Cost ($) d. In the analysis of the pizza data, calories has a moderately strong positive association with fat— the two tend to rise or fall together, which makes sense because fat has a lot of calories. Fat has a weak positive association with cost. There appears to be no association between cost and calories. Note on E26: This open-ended investigation requires a computer. This exercise involves issues that might be sensitive for some students. You have the chance to emphasize strongly that “association does not imply causation.” E26. The four scatterplots are shown here. There appears to be little or no association between the percentage living below the poverty line and the percentage living in metropolitan areas. Likewise, there appears to be little association between the poverty rate and the percentage of whites. The two outliers in this plot are regions with low percentages of whites, namely, Washington, D.C. (high poverty rate) and Hawaii (low poverty rate). The poverty rate is negatively associated with the percentage of high school graduates; as the latter goes up, the percentage living in poverty generally goes down. Finally, poverty appears to be only weakly associated with percentage of families headed by a single parent (with Washington, D.C., again as an outlier). On the surface, it looks as though increasing graduation rates would have the largest effect on decreasing the poverty rate. Keep in mind, however, that the problem of poverty is much more complex than that, and many other variables are lurking in the background. Association is not the same as cause and effect. Simply increasing high school graduation rates, although it might be a good thing to do, will not 108 Section 3.2 Solutions Poverty (%) 260 20 18 16 14 12 10 8 6 4 2 0 Poverty (%) 300 automatically elevate all of those living below the poverty line to a better economic condition. It is even possible that a lower poverty rate may cause a higher high school graduation rate, or that a third “lurking” variable is influencing both of these variables. 20 40 60 80 100 120 Metropolitan Residence (%) 20 18 16 14 12 10 8 6 4 2 0 Poverty (%) 340 10 20 30 40 50 60 70 80 90 100 White (%) 20 18 16 14 12 10 8 6 4 2 0 76 78 80 82 84 86 88 90 92 94 Graduates (%) Poverty (%) Calories 380 20 18 16 14 12 10 8 6 4 2 0 6 8 10 12 14 16 18 20 Single Parent (%) Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 0 10 20 30 40 50 60 5 10 15 20 Student/Faculty Ratio 25 Alumni Giving Percentage Versus Student/Faculty Ratio Alumni Giving Percentage Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.2 109 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 Fuel Consumption (gal/h) 4000 Operating Cost Versus Fuel Consumption Rate Cost ($ Ⲑ h) 110 Section 3.2 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 3.3 Correlation: The Strength of a Linear Trend Objectives • to estimate correlation from a scatterplot • to understand that correlation should not be computed from data that are not linear and that a high correlation does not mean that the data are linear • to understand r as the average product of the z-scores • to use the relationship between r and the slope of the regression line • to be aware of possible lurking variables and not assume that correlation implies causation • to interpret r2 as the proportion of the variation in the values of y that can be explained by x The section “Regression Toward the Mean” is optional. The objectives for that section are • to visualize the regression line as the line of the means of the values of y for fixed values of x • to recognize the regression effect Important Terms and Concepts • correlation coefficient • lurking variable • r 2, the coefficient of determination • average product of z-scores • correlation versus causation Optional Terms and Concepts • regression line as the line of means • regression toward the mean (the regression effect) Alignment with the AP Statistics Topic Outline This section aligns with the listed items of the AP Statistics Topic Outline as described here. The actual text of the AP Statistics Topic Outline and the complete correlation begin on page xxi. ID2 Students examine bivariate data for correlation and linearity. Lesson Planning Class Time Two days if the optional section (“Regression Toward the Mean”) is not covered. Three days if the optional section is covered. Materials For Activity 3.3a, a measuring tape, a yardstick, or a meterstick for each pair of students Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.3 111 Suggested Assignments Classwork Essential Recommended D9–13, D17, D18, D21 P9–19 Activity 3.3a D14, D20, D23 Optional D15, D16, D19, D22, D24, D25 P20, P21 Chapter 3 Quiz 1 Homework Essential Recommended E27, E31, E33, E34, E37, E40 E28, E29, E36, E39 Optional E30, E32, E35, E38, E41, E42 Lesson Notes: Estimating the Correlation Once a linear trend is established (positive or negative) the strength of the association can be measured using the correlation coefficient. The correlation is a useful measure of strength only if the data are linear—clustered either loosely or tightly about a line. Types of Correlation Francis Galton is credited with being the first to understand the idea of correlation (1888), which he called “co-relation.” The correlation coefficient used in the text is called Pearson’s correlation coefficient, after Karl Pearson, who introduced it in 1896. Other correlations not covered in this text include the rank correlation formulas of Charles Edward Spearman and Maurice G. Kendall. These give the correlation for paired observations that are ranks, such as the ranking of ten gymnasts by Judge A paired with the ranking of the same ten gymnasts by Judge B. If there are no ties in the ranks, Spearman’s formula gives the same value as Pearson’s. What Happened to Normality? The guiding features of analyzing data from Chapter 2 were plot → shape → center → spread In this section, they are interpreted for bivariate data as scatterplot → shape→ trend → strength In Chapter 2, students were told that “normal” was the “ideal” shape—if a distribution is approximately normal, the mean and standard deviation generally are useful measures of center and variability. In this section, students are told that the ideal shape for a scatterplot is elliptical. With an elliptically shaped cloud of points, the regression line and the correlation are generally useful measures of center and variability. What happened to normality? Look at Display 3.50 (page 153 of the student book) of a younger sister’s height plotted against her older sister’s height. The two “marginal” distributions, the separate distributions of the heights of the older sisters and the heights of the 112 Section 3.3 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press younger sisters, are approximately normal. The scatterplot of the joint behavior of the two variables shows an elliptical cloud that is more dense toward the middle than toward the edges. A three-dimensional plot of these data could show the density of data points everywhere in the x-y plane. Such a plot could be modeled by a bivariate normal distribution similar to the one pictured below. (In this case, the distributions of x and y both have mean 0 and standard deviation 1; the correlation is 0.6.) This is called the joint distribution of x and y. One characteristic of bivariate normal distributions that will be important in Chapter 11: Inference for Regression, is that if you slice the distribution at any fixed value of x, the “conditional” distribution of the y’s corresponding to that x is normal. A three-dimensional diagram of this is shown in Display 11.4. Activity 3.3a: Was Leonardo Correct? This activity is highly recommended. To orchestrate the sharing of data, you may wish to provide the blackline master provided at the end of this section on page 124 and then have each student read his or her set of four values. Or you can make an overhead transparency and have students come up and write in their measurements. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.3 113 1. Student measurements will vary. Sample results from one group of children and teens are given in the following answers. These measurements are in centimeters. 2. If Leonardo is correct, the points should lie near the lines: arm span height 3 height kneeling height __ 4 1 height hand length __ 9 Looking at the plots, these rules appear to be approximately correct. The lines are the least squares regression lines computed in step 3. Arm Span (cm) 180 160 140 120 100 80 90 120 150 180 Height (cm) Hand Length (cm) Kneeling Height (cm) 22 140 120 100 80 20 18 16 14 12 10 60 90 120 150 180 Height (cm) 90 120 150 Height (cm) 180 3. The least squares regression equation for predicting the arm span from the height is arm span 5.81 1.03 height; r 0.99. The least squares regression equation for predicting the kneeling height from the height is kneeling height 2.19 0.73 height; r 0.989. The least squares regression equation for predicting the hand length from the height is hand length 2.97 0.12 height; r 0.96. In each case, the correlation is quite high, at least 0.96. On many graphing calculators, you must perform the regression in order to calculate the correlation. See Calculator Note 3H in Calculator Notes for the Texas Instruments TI-83 Plus and TI-84 Plus. 4. For the first plot, the slope is 1.03. This means that if one student is 1 cm taller than another, his or her arm span tends to be 1.03 cm longer. Leonardo predicted a difference of 1 cm. For the second plot, the slope is 0.73. This means that if one student is 1 cm taller than another, his or her kneeling height tends to be 0.73 cm taller. Leonardo predicted a difference of 0.75 cm. For the third plot, the slope is 0.12. This means that if one student is 1 cm taller than another, his or her hand length tends to be 0.12 cm longer. Leonardo predicted a difference of _19 , or 0.11 cm. In each case, the points are packed tightly about the regression line, so there is a strong correlation. 5. Yes. The slopes are about what he predicted, the y-intercepts are close to 0 in each case, and the correlations are strong. 114 Section 3.3 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Lesson Notes: A Formula for the Correlation, r You can use the spreadsheet capability of the TI-83 Plus or TI-84 Plus to speed the computation for the formula for the correlation on page 142 of the student book. See Calculator Note 3H in Calculator Notes for the Texas Instruments TI-83 Plus and TI-84 Plus. _ _ The conversion of the x-y plane to the zx-zy plane places (x, y) at the new origin. The relative position of the points does not change; but their coordinates become their respective z-scores. This transformation of the plane facilitates a visual understanding of correlation as “the average of the products of the z-scores.” Depending on the backgrounds of your students, this is a great opportunity to discuss correlation as an application of transformations. Lesson Notes: Correlation Does Not Imply Causation Students will learn in Chapter 4 that to establish that one variable “causes” another, you must perform a randomized comparative experiment. Using observational data, such as the fact that smokers get more lung cancer than nonsmokers, does not establish that smoking causes lung cancer because other factors are not controlled, by randomization or otherwise. For example, to establish that breathing cigarette smoke causes tumors in rats, you would have to randomly assign two treatments—exposed to no smoke and exposed to smoke—to different rats and see whether the rats exposed to smoke develop significantly more tumors. If they do and the rats were otherwise treated alike, you can say that cigarette smoke causes tumors in rats. Causation is a tricky issue at this stage, not only because students haven’t yet studied experimental design but also because of the various meanings of the word “cause.” In some cases, known physical reasons do establish cause and effect, such as fire “causing” smoke. Scientists did not do a randomized comparative experiment to establish that fire “causes” smoke. In other cases, we can’t determine a physical link but rely on statistical evidence, as we do when we say that receiving love and praise during childhood “causes” good behavior. In the best of all worlds, causation is established using both kinds of evidence: a probable physical link along with statistical evidence, preferably in the form of a randomized comparative experiment. For example, both kinds of evidence have been used to establish that cigarette smoking causes cancer. Often the statistical evidence leads scientists to look for a physical link. Once that link is established, we do not need more experiments. The issues of causation, lurking variables, and confounding will reappear in Section 4.3. Lesson Notes: Interpreting r 2 In the regression examples in this textbook, there is one predictor variable, x, and one response variable, y. But many situations have more than one predictor variable. For example, in predicting the probability an entering student will graduate from a given college, the college may want to include such predictor variables as high school GPA, SAT scores, and number of hours worked per week in the regression. One way to account for many predictor variables in a prediction model is to use multiple regression. The formula for r does not generalize to regression where there is more than one predictor variable, but the idea of r2 as the proportion of variance accounted for by the regression does. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.3 115 One of the reasons that statistical software gives the value of r2, or R-squared, rather than the correlation r is that R-squared does generalize to multiple regression. The values for SSE and SST can still be computed, the former as the sum of the squared residuals from the fitted plane ( y ŷ)2 and the latter still as _ the sum of the squares of the differences of the response from its mean, (y y)2. If you have Fathom, you can find the SST (SS Total) and SSE (SS Residual) by using multiple regression in the Model menu to find the regression equation. What Is R-sq(adjusted)? Statistical software gives not only the value of r2 (sometimes seen as R-squared or R-sq) but also a value for “R-squared(adjusted),” which is slightly smaller. Although introductory students should ignore this for now, the adjusted value is actually the better value to use if your data are from a sample and you would like an approximately unbiased estimate of the r2 for the entire population. Just as r2 looks at the proportional reduction in the total sum of squared errors, SST, by comparing SST to SSE, the adjusted R-squared, r 2a, looks at the proportional reduction in total mean squared error. Mean squared errors are sums of squared errors divided by the appropriate degrees of freedom, so that in simple linear regression SST MST _____ n1 and SSE MSE _____ n2 Note also that MST s2y, the variance of the y’s, and MSE s2, the variance of the residuals. Finally, s2y s2 MST MSE ______ r 2a ___________ MST s2y Notes for AP Teachers Interpreting r2 Even though r2 is not currently in the AP Statistics Topic Outline, it has appeared on some AP Exams. Students should know the generic interpretation of r2 is: <r2> of the variation in <response variable> can be explained by the linear relationship with <explanatory variable>. Virtually all technology tools return r2. Many software packages return only 2 r , rather than r. Relationship Between r and the Slope The AP Exam may expect students to calculate and interpret the linear _ _ regression equation for two variables if they are given only x, y, sx , sy , and r. Modeling Good Answers Once students have completed E39, show them the model answer as an example of what is expected on the AP Statistics Exam. Students will see how the correlation coefficient is interpreted in context. A PDF file containing the question and the model answer is available at www.keypress.com/keyonline. 116 Section 3.3 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Solutions Discussion D9. a. 0.783 b. –0.783 c. 0.999 d. 0.906 D10. a. Relationships II and III are positive. Relationship IV is negative because the more socks in a bag, the cheaper they tend to be per sock. Relationship II is the strongest, with r 1. Relationship I is the weakest; r is almost 0. b. For I, there should be no relationship between these two variables. That is, for any day of the month picked, range of haircut costs will be about the same. This range will depend on local prices, and knowing the day of the month a person was born tells you nothing further about the cost of a haircut. Thus, there is virtually no correlation between these two variables. For II, y does vary with x, but for a fixed x, there will no variation in y. Each y will be exactly equal to px. The correlation is 1. For III, generally, the more socks in a bag, the higher the price of the bag. There will be some variation in price because some brands of socks are more expensive than others, so there would be a strong correlation, but not a perfect one. Knowing the number of socks in the package does assist you in predicting a price range for the bag. For IV, there will be some variation. Generally, the more socks in a bag, the cheaper the cost per sock. Knowing how many socks are in a bag can assist you in your prediction of the price range per sock. All else being equal, the larger the variation in y at each value of x, the lower the correlation. D11. a. For America West, D12. a. The correlation measures the strength of a linear association by measuring how tightly packed the data points are about a straight line. Its size is affected most strikingly by points far away _ _ from (x, y) and points not near the new coordinate _ _ axes, x x and y y. b. Correlation is a unitless quantity because it is the average product of z-scores, which have no units. For example, the unit for mishandled bags is bags per thousand passengers. When computing the z-scores, the units cancel out. For America West, this would be bags/thousand 5.739 bags/thousand x x 4.36 __________________________________ _____ _ sx 1.3977 bags/thousand 0.98662 c. No. Because it is based on a symmetric calculation, zx zy zy zx, r does not depend on which variable is chosen as x and which as y. D13. a. For well-behaved data, the correlation will be positive because most of the products zx zy are positive. It is not necessarily the case that r is positive, however. The scatterplot here has more points in Quadrants I and III, but the correlation is negative, r 0.237. 10 y– ⫽ 0 y 0 x– ⫽ 0 –10 –5 0 x 5 _ 4.36 5.739 0.98662 x x ___________ _____ sx 1.3977 _ y y ___________ _____ 81.9 74.85 1.39507 sy _ xx _____ sx 5.0535 _ yy _____ 0.98662 1.39507 1.37640 s y b. Delta, in the lower right corner of Quadrant IV, has the largest product, zx zy, in absolute value. _ _ c. JetBlue, near the point (x, y) has the smallest value of | zx zy |. A point will make a small _ _ contribution if it is either near (x, y) or near one of _ _ the lines x x and y y. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press b. For well-behaved data, the correlation will be negative because most of the products zx zy are negative. As in part a, it is not necessarily the case that r is negative. c. For well-behaved data, the correlation will be near 0. D14. When the correlation is small, the error in prediction will be larger than if the correlation were larger. A larger correlation (near 1 or 1) means the points are generally nearer the line, and predictions made using the line will be relatively close to the observed values. Even though the error in prediction is large when the correlation is small, having a regression Section 3.3 Solutions 117 line is better than having no line as long as there is a definite linear trend in the data. You could give an example from the quiz scores in Display 3.45. A student who scored 10 on the second quiz would be predicted to get about 15 on the third quiz, but we wouldn’t be surprised if this prediction was off by 10 points or so. A student who scored 25 on the first quiz, would be expected to get about 23 on the second quiz, again with about the same estimated error. These lead to two different predictions for the two students: 15 10 is different from 23 10 even though there is some overlap in what we would think of as reasonable bounds on the predictions. Note on D14: Many students will say that the prediction from the regression equation is better than no prediction at all. To lead students toward the idea of r2 introduced on page 148, ask them what they mean by “no prediction.” That is, what would their estimate be if they had full information about the scores on Quiz 2 and Quiz 3, but no information about the relationship between the two? D15. a. Scenarios could be situations in which there is not a definite linear trend in the data ( y appears to be unrelated to x), along with large variation in the y-values. Age versus month of birth for a large group of adults might be an example. b. Scenarios could be situations in which there is a definite linear trend, but where there is much variation in the y-values at each level of x. SAT math score versus score on the first college calculus test for a group of college students could be an example. Family grocery bill per month versus number of people in the family could be another. c. Scenarios could be situations in which the data points fit closely to a line but the cloud of points has considerable curvature, so as to make the straight line a poor measure of the center, or a poor description of the nature of the association between the variables. Height versus age for trees could be an example, because height levels off for older trees. You have seen other curved relationships earlier in this chapter. d. Scenarios could be situations in which the data points fit closely to a line and the line has a slope that is not close to zero. Height versus age for growing children could be an example, as could height versus shoe size for adults. D16. The growth rate will probably begin to slow down at some point, if it hasn’t already. New blogs will continue to appear, but probably not at the same rate they did initially. 118 Section 3.3 Solutions D17. An estimate of the slope is sy 113 ____ b1 r __ sx 0.7 115 0.69 To find the y-intercept, use the fact that the point (520, 508) is on the regression line: y slope x y-intercept 508 0.69 (520) y-intercept y-intercept 149.2 The equation is critical reading 149.2 0.69 math D18. Having planes arrive late could result in more bags not making their connecting flight and so end up mishandled. This would result in a negative correlation between percentage of on-time arrivals and number of mishandled bags. It is also possible that there is no direct link between the variables; both could be a result of the lurking variable of whether or not the airline is generally well run. D19. At first glance, it would seem that the more highly rated the university, the lower its acceptance rate. One possible explanation is that few students apply to the most selective of these universities unless they are pretty sure they will be admitted. You would not say that one variable causes the other, rather that they are both associated with the most savvy students. Another possible lurking variable is that the very best students apply to more colleges because they are shopping for the college that will offer them the best financial aid deal. As a result, the more highly rated colleges must accept a high percentage of the students that apply because they know the students have applied many places and are likely to go elsewhere even if they are admitted. D20. a. Some might say that a high percentage of males “causes” higher salaries because men are more favorably treated than women when it comes to salary. On the other hand, the lurking variable may be how quantitative the subject is. There tend to be far fewer graduates in quantitative subjects, and business and industry want to hire them also; therefore, these faculty positions are harder to fill. This competition for fewer graduates may be the reason the people in quantitative subjects (who tend to be male) are more highly paid. b. Some people might say that a high number of hate groups “causes” a large number of people on death row because members of hate groups tend to commit murders. On the other hand, an obvious lurking variable here is the size of the state’s population. Larger states have more of everything. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press In fact, in percentage terms, hate crimes account for a small proportion of those on death row. To determine whether more hate groups in a state results in more people on death row because of hate crimes, you might look at a scatterplot of the percentage of people on death row because of hate crimes against the percentage of people in the state who belong to hate groups. c. Some people might say that a high rate of gun ownership “causes” lower rates of violent crime by making criminals reluctant to commit violent crime for fear the person will protect himself or herself with a gun. On the other hand, the explanation may be the lurking variable of how rural the state is. Rural areas have higher rates of gun ownership, presumably for hunting, and have lower reported crime. D21. a. The best prediction would be the mean IQ of 101. b. approximately 1 point c. 0.997 54 45 98.8. You should not have much faith in this prediction because the variation around the line is so great. d. Approximately 2% of the variation is accounted for by taking head circumference into account. The regression equation is not of much practical help in making the prediction. e. Answers will vary. It seems likely that the regression line would become less steep and the correlation would decrease, indicating little to no correlation between head circumference and IQ. D22. Rewrite the ratio as follows: 1 inch taller than another, her younger sister also tends to be 1 inch taller than the younger sister of the other older sister. The latter interpretation is what most people would expect. But the element of chance involved results in the regression effect that the younger sister is not as tall as expected. Practice P9. a. 0.5 b. 0.5 c. 0.95 d. 0 e. 0.95 P10. a. Guesses will vary but should be positive and close to 1. b. r 0.908 P11. This problem isn’t as much work as it looks like. For the first four tables, the means are all 0 and the standard deviations are all 1, so they have already been standardized. You can find the average product in your head. Table e has the same correlation as table a, table f has the same as table c, table g has the same as table b, and table h has the same as table c. a. 1 b. 0.5 c. 0.5 d. 1 e. 1 f. 0.5 g. 0.5 h. 0.5 P12. In the table below, all but one of the products are positive, resulting in a positive correlation. The first and last pizza contribute a large amount, making the correlation quite strong. 5.44700 0.9078 r _______ 6 This is about the same value as in P10 and differs only because of rounding. Original Units SSE SST SSE 1 ____ r 2 _________ SST SST Because SSE is less than or equal to SST and both SSE are positive, the ratio ___ SST will always be between 0 and 1 inclusive. Thus, r2 will always be between 0 and 1 inclusive. So r must always be between 1 and 1 inclusive. D23. The number of “heating units” used by a house varies from year to year. A good predictor of how many units that will be is temperature. The investigator is saying that temperature accounts for 70% (r 2 0.7) of the year-to-year variation. D24. The regression line is a “line of means” because it attempts to go through the mean value of the y’s at each fixed value of x. That is, the regression line estimates the mean value of y for each fixed value of x. D25. Two older sisters 1 inch apart in height will, on the average, have two younger sisters who are only 0.337 inches apart. For the line y x, the interpretation would be that if one older sister is Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Fat (g) Total _ Product zx zy _ _ yy x x zx ____ s yy zy ____ s x x ____ ____ sx sy 1.21863 2.00685 _ x y 19.5 385 1.64681 15 370 0.48890 0.98385 0.48100 14 280 0.23158 –0.42478 –0.09837 12 305 –0.28305 –0.03349 0.00948 9 230 –1.05499 –1.20736 1.27375 14.2 350 0.28305 0.67082 0.18987 8 230 –1.31230 –1.20736 91.7 2150 0 0 307.14 0 0 1 1 Mean 13.1 SD Calories Standard Units (z-scores) 3.8863 63.8916 1.58442 5.44700 P13. a. positive b. The point at the extreme upper right of the plot at about (4.8, 5) will make the largest positive contribution to the correlation because it is _ _ farthest away from the new origin (x, y) and from _ _ the new coordinate axes (x x) and (y y) and so has a large zx zy. Section 3.3 Solutions 119 c. In Quadrants I and III; 20 have a positive product. d. In Quadrants II and IV; 7 have a negative product. P14. The plot on the top has a strong curvature. A line would not be appropriate here. The plot on the bottom is linear. The cloud of points is roughly elliptical. A line would be appropriate for this plot. P15. a. The correlation is 0.650: sy b1 r __ sx 7 0.368 r _____ 12.37 r 0.650 90 80 70 60 50 60 70 80 Exam 1 90 100 P16. a. the size of the city’s population b. Divide each number by the population of the city to get the number of fast-food franchises per person and the proportion of the people who get stomach cancer. P17. An obvious lurking variable is the age of the child. Parents tend to give higher allowances to older children, and vocabulary is larger for older children than for younger. P18. A careless conclusion would be that people are too busy watching television to have babies. The lurking variable is how affluent the people in the country are. More affluent people tend to have more televisions and have fewer children. P19. a. The formula relating these quantities is 480.25 212.37 0.5578 (as given SST SSE ___________ r 2 _______ 480.25 SST in the output) so r 0.747 Because the slope of the regression line is negative (the scatterplot goes downhill), r 0.747. b. The value of r2 means that 55.8% of the variability from state to state in the percentage of families living in poverty can be “explained” by the percentage of adults who are high school graduates. In other words, there is 55.8% less variability in the differences between y and ŷ _ than between y and y. So by knowing the high 120 Section 3.3 Solutions 74 72 Older Sister’s Height (in.) 40 70 68 X 66 X 64 X X X X 62 60 58 56 56 58 60 62 64 66 68 70 Younger Sister’s Height (in.) 72 74 74 72 Older Sister’s Height (in.) Exam 2 b. The regression equation is Exam 2 48.94 0.368 Exam 1. The predicted Exam 2 score is 78.38. c. The regression equation is Exam 1 14.1 1.149 Exam 2. d. 100 school graduation percentage for a state and using the regression line, you tend to do a better _ job of predicting y than if you used just y as the prediction for that state. c. No. Although that conclusion may seem reasonable, the existence of a negative correlation alone does not allow us to say that a state can reduce its poverty rate by increasing graduation rates. There might be a lurking variable behind the negative correlation, such as the type of industry in the state. Some industries may encourage people to stay in school to get needed training and also pay well enough to keep people above the poverty line. d. Value x is in percentage of high school graduates, y is in percentage of families living in poverty, b1 is in percentage of families living in poverty per percentage of high school graduates, and r has no units. P20. The plot should look similar to one of the two options shown here. The regression line is flatter than the line connecting the endpoints of the ellipse. This plot shows the regression effect as well. This time, it is the older sisters of the taller younger sisters who tend to be less tall than their younger sisters! 70 X 68 X 66 X 64 X 62 X X 60 58 56 56 58 60 62 64 66 68 70 Younger Sister’s Height (in.) 72 74 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press P21. There is evidence of regression to the mean. An ellipse around the cloud of points will have a major axis that is steeper than the regression line. The slope of the regression line is only about 0.4, much less than 1. Also, for the vertical strip containing exam 1 scores above 95, the mean exam 2 score is about 93. For exam 1 scores less than 70, the mean exam 2 score is about 76. Exercises E27. The correlations of the scatterplots are a. 0.66 b. 0.25 c. 0.06 d. 0.40 e. 0.85 f. 0.52 g. 0.90 h. 0.74 E28. a. about 0.37. An estimate between 0.50 and 0.25 is a good one. b. about 0.65. An estimate between 0.50 and 0.80 is a good one. c. about 0.53. An estimate between 0.40 and 0.65 is a good one. E29. a. r 0.707 b. r 0.707 E30. a. 1 b. 0.5 c. 0.5 d. 1 e. 1 f. 0.5 g. 0.5 h. –0.5 E31. a. about 0.94 b. All but three of the points lie in Quadrants I _ _ and III (based on the “origin” (x, y).) In Quadrant I, zx and zy are positive, and in Quadrant III, both zx and zy are negative. In both of these quadrants the products zx zy are positive. The three points whose products zx zy _ _ are negative are all close to either x x or y y and are therefore small negative values. Thus, the correlation is positive. c. The point in the lower left corner of Quadrant III makes the largest contribution. It is the most extreme point for both the x- and the y-values, giving it the largest (in absolute value) z-score for both variables. _ _ d. The point just below (x, y) makes the smallest contribution. Both zx and zy are near zero, so the product of these z-scores will be quite small. E32. a. I. B II. C III. A b. i. ii. E33. a. No, because the units will be different. For example, for the group that measures in chirps per second and uses temperature for x, the units of the slope will be chirps per second per degree temperature. For the group that measures in chirps per minute, the units will be chirps per minute per degree temperature. So the slope for the second group should be 60 times that of the first group. For a group that measures in chirps per minute and uses chirps for x, the units of the slope will be degrees temperature per chirps per minute. Even if they use the same units, groups that interchange x and y will get different slopes (chirps per minute per degrees Celsius, or degrees Celsius per chirps per minute). b. Yes, the correlation is the same no matter what the units or what you use for x and for y. The correlation is the same because r is equal to the average product of z-scores, which have no units, and zx zy zy zx. E34. a. An estimate of the slope is sy 0.083 _____ b1 r __ sx (0.45) 4.3 0.00965 To find the y-intercept, use the fact that the point _ _ (x, y) (11.7, 0.827) is on the regression line: y slope x y-intercept 0.827 0.00965(11.7) y-intercept y-intercept 0.93991 The equation is ŷ 0.00965x 0.93991. b. An estimate of the slope is sy 4.3 _____ b1 r __ sx 0.5 0.083 25.90 To find the y-intercept, use the fact that the point _ _ (x, y) (0.827, 11.7) is on the regression line: y slope x y-intercept 11.7 25.90(0.827) y-intercept y-intercept 33.12 The equation is ŷ 25.90x 33.12. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.3 Solutions 121 E35. a. True. This is a direct result of the formula sy b1 r __ s x s y If sx sy , the factor on the right, __ sx , will be greater than 1, which makes b1 greater in absolute value than r. b. The formula sy 1.6 0.8 __ sx sy 2 __ s x can be true only if sx 25 and sy 50. c. sestimated b1 r _______ s actual 4.12 0.36 r ____ 0.93 r 0.081 d. 0.088740 0.016304 0.8163 __________________ 0.088740 sactual b1 r ______ s So r 0.903. The value of r2 above is equal to R-sq in the regression analysis. b. The largest residual occurs for the point (55, 0.318). Its value is estimated 0.93 b1 0.081 ____ 4.12 b1 0.0183 E36. An estimate of the slope of the regression line is sy 8 ___ b1 r __ sx 0.8 30 0.21 _ _ 75 b0 0.21 280 b0 16.2 The regression equation is ŷ 16.2 0.21 x Julie’s final exam score is predicted to be 16.2 0.21 300 79.2. E37. a. A large brain helps animals live smarter and therefore longer. The lurking variable is overall size of the animal. Larger animals tend to develop more slowly, from gestation to “childhood” through old age, and larger animals tend to live longer than smaller ones. b. If we kept the price of cheeseburgers down, college would be more affordable. The lurking variable is inflation over the years—all costs have gone up over the years. c. The Internet is good for business. The lurking variable is years. Stock prices generally go up due Section 3.3 Solutions y ŷ 0.318 [0.202 0.00306(55)] 0.318 0.3703 0.0523 To find the y-intercept, use the fact that (x, y) (280, 75) is on the regression line. 122 to inflation over the years. The Internet is new technology, and so the number of Internet sites also is increasing over the years. E38. a. (calories, fat), 0.95 (calories, saturated fat), 0.95 (calories, sodium), 0.5 ( fat, saturated fat), 0.95 ( fat, sodium), 0.5 (saturated fat, sodium), 0.5 b. If you use more salt in your food, it will reduce the calories and the fat. c. The lurking variable is whether the cheese is low fat. Fat makes cheese taste good (and adds calories). To make up for this loss of flavor in low fat cheese, manufacturers may add more salt. E39. a. SST SSE r 2 _________ SST c. Yes, hot weather causes people to want to eat something cold, but the correlation alone does not tell us that. d. degrees Fahrenheit, pints per person, pints per person per degree Fahrenheit, and no units e. MS was computed by dividing SS by DF. E40. The center of this cloud of points should not be modeled by a straight line. This scatterplot can be considered a plot with curvature or a plot with two very influential points in the upper right corner. In either case, some adjustments should be made to the data before attempting to fit a line to it. One possibility is to transform the data by techniques to be learned in Section 3.5. Because r 2 is a measure of how closely the points cluster about the regression line, it would not make sense in this context. E41. Scoring exceptionally well, for example, on a test involves more than just studying the material. A certain amount of randomness is involved, too—the teacher asked questions about what the student knew, the student was feeling well that day, the student was not distracted, and so on. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press It’s unlikely that this combination of knowing the material and good luck will happen again on the next test for this same student. The student probably will get a lower, but still high, score on the next test even if he or she doesn’t slack off. However, it would appear as if doing well the first time and getting praised prompted the student to relax and study less. On the second test, the student’s place at the top of the class may be taken by another student who knew just as much for the first test but was also affected by randomness on the first test—perhaps unlucky in the questions the teacher chose or unlucky in another way at the time. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press At the other end, a student who scores exceptionally poorly on the first test also has a bit of randomness involved—bad luck this time. Whether or not he or she is praised, the student scoring exceptionally poorly probably won’t have all of the random factors go against him or her on the next test and the student’s score will tend to be higher. E42. This disappointing development is an example of the regression effect. The explanation is similar to that of E41. The same students are not likely to score at the top of their class two times in a row. Section 3.3 Solutions 123 Activity 3.3a: Was Leonardo Correct? Height 124 Section 3.3 Kneeling Height Arm Span Hand Length Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Relationships Between Poverty Rates and Four Variables Scatterplot Matrix 100 80 Met Residence 60 40 90 70 White 50 30 90 85 Graduates 80 18 14 Poverty 10 6 20 16 Single Parent 12 8 40 60 80 100 30 50 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 70 90 80 85 90 6 810 14 18 8 10 12 14 16 18 20 Section 3.3 125 74 72 70 68 66 64 62 60 58 56 56 58 60 62 64 66 68 70 72 74 Younger Sister’s Height (in.) Heights of Sisters Older Sister’s Height (in.) 126 Section 3.3 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 3.4 Diagnostics: Looking for Features That the Summaries Miss Objectives • to identify a potential influential point by examining a scatterplot • to determine whether a point is influential by excluding it when computing the correlation and equation of the regression line • to make and interpret a residual plot Important Terms and Concepts • influential point • outlier • residual plot Alignment with the AP Statistics Topic Outline This section aligns with the listed items of the AP Statistics Topic Outline as described here. The actual text of the AP Statistics Topic Outline and the complete correlation begin on page xxi. ID4 Students explore residual plots, outliers, and influential points for bivariate data. Lesson Planning Class Time One to two days Materials For Activity 3.4a, a piece of paper for recording data Suggested Assignments Classwork Essential D26–30 P22–25 Recommended Optional Activity 3.4a D31 Homework Essential E43, E45, E47, E49 Recommended E44, E50 Optional E46, E48, E51–54 Lesson Notes: Which Points Have the Influence? You identify an influential point by temporarily removing the candidate from the data set and then recomputing the correlation and regression equation to see how much they change. If they change only a little, the point isn’t influential. You Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.4 127 should not infer that a large change means that the point should be permanently discarded. Typically, a point should remain in the data set, but you should conclude that correlation and linear regression might not be suitable summary statistics for those data. If a point is an outlier, like all outliers, it should be carefully examined for a possible error in recording data. Outliers and influential points are two separate ideas. In particular, some outliers may not be influential points. The example on animal longevity on page 164 in the student book covers both outliers and influential points. Activity 3.4a Near and Far You can have each student bring an index card or a piece of paper and have students list their six locations, their estimates, and their actual step count in columns. Answers will vary because students analyze their own data. For some students the far point will be very influential on the slope (in either direction), and for others it won’t be. If the far point is far removed from the data, it will almost always have considerable influence on the correlation, even if it does not affect the slope. 1–3. A sample set of data from one student is shown next. He or she consistently underestimated distances. Estimate (x) Actual (y) 3 3 21 34 25 48 28 63 40 146 180 340 Actual 4–6. The scatterplots with and without the “far” point are shown here, along with the regression summaries and the correlations. The trend appears linear in the first plot only because of the far point. In the second plot, the trend is a curve. Proportionally, the far point was not off by as much as some of the others, which show increasingly bad estimates with farther distance. 350 300 250 200 150 100 50 0 0 50 100 150 200 Estimate Dependent variable is: Actual No Selector R squared 94.6% R squared (adjusted) 93.2% s 32.40 with 6 2 4 degrees of freedom Source Regression Residual Variable Constant Estimate Sum of Squares 73163.2 4198.14 Coefficient 13.6176 1.85958 df 1 4 Mean Square 73163.2 1049.53 s.e. of Coeff 17.22 0.2227 t–ratio 0.791 8.35 F–ratio 69.7 prob 0.4733 0.0011 r 0.972 128 Section 3.4 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Actual 150 120 90 60 30 0 0 10 20 30 Estimate 40 Dependent variable is: Actual No Selector 6 total cases of which 1 is missing R squared 84.8% R squared (adjusted) 79.7% s 24.14 with 5 2 3 degrees of freedom Source Regression Residual Variable Constant Estimate Sum of Squares 9718.15 1748.65 Coefficient 27.0973 3.67083 df 1 3 Mean Square 9718.15 582.885 s.e. of Coeff 23.65 0.8990 t–ratio 1.15 4.08 F–ratio 16.7 prob 0.3349 0.0265 r 0.921 7. Yes. Removing the far point decreased the correlation and almost doubled the slope. Although the far point doesn’t follow the curved pattern of the other points, its distance from them results in a large zx zy and so including it increases the correlation. Lesson Notes: Residual Plots Students can add a horizontal line to their handmade residual plots where the residual value equals 0. This line will make it easier to visualize and evaluate where the model overestimates and underestimates. On a TI-83 Plus or TI-84 Plus, the residual plot will display the “zero residual line” automatically. Residual plot A in D30 presents a “good” residual plot, that is, one that displays random scatter and a fairly constant spread in the y’s across all values of x. The residual plots C and D in D30 are plots that show a pattern, whereas residual plot I in Display 3.77 is an example where the spread in y is not constant across all values of x. These residual plots show that a linear model is not the right model for the data. Make sure students see the connection between the scatterplot with its regression line and the residual plot. Comparing the residual plot with the scatterplot will help students see the diagnostic information that a residual plot contains. A misconception some students have as they look at residual plots— confusing them with the original scatterplot—is to think patterns are good in residuals. You might emphasize that if the residual plot still has a pattern, the model is not taking care of everything and students need to look for another model that will reduce or eliminate that pattern, too. With a good model, residuals should be without patterns and randomly scattered about the zero line. Students are beginning to conceptualize random error about a line, and they will get a chance to see this again in Chapter 11. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.4 129 Notes for the AP Teacher Modeling Good Answers Once students have completed E45, show them the model answer with its good discussion of residuals. A PDF file containing the question and the model answer is available at www.keypress.com/keyonline. Discussion Note on D26–28: Students might work in groups of four, with each group calculating the correlation coefficient and the regression equation for one of the four Anscombe data sets. Groups can then share their results. Seeing similar summaries for quite different graphs will reinforce the fact that numerical summaries are not sufficient to describe any data set, either univariate or bivariate. D26. Plot I shows a positive linear trend that is moderately strong. Plot II shows points that lie along a curve. Plot III shows all points lying on a straight line except for the one point near the right end. (This point will have the effect of raising that end of the regression line.) Plot IV has all but one of the points stacked up at the same value of x, although the outlier gives it a slight positive trend. (The regression line will go through the middle of the points on the left and through the isolated point on the right.) a. A straight line is a good summary only for plot I. However, in all four plots the regression line has a slope of about 0.5 and an intercept of about 3.0. (See D27.) b. The correlation is about 0.8 for plot I and should not be used to describe the others because it is a measure of the strength of a linear relationship. D27. There is no way to tell which plot produced the given summary statistics. In fact, the summary statistics are essentially the same for all four plots. The moral? Draw a picture before you summarize data! D28. a. Plots III and IV both have influential observations, but plot IV contains the more influential point. The influential point in plot IV is more isolated from the data and completely controls the slope of the regression line. b. This influential point lies on the regression line through these data. c. If the influential point (the isolated point) is removed from plot IV, all of the data points will 130 Section 3.4 Solutions stack up at the same value of x. Thus, a regression slope and correlation cannot be computed. D29. Delta produces the residual on the extreme right of the plot, at point (8.03, 0.18). Northwest produces the fifth residual from the left and the largest in absolute value at point (5.36, 8.5). D30. a. A—I; B—IV; C—II; D—III b. The scatterplot shows the actual values of y and upward or downward trend but may obscure patterns in the residuals (or at least appear to diminish them a bit). The residual plots do not show the values of y or trend in the original data, but they do show the values of the residuals, and they make departures from linearity easier to see. D31. This question gives further information on plotting residuals versus x–values or predicted values, ŷ. The only difference between the two residual plots is a matter of scaling. The second residual plot has essentially the same scale on the horizontal axis as the scatterplot, with x beginning at zero. Thus, it must be the one with residuals plotted versus x. The points on the first residual plot have horizontal scale values of 0.5, 1.0, and 1.5, which are the fitted values that are obtained when ŷ 0.5 0.5x and x is 0, 1, or 2, respectively. (See E52, page 178 of the student book, for more on this idea.) Practice P22. a. In the scatterplot with the regression line, the data show no pattern except for the one outlier in the upper right. International ($ millions) Solutions 1200 1000 800 600 400 200 350 400 450 500 550 Domestic ($ millions) 600 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 1200 1000 800 600 400 200 350 400 450 500 550 Domestic ($ millions) 600 P23. a. The student did not predict very well. The estimates were consistently low. b. (180, 350) appears to be the most influential point. It is an outlier in both variables and is not aligned with the other points. c. With the point (180, 350) the regression equation is actual 12.23 1.92 estimate, and r 0.975. Without this point, the equation is actual 27.10 3.67 estimate, with r 0.921. This point pulls the right end of the regression line down, decreasing the slope and increasing the correlation. P24. a. The scatterplot with the regression line is shown here. Residual c. The residual plot is shown here. 0.5 0.25 0 –0.25 –0.5 0 1 2 3 x d. The residual plot straightens out the tilt in the scatterplot so that the residuals can be seen as deviations above and below zero rather than above and below a tilted line. The symmetry of the residuals in this example shows up better on the residual plot. P25. a. A—IV; B—II; C—I; D—III b. i. The residual plot will open upward, like a cup, as in plot II. ii. The residual plot has a fan shape, as in plot I. iii. The residual plot will open downward, like an inverted cup. No plot in this example shows this pattern. iv. The residual plot will be V–shaped, as in plot III. The pattern is more clear if you ignore the point with a residual of about 10. c. Plot D and residual plot III show a scatterplot that looks as though it should be modeled by two different straight lines. The V shape can be seen in the scatterplot, but it may be more obvious to many in the residual plot because the overall linear trend (tilt of the V) has been removed. Exercises E43. a. The scatterplot appears here (with the regression line). A line is not a good model because the cloud of points is not elliptical and has one extremely influential point. 2.0 1.5 y 1.0 0.5 0 0 1 2 3 x b. The table is shown here. x y Predicted Value Residual 0 0 0.5 0.5 0 1 0.5 0.5 1 1 1 0 3 2 2 0 Minimum Temperature (°F) International ($ millions) b. The regression equation is ŷ 680 2.85x and the correlation is r 0.7. The slope is positive, and the correlation is moderately strong. c. The plot without the influential point (Titanic) is shown here. The regression equation is ŷ 1350 2.14x and the correlation is r 0.50. Now the slope is negative and the correlation is moderate. Titanic has a huge influence (as a big ship should). 20 0 –20 –40 –60 –80 –100 –120 –140 50 70 90 110 130 Maximum Temperature (°F) b. The regression equation is ŷ 161.90 0.954x, and r 0.49. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.4 Solutions 131 Source Regression Residual Variable Constant x Sum of Squares 113.348 80.7516 df 1 8 Coefficient 1.63057 0.745223 Mean Square 113.348 10.0939 s.e. of Coeff 2.299 0.2224 t–ratio 0.709 3.35 F–ratio 11.2 prob 0.4984 0.0101 Here is the completed table: 20 0 –20 –40 –60 –80 –100 –120 –140 50 70 90 110 130 Maximum Temperature (°F) Count After Treatment E44. a. The point (0.137, 2.252) appears to be the most influential because it is an outlier in both variables. It appears that if this point were eliminated, the right end of the line would drop, decreasing the slope. The correlation would probably also decrease. Eliminating (0.137, 2.252) does indeed decrease the slope of the regression line, from 13.0 to 7.46. The correlation also decreases, from 0.896 to 0.634. b. A point near the regression line and near __ __ (x , y ) is likely to have little effect on the slope or correlation. One such point would be (0.0194, 0.517). Eliminating this point actually leaves the regression equation almost the same and very slightly increases the correlation from 0.896 to 0.897. c. Removing the point with the largest residual in absolute value, (0.0764, 0.433), would cause a slight increase in slope but a large increase in correlation. The actual result is to cause the slope to increase from 13.0 to 15.4 and the correlation to increase from 0.896 to 0.969. E45. a. The plot with the regression line and the regression summary are shown here. The equation of the line is ŷ 1.63 0.745x. 15 12 9 6 3 0 5 10 15 20 Count Before Treatment 132 Dependent variable is: y No Selector 13 total cases of which 3 are missing R squared 58.4% R squared (adjusted) 53.2% s 3.177 with 10 2 8 degrees of freedom Section 3.4 Solutions x y 11 6 6.567 0.567 8 0 4.331 4.331 Predicted Values Residual 5 2 2.096 0.096 14 8 8.803 0.803 19 11 12.529 1.529 6 4 2.841 1.159 10 13 5.822 7.178 6 1 2.841 1.841 11 8 6.567 1.433 3 0 0.605 0.605 b. The residual plot shown here is a little unusual in that it shows more variability in the middle than at either end. This is partly because there are more cases in the middle. 8 6 Residual Minimum Temperature (°F) c. With Antarctica removed, the slope of the regression line changes from positive (0.954) to negative (1.869) and the correlation becomes negative, r 0.45. The plot appears as shown here. (Notice that a new potential influential point has appeared: Oceania.) Without a plot of the data, you might come to the following incorrect conclusion: In general, continents tend to be “warm” or “cold”; that is, continents with higher maximums also tend to have higher minimums. In fact, there is little relationship. 4 2 0 –2 –4 0 5 10 15 20 Count Before Treatment c. The disinfectant appears to be unusually effective for the person with the large negative residual, the point (8, 0) on the original scatterplot. It is seemingly ineffective for the person with the large positive residual, the point (10, 13) on the scatterplot. E46. a. The slope of the line is very close to 1, and the y-intercept is 3.57. Because the slope is about 1, the y-intercept means that textbooks bought online tend to cost about $3.57 less than those bought at the college bookstore. (In fact, the mean cost of the college textbooks is $47.04, and the mean cost of the online textbooks is $45.03. Their difference is not $3.57 because the slope is not Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press College Online Fitted Residual 93.40 94.18 92.9281 1.2519 9.95 7.96 6.7092 1.2508 46.70 48.75 44.6786 4.0714 76.00 94.15 74.9508 19.1992 86.70 80.95 86.0058 5.0558 7.95 6.36 4.6428 1.7171 24.00 16.80 21.2254 4.4254 12.70 10.66 9.5505 1.1095 66.00 45.50 64.619 19.119 Residual The next residual plot shows that a line is a reasonable model for these data. The points are scattered randomly above and below 0, except that the points fan out to the right. This pattern indicates that the points lie farther from the regression line as the prices increase. 20 15 10 5 0 –5 –10 –15 –20 0 20 40 60 80 100 College Bookstore Price ($) c. The plot with the line y x is shown next. A point above the line represents a textbook that costs more at the online bookstore. A point below the line represents a point that costs less at the online bookstore. A point on the line represents a book that costs the same at both stores. 80 20 0 20 40 60 80 100 College Bookstore Price ($) Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 10 12 14 16 Fat (g) 18 20 b. The largest positive residual belongs to Pizza Hut’s Stuffed Crust, which has more calories than would be predicted from a simple linear model. The largest negative residual belongs to Pizza Hut’s Pan. None appear so far away that it would be called an exception, or outlier. In fact, you can check this by making a boxplot of the residuals, as shown here. –40 –30 –20 –10 0 10 Residual 20 30 40 c. The complete data set shows a moderately strong positive trend with a slope of about 14.9 calories per gram of fat and correlation of 0.908. The most influential data point would be the one farthest away from the main cloud of points on the x–axis (Domino’s Deep Dish). Removing Domino’s Deep Dish yields a slope of around 18 calories per gram of fat and correlation of 0.893. None of the other points have nearly as much influence on the slope. 60 40 40 30 20 10 0 –10 –20 –30 –40 8 Calories Online Bookstore Price ($) 100 d. The boxplot is centered slightly above 0, indicating that most college bookstore prices tend to be slightly higher than the online prices. However, most of the differences are close to 0, so there is little difference in price. The median difference is about $2 more for the college bookstore. Two outliers are shown, which means that for two textbooks, the prices vary greatly, in one case being less expensive at the college bookstore and in the other case being less expensive at the online bookstore. The overall lesson is that with more expensive textbooks, it pays to shop around. E47. a. The residual plot for these data looks like this: Residual exactly equal to 1.) The slope is 1.03, which means that for every $1.00 increase in price for a book sold through the college bookstore, there tends to be a $1.03 increase, on average, for the same book bought online. b. The table for computing the residuals is shown here. 400 380 360 340 320 300 280 260 240 220 8 10 12 14 16 Fat (g) 18 20 Section 3.4 Solutions 133 134 Section 3.4 Solutions E51. a. The regression equation is ŷ 366.67 16x. The completed table is shown here. Aircraft Seats Cost Predicted Residual ERJ–145 50 1100 1166.67 66.67 DC9 100 2100 1966.67 133.33 MD–90 150 2700 2766.67 66.67 Residual b. The plot of the residuals versus x (seats) is given first followed by the plot of the residuals versus ŷ. The only difference between the two plots is the scaling on the horizontal axis. 150 100 50 0 –50 –100 40 Residual E48. a. The pattern of the scatterplot is basically linear, so the slope is constant across the numbers of seats. b. The spread in the flight lengths increases as the number of seats increases. The points fan out to the right. c. A good guess is likely only in the first case. Predicting flight length for planes with fewer numbers of seats would be easier because there is less variation in flight lengths for the smaller planes than for the larger. When the number of seats is between 50 and 150, the values for flight length vary between about 175 to about 1065 miles, whereas when the number of seats is between 200 and 300, the values for flight length vary between 947 and 3559. d. The residual plot for this scatterplot also fans out (spread out more) as the number of seats increases. In fact, the fan shape may be seen better in the residual plot. e. Planes that carry more passengers have more variation in their average flight lengths because they tend to fly longer distances and there is a larger spread of numbers over which to vary. In general, larger numbers usually show a larger absolute variation than smaller ones. But the relative variation may be fairly constant. If you make a new variable defined as (flight length)/ (number of seats) and plot it versus number of seats, the fan shape disappears. E49. A—I; B—IV; C—III; D—II. A linear model would be appropriate for C and D. Both C and D show a random scatter of points in the residual plot, but the slope of the regression line is almost zero for plot C, and there appears to be little correlation. Plot B does not have a random scatter around the line; the pattern appears to be cyclical. This pattern is typical of a situation in which something changes approximately linearly over time. What happens next usually depends on what just happened, causing this up-and-down pattern in the residuals. E50. No for plot A. The pattern is impossible for residuals because the major linear trend would already have been removed by fitting the regression line. Residuals show what is left over after any linear trend is removed from the data. No for plot B. These residuals do not have a mean of zero. 60 80 100 120 140 160 Seats 120 80 40 0 –40 –80 1000 1400 1800 2200 Predicted 2600 E52. The fitted value ŷ is a linear transformation of x; that is, ŷ a bx. Thus, using ŷ rather than x on the horizontal axis does not change the relative distance of the values from each other—it just translates and stretches the horizontal axis. Consider this example for a regression line ŷ 1 2x with values of x 0, 1, 3, 4, 6, 6. Then the values of ŷ are 1, 3, 7, 9, 13, 13. These two sets of points are plotted on horizontal scales in the next plot. Note that the relative spacing of the points is exactly the same on each scale. 0 1 2 3 4 5 6 1 3 5 7 9 11 13 If the regression line has a negative slope, then the larger y’s correspond to the small x’s and one residual plot will be the mirror image of the other. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Weight (lb) 190 180 170 160 Life Expectancy (yr) It is difficult to see the strong V shape of the residuals in a scatterplot. E54. Scatterplots of the data, without and with the regression line, are shown here. Begin by drawing in the regression line. Then use the residual plot to determine how far above or below the regression line each point should be placed. Note that a linear model is not a good one for predicting life expectancy from GNP. 90 80 70 60 50 0 10 20 30 Gross National Product ($ thousands per capita) Life Expectancy (yr) E53. Scatterplots of the original data, without and with the regression line, follow the commentary. To estimate the recommended weight for a person whose height is 64 inches, add the fitted weight (given to be 145 pounds) to the residual of about 1.2 to get 146.2. The slope of the regression (187 145) line must be about _______ (76 64) 3.5. For the second height, 65 inches, the fitted weight would be 145 3.5(1) 148.5. The residual is about 0.9 pounds. Thus, the recommended weight is about 148.5 0.9 149.4 pounds. You could continue point by point to get the next plot, but a rough sketch can be obtained by making use of the linear patterns in the residuals (and hence in the original scatterplot). The points on the scatterplot must form a straight line up to a height of 71, where the weight must be about 145 7(3.5) 1.5 168.0. The remainder of the points must form (approximately) another straight line up to a height of 76, where the weight must be about 145 12(3.5) 2.2 189.2. 90 80 70 60 50 0 10 20 30 Gross National Product ($ thousands per capita) 150 140 64 68 72 Height (in.) 76 64 68 72 Height (in.) 76 Weight (lb) 190 180 170 160 150 140 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.4 Solutions 135 3.5 Shape-Changing Transformations Objectives • to understand that removing the curvature from the shape of a scatterplot requires a nonlinear transformation • to use a log transformation to straighten data that follow an exponential pattern • to use a log-log transformation to straighten data that follow an unknown power relationship • to use a power transformation to straighten data that follow a known or suspected power relationship Important Terms and Concepts • log-log transformation • homogeneous residuals • power transformations • exponential growth and decay • log transformation • power function Alignment with the AP Statistics Topic Outline This section aligns with the listed items of the AP Statistics Topic Outline as described here. The actual text of the AP Statistics Topic Outline and the complete correlation begin on page xxi. ID5 Students use logarithmic and power transformations to achieve linearity. Lesson Planning Class Time Two to three days Materials For Activity 3.5a, a paper cup and 200 pennies per student (or pair of students) Suggested Assignments Classwork Essential Recommended Optional Activity 3.5a D35, D40 D34, D39, D42 D32, D33, D36–38, D41, D43 P30–32, P35, P37, P38 P36, P39 Chapter 3 Quiz 2 P26–29, P33, P34 Homework Essential E55–57, E59 136 Section 3.5 Recommended E60, E65 Optional E58, E61–64, E66 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Lesson Notes: Exponential Growth and Decay An exponential pattern of growth or decay is one of the most common nonlinear patterns seen in science and social science. Activity 3.5a introduces a common type of exponential decay so that students can study the pattern made by a quantity over time. This situation is fully analyzed after the activity. Activity 3.5a Copper Flippers Activity 3.5a is essential. This activity can be done in pairs—one to toss the coins and count and one to record. 1–3. The data from one student’s results are shown here. Toss Heads 1 91 2 59 3 27 4 9 5 7 6 4 4. A scatterplot from the sample student results is shown in Display 3.91 on page 181 of the student book. The pattern appears to be exponential, not linear. 5. The explanation is given on pages 181–182 of the student book. Have students save their data to use in D22 on page 397. Lesson Notes: Exponential Functions and Log Transformations Although the AP Statistics Topic Outline does not explicitly require that students understand the algebra of log and log-log transformations, it should not be out of the reach of AP students to complete the conversion. Discuss these algebraic conversions, and expect students to complete them. This work provides an excellent opportunity for students to see that logarithms can be very useful. The basic rules of logarithms and powers are shown in this table. Students may need to be reminded of these rules. Logarithms Powers logb(mn) logbm logbn a logbm n logbm (a ) amn logbbm m alogam m n mn am an m n Most commonly, scientists use natural logarithms, which have a base of e and are denoted by the symbol ln. The number e and functions of e are accessible on any graphing calculator. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.5 137 To understand why a log transformation straightens points that follow an exponential pattern, begin with the exponential model y abx Take the log of both sides: ln y ln(abx) ln a ln bx ln a x ln b You now have an equation that is linear in x, with slope ln b and intercept ln a. So a plot of ln y versus x will be linear. The value of ln b can be exponentiated to find an estimate of b. Students should also understand that b 1 (growth rate) for exponential growth and b 1 (decay rate) for exponential decay. In the exponential growth example on pages 182–183, notice that in the computation, e0.0148 1.015. Whenever you are computing ex, where x is close to 0, it always will be the case that ex 1 x. As another example, in E64, students will find that e0.05645 1 (0.055) 0.945, so the decay rate is 5.5%. Lesson Notes: Log-Log Transformations of Power Functions In this section students should recognize the form of a power model and the method of using a log-log transformation to linearize this curve, as shown in the box on page 188. It is virtually impossible to distinguish between a set of data best fit by an exponential model and one best fit by a power model just by observing the pattern in a scatterplot. The investigator should think, first, about which makes more sense in the context of the problem. Growths of populations are more logically exponential while relationships between weight and height are more logically ones of power. The Algebra of Log-Log Transformations If we take the log of both sides of a power function y a xb, we end up with log y log a b log x Again this is linear, but the independent variable here is the log of our original independent variable, x. And the new dependent variable is the log of the original dependent variable, y. The slope is b and the intercept is log a. Here are two examples. 1. To solve log y 0.484 2.10 log x for y, exponentiate both sides and use the rules in the table on page 137. 10log y 100.484 2.10 log x y 100.484 102.10 log x using the 1st and 3rd power rules y 0.328 10log x using the 2nd log rule and calculating 100.484 y 0.328 (x)2.10 using the 3rd power rule 2.10 138 Section 3.5 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 2. Solve ln y 25.61 0.0151x for y. eln y e25.61 0.0151x exponentiating both sides y e25.61 e0.015x using the 1st and 3rd power rules y e25.61(1.015)x using the 2nd power rule and calculating e0.015 Note: e25.61 is a constant. Lesson Notes: Power Transformations Power transformations are appropriate when your data follow a power model of the form y axb. However, a power transformation (as well as any other transformation) should be employed only when it is consistent with the physical situation in the problem. In thinking through the physical situation, sometimes a simple power transformation such as a square or square root is all that is needed. Unfortunately, some students will keep trying transformations until they get a “good” fit regardless of its appropriateness. A correlation r close to ±1 does not imply a “good” fit automatically. For example, in the study of tree diameter versus age, the correlation is 0.89, but the scatterplot shows enough curvature in the data to make the straight line a reasonably poor model. Many graphing calculators have spreadsheet capabilities that make the transformation of data quite easy. For example, on a TI-83 Plus or TI-84 Plus, you can load the values of x in list L1, the values of y in list L2, and then, say, define L3 (L2)2. A linear regression can then be done on (x, y) and (x, y 2) and appropriate comparisons made. Notes on Software Some software packages will show the regression as log y . . . . Graphing calculators generally do not. That is, if students are doing a regression on (log x, log y), the calculator will return the coefficients of y a bx. It is up to the students to realize that they must use the appropriate meaning of those calculated coefficients, that is, log y a b log x. Notes for the AP Teacher Modeling Good Answers Once students have completed E65, show them the model answer as an example of what is expected on the AP Exam. A PDF file containing the questions and the model answers is available at www.keypress.com/keyonline. Solutions Discussion D32. If the chance of “living” was 0.6, then the rate of decay (death rate) would be 0.4 and the line would drop less steeply. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press D33. The residuals indicate a lot of scatter around the trend line for the years up to 1850 and then a trend that increases faster than the model would indicate, up to about 1920, followed by an increase slower than the model would indicate. This same pattern can be seen in the original scatterplot if you look closely. Section 3.5 Solutions 139 Flight Length ln(flight length) 3960 8.28 904 6.81 175 5.16 plane) is removed, the data show a strong, positive linear trend and the residuals have no trend. Fitting a new line to the reduced data set will show that the outlier has some influence on the slope of the regression line and even more influence on the correlation. D39. It doesn’t make sense to say that speed causes flight length. The reason for the positive association is the lurking variable of the size of the plane. In general, larger planes go faster and also have longer average flight lengths. In fact, the larger and faster planes tend to be deployed on longer flights intentionally; that is a management decision. D40. The graph of the power function is shown below. The graph does not increase as quickly as the exponential function, once x gets above 3.21. 120 y 100 80 y ⫽ x2 60 40 20 0 2 4 6 8 10 12 x 1200 y ⫽ 2x 1000 800 y D34. In 1803, the Louisiana Purchase doubled the size of the United States but added relatively few people. Thus, the population density dropped dramatically below what any model would have predicted. Similarly, the drop in density from 1840 to 1850 may be accounted for by the addition of much of the Southwest to the United States. D35. a. The overall trend is curved upward in the original data, so the correlation coefficient is not a useful measure of the strength of the relationship. b. These are time series data, with one observation per time period. In time series data, each observation is highly correlated with the preceding observation (things cannot change very much from one time period to the next), so it is not surprising to see points closely clustered about the linear trend in the transformed data. Of much more interest is the nonlinear component of the overall trend and the fluctuations around this trend, as revealed in the residuals. D36. From the equation ŷ e25.118(1.0148)x, you can see that the growth rate is estimated to be about 1.5% per year. For a 10-year period, think of going from x 0 to x 10. If at year 0 the population is ŷ a, at year 10 it will be ŷ a(1.015)10 a(1.161), for a 16% gain. Make sure students observe that the 10-year growth rate is not (10)(1.5%) 15%; it is greater than that because of the compounding. D37. Some of the flight lengths are in the low hundreds, whereas others are in the mid-thousands. A jump from 100 to 1000 is a jump of one order of magnitude. Note the effect that the log transformation has on these three flight lengths, which were picked because they are the longest, median, and shortest. The logs are more evenly spread out than the original flight lengths. 600 400 y ⫽ x2 200 0 2 4 6 x 8 10 12 The graph above shows the power model and the exponential model together. Both functions increase at an increasing rate. However, the growth of the exponential model eventually dwarfs that of the power model. This is true in general. Exponential growth models (b ⬎ 1) will eventually outgrow any power model. D38. Both the scatterplot and the residual plot show an outlier on the speed variable. The other residuals have a linear trend. Once this outlier (the slowest 140 Section 3.5 Solutions Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press D41. We’ll calculate the weight of a 100-inch alligator using the base 10 log and compare to the example on page 189 where the base e log was used. The table shows the lengths and weights, along with the base 10 log of each. Length Weight 94 log(weight) 2.152 weight 102.152 141.9 pounds log (length) log (weight) 130 1.97313 2.11394 74 51 1.86923 1.70757 147 640 2.16732 2.80618 58 28 1.76343 1.44716 86 80 1.9345 1.90309 94 110 1.97313 2.04139 63 33 1.79934 1.51851 86 90 1.9345 1.95424 69 36 1.83885 1.5563 72 38 1.85733 1.57978 128 366 2.10721 2.56348 85 84 1.92942 1.92428 82 80 1.91381 1.90309 86 83 1.9345 1.91908 eln y eln x ea 88 70 1.94448 1.8451 y x ea 72 61 1.85733 1.78533 74 54 1.86923 1.73239 61 44 1.78533 1.64345 log y log (x ea) 90 106 1.95424 2.02531 log y log x log ea 89 84 1.94939 1.92428 68 39 1.83251 1.59106 76 42 1.88081 1.62325 114 197 2.0569 2.29447 90 102 1.95424 2.0086 78 57 1.89209 1.75587 The small difference from the value 141.3 reported in the text is due to rounding error. A small rounding error in an exponent can have a rather large effect on the result. Calculating with rounded numbers is always risky, especially when one of those numbers is an exponent. The properties of logarithms explain why this works. If you have an equation such as ln y ln x a and you want to rewrite it in base 10, first exponentiate each side and simplify. eln y eln x a Now take the log base 10 of each side. Here is the scatterplot log(weight) versus log(length). 3.0 2.8 2.6 log(weight) The prediction is log(weight) 3.286 log(100) 4.42 2.152. The predicted weight is found by solving 2.4 2.2 2.0 1.8 1.6 1.4 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 log(length) Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 2.15 2.20 Note that ea is a constant. In the alligator example, the constant in the first regression equation was 10.2. The equation above says to calculate log(e10.2) 4.43. This value is very close to the constant in the regression equation using base 10 logs, or 4.42. Again, rounding error has caused a small difference in the last decimal place. D42. Perhaps the cross-sectional area is growing at a constant rate across time. This area would be proportional to the square of the diameter. D43. The diameters at the left end of the scatterplot show less variation than those at the right end. Thus, you would be able to get a more precise prediction of the diameter for 10-year-old trees than you would for 40-year-old trees. This should seem intuitively reasonable; older trees would naturally have more variability in size than younger trees. (The same is true of children.) Section 3.5 Solutions 141 Practice P26. a. This plot shows the data for Dying Dice. P27. The Florida population shows a definite nonlinear trend that could represent exponential growth. 15,000,000 160 12,000,000 120 Population Population 200 80 40 0 2 4 6 8 10 12 Roll Number 9,000,000 6,000,000 3,000,000 0 1800 1850 1900 1950 2000 Year b. The transformed data show a nice linear trend. a. The log transformation transforms the pattern to a linear one that can be summarized by a straight line. 4.0 3.0 2.0 1.0 0 2 4 6 8 Roll Number 10 ln(Population) ln(population) 5.0 12 c. The equation of the line in part b is ˆ ln y 5.22 0.435x 1820 1860 1900 1940 1980 2020 Year or ln(pop) 5.22 0.435 roll number Because e0.435 0.647, the rate of dying is estimated to be 1 0.647, or about 0.353, or 35%, per time period. This rate is close to the theoretical probability of dying, set up to be 1/3. d. The residual plot shows some curvature, indicating a death rate of a little more than 0.35 in the early stages and a little less than 0.35 in the later stages. This kind of pattern would not be unusual in data on real animals. 0.30 0.20 Residual 0.10 b. The equation of the line is ln(pop) 54.9342 0.03583 year so that pop e54.9382(e0.03583)year e54.9382 (1.036)year for a growth rate of 3.6% per year, which is a high rate of growth. c. The residual plot shows a pattern. Florida grew less rapidly than the model predicts up to about 1845, then grew more rapidly than predicted, then less, then more. A big jump in growth occurred between 1950 and 1960. Then in 2000, there was a big drop in growth. 0 Residual –0.10 –0.20 0 2 4 6 8 10 12 Roll Number 17 16 15 14 13 12 11 10 0.15 0.10 0.05 0.00 –0.05 –0.10 –0.15 –0.20 1820 1860 1900 1940 1980 2020 Year 142 Section 3.5 Solutions Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press P28. x y log y 2 1000 3 1 100 2 0 10 1 1 1 0 P30. If log y c dx, then by rules of logarithms, y (10c)(10)dx (10c)(10d)x abx, where a 10c and b 10d. For P28, y 10(10)x. For P29a, y 1(100.5)x 3.16x. log y The plot does indeed show a straight line. 3.0 2.5 2.0 1.5 1.0 0.5 0 For P29b, y 1014(102)x 1014 100x. P31. Using a ln( flight length) transformation gives the following printout, which agrees with the first part of Display 3.98 in the student book. The regression equation is ln(FlightLength) 1.57 0.0120 speed –1 –0.5 0 0.5 1 1.5 2 x Predictor Constant Speed s 0.2691 Coef 1.5730 0.0119958 Stdev 0.3281 0.0007396 R–sq 89.5% t-ratio 4.79 16.22 p 0.000 0.000 R–sq(adj) 89.1% Analysis of Variance The equation of this line is log y 1 x, so the slope is 1 and the y-intercept is 1. P29. The tables and plots are shown here. a. xa ya log ya 6 1000 3 4 100 2 2 10 1 0 1 0 MS 19.058 0.072 F 263.09 p 0.000 flight length e1.57 e0.012 speed 4.807(1.012)speed P32. A plot of the data shows a cluster at the left with little trend and three points at the right that, as a group, are potentially influential. The points in the residual plot show curvature. 2 1 80 0 2 4 6 xa The equation of the line is log ya 0 0.5xa, so the slope is 0.5 and the y-intercept is 0. b. xb yb log yb 0.0001 4 6 0.01 2 100 60 40 20 0 0 5 10 0 5 10 15 20 25 30 15 20 Trips 25 30 30 Residual 5 2 log yb SS 19.058 2.246 21.304 The answer is, then: Consumption (g) log ya DF 1 31 32 ln(flight length) 1.57 0.012 speed 3 8 SOURCE Regression Error Total 2 1 0 –1 –2 –3 –4 0 –20 5 6 7 8 xb The equation of the line is log yb 14 2xb, so the slope is 2 and the y-intercept is 14. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Taking the natural log of the consumption straightens the residual plot. However, these transformed points don’t form an elliptical cloud. Some fishers’ families eat essentially no fish, even if the person fishes as many as 11 times a month. Section 3.5 Solutions 143 3 0.1 2 1 0 0 5 10 15 20 25 30 0.0 -0.1 -0.2 -0.3 -0.4 2.0 Residual Scatter Plot Tidal Velocity 0.2 4 ln(velocity) ln(Consumption) 5 -0.5 -0.6 0.0 -4.0 –1.5 5 10 15 20 Trips 25 30 The equation of the regression line is ln(consumption) 0.39 0.143 trips with a correlation of about 0.69. As it happens, removing the three points to the right has a small effect on the slope, increasing it from 0.143 to 0.177. However, the correlation drops considerably, from 0.69 to 0.37, with the regression line now passing between the two remaining clusters of points. P33. The prediction is ln(weight) 3.29 ln(75) 10.2 4.005. Solving, Residual 0 P34. a. There is a strong positive curved relationship between depth and velocity. The curve is concave down with the rate of change in velocity decreasing as the depth increases. b. The curve is not exponential, but could be a power function with a power less than 1. Taking the log of both variables results in this plot, shown with its residual plot. A reasonable model is ln(velocity) 0.146 0.175 ln(depth). Solving for velocity gives velocity 1.157 depth0.175. -2.0 -1.0 ln(depth) 0.0 0.10 0.00 -0.10 -0.20 -4.0 -2.0 -1.0 0.0 ln(depth) lnVelocity = 0.175lnDepth + 0.15; r2 = 0.89 P35. a. i. -3.0 30 25 20 ya 15 10 5 0 ln(weight) 4.005 weight e4.005 54.9 pounds (or 54.8 with no rounding) -3.0 1 2 3 xa ii. The y-scale must be shrunk more for larger values of x than for smaller values of x. The cube root transformation will straighten them. iii. 3 2 ya1/3 1 0 1 2 3 xa b. i. 10 8 yb 6 4 2 0 2 6 4 8 10 xb 144 Section 3.5 Solutions Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press ii. The y-scale must be shrunk more for smaller values of x than for larger values of x. The reciprocal transformation (power of 1) will work. y –1 b 0 0 2 4 6 8 20 60 40 80 100 xc ii. The y-scale must be expanded more for larger values of x than for smaller values of x. A square transformation (power of 2) will straighten the points. iii. yc2 100 80 60 40 20 0 20 40 60 80 100 xc P36. For the points in P35a, the x-scale must be expanded. A cubic power transformation will straighten this relationship. 30 ya 20 10 20 8 10 8 7 6 5 4 3 2 1 0 30 xa3 10 8 6 4 2 2 3 2 3 4 5 6 7 6 7 2.0 Residual For the points in P35b, the x-scale must be shrunk a bit. A reciprocal transformation (power of 1) will do the trick. yb 6 x 0.5 c P37. a. The area is proportional to the square of the radius (y πx2), so y would have to be raised to the power _12 (square root). b. The volume is proportional to the cube of a side (y x3), so y would have to be raised to the power of _13 (cube root). c. The volume is proportional to the square of the diameter ( y 8 _2x 2 or y 2x2), so y would have to be raised to the power _12 (square root). P38. The following plots show the diameter plotted against the square root of age and the residuals from the regression line. The regression analysis is also provided. This transformation results in a scatterplot with less of a fan shape than the one for diameter2 versus age. If you want to predict diameters along the full range of ages, this transformation will allow more even precision in the predictions. 10 0 4 The exponents in P36 are the reciprocals of the exponents in P35. 10 8 yc 6 4 2 0 2 10 xb c. i. 10 8 yc 6 4 2 1.0 0.8 0.6 0.4 0.2 Diameter iii. For the points in P35c, again, the x-scale must shrink. This time a square root transformation (power of 0.5) will straighten the points. 0 –2.5 4 5 Sqrt (age) Diameter ⫽ 1.47 · Sqrt (age) ⫺ 1.86; r2 ⫽ 0.83 0 0.2 0.4 0.6 0.8 1.0 x b–1 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Section 3.5 Solutions 145 Variable Constant √Age Coefficient 1.85727 1.46516 df 1 25 Mean Square F-ratio 102.855 124 0.831876 s.e. of Coeff 0.6175 0.1318 t-ratio 3.01 11.1 prob 0.0059 ⱕ0.0001 P39. a. The plot of the data suggests that you must expand the y-scale or shrink the x-scale, so the power transformation on x will have to be a power less than 1. Students may suggest a square root transformation on y because it has been successful in the past. b. The log-log transformation yields a nearly linear plot. 70 60 50 40 30 20 10 200 250 300 350 400 450 500 550 Speed (mi/h) E56. The plots of the data and the regression analysis are shown next. 200 Population Sum of Squares 102.855 20.7969 Source Regression Residual Sqrt (flight length) Dependent variable is: Diameter No Selector R squared 83.2% R squared (adjusted) 82.5% s 0.911 with 7 2 25 degrees of freedom 160 120 80 40 0 1 log(brain weight) 4 ln (population) 3 2 1 0 The regression equation is logBrain 0.908 0.760 logBody Predictor Constant logBody s 0.3156 Coef 0.90754 0.76020 Stdev 0.04967 0.03162 R-sq 92.6% t-ratio 18.27 24.04 p 0.000 0.000 R-sq(adj) 92.5% Analysis of Variance SOURCE Regression Error Total DF 1 46 47 SS 57.577 4.583 62.159 MS 57.577 0.100 F 577.96 p 0.000 Exercises E55. The plot of the square root of flight length versus speed still retains obvious curvature— this transformation is less satisfactory than the log transformation. 146 Section 3.5 Solutions Residual c. The regression equation is log(brain) 0.908 0.76 log(body), or brain 8.10(body)0.76 (or 8.08 with no rounding). The slope of the line, 0.76, agrees with the insight that the x-scale must be transformed by a power less than 1. 6 6 5 4 3 2 1 –1 –3 –2 –1 0 1 2 3 4 log(body weight) 2 3 4 5 Roll Number 0 1 0 1 2 3 4 5 6 2 3 4 5 Roll Number 6 0.2 0 –0.3 The equation of the regression line isˆ ln y 5.142 0.885x. Because e0.885 0.413, the estimated rate of decay is 1 0.413 0.587. The curved, V-shaped residual plot shows that the rate of decay is greater than the estimated value during the first and last time periods and less than the estimated value over the middle time periods. E57. The plot of the data shows curvature. Although the log-log transformation helps, it does not remove the curvature. (Neither will any other power transformation.) The residual plots suggest dividing the data into two groups, as the trends are more linear within each group. So a good way to model these data is to split them into two groups (perhaps with ages 2 through 8 in one group and 9 through 14 in the other). Then although the points in each group still have some curvature, the residuals are much smaller. The younger group has a weight gain per inch of height that is lower than the overall average, whereas the older group has a weight gain per inch of height that is higher than the average. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Weight (lb) Percent Successful Scatter Plot Median Heights 120 100 80 60 100 80 60 40 20 0 40 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 Number of Chimps 0 Residual 35 40 45 50 55 60 Height (in.) 65 70 6 0 -6 50 55 60 65 Height (in.) Weight = 2.76Height - 80; r2 = 0.96 ln(weight) 35 40 45 70 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 Residual 20 12 0 –8 b. A line is not a bad fit here and would predict reasonably well for parties of 16 or fewer chimps because the residuals are small. However, a line is not the most appropriate model and would probably not work as well to predict for parties much more than 16 chimps strong. c. A log-log transformation works pretty well. The model would be ln(percent) 0.524 ln(chimps) 2.9575, or percent 19.25 chimps0.524. 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 ln(height) Note on E58: Using the Fathom data file makes this question much easier. On the plot of cost/seat/mile versus flight length, highlight the points for one group. The same subjects will be highlighted on the other graphs and this question is more easily answered. E58. The group on the right in the plot of cost/seat/mile versus flight length also happens to be the larger planes, judging from the numbers of seats, and they use more fuel and are the planes with the highest flight speeds. (Refer to the scatterplot matrix of air-line data for E8 on page 91 of this Instructor’s Guide.) E59. a. There is a strong positive relationship between size of the hunting party and success rates. The plot looks fairly linear, but a look at the residual plot makes it apparent that there is some curvature in the data. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press ln(percent) 0.00 –0.08 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Residual 0.10 0.2 0.0 –0.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ln(chimps) Note that an exponential model fit using a log transformation does exactly the wrong thing, bending the plot in the wrong direction. ln(percent) Residual 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 0 2 4 6 8 10 12 14 16 18 Chimps Section 3.5 Solutions 147 1900 Year 1980 35,000,000 30,000,000 25,000,000 20,000,000 15,000,000 10,000,000 5,000,000 0 Sqrt (U.S. population) c. The pattern in the immigration data is quite cyclical, which is another common time series pattern. No simple power transformation will straighten this out. There is more than one “bend” in the data; power transformations only work well for a single bend. 10,000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1820 1860 1900 1940 1980 2020 Year E61. The scatterplot of the original data appears here. b. Taking the log of the population overcompensates for the curve in the original data; the growth is not really exponential growth. A better transformation is the square root of the population. The plot of these data, along with the regression analysis and residual plot, is presented next. The residuals still have some pattern, as is expected for time series data, but it is not very pronounced. _________ The regression equation is population 138840 77.7 year. Price ($) 1840 1880 1920 1960 2000 Year Section 3.5 Solutions –400 Sqrt (Pop) = 77.7 Year ⫺ 138800; r 2 = 1.00 a. The next plot shows the growth in population for each decade. The increase in population wasn’t constant from decade to decade but increased in a linear way. When change in population grows linearly, the population grows quadratically. Thus, we can predict that an exponential model isn’t appropriate and that taking the square root of each population will linearize the original scatterplot. 148 300 0 1820 1860 1900 1940 1980 2020 Year 1820 Population Growth 1820 1860 1900 1940 1980 2020 Residual 300,000,000 250,000,000 200,000,000 150,000,000 100,000,000 50,000,000 0 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Immigration (thousands) U.S. Population d. The residual plot shown in part c shows a random scatter, which is good. There is more spread for the smaller hunting parties than for the larger ones, so a transformation that reduces this would be better. E60. The data show a curved pattern of growth over time, which could perhaps be modeled as exponential growth as we often hear that population is growing exponentially. 800 700 600 500 400 300 200 100 0 10 20 30 40 50 60 70 80 Days in Advance Because you would expect the price to go up as flight time gets nearer, a reciprocal transformation of the price (or 1/price) might linearize data such as these. Actually, this transformation does a good job. The plot and residual plot are shown here. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 0.008 0.006 0.004 0.002 0 10 20 30 40 50 60 70 80 Residual 0.005 0 –0.005 0 10 20 30 40 50 60 70 80 Days in Advance 1 The regression equation is ___ price 0.00541 0.000039 days. If you solve for price, you get 1 price ______________________ (0.00541 0.000039 days) That is, the number of days affects the price very little according to this model. (As students will learn later, the slope of the regression equation is not significantly different from 0.) You get a linear relationship with some negative trend by plotting (ln(days), ln(price)). First, substitute _12 day for 0 days before taking logs. This regression equation is ln(price) 5.65 0.166 ln(days). The plot and residual plot are shown next. If you solve the equation for the price, you get price 284.29 days0.166. Notice that for the range 10 days to 30 days, when most of the purchases were made, the range of prices is only from $161.64 to $193.98. Once again, days accounts for little of the variation in price. (And, again, the slope of the regression line is not statistically significant.) ln(price) 7 6 5 35 30 25 20 15 0 200 400 600 800 Body Mass (kg) 3.6 3.4 3.2 3.0 2.8 2.6 –5.0 –2.5 4 –1 0 1 2 3 4 5 0 2.5 5.0 ln(body mass) 7.5 The equation of this regression line is 1.5 Residual Here’s the reason no transformation will give us a good model of predicting price from day: Five of the passengers paid a lot more for their tickets than did the other passengers, and they bought them 3, 4, 8, 9, and 9 days before the flight. (See the previous scatterplot.) But other passengers bought their tickets even closer to flight time and paid just about the same as passengers who bought their tickets months before. If the five passengers who paid extremely high prices are left out, the relationship is reasonably linear but flat. The correlation between days and price for the remaining passengers is 0.034, or practically nonexistent. Thus, the best model is to say that there is no relationship between the day these passengers bought their tickets and the price they paid, with the exception of five passengers who bought their tickets within 9 days of the flight and who paid more than double any other passenger. E62. a. It appears that brain oxygen versus body mass could be modeled by exponential decay, but a quick check will show that a log transformation does little to straighten the plot. The log-log transformation does well, once again. Oxygen Use in Brain 0.010 ln(brain oxygen) Reciprocal of Price 0.012 ln(brain oxygen) 3.26 0.07 ln(body mass) 0 –1.2 –1 0 1 2 3 ln(days) Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 4 5 which implies that ln(brain oxygen) decreases, on the average, by 0.07 units for every 1 unit increase in ln(body mass). Section 3.5 Solutions 149 log(GNP) b. Using the log-log transformation, the relationship between lung oxygen consumption and body mass has a similar linear trend except for a couple of stray points. 2.25 2.00 1.50 1.25 –5.0 –2.5 5 10 15 20 25 30 35 40 0 2.5 5.0 ln(body mass) 0 5 10 15 20 25 30 35 40 Birthrate 0.0 –1.2 7.5 The equation of this line is which implies that ln(lung oxygen) decreases, on the average, by 0.0951 units for every 1 unit increase in ln(body mass). c. If this theory is true and oxygen consumption depends on the relative size of the organ, then the lung oxygen consumption should decrease less rapidly than the brain oxygen consumption. But the data show that the lung oxygen consumption decreases more rapidly than that of the brain. There must be another explanation as to why the brain seems to use more oxygen, relative to its size, than does other organs. E63. a. The scatterplot of these data show a marked decrease in the birthrate as the GNP increases. The relationship is nonlinear but does not look like exponential decay. 40 35 30 25 20 15 10 5 0 5 10 15 20 25 30 35 40 Birthrate b. The log transformation works quite well here and gives a plot that seems appropriate for a regression line. The residual plot looks rather like random scatter and further supports this choice of a statistical model. log(GNP) = –0.0674 Birthrate + 1.87; r 2 = 0.60 Dependent variable is: logGNP No Selector R squared = 59.7% R squared (adjusted) = 58.0% s = 0.4383 with 25 - 2 = 23 degrees of freedom Source Regression Residual Sum of Squares 6.54964 4.41913 Variable Constant Birthrate Coefficient 1.86903 0.067360 df 1 23 Section 3.5 Solutions s.e. of Coeff 0.2276 0.0115 F-ratio 34.1 t-ratio 8.21 5.84 prob ⱕ 0.0001 ⱕ 0.0001 log(GNP) 1.87 0.0674 birthrate or GNP 101.87 (100.0674 birthrate) 74.13 0.856birthrate To interpret the slope and intercept of the model we must use the linear version on the log scale. log(GNP) decreases, on the average, 0.0674 units for every 1 unit increase in birthrate. E64. a. The decreasing trend has a slight curvature, especially toward the later years, so perhaps a log transformation (exponential decay) will work. This would make the interpretation of the results quite easy. 55 50 45 40 35 30 25 1965 150 Mean Square 6.54964 0.192136 The regression equation is 18+ ln(lung oxygen) 1.976 0.0951 ln(body mass) GNP 0 0.8 1.75 Residual ln(lung oxygen) 2.50 1.4 1.0 0.6 0.2 –0.2 –0.6 1975 1985 Year 1995 2005 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 18 – 24 ln(18+) 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 1975 1985 1995 2005 Residual 1965 Residual 0.06 0.00 55 50 45 40 35 30 25 20 1965 1975 1985 1995 2005 1965 1975 1985 Year 1995 2005 12 0 –8 –0.08 1965 1975 1985 1995 Year 2005 18 – 24 The residual plot still has a good deal of curvature. Another look at the original plot shows that the curve seems to be approaching 21, not 0, as an exponential decay function would. 60 50 18+ The rate of smoking seems to decrease linearly until about 1990, then it begins increasing linearly. These plots show lines for the two parts of the plot separately. 40 30 55 50 45 40 35 30 25 1965 1970 1975 1980 1985 1990 1995 10 1965 1975 1985 Year 1995 2005 Residual 20 18 – 24 1975 1985 1995 2005 Residual 0.20 0.00 –0.20 1965 1975 1985 Year 1995 2005 This exponential decay model fits well and shows that the level of smoking above 21% is decreasing at a rate of about 5.5% per year because e0.05645 0.945. (Remember, this rate of decrease is a percent of a percent because the original measurements are percentages.) b. This plot poses a difficulty because the trend changes abruptly about 1991. One equation will not work. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Residual ln(18+ – 21) 1965 0.0 –2.0 1965 1970 1975 1980 1985 1990 1995 Year An exponential model could be used by subtracting 21 from the percentage before taking the log. This graph shows the results. 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2.5 34 32 30 28 26 24 22 1990 1994 1998 2002 1990 1994 1998 Year 2002 3 0 –4 The regression equation for the first plot is percentage 2311 1.1492 year, and for the second plot it is percentage 872.00 0.514 year. c. The pattern of decrease for the 65 and older category is much more linear; in fact, the log transformation will make things worse instead of better. Section 3.5 Solutions 151 30 resid(CO2) 20 15 10 5 1975 1985 Year 1995 2005 The regression equation is percentage 1062 0.5255 year. Predictor Constant Year s 1.100 Coef 1061.83 0.52550 Stdev 53.06 0.02666 R–sq 95.8% t–ratio 20.01 19.71 p 0.000 0.000 R–sq(adj) 95.6% Analysis of Variance SOURCE Regression Error Total DF 1 17 18 SS 470.16 20.57 490.73 MS 470.16 1.21 F 388.57 p 0.000 CO2 380 370 360 350 340 330 320 310 334 332 330 328 326 324 322 320 318 316 1958 1962 1966 1970 1974 1978 1.5 0.0 –1.5 1958 1962 1966 1970 1974 1978 Year 1950 1970 1990 2010 Year 5.94 5.92 5.90 5.88 5.86 5.84 5.82 5.80 5.78 5.76 5.74 380 370 360 350 340 330 1975 1980 1985 1990 1995 2000 2005 Residual ln(CO2) c. The pattern of the residuals suggests an abrupt change around 1976. A better way to model these data might be to use two straight lines with different slopes, one line covering the period from 1959 to about 1976 and the other from about 1977 to 2003. The first two plots below show the regression line and residual plots for years up to 1976 and for years after 1976, respectively. The third plot shows the two lines on the original plot. Residual This nearly constant (linear) rate of decrease amounts to about half a percentage point per year. E65. a. See the plots below. The amount of CO2 is definitely increasing over the years, and the upward curvature makes it reasonable to suspect exponential growth. But note that the log transformation is not much help here. 5 4 3 2 1 0 –1 –2 –3 1950 1960 1970 1980 1990 2000 2010 Year CO2 1965 CO2 65+ 25 1950 1970 1990 1.5 0.0 –1.5 1975 1980 1985 1990 1995 2000 2005 Year 2010 Year 380 370 360 CO2 b. The fitted line appears in the first plot in part a. The residual plot from the original data shows that CO2 increased at a rate lower than the overall average from 1967 to about 1994 and at a higher rate than the overall average before 1967 and after 1994. 350 340 330 320 310 1955 1965 1975 1985 1995 2005 Year 152 Section 3.5 Solutions Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press prediction, although the left end of this line is pulled down a bit by a couple of states that have both low percentages and relatively low scores. Score 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 0 1955 1965 1975 1985 1995 2005 20 40 60 Percent 80 100 6.40 6.36 6.32 6.28 6.24 6.20 0.03 0.00 –0.02 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Residual Residual 620 600 580 560 540 520 500 480 ln(score) Adjusted ln(CO2) The regression equation for the first plot is CO2 1559.3 0.957 year. For the second plot it is CO2 2776.7 1.573 year. Another possibility would be to recognize that although an exponential model has an asymptote at zero, the CO2 level in the atmosphere was never near 0. One estimate is that pre-industrial levels of CO2 in the atmosphere were around 250 ppm. We can adjust for this by taking the natural log of (CO2 level 250). 1955 1965 1975 1985 1995 2005 Year The regression equation for this plot is ln(CO2) 25.281 0.0150 year. Once again, notice that the residual plots have an oscillating pattern typical of time series data. d. The linear model gives an average increase of about 1.57 ppm CO2 per year for years after 1976. Using the exponential model with an asymptote at 250 ppm, the amount of CO2 in the atmosphere above 250 ppm is multiplied by e0.0150 1.015 each year, for a growth rate of about 1.5% per year. E66. The plot of average SAT math score versus percentage taking exam shows a decreasing trend with a curvature. A log-log transformation straightens this out nicely, and the regression analysis of ln(SAT math score) versus ln(percentage taking exam) provides a good model for Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 0.08 0.00 –0.08 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 ln(percent) The complete regression analysis is shown here. The regression equation is lnScore 6.46 0.0509 lnPercent Predictor Constant lnPercent s 0.02921 Coef 6.45586 0.050937 Stdev 0.01351 0.003967 R–sq 77.5% t–ratio 477.99 12.84 p 0.000 0.000 R–sq(adj) 77.0% Analysis of Variance SOURCE Regression Error Total DF 1 48 49 SS 0.14072 0.04096 0.18168 MS 0.14072 0.00085 F 164.89 p 0.000 Section 3.5 Solutions 153 Chapter Summary Homework Essential E67–70, E73, E81 Recommended E74, E75, E76, E78, E80 Optional E71, E72, E77, E79, E82–83 For AP Students AP1–10 AP Exam Practice Solutions Review Exercises E67. a. If Leonardo is correct, the points should lie near the lines: arm span height Kneeling Height (cm) As you worked through this chaper, you may have assigned relevant items from past AP Statistics Exams, which are listed in the table on page 144. As you review and assess the chapter, you may wish to assign additional items. You can also use the Chapter 3 AP Practice Quiz or the Chapters 2–3 AP Practice Quiz on pages 39–42 of the Instructor’s Resource Book. Each quiz includes five multiple-choice items similar to those on AP Statistics Exams and one free-reponse question from the Statistics in Action student book. Once students have taken the quiz, go over the elements of a good answer. You might display the model answers to the free-response questions downloadable from www.keypress.com/keyonline. 80 100 120 140 160 180 Height (cm) 3 height kneeling height __ 4 9 Arm Span (cm) Looking at the next plots, these rules appear to be approximately correct. 180 160 140 120 100 80 Chapter 3 Summary 15 10 90 110 130 150 170 Height (cm) 90 110 130 150 170 190 Height (cm) 154 20 Hand Length (cm) 1 height hand length __ 140 130 120 110 100 90 80 70 The least squares regression equation for predicting the arm span from the height is ŷ 5.81 1.03x. The least squares regression equation for predicting the kneeling height from the height is ŷ 2.19 0.73x. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press arm span and height: 0.992 (strongest) kneeling height and height: 0.989 hand length and height: 0.961 (weakest) E68. a. This scatterplot shows no obvious association between temperature and number of distressed Orings. Although the highest number of distressed O-rings was at the lowest temperature, the second highest number was at the highest temperature. b. It is difficult to look at the scatterplot of the complete set of data, shown next, and not see that any risk is almost entirely at lower temperatures. The correlations are r 0.263 for the incomplete set of data but r 0.567 after all points are included, which is only moderately strong at any rate. This is a tragic example of scientists and engineers not asking the right question: Do I have all of the data? were left off the plot because it was felt that these flights did not contribute any information about the temperature effect. The Commission concluded that “A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature.” [Report of the Presidential Commission on the Space Shuttle Challenger Accident (Washington, D.C., 1986), page 148.] E69. a. Yes, the student who scored 52 on the first exam and 83 on the second lies away from the general pattern. This student scored much higher on the second exam than would have been expected. This point will be influential because the value of x is extreme on the low side and the point lies away from the regression line. The point sticks out on the residual plot. There is a pattern in the rest of the points; they have a positive correlation. b. The slope should increase, and the correlation should increase. In fact, the slope increases from 0.430 to 0.540, and the correlation increases from 0.756 to 0.814. c. The residual plot appears next. The residuals now appear scattered, without any obvious pattern, so a linear model fits the points well when point (52, 83) is removed. Residual The least squares regression equation for predicting the hand length from the height is ŷ 2.97 0.12x. b. If one student is 1 cm taller than another, their arm span tends to be 1.03 cm larger. If one student is 1 cm taller than another, their kneeling height tends to be 0.73 cm larger. If one student is 1 cm taller than another, their hand length tends to be 0.12 cm larger c. In each case, the points are packed tightly about the regression line and so there is a very strong correlation. The correlations are Number of O-Rings with Some Distress 3 8 6 4 2 0 –2 –4 –6 –8 –10 60 2 1 0 50 60 70 80 Temperature (°F) 90 Background information: After each launch, the two rocket motors on the sides of the shuttle were recovered and inspected. Each rocket motor was made in four pieces, which were fit together with O-rings to seal the small spaces between them. The O-rings were 37.5 feet in diameter and 0.28 inches thick. The Rogers Commission, which was appointed by President Reagan to find the cause of the accident, noted that the flights with zero incidents Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 70 80 90 100 Exam 1 Without (52, 83) d. Yes, there is regression to the mean in any elliptical cloud of points whenever the correlation is not perfect. For example, the student who scored lowest on Exam 1 did much better on Exam 2. The highest scorer on Exam 1 was not the highest scorer on Exam 2. A line fit through this cloud of points does have slope less than 1. E70. a. Use the formula sy b1 r __ s x to obtain 11.6 0.845 r 0.51 ____ 7.0 Chapter 3 Summary 155 _ _ b. Using the fact that (x, y) is on the regression line, _ _ ŷ y b1(x x) be made for a negative slope and a zero slope. Alternatively, the truth of this statement can be seen from the relationship 87.8 0.51(x 82.3) sy b1 r __ s 45.83 0.51x x E72. a. Correlations of 0 occur between A and B, B and E, B and F, C and D, and E and F. The correlation between D and E is 0.02. The pairs with zero correlation all have scatterplots that are symmetric around a center vertical line or center horizontal line or both. b. A and E, D and F, and A and C c. A and D, A and F, B and C, and C and F d. Student responses may include the following: The scatterplot of A and B shows that there can be a pattern in the points (a shape) even though the correlation is zero. That the scatterplots of A and D and of A and F have about the same correlation also shows that the correlation does not tell anything about their quite different shapes. For your information, the complete correlation matrix is shown here. A B B 0 C D C 0.447 0.258 D 0.224 0.129 0 E 0.875 0 0.091 0.018 F 0.258 0 0.289 0.577 E 0 E73. a. True. Both measure how closely the points cluster about the “center” of the data. For univariate data, that center is the mean; for bivariate data, the center is the regression line. b. True. Refer to E72 for examples. c. False. For example, picture an elliptical cloud of points with major axis along the y-axis. The correlation will be 0, but there will be a wide variation in the values of y for any given x. d. True. Intuitively, a positive slope means that as x increases, y tends to increase. This is equivalent to a positive correlation. Similar statements can 156 Chapter 3 Summary Because the standard deviations are always positive, b1 and r must have the same sign. E74. a. The correlation is quite high, about –0.83. b. There are two clusters of points—one of states with a small percentage of students taking the SAT and one of states with more than 50% taking the SAT. The second cluster has almost no correlation and would have a relatively flat regression line, whereas the first cluster has a strong negative relationship. Combining these two clusters results in summary statistics that do not adequately describe either one of them. c. A residual plot would be U-shaped with points above zero on the left, below zero in the middle, and above zero again on the right. The actual residual plot is shown here. Residual E71. a. Each value should be matched with itself. b. Match each value with itself, except match 0.5 (or 0.5) with 0: (1.5, 1.5) (0.5, 0.5) (0, 0), (0, 0.5), (0.5, 0), (1.5, 1.5) for a correlation of .950. c. (1.5, 0) (0.5, 0.5) (0, 1.5), (0, 1.5), (0.5, 0.5), (1.5, 0) has a correlation of 0.1. d. Match the biggest with the smallest, the next biggest with the next smallest, etc.: (1.5, 1.5) (0.5, 0.5) (0, 0), (0, 0), (0.5, 0.5), (1.5, 1.5). Note on E72: Students should not have to actually compute the correlations to answer these questions. 40 30 20 10 0 –10 –20 –30 –40 –50 0 10 20 30 40 50 60 70 80 90 100 Percentage Taking Exam E75. a. Public universities that have the highest in-state tuition also tend to be the universities with the highest out-of-state tuition, and public universities that have the lowest in-state tuition also tend to be the universities with the lowest out-of-state tuition. This relationship is quite strong. b. No, the correlation does not change with a linear transformation of one or both variables. However, if you were to take logarithms of the tuition costs, the correlation would change. c. The slope would not change with the first transformation. Consider the formula for the slope: sy b1 r __ s x The correlation remains unchanged with the change 1 of units. The standard deviations would each be ____ 1000 1 ____ as large as previously, but the factor of 1000 would be in both the numerator and denominator and so would cancel out. But if you were to take logarithms of the tuition costs, the slope would change because sy the proportion __ sx would be different. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press y Solving for r, the formula becomes sx r b1__ s x These correlations are A: 0.5; B: 0.3; C: 0.25. So from weakest to strongest, they are ordered C, B, A. E78. a. No. It is possible for one bookstore to be a lot more expensive than the other. For example, suppose your local bookstore sold each book for $10 less than the online price. The correlation would be 1. b. The main reason for the high correlation is that the bookstores pay approximately the same wholesale cost for a book. They then add on an amount to cover their overhead costs and to give them a profit. To have a cause-and-effect relationship, a change in one variable should trigger a change in the other variable. That is not necessarily the case with these prices; however, if the online bookstore lowers its prices, it might force local bookstores to do the same. E79. For example, stocks that do the best in one quarter may not be the ones that do the best in the next quarter. As another example, the best 20 and worst 20 hitters this year in Major League Baseball are not likely to repeat this kind of performance again next year. E80. This matrix gives the correlations between all pairs of variables in this exercise. Max Long Max Long Ave Long 0.769 Gestation 0.577 0.761 0.215 0.237 Speed Gestation a. In general, animals with longer maximum longevity have longer gestation periods. The trend is reasonably linear, with a moderate correlation. The plot shows heteroscedasticity, with the points fanning out as maximum longevity increases. The elephant is an outlier, although it follows the general linear trend. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 70 700 600 500 400 300 200 100 Gestation Period (days) 0 5 10 15 20 25 30 35 40 45 Average Longevity (yr) c. The three relatively slow animals (elephant, hippopotamus, and grizzly bear) in the lowerright corner give these variables a slight negative correlation of 0.215. However, the rest of the animals show a positive trend, with longer average longevity associated with greater speed. The lurking variable is the size of the animal. Larger animals tend to live longer and be faster (unless they get very big, like an elephant). 50 40 30 20 10 0 0.018 10 20 30 40 50 60 Maximum Longevity (yr) b. The pattern here is similar: The animals with longer average longevity have longer gestation periods. However, the relationship is not as strong as in part a, so maximum longevity is the better predictor of gestation period. There are two outliers, the elephant and the hippopotamus. Gestation Period (days) sy b1 r __ s 700 600 500 400 300 200 100 0 Speed (mi/h) E76. a. Quadrant I: , , ; Quadrant II: , , ; Quadrant III: , , : Quadrant IV: , , b. The points near the origin or near one of the new axes make the smallest contributions. The contributions are small because either zx or zy would be close to 0. E77. You can compute the values of r using the formula 5 10 15 20 25 30 35 40 45 Average Longevity (yr) E81. a. The next scatterplot shows a very strong linear relationship between the expenditures for police officers and the number of police officers per state. This relationship makes sense. There is one outlier and influential point, California, which has by far the largest population of any state listed. Chapter 3 Summary 157 8000 7000 6000 5000 4000 3000 2000 1000 0 Residual 0 10 20 30 40 50 60 70 80 90100 Officers (in thousands) 800 0 –800 0 10 20 30 40 50 60 70 80 90 100 Officers (in thousands) b. The scatterplot again shows a very strong linear relationship, with California as an outlier and influential point. The larger the population of the state, the more police officers. This time the correlation is 0.987, and the equation of the regression line is number of police 1.47 2.91 population Officers (in thousands) 100 Residual That is, for every increase of 1 million in the population, the number of police officers tends to go up by 2910. If the outlier, California, is removed, the slope of the line increases a little to 3.15 and the correlation decreases a little to 0.976. Here, California is not very influential on the slope or the correlation. 6 80 60 40 20 0 35 0 –6 0 158 5 10 15 20 25 30 Population (in millions) Chapter 3 Summary 5 10 15 20 25 30 Population (in millions) 35 Violent Crime Rate (per 100,000 population) Spending ($ millions) So for every additional thousand police officers, costs tend to go up by $73,800,000, or $73,800 each. If the outlier, California, is removed, the slope of the line decreases to 57.2 but the correlation increases to 0.984. Here, California is very influential on the slope. 900 800 700 600 500 400 300 200 100 0 20 40 60 80 100 Officers (in thousands) A log-log transformation straightens these points quite well, as shown in this scatterplot and residual plot. log(violent) expPolice 402.8 73.8 number of police c. The scatterplot shows that there is a moderate positive but possibly curved relationship. For this scatterplot, it is not appropriate to compute the correlation or equation of the regression line. But, in general, the larger the number of police, the higher the rate of violent crime. (Note that this is the rate per 100,000 people in the state, not the number of violent crimes.) Almost equivalently, the larger the population of a state, the higher the violent crime rate. It is not at all obvious why this should be the case. Why would larger states (more police, more population) tend to have higher rates of violent crime? (Because of the strong relationship between the number of police officers per state and the population, a scatterplot of the crime rate versus the number of police officers per 100,000 population looks about the same.) Residual The correlation is 0.976, and the equation of the regression line is 3.0 2.8 2.6 2.4 2.2 2.0 0.0 0.4 0.8 1.2 1.6 log(officers) 2.0 0.0 0.4 0.8 1.2 1.6 log(officers) 2.0 0.20 0.00 –0.20 Note on E82: This exploration will be much more efficient using Fathom and the data on the Instructor’s Resource CD rather than a graphing calculator. E82. a. A linear model works fairly well, as shown in this scatterplot and residual plot. There is some heteroscedasticity, however, so the residuals for houses with a large number of square feet are larger. The equation is price 25.2 75.6 area, and the correlation is 0.899. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 0.5 1 1.5 2 2.5 3 3.5 4 Residual 100 0 –60 0 0.5 1 1.5 2 2.5 3 3.5 4 Area (thousand square feet) Price ($ thousands) Separating the houses into new houses and old houses, the corresponding regression equations are price 48.4 96.0 area and price 16.6 66.6 area. These equations are very different, so you should not use one equation for both groupings. New houses cost quite a bit more per square foot. You can see the two relationships in this plot. 350 300 250 200 150 100 50 0 0.5 1 1.5 2 2.5 3 3.5 4 Area (thousand square feet) Age New Old b. The two largest houses clearly are influential points, as could be the third largest house and the smallest house in the lower-left corner. If the two largest are removed from the data set, the correlation drops to 0.867 and the regression equation changes to price 10.3 65.7 area. This is quite a change in the model—the price is now increasing $10,000 less per increase of 1,000 square feet. c. Using the equation from part a, the price for an old house of 1,000 square feet would be price 16.6 66.6 area 16.6 66.6(1) 50, or $50,000. The price for a house of 2,000 square feet would be price 16.6 66.6 area 16.6 66.6(2) 116.6, or $116,600. You should have more confidence in the first prediction because the spread in the prices is less for the smaller houses than for the larger houses. Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Number of Bathrooms 0 d. As seen in these boxplots, the number of bathrooms is strongly related to the selling price. A lurking variable here is the number of square feet in the house, which is very strongly related to both the price and the number of bathrooms. A regression line is not appropriate here mostly because of the skewness in the prices for houses with three bathrooms and because you can do something better. You can compute the mean (or the median) price of a house with one bathroom, with two bathrooms, and with three bathrooms: $40,320 ($33,400); $99,290 ($98,500); and $201,400 ($170,000). This process is equivalent to regression but does not require a linear relationship. 1 2 3 0 50 100 150 200 250 300 350 Price ($ thousands) E83. a. The next scatterplot shows that expenditures per pupil (ExpPP) and average teacher salary (TeaSal) are moderately correlated: r 0.66. The equation of the regression line is ExpPP 1144.8 182.7 TeaSal. So, if one state’s average teacher’s salary is $1000 more than another’s, its per-pupil expenditure tends to be $183 more. In this case, there is a cause-and-effect relationship because if teachers are paid more, on average, the cost per pupil has to go up unless class sizes are increased proportionally. ExPP ($) Price ($ thousands) 350 300 250 200 150 100 50 13,000 12,000 11,000 10,000 9000 8000 7000 6000 32 36 40 44 48 52 TeaSal ($ thousands) 56 b. This next scatterplot has per capita expenditure on schools plotted against the average teacher salary. Again, you would expect a positive relationship, and r 0.577, so it is a bit weaker than in part a. It isn’t surprising that the two correlations are so close because the number of pupils in the state is pretty much Chapter 3 Summary 159 automatically draws ellipses). All of the relationships with percentage of dropouts have correlations close to zero and in the matrix plot, their ellipses are quite round and fat. So, no, none of the variables are good predictors of the percentage of dropouts. ExpPC ($) proportional to the number of people in the state, so you would expect expenditure per pupil and expenditure per capita to have roughly the same correlation with teacher salary (or any other variable). 2200 2000 1800 1600 1400 1200 1000 Pearson Product-Moment Correlation No selector 32 36 40 44 48 52 TeaSal ($ thousands) 56 c. Shown below are the correlations and a scatterplot matrix (the JMP version, which Dropout Dropout ExpPP 1.000 ExpPC TeaSal Enroll Teachers ExpPP 0.179 1.000 ExpPC 0.060 0.912 1.000 TeaSal 0.024 0.660 0.577 1.000 Enroll 0.052 0.096 0.125 0.438 1.000 Teachers 0.098 0.165 0.175 0.437 0.975 1.000 Chapter 3 Review, E83c Scatterplot Matrix 13000 11000 ExpPP 9000 7000 2000 ExpPC 1600 1200 55 45 Tea Sal 35 11 9 7 %Dropout 5 3 7000 5000 Enrollment 3000 1000 350 250 Teachers 150 50 7000 160 Chapter 3 Summary 11000 1200 1800 35 45 55 3456789 11 1000 4000 50 150 300 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press 50 Body Fat (%) AP1. A. There aren’t any points in the upper left-hand corner because the oldest child has to be older (or the same age, in the case of twins) than the youngest child. Thus all points must lie on or below the line y x. AP2. C. The predicted birthrate is 0.38 60 53.5 30.7, so the residual is the actual birthrate of 47 minus this prediction, 47 30.7 16.3. AP3. C. Curvature in the residual plot of a linear regression is a sign of curvature in the original plot, so statement I is true. When points in the residual plot lie below the line y 0, the points in the original scatterplot lie below the regression line and so the prediction is too large. Thus, statement II is true. Statement III is false because, for example, the pattern could be exponential with a high correlation. AP4. A. Outliers should not be removed permanently from a data set simply because they are outliers. Further investigation is needed, as described in B, C, and D. AP5. A. B is incorrect because the slope of the regression equation is positive, so the correlation is 0.228. C is incorrect because the value of R2 doesn’t give any information about linearity versus curvature. E is incorrect because it implies that each person’s satisfaction tends to increase over their stay in the hospital. Instead, there may be a lurking variable of age: older people have to stay longer and they also tend to be more satisfied with their care. Or, the lurking variable might be severity of the problem. The more seriously ill a patient is, the longer they tend to have to stay, and the more grateful they are for the care they were given. AP6. D. In the year 2000, t 50, so log10(population) 0.01 50 7 7.5, and thus population 107.5 31,622,777 and 31,600,000 is the closest. AP7. E. Choice A is a poor choice because each point represents a different Barbarian, and so does not establish trends in a particular Barbarian. Choice B is closer to an interpretation of the intercept than to the slope. Choice C might be close to correct if the y-intercept was near zero, but here it’s far from zero. For choice D, you would have to know the scores on the two sections had equal standard deviations before making this interpretation. AP8. D. Note that the correct statement E is equivalent to saying that 81% of the variation in the number of raids among Barbarians is explained by personal cleanliness. AP9. a. i. From the plot, this line looks to be a good fit. 40 30 20 10 1.0 1.02 1.04 1.06 1.08 Density ii. The regression equation is ŷ 505.254 460.678x and the analysis is as follows. Dependent variable is: % Fat No Selector R squared 99.9% R squared (adjusted) 99.9% s 0.2443 with 15 – 2 13 degrees of freedom Source Regression Residual Sum of Squares 1206.48 0.775560 df 1 13 Mean Square 1206.48 0.059658 F-ratio 20223 Variable Constant Density Coefficient 505.254 -460.678 s.e. of Coeff 3.386 3.239 t-ratio 149 –142 prob ≤0.0001 ≤0.0001 iii. The r2 value of 0.999 seems to confirm that this model is a good fit. iv. The residual plot uncovers some problems; perhaps we could do better! 0.35 0.20 Residual AP Practice Test 0.05 –0.10 –0.25 –0.40 1.0 1.02 1.04 1.06 Density 1.08 b. i. As the percentage of fat increases, the body density decreases. Perhaps the positive association between the reciprocal of density and the percentage of fat would be easier to model. The pertinent plots and the regression analysis are shown on the next page. Chapter 3 Summary 161 this time the regression equation is % body fat 1 453.7 498.97 _____ density , which is similar to Siri’s model but not nearly as close as the equation for women. Thus, the model fits better for women than for men. Body Fat (%) 50 40 30 20 10 0.94 0.96 0.98 1 Body Fat (%) 0.92 Residual 0.25 0 –0.25 0.92 1 0.9125 Sum of Squares 1206.88 0.371389 df 1 13 Mean Square 1206.88 0.028568 F-ratio 42246 Variable Constant 1/Density Coefficient –450.632 495.654 s.e. of Coeff 2.309 –0.412 t-ratio –195 206 prob ≤0.0001 ≤0.0001 The residuals show less pattern; the plot is more like one of random scatter, suggesting that this is a better model. The regression line has the equation 1 % body fat 450.63 495.65 ______ density which is very close to the Siri equation. ii. The correlation is close to 1 for both models, but the second proves to be a better fitting model than the first. Moral: Never use correlation as the only criterion for choosing a model. iii. Percent body fat as a function of log(density) works almost as well as Siri’s model. The residual plot, however, has a hint of a pattern. AP10. a. For women, the regression equation was 1 % body fat 450.63 495.65 _____ density , almost identical to Siri’s model. (See AP9.) The relationship is very strong and linear, with correlation almost equal to 1 and no pattern in the residual plot. Using the same variables for men, the relationship is again extraordinarily linear (see the next scatterplot), with a correlation near 1. But Chapter 3 Summary Density Source Regression Residual b. For women, this scatterplot of density against skinfold is quite linear and does not require re-expression. Thus, a reasonable model is the regression equation, density 1.084 0.000311 skinfold. The correlation is 0.897. 1.08 1.06 1.04 1.02 1.00 50 100 150 200 Skinfold (mm) 250 For men, the relationship is less strong and has some curvature (see the scatterplot); however, the linear model is an adequate one. The equation is density 1.105 0.000295 skinfold. The residual plot shows some heteroscedasticity. 1.10 Density R squared 100.0% R squared (adjusted) 100.0% s 0.1690 with 15 – 2 13 degrees of freedom 162 0.9375 0.9625 Reciprocal of Density % Fat 1.08 1.06 1.04 1.02 50 100 150 200 250 50 100 150 200 Skinfold (mm) 250 0.02 Residual Dependent variable is: No Selector 0.94 0.96 0.98 Reciprocal of Density 35 30 25 20 15 10 5 0 0 –0.03 Statistics in Action Instructor’s Guide, Volume 1 © 2008 Key Curriculum Press