Chapter 4 Describing Bivariate Numerical Data Created by Kathy Fritz Correlation Pearson’s Sample Correlation Coefficient Properties of r Interpreting Scatterplots To interpret a scatterplot, follow the basic strategy of data analysis from Chapters 1 and 2. Look for patterns and important departures from those patterns. How to Examine a Scatterplot As in any graph of data, look for the overall pattern and for striking departures from that pattern. • You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. • An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship. Interpreting Scatterplots Definition: Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together. Two variables have a negative association when above-average values of one tend to accompany below-average values of the other. Consider the SAT example. Strength Interpret the scatterplot. Direction Form There is a moderately strong, negative, curved relationship between the percent of students in a state who take the SAT and the mean SAT math score. Further, there are two distinct clusters of states and two possible outliers that fall outside the overall pattern. Interpreting Scatterplots When the points in a scatterplot tend to cluster tightly around a line, the relationship is described as strong. Try to order the scatterplots from strongest relationship to the weakest. A B C D These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010). Pearson’s Sample Correlation Coefficient • Usually referred to as just the correlation coefficient • Denoted by r • Measures the strength and direction of a linear relationship between two numerical variables Measuring Linear Association: Correlation A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately, our eyes are not good judges of how strong a linear relationship is. Definition: The correlation r measures the strength of the linear relationship between two quantitative variables. •r is always a number between -1 and 1 •r > 0 indicates a positive association. •r < 0 indicates a negative association. •Values of r near 0 indicate a very weak linear relationship. •The strength of the linear relationship increases as r moves away from 0 towards -1 or 1. •The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship. Properties of r 1. The sign of r matches the direction of the linear relationship. r is positive r is negative Properties of r 2. The value of r is always greater than or equal to -1 and less than or equal to +1. Strong correlation Moderate correlation Weak correlation -1 -.8 -.5 0 .5 .8 1 Properties of r 3. r = 1 only when all the points in the scatterplot fall on a straight line that slopes upward. Similarly, r = -1 when all the points fall on a downward sloping line. Properties of r 4. r is a measure of the extent to which x and y are linearly related Find the correlation for these points: x y 2 40 4 20 6 8 8 4 10 8 12 20 14 40 Compute the correlation coefficient? r = 0 Sketch the scatterplot. 40 30 20 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Properties of r 5. The value of r does not depend on the unit of measurement for either variable. Mare Weight (in Kg) Foal Weight (in Kg) Mare Weight (in lbs) Foal Weight (in Kg) 556 129.0 1223.2 129.0 638 119.0 1403.6 119.0 588 132.0 1293.6 132.0 550 123.5 1210.0 123.5 580 112.0 1276.0 112.0 642 113.5 1412.4 113.5 568 95.0 1249.6 95.0 642 104.0 1412.4 104.0 556 104.0 1223.2 104.0 616 93.5 1355.2 93.5 549 108.5 1207.8 108.5 504 95.0 1108.8 95.0 515 117.5 1111.0 117.5 551 128.0 1212.2 127.5 594 127.5 1306.8 127.5 Measuring Linear Association: Correlation Scatterplots and Correlation Calculating Correlation Coefficient The correlation coefficient is calculated using the following formula: where and Correlation The formula for r is a bit complex. It helps us to see what correlation is, but in practice, you should use your calculator or software to find r. How to Calculate the Correlation r Suppose that we have data on variables x and y for n individuals. The values for the first individual are x1 and y1, the values for the second individual are x2 and y2, and so on. The means and standard deviations of the two variables are x-bar and sx for the x-values and y-bar and sy for the y-values. The correlation r between x and y is: x n x y n y 1 x1 x y1 y x 2 x y 2 y r ... n 1 sx sy sx sy sx sy x i x y i y 1 r n 1 sx sy The web site www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student-related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000. Expenditures 8810 7780 8112 8149 8477 7342 7984 Graduation rates 66.1 52.4 48.9 48.1 42.0 38.3 31.3 Here is the scatterplot: Does the relationship appear linear? Explain. College Expenditures Continued: To compute the correlation coefficient, first find the z-scores. x y zx zy zxzy 8810 66.1 1.52 1.74 2.64 7780 52.4 -0.66 0.51 -0.34 8112 48.9 0.04 0.19 0.01 8149 48.1 0.12 0.12 0.01 8477 42.0 0.81 -0.42 -0.34 7342 38.3 -1.59 -0.76 1.21 7984 31.3 -0.23 -1.38 0.32 How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zy is positive zxzy is negative zx is negative zy is negative zxzy is positive zx is positive zy is positive zxzy is positive Will the sum of zxzy be positive or negative? How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zy is positive zxzy is negative zx is negative zy is negative zxzy is positive zx is positive zy is positive zxzy is positive Will the sum of zxzy be positive or negative? zx is negative zy is positive zxzy is negative How the Correlation Coefficient Measures the Strength of a Linear Relationship Will the sum of zxzy be positive or negative or zero? Facts about Correlation How correlation behaves is more important than the details of the formula. Here are some important facts about r. 1. Correlation makes no distinction between explanatory and response variables. 2. r does not change when we change the units of measurement of x, y, or both. 3. The correlation r itself has no unit of measurement. Cautions: • Correlation requires that both variables be quantitative. • Correlation does not describe curved relationships between variables, no matter how strong the relationship is. • Correlation is not resistant. r is strongly affected by a few outlying observations. • Correlation is not a complete summary of two-variable data. Does a value of r close to 1 or -1 mean that a change in one variable causes a change in the other variable? Association or correlation does NOT imply causation. Consider the following examples: • The relationship between the number of Causality can onlymore be shown carefully cavities in child’s teeth the size of Should weaall drink hotand by controlling values of crime allto variables that are responses cold weather chocolate to lower the rate? his orBoth her vocabulary is strong and might be related to the ones under study. In other positive. words, with well-controlled, So does thisamean I should feedwell-designed children more variables are both strongly candy toThese increase their vocabulary? experiment. to the ageisofnegatively the child • Consumption ofrelated hot chocolate correlated with crime rate. Linear Regression Least Squares Regression Line Suppose there is a relationship between two numerical variables. Let x be the amount spent on advertising and y be the amount of sales for the product during a given period. You might want to predict product sales (y) for a month when the amount spent on advertising is $10,000 (x). The equation of a line is: y a bx Where: b – is the slope of the line – it is the amount by which y increases when x increases by 1 unit a – is the intercept (also called y-intercept or vertical intercept) – it is the height of the line above x = 0 – in some contexts, it is not reasonable to interpret the intercept Regression Line Linear (straight-line) relationships between two quantitative variables are common and easy to understand. A regression line summarizes the relationship between two variables, but only in settings where one of the variables helps explain or predict the other. Definition: A regression line is a line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Shown is a scatterplot of the change in nonexercise activity (cal) and measured fat gain (kg) after 8 weeks for 16 healthy young adults. The plot shows a moderately strong, negative, linear association between NEA change and fat gain with no outliers. The regression line predicts fat gain from change in NEA. When nonexercise activity = 800 cal, our line predicts a fat gain of about 0.8 kg after 8 weeks. Interpreting a Regression Line A regression line is a model for the data, much like density curves. The equation of a regression line gives a compact mathematical description of what this model tells us about the relationship between the response variable y and the explanatory variable x. Definition: Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). A regression line relating y to x has an equation of the form ŷ = a + bx In this equation, •ŷ (read “y hat”) is the predicted value of the response variable y for a given value of the explanatory variable x. •b is the slope, the amount by which y is predicted to change when x increases by one unit. •a is the y intercept, the predicted value of y when x = 0. The Deterministic Model Notice, the y-value is determined by substituting the x-value into the equation of the line. Also notice that the points fall on the line. Interpreting a Regression Line Consider the regression line from the example “Does Fidgeting Keep You Slim?” Identify the slope and yintercept and interpret each value in context. fatgain = 3.505 - 0.00344(NEA change) The slope b = -0.00344 tells us that the amount of fat gained is predicted to go down by 0.00344 kg for each added calorie of NEA. The y-intercept a = 3.505 kg is the fat gain estimated by this model if NEA does not change when a person overeats. • Prediction fatgain = 3.505 - 0.00344(NEA change) Least-Squares Regression We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. Use the NEA and fat gain regression line to predict the fat gain for a person whose NEA increases by 400 cal when she overeats. Extrapolation We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. The accuracy of the prediction depends on how much the data scatter about the line. While we can substitute any value of x into the equation of the regression line, we must exercise caution in making predictions outside the observed values of x. Definition: Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. Don’t make predictions using values of x that are much larger or much smaller than those that actually appear in your data. How do you find an appropriate line for describing a bivariate data set? y = 10 + 2x 40 30 20 10 5 10 15 20 Residuals In most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible. Definition: A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, residual = observed y – predicted y residual = y - ŷ residual Positive residuals (above line) Negative residuals (below line) Least squares regression line Different regression lines produce different residuals. The least squares regression line is the line that minimizes the sum of squared deviations or residuals. and the y-intercept is: 𝑎 = 𝑦 − 𝑏𝑥 The equation of the least square regression line is: 𝑦 = 𝑎 + 𝑏𝑥 Least-Squares Regression Line We can use technology to find the equation of the leastsquares regression line. We can also write it in terms of the means and standard deviations of the two variables and their correlation. Definition: Equation of the least-squares regression line We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and standard deviations of the two variables and their correlation. The least squares regression line is the line ŷ = a + bx with slope br sy sx and y intercept a y bx Let’s investigate the meaning of the least squares regression line. Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2). (3,10) 1 yˆ x 3 3 (6,2) (0,0) Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3) x 11 15 19 23 27 y 150 270 450 580 740 Sketch a scatterplot for this data set. Pomegranate study continued Predict the average volume of the tumor for 20 days after injection. Predict the average volume of the tumor for 5 days after injection. Assessing the Fit of a Line Residuals Residual Plots Outliers and Influential Points Coefficient of Determination Standard Deviation about the Line Assessing the fit of a line Important questions are: 1. Is the line an appropriate way to summarize the relationship between x and y ? 2. Are there any unusual aspects of the data set that you need to consider before proceeding to use the least squares regression line to make predictions? 3. If you decide that it is reasonable to use the line as a basis for prediction, how accurate can you expect predictions to be? Residuals Recall, the vertical deviations of points from the least squares regression line are called deviations. These deviations are also called residuals. 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑦 − 𝑦 In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. Distance from Debris (x) Distance Traveled (y) 6.94 0.00 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 22.72 9.19 20.11 9.05 26.16 9.36 30.65 Residual Plots One of the first principles of data analysis is to look for an overall pattern and for striking departures from the pattern. A regression line describes the overall pattern of a linear relationship between two variables. We see departures from this pattern by looking at the residuals. Definition: A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data. Residual plots • A residual plot is a scatterplot of the (x, residual) pairs. • Residuals can also be graphed against the predicted y-values • Isolated points or a pattern of points in the residual plot indicate potential problems. Interpreting Residual Plots A residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. 1) The residual plot should show no obvious patterns 2) The residuals should be relatively small in size. Pattern in residuals Linear model not appropriate Deer mice continued Distance from Debris (x) Distance Traveled (y) 6.94 0.00 14.76 -14.76 5.23 6.13 9.23 -3.10 5.21 11.29 9.16 2.13 7.10 14.35 15.28 -0.93 8.16 12.03 18.70 -6.67 5.50 22.72 10.10 12.62 9.19 20.11 22.04 -1.93 9.05 26.16 21.58 4.58 9.36 30.65 22.59 8.06 Plot the residuals against the distance from debris (x) Deer mice continued 15 Residuals 10 5 5 -5 -10 -15 6 7 8 9 Distance from debris 15 15 10 10 Residuals Residuals Deer mice continued 5 10 -5 15 20 25 Predicted Distance traveled 5 9 5 -5 -10 -10 -15 -15 6 7 8 9 Distance from debris Residual plots continued Let’s examine the accompanying data on x = height (in inches) and y = average weight (in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts). x 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 y 113 115 118 121 124 128 131 134 137 141 145 150 153 159 164 Let’s examine the data set for 12 black bears from the Boreal Forest. x = age (in years) and y = weight (in kg) x Y 10.5 6.5 28.5 10.5 6.5 54 40 62 51 55 7.5 6.5 5.5 7.5 11.5 9.5 5.5 56 62 50 42 40 59 51 Sketch a scatterplot with the fitted regression line. 60 Weight 55 50 45 40 Age 5 10 15 20 25 30 Black bears continued Y 10.5 6.5 28.5 10.5 6.5 54 40 62 51 55 7.5 6.5 5.5 7.5 11.5 9.5 5.5 56 62 50 42 40 59 60 55 Weight x 50 45 40 Age 5 10 15 20 Predicted Distance traveled 25 30 51 Coefficient of Determination • The coefficient of determination is the proportion of variation in y that can be attributed to an approximate linear relationship between x & y • Denoted by r2 • The value of r2 is often converted to a percentage. The Role of r2 in Regression The standard deviation of the residuals gives us a numerical estimate of the average size of our prediction errors. There is another numerical quantity that tells us how well the least-squares regression line predicts values of the response y. Definition: The coefficient of determination r2 is the fraction of the variation in the values of y that is accounted for by the least-squares regression line of y on x. We can calculate r2 using the following formula: r2 1 where and SSE residual 2 SST (y i y )2 SSE SST Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 Suppose you didn’t know any x-values. What distance would you expect deer mice to travel? 9.05 9.36 26.16 30.65 Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 Now let’s find how much variation there is in the distance traveled (y) from the least squares regression line. 9.05 9.36 26.16 30.65 Distance traveled x Distance to debris Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food Total amount of variation in the distance traveled (y) is SSTO = 773.95 m2. The amount of variation in y values from the regression line is SSResid = 526.27 m2 Approximately what percent of the variation in distance traveled (y) can be explained by the linear relationship? r2 = 32% Standard Deviation about the Least Squares Regression Line The standard deviation about the least squares regression line is 𝑠𝑒 = 𝑆𝑆𝑅𝑒𝑠𝑖𝑑 𝑛−2 The value of se can be interpreted as the typical amount an observation deviates from the least squares regression line. Standard Deviation about the Least Squares Regression Line • Definition: • If we use a least-squares regression line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by s 2 residuals n2 2 ˆ ( y y ) i n2 Partial output from the regression analysis of deer mouse data: Predictor Coef SE Coef T P Constant -7.69 13.33 -0.58 0.582 Distance to debris 3.234 1.782 1.82 0.112 S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3% Analysis of Variance Source DF SS MS F P Regression 1 247.68 247.68 3.29 0.112 Resid Error 7 526.27 75.18 Total 8 773.95 Interpreting the Values of se and r2 A small value of se indicates that residuals tend to be small. This value tells you how much accuracy you can expect when using the least squares regression line to make predictions. A large value of r2 indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y. A useful regression line will have a reasonably small value of se and a reasonably large value of r2. A study (Archives of General Psychiatry[2010]: 570-577) looked at how working memory capacity was related to scores on a test of cognitive functioning and to scores on an IQ test. Two groups were studied – one group consisted of patients diagnosed with schizophrenia and the other group consisted of healthy control subjects. Putting it All Together Describing Linear Relationships Making Predictions Steps in a Linear Regression Analysis 1. 2. 3. 4. 5. 6. 7. Summarize the data graphically by constructing a scatterplot Based on the scatterplot, decide if it looks like the relationship between x an y is approximately linear. If so, proceed to the next step. Find the equation of the least squares regression line. Construct a residual plot and look for any patterns or unusual features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step. Compute the values of se and r2 and interpret them in context. Based on what you have learned from the residual plot and the values of se and r2, decide whether the least squares regression line is useful for making predictions. If so, proceed to the last step. Use the least squares regression line to make predictions. Revisit the crime scene DNA data Recall the scientists were interested in predicting age of a crime scene victim (y) using the blood test measure (x). Step 4: A residual plot constructed from these data showed a few observations with large residuals, but these observations were not far removed from the rest of the data in the x direction. The observations were not judged to be influential. Also there were no unusual patterns in the residual plot that would suggest a nonlinear relationship between age and the blood test measure. Step 5: se = 8.9 and r2 = 0.835 Approximately 83.5% of the variability in age can be explained by the linear relationship. A typical difference between the predicted age and the actual age would be about 9 years. Step 6: Based on the residual plot, the large value of r2, and the relatively small value of se, the scientists proposed using the blood test measure and the least squares regression line as a way to estimate ages of crime victims. Step 7: To illustrate predicting age, suppose that a blood sample is taken from an unidentified crime victim and that the value of the blood test measure is determined to be -10. The predicted age of the victim would be 𝑦 = −33.65 − 6.74 −10 = 33.75 years Common Mistakes Avoid these Common Mistakes 1. Correlation does not imply causation. A strong correlation implies only that the two variables tend to vary together in a predictable way, but there are many possible explanations for why this is occurring other than one variable causing change in the other. Avoid these Common Mistakes 2. A correlation coefficient near 0 does not necessarily imply that there is no relationship between two variables. Although the variables may be unrelated, it is also possible that there is a strong but nonlinear relationship. 40 Be sure to look at a scatterplot! 30 20 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avoid these Common Mistakes 3. The least squares regression line for predicting y from x is NOT the same line as the least squares regression line for predicting x from y. The ages (x, in months) and heights (y, in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 To predict height from age: ℎ𝑒𝑖𝑔ℎ𝑡 = 20.40 + .34(𝑎𝑔𝑒) To predict age from height: 𝑎𝑔𝑒 = −58.20 + 2.89(ℎ𝑒𝑖𝑔ℎ𝑡) Avoid these Common Mistakes 4. Beware of extrapolation. Using the least squares regression line to make predictions outside the range of x values in the data set often leads to poor predictions. Predict the height of a child that is 15 years (180 months) old. ℎ𝑒𝑖𝑔ℎ𝑡 = 20.4 + .34 180 = 81.6 𝑖𝑛𝑐ℎ𝑒𝑠 It is unreasonable that a 15 year-old would be 81.6 inches or 6.8 feet tall Avoid these Common Mistakes 5. Be careful in interpreting the value of the intercept of the least squares regression line. In many instances interpreting the intercept as the value of y that would be predicted when x = 0 is equivalent to extrapolating way beyond the range of x values in the data set. The ages (x, in months) and heights (y, in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Avoid these Common Mistakes 6. Remember that the least squares regression line may be the “best” line, but that doesn’t necessarily mean that the line will produce good predictions. This has a relatively large se – thus we can’t accurately predict IQ from working memory capacity. Avoid these Common Mistakes 7. It is not enough to look at just r2 or just se when evaluating the regression line. Remember to consider both values. In general, your would like to have both a small value for se and a large value for r2. Avoid these Common Mistakes 8. The value of the correlation coefficient, as well as the values for the intercept and slope of the least squares regression line, can be sensitive to influential observations in the data set, particularly if the sample size is small. Be sure to always start with a plot to check for potential influential observations. Correlation and Regression Wisdom 1. The distinction between explanatory and response variables is important in regression. Least-Squares Regression Correlation and regression are powerful tools for describing the relationship between two variables. When you use these tools, be aware of their limitations Correlation and Regression Wisdom Definition: An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line. Least-Squares Regression 2. Correlation and regression lines describe only linear relationships. 3. Correlation and least-squares regression lines are not resistant. Correlation and Regression Wisdom 4. Association does not imply causation. An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. A serious study once found that people with two cars live longer than people who only own one car. Owning three cars is even better, and so on. There is a substantial positive correlation between number of cars x and length of life y. Why? Least-Squares Regression Association Does Not Imply Causation