Chapter 4 Describing Bivariate Numerical Data Created by Kathy Fritz Forensic scientists must often estimate the age of an unidentified crime victim. Prior to 2010, this was usually done by analyzing teeth and bones, and the resulting estimates were not very reliable. A study described in the This line can be used to paper “Estimating Human Age from T-Cell DNA estimate the age of a crime Rearrangements” (Current Biology [2010]) examined the victim from a blood test. relationship between age and a measure based on a blood test. Age and the blood test measure were recorded for 195 people ranging in age from a few weeks to 80 years. A scatterplot of the data appears to the right. Do you think there is a relationship? If so, what kind? If not, why not? Correlation Pearson’s Sample Correlation Coefficient Properties of r Does it look like there is a relationship between the two variables? Yes If so, is the relationship linear? Yes Does it look like there is a relationship between the two variables? Yes If so, is the relationship linear? Yes Does it look like there is a relationship between the two variables? Yes If so, is the relationship linear? No, looks curved Does it look like there is a relationship between the two variables? Yes If so, is the relationship linear? No, looks parabolic Does it look like there is a relationship between the two variables? No If so, is the relationship linear? Linear relationships can be either positive or negative in direction. Are these linear relationships positive or negative? Negative Positive When the points in a scatterplot tend to cluster tightly around a line, the relationship is described as strong. Try to order the scatterplots from strongest relationship to the weakest. A, C, B, D A B C D These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010). Pearson’s Sample Correlation Coefficient • Usually referred to as just the correlation coefficient The strongest • Denoted by rvalues of the correlation coefficient are r = +1 and r = -1. • Measures the strength and direction of a linear relationship two numerical The weakest value of between the correlation coefficient is variables r = 0. An important definition! Properties of r 1. The sign of r matches the direction of the linear relationship. r is positive r is negative Properties of r 2. The value of r is always greater than or equal to -1 and less than or equal to +1. Strong correlation Moderate correlation Weak correlation -1 -.8 -.5 0 .5 .8 1 Properties of r 3. r = 1 only when all the points in the scatterplot fall on a straight line that slopes upward. Similarly, r = -1 when all the points fall on a downward sloping line. Properties of r 4. r is a measure of the extent to which x and y are linearly related Find the correlation for these points: Does this mean that there is NO x 2 4 6 8 10 12 14 relationship between these points? y 40 20 8 4 8 20 40 Compute the correlation coefficient? r = 0 Sketch r = 0, the but scatterplot. the data set has a definite relationship! 40 30 20 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Properties of r 5. The value of r does not depend on the unit Calculate r for the of measurement for either variable. data set of mares’ Mare Weight Foal Weight Mare Weight Foal Weight (in Kg) (in Kg) (in lbs) (in Kg) weight and the 556 129.0 1223.2 129.0 weight of their 638 119.0 1403.6 119.0 588 132.0 1293.6 132.0 foals. 550 123.5 1210.0 123.5 580 112.0 1276.0 112.0 r = -0.00359 642 113.5 1412.4 113.5 r = -0.00359 568 95.0 1249.6 95.0 642 104.0 556 104.0 616 93.5 549 108.5 504 95.0 515 117.5 551 128.0 594 127.5 Change the mare weights to pounds by multiply Kg by 2.2 and calculate r. 1412.4 104.0 1223.2 104.0 1355.2 93.5 1207.8 108.5 1108.8 95.0 1111.0 117.5 1212.2 127.5 1306.8 127.5 Calculating Correlation Coefficient The correlation coefficient is calculated using the following formula: where and The web site www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student-related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000. Expenditures 8810 7780 8112 8149 8477 7342 7984 Graduation rates 66.1 52.4 48.9 Here is the scatterplot: Does the relationship appear linear? Explain. 48.1 42.0 38.3 31.3 College Expenditures Continued: To compute the correlation coefficient, first find the z-scores. x y zx zy zxzy 8810 66.1 1.52 1.74 2.64 8149 48.1 0.12 0.12 0.01 -0.66 0.51 -0.34 To7780 interpret52.4 the correlation coefficient, use the 8112 48.9 0.04 0.01 definition – 0.19 8477is a positive, 42.0 0.81 -0.42 -0.34 There moderate linear relationship 7342 six-year 38.3 -1.59 -0.76 and student1.21 between graduation rates 7984 31.3 -0.23 -1.38 0.32 related expenditures. How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zy is positive zxzy is negative zx is negative zy is negative zxzy is positive zx is positive zy is positive zxzy is positive Will the sum of zxzy be positive or negative? How the Correlation Coefficient Measures the Strength of a Linear Relationship zx is negative zy is positive zxzy is negative zx is negative zy is negative zxzy is positive zx is positive zy is positive zxzy is positive Will the sum of zxzy be positive or negative? zx is negative zy is positive zxzy is negative How the Correlation Coefficient Measures the Strength of a Linear Relationship Will the sum of zxzy be positive or negative or zero? Does a value of r close to 1 or -1 mean that a change in one variable causes a change in the other variable? Association does NOT imply causation. Consider the following examples: • The relationship between the number of Causality can onlymore be shown carefully cavities in child’s teeth the size of Should weaall drink hotand by controlling values of crime allto variables that are responses cold weather chocolate to lower the rate? his orBoth her vocabulary is strong and might be related to the ones under study. In other positive. words, with well-controlled, So does thisamean I should feedwell-designed children more variables are both strongly candy toThese increase their vocabulary? experiment. to the ageisofnegatively the child • Consumption ofrelated hot chocolate correlated with crime rate. Linear Regression Least Squares Regression Line Suppose there is a relationship between two numerical variables. Let x be the amount spent on advertising and y be the amount of sales for the product during a given period. You might want to predict product sales (y) for a month when the amount spent on advertising is $10,000 (x). The equation of a line is: y a bx Where: b – is the slope of the line – it is the amount by which y increases when x increases by 1 unit a – is the intercept (also called y-intercept or vertical intercept) – it is the height of the line above x = 0 – in some contexts, it is not reasonable to interpret the intercept The Deterministic Model We often say x determines y. Notice, the y-value is determined by substituting the x-value into the equation of the line. Also notice that the points fall on the line. But, when we fit a line to data, do all the points fall on the line? How do you find an appropriate line for describing a bivariate data set? The point (15,44) has a deviation To assess the fit of a line, we need a way to combine theofnis To assess the What fit adeviations line, we intoof the meaning single measure of fit. look ata how the points this deviate deviation? vertically from the line. of +4. y = 10 + 2x 40 30 20 What is the meaning of 10 deviation? a negative 5 10 15 20 Least squares regression line The least squares regression line is the line that minimizes the sum of squared deviations. The most widely used measure of the fit of a line y = a + bx to bivariate data is the sum of the squared deviations about the line. Let’s investigate the meaning of the least squares regression line. Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2). Use the a calculator to (3,10) Find sum of the findsquares the least What is the sum of squares the regression of the deviations deviations fromline the line from the line? 6 Find the Hmmmmm . . . Will the sum vertical always be zero? deviations Why does this seem so from the familiar? line -3 of The line that minimizes the sum squared deviations is the least squares (6,2) -3 regression line. 1 yˆ x 3 3 (0,0) Sum of the squares = 54 Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer. In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time. (x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3) 11 15 19 23 27 y 150 270 450 580 740 Sketch a scatterplot for this data set. 800 Average tumor volume x 700 600 500 400 300 200 100 10 12 14 16 18 20 22 24 26 28 Number of days after injection Interpretation of slope: The average volume of the tumor increases by approximately 37.25 mm3 for each day increase in the number of days after injection. Computer Does software the intercept and graphing have meaning calculators in this can calculate the context? least squares Why orregression why not? line. Pomegranate study continued It is unknown thethe pattern Predict the averagewhether volume of tumor for 20 in the scatterplot continues daysobserved after injection. outside the range of x-values. Why? This is the danger of extrapolation. The should nottumor be used Predictleast the squares average line volume of the forto5 make predictions for y using x-values days after injection. outside the range in the data set. Can volume be negative? Why is the line used to summarize a linear relationship called the least squares regression line? This terminology comes from the relationship between the least squares line and the correlation coefficient. If r = 1, what do you know about the location of the points? Why is the line used to summarize a linear relationship called the least squares regression line? What would happen if r = 0.4? . . . 0.3? . . . 0.2? If you want to predict x from y, can you use the least squares line of y on x? The regression line of y on x should not be used to predict x, because it is not the line that minimizes the sum of the squared deviations in the x direction. Assessing the Fit of a Line Residuals Residual Plots Outliers and Influential Points Coefficient of Determination Standard Deviation about the Line Assessing the fit of a line Important questions are: Once the least squares regression line is the next step way is toto examine 1. Isobtained, the line an appropriate summarize This section how effectively the line summarizes the relationship between x and y ? the relationship between x and will y. look at 2. Are there any unusual aspects of thegraphical data set that you need to consider before and proceeding to use the least squares numerical regression line to make predictions? methods to 3. If you decide that it is reasonable toanswer use the line as a basis for prediction, how these accurate can you expect predictionsquestions. to be? Residuals In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters. 5.23 6.13 5.21 11.29 7.10 14.35 8.16 12.03 5.50 9.19 Distance If the point Traveled (y) Distance traveled 6.94 is below the line the residual will be negative. 0.00 14.76 -14.76 Distance from Debris (x) 22.72 If the 9.23 -3.10 9.16 2.13 15.28 -0.93 18.70 -6.67 point 10.10 is above the 12.62 line 20.11 22.04 -1.93 Distance to debris the residual will be positive. 9.05 26.16 21.58 4.58 9.36 30.65 22.59 8.06 Calculate the predicted y and the residuals. Residual plots • A residual plot is a scatterplot of the A careful look at the residuals can (x, residual) pairs. reveal many potential problems. • Residuals can alsoplot be graphed against A residual is a graph of thethe predicted y-valuesresiduals. • Isolated points or a pattern of points in the residual plot indicate potential problems. Deer mice continued Distance from Debris (x) Distance Traveled (y) 6.94 0.00 14.76 -14.76 5.23 6.13 9.23 -3.10 5.21 11.29 9.16 2.13 7.10 14.35 15.28 -0.93 8.16 12.03 18.70 -6.67 5.50 22.72 10.10 12.62 9.19 20.11 22.04 -1.93 9.05 26.16 21.58 4.58 9.36 30.65 22.59 8.06 Plot the residuals against the distance from debris (x) Deer mice continued Are there any isolated points? Is there a pattern in the points? 15 Residuals 10 5 5 -5 6 7 8 9 Distance from debris The points in the residual plot appear scattered at random. -10 -15 This indicates that a line is a reasonable way to describe the relationship between the distance from debris and the distance traveled. 15 15 10 10 Residuals Residuals Deer mice continued 5 10 -5 15 20 25 Predicted Distance traveled 5 9 5 -5 -10 -10 -15 -15 Residual plots can be plotted against either the x-values or the predicted y-values. 6 7 8 9 Distance from debris Residual plots continued Let’s examine the accompanying data on x = height (in inches) and y = average weight (in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts). x 58 59 60 61 62 y 113 115 118 121 124 63 65 66 The64residual 67 68 plot 128 131 134 137a 141 145 displays definite curved The scatterplot pattern. appears rather straight. Even though r = 0.99, it is not accurate to say that weight increases linearly with height 69 70 71 72 150 153 159 164 Let’s examine the data set for 12 black bears from the Boreal Forest. If ythe point(in affects the x = age (in years) and = weight kg) placement of the leastx 10.5 6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 5.5 squares regression line,9.5then Y 54 40 62 51 55the 56point 62 is 42 considered 40 59 51 an50 point.line. Sketch a scatterplot with theinfluential fitted regression What would happen tounusual the regression Do you notice anything line ifthis thisdata pointset? is60 removed? about Weight 55 This observation has an x-value that differs greatly from the others in the data set. 50 45 40 Age 5 10 15 20 25 30 Black bears continued Y 10.5 6.5 28.5 10.5 6.5 54 40 62 51 55 7.5 6.5 5.5 7.5 11.5 9.5 5.5 56 62 50 42 40 59 51 Notice that this observation falls far An observation is an away from the regression line in outlier if it has a the y direction. large residual. 60 55 Weight x 50 45 40 Age 5 10 15 20 Predicted Distance traveled 25 30 Coefficient of Determination • The coefficient of determination is the Suppose that you would like to predict the price of proportion of variation in y that can be houses in a particular city from the size of the house attributed to an There approximate linear in house (in square feet). will be variability relationship between x & y that makes accurate price, and it is this variability price prediction a challenge. • Denoted by r2 If you know that differences in house size account for a large proportion of the variability in house price, • The ofthe r2 size is often converted a predict then value knowing of a house will helpto you its price. percentage. Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food x 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 Suppose you didn’t know any x-values. What distance would you expect deer mice to travel? Why do we square the To finddeviations? the total amount of variation in the distance Total amount of variation in the traveled you need (y) to find distance(y)traveled is the sum of the squares of these deviations from the=mean. SSTo 773.95 m2 9.05 9.36 26.16 30.65 Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food 6.94 5.23 5.21 y 0 6.13 11.29 7.10 8.16 5.50 9.19 14.35 12.03 22.72 20.11 Now let’s find how much variation there is in the distance traveled (y) from the Why do we least squares regression line. Distance traveled x 9.05 26.16 30.65 square the Distance to debris residuals? The amount of variation in the To find the amount of variation in the distance traveled findthe least distance traveled (y) (y), from the sum of the squaredline is squares regression residuals. SSResid = 526.27 m2 9.36 Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food Total amount of variation in the distance traveled (y) is SSTO = 773.95 m2. The amount of variation in y values from the regression line is How does the variation in y change when we used the least squares regression line? SSResid = 526.27 m2 Approximately what percent of the variation in distance traveled (y) can be explained by the linear relationship? r2 = 32% Standard Deviation about the Least Squares Regression Line The coefficient of determination (r 2) measures the extent of variability about the least squares regression line relative to overall variability in y. This does not necessarily imply that the deviations from the line are small in an absolute sense. Partial output from the regression analysis of deer mouse data: The standard deviation (s): Predictor T P This is the typicalCoef amount SE by Coef which an observation Constant -7.69 13.33 regression -0.58 0.582 deviates from the least squares line Distance to debris S = 8.67071 3.234 R-sq = 32.0% 1.782 1.82 0.112 R-sq(adj) = 22.3% The y-intercept (a): The slope (b): Analysis of Variance This value has no meaning in context since it doesn't The distance traveled to food increases by approximately Source DF SS MS F P make sense to have a negative distance. 2the 3.234 meters for an increase of 1 meter to The coefficient of determination (r ): nearest Regression 1 247.68 247.68 3.29 0.112 debris pile. Only 32% of the Resid Error 7 observed 526.27 variability 75.18 in the distance SSResid traveled for food can be explained by the approximate Total 8 773.95 linear relationship between the distance traveled for food and the distance to the nearest debris pile. SSTo Interpreting the Values of se and r2 A small value of se indicates that residuals tend to be small. This value tells you how much accuracy you can expect when using the least squares regression line to make predictions. A large value of r2 indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y. A useful regression line will have a reasonably small value of se and a reasonably large value of r2. A study (Archives of General Psychiatry[2010]: 570-577) looked how working memory capacity was related to Forat the patient group, the typical deviation of the For the control group, the typical deviation of the scores on a test ofthe cognitive functioning and to10.7, scores on observations from regression line is about which observations from the regression line is about 6.1, which is an IQ test. Two groups were studied onerelatively group small is somewhat large. Approximately 14%– (a smaller. Approximately 79% (a much larger amount) of consisted patients diagnosed with schizophrenia amount) ofof the variation in the cognitive functioningand score the variation in the cognitive functioning score is the otherisgroup consisted of linear healthy control subjects. explained by the relationship. explained by the regression line. Thus, the regression line for the control group would produce more accurate predictions than the regression line for the patient group. Putting it All Together Describing Linear Relationships Making Predictions Steps in a Linear Regression Analysis 1. 2. 3. 4. 5. 6. 7. Summarize the data graphically by constructing a scatterplot Based on the scatterplot, decide if it looks like the relationship between x an y is approximately linear. If so, proceed to the next step. Find the equation of the least squares regression line. Construct a residual plot and look for any patterns or unusual features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step. Compute the values of se and r2 and interpret them in context. Based on what you have learned from the residual plot and the values of se and r2, decide whether the least squares regression line is useful for making predictions. If so, proceed to the last step. Use the least squares regression line to make predictions. Revisit the crime scene DNA data Recall the scientists were interested in predicting age of a crime scene victim (y) using the blood test measure (x). Step 1: Scientist first constructed a scatterplot of the data. Step 2: Based on the scatterplot, it does appear that there is a reasonably strong negative linear relationship between and the blood test measure. Step 4: A residual plot constructed from these data showed a few observations with large residuals, but these observations were not far removed from the rest of the data in the x direction. The observations were not judged to be influential. Also there were no unusual patterns in the residual plot that would suggest a nonlinear relationship between age and the blood test measure. Step 5: se = 8.9 and r2 = 0.835 Approximately 83.5% of the variability in age can be explained by the linear relationship. A typical difference between the predicted age and the actual age would be about 9 years. Step 6: Based on the residual plot, the large value of r2, and the relatively small value of se, the scientists proposed using the blood test measure and the least squares regression line as a way to estimate ages of crime victims. Modeling Nonlinear Relationships Choosing a Nonlinear Function to Describe a Relationship Function Equation Looks Like 10 10 5 5 Quadratic 5 10 5 10 50 100 10 Square root 20 0 -10 50 Reciprocal 100 12 10 11 9 10 8 50 100 50 100 Choosing a Nonlinear Function to The common log (base 10) Describe a Relationship may also be used. Function Log Equation Looks Like 10 10 5 5 While statisticians often use these nonlinear regressions, in AP Statistics, we will linearize our data using transformations. Then we can use what we already know about the least squares regression line. 50 Exponential 100 50 10 2 5 1 5 5 10 4 Power 2 2 4 100 10 Models that Involve Transforming Only x This suggest that if the pattern in the scatterplot of (x, y) pairs looks like one of these curves, an appropriate transformation of the x values should result in transformed data that shows a linear relationship. Read “x prime” Model Square root Reciprocal Log Transformation Let’s look at an example. Is electromagnetic radiation from phone antennae associated with declining bird populations? The accompanying data on x = electromagnetic field strength (Volts per meter) and y = sparrow density (sparrows per hectare) First look at a scatterplot of the data. The data is curved and looks similar to the graph of the log model. Field Strength Sparrow Density 0.11 41.71 0.20 33.60 0.29 24.74 0.40 19.50 0.50 19.42 0.61 18.74 1.01 24.23 1.10 22.04 0.70 16.29 0.80 14.69 0.90 16.29 1.20 16.97 1.30 12.83 1.41 13.17 1.50 4.64 1.80 2.11 1.90 0.00 3.01 0.00 3.10 14.69 3.41 0.00 Field Strength vs. Sparrow Density Continued Sparrow Density = 14.8 – ln (Field Strength) Predictor Constant Ln (field strength) S = 5.50641 Ln Field Strength Sparrow Density -2.207 41.71 -1.609 33.60 -1.238 24.74 Coef SE Coef T P 14.805 1.238 11.96 0.000 -.0916 19.50 -10.546 1.389 -7.59 0.000 -0.693 19.42 -0.494 18.74 0.001 24.23 0.095 22.04 -0.357 16.29 -0.223 14.69 -0.105 16.29 R-Sq = 76.2% R-Sq(adj) = 74.9% 0.182 . . . and graph the scatterplot Notice of y on that x’ the 0.262 0.344 transformed data 0.405 is now linear. We 0.588 0.642 can find the least 1.102 squares regression 1.131 1.227 line. 16.97 12.83 13.17 4.64 2.11 0.00 0.00 14.69 0.00 Field Strength vs. Sparrow Density Continued Sparrow Density = 14.8 – ln (Field Strength) Predictor Constant Ln (field strength) S = 5.50641 Coef SE Coef T P 14.805 1.238 11.96 0.000 -10.546 1.389 -7.59 0.000 R-Sq = 76.2% A residual plot from the least squares regression line fit to the transformed data, shown below, has no apparent patterns or unusual features. It appears that the log model is a reasonable choice for describing the relationship between sparrow density and field strength. R-Sq(adj) = 74.9% The value of R 2 for this model is 0.762 and se = 5.5. Field Strength vs. Sparrow Density Continued Sparrow Density = 14.8 – ln (Field Strength) Predictor Constant Ln (field strength) S = 5.50641 Coef SE Coef T P 14.805 1.238 11.96 0.000 -10.546 1.389 -7.59 0.000 R-Sq = 76.2% R-Sq(adj) = 74.9% This model can now be used to predict sparrow density from field strength. For example, if the field strength is 1.6 Volts per meter, what is the prediction for the sparrow density? Models that Involve Transforming y Let’s consider the remaining nonlinear models, the exponential model and the power model. Using properties of below, the Notice that using the transformations logarithms, it follows that . . . exponential and power models are linearized. Model Exponential Power Transformation In a study of factors that affect the survival of loon chicks in Wisconsin, a relationship between the pH of lake water and blood mercury level in loon chicks was observed. The researchers thought that it is possible that the pH of the lake could be related to the type of fish that the loons ate. A scatterplot of the data is shown below. Ln(blood mercury level)= 1.06-0.396 Lake pH Predictor Coef SE Coef T P Constant 1.0550 0.5535 1.91 0.065 Lake pH -0.3956 0.0826 -4.79 0.000 S = 0.6056 R-Sq = 39.6% R-Sq(adj) = 37.8% Choosing Among Different Possible Nonlinear Models Often there is more than one reasonable model that could be used to describe a nonlinear relationship between two variables. How do you choose a model? 1) Consider scientific theory. Does it suggest what model the relationship is? 2) In the absence of scientific theory, choose a model that has small residuals (small se) and accounts for a large proportion of the variability in y (large R 2). Common Mistakes Avoid these Common Mistakes 1. Correlation does not imply causation. A strong correlation implies only that the two variables tend to vary together in a predictable way, but there are many possible explanations for why this is occurring other The number of fire trucks at a house than one variable causing change in the other. that is on fire and the amount of damage from the fire have a strong, Don’t fall into this trap! positive correlation. So, to avoid a large amount of damage if your house is on fire – don’t allow several fire trucks to come to your house? Avoid these Common Mistakes 2. A correlation coefficient near 0 does not necessarily imply that there is no relationship between two variables. Although the variables may be unrelated, it is also possible that there is a strong but nonlinear relationship. 40 Be sure to look at a scatterplot! 30 20 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Avoid these Common Mistakes 3. The least squares regression line for predicting y from x is NOT the same line as the least squares regression line for predicting x from y. The ages (x, in months) and heights (y, in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Avoid these Common Mistakes 4. Beware of extrapolation. Using the least squares regression line to make predictions outside the range of x values in the data set often leads to poor predictions. Predict the height of a child that is 15 years (180 months) old. It is unreasonable that a 15 year-old would be 81.6 inches or 6.8 feet tall Avoid these Common Mistakes 5. Be careful in interpreting the value of the intercept of the least squares regression line. In many instances interpreting the intercept as the value of y that would be predicted when x = 0 is equivalent to extrapolating way beyond the range of x values in the data set. The ages (x, in months) and heights (y, in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Avoid these Common Mistakes 6. Remember that the least squares regression line may be the “best” line, but that doesn’t necessarily mean that the line will produce good predictions. This has a relatively large se – thus we can’t accurately predict IQ from working memory capacity. Avoid these Common Mistakes 7. It is not enough to look at just r2 or just se when evaluating the regression line. Remember to consider both values. In general, your would like to have both a small value for se and a large value for r2. This indicates that deviations from Thistoindicates the line tend be small.that the linear relationship explains a large proportion of the variability in the y values. Avoid these Common Mistakes 8. The value of the correlation coefficient, as well as the values for the intercept and slope of the least squares regression line, can be sensitive to influential observations in the data set, particularly if the sample size is small. Be sure to always start with a plot to check for potential influential observations.