Chapter 4: Describing the Relation between Two Variables 4.1 Scatter Diagrams and Correlation 4.2 Least Squares Regression 4.3 Diagnostics on the Least Squares Regression Line October 11, 2008 1 Variable Association Question: In a population are two or more variables of the population linked? For example, do Math 127A students with brown eyes have higher IQ than students with other eye colors? 2 Bivariate Data Recall: A variable is any characteristic of the objects in the population that will be analyzed. Data is the value (categorical or quantitative) that is measured for a variable. If we have only one variable that is measured, then we call this univariate data. If two variables are measured simultaneously, then we call it bivariate data. 3 Example 1 Consider the population of all cars in the State of Tennessee. Suppose we collect data on the number of miles on each car and the age of the car. One variable is the mileage of each car and a second variable is the age (in years) of each car. This would form a bivariate dataset. Question: Is there a relationship between the age of the car and the number of miles? 4 Example 2 Consider the population of all undergraduates at Vanderbilt during the present academic year. Suppose that we survey each student to determine the number of hours that they watch television each week and their GPA at the end of Spring 2007 semester. The two variables are: (1) hours of TV watching per week and (2) GPA. This forms a bivariate dataset. Question: Is there a relationship for Vanderbilt undergraduates concerning these two variables? 5 Response & Explanatory Variables Definition: Suppose we have bivariate data for two variables in a population or sample. The response ( or dependent) variable is the variable whose value can be explained by the values of the explanatory (or independent) variable. Mathematics : y 1 x . y is the dependent variable and x is the independent variable. 5 x2 6 Association between Variables Definition: Consider two variables associate with a population. We say that an association exists between the two variables if a particular value for one of the variables is more likely to occur with certain values of the other variable. 7 Association between Two Quantitative Variables We now consider a sample that contains information about two quantitative variables. We want to determine if an association between these two variables exist. Consider a set of bivariate data. Let S x1 , x2 ,K , xn denote data for the one variable and T y1 , y2 ,K , yn data for the second variable. Here xi and yi are numbers (values) for our two quantitative variables. Our main goal will is to determine if there is a relationship between the two sets. Technically, it would be desireable to find a function, f , such that yi f (xi ). However, for bivariate data that will vary from sample to sample, this is virtually impossible. Therefore, we look for an association that permits this variability. 8 Association between Sets (variables) 9 One Approach One approach is to look at the descriptive characteristics (statistics) of set of the bivariate dataset separately. Example: S = (-2,3,7,8,9) Range: r = 11 r = 10 Mean: m=5 m=4 Median: median = 7 median = 4 SD: s = 4.52 s = 3.94 Conclusion: Not much help! and T = (0,1,4,5,10). 10 A Better Approach: Scatterplots Suppose that we have bivariate data : S x1, x 2, , x n and T y1, y 2, , y n , for two variables. From these two sets we form a third set of order pairs : A x1, y1 , x 2, y 2 , , x n , y n . Definition: The plot of the points of A as points in the xy-plane is called a scatterplot. Remark: Although it technically doesn’t matter, we choose the first set to be the explanatory variable (horizontal axis) and the second set to be the response variable (vertical axis). 11 Example Suppose that we have bivariate data where the one sample data (explanatory variable ) is (1,3,4,6,9,12) and the other sample data (response variable) is (2,-1,3,0,1,4). 12 Scatterplots & Excel It is easy to create scatterplots in Excel. Assuming that your data is list in two columns (or two rows), you select Chart from the Insert Menu and then choose xy scatter from the different types of charts. Then use the chart wizard to construct the scatterplot. 13 Example Explanatory Variable: GDP Response Variable: Internet Use 14 Positive & Negative Associations Definition: We say that two numerical variables (x & y) have a positive association if as x increases, then y also tends to increase. We say that they have a negative association if as x increases, y tends to decrease. If there is neither a positive or negative association, we say that there is no association. 15 Positive & Negative Association in Scatterplot 16 Example Consider the bivariate date: • S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory) • T = (4,4,5,6,4,4,5,9,5,11,6) (response) Is there an association? 17 Example Consider the bivariate date: • S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory) • T = (4,4,5,6,4,4,5,9,5,11,6) (response) Is there an association? There appears to be a positive association between the explanatory and response variables. 18 Example This example deals with the correlation between the Pat Buchanan and Ross Perot countywide votes in the 1996 and 2000 elections in Florida. Each dot (x,y) is a county in Florida with the first component the Perot vote and the second component the Buchanan vote for two different elections (1996, 2000). 19 Generic Scatterplot Consider a bivariate set of quantitative data and suppose that we construct the scatterplot for this data. 20 Linear Response Consider a bivariate set of quantitative data and suppose that we construct the scatterplot for this data. There appears to be a linear relationship between x and y in a “fuzzy” sense. 21 The Linear Correlation Coefficient Consider a bivariate set of data. If we believe that there is a linear response between the two variables, then we can define a number (which we will denote by r) that is a measure of how much the scatterplot varies from a linear relationship between the two variables (x & y): y = mx + b. Remark: The correlation coefficient is sometimes called the Pearson Correlation Coefficient. 22 Calculation of r Consider two sets (samples) of data: S x1 , x2 ,K , xn and T y1 , y2 ,K , yn . Let sx denote the sample standard deviation of S and sy the sample standard deviation of the T . The we define the correlation coefficient r as n n n x y i i xi yi n 1 n xi x yi y i 1 r 2 2 n 1 i 1 sx sy n n n xi n yi x 2 i 1 y 2 i 1 i i i 1 i 1 n n i 1 i 1 where x is the mean of S and y is the mean of T . 23 Remark Recall that we have introduced the z-score of a data value: xx y y 1 n zx = or zy . The correlation coefficient is the same as r zx zy . sx sy n 1 i 1 i i Hence, it a normalized product of the z-scores for both data sets. If x and y have a positive association, as x increases, then so will y. In this case, we expect that zxi and zyi will have the same sign and hence, zxi zyi 0. If the the association is negative, then we expect that that zxi and zyi will have the opposite signs and hence, zxi zyi 0. The strength of the association will depend on the magnitude of r i.e, r . The magnitude will depend of the sizes of zxi zyi . The smaller that zxi = xi x y y or zyi i , the smaller sx sy zxi zyi . This will happen when xi x and/or yi y i.e, all the x-points and y-points do not differ from their means. 24 What does r tell us? Suppose we have a bivariate set of data with x be arbitrary, but y = mx + b. That is, the two set are linearly related. What is the correlation number for this type of set? Example: Let m = 2 and b = 1. Then S = (1,2,3,4,…,12) and T = (3,5,7,9,…,25). Using the formula, we find r = 1 . 25 What does r tell us? Example: Let m = -2 and b = 25. Then S = (1,2,3,4,…,12) and T = (23,21,19,…,1). Using the formula, we find r = -1 . 26 Linear Correlation: r • • • • • • If r = 1, then there is a perfect positive linear association between the variables. If r = -1, then there is a perfect negative linear association between the variables. If r = 0, then there is no linear correlation between the variables. If 0 < r < 1, then there is some positive correlation, although the nearer that r is to zero, the weaker the correlation. If -1 < r < 0, then there is some negative correlation between the variables. If r = 0, it does not mean that there is no association, but rather no linear association. In other words, r measures the strength of the linear association between the two variables. The relationship between two variables may be nonlinear, yet you can approximate the nonlinear relationship by a linear relationship. 27 The Bottom Line If you want to know if there is a linear assoication between two quantitative variables in a bivariate set, compute the correlation coefficient r. Its sign (+ or -) will tell you if there is a positive or negative association and the magnitude of r, |r|, will tell you the strength of the association. 28 Example Consider bivariate data: S = (0,1,2,…,8,9) and T = (1.00,2.00,2.09,2.14,…,2.30,2.32). The data in set T was generated by the function: f(x) = x(1/8) + 1. r = 0.72 29 Example Consider the function y = f(x) = x10 and the points (0, 0.1, 0.2, 0.3,…, 0.9, 1.0). We form a bivariate set with these points: ((0,0), (0.1, 10-10),…,(0.9,0.348678), (1,1)). The correlation coefficient for this data is r = 0.669641. This indicates a medium strength linear correlation. However, it is a perfect nonlinear correlation with the nonlinear function x10. 30 Example Animal Gestation Life Expectancy Cat 63 11 Chicken 22 7.5 Dog 63 11 Duck 28 10 Goat 151 12 Lion 108 10 Parakeet 18 8 Pig 115 10 Rabbit 31 7 Squirrel 44 9 Is there a linear association between gestation period and life expectancy? 31 Note: The explanatory variable is the gestation period and the response variable is the life expectancy. Also, dogs and cats have the same data. r 1 n xi x yi y n 1 i 1 sx sy n 10 x 64.3 y 9.5 sx 45.6704 sy 1.6403 r 0.725657 32 Example The U.S. Federal Reserve Board provides data on the percentage of disposable personal income required to meet consumer loan payments and mortgage payments. The following table summarizes this yearly data over the pass several years. Consumer Debt Household Debt Consumer Debt Household Debt 7.88 6.22 6.24 5.73 7.91 6.14 6.09 5.95 7.65 5.95 6.32 6.09 7.61 5.83 6.97 6.28 7.48 5.83 7.38 6.08 7.49 5.85 7.52 5.79 7.37 5.81 7.84 5.81 6.57 5.79 Question: Are consumer debt and house debt correlated? 33 Means: 7.22133 (consumer) 5.94333 (household) Sample SD: 0.623583 (consumer) 0.175526 (household) Correlation Coefficient: r = 0.117813 34 Excel and Correlation Excel can be used to find the regression line for a set of bivariate data. In the Tools menu, select Data Analysis. In the Data Analysis window, select Correlation and follow the wizard. It produces what is called a correlation matrix. The number that occupies the 2nd row and 1st column is the correlation coefficient. It is possible to calculate the correlation coefficient between several variables using this tool. 35 Least Squares Regression Suppose that we have bivariate data, x , y , x , y ,K , x , y , and there is a linear association 1 1 2 2 n n between the two variables, x and y. That is, r 0. When r 0, we say that there is no association between the variables even though there is a horizontal line (slope zero) through the data. We know then there is line with the equation y mx b that in some sense approximates the relationship between x and y. Note : In general, yi mxi b. Section 4.2 36 Reminder about Lines The equation of a straight line is: y = mx + b. The number m is called the slope of the line and the number b is called the y-intercept. If m > 0, the y increases with x and if m < 0, then it decreases with x. Given two distinct points in the plane, one can find the numbers m and b. Points, (x,y), that satisfy the same equation, y = mx + b, are said to be co-linear. 37 Remark Given a value for the explanatory variable, say xk , we can compute an approximation for yk , call it ŷk , by using the equation for the straight line i.e., ŷk mxk b yk . If fact, for any x (not necessary a data point), we can infer a value for the corresponding y. Objective : Find the equation of the line that best approximates the association between the two variables. This line is called a regression line for the bivariate data. 38 Problem Give a set of points in the xy-plane, there are an infinite number of lines that can be drawn through the points if the points are not co-linear. 39 Error and Residual Consider a bivariate set of data, x , y , x , y ,K , x , y and suppose that we construct some straight line: 1 1 2 2 n n y mx b. Let ŷi mxi b. The difference i yi -ŷi is called the residual or prediction error of using ŷi as an approximation for yi . Idea : Given a set of bivariate data, choose the straight line that minimizes the residuals of all points. 40 Least Squares Line Consider a set of bivariate data, x , y ,K , x , y and consider the line: y mx b. 1 1 n n n n Let i yi ŷi where ŷi mxi b. The numbers i is called the residual of yi . Consider R yi ŷi . 2 2 i i 1 i 1 sy The values of m and b that make R as small as possible are: (1) m r where r is the correlation coefficient sx for the bivariate set and sx and sy are the sample standard deviations and (2) b y mx where x and y are the means of their particular sets. The straight line calculated in this way is called a least squares line. Alternately, it is called a regression line. Remark : Your book uses the notation for the line: y b1 x b0 . 41 Lot of things to compute! To compute the least squares lines one must calculate: • the sample standard deviations of two sets • the mean of two sets • the correlation between two sets. Fortunately, there is technology available to do this for us: http://www.shodor.org/unchem/math/lls/leastsq.html . 42 Example (by hand) Find the least squares line for the data set: ((-1,1),(0,2),(2,-1),(3,0)). • X = (-1,0,2,3) and Y = (1,2,-1,0). • Means: 1 and 0.5, respectively • Sample standard deviations: 1.83 and 1.29, respectively • Correlation coefficient: r = -0.71 • m = -0.71(1.29/1.83) = -0.50 • b = 0.5 - (-0.50)(1) = 1.0 • y = -0.5x + 1.0 43 Example Anthropolgists using bones to predict height of individuals. x = length of femur (thighbone) in cm y = height in cm yˆ 2.4x 61.4 What is the predicted height of an individual with a 50 cm femur? The regression equation predicts: (2.4)(50) + 61.4 = 181.4 cm = 71.4 in 44 Interpreting the y-intercept y-intercept: • The predicted value for y when x = 0. • Helps in plotting the line, because it gives the point where the least squares regression line crosses the y-axis. • May not have any interpretative value if no observations had x values near 0. 45 Interpreting the Slope Slope: Measures the change in the predicted variable for every unit change in the explanatory variable. Hence, it is a rate of change between the explanatory variable and the predicted (response) variable. Note that slope has units (units of response variable divided by the units of the explanatory variable). 46 Slopes and Association 47 Example The population of the Detroit Metropolitan Area is summarized in the following table from 1950 to 2000: Year Population (millions) n6 x 1975 sx 18.7083 1950 1960 1970 1980 1990 2000 3.0 3.8 4.2 5.2 5.9 7.0 y 4.85 sy 1.46935 1 n xi x yi y r 0.993121 n 1 i 1 sx sy sy m r 0.078 sx b y mx 149.2 Questions : (i) Give a prediction of the population in the year 2010. (ii) Why did the population slow during the 1980's? Answers : (i) ŷ m 2010 b 5.58 (ii) Recession, poor domestic auto sales. ŷ 0.078x 149.2 48 Residuals • • • • They measure the difference between a data point (observation) and a prediction: y - (mx + b). Every data point has a residual. A residual with a large absolute value (±) indicates an unusual observation. Large residuals can be found by constructing a histogram of the residuals. 49 Example Research at NASA studied the relationship between the right humerus and right tibia of 11 rats that were sent into space on the Spacelab. Here is the data collected. Right Humerus (mm) Right Tibia (mm) Right Humerus (mm) Right Tibia (mm) 24.80 36.05 25.90 37.38 24.59 35.57 26.11 37.96 24.59 35.57 26.63 37.46 24.29 34.58 26.31 37.75 23.81 34.20 26.84 38.50 24.87 34.73 Find a least-squares regression line with x being the right humerus and y the right tibia. 50 n 11 x 25.34 sx 1.04127 r y 36.3409 sy 1.52162 1 n xi x yi y 0.951316 n 1 i 1 sx sy sy m r 1.39017 sx b y mx 1.11395 ŷ 1.39017x 1.11395 51 52 Slope & Correlation • • Correlation – Describes the strength of a linear association between the two variables – Does not change when the units of the measurement of the variables change. – It is not necessary to identify which variable is the response variable and which variable is the explanatory variable. Slope – Numerical value of slope depends on the units used to measure the variables. – Does not indicate in the association whether association is strong or weak. – One must identify the explanatory and response variables. – The regression equation (y = mx + b) can be used to predict values of the response variable. 53 Summing the Residuals The residual of a data point, xi , yi , is a measure of how well the regression line, ŷ mx b, approximates the data i.e., i yi ŷi yi mxi b . The smaller that i is, the better the approximation. Hence, we calculate the sum of the squares n of the residuals and then take the square root of this sum: y ŷ . 2 i i This is an i 1 overall measure of the approximation. 54 Example Find the least squares regression line for the bivariate data: ((-1,0),(1,2),(2,3),(4,3),(5,4)) and then calculate the square root of the sum of the squared residuals. X 1,1, 2, 4, 5 n5 11 x 5 57 sx 2.38747 10 r 0.939026 sy m r 0.596491 sx Y 0, 2, 3, 3, 4 y 12 5 sy 23 1.51658 10 b y mx 1.08772 ŷ 0.596491x 1.08772 i yi ŷi , i 1, 2,..., n {-1.08772, -0.280702, 0.122807, 0.122807, 0.526316} n y ŷ 2 i i 1 i 1.25264 55 Example Data: {(6, 4), (7, 5), (8, 2), (6, 7), (1, 6), (8, 6), (2, 5), (6, 8), (7, 6), (6, 2), (3, 0), (6, 1), (4, 7), (0, 0), (3, 3), (3, 8), (3, 3), (0, 7), (8, 2), (7, 3), (7, 3), (6, 0), (0, 5), (2, 8), (1, 3), (9, 5), (7, 0), (4, 4), (8, 5), (4, 2), (2, 7), (8, 9), (0, 3), (7, 6), (7, 8), (9, 7), (5, 4), (8, 0), (7, 4), (1, 3), (9, 7), (3, 5), (2, 9), (7, 6), (7, 5), (8, 6), (7, 6), (8, 2), (8, 3), (2, 5)} x 5.14 y 4.5 sx 2.83571 sy 2.50917 r 0.00143411 m 0.00126897 b 4.50652 56 Residuals: {-0.498909, 0.50236, -2.49637, 2.50109, 1.49475, 1.50363, 0.496015, 3.50109, 1.50236, 2.49891, -4.50272, -3.49891, 2.49855, -4.50652, -1.50272, 3.49728, -1.50272, 2.49348, -2.49637, 1.49764, -1.49764, -4.49891, 0.493477, 3.49602, -1.50525, 0.504898, -4.49764, -0.501447, 0.503629, 2.50145, 2.49602, 4.50363, -1.50652, 1.50236, 3.50236, 2.5049, -0.500178, -4.49637, -0.49764, 1.50525, 2.5049, 0.497284, 4.49602, 1.50236, 0.50236, 1.50363, 1.50236, -2.49637, -1.49637, 0.496015} 57 Excel and Regression Lines Excel can be used to find the regression line for a set of bivariate data. In the Tools menu, select Data Analysis. In the Data Analysis window, select regression and follow the wizard. 58 Be Careful in Analyzing Association between Two Variables We have developed the regression line as a way to predict values for the response variable in terms of the explanatory variable. Can this lead to bad predictions? By all means! 59 Extrapolation Consider the data sets: S x1 ,K , xn (explanatory variable) and T y1 ,K , yn (response variable). Let use assume the we order S so the x1 x2 K xn . We compute a regression line for this data: ŷ mx b where ŷ is a is a prediction for y. Technically, the equation of the regression line was calculated under the assumption the x1 x xn . If we use the equation of the regression line to calculate predictions for x x1 and/or x xn , then this is called extrapolation. 60 Example Consider S = (-1,0,3,4,5) and T = (1,0,2,3,2). The regression line is then y = 0.351x + 0.828. Suppose that we want a prediction for x = -3 (-0.224) or x = 6 (2.932). Using our equation to obtain these predictions is called extrapolation. 61 Problems with Extrapolation • Extrapolation is using a regression line to predict values of the response variable for values of explanatory variable that are outside of the data set. – It is riskier the farther we move from the range of the given xvalues. – There is no guarantee that the same relationship (the regression line) will have the same trend outside the range of x-values. 62 Diagnostics on the Least Squares Regression Line Given a set of bivariate data, we can compute the least squares regression line that “fits” the data. We now look at “how well” this line approximates the data. Section 4.3 63 Review Given a sample of data from our population, ((x1,y2),…,(xn,yn)), we can construct a linear regression line: y = mx + b . We have formulas for the constants the slope, m, and the y-intercept, b. Furthermore, we have the (linear) correlation coefficient, r. The correlation number r measures the strength of the linear relation between the two quantitative variables x and y. n n n x y i i xi yi n n 1 n xi x yi y 1 i 1 r xi x yi y 2 2 n 1 i 1 sx sy (n 1)sx sy i 1 n n n xi n yi x 2 i 1 y 2 i 1 i 1 i i 1 i n n i 1 i 1 x is the mean and sx the sample standard deviation of x1 ,K , xn y is the mean and sy the sample standard deviation of y1 ,K , yn sy m r and b y mx sx 64 Deviations Consider a data set: x1 , y1 , x2 , y2 ,..., xn , yn . Let y mx b be the regression line, 1 n ŷi mxi b, and y yi . For a point xi , yi : n i 1 (i) yi y = total deviation at xi , yi (ii) ŷi y = explained deviation at xi , yi (iii) yi ŷi = unexplained deviation at xi , yi 65 Variation Consider any point in the data, yi , and the prediction, ŷi mxi b. Then total deviation = yi y ŷi y yi ŷi explained deviation + unexplained deviation. n n Next, one can prove: n n y y ŷ y y ŷ 2 i i 1 2 i i 1 2 i i 1 i 1 ŷi y n 2 i 1 n y ŷ 2 i i 1 n i y y y y 2 i i 1 . 2 i i 1 We define: n y y 2 = total variation i i 1 n ŷ y 2 = explained variation i i 1 n y ŷ 2 i i = unexplained variation. i 1 Hence, 1 unexplained variation explained variation . total variation total variation Definition : R 2 of determination. explained variation . The constant R 2 is also called the coefficient total variation 66 Interesting Fact Theorem : Let R 2 be the coefficient of determination and r be the correlation coefficient, i.e., 2 1 n x x y y xi x yi y 1 2 2 i i r . Then R r . n 1 i 1 sx sy n 1 i 1 sx sy n 67 Correlation r and Proportional Reduction in Error R2 Both r and r2 describe the strength of an association between the quantitative variables x and y. • -1 ≤ r ≤ 1 and it represents the slope of the regression line when x and y have been standardized. • 0 ≤ R2 ≤ 1 and it summarizes the reduction in relative error between the errors with respect to the mean of the y-values in the sample and the errors with respect to the residuals of the sample. People often express R2 as a percentage. Bottom Line: Both are measures of strength of an association, but have different interpretations. 68 Example Data: ((-1,1),(0,2),(2,4),(3,2),(3,1),(4,-1)) Scatterplot: Sample Means : x 11 6 y 3 2 Sample Standard Deviatons : sx Correlation Coefficinet : r 113 27 1.94 sy 1.64 30 10 3 0.282 113 69 246 219 165 138 138 111 Predictions ŷi : , , , , , 113 113 113 113 113 113 Coefficient of Determination : R r 2 2 2 9 3 0.079646 113 113 27 0.239 113 219 y - intercept : b 1.938 113 27 219 Regression Line : ŷ x 113 113 Slope : m 70 Residual Analysis We have seen in the previous section that the residual for each point in the bivariate set gives us a measure of how well the regression line “fits” the data. Overall, residual analysis can be used: • To determine if a linear regression model is appropriate. • Check the variation of the residuals (the variance or standard deviation of the residuals). • Help isolate outliers. 71 Is a Linear Model Appropriate? Question: For a given set of bivariate data, is a regression line, y = mx + b, an appropriate model for approximating the behavior of y as x varies? Approach: Plot the residuals of the data set as a function of the explanatory variable: (x,residual). If this plot is “flat, but random” i.e., the residuals do not change much with x and there is not discernible pattern to the residuals, then a linear model is probably OK. 72 Example Right Humerus (mm) Right Tibia (mm) Right Humerus (mm) Right Tibia (mm) 24.80 36.05 25.90 37.38 24.59 35.57 26.11 37.96 24.59 35.57 26.63 37.46 24.29 34.58 26.31 37.75 23.81 34.20 26.84 38.50 24.87 34.73 n 11 x 25.34 sx 1.04127 y 36.3409 sy 1.52162 1 n xi x yi y r 0.951316 n 1 i 1 sx sy sy m r 1.39017 sx ŷ 1.39017x 1.11395 b y mx 1.11395 73 Residuals: {0.459784, 0.27172, 0.27172, -0.301229, -0.0139461, -0.957528, 0.260595, 0.548659, -0.674231, 0.0606241, 0.073833}. The sum of the squares of the residual is 1.43087. The mean of the set of residuals is -2.58379 x 10-15 and the sample standard deviation is 0.468989. 74 Variance of Residuals If the spread residuals as a function of the explanatory variable increases or decreases as the explanatory variable increases, then the linear regression line may not be an appropriate model. In this case it violates the constant error variance criterion. 75 Example Data: 1, 3.85725 , 1,1.89082 , 3, 0.183631, 4, 0.806224 , 5, 2.05 x 2.4 y 0.615098 sx 2.40832 sy 2.3156 r 0.998419 sy m r 0.959982 sx b y mx 2.91905 ŷ mx b 0.959982x 2.91905 Residuals: -0.0217823, -0.0682502, 0.144522, 0.11465, -0.16914 Sample Standard Deviation of Residuals: 0.130165 Constant Variance of Residuals 76 Example Data: 1, 2.29705 , 1, 2.06739 , 3, 0.528931, 4, 4.45035 , 5,10.6457 x 2.4 y 2.25211 sx 2.40832 sy 5.42233 r 0.877213 sy m r 1.97504 sx b y mx 2.488 ŷ mx b 1.97504 x 2.488 Residuals: 2.16599, -1.55443, -2.9082, -0.961828, 3.25847 Sample Standard Deviation of Residuals: 2.60327 Non-constant Variance of Residuals 77 Outliers Outliers can dramatically effect the regression line parameters: m and b. If outliers can be identified, then they should be removed from the calculation of the equation for the regression line. 78 Example Data: ((-1,-2),(0,0),(1,2),(3,4)) Means: x: 0.75 and y: 1.0 SD: sx = 1.708 and sy = 2.582 Correlation: r = 0.9827 Regression Line: m = 1.486 and b = -0.114 Data: ((-1,-2),(0,0),(1,2),(3,10)) Means: x: 0.75 and y: 2.5 SD: sx = 1.708 and sy = 5.30 Correlation: r = 0.9833 Regression Line: m = 3.029 and b = 0.229 Notice that the correlation coefficient r hasn’t changed much, but the slope and y-intercept have. 79 Box-whisker Plot for Residuals and Outliers Consider the following bivariate data: {(0,1),(1,-3),(2,3),(4,4),(5,6),(6,6),(7,15)} r 0.860226 ŷ 1.81507x 1.91096 It appears that (7,15) might be an outlier. 80 We examine the set of residuals: {2.91096, -2.90411, 1.28082, -1.34932, -1.16438, -2.97945, 4.20548}. We call the five number summary for the residuals without the last residual (4.20548). That is, (-2.97945, -2.94178, -1.25685, 2.09589, 2.91096). The upper fence is: Q3 + 1.5(IQR) where IQR = 5.03767. That is, UF = 9.6524. Hence, the value of residual corresponding to x = 7 would not be considered an outlier. 81 82 Influential Observations An influential observation is a data point that significantly changes the equation of the linear regression line. Such a point may or may not be an outlier. 83 Example Data: {(-2., -0.221996), (-1.75, -0.135721), (-1.5, -0.119657), (-1.25, -0.2215), (-1., -0.122489), (-0.75, -0.0977267), (-0.5, -0.112159), (-0.25, -0.0423737), (0., -0.0465509), (0.25, 0.121379), (0.5, -0.0336865), (0.75, 0.118555), (1., 0.080907), (1.25, 0.116514), (1.5, 0.181848), (1.75, 0.186804), (2., 0.170339), (2.25, 0.200667), (2.5, 0.233716), (2.75, 0.362922), (3., 0.365765), (3.25, 0.349446), (3.5, 0.445323), (3.75, 0.306702), (4., 0.38776), (4.25, 0.510167), (4.5, 0.41498), (4.75, 0.403201), (5., 0.410249)} r 0.967795 m 0.1001504 b 0.00696613 ŷ 0.1001504x 0.00696613 84 Add the point (1,3) - outlier r 0.334904 m 0.0904561 b 0.10627 ŷ 0.0904561x 0.10627 The point (1,3) is an outlier, but it is not an influential points since it doesn’t significantly change the linear regression line. 85 Add the point (10,3) - outlier r 0.850706 m 0.184705 b 0.0889436 ŷ 0.184705x 0.0889436 The point (10,3) is an outlier and and an influential point (it almost doubles the slope of the regression line). 86 Rule of Thumb If the a point is an outlier with respect to the explanatory variable (x), then it will be an influential point. 87