Definition The linear correlation coefficient r measures the strength of the linear relationship between the paired quantitative x- and y-values in a sample. Requirements 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. The effects of any other outliers should be considered by calculating r with and without the outliers included. Notation for the Linear Correlation Coefficient n = number of pairs of sample data denotes the addition of the items indicated. x denotes the sum of all x-values. x2 indicates that each x-value should be squared and then those squares added. (x)2 indicates that the x-values should be added and then the total squared. Notation for the Linear Correlation Coefficient xy indicates that each x-value should be first multiplied by its corresponding y-value. After obtaining all such products, find their sum. r = linear correlation coefficient for sample data. = linear correlation coefficient for population data. Formula The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample. r= nxy – (x)(y) n(x2) – (x)2 n(y2) – (y)2 Computer software or calculators can compute r Properties of the Linear Correlation Coefficient r 1. –1 r 1 2. if all values of either variable are converted to a different scale, the value of r does not change. 3. The value of r is not affected by the choice of x and y. Interchange all x- and y-values and the value of r will not change. 4. r measures strength of a linear relationship. 5. r is very sensitive to outliers, they can dramatically affect its value. Example: Using software or a calculator, r is automatically calculated: Example: Using the pizza subway fare costs, we have found that the linear correlation coefficient is r = 0.988. What proportion of the variation in the subway fare can be explained by the variation in the costs of a slice of pizza? With r = 0.988, we get r2 = 0.976. We conclude that 0.976 (or about 98%) of the variation in the cost of a subway fares can be explained by the linear relationship between the costs of pizza and subway fares. This implies that about 2% of the variation in costs of subway fares cannot be explained by the costs of pizza. Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no linear correlation. Basic Concepts of Regression Regression The regression equation expresses a relationship between x (called the explanatory variable, predictor variable or independent variable), and ^ y (called the response variable or dependent variable). The typical equation of a straight line y = mx + b is expressed in the form ^ y = b0 + b1x, where b0 is the y-intercept and b1 is the slope. Definitions Regression Equation Given a collection of paired data, the regression equation y^ = b0 + b1x algebraically describes the relationship between the two variables. Regression Line The graph of the regression equation is called the regression line (or line of best fit, or least squares line). Notation for Regression Equation Population Parameter Sample Statistic y-intercept of regression equation 0 b0 Slope of regression equation 1 b1 Equation of the regression line y = 0 + 1 x y^ = b0 + b1x Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors. Formulas for b0 and b1 b1 r sy sx b0 y b1x (slope) (y-intercept) calculators or computers can compute these values Example: Refer to the sample data given in Table below. Use technology to find the equation of the regression line in which the explanatory variable (or x variable) is the cost of a slice of pizza and the response variable (or y variable) is the corresponding cost of a subway fare. Example: Requirements are satisfied: simple random sample; scatterplot approximates a straight line; no outliers Here are results from four different technologies technologies Example: All of these technologies show that the regression equation can be expressed as ^ = 0.0346 +0.945x, where ^ y y is the predicted cost of a subway fare and x is the cost of a slice of pizza. Example: Graph the regression equation yˆ 0.0346 0.945 x (from the preceding Example) on the scatterplot of the pizza/subway fare data and examine the graph to subjectively determine how well the regression line fits the data. Example: Definition For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the yvalue that is predicted by using the regression equation. That is, residual = observed y – predicted y = y – ^y Residuals Definitions A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. Definitions A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y – y^ (where y^ denotes the predicted value of y). That is, a residual plot is a graph of the ^ points (x, y – y). Residual Plot Analysis When analyzing a residual plot, look for a pattern in the way the points are configured, and use these criteria: The residual plot should not have an obvious pattern that is not a straight-line pattern. The residual plot should not become thicker (or thinner) when viewed from left to right. Residuals Plot - Pizza/Subway Residual Plots Residual Plots Residual Plots Definition Coefficient of determination is the amount of the variation in y that is explained by the regression line. r 2 = explained variation. total variation The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Alternate Formula for r SS( xy ) r SS( x )SS( y ) SS(x) “sum of squ ares forx” x 2 SS( y) “sum of squ ares fory” y 2 ( x) 2 n ( y) 2 n x y SS(xy) “sum of squ ares forxy” xy n Example Example: The table below presents the weight (in thousands of pounds) x and the gasoline mileage (miles per gallon) y for ten different automobiles. Find the linear correlation coefficient: 2 2 y y xy x x Sum Sum Sum 2.5 2.5 2.5 3.0 3.0 3.0 4.0 4.0 4.0 3.5 3.5 3.5 2.7 2.7 2.7 4.5 4.5 4.5 3.8 3.8 3.8 2.9 2.9 2.9 5.0 5.0 5.0 2.2 2.2 2.2 34.1 34.1 34.1 x 40 40 40 6.25 6.25 6.25 43 43 43 9.00 9.00 9.00 30 30 30 16.00 16.00 16.00 35 35 35 12.25 12.25 12.25 42 42 42 7.29 7.29 7.29 19 19 19 20.25 20.25 20.25 32 32 32 14.44 14.44 14.44 39 39 39 8.41 8.41 8.41 15 15 15 25.00 25.00 25.00 14 14 14 4.84 4.84 4.84 309 309 309 123.73 123.73 123.73 y x2 1600 1600 1600 100.0 100.0 100.0 1849 1849 1849 129.0 129.0 129.0 900 900 900 120.0 120.0 120.0 1225 1225 1225 122.5 122.5 122.5 1764 1764 1764 113.4 113.4 113.4 361 361 361 85.5 85.5 85.5 1024 1024 1024 121.6 121.6 121.6 1521 1521 1521 113.1 113.1 113.1 225 225 225 75.0 75.0 75.0 196 196 196 30.8 30.8 30.8 10665 10665 10665 1010.9 1010.9 1010.9 y2 xy Completing the Calculation for r SS( x ) x SS( y ) x) ( n y) ( 2 y SS( xy ) r 2 2 n 2 (34.1) 2 123.73 7.449 10 (309) 2 10665 1116.9 10 x y (34.1)(309) xy 1010.9 42.79 SS ( xy ) SS ( x )SS ( y ) n 10 42.79 ( 7.449 )(1116 .9 ) 0.47 The Line of Best Fit Equation • The equation is determined by: b0: y-intercept b1: slope • Values that satisfy the least squares criterion: ( x x )( y y ) SS( xy ) b1 2 SS( x ) ( x x) b0 y (b1 x ) y (b x) n 1 Example Example: A recent article measured the job satisfaction of subjects with a 14-question survey. The data below represents the job satisfaction scores, y, and the salaries, x, for a sample of similar individuals: x y 31 17 33 20 22 13 24 15 1) Draw a scatter diagram for this data 2) Find the equation of the line of best fit 35 18 29 17 23 12 37 21 Line of Best Fit SS( x ) x) ( 2 x n SS( xy ) b1 b0 2 234 2 7074 229.5 8 x y (234)(133) xy 4009 118.75 n 8 SS(xy) 11875 . 0.5174 SS(x) 2295 . y (b1 x) 133 (0.5174)(234) n 8 14902 . . 0.517x Solution 1) Equation fothe lineof best it: f ^y 149 Scatter Diagram Solution 2) Job Satisfaction Survey 22 21 20 19 18 Job Satisfaction 17 16 15 14 13 12 21 23 25 27 29 Salary 31 33 35 37