Chapter 9 – Correlation and Regression 9.1 Correlation A correlation is a relationship between two statistical variables measured from the same population. In this chapter, we will consider only linear correlation which comes in three types: Positive Linear Correlation - high values for one variable tend to correspond to high values for the second variable. Examples: Height vs. Weight for adults Blood Alcohol Level vs. Reaction Time Negative Linear Correlation - high values for one variable tend to correspond to low values for the second variable. Examples: Age vs. Retail Value of a Ford F-150 Truck Blood Alcohol Level vs. Weight for adults after consuming one drink Reaction Time vs. Hours of Sleep for the previous night No Linear Correlation - no relationship between the variables or a non-linear relationship. Examples: Height vs. Number of Years of Education Natural hair color and intelligence quotient score 1 Scatter Diagrams One way to determine the type of linear correlation between two variables is by means of a scatter diagram. To construct a scatter diagram, we plot the value of one variable along the x-axis and the other along the y-axis, and then for each member of our population or sample group, we plot a point corresponding to the measurements of the individual. We can then determine the type of linear correlation as follows: Positive Linear Correlation General trend in the plotted points is from bottom left to top right. Negative Linear Correlation General trend in the plotted points is from top left to bottom right. No Linear Correlation No general trend in plotted points, or a non-linear trend. The strength of the linear correlation can be judged by looking at how closely the points approximate a straight line. 2 Example: The following table shows the Height (x) vs. Femur Length (y) measurements (both in inches) for 10 men: x y 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36 Scatter Diagram for Height vs. Femur Length Legth of Femur 50 45 40 35 65 66 67 68 69 70 71 72 Height The diagram shows a ___________ linear correlation between the variables. 3 Example: The following table gives the weight (x) (in 1000 lbs.) and highway fuel efficiency (y) (in miles/gallon) for a sample of 13 cars. Vehicle X y Chevrolet Camaro Dodge Neon Honda Accord Lincoln Continental Oldsmobile Aurora Pontiac Grand Am Mitsubishi Eclipse BMW 3-Series Honda Civic Toyota Camry Hyundai Accent Mazda Protégé Cadillac DeVille 3.545 2.6 3.245 3.93 3.995 3.115 3.235 3.225 2.44 3.24 2.29 2.5 4.02 30 32 30 24 26 30 33 27 37 32 37 34 26 MPG Highway Scatter Diagram for Weight vs. Highway MPG 40 38 36 34 32 30 28 26 24 22 20 2 2.5 3 3.5 4 4.5 weight (1000 lbs) The diagram indicates a _____________ linear correlation between the variables. 4 Coefficient of Correlation A more precise method of determining the type and strength of a linear correlation is to calculate the coefficient of linear correlation (denoted by r) for the two variables using the formula: n xy x y r n x x 2 2 n y y 2 2 The coefficient of linear correlation will always be a number between -1 and 1, with a positive value indicating a positive correlation and a negative value a negative correlation. A coefficient of r 1 for a data set indicates perfect positive linear correlation, and r 1 indicates perfect negative linear correlation, while r 0 would indicate no linear correlation. The closer the value of r is to 1 , the stronger the correlation, and the closer to zero, the weaker the correlation. Calculating the Coefficient of Correlation The coefficient of correlation between two variables is most easily calculated by constructing a table (see example below) with columns that contain the x and y variable values for each individual, the value of xy for each individual, and the values of x 2 and y 2 for each individual. The sum of each column is found, and these sums can then be substituted into the formula above to find r. 5 Example: Using our previous data set of height vs femur length for 10 men, we get the table: Variable x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6 Sum 3009 2661.24 3183.48 2940.36 2704 3273.16 2886.1 2694.72 2875.43 2361.6 x2 y2 5012.64 1806.25 4382.44 1616.04 5140.89 1971.36 4719.69 1831.84 4569.76 1600 4788.64 2237.29 4422.25 1883.56 4515.84 1608.01 4664.89 1772.41 4303.36 1296 418.8 28589.09 46520.4 17622.76 y xy 42.5 40.2 44.4 42.8 40 47.3 43.4 40.1 42.1 36 681.8 The coefficient of correlation for the variables is thus: n xy x y r n r x x 2 2 n y y 2 2 10 28589.09 681.8 418.8 10 46520.4 681.8 2 10 17622.76 418.8 353.06 353.06 .651 352.76 834.16 542.4558 6 2 Exercise: Calculate the coefficient of correlation for the vehicle weight and miles per gallon data sets. The table of variables is given below: Variable x Sums y xy 3.545 2.6 3.245 3.93 3.995 3.115 3.235 3.225 2.44 3.24 2.29 2.5 4.02 30 32 30 24 26 30 33 27 37 32 37 34 26 41.38 398 106.35 83.2 97.35 94.32 103.87 93.45 106.755 87.075 90.28 103.68 84.73 85 104.52 1240.58 135.93675 n xy x y r n x x 2 2 n y y 2 x2 y2 12.567025 6.76 10.530025 15.4449 15.960025 9.703225 10.465225 10.400625 5.9536 10.4976 5.2441 6.25 16.1604 2 7 900 1024 900 576 676 900 1089 729 1369 1024 1369 1156 676 12388 Significance of the Coefficient of Correlation When the coefficient of correlation is calculated from sample data sets, there is a chance that a linear correlation will be found when, in fact, no correlation exists between the population variables. Therefore, before deciding that a linear correlation exists between two variables when using sample data, we will run a test for significance. The population parameter representing the coefficient of correlation for population data is denoted by (row), and we use the sample coefficient r to determine if the hypothesis H 0 : 0 can be rejected. This is in fact a two-tailed t-test, but the resulting critical r values for the .05 and .01 levels of significance are listed in Table 11 on page A 28. Example: The height vs. femur length data set has a coefficient of correlation of r .651, thus this correlation is significant at the .05 level of significance (> .632) but not at the .01 (>.765) level of significance. Note that as the sample size grows, r can be _________________ to reject the null hypothesis that ρ = 0 and conclude that the coefficient of correlation, r, is significant (ie not actually zero). Exercise: Use Table 11 to determine the level(s) of significance for the vehicle weight vs. highway mpg data. 8 9.2 Linear Regression If a pair of variables has a significant linear correlation, then the relationship between the data values can be roughly approximated by a linear equation. The process of finding the linear equation which best fits the data values is known as linear regression and the line of best fit is called the regression line. It is a fact of linear algebra and analysis that the least squares line of best fit to a set of data values has an equation of the form ŷ mx b where: m n xy x y n x x 2 2 and b y mx y m x n Example: For the vehicle weight vs. highway mileage data set, we have: m 13 1240.58 41.38 398 13 135.937 41.38 2 341.7 6.23 54.877 and b 398 (6.23) 41.38 655.797 50.45 13 13 so our regression line is given by the equation yˆ 6.23x 50.45 . The graph of this line is shown on the scatter diagram for the data set below. MPG Highway Vehicle Weight vs. MPG Highway 40 38 36 34 32 30 28 26 24 22 20 2 2.5 3 3.5 4 4.5 weight (1000 lbs) Exercise: Find the equation of the regression line for the height vs. femur length data. 9 Using the Regression Line to Predict Data Values The primary use for the regression equation is to predict values for one variable given a value for the other variable. Meaningful predictions can only be made for values within the range of the original data. Example: Using our regression equation for the car data, we could estimate that a car that weighed 3000 lbs ( x 3) would have a highway mpg of yˆ 6.23(3) 50.45 31.76 . Likewise, if we knew a car’s highway mpg was 36 mpg, then we would estimate its weight by solving 36 6.23x 50.45 to get x 2.319 or a car that weighs 2319 lbs. A margin of error could be also be added on to these estimates to generate a confidence interval (using the t-distribution), but we will not cover this in this class. Note: it would not be meaningful to either predict the mileage of a car weighing 5,000 lbs or predict the weight of a car getting 50 mpg because each these is not within the range of the original data. Exercise: Suppose a crime scene investigator digs up the femur of a man and finds that it is 38.5 inches long. Based on our regression line for the height vs. femur length data, what would we estimate the man’s height to have been? 10 Finding Correlation and Regression Using the TI-83 The coefficient of correlation for a set of paired data values can be found using the TI83 by placing the x values in L1 and the y values in L2, and then running the LinRegTTest program. To run the LinRegT-Test program, Press the STATS key Use the arrow keys to select the TESTS Menu Choose number E: LinRegT-Test by pressing ENTER When you are presented with the menu, choose Xlist: L1 Ylist: L2 Freq: 1 & : 0 RegEQ: (blank) Choose Calculate and press ENTER. The following will be displayed: a b r = = = y-intercept of regression equation (b) Slope of regression equation (m) Coefficient of Correlation (r) 2nd y=stat plot Highlight plot 1 - on - no L1, yes L2 - graph - zoom stat (#9) Stat Calc #4 lin rg ax +b L1, L2; vars y-vars #1 (function) Enter y1, enter, graph 11