Math 3680 Lecture #19 Correlation and Regression The Correlation Coefficient: Limitations X -5 -4 -1 2 3 5 AVG 0 SD 4 Y 25 16 1 4 9 25 AVG 13.33333 SD 10.36661 X (su) -1.25 -1 -0.25 0.5 0.75 1.25 Y (su) 1.1254 0.2572 -1.1897 -0.9003 -0.418 1.1254 Product -1.40676 -0.25724 0.297429 -0.45016 -0.31351 1.40676 r= -0.1447 30 Moral: Correlation coefficients only measure linear association. 25 20 15 10 5 0 -6 -4 -2 0 2 4 6 Film starring y, x, Minutes Matthew Opening Weekend Gross Shirtless McConaughey (millions of dollars) We Are Marshall ED tv Reign of Fire Sahara Fool's Gold 0 6.1 0.8 8.3 1.6 15.6 1.8 18.1 14.6 21.6 Correlation 0.733965 25 20 15 10 5 0 0 2 4 6 8 10 12 14 Film starring y, x, Minutes Matthew Opening Weekend Gross Shirtless McConaughey (millions of dollars) We Are Marshall ED tv Reign of Fire Sahara Fool's Gold 0 0.8 1.6 1.8 6.1 8.3 15.6 18.1 Correlation 0.966256 25 20 15 10 5 0 0 2 4 6 8 10 12 14 Moral: Correlation coefficients are most appropriate for football-shaped scatter diagrams and can be very sensitive to outliers. Regression The heights and weights from a survey of 988 men are shown in the scatter diagram. Avg height = 70 in SD height = 3 in Avg weight = 162 lb SD weight = 30 lb r = 0.47 Example. Suppose a man is one SD above average in height (70 + 3 = 73 inches). Should you guess his weight to be one SD above average (162 + 30 = 192 pounds)? Solution: No. Notice that maybe 10 or 11 of the men 73 inches tall have weights above 192 pounds, while dozens have weights below 192 pounds. (73, 192) A better prediction is obtained by increasing by not a full SD but by r SDs: Prediction = Average + ( r )(# SDs)(SD) = 162 + (0.47) (1) (30) = 176.1 lb This is our second interpretation of the correlation coefficient. Prediction =162+(1) (0.47) (30) =176.1 lb (73, 176.1) (73, 176.1) (70, 162) (0.47)(30) lbs 3 inches Slope = 0.47 30 3 r sy sx Example: Predict the height of a man who is 176.1 pounds. Does this contradict our previous example? Example: Predict the weight of a man who is 5’6”. Where does this prediction appear in the diagram? Notice that these points are displayed on the solid line in the diagram. This line is called the regression line. To obtain this line, you start at the point of averages, and draw a line with slope sy rs x In other words, the equation of the regression line is yˆ y r sy (x x) sx Reverse the roles of x and y when predicting in the other direction. Example: Find the equation of the regression line for the height-weight diagram. Example: A university has made a statistical analysis of the relationship between SAT-M scores and first-year GPA. The results are: Average SAT score = 550 SD = 80 Average first-year GPA = 2.6 SD = 0.6 r 0.4 The scatter diagram is football shaped. Find the equation of the regression line. Then predict the first-year GPA of a randomly chosen student with an SAT-M score of 650. Both Excel and TI calculators are capable of computing and visualizing regression lines. (See book p. 426). In Excel 2007, highlight the x- and y-values and use Insert, Scatter, to draw a scatter plot. Click the data points, and then right-click Add Trendline to see the regression line. 180 160 140 120 Height (cm) Age (mo) Height (cm) 24 87 48 101 60 120 96 159 63 135 39 104 63 126 39 96 Age (mo) Line Fit Plot 100 80 60 Height (cm) Predicted Height (cm) 40 20 0 0 20 40 60 Age (mo) 80 100 120 The Regression Effect For a study of 1,078 fathers and sons: Average fathers’ height = 68 in SD = 2.7 in Average sons’ height = 69 in SD = 2.7 inches r @ 0.5 Suppose a father is 72 inches tall. How tall would you predict his son to be? Repeat for a father who is 64 inches tall. Notice that tall fathers tend to have tall sons – though sons who are not as tall. Likewise, short fathers on average will have short sons – just not as short. Hence the term, “regression.” A pioneering but aristocratic statistician (Galton) called this effect, the “regression toward mediocrity,” and the term has stuck. There is no biological cause to this effect – it is strictly statistical. Thinking that the regression effect is due to something important is called the regression fallacy. Example: A preschool program attempts to boost students’ IQs. The children are tested when they enter the program (pretest), and again when they leave the program (post-test). On both occasions, the average IQ score was 100, with an SD of 15. Also, students with below-average IQs on the pretest had scores that went up by 5 points, while students with above average scores of the pretest had their scores drop by an average of 5 points. What is going on? Does the program equalize intelligence? Example. Suppose someone gets a score of 140 on the pretest. Does this mean that the student has an IQ of exactly 140? Solution: No. There will always be chance error associated with the measurement. For the sake of argument, let’s assume that the chance error is equal to 5 points. Then there are two likely explanations, they are: Actual IQ of 135, with a chance error of +5 Actual IQ of 145, with a chance error of -5 Which of the above is the most likely explanation? This explains the regression effect. If someone scores above average on the first test, we would estimate that the true score is probably a bit lower than the observed score. Example: An instructor gives a midterm. She asks the students who score 20 points below average to see her regularly during her office hours for special tutoring. They all scored at least average on the final. Can this improvement be attributed to the regression effect? Regression and Error Estimation n = 988 Avg height = 70 in SD height = 3 in Avg weight = 162 lb SD weight = 30 lb r = 0.47 For a man 73 in tall, we predict weight of 162+(1) (0.47) (30) =176.1 lb Next question: what is the error for this estimate? Based on the picture, is it 30 lb? Or less? THEOREM. Assuming the data is normally distributed, we have n 1 2 se s y 1 r n2 For the current example, that means se 30 1 (0.47) 2 987 26.5 986 The weight is therefore estimated as 176.1 ± 26.5 lb. Note. If n is large, the last factor in se s y 1 r n 1 2 n2 may be safely ignored. In other words, if n is large, then se s y 1 r 2 Example: For a study of 1,078 fathers and sons: Fathers: Average height = 68 in SD = 2.7 in Sons: Average height = 69 in SD = 2.7 inches r = 0.5 Suppose a father is 63 inches tall. What percentage have sons who are at least 66 inches tall? Testing Correlation Recall the equation of the regression line: yˆ y r sy (x x) sx so that the slope of the regression line is b r sy . sx The standard error for the slope is given by SEb se sx n 1 sy 1 r sx n2 2 b 1 r r n2 2 The standard error for the slope is given by SEb se sx n 1 sy 1 r sx n2 2 b 1 r r n2 2 Under the assumptions of: 1. normality, and 2. homoscedasticity (see below), the t distribution with df = n - 2 may be used to find confidence intervals and perform hypothesis tests. Homoscedasticity means that the variability of the data about the regression line does not depend on the value of x. n = 988 Avg height = 70 in SD height = 3 in Avg weight = 162 lb SD weight = 30 lb r = 0.47 b (0.47) 30 3 4.7 SEb 4.7 0.47 1 (0.47) 2 0.281 986 For df = 986, the Student t distribution is almost normal. So a 95% confidence interval for the slope is 4.7 (1.96)(0.281) 4.7 0.55 The corresponding confidence interval for the correlation coefficient for all men is 0.47 0.055 Example: For a study of 1,078 fathers and sons: Fathers: Average height = 68 in SD = 2.7 in Sons: Average height = 69 in SD = 2.7 inches r = 0.5 Test the hypothesis that the correlation coefficient for all fathers and sons is positive.