Regression Regression Correlation and regression are closely related in use and in math. Correlation summarizes the relations b/t 2 variables. Regression is used to predict values of one variable from values of the other (e.g., SAT to predict GPA). Basic Ideas (2) Yi a bX i ei Sample value: Intercept – place where X=0 Slope – change in Y if X changes 1 unit. Rise over run. If error is removed, we have a predicted value for each person at X (the line): Y a bX Suppose on average houses are worth about $75.00 a square foot. Then the equation relating price to size would be Y’=0+75X. The predicted price for a 2000 square foot house would be $150,000. Linear Transformation Y a bX C 4 0 3 5 3 0 2 5 Y 2 5 1 0 a = 0 1 h n 1 Y 5 = 1 Y C in g Y 1 to 1 mapping of variables via line Permissible operations are addition and multiplication (interval data) Y 3 0 2 0 + 0 = 5 1 2 Y + 0 hg X = 2 Y + a 2 5 =X Y X nt gh + 5 = 2 + 5 1 0 5 0 0 0 0 2 4 6 X Add a constant 8 1 2 0 4 6 X Multiply by a constant 8 Linear Transformation (2) Degrees F Centigrade to Fahrenheit 240 Note 1 to 1 map 212 degrees F, 100 degrees C 200 160 Intercept? 120 Slope? Y a bX 80 40 32 degrees F, 0 degrees C 0 0 30 60 90 120 Degrees C Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32. Slope is 1.8. When Cent goes from 0 to 100 (rise), Fahr goes from 32 to 212, and 212-32 = 180. Then 180/100 =1.8 is rise over run is the slope. Y = 32+1.8X. F=32+1.8C. i Regression Line (1) Basics e R 2 1 e g 0 0 8 M 0 r e e s a s n e 1 6 0 W M 1 D 4 e 1 2 1 L 0 E v 0 ( 0 6 e 6 2 Y in a r ia y e r t D H 5 6 6 7 8 e n a o r f n 1. Passes o io thru both r means. 2. vPasses closeiato points. Note errors. io t 3. Described by an 1equation. 2 , e 6 0 6 4 ' 7 0 ig 2 h t Regression Line (2) Slope Plot of Weight by Height Equation for a line is Y=mX+b in algebra. Se c o n d T i tl e 210 180 M e a n = 1 5 0 .7 lb s . In regression, equation usually written Y=a+bX Weight Re g re s s i o n l i n e W e i g h t = -3 2 7 + 7 .1 5 *He i g h t 150 120 M e a n = 6 6 .8 In c h es 90 60 63 66 69 72 75 Height Y is the DV (weight), X is the IV (height), a is the intercept (-327) and b is the slope (7.15). The slope, b, indicates rise over run. It tells how many units of change in Y for a 1 unit change in X. In our example, the slope is a bit over 7, so a change of 1 inch is expected to produce a change a bit more than 7 pounds. Regression Line (3) Intercept Plot of Weight by Height Se c o n d T i tl e 210 180 M e a n = 1 5 0 .7 lb s . Re g re s s i o n l i n e Weight The Y intercept, a, tells where the line crosses the Y axis; it’s the value of Y when X is zero. W e i g h t = -3 2 7 + 7 .1 5 *He i g h t 150 120 M e a n = 6 6 .8 In c h es 90 60 63 66 69 72 75 Height The intercept is calculated by: a Y bX Sometimes the intercept has meaning; sometimes not. It depends on the meaning of X=0. In our example, the intercept is –327. This means that if a person were 0 inches tall, we would expect them to weigh –327 lbs. Nonsense. But if X were the number of smiles,then a would have meaning. Correlation & Regression Correlation & regression are closely related. 1. The correlation coefficient is the slope of the regression line if X and Y are measured as z scores. Interpreted as SDY change with a change of 1 SDX. 2. SDY For raw scores, the slope is: br SDX The slope for raw scores is the correlation times the ratio of 2 standard deviations. (These SDs are computed with (N-1), not N). In our example, the correlation was .96, so the slope can be found by b = .96*(33.95/4.54) = .96*7.45 = 7.15. Recall that a Y bX . Our intercept is 150.77.15*66.8 -327. Correlation & Regression (2) 3. The regression equation is used to make predictions. The formula to do so is just: Y ' a bX Suppose someone is 68 inches tall. Predicted weight is -327+7.15*68 = 159.2. Estimating Y f or X = 3 5 Y=2+.5(3) = 3.5 Y=2+.5*X 4 Regression Line Slope=.5 3 2 Intercept=2 1 -2 0 2 3 X 5 6 Review What is the slope? What does it tell or mean? What is the intercept? What does it tell or mean? How are the slope of the regression line and the correlation coefficient related? What is the main use of the regression line? Test Questions 30 50 30 40 20 30 20 Miles per Gallon 20 10 10 10 0 0 -100 0 100 200 300 400 500 -100 Time to Accelerate from 0 to 60 mph (sec) Time to Accelerate from 0 to 60 mph (sec) 30 20 10 0 100 200 300 400 500 0 68 0 1000 2000 3000 4000 5000 Vehicle Weight (lbs.) A 6000 Engine Displacement (cu. inches) 70 72 74 76 78 Engine Displacement (cu. inches) Model Year (modulo 100) B C D What is the approximate value of the intercept for Figure C? a. 0 b. 10 c. 15 d. 20 80 82 84 Test Questions In a regression line, the equation used is typicallyY ' a bX . What does the value a stand for? independent variable intercept predicted value (DV) slope ig Regression of Weight on Height X R R ee gg rr Wt 61 105 62 120 63 120 65 160 65 120 68 145 69 175 70 160 72 185 75 210 N=10 N=10 M=67 M=150 Correlation (r) = .94. SD=4.57 SD= 33.99 Regression equation: Y’=-361.86+6.97X e Ht 2 4 0 1 0 Y a bX 2 Y = - 1 8 0 1 5 0 R W R 1 2 0 9 0 6 0 6 6 u 0 6 2 6 H 4 6 6 7 e 8 7 0 7 ig 2 7 ig Predicted Values & Errors Y a bX e R e Numbers for linear part and error. g N r 0 0 1 8 M 0 2 e 0 W M 1 D 4 e 1 2 1 6 e L 0 v E 0 ( 0 6 2 s Y in a r ia y e H Note M of Y’ and Residuals. Note variance of Y is V(Y’) + V(res). r o io r n , 8 t 1 61 105 62 o 63 65 68 2 69 2 Wt io Y' n f P P f io 0 Error o 108.19 -3.19 115.16 4.84 X 120 122.13 -2.13 160 136.06 23.94 120 65 ia 7 0 ig o a 7 7 8 e n 6v 5 6 6 n 4 5 e 6 0 6 4 ' r t D a 3 e 6 Ht s 1 2 1 e 120 145n 175 f Y a a r ) r t 136.06 -16.06 r o 156.97 t f m -11.97 163.94 11.06 70 160 170.91 -10.91 9 72 185 184.84 0.16 10 75 210 205.75 4.25 M 67 150 150.00 0.00 4.57 33.99 31.85 11.89 20.89 1155.56 1014.37 141.32 h SD Variance t r Error variance S 2 Y' (Y Y ' ) 2 N SY2' SY2 (1 r 2 ) (Heiman’s notation for error is not standard. ) In our example, r .94; r 2 .88 SY2' 2 ( Y Y ' ) N 141.32 SY2' SY2 (1 r 2 ) 1156 * (1 .88) 141 Standard error of the Estimate – average distance from prediction SY ' SY 1 r 2 In our example SY ' 141.32 12 Variance Accounted for 2 S r 2 1 Y2' SY (Heiman’s notation for error is not standard. ) The basic idea is to try maximize r-square, the variance accounted for. The closer this value is to 1.0, the more accurate the predictions will be. Sample Exam Data from Previous Class Exam 1 Exam 2 86.00 98.00 70.00 84.00 82.00 92.00 92.00 72.00 96.00 82.00 56.00 70.00 76.00 82.00 74.00 94.00 78.00 56.00 66.00 72.00 A sample of 10 scores from both exams Assuming these are representative, what can you say about the exams? The students? Scatterplot & Boxplots of 2 Exams Exam 1 Exam 2 Descriptive Stats Descriptives Exam1 Mean Median 86.0000 Variance 108.959 Std. Deviation Exam2 Statistic 83.4412 10.43837 Minimum Maximum Range Mean 52.00 100.00 48.00 70.7721 Median 72.0000 Variance 220.503 Std. Deviation Minimum Maximum Range Std. Error .89508 14.84935 24.00 100.00 76.00 1.27332 Correlations Correlations Exam1 Exam1 Pearson Correlation Exam2 1 .420** Sig. (2-tailed) N Exam2 Pearson Correlation .000 165 136 .420** 1 Sig. (2-tailed) .000 N 136 **. Correlation is significant at the 0.01 level (2-tailed). 139 Scatterplot with means and regression line Note that the correlation, r, is .42 and the squared correlation, R2, is .177. R2 is also the variance accounted for. We can predict a bit less than 20 percent of the variance in Exam 2 from Exam 1. Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Exam1 a. Dependent Variable: Exam2 Std. Error 20.895 9.377 .598 .112 Coefficients Beta t .420 Sig. 2.228 .028 5.360 .000 Predicted Scores Coefficientsa Unstandardized Coefficients Model 1 (Constant) Exam1 B Std. Error 20.895 9.377 .598 Standardized Coefficients Beta .112 t 2.228 .420 5.360 Sig. .028 .000 a. Dependent Variable: Exam2 Y ' a bX Predicted Exam 2 = 20.895 + .598*Exam1 For example, if I got 85 on Exam 1, then my predicted score for Exam 2 is 20.895+.598*85 = 71.73 = 72 percent