Regression Basics Predicting a DV with a Single IV Questions • What are predictors and criteria? • Write an equation for the linear regression. Describe each term. • How do changes in the slope and intercept affect (move) the regression line? • What does it mean to test the significance of the regression sum of squares? R-square? • What is R-square? • What does it mean to choose a regression line to satisfy the loss function of least squares? • How do we find the slope and intercept for the regression line with a single independent variable? (Either formula for the slope is acceptable.) • Why does testing for the regression sum of squares turn out to have the same result as testing for Rsquare? Basic Ideas • Jargon – IV = X = Predictor (pl. predictors) – DV = Y = Criterion (pl. criteria) – Regression of Y on X e.g., GPA on SAT • Linear Model = relations between IV and DV represented by straight line. Yi X i i (population values) • A score on Y has 2 parts – (1) linear function of X and (2) error. Basic Ideas (2) • Sample value: Yi a bX i ei • Intercept – place where X=0 • Slope – change in Y if X changes 1 unit. Rise over run. • If error is removed, we have a predicted value for each person at X (the line): Y a bX Suppose on average houses are worth about $75.00 a square foot. Then the equation relating price to size would be Y’=0+75X. The predicted price for a 2000 square foot house would be $150,000. Linear Transformation Y a bX • 1 to 1 mapping of variables via line • Permissible operations are addition and multiplication (interval data) 4 0 3 5 3 0 2 5 Y 2 5 1 0 a = 0 1 h n 1 Y 5 = 1 Y C in g Y Y C 3 0 2 0 + 0 = 5 1 2 Y + 0 hg X = 2 Y + a 2 5 =X Y X nt gh + 5 = 2 + 5 1 0 5 0 0 0 0 2 4 6 X Add a constant 8 1 2 0 4 6 8 X Multiply by a constant Linear Transformation (2) Centigrade to Fahrenheit Note 1 to 1 map 240 212 degrees F, 100 degrees C 200 Intercept? 160 120 Slope? Degrees F • • • • Y a bX 80 40 32 degrees F, 0 degrees C 0 0 30 60 90 120 Degrees C Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32. Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is rise over run is the slope. Y = 32+1.8X. F=32+1.8C. Review • What are predictors and criteria? • Write an equation for the linear regression with 1 IV. Describe each term. • How do changes in the slope and intercept affect (move) the regression line? ig Regression of Weight on Height X 61 105 62 120 63 120 65 160 65 120 68 145 69 175 70 160 72 185 75 210 N=10 N=10 M=67 M=150 SD=4.57 SD= 33.99 R R ee gg rr e Wt 2 4 0 1 0 Y a bX 2 Y = - 1 8 0 1 5 0 R R W Ht 1 2 0 9 0 6 0 6 6 u 0 6 2 6 H 4 6 6 7 e 8 7 0 7 ig Correlation (r) = .94. Regression equation: Y’=-316.86+6.97X ig Illustration of the Linear Model. This concept is vital! Yi X i i e R 2 Yi a bX i ei 1 Y a bX ei Yi Yi ' g 0 0 8 M 0 r e e e 1 6 0 M W Consider Y as a deviation from the mean. e 1 D 4 e 1 2 1 L 0 E v 0 ( 0 6 e 6 2 Y in a r ia y e v 5 6 6 7 8 e a o e 6 H n r t D 0 6 4 ' 7 0 ig Part of that deviation can be associated with X (the linear part) and part cannot (the error). ig Predicted Values & Residuals Y a bX e R 2 1 e Numbers for linear part and error. g N r e 0 8 M 0 6 2 e W 1 D 4 e 1 2 1 L 0 v E 0 ( 0 6 e 6 2 Y in a r ia y e H r o io r n Note M of Y’ and Residuals. Note variance of Y is V(Y’) + V(res). ig , t Y' Resid o 105 108.19 -3.19 62 120 o 115.16 f 4.84 63 120 122.13 -2.13 160 136.06 23.94 65 f 68 P a 120 P f io 145 Y 136.06 a r n 156.97 r r o t t -11.97 f 175 ) 163.94 11.06 70 160 170.91 -10.91 9 72 185 184.84 0.16 10 75 210 205.75 4.25 M 67 150 150.00 0.00 SD 4.57 33.99 31.85 11.89 V 20.89 1155.56 1014.37 141.32 2 h t X -16.06 0 7 0 8 1 n 61 65 ia Wt io 692 7 7 8 e o a v6 5 6 6 n 4 n 5 e 6 0 6 4 ' r t D a 3 0 M s 1 0 e 1 Ht s m Finding the Regression Line Need to know the correlation, SDs and means of X and Y. The correlation is the slope when both X and Y are expressed as z scores. To translate to raw scores, just bring back original SDs for both. SDY z X zY (rise over run) b rXY rXY SD X N To find the intercept, use: a Y bX Suppose r = .50, SDX = .5, MX = 10, SDY = 2, MY = 5. Slope 2 b .5 2 .5 Intercept Equation a 5 2(10) 15 Y ' 15 2 X ig Line of Least Squares e R We have some points. e g 2 0 0 1 8 M 0 1 6 0 r e e a e W y ' Y M e a n Assume linear relations L in e a 1 4 0 D e is reasonable, so the 2 E v r ia r t o io 2 0 D e v vbls can be represented 1 ( 6 5 , 1 0 0 by a line. Where 6 6 2 6 4 6 6 7 8 7 0 H e ig should the line go? Place the line so errors (residuals) are small. The line we calculate has a sum of errors = 0. It has a sum of squared errors that are as small as possible; the line provides the smallest sum of squared errors or least squares. ia 2 h Least Squares (2) Review • What does it mean to choose a regression line to satisfy the loss function of least squares? • What are predicted values and residuals? Suppose r = .25, SDX = 1, MX = 10, SDY = 2, MY = 5. What is the regression equation (line)? Partitioning the Sum of Squares Y a bX e Y Y 'e Y ' a bX e Y Y ' Y Y (Y 'Y ) (Y Y ' ) reg Definitions = y, deviation from mean error 2 2 ( Y Y ) [( Y ' Y ) ( Y Y ' )] (y) (Y 'Y ) (Y Y ' ) 2 Sum of squared deviations from the mean 2 Sum of squares 2 (cross products drop out) Sum of squares Sum of squared + = due to residuals regression Analog: SStot=SSB+SSW Partitioning SS (2) SSY=SSReg + SSRes SSY SS Re g SS Re s SSY SSY SSY 1 R 2 (1 R 2 ) Total SS is regression SS plus residual SS. Can also get proportions of each. Can get variance by dividing SS by N if you want. Proportion of total SS due to regression = proportion of total variance due to regression = R2 (R-square). Partitioning SS (3) Wt (Y) M=150 (Y Y ) 2 Y' Y 'Y (Y 'Y ) 2 Resid (Y-Y') Resid2 105 2025 108.19 -41.81 1748.076 -3.19 10.1761 120 900 115.16 -34.84 1213.826 4.84 23.4256 120 900 122.13 -27.87 776.7369 -2.13 4.5369 160 100 136.06 -13.94 194.3236 23.94 573.1236 120 900 136.06 -13.94 194.3236 -16.06 257.9236 145 25 156.97 6.97 48.5809 -11.97 143.2809 175 625 163.94 13.94 194.3236 11.06 122.3236 160 100 170.91 20.91 437.2281 -10.91 119.0281 185 1225 184.84 34.84 1213.826 0.16 0.0256 210 3600 205.75 55.75 3108.063 4.25 18.0625 Sum = 1500 10400 1500.01 0.01 9129.307 -0.01 1271.907 Variance 1155.56 1014.37 141.32 Partitioning SS (4) Total Regress Residual SS 10400 9129.31 1271.91 Variance 1155.56 1014.37 141.32 10400 9129.31 1271.91 1 .88 .12 10400 10400 10400 Proportion of SS 1155.56 1014.37 141.32 1 .88 .12 1155.56 1155.56 1155.56 rYY ' .94 rXY rY ' X 1 . Proportion of Variance R2 = .88 Note Y’ is linear function of X, so rYY2 ' .88 R 2 . rYE .35 rYE2 .12 rY 'E 0 Significance Testing Testing for the SS due to regression = testing for the variance due to regression = testing the significance of R2. All are the same. H : R 2 0 0 population F SSreg / df 1 F SSreg / df 1 SSres / df 2 SSres / df 2 SSreg / k SSres / ( N k 1) 9129.31 / 1 57.42 127191 . / (10 1 1) R2 / k F (1 R 2 ) /( N k 1) F .88 / 1 58.67 (1.88) / (10 1 1) k=number of IVs (here it’s 1) and N is the sample size (# people). F with k and (N-k-1) df. Equivalent test using R-square instead of SS. Results will be same within rounding error. Review • What does it mean to test the significance of the regression sum of squares? R-square? • What is R-square? • Why does testing for the regression sum of squares turn out to have the same result as testing for R-square?