Linear Regression Correlation vs Regression: What’s the Difference? • Correlation measures how strongly related 2 variables are. • Regression provides a means for predicting the value of one variable based on the value of a related variable. • The underlying mathematics are the same. • Here we are dealing only with linear correlation and linear regression. PSYC 6130, PROF. J. ELDER 2 Optimal Prediction using z Scores • Consider 2 variables X and Y that may be related in some way. – e.g., • X = midterm score, Y = final exam score • X = reaction time, Y = error rate • Suppose you know X for a particular case (e.g., individual, trial). What is your best guess at Y? • The answer turns out to be pretty simple: zY rzX PSYC 6130, PROF. J. ELDER 3 Example: 6130A 2005-06 Assignment marks Mean Sample Std. Dev. PSYC 6130, PROF. J. ELDER Assignment 1 X Assignment 2 Y 86.7% 81.5% 85.0% 85.5% 90.2% 95.4% 91.9% 93.1% 94.8% 93.6% 94.8% 94.2% 94.8% 81.8% 82.4% 84.3% 86.8% 83.6% 87.4% 93.1% 93.1% 91.8% 93.7% 93.1% 94.3% 95.6% 90.9% 4.66% 89.3% 5.04% 4 Graphical Representation PSYC 6130A 2005-06 3 zY 0.7998zX Assignment 2 z-Score 2 Regression line 1 0 -3 -2 -1 0 1 -1 -2 -3 Assignment 1 z-Score PSYC 6130, PROF. J. ELDER 5 2 3 The Raw-Score Regression Formula In terms of sample statistics: In terms of population parameters: Y Y Y r ( X X ) X sY Y Y r(X X) sX or or Y aYX bYX X Y aYX bYX X where where aYX Y bYX X bYX Y r X aYX Y bYX X s bYX Y r sX PSYC 6130, PROF. J. ELDER 6 Example: 6130A 2005-06 Assignment marks Mean Sample Std. Dev. PSYC 6130, PROF. J. ELDER Assignment 1 X Assignment 2 Y 86.7% 81.5% 85.0% 85.5% 90.2% 95.4% 91.9% 93.1% 94.8% 93.6% 94.8% 94.2% 94.8% 81.8% 82.4% 84.3% 86.8% 83.6% 87.4% 93.1% 93.1% 91.8% 93.7% 93.1% 94.3% 95.6% 90.9% 4.66% 89.3% 5.04% 7 Graphical Representation PSYC 6130 Section A 2005-2006 Assignment 2 Grade 100% y = 0.867x + 10.5% 95% aYX 10.5% Regression line 90% 85% 80% 75% 80% 85% 90% 95% 100% Assignment 1 Grade PSYC 6130, PROF. J. ELDER 8 bYX 0.867 Residuals • The deviations of the actual Y values from the Y values predicted by the regression line are called residuals. Assignment 2 Grade • The regression line minimizes the sum of squared residuals (and hence is called a mean-squared fit). PSYC 6130 Section A 2005-2006 100% 95% 90% 85% Y residual Y Y Y 80% 75% 80% 85% 90% 95% 100% Assignment 1 Grade PSYC 6130, PROF. J. ELDER 9 Variance of the Estimate • Total prediction error is expressed as the variance of the 2 estimate (or mean-squared error) est Y: In terms of population parameters: 2 est Y In terms of sample statistics: 2 ( Y Y ) 2 sest Y N 2 ( Y Y ) N 2 2 2 Note that est Y Y. Equality applies only when r 0. est Y (sest Y ) is called the standard error of the estimate. PSYC 6130, PROF. J. ELDER 10 Explained and Unexplained Variance Assignment 2 Grade PSYC 6130 Section A 2005-2006 100% Y 95% 90% Explained Y 85% 1 (Y Y )2 N 1 2 Unexplained Variance est (Y Y )2 y N 2 Explained Variance: exp 80% 75% 80% Unexplained Y 85% 90% 95% 100% Assignment 1 Grade PSYC 6130, PROF. J. ELDER 11 Summary of Variances Population: Total Variance 2 Y (Y Y ) N 2 Unexplained Variance est Y 2 Explained Variance: exp PSYC 6130, PROF. J. ELDER 2 12 2 (Y Y ) N 2 ( Y ) Y N Summary of Variances Population: • It can be shown that: 2 Y 2 estY 2 exp • i.e., the variance is equal to the sum of the explained and unexplained variances. PSYC 6130, PROF. J. ELDER 13 Summary of Variances Sample: Total Variance sY2 2 ( Y Y ) N 1 2 Unexplained Variance sest Y 2 ( Y Y ) N 2 2 2 Explained Variance: sexp sY2 sestY PSYC 6130, PROF. J. ELDER 14 Coefficient of Determination • The fraction of the total variance explained by the regression line is called the coefficient of determination • It can be shown that this is just the square of the Pearson coefficient r: • Population: 2 ( Y ) Y 2 estY Coefficient of Determination r 1 2 2 Y (Y Y ) 2 • Sample: 2 ( Y Y ) 2 n 2 sestY Coefficient of Determination r 1 2 n 1 sY2 (Y Y ) 2 PSYC 6130, PROF. J. ELDER 15 Coefficient of Nondetermination • The fraction of the total variance that remains unexplained by the regression line is called the coefficient of nondetermination • It can be shown that this is just 1-r2: • Population: 2 ( Y Y ) 2 estY Coefficient of Nondetermination 1- r 2 2 (Y Y ) Y 2 • Sample: 2 ( Y Y ) 2 n 2 sestY Coefficient of Nondetermination 1- r 2 (Y Y ) n 1 sY2 2 PSYC 6130, PROF. J. ELDER 16 Summary of Coefficients Sample: Population: Coefficient of Determination: Coefficient of Determination: (Y Y )2 r 2 2 1 2 n 2 sestY n 1 sY2 Coefficient of Nondetermination: Coefficient of Nondetermination: (Y Y )2 (Y Y ) (Y Y ) 2 2 est 2 Y 1-r 2 2 (Y Y ) Y PSYC 6130, PROF. J. ELDER (Y Y ) (Y Y ) 2 2 estY 2 r 1 2 Y (Y Y )2 1-r 17 2 2 2 n 2 sest Y 2 n 1 sY Components of Variance: SPSS Output Explained SS: (Y Y ) Unexplained SS: Total SS: 2 (Y Y ) 2 (Y Y ) 2 2 est Y Unexplained Variance s (Y Y )2 ANOVA N 2 b Sum of Model 1 Squares Regression df Mean Square 861347.2 1 861347.186 Residual 1325861 11491 115.383 Total 2187209 11492 a. Predictors: (Constant), How tall are you without your shoes on (in cm.) b. Dependent Variable: How much do you weigh (in kilograms) PSYC 6130, PROF. J. ELDER 18 F 7465.139 Sig. .000 a Estimating the Variance of the Estimate • Uncertainty in predictions can be estimated using the assumption of homoscedasticity. – (Etymology: hom- + Greek skedastikos able to disperse, from skedannynai to disperse) – Thought question: does this also explain the origin of the verb skedaddle? – In other words, homogeneity of variance in Y over the range of X. PSYC 6130, PROF. J. ELDER 19 Confidence Intervals for Predictions Y Y tcrit sestY PSYC 6130, PROF. J. ELDER 1 ( X X )2 1 N ( N 1) s X2 20 Example: 6130A 2005-06 Assignment marks Mean Sample Std. Dev. PSYC 6130, PROF. J. ELDER Assignment 1 X Assignment 2 Y 86.7% 81.5% 85.0% 85.5% 90.2% 95.4% 91.9% 93.1% 94.8% 93.6% 94.8% 94.2% 94.8% 81.8% 82.4% 84.3% 86.8% 83.6% 87.4% 93.1% 93.1% 91.8% 93.7% 93.1% 94.3% 95.6% 90.9% 4.66% 89.3% 5.04% 21 r 0.7998 Underlying Assumptions • Independent random sampling • Linearity • Normal Distribution • Homoscedasticity PSYC 6130, PROF. J. ELDER 22 Regressing X on Y • Simply reverse the formulae, e.g., In terms of sample statistics: sX X X r (Y Y ) sY or X aXY bXYY where aXY X bXYY s bXY X r sY PSYC 6130, PROF. J. ELDER 23 When to Use Linear Regression • Prediction • Statistical Control – Adjust for effects of confounding variable. – Also known as partialing out the effect of the confounding variable. • Experimental Psychology: modeling effect of continuous independent variable on continuous dependent variable. – e.g., reaction time vs set size in visual search. PSYC 6130, PROF. J. ELDER 24 Statistical Control Example: Mental Health Women report more bad mental health days than men, t(8176)=-7.1, p<.001, 2-tailed. PSYC 6130, PROF. J. ELDER 25 Statistical Control Example: Physical Health PSYC 6130, PROF. J. ELDER 26 Correlation Pearson’s r = 0.31 PSYC 6130, PROF. J. ELDER 27 After Partialing Out Physical Health PSYC 6130, PROF. J. ELDER 28 Result of Partialing Out Physical Health Controlling for physical health, women report more bad mental health days than men, t(8176)=-5.7, p<.001, 2-tailed. PSYC 6130, PROF. J. ELDER 29