Regression and Correlation (and scatter plots) Outline • • • • • Making a Scatter plot Calculating the Regression Line The Correlation Alternative Procedures Further Considerations of Regressions and Correlations Estimating the speed of a car Estimatedi 8 8 35 13 5 7 14 22 27 5 Actuali 12 29 24 16 7 5 24 21 26 24 40 Four Steps Estimated velocity (mph) 10 20 30 2. Decide the scale and draw the y axis 3. Add all the points 4. Label axes 0 1. Decide the scale and draw the x axis 0 10 20 30 Actual velocity (mph) 40 Regression Lines • Different types • Here, a straight line that gets close to most of points. • One way to define “close to most points” is by finding the line that minimizes the sum of the squared vertical distances from line to the points. Minimizing ∑ei2. • Called ordinary least squares (OLS) regression. What is a straight line? • In mathematics: y = ax + b a is the slope b is the intercept • In statistics: yi = β0 + β1 xi + ei Subscripts show people have different values on variables Β0 is the intercept β1 is the slope ei are the residuals: the vertical distance between the line and the points. Estimatedi 3.62 0.57Actuali 0 Estimated velocity (mph) 10 20 30 40 Scatter plot with regression line and residuals 0 10 20 30 Actual velocity (mph) 40 How to find the OLS line? 1 ( x x )( y y ) (x x) i i 2 i 0 y 1x where y is the mean of the y variable where x is the mean of the x variabl e. and Ei Ei Ai E i- E i 8 12 -6.4 8 29 -6.4 35 24 20.6 13 16 -1.4 5 7 -9.4 7 5 -7.4 14 24 -0.4 22 21 7.6 27 26 12.6 5 24 -9.4 144 188 0 Ai - A i -6.8 10.2 5.2 -2.8 -11.8 -13.8 5.2 2.2 7.2 5.2 0 Lots of calculations (Ei- E i) (Ai - A i ) 43.5 -65.3 107.1 3.9 110.9 102.1 -2.1 16.7 90.7 -48.9 358.8 (Ei- E i )2 (Ai- A i )2 41.0 46.2 41.0 104.0 424.4 27.0 2.0 7.8 88.4 139.2 54.8 190.4 0.2 27.0 57.8 4.8 158.8 51.8 88.4 27.0 956.4 625.6 predi ei 10.5 -2.5 20.3 -12.3 17.4 17.6 12.8 0.2 7.6 -2.6 6.5 0.5 17.4 -3.4 15.7 6.3 18.5 8.5 17.4 -12.4 144 0 Estimate for β1 = 358.8/625.6 = 0.5735 Estimate for β0 = 14.4 - .5735(18.8) = 3.62. ei2 6.3 150.1 310.4 0.0 6.9 0.3 11.4 40.2 71.8 153.3 750.6 SStotal = SSmodel + SSresidual • • • • • • SStotal is 956.4 SSresidual is 750.6 SSmodel = SStotal - SSresidual = 956.4 – 750.6 = 205.8 R2 = SSmodel / SStotal = 205.8/956.4 = 0.22 R2 = the proportion of variation accounted for by the model. Square root of R2, R (or r), also useful. Testing hypothesis R2 = 0 in the population • • • • • • dfmodel = # variables in the model (here 1) dferror = n - dfmodel – 1 (here 10 – 1 – 1 = 8) MSSmodel = SSmodel /dfmodel (here 205.8/1 = 205.8) MSSerror = SSerror /dferror (here 750.6/8 = 93.8) F(dfmodel, dferror) = MSSmodel/MSSerror F(1,8) = 205.8/93.8 = 2.19 • If computer doing the calculations, p = .18. Scatter plots with several values at same coordinate Pearson’s Correlation r r ( x x )( y (x x) ( y i i 2 i r 358 . 8 ( 625 . 6 )( 956 . 4 ) i y) y) 358 . 8 2 0 . 46 773 . 5 Square root of R2, but keeping the sign. Ranges from -1 to 1, with negative associations having negative r values. Testing if r = 0 in the population t r n2 1 r 2 r (n 2) 2 , F 1 r 2 Also worth calculating confidence intervals (see text) An Alternative Procedure: Spearman's rS • Rank the data, and then run Pearson’s • Some complications and variations if there are lots of ties. • Impact of univariate outliers (both those that increase the r and decrease it) Scatterplot on logs of data Regency Regency 2 3 4 r = .76 rS = .74 0 50 1 100 150 200 ln(Drugs + 1) 5 r = .94 rS = .78 0 Drug offences 250 300 6 Scatterplot on Raw data 0 1000 2000 Theft offences 3000 2 3 4 5 6 ln(Thefts + 1) 7 8 Further Considerations • What we have discussed has been for straight lines. Look at scatter plots to see if other techniques for curves necessary. • Correlation does not imply causation (but it suggests that somewhere in the network of hypotheses that includes these two variables that there are causal relationships).