Stat 330 (Spring 2015): slide set 31 ŷi − yi =: ei. 2 ♠ After fitting the line ŷ = b0 + b1x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction above, but rather only the errors Last update: April 21, 2015 Stat 330 (Spring 2015) Slide set 31 Stat 330 (Spring 2015): slide set 31 (yi − ȳ) = 2 i=1 n yi2 1 − n i=1 n yi 2 = SST Total Sum of Squares i=1 n e2i = i=1 n 1 (yi − ŷ)2 = SSE : Sum of Squares of Errors Stat 330 (Spring 2015): slide set 31 3 ♠ The fact is that SST ≥ SSE, so that SSR := SST − SSE ≥ 0 is taken as a measure of ”variation accounted for” in the fitting of the line. is a measure for the remaining/residual/error variation. and ♠ Notice SST is (n − 1) · s2y , where s2y is the sample variance of Y . is a measure for the variability of y (figure below). i=1 n Ideas: The quantity Second measure of goodness of fit: Coefficient of determination R2, it is based on a comparison of “variation accounted for” by the line versus “raw variation” of y. Review: sample correlation as a measure of goodness of fit Goodness of fit (Cont’d) n y 7.185 7.341 7.480 7.601 7.150 7.445 7.741 7.639 8.060 7.823 7.569 7.830 8.122 8.071 8.903 8.242 8.344 8.541 8.541 8.720 8.670 8.500 x = year − 1900 0 4 8 12 20 24 28 32 36 48 52 56 60 64 68 72 76 80 84 88 92 96 where Sxy ŷ 7.204 7.266 7.328 7.390 7.513 7.575 7.637 7.699 7.761 7.947 8.009 8.071 8.133 8.195 8.257 8.319 8.381 8.443 8.505 8.567 8.629 8.691 y − ŷ -0.019 0.075 0.152 0.211 -0.363 -0.130 0.104 -0.060 0.299 -0.124 -0.440 -0.241 -0.011 -0.124 0.646 -0.077 -0.037 0.098 0.036 0.153 0.041 -0.191 SSR SST i=1 yi2 1 − n i=1 n yi 2 = 1406.109 − = 0.81 6 Connection Between R and r 7 ♣ It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we need a probability model. Example (Olympic-continued): R2 = 0.8095 = (0.8997)2 = r2. R2 = r2 if and only if ŷ = b0 + b1x ♠ Then R2 is equal to the squared sample correlation between y and x, which is exactly r2: ♠ If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1x, the correlation between ŷ and x is 1. ♠ R is SSR/SST - that’s the squared sample correlation of y and ŷ! 2 4.707 5.810 Stat 330 (Spring 2015): slide set 31 (y − ŷ)2 0.000 0.006 0.023 0.045 0.132 0.017 0.011 0.004 0.089 0.015 0.194 0.058 0.000 0.015 0.417 0.006 0.001 0.010 0.001 0.024 0.002 0.036 2 = Stat 330 (Spring 2015): slide set 31 SSR SST 175.5182 = 5.81. 22 5 SSE = SST −SSR = 5.810−4.707 = 1.103 and R2 = SST = Syy = n ♠ Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit. In other word, the more variability can be explained by the line or linear model. Example (Olympics-continued): R2 = ♠ Definition: The coefficient of determination R2 is defined as: Coefficient of determination R2 Stat 330 (Spring 2015): slide set 31 4 1100 · 175.518 = 9079.584 − = 303.584 22 SSR = b1 · Sxy = 0.0155 · 303.684 = 4.707 Example (Olympics-continued): SSR = b1 · Sxy is the portion of the total variation explained by the fitted model. ♠ Its is easier to compute it using the formula (ŷ − y)2 = SSR : Regression Sum of Squares ♠ Using previous notation SSR is also shown to be i=1 Stat 330 (Spring 2015): slide set 31 Regression Sum of Squares, SSR Stat 330 (Spring 2015): slide set 31 Stat 330 (Spring 2015): slide set 31 10 ♣ Remark: Using the model, not only can we estimate β0, β1 and σ 2, we can also pursue further statistical inferences like confidence interval, hypothesis testing. The tool for inference is called ANOVA (analysis of variance). y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055). Overall, we assume a linear regression model of the form: β̂1 = b1 = 0.0155 (in m/year) 1.103 SSE σ̂ 2 = = = 0.055. n−2 20 β̂0 = b0 = 7.2073 (in m) ♣ Example (Olympics-continued): 8 ♠ β0, β1, and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs (xi, yi)). ♠ In symbols: yi = β0 + β1xi + i with i i.i.d. normal N (0, σ 2). ♠ Idea: in word, for input x the output y is normally distributed with mean β0 + β1x = μy|x and standard deviation σ. Simple line regression model Stat 330 (Spring 2015): slide set 31 σ̂ 2 = n 1 SSE (yi − ŷi)2 = . n − 2 i=1 n−2 ♠ The “right” estimator for σ 2 turns out to be: 9 ♠ σ 2 measures the variation around the “true” line β0 + β1x - we don’t know that line, but only b0 + b1x. Should we base the estimation of σ 2 on this line? ♥ What about σ 2? ♠ Point estimates: how to estimate β0, β1 and σ 2? ♥ β̂0 = b0, βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates). Estimates for regression model