Lecture 14 Simple Linear Regression Ordinary Least Squares (OLS) Consider the following simple linear regression model Yi = α + βXi + εi where, for each unit i, • Yi is the dependent variable (response). • Xi is the independent variable (predictor). • εi is the error between the observed Yi and what the model predicts. This model describes the relation between Xi and Yi using an intercept and a slope parameter. The error εi = Yi − α − βXi is also known as the residual because it measures the amount of variation in Yi not explained by the model. We saw last class that there exists α̂ and β̂ that minimize the sum of ε2i . Specifically, we wish to find α̂ and β̂ such that n 2 X Yi − (α̂ + β̂Xi ) i=1 is the minimum. Using calculus, it turns out that α̂ = Ȳ − β̂ X̄ β̂ = P (Xi − X̄)(Yi − Ȳ ) P . (Xi − X̄)2 OLS and Transformation If we center the predictor, X̃i = Xi − X̄, then X̃i has mean zero. Therefore, P X̃i (Yi − Ȳ ) ∗ ∗ α̂ = Ȳ β̂ = . P 2 X̃i By horizontally shifting the value of Xi , note that β ∗ = β, but the intercept changed to the overall average of Yi Consider the linear transformation Zi = a + bXi with Z̄ = a + bX̄. Consider the linear model Yi = α∗ + β ∗ Zi + εi 1 Let α̂ and β̂ be the estimates using X as the predictor. Then the intercept and slope using Z as the predictor are P (Zi − Z̄)(Yi − Ȳ ) P β̂ ∗ = (Zi − Z̄)2 P (a + bXi − a − bX̄)(Yi − Ȳ ) P = (a + bX̄ − a − bX̄)2 P (bXi − bX̄)(Yi − Ȳ ) P = (bX̄ − bX̄)2 P b (Xi − X̄)(Yi − Ȳ ) 1 P = = β̂ b b2 (X̄ − X̄)2 and α̂∗ = Ȳ − β̂ ∗ Z̄ = Ȳ − aβ̂ a β̂ (a + bX̄) = Ȳ − β̂ X̄ − = α̂ − β̂ b b b We can get the same results by substituting Xi = Zi −a b into the linear model. Yi = α + βXi + εi Zi − a Yi = α + β + εi b aβ β + Z i + εi Yi = α − b b Yi = α∗ + β ∗ Zi + εi Properties of OLS Given the estimates α̂ and β̂, we can define (1) the estimated predicted value Ŷi and (2) the estimated residual εˆi . Ŷi = α̂ + β̂Xi εˆi = Yi − Ŷi = Yi − α̂ − β̂Xi The least squared estimates have the following properties. P 1. i εˆi = 0 n X i=1 εˆi = n X (Yi − α̂ − β̂Xi ) = i=1 n X Yi − nα̂ − β̂ i=1 = nȲ − nα̂ − nβ̂ X̄ = n(Ȳ − α̂ − β̂ X̄) = n(Ȳ − (Ȳ − β̂ X̄) − β̂ X̄) = 0 2 n X i=1 Xi 2. X̄ and Ȳ is always on the fitted line. α̂ + β̂ X̄ = (Ȳ − β̂ X̄) + β̂ X̄ = Ȳ sY , where sY and sX are the sample standard deviation of X and Y , and rXY is sX the correlation between X and Y . Note that the sample correlation is given by: P (Xi − X̄)(Yi − Ȳ ) . rXY = (n − 1)sX sY 3. β̂ = rXY × Therefore, P (Xi − X̄)(Yi − Ȳ ) n − 1 sY sX P β̂ = × × × 2 n − 1 sY sX (Xi − X̄) Cov(X, Y ) sY sX = × × sY sX s2X Cov(X, Y ) sY 1 = × × sX sY 1 sX sY = rXY × sX 4. β̂ is a weighted sum of Yi . P (Xi − X̄)(Yi − Ȳ ) P β̂ = (Xi − X̄)2 P P (Xi − X̄)Ȳ (Xi − X̄)Yi − P = P 2 (Xi − X̄) (Xi − X̄)2 P P (Xi − X̄)Yi Ȳ (Xi − X̄) = P − P (Xi − X̄)2 (Xi − X̄)2 P (Xi − X̄)Yi −0 = P (Xi − X̄)2 X Xi − X̄ P = Yi (Xi − X̄)2 X Xi − X̄ = wi Yi , wi = P (Xi − X̄)2 3 5. α̂ is a weighted sum of Yi . α̂ = Ȳ − β̂ X̄ X X1 Yi − X̄ wi Yi = n X 1 X̄(Xi − X̄) = vi Yi , vi = − P n (Xi − X̄)2 6. P Xi εˆi = 0. X X X (Yi − Ŷi )Xi = (Yi − α̂ − β̂Xi )Xi X = (Yi − (Ȳ − β̂ X̄) − β̂Xi )Xi X = ((Yi − Ȳ ) − β̂(Xi − X̄))Xi X X = Xi (Yi − Ȳ ) − β̂ Xi (Xi − X̄) X X = β̂ (Xi − X̄)2 − β̂ Xi (Xi − X̄) X = β̂ (Xi − X̄)2 − Xi (Xi − X̄) X = β̂ Xi2 − 2Xi X̄ + X̄ 2 − Xi2 + Xi X̄) X = β̂ X̄ 2 − Xi X̄ X = β̂ X̄ X̄ − Xi = 0 Xi εˆi = P P Xi (Yi − Ȳ ) (Xi − X̄)(Yi − Ȳ ) P = P Note that β̂ = 2 (Xi − X̄) (Xi − X̄)2 7. Partition of total variation. (Yi − Ȳ ) = Yi − Ȳ + Ŷi − Ŷi = (Ŷi − Ȳ ) + (Yi − Ŷi ) • (Yi − Ȳ ): total deviation of the observed Yi from a null model (β = 0). • (Ŷ − Ȳ ): deviation of the modeled value Ŷi (β 6= 0) from a null model (β = 0). • (Ŷ − Yi ): deviation of the modeled value from the observed value. Total Variations (Sum of Squares) P • (Yi − Ȳ )2 : total variation: (n − 1) × V (Y ). P • (Ŷ − Ȳ )2 : variation explained by the model 4 • P P 2 (Ŷ − Yi )2 : variation not explained by the model; residual sum of squares (SSE), ε̂i Partition of Total Variation X X X (Yi − Ȳ )2 = (Ŷi − Ȳ )2 + (Yi − Ŷi )2 Proof: X i2 Xh (Ŷi − Ȳ ) + (Yi − Ŷi ) X X X = (Ŷi − Ȳ )2 + (Yi − Ŷi )2 + 2 (Ŷi − Ȳ )(Yi − Ŷi ) X X = (Ŷi − Ȳ )2 + (Yi − Ŷi )2 (Yi − Ȳ )2 = where, X i Xh (α̂ + β̂Xi ) − (α̂ + β̂ X̄) εˆi i Xh = β̂(Xi − X̄) εˆi X X = β̂ Xi εˆi − β̂ X̄ εˆi = 0 + 0 (Ŷi − Ȳ )(Yi − Ŷi ) = P 2 8. Alternative formula for ε̂i X X X (Yi − Ŷi )2 = (Yi − Ȳ )2 − (Ŷi − Ȳ )2 X X = (Yi − Ȳ )2 − (α̂ + β̂Xi − Ȳ )2 X X = (Yi − Ȳ )2 − (Ȳ + βX X̄ + β̂Xi − Ȳ )2 i2 X Xh 2 = (Yi − Ȳ ) − β̂(Xi − X̄) X X = (Yi − Ȳ )2 − β̂ 2 (Xi − X̄)2 or X X (Yi − Ŷi )2 = (Yi − Ȳ )2 − 9. Variation Explained by the Model: 5 P 2 (Xi − X̄)(Yi − Ȳ ) P (Xi − X̄)2 An easy way to calculate X P (Ŷi − Ȳ )2 is X (Yi − Ŷi )2 P 2 X (Xi − X̄)(Yi − Ȳ ) 2 P = (Yi − Ȳ ) − (Xi − X̄)2 P 2 (Xi − X̄)(Yi − Ȳ ) P = (Xi − X̄)2 (Ŷi − Ȳ )2 = X (Yi − Ȳ )2 − 10. Coefficient of Determination r2 : P variation explained by the model (Ŷi − Ȳ )2 r = =P total variation (Yi − Ȳ )2 2 (a) r2 is related the residual variance. X X X (Yi − Ŷi )2 = (Yi − Ȳ )2 − (Ŷi − Ȳ )2 i P(Y − Ȳ )2 hX X i 2 2 = (Yi − Ȳ ) − (Ŷi − Ȳ ) × P (Yi − Ȳ )2 X = (1 − r2 ) × (Yi − Ȳ )2 Diving the right-hand side by n − 2 and the left-hand side by n − 1, for large n s2 ≈ (1 − r2 )Var(Y ) (b) r2 is the square of the sample correlation between X and Y . P (Ŷi − Ȳ )2 2 r =P (Yi − Ȳ )2 2 P (Xi − X̄)(Yi − Ȳ ) P =P (Yi − Ȳ )2 (Xi − X̄)2 " #2 P (Xi − X̄)(Yi − Ȳ ) 2 pP = pP = rXY 2 2 (Yi − Ȳ ) (Xi − X̄) Statistical Inference for OLS Estimates Parameters α̂ and β̂ can be estimated for any given sample of data. Therefore, we also need to consider their sampling distributions because each sample of (Xi , Yi ) pairs will result in different estimates of α and β. 6 Consider the following distribution assumption on the error, iid εi ∼ N (0, σ 2 ) . Yi = α + βXi + εi The above is now a statistical model that describes the distribution of Yi given Xi . Specifically, we assume the observed Yi is error-prone but centered around the linear model for each value of Xi . iid Yi ∼ N (α + βXi , σ 2 ) The error distribution results in the following assumptions. 1. Homoscedasticity: variance of Yi is the same for any Xi . V (Yi ) = V (α + βXi + εi ) = V (εi ) = σ 2 2. Linearity: the mean of Yi is linear with respect to Xi E(Yi ) = E(α + βXi + εi ) = α + βXi + E(εi ) = α + βXi 3. Independence: Yi are independent of each other. Cov(Yi , Yj ) = Cov(α + βXi + εi , α + βXj + εj ) = Cov(εi , εj ) = 0 We can now investigate the bias and variance of OLS estimators α̂ and β̂. We assume that Xi are known constants. • β̂ is unbiased. hX i Xi − X̄ E[β̂] = E wi Yi , wi = P (Xi − X̄)2 X = wi E[Yi ] X = wi (α + βXi ) X X =α wi + β wi Xi = β 7 Because X 1 wi = P (Xi − X̄) = 0 (Xi − X̄)2 X X 1 Xi (Xi − X̄) wi Xi = P (Xi − X̄)2 X 1 P =P Xi (Xi − X̄) Xi (Xi − X̄) − X̄(Xi − X̄) X 1 P =P Xi (Xi − X̄) Xi (Xi − X̄) − X̄ (Xi − X̄) X 1 =P Xi (Xi − X̄) Xi (Xi − X̄) − 0 =1 X • α̂ is unbiased. E[α̂] = E[Ȳ − β̂ X̄] = E[Ȳ ] − X̄E[β̂] X 1 1X =E Yi − β X̄ = E [α + βXi ] − β X̄ n n 1 = (nα + nβXi ) − β X̄ n = α + β X̄ − β X̄ = α σ2 σ2 = , where σ 2 is the residual variance and s2x is the 2 (n − 1)s2x i=1 (Xi − X̄1 ) sample variance of x1 , . . . , xn . • β̂ has variance Pn Xi − X̄ wi = P (Xi − X̄)2 hX i V [β̂] = V wi Yi , X = wi2 V [Yi ] X = σ2 wi2 where X X 1 wi2 = P (Xi − X̄)2 [ (Xi − X̄)2 ]2 1 =P (Xi − X̄)2 8 • α̂ has variance σ2 1 X̄ 2 + Pn 2 n i=1 (Xi − X̄1 ) hX i 1 X̄(Xi − X̄) V [α̂] = V vi Yi , vi = − P n (Xi − X̄)2 X = vi2 V [Yi ] 2 X1 X̄(Xi − X̄) 2 =σ −P n (Xi − X̄)2 X X X̄ 2 (Xi − X̄)2 X X̄(Xi − X̄) 1 2 P P =σ − −2 n2 [ (Xi − X̄)2 ]2 n (Xi − X̄)2 X X X̄ 2 1 2X̄ 2 P − P = σ2 (X − X̄) − (X − X̄) i i n [ (Xi − X̄)2 ]2 n (Xi − X̄)2 X̄ 2 1 + Pn = σ2 2 n i=1 (Xi − X̄1 ) • α̂ and β̂ are negatively correlated. X X 1 Xi − X̄ X̄(Xi − X̄) Cov(α̂, β̂) = Cov v i Yi , wi Yi , vi = − P , wi = P 2 n (Xi − X̄) (Xi − X̄)2 X = vi wi Cov (Yi , Yi ) X = vi wi σ 2 X1 X̄(Xi − X̄) Xi − X̄ 2 P −P =σ n (Xi − X̄)2 (Xi − X̄)2 X X Xi − X̄ X − X̄ X̄(Xi − X̄) P i P P = σ2 − n (Xi − X̄)2 (Xi − X̄)2 (Xi − X̄)2 P P X̄(Xi − X̄)2 (Xi − X̄) 2 P − P =σ n (Xi − X̄)2 [ (Xi − X̄)2 ]2 P X̄ (Xi − X̄)2 2 =σ 0− P [ (Xi − X̄)2 ]2 σ 2 X̄ = −P (Xi − X̄)2 Therefore, the sampling distributions for the regression parameters are: X̄ 2 2 1 α̂ ∼ N α, σ + Pn 2 n i=1 (Xi − X̄1 ) 9 β̂ ∼ N σ2 β, Pn 2 i=1 (Xi − X̄1 ) However, in most cases, we don’t know what σ 2 and will need to estimate it from the data. One unbiased estimator for σ 2 is obtained from the residuals. N s2 = 1 X (Yi − Ŷi )2 n−2 i=1 By replacing σ 2 with s2 , we have the following sampling distributions for α̂ and β̂. 1 X̄ 2 s2 P α̂ ∼ tn−2 α, s2 + Pn β, β̂ ∼ t n−2 n 2 2 n i=1 (Xi − X̄1 ) i=1 (Xi − X̄1 ) Note that our degrees of freedom is n − 2 because P we estimated two parameters (α and β) in order to calculate the residuals. Similarly recall that (Ŷi − Yi ) = 0. Knowing that α̂ and β̂ follow a t-distribution with degrees of freedom n-2, we are able to construct confidence interval and carry out hypothesis test as before. For example, a 90% confidence interval for β is given by s s2 . β̂ ± t5%,n−2 × Pn 2 i=1 (Xi − X̄1 ) Similarly, consider testing the hypotheses H0 β = β0 versus HA β 6= β0 . The corresponding t-test statistic is given by β̂ − β0 s ∼ tn−2 . s2 Pn 2 i=1 (Xi − X̄1 ) Prediction Based on the sampling distribution of α̂ and β̂ we can also investigate the uncertainty associated with Ŷ0 = α̂ + β̂X0 on the fitted line for any X0 . Specifically, (X0 − X̄)2 2 1 ˆ Y0 ∼ N α + X0 β, σ +P . n (Xi − X̄)2 Note that α + X0 β is the mean of Y given X0 . The textbook uses the notation µ0 . Replacing the unknown with our estimates, we have (X0 − X̄)2 1 +P . Yˆ0 ∼ tn−2 α̂ + X0 β̂, s2 n (Xi − X̄)2 1 (X0 − X̄)2 • The fitted value Yˆ0 = α̂ + β̂X0 has mean α + X0 β and variance σ 2 +P . n (Xi − X̄)2 E(Yˆ0 ) = E(α̂ + β̂X0 ) = E(α̂) + X0 E(β̂) = α + X0 β 10 V (Yˆ0 ) = V (α̂ + β̂X0 ) = V (α̂) + X02 V (β̂) + X0 Cov(α̂, β̂) 1 X̄ 2 X02 X0 X̄ 2 P = σ2 + Pn + σ − 2σ 2 P n 2 2 2 n (X − X̄ ) (X − X̄ ) (X i 1 i 1 i − X̄) i=1 i=1 2 1 (X0 − X̄) +P = σ2 n (Xi − X̄)2 We can also consider the uncertainty of a new observation Y0∗ at X0 . In other words, what values of Y0∗ do I expect to see at X0∗ ? Note that this is different from Yˆ0 which is the fitted value that represents the mean of Y at X0 . Since Y0∗ = α + βX0 + ε, ε ∼ N (0, σ 2 ) , by replacing the unknown parameters with our estimates, we obtain (X0 − X̄)2 ∗ 2 2 1 Y0 ∼ tn−2 α̂ + β̂X0 , s + s +P . n (Xi − X̄)2 Again, the above distribution describes the actual observable data, instead of model parameters; interval estimates are also known as prediction intervals, instead of confidence intervals. 11