This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2006, The Johns Hopkins University and Karl W. Broman. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed. Example Measurements of degredation of heme with different concentrations of hydrogen peroxide (H2O2), for different species of heme. pf3d7 and pyoelii 0.35 0.35 0.30 0.30 0.25 0.25 OD OD pf3d7 0.20 0.20 0.15 0.15 0.10 0.10 0 10 25 50 pf3d7 pyoelii 0 H2O2 concentration 10 25 50 H2O2 concentration 0.0 0.2 0.4 OD 0.6 0.8 1.0 Degradation 0 10 20 30 H2O2 concentration 40 50 Degradation [%] 100 pfhz 75 pbr pov OD[%] pknow pviv 50 pf3d7 pyoelii 25 pgal 0 0 10 25 50 H2O2 concentration Y = 20 + 15X 140 120 Y = 40 + 8X 100 Y 80 Y = 70 + 0X 60 Y = 0 + 5X 40 20 0 0 2 4 6 X 8 10 12 3 2 β1 Y 1 1 β0 0 −1 0 1 2 3 4 X The regression model Let X be the predictor and Y be the response. Assume we have n observations (x1, y1), . . . , (xn, yn) from X and Y. The simple linear regression model is yi = β0 + β1xi + ǫi, How do we estimate β0, β1, σ 2 ? ǫi ∼ iid N(0,σ2). Fitted values and residuals We can write ǫi = yi − β0 − β1xi For a pair of estimates (β̂0, β̂1) for (β0, β1) we define the fitted values as ŷi = β̂0 + β̂1xi The residuals are ǫ̂i = yi − ŷi = yi − β̂0 − β̂1xi Y Residuals Y ^ Y ^ε X Residual sum of squares For every pair of values for β0 and β1 we get a different value for the residual sum of squares. RSS(β0, β1) = X (yi − β0 − β1xi)2 i We can look at RSS as a function of β0 and β1. We try to minimize this function, i. e. we try to find (β̂0, β̂1) = minβ0,β1 RSS(β0, β1) Hardly surprising, this method is called least squares estimation. Residual sum of squares RSS b0 b1 0.2 0.4 β1 0.6 0.8 Residual sum of squares 2 4 6 8 β0 Notation Assume we have n observations: (x1, y1), . . . , (xn, yn). P i xi x̄ = n P i yi ȳ = n X X 2 SXX = (xi − x̄) = x2i − n(x̄)2 SYY = SXY = RSS = i X i X i X i 2 (yi − ȳ) = i X y2i − n(ȳ)2 i (xi − x̄)(yi − ȳ) = X i (yi − ŷi) 2 = X i ǫ̂2i xiyi − nx̄ȳ Parameter estimates The function RSS(β0, β1) = X (yi − β0 − β1xi)2 i is minimized by β̂1 = SXY SXX β̂0 = ȳ − β̂1x̄ Useful to know Using the parameter estimates, our best guess for any y given x is y = β̂0 + β̂1x Hence β̂0 + β̂1x̄ = ȳ − β̂1x̄ + β̂1x̄ = ȳ That means every regression line goes through the point (x̄, ȳ). Variance estimates As variance estimate we use σ̂ 2 = RSS n–2 This quantity is called the residual mean square. It has the property σ̂ 2 (n – 2) × 2 ∼ χ2n – 2 σ In particular, this implies E(σ̂ 2) = σ 2 Example H2O2 concentration 0 10 25 50 0.3399 0.3168 0.2460 0.1535 0.3563 0.3054 0.2618 0.1613 0.3538 0.3174 0.2848 0.1525 We get x̄ = 21.25, ȳ = 0.27, SXX = 4256.25, SXY = – 16.48, RSS = 0.0013. Therefore β̂1 = σ̂ = – 16.48 = – 0.0039, 4256.25 r 0.0013 = 0.0115. 12 – 2 β̂0 = 0.27 – (– 0.0039) × 21.25 = 0.353, pf3d7 Y = 0.353 − 0.0039X 0.35 OD 0.30 0.25 0.20 0.15 0 10 25 50 H2O2 concentration The R function lm() does all these calculations for you. And more! Comparing models We want to test whether β1 = 0: H0 : yi = β0 + ǫi versus Ha : yi = β0 + β1xi + ǫi Fit under Ha y Fit under Ho x Sum of squares Under Ha : RSS = X i (SXY)2 (yi − ŷi) = SYY − = SYY − β̂12 × SXX SXX 2 Under H0 : X X (yi − β̂0)2 = (yi − ȳ)2 = SYY i i Hence (SXY)2 SSreg = SYY − RSS = SXX ANOVA Source df SS MS F regression on X 1 SSreg MSreg = SSreg 1 residuals for full model n–2 RSS MSE = RSS n–2 total n–1 SYY MSreg MSE David Sullivan’s pf3d7 data Source df SS MS F regression on X 1 0.06378 0.06378 484.1 residuals for full model 10 0.00131 0.00013 total 11 0.06509 pf3d7 Y = 0.353 − 0.0039X 0.35 Y = 0.271 OD 0.30 0.25 0.20 0.15 0 10 25 H2O2 concentration Remember: The R function lm() does the calculations for you! 50