Linear regression models Simple Linear Regression History • Developed by Sir Francis Galton (18221911) in his article “Regression towards mediocrity in hereditary structure” Purposes: • To describe the linear relationship between two continuous variables, the response variable (yaxis) and a single predictor variable (x-axis) • To determine how much of the variation in Y can be explained by the linear relationship with X and how much of this relationship remains unexplained • To predict new values of Y from new values of X The linear regression model is: Yi 0 1 X i i • Xi and Yi are paired observations (i = 1 to n) • β0 = population intercept (when Xi =0) • β1 = population slope (measures the change in Yi per unit change in Xi) • εi = the random or unexplained error associated with the i th observation. The εi are assumed to be independent and distributed as N(0, σ2). Linear relationship Y ß1 1.0 ß0 X Linear models approximate non-linear functions over a limited domain extrapolation interpolation extrapolation • For a given value of X, the sampled Y values are independent with normally distributed errors: Y = β + β *X + ε i o 1 i i ε ~ N(0,σ2) E(εi) = 0 E(Yi ) = βo + β1*Xi Y E(Y2) E(Y1) X X1 X2 Fitting data to a linear model: ˆ ˆ ˆ Yi 0 1X i Yi Yi – Ŷi = εi (residual) Ŷi Xi The residual 2 di (Yi Yi ) The residual sum of squares RSS n i 1 2 (Yi Yi ) Estimating Regression Parameters • The “best fit” estimates for the regression population parameters (β0 and β1) are the values that minimize the residual sum of squares (SSresidual) between each observed value and the predicted value of the model: n n i 1 i 1 Chooseˆ0 , ˆ1 to minimize (Yi Yˆi ) 2 (Yi ( ˆ0 ˆ1X i ))2 Sum of squares SSY n i 1 (Yi Yi ) 2 n i 1 (Yi Yi )(Yi Yi ) Sum of cross products SS XY n i 1 (Yi Yi )( X i X i ) Least-squares parameter estimates s SS XY XY ˆ 1 2 SS XX sX where SSX n i 1 (Xi Xi ) 2 Sample variance of X: 1 2 sX n 1 n ( X X )( X i X ) i 1 i Sample covariance: 1 n s XY ( X i X )(Yi Y ) i 1 n 1 n ( X i X )(Yi Y ) s XY SS XY i 1 ˆ 1 2 n SS X sX ( X i X )( X i X ) i 1 Solving for the intercept: ˆ ˆ 0 Y 1 X Thus, our estimated regression equation is: Yˆi ˆ0 ˆ1 X i Hypothesis Tests with Regression • Null hypothesis is that there is no linear relationship between X and Y: H0: β1 = 0 Yi = β0 + εi HA: β1 ≠ 0 Yi = β0 + β1 Xi + εi • We can use an F-ratio (i.e., the ratio of variances) to test these hypotheses Variance of the error of regression: 2 ˆ Y Y i i n SSresidual i 1 2 ˆ n2 n2 NOTE: this is also referred to as residual variance, mean squared error (MSE) or residual mean square (MSresidual) Mean square of regression: 2 ˆ Y i Y n MSregression SSregression 1 i 1 1 The F-ratio is: (MSRegression)/(MSResidual) This ratio follows the F-distribution with (1, n-2) degrees of freedom Variance components and Coefficient of determination SSreg SSY RSS SSY SSreg RSS Coefficient of determination r 2 SSreg SSY SSreg SSreg RSS ANOVA table for regression Source Regression Residual Total Degrees Sum of squares of freedom n Mean square SSreg (Yˆi Yi ) 2 SSreg i 1 1 n RSS n-2 RSS (Yi Yˆi ) 2 i 1 n2 n SSY n-1 SSY (Yi Yi ) 2 i 1 n 1 1 Expected mean square F ratio N SSreg / 1 2 12 i 1 2 Y2 X2 RSS /(n 2 ) Product-moment correlation coefficient r SS XY s XY SSX SSY s X sY Parametric Confidence Intervals • If we assume our parameter of interest has a particular sampling distribution and we have estimated its expected value and variance, we can construct a confidence interval for a given percentile. • Example: if we assume Y is a normal random variable with unknown mean μ and variance σ2, then (Y ) is distributed as a standard normal variable. But, since we don’t know σ, we must divide by the standard error instead: (Y ) sY , giving us a tdistribution with (n-1) degrees of freedom. • The 100(1-α)% confidence interval for μ is then given by: Y t(1 / 2;n 1) sY Y t(1 / 2;n1) sY • IMPORTANT: this does not mean “There is a 100(1-α)% chance that the true population mean μ occurs inside this interval.” It means that if we were to repeatedly sample the population in the same way, 100(1-α)% of the confidence intervals would contain the true population mean μ. Publication form of ANOVA table for regression Source Regression Residual Total Sum of Squares Mean Square df 11.479 1 11.479 8.182 15 .545 19.661 16 F 21.044 Sig. 0.00035 Variance of estimated intercept ˆ 1 X ˆ n SS X 2 2 0 2 ˆ0 t ,n 2 ˆ ˆ ˆ0 ˆ0 t ,n 2 ˆ ˆ 0 0 Variance of the slope estimator ˆ 1 2 ˆ 2 SS X ˆ1 t ,n 2 ˆ ˆ 1 ˆ1 t ,n 2 ˆ ˆ 1 1 Variance of the fitted value 2 ˆ (Yˆ | X ) 2 Xi X 2 1 ˆ n SS X Yˆ t ,n2ˆ (Yˆ | X ) Yˆ Yˆ t ,n2ˆ(Yˆ | X ) Variance of the predicted value (Ỹ): 1 X~ X 2 2 2 ˆ ˆ ~ ~ 1 (Y | X ) n SS X ~ ~ ~ ~ ~ Y t , n 2ˆ (Y | X ) Y Y t , n 2ˆ (Y~| X~ ) Regression 8 7 6 5 4 3 2 1 -2 0 2 4 Ln( Island Area) 6 8 10 Assumptions of regression • The linear model correctly describes the functional relationship between X and Y • The X variable is measured without error • For a given value of X, the sampled Y values are independent with normally distributed errors • Variances are constant along the regression line Residual plot for species-area relationship 1.5 1.0 .5 0.0 -.5 -1.0 -1.5 2.5 3.0 3.5 4.0 4.5 Unstandardized Predicted Value 5.0 5.5 6.0