STAT355 - Probability & Statistics Chapter 12: Simple Linear Regression and Correlation Fall 2011 STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 1 / Linea 21 Chapter 12: Simple Linear Regression and Correlation 1 12.1 The Simple Linear Regression Model 2 12.2 Estimating Model Parameters 3 12.3 Inferences About the Slope Parameter β1 STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 2 / Linea 21 The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and y is a linear relationship y = β0 + β1 x. The set of pairs (x, y ) for which y = β0 + β1 x determines a straight line with slope β1 and y-intercept β0 . The objective of this section is to develop a linear probabilistic model. If the two variables are not deterministically related, then for a fixed value of x, there is uncertainty in the value of the second variable. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 3 / Linea 21 The Simple Linear Regression Model For example, if we are investigating the relationship between age of child and size of vocabulary and decide to select a child of age x = 5.0 years, then before the selection is made, vocabulary size is a random variable Y . After a particular 5-year-old child has been selected and tested, a vocabulary of 2000 words may result. We would then say that the observed value of Y associated with fixing x = 5.0 was y = 2000. More generally, the variable whose value is fixed by the experimenter will be denoted by x and will be called the independent, predictor, or explanatory variable. For fixed x, the second variable will be random; we denote this random variable and its observed value by Y and y , respectively, and refer to it as the dependent or response variable. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 4 / Linea 21 The Simple Linear Regression Model Usually observations will be made for a number of settings of the independent variable. Let x1 , x2 , ..., xn denote values of the independent variable for which observations are made, and let Yi and yi , respectively, denote the random variable and observed value associated with xi . The available bivariate data then consists of the n pairs (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ). A picture of this data called a scatter plot gives preliminary impressions about the nature of any relationship. In such a plot, each (xi , yi ) is represented as a point plotted on a two dimensional coordinate system. Examples - STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 5 / Linea 21 A Linear Probabilistic Model For the deterministic model y = β0 + β1 x, the actual observed value of y is a linear function of x. The appropriate generalization of this to a probabilistic model assumes that the expected value of Y is a linear function of x, but that for fixed x the variable Y differs from its expected value by a random amount. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 6 / Linea 21 A Linear Probabilistic Model Definition There are parameters β0 , β1 , and σ 2 , such that for any fixed value of the independent variable x, the dependent variable is a random variable related to x through the model equation Y = β0 + β1 x + (1) The quantity in the model equation is a random variable, assumed to be normally distributed with E () = 0 and V () = σ 2 . STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 7 / Linea 21 A Linear Probabilistic Model The variable is usually referred to as the random deviation or random error term in the model. Without , any observed pair (x, y ) would correspond to a point falling exactly on the line y = β0 + β1 x, called the true (or population) regression line. The inclusion of the random error term allows (x, y ) to fall either above the true regression line (when > 0) or below the line (when < 0). The slope β1 of the true regression line is interpreted as the expected change in Y associated with a 1-unit increase in the value of x. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 8 / Linea 21 Estimating Model Parameters Principle of Least Squares The vertical deviation of the point (xi , yi ) from the line y = b0 + b1 x is height of point-height of line = yi − (b0 + b1 xi ) The sum of squared vertical deviations from the points (x1 , y1 ), ..., (xn , yn ) to the line is then f (b0 , b1 ) = n X [yi − (b0 + b1 xi )]2 i=1 The point estimates of β0 and β1 , denoted by β̂0 and β̂1 and called the least squares estimates, are those values that minimize f (b0 , b1 ). That is, β̂0 and β̂1 are such that f (β̂0 , β̂1 ) ≤ f (b0 , b1 ) for any b0 and b1 . STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 9 / Linea 21 Estimating Model Parameters The least squares estimate of the intercept β0 of the true regression line is P P yi − β̂1 xi = ȳ − β̂1 x̄ b0 = β̂0 = n The least squares estimate of the slope coefficient β1 of the true regression line is P Sxy (xi − x̄)(yi − ȳ ) P b1 = β̂1 = = 2 (xi − x̄) Sxx Remarks: Computing formulas for the numerator and denominator of Sxy and Sxx are X X X X X Sxy = xi yi − ( xi )( yi )/n and Sxx = xi2 − ( xi )2 /n STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 10 / Linea 21 Example Consider the following data: x y 2 3.1 3.9 5.3 6 7.2 8 5.3 6.3 9.0 12.2 11.5 16.7 16.9 Fit a linear model to the data and obtain the regression line (Using your calculator). STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 11 / Linea 21 Fitted Values and Residuals The parameter σ 2 determines the amount of variability inherent in the regression model. A large value of σ 2 will lead to observed (xi , yi )s that are quite spread out about the true regression line, whereas when σ 2 is small the observed points will tend to fall very close to the true line. Definition The fitted (or predicted) values ŷ1 , ŷ2 , ..., ŷn are obtained by successively substituting x1 , ..., xn into the equation of the estimated regression line: ŷ1 = β̂0 + β̂1 x1 , ŷ2 = β̂0 + β̂1 x2 , ..., ŷn = β̂0 + β̂1 xn . The residuals are y1 − ŷ1 , y2 − ŷ2 , ..., yn − ŷn , the differences between the observed and fitted y values. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 12 / Linea 21 Estimating σ 2 Definition The error sum of squares (also called residual sum of squares), denoted by SSE , is X X SSE = (yi − ŷi )2 = [yi − (β̂0 + β̂1 xi )]2 and the estimate of σ 2 is σ̂ 2 = s 2 = SSE = n−2 P (yi − ŷi )2 n−2 STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 13 / Linea 21 The Coefficient of Determination Examples using R STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 14 / Linea 21 The Coefficient of Determination Definition The coefficient of determination, denoted by r 2 , is given by r2 = 1 − SSE SST P where SST = Syy = (yi − ȳ )2 . It is interpreted as the proportion of observed y variation that can be explained by the simple linear regression model. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 15 / Linea 21 The Coefficient of Determination - Remarks I The higher the value of r 2 , the more successful is the simple linear regression model in explaining y variation. I When regression analysis is done by a statistical computer package, either r 2 or 100r 2 (the percentage of variation explained by the regression) is a prominent part of the output. I If r 2 is small, an analyst will usually want to search for an alternative model (either a nonlinear model or a multiple regression model that involves more than a single independent variable) that can more effectively explain y variation. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 16 / Linea 21 Inferences About the Slope Parameter β1 The estimators (statistics, and thus random variables) for β0 , β1 and σ 2 are obtained by replacing yi by Yi in the expression of β̂0 , β̂1 , and σ̂ 2 . P (xi − x̄)(Yi − Ȳ ) P β̂1 = (xi − x̄)2 P P Yi − β̂1 xi β̂0 = n P 2 P P Yi − β̂0 Yi − β̂1 xi Yi 2 2 σ̂ = S = n−2 STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 17 / Linea 21 Inferences About the Slope Parameter β1 Proposition 1 The mean value of β̂1 is E (β̂1 ) = β1 , so β̂1 is an unbiased estimator of β1 . 2 The variance and standard deviation of β1 are V (β̂1 ) = σβ̂2 = 1 σ σ2 and σβ̂1 = √ Sxx Sxx P where Sxx = (xi − x̄)2 . Replacing σ by its estimate s gives an estimate for σβ̂1 (the estimated standard error of β̂1 ): σ̂β̂1 = sβ̂1 = √ 3 s Sxx The estimator β̂1 has a normal distribution (because it is a linear function of independent normal rv’s). STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 18 / Linea 21 Inferences About the Slope Parameter β1 Theorem The assumptions of the simple linear regression model imply that the standardized variable T = β̂1 − β1 β̂1 − β1 √ = Sβ̂1 S/ Sxx has a t distribution with n − 2 df. Proposition A 100(1 − α)% confidence interval for β1 of the true regression line is β̂1 ± tα/2,n−2 sβ̂1 STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 19 / Linea 21 Inferences About the Slope Parameter β1 Hypothesis-Testing Procedures Null hypothesis: H0 : β1 = β10 Test statistic value: t = β̂1 −β10 sβ̂ 1 Alternative Hypothesis Ha : β1 > β10 Ha : β1 < β10 Ha : β1 6= β10 Rejection Region for Level α Test t ≥ tα,n−2 t ≤ −tα,n−2 either t ≥ tα/2,n−2 or t ≤ −tα/2,n−2 A P-value based on n − 2 can be calculated just as was done previously for t tests. STAT355 () - Probability & Statistics Chapter Fall 2011 12: Simple 20 / Linea 21 Inferences About the Slope Parameter β1 Regression and ANOVA P The decomposition of the total sum of squares (yi − ȳ )2 into a part SSE , which measures unexplained variation, and a part SSR, which measures variation explained by the linear relationship. Source of Variation Regression Error Total df 1 n−2 n−1 S. of Squares SSR SSE SST STAT355 () - Probability & Statistics Mean Square SSR/1 2 s = SSE /(n − 2) F SSR/s 2 Chapter Fall 2011 12: Simple 21 / Linea 21