From [JK10 Section 13.4 Linear Regression Analysis] – this is the later-chapter presentation. The introductory discussion is in Chapter 3] [maybe revise this after you revisit Chapter 3] Line of Best Fit – Simple Linear Regression The method of Least Squares The Line of Best fit Slope = 𝑏1 = 𝑆𝑆(𝑥,𝑦) 𝑆𝑆(𝑥) Slope = 𝑏1 = ∑(𝑥−𝑥̅ )(𝑦−𝑦̅) ∑[(𝑥−𝑥̅ )2 ] y-intercept = 𝑏0 = where 𝑆𝑆(𝑥𝑦) = ∑(𝑥𝑦) − 𝑛 and 𝑆𝑆(𝑥) = ∑(𝑥 2 ) − (∑ 𝑥)2 𝑛 (computational) (definition) ∑ 𝑦−(𝑏1 ∙∑ 𝑥) 𝑛 ∑ 𝑥∙∑ 𝑦 or 𝑏0 = 𝑦̅ − (𝑏1 ∙ 𝑥̅ ) (computational) Is there a linear relationship between the two variables? What equation expresses that relationship? If 𝑥 and 𝑦 are unrelated, the slope is 0. There are other regression methods Curvilinear, involving powers of 𝑥 and other functions such as 𝑒 𝑥 and ln 𝑥. Multiple regression, more than one input variable. The Linear Model ⏞ 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 𝛽0 is the 𝑦-intercept. 𝛽1 is the slope. 𝜖 is the random experimental error in the observed value of 𝑦 at a given value of 𝑥 𝑒 =𝑦−⏞ 𝑦 The Residual 𝑒 is the “residual”, the estimate of the experimental error. It is the difference between the observed value of 𝑦 and the predicted value, ⏞ 𝑦. The sum of the errors (the sum of the residuals) for all values of 𝑦 for a given value of 𝑥 is exactly zero, by design of the Least Squares Method. Document1 2/9/2016 9:39 AM - D.R.S. Statistics with the Residual The mean value of the residual: 𝜇𝑒 = 0 The variance of the residual: 𝜎𝑒2 . We want to estimate what this is. The Story Suppose we observe several values of 𝑦 at a given value of 𝑥. Plot a distribution of 𝑦 values observed (a distribution curve perpendicular to the plane at the 𝑥 value under consideration. (three-dimensional graph now – the frequency is the distance it jumps off the page toward you) We need to make an assumption: the distribution of these 𝑦 values at each 𝑥 is approximately a normal distribution. We also assume that the variances of the distributions of 𝑦 at all values of 𝑥 are the same. The mean of the observed 𝑦 is different from one 𝑥 to another 𝑥, but at each one it can be estimated by 𝑦̅. Variance of the Error From [JK10 Section 13.4] Variance of the error (the residual), 𝑒: 2 𝑠𝑒2 ∑(𝑦 − ⏞ 𝑦) = 𝑛−2 Replace ⏞ 𝑦 with 𝑏0 + 𝑏1 𝑥 → 𝑠𝑒2 = ∑(𝑦 − 𝑏0 − 𝑏1 𝑥)2 𝑛−2 Then do a bunch of algebra and you can arrive at this: → 𝑠𝑒2 = (∑ 𝑦 2 ) − (𝑏0 ) ∙ (∑ 𝑦) − (𝑏1 ) ∙ (∑ 𝑥𝑦) 𝑛−2 Inferences about the slope of the regression line From [JK10 Section 13.5] Estimate for the Variance of the Slope [JK10 page 718] The sampling distribution of the slope. If random samples of size n are repeatedly taken from a bivariate population, then the calculated slopes, the 𝑏1 ’s, will form a sampling distribution that is normally distributed with a mean of 𝛽1 , the population value of the slope, and with a variance of 𝑠2 𝜖 𝜎𝑏21 =… An appropriate estimator is 𝑠𝑏21 = ∑(𝑥−𝑥̅ = )2 𝑠𝜖2 ∑ 𝑥2− (∑ 𝑥)2 𝑛 .. Terminology: “Variance of the Document1 2/9/2016 9:39 AM - D.R.S. slope” or “Variance of the regression”, “Standard deviation of the slope” or “standard error of the regression”. [JK10 page 719] Assumptions for inferences about the linear regression: The set of (x,y) ordered pairs forms a random sample. The y values at each x have a normal distribution. Since the population standard deviation is unknown and replaced with the sample standard deviation, the t-distribution will be used with n-2 degrees of freedom. Confidence Interval for the Slope of the Regression Line [JK10 page 719] The formula is 𝑏1 ± 𝑡(𝑛−2,𝛼/2) ∙ 𝑠𝑏1 Hypothesis Testing with the Slope of the Regression Line The null hypothesis is 𝛽1 = 0, that is, no relationship. We want to find if the test statistic 𝑡 ∗= 𝑏1 −𝛽1 𝑠𝑏1 (the slope of our regression line, minus the hypothesized slope 0, divided by the standard error of our slope) is in the critical region. The critical value needs the usual significance 𝛼 and the degrees of freedom is 𝑑𝑓 = 𝑛 − 2. Coefficient of Linear Correlation Recall from somewhere else that 𝑟 = ⋯ = 𝑆𝑆(𝑥𝑦) √𝑆𝑆(𝑥)∙𝑆𝑆(𝑦) . Coefficient of Determination [Blu4 page 550] says more: 𝑟 2 is the coefficient of determination and it is equal to the explained variation divided by the total variation. It answers the question of “How much of the variation is explained by the regression line?” And the coefficient of nondetermination is 1 − 𝑟 2 . ̂ Confidence Intervals and Prediction Intervals of 𝒚 Linear regression produces the linear equation 𝑦̂ = 𝑏1∙ 𝑥 + 𝑏0. [JK10 page 727] When you are talking about one specific value of the independent variable, 𝑥0 , 𝑦̂ is the best point estimate of 𝑦. [JK10 page 727] goes on to discuss the development leading up to the formulas for the confidence interval and the prediction interval. Document1 2/9/2016 9:39 AM - D.R.S. Confidence Interval for the Mean of Population Values of y at a given x0 [JK10] Notation: 𝜇𝑦|𝑥0 . Formula and example on [JK10 Page 728] Prediction Interval for the Value of one individual y [JK10 page 730] Notation for this concept: 𝑦𝑥=𝑥0 . Formula and example. Comparison of these two concepts [JK10 page 731] has illustration – Figure 13.14 – showing Confidence Belts for the mean 𝑦 and Prediction Belts for the values of 𝑦. The Prediction belts are much wide. [Blu4 page 553] discusses only the Prediction Interval and only briefly, no more than a half page. Comparing Linear Correlation and Linear Regression [JK10 Section 13.7 page 737] A difference A similarity Linear Correlation Use the linear correlation coefficient to answer the question “Are these two variables linearly related?” The linear correlation coefficient can be used to indicate the usefulness of 𝑥 as a predictor of 𝑦 in the case where the linear model is appropriate. Linear Regression Use regression analysis to answer questions about the relationship between two variables: “What is the relationship?” “How are two variables related?” The inference test concerning the slope of the regression line (𝐻0 : 𝛽1 = 0) tests this same basic concept. Document1 2/9/2016 9:39 AM - D.R.S.