The Simple Linear Regression Model The purpose of regression analysis is to obtain a mathematical relationship between the values of two or more variables. The mathematical relationship is an equation or a function which provides the framework to determine the extent or degree of association between the values of the variables. In a simple regression model there are only two variables: Dependent, or Explained Variable—y One Independent , or Explanatory Variable—x To what extent the variations in the value of the dependent variable is explained by the independent variable? Example To what degree the variations in scores in a test are related to the number of hours studied for the test? Here the test score is the dependent or explained variable (y) and the hours of study is the independent or explanatory variable (x). To study the relationship between test scores and hours of study, the following data from a random sample of 10 students is shown. The “scatter diagram” for the data is also shown. Page 1 of 10 Test scores y 52 56 56 72 72 80 88 92 96 100 Hours of study x 2.5 1.0 3.5 3.0 4.5 6.0 5.0 4.0 5.5 7.0 To obtain a mathematical framework to study the various aspects of the relationship between the dependent and independent variable we need to develop a regression equation from the sample data. The regression equation is the equation of the straight line fitted to the scatter diagram. The general equation of a straight line: y = a + bx The equation of the regression line fitted to the scatter diagram: ŷ = b₀ + b₁x To develop the equation for the regression line we need to obtain the values for the vertical intercept (b₀) and the slope (b₁) of the line that fits the scatter diagram the best. The mathematical method which provides the formulas to compute the values for b₀ and b₁ is called the least squares method. The Least-Squares Method (LSM) to Determine the Values for b₁ and b₀ To explain the LSM, first we need to understand the difference between the symbols “y” (the plain or naked y) and “ŷ” (y-hat). The plain y represents the values of the dependent variable observed in the sample data that is Page 2 of 10 associated with each x value. These are the diamond-shaped markers shown in the scatter diagram. Once we determine the values of the “coefficients” of the regression, b₀ and b₁, then for each value of x there will be a unique value of y which lies on the regression line. These values of y are denoted by ŷ and are called the “predicted values”. This is why the equation for the regression line is represented by ŷ = b₀ + b₁x. y: Observed values of the dependent variable. ŷ: Predicted values of the dependent variable. In the diagram the predicted values ŷ are shown as circular makers located on the regression line. 120 Observed and Predicted Values of the Indpendent Variable 100 y Test score The Regression Equation 80 ŷ 60 40 20 0 0 1 2 3 4 5 Hours of study 6 7 8 What does “least-squares” mean? x 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 7.0 42.0 As the above diagram shows, for each value of x on the horizontal axis there is an observed value y and a predicted value ŷ. The difference between the observed y and the predicted ŷ is called the residual (or prediction error, or simply error) and is denoted by e. e = y − ŷ Squaring and summing all the squared errors we have the sum of squared errors (SSE). 𝑆𝑆𝐸 = ∑ 𝑒 2 = ∑(𝑦 − 𝑦̂)2 The least squares method uses a mathematical process involving partial derivatives and solving a system of equations to determine the formulas for the regression coefficients b₀ and b₁ such that the SSE is minimized. The least squares formulas for the coefficients of the regression equation are: 𝑏1 = ∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ ∑ 𝑥 2 − 𝑛𝑥̅ 2 b₀ = y̅ − b₁x̅ The following calculations show how these formulas are used to obtain the values for the coefficients of the regression equation: Page 3 of 10 y 56 52 72 56 92 72 88 96 80 100 764 xy x² 56 1.00 130 6.25 216 9.00 196 12.25 368 16.00 324 20.25 440 25.00 528 30.25 480 36.00 700 49.00 3438 205.00 x̅ = x ∕ n = 42 ∕ 10 = 4.2 y̅ = y ∕ n = 764 ∕ 10 = 76.4 xy = 3438 x² = 205 𝑏1 = ∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ 3438 − 10(4.2)(76.4 = = 8.014 ∑ 𝑥 2 − 𝑛𝑥̅ 2 205 − 10(4.22 ) b₀ = y̅ − b₁x̅ = 76.4 – 8.014(4.2) = 42.741 The regression line is then written as: ŷ = 42.741 + 8.014x Now, for x = 3 hours of study, The observed value is: y = 72 The predicted value is: ŷ = 42.741 + 8.014(3) = 66.78 The residual is: e = y − ŷ = 72 – 66.78 = 5.22 The following table shows the calculation of all predicted values, the residuals (prediction errors) and SSE: x y ŷ = b₀ + b₁x e = y − ŷ e² = (y − ŷ)² 2.5 52 62.78 -10.78 116.13 1.0 56 50.76 5.24 27.51 3.5 56 70.79 -14.79 218.75 3.0 72 66.78 5.22 27.21 4.5 72 78.80 -6.80 46.30 6.0 80 90.83 -10.83 117.18 5.0 88 82.81 5.19 26.92 4.0 92 74.80 17.20 295.94 5.5 96 86.82 9.18 84.31 7.0 100 98.84 1.16 1.35 0.00 961.59 Note that the sum of prediction errors equals zero: e = (y − ŷ) = 0. This means the regression line ŷ = 42.741 + 8.014x is the best fitting line because it balances sum of negative residuals against the sum of positive residuals. Consequently, the sum of squared deviations e² = (y − ŷ)² = 961.59 is the smallest possible. Hence the term least squares. Variance of Prediction Error, and Standard Error of Estimate The observed values y are scattered or dispersed around the fitted regression line. We need a summary measure of dispersion of y values around the regression line. The Page 4 of 10 summary measure is the average deviation of y from ŷ. First, however, we must find the average squared deviation of y from ŷ. This average is called the variance of the prediction error. ∑(𝑦 − 𝑦̂)2 𝑆𝑆𝐸 var(𝑒) = = = 𝑀𝑆𝐸 𝑛−2 𝑛−2 var(e) is also called mean square error (MSE). var(𝑒) = 961.59 = 120.199 8 The square root of var(e) is called the standard error of estimate. se(𝑒) = √ ∑(𝑦 − 𝑦̂)2 𝑆𝑆𝐸 =√ = √𝑀𝑆𝐸 𝑛−2 𝑛−2 se(𝑒) = √120.199 = 10.964 This value, 10.96, tells us that on average observed scores in the sample deviate from the predicted scores by about 11 score units. Given the scale of the y values, the smaller the se(e), the more closely clustered the y values are around the regression line, hence the closer the fit of the regression line to the scatter diagram. The closer the fit of the regression line, the stronger the association between the variations in y with that of x. In the extreme, limiting, case, if all variations in test scores were explained by the hours of study, the se(e) would be zero. 100 Standard Error of Estimate, se(e), Measures the Average Deviation of y from 𝑦̂ Values y 90 Test Scores 80 70 ŷ 60 50 40 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 Hours of Study R² is based on the comparison of the two deviation measures: 1) Deviation of y from the regression line (that is, from ŷ); 2) deviation of y from the mean line (that is, y̅). In the diagram below, consider one of the observations in the sample where x = 5.5. Given 5.5 study hours, the observed score is y = 96 and the predicted score is ŷ = 42.741 + 8.014(5.5) = 86.8. The observed score y = 96 deviates from the mean score (the mean line), y̅ = 76.4, by y − y̅ = 96 − 76.4 = 19.6. This deviation is called the total deviation. Part of this total deviation is accounted for by the predicted value: ŷ − y̅ = 86.8 – 76.4 = 10.4. This portion of the total deviation is said to be predicted or explained by the regression (by hours of study), hence the term explained deviation. The remainder, the residual, is e = y − ŷ = 96 – 86.8 = 9.2. This is the familiar deviation: the residual or the prediction error and in this context is called the unexplained deviation. However, because the value of se(e) is affected by the scale of the data, it is not viewed by itself as a reliable measure of “closeness-of-fit”. Total Deviation, Explained Deviation, and Unexplained Deviation 100 96 90 Coefficient of Determination, R² Page 5 of 10 86.8 80 Test Scores Because the standard error of estimate is affected by the scale of the data, we need an alternative measure which indicates the degree of association between the values of y and x that is independent of the scale of the data. This alternative measure, called the coefficient of determination, but more commonly known as R², is a relative (proportional or percentage) value. Simply, R² indicates the proportion or percentage of the variations in the y values explained by or due to x. ŷ y̅ 70 76.4 60 50 40 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Hours of Study Thus, the total deviation is comprised of the explained deviation and the unexplained deviation: Total Deviation (y − y̅) (96 – 76.4) 19.6 Explained Deviation = (ŷ − y̅) = (86.8 – 76.4) = 10.4 = + Unexplained Deviation + (y − ŷ) + (96 – 86.8) + 9.2 In this one case the proportion of total deviation explained by the regression (by x) is 10.4 ∕ 19.6 = 0.53 or 53%. We can repeat these steps for all 10 observations in the sample. However, we need a measure which uses combined (sum of) deviations of all the observations. But when we sum the deviations we see that sum of deviations all are equal to zero: (y − y̅) = ( ŷ − y̅) = (y − ŷ) = 0 To remedy this problems we must square the deviations and obtain the following sums of squared deviations. SST = (y − y̅)² SSR = ( ŷ − y̅)² SSE = (y − ŷ) (unexplained) Sum of Squares Total Sum of Squares Regression (explained) Sum of Squares Error y ŷ (y − y̅)² (ŷ − y̅)² (y − ŷ)² 52 62.78 595.36 185.61 116.13 56 50.76 416.16 657.65 27.51 56 70.79 416.16 31.47 218.75 72 66.78 19.36 92.48 27.21 72 78.80 19.36 5.78 46.30 80 90.83 12.96 208.09 117.18 88 82.81 134.56 41.10 26.92 92 74.80 243.36 2.57 295.94 96 86.82 384.16 108.54 84.31 100 98.84 556.96 503.52 1.35 2798.40 1836.81 961.59 Note that: (y − y̅)² SST 2798.40 = = = ( ŷ − y̅)² SSR 1836.81 + + + (y − ŷ)² SSE 961.59 R² shows the proportion of total deviation that is explained by the regression. Thus, 𝑅2 = 𝑆𝑆𝑅 1836.81 = = 0.6564 𝑆𝑆𝑇 2798.40 That is, nearly 66% of the variations in tests scores is due to hours of study. Also note that, Page 6 of 10 𝑆𝑆𝐸 𝑅2 = 1 − 𝑆𝑆𝑇 This indicates that the more widely scattered the observed y are around the regression line, the larger the SSE, and thus the smaller the R². STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION To study the relationship between test scores and hours of study we obtained a random sample from the population of students taking the test. Therefore, the regression equation that is generated from the sample data is an estimated regression equation. The coefficients of the estimated regression, b₀ and b₁, are thus each a sample statistic which function as estimators of the population intercept and slope parameters, respectively. The population intercept parameter is denoted by β₀ and the slope parameter by β₁. Sample regression equation: Population regression equation: ŷ = b₀ + b₁x ŷ = β₀ + β₁x Similar to the parameters µ and π, where statistical inference is based on the sampling distribution of x̅ and p̅, respectively, inferences for β₀ and β₁ are based on the sampling distribution of the sample statistics b₀ and b₁. Here we will consider the sampling distribution of b₁ only. Comparing the sampling distribution of b₁ to that of x̅, you will see that the concept of sampling distribution applies equally to any sample statistic. Page 7 of 10 There are gazillions of x̅ values obtained from the gazillions of samples. These x̅ values are normally distributed with an expected value (mean) equal to the population parameter µ and a measure of dispersion called the standard error of the sample statistic x̅, se(x̅). Now, take the same sentence and change the symbol x̅ to b₁ and µ to β₁: There are gazillions of b₁ values obtained from the gazillions of samples. These b₁ values are normally distributed with an expected value (mean) equal to the population parameter β₁ and a measure of dispersion called the standard error of the sample statistic b₁, se(b₁). Confidence Interval for β₁ x 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 7.0 For comparison, let’s start with the confidence interval for µ: L, U = x̅ ± MOE MOE = tα/2, df se(x̅) Now, the confidence interval for the population parameter β₁ is: L, U = b₁ ± MOE MOE = tα/2, df se(b₁) Note that in simple regression the degrees of freedom to be used in the t distribution is, df = n − 2 √∑(𝑥 − 𝑥̅ )2 = 10.964 √28.6 = 2.05 L, U = b₁ ± MOE = 8.014 ± 4.73 = (3.28, 12.74) se(𝑒) √∑(𝑥 − 𝑥̅ )2 Now we are set to construct a 95% confidence interval for the population slope parameter β₁. b₁ = 8.014 df = n – 2 = 10 – 2 = 8 tα/2, df = t0.025, 8 = 2.306 se(e) = 10.964 Given x̅ = 4.2, we can compute (x – x̅)² as follows: Page 8 of 10 se(𝑒) MOE = tα/2, df se(b₁) = 2.306(2.05) = 4.73 We need a formula to compute se(b₁), and that is: se(𝑏1 ) = se(𝑏1 ) = (x − x̅)² 10.24 2.89 1.44 0.49 0.04 0.09 0.64 1.69 3.24 7.84 (x – x̅)² = 28.6 Test of Hypothesis for β₁ Generally, the purpose of the hypothesis test for the population slope parameter is to show that the β₁ is significantly different from zero. Thus, the hypothesis test is a two-tails test with following hypotheses statement: H₀: β₁ = 0 H₁: β₁ ≠ 0 Why do we want to prove that β₁ is significantly different from zero? That is, why should we be interested in rejecting the null hypothesis H₀: β₁ = 0? Our objective is to prove that test scores do respond to, or are related to, hours of study. If test scores were not related to hours of study at all then the population regression line would be flat, that is, there would be no slope: β₁ = 0. Therefore, to prove that our model makes sense! we must reject H₀: β₁ = 0 and prove that β₁ ≠ 0 “beyond a reasonable doubt”. The slope coefficient computed from the sample is b₁ = 8.014. To prove that this value is significantly different from zero, we need to compute the test statistic to use in our decision rule: Reject H₀: β₁ = 0 if prob value < α We can compute the prob value using Excel: Excel 2010: Excel 2007 or older: tails) =T.DIST.2T(x, deg_freedom) =T.DIST.2T(3.909,8) =TDIST(x, deg_freedom, = TDIST(3.909,8,2) prob value = 0.0045 Reject H₀: β₁ = 0 if test statistic > critical value The test statistic is determined as follows b1 (β1 )0 se(b1 ) But, since from null hypothesis statement β₁ = 0, then in the TS formula (β₁)₀ = 0, which makes, b1 TS: t = se(b1 ) 8.014 t= = 3.909 2.05 TS: t = The critical value for the test, using α = 0.05, is: CV = tα/2, df = t0.025, 8 = 2.306 Since TS = 3.909 > CV = 2.306, reject H₀: β₁ = 0 and conclude that β₁ ≠ 0. We can also use the probability value decision rule: Page 9 of 10 Since prob value = 0.0045 < α = 0.05, reject H₀: β₁ = 0 and conclude that β₁ ≠ 0. Excel Regression Summary Output The following is the Excel regression summary output. Please see the main notes Chapter 7—Regression for details. SUMMARY OUTPUT Regression Statistics Multiple R 0.8102 R Square 0.6564 Adjusted R Square 0.6134 Standard Error 10.9635 Observations 10 ANOVA df Regression Error Total Intercept Hours Page 10 of 10 1 8 9 SS 1836.806 961.594 2798.4 Coefficients 42.7413 8.0140 Standard Error 9.2821 2.0501 MS 1836.806 120.199 t Stat 4.6047 3.9091 F Significance F 15.281 0.004 P-value 0.0017 0.0045 Lower 95% 21.3368 3.2865 Upper 95% 64.1458 12.7414