Simple Linear Regression Previously we tested for a relationship between two categorical variables, but what if are interest involved two quantitative variables? You may recall from algebra the line equation: Y = mX + b where Y is the dependent variable; m is the slope of the line; X is the independent variable; and b is the y-intercept. In algebra you usually had a perfect line, i.e. each X and Y pair fell exactly on the line, however in statistics this rarely the case. The concept begins by considering some dependent variable, e.g. Final Exam. In looking at such scores you realize that not all of them are the same (i.e. not everyone scores the same on the final) so one may wonder what would explain this variation. One possibility is performance on the midterms. To see if there might be a linear relationship between these two variables one would start with a Scatterplot. Scatterplot of Final vs Midterms 1 00 90 Final 80 70 60 50 40 60 70 80 90 1 00 Midterms From this plot you can see that a potential linear relationship exists as for the most part higher midterm avergages appear to coincide with higher final exam scores. To get a numerical aspect for a linear relationship one would consider the correlation with the symbol of r. From Minitab this correlation is 0.67. Thus correlation is a measure of the strength of a linear relationship between two quantitative variables. If a perfect linear relationship existed the correlation would be one. However, not all relationships are positive. For instance consider the variables Weight and Exercise. One would think that as the more one exercised their weight would decrease. Thus increased exercise would result in decreasing weight. This would have a negative relationship. Thus correlations can also be negative. This gives the possible ranges for correlation to be from – 1 <= r <= 1. Once we establish that a potential linear relationship might exist what in a line equation would one be interested in testing statistically to establish some “statistical prove” of an existing relationship? Looking at a line, the key piece to establish this would be the slope. That is, if the 1 slope were significantly different from 0 (i.e. no linear relationship) then one would conclude that the two variables are linearly related. But to get to this point be must first establish our general statistical line expression. Y = Bo + B1X + e As before with algebra Y is the response or outcome variable, Bo is the population y-intercept, B1 is the population slope, and X is the predictor or explanatory variable. This being statistics however were refer to these B’s or “betas” as the population yintercept and population slope and use capital lettering. What, though, is the “e”? As you can see from the Scatterplot, not all of the plotted points fall neatly on a line. That is, if we fit a line through the data not all dots would be on this line; there would be some error from the line we draw and many of the plotted points. We call this distance from the plotted points to the best fit line for the data the error or residuals. Below is a plot that includes the best fitted line and the line at the mean of Y (i.e. what we would use to predict a final exam if we did not have an X variable). We are not going to get into the math behind how the y-intercept and slope is calculated. Just know that the line that fits the data best would be a line that produces the least amount of error (which makes sense). Once we fit this line we have estimates for the y-intercept and slope. We call this the “fitted regression line” or “regression equation”. The general form is: yˆ b0 b1 X with the first symbol read “y-hat”. Another popular notation expression is yˆ a bX Note that there is no error in this equation as this is now a best-fit line meaning that all of the “fitted” points would fall on this line. More on that later. In statistics we also have some alternative names for Y and X. We refer to Y as the response or outcome variable, and X as the predictor or explanatory variable. 2 With graphical support of a linear relationship we can use the software to estimate or calculate the y-intercept and slope for this best fit line. Using Minitab we get a regression equation of: Final = -0.6 + 0.835 Midterms From this equation we have a y-intercept of negative 0.6 and the 0.835 represents the slope of the line. To then estimate a Y value (in this case a final exam) you would simply “plug in” some midterm average. For example, if you wanted to predict a final exam for a midterm average of 80 you would get: Final = -0.6 + 0.835*(80) = -0.6 + 66.8 = 66.2 Keeping with this, fitted or predicted values would be the values of Y found by substituting into the equation the observed x-values. In this example these fits or predictions of Exam Average would be found by subbing into the equation for Quiz Average each of the individual quiz averages. Interpreting the slope: In general the slope is interpreted as “for a unit increase in X the predicted Y will increase in its units by the value of the slope.” In this example then the slope is interpreted as “for an increase in one point for quiz average the predicted exam average would increase by 0.491” We also should know that a line can have a negative or positive slope. This will be tied into the correlation and vice-versa. That is, the slope of the line and the correlation will have the same sign. Testing the Slope The hypotheses for testing if a significant linear relationship exists are: Ho: B1 = 0 versus Ha: B1 ≠ 0 This test is done using a “t” test statistic where this test statistic is found by: t = (b1 – 0)/S.E(b1) with a DF equal to n – 2. Again we will use Minitab to conduct this test, but we can see in the output how this is down. The output is as follows: Term Constant Midterms Coef -0.6 0.835 SE Coef 15.4 0.175 T-Value -0.04 4.78 P-Value 0.969 0.000 VIF 1.00 Ignore VIF. The output gives two tests: one for the y-intercept (the row labeled Constant and is something in this class we will disregard) and another for the slope (the row for the X-variable in this case Midterms). The value under T is the test statistic of 478 is found by taking the slope estimate, 0.835, and dividing by its standard error, 0.175. This particular analysis included 30 students so the DF for this t-test is N - 2 or 28. The p-value for the test found under P is approximately 0.000 which is less than 0.05 which as we should know leads us to reject Ho and conclude that the population slope differs from 0. Thus we would conclude that a statistically significant linear relationship exists between Final and Midterms. 3 Confidence interval for slope: As with other confidence intervals the structure follows: sample stat ± Multiplier*SE For the slope, this is b1 ± T*S{b1} where T is from the t-table with N-2 degrees of freedom. E.G. a 95% CI would be 0.835 ± 2.00*0.175 = 0.835 ± 0.350 = 0.485 to 1.185. We are 95% confident that the true slope for Midterms to estimate mean exam scores is from 0.485 to 1.185 Note that this interval does NOT contain the hypothesized value of 0 further supporting our conclusion that the slope differs from 0 and that Unit Quiz Average is a significant linear predictor of Exam Average. Other Regression Concepts 1. Is there a way to measure “how well” X is doing in explaining the variation in Y? There is and this is called the Coefficient of Determination and is symbolized by R2 (R-sq in Minitab output). This R2 is calculated by comparing the Sum of Squares Regression (SSR) to the Sum of Squares Total (SST or SSTO) or SSR/SST. Alternatively since SST = SSR + SSE we can get R2 by taking 1 – SSE/SST. This R2 is defined as the “percentage in Y that is explained by X.” Since this is defined as a percentage we multiply the above by 100%. Again looking at the regression output Analysis of Variance Source Regression Midterms Error Lack-of-Fit Pure Error Total DF 1 1 28 17 11 29 Adj SS 2674 2674 3275 1344 1931 5948 Adj MS 2673.80 2673.80 116.95 79.05 175.51 F-Value 22.86 22.86 P-Value 0.000 0.000 0.45 0.932 Model Summary S 10.8142 R-sq 44.95% R-sq(adj) 42.98% R-sq(pred) 38.43% The SST is 5948 and SSR is 2674. So R2 = 2674/5948 = 0.4495 and multiplying by 100% we get 44.95% which we also see in the output as R-sq. For this course we will ignore the output values for S, R-sq(pred) and R-Sq(adj). You may recognize that we used r for correlation and R2 here and there is a connection and it is the obvious one: squaring the correlation and multiplying by 100% gets us R2. In turn, taking the square root of R2 (after converting from % to decimal) gets us to r, but remember that the square root of a number is + or – . We would use the sign that corresponds to the slope of the line. For example, if we take the square root of 0.4495 we get ± 0.670 but since our slope is positive (i.e. as Unit Quiz Average increases so, too, does Exam Average) then we can use the positive square root to get 0.670 as the correlation – which we commented on above. Alternatively, we can find this R-square value by taking 1 – SSE/SST. 2. From the Scatterplot we can also see that some points are very far removed from the line and the general scatter of the data. These values are called outliers and the can be very informative. For example, we might see that one student scored very high on the final without doing very well on the midterms, or vice-versa. In regards to outliers we usually refer to them as general outliers 4 (meaning their removal would improve the regression, i.e. lower the error, or increase correlation) or as influential outliers or observation(s) that if removed would weaken the correlation or increase the error. Typically influential outliers can be identified in a Scatterplot that are further along the x-axis compared to the remainder of the data [in this case the students with 0% quiz averages could be influential as you notice how much more removed these quiz averages are from the other quiz averages]. 3. A final concept of regression is extrapolation. Extrapolation is simply using the regression equation on x-values that are far removed from the range of values used in the analysis. For instance, if an analysis considered age as X where the ages ranged from 18 to 22 then we would not want to apply the equation to ages outside this age range. In our example here, midterms ranged from 60 to 100. If someone had not taken the midterms applying the model to their final would produce a predicted score of negative 0.6 – a score that is not possible. 5