Relationship Between Two Quantitative Variables

advertisement
Simple Linear Regression
Previously we tested for a relationship between two categorical variables, but what if are interest
involved two quantitative variables? You may recall from algebra the line equation:
Y = mX + b where Y is the dependent variable; m is the slope of the line; X is the
independent variable; and b is the y-intercept.
In algebra you usually had a perfect line, i.e. each X and Y pair fell exactly on the line, however
in statistics this rarely the case. The concept begins by considering some dependent variable, e.g.
Final Exam. In looking at such scores you realize that not all of them are the same (i.e. not
everyone scores the same on the final) so one may wonder what would explain this variation.
One possibility is performance on the midterms. To see if there might be a linear relationship
between these two variables one would start with a Scatterplot.
Scatterplot of Final vs Midterms
1 00
90
Final
80
70
60
50
40
60
70
80
90
1 00
Midterms
From this plot you can see that a potential linear relationship exists as for the most part higher
midterm avergages appear to coincide with higher final exam scores. To get a numerical aspect
for a linear relationship one would consider the correlation with the symbol of r. From Minitab
this correlation is 0.67. Thus correlation is a measure of the strength of a linear relationship
between two quantitative variables. If a perfect linear relationship existed the correlation would
be one. However, not all relationships are positive. For instance consider the variables Weight
and Exercise. One would think that as the more one exercised their weight would decrease. Thus
increased exercise would result in decreasing weight. This would have a negative relationship.
Thus correlations can also be negative. This gives the possible ranges for correlation to be from –
1 <= r <= 1.
Once we establish that a potential linear relationship might exist what in a line equation would
one be interested in testing statistically to establish some “statistical prove” of an existing
relationship? Looking at a line, the key piece to establish this would be the slope. That is, if the
1
slope were significantly different from 0 (i.e. no linear relationship) then one would conclude that
the two variables are linearly related. But to get to this point be must first establish our general
statistical line expression.
Y = Bo + B1X + e As before with algebra Y is the response or outcome variable, Bo is
the population y-intercept, B1 is the population slope, and X is the predictor or explanatory
variable. This being statistics however were refer to these B’s or “betas” as the population yintercept and population slope and use capital lettering. What, though, is the “e”? As you can see
from the Scatterplot, not all of the plotted points fall neatly on a line. That is, if we fit a line
through the data not all dots would be on this line; there would be some error from the line we
draw and many of the plotted points. We call this distance from the plotted points to the best fit
line for the data the error or residuals. Below is a plot that includes the best fitted line and the
line at the mean of Y (i.e. what we would use to predict a final exam if we did not have an X
variable). We are not going to get into the math behind how the y-intercept and slope is
calculated. Just know that the line that fits the data best would be a line that produces the least
amount of error (which makes sense).
Once we fit this line we have estimates for the y-intercept and slope. We call this the “fitted
regression line” or “regression equation”. The general form is:
yˆ  b0  b1 X with the first symbol read “y-hat”. Another popular notation expression is
yˆ  a  bX
Note that there is no error in this equation as this is now a best-fit line meaning that
all of the “fitted” points would fall on this line. More on that later.
In statistics we also have some alternative names for Y and X. We refer to Y as the response or
outcome variable, and X as the predictor or explanatory variable.
2
With graphical support of a linear relationship we can use the software to estimate or calculate the
y-intercept and slope for this best fit line. Using Minitab we get a regression equation of:
Final = -0.6 + 0.835 Midterms
From this equation we have a y-intercept of negative 0.6 and the 0.835 represents the slope of the
line. To then estimate a Y value (in this case a final exam) you would simply “plug in” some
midterm average. For example, if you wanted to predict a final exam for a midterm average of 80
you would get:
Final = -0.6 + 0.835*(80) = -0.6 + 66.8 = 66.2
Keeping with this, fitted or predicted values would be the values of Y found by substituting into
the equation the observed x-values. In this example these fits or predictions of Exam Average
would be found by subbing into the equation for Quiz Average each of the individual quiz
averages.
Interpreting the slope:
In general the slope is interpreted as “for a unit increase in X the predicted Y will increase in its
units by the value of the slope.” In this example then the slope is interpreted as “for an increase
in one point for quiz average the predicted exam average would increase by 0.491”
We also should know that a line can have a negative or positive slope. This will be tied into the
correlation and vice-versa. That is, the slope of the line and the correlation will have the same
sign.
Testing the Slope
The hypotheses for testing if a significant linear relationship exists are:
Ho: B1 = 0 versus Ha: B1 ≠ 0
This test is done using a “t” test statistic where this test statistic is found by:
t = (b1 – 0)/S.E(b1) with a DF equal to n – 2.
Again we will use Minitab to conduct this test, but we can see in the output how this is down.
The output is as follows:
Term
Constant
Midterms
Coef
-0.6
0.835
SE Coef
15.4
0.175
T-Value
-0.04
4.78
P-Value
0.969
0.000
VIF
1.00
Ignore VIF. The output gives two tests: one for the y-intercept (the row labeled Constant and is
something in this class we will disregard) and another for the slope (the row for the X-variable in
this case Midterms). The value under T is the test statistic of 478 is found by taking the slope
estimate, 0.835, and dividing by its standard error, 0.175. This particular analysis included 30
students so the DF for this t-test is N - 2 or 28. The p-value for the test found under P is
approximately 0.000 which is less than 0.05 which as we should know leads us to reject Ho and
conclude that the population slope differs from 0. Thus we would conclude that a statistically
significant linear relationship exists between Final and Midterms.
3
Confidence interval for slope:
As with other confidence intervals the structure follows: sample stat ± Multiplier*SE
For the slope, this is b1 ± T*S{b1} where T is from the t-table with N-2 degrees of freedom.
E.G. a 95% CI would be 0.835 ± 2.00*0.175 = 0.835 ± 0.350 = 0.485 to 1.185. We are 95%
confident that the true slope for Midterms to estimate mean exam scores is from 0.485 to 1.185
Note that this interval does NOT contain the hypothesized value of 0 further supporting our
conclusion that the slope differs from 0 and that Unit Quiz Average is a significant linear
predictor of Exam Average.
Other Regression Concepts
1. Is there a way to measure “how well” X is doing in explaining the variation in Y? There is and
this is called the Coefficient of Determination and is symbolized by R2 (R-sq in Minitab output).
This R2 is calculated by comparing the Sum of Squares Regression (SSR) to the Sum of Squares
Total (SST or SSTO) or SSR/SST. Alternatively since SST = SSR + SSE we can get R2 by
taking 1 – SSE/SST. This R2 is defined as the “percentage in Y that is explained by X.” Since
this is defined as a percentage we multiply the above by 100%. Again looking at the regression
output
Analysis of Variance
Source
Regression
Midterms
Error
Lack-of-Fit
Pure Error
Total
DF
1
1
28
17
11
29
Adj SS
2674
2674
3275
1344
1931
5948
Adj MS
2673.80
2673.80
116.95
79.05
175.51
F-Value
22.86
22.86
P-Value
0.000
0.000
0.45
0.932
Model Summary
S
10.8142
R-sq
44.95%
R-sq(adj)
42.98%
R-sq(pred)
38.43%
The SST is 5948 and SSR is 2674. So R2 = 2674/5948 = 0.4495 and multiplying by 100% we get
44.95% which we also see in the output as R-sq. For this course we will ignore the output values
for S, R-sq(pred) and R-Sq(adj). You may recognize that we used r for correlation and R2 here
and there is a connection and it is the obvious one: squaring the correlation and multiplying by
100% gets us R2. In turn, taking the square root of R2 (after converting from % to decimal) gets
us to r, but remember that the square root of a number is + or – . We would use the sign
that corresponds to the slope of the line. For example, if we take the square root of 0.4495 we
get ± 0.670 but since our slope is positive (i.e. as Unit Quiz Average increases so, too, does Exam
Average) then we can use the positive square root to get 0.670 as the correlation – which we
commented on above. Alternatively, we can find this R-square value by taking 1 – SSE/SST.
2. From the Scatterplot we can also see that some points are very far removed from the line and
the general scatter of the data. These values are called outliers and the can be very informative.
For example, we might see that one student scored very high on the final without doing very well
on the midterms, or vice-versa. In regards to outliers we usually refer to them as general outliers
4
(meaning their removal would improve the regression, i.e. lower the error, or increase correlation)
or as influential outliers or observation(s) that if removed would weaken the correlation or
increase the error. Typically influential outliers can be identified in a Scatterplot that are further
along the x-axis compared to the remainder of the data [in this case the students with 0% quiz
averages could be influential as you notice how much more removed these quiz averages are from
the other quiz averages].
3. A final concept of regression is extrapolation. Extrapolation is simply using the regression
equation on x-values that are far removed from the range of values used in the analysis. For
instance, if an analysis considered age as X where the ages ranged from 18 to 22 then we would
not want to apply the equation to ages outside this age range. In our example here, midterms
ranged from 60 to 100. If someone had not taken the midterms applying the model to their final
would produce a predicted score of negative 0.6 – a score that is not possible.
5
Download