How to Find the Regression Equation

advertisement
Problem Statement
Last year, five randomly selected students took a math aptitude test before they began their
statistics course. The Statistics Department has three questions.



What linear regression equation best predicts statistics performance, based on math
aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in
statistics?
How well does the regression equation fit the data?
How to Find the Regression Equation
In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column
shows statistics grades. The last two rows show sums and mean scores that we will use to
conduct the regression analysis.
How to Use the Regression Equation
Once you have the regression equation, using it is a snap. Choose a value for the independent
variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent
variable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test,
the estimated statistics grade would be:
ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288
Warning: When you use a regression equation, do not use values for the independent variable
that are outside the range of values used to create the equation. That is called extrapolation, and
it can produce unreasonable estimates.
In this example, the aptitude test scores used to create the regression equation ranged from 60 to
95. Therefore, only use values inside that range to estimate statistics grades. Using values outside
that range (less than 60 or greater than 95) is problematic
How to Find the Coefficient of Determination
Whenever you use a regression equation, you should ask how well the equation fits the data. One
way to assess fit is to check the Coefficient of Determination
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is
interpreted as the proportion of the variance in the dependent variable that is predictable from the
independent variable.





The coefficient of determination is the square of the correlation (r) between predicted y
scores and actual y scores; thus, it ranges from 0 to 1.
With linear regression (the type of regression we are using in this tutorial), the coefficient
of determination is also equal to the square of the correlation between x and y scores.
An R2 of 0 means that the dependent variable cannot be predicted from the independent
variable.
An R2 of 1 means the dependent variable can be predicted without error from the
independent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from
X; an R2 of 0.20 means that 20 percent is predictable; and so on.
The formula for computing the coefficient of determination for a linear regression model with
one independent variable is given below.
Coefficient of determination. The coefficient of determination (R2) for a linear regression
model with one independent variable is:
R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the
x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y
value, σx is the standard deviation of x, and σy is the standard deviation of y. Computations for
the sample problem of this lesson are shown below.
σx = sqrt [ Σ ( xi - x )2 / N ]
σx = sqrt( 730/5 ) = sqrt(146) = 12.083
σy = sqrt [ Σ ( yi - y )2 / N ]
σy = sqrt( 630/5 ) = sqrt(126) = 11.225
R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2
R2 = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ]2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48
A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades
(the dependent variable) can be explained by the relationship to math aptitude scores (the independent
variable).
Residual Plots
A residual plot is a graph that shows the residuals on the vertical axis and the independent
variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the
horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear
model is more appropriate.
Below the table on the left shows inputs and outputs from a simple linear regression analysis,
and the chart on the right displays the residual (e) and independent variable (X) as a residual plot.
x
60
70
80
85
95
y
70
65
70
95
85
ŷ
65.411
71.849
78.288
81.507
87.945
e
4.589
-6.849
-8.288
13.493
-2.945
The residual plot shows a fairly random pattern - the first residual is positive, the next two are
negative, the fourth is positive, and the last residual is negative. This random pattern indicates
that a linear model provides a decent fit to the data.
Below, the residual plots show three typical patterns. The first plot shows a random pattern,
indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and
inverted U), suggesting a better fit for a non-linear model.
Random pattern
Non-random: U-shaped
Non-random: Inverted U
Which of the following statements are true?
I. When the sum of the residuals is greater than zero, the data set is nonlinear.
II. A random pattern of residuals supports a linear model.
III. A random pattern of residuals supports a non-linear model.
(A) I only
(B) II only
(C) III only
(D) I and II
(E) I and III
Solution
The correct answer is (B). A random pattern of residuals supports a linear model; a non-random
pattern supports a non-linear model. The sum of the residuals is always zero, whether the data set
is linear or nonlinear.
Influential Points
An influential point is an outlier that greatly affects the slope of the regression line. One way to
test the influence of an outlier is to compute the regression equation with and without the outlier.
Which of the following statements are true?
I. When the data set includes an influential point, the data set is nonlinear.
II. Influential points always reduce the coefficient of determination.
III. All outliers are influential data points.
(A) I only
(B) II only
(C) III only
(D) All of the above
(E) None of the above
Solution
The correct answer is (E). Data sets with influential points can be linear or nonlinear
Download