7 Outline

advertisement
The Simple Linear Regression Model
The purpose of regression analysis is to obtain a
mathematical relationship between the values of two or more
variables. The mathematical relationship is an equation or a
function which provides the framework to determine the
extent or degree of association between the values of the
variables.
In a simple regression model there are only two variables:
Dependent, or Explained Variable—y
One Independent , or Explanatory Variable—x
To what extent the variations in the value of the dependent
variable is explained by the independent variable?
Example
To what degree the variations in scores in a test are related
to the number of hours studied for the test? Here the test
score is the dependent or explained variable (y) and the
hours of study is the independent or explanatory variable (x).
To study the relationship between test scores and hours of
study, the following data from a random sample of 10
students is shown. The “scatter diagram” for the data is also
shown.
Page 1 of 10
Test scores
y
52
56
56
72
72
80
88
92
96
100
Hours of study
x
2.5
1.0
3.5
3.0
4.5
6.0
5.0
4.0
5.5
7.0
To obtain a mathematical framework to study the various
aspects of the relationship between the dependent and
independent variable we need to develop a regression
equation from the sample data.
The regression equation is the equation of the straight line
fitted to the scatter diagram.
The general equation of a straight line:
y = a + bx
The equation of the regression line fitted to the scatter
diagram:
ŷ = b₀ + b₁x
To develop the equation for the regression line we need to
obtain the values for the vertical intercept (b₀) and the slope
(b₁) of the line that fits the scatter diagram the best. The
mathematical method which provides the formulas to
compute the values for b₀ and b₁ is called the least squares
method.
The Least-Squares Method (LSM) to Determine the
Values for b₁ and b₀
To explain the LSM, first we need to understand the
difference between the symbols “y” (the plain or naked y) and
“ŷ” (y-hat). The plain y represents the values of the
dependent variable observed in the sample data that is
Page 2 of 10
associated with each x value. These are the diamond-shaped
markers shown in the scatter diagram. Once we determine
the values of the “coefficients” of the regression, b₀ and b₁,
then for each value of x there will be a unique value of y
which lies on the regression line. These values of y are
denoted by ŷ and are called the “predicted values”. This is
why the equation for the regression line is represented by
ŷ = b₀ + b₁x.
y: Observed values of the dependent variable.
ŷ: Predicted values of the dependent variable.
In the diagram the predicted values ŷ are shown as circular
makers located on the regression line.
120
Observed and Predicted Values of the
Indpendent Variable
100
y
Test score
The Regression Equation
80
ŷ
60
40
20
0
0
1
2
3
4
5
Hours of study
6
7
8
What does “least-squares” mean?
x
1.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
7.0
42.0
As the above diagram shows, for each value of x on the
horizontal axis there is an observed value y and a predicted
value ŷ. The difference between the observed y and the
predicted ŷ is called the residual (or prediction error, or
simply error) and is denoted by e.
e = y − ŷ
Squaring and summing all the squared errors we have the
sum of squared errors (SSE).
𝑆𝑆𝐸 = ∑ 𝑒 2 = ∑(𝑦 − 𝑦̂)2
The least squares method uses a mathematical process
involving partial derivatives and solving a system of
equations to determine the formulas for the regression
coefficients b₀ and b₁ such that the SSE is minimized.
The least squares formulas for the coefficients of the
regression equation are:
𝑏1 =
∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅
∑ 𝑥 2 − 𝑛𝑥̅ 2
b₀ = y̅ − b₁x̅
The following calculations show how these formulas are used
to obtain the values for the coefficients of the regression
equation:
Page 3 of 10
y
56
52
72
56
92
72
88
96
80
100
764
xy
x²
56
1.00
130
6.25
216
9.00
196 12.25
368 16.00
324 20.25
440 25.00
528 30.25
480 36.00
700 49.00
3438 205.00
x̅ = x ∕ n = 42 ∕ 10 = 4.2
y̅ = y ∕ n = 764 ∕ 10 = 76.4
xy = 3438
x² = 205
𝑏1 =
∑ 𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ 3438 − 10(4.2)(76.4
=
= 8.014
∑ 𝑥 2 − 𝑛𝑥̅ 2
205 − 10(4.22 )
b₀ = y̅ − b₁x̅ = 76.4 – 8.014(4.2) = 42.741
The regression line is then written as:
ŷ = 42.741 + 8.014x
Now, for x = 3 hours of study,
The observed value is:
y = 72
The predicted value is: ŷ = 42.741 + 8.014(3) = 66.78
The residual is:
e = y − ŷ = 72 – 66.78 = 5.22
The following table shows the calculation of all predicted
values, the residuals (prediction errors) and SSE:
x
y
ŷ = b₀ + b₁x
e = y − ŷ
e² = (y − ŷ)²
2.5
52
62.78
-10.78
116.13
1.0
56
50.76
5.24
27.51
3.5
56
70.79
-14.79
218.75
3.0
72
66.78
5.22
27.21
4.5
72
78.80
-6.80
46.30
6.0
80
90.83
-10.83
117.18
5.0
88
82.81
5.19
26.92
4.0
92
74.80
17.20
295.94
5.5
96
86.82
9.18
84.31
7.0
100
98.84
1.16
1.35
0.00
961.59
Note that the sum of prediction errors equals zero:
e = (y − ŷ) = 0. This means the regression line
ŷ = 42.741 + 8.014x is the best fitting line because it balances
sum of negative residuals against the sum of positive
residuals. Consequently, the sum of squared deviations
e² = (y − ŷ)² = 961.59 is the smallest possible. Hence the
term least squares.
Variance of Prediction Error, and Standard Error of
Estimate
The observed values y are scattered or dispersed around the
fitted regression line. We need a summary measure of
dispersion of y values around the regression line. The
Page 4 of 10
summary measure is the average deviation of y from ŷ. First,
however, we must find the average squared deviation of y
from ŷ. This average is called the variance of the prediction
error.
∑(𝑦 − 𝑦̂)2
𝑆𝑆𝐸
var(𝑒) =
=
= 𝑀𝑆𝐸
𝑛−2
𝑛−2
var(e) is also called mean square error (MSE).
var(𝑒) =
961.59
= 120.199
8
The square root of var(e) is called the standard error of
estimate.
se(𝑒) = √
∑(𝑦 − 𝑦̂)2
𝑆𝑆𝐸
=√
= √𝑀𝑆𝐸
𝑛−2
𝑛−2
se(𝑒) = √120.199 = 10.964
This value, 10.96, tells us that on average observed scores in
the sample deviate from the predicted scores by about 11
score units. Given the scale of the y values, the smaller the
se(e), the more closely clustered the y values are around the
regression line, hence the closer the fit of the regression line
to the scatter diagram. The closer the fit of the regression
line, the stronger the association between the variations in y
with that of x. In the extreme, limiting, case, if all variations
in test scores were explained by the hours of study, the se(e)
would be zero.
100
Standard Error of Estimate, se(e), Measures
the Average Deviation of y from 𝑦̂ Values
y
90
Test Scores
80
70
ŷ
60
50
40
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
Hours of Study
R² is based on the comparison of the two deviation measures:
1) Deviation of y from the regression line (that is, from ŷ); 2)
deviation of y from the mean line (that is, y̅). In the diagram
below, consider one of the observations in the sample where
x = 5.5. Given 5.5 study hours, the observed score is y = 96
and the predicted score is ŷ = 42.741 + 8.014(5.5) = 86.8. The
observed score y = 96 deviates from the mean score (the
mean line), y̅ = 76.4, by y − y̅ = 96 − 76.4 = 19.6. This
deviation is called the total deviation. Part of this total
deviation is accounted for by the predicted value: ŷ − y̅ = 86.8
– 76.4 = 10.4. This portion of the total deviation is said to be
predicted or explained by the regression (by hours of study),
hence the term explained deviation. The remainder, the
residual, is e = y − ŷ = 96 – 86.8 = 9.2. This is the familiar
deviation: the residual or the prediction error and in this
context is called the unexplained deviation.
However, because the value of se(e) is affected by the scale of
the data, it is not viewed by itself as a reliable measure of
“closeness-of-fit”.
Total Deviation, Explained Deviation, and Unexplained Deviation
100
96
90
Coefficient of Determination, R²
Page 5 of 10
86.8
80
Test Scores
Because the standard error of estimate is affected by the
scale of the data, we need an alternative measure which
indicates the degree of association between the values of y
and x that is independent of the scale of the data. This
alternative measure, called the coefficient of determination,
but more commonly known as R², is a relative (proportional
or percentage) value. Simply, R² indicates the proportion or
percentage of the variations in the y values explained by or due
to x.
ŷ
y̅
70
76.4
60
50
40
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Hours of Study
Thus, the total deviation is comprised of the explained
deviation and the unexplained deviation:
Total
Deviation
(y − y̅)
(96 – 76.4)
19.6
Explained
Deviation
=
(ŷ − y̅)
= (86.8 – 76.4)
=
10.4
=
+ Unexplained
Deviation
+
(y − ŷ)
+ (96 – 86.8)
+
9.2
In this one case the proportion of total deviation explained by
the regression (by x) is 10.4 ∕ 19.6 = 0.53 or 53%.
We can repeat these steps for all 10 observations in the
sample. However, we need a measure which uses combined
(sum of) deviations of all the observations. But when we sum
the deviations we see that sum of deviations all are equal to
zero:
(y − y̅) = ( ŷ − y̅) = (y − ŷ) = 0
To remedy this problems we must square the deviations and
obtain the following sums of squared deviations.
SST = (y − y̅)²
SSR = ( ŷ − y̅)²
SSE = (y − ŷ)
(unexplained)
Sum of Squares Total
Sum of Squares Regression (explained)
Sum of Squares Error
y
ŷ
(y − y̅)²
(ŷ − y̅)²
(y − ŷ)²
52
62.78
595.36
185.61
116.13
56
50.76
416.16
657.65
27.51
56
70.79
416.16
31.47
218.75
72
66.78
19.36
92.48
27.21
72
78.80
19.36
5.78
46.30
80
90.83
12.96
208.09
117.18
88
82.81
134.56
41.10
26.92
92
74.80
243.36
2.57
295.94
96
86.82
384.16
108.54
84.31
100
98.84
556.96
503.52
1.35
2798.40
1836.81
961.59
Note that:
(y − y̅)²
SST
2798.40
=
=
=
( ŷ − y̅)²
SSR
1836.81
+
+
+
(y − ŷ)²
SSE
961.59
R² shows the proportion of total deviation that is explained
by the regression. Thus,
𝑅2 =
𝑆𝑆𝑅 1836.81
=
= 0.6564
𝑆𝑆𝑇 2798.40
That is, nearly 66% of the variations in tests scores is due to
hours of study.
Also note that,
Page 6 of 10
𝑆𝑆𝐸
𝑅2 = 1 −
𝑆𝑆𝑇
This indicates that the more widely scattered the observed y
are around the regression line, the larger the SSE, and thus
the smaller the R².
STATISTICAL INFERENCE FOR THE PARAMETERS OF
POPULATION REGRESSION
To study the relationship between test scores and hours of
study we obtained a random sample from the population of
students taking the test. Therefore, the regression equation
that is generated from the sample data is an estimated
regression equation. The coefficients of the estimated
regression, b₀ and b₁, are thus each a sample statistic which
function as estimators of the population intercept and slope
parameters, respectively. The population intercept
parameter is denoted by β₀ and the slope parameter by β₁.
Sample regression equation:
Population regression equation:
ŷ = b₀ + b₁x
ŷ = β₀ + β₁x
Similar to the parameters µ and π, where statistical inference
is based on the sampling distribution of x̅ and p̅, respectively,
inferences for β₀ and β₁ are based on the sampling
distribution of the sample statistics b₀ and b₁. Here we will
consider the sampling distribution of b₁ only.
Comparing the sampling distribution of b₁ to that of x̅, you
will see that the concept of sampling distribution applies
equally to any sample statistic.
Page 7 of 10
There are gazillions of x̅ values obtained from the gazillions
of samples. These x̅ values are normally distributed with an
expected value (mean) equal to the population parameter µ
and a measure of dispersion called the standard error of the
sample statistic x̅, se(x̅).
Now, take the same sentence and change the symbol x̅ to b₁
and µ to β₁: There are gazillions of b₁ values obtained from
the gazillions of samples. These b₁ values are normally
distributed with an expected value (mean) equal to the
population parameter β₁ and a measure of dispersion called
the standard error of the sample statistic b₁, se(b₁).
Confidence Interval for β₁
x
1.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
7.0
For comparison, let’s start with the confidence interval for µ:
L, U = x̅ ± MOE
MOE = tα/2, df se(x̅)
Now, the confidence interval for the population parameter β₁
is:
L, U = b₁ ± MOE
MOE = tα/2, df se(b₁)
Note that in simple regression the degrees of freedom to be
used in the t distribution is,
df = n − 2
√∑(𝑥 − 𝑥̅ )2
=
10.964
√28.6
= 2.05
L, U = b₁ ± MOE = 8.014 ± 4.73 = (3.28, 12.74)
se(𝑒)
√∑(𝑥 − 𝑥̅ )2
Now we are set to construct a 95% confidence interval for
the population slope parameter β₁.
b₁ = 8.014
df = n – 2 = 10 – 2 = 8
tα/2, df = t0.025, 8 = 2.306
se(e) = 10.964
Given x̅ = 4.2, we can compute (x – x̅)² as follows:
Page 8 of 10
se(𝑒)
MOE = tα/2, df se(b₁) = 2.306(2.05) = 4.73
We need a formula to compute se(b₁), and that is:
se(𝑏1 ) =
se(𝑏1 ) =
(x − x̅)²
10.24
2.89
1.44
0.49
0.04
0.09
0.64
1.69
3.24
7.84
(x – x̅)² = 28.6
Test of Hypothesis for β₁
Generally, the purpose of the hypothesis test for the
population slope parameter is to show that the β₁ is
significantly different from zero. Thus, the hypothesis test is
a two-tails test with following hypotheses statement:
H₀: β₁ = 0
H₁: β₁ ≠ 0
Why do we want to prove that β₁ is significantly different
from zero? That is, why should we be interested in rejecting
the null hypothesis H₀: β₁ = 0? Our objective is to prove that
test scores do respond to, or are related to, hours of study. If
test scores were not related to hours of study at all then the
population regression line would be flat, that is, there would
be no slope: β₁ = 0. Therefore, to prove that our model
makes sense! we must reject H₀: β₁ = 0 and prove that β₁ ≠ 0
“beyond a reasonable doubt”. The slope coefficient computed
from the sample is b₁ = 8.014. To prove that this value is
significantly different from zero, we need to compute the test
statistic to use in our decision rule:
Reject H₀: β₁ = 0 if prob value < α
We can compute the prob value using Excel:
Excel 2010:
Excel 2007 or older:
tails)
=T.DIST.2T(x, deg_freedom)
=T.DIST.2T(3.909,8)
=TDIST(x, deg_freedom,
= TDIST(3.909,8,2)
prob value = 0.0045
Reject H₀: β₁ = 0 if test statistic > critical value
The test statistic is determined as follows
b1  (β1 )0
se(b1 )
But, since from null hypothesis statement β₁ = 0, then in the
TS formula (β₁)₀ = 0, which makes,
b1
TS: t =
se(b1 )
8.014
t=
= 3.909
2.05
TS: t =
The critical value for the test, using α = 0.05, is:
CV = tα/2, df = t0.025, 8 = 2.306
Since TS = 3.909 > CV = 2.306, reject H₀: β₁ = 0 and conclude
that β₁ ≠ 0.
We can also use the probability value decision rule:
Page 9 of 10
Since prob value = 0.0045 < α = 0.05, reject H₀: β₁ = 0 and
conclude that β₁ ≠ 0.
Excel Regression Summary Output
The following is the Excel regression summary output. Please see the main notes Chapter 7—Regression for details.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8102
R Square
0.6564
Adjusted R Square
0.6134
Standard Error
10.9635
Observations
10
ANOVA
df
Regression
Error
Total
Intercept
Hours
Page 10 of 10
1
8
9
SS
1836.806
961.594
2798.4
Coefficients
42.7413
8.0140
Standard
Error
9.2821
2.0501
MS
1836.806
120.199
t Stat
4.6047
3.9091
F
Significance F
15.281
0.004
P-value
0.0017
0.0045
Lower 95%
21.3368
3.2865
Upper 95%
64.1458
12.7414
Download