7 Regression

advertisement
CHAPTER 7
REGRESSION
1.
2.
3.
4.
INTRODUCTION
THE REGRESSION EQUATION
2.1. The Least Squares Regression Line
2.2. The Predicted Value of y for a Given π‘₯ Value, and the Residual 𝑒
2.3. The Standard Error of Estimate
2.4. Coefficient of Determination, 𝑅2
STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION REGRESSION
3.1. Confidence Interval for Population Slope Parameter 𝛽1
3.1.1. More About the Sampling Distribution of 𝑏1
3.2. Test of Hypothesis for Population Slope Parameter 𝛽1
3.3. Confidence Interval for Predicted Value of y for a Given π‘₯
3.4. Confidence Interval for Mean Value of 𝑦 for a Given π‘₯
USING EXCEL FOR REGRESSION ANALYSIS
4.1. Understanding the Computer Regression Output
1. INTRODUCTION
To explain regression simply, suppose you want to find out what factor affects the students’ grades in the
statistics departmental common finals. What determines the variations in test scores? Why do some
students have higher scores than others? Suppose a friend offers the explanation that the variations in scores
are related to students’ heights. Another friend proposes that score variations are related to the number of
hours a student studies for the test. Your task is to find out which theory is more realistic (duh!).
Here we have two models before us attempting to explain the variations in student statistics test scores. Each
model consists of two variables: the dependent variable and the independent variable. In both models the
dependent variable (also called the explained variable) is test scores. In Model 1, however, the independent
variable (also called the explanatory variable) is student height and in Model 2 the independent variable is
hours studied. Suppose you select a random sample of 10 students. You obtain the departmental final scores
for the dependent variable. For the independent variable in Model 1 you measure their heights. For Model 2,
you ask the students to state as accurately as possible the number of hours they studied for the test. The
following are the hypothetical data for each model:
Chapter 7—Regression
Page 1 of 17
Model 1
Variables
Dependent Independent
𝑦
π‘₯
Height
Score
in Inches
52
72
56
65
56
70
72
74
72
64
80
62
88
71
92
75
96
74
100
69
Model 2
Variables
Dependent Independent
𝑦
π‘₯
Hours
Score
Studied
52
2.5
56
1.0
56
3.5
72
3.0
72
4.5
80
6.0
88
5.0
92
4.0
96
5.5
100
7.0
Your task is to find out which model better explains the variations in student scores. You do not observe the
influence (if any) by looking at the numbers. In other words, it is hard to see any pattern or association in
differences in scores in relation to differences in either height or hours studies. A visual aid is much more
descriptive than the plain numbers. The visual aid in regression is called the scatter diagram. The following
are the scatter diagrams for the two models. In each model the independent variable is measured on the
horizontal axis and the dependent variable on the vertical axis.
The scatter diagram for Model 1 shows that there is no relationship between student height and score,
because there is no recognizable pattern. But in the scatter diagram for Model 2 there is a recognizable
pattern showing that, in general, scores increase with the number of hours studied.
Model 2
100
100
80
80
Test scores
Test score
Model 1
60
40
60
40
20
20
0
0
60
62
64
66
68
70
72
Student height
Chapter 7—Regression
74
76
78
0
1
2
3
4
5
6
7
8
Hours studied
Page 2 of 17
2. THE REGRESSION EQUATION
To determine a more precise depiction of the relationship between the dependent variable 𝑦 and the
independent variable π‘₯, we need to describe the relationship as a mathematical equation. This equation is
derived as the equation of the line that fits the scatter diagram the best. The regression analysis provides the
tools for fitting a regression line onto the scatter diagram. To draw any line in the π‘₯y quadrant, you must
have a vertical intercept and a slope. The general equation for a straight line is the following:
𝑦 = 𝑏0 + 𝑏1 π‘₯
Here π’ƒπŸŽ represents the vertical intercept and π’ƒπŸ the slope of the line. The slope represents the change in
value of 𝑦 per unit change in π‘₯:
𝑏1 =
βˆ†π‘¦
βˆ†π‘₯
y
βˆ†y
βˆ†x
x
One can fit a line to the scatter diagram manually. There are many possible lines that could be fitted in this
manner. However, there is a mathematical approach to fitting the most accurate, or best-fitting line. This
method is called the least squares method. In explaining the method, you will see why it is called the least
squares. We will use the data for Model 2 to explain how to find the regression equation.
The model we are dealing with in this discussion is called a simple linear regression model. It is “simple“
because there is only one independent variable. If a model contains more than one independent variable then
it is called a multiple regression model. For example, in your model explaining the student scores, in addition
to the number of hours studied, you may include the students’ SAT scores as a second independent variable.
The current model is also a “linear“ regression model because the regression equation provides a straight
regression line. The line is not curved.
The general form of a simple linear regression equation is as follows:
𝑦̂ = 𝑏0 + 𝑏1 π‘₯
Note that in the regression equation we use 𝑦̂ (𝑦-hat) rather than 𝑦. In regression models, the symbol 𝑦 (hatless) represents the observed values of the dependent variable, the actual value observed in the sample. The
distinction between 𝑦 and 𝑦̂ will become apparent below.
Chapter 7—Regression
Page 3 of 17
2.1. The Least Squares Regression Line
The mathematical method used to obtain the regression line is called the least squares method because with
the resulting regression line the sum of squared value of the vertical distance between the observed y values and
the regression line is minimized (is the least). In the following diagram, the diamond-shaped markers
represent the y values observed in the sample. For each value of π‘₯ (hours) there is an observed value of y
(actual score). The circular markers on the regression line represent the predicted values. Once you find the
regression equation and draw the regression line, for each value of π‘₯ there will be a corresponding predicted
value of 𝑦 on the line, which we denote by 𝑦̂.
y
y
e = y − yΜ‚
yΜ‚
x
The vertical distance between the observed value (𝑦)and predicted value (𝑦̂) is called the prediction error
and is denoted by 𝑒: 𝑒 = 𝑦 − 𝑦̂. Squaring the error terms and summing them we obtain the sum of squared
errors.
οƒ₯𝑒2 = οƒ₯(𝑦 − 𝑦̂ )2
The least squares line assures that this sum of squares is minimized. You cannot find any other line that
would provide a smaller sum of squared errors than the least squares line.
How do you obtain the least squares regression line? As explained, the equation for any straight line is
obtained by determining the vertical intercept and the slope. In the simple linear regression, the slope and
vertical inter intercept are obtained using the following formulas:
∑π‘₯𝑦 − 𝑛π‘₯Μ… 𝑦̅
∑π‘₯ 2 − 𝑛π‘₯Μ… 2
Slope:
𝑏1 =
Vertical Intercept:
𝑏0 = 𝑦̅ − 𝑏1 π‘₯Μ…
Using the data in Model 2 now we can determine the values for 𝑏1 and 𝑏2 .
Chapter 7—Regression
Page 4 of 17
𝑦
52
56
56
72
72
80
88
92
96
100
𝑦̅ = 76.4
π‘₯
2.5
1.0
3.5
3.0
4.5
6.0
5.0
4.0
5.5
7.0
π‘₯Μ… = 4.2
π‘₯𝑦
130
56
196
216
324
480
440
368
528
700
οƒ₯π‘₯𝑦 = 3438
π‘₯2
6.25
1.00
12.25
9.00
20.25
36.00
25.00
16.00
30.25
49.00
οƒ₯π‘₯2 = 205.00
Computing π‘₯Μ… = 4.2 and 𝑦̅ = 76.4, we can fill in the values in the formulas:
𝑏1 =
∑π‘₯𝑦 − 𝑛π‘₯Μ… 𝑦̅ 3,438 − 10(4.2)(76.4)
=
= 8.014
205 − 10(4.22 )
∑π‘₯ 2 − 𝑛π‘₯Μ… 2
𝑏0 = 𝑦̅ − 𝑏1 π‘₯Μ… = 76.4 − 8.014(4.2) = 42.741
The least square regression equation for Model 2 is then:
𝑦̂ = 42.741 + 8.014π‘₯
What does this equation imply? The slope value of (rounded) 8.0 means that for each additional hour of
study the model predicts that score will increase by 8 points. The vertical intercept of (rounded) 43 means
that the model predicts that if a student did not study at all the score would be 43.
2.2. The Predicted Value of y for a Given x Value, and the Residual e
An important function of the regression equation is that it enables us to predict the value of 𝑦 for a given
value of π‘₯. For example, according to the model, if a student studies 6 hours, the model predicts that the score
would be
𝑦̂ = 42.741 + 8.014(6) = 90.8
You can thus predict the score for any number of hours studied. The difference between the predicted value
𝑦̂ for a given value of π‘₯ and the observed value associated with that π‘₯ value in the sample data is called the
residual (or prediction error). For example, in the data when π‘₯ = 6, the associated score y is 80. The
residual is then
𝑒 = 𝑦 − 𝑦̂ = 80 − 90.8 = −10.8
Now compute all the predicted values and residuals for Model 2.
Chapter 7—Regression
Page 5 of 17
𝑦̂ = 𝑏0 + 𝑏1 π‘₯ 𝑒 = 𝑦 − 𝑦̂
𝑒 2 = (𝑦 − 𝑦̂)2
𝑦
π‘₯
52
56
56
72
72
80
2.5
1.0
3.5
3.0
4.5
6.0
62.78
50.76
70.79
66.78
78.80
90.83
-10.78
5.24
-14.79
5.22
-6.80
-10.83
116.13
27.51
218.75
27.21
46.30
117.18
88
92
96
100
5.0
4.0
5.5
7.0
82.81
74.80
86.82
98.84
5.19
17.20
9.18
1.16
26.92
295.94
84.31
1.35
οƒ₯𝑒 = 0.00
οƒ₯𝑒2 = 961.59
Note that the sum of squared residuals or sum of squared errors (𝑆𝑆𝐸) is:
𝑆𝑆𝐸 = οƒ₯𝑒 2 = οƒ₯ (𝑦 − 𝑦̂)2 = 961.59
This value is the “least squares” mentioned above. There is no other line that would give you a smaller sum of
squared errors. The Least Squares Method of determining 𝑏0 and 𝑏1 , the regression coefficients, guarantees
that this sum will be the smallest possible (the least squares).
2.3. Variance of Error and the Standard Error of Estimate
Note that the value of 𝑆𝑆𝐸 is obtained by summing the squared deviations of the predicted from the observed
values of y. Using 𝑆𝑆𝐸 we can obtain a summary measure similar to the variance and, its square root,
standard deviation. The variance measure shows the average squared deviation of the observed 𝑦 from the
regression line (𝑦). To compute this measure, denoted by 𝐯𝐚𝐫(𝒆), divide the 𝑆𝑆𝐸 by the degrees of freedom,
which here is 𝑑𝑓 = 𝑛 − 2 = 8. This measure is also known as Mean Square Error (𝑀𝑆𝐸).
var(𝑒) = 𝑀𝑆𝐸 =
οƒ₯ (𝑦 − 𝑦̂)2
𝑛−2
The square root of var(𝑒) is called the standard error of estimate, and is denoted by 𝐬𝐞(𝒆).
se(𝑒) = √
se(𝑒) = √
οƒ₯𝑒 2
𝑛−2
=√
οƒ₯ (𝑦 − 𝑦̂)2
𝑛−2
𝑆𝑆𝐸
=√
= √𝑀𝑆𝐸
𝑑𝑓
961.59
= √120.198 = 10.964
10 − 2
In any given regression model, the more scattered the observed values of 𝑦 around the regression line, the
larger the se(𝑒). If se(𝑒) is large, we say that the regression line is not a good fit. The smaller the se(𝑒), the
better the fit. Compare the equation for se(𝑒) to that of 𝑠, the standard deviation of 𝑦:
οƒ₯ (𝑦 − 𝑦̅)2
𝑠=√
𝑛−1
Chapter 7—Regression
Page 6 of 17
These two equations are very similar. The standard error of estimate measures the deviations of 𝑦 values
from the regression line. The standard deviation of 𝑦 measures the deviation of the 𝑦 values from the mean of
𝑦, (𝑦̅). In the following diagram note that the mean 𝑦̅ is a horizontal line because there is single value of 𝑦̅ for
all values of π‘₯ on the horizontal axis. Both 𝑠 and se(𝑒) measure deviations or the degree of scatter of the 𝑦
values, the diamond-shaped markers, from a line: The standard error of estimate shows the average
deviation of 𝑦 from 𝑦̂ and the standard deviation of 𝑦 provides the average deviation of 𝑦 from 𝑦̅. Thus, the
larger se(𝑒) and s values the more scattered the data.
y
yΜ‚
yΜ„
x
The standard error of estimate for Model 2 is:
se(𝑒) = √
961.59
= 10.964
8
Compare this to the standard error for Model 1, where se(𝑒) = 18.129.1 Note that the standard error for
Model 1 is significantly greater than that for Model 2. This indicates that the Model 2 regression line is a
much better fit than Model 1.
2.4. Coefficient of Determination, π‘ΉπŸ
The fit of the regression line is a measure of the closeness of the relationship between π‘₯ and 𝑦. The less
scattered the observed 𝑦 values are around the regression line, the closer the relationship between π‘₯ and 𝑦.
As explained above, se(𝑒) is such a measure of the fit. However, se(𝑒) has a major drawback. It is an absolute
measure and, therefore, is affected by the absolute size, or the scale of the data. The larger the values or scale
of the data set, the larger the se(𝑒).
To explain this drawback, consider the data in Model 2. Suppose the statistics test from which the scores are
obtained is a 25-question-mutiple-choice test. For scoring purposes we can either assign 1 point to each
question and measure the scores from a scale of 25, or assign 4 points to each question and measure the
scores from a scale of 100. We can set up our Model 2 either way.
This value is obtained directly, without going through the worksheet calculations, using the Excel function:
=STEYX(y range, x range).
1
Chapter 7—Regression
Page 7 of 17
Scores
Hours
Scale = 100 Studied
52
2.5
56
1.0
56
3.5
72
3.0
72
4.5
80
6.0
88
5.0
92
4.0
96
5.5
100
7.0
se(𝑒) = 10.964
Scores
Hours
Scale = 25 Studied
13
2.5
14
1.0
14
3.5
18
3.0
18
4.5
20
6.0
22
5.0
23
4.0
24
5.5
25
7.0
se(𝑒) = 2.741
For the purpose of the analysis of the impact of hours studied on test scores it should make no difference
which scale we use for test scores. But note that the standard error of estimate is higher when the scale is
from 100. Does this mean that the model is a better fit when the test score scale is from 25? Of course not.
Both versions of the model have exactly the same fit.
This discussion should make it clear that using se(𝑒) as a measure of closeness of the fit suffers from the
misleading impact of the absolute size or scale of the data used in the model. An alternative measure of the
closeness of fit, which is not affected by the scale of the data, is the coefficient of determination denoted by
𝑹² (r-square).
R-square measures the proportion of total variations in 𝑦 (around the mean 𝑦̅) explained by the regression
(that is, by π‘₯). Mathematically, π‘ΉπŸ is the proportion of the total squared deviations of the 𝑦 values from 𝑦̅ that
is explained by the total squared deviations of 𝑦̂ values (points on the regression line) from 𝑦̅. To understand
this statement consider the following diagram.
y
Unexplained deviation
y
yΜ‚
Explained deviation
yΜ„
96
86.8
Total deviation
76.4
5.5
x
In the diagram, the horizontal line represents the mean of all the observed 𝑦 values, 𝑦̅ = 76.4. The regression
line is represented by the regression equation 𝑦̂ = 42.741 + 8.014π‘₯. A single observed value of 𝑦 = 96 for a
given π‘₯ = 5.5 hours is selected. The vertical distance between this 𝑦 value and 𝑦̅ is
Chapter 7—Regression
Page 8 of 17
π‘‡π‘œπ‘‘π‘Žπ‘™ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦 − 𝑦̅
𝑦 − 𝑦̅ = 96 − 76.4 = 19.6
The vertical distance between 𝑦̂ on the regression line and 𝑦̅ is
𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦̂ − 𝑦̅
𝑦̂ − 𝑦̅ = 86.8 − 76.4 = 10.4
As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression
model. That is, this deviation is explained by the independent variable π‘₯, hours of study.
The vertical distance between 𝑦 and 𝑦̂, the residual, is
π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦 − 𝑦̂
𝑦 − 𝑦̂ = 96 − 86.6 = 9.2
Note that unexplained deviation is the familiar prediction error or residual.
Thus,
π‘‡π‘œπ‘‘π‘Žπ‘™ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› + π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
(𝑦 − 𝑦̅) = (𝑦̂ − 𝑦̅) + (𝑦 − 𝑦̂)
Repeating the same process for all values of 𝑦, squaring the resulting deviations, and summing the squared
values, we have the following sum of squared deviations:
1. Sum of Squared Total Deviations
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  π‘‡π‘œπ‘‘π‘Žπ‘™ (𝑆𝑆𝑇):
οƒ₯(𝑦 − 𝑦̅)2
2. Sum of Squared Explained Deviations
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  π‘…π‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘› (𝑆𝑆𝑅):
οƒ₯(𝑦̂ − 𝑦̅)2
3. Sum of Squared Unexplained Deviations
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ (𝑆𝑆𝐸):
οƒ₯𝑒2 = οƒ₯(𝑦 − 𝑦̂ )2
Mathematically and numerically it can be shown that
οƒ₯(𝑦 − 𝑦̅)2 = οƒ₯(𝑦̂ − 𝑦̅)2 + οƒ₯(𝑦 − 𝑦̂ )2
That is,
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
The following worksheet for Model 2 shows that this equality holds:
Chapter 7—Regression
Page 9 of 17
𝑦
52
56
56
72
72
80
88
92
96
100
Note that:
(𝑦 − 𝑦̅)2
595.36
416.16
416.16
19.36
19.36
12.96
134.56
243.36
384.16
556.96
(𝑦̂ − 𝑦̅)2
185.61
657.65
31.47
92.48
5.78
208.09
41.10
2.57
108.54
503.52
(𝑦 − 𝑦̂)2
116.13
27.51
218.75
27.21
46.30
117.18
26.92
295.94
84.31
1.35
οƒ₯(𝑦 − 𝑦̅)2 = 2798.40 οƒ₯(𝑦̂ − 𝑦̅)2 =1836.81
οƒ₯(𝑦 − 𝑦̂ )2 = 961.59
π‘₯
2.5
1.0
3.5
3.0
4.5
6.0
5.0
4.0
5.5
7.0
2798.40 = 1836.81 + 961.59
Rearranging the equation, we can write 𝑆𝑆𝑅 as the difference between 𝑆𝑆𝑇 and 𝑆𝑆𝐸:
𝑆𝑆𝑅 = 𝑆𝑆𝑇 − 𝑆𝑆𝐸
οƒ₯(𝑦̂ − 𝑦̅)2 = οƒ₯(𝑦 − 𝑦̅)2 − οƒ₯(𝑦 − 𝑦̂ )2
Dividing both sides by 𝑆𝑆𝑇, we have:
𝑆𝑆𝑅 𝑆𝑆𝑇 𝑆𝑆𝐸
𝑆𝑆𝐸
=
−
=1−
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑇
οƒ₯(𝑦̂ − 𝑦̅)2
οƒ₯(𝑦 − 𝑦̂)2
=
1
−
οƒ₯(𝑦 − 𝑦̅)2
οƒ₯(𝑦 − 𝑦̅)2
As stated at the beginning of this discussion, 𝑅2 measures the proportion of total deviations in 𝑦 explained by
the regression. Thus the left hand side of the above equation is 𝑅2 :
𝑅2 =
οƒ₯(𝑦̂ − 𝑦̅)2 𝑆𝑆𝑅
=
οƒ₯(𝑦 − 𝑦̅)2 𝑆𝑆𝑇
On the right hand side of the equation, the ratio
οƒ₯(𝑦 − 𝑦̂ )2 𝑆𝑆𝐸
=
οƒ₯(𝑦 − 𝑦̅)2 𝑆𝑆𝑇
is the proportion of total deviations in y that is due to error or residual, that is, not explained by the
regression. Thus the larger the ratio 𝑆𝑆𝐸 ⁄𝑆𝑆𝑇, the smaller will 𝑅2 be.
For Model 2:
𝑅2 =
𝑆𝑆𝑅 183681
=
= 0.6564
𝑆𝑆𝑇 279840
and
Chapter 7—Regression
Page 10 of 17
𝑆𝑆𝐸
96159
=
= 0.3436
𝑆𝑆𝑇 279840
Thus, when 𝑅2 = 0.6564, nearly 66 percent of the variations or deviations in 𝑦, test scores, are explained by
the regression model, that is the independent variable π‘₯, the hours of study. The remaining 34 percent of the
variations are due to other unexplained factors (you may call these factors the “unmeasurable attributes of an
individual”). Note that if all the variations in 𝑦 were explained by hours studied, then 𝑅2 = 1. Thus, the
values of 𝑅2 vary from 0 to 1:
0 ≤ 𝑅2 ≤ 1
The closer to 0, the weaker the relationship between π‘₯ and 𝑦. The closer to 1, the stronger the relationship.
Using Excel function RSQ(𝑦 range, π‘₯ range), we can find the 𝑅2 for Model 1:
For Model 1,
𝑅2 = 0.0605
As expected, 𝑅2 for Model 1 is near zero. There is, practically, no relationship between student height and
statistics test score.
Also note that the value of 𝑅2 is not affected by the scale of the data. You can check this for Model 2 using
Excel with the scores based on the scale of 25.
3. STATISTICAL INFERENCE FOR THE PARAMETERS OF POPULATION
REGRESSION
Note that to check for the validity of the proposition that there is a relationship between the test scores and
number of hours of studied, we used the data from a sample of 10 students. Using the sample data we
obtained the sample regression equation, the general form of which is
𝑦̂ = 𝑏0 + 𝑏1 π‘₯
To obtain the sample regression equation we have to determine the slope and the vertical intercept of the
regression line from sample data. The sample regression equation is thus an estimate of the population
regression equation. To construct the population regression equation we need to obtain the population slope
and the population vertical intercept. But since we do not have access to the population data, we use the
slope (𝑏1 ) and the vertical intercept (𝑏0 ) determined from the sample data as estimates of the population
slope 𝜷𝟏 , and population vertical intercept 𝜷𝟎 . This way the sample regression line becomes the estimator
of the population regression line:
𝑦̂ = 𝛽0 + 𝛽1 π‘₯
Going back to the statistical inference for the population mean, we used the sample statistic π‘₯Μ… as an estimator
of the population parameter πœ‡. Using π‘₯Μ… we built a confidence interval for πœ‡ or performed a test of hypothesis.
Similarly, in regression, we use the sample statistic π’ƒπŸŽ as the estimator of population parameter 𝜷𝟎 , and
sample statistic π’ƒπŸ as the estimator of population parameter 𝜷𝟏 . Using the two sample statistics in
regression we can build confidence intervals or perform tests of hypotheses for the two population
parameters.
Chapter 7—Regression
Page 11 of 17
3.1. Confidence Interval for Population Slope Parameter 𝜷𝟏
To see the similarities between the confidence interval for πœ‡, the population mean, and that for 𝛽1 , first
consider the formula for the confidence interval for πœ‡ from Chapter 5:
Confidence interval for πœ‡:
𝐿, π‘ˆ = π‘₯Μ… ± 𝑑𝛼⁄2,(𝑛−1) se(π‘₯Μ… )
where se(π‘₯Μ… ) = 𝑠⁄√𝑛. Note that the interval is built around π‘₯Μ… using the margin of error, 𝑀𝑂𝐸 =
𝑑𝛼⁄2,(𝑛−1) se(π‘₯Μ… ). The confidence interval for 𝛽1 has the same general characteristics. It is built around the
sample statistic 𝑏1 with ±π‘€π‘‚𝐸. The 𝑀𝑂𝐸 in all confidence intervals is always equal to the 𝑑 score (or the 𝑧
score) times the standard error of the relevant sample statistic. The confidence interval formula for 𝛽1 is
then:
Confidence interval for 𝜷𝟏 :
𝐿, π‘ˆ = 𝑏1 ± 𝑑𝛼⁄2,(𝑛−2) se(𝑏1 )
Note that here the t score involves 𝑛 − 2 degrees of freedom.2 The term se(𝑏1 ) is the standard error of the
sampling distribution of π’ƒπŸ . The formula for se(𝑏1 ) is:
Standard error of the sampling distribution of b1:
se(𝑏1 ) =
se(𝑒)
√∑(π‘₯ − π‘₯Μ… )2
3.1.1.
More About the Sampling Distribution of π’ƒπŸ
You should clearly recognize the fact that 𝑏1 is a summary characteristic obtained from a random sample. It
is, therefore, like π‘₯Μ… , a sample statistic. In the discussion of the concept of sampling distribution in Chapter 4
we learned that the number of samples of size 𝑛 obtained from a parent population is infinite. There are,
thus, infinite number of sample statistics such as π‘₯Μ… that one may obtain from these samples. Since the values
of π‘₯Μ… are obtained from randomly selected samples, then π‘₯Μ… is a random variable. The probability distribution
of this random variable, we learned, is called a sampling distribution. We also learned that the center of
gravity or the mean of the sample statistics is the corresponding parameter in the parent population. With
respect to π‘₯Μ… , it was explained, the mean of the means was the population mean πœ‡. Also, the measure of
dispersion of the values of the sample statistic around their center of gravity is called the standard error.
Thus, the measure of dispersion of π‘₯Μ… values around μ is se(π‘₯Μ… ). Finally, in order to apply the sampling
distribution of π‘₯Μ… in statistical inference, the π‘₯Μ… values must have a normal distribution.
We can apply the same concepts to the sample statistics 𝑏1 . We can obtain infinite number of 𝑏1 values from
the infinite number of random samples. This makes 𝑏1 a random variable with a sampling distribution. The
expected value, the center of gravity, or the mean of the 𝑏1 is equal to the parameter of the parent population,
𝛽1 . And the measure of dispersion of the 𝑏1 values is the standard error of 𝑏1 , se(𝑏1 ). Furthermore, in order
to apply the sampling distribution of 𝑏1 for statistical inference, the 𝑏1 values must be normally distributed.3
The following diagram shows the similarities between the sampling distribution of π‘₯Μ… and the sampling
distribution of 𝑏1 .
The regression equation is obtained by estimating two population parameters 𝛽0 and 𝛽1 . For each population parameter
estimated, we lose one degree of freedom.
3 In order for the sampling distribution of 𝑏 to be normal, certain conditions must be present. They are not relevant to
1
the discussion here. We assume these conditions are present for our discussion.
2
Chapter 7—Regression
Page 12 of 17
Sampling Distribution of b₁
Sampling Distribution of xΜ„
xΜ„
E(xΜ„ ) = μ
b₁
E(b₁) = β₁
Now, to build a 95% confidence interval for population slope parameter in Model 2, in addition to the
following quantities, we need to compute the standard error of 𝑏1
𝑏1 = 8.5188
𝑑𝛼⁄2,(𝑛−2) = 𝑑0.025,(8) = 2.306
se(𝑒) = 10.964
π‘₯Μ… = 4.2
The following is the computation of ∑(π‘₯ − π‘₯
Μ…)2 used in the denominator of se(b₁).
π‘₯
2.5
1.0
3.5
3.0
4.5
6.0
5.0
4.0
5.5
7.0
∑(π‘₯ − π‘₯Μ…)2 =
se(𝑏1 ) =
10.964
√28.6
(π‘₯ − π‘₯Μ… )2
2.89
10.24
0.49
1.44
0.09
3.24
0.64
0.04
1.69
7.84
28.60
= 2.05
𝑀𝑂𝐸 = 𝑑𝛼⁄2,(𝑛−2) se(𝑏1 ) = 2.306(2.05) = 4.727
𝐿 = 𝑏1 − 𝑀𝑂𝐸 = 8.014 − 4.727 = 3.287
π‘ˆ = 𝑏1 + 𝑀𝑂𝐸 = 8.014 + 4.727 = 12.741
We are 95% confident that the population slope parameter 𝛽1 is between 3.287 and 12.741.
3.2.
Test of Hypothesis for Population Slope Parameter 𝜷𝟏
Recall that to perform a test of hypothesis about population parameters πœ‡ or πœ‹ we stated a null and an
alternative hypothesis and then compared a test statistic to a critical value. There the test could be either a
two tail, a lower, or an upper tail test. In performing a two-tail test for, say, πœ‡ the critical value is 𝑑𝛼⁄2,(𝑛−1)
and the test statistic is 𝑑 = (π‘₯Μ… − πœ‡0 )⁄se(π‘₯Μ… ). In regression analysis, performing a test of hypothesis for the
Chapter 7—Regression
Page 13 of 17
population slope parameter 𝛽1 , as you will see, is less complicated. The test is nearly always a two tail test.
Here is why!
In regression analysis we want to determine whether there is a relationship between π‘₯ and 𝑦. If there is a
relationship, then the variations in the value of 𝑦 in response to changes in π‘₯ is reflected in the slope of the
regression line. The slope shows the change in the value of 𝑦 per unit change in π‘₯. In the population
regression equation, the slope is 𝛽1 = βˆ†π‘¦⁄βˆ†π‘₯. If there is no relationship between π‘₯ and 𝑦, then there is no
change in 𝑦 in response to changes in π‘₯. Thus, the slope is zero. In performing a test of hypothesis about 𝛽1 ,
the null hypothesis is that the slope is zero. Using inferential statistics, we want to reject the null hypothesis, to
provide significant proof that the slope is not zero.
The Null and Alternative Hypothesis for 𝜷𝟏 :
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
To perform the test we need a critical value, which is
The Critical Value:
𝑑𝛼⁄2,(𝑛−2)
The test statistic for the hypothesis test has the same format as the that for πœ‡. The test statistic is again a
𝑑 value the numerator of which is the difference between the sample statistic 𝑏1 and the hypothesized value of
the population parameter 𝛽1 , and the denominator is the standard error of 𝑏1 , se(b1):
𝑑=
𝑏1 − (𝛽1 )0
se(𝑏1 )
However, note that the null hypothesis states that 𝛽1 = 0. Therefore, the test statistic is simplified as follows:
𝑻𝒉𝒆 𝑻𝒆𝒔𝒕 π‘Ίπ’•π’‚π’•π’Šπ’”π’•π’Šπ’„:
𝑑=
𝑏1
se(𝑏1 )
To reject the null hypothesis that β1 = 0, the test statistic must exceed the critical value:
To reject the null hypothesis:
𝑑 > 𝑑𝛼⁄2,(𝑛−2)
Example
Perform a test of hypothesis for the population slope parameter in Model 2.
𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0
The critical value is:
𝑑𝛼⁄2,(𝑛−2) = 𝑑0.025,(8) = 2.306
The test statistic is:
𝑑=
𝑏1
8.014
=
= 3.910
se(𝑏1 )
2.05
Since the test statistic exceeds the critical value reject the null hypothesis that the population slope parameter
is zero.
We can also use the test statistic to obtain the probability value. Using the Excel function TDIST, we compute
2 × P(t > |t|).
Chapter 7—Regression
Page 14 of 17
=T. DIST. 2T(x, deg_freedom)
=T.DIST.2T(3.91,8)
= 0.0045
This is a very small probability value. Therefore, 𝐻0 : 𝛽1 = 0 is rejected for any 𝛼 > 0.005.
4. USING EXCEL FOR REGRESSION ANALYSIS
Determining the regression equation and all the related analysis is a cumbersome process involving many
calculations. In the above discussion of regression we went through all the calculations to explain the
concepts. When conducting research, however, using a computer program is essential. The Excel
spreadsheet provides a simple process for determining all the different calculations performed above in one
swoop. The following explains the steps in Excel:
1.
2.
3.
4.
Enter the data for 𝑦 and π‘₯ variables.
Click on Data, then Data Analysis. Locate Regression in the provided list.
Click the box labeled Input Y Range and then select the cells containing the 𝑦 data. Do the same in the
box labeled Input X Range. Choose where you want Excel to show the output on the worksheet by
clicking the box labeled Output Range and select the cell where you want the top left corner of the output
be printed.
Click OK and the following output will appear.
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.8102
0.6564
0.6134
10.9635
10
ANOVA
df
1
8
9
SS
1836.8056
961.5944
2798.4
MS
1836.806
120.1993
F
15.2813
Significance F
0.004487
Coefficients
42.74126
8.01399
Standard Error
9.28207
2.05007
t Stat
4.60471
3.90913
P-value
0.00174
0.00449
Lower 95%
21.33676
3.28652
Regression
Residual
Total
Intercept
X variable 1
4.1.
Upper 95%
64.14575
12.74145
Understanding the Computer Printout
Not all the items on the printout that are familiar to you. Here we will consider only those we have discussed.
1.
The vertical intercept 𝑏0 :
Shown under the column “Coefficients” and clearly it is labeled
“Intercept”. 𝑏0 = 42.74126
2.
The slope 𝑏1 :
Labeled “X variable 1” under the column “Coefficients”. 𝑏1 =
8.01399
Chapter 7—Regression
Page 15 of 17
3.
Standard Error of Estimate se(𝑒):
Shown under the column labeled “Regression Statistics”,
labeled “Standard error”: se(𝑒) = √
4.
Sum of Squares Regression (𝑆𝑆𝑅):
∑(𝑦−𝑦̂)2
𝑛−2
= 10.9635
Shown under the column labeled “𝑆𝑆” (meaning Sum of
Squares) in the row labeled “Regression”.
𝑆𝑆𝑅 = οƒ₯(𝑦̂ − 𝑦̅)2 = 1836.8056.
5.
Sum of Squares Error (𝑆𝑆𝐸):
Shown under the intersection of column “SS” and row
“Residual”. 𝑆𝑆𝐸 = οƒ₯(𝑦 − 𝑦̂)2 = 961.5944.
6.
var(𝑒)
When 𝑆𝑆𝐸 is divided by the degrees of freedom (df associated
with Residual, 𝑑𝑓 = 𝑛 − 2 = 8) the result is called the Mean
Squares Error (𝑴𝑺𝑬), which is also known as 𝐯𝐚𝐫(𝒆). This
figure is shown under the column labeled 𝑀𝑆. Note that
var(𝑒) = 𝑀𝑆𝐸 =
∑(𝑦 − 𝑦̂)2 𝑆𝑆𝐸 961.5944
=
=
= 120.1993
𝑛−2
𝑑𝑓
8
7.
Sum of Squares Total (𝑆𝑆𝑇):
Shown under the intersection of column “𝑆𝑆” and row “Total”.
𝑆𝑆𝑇 = οƒ₯(𝑦 − 𝑦̅)2 = 2798.4
8.
R-Square (𝑅2 ):
Shown under “Regression Statistics”. 𝑅2 = 𝑆𝑆𝑅 ⁄𝑆𝑆𝑇 = 0.65638
9.
Standard Error of 𝑏1 , se(𝑏1 ):
Shown under the intersection of column “Standard Error” and
row “X variable 1”.
se(𝑏1 ) =
10. 95% Confidence interval for β1:
se(𝑒)
√∑(π‘₯ − π‘₯Μ… )2
= 2.05007
Shown under the intersection of columns “Lower 95%” and
“Upper 95%”, on the one hand, and row “Hours”.
𝐿, π‘ˆ = 𝑏1 ± 𝑑𝛼⁄2,𝑑𝑓 se(𝑏1 )
𝐿 = 8.01399 − 2.306(2.05007) = 3.28652
π‘ˆ = 8.01399 + 2.306(2.05007) = 12.74145
11. Test Statistic for 𝐻0 : 𝛽1 = 0
Shown under the intersection of column “t stat” and row “X
variable 1”.
𝑇𝑆 =
𝑏1
8.01399
=
= 3.0913
se(𝑏1 ) 2.05007
12. 𝑷𝒓𝒐𝒃 Value
Recall that an alternative approach to the test of hypothesis is the π‘π‘Ÿπ‘œπ‘ value approach. In Chapter 6, in
the discussion of the two-tail test of hypothesis for πœ‡, it was stated that you reject the null hypothesis if
the π‘π‘Ÿπ‘œπ‘ value is less than α (where 𝛼 is the level of significance of the test). The π‘π‘Ÿπ‘œπ‘ value is P(𝑑 > 𝑇𝑆).
The same argument applies to the test of hypothesis in the regression analysis. Let 𝛼 = 0.05. Excel
computes 2 × P(𝑑 > 𝑇𝑆) = 2 × P(𝑑 > 3.0903) = 0.00449.4 This is shown under the intersection of
4
The Excel command is =T.DIST.2T(x, deg_freedom)
Chapter 7—Regression
Page 16 of 17
column “P-value” and row “X variable 1”. Note that this is a two tail test. The prob value shown in the
computer output is the area under the two tails of the t curve. Since 0.00449 < 𝛼 = 0.05, reject the null
hypothesis that the population slope 𝛽1 = 0.
Chapter 7—Regression
Page 17 of 17
Download