Chapter 8 SIMPLE LINEAR REGRESSION ANALYSIS

advertisement
Chapter 8
SIMPLE LINEAR
REGRESSION ANALYSIS
Correlation Analysis (page 579)
Given: Bivariate data={(X1,Y1), (X2,Y2), …, (Xn,Yn)}
Note: Association in bivariate data means a systematic connection
between changes in one variable and changes in the other. If
both variables were measured on at least an ordinal scale then
the direction of the association can be described as either
positive or negative. When an increase in one variable tends
to be accompanied by an increase in the other, the variables
are positively associated. On the other hand, when an
increase in one variable tends to be accompanied by a
decrease in the other, the variables are negatively associated.
Objective of Correlation Analysis: to measure the strength and
direction of the linear association between two variables.
Chapter 8. Simple Linear Regression Analysis
Scatter Diagram (page 579)
 First step in correlation analysis is to plot the individual
pairs of observations on a two-dimensional graph called
the scatter diagram. This will help you visualize the
possible underlying linear relationship between the two
variables.
 Using Microsoft Excel:
Step 1. Highlight data.
Step 2. Click Insert then choose Scatter.
Chapter 8. Simple Linear Regression Analysis
Linear Correlation Coefficient
(page 580)
Definition 18.1.
The linear correlation coefficient, denoted by (Greek
letter rho), is a measure of the strength of the linear
relationship existing between two variables, say X and
Y, that is independent of their respective scales of
measurement.
Cov(X, Y )
X
Y
Chapter 8. Simple Linear Regression Analysis
Properties of
(page 580)
 A linear correlation coefficient can only assume values between -1 and 1, inclusive of




endpoints.
The sign of describes the direction of the linear relationship between X and Y. A positive
value for means that the line slopes upward to the right, and so as X increases, the value of
Y increases. On the other hand, a negative value for means that the line slopes
downward to the right, and so as X increases, the value of Y decreases.
If = 0, then there is no linear correlation between X and Y. However, this does not mean
a lack of association. It is possible to obtain a zero correlation even if the two variables are
related, though their relationship is nonlinear, such as a quadratic relationship.
When is -1 or 1, there is perfect linear relationship between X and Y and all the points
(x, y) fall on a line whose slope is not equal to 0. ( is undefined when the slope is 0 since
Var(Y)=0 in this case). A that is close to 1 or -1 indicates a strong linear relationship.
A strong linear relationship does not necessarily imply that X causes Y or Y causes X. It is
possible that a third variable may have caused the change in both X and Y, producing the
observed relationship. This is an important point that we should always remember when
studying not just relationships, but also comparing two populations, say by using a t-test.
Unless we collected our data using a well-designed experiment where we were able to
randomize the treatments and substantially control the extraneous variables, we need to
use the more complex “causal” models to study causality. Otherwise, we just describe the
observed relationship or the observed difference between means.
Chapter 8. Simple Linear Regression Analysis
Pearson product moment correlation coefficient (page 581)
Definition 18.2
The Pearson product moment correlation coefficient between X and Y,
denoted by r, is defined as:
n
n
X i Yi
i 1
r
n
n
i 1
X 2i
n
n
Xi
i 1
n
i 1
Yi
i 1
2
Xi
n
n
n
i 1
Yi2
(X i
X)( Yi
Y)
i 1
n
i 1
2
Yi
n
(X i
i 1
This is a point estimator of .
Chapter 8. Simple Linear Regression Analysis
2
X)
n
i 1
( Yi
Y)2
Scatter Diagrams of Various Data Sets with Different
Values of r (Figure 18.2)
r = -1
r=1
Y
Y
X
X
r=0
r = 0.87
Y
X
Guide: Strong correlation – around 0.6 to 1 or around -1 to -0.6
Medium correlation – around 0.3 to 0.6 or around -0.6 to -0.3
Weak correlation – around 0.1 to 0.3 or around -0.3 to -0.1
Chapter 8. Simple Linear Regression Analysis
Remark
If r=1 then all the data points belong in a line whose slope is positive.
 If r=-1 then all the data points belong in a line whose slope is negative.
 If r=0 then we cannot conclude that all the data points belong in a line whose slope is 0 (or a
horizontal line).
Example:

X
-4
-2
0
2
4
0
Sum
Y
4
2
0
2
4
12
XY
-16
-4
0
4
16
0
n
n
n
X i Yi
i 1
r
n
n
i 1
X
2
i
Xi
Xi
5
16
4
0
4
16
40
16
4
0
4
16
40
4
3
2
1
0
-4
Yi
-2
i 1
2
i 1
Y2
n
i 1
n
X2
n
n
2
i
Y
i 1
(5)(0) (0)(12)
(5)(40) (0)2 (5)(40) (12)2
n
2
Yi
i 1
0
Chapter 8. Simple Linear Regression Analysis
0
2
4
Test of Hypothesis
Ho:
=0 vs Ha:
Test Statistic:
T
0
r
1 r2
n 2
Critical region: |t| > t /2(v=n-2)
Even if we are able to establish that there is a linear relationship between
two variables, we still do not conclude that X causes Y. There may be a
third variable that is correlated with both X and Y that is responsible for
the apparent correlation.
Chapter 8. Simple Linear Regression Analysis
Examples
Example 18.2 (page 581) and Example 18.3 (page 583)
Exercise 1 (page 584). Suppose a breeder of Thoroughbred horses wishes to determine
whether a linear relationship exists between the gestation period and the length of life
of a horse. The breeder collected the following data from various stables across the
region.
Horse
1
2
3
4
5
6
7
8
9
10
a.)
b.)
c.)
Gestation Period
(in days)
416
280
290
309
365
356
403
300
265
400
Length of Life
(years)
24
25.75
20
22
20
21.5
23.5
21.75
21
21
Plot a scatter diagram of the data on the gestation period and the length of life of a horse. Does
there appear to be a linear relationship between the variables?
Compute for the Pearson correlation coefficient between the gestation period and the length of
life of a horse. What conclusion can you draw based on the value of the correlation coefficient?
Does this support your observation in a.)?
Test whether is different from 0 using a 0.05 level of significance.
Chapter 8. Simple Linear Regression Analysis
Computing for r
n
n
n
X i Yi
i 1
r
n
n
X
i 1
2
i
n
Xi
Yi
i 1
i 1
2
n
Xi
i 1
n
n
i 1
2
i
Y
n
2
Yi
i 1
(10)(74706.5) (3384)(220.5)
(10)(1173632) (3384)2 (10)(4892.625) (220.5)2
Horse
1
2
3
4
5
6
7
8
9
10
Total
X
416
280
290
309
365
356
403
300
265
400
3384
Y
24
25.75
20
22
20
21.5
23.5
21.75
21
21
220.5
0.0956
X2
Y2
XY
173056
576
9984
78400 663.0625 7210
84100
400
5800
95481
484
6798
133225
400
7300
126736
462.25
7654
162409
552.25
9470.5
90000 473.0625 6525
70225
441
5565
160000
441
8400
1173632 4892.625 74706.5
Chapter 8. Simple Linear Regression Analysis
Hypothesis Testing About
Ho:
0 at
0 vs. Ha:
Test-statistic:
T
= 0.05.
r
1 r2
n 2
Decision rule: Reject Ho if |t| > t.025(v=8). That is reject Ho if t>2.306 or t<-2.306.
Computed value of test statistic:
t
0.0956
1- .0956
8
2
0.2718
Do not reject Ho. There is insufficient evidence at 0.05 level of significance to conclude
that there is a linear relationship between gestation period and length of life of horses.
Chapter 8. Simple Linear Regression Analysis
Simple Linear Regression Model (page 585)
Definition 18.3
The simple linear regression model is given by the equation
Yi
where
o
1
Xi
i
Yi is the value of the response variable (continuous) for the ith element,
Xi is the value of the explanatory variable (continuous) for the ith element,
0
is a regression coefficient that gives the Y-intercept of the regression line,
1
is a regression coefficient that gives the slope of the line,
i
is the random error term for the ith element,
where the
i
’s are independent, normally distributed with mean 0 and
variance 2 for i = 1, 2, …, n,
n is the number of elements
Chapter 8. Simple Linear Regression Analysis
Remarks (page 586)
Blue Line:
E(Y given that X=x) =







Chapter 8. Simple Linear Regression Analysis
Y|x
=
o+
1x
since E( )=0
The random error, I , is the vertical gap between
the ith observation and the blue line. I is a
random variable and we will never know its
realized value because o and 1 are unknown.
The random error term accounts for all other
factors that affect the value of Y that cannot be
explained by the relationship between X and Y.
This includes all other variables related to Y and
also measurement errors.
We require that the Is are independent random
variables. For any fixed value of X, these random
variables are normally distributed. The mean of
any I is 0 and its variance is 2. (that is, we do not
allow that the variation in the values of Is to
differ for the different values of X).
Consequently, for a fixed value of
X=x,Y~Normal( o + 1x, 2)
o is the value of the mean of Y when X=0.
1 is the change in the average value of Y for
every unit increase in the value of X.
Steps in Doing Simple Linear Regression Analysis (page 587)
Step 1.
Obtain the equation that best fits the data.
Step 2.
Evaluate the equation to determine the strength of the
relationship for prediction and estimation.
Step 3.
Determine if the assumptions on the error terms are satisfied.
Step 4.
If the model fits the data adequately, use the equation for
prediction and for describing the nature of the relationship
between the variables.
Chapter 4. Estimation: Two Populations
Estimation Using the Method of Least Squares (page 588)
The estimated regression equation is given by:
Yˆ = b0 + b1X
We use this formula to compute the predicted value of Y when given the value of X. We also
use this to compute the predicted value of the ith observation in the sample data as follows:
Yˆ i = b0 + b1Xi
The method of least squares derives the values of b0 and b1 that minimizes
n
å
2
( Yi - ( b 0 + b 0 Xi )) =
i= 1
n
å
ei2
i= 1
Based on this criterion, the following formulas for bo, the estimate for
for
o,
and b1, the estimate
1 , are obtained:
n
n
n
b1
X i Yi
n
Xi
i 1
Yi
i 1
n
n
n
X
i 1
i 1
2
2
i
Xi
and
bo
Y
b1 X
i 1
Chapter 8. Simple Linear Regression Analysis
Graphical Representation
 The random error term: the vertical
Yˆ = b0 + b1X
Y|x
=
o+
1x
gap between the ith observation and
Y|Xi = 0 + 1Xi
i = Yi – ( 0 + 1Xi)
 The residual: the vertical gap
between the ith observation and
Yˆ i = b0 + b1Xi
ei = Yi – (b0 + b1Xi)
Chapter 8. Simple Linear Regression Analysis
Example
Example 18.5 (page 589)
Example 11.12 (Mendenhall/Scheaffer) The data below represent a sample of mathematics
achievement scores and calculus grades for 10 independently selected college freshmen. Plot the
scatter diagram and use the method of least squares to fit a line to the given 10 points.
Student
Math
Achievement
Score
Calculus
Grade
1
2
3
4
5
6
7
8
9
10
39
43
21
64
57
47
28
75
34
52
65
78
52
82
92
89
73
98
56
75
Chapter 4. Estimation: Two Populations
Computing for bo and b1
Student
X
Y
X2
XY
1
2
3
4
5
6
7
8
9
10
39
43
21
64
57
47
28
75
34
52
65
78
52
82
92
89
73
98
56
75
1521
1849
441
4096
3249
2209
784
5625
1156
2704
2535
3354
1092
5248
5244
4183
2044
7350
1904
3900
460
760
23634
36854
Total
æ
öæ
ö
nå X i Yi - ççå X i ÷
÷ççå Yi ÷
÷
÷ç i= 1 ÷
çè i= 1 øè
ø (10)(36854) - (460)(760)
b1 = i= 1
=
= 0.766
2
2
n
n
(10)(23634)
(460)
æ
ö
nå X i2 - ççå X i ÷
÷
çè i= 1 ÷
ø
i= 1
n
bo
n
n
Y
b1 X =76 – (0.766)(46)=40.78
ˆ = 40.78 + 0.766X
Estimated regression equation: Y
Chapter 8. Simple Linear Regression Analysis
Using the Estimated Regression Equation
ˆ = 40.78 + 0.766X
Estimated regression equation: Y
where Y=calculus grade and X=math achievement score
If the model fits the data adequately, we can use this equation for prediction purposes and
for describing the nature of the relationship between the variables.
We can only predict the value of Y within the values of X in our data set, that is, for
math achievement scores (X) from 21 to 75 only. For example, when X=50, we
predict the calculus score to be Ŷ = 40.78 + (0.766)(50) = 79.06 .
b1=0.766 means that as the math achievement score (X) increases by 1 unit, the mean
calculus grade (Y) is estimated to increase by 0.766.
b0=40.78 has no meaningful interpretation because X=0 is not within the range of
values we used in our estimation.
Chapter 8. Simple Linear Regression Analysis
Confidence Interval Estimation (pages 590-592)
n
An estimator for
2
: MSE
Yi
SSE
n 2
Confidence interval estimator for
Yˆi
2
i 1
n 2
:
1
(
)
b1 - t a (v = n - 2)Sb1 , b1 + t a (v = n - 2)Sb1 where
2
2
MSE
S b1
2
n
Xi
n
X
i 1
2
i
n
i 1
Confidence interval estimator for
:
0
n
X 2i
MSE
(b 0
)
t a (v = n - 2)Sb0 , b0 + t a (v = n - 2)Sb0 where
2
2
i 1
S b0
n
n
X
i 1
2
n
2
i
Xi
i 1
Chapter 8. Simple Linear Regression Analysis
Hypothesis Testing
To test if there is a significant linear relationship between Y and X:
Ho:
1
=0 vs Ha:
1
0
b1
Sb
T
=
Test Statistic:
Sb1 where
MSE
1
2
n
Xi
n
X
i 1
2
i
i 1
n
Critical region: |t| > t /2(v=n-2)
Chapter 8. Simple Linear Regression Analysis
Example
Student
Math Score
(X)
Calculus Grade
(Y)
X2
1
2
3
4
5
6
7
8
9
10
39
43
21
64
57
47
28
75
34
52
65
78
52
82
92
89
73
98
56
75
1521
1849
441
4096
3249
2209
784
5625
1156
2704
70.6410671 -5.6410671
73.7033145 4.29668553
56.8609539 -4.86095392
89.7801132 -7.78011318
84.4211803 7.578819725
76.7655618 12.23443816
62.2198868 10.78011318
98.2012935 -0.20129345
66.8132579 -10.8132579
80.5933711 -5.59337106
31.821638
18.46150654
23.62887302
60.53016105
57.43850843
149.681477
116.2108401
0.040519054
116.926546
31.2857998
Total
460
760
23634
SSE=
606.025869
ˆ = 40.78 + 0.766X
Predicted Y: Y
Ho:
=0 vs Ha:
1
1
Predicted Y
Residuals
Squared
residuals
Residuals: e = Y - Ŷ
0 at =0.05
b1
0.766
T
=
=
= 4.375 where Sb =
Test Statistic:
Sb1 0.174985
1
SSE /( n - 2)
=
2
æn
ö÷
ççå X ÷
n
çè i= 1 i ø÷
2
åi= 1 Xi - n
(606.025869)/ 8
= 0.174985
23634 - ((460)2 /10)
Critical region: |t| > t /2(v=n-2); that is, t >2.306 or t < -2.306
Chapter 8. Simple Linear Regression Analysis
Coefficient of Determination (page 593)
Definition 18.4
The coefficient of determination, denoted by R2, is defined as the
proportion of the variability in the observed values of the response
variable that can be explained by the explanatory variable through their
linear relationship
Chapter 8. Simple Linear Regression Analysis
Remarks
 We can use the coefficient of determination to assess the
goodness-of-fit of the linear regression model.
 The realized value of the coefficient of determination will
be from 0 to 1. Usually, this value is expressed in
percentage so that we may interpret this as the
percentage of the variation in the values of Y that is
explained by the explanatory variable X through the
model.
 If the model has perfect predictability then R2=1. If the
model has no predictive capability then R2=0.
Chapter 8. Simple Linear Regression Analysis
Relationship between r and b1
n
n
n
b1
X i Yi
Xi
i 1
n
n
X
n
i 1
2
2
i
and
Xi
i 1
n
2
i
X
i 1
b1
n
n
i 1
n
i 1
n
n
X
2
i
n
X i Yi
i 1
r
n
i 1
n
r
Yi
i 1
n
Note:
n
n
Xi
i 1
n
i 1
i 1
2
Xi
n
n
2
i
Y
i 1
2
Xi
i 1
2
i
Y
n
Yi
2
Yi
i 1
Chapter 8. Simple Linear Regression Analysis
n
i 1
2
Yi
Computing for R2
Student
X
Y
X2
Y2
XY
1
2
3
4
5
6
7
8
9
10
39
43
21
64
57
47
28
75
34
52
65
78
52
82
92
89
73
98
56
75
1521
1849
441
4096
3249
2209
784
5625
1156
2704
4225
6084
2704
6724
8464
7921
5329
9604
3136
5625
2535
3354
1092
5248
5244
4183
2044
7350
1904
3900
460
760
23634
59816
36854
Total
æ
öæ
ö
nå X i Yi - ççå X i ÷
÷ççå Yi ÷
÷
÷
çè i= 1 ÷
øèç i= 1 ø
i= 1
=
2 öæ n
2ö
n
n
æ
ö
æ
ö
÷çç
2
÷÷
ç
÷
X i2 - ççå X i ÷
÷÷
÷
çnå Yi - èççå Yi ø÷
÷÷
÷
÷
÷ øè
çè i= 1 ø
֍ i= 1
÷
i= 1
ø
n
r=
æ n
çç
ççnå
è i= 1
n
n
(10)(36854) - (460)(760)
((10)(23634) -
(460)2 )((10)(59816) - (760)2 )
= 0.8398
R2(100%) = (.8398)2(100%)=70.52%
70.52% of the variability of the grades in calculus can be explained by the math achievement
scores.
Chapter 8. Simple Linear Regression Analysis
Download