Statistics 2014, Fall 2001

advertisement
1
The simple linear regression model makes the following
assumptions:
i) The relationship between the predictor variable and the
response variable is linear, apart from random error;
ii) The random error terms in the model are independent, and
identically distributed, having a distribution that is normal
with mean 0 and variance  .
In any situation in which we want to use simple linear regression,
these assumptions need to be checked so that we can be confident
that the model works.
2
We check the first assumption using the scatterplot of the response
variable against the predictor variable.
We will check the second assumption using the residuals from the
model:
If the data are collected in a time sequence, we may check the
assumption of independence using a time series plot of the residuals.
If we see any pattern, then we will not accept the assumption of
independence.
We will do a normal q-q plot of the residuals to check the
assumption of normality.
We will plot the residuals against the predictor variable to check the
assumption of constant variance. The values of the residuals should
be randomly distributed about the 0 line for all x. If we see any
pattern, then we will not accept the assumption of constant variance.
Example: (stainless steel stress fracture example, continued)
We have already done a scatterplot and seen that the relationship
between applied tensile stress and time to fracture seems to be
linear. The residuals for the model are given in the table below:
2
RESIDUAL
OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
Predicted Y
64.16548673
61.91327434
57.40884956
52.90442478
50.65221239
48.4
43.89557522
39.39115044
34.88672566
30.38230088
Residuals
-1.165486726
-3.913274336
-2.408849558
8.095575221
11.34778761
-11.4
-5.895575221
5.608849558
11.11327434
-11.38230088
Excel gives residual plots and normal q-q plots as options for
regression. The normal q-q plot from this option actually is not very
informative. Hence we will do a normal q-q plot of the residuals
using the handout.
The plots are shown below:
X Variable 1 Residual Plot
15
Residuals
10
5
0
0
10
20
30
40
50
-5
-10
-15
X Variable 1
The plot of the residuals v. tensile stress shows no obvious pattern.
Hence we will accept the assumption of homoscedasticity.
3
From the normal q-q plot, we see no reason to doubt the assumption
of normality.
Normal Q-Q Plot of Residuals
Standardized Order Statistics
2
1.5
1
0.5
0
-2
0
-1
1
2
-0.5
-1
-1.5
-2
Standardized Normal Scores
Confidence Intervals in Simple Linear Regression
If the error terms in the model satisfy the assumptions of being i.i.d.
normal, then we have

 
ˆ

1 ~ Normal 1 ,

SS xx  , and

2 

1
x
ˆ

 0 ~ Normal  0 , 


.
n
SS
xx


ˆ1  1
MSE
~ t n  2


se
̂

1
Hence, se ˆ
, where
SS xx ; and
1
 
4
2
ˆ0  0


1
x
~ t n  2



se
̂

MSE

0
 n SS  .
, where
se ˆ0
xx 

 
Then a 100(1-)% confidence interval estimate for 1 is
ˆ1  t 
2
 
se ˆ1 , and a 100(1-)% confidence interval estimate for
;n 2
 
ˆ
ˆ
0 is  0  t  ;n2 se  0 .
2
Example: (stainless steel stress fracture example, continued)
A 95% confidence interval estimate for the slope of the line of best
 
ˆ
ˆ
fit is 1  t ; n  2 se 1  0.900884956  2.3060.242775962
2

hrs.
hrs.
  1.4607
,  0.3410
2
kg / mm
kg / mm2






 , and a 95% confidence interval

estimate for the intercept of the line of best fit is
ˆ0  t
2
;n 2
 
se ˆ0  66.41769912  2.3065.648129399
53.3931 hrs., 79.4423 hrs..
Sometimes we want an estimate of the conditional mean of the
response variable at a particular value of the predictor variable.
An unbiased estimator of the conditional mean at x = x0 is
ˆY |x
0
1 x  x  
 ˆ0  ˆ1 x0 , which has variance V ˆ Y | x    2   0
.
n
SS

0

2
xx


Then a 100(1-)% confidence interval estimate of the conditional
mean at x = x is ˆ Y | x0  t  se ˆ Y | x0 , where
0
2
;n  2


5

se ˆ Y | x0

 1 x0  x 2 

 MSE  
SS xx  .
n
Example: (stainless steel stress fracture example, continued)
A point estimate of the mean time to fracture when the stress is at 45
kg/mm2 is
ˆ Y |45  ˆ0  ˆ1 x0  66.41769912  0.90088495645  25.8779 hrs.
The mean stress for the sample is 20 hrs, and
SSxx = (n-1)S2 = 1412.5. Also MSE = 83.25298673.
Then a 95% confidence interval estimate for the mean time to
fracture when the stress is at 45 kg/mm2 is
ˆY | x0  t se ˆY | x0
2
;n 2


2


45  20
 25.8779 hrs.  2.306 83.25298673 0.10 
1412.5

 10.3808 hrs., 41.3750 hrs. .




Note: The standard error of ˆY |x0  ˆ0  ˆ1 x0 is an increasing
function of the the squared difference between x0 and x . Hence the
confidence interval will be narrowest at the mean of x, and will
increase in width as the distance from the mean increases.
(See p. 279).
6
Download