2.7 The analysis of variance to regression analysis

advertisement
1
2.7 The Analysis of Variance (F-test) to Regression Analysis
H 0 : 1  0 v.s. H a : 1  0
We have the following 2 models:
y  0  
Horizontal:
y   0  1 x  
Line :
 yˆ  y
 yˆ  b0  b1 x
Note: The object function for the model 1 is
n
S (  0 )   ( yi   0 ) 2 .
i 1
Thus, the estimate of the parameter  0 can be obtained by solving
S (  0 )
 0 . y is the solution. ŷ = y .
 0
Fundamental Equation:
n
(y
i 1
n
n
 y )   ( yi  yˆ i )   ( yˆ i  y ) 2
2
i
2
i 1
i 1

(“distance” between data and horizontal line)=
(“distance” between data and line) +
(“distance” between model line and horizontal line) .
ŷi
y
(horizontal)
yi
(line)
n
 ( yˆ
i 1
i
(data)
n
(y
 y) 2
i 1
n
(y
i2
i
i
 yˆ i ) 2
 y) 2
[Derivation of Fundamental Equation]:
n
n
n
i 1
i 1
i 1
n
n
 ( yi  y ) 2   ( yi  yˆ i  yˆ i  y ) 2   ( yi  yˆ i ) 2   ( yˆ i  y ) 2  2 ( yi  yˆ i )( yˆ i  y )
i 1
i 1
n
n
i 1
i 1
  ( y i  yˆ i ) 2   ( yˆ i  y ) 2
n
since
(y
i 1
n
i
 yˆ i )( yˆ i  y )    y i  ( y  b1 ( xi  x ))  y  b1 ( xi  x )  y 
i 1
2
n
n

  ( y i  y )  b1 ( xi  x )b1 ( xi  x )  b1  ( yi  y )  b1 ( xi  x )( xi  x )
i 1
 i 1

n
n


 b1  ( yi  y )( xi  x )  b1  ( xi  x ) 2   b1 ( s XY  b1 s XX )
i 1
 i 1

s
 b1 (s XY  XY S XX )  b1 (s XY  s XY )  0
s XX
The ANOVA (Analysis of Variance) table corresponding to the
fundamental equation:
Source
df
SS
MS
n
n
Due to regression
1
SSR   ( yˆ i  y ) 2 MSR   ( yˆ i  y ) 2
Residual (Error)
n-2
i 1
i 1
n
n
SSE   ( y i  yˆ i ) 2
i 1
Total (corrected)
MSE 
(y
i 1
i
 yˆ i ) 2
n2
n
n-1
SST   ( y i  y ) 2
i 1
Let
n
 ( yˆ
i 1
f 
i
 y)2
n
 ( yi  yˆi ) 2
n
1

 ( yˆ
i 1
i
 y)2
s2
,
i 1
n2
the ratio of the mean sum of squares due to the regression and mean
residual sum of squares. Intuitively, large F value might imply the
difference between the line and the horizontal line is relatively large to
the random variation reflected by the mean residual sum of squares. That
is,  1 is so significant such that the difference between the line and the
horizontal line are apparent. Therefore, the F value can provide important
information about if H 0 : 1  0 .
Next question to ask: how large value of F can be considered to
be large? To test H 0 : 1  0 v.s. H a : 1  0 ,
f  f1, n  2 ,  reject H 0
3
Note: The sum of squares due to the regression and the mean sum of
squares due to regression are
MSR 
n
SSR

1
 ( yˆ
i 1
i
 y )2 .
n
SST  sYY   ( yi  y ) 2
The total sum of squares is
i 1
f 
Thus, the f statistic is
MSR
.
MSE
Note: For ease of computation, the following equations can be used:
MSR  SSR  b1s XY  b12 s XX .
Note: E MSE    , E MSR   E SSR     1 s XX .
2
2
2
Note: Let t be the statistic for testing H 0 : 1  0 v.s. H a : 1  0 .
Then,
f  t2 .
Motivating Example (continue):
Assume   0.05 . To test H 0 : 1  0 v.s. H a : 1  0 , we have the following:
b1  5, s XY  2840, SSR  b1s XY  5  2840  14200,
10
  yi
 y 
2
i 1
 SSE 
10
y
i 1
10
 y
i 1
i
2
i
 10 y 2  184730  10  130 2  15730
 y   SSR  15730  14200  1530
2
Thus, we have the following ANOVA table
Source
df
Regression 1
Residual
(Error)
SS
MS
SSR=14200
MSR 
n-2=8 SSE=1530
Total
9
(corrected)
15730
f
SSR
 14200
1
SSE
8
 191.25
MSE 
MSR 14200

MSE 191.25
 74.1
f 
4
Since
f  74.1  5.32  f1,8, 0.05 ,
we reject H 0 : 1  0 . Note that
f  74.1  8.61  t 2 .
2
Example 2 (continue):
Suppose the model is

yi   0   1 xi   i , i  1,,20,  i ~ N 0,  2
,
and
20
x
i 1
i
20
20
i 1
i 1
 1330,  yi  1862.8,  xi2  90662,
20
y
i 1
20
2
i
 173554.26,  xi yi  124206.9
i 1
(a)Provide an ANOVA table.
(b) Find the 95% confidence interval for  1 .and use the confidence interval to test
H 0 : 1  0 .
[solution:]
(a)
Since
2
 1862.8 
sYY  SST   y  20  y  173554.26  20  
  53.06
 20 
i 1
SSR  b1 s XY  0.149  330.7  49.220
20
2
i
2
SSE  SST  SSR  53.06  49.220  3.848
The ANOVA table is
Source
df
SS
Residual
(Error)
n-2=18
SSE=3.848
Regression
1
SSR=49.220
Total
(corrected)
19
53.068
MS
SSE
18
 0.214
MSE 
MSR 
SSR
 49.220
1
5
(b) The 95% confidence interval for  1 is
 s2
b1  t n  2, 
2 s
 XX



1
2
 0.214 
 0.149  t18,0.025  

 2217 
1
2
 0.128,0.170 .
Since 0  0.128,0.170 , we reject H 0 : 1  0 .
Example 3:
Given are 5 observations for two variables x and y.
xi
2
3
5
yi
25
25
20
Suppose the model is
1
8
30
16

yi   0  1 xi   i , i  1,,5,  i ~ N 0, 2
(a)
(b)
(c)
(d)
,
Find the least square estimate and the fitted regression equation
Provide an ANOVA table and use F statistic to test H 0 : 1  0 at   0.01.
Use t statistic to test H 0 : 1  1.5 at   0.01.
Find the 95% confidence interval for  0 .and use the confidence interval to test
H 0 :  0  30 .
[solutions:]
(a) Since
5
x
i 1
5
i
5
5
5
i 1
i 1
i 1
 19,  x  103,  xi y i  383,  yi  116,  y i2  2806,
i 1
2
i
thus,
2
 19 
  x  5  x  103  5     30.8
 5
i 1
5
 19   116 
  xi yi  5  x y  383  5     
  57.8
 5  5 
i 1
5
s XX
s XY
2
i
2
Then, the least square estimate is
b1 
s XY  57.8
 116 
 19 

 1.8766, b0  y  b1 x  
  (1.8766)     30.3311
s XX
30.8
 5 
5
The fitted regression equation is
yˆ  30.3311  1.8766 x .
(b)
Since
6
2
 116 
sYY  SST   y  5  y  2806  5  
  114.8
 5 
i 1
SSR  b1 s XY  1.8766  57.8  108.467
5
2
i
2
SSE  SST  SSR  114.8  108.467  6.333
The ANOVA table is
Source
df
SS
Regression
1
SSR=108.467
Residual
(Error)
n-2=3
SSE=6.333
Total
n-1=4
(corrected)
SST=114.8
MS
F
SSR
1
 108.467
f 
MSR 
s 2  MSE 
MSR
MSE
 51.381
SSE
 2.111
3
Since f  51.381  34.12  f1,3, 0.01 , we reject H 0 : 1  0 .
(c)
t
b1  c  1.8766  (1.5)
 0.3766


 1.438 .
1
1
s b1 
2
2
2
2
.
111
s

 s 
30.8
XX 



Since
t  1.438  5.841  t 3,0.005  t n 2,
2
,
we do not reject H 0 : 1  1.5 .
(d)
The 95% confidence interval for  0 is
 2 n 2
 s  xi 
b0  t n 2,  i 1 

2  ns
XX




1
2
 2.111  103 
 30.3311  t 3,0.025  

 5  30.8 
1
2
 30.3311  (3.182  1.188)
 26.551,34.111
.
Since
30 26.551,34.111 , we do not reject
H 0 :  0  30 .
Download