Correlation_and_Regression_Anaylsis

advertisement
Systems Engineering Program
Department of Engineering Management, Information and Systems
EMIS 7370/5370 STAT 5340 :
PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS
Correlation and Regression
Analysis – An Application
Dr. Jerrell T. Stracener, SAE Fellow
Leadership in Engineering
1
Montgomery, Peck, and Vining (2001) present data
concerning the performance of the 28 National Football
league teams in 1976. It is suspected that the number of
games won(y) is related to the number of yards gained
rushing by an opponent(x). The data are shown in the
following table:
2
Games
Won (y)
Yards
Rushing by
Opponent (x)
Games
Won (y)
Yards
Rushing by
Opponent (x)
Washington
10
2205
Detroit
6
1901
Minnesota
11
2096
Green Bay
5
2288
New England
11
1847
Houston
5
2072
Oakland
13
1903
Kansas City
5
2861
Pittsburgh
10
1457
Miami
6
2411
Baltimore
11
1848
New Orleans
4
2289
Los Angeles
10
1564
New york Giants
3
2203
Dallas
11
1821
New York Jets
3
2592
Atlanta
4
2577
Philadelphia
4
2053
Buffalo
2
2476
St. Louis
10
1979
Chicago
7
1984
San Diego
6
2048
Cincinnati
10
1917
San Francisco
8
1786
Cleveland
9
1761
Seattle
2
2876
Denver
9
1709
Tampa Bay
0
2560
Team
Team
3
Correlation Analysis
• Statistical analysis used to obtain a quantitative
measure of the strength of the relationship between
a dependent variable and one or more independent
variables
4
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Scatter Plot
5
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Sample correlation coefficient
 n
  n
 n

n   x i y i     x i   y i 
 i 1
  i 1  i 1 
ρˆ  r 
1
2
2
2
n
n
n
 n







 
2
2
 n
x i    x i   n  y i    y i  

  i 1

i 1
i 1
i 1










Notes:
-1  r  1
R=r2  100% = coefficient of determination
6
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
r
28 * 386,127  59,084 *195
28 *128,284,292  59,084 * 28 *1,685  195 
1
2
2
r  0.738
R=r2  100% =0.5447
7
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
2
Correlation
To test for no linear association between x & y,
calculate
t
r n2
1 r
2
Where r is the sample correlation coefficient and n
is the sample size.
t
r n2
1 r
2

 0.738 * 28  2
1  ( 0.738)
 5.5766
2
8
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Correlation
Conclude no linear association if
- tα
2
,n 2
 t  tα
,n 2
2
then treat y1, y2, …, yn as a random sample
9
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Correlation
Take α=0.05 and check from the T-table, we get
-tα
,n  2
  t 0 .025 , 26   2 . 0555
2
Since t=-5.5766 < -2.0555, we conclude that there
is linear association between x and y and proceed
with regression analysis
10
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Linear Regression Model
Simple linear regression model
Y   0  1X  
where
Y is the response (or dependent) variable
0 and 1 are the unknown parameters
 ~ N(0,) and data: (x1, y1), (x2, y2), ..., (xn, yn)
11
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Least squares estimates of 0 and 1
^
b1   1 
n
n
n
i 1
i 1
i 1
n xi yi   xi  yi


n x    xi 
i 1
 i 1 
n
n
2
2
i
n
1 n

b 0  β 0    y i  b1  x i 
n  i 1
i 1

^
12
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
estimates of 1
^
b1  β1 
n
n
n
i 1
i 1
i 1
n  x i yi   x i  yi


n x    xi 
i 1
 i 1 
n
n
2
2
i
b1 
28 * 386,127  59,084 *195
28 *128,284,292  59,084
2
b1  0.00703
13
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
estimates of 0
n
1 n

b 0    y i  b1  x i 
n  i 1
i 1

b0 
1
28
195  (0.00703) * 59,084
b0  21.7883
14
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Least squares regression equation
Point estimate of the linear model
Y  β 0  β1x  ε
is
ˆ  21.78825  0.00703x
Y
15
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Regression Fitted Line Plot
16
Stracener_EMIS 7370/STAT 5340_Fall 08_11.18.08
Point estimate of 2
ˆ
σ
2
S 
2


y

Y
 i

i 
n  2 i 1 

1
 n
 y  y

i
n  2  i 1
1
n

2

^
2
b1  n
 n
 n
 


n  X i y i    X i   y i  
n  i 1
 i 1
 i 1  
2
n






y



i
1  n
b1  n
 n
 n
 
2
 i 1 


  yi 
n  X i y i    X i   y i  
n  2  i 1
n
n  i 1
 i 1
 i 1  




 5.726
17
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Interval Estimates for y intercept (0)
(1 - )100% confidence interval for 0 is
β
0L
, β 0U

where
β 0L  b 0  t α
,n 2
Sb 0
2
and
β 0U  b 0  t α
,n 2
Sb 0
2



2
  Xi 


 i 0


 S
2 

n
n

 

 n   X i2     X i  
  i 0
  i 0
 
1/ 2
n
where
Sb 0
18
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Interval Estimates for y intercept (0)
Take =0.05, then 95% confidence interval for 0 is
S b0


 S

 n
 n 
  i  0
 n
2 
 Xi 
 i0

n


2
Xi   
  i0



2 

Xi  
 
1/ 2
128 , 284 , 292


 2 . 3929 * 
2 
28
*
128
,
284
,
292

59
,
084


1/ 2
 2 . 696
19
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Interval Estimates for y intercept (0)
Apply S b to the equation and we get the lower and upper
bound for β 0 :
0
β 0L  b 0  t α
,n 2
Sb 0  21.7883  2.056 * 2.696  16.246
2
β 0U  b 0  t α
,n 2
Sb 0  21.7883  2.056 * 2.696  27.33
2
20
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Interval Estimates for slope (1)
(1 - )100% confidence interval for 1 is
β
1L
, β1U
 where
β1L  b1  t α
,n  2
Sb1
2
and
β1U  b1  t α
,n 2
Sb1
2
where
Sb1 
S
1
2
2

 n
 
  Xi  
 n
2
 i 0
 

Xi 

 i 0

n






21
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Interval Estimates for slope (1)
S
Sb1 
1

 n

  Xi 
 n
2
 i 0


X

i

n
i 0



2
2







2.3929
2

59,084 
128,284,292 

28


1/ 2
 0.00126
β1L  b1  t α
,n 2
Sb1  0.00703  2.056 * 0.00126  0.00961
2
β1U  b1  t α
,n 2
Sb1  0.00703  2.056 * 0.00126  0.00444
2
22
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Confidence interval for conditional mean of Y, given x=2205
Given x equal to 2205, we can calculate the confidence
interval of conditional mean of Y
1

2


2
^
^
1
n xx


 L ( x)  Y ( x) t 


2 
n
n
n
,n  2



2
2

n  xi     xi  

 i 1
  i 1  


1
28 * 3,608,611
1
2
 L ( x)  6.298  2.056 * 2.3929 *  
2 
28
28
*
128
,
284
,
292

59084


 L ( x)  1.291
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
23
Confidence interval for conditional mean of Y, given x=2205
and
1

2


2
^
^
1
n xx


U ( x)  Y ( x) t 


2 
n
n
n
,n  2



2
2

n  xi     xi  

 i 1
  i 1  


1
28 * 3,608,611
1
2
U ( x)  6.298  2.056 * 2.3929 *  
2 
28
28
*
128284292

59084


U ( x)  11.305
24
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
25
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Prediction interval for a single future value of Y, given x
1

2


2
^
^
1
n xx


YL ( x )  Y ( x )  t 
 1 
2 

n
n
,n 2
n



2
2 

n  x i     x i  

 i 1
  i 1
 


and
1

2


2
^
^
1
n xx


YU ( x )  Y ( x )  t 
 1 
2 

n
n
,n 2
n



2
2 

n  xi     xi  

 i 1
  i 1  


26
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Prediction interval for a single future value of Y, given x=2000
Given x= 2000,
^
Y ( 2000 )  21 . 7883  0 . 00703 * 2000  7 . 738
1

2


2
^
^
1
n xx

YL ( x )  Y ( x )  t 
 1  
2 

n
n
,n  2
n




2
2

n  xi     xi  

 i 1
  i 1  


1
1
28 * 3,608,611

2
YL ( x )  7.738  2.056 * 2.3929 * 1 

2 
28 28 *128,284,292  59,084 

YL ( x )  0.7186
27
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Prediction interval for a single future value of Y, given x=2000
and
1

2


2
^
^
1
n xx

YU ( x )  Y ( x )  t 
 1  
2


,n  2
n
 n 2  n

2

n  xi     xi  

 i 1
  i 1  


1
1
28 * 3,608,611

2
YU ( x )  7.738  2.056 * 2.3929 * 1 

2 
28 28 *128,284,292  59084 

YU ( x )  14.757
28
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
29
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Excel Calculation
X
Y
2205
10
22050
4862025
100
6.297905
13.70551
8997.878
2096
11
23056
4393216
121
7.063641
15.49492
200.0204
1847
11
20317
3411409
121
8.812891
4.783447
69244.16
1903
13
24739
3621409
169
8.419485
20.98112
42908.16
1457
10
14570
2122849
100
11.55268
2.410815
426595.6
1848
11
20328
3415104
121
8.805866
4.814226
68718.88
1564
10
15640
2446096
100
10.80099
0.641591
298272
1821
11
20031
3316041
121
8.995543
4.017847
83603.59
2577
4
10308
6640929
16
3.684567
0.099498
217955.6
2476
2
4952
6130576
4
4.394103
5.731727
133851.4
1984
7
13888
3936256
49
7.850452
0.723268
15912.02
1917
10
19170
3674889
100
8.321134
2.818592
37304.16
1761
9
15849
3101121
81
9.417049
0.17393
121900.7
1709
9
15381
2920681
81
9.782355
0.612079
160915.6
1901
6
11406
3613801
36
8.433535
5.922094
43740.73
2288
5
11440
5234944
25
5.714821
0.51097
31633.16
2072
5
10360
4293184
25
7.232243
4.982909
1454.878
2861
5
14305
8185321
25
1.689439
10.95981
563786.4
2411
6
14466
5812921
36
4.850734
1.320812
90515.02
2289
4
9156
5239521
16
5.707796
2.916568
31989.88
2203
3
6609
4853209
9
6.311955
10.96905
8622.449
2592
3
7776
6718464
9
3.579191
0.335462
232186.3
2053
4
8212
4214809
16
7.36572
11.32807
3265.306
1979
10
19790
3916441
100
7.885577
4.470783
17198.45
2048
6
12288
4194304
36
7.400846
1.962368
3861.735
1786
8
14288
3189796
64
9.241422
1.541128
105068.6
2876
2
5752
8271376
4
1.584062
0.173004
586537.2
2560
0
0
6553600
0
3.803994
14.47037
202371.4
195
386127
128284292
1685
195
148.872
3608611
9155
961785.6
SUM
59084
x-bar
2110.1429
XY
X^2
Y^2
Y^
-709824
101041120
(Y-Y^)^2
(x-xbar)^2
34.54949
-0.738027304 <-r
Sb0
14.0723
2.696233
b1
-0.007025
5.725845085 <-S^2
b0l
16.2448
b0
21.788251
2.392873813 <--S
b0u
27.33171
Y(2205)->
Y(2000)->
6.2979048
7.7380503
mu-l
1.291074258
mu-u
11.30473529
y-l
0.718628866
y-u
14.7574718
Sb1
0.00126
0.00126
Sb1l
-0.00961
-0.00961
Sb1u
-0.00444
-0.00444
30
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
Excel Regression Analysis Output
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.738027
0.544684
0.527172
2.392874
28
ANOVA
df
1
26
27
178.0923
148.872
326.9643
178.0923
5.725845
31.10324
Significance F
7.381E-06
Coefficients
21.78825
-0.00703
Standard Error
2.696233
0.00126
t Stat
8.080996
-5.57703
P-value
1.46E-08
7.38E-06
Lower 95%
16.246064
-0.009614
Predicted Y
6.297905
7.063641
8.812891
8.419485
11.55268
8.805866
10.80099
8.995543
3.684567
4.394103
7.850452
8.321134
9.417049
9.782355
8.433535
5.714821
7.232243
1.689439
4.850734
5.707796
6.311955
3.579191
7.36572
7.885577
7.400846
9.241422
1.584062
3.803994
Residuals
3.702095
3.936359
2.187109
4.580515
-1.55268
2.194134
-0.80099
2.004457
0.315433
-2.3941
-0.85045
1.678866
-0.41705
-0.78235
-2.43354
-0.71482
-2.23224
3.310561
1.149266
-1.7078
-3.31195
-0.57919
-3.36572
2.114423
-1.40085
-1.24142
0.415938
-3.80399
Regression
Residual
Total
Intercept
X Variable 1
SS
MS
F
Upper 95%
27.3304377
-0.0044359
Lower 95.0%
16.2460641
-0.0096143
Upper 95.0%
27.33044
-0.00444
RESIDUAL OUTPUT
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Stracener_EMIS 7370/STAT 5340_Fall 08_11.17.08
31
Download