Chapter 11 Slides (PPT)

advertisement
Chapter 11
Linear Regression and Correlation
Linear Regression and Correlation
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
• Model:
Y   0  1 x  
 ~ N (0,  )
• 1 > 0  Positive Association
• 1 < 0  Negative Association
• 1 = 0  No Association
2
e
Least Squares Estimation of 0, 1
 0  Mean response when x=0 (y-intercept)
 1  Change in mean response when x increases
by 1 unit (slope)
• 0, 1 are unknown parameters (like m)
• 0+1x  Mean response when explanatory
variable takes on the value x
• Goal: Choose values (estimates) that minimize the
sum of squared errors (SSE) of observed values to
the straight-line:
^
^
^
y   0 1 x
n 




SSE  i 1  yi  y i   i 1  yi    0   1 xi  





n
^
2
^
^
2
Example - Pharmacodynamics of LSD
• Response (y) - Math score (mean among 5 volunteers)
• Predictor (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
80
70
60
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
50
40
SCORE
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
30
20
1
2
LSD_CONC
Source: Wagner, et al (1968)
3
4
5
6
7
Least Squares Computations
Parameter Estimates
Summary Calculations
 
  x  x y  y 
  y  y 
S xx   x  x
S xy
S yy
2
2

2
SSE  S yy  S xy
S xx

^
1

x  x y  y  S



S


x

x

xy
2
^
xx
^
 0  y  1 x
2


 yi  y i 

SSE

2
i 1 
se 

n2
n2
n
^
Example - Pharmacodynamics of LSD
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
350.61
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
30.33
x-xbar
-3.163
-1.363
-1.073
0.357
1.497
1.667
2.077
-0.001
y-ybar
28.843
8.113
17.383
-12.617
-4.437
-17.167
-20.117
0.001
Sxx
10.004569
1.857769
1.151329
0.127449
2.241009
2.778889
4.313929
22.474943
Sxy
-91.230409
-11.058019
-18.651959
-4.504269
-6.642189
-28.617389
-41.783009
-202.487243
Syy
831.918649
65.820769
302.168689
159.188689
19.686969
294.705889
404.693689
2078.183343
(Column totals given in bottom row of table)
350.61
30.33
y
 50.087
x
 4.333
7
7
^
^
^
 202.4872
1 
 9.01
 0  y   1 x  50.09  (9.01)( 4.33)  89.10
22.4749
^
y  89.10  9.01x
se2  50.72
SPSS Output and Plot of Equation
Coefficientsa
Model
1
Uns tandardized
Coefficients
B
Std. Error
89.124
7.048
-9.009
1.503
(Cons tant)
LSD_CONC
Standardized
Coefficients
Beta
-.937
t
12.646
-5.994
a. Dependent Variable: SCORE
Math Score vs LSD Concentration (SPSS)
80.00

Linear Regression
70.00

60.00
score

50.00

40.00



30.00
1.00
2.00
score = 89.12 + -9.01 * lsd_conc
R-Square4.00
= 0.88 5.00
3.00
6.00
lsd_conc
Sig.
.000
.002
Inference Concerning the Slope (1)
• Parameter: Slope in the population model (1)
^
• Estimator: Least squares estimate:  1
• Estimated standard error:
SE ^  se 1 S xx
1
• Methods of making inference regarding population:
– Hypothesis tests (2-sided or 1-sided)
– Confidence Intervals
Hypothesis Test for 1
• 2-Sided Test
– H0: 1 = 0
– HA: 1  0
• 1-sided Test
– H0: 1 = 0
– HA+: 1 > 0 or
– HA-: 1 < 0
^
1
T .S . : tobs 
se 1 S xx
1
T .S . : tobs 
se 1 S xx
R.R. : | tobs |  t / 2,n  2
R.R. : tobs  t ,n 2
P  value : 2 P(t | tobs |)
P  val  : P(t  tobs ) P  val  : P(t  tobs )
^
R.R. : tobs   t ,n  2
(1-)100% Confidence Interval for 1
^
^
 1  t / 2 SE   1  t / 2 se
^
1
1
S xx
• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
Example - Pharmacodynamics of LSD
^
n  7  1  9.01 se  50.72  7.12 S xx  22.475
SE ^
1
1
 7.12
 1.50
22.475
• Testing H0: 1 = 0 vs HA: 1  0
 9.01
T .S . : tobs 
 6.01
1.50
R.R. :| tobs | t.025,5  2.571
• 95% Confidence Interval for 1 :
 9.01  2.571(1.50)   9.01  3.86  (12.87,5.15)
Confidence Interval for Mean When x=x*
• Mean Response at a specific level x* is
E ( y | x*)  m y   0  1 x *
• Estimated Mean response and standard error
(replacing unknown 0 and 1 with estimates):
^
^
^
m y   0 1 x*
SE ^  se
m

1 x * x

n
S xx

2
• Confidence Interval for Mean Response:
^
m y  t / 2,n2SE
^
^
m
 m y  t / 2,n 2 se

1 x * x

n
S xx

2
Prediction Interval of Future Response @ x=x*
• Response at a specific level x* is
y x*  m y     0  1 x * 
• Estimated response and standard error (replacing
unknown 0 and 1 with estimates):
^
^
^
y   0 1 x*

1 x * x
SE ^  se 1  
y
n
S xx

2
• Prediction Interval for Future Response:
^
y  t / 2,n 2SE ^
y

1 x * x
 y  t / 2,n 2 se 1  
n
S xx
^

2
Correlation Coefficient
• Measures the strength of the linear association
between two variables
• Takes on the same sign as the slope estimate from
the linear regression
• Not effected by linear transformations of y or x
• Does not distinguish between dependent and
independent variable (e.g. height and weight)
• Population Parameter: ryx
• Pearson’s Correlation Coefficient:
ryx 
S xy
S xx S yy
1  r  1
Correlation Coefficient
• Values close to 1 in absolute value  strong linear
•
•
•
•
association, positive or negative from sign
Values close to 0 imply little or no association
If data contain outliers (are non-normal),
Spearman’s coefficient of correlation can be
computed based on the ranks of the x and y values
Test of H0:ryx = 0 is equivalent to test of H0:1=0
Coefficient of Determination (ryx2) - Proportion of
variation in y “explained” by the regression on x:
r  (ryx ) 
2
yx
2
S yy  SSE
S yy
SS (Total )  SS (Residual )

SS (Total )
0  r2 1
Example - Pharmacodynamics of LSD
S xx  22.475
S xy  202.487
S yy  2078.183
SSE  253.89
 202.487
 0.94
(22.475)( 2078.183)
ryx 
2078.183  253.89
r 
 0.88  (0.94) 2
2078.183
2
yx
Syy
80.00

Mean
70.00
80.00

60.00

Linear Regression
70.00

score
score
SSE
50.00
Mean = 50.09

40.00
60.00


50.00

40.00



1.00


30.00
30.00
2.00
3.00
4.00
lsd_conc
5.00
6.00
1.00

score = 89.12 + -9.01 * lsd_conc
2.00 R-Square
3.00 = 0.88
4.00
5.00
6.00
lsd_conc
Example - SPSS Output
Pearson’s and Spearman’s Measures
l
a
O
C
S
P
1
7
*
S
2
.
N
7
7
L
P
7
1
*
S
2
.
N
7
7
*
C
l
a
O
C
S
S
C
0
9
*
S
3
.
N
7
7
L
C
9
0
*
S
3
.
N
7
7
*
C
Hypothesis Test for ryx
• 2-Sided Test
– H0: ryx = 0
– HA: ryx  0
T .S . : tobs  ryx
• 1-sided Test
– H0: ryx = 0
– HA+: ryx > 0 or
– HA-: ryx < 0
n2
1  ryx2
T .S . : tobs  ryx
n2
1  ryx2
R.R. : | tobs |  t / 2,n  2
R.R. : tobs  t ,n  2
P  value : 2 P(t | tobs |)
P  val  : P(t  tobs ) P  val  : P(t  tobs )
R.R. : tobs   t ,n  2
Analysis of Variance in Regression
• Goal: Partition the total variation in y into
variation “explained” by x and random variation
^
^
( yi  y )  ( yi  y i )  ( y i  y )
2
^
^
 ( y  y)   ( y  y )   ( y  y)
2
i
i
i
2
i
• These three sums of squares and degrees of freedom are:
•Total (TSS)
DFT = n-1
• Error (SSE)
DFE = n-2
• Model (SSR)
DFR = 1
Analysis of Variance for Regression
Source of
Variation
Model
Error
Total
Sum of
Squares
SSR
SSE
TSS
Degrees of
Freedom
1
n-2
n-1
• Analysis of Variance - F-test
• H0: 1 = 0
T .S . : Fobs
R.R. : Fobs
HA: 1  0
MSR

MSE
 F ,1, n  2
P  value : P ( F  Fobs )
Mean
Square
MSR = SSR/1
MSE = SSE/(n-2)
F
F = MSR/MSE
Example - Pharmacodynamics of LSD
• Total Sum of squares:
TSS   ( yi  y) 2  2078.183
DFT  7  1  6
• Error Sum of squares:
^
SSE   ( yi  y i ) 2  253.890
DFE  7  2  5
• Model Sum of Squares:
^
SSR   ( y i  y) 2  2078.183  253.890  1824.293
DFR  1
Example - Pharmacodynamics of LSD
Source of
Variation
Model
Error
Total
Sum of
Squares
1824.293
253.890
2078.183
Degrees of
Freedom
1
5
6
Mean
Square
1824.293
50.778
•Analysis of Variance - F-test
• H0: 1 = 0
T .S . : Fobs
R.R. : Fobs
HA: 1  0
MSR

 35.93
MSE
 F.05,1, 5  6.61
P  val : P ( F  35.93)
F
35.93
Example - SPSS Output
b
O
m
d
F
S
M
i
a
g
f
a
1
R
2
1
2
8
2
R
1
5
6
T
3
6
a
P
b
D
Linearity of Regression (SLR)
F -Test for Lack-of-Fit (n j observations at c distinct levels of "X")
H 0 : E Yi    0  1 X i
H A : E Yi   mi   0  1 X i
Compute fitted value Y j and sample mean Y j for each distinct X level
c
nj

Lack-of-Fit: SS  LF     Y j  Y j
j 1 i 1
c
nj

Pure Error: SS  PE    Yij  Y j
j 1 i 1


2
2
df LF  c  2
df PE  n  c
SS ( LF )  c  2   MS ( LF )



~
MS
(
PE
)
SS
(
PE
)
n

c




H0
Test Statistic: FLOF
Reject H 0 if FLOF  F 1   ; c  2, n  c 
Fc  2,n c
Download