7 - Linear Correlati..

advertisement
CHAPTER 7
Linear Correlation & Regression Methods
• 7.1 - Motivation
• 7.2 - Correlation / Simple Linear Regression
• 7.3 - Extensions of Simple Linear Regression
Parameter
Estimation via
SAMPLE
…
Testing
for association
between
two DATA
POPULATION
variables X and Y…
• Categorical variables
• Numerical variables
 Chi-squared Test
Categories of X
 ???????
PARAMETERS
Categories of Y
 Means:
 X  E[ X ]
 Variances:
Y  E[Y ]
 X2  E ( X   X ) 2 
 Y2  E (Y  Y ) 2 
 Covariance:
Examples:
X = Disease status (D+, D–)
Y = Exposure status (E+, E–)
X = # children in household (0, 1-2, 3-4, 5+)
Y = Income level (Low, Middle, High)
 XY  E ( X   X )(Y  Y )
Parameter Estimation via SAMPLE DATA …
x1, x2 , x3 , x4 ,
 y1, y2 , y3 , y4 ,
, xn 
, yn 
• Numerical variables
 ???????
PARAMETERS
STATISTICS
x
y


 Means: 
[ X ] y Y n E[Y ]
xX E
n
( x  x )2
2
22

 Variances: 
s  E ( X   ) 
X


2
( y  y )2
2

s Y  E (Y  Y ) 2 
xX
n 1
y
n 1
 Covariance:
y)
s XYE( x(xX)( y 
 X(can
)(Y be
 +,Y )–, or 0)
xy
n 1
Parameter Estimation via SAMPLE DATA …
x1, x2 , x3 , x4 ,
, xn 
• Numerical variables
x 1 x2 x3 x4 … xn
 ???????
y 1 y2 y3 y4 … yn
 y1, y2 , y3 , y4 ,
, yn 
PARAMETERS
STATISTICS
x
y


 Means: 
[ X ] y Y n E[Y ]
xX E
n
( x  x )2
2

 Variances: s 
Y
JAMA. 2003;290:1486-1493
x
n 1
( y  y )2

s  n 1
Scatterplot
(n data points)
2
y
 Covariance:
y)
s XYE( x(xX)( y 
 X(can
)(Y be
 +,Y )–, or 0)
xy
X
n 1
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 ???????
y 1 y2 y3 y4 … yn
PARAMETERS
STATISTICS
x
y


 Means: 
[ X ] y Y n E[Y ]
xX E
n
( x  x )2
2

 Variances: s 
Y
JAMA. 2003;290:1486-1493
x
n 1
( y  y )2

s  n 1
Scatterplot
(n data points)
2
y
 Covariance:
y)
s XYE( x(xX)( y 
 X(can
)(Y be
 +,Y )–, or 0)
xy
n 1
Does this suggest a linear
trend between X and Y?
X
If so, how do we measure it?
Testing for association between two population variables X and Y…
^
• Numerical variables
 ???????
PARAMETERS
 Means:
 X  E[ X ]
 Variances:
Y  E[Y ]
 X2  E ( X   X ) 2 
 Y2  E (Y  Y ) 2 
 Covariance:
 XY  E ( X   X )(Y  Y )
 Linear Correlation Coefficient:

 XY
 X2  Y2
Always between
–1 and +1
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 ???????
y 1 y2 y3 y4 … yn
PARAMETERS
STATISTICS
x
y


 Means: 
[ X ] y Y n E[Y ]
xX E
n
( x  x )2
2

 Variances: s 
Y
JAMA. 2003;290:1486-1493
x
n 1
( y  y )2

s  n 1
Scatterplot
(n data points)
2
y
 Covariance:
y)
s XYE( x(xX)( y 
 X(can
)(Y be
 +,Y )–, or 0)
xy
n 1
 Linear Correlation Coefficient:
X


r
sxy
XY
sXx2 s yY22
Always between
–1 and +1
Parameter Estimation via SAMPLE DATA …
Example in R (reformatted for brevity):
x
x
x
x … x
y
y
y
y … y
> pop
1 =2seq(0,
3 20,
4 0.1) n
> x = sort(sample(pop, 10))
1.11 1.8
4.0 n7.3
2 2.1
3 3.7
4
11.9 12.4 17.1
> yY = sample(pop, 10)
13.1 18.3 17.6 19.1 19.3
13.6 8.0 3.0
3.2
• Numerical variables
 ???????
9.1
PARAMETERS
STATISTICS
x  
y 
x
y
>
c(mean(x),
mean(y))
 Means:
[X ]
X  E
Y n E[Y ]
n
7.05
12.08
2
2
22var(x)( x  x )
>
 Variances: X  E  ( X  X ) 
x 29.48944

n1
5.6
JAMA. 2003;290:1486-1493
plot(x, y, pch = 19)
s  

s  
Y ) 2 
( y  y )2
> 22var(y)
 E ( X 
Y43.76178
y
n 1
Scatterplot
n = 10
(n data points)
 Covariance:
y)
> Ycov(x,
y)0)
s XYE( x(xX)( y 
 X(can
)(
 +,Y )–, or
be
xy
n 1
-25.86667
 Linear Correlation Coefficient:
X
r
sxy
sx2
s y2
>Always
cor(x,
y)
between
-0.7200451
–1 and +1
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 Linear Correlation Coefficient:
y 1 y2 y3 y4 … yn
r
Y
JAMA. 2003;290:1486-1493
sxy
sx2
s y2
Always between
–1 and +1
r measures the strength of linear association
Scatterplot
(n data points)
X
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 Linear Correlation Coefficient:
y 1 y2 y3 y4 … yn
r
Y
JAMA. 2003;290:1486-1493
sxy
sx2
s y2
Always between
–1 and +1
r measures the strength of linear association
Scatterplot
(n data points)
r
–1
0
negative linear
correlation
X
+1
positive linear
correlation
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 Linear Correlation Coefficient:
y 1 y2 y3 y4 … yn
r
Y
JAMA. 2003;290:1486-1493
sxy
sx2
s y2
Always between
–1 and +1
r measures the strength of linear association
Scatterplot
(n data points)
r
–1
0
negative linear
correlation
X
+1
positive linear
correlation
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 Linear Correlation Coefficient:
y 1 y2 y3 y4 … yn
r
Y
JAMA. 2003;290:1486-1493
sxy
sx2
s y2
Always between
–1 and +1
r measures the strength of linear association
Scatterplot
(n data points)
r
–1
0
negative linear
correlation
X
+1
positive linear
correlation
Parameter Estimation via SAMPLE DATA …
• Numerical variables
x 1 x2 x3 x4 … xn
 Linear Correlation Coefficient:
y 1 y2 y3 y4 … yn
r
Y
JAMA. 2003;290:1486-1493
> cor(x,
y)
-0.7200451
Scatterplot
(n data points)
sxy
sx2
s y2
Always between
–1 and +1
r measures the strength of linear association
r
–1
0
negative linear
correlation
X
+1
positive linear
correlation
Testing for linear association between two numerical population variables X and Y…
 Linear Correlation Coefficient

 XY
Now that we have r, we can conduct
HYPOTHESIS TESTING on 
H 0 :   0 "No linear association
 X2  Y2
between X and Y ."
H A :   0 "Linear association
between X and Y ."
 Linear Correlation Coefficient
̂  r 
T
sxy
sx2
Test Statistic for p-value
s y2

r
1 r
n  2 ~ tn  2
2
0.72
1  (.72)
2 * pt(-2.935, 8)
2
10  2  2.935 on t8
p-value = .0189 < .05
Parameter Estimation via SAMPLE DATA …
 Linear Correlation Coefficient:
r
r measures the
sxy
sx2
s y2
strength of linear
association
If such an association between X and Y
exists, then it follows that for any
intercept 0 and slope 1, we have…
Y  0  1 X  
“Response = Model + Error”
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Ŷ  ˆ0  ˆ1 X
in what sense???
SSErr   ei 2
Parameter
Estimation
via SAMPLEvia
DATA
…
SIMPLE LINEAR
REGRESSION
the METHOD
OF LEAST SQUARES
If such an association between X and Y
 Linear Correlation Coefficient:
exists, then it follows that for any
intercept 0 and slope 1, we have…
r measures the
sxy
r
strength of linear
sx2 s y2 association
Y  0  1 X  
“Response = Model + Error”
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Ŷ  ˆ0  ˆ1 X
in what
i.e.,
thatsense???
minimizes
“Least Squares
Regression Line”
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
( x , y ) is on line  18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
If such an association between X and Y
 Linear Correlation Coefficient:
exists, then it follows that for any
intercept 0 and slope 1, we have…
r measures the
sxy
r
strength of linear
sx2 s y2 association
Y  0  1 X  
“Response = Model + Error”
> cor(x, y)
-0.7200451
( xi , yˆi )
Find estimates ̂ 0 and ̂1 for the “best” line
Ŷ  ˆ0 ˆ0.87715
Yˆ  18.26391
X
1X
i.e., that minimizes
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
Residuals
ei  yi  yˆ i
( xi , yi )
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
Check 
( x , y ) is on line  18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Ŷ  ˆ0 ˆ0.87715
Yˆ  18.26391
X
1X
i.e., that minimizes
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Yˆ  18.26391  0.87715 X
i.e., that minimizes
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Yˆ  18.26391  0.87715 X
i.e., that minimizes
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
residuals
Y  Yˆ
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Yˆ  18.26391  0.87715 X
i.e., that minimizes
SSErr   ei 2
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
residuals
Y  Yˆ
~
E
X
E
R
C
I
S
E
~
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Yˆ  18.26391  0.87715 X
i.e., that minimizes
SSErr   ei 2  189.6555
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
Testing for linear association between two numerical population variables X and Y…
 Linear Regression Coefficients
Y  0  1 X  
“Response = Model + Error”
Now that we have these, we can conduct
HYPOTHESIS TESTING on 0 and 1
H 0 : 1  0 "No linear association
between X and Y ."
H A : 1  0 "Linear association
between X and Y ."
 Linear Regression Coefficients
Ŷ  ˆ0  ˆ1 X
sxy
ˆ
1  2
sx
ˆ0  y  ˆ1 x
Test Statistic for p-value?
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
residuals
Y  Yˆ
~
E
X
E
R
C
I
S
E
~
> cor(x, y)
-0.7200451
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
Find estimates ̂ 0 and ̂1 for the “best” line
Yˆ  18.26391  0.87715 X
i.e., that minimizes
SSErr   ei 2  189.6555
sxy 25.86667
ˆ
 0.87715
1  2 
29.48944
sx
ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)
 18.26391
Testing for linear association between two numerical population variables X and Y…
 Linear Regression Coefficients
Now that we have these, we can conduct
HYPOTHESIS TESTING on 0 and 1
Y  0  1 X  
H 0 : 1  0 "No linear association
between X and Y ."
“Response = Model + Error”
H A : 1  0 "Linear association
between X and Y ."
SSErr   ( y  yˆ )2
Test Statistic for p-value
 Linear Regression Coefficients
Ŷ  ˆ0  ˆ1 X
ˆ1 
sxy
s
2
x
ˆ0  y  ˆ1 x
T

ˆ1  1
MSErr
(n  1) sx2
MSErr 
SSErr
n2
tn  2
0.87715  0
(9)(29.48944)  2.935 on t8
189.6555 / 8
Same t-score as H0:  = 0! p-value = .0189
>
>
>
>
plot(x, y, pch = 19)
lsreg = lm(y ~ x)
# or lsfit(x,y)
abline(lsreg)
summary(lsreg)
BUT WHY HAVE TWO
METHODS FOR THE
Call:
SAME PROBLEM???
lm(formula = y ~ x)
Residuals:
Min
1Q
-8.6607 -3.2154
Median
0.8954
3Q
3.4649
Max
5.7742
Coefficients:
Estimate Std. Error t value
(Intercept) 18.2639
2.6097
6.999
x
-0.8772
0.2989 -2.935
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Because this second
method generalizes…
Pr(>|t|)
0.000113 ***
0.018857 *
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.869 on 8 degrees of freedom
Multiple R-squared: 0.5185,
Adjusted R-squared: 0.4583
F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
SS
df
MS
Treatment
Error
Total
–
F-ratio
p-value
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
SS
df
MS
Regression
Error
Total
–
F-ratio
p-value
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
df
SS
Regression
1
MS
Error
Total
–
F-ratio
p-value
Testing for linear association between two numerical population variables X and Y…
 Linear Regression Coefficients
Now that we have these, we can conduct
HYPOTHESIS TESTING on 0 and 1
Y  0  1 X  
H 0 : 1  0 "No linear association
between X and Y ."
“Response = Model + Error”
H A : 1  0 "Linear association
between X and Y ."
SSErr   ( y  yˆ )2
Test Statistic for p-value
 Linear Regression Coefficients
Ŷ  ˆ0  ˆ1 X
ˆ1 
sxy
s
2
x
ˆ0  y  ˆ1 x
T

ˆ1  1
MSErr
(n  1) sx2
tn  2
MSErr 
SSErr
n2
df Err  8
0.87715  0
(9)(29.48944)  2.935 on t8
189.6555 / 8
Same t-score as H0:  = 0! p-value = .0189
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
df
SS
Regression
1
Error
8
Total
MS
–
F-ratio
p-value
Parameter Estimation via SAMPLE DATA …
x 1 x2 x3 x4 … xn
y 1 y2 y3 y4 … yn
STATISTICS
 Means:
x

x
 Variances:
JAMA. 2003;290:1486-1493
Scatterplot
(n data points)
n
y

y
( x x )

s  n 1
2
x
n
2
SSTotal
( y  y )2


s  n 1
2
y
df Total
Parameter Estimation via SAMPLE DATA …
x 1 x2 x3 x4 … xn
y 1 y2 y3 y4 … yn
STATISTICS
 Means:
x

x
 Variances:
JAMA. 2003;290:1486-1493
n
y

y
( x x )

s  n 1
2
x
n
2
SSTotal
( y  y )2


s  n 1
2
y
df Total
Scatterplot
(n data points)
SSTot   ( y  y )2  (n  1) s y2
SSTot is a measure of the total amount
of variability in the observed responses
(i.e., before any model-fitting).
Parameter Estimation via SAMPLE DATA …
x 1 x2 x3 x4 … xn
y 1 y2 y3 y4 … yn
STATISTICS
 Means:
x

x
 Variances:
( x  x )2

s  n 1
n
2
x
SSTotal
( y  y )2


s  n 1
2
y
JAMA. 2003;290:1486-1493
Scatterplot
(n data points)
n
y

y
df Total
SSReg   ( yˆ  y )2
SSTot   ( y  y )2  (n  1) s y2
SSReg is a measure of the total amount
of variability in the fitted responses
(i.e., after model-fitting.)
Parameter Estimation via SAMPLE DATA …
x 1 x2 x3 x4 … xn
y 1 y2 y3 y4 … yn
STATISTICS
 Means:
x

x
 Variances:
( x x )

s  n 1
2
x
n
2
SSTotal
( y  y )2


s  n 1
2
y
JAMA. 2003;290:1486-1493
Scatterplot
(n data points)
n
y

y
df Total
SSReg   ( yˆ  y )2
SSErr   ( y  yˆ )2
SSTot   ( y  y )2  (n  1) s y2
SSErr is a measure of the total amount
of variability in the resulting residuals
(i.e., after model-fitting).
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
residuals
Y  Yˆ
~
E
X
E
R
C
I
S
E
~
> cor(x, y)
-0.7200451
Yˆ  18.26391  0.87715 X
SSReg   ( yˆ  y )2 = 204.2
( xi , yˆi )
Residuals
ei  yi  yˆ i
( xi , yi )
SSErr   ( y  yˆ )2 = 189.656
SSTot   ( y  y )2  (n  1) s y2 = 9 (43.76178)
= 393.856
SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES
predictor
X
1.1
1.8
2.1
3.7
4.0
7.3
9.1
11.9
12.4
17.1
observed
response
Y
13.1
18.3
17.6
19.1
19.3
3.2
5.6
13.6
8.0
3.0
fitted
response
Yˆ
~
E
X
E
R
C
I
S
E
~
residuals
Y  Yˆ
~
E
X
E
R
C
I
S
E
~
> cor(x, y)
-0.7200451
Yˆ  18.26391  0.87715 X
SSReg   ( yˆ  y )2 = 204.2
( xi , yˆi )
Residuals
ei  yi  yˆ i
SSErr   ( y  yˆ )2 = 189.656
SSTot   ( y  y )2 = 393.856
SSTot = SSReg + SSErr
( xi , yi )
Tot
Err
Reg
minimum
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
df
SS
MS
1
204.200
MSReg
Regression
Error
8
189.656
MSErr
Total
9
393.856
–
F-ratio
p-value
Fk – 1, n – k
0<p<1
ANOVA Table
H 0 : 1  0
Y  0  1 X  
H A : 1  0
Source
df
SS
MS
1
204.200
204.200
Regression
Error
8
189.656
23.707
Total
9
393.856
–
F-ratio
p-value
8.61349
0.018857
Same as
before!
Source
Regression
df
SS
MS
1
204.200
204.200
Error
8
189.656
23.707
Total
9
393.856
–
F-ratio
p-value
8.61349
0.018857
> summary(aov(lsreg))
x
Residuals
Df Sum Sq Mean Sq F value Pr(>F)
1 204.20 204.201 8.6135 0.01886 *
8 189.66 23.707
Source
Regression
df
SS
MS
1
204.200
204.200
Error
8
189.656
23.707
Total
9
393.856
–
F-ratio
p-value
8.61349
0.018857
Coefficient of Determination
Moreover,
SSReg
SSTot
204.2

 0.5185 .
393.856
The least squares regression line
accounts for 51.85% of the total
variability in the observed response,
with 48.15% remaining.
> cor(x, y)
-0.7200451
r 2  (0.72)2  0.5185
Coefficient of Determination
Moreover,
SSReg
SSTot
204.2

 0.5185 .
393.856
The least squares regression line
accounts for 51.85% of the total
variability in the observed response,
with 48.15% remaining.
>
>
>
>
plot(x, y, pch = 19)
lsreg = lm(y ~ x)
abline(lsreg)
summary(lsreg)
SSReg
SSTot
r 2  (0.72)2  0.5185
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-8.6607 -3.2154
Median
0.8954
 0.5185
Coefficient of Determination
3Q
3.4649
The least squares regression line
Max for 51.85% of the total
accounts
5.7742
variability in the observed response,
with 48.15% remaining.
Coefficients:
Estimate Std. Error t value
(Intercept) 18.2639
2.6097
6.999
x
-0.8772
0.2989 -2.935
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
0.000113 ***
0.018857 *
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.869 on 8 degrees of freedom
Multiple R-squared: 0.5185,
Adjusted R-squared: 0.4583
F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886
Summary of Linear Correlation and Simple Linear Regression
Given:
X x1 x2 x3 x4 … x n
Y y1 y2 y3 y4 … y n
Means
Variances
x
y
sx2
 Linear Correlation Coefficient
r
Y
s xy
JAMA. 2003;290:1486-1493
–1  r  +1
sxy
sx2
s
2
y
Covariance
s y2
measures the strength
of linear association
 Least Squares Regression Line
Yˆ  ˆ0  ˆ1 X
minimizes SSErr =
ˆ1  sxy sx2 , ˆ0  y  ˆ1 x
2
ˆ
(
y

y
)

= SSTot – SSReg
All point estimates can be upgraded
to CIs for hypothesis testing, etc.
(ANOVA)
X
Summary of Linear Correlation and Simple Linear Regression
95% Confidence
Means Intervals
Variances
Given:
Covariance
2 intervals”)
(see notes for “95% prediction
x
X x1 x2 x3 x4 … x n
upper 95%
y band
yn
Y y1 y2 y3 y4 … confidence
 Linear Correlation Coefficient
r
–1  r  +1
sxy
sx2
Y
s y2
s
2
y
s xy
JAMA. 2003;290:1486-1493
measures the strength
of linear association
 Least Squares Regression Line
Yˆ  ˆ0  ˆ1 X
ŷ
sx
minimizes SSErr =
ˆ1  sxy sx2 , ˆ0  y  ˆ1 x
2
ˆ
(
y

y
)

= SSTot – SSReg
All point estimates can be upgraded
to CIs for hypothesis testing, etc.
(ANOVA)
lower 95%
confidence band
X
Summary of Linear Correlation and Simple Linear Regression
Given:
X x1 x2 x3 x4 … x n
Y y1 y2 y3 y4 … y n
Means
Variances
x
y
sx2
 Linear Correlation Coefficient
r
Y
s xy
JAMA. 2003;290:1486-1493
–1  r  +1
sxy
sx2
s
2
y
Covariance
s y2
measures the strength
of linear association
 Least Squares Regression Line
Yˆ  ˆ0  ˆ1 X
minimizes SSErr =
ˆ1  sxy sx2 , ˆ0  y  ˆ1 x
= SSTot – SSReg
All point estimates can be upgraded
to CIs for hypothesis testing, etc.
 Coefficient of Determination
2
ˆ
(
y

y
)

X
(ANOVA)
r2 
SSReg proportion of total variability modeled
SSTot by the regression line’s variability.
Testing for linear association between a population response variable Y and multiple
predictor variables X1, X2, X3, … etc.
Multilinear Regression
“Response = Model + Error”
Y   0  1 X 1   2 X 2  3 X 3 
H 0 : 1  2  3 
  k 1 X k 1  
“main effects”
 k 1  0 "No linear association between Y and
any of its predictors X 1 , X 2 , X 3 ,
H A : i  0
for some i  1, 2,..., k  1
, X k 1."
"Linear association between Y and
at least one of its predictors."
Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2 
 ˆk 1 X k 1
For now, assume the “additive model,” i.e., main effects only.
Y
Multilinear Regression
Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2
True response yi
Least Squares calculation
of regression coefficients
is computer-intensive.
Formulas require Linear
Algebra (matrices)!
Residual
ei  yi  yˆi

( x1 , x2 , y )
Fitted response yˆ i
Once calculated, how
do we then test the
null hypothesis?
X2
0
ANOVA
(x1i , x2i)
Predictors
X1
Testing for linear association between a population response variable Y and multiple
predictor variables X1, X2, X3, … etc.
Multilinear Regression
“Response = Model + Error”
Y   0  1 X 1   2 X 2  3 X 3 
  k 1 X k 1  
R code example: lsreg = lm(y ~ x1+x2+x3)
“main effects”
Testing for linear association between a population response variable Y and multiple
predictor variables X1, X2, X3, … etc.
Multilinear Regression
“Response = Model + Error”
Y   0  1 X 1   2 X 2  3 X 3 
 1,1 X 12   2,2 X 22 
  k 1 X k 1
  k 1,k 1 X k21
 cubes +
R code example: lsreg = lm(y ~ x+x^2+x^3)
x1+x2+x3)
“main effects”
quadratic terms, etc.

(“polynomial regression”)
Testing for linear association between a population response variable Y and multiple
predictor variables X1, X2, X3, … etc.
Multilinear Regression
“Response = Model + Error”
Y   0  1 X 1   2 X 2  3 X 3 
 1,1 X 12   2,2 X 22 
  k 1 X k 1
  k 1, k 1 X k21
 cubes +
+ 1,2 X 1 X 2  1,3 X 1 X 3 
+  2,3 X 2 X 3   2,4 X 2 X 4 
“main effects”
quadratic terms, etc.
(“polynomial regression”)
 1, k 1 X 1 X k 1
“interactions”
  2,k 1 X 2 X k 1
+
R code example: lsreg = lm(y ~ x1*x2)
x+x^2+x^3)
x1+x2+x1:x2)

Recall…
Example in R (reformatted for brevity):
Multiple Linear Reg with interaction
with an indicator (“dummy”) variable:
I = 1
> I = c(1,1,1,1,1,0,0,0,0,0)
Yˆ  13.36  1.62 X
Yˆ  ˆ0  ˆ1 X  ˆ2 I  ˆ3 X I
> lsreg = lm(y ~ x*I)
> summary(lsreg)
Coefficients:
I = 0
Yˆ  6.56  0.01X
Suppose these are actually two subgroups,
requiring two distinct linear regressions!
Estimate
(Intercept)
x
I
x:I
6.56463
0.00998
6.80422
1.60858
Yˆ  6.56  0.01X  6.80 I  1.61X I
ANOVA Table (revisited)
Y  0  1 X1  2 X 2 
H 0 : 1  2  3 
  k 1 X k 1  
"No linear association between Y and
any of its predictors X 1 , X 2 , X 3 ,…, X k -1 ."
k 1  0
Note that if true, then it would follow that
H A : i  0
Y  0  

 0  Y .
"Linear association between Y and
at least one of its predictors."
for some i  1, 2,..., k  1
ˆ  ˆ  ˆ X  ˆ X 
0
1 1
2 2
From sample of n data points…. Y
Note that if true, then it would follow that
 ˆk 1 X k 1
ˆ0  y .
But how are these regression coefficients calculated in general?
“Normal equations” solved via computer (intensive).
ANOVA Table (revisited)
H 0 : 1  2  3 
 k 1  0
Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2 
Source
df
"No linear association between Y and
any of its predictors X 1 , X 2 , X 3 , , X k 1."
 ˆk 1 X k 1
SS
MS 
(based on n data points).
SS
df
F 
MSReg
MSErr
p-value
n
Regression
k 1
2
ˆ
(
y

y
)
 i
i 1
MSReg
Fk 1, n  k
n
Error
nk
2
ˆ
(
y

y
)
 i i
i 1
0  p 1
MSErr
n
Total
n 1
2
(
y

y
)
 i
i 1
*** How are only the statistically significant variables determined? ***
“MODEL SELECTION”(BE)
X1
1
Ŷ 
Step 1.
t-tests:
p-values:
X2
ˆ
ˆ2
Step 0. Conduct an overall F-test of
significance (via ANOVA) of the full model.
If significant, then…
X3
X4
ˆ
3
+
+
H 0 : 1  0 H 0 : 2  0
p1 < .05
p2 < .05
Reject H0
Reject H0
ˆ4
+
H 0 : 3  0
p3  .05
Accept H0
+ ……
H 0 : 4  0
p4 < .05
Reject H0
Step 2. Are all coefficients significant at level  ? If not….
……
……
……
“MODEL SELECTION”(BE)
X1
1
Ŷ 
Step 1.
t-tests:
p-values:
X2
ˆ
ˆ2
Step 0. Conduct an overall F-test of
significance (via ANOVA) of the full model.
If significant, then…
X3
X4
ˆ
3
+
+
H 0 : 1  0 H 0 : 2  0
+
H 0 : 3  0
p2 < .05
Reject H0
Reject H0
+ ……
H 0 : 4  0
p3  .05
p1 < .05
ˆ4
p4 < .05
Accept H0
Reject H0
……
……
……
Step 2. Are all coefficients significant at level  ? If not…. delete that term,
X1
Ŷ 
X2
ˆ
1
+
ˆ2
+
X3 ˆ
3
X4
+
ˆ4
+ ……
Step 0. Conduct an overall F-test of
significance (via ANOVA) of the full model.
If significant, then…
“MODEL SELECTION”(BE)
X1
1
Ŷ 
Step 1.
t-tests:
p-values:
X2
ˆ
ˆ2
X3
X4
ˆ
3
+
+
H 0 : 1  0 H 0 : 2  0
+
H 0 : 3  0
p2 < .05
Reject H0
Reject H0
+ ……
H 0 : 4  0
p3  .05
p1 < .05
ˆ4
p4 < .05
Accept H0
Reject H0
……
……
……
Step 2. Are all coefficients significant at level  ? If not…. delete that term,
and recompute new coefficients!
Ŷ 
X1 X1ˆ ˆ1 X2
1
+ +
Xˆ2 ˆ 
2
2
+
X4 Xˆ4
 4
+
+
ˆ4
+ ……
+ ……
Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model
Recall ~
k  2 independent, equivariant, normally-distributed “treatment groups”
Y1
Y2
Yk
1
H0 :
1
 k
k
1
2
2
=
2
=
= k
Re-plot data on
a “log-log” scale.
Re-plot data on a “log”
scale (of Y only)..
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”) = example of a general “link function” g ( )
 ˆ  ˆ
ˆX
ln 





0
1
 1  ˆ 
 ˆ 
1
1 e
 ( ˆ0  ˆ1 X )
“MAXIMUM LIKELIHOOD ESTIMATION”
(Note: Not based on LS implies “pseudo-R2,” etc.)
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”)
 ˆ  ˆ
ˆ
ˆ
ln 
  0  1 X1  2 X 2 
 1  ˆ 
 ˆk X k  ˆ 
Suppose one of the predictor variables is binary…
 ˆ 
X1  1: ln  1   ˆ0  ˆ1  ˆ2 X 2 
 1  ˆ 
1 

1
1 e
 ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )
 1, Age  50
X1  
0, Age  50
 ˆk X k
SUBTRACT!
 ˆ 
X 1  0 : ln  0   ˆ0 
 1  ˆ 
0 

ˆ2 X 2 
 ˆk X k
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”)
 ˆ  ˆ
ˆ
ˆ
ln 
  0  1 X1  2 X 2 
 1  ˆ 
 ˆk X k  ˆ 
Suppose one of the predictor variables is binary…
 ˆ 
X1  1: ln  1   ˆ0  ˆ1  ˆ2 X 2 
 1  ˆ 
1 

1
1 e
 ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )
 1, Age  50
X1  
0, Age  50
 ˆk X k
SUBTRACT!
 ˆ 
X 1  0 : ln  0   ˆ0 
 1  ˆ 
0 

ˆ2 X 2 
 ˆk X k
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”)
 ˆ  ˆ
ˆ
ˆ
ln 
  0  1 X1  2 X 2 
 1  ˆ 
 ˆk X k  ˆ 
Suppose one of the predictor variables is binary…
 ˆ 
 ˆ 
1
  ln  0   ˆ1
ln 
 1  ˆ 
 1  ˆ 
1 
0 


1
1 e
 ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )
 1, Age  50
X1  
0, Age  50
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”)
 ˆ  ˆ
ˆ
ˆ
ln 
  0  1 X1  2 X 2 
 1  ˆ 
 ˆk X k  ˆ 
Suppose one of the predictor variables is binary…
  ˆ  
 1 
  1  ˆ1   ˆ
ln 
  1
  ˆ 0  

 

  1  ˆ 0  
1
1 e
 ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )
 1, Age  50
X1  
0, Age  50
Binary outcome, e.g., “Have you ever had surgery?” (Yes / No)
“log-odds” (“logit”)
 ˆ  ˆ
ˆ
ˆ
ln 
  0  1 X1  2 X 2 
 1  ˆ 
 ˆk X k  ˆ 
Suppose one of the predictor variables is binary…
 odds of surgery given Age  50 
ln 
  ˆ1
 odds of surgery given Age  50 


ln  OR   ˆ1 ………….. implies …………..
1
1 e
 ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )
 1, Age  50
X1  
0, Age  50
OR  e
ˆ1
in population dynamics
Unrestricted population growth
(e.g., bacteria)
Population size y obeys the following law
with constant a > 0.
dy
 ay
Population size y obeys the following law,
constant a > 0, and “carrying capacity” M.
Let survival probability  = y M .
dy
dt
1
Restricted population growth
(disease, predation, starvation, etc.)
 a y ( M  y) 
dt
1
ln | y |  at  b
y  e a t b  e a t eb  C e a t
With initial condition y (0)  y0
Exponential growth
d   a dt
 (1   )
1
1 
 
 d   a dt
  1  
ln |  |  ln |1   |  at  b
y
y  y0 e
 a (1   )
dt
d y  a dt
at
d
Logistic growth

0
 0  (1   0 ) e  at
 
ln 
 1 

  at  b

Download