Linear Regression and Correlation

Linear Regression and Correlation
• Explanatory and Response Variables are Numeric
• Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
• Model:
Y  0  1x  
 ~ N (0, )
• 1 > 0  Positive Association
• 1 < 0  Negative Association
• 1 = 0  No Association
Least Squares Estimation of 0, 1
 0  Mean response when x=0 (y-intercept)
 1  Change in mean response when x increases
by 1 unit (slope)
• 0, 1 are unknown parameters (like m)
• 0+1x  Mean response when explanatory
variable takes on the value x
• Goal: Choose values (estimates) that minimize the
sum of squared errors (SSE) of observed values to
the straight-line:
^
y  b0  b1 x
n 
n 



SSE  i 1  yi  y i   i 1  yi    0   1 xi  





^
2
^
^
2
Example - Pharmacodynamics of LSD
• Response (y) - Math score (mean among 5 volunteers)
• Predictor (x) - LSD tissue concentration (mean of 5 volunteers)
• Raw Data and scatterplot of Score vs LSD concentration:
80
70
60
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
50
40
SCORE
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
30
20
1
2
LSD_CONC
Source: Wagner, et al (1968)
3
4
5
6
7
Least Squares Computations
 x  x 
  x  x  y 
  y  y 
2
S xx 
S xy
S yy
b1 
y

2
 x  x y 
 x  x 
y
2

S xy
S xx
b0  y  b1 x


y

y





n2
^
s2
2

SSE
n2
Example - Pharmacodynamics of LSD
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
350.61
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
30.33
x-xbar
-3.163
-1.363
-1.073
0.357
1.497
1.667
2.077
-0.001
y-ybar
28.843
8.113
17.383
-12.617
-4.437
-17.167
-20.117
0.001
Sxx
10.004569
1.857769
1.151329
0.127449
2.241009
2.778889
4.313929
22.474943
Sxy
-91.230409
-11.058019
-18.651959
-4.504269
-6.642189
-28.617389
-41.783009
-202.487243
Syy
831.918649
65.820769
302.168689
159.188689
19.686969
294.705889
404.693689
2078.183343
(Column totals given in bottom row of table)
350.61
30.33
y
 50.087
x
 4.333
7
7
^
 202.4872
b1 
 9.01
b0  y   1 x  50.09  (9.01)(4.33)  89.10
22.4749
^
y  89.10  9.01x
s 2  50.72
SPSS Output and Plot of Equation
a
c
i
d
a
i
i
c
c
B
e
i
M
E
t
g
1
(
4
8
6
0
L
9
3
7
4
2
a
D
M ath Score vs LSD Conce ntr ation (SPSS)
80.00

Linear Regression
70.00

60.00
score

50.00

40.00



30.00
1.00
2.00
score = 89.12 + -9.01 * lsd_conc
R-Square4.00
= 0.88 5.00
3.00
6.00
lsd_conc
Inference Concerning the Slope (1)
• Parameter: Slope in the population model (1)
• Estimator: Least squares estimate: b1
• Estimated standard error: SEb  s / S xx
1
• Methods of making inference regarding population:
– Hypothesis tests (2-sided or 1-sided)
– Confidence Intervals
Hypothesis Test for 1
• 2-Sided Test
– H0: 1 = 0
– HA: 1  0
T .S . : tobs
b1

SEb1
• 1-sided Test
– H0: 1 = 0
– HA+: 1 > 0 or
– HA-: 1 < 0
T .S . : tobs
b1

SEb1
R.R. : | tobs |  t / 2,n  2
R.R. : tobs  t ,n 2
P  value : 2 P (t | tobs |)
P  val : P(t  tobs ) P  val : P(t  tobs )
R.R. : tobs   t ,n 2
(1-)100% Confidence Interval for 1
b1  t / 2 SEb1  b1  t / 2
s
S xx
• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
Example - Pharmacodynamics of LSD
n  7 b1  9.01 s  50.72  7.12 S xx  22.475
7.12
SE b1 
 1.50
22.475
• Testing H0: 1 = 0 vs HA: 1  0
T .S . : tobs
 9.01

 6.01
1.50
R.R. :| tobs | t.025 ,5  2.571
• 95% Confidence Interval for 1 :
 9.01 2.571(1.50)   9.01 3.86  (12.87,5.15)
Confidence Interval for Mean When x=x*
• Mean Response at a specific level x* is
E( y | x*)  m y  0  1 x *
• Estimated Mean response and standard error
(replacing unknown 0 and 1 with estimates):
^
m y  b0  b1 x *

1 x * x
SE ^  s

m
n
S xx

2
• Confidence Interval for Mean Response:
^
m y  t / 2,n  2SE
^
m
Prediction Interval of Future Response @ x=x*
• Response at a specific level x* is
yx*  m y    0  1x * 
• Estimated response and standard error (replacing
unknown 0 and 1 with estimates):
^
y  b0  b1 x *

1 x * x
SE ^  s 1  
y
n
S xx

2
• Prediction Interval for Future Response:
^
y  t / 2,n  2SE ^
y
Correlation Coefficient
• Measures the strength of the linear association
between two variables
• Takes on the same sign as the slope estimate from
the linear regression
• Not effected by linear transformations of y or x
• Does not distinguish between dependent and
independent variable (e.g. height and weight)
• Population Parameter - r
• Pearson’s Correlation Coefficient:
r
S xy
S xx S yy
1  r  1
Correlation Coefficient
• Values close to 1 in absolute value  strong
•
•
•
•
linear association, positive or negative from sign
Values close to 0 imply little or no association
If data contain outliers (are non-normal),
Spearman’s coefficient of correlation can be
computed based on the ranks of the x and y values
Test of H0:r = 0 is equivalent to test of H0:1=0
Coefficient of Determination (r2) - Proportion of
variation in y “explained” by the regression on x:
r  (r ) 
2
2
S yy  SSE
S yy
0  r2 1
Example - Pharmacodynamics of LSD
S xx  22.475 S xy  202.487 S yy  2078.183 SSE  253.89
r
 202.487
 0.94
(22.475)(2078.183)
2078.183 253.89
r 
 0.88  (0.94) 2
2078.183
2
Syy
80.00
80.00

Mean

Linear Regression
70.00
70.00


60.00
60.00
score
score
SSE


50.00
50.00

Mean = 50.09

40.00
40.00





30.00
1.00
30.00
1.00
2.00
3.00
4.00
5.00

score = 89.12 + -9.01 * lsd_conc
2.00 R-Square
3.00 = 0.88
4.00
5.00
6.00
6.00
lsd_conc
lsd_conc
Example - SPSS Output
Pearson’s and Spearman’s Measures
l
a
O
C
S
P
1
7
*
S
2
.
N
7
7
L
P
7
1
*
S
2
.
N
7
7
*
C
l
a
O
C
S
S
C
0
9
*
S
3
.
N
7
7
L
C
9
0
*
S
3
.
N
7
7
*
C
Analysis of Variance in Regression
• Goal: Partition the total variation in y into
variation “explained” by x and random variation
^
^
( yi  y)  ( yi  y i )  ( y i  y)
2
^
^
 ( y  y)   ( y  y )   ( y  y)
2
i
i
i
2
i
• These three sums of squares and degrees of freedom are:
•Total (SST)
DFT = n-1
• Error (SSE)
DFE = n-2
• Model (SSM)
DFM = 1
Analysis of Variance in Regression
Source of
Variation
Model
Error
Total
Sum of
Squares
SSM
SSE
SST
Degrees of
Freedom
1
n-2
n-1
• Analysis of Variance - F-test
• H0: 1 = 0
T .S . : Fobs
R.R. : Fobs
HA: 1  0
MS M

MS E
 F ,1, n  2
P  value : P ( F  Fobs )
Mean
Square
MSM = SSM/1
MSE = SSE/(n-2)
F
F = MSM/MSE
Example - Pharmacodynamics of LSD
• Total Sum of squares:
SST   ( yi  y) 2  2078.183
DFT  7  1  6
• Error Sum of squares:
^
SSE   ( yi  y i ) 2  253.890
DFE  7  2  5
• Model Sum of Squares:
^
SSM   ( y i  y) 2  2078.183 253.890  1824.293
DFM  1
Example - Pharmacodynamics of LSD
Source of
Variation
Model
Error
Total
Sum of
Squares
1824.293
253.890
2078.183
Degrees of
Freedom
1
5
6
Mean
Square
1824.293
50.778
•Analysis of Variance - F-test
• H0: 1 = 0
T .S . : Fobs
R.R. : Fobs
HA: 1  0
MSR

 35.93
MSE
 F.05,1, 5  6.61
P  val : P ( F  35.93)
F
35.93
Example - SPSS Output
b
O
m
d
F
S
a
M
i
f
g
a
1
R
2
1
2
8
2
R
1
5
6
T
3
6
a
P
b
D
Multiple Regression
• Numeric Response variable (Y)
• p Numeric predictor variables
• Model:
Y = 0 + 1x1 +  + pxp + 
• Partial Regression Coefficients: i  effect (on the
mean response) of increasing the ith predictor
variable by 1 unit, holding all other predictors
constant
Example - Effect of Birth weight on
Body Size in Early Adolescence
• Response: Height at Early adolescence (n =250 cases)
• Predictors (p=6 explanatory variables)
• Adolescent Age (x1, in years -- 11-14)
• Tanner stage (x2, units not given)
• Gender (x3=1 if male, 0 if female)
• Gestational age (x4, in weeks at birth)
• Birth length (x5, units not given)
• Birthweight Group (x6=1,...,6 <1500g (1), 15001999g(2), 2000-2499g(3), 2500-2999g(4), 30003499g(5), >3500g(6))
Source: Falkner, et al (2004)
Least Squares Estimation
• Population Model for mean response:
E(Y )  0  1x1    p x p
• Least Squares Fitted (predicted) equation, minimizing SSE:
^
Y  b0  b1 x1    bp x p


SSE    Y  Y 


^
• All statistical software packages/spreadsheets can
compute least squares estimates and their standard errors
2
Analysis of Variance
• Direct extension to ANOVA based on simple
linear regression
• Only adjustments are to degrees of freedom:
– DFM = p
Source of
Variation
Model
Error
Total
DFE = n-p-1
Sum of
Squares
SSM
SSE
SST
Degrees of
Freedom
p
n-p-1
n-1
Mean
Square
MSM = SSM/p
MSE = SSE/(n-p-1)
SST  SSE
SSM
R 

SST
SST
2
F
F = MSM/MSE
Testing for the Overall Model - F-test
• Tests whether any of the explanatory variables are
associated with the response
• H0: 1==p=0 (None of the xs associated with y)
• HA: Not all i = 0
T .S . : Fobs
R.R. : Fobs
MS M
R2 / p


MS E
(1  R 2 ) /( n  p  1)
 F , p , n  p 1
P  va l : P ( F  Fobs )
Example - Effect of Birth weight on
Body Size in Early Adolescence
• Authors did not print ANOVA, but did provide following:
• n=250
p=6
• H0: 1==6=0
• HA: Not all i = 0
R2=0.26
MS R
R2 / p
T .S . : Fobs 


2
MS E
(1  R ) /( n  p  1)
0. 2 6 / 6
.0 4 3 3


 1 4.2
(1  0.2 6) /( 2 5 0  6  1)
.0 0 3 0
R.R. : Fobs  F , 6 , 243  2.1 3
P  va l : P ( F  1 4.2)
Testing Individual Partial Coefficients - t-tests
• Wish to determine whether the response is
associated with a single explanatory variable, after
controlling for the others
• H0: i = 0
T .S . : t obs
HA: i  0 (2-sided alternative)
bi

SE bi
R.R. : | t obs |  t / 2 , n  p 1
P  val : 2 P (t | t obs |)
Example - Effect of Birth weight on
Body Size in Early Adolescence
Variable
b
SEb
t=b/SEb P-val (z)
Adolescent Age
2.86
0.99
2.89
.0038
Tanner Stage
3.41
0.89
3.83
<.001
Male
0.08
1.26
0.06
.9522
Gestational Age
-0.11 0.21
-0.52
.6030
Birth Length
0.44
0.19
2.32
.0204
Birth Wt Grp
-0.78 0.64
-1.22
.2224
Controlling for all other predictors, adolescent age,
Tanner stage, and Birth length are associated with
adolescent height measurement
Testing for the Overall Model - F-test
• Tests whether any of the explanatory variables
are associated with the response
• H0: 1==p=0 (None of Xs associated with Y)
• HA: Not all i = 0
MSR
R2 / p
T .S . : Fobs 

MSE
(1  R 2 ) /(n  p  1)
P  val : P( F  Fobs )
The P-value is based on the F-distribution with p numerator and
(n-p-1) denominator degrees of freedom
Comparing Regression Models
• Conflicting Goals: Explaining variation in Y while
keeping model as simple as possible (parsimony)
• We can test whether a subset of p-g predictors
(including possibly cross-product terms) can be
dropped from a model that contains the remaining
g predictors. H0: g+1=…=p =0
– Complete Model: Contains all k predictors
– Reduced Model: Eliminates the predictors from H0
– Fit both models, obtaining the Error sum of squares for
each (or R2 from each)
Comparing Regression Models
• H0: g+1=…=p = 0 (After removing the
effects of X1,…,Xg, none of other predictors
are associated with Y)
• Ha: H0 is false
T est Statistic: Fobs
( SSEr  SSEc ) /( p  g )

SSEc /[n  p  1]
P  P( F  Fobs )
P-value based on F-distribution with p-g and n-p-1 d.f.
Models with Dummy Variables
• Some models have both numeric and categorical
explanatory variables (Recall gender in example)
• If a categorical variable has k levels, need to create
k-1 dummy variables that take on the values 1 if
the level of interest is present, 0 otherwise.
• The baseline level of the categorical variable for
which all k-1 dummy variables are set to 0
• The regression coefficient corresponding to a
dummy variable is the difference between the
mean for that level and the mean for baseline
group, controlling for all numeric predictors
Example - Deep Cervical Infections
• Subjects - Patients with deep neck infections
• Response (Y) - Length of Stay in hospital
• Predictors: (One numeric, 11 Dichotomous)
–
–
–
–
–
–
–
–
–
–
–
Age (x1)
Gender (x2=1 if female, 0 if male)
Fever (x3=1 if Body Temp > 38C, 0 if not)
Neck swelling (x4=1 if Present, 0 if absent)
Neck Pain (x5=1 if Present, 0 if absent)
Trismus (x6=1 if Present, 0 if absent)
Underlying Disease (x7=1 if Present, 0 if absent)
Respiration Difficulty (x8=1 if Present, 0 if absent)
Complication (x9=1 if Present, 0 if absent)
WBC > 15000/mm3 (x10=1 if Present, 0 if absent)
CRP > 100mg/ml (x11=1 if Present, 0 if absent)
Source: Wang, et al (2003)
Example - Weather and Spinal Patients
• Subjects - Visitors to National Spinal Network in 23 cities
Completing SF-36 Form
• Response - Physical Function subscale (1 of 10 reported)
• Predictors:
–
–
–
–
–
–
–
–
–
–
Source: Glaser, et al (2004)
Patient’s age (x1)
Gender (x2=1 if female, 0 if male)
High temperature on day of visit (x3)
Low temperature on day of visit (x4)
Dew point (x5)
Wet bulb (x6)
Total precipitation (x7)
Barometric Pressure (x7)
Length of sunlight (x8)
Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon,
wan gibbous, last Qtr, wan crescent, presumably had 8-1=7
dummy variables)
Analysis of Covariance
• Combination of 1-Way ANOVA and Linear
Regression
• Goal: Comparing numeric responses among k
groups, adjusting for numeric concomitant
variable(s), referred to as Covariate(s)
• Clinical trial applications: Response is Post-Trt
score, covariate is Pre-Trt score
• Epidemiological applications: Outcomes
compared across exposure conditions, adjusted for
other risk factors (age, smoking status, sex,...)
Nonlinear Regression
• Theory often leads to nonlinear relations
between variables. Examples:
– 1-compartment PK model with 1st-order
absorption and elimination
– Sigmoid-Emax S-shaped PD model
Example - P24 Antigens and AZT
• Goal: Model time course of P24 antigen levels
after oral administration of zidovudine
• Model fit individually in 40 HIV+ patients:
E(t )  E0 (1  A)ekout t  E0 A
where:
• E(t) is the antigen level at time t
• E0 is the initial level
• A is the coefficient of reduction of P24 antigen
• kout is the rate constant of decrease of P24 antigen
Source: Sasomsin, et al (2002)
Example - P24 Antigens and AZT
• Among the 40 individuals who the model was fit,
the means and standard deviations of the PK
“parameters” are given below:
Parameter
E0
A
kout
Mean
472.1
0.28
0.27
Std Dev
408.8
0.21
0.16
• Fitted Model for the “mean subject”
E(t )  472.1(1  0.28)e0.27 t  (472.1)(0.28)
Example - P24 Antigens and AZT
Example - MK639 in HIV+ Patients
• Response: Y = log10(RNA change)
• Predictor: x = MK639 AUC0-6h
• Model: Sigmoid-Emax:
0 x 
Y 


x  1
2
2
2
• where:
• 0 is the maximum effect (limit as x)
• 1 is the x level producing 50% of maximum effect
• 2 is a parameter effecting the shape of the function
Source: Stein, et al (1996)
Example - MK639 in HIV+ Patients
• Data on n = 5 subjects in a Phase 1 trial:
Subject
1
2
3
4
5
log RNA change (Y) MK639 AUC0-6h (x)
0.000
10576.9
0.167
13942.3
1.524
18235.3
3.205
19607.8
3.518
22317.1
• Model fit using SPSS (estimates slightly different from notes,
which used SAS)
35.60
3.52x
Y  35.60
x
 18374.39
^
Example - MK639 in HIV+ Patients
Data Sources
• Wagner, J.G., G.K. Aghajanian, and O.H. Bing (1968). “Correlation of
Performance Test Scores with Tissue Concentration of Lysergic Acid
Diethylamide in Human Subjects,” Clinical Pharmacology and
Therapeutics, 9:635-638.