4A - Biostatistics

advertisement
Lecture 4, part 1: Linear Regression
Analysis: Two Advanced Topics
Karen Bandeen-Roche, PhD
Department of Biostatistics
Johns Hopkins University
July 14, 2011
Introduction to Statistical
Measurement and Modeling
Data examples
 Boxing and neurological injury
 Scientific question: Does amateur boxing lead to
decline in neurological performance?
 Some related statistical questions:
 Is there a dose-response increase in the rate of cognitive
decline with increased boxing exposure?
 Is boxing-associated decline independent of initial
cognition and age?
 Is there a threshold of boxing that initiates harm?
Boxing data
-20
-10
0
blkdiff
10
20
Lowess smoother
0
bandwidth = .8
100
200
blbouts
300
400
Outline
 Topic #1: Confounding
 Handling this is crucial if we are to draw correct
conclusions about risk factors
 Topic #2: Signal / noise decomposition
 Signal: Regression model predictions
 Noise: Residual variation
 Another way of approaching inference, precision of
prediction
Topic # 1: Confounding
 Confound means to “confuse”
 When the comparison is between groups that are
otherwise not similar in ways that affect the
outcome
 Lurking variables,….
Confounding Example:
Drowning and Eating Ice Cream
*
*
*
Drowning
rate
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
Ice Cream eaten
Confounding
Epidemiology definition: A characteristic “C” is a
confounder if it is associated (related) with both the
outcome (Y: drowning) and the risk factor (X: ice cream)
and is not causally in between
Ice Cream
Consumption
Drowning rate
??
July 2010
JHU Intro to Clinical Research
7
Confounding
Statistical definition: A characteristic “C” is a confounder
if the strength of relationship between the outcome (Y:
drowning) and the risk factor (X: ice cream) differs with,
versus without, adjustment for C
Ice Cream Eaten
Outdoor
Temperature
Drowning rate
Confounding Example:
Drowning and Eating Ice Cream
*
*
*
Drowning
rate
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
Warm
temperature
Cool temperature
Ice Cream eaten
Effect modification
A characteristic “E” is an effect modifier if the strength of
relationship between the outcome (Y: drowning) and the
risk factor (X: ice cream) differs within levels of E
Ice Cream
Consumption
Drowning rate
Outdoor
temperature
July 2010
JHU Intro to Clinical Research
10
Effect Modification:
Drowning and Eating Ice Cream
Drowning
rate
*
*
*
*
*
*
*
*
*
*
*
*
*
Warm
temperature
Cool temperature
Ice Cream eaten
Topic #2: Signal/Noise Decomposition
 Lovely due to geometry of least squares
 Facilitates testing involving multiple parameters at
once
 Provides insight into R-squared
Signal/Noise Decomposition
 First step: decomposition of variance
 “Regression” part: Variance of Y s
 “Error” or “Residual” part: Variance of e
 Together: These determine “total” variance of Ys
 “Sums of Squares” (SS) rather than variance per se
 Regression SS (SSR):
n
 Error SS (SSE):
 Total SS (SST):
(Y
n

i 1
11
2

ˆ
¯
(YiY)  Y (H )Y
n
ˆ 2
iY i) 
Y(IH)Y
i 1
n

i 1
11
2

¯
(YiY)  Y (I )Y
n
Signal/Noise Decomposition
 Properties
 SST = SSR + SSE
 SSR/SST = “proportion of variance explained” by
regression = R-squared
 Follows from geometry
 SSR and SSE are independent (assuming A1-A5) and
have easily characterized probability distributions
 Provides convenient testing methods
 Follows from geometry plus assumptions
Signal/Noise Decomposition
 SSR and SSE are independent
 Define M = span(X) and take “Y” as centered at Y
 It is possible to orthogonally rotate the coordinate axes
so that first p axes ε M; remaining n-p-1 axes ε M⊥
 Gram-Schmidt orthogonalization
 Doing this transforms Y into TY :=Z, for some
orthonormal matrix T with columns:= {e1,...,en-1}
 Distribution of Z = N(TE[Y|X],σ2I)
Signal/Noise Decomposition
 SSR and SSE are independent - continued
 TY=Z
Y = T’Z

p
n1
j 1
j p1
Σ Zje j Σ Zje j
n1
n1
j p1
j p1
 SSE = squared length of Σ Zje = Σ Z 2
j
j
p
p
j 1
j 1
2
 SSR = squared length of Σ Zje = Σ Zj
j
 Claim now follows: SSR & SSE are independent
because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent
Signal/Noise Decomposition
 Under A1-A5 SSE, SSR and their scaled ratio have
convenient distributions
 Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p
 Recall {Z1,...,Zn-1} are mutually independent normal
with variance=σ2
n
2
 Thus SSE = Σ Zj
j p2
n1
Zj2
j p1
σ2
2
=σ Σ
~ σ2 χ2n-p-1 under A1-A5
2
(a sum of k independent squared N(0,1) is  k )
Signal/Noise Decomposition
 Under A1-A5 SSE, SSR and their scaled ratio have
convenient distributions
 For j ≤ p E[Zj|X] ≠ 0 in general
 Exception: H0: β1=…=βp = 0
p
 Then SSR =
and
2
Σ Zj ~ σ2 χ2p under A1-A5
j 1
SSR / p
SSE / (n  p  1)
~ Fp,n-p-1 ~
 2p / p
2
n p 1
/ (n  p  1)
with numerator and denominator independent.
Signal/Noise Decomposition
 An organizational tool: The analysis of variance
(ANOVA) table
SOURCE
Sum of
Squares (SS)
Degrees of
freedom (df)
Mean square
(SS/df)
Regression
SSR
p
SSR/p
Error
SSE
n-p-1
SSE/(n-p-1)
=
2

Total
SST
n-1
= SSR + SSE
F=
MSR/MSE
“Global” hypothesis tests
 These involve sets of parameters
 Hypotheses of the form
H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs.
H1: βj ≠ 0 for at least one of the j
Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0
Example 2: H0: all polynomial or spline coefficients
involving a given variable = 0.
Example 3: H0: all coefficients involving a variable = 0.
“Global” hypothesis tests
 Testing method: Sequential decomposition of sums of
squares
 Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model
 Fit model excluding xj1,...,xjpj: Save SSE = SSEs
 Fit “full” (or larger) model adding xj1,...,xjpj to smaller
model. Save SSE=SSEL, often=overall SSE
 Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)]
 Distribution under null: F(pj,n-p-1)
 Define rejection region based on this distribution
 Compute S
 Reject or not as S is in rejection region or not
Signal/Noise Decomposition
 An augmented version for global testing
SOURCE
Sum of Squares
(SS)
Degrees of
freedom (df)
Mean square
(SS/df)
Regression
SSR
p
SSR/p
Error
Total
X1
SST-SSEs
p1
X2|X1
SSES-SSEL
p2
SSEL n-p-1
SST
= SSR + SSE
(SSES-SSEL )/p2
SSEL/(n-p-1)
n-1
F=
MSR(2|1)/MSE
R-squared – Another view
 From last lecture: ECDF Corr(Y, Y ) squared
 More conventional: R2 = SSR/SST
 Geometry justifies why they are the same
 Cov(Y, Y ) = Cov(Y-
Y+ Y , Y ) = Cov(e, Y ) + Var(Y )
 Covariance = inner product
first term = 0
 A measure of precision with which regression model
describes individual responses
Outline: A few more topics
 Colinearity
 Overfitting
 Influence
 Mediation
 Multiple comparisons
Main points
 Confounding occurs when an apparent association
between a predictor and outcome reflects the association
of each with a third variable
 A primary goal of regression is to “adjust” for
confounding
 Least squares decomposition of Y into fit and residual
provides an appealing statistical testing framework
 An association of an outcome with predictors is
evidenced if SS due to regression is large relative to SSE
 Geometry: orthogonal decomposition provides
convenient sampling distribution, view of R2
 ANOVA
Download