Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 14, 2011 Introduction to Statistical Measurement and Modeling Data examples Boxing and neurological injury Scientific question: Does amateur boxing lead to decline in neurological performance? Some related statistical questions: Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure? Is boxing-associated decline independent of initial cognition and age? Is there a threshold of boxing that initiates harm? Boxing data -20 -10 0 blkdiff 10 20 Lowess smoother 0 bandwidth = .8 100 200 blbouts 300 400 Outline Topic #1: Confounding Handling this is crucial if we are to draw correct conclusions about risk factors Topic #2: Signal / noise decomposition Signal: Regression model predictions Noise: Residual variation Another way of approaching inference, precision of prediction Topic # 1: Confounding Confound means to “confuse” When the comparison is between groups that are otherwise not similar in ways that affect the outcome Lurking variables,…. Confounding Example: Drowning and Eating Ice Cream * * * Drowning rate * * * * * * * * * * * * * * * * * * * * * * * Ice Cream eaten Confounding Epidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between Ice Cream Consumption Drowning rate ?? July 2010 JHU Intro to Clinical Research 7 Confounding Statistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C Ice Cream Eaten Outdoor Temperature Drowning rate Confounding Example: Drowning and Eating Ice Cream * * * Drowning rate * * * * * * * * * * * * * * * * * * * * * * * Warm temperature Cool temperature Ice Cream eaten Effect modification A characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E Ice Cream Consumption Drowning rate Outdoor temperature July 2010 JHU Intro to Clinical Research 10 Effect Modification: Drowning and Eating Ice Cream Drowning rate * * * * * * * * * * * * * Warm temperature Cool temperature Ice Cream eaten Topic #2: Signal/Noise Decomposition Lovely due to geometry of least squares Facilitates testing involving multiple parameters at once Provides insight into R-squared Signal/Noise Decomposition First step: decomposition of variance “Regression” part: Variance of Y s “Error” or “Residual” part: Variance of e Together: These determine “total” variance of Ys “Sums of Squares” (SS) rather than variance per se Regression SS (SSR): n Error SS (SSE): Total SS (SST): (Y n i 1 11 2 ˆ ¯ (YiY) Y (H )Y n ˆ 2 iY i) Y(IH)Y i 1 n i 1 11 2 ¯ (YiY) Y (I )Y n Signal/Noise Decomposition Properties SST = SSR + SSE SSR/SST = “proportion of variance explained” by regression = R-squared Follows from geometry SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions Provides convenient testing methods Follows from geometry plus assumptions Signal/Noise Decomposition SSR and SSE are independent Define M = span(X) and take “Y” as centered at Y It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥ Gram-Schmidt orthogonalization Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1} Distribution of Z = N(TE[Y|X],σ2I) Signal/Noise Decomposition SSR and SSE are independent - continued TY=Z Y = T’Z p n1 j 1 j p1 Σ Zje j Σ Zje j n1 n1 j p1 j p1 SSE = squared length of Σ Zje = Σ Z 2 j j p p j 1 j 1 2 SSR = squared length of Σ Zje = Σ Zj j Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2 n 2 Thus SSE = Σ Zj j p2 n1 Zj2 j p1 σ2 2 =σ Σ ~ σ2 χ2n-p-1 under A1-A5 2 (a sum of k independent squared N(0,1) is k ) Signal/Noise Decomposition Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions For j ≤ p E[Zj|X] ≠ 0 in general Exception: H0: β1=…=βp = 0 p Then SSR = and 2 Σ Zj ~ σ2 χ2p under A1-A5 j 1 SSR / p SSE / (n p 1) ~ Fp,n-p-1 ~ 2p / p 2 n p 1 / (n p 1) with numerator and denominator independent. Signal/Noise Decomposition An organizational tool: The analysis of variance (ANOVA) table SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error SSE n-p-1 SSE/(n-p-1) = 2 Total SST n-1 = SSR + SSE F= MSR/MSE “Global” hypothesis tests These involve sets of parameters Hypotheses of the form H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. Example 3: H0: all coefficients involving a variable = 0. “Global” hypothesis tests Testing method: Sequential decomposition of sums of squares Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model Fit model excluding xj1,...,xjpj: Save SSE = SSEs Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)] Distribution under null: F(pj,n-p-1) Define rejection region based on this distribution Compute S Reject or not as S is in rejection region or not Signal/Noise Decomposition An augmented version for global testing SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error Total X1 SST-SSEs p1 X2|X1 SSES-SSEL p2 SSEL n-p-1 SST = SSR + SSE (SSES-SSEL )/p2 SSEL/(n-p-1) n-1 F= MSR(2|1)/MSE R-squared – Another view From last lecture: ECDF Corr(Y, Y ) squared More conventional: R2 = SSR/SST Geometry justifies why they are the same Cov(Y, Y ) = Cov(Y- Y+ Y , Y ) = Cov(e, Y ) + Var(Y ) Covariance = inner product first term = 0 A measure of precision with which regression model describes individual responses Outline: A few more topics Colinearity Overfitting Influence Mediation Multiple comparisons Main points Confounding occurs when an apparent association between a predictor and outcome reflects the association of each with a third variable A primary goal of regression is to “adjust” for confounding Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework An association of an outcome with predictors is evidenced if SS due to regression is large relative to SSE Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2 ANOVA