4A - Biostatistics

Lecture 4, part 1: Linear Regression Analysis: Two Advanced Topics Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University July 14, 2011 Introduction to Statistical Measurement and Modeling Data examples  Boxing and neurological injury  Scientific question: Does amateur boxing lead to decline in neurological performance?  Some related statistical questions:  Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure?  Is boxing-associated decline independent of initial cognition and age?  Is there a threshold of boxing that initiates harm? Boxing data -20 -10 0 blkdiff 10 20 Lowess smoother 0 bandwidth = .8 100 200 blbouts 300 400 Outline  Topic #1: Confounding  Handling this is crucial if we are to draw correct conclusions about risk factors  Topic #2: Signal / noise decomposition  Signal: Regression model predictions  Noise: Residual variation  Another way of approaching inference, precision of prediction Topic # 1: Confounding  Confound means to “confuse”  When the comparison is between groups that are otherwise not similar in ways that affect the outcome  Lurking variables,…. Confounding Example: Drowning and Eating Ice Cream * * * Drowning rate * * * * * * * * * * * * * * * * * * * * * * * Ice Cream eaten Confounding Epidemiology definition: A characteristic “C” is a confounder if it is associated (related) with both the outcome (Y: drowning) and the risk factor (X: ice cream) and is not causally in between Ice Cream Consumption Drowning rate ?? July 2010 JHU Intro to Clinical Research 7 Confounding Statistical definition: A characteristic “C” is a confounder if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs with, versus without, adjustment for C Ice Cream Eaten Outdoor Temperature Drowning rate Confounding Example: Drowning and Eating Ice Cream * * * Drowning rate * * * * * * * * * * * * * * * * * * * * * * * Warm temperature Cool temperature Ice Cream eaten Effect modification A characteristic “E” is an effect modifier if the strength of relationship between the outcome (Y: drowning) and the risk factor (X: ice cream) differs within levels of E Ice Cream Consumption Drowning rate Outdoor temperature July 2010 JHU Intro to Clinical Research 10 Effect Modification: Drowning and Eating Ice Cream Drowning rate * * * * * * * * * * * * * Warm temperature Cool temperature Ice Cream eaten Topic #2: Signal/Noise Decomposition  Lovely due to geometry of least squares  Facilitates testing involving multiple parameters at once  Provides insight into R-squared Signal/Noise Decomposition  First step: decomposition of variance  “Regression” part: Variance of Y s  “Error” or “Residual” part: Variance of e  Together: These determine “total” variance of Ys  “Sums of Squares” (SS) rather than variance per se  Regression SS (SSR): n  Error SS (SSE):  Total SS (SST): (Y n  i 1 11 2  ˆ ¯ (YiY)  Y (H )Y n ˆ 2 iY i)  Y(IH)Y i 1 n  i 1 11 2  ¯ (YiY)  Y (I )Y n Signal/Noise Decomposition  Properties  SST = SSR + SSE  SSR/SST = “proportion of variance explained” by regression = R-squared  Follows from geometry  SSR and SSE are independent (assuming A1-A5) and have easily characterized probability distributions  Provides convenient testing methods  Follows from geometry plus assumptions Signal/Noise Decomposition  SSR and SSE are independent  Define M = span(X) and take “Y” as centered at Y  It is possible to orthogonally rotate the coordinate axes so that first p axes ε M; remaining n-p-1 axes ε M⊥  Gram-Schmidt orthogonalization  Doing this transforms Y into TY :=Z, for some orthonormal matrix T with columns:= {e1,...,en-1}  Distribution of Z = N(TE[Y|X],σ2I) Signal/Noise Decomposition  SSR and SSE are independent - continued  TY=Z Y = T’Z  p n1 j 1 j p1 Σ Zje j Σ Zje j n1 n1 j p1 j p1  SSE = squared length of Σ Zje = Σ Z 2 j j p p j 1 j 1 2  SSR = squared length of Σ Zje = Σ Zj j  Claim now follows: SSR & SSE are independent because (Z1,…,Zp) and (Zp+1,…,Zn-1) are independent Signal/Noise Decomposition  Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions  Under A1-A2: E[Y|X] ε M, E[Zj|X] =0, all j>p  Recall {Z1,...,Zn-1} are mutually independent normal with variance=σ2 n 2  Thus SSE = Σ Zj j p2 n1 Zj2 j p1 σ2 2 =σ Σ ~ σ2 χ2n-p-1 under A1-A5 2 (a sum of k independent squared N(0,1) is  k ) Signal/Noise Decomposition  Under A1-A5 SSE, SSR and their scaled ratio have convenient distributions  For j ≤ p E[Zj|X] ≠ 0 in general  Exception: H0: β1=…=βp = 0 p  Then SSR = and 2 Σ Zj ~ σ2 χ2p under A1-A5 j 1 SSR / p SSE / (n  p  1) ~ Fp,n-p-1 ~  2p / p 2 n p 1 / (n  p  1) with numerator and denominator independent. Signal/Noise Decomposition  An organizational tool: The analysis of variance (ANOVA) table SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error SSE n-p-1 SSE/(n-p-1) = 2  Total SST n-1 = SSR + SSE F= MSR/MSE “Global” hypothesis tests  These involve sets of parameters  Hypotheses of the form H0: βj = 0 for all j in a defined subset of {j=1,...,p} vs. H1: βj ≠ 0 for at least one of the j Example 1: H0: βLATITUDE = 0 and βLONGITUDE = 0 Example 2: H0: all polynomial or spline coefficients involving a given variable = 0. Example 3: H0: all coefficients involving a variable = 0. “Global” hypothesis tests  Testing method: Sequential decomposition of sums of squares  Hypothesis to be tested is H0: βj1=...=βjk = 0 in full model  Fit model excluding xj1,...,xjpj: Save SSE = SSEs  Fit “full” (or larger) model adding xj1,...,xjpj to smaller model. Save SSE=SSEL, often=overall SSE  Test statistic S = [(SSES-SSEL)/pj]/[SSEL(n-p-1)]  Distribution under null: F(pj,n-p-1)  Define rejection region based on this distribution  Compute S  Reject or not as S is in rejection region or not Signal/Noise Decomposition  An augmented version for global testing SOURCE Sum of Squares (SS) Degrees of freedom (df) Mean square (SS/df) Regression SSR p SSR/p Error Total X1 SST-SSEs p1 X2|X1 SSES-SSEL p2 SSEL n-p-1 SST = SSR + SSE (SSES-SSEL )/p2 SSEL/(n-p-1) n-1 F= MSR(2|1)/MSE R-squared – Another view  From last lecture: ECDF Corr(Y, Y ) squared  More conventional: R2 = SSR/SST  Geometry justifies why they are the same  Cov(Y, Y ) = Cov(Y- Y+ Y , Y ) = Cov(e, Y ) + Var(Y )  Covariance = inner product first term = 0  A measure of precision with which regression model describes individual responses Outline: A few more topics  Colinearity  Overfitting  Influence  Mediation  Multiple comparisons Main points  Confounding occurs when an apparent association between a predictor and outcome reflects the association of each with a third variable  A primary goal of regression is to “adjust” for confounding  Least squares decomposition of Y into fit and residual provides an appealing statistical testing framework  An association of an outcome with predictors is evidenced if SS due to regression is large relative to SSE  Geometry: orthogonal decomposition provides convenient sampling distribution, view of R2  ANOVA

4A - Biostatistics

Related documents

Products

Support

4A - Biostatistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib