Linear Regression Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/courses Apr-15 H.S. 1 Linear regression CONCEPTS Apr-15 H.S. 2 Outcome and regression types • Numerical data – Discrete • number of partners Poisson regression – Continuous • Weight Linear regression • Categorical data – Nominal • disease/ no disease Logistic regression – Ordinal • small/ medium/ large Apr-15 H.S. Ordinal regression 3 Regression idea 2500 3000 3500 4000 4500 5000 model: y b0 b1 x e y = outcome x = covariate b1 coefficient , effectof x e error,residual 250 260 270 280 290 gestational age (days) 300 310 model with manycofactors: y b0 b1 x1 b2 x2 e x1 , x 2 = covariate Apr-15 H.S. 4 Measures and Assumptions weight b0 b1 gest b2 sex e • Adjusted effects – b1 is the increase in weight per day of gestational age – b1 is adjusted for b2 • Assumptions – Independent errors – Linear effects – Constant error variance • Robustness – influence Apr-15 H.S. 5 Workflow • DAG • Plots: distribution and scatter • Bivariate analysis • Regression – Model estimation – Test of assumptions • Independent errors • Linear effects • Constant error variance Discuss Plot – Robustness • Influence Apr-15 Plot H.S. 6 Continuous outcome: Linear regression, Birth weight ANALYSIS Apr-15 H.S. 7 DAGs C2 C1 parity sex E D gest age birth weight Associations Causal effects Bivariate (unadjusted) Multivariable (adjusted) Draw your assumptions before your conclusions Apr-15 H.S. 8 Plot outcome by exposure Effects on linear regression: OK Be clear on the research question: overall birth weight: linear regression low birth weight: logistic regression linear and logistic can give opposite results May lead to non-constant error variance May have high influential outliers Apr-15 H.S. 9 Plot outcome by exposure, cont. Linear effects? Yes Apr-15 H.S. 10 Bivariate analysis Outcome: birthweight All Gestational age <=280 days >280 days Sex Boy Girl Parity 0 1 2 Apr-15 N 564 Mean 3604 p-value <0.001 230 288 3436 3744 0.004 291 273 3668 3535 <0.001 225 215 123 3485 3677 3695 H.S. 11 Continuous outcome: Linear regression, Birth weight REGRESSION Apr-15 H.S. 12 Categorical covariates • 2 categories – OK, but know the coding • 3+ categories – Use “dummies” • • • • “Dummies” are 0/1 variables used to create contrasts Want 3 categories for parity: 0, 1 and 2-7 children Choose 0 as reference Make dummies for the two other categories generate Parity1 generate Parity2_7 Apr-15 = = (parity==1) if parity<. (parity>=2) if parity<. H.S. 13 Model estimation Syntax: regress weight gest sex Parity1 Parity2_7 Apr-15 H.S. 14 Create meaningful constant Expected birth weight E ( y ) 0 1 gest 2 sex 3 Parity1 4 Parity2 _ 7 Expected birth weight at: 0 1972gr 0 1 280 2 1 3524gr gest= 0, sex=0, parity=0 gest=280, sex=1, parity=0 Alternative: center variables gen gest280=gest-280 gest280 has a meaningful zero at 280 days gen sex0=sex-1 sex0 has a meaningful zero at boys Model results Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 Apr-15 coeff 3524.3 6.0 95% conf. Int. (3.9 , 8.2) 0 -139.2 (-228.9 , -49.5) 0 232.0 226.0 (130.6 , 333.5) (106.9 , 345) H.S. 16 Test of assumptions • Discuss • Independent residuals? 1000 1500 • Plot residuals versus predicted y -1000 -500 0 Residuals 500 • Linear effects? • constant variance? 3200 3400 3600 Linear prediction 3800 4000 Outlier not included Apr-15 H.S. 17 Violations of assumptions • Dependent residuals .5 1 Use linear mixed models -.5 0 • Non linear effects -1 Add square term Or use piecewise linear 220 240 260 gest 280 300 2 200 0 -1 -2 Use robust variance estimation res 1 • Non-constant variance 3400 Apr-15 H.S. 3500 3600 p 3700 18 3800 6000 Influence 5000 Regression without outlier 4000 Regression with outlier 2000 3000 Outlier 200 Apr-15 300 400 500 Gestational age H.S. 600 700 19 .2 Measures of influence -.6 -.4 -.2 0 Remove obs 1, see change remove obs 2, see change 1 2 10 Id • Measure change in: – Predicted outcome – Deviance – Coefficients (beta) • Delta beta Apr-15 H.S. 20 -10 -8 -6 -4 -2 0 Delta beta for gestational age 539 2000 3000 4000 weight 5000 beta for gestational age= 6.04 Apr-15 H.S. 6000 If obs nr 539 is removed, beta will change from 6 to 16 21 Removing outlier Full data Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 Outlier removed coeff 95% conf. Int. 3524 6 0 -139 0 232 226 Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 (4 , 8) (-229 , -49) (131 , 333) (107 , 345) One outlier affected two estimates Apr-15 coeff 95% conf. Int. 3531 17 (13 , 20) 0 -166 (-252 , -80) 0 229 225 (132 , 326) (112 , 339) Final model H.S. 22 Summing up • DAGs – Guide analysis • Plots – Unequal variance, non-linearity, outliers • Bivariate analysis • Linear regression – – – – Fit model Check assumptions Check robustness Make meaningful constant Apr-15 H.S. 23