This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed. Section B More Details about Linear Regression (Optional) MLR Allows One To . . . Evaluate association between an outcome and multiple predictors in one model Evaluate relationships between predictors as they relate to outcome Estimate amount of variation in outcome explained by multiple predictors Investigate interactions between pairs of predictors 3 MLR Venn diagram 4 Multiple Linear Regression Model Points don’t fall exactly on line, so to represent observed values we add “ε” ε is Noise Error Scatter Continued 5 Multiple Linear Regression Model Points don’t fall exactly on line, so to represent that we add “ε” Observed value = regression estimate + residual ε is “residual variability” 6 How Do We Choose the “Right” Line? The linear regression “line” is the line which gets “closest” to all of the points How do we measure closeness to more than one point? 7 Least Squares Regression Values of etc . . . that minimize Continued 8 Least Squares Regression Values of etc., that minimize Continued 9 Least Squares Regression Values of etc., that minimize 10 MLR—What is the “Line” Any regression equation with more than one predictor (x) describes a multi-dimensional object in a multiple dimensional space Two x’s—describes a “plane” Three or more x’s—can’t even “visualize” So where is the “line” 11 Assumptions Linearity—adjusted relationship between E[y] and each x is linear So really, it’s the adjusted relationship between the mean of y and each predictor that should be linear To assess this, one would actually need to look at a scatterplot between y and a single x, after both have been adjusted for all other x’s 12 Checking Linearity Adjusted variable plots—allow for viewing of relationship between E[Y] and x after adjusting for all other model predictors This plots the residuals from regressing y on all other predictors against residuals from regressing x on all other predictors Continued 13 Checking Linearity Recall MLR of hemoglobin on PCV and age for 21 subjects Continued 14 Checking Linearity Assumptions Relationship between Hb and PCV is linear after adjusting for age Relationship between Hb and age is linear after adjusting for PCV Continued 15 Checking Linearity Adjusted variable plots allow for visualizing “adjusted scatterplots” The “avplot” command in Stata can be used after running any MLR to plot these adjusted scatterplots Syntax “avplot var_name” Continued 16 Checking Linearity Adjusted variable plot—Hb on PCV after adjusting for age Command: avplot PCV Continued 17 Checking Linearity Adjusted variable plot—Hb on age after adjusting for PCV Command: avplot age 18 Multiple Linear Regression Model— Assumptions In addition to “adjusted linearity” assumptions, linear regression assumes a lot about behavior of residuals, the discrepancy between the observed values and their corresponding predicted values Assumptions about residuals Random noise Sometimes positive, sometimes negative but, on average, it’s 0 Normally distributed about 0 Continued 19 Multiple Linear Regression Model— Assumptions These assumptions can be investigated graphically by looking at “residuals vs. predicted values” plot This plot can be constructed in Stata by running “rvfplot” command after using the “regress” command Continued 20 Multiple Linear Regression Model— Assumptions Residuals versus predicted values from regression of Hb and PCV and age 21 Why the Big “Fuss” about “Well-Behaved” Residuals? Built in standard error formulas work when residuals are well-behaved—if residuals don’t meet assumptions these formulas tend to underestimate coefficient standard errors giving overly “optimistic” p-values and too narrow CIs 22 What to Do about Miscreant Residuals Bootstrapping Weighted regression 23 Collinearity Sometimes, in an MLR situation, two or more of the x variables are “telling the same story”—that is, they contain the same information This can lead to a situation called collinearity or multicollinearity Continued 24 Collinearity Collinearity occurs when two or more covariates are highly correlated The model can not “choose” which covariate “gets credit” for the association with y—this yields unstable coefficient estimates Continued 25 Collinearity A variable ( ) may not be significantly associated with y because it is highly correlated with another variable ( ) that is also in the multiple linear regression analysis This can happen if both ( ) and ( ) are highly correlated with each other in conjunction with each correlation with y This is called collinearity—“hyper-confounding” Continued 26 Collinearity Attempt to diagram 27 Example of Collinearity Possible scenario in which collinearity may exist y = blood pressure = body mass index (BMI) (weight/height2) = weight Continued 28 Example of Collinearity Simple linear regression with BMI S BMI p < .001 Simple linear regression with weight S Weight p < .001 Continued 29 Example of Collinearity Multiple linear regression with BMI and weight S Continued 30 Example of Collinearity Any of the following scenarios could happen BMI p < .001 Weight p = .76 BMI p = .21 Weight p = .01 Continued 31 Example of Collinearity Any of the following scenarios could happen BMI p = .31 Weight p = .15 BMI p = .02 Weight p = .01 32 Detecting Collinearity Sometimes obvious (BMI and weight) Substantive knowledge Continued 33 Detecting Collinearity “Weird” behavior Two predictors—each significant in simple linear regression, neither significant when together in multiple regression Continued 34 Detecting Collinearity Detection Not so “cut and dry” One solution—compare p-values for each factor in models alone to p-values when both are in same model If significance changes drastically for both, possibility exists Continued 35 Detecting Collinearity Solution Choose one of the two predictors for use in your final analysis 36 Measuring Correlation Partial correlation coefficients Can compute for each to measure correlation between y and after adjusting for all other x variables in model Similar interpretation to r from simple linear regression Continued 37 Measuring Correlation Adjusted Amount of variability in y explained by , , . . . Adjusted because automatically increases for each extra x The adjusted value is “penalized” Generally slightly lower than original 38 Correlated Outcomes One last big assumption in linear regression is that observations are independent from subject to subject Sometimes, this assumption is violated Repeated measures Cluster sampling—potential correlation within each cluster (town, school, etc.) 39 Example Longitudinal study to evaluate efficacy of a new drug to reduce seizures among epileptics Subjects randomized to drug or placebo Baseline measurements taken (four follow-up periods) Outcome of interest—seizure counts in four two-week follow-up periods Notes Available 40 The Data In Stata 41 Idea For each subject, we have four observations If observations within subject are correlated, than 2nd– 4th observations are not giving “full new” knowledge If we treat each observation as independent, we will underestimate se of regression coefficients 42 Correlated Outcomes Standard MLR assuming independence Continued 43 Correlated Outcomes MLR using “cluster option” 44 What Is Cluster Option? Notice, coefficient estimates don’t change, just the standard errors (get larger) “Built in” regression coefficient standard error equations are not valid when observations are not independent “Cluster” option invokes Generalized Estimating Equation (GEE) estimates of standard errors 45 Dealing with Non-Linearity What are some solutions to dealing with non-linear relationships between a continuous outcome and a continuous predictor? Categorize predictor into a discrete number of groups Spline terms Quadratic terms 46 Example Data on the number of visits to a fast food restaurant in the past six months and hours of TV watched in past month for 64 subjects Continued 47 Example Scatterplot 48 Methods for Non-Linear Relationships Categorizing continuous predictor Allows for “change” in outcome across groups to vary Quadratic terms Allows association between x and y to increase (or decrease) for each unit of x Splines Allows association x and y to increase (or decrease) for different values of x 49 Methods for Curved Relationships Add a quadratic term Continued 50 Methods for Curved Relationships Resulting equation The estimated change in the mean of y per one unit increase in depends on the value of ! 51 Spline Terms Allow estimation of a different coefficient for x depending on value of x Allow for an interaction between x and itself! Formulation—a cross between a dummy variable and a continuous variable Continued 52 Spline Terms Need to pick “cut points”—spline allows for estimation of different slopes relating y to across range of values Continued 53 Spline Terms “Cut point” usually driven by research question “Does fast-food advertising work better on those with “above-average” viewing habits? “Did the introduction of a needle exchange program alter the relationship between number of drug related arrests and whether the arrestee was carrying only “works?” Continued 54 Spline Terms “Cut point” usually driven by research question “Is there a threshold effect of oat bran intake on lowering cholesterol level in adults with type two diabetes?” “Is there a difference in the relationship between blood pressure and weight for those considered ‘clinically obese’ compared to others?” 55 Model with Spline Define = 0 if if ≥75 < 75 Spline only “turned on“ for values of ≥ 75! Continued 56 Model with Spline MLR Model Continued 57 Model with Spline MLR model if < 75: MLR model if ≤ 75: estimates the difference in slope of relative to slope when ≥ 75 when < 75 58 Results with Spline 59 Evaluation Your feedback on this lecture presentation is very important and will be used for future revisions. Please take a moment to evaluate this lecture. The Evaluation link is available on the lecture page. 60