This work is licensed under a . Your use of this

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this
material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of these materials
permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or
warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently
review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for
obtaining permissions for use from third parties as needed.
Section B
More Details about Linear Regression (Optional)
MLR Allows One To . . .
Evaluate association between an outcome and multiple
predictors in one model
Evaluate relationships between predictors as they relate
to outcome
Estimate amount of variation in outcome explained by
multiple predictors
Investigate interactions between pairs of predictors
3
MLR
Venn diagram
4
Multiple Linear Regression Model
Points don’t fall exactly on line, so to represent observed
values we add “ε”
 
ε is
  Noise
  Error
  Scatter
Continued
5
Multiple Linear Regression Model
Points don’t fall exactly on line, so to represent that we
add “ε”
 
Observed value = regression estimate + residual
ε is “residual variability” 6
How Do We Choose the “Right” Line?
The linear regression “line” is the line which gets “closest”
to all of the points
How do we measure closeness to more than one point?
7
Least Squares Regression
Values of
etc . . . that minimize
Continued
8
Least Squares Regression
Values of
etc., that minimize
Continued
9
Least Squares Regression
Values of
etc., that minimize
10
MLR—What is the “Line”
Any regression equation with more than one predictor
(x) describes a multi-dimensional object in a multiple
dimensional space
  Two x’s—describes a “plane”
  Three or more x’s—can’t even “visualize”
So where is the “line”
11
Assumptions
Linearity—adjusted relationship between E[y] and each x
is linear
So really, it’s the adjusted relationship between the mean
of y and each predictor that should be linear
  To assess this, one would actually need to look at
a scatterplot between y and a single x, after both
have been adjusted for all other x’s
12
Checking Linearity
Adjusted variable plots—allow for viewing of
relationship between E[Y] and x after adjusting for all
other model predictors
This plots the residuals from regressing y on all other
predictors against residuals from regressing x on all other
predictors
Continued
13
Checking Linearity
Recall MLR of hemoglobin on PCV and age for 21
subjects
Continued
14
Checking Linearity
Assumptions
  Relationship between Hb and PCV is linear after
adjusting for age
  Relationship between Hb and age is linear after
adjusting for PCV
Continued
15
Checking Linearity
Adjusted variable plots allow for visualizing “adjusted
scatterplots”
The “avplot” command in Stata can be used after
running any MLR to plot these adjusted scatterplots
  Syntax “avplot var_name”
Continued
16
Checking Linearity
Adjusted variable plot—Hb on PCV after adjusting for
age
  Command: avplot PCV
Continued
17
Checking Linearity
Adjusted variable plot—Hb on age after adjusting for
PCV
  Command: avplot age
18
Multiple Linear Regression Model—
Assumptions
In addition to “adjusted linearity” assumptions, linear
regression assumes a lot about behavior of residuals, the
discrepancy between the observed values and their
corresponding predicted values
Assumptions about residuals
  Random noise
  Sometimes positive, sometimes negative but, on
average, it’s 0
  Normally distributed about 0
Continued
19
Multiple Linear Regression Model—
Assumptions
These assumptions can be investigated graphically by
looking at “residuals vs. predicted values” plot
This plot can be constructed in Stata by running “rvfplot”
command after using the “regress” command
Continued
20
Multiple Linear Regression Model—
Assumptions
Residuals versus predicted values from regression of Hb
and PCV and age
21
Why the Big “Fuss” about “Well-Behaved”
Residuals?
Built in standard error formulas work when residuals are
well-behaved—if residuals don’t meet assumptions
these formulas tend to underestimate coefficient
standard errors giving overly “optimistic” p-values and
too narrow CIs
22
What to Do about Miscreant Residuals
Bootstrapping
Weighted regression
23
Collinearity
Sometimes, in an MLR situation, two or more of the x
variables are “telling the same story”—that is, they
contain the same information
This can lead to a situation called collinearity or
multicollinearity
Continued
24
Collinearity
Collinearity occurs when two or more covariates are
highly correlated
The model can not “choose” which covariate “gets
credit” for the association with y—this yields unstable
coefficient estimates
Continued
25
Collinearity
A variable ( ) may not be significantly associated with y
because it is highly correlated with another variable ( )
that is also in the multiple linear regression analysis
This can happen if both ( ) and ( ) are highly
correlated with each other in conjunction with each
correlation with y
This is called collinearity—“hyper-confounding”
Continued
26
Collinearity
Attempt to diagram
27
Example of Collinearity
Possible scenario in which collinearity may exist
y = blood pressure
= body mass index (BMI)
  (weight/height2)
= weight
Continued
28
Example of Collinearity
Simple linear regression with BMI
  S
  BMI p < .001
Simple linear regression with weight
  S
  Weight p < .001
Continued
29
Example of Collinearity
Multiple linear regression with BMI and weight
  S
Continued
30
Example of Collinearity
Any of the following scenarios could happen
  BMI p < .001
  Weight p = .76
  BMI p = .21
  Weight p = .01
Continued
31
Example of Collinearity
Any of the following scenarios could happen
  BMI p = .31
  Weight p = .15
  BMI p = .02
  Weight p = .01
32
Detecting Collinearity
Sometimes obvious (BMI and weight)
Substantive knowledge
Continued
33
Detecting Collinearity
“Weird” behavior
  Two predictors—each significant in simple linear
regression, neither significant when together in
multiple regression
Continued
34
Detecting Collinearity
Detection
  Not so “cut and dry”
  One solution—compare p-values for each factor in
models alone to p-values when both are in same
model
  If significance changes drastically for both,
possibility exists
Continued
35
Detecting Collinearity
Solution
  Choose one of the two predictors for use in your
final analysis
36
Measuring Correlation
Partial correlation coefficients
  Can compute for each to measure correlation
between y and after adjusting for all other x
variables in model
  Similar interpretation to r from simple linear
regression
Continued
37
Measuring Correlation
Adjusted
  Amount of variability in y explained by , , . . .
  Adjusted because automatically increases for
each extra x
  The adjusted value is “penalized”
  Generally slightly lower than original
38
Correlated Outcomes
One last big assumption in linear regression is that
observations are independent from subject to subject
Sometimes, this assumption is violated
  Repeated measures
  Cluster sampling—potential correlation within
each cluster (town, school, etc.)
39
Example
Longitudinal study to evaluate efficacy of a new drug to
reduce seizures among epileptics
  Subjects randomized to drug or placebo
  Baseline measurements taken (four follow-up
periods)
  Outcome of interest—seizure counts in four
two-week follow-up periods
 Notes Available
40
The Data
In Stata
41
Idea
For each subject, we have four observations
If observations within subject are correlated, than 2nd–
4th observations are not giving “full new” knowledge
If we treat each observation as independent, we will
underestimate se of regression coefficients
42
Correlated Outcomes
Standard MLR assuming independence
Continued
43
Correlated Outcomes
MLR using “cluster option”
44
What Is Cluster Option?
Notice, coefficient estimates don’t change, just the
standard errors (get larger)
“Built in” regression coefficient standard error equations
are not valid when observations are not independent
“Cluster” option invokes Generalized Estimating
Equation (GEE) estimates of standard errors
45
Dealing with Non-Linearity
What are some solutions to dealing with non-linear
relationships between a continuous outcome and a
continuous predictor?
  Categorize predictor into a discrete number of
groups
  Spline terms
  Quadratic terms
46
Example
Data on the number of visits to a fast food restaurant in
the past six months and hours of TV watched in past
month for 64 subjects
Continued
47
Example
Scatterplot
48
Methods for Non-Linear Relationships
Categorizing continuous predictor
  Allows for “change” in outcome across groups to
vary
Quadratic terms
  Allows association between x and y to increase (or
decrease) for each unit of x
Splines
  Allows association x and y to increase (or decrease)
for different values of x
49
Methods for Curved Relationships
Add a quadratic term
Continued
50
Methods for Curved Relationships
Resulting equation
The estimated change in the mean of y per one unit
increase in depends on the value of !
51
Spline Terms
Allow estimation of a different coefficient for x
depending on value of x
Allow for an interaction between x and itself!
Formulation—a cross between a dummy variable and a
continuous variable
Continued
52
Spline Terms
Need to pick “cut points”—spline allows for estimation of
different slopes relating y to across range of values
Continued
53
Spline Terms
“Cut point” usually driven by research question
  “Does fast-food advertising work better on those
with “above-average” viewing habits?
  “Did the introduction of a needle exchange
program alter the relationship between number of
drug related arrests and whether the arrestee was
carrying only “works?”
Continued
54
Spline Terms
“Cut point” usually driven by research question
  “Is there a threshold effect of oat bran intake on
lowering cholesterol level in adults with type two
diabetes?”
  “Is there a difference in the relationship between
blood pressure and weight for those considered
‘clinically obese’ compared to others?”
55
Model with Spline
Define
=
  0 if
if ≥75
< 75
Spline only “turned on“ for values of
≥ 75!
Continued
56
Model with Spline
MLR Model
Continued
57
Model with Spline
MLR model if
< 75:
MLR model if
≤ 75:
estimates the difference in slope of
relative to slope when ≥ 75
when
< 75
58
Results with Spline
59
Evaluation
Your feedback on this lecture
presentation is very important and will
be used for future revisions. Please
take a moment to evaluate this lecture.
The Evaluation link is available on the
lecture page.
60