Multiple Linear Regression
Linear regression with two or
more predictor variables
Introduction
Often in linear regression, you want to
investigate the relationship between more
than one predictor variable and some
outcome. In this case, your model will
contain more than one independent
variable. It is also often important to
investigate a possible interaction between
two or more independent variables.
Consider the following situation:
The file air.txt contains a subsample of data from a
study of the effect of air pollution on lung
function. The variables measured were age,
gender, height, weight, forced vital capacity
(FVC), and forced expiratory volume in 1 second
(FEV1). FVC is the total volume of air in liters
which an individual can expel regardless of how
long it takes. FEV1 is the volume of air expelled
during the first second when an individual has
been told to breath in deeply and then expel as
much air as possible.
(Dunn and Clark (1987), Applied Statistics: Analysis of Variance and
Regression, p.354.)
Input the file air.txt into SAS with the following
code (adjusting the location of the file as
necessary):
DATA air;
INFILE ‘C:\air.txt' dlm = ' ' firstobs = 2;
INPUT sex age height weight fvc fev1;
height_age = height*age;
RUN;
“Height_age” creates a new variable which
represents the interaction between height and
age.
Exploring the Data
We are interested in what factors may
predict FVC. In order to explore this
before analyzing the data, create two
plots: one of FVC vs. height; the other of
FVC vs. age:
PROC GPLOT DATA = air;
PLOT fvc * height;
PLOT fvc * age;
RUN;
Plot of FVC * Height
Plot of FVC * Age
It appears a linear relationship is justified
between FVC and height, although it is
unclear whether a linear relationship exists
between FVC and age.
Create a multiple linear regression model
using both height and age to predict FVC:
PROC REG DATA = air;
MODEL fvc = height age;
RUN;
QUIT;
Multiple Regression Output
Interpreting Output
• The multiple regression equation is:
Yhat = -6.67 + 0.18(height) – 0.03(age)
• The R-Square value is interpreted the same as
with simple linear regression:
67% of the variance in FVC is explained by
height and age in the model.
• Because the model includes more than one
predictor variable, you may want to consider
using the adjusted R2 (Adj R-Sq) value instead
of the R-Square for interpreting amount of
variance explained by the independent
variables.
Overall F-test
To test whether all of the variables taken
together significantly predict the outcome
variable (FVC), use the overall F-test. The
test statistic (F* = 36.96) is found under F
Value. The associated pvalue (<0.001) is
found under Pr > F.
Ho: β1 = β2 = 0 vs. Ha: At least one β ≠ 0.
Because the p-value is less than 0.05, we
reject the null hypothesis and conclude
that taken together, height and age are
significantly related to FVC.
Testing Significance of One Variable
To test the significance of an individual
variable in predicting FVC, use the test
statistic (t Value) and associated pvalue
for that particular variable (Pr > |t|).
For example, the test of whether height is
significantly related to FVC [Ho: β1 = 0 vs.
Ha: β1 ≠ 0], has t* = 8.15, p < 0.0001.
Reject the null hypothesis and conclude
that height is significantly related to FVC.
Testing for an Interaction
Because we have more than one predictor
variable, it is important to consider whether they
interact in some way. To test whether the
interaction between height and age is significant,
create another model in SAS that contains both
the main effects of height and age as well as the
interaction term you created:
PROC REG DATA = air;
MODEL fvc = height age height_age;
RUN;
QUIT;
Output with Interaction Term
Is the interaction significant?
Notice that the pvalue for the interaction is 0.39,
which is greater than 0.05. Therefore, the
interaction between age and height is not
significant, and we do not need to include it in
the model.
Additionally, notice that the R-Square is 0.679,
indicating that 68% of the variability in FVC is
explained by height, age and height_age. This
number is not much larger than the R-Square
from the model with just height and age. This
also is a good indicator that the interaction term
is not necessary.
The final model only needs to include height and
age predicting FVC.
Conclusions
Multiple Linear Regression in SAS is very
similar to Simple Linear Regression. The
major difference is that more variables are
added to the model statement, and
interaction terms need to be considered.
Use the same options (clb, cli, clm) for
creating confidence intervals in SAS and
determining outliers (r) and influential
points (influence).
Linear Regression is used with continuous
outcome variables. If the outcome
variable of interest is categorical, logistic
regression analysis is used. The next
tutorial is an introduction to logistic
regression.