Multiple Linear Regression As we have seen, linear regression is

advertisement
Multiple Linear Regression
As we have seen, _______________ linear regression is used to describe the relationship between a
response variable (y) and a _______________ predictor variable (x). In this handout, we’ll discuss
_______________ linear regression. This is used when we have _______________ predictor variables.
Example: Berkley Guidance Study
The data are excerpted from the Berkley Guidance Study, a longitudinal study monitoring the growth of
children in Berkley, CA born between January 1928 and June 1929. The data can be found in the file
BGSgirls.jmp on the course website.
The variables in the girls’ data set are:











WT2 – weight at age 2 (kg)
HT2 – height at age 2 (cm)
WT9 – weight at age 9
HT9 – height at age 9
LEG9 – leg circumference at age 9 (cm)
STR9 – a composite measure of strength at age 9 (higher values – stronger)
WT18 – weight at age 18
HT18 – height at age 18
LEG18 – leg circumference at age 18
STR18 – strength at age 18
SOMA – somatotype, on a seven-point scale, as a measure of fatness (1 = slender, 7 = fat);
determined using a photograph at age 18
Objective: Develop a multiple linear regression model for predicting SOMA at age 18, using as
potential predictors the variables from ages 2 and 9 only.
We begin by examining the ___________________________________ of the potential predictors and
the response, somatotype. To do this in JMP, select Analyze  Multivariate Methods  Multivariate
and place the response (SOMA) and the predictors (WT2, HT2, WT9, HT9, LEG9, STR9) in the Y box. Click
OK.
The correlations between all the variables are given first.
1
Next the scatterplot matrix for the data is given.
To obtain the significance tests for all the pairwise correlations, click on the red drop-down arrow next
to Multivariate and choose Pairwise Correlations. You should then get the following output.
2
Questions:
1. Which predictor variables exhibit the strongest linear relationship with the response variable
SOMA? Explain.
2. Which predictor variables exhibit the weakest linear relationship with the response variable
SOMA? Explain.
Using JMP to fit the multiple linear regression model
We can use JMP to fit the multiple linear regression model. Choose Analyze  Fit Model and put SOMA
in the Y box and put WT2, HT2, WT9, HT9, LEG9, and STR9 in the Construct Model Effects box as shown
below.
3
Click Run and JMP returns the following output.
2
1
3
Understanding the output:
This tests the overall usefulness of the multiple regression model. If this p-value
1
is significant, then we have evidence that at least one of the predictor variables
has a significant linear relationship with the response, i.e. the model is useful.
The coefficient of determination, R2, determines the percent of the total variation
in the response that is explained by ALL of the predictor variables. Note that
2
when comparing multiple regression models, you should use the adjusted R2.
These p-values test whether each predictor is useful in the model over and above
all of the other predictor variables. For example, note that the p-value for LEG9 is
3
0.604. This does NOT mean that LEG9 is not in itself a significant predictor of
SOMA; it just means that it is not contributing useful information above and
beyond that contributed by other predictor variables.
4
Checking the regression assumptions
The multiple linear regression assumptions are as follows:
1. Linearity – The response variable (Y) can be modeled using the predictors in the following form:
E(Y|X) = β0 + β1X1 + β2X2 + … + βpXp
2. Constant Variance – The variability in the response variable (Y) must be the same for all
specified values of the X variables, i.e. Var(Y|X) = σ2 or SD(Y|X) = σ.
3. Independence – The response measurements should be independent of each other.
4. Normality – The response measurements (Y) should follow a normal distribution.
5. You should also take the time to identify any outliers since outliers can be very problematic in
any regression model.
Plots needed to check regression assumptions:



Plot of predicted values vs. residuals (this is provided in JMP).
Plot of each X variable versus the residuals (save the residuals to your data set and then make a
scatterplot for each X variable).
Make a histogram or Normal Quantile Plot of the residuals to check for normality.
Example: Let’s check the assumptions for the BGSgirls example.

The plot of the predicted/fitted values versus the residuals automatically provided in JMP.
From the red drop-down menu choose Save Columns  Residuals. We will need these to construct the
following plots.
5

Creating a plot of each X variable versus the residuals. Choose Graph  Graph Builder. Drag
the residual to the y-axis and each individual x variable to the x-axis.
6

Finally, select Analyze  Distribution and put Residual SOMA in the Y, Columns box to obtain
the histogram of the residuals. Click on the red drop-down arrow and choose Normal Quantile
Plot as well.
Questions:
3. Does the assumption of linearity appear to be met? Explain.
4. Does the assumption of constant variance appear to be met? Explain.
5. Does the assumption of normality appear to be met? Explain.
6. Can you identify any outliers? Explain.
7
Simplifying the model with backwards elimination
The effects tests for the individual predictors suggest that the model could be simplified by removing
several terms.
The individual tests suggest that WT2, HT9, LEG9, and STR9 could potentially be removed from the
model. _______________________________ is a model development strategy where we first fit a
model that includes all potential predictors of interest and then we proceed to remove
__________________________ predictors/effects one at a time until no further terms can be removed.
We remove terms with the largest p-values first and then continue removing until all terms are
significant at some specified level of significance. Often times we use α = ________ rather than the
usual α = 0.05 level for determining significance of an individual predictor. We begin by taking out ____
because it has the largest associated p-value = __________. The results for this simpler model are given
below.
Next, LEG9 could be removed (p-value = 0.6100).
8
Finally, we remove WT2 (p-value = 0.5840) and obtain the following results.
Even though the predictor STR9 is not significant at theα = 0.05, we will leave it in the model. Therefore,
our final model for predicting average somatotype involves using HT2, WT9 and STR9 as predictors.
9
Questions:
7. Write the estimated regression equation for E(SOMA|HT2, WT9, STR9).
8. Interpret each of the regression coefficients.
10
Lastly, we need to check the assumptions for the final model.
Once again, no major model violations are suggested. However, there is a fairly extreme outlier. While
it is generally not acceptable to delete an observation without good reason (e.g. you KNOW a mistake
was made), it is interesting to see how the analysis might change when outliers are excluded from the
study.
When this outlier is deleted the same model (containing three predictors) is obtained via backward
elimination. The summary of the final model with the outlier deleted is shown below.
Question:
9. What changed?
11
A few more plots in JMP
JMP also produces plots called Effect __________________ plots. They are equivalent to a more
commonly employed graphical device called an _________________________ plot (AVP). These plots
show the relationship between the response variable (SOMA) and each of the predictors adjusted for all
other terms in the model.
The negative estimated coefficients for HT2 and STR9 are supported by the negative adjusted
relationships for these terms. If the dashed red lines do not completely contain the horizontal blue line,
then the term is deemed significant.
Clearly, ____ has the strongest adjusted relationship with somatotype.
Also, a plot of the actual somatotype (Y) vs. the fitted/predicted values from the model is given below.
Question:
10. What should this plot look like if the model fits the data perfectly?
12
Download