Multiple Linear Regression As we have seen, _______________ linear regression is used to describe the relationship between a response variable (y) and a _______________ predictor variable (x). In this handout, we’ll discuss _______________ linear regression. This is used when we have _______________ predictor variables. Example: Berkley Guidance Study The data are excerpted from the Berkley Guidance Study, a longitudinal study monitoring the growth of children in Berkley, CA born between January 1928 and June 1929. The data can be found in the file BGSgirls.jmp on the course website. The variables in the girls’ data set are: WT2 – weight at age 2 (kg) HT2 – height at age 2 (cm) WT9 – weight at age 9 HT9 – height at age 9 LEG9 – leg circumference at age 9 (cm) STR9 – a composite measure of strength at age 9 (higher values – stronger) WT18 – weight at age 18 HT18 – height at age 18 LEG18 – leg circumference at age 18 STR18 – strength at age 18 SOMA – somatotype, on a seven-point scale, as a measure of fatness (1 = slender, 7 = fat); determined using a photograph at age 18 Objective: Develop a multiple linear regression model for predicting SOMA at age 18, using as potential predictors the variables from ages 2 and 9 only. We begin by examining the ___________________________________ of the potential predictors and the response, somatotype. To do this in JMP, select Analyze Multivariate Methods Multivariate and place the response (SOMA) and the predictors (WT2, HT2, WT9, HT9, LEG9, STR9) in the Y box. Click OK. The correlations between all the variables are given first. 1 Next the scatterplot matrix for the data is given. To obtain the significance tests for all the pairwise correlations, click on the red drop-down arrow next to Multivariate and choose Pairwise Correlations. You should then get the following output. 2 Questions: 1. Which predictor variables exhibit the strongest linear relationship with the response variable SOMA? Explain. 2. Which predictor variables exhibit the weakest linear relationship with the response variable SOMA? Explain. Using JMP to fit the multiple linear regression model We can use JMP to fit the multiple linear regression model. Choose Analyze Fit Model and put SOMA in the Y box and put WT2, HT2, WT9, HT9, LEG9, and STR9 in the Construct Model Effects box as shown below. 3 Click Run and JMP returns the following output. 2 1 3 Understanding the output: This tests the overall usefulness of the multiple regression model. If this p-value 1 is significant, then we have evidence that at least one of the predictor variables has a significant linear relationship with the response, i.e. the model is useful. The coefficient of determination, R2, determines the percent of the total variation in the response that is explained by ALL of the predictor variables. Note that 2 when comparing multiple regression models, you should use the adjusted R2. These p-values test whether each predictor is useful in the model over and above all of the other predictor variables. For example, note that the p-value for LEG9 is 3 0.604. This does NOT mean that LEG9 is not in itself a significant predictor of SOMA; it just means that it is not contributing useful information above and beyond that contributed by other predictor variables. 4 Checking the regression assumptions The multiple linear regression assumptions are as follows: 1. Linearity – The response variable (Y) can be modeled using the predictors in the following form: E(Y|X) = β0 + β1X1 + β2X2 + … + βpXp 2. Constant Variance – The variability in the response variable (Y) must be the same for all specified values of the X variables, i.e. Var(Y|X) = σ2 or SD(Y|X) = σ. 3. Independence – The response measurements should be independent of each other. 4. Normality – The response measurements (Y) should follow a normal distribution. 5. You should also take the time to identify any outliers since outliers can be very problematic in any regression model. Plots needed to check regression assumptions: Plot of predicted values vs. residuals (this is provided in JMP). Plot of each X variable versus the residuals (save the residuals to your data set and then make a scatterplot for each X variable). Make a histogram or Normal Quantile Plot of the residuals to check for normality. Example: Let’s check the assumptions for the BGSgirls example. The plot of the predicted/fitted values versus the residuals automatically provided in JMP. From the red drop-down menu choose Save Columns Residuals. We will need these to construct the following plots. 5 Creating a plot of each X variable versus the residuals. Choose Graph Graph Builder. Drag the residual to the y-axis and each individual x variable to the x-axis. 6 Finally, select Analyze Distribution and put Residual SOMA in the Y, Columns box to obtain the histogram of the residuals. Click on the red drop-down arrow and choose Normal Quantile Plot as well. Questions: 3. Does the assumption of linearity appear to be met? Explain. 4. Does the assumption of constant variance appear to be met? Explain. 5. Does the assumption of normality appear to be met? Explain. 6. Can you identify any outliers? Explain. 7 Simplifying the model with backwards elimination The effects tests for the individual predictors suggest that the model could be simplified by removing several terms. The individual tests suggest that WT2, HT9, LEG9, and STR9 could potentially be removed from the model. _______________________________ is a model development strategy where we first fit a model that includes all potential predictors of interest and then we proceed to remove __________________________ predictors/effects one at a time until no further terms can be removed. We remove terms with the largest p-values first and then continue removing until all terms are significant at some specified level of significance. Often times we use α = ________ rather than the usual α = 0.05 level for determining significance of an individual predictor. We begin by taking out ____ because it has the largest associated p-value = __________. The results for this simpler model are given below. Next, LEG9 could be removed (p-value = 0.6100). 8 Finally, we remove WT2 (p-value = 0.5840) and obtain the following results. Even though the predictor STR9 is not significant at theα = 0.05, we will leave it in the model. Therefore, our final model for predicting average somatotype involves using HT2, WT9 and STR9 as predictors. 9 Questions: 7. Write the estimated regression equation for E(SOMA|HT2, WT9, STR9). 8. Interpret each of the regression coefficients. 10 Lastly, we need to check the assumptions for the final model. Once again, no major model violations are suggested. However, there is a fairly extreme outlier. While it is generally not acceptable to delete an observation without good reason (e.g. you KNOW a mistake was made), it is interesting to see how the analysis might change when outliers are excluded from the study. When this outlier is deleted the same model (containing three predictors) is obtained via backward elimination. The summary of the final model with the outlier deleted is shown below. Question: 9. What changed? 11 A few more plots in JMP JMP also produces plots called Effect __________________ plots. They are equivalent to a more commonly employed graphical device called an _________________________ plot (AVP). These plots show the relationship between the response variable (SOMA) and each of the predictors adjusted for all other terms in the model. The negative estimated coefficients for HT2 and STR9 are supported by the negative adjusted relationships for these terms. If the dashed red lines do not completely contain the horizontal blue line, then the term is deemed significant. Clearly, ____ has the strongest adjusted relationship with somatotype. Also, a plot of the actual somatotype (Y) vs. the fitted/predicted values from the model is given below. Question: 10. What should this plot look like if the model fits the data perfectly? 12