Stat 301 B – Lab 9

advertisement
Stat 301 B – Lab 9
Goals: In this lab, we will see how to:
calculate standardized beta values
calculate VIF to assess multicollinearity
reminder about how to calculate the se of a predicted value
We will use the grandfather clocks data set (gfclocks.txt) to illustrate all three.
Standardized β values and Variance Inflation Factors:
Neither is shown by default but can be requested from JMP. Both are characteristic of parameters in the model, so they
are accessed though the Parameter Estimates box in the Fit Model results.
1. Load the clocks data and fit a multiple regression model using the Fit Model dialog.
2. Right-click inside the Parameter Estimates box of results (Yes, the JMP user interface would be more consistent
if there were a red triangle, but there isn’t). Mouse over Columns to bring up the menu of items shown in the
Parameter Estimates box (see figure below). You see that that the items that are shown in the table are already
clicked (except for ~Bias, which is only relevant for some models. Where relevant, it is shown by default).
Click on VIF to add a column with the VIF values for each parameter to the results.
Repeat and click on Std Beta to add a column with the Standardized Beta values.
Calculating the std. error of a predicted value
Reminder: this is a characteristic of the X value for the observation. When the regression has more than one variable,
“the X value” is the value for each variable. The se can be calculated for any X value, including X values for new
observations. In class we will talk about this as one way to assess whether we are extrapolating when we can’t just look
at the values.
1. Load the data set into JMP, click on the next blank line and enter the values of the X variable for the point you
want to make a prediction for.
2. If JMP fusses that you have changed the data and need to refit the model, click the red triangle by Response (top
left of the Fit Model results), select Script / Redo Analysis. This will rerun the same model with “new” data. You
will get a new results window. (You haven’t changed the data in a way that matters because you haven’t, or
shouldn’t, enter a new response value, but JMP is being conservative).
3. Click the red triangle by Response (top left of the Fit Model results), select Save Columns, then either Std Error
of Predicted or StdErr Pred Formula. Both give you the numbers for the X values currently in the data set. The
formula version saves a formula in the data set column, so you can add observations to get the se for those new
X values. The “non-formula” version gives you the numbers for the observations in the data set.
Self assessment questions:
We continue to evaluate the relationships between home sales price and assessed values for 84 homes sold in one
neighborhood of Tampa Florida between 2008 and 2009. This is a subset of the data considered in Case Study 2 in the
text.
Reminder: each row of data is for the sale of one house. The three variables are the sales price, the assessed value of
the land, and the assessed value of the improvements (buildings, e.g. the house, garage, or garden shed, or an in-ground
pool). The goal is to develop a model to predict sales price for a home about to be put on the market.
The data are in tamsales1.txt on the class web site. You will need to use text import / preview to read the file correctly.
All my analyses use untransformed variables for demonstration.
Questions:
The first three questions are based on the model that uses LAND and IMPROVE to predict PRICE.
1. Use standardized beta’s to decide which X variable, LAND or IMPROVE, is the more important predictor of price.
2. Do you have any concerns about multicollinearity in this model?
3. Any concerns about extrapolation if you use this model to predict sales price for a property with an assessed
LAND value of 400,000 and IMPROVEments of 400,000? What about a property with an assessed LAND value of
600,000 and IMPROVEments of 100,000?
Note: The average LAND value is 139,550 and the average IMPROVE value is 119,700.
4. Fit the quadratic model with LAND, IMPROVE, LAND2, IMPROVE2, and LAND*IMPROVE. Remember to turn off
“Center Polynomials” or create the quadratic and cross-product variables in the data set. Do you have any
concerns about multicollinearity in this model?
5. Now fit the quadratic model with LAND, IMPROVE, LAND2, IMPROVE2, and LAND*IMPROVE and allow JMP to
center the polynomials. You will have to create the quadratic and interaction variables “on the fly”, as discussed
in lab 8 instructions. Does centering the polynomials address any concerns about multicollinearity?
Answers:
1. Land is the more important predictor (std beta = 0.55), but the difference is not huge (std beta for IMPROVE =
0.48).
2. No – the VIF values for both LAND and IMPROVE are 1.8, which is very much less than the suggested cutoff of
10.
3. No, Yes. The se of the predicted sales price at the mean LAND and mean IMPROVE is 15,920. The se for the 1st
property (400,000, 400,000) is 43,332, which is 2.7 times the smallest se, so no concerns about extrapolation for
the 1st property. The se for the second property is 86,837, which is 5.4 times the smallest se, so Yes a concern
about extrapolation.
4. Yes – the VIF values for all 5 regression slopes are all above 10 and two are above 30.
5. Centering helps but does not eliminate multicollinearity for these data. After centering, both the LAND2 and
LAND*IMPROVE still a concern. Their VIF values are 13.7 and 13.9, respectively.
Download