Handout #6: Confidence and Prediction Intervals for Predictions Section 6.1 : Modeling Used Car Prices Example 6.1: The first example in this handout will use the CarPrices datasets from our course website. This dataset includes a various variables that are thought to influence the price of a vehicle. A snip-it of the data is provided here. Simple Linear Regression Setup ο· ο· ο· ο· Model to be fit using only used cars, i.e., New=No. Response Variable: Price Predictor Variable: Miles Assume the following structure for mean and variance functions o o πΈ(πππππ|πππππ , πππ€ = ππ) = π½0 + π½1 ∗ πππππ πππ(πππππ|πππππ , πππ€ = ππ) = π 2 The first step in running an analysis in JMP is to subset or filter the original dataset to includes only the used vehicles. This can be done easily in JMP. First, select Analyze > Distribution. Place New in the Y, Columns box and click OK. Select Analyze > Distribution, place New in Y, Columns box. Double clicking on New=No, will create a new dataset that includes only the New=No vehicles 1 JMP clearly names this new subset of the original data. This new subset is called CarPrices (New=No). We can see below this dataset includes 170 of the observations from the original CarPrices dataset. First, consider a scatterplot to visualize the relationship between Price | Miles, New=No. Questions 1. What is the general pattern/trend/relationship between Price and Miles? 2. Do you think the assumed mean function stated above is appropriate? How about the assumed form for the variance function? Discuss. 2 Lowess smoother for Price | Miles, New=No Lowess smoother of the variability Consider again the much simpler form for the mean function. πΈ(πππππ|πππππ , πππ€ = ππ) = π½0 + π½1 ∗ πππππ To fit this model in JMP, select Analyze > Fit Model. Put Price, i.e. the response, in the Y box, and Miles, i.e. the predictor variable, in the Construct Model Effect box. Click Run. 3 The distribution of Price | Miles, New=No with the estimated mean function and the summary of fit output as provided by JMP. Questions 3. Does a linear mean function appear to fit this data well? Discuss. 4. The New=No filtered dataset had 170 observations (see JMP spreadsheet on p2), but it appears only 169 were used in our regression analysis (see Observations (or Sum Wgts) = 169). Why is this the case? 5. What is the best estimate for the variance in the condition distribution, i.e., Μ (πππππ|πππππ , πππ€ = ππ) = πΜ 2 ? πππ 6. The standard deviation is simply the square root of the variance. Compute this quantity for the conditional distribution given here. That is, compute √Μ πππ(πππππ|πππππ , πππ€ = ππ) = √πΜ 2 . Does this match the Root Mean Square Error computed by JMP? 7. What is the interpretation of the Root Mean Square Error quantity in the context of this problem? 8. What is the interpretation of the R2 value for this model? 4 Consider again the form of the mean function and the parameter estimate portion of the output provided by JMP. πΈ(πππππ|πππππ , πππ€ = ππ) = π½0 + π½1 ∗ πππππ Questions 9. What is the best estimate for the slope of the true mean function? That is, what is π½Μ1 ? 10. What is the best estimate for the y-intercept of the true mean function? That is, what is π½Μ0 ? 11. Write out the estimated mean function using the estimated parameters. 12. Interpret, in context and using laymen’s language, the slope in the above equation. 13. Interpret, in context and using laymen’s language, the y-intercept in the above equation. The 95% confidence intervals for the parameters in the true mean function can be obtained by selecting Show All Confidence Intervals under Regression Reports from the red drop-down menu in JMP. 5 Questions 14. Interpret, in context and using laymen’s language, the 95% confidence interval for π½1 ? 15. Interpret, in context and using laymen’s language, the 95% confidence interval for π½0 ? The estimated mean function is given by the quantity πΈΜ (πππππ|πππππ , πππ€ = ππ) = π½Μ0 + π½Μ1 ∗ πππππ = 20889 − 0.10 ∗ πππππ This quantity can be used to estimate the average Price of a used car with 50,000 Miles. The math is shown here. πΈΜ (πππππ|πππππ = 50000, πππ€ = ππ) = 20889 − 0.10 ∗ 50000 = $15,889 The $15,889 value appears to be reasonable when we isolate vehicles near 50,000 miles in our original scatterplot. 6 Confidence Interval for Predictions Akin to all other estimated quantities, we can expect variation to exist in this estimate. That is, a different random sample will produce a different estimated price for a vehicle with 50,000 miles. A 95% confidence interval for the quantity πΈΜ (πππππ|πππππ = 50000, πππ€ = ππ) can be obtained directly in JMP by selecting Mean Confidence Interval from the Save Columns menu item in JMP. 7 A 95% confidence interval for every observation in the dataset is provided in the JMP spreadsheet. The JMP spreadsheet can be sorted by Miles in order to more easily find a 95% confidence interval for vehicles with 50,000 miles. From a review of the table above, we can see there are no observations that have exactly 50,000 miles. That is, none of the confidence intervals provided are correct for the quantity πΈΜ (πππππ|πππππ = 50000, πππ€ = ππ) Obtaining output for a New Observations To obtaining a prediction, standard error, and confidence interval for a new observation use the Formula versions of these quantities as provided in JMP. 8 Next, create a pseudo observation in JMP with the desired characteristics. For our example here, this pseudo observation will have Miles = 50000. Questions 16. Does the predicted price (aside from rounding) provided by JMP agree with what we computed above? Discuss. 17. What is the standard error for πΈΜ (πππππ|πππππ = 50000, πππ€ = ππ)? Give a practical interpretation of this quantity. The formula for the 95% normal-based confidence interval for the average predicted value is given by o o πΏππ€ππ πΏππππ‘ = πππππππ‘ππ ππππ’π − π ∗ ππ‘ππππππ πΈππππ πππππ πΏππππ‘ = πππππππ‘ππ ππππ’π + π ∗ ππ‘ππππππ πΈππππ where, c is the 97.5th percentile from a t-distribution with n-2 degrees-of-freedom. Task: Verify the calculations for the 95% confidence interval for the average predicted value here. t-distribution with df = 169 – 2 = 167 Lower Limit: Upper Limit: In Excel: 9 18. Interpret the 95% confidence interval the average predicted value for a used vehicle with 50,000 miles, i.e. the 95% confidence interval for the quantity πΈΜ (πππππ|πππππ = 50000, πππ€ = ππ). Comment The above confidence interval is not appropriate when attempting to predict the Price for a single vehicle, but instead is a reasonable range of values for average predicted price for vehicles with 50,000 miles. Confidence Interval Prediction Interval Average Predicted Price Single Predicted Price All vehicles have 50,000 Miles Single Vehicle with 50,000 Miles An overlay plot of the data, the estimated mean function, and the 95% confidence interval for the average predicted price for vehicles with 50,000 miles. 10 Consider the plot of the estimated mean function over repeated samples. These plots were introduced in a previous handout. Notice that the variation in the estimated mean function is smaller for some values and larger for others. In particular, the variation is smallest near the average miles and increases as miles either increases or decrease. The reason is for this is that all estimated linear mean functions must pass through the point (Average Miles, Average Price). The average number of miles for our dataset is 78,442, thus, the standard error for the average prediction will be smallest when making predictions for vehicles near 78,000 miles. The 95% confidence interval bands for an average prediction across all miles. 11 Prediction Interval for Predictions Recall, a confidence interval is the appropriate quantity when interested in the average predicted price for a vehicle with 50,000 miles. πΈ(πππππ|πππππ = 50000, πππ€ = ππ) However, a prediction interval is necessary when attempting to make predictions for a single vehicle. πππππ|πππππ = 50000, πππ€ = ππ The 95% prediction intervals can be obtained in JMP by selecting Indiv Confidence Interval from the Save Columns menu in JMP. The prediction intervals are placed in the spreadsheet in JMP. Again, sorting the spreadsheet by Miles to obtain prediction intervals for used vehicles with close to 50,000 miles. 12 Similar to what was done above, in order to obtain a 95% prediction interval for a new observation not currently in the dataset, you must use the Formula versions of these quantities. Obtaining a prediction interval for a vehicle with 50,000 miles. The formula for the 95% normal-based prediction interval for the single value is given by o o πΏππ€ππ πΏππππ‘ = πππππππ‘ππ ππππ’π − π ∗ ππ‘ππππππ πΈππππ πππππ πΏππππ‘ = πππππππ‘ππ ππππ’π + π ∗ ππ‘ππππππ πΈππππ where, c is the 97.5th percentile from a t-distribution with n-2 degrees-of-freedom. For some reason, JMP does not compute the standard error for an individual prediction – notice the dot in the StdErr Indiv Price column for our new observation. The standard error for a prediction interval is computed as follows. ππ‘ππππππ πΈππππ = πππ Μ √β πΈ(πππππ|πππππ =50000,πππ€=ππ) πππππππππ‘π¦ ππ ππππ πΉπ’πππ‘πππ + πππ β πππππ|πππππ ,πππ€=ππ πππππππππ‘π¦ ππ πππππ£πππ’ππ πππ πππ£ππ‘ππππ = √426.382 + 4850.5122 = 4869.22 13 The actual calculations for the 95% prediction interval. Lower Limit: $15,882.72 − (1.9743 ∗ 4869.22) = $6,269.4 Upper Limit: $15,882.72 + (1.9743 ∗ 4869.22) = $25,496.02 t-distribution with df = 169 – 2 = 167 In Excel: Questions 19. Interpret the above 95% prediction interval for a single used vehicle with 50,000 miles, i.e. the 95% prediction interval for πππππ|πππππ = 50000, πππ€ = ππ. 20. In the context of this example, explicitly explain the difference in the scope of inference between the 95% confidence interval and the 95% prediction interval. Visual contrast of the 95% confidence interval and 95% prediction interval for used vehicles with 50,000 miles. 14 Plotting the 95% confidence and prediction bands across all observations in our dataset in JMP can be done using an overlay plot. Overlay plot in JMP 15