Confidence / Prediction Intervals

advertisement
Handout #6: Confidence and Prediction Intervals for Predictions
Section 6.1 : Modeling Used Car Prices
Example 6.1: The first example in this handout will use the CarPrices datasets from our course website.
This dataset includes a various variables that are thought to influence the price of a vehicle. A snip-it of
the data is provided here.
Simple Linear Regression Setup
ο‚·
ο‚·
ο‚·
ο‚·
Model to be fit using only used cars, i.e., New=No.
Response Variable: Price
Predictor Variable: Miles
Assume the following structure for mean and variance functions
o
o
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑠
π‘‰π‘Žπ‘Ÿ(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = 𝜎 2
The first step in running an analysis in JMP is to subset or filter the original dataset to includes only the
used vehicles. This can be done easily in JMP. First, select Analyze > Distribution. Place New in the Y,
Columns box and click OK.
Select Analyze > Distribution, place New in Y,
Columns box.
Double clicking on New=No, will create a new
dataset that includes only the New=No vehicles
1
JMP clearly names this new subset of the original data. This new subset is called CarPrices (New=No).
We can see below this dataset includes 170 of the observations from the original CarPrices dataset.
First, consider a scatterplot to visualize the relationship between Price | Miles, New=No.
Questions
1. What is the general pattern/trend/relationship between Price and Miles?
2. Do you think the assumed mean function stated above is appropriate? How about the assumed
form for the variance function? Discuss.
2
Lowess smoother for Price | Miles, New=No
Lowess smoother of the variability
Consider again the much simpler form for the mean function.
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑠
To fit this model in JMP, select Analyze > Fit Model. Put Price, i.e. the response, in the Y box, and Miles,
i.e. the predictor variable, in the Construct Model Effect box. Click Run.
3
The distribution of Price | Miles, New=No with the estimated mean function and the summary of fit
output as provided by JMP.
Questions
3.
Does a linear mean function appear to fit this data well? Discuss.
4. The New=No filtered dataset had 170 observations (see JMP spreadsheet on p2), but it appears
only 169 were used in our regression analysis (see Observations (or Sum Wgts) = 169). Why is
this the case?
5. What is the best estimate for the variance in the condition distribution, i.e.,
Μ‚ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = πœŽΜ‚ 2 ?
π‘‰π‘Žπ‘Ÿ
6. The standard deviation is simply the square root of the variance. Compute this quantity for the
conditional distribution given here. That is, compute √Μ‚
π‘‰π‘Žπ‘Ÿ(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = √πœŽΜ‚ 2 .
Does this match the Root Mean Square Error computed by JMP?
7. What is the interpretation of the Root Mean Square Error quantity in the context of this
problem?
8. What is the interpretation of the R2 value for this model?
4
Consider again the form of the mean function and the parameter estimate portion of the output
provided by JMP.
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑠
Questions
9. What is the best estimate for the slope of the true mean function? That is, what is 𝛽̂1 ?
10. What is the best estimate for the y-intercept of the true mean function? That is, what is 𝛽̂0 ?
11. Write out the estimated mean function using the estimated parameters.
12. Interpret, in context and using laymen’s language, the slope in the above equation.
13. Interpret, in context and using laymen’s language, the y-intercept in the above equation.
The 95% confidence intervals for the parameters in the true mean function can be obtained by selecting
Show All Confidence Intervals under Regression Reports from the red drop-down menu in JMP.
5
Questions
14. Interpret, in context and using laymen’s language, the 95% confidence interval for 𝛽1 ?
15. Interpret, in context and using laymen’s language, the 95% confidence interval for 𝛽0 ?
The estimated mean function is given by the quantity
𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠, 𝑁𝑒𝑀 = π‘π‘œ) = 𝛽̂0 + 𝛽̂1 ∗ 𝑀𝑖𝑙𝑒𝑠
= 20889 − 0.10 ∗ 𝑀𝑖𝑙𝑒𝑠
This quantity can be used to estimate the average Price of a used car with 50,000 Miles. The math is
shown here.
𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ) = 20889 − 0.10 ∗ 50000
= $15,889
The $15,889 value appears to be reasonable when we isolate vehicles near 50,000 miles in our original
scatterplot.
6
Confidence Interval for Predictions
Akin to all other estimated quantities, we can expect variation to exist in this estimate. That is, a
different random sample will produce a different estimated price for a vehicle with 50,000 miles.
A 95% confidence interval for the quantity
𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ)
can be obtained directly in JMP by selecting Mean Confidence Interval from the Save Columns menu
item in JMP.
7
A 95% confidence interval for every observation in the dataset is provided in the JMP spreadsheet.
The JMP spreadsheet can be sorted by Miles in order to more easily find a 95% confidence interval for
vehicles with 50,000 miles.
From a review of the table above, we can see there are no observations that have exactly 50,000 miles.
That is, none of the confidence intervals provided are correct for the quantity
𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ)
Obtaining output for a New Observations
To obtaining a prediction, standard error, and confidence interval for a new observation use the
Formula versions of these quantities as provided in JMP.
8
Next, create a pseudo observation in JMP with the desired characteristics. For our example here, this
pseudo observation will have Miles = 50000.
Questions
16. Does the predicted price (aside from rounding) provided by JMP agree with what we computed
above? Discuss.
17. What is the standard error for 𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ)? Give a practical
interpretation of this quantity.
The formula for the 95% normal-based confidence interval for the average predicted value is given by
o
o
πΏπ‘œπ‘€π‘’π‘Ÿ πΏπ‘–π‘šπ‘–π‘‘ = π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’ − 𝑐 ∗ π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
π‘ˆπ‘π‘π‘’π‘Ÿ πΏπ‘–π‘šπ‘–π‘‘ = π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’ + 𝑐 ∗ π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
where, c is the 97.5th percentile from a t-distribution with n-2 degrees-of-freedom.
Task: Verify the calculations for the 95% confidence
interval for the average predicted value here.
t-distribution with df = 169 – 2 = 167
Lower Limit:
Upper Limit:
In Excel:
9
18. Interpret the 95% confidence interval the average predicted value for a used vehicle with 50,000
miles, i.e. the 95% confidence interval for the quantity 𝐸̂ (π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ).
Comment
The above confidence interval is not appropriate when attempting to predict the Price for a single
vehicle, but instead is a reasonable range of values for average predicted price for vehicles with
50,000 miles.
Confidence Interval
Prediction Interval
Average Predicted Price
Single Predicted Price
All vehicles have 50,000 Miles
Single Vehicle with 50,000 Miles
An overlay plot of the data, the estimated mean function, and the 95% confidence interval for the
average predicted price for vehicles with 50,000 miles.
10
Consider the plot of the estimated mean function over repeated samples. These plots were introduced
in a previous handout.
Notice that the variation in the estimated mean function is smaller for some values and larger for
others. In particular, the variation is smallest near the average miles and increases as miles either
increases or decrease. The reason is for this is that all estimated linear mean functions must pass
through the point (Average Miles, Average Price). The average number of miles for our dataset is
78,442, thus, the standard error for the average prediction will be smallest when making predictions for
vehicles near 78,000 miles.
The 95% confidence interval bands for an average prediction across all miles.
11
Prediction Interval for Predictions
Recall, a confidence interval is the appropriate quantity when interested in the average predicted price
for a vehicle with 50,000 miles.
𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ)
However, a prediction interval is necessary when attempting to make predictions for a single vehicle.
π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ
The 95% prediction intervals can be obtained in JMP by selecting Indiv Confidence Interval from the Save
Columns menu in JMP.
The prediction intervals are placed in the spreadsheet in JMP.
Again, sorting the spreadsheet by Miles to obtain prediction intervals for used vehicles with close to
50,000 miles.
12
Similar to what was done above, in order to obtain a 95% prediction interval for a new observation not
currently in the dataset, you must use the Formula versions of these quantities.
Obtaining a prediction interval for a vehicle with 50,000 miles.
The formula for the 95% normal-based prediction interval for the single value is given by
o
o
πΏπ‘œπ‘€π‘’π‘Ÿ πΏπ‘–π‘šπ‘–π‘‘ = π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’ − 𝑐 ∗ π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
π‘ˆπ‘π‘π‘’π‘Ÿ πΏπ‘–π‘šπ‘–π‘‘ = π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘‰π‘Žπ‘™π‘’π‘’ + 𝑐 ∗ π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
where, c is the 97.5th percentile from a t-distribution with n-2 degrees-of-freedom.
For some reason, JMP does not compute the standard error for an individual prediction – notice the dot
in the StdErr Indiv Price column for our new observation. The standard error for a prediction interval is
computed as follows.
π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
=
π‘‰π‘Žπ‘Ÿ Μ‚
√⏟ 𝐸(π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠=50000,𝑁𝑒𝑀=π‘π‘œ)
π‘‰π‘Žπ‘Ÿπ‘–π‘π‘–π‘™π‘–π‘‘π‘¦ 𝑖𝑛 π‘€π‘’π‘Žπ‘› πΉπ‘’π‘›π‘π‘‘π‘–π‘œπ‘›
+
π‘‰π‘Žπ‘Ÿ
⏟ π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠,𝑁𝑒𝑀=π‘π‘œ
π‘‰π‘Žπ‘Ÿπ‘–π‘π‘–π‘™π‘–π‘‘π‘¦ 𝑖𝑛 π‘–π‘›π‘‘π‘–π‘£π‘–π‘‘π‘’π‘Žπ‘™ π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘›π‘ 
= √426.382 + 4850.5122
= 4869.22
13
The actual calculations for the 95% prediction interval.
Lower Limit:
$15,882.72 − (1.9743 ∗ 4869.22)
= $6,269.4
Upper Limit:
$15,882.72 + (1.9743 ∗ 4869.22)
= $25,496.02
t-distribution with df = 169 – 2 = 167
In Excel:
Questions
19. Interpret the above 95% prediction interval for a single used vehicle with 50,000 miles, i.e. the
95% prediction interval for π‘ƒπ‘Ÿπ‘–π‘π‘’|𝑀𝑖𝑙𝑒𝑠 = 50000, 𝑁𝑒𝑀 = π‘π‘œ.
20. In the context of this example, explicitly explain the difference in the scope of inference between
the 95% confidence interval and the 95% prediction interval.
Visual contrast of the 95% confidence interval and 95% prediction interval for used vehicles with 50,000
miles.
14
Plotting the 95% confidence and prediction bands across all observations in our dataset in JMP can be
done using an overlay plot.
Overlay plot in JMP
15
Download