Prediction concerning the response Y Where does this topic fit in? • • • • Model formulation Model estimation Model evaluation Model use Translating two research questions into two reasonable statistical answers • What is the mean weight, μ, of all American women, aged 18-24? – If we want to estimate μ, what would be a good estimate? • What is the weight, y, of a randomly selected American woman, aged 18-24? – If we want to predict y, what would be a good prediction? Could we do better by taking into account a person’s height? 210 200 190 w 266.5 6.1h weight 180 170 160 150 140 y 158.8 130 120 110 62 66 70 height 74 College entrance test score One thing to estimate (μy) and one thing to predict (y) 22 Y E Y 0 1 x 18 14 10 Yi 0 1 xi i 6 1 2 3 High school gpa 4 5 Two different research questions • What is the mean response μY when the predictor value is xh? • What value will a new observation Ynew be when the predictor value is xh? Example: Skin cancer mortality and latitude • What is the expected (mean) mortality rate for all locations at 40o N latitude? • What is the predicted mortality rate for 1 new randomly selected location at 40o N? Example: Skin cancer mortality and latitude Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % yˆ 389.19 5.9776(40) 150.1 Mortality 200 150 100 30 40 Latitude 50 “Point estimators” yˆ h b0 b1 xh is the best answer to each research question. That is, it is: • the best guess of the mean response at xh • the best guess of a new observation at xh But, as always, to be confident in the answer to our research question, we should put an interval around our best guess. It is dangerous to “extrapolate” beyond scope of model. Regression Plot colonies = 16.0667 + 1.61576 conc S = 2.67546 R-Sq = 66.8 % R-Sq(adj) = 63.5 % 30 colonies 25 20 15 0 1 2 3 conc 4 5 6 It is dangerous to “extrapolate” beyond scope of model. Regression Plot colonies = 15.0205 + 3.22113 conc - 0.276956 conc**2 S = 2.74819 R-Sq = 69.6 % R-Sq(adj) = 64.5 % colonies 30 20 10 0 5 conc 10 A confidence interval for the population mean response μY … when the predictor value is xh College entrance test score Again, what are we estimating? 22 Y E Y 0 1 x 18 14 10 Yi 0 1 xi i 6 1 2 3 High school gpa 4 5 (1-α)100% t-interval for mean response μY Formula in words: Sample estimate ± (t-multiplier × standard error) Formula in notation: 2 1 xh x yˆ h t ,n 2 MSE n x x 2 2 i Example: Skin cancer mortality and latitude Predicted Values for New Observations New Obs 1 Fit 150.08 SE Fit 2.75 95.0% CI (144.56, 155.61) Values of Predictors for New Observations New Obs 1 Lat 40.0 95.0% PI (111.23,188.93) Factors affecting the length of the confidence interval for μY 2 1 xh x yˆ h t ,n 2 MSE n x x 2 2 i • As the confidence level decreases, … • As MSE decreases, … • As the sample size increases, … • The more spread out the predictor values, … • The closer xh is to the sample mean, … Does the estimate of μY when xh = 1 vary more here …? y 25 15 Var yhat(x=1) 5 1 2 3 4 5 x 6 7 8 9 10 N 5 StDev 0.320 … or here? 30 y 20 Var yhat(x=1) 10 0 1 2 3 4 5 x 6 7 8 9 10 N 5 StDev 2.127 Does the estimate of μY vary more when xh = 1 or when xh = 5.5? 30 20 y Var N yhat(x=1) 5 yhat(x=5.5) 5 10 0 1 2 3 4 5 x 6 7 8 9 10 StDev 2.127 0.512 Example: Skin cancer mortality and latitude Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 Mean of Lat = 39.533 2 28.0 When is it okay to use the confidence interval for μY formula? • When xh is a value within the scope of the model – xh does not have to be one of the actual x values in the data set. • When the “LINE” assumptions are met. – The formula works okay even if the error terms are only approximately normal. – If you have a large sample, the error terms can even deviate substantially from normality. Prediction interval for a new response Ynew College entrance test score Again, what are we predicting? 22 Y E Y 0 1 x 18 14 10 Yi 0 1 xi i 6 1 2 3 High school gpa 4 5 (1-α)100% prediction interval for new response Ynew Formula in words: Sample prediction ± (t-multiplier × standard error) Formula in notation: 2 1 xh x yˆ h t ,n 2 MSE 1 n x x 2 2 i Example: Skin cancer mortality and latitude Predicted Values for New Observations New Obs 1 Fit 150.08 SE Fit 2.75 95.0% CI (144.56, 155.61) Values of Predictors for New Observations New Obs 1 Lat 40.0 95.0% PI (111.23,188.93) When is it okay to use the prediction interval for Ynew formula? • When xh is a value within the scope of the model – xh does not have to be one of the actual x values in the data set. • When the “LINE” assumptions are met. – The formula for the prediction interval depends strongly on the assumption that the error terms are normally distributed. What’s the difference in the two formulas? Confidence interval for μY : 2 1 x x h yˆ h t ,n 2 MSE n x x 2 2 i Prediction interval for Ynew: 2 1 xh x yˆ h t ,n 2 MSE 1 n x x 2 2 i Prediction of Ynew if the mean μY is known Suppose it were known that the mean skin cancer mortality at xh = 40o N is 150 deaths per million (with variance 400)? What is the predicted skin cancer mortality in Columbus, Ohio? Normal curve 0.02 0.01 0.95 0.00 90 110 130 150 170 Mortality 190 210 And then reality sets in • The mean μY is not known. – Estimate it with the predicted response – The cost of using ŷ ŷ to estimate μY is the variance of ŷ • The variance σ2 is not known. – Estimate it with MSE. Variance of the prediction The variation in the prediction of a new response depends on two components: 1. the variation due to estimating the mean μY with ŷh 2. the variation in Y (Yˆh ) 2 2 which is estimated by: 2 2 1 1 xh x xh x MSE 1 n MSE MSE n 2 2 n n x x x x i i i 1 i 1 What’s the effect of the difference in the two formulas? Confidence interval for μY : 2 1 x x h yˆ h t ,n 2 MSE n x x 2 2 i Prediction interval for Ynew: 2 1 xh x yˆ h t ,n 2 MSE 1 n x x 2 2 i What’s the effect of the difference in the two formulas? • A (1-α)100% confidence interval for μY at xh will always be narrower than a (1-α)100% prediction interval for Ynew at xh. • The confidence interval’s standard error can approach 0, whereas the prediction interval’s standard error cannot get close to 0. Confidence intervals and prediction intervals for response in Minitab • Stat >> Regression >> Regression … • Specify response and predictor(s). • Select Options… – In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. – Specify confidence level (default is 95%). • Click on OK. Click on OK. • Results appear in session window. Confidence intervals and prediction intervals for response in Minitab Confidence intervals and prediction intervals for response in Minitab C6 40 28 Example: Skin cancer mortality and latitude Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 Mean of Lat = 39.533 2 28.0 A plot of the confidence interval and prediction interval in Minitab • Stat >> Regression >> Fitted line plot … • Specify predictor and response. • Under Options … – Select Display confidence bands. – Select Display prediction bands. – Specify desired confidence level (95% default) • Select OK. Select OK. A plot of the confidence interval and prediction interval in Minitab A plot of the confidence interval and prediction interval in Minitab Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 250 150 Regression 95% CI 95% PI 50 30 40 Latitude 50