Topic 6: Estimation and Prediction of Yh Outline • Estimation and inference of E(Yh) • Prediction of a new observation • Construction of a confidence band for the entire regression line Estimation of E(Yh) • E(Yh) = μh = β0 + β1Xh, the mean value of Y for the subpopulation with X=Xh • We will estimate E(Yh) by ^ ˆ h b0 b1 X h • KNNL use Ŷh for this estimate, see equation (2.28) on pp 52 Theory for Estimation of E(Yh) • ̂ h is Normal with mean μh and variance X X h 1 2 ˆ h 2 n Xi X 2 2 • The Normality is a consequence of the fact that b0 + b1Xh is a linear combination of Yi’s • See KNNL pp 52-54 for details Application of the Theory • We estimate σ2( ̂ h ) by • 2 ( X h X) 2 2 1 s (ˆ h ) s 2 n (Xi X) It then follows that ˆ h E(Yh ) ~ t(n 2) s( ˆ h ) • Details for confidence intervals and significance tests are consequences 95% Confidence Interval for E(Yh) • ̂ h ± tcs( ̂ h ) where tc = t(.975, n-2) • NOTE: significance tests can be constructed but they are rarely used in practice Toluca Company Example (pg 19) • Manufactures refrigeration equipment • One replacement part manufactured in lots of varying sizes • Company wants to determine the optimum lot size • To do this, company needs to first describe the relationship between work hours and lot size Scatterplot w/ regr line hours 600 500 400 300 200 100 20 30 40 50 60 70 lotsize 80 90 100 110 120 SAS CODE ***Generating the data set***; data toluca; infile ‘../data/CH01TA01.txt'; input lotsize hours; data other; size=65; output; size=100; output; data toluca1; set toluca other; proc print data=toluca1; run; SAS CODE ***Generating the confidence intervals for all values of X in the data set***; proc reg data=toluca1; model hours=size/clm; id lotsize; run; clm option generates confidence intervals for the mean Variable Intercept lotsize DF 1 1 Parameter Estimates Parameter Standard Estimate Error 62.36586 26.17743 3.57020 0.34697 t Value 2.38 10.29 Pr > |t| 0.0259 <.0001 Output Statistics Obs 1 25 26 27 lotsize 80 70 65 100 Dependent Predicted Std Error Variable Value Mean Predict 95% CL Mean 399.0000 347.9820 10.3628 326.5449 369.4191 323.0000 312.2800 9.7647 292.0803 332.4797 . 294.4290 9.9176 273.9129 314.9451 . 419.3861 14.2723 389.8615 448.9106 Notes • Standard error affected by how far Xh is from X (see Figure 2.6) • Recall teeter-totter idea…a change in the slope has bigger impact on Y as you move away from X Prediction of Yh(new) • Want to predict value for a new observation at X=Xh Note!! • Model: Yh(new) = β0 + β1Xh + • Since E(e)=0 same value as for E(Yh) • Prediction interval, however, relies heavily on assumption that e are Normally distributed Prediction of Yh(new) • Var(Yh(new))=Var(̂ h )+Var(ξ ) X X h 1 2 2 s (pred) s 1 n Xi X • Then follows that 2 2 (Yh ( new) ˆ h ) / s(pred) ~ t (n 2) Notes • Procedure can be modified for the mean of m observations at X=Xh (see 2.39a and 239b on page 60) • Standard error affected by how far Xh is from X (see Figure 2.6) SAS CODE ***Generating the prediction intervals for all values of X in data set***; proc reg data=toluca1; model hours=lotsize/cli; id lotsize; cli option generates run; prediction interval for a new observation Output Statistics Obs lotsize 1 80 25 70 26 65 27 100 Dependent Variable 399.0000 323.0000 . . Predicted Value 347.9820 312.2800 294.4290 419.3861 Std Error Mean Predict 10.3628 9.7647 9.9176 14.2723 95% CL Predict 244.7333 451.2307 209.2811 415.2789 191.3676 397.4904 314.1604 524.6117 These are wrong…same as before. Does not include variability about regression line Notes • The standard error (Std Error Mean Predict)given in this output ̂ h of is the standard error not s(pred) • The prediction interval is correct and wider than the previous confidence interval Notes • To get correct standard error need to add the variance about the regression line s( pred ) s (ˆ h ) MSE 2 Confidence band for regression line • ̂ h ± Ws( ̂ h) where W2=2F(1-α; 2, n-2) • This gives combined “confidence intervals” for all Xh • Boundary values of confidence bands define a hyperbola • Will be wider at Xh than single CI Confidence band for regression line • • • Theory comes from the joint confidence region for (β0, β1 ) which is an ellipse (Stat 524) We can find an alpha for tc that gives the same results We find W2 and then find the alpha for tc that will give W = tc SAS CODE data a1; n=25; alpha=.10; dfn=2; dfd=n-2; tsingle=tinv(1-alpha/2,dfd); w2=2*finv(1-alpha,dfn,dfd); w=sqrt(w2); alphat=2*(1-probt(w,dfd)); t_c=tinv(1-alphat/2,dfd); output; proc print data=a1; run; SAS OUTPUT n alpha dfn dfd tsingle w2 w alphat t_c 25 0.1 2 23 1.71387 5.09858 2.25800 0.033740 2.25800 Used for single 90% CI Used for 90% confidence band SAS CODE symbol1 v=circle i=rlclm97; proc gplot data=toluca; plot hours*lotsize; run; hours 600 500 400 300 200 100 20 30 40 50 60 70 lotsize 80 90 100 110 120 Estimation of E(Yh) and Prediction of Yh • ̂ h = b0 + b1Xh 2 ( X h X) 2 2 1 • s ( ˆ h ) 2 n (X i X) 2 ( X X ) 1 2 2 h • s ( pred) 1 2 n (X i X) SAS CODE symbol1 v=circle i=rlclm95; proc gplot data=toluca; Confidence plot hours*lotsize; intervals symbol1 v=circle i=rlcli95; proc gplot data=toluca; plot hours*lotsize; Prediction intervals run; hours 600 Confidence band 500 400 300 200 100 20 30 40 50 60 70 lotsize 80 90 100 110 120 hours 600 Confidence intervals 500 400 300 200 100 20 30 40 50 60 70 lotsize 80 90 100 110 120 hours 600 Prediction intervals 500 400 300 200 100 20 30 40 50 60 70 lotsize 80 90 100 110 120 Background Reading • Program topic6.sas has the code for the various plots and calculations; • Sections 2.7, 2.8, and 2.9