Topic_6

advertisement
Topic 6: Estimation and
Prediction of Yh
Outline
• Estimation and inference of E(Yh)
• Prediction of a new observation
• Construction of a confidence band
for the entire regression line
Estimation of E(Yh)
• E(Yh) = μh = β0 + β1Xh, the mean value
of Y for the subpopulation with X=Xh
• We will estimate E(Yh) by
^
ˆ h  b0  b1 X h
• KNNL use Ŷh for this estimate, see
equation (2.28) on pp 52
Theory for Estimation of
E(Yh)
•
̂ h is Normal with mean μh and variance



X

X
h
1
 2  ˆ h    2  
n
 Xi  X


2



2


• The Normality is a consequence of the fact
that b0 + b1Xh is a linear combination of
Yi’s
• See KNNL pp 52-54 for details
Application of the Theory
• We estimate σ2( ̂ h ) by
•
2

(
X h  X) 
2
2 1
s (ˆ h )  s  
2
 n  (Xi  X) 
It then follows that
ˆ h  E(Yh )
~ t(n  2)
s( ˆ h )
• Details for confidence intervals and
significance tests are consequences
95% Confidence Interval
for E(Yh)
•
̂ h ± tcs( ̂ h )
where tc = t(.975, n-2)
• NOTE: significance tests can be
constructed but they are rarely used in
practice
Toluca Company Example
(pg 19)
• Manufactures refrigeration equipment
• One replacement part manufactured in
lots of varying sizes
• Company wants to determine the
optimum lot size
• To do this, company needs to first
describe the relationship between work
hours and lot size
Scatterplot w/ regr line
hours
600
500
400
300
200
100
20
30
40
50
60
70
lotsize
80
90
100
110
120
SAS CODE
***Generating the data set***;
data toluca;
infile ‘../data/CH01TA01.txt';
input lotsize hours;
data other;
size=65; output;
size=100; output;
data toluca1; set toluca other;
proc print data=toluca1;
run;
SAS CODE
***Generating the confidence
intervals for all values of X in
the data set***;
proc reg data=toluca1;
model hours=size/clm;
id lotsize;
run;
clm option generates
confidence intervals for the
mean
Variable
Intercept
lotsize
DF
1
1
Parameter Estimates
Parameter Standard
Estimate
Error
62.36586 26.17743
3.57020
0.34697
t Value
2.38
10.29
Pr > |t|
0.0259
<.0001
Output Statistics
Obs
1
25
26
27
lotsize
80
70
65
100
Dependent Predicted
Std Error
Variable
Value Mean Predict 95% CL Mean
399.0000 347.9820
10.3628 326.5449 369.4191
323.0000 312.2800
9.7647 292.0803 332.4797
. 294.4290
9.9176 273.9129 314.9451
. 419.3861
14.2723 389.8615 448.9106
Notes
• Standard error affected by how far Xh
is from X (see Figure 2.6)
• Recall teeter-totter idea…a change in
the slope has bigger impact on Y as
you move away from X
Prediction of Yh(new)
• Want to predict value for a new
observation at X=Xh
Note!!
• Model: Yh(new) = β0 + β1Xh + 
• Since E(e)=0 same value as for E(Yh)
• Prediction interval, however, relies
heavily on assumption that e are
Normally distributed
Prediction of Yh(new)
• Var(Yh(new))=Var(̂ h )+Var(ξ )



X

X
h
1
2
2 
s (pred)  s 1  
 n
 Xi  X

• Then follows that

2



2


(Yh ( new)  ˆ h ) / s(pred) ~ t (n  2)
Notes
• Procedure can be modified for the
mean of m observations at X=Xh (see
2.39a and 239b on page 60)
• Standard error affected by how far Xh
is from X (see Figure 2.6)
SAS CODE
***Generating the prediction intervals
for all values of X in data set***;
proc reg data=toluca1;
model hours=lotsize/cli;
id lotsize;
cli option generates
run;
prediction interval for a
new observation
Output Statistics
Obs lotsize
1
80
25
70
26
65
27
100
Dependent
Variable
399.0000
323.0000
.
.
Predicted
Value
347.9820
312.2800
294.4290
419.3861
Std Error
Mean Predict
10.3628
9.7647
9.9176
14.2723
95% CL Predict
244.7333 451.2307
209.2811 415.2789
191.3676 397.4904
314.1604 524.6117
These are wrong…same as
before. Does not include
variability about regression
line
Notes
• The standard error (Std Error
Mean Predict)given in this output
̂ h of
is the standard error
not
s(pred)
• The prediction interval is correct and
wider than the previous confidence
interval
Notes
• To get correct standard error need to
add the variance about the regression
line
s( pred )  s (ˆ h )  MSE
2
Confidence band for
regression line
•
̂ h ± Ws( ̂ h)
where W2=2F(1-α; 2, n-2)
• This gives combined “confidence
intervals” for all Xh
• Boundary values of confidence
bands define a hyperbola
• Will be wider at Xh than single CI
Confidence band for
regression line
•
•
•
Theory comes from the joint
confidence region for (β0, β1 ) which
is an ellipse (Stat 524)
We can find an alpha for tc that
gives the same results
We find W2 and then find the alpha
for tc that will give W = tc
SAS CODE
data a1; n=25; alpha=.10;
dfn=2; dfd=n-2;
tsingle=tinv(1-alpha/2,dfd);
w2=2*finv(1-alpha,dfn,dfd);
w=sqrt(w2);
alphat=2*(1-probt(w,dfd));
t_c=tinv(1-alphat/2,dfd);
output;
proc print data=a1;
run;
SAS OUTPUT
n alpha dfn dfd tsingle
w2
w alphat
t_c
25
0.1
2 23 1.71387 5.09858 2.25800 0.033740 2.25800
Used for
single
90% CI
Used for
90%
confidence
band
SAS CODE
symbol1 v=circle i=rlclm97;
proc gplot data=toluca;
plot hours*lotsize;
run;
hours
600
500
400
300
200
100
20
30
40
50
60
70
lotsize
80
90
100
110
120
Estimation of E(Yh) and
Prediction of Yh
• ̂ h = b0 + b1Xh
2

(
X h  X) 
2
2 1
• s ( ˆ h )    

2
 n  (X i  X) 
2


(
X

X
)
1
2
2
h
• s ( pred)   1  

2
 n  (X i  X) 
SAS CODE
symbol1 v=circle i=rlclm95;
proc gplot data=toluca;
Confidence
plot hours*lotsize;
intervals
symbol1 v=circle i=rlcli95;
proc gplot data=toluca;
plot hours*lotsize;
Prediction
intervals
run;
hours
600
Confidence
band
500
400
300
200
100
20
30
40
50
60
70
lotsize
80
90
100
110
120
hours
600
Confidence
intervals
500
400
300
200
100
20
30
40
50
60
70
lotsize
80
90
100
110
120
hours
600
Prediction
intervals
500
400
300
200
100
20
30
40
50
60
70
lotsize
80
90
100
110
120
Background Reading
• Program topic6.sas has the code for
the various plots and calculations;
• Sections 2.7, 2.8, and 2.9
Download