predictY

advertisement
Prediction concerning
the response Y
Where does this topic fit in?
•
•
•
•
Model formulation
Model estimation
Model evaluation
Model use
Translating two research questions into
two reasonable statistical answers
• What is the mean weight, μ, of all American
women, aged 18-24?
– If we want to estimate μ, what would be a good
estimate?
• What is the weight, y, of a randomly
selected American woman, aged 18-24?
– If we want to predict y, what would be a good
prediction?
Could we do better by taking into
account a person’s height?
210
200
190
w  266.5  6.1h
weight
180
170
160
150
140
y  158.8
130
120
110
62
66
70
height
74
College entrance test score
One thing to estimate (μy)
and one thing to predict (y)
22
Y  E Y    0  1 x
18
14
10
Yi   0  1 xi    i
6
1
2
3
High school gpa
4
5
Two different research questions
• What is the mean response μY when the
predictor value is xh?
• What value will a new observation Ynew be
when the predictor value is xh?
Example:
Skin cancer mortality and latitude
• What is the expected (mean) mortality rate
for all locations at 40o N latitude?
• What is the predicted mortality rate for 1
new randomly selected location at 40o N?
Example:
Skin cancer mortality and latitude
Regression Plot
Mortality = 389.189 - 5.97764 Latitude
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
yˆ  389.19  5.9776(40)  150.1
Mortality
200
150
100
30
40
Latitude
50
“Point estimators”
yˆ h  b0  b1 xh
is the best answer to each research question.
That is, it is:
• the best guess of the mean response at xh
• the best guess of a new observation at xh
But, as always, to be confident in the answer to our research
question, we should put an interval around our best guess.
It is dangerous to “extrapolate”
beyond scope of model.
Regression Plot
colonies = 16.0667 + 1.61576 conc
S = 2.67546
R-Sq = 66.8 %
R-Sq(adj) = 63.5 %
30
colonies
25
20
15
0
1
2
3
conc
4
5
6
It is dangerous to “extrapolate”
beyond scope of model.
Regression Plot
colonies = 15.0205 + 3.22113 conc - 0.276956 conc**2
S = 2.74819
R-Sq = 69.6 %
R-Sq(adj) = 64.5 %
colonies
30
20
10
0
5
conc
10
A confidence interval for
the population mean response μY
… when the predictor value is xh
College entrance test score
Again, what are we estimating?
22
Y  E Y    0  1 x
18
14
10
Yi   0  1 xi    i
6
1
2
3
High school gpa
4
5
(1-α)100% t-interval
for mean response μY
Formula in words:
Sample estimate ± (t-multiplier × standard error)
Formula in notation:
2 
 1

xh  x  

yˆ h  t  ,n 2   MSE 

 n   x  x 2 
2
i


Example:
Skin cancer mortality and latitude
Predicted Values for New Observations
New Obs
1
Fit
150.08
SE Fit
2.75
95.0% CI
(144.56, 155.61)
Values of Predictors for New Observations
New Obs
1
Lat
40.0
95.0% PI
(111.23,188.93)
Factors affecting the length of the
confidence interval for μY
2 
 1

xh  x  

yˆ h  t  ,n 2   MSE 

 n   x  x 2 
2
i


• As the confidence level decreases, …
• As MSE decreases, …
• As the sample size increases, …
• The more spread out the predictor values, …
• The closer xh is to the sample mean, …
Does the estimate of μY when xh = 1
vary more here …?
y
25
15
Var
yhat(x=1)
5
1
2
3
4
5
x
6
7
8
9
10
N
5
StDev
0.320
… or here?
30
y
20
Var
yhat(x=1)
10
0
1
2
3
4
5
x
6
7
8
9
10
N
5
StDev
2.127
Does the estimate of μY vary more
when xh = 1 or when xh = 5.5?
30
20
y
Var
N
yhat(x=1)
5
yhat(x=5.5) 5
10
0
1
2
3
4
5
x
6
7
8
9
10
StDev
2.127
0.512
Example:
Skin cancer mortality and latitude
Predicted Values for New Observations
New Fit SE Fit
95.0% CI
95.0% PI
1 150.08 2.75 (144.6,155.6) (111.2,188.93)
2 221.82 7.42 (206.9,236.8) (180.6,263.07)X
X denotes a row with X values away from the
center
Values of Predictors for New Observations
New Obs Latitude
1
40.0
Mean of Lat = 39.533
2
28.0
When is it okay to use the
confidence interval for μY formula?
• When xh is a value within the scope of the
model – xh does not have to be one of the
actual x values in the data set.
• When the “LINE” assumptions are met.
– The formula works okay even if the error terms
are only approximately normal.
– If you have a large sample, the error terms can
even deviate substantially from normality.
Prediction interval for
a new response Ynew
College entrance test score
Again, what are we predicting?
22
Y  E Y    0  1 x
18
14
10
Yi   0  1 xi    i
6
1
2
3
High school gpa
4
5
(1-α)100% prediction interval
for new response Ynew
Formula in words:
Sample prediction ± (t-multiplier × standard error)
Formula in notation:
2 
 1

xh  x  

yˆ h  t  ,n 2   MSE 1  
 n   x  x 2 
2
i


Example:
Skin cancer mortality and latitude
Predicted Values for New Observations
New Obs
1
Fit
150.08
SE Fit
2.75
95.0% CI
(144.56, 155.61)
Values of Predictors for New Observations
New Obs
1
Lat
40.0
95.0% PI
(111.23,188.93)
When is it okay to use the
prediction interval for Ynew formula?
• When xh is a value within the scope of the
model – xh does not have to be one of the
actual x values in the data set.
• When the “LINE” assumptions are met.
– The formula for the prediction interval depends
strongly on the assumption that the error terms
are normally distributed.
What’s the difference
in the two formulas?
Confidence interval for μY :
2 
 1


x

x
h

yˆ h  t  ,n 2   MSE   
 n   x  x 2 
2
i


Prediction interval for Ynew:
2 
 1

xh  x  

yˆ h  t  ,n 2   MSE 1  
 n   x  x 2 
2
i


Prediction of Ynew
if the mean μY is known
Suppose it were known that the mean skin cancer mortality at
xh = 40o N is 150 deaths per million (with variance 400)?
What is the predicted skin cancer mortality in Columbus, Ohio?
Normal curve
0.02
0.01
0.95
0.00
90
110
130
150
170
Mortality
190
210
And then reality sets in
• The mean μY is not known.
– Estimate it with the predicted response
– The cost of using
ŷ
ŷ to estimate μY is the
variance of ŷ
• The variance σ2 is not known.
– Estimate it with MSE.
Variance of the prediction
The variation in the prediction of a new response depends
on two components:
1. the variation due to estimating the mean μY with
ŷh
2. the variation in Y
   (Yˆh )
2
2
which is estimated by:




2
2
1
 1


xh  x  
xh  x  
  MSE 1   n

MSE  MSE   n
2
2
n
 n




x

x
x

x


i
i




i 1
i 1


What’s the effect of the
difference in the two formulas?
Confidence interval for μY :
2 
 1


x

x
h

yˆ h  t  ,n 2   MSE   
 n   x  x 2 
2
i


Prediction interval for Ynew:
2 
 1

xh  x  

yˆ h  t  ,n 2   MSE 1  
 n   x  x 2 
2
i


What’s the effect of the
difference in the two formulas?
• A (1-α)100% confidence interval for μY at xh
will always be narrower than a (1-α)100%
prediction interval for Ynew at xh.
• The confidence interval’s standard error can
approach 0, whereas the prediction interval’s
standard error cannot get close to 0.
Confidence intervals and prediction
intervals for response in Minitab
• Stat >> Regression >> Regression …
• Specify response and predictor(s).
• Select Options…
– In “Prediction intervals for new observations” box,
specify either the X value or a column name containing
multiple X values.
– Specify confidence level (default is 95%).
• Click on OK. Click on OK.
• Results appear in session window.
Confidence intervals and prediction
intervals for response in Minitab
Confidence intervals and prediction
intervals for response in Minitab
C6
40
28
Example:
Skin cancer mortality and latitude
Predicted Values for New Observations
New Fit SE Fit
95.0% CI
95.0% PI
1 150.08 2.75 (144.6,155.6) (111.2,188.93)
2 221.82 7.42 (206.9,236.8) (180.6,263.07)X
X denotes a row with X values away from the
center
Values of Predictors for New Observations
New Obs Latitude
1
40.0
Mean of Lat = 39.533
2
28.0
A plot of the confidence interval and
prediction interval in Minitab
• Stat >> Regression >> Fitted line plot …
• Specify predictor and response.
• Under Options …
– Select Display confidence bands.
– Select Display prediction bands.
– Specify desired confidence level (95% default)
• Select OK. Select OK.
A plot of the confidence interval and
prediction interval in Minitab
A plot of the confidence interval and
prediction interval in Minitab
Regression Plot
Mortality = 389.189 - 5.97764 Latitude
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Mortality
250
150
Regression
95% CI
95% PI
50
30
40
Latitude
50
Download