330.Lect7

advertisement
STATS 330: Lecture 7
3/22/2016
330 lecture 7
1
Prediction
Aims of today’s lecture
 Describe how to use the regression model
to predict future values of the response,
given the values of the covariates
 Describe how to estimate the mean
response, for particular values of the
covariates
3/22/2016
330 lecture 7
2
Typical questions
 Given the height and diameter, can we predict
the volume of a cherry tree? E.g.if a particular
tree has diameter 11 inches and height 85 feet,
can we predict its volume?
 What is the average volume of all cherry trees
having a given height and diameter? E.g. if we
consider all cherry trees with diameter 11 inches
and height 85 feet, can we estimate their mean
volume?
3/22/2016
330 lecture 7
3
The regression model
(revisited)
 The diagram shows the
true plane, which
describes the mean
response for any x1, x2
 The short “sticks”
represent the “errors”
 The y-observation is the
sum of the mean and the
error,
Y= m + e
3/22/2016
330 lecture 7
4
Prediction: how to do it
 Suppose we want to predict the response
of an individual whose covariates are
x1, … xk.
for the cherry tree example, k=2 and x1
(height) is 85, x2 (diameter) is 11
 Since the response we want to predict is
Y= m + e, we predict Y by making separate
predictions of m and e, and adding the
results together.
3/22/2016
330 lecture 7
5
Prediction:
how to do it (2)
 Since m is the value of the true regression
plane at x1,…xk, we predict m by the value of
the fitted plane at x1,…xk. This is
ˆ0 + ˆ1 x1 + ... + ˆk xk
 We have no information about e other than
the fact that it is sampled from a normal
distribution with zero mean. Thus, we predict
e by zero.
3/22/2016
330 lecture 7
6
Prediction:
how to do it (3)
Thus, the value of the predictor is
ˆ0 + ˆ1 x1 + ... + ˆk xk + 0
 ˆ0 + ˆ1 x1 + ... + ˆk xk
Thus, to predict the response y when the
covariates are x1,…xk, we use the fitted plane at
x1,…xk.as the predictor.
3/22/2016
330 lecture 7
7
Inner product form
 Inner product form of predictor:
ˆ
ˆ
ˆ
ˆ
  (  0 , 1 ,...,  k )
Inner
product
Vector of
regression
coefficients
x  (1, x1 ,..., xk )
ˆ
ˆ
ˆ
ˆ
x'    0 + 1 x1 + ... +  k xk
3/22/2016
Vector of predictor
variables
330 lecture 7
8
Prediction error
 Prediction error is measured by the expected
value of (actual-predictor)2
 Equal to Var (predictor) + s2
• First part comes from predicting the mean, second
part from predicting the error.
• Var(predictor) depends on the x’s and on s2, see
formula on slide 12
 Standard error of prediction is square root of
Var (predictor) + s2
 Prediction intervals approximately predictor
+/- 2 standard errors, or more precisely….
3/22/2016
330 lecture 7
9
Prediction interval
 A prediction interval is an interval which
contains the actual value of the response with
a given probability e.g. 0.95
 95% Interval is for x’ is
x'   ( Standard error of prediction)  t nk 1 (0.975)
3/22/2016
330 lecture 7
10
Var(predictor)
Arrange covariate data (used for fitting model)
into a matrix X
1 x11  x1k 


X  




1 x n1  xnk 
Var ( predictor )  s x' ( X' X) x
2
3/22/2016
330 lecture 7
1
11
qt
0.4
Density of t-distribution with 28 df
0.2
Density
0.3
Area = 0.975
0.0
0.1
Area = 0.025
-4
-2
0
2
4
t
qt(0.975,28)
3/22/2016
330 lecture 7
12
Doing it in R
 Suppose we want to predict the volume of 2
cherry trees. The first has diameter 11
inches and height 85 feet, the second has
diameter 12 inches and height 90 feet.
 First step is to make a data frame containing
the new data. Names must be the same as
in the original data.
 Then we need to combine the results of the
fit (regression coefs, estimate of error
variance) with the new data (the predictor
variables) to calculate the predictor
3/22/2016
330 lecture 7
13
Doing it in R (2)
# calculate the fit from the original data
cherry.lm
=lm(volume~diameter+height,data=cherry.df)
# make the new data frame
new.df = data.frame(diameter=c(11,12),
height=c(85,90))
# do the prediction
Results of fit
predict(cherry.lm,new.df)
1
2
22.63846 29.04288
New data
(predictor
variables)
This only gives the predictions. What about prediction
errors, prediction intervals etc?
3/22/2016
330 lecture 7
14
Doing it in R (3)
predict(cherry.lm,new.df,
se.fit=T,interval="prediction")
$fit
95% Prediction
fit
lwr
upr
intervals
1 22.63846 13.94717 31.32976
2 29.04288 19.97235 38.11340
$se.fit
Sqrt of var(predictor)
1
2
not standard error of
1.712901 2.130571
prediction
$df
[1] 28
Residual mean square
$residual.scale
[1] 3.881832
3/22/2016
330 lecture 7
15
Hand calculation
Prediction is 22.63846
SE(predictor) = sqrt(se.fit^2 + residual.scale^2)
Prediction interval is:
Prediction +/- SE(predictor)
*qt(0.975,n-k-1)
Hence SE(predictor) = sqrt(1.712901^2 + 3.881832^2) =
4.242953
qt(0.975,n-k-1) = qt(0.975,28) = 2.048407
PI is 22.63846 +\- 4.242953 * 2.048407 or
(13.94717, 31.32975)
3/22/2016
330 lecture 7
16
Estimating the mean
response
 Suppose we want to estimate the mean response of
all individuals whose covariates are x1, … xk.
 Since the mean we want to predict is the height of the
true regression plane at x1, … xk, we use as our
estimate the height of the fitted plane at x1, …, xk. This
is
ˆ0 + ˆ1 x1 + ... + ˆk xk
3/22/2016
330 lecture 7
17
Standard error of the
estimate
The standard error of the estimate is the
square root of Var ( Predictor )
Note that this is less than the standard error
of the prediction:
prediction SE  Var ( Predictor ) + s
2
Estimate SE  Var ( Predictor )
Confidence intervals: predictor +/- 2 standard
error of estimate * qt(0.975,n-k-1)
3/22/2016
330 lecture 7
18
Doing it in R
 Suppose we want to estimate the
mean volume of all cherry trees having
diameter 11 inches and height 85 feet.
 As before, the first step is to make a
data frame containing the new data.
Names must be the same as in the
original data.
3/22/2016
330 lecture 7
19
Doing it in R (2)
> new.df<-data.frame(diameter=11,height=85)
> predict(cherry.lm,new.df,se.fit=T,
interval="confidence")
$fit
fit
lwr
upr
[1,] 22.63846 19.12974 26.14718
Confidence interval
$se.fit
Note: shorter than
prediction interval
[1] 1.712901
$df
[1] 28
$residual.scale
[1] 3.881832
3/22/2016
330 lecture 7
20
Example: Hydrocarbon
data
When petrol is pumped into a tank, hydrocarbon vapours are
forced into the atmosphere. To reduce this significant source of
air pollution, devices are installed to capture the vapour. A
laboratory experiment was conducted in which the amount of
vapour given off was measured under carefully controlled
conditions. In addition to the response, there were four
variables which were thought relevant for prediction:
t.temp:
initial tank temperature ( degrees F)
p.temp:
temperature of the dispensed petrol ( degrees F)
t.vp:
initial vapour pressure in tank (psi)
p.vp:
vapour pressure of the dispensed petrol (psi)
hc: emitted dispensed hydrocarbons (g) (response)
3/22/2016
330 lecture 7
21
The data
> petrol.df
t.temp p.temp
1
28
33
2
24
48
3
33
53
4
29
52
5
33
44
6
31
36
7
32
34
8
34
35
9
33
51
t.vp
3.00
2.78
3.32
3.00
3.28
3.10
3.16
3.22
3.18
p.vp
3.49
3.22
3.42
3.28
3.58
3.26
3.16
3.22
3.18
hc
22
27
29
29
27
24
23
23
26
… 125 observations in all
3/22/2016
330 lecture 7
22
Pairs plot
60
80
3
4
5
6
7
70
90
40
80
30
50
t.temp
6
7
40
60
p.temp
6
7
3
4
5
t.vp
20 30 40 50
3
4
5
p.vp
hc
30
3/22/2016
50
70
90
3
4
5
6
7
330 lecture 7
20
30
40
50
23
Preliminary conclusion
 All variables seem related to the response
 p.vp and t.vp seem highly correlated
 Quite strong correlations between some of
the other variables
 No obvious outliers
3/22/2016
330 lecture 7
24
Fit the “full” model
>petrol.reg<-lm(hc~ t.temp + p.temp + t.vp + p.vp,
data=petrol.df)
> summary(petrol.reg)
Call:
lm(formula = hc ~ t.temp + p.temp + t.vp + p.vp,
data = petrol.df)
Residuals:
Min
1Q
-6.46213 -1.31388
Median
0.03005
3Q
1.23138
Max
6.99663
Continued on next slide...
3/22/2016
330 lecture 7
25
Fit the “full” model
(2)
Coefficients:
Estimate Std. Error t value
(Intercept) 0.16609
1.02198
0.163
t.temp
-0.07764
0.04801 -1.617
p.temp
0.18317
0.04063
4.508
t.vp
-4.45230
1.56614 -2.843
p.vp
10.27271
1.60882
6.385
--Signif. codes: 0 `***' 0.001 `**' 0.01
Pr(>|t|)
0.87117
0.10850
1.53e-05 ***
0.00526 **
3.37e-09 ***
`*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2.723 on 120 degrees of freedom
Multiple R-Squared: 0.8959,
Adjusted R-squared: 0.8925
F-statistic: 258.2 on 4 and 120 DF, p-value: < 2.2e-16
3/22/2016
330 lecture 7
26
Conclusions




Large R2
Significant coefficients, except for temp
Model seems satisfactory
Can move on to prediction
Lets predict hydrocarbon emissions when
t.temp=28, p.temp=30, t.vp=3 and p.vp=3.5
3/22/2016
330 lecture 7
27
Prediction
> pred.df<-data.frame(t.temp=28, p.temp=30,
t.vp=3, p.vp=3.5)
> predict(petrol.reg,pred.df,interval="p" )
fit
lwr
upr
[1,] 26.08471 20.23045 31.93898
Hydrocarbon emission is predicted to be between 20.2
and 31.9 grams.
3/22/2016
330 lecture 7
28
Download