330.Lect12

advertisement
STATS 330: Lecture 12
4/8/2015
330 lecture 12
1
Diagnostics 4
Aim of today’s lecture
To discuss diagnostics for independence
4/8/2015
330 lecture 12
2
Independence
 One of the regression assumptions is that the errors are
independent.
 Data that is collected sequentially over time often have
errors that are not independent.
 If the independence assumption does not hold, then the
standard errors will be wrong and the tests and
confidence intervals will be unreliable.
 Thus, we need to be able to detect lack of independence.
4/8/2015
330 lecture 12
3
Types of dependence
 If large positive errors have a tendency to follow
large positive errors, and large negative errors a
tendency to follow large negative errors, we say
the data has positive autocorrelation
 If large positive errors have a tendency to follow
large negative errors, and large negative errors
a tendency to follow large positive errors, we say
the data has negative autocorrelation
4/8/2015
330 lecture 12
4
Diagnostics
 If the errors are positively autocorrelated,
• Plotting the residuals against time will show
long runs of positive and negative residuals
• Plotting residuals against the previous
residual (ie ei vs ei-1) will show a positive
trend
• A correlogram of the residuals will show
positive spikes, gradually decaying
4/8/2015
330 lecture 12
5
Diagnostics (2)
If the errors are negatively autocorrelated,
• Plotting the residuals against time will show
alternating positive and negative residuals
• Plotting residuals against the previous
residual (ie ei vs ei-1) will show a negative
trend
• A correlogram of the residuals will show
alternating positive and negative spikes,
gradually decaying
4/8/2015
330 lecture 12
6
Residuals against time
Can omit the “x”
vector if it is
sequence numbers
res<-residuals(lm.obj)
plot(1:length(res),res,
xlab=“time”,ylab=“residuals”, type=“b”)
lines(1:length(res),res)
abline(h=0, lty=2)
Dots/lines
Dotted line
at 0 (mean
residual)
4/8/2015
330 lecture 12
7
0.0
-1.0 -0.5
residual
0.5
Autocorrelation = 0.9
0
20
40
60
80
100
80
100
80
100
time
0.4
0.0
-0.4
residual
Autocorrelation = 0.0
0
20
40
60
time
0.0 0.5 1.0
-1.0
residual
Autocorrelation = - 0.9
0
20
40
60
time
4/8/2015
330 lecture 12
8
Residuals against previous
res<-residuals(lm.obj)
n<-length(res)
plot.res<-res[-1] # element 1 has no previous
prev.res<-res[-n] # have to be equal length
plot(prev.res,plot.res,
xlab=“previous residual”,ylab=“residual”)
4/8/2015
330 lecture 12
9
Plots for different
degrees of autocorrelation
Autocorrelation = 0.0
Autocorrelation = - 0.9
-0.5
0.0
residual
4/8/2015
0.5
1.0
0.5
0.0
-1.5
-0.4
-1.0
-0.5
previous residual
0.2
0.0
-0.2
previous residual
0.5
0.0
-0.5
previous residual
0.4
1.0
1.0
Autocorrelation = 0.9
-0.4
-0.2
0.0
0.2
residual
330 lecture 12
0.4
-1.5
-1.0
-0.5
0.0
0.5
1.0
residual
10
Correlogram
acf(residuals(lm.obj))
 Correlogram (autocorrelation function, acf)
is plot of lag k autocorrelation versus k
 Lag k autocorrelation is correlation of
residuals k time units apart
4/8/2015
330 lecture 12
11
0.4
0.0
ACF
0.8
Autocorrelation = 0.9
0
5
10
15
20
15
20
15
20
Lag
0.4
0.0
ACF
0.8
Autocorrelation = 0.0
0
5
10
Lag
-1.0 -0.5 0.0 0.5 1.0
ACF
Autocorrelation = - 0.9
0
5
10
Lag
4/8/2015
330 lecture 12
12
Durbin-Watson test
 We can also do a formal hypothesis test, (the
Durbin-Watson test) for independence
 The test assumes the errors follow a model of
the form
 i   i 1  ui
NB
where the ui’s are independent, normal
and have constant variance.  is the lag 1
correlation: this is the autoregressive
model of order 1
4/8/2015
330 lecture 12
13
Durbin-Watson test (2)
 When  = 0, the errors are independent
 The DW test tests independence by testing  = 0
  is estimated by
n
ˆ 
e e
i
i 2
i 1
n
2
e
 i
i 2
4/8/2015
330 lecture 12
14
Durbin-Watson test (3)
n
DW test statistic is
DW 
 (e e )
i 2
i 1
i
n
e
i 2
2
 2(1  ˆ )
2
i
 Value of DW is between 0 and 4
 Values of DW around 2 are consistent with
independence
 Values close to 4 indicate negative serial correlation
 Values close to 0 indicate positive serial correlation
4/8/2015
330 lecture 12
15
Durbin-Watson test (4)
 There exist values dL, dU depending on the
number of variables k in the regression
and the sample size n – see table on next
slide
 Use the value of DW to decide on
independence as follows:
Positive
autocorrelation
0
Negative
autocorrelation
Independence
dL
dU
4-dU
4-dL
4
Inconclusive
4/8/2015
330 lecture 12
16
Durbin-Watson table
4/8/2015
330 lecture 12
17
Example: the advertising
data
 Sales and advertising data
 Data on monthly sales and advertising
spend for 35 months
 Model is Sales ~ spend + prev.spend
(prev.spend = spend in previous
month)
4/8/2015
330 lecture 12
18
Advertising data
> ad.df
spend prev.spend
1
16
15
2
18
16
3
27
18
4
21
27
5
49
21
6
21
49
7
22
21
8
28
22
9
36
28
10
40
36
11
3
40
12
21
3
… 35 lines in all
4/8/2015
sales
20.5
21.0
15.5
15.3
23.5
24.5
21.3
23.5
28.0
24.0
15.5
17.3
330 lecture 12
19
R code for residual vs
previous plot
advertising.lm<-lm(sales~spend + prev.spend, data =
ad.df)
res<-residuals(advertising.lm)
n<-length(res)
plot.res<-res[-1]
prev.res<-res[-n]
plot(prev.res,plot.res, xlab="previous
residual",ylab="residual",main="Residual versus
previous residual \n for the advertising data")
abline(coef(lm(plot.res~prev.res)), col="red",
lwd=2)
4/8/2015
330 lecture 12
20
0
-5
residual
5
Residual versus previous residual
for the advertising data
-5
0
5
previous residual
4/8/2015
330 lecture 12
21
Time series plot,
correlogram – R code
par(mfrow=c(2,1))
plot(res, type="b", xlab="Time
Sequence", ylab = "Residual",
main = "Time series plot of residuals
for the advertising data")
abline(h=0, lty=2, lwd=2,col="blue")
acf(res, main ="Correlogram of
residuals for the advertising data")
4/8/2015
330 lecture 12
22
0
-5
Residual
5
Time series plot of residuals for the advertising data
Increasing
trend?
0
5
10
15
20
25
30
35
Time Sequence
0.6
-0.2 0.2
ACF
1.0
Correlogram of residuals for the advertising data
0
5
10
15
Lag
4/8/2015
330 lecture 12
23
Calculating DW
> rhohat<-cor(plot.res,prev.res)
> rhohat
[1] 0.4450734
> DW<-2*(1-rhohat)
> DW
[1] 1.109853
For n=35 and k=2, dL = 1.34. Since DW =
1.109 < dL = 1.34 , strong evidence of
positive serial correlation
4/8/2015
330 lecture 12
24
Durbin-Watson table
use
(1.28 + 1.39)/2
= 1.34
4/8/2015
330 lecture 12
25
Remedy (1)
 If we detect serial correlation, we need to
fit special time series models to the data.
 For full details see STATS 326/726.
 Assuming that the AR(1) model is ok, we
can use the arima function in R to fit the
regression
4/8/2015
330 lecture 12
26
Fitting a regression with
AR(1) errors
> arima(ad.df$sales,order=c(1,0,0),
xreg=cbind(spend,prev.spend))
Call:
arima(x = ad.df$sales, order = c(1, 0, 0), xreg =
cbind(spend, prev.spend))
Coefficients:
ar1 intercept
0.4966
16.9080
s.e. 0.1580
1.6716
spend
0.1218
0.0308
prev.spend
0.1391
0.0316
sigma^2 estimated as 9.476:
log likelihood = -89.16, aic = 188.32
4/8/2015
330 lecture 12
27
Comparisons
lm
arima
Const (std err)
15.60 (1.34)
16.90 (1.67)
Spend (std err)
0.142 (0.035)
0.128 (0.031)
0.166 (0.036)
0.139 (0.031)
0.442
0.497
3.652
3.078
Prev Spend
(Std err)
1st order
Correlation
Sigma
4/8/2015
330 lecture 12
28
Remedy (2)
 Recall there was a trend in the time series
plot of the residuals, these seem related to
time
 Thus, time is a “lurking variable” , a
variable that should be in the regression
but isn’t
 Try model
Sales ~ spend + prev.spend + time
4/8/2015
330 lecture 12
29
Fitting new model
time=1:35
new.advertising.lm<-lm(sales~spend
+ prev.spend + time, data = ad.df)
res<-residuals(new.advertising.lm)
n<-length(res)
plot.res<-res[-1]
prev.res<-res[-n]
DW = 2*(1-cor(plot.res,prev.res))
4/8/2015
330 lecture 12
30
DW Retest
 DW is now 1.73
 For a model with 3 explanatory variables,
du is about 1.66 (refer to the table), so no
evidence of serial correlation
 Time is a highly significant variable in the
regression
 Problem is fixed!
4/8/2015
330 lecture 12
31
Download