Module 2 Review of Regression Analysis Methods

advertisement
Module 2
Review of Regression Analysis Methods
with Applications to Time Series Analysis
Class notes for Statistics 451: Applied Time Series
Iowa State University
Copyright 2015 W. Q. Meeker.
November 11, 2015
17h 20min
2-1
Module 2
Segment 1
Fitting a Simple Linear Regression Line
To a Time Series Data Set
2-2
500
400
300
200
100
Thousands of Passengers
600
International Airline Passengers 1949-1960
1949
1951
1953
1955
1957
1959
1961
Year
2-3
400
300
200
100
Thousands of Passengers
500
International Airline Passengers 1949-1960
on Log Axis
1949
1951
1953
1955
1957
1959
1961
Year
2-4
6.0
5.5
5.0
Log Thousands of Passengers
6.5
Log International Airline Passengers 1949-1960
Model 1 Linear Regression Trend Line
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2-5
300
200
100
Thousands of Passengers
400
500
600
International Airline Passengers 1949-1960
Model 1 Log Linear Regression Trend Line
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2-6
R airline.frame Data Frame
> airline.frame
Passengers Month
Time
1
112
Jan 1949.000
2
118
Feb 1949.083
3
132
Mar 1949.167
4
129
Apr 1949.250
5
121
May 1949.333
.......................
141
508
Sep 1960.667
142
461
Oct 1960.750
143
390
Nov 1960.833
144
432
Dec 1960.917
Note that “coded time” Time runs
from 1949, 1949+1/12, . . . , 1960+11/12
2-7
Airline Model 1 R Output
> airline.fit1 <- lm(log(passengers.ts) ~ Time)
> summary(airline.fit1)
Call: lm(formula = log(passengers.ts) ~ Time)
Residuals:
Min
1Q
Median
3Q
Max
-0.3086 -0.1039 -0.01796 0.09737 0.2954
Coefficients:
Value Std. Error
(Intercept) -230.1879
6.5389
Time
0.1206
0.0033
t value
-35.2029
36.0505
Pr(>|t|)
0.0000
0.0000
Residual standard error: 0.139 on 142 degrees of freedom
Multiple R-Squared: 0.9015
F-statistic: 1300 on 1 and 142 degrees of freedom, the p-value is 0
2-8
Time Series Regression Model and
Ordinary Least Squares (OLS) Estimates
• Model:
y = β0 + β1Time + a
yt = β0 + β1Timet + at
where
◮ at ∼ IID(0, σa2) or at ∼ NID(0, σa2). The NID assumption
is needed for statistical efficiency, normal-theory inference and normal-theory prediction intervals.
◮ Time is numerical time or coded time (e.g., day, week,
or year), and
◮ t = 1, 2, . . . , n indexes the equally spaced observations.
• Model parameter estimates:
ybt = βb0 + βb1Timet,
ba
S=σ
2-9
Airline Model 1 and OLS Estimates
• Model:
log(Passengers) = β0 + β1Time + at
Time is coded time
(in this case Time = 1949, 1949+1/12, ..., 1960+11/12)
• Model parameter estimates:
\
log(Passengers)
= βb0 + βb1Time
= −230. + 0.12Time
S = 0.139
R2 = .9015
2 - 10
Module 2
Segment 2
Time Series Residuals,
Residual Statistics, and ANOVA
2 - 11
300
200
100
Thousands of Passengers
400
500
600
International Airline Passengers 1949-1960
Model 1 Log Linear Regression Trend Line
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2 - 12
Time Series Residuals
• Residuals are deviations between the observed response (data)
and fitted value of the response (model).
Unobserved: at = yt − E(Yt) = yt − (β0 + β1Time)
b t = yt − ybt = yt − (βb0 + βb1Time)
Observed: a
2 - 13
Residual Standard Deviation
• The residual standard deviation (estimate of σa) is
S=
s
Unexplained Sum of Squares
n−p
v
u Pn
2
u
b
a
= t t=1 t
n−p
v
u Pn
u
(yt − ybt)2
t
t=1
=
n−p
where p is the number of model parameters.
2 - 14
Coefficient of Determination
• The coefficient of determination (proportion of variability
explained by the model) is
Explained Sum of Squares
R =
Total Sum of Squares
2
Pn
2
b
(y
−
y
)
t
t
= 1 − Pt=1
n (y − ȳ)2
t=1 t
2 - 15
Module 2
Segment 3
Dummy Variable Model
for Time Series Seasonality
2 - 16
Airline Model 2 R Output
> airline.fit2 <- lm(log(passengers.ts) ~ Time + Month)
> summary(airline.fit2)
Coefficients:
Value Std. Error
t value Pr(>|t|)
(Intercept) -230.6755
2.7985
-82.4290
0.0000
Time
0.1208
0.0014
84.3990
0.0000
MonthAug
0.2144
0.0242
8.8549
0.0000
MonthDec
-0.0982
0.0242
-4.0541
0.0001
MonthFeb
-0.0990
0.0242
-4.0870
0.0001
.........................
MonthJun
0.1198
0.0242
4.9468
0.0000
MonthMar
0.0313
0.0242
1.2914
0.1989
MonthMay
-0.0024
0.0242
-0.0978
0.9222
MonthNov
-0.2121
0.0242
-8.7548
0.0000
MonthOct
-0.0684
0.0242
-2.8228
0.0055
MonthSep
0.0698
0.0242
2.8814
0.0046
Residual standard error: 0.0593 on 131 degrees of freedom
Multiple R-Squared: 0.9835
F-statistic: 649.4 on 12 and 131 degrees of freedom, the p-value is 0
2 - 17
Dummy (or Indicator) Variables
Ordered months:
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
Letting April be the baseline month, we define
Aug = MonthAug =
Dec = MonthDec =
..
Sep = MonthSep =
(
(
(
1 in August
0 otherwise
1 in December
0 otherwise
1 in September
0 otherwise
Because April is the baseline month, it needs no dummy variable.
2 - 18
Airline Model 2
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
• Model:
log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . .
+β10Nov + β11Oct + β12Sep + at
2 - 19
Airline Model 2
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
• Model:
log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . .
+β10Nov + β11Oct + β12Sep + at
• Apr: log(Passengers) = β0 + β1Time
2 - 20
Airline Model 2
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
• Model:
log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . .
+β10Nov + β11Oct + β12Sep + at
• Apr: log(Passengers) = β0 + β1Time
• Aug:
log(Passengers) = β0 + β1Time + β2
= (β0 + β2) + β1Time
2 - 21
Airline Model 2
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
• Model:
log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . .
+β10Nov + β11Oct + β12Sep + at
• Apr: log(Passengers) = β0 + β1Time
• Sep:
log(Passengers) = β0 + β1Time + β12
= (β0 + β12) + β1Time
2 - 22
Airline Model 2 OLS Estimates
• Model parameter estimates:
\
log(Passengers)
= βb0 + βb1Time + βb2Aug + βb3Dec + . . .
+βb10Nov + βb11Oct + βb12Sep
= −231 + 0.12Time + 0.21Aug − 0.10Dec + . . .
−0.21Nov − 0.07Oct + 0.07Sep
S = 0.0593
R2 = 0.9835
2 - 23
Airline Model 1 and OLS Estimates
• Model:
log(Passengers) = β0 + β1Time + at
Time is coded time
(in this case Time = 1949, 1949+1/12, ..., 1960+11/12)
• Model parameter estimates:
\
log(Passengers)
= βb0 + βb1Time
= −230. + 0.12Time
S = 0.139
R2 = .9015
2 - 24
International Airline Passengers 1949-1960
Model 2 Fitted Values
600
Jun
500
Jul
Aug
Mar
300
200
100
Thousands of Passengers
400
Nov
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2 - 25
-0.15
-0.10
-0.05
0.0
0.05
0.10
International Airline Passengers 1949-1960
Model 2 Residual Time Series Plot
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
2 - 26
International Airline Passengers 1949-1960
Model 2 Residuals versus Fitted Values
•
0.10
•
•
•
••
0.0
•
-0.10
-0.05
•
•
•
• •
••
-0.15
residuals(airline.fit2)
0.05
•
•
••
• •
•
•
•
••
••
•• •
•
•
•
••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•••
• ••
• •
••
••
••
•
• ••••
••
•
•
•
•
•
•• •
•
•
••
••
•
••
•
•• •
•
• •
•
•
•
•
•
•
••
•
••
• •
•
•
•
•
•
•
•
•
•
•
•• •
•
•
•
•
•
•
•
5.0
5.5
6.0
fitted.values(airline.fit2)
2 - 27
International Airline Passengers 1949-1960
Model 2 Normal Q-Q Plot of the Residuals
•
0.10
•
0.0
-0.05
-0.10
-0.15
residuals(airline.fit2)
0.05
•• •
• • ••
••
••• • •
•
•
•••
•••
•
•
•
•
•
•
•
•
•
•
••••••
••••••••
•
•
•••••••
•
•
•
••••
•••••••
•
•
•
•
•••
•••••••
•••••
•
•••••••••
•
•
•
•
••••
••••
•
•
••
•••
••••••
•
•
•••
•
•
• • •••
•
•• •
•
•
•
-2
-1
0
1
2
Quantiles of Standard Normal
2 - 28
International Airline Passengers 1949-1960
Model 2 ACF of the Residuals
-0.2
0.0
0.2
ACF
0.4
0.6
0.8
1.0
Series : residuals(airline.fit2)
0.0
0.5
1.0
1.5
Lag
2 - 29
Module 2
Segment 4
Dummy Variable plus Interaction Model
for Time Series Seasonality
2 - 30
Airline Model 3 R Output
> airline.fit3 <- lm(log(passengers.ts) ~ Time + Month + Time:Month)
> summary(airline.fit3)
Value Std. Error
t value Pr(>|t|)
(Intercept) -222.1037
9.1389
-24.3030
0.0000
Time
0.1164
0.0047
24.9058
0.0000
MonthAug -27.7202
12.9255
-2.1446
0.0340
MonthDec
2.0172
12.9266
0.1561
0.8763
MonthFeb
13.9378
12.9238
1.0785
0.2830
...........................................
MonthOct -10.3970
12.9260
-0.8043
0.4228
MonthSep
-7.2389
12.9258
-0.5600
0.5765
TimeMonthAug
0.0143
0.0066
2.1611
0.0327
TimeMonthDec
-0.0011
0.0066
-0.1634
0.8705
...........................................
TimeMonthOct
0.0053
0.0066
0.7991
0.4258
TimeMonthSep
0.0037
0.0066
0.5655
0.5728
Residual standard error: 0.05591 on 120 degrees of freedom
Multiple R-Squared: 0.9865
F-statistic: 382.4 on 23 and 120 degrees of freedom, the p-value is 0
2 - 31
Airline Model 3 and OLS Estimates
> sort(unique(Month))
[1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
• Model:
log(Passengers) = β0 + β1Time + β2Aug + β3Dec + . . .
+β11Oct + β12Sep + β13TimeMonthAug . . .
+β22TimeMonthOct + β23TimeMonthSep + at
• Where
TimeMonthAug = Time × MonthAug
..
TimeMonthOct = Time × MonthOct
2 - 32
International Airline Passengers 1949-1960
Model 3 Fitted Values
600
July
Aug
Jun
500
Mar
300
200
100
Thousands of Passengers
400
Nov
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2 - 33
ANOVA Tables for Model 2 and Model 3
> anova(airline.fit2)
Analysis of Variance Table
Response: log(passengers.ts)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value Pr(F)
Time
1 25.12337 25.12337 7143.588
0
Month
11
2.28429 0.20766
59.047
0
Residuals 131
0.46072 0.00352
> anova(airline.fit3)
Analysis of Variance Table
Response: log(passengers.ts)
Terms added sequentially (first to last)
Df Sum of Sq Mean Sq F Value
Pr(F)
Time
1 25.12337 25.12337 8037.763 0.000000000
Month
11
2.28429 0.20766
66.438 0.000000000
Time:Month 11
0.08564 0.00779
2.491 0.007494726
Residuals 120
0.37508 0.00313
2 - 34
Comparing Model 2 and Model 3
To test statistical the significance of the Time:Month terms
F =
(SSEReduced − SSEFull)/#parameters tested
SSEFull/doffull
(0.46072 − 0.37508)/11
=
= 2.49
0.37508/120
> qf(.95,11,120)
[1] 1.86929
Thus because F = 2.49 > F(.95;11,120) = 1.87 we conclude
that there is evidence of differing slopes at the 5% level of
significance. Alternatively, the p-value is Pr(F > 2.49) =
0.0075.
> 1-pf(2.49,11,120)
[1] 0.007510861
2 - 35
Module 2
Segment 5
Dummy Variable plus Interaction Model
without Transformation
2 - 36
Airline Model 4 R Output
> airline.fit4 <- lm(passengers.ts ~ Time + Month + Time:Month)
> summary(airline.fit4)
Value Std. Error
t value
Pr(>|t|)
(Intercept) -57192.9980
2618.7677
-21.8397
0.0000
Time
29.3951
1.3397
21.9417
0.0000
MonthAug -26257.8983
3703.8125
-7.0894
0.0000
MonthDec
2005.7773
3704.1284
0.5415
0.5892
MonthFeb
9479.1992
3703.3389
2.5596
0.0117
...........................................
MonthOct -2134.5245
3703.9704
-0.5763
0.5655
MonthSep -9219.5218
3703.8915
-2.4891
0.0142
TimeMonthAug
13.4685
1.8946
7.1089
0.0000
TimeMonthDec
-1.0385
1.8946
-0.5481
0.5846
...........................................
TimeMonthOct
1.0839
1.8946
0.5721
0.5683
TimeMonthSep
4.7273
1.8946
2.4951
0.0140
Residual standard error: 16.02 on 120 degrees of freedom
Multiple R-Squared: 0.985
F-statistic: 343.4 on 23 and 120 degrees of freedom, the p-value is 0
2 - 37
400
300
200
100
Thousands of Passengers
500
600
International Airline Passengers 1949-1960
Model 4 Fitted Values
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
Year
2 - 38
-20
0
20
International Airline Passengers 1949-1960
Model 4 Residual Time Series Plot
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
2 - 39
International Airline Passengers 1949-1960
Model 4 Residuals versus Fitted Values
• •
•
•
• •
• •
•
•
•
•
•
••
••
•
•
•
•
•
•
•
••••
•
•
•
•
• ••
•
•
•
•
• •
• ••
•
•
•
•
•
••
•
•
•
•
•
••
•
••
•
• ••
•
•
•
•
•
•
200
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
••
• • •
•
• • •• •
••
••
••
•••
• •
• •
•
•
••
0
•
•
•
100
•
•
-20
residuals(airline.fit4)
20
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
300
400
500
600
fitted.values(airline.fit4)
2 - 40
International Airline Passengers 1949-1960
Model 4 Normal Q-Q Plot of the Residuals
• •
•
•
0
-20
residuals(airline.fit4)
20
•• •
•
•
•
• ••
••
••••
•
•••
•
••••
•
•
••••
••••
•
•
•••••••
•
•
••
•••••••••
•••••••
•
•
•
•
•
•
•••••••
•••••••
•
•
•
•
•••••
•••••
•
•
•
•••••
••••
•
•
•••
•
•
•
•
•
•
•••
••
•
•
•
••
•
•
•
•
•
•• • •
•
-2
-1
0
1
2
Quantiles of Standard Normal
2 - 41
International Airline Passengers 1949-1960
Model 4 Residual ACF
-0.2
0.0
0.2
ACF
0.4
0.6
0.8
1.0
Series : residuals(airline.fit4)
0.0
0.5
1.0
1.5
Lag
2 - 42
Variability Often Depends on Level
• Possible model: Stock Price varies ±10% of E(Price).
• Example:
Stock #1 E(Price)=10
Price = 10 ±10%
Price = 10 ±1
Stock #2 E(Price)=100
Price = 100 ±10%
Price = 100 ±10
This implies nonconstant variance!
• Percent change prediction:
\
\
Price = [Price/1.10,
1.10 × Price]
\ ± log(1.10)
log(Price) = log(Price)
\ ± .0953
log(Price) = log(Price)
2 - 43
Module 2
Segment 6
Interpretation of Log-Linear Model Parameters and
Introduction to Prediction Intervals
2 - 44
Interpretation of Regression Coefficients
in a Log-Linear Model
(fit linear model after taking the logarithm of the
response)
Example:
Yt = β0 + β1t + at
E(Yt) = β0 + β1t
= E(Yt−1) + β1
exp[E(Yt )] = exp[E(Yt−1)] × exp[β1]
For the airline data exp[βb1] = exp[.1206] = 1.128 or about
13% growth/year
In general:
100(exp(slope) − 1) = percent growth per unit time
and 100(slope) ≈ percent growth per unit time if the slope
is small (e.g., a few percent)
2 - 45
Prediction of the Response in a Regression Model
• A 100(1 − α)% prediction interval for a future value of Yt is
ybt ± t(1−α/2,n−p)S(Y −Yb )
where the prediction standard error S(Y −Yb ), is an estimate
q
of Var(Y − Yb )
• Familiar formulas for the prediction variance for a regression
response at x0:
#
2
1
(x0 − x̄)
2
b
+1
Simple: Var(Y − Y ) = σ
+ Pn
2
n
i=1(xi − x̄)
i
h
′
−1
2
′
Multiple: Var(Y − Yb ) = σ x0(X X) x0 + 1
"
2 - 46
Prediction Standard Error
for an Independent Future Observation
• There are two sources of variability in a statistical prediction.
Var(Y − Yb ) = Var(Y ) + Var(Yb )
An estimate of this variance is
S2
= S 2 + S 2b
b
Y
(Y −Y )
Pn
2
where S = MSE = t=1(Yt − Ybt)2/(n − p),
• As n → ∞
S 2b → 0
Y
2
S
→ σ2
Thus for large n
S2
≈ S2
b
(Y −Y )
2 - 47
Download