Uploaded by N M

Forecasting principles and practictice part solutions

advertisement
library(tsibble)
library(tsibbledata)
library(feasts)
library(fable)
library(tidyverse)
library(lubridate)
library(magrittr)
library(ggthemes)
theme_set(theme_light())
theme_update(plot.title = element_text(hjust = 0.5))
Question 1 (Exercise 2.10.4)
library(USgas)
ts = tsibble(us_total, index = year, key = state)
ts_new_england = ts %>%
filter(state %in% c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts',
'Connecticut', 'Rhode Island'))
autoplot(ts_new_england, y) +
labs(x = 'Year',
y = 'Demand for natural gas',
title = 'Dynamic of demand for natural gas in New England'
)
Question 2-1 (Exercise 2.10.9)
aus = aus_production %>%
filter(!is.na(Bricks))
aus %>% autoplot(Bricks) + labs(title = 'autoplot')
aus %>% gg_season(Bricks) + labs(title = 'gg_season')
aus %>% gg_subseries(Bricks) + labs(title = 'gg_subseries')
aus %>% ACF(Bricks) %>% autoplot() + labs(title = 'ACF %>% autoplot')
aus %>% gg_lag(Bricks, geom='point') + labs(title = 'gg_lag') +
theme(legend.position = 'top')
There was very strong upward trend up to 1980 year, then it stopped, and we have some
downward trend after 1980.
Also, there is definitely some seasonality — production in the first quarter almost always
lower, than in other quarters. Third quarter usually is the most productive one.
We have strong cyclicity as well, with period around 5 years. It’s interesting, that each cycle
starts with extreme deep falling down, then ~ 5 years of recovery before next falling down.
Question 2-2 (Exercise 5.11.1)
aus %>%
model(SNAIVE(Bricks ~ lag('year'))) %>%
forecast(h=10) %>%
autoplot(aus) +
labs(title = 'Forecast with SNAIVE model')
Question 2-3 (Exercise 5.11.11)
A. STL decomposition
Let’s try different window size for estimating seasonal component and compare their
residuals.
aus %>%
filter(!is.na(Bricks)) %>%
model(STL(Bricks ~ trend() +
season(window = "periodic"),
robust = F)) %>%
gg_tsresiduals() +
labs(title = 'Season window = infinity')
aus %>%
filter(!is.na(Bricks)) %>%
model(STL(Bricks ~ trend() +
season(window = 4),
robust = F)) %>%
gg_tsresiduals() +
labs(title = 'Season window = 4')
aus %>%
filter(!is.na(Bricks)) %>%
model(STL(Bricks ~ trend() +
season(window = 10),
robust = F)) %>%
gg_tsresiduals() +
labs(title = 'Season window = 10')
Decomposition with fixed seasonal component produces less autocorrelated residuals, let’s
choose this one.
dcmp = aus %>%
filter(!is.na(Bricks)) %>%
model(STL(Bricks ~ trend() +
season(window = "periodic"),
robust = F)) %>%
components()
autoplot(dcmp)
B + C. Seasonally adjusted data
dcmp %<>% select(-.model)
dcmp %>%
model(NAIVE(season_adjust)) %>%
forecast(h=10) %>%
autoplot(dcmp) +
labs(title = 'Seasonally adjusted data with naive forecast')
D. decomposition_model()
dcmp_full = aus %>%
model(decomposition_model(
STL(Bricks ~ trend() + season(window = "periodic"), robust = F),
NAIVE(season_adjust)
))
dcmp_full %>%
forecast(h=10) %>%
autoplot(aus)
E. Autocorrelation of residuals
dcmp_full %>%
gg_tsresiduals()
At certain periods residuals looks autocorrelated. But in general — not much.
F. Comparing with robust STL.
dcmp_robust = aus %>%
model(decomposition_model(
STL(Bricks ~ trend() + season(window = "periodic"), robust = T),
NAIVE(season_adjust)
))
dcmp_robust %>%
gg_tsresiduals()
I don’t see much difference here. Probably it’s because we don’t have a lot of outliers in the
data.
G. Compare SNAIVE and decomposition_model() forecasts
train = head(aus, nrow(aus)-8)
test = tail(aus, 8)
fc = train %>%
model(SNaive = SNAIVE(Bricks),
Decomposition = decomposition_model(
STL(Bricks ~ trend() + season(window = "periodic"), robust = F),
NAIVE(season_adjust))) %>%
forecast(h=8)
fc %>%
autoplot(tail(aus, 40), level=NULL)
accuracy(fc, test)
## # A tibble: 2 x
##
.model
##
<chr>
## 1 Decomposition
## 2 SNaive
10
.type
ME RMSE
MAE
MPE MAPE MASE RMSSE
ACF1
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
Test
8.00 18.1 13.8 1.82
3.36
NaN
NaN 0.0957
Test
2.75 20
18.2 0.395 4.52
NaN
NaN -0.0503
Decomposition model looks better. Seasonal recession in winter 2004 was unusually small, it
gave some points to seasonal naive which predicts lower winter recession, but actually most
winters had recession more like in decomposition model forecast. And even with that winter,
RMSE, MAE and MAPE are all better with decomposition model. So I would choose it instead of
seasonal naive model.
Question 3 (Exercise 7.10.1)
A. Relationship of temperature and demand
jan14_vic_elec = vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)
jan14_vic_elec %>%
ggplot() +
aes(Temperature, Demand) +
geom_point(size=2) +
geom_smooth(method='lm') +
labs(title = 'January, 2014',
x = 'Maximum daily temperature')
Demand has positive correlation with temperature, because in hot days people are using airconditioner for cooling, it increases electricity demand.
B. Residuals plot
fit = jan14_vic_elec %>%
model(TSLM(Demand ~ Temperature))
augment(fit) %>%
ggplot(aes(Date)) +
geom_line(aes(y = Demand, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00")) +
labs(title = 'Electricity demand for Victoria, Australia') +
guides(colour = guide_legend(title = "Series"))
augment(fit) %>%
ggplot() +
geom_point(aes(.fitted, .resid)) +
geom_hline(yintercept = 0) +
labs(x = 'Fitted value',
y = 'Residual')
report(fit)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Series: Demand
Model: TSLM
Residuals:
Min
1Q
-49978.2 -10218.9
Median
-121.3
3Q
18533.2
Max
35440.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59083.9
17424.8
3.391 0.00203 **
Temperature
6154.3
601.3 10.235 3.89e-11 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24540 on 29 degrees of freedom
Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
In general, this model isn’t too bad. With R² 78% it catches most demand’s variability. Has
some slightly outliers, days with very low demand regardless temperature 26-27 C. But it
could be rainy days with relatively high temperature maximum which didn’t last long.
C. Forecast
Our model equation is
Demand = 59083.9 + Temperature * 6154.3
With temperature = 15 C we wil predict demand 151398, with temperature 35 C — demand
274484. Second forecast looks fine for me, but I’m not so confident about forecast for 15 C,
because I think, that there is some bottom line when people don’t use air-conditioned
anymore, and maybe even turn on heater, so this linear relationship will not valid under that
line. Since we don’t have such low temperature in January 2014, it’s hard to guess where that
bottom line is.
D. Confidence intervals
One of easy ways to get confidence intervals is to use hilo() function:
fit %>%
forecast(
new_data(jan14_vic_elec, 2) %>%
mutate(Temperature = c(15, 35))
) %>%
hilo(level=c(80,95)) %>%
unpack_hilo(c(`80%`, `95%`)) %>%
select(Temperature, `80%_lower`, `80%_upper`, `95%_lower`, `95%_upper`,
`.mean`) %>%
mutate_all(round)
## # A tsibble: 2 x 7 [1D]
##
Temperature `80%_lower` `80%_upper` `95%_lower` `95%_upper` .mean Date
##
<dbl>
<dbl>
<dbl>
<dbl>
<dbl> <dbl> <date>
## 1
15
117908
184889
100179
202617 151398 2014-0201
## 2
35
242088
306880
224939
324029 274484 2014-0202
E. Using whole dataset
vic_elec_full = vic_elec %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)
vic_elec_full %>%
ggplot() +
aes(Temperature, Demand) +
geom_point(color = 'midnightblue') +
labs(title = 'Daily electricity demand for Victoria, Australia',
x = 'Maximum daily temperature')
As we suggested earlier, at some point all air-conditioners already off, so decreasing in
temperature don’t cause decreasing in energy demand. And when temperature goes even
lower, approximately lower than 20C, people starting to use electricity for heating, so
temperature and demand have negative correlation below 20C.
Question 4 (Exercise 7.10.4)
A. Time plot
library(fpp3)
autoplot(souvenirs) +
labs(title = 'Sales of souvenir shop in Australia')
We have clear yearly seasonality in this dataset — very high peaks before every Christmas,
small peaks during the local surfing festival. Also, there is some trend over time — during
years sells were increased. Peak before Christmas 1994 is ~ 5 times higher, than it was before
Christmas 1988.
B. Logging the data.
Let’s take a look at distribution of monthly sales.
ggplot(souvenirs) +
geom_histogram(aes(Sales), bins=45, fill = 'midnightblue')
As we can see, it has heavily right skewed distribution. Difference between most observations
is relatively very small, some observations to the right are very far away. With such
distribution it’s unlikely will have linear relationship with other variables, unless that variable
will have same shape of distribution. If we take a log of it, it will look like this:
souv = souvenirs %>%
tsibble(index = Month) %>%
mutate(Sales = log(Sales))
ggplot(souv) +
geom_histogram(aes(Sales), bins=45, fill = 'midnightblue')
autoplot(souv) +
labs(y = 'Log of sales')
Such variable will probably have more linear relationship with other variables. Also, trend is
much more linear now, which is important because we wil use linear trend as predictor.
C. Fitting linear model
For adding linear trend over time we can just add numbers from 0 up to 83 (because we have
84 observations), this way trend will be estimated during linear regression and will be chosen
the trend which minimize squared errors. Also, we need to add seasonal dummy variables as
well as “surfing festival”, which will equal 1 every March and 0 at any other month.
In order to avoid perfect multicollinearity among 4 seasons, we will use spring as baseline
season, so coefficients for rest 3 seasons will reflect the difference in compare with spring.
souv$trend = 0:83
souv$fest = 0
souv[month(souv$Month) == 3, 'fest'] = 1
souv$winter = 0
souv$spring = 0
souv$summer = 0
souv$autumn = 0
souv[month(souv$Month) %in% c(12, 1, 2), 'winter'] = 1
souv[month(souv$Month) %in% c(3, 4, 5), 'spring'] = 1
souv[month(souv$Month) %in% c(6, 7, 8), 'summer'] = 1
souv[month(souv$Month) %in% c(9, 10, 11), 'autumn'] = 1
souv_lm = souv %>%
model(TSLM(Sales ~ trend + fest + winter + summer + autumn))
augment(souv_lm) %>%
ggplot(aes(Month)) +
geom_line(aes(y = Sales, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(values = c(Data = "black", Fitted = '#0055CC')) +
labs(title = 'Fitted linear model', y = 'Log of sales') +
guides(colour = guide_legend(title = "Series"))
D. Residuals plots against time and fitted values
gg_tsresiduals(souv_lm) +
labs(title = 'Residuals against time')
augment(souv_lm) %>%
ggplot() +
geom_point(aes(.fitted, .innov), color = 'midnightblue') +
geom_hline(yintercept = 0) +
labs(x = 'Fitted value',
y = 'Residual',
title = 'Residuals against fitted values'
)
Plot of residuals against fitted values doesn’t reveal much of interesting for me, but against
time — definitely does. We can see strong yearly seasonality for some months, not for each
month though. Also if we’ll take a look at ACF plot of residuals, there is very strong
autocorrelation with lag 12, which tells same thing once more time — yearly seasonality of
residuals.
E. Boxplots of the residuals month
augment(souv_lm) %>%
ggplot() +
geom_boxplot(aes(.innov, factor(month.name[month(Month)], levels =
month.name), fill = as.factor(month(Month)))) +
labs(x = 'Residual', y = 'Month',
title = 'Residuals for each month') +
coord_flip() +
theme(legend.position = 'none',
axis.text.x = element_text(angle=-90, hjust = 0))
Here we can see more clearly, which months have bias residuals. Predictions for period from
November up to February are very poor with our current model, especially December.
F. Interpretation of coefficients
report(souv_lm)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Series: Sales
Model: TSLM
Residuals:
Min
1Q
Median
-0.97322 -0.24768 -0.04151
3Q
0.17522
Max
1.41261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.946636
0.160486 49.516 < 2e-16 ***
trend
0.024344
0.002257 10.786 < 2e-16 ***
fest
0.302670
0.231154
1.309 0.19425
winter
0.339953
0.172277
1.973 0.05200 .
summer
0.146501
0.172366
0.850 0.39796
autumn
0.464935
0.172720
2.692 0.00869 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4993 on 78 degrees of freedom
## Multiple R-squared: 0.6247, Adjusted R-squared: 0.6007
## F-statistic: 25.97 on 5 and 78 DF, p-value: 2.4365e-15
Trend: each month baseline of prediction increases by 2.5% (starting from 2826 at first
month)
Surfing festival: increases prediction by 35.3%
Winter: increases prediction by 40.5%
Summer: increases prediction by 15.8%
Autumn: increases prediction by 59.2%
G. Ljung-Box test
Box.test(augment(souv_lm)$.innov, lag = 12, type = "Ljung")
##
## Box-Ljung test
##
## data: augment(souv_lm)$.innov
## X-squared = 81.2, df = 12, p-value = 2.435e-12
According to Ljung-Box test, residuals of our model are definitely autocorrelated (well, we
already hadn’t any doubts about that). It means, that we can make our predictions more
accurate.
H. Forecast
souv_lm %>%
forecast(new_data = souv[1:36, ] %>% mutate(trend = 84:119, Month = Month+84))
%>%
mutate(Sales = exp(Sales)) %>%
autoplot(souv %>% mutate(Sales = exp(Sales)))
I. Suggestions about improvement
We added dummy variable for March and median residual for March ic close to zero. We
should do the same for December, it’s absolutely necessary. After we add it, coefficient for
winter will be significantly reduced and predictions for January/February will become much
more accurate kind of automatically. Still, it’s probably worth to add dummy variable for
January as well, because after people buy a lot in December seems like they don’t but much in
January.
Question 5
Introduction
We will analyze dataset with monthly sales for a souvenir shop on the wharf at a beach resort
town in Queensland, Australia. It was opened in January 1987 and we have data up to
December 1993, so it’s 7 years in total. Over time this shop was expanded and sales did grow.
Sales are different depending on which time of year it is — for example, there is local surfing
festival every March, so people buy many gifts at that month. Also, people buy a lot of gifts
before Christmas.
Plot below shows, how sales of this shop changed every month:
We can see high peak every December. Also, there is upward trend over time.
Predictors
We want to model sales, using linear regression. It means, that we will take few variables and
say, that sales at certain month will be a linear combination of those variables at that month.
We will call those variables “predictors”. We will use 6 predictors:
1. Baseline (Intercept): it will have identical value for each month, which will reflect expected
amount of sales, if all other predictors equal to zero.
2. Trend: Since over time our shop is growing, each month we will add some value to
expected amount of sales, same value every time.
3. Surfing festival: this variable will be equal to 1, if there was surfing festival at this month,
and 0 otherwise.
4. Winter: 1 for winter months and 0 otherwise.
5. Summer: 1 for summer months and 0 otherwise.
6. Autumn: 1 for autumn months and 0 otherwise.
Note, that we will not use variable for spring, because if all 3 seasonal variables (winter,
spring and autumn) are equal to 0, we already know that it’s spring then. Also, it means that
our baseline prediction, when all variables are equal to zero, will be prediction as if it would
be spring, even if it’s January, for example. We could choose any season for baseline, I did
choose spring because later it will have the lowest coefficient among 4 seasons, so
coefficient for other seasons will indicate, how much sales are increased at that season
compared to spring, I think it makes interpretation slightly clearer.
Log-transformation
Since each month we will increase expected value of sales at the same value, we will model
trend as straight line. It would look like this:
As you can guess, such model will not very accurate, because in reality the trend isn’t linear, it
grows slowly at the beginning and then grows faster and faster. Convenient way to fix it is to
model logarithm of sales instead of amount of sales itself. In this case it will look like this:
Of course, it’s not perfect, but it’s definitely much better.
Starting point of this trend line is our first variable, “intercept”. Variable “trend” reflects the
slope of this line. Rest 4 variables (Surfing festival and 3 seasonal variables) will increase
expected log of sales at some fixed amount for each variable. We will estimate this amount for
each variable, using OLS regression, i.e. we will estimate them that way, which will give as
small mean squared error, as possible.
Linear model summary
After we did it, we have the linear model, which can be described with summary table below:
##
##
##
##
##
##
##
##
##
##
##
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.946636
0.160486 49.516 < 2e-16 ***
trend
0.024344
0.002257 10.786 < 2e-16 ***
fest
0.302670
0.231154
1.309 0.19425
winter
0.339953
0.172277
1.973 0.05200 .
summer
0.146501
0.172366
0.850 0.39796
autumn
0.464935
0.172720
2.692 0.00869 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4993 on 78 degrees of freedom
## Multiple R-squared: 0.6247, Adjusted R-squared: 0.6007
## F-statistic: 25.97 on 5 and 78 DF, p-value: 2.4365e-15
Column “estimate” shows, how much that variable increases logarithm of sales. The right
column, called “Pr(>|t|)”, shows probability, that estimate for that variable actually should be
zero and it’s not zero only due random chance.
It’s not very clear, what increasing in logarithm of sales represents. It’s easy to look at them
this way:

Trend: each month expected value increases by 2.5% (starting from 2826 at first
month)

Surfing festival: increases prediction by 35.3%

Winter: increases prediction by 40.5%

Summer: increases prediction by 15.8%

Autumn: increases prediction by 59.2%
In case you wonder, how I calculated these percentage values, I used this formula:
((exp(estimate) - 1) * 100)
Adjusted R² in that summary table shows, how much of total variability of sales our model was
able to catch.
P-value of F-statistic (it’s the last row in summary table) shows the probability, that all our
predictors are useless. This probability basically equal to zero.
On the plot below we can see, how fitted values of our model looks like:
Residuals
Now let’s analyze not what our model catches, but what it didn’t. Group of 3 plot below shows
some information about residuals of our model, i.e how much each fitted value is off from the
real data.
From the first sublot we can see, that each November and, especially, each December real
amount of sales is higher than what our model predicts. For January and February, it’s the
opposite — real data always lower.
From second subplot, the one at the left, we can see, that each residual is very similar to
residual for same month previous year. So, if we would correct prediction based on residual 12
months ago, our predictions would become much better.
The last subplot shows distribution of residuals. In perfect scenario it would look like Gaussian
distribution, but of course it doesn’t.
Also, we can check residuals for each month separately, using plot like this:
Here we can see basically same things, just with different representation, clearer in some
aspects.
Forecast
Now we will forecast sales in next 3 years, 1994—1996 (regardless downsides of our current
model). Painted over areas will represent confidence intervals of our predictions, with 80%
and 95% level of certainty. Here it is:
Here is the same forecast as a table:
Date
1994 Jan
1994 Feb
1994 Mar
1994 Apr
1994 May
1994 Jun
1994 Jul
1994 Aug
1994 Sep
1994 Oct
1994 Nov
1994 Dec
1995 Jan
1995 Feb
1995 Mar
1995 Apr
1995 May
1995 Jun
1995 Jul
1995 Aug
1995 Sep
Prediction
For sales
80%
lower
80%
upper
95%
lower
95%
upper
30685
31441
31037
23496
24075
28561
29265
29986
42246
43287
44353
40107
41095
42108
41567
31468
32244
38251
39193
40159
56579
15746
16125
15443
11946
12233
14631
14983
15343
21641
22162
22694
20442
20931
21432
20524
15873
16253
19441
19905
20380
28756
59797
61304
62375
46214
47381
55753
57161
58605
82467
84549
86685
78691
80686
82732
84186
62384
63968
75261
77173
79134
111323
11061
11324
10672
8351
8549
10268
10512
10761
15188
15548
15917
14308
14645
14990
14126
11050
11309
13587
13906
14232
20097
85126
87298
90258
66113
67803
79441
81473
83558
117505
120511
123595
112428
115320
118288
122317
89619
91930
107689
110467
113318
159288
1995 Oct
1995 Nov
1995 Dec
1996 Jan
1996 Feb
1996 Mar
1996 Apr
1996 May
1996 Jun
1996 Jul
1996 Aug
1996 Sep
1996 Oct
1996 Nov
1996 Dec
57973
59402
53715
55038
56395
55670
42145
43183
51229
52491
53785
75775
77642
79556
71939
29443
30146
27131
27777
28437
27232
21057
21557
25788
26401
27028
38145
39051
39978
35951
114150
117051
106345
109056
111837
113805
84351
86504
101766
104365
107031
150527
154371
158315
143951
20569
21052
18899
19340
19792
18650
14584
14924
17932
18349
18776
26524
27141
27773
24902
163397
167615
152666
156626
160691
166171
121790
124957
146353
150160
154068
216478
222109
227889
207819
Download