Forecasting

advertisement
Forecasting
Handout for G12, Michaelmas term 2001
Lecturer: Stefan Scholtes
The aim of forecasting is to predict a future value of a variable such as temperature,
amount of rainfall, demand for a product, interest rate, air quality index, etc. A typical
forecasting process proceeds as follows:
i)
Construct a parametric model of the data generating process. If possible,
this model should incorporate as much information as possible about the
nature of the data generating process (e.g. diffusion models for weather
forecasts).
Use historic data to estimate the unknown parameters in the model (model
fit).
Compute the forecasted values.
Subject the forecast to expert judgement, intuition and common sense and
adjust if necessary.
ii)
iii)
iv)
Time series. To get an idea about the variable that one wishes to forecast, it is most
sensible to begin with a graphical plot of historic data, ordered by time points. Such
data is called a time series. A typical time series plot looks like this:
Time Series
200
180
160
Demand
140
120
100
80
60
40
20
0
0
10
20
30
Tim e
40
50
The above time series seems to show an upwards trend. There is also apparently some
cyclical buying behaviour as well as two seasons, a downwards season in the middle
of the year and an upward season towards the end of the year. Finally, there is
obviously quite a bit of erratic buying behaviour present. It helps to analyse these
structural components of a time series separately. To do this, we start from a model
Yt   t  ct  st   t
where  t , ct , st are the deterministic trend, cyclic and seasonal components of the
series, while  t is a random variable which models the erratic behaviour.
Trend. A trend is a monotonic (increasing or decreasing) function of time. The
simplest such function is a linear trend  t  a  bt . Other popular trend models are
quadratic a  bt  ct 2 , exponential a  be ct , or logarithmic a  b ln( ct ) . If  t  a ,
independently of time t then we say that there is no trend in the time series.
Cycles. Cycles are typically modelled using sine and cosine functions
2t
2t
ct  c sin(
)  d cos(
),
p
p
where p is the cycle length, i.e. ct  ct  p ,
c 2  d 2 is the maximal value of ct
c
during a cycle, and t  s arctan( ) is the times at which this maximal value is
d
achieved. Whilst c, d can be estimated on the basis of historic data, using regression,
the analyst has to provide the cycle length p . This is often done on the basis of
knowledge of the underlying process. One may for instance know that ones customers
tend to buy either at the end or beginning of a month and that demand in the middle of
the months is rather low.
Seasons. Seasons are periods where the behaviour of the time series is notably
different from the behaviour outside of the season. The difference between a season
and a cycle is that in a cycle the change is continuous over time and, at least in our
modelling of cycles, a regular up and down, whilst in a seasonal pattern, the change in
the time series can be discontinuous as time moves into a season. A popular way of
modelling seasons is through so-called dummy variables (sometimes also called
indicator variables) which take on values 0 or 1. Their value is 1 at times t during the
season and 0 at times t outside the season. The model for this is
st   ei X it ,
where e i is a parameter and X it is a variable under your control with
1 if t is in the season
X it  
0 if t is outside the season
As in the case of the cycle length, the analyst has to specify the seasons and thereby
give X it their respective values for the historic time series at hand. Regression can
then be used to estimate the parameters e i which amount to a constant increase or
decrease of the time series variable (e.g. demand) during the season. Seasonal patterns
are more general than cycles and a seasonal model with dummy variables can be fit to
pure cyclic behaviour if sufficiently many dummy variables are used.
The decomposition of a time series into trend, cycles, seasons and erratic behaviour
(  t  yt   t  ct  st ) is illustrated in the following graph:
Decomposition of Time Series
200
150
Trend
100
Cycles
Seasons
Random
50
Demand
0
-50
0
10
20
30
40
50
Tim e
The random component. Our ability to forecast values of the time series variable
depends obviously on our knowledge of the distribution of the erratic behaviour.
The simplest assumption would be to assume that the random components  t are
independent and identically distributed (i.i.d) for all t and have zero means. This is
the main assumption for the statistical analysis of regression models. Because of the
independence assumption, earlier observations of the time series do not provide
information about subsequent observations. The best forecast for the pure random
component in the i.i.d. case is its expectation zero. The last graph has been generated
with an i.i.d random component with normal distribution and zero mean.
Unfortunately, many time series show correlations between the random components
e t . This has far reaching consequences since it means that the statistical analysis of
time series cannot rely on the fairly simple analysis of regression models, which
assumed independent random errors. A popular model for the dependent random
components in time series is a so-called autoregressive process:
 t  a  b1 t 1  ...  bk  t k  ut ,
where the u t ’s are i.i.d random variables. The difference to the pure i.i.d model is that
the expected value of  t depends on the observed random effects at earlier times
t  1,..., t  k . Such a process is model is called AR(k), where k is the number of
earlier time periods that are taken into consideration. Such autoregressive models are
the basis for a sophisticated forecasting technique called the Box-Jenkins method. The
presentation of this approach goes beyond the scope of this introduction and can be
found in most books on time-series analysis.
Digestion. We have seen general models for trends, cycles, seasons and random
components of a time series. The models can be additively combined to a parametric
model of the time series. The unknown parameters of this model are typically
estimated by fitting the model to historic data, e.g., by using a least squares approach.
The statistical analysis of such an approach is more complicated than that of a
standard regression model since random errors in time series are typically correlated
and also since forecasting, by its very nature, aims to predict the value outside of the
range of the independent variable (time) of the historic data. We have seen that
regression estimators tend to become very volatile outside of the range of historic x values (here t -values). Because of the difficulty of a sound statistical analysis for
forecasting procedure, forecasting is much more an art than standard regression
modelling. Whenever a forecast is being made, it should be subjected to expert
judgement and common sense and adjusted if necessary.
Fitting a forecast model. As with regression models, forecasting models involve
unknown parameters p that are usually determined so that the model fits the data
reasonably well.
i)
Root mean square error (RMSE): Recall that the standard measure of fit
for regression models was the sum of squared deviations of model
prediction from observation. We called this measure the sum of squared
errors, SSE ( p) and wanted it to be as small as possible. The same set of
“optimal” parameters would be obtained if we minimised the so-called
mean square error SSE ( p) / n since the number of data points n is fixed
and the minimizer of a function f does not change if the function is
multiplied by a positive constant. More generally, g ( f ( x)) has the same
minimizers as f (x) if g (.) is strictly increasing on the range of values of
f . Therefore, the least squares parameters minimise not only the mean
square error but also its root SSE( p) / n , which is called the root mean
square error. This measure is often used to compare fits for forecasting
models. Bear in mind that it is just a monotonic transformation of the sum
of squared errors.
ii)
Forecast bias: Forecasting is often done by heuristic procedure which may
well be biased. The forecast bias is defined by
FB 
 ( f (t , p)  y )
t
# of observatio ns in the times series
.
If FB is significantly positive then the forecast tends to be too high, while
a negative value of FB indicates that it is too low. It may sometimes be
sensible to correct for bias by subtracting FB from the forecast. However,
it may be more sensible to adjust the parameters of the forecasting model
to reduce the bias.
iii)
It is very important to graphically check the forecasting model against the
time series before it is used. You should in particular check the ability of
the model to predict turning points in the series. The prediction of turning
points is vitally important for companies since it allows them to adjust
their production and inventory policies to future changes in the market.
Forecasters are often willing to sacrifice small mean square errors and/or
small bias for accurate predictions of turning points.
The fitting of a forecasting model to time series data can be conveniently done with
an Excel add-in called Solver. It allows you to solve (reasonably small) optimisation
problems. To invoke it, choose the solver command from the tools menu1 and you
will see the solver command dialog box. In the case of curve fitting, input as target
cell the cell that computes the fitting measure, e.g. the sum of squared errors. Click on
Min for minimisation, and input under changing cells the array that contains the
parameters of your model. Now click on solve and the solver will return the set of
parameters that minimises your fitting measure.2 Try this out with the spreadsheet
Time Series Fit.xls, workbook “Fitting a nonlinear model”. Here the fitting measure is
in cell K7 and the parameters are in cells B7:B13. Notice that the parameter p is not
variable; it has been fixed to the true value 10. Play with this parameter and resolve to
see how sensitive the result depends on p. In fact, you may want to include it as an
optimisation variable, i.e. re-run the optimisation with changing cells B7:B14. By
doing this you have reduced your fitting measure, i.e. the new model fits the data
better. But was that a good idea if you compare the results with the true parameters?
De-seasoning time series. We have seen above that seasonality can be modelled by
using dummy variables, one for each season. The parameters in such a model can then
be estimated using, e.g., a least squares criterion.
There is a simple alternative way of dealing with seasonality: We can “de-season” the
data first, then forecast values for the de-seasoned series and “re-season” again.
There are various ways of “de-seasoning” a time series. They basically all assume that
the analyst has identified the number and timing of the seasons, either from the
underlying data generating process or from inspection of the time series. Suppose we
are looking at seasons over a year and have several years of data. Then for each year
you compare the annual period average in year i
1
If there is no solver command under the tools menu, click Add-ins and activate solver add-in. If there
is no solver add-in in the list of add-ins, you will have to load it from the Excel CD (your computer
officer can probably help you with this).
2
Generally speaking, the solver will only return a local optimum which depends on the starting values
for your parameter set. In the case of fitting a “base”-function type model
f ( x, p)   pi fi ( x) the
solution will be the globally optimal solution, independently of the initial values of the parameters. You
will learn more about the solver in the second part of G12.
y
Ai 
t
t in year i
# measuremen ts in year i
with the period average in season j of year i :
y
S ij 
t
t in season j of year i
# measuremen ts in season j of year i
.
The seasonal factor for season j of year i is then S ij / Ai and the overall seasonal
factor for season j is the average seasonal factor over all years in the time series
sj 
S
ij
/ Ai
all years i
# years in the time series
.
A season that tends to overshoot the annual average will have a seasonal factor larger
than 1. To de-season the time series, you divide the time series values during the
season by the corresponding seasonal factor. Then you perform a forecast for the deseasoned time series, e.g. by exponential smoothing. Afterwards this, you re-season
again by multiplying the respective seasonal factor to the forecast.3
Smoothing procedures. Since a statistical analysis of the “best fit” forecasting
models is difficult, heuristic procedures, based on the idea of “smoothing” the time
series, are often used for forecasting. The simplest procedure is a moving average:
Given the time series up to time t  1, forecast for time t is
Ft 
yt 1  ...  yt m
.
m
Finding the right value of m is again not easy. The more volatile the time series is,
the larger m should be to allow for the cancellation of positive and negative
fluctuations. Considerable experience is required to obtain useful forecasts by a
moving average approach.
A very popular alternative to moving averages is the so-called exponential smoothing
procedure. Given a time series y1 ,..., yT , this procedure generates an associated
“smoothed” time-series of forecasts F1 ,..., FT 1 starting with F1  y1 according to the
formula
3
Our computation of seasonal factors here was based on averaging the annual seasonal factors over all
years in the time series. If there is a believe that the impact of seasons change over time, then there are
alternative ways of computing the seasonal factors to be used in the forecast. One such way is to look
at the time series of seasonal factors s j and use variants of the exponential smoothing techniques
explained in the sequel.
Ft 1  Ft  Et ,
where
Et  yt  Ft
is the forecast error of the previous period and  is a constant between 0 and 1.
Notice that for   0 we make never update our forecast, while for   1 our forecast
is the “last-value” forecast. The smaller the parameter  the “smoother” the series of
predictions Ft .
In exponential smoothing, the time series is used to “train” the forecast. The forecast
value Ft 1 depends only on the value yt and the previously generated forecast Ft and
therefore the “smoothed” series FT 1 . The value FT 1 is the forecast for the next
period in the future.
Exponential smoothing does not perform particularly well in the presence of a trend.
One way of addressing this problem is to first estimate by fitting a suitable regression
model to the time series data and then subtracting the estimated trend from the data.
This new “de-trended” time series can then be forecast using e.g. exponential
smoothing. After this, the trend is added again to the data.
An extension of exponential smoothing is the so-called double exponential smoothing
which explicitly incorporates a trend. This procedure is based in a linear model
E ( yt )  a  bt .
Notice that E ( yt 1 )  E ( yt )  b . This identity is the basis for the double exponential
smoothing procedure, which forecasts E ( y t 1 ) as a sum of two components
Ft 1 
A  Bt ,
t

Base level
Trend
where At is a proxy for E( y t ) and Bt is a proxy for b  E ( yt )  E ( yt 1 ) . If we
replace the expectation by the observation, i.e. set At  y t and Bt  At  At 1 , then we
track the time series data, lagging one time step behind. To incorporate a smoothing
component, we update At according to the exponential smoothing formula
At  At 1   ( yt  ( At 1  Bt 1 )) .
Notice that Bt 1 occurs here since our prediction of yt is At 1  Bt 1 . Having this
update of At , we could use the formula Bt  At  At 1 . However, it turns out that this
incorporates substantial fluctuations into our prediction of the trend slope b . It is
better to smooth these fluctuations out by smoothing the series of values At  At 1 .
This is done in the same way as the smoothing of the original time series yt , i.e.
rather than taking the last observation At  At 1 as a prediction of b , we use the last
observation corrected by a multiple of the “prediction error”:
Bt  Bt 1   ( At  At 1  Bt 1 )
where  is a suitable constant between 0 and 1, typically an order of magnitude
smaller than  . The method is fairly sensitive to the initial value of the trend at time
t=0. A sensible starting value is the slope of a linear regression line through the time
series. The following plot illustrates the advantage of two-parameter exponential
smoothing in the presence of trend.
Time Series
100
90
80
70
Time Series
60
Single exponential smoothing
50
Double exponential smoothing
40
Linear (Time Series)
30
20
10
0
1
11
21
31
41
51
Tim e
Both, single and double exponential smoothing forecasts capture the cyclic pattern of
the time series reasonably well, but the single forecast is biased towards lower values,
as can be seen by comparing it with the regression line.
The double exponential forecast allows us not only to compute a forecast FT 1 for one
period ahead of the time series but also to forecast several periods ahead by
FT  m  FT 1  (m  1) BT .
Not only trends are badly caught by one-parameter exponential smoothing; it does
also not cope well with seasonal patterns. In practice, seasonality is often dealt with
by de-seasoning the data, e.g., as explained above. An alternative is the so-called
triple exponential smoothing method, which depends on three parameters and
incorporates trend as well as seasonality explicitly. Details can be found in most
books on forecasting.
Error estimates. One way of getting an idea of the potential error level involved in a
forecast is to do a theoretical analysis of the distribution of the forecast error. This can
be done under simplifying assumptions, e.g., along the lines of our probabilistic
analysis of predictions of simple regression models.
A simpler alternative to a theoretical analysis of the forecast error is to check how
well the exponential smoothing procedure predicted values in the past and project that
into the future. To do this, de-season (and in the case of single exponential smoothing
de-trends) the time series. Give the smoothing procedure a “run-in” time of several
periods (the number depends on the size of your time series). After the run-in time
record the errors made by the exponential smoothing forecast. Take the empirical
distributions of these errors as a proxy of the distribution of the errors for your
forecast. A typical empirical cumulative distribution function looks like this (see
Champagne Sales.xls):
CDF of Forecast Errors
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
-1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
0.2 0.4 0.6 0.8
1
1.2 1.4 1.6 1.8
2
2.2
Error
An empirical 90% confidence interval can now be obtained, by cutting off the 5%
lowest and 5% highest errors, i.e. it is of the from
[Forecast - u, Forecast + v],
where u is such that F (u )  5% and v is such that F (v)  95% . If you wish to
estimate the distribution of forecasting errors several periods ahead, you must
compute the historical forecasting errors several periods ahead. The further ahead you
forecast, the larger the confidence intervals become.
Judgemental forecasting. We have mentioned already that forecasting is more of an
art than a science. Analytical forecasting techniques aim to recover the generating
mechanism for a time series from historic data. The two techniques that we have
considered, regression and exponential smoothing, suffer from different flaws.
Regression assumes that you start with the “right” model to begin with and only
adjust the parameters. Exponential smoothing only picks up information that is
“present” in the time series. Neither procedure includes potentially available “outside”
information such as, e.g., the knowledge that a new competitor is about to enter the
market, that a new technology is emerging, or that demand patterns are changing. To
account for this flaw, it is very important to subject the technical forecasting analysis
to expert knowledge and common sense, making clear all assumptions used in the
forecast. Expert opinion can be obtained by talking to individual managers or to a jury
of executives. An interesting approach is the so-called Delphi method: Ask the
members of a jury independently about their forecast4. Then reveal all these forecasts
(or some summary statistics) to all members of the jury and ask them to adjust their
forecast in the light of the forecast of the other jury members. Continue until no one
adjusts their forecast anymore. If you are lucky, this process converges somehow to a
consensus or close to a consensus. If not, then there is a substantial divergence of
opinion – which is important to bear in mind in decision making. To form an own
opinion, the decision maker can invite the jury to a meeting for a discussion of their
diverging opinions.
An often applied judgemental forecasting technique for demand estimates is the salesforce composite. Estimates are created by the sales force and sent up through the
hierarchy, possibly adjusted by the line managers. Obviously, consumer surveys or
test marketing can be invaluable complements to technical demand forecasts, in
particular if there is substantial uncertainty, as in the case of a substantially altered
product or price.
References. Please read Chapter 18 of Hillier and Lieberman, Introduction to
Operations Research. More in-depth information on forecasting and its applications in
a business context can be found in specialized texts such as J.E. Hanke, D.W.
Wichern, A.G. Reitsch, Business Forecasting, Prentice Hall 2001.
4
It is better to ask for an interval than for a point forecast. Also, one of the jury members may well be a
“technical” forecaster, who uses analytic techniques like the ones explained earlier.
Download