Taming Uncertainty: All You Need to Know About Measuring

advertisement
Taming Uncertainty: All You Need to Know About
Measuring Forecast Accuracy, But Are Afraid to
Ask
“Television won’t be able to hold on to any market it captures after the first six months. People will soon
get tired of staring at a plywood box every night.”
A WELL-KNOWN MOVIE MOGUL IN THE EARLY DAYS OF TV.
This White Paper 08 describes
• Why it is a necessary to define first what a forecast error is
• What bias and precision means for accuracy measurement
• How, when and why to make accuracy measurements
• The systematic steps in a forecast evaluation process.
After reading this paper, you should be able to
• Understand the difference between fitting errors and forecast errors
• Recognize that there is no one best measure of accuracy for products, customers and hierarchies,
industry forecasts and time horizons
• Realize that simple averaging is not a best practice for summarizing accuracy measurements
• Engage with potential users of demand forecasts to clearly define their forecast accuracy
requirements
The Need for Measuring Forecast Accuracy
The Institutional Broker's Estimate System (I/B/E/S), a service that tracks financial analysts’ estimates,
reported that forecasts of corporate yearly earnings that are made early in the year are persistently
optimistic, that is to say, upwardly biased. In all but 2 of the 12 years studied (1979 - 1991), analysts
revised their earnings forecasts downward by the end of the year. In a New York Times article (January 10,
1992), Jonathan Fuerbringer writes that this pattern of revisions was so clear that the I/B/E/S urged stock
market investors to take early-in-the year earnings forecasts with a grain of salt.
It is generally recognized in most organizations that accurate forecasts are essential in achieving
measurable improvements in their operations. In demand forecasting and planning organizations, analysts
use accuracy measures for evaluating their forecast models on an ongoing basis, and hence accuracy is
something that needs to be measured at all levels of a product and customer forecasting database. Some of
the questions you will need to ask include:
•
•
•
Will the model prove reliable for forecasting units or revenues over a planned forecast horizon,
such as item-level product demand for the next 12 weeks or aggregate revenue demand for the next
4 quarters?
Will a forecast have a significant impact on marketing, sales, budgeting, logistics and production
activities?
Will a forecast have an effect on inventory investment or customer service?
In a nutshell, inaccurate forecasts can have a direct effect on setting inadequate safety stocks, ongoing
capacity problems, massive rescheduling of manufacturing plans, chronic late shipments to customers, and
2
adding expensive manufacturing flexibility resources (Exhibit 1). In addition, the Income Statement and
Balance Sheet are also impacted by poor demand forecasts in financial planning (Exhibit 2).
Operational Impact of Poor Forecasts
Poor Forecast
Stock outs
And Late
Orders
Inaccurate
Production
Schedules
Schedule
Changes
Raw Material
Expediting
Loss of volume breaks
Higher freight costs
Higher raw material
inventory
Part deliveries
Inter-warehouse
transfers
Lost Sales
Yield losses
Productivity losses
Higher WIP inventory
Higher WIP inventory
Reduced
Revenues
Higher Shipping
Costs
Increased
Safety Stock
Increased Finished Goods
Inventory
Higher Warehousing Costs
© 2008 - 2015 CPDF_Delphus, Inc.
Source: C. Conradi – SYSPRO 2013
5
Exhibit 1 Operational Impact of Poor Forecasts (Source: S. C. Conradie, Noeticconsulting 2013)
Financial Impact of Poor Demand Forecasts
Sales
Income
Statement
Costs
Assets
Balance
Sheet
Liabilities
© 2008 - 2015 CPDF_Delphus, Inc.
Source: C. Conradi – SYSPRO 2013
Service levels
Out of stocks
Reliability
Logistics costs
Inventory Levels
Obsolete stocks
Direct distribution costs
Inventories
Cash
Fixed Assets
Short term debt
Long term debt
6
Exhibit 2 Financial Impact of Poor Forecasts (Source: S. C. Conradie, Noeticconsulting 2013)
3
Analyzing Forecast Errors
Whether a method that has provided a good fit to historical data will also yield accurate forecasts for future
time periods is an unsettled issue. Intuition suggests that this may not necessarily be the case. There is no
guarantee that past patterns will persist in future periods. For a forecasting technique to be useful, we must
demonstrate that it can forecast reliably and with consistent accuracy. It is not sufficient to simply produce a
model that performs well only in historical (within sample) fit periods. One must also measure forecasting
performance with multiple metrics, while not just with a single, commonly used metric, like the Mean
Absolute Percentage Error (MAPE) or even a weighted MAPE (to be defined precisely later in this paper).
At the same time, the users of a forecast need forecasting results on a timely basis. Using a forecasting
technique and waiting one or more periods for history to unfold in future periods is not practical because
our advice as forecasters will not be timely. Consequently, we advocate the use of holdout test periods for
the analysis and evaluation of forecast errors on an ongoing basis.
Generally, the use of a single measure of accuracy involving simple averages is not a best practice
as you will misinterpret the results when underlying distributions are not normal. Even with approximate
normality a simple average of numbers may lead to misleading uses. With minimal effort, you can
complement a traditional measure with an outlier-resistant alternative to safeguard against the common
presence of an outlier or unusual values in accuracy measurements. It should always be good practice to
report accuracy measures in pairs to insure against misleading results. When the pair are similar (in a
practical sense), then report the traditional result. Otherwise, go back and investigate the data going into the
calculation, using domain information whenever possible.
Two important aspects of forecast accuracy measurement are bias and precision. Bias is a problem
of direction: Forecasts are typically too low (downward bias) or typically too high (upward bias). Precision
is an issue of magnitudes: Forecast errors can be too large (in either direction) using a particular forecasting
technique. Consider first a simple situation - forecasting a single product or item. The attributes that should
satisfy most forecasters include lack of serious bias, acceptable precision, and superiority over naive
models.
Lack of Bias
If forecasts are typically too low, we say that they are downwardly biased; if too high, they are upwardly
biased. If overforecasts and underforecasts tend to cancel one another out (i.e., if an average of the forecast
errors is approximately zero), we say that the forecasts are unbiased.
Bias refers to the tendency of a forecast to be predominantly toward one side of the truth.
Bias is a problem of direction. If we think of forecasting as aiming darts at a target, then a bias
implies that the aim is off-center. That is, the darts land repeatedly toward the same side of the target
(Exhibit 3). In contrast, if forecasts are unbiased, they are evenly distributed around the target.
Bias and Precision
Unbiased
Biased
Precision
© 2008 - 2015 Delphus, Inc.
Exhibit 3 Biased, Unbiased and Precise Forecasts
7
4
What is an Acceptable Precision?
Imprecision is a problem if the forecast errors tend to be too large. Exhibit 4 shows three patterns that differ
in terms of precision. The upper two forecasts are the less precise - as a group, they are farther from the
target. If bias is thought of as bad aim, then imprecision is a lack of constancy or steadiness. The precise
forecast is generally right around the target.
Precision refers to the distance between the forecasts as a result of using a particular forecasting
technique and the corresponding actual values.
Forecasts and Forecast Errors
© 2008 - 2015 Delphus, Inc.
8
Exhibit 4 (a) Bar Charts and (b) Tables Showing Actuals (A), Forecasts and Forecast Errors for Three
Forecasting Techniques
Exhibit 4 illustrates the measurement of bias and precision for three hypothetical forecasting
techniques. In each case, the fit period is periods 1 - 20. Shown in the top row of Exhibit 4b are actual
values for the last four periods (21 – 24). The other three rows contain forecasts using forecasting
techniques X, Y, and Z. These are shown in a bar chart in Exhibit 4. What can we say about how good these
forecast models are? On the graph, the three forecasts do not look all that different.
But, what is a forecast error? It is not unusual to find different definitions and interpretations for
concepts surrounding accuracy measurement among analysts, planners and managers in the same company.
Exhibit 4 records the deviations between the actuals and their forecasts. Each deviation represents a
forecast error (or forecast miss) for the associated period defined by:
Forecast error (E) = Actual (A) - Forecast (F)
If (F – A) is the preferred use in some organizations but not others, then demand forecasters and
planners should name it something else, like forecast variance, a more conventional meaning for revenueoriented planners. The distinction is important because of the interpretation of bias in under- and
overforecasting situations. Consistency is paramount here.
Contrast this with a fitting error (or residual) of a model over a fit period, which is:
Fitting error = Actual (A) - Model fit
5
Forecast error) is a measure of forecast accuracy. By convention, it is defined as A – F. On the other
hand, Fitting error (or residual) is a measure of model adequacy: Actual – Model Fit.
In Exhibit 4b, the forecast error shown for technique X is 1.8 in period 21. This represents the
deviation between the actual value in forecast period 21 (= 79.6) and the forecast using technique X (=
77.8). In forecast period 22, the forecast using technique X was lower than actual value for that period
resulting in a forecast error of 6.9. The period 24 forecast using technique Z was higher than that period's
actual value; hence, the forecast error is negative (- 3.4). When we overforecast, we must make a negative
adjustment to reach the actual value. Note that if the forecast is less than the actual value, the miss is a
positive number; if the forecast is more than the actual value, the miss is a negative number.
In this example, it appears that Model X is underforecasting (positive forecast errors), hence it is
biased. Model Z appears to have a forecast (#22) that should be market for investigation; it is somewhat
large relative to the rest of the numbers.
To identify patterns of upward- and downward-biased forecasts, we start by comparing the number
of positive and negative misses. In Exhibit 4b, Technique X under-forecasts in all four periods, indicative of
a persistent (downward) bias. Technique Y underforecasts and overforecasts with equal frequency;
therefore, it exhibits no evidence of bias in either direction. Technique Z is biased slightly toward
overforecasting. As one measure of forecast accuracy (Exhibit 5), we calculate a percentage error PE =
100% * (A – F)/A.
Forecast Percentage Error
© 2008 - 2015 Delphus, Inc.
9
Exhibit 5 (a) Bar chart and (b) Table of the Forecast Error as Percentage Error (PE) between Actuals and
Forecasts for Three Techniques
To reduce bias in a forecasting technique, we can either (1) reject any technique that projects with
serious bias in favor of a less-biased alternative (after we have first compared the precision and complexity
of the methods under consideration) or (2) investigate the pattern of bias in the hope of devising a bias
adjustment; for example, we might take the forecasts from method X and adjust them upward to try to offset
the tendency of this method to underforecast. Also, forecasts for Techniques Y and Z could be averaged
after placing Forecast X aside for the current forecasting cycle.
6
Ways to Evaluate Accuracy
A number of forecasting competitions were conducted by the International Institute
of Forecasters (IIF; www.forecasters.org) to assess the effectiveness of statistical
forecasting techniques and determine which techniques are among the best. Starting
with the original M competition in 1982, Spyros Makridakis and his co-authors
reported results in the Journal of Forecasting, in which they compared the accuracy
of about 20 forecasting techniques across a sample of 111 time series – a very small
dataset by today’s standards. A subset of the methods was tested on 1001 time series,
still not very large, but substantial enough for the computer processing power at the
time. The last 12 months of each series were held out and the remaining data were
used for model fitting. Using a range of measures on a holdout sample, the
International Journal of Forecasting (IJF) conducted a competition in 1997 comparing a range of
forecasting techniques across a sample of 3003 time series. Known as the M3 competition, these data and
results can be found at the website www.maths.monash.edu.au/∼hyndman/forecasting/. The IJF
published a number of other papers evaluating the results of these competitions. These competitions have
become the basis for how we should measure forecast accuracy in practice.
Forecast accuracy measurements are performed in order to assess the accuracy of a forecasting
technique.
The Fit Period versus the Test Period
In measuring forecast accuracy, a portion of the historical data is withheld, and reserved for evaluating
forecast accuracy. Thus, the historical data are first divided into two parts: an initial segment (the fit
period) and a later segment (the holdout sample). The fit period is used to develop a forecasting model.
Sometimes, the fit period is called the within-sample training, initialization, or calibration period. Next,
using a particular model for the fit period, model forecasts are made for the later segment. Finally, the
accuracy of these forecasts is determined by comparing the projected values with the data in the holdout
sample. The time period over which forecast accuracy is evaluated is called the test period, validation
period, or holdout-sample period.
A forecast accuracy test should be performed with data from a holdout sample.
When implementing software or spreadsheet code, we can minimize confusion and
misinterpretation if a common notation for the concepts in these calculations is used. Let Yt (1) denote the a
one-period-ahead forecast of Yt and Yt (m) the m-period-ahead forecast. For example, a forecast for t = 25
that is the one-period-ahead forecast made at t = 24 is denoted by Y24 (1) and the a two-period-ahead
forecast for the same time, t = 25, made at t = 23 is denoted by Y23(2). Generally, forecasters are interested
in multi period-ahead forecasts because lead times longer than one period are required for business to act
on a forecast. We use the following conventions for a time series:
{Yt , t = 1,2, …, T}
the historical dataset up to and including period t = T
{Ŷ t , t =1,2, . . . , T}
the data set of fitted values that result from fitting a model to historical
data
YT+m
the future value of Yt , m periods after t = T
{YT (1), YT (2), …, YT (m)}, the one- to m-period ahead-forecasts made from t = T
Consequently, we see that forecast errors and fit errors (residuals) refer to:
7
Y1 - Ŷ t , Y2 - Ŷ 2 , . . . , YT - Ŷ T the fit errors or residuals from a fit
YT+1 - YT (1), YT+2 - YT (2), . . . , YT+m - YT(m) the forecast errors (not to be confused with the
residuals from a fit)
In dealing with forecast accuracy, it is important to distinguish between forecast errors and fitting
errors.
Goodness-of-Fit versus Forecast Accuracy
We need to assess forecast accuracy rather than just calculate overall goodness-of-fit statistics for the
following reasons:
• Goodness-of-fit statistics may appear to give better results than forecasting-based calculations, but
goodness-of-fit statistics measure model adequacy over a fitting period that may not be
representative of the forecasting period.
• When a model is fit, it is designed to reproduce the historical patterns as closely as possible. This
may create complexities in the model that capture insignificant patterns in the historical data,
which may lead to over-fitting.
• By adding complexity, we may not realize that insignificant patterns in the past are unlikely to
persist into the future. More important, the more subtle patterns of the future are unlikely to have
revealed themselves in the past.
• Exponential smoothing models are based on updating procedures in which each forecast is made
from smoothed values in the immediate past. For these models, goodness-of-fit is measured from
forecast errors made in estimating the next time period ahead from the current time period. These
are called one-period-ahead forecast errors (also called one-step-ahead forecast errors). Because it
is reasonable to expect that errors in forecasting the more distant future will be larger than those
made in forecasting the next period into the future, we should avoid accuracy assessments based
exclusively on one-period-ahead errors.
When assessing forecast accuracy, we may want to know about likely errors in forecasting more than
one period-ahead.
Item-Level versus Aggregate Performance
Forecast evaluations are also useful in multi-series comparisons. Production and inventory managers
typically need demand or shipment forecasts for hundreds to tens of thousands of items (SKUs) based on
historical data for each item. Financial forecasters need to issue forecasts for dozens of budget categories in
a strategic plan on the basis of past values of each source of revenue. In a multi-series comparison, the
forecaster should appraise the method based not only on its performance for the individual item but also on
the basis of its overall accuracy when tested over various summaries of the data. How to do that most
effectively will depend on the context of the forecasting problem and can, in general, not be determined a
priori. In the next section, we discuss various measures of forecast accuracy.
Absolute Errors versus Squared Errors
The use of absolute errors and that based on squared errors are both useful. There is a good argument for
consistency in the sense that a model's forecasting accuracy should be evaluated on the same basis used to
develop (fit) the model. The standard basis for model fit is the least-squares criterion that is minimizing the
mean-square-error MSE between the actual and fitted values. To be consistent, we should evaluate forecast
accuracy based on squared error measures, such as the root mean squared error (RMSE). It is useful to test
how well different methods do in forecasting using a variety of accuracy measures. Forecasts are put to a
wide variety of uses in any organization, and no single forecaster can dictate on how they must be used and
interpreted.
8
Sometimes costs or losses due to forecast errors are in direct proportion to the size of the error double the error leads to double the cost. For example, when a soft drink distributor realized that the costs
of shipping its product between distribution centers was becoming prohibitive, it made a study of the
relationship between under-forecasting (not enough of the right product at the right place, thus requiring a
transshipment, or backhaul, from another distribution center) and the cost of those backhauls. As shown in
Exhibit 6, over-forecasts of 25% or higher appeared strongly related to an increase in the backhaul of
pallets of product. In this case, the measures based on absolute errors are more appropriate. In other cases,
small forecast errors do not cause much harm and large errors may be devastating; then, we would want to
stress the importance of (avoidance of) large errors, which is what squared-error measures accomplish.
Exhibit 6 Backhauls in Pallets versus Forecast Error
Measures of Forecast Accuracy
Measures of bias
There are two common measures of bias. The mean error (ME) is the sum of the forecast errors divided by
the number of periods in the forecast horizon (h) for which forecasts were made:
ME = [∑ (At - Ft )] / h
= [∑ (Ai - Fi )] / h
(sum from t = T+1 to t = T+h)
sum from i = 1 to i = h
The mean percentage error (MPE) is
MPE = 100 * [∑ (At - Ft ) / At ] / h
sum from t = T+1 to t = T+h
= 100 * [∑ (Ai - Fi ) / Ai ] / h
sum from i = 1 to i = h
The MPE and ME are useful supplements to a count of the frequency of under- and over-forecasts.
The ME gives the average of the forecast errors expressed in the units of measurement of the data; the MPE
gives the average of the forecast errors in terms of percentage and is unit-free.
If we examine the hypothetical example in Exhibits 7 and 8, we find that for technique X ME =
3.8, and MPE = 4.2%. This means that, technique X, we underforecast by an average of 3.8 units per
period. Restated in terms of the MPE, this means that the forecasts from technique X were below the actual
values by an average of 4.2%. A positive value for the ME or MPE signals downward bias in the forecasts
using technique X.
The two most common measures of bias are the mean error (ME) and the mean percentage error
(MPE).
9
Compared to technique X, the ME and MPE for technique Y are much smaller. These results are
not surprising. Because technique Y resulted in an equal number of under-forecasts and overforecasts, we
expect to find that the average error is close to zero.
On the other hand, the low ME and MPE values for technique Z are surprising. We have
previously seen that technique Z over-forecasts in three of the four periods. A closer look at the errors from
technique Z shows that there was a very large under-forecast in one period and relatively small overforecasts in the other three periods. Thus, the one large underforecast offset the three small over-forecasts,
yielding a mean error close to 0. The lesson here is that averages can be distorted by one or just a few
unusually large errors. One should consider calculating non-conventional measures like the Median Error
MdE and the Median Percentage Error MdPE. We should always use multiple error measures as a
supplement to the analysis of the individual errors, as unusual values can readily distort an incorrect
interpretation.
Measures of Precision
Certain indicators of precision are based on the absolute values of the forecast errors. By taking an absolute
value, we eliminate the possibility that under-forecasts and over-forecasts negate one another. Therefore, an
average of the absolute forecast errors reveals simply how far apart the forecasts are from the actual values.
It does not tell us if the forecasts are biased. The absolute errors are shown in Exhibit 10.9, and the absolute
values of the percentage errors are shown in Exhibit 7.
Absolute Error
© 2008 - 2015 Delphus, Inc.
13
Exhibit 7 (a) Bar chart and (b) Table Showing Forecast Error as the Absolute Difference |E| between Actuals
and Forecasts for Three Techniques
10
Absolute Percentage Error
© 2008 - 2015 Delphus, Inc.
14
Exhibit 8 (a) Bar chart and (b) Table Showing Forecast Error as the Absolute Percentage Difference |PE(%)|
between Actuals and Forecasts for Three Techniques.
|(PE) | = 100 • | (A - F)| / A |
The most common averages of absolute values are mean absolute error (MAE), mean absolute
percentage error (MAPE), and median absolute percentage error (MdAPE).
MAE = [∑ | (At - Ft ) | ] / h
= [∑ | (Ai - Fi ) | ] / h
sum t = T + 1 to t = T + h
sum from i = 1 to i = h
MAPE = 100 * [∑ | (At- Ft )| / At ] / h
= 100 * [∑ | (Ai - Fi )| / Ai ] / h
sum t = T + 1 to t = T+ h
sum from i = 1 to i = h
MdAPE = Median value of {| (Ai - Fi )| / Ai | i = 1 , . . . h }
Interpretations of the averages of absolute error measures are straightforward. In Exhibit 7, we
calculate that the MAE is 4.6 for technique Z, from which we can conclude that the forecast errors from this
technique average 4.6 per period. The MAPE, from Exhibit 8, is 5.2%, which tells us that the period
forecast errors average 5.2%.
The MdAPE for technique Z, from Exhibit 9, is approximately 4.1% (the average of the two
middle values: 4.4 and 3.8). Thus, half the time the forecast errors exceeded 4.1%, and half the time they
were smaller than 4.1%. When there is a serious outlier among the forecast errors, as with technique Z, it is
useful to calculate the MdAPE in addition to the MAPE because medians are less sensitive (more resistant)
than mean values to distortion from outliers. This is why the MdAPE is a full percentage point below the
MAPE for technique Z. Sometimes, as with technique Y, the MdAPE and the MAPE are virtually identical.
In this case, we can report the MAPE because it is the far more common measure.
11
Exhibit 9 (a) Bar chart and (b) Table Showing Forecast Error as Squared Difference (E2) between Actuals and
Forecasts for Three Techniques
Exhibit 10 (a) Bar chart and (b) Table Showing Forecast Error as Squared Percentage Difference PE(%))2
between Actuals and Forecasts for Three Techniques
In addition to the indicators of precision calculated from the absolute errors, certain measures are
commonly used that are based on the squared values of the forecast errors (Exhibit 10), such as the MSE
and RMSE. (RMSE is the square root of MSE.) A notable variant on RMSE is the root mean squared
percentage error (RMSPE), which is based on the squared percentage errors (Exhibit 10).
MSE = [∑ (At - Ft )2 ] / h
= [∑ (Ai- Fi )2 ] / h
sum from t = T+1 to t = T+h
sum from i = 1 to i = h
RMSPE = √ 100 * [∑ {(At- Ft ) / At }2 ] / h
= 100 * [∑ {(Ai - Fi ) / Ai }2 ] / h
sum t = T+1 to t = T+h
sum from i = 1 to i = h
To calculate RMSPE, we square each percentage error in Exhibit 10. The squares are then averaged, and
the square root is taken of the average.
The MSE is expressed in the square of the units of the data. This may make it difficult to interpret
when referring to dollar volumes, for example. By taking the square root of the MSE, we return to the units
of measurement of the original data.
We can simplify the interpretation of RMSE and present it as if it were the mean absolute error
(MAE); the forecasts of technique Y are in error by an average of 4. There is little real harm in doing this;
just keep in mind that the RMSE is generally somewhat larger than the MAE. (compare Exhibits 8 and 10).
12
Some forecasters present the RMSE within the context of a prediction interval. Here is how it
might sound: "Assuming the forecast errors are normally distributed, we can be approximately 68% sure
that technique Y's forecasts will be accurate to within 4.0 on the average." However, it is best to not to make
these statements too precise because it requires us to assume a normal distribution of forecast errors, a rarity
in demand forecasting demand forecasting practice.
The RMSE can be interpreted as a standard error of the forecasts.
The RMSPE is the percentage version of the RMSE (just as the MAPE is the percentage version of
the MAE). Because the RMPSE for technique Y equals 4.5%, we may suggest that "the forecasts of
technique Y have a standard or average error of approximately 4.5%."
If we desire a squared error measure in percentage terms, the following shortcut is sometimes
taken. Divide the RMSE by the mean value of the data over the forecast horizon. The result can be
interpreted as the standard error as a percentage of the mean of the data and is called the coefficient of
variation (because it is a ratio of a standard deviation to a mean). Based on technique Y, average actual
values over periods 21 - 24 were 88.2 (Exhibit 1), and the RMSE for technique Y was found to be 4.0
(Exhibit 10). Hence, the coefficient of variation is 4/88.2 = 4.5%, a result that is virtually identical to the
RMSPE, in this case.
Comparison with Naive Techniques
After reviewing the evidence of bias and precision, we may want to select technique Y as the preferred of
the three candidates. Based on the evaluation, technique Y's forecast for periods 21 - 24 shows no indication
of bias - two under-forecasts and two over-forecasts - and proves slightly more precise than the other two
techniques, no matter which measure of precision is used. The MAE for technique Y reveals that the
forecast errors over periods 21 - 24 average approximately 3.5 per period, and the MAPE indicates that
these errors come to just under 4% on average.
To provide additional perspective on technique Y, we might contrast technique Y's forecasting
record with that of a very simplistic procedure, one requiring little or no thought or effort. Such procedures
are called naive techniques. The continued time and effort investment in technique Y might not be worth the
cost if it can be shown to perform no better than a naïve technique. Sometimes the contrast with a naive
technique is sobering. The forecaster thinks an excellent model has been developed and finds that it barely
outperforms a naive technique.
If the forecasting performance of technique Y is no better or is worse than that of a naive technique, it
suggests that in developing technique Y we have not accomplished very much.
A standard naive forecasting technique has emerged for data that are yearly or otherwise lack
seasonality; it is called a NAÏVE_1 or Naive Forecast1 (NF1). We see later that there are also naïve
forecasting techniques for seasonal data. A NAÏVE_1 forecast of the following time period is simply the
value in effect during this time period. Alternatively stated, the forecast for any one time period is the value
observed during the previous time period.
A NAIVE_1 is called a no change forecasting technique, because its forecasted value is unchanged from
the previously observed value.
Exhibit 11 Actuals and NAIVE_1 (one-period-ahead) Forecasts for Technique Y. (Exhibit 3)
13
Exhibit 11 shows the forecast for our example (Exhibit 3). Note that one-period-ahead forecasts
for periods 22 - 24 are the respective actuals for periods 21 - 23. The forecast for period 21 (=73.5) is the
actual value for period 20. When we appraise NAIVE_1's performance, we note it is biased. It
underestimated periods 21 - 23 and overestimated period 24. By always projecting no change, a NAIVE_1
will underforecasts whenever data increase and over-forecast whenever the data are declining. We do not
expect a NAIVE_1 to work very well when the data are steadily trending in one direction.
The error measures for NAIVE_1 in Exhibit 11 are all higher than the corresponding error
measures for technique Y, indicating that technique Y was more accurate than NAIVE_1. By looking at the
relative error measures, we can be more specific about the advantage of technique Y over NAIVE_1.
Because a naive technique has no prescription for a random error term, the NAIVE_1 technique
can be viewed as a no-change, no-chance method (in the sense of our forecasting objectives) to predict
change in the presence of uncertainty (chance). Hence, it is useful as a benchmark-forecasting model to
beat.
Relative Error Measures
A relative error measure is a ratio; the numerator is a measure of the precision of a forecasting method;
the denominator is the analogous measure for a naive technique.
Exhibit 12 Relative Error Measures for Technique Y Relative to NAIVE_1. (Exhibit 3)
Exhibit 12 provides a compendium of relative error measures that compare technique Y with the
NAIVE_1. The first entry (=0.73) is the ratio of the MAE for technique Y to the MAE for the NAIVE_1
(3.55/4.87 = 0.73). It is called the relative absolute error (RAE). The RAE shows that the average forecast
error using technique Y is 73% of the average error using NAIVE_1. This implies a 27% improvement of
technique Y over NAIVE_1.
Expressed as an improvement score, the relative error measure is sometimes called a forecast
coefficient (FC), as listed in the last column of Exhibit 12. When FC = 0, this indicates that the technique in
question was no better than NAIVE_1 technique. When FC < 0, the technique was in fact less precise than
the NAIVE_1, whereas when FC > 0 the forecasting technique was more precise than NAIVE_1. When FC
= 1, the forecasting technique is perfect.
The second row of Exhibit 12 shows the relative error measures based on the MAPE. The results
almost exactly mirror those based on the MAE, and the interpretations are identical. The relative error
measure based on the RMSE (which is the RMSE of technique Y divided by RMSE of the NAIVE_1) was
originally proposed by in 1966 by the economist Henri Theil and it is known as Theil's U or Theil's U2
statistic. As with the RAE, a score less that 1 suggests an improvement over the naive technique.
The relative errors based on the squared-error measures, RMSE and RMSPE, suggest that
technique Y in Exhibit 12 was about 40% more precise than the NAIVE_1. The relative errors based on
MAE and MAPE put the improvement at approximately 25 – 30%. Why the difference? Error measures
based on absolute values, such as the MAE and MAPE, give a weight (emphasis) to each time period in
proportion to the size of the forecast error. The RMSE and RMSPE, in contrast, are measures based on
squared errors and squaring gives more weight to large errors. A closer look at Exhibit 11 shows that the
NAIVE_1 commits a very large error in forecast period 22.
Predictive Visualization Techniques
Predictive Visualization techniques for forecast accuracy measurement provide a perspective on how
serious a forecasting bias may be. If the bias exceeds a designated threshold, a "red flag" is raised
concerning the forecasting approach. The forecaster should then reexamine the data to identify changes in
14
the trend, seasonal, or other business patterns, which in turn suggest some adjustment to the forecasting
approach.
Ladder Charts
By plotting the current year’s monthly forecasts on a ladder chart, a forecaster can determine whether the
seasonal pattern in the forecast looks reasonable. The ladder chart in Exhibit 13 consists of six items of
information for each month of the year: average over the past 5 years, the past year’s performance, the 5year low, the 5-year high, the current year-to-date, and the monthly forecasts for the remainder of the year.
The 5-year average line usually provides the best indication of the seasonal pattern, assuming this pattern is
not changing significantly over time. In fact, this is a good check for reasonableness that can be done before
submitting the forecast for approval.
A ladder chart is a simple yet powerful tool for monitoring forecast results.
The level of the forecast can be checked for reasonableness relative to the prior year (dashed line
in Exhibit 13). The forecaster can determine whether or not the actuals are consistently overrunning or
underrunning the forecast. In this example, the forecast errors are positive for 3 months and negative for 3
months. The greatest differences between the actual and forecast values occur in March and April, but here
the deviations are of opposite signs; additional research should be done to uncover the cause of the unusual
March-April pattern. The forecasts for the remainder of the current year look reasonable, although some
minor adjustments might be made. The ladder chart is one of the best predictive visualization tools for
quickly identifying the need for major changes in forecasts.
Exhibit 13 Ladder Chart
Prediction-Realization Diagram
Another useful visual approach to monitoring forecast accuracy is the predictionrealization diagram introduced by the economist Henri Theil in 1958. If the
predicted values are indicated on the vertical axis and the actual values on the
horizontal axis, a straight line with a 450 slope will represent perfect forecasts. This
is called the line of perfect forecasts (Exhibit 14). In practice, the predictionrealization diagram is sometimes rotated so that the line of perfect forecasts
appears horizontal.
The prediction-realization diagram indicates how well a model or forecaster has predicted turning points
and also how well the magnitude of change has been predicted given that the proper direction of change
has been forecast.
The diagram has six sections. Points falling in sections II and V are a result of turning-point errors.
In Section V, a positive change was predicted, but the actual change was negative. In Section II, a negative
change was predicted, but positive change occurred. The remaining sections involve predictions that were
15
correct in sign but wrong in magnitude. Points above the line of perfect forecasts reflect actual changes that
were less than predicted. Points below the line of perfect forecasts represent actual changes that were
greater than predicted.
Exhibit 14 Prediction-Realization Diagram
The prediction-realization diagram can be used to record forecast results on an ongoing basis.
Persistent overruns or underruns indicate the need to adjust the forecasts or to reestimate the model. In this
case, a simple error pattern is evident and we can raise or lower the forecast based on the pattern and
magnitude of the errors.
More important, the diagram indicates turning-point errors that may be due to misspecification or
missing variables in the model. The forecaster may well be at a loss to decide how to modify the model
forecasts. An analysis of other factors that occurred when the turning-point error was realized may result in
inclusion of a variable in the model that was missing from the initial specification.
Exhibit 15 depicts two prediction-realization diagrams for performance results from two
forecasters in their respective sales regions of a motorcycle manufacturer. By plotting percentage changes in
the monthly forecast versus the percentage change in the monthly actuals, you can see the effect of turning
points on performance. In the first region, six of the forecasts are close to the line of perfect fit (45 degree
line), indicating a good job. The explanation of the other two forecasts suggest that the June point can be
attributed to a push to make up for lost retail sales in the first five months because of late production. In the
second region, for all eight months it showed an inability to detect a continuing decline as well as some
excessive optimism. Interestingly, on the basis of a measure of MAPE, the worse performing region has the
lower MAPE. On balance, it should be noted that the growth patterns are also quite different in the two
regions, which may need to be taken in consideration.
16
Exhibit 15 Prediction-Realization Diagrams for Two Regions of a Motorcycle Manufacturer: Region A (top),
Region B (bottom)
What Are Prediction Intervals?
A major shortcoming in the demand forecaster’s practice of forecasting techniques is the absence of error
bounds on forecasts. By usine the State Space Forecasting Methodology for Exponential smoothing and
ARIMA modeling, the demand forecaster can create prediction limits that reflect a more realistic,
unsymmetrical, as well as the conventional, symmetrical error bounds on forecasts. Why is this important?
A simple example will make the point.
Prediction Intervals for Time Series Models
When no error distribution can be assumed for a method that will make it a model, as with moving
averages, we cannot construct model-based prediction intervals on forecasts to measure uncertainty.
However, empirical prediction intervals can be constructed using an approach by Goodman and Williams in
a 1971 article entitled “A simple method for the construction of empirical confidence limits for economic
forecasts” in the Journal of the American Statistical Association. With the Goodman/Williams procedure,
we go through the ordered data, making forecasts at each time point. Then, the comparisons of these
forecasts with the actuals (that are known) yield an empirical distribution of forecast errors. If the future
errors are assumed to be distributed like the empirical errors, then the empirical distribution of these
17
observed errors can be used to set prediction intervals for subsequent forecasts. In some practical cases, the
theoretical size and the empirical size of the intervals have been found to agree closely.
Prediction intervals are used as a way of expressing the uncertainty in the future values that are derived
from models, and they play a key role in tracking forecasts from these models.
In regression models, it is generally assumed that random errors are additive in the model equation
and have a normal distribution with zero mean and constant variance. The variance of the errors can be
estimated from the time series data. The estimated standard deviation (square root of the variance) is
calculated for the forecast period and is used to develop the desired prediction interval. Although 95%
prediction intervals are frequently used, the range of forecast values for volatile series may be so great that
it might also be useful to show the limits for a lower level of probability, say 80%. This would narrow the
interval. It is common to express prediction intervals about the forecast, but in the tracking process, it is
more useful to deal with prediction intervals for the forecast errors (Error = Actual – Forecast) on a periodby-period or cumulative basis.
A forecaster often deals with aggregated data and would like to know the prediction intervals about
an annual forecast based on a model for monthly or quarterly data. The annual forecast is created by
summing twelve (independent) monthly (or four quarterly) predictions. Developing the appropriate
probability limits requires that the variance for the sum of the prediction errors be calculated. This can be
determined from the variance formula
Var (∑ ei) = ∑ Var (ei) + 2 ∑ Cov (ei , ej )
If the forecast errors have zero covariance (Cov) – in particular, if they are independent of one another, - the
variance of the sum equals the sum of the prediction variances.
The most common form of correlation in the forecast errors for time series models is positive
autocorrelation. In this case, the covariance term is positive and the prediction intervals derived would be
too small. In the unusual case of negative covariances, the prediction intervals would be too wide. Rather
than deal with this complexity, most software programs assume the covariance is small and inconsequential.
In typical forecasting situations, prediction intervals are probably too conservative or narrow.
Prediction Interval as Percentages
Many users of forecasts find the need for prediction intervals obtuse. In such cases, the forecaster may find
more acceptance of this technique if the prediction intervals for forecasts are expressed in terms of percentages.
For example, a 95% probability interval for a forecast might be interpreted that we are 95% sure that the
forecast will be within ± 15% (say) of the actual numerical value. The percentage miss associated with any
prediction interval can be calculated from the formula:
Percent miss = [(Estimated standard error of the forecast) (t factor) x 100% ] / (Forecast)
where t is the tabulated value of the Student t distribution for the appropriate degrees of freedom and
confidence level. This calculation based on the assumption of normality of forecast errors and provides an
estimate of the percent miss for any particular period. Values can be calculated for all periods (e.g., for each
month of the year), resulting in a band of prediction intervals spanning the year.
It could also be useful to determine a prediction interval for a cumulative forecast error. Under the
assumption that forecast errors are independent and random, the cumulative percentage miss will be smaller
than the percentage miss for an individual period because the positive and negative errors will cancel to some
extent. As an approximate rule, the average period (say monthly or quarterly) percentage miss is calculated by
dividing the average monthly percentage miss by √12 ( the square root of the number of periods of the
forecast) or the average quarterly percentage miss by √ 4 (=2).
18
Using Prediction Intervals as Early Warning Signals
One of the simplest tracking signals is based on the ratio of the sum of the forecast errors to the mean
absolute error MAE. It is called the cumulative sum tracking signal (CUSUM). For certain CUSUM
measures, a threshold or upper limit of |0.5| suggests that the forecast errors are no longer randomly
distributed about 0 but rather are congregating too much on one side (i.e., forming a biased pattern).
More sophisticated tracking signals involve the taking of weighted or smoothed averages of
forecast errors. The best-known standard tracking signal, due to Trigg Leach dates back to 1964 and will be
described shortly. Tracking signals are especially useful when forecasting a large number of products at a
time, as is the case in spare parts inventory management systems.
.
A tracking signal is a ratio of a measure of bias to a companion measure of precision.
Warning signals can be seen as predictive visualizations by plotting the forecast errors over time
together with the associated prediction intervals and seeing whether the forecast errors continually lie above
or below the zero line. Even though the individual forecast errors may well lie within the appropriate
prediction interval for the period (say monthly), a plot of the cumulative sum of the errors may indicate that
their sum lies outside its prediction interval.
An early warning signal is a succession of overruns and underruns.
A type of warning signal is evident in Exhibits 16 and 17. It can be seen that the monthly forecast
errors lie well within the 95% prediction interval for 9 of the 12 months, with two of the three exceptions
occurring in November and December. Exhibit 16 suggests that the individual forecast errors lie within their
respective prediction intervals. However, it is apparent that none of the forecast errors are negative certainly, they do not form a random pattern around the zero line. Hence, there appears to be a bias. To
determine whether the bias in the forecast is significant, we review the prediction intervals for cumulative
sums of forecast errors.
Exhibit 16 Monthly Forecast Errors and Associated Prediction Intervals for a Time Series Model
The cumulative prediction interval (Exhibit 17) confirms the problem with the forecast bias. The
cumulative forecast errors fall on the outside of the prediction interval for all twelve periods. This model is
clearly under-forecasting. Either the model has failed to capture a strong cyclical pattern during the year or
the data are growing rapidly and the forecaster has failed to make a proper specification in the model.
Using these two plots, the forecaster would probably be inclined to make upward revisions in the
forecast after several months. It certainly would not be necessary to wait until November to recognize the
problem.
19
Exhibit 17 Cumulative Forecast Errors and Associated Cumulative Prediction Intervals for a Time Series
Model
Another kind of warning signal occurs when too many forecast errors fall outside the prediction
intervals. For example, with a 90% prediction interval, we expect only 10% (approximately 1 month per
year) of the forecast errors to fall outside the prediction interval. Exhibit 18 shows a plot of the monthly
forecast errors for a time series model. In this case, five of the twelve errors lie outside the 95% interval.
Clearly, this model is unacceptable as a predictor of monthly values. However, the monthly error pattern
may appear random and the annual forecast (cumulative sum of twelve months) might be acceptable.
Exhibit 19 shows the cumulative forecast errors and the cumulative prediction intervals. It appears that the
annual forecast lies within the 95% prediction interval and is acceptable.
Exhibit 18 Monthly Forecast Errors and Prediction Intervals for a Time Series Model
Exhibit 19 Cumulative Forecast Errors and Cumulative Prediction Intervals for a Time Series Model
The conclusion that may be reached from monitoring the two sets of forecast errors is that neither
model is wholly acceptable, and that neither can be rejected. In one case, the monthly forecasts were good
20
but the annual forecast was not. In the other case, the monthly forecasts were not good, but the annual
forecast turned out to be acceptable. Whether the forecaster retains either model depends on the purpose for
which the model was constructed. It is important to note that, by monitoring cumulative forecast errors, the
forecaster was able to determine the need for a forecast revision more quickly than if the forecaster were
only monitoring monthly forecasts. This is the kind of early warning that management should expect from
the forecaster.
Trigg Tracking Signal
The Trigg tracking signal proposed by D. W. Trigg in 1964 indicates the presence of nonrandom forecast
errors; it is a ratio of two smoothed errors Et and Mt . The numerator Et is a simple exponential smooth of
the forecast errors et , and the denominator Mt is a simple exponential smooth of the absolute values of the
forecast errors. Thus,
Tt = Et / Mt
Et = α et + (1 - α) Et - 1
Mt = α |et | + (1 - α) Mt - 1
where et = Yt – Ft the difference between the observed value Yt and the forecast Ft.
Trigg shows that when Tt exceeds 0.51 for α = 0.1 or 0.74 for α = 0.2, the forecast errors are
nonrandom at the 95% probability level. Exhibit 20 shows a sample calculation of the Trigg tracking signal
for an adaptive smoothing model of seasonally adjusted airline passenger data. The tracking signal correctly
provides a warning at period 15 after five consecutive periods in which the actual exceeded the forecast.
Period 11 has the largest forecast error, but no warning is provided because the sign of the forecast error
became reversed. It is apparent that the model forecast errors can increase substantially above prior
experience without a warning being signaled as long as the forecast errors change sign. Once a pattern of
over- or underforecasting is evident, a warning is issued.
Exhibit 20 Trigg’s Tracking Signal (α = 0.1) for an Adaptive Smoothing Model of Seasonally Adjusted Airline
Data
Spreadsheet Example: How to Apply a Tracking Signal
In these White Papers we discuss a variety of forecasting models, from simple exponential smoothing to
multiple linear regression. Once we start using these models on an ongoing basis, we cannot expect them to
supply good forecasts indefinitely. The patterns of our data may change, and we will have to adjust our
forecasting model or perhaps select a new model better suited to the data. For example, suppose that we are
using simple exponential smoothing, a model that assumes that the data fluctuate about a constant level.
Now assume that a trend develops. We have a problem because the level exponential smoothing forecasts
21
will not keep up with a trend and we will be consistently under-forecasting. How do we detect this
problem?
One answer is to periodically test the data for trends and other patterns. Another is to examine a
graph of the forecasts each time that we add new data to our model. Both methods quickly become
cumbersome, especially when we forecast a large number of product SKUs by locations. Visually checking
all these data after every forecast cycle is out of the question!
Quick and Dirty Control
The tracking signal, a simple quality control method, is just the tool for the job. When a forecast for a given
time period is too large, the forecast error has a negative sign; when a forecast is too small, the forecast
error is positive. Ideally, the forecasts should vary around the actual data and the sum of the forecast errors
should be near zero. But if the sum of the forecast errors departs from zero in either direction, the
forecasting model may be out of control.
Just how large should the sum of the forecast errors grow before we act? To answer this, we need a
basis for measuring of the relative dispersion, or scatter, of the forecast errors. If forecast errors are widely
scattered, a relatively large sum of errors is not unusual. A standard measure of variability such as the
standard deviation could be used as a measure of dispersion, but more commonly the mean absolute error
(MAE) is used to define the tracking signal
:
Tracking signal = (Sum of forecast errors) / (MAE)
Typically the control limit on the tracking signal is set at ±4.0. If the signal goes outside this range
we should investigate. Otherwise, we can let our forecasting model run unattended. There is a very small
probability, less than 1%, that the signal will go outside the range ±4.0 due to chance.
The Tracking Signal in Action
Exhibit 21 is a graph of actual sales and the forecasts produced by simple exponential smoothing. For the
first 6 months, the forecast errors are relatively small. Then, a strong trend begins in month 7. Fortunately,
as shown in Exhibit 22, the tracking signal sounds an alarm immediately.
Exhibit 21 Monthly Forecasts and Sales for a Time Series Model
22
Exhibit 22 Monthly Forecast Errors and Tracking Signal
Column F in Exhibit 23 contains the mean absolute error. This is updated each month to keep pace
with changes in the errors. Column G contains the running sum of the forecast errors, and column H holds
the value of the tracking signal ratio. A formula in column I displays the label ALARM if the signal is
outside the range ±4.0. We can now take a step-by-step approach to building the spreadsheet. For those
unfamiliar with simple exponential smoothing (Download free White Paper 07 at
http://cpdftraining.org/PDFgallery2.htm), the forecasting formulas are explained as they are entered.
Exhibit 23 Actuals, Forecasts, and Forecast Tracking Signal
Step 1: Set up a spreadsheet. Enter the labels in rows 1 - 5. In cell B2, enter a smoothing weight of 0.1. In
cell F2, enter a control limit of 4.0. Number the months in the column A and enter the actual sales in
column B.
Step 2: Compute the monthly forecasts and forecast errors. To get the smoothing process started, we
must specify the first forecast in cell C7. A good rule of thumb is to set the first forecast equal to the first
actual data value. Therefore, enter +B7 in cell C7. In cell D7, enter the formula +B7-C7, which calculates
the forecast error.
In cell C8, the first updating of the forecast occurs. Enter the formula +C7*$B$2*D7 in cell C7.
This formula states that the forecast for month 2 is equal to the forecast for month 1 plus the smoothing
weight times the error in month 1. Copy the updating formula in cell C8 to the range C9 to C19. Copy the
error-calculating formula in cell D7 to range D8 to D18. Because the actual sales figure is not yet available
for month 13, cells B19 and D19 are blank.
23
Exponential smoothing works by adjusting the forecasts in a direction opposite to that of the error,
working much like a thermostat, an automatic pilot, or a cruise control on an automobile. Examine the way
the forecasts change with the errors. The month 1 error is zero, so the month 2 forecast is unchanged. The
error is positive in month 2, so the month 3 forecast goes up. In month 3 the error is negative, so the month
4 forecast goes down.
The amount of adjustment each month is controlled by the weight in cell B2. A small weight
avoids the mistake of overreacting to purely random changes in the data. But a small weight also means that
the forecasts are slow to react to true changes. A large weight has just the opposite effect – the forecasts
react quickly to both randomness and true changes. Some experimentation is always necessary to find the
best weight for the data.
Step 3: Compute the Mean Absolute Error (MAE). Column E converts the actual errors to absolute
values. In cell E7, enter the absolute value of E7, and then copy this cell to range E8 to E18. The mean
absolute error, in column F is updated via exponential smoothing much like the forecasts. In cell F6, a
starting value for the MAE must be specified. This must be representative of the errors or the tracking
signal can give spurious results. A good rule of thumb is to use the average of all available errors to start up
the MAE. But here we will do an evaluation of how the tracking signal performs without advance
knowledge of the surge in sales in month 7. We use the average of the absolute errors for months 2 - 6 only
as the initial MAE (month 1 should be excluded because the error is zero by design). In cell F6, enter the
formula for the average of E8 to E12.
To update the MAE, in cell F7 enter $B$2*E2+(1-$B$2)*F6. Copy this formula to the range F8 to
F12. This formula updates the MAE using the same weight as the forecast. The result is really a weighted
average of the MAE, which adapts to changes in the scatter of the errors. The new MAE each month is
computed as 0.1 times the current absolute error plus 0.9 times the last MAE.
Step 4: Sum the forecast errors. This simple operation is considered a separate step because a mistake
would be disastrous here. We must sum the actual errors, not the absolute errors. The only logical starting
value for the sum is zero, which should be entered in cell G6. In cell G7, enter +D7+G6 and copy this to
range G8 to G18.
Step 5: Compute the tracking signal. The tracking signal value in column H is the ratio of the error sum
to the MAE. In cell H6, enter +G6/F6 and copy this to range H7 to H18.
Step 6: - Enter the ALARM formulas
In cell I6, enter an IF statement for ABS (H6)>F3,"ALARM","") and copy it to range I7 to I19. This
formula takes the absolute value of the tracking signal ratio. If that value is greater than the control limit, the
label ALARM is displayed; otherwise, the cell is blank.
Adapting the Tracking Signal to Other Spreadsheets
The formulas in columns E through I in Exhibit 23 are not specific to simple exponential smoothing. They
can be used in any other forecasting spreadsheet as long as the actual period-by-period forecast errors are
listed in a column. If we use a forecasting method that does not include weights, such as linear regression,
we will also have to assign a weight for updating the MAE - we simply use a weight of 0.1 in such
applications. Concerning the weights used to update forecasts, experience suggests that there is not much to
be gained by experimenting with the weights in the MAE.
In large forecasting spreadsheets, macros can be used to operate the monitoring control system.
For example, we could write a macro that checks the last cell in the tracking signal column of each data set.
The macro could sound a beep and pause if a signal is out of range. Or we could write a macro that prints
out a report showing the location of each offending signal for later investigation.
In Exhibit 23, the tracking signal gives an immediate warning when our forecasts are out of
control. Do not expect this to happen every time. The tracking signal is by no means a perfect monitoring
control device, and only drastic changes in the data will cause immediate warnings. More subtle changes
may take some time to detect.
24
Statistical theory tells us that the exact probability that the tracking signal will go outside the range
±4.0 depends on many factors, such as the type of data, the forecasting model in use, and the parameters of
the forecasting model such as the smoothing weights. Control limits of ±4.0 are a good starting point for
tracking much business data. If experiments show that these limits miss important changes in the data,
reduce them to 3.5 or 3.0. If the signal gives false alarms, cases in which the signal is outside the limits but
there is no apparent problem, increase the limits to 4.5 or 5.0. We recommend adding a tracking signal to
every forecasting model. They save a great deal of work by letting us manage by exception.
Mini Case: Accuracy Measurements of Transportation Forecasts
The “Hauls” data are proprietary but real. The problem is to investigate the need for data quality
adjustments prior to creating a useful forecasting model. Does it matter if a few data points look unusual?
For trend/seasonal data, how does it impact seasonal factors, monthly bias and overall precision? We would
like to get an assessment of the impact of an outlier (just one) and how we would measure forecast
accuracy.
Preliminary Analysis. Based on the three years (36 monthly observations), the total variation can be
decomposed into trend, seasonal and other using the Excel DataAnalysis Add-in (ANOVA without
replication). Exhibit 24 shows Seasonality (44%), Trend (1%) and Other (55%). Trend/Seasonal models
are expected to show a significant amount of uncertainty in the prediction limits. After an outlier adjustment
(shown below), the total variation breaks down to Seasonality (47%), Trend (1%) and Other (52%). But,
how accurate might the forecasts be?
Exhibit 24 ANOVA Pie Chart for Hauls Data
Exploratory Forecast
1. We proceed by modeling the unadjusted data with a trend/seasonal exponential smoothing model
and test first for within sample outliers. The Model is a State Space Model (N,M,M); Local Level
(No Trend), Multiplicative Seasonality, Multiplicative Error). The latter means that the prediction
limits are asymmetrical (cf. www.peerforecaster.com for free download and license of
PEERForecaster.xla Excel Addin)
2. In Exhibit 25, the model residuals are shown in Column 1. The Descriptive Statistics (Excel >
DataAnalysis Addin) are used to determine the bounds for outlier detection in the residuals.
Outlier Detection and Correction
3. For outlier detection, the traditional method is Mean +/- St.Dev. The traditional method is not
outlier resistant as both the Mean and St.Dev are distorted when the residuals contain one or more
outlier. The outlier-resistant nonconventional method is based on the calculations (a) 75th
percentile + 1.5 (Interquartile Distance), and (b) 25th Percentile – 1.5 (Interquartile Distance).
4. Both outlier detection methods give the same conclusion (in this case): Outlier in residuals = 867.
The outlier can be corrected (conservatively) by replacing the actual (= 5265) with the Upper
Bound or more commonly with the fitted model value for that period (= 4397.62).
5. Before making adjustments, outliers should always be reviewed for reasonableness with a domain
expert familiar with the historical data. It is critical, however, to take data quality steps before
modeling as outlier effects can be confounded with the main effects, like seasonal factors, special
events and promotions.
25
Exhibit 25 Outlier Calculation with Tradtional and Nonconventional Methods
Forecast Testing
6. We choose a hold-out sample of twelve months and create a 12-period ahead forecast for
evaluation. The outlier (October of the second year) could have a significant impact on exponential
smoothing forecasts because it occurs just before the 12-month hold-out period. The first period in
the data is a December.
7. By choosing the fitted value as outlier replacement (Exhibit 26), we rerun the trend/seasonal
exponential smoothing model with the outlier-adjusted data. The Haul data are shown in the third
column.
8. In Exhibit 27, we show a credible forecast (Change) with prediction limits as a measure of
uncertainty (Chance). The trend/seasonal exponential smoothing model is a State Space Model
(N,M,A): Local Level (No trend), Multiplicative seasonality and Additive error. The latter means
that the bounds for the prediction limits are symmetrical around the forecast.
Preliminary Forecast
9. Exhibit 28 shows Predictive Visualization of a preliminary forecast with Outlier-Adjusted Hauls
Data (Ad,M,A); Local Trend (Damped), Multiplicative Seasonality, Additive Error). The latter
means that the bounds for the prediction limits are symmetrical around the forecast.
10. The 12 actuals in the hold-out sample lie within the error bounds (not shown)
26
Exhibit 26 Outlier Replacement for October in 2nd Year: Actual = 5265, Adjusted Value = 4397.62
Exhibit 27 Predictive Visualization of Outlier-Adjusted Hauls Data with 12-Month Hold-out Period
Exhibit 28 Predictive Visualization of a Preliminary Forecast with Outlier-Adjusted Hauls Data
27
Model Evaluation
11. Exhibit 29 shows Bias and Precision measures for forecasts in the Hold-Out Sample With and
Without Outlier-Adjusted Haul Data. The Mean Error (ME) and Mean Absolute Percentage Error
(MAPE) are not outlier resistant, unlike their Median counterparts. The ME is severly distorted by
the October outlier, while the Median Error (MdE) gives a more reliable result for the bias. The
effect of an outlier also appears to be greater on the MAPE than the MdAPE.
12. In practice, both measures should always be calculated and compared. If the MAPE and MdAPE
appear to be quite different (from an “impact” perspective), then investigate the underlying
numbers for stray values. If comparable, then report the MAPE as this measure is typically better
known. Here a MAPE of 6% would be a credible measure of performance based on the
comparison in this one case.
13. Exhibit 30 shows a comparison of seasonal factors from the analysis with and without outlier
adjustment. The impact is greatest for October, but also shows some effect in December/January.
Exhibit 29 Bias and Precision Measures for Forecasts in the Hold-Out Sample With and Without OutlierAdjusted Haul Data
Exhibit 30 Comparison of Multiplicative Seasonal Factors ( With and Without Outlier-Adjusted Haul Data)
28
We come up with standard forecast evaluation procedure as follows:
1.
2.
3.
4.
Calculate the ME, MPE, MAPE, MAE and the RMSE. Contrast these measures of accuracy.
Recommend and calculate a resistant set of measures for each one of the measures
Create a Naïve_12 (12-period-ahead) forecast for a 12-month holdout sample, by starting with
January forecast = January (previous year) actual, February forecast = February (previous
year) actual, and so on. Construct a relative error measure, like RAE, comparing the forecasts
with the Naïve_12 forecasts. Can you do a better job with a model than a Naïve_12 method?
On a month-to-month basis these accuracy measures might not be meaningful to management
(too large, perhaps), so you suggest looking at the results on a quarterly basis. Aggregate the
original data and forecasts into quarterly periods (or buckets). Repeat (1) and (2) using a
Naïve_4 as the benchmark forecast to beat.
Final Takeaways
One goal of an effective demand forecasting strategy is to improve forecast accuracy through the use of
forecasting performance measurement. Forecasters use accuracy measures to evaluate the relative
performance of forecasting models and assess alternative forecast scenarios. Accuracy measurement
provides management with a tool to establish credibility for the way the forecast was derived and to provide
incentives to enhance professionalism in the organization.
•
•
•
•
•
It should always be good practice to report accuracy measures in pairs to insure against
misleading results. When the pair are similar (in a practical sense), then report the
traditional result. Otherwise, go back and investigate the data going into the calculation,
using domain information whenever possible
We recommend the use of relative error measures to see how much of an improvement is
made in choosing a particular forecasting model over some benchmark or "naive"
forecasting technique.
Relative error measures also provide a reliable way of ranking forecasting models on the
basis of their usual accuracy over sets of time series.
Test a forecasting model with hold out samples by splitting the historical data into two
parts, using the first segment as the fit period. Then, the forecasting model is used to
make forecasts for a number of additional periods (the forecast horizon). Because there
are actual values withheld in the holdout sample, one can assess forecast accuracy by
comparing the forecasts against the known figures. Not only do we see how well the
forecasting method has fit the more distant past (fit period) but also how well it would
have forecast the test period.
Tracking signals are useful when large numbers of items must be monitored for accuracy.
This is often the case in inventory systems. When the tracking signal for an item exceeds
the threshold level, the forecaster’s attention should be promptly drawn to the problem.
The tracking of results is essential to ensure the continuing relevance of the forecast. By
properly tracking forecasts and assumptions, the forecaster can inform management when
a forecast revision is required. It should not be necessary for management to inform the
forecaster that something is wrong with the forecast. The techniques for tracking include
ladder charts, prediction-realization diagrams, prediction intervals, and tracking signals
that identify nonrandom error patterns. Through tracking, we can better understand the
models, their capabilities, and the uncertainty associated with the forecasts derived from
them.
29
Contents
Taming Uncertainty: All You Need to Know About Measuring Forecast Accuracy, But Are Afraid to Ask
..........................................................................................................................
The Need for Measuring Forecast Accuracy.......................................................................................... 1
Analyzing Forecast Errors ...................................................................................................................... 3
Lack of Bias............................................................................................................................................ 3
What is an Acceptable Precision?........................................................................................................... 4
Ways to Evaluate Accuracy ..................................................................................................................... 6
The Fit Period versus the Test Period..................................................................................................... 6
Goodness-of-Fit versus Forecast Accuracy ............................................................................................ 7
Item-Level versus Aggregate Performance............................................................................................. 7
Absolute Errors versus Squared Errors................................................................................................... 7
Measures of Forecast Accuracy............................................................................................................... 8
Measures of bias ..................................................................................................................................... 8
Measures of Precision............................................................................................................................. 9
Comparison with Naive Techniques ..................................................................................................... 12
Relative Error Measures ....................................................................................................................... 13
Predictive Visualization Techniques ..................................................................................................... 13
Ladder Charts ....................................................................................................................................... 14
Prediction-Realization Diagram ........................................................................................................... 14
What Are Prediction Intervals? ............................................................................................................ 16
Prediction Intervals for Time Series Models ........................................................................................ 16
Prediction Interval as Percentages ........................................................................................................ 17
Using Prediction Intervals as Early Warning Signals ........................................................................... 18
Trigg Tracking Signal........................................................................................................................... 20
Spreadsheet Example: How to Apply a Tracking Signal .................................................................... 20
Quick and Dirty Control ....................................................................................................................... 21
The Tracking Signal in Action.............................................................................................................. 21
Adapting the Tracking Signal to Other Spreadsheets ........................................................................... 23
Mini Case: Accuracy Measurements of Transportation Forecasts.................................................... 24
Final Takeaways ..................................................................................................................................... 28
Download