Taming Uncertainty: All You Need to Know About Measuring Forecast Accuracy, But Are Afraid to Ask “Television won’t be able to hold on to any market it captures after the first six months. People will soon get tired of staring at a plywood box every night.” A WELL-KNOWN MOVIE MOGUL IN THE EARLY DAYS OF TV. This White Paper 08 describes • Why it is a necessary to define first what a forecast error is • What bias and precision means for accuracy measurement • How, when and why to make accuracy measurements • The systematic steps in a forecast evaluation process. After reading this paper, you should be able to • Understand the difference between fitting errors and forecast errors • Recognize that there is no one best measure of accuracy for products, customers and hierarchies, industry forecasts and time horizons • Realize that simple averaging is not a best practice for summarizing accuracy measurements • Engage with potential users of demand forecasts to clearly define their forecast accuracy requirements The Need for Measuring Forecast Accuracy The Institutional Broker's Estimate System (I/B/E/S), a service that tracks financial analysts’ estimates, reported that forecasts of corporate yearly earnings that are made early in the year are persistently optimistic, that is to say, upwardly biased. In all but 2 of the 12 years studied (1979 - 1991), analysts revised their earnings forecasts downward by the end of the year. In a New York Times article (January 10, 1992), Jonathan Fuerbringer writes that this pattern of revisions was so clear that the I/B/E/S urged stock market investors to take early-in-the year earnings forecasts with a grain of salt. It is generally recognized in most organizations that accurate forecasts are essential in achieving measurable improvements in their operations. In demand forecasting and planning organizations, analysts use accuracy measures for evaluating their forecast models on an ongoing basis, and hence accuracy is something that needs to be measured at all levels of a product and customer forecasting database. Some of the questions you will need to ask include: • • • Will the model prove reliable for forecasting units or revenues over a planned forecast horizon, such as item-level product demand for the next 12 weeks or aggregate revenue demand for the next 4 quarters? Will a forecast have a significant impact on marketing, sales, budgeting, logistics and production activities? Will a forecast have an effect on inventory investment or customer service? In a nutshell, inaccurate forecasts can have a direct effect on setting inadequate safety stocks, ongoing capacity problems, massive rescheduling of manufacturing plans, chronic late shipments to customers, and 2 adding expensive manufacturing flexibility resources (Exhibit 1). In addition, the Income Statement and Balance Sheet are also impacted by poor demand forecasts in financial planning (Exhibit 2). Operational Impact of Poor Forecasts Poor Forecast Stock outs And Late Orders Inaccurate Production Schedules Schedule Changes Raw Material Expediting Loss of volume breaks Higher freight costs Higher raw material inventory Part deliveries Inter-warehouse transfers Lost Sales Yield losses Productivity losses Higher WIP inventory Higher WIP inventory Reduced Revenues Higher Shipping Costs Increased Safety Stock Increased Finished Goods Inventory Higher Warehousing Costs © 2008 - 2015 CPDF_Delphus, Inc. Source: C. Conradi – SYSPRO 2013 5 Exhibit 1 Operational Impact of Poor Forecasts (Source: S. C. Conradie, Noeticconsulting 2013) Financial Impact of Poor Demand Forecasts Sales Income Statement Costs Assets Balance Sheet Liabilities © 2008 - 2015 CPDF_Delphus, Inc. Source: C. Conradi – SYSPRO 2013 Service levels Out of stocks Reliability Logistics costs Inventory Levels Obsolete stocks Direct distribution costs Inventories Cash Fixed Assets Short term debt Long term debt 6 Exhibit 2 Financial Impact of Poor Forecasts (Source: S. C. Conradie, Noeticconsulting 2013) 3 Analyzing Forecast Errors Whether a method that has provided a good fit to historical data will also yield accurate forecasts for future time periods is an unsettled issue. Intuition suggests that this may not necessarily be the case. There is no guarantee that past patterns will persist in future periods. For a forecasting technique to be useful, we must demonstrate that it can forecast reliably and with consistent accuracy. It is not sufficient to simply produce a model that performs well only in historical (within sample) fit periods. One must also measure forecasting performance with multiple metrics, while not just with a single, commonly used metric, like the Mean Absolute Percentage Error (MAPE) or even a weighted MAPE (to be defined precisely later in this paper). At the same time, the users of a forecast need forecasting results on a timely basis. Using a forecasting technique and waiting one or more periods for history to unfold in future periods is not practical because our advice as forecasters will not be timely. Consequently, we advocate the use of holdout test periods for the analysis and evaluation of forecast errors on an ongoing basis. Generally, the use of a single measure of accuracy involving simple averages is not a best practice as you will misinterpret the results when underlying distributions are not normal. Even with approximate normality a simple average of numbers may lead to misleading uses. With minimal effort, you can complement a traditional measure with an outlier-resistant alternative to safeguard against the common presence of an outlier or unusual values in accuracy measurements. It should always be good practice to report accuracy measures in pairs to insure against misleading results. When the pair are similar (in a practical sense), then report the traditional result. Otherwise, go back and investigate the data going into the calculation, using domain information whenever possible. Two important aspects of forecast accuracy measurement are bias and precision. Bias is a problem of direction: Forecasts are typically too low (downward bias) or typically too high (upward bias). Precision is an issue of magnitudes: Forecast errors can be too large (in either direction) using a particular forecasting technique. Consider first a simple situation - forecasting a single product or item. The attributes that should satisfy most forecasters include lack of serious bias, acceptable precision, and superiority over naive models. Lack of Bias If forecasts are typically too low, we say that they are downwardly biased; if too high, they are upwardly biased. If overforecasts and underforecasts tend to cancel one another out (i.e., if an average of the forecast errors is approximately zero), we say that the forecasts are unbiased. Bias refers to the tendency of a forecast to be predominantly toward one side of the truth. Bias is a problem of direction. If we think of forecasting as aiming darts at a target, then a bias implies that the aim is off-center. That is, the darts land repeatedly toward the same side of the target (Exhibit 3). In contrast, if forecasts are unbiased, they are evenly distributed around the target. Bias and Precision Unbiased Biased Precision © 2008 - 2015 Delphus, Inc. Exhibit 3 Biased, Unbiased and Precise Forecasts 7 4 What is an Acceptable Precision? Imprecision is a problem if the forecast errors tend to be too large. Exhibit 4 shows three patterns that differ in terms of precision. The upper two forecasts are the less precise - as a group, they are farther from the target. If bias is thought of as bad aim, then imprecision is a lack of constancy or steadiness. The precise forecast is generally right around the target. Precision refers to the distance between the forecasts as a result of using a particular forecasting technique and the corresponding actual values. Forecasts and Forecast Errors © 2008 - 2015 Delphus, Inc. 8 Exhibit 4 (a) Bar Charts and (b) Tables Showing Actuals (A), Forecasts and Forecast Errors for Three Forecasting Techniques Exhibit 4 illustrates the measurement of bias and precision for three hypothetical forecasting techniques. In each case, the fit period is periods 1 - 20. Shown in the top row of Exhibit 4b are actual values for the last four periods (21 – 24). The other three rows contain forecasts using forecasting techniques X, Y, and Z. These are shown in a bar chart in Exhibit 4. What can we say about how good these forecast models are? On the graph, the three forecasts do not look all that different. But, what is a forecast error? It is not unusual to find different definitions and interpretations for concepts surrounding accuracy measurement among analysts, planners and managers in the same company. Exhibit 4 records the deviations between the actuals and their forecasts. Each deviation represents a forecast error (or forecast miss) for the associated period defined by: Forecast error (E) = Actual (A) - Forecast (F) If (F – A) is the preferred use in some organizations but not others, then demand forecasters and planners should name it something else, like forecast variance, a more conventional meaning for revenueoriented planners. The distinction is important because of the interpretation of bias in under- and overforecasting situations. Consistency is paramount here. Contrast this with a fitting error (or residual) of a model over a fit period, which is: Fitting error = Actual (A) - Model fit 5 Forecast error) is a measure of forecast accuracy. By convention, it is defined as A – F. On the other hand, Fitting error (or residual) is a measure of model adequacy: Actual – Model Fit. In Exhibit 4b, the forecast error shown for technique X is 1.8 in period 21. This represents the deviation between the actual value in forecast period 21 (= 79.6) and the forecast using technique X (= 77.8). In forecast period 22, the forecast using technique X was lower than actual value for that period resulting in a forecast error of 6.9. The period 24 forecast using technique Z was higher than that period's actual value; hence, the forecast error is negative (- 3.4). When we overforecast, we must make a negative adjustment to reach the actual value. Note that if the forecast is less than the actual value, the miss is a positive number; if the forecast is more than the actual value, the miss is a negative number. In this example, it appears that Model X is underforecasting (positive forecast errors), hence it is biased. Model Z appears to have a forecast (#22) that should be market for investigation; it is somewhat large relative to the rest of the numbers. To identify patterns of upward- and downward-biased forecasts, we start by comparing the number of positive and negative misses. In Exhibit 4b, Technique X under-forecasts in all four periods, indicative of a persistent (downward) bias. Technique Y underforecasts and overforecasts with equal frequency; therefore, it exhibits no evidence of bias in either direction. Technique Z is biased slightly toward overforecasting. As one measure of forecast accuracy (Exhibit 5), we calculate a percentage error PE = 100% * (A – F)/A. Forecast Percentage Error © 2008 - 2015 Delphus, Inc. 9 Exhibit 5 (a) Bar chart and (b) Table of the Forecast Error as Percentage Error (PE) between Actuals and Forecasts for Three Techniques To reduce bias in a forecasting technique, we can either (1) reject any technique that projects with serious bias in favor of a less-biased alternative (after we have first compared the precision and complexity of the methods under consideration) or (2) investigate the pattern of bias in the hope of devising a bias adjustment; for example, we might take the forecasts from method X and adjust them upward to try to offset the tendency of this method to underforecast. Also, forecasts for Techniques Y and Z could be averaged after placing Forecast X aside for the current forecasting cycle. 6 Ways to Evaluate Accuracy A number of forecasting competitions were conducted by the International Institute of Forecasters (IIF; www.forecasters.org) to assess the effectiveness of statistical forecasting techniques and determine which techniques are among the best. Starting with the original M competition in 1982, Spyros Makridakis and his co-authors reported results in the Journal of Forecasting, in which they compared the accuracy of about 20 forecasting techniques across a sample of 111 time series – a very small dataset by today’s standards. A subset of the methods was tested on 1001 time series, still not very large, but substantial enough for the computer processing power at the time. The last 12 months of each series were held out and the remaining data were used for model fitting. Using a range of measures on a holdout sample, the International Journal of Forecasting (IJF) conducted a competition in 1997 comparing a range of forecasting techniques across a sample of 3003 time series. Known as the M3 competition, these data and results can be found at the website www.maths.monash.edu.au/∼hyndman/forecasting/. The IJF published a number of other papers evaluating the results of these competitions. These competitions have become the basis for how we should measure forecast accuracy in practice. Forecast accuracy measurements are performed in order to assess the accuracy of a forecasting technique. The Fit Period versus the Test Period In measuring forecast accuracy, a portion of the historical data is withheld, and reserved for evaluating forecast accuracy. Thus, the historical data are first divided into two parts: an initial segment (the fit period) and a later segment (the holdout sample). The fit period is used to develop a forecasting model. Sometimes, the fit period is called the within-sample training, initialization, or calibration period. Next, using a particular model for the fit period, model forecasts are made for the later segment. Finally, the accuracy of these forecasts is determined by comparing the projected values with the data in the holdout sample. The time period over which forecast accuracy is evaluated is called the test period, validation period, or holdout-sample period. A forecast accuracy test should be performed with data from a holdout sample. When implementing software or spreadsheet code, we can minimize confusion and misinterpretation if a common notation for the concepts in these calculations is used. Let Yt (1) denote the a one-period-ahead forecast of Yt and Yt (m) the m-period-ahead forecast. For example, a forecast for t = 25 that is the one-period-ahead forecast made at t = 24 is denoted by Y24 (1) and the a two-period-ahead forecast for the same time, t = 25, made at t = 23 is denoted by Y23(2). Generally, forecasters are interested in multi period-ahead forecasts because lead times longer than one period are required for business to act on a forecast. We use the following conventions for a time series: {Yt , t = 1,2, …, T} the historical dataset up to and including period t = T {Ŷ t , t =1,2, . . . , T} the data set of fitted values that result from fitting a model to historical data YT+m the future value of Yt , m periods after t = T {YT (1), YT (2), …, YT (m)}, the one- to m-period ahead-forecasts made from t = T Consequently, we see that forecast errors and fit errors (residuals) refer to: 7 Y1 - Ŷ t , Y2 - Ŷ 2 , . . . , YT - Ŷ T the fit errors or residuals from a fit YT+1 - YT (1), YT+2 - YT (2), . . . , YT+m - YT(m) the forecast errors (not to be confused with the residuals from a fit) In dealing with forecast accuracy, it is important to distinguish between forecast errors and fitting errors. Goodness-of-Fit versus Forecast Accuracy We need to assess forecast accuracy rather than just calculate overall goodness-of-fit statistics for the following reasons: • Goodness-of-fit statistics may appear to give better results than forecasting-based calculations, but goodness-of-fit statistics measure model adequacy over a fitting period that may not be representative of the forecasting period. • When a model is fit, it is designed to reproduce the historical patterns as closely as possible. This may create complexities in the model that capture insignificant patterns in the historical data, which may lead to over-fitting. • By adding complexity, we may not realize that insignificant patterns in the past are unlikely to persist into the future. More important, the more subtle patterns of the future are unlikely to have revealed themselves in the past. • Exponential smoothing models are based on updating procedures in which each forecast is made from smoothed values in the immediate past. For these models, goodness-of-fit is measured from forecast errors made in estimating the next time period ahead from the current time period. These are called one-period-ahead forecast errors (also called one-step-ahead forecast errors). Because it is reasonable to expect that errors in forecasting the more distant future will be larger than those made in forecasting the next period into the future, we should avoid accuracy assessments based exclusively on one-period-ahead errors. When assessing forecast accuracy, we may want to know about likely errors in forecasting more than one period-ahead. Item-Level versus Aggregate Performance Forecast evaluations are also useful in multi-series comparisons. Production and inventory managers typically need demand or shipment forecasts for hundreds to tens of thousands of items (SKUs) based on historical data for each item. Financial forecasters need to issue forecasts for dozens of budget categories in a strategic plan on the basis of past values of each source of revenue. In a multi-series comparison, the forecaster should appraise the method based not only on its performance for the individual item but also on the basis of its overall accuracy when tested over various summaries of the data. How to do that most effectively will depend on the context of the forecasting problem and can, in general, not be determined a priori. In the next section, we discuss various measures of forecast accuracy. Absolute Errors versus Squared Errors The use of absolute errors and that based on squared errors are both useful. There is a good argument for consistency in the sense that a model's forecasting accuracy should be evaluated on the same basis used to develop (fit) the model. The standard basis for model fit is the least-squares criterion that is minimizing the mean-square-error MSE between the actual and fitted values. To be consistent, we should evaluate forecast accuracy based on squared error measures, such as the root mean squared error (RMSE). It is useful to test how well different methods do in forecasting using a variety of accuracy measures. Forecasts are put to a wide variety of uses in any organization, and no single forecaster can dictate on how they must be used and interpreted. 8 Sometimes costs or losses due to forecast errors are in direct proportion to the size of the error double the error leads to double the cost. For example, when a soft drink distributor realized that the costs of shipping its product between distribution centers was becoming prohibitive, it made a study of the relationship between under-forecasting (not enough of the right product at the right place, thus requiring a transshipment, or backhaul, from another distribution center) and the cost of those backhauls. As shown in Exhibit 6, over-forecasts of 25% or higher appeared strongly related to an increase in the backhaul of pallets of product. In this case, the measures based on absolute errors are more appropriate. In other cases, small forecast errors do not cause much harm and large errors may be devastating; then, we would want to stress the importance of (avoidance of) large errors, which is what squared-error measures accomplish. Exhibit 6 Backhauls in Pallets versus Forecast Error Measures of Forecast Accuracy Measures of bias There are two common measures of bias. The mean error (ME) is the sum of the forecast errors divided by the number of periods in the forecast horizon (h) for which forecasts were made: ME = [∑ (At - Ft )] / h = [∑ (Ai - Fi )] / h (sum from t = T+1 to t = T+h) sum from i = 1 to i = h The mean percentage error (MPE) is MPE = 100 * [∑ (At - Ft ) / At ] / h sum from t = T+1 to t = T+h = 100 * [∑ (Ai - Fi ) / Ai ] / h sum from i = 1 to i = h The MPE and ME are useful supplements to a count of the frequency of under- and over-forecasts. The ME gives the average of the forecast errors expressed in the units of measurement of the data; the MPE gives the average of the forecast errors in terms of percentage and is unit-free. If we examine the hypothetical example in Exhibits 7 and 8, we find that for technique X ME = 3.8, and MPE = 4.2%. This means that, technique X, we underforecast by an average of 3.8 units per period. Restated in terms of the MPE, this means that the forecasts from technique X were below the actual values by an average of 4.2%. A positive value for the ME or MPE signals downward bias in the forecasts using technique X. The two most common measures of bias are the mean error (ME) and the mean percentage error (MPE). 9 Compared to technique X, the ME and MPE for technique Y are much smaller. These results are not surprising. Because technique Y resulted in an equal number of under-forecasts and overforecasts, we expect to find that the average error is close to zero. On the other hand, the low ME and MPE values for technique Z are surprising. We have previously seen that technique Z over-forecasts in three of the four periods. A closer look at the errors from technique Z shows that there was a very large under-forecast in one period and relatively small overforecasts in the other three periods. Thus, the one large underforecast offset the three small over-forecasts, yielding a mean error close to 0. The lesson here is that averages can be distorted by one or just a few unusually large errors. One should consider calculating non-conventional measures like the Median Error MdE and the Median Percentage Error MdPE. We should always use multiple error measures as a supplement to the analysis of the individual errors, as unusual values can readily distort an incorrect interpretation. Measures of Precision Certain indicators of precision are based on the absolute values of the forecast errors. By taking an absolute value, we eliminate the possibility that under-forecasts and over-forecasts negate one another. Therefore, an average of the absolute forecast errors reveals simply how far apart the forecasts are from the actual values. It does not tell us if the forecasts are biased. The absolute errors are shown in Exhibit 10.9, and the absolute values of the percentage errors are shown in Exhibit 7. Absolute Error © 2008 - 2015 Delphus, Inc. 13 Exhibit 7 (a) Bar chart and (b) Table Showing Forecast Error as the Absolute Difference |E| between Actuals and Forecasts for Three Techniques 10 Absolute Percentage Error © 2008 - 2015 Delphus, Inc. 14 Exhibit 8 (a) Bar chart and (b) Table Showing Forecast Error as the Absolute Percentage Difference |PE(%)| between Actuals and Forecasts for Three Techniques. |(PE) | = 100 • | (A - F)| / A | The most common averages of absolute values are mean absolute error (MAE), mean absolute percentage error (MAPE), and median absolute percentage error (MdAPE). MAE = [∑ | (At - Ft ) | ] / h = [∑ | (Ai - Fi ) | ] / h sum t = T + 1 to t = T + h sum from i = 1 to i = h MAPE = 100 * [∑ | (At- Ft )| / At ] / h = 100 * [∑ | (Ai - Fi )| / Ai ] / h sum t = T + 1 to t = T+ h sum from i = 1 to i = h MdAPE = Median value of {| (Ai - Fi )| / Ai | i = 1 , . . . h } Interpretations of the averages of absolute error measures are straightforward. In Exhibit 7, we calculate that the MAE is 4.6 for technique Z, from which we can conclude that the forecast errors from this technique average 4.6 per period. The MAPE, from Exhibit 8, is 5.2%, which tells us that the period forecast errors average 5.2%. The MdAPE for technique Z, from Exhibit 9, is approximately 4.1% (the average of the two middle values: 4.4 and 3.8). Thus, half the time the forecast errors exceeded 4.1%, and half the time they were smaller than 4.1%. When there is a serious outlier among the forecast errors, as with technique Z, it is useful to calculate the MdAPE in addition to the MAPE because medians are less sensitive (more resistant) than mean values to distortion from outliers. This is why the MdAPE is a full percentage point below the MAPE for technique Z. Sometimes, as with technique Y, the MdAPE and the MAPE are virtually identical. In this case, we can report the MAPE because it is the far more common measure. 11 Exhibit 9 (a) Bar chart and (b) Table Showing Forecast Error as Squared Difference (E2) between Actuals and Forecasts for Three Techniques Exhibit 10 (a) Bar chart and (b) Table Showing Forecast Error as Squared Percentage Difference PE(%))2 between Actuals and Forecasts for Three Techniques In addition to the indicators of precision calculated from the absolute errors, certain measures are commonly used that are based on the squared values of the forecast errors (Exhibit 10), such as the MSE and RMSE. (RMSE is the square root of MSE.) A notable variant on RMSE is the root mean squared percentage error (RMSPE), which is based on the squared percentage errors (Exhibit 10). MSE = [∑ (At - Ft )2 ] / h = [∑ (Ai- Fi )2 ] / h sum from t = T+1 to t = T+h sum from i = 1 to i = h RMSPE = √ 100 * [∑ {(At- Ft ) / At }2 ] / h = 100 * [∑ {(Ai - Fi ) / Ai }2 ] / h sum t = T+1 to t = T+h sum from i = 1 to i = h To calculate RMSPE, we square each percentage error in Exhibit 10. The squares are then averaged, and the square root is taken of the average. The MSE is expressed in the square of the units of the data. This may make it difficult to interpret when referring to dollar volumes, for example. By taking the square root of the MSE, we return to the units of measurement of the original data. We can simplify the interpretation of RMSE and present it as if it were the mean absolute error (MAE); the forecasts of technique Y are in error by an average of 4. There is little real harm in doing this; just keep in mind that the RMSE is generally somewhat larger than the MAE. (compare Exhibits 8 and 10). 12 Some forecasters present the RMSE within the context of a prediction interval. Here is how it might sound: "Assuming the forecast errors are normally distributed, we can be approximately 68% sure that technique Y's forecasts will be accurate to within 4.0 on the average." However, it is best to not to make these statements too precise because it requires us to assume a normal distribution of forecast errors, a rarity in demand forecasting demand forecasting practice. The RMSE can be interpreted as a standard error of the forecasts. The RMSPE is the percentage version of the RMSE (just as the MAPE is the percentage version of the MAE). Because the RMPSE for technique Y equals 4.5%, we may suggest that "the forecasts of technique Y have a standard or average error of approximately 4.5%." If we desire a squared error measure in percentage terms, the following shortcut is sometimes taken. Divide the RMSE by the mean value of the data over the forecast horizon. The result can be interpreted as the standard error as a percentage of the mean of the data and is called the coefficient of variation (because it is a ratio of a standard deviation to a mean). Based on technique Y, average actual values over periods 21 - 24 were 88.2 (Exhibit 1), and the RMSE for technique Y was found to be 4.0 (Exhibit 10). Hence, the coefficient of variation is 4/88.2 = 4.5%, a result that is virtually identical to the RMSPE, in this case. Comparison with Naive Techniques After reviewing the evidence of bias and precision, we may want to select technique Y as the preferred of the three candidates. Based on the evaluation, technique Y's forecast for periods 21 - 24 shows no indication of bias - two under-forecasts and two over-forecasts - and proves slightly more precise than the other two techniques, no matter which measure of precision is used. The MAE for technique Y reveals that the forecast errors over periods 21 - 24 average approximately 3.5 per period, and the MAPE indicates that these errors come to just under 4% on average. To provide additional perspective on technique Y, we might contrast technique Y's forecasting record with that of a very simplistic procedure, one requiring little or no thought or effort. Such procedures are called naive techniques. The continued time and effort investment in technique Y might not be worth the cost if it can be shown to perform no better than a naïve technique. Sometimes the contrast with a naive technique is sobering. The forecaster thinks an excellent model has been developed and finds that it barely outperforms a naive technique. If the forecasting performance of technique Y is no better or is worse than that of a naive technique, it suggests that in developing technique Y we have not accomplished very much. A standard naive forecasting technique has emerged for data that are yearly or otherwise lack seasonality; it is called a NAÏVE_1 or Naive Forecast1 (NF1). We see later that there are also naïve forecasting techniques for seasonal data. A NAÏVE_1 forecast of the following time period is simply the value in effect during this time period. Alternatively stated, the forecast for any one time period is the value observed during the previous time period. A NAIVE_1 is called a no change forecasting technique, because its forecasted value is unchanged from the previously observed value. Exhibit 11 Actuals and NAIVE_1 (one-period-ahead) Forecasts for Technique Y. (Exhibit 3) 13 Exhibit 11 shows the forecast for our example (Exhibit 3). Note that one-period-ahead forecasts for periods 22 - 24 are the respective actuals for periods 21 - 23. The forecast for period 21 (=73.5) is the actual value for period 20. When we appraise NAIVE_1's performance, we note it is biased. It underestimated periods 21 - 23 and overestimated period 24. By always projecting no change, a NAIVE_1 will underforecasts whenever data increase and over-forecast whenever the data are declining. We do not expect a NAIVE_1 to work very well when the data are steadily trending in one direction. The error measures for NAIVE_1 in Exhibit 11 are all higher than the corresponding error measures for technique Y, indicating that technique Y was more accurate than NAIVE_1. By looking at the relative error measures, we can be more specific about the advantage of technique Y over NAIVE_1. Because a naive technique has no prescription for a random error term, the NAIVE_1 technique can be viewed as a no-change, no-chance method (in the sense of our forecasting objectives) to predict change in the presence of uncertainty (chance). Hence, it is useful as a benchmark-forecasting model to beat. Relative Error Measures A relative error measure is a ratio; the numerator is a measure of the precision of a forecasting method; the denominator is the analogous measure for a naive technique. Exhibit 12 Relative Error Measures for Technique Y Relative to NAIVE_1. (Exhibit 3) Exhibit 12 provides a compendium of relative error measures that compare technique Y with the NAIVE_1. The first entry (=0.73) is the ratio of the MAE for technique Y to the MAE for the NAIVE_1 (3.55/4.87 = 0.73). It is called the relative absolute error (RAE). The RAE shows that the average forecast error using technique Y is 73% of the average error using NAIVE_1. This implies a 27% improvement of technique Y over NAIVE_1. Expressed as an improvement score, the relative error measure is sometimes called a forecast coefficient (FC), as listed in the last column of Exhibit 12. When FC = 0, this indicates that the technique in question was no better than NAIVE_1 technique. When FC < 0, the technique was in fact less precise than the NAIVE_1, whereas when FC > 0 the forecasting technique was more precise than NAIVE_1. When FC = 1, the forecasting technique is perfect. The second row of Exhibit 12 shows the relative error measures based on the MAPE. The results almost exactly mirror those based on the MAE, and the interpretations are identical. The relative error measure based on the RMSE (which is the RMSE of technique Y divided by RMSE of the NAIVE_1) was originally proposed by in 1966 by the economist Henri Theil and it is known as Theil's U or Theil's U2 statistic. As with the RAE, a score less that 1 suggests an improvement over the naive technique. The relative errors based on the squared-error measures, RMSE and RMSPE, suggest that technique Y in Exhibit 12 was about 40% more precise than the NAIVE_1. The relative errors based on MAE and MAPE put the improvement at approximately 25 – 30%. Why the difference? Error measures based on absolute values, such as the MAE and MAPE, give a weight (emphasis) to each time period in proportion to the size of the forecast error. The RMSE and RMSPE, in contrast, are measures based on squared errors and squaring gives more weight to large errors. A closer look at Exhibit 11 shows that the NAIVE_1 commits a very large error in forecast period 22. Predictive Visualization Techniques Predictive Visualization techniques for forecast accuracy measurement provide a perspective on how serious a forecasting bias may be. If the bias exceeds a designated threshold, a "red flag" is raised concerning the forecasting approach. The forecaster should then reexamine the data to identify changes in 14 the trend, seasonal, or other business patterns, which in turn suggest some adjustment to the forecasting approach. Ladder Charts By plotting the current year’s monthly forecasts on a ladder chart, a forecaster can determine whether the seasonal pattern in the forecast looks reasonable. The ladder chart in Exhibit 13 consists of six items of information for each month of the year: average over the past 5 years, the past year’s performance, the 5year low, the 5-year high, the current year-to-date, and the monthly forecasts for the remainder of the year. The 5-year average line usually provides the best indication of the seasonal pattern, assuming this pattern is not changing significantly over time. In fact, this is a good check for reasonableness that can be done before submitting the forecast for approval. A ladder chart is a simple yet powerful tool for monitoring forecast results. The level of the forecast can be checked for reasonableness relative to the prior year (dashed line in Exhibit 13). The forecaster can determine whether or not the actuals are consistently overrunning or underrunning the forecast. In this example, the forecast errors are positive for 3 months and negative for 3 months. The greatest differences between the actual and forecast values occur in March and April, but here the deviations are of opposite signs; additional research should be done to uncover the cause of the unusual March-April pattern. The forecasts for the remainder of the current year look reasonable, although some minor adjustments might be made. The ladder chart is one of the best predictive visualization tools for quickly identifying the need for major changes in forecasts. Exhibit 13 Ladder Chart Prediction-Realization Diagram Another useful visual approach to monitoring forecast accuracy is the predictionrealization diagram introduced by the economist Henri Theil in 1958. If the predicted values are indicated on the vertical axis and the actual values on the horizontal axis, a straight line with a 450 slope will represent perfect forecasts. This is called the line of perfect forecasts (Exhibit 14). In practice, the predictionrealization diagram is sometimes rotated so that the line of perfect forecasts appears horizontal. The prediction-realization diagram indicates how well a model or forecaster has predicted turning points and also how well the magnitude of change has been predicted given that the proper direction of change has been forecast. The diagram has six sections. Points falling in sections II and V are a result of turning-point errors. In Section V, a positive change was predicted, but the actual change was negative. In Section II, a negative change was predicted, but positive change occurred. The remaining sections involve predictions that were 15 correct in sign but wrong in magnitude. Points above the line of perfect forecasts reflect actual changes that were less than predicted. Points below the line of perfect forecasts represent actual changes that were greater than predicted. Exhibit 14 Prediction-Realization Diagram The prediction-realization diagram can be used to record forecast results on an ongoing basis. Persistent overruns or underruns indicate the need to adjust the forecasts or to reestimate the model. In this case, a simple error pattern is evident and we can raise or lower the forecast based on the pattern and magnitude of the errors. More important, the diagram indicates turning-point errors that may be due to misspecification or missing variables in the model. The forecaster may well be at a loss to decide how to modify the model forecasts. An analysis of other factors that occurred when the turning-point error was realized may result in inclusion of a variable in the model that was missing from the initial specification. Exhibit 15 depicts two prediction-realization diagrams for performance results from two forecasters in their respective sales regions of a motorcycle manufacturer. By plotting percentage changes in the monthly forecast versus the percentage change in the monthly actuals, you can see the effect of turning points on performance. In the first region, six of the forecasts are close to the line of perfect fit (45 degree line), indicating a good job. The explanation of the other two forecasts suggest that the June point can be attributed to a push to make up for lost retail sales in the first five months because of late production. In the second region, for all eight months it showed an inability to detect a continuing decline as well as some excessive optimism. Interestingly, on the basis of a measure of MAPE, the worse performing region has the lower MAPE. On balance, it should be noted that the growth patterns are also quite different in the two regions, which may need to be taken in consideration. 16 Exhibit 15 Prediction-Realization Diagrams for Two Regions of a Motorcycle Manufacturer: Region A (top), Region B (bottom) What Are Prediction Intervals? A major shortcoming in the demand forecaster’s practice of forecasting techniques is the absence of error bounds on forecasts. By usine the State Space Forecasting Methodology for Exponential smoothing and ARIMA modeling, the demand forecaster can create prediction limits that reflect a more realistic, unsymmetrical, as well as the conventional, symmetrical error bounds on forecasts. Why is this important? A simple example will make the point. Prediction Intervals for Time Series Models When no error distribution can be assumed for a method that will make it a model, as with moving averages, we cannot construct model-based prediction intervals on forecasts to measure uncertainty. However, empirical prediction intervals can be constructed using an approach by Goodman and Williams in a 1971 article entitled “A simple method for the construction of empirical confidence limits for economic forecasts” in the Journal of the American Statistical Association. With the Goodman/Williams procedure, we go through the ordered data, making forecasts at each time point. Then, the comparisons of these forecasts with the actuals (that are known) yield an empirical distribution of forecast errors. If the future errors are assumed to be distributed like the empirical errors, then the empirical distribution of these 17 observed errors can be used to set prediction intervals for subsequent forecasts. In some practical cases, the theoretical size and the empirical size of the intervals have been found to agree closely. Prediction intervals are used as a way of expressing the uncertainty in the future values that are derived from models, and they play a key role in tracking forecasts from these models. In regression models, it is generally assumed that random errors are additive in the model equation and have a normal distribution with zero mean and constant variance. The variance of the errors can be estimated from the time series data. The estimated standard deviation (square root of the variance) is calculated for the forecast period and is used to develop the desired prediction interval. Although 95% prediction intervals are frequently used, the range of forecast values for volatile series may be so great that it might also be useful to show the limits for a lower level of probability, say 80%. This would narrow the interval. It is common to express prediction intervals about the forecast, but in the tracking process, it is more useful to deal with prediction intervals for the forecast errors (Error = Actual – Forecast) on a periodby-period or cumulative basis. A forecaster often deals with aggregated data and would like to know the prediction intervals about an annual forecast based on a model for monthly or quarterly data. The annual forecast is created by summing twelve (independent) monthly (or four quarterly) predictions. Developing the appropriate probability limits requires that the variance for the sum of the prediction errors be calculated. This can be determined from the variance formula Var (∑ ei) = ∑ Var (ei) + 2 ∑ Cov (ei , ej ) If the forecast errors have zero covariance (Cov) – in particular, if they are independent of one another, - the variance of the sum equals the sum of the prediction variances. The most common form of correlation in the forecast errors for time series models is positive autocorrelation. In this case, the covariance term is positive and the prediction intervals derived would be too small. In the unusual case of negative covariances, the prediction intervals would be too wide. Rather than deal with this complexity, most software programs assume the covariance is small and inconsequential. In typical forecasting situations, prediction intervals are probably too conservative or narrow. Prediction Interval as Percentages Many users of forecasts find the need for prediction intervals obtuse. In such cases, the forecaster may find more acceptance of this technique if the prediction intervals for forecasts are expressed in terms of percentages. For example, a 95% probability interval for a forecast might be interpreted that we are 95% sure that the forecast will be within ± 15% (say) of the actual numerical value. The percentage miss associated with any prediction interval can be calculated from the formula: Percent miss = [(Estimated standard error of the forecast) (t factor) x 100% ] / (Forecast) where t is the tabulated value of the Student t distribution for the appropriate degrees of freedom and confidence level. This calculation based on the assumption of normality of forecast errors and provides an estimate of the percent miss for any particular period. Values can be calculated for all periods (e.g., for each month of the year), resulting in a band of prediction intervals spanning the year. It could also be useful to determine a prediction interval for a cumulative forecast error. Under the assumption that forecast errors are independent and random, the cumulative percentage miss will be smaller than the percentage miss for an individual period because the positive and negative errors will cancel to some extent. As an approximate rule, the average period (say monthly or quarterly) percentage miss is calculated by dividing the average monthly percentage miss by √12 ( the square root of the number of periods of the forecast) or the average quarterly percentage miss by √ 4 (=2). 18 Using Prediction Intervals as Early Warning Signals One of the simplest tracking signals is based on the ratio of the sum of the forecast errors to the mean absolute error MAE. It is called the cumulative sum tracking signal (CUSUM). For certain CUSUM measures, a threshold or upper limit of |0.5| suggests that the forecast errors are no longer randomly distributed about 0 but rather are congregating too much on one side (i.e., forming a biased pattern). More sophisticated tracking signals involve the taking of weighted or smoothed averages of forecast errors. The best-known standard tracking signal, due to Trigg Leach dates back to 1964 and will be described shortly. Tracking signals are especially useful when forecasting a large number of products at a time, as is the case in spare parts inventory management systems. . A tracking signal is a ratio of a measure of bias to a companion measure of precision. Warning signals can be seen as predictive visualizations by plotting the forecast errors over time together with the associated prediction intervals and seeing whether the forecast errors continually lie above or below the zero line. Even though the individual forecast errors may well lie within the appropriate prediction interval for the period (say monthly), a plot of the cumulative sum of the errors may indicate that their sum lies outside its prediction interval. An early warning signal is a succession of overruns and underruns. A type of warning signal is evident in Exhibits 16 and 17. It can be seen that the monthly forecast errors lie well within the 95% prediction interval for 9 of the 12 months, with two of the three exceptions occurring in November and December. Exhibit 16 suggests that the individual forecast errors lie within their respective prediction intervals. However, it is apparent that none of the forecast errors are negative certainly, they do not form a random pattern around the zero line. Hence, there appears to be a bias. To determine whether the bias in the forecast is significant, we review the prediction intervals for cumulative sums of forecast errors. Exhibit 16 Monthly Forecast Errors and Associated Prediction Intervals for a Time Series Model The cumulative prediction interval (Exhibit 17) confirms the problem with the forecast bias. The cumulative forecast errors fall on the outside of the prediction interval for all twelve periods. This model is clearly under-forecasting. Either the model has failed to capture a strong cyclical pattern during the year or the data are growing rapidly and the forecaster has failed to make a proper specification in the model. Using these two plots, the forecaster would probably be inclined to make upward revisions in the forecast after several months. It certainly would not be necessary to wait until November to recognize the problem. 19 Exhibit 17 Cumulative Forecast Errors and Associated Cumulative Prediction Intervals for a Time Series Model Another kind of warning signal occurs when too many forecast errors fall outside the prediction intervals. For example, with a 90% prediction interval, we expect only 10% (approximately 1 month per year) of the forecast errors to fall outside the prediction interval. Exhibit 18 shows a plot of the monthly forecast errors for a time series model. In this case, five of the twelve errors lie outside the 95% interval. Clearly, this model is unacceptable as a predictor of monthly values. However, the monthly error pattern may appear random and the annual forecast (cumulative sum of twelve months) might be acceptable. Exhibit 19 shows the cumulative forecast errors and the cumulative prediction intervals. It appears that the annual forecast lies within the 95% prediction interval and is acceptable. Exhibit 18 Monthly Forecast Errors and Prediction Intervals for a Time Series Model Exhibit 19 Cumulative Forecast Errors and Cumulative Prediction Intervals for a Time Series Model The conclusion that may be reached from monitoring the two sets of forecast errors is that neither model is wholly acceptable, and that neither can be rejected. In one case, the monthly forecasts were good 20 but the annual forecast was not. In the other case, the monthly forecasts were not good, but the annual forecast turned out to be acceptable. Whether the forecaster retains either model depends on the purpose for which the model was constructed. It is important to note that, by monitoring cumulative forecast errors, the forecaster was able to determine the need for a forecast revision more quickly than if the forecaster were only monitoring monthly forecasts. This is the kind of early warning that management should expect from the forecaster. Trigg Tracking Signal The Trigg tracking signal proposed by D. W. Trigg in 1964 indicates the presence of nonrandom forecast errors; it is a ratio of two smoothed errors Et and Mt . The numerator Et is a simple exponential smooth of the forecast errors et , and the denominator Mt is a simple exponential smooth of the absolute values of the forecast errors. Thus, Tt = Et / Mt Et = α et + (1 - α) Et - 1 Mt = α |et | + (1 - α) Mt - 1 where et = Yt – Ft the difference between the observed value Yt and the forecast Ft. Trigg shows that when Tt exceeds 0.51 for α = 0.1 or 0.74 for α = 0.2, the forecast errors are nonrandom at the 95% probability level. Exhibit 20 shows a sample calculation of the Trigg tracking signal for an adaptive smoothing model of seasonally adjusted airline passenger data. The tracking signal correctly provides a warning at period 15 after five consecutive periods in which the actual exceeded the forecast. Period 11 has the largest forecast error, but no warning is provided because the sign of the forecast error became reversed. It is apparent that the model forecast errors can increase substantially above prior experience without a warning being signaled as long as the forecast errors change sign. Once a pattern of over- or underforecasting is evident, a warning is issued. Exhibit 20 Trigg’s Tracking Signal (α = 0.1) for an Adaptive Smoothing Model of Seasonally Adjusted Airline Data Spreadsheet Example: How to Apply a Tracking Signal In these White Papers we discuss a variety of forecasting models, from simple exponential smoothing to multiple linear regression. Once we start using these models on an ongoing basis, we cannot expect them to supply good forecasts indefinitely. The patterns of our data may change, and we will have to adjust our forecasting model or perhaps select a new model better suited to the data. For example, suppose that we are using simple exponential smoothing, a model that assumes that the data fluctuate about a constant level. Now assume that a trend develops. We have a problem because the level exponential smoothing forecasts 21 will not keep up with a trend and we will be consistently under-forecasting. How do we detect this problem? One answer is to periodically test the data for trends and other patterns. Another is to examine a graph of the forecasts each time that we add new data to our model. Both methods quickly become cumbersome, especially when we forecast a large number of product SKUs by locations. Visually checking all these data after every forecast cycle is out of the question! Quick and Dirty Control The tracking signal, a simple quality control method, is just the tool for the job. When a forecast for a given time period is too large, the forecast error has a negative sign; when a forecast is too small, the forecast error is positive. Ideally, the forecasts should vary around the actual data and the sum of the forecast errors should be near zero. But if the sum of the forecast errors departs from zero in either direction, the forecasting model may be out of control. Just how large should the sum of the forecast errors grow before we act? To answer this, we need a basis for measuring of the relative dispersion, or scatter, of the forecast errors. If forecast errors are widely scattered, a relatively large sum of errors is not unusual. A standard measure of variability such as the standard deviation could be used as a measure of dispersion, but more commonly the mean absolute error (MAE) is used to define the tracking signal : Tracking signal = (Sum of forecast errors) / (MAE) Typically the control limit on the tracking signal is set at ±4.0. If the signal goes outside this range we should investigate. Otherwise, we can let our forecasting model run unattended. There is a very small probability, less than 1%, that the signal will go outside the range ±4.0 due to chance. The Tracking Signal in Action Exhibit 21 is a graph of actual sales and the forecasts produced by simple exponential smoothing. For the first 6 months, the forecast errors are relatively small. Then, a strong trend begins in month 7. Fortunately, as shown in Exhibit 22, the tracking signal sounds an alarm immediately. Exhibit 21 Monthly Forecasts and Sales for a Time Series Model 22 Exhibit 22 Monthly Forecast Errors and Tracking Signal Column F in Exhibit 23 contains the mean absolute error. This is updated each month to keep pace with changes in the errors. Column G contains the running sum of the forecast errors, and column H holds the value of the tracking signal ratio. A formula in column I displays the label ALARM if the signal is outside the range ±4.0. We can now take a step-by-step approach to building the spreadsheet. For those unfamiliar with simple exponential smoothing (Download free White Paper 07 at http://cpdftraining.org/PDFgallery2.htm), the forecasting formulas are explained as they are entered. Exhibit 23 Actuals, Forecasts, and Forecast Tracking Signal Step 1: Set up a spreadsheet. Enter the labels in rows 1 - 5. In cell B2, enter a smoothing weight of 0.1. In cell F2, enter a control limit of 4.0. Number the months in the column A and enter the actual sales in column B. Step 2: Compute the monthly forecasts and forecast errors. To get the smoothing process started, we must specify the first forecast in cell C7. A good rule of thumb is to set the first forecast equal to the first actual data value. Therefore, enter +B7 in cell C7. In cell D7, enter the formula +B7-C7, which calculates the forecast error. In cell C8, the first updating of the forecast occurs. Enter the formula +C7*$B$2*D7 in cell C7. This formula states that the forecast for month 2 is equal to the forecast for month 1 plus the smoothing weight times the error in month 1. Copy the updating formula in cell C8 to the range C9 to C19. Copy the error-calculating formula in cell D7 to range D8 to D18. Because the actual sales figure is not yet available for month 13, cells B19 and D19 are blank. 23 Exponential smoothing works by adjusting the forecasts in a direction opposite to that of the error, working much like a thermostat, an automatic pilot, or a cruise control on an automobile. Examine the way the forecasts change with the errors. The month 1 error is zero, so the month 2 forecast is unchanged. The error is positive in month 2, so the month 3 forecast goes up. In month 3 the error is negative, so the month 4 forecast goes down. The amount of adjustment each month is controlled by the weight in cell B2. A small weight avoids the mistake of overreacting to purely random changes in the data. But a small weight also means that the forecasts are slow to react to true changes. A large weight has just the opposite effect – the forecasts react quickly to both randomness and true changes. Some experimentation is always necessary to find the best weight for the data. Step 3: Compute the Mean Absolute Error (MAE). Column E converts the actual errors to absolute values. In cell E7, enter the absolute value of E7, and then copy this cell to range E8 to E18. The mean absolute error, in column F is updated via exponential smoothing much like the forecasts. In cell F6, a starting value for the MAE must be specified. This must be representative of the errors or the tracking signal can give spurious results. A good rule of thumb is to use the average of all available errors to start up the MAE. But here we will do an evaluation of how the tracking signal performs without advance knowledge of the surge in sales in month 7. We use the average of the absolute errors for months 2 - 6 only as the initial MAE (month 1 should be excluded because the error is zero by design). In cell F6, enter the formula for the average of E8 to E12. To update the MAE, in cell F7 enter $B$2*E2+(1-$B$2)*F6. Copy this formula to the range F8 to F12. This formula updates the MAE using the same weight as the forecast. The result is really a weighted average of the MAE, which adapts to changes in the scatter of the errors. The new MAE each month is computed as 0.1 times the current absolute error plus 0.9 times the last MAE. Step 4: Sum the forecast errors. This simple operation is considered a separate step because a mistake would be disastrous here. We must sum the actual errors, not the absolute errors. The only logical starting value for the sum is zero, which should be entered in cell G6. In cell G7, enter +D7+G6 and copy this to range G8 to G18. Step 5: Compute the tracking signal. The tracking signal value in column H is the ratio of the error sum to the MAE. In cell H6, enter +G6/F6 and copy this to range H7 to H18. Step 6: - Enter the ALARM formulas In cell I6, enter an IF statement for ABS (H6)>F3,"ALARM","") and copy it to range I7 to I19. This formula takes the absolute value of the tracking signal ratio. If that value is greater than the control limit, the label ALARM is displayed; otherwise, the cell is blank. Adapting the Tracking Signal to Other Spreadsheets The formulas in columns E through I in Exhibit 23 are not specific to simple exponential smoothing. They can be used in any other forecasting spreadsheet as long as the actual period-by-period forecast errors are listed in a column. If we use a forecasting method that does not include weights, such as linear regression, we will also have to assign a weight for updating the MAE - we simply use a weight of 0.1 in such applications. Concerning the weights used to update forecasts, experience suggests that there is not much to be gained by experimenting with the weights in the MAE. In large forecasting spreadsheets, macros can be used to operate the monitoring control system. For example, we could write a macro that checks the last cell in the tracking signal column of each data set. The macro could sound a beep and pause if a signal is out of range. Or we could write a macro that prints out a report showing the location of each offending signal for later investigation. In Exhibit 23, the tracking signal gives an immediate warning when our forecasts are out of control. Do not expect this to happen every time. The tracking signal is by no means a perfect monitoring control device, and only drastic changes in the data will cause immediate warnings. More subtle changes may take some time to detect. 24 Statistical theory tells us that the exact probability that the tracking signal will go outside the range ±4.0 depends on many factors, such as the type of data, the forecasting model in use, and the parameters of the forecasting model such as the smoothing weights. Control limits of ±4.0 are a good starting point for tracking much business data. If experiments show that these limits miss important changes in the data, reduce them to 3.5 or 3.0. If the signal gives false alarms, cases in which the signal is outside the limits but there is no apparent problem, increase the limits to 4.5 or 5.0. We recommend adding a tracking signal to every forecasting model. They save a great deal of work by letting us manage by exception. Mini Case: Accuracy Measurements of Transportation Forecasts The “Hauls” data are proprietary but real. The problem is to investigate the need for data quality adjustments prior to creating a useful forecasting model. Does it matter if a few data points look unusual? For trend/seasonal data, how does it impact seasonal factors, monthly bias and overall precision? We would like to get an assessment of the impact of an outlier (just one) and how we would measure forecast accuracy. Preliminary Analysis. Based on the three years (36 monthly observations), the total variation can be decomposed into trend, seasonal and other using the Excel DataAnalysis Add-in (ANOVA without replication). Exhibit 24 shows Seasonality (44%), Trend (1%) and Other (55%). Trend/Seasonal models are expected to show a significant amount of uncertainty in the prediction limits. After an outlier adjustment (shown below), the total variation breaks down to Seasonality (47%), Trend (1%) and Other (52%). But, how accurate might the forecasts be? Exhibit 24 ANOVA Pie Chart for Hauls Data Exploratory Forecast 1. We proceed by modeling the unadjusted data with a trend/seasonal exponential smoothing model and test first for within sample outliers. The Model is a State Space Model (N,M,M); Local Level (No Trend), Multiplicative Seasonality, Multiplicative Error). The latter means that the prediction limits are asymmetrical (cf. www.peerforecaster.com for free download and license of PEERForecaster.xla Excel Addin) 2. In Exhibit 25, the model residuals are shown in Column 1. The Descriptive Statistics (Excel > DataAnalysis Addin) are used to determine the bounds for outlier detection in the residuals. Outlier Detection and Correction 3. For outlier detection, the traditional method is Mean +/- St.Dev. The traditional method is not outlier resistant as both the Mean and St.Dev are distorted when the residuals contain one or more outlier. The outlier-resistant nonconventional method is based on the calculations (a) 75th percentile + 1.5 (Interquartile Distance), and (b) 25th Percentile – 1.5 (Interquartile Distance). 4. Both outlier detection methods give the same conclusion (in this case): Outlier in residuals = 867. The outlier can be corrected (conservatively) by replacing the actual (= 5265) with the Upper Bound or more commonly with the fitted model value for that period (= 4397.62). 5. Before making adjustments, outliers should always be reviewed for reasonableness with a domain expert familiar with the historical data. It is critical, however, to take data quality steps before modeling as outlier effects can be confounded with the main effects, like seasonal factors, special events and promotions. 25 Exhibit 25 Outlier Calculation with Tradtional and Nonconventional Methods Forecast Testing 6. We choose a hold-out sample of twelve months and create a 12-period ahead forecast for evaluation. The outlier (October of the second year) could have a significant impact on exponential smoothing forecasts because it occurs just before the 12-month hold-out period. The first period in the data is a December. 7. By choosing the fitted value as outlier replacement (Exhibit 26), we rerun the trend/seasonal exponential smoothing model with the outlier-adjusted data. The Haul data are shown in the third column. 8. In Exhibit 27, we show a credible forecast (Change) with prediction limits as a measure of uncertainty (Chance). The trend/seasonal exponential smoothing model is a State Space Model (N,M,A): Local Level (No trend), Multiplicative seasonality and Additive error. The latter means that the bounds for the prediction limits are symmetrical around the forecast. Preliminary Forecast 9. Exhibit 28 shows Predictive Visualization of a preliminary forecast with Outlier-Adjusted Hauls Data (Ad,M,A); Local Trend (Damped), Multiplicative Seasonality, Additive Error). The latter means that the bounds for the prediction limits are symmetrical around the forecast. 10. The 12 actuals in the hold-out sample lie within the error bounds (not shown) 26 Exhibit 26 Outlier Replacement for October in 2nd Year: Actual = 5265, Adjusted Value = 4397.62 Exhibit 27 Predictive Visualization of Outlier-Adjusted Hauls Data with 12-Month Hold-out Period Exhibit 28 Predictive Visualization of a Preliminary Forecast with Outlier-Adjusted Hauls Data 27 Model Evaluation 11. Exhibit 29 shows Bias and Precision measures for forecasts in the Hold-Out Sample With and Without Outlier-Adjusted Haul Data. The Mean Error (ME) and Mean Absolute Percentage Error (MAPE) are not outlier resistant, unlike their Median counterparts. The ME is severly distorted by the October outlier, while the Median Error (MdE) gives a more reliable result for the bias. The effect of an outlier also appears to be greater on the MAPE than the MdAPE. 12. In practice, both measures should always be calculated and compared. If the MAPE and MdAPE appear to be quite different (from an “impact” perspective), then investigate the underlying numbers for stray values. If comparable, then report the MAPE as this measure is typically better known. Here a MAPE of 6% would be a credible measure of performance based on the comparison in this one case. 13. Exhibit 30 shows a comparison of seasonal factors from the analysis with and without outlier adjustment. The impact is greatest for October, but also shows some effect in December/January. Exhibit 29 Bias and Precision Measures for Forecasts in the Hold-Out Sample With and Without OutlierAdjusted Haul Data Exhibit 30 Comparison of Multiplicative Seasonal Factors ( With and Without Outlier-Adjusted Haul Data) 28 We come up with standard forecast evaluation procedure as follows: 1. 2. 3. 4. Calculate the ME, MPE, MAPE, MAE and the RMSE. Contrast these measures of accuracy. Recommend and calculate a resistant set of measures for each one of the measures Create a Naïve_12 (12-period-ahead) forecast for a 12-month holdout sample, by starting with January forecast = January (previous year) actual, February forecast = February (previous year) actual, and so on. Construct a relative error measure, like RAE, comparing the forecasts with the Naïve_12 forecasts. Can you do a better job with a model than a Naïve_12 method? On a month-to-month basis these accuracy measures might not be meaningful to management (too large, perhaps), so you suggest looking at the results on a quarterly basis. Aggregate the original data and forecasts into quarterly periods (or buckets). Repeat (1) and (2) using a Naïve_4 as the benchmark forecast to beat. Final Takeaways One goal of an effective demand forecasting strategy is to improve forecast accuracy through the use of forecasting performance measurement. Forecasters use accuracy measures to evaluate the relative performance of forecasting models and assess alternative forecast scenarios. Accuracy measurement provides management with a tool to establish credibility for the way the forecast was derived and to provide incentives to enhance professionalism in the organization. • • • • • It should always be good practice to report accuracy measures in pairs to insure against misleading results. When the pair are similar (in a practical sense), then report the traditional result. Otherwise, go back and investigate the data going into the calculation, using domain information whenever possible We recommend the use of relative error measures to see how much of an improvement is made in choosing a particular forecasting model over some benchmark or "naive" forecasting technique. Relative error measures also provide a reliable way of ranking forecasting models on the basis of their usual accuracy over sets of time series. Test a forecasting model with hold out samples by splitting the historical data into two parts, using the first segment as the fit period. Then, the forecasting model is used to make forecasts for a number of additional periods (the forecast horizon). Because there are actual values withheld in the holdout sample, one can assess forecast accuracy by comparing the forecasts against the known figures. Not only do we see how well the forecasting method has fit the more distant past (fit period) but also how well it would have forecast the test period. Tracking signals are useful when large numbers of items must be monitored for accuracy. This is often the case in inventory systems. When the tracking signal for an item exceeds the threshold level, the forecaster’s attention should be promptly drawn to the problem. The tracking of results is essential to ensure the continuing relevance of the forecast. By properly tracking forecasts and assumptions, the forecaster can inform management when a forecast revision is required. It should not be necessary for management to inform the forecaster that something is wrong with the forecast. The techniques for tracking include ladder charts, prediction-realization diagrams, prediction intervals, and tracking signals that identify nonrandom error patterns. Through tracking, we can better understand the models, their capabilities, and the uncertainty associated with the forecasts derived from them. 29 Contents Taming Uncertainty: All You Need to Know About Measuring Forecast Accuracy, But Are Afraid to Ask .......................................................................................................................... The Need for Measuring Forecast Accuracy.......................................................................................... 1 Analyzing Forecast Errors ...................................................................................................................... 3 Lack of Bias............................................................................................................................................ 3 What is an Acceptable Precision?........................................................................................................... 4 Ways to Evaluate Accuracy ..................................................................................................................... 6 The Fit Period versus the Test Period..................................................................................................... 6 Goodness-of-Fit versus Forecast Accuracy ............................................................................................ 7 Item-Level versus Aggregate Performance............................................................................................. 7 Absolute Errors versus Squared Errors................................................................................................... 7 Measures of Forecast Accuracy............................................................................................................... 8 Measures of bias ..................................................................................................................................... 8 Measures of Precision............................................................................................................................. 9 Comparison with Naive Techniques ..................................................................................................... 12 Relative Error Measures ....................................................................................................................... 13 Predictive Visualization Techniques ..................................................................................................... 13 Ladder Charts ....................................................................................................................................... 14 Prediction-Realization Diagram ........................................................................................................... 14 What Are Prediction Intervals? ............................................................................................................ 16 Prediction Intervals for Time Series Models ........................................................................................ 16 Prediction Interval as Percentages ........................................................................................................ 17 Using Prediction Intervals as Early Warning Signals ........................................................................... 18 Trigg Tracking Signal........................................................................................................................... 20 Spreadsheet Example: How to Apply a Tracking Signal .................................................................... 20 Quick and Dirty Control ....................................................................................................................... 21 The Tracking Signal in Action.............................................................................................................. 21 Adapting the Tracking Signal to Other Spreadsheets ........................................................................... 23 Mini Case: Accuracy Measurements of Transportation Forecasts.................................................... 24 Final Takeaways ..................................................................................................................................... 28