FORECAST ERRORS: BALANCING THE RISKS AND COSTS OF BEING WRONG Qiang Xu1 Hilke Kayser Lynn Holland March 2007 1 Chief Econometrician and Director of Research Mailing Address: New York State Executive Department Division of the Budget Room 135 – State Capitol Albany, NY 12224, USA Email Address: bdxu@budget.state.ny.us Phone: (518) 474-1766 1 Forecast Errors: Balancing the Risks and Costs of Being Wrong In practice, there is no such thing as a perfect forecast. Forecast errors can arise from various sources, including an incorrect model specification, errors in the data, incorrect assumptions regarding the future values of explanatory variables, and shocks or events that, by nature, cannot be predicted at the time when the forecast was made. Thus, even under a correct model specification and correct assumptions, forecasts will differ from the actual values. Forecast errors are typically assumed to be drawn from a zeromean process, such as white noise. Errors of that nature are the best one can hope for, since no model can presume to capture all of the factors that affect the variable under consideration. Though the model specification may be correct, the analyst typically works with sample data rather than population data, making parameter estimates subject to sampling error. However, when a model is solved to produce a forecast, the model coefficient estimates are treated as fixed numbers, when in fact, they are themselves random variables. The forecaster can only hope to estimate the "true" model parameters within a statistically acceptable margin of error. For example, though the true parameter value may be 0.85, an estimate of 0.75 may be judged to be statistically significant. Indeed, any value between 0.75 and 0.85 might reasonably be expected to pass the test of statistical significance. But when either 0.75 or 0.80 is used instead of 0.85 to predict future values, the forecast outcome will be different. In light of the many sources of risk, the forecaster must be prepared to make an assessment of the risks to the forecast, and evaluate the costs associated with those risks. After performing such an assessment, the forecaster may want to implement a feedback mechanism from the risk assessment back to the forecast. If the risk that the forecast will be too high is assessed to be greater than the risk of being too low, then the analyst may want to lower the forecast in order to restore balance. For example, it is unlikely that an econometric model can adequately capture the impact of geopolitical turmoil on oil prices. Consequently, when there is a war going on in the Middle East, the probability that actual oil prices will rise above the model forecast may be greater than the probability that oil prices will be below. In such cases, the analyst may not only want to make explicit the asymmetric nature of the risks, but may also feel justified in making an upward adjustment to the model forecast. Even when the forecast risks are balanced, the costs associated with forecast errors may not be. In many situations, the cost of an overestimate may outweigh the cost of an underestimate, and, in such cases, the analyst may feel justified in making a downward adjustment to the model forecast in order to balance the costs. In estimating budgetary revenues and spending, the cost of overestimating tax receipts may include the risk of a fiscal crisis, while no such risk is inherent in underestimation. These concerns lead to a discussion of the forecaster's "loss function" and an evaluation of the costs of being wrong. 2 Section 1 of this chapter introduces various measures of forecast error, including the notion of symmetric vs. asymmetric error distribution. Section 2 presents methods for assessing forecast risks (prediction intervals and density forecasts) and for presenting those risks to other interested parties. These methods include Monte Carlo simulation and the construction of fan charts. For simplicity of exposition, sections 1 and 2 abstract from the forecaster's loss function, implicitly assuming that the forecaster's loss is simply proportional to the absolute value of the error itself. Section 3 introduces more general forms for the forecaster's loss function and discusses the choice of an optimal forecast under a given loss function and a given distribution of risks. Section 4 discusses methods for choosing among a menu of forecasts given a particular loss function. 1. Measures of Forecast Error There are a number of statistics that are commonly used to measure forecast error. Suppose Yt is an observed time series and one is interested in forecasting its future values H periods ahead. Define et h ,t as the time t+h forecast error for a forecast made at time t such that et h ,t Yt h Yt h ,t where Yt h is the actual value of Y at time t+h and Yt h ,t is the forecast for Yt h made at time t. Similarly, we define the percentage error as pt h,t Yt h Yt h,t . Yt h In addition, there are various statistics that summarize the model's overall fit. For a given value of h, these include the mean error, 1 T Mean Error: ME et h ,t T t 1 which can be interpreted as a measure of bias. An ME greater than zero, indicates that the model has a tendency to underestimate. All else being equal, the smaller the ME , the better the model. We can also define the error variance, Error Variance: EV 1 T (et h ,t ME ) 2 T t 1 which measures the dispersion of the forecast errors. Squaring the errors amplifies the penalty for large errors and does not permit positive and negative errors to cancel each other out. All else being equal, the smaller the EV , the better the model. Popular measures also include: 1 T Mean Squared Error: MSE et2 h ,t T t 1 3 and Mean Squared Percent Error: MSPE 1 T 2 pt h,t . T t 1 Often the squared roots of these measures are used to preserve units, yielding Root Mean Squared Error: RMSE 1 T 2 et h,t T t 1 and Root Mean Squared Percent Error: RMSPE 1 T 2 pt h,t . T t 1 Some less popular but nevertheless common accuracy measures include: Mean Absolute Error: MAE 1 T et h,t T t 1 and Mean Absolute Percent Error: MAPE 1 T pt h,t . T t 1 It is clear that the length of the forecast horizon, H, is of crucial importance, as longerterm forecasts tend to have larger errors, when compared with nearer-term forecasts. 2. Risk Assessment: Monte Carlo Simulation and Fan Charts Since no forecast can be expected to be 100 percent accurate, risk assessment involving measures of expected forecast accuracy has become increasingly popular. The construction of such measures is usually simulation-based and the availability of ample computing power has made these computations more widely feasible. The most common constructs for assessing risk are prediction intervals and density forecasts. A prediction interval supplements a point forecast with a range and a probability that the actual value will fall within that range. A density forecast goes one step further by assigning varying degrees of likelihood to particular values as one moves further from the point forecast. The basic tool for constructing these measures is Monte Carlo simulation. Monte Carlo Simulation Applications of Monte Carlo methods have enjoyed a flowering in the econometrics literature. In these studies, data are generated using computer-based pseudorandom number generators, i.e., computer programs that generate sequences of values that appear 4 to be strings of draws from a specified probability distribution. For a set of three given values {p,q,r}, the method of generation usually proceeds as follows: 0. Initialize the seed. 1. Update the seed according to: seed j f ( seed j 1 , p, q ) . 2. Calculate x j seed j / r . 3. Perform a distribution-specific transformation on x if necessary (if the desired distribution is something other than a standard uniform distribution, or U[0,1]); then move x into memory. 4. Return to step 1. For example, the following simple pseudo-random number generator has been widely used for x ~ U[0,1]: 0. Initialize seed 0 . 1. Update the seed according to: seed j mod( p * seed j 1 , q) . 2. Calculate x j seed j / r . 3. Move x into memory. 4. Return to step 1. The modulus function, mod(a,b), is the integer remainder after a is divided by b. For example, mod(11,3)=2. The generator will produce several million pseudorandom draws from U(0,1). For example, suppose the seed is initialized at 1234567.0 and let { p, q, r} {16807.0, 2147483648.0, 2147483655.0} . Then, the first ten values produced by this random number generator are: Iteration 0 1 2 3 4 5 6 7 8 9 10 SEED 1234567 1422014737 456166167 268145409 1299195559 2113510897 250624311 1027361249 1091982023 546604753 1998521175 X 0.662177 0.212419 0.124865 0.604985 0.984180 0.116706 0.478402 0.508494 0.254533 0.930634 The above sample is drawn from a standard uniform, or U[0,1], population. When sampling from a standard uniform U[0,1] population, the sequence is essentially a difference equation, since given the initial seed, x j is ultimately a function of x j 1 . In most cases, the result at step 2 is a pseudo draw from the continuous uniform distribution in the range of zero to one. 5 For a given model specification and a given set of exogenous inputs, Monte Carlo simulation studies evaluate the risk to the forecast due to variation in the dependent variable that cannot be explained by the model, as well as the random variation in the model parameters. By assumption, the model errors are considered to be draws from a normally distributed random variable with mean zero. For purposes of the simulation, the model parameters are also considered to be random variables that are distributed as multivariate normal. The standard deviation of the regression errors, and the means and standard deviations of the parameter distribution are derived from the regression analysis. In order to simulate values for the dependent variable, a random number generator is used to generate a value for the model error and values for the parameters from each of the above probability distributions. Based on these draws and values from the input data set, which for purposes of the simulation is assumed to be fixed, the model is solved for the dependent variable. This "experiment" is typically repeated thousands of times, yielding thousands of simulated values for each observation of the dependent variable. The means and standard deviations of these simulated values can be used to construct a prediction interval and provide the starting point for creating a density forecast typically portrayed by a fan chart. Figure 1 Fan Chart for Partnership/S Corporation Income Growth 90 percent prediction intervals 20% Monte Carlo Mean DOB Forecast Percent change 15% 10% 5% 0% -5% 1991 1993 1995 1997 1999 2001 2003 2005 2007 Note: With 90 percent probability, actual growth will fall into the shaded region. Bands represent 5 percent probability regions. Source: NYS Department of Taxation and Finance; DOB staff estimates. Density Forecasts and Fan Charts Fan charts display prediction intervals as shown in Figure 1. It is estimated that with 90 percent probability, future values will fall into the shaded area of the fan. Each band within the shaded area reflects five percent probability regions. The chart "fans out" over 6 time to reflect the increasing uncertainty and growing risk as the forecast departs further from the base year. Not only does the fan chart graphically depict the risks associated with a point forecast as time progresses, but it also highlights how realizations that are quite far from the point estimate can have a reasonably high likelihood of occurring. Fan charts can exhibit skewness that reflects more downside or upside risk to the forecast, and the costs associated with erring on either side. Theoretical Underpinnings of the Fan Chart To capture the notion of asymmetric risk, the fan chart used by DOB assumes a twopiece normal distribution for each of the forecast years following an approach inspired by Wallis (1999) and others. A two-piece normal distribution of the form A exp[( x )2 / 2 12 ] x f ( x) 2 2 A exp[( x ) / 2 2 ] x with A ( 2 (1 2 ) / 2)1 , is formed by combining halves of two normal distributions having the same mean but different standard deviations, with parameters ( , 1 ) and ( , 2 ) , and scaling them to give the common value f ( ). If 1 2 , the two-piece normal has positive skewness with the mean and median exceeding the mode. A smooth distribution f ( x) arises from scaling the discontinuous distribution f ( z ) to the left of μ using 2 1 /( 1 2 ) and the original distribution f ( z ) to the right of μ using 2 2 /( 1 2 ). Figure 2 f ( x), f ( z ) ____ two halves of normal distributions with mean and standard deviations 1 and 2 . ------ two-piece normal distribution with mean . β α α δ σ1/(σ1+σ2) σ2/(σ1+σ2) x, z 7 One can determine the cutoff values for the smooth probability density function f ( x) from the underlying standard normal cumulative distribution functions by recalling the scaling factors. For 1 ( 1 2 ) , i.e. to the left of μ, the point of the two-piece normal distribution defined by Prob( X x ) = is the same as the point that is defined by Prob(Z z ) = , with ( 1 2 ) 2 1 and x 1 z Likewise, for (1 ) 2 ( 1 2 ) , i.e. to the right of μ, the point of the two-piece normal distribution that is defined by Prob( X x ) = is the same as the point that is defined by Prob(Z z ) = , with ( 1 2 ) 2 2 x1 1 z1 and For the two-piece normal distribution, the mode remains at μ. The median of the distribution can be determined as the value defined by Prob( X x ) =0.5 . The mean of the two-piece normal distribution depends on the skewness of the distribution and can be calculated as: E( X ) 2 ( 2 1 ) Choice of Parameters In constructing its fan charts, DOB uses means from the Monte Carlo simulation study as the mean, μ, of the two underlying normal distributions. As mentioned above, if the two-piece normal distribution is skewed, the Monte Carlo mean becomes the mode or most likely outcome of the distribution and will differ from the median and the mean. In the sample fan chart above, the mode is displayed as the crossed line. Except for in extremely skewed cases the mode tends to fall close to the middle of the central 10 percent prediction interval. As Britton et al. (1998) point out in their discussion of the inflation fan chart by the Bank of England, the difference between the mean and the mode provides a measure of the skewness of the distribution. Given the skewness parameter, γ, DOB determines the two standard deviations, 1 and 2 , as 1 = (1+ ) and 2 = (1- ) , where is the standard deviation from the Monte Carlo simulation study. By definition, the mean of the distribution is the weighted average of the realizations of the variable under all possible scenarios, with the weights corresponding to the probability or likelihood of each scenario. In its forecasts, DOB aims to assess and incorporate the likely risks. Though no attempt is made to strictly calculate the 8 probability weighted average, the forecast will be considered a close approximation of the mean. Thus the skewness parameter, γ, is determined as the difference between DOB's forecast and the Monte Carlo mean. DOB's fan chart shows central prediction intervals with equal tail probabilities. For example, the region in the darkest two slivers represents the ten percent region in the center of the distribution. DOB adds regions with 5 percent probability on either side of the central interval to obtain the next prediction interval. If the distribution is skewed, the corresponding 5 percent prediction intervals will include different ranges of growth rates at the top and the bottom, thus leading to an asymmetric fan chart. The 5 percent prediction regions encompass increasingly wider ranges of growth rates as one moves away from the center because the probability density of the two-piece normal distribution decreases as one moves further into the tails. Thus the limiting probability for any single outcome to occur is higher for the central prediction regions than for intervals further out because a smaller range of outcomes shares the same cumulative probability. Over time, risks become cumulative and uncertainties grow. DOB uses its own forecast history to determine the degree to which σ1 and σ2 need to be adjusted upward to maintain the appropriate probability regions. 3. Generalizing the Forecaster's Loss Function When the forecaster's loss function is more general than the simple one assumed for the prior section, the forecaster's choice of an optimal forecast may deviate even further from the model forecast. Suppose a forecaster working for a private sector manufacturing firm is asked to provide guidance as to whether the firm should raise its level of inventories based on the outlook for demand for the company's product. If demand is projected to be high, then the firm will proceed to build inventories; if low, then the firm will reduce inventories. There are costs to the firm of being wrong. If demand is unexpectedly low, the firm will have unplanned inventories, while if demand is higher than expected, the firm will lose market share. The simple tables below, which summarize the costs to the firm of bad planning under alternative loss structures, clearly illustrate that the loss structure will factor critically into the firm's decision. Forecast/Actual High Low Under Symmetric Losses High $0 $10,000 Low $10,000 $0 Decision High Forecast Low forecast Under Asymmetric Losses Demand High $0 $20,000 Demand Low $10,000 $0 The construct for measuring the cost attached by the forecaster to an incorrect prediction is the loss function, L (et h ,t ) , where et h ,t is defined as above. The cost 9 associated with the forecast error is presumed to depend only on the size of the forecast error and to be positive unless the error is (in theory) zero. Typically, L(e) is constructed to satisfy three requirements: 1. L(0) 0 . 2. L(e) is continuous, implying that two nearly identical forecast errors should produce nearly identical losses. 3. L(e) increases as the absolute value of e increases, implying that the bigger the size of the absolute value of the error, the bigger the loss. Figure 3 Quadratic Loss Function Loss 30 20 10 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 2 3 4 5 Forecast Error Figure 4 Absolute Loss Function Loss 6 4 2 0 -5 -4 -3 -2 -1 0 1 Forecast Error Loss functions can be either symmetric or asymmetric. Depicted in Figure 3 is the quadratic loss function, where 10 L(e) e2 . The squaring associated with quadratic loss makes large errors much more costly than small ones. In addition, the loss increases at an increasing rate on each side of the origin, implying symmetry. The absolute loss function is depicted in Figure 4, where L(e) | e | . This function is also symmetric, but the loss increases at a constant rate with the size of the error, producing its V-shape. In reality, the costs associated with being wrong may not always be symmetric. For example, if the costs associated with under- and over-predicting travel time to the airport were symmetric, we would expect many more missed flights than we actually observe. That we observe few missed flights is an indication that the cost of a missed flight must outweigh the cost of arriving early and having to wait in the airport, implying that the loss function is not symmetric. As alluded to above, government budget analysts may also face asymmetric costs associated with over-predicting vs. under-predicting revenues. Indeed, the different branches of government may have asymmetric loss functions that are mirror images of each other. Industry analysts may also attach a higher cost to an overly pessimistic forecast than to an overly optimistic one. Here we present the two asymmetric loss functions that are most popular in the literature. A more detailed presentation can be found in Christoffersen and Diebold (1997). The first is the "linex" function, L(e) b exp(ae) ae 1 , a R \ 0 , b R . The linex loss function is so-named since for a greater than (less than) 0, it assigns a cost that is linear in the forecast error if the error is negative (positive) and exponential in the forecast error if it is positive (negative). Thus, negative forecast errors (Yt+h<Yt+h,t) are much less costly than positive errors. The linex loss function, which is depicted in the graph below, may well pertain to forecasting the time it will take to get to the airport. A negative error implies a longer wait at the airport, while a large positive error could entail a missed flight. Under the linex loss function, the optimal h-step ahead forecast solves the following minimization problem: min Et b exp(a(Yt h Yˆt h )) a(Yt h Yˆt h )-1 . Yˆt h Differentiating and using the conditional moment-generating function for a conditionally normally distributed random variate yields, 11 a Yˆt h =t h t t2 h t 2 assuming conditional heteroskedasticity.2 Thus, the optimal predictor is a simple function of the conditional mean and a bias term that depends on the conditional h-step ahead prediction-error variance and the degree of loss function asymmetry, as measured by the parameter a. When a is positive, the larger is a, the greater the bias toward negative errors (over-prediction). In addition, when a is positive, the optimal predictor is also positive in the prediction-error variance. Figure 5 Linex Loss Function 30 Loss 20 10 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Forecast Error A second commonly used asymmetric loss function is the "linlin" loss function, which can be expressed as follows, a e , if e 0 L (e) . b e , if e 0 The "linlin" loss function is so-called since it is linear in the errors, and is a generalization of the absolute loss function depicted above where the slopes are allowed to differ on either side of the origin. The optimal predictor solves the following minimization problem, 2 Christoffersen and Diebold derive a "pseudo-optimal" estimator by replacing unconditional h-step ahead prediction-error variance t2 h t with the h2 , the resulting estimator only being optimal under conditional homoskedasticity. However, under conditional heteroskedasticity, the "pseudo-optimal" estimator will fail to result in a lower conditionally expected loss than the conditional mean except during times of high volatility. 12 ˆ ˆ )f (Y | )dY min a ( Y Y ) f ( Y | ) dY b ( Y Y t h t h t h t t h ˆ t h t h t h t t h . Yˆt h ˆ Y Yt h t h The first order condition implies the following result F (Yˆt h | t ) a ab where t is the conditional cumulative distribution function (c.d.f.) of Yt h . If Yt h is normally distributed, then the optimal predictor is a Yˆt h =t h t t h t 1 ab where ( z ) is the standard normal c.d.f. The above results pertain to two fairly simple loss functions. However, Christoffersen and Diebold also show how an optimal predictor can be approximated when the loss function is more general using numerical simulation. Though less restrictive, this approach may be less accessible to the average practitioner. Moreover, on choosing values for parameters a and b, the literature is silent. However, it is hoped that the above discussion has illustrated how the problem of asymmetric loss fits into the broader problem of forecasting and can provide a useful guideline as to how to proceed and communicate the central issue. 4. Statistical Comparison of Alternative Forecasts Choosing Among Competing Models Suppose one must choose between two competing models, A and B, given a particular loss function. This can be couched as a hypothesis testing problem: H 0 : E[ L(etA h ,t )] E[ L(etB h ,t )] H A : E[ L(etA h ,t )] E[ L(etB h ,t )] or E[ L(etA h ,t )] E[ L(etB h ,t )] Equivalently, you might want to test the hypothesis that the expected loss differential is zero E[dt ] E[ L(etAh,t )] E[ L(etBh,t )] 0 If dt is a stationary series, the large-sample distribution of the sample mean loss differential is 13 T (d ) ~ N (0, f ) where d 1 T [ L(etA h ,t ) L(etB h ,t )] T t 1 is the sample mean loss differential, f is the variance of the sample mean differential, and is the population mean loss differential. Under the null hypothesis of a zero population mean loss differential, the standardized sample mean loss differential has a standard normal distribution B d ~ N (0,1) fˆ T where fˆ is a consistent estimate of f .3 Forecast Combination Suppose one has two competing models, A and B, and statistical test results indicate that they are equally accurate. Should you combine them? Forecast Encompassing Suppose models A and B produce forecasts Yt Ah,t and Yt Bh,t . The following regression can be performed, Yt h,t AYt Ah,t BYt Bh,t t h,t If A 1 and B 0 then Model A forecast encompasses Model B. If A 0 and B 1 then Model B forecast encompasses Model A. Otherwise, neither model encompasses the other and you may want to combine them. Forecast Combination The Blue Chip Consensus forecast is a simple average of about 50 forecasts. However, under certain circumstances, equally weighting all of the participating forecasters may not be optimal. For example, suppose again there are two forecasts, Yt Ah,t and Yt Bh,t . One might combine them in a weighted average: 3 Alternatively, the sophisticated practitioner might want to choose between competing density forecasts. This problem is treated rigorously in Tay and Wallis (2000), under loss functions of general form, but is beyond the scope of this chapter. 14 Yt Ch,t *Yt Ah,t (1 )Yt Bh,t where Yt C h,t is the combination forecast. Alternatively, one can write the problem in terms of forecast errors: etCh,t * etAh,t (1 )* etBh,t with variance 2 C2 2 A2 (1 )2 B2 2 (1 )2 AB based on forecasters' past performances. The value of can be determined as the solution to an optimization problem where the objective is to minimize the weighted average forecast error. The first order condition indicates that the simple Blue Chip weighting scheme is not necessarily optimal. The above methods abstract from consideration of the form of the forecaster's loss function. Forecast combination under more general circumstances is discussed more rigorously in Elliott and Timmermann (2002). The authors show that as long as the forecast error density is elliptically symmetric, the forecast combination weights are invariant over all loss functions, leaving only the constant term to capture the tradeoff between the bias in the loss function and the variance of the forecast error. As to the importance of the shape of the loss function to the choice of weights, the authors offer the intuitive conclusion that the larger the degree of loss function asymmetry, the larger the gains from optimally estimating the combination weights compared to equally weighting the forecasts. Following Elliott and Timmermann (2002), we generalize the problem of forecast combination by defining Yt h,t as a vector of forecasts and assume that Yt C h,t and Yt h,t are jointly distributed with the following first and second moments, Yt C h ,t Y E Yt+h,t μ and Yt C h ,t Y2 Var Yt+ h,t σ 21 σ'21 . Σ 22 Assume that the optimal combination forecast is a linear combination of the elements of Yt h,t , giving rise to the forecast error defined as 15 et h,t Yt Ch,t c ωYt+h,t where ω is a vector of combination weights and c is a scalar constant, and et has the following first and second moments e y c ωμ e2 y2 + ωΣ 22 ω 2ωσ 21 . Under a symmetric quadratic loss function, the first order conditions of the minimization problem imply the optimal population values 0c y ωμ -1 ω0 Σ 22 σ 21 . Although Elliott and Timmermann (2002) presents very general results, a common special class of cases is that of elliptically symmetric forecast errors, but asymmetric loss. The solution values for the optimal weights have the convenient property that only the constant term c depends on the shape of the loss function. Thus, if E L et g e , e2 then 0c is the solution to g e* , e2 / e 0 where e* is the optimal value for e . Thus, under the assumption of normally distributed forecast errors and a linex loss function, a 2 0c y ω0 μ + e2 while under linlin loss, a . ab 0c y ω0 μ e 1 16 References Britton, E., P. Fisher and J. Whitley (1998). "The Inflation Report projections: understanding the fan chart." Bank of England Quarterly Bulletin, 38, 30-37. Christofferson, P. and F.X. Diebold (1997). "Optimal prediction under asymmetrical loss." Econometric Theory 13, 806-817. Elliott, G. and A. Timmermann (2002). "Optimal forecast combinations under general loss functions and forecast error distributions." University of California at San Diego, Economics Working Paper Series 2002-08. Granger, Clive (1989). Forecasting in Business and Economics (2nd edition), San Diego: Academic Press. Tay, A. and K. Wallis (2000). "Density forecasting: a survey." Journal of Forecasting 19, 235-254. Wallis, K. (1999). "Asymmetric density forecasts of inflation and the Bank of England's fan chart." National Institute Economic Review, no. 167, January, 106-112. 17