Generalised Density Forecast Combinations

advertisement
Generalised Density Forecast Combinations∗
N. Fawcett
G. Kapetanios
Bank of England
Queen Mary, University of London
J. Mitchell
S. Price
Warwick Business School
Bank of England
April 11, 2013
Abstract
Density forecast combinations are becoming increasingly popular as a means
of improving forecast ‘accuracy’, as measured by a scoring rule. In this paper we
generalise this literature by letting the combination weights follow more general
schemes. Sieve estimation is used to optimise the score of the generalised density
combination where the combination weights depend on the variable one is trying to
forecast. Specific attention is paid to the use of piecewise linear weight functions that
let the weights vary by region of the density. We analyse these schemes theoretically,
in Monte Carlo experiments and in an empirical study. Our results show that the
generalised combinations outperform their linear counterparts.
JEL Codes: C53
Keywords: Density Forecasting; Model Combination; Scoring Rules
1
Introduction
Density forecast combinations or weighted linear combinations of prediction models are
becoming increasingly popular in econometric applications as a means of improving forecast ‘accuracy’, as measured by a scoring rule (see Gneiting & Raftery (2007)), especially
∗
Email addresses: Nicholas Fawcett (Nicholas.Fawcett@bankofengland.co.uk); George Kapetanios (g.kapetanios@qmul.ac.uk);
James Mitchell (James.Mitchell@wbs.ac.uk);
Simon Price
(Simon.Price@bankofengland.co.uk)
1
in the face of uncertain instabilities and uncertainty about the preferred model; e.g.,
see Jore et al. (2010), Geweke & Amisano (2012) and Rossi (2012). Geweke & Amisano
(2011) contrast Bayesian model averaging with linear combinations of predictive densities,
so-called “opinion pools”, where the weights on the component density forecasts are optimised to maximise the score, typically the logarithmic score, of the density combination
as suggested in Hall & Mitchell (2007).
In this paper we generalise this literature by letting the combination weights follow
more general schemes. Specifically, we let the combination weights depend on the variable
one is trying to forecast. We let the weights in the density combination depend on, for
example, where in the forecast density you are, which is often of interest to economists.
This allows for the possibility that while one model may be particularly useful (and receive
a high weight in the combination) when the economy is in recession, for example, another
model may be more informative when output growth is positive. The idea of letting the
weights on the component densities vary according to the (forecasted) value of the variable
of interest contrasts with two recent suggestions, to let the weights in the combination
follow a Markov-switching structure (Waggoner & Zha (2012)), or to evolve over time
according to a Bayesian learning mechanism (Billio et al. (2013)). Accommodating timevariation in the combination weights mimics our approach to the extent that over time
one moves into different regions of the forecast density.
The plan of this paper is as follows. Section 2 develops the theory behind the generalised density combinations. It proposes the use of sieve estimation (cf. Chen & Shen
(1998)) as a means of optimising the score of the generalised density combination over a
tractable, approximating space of weight functions on the component densities. We consider, in particular, the use of piecewise linear weight functions that have the advantage
of explicitly letting the combination weights depend on the region, or specific quantiles, of
the density. This means prediction models can be weighted according to their respective
abilities to forecast across different regions of the distribution. Section 3 draws out the
flexibility afforded by generalised density combinations by undertaking two Monte-Carlo
simulations. These show that the generalised combinations are more flexible than their
linear counterparts and can better mimic the true but unknown density, irrespective of its
form. The benefits of using the generalised combinations are found to increase with the
sample size. But the advantage of more flexible weight functions can be outweighed by
estimation error, in small samples. Consequently, the simulations show that in practice
it may be prudent to use parsimonious weight functions, despite their more restrictive
form. Section 4 then shows how the generalised combinations can work better in practice,
2
finding that they deliver more accurate density forecasts of the S&P500 daily return than
optimal linear combinations of the sort used in Hall & Mitchell (2007) and Geweke &
Amisano (2011). Section 5 concludes.
2
Generalised density combinations: theory
We provide a general scheme for combining density forecasts. Consider a covariance
stationary stochastic process of interest yt , t = 1, ..., T and a vector of predictor variables xt , t = 1, ..., T . Our aim is to forecast the density of yt+1 conditional on Ft =
σ xt+1 , (yt , x0t )0 , ..., (y1 , x01 )0 .
We assume the existence of a set of density forecasts. Let there be N such forecasts
denoted by qi (y|Ft ) ≡ qit (y), i = 1, ..., N , t = 1, ...T . We suggest a generalised combined
density forecast, given by the generalised “opinion pool”
pt (y) =
N
X
wi (y) qit (y)
(1)
i=1
such that
Z
pt (y) dy = 1,
(2)
where wi (y) are the weights on the individual density forecasts which themselves depend
on y. This generalises existing work on optimal linear density forecast combinations,
where wi (y) = wi and wi are scalars; see Hall & Mitchell (2007) and Geweke & Amisano
(2011).
We need to provide a mechanism for deriving wi (y). Accordingly, we define a predictive loss function given by
T
X
LT =
l (pt (yt ) ; yt ) .
(3)
t=1
We assume that there exist wi0 (y) in the space of qi -integrable functions Ψqi where
Z
Ψqi = w (.) : w (y) qi (y) dy < ∞ , i = 1, .., N,
(4)
0
; yt ≤ E (l (pt (yt ; w1 , .., wN ) ; yt ))
E (l (pt (yt ) ; yt )) ≡ E l pt yt ; w10 , .., wN
(5)
such that
3
for all (w1 , .., wN ) ∈
Q
Ψqi . We suggest determining wi by minimising LT , i.e.,
i
{ŵ1T , ..., ŵN T } = arg
min
wi ,i=1,...,N
LT .
(6)
It is clear that the minimisation problem in (6) is difficult to solve unless one restricts
the space over which one searches from Ψqi to a more tractable space. A general way
forward à la Chen & Shen (1998) is to search over
(
Φq i =
wη (.) : wη (y) = v0 +
∞
X
)
ν s η s (y, θ s ) , ν s ≥ 0, η s (y, θ s ) ≥ 0
, i = 1, ..., N,
s=1
(7)
where
is a known basis, up to a finite dimensional parameter vector θ s , such
that Φqi is dense in Ψqi , and {ν s }∞
s=1 is a sequence of constants. Such a basis can be made
of any of a variety of functions including trigonometric functions, indicator functions,
neural networks and splines.
Φqi , in turn, can be approximated through sieve methods by
{η s (y, θ s )}∞
s=1
(
ΦTqi =
wT η (.) : wT η (.) = v0 +
pT
X
)
ν s η s (y, θ s ) ν s ≥ 0, η s (y, θ s ) ≥ 0
, i = 1, ..., N,
s=1
(8)
where pT → ∞ is either a deterministic or data-dependent sequence.
Note that a sufficient condition for (2) is
Z X
N
vi0 +
Y i=1
pT
X
!
ν is η s (y, θ s ) qit (y) dy =
s=1
N
X
vi0 +
pT
X
i=1
s=1
N
X
pT
!
Z
ν is
η s (y, θ s ) qit (y) dy
Y
(9)
=
i=1
where κis =
R
Y
η s (y, θ s ) qit (y) dy.
v0 +
X
!
ν is κis
=1
s=1
0
0
Defining ϑT = ν 1 , ..., ν m , θ 01 , ..., θ 0pT , ν i = vi0 , ..., ν 0ipT , w = w (ϑT ) = {w1 , ..., wm }
and
ϑ̂T = arg min LT = arg min LT (wT η (ϑT ))
(10)
ϑT
ϑT
4
we can use Corollary 1 of Chen & Shen (1998) to show the following result
1 LT wT η ϑ̂T
− LT w0 = op (1) .
T
(11)
In fact, Theorem 1 of Chen & Shen (1998) also provides rates for the convergence ϑ̂T
to w0 . However, the proofs depend crucially on the choice of the basis {η s (y, θ s )}∞
s=1 .
Chen & Shen (1998) discuss many possible bases. We will derive a rate for one such basis
below. Unfortunately, this rate is not fast enough to satisfy the condition associated with
(4.2) in Theorem 2 of Chen & Shen (1998) and, therefore, we cannot prove asymptotic
normality for v̂0 when pT → ∞.
Of course, we can limit our analysis to less general spaces. In particular, we can search
over
(
)
p
X
Φpqi = wT η (.) : wpη (.) = v0 +
ν s η s (y, θ s ) , i = 1, .., m.
(12)
s=1
0
for some finite p. Let ϑp = ν 1 , ..., ν p , θ 01 , ..., θ 0p . We make the following assumptions:
Assumption 1 There exists a unique ϑ0p ∈ int Θp that minimises E (LT (wT η (ϑT ))), for
all p.
Assumption 2 l (pt (., ϑp ) ; .) has bounded derivatives with respect to ϑp uniformly over
Θp , t and p.
Assumption 3 yt is an Lr -bounded (r > 2), L2 − N ED process of size −a, on an αmixing process, Vt , of size −r/(r − 2) such that a ≥ (r − 1)/(r − 2).
Assumption 4 Let l(i) (pt (yt , ϑp ) ; yt ) denote the i-th derivative of l with respect to yt ,
where l(0) (pt (yt , ϑp ) ; yt ) = l (pt (yt , ϑp ) ; yt ). Let
(i)
l pt y (1) , ϑ0p ; y (1) − l(i) pt y (2) , ϑ0p ; y (2) ≤ Bt(i) y (1) , y (2) y (1) − y (2) , i = 0, 1, 2
(13)
(i)
where Bt (., .) are nonnegative measurable functions. Then, for all m, and some r > 2,
(i)
B
(y
,
E
(y
|V
,
...,
V
))
t
t
t t−m
t+m < ∞, i = 0, 1, 2,
(14)
(i)
l pt yt , ϑ0p ; y (1) < ∞, i = 0, 1, 2,
r
(15)
r
where k.kr denotes Lr norm.
5
Assumption 5 qit (y) are bounded functions for i = 1, ..., N .
Then, it is straightforward to show that:
Theorem 1 Under Assumptions 1-5,
√ 0
T ϑ̂T − ϑp →p N (0, V )
(16)
where
V = V1−1 V2 V1−1 ,
T
1 X ∂ 2 l (pt (yt ) ; yt ) V1 =
,
T t=1
∂ϑ∂ϑ0
ϑ̂T
!
!0
T
1 X ∂l (pt (yt ) ; yt ) ∂l (pt (yt ) ; yt ) V2 =
.
T t=1
∂ϑ
∂ϑ
ϑ̂T
ϑ̂T
Using this result one can test the null hypothesis
H0 : ν 0i = (vi0 , 0, ..., 0)0 , ∀i.
(17)
This allows us to test whether it is useful to allow the weights to depend on y. Note
the need here for a basis that is fully known and does not depend on unknown parameters
as these would then be unidentified under the null and would require careful treatment.
It is clear from Theorem 1 that the generalised combination scheme can reduce the value
of the objective function compared to the linear combination scheme as long as ν 0i does
not minimise the population objective function. If ν 0i does minimise the loss function
then both schemes will achieve this minimum. And if one of the component densities
is, in fact, the true density, then the loss function is minimised when the true density is
given a unit weight, and all other densities are given a zero weight. However, there is
no guarantee, theoretically, that this will be the case in practice, as other sets of weights
potentially can also achieve the minimum objective function.
Furthermore, in general, the loss function may not have a single minimiser, if wi ∈ Ψqi .
We may then wish to modify the loss function in a variety of ways. We examine a couple
of possibilities here. First, we can require that wi that are far away from a constant
function are penalised. This would lead to a loss function of the form
LT =
T
X
t=1
l (pt (yt ) ; yt ) + T
γ
N Z
X
i=1
|wi (y) − fC (y)|δ dy, δ > 0, 0 < γ < 1,
C
6
(18)
R
where we impose the restriction |wi (y) − fC (y)|δ dy < ∞ and fC (y) is the uniform
density over C. An alternative way to guarantee uniqueness for the solution of (6) is to
impose restrictions that apply to the functions belonging in Φqi . Such a loss function can
take the form
LT =
T
X
l (pt (yt ) ; yt ) + T
t=1
γ
N X
∞
X
|ν s |δ , δ > 0, 0 < γ < 1,
(19)
i=1 s=1
P
δ
where we assume that ∞
s=1 |ν s | < ∞. This sort of modification has parallels to penalisations involved in methods such as penalised likelihood. Given this, it is worth noting
that the above general method for combining densities can easily cross the boundary between density forecast combination and outright density estimation. For example, simply
setting N = 1 and q1t (y) to the uniform density and using some penalised loss function
such as (18) or (19), reduces the density combination method to density estimation akin
to penalised maximum likelihood.
2.1
Piecewise linear weight functions
In what follows we suggest, in particular, the use of indicator functions for η s (y), i.e.
η s = I (rs−1 ≤ y < rs ) where the boundaries r0 < r1 < ... < rs−1 < rs are either known a
priori or estimated from the data. The boundaries might be assumed known on economic
grounds. For example, in a macroeconomic application, some models might be deemed to
forecast better in recessionary than expansionary times (suggesting a boundary of zero),
or when inflation is inside its target range (suggesting boundaries determined by the
inflation target). Otherwise, they must be estimated, as Section 3 considers further.
With piecewise linear weight functions, the generalised combined density forecast is
pt (y) =
pT
N X
X
ν is qit (y) I (rs−1 ≤ y < rs ) ,
(20)
i=1 s=1
where ν is are constants (to be estimated) and
Z
Z
rs
I (rs−1 ≤ y < rs ) qit (y) dy =
κis =
qit (y) dy.
(21)
rs−1
Y
These piecewise linear weights allow for considerable flexibility, and have the advantage
that they let the combination weight depend on y; e.g. it may be helpful to weight some
7
models more highly when y is negative, say the economy is in recession, than when y is
high and output growth is strongly positive. This flexibility increases as s, the number of
regions, tends to infinity.
2.2
Scoring rules
Gneiting & Raftery (2007) discuss a general class of proper scoring rules to evaluate
density forecast accuracy, whereby a numerical score is assigned based on the predictive
density at time t and the value of y that subsequently materialises, here assumed without loss of generality to be at time t + 1. A common choice for the loss function LT ,
within the ‘proper’ class (cf. Gneiting & Raftery (2007)), is the logarithmic scoring rule.
More specific loss functions that might be appropriate in some economic applications can
readily be used instead (e.g. see Gneiting & Raftery (2007)) and again minimised via
our combination scheme. But an attraction of the logarithmic scoring rule is that, absent
knowledge of the loss function of the user of the forecast, by maximising the logarithmic
score one is simultaneously minimising the Kulback-Leibler Information Criterion relative
to the true but unknown density; and when this distance measure is zero, we know from
Diebold et al. (1998) that all loss functions are, in fact, being minimised.
Using the logarithmic scoring rule, with piecewise linear weights, the loss function LT
is given by
LT =
T
X
− log pt (yt+1 ) =
T
X
t=1
t=1
log
pT
N X
X
!
ν is qit (yt+1 ) I (rs−1 ≤ yt+1 < rs ) .
(22)
i=1 s=1
Minimising this over the full-sample (t = 1, ..., T ) with respect to ν is , subject to
PN PpT
s=1 ν is κis = 1, gives the Lagrangian
i=1
L̃T =
T
X
t=1
log
pT
N X
X
!
ν is qit (yt+1 ) I (rs−1 ≤ yt+1 < rs )
i=1 s=1
− λs
pT
N X
X
!
ν is κis − 1 . (23)
i=1 s=1
Then,
"
#
T
X
∂ L̃T
qit (yt+1 ) I (rs−1 ≤ yt+1 < rs )
=
− λs κis = 0
PN PpT
∂ν is
ν
q
(y
)
I
(r
≤
y
<
r
)
is
it
t+1
s−1
t+1
s
i=1
s=1
t=1
8
(24)
N
pT
XX
∂ L̃T
=
ν is κis − 1 = 0.
∂λs
i=1 s=1
(25)
In practice, in a time-series application, without knowledge of the full sample (t =
1, ..., T ) this minimisation would be undertaken recursively at each period t based on
information through (t − 1). In a related context, albeit for evaluation, Diks et al. (2011)
discuss the weighted logarithmic scoring rule, wt (yt+1 ) log qit (yt+1 ), where the weight
function wt (yt+1 ) emphasises regions of the density of interest; one possibility, as in (22),
is that wt (yt+1 ) = I (rs−1 ≤ yt+1 < rs ). But, as Diks et al. (2011) show, the weighted
logarithmic score rule is ‘improper’ and can systematically favour misspecified densities
even when the candidate densities include the true density.
We can then prove the following result using Theorem 1 of Chen & Shen (1998).
Theorem 2 Let Assumptions 1, 3 and 4 hold. Let
T
X
− log pt (yt+1 )
(26)
η s = I (rs−1 ≤ y < rs )
(27)
LT =
t=1
and
T
where {rs }ps=1
are known constants. Then,
0
v − ν̂ T = op T −ϕ
(28)
for all ϕ < 1/2 as long as T 1/2 /pT → 0.
We also have the following Corollary of Theorem 1 for the leading case of using the
logarithmic score as a loss function, piecewise linear sieves and Gaussian like component
densities.
Corollary 3 Let Assumptions 1 and 3 hold. Let qi (y) be bounded functions such that
qi (y) ∼ exp (−y 2 ) as y → ±∞ for all i. Let
T
X
− log pt (yt+1 )
(29)
η s = I (rs−1 ≤ y < rs )
(30)
LT =
t=1
and
9
T
where {rs }ps=1
are known constants. Then, the results of Theorem 1 hold.
3
3.1
Monte Carlo study
Monte Carlo setup
In this Section we illustrate the performance of the generalised density forecast combination scheme using Monte Carlo experiments. We consider two different Monte Carlo
settings. These are distinguished by the nature of the true (but in practice, if not reality, unknown) density. In the first simulation a non-Gaussian true density is considered,
and we compare the ability of linear and generalised combinations of misspecified Gaussian densities to capture the non-Gaussianity. In the second simulation we assume the
true density is Gaussian, but again consider combinations of two misspecified Gaussian
densities. In this case, we know that linear combinations will, in general, incorrectly
yield non-Gaussian densities, given that they generate mixture distributions. It is therefore important to establish if and how the generalised combinations improve upon this.
We implement the generalised density forecast combinations below using piecewise linear weight functions, (20); and we use the simulation to draw out consequences both
of parameter estimation error about ri and uncertainty about the number of thresholds
pT = p.
Setting 1: The true density is a two part normal (TPN) density. We note that a
random variable Y has a two part normal density if the probability density function is
given by
(
A exp(−y − µ)2 /2σ 21 if y < µ
A exp(−y − µ)2 /2σ 22 if y ≥ µ
−1
√
2π(σ 1 + σ 2 )/2
A=
f (Y ) =
(31)
We assume that the investigator combines the normal components given by the densities of two normals with the same mean µ and variances given by σ 21 and σ 22 , respectively.
These two densities are then combined via (a) the generalised combination scheme and (b)
optimised (with respect to the logarithmic score) but fixed weights, as in Hall & Mitchell
(2007) and Geweke & Amisano (2011). When using the generalised combination scheme
we consider indicator functions where ri is either known or unknown but estimated using
a grid search, to examine how much estimation error of boundaries matters. In particular,
10
we set pt = p = 2 and r1 = 0, if ri is assumed known, while we consider a grid with 11
equally spaced points between -1 and 1 if ri is assumed unknown. We fix σ 21 = 1 and
consider various values for σ 22 . In particular, we set σ 22 = 1.5, 2, 4 and 8.
Setting 2: For this setting, the true density is the standard normal. There are two
component densities each of which is a misspecified normal density because the mean
is incorrect. We fix the variances of the component densities to unity but allow for
different values for the means, µ1 and µ2 . Then, we either take a combination of these two
misspecified normal densities using the generalised combination scheme and a piecewise
linear weight function; or use fixed weights, as before. We allow for various values of p from
p = 2 to p = 21 where ri are assumed known. We consider (µ1 , µ2 ) = (−0.5, 0.5) , (−1, 1)
and (−2, 2) .
We consider samples sizes of 50, 100, 200, 400, and 1000 observations and carry out
1000 Monte Carlo replications for each experiment. We compare the performance of the
generalised combination scheme with the performance of the linear combination scheme
and that of the component densities. Our performance criterion to measure the distance
between the true density and the forecast density is their integrated mean squared error.
3.2
Monte Carlo results
Results for the two settings are presented in Tables 1 and 2. We start by considering
the results for Setting 1. When the boundary is assumed known and set to 0 then the
generalised combination scheme dominates the linear scheme for all sample sizes and
values of σ 22 as expected. The difference in performance is massive for larger sample sizes
and values of σ 22 . When the boundary is assumed unknown then the performance of the
generalised combination scheme deteriorates as expected. For small sample sizes and low
values of σ 22 the linear scheme works better; but as more observations become available
and for larger variances the generalised scheme is again the best performing method by
far. Using either of the two component densities alone is, as expected, a bad idea.
Moving on to Setting 2, where the boundary is estimated in all cases, we see a nonlinear trade-off between the complexity of the weight function and the performance of the
generalised scheme. It seems that a weight function of medium complexity (say p = 5)
works best on average. The linear scheme is outperformed unless both the misspecification of the means of the component densities and the sample size is small. Overall, it
is clear that while there is benefit to be had from a more flexible weight function this is
reduced by the difficulty of estimating it as it becomes too complex.
11
Table 1: Monte Carlo results for Setting 1 reporting integrated mean square errors. E1
denotes the generalised combination scheme with known boundary; E2 denotes the generalised combination scheme with unknown boundary; E3 denotes the optimal linear
combination; N (0, 1) and N (0, σ 22 ) are the two marginal (component) densities
.
σ 22
T
50
100
1.5 200
400
1000
50
100
2
200
400
1000
50
100
4
200
400
1000
50
100
8
200
400
1000
Combined Density Component Density
E1
E2
E3
N (0, 1) N (0, σ 22 )
0.670 6.013 0.845 1.654
1.103
0.320 2.619 0.771 1.654
1.103
0.163 1.241 0.714 1.654
1.103
0.079 0.518 0.690 1.654
1.103
0.030 0.177 0.673 1.654
1.103
0.551 4.477 1.696 4.421
2.211
0.277 1.826 1.587 4.421
2.211
0.128 0.888 1.528 4.421
2.211
0.068 0.393 1.504 4.421
2.211
0.028 0.143 1.486 4.421
2.211
0.382 2.209 2.746 12.728
3.182
0.175 1.021 2.643 12.728
3.182
0.094 0.456 2.597 12.728
3.182
0.044 0.216 2.572 12.728
3.182
0.019 0.079 2.556 12.728
3.182
0.224 1.041 2.283 19.413
2.427
0.106 0.484 2.218 19.413
2.427
0.052 0.223 2.191 19.413
2.427
0.026 0.112 2.175 19.413
2.427
0.010 0.043 2.163 19.413
2.427
12
Table 2: Monte Carlo results for Setting 2 reporting integrated mean square errors. E2
denotes the generalised combination scheme with unknown boundary; E3 denotes the
optimal linear combination; N (µ1 , 1) and N (µ2 , 1) are the two marginal (component)
densities.
(µ1 , µ2 )
(−0.5, 0.5)
(−0.5, 0.5)
(−0.5, 0.5)
(−0.5, 0.5)
(−0.5, 0.5)
(−1, 1)
(−1, 1)
(−1, 1)
(−1, 1)
(−1, 1)
(−2, 2)
(−2, 2)
(−2, 2)
(−2, 2)
(−2, 2)
T
Combined Density
E2
p = 2 p = 3 p = 5 p = 7 p = 19 p = 21
50
1.306 1.456 2.033 3.569 7.208
9.986
100 0.725 0.823 1.158 1.894 3.700
5.494
200 0.384 0.442 0.647 0.995 1.868
3.335
400 0.215 0.240 0.379 0.543 0.982
2.270
1000 0.119 0.116 0.179 0.276 0.449
1.618
50
2.184 2.053 2.608 3.893 7.233
9.877
100 1.529 1.288 1.553 2.137 3.761
5.591
200 1.104 0.801 0.890 1.225 1.995
3.415
400 0.927 0.565 0.478 0.723 1.072
2.315
1000 0.813 0.430 0.221 0.358 0.519
1.665
50
9.409 5.926 4.160 4.916 7.745 10.098
100 8.403 4.727 2.451 2.883 4.088
5.786
200 8.107 4.240 1.474 1.738 2.317
3.641
400 7.897 3.981 0.982 1.012 1.360
2.521
1000 7.687 3.774 0.690 0.525 0.742
1.841
13
E3
0.673
0.476
0.390
0.344
0.316
3.997
3.792
3.676
3.617
3.586
22.252
22.039
21.920
21.873
21.838
Component Density
N (µ1 , 1) N (µ2 , 1)
3.418
3.418
3.418
3.418
3.418
12.480
12.480
12.480
12.480
12.480
35.663
35.663
35.663
35.663
35.663
3.418
3.418
3.418
3.418
3.418
12.480
12.480
12.480
12.480
12.480
35.663
35.663
35.663
35.663
35.663
4
Empirical Application
We consider S&P 500 daily percent log returns data from 3 January 1972 to 16 December
2005, the exact same dataset used by Geweke & Amisano (2010, 2011) in their analysis
of optimal linear pools.1 Following Geweke & Amisano (2010, 2011) we then estimate a
Gaussian GARCH(1,1) model and a Student t-GARCH(1,1) model via maximum likelihood (ML) using rolling samples of 1250 trading days (about five years). One day ahead
density forecasts are then produced recursively from the two models for the return on 15
December 1976 through to the return on 19 December 2005, delivering a full out-of-sample
period of 7325 daily observations. The predictive densities are formed by substituting the
ML estimates for the unknown parameters.
The two component densities are then combined using either a linear or generalised
combination scheme. We evaluate the generalised combination scheme in two ways.
Firstly, we fit the generalised and linear combinations over the whole dataset, insample (t = 1, ..., T ). This involves replicating part of the empirical analysis in Geweke &
Amisano (2011) who compared the optimised linear combination with these two marginal
(component) densities and like us (see Table 3 below) find gains to linear combination.
But of course we then consider our generalised combination scheme, setting p = 2, 5, 7,
with threshold intervals fixed as shown in Table 4. We compare the predictive log score
(LogS) of these three generalised combinations with the optimal linear combination as well
as the two marginal component densities themselves. We test equal predictive accuracy
between the linear combination and the other densities based on their Kullback Leibler
Information Criterion (KLIC) differences. A general test is provided by Giacomini &
White (2006); a Wald-type test statistic is given as
T
T
−1
T
X
!0
∆Lt
Σ
−1
T
−1
T
X
!
∆Lt
(32)
t=1
t=1
where ∆Lt is the expected difference in the logarithmic scores of the two prediction models
at time t and equals their KLIC or distance measure and Σ is an appropriate heteroscedasticity and autocorrelation consistent (HAC) estimate of the asymptotic covariance matrix.
Under the null hypothesis of equal accuracy E(∆Lt ) = 0, Giacomini & White (2006) show
that the test statistic tends to χ21 as T → ∞.
Table 3 shows that the generalised combinations deliver, as expected, higher in-sample
1
We thank Amisano and Geweke for providing both their data and computational details.
14
scores than either of the component densities or the optimal linear combination. Table 4
helps to explain why by reporting the (in-sample) combination weights by region of the
density. These weights are obviously not constrained to sum to unity, and can take any
positive value. They vary considerably both across and within regions, with the effect
that the generalised densities are much more flexible than either component density or the
linear combination: the component densities are building blocks which are put together
statistically to maximise the logarithmic score of the ensuing combined density. Figure 1
illustrates this flexibility by plotting the density forecasts for the return on 19 December
2005 from the linear and the three generalised combinations (p = 2, 5, 7). Alongside
these forecasts we plot a nonparametric estimate of the (unconditional) returns density,
estimated over the full-sample. We see from Figure 1 that the generalised densities better
capture the fat-tails in the data, albeit with spikes in the generalised densities observed
at the thresholds.
Secondly, we estimate the combination weights using the first 6000 observations and
evaluate the resulting densities over the remaining sample of 1325 observations, 12 September 2000 to 19 December 2005. This means we are performing a real-time exercise as only
past data are used for optimisation. This is an important test given the extra parameters involved when estimating the generalised rather than linear combination. As before,
we consider three generalised schemes setting p = 2, 5, 7 and compare the predictive log
score (LogS) of these generalised combinations with the optimal linear combination as
well as the two marginal component densities themselves. We again test equal predictive
accuracy between the linear combination and the other densities based on their Kullback
Leibler Information Criterion (KLIC) differences.
Table 5 shows that real-time generalised combinations do deliver higher scores than
either of the component densities themselves. Combination improves accuracy. But realtime optimal linear combinations do not improve upon the normal-GARCH density. Thus
while, as Geweke & Amisano (2011) show on this same dataset, optimal linear combinations will at least match the performance of the best component density when estimated
over the full sample (t = 1, ..., T ), there is no guarantee that the optimal combination
will help out-of-sample when computed recursively. Importantly, inspection of the KLIC
test statistics shows that the gains associated with use of the generalised combinations
are statistically significant at the 99% significance level. While there are improvements
in the average logarithmic score associated with more flexible generalised combinations,
so when p = 7 rather than when p = 2, the KLIC tests indicate that relative to the linear
combination there are benefits to using a parsimonious weighting function. While a high
15
p pays off in terms of delivering a higher logarithmic score on average, it is associated
with higher variability, which is penalised in the KLIC test.
Table 3: Average predictive logarithmic score (logS) and equal predictive accuracy test
(KLIC), versus the optimal linear combination: 3 January 1972 - 19 December 2005 (7325
observations). In-sample
Marginal
Linear combination
Generalised combination
LogS
KLIC
Normal-GARCH −9544.4154 -2.824
t-GARCH
−9363.5734 -1.205
−9308.0246 p=2
−8502.2566 18.174
p=5
−7681.9769 19.566
p=7
−7354.2126 18.958
Table 4: Model weights for the three generalised combination schemes over
sample.
Intervals (in %)
Model
(−∞, −2.5) (−2.5, −2) (−2, −1) (−1, 0) (0, 1) (1, 2) (2, 2.5)
Scheme 1: p = 2
Normal-GARCH
2.239
2.239
2.239
0.006 0.006 0.000 0.000
t-GARCH
0.692
0.692
0.692
0.831 0.831 2.736 2.736
Scheme 2: p = 5
Normal-GARCH
80.805
80.805
2.344
0.205 0.005 0.000 56.421
t-GARCH
6.407
6.407
0.000
0.619 0.846 2.242 1.178
Scheme 3: p = 7
Normal-GARCH
1035.967
50.326
2.346
0.116 0.005 0.022 32.019
t-GARCH
34.840
0.165
0.000
0.706 0.846 2.218 0.000
16
the whole
(2.5, ∞)
0.000
2.736
56.421
1.178
993.512
2.915
Table 5: Average predictive logarithmic score (logS) and equal predictive accuracy test
(KLIC), versus the optimal linear combination: 12 September 2000 - 19 December 2005
(1325 observations). Out-of-sample
Marginal
Linear combination
Generalised combination
5
Normal-GARCH
t-GARCH
p=2
p=5
p=7
LogS
−1902.6516
−1912.1234
−1907.7278
−1874.0055
−1777.4188
−1725.7779
KLIC
0.953
-4.564
9.673
9.296
8.956
Conclusion
With the growing recognition that point forecasts are best seen as the central points of
ranges of uncertainty more attention is now paid to density forecasts. Coupled with uncertainty about the best means of producing these density forecasts, and practical experience
that combination can render forecasts more accurate, density forecast combinations are
being used increasingly in macroeconomics and finance.
This paper extends this existing literature by letting the combination weights follow
more general schemes. It introduces generalised density forecast combinations, where the
combination weights depend on the variable one is trying to forecast. Specific attention
is paid to the use of piecewise linear weight functions that let the weights vary by region
of the density. These weighting schemes are examined theoretically, with sieve estimation
used to optimise the score of the generalised density combination. The paper then shows
both in simulations and in an application to S&P500 returns that the generalised combinations can deliver more accurate forecasts than linear combinations with optimised
but fixed weights as in Hall & Mitchell (2007) and Geweke & Amisano (2011). Their
use therefore seems to offer the promise of more effective forecasts in the presence of a
changing economic climate.
17
Figure 1: Linear and generalised combination density forecasts (for p = 2, 5, 7) for the
S&P500 daily return on 19 December 2005 compared with a nonparametric estimate of
the (unconditional) returns density (dashed-line)
18
References
Amemiya, T. (1985), Advanced Econometrics, Harvard University Press.
Billio, M., Casarin, R., Ravazzolo, F. & van Dijk, H. K. (2013), ‘Time-varying combinations of predictive densities using nonlinear filtering’, Journal of Econometrics .
Forthcoming.
Chen, X. & Shen, X. (1998), ‘Sieve extremum estimates for weakly dependent data’,
Econometrica 66(2), 289–314.
Davidson, J. (1994), Stochastic Limit Theory, Oxford University Press.
Diebold, F. X., Gunther, A. & Tay, K. (1998), ‘Evaluating density forecasts with application to financial risk management’, International Economic Review 39, 863–883.
Diks, C., Panchenko, V. & van Dijk, D. (2011), ‘Likelihood-based scoring rules for comparing density forecasts in tails’, Journal of Econometrics 163(2), 215–230.
Geweke, J. & Amisano, G. (2010), ‘Comparing and evaluating Bayesian predictive distributions of asset returns’, International Journal of Forecasting 26(2), 216–230.
Geweke, J. & Amisano, G. (2011), ‘Optimal prediction pools’, Journal of Econometrics
164(1), 130–141.
Geweke, J. & Amisano, G. (2012), ‘Prediction with misspecified models’, American Economic Review: Papers and Proceedings 102(3), 482–486.
Giacomini, R. & White, H. (2006), ‘Tests of conditional predictive ability’, Econometrica
74, 1545–1578.
Gneiting, T. & Raftery, A. E. (2007), ‘Strictly proper scoring rules, prediction, and estimation’, Journal of the American Statistical Association 102, 359–378.
Hall, S. G. & Mitchell, J. (2007), ‘Combining density forecasts’, International Journal of
Forecasting 23, 1–13.
Jong, R. D. (1997), ‘Central limit theorems for dependent heterogeneous random variables’, Econometric Theory 13, 353–367.
19
Jore, A. S., Mitchell, J. & Vahey, S. P. (2010), ‘Combining forecast densities from VARs
with uncertain instabilities’, Journal of Applied Econometrics 25, 621–634.
Rossi, B. (2012), Advances in Forecasting under Instabilities, in G. Elliott & A. Timmermann, eds, ‘Handbook of Economic Forecasting, Volume 2’, Elsevier-North Holland
Publications. Forthcoming.
Waggoner, D. F. & Zha, T. (2012), ‘Confronting model misspecification in macroeconomics’, Journal of Econometrics 171(2), 167 – 184.
20
Appendix
Proof of Theorem 1
C or Ci where i takes integer values, denote generic finite positive constants.
We wish to prove that minimisation of
LT (ϑp ) =
T
X
l (pt (yt , ϑp ) ; yt ) ,
t=1
where
pt (y, ϑp ) =
N
X
wpη,i (y, ϑp ) qit (y) ,
i=1
wpη,i (y, ϑp ) = vi0 +
p
X
ν is η s (y, θ is ) ,
s=1
0
produces an estimate, denoted by ϑ̂T , of the value of ϑp = ν 1 , ..., ν p , θ 01 , ..., θ 0p that
minimises limT →∞ E(LT (ϑp )) = L (ϑp ), denoted by ϑ0p , that is asymptotically normal
with an asymptotic variance given in the statement of the Theorem. To prove this we use
Theorem 4.1.3 of Amemiya (1985). The conditions of this Theorem are satisfied if the
following hold:
LT (ϑp ) →p L (ϑp ) , uniformly over ϑp ,
(33)
1 ∂LT (ϑp ) →d N (0, V2 )
T
∂ϑp ϑ0p
(34)
1 ∂ 2 LT (ϑp ) →p V1
T ∂ϑp ∂ϑ0p ϑ̂
(35)
T
To prove (33), we note that by Theorems 21.9, 21.10 and (21.55)-(21.57) of Davidson
(1994), (33) holds if
LT (ϑp ) →p L (ϑp ) ,
(36)
and
∂LT (ϑp ) sup ∂ϑp < ∞
ϑp ∈Θp
(37)
(37) holds by the fact that l (pt (., ϑp ) ; .) has uniformly bounded derivatives with respect
21
to ϑp over Θp and t, by Assumption 2. (35)-(36) and (34) follow by Theorems 19.11 of
Davidson (1994) and Jong (1997), respectively given Assumption 4 on the boundedness
of the relevant moments and if the processes involved are N ED processes on some αmixing processes satisfying the required size restrictions. To show the latter we need to
show that (A) l (pt (yt , ϑp ) ; yt ) and l(2) (pt (yt , ϑp ) ; yt ) are L1 − N ED process on an αmixing process, where l(i) (pt (yt , ϑp ) ; yt ) denotes the i-th derivative of l with respect to
yt , and (B) l(1) (pt (yt , ϑp ) ; yt ) is an Lr -bounded, L2 − N ED process, of size −1/2, on
an α-mixing process of size −r/(r − 2). These conditions are satisfied by Assumption 4,
given Theorem 17.16 of Davidson (1994).
Proof of Theorem 2
We need to show that the conditions A of Chen & Shen (1998) hold. Condition A.1 is
satisfied by Assumption 3. Conditions A.2-A.4 have to be confirmed for the particular
instance of the loss function and the approximating function basis given in the Theorem.
We show A.2 for
T
X
− log pt (yj+1 ) .
LT =
t=1
where
pt (y) =
pT
N X
X
ν is qit (y) I (rs−1 ≤ y < rs ) .
i=1 s=1
0
Let ν 0 = (ν 01 , ..., ν 0N ), ν 0i = (ν 0i1 , ...) denote the set of coefficients that maximise E (LT )
T
and ν T a generic point of the space of coefficients {ν T,is }N,p
i,s=1 . We need to show that
V ar log pt y, ν 0 − log pt (y, ν T ) ≤ Cε2 .
sup
(38)
{kν 0 −ν T k≤ε}
We have
0
log pt y, ν −log pt (y, ν T ) = log
0
pt (y, ν )
pt (y, ν T )
PN PpT
log 1 +
PN PpT
= log
0
s=1 ν is qit (y) I (rs−1 ≤ y < rs )
PNi=1PpT
i=1
s=1
ν T,is qit (y) I (rs−1 ≤ y < rs )
0
i=1
s=1 (ν is − ν T,is ) qit (y) I (rs−1 ≤ y < rs )
PN PpT
i=1
s=1 ν T,is qit (y) I (rs−1 ≤ y < rs )
22
!
!
=
But,
!
PN PpT
0
(ν
−
ν
)
q
(y)
I
(r
≤
y
<
r
)
T,is
it
s−1
s
is
i=1
s=1
≤
log 1 +
PN PpT
i=1
s=1 ν T,is qit (y) I (rs−1 ≤ y < rs )
P P
pT
N
i=1 s=1 (ν 0is − ν T,is ) qit (y) I (rs−1 ≤ y < rs ) C1 ε + PN PpT
i=1
s=1 ν T,is qit (y) I (rs−1 ≤ y < rs )
Then, (38) follows from Assumption 5. Condition A.4, requiring that
sup
{kν 0 −ν T k≤ε}
log pt y, ν 0 − log pt (y, ν T ) ≤ εs UT (y) ,
(39)
where supT E (UT (yt ))γ < ∞, for some γ > 2, follows similarly.
Next we focus on Condition A.3. First, we need to determine the rate at which a function in ΦTqi , where η s = I (rs−1 ≤ y < rs ) can approximate a continuous function, f , with
finite L2 -norm defined on a compact interval of the real line denoted by M = [M1 , M2 ].
To simplify notation and without loss of generality we denote the piecewise linear apPT
proximating function by ps=0
ν s I (rs−1 ≤ y < rs ) and would like to determine the rate at
R P
1/2
2
pT
which M ( s=1 ν s I (rs−1 ≤ y < rs ) − f (y)) dy
converges to zero as pT → ∞. We
∞
∞
pT
T
assume that the triangular array {{rs }s=0 }T =1 = {{rT s }ps=0
}T =1 defines an equidistant
grid in the sense that M1 ≤ rT 0 < rT pT ≤ M2 and
sup rs − rs−1 = O
s
and
inf rs − rs−1 = O
s
1
pT
1
pT
.
For simplicity, we set M1 = rT 0 and rT pT = M2 . We have
Z
M
pT
X
!2
ν s I (rs−1 ≤ y < rs ) − f (y)
s=1
dy =
pT Z
X
s=0
rs
rs−1
By continuity of f we have that there exist ν s such that
sup
sup
s
y∈[rs−1 ,rs ]
|ν s − f (y)| = O
23
1
pT
(ν s − f (y))2 dy
This implies that
Z
rs
C1
sup
(ν s − f (y)) dy ≤ 2 sup
pT s
s
rs−1
2
Z
rs
dy ≤
rs−1
C1
p3T
which implies that
pT Z
X
rs
Z
2
s
rs−1
s=0
rs
(ν s − f (y))2 dy ≤
(ν s − f (y)) dy ≤ pT sup
rs−1
C1
p2T
giving

Z

!2
pT
X
M
ν s I (rs−1 ≤ y < rs ) − f (y)
1/2
dy 
= O p−1
.
T
s=1
The next step involves determining HΦTq () which denotes the bracketing L2 metric eni
tropy of ΦTqi . HΦTq () is defined as the logarithm of the cardinality of the -bracketing set
i
of ΦTqi that has the smallest cardinality among all -bracketing sets. An -bracketing set
for ΦTqi , with cardinality Q, is defined as a set of L2 bounded functions hl1 , hu1 , ..., hlQ , huQ
such that maxj huj − hlj ≤ and for any function h in ΦTqi , defined on M = [M1 , M2 ],
there exists j such that hlj ≤ h ≤ huj almost everywhere. We determine the bracketing L2
metric entropy of ΦTqi where η s = I (rs−1 ≤ y < rs ), from first principles. By the definition
of ΦTqi , any function in ΦTqi is bounded. We set
sup sup sup |h (y)| = B < ∞.
T
h∈ΦT
qi
y
Q
Then, it is easy to see that an -bracketing set for ΦTqi is given by hli , hui i=1 where
hliq
=
pT
X
ν lis I (rs−1 ≤ y < rs ) ,
s=0
huiq
=
pT
X
ν uis I (rs−1 ≤ y < rs ) ,
s=0
ν lis = ν lT is takes values in {−B, −B + /pT , −B + 2/pT , ..., B − /pT } and ν uis = ν uT is
2
takes values in {−B + /pT , −B
T , ..., B}. Clearly for > C > 0, Q = QT = O (pT )
+2 2/p
2BpT
and so, then, HΦTq () = ln
= O (ln (pT )). For some δ, such that 0 < δ < 1,
i
24
Condition A.3 of Chen & Shen (1998) involves
δ
−2
Zδ
1/2
HΦT
qi
() d = δ
−2
δ2
Zδ
ln
2Bp2T
1/2
d ≤ δ
−2
δ2
δ
−2
δ ln
2Bp2T
δ
Zδ
ln
2Bp2T
d =
δ2
2
− δ ln
2Bp2T
δ2
<δ
−1
ln
2Bp2T
δ
which must be less than T 1/2 . We have that
2Bp2T
−1
δ ln
≤ T 1/2
δ
or
δ
−1
ln
p2T
δ
= o T 1/2
Setting δ = δ T and parametrising δ T = T −ϕ and pT = T φ gives ϕ < 1/2. So using the
result of Theorem 1 of Chen & Shen (1998) gives
0
v − ν̂ T = Op max T −ϕ , T −φ = Op T − min(ϕ,φ)
Proof of Corollary 3
We need to show that the conditions of Theorem 2 hold for the choices made in the
statement of the Corollary. We have that
l (pt (yt , ϑp ) ; yt ) = log
p
N X
X
!
ν is qit (yt ) I (rs−1 ≤ yt < rs )
i=1 s=1
Without loss of generality we can focus on a special case given by
l (pt (yt , ϑp ) ; yt ) = log (ν 1 q1 (yt ) + ν 2 q2 (yt )) .
We have
l(1) (pt (yt , ϑp ) ; yt ) =
and
l(1) (pt (yt , ϑp ) ; yt ) = −
q1 (yt )
,
ν 1 q1 (yt ) + ν 2 q2 (yt )
q1 (yt )2
.
(ν 1 q1 (yt ) + ν 2 q2 (yt ))2
25
It is clear that both l(1) (pt (yt , ϑp ) ; yt ) and l(2) (pt (yt , ϑp ) ; yt ) are bounded functions of yt
so, by Theorem 17.13 of Davidson (1994), the N ED properties of yt given in Assumption 3
are inherited by l(1) (pt (yt , ϑp ) ; yt ) and l(2) (pt (yt , ϑp ) ; yt ). So we focus on l (pt (yt , ϑp ) ; yt )
and note that, since q(y) ∼ exp (−y 2 ) as y → ±∞,
log exp −y 2
= −y 2
Then, using Example 17.17 of Davidson (1994) which gives that if yt is L2 − N ED of size
−a and Lr -bounded (r > 2), then y 2 is L2 -NED of size −a(r − 2)/2(r − 1). Since we need
that the N ED size of l (pt (yt , ϑp ) ; yt ) to be greater than 1/2 the minimum acceptable
value for r is a ≥ (r − 1)/(r − 2).
26
Download