Particle Learning and Smoothing

advertisement
Particle Learning and Smoothing
Carlos M. Carvalho, Michael Johannes, Hedibert F. Lopes
and Nicholas Polson∗
First draft: December 2007
This version: February 2009
Abstract
In this paper we develop particle learning (PL) methods for state filtering, sequential parameter learning and smoothing in a general class of nonlinear state space
models. The approach extends existing particle methods by incorporating static parameters and utilizing sufficient statistics for the parameters and/or the states as
particles. State smoothing with parameter uncertainty is also solved as a by product of particle learning. In a number of applications, we show that our algorithms
outperform existing particle filtering algorithms as well as MCMC.
Keywords: Particle Learning, Filtering, Smoothing, Mixture Kalman Filter, Bayesian
Inference, Bayes factor, State Space Models.
∗
Carvalho, Lopes and Polson are at The University of Chicago, Booth School of Business, 5807 S.
Woodlawn, Chicago IL 60637. email: carlos.carvalho@chicagobooth.edu, hlopes@chicagobooth.edu and
ngp@chicagobooth.edu. Johannes is at the Graduate School of Business, Columbia University, 3022 Broadway, NY, NY, 10027. email: mj335@columbia.edu.
1
1
Introduction
There are two general inference problems in state space models. The first is inherently
sequential and involves sequential learning about fixed parameters (θ) and latent states
(xt ) using observations up to time t denoted by y t = (y1 , . . . , yt ). This problem is solved by
computing the joint posterior distribution p(θ, xt |y t ) sequentially over time, as new data
arrives. Most of the literature focuses exclusively on state filtering conditional on parameters, computing p(xt |θ, y t ) using particle filtering methods. Extending these algorithms
to handle parameter learning is an active area of research with important developments
appearing in Liu and West (2001), Storvik (2002) and Fearnhead (2002) to cite a few. The
second problem is state smoothing, computing p (xt |y t ), whilst accounting for parameter
uncertainty.
On the theoretical side, we provide a new particle learning (PL) approach for computing
the sequence of filtering and smoothing distributions, p(θ, xt |y t ) and p (xt |y t ) for a wide
class of models and document its performance against common competing approaches.
The central idea behind PL is the creation of a particle filter that directly samples from
the joint posterior distribution of xt and conditional sufficient statistics for θ, st . This
is achieved by updating the particle approximation with a resample-propagate framework
as opposed to the typical propagate-resample approaches. Moreover, we explicitly treat
conditional sufficient statistics as states to improve efficiency.
On the empirical side, we show that the performance of particle learning can be further
improved by marginalization of the states, i.e. by extending the use of sufficient statistics
to states. We provide extensive simulation evidence comparing our approach to competing
methods. Other concrete contributions are listed below.
Sequential Learning. For Gaussian DLMs and conditionally Gaussian DLMs (CDLM),
particle filters are defined over both state and parameter sufficient statistics, such as
Kalman filter’s mean and variance. Our approach extends Liu and Chen’s (2000) mixture Kalman filter (MKF) method by allowing for parameter learning. By re-sampling first
it also provides a more efficient algorithm for pure filtering. We also show that nonlinearity in the state evolution, but linearity in the observation equation, is straightforward to
implement within the PL framework. This dramatically extends the class of models that
MKF particle methods apply to. Although we are no longer able to marginalize out the
state variables and use state sufficient statistics, we can still utilize parameter sufficient
2
statistics and generate exact samples from the particle approximations. In a comprehensive simulation study, we demonstrate that PL provides a significant improvement over the
current benchmarks in the literature such as Liu and West (2001).
Smoothing. We extend Godsill, Doucet and West’s (2004) smoothing results to all models considered above. In doing so, PL also extends Liu and Chen’s (2000) MKF in computing the sequence of smoothed distributions, p(xt |y t ). Posterior inference is typically solved
via Markov chain Monte Carlo (MCMC) methods as p (xt |y t ) is a difficult-to-compute
marginal from p (xt , θ|y t ). Key references are Carlin, Polson, and Stoffer (1992), Carter
and Kohn (1994) and Frühwirth-Schnatter (1994). Our simulations provide strong evidence that PL dominates the standard MCMC strategies in computing time and delivers
similar accuracy when computing p (xt |y t ). Gains are most remarkable in models with nonlinearities (CGDM), where the common alternative of MCMC with single-states updates
are well known to be inefficient (Papaspiliopoulos and Roberts, 2008). Another advantage of PL over MCMC in these classes of models is the direct and recursive access to
approximations of predictive distributions a key element in sequential model comparison.
The paper is outlined as follows. The rest of this section review the most popular sequential Monte Carlo schemes for pure filtering and parameter learning. Section 2 presents the
general filtering and smoothing strategy for PL. Section 3 considers the class of conditional
dynamic linear models. In this context we present, in the Appendix, theoretical results
that motivate and justify the good performance of PL approximations. Section 4 discuss
the implementation of our methods in nonlinear specifications. Sections 5 and 6 compares
the performance of PL to MCMC and alternative particle methods.
1.1
Particle filtering review
First, we review the three main types available for pure filtering. Traditionally researchers
have focuses on propagate-resample (or resample-move) algorithms. Our approach will
instead exploit resample/propagate algorithms and marginalization/Rao-Blackwellization
over states and parameters. Moreover, we show how using conditional sufficient statistics
replaces the static sequential parameter learning problem with a state filtering problem.
Traditional particle filters use sequential importance sampling (SIS) which we now describe:
3
Sequential importance sampling with resampling (SISR). This algorithm uses the
following representation of the filtering density:
Z
p (yt+1 |xt+1 ) p (xt+1 |y t )
t+1
t
t
p xt+1 |y
=
and
p
x
|y
=
p
(x
|x
)
p
x
|y
dxt ,
t+1
t+1
t
t
p (y t+1 |yt )
which leads to a particle approximation of the form
N
X
(i)
wi (xt+1 , yt+1 )p xt+1 |xt .
pN xt+1 |y t =
i=1
which leads to the following blind importance sampling algorithm
(i)
(i)
1. Draw xt+1 ∼ p xt+1 |xt
n
oN
(i)
(i)
2. Compute weights wi (xt+1 , yt+1 ) ∝ p yt+1 |xt+1 and resample xt+1
.
i=1
SISR with optimal importance function. Here, the approximation of the filtering
density is the same but we use a more efficient propagation distribution which in turn
changes the resampling weights
(i)
(i)
1. Draw xt+1 ∼ p xt+1 |xt , yt+1
2. Compute wi (xt , yt+1 ) ∝
p (yt+1 |xt+1 ) p (xt+1 |xt )
= p (yt+1 |xt ) and resample (xt+1 , xt )(i) .
p (xt+1 |xt , yt+1 )
Using yt+1 in the first step allows the algorithm to “look-forward” in propagating particles leading to efficiency gains.
Auxiliary particle filter (APF). APF uses a different representation of the filtering
distribution,
Z
p(yt+1 |xt )
t+1
p xt+1 |y
=
p (xt+1 |xt , yt+1 ) p (xt |yt ) dxt
p(yt+1 |y t )
that leads to a different mixture approximation:
p
N
xt+1 |y
t
=
N
X
(i)
wi (xt , yt+1 )p
(i)
xt+1 |xt , yt+1
.
i=1
n oN
(i)
(i)
k(i)
to get xt
1. Compute weights wi (xt , yt+1 ) ∝ p yt+1 |xt
and resample xt
i=1
(i)
k(i)
2. Propagate xt+1 ∼ p xt+1 |xt , yt+1
4
This requires that
Z
p (yt+1 |xt ) =
p (yt+1 |xt+1 ) p (xt+1 |xt ) dxt+1
is known or can be easily computed and that p (xt+1 |xt , yt+1 ) is known. Notice that
var (w (xt , yt+1 )) > var (w (xt+1 , yt+1 )), which implies that the weights are more evenly
distributed. In section 2.4 we show how to further marginalize using state sufficient statistics to improve efficiency.
1.2
Sequential parameter learning review
Storvik’s Algorithm. Storvik assumes that the posterior distribution of θ given xt and
y t depends on a low-dimensional set of sufficient statistics that can be asily updated recursively although as noted by Liu and Chen (2002) this is not necessary. The recursion
for sufficient statistics is given by st+1 = S (st , xt+1 , yt+1 ). This then leads to the following
SISR algorithm for parameter learning:
Storvik’s Algorithm
1. Sample θ ∼ p(θ|xt−1 , y t ).
2. Sample xt ∼ p(xt |xt−1 , θ, yt ).
3. Resample xt with weights
wt ∝
p(θ|st−1 )p(xt |xt−1 , θ)p(yt |xt , θ)
p(θ|xt−1 , y t )p(xt |xt−1 , θ, yt )
4. Update sufficient statistics st .
Here we have used the optimal importance functions for the parameters and states.
Storvik’s algorithm filters the whole vector xt and hence will decay very quickly
Liu and West’s algorithm. To solve the sequential state parameter learning problem,
Gordon, Salmond and Smith (1993) suggest incorporating artificial evolution noise for θ
so that we can treat it as a state variable. The main drawback of their idea is simple:
parameters are not states and the introdution of this noise can degrade results. Their
5
approach imposes a loss of information in time as artificial uncertainties added to the
parameters eventually resulting in a very diffuse posterior density for θ. To overcome this,
Liu and West (2001) suggest a kernel smoothing to approximate p(θ|y t ) which relies upon
West (1993) mixture modeling ideas, which approximates a possibly transformed posterior
density p(θ|y t ), by a mixture of multivariate normals. Specifically, let
n
oM
(l) (l)
(l)
xt , θt , wt
∼ pb(xt , θ|y t )
l=1
then the parameter posterior approximation is
pb(θ|Y1:t ) =
M
X
(j)
(j)
wt N (aθt + (1 − a)θ̃t ; b2 Vt )
j=1
with moments
θ̃t =
M
X
(j) (j)
wt θt
and Vt =
j=1
M
X
(j)
(j)
(j)
wt (θt − θ̄t )(θt − θ̄t )T
j=1
The constants a and h measure, respectively, the extent of the shrinkage and the degree of
over dispersion of the mixture. As in Liu and West (2001), the choice of a and h will depend
on a discount factor δ in (0, 1], typically around 0.95 − 0.99, so that h2 = 1 − ((3δ − 1)/2δ)2
√
and a = 1 − h2 . This is a way to specify the controlling smoothing parameter b as a
function of the amount of information preserved in each step of the filter. Analogously to
Pitt and Shephard’s (1999) algorithm, the sequential scheme proceeds as follows:
Liu and West’s algorithm
(j)
(j)
(j)
(j)
1. k l is sampled from p(yt+1 |µt+1 , θt )wt , with µt+1 = E(xt+1 |xt , θ);
(l)
(kl )
+ (1 − a)θ̃t ; h2 Vt );
2. θt+1 is sampled from N (aθt
(l)
(kl )
(l)
3. xt+1 is sampled from p(xt+1 |xt , θt+1 ), which leads to weights
(l)
(l)
p yt+1 |xt+1 , θt+1
(l)
wt+1 ∝ (kl )
(kl )
p yt+1 |µt+1 , mt
and samples from the posterior,
n
oM
(l)
(l)
(l)
xt+1 , θt+1 , wt+1
∼ pb(xt+1 , θ|y t+1 )
l=1
6
2
Particle Learning and Smoothing
Consider a general state space model defined by the observation and evolution equations
yt+1 ∼ p (yt+1 |xt+1 , θ)
xt+1 ∼ p (xt+1 |xt , θ) ,
with initial state distribution p (x0 |θ) and prior p(θ). The sequential parameter learning
and state filtering problem is characterized by the joint posterior distribution, p (xt , θ|y t )
whereas smoothing is characterized by p xT , θ|y T , with y t = (y1 , . . . , yt ), xt = (x1 , . . . , xt )
and T denoting the last observation time.
Sequential sampling from the above posteriors is difficult due to their dimensionality
and the complicated functional relationships between the parameters, states, and data.
MCMC methods have been developed to solve the smoothing problem but are too slow for
the sequential problem, which requires on-line simulation based on a recursive or iterative
structure. The classic example of recursive estimation is the Kalman filter (Kalman, 1960)
in the case of linear Gaussian models with known parameters. In this paper, we use
a particle approach to sequentially filter p (xt , θ|y t ) and obtain p xT |y T by backwards
propagation, effectively solving the filtering, learning and smoothing problems.
Particle methods use a discrete representation of p (xt , θ|y t ) via
N
1 X
p (xt , θ|y ) =
δ
(i) ,
N i=1 (xt ,θ)
N
t
where N is the number of particles, (xt , θ)(i) denotes a particle vector and δ(·) is the Dirac
delta function, representing the distribution degenerate at the particles. Given this particle approximation, the key problem is how to jointly propagate the parameter and state
particles once yt+1 becomes available. This step is complicated because the state propagation depends on the parameters, and vice versa. To circumvent the co-dependence in a
joint draw, it is common to use importance sampling. This, however, can lead to particle
degeneracies, as the importance densities may not closely match the target densities.
Our approach will rely on three main insights to deal with these problems: (i) conditional sufficient statistics are used to represent the posterior of θ. This allows us to think
of the sufficient statistics as additional states that are sequentially updated. Whenever
possible, sufficient statistics for the latent states are also introduced. This increases the efficiency of our algorithm by reducing the variance of sampling weights in what can be called
7
a Rao-Blackwellized filter. (ii) We use a re-sample/propagate framework to provide exact
samples from our particle approximation when moving from pN (xt , θ|y t ) to pN (xt+1 , θ|y t+1 ).
This avoids sample importance re-sampling (SIR) (Gordon, Salmond and Smith, 1993) and
the associated “decay” in the particle approximation. Finally, (iii) we extend the backward propagation smoothing algorithm of Godsill, Doucet and West (2004) to incorporate
uncertainty about θ.
2.1
Parameter Learning
For all models considered, we assume that the posterior for the parameter vector θ admits
a conditional sufficient statistics structure given the states and data (xt , y t ), so that
p θ|xt , y t = p (θ|st ) ,
where st is a recursively defined sufficient statistics,
st = S (st−1 , xt , yt ) .
(1)
The use of sufficient statistics is foreshadowed in Gilks and Berzuini (2001), Storvik (2002)
and Fearnhead (2002). Previous approaches to parameter learning include the addition of θ
in the particle set or the introduction of evolutionary noise, turning θ into an artificial state
(Gordon et al., 1993). The first alternative breaks down quickly as there is no mechanism for
replenishing the particles of θ whereas the second overestimate the uncertainty about θ by
working with a different model. Liu and West (2001) propose a learning scheme adapted to
the auxiliary particle filter that approximates p(θ|y t ) by kernel mixture of normals. While
being a widely used and applicable strategy, it still degenerates fairly quickly because the
propagation mechanism for xt+1 and θ does not take into account the current information
yt+1 . This is highlighted in our simulation studies of Section 6. Another alternative is to
consider sampling over (θ, xt ) with a Metropolis kernel as in Gilks and Berzuini (2001) and
Polson, Stroud and Müller (2008). The complexity of this strategy grows with t and suffers
from the curse of dimensionality as discussed in Bengtsson, Bickel, and Li (2008). Our
approach is of fixed dimension as it targets the filtering distribution of (st , xt ).
The introduction of sufficient statistics provides the mechanism to break particle decay
by sequentially targeting (st , xt ), where st is automatically replenished as a function of the
current samples for xt and observation yt . One way to interpret this strategy is to think of
8
st as an additional latent state in the model, with a deterministic evolution determined by
(1). This can be seen by the following factorization of the joint conditional posterior
p xt , st , θ|y t = p(θ|st )p xt , st |y t .
Our algorithm will be based on developing a particle approximation pN (xt , st |y t ) to the
joint distribution of states and sufficient statistics. Parameter learning p(θ|y t ) will then be
performed “offline” by simulating particles from p(θ|st ).
2.2
Re-sample-Propagate
Let zt = {xt , st , θ} and assume that at time t, i.e., after observing y t , we have a particle
n oN
(i)
approximation pN (zt |y t ), given by zt
. Once yt+1 is observed, our method updates
i=1
the above approximation using the following re-sample/propagation rule:
p zt |y t+1
∝ p (yt+1 |zt ) p zt |y t
Z
t+1
p zt+1 |y
=
p (st+1 |xt+1 , st , yt+1 ) p (xt+1 |zt , yt+1 ) p zt |y t+1 dxt dst
(2)
(3)
From (2), we see that an updated approximation pN (zt |y t+1 ) can be obtained by ren oN
(i)
sampling the particles zt
with weights proportional to the predictive, leading to
i=1
a particle approximation
N
X
(i)
N
t+1
p zt |y
=
w zt δz(i)
t
i=1
with weights given by
(i)
w zt
(i)
p yt+1 |zt
.
=P
(i)
N
p
y
|z
t+1 t
i=1
This updated approximation is used in (3) to generate propagated samples from the posterior p (xt+1 |zt , yt+1 ) that are then used to update st+1 , deterministically, by the recursive
map
st+1 = S (st , xt+1 , yt+1 ) .
Notice that (3) is a slight abuse of notation as p (st+1 |xt+1 , st , yt+1 ) is either 0 or 1. However,
since st and xt+1 are random variables, the conditional sufficient statistics st+1 are also
random and are replenished, essentially as a state, in the filtering step. This is the key
9
insight for handling the learning of θ. The particles for st+1 are sequentially updated and
replenished as a function of xt+1 and updated samples from p(θ|st+1 ) can be obtained at
the end of the filtering step.
Our approach departures from the standard particle filtering literature by inverting the
order in which states are propagated and updated. By re-sampling first we reduce the
compounding of approximation errors as the states are propagated after being “informed”
by yt+1 . This is discussed in detail in the Appendix. In addition, borrowing the terminology of Pitt and Shephard (1999), PL is a fully-adapted filter that starts from a particle
approximation pN (zt |y t ) and provides direct (or exact) draws from the particle approximation pN (zt+1 |y t ). This reduces the accumulation of Monte Carlo error as there is no need
to re-weight or compute importance sampling weights at the end of the filter. To clarify
n oN
n
oN
(i)
(i)
this notion we can re-write the problem of updating the particles zt
to zt+1
as
i=1
i=1
the problem of obtaining samples from the target p (xt+1 , zt |y t+1 ) based on draws from the
proposal p (zt |y t+1 ) p (xt+1 |zt , y t+1 ) yielding importance weights
wt+1
p (xt+1 , zt |y t+1 )
= 1,
∝
p (zt |y t+1 ) p (xt+1 |zt , y t+1 )
(4)
and therefore, exact draws. Sampling from the proposal is done in two steps: first zt ∼
n oN
(i)
p (zt |y t+1 ) are simply obtained by re-sampling the particles zt
with weights propori=1
tional to p (yt+1 |zt ); we can then sample (xt+1 ) ∼ p (xt+1 |zt , y t+1 ). Finally, updated samples
for st+1 are obtained as a function of the samples of xt+1 , with weights 1/N , which prevents
particle degeneracies in the estimation of θ. This is a feature of the “re-sample-propagate”
mechanism of PL. Any “propagate-re-sample” strategy will lead to decay in the particles
of xt+1 with significant negative effects on pN (θ|st+1 ). This strategy will only be possible
whenever both p (yt+1 |zt ) and p (xt+1 |zt , y t+1 ) are analytically tractable which is the case
in the classes of models considered here. Notice, however, that the above interpretation
provides a clear intuition for the construction of effective filters as the goal is to come up
with proposals that closely match the densities in the denominator of (4).
10
Particle learning
1. Resample particles zt = (xt , st , θ) with weights
wt ∝ p(yt+1 |zt )
2. Propagate new states xt+1 from
p(xt+1 |zt , yt+1 )
3. Update state and parameter sufficient statistics
4. Sample θ from p(θ|st+1 )
2.3
State marginalization
A more efficient approach in dynamic linear models is to marginalize states and just track
conditional state sufficient statistics. Here we use the fact that
Z
t
p xt |y = p (xt |sxt ) p sxt |y t dsxt = E p (xt |sxt ) |y t .
Thus, we are interested in the distribution p (sxt |y t ). The filtering recursions are given by
Z
1
x
t+1
p sxt+1 |sxt , xt+1 , yt+1 p sxt , xt+1 |y t+1 dsxt dxt+1
p st+1 |y
=
t+1
p (y |yt )
Z
x
1
x
x
x
x
x t
p
s
|s
,
x
,
y
p
(y
|s
)
p
(x
|s
,
y
)
p
s
=
|y
dst dxt+1 ,
t+1
t+1
t+1
t+1
t+1
t+1 t
t
t
t
p (y t+1 |yt )
and notice the double integral (extra marginalization). Instead of marginalizing xt , you
marginalize sxt and xt+1 . Note that we need to know
Z
x
p (xt+1 |st , yt+1 ) = p (xt+1 |xt , yt+1 ) p (xt |sxt ) dxt .
The algorithm is then
1. Resample w (sxt , yt+1 ) ∝ p (yt+1 |sxt ) and resample {sxt }N
i=1
(i)
2. Propagate xt+1 ∼ p (xt+1 |sxt , yt+1 ) .
The weights are flatter in the first stage p (yt+1 |xt ) versus p (yt+1 |sxt ). Because of RaoBlackwellisation, this will add to the efficiency of the algorithm.
11
2.4
Smoothing
After one sequential pass through the data, our particle approximation computes samples
from pN (xt , st |y t ) for all t ≤ T . However, in many situations, we are required to obtain full
smoothing distributions p(xT |y T ) which is typically carried out by a MCMC scheme. We
now show that our filtering strategy provides a direct backward sequential pass to sample
from the target smoothing distribution.
To compute the marginal smoothing distribution, we need to start by writing the joint
posterior of xT , θ as
T
p x , θ|y
T
= p xT , θ|y
T
1
Y
p xt |xt+1:T , θ, y T
T −1
= p xT , θ|y
T
1
Y
p xt |xt+1 , θ, y t .
T −1
By Bayes rule and conditional independence we have
p xt |xt+1 , θ, y t
∝ p xt+1 |xt , θ, y t p xt |θ, y t .
(5)
From (5) we can now derive a recursive backward sampling algorithm to jointly sample
from p xT , θ|y T by sequentially sampling from filtered particles with weights proportional
to p (xt+1 |xt , θ, y t ). In detail, randomly choose, at time T , (x̃T , s˜T ) from the particle ap(i)
proximation pN (xT , sT |y T ) and sample θ̃ ∼ p(θ|s˜T ). For t = T − 1, . . . , 1, choose x̃t = xt
n oN
(i)
from the filtered particles xt
with weights
i=1
(i)
(i)
wt|t+1 ∝ p x̃t+1 |xt , θ̃ .
This algorithm is an extension of Godsill, Doucet and West (2004) to state space models
where the fixed parameters are unknown.
12
Particle smoothing
1. Forward filtering via Particle Learning
(xT , θ)(i)
2. Backwards smoothing with weights
(i)
(i)
wt|t+1 ∝ p x̃t+1 |xt , θ̃ .
2.5
Discussion
Another way to see why the algorithm performs well is as follows. Consider the pure filtering
problem (ignoring parameters). Traditionally, filtering is done in two steps: prediction and
updating (see Kalman, 1960) following
Z
t
p xt+1 |y = p (xt+1 |xt ) p xt |y t dxt
(6)
p xt+1 |y t+1 ∝ p (yt+1 |xt+1 ) p xt+1 |y t .
(7)
This has lead to numerous particle filtering algorithms that are first based on propagation
(the prediction step) and then on re-sampling (the updating step). These algorithms have
well known degeneracy problems, most clearly discussed in Pitt and Shephard (1999). Our
strategy uses Bayes rule and reverses the logic with:
p xt |y t+1 ∝ p (yt+1 |xt ) p xt |y t
Z
t+1
p xt+1 |y
= p (xt+1 |xt , yt+1 ) p xt |y t+1 dxt .
(8)
(9)
Therefore, our algorithm will first re-sample (smooth) and then propagate. Notice that
this approach first solves a smoothing problem, re-sampling to characterize p (xt |y t+1 ), and
then propagates these re-sampled states. Since information in yt+1 is used in both steps,
the algorithm will be more efficient. Mathematically, the weights used for re-sampling are
Z
p (yt+1 |xt ) = p (yt+1 |xt+1 ) p (xt+1 |xt ) dxt+1 .
(10)
Key to the algorithm is the marginalization of xt+1 in the re-sampling step. This creates
weights that are more uniformly distributed (when compared to the original approach of
13
Storvik, 2002) in the re-sampling step as
var (p (yt+1 |xt )) < var (p (yt+1 |xt+1 )) .
The marginalization step in (10) is possible for a vast class of state space models including
the entire class of conditionally dynamic linear models (see Liu and Chen, 2000; Fearnhead
and Clifford, 2003; Scott, 2002) and many non-linear models. Remember that the introduction of conditional sufficient statistics allows us to think about the learning of θ as a
filtering problem and therefore, the above discussion is preserved in our general approach
where θ is unknown.
Throughout, we note that efficiency improvements can be made by marginalizing out as
many variables from the predictive distribution in the re-sampling step as possible. Most
efficient will be re-sampling with predictive p(yt+1 |sxt , st ), that is, given just the conditional
sufficient statistics for parameters and states (sxt ) (see Section 3). This point can be casted
in terms of the effective sample size (see Kong, Liu and Wong, 1994) and its relationship to
the re-sampling weights in pure filtering context, conditionally on θ. The weights for standard approaches are based on blind propagation and are given by w(xt+1 ) = p(yt+1 |xt+1 , θ).
Our algorithm reverses this logic and first re-samples and then propagates. This has weights
w(xt ) = p(yt+1 |xt , θ). The most efficient approach is to marginalize,whenever possible, over
the state vector and condition on sxt leading to weights w (sxt ) (this is the case for the models presented in Section 3). Our efficiency result is based on an ordering for the effective
sample sizes (ESS) (precision of weights) which is given by
var(w(sxt )) ≤ var(w(xt )) ≤ var(w(xt+1 )).
This is based on the identity
w(sxt ) = p(yt+1 |sxt ) = Ext |sxt (p(yt+1 |xt )) = E(w(xt )|sxt ),
which using the conditional variance identity leads to
var(w(xt )) ≥ var (E(w(xt )|sxt )) .
Similarly, we know that the marginal p(yt+1 |xt ) = Ext+1 |xt (w(xt+1 )). In practice, the
advantage is twofold: our strategy has better effective sample size and the particles will
never decay to a point as the propagation step is performed last.
14
3
Conditional Dynamic Linear Models
We now explicitly derive our PL algorithm in a class of conditional dynamic linear models
which are an extension of the models considered in West and Harrison (1997). This consists of a vast class of models that embeds many of the commonly used dynamic models.
MCMC via Forward-filtering Backwards-sampling or mixture Kalman filtering (MKF) (Liu
and Chen, 2000) are the current methods of use for the estimation of these models. As an
approach for filtering, PL has a number of advantages over MKF. First, our algorithm is
more efficient as it is a perfectly-adapted filter. Second, we extend MKF by including learning about fixed parameters and smoothing for states. If compared to MCMC, smoothing
via PL can be faster for the same level of accuracy as discussed in Section 5.
The conditional DLM defined by the observation and evolution equations takes the form
of a linear system conditional on an auxiliary state λt+1
yt+1 = Fλt+1 xt+1 + t+1 where t+1 ∼ N 0, Vλt+1
xt+1 = Gλt+1 xt + xt+1 where xt+1 ∼ N 0, Wλt+1 .
The marginal distribution of observation error and state shock distribution are any combination of normal, scale mixture of normals, or discrete mixture of normals depending on
the specification of the distribution on the auxiliary state variable p(λt+1 |θ), so that,
Z
p (t+1 |θ) = N 0, Vλt+1 p (λt+1 |θ) dλt+1 .
Extensions to hidden Markov specifications where λt+1 evolves according to p(λt+1 |λt , θ)
are straightforward and are discussed in the example of Section 3.2.
3.1
Particle Learning in CDLM
In CDLMs the state filtering and parameter learning problem is equivalent to a filtering
problem for the joint distribution of their respective sufficient statistics. This is a direct
result of the factorization of the full joint as
p xt+1 , θ, λt+1 , st+1 , sxt+1 |y t+1 = p(θ|st+1 )p(xt+1 |sxt+1 , λt+1 )p λt+1 , st+1 , sxt+1 |y t+1 ,
15
where the conditional sufficient statistics for states (sxt ) and parameters (st ) satisfy the
deterministic updating rules
sxt+1 = K (sxt , θ, λt+1 , yt+1 )
(11)
st+1 = S (st , xt+1 , λt+1 , yt+1 )
(12)
where K(·) denotes the Kalman filter recursions and S(·) the recursive update of the sufficient statistics as in (1). More specifically, define sxt = (mt , Ct ) as Kalman filter first and
second moments at time t. Conditional on θ,
p xt+1 |sxt+1 , θ, λt+1 ∼ N (at+1 , Rt+1 )
where at+1 = Gλt+1 mt and Rt+1 = Gλt+1 Ct G0λt+1 + Wλt+1 . Updating the state sufficient
statistics (mt+1 , Ct+1 ) is done by
mt+1 = Gλt+1 mt + At+1 yt+1 − Fλt+1 Gλt+1 mt
−1
Ct+1
=
−1
Rt+1
+
Fλ0 t+1 Fλt+1 Vλ−1
t+1
(13)
(14)
where At+1 = Rt+1 Fλt+1 Q−1
t+1 is the Kalman gain matrix and Qt+1 = Fλt+1 Rt+1 Fλt+1 + Vλt+1 .
We are now ready to define the PL scheme for CDLM. First, without loss of generality,
assume that the auxiliary state variable is discrete with λt+1 ∼ p(λt+1 |λt , θ). We start, at
time t, with a particle approximation for the joint posterior
p
N
xt , θ, λt , st , sxt |y t
N
1 X
x (i)
=
δ
N i=1 (xt ,θ,λt ,st ,st )
that are then propagated to t + 1 by first re-sampling the current particles with weights
proportional to the predictive p(yt+1 |(θ, sxt )). This provides a particle approximation to the
smoothing distribution p(xt , θ, λt , st , sxt |y t+1 ). New states λt+1 and xt+1 are then propagated
through the conditional posterior distributions p (λt+1 |λt , θ, yt+1 ) and p(xt+1 |λt+1 , xt , θ, yt+1 ).
Finally the conditional sufficient statistics are updated according to (11) and (12) and new
samples for θ are obtained from p(θ|st+1 ). Notice that in the conditional dynamic linear
models all the above densities are available for evaluation and sampling. For instance, the
predictive is computed via
X
p yt+1 |(λt , sxt , θ)(i) =
p yt+1 |λt+1 , (sxt , θ)(i) p(λt+1 |λt , θ)
λt+1
16
where the inner predictive distribution is given by
Z
x
p (yt+1 |λt+1 , st , θ) = p (yt+1 |xt+1 , λt+1 , θ) p(xt+1 |sxt , θ)dxt+1 .
n
oN
(i)
x
Starting with particle set (xt , θ, λt , st , st )
, the above discussion can be summai=1
rized in the following PL algorithm:
Algorithm 1: CDLM
Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where
w(i) ∝ p yt+1 |(λt , sxt , θ)(i)
Step 2 (Propagate): States
k(i)
λt+1 ∼ p λt+1 | (λt , θ) , yt+1
xt+1 ∼ p xt+1 | (xt , θ)k(i) , λt+1 , yt+1
Step 3 (Propagate): Sufficient Statistics
sxt+1 = K (sxt , θ, λt+1 , yt+1 )
st+1 = S (st , xt+1 , λt+1 , yt+1 )
In the case where the auxiliary state variable λt is continuous it is not always possible
to integrate out λt+1 form the predictive in step 1. We extend
by adding
the above scheme
(i)
to the current particle set a propagated particle λt+1 ∼ p λt+1 | (λt , θ)
and define the
following PL algorithm:
17
Algorithm 2: Auxiliary State CDLM
Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where
w(i) ∝ p yt+1 |(sxt , λt+1 , θ)(i)
Step 2 (Propagate): States
k(i)
xt+1 ∼ p xt+1 | (xt , λt+1 , θ)
, yt+1
Step 3 (Propagate): Sufficient Statistics
sxt+1 = K (sxt , θ, λt+1 , yt+1 )
st+1 = S (st , xt+1 , λt+1 , yt+1 )
Both of the above algorithms can be combined with the backwards propagation scheme
of Section 2.4 to provide a full draw from the marginal posterior distribution for all the
states given the data, namely p(xT |y T ).
3.2
Example: Dynamic Factor Model with Switching Loadings
Consider data yt = (yt1 , yt2 )0 , t = 1, . . . , T following a dynamic factor model with Markov
switching in the states λt+1 . This will illustrate our algorithm with λt+1 being discrete in
a conditional dynamic linear model (see also, Carvalho and Lopes, 2007). Specifically, we
have
!
1
yt+1 =
xt+1 + σt+1
βλt+1
xt+1 = xt + σx vt+1
p(λt+1 |λt , θ) ∼ MS(p, q)
where MS denotes a Markov switching process with transition probabilities (p, q) and
t and vt are independent standard normals. The initial state is assumed to be x0 ∼
N (m0 , C0 ). The parameters are θ = (β1 , β2 , σ 2 , σx2 , p, q)0 .
For this model, we are able to develop the algorithm marginalizing over both (xt+1 , λt+1 )
by using state sufficient statistics sxt = (mt , Ct ) as particles. From the Kalman filter
18
recursions we know that p(xt |λt , θ, y t ) ∼ N (mt , Ct ). The mapping for state sufficient
statistics (mt+1 , Ct+1 ) = K (mt , Ct , λt+1 , θ, yt+1 ) is given by the one-step Kalman update as
in (13) and (14). The prior distributions are conditionally conjugate where
ν10 d10
ν00 d00
2
2
2
2
,
and p(σx ) ∼ IG
,
.
p(βi |σ ) ∼ N bi0 , σ Bi0 for i = 1, 2, p(σ ) ∼ IG
2 2
2 2
For the transition probabilities, we assume that p ∼ Beta(p1 , p2 ) and q ∼ Beta(q1 , q2 ).
n
oN
Assume that at time t, we have particles (xt , θ, λt , sxt , st )(i)
approximating p (xt , θ, λt , sxt , st |y t ).
i=1
The PL algorithm can be described through the following steps:
Re-sampling: Draw an index k(i) ∼ M ulti(w(i) ) with weights w(i) ∝ p(yt+1 |(mt , Ct , λt , θ)k(i) )
p(yt+1 |mt , Ct , λt , θ) =
2
X
fN
1
yt+1 ;
!
mt , (Ct + σx2 )
βλt+1
λt+1 =1
1
βλt+1
βλt+1
βλ2t+1
!
!
+ σ 2 I2
p (λt+1 |λt , θ) ,
where fN denotes the normal density function.
Propagating λ: Draw auxiliary state variable
(i)
λt+1 ∼ p(λt+1 |(sxt , λt , θ)k(i) , yt+1 )
using
P r(λt+1 = j|sxt , λt , θ, yt+1 )
∝ fN
1
βj
yt+1 ;
!
mt , (Ct + σx2 )
1 βj
βj βj2
!
!
+ σ 2 I2
p (λt+1 = j|λt , θ) .
Propagating x: Draw states
(i)
(i)
xt+1 ∼ p(xt+1 |λt+1 , (sxt , θ)k(i) , yt+1 )
Updating sufficient statistics for states: The Kalman filter recursion yield
mt+1 = mt + At+1 (yt+1 − βλt+1 mt )
0
Ct+1 = Ct + σx2 − At+1 Q−1
t+1 At+1
19
where
Qt+1
At+1
!
β
λt+1
= (Ct + σx2 )
+ σ 2 I2
2
βλt+1 βλt+1
2
= (Ct + σx ) 1 βλt+1 Q−1
t+1
1
Updating sufficient statistics for parameters: The conditional posterior p(θ|st ) is decomposed into
p(βi |σ 2 , st+1 ) ∼ N (bi,t+1 , σ 2 Bi,t+1 )
i = 1, 2
ν0,t+1 d0,t+1
ν1,t+1 d1,t+1
2
2
p(σ |st+1 ) ∼ IG
,
and p(σx |st+1 ) ∼ IG
,
2
2
2
2
p(p|st+1 ) ∼ Beta(p1,t+1 , p2,t+1 ) and p(q|st+1 ) ∼ Beta(q1,t+1 , q2,t+1 )
where
−1
Bi,t+1
= Bit−1 + x2t+1 Iλt+1 =i
bi,t+1 = Bi,t+1 Bit−1 bit + xt yt2 Iλt+1 =i
d0,t+1
2
X
−1
= d0t + (yt+1,1 − xt+1 ) +
(yt+1,2 − bj,t+1 xt+1 ) yt+1,2 + bj,t+1 Bj0
Iλt+1 =j
2
j=1
d1,t+1 = d1t + (xt+1 − xt )2 , and
νi,t+1 = νi,t + 1.
For pi,t+1 , qi,t+1 for i = 1, 2 we have
p1,t+1 = p1t + Iλt =1,λt+1 =1 and p2,t+1 = p2t + Iλt =1,λt+1 =2
and similarly for qi,t+1 .
Figure 1 and 2 illustrates our PL algorithm in a simulated 2-dimensional MarkovSwitching Dynamic Factor Model. The first panel of Figure 1 displays the true underlying
λ process along with filtered and smoothed estimates whereas the second panel presents
the same information for the common factor x. Figure 2 provides the sequential parameter
learning plots.
20
4
Nonlinear Filtering and Learning
We now extend our PL filter to a general class of non-linear state space models, namely
conditional Gaussian dynamic model (CGDM). This class generalizes conditional dynamic
linear models by allowing non-linear evolution equations. In this context we take advantage
of most efficiency gains of PL as we are still able follow the re-sample/propagate logic and
filter sufficient statistics for θ.
Consider a conditional Gaussian state space model with non-linear evolution equation,
yt+1 = Fλt+1 xt+1 + t+1 where t+1 ∼ N 0, Vλt+1
(15)
xt+1 = Gλt+1 Z (xt ) + ωt+1 where ωt+1 ∼ N 0, Wλt+1 ,
(16)
where Z(·) is a given non-linear function. Due to the non-linearity in the evolution we are
no longer able to work with state sufficient statistics sxt but we are still able to evaluate
the predictive p (yt+1 |xt , λt , θ).
n
oN
In general, take as the particle set: (xt , θ, λt , st )(i)
. For discrete λ we can define
i=1
the following algorithm:
Algorithm 3: CGDM
Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where
w(i) ∝ p yt+1 |(xt , λt , θ)(i)
Step 2 (Propagate): States
k(i)
λt+1 ∼ p λt+1 | (λt , θ) , yt+1
xt+1 ∼ p xt+1 | (xt , , θ)k(i) , λt+1 , yt+1
Step 3 (Propagate): Sufficient Statistics
st+1 = S (st , xt+1 , λt+1 , yt+1 )
When λ is continuous we propagate λt+1 ∼ p λt+1 | (λt , θ)(i) and then re-sample the
21
particle (xt , λt+1 , st )(i) with the appropriate predictive p yt+1 |(xt , λt+1 , θ)(i) as in Algorithm 2.
Finally it is straightforward to extend the backward smoothing strategy from Section
2.4 to obtain samples from p xT |y T .
4.1
Example: CGDM
Consider the fat-tailed nonlinear state space model given by
ν ν p
,
yt+1 = xt+1 + σ λt+1 t+1 where λt+1 ∼ IG
2 2
xt
xt+1 = g(xt )β + σx ut+1 where g(xt ) =
1 + x2t
where t+1 and ut+1 are independent standard normals and ν is known. The observation
p
error term λt+1 t+1 is standard t-Student with ν degrees of freedom. Filtering strategies
for this model without parameter learning are described in deJong et al (2007).
n
oN
(i)
Assume that at time t, we have particles (xt , θ, λt+1 , st )
approximating p (xt , θ, λt+1 , st |y t ).
i=1
The PL algorithm can be described through the following steps:
Resampling: Draw an index k(i) ∼ M ulti w(i) with weights w(i) ∝ p(yt+1 |(xt , λt+1 , θ)(i) )
where
p(yt+1 |xt , λt+1 , θ) = fN yt+1 |βg(xt ), λt+1 σ 2 + σx2 .
(i)
Propagating x: Draw xt+1 ∼ p(xt+1 |(λt+1 , xt , θ)k(i) , yt+1 ). This conditional distribution
is given by
p(xt+1 |λt+1 , xt , θ, yt+1 ) = N (xt+1 ; µ(xt , λt+1 , θ), V (xt , λt+1 , θ))
yt+1
βg(xt )
µ(xt , λt+1 , θ) = V (xt , λt+1 , θ)
+
λt+1 σ 2
σx2
2
−2
V (xt , λt+1 , θ)−1 = λ−1
t+1 σ + σx .
Updating sufficient statistics: Posterior parameter learning for θ = (β, σ 2 , σx2 ) follow
directly from conditionally normal-inverse gamma update.
Figure 3 illustrates the above PL algorithm in a simulated example where β = 0.9,
σ = 0.04 and σx2 = 0.01. The algorithm uncovers the true parameters very efficiently in a
sequential fashion. In Section 5 we present a simulation exercise comparing the performance
of PL versus MCMC (Carlin, Polson and Stoffer, 1992) in this same example while in Section
6 we compare PL to the filter of Liu and West (2001).
2
22
5
Comparing PL to MCMC
PL combined with the backward smoothing algorithm (as in Section 2.4) is an alternative
to MCMC methods for state space models. In general MCMC methods (see Gamerman and
Lopes, 2006) use Markov chains designed to explore the posterior distribution p xT , θ|y T
of states and parameters conditional on all the information available, y T = (y1 , . . . , yT ).
For example, an MCMC strategy would have to iterate through
p(θ|xT , y T ) and p(xT |θ, y T ).
However, MCMC relies on the convergence of very high-dimensional Markov chains. In the
purely conditional Gaussian linear models or when states are dicrete, p(xT |θ, y T ) can be
sampled in block using FFBS. Even in these ideal cases, achieving convergency is far from a
easy task and the computational complexity is enormous as at each iteration one would have
to filter forward and backward sample for the full state vector xT . The particle learning algorithm presented here has two advantages: (i) it requires only one forward/backward pass
through the data for all N particles and (ii) the approximation accuracy does not rely on
convergence results that are virtually impossible to access in practice (see Papaspiliopoulos
and Roberts, 2008).
In the presence of non-linearities, MCMC methods will suffer even further as no FFBS
scheme is available for the full state vector xT . One would have to resort to univariate
updates of p(xt |x(−t) , θ, y T ) as in Carlin, Polson and Stoffer (1992). It is well known that
these methods generate very “sticky” Markov chains, increasing computational complexity
and slowing down convergence.
The computational complexity of our approach is only O(N T ) where N is the number
of particles. PL is also attractive given the simple nature of its implementation (especially
if compared to more novel hybrid methods).
Another important advantage of PL is that it directly provides the filtered joint posteriors p(xt , θ|y t ) whereas MCMC would have to be repeated T times to make that available.
These distributions are not only important for sequential predictive problems but also key
in the computation of Bayes factors for model assessment in state space models. More
specifically, the marginal predictive can be approximated via
p̂(yt+1 |y t ) =
N
1 X
p(yt+1 |(xt , θ)(i) ).
N i=1
23
This then allows the computation Bayes factors BF t or sequential likelihood ratios for
competing models M0 and M1 (see, for example, West, 1986):
BF t =
p(y t |M1 )
p(y t |M0 )
Q
where p(y t |Mi ) = tj=1 p(yj |y j−1 , Mi ).
We now present three experiments comparing the performance of PL against MCMC
alternatives.
Experiment 1. First we compare PL with FFBS in a first order normal DLM, defined
as
yt+1 = xt+1 + t+1 where t+1 ∼ N 0, σ 2
xt+1 = xt + xt+1 where xt+1 ∼ N 0, τ 2 .
This is a special case of CDLM considered in Section 3.
We simulate data from the above model with different specifications for σ 2 , τ 2 effectively
varying the signal to noise ratio of the model. Table 1 presents the results for the comparison
between PL and FFBS. It is clear that PL outperforms FFBS in computational time for
similar levels of accuracy in estimating the state vector xT . It is also important to highlight
that FFBS does not necessarily converge geometrically as showed by Papaspiliopoulos and
Roberts (2008) so PL can be even more advantageous in more complex model specifications.
Figure 4 presents draws from the posterior distribution p(σ 2 , τ 2 |y T ) for both PL and
FFBS indicating that PL is also successful in learning about the fixed parameters defining
the model.
Experiment 2. We now consider the first order conditional Gaussian dynamic model
with nonlinear state equation as defined in Section 4.1. Once again we generate data with
different levels of signal to noise ratio and compare the performance of PL with MCMC.
In this situation we are no longer able to implement a FFBS and need to resort to single
move updates as in Carlin, Polson and Stoffer (1992). Table 2 presents the results for
the comparisons. Again, PL outperforms MCMC by a large amount, and in most cases,
provide better estimates of xT .
In addition, Figure 5 shows the effective sample size (ESS) form PL and MCMC for
the samples of x100 . Notice that 100 separate MCMC algorithms are necessary to obtain
24
the right hand side figure, whereas the figure on the left is trivially obtained from one run
of PL. Due to its high level of autocorrelation the MCMC implementation has very small
ESS relative to PL. This result emphasizes the vast computational advantage of PL as, for
the same level of ESS, we would need to run the MCMC for many more iterations.
Experiment 3. Finally, we consider the following example from Papaspiliopoulos and
Roberts (2008). The model is defined as
y = x1 + 1 where t+1 ∼ N (0, 1)
p
x1 = x0 + λ1 x1 where xt+1 ∼ N (0, 5) ,
√
where λ1 ∼ IG 21 , 21 and prior or initial distribution p(x0 ) ∼ N (0, 1000). Hence we
have a heavy-tailed Cauchy error in the evolution equation.
Only one observation is available: y = 0. The joint posterior exhibits a similar behavior
as the witch’s hat distribution (Polson, 1991) as the posterior is a heavy-tailed Cauchy
multiplied by a normal. Papaspiliopoulos and Roberts (2008, example 1 and proposition
3.7) show that the Gibbs sampler does not converge well, being highly dependent on the
initial starting point.
In this case, PL works as follows: first, generate particles (x0 , λ1 )(i) ∼ p(x0 )p(λ1 ).
Then re-sample these particles with weights proportional to the predictive p(y|x0 , λ1 ) ∼
N (x0 , 1 + λ1 ). Then propagate particles for the next state x1 and then backwards sample
x0 using
p(x1 |(x0 , λ)k(i) , y) and p(x0 |(x1 , λ)k(i) , y).
Figure 6 plots the joint smoothing posterior for one data-point y given by p(x1 , x0 |y).
If compared to the results in Papaspiliopoulos and Roberts (2008), PL provides a much
improved approximation for this posterior.
6
Comparing PL to “Liu and West”
All the experiments presented so far provide a lot of evidence of PL’s ability to effectively
solve the problem of learning and smoothing. We now compare the performance of PL
against the most commonly used filter for situations where θ is unknown, i.e, the filter
proposed by Liu and West (2001) (LW). This is a filtering strategy that adapts the auxiliary
particle filter of Pitt and Shephard (1999) and solves the learning of θ by incorporating a
25
kernel density approximation for p(θ|y t ). We also use this section to discuss the robustness
of PL when the sample size T grows. In the following two experiments, our focus is to
assess the performance of each filter in properly estimating the sequence of p(θ|y t ) for all t.
Experiment 4. First we compare PL with LW in the same set up of Experiment 1 of
Section 5. Here we assume knowledge about τ 2 = 0.1 and make simulation with different
values for σ 2 where the τ /σ = 0.5, 1 and 10. In this context we can analytically evaluate
the true posterior distribution p(σ 2 |y t ) for all t. We perform a Monte Carlo experiment
based on 1000 replications with T = 200 or T = 1000. In each of the six cases we compute
the estimation risk of 5 percentiles of p(σ 2 |y T ). Figure 7 shows the risk ratios of PL versus
LW. It is clear that PL is uniformly superior in all cases. Figure 8 and 9 are snapshots of
the experiment in one particular replication with τ /σ = 1 and T = 1000. It is important to
highlight that PL is able to correctly estimate p(σ 2 |y T ) even for a significantly large value
for T showing the robustness of the filter to particle degeneration. This fact can also be
seen by inspecting Figure 10 where we sequentially plot the distribution of the conditional
sufficient statistics for σ 2 , along with their true simulated values. This combined with the
accurate approximation of p(σ 2 |y T ) shows that filtering the conditional sufficient statistics
is in fact providing a solution for learning. LW however, suffers from particle degeneracy
fairly quickly as illustrated in Figure 8.
Experiment 5. We revisit the non-linear, conditional Gaussian dynamic model of Section
4.1. In this case, θ = (σ 2 , β, τ 2 ). No “true” benchmark for comparison exists here so we
decide to run a very long MCMC (for T = 100) and compare the posteriors for θ obtained
with the outputs from PL and LW. Figure 8 presents both the sequential learning plots
as well as the estimated densities for θ at time T = 100. Once again, PL outperforms
LW and provide answers that are much close to the MCMC output. It is important to
remember that PL runs in a fraction of time of MCMC while providing a richer output,
i.e, the sequence of filtered distributions p(xt , θ|y t ).
7
Final Remarks
In this paper we provide particle learning tools (PL) for a large class of state space models.
Our methodology incorporates sequential parameter learning, state filtering and smooth-
26
ing. This provides an alternative to the popular FFBS/MCMC (Carter and Kohn, 1994)
approach for conditional dynamic linear models (DLMs) and also to MCMC approaches
to nonlinear non-Gaussian models. It is also a generalization of the mixture Kalman filter
(MKF) approach of Liu and Chen (2000) that includes parameter learning and smoothing.
The key assumption is the existence of a conditional sufficient statistic structure for the
parameters which is commonly available in many commonly used models.
We provide extensive simulation evidence and theoretical Monte Carlo convergence
bounds to address the efficiency of PL versus standard methods. Computational time
and accuracy are used to assess the performance. Our approach compares very favorably
with these existing strategies and is robust to particle degeneracies as the sample size
grows. Finally PL has the additional advantage of being an intuitive and easy-to-implement
computational scheme and should, therefore, become a default choice for posterior inference
in this very general class of models.
8
Appendix A: Monte Carlo Convergence Bounds
The convergence of the particle approximation for mixture DLMs (see Section 3) can be
described as follows. First, note that we have current sufficient statistic posterior distribution p(sxt , st |y t ) and particle approximation pN (sxt , st |y t ). By definition, pN (sxt , st |y t+1 ) is
given by
p(yt+1 |sxt , st ) N x
p (st , st |y t )
pN (sxt , st |y t+1 ) =
p (yt+1 |y t )
from which we use the deterministic mapping to draw st+1 = S (st , xt+1 , λt+1 , yt+1 ) and
similarly sxt+1 . We can estimate the denominator via
p
N
yt+1 |y
t
=E
N
(p (yt+1 |sxt , st ))
N
1 X
p yt+1 |(sxt , st )(i)
=
N i=1
where (sxt , st )(i) ∼ p(sxt , st |y t ). Consider estimating the next filtering distribution for the
conditional sufficient statistics. Later we provide a bound for the sufficient statistics and
parameters.
Z
x
t+1
p(st+1 , st+1 |y ) = p(sxt+1 , st+1 |sxt , st , θ, yt+1 )p(sxt , st , θ|yt+1 )d(sxt , st , θ)
Z
p(yt+1 |sxt , st , θ)p(sxt , st , θ|y t ) x
= p(sxt+1 , st+1 |sxt , st , θ, yt+1 )
d(st , st , θ)
p(yt+1 |y t )
27
and the algorithm provides density estimation for the filtering distribution of (sxt+1 , st+1 )
namely p(sxt+1 , st+1 |y t+1 ). Here p(sxt+1 , st+1 |sxt , st , θ, yt+1 ) is based on the deterministic mapping given λt+1 averaged over the appropriate conditional posterior for λt+1 .
To complete the analysis of our algorithm we prove a convergence and MC error bound.
Define the variational norm as
Z
N
t
t
kpt − pt k = |pN
t (φ|y ) − pt (φ|y )|dφ.
Theorem (Convergence): The particle-based estimate pN sxt+1 , st+1 |y t+1 is consistent
for p sxt+1 , st+1 |y t+1 namely
N x
p st+1 , st+1 |y t+1 − p sxt+1 , st+1 |y t+1 →P 0
√
with the Monte Carlo convergence rate N if
Z 2 x
p (st+1 , st+1 , sxt , st , θ|y t+1 ) x
Ct =
d(st , st , θ) < ∞.
p(sxt , st , θ|y t )
Proof: First, notice that we can write
PN x
(i)
p(yt+1 |(sx
(i)
1
x
t ,st ,λt+1 ,θ) )
,
s
,
λ
,
θ)
,
y
,
s
|
(s
p
s
t
t+1
t+1
t+1
t
t
t+1
i=1
N
p(yt+1 |y )
pN sxt+1 , st+1 |y t+1 =
PN p(y |(sx ,s ,λ ,θ)(i) )
1
N
t+1
i=1
t
t
t+1
p(yt+1 |y t )
Taking expectations over the distribution of (sxt , st , λt+1 , θ) we have that the denominator
converges to
! Z
N
1 X p(yt+1 |(sxt , st , λt+1 , θ)(i) )
p(yt+1 |sxt , st , λt+1 , θ)p(sxt , st , λt+1 , θ|yt+1 ) x
E
=
d(st , st , λt+1 , θ) = 1
N i=1
p(yt+1 |y t )
p(yt+1 |y t )
The numerator converges to
!
N
x
(i)
p(y
|(s
,
s
,
λ
,
θ)
)
1 X
t+1
t
t+1
t
E
p(sxt+1 , st+1 , λt+1 |(sxt , st , λt+1 , θ)(i) , yt+1 )
N i=1
p(yt+1 |y t )
Z
p(yt+1 |sxt , st , λt+1 , θ)p(sxt , st , λt+1 , θ|y t ) x
= p(sxt+1 , st λt+1 |sxt , st , λt+1 , θ, yt+1 )
d(st , st , λt+1 , θ)
p(yt+1 |y t )
Z
= p(sxt+1 , st+1 |sxt , st , θ, yt+1 )p(sxt , st , θ|y t+1 )d(sxt , st , θ)
= p(sxt+1 , st+1 |y t+1 )
28
Hence, in expectation
E pN (sxt+1 , st+1 |y t+1 ) = p(sxt+1 , st+1 |y t+1 )
For the variance bound, we use the inequality V ar(φ) ≤ E(φ2 ) leading to the error bound:
V ar pN sxt+1 , st+1 |y t+1 ≤
2
Z 1
p(yt+1 |sxt , st , θ)
x
x
p(st+1 , st+1 |st , st , θ, yt+1 )
p(sxt , st , θ|y t )d(sxt , st , θ)
N
p(yt+1 |y t )
Z 2 x
p (st+1 , st+1 , sxt , st , θ|y t+1 ) x
1
Ct+1
=
d(s
,
s
,
θ)
=
t
t
x
N
p(st , st , θ|y t )
N
where we have used Bayes rule identity
p(sxt , st , θ|y t+1 )/p(sxt , st , θ|y t ) = p(yt+1 |sxt , st , θ)/p(yt+1 |y t )
Notice that the Monte carlo error depends on how the sufficient statistics update in time
through the joint distribution p(sxt+1 , st+1 , sxt , st , θ|y t+1 ). Dramatic shifts in this distribution
√
will require larger N . This can be used to find a N bound for the variational distance of
pN (st , sxt |y t ) due to the inequality E(|φ|)2 ≤ E(φ2 ) for any functional φ.
Finally, the variational distance for the particle approximation for the state and parameter distribution satisfies the following bound
N
p (xt , θ|y t ) − p(xt , θ|y t ) ≤ pN (st , sxt |y t ) − p(st , sxt |y t ) .
This is due to the fact that the state and parameter posterior is a Rao-Blackwellised average
Z
t
p(xt , θ|y ) = p(xt , θ|st , sxt )p(st , sxt |y t )dst dsxt .
Therefore,
Z
N
p (xt , θ|y t ) − p(xt , θ|y t ) = p(xt , θ|st , sxt )(pN (st , sxt |y t ) − p(st , sxt |y t ))dst dsxt
Z
≤ p(xt , θ|st , sxt ) pN (st , sxt |y t ) − p(st , sxt |y t ) dθdxt
= pN (st |y t ) − p(st |y t ) .
This can be used inductively as in Godsill, Doucet and West (2004) as it is clearly true at
time t = 0 by taking a random sample from p(x0 , θ). We note that we also have the typical
Monte Carlo rate of convergence for expectations of functionals E N (φ) as well as density
estimate pN .
29
References
Bengtsson, T., Bickel, P. and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of
the particle filter in very large scale systems. Probability and Statistics: Essays in Honor
of David A. Freedman. Institute of Mathematical Statistics Collections, 2, 316-334.
Carlin, B, and Polson, N.G. and Stoffer, D. (1992). A Monte Carlo approach to nonnormal
and nonlinear state-space modeling. Journal of the American Statistical Association, 87,
493-500.
Carter, C. and Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika,
82, 339-350.
Carvalho, C.M. and Lopes, H.F. (2007). Simulation-based sequential analysis of Markov
switching stochastic volatility models. Computational Statistics and Data Analysis, 51,
4526-4542.
Doucet, A., de Freitas, J. and Gordon, N. (Eds) (2001). Sequential Monte Carlo Methods
in Practice. New York: Springer.
Fearnhead, P. (2002). Markov chain Monte Carlo, sufficient statistics, and particle filters.
Journal of Computational and Graphical Statistics, 11, 848-862
Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via
particle filters. Journal of the Royal Statistical Society, B, 65, 887-899
Frühwirth-Schnatter, S. (1994). Applied state space modelling of non-Gaussian time series
using integration-based Kalman filtering. Statistics and Computing, 4, 259-269
Gamerman, D. and Lopes, H.F. (2006). Markov Chain Monte Carlo: Stochastic Simulation
for Bayesian Inference. Chapman and Hall / CRC.
Godsill, S.J., Doucet, A. and West, M. (2004). Monte Carlo smoothing for nonlinear time
series Journal of the American Statistical Association, 99, 156-168.
Gordon, N., Salmond, D. and Smith, A.F.M. (1993). Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEEE Proceedings, F-140, 107-113.
Gilks, W. and Berzuini, C. (2001). Following a moving target: Monte Carlo inference for
dynamic Bayesian models. Journal of Royal Statistical Society, B, 63, 127-146..
Johannes, M., Polson, N.G. and Yae, S.M. (2007). Nonlinear filtering and learning. Working Paper. The University of Chicago Booth School of Business.
30
Kalman, R.E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering,82, 35-45.
Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstationary time series.
Journal of the American Statistical Association, 82, 1032-1041
Liu, J. and Chen, R. (2000). Mixture Kalman filters. Journal of Royal Statistical Society,
B, 62(3), 493-508.
Liu, J. and West, M. (2001). Combined parameters and state estimation in simulationbased filtering. In Sequential Monte Carlo Methods in Practice (Eds. A. Doucet, N. de
Freitas and N. Gordon). Springer-Verlag, New York.
Papaspiliopoulos, O., Roberts, G. (2008). Stability of the Gibbs sampler for Bayesian
hierarchical models. Annals of Statistics, 36, 95-117.
Pitt, M. and Shephard, N. (1999) Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, 94, 590-599
Polson, N.G. (1991). Comment on “Practical Markov Chain Monte Carlo” by C. Geyer.
Statistical Science, 7, 490-491.
Polson, N.G., Stroud, J. and Müller, P. (2008). Practical filtering with sequential parameter
learning. Journal of the Royal Statistical Society B, 70, 413-428.
Storvik, G., (2002). Particle filters in state space models with the presence of unknown
static parameters. IEEE. Trans. of Signal Processing 50, 281–289.
West, M. (1986). Bayesian model monitoring. Journal of the Royal Statistical Society, B,
48, 70-78.
West, M. and Harrison, J. (1997). Bayesian forecasting and dynamic models SpringerVerlag Inc (Berlin; New York) 2nd Edition.
Scott, S. (2002). Bayesian methods for hidden Markov Models: recursive computing in the
21st century. Journal of the American Statistical Association, 97, 335-351.
31
T
50
50
50
200
200
200
500
500
500
σ/τ
5
1
0.1
5
1
0.1
5
1
0.1
Time (in sec.)
MSE
SMC MCMC
SMC MCMC
1.7
59.2 0.053783 0.041281
1.4
63.4 0.003069 0.002965
1.8
73.9 0.000092 0.000092
7.5
253.9 0.031188 0.022350
6.3
219.5 0.004751 0.004590
5.9
219.3 0.000081 0.000081
18.7
565.7 0.038350 0.033724
17.8
544.8 0.004270 0.004192
17.1
540.3 0.000090 0.000090
MAE
SMC MCMC
0.1856861 0.159972
0.0444318 0.044884
0.0076071 0.007644
0.1382585 0.116308
0.0560816 0.055067
0.0073801 0.007379
0.1563062 0.146293
0.0516373 0.050642
0.0074950 0.007505
Table 1: Experiment 1. First order normal dynamic linear model. Prior distributions for
2
and
σ 2 and τ 2 are IG(a, b) and IG(c, d), respectively, where a = c = 10, b = (a − 1)σtrue
2
d = (c − 1)τtrue . Similarly, x0 ∼ N (x0,true , 100). PL algorithm based on 5,000 particles.
FFBS based on 5,000 draws, after 5,000 burn-in. For the MCMC runs, initial values for σ 2
and τ 2 are the true values. For the SMC runs, initial draws of σ 2 and τ 2 are taken from
the priors, while KF0 = (x0,true , 100). Also, τtrue = 0.1. Finally, MSE and MAE refer to
mean squared error and mean absolute error in estimating the state vector xT .
32
T
50
50
50
200
200
200
500
500
500
σ/τ
5
1
0.1
5
1
0.1
5
1
0.1
Time (in sec.)
MSE
MAE
SMC MCMC
SMC MCMC
SMC MCMC
3.4
30.6 0.021718 0.022716 0.119374 0.114798
2.7
30.4 0.005046 0.005994 0.058300 0.061849
2.3
30.2 0.000080 0.000080 0.007538 0.007346
13.1
121.5 0.033697 0.024991 0.147363 0.120748
14.0
121.1 0.005901 0.006270 0.061113 0.061762
13.3
119.7 0.000135 0.000134 0.008730 0.008644
30.3
378.5 0.025721 0.022716 0.127526 0.119456
35.8
441.8 0.005177 0.005712 0.056890 0.060391
32.1
437.8 0.000122 0.000123 0.008661 0.008705
Table 2: Experiment 2. First order conditionally normal non-linear dynamic model with
observation equation. The number of degrees of freedom is kept at ν = 10. True values for
β and τ 2 are 0.95 and 0.01. Prior distributions for β|τ 2 , σ 2 and τ 2 are N (β0 , τ 2 ), IG(a, b)
2
and IG(c, d), respectively, where β0 is defined below, a = c = 10, b = (a − 1)σtrue
and
2
d = (c − 1)τtrue . Similarly, x0 ∼ N (d0 , 100), with d0 defined below. SMC based on 5,000
particles. Initial draws of β, σ 2 , τ 2 and x0 are taken from the priors. MCMC based on 5,000
draws, after 5,000 burn-in. Initial values for β, σ 2 and τ 2 are the true values. The standard
deviation of the random walk proposal for states x’s is ξ = 0.01. The first 20 observations
are used to create ỹt = (yt+2 + yt+1 , yt , yt+1 , yt+2 )/5 and x̃t = ỹt−1 , for t = 3, . . . , 18, and
¯
run the regression ỹt ∼ N (β x̃t , σ̃ 2 ). Then, β0 = β̂ and d0 = ỹ.
33
Figure 1: Filtering and Smoothing for States. Top panel(λ): true (red), filtered
expected value (black) and smoothed expected value (blue). Bottom panel(x): true (red),
filtered expected value (black) and smoothed expected value (blue).
34
Figure 2: Parameter Learning. The black represent the filtered posterior mean of the
parameter. The blue lines create a 95% credibility interval and the red line denotes the
true parameter value.
35
Figure 3: Non-Linear CGM. The first 3 panels presents the mean, 95% credibility interval
and true value (red line) in the sequential learning of parameters. The last panel shows a
scatter plot of the true state x and its filtered posterior means.
36
0.025
0.020
0.010
0.015
τ2
0.015
0.010
τ2
0.020
0.025
0.030
(b)
0.030
(a)
0.005
●
0.000
0.000
0.005
●
0.1
0.2
0.3
0.4
0.1
0.2
0.3
σ
σ
(c)
(d)
0.4
0.015
0.010
τ2
0.010
τ2
0.015
0.020
2
0.020
2
●
0.005
0.005
●
0.005
0.010
0.015
0.020
0.005
σ
0.010
0.015
0.020
σ
2
2
Figure 4: Experiment 1. Countour plots for the true posterior p(σ 2 , τ 2 |y T ) and draws
form the PL(a,c) and FFBS(b,d). The blue dots represent the true values for the parameters
in all cases. In (a,b) T = 50 and (c,d) T = 500.
37
Markov Chain Monte Carlo
ESS
260
2000
265
4000
270
ESS
275
6000
280
8000
285
Sequential Monte Carlo
0
20
40
60
80
100
0
time
20
40
60
80
100
time
Figure 5: Experiment 2. Effective sample size from PL (left) and single move MCMC
(right). PL based on 10,000 particles and MCMC based on 10,000 iterations after 5,000
burn-in.
38
Figure 6: Experiment 3. PL draws (blue dots) for the posterior p(x0 , x1 |y) in Experiment
3 of Section 5. The red contours represents the true posterior.
39
Figure 7: Experiment 4. Risk ratios for the estimation of percentiles in the distribution
p(σ 2 |y t ).
40
Figure 8: Experiment 4. The left panel shows the sequential estimate of the mean and
95% credible regions for σ 2 . The right panel displays the true versus estimated p(σ 2 |y 1000 ).
The horizontal line (left) and the black dot (right) represent the true value for σ 2 = 0.1.
41
Figure 9: Experiment 4. True versus estimated p(σ 2 |y T ) for different values of T.
42
Figure 10: Experiment 4. Sequential estimates of the mean and 95% credible regions for
the conditional sufficient statistics of σ 2 . The black solid lines represent the true values.
43
Figure 11: Experiment 5. The left panels show the sequential estimate of the mean and
95% credible regions for all parameters. The right panels display the MCMC versus PL
estimates of p(θ|y T ). The horizontal line (left) and the black dot (right) represent the true
value for each parameters.
44
Download