Monte Carlo Filtering of Piecewise Deterministic Processes Nick Whiteley, Adam M. Johansen

advertisement
Monte Carlo Filtering
of Piecewise Deterministic Processes
Nick Whiteley, Adam M. Johansen
and Simon Godsill
CUED/F-INFENG/TR-592 (2007)
Monte Carlo Filtering of Piecewise Deterministic Processes
Nick Whiteley∗, Adam M. Johansen†and Simon Godsill
‡
Abstract
We present efficient Monte Carlo algorithms for performing Bayesian inference in a broad class of
models: those in which the distributions of interest may be represented by time marginals of continuoustime jump processes conditional on a realisation of some noisy observation sequence.
The sequential nature of the proposed algorithm makes it particularly suitable for real-time estimation
in time series. We demonstrate that two existing schemes can be interpreted as particular cases of the
proposed method. Results are provided which illustrate significant performance improvements relative
to existing methods.
keywords: Bayesian Filtering, Optimal Filtering, Particle Filters, Sequential Monte Carlo.
1
Introduction
Throughout statistics and related fields, there exist numerous problems which are most naturally addressed
in a continuous time framework. A great deal of progress has been made over the past few decades in online
estimation for partially-observed, discrete time series. This paper generalises the popular particle filtering
methodology to a broad class of continuous time models.
Within the Bayesian paradigm, the tasks of optimal filtering and smoothing correspond to obtaining,
recursively in time, the posterior distribution of the trajectory of an unobserved stochastic process at a
particular time instant, or over some interval, given a sequence of noisy observations made over time.
State space models, in which the unobserved process is a discrete-time Markov chain, are very flexible
and have found countless applications throughout statistics, engineering, biology, economics and physics. For
example in object tracking, the hidden process models the evolution of position and other intrinsic parameters
of a manoeuvring object. The aim is then to estimate the trajectory of that object, given observations made
by a noisy sensor.
In many cases of interest the state-space model is nonlinear and nongaussian, and exact inference is
intractable. Approximation methods must, therefore, be employed. Sequential Monte Carlo (SMC) methods
(Doucet et al., 2001; Cappé et al., 2007), approximate the sequence of posterior distributions by a collection
of weighted samples, termed particles. The application of SMC methods to the discrete time filtering problem
has been the focus of much research in the last fifteen years, especially in the context of tracking. See, for
example, Gordon et al. (1993); Doucet et al. (2000). However, many physical phenomena are most naturally
∗ Nick Whiteley is a Ph.D. student in the Signal Processing Laboratory, Department of Engineering, University of Cambridge,
Trumpington Street, Cambridge, CB2 1PZ, UK. (email: npw24@cam.ac.uk)
† Adam Johansen in a Brunel Fellow in the Statistics Group, Department of Mathematics, University of Bristol, University
Walk, Bristol, BS8 1TW, UK. (email: adam.johansen@bristol.ac.uk)
‡ Simon Godsill is Professor of Statistical Signal Processing, Department of Engineering, University of Cambridge,Trumpington Street, Cambridge, CB2 1PZ, UK. (email: sjg@eng.cam.ac.uk)
1
modelled in continuous time and this motivates the development of inference methodology for continuous time
stochastic processes. Whilst it is possible to attempt discretisation, this can be computationally expensive
and introduces a bias which is not easy to quantify.
Estimating the trajectory of a general continuous-time process from statistical knowledge and a finite
collection of observations is not practical in any degree of generality. Davis (1984) proposed the use of
piecewise deterministic processes (PDP’s) which evolve deterministically in continuous time except at a
countable collection of stopping times at which they randomly jump (in the sense of potentially introducing
a discontinuity into the trajectory) to a new value. The connection between PDP’s and marked point
processes was investigated in Jacobsen (2006) and Jacod and Skorokhod (1996) used the term jumping
process to describe a generalisation of these models to non-topological spaces.
Numerous problems can be straightforwardly and parsimoniously cast into this framework. We will
present examples in the following two areas:
Tracking: It has been demonstrated that, in some object tracking scenarios, the trajectory of a manoeuvring
object may be more parsimoniously modelled by a PDP than by traditional discrete-time models, see
Lasdas and Davis (1989); Sworder et al. (1995); Godsill et al. (2007) and references therein.
Mathematical Finance: Inferring the hidden intensity of a shot noise Cox process (SNCP) plays a role in
reinsurance and option pricing schemes. Exact inference for SNCP models is intractable. Approximate
methods based on weak convergence to Gaussian processes in specific parameter regimes (Dassios and
Jang, 2005) and Markov Chain Monte Carlo have been proposed (Centanni and Minozzo, 2006).
If such models are to be employed in practice, accurate and computationally efficient inference schemes
must be developed. The contribution of this paper is the development of efficient SMC algorithms for the
filtering of any PDP. A key factor which determines the efficiency of SMC methods is the mechanism by
which particles are propagated and re-weighted. If this mechanism is not carefully chosen, the particle
approximation to the target distribution will be poor, with only a small subset of the particles having any
significant weight. Over time, this in turn leads to loss of diversity in particle locations and instability of
the algorithm. The proposed approach combats these issues, maintaining considerably more diversity in the
particle system than previously-proposed approaches and therefore providing a better approximation to the
target distribution.
We are specifically interested in performing inference online, that is, as observations become available
over time and in such a manner that computational complexity and storage requirements do not increase
over time. This is essential in many applications.
In the next section, we specify the filtering model of interest more precisely. A representative example
is presented and existing approaches to inference for such models are discussed. In section 3 we describe
the design of an SMC algorithm which is efficient for the processes of interest here. Details of an auxiliary
variable technique which can further improve the efficiency of the proposed method are also presented. We
present results in section 4, for a SNCP model used in reinsurance pricing and a fighter aircraft tracking
model and demonstrate the improvement in performance over existing algorithms which is possible.
2
Background
Before introducing discrete time particle filtering methods, formalising the problem and describing approaches to its solution, it is useful to introduce some notation which will be used throughout.
Except where otherwise noted, distributions will be assumed to admit a density with respect to a suitable
dominating measure, denoted dx, and, with a slight abuse of notation, the same symbol will be used to refer
2
to both a measure and its associated density, so for example π(dx) = π(x)dx. We denote by δx (·) the Dirac
measure which takes all its mass at the point x. For a Markov kernel K(x, dy) acting from E1 to E2 and
some probability measure µ(dx) on E1 , we will employ the following shorthand for the integral operation:
Z
µK(·) =
K(x, ·)µ(dx).
E1
When describing marginals of joint densities, we will use the notation of the following form:
Z
πn (xk ) = πn (x1:n )dx1:k−1 dxk+1:n .
2.1
Particle Filtering
The traditional setting of sequential Monte Carlo methods has been in the field of filtering discrete time
hidden Markov models. To motivate the present work it is convenient to review these particle filtering
algorithms – this summary is necessarily brief and the reader is referred to Doucet et al. (2001) for further
details.
Consider two discrete time stochastic processes: the signal process, denoted by (Xn )n≥1 , and the observation process denoted (Yn )n≥1 , defined on a common probability space (Ω, F , P ) with conditional independence structure (where we assume that densities f and g exist):
Z
P (Xn ∈ A|X1:n−1 = x1:n−1 , Y1:n−1 = y1:n−1 ) ≡
f (xn |xn−1 )dxn
ZA
P (Yn ∈ B|X1:n = x1:n , Y1:n−1 = y1:n−1 ) ≡
g(yn |xn )dyn
B
for any measurable sets A, B. That is, the , (Xn )n≥1 is a Markov chain and (Yn )n≥1 has the property that,
for all n, conditional on the signal at time n, Yn is independent of all previous states and observations.
The filtering problem is to approximate the density P (xn |Y1:n = y1:n ) at each time (typically, it is
necessary to perform this task sequentially and in real time as one wishes to obtain estimates concerning the
current state of the system as observations arise). The densities of interest may be written as time marginals
of the smoothing densities which are given recursively by:
p̂n (x1:n |y1:n−1 ) = pn (x1:n−1 |y1:n )f (xn |xn−1 )
pn (x1:n |y1:n ) ∝ p̂n (x1:n |y1:n−1 )g(yn |xn ).
This is a difficult problem as, except in a few special cases, the normalising constant of pn (x1:n |y1:n ) cannot
be obtained in closed form. The solution offered by sequential Monte Carlo is to propagate a collection
of weighted samples through time so that they approximate each of the distributions of interest at the
appropriate time using importance sampling and resampling mechanisms. Resampling involves eliminating
particles with low weights and multiplying those with high weights, resulting in a set of samples which all
have the same weight. This is optionally applied at each iteration. Algorithm 1 provides a description of
a basic particle filter in the case that resampling is applied at every iteration. At step 10 the algorithm
provides a weighted sample from pn (xn |y1:n ). The resampling step is a crucial ingredient. If it were never
applied, the algorithm would collapse after several iterations, with all weight concentrated on a single particle,
giving a very poor approximation of the target distribution. Resampling alleviates this weight-degeneracy
phenomenon at the expense of introducing degeneracy in the historical locations of the particles. However,
3
for filtering, this latter degeneracy effect is not so problematic as the very recent history of the particle set,
from which estimates are to be drawn, can remain diverse. See Douc et al. (2005) and references therein for
details of a variety of resampling schemes.
Algorithm 1 A simple sequential Monte Carlo filter.
1: Set n = 1.
2: for i = 1 to N do
(i)
3:
X 1 ∼ h1
(i)
W1
4:
5:
6:
7:
8:
9:
10:
11:
12:
∝
(i)
h1 (X1 )
end for
n←n+1
for i = 1 to N
do (i)
(i)
Xn ∼ hn ·|Xn−1
(i)
Wn ∝
{where h1 is an instrumental density.}
(i)
p1 (X1 )
(i)
{hn (·|Xn−1 ) is some instrumental density.}
(i)
(i)
(i)
f (Xn
|Xn−1 )g(yn |Xn
)
(i)
(i)
hn (Xn |Xn−1 )
.
end for
Resample
Go to step 6.
2.2
Statement of the Problem
The problem which is treated in the present work can be summarised as follows. We first define the signal
process (ζt )t≥0 , where each ζt takes a value in a state space Ξ (which is assumed to be Polish and endowed
with a suitable Borel field), typically some subset of Rdx , and a sequence of noisy observations (Yn )n∈N .
Given the signal at the observation time (or in some local neighbourhood if observations are received in
batches at fixed times as in the SNCP example), the observations are conditionally independent of one
another and the remainder of the signal process. Our aim is to obtain iteratively, as observations arrive, the
conditional distribution of the signal process given the collection of observations up to that time.
2.3
Model Specification
We now provide a specification of the class of models under consideration. Consider first a discrete-time pair
Markov process (τj , θj )j∈N , of non-decreasing times, τj ∈ R+ and parameters, θj ∈ Ξ with transition kernel
of the form:
p(d(τj , θj )|τj−1 , θj−1 ) = f (dτj |τj−1 )q(dθj |θj−1 , τj−1 , τj ).
(1)
There is no theoretical reason for limiting ourselves to the conditional independence structure as in (1),
but we here consider only such structure for ease of presentation. In some applications there may be a
requirement to make τj dependent on θj−1 , or even to define the process (τj , θj )j∈N as non-Markovian, and
in principle the proposed methods can accommodate such dependence structures.
We next define a random continuous time counting process (νt )t≥0 as follows:
νt =
∞
X
I[0,t] (τj ) = max{j : τj ≤ t}.
j=1
4
The right-continuous signal process, (ζt )t≥0 , which takes a value in Ξ at any time t and has known initial
distribution, ζ0 ∼ q0 (ζ0 ), is then defined by:
ζt = F (t, τνt , θνt ),
with the conventions that τ0 = 0, θ0 = ζ0 . The function F : R+ × R+ × Ξ → Ξ, is deterministic, Ξ-valued
and subject to the condition that F (τj , τj , θj ) = θj , ∀j ∈ N.
From the initial condition ζ0 , a realisation of the signal process evolves deterministically according to F
until the time of the first jump τ1 , at which time it takes the new value θ1 . The signal continues to evolve
deterministically according to F until τ2 , at which time the signal acquires the new value θ2 , and so on.
The signal process is observed in the nth window (tn−1 , tn ] through a collection of random variables,
Yn , conditionally independent of the past, given the signal process on (tn−1 , tn ]. We denote the likelihood
function by g(yn |ζ(tn−1 ,tn ] ).
We will be especially interested in the number of jumps occurring in the interval [0, tn ] and therefore set
kn , νtn . Our signal model induces a joint prior distribution, pn (kn , τ1:kn ), on the number of jumps in [0, tn ]
and their locations:
kn
Y
pn (kn , dτ1:kn ) = S(tn , τkn )
f (dτj |τj−1 ),
(2)
j=1
S∞
which has support on the disjoint union: k=0 {k} × Tn,k , where Rk ⊃ Tn,k = {τ1:k : 0 < τ1 < · · · < τk ≤ tn }
and where S(t, τ ) is the survivor function associated with the transition kernel f (dτj |τj−1 ):
Z t
S(t, τ ) = 1 −
f (ds|τ ).
(3)
τ
(3) is the conditional probability that, given the most recent jump occurred at τ , the next jump occurs after
t. Thus (2) is the probability that there are precisely kn jumps on [0, tn ], with locations in infinitesimal
neighbourhoods of τ1 , τ2 , ...., τkn .
Given the function F , the path (ζt )t∈[0,tn ] is completely specified by the initial condition ζ0 , the number of
jumps, kn , their locations τ1:kn and associated parameter values θ1:kn . We will be interested in the conditional
distributions of such quantities, given all observations made up to and including tn . It is consequently
necessary to specify a distinct collection of random variables Xn = (kn , ζn,0 , θn,1:kn , τn,1:kn ), which takes its
values in the disjoint union:
∞
[
En =
{k} × Ξk+1 × Tn,k ,
k=0
k
with R ⊃ Tn,k = {τn,1:k : 0 < τn,1 < · · · < τn,k ≤ tn }. Note that En ⊂ En+1 . We also write ζn,t =
F (t, τνn,t , θνn,t ).
In order to obtain the distribution of (ζt )t∈[0,tn ] , given the observations y1:n , it would suffice to find
πn (xn ), the posterior density of Xn , because, by construction, the signal process is a deterministic function
of the jump times and parameters. This posterior distribution, up to a constant of proportionality, has the
form:
πn (xn ) ∝ pn (kn , τn,1:kn )
× q0 (ζn,0 )
kn
Y
q(θn,j |θn,j−1 , τn,j , τn,j−1 )
n
Y
p=1
j=1
5
g(yn |ζn,(tp−1 ,tp ] ).
(4)
As a filtering example, in order to obtain the distribution, p(ζtn |y1:n ), it suffices to obtain πn (τkn , θkn ). Exact
inference for all non-trivial versions of this model is intractable and in section 3 we describe Monte Carlo
schemes for obtaining sample-based approximations of posterior distributions.
2.4
A Motivating Example
We next provide an example application in order to illustrate how the above-specified model can be used
in practice. We consider the planar motion of a vehicle which manoeuvres according to standard, piecewise constant acceleration dynamics. Each parameter may be decomposed into x and y components, each
containing a position, s, velocity, u, and acceleration value, a,
x x
θj
F (t, τνt , θνt )
θj =
and F (t, τνt , θνt ) =
.
θjy
F y (t, τνt , θνt )
Here Ξ = R6 but the x and y components have identical parameters and evolutions; for brevity we
describe only a single component: θjx = [sxj uxj axj ]T and,

1 (t − τνt )
1
F x (t, τνt , θνt ) =  0
0
0
 x 
− τνt )2
sνt
(t − τνt )   uxνt  .
1
axνt
1
2 (t
At time zero the vehicle has position, velocity and acceleration ζ0 . At subsequent times, τj , the acceleration
of the vehicle jumps to a new, random value according to q(dθj |θj−1 , τj−1 , τj ), which is specified by:
q(dθjx |θj−1 , τj , τj−1 ) = δsx,− (dsxj )δvx,− (dsxj )q(daxj ),
j
j
where,
x
(τj − τj−1 ) + axj−1 (τj − τj−1 )2 /2,
sx,−
= sxj−1 + vj−1
j
x
vjx,− = vj−1
+ axj−1 (τj − τj−1 ).
The component of F in the y-direction is equivalent. This model is considered a suitable candidate for the
benchmark fighter-aircraft trajectory as described in Blair et al. (1998), shown in figure 1. The parameter
transition kernel in this embodies prior knowledge of how the acceleration of the vehicle changes. The
kernel f (τj |τj−1 ), represents prior knowledge of the amount of time the vehicle spends executing a particular
manoeuvre. As an example observation model, at each time tn the Cartesian position of the vehicle is
observed through some noisy sensor. An example inference task is then to recursively estimate the position
of the vehicle, and the time of the most recent jump in its acceleration, given observations from the noisy
sensor. We return to this example in the sequel.
2.5
Related Work
Inference schemes based upon direct extension of the particle filter have been devised for the process of
interest. The variable rate particle filter (VRPF) of Godsill and Vermaak (2004); Godsill et al. (2007);
Godsill (2007) is one such scheme, which samples a random number of jump times on the interval [0, tn ]
6
30
25
20
sy /km
15
10
5
0
−5
−10
35
40
45
50
55
60
65
70
sx /km
Figure 1: Benchmark 2D position trajectory and observations with additive Gaussian noise. The trajectory
begins around (66,29), with the aircraft executing several manoeuvres before the trajectory terminates after
a duration of 185 seconds.
by drawing recursively from f (τj |τj−1 ), until a stopping time. One could equivalently sample first from
pn (kn ) and then from fn (τ1:k |kn ). A similar method was presented independently in Maskell (2004). The
relationship between the proposed method and these existing algorithms is made precise in the sequel.
In the context of particle filtering for standard discrete time state space models, it is well known that
sampling from the prior distribution can be inefficient yielding importance weights of high variance. This
problem is exacerbated when constructing SMC algorithms for PDP’s.
Resampling, which is an essential component of SMC algorithms for online filtering, results in multiple copies of some particles. Under the prior distribution for the class of models considered here, there
is significant positive probability that zero jumps will occur between one observation time and the next.
Proposing from the prior distribution after resampling can therefore result in multiple copies of some particles being propagated over several iterations of the algorithm, without diversification. This phenomenon
is most obvious when the expected jump arrival rate is low relative to the rate at which observations are
made (as is the case in applications of interest). This in turn leads to the accumulation of errors in the
particle approximation to the target distribution and instability of the algorithm. Computational methods
to avoid repeated calculations were proposed in Godsill et al. (2007), but these do not address the underlying
problem. The technique of “state regeneration” was introduced in Godsill and Vermaak (2005), but it too
involves proposing particles without taking into account new observations.
In Doucet et al. (2006a), the SMC samplers framework is used to enhance the efficiency of a discrete
7
time filtering algorithm. This method allows a recent history of the state trajectory for each particle to be
perturbed using an importance sampling mechanism, diversifying the particle set. The “adjustment” moves
proposed below achieve a similar effect in the context of filtering for PDP’s.
Finally, we note that Chopin (2006) addressed a related but discrete-time change-point problem. Their
approach reformulates the standard state-space model, yielding a filtering scheme which propagates the
posterior distribution for the time since the most recent change-point. Chopin remarks that slow-mixing
properties make the inclusion of MCMC diversification moves in the accompanying SMC algorithm essential
for efficient operation. The “adjustment” moves proposed below address the same basic problem within the
framework of the present paper. MCMC diversification can also be applied in the proposed algorithm (see
details below).
3
Methodology
We begin with a brief summary of the sampling framework which we propose to employ before describing
its use to sample from the class of distributions of interest.
The SMC samplers framework of Del Moral et al. (2006a,b) is a very general method for obtaining a
set of samples from a sequence of distributions which exist on the same or different spaces. This can be
viewed as a generalisation of the standard SMC method in which the object distribution exists on a space
of strictly increasing dimension. SMC samplers have recently been applied to similar trans-dimensional
problems (Doucet et al., 2006b). The motivation for this approach is that it would be extremely useful to
have a generic technique for sampling from a sequence of distributions defined upon arbitrary spaces which
are somehow related and this is precisely what is required in the filtering of piecewise deterministic processes.
The principal innovation of the SMC sampler approach is to construct a sequence of synthetic distributions
with the necessary properties. Given a collection of measurable spaces (En , En )n∈N , upon which the sequence
of probability measures from which we wish to sample, (πn )n∈N is defined,
! it is possible to construct a
n
Q
sequence of distributions (e
πn )n∈N upon the sequence of spaces
En
endowed with the product
p=1
n∈N
σ-algebras, which have the target at time index n as a marginal distribution at that time index. As we are
only interested in this marginal distribution, there is no need to adjust the position of earlier states in the
chain and standard SMC techniques can be employed.
However, this approach suffers from the obvious deficiency that it involves conducting importance sampling upon an extremely large space, whose dimension is increasing with n. In order to ameliorate the
situation, the synthetic distributions are constructed in the following way (assuming that a density representation exists):
n−1
Y
π
en (x1:n ) = πn (xn )
Lp (xp+1 , xp ) ,
(5)
p=1
where (Ln )n∈N is a sequence of ‘backward’ Markov kernels from En into En−1 .
With this structure, an importance sample from π
en is obtained by taking the path x1:n−1 , a sample
from π
en−1 , and extending it with a Markov kernel, K : En−1 × En → [0, 1], which acts from En−1 into En ,
providing samples from π
en−1 × Kn and leading to the importance weight:
wn (xn−1 , xn ) ∝
π
en (x1:n )
πn (xn )Ln−1 (xn−1 , xn )
=
.
π
en−1 (x1:n−1 )Kn (xn−1 , xn )
πn−1 (xn−1 )Kn (xn−1 , xn )
8
Remark 1. Although, for simplicity, we have above followed the convention in the literature and assumed
that all distributions and kernels of interest may be described by a density with respect to some common
dominating measure, this is not necessarily the case.
All that is actually required is that the product distribution defined by
πn ⊗ Ln−1 (d(xn−1 , xn )) = πn (dxn )Ln−1 (xn , dxn−1 )
is dominated by πn−1 ⊗Kn (d(xn−1 , xn )) and the importance weight is then the corresponding Radon-Nikodỳm
derivative. Thus the backward kernel Ln−1 must be chosen such that this requirement is met.
This approach ensures that the importance weights at time index n depend only upon xn and xn−1 . This
also allows the algorithm to be constructed in such a way that the full history of the sampler need not be
stored. In the context of the filtering problem under consideration, recall that xn actually specifies a path
of (ζt )t∈[0,tn ] , but we will construct samplers with importance weights which depend only upon some recent
history of xn and xn−1 , yielding a scheme with storage requirements which do note increase over time.
3.1
PDP Particle Filter
We next describe the design of a SMC scheme to target the distributions of interest in our sampling problem.
By applying the SMC samplers method to the sequence of distributions (πn (xn ))n∈N , see (4), we can obtain
recursive schemes which propagate particle approximations to the distributions of various quantities in the
recent history of the signal process. We will employ kernels which result in online algorithms, i.e. whose
computational complexity and storage requirements do not grow over time.
As a simple example we could maintain an approximation to each marginal distribution πn (τkn , θkn ), and
thus to p(ζtn |y1:n ). Then at the nth iteration, the algorithm would yield a set of N particles,
(i)
{(kn τkn θkn )(i) , wn }N
i=1 . This approximates the filtering distribution for the signal process via:
P N (dζtn |y1:n ) =
N
X
i=1
wn(i) δζ (i) (dζtn ),
tn
(i)
(i)
(i)
where ζtn = F tn , τ (i) , θ (i) .
kn
kn
We next give specific details of how such algorithms can be constructed. The explicit treatment of the
dimensionality of the problem gives us control over the proposal of different numbers of jumps. Furthermore,
the SMC samplers framework accommodates a more efficient proposal mechanism than that of the VRPF
by permitting ‘adjustment’ moves as described below.
3.1.1
Choice of Kernels
The SMC samplers approach is clearly very flexible, and is perhaps too general : in addition to the choice of
the proposal kernels Kn , it is now necessary to select a sequence of backward kernels Ln . The appearance
of these kernels in the weight expression makes it clear that it will be extremely important to select these
carefully.
In the proposed algorithms we will be employing proposal kernels which are mixtures, for example each
mixture component applying a different increment to the dimensionality parameter kn . In general it is valid
to allow the mixture weights to be a function of the current state, for example:
Kn (xn−1 , xn ) =
M
X
αn,m (xn−1 )Kn,m (xn−1 , xn ),
m=1
M
X
m=1
9
αn,m = 1.
Del Moral et al. (2006a) provide the following expression for the optimal backward kernel for such a proposal
kernel (in the sense that the variance of the importance weights is minimized if resampling is conducted at
every step), which also has a mixture form:
Lopt
n−1 (xn , xn−1 ) =
M
X
opt
βn−1,m
(xn )Lopt
n−1,m (xn , xn−1 )
m=1
=
M
X
m=1
PM
R
αn,m (xn−1 )πn−1 (xn−1 )Kn,m (xn−1 , xn )
m=1 En−1
αn,m (xn−1 )πn−1 (xn−1 )Kn,m (xn−1 , xn )dxn−1
.
(6)
When this optimal backward kernel is employed, the importance weight becomes:
wn (xn−1 , xn ) ∝ R
En−1
πn−1 (xn−1 )
hP
M
m=1
πn (xn )
i
.
αn,m (xn−1 )Kn,m (xn−1 , xn ) dxn−1
(7)
Whenever it is possible to use this optimal backward kernel, one should do so. However, in most instances
the integral in the denominator of (7) will prove intractable (it is an expectation with respect to the target
distribution πn−1 , and the intractability of such integrals is the main motivation for adopting a samplingbased approach in the first place), and an approximation to the optimal kernel must be used instead. We note
that it is approximately optimal rather than an algorithmic approximation: the algorithm remains exact,
only the estimator variance suffers as a result of the approximation. In the context of mixture proposal
kernels, it was suggested in Del Moral et al. (2006b), that IS could be performed on a higher dimensional
space, involving the space of mixture component indicators, in which case the incremental importance weight
is specified by:
wn (xn , m, xn−1 ) ∝
πn (xn )βn−1,m (xn )Ln−1,m (xn , xn−1 )
.
πn−1 (xn−1 )αn,m (xn−1 )Kn,m (xn−1 , xn )
(8)
The design of the proposal kernel also plays a significant role in the performance of the algorithm. In
order to minimise the variance of the importance weights, it must be well matched to the target distribution
and hence to the observations.
There are many different proposal kernels which could be employed in the context of PDP filtering. In
order to provide some guidance, we next give a generic kernel, consisting of two moves, in order to highlight
issues which affect the algorithm’s performance. An algorithm built around this kernel and possibly also
involving the application of a Metropolis-Hastings kernel after resampling (further details below), constitutes
a general strategy for the filtering PDPs. Another example can be found in Appendix A, where the algorithm
of Maskell (2004) and the VRPF of Godsill et al. (2007) are specified in the PDP particle filter framework.
3.1.2
Standard Moves
We here consider two generic moves, an adjustment move and a birth move.
Adjustment Move Kn,a . The dimensionality is maintained and the most recent parameter, θn−1,kn−1 , is
drawn from a distribution ηn (·|xn−1 ) , yielding a new parameter value θn,kn .
Kn,a (xn−1 , dxn ) = δkn−1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn )δθn−1,1:kn−1 (dθn−1,1:kn −1 )
× ηn (dθn,kn |xn−1 ).
(9)
10
It can be shown that the optimal choice (in the sense of minimizing the conditional variance of the importance
weights given that an adjustment move is to be made and that the optimal backwards kernel is employed) of
distribution from which to propose the new parameter is the full conditional distribution πn (·|xn \θn,kn ),where
xn \ θn,kn denotes all components of xn other than θn,kn .
Birth Move Kn,b . The dimensionality is increased and additional jump times and parameters are sampled.
In a simple case, the number of jumps is incremented, kn = kn−1 + 1, a new jump, τn,kn is proposed
from a distribution hn (·) on (τn,kn−1 , tn ], and a new parameter is then drawn from a proposal distribution
ηn (·|xn \ θn,kn ):
Kn,b (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 )
× δθn−1,1:kn−1 (dθn−1,1:kn −1 )hn (dτn,kn )ηn (dθn,kn |xn \ θn,kn ).
(10)
It can be shown that the optimal choice (in the sense of minimizing the variance of the importance weights)
of distribution from which to propose the new parameter is the full conditional πn (θn,kn |xn \ θn,kn ), given
that a birth move is to be made, the optimal backwards kernel employed and τn,kn . If τn,kn ≤ tn−1 this
amounts to altering the trajectory (ζt )t∈[τn,kn ,tn−1 ] and extending the trajectory onto (tn−1 , tn ].
An alternative, suboptimal choice is to propose the new parameter from the prior, q(·|θj−1 , τj−1 , τj ). This
proposal has the advantage that it is simpler to use and does not require knowledge of the full conditional
distribution. Unfortunately, this simplicity comes at the cost of an increase in variance. When the full
conditional distributions are not available analytically, sensible approximations should be employed. We
note that such approximations do not affect the exactness of the algorithm; just the estimator variance.
It should be noted that, due to the nested structure of the sequence (En )n∈N and the support of the
sequence (πn )n∈N , there is no technical requirement to include a dimensionality reducing component in
the proposal. However, one may be advantageous in some scenarios, for example in models where later
observations can reduce the posterior probability of there being a change point in the recent history of the
hidden process.
Proposal Mixture Weights. A simple choice of mixture weights is in terms of prior probabilities. If
employing a kernel which consists of a birth move and an adjustment move, we could define the mixture
weight for the adjustment, αn,a (xn−1 ), move as follows:
αn,a (xn−1 ) = S(tn , τn−1,kn −1 ),
(11)
which is the conditional prior probability of zero jumps in (τn−1,kn−1 , tn ], given τn−1,kn−1 . Ideally we would
like to take into account information from the observations in order to adapt the mixture weights and
minimize the variance of the importance weights. Unfortunately this is usually intractable in practice.
Backward Kernel and Importance Weights. Given a proposal kernel consisting of a number of different
moves, we need a strategy for designing a corresponding backward kernel. We adopt the approach as in Del
Moral et al. (2006b) and first design one backward kernel component corresponding to each proposal kernel
component, and then find a way to combine the backward kernel components.
Ideally we would employ an optimal backward kernel with components specified by (6), unfortunately
this will rarely be possible in practice. Sensible approximations must therefore be employed, whilst ensuring
that the requirement described in remark (1) is met. For the forward kernel components described above,
we advocate the use of the following backward kernel components or approximations thereof:
Ln−1,a (xn , xn−1 ) =
πn−1 (xn−1 )Kn,a (xn−1 , xn )
,
πn−1 Kn,a (xn )
11
Ln−1,b (xn , xn−1 ) =
πn−1 (xn−1 )Kn,b (xn−1 , xn )
.
πn−1 Kn,b (xn )
We then need a way to combine these backward kernel components. Unfortunately it is difficult to approxopt
imate the mixture weights, βn−1,m
(xn ) of the optimal backward kernel. For the two move types above, a
strategy which has been found to work in practice is simply to set, for all m, βn−1,m (xn ) = 1/M , where M
is the total number of components in the proposal kernel. This approach can be expected to perform well
when there is little overlap in support of πn−1 Kn,b and πn−1 Kn,a . Then, using the expression in (8), the
corresponding importance weights are given by the following expressions.
For the adjustment move, the incremental weight is given by:
πn (xn )πn−1 (θn−1,kn−1 |xn−1 \ θn−1,kn−1 )
,
αn,a (xn−1 )πn−1 (xn−1 )πn (θn,kn |xn \ θn,kn )
(12)
πn (xn )
.
αn,b (xn−1 )πn−1 (xn−1 )hn (τn,kn )πn (θn,kn |xn \ θn,kn )
(13)
wn (xn−1 , xn ) ∝
and for the birth move:
wn (xn−1 , xn ) ∝
From the definition of πn , (4), and the definition of the proposal kernels, it can be seen that (12) and
(13) in fact only depend only upon a recent history of xn .
Further Considerations. A technical requirement of importance sampling schemes is that support of the
proposal distribution includes that of the posterior distribution. It appears never to have been mentioned in
the literature that in the context of trans-dimensional inference in time series, this imposes the requirement
that a forward kernel capable of proposing any positive number of births in the interval (tn−1 , tn ] must be
employed if there is a positive probability associated with such configurations under the target distribution.
In principle it might be sufficient to employ a birth move which introduces 1+B new jumps, with B a
Poisson random variable of low intensity, to ensure convergence (asymptotically in the number of particles)
to the target distribution. Of course, for finite samples, if the intensity is small enough it is equivalent to
never proposing more than a single new jump. However, if the intensity is made too small then the estimator
variance may become very large. In practice it is extremely unlikely that it will be possible to accurately
infer the position of several jumps between a pair of observations.
Due to the sequential nature of the algorithm and the form of the above moves, the algorithm requires
storage of only a fixed-length history for each particle. By employing a mixture of moves which leave the
state unchanged and prior birth moves on the interval (tn−1 , tn ], in proportions specified by the prior on the
dimensionality parameter kn , we obtain algorithms as in Godsill (2007) and Maskell (2004). For details, see
Appendix A.
It is important to ensure that a significant fraction of the particle set have weights substantially above 0.
The effective sample size (ESS), introduced by Kong et al. (1994), is an approximation of a quantity which
describes the effective number of iid samples to which the set corresponds.
hP
i−1
N
(i)−2
The ESS is defined as ESS =
W
where W (i) are the normalized weights. Resampling
i=1
should be carried out after any iteration which causes the ESS to fall below a reasonable threshold (typically
around half of the total number of particles), to prevent the sample becoming degenerate with a small number
of samples having very large weights. After resampling at the nth iteration, a Markov kernel of invariant
distribution πn , for example a reversible jump Metropolis-Hastings kernel, can be applied to the particle
system. This is a standard technique for increasing the diversity of particle locations, whilst maintaining a
particle set which targets πn , see Gilks and Berzuini (2001).
12
The generic scheme for a PDP particle filter is given in algorithm 2. As is demonstrated in the examples,
inserting a mixture of the moves as described in 3.1.2 and the specified weight expressions into this algorithm
provides a generic technique for the filtering of PDPs.
Algorithm 2 PDP Particle Filter
1: n = 1
2: for i = 1 to N do
(i)
3:
X1 ∼ η
(i)
W1
4:
5:
6:
7:
(i)
13:
14:
(i)
Xn ∼ Kn (Xn−1 , ·)
(i)
(i)
Wn ∝ Wn−1
10:
12:
(i)
η(X1 )
end for
n←n+1
for i = 1 to N do
8:
9:
11:
∝
{where η(x) is an instrumental distribution.}
(i)
π1 (X1 )
(i)
(i)
(i)
πn (Xn
)Ln−1 (Xn
,Xn−1 )
(i)
(i)
(i)
πn−1 (Xn−1 )Kn (Xn−1 ,Xn )
end for
Resampling can be conducted at this stage.
Sample rejuvenation can be conducted by allowing the particles to evolve under the action of a Markov
kernel of the invariant distribution πn .
Go to step 6
3.2
Auxiliary Methods
The auxiliary particle filter of Pitt and Shephard (1999) is motivated by the idea that in standard sequential
importance resampling algorithms, one takes a weighted sample which targets some distribution πn−1 and
resamples from this distribution before mutating the sample and weighting it to target πn . This seems
somewhat inefficient: it would be better to take into account how well each sample is likely to fit πn
after mutation, before resampling is performed. Pitt and Shephard (1999) essentially introduces a set of
auxiliary particle weights before the resampling step, with aim of ‘pre-selecting’ particles so as to reduce
the variance of the importance weights at the next time step. In Godsill and Clapp (2001) it was noted
that such schemes can be formulated as importance sampling applied to the joint sequence of states from
1 : n. In Johansen and Doucet (2008), the auxiliary method was reinterpreted as being a standard sequential
importance resampling algorithm which targets an auxiliary sequence of distributions which are then used
as an importance sampling proposal to provide estimators for the distribution of interest. The generalisation
from particle filters to general SMC samplers has also been made in Johansen and Doucet (2007).
In the context of filtering and the class of models described in this paper, the idea is to use the observation
at time n + 1 to guide the weighting of the particle set at time n so that resampling does not eliminate
particles which will be more favourably weighted once that observation has been taken into account. This
pre-weighting is then corrected for by importance weighting after the resampling step.
We now describe an auxiliary version of the algorithm presented above. This amounts to an SMC
sampler targeting an auxiliary sequence of distributions on (En )n≥0 , which will be denoted by (µn )n≥1 , and
fn ) which correct for the discrepancy between (µn )n≥1 and (πn )n≥1 .
a sequence of importance weights, (W
13
We will focus on auxiliary distributions of the following form:
µn (xn ) ∝ Vn (τn,kn , θn,kn , yn+1 )πn (xn )
where for each n, Vn : R+ × Ξ → (0, ∞) is a potential function, i.e. the algorithm will pre-select particles
on the basis of the most recent jump time and associated parameter and their interaction with the next
observation. This potential function would usually be chosen to favour states that fit well with the new
observation.
As in the standard auxiliary particle filter, this approach allows us to use a small amount of information
from the immediate future to ensure particles which explain the forthcoming observation well produce offspring, even if the present observation doesn’t provide them with much weight. This is a technique intended
to improve the robustness of the filtering estimate to outlying observations. It is important to choose the
potential function Vn sensibly, guidance is provided in Pitt and Shephard (1999) for the filtering case and in
Johansen and Doucet (2007) taking into account some recent developments. The application of the auxiliary
method to the example sampler above is described in Algorithm 3. Steps 5 and 13 in this algorithm indicate
where the proposed scheme yields particle sets targeting the distributions of interest, (πn )n∈N , in the sense
of approximating expectations with respect to these distributions.
Algorithm 3 Auxiliary PDP Particle Filter
1: n = 1
2: for i = 1 to N do
(i)
3:
X1 ∼ η
(i)
1 )
f1 (X (i) ) ∝ π1 (X(i)
4:
W
1
R
5:
6:
7:
8:
9:
10:
11:
f1 (X (i) )ϕn (X (i) )
W
n
PN f1
(i)
i=1 W1 (X1 )
(i)
(i)
f1 (X (i) )
V1 (τ1,k1 , θ1,k1 , y2 )W
1
(i)
fn(i) (X (i) , Xn(i) ) ∝
W
n−1
13:
15:
16:
PN
ϕ(x1 )πn (x1 )dx1 ≈
i=1
W1 (X1 ) ∝
end for
n←n+1
(i)
Resample from distribution defined by {Wn−1 }N
i=1
for i = 1 to N do
(i)
(i)
Xn ∼ Kn (Xn−1 , ·)
12:
14:
η(X1 )
{where η(x) is an instrumental distribution.}
R
ϕ(xn )πn (xn )dxn ≈
(i)
(i)
(i)
(i)
(i)
πn (Xn
)Ln−1 (Xn
,Xn−1 )
1
Vn−1 (τ
(i)
,θ
(i)
(i)
(i)
(i)
n−1,k
n−1,k
n−1
n−1
PN f
(i)
(i)
(i)
i=1 Wn (Xn−1 ,Xn )ϕn (Xn )
PN f
(i)
(i)
i=1 Wn (Xn−1 ,Xn )
(i)
(i)
(i)
(i)
,yn ) πn−1 (Xn−1 )Kn (Xn−1 ,Xn )
(i)
(i)
fn (X
Wn (Xn−1 , Xn ) ∝ Vn (τ (i) , θ (i) , yn+1 )W
n−1 , Xn )
n,kn
n,kn
end for
Go to step 8
3.3
Convergence and Theoretical Validity
As the above algorithms are a methodological application of the SMC samplers framework, the theoretical
results of Chopin (2004) and Del Moral et al. (2006a) apply directly to our algorithm. As a simple example, by
application of the results in Del Moral et al. (2006a), it is immediately apparent that a Central Limit Theorem
14
for the error in estimation of the posterior mean of ζtn applies for the proposed method. Furthermore, this
implies theoretical justification of the VRPF algorithm as described in Appendix A. Theoretical validity of
the auxiliary scheme can be readily established by considering the interpretation in Johansen and Doucet
(2007).
4
Simulation Study
In this section, we present realistic examples from mathematical finance and signal processing. These illustrate the performance improvements which are provided by the algorithmic framework presented above,
when combined with sensible choices of proposal distributions.
4.1
Shot Noise Cox Process
In this model the signal process is the intensity of a Shot noise Cox Process and takes values in Ξ = R+ .
The model is specified by the following distributions:
f (τj |τj−1 ) = λτ exp(−λτ (τj − τj−1 )) × I[τj−1 ,∞) (τj ),
q(ζ0 ) = exp(−λθ ζ0 ) × I[0,∞) (ζ0 ),
q(θj |θj−1 , τj , τj−1 ) = λθ exp(−λθ (θj − ζτ−j )) × I[ζτ− ,∞) (θj ),
j
where
ζτ−j
= θj−1 exp(−κ(τj − τj−1 )), and:
F (t, τ, θ) = θ exp(−κ(t − τ )).
Given ζ(tn−1,tn ] , the observations are an inhomogeneous Poisson process with intensity ζ(tn−1,tn ] , i.e. Yn
is collection of random point locations in (tn−1 , tn ]. The likelihood function is given by:
!
Z tn
Y
ζs ds
ζyn,i
g(yn |ζ(tn−1,tn ] ) = exp −
tn−1
i
where yn,i is the time of the ith event observed in (tn−1 , tn ].
In reinsurance pricing applications, the observed events model the claims on an insurance portfolio.
The jumps of the signal process model ‘primary events’: incidents which result in insurance claims. An
approximation to the optimal filter for this model was derived in Dassios and Jang (2005). They showed
that as the rate of primary events grows, specific transformations of (ζt )t≥0 and the observation counting
process converge weakly to Gaussian processes to which the Kalman-Bucy filter can be applied. The resulting
approximation of the optimal filter is poor when the rate of the primary event process, λτ is low, as it is in
economically-important settings. The method we propose suffers from no such restrictions. We note that
similar models have application in the pricing of options, for example Centanni and Minozzo (2006).
We compare the performance of the basic VRPF approach (referred to as Scheme 1) to the proposed
algorithm (Scheme 2), on data simulated from the model with the following parameter settings: κ = 0.01,
λτ = 1/40, λθ = 2/3, ∆ = 50 and 500 particles. The true hidden trajectory and a histogram of the observed
points are given in figure 2.
15
For this model, it was found to be sufficient to construct and algorithm with a proposal kernel consisting of
a single move, with a trans-dimensional Metropolis-Hastings kernel applied after resampling. The proposal
kernel is a birth move, proposing a new jump time in the interval (tn−1 , tn ] and and then drawing the
associated parameter from it’s full conditional distribution:
Kn (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 )
× δθn−1,1:kn−1 (dθn−1,1:kn −1 )hn (dτn,kn )πn (dθn,kn |xn \ θn,kn ),
where hn (τn,kn ) is the proposal distribution for the time of the new change point. This was built from a
histogram of the data observed on (tn−1 , tn ] in the following manner. The differences in counts between
subsequent bins of the histogram were computed and exponentiated to yield a piece-wise constant distribution. More specifically, consider partitioning the interval (tn−1 , tn ] into m sub-intervals of equal length. Let
cp be the number of observed events in the pth bin. For p ∈ {1, 2, ..., m − 1} define dp = cp+1 − cp . Then
define a piecewise constant probability density function on (tn−1 , tn ] by partitioning the interval into m − 1
sub-intervals of equal length. In the pth such sub-interval the density has the value:
exp(dp )
hn (τ ) = Pm−1
,
q=1 exp(dq )
τ ∈ (tn−1 + (p − 1)δ, tn−1 + pδ],
where δ = (tn − tn−1 )/(m − 1).
The corresponding component of the backwards kernel is necessarily (see Remark 1):
Ln−1 (xn−1 , dxn ) = δkn −1 (kn )δτn,1:kn −1 (dτn−1,1:kn−1 )δθn−1,1:kn −1 (dθn−1,1:kn−1 ).
Systematic resampling was applied when the ESS dropped below 40%. After resampling, an MetropolisHastings kernel consisting a sequence of three moves was applied:
1. a birth/death reversible jump move, with the birth being made by drawing a new jump time from the
uniform distribution on (τn,kn , tn ] and then drawing the parameter from the full conditional distribution
2. a perturbation of the most recent jump time, τn,kn , drawing from a Gaussian kernel centred at τn,kn
and truncated to the interval (τn,kn −1 , tn ]
3. a perturbation of the parameter associated with the most recent jump time, drawing from the full
conditional distribution.
In practice resampling was found to occur at less than 40% of iterations. Despite the fact that the
proposal kernel consisted only of a birth move, it was found that the algorithm did not significantly over-fit
the data: for the date shown in figure 1, the true number of jumps is 34 and for a typical run, the MAP
number of jumps was 39. The MMSE filtering estimate of ζt obtained using the proposed algorithm is shown
in figure 2.
Figure 3 shows a count of the number of unique particles, over time, at the final iteration of the algorithm
(i)
(i.e. t = 2000), where we have stored the entire history of each particle, ζ[0,2000] , in order to compute these
quantities. However, the diversity of particle locations alone does not tell us everything about the efficiency of
the algorithm. Whilst there may be many unique particles, the importance weights may have high variance,
in which case the quality of the particle approximation will be low. In order to portray this characteristic, we
also plot the same quantities, but after resampling is applied (whilst resampling can increase the variance of
16
an estimate made from the particle set, for systematic resampling this increase is typically extremely small).
These quantities demonstrate the degree of degeneracy of both the particle locations and the degeneracy of
the importance weights. Pre-resampling, Algorithm 1 exhibits some diversity in the recent history of particle
locations, note that the number of unique particles is less than the total number. This is an example of the
phenomenon described in section 2.5. Degeneracy of the importance weights means that this diversity is
reduced when resampling occurs. The Algorithm 2 exhibits diversity in particle locations much further back
in time, with all particles being unique in the recent history. Reduction in the number of unique particles
when resampling occurs is less severe than for Algorithm 1, demonstrating that the importance weights have
lower variance.
10
8
ζt
6
4
2
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
t
(a)
40
30
20
10
0
0
200
400
600
800
1000
t
(b)
500
500
400
400
Unique Particles
Unique Particles
Figure 2: SNCP model. Top: True intensity (solid) and MMSE filtering estimate (dotted). Bottom: histogram of realised observations with bin length 2.5.
300
200
100
0
1000
1200
1400
1600
1800
2000
t
300
200
100
0
1000
1200
1400
1600
1800
2000
t
(a) Scheme 1
(b) Scheme 2
Figure 3: Unique particles at the final iteration vs time. Solid: Pre-resampling. Dotted: Post-resampling.
17
4.2
Object Tracking
This example illustrates the performance of the proposed methods on a constant acceleration tracking model
applied to the benchmark fighter aircraft trajectory depicted in figure 1, which has duration 185 seconds.
The model is as specified in subsection 2.4, with acceleration components being drawn independently from
an isotropic Gaussian distribution of mean zero and standard deviation σθ = 10m/s2 . 37 additive, isotropic
Gaussian observations of the position of the aircraft, with standard deviation σy = 200m, were generated
at intervals of ∆ = 5s, also shown in figure 1. The inter-jump times were chosen to be Gamma distributed,
with shape and scale parameters 10 and 2.5 respectively, corresponding to a mean inter-arrival time 25s. For
this conditionally linear-Gaussian model is possible to analytically integrate out the parameters θ0:kn , and
only jump times need be sampled.
Comparisons were made between three schemes described below. In all cases, systematic resampling was
employed.
1. The basic VRPF algorithm, sampling from the prior, see Appendix A for specification of the kernels.
Resampling was performed when the ESS fell below 50%.
2. A forward kernel with two components. The first component is a birth move which adds a single point
uniformally in (τn,kn −1 , tn ]:
Kn,1 (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 )
×
1
.
tn − τn,kn−1
For this move, the corresponding component of the backwards kernel is necessarily:
Ln−1,1 (xn−1 , dxn ) = δkn −1 (kn )δτn,1:kn −1 (dτn−1,1:kn−1 ).
The second component is an adjustment move in which a Gaussian random walk kernel, restricted to
(τn,kn −1 , tn ] is applied to the most recent jump time. This component of the kernel is given by:
Kn,2 (xn−1 , dxn ) ∝ I(τn−1,kn−1 −1 ,tn ] (τn,kn )
× N (τn−1,kn−1 , σa2 )dτn,kn
× δkn−1 (kn )δτn−1,1:kn−1 −1 (dτn,1:kn −1 ).
A Gaussian kernel of the same variance, restricted to (τk−1 , tn−1 ], is used for the corresponding component of the backward kernel:
Ln−1,2 (xn , dxn−1 ) ∝ I(τn−1,kn−1 −1 ,tn−1 ] (τn−1,kn−1 )
× N (τn,kn , σa2 )dτn−1,kn−1
× δkn (kn−1 )δτn,1:kn −1 (dτn−1,1:kn−1 −1 ).
The forward mixture weights were defined as in (11), with the backward mixture weights set to uniform.
Resampling was performed when the ESS fell below 50%.
18
3. As 2. but with an auxiliary scheme using an approximation of the the following auxiliary weighting
function:
S(tn+1 , τn,kn )
p(yn+1 |y1:n , kn+1 = kn , τn,1:kn )
S(tn , τn,kn )
1
+
(S(tn , τn,kn ))2
(Z
×
f (τn+1,kn+1 |τn,kn )S(tn , τn+1,kn+1 )
Vn (τn,kn , θn,kn , yn+1 ) =
(tn ,tn+1 ]
× p(yn+1 |y1:n , kn+1 = kn + 1, τn+1,1:kn+1 )dτn+1,kn+1 .
The one-dimensional integral is approximated using standard numerical techniques (note this does not
affect the theoretical validity of the algorithm). The motivation for this auxiliary weighting function
is that it approximates the predictive likelihood, further details are given in Appendix B. Resampling
was applied at every iteration.
30
25
20
sy /km
15
10
5
0
−5
−10
35
40
45
50
55
60
65
70
sx /km
Figure 4: Benchmark 2D position trajectory (solid) and MMSE filtering estimate (dotted).
It was found that all three schemes yielded roughly the same error performance in estimating ζtn from each
πn and little decrease in estimation error was observed using more than 200 particles. This can be attributed
19
5000
5000
4000
4000
Unique Particles
Unique Particles
to the Rao-Blackwellised nature of the model. For an example of a model in which the parameters are not
integrated out, see Whiteley et al. (2007); in this instance the VRPF exhibits significantly greater error
in this filtering task, and is substantially outperformed by an algorithm which includes adjustment moves.
MMSE estimates of the position of the vehicle using Scheme 3, and made before resampling, are shown in
figure 4.
However, as more of the recent history of the trajectory is considered, including the times of the jumps
in acceleration, the superior performance of the proposed method becomes apparent.
3000
2000
1000
0
100
120
140
160
3000
2000
1000
0
100
180
120
140
t
160
180
t
(a) Scheme 1
(b) Scheme 2
Unique Particles
5000
4000
3000
2000
1000
0
100
120
140
160
180
t
(c) Scheme 3
Figure 5: Unique particles at the final iteration vs time. Pre-resampling (solid), Post-resampling (dotted).
Figure 5 provides insight into the relationship between the particle diversity and weight degeneracy for
the three algorithms, at the final iteration of the algorithm.
In order to show the diversity of particle set, we plot the number unique particles at the final iteration, as
a function of time. Again we consider the particle set before and after resampling, but note that in the case of
the Algorithm 3, for the purposes of these diagnostics, we applied resampling at the final iteration according
to the importance weights from which an estimate over the target distribution would be drawn. The difference
between the pre and post-resampling plots for each algorithm indicate the variance of the importance weights:
when the variance is high, the number of unique particles is relatively low post-resampling. For these tests
the adjustment kernel standard deviation was set to σa = ∆/1000s.
From figure 5 it can be seen that Scheme 1 exhibits particle diversity only in the very recent history,
even before resampling. Scheme 2 exhibits much more diversity, especially between the times of the last and
penultimate observations. However, much of this diversity if lost upon resampling. The plot of Scheme 3
20
50
50
40
40
Unique Particles
Unique Particles
shows that a significant proportion of diversity arising from the adjustment move and the births in (τn,k−1 , tn ]
is not lost after resampling.
30
20
10
0
1
30
20
10
0
1
0
0
180
−1
log10
σa
∆
150
140
−4
−5
130
170
−2
160
−3
180
−1
170
−2
log10
t/s
σa
∆
160
−3
150
140
−4
−5
(a) Pre-resampling
130
t/s
(b) Post-resampling
Figure 6: Algorithm 3. Unique particles vs time and standard deviation of adjustment kernel.
Figure 6 shows the effect of σa in algorithm 3, averaged over 50 runs for each value of σa , using N = 50
particles. Pre-resampling, all settings produce large particle diversity. However, as σa approaches and
exceeds the inter-observation time, the variance of the importance weights becomes high and consequently,
post-resampling, the number of unique particles is dramatically reduced. Thus, when using a Gaussian
random walk adjustment kernel, it is evident from figures 6 and 7 that there is a trade off between diversity
of particle locations and importance weight degeneracy.
Figure 7 compares the performance of algorithms in terms of estimating jump locations. The use of the
adjustment move in Scheme 2 yields a small improvement over Scheme 1. However, the results for Scheme
3 show a marked improvement. Note that, in the model, jumps occur in the acceleration but it is only
the position of the vehicle which is observed. Therefore optimal estimates of jump times in the very recent
history should exhibit large variance, the results for Scheme 3 are commensurate with this.
5
Conclusions
We have presented a broad class of algorithms for the filtering of Piecewise Deterministic processes and
demonstrated that this class incorporates existing techniques as particular cases. The proposed framework
falls within a very general sampling framework for which convergence and stability results exist.
The significance of this work is twofold: It allows existing convergence results, and a central limit theorem
in particular, to be employed to provide a formal theoretical justification of algorithms which are in use
already and a more general approach which has intuitively preferable characteristics. Furthermore, the
proposed algorithms allow for the use of principled techniques which lead to substantial improvements in
efficiency when compared with existing techniques.
21
0.4
0.2
0
1
0.5
0
1
0.5
0
1
0.5
0
|a|/ms−2
1
0.5
0
100
50
0
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
t/s
(a) Algorithm 1
0.4
0.2
0
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0.5
0
0.4
0.2
0
0.2
0.1
0
|a|/ms−2
0.4
0.2
0
100
50
0
t/s
(b) Algorithm 2
0.1
0.05
0
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0
20
40
60
80
100
120
140
160
180
0.5
0
0.4
0.2
0
0.4
0.2
0
0.5
|a|/ms−2
0
100
50
0
t/s
(c) Algorithm 3
Figure 7: Weighted histograms of particle jump times at final iteration, pre-resampling. In each sub figure,
top to second bottom: τn,k , τn,k−1 , τn,k−2 , τn,k−3 , τn,k−4 and bottom: true acceleration magnitude. σa =
∆/1000. N = 5000.
22
References
Blair, W., Watson, G., Kirubarajan, T., and Bar-Shalom, Y. (1998), “Benchmark for radar allocation and
tracking in ECM,” IEEE Transactions on Aerospace and Electronic Systems, 34, 1097–1114.
Cappé, O., Godsill, S., and Moulines, E. (2007), “An Overview of Existing Methods and Recent Advances
in Sequential Monte Carlo,” Proceedings of the IEEE, 96, 899 –924.
Centanni, S. and Minozzo, M. (2006), “Estimation and filtering by reversible jump MCMC for a doubly
stochastic Poisson model for ultra-high-frequency financial data,” Statistical Modelling, 6, 97–118.
Chopin, N. (2004), “Central limit theorem for sequential Monte Carlo methods and its application to Bayesian
inference,” Annals of Statistics, 32, 2385–2411.
— (2006), “Dynamic detection of change points in long time series,” Annals of the Institute of Statistical
Mathematics, 59, 349–366.
Dassios, A. and Jang, J. (2005), “Kalman-Bucy filtering for linear system driven by the Cox process with shot
noise intensity and its application to the pricing of reinsurance contracts,” Journal of Applied Probability,
42, 93–107.
Davis, M. (1984), “Piecewise Deterministic Markov Processes: a general class of non-diffusion stochastic
models,” Journal Royal Statistical Society B, 46, 353–388.
Del Moral, P., Doucet, A., and Jasra, A. (2006a), “Sequential Monte Carlo for Bayesian Computation,” in
Bayesian Statistics 8, Oxford University Press, pp. 115–148.
— (2006b), “Sequential Monte Carlo Samplers,” Journal of the Royal Statistical Society B, 63, 411–436.
Douc, R., Cappé, O., and Moulines, E. (2005), “Comparison of Resampling Schemes for Particle Filtering,”
in Proceedings of the IEEE 4th International Symposium on Image and Signal Analysis, pp. 64–69.
Doucet, A., Briers, M., and Sénécal, S. (2006a), “Efficient Block Sampling Strategies for Sequential Monte
Carlo Methods,” Journal of Computational and Graphical Statistics, 15, 693–711.
Doucet, A., de Freitas, N., and Gordon, N. (eds.) (2001), Sequential Monte Carlo Methods in Practice,
Statistics for Engineering and Information Science, New York: Springer Verlag.
Doucet, A., Godsill, S., and Andrieu, C. (2000), “On sequential Monte Carlo sampling methods for Bayesian
filtering,” Statistics and Computing, 10, 197–208.
Doucet, A., Montesano, L., and Jasra, A. (2006b), “Optimal Filtering For Partially Observed Point Processes
Using Trans-Dimensional Sequential Monte Carlo,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processsing, (ICASSP).
Gilks, W. and Berzuini, C. (2001), “Following a moving target - Monte Carlo inference for dynamic Baysian
Models,” Journal of the Royal Statistical Society B, 63, 127–146.
Godsill, S. (2007), “Particle filters for continuous-time jump models in tracking applications,” in ESAIM:
Proceedings of Oxford Workshop on Particle Filtering (to appear).
23
Godsill, S. and Clapp, T. (2001), “Improvement Strategies for Monte Carlo Particle Filters,” in Sequential
Monte Carlo Methods in Practice, eds. Doucet, A., Freitas, J. F. G. D., and Gordon, N. J., New York:
Springer-Verlag, Statistics for Engineering and Information Science, chap. 7, pp. 139–158.
Godsill, S., Vermaak, J., Ng, K.-F., and Li, J.-F. (2007), “Models and Algorithms for Tracking of Manoeuvring Objects using Variable Rate Particle Filters,” Proceedings of IEEE, 95, 925–952.
Godsill, S. J. and Vermaak, J. (2004), “Models and Algorithms for tracking using trans-dimensional sequential Monte Carlo,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 3, pp. 976–979.
— (2005), “Variable rate particle filters for tracking applications,” in Proceedings of IEEE Statistical Signal
Processing Workshop (SSP), pp. 1280–1285.
Gordon, N. J., Salmond, S. J., and Smith, A. F. M. (1993), “Novel approach to nonlinear/non-Gaussian
Bayesian state estimation,” IEE Proceedings-F, 140, 107–113.
Jacobsen, M. (2006), Point Process Theory and Applications: Marked Point and Piecewise Deterministic
Processes, Probability and its Applications, Boston: Birkhäuser.
Jacod, J. and Skorokhod, A. V. (1996), “Jumping Markov Processes,” Annales de l’Institut Henri Poincaré,
Probabilités et Statistiques, 32, 11–68.
Johansen, A. M. and Doucet, A. (2007), “Auxiliary Variable Sequential Monte Carlo Methods,” Tech. Rep.
07:09, University of Bristol, Department of Mathematics – Statistics Group, University Walk, Bristol, BS8
1TW, UK.
— (2008), “A Note on the Auxiliary Particle Filter,” Statistics and Probability Letters, to appear.
Kong, A., Liu, J. S., and Wong, W. H. (1994), “Sequential Imputations and Bayesian Missing Data Problems,” Journal of the American Statistical Association, 89, 278–288.
Lasdas, V. and Davis, M. (1989), “A piecewise deterministic process approach to target motion analysis
[sonar],” in Proceedings of the 28th IEEE Conference on Decision and Control, Tampa, FL, USA, vol. 2,
pp. 1395–1396.
Maskell, S. (2004), “Joint Tracking of Manoeuvring Targets and Classification of their manoeuvrability,”
EURASIP Journal on Applied Signal Processing, 15, 2339–2350.
Pitt, M. K. and Shephard, N. (1999), “Filtering via Simulation: Auxiliary Particle Filters,” Journal of the
American Statistical Association, 94, 590–599.
Sworder, D. D., Kent, M., Vojak, R., and Hutchins, R. G. (1995), “Renewal Models for Maneuvering
Targets,” IEEE Transactions on Aerospace and Electronic Systems, 31, 138–150.
Whiteley, N., Johansen, A. M., and Godsill, S. (2007), “Efficient Monte Carlo filtering for discretely observed
jumping processes,” in Proceedings of IEEE Statistical Signal Processing Workshop, Madison, Wisconsin,
USA, pp. 89–93.
24
A
The VRPF Algorithm
The VRPF of Godsill et al. (2007) can be obtained by constructing a PDP particle filtering algorithm, with
the following mixture proposal kernel:
Kn (xn−1 , dxn ) =
∞
X
m=0
e n,m (xn−1 , dxn )
K
(14)
e n,m may be decomposed as the product of a mixture weight and a
where, the sub-probability kernels K
probability kernel:
e n,m (xn−1 , dxn ) := αn,m (xn−1 )Kn,m (xn−1 , dxn ).
K
The mixture weight αn,m (xn−1 ) is the conditional prior probability that xn has m jumps in (tn−1 , tn ],
given kn ≥ kn−1 , τn,kn−1 = τn−1,kn−1 and supp : τn,p ≤ tn−1 = kn−1 . Thus new jump times proposed from
(14) will always lie in (tn−1 , tn ].
and for m > 0,
e n,0 (xn−1 , dxn ) = S(tn , τn−1,kn−1 ) δxn−1 (dxn )
K
S(tn−1 , τn−1,kn−1 )
e n,m (xn−1 , dxn ) = S(tn , τn,kn−1 +m ) δkn−1 +m (kn )
K
S(tn−1 , τn−1,kn−1 )
× δτn−1,1:kn−1 (dτn,1:kn−1 )δθn−1,0:kn−1 (dθn,0:kn−1 )
kn−1 +m
×
Y
f (dτj |τj−1 )q(dθj |θj−1 , τj , τj−1 )
j=kn−1 +1
× I(tn−1 ,tn ] (τn,kn−1 +1 ).
Proposition 1. For the proposal kernel (14), and in the case that the optimal backwards kernel is employed,
the importance weight is given by:
wn (xn−1 , xn ) ∝ gn (yn |ζtn ).
Proof. It is clear that πn ≪ πn−1 Kn . Using the optimal form of auxiliary kernel leads to the following
expression for the importance weight:
dπn
(xn )
dπn−1 Kn
dπ
hPn
i (xn )
=
e n,m
dπn−1
K
wn (xn−1 , xn ) ∝
m
=
d
hP
m
25
dπn
e n,m
πn−1 K
i (xn )
e n,m has within its support only
by bounded convergence. Furthermore, as (En , En ) is Polish and πn−1 K
configurations in which
kn
X
p(xn ) :=
I(tn−1 ,tn ] (τn,j ) = m,
j=1
e n,q (B) = 0. Thus, from the definition of the Radonfor any q 6= p(xn ) there exists B such that πn−1 K
Nikodỳm derivative, we may write:
wn (xn−1 , xn ) =
dπn
e n,p(x )
dπn−1 K
n
(xn )
= gn (yn |ξtn )
Thus it can then be seen that the use of (14), with its optimal backwards kernel, initialising from the
prior and resampling at every iteration, coincides with the VRPF algorithm of Godsill et al. (2007), with the
‘state-regeneration’ viewed as being performed at the beginning of the nth iteration as opposed to the end
of the n − 1th iteration. “Completing the neighbourhood” as discussed in Godsill et al. (2007) corresponds
(for the models considered in this paper) to having obtained a sample from (14) by sampling from f (τj |τj−1 )
and q(θj |θj−1 , τj , τj−1 ) until tn has been exceeded. This also coincides with the algorithm of Maskell (2004),
with proposals made from the prior
B
Auxiliary Method
Consider the relationship between xn and xn−1 induced by the VRPF proposal kernel. Denote by ǩn the
number of jumps proposed at time step n, so that kn = kn−1 + ǩn .
p(yn |y1:n−1 , kn−1 , τn−1,1:kn−1 )
= p(ǩn = 0|τn,kn−1 )p(yn |y1:n−1 , kn , τn−1,1:kn−1 )
∞ Z
X
+
p(ǩn , τkn−1 +1:kn |τn,kn−1 , τn,kn−1 +1 > tn−1 )p(yn |y1:n−1 , kn , τn,1:kn )dτn,kn−1 +1:kn ,
ǩn =1
Sn
S(tn , τn−1,kn−1 )
p(yn |y1:n−1 , kn = kn−1 , τn−1,1:kn−1 )
S(tn−1 , τn−1,kn−1 )
1
+
(S(tn−1 , τn−1,kn−1 ))2
Z
×
f (τn,kn |τn−1,kn−1 )S(tn−1 , τn,kn )p(yn |y1:n−1 , kn = kn−1 + 1, τn,1:kn )dτn,kn
=
(tn−1 ,tn ]
+ ǫn .
(15)
where ǫn represents the summands for ǩn = 2, 3, ..., ∞ and Sn = {τn,kn−1 +1:kn : tn−1 < τn,kn−1 +1 < ... <
τn,kn ≤ tn }.
The auxiliary weighting function is obtained by truncating the series and using the first two terms in
(15), approximating the one-dimensional integral in the second term by a standard numerical techniques
such as the trapezium rule.
26
Download