Monte Carlo Filtering of Piecewise Deterministic Processes Nick Whiteley, Adam M. Johansen and Simon Godsill CUED/F-INFENG/TR-592 (2007) Monte Carlo Filtering of Piecewise Deterministic Processes Nick Whiteley∗, Adam M. Johansen†and Simon Godsill ‡ Abstract We present efficient Monte Carlo algorithms for performing Bayesian inference in a broad class of models: those in which the distributions of interest may be represented by time marginals of continuoustime jump processes conditional on a realisation of some noisy observation sequence. The sequential nature of the proposed algorithm makes it particularly suitable for real-time estimation in time series. We demonstrate that two existing schemes can be interpreted as particular cases of the proposed method. Results are provided which illustrate significant performance improvements relative to existing methods. keywords: Bayesian Filtering, Optimal Filtering, Particle Filters, Sequential Monte Carlo. 1 Introduction Throughout statistics and related fields, there exist numerous problems which are most naturally addressed in a continuous time framework. A great deal of progress has been made over the past few decades in online estimation for partially-observed, discrete time series. This paper generalises the popular particle filtering methodology to a broad class of continuous time models. Within the Bayesian paradigm, the tasks of optimal filtering and smoothing correspond to obtaining, recursively in time, the posterior distribution of the trajectory of an unobserved stochastic process at a particular time instant, or over some interval, given a sequence of noisy observations made over time. State space models, in which the unobserved process is a discrete-time Markov chain, are very flexible and have found countless applications throughout statistics, engineering, biology, economics and physics. For example in object tracking, the hidden process models the evolution of position and other intrinsic parameters of a manoeuvring object. The aim is then to estimate the trajectory of that object, given observations made by a noisy sensor. In many cases of interest the state-space model is nonlinear and nongaussian, and exact inference is intractable. Approximation methods must, therefore, be employed. Sequential Monte Carlo (SMC) methods (Doucet et al., 2001; Cappé et al., 2007), approximate the sequence of posterior distributions by a collection of weighted samples, termed particles. The application of SMC methods to the discrete time filtering problem has been the focus of much research in the last fifteen years, especially in the context of tracking. See, for example, Gordon et al. (1993); Doucet et al. (2000). However, many physical phenomena are most naturally ∗ Nick Whiteley is a Ph.D. student in the Signal Processing Laboratory, Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, UK. (email: npw24@cam.ac.uk) † Adam Johansen in a Brunel Fellow in the Statistics Group, Department of Mathematics, University of Bristol, University Walk, Bristol, BS8 1TW, UK. (email: adam.johansen@bristol.ac.uk) ‡ Simon Godsill is Professor of Statistical Signal Processing, Department of Engineering, University of Cambridge,Trumpington Street, Cambridge, CB2 1PZ, UK. (email: sjg@eng.cam.ac.uk) 1 modelled in continuous time and this motivates the development of inference methodology for continuous time stochastic processes. Whilst it is possible to attempt discretisation, this can be computationally expensive and introduces a bias which is not easy to quantify. Estimating the trajectory of a general continuous-time process from statistical knowledge and a finite collection of observations is not practical in any degree of generality. Davis (1984) proposed the use of piecewise deterministic processes (PDP’s) which evolve deterministically in continuous time except at a countable collection of stopping times at which they randomly jump (in the sense of potentially introducing a discontinuity into the trajectory) to a new value. The connection between PDP’s and marked point processes was investigated in Jacobsen (2006) and Jacod and Skorokhod (1996) used the term jumping process to describe a generalisation of these models to non-topological spaces. Numerous problems can be straightforwardly and parsimoniously cast into this framework. We will present examples in the following two areas: Tracking: It has been demonstrated that, in some object tracking scenarios, the trajectory of a manoeuvring object may be more parsimoniously modelled by a PDP than by traditional discrete-time models, see Lasdas and Davis (1989); Sworder et al. (1995); Godsill et al. (2007) and references therein. Mathematical Finance: Inferring the hidden intensity of a shot noise Cox process (SNCP) plays a role in reinsurance and option pricing schemes. Exact inference for SNCP models is intractable. Approximate methods based on weak convergence to Gaussian processes in specific parameter regimes (Dassios and Jang, 2005) and Markov Chain Monte Carlo have been proposed (Centanni and Minozzo, 2006). If such models are to be employed in practice, accurate and computationally efficient inference schemes must be developed. The contribution of this paper is the development of efficient SMC algorithms for the filtering of any PDP. A key factor which determines the efficiency of SMC methods is the mechanism by which particles are propagated and re-weighted. If this mechanism is not carefully chosen, the particle approximation to the target distribution will be poor, with only a small subset of the particles having any significant weight. Over time, this in turn leads to loss of diversity in particle locations and instability of the algorithm. The proposed approach combats these issues, maintaining considerably more diversity in the particle system than previously-proposed approaches and therefore providing a better approximation to the target distribution. We are specifically interested in performing inference online, that is, as observations become available over time and in such a manner that computational complexity and storage requirements do not increase over time. This is essential in many applications. In the next section, we specify the filtering model of interest more precisely. A representative example is presented and existing approaches to inference for such models are discussed. In section 3 we describe the design of an SMC algorithm which is efficient for the processes of interest here. Details of an auxiliary variable technique which can further improve the efficiency of the proposed method are also presented. We present results in section 4, for a SNCP model used in reinsurance pricing and a fighter aircraft tracking model and demonstrate the improvement in performance over existing algorithms which is possible. 2 Background Before introducing discrete time particle filtering methods, formalising the problem and describing approaches to its solution, it is useful to introduce some notation which will be used throughout. Except where otherwise noted, distributions will be assumed to admit a density with respect to a suitable dominating measure, denoted dx, and, with a slight abuse of notation, the same symbol will be used to refer 2 to both a measure and its associated density, so for example π(dx) = π(x)dx. We denote by δx (·) the Dirac measure which takes all its mass at the point x. For a Markov kernel K(x, dy) acting from E1 to E2 and some probability measure µ(dx) on E1 , we will employ the following shorthand for the integral operation: Z µK(·) = K(x, ·)µ(dx). E1 When describing marginals of joint densities, we will use the notation of the following form: Z πn (xk ) = πn (x1:n )dx1:k−1 dxk+1:n . 2.1 Particle Filtering The traditional setting of sequential Monte Carlo methods has been in the field of filtering discrete time hidden Markov models. To motivate the present work it is convenient to review these particle filtering algorithms – this summary is necessarily brief and the reader is referred to Doucet et al. (2001) for further details. Consider two discrete time stochastic processes: the signal process, denoted by (Xn )n≥1 , and the observation process denoted (Yn )n≥1 , defined on a common probability space (Ω, F , P ) with conditional independence structure (where we assume that densities f and g exist): Z P (Xn ∈ A|X1:n−1 = x1:n−1 , Y1:n−1 = y1:n−1 ) ≡ f (xn |xn−1 )dxn ZA P (Yn ∈ B|X1:n = x1:n , Y1:n−1 = y1:n−1 ) ≡ g(yn |xn )dyn B for any measurable sets A, B. That is, the , (Xn )n≥1 is a Markov chain and (Yn )n≥1 has the property that, for all n, conditional on the signal at time n, Yn is independent of all previous states and observations. The filtering problem is to approximate the density P (xn |Y1:n = y1:n ) at each time (typically, it is necessary to perform this task sequentially and in real time as one wishes to obtain estimates concerning the current state of the system as observations arise). The densities of interest may be written as time marginals of the smoothing densities which are given recursively by: p̂n (x1:n |y1:n−1 ) = pn (x1:n−1 |y1:n )f (xn |xn−1 ) pn (x1:n |y1:n ) ∝ p̂n (x1:n |y1:n−1 )g(yn |xn ). This is a difficult problem as, except in a few special cases, the normalising constant of pn (x1:n |y1:n ) cannot be obtained in closed form. The solution offered by sequential Monte Carlo is to propagate a collection of weighted samples through time so that they approximate each of the distributions of interest at the appropriate time using importance sampling and resampling mechanisms. Resampling involves eliminating particles with low weights and multiplying those with high weights, resulting in a set of samples which all have the same weight. This is optionally applied at each iteration. Algorithm 1 provides a description of a basic particle filter in the case that resampling is applied at every iteration. At step 10 the algorithm provides a weighted sample from pn (xn |y1:n ). The resampling step is a crucial ingredient. If it were never applied, the algorithm would collapse after several iterations, with all weight concentrated on a single particle, giving a very poor approximation of the target distribution. Resampling alleviates this weight-degeneracy phenomenon at the expense of introducing degeneracy in the historical locations of the particles. However, 3 for filtering, this latter degeneracy effect is not so problematic as the very recent history of the particle set, from which estimates are to be drawn, can remain diverse. See Douc et al. (2005) and references therein for details of a variety of resampling schemes. Algorithm 1 A simple sequential Monte Carlo filter. 1: Set n = 1. 2: for i = 1 to N do (i) 3: X 1 ∼ h1 (i) W1 4: 5: 6: 7: 8: 9: 10: 11: 12: ∝ (i) h1 (X1 ) end for n←n+1 for i = 1 to N do (i) (i) Xn ∼ hn ·|Xn−1 (i) Wn ∝ {where h1 is an instrumental density.} (i) p1 (X1 ) (i) {hn (·|Xn−1 ) is some instrumental density.} (i) (i) (i) f (Xn |Xn−1 )g(yn |Xn ) (i) (i) hn (Xn |Xn−1 ) . end for Resample Go to step 6. 2.2 Statement of the Problem The problem which is treated in the present work can be summarised as follows. We first define the signal process (ζt )t≥0 , where each ζt takes a value in a state space Ξ (which is assumed to be Polish and endowed with a suitable Borel field), typically some subset of Rdx , and a sequence of noisy observations (Yn )n∈N . Given the signal at the observation time (or in some local neighbourhood if observations are received in batches at fixed times as in the SNCP example), the observations are conditionally independent of one another and the remainder of the signal process. Our aim is to obtain iteratively, as observations arrive, the conditional distribution of the signal process given the collection of observations up to that time. 2.3 Model Specification We now provide a specification of the class of models under consideration. Consider first a discrete-time pair Markov process (τj , θj )j∈N , of non-decreasing times, τj ∈ R+ and parameters, θj ∈ Ξ with transition kernel of the form: p(d(τj , θj )|τj−1 , θj−1 ) = f (dτj |τj−1 )q(dθj |θj−1 , τj−1 , τj ). (1) There is no theoretical reason for limiting ourselves to the conditional independence structure as in (1), but we here consider only such structure for ease of presentation. In some applications there may be a requirement to make τj dependent on θj−1 , or even to define the process (τj , θj )j∈N as non-Markovian, and in principle the proposed methods can accommodate such dependence structures. We next define a random continuous time counting process (νt )t≥0 as follows: νt = ∞ X I[0,t] (τj ) = max{j : τj ≤ t}. j=1 4 The right-continuous signal process, (ζt )t≥0 , which takes a value in Ξ at any time t and has known initial distribution, ζ0 ∼ q0 (ζ0 ), is then defined by: ζt = F (t, τνt , θνt ), with the conventions that τ0 = 0, θ0 = ζ0 . The function F : R+ × R+ × Ξ → Ξ, is deterministic, Ξ-valued and subject to the condition that F (τj , τj , θj ) = θj , ∀j ∈ N. From the initial condition ζ0 , a realisation of the signal process evolves deterministically according to F until the time of the first jump τ1 , at which time it takes the new value θ1 . The signal continues to evolve deterministically according to F until τ2 , at which time the signal acquires the new value θ2 , and so on. The signal process is observed in the nth window (tn−1 , tn ] through a collection of random variables, Yn , conditionally independent of the past, given the signal process on (tn−1 , tn ]. We denote the likelihood function by g(yn |ζ(tn−1 ,tn ] ). We will be especially interested in the number of jumps occurring in the interval [0, tn ] and therefore set kn , νtn . Our signal model induces a joint prior distribution, pn (kn , τ1:kn ), on the number of jumps in [0, tn ] and their locations: kn Y pn (kn , dτ1:kn ) = S(tn , τkn ) f (dτj |τj−1 ), (2) j=1 S∞ which has support on the disjoint union: k=0 {k} × Tn,k , where Rk ⊃ Tn,k = {τ1:k : 0 < τ1 < · · · < τk ≤ tn } and where S(t, τ ) is the survivor function associated with the transition kernel f (dτj |τj−1 ): Z t S(t, τ ) = 1 − f (ds|τ ). (3) τ (3) is the conditional probability that, given the most recent jump occurred at τ , the next jump occurs after t. Thus (2) is the probability that there are precisely kn jumps on [0, tn ], with locations in infinitesimal neighbourhoods of τ1 , τ2 , ...., τkn . Given the function F , the path (ζt )t∈[0,tn ] is completely specified by the initial condition ζ0 , the number of jumps, kn , their locations τ1:kn and associated parameter values θ1:kn . We will be interested in the conditional distributions of such quantities, given all observations made up to and including tn . It is consequently necessary to specify a distinct collection of random variables Xn = (kn , ζn,0 , θn,1:kn , τn,1:kn ), which takes its values in the disjoint union: ∞ [ En = {k} × Ξk+1 × Tn,k , k=0 k with R ⊃ Tn,k = {τn,1:k : 0 < τn,1 < · · · < τn,k ≤ tn }. Note that En ⊂ En+1 . We also write ζn,t = F (t, τνn,t , θνn,t ). In order to obtain the distribution of (ζt )t∈[0,tn ] , given the observations y1:n , it would suffice to find πn (xn ), the posterior density of Xn , because, by construction, the signal process is a deterministic function of the jump times and parameters. This posterior distribution, up to a constant of proportionality, has the form: πn (xn ) ∝ pn (kn , τn,1:kn ) × q0 (ζn,0 ) kn Y q(θn,j |θn,j−1 , τn,j , τn,j−1 ) n Y p=1 j=1 5 g(yn |ζn,(tp−1 ,tp ] ). (4) As a filtering example, in order to obtain the distribution, p(ζtn |y1:n ), it suffices to obtain πn (τkn , θkn ). Exact inference for all non-trivial versions of this model is intractable and in section 3 we describe Monte Carlo schemes for obtaining sample-based approximations of posterior distributions. 2.4 A Motivating Example We next provide an example application in order to illustrate how the above-specified model can be used in practice. We consider the planar motion of a vehicle which manoeuvres according to standard, piecewise constant acceleration dynamics. Each parameter may be decomposed into x and y components, each containing a position, s, velocity, u, and acceleration value, a, x x θj F (t, τνt , θνt ) θj = and F (t, τνt , θνt ) = . θjy F y (t, τνt , θνt ) Here Ξ = R6 but the x and y components have identical parameters and evolutions; for brevity we describe only a single component: θjx = [sxj uxj axj ]T and, 1 (t − τνt ) 1 F x (t, τνt , θνt ) = 0 0 0 x − τνt )2 sνt (t − τνt ) uxνt . 1 axνt 1 2 (t At time zero the vehicle has position, velocity and acceleration ζ0 . At subsequent times, τj , the acceleration of the vehicle jumps to a new, random value according to q(dθj |θj−1 , τj−1 , τj ), which is specified by: q(dθjx |θj−1 , τj , τj−1 ) = δsx,− (dsxj )δvx,− (dsxj )q(daxj ), j j where, x (τj − τj−1 ) + axj−1 (τj − τj−1 )2 /2, sx,− = sxj−1 + vj−1 j x vjx,− = vj−1 + axj−1 (τj − τj−1 ). The component of F in the y-direction is equivalent. This model is considered a suitable candidate for the benchmark fighter-aircraft trajectory as described in Blair et al. (1998), shown in figure 1. The parameter transition kernel in this embodies prior knowledge of how the acceleration of the vehicle changes. The kernel f (τj |τj−1 ), represents prior knowledge of the amount of time the vehicle spends executing a particular manoeuvre. As an example observation model, at each time tn the Cartesian position of the vehicle is observed through some noisy sensor. An example inference task is then to recursively estimate the position of the vehicle, and the time of the most recent jump in its acceleration, given observations from the noisy sensor. We return to this example in the sequel. 2.5 Related Work Inference schemes based upon direct extension of the particle filter have been devised for the process of interest. The variable rate particle filter (VRPF) of Godsill and Vermaak (2004); Godsill et al. (2007); Godsill (2007) is one such scheme, which samples a random number of jump times on the interval [0, tn ] 6 30 25 20 sy /km 15 10 5 0 −5 −10 35 40 45 50 55 60 65 70 sx /km Figure 1: Benchmark 2D position trajectory and observations with additive Gaussian noise. The trajectory begins around (66,29), with the aircraft executing several manoeuvres before the trajectory terminates after a duration of 185 seconds. by drawing recursively from f (τj |τj−1 ), until a stopping time. One could equivalently sample first from pn (kn ) and then from fn (τ1:k |kn ). A similar method was presented independently in Maskell (2004). The relationship between the proposed method and these existing algorithms is made precise in the sequel. In the context of particle filtering for standard discrete time state space models, it is well known that sampling from the prior distribution can be inefficient yielding importance weights of high variance. This problem is exacerbated when constructing SMC algorithms for PDP’s. Resampling, which is an essential component of SMC algorithms for online filtering, results in multiple copies of some particles. Under the prior distribution for the class of models considered here, there is significant positive probability that zero jumps will occur between one observation time and the next. Proposing from the prior distribution after resampling can therefore result in multiple copies of some particles being propagated over several iterations of the algorithm, without diversification. This phenomenon is most obvious when the expected jump arrival rate is low relative to the rate at which observations are made (as is the case in applications of interest). This in turn leads to the accumulation of errors in the particle approximation to the target distribution and instability of the algorithm. Computational methods to avoid repeated calculations were proposed in Godsill et al. (2007), but these do not address the underlying problem. The technique of “state regeneration” was introduced in Godsill and Vermaak (2005), but it too involves proposing particles without taking into account new observations. In Doucet et al. (2006a), the SMC samplers framework is used to enhance the efficiency of a discrete 7 time filtering algorithm. This method allows a recent history of the state trajectory for each particle to be perturbed using an importance sampling mechanism, diversifying the particle set. The “adjustment” moves proposed below achieve a similar effect in the context of filtering for PDP’s. Finally, we note that Chopin (2006) addressed a related but discrete-time change-point problem. Their approach reformulates the standard state-space model, yielding a filtering scheme which propagates the posterior distribution for the time since the most recent change-point. Chopin remarks that slow-mixing properties make the inclusion of MCMC diversification moves in the accompanying SMC algorithm essential for efficient operation. The “adjustment” moves proposed below address the same basic problem within the framework of the present paper. MCMC diversification can also be applied in the proposed algorithm (see details below). 3 Methodology We begin with a brief summary of the sampling framework which we propose to employ before describing its use to sample from the class of distributions of interest. The SMC samplers framework of Del Moral et al. (2006a,b) is a very general method for obtaining a set of samples from a sequence of distributions which exist on the same or different spaces. This can be viewed as a generalisation of the standard SMC method in which the object distribution exists on a space of strictly increasing dimension. SMC samplers have recently been applied to similar trans-dimensional problems (Doucet et al., 2006b). The motivation for this approach is that it would be extremely useful to have a generic technique for sampling from a sequence of distributions defined upon arbitrary spaces which are somehow related and this is precisely what is required in the filtering of piecewise deterministic processes. The principal innovation of the SMC sampler approach is to construct a sequence of synthetic distributions with the necessary properties. Given a collection of measurable spaces (En , En )n∈N , upon which the sequence of probability measures from which we wish to sample, (πn )n∈N is defined, ! it is possible to construct a n Q sequence of distributions (e πn )n∈N upon the sequence of spaces En endowed with the product p=1 n∈N σ-algebras, which have the target at time index n as a marginal distribution at that time index. As we are only interested in this marginal distribution, there is no need to adjust the position of earlier states in the chain and standard SMC techniques can be employed. However, this approach suffers from the obvious deficiency that it involves conducting importance sampling upon an extremely large space, whose dimension is increasing with n. In order to ameliorate the situation, the synthetic distributions are constructed in the following way (assuming that a density representation exists): n−1 Y π en (x1:n ) = πn (xn ) Lp (xp+1 , xp ) , (5) p=1 where (Ln )n∈N is a sequence of ‘backward’ Markov kernels from En into En−1 . With this structure, an importance sample from π en is obtained by taking the path x1:n−1 , a sample from π en−1 , and extending it with a Markov kernel, K : En−1 × En → [0, 1], which acts from En−1 into En , providing samples from π en−1 × Kn and leading to the importance weight: wn (xn−1 , xn ) ∝ π en (x1:n ) πn (xn )Ln−1 (xn−1 , xn ) = . π en−1 (x1:n−1 )Kn (xn−1 , xn ) πn−1 (xn−1 )Kn (xn−1 , xn ) 8 Remark 1. Although, for simplicity, we have above followed the convention in the literature and assumed that all distributions and kernels of interest may be described by a density with respect to some common dominating measure, this is not necessarily the case. All that is actually required is that the product distribution defined by πn ⊗ Ln−1 (d(xn−1 , xn )) = πn (dxn )Ln−1 (xn , dxn−1 ) is dominated by πn−1 ⊗Kn (d(xn−1 , xn )) and the importance weight is then the corresponding Radon-Nikodỳm derivative. Thus the backward kernel Ln−1 must be chosen such that this requirement is met. This approach ensures that the importance weights at time index n depend only upon xn and xn−1 . This also allows the algorithm to be constructed in such a way that the full history of the sampler need not be stored. In the context of the filtering problem under consideration, recall that xn actually specifies a path of (ζt )t∈[0,tn ] , but we will construct samplers with importance weights which depend only upon some recent history of xn and xn−1 , yielding a scheme with storage requirements which do note increase over time. 3.1 PDP Particle Filter We next describe the design of a SMC scheme to target the distributions of interest in our sampling problem. By applying the SMC samplers method to the sequence of distributions (πn (xn ))n∈N , see (4), we can obtain recursive schemes which propagate particle approximations to the distributions of various quantities in the recent history of the signal process. We will employ kernels which result in online algorithms, i.e. whose computational complexity and storage requirements do not grow over time. As a simple example we could maintain an approximation to each marginal distribution πn (τkn , θkn ), and thus to p(ζtn |y1:n ). Then at the nth iteration, the algorithm would yield a set of N particles, (i) {(kn τkn θkn )(i) , wn }N i=1 . This approximates the filtering distribution for the signal process via: P N (dζtn |y1:n ) = N X i=1 wn(i) δζ (i) (dζtn ), tn (i) (i) (i) where ζtn = F tn , τ (i) , θ (i) . kn kn We next give specific details of how such algorithms can be constructed. The explicit treatment of the dimensionality of the problem gives us control over the proposal of different numbers of jumps. Furthermore, the SMC samplers framework accommodates a more efficient proposal mechanism than that of the VRPF by permitting ‘adjustment’ moves as described below. 3.1.1 Choice of Kernels The SMC samplers approach is clearly very flexible, and is perhaps too general : in addition to the choice of the proposal kernels Kn , it is now necessary to select a sequence of backward kernels Ln . The appearance of these kernels in the weight expression makes it clear that it will be extremely important to select these carefully. In the proposed algorithms we will be employing proposal kernels which are mixtures, for example each mixture component applying a different increment to the dimensionality parameter kn . In general it is valid to allow the mixture weights to be a function of the current state, for example: Kn (xn−1 , xn ) = M X αn,m (xn−1 )Kn,m (xn−1 , xn ), m=1 M X m=1 9 αn,m = 1. Del Moral et al. (2006a) provide the following expression for the optimal backward kernel for such a proposal kernel (in the sense that the variance of the importance weights is minimized if resampling is conducted at every step), which also has a mixture form: Lopt n−1 (xn , xn−1 ) = M X opt βn−1,m (xn )Lopt n−1,m (xn , xn−1 ) m=1 = M X m=1 PM R αn,m (xn−1 )πn−1 (xn−1 )Kn,m (xn−1 , xn ) m=1 En−1 αn,m (xn−1 )πn−1 (xn−1 )Kn,m (xn−1 , xn )dxn−1 . (6) When this optimal backward kernel is employed, the importance weight becomes: wn (xn−1 , xn ) ∝ R En−1 πn−1 (xn−1 ) hP M m=1 πn (xn ) i . αn,m (xn−1 )Kn,m (xn−1 , xn ) dxn−1 (7) Whenever it is possible to use this optimal backward kernel, one should do so. However, in most instances the integral in the denominator of (7) will prove intractable (it is an expectation with respect to the target distribution πn−1 , and the intractability of such integrals is the main motivation for adopting a samplingbased approach in the first place), and an approximation to the optimal kernel must be used instead. We note that it is approximately optimal rather than an algorithmic approximation: the algorithm remains exact, only the estimator variance suffers as a result of the approximation. In the context of mixture proposal kernels, it was suggested in Del Moral et al. (2006b), that IS could be performed on a higher dimensional space, involving the space of mixture component indicators, in which case the incremental importance weight is specified by: wn (xn , m, xn−1 ) ∝ πn (xn )βn−1,m (xn )Ln−1,m (xn , xn−1 ) . πn−1 (xn−1 )αn,m (xn−1 )Kn,m (xn−1 , xn ) (8) The design of the proposal kernel also plays a significant role in the performance of the algorithm. In order to minimise the variance of the importance weights, it must be well matched to the target distribution and hence to the observations. There are many different proposal kernels which could be employed in the context of PDP filtering. In order to provide some guidance, we next give a generic kernel, consisting of two moves, in order to highlight issues which affect the algorithm’s performance. An algorithm built around this kernel and possibly also involving the application of a Metropolis-Hastings kernel after resampling (further details below), constitutes a general strategy for the filtering PDPs. Another example can be found in Appendix A, where the algorithm of Maskell (2004) and the VRPF of Godsill et al. (2007) are specified in the PDP particle filter framework. 3.1.2 Standard Moves We here consider two generic moves, an adjustment move and a birth move. Adjustment Move Kn,a . The dimensionality is maintained and the most recent parameter, θn−1,kn−1 , is drawn from a distribution ηn (·|xn−1 ) , yielding a new parameter value θn,kn . Kn,a (xn−1 , dxn ) = δkn−1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn )δθn−1,1:kn−1 (dθn−1,1:kn −1 ) × ηn (dθn,kn |xn−1 ). (9) 10 It can be shown that the optimal choice (in the sense of minimizing the conditional variance of the importance weights given that an adjustment move is to be made and that the optimal backwards kernel is employed) of distribution from which to propose the new parameter is the full conditional distribution πn (·|xn \θn,kn ),where xn \ θn,kn denotes all components of xn other than θn,kn . Birth Move Kn,b . The dimensionality is increased and additional jump times and parameters are sampled. In a simple case, the number of jumps is incremented, kn = kn−1 + 1, a new jump, τn,kn is proposed from a distribution hn (·) on (τn,kn−1 , tn ], and a new parameter is then drawn from a proposal distribution ηn (·|xn \ θn,kn ): Kn,b (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 ) × δθn−1,1:kn−1 (dθn−1,1:kn −1 )hn (dτn,kn )ηn (dθn,kn |xn \ θn,kn ). (10) It can be shown that the optimal choice (in the sense of minimizing the variance of the importance weights) of distribution from which to propose the new parameter is the full conditional πn (θn,kn |xn \ θn,kn ), given that a birth move is to be made, the optimal backwards kernel employed and τn,kn . If τn,kn ≤ tn−1 this amounts to altering the trajectory (ζt )t∈[τn,kn ,tn−1 ] and extending the trajectory onto (tn−1 , tn ]. An alternative, suboptimal choice is to propose the new parameter from the prior, q(·|θj−1 , τj−1 , τj ). This proposal has the advantage that it is simpler to use and does not require knowledge of the full conditional distribution. Unfortunately, this simplicity comes at the cost of an increase in variance. When the full conditional distributions are not available analytically, sensible approximations should be employed. We note that such approximations do not affect the exactness of the algorithm; just the estimator variance. It should be noted that, due to the nested structure of the sequence (En )n∈N and the support of the sequence (πn )n∈N , there is no technical requirement to include a dimensionality reducing component in the proposal. However, one may be advantageous in some scenarios, for example in models where later observations can reduce the posterior probability of there being a change point in the recent history of the hidden process. Proposal Mixture Weights. A simple choice of mixture weights is in terms of prior probabilities. If employing a kernel which consists of a birth move and an adjustment move, we could define the mixture weight for the adjustment, αn,a (xn−1 ), move as follows: αn,a (xn−1 ) = S(tn , τn−1,kn −1 ), (11) which is the conditional prior probability of zero jumps in (τn−1,kn−1 , tn ], given τn−1,kn−1 . Ideally we would like to take into account information from the observations in order to adapt the mixture weights and minimize the variance of the importance weights. Unfortunately this is usually intractable in practice. Backward Kernel and Importance Weights. Given a proposal kernel consisting of a number of different moves, we need a strategy for designing a corresponding backward kernel. We adopt the approach as in Del Moral et al. (2006b) and first design one backward kernel component corresponding to each proposal kernel component, and then find a way to combine the backward kernel components. Ideally we would employ an optimal backward kernel with components specified by (6), unfortunately this will rarely be possible in practice. Sensible approximations must therefore be employed, whilst ensuring that the requirement described in remark (1) is met. For the forward kernel components described above, we advocate the use of the following backward kernel components or approximations thereof: Ln−1,a (xn , xn−1 ) = πn−1 (xn−1 )Kn,a (xn−1 , xn ) , πn−1 Kn,a (xn ) 11 Ln−1,b (xn , xn−1 ) = πn−1 (xn−1 )Kn,b (xn−1 , xn ) . πn−1 Kn,b (xn ) We then need a way to combine these backward kernel components. Unfortunately it is difficult to approxopt imate the mixture weights, βn−1,m (xn ) of the optimal backward kernel. For the two move types above, a strategy which has been found to work in practice is simply to set, for all m, βn−1,m (xn ) = 1/M , where M is the total number of components in the proposal kernel. This approach can be expected to perform well when there is little overlap in support of πn−1 Kn,b and πn−1 Kn,a . Then, using the expression in (8), the corresponding importance weights are given by the following expressions. For the adjustment move, the incremental weight is given by: πn (xn )πn−1 (θn−1,kn−1 |xn−1 \ θn−1,kn−1 ) , αn,a (xn−1 )πn−1 (xn−1 )πn (θn,kn |xn \ θn,kn ) (12) πn (xn ) . αn,b (xn−1 )πn−1 (xn−1 )hn (τn,kn )πn (θn,kn |xn \ θn,kn ) (13) wn (xn−1 , xn ) ∝ and for the birth move: wn (xn−1 , xn ) ∝ From the definition of πn , (4), and the definition of the proposal kernels, it can be seen that (12) and (13) in fact only depend only upon a recent history of xn . Further Considerations. A technical requirement of importance sampling schemes is that support of the proposal distribution includes that of the posterior distribution. It appears never to have been mentioned in the literature that in the context of trans-dimensional inference in time series, this imposes the requirement that a forward kernel capable of proposing any positive number of births in the interval (tn−1 , tn ] must be employed if there is a positive probability associated with such configurations under the target distribution. In principle it might be sufficient to employ a birth move which introduces 1+B new jumps, with B a Poisson random variable of low intensity, to ensure convergence (asymptotically in the number of particles) to the target distribution. Of course, for finite samples, if the intensity is small enough it is equivalent to never proposing more than a single new jump. However, if the intensity is made too small then the estimator variance may become very large. In practice it is extremely unlikely that it will be possible to accurately infer the position of several jumps between a pair of observations. Due to the sequential nature of the algorithm and the form of the above moves, the algorithm requires storage of only a fixed-length history for each particle. By employing a mixture of moves which leave the state unchanged and prior birth moves on the interval (tn−1 , tn ], in proportions specified by the prior on the dimensionality parameter kn , we obtain algorithms as in Godsill (2007) and Maskell (2004). For details, see Appendix A. It is important to ensure that a significant fraction of the particle set have weights substantially above 0. The effective sample size (ESS), introduced by Kong et al. (1994), is an approximation of a quantity which describes the effective number of iid samples to which the set corresponds. hP i−1 N (i)−2 The ESS is defined as ESS = W where W (i) are the normalized weights. Resampling i=1 should be carried out after any iteration which causes the ESS to fall below a reasonable threshold (typically around half of the total number of particles), to prevent the sample becoming degenerate with a small number of samples having very large weights. After resampling at the nth iteration, a Markov kernel of invariant distribution πn , for example a reversible jump Metropolis-Hastings kernel, can be applied to the particle system. This is a standard technique for increasing the diversity of particle locations, whilst maintaining a particle set which targets πn , see Gilks and Berzuini (2001). 12 The generic scheme for a PDP particle filter is given in algorithm 2. As is demonstrated in the examples, inserting a mixture of the moves as described in 3.1.2 and the specified weight expressions into this algorithm provides a generic technique for the filtering of PDPs. Algorithm 2 PDP Particle Filter 1: n = 1 2: for i = 1 to N do (i) 3: X1 ∼ η (i) W1 4: 5: 6: 7: (i) 13: 14: (i) Xn ∼ Kn (Xn−1 , ·) (i) (i) Wn ∝ Wn−1 10: 12: (i) η(X1 ) end for n←n+1 for i = 1 to N do 8: 9: 11: ∝ {where η(x) is an instrumental distribution.} (i) π1 (X1 ) (i) (i) (i) πn (Xn )Ln−1 (Xn ,Xn−1 ) (i) (i) (i) πn−1 (Xn−1 )Kn (Xn−1 ,Xn ) end for Resampling can be conducted at this stage. Sample rejuvenation can be conducted by allowing the particles to evolve under the action of a Markov kernel of the invariant distribution πn . Go to step 6 3.2 Auxiliary Methods The auxiliary particle filter of Pitt and Shephard (1999) is motivated by the idea that in standard sequential importance resampling algorithms, one takes a weighted sample which targets some distribution πn−1 and resamples from this distribution before mutating the sample and weighting it to target πn . This seems somewhat inefficient: it would be better to take into account how well each sample is likely to fit πn after mutation, before resampling is performed. Pitt and Shephard (1999) essentially introduces a set of auxiliary particle weights before the resampling step, with aim of ‘pre-selecting’ particles so as to reduce the variance of the importance weights at the next time step. In Godsill and Clapp (2001) it was noted that such schemes can be formulated as importance sampling applied to the joint sequence of states from 1 : n. In Johansen and Doucet (2008), the auxiliary method was reinterpreted as being a standard sequential importance resampling algorithm which targets an auxiliary sequence of distributions which are then used as an importance sampling proposal to provide estimators for the distribution of interest. The generalisation from particle filters to general SMC samplers has also been made in Johansen and Doucet (2007). In the context of filtering and the class of models described in this paper, the idea is to use the observation at time n + 1 to guide the weighting of the particle set at time n so that resampling does not eliminate particles which will be more favourably weighted once that observation has been taken into account. This pre-weighting is then corrected for by importance weighting after the resampling step. We now describe an auxiliary version of the algorithm presented above. This amounts to an SMC sampler targeting an auxiliary sequence of distributions on (En )n≥0 , which will be denoted by (µn )n≥1 , and fn ) which correct for the discrepancy between (µn )n≥1 and (πn )n≥1 . a sequence of importance weights, (W 13 We will focus on auxiliary distributions of the following form: µn (xn ) ∝ Vn (τn,kn , θn,kn , yn+1 )πn (xn ) where for each n, Vn : R+ × Ξ → (0, ∞) is a potential function, i.e. the algorithm will pre-select particles on the basis of the most recent jump time and associated parameter and their interaction with the next observation. This potential function would usually be chosen to favour states that fit well with the new observation. As in the standard auxiliary particle filter, this approach allows us to use a small amount of information from the immediate future to ensure particles which explain the forthcoming observation well produce offspring, even if the present observation doesn’t provide them with much weight. This is a technique intended to improve the robustness of the filtering estimate to outlying observations. It is important to choose the potential function Vn sensibly, guidance is provided in Pitt and Shephard (1999) for the filtering case and in Johansen and Doucet (2007) taking into account some recent developments. The application of the auxiliary method to the example sampler above is described in Algorithm 3. Steps 5 and 13 in this algorithm indicate where the proposed scheme yields particle sets targeting the distributions of interest, (πn )n∈N , in the sense of approximating expectations with respect to these distributions. Algorithm 3 Auxiliary PDP Particle Filter 1: n = 1 2: for i = 1 to N do (i) 3: X1 ∼ η (i) 1 ) f1 (X (i) ) ∝ π1 (X(i) 4: W 1 R 5: 6: 7: 8: 9: 10: 11: f1 (X (i) )ϕn (X (i) ) W n PN f1 (i) i=1 W1 (X1 ) (i) (i) f1 (X (i) ) V1 (τ1,k1 , θ1,k1 , y2 )W 1 (i) fn(i) (X (i) , Xn(i) ) ∝ W n−1 13: 15: 16: PN ϕ(x1 )πn (x1 )dx1 ≈ i=1 W1 (X1 ) ∝ end for n←n+1 (i) Resample from distribution defined by {Wn−1 }N i=1 for i = 1 to N do (i) (i) Xn ∼ Kn (Xn−1 , ·) 12: 14: η(X1 ) {where η(x) is an instrumental distribution.} R ϕ(xn )πn (xn )dxn ≈ (i) (i) (i) (i) (i) πn (Xn )Ln−1 (Xn ,Xn−1 ) 1 Vn−1 (τ (i) ,θ (i) (i) (i) (i) n−1,k n−1,k n−1 n−1 PN f (i) (i) (i) i=1 Wn (Xn−1 ,Xn )ϕn (Xn ) PN f (i) (i) i=1 Wn (Xn−1 ,Xn ) (i) (i) (i) (i) ,yn ) πn−1 (Xn−1 )Kn (Xn−1 ,Xn ) (i) (i) fn (X Wn (Xn−1 , Xn ) ∝ Vn (τ (i) , θ (i) , yn+1 )W n−1 , Xn ) n,kn n,kn end for Go to step 8 3.3 Convergence and Theoretical Validity As the above algorithms are a methodological application of the SMC samplers framework, the theoretical results of Chopin (2004) and Del Moral et al. (2006a) apply directly to our algorithm. As a simple example, by application of the results in Del Moral et al. (2006a), it is immediately apparent that a Central Limit Theorem 14 for the error in estimation of the posterior mean of ζtn applies for the proposed method. Furthermore, this implies theoretical justification of the VRPF algorithm as described in Appendix A. Theoretical validity of the auxiliary scheme can be readily established by considering the interpretation in Johansen and Doucet (2007). 4 Simulation Study In this section, we present realistic examples from mathematical finance and signal processing. These illustrate the performance improvements which are provided by the algorithmic framework presented above, when combined with sensible choices of proposal distributions. 4.1 Shot Noise Cox Process In this model the signal process is the intensity of a Shot noise Cox Process and takes values in Ξ = R+ . The model is specified by the following distributions: f (τj |τj−1 ) = λτ exp(−λτ (τj − τj−1 )) × I[τj−1 ,∞) (τj ), q(ζ0 ) = exp(−λθ ζ0 ) × I[0,∞) (ζ0 ), q(θj |θj−1 , τj , τj−1 ) = λθ exp(−λθ (θj − ζτ−j )) × I[ζτ− ,∞) (θj ), j where ζτ−j = θj−1 exp(−κ(τj − τj−1 )), and: F (t, τ, θ) = θ exp(−κ(t − τ )). Given ζ(tn−1,tn ] , the observations are an inhomogeneous Poisson process with intensity ζ(tn−1,tn ] , i.e. Yn is collection of random point locations in (tn−1 , tn ]. The likelihood function is given by: ! Z tn Y ζs ds ζyn,i g(yn |ζ(tn−1,tn ] ) = exp − tn−1 i where yn,i is the time of the ith event observed in (tn−1 , tn ]. In reinsurance pricing applications, the observed events model the claims on an insurance portfolio. The jumps of the signal process model ‘primary events’: incidents which result in insurance claims. An approximation to the optimal filter for this model was derived in Dassios and Jang (2005). They showed that as the rate of primary events grows, specific transformations of (ζt )t≥0 and the observation counting process converge weakly to Gaussian processes to which the Kalman-Bucy filter can be applied. The resulting approximation of the optimal filter is poor when the rate of the primary event process, λτ is low, as it is in economically-important settings. The method we propose suffers from no such restrictions. We note that similar models have application in the pricing of options, for example Centanni and Minozzo (2006). We compare the performance of the basic VRPF approach (referred to as Scheme 1) to the proposed algorithm (Scheme 2), on data simulated from the model with the following parameter settings: κ = 0.01, λτ = 1/40, λθ = 2/3, ∆ = 50 and 500 particles. The true hidden trajectory and a histogram of the observed points are given in figure 2. 15 For this model, it was found to be sufficient to construct and algorithm with a proposal kernel consisting of a single move, with a trans-dimensional Metropolis-Hastings kernel applied after resampling. The proposal kernel is a birth move, proposing a new jump time in the interval (tn−1 , tn ] and and then drawing the associated parameter from it’s full conditional distribution: Kn (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 ) × δθn−1,1:kn−1 (dθn−1,1:kn −1 )hn (dτn,kn )πn (dθn,kn |xn \ θn,kn ), where hn (τn,kn ) is the proposal distribution for the time of the new change point. This was built from a histogram of the data observed on (tn−1 , tn ] in the following manner. The differences in counts between subsequent bins of the histogram were computed and exponentiated to yield a piece-wise constant distribution. More specifically, consider partitioning the interval (tn−1 , tn ] into m sub-intervals of equal length. Let cp be the number of observed events in the pth bin. For p ∈ {1, 2, ..., m − 1} define dp = cp+1 − cp . Then define a piecewise constant probability density function on (tn−1 , tn ] by partitioning the interval into m − 1 sub-intervals of equal length. In the pth such sub-interval the density has the value: exp(dp ) hn (τ ) = Pm−1 , q=1 exp(dq ) τ ∈ (tn−1 + (p − 1)δ, tn−1 + pδ], where δ = (tn − tn−1 )/(m − 1). The corresponding component of the backwards kernel is necessarily (see Remark 1): Ln−1 (xn−1 , dxn ) = δkn −1 (kn )δτn,1:kn −1 (dτn−1,1:kn−1 )δθn−1,1:kn −1 (dθn−1,1:kn−1 ). Systematic resampling was applied when the ESS dropped below 40%. After resampling, an MetropolisHastings kernel consisting a sequence of three moves was applied: 1. a birth/death reversible jump move, with the birth being made by drawing a new jump time from the uniform distribution on (τn,kn , tn ] and then drawing the parameter from the full conditional distribution 2. a perturbation of the most recent jump time, τn,kn , drawing from a Gaussian kernel centred at τn,kn and truncated to the interval (τn,kn −1 , tn ] 3. a perturbation of the parameter associated with the most recent jump time, drawing from the full conditional distribution. In practice resampling was found to occur at less than 40% of iterations. Despite the fact that the proposal kernel consisted only of a birth move, it was found that the algorithm did not significantly over-fit the data: for the date shown in figure 1, the true number of jumps is 34 and for a typical run, the MAP number of jumps was 39. The MMSE filtering estimate of ζt obtained using the proposed algorithm is shown in figure 2. Figure 3 shows a count of the number of unique particles, over time, at the final iteration of the algorithm (i) (i.e. t = 2000), where we have stored the entire history of each particle, ζ[0,2000] , in order to compute these quantities. However, the diversity of particle locations alone does not tell us everything about the efficiency of the algorithm. Whilst there may be many unique particles, the importance weights may have high variance, in which case the quality of the particle approximation will be low. In order to portray this characteristic, we also plot the same quantities, but after resampling is applied (whilst resampling can increase the variance of 16 an estimate made from the particle set, for systematic resampling this increase is typically extremely small). These quantities demonstrate the degree of degeneracy of both the particle locations and the degeneracy of the importance weights. Pre-resampling, Algorithm 1 exhibits some diversity in the recent history of particle locations, note that the number of unique particles is less than the total number. This is an example of the phenomenon described in section 2.5. Degeneracy of the importance weights means that this diversity is reduced when resampling occurs. The Algorithm 2 exhibits diversity in particle locations much further back in time, with all particles being unique in the recent history. Reduction in the number of unique particles when resampling occurs is less severe than for Algorithm 1, demonstrating that the importance weights have lower variance. 10 8 ζt 6 4 2 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1200 1400 1600 1800 2000 t (a) 40 30 20 10 0 0 200 400 600 800 1000 t (b) 500 500 400 400 Unique Particles Unique Particles Figure 2: SNCP model. Top: True intensity (solid) and MMSE filtering estimate (dotted). Bottom: histogram of realised observations with bin length 2.5. 300 200 100 0 1000 1200 1400 1600 1800 2000 t 300 200 100 0 1000 1200 1400 1600 1800 2000 t (a) Scheme 1 (b) Scheme 2 Figure 3: Unique particles at the final iteration vs time. Solid: Pre-resampling. Dotted: Post-resampling. 17 4.2 Object Tracking This example illustrates the performance of the proposed methods on a constant acceleration tracking model applied to the benchmark fighter aircraft trajectory depicted in figure 1, which has duration 185 seconds. The model is as specified in subsection 2.4, with acceleration components being drawn independently from an isotropic Gaussian distribution of mean zero and standard deviation σθ = 10m/s2 . 37 additive, isotropic Gaussian observations of the position of the aircraft, with standard deviation σy = 200m, were generated at intervals of ∆ = 5s, also shown in figure 1. The inter-jump times were chosen to be Gamma distributed, with shape and scale parameters 10 and 2.5 respectively, corresponding to a mean inter-arrival time 25s. For this conditionally linear-Gaussian model is possible to analytically integrate out the parameters θ0:kn , and only jump times need be sampled. Comparisons were made between three schemes described below. In all cases, systematic resampling was employed. 1. The basic VRPF algorithm, sampling from the prior, see Appendix A for specification of the kernels. Resampling was performed when the ESS fell below 50%. 2. A forward kernel with two components. The first component is a birth move which adds a single point uniformally in (τn,kn −1 , tn ]: Kn,1 (xn−1 , dxn ) = δkn−1 +1 (kn )δτn−1,1:kn−1 (dτn−1,1:kn −1 ) × 1 . tn − τn,kn−1 For this move, the corresponding component of the backwards kernel is necessarily: Ln−1,1 (xn−1 , dxn ) = δkn −1 (kn )δτn,1:kn −1 (dτn−1,1:kn−1 ). The second component is an adjustment move in which a Gaussian random walk kernel, restricted to (τn,kn −1 , tn ] is applied to the most recent jump time. This component of the kernel is given by: Kn,2 (xn−1 , dxn ) ∝ I(τn−1,kn−1 −1 ,tn ] (τn,kn ) × N (τn−1,kn−1 , σa2 )dτn,kn × δkn−1 (kn )δτn−1,1:kn−1 −1 (dτn,1:kn −1 ). A Gaussian kernel of the same variance, restricted to (τk−1 , tn−1 ], is used for the corresponding component of the backward kernel: Ln−1,2 (xn , dxn−1 ) ∝ I(τn−1,kn−1 −1 ,tn−1 ] (τn−1,kn−1 ) × N (τn,kn , σa2 )dτn−1,kn−1 × δkn (kn−1 )δτn,1:kn −1 (dτn−1,1:kn−1 −1 ). The forward mixture weights were defined as in (11), with the backward mixture weights set to uniform. Resampling was performed when the ESS fell below 50%. 18 3. As 2. but with an auxiliary scheme using an approximation of the the following auxiliary weighting function: S(tn+1 , τn,kn ) p(yn+1 |y1:n , kn+1 = kn , τn,1:kn ) S(tn , τn,kn ) 1 + (S(tn , τn,kn ))2 (Z × f (τn+1,kn+1 |τn,kn )S(tn , τn+1,kn+1 ) Vn (τn,kn , θn,kn , yn+1 ) = (tn ,tn+1 ] × p(yn+1 |y1:n , kn+1 = kn + 1, τn+1,1:kn+1 )dτn+1,kn+1 . The one-dimensional integral is approximated using standard numerical techniques (note this does not affect the theoretical validity of the algorithm). The motivation for this auxiliary weighting function is that it approximates the predictive likelihood, further details are given in Appendix B. Resampling was applied at every iteration. 30 25 20 sy /km 15 10 5 0 −5 −10 35 40 45 50 55 60 65 70 sx /km Figure 4: Benchmark 2D position trajectory (solid) and MMSE filtering estimate (dotted). It was found that all three schemes yielded roughly the same error performance in estimating ζtn from each πn and little decrease in estimation error was observed using more than 200 particles. This can be attributed 19 5000 5000 4000 4000 Unique Particles Unique Particles to the Rao-Blackwellised nature of the model. For an example of a model in which the parameters are not integrated out, see Whiteley et al. (2007); in this instance the VRPF exhibits significantly greater error in this filtering task, and is substantially outperformed by an algorithm which includes adjustment moves. MMSE estimates of the position of the vehicle using Scheme 3, and made before resampling, are shown in figure 4. However, as more of the recent history of the trajectory is considered, including the times of the jumps in acceleration, the superior performance of the proposed method becomes apparent. 3000 2000 1000 0 100 120 140 160 3000 2000 1000 0 100 180 120 140 t 160 180 t (a) Scheme 1 (b) Scheme 2 Unique Particles 5000 4000 3000 2000 1000 0 100 120 140 160 180 t (c) Scheme 3 Figure 5: Unique particles at the final iteration vs time. Pre-resampling (solid), Post-resampling (dotted). Figure 5 provides insight into the relationship between the particle diversity and weight degeneracy for the three algorithms, at the final iteration of the algorithm. In order to show the diversity of particle set, we plot the number unique particles at the final iteration, as a function of time. Again we consider the particle set before and after resampling, but note that in the case of the Algorithm 3, for the purposes of these diagnostics, we applied resampling at the final iteration according to the importance weights from which an estimate over the target distribution would be drawn. The difference between the pre and post-resampling plots for each algorithm indicate the variance of the importance weights: when the variance is high, the number of unique particles is relatively low post-resampling. For these tests the adjustment kernel standard deviation was set to σa = ∆/1000s. From figure 5 it can be seen that Scheme 1 exhibits particle diversity only in the very recent history, even before resampling. Scheme 2 exhibits much more diversity, especially between the times of the last and penultimate observations. However, much of this diversity if lost upon resampling. The plot of Scheme 3 20 50 50 40 40 Unique Particles Unique Particles shows that a significant proportion of diversity arising from the adjustment move and the births in (τn,k−1 , tn ] is not lost after resampling. 30 20 10 0 1 30 20 10 0 1 0 0 180 −1 log10 σa ∆ 150 140 −4 −5 130 170 −2 160 −3 180 −1 170 −2 log10 t/s σa ∆ 160 −3 150 140 −4 −5 (a) Pre-resampling 130 t/s (b) Post-resampling Figure 6: Algorithm 3. Unique particles vs time and standard deviation of adjustment kernel. Figure 6 shows the effect of σa in algorithm 3, averaged over 50 runs for each value of σa , using N = 50 particles. Pre-resampling, all settings produce large particle diversity. However, as σa approaches and exceeds the inter-observation time, the variance of the importance weights becomes high and consequently, post-resampling, the number of unique particles is dramatically reduced. Thus, when using a Gaussian random walk adjustment kernel, it is evident from figures 6 and 7 that there is a trade off between diversity of particle locations and importance weight degeneracy. Figure 7 compares the performance of algorithms in terms of estimating jump locations. The use of the adjustment move in Scheme 2 yields a small improvement over Scheme 1. However, the results for Scheme 3 show a marked improvement. Note that, in the model, jumps occur in the acceleration but it is only the position of the vehicle which is observed. Therefore optimal estimates of jump times in the very recent history should exhibit large variance, the results for Scheme 3 are commensurate with this. 5 Conclusions We have presented a broad class of algorithms for the filtering of Piecewise Deterministic processes and demonstrated that this class incorporates existing techniques as particular cases. The proposed framework falls within a very general sampling framework for which convergence and stability results exist. The significance of this work is twofold: It allows existing convergence results, and a central limit theorem in particular, to be employed to provide a formal theoretical justification of algorithms which are in use already and a more general approach which has intuitively preferable characteristics. Furthermore, the proposed algorithms allow for the use of principled techniques which lead to substantial improvements in efficiency when compared with existing techniques. 21 0.4 0.2 0 1 0.5 0 1 0.5 0 1 0.5 0 |a|/ms−2 1 0.5 0 100 50 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 t/s (a) Algorithm 1 0.4 0.2 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0.5 0 0.4 0.2 0 0.2 0.1 0 |a|/ms−2 0.4 0.2 0 100 50 0 t/s (b) Algorithm 2 0.1 0.05 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0.5 0 0.4 0.2 0 0.4 0.2 0 0.5 |a|/ms−2 0 100 50 0 t/s (c) Algorithm 3 Figure 7: Weighted histograms of particle jump times at final iteration, pre-resampling. In each sub figure, top to second bottom: τn,k , τn,k−1 , τn,k−2 , τn,k−3 , τn,k−4 and bottom: true acceleration magnitude. σa = ∆/1000. N = 5000. 22 References Blair, W., Watson, G., Kirubarajan, T., and Bar-Shalom, Y. (1998), “Benchmark for radar allocation and tracking in ECM,” IEEE Transactions on Aerospace and Electronic Systems, 34, 1097–1114. Cappé, O., Godsill, S., and Moulines, E. (2007), “An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo,” Proceedings of the IEEE, 96, 899 –924. Centanni, S. and Minozzo, M. (2006), “Estimation and filtering by reversible jump MCMC for a doubly stochastic Poisson model for ultra-high-frequency financial data,” Statistical Modelling, 6, 97–118. Chopin, N. (2004), “Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference,” Annals of Statistics, 32, 2385–2411. — (2006), “Dynamic detection of change points in long time series,” Annals of the Institute of Statistical Mathematics, 59, 349–366. Dassios, A. and Jang, J. (2005), “Kalman-Bucy filtering for linear system driven by the Cox process with shot noise intensity and its application to the pricing of reinsurance contracts,” Journal of Applied Probability, 42, 93–107. Davis, M. (1984), “Piecewise Deterministic Markov Processes: a general class of non-diffusion stochastic models,” Journal Royal Statistical Society B, 46, 353–388. Del Moral, P., Doucet, A., and Jasra, A. (2006a), “Sequential Monte Carlo for Bayesian Computation,” in Bayesian Statistics 8, Oxford University Press, pp. 115–148. — (2006b), “Sequential Monte Carlo Samplers,” Journal of the Royal Statistical Society B, 63, 411–436. Douc, R., Cappé, O., and Moulines, E. (2005), “Comparison of Resampling Schemes for Particle Filtering,” in Proceedings of the IEEE 4th International Symposium on Image and Signal Analysis, pp. 64–69. Doucet, A., Briers, M., and Sénécal, S. (2006a), “Efficient Block Sampling Strategies for Sequential Monte Carlo Methods,” Journal of Computational and Graphical Statistics, 15, 693–711. Doucet, A., de Freitas, N., and Gordon, N. (eds.) (2001), Sequential Monte Carlo Methods in Practice, Statistics for Engineering and Information Science, New York: Springer Verlag. Doucet, A., Godsill, S., and Andrieu, C. (2000), “On sequential Monte Carlo sampling methods for Bayesian filtering,” Statistics and Computing, 10, 197–208. Doucet, A., Montesano, L., and Jasra, A. (2006b), “Optimal Filtering For Partially Observed Point Processes Using Trans-Dimensional Sequential Monte Carlo,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processsing, (ICASSP). Gilks, W. and Berzuini, C. (2001), “Following a moving target - Monte Carlo inference for dynamic Baysian Models,” Journal of the Royal Statistical Society B, 63, 127–146. Godsill, S. (2007), “Particle filters for continuous-time jump models in tracking applications,” in ESAIM: Proceedings of Oxford Workshop on Particle Filtering (to appear). 23 Godsill, S. and Clapp, T. (2001), “Improvement Strategies for Monte Carlo Particle Filters,” in Sequential Monte Carlo Methods in Practice, eds. Doucet, A., Freitas, J. F. G. D., and Gordon, N. J., New York: Springer-Verlag, Statistics for Engineering and Information Science, chap. 7, pp. 139–158. Godsill, S., Vermaak, J., Ng, K.-F., and Li, J.-F. (2007), “Models and Algorithms for Tracking of Manoeuvring Objects using Variable Rate Particle Filters,” Proceedings of IEEE, 95, 925–952. Godsill, S. J. and Vermaak, J. (2004), “Models and Algorithms for tracking using trans-dimensional sequential Monte Carlo,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 976–979. — (2005), “Variable rate particle filters for tracking applications,” in Proceedings of IEEE Statistical Signal Processing Workshop (SSP), pp. 1280–1285. Gordon, N. J., Salmond, S. J., and Smith, A. F. M. (1993), “Novel approach to nonlinear/non-Gaussian Bayesian state estimation,” IEE Proceedings-F, 140, 107–113. Jacobsen, M. (2006), Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes, Probability and its Applications, Boston: Birkhäuser. Jacod, J. and Skorokhod, A. V. (1996), “Jumping Markov Processes,” Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 32, 11–68. Johansen, A. M. and Doucet, A. (2007), “Auxiliary Variable Sequential Monte Carlo Methods,” Tech. Rep. 07:09, University of Bristol, Department of Mathematics – Statistics Group, University Walk, Bristol, BS8 1TW, UK. — (2008), “A Note on the Auxiliary Particle Filter,” Statistics and Probability Letters, to appear. Kong, A., Liu, J. S., and Wong, W. H. (1994), “Sequential Imputations and Bayesian Missing Data Problems,” Journal of the American Statistical Association, 89, 278–288. Lasdas, V. and Davis, M. (1989), “A piecewise deterministic process approach to target motion analysis [sonar],” in Proceedings of the 28th IEEE Conference on Decision and Control, Tampa, FL, USA, vol. 2, pp. 1395–1396. Maskell, S. (2004), “Joint Tracking of Manoeuvring Targets and Classification of their manoeuvrability,” EURASIP Journal on Applied Signal Processing, 15, 2339–2350. Pitt, M. K. and Shephard, N. (1999), “Filtering via Simulation: Auxiliary Particle Filters,” Journal of the American Statistical Association, 94, 590–599. Sworder, D. D., Kent, M., Vojak, R., and Hutchins, R. G. (1995), “Renewal Models for Maneuvering Targets,” IEEE Transactions on Aerospace and Electronic Systems, 31, 138–150. Whiteley, N., Johansen, A. M., and Godsill, S. (2007), “Efficient Monte Carlo filtering for discretely observed jumping processes,” in Proceedings of IEEE Statistical Signal Processing Workshop, Madison, Wisconsin, USA, pp. 89–93. 24 A The VRPF Algorithm The VRPF of Godsill et al. (2007) can be obtained by constructing a PDP particle filtering algorithm, with the following mixture proposal kernel: Kn (xn−1 , dxn ) = ∞ X m=0 e n,m (xn−1 , dxn ) K (14) e n,m may be decomposed as the product of a mixture weight and a where, the sub-probability kernels K probability kernel: e n,m (xn−1 , dxn ) := αn,m (xn−1 )Kn,m (xn−1 , dxn ). K The mixture weight αn,m (xn−1 ) is the conditional prior probability that xn has m jumps in (tn−1 , tn ], given kn ≥ kn−1 , τn,kn−1 = τn−1,kn−1 and supp : τn,p ≤ tn−1 = kn−1 . Thus new jump times proposed from (14) will always lie in (tn−1 , tn ]. and for m > 0, e n,0 (xn−1 , dxn ) = S(tn , τn−1,kn−1 ) δxn−1 (dxn ) K S(tn−1 , τn−1,kn−1 ) e n,m (xn−1 , dxn ) = S(tn , τn,kn−1 +m ) δkn−1 +m (kn ) K S(tn−1 , τn−1,kn−1 ) × δτn−1,1:kn−1 (dτn,1:kn−1 )δθn−1,0:kn−1 (dθn,0:kn−1 ) kn−1 +m × Y f (dτj |τj−1 )q(dθj |θj−1 , τj , τj−1 ) j=kn−1 +1 × I(tn−1 ,tn ] (τn,kn−1 +1 ). Proposition 1. For the proposal kernel (14), and in the case that the optimal backwards kernel is employed, the importance weight is given by: wn (xn−1 , xn ) ∝ gn (yn |ζtn ). Proof. It is clear that πn ≪ πn−1 Kn . Using the optimal form of auxiliary kernel leads to the following expression for the importance weight: dπn (xn ) dπn−1 Kn dπ hPn i (xn ) = e n,m dπn−1 K wn (xn−1 , xn ) ∝ m = d hP m 25 dπn e n,m πn−1 K i (xn ) e n,m has within its support only by bounded convergence. Furthermore, as (En , En ) is Polish and πn−1 K configurations in which kn X p(xn ) := I(tn−1 ,tn ] (τn,j ) = m, j=1 e n,q (B) = 0. Thus, from the definition of the Radonfor any q 6= p(xn ) there exists B such that πn−1 K Nikodỳm derivative, we may write: wn (xn−1 , xn ) = dπn e n,p(x ) dπn−1 K n (xn ) = gn (yn |ξtn ) Thus it can then be seen that the use of (14), with its optimal backwards kernel, initialising from the prior and resampling at every iteration, coincides with the VRPF algorithm of Godsill et al. (2007), with the ‘state-regeneration’ viewed as being performed at the beginning of the nth iteration as opposed to the end of the n − 1th iteration. “Completing the neighbourhood” as discussed in Godsill et al. (2007) corresponds (for the models considered in this paper) to having obtained a sample from (14) by sampling from f (τj |τj−1 ) and q(θj |θj−1 , τj , τj−1 ) until tn has been exceeded. This also coincides with the algorithm of Maskell (2004), with proposals made from the prior B Auxiliary Method Consider the relationship between xn and xn−1 induced by the VRPF proposal kernel. Denote by ǩn the number of jumps proposed at time step n, so that kn = kn−1 + ǩn . p(yn |y1:n−1 , kn−1 , τn−1,1:kn−1 ) = p(ǩn = 0|τn,kn−1 )p(yn |y1:n−1 , kn , τn−1,1:kn−1 ) ∞ Z X + p(ǩn , τkn−1 +1:kn |τn,kn−1 , τn,kn−1 +1 > tn−1 )p(yn |y1:n−1 , kn , τn,1:kn )dτn,kn−1 +1:kn , ǩn =1 Sn S(tn , τn−1,kn−1 ) p(yn |y1:n−1 , kn = kn−1 , τn−1,1:kn−1 ) S(tn−1 , τn−1,kn−1 ) 1 + (S(tn−1 , τn−1,kn−1 ))2 Z × f (τn,kn |τn−1,kn−1 )S(tn−1 , τn,kn )p(yn |y1:n−1 , kn = kn−1 + 1, τn,1:kn )dτn,kn = (tn−1 ,tn ] + ǫn . (15) where ǫn represents the summands for ǩn = 2, 3, ..., ∞ and Sn = {τn,kn−1 +1:kn : tn−1 < τn,kn−1 +1 < ... < τn,kn ≤ tn }. The auxiliary weighting function is obtained by truncating the series and using the first two terms in (15), approximating the one-dimensional integral in the second term by a standard numerical techniques such as the trapezium rule. 26